Career Opportunities in Internet Data Science
The Promise of Big Data
Thanks to cloud computing and continuing advances in technology, big data – data sets so massive and complex that they require new management tools – is here to stay.
- Natural-language processing has opened up unstructured data (e.g., emails, documents, social media posts) to analysis.
- Cloud-based data storage and low-cost, high-speed data processing has become increasingly cheap.
- Sophisticated analytics software and tools are widely available.
Telecom and Internet companies are in a particularly strong position to take advantage of big data. In 2009:
- Google processed approximately 24 petabytes of data per day
- AT&T transferred around 30 petabytes of data through its network daily
- Facebook users had already uploaded over 15 billion photos
That’s creating the need for data scientists, aka the sexiest job of the 21st century. In addition to being analysts, data scientists are free-wheelers, hackers and problem-solvers that manage and analyze information. Companies approach them with a conundrum. Data scientists find the answers hidden in big data.
Like This Ad?
One of the more famous petri dishes for data science was Facebook. In 2005, a young financial wizard abandoned a life of ease and overpriced cocktails on Wall Street and decamped to Silicon Valley.
Landing at Facebook, Jeffrey Hammerbacher put together a team of gurus to mine huge hordes of the site’s social network data, crunch the numbers for insights and use their discoveries to improve the service and create targeted advertising. (Two years later, Hammerbacher decamped again to form Cloudera.)
Over at LinkedIn, D.J. Patil and his colleagues were hard at work on a similar approach. Their work produced many tools users are now familiar with – recommendation technology such as “Groups you may like”, as well as features such as Career Explorer and Jobs Recommendation.
Consumers ‘R’ Us
Of course, a lot of demand for data scientists is coming from e-commerce. Business intelligence and the retail industry have a long history together (AC Nielsen began asking impertinent questions in 1923), and the partnership has grown a whole lot deeper and stronger with the arrival of the Internet.
Now retailers are asking data scientists to combine a huge host of data sets to create:
- Personalized recommendations based on weather, seasonal trends, traffic reports, past purchase history, your dog’s favorite chew toys…
- Smarter sentiment analysis
- Product insights gleaned from RFID and sensor data
- Detailed market basket and video analysis
- Geo-targeted marketing
- Real-time pricing and inventory management
The wish list goes on. And on. And on.
And Google, SAS and IBM have been right there to help. In recent years, these big fish have gone on buying streaks, snapping up smaller fry with useful analytics technologies.
- In 2013, Google bought the e-commerce specialist Channel Intelligence for $125 million. With this purchase, it acquired CI’s sales and event transaction data gathering platform (TrueTag) – technology which CI claims tracks nearly 15 percent of U.S. transactions online.
- In the same year, IBM rolled out a number of analytics-backed retail apps and services that incorporated components from Tealeaf Technology, a customer experience management specialist.
Humans, We Have a Problem
Though many data scientists may use the Web as a data source, they’re not limited by it. In fact, most successful business intelligence and data analytics companies are pulling data from as many sources as they can find.
Take Splunk, big data’s “first IPO”. Its specialty is harvesting big data from machines. These include Web servers, mobile devices – even something as prosaic as an air conditioning unit.
- As soon as a machine creates a piece of data, Splunk seizes it and stores it in a cloud-based database.
- The aim is to trace a machine’s repetitive patterns, find anomalies and diagnose problems.
- Once it’s found an issue, its programs can create immediate alerts (as well as less urgent graphs and reports) for the client.
This kind of process results in multiple applications. For instance, MetroPCS and T-Mobile have used Splunk to monitor their networks, while government agencies watch for attacks on networks.
Splunk isn’t the only kid on the block to see the potential. In 2013, Loggly – a cloud-based log management service with SaaS architecture – managed to raise $20.9 million from investors such as Cisco, Trinity Ventures, Matrix Partners and others.
Saving the World
But maybe the most positive trend appearing in Web-based data science has nothing to do with profit or gain. As Mayer-Schönberger and Cukier relate in their book, big data might just help save the world.
In 2008, a group of Google’s data scientists began to take a good hard look at the flu. Putting together the 50 million most common search terms in their U.S. database, they compared it against a set of CDC data on the spread of seasonal flu from 2003 to 2008.
They went at this task with no preconceptions. They simply designed a system that would look for correlations between the recorded spread of flu and the frequency of specific search queries. To test their flu predictions, they processed 450 million mathematical models.
And they found the link. A combination of 45 search terms, used together in a mathematical model, could tell them – in real-time – how flu epidemics were spreading.
This real-time monitoring was far superior to any government report to date. When the H1N1 virus hit in 2009, public health officials were right on top of its spread.
Data Risks and Regulations
You’re Only as Good as Your Data
With all its potential, big data isn’t the answer to everything. As Lei Yao and Chen Wei point out in Regulation[s] Must Adapt to [the] Big Data Revolution, big data:
- Cannot prevent misinterpretation or prejudice
- Does not bridge the gap between correlation and causation
- Suffers from selective and incomplete coverage (e.g., social media users are almost invariably young and urban)
What’s more, the volume, velocity and variety of big data are only going to get bigger. Mobile use is exploding. Developing countries are coming online. The Internet of Things is spawning a whole new information universe.
Internet companies, who are already dealing with astronomical numbers, will have to be ready to handle the load. Data scientists will have to be primed to know where to look.
Private! Keep Out!
Or decide whether they should be looking at all. Nothing is more contentious in today’s data climate than privacy. Consumers are rightly disturbed by reports that:
- In a 2013 study looking at anonymous mobile data from 1.5 million European mobile users, up to 95 percent can be identified from only four data categories.
- Health insurers are buying giant databases to flag purchasers of plus-sized clothing as being at a risk for obesity.
- Digital billboards can analyze your facial features or access your smartphone data to target you with customized advertising.
Not to mention the daily news stories on what Google and Facebook have done with content.
In a global economy – and with increasing demands from the government to access citizens’ information – the Internet industry is facing a set of complicated questions:
- Who owns the rights to personal data? Are there exceptions to the rule?
- Now that the cloud is here, what safeguards are needed to protect private information?
- As data collaboration increases, how much can Internet companies share with commercial partners, business intelligence vendors and non-profits?
- What does privacy mean in the 21st century?
Unfortunately, as with all things regarding data science, there are no easy answers.
History of Data Science and the Internet
“I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers.” – Tim Berners-Lee
In early October 1957, every headline in the world was screaming:
- “RED ‘MOON’ OVER LONDON!”
- “RUSSIA WINS THE RACE INTO OUTER SPACE!”
- “SPACE AGE IS HERE!”
Sputnik, the first artificial earth satellite, was in the sky.
A few months later the U.S. Department of Defense hastily issued directive 5105.15 – the establishment of the Advanced Research Projects Agency (ARPA). Packed with the country’s best and brightest, this agency spent the sixties tackling a communications challenge. The solution to the challenge was known as ARPANET and would transform the world forever.
Do You See the L?
In 1962, J.C.R. Licklider (aka “computing’s Johnny Appleseed”) became Director of the Information Processing Techniques Office (IPTO) within DARPA. His job? Find a way to unite the department’s main computers at Cheyenne Mountain with the Pentagon and Strategic Air Command (SAC) headquarters in a wide-area network.
His idea was clear and simple:
“A network of such [computers], connected to one another by wide-band communication lines.”
This dream came one step closer to reality when Paul Baran, Donald Davies and others came up with the concept of packet-switching. By bundling data into arbitrary packets and routing these “digital envelopes” onward, computer engineers could save precious bandwidth on connection lines.
By the late 1960s, Licklider had moved on to other projects and Robert Taylor had become head of IPTO. Working with smart people like MIT’s Larry Roberts and Leonard Kleinrock, he threw even more resources into the project.
On October 29, 1969, Kleinrock was at UCLA, on the phone with colleagues at the Stanford Research Institute (SRI). Their computers had been linked; their systems were running:
“We typed the L, and we asked on the phone, ‘Do you see the L?’
‘Yes, we see the L,’ came the response.
‘We typed the O, and we asked, ‘Do you see the O?’
‘Yes, we see the O.’
Then we typed the G, and the system crashed.”
And LO, the Internet was born.
After the IPTO had recovered from their first crash, events began to snowball:
- 1971: Abhay Bhushan publishes the original specification for File Transfer Protocol (FTP).
- 1971: Michigan’s Merit Network makes its debut.
- 1972: Larry Roberts conducts a public demonstration of ARPANET at the International Conference on Computer Communications.
- 1972: The first complete TELNET protocol comes out.
- 1974: Vinton Cerf, Yogen Dalal and Carl Sunshine publish the initial specifications of TCP (Transmission Control Protocol).
- 1976: Robert Metcalfe and his colleagues launch Ethernet, a family of technologies for local area networks (LANs)
By the 1980s, emails, newsgroups and LANs were common practice for universities and research groups. In 1985, the National Science Foundation Network (NSFnet) combined five national supercomputer centers, eventually replacing ARPANET as the de facto educational network.
This meant data. Lots and lots of data.
The Roaring 90s
Throughout the 1980s and early 1990s, a number of groups were busy developing ways to organize this data and accommodate predicted growth. In 1989, Tim Berners-Lee proposed a simple but far-reaching idea: Why not use the Internet as a platform to create a global hypertext system available to all? A few years later, the World Wide Web appeared as a publicly available service.
From the beginning, it was important that the Web be readily accessible by ordinary users. Browsers like Mosaic (1993) and Netscape (1994) helped them make sense of it. Yahoo! (1994) and AltaVista (1995) made it much easier to search. E-commerce companies such as Book Stacks Unlimited (1992) and Amazon (1995) took advantage of it to create a new paradigm for business.
By the end of the century, the Internet was well on its way to becoming the primary avenue of information flowing through two-way telecom networks.
Then along came Google. The word is a deliberate misspelling of the word “googol” – the name for a very large number: one followed by 100 zeros. As this name suggests, the company was thinking big from the beginning.
Throughout the late 1990s and early 2000s, Google:
- Improved on existing search algorithms by tallying hyperlinked references to each Web page (to measure popularity) in addition to parsing the text on the page (PageRank, named for the co-founder, Larry Page)
- Began selling advertisements associated with search keywords
- Developed MapReduce, a programming model for processing large data sets (codified into the open source software Apache Hadoop)
- Launched Gmail, Google Translate, Google Maps, Google News, Google Books, and many more experiments
- Snapped up YouTube
It literally ate and digested data.
And Google’s approach worked. As Viktor Mayer-Schönberger and Kenneth Cukier discuss in their 2013 book, Big Data: A Revolution that Will Transform how We Live, Work, and Think:
“The reason Google’s translation system works well is not that it has a smarter algorithm. It works well because its creators, like Banko and Brill at Microsoft, fed in more data – and not just of high quality. Google was able to use a dataset tens of thousands of times larger than IBM’s Candide because it accepted messiness.”
Messy data? Unstructured data? Didn’t matter. In their 2009 paper, The Unreasonable Effectiveness of Data, Google’s AI expert Peter Norvig and colleagues note:
“Simple models and a lot of data trump more elaborate models based on less data.”