Data Science in the Internet Industry

Career Opportunities in Internet Data Science

The Promise of Big Data
Thanks to cloud computing and continuing advances in technology, there are new opportunities to analyze big data.

What’s more:

Natural-language processing has opened up unstructured data (e.g., emails, documents, social media posts) to analysis.
Cloud-based data storage and low-cost, high-speed data processing has become increasingly cheap.
Sophisticated analytics software and tools are widely available.

Telecom and Internet companies are in a particularly strong position to take advantage of big data. As of 2020, Google processes approximately 3.5 billion searches per day and 2.5 quintillion bytes of data are created each day.

Like This Ad?

One of the more famous petri dishes for data science was Facebook. In 2005, Jeffrey Hammerbacher put together a team of gurus to mine huge hordes of the site’s social network data, crunch the numbers for insights and use their discoveries to improve the service and create targeted advertising. (Two years later, Hammerbacher decamped again to form Cloudera.)

Over at LinkedIn, D.J. Patil and his colleagues were hard at work on a similar approach. Their work produced many tools users are now familiar with – recommendation technology such as “Groups you may like”, as well as features such as Career Explorer and Jobs Recommendation.

Consumers ‘R’ Us

Business intelligence and the retail industry have a long history together (AC Nielsen began asking impertinent questions in 1923), and the partnership has developed with the Internet.

Data scientists working in retail can combine data sets to create:

Personalized recommendations based on weather, seasonal trends, traffic reports, past purchase history, your dog’s favorite chew toys…
Smarter sentiment analysis
Product insights gleaned from RFID and sensor data
Detailed market basket and video analysis
Geo-targeted marketing
Real-time pricing and inventory management

In recent years Google, SAS and IBM have gone on buying streaks, snapping up smaller companies with useful analytics technologies.

For example:

In 2013, Google bought the e-commerce specialist Channel Intelligence for $125 million. With this purchase, it acquired CI’s sales and event transaction data gathering platform (TrueTag) – technology which CI claims tracks nearly 15 percent of U.S. transactions online.
In the same year, IBM rolled out a number of analytics-backed retail apps and services that incorporated components from Tealeaf Technology, a customer experience management specialist.

Humans, We Have a Problem

Though many data scientists may use the Web as a data source, they’re not limited by it. In fact, most successful business intelligence and data analytics companies are pulling data from as many sources as they can find.

Take Splunk. Its specialty is harvesting big data from machines. These include Web servers, mobile devices – even something as prosaic as an air conditioning unit.

As soon as a machine creates a piece of data, Splunk seizes it and stores it in a cloud-based database.
The aim is to trace a machine’s repetitive patterns, find anomalies and diagnose problems.
Once it’s found an issue, its programs can create immediate alerts (as well as less urgent graphs and reports) for the client.

Saving the World

But maybe the most positive trend appearing in Web-based data science has nothing to do with profit or gain. As Mayer-Schönberger and Cukier relate in their book, Big Data, A Revolution That Will Transform How We Live, Work, and Think – big data might just help save the world.

In 2008, data scientists at Google used search and CDC data to look at the spread of the flu. Putting together the 50 million most common search terms in their U.S. database, they compared it against a set of CDC data on the spread of seasonal flu from 2003 to 2008.

They went at this task with no preconceptions. They simply designed a system that would look for correlations between the recorded spread of flu and the frequency of specific search queries. To test their flu predictions, they processed 450 million mathematical models.

And they found the link. A combination of 45 search terms, used together in a mathematical model, could tell them – in real-time – how flu epidemics were spreading.

This real-time monitoring was far superior to any government report to date. When the H1N1 virus hit in 2009, public health officials were right on top of its spread.

Data Risks and Regulations

You’re Only as Good as Your Data

With all its potential, big data isn’t the answer to everything.

What’s more, the volume, velocity and variety of big data are only going to get bigger. Mobile use is exploding. Developing countries are coming online. The Internet of Things is spawning a whole new information universe.

Internet companies, who are already dealing with astronomical numbers, will have to be ready to handle the load. Data scientists will have to be primed to know where to look.

Private! Keep Out!

In a global economy – and with increasing demands from the government to access citizens’ information – the Internet industry is facing a set of complicated questions:

Who owns the rights to personal data? Are there exceptions to the rule?
Now that the cloud is here, what safeguards are needed to protect private information?
As data collaboration increases, how much can Internet companies share with commercial partners, business intelligence vendors and non-profits?
What does privacy mean in the 21st century?

Unfortunately, as with many things regarding data science, there are no easy answers.

History of Data Science and the Internet

“I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers.” – Tim Berners-Lee, inventor of the World Wide Web

In early October 1957, headlines read:

“RED ‘MOON’ OVER LONDON!”
“RUSSIA WINS THE RACE INTO OUTER SPACE!”
“SPACE AGE IS HERE!”

Sputnik, the first artificial earth satellite, was in the sky.

A few months later the U.S. Department of Defense hastily issued directive 5105.15 – the establishment of the Advanced Research Projects Agency (ARPA). Packed with the country’s best and brightest, this agency spent the sixties tackling a communications challenge. The solution to the challenge was known as ARPANET and would transform the world forever.

Do You See the L?

In 1962, J.C.R. Licklider (aka “computing’s Johnny Appleseed”) became Director of the Information Processing Techniques Office (IPTO) within the Defense Advanced Research Projects Agency (DARPA). His job? Find a way to unite the department’s main computers at Cheyenne Mountain with the Pentagon and Strategic Air Command (SAC) headquarters in a wide-area network.

His idea was clear and simple:

“A network of such [computers], connected to one another by wide-band communication lines”, from Man-Computer Symbiosis (1960).

This dream came one step closer to reality when Paul Baran, Donald Davies and others came up with the concept of packet-switching. By bundling data into arbitrary packets and routing these “digital envelopes” onward, computer engineers could save precious bandwidth on connection lines.

By the late 1960s, Licklider had moved on to other projects and Robert Taylor had become head of IPTO. Working with smart people like MIT’s Larry Roberts and Leonard Kleinrock, he threw even more resources into the project.

On October 29, 1969, Kleinrock was at UCLA, on the phone with colleagues at the Stanford Research Institute (SRI). Their computers had been linked; their systems were running:

“We typed the L, and we asked on the phone, ‘Do you see the L?’

‘Yes, we see the L,’ came the response.

‘We typed the O, and we asked, ‘Do you see the O?’

‘Yes, we see the O.’

Then we typed the G, and the system crashed.”

And LO, the Internet was born.

Following Protocol

After the IPTO had recovered from their first crash, events began to snowball:

1971: Abhay Bhushan publishes the original specification for File Transfer Protocol (FTP).
1971: Michigan’s Merit Network makes its debut.
1972: Larry Roberts conducts a public demonstration of ARPANET at the International Conference on Computer Communications.
1972: The first complete TELNET protocol comes out.
1974: Vinton Cerf, Yogen Dalal and Carl Sunshine publish the initial specifications of TCP (Transmission Control Protocol).
1976: Robert Metcalfe and his colleagues launch Ethernet, a family of technologies for local area networks (LANs)

By the 1980s, emails, newsgroups and LANs were common practice for universities and research groups. In 1985, the National Science Foundation Network (NSFnet) combined five national supercomputer centers, eventually replacing ARPANET as the de facto educational network.

The Roaring 90s

Throughout the 1980s and early 1990s, a number of groups were busy developing ways to organize this data and accommodate projected growth. In 1989, Tim Berners-Lee proposed a simple but far-reaching idea: Why not use the Internet as a platform to create a global hypertext system available to all? A few years later, the World Wide Web appeared as a publicly available service.

From the beginning, it was important that the Web be readily accessible by ordinary users. Browsers like Mosaic (1993) and Netscape (1994) helped them make sense of it. Yahoo! (1994) and AltaVista (1995) made it easier to search. E-commerce companies such as Book Stacks Unlimited (1992) and Amazon (1995) created new businesses.

By the end of the century, the Internet was well on its way to becoming the primary avenue of information flowing through two-way telecom networks.

Google Googol

Then along came Google. The word is a deliberate misspelling of the word “googol” – the name for a very large number: one followed by 100 zeros. As this name suggests, the company was thinking big from the beginning.

Throughout the late 1990s and early 2000s, Google:

Improved on existing search algorithms by tallying hyperlinked references to each Web page (to measure popularity) in addition to parsing the text on the page (PageRank, named for the co-founder, Larry Page)
Began selling advertisements associated with search keywords
Developed MapReduce, a programming model for processing large data sets (codified into the open source software Apache Hadoop)
Launched Gmail, Google Translate, Google Maps, Google News, Google Books, and many more experiments
Snapped up YouTube

And Google’s approach worked. As Viktor Mayer-Schönberger and Kenneth Cukier discuss in their 2013 book, Big Data: A Revolution that Will Transform how We Live, Work, and Think:

“The reason Google’s translation system works well is not that it has a smarter algorithm. It works well because its creators, like Banko and Brill at Microsoft, fed in more data – and not just of high quality. Google was able to use a dataset tens of thousands of times larger than IBM’s Candide because it accepted messiness.”

Messy data? Unstructured data? Didn’t matter. In their 2009 paper, The Unreasonable Effectiveness of Data, Google’s AI expert Peter Norvig and colleagues note:

“Simple models and a lot of data trump more elaborate models based on less data.”

Last updated: June 2020