Data Science in the Internet Industry

November 1, 2013

Career Opportunities in Internet Data Science

The Promise of Big Data
Thanks to cloud computing and continuing advances in technology, there are new opportunities to analyze big data.

What’s more:

Natural-language processing has opened up unstructured data (e.g., emails, documents, social media posts) to analysis.
Cloud-based data storage and low-cost, high-speed data processing has become increasingly cheap.
Sophisticated analytics software and tools are widely available.

Telecom and Internet companies are in a particularly strong position to take advantage of big data. As of 2020, Google processes approximately 3.5 billion searches per day and 2.5 quintillion bytes of data are created each day.

Data Risks and Regulations

You’re Only as Good as Your Data

With all its potential, big data isn’t the answer to everything.

What’s more, the volume, velocity and variety of big data are only going to get bigger. Mobile use is exploding. Developing countries are coming online. The Internet of Things is spawning a whole new information universe.

Internet companies, who are already dealing with astronomical numbers, will have to be ready to handle the load. Data scientists will have to be primed to know where to look.

Private! Keep Out!

In a global economy – and with increasing demands from the government to access citizens’ information – the Internet industry is facing a set of complicated questions:

Who owns the rights to personal data? Are there exceptions to the rule?
Now that the cloud is here, what safeguards are needed to protect private information?
As data collaboration increases, how much can Internet companies share with commercial partners, business intelligence vendors and non-profits?
What does privacy mean in the 21st century?

Unfortunately, as with many things regarding data science, there are no easy answers.

History of Data Science and the Internet

“I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers.” – Tim Berners-Lee, inventor of the World Wide Web

In early October 1957, headlines read:

“RED ‘MOON’ OVER LONDON!”
“RUSSIA WINS THE RACE INTO OUTER SPACE!”
“SPACE AGE IS HERE!”

Sputnik, the first artificial earth satellite, was in the sky.

A few months later the U.S. Department of Defense hastily issued directive 5105.15 – the establishment of the Advanced Research Projects Agency (ARPA). Packed with the country’s best and brightest, this agency spent the sixties tackling a communications challenge. The solution to the challenge was known as ARPANET and would transform the world forever.

Do You See the L?

In 1962, J.C.R. Licklider (aka “computing’s Johnny Appleseed”) became Director of the Information Processing Techniques Office (IPTO) within the Defense Advanced Research Projects Agency (DARPA). His job? Find a way to unite the department’s main computers at Cheyenne Mountain with the Pentagon and Strategic Air Command (SAC) headquarters in a wide-area network.

His idea was clear and simple:

“A network of such [computers], connected to one another by wide-band communication lines”, from Man-Computer Symbiosis (1960).

This dream came one step closer to reality when Paul Baran, Donald Davies and others came up with the concept of packet-switching. By bundling data into arbitrary packets and routing these “digital envelopes” onward, computer engineers could save precious bandwidth on connection lines.

By the late 1960s, Licklider had moved on to other projects and Robert Taylor had become head of IPTO. Working with smart people like MIT’s Larry Roberts and Leonard Kleinrock, he threw even more resources into the project.

On October 29, 1969, Kleinrock was at UCLA, on the phone with colleagues at the Stanford Research Institute (SRI). Their computers had been linked; their systems were running:

“We typed the L, and we asked on the phone, ‘Do you see the L?’

‘Yes, we see the L,’ came the response.

‘We typed the O, and we asked, ‘Do you see the O?’

‘Yes, we see the O.’

Then we typed the G, and the system crashed.”

And LO, the Internet was born.

Following Protocol

After the IPTO had recovered from their first crash, events began to snowball:

1971: Abhay Bhushan publishes the original specification for File Transfer Protocol (FTP).
1971: Michigan’s Merit Network makes its debut.
1972: Larry Roberts conducts a public demonstration of ARPANET at the International Conference on Computer Communications.
1972: The first complete TELNET protocol comes out.
1974: Vinton Cerf, Yogen Dalal and Carl Sunshine publish the initial specifications of TCP (Transmission Control Protocol).
1976: Robert Metcalfe and his colleagues launch Ethernet, a family of technologies for local area networks (LANs)

By the 1980s, emails, newsgroups and LANs were common practice for universities and research groups. In 1985, the National Science Foundation Network (NSFnet) combined five national supercomputer centers, eventually replacing ARPANET as the de facto educational network.

The Roaring 90s

Throughout the 1980s and early 1990s, a number of groups were busy developing ways to organize this data and accommodate projected growth. In 1989, Tim Berners-Lee proposed a simple but far-reaching idea: Why not use the Internet as a platform to create a global hypertext system available to all? A few years later, the World Wide Web appeared as a publicly available service.

From the beginning, it was important that the Web be readily accessible by ordinary users. Browsers like Mosaic (1993) and Netscape (1994) helped them make sense of it. Yahoo! (1994) and AltaVista (1995) made it easier to search. E-commerce companies such as Book Stacks Unlimited (1992) and Amazon (1995) created new businesses.

By the end of the century, the Internet was well on its way to becoming the primary avenue of information flowing through two-way telecom networks.

Google Googol

Then along came Google. The word is a deliberate misspelling of the word “googol” – the name for a very large number: one followed by 100 zeros. As this name suggests, the company was thinking big from the beginning.

Throughout the late 1990s and early 2000s, Google:

Improved on existing search algorithms by tallying hyperlinked references to each Web page (to measure popularity) in addition to parsing the text on the page (PageRank, named for the co-founder, Larry Page)
Began selling advertisements associated with search keywords
Developed MapReduce, a programming model for processing large data sets (codified into the open source software Apache Hadoop)
Launched Gmail, Google Translate, Google Maps, Google News, Google Books, and many more experiments
Snapped up YouTube

And Google’s approach worked. As Viktor Mayer-Schönberger and Kenneth Cukier discuss in their 2013 book, Big Data: A Revolution that Will Transform how We Live, Work, and Think:

“The reason Google’s translation system works well is not that it has a smarter algorithm. It works well because its creators, like Banko and Brill at Microsoft, fed in more data – and not just of high quality. Google was able to use a dataset tens of thousands of times larger than IBM’s Candide because it accepted messiness.”

Messy data? Unstructured data? Didn’t matter. In their 2009 paper, The Unreasonable Effectiveness of Data, Google’s AI expert Peter Norvig and colleagues note:

“Simple models and a lot of data trump more elaborate models based on less data.”

Last updated: June 2020

Data Science in the Internet Industry

Career Opportunities in Internet Data Science

Sponsored Schools

Like This Ad?

Consumers ‘R’ Us

Humans, We Have a Problem

Saving the World