Using Hadoop for Data Science

What Is Hadoop?

Hadoop is an open-source software framework that provides for processing of large data sets across clusters of computers using simple programming models. Hadoop is designed to scale up from single servers to thousands of machines.

Hadoop grew out of an open-source search engine called Nutch, developed by Doug Cutting and Mike Cafarella. Back in the early days of the Internet, the pair wanted to invent a way to return web search results faster by distributing data and calculations across different computers so multiple tasks could be executed at the same time.

The distributed computing and processing portion of Nutch was eventually split off and named Hadoop (after Cutting’s son’s toy elephant). Yahoo released Hadoop as an open-source project in 2008, and today the Hadoop ecosystem is managed and maintained by the non-profit Apache Software Foundation (ASF), an international community of software developers and contributors.

SPONSORED SCHOOLS

Rutgers University

info

Rutgers Data Science Bootcamp

Gain skills needed to analyze data and deliver value to organizations. Complete projects using real data sets from the worlds of finance, healthcare, government, social welfare, and more.

Southern Methodist University

info

SMU Data Science Boot Camp

Develop concrete, in-demand data skills and learn how to help drive business decisions and solve challenges that companies are facing. No programming experience required.

Northwestern University

info

Northwestern Data Science and Visualization Boot Camp

Northwestern Data Science and Visualization Bootcamp teaches practical and technical skills in 24 intensive weeks. Students apply their knowledge to hands-on projects that translate directly into work in the field.

University of Southern California

info

USC Viterbi Data Analytics Boot Camp

Expand your skill set and grow as a data analyst. This program covers the specialized skills to be successful in the field of data in 24 weeks.

info SPONSORED

Four core modules are included in the ASF’s basic framework:

  • Hadoop Common consists of the common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System is a distributed file system that provides high-throughput access to application data.
  • Hadoop YARN is a framework for job scheduling and cluster resource management.
  • Hadoop MapReduce is a YARN-based system for parallel processing of large data sets.

Other elements of the Hadoop ecosystem of technologies include the following:

  • Pig is a high-level data-flow language and execution framework for parallel computation. It allows users to perform data extractions and transformations and basic analysis without having to write MapReduce programs.
  • Hive is a data warehouse infrastructure that provides data summarization and ad hoc querying. It was initially developed by Facebook.
  • HBase is a scalable, distributed database that supports structured data storage for large tables.
  • Ambari is a web interface for provisioning, managing, and monitoring Hadoop services and components.
  • Cassandra is a scalable multi-master database system.
  • Oozie is a Hadoop job scheduler.
  • Sqoop is a connection and transfer mechanism that moves data between Hadoop and relational databases.
  • Spark is an open-source cluster computing framework with in-memory analytics.
  • Zookeeper is a high-performance coordination service for distributed applications.

Hadoop can be downloaded for free, however commercial distributions such as ClouderaHortonworks, and MapR are also available. For a fee, you get the software vendor’s version of the framework along with additional software components, tools, training, and documentation.

Why Use Hadoop?

Hadoop has a lot to offer. SAS Institute identifies the following five benefits:

  • Computing power: Hadoop’s distributed computing model allows it to process huge amounts of data. The more nodes you use, the more processing power you have.
  • Flexibility: Hadoop stores data without requiring any preprocessing. Store data—even unstructured data such as text, images, and video—now; decide what to do with it later.
  • Fault toleranceHadoop automatically stores multiple copies of all data, and if one node fails during data processing, jobs are redirected to other nodes and distributed computing continues.
  • Low cost: The open-source framework is free, and data is stored on commodity hardware.
  • Scalability: You can easily grow your Hadoop system, simply by adding more nodes.

Although the development of Hadoop was motivated by the need to search millions of webpages and return relevant results, it today serves a variety of purposes. Hadoop’s low-cost storage makes it an appealing option for storing information that is not currently critical but that might be analyzed later. Hadoop storage is unencumbered by the schema-related constraints commonly found in SQL-based systems. Organizations are using Hadoop to stage large amounts of raw, sometimes unstructured data for loading into enterprise data warehouses. Many of Hadoop’s largest adopters use it for the real-time data analysis that enables web-based recommendation systems.

Who Uses Hadoop?

Apache Software Foundation maintains a list of companies using Hadoop, and usage goes beyond powering search engines or analyzing customer behavior to better target ads. Here’s how some big names are using Hadoop:

  • eBay uses Hadoop for search optimization.
  • University of Maryland uses it as part of the Google/IBM academic cloud computing initiative.
  • At Facebook, Hadoop is used to store copies of internal log and dimension data sources and as a source for not only reporting and analytics but also machine learning.
  • Hadoop powers LinkedIn‘s People You May Know feature.
  • Hadoop enables Opower to suggest ways for consumers to save money on energy bills.
  • To determine user preferences, Orbitz uses Hadoop to analyze every aspect of visitors’ sessions on its sites.
  • Spotify uses Hadoop for content generation and for data aggregation, reporting, and analysis.
  • Twitter uses Hadoop to store and process tweets and log files.
  • Yahoo! has more than 40,000 computers running Hadoop to support research for Ad Systems and Web Search.

Interested in a different career? Check out our other bootcamp guides below:

SPONSORED SCHOOL

University of London

info

Online BSc Data Science and Business Analytics

The online BSc Data Science and Business Analytics from the University of London, with academic direction from LSE, enables students to build essential technical and critical thinking skills and prepare for careers in data science, analytics and other growing fields – while they work, without relocating.

info SPONSORED

Learn Hadoop

You’ve got lots of choices for online Hadoop training. Here are some options to consider:

Last updated: June 2020