Open source tools are the lifeblood of big data. You can’t have one without the other (and, really, why would you want to?).
If you’re after the best of the best, your first stop should be the Bossie Awards. Every year, the folks at InfoWorld choose their favorite “Best of Open Source Software” business products, culled from a flood of user nominations.
I’ve collected a few familiar names below. For a general list of open source platforms, check out this handy infographic from BigData Startups.
Data Analysis and Platforms
The granddaddy of them all. With the help of simple programming models, Apache Hadoop is a framework that allows for the distributed processing of huge data sets across clusters of computers.
To do this, Hadoop uses the parallel-processing system MapReduce. The large computational tasks of dealing with big data are divvied up into small jobs which are then shared out among as many nodes as needed, in any available cluster, to be executed or re-executed.
“There aren’t many Apache projects that support even one heavily capitalized startup. Hadoop supports several. Analysts estimate that Hadoop will be a ballooning market worth tens of billions per year. If you slipped into a coma during the financial crisis and just woke up, this is the biggest thing you missed.”
Hadoop is part of a large family including Cassandra, Hive, HBase, and Pig – and has plenty of fans. In 2013, Apache gave MapReduce a complete overhaul and created the YARN framework (aka MapReduce 2.0). The MapReduce algorithm remains, but can now be swapped out for other processing algorithms, including ones that run interactively.
Slightly irrelevant trivia: In 2005, Doug Cutting and Mike Cafarella were putting the finishing touches on their project. Searching for a memorable name, Cutting hit upon his son’s yellow toy elephant – Hadoop.
Storm’s claim to fame is its ability to process unbounded streams of data in real-time, “doing for real-time processing what Hadoop did for batch processing.” It will integrate with any queuing system and any database system.
Storm’s real-time processing capabilities come in handy when analysts are dealing with highly dynamic sources. Twitter, for example, uses it to help generate its trending topics. Other applications include online machine learning, continuous computation, distributed RPC, ETL – and the list goes on.
Hadoop is outstanding at handling massive batches of data, but exploring ideas with it can make the clock drag.
Taking its inspiration from Google’s Dremel system, Apache Drill is designed to permit super-fast ad-hoc querying and interactive analysis of giant data sets.
Drill can tap into multiple sources of data – HBase, Cassandra, MongoDB, etc. – in addition to traditional databases. When you need a tool to scan petabytes of data and trillions of records in seconds, Drill is an excellent choice.
Where once there was S, there now is R. This freely available statistical programming language and environment is rapidly becoming the popular standard for statistics software and data analysis. It combines the fundamentals of the S programming language (created by John Chambers at Bell Labs) with lexical scoping semantics derived from Scheme.
Developers like it because it’s cheap, powerful and plays well with Hadoop.
“R is making serious headway in ousting SAS and SPSS from their thrones, and has become the tool of choice for the world’s best statisticians (and data scientists, and analysts too).”
“Because it has an unusually strong community around it, you can find R libraries for almost anything under the sun — making virtually any kind of data science capability accessible without new code.”
Slightly irrelevant trivia: First created in 1997, R is named for Robert Gentleman and Ross Ihaka – the “R & R” of the Statistics Department of the University of Auckland.
This Apache project aims to teach machines some of what humans do: to recognize patterns in data. The stated goal of Mahout, a machine learning and data mining library, is to build freely distributed and scalable machine learning algorithms. The current algorithms focus on grouping related documents together (clustering), learning to match documents to categories (classification), matching users to probable favorites (collaborative filtering), and identifying items that typically fall into groups together (frequent pattern mining).
Though many of Mahout’s offerings are implemented on top of Hadoop using the map/reduce paradigm, it’s still pretty democratic. Contributors are welcome to introduce algorithms that run on a single node or a non-Hadoop cluster.
Raise a glass to the folks in the other down under – New Zealand. Developed at the University of Waikato, this machine learning library contains a host of useful tools and algorithms for data analysis and predictive modeling.
Weka’s Java-based tools support a range of data mining tasks, including data preprocessing, clustering, classification, regression, and visualization, among others.
Slightly irrelevant trivia: Officially, Weka stands for “Waikato Environment for Knowledge Analysis.” Unofficially, it refers to the native Weka, a flightless bird with a feisty temperament.
Once known as YALE (Yet Another Learning Environment), RapidMiner has been around since the beginning of the millennium. Its data mining system is available as a stand-alone application for data analysis or as an engine for integration into other products.
RapidMiner pulls its learning schemes and attribute evaluators from Weka, and for statistical modeling, offers either the native Rapid-I scripting environment or the R language. Machine learning, data mining, text mining, predictive analytics – it’s equipped to handle them all. Written in Java, it runs on every major platform and operating system.
Databases / Data Warehousing
Apache’s popular key-value oriented database management system is built to juggle large amounts of data across multiple commodity servers. It prides itself on availability, scalability and fault-tolerance, and avoids bottlenecks and single points of failure.
Because data is automatically replicated to multiple nodes, failed nodes can be replaced with no downtime. That’s good news for data-critical applications.
Cassandra started its life at Facebook, when Avinash Lakshman and Prashant Malik created it to power the Inbox Search feature. Today it’s working for Netflix, eBay, Twitter, Reddit, Ooyala and more. The largest known Cassandra cluster boasts over 300 terabytes of data in over 400 computers.
Apache’s data warehouse infrastructure is designed to facilitate querying and management of large data sets in distributed storage. Built on top of Hadoop, it allows you to overlay structure on a variety of data formats and provides tools for data querying and analysis.
Hive’s versions as far back as 0.8.x are compatible with Hadoop 2.0. As of late 2013, Hortonworks is looking to boost Hive’s power with the Stinger Initiative. Among other improvements, Hive will use YARN to query Hadoop directly and add a new runtime framework, Tez, for greater efficiency.
OrientDB, a NoSQL DBMS written in Java, provides the schema-less flexibility of document databases, the complexity of the graph model with direct relationships among document records, and object orientation for added power and flexibility.
In addition to schema-less mode, OrientDB also functions in schema-full or hybrid mode. It ensures reliability with ACID transactions and multi-master replication, and it’s fast – storing 150,000 records per second on ordinary hardware.
For random, realtime read/write access to big data, many data miners turn to Apache’s HBase, the Hadoop database. Written in Java, this scalable, distributed database is modeled after Google’s BigTable.
Running on top of Hadoop and HDFS (Hadoop Distributed Filesystem), HBase is particularly good at fault-tolerant storing of sparse data (i.e., small bits of information caught within a larger collection of empty or unimportant data that must be waded through efficiently).
Written in Java, Gephi is a tool to help data scientists visualize and explore all kinds of networks (as large as 1 million nodes) and complex systems, resulting in some pretty eye-catching dynamic and hierarchical graphs.
Thanks to the hard work of its community, there are also plenty of plug-ins, wikis and tutorials to ease the way.
You can use Gephi on everything from patterns of biological data to analyses of suspected insider trading. It turns up as LinkedIn Maps, visualizations of the connectivity of New York Times content, and Twitter network traffic during social unrest.
Giraph – Apache’s iterative graph processing system – is developed in parallel with Google’s Pregel. Giraph extends Pregel’s model by adding master computation, sharded aggregators, edge-oriented input, out-of-core computation and a variety of other useful features.
Facebook is a particular fan of its speed and scalability. As Avery Ching explains in Scaling Apache Giraph to a Trillion Edges:
“We ended up choosing Giraph for several compelling reasons. Giraph directly interfaces with our internal version of HDFS (since Giraph is written in Java) and talks directly to Hive. Since Giraph runs as a MapReduce job, we can leverage our existing MapReduce (Corona) infrastructure stack with little operational overhead. With respect to performance, at the time of testing Giraph was faster than the other frameworks – much faster than Hive.”
MongoDB is a cross-platform, document-oriented database system and currently the most popular NoSQL database.
It ditches the rigid schemas of RDBMS in favor of a binary form of JSON documents – with dynamic schemas, giving data miners a lot of power.
You’ll find MongoDB at work in Shutterfly’s photo platform, eBay’s search suggestion, Forbes’s storage system and MetLife’s “The Wall.”
Even better, it’s recently been updated. 2013 features include:
- Text search (beta) and geospatial capabilities
- Concurrent index builds
- And many more
Apache’s NoSQL database uses a trio of components making it extremely Web-friendly:
- JSON for documents
- HTTP for an API
In his comparison of NoSQL databases, Kristof Kovacs points out that CouchDB works well with data that accumulates, changes occasionally, answers to pre-defined queries, and needs versioning as a priority (e.g., CRM, CMS systems).
Couchbase has a lot going on nowadays. As of 2013, its active open-source projects include Couchbase Client SDKs, Couchbase Mobile, Couchbase Labs and its flagship NoSQL database, Couchbase Server.
Unlike relational database systems, Couchbase Server allows you to modify applications without the constraints of a fixed database schema. Plus, as noted in the 2013 Bossie Awards:
“One unique attribute of Couchbase Server is its memcached library. This feature allows developers to seamlessly transition from a memcached environment and gain data replication, durability, and zero application downtime.”
Couchbase has had two recent updates – 2.0 added document database capability, and 2.1 gave it cross-data center replication and improved storage performance.
Winner of a 2013 Bossie, Sqoop is Apache’s tool for transferring data back and forth between structured data stores (relational databases) and Hadoop. To do this, it uses concurrent connections, customizable mapping of data types and metadata propagation.
As James R. Borck notes in his Bossie citation, Sqoop gives you the option to tailor imports to HDFS, Hive and HBase and then export results back to databases. Microsoft, for example, uses a Sqoop-based connector to transfer data from its SQL server databases into Hadoop.
For all you scientists out there dealing with big data, here are a couple of projects that are simply too good to leave off the list.
The Broad Institute’s genomic analysis platform has hundreds of outstanding (and free) tools for researchers who have to do gene expression analysis, proteomics, SNP analysis, flow cytometry and RNA-seq analysis. And don’t forget common data-processing tasks!
Created by IBM and supported by the Eclipse Foundation, the Spatiotemporal Epidemiological Modeler (STEM) enables scientists and public health officials to create models of emerging infectious diseases as they develop and thus more effectively stem their spread.
STEM was developed in a recent project in which IBM teamed up with Johns Hopkins University and UCSF to develop new models of the spread of dengue fever and malaria. The researchers wanted to be able to predict future spread patterns by taking into account a wide variety of factors like temperature, precipitation, and soil characteristics. STEM enables scientists to correlate all this data with disease data to forecast where an disease is most likely to crop up next.