How Big Data Pieces, Technology, and Animals fit together

How Big Data Pieces and animals fit together: MapReduce, HDFS, Apache Spark,, Pregel, Zookeeper, Flume, Hive, Pig, and more, explained by a Quora (and past Facebook) Data Scientist.



By Gregory Piatetsky, @kdnuggets.

I came across this excellent post on Quora by Abhinav Sharma Abhinav Sharma, that was a great explanation for Big Data technologies fit together.

Abhinav is a Product Designer at Quora, and earlier was a Software Engineer and Data Scientist at Facebook, Coursera, Mozilla, Duolingo and other places.

I added a few links where appropriate and an image from Databricks slideshare Big Data Analytics: A Zoo of Innovation.

Big Data Zoo

  • MapReduce is the Google paper that started it all It's a paradigm for writing distributed code inspired by some elements of functional programming. You don't have to do things this way, but it neatly fits a lot of problems we try to solve in a distributed way. The Google internal implementation is called MapReduce and Hadoop is it's open-source implementation. Amazon's Hadoop instance is called Elastic MapReduce (EMR) and has plugins for multiple languages.
  • HDFS is an implementation inspired by the Google File System (GFS) to store files across a bunch of machines when it's too big for one. Hadoop consumes data in HDFS (Hadoop Distributed File System).
  • Apache Spark is an emerging platform that has more flexibility than MapReduce but more structure than a basic message passing interface. It relies on the concept of distributed data structures (what it calls RDDs) and operators.
  • Because Spark is a lower level thing that sits on top of a message passing interface, it has higher level libraries to make it more accessible to data scientists. The Machine Learning library built on top of it is called MLlib and there's a distributed graph library called GraphX.
  • Pregel and it's open source twin Giraph is a way to do graph algorithms on billions of nodes and trillions of edges over a cluster of machines. Notably, the MapReduce model is not well suited to graph processing so Hadoop/MapReduce are avoided in this model, but HDFS/GFS is still used as a data store.
  • Zookeeper is a coordination and synchronization service that a distributed set of computer make decisions by consensus, handles failure, etc.
  • Flume and Scribe are logging services, Flume is an Apache project and Scribe is an open-source Facebook project. Both aim to make it easy to collect tons of logged data, analyze it, tail it, move it around and store it to a distributed store.
  • Google BigTable and it's open source twin HBase were meant to be read-write distributed databases, originally built for the Google Crawler that sit on top of GFS/HDFS and MapReduce/Hadoop. Google Research Publication: BigTable
  • Hive and Pig are abstractions on top of Hadoop designed to help analysis of tabular data stored in a distributed file system (think of excel sheets too big to store on one machine). They operate on top of a data warehouse, so the high level idea is to dump data once and analyze it by reading and processing it instead of updating cells and rows and columns individually much. Hive has a language similar to SQL while Pig is inspired by Google research paper Interpreting the Data: Parallel Analysis with Sawzall. You generally don't update a single cell in a table when processing it with Hive or Pig.
  • Hive and Pig turned out to be slow because they were built on Hadoop which optimizes for the volume of data moved around, not latency. To get around this, engineers bypassed and went straight to HDFS. They also threw in some memory and caching and this resulted in
     
    They all have slightly different semantics but are essentially meant to be programmer or analyst friendly abstractions to analyze tabular data stored in distributed data warehouses.
  • Mahout (Scalable machine learning and data mining) is a collection of machine learning libraries written in the MapReduce paradigm, specifically for Hadoop. Google has it's own internal version but they haven't published a paper on it as far as I know.
  • Oozie is a workflow scheduler. The oversimplified description would be that it's something that puts together a pipeline of the tools described above. For example, you can write an Oozie script that will scrape your production HBase data to a Hive warehouse nightly, then a Mahout script will train with this data. At the same time, you might use pig to pull in the test set into another file and when Mahout is done creating a model you can pass the testing data through the model and get results. You specify the dependency graph of these tasks through Oozie (I may be messing up terminology since I've never used Oozie but have used the Facebook equivalent).
  • Lucene is a bunch of search-related and NLP tools but it's core feature is being a search index and retrieval system. It takes data from a store like HBase and indexes it for fast retrieval from a search query. Solr uses Lucene under the hood to provide a convenient REST API for indexing and searching data. ElasticSearch is similar to Solr.
  • Sqoop is a command-line interface to back SQL data to a distributed warehouse. It's what you might use to snapshot and copy your database tables to a Hive warehouse every night.
  • Hue is a web-based GUI to a subset of the above tools

 


Quora question: How do I learn about big data? with more answers.