Hadoop Key Terms, Explained
An straightforward overview of 16 core Hadoop ecosystem concepts. No Big Picture discussion, just the facts.
Now, let us check the other related terms in the Hadoop ecosystem.
Hive is data warehouse software, which supports reading, writing and managing large volume of data stored in a distributed storage system. It provides SQL like query language known as HiveQL (HQL), for querying the dataset. Hive supports storage in HDFS and other compatible file systems like Amazon S3, and others.
8. Apache Pig
Apache Pig is a high level platform for large data set analysis. The language to write Pig scripts are known as Pig Latin. It basically abstracts the underlying MapReduce programs and makes it easier for developers to work on MapReduce model without writing the actual code.
9. Apache Spark
Spark is a cluster computing framework and general compute engine for Hadoop data (large scale data-set). It performs almost 100 times faster compared to MapReduce in memory. For disk, it is almost 10 times faster. Spark can run on different environments/mode like stand-alone mode, on Hadoop, on EC2 etc. It can access data from HDFS, HBase, Hive or any other Hadoop data source.
Sqoop is a command line tool to transfer data between RDBMS and Hadoop data bases. It is mainly used for import/export data between relational and non-relational databases. The name Sqoop is formed by combining the initial and last part of two other terms Sql + Hadoop.
OOzie is a Hadoop work flow engine. It schedules work flows to manage Hadoop jobs.
ZooKeeper is an open source platform, which provides high performance coordination service for Hadoop distributed applications. It is a centralized service for maintaining configuration information, naming registry, distributed synchronization and group services.
13. Apache Flume
Apache Flume is a distributed service, mainly used for data collection, aggregation and movement. It works very efficiently with large amount of log and event data.
Hue is basically a web interface for analysing Hadoop data. It is open source project, supports Hadoop and its eco-system. Its main purpose is to provide better user experience. It provides drag and drop facilities and editors for Spark, Hive, HBase, etc.
Mahout is open source software for building scalable machine learning and data mining applications quickly.
Ambari is basically a web based tool for monitoring and managing Hadoop clusters. It includes support for ecosystem services and tools like HDFS, MapReduce, HBase, ZooKeeper, Pig, Sqoop, etc. Its three main functionalities are provisioning, managing and monitoring Hadoop clusters.
As the Hadoop ecosystem is continuously evolving, new software, services and tools are also emerging. As a result, there will be new terms and jargons in the big data world. We need to keep a close watch and understand those in time.
In this article we have tried to identify the most important key terms in the Hadoop ecosystem. We have also discussed a little bit about the ecosystem and why we need to know the terms. Hadoop is now become a main stream technology, so people are getting more involved into it. So, it is the right time to understand some basic concepts and terms used in the Hadoop world. In future, there will a lot of new concepts and terms available, and we must update ourselves accordingly.