KDnuggets Home » News » 2016 » May » Tutorials, Overviews » Hadoop Key Terms, Explained ( 16:n20 )

Hadoop Key Terms, Explained


An straightforward overview of 16 core Hadoop ecosystem concepts. No Big Picture discussion, just the facts.



In the current technology landscape, big data and analytics are the two most important areas where people are taking lot of interest. The obvious reason behind this traction is - enterprises are getting business benefit out of these big data and BI applications. Hadoop is now become a main stream technology, so its coverage and discussion is also spreading beyond tech media. But, what we have observed is - people still find it difficult to understand the actual concepts, and often make some vague idea about Hadoop and other related technologies.

In this article, our honest effort is to explain the Hadoop key terms in a very simple way, so that technical and non-technical audience can understand it.

Hadoop is a very powerful open source platform managed by Apache Foundation. Hadoop platform is built on Java technologies and capable of processing huge volume of heterogeneous data in a distributed clustered environment. Its scaling capability makes it a perfect fit for distributed computing.

Hadoop ecosystem consists of Hadoop core components and other associated tools. In the core components, Hadoop Distributed File System (HDFS) and the MapReduce programming model are the two most important concepts. Among the associated tools, Hive for SQL, Pig for dataflow, Zookeeper for managing services etc are important. We will explain these terms in details.

Hadoop ecosystem

We have already discussed that Hadoop is a very popular topic nowadays, and everybody is talking about it, knowingly or unknowingly. So the problem is - if you are discussing something or listening to something, but not aware what it exactly means, then you will not be able to connect the dots or digest it. The problem is more visible when the people are from a different domain, like business people, marketing guys, top management etc. Because these people do not need to know 'How Hadoop works?', rather they are more interested to know ‘how it can bring business benefit’. To realize the business benefit, a little bit of understanding of Hadoop terms are very much important across all layers. But at the same time, the terms should be explained in simple way without complex jargons, making the readers comfortable.

In this section we will explore different terms in Hadoop and its eco-system,with some explanation. For clarity in understanding, we will make two broad categories, one is the base module and the other one is the additional software packages and tools which can be installed separately or on top of Hadoop. Hadoop refers to all these entities.

First, let us have a look at the terms which compose the base module.

1. Apache Hadoop
 
Apache Hadoop is an open-source framework for processing large volume of data in a clustered environment. It uses simple MapReduce programming model for reliable, scalable and distributed computing. The storage and computation both are distributed in this framework.

2. MapReduce
 
MapReduce is a programming model for parallel processing of large volumes of data in a distributed environment. The MapReduce paradigm has two main components, one is the Map() method, which performs filtering and sorting. The other one is the Reduce() part, designed to perform summary of the output from the Map part.

MapReduce

3. Hadoop Common
 
Apache Common contains common utilities to support different Hadoop modules. It is basically a library of common tools and utilities. Hadoop common is mainly used by developers during application development.

4. Hadoop Distributed File System (HDFS)
 
The Hadoop Distributed File System (HDFS) is a distributed file system spans across commodity hardware. It scales very fast and provides high throughput. Data blocks are replicated and stored in a distributed way on a clustered environment.

5. Yet Another Resource Negotiator (YARN)
 
YARN is a resource manager available in Hadoop 2. The role of YARN is to manage and schedule computing resources in a clustered environment.