18 essential Hadoop tools

Hadoop tools develop at a rapid rate, and keeping up with the latest can be difficult. Here we detail 18 of the most essential tools that work well with Hadoop.

Hadoop LogoHadoop is an essential part of many data science projects. New technologies developed on top of Hadoop are released all the time, and it can be difficult to keep up with the wide array of tools at your disposal, so here is a list of 18 of the most essential:

  • Apache Hadoop, the official distribution.
  • Apache Ambari, a software package for managing Hadoop clusters
  • HDFS (Hadoop Distributed File System), the basic framework for splitting data across a cluster underpinning Hadoop.
  • Apache HBase, a table-oriented database built on top of Hadoop.
  • Apache Hive, a data warehouse built on top of Hadoop that makes data accessible through an SQL-like language.
  • Apache Sqoop, a tool for transferring data between Hadoop and other data stores.
  • Apache Pig, a platform for running code on data in Hadoop in parallel.
  • ZooKeeper, a tool for configuring and synchronizing Hadoop clusters.
  • NoSQL, a type of database that breaks from traditional relational database management systems using SQL. Popular NoSQL databases include Cassandra, Riak, and MongoDB.
  • Apache Mahout, a machine learning library designed to run on data stored in Hadoop.
  • Apache Lucene/Apache Solr, a tool for indexing text data that integrates well with Hadoop.
  • Apache Avro, a data serialization system.
  • Oozie, a workflow manager for the Apache toolchain.
  • GIS Tools, a set of tools to help manage geographical components of your data.
  • Apache Flume, a system for collecting log data using HDFS.
  • SQL on Hadoop, some of the most popular options include: Apache Hive, Cloudera Impala, Presto (Facebook), Shark, Apache Drill, EMC/Pivotal HAWQ, BigSQL by IBM, Apache Phoenix (for HBase), Apache Tajo
  • Clouds, managed servers and services that remove the hassle of running your own infrastructure.
  • Apache Spark, a new way to run algorithms even faster on Hadoop data.

Each of these technologies adds another tool to your data analysis tool belt and can make your job easier in the right conditions. More info at NetworkWorld:

slide show: 18 essential Hadoop tools for crunching big data