To Hadoop or Not to Hadoop?

Hadoop is very popular, but is not a solution for all Big Data cases. Here are the questions to ask to determine if Hadoop is right for your problem.

Guest blog By Anand Krishnaswamy, ThoughtWorks, Oct 4, 2013.

HadoopHadoop is often positioned as the one framework your organization needs to solve nearly all your problems. Mention "Big Data" or "Analytics" and pat comes the reply: Hadoop!

Hadoop, however, was purpose-built for a clear set of problems; for others it is, at best, a poor fit or, even worse, a mistake.

While data transformation (or, broadly, ETL operations) benefit significantly from a Hadoop setup, if your organization needs fall into any of the following categories, Hadoop might be a misfit.

1. Big Data Cravings

While many businesses like to believe that they have a Big Data dataset, it is often not the case.

Ask Yourself:

  • Do I have several terabytes of data or more?
  • Do I have a steady, huge influx of data?
  • How much of my data am I going to operate on?

2. You Are in the Queue

When submitting jobs, Hadoop's minimum latency is about a minute. It would be a loyal and patient customer who would stare at the screen for 60+ seconds waiting for a response.

Ask Yourself:

  • What are user expectations around response time?
  • Which of my jobs can be batched up?

3. Your Call will be Answered In...

Hadoop has not served businesses requiring real-time responses to their queries. Jobs which go through the map-reduce cycle also spend time in the shuffle cycle. None of these are time-bound making developing real-time applications on top of Hadoop, very difficult.

Ask Yourself:

  • What is the level of interaction users/analysts expect with my data?
  • Do they wish to have interactivity with terabytes of data or just a subset?

Let's say it together: Hadoop works in batch mode. That means as new data is added the jobs need to run over the entire set again. Hence, analyses time keeps increasing.

4. I Just Broke Up With My Social Network

Hadoop, especially MapReduce, is best suited for data that can be decomposed to key-value pairs without fear of losing context or any implicit relationship. If your primary data structure is a graph or a network, then you are probably better off using a graph database like Neo4J or Dex or you could explore recent entries on the scene like Google's Pregel or Apache Giraph.

Ask Yourself:

  • Is the underlying structure of my data as vital as the data itself?
  • Is the insight I wish to gain reflective of the structure as much as or more than the data?

5. The Mold of MapReduce

Some tasks/jobs/algorithms simply do not yield to the programming model of MapReduce. Added to these are business cases where the data is not significantly large or the total data set is large but made up of billions of small files (e.g. many image files which need to be scanned for a particular shape) which can't be concatenated.

Ask Yourself:

  • Does my business places great emphasis on highly specialized algorithms or domain specific processes?
  • Wouldn't the technical team be better equipped to analyse if the algorithms are MapReducible or not?

Now that we have explored some of the reasons when Hadoop might be a misfit, let's look at when it might make sense.

Does your organization...

  1. Want to extract information from piles of, say, text logs?
  2. Want to transform largely unstructured or semi-structured data into some other useable and structured format?
  3. Have tasks that can run over the entire set of data overnight (like credit card companies do with the day's transactions)?
  4. Treat conclusions drawn from a single processing of data as valid till the next scheduled processing (unlike stock market prices which definitely change between end of day values)?

Then, most certainly you should explore Hadoop.

Anand Krishnaswamy is a senior consultant and developer who dabbles in big data analytics with ThoughtWorks, a global technology company that provides fresh thinking to solve some of the world's toughest problems.