With the modern world's unrelenting deluge of data, settling on the exact sizes which make data "big" is somewhat futile, with practical processing needs trumping the imposition of theoretical bounds. Like the term Artificial Intelligence, Big Data is a moving target; just as the expectations of AI of decades ago have largely been met and are no longer referred to as AI, today's Big Data is tomorrow's "that's cute," owing to the exponential growth in the data that we, as a society, are creating, keeping, and wanting to process. As such, traditional data processing tools which do not scale to big data will eventually become obsolete.
So the question is, what are we doing with this data? The answer, of course, is very context-dependent. But everyone is processing Big Data, and it turns out that this processing can be abstracted to a degree that can be dealt with by all sorts of Big Data processing frameworks. A few of these frameworks are very well-known (Hadoop and Spark, I'm looking at you!), while others are more niche in their usage, but have still managed to carve out respectable market shares and reputations.
We will take a look at 5 of the top open source Big Data processing frameworks being used today. Of course, these aren't the only ones in use, but hopefully they are considered to be a small representative sample of what is available, and a brief overview of what can be accomplished with the selected tools.
First up is the all-time classic, and one of the top frameworks in use today. So prevalent is it, that it has almost become synonymous with Big Data. But you already know about Hadoop, and MapReduce, and its ecosystem of tools and technologies including Pig, and Hive, and Flume, and HDFS. And all the others. Hadoop was first out of the gate, and enjoyed (and still does enjoy) widespread adoption in industry.
So why would you still use Hadoop, given all of the other options out there today? Despite the fact that Hadoop processes often complex Big Data, and has a slew of tools that follow it around like an entourage, Hadoop (and its underlying MapReduce) is actually quite simple. If your data can be processed in batch, and split into smaller processing jobs, spread across a cluster, and their efforts recombined, all in a logical manner, Hadoop will probably work just fine for you.
A number of tools in the Hadoop ecosystem are useful far beyond supporting the original MapReduce algorithm that Hadoop started as. Of particular note, and of a foreshadowing nature, is YARN, the resource management layer for the Apache Hadoop ecosystem. It can be used by systems beyond Hadoop, including Apache Spark. Here is an in-depth article on cluster and YARN basics.
Spark is the heir apparent to the Big Data processing kingdom. Spark and Hadoop are often contrasted as an "either/or" choice, but that isn't really the case. The Hadoop ecosystem can accommodate the Spark processing engine in place of MapReduce, leading to all sorts of different environment make-ups that may include a mix of tools and technologies from both ecosystems. As one specific example of this interplay, Big Data powerhouse Cloudera is now replacing MapReduce with Spark as the default processing engine in all of its Hadoop implementations moving forward. As another example, Spark does not include its own distributed storage layer, and as such it may take advantage of Hadoop's distributed filesystem (HDFS), among other technologies unrelated to Hadoop (such as Mesos).
Spark differs from Hadoop and the MapReduce paradigm in that it works in-memory, speeding up processing times. Spark also circumvents the imposed linear dataflow of Hadoop's default MapReduce engine, allowing for a more flexible pipeline construction.
When would you choose Spark? If you don't want to be shackled by the MapReduce paradigm and don't already have a Hadoop environment to work with, or if in-memory processing will have a noticeable effect on processing times, this would be a good reason to look at Spark's processing engine. Also, if you are interested in tightly-integrated machine learning, MLib, Spark's machine learning library, exploits its architecture for distributed modeling.
Again, keep in mind that Hadoop and Spark are not mutually exclusive.
When it comes to processing Big Data, Hadoop and Spark may be the big dogs, but they aren't the only options. We look at 3 additional Big Data processing frameworks below, what their strengths are, and when to consider using them. Their search term prevalence is displayed above; Storm is clearly the most popular of the 3, Flink is a newcomer seemingly building quick interest, and Samza fits somewhere in the middle, but looks as though interest may be dwindling. They will be given treatment in alphabetical order.
Apache Flink is a streaming dataflow engine, aiming to provide facilities for distributed computation over streams of data. Treating batch processes as a special case of streaming data, Flink is effectively both a batch and real-time processing framework, but one which clearly puts streaming first.
Flink provides a number of APIs, including a streaming API for Java and Scala, a static data API for Java, Scala, and Python, and an SQL-like query API for embedding in Java and Scala code. It also has its own machine learning and graph processing libraries. Flink has an impressive set of additional features, including:
- High Performance & Low Latency
- Support for Event Time and Out-of-Order Events
- Exactly-once Semantics for Stateful Computations
- Continuous Streaming Model with Backpressure
- Fault-tolerance via Lightweight Distributed Snapshots
Why use Flink over, say, Spark? Flink is truly stream-oriented. Spark operates in batch mode, and even though it is able to cut the batch operating times down to very frequently occurring, it cannot operate on rows as Flink can. If you are processing stream data in real-time (real real-time), Spark probably won't cut it. In such cases, a framework such as Flink (or one of the others below) will be necessary.
If you are interested in more on the contrast between Spark and Flink, have a look at this article, which discusses, among other things, the similarity of API syntax between the 2 projects (which could lead to easier adoption). Another comparison discussion can be found on Stack Overflow.
Apache Storm is a distributed real-time computation system, whose applications are designed as directed acyclic graphs. Storm is designed for easily processing unbounded streams, and can be used with any programming language. It has been benchmarked at processing over one million tuples per second per node, is highly scalable, and provides processing job guarantees. Unique for items on this list, Storm is written in Clojure, the Lisp-like functional-first programming language.
Apache Storm can be used for real-time analytics, distributed machine learning, and numerous other cases, especially those of high data velocity. Storm can run on YARN and integrate into Hadoop ecosystems, providing existing implementations a solution for real-time stream processing. Five characteristics which make Storm ideal for real-time processing workloads are (taken from HortonWorks):
- Fast - benchmarked as processing one million 100 byte messages per second per node
- Scalable - with parallel calculations that run across a cluster of machines
- Fault-tolerant - when workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node.
- Reliable - Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. Messages are only replayed when there are failures.
- Easy to operate - standard configurations are suitable for production on day one. Once deployed, Storm is easy to operate.
Keep in mind that Storm is a stream processing engine without batch support. Storm does not support state management natively; however, Trident, a high level abstraction layer for Storm, can be used to accomplish state persistence. Trident also brings functionality similar to Spark, as it operates on mini-batches.
Here is a discussion on Storm vs Flink.
Finally, Apache Samza is another distributed stream processing framework. Samza is built on Apache Kafka for messaging and YARN for cluster resource management. Its website provides the following overview of Samza:
- Simple API: Unlike most low-level messaging system APIs, Samza provides a very simple callback-based “process message” API comparable to MapReduce.
- Managed state: Samza manages snapshotting and restoration of a stream processor’s state. When the processor is restarted, Samza restores its state to a consistent snapshot. Samza is built to handle large amounts of state (many gigabytes per partition).
- Fault tolerance: Whenever a machine in the cluster fails, Samza works with YARN to transparently migrate your tasks to another machine.
- Durability: Samza uses Kafka to guarantee that messages are processed in the order they were written to a partition, and that no messages are ever lost.
- Scalability: Samza is partitioned and distributed at every level. Kafka provides ordered, partitioned, replayable, fault-tolerant streams. YARN provides a distributed environment for Samza containers to run in.
- Pluggable: Though Samza works out of the box with Kafka and YARN, Samza provides a pluggable API that lets you run Samza with other messaging systems and execution environments.
- Processor isolation: Samza works with Apache YARN, which supports Hadoop’s security model, and resource isolation through Linux CGroups.
This article discusses Storm vs Spark vs Samza, which also describes Samza as perhaps the most underrated of the stream processing frameworks (which ultimately tipped the scales in favor of its inclusion in this post).
The first 2 of 5 frameworks are the most well-known and most implemented of the projects in the space. They are also mainly batch processing frameworks (though Spark can do a good job emulating near-real-time processing via very short batch intervals). The final 3 frameworks are all real-time or real-time-first processing frameworks; as such, this post does not purport to be an apples-to-apples comparison of frameworks. Instead, these various frameworks have been presented to get to know them a bit better, and understand where they may fit in.
This post provides some discussion and comparison of further aspects of Spark, Samza, and Storm, with Flink thrown in as an afterthought. The post also links to some other sources, including one which discusses more precise conditions of when and where to use particular frameworks. The conclusion, as it turns out, is that there are no hard and fast rules, and, instead, a series of guidelines and suggestions exist. This is worth remembering when in the market for a data processing framework.
Also note that these apples-to-orange comparisons mean that none of these projects are mutually exclusive. There are good reasons to mix and match pieces from a number of them to accomplish particular goals. The fallacious "Hadoop vs Spark" debate need not be extended to include these particular frameworks as well.
A final word regarding distributed processing, clusters, and cluster management: each processing framework listed herein can be configured to run on both YARN and Mesos, both of which are Apache projects, and both of which are cluster management common denominators. That YARN is a Hadoop component that has been adapted by numerous applications beyond what is listed here is a testament to Hadoop's innovation, and its framework's adoption beyond the strictly-Hadoop ecosystem. Of any transferable and lasting skill to attain that has been alluded to herein, it seems that the cluster and resource management layer, including YARN and Mesos, would be a good bet.