Apache Flink and the case for stream processing
Realtime analytics have been proven challenging in the past, but with new tools it will be possible to setup your pipelines in relative short time. Apache Flink is one of such framework, find out how you can exploit it for your demands.
By Stephan Ewen, Kostas Tzoumas (Data Artisans).
Apache Flink is a relatively new framework in the Apache Software Foundation that puts streaming first: it supports batch analytics, continuous stream analytics, as well as machine learning and graph processing natively on top of a streaming engine.
What does streaming mean? Streaming is a new movement that advocates a shift on how we think about data infrastructure. Many data sources represent series of events that are continuously produced (e.g., web logs, user transactions, system logs, sensor networks, etc). Traditionally, certain periods of these events (say, a day) were batched together and saved and stored somewhere to be in turn analyzed as a batch. This introduces unnecessary latencies between data generation and action on the results, and carries the implicit assumption that data is at some point complete, and can be used to make accurate predictions. From this traditional batch way of doing analytics forward, the next logical step is to remove the additional barriers and process event streams as event streams - this is what stream processing is all about. Of course, batch systems like Apache Hadoop are here to stay (some datasets are inherently batch), and streaming architectures often start out as an augmentation, and not a replacement of Hadoop clusters.
A typical design pattern in streaming architecture is gathering and collecting the streams in a central broker (e.g., Apache Kafka), and using a stream processor to do analytics on the streams, create alerts, maintain state and summaries/views, and create derived streams that can be fed back to the central stream log.
In the ecosystem of open source stream processors, Flink provides a unique combination of concurrently satisfying the following requirements:
- It provides exactly-once guarantees to state updates (since Flink 0.9.0), so that application developers do not have to worry about duplicates
- It has a high-throughput engine that controllably buffers events before sending them over the network.
- It achieves low latency by using a checkpointing mechanism based on Chandy-Lamport rather than processing data in micro-batches
- By virtue of using the pure streaming model of continuous operators and mutable state, it provides a powerful streaming programming model with flexible windowing schemes.
See, for example, our experience with clocking Flink to a throughputs of millions of records per second per core, and latencies well below 50 milliseconds going to the 1 millisecond range here.
What about batch?
So, Flink can be a very good match for real-time stream processing use cases. Another point that is usually hard to grasp for many people is that a good stream processor can actually serve as a great foundation for batch data processing! As we have lived with batch processors for a long time, people think of batch and streaming use cases as something very different. They actually are not! Batch programs are streaming programs that work on finite data sets (as opposed to infinite streams), and can thus have a view of all data inside an operation, rather than a window of the stream.
Flink builds both a DataStream API (for stream analytics) and a full-featured DataSet API (for batch analytics) on top of the underlying stream processing engine. In fact, Flink contains a set of built-in libraries (e.g., for Machine Learning and graph processing):
The project also integrates with a wealth of other open source tools. Flink itself can consume data from sources such as HDFS (supporting all Hadoop input formats including Apache Avro), HBase, Tachyon, and others. Flink offers compatibility APIs with Hadoop MapReduce and a newly added Storm API, and other projects such as Apache Zeppelin, Apache SAMOA, Apache MRQL, Apache BigTop, Apache Mahout, Google Cloud Dataflow, and Cascading are adding support for Flink as a backend.
A bit of history
Flink has its origins at the Technical University of Berlin, where, in 2009 a team of researchers set out to improve upon limitations in Hadoop and other systems. Flink entered the Apache Incubator in April 2014, and graduated eight months later in December 2014.
Since then, and 5 software releases later, the community has done an impressive job in developing new features and enlarging the scope of the project. What started out as a pipelined data processing engine, evolved into an engine that can do both batch and streaming analytics natively, and also includes higher-level functionality like a Machine Learning library, and a relational API. The code growth was also accompanied by substantial community growth. By now Flink has more than 120 contributors that form a very helpful community.
If you are interested in Flink, check out a local meetup in your city, and consider attending Flink Forward 2015, the first full conference about Flink in Berlin, Germany.
Bio: Stephan Ewen is a Flink committer and co-founder and CTO of Data Artisans. Before founding data Artisans, Stephan was developing Flink since the early days of the project (then called Stratosphere). Stephan has a PhD in Computer Science from TU Berlin and has worked with IBM and Microsoft.
Kostas Tzoumas is a committer at Apache Flink and co-founder and CEO of data Artisans. Before founding data Artisans, Kostas was a postdoctoral researcher at TU Berlin (on Flink's predecessor Stratosphere) and received a PhD in Computer Science from Aalborg University.
Related:- Patterns for Streaming Realtime Analytics
- Getting Business Value from Real-time Streaming Analytics
- SQL-like Query Language for Real-time Streaming Analytics