Beginner’s Guide to Apache Flink – 12 Key Terms, Explained
We review 12 core Apache Flink concepts, to better understand what it does and how it works, including streaming engine terminology.
7. Flink DataSet API (for Batch Processing)
Apache Flink provides a DataSetAPI that allows to developers can develop programs (in Java, Scala and Python), that implement transformations on data sets (see examples in 7.1). The data sets are initially created from multiple sources such as File-Based, Collection-Based, Socket-based, Custom. The results of the data sets return via Data Sinks, which allow write the data to distributed files or for example command line terminal.
7.1 Examples of transformations in Flink:
- Map
- FlatMap
- Map Partition
- Filter
- Reduce
- ReduceGroup
- Aggregate
- Distinct
- Join
- OuterJoin
- CoGroup
- Cross
- Union
- Rebalance
- Hash-Partition
- Range-Partition
- Custom Partitioning
- Sort Partition
- First-n
- Project (for data streams of Tuples)
- MinBy / MaxBy (for data streams of Tuples)
8. FlinkCEP – Complex event processing for Flink
Apache Flink includes a complex event processing library that allows to developers detect complex event patterns in a stream of endless data. With the analysis of Matching sequences data scientist can construct complex events and do a deep analysis of the data.
9. Streaming SQL
Streaming SQL is a query language that extends the traditional SQL capabilities to process real-time data streams. The main challenge is incorporate aggregations, windows and time semantics on streams. Nowadays, Apache Flink community is working with Apache Calcite community to develop a new model for solve these challenges and improvements.
10. Flink Table API & SQL
Flink Table API and SQL are experimental features focusing in Streaming SQL that allow to work with SQL-like expressions for relational stream and batch processing. The Table API and SQL interface operate on a relational Table abstraction and provide tight integration with Flink DataSet API.
11. Flink ML (Machine Learning)
Apache Flink provides an extensive and newly scalable Machine Learning (ML) library for Flink developers, the main two goals of Flink ML are to help to developers keep glue code to a minimum and second goal is to make it easy to use providing detailed documentation with examples. Flink ML allows to data scientist to test their algorithms and models locally with a subset of the total data and hence with the same written code this can be executed at a much larger scale in a cluster setting.
The Machine learning algorithms supported at the moment are:
Supervised Learning
- SVM using Communication efficient distributed dual coordinate ascent (CoCoA)
- Multiple linear regression
- Optimization Framework
Data Preprocessing
- Polynomial Features
- Standard Scaler
- MinMax Scaler
Recommendation
- Alternating Least Squares (ALS)
Utilities
- Distance Metrics
12. Flink Gelly (Graph Processing)
Gelly is a Graph API for Flink that contains variety of methods and tool (such as graph transformations and utilities, iterative graph processing, library of graph algorithms) for doing graph analysis applications in Flink. Gelly can be seamlessly mixed with the DataSet Flink API for developing programs that use both record-based and graph-based analysis.
Flink Gelly provide the next Graph Methods:
- Graph Properties (e.g. getVertexIds, getEdgelds, numberOfVerices, numberOfEdgest, etc)
- Transformations (e.g. map, filter, join, subgraph, union, difference, reverse, undirected, getTriplets, etc.)
- Mutations (e.g. add vertex, add edge, remove vertex, remove edge)
- Neighborhood Methods.
Some Algorithms provides in Flink Gelly:
- PageRank
- Single Source Shortest Paths
- Label Propagation
- Weakly Connected Components
- Community Detection
- Triangle Count & Enumeration
- Graph Summarization
BONUS term
Blink (Alibaba Flink)
Blink is a project (improvements to Flink) from Alibaba Group, which operates the world’s largest online marketplace and it profits bigger than Amazon and eBay combined. These improvements include better changes in Flink Table API (such as unification of SQL layer for batch and streaming, adding features stream-stream join, aggregations, windowing, retraction) and Runtime Compatibility with Flink API and Ecosystem (such as new runtime architecture on YARN, optimized state, checkpoint & failover, reliable and production quality and others).
Summary.
Apache Flink could be considering as the 4th generation of Big Data Framework, instead of waiting for a long cycle of batch processing until data could be available, data scientist can work in real-time with data generated and processed continuously.
Bio: Andrés Vivanco a Master student at TU Berlin as part of the Erasmus Mundus IT4BI. Currently, he is researching in DIMA about Data Science, Data Generation, Machine & Deep Learning, Benchmarking, and Big Data.