To Code or Not to Code with KNIME

Find out how KNIME allows us to integrating analytical languages, such as R and Python and visual design of SQL code. Also, learn to integrate your Hadoop, visualization and ETL systems with the KNIME.

ETL: Visual SQL

Data loading and integration is the part nobody really talks about much. But here too, experts are capable of writing a few lines of SQL a lot faster than putting together a few modules graphically. However, not everyone is sufficiently fluent in SQL and also these users should still be given the ability to mix and match their data, giving them access to the majority of the functionality available via SQL (and remembering the many little nuances of it for different databases). In this case we cannot use the parameter based customization from above. Instead, KNIME encapsulates SQL fragments visually as individual nodes. The figure below shows what this could look like for an example workflow accessing two tables in a database and joining them after a couple of aggregation operations. Notable here is that this series of nodes will not be executed within the KNIME workflow environment itself but pushed out to the database – in fact KNIME silently assembles the underlying SQL statement under the hood (and if you’re curious, it is available on the node view). And, to make things even easier, depending on the initial database connector node, KNIME also makes sure only supported operations (such as for the aggregation node) are available and creates the right SQL dialect for that particular database.

Big Data: Wrapping Hadoop

The visual SQL approach described above also allows KNIME workflows to model and control ETL operations natively on Hadoop – using one of the Cloudera, Hortonworks, or MapR certified connectors works just the same as running operations in your local MySQL database ‒ but suddenly things get executed on Hadoop.

But why stop there? If you can reach out to R and Python and SQL, why not also add nodes that allow the integration of code that runs directly on Hadoop as well? With version 2.12, KNIME has nodes to encapsulate calls to MLlib and enables Spark operations to be modeled. A special Spark Scripting node encapsulates functionality that has not yet been exposed as a native node as well. The figure below shows an example of such a workflow.

Visualization: JavaScript, the fourth musketeer

And finally, instead of building yet another visualization library or tightly coupling pre-existing ones, KNIME has recently also introduced a JavaScript node that quickly builds new visualizations and exposes them as KNIME nodes, the figure below shows an example of network graphics generated using the well-known D3 library, wrapped nicely into a KNIME node. But other libraries are available as well, of course, and one of the neat features of this concept is that the resulting view nodes can be used to display visualizations in the KNIME WebPortal as well, allowing the deployment of web based interactive workflow controls, enabling true Guided Analytics – but that’s a story for another blog post.

Bio: Prof. Dr. Michael Berthold is the founder and president of AG, makers of the popular KNIME open source data mining and processing platform.