To Code or Not to Code with KNIME
Find out how KNIME allows us to integrating analytical languages, such as R and Python and visual design of SQL code. Also, learn to integrate your Hadoop, visualization and ETL systems with the KNIME.
By Michael Berthold, (KNIME).
Many modern data analysis environments allow for code-free creation of advanced analytics workflows. The advantages are obvious: more casual users, who cannot possibly stay on top of the complexity of working in a programming environment, are empowered to use existing workflows as templates and modify them to fit their needs, thus creating complex analytics protocols that they would never have been able to create in a programming environment. At the same time, these visual environments serve as an excellent means for documentation purposes. Instead of having to read code, the visual representation intuitively explains which steps have been performed and – in most environments at least ‒ the configuration of each module is self-explanatory as well. This enables a broad set of intuitively reusable workflows to be built up capturing the data scientists’ wisdom.
This is only one side of the coin, however. Directly writing code still is and always will be the most versatile way to quickly and flexibly create a new analysis. In some areas this may not be as dramatic, as the need for new ways of solving (parts of) problems isn’t as critical anymore and a carefully designed visual environment may capture everything needed. In Advanced Analytics, however, the opposite is true: this is still very much a field under active development as we currently witness in Deep Learning and Distributed Algorithms for Big Data (to name just two examples). Hence, in order to truly get full value out of your data, experts need to be able to quickly try out a new routine either written themselves or by their colleagues.
The real question is therefore not which of the two paradigms to choose but how to make sure that users get the best of both worlds: ease of (re)use and versatility. Again, this boils down to the openness of the platform you are choosing for your analytics – can you really afford to lock today’s and tomorrow’s data scientist into a visual workbench that supports just one analytical language? Which one? A truly open platform allows you to choose what you – and more importantly – your data scientists are comfortable using and allows them to collaboratively use what they know best without having to learn the nuts and bolts of every other coding paradigm used in your organization.
Analytics: Modular R and Python
The most important scripting languages for data analysis are, of course, R and Python. The screenshot below shows how expert code written in those two languages can be integrated in a KNIME analytical workflow. In this example R is used to create a graphic and Python for the model building (not because this is necessarily a good choice but simply to demonstrate that we can do that). A different user can now simply pick up this workflow and re-use it, possibly never even looking at the underlying code pieces. In KNIME, the entire sub-workflow could even be encapsulated in a metanode and exposed as a new functional unit without anyone even needing to see what’s going on inside; we can also expose only those parameters that we want to be controlled from the outside, modeled by the green QuickForm nodes, taking this to yet another level of abstraction.
Similar integrations exist also for other languages such as Groovy and Matlab, some contributed by the ever active KNIME community.