Open Innovation in the Age of Big Data

Open platforms for data analysis are more important than ever with big data and increasing access to heterogeneous data sources for analysts.

By Michael Berthold (KNIME), Dec 2014.

big data We don’t want to join the usual discussion about what Big Data really is here – ultimately everybody’s interpretation contains a grain of truth: Big Data is essentially about huge and/or highly heterogeneous and/or semi or unstructured data. However the real issue is: what do we use Big Data for?

To answer this question we need to take a look at the different stages of innovation or discovery. Initially all we do is to try and poke holes in the darkness. We look for connections of some sort such as correlations or clusters, basically anything that can help us gain insight no matter how small into the underlying system; sometimes that’s enough to already create business value. Secondly, later-stage data analysis goes on to create more abstract descriptions by focusing on parts of the system. This is classical data analysis, which utilizes more advanced data analysis methods. In the third and final stage we obtain a complete picture of the underlying system and our goal is to match the model’s parameters. This last stage is no longer really part of data analysis as we already fully understand how the system works. Typical advanced analytics focuses mainly on the second phase whereas exploratory data analysis is more focused on the first phase.

Big Data drives us back towards this first phase of that process. We, again, don’t really know much, if anything, about the underlying system and are looking for interesting connections in the underlying data to gain initial insights. Establishing these connections can create more value than before as they are now derived from much bigger/more complex (i.e. “big” – see above) data sources. This sometimes leads to the claim that in the age of Big Data all we need are such correlations. True, correlations based on much more and more diverse data are likely, albeit not always, more meaningful (beware of spurious correlations!). But that’s only part of the story – we should not lose sight of the big picture: finding a model that describes as much as possible of the underlying system is the goal of data science. Big or small.

Tools to create these initial insights from Big Data are all the hype right now. However this also raises an interesting question: if we know how little we know about the Big Data world, how can we trust any one monolithic, proprietary platform provider to know what will keep us innovating and discovering new insights now and in the future?

Open Platform The need for open platforms in classic data analytics is therefore even more pressing now, in the age of Big Data, when data analysts have easy access to an ever-growing number of internal and external data sources. To tackle this challenge they need quick and easy access to best-of-breed tools to intuitively explore new analysis ideas unburdened by the artificial barriers of closed environments.

Therefore also here, the five key pillars of an open analytics platform are vital for success:

  • Open platforms are integrative. They play with existing systems (but don’t have to). They support various data sources, both large and small. They integrate new, existing, and legacy tools – inhouse or from external specialists. This is instrumental to securing the best of both worlds: emerging Big Data platforms and advanced analytic tools.
  • Open platforms are transparent. They are intuitive to use – which enables quick prototyping and they can be used in production and as templates for business analysts. They are, of course, open source, so anyone can build on top of them. It is interesting to note that many of the Big Data storage and processing platforms are open source already!
  • Open platforms are flexible and agile. They allow reproducibility and reusability of processes and enable users to quickly and effortlessly explore alternative processes at the same time. They also future proof your Big Data investments – there is no need to wait for a proprietary vendor to add desired functionalities: the community or one of the many platform partners will have already done that.
  • Open platforms are collaborative at many levels. Users of, but also developers for, open platforms know that they can boost the impact of their work by sharing their latest tools and learnings vs. keeping it all to themselves.
  • As a result, open platforms are much more powerful than any monolithic application can ever be. Due to the simplicity of mixing and matching best-of-breed technology within the same intuitive environment, breakthrough discoveries and innovation can come from anyone and anywhere.

And here is another thought: history keeps repeating itself. Expert Systems were supposed to be the solution to knowledge capturing – but ended up “only” being an important piece of a larger puzzle. Data Warehouses were supposed to solve the need for, I guess all of ETL, once and for all; they ended up being a solution for fairly static data structures but never really captured all of the data. And now we believe that by pumping all of our data into a large, distributed data storage environment we will solve all of our “data problems”?

I much rather foresee an interesting mix of unstructured, messy, heterogeneous, distributed Big Data storage facilities playing in concert with more organized, much better structured data repositories. Do we already know what this mix will look like? Do any of the analytic tool vendors know? Do we really want to repeat that mistake of locking ourselves and/or our data in with one single vendor and trust that vendor will know what we will need in a year or two from now?

My personal bet is on an open platform that allows selection of the best resources (data, tools, or expertise), unconstrained by a proprietary toolbox. Now and in the exciting years to come.

Bio: Michael Berthold is co-founder of KNIME, the open analytics platform used by thousands of data experts around the world. He is currently president of AG and a professor at Konstanz University, where his research interests include bisociative data analysis and widening of mining algorithms.