Interview: Michael Berthold, KNIME Founder, on Research, Creativity, Big Data, and Privacy, Part 2

We discuss interesting research projects, scientific research and creativity, Big Data hype and reality, is privacy still possible, and advice for beginning Data Scientists.

By Gregory Piatetsky, @kdnuggets, Aug 12, 2014.

Michael Berthold Prof. Dr. Michael Berthold is the founder and president of AG, makers of the popular KNIME open source data mining and processing platform.

Here is the first part of the interview.

Gregory Piatetsky, Q6. Besides KNIME, what were some of your most interesting research projects ?

Michael Berthold: Two come to mind - first the learning in parallel universes. The idea that objects can have different representations has been floating around for a while (multi view / multi instance learning) but creating a model that combines the patterns from across those representations rather than learning one representation from all those was a new way to think about this. It resulted in a couple of neat algorithms that are now, of course, also part of KNIME.

Bisociative Knowledge Discovery The other one created a larger EU consortium and resulted in a nice LNCS volume: Bisociative Knowledge Discovery, the idea that creativity is triggered by connecting diverse domains. The concept is not new (Arthur Koestler published about this in the 60s) but creating computer based methods to mimic or at least support it was. Some of the output of the project is now also available in KNIME, for instance as part of the network and text mining extensions.

Q7. In your chapter in "Journeys to Data Mining" book [Springer, 2012] you talk about 3 phases of scientific research -
- "Parametrization" (eg fine-tuning a parameter),
- "Pattern Detection" (looking for a pattern of known type), and
"Hypothesis Generation" (open-ended exploration).
Data mining can work well for the first 2 types. What is needed to help automate the third phase, "Hypothesis generation" (search for "unknown unknowns"). Can such human creativity be automated?

MB: Well, see above. We were (and are) not shooting for a truly creative system but more for a system that supports creativity. Ideally you have a system that presents interesting connections/patterns to a user and she says "what the heck is that???" and, after some thinking and exploring where that connection comes from, creates an entirely new question that is then answered using the data.

So, in a way, we are trying to build a system that helps you come up with completely new patterns to look for or questions to ask. That would already be a huge improvement over current systems that really only provide answers to questions that you already know how to phrase.

Big Data GP: Q8. What do think of Big Data "buzz" - how much is hype, and how much is reality? Do you expect Big Data to fall into a "trough of disillusionment" soon or will Big Data tools (perhaps without using the "Big Data" name) become part of the mainstream and reach plateau of productivity?

MB: It's hype. And as I said above, it's also a problem that pharmas had for at least a decade without calling it "big data". I see two problems:

Problem 1: there is tendency for analyses to become a lot more shallow when more data is available.
Many people seem to believe that if they have more data than they don't really need to understand the data and/or problem anymore. And that leads to the
Problem 2: a lot of Big Data problems aren't Big Analytics problems but the ETL part is the real challenge.

Quite a few of the applications we see really boil down to doing Big ETL and running the analysis on a carefully (and meaningfully!) created aggregate of the data. Don't get me wrong, there are some Big Analytics problems but I agree with David Hand who said something along the lines of "only a small fraction of all problems are Big Data problems".

So I think Big Data will cease to be such a hype but it will stick around. Fact is that people are now collecting lots more data and now at least they stuff it into Hadoop instead of burning it onto DVDs. Getting meaningful answers from this pile of data will require a combination of ETL and analytics, not necessarily all run in a "big" mode, though. So I think the real power will come from combining Big Data tools with the classic, very powerful analytics environments - one of the reasons why we recently added our Big Data Extensions. We have seen people who accessed huge amounts of data in Hadoop, ran some analyses in KNIME and even handed samples over to our R integration to run some bleeding edge algorithm.

As a side note: I am curious to see how the algorithm development will change in the coming years - we are working on something called "Widened Data Mining", where we invest distributed resources not to speed up analysis but to find better results. I think in the long run we will need to revisit some of the assumptions we have made in the past. Why use a greedy algorithm to find a decision tree, for instance, if I have a million cores that can explore the search space much better than that sequential, greedy algorithm?

Big Data and PrivacyGP: Q9. Big Data also seems to greatly reduce privacy and make anonymity less and less possible. [eg see "No silver bullet" - No silver bullet: De-identification still doesn't work paper]
What are your thoughts about Privacy and Big Data - will there be ethical guidelines for Privacy? EU privacy rules are much stricter than US - what are the implications?

MB: I am curious about that myself. I don't think we can go back to before and simply ignore the problem. I think we have good ethical guidelines (and if not - use your common sense). But I am not sure they can be implemented so easily. For instance I am not sure if forbidding Google to deliver certain results is a good way forward, I just don't think that's feasible to enforce. Honestly? I don't know. I am looking forward to interesting discussions about that with my kids. I do wonder what someone thinks about that who grew up with "all" information out there in the net...

GP: Q10. What advice will you offer for aspiring Data Scientists?

MB: Practice.
I strongly believe that being a data scientist is an interesting merger of science and craftsmanship.
You need to really understand the theory but at the same time you also need to exercise your gut feeling. I often have really smart students who really understand the concepts and then they apply algorithms and you just look over their shoulder and say "think! Can this result even make sense?!". And then they say "ah, you are right, that's impossible!".

This actually triggered our data generation nodes in KNIME - we wanted to have data sets accompany our "Guide to Intelligent Data Analysis" textbook that had a lot of those weird cases built-in. So that students can already fall into a lot of traps and gain this experience first-hand, so to speak. But in the end you need to practice running real world data analyses. Over and over again. And then you learn to trust your instincts that tell you "wait, this is too good to/can't be true!"

GP: Q11. What do you like to do in your free time?

I don't have or need much free time. I have a wonderful family and I have a hobby that I love and was lucky enough to turn into a job. Ok, if I had a bit more flexibility in my hobby/job I would go back and write a few more algorithms for KNIME. That would scare the hell out of the other KNIMErs, though.

Bio: Michael Berthold holds the Nycomed-Chair for Bioinformatics and Information Mining at Konstanz University, Germany where his research focuses on using machine learning methods for the interactive analysis of large information repositories in the Life Sciences.

Most of the research results of Michael Berthold's group are made available to the public via the open source data processing platform KNIME. In 2008 Michael Berthold co-founded AG, located in Zurich, Switzerland. offers consulting and training for the KNIME platform in addition to an increasing range of enterprise products.

M. Berthold is a Fellow of the IEEE, Past President of the North American Fuzzy Information Processing Society, Associate Editor of several journals and Past-President of the IEEE System, Man, and Cybernetics Society. He was involved in organization of various conferences, most notably the IDA-series of symposia on Intelligent Data Analysis and the conference series on Computational Life Science. With David Hand, he co-edited a successful textbook "Intelligent Data Analysis: An Introduction". He is also a co-author of the "Guide to Intelligent Data Analysis" (2010).