Interview: Stefan Groschupf, Datameer on Why Domain Expertise is More Important than Algorithms

We discuss large-scale data architectures in 2020, career path, open source involvement, advice, and more.

stefan-datameerStefan Groschupf is a big data veteran and serial entrepreneur with strong roots in the open source community. He was one of the very few early contributors to Nutch, the open source project that spun off Hadoop, which 10 years later, is considered a 20 billion dollar business.

Stefan is currently CEO and Chairman of Datameer, the company he co-founded in 2009 after several years of architecting and implementing distributed big data analytic systems for companies like Apple, EMI Music, Hoffmann La Roche, AT&T, the European Union, and others.Stefan is a frequent conference speaker, contributor to industry publications and books, holds patents and is advising a set of startups on product, scale and operations.

First part of interview

Second part of interview

Here is third and last part of my interview with him:

Anmol Rajpurohit: Q12. Based on your strong technical background and thought leadership in the field, how do you foresee data architectures in 2020 for large-scale data processing? Would RDBMS’ co-exist with Hadoop? If so, what role would RDBMS play and how would that be integrated with the larger Hadoop-based architecture?

data-architectureStefan Groschupf: I don’t think it will matter in the future. Hadoop is a virtualization environment; it is a way to cluster machines together. What is cool is that it will be virtualized and it will be somewhere in the cloud. What I mean with virtualized, think of VMWare, is one physical machine and many virtual machines on top of that. Hadoop is doing the opposite. It has many machines in a cluster together and one supercomputer. To the end user, it will not matter what it runs. That's important because then we can focus on the data and analytics rather on which technology we need to use.

If you really take a part Oracle or Teradata, there's so much optimization technology in there. They're using a different technology to do a full table scan versus a B-tree scan; they do a bunch of caching and have hot and cold data, etc. The same thing will happen with Hadoop, where you will have decision engines, like Smart Execution, that will decide if you need a graph engine for this query or you will run this in-memory or on your hard drive. You will have cross-space optimizers. It will get much more complex and much faster.

AR: Q13. You started your career as a software developer and data architect, and are currently the CEO of a technology company. When you look back on your career path so far, what do you see as the major milestones and what were the key inspirations behind achieving those milestones?

SG: That’s not entirely correct. I actually started my career as a designer and then became a user-interface designer before really coding as a software developer.

3-days-of-the-condorMy initial inspiration stems form when I was 16, one of the first movies you could see after the wall came down in East Germany was the “Three Days of the Condor,” where Robert Redford used a PDP8 to analyze books. I always found that I couldn’t read enough books and the idea of being able to write software that could analyze text really fascinated me. That's why I developed, at some point, text classification clustering algorithms. Back then we used hidden Markov models for named entity extraction, which KDnuggets readers might appreciate.

There have been six stages of creativity that have helped bring me to where I am now:
  • My first stage was learning Photoshop to create still photos when I worked at a music magazine.
  • I wanted to build on top of that, so my second stage of creativity was adding a timeline to become a video editor. I cut a film for the Berlinale and cut ads for BMW advertisements.
  • My third dimension was 3D animation. I was a very early user of Autodesk Softimage and then later Maya, again, mostly for video advertisement and ads.
  • Then, I discovered interactivity. I was one of the power users of Macromedia Director.
  • This is where I really found my passion for object-oriented programming, which became my fourth stage.
  • Then I realized that I could create functionality – my fifth stage of creativity. That's how I really got into hard-core programming. I always loved data-visualization.
  • The sixth, and current, stage of creativity is working with the most difficult of all materials you can design on the planet, and that's humans. Putting them together on functional teams.

AR: Q14. When and how did you get started with coding for Open Source? What were the key learning from your contributions to Nutch (search engine) and Katta (distributed Lucene index)? How has the involvement with Open Source community impacted your career progress and decisions?

SG: I was fascinated early on by the creative process of creating functionality around data, specifically, text data. I worked on network word graphs with early thesaurus datasets and Weka, which is one of the first data-mining, open-source frameworks books.

open-sourceOpen source is a great way to learn new technologies, while also enjoying the creative process. Writing a piece of beautiful code and having thousands of people use it and the chance to impact millions of people is phenomenal. I would never have expected that. For example, major dating sites use Hadoop to match people daily. Had I known when I was in my living room coding away on Nutch, and later on Hadoop, that eventually that code would be used to find love, I would have written more code. No, I'm just kidding.

AR: Q15. What advice would you give to people aspiring a long career in Data Science?

expertSG: In five to ten years, we will have artificial intelligence that will decide which algorithm is the best. It's really good to understand the basics, but I think what's more important is to become a subject-matter expert in something and dive really deep on the domain rather than on the algorithms. Moore's Law will make data science, as it is today, go away tomorrow. As you're looking at your career choices, I think it’s important to understand what Moore's Law and logarithmic growth of knowledge really means.

For a Java programmer, it's cool to know assembly, but no one is working in assembly anymore. We have entire school programs dedicated to data science. This is great for the next few years, but it will go away and there will be completely different technology in the future. Become a domain expert.

AR: Q16. What was the last book that you read and liked? What do you like to do when you are not working?

thinking-fast-and-slowSG: I will give you a few because they’re really good. “Cracking the Sales Manager Code,” “The Rise of Superman,” and “Thinking Fast and Slow.” I'm also a fan of Malcolm Gladwell’s writing.

Outside of work, I like to train for Ironman competitions. I also really enjoy projects like building my own Internet of Things devices and researching how to convert data into soundscapes.

anmol-rajpurohitAnmol Rajpurohit is a software development intern at Salesforce. He is a former MDP Fellow and a graduate mentor for IoT-SURF at UCI-Calit2. He has presented his research work at various conferences including IEEE Big Data 2013. He is currently a graduate student (MS, Computer Science) at UC, Irvine.