Interview: Stefan Groschupf, Datameer on Why SQL on Hadoop is a Bad Idea
We discuss the startups landscape in Big Data, valuation of Big Data companies, recognition earned by Datameer, and why SQL on Hadoop is a bad idea.
Stefan Groschupf is a big data veteran and serial entrepreneur with strong roots in the open source community. He was one of the very few early contributors to Nutch, the open source project that spun off Hadoop, which 10 years later, is considered a 20 billion dollar business.
Stefan is currently CEO and Chairman of Datameer, the company he co-founded in 2009 after several years of architecting and implementing distributed big data analytic systems for companies like Apple, EMI Music, Hoffmann La Roche, AT&T, the European Union, and others.Stefan is a frequent conference speaker, contributor to industry publications and books, holds patents and is advising a set of startups on product, scale and operations.
First part of interview
Here is second part of my interview with him:
Anmol Rajpurohit: Q7. When one looks at the Big Data landscape, one of the fairly obvious observations is the increasingly large number of startups (and companies that had initial growth, but stagnated very soon). Where do you position Datameer in comparison to its competitors and in general, looking at the overall big picture?
Stefan Groschupf: We feel really, really good. The growth of Datameer is amazing. We have raised $37 million dollars so far and our next competitor just spent that last year.
We have built a sustainable company and have phenomenal growth year-over-year, but we do this by having a great product that people are willing to pay for. I think there's a lot of hype in this space for good reason. There are potentially huge returns of investment and so companies with unlimited money will just buy their growth and logos.
If you have such a unique shift in the market like big data or data in general, brought to the market, then there will be a number of start-ups that want to jump on that bandwagon. But guess what? Building a company is really hard. Not everyone will succeed at it.
AR: Q8. One of the on-going debates about Big Data is whether we are currently in a bubble, based on hype and over-expectations (or ahead-of-time expectations). From a CEO perspective, what are your thoughts on the valuation of Big Data companies, including both public and non-public companies? Would you say that most Big Data companies are under-valued, appropriately valued or over-valued?
SG: I think most big data companies are absolutely overvalued. That's very important because now they have to deliver on that value. The market will clean itself.
If you do a good job of looking into whether there's really value-creation in the company, then you should be fine. We see companies that create a tremendous amount of value, have really solid bookings and revenue numbers. We also see companies where, again, the growth is mostly bought. It's not really emerging, next-generation technology, it's just the same thing with a bit of different color and a Hadoop sticker on top of it.
It's important that you truly bring something new to the game. If you can pull this off, customers will pay for your product.
AR: Q9. In its short history of six years, Datameer has won several awards and recognition. If you had to choose one of them, as the most close to your heart, which one would that be and why?
SG: That is a tough question. I am incredibly proud of every award and achievement we’ve received, though “Most Innovative Product” by Fast Company is right near the top along with “Best Data Discovery Product” by GigaOm Research.
If you ask what the legacy is that I want to leave behind it would be something that recognizes Datameer as the best place to work. We hire amazing people and I’m having the time of my life coming to work every day.
AR: Q10. What are the most important lessons that you have learned through the experience of starting a company, raising capital and leading the company in a new, but heavily competitive field?
SG: It's all about staying honest with your customers, your colleagues and yourself. Building a company is incredibly difficult. Sometimes, especially in such a hyped market like we are in, you're scratching your head asking, "What I am doing wrong? Why are those companies raising so much money?" Then a year later, they're imploding because they couldn't sell to a single customer.
AR: Q11. Hadoop was primarily designed for sequential-access (leading to low memory requirements) of large data-sets (structured as well as unstructured) over commodity hardware. However, most of the current SQL-on-Hadoop solutions (Hive, Stinger, Impala, Drill, Hadapt, etc.) treat Hadoop in a very different way by enforcing schema (creating challenges for unstructured data), fast access through caching (creating costly memory requirements), etc. What are your thoughts on enabling SQL access for Hadoop? Is there a better way to make data in Hadoop accessible through SQL, while not restricting the true power of Hadoop? (Or, are we asking the wrong question here and SQL-on-Hadoop is a bad idea?)
SG: Yes. SQL on Hadoop is a bad idea.
When you put data into a structure, like SQL, you limit what you can do with the data in the future. Hadoop is the same thing for data as 3D printing is for manufacturing. It's absolutely disruptive. You have the raw material data. At any given time, with schema on read, you can extract what you want to get out there. Like a 3D printer, you press a button and it prints any kind of tool that you need.
It's really a question of waterfall versus agile in data analytics. SQL requires a waterfall design approach and you're limited. You will absolutely not hit your deadlines; you will always fail to deliver. Hadoop truly enables an agile analytics approach. You first collect all the data, and then you pull out whatever you want. Everything that is SQL on top of Hadoop is nonsense.
Oracle already figured SQL out. If they thought it would be cool to run SQL on top of Hadoop, they would have done it already. If you need structured query language that, by the way, is forty years old, then use a database. You don't need Hadoop.
Third part of the interview will be published soon.
Anmol Rajpurohit is a software development intern at Salesforce. He is a former MDP Fellow and a graduate mentor for IoT-SURF at UCI-Calit2. He has presented his research work at various conferences including IEEE Big Data 2013. He is currently a graduate student (MS, Computer Science) at UC, Irvine.