Interview: Ted Dunning, MapR on Apache Mahout & Technology Landscape in ML

We discuss Apache Mahout, its comparison with Spark and H2O, trends, advice, desired qualities in data scientists and more.

ted-dunningTed Dunning is Chief Applications Architect at MapR Technologies and committer and PMC member of the Apache Mahout, Apache ZooKeeper, and Apache Drill projects and mentor for Apache Storm, DataFu, Flink and Optiq projects. Ted was the chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems. He built fraud detection systems for ID Analytics (LifeLock) and he has 24 patents issued to date and a dozen pending. Ted has a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.

First part of interview

Here is second and last part of my interview with him:

Anmol Rajpurohit: Q7. Which trends in the space of recommender systems appear the most interesting to you? Why?

Ted Dunning : I think that the idea of synthetic indicators is very exciting. These are tags that can be applied to content and users based on external properties or relations that allow some pretty amazing capabilities.

AR: Q8. What motivated you to work on Apache Mahout? How do you compare Mahout with Spark and H2O?

mahout-logoTD: Well, some good friends asked me to answer some questions. From there it was a down-hill slope. First a few questions to be answered. Then some code to be reviewed. Then a few implementations. Suddenly I was a committer and was strong committed to the project.

With respect to Spark and H2O, it is difficult to make direct comparisons. Mahout was many years ahead of these other systems and thus had to commit early on to much more primitive forms of scalable computing in order to succeed. That commitment has lately changed and the new generation of Mahout code supports both Spark and H2O as computational back-ends for modern work.
That inter-relationship makes direct comparison even harder in some ways. I think that there is so much to work on in machine learning that it is hard to say that one project is directly competitive with another when, in fact, they actually work together in many ways.

Clearly Mahout has a huge lead over the other systems in the way that it compiles linear algebra expressions into efficient programs for back-ends like Spark (or H2O). Clearly also, H2O has a huge lead over Spark's MLLib in terms of numerical performance and sophisticated learning algorithms. Mahout is also the only system that fully supports indicator-based recommendation systems, which is a huge difference as well.

AR: Q9. What is the best advice you have got in your career?

curiosityTD: I have been lucky enough to have had a large number of people who have helped me over the years and I don't think that I could distill that help into a single bit of advice. One inspiration that I have had from a number of mentors over and over again is to maintain a sense of wonder and curiosity about the world. Many of the things that are most quoted in my work have been things that I learned about while talking to people expert in fields other than my own. The LLR test came from some astro-physicist friends. T-digest came from clustering. Cross recommendations came from symmetry considerations.

AR: Q10. Is "talent crunch" a real problem in Big Data? What has been your personal experience around it?

TD: Yes. The talent-crunch is a real problem. But finding really good people is always hard.

People over-rate specific qualifications. Some of the best programmers and data scientists I have known did not have specific training as programmers or data scientists. Jacques Nadeau leads the MapR effort to contribute to Apache Drill, for instance, and he has a degree in philosophy, not computing. One of the better data scientists I know has a degree in literature. These are widely curious people who are voracious learners. Combine that with a good sense of mathematical reasoning and a person can go quite far.

talent-crunchLimiting your hiring to people who have a CS degree from a top-10 university and 5-10 years experience in exactly what you want them to do makes it very hard to hire good people and very much limits how much in the way of new ideas they can bring to you.

A great example of this same bias happens when people ask questions in interviews to which they already know the answer. I don't want to hire people who know what I know. I want to hire people who know what I don't know. If I learn something important from a candidate during an interview, that is one of the best indications that they are a good hire. If they learn from me, I don't consider that a great indicator.

get-it-doneAR: Q11. What key qualities do you look for when interviewing for Data Science related positions on your team?

TD: I want people who are switched-on, curious about things, willing to try new things and who are willing to tell me when I am wrong (hopefully somewhat gently). I also want people who get things done and understand the value of simplicity.

AR: Q12. What was the last book that you read and liked?

what-is-life-schrodingerTD: That is a really hard question, partly because I love reading. I read "What is Life?" by Erwin Schroedinger and was really fascinated how he could drive to the heart of problems. Even in the early 40's and even as a physicist rather than a biologist or chemist, he was able to think clearly and succinctly and surprisingly correctly about the mechanisms of life. Very impressive. It is also very impressive that he could write so very well even using a second language which, as he put it, can never fit as well as the original.