Rich Data Summit Takeaways

Data scientists get excited about algorithms. But nearly all time spent working with data involves acquiring, pipelining, annotating and cleaning it. At the Rich Data Summit in SF, data's dirty work took center stage.

Last Wednesday, I attended the Rich Data Summit in San Francisco, an event hosted by crowdsourcing startup CrowdFlower. By "rich data", the organizers mean "high quality data". Advertisements for the summit cited the ubiquitous gripe that data scientists spend 80% of their time acquiring, cleaning, annotating, and processing data, leaving only 20% of their time for algorithmic and analytic work. Unlike academic conferences, which focus exclusively on the intellectual work, this one set out to focus on the plumbing.


Staged at SF's The Village and exceptionally well-organized, the summit mainly featured a series of talks that ran continuously from morning until evening, and a trade-show style demo session that ran in parallel. Demos from sponsors included publisher O'Reilly, machine learning platform Dato, security / fraud detection platform SMYTE, and in-memory database startup MemSQL. The speakers reflected a broad set of backgrounds, with CrowdFlower CEO Lukas Biewald and celebrity journalist Nate Silver delivering the opening keynotes. Subsequent speakers included Silicon Valley CEOs, data scientists, and university professors.

Generally, the event had the flavor of a TED event but focused squarely on data science. Talks were high-level and technical details scarce. But many of the best talks were still concrete, making convincing demonstrations. These included MetaMind CEO Richard Socher's image classification demonstration, CrowdFlower's new service for active learning, and Idibon's human-in-the-loop systems for natural language processing, demonstrated by CEO Robert Munro. Other speakers, such as Uber's Silvanus Lee demonstrated specific business problems, giving a flavor for the kind of data they work with. In contrast, the less exciting talks were those with the flavor of motivational speeches, issuing big data platitudes with low information density. The following paragraphs capture the most salient takeaways.


Not Just Ads

In his keynote, when discussing the rise of data science as an industry, Biewald invoked the Jeff Hammerbacher quote, "the best minds of my generation are thinking about how to make people click ads. That sucks." The implication was that these best minds were largely machine learners and data scientists. Biewald went on, suggesting that data science has moved beyond ads and that machine learning and data science now touch a much wider range of applications. The summit's broad participation certainly supported his claim. While ad/marketing firms were represented with a talk by 6sense CEO Amanda Kahlow, other participants discussed government, speech recognition, the sharing economy, fraud prevention, and more.

Humans in the Loop

Not surprisingly, a prominent theme of the conference was data science with humans in the loop. I lost count of how many talks restated something to the effect of "let computers do what they do best let humans do what they do best." This sentiment was clear talks ranging from the keynote by Nate Silver, who has accomplished celebrity as a political and sports journalist by pairing a small toolbox of statistical tools with his own deep reservoir of common sense and domain expertise, to the talk from Idibon CEO Robert Munro, who described tools for using active learning with deep learning.

In some ways, it seems backwards that the next phase of computing should have humans in the loop. Presumably, over time, computers should become more autonomous, and in the end not depend on human intervention. In that sense, a push towards processes with humans in the loop seems to be a regression.

However, as I thought about this more, I shifted perspectives. Namely, data science work now depends heavily on labeled datasets. In this sense, humans are not only in the loop but indiscriminately in the loop in a brute force manner. A shift towards an active learning approach could represent a progression towards more autonomous systems, requiring less human involvement.

The Low Hanging Fruit is Really Low

Coming out of an academic machine learning conference, I'm often awed by the remarkable capabilities of modern machine learning systems. While there's much more research to be done, the field feels increasingly mature, and new discoveries are generally not trivial. But the successful applications tend to be in domains like image classification and machine translation, where clean data is abundant.

At the Rich Data Summit, as I took in the talks, chatted with start-up founders, and played with the demos, I realized that many companies are first thinking about how to acquire and pipeline the correct data, and to develop the software infrastructure to handle real time data analysis. Even the simplest machine intelligent practices are still higher than the lowest hanging fruit.

Data Ethics is Nascent

While the data science community races to find creative ways to find data, the rest of the world is increasingly concerned about how their data might be used. However, it seems that industrial thinking on this topic is still immature. In a talk about data privacy, chief data scientist Daniela Braga suggested that sensitive voice information could be anonymized by subtracting from each recording the fundamental frequency. While a technical glitch prevented playback, she had prepared a demo which assuredly would present voices that although different in raw audio were indistinguishable to the human ear. This might seem superficially convincing, but as a machine learning practitioner, I'm pretty confident that I could train a classifier to re-identify these voices.

"What is a Data Scientist?" Remains an Open Question

At least three different speakers led with some variation on "how many people in the audience are data scientists?" At these times, I'd watch a number of people look cautiously around the room, trying to decide if the term applied to them. In the past, I've written satirically about how comically vague and inflated the term has become. Will the Real Data Scientists Please Stand Up?).

I had a chance to ask Biewald for his view on the term, and who he had in mind when organizing a data science conference. According to Biewald, "I don't feel protective about the term. If people want to call themselves that, they can. I think that what happened is you have a lot of people who call themselves statisticians or data analysts. And I watched this happen when I was in school. As data has grown from thousands of rows to billions of rows, it becomes a lot less important to do fancy tests and more important to know, can you actually write a query that will get at this data? So I think the whole field of statistics became a lot more computational. You had a lot of people from computer science that ended up being able to do a lot more interesting statistics than could have been done before. So in my view that's why they thought of a new term."

While this might not constitute a definition, it's as reasonable an explanation of the term as I've heard.

On the topic of who is or isn't a data scientist, I was initially puzzled over Nate Silver's inclusion as a keynote speaker. For the record, I admire his journalism and looked forward to meeting him. He has a rare combination of domain expertise, an ability to combine prior knowledge with simple statistical tools to make accurate predictions, and a gift for writing clearly for a wide audience. Amid dearth of statistical literacy in journalism, Nate Silver is a shining example of quantitative journalism done right. At the same time, he's not exactly a data scientist. Many of his datasets consist of a single example for which the label is unknown. He is simultaneously the most famous statistician in the world, and yet not exactly a statistician.

In his talk, Silver broadly dismissed the trend towards big data and correlative predictive models in favor of models built solidly atop a firm grasp of causality. This might be sensible in Nate Silver's work, but not in other domains. I can't imagine how long it would take us to build a working speech recognition system if we had to understand a causal relationship between each individual audio sample and the predicted phonemes. While Silver's approach requires clean data for which causal relationships are already known, the strength of machine learning is often its ability to function in domains where data is neither clean nor well-understood.

In our conversation, Biewald offered a perspective on the choice: "I think he's done more for statistical literacy than anyone else. He doesn't use a lot of machinery but I think his thought process is very good. He's also very good at combining quantitative data with qualitative data. And at the same time, he has total cross-over mainstream appeal. It's so amazing that he doesn't have to dumb down the statistics to have mass appeal. I think that he's doing something that most people have to do in the real world, which is combining a lot of data sources when it's not really clear how you would want to combine them." On topic of the connection between Silver's simple stats methodology and industrial big data work, Biewald added, "clearly he has a point of view that you should use a simpler tool. I have data scientists that work for me and I think they tend to throw too much machinery at problems. ... Really good data scientists realize that every piece of complexity you add has a huge hidden cost."

Zachary Chase Lipton Zachary Chase Lipton is a PhD student in the Computer Science Engineering department at the University of California, San Diego. Funded by the Division of Biomedical Informatics, he is interested in both theoretical foundations and applications of machine learning. In addition to his work at UCSD, he has interned at Microsoft Research Labs and as a Machine Learning Scientist at Amazon, is a Contributing Editor at KDnuggets, and has signed on as an author at Manning Publications.