EMC Data Science Summit Report by Angeliki Papagiannopoulou

A report on highlights and insights from EMC Data Science Summit, exclusively for KDnuggets. No more talk about Big Data.

Here is a report, exclusively for KDnuggets, from EMC Data Science Summit by Angeliki Papagiannopoulou , a researcher in Sentiment analysis, BI, and data mining.

EMC also made available videos from the DS Summit.

EMC Data Science Summit 2012

Data Science Summit 2012 Insights, by Angeliki Papagiannopoulou

"No more talk about Big Data" was the mandate that officially kicked off this year's Data Science Summit at the Venetian in Las Vegas; else, you would have to be penalized by a zip from your drink. After this very important statement, the event opening reception proceeded, emphasizing the urgency to be able to handle the fast-growing volumes of Big Data. To tackle Big data, you need to combine problem solving with curiosity based on a specific story.

In particular, Business Executives, Social Media Analysts, Data Scientists, System Engineers, Strategic Marketers, Data Journalists, University Professors and many more experts shared the opportunity of getting together, communicating their professional experiences, exchanging academic approaches with respect to data manipulation and interacting with their peers, along with getting input about the scientific and business trends as discussed in the different Summit sessions.

To begin with, Nate Silver, who writes the New York Times political/analytical blog FiveThirtyEight.com explained the position of state-of-the-art prediction algorithms in all aspects of our lives. In consideration of profiting from such practices, he pointed out the need to overcome obstacles such as the noise of a random signal and data overfitting. A roundtable discussion took over to discuss the impact of social data to privacy issues, another hot topic of the Summit. The speakers concluded that private is private and stays that way.

One of the most interesting sessions regarded case studies where big data clustering and classification prove to help humanity as well as B2B and B2C companies. In short, John Brownstein, Associate Professor at Harvard, demonstrated how in his company, HealthMap specialists deal with early disease detection. Nora Denzel, Senior Vice President at Intuit, designated the significance of constructing spending profiles and big data for the little companies and that therefore 1/4 of US payrolls are delivered through their systems.

Tarek Kamil, Executive Director at InfoMotion Sports Technologies, identified how practices like applying sensors to sports equipement may enable the overall team performance. Last but not least, Oren Etzioni, Professor at the University of Washington, indicated best practices in product comparison via e-commerce solutions they implement in Decide.com company, where they rank approximately 5000 products to offer best buy solutions.

How does an organization become data-driven? Answers were provided by Michael Chui, Senior Fellow at McKinsey Global Institute. He pinpointed the effective use of big data across the economy and the competition in the field of its processing. Moreover, he urged upon the incorporation of data from outside the organization and the data analysis and prediction techniques currently employed versus the old-school Business Intelligence (BI). Data need segmentation and differentiation as to retail and business data. Conclusively, educators themselves have to enhance their competence on the new era, besides policy and decision makers.

The old-school BI was also questioned by Piyanka Jain, President of Aryng, who shifted the focus of attention from BI to BI, i.e. Business Impact. Throughout her long-standing presence in Business Analytics, interplaying with calculus vs. conditional probability, she has envisaged that decisions are based on data. This resolution goes along with the natural leading from data savvy to intelligent savvy.

Big data carry along too much information; in other words, it's important to get the gist out of them. Jeremy Howard, the leading prediction data scientist according to Kaggle.com, spoke about the importance of creepy data, which provide the most beneficial information. Tony Gebara, Associate Professor in Columbia, denoted the seriousness of outliers, which results in discarding noise. All in all, it's hard to build new data and insights resolve correlation from causality. Yet again, privacy remains a major issue.

Data science is a multi-disciplinary field after all and it comes down to statistics, programming and curiosity skills. Berkeley and Stanford work on a project of bringing data to the people through Data Base reasoning and Human Computer Interaction. Since around 1999, the community has been contemplating the traces of human activity by studying the internet topology and in specific the social media data, mostly through graph based representations, as asserted by Jure Leskovec, Associate Professor at Stanford University. Going beyond analyzing such current trends, people in his group collect data from companies so as to predict future ones, for example new Facebook friendships.

Finally, it is worth mentioning that behavioral and cognitive scientists take advantage of Data Visualization Systems and that Jonathan Harris went along Visualization by presenting his personal roaming around data handling. Furthermore, he affirmed with examples of his own experience in the broader field of data science the multidisciplinary aspects of making sense out of big data. So, wrapping up the motivations having arisen on this year's Data Science Summit, I would propose to take action by starting to substantiate all these findings and conclusions and this would mean: "No more talk about big data".