KDD-2014 – The Biggest, Best, and Booming Data Science Meeting

KDD-2014 was the largest (with over 2300 people) and the best Data Science meeting, highlighting the huge progress of Data Science made with Big Data, and its even more amazing potential.

By Gregory Piatetsky, @kdnuggets, Aug 28, 2014.

I have just attended KDD-2014, Data Mining for Social Good KDD-2014, 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, held August 24-27, 2014, in New York City.

This is part 1 of the report - here is part 2.

With an amazing 2,300 people attending (twice the size of KDD-2013), KDD showed that it was not just the leading research conference in Data Mining, Data Science, and Knowledge Discovery, but also the largest conference in the field. Kudos to KDD-2014 general chairs Claudia Perlich and Sofus Macskássy for excellent organization.

There were 151 research papers accepted from a record 1,036 submissions.

For the first time at KDD, over 50% of attendees were from the industry. Attendees came from a record 52 countries, with most from the US (1506 people), followed by China (86), Japan (57), and Germany (40). More data mining on KDD statistics in the second part of my post.

For a detailed overview of the papers, review process, keywords were more likely to get papers accepted/rejected, and other statistics, see a great presentation compiled by PC Co-chair @Jure Leskovec:

Jure and PC co-chair Wei Wang created a very full program, but with up to 6 parallel tracks I frequently wanted to be in 3 places at once!

Bloomberg HQ hosting KDD-2014 The success of the conference was very evident in the first day, with part of the workshops held in the beautiful new Bloomberg headquarters. Bloomberg corporation and its CEO @DanDoctoroff deserve a lot of gratitude from data scientists for their support of KDD-2014 and its theme, Data Science for Social Good.

There were many presentations and talks about social good, including
  • Workshop on Data Science for Social Good from
  • @DrewConway talked about using Data Science at New York City, and how simple analytics led to big improvement in FDNY effectiveness
  • Nathan Eagle gave a very impressive presentation about Jana, a mobile platform with the huge presence (~ 3.5 billion users) in the developing countries. Jana lets users watch mobile ads on android phones in exchange for airtime credits. This works both for users and for advertisers and creates a lot of social good in the process. The Jana mobile network, because it is so large, also has interesting additional applications, like detecting earthquakes or epidemics.
  • presentations from Data for Social Good fellowships run by @RayidGhani

To remind us that, alas, not all is nice and good in our world, Rand Waltzman from DARPA gave a rather scary talk about Information Environment Security, how rumor spread in social media can be started and detected, and how some terrorist organizations are surprisingly effective users of social media for their ends.

However, the core of the conference was serious and excellent data science. Many popular sessions and workshops were packed wall-to-wall, with barely a place to lean against! !
Some technical topics that I found especially notable/popular include:
  • Deep Learning, with 2 tutorials and several talks
  • Social Networks and graph analytics (popular for the last 10 years, and even more so this year)
  • Topic modeling
  • Recommendations
  • Workforce analytics

KDD has presented several significant Awards , including Innovation, Service, Test of Time, Best Papers, and Best Dissertation awards.

See also my interview with Pedro Domingos, Winner of KDD 2014 Data Mining/Data Science Innovation Award.

Pedro Domingos Innovation Award Talk on Sunday, Aug 24, was on Scalable Data Science and Very Large Scale models. He made an analogy that Very Large Scale models are like VLSI - we now begin the transition to very large scale models. He introduced and explained 3 principles for very large scale models and proposed Markov-Logic based Sum-Products Networks as a way to building Sum-Product models efficiently.

Here are my selected tweets from his talk (with prefix "#kdd2014 Pedro Domingos" removed)
  • like with VLSI design is independent of fabrication. KDD is now going thru a similar transition with Very Large Scale models
  • going from a Large Model (customer, neuron, organism) to a Very Large Model (Social Network, Brain, Ecosystem)
  • principle 1 for building Very Large Models: model the whole, not just the parts
  • people (customers) influence each other - model the whole network, not each person separately
  • model interactions and relationships using probabilistic Markov Logic Networks
  • principle 2 for building Very Large Models: tame complexity by hierarchical decomposition
  • most #DataScience models were not hierarchical because there was little need
  • the world is hierarchical, with many taxonomies and we can exploit it to make inference tractable
  • we can make 2 assumptions: subparts are independent given the part; probability for class is avg over subclasses
  • using hierarchy and 2 previous assumptions makes our inference tractable
  • building Very Large Models can be done efficiently using Sum-Product Networks
  • Markov Logic Network + Sum-Product Theorem = Tractable Markov Log
  • principle 3 for building Very Large Models: Time and Space should not depend on data size

Conference program was packed with interesting research and industry presentations, sometimes with up to 6 parallel activities, so I can only mention a few.

On Monday, Aug 25, Oren @Etzioni, a distinguished researcher, entrepreneur, and now the director of Allen Institute for AI, gave a controversial keynote on "The Battle for the Future of Data Mining".

He argued that the traditional data-driven approach, including Deep Learning, is limited in its potential and proposed a knowledge-driven method - building a very complex knowledge base. As a test of their system, he is developing an open question answering system which will be able to answer questions from 4th grade science exams. Here are some of my tweets from his talk (with "#kdd2014 Oren Etzioni" removed from 2nd & following tweets)
  • #kdd2014 Oren @Etzioni Allen Institute for #AI wants to go beyond IBM #Watson. Next step - answer 4th grade science test
  • what is next after #BigData wave crests?
  • you cannot play 20 questions with Nature and win (Newell) - need deeper Artificial Intelligence

On Tue, Aug 26, Eric Horvitz, @EricHorvitz, Director of Microsoft Research, talked about "Data, Predictions, and Decisions in Support of People and Society". Here are some of my tweets from his keynote: (with repeated part "@EricHorvitz #kdd2014 keynote" deleted for brevity)
  • Renaissance of rich representations - amazing progress in speech translation
  • early work on forecasting future traffic, surprise problems in Seattle
  • companies should not slurping data willy-nily, but work with user cooperation #BigData #privacy
  • MSR built a real Azure "cloud" service to predict winds using airplane info
  • Readmissions Manager products predicts patient readmission within 30 days
  • Interpretability vs Power trade-off; a sweet spot is to capture pair-wise interactions
  • 44-98,000 preventable deaths/year in US
  • 1 in 250 people query on top 100 drugs in US
  • web-scale pharmavigilance - use web searches to detect side-effects of drugs
  • sodium content of downloaded recipes correlates with hospital admissions for heart failure
  • using cell call metadata can be used to detect an earthquake size, location
  • AI-D.org - #AI for development, help in less-developed countries
Here are his slides: Data, Predictions, and Decisions in Support of People and Society (PDF).
Both keynotes were covered in NYTimes: Looking to the Future of Data Science.

Also, just found a very nice page with KDD-2014 Tweet, Photo, and Video highlights, created using seen.co platform. According to that page, top twitter accounts for #kdd2014 were @Bloomberg, @kdnuggets, @DataKind, @DanDoctoroff, and @erichorvitz.

Here is Part 2 of my report: The Magic Module network and Privacy vs Big Data.