KDD-2014 report, part 2: The Magic Module network and Privacy vs Big Data

Here is part 2 of my report on KDD-2014, the biggest and the best Data Science meeting: The Magic Module genes, Privacy vs Big Data, and should we ask for consent of data subjects?

By Gregory Piatetsky, @kdnuggets, Sep 2, 2014.

This is part 2 of my report on KDD-2014, Data Mining for Social Good KDD-2014, 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, held August 24-27, 2014, in New York City.

Here is part 1: KDD-2014 - The Biggest, Best, and Booming Data Science Meeting.

A novel, but great feature of the conference was KDD Madness - 30 seconds, 1 slide summaries of all the talks, ably run by Aris Gionis (Aalto U.) and Jie Tang (Tsinghua U.). This gave enough of a flavor of each talk to help people decide where to go, and was probably just long enough to keep the attention span of today's researchers, who would turn to their computers and smartphones as soon as they felt even a little bored.

Notably, Macbooks were more visible than Windows laptops. Despite 2000+ attendees, WiFi KDD_ORACLE (thanks to Oracle who sponsored it) worked very well. Perhaps too well, since many attendees were checking emails or facebook or even running some cool R data mining scripts during the talks. I can imagine an app that will measure in real-time the interestingness of the talk by changes in wifi activity during the talk. Any takers?

KDD-2014 reception sponsored by Bloomberg, photo by @kdd_news In the evening on Tue, Aug 26 Bloomberg corporation has sponsored a great reception at Pier 60 for all KDD attendees. The views were fantastic, the food excellent, the drinks were free, but conversation was hard to keep because the music was much too loud.

On Wed, Aug 27, Eric Schadt, Director, Icahn Institute for Genomics and Multiscale Biology gave a keynote on
A Data Driven Approach to Diagnosing and Treating Disease . He came to KDD-2014 shortly after giving a talk at Deepak Chopra foundation, so his work spans both Big Data and Spirituality!

Pathways associated with co-expression nodes, from @IcahnInstitute tweet if I understood it correctly, Eric Schadt reported on a extremely significant potential discovery, where his team found a kind of "magic module" network of genes connected to inflammation that is involved in many diseases, including Alzheimer. Even more amazingly, this network is also active during meditation (hence Chopra connection) and when people are given a placebo.

My selected tweets and notes from his talk:
  • Mendelian randomization as a path to causal inference - leverage natural variation to find causality
  • Random assortment of chromosomes is biological equivalent of data randomization, enables causal inference
  • beyond amazing visualizations, amazing potential for precision, personalized, better medicine
  • diseases are not independent - the human system is highly interconnected
  • key "magic module" network is involved in many diseases, even active in placebo effect

The Wed, Aug 27 afternoon Panel, with leading researchers Rakesh Agrawal (Microsoft), Solon Barocas @s010n, Chris Clifton (Purdue), Corinna Cortes (Google), and Rayid Ghani (UIC and Edgeflip), focused on the question

Does Social Good Justify Risking Personal Privacy?.

Rakesh proposed Hippocratic Data Systems, with controlled data management and privacy preserving analytics.

Chris Clifton argued data scientists need to worry about privacy issues now, otherwise they will eventually not have data or not be able to do data mining because of government regulations. He proposed to ask for informed consent of data subjects before doing data mining on their data.

Solon Barocas said that often the most socially useful inferences are also the most sensitive. He gave an example of FICO predictor for prescription compliance, which was well-intentioned, effective, but became very controversial because of privacy issues.

Rayid Ghani said that data miners can know a little about you, and predict the rest - most people don't realize it. This power can be used for good (eg improve healthcare, education) or bad purposes. He gave an example of a model for predicting high-school drop-outs. If the model says Johnny is likely to fail and needs help with this subject, will this be an invasion of privacy? Rayid pointed to problems with desire to get subject consent in advance, and said that data is frequently collected for transactions; We usually dont know in advance how to use it for social good.

Here are my selected tweets and notes:
  • Rakesh Agrawal: choice between social good or privacy is a false choice; technology can help provide both
  • Rakesh Agrawal: when education info is online, many opportunities to improve education, but also risks to privacy
  • Chris Clifton: providing false data, being afraid to talk and share information, stops discourse and creates social harm
  • Chris Clifton: if we won't worry about privacy, we will eventually not have data or not be able to do data mining
  • Corinna Cortes: Google Health promised many benefits, but people did not want to upload their information
  • Corinna Cortes: need to focus on education of public that good can come out from sharing data - Denmark is an example
  • @RayidGhani: larger world think "data mining" is looking at all possible correlations, while actually we look at specific targets
  • @RayidGhani: data miners can know a little about you, predict the rest; most people don't realize it. Can be used for good or bad
  • Chris Clifton: we can do controlled randomized experiments and respect privacy by asking for *informed* consent
  • Chris Clifton: data mining experiments involving people should employ ethical review boards
  • great question from the audience: When beginning a new data science project, what you should ask yourself to be sure you are preserving #privacy?
  • Rakesh Agrawal: guideline 1: assume that it is your data - would your feel comfortable with the data mining?
  • Rakesh Agrawal: guideline 2: if your data mining project is published, would you feel comfortable with it? If not, don't start
  • Rakesh Agrawal: guideline 3: if I collect data about a person, what does that person get in return?
  • Corinna Cortes: there are also good reasons to connect accounts - we at AT&T did this after Sep 11
  • Solon Barocas @s010n: post #Snowden there was a shift in fed govt response: debate went from "compliance" to "what is the right thing to do" #privacy
  • @RayidGhani: data is frequently collected for transactions; don't know in advance how to use it for social good
  • question: the beneficiaries of social good may be the same or different groups/people as those risking privacy. How to balance risks/benefits?
  • Chris Clifton: whether people are benefiting themselves or not from data analysis, they should have a choice
  • @RayidGhani: people can be manipulated; need to educate people about consequences of their choices
  • @RayidGhani: #datamining community for many years focused only on utility; it is good we now focus on #privacy issues; need more case studies
  • Corinna Cortes: when people think about #DataMining solving a specific task, they are less worried about it
  • Corinna Cortes: however we need #DataMining technology for finding all the correlations

The panel seemed to agree that there were no easy answers. If the goal is to help real people, anonymization may not help.

The panel was followed by excellent and entertaining keynote from

Sendhil Mullainathan (Harvard), @m_sendhil, Bugbears or Legitimate Threats? (Social) Scientists' Criticisms of Machine Learning

Sendhil showed this picture (thanks to Xavier Amatriain @xamat for the tweet) that is a good example of human contradictions in behavioral economics

Fitness club

Prof. Mullainathan talked about how he would redo some of his earlier work, and focused on two barriers:
  • Barrier 1: Predicting "vs" Theory Testing
  • Barrier 2: Correlation vs Causation

He described his ground breaking work analyzing unemployment rate among US college graduates. He found a big gap by race. Was it caused by discrimination or skills gap?

His group collected a large number of fake resumes, half sent with "white" and half with "black" names. The shocking result is that after sending large number of fake resumes, "white" names had 50% higher call back rate than "black" names.

More selected tweets from me and others from his talk:
  • You don't always needed causal inference for decisions; many policy problems are prediction problems, like bail
  • It is not hard to come up with an algorithm that predicts crime much better than a human judge
  • with better crime prediction, crime rate can be cut in half, keeping the same number of people in jail #DataScience
  • good experiments can yield bad science: beware gaps between the theory's structural statement and hypotheses tested.
  • Use induction to test your theory variables don't curate inclusion, curate exclusion.

One conclusion from Prof. Mullainathan talk was that we frequently don't know in advance what data we need to collect, so asking data subjects for specific consent will significantly restrict research.

Among many other talks, I want to mention EMBERS: Forecasting Civil Unrest using Open Source Indicators.
EMBERS is an automated, 24x7 continuous system for forecasting civil unrest across 10 countries of Latin America using open source indicators such as tweets, news sources, blogs, economic indicators, and other data sources. Unlike retrospective studies, EMBERS has been making forecasts into the future since Nov 2012 which have been (and continue to be) evaluated by an independent T&E team (MITRE).

For those who could not attend, KDD-2014 proceedings are available in ACM Digital Library. I was told that all sessions were video-recorded, so videos should also be available soon.

Just before the conference, I also attended, BPDM workshop: Broadening Participation in Data Mining, ably organized by Brandi Marshall and Caio Soares.

BPDM goal to increase diversity and representation of minorities in data science, and there were many good students there with whom I had the pleasure to interact.

As a fun and refreshing activity, several KDD organizers, including me received and completed an KDD-2014 Ice Bucket Challenge KDD-2014 Ice Bucket Challenge

At KDD-2014, I was surprised to find myself in the position of a minor celebrity, with students and non-students asking to take a picture with me.

KDD-2014 marks 25 years after the very first KDD-89 workshop I organized, and I could not imagine 25 years ago how a field would grow and change.

Looking forward to the next 25 years of KDD success!

Other reports on KDD-2014