Key Takeaways from KDD 2018: a Deconfounder, Machine Learning at Pinterest, Knowledge Graph
Highlights and key takeaways from KDD 2018, 24th ACM SIGKDD conference on Data Science and Data Mining: including what is a deconfounder, how Pinterest approaches Machine Learning, Knowledge Graph for Products, and Differential Privacy.
Last month I attended KDD-2018, the premier interdisciplinary conference which brought together researchers and practitioners from data science, data mining, knowledge discovery, large-scale data analytics, and big data. This year about 3300 delegates (all-time high!) from 99 countries attended the conference. I’m happy to share that in 1989, KDnuggets Founder and President -Dr. Gregory Piatetsky-Shapiro organized the first KDD meeting - the workshop on Knowledge Discovery in Data (KDD-89), held at IJCAI-1989 in Detroit, MI.
Here is the summary of key points from selected talks at KDD 2018:
Note: The speaker slides have not been shared yet. Thus, I am sharing here the slide pictures I took, which are far from perfect.
The conference began with a keynote from Dr. Jeannette M. Wing, Professor of Computer Science, Columbia University on “Data for Good”. She started with an overview of the various phases of data life-cycle ( Generation -> Collection -> Processing -> Storage -> Management -> Analysis -> Visualization -> Interpretation ), while explaining the privacy and ethical concerns throughout the life-cycle. She summarized classical casual inference as:
- Confounders affect both - the causes and the outcomes.
- We should correct for all confounders in casual inference, which requires in theory to measure all confounders.
- But, whether we have measure all confounders is (famously) untestable.
She introduced an interesting idea of Deconfounder (refer pic below)and its merits, as shown below.
Dr. Wing put some light on research and education opportunities to excite research minded audience. Lastly, she explained the pervasiveness of data science applications across Columbia University.
Grace Huang (Data Science Manager @ Pinterest) gave an insightful talk on “The Pinterest Approach to Machine Learning”, in which she highlighted the following key lessons learned by her team:
- Beware of data and system bias
- Testing and monitoring should be must
- Good infrastructure speeds up iteration
- Measurement and understanding are crucial
- Build a sustainable ecosystem
- Design a ML minded product, and a product minded ML system
While explaining the second lesson she emphasized that “Offline Data Distribution != Online Data Distribution”. When her team migrated from GBDT to Neural Network it caused data corruption which was discovered only after several weeks as the data corruption was causing silent failures.
She also emphasized that “Offline Performance ! = Online Performance”, and thus, the final bar must always be running models on live traffic.
Xin Luna Dong, Principal Scientist @ Amazon gave an interesting talk on “Building a Broad Knowledge Graph for Products”. Through intuitive examples related to movies, Dr. Dong shared how a knowledge graph is different from a product graph. The mission of product graph is to answer any question about products and related knowledge in the world. She explained how the Product Graph (PG) is related to generic Knowledge Graph (KG), as shown below.
She summarized the challenges in building Product Graph into four categories:
- No major sources to curate knowledge from
- Wikipedia does not help too much
- A lot of structured data buried in text descriptions in catalog
- Retailers gaming with the system so noisy data
- Large number of new products everyday
- Curation is impossible
- Freshness is a challenge
- Large number of product categories
- A lot of work to manually define ontology
- Hard to catch the trend of new product categories and properties
- Many entities are not named entities
- Named Entity Recognition does not apply
- New challenges for extraction, linking, and search
She briefly explained some papers published in this direction to tackle challenges while attempting to build Product Graph. The knowledge for product graph comes from Amazon marketplace data, World Wide Web and some structured data. Refer picture below to understand building of a broad graph:
Dr. Dong and her team at Amazon aims to build an authoritative knowledge graph for all products in the world. In this pursuit, they follow the research philosophy shown below.
Joseph Sirosh, Corporate Vice President, Cloud AI Platform @ Microsoft gave an inspiring talk on “Planet-Scale Land Cover Classification with FPGAs”. He walked through how FPGAs are used within Microsoft, and how we can tap the power of FPGAs for real-time AI. DNNs are challenging to infer cost-effectively and deploy in large-scale online services with low latencies and price/performance. Project Brainwave is a hardware architecture designed to enable high performance real-time AI computations and the architecture is deployed on field programmable arrays (FPGAs). This fundamentally transforms latencies and price-performance for large-scale use of DNNs. He shared several examples of the application of this technology, including one where it helped stop poaching.
In summary, he shared the following takeaways:
- Cloud FPGAs enable real-time AI inferencing at planet scale
- Synthetic training data helps when training data is scare and expensive
- AirSim (see: Github) can help synthesize vast amounts of visual data.
- Deep domain adaptation makes the imagery closer to real training data
John M. Abowd, Chief Scientist and Associate Director for Research and Methodology, U.S. Census Bureau gave a very interesting talk on “The U.S. Census Bureau Adopts Differential Privacy”. Earlier this year the U.S. Census Bureau declared that it is adopting differential privacy with regards to redrawing electoral districts. Dr. Abowd started with sharing brief history of differential privacy at the U.S. Census. In 2008, differential privacy was first time implemented in production worldwide (link).
In 2018 US Census had the first large-scale implementation of central differential privacy worldwide.
Referring the database reconstruction theorem, he emphasized that:
- Powerful result from Dinur and Nissim (2003) [link]
- Too many statistics published too accurately from a confidential database exposes the entire data with near certainty
- How accurately is "too accurately"?
- Cumulative noise must be of the order √N
While sharing data from the 2010 Census of Population and Housing, he demonstrated that the database reconstruction theorem is the death knell for traditional data publication systems from confidential sources. He also discussed the pros and cons of Disclosure Avoidance System relying on Noise Injection for formal privacy rules.
There is on-going research and debate on the optimal solution for the social choice problem. The Marginal Social Benefit (MSB) is the sum of all persons' willingness-to-pay for data accuracy with increased privacy loss. The marginal rate of transformation is the slope of the privacy-loss vs. accuracy graphs. This is exactly the same problem being addressed by Google in RAPPOR, Apple in iOS 11, and Microsoft in Windows 10 telemetry.
He shared following graph, which shows social optima in production possibilities for alternate mechanisms. MSB stands for Marginal Social Benefit and MSC stands for Marginal Social Choice in the diagram.
KDD 2018 offered an excellent insight into the current research in the areas of Machine Learning, Deep Learning, and Artificial Intelligence. There were over a dozen tracks running in parallel, so choosing which session to attend was a tough decision. There was certainly a lot of quality content densely packed during the conference. Most of the speakers were key leaders of their field and their sessions reflected a strong sense of purpose, a drive for innovation, and rigor for quality research (and applications). The conference provided ample opportunities to socialize with like-minded people from around the world. After this great experience at my first KDD, I would highly recommend it to anyone who wants to learn about cutting-edge research in this space. Looking forward to KDD'19 in Anchorage, Alaska!
- Journey to Machine Learning – 100 Days of ML Code
- Deep Conversations: Lisha Li, Principal at Amplify Partners
- Populating a GRAKN.AI Knowledge Graph with the World