Strata SF day 1 Highlights: from Edge to AI, scoring AI projects, cyberconflict, cryptography
Journey from “Edge to AI”, scoring your AI projects, cyberconflict, role of cryptography in AI and more insights from a leading conference.
Last week data scientists, analysts, executives, engineers, developers, and AI researchers from a wide range of industries flew in the city of seven hills, San Francisco, California. Each one of over 1000 attendees was super pumped to share and learn emerging trends that are transforming data and businesses. I was one of them and I would like to share some key takeaways with the data science community around
the globe. In my opinion, at Strata Data Conferences one could see a perfect intersection of cutting-edge science and evolving business models. The conference featured more than 300 speakers, 10 keynotes, 10 tutorials, and 150+ technical sessions.
Key takeaways from Day 1 (March 27, 2019):
After welcoming keynote from conference organizers, Amy O’Connor, Chief Data and Information Officer, Cloudera gave a talk on The Journey to the Data-Driven Enterprise from the Edge to AI. Talking about Cloudera’s journey to become more data-driven, she explained that the key challenges come from complexity of use cases, diversity and magnitude of data, and managing data from edge all the way to AI insights.
Getting data from the Edge requires one to ingest, process, stream, and manage data securely and quickly. Furthermore, the results of AI models need to be pushed back to Edge in real-time. Talking about recent merger of Cloudera and Hortonworks, she mentioned that their Sales Operations team derives insights from propensity models to identify which potential customers to focus on.
Jed Dougherty, Lead Data Scientist, Dataiku delivered a keynote talk on Scoring your business in the AI matrix. He started with sharing his slightly personalized recent history of hype in AI and Data Science. Shortly after Alexa was released in 2015, interest in AI increased and is now exploding. Is that a singularity? Probably not, it’s a hype rocket. We can’t afford to be wobbly, touchy, feely in our definitions. We need a rubric to define a project’s “AI Score” to decide if we should do it or not. An AI score should do two things:
- Define whether current projects align with “True” AI
- Help an organization decide whether they even need to be aiming for “True” AI
He shared the following AI score matrix (developed by Florian Douetteau, CEO, Dataiku):
Jed shared scoring of some examples:
The moral of the talk was companies should build their foundations and then decide whether and how they would benefit from the types of technologies to register high AI score before climbing into the hype rocket.
David Sanger, Data Journalist, The New York Times gave a compelling keynote on
Cyberconflict: A new era of war, sabotage, and fear. He explained how the rise of cyberweapons has transformed geopolitics since the invention of the atomic bomb. He shared an iconic image, which shows Watergate filing cabinet next to the recently hacked DNC server.
A filing cabinet broken into in 1972 as part of the Watergate burglary sits beside a computer server that Russian hackers breached during the 2016 presidential campaign at the Democratic National Committee’s headquarters in Washington
Talking about state of cyber attacks and lack of an effective approach to combat them he emphasized that, “We know the delivery vehicle, but we don’t know the warhead”. In this new world whoever owns the network owns society. Today, cyber attacks have become pervasive – the primary way that states exercise power without triggering war.
For many years, we have investigated technological solutions for it. Now, we recognize that, like terrorism and climate change, this is a problem we’ll have to manage for decades. There are 4 ways nations use cyber:
- For espionage
- For manipulation of Data
- For destructive purposes: To do via cyber what previously could be accomplished by sabotage or bombing
- For achieving political goals
He concluded the talk stating that the problem is about to get worse as IoT devices are increasing multi-fold.
Shafi Goldwasser, Professor, UC Berkeley, and Co-founder & Chief Scientist, Duality Technologies gave a thought-provoking keynote on AI and Cryptography: challenges and opportunities. She emphasized that Cryptography can play a key role in enabling “safe” machine learning. There is a shift of power from human decision making to machine-led decision making through AI/ML. In the absence of proper safety and security measures this shift makes us more vulnerable and unpredictable. In order to prevent “power abuse” via Machine Learning, we need to ensure that all data models work in accordance with privacy policies and cannot be maliciously tampered for profit or control.
Cryptography has recently been focused on privacy and correctness of computation (rather than the legacy focus on communication). In the context of ML, a few years back, the prevalent approach was to integrate cryptographic building blocks (Homomorphic Encryption, Garbled Circuits) with ML-based classifiers to ensure privacy and security. In recent years, as the focus has moved to deep neural networks, it has become more challenging to preserve privacy as its harder to work on encrypted data without losing performance (mostly because of non-linear activation functions).
It’s also hard to track the usage of data, so that the institutions sharing data can keep control over how that data is being used. Particularly, how can they know if someone used their training data without a proper “privacy preserving” mechanism. One way to trace the unauthorized use of a data model is through Watermark DNN models by training the network to accept watermarks (planted adversarial examples).
She concluded with the assertion that there is a real opportunity for developing new cryptography motivated by machine learning, and new machine learning motivated by cryptography.
Kurt Brown, Director, Data Platform, Netflix gave an interesting session on
The journey toward a self-service data platform at Netflix. He explained how Netflix is working toward an easier, "self-service" data platform without sacrificing any enabling capabilities. He started by sharing key characteristics of a “self-service” platform:
- Clear, easy, enabling
- Minimal (to no) coordination
- “Product mindset”
During last 10 years, the Netflix data platform has also evolved to handle 100 thousand times more events (~10s millions -> ~1+ trillion) with a number of technology/products being used now. Meanwhile, the underlying data has increased from 34 TB to 100+ PB.
He shared following principles/approaches for different components used to create “self-service” platform:
- Machine Learning Infrastructure
- Meet users where they are
- Simplify the integrations
- Building blocks
- Rich metadata
- Virtuous cycle
- Little things
- Buy yourself credibility
- Engineers != Designers
- Choose your own adventure
- Self-service != expertise-free
- Heavy lifting handled
- Weave it into your world
- Extend ecosystem
- (Big Data) Scheduler – Meson
- Do the work once
- Create a center of gravity
- Minimize cognitive load
- Don’t take away what’s working
- Go with (and to) the flow
- Education and Support
- Need clarity (all around)
- Good at it / like it
- “Drinking your own champagne”
- Establish paved paths
He also shared quote by Phil Karlton - “There are only two hard things in Computer Science: cache invalidation & naming things.” He mentioned folks should keep in mind that clear is better than clever when trying to name projects.
- On-line and web-based: Analytics, Data Mining, Data Science, Machine Learning education
- Software for Analytics, Data Science, Data Mining, and Machine Learning
- AI and the data production landscape
- AI Supporting The Earth
- An introduction to explainable AI, and why we need it