Strata SF day 2 Highlights: AI and Politics, Chatbots Insights, Forecasting Uncertainty, Scalable Video Analysis, and more

AI influencing Politics, insights from Chatbots, Enterprise Data Cloud, handling Video Big Data, and more takeaways from Strata Data Conference 2019, San Francisco.



See also Highlights from Strata Data Conference 2019 - San Francisco, Day 1

Last month data scientists, analysts, executives, engineers, developers, and AI researchers from a wide range of industries flew in the city of seven hills, San Francisco, California. Each one of over 1000 attendees was super pumped to share and learn emerging trends that are transforming data and businesses. I was one of them and I would like to share some key takeaways with the data science community around the globe. In my opinion, at Strata Data Conferences one could see a perfect intersection of cutting-edge science and evolving business models. The conference featured more than 300 speakers, 10 keynotes, 10 tutorials, and 150+ technical sessions.

Strata Data Conference Day 2 Figure 1

Elizabeth Svoboda, award-winning journalist and author of “What Makes a Hero? The Surprising Science of Selflessness” gave a very interesting keynote on Hacking the vote: The neuropolitical universe. Using biosensors and predictive analytics, political campaigns aim to decode your true desires—and influence your vote—without your knowledge. The future of free and fair elections is at stake, and we need to educate ourselves of the AI capabilities of political campaigns to avoid unconscious manipulation of our thoughts and actions.

For a long time, Marketing has already been using advanced technologies such as brain waves, facial expressions, reaction times, and other bio data. By applying EEG electrodes one can get deep neurological insights on what kind of emotional reactions an ad campaign is creating. Based on this knowledge, you can tweak the ad campaign to achieve the desired emotional reaction.

Another rising technology is enabling reading micro-expressions on people’s faces. These emotions often flicker across faces in less than a second and thus, hard for any human to study. Maria Pocovi and her team at Emotion Research Lab have prepared a software which can read micro-expressions with above 90% accuracy. The use of AI is becoming increasingly popular across political consultants, including countries such as US and Mexico.

Lauren Kunze, CEO, Padorabots delivered an insightful keynote on “Chatting with machines: Strange things 60 billion bot logs say about human nature”. She emphasized that natural language understanding is still far from being solved. That’s why Siri or annoying customer chat bots are still apologizing – ‘I’m not sure! I didn’t understand you’. Even if we have a labeled data set that was sufficiently labeled to train a ML model as to what people say to your chat bot, we would still need descriptive responses. Conversational AI Designers is a new hot job opportunity in Silicon Valley.

The reason why we can’t let a computer formulate a theoretical sensible answer based on what it learns is peoples’ fault. People say terrible things on the Internet. For example: Microsoft Tay famously turned into Hitler loving sex robot within 24 hours of being fed on a tad of tweets. This says more about human nature than about robots. Our biases creep into our datasets and the bias software that’s created is actually a reflection of our worst self.  How can we solve this problem? The frequency at which words appears in a language actually obey Zipf’s law, which observed that the most frequent appeared twice as often as the second most frequent word and so on. The same distribution curve holds true not just for words but also for phrases and sentences.

Strata Data Conference Day 2 Figure 2

95% of first input to all chat bots is covered by just 1800 words. Avg branching factor decreases with each successive word; implying that people are boring. We walk around saying the same things most of the time. The following image shows top 50 inputs across all English-speaking chat bots on the platform (universal domain), regardless of the use-case:

Strata Data Conference Day 2 Figure 3

Pandorabots has open-sourced chat libraries so that developers do not need to re-invent the wheel. She mentioned 10 thousand is the magic number of responses/rules required on average to cover a specific domain, which boils to about a month of work prior to launch.

Mike Olson, Chief Strategy Officer, Cloudera delivered a keynote on “The enterprise data cloud”. The disruptive advances in technology such as Big Data, powerful analytics, and public cloud services has fundamentally changed our expectations of technology: it should be fast, simple, and flexible to use. He explained the key capabilities an enterprise data cloud system requires, and why the future belongs to multi-cloud (or hybrid cloud).

Strata Data Conference Day 2 Figure 4

He shared that for large enterprises the cloud isn’t going to happen; it has already happened. The data center is not dead. Virtually every large enterprise still continues to operate data center. What they really like to do is to have one consistent picture of the data regardless of where it happens to live. They would like to set, enforce, and monitor security policies and data governance policies to ensure they are doing the right thing.

We will see data privacy evolve as a critical requirement across the globe. The main lesson that we’ve learned from the cloud is that “ease” seriously matters. These systems have to spin up and spin down on demand. The users must be able to self-service their way to any analytic framework they run.

Theresa Johnson, Product Manager, Metrics & Forecasting Products, Airbnb gave a compelling keynote on Forecasting Uncertainty. She outlined the following challenges in forecasting:

  • Building the model
  • Guiding the company
  • Scaling the work

She mentioned that it’s actually hard (or some say impossible) to come up with a theory of business that allows you to map input metrics and how they flow together. As data scientists, we can start to model flow to business activities. That would lead to better theory of case. At each step of the process we can quantify the uncertainty. Decision trees require theory; but, building a theoretical model of business can be daunting. To get started you need to know:

  • Insight into how the world works
  • Explaining phenomena based on its constituent parts
  • Our forecast is going to be only as good as theory

For example: When trying to forecast how many pool cleaners are in Chicago?

Airbnb way of solving it – Create a model on how to get pool cleaners using both a supply side and demand side. Bringing two sides together in this two-sided marketplace of pool cleaners gives final number of pool cleaners in Chicago.

Strata Data Conference Day 2 Figure 5

Interesting aspect here is that when each individual metric is forecasted separately, we know where the error/uncertainty is around each specific part of the overall forecast. We can titrate and figure out how that would impact the outcome.

Strata Data Conference Day 2 Figure 6

Alex Poms and Will Crichton, Ph.D. students from Stanford University gave an interesting talk on “Scanner: Efficient video analysis at scale”.  There are an increasing number of applications that depend on processing large amount of video. More video datasets are available in a world instrumented with cameras, such as millions of unlabeled YouTube videos and millions of labeled action clips. They outlined how big video data is different from traditional big data:

  • Must maintain video compressed
    • Storage & I/O costs: minimum 4x increase
    • Compute costs: 6 FPS/core decode + write (at 1080p) – With 200,000 hours of video at GCE pricing: 500k compute hours = $17000. In addition, 320 TB of JPEGs = $9,000/month
  • Run and coordinate low-level libraries with minimal overhead
    Strata Data Conference Day 2 Figure 7
  • Must use hardware accelerators (GPUs, TPUs, FPGAs)
  • Must carefully orchestrate data movement between heterogeneous devices

Goals in mind while architecting Scanner:

  1. Use a single machine as efficiently as possible
  2. As productive to work in as Spark, etc.
  3. Effectively execute applications at scale
  4. Interoperate with the big data ecosystem

Video frames are big i.e. 1 hour of video can be equivalent to 12 Wikipedias / 12 GB of pixels. They did a live demo of Scanner and went in some detail to cover programming model and dataflow.

The event also included prize ceremony for the Strata Data Awards, which recognize outstanding advances in the fields of data science and machine learning. This year’s winners (decided by a team of judges along with audience votes):

  • Most Disruptive Startup: Okera
  • Most Innovative Product: Sigma
  • Most Impactful Data Initiative: Civis
  • Most Notable Open Source Contribution: SparkNLP

Resources:

Related: