Key Takeaways from Strata + Hadoop World 2017 San Jose, Day 2

The focus is increasingly shifting from storing and processing Big Data in an efficient way, to applying traditional and new machine learning techniques to drive higher value from the data at hand.

Tom Reilly, CEO, Cloudera and Khalid Al-Kofahi, VP of Research & Development, Thomson Reuters gave the opening keynote on one of the hottest public concerns – fake news. In their talk titled “Becoming smarter about credible news” they described the solution prepared by their joint effort. Khalid mentioned that an independent survey found that today 20% of news stories are breaking first on Twitter. So, social media is clearly a valuable source of breaking news as well as for additional information on a currently running news story. However, social media is also filled with fake news, advertisements disguised as news, rumors, etc. Thus, it is becoming increasingly challenging for journalists to leverage social media for news reporting while ensuring the trust and accuracy of their news stories.

The technological solution works like this: First it consumes twitter feeds and filters out noise. Then, metadata (topics, relationships, etc.) is added to tweet content. After that, tweets are grouped into clusters based on semantic and syntactic similarities. Finally, a veracity score is calculated for each news story which measures how news-worthy the story is (through checks such as for factual stories – is the content true?, for opinion-based news story – is the entity expressing the opinion a subject-matter expert?, etc.). The machine learning models empowering the system’s intelligence do take into account the experiential wisdom of investigative journalists. The systems completes this entire process for any news story in under 40 milliseconds.

Talking about the quality of results, Khalid reported that given a group of tweets (minimum 5) on a news story, the systems is able to classify the news as real or fake with 78% accuracy. This system is currently being used by Thomson Reuters journalists across the world to decide which local stories to report on and to check the accuracy of their news content.

Andra Keay, MD, Silicon Valley Robotics shared insights on “Making good robots”. Talking about her experience of commercializing innovative robotic technologies, she highlighted some important issues. Robots are already coming with cultural baggage. They are typically referred to as male rather than being considered neutral. They are considered female only when their physical appearance is designed to enhance their gender characteristics.

Robots should obey our laws, including the societal covenants. The robotic intelligence should be based on our real life values and societal principles, and not on the unconscious bias and stereotypes of its creators. We need to understand that our relationship with robots is not one way, the way we design them to work and interact will have significant impact on our lives in near future.

Robotics has now hit an inflection point in its evolution. Low cost of sensors, increasing intelligence, augmented by ubiquitous connectivity, improvement in multi-dimensional multi-terrain mobility and comprehensive understanding of surrounding environment is enabling robots to become a significant part of our future lives. Thus, we need feedback loops (such as design guidelines) to identify and rectify characteristics that are not best suited for our world.

We are now witnessing a huge wave of semi-autonomous robots (embedded in larger devices) capable of taking certain actions reliably. The various sensors on these robots are collecting a tremendous amount of data. Intel reported that a self-driving car generates about 4 TB data per day. Most of such data has to be processed locally for insights that drive robots’ actions. While we typically think of robots as objects, they will soon be in our lives as an environment such as Amazon Go or smart-homes.

She also emphasized that robots need to be transparent – how/where/when data is stored, who owns the data, what decisions are being taken based on this data, etc.

Vijay Narayanan, Partner Director, Microsoft gave a keynote on “Big data, AI, the genome, and everything”. Recently, for the first time in history, a quick DNA test diagnosed a boy’s illness (that several other medical tests had failed to diagnose) and saved the boy’s life. This analysis of DNA sequences required just 96 minutes. A similar analysis by previous generations of software on the same hardware would have taken 24 hours or more to complete.

The cost of DNA sequencing has dropped radically in the last few years, from $10M in 2007 to under $1K today. Simultaneously, the number of human genomes sequenced during these years has increased exponentially. Recently, it has been doubling every 7 months. Such growth means that genomic labs today need an enormous amount of storage and computing power, for which they are now relying on the cloud.

The Human Genome Project aimed at sequencing all the 3 billion letters/base pairs in the human genome. It started in 1990, and took 13 years and $ 3 billion (approx.) to complete. It has already had substantial impact so far – the discovery of 1,800 disease genes and the development of over 2,000 genetic tests that patients can get done to learn the risk of their inherited diseases. The project has enabled a new generation of Precision Medicine, one that is personalized and preemptive.

He mentioned that today when we think of Big Data we are primarily talking about data from social media, astronomy, video content, etc. But, in 8 years from now the data generated from genomic labs will overshadow all of them and become the predominant Big Data challenge – how to collect and store such data efficiently? how to scale-up affordably? how to process this data and generate insights in reasonable time? The answers would come from fast algorithms and scalable computing infrastructure. One of the major breakthroughs in this direction is the SNAP (Scalable Nucleotide Alignment Program) system (developed by Matei Zaharia et al.) based on a simple hash index of short seed sequences from the genome.

Michael Jordan, Professor, UC Berkeley gave an exciting talk on his research work “Ray: A Distributed Execution Framework for Emerging AI Applications”. Project Ray aims at replacing Hadoop-like thinking with something better suited for a wide range of ML algorithms. The ML capabilities of most of the Big Data platforms today are focused on just one ML method: neural networks (which includes deep learning).

In Jan 2017, UC Berkeley launched RISELab (Real-time Intelligent Secure Execution Lab) as the successor for AMPLab. In its 5-year history, AMPLab had produced several great software including Spark, CoCoA, MLBase, and BlinkDB. Spark was developed to address latency issues with running simple ML methods such as logistic regression on Big Data. While Spark did well in that arena, it is now time to replace it with something with greater support for generic ML. Project Ray is currently being developed as a successor to Spark that would address the priority areas shown in the slide below.

Most of today’s ML libraries are focused on either supervised learning or neural networks. But, those scenarios exclude a majority of real-life ML challenges. Thus, Project Ray is based on reinforcement learning, where the model learns on its own based on limited positive/negative input on its actions. The code for Project Ray is available on Github and there will be an alpha release soon.

Desiree Matel-Anderson, Chief Wrangler, The Field Innovation Team (FIT) talked about doing good with data in her talk “Data in disasters: Saving lives and innovating in real time”. FIT’s mission is to empower humans to create cutting-edge disaster solutions. She shared several case studies including hurricane Sandy, Boston Marathon bombings, and Syrian refugee crisis. She highlighted how the creative application of data collection, processing, and visualization can make an impact in real-time towards addressing natural (eg. hurricane) and man-made (eg. war) disasters.

Simple techniques such as public sentiment analysis based on Twitter data can provide powerful insights to direct the aid efforts towards maximum effectiveness. AI based chat bot was used to provide psycho-social services to the survivors of Syrian refugee crisis. In summary, she emphasized that “data saves lives” and invited data science intellectuals to participate in social projects.

Rob Craft, Product Lead, Machine Learning, Google gave an interesting talk on “Machine Learning at Google”. He differentiated Artificial Intelligence (AI) from Machine Learning (ML) stating that AI is the overarching science and ML is a group of techniques that enables AI. He described ML as just math with some statistics, a large chunk of data and some smart talent. ML is no more some black-box magic limited to the realm of research, ML is present in our day-to-day lives solving real-world problems.

On an average the input for Google Search is just two and a half words. This has remained a constant over the last decade. In order to deliver relevant web search results, Google needs to supplement that limited text input with knowledge about user’s intent and overall context. That’s a big challenge. It gets a bit easier when people use voice search, because in that case the input is usually sentences that provide a lot more information about what user is looking for, than those just two and a half words typed into the search box.

For many years, Google Search was driven by hand-crafted business rules engine. Soon, Google ran into millions of those rules, making it extremely tedious and complex to maintain and update those rules. So, ML was used to understand the underlying context in order to get a better understanding of the intent, leading to simpler rules managed automatically. It took just 3-4 months to get the first ML model to replace those hand-crafted rules.

He concluded his talk with a brief summary of Google’s offering across the ML spectrum such as TensorFlow for ML researchers, CloudML for data scientists, and ML APIs (Translate API, Vision API, Speech API, Natural Language API, etc.) for app developers.