Key Takeaways from the Strata San Jose 2018

By dropping 'Hadoop' from its name, the @strataconf 2018 in San Jose signaled the emphasis on machine learning, cloud, streaming and real-time applications.  

These are my takeaways from the popular Strata Data Conference 2018, held on March 6-8, at the McEnery Convention Center in San Jose. Regular attendees were quick to notice the name change from ‘Strata + Hadoop World’ back to ‘Strata Data Conference’.  Here is the link to the listed topics, speakers and slide-decks from this Conference and if you subscribe to O’Reilly’s Safari, you can view the event recordings by logging into your account.  Safari also provides a free 90-day trial. Related events: Here are some highlights from the London Strata, and the Strata Data Conference NY dates have been announced.

Strata 2018 in San Jose, was big and the buzz was real. Here are some details for the number lovers:
• 3800 registrations
• 313 speakers
• 165 sessions
• 22 tutorials
• Seven 2-day training sessions
• 16 keynotes
• 99 exhibitors

There were about a similar number of sessions on Data Engineering / Architecture as on Data Science and Machine Learning, with a lot of emphasis on cloud, streaming and real-time applications. The shift away from just Hadoop was evident. You can view the Keynotes on the O’Reilly YouTube channel here.

The following is a quick drive-through of some selected sessions.

Evan Kriminger, Data Scientist @ZestFinance

Explaining Machine Learning Models

  • Need for Interpretability: Just the loss function is not enough to describe everything that the model sets out to do. Objective misalignment can happen when, for example, to reduce cholesterol, the model says “Don’t eat anything.”  In other applications, there could be bigger implications: for example, a person could be requesting an increase in credit limit to improve his/her credit scores.

  • Who Needs the Interpretability: May be needed by the model developer and tester to debug and to find failure cases. It could be used by the end-user to understand how to apply.  It could be used by the practitioner, who uses interpretability to condense the model output for human use.  Interpretability is very useful in understanding scientific discoveries.


  • As the complexity of the models increases, then it becomes even more difficult to interpret the understanding. For example, in adversarial cases, interpretation plays a vital role in trying to understand how a tiny perturbation caused a big change in the output.
  • Barrier to Adoption in Underwriting: For some domains, the actual learning of the models is trivial. The explainability truly needs focus.  FICO earlier posted why it cannot adopt AI models. It has been working with Google and UC Berkeley to make these more interpretable.
  • Conveying Interpretations: Couple of dichotomies to consider – are we trying to characterize the model as a whole – are we just trying to explain a test point? Or, are looking at the sample or the features – coarse grain to fine grain?  Visualization has been the key to conveying interpretations – reducing dimensionality to help humans understand what the model is trying to show.  Influence functions help ‘up-weight’ which training samples helped to a classification.
  • Conveying Interpretations: Two additional ways to convey interpretations are at the feature level – that is, which features do what:
    • Sensitivity: Explanation here comes from the saliency map. We measure the sensitivity of the output as a function of each feature, by taking the gradient at each value.
    • Decomposition helps to explain the prediction as a whole – how does ‘this’ specific feature contribute to the prediction? Decomposition breaks down each feature and its relevance into its path to prediction.  This is more continuous and related to the goal.



  • Related Concepts:
    • Transparency: This is more about understanding the inner working of the model
    • Attention: In attention methods, the model learns what to focus on and what not to


  • Naturally Interpretable Models: Both the following approaches are commonly used in the Finance industry:
    • Linear Models: Naturally easy to interpret. Assigning blame is easy in such models.  If you take the gradient of this model with respect to the input, you get the sensitivity.
    • Decision Trees: Easy to find the path to a decision. You can find the feature importance from a decision tree – that is, how often did we use this feature in order to make a decision.  These models are versatile as it is easy to do both: sensitivity and decomposition.



  • Taxonomy of Interpretable Models: The baseline for taxonomy is attribution. Based on the model type, the input type and the goals of what the model seeks to achieve, there are broadly four approaches to coming up with a taxonomy.



  • Marginal Feature Contributions: Take the inputs one at a time, perturb them and find impact on the output. Biggest problem is attribution may be difficult.  For example, you perturb one pixel and it may not affect the output at all.  This works best when few features are more important (than the others) and work independently.
  • LIME: LIME is the most well-known package used for interpretation – it is model agnostic. It is computationally expensive.  You take a model with a complex decision boundary, for the point of interest, we create a dataset by weighing the samples across it proportional to samples around it using local neighborhood.  Package has the pick-step – presents the whole dataset – allowing specific set of samples to look at, which may be more interesting for interpretability.
  • Backprop Approaches: This is the most common and natural way used with the differentiable models. How sensitive is the output with respect to specific inputs?  It is easy to implement – runs on a few GPUs, easy to track the boundaries of interest.  But, a big problem is that it fails in flat regions where it will give zero attributions (so, for images using ReLU, it can show no attributions).
  • Backprop Approaches (contd.): Using simple gradients measures just the sensitivity but not decomposition. So, we need some axioms that the interpretability measure should have to help us here to decompose the model.   If two input vectors differ by a single feature, then the attributions for that feature should be greater than zero.  It should be implementation invariant – we are only worried about the output-input relationship.  Linearity helps us to use ensemble methods to combine the models.
  • Backprop Approaches (contd.): Includes an array of methods. Top 4 are just trying to figure out a better way to backprop through ReLUs – specific to image classification – how to backprop the attributions.  DeepLIFT has a discrete gradient which means you lose the chain rule.
  • Evaluating Interpretability Models: We need some metrics to compare how well the models fare at being interpretable.
  • Research Directions:
    • Need better loss functions
    • Need to understand when the interpretation fails
    • Can we explain in other domains?
    • Can we build tools?


  • Note: Do check out what Mike Lee Williams (Cloudera Fast Forward Labs) had to say in his talk ‘Interpretable Machine Learning Products’ in this same conference.


Miryung Kim, Professor of CS @ UCLA


Who are we?  The largest-scale study of professional data scientists

  • Professor Miryung Kim leads the Software Engineering and Analysis Lab at UCLA and has lead multiple studies covering a large-scale study of professional data scientists and shares her findings via two parts in this talk. Her webpage is full of very interesting links and two relevant ones here are: her Google Scholar page, the slide deck for a talk similar to Strata.
  • One may recollect the Kaggle survey here that claimed 16,000 responses to its questionnaire.
  • The following highlights cover the first part only.  Based on your feedback, we could have a separate post on the second part.


Part 1: Data Science Elevating Software Engineering


  • Professor Kim’s team used K-means clustering to identify the relative time spent on different activities to come up with 5 out of 9 representative categories of roles played in a data science team, summarized in the following graphic by me.



  • Challenges that Data Scientists faced fell into three buckets: related to data, related to analysis and related to people.
  • Challenges related to data: Poor data quality, lack of data, missing values, delayed data and finally, making sense of the spaghetti data stream coming from various sources.
  • Challenges related to analysis: Scaling is a pain (e.g. owing to huge data size, batch processing jobs like Hadoop make iterative work expensive and it is painful to do a quick visualization of the large pool), too many tools and, of course, the pain of feature engineering in ML.
  • Challenges related to people: Very difficult to convince teams the value of data science, ability to get the buy-in from the engineering team to collect high quality data and finally, conveying the insights to the leaders and stakeholders in an effective manner.
  • Best Practices collected from the surveys: Professionalize the practices, processes; foster a community around the practices. Choose the right business goals to dig into and determine a project’s success metrics in the early planning phases.  Always question the data and focus on the explainability.
  • Challenges around quality assurance: Validation is very tough – for example, how do you know if a query is misbehaving?


  • Success strategies around quality assurance: Rigorous cross validation and peer reviews are very useful. Dogfood simulations are important and always try to check the implicit constraints of your system.


Part 2: Big Data Debugging


  • This part of the talk focused on Software Engineering Elevating Data Science via debugging for Big Data Analytics, testing, data summary and explanation and optimization for iterative development. Read more about this part from her published papers or here, or by exercising your account on the O’Reilly Safari account.  Alternately, check out a previous video here.


Alexandra Gunderson @ Arundo Analytics


Machine Learning to tackle Industrial Data Fusion


  • Arundo Analytics deploys and manages data science models at scale, on the edge or in the cloud for Industrial customers.


  • The two main reasons Heavy Industries like Oil and Gas, Maritime and Energy industries want to go into ML is:
    • Decrease downtime
    • Increase efficiency


  • The opportunity in these markets is very large.


  • The lack of labeled data is a critical shortcoming for deploying Data Science and Machine Learning at scale. There are also human and cultural barriers.  Human barriers can be at one end of the spectrum, where they have just a fuzzy idea that ML is going to take away jobs via self-driven oil rigs.  At the other end, there are people who have unreasonable expectations from ML without having had data on hand to answer their questions.


  • The data infrastructure of heavy industry assets was not built with ML in mind. Many just don’t have the data.  There are barriers to change that extend beyond data access (culture, for instance).  They demand models that will tell them when the equipment will fail, why it will fail, and what actions to take – and, there’s no data.


  • Three questions to ask:
  1. How do we gather data we need to build a ML model?
  2. What are some of the steps to take to expedite the process?
  3. Once we have all the data, what can we achieve?


  • Take the example of a compressor on Rig X for an oil exploration company. Arundo needed two types of data:
    • Numerical Data – How does the equipment behave?
    • Event Data – When and what failures happened?


  • The process of understanding the problems involves a lot of interviewing, understanding the use-case, getting the data, etc.



  • The customer team shared the process and instrumentation diagrams – Arundo sat down with the engineers and the subject experts to go through these diagrams to find the right sensors that matter to a specific compressor failure.


  • Arundo then tried to understand and interpret the numerical data and make a better understanding of the failure characteristics – what were the financial implications, how did they classify the failures, etc.


  • The availability of data is a big issue – if there’s data for a five-year period, then it is considered very lucky. However, there is a large variety of data in most such examples.


  • The success of data science depends on what adds the most value. So, in this specific case, Arundo looked at the Dry Gas Seal Failure.  In a 20-year lifespan, there could be, maybe, two failures.  And if the data is available for only last five years, then it is not going to be scalable or even usable.


  • Arundo then puts all data into the right context, split by system, sub-system and component, and hierarchy. May be Event A is mapped to only to a specific pump or a meter.  Next step would be to take the same approach to all their assets.  That will help to scale, collect enough data and start to attempt finding patterns.


  • Connecting to the right people is very important – the experts who hold pieces of the puzzles.


  • Event Labeling: These are typically CSV files – and there could be about 90 different causes of a sensor failure. Though there are drop-down lists to make this labeling easier, most are labeled ‘Other’.  The best is to structure the data automatically by looking at the text.  This makes creating a process around this very important.


  • Sensor Labeling: Given a list of all tags on an asset, Arundo then determined where in an equipment hierarchy would each belong. Typically, there are just too many sensors, so experts/engineers shortlist (say, about 40 sensors) that are most important.  This comes from the Process/Equipment/Process diagrams.  The tags, by themselves, are usually insufficient.  So, Arundo sat with the Engineers and went through each entry to understand where each sensor goes, and what equipment is each associated with.  That starts to generate patterns from the sensor names and Arundo then used this information to start coding and automating this part of the sensor labeling.



  • Best way to minimize the work on sensor labeling is by creating a simple tool. Arundo has built a tool P&ID Miner which Alexandra demoed.  It uses optical character recognition to go through the P&ID diagrams, reads the P&ID from the figures and puts them into Excel sheet.


  • Arundo built an Anomaly Detection Model for the Compressor which shows the most anomalous sensors associated with each compressor. It used the K-means clustering algorithm.  Such a model helps to train a new engineer on the team too.  This helped them find defective compressors about two weeks before it happened – tremendous value delivered by proper implementation of machine learning.


Evan Sparks @ Determined AI


Taming Deep Learning


  • Deep Learning is complex, expensive and time-consuming. We measure time to market of ML based systems in years, not in months or weeks.  We need the immense expertise of ML, systems, and domains.  We need 100s of GPUs to deliver world class solutions.


  • Software involves inadequate point solutions – stone age stuff hacked together. Software development needs to be reinvented for the AI age.


  • Evan began with a glimpse at AI Development Workflow. One begins with: framing the business problem to be solved using AI, and the corresponding mathematical cost function for AI to optimize.



  • All this is a highly iterative process



  • After hundreds of iterations, one may land on a model that is ready to deploy.



  • Evan focused on the Training block as it takes a lot of time and resources. Two key solutions that the Determined AI team has solved with help from others:
    • Hyperparameter Optimization: They have come up with an algorithm – published by Ameet Talwalkar and others in Academia.
    • PALEO: Is an analytical model to estimate the scalability and the performance of deep learning systems.


  • Hyperband is a bandit-based approach to Hyperparameter Optimization
    • Key Question: How can we efficiently identify high-quality hyperparameters? Efficiency is in terms of resources consumed and quality is defined in terms of predictive error.



  • Existing Methods come in two camps:
    • Random Search: A selected random bunch of points in the space, use off-the-shelf software to evaluate the points, and get a predictive error out on the validation
    • Adaptive Methods: Start with a few points (even if randomly selected), try and learn something about the surface and the contour points, and hopefully, it will do a better job of high-quality points
    • Case 1: When n is exponential in d



    • Case 2: When n is linear in d


  • Many of the learning algorithms are iterative, and hence provide insights on how the models are behaving over time. Key intuition behind Hyperband is that there is a band of wasted iterations – so, just focus on the models that provide the lowest predictive errors over sequences of iterations.  What could go wrong with this approach?  One, the sequences could be non-monotonic and non-smooth.


  • Why Hyperband works:



  • How do we say this theoretically? Ameet and others called this the “Best arm identification” – that is, given n configurations, how many resources are required to find the best one?


  • Paleo: Enter Performance Model for Deep Neural Networks (PALEO). How do I do distributed training really fast?  The high-level strategy for this was: Model architecture provides a declarative specification of computation, while the map computation to the specific choice of computational platform, how much time would this thing take to execute?


  • Cost of deep learning: One of the major drawbacks of the current architectures is that they are expensive to train, typically requiring dedicated hardware and resources


  • Today there are some 6+ distributed deep learning algorithms on Apache Spark: like SparkNet, CaffeOnSpark, etc. DDL, in general, has picked up: TensorFlow showed 56X speedups for Inception on 100GPUs.  Mxnet showed 109x speedup for Inception on 128 GPUs.  Maybe one needs to come up with a systematic way to analyze these distributed systems.






  • Evan walked through specific examples of how PALEO can efficiently and accurately explore the design space of architectures and computing platforms.  Some points to bear in mind: Carefully consider the strong versus weak scalability.  Weak scaling changes workloads.  Ask the vendors for convergence plots so that you have a clearer picture of the convergence time versus final accuracy.


Jeff Dean, Google Senior Fellow @ Google


Using Deep Learning to Solve Challenging Problems


This was one of the most popular talks – the room filled out and overflowed – many people were turned away.  Suggestion: Strata should provide a live screen outside the halls for the overflow population to watch.


  • There has been an explosion of interest, research and applications around Machine Learning, Neural Networks and Deep Learning in the last few years.


  • Neural Networks are now able to solve a whole gamut of problems. You can input a bunch of pixels so that it just recognizes the image.  You can now provide an audio recording and get Neural Networks to provide an end-to-end transcription of it (earlier, one needed a complex chain - acoustic modeling, followed by linguistic modeling and so on).  You can feed Neural Networks a complete sentence in one language and get a complete translation in another language.  You can feed Neural Networks a bunch of pixels and it can now recognize the picture and the full context (e.g. A blue and yellow train traveling down the tracks.).


  • Consider the ImageNet Challenge. In 2011, the best models (that did not use Neural Networks) topped out at 26% error rate.  Humans can reach up to 5% error rate.  But today (2016), with the best Neural Networks, we can reach 3% error rate, much better than humans.



  • Jeff began with the 2008 list of Grand Engineering Challenges for 21st Century as many are still being solved, and added some of his own to update the list.



  • Jeff selected some of the topics from this list to provide a sample of the progress in each.


  • Restore and Improve Urban Infrastructure: Autonomous cars are going to become reality. Raw sensor data is going to build a very high-level view of the world around us to make autonomous cars safer and reliable.  Waymo recently tested autonomous cars without any safety drivers.


  • Advanced Health Informatics: Diagnosing eye diseases (specifically, diabetic retinopathy, with about 400 million people at risk in the world) via retinal images. For training a Neural Network, it is vital to have a large number of labeled images.  However, if you ask two ophthalmologists opinion on the same image, they will agree only 60% of the time.  Worse, if you ask the same ophthalmologist his/her opinion on the same image sometime later, he/she will agree with previous self-opinion only 65% of the time.  So, to get a good training labeled database, Google asked seven ophthalmologists for opinions and then used the majority to classify and label the images for training.  Now, this kind of application is spreading to radiology.  A very interesting discovery happened when a new person joined the Google team working on eye-images.  This person was given a bunch of labeled images to play with, train and learn something from them.  This person came back and was, surprisingly able to predict the gender and the cardiovascular health from these images -- which turned out to be quite prescient.  There are numerous non-imaging diagnosis methods for predicting the medical future of patients.



  • Neural Machine Translation: Leaps and bounds better than the earlier phrase-based translation. The performance gap between algorithms and the humans who are bilingual is now very small when compared to the Neural Network models.




  • The field of Connectomics is about reconstructing Neural circuits from high-resolution brain imaging. Using its vast computing resources, Google has made some progress, trying to reconstruct the complete connectivity of the songbird brain wiring.   Another example: Using RNN to iteratively fill out an object based on image content and its own previous predictions.  Check out this paper for details.




  • Engineering the tools for scientific discovery: Google team built Tensorflow and open-sourced it in end-2015, to establish a common platform for expressing ML and systems. Google is developing Tensor Processing Units (TPUs) as dedicated hardware to solve ML computation needs.  Again, available as part of Google Cloud services.  It is making 1000 Cloud TPUs available for free to top researchers who are committed to open machine learning research.


  • Automated Machine Learning (Learning to Learn or AutoML): Today: Solution = ML Expertise + data + computation. To explore for tomorrow Google is attempting to try: Solution = Data + 100x computation.  AutoML is progressing well – it is able to use brute force computation to search and evaluate a large number of Neural Architecture models to suit a particular problem and outperforms handcrafted models.  It provides better accuracy at the lower end of operations (a lower cost deployment) as well at the higher end.  AutoML is now available as part of the Google Cloud services.


  • Neural Networks have two nice properties: Neural Nets are very tolerant around reduced precision – which makes them very useful for rough estimates. Second, all basic operations in their execution are composed of a small set of specific operations which allows developing specialized hardware, for example, hardware units that accelerate linear algebra as components of a faster solution.  E.g. TPU v2


  • To conclude, deep neural networks and ML are producing significant breakthroughs that are solving some of the world’s grand challenges.



var disqus_shortname = 'kdnuggets';