Strata 2014 Santa Clara: Highlights of Day 2 (Feb 12)

Strata 2014 was a great conference, and here are key insights from some of the best sessions on day 2: Big Data Vendor Landscape, Machine Learning for Social Change, Secrets of Gertrude Stein, and Facebook Exascale Analytics.

Strata 2014 One of the biggest challenges at big conferences such as Strata 2014 is that there is so much happening quickly and simultaneously that it is almost impossible to catch all the action.

We help you by summarizing the key insights from some of the best and most popular sessions at the conference. These concise, takeaway-oriented summaries are designed for both people who attended the conference but would like to re-visit the key sessions for a deeper understanding and people who could not attend the conference.

See also: Strata 2014 Santa Clara: Highlights from Day 3 (Feb 13).

Edd Dumbill Navigating the Big Data Vendor Landscape by Edd Dumbill, Silicon Valley Data Science

Edd’s talk was focused on how the changing landscape is changing our data needs. While describing the big picture, he emphasized the fact that Hadoop has over 208 partners now and the number is still growing. He shared the concept of data gravity, i.e. data attracts data, which eventually leads to Hadoop becoming the center of gravity for all data needs. He mentioned that although there is a good amount of cost involved on the way to Big Data but once inside the benefits of scale and agility are well understood.

One of the most interesting points made by him was about his vision of “Data Lake”, which describes an organization with ideal processes and resources for Big Data. Referring to it as “data lake dream”, he argued that it is an accessible dream despite the sort of utopian goals of a data centered architecture with no organizational silos and diverse applications residing within the same data cloud leveraging a scalable, distributed environment. Describing the path to realizing this dream, he talked about the four levels of Hadoop maturity:

  1. Life before Hadoop                                    
  2. Hadoop is introduced                                    
  3. Growing the data lake                                   
  4. Data lake and application cloud                  

Data Lake and Application Cloud Talking about how to make the right choices, he shared the concept of experimental enterprise, which he claimed is necessary to compete in a world that’s being rapidly digitized. The key characteristics of experimental enterprise are:

  • Experimentation must be cheap in order to de-risk failure
  • Experimentation must be quick, so a feedback loop can be used to learn from the market and environment, and allow the business to expand rapidly
  • Experimentation must not break the important production processes of business

These principles can be used within IT to build an experimental enterprise by employing six building blocks: cloud, DevOps, open source, agile development, platforms,and data science. Along with a brief explanation of these blocks, he demonstrated how they fit together to create an IT infrastructure ready to serve as a strategic advantage. It is also important to remain on track with technical demands when developing a solution. Regarding testing, Edd shared his mantra: “always test and don’t trust”.

He emphasized on how important it is to consciously trade-off and score according to relative environments. In regard with vendor maturity, he said that Big Data is still an immature sector, Hadoop 2.0 is a key development as it introduced in memory databases, solutions are trending to become platforms and services play a large role in any spending. He concluded the session stating that understanding a vendor’s own requirement and conducting a thorough examination of options are the key points when developing a solution.

Fernand PajotMachine Learning for Social Change by Fernand Pajot, is the world’s largest petition platform with about 300 million signatures, 4000 declared victories in 121 countries and about 20 million users experiencing victory making petitions successful. Talking about the data challenges at, Fernand explained what it takes to identify potential users who are most likely to sign for a particular petition and predicting how many of them would actually sign it. Key insights from his approach towards this challenge are as follows:

  1. Behavioral data trumps demographics and third party data sources
  2. Make all your features binary to have flexibility on which algorithm to pick
  3. Random forests work well for imbalanced data
  4. Performance was not the main criteria when choosing the learning model, use case and flexibility were way more important
  5. Big RAM/CPU instances are cheap, sparse binary datasets are small. If you can do in-memory training, do it

He suggested treating recommendation and discovery as a coherent product. Through matching the synonyms terms used across the data landscape and the delivery channels, he explained the deep similarity. For example, he mentioned featured petitions and similar petitions analogous to online feed and email push respectively.

Although there are tons of input sources to try but picking the right metric to optimize is the most difficult. Sharing the results of his work, he showed that collaborative filtering has led to 30% increase in overall signatures and 3x increase over baseline in additional signatures. A key learning is to always start with the simplest models such as basic similarity metrics. He concluded the session talking about the data-driven approach followed by for sponsored petition targeting.

Fernand’s presentation can be accessed: Here

Ian Timourian Unlocking the Secrets of Gertrude Stein by Ian Timourian from Paxata

Ian Timourian gave an interesting talk in which he used advanced methods of data science and visualization to analyze the poetry of Gertrude Stein. Ian has been exploring the algorithms and techniques utilized by Gertrude through visualization. His talk touched important topics such as novel techniques for the visualization of n-grams, methods of turning poetry into music and the intersection of art and data visualization.

He expressed the immense joy of data explorations, mashups, generative art, etc. The highlight of his talk was a transcription of the structure of Stein’s poetry to music. The next step for Timourianis to find a way to create a videogame-style visualization tool for cleaning and distilling data. This is not so far-fetched since he works at Paxata, a startup that makes a product for that purpose.

Sambavi MuthukrishnanExascale Data Analytics @ Facebook by Sambavi Muthukrishnan, Engineering, Facebook

Sambavi started the session stating that Facebook’s data warehouse has grown rapidly over the years, and has posed unique scalability challenges. She briefly outlined the evolution of the analytics software stack in the last year (both storage and query engines) and shared details based on her experience of the data management and compute challenges at this scale. During the talk she mentioned various components of the analytics software stack, as follows:

  1. Scribe – Server for aggregating log data streamed in real time from a large number of servers
  2. Scuba – System for real-time, ad-hoc analysis of arbitrary datasets
  3. Puma – System for real-time analytics
  4. Corona – Scheduling framework that separates cluster resource management from job coordination
  5. Hive – Open source, peta-byte scale data warehousing framework based on Hadoop
  6. Giraph – Programming framework to express a wide range of graph algorithms in a simple way and scale them to massive datasets
  7. Presto - Distributed SQL query engine optimized for ad-hoc analysis at interactive speed
  8. HDFS and Data Pipelines

Facebook Analytics Software StackTalking about data lifecycles, she categorized data as hot data, warm data and cold data. She explained how her team improved Corona by introducing resource sandboxing, online upgrades and re-startable job trackers.

The corona can now scale to 4000+ node clusters and 120+ jobs/day. With some key challenges solved, Sambavi’s team is now focusing on the next big challenges: multi-temperate data, seamless multi-namespace query and distributed machine learning.

Strata 2014 Santa Clara: Highlights from Day 3 (Feb 13).