Strata + Hadoop World 2015 San Jose – Day 2 Highlights

Strata + Hadoop World 2015 was a great conference, and here are key insights from some of the best sessions on day 2.

strata-hadoop-worldOne of the biggest challenges at big conferences such as Strata + Hadoop World 2015 is that there is so much happening quickly and simultaneously that it is almost impossible to catch all the action.

We help you by summarizing the key insights from some of the best and most popular sessions at the conference. These concise, takeaway-oriented summaries are designed for both people who attended the conference but would like to re-visit the key sessions for a deeper understanding and people who could not attend the conference.

Highlights from Day 1

Highlights from Day 2:

Key Ideas from Keynotes:

Eddie Garcia, Chief Security Architect, Cloudera presented a new standard to protect all data including open data: "secured-by-default data" -> Data that self-contains properties that allows to secure its confidentiality and integrity. Data that comes with a built-in security contract between its producers and consumers.

Alistair Croll, Founder, Solve for Interesting mentioned that as Big Data turns into a consumer product we should ensure that nobody should know more about our life than we do. He said "When machines get intelligent, we might not even notice, because they'll be us and we'll be them."

Michael Greene, VP, Software and Services, Intel announced Intel's collaboration with Databricks and AMPLab, UC Berkeley to accelerate spark analytics.

Eden Medina, Data Scientist, Indiana University talked about Project Cybersyn and Big Data lessons from our cybernetic past.

contributors-sparkMatei Zaharia, CTO, Databricks mentioned that Spark 1.3 would include Spark Dataframes, which are automatically optimized; thus, making it faster than Python and Scala. He also stated that Spark is the most active project in Big Data space as well as overall at Apache Software Foundation.  He also announced that Spark 1.4 will be released in June this year and would include R Interface - "SparkR".

Joseph Sirosh, CVP, Machine Learning, Microsoft gave a fascinating talk on “the Connected Cow” showing that data is the true power and every company is a data company. He said "It all starts, not with data & analytics, but with an idea - a question"

Jeffrey Heer, Co-Founder, Trifacta emphasized that visualization tools should show data variation, not design variation. Future of data visualization would rely on how we move from tools that work well for designer to tools that enable analysis and help better-decision making in industry and government.

Kurt Brown, Netflix talked about what Netflix data platform team is up to and why. They are particularly focused on performance and ease of use. They’ve recently upgraded to Hadoop 2, partnered with the community developing Pig on Tez, adopted the Parquet file format, and fully integrated Presto into their stack. They are using Presto over other alternatives because it goes well with AWS and it is open source developed in Java providing real-world Big Data solution. Based on their experience good features about Spark are: cohesive environment, multiple language support, performance and a great community. However, they realized that it still lacks: maturity, multi-tenancy/concurrency, shuffling, etc.

They use Parquet file format since it is columnar, works well cross-project and has a great community with strong contributions. They are continuously adding to their big data open source suite, with their latest contribution being Inviso (which provides easy searching and visibility into Hadoop execution and performance). They are working on developing a cohesive framework for easy platform interaction (via their Big Data API and Big Data Portal).
Patrick Wendell, Databricks shared insights about tuning and debugging a production Spark deployment. Talking about Spark's execution model, he said that the key to tuning Spark apps is a sound grasp of Spark's internal mechanisms. In order to explain how a user program gets translated into units of physical execution such as jobs, stages and task, he started with explaining RDD API.

RDDs are a distributed collection of records. Transformations create new RDDs from existing ones and Actions materialize a value in the user program. Basically, Transformations build up a DAG and are materialized through a method runJob. He shared a stage graph in order to show how two stages performs their tasks and share data. He briefly explained units of physical execution:
  • Jobs -> Work required to compute RDD in rubJob
  • Stages -> A wave of work within a job, corresponding to one or more pipelined RDD's
  • Tasks - > A unit of work within a stage corresponding to one RDD partition
  • Shuffle -> Transfer of data between stages

Talking about performance, he mentioned that in general, avoiding shuffle will make program run faster. Regarding parallelism, he mentioned that if one is not using all slots in cluster then repartition can increase parallelism, otherwise not. Serialization is sometimes a bottleneck when shuffling and caching data. Using the Kryo serializer often makes the code run faster. Since Spark scales horizontally, so more hardware is better. He recommended to keep executor heap size to 64 GB or less.

Prof. Robert Grossman, University of Chicago shared lessons learned from building anomaly detection systems for three operational systems. He mentioned that anomaly detection has three common cases: supervised, unsupervised and semi-supervised learning.

For supervised learning: We have two classes of data, one normal and one consisting of anomalies. This is a standard classification problem, perhaps with imbalanced classes and is a well understood problem. However for unsupervised learning we have no labeled data representing anomalies or normal data. And for semi-supervised we have some labeled data available for training, but not very much; perhaps just for normal data.

He mentioned that "The core of most large scale systems that process anomalies is a ranking and packaging of candidate anomalies so that they may be carefully investigated by subject matter experts." He then presented various use-cases. He concluded the talk giving three solutions when we have a lot of non-labeled data:

Approach 1:
  1. Create clusters or micro-clusters
  2. Manually curate some of the clusters with tags or labels (oven with “orthogonal” data)
  3. Produce scores from 0 to 1000 based upon distances to interesting clusters
  4. Visualize
  5. Discuss weekly with subject matter experts and use these sessions to improve the model.

Approach 2:
  1. Create clusters or micro-clusters
  2. Rank order the findings from most significant to least significant.
  3. Compare your findings with other ranking methods. Enrich with other covariates.
  4. Visualize
  5. Discuss weekly with subject matter experts and use these sessions to improve the model.

Approach 3:
  1. Build baseline models for each entity of interest, even if thousands of millions of models.
  2. Produce scores from 0 to 1000 based upon deviation of observed behavior from baseline
  3. Visualize
  4. Discuss weekly with subject matter experts and use these sessions to improve the model.