Strata + Hadoop World 2015 Singapore – Day 2 Highlights

Here are the quick takeaways and valuable insights from selected talks at one of the most reputed conferences in Big Data – Strata + Hadoop World 2015, Singapore, day 2.

This was the first conference of the Strata + Hadoop World in Asia in Singapore. Which had the great response from big enterprise and SMEs as well. Data scientists, big data engineers and leaders from all over the world came from sharing their experiences and explore where this revolution of big data is going to be.

Strata Hadoop Singapore 2015

We help you by summarizing the key insights from some of the best and most popular sessions at the conference. These concise, takeaway-oriented summaries are designed for both people who attended the conference but would like to re-visit the key sessions for a deeper understanding and people who could not attend the conference.

Summary of Day 1:

  • “Hadoop will disappear”, we will be discussing about what we will be doing with the Hadoop ecosystem rather than whether or not to have one.
  • After the improvements in data pipeline, there will be better tools made available for data science and analytics.
  • To make data science project successful and well accepted, we should be targeting the use-cases which will impact the business as whole rather than parts of it.
  • Even though data science traditionally considered as a process of arts and science, it will get standardised through the tools and workflows.

Highlights from Day 1

Quick Takeaways from Keynotes:

Doug Cutting Chief Architect, Cloudera discussed about growth and challenges for the data ecosystem. He begun his talk with the history of data science ecosystem, the three pillars which have brought us to this big data century. 1. improved hardware capabilities, 2. data and engineering which transformed industries, and 3. the open source technologies which allowed the smaller businesses to develop trustworthy and scalable softwares. Apache Hadoop made it possible to have not only a much more powerful and cost-effective solutions, but also it made them more flexible without any centralised controller. This resulted in to adding new components and replacing the older ones with faster and better ones. He concluded with the major challenges while going ahead:

  • Maturity: these technologies are fairly young and have to be tested in different scenarios and industries.
  • Talent: We have huge shortage in terms of skilled engineers who could use these tools.
  • Complexity: Due to decentralized architecture which allowed use to have more data, more tools and more points of interactions. To address this we can your cross means of common formats for data transfer like parquet or avro, and lineage and auditing processes.
  • Security: Adding better authentications and finer-and-finer grain authorizations throughout the stack.
  • Trust: Building systems that people trust. As we are collecting more personnel data, so we should build confidence about the applications with the customers by providing transparency.
  • Change: The rate at which technologies and applications are changing is very fast.

Ivan Teh, Managing Director, Fusionex talked about how big data is changing and reshaping the world. The  physical and digital world converged because of the technological advancements, and it will continue to be so. With this comes the big data and big challenges of deriving insights from the data which could be solved by with data science and analytics.


Sanqi Li, CTO of Products and Solutions, Huawei described the big data driven networks. Where infrastructure has been geo distributed and will be consuming PaaS, cloud OS and network OS etc. Big data will power by these systems which don’t have any centralized controller and will need next gen analytical tools. And for those tools we turn towards the open source community as it provides best way to learn by collaboration, consolidation and co-innovation. Later he shared, how they used this experience to transform the business of Shanghai Unicorn, by monetizing and deriving the insights from the data.


In coming age of IoT we will be operating on the edge networks to get insights and we will face following major challenges:

  • Missing a geo-distributed and tiered streaming/analytics architecture, beyond the current focus of centralized data center analytics
  • lack of network telemetry framework in SDN/NFV activities
  • towards the open E2E analytics ecosystem across data center, network edge and IoT client devices for intelligent E2E IoT connectivity

Ziya Ma, Director, Big Data Technologies, Intel Corp talked about the new optimizations for big data and analytics. She explained how they are reshaping retail, healthcare and transportation industries by Intel’s enterprise platform.

Melanie Warrick, Deep Learning Engineer, Skymind gave overview of deep learning, how it learns an why it is used. Deep learning in overly simplified ways can be comparable with the way human brain learns in evolutionary ways. It is mostly used for feature engineering, language & image processing and unsupervised modelling. Major types of deep learning are Feed Forward Neural Network(FFNN) for finding patterns with deep architectures, Convolution Neural Networks(CNN) used to deal with image datasets, Recurrent Neural Networks(RNN) for linguistic data, time series, and  Restricted Boltzmann Machine(RBM) for generating data, deconstruct and restructure data. She shared well known use-cases of deep learning are read handwritings, play games, write stories, speech recognition and describing images and self driving cars etc.

Reynold Xin, Cofounder, Databricks shared the current state of Apache Spark, and where it is going. In 2015 spark community has grown in terms of attendees for summits and meetups substantially(about four time) and contributor to the spark have grown substantially. Industries, datasets and applications running on top of spark have become varied and broaden in scope. Talking about the trends, newer users are using sparks prefer standalone(48%) deployment rather than running on top of Hadoop(41%) or Mesos(11%) clusters.


In terms of use-cases(implemented in Asia), the largest cluster of Spark is ran by Tencent with 8000+ nodes and churning 150+PB data which is growing at 1PB data per day, Alibaba’s Taobao using spark for graph processing, fraud detection and recommendation systems. Banks using it for credit risk analysis and customer acquisitions along with the traditional batch processing. In 2016, spark will enable more API support and optimized algorithms for ML. From version 1.6 onwards programmers will be able to use type safe Dataset API for stronger contracts and better engineering.

Farrah Bostic, Founder, The Difference Engine talked about how to use qualitative data for research and enterprise. The challenges for collecting qualitative data for research are cost and time taken. Following are  major issues while performing qualitative research:

  • Too late to began with- company or team have already came up with some solution and they want to revalidate it
  • You start out wrong about everything- when there is misrepresentation of knowledge and researcher lost their goal in between, researcher have biases well before beginning research which discourages from experimentation and investigation
  • You don’t use the tools at your disposal to learn all you can- often they haven’t looked at their data, and done own research about existing data
  • You rely on horses mouth: people usually don’t pay attention to different opinion they select samples which are not representative.

Why to do qualitative research- it tells you why and how behind the what of quantitative datasets, can also reveal opportunities for innovation with existing datasets.

Valuable Insights from Selected Sessions:

Building and deploying real time big data prediction models by Deepak Agrawal (24[7] Inc.) shared his experience of building real time models with the web analytics data. In customer support about 98% of customer issues are solved by self help (McKinsey), only 1.1 are thought assisted & non-self help which are around 7 million+ users in Europe. So it is important to provide the instant support and more preferred way could be predicting where user might need help. 24[7] provides services to its clients who generate 3 billion transaction and 50 TB data per week to predict customer journey prediction and support. Lessons learned while developing these realtime big data models:

  1. Websites are dynamic and changes to website will break the models, as they are changed on the fly and rate of change is high. They deal with this by working with client through alert based system.
  2. As models become complex, evaluation time also increase exponentially resulting in latency. So while building your model design keep strict matrices for executing the model on realtime data (like target < 50 milli seconds).
  3. You need a robust A/B testing platform which continuously monitors model performance & also indicate a right time to intervention. You should divide the funnel into the test group 50-50 control group and non-control group target population defined.

Data problems while modelling:

  • Data sparseness which is driven by a high ratio of new vs repeat visitors and long gaps in the time
  • Ability to stitch customer data across channels and devices for an omni-channel view of customers(tracking user across mobile, desktop, tablets and on different networks).
  • Ability to capture, processing & storing ever expanding data streams while maintaining data quality
  • Ability to build and deploy models at scale, currently it takes 3 days to deploy after testing the models.

Computational privacy: The privacy bounds of human behavior by Yves-Alexandre de Montjoye (MIT Media Lab). In his intriguing talk Alexandre made it clear, that even with the anonymization, we won’t be able to protect the users privacy.

He developed the concept of unicity to study the risks of re-identification of large-scale metadata datasets. He showed that in 4 spatio-temporal points are enough to uniquely identify 95% of people in a mobile phone database of 1.5M people and to identify 90% of people in a credit card database of 1M people. He used machine learning techniques to study what can be inferred from metadata about individuals. For example, using behavioral indicators computed from metadata using the Bandicoot toolbox, he was able to predict people’s personality up to 1.7x better than random. He concluded by mentioning that, only way to prevent users identity is by providing limited access to the customer datastore with web querying or multi-level secure access to data.

Modeling the smart and connected city of the future with Kafka and Spark by Eric Frenkiel (MemSQL). For building the smart cities we will need realtime queuing systems, high speed batch processing engine and a scalable faster database(which provides read and write speed comparable with SQL databases). Eric talked about how we can architect pipeline with kafka, spark and memSQL.

Here’s how it works:

  • Data from streams is pushed to Apache Kafka.
  • Spark Streaming ingests event data from Apache Kafka, then it will be filtered by event type and enriched each event with time and geo-location data.
  • Using the MemSQL Spark Connector, data will then be written to MemSQL with each event type flowing into a separate table. MemSQL handles record deduplication (Kafka’s “at least once” semantics guarantee fault tolerance but not uniqueness).
  • As data is streaming in, users will be able to run queries in MemSQL to generate different metrics and report on various event.

Enterprise Deep Learning Workflows with DL4J by Josh Patterson (Patterson Consulting). With evolution of deep learning many industries are eager to try them out with their current enterprise solutions. Josh talked about how DeepLearning4J is one of the most suitable framework for enterprises and what are its advantages. He described that the enterprise pipelines are interconnected and operating with varied frameworks and tools. So if we use languages like python or R, they might create bottleneck and developers will spend more time resolving integration issues rather than modelling. DL4J runs on top of java and it can run on top of Spark and Cuda.

Bio: Devendra Desale(@DevendraDesale) is a data science graduate student currently working on text mining and big data technologies. He is also interested in enterprise architectures and data-driven business. When away from the computer, he also enjoys attending meetups and venturing into the unknown.