This was the first conference of the Strata + Hadoop World in Asia in Singapore. Which had the great response from big enterprise and SMEs as well. Data scientists, big data engineers and leaders from all over the world came from sharing their experiences and explore where this revolution of big data is going to be.
One of the biggest challenges at big conferences such as Strata + Hadoop World 2015 is that there is so much happening quickly and simultaneously that it is almost impossible to catch all the action.
We help you by summarizing the key insights from some of the best and most popular sessions at the conference. These concise, takeaway-oriented summaries are designed for both people who attended the conference but would like to re-visit the key sessions for a deeper understanding and people who could not attend the conference.
Quick Takeaways from Keynotes:
Mike Olson keynote, Chief Strategy Officer, Cloudera shared how the next generation of analytics is evolving. Why Hadoop will disappear, what he meant, that the Hadoop ecosystem will be ubiquitous and we won’t need to talk about it. We will be talking about the insights, analytics and we will see richer tools for data exploration, insights and decision making. He also discussed how Cloudera will be embracing Apache Spark as the core engine for developing the big data and machine learning applications. He mentioned the latest tools by Cloudera for analytics like RecordService, Kudu and Cloudera Navigation Optimizer which will enable businesses to create secure, faster and optimized data analytics systems.
Rod Smith, VP, Emerging Technologies, IBM talked about big data technologies getting matured and hence creating new opportunities for the data scientists, analytics and engineers. Also, how machine learning will get more domain specific and technology will be tailored as per the problem context. This will demand more domain oriented data scientists, who understand the business units and the organisation.
Kevin Lee, VP, GrabTaxi shared how they are simplifying the choice for the drivers and passengers. They used deep learning and other feature extraction and build models for matching the drivers and passengers. Deep learning model beats the human engineering on all the model success criteria like accuracy, training time and production performance.
Tara Hirebet Regional New Business Director, R/GA talked about the smart cities becoming data daddy’s and over-authoritative for the people who will be living in cities. Where people come to cities for the challenges, freedom, opportunities which they are provided with and they might be over burdened by the management of the cities. So the golden mean for this could be “Data Empowerment”, this will motivate citizens to take better decision without being worrying about the big daddy’s.
Deepak Ramanatham, Chief Technology Office, SAS Asia Pacific shared the key patterns emerging from a wide cross section of corporate and institutional Hadoop journeys. Data modelling is set on the path to be a well structured process and tools for doing so will emerge.
Rishi Malhotra, Cofounder and CEO, Saavn shared how a music streaming company who has a large portion of mobile users (95%), has used this rich data to improve their services. Data has to be part of company “Eat data for breakfast” is of the motto of the company. Their datasets act as proxy for billion users, track how user access the service and find patterns. He emphasized on the data ripple reflect, and how corporates should consider collaborating and developing cross-industry strategies based on data.
Fig. 1: Different Music Streaming Patterns in India, US, Singapore from mobile user data (from Saavn)
Amit Bansal, MD, Accenture Digital shared the idea of liquid expectation faster development cycles and very less times to react. He shared how machine learning is evolving and how it will replace a lot routine work. He gave example of The Grid the site which builds website, all by means of machine learning.
Jana Eggers President, Nara Logics shared idea of how to deal with all this change AI going to bring to the organisation. Don’t think of the AI as a robot, they are more like algorithms like enablers or helpers for the existing systems. We need mathematicians, product owners, data engineers and business people to build next generation products using AI. To be successful in future we need to collaborate and interact with these algorithms.
Valuable Insights from Selected Sessions:
Building South East Asia’s largest E-commerce Recommender by Kai Xin Thia (Lazada) talked about lessons learned while developing a large scale recommendation system and how they want to improve it in future. In order to select the metrics they not only consider the measure the historical performance, but also how to improve the user experience and serendipity while searching. Following are the key lessons:
- Building single recommender that will work across multiple countries and cultural preferences is very complex and expensive
- Tuning different types of recommenders which have different strengths and weaknesses, you need to consider multiple trade-offs
- Some models might do well on new customers (cold start problem) while others perform better on customers with a good shopping history
- Apache Spark is powerful but without a strong underlying Hadoop infrastructure and tools like Kafka to handle clickstream data in order to perform the analytics on millions of users.
Below is the architectural diagram of the Lazada. Red circle indicates the area of focus for the further improvement.
fig.2 Lazada Recommender System Architecture and Future work
Fast big data analytics with Spark on Tachyon in Baidu by Bin Fan (Tachyon Nexus), Xiang Wen (Baidu), they talked about the Tachyon which is an open source memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster jobs, possibly written in different computation frameworks, such as Spark and MapReduce. Tachyon lies between computation frameworks or jobs, and various kinds of storage systems.
fig.3 Tachyon’s Location in Data Processing and Supported Tools
At Baidu Tachyon was used to improve data analytics performance by 30 times. Earlier they used Hadoop and HDFS combination for processing data, which was replaced by Spark and tachyon, to churn out 1 PB data storage.
The journey to value using advanced analytics by Thomas Beaujard (Accenture Digital), Tom Ridsdill-Smith (Woodside), talked about how they identified the opportunities in oil and gas domain for doing analytics and their approach to solve the problems with machine learning.
fig.4 Data Pipeline Designed by Accenture for Woodside on AWS
Initial stages they quickly(6-9 months) developed a platform to integrate the data from all the sensors and machineries operating within company on the cloud. Once data was ready they identified use cases and prioritised them based on the ROI. They showcased two use-cases, one was for predictive maintenance of the valves and other one for the anomaly detection for alert generation for ARGU foaming. You can find their presentation here.
Building a self-serve real-time reporting platform at LinkedIn by Shirshanka Das (LinkedIn) shared how LinkedIn transformed their internal data pipelines to build a integrated, high speed and scalable system. During this process they developed different tools like Gobbline as central ingestion platform for both stream and batch data, Pinot to serve the data by easy queries and Raptor a big data visualization framework. Also company adopted a Unified Metrics Platform(UMP) which allowed them to have re-usable metrics and standardise the process. Presentation, here.
fig. 5 Unified Metrics Platform Data Flow by LinkedIn
Highlights from Day 2
Bio: Devendra Desale(@DevendraDesale) is a data science graduate student currently working on text mining and big data technologies. He is also interested in enterprise architectures and data-driven business. When away from the computer, he also enjoys attending meetups and venturing into the unknown.