Big Data Innovation Summit 2014 Santa Clara: Highlights of Selected Talks on Day 2

Highlights from the presentations by big data technology practitioners from NYSE, Glassdoor, Slice and Paychex on day 2 of Big Data Innovation Summit 2014 in Santa Clara.

By Anmol Rajpurohit (@hey_anmol), April 29, 2014.

Big Data Innovation Summit 2014 (Apr 9-10, 2014) was organized by Innovation Enterprise at Santa Clara Convention Center in Santa Clara, CA. The summit brought together experts from industry as well as academia for two days of insightful presentations, workshops, discussions, panels and networking. It covered areas including Big Data Innovation, Data Analytics, Hadoop & Open-Source Software, Data Science, Algorithms & Machine Learning, Data Driven Business Decisions and more.

People attending such conferences would agree that there is so much happening quickly and often simultaneously at conferences that it is almost impossible to catch all the action. KDnuggets helps you by summarizing the key insights from all the talk at the conference. These concise, takeaway-oriented summaries are designed for both – people who attended the conference but would like to re-visit the key sessions for a deeper understanding and people who could not attend the conference. As you go through it, for any talk that you find interesting, check KDnuggets as we would soon publish exclusive interviews with some of these speakers.

Highlights from selected talks on day 1 (Wed Apr 9)

Here are highlights from selected talks on day 2 (Thu Apr 10):

Andrew Ahn, Senior Director, Big Data group at ICE/NYSE commenced his talk by describing common business challenges such as: data retention requirements for data across years, fast access to disparate data for complex analytics with strict SLA’s, etc. The legacy model of data presentation does not meet the demands of today’s data consumption model.

Responsible for the development roadmap and GTM strategy of Pivotal Data Dispatch (PDD), Andrew described a new approach called ”Data Lake”. Under this approach, the “Lake” or reservoir is where all “at rest” enterprise data is stored. A key feature of PDD was to allow for virtualization of Lake(s) and enterprise data platforms.

Design rules followed for Data Dispatch were:
  1. All data will never lie in one physical place
  2. No single technology will be used to store data
  3. Multiple toolsets will be used to analyze data

Key benefits of this concept were:
  1. Users can Discover Data and then provision to analytics space
  2. Admins define meta data and access policies
  3. Cost management

At the end, he mentioned that NYSE and Pivotal have joined forces to bring data dispatch to market and the product was launched at Strata conference held in Oct 2013. Data dispatch Vikas Sabnani, Senior Director, Data Science & Analytics at Glassdoor discussed about data products and their use cases. His team (which includes professionals from core data science, engineering, product and design) makes Glassdoor products and business smarter. In many cases, like Glassdoor's, data is the product itself (for example, salary data). Like design did in the last decade, Data will likely form the back-bone of most revolutionary products in the coming decades. Also, firms should set up good benchmarks and create ambitious goals.

Explaining the first use case of “user acquisition” he advised to: Start small & scrappy. Iterate really really fast. Don’t annoy (a lot of) users. Based on insights from product team, when the recommendations for relevant jobs were placed in email alerts the user acquisition increased by a staggering 24% (compared to mere 5% hike on placing recommendations to the website). Lesson: In case of recommendations, it is very important – how and where you provide them.

The second use case titled “Monday Hack” was based on the exploration of an unexpected hike in the number of daily visits from job alert emails. The root cause was identified to be a bug in the software due to which instead of sending jobs from the past 24 hours, users were sent jobs from the entire past data. While fixing this bug, they decided to benefit from this observation, by tweaking the job alerts to show more recommendations on Monday to match users’ cyclic increase in interest for more content. Monday hack The final use case was based on optimizing send times for job alerts. Challenged with the increasing workload (due to increase in number of users), the team wondered if it would be of value to personalize send times and spread them around the day. A controlled experiment, with randomized send-times was run and returned positive results. Visits were up by 15%.

Conal Sathi, Data Scientist from Slice started by briefly describing about Slice. Slice has been building products since 2010 to organize all the e-commerce data locked away in email. It is getting increasingly difficult for shoppers to manage purchase receipts of online shopping and track orders across online shopping portals. Slice solves this problem by making it easy for users through an integrated e-commerce management system which is highly intuitive and convenient to use. Besides, Slice also benefits shoppers by providing features such as price drop alerts, tracking digital purchases, spending analysis, package tracking and much more.

Meanwhile, Slice mines all this e-commerce transaction data to create a purchase network to revolutionize shopping – what they call “the Purchase Graph”, in which nodes represent what and where people buy. Edges represent similarities between nodes i.e. two nodes are similar if people who buy the first item also buy the latter item(correlation, not causation).  He gave some examples explaining purchase graphs pin-pointing some great insights from the graph. The slide below is a purchase graph with color of nodes showing different types of sellers. Also, the smaller the edge, the more similar the nodes and two nodes must be directly connected to be similar. Purchase graph He concluded the talk by highlighting how purchase graph has helped understand:

  • Psychographics of online shoppers
  • Purchase behavior of shoppers by merchants, brands or even products
  • Competitors/complements
  • User loyalty
  • Advertising (where should one company put its ads)

Tom Kern, Risk Modelling Manager at Paychex delivered his talk focused on how Predictive Modeling brings art and science together to play a key role in business strategy. Without Predictive Modeling, many strategic decisions are left to the ‘gut’, ignoring enormous opportunities for data-driven decision making in the age of big data.  Paychex leveraged expertise in predictive analytics to add an empirical layer to sales strategy decisions.  With the addition of models to predict likely sales units and establish a yardstick to measure sales value by zip code, sales management became statistically informed as they made decisions regarding quota setting, territory alignment and market expansion.

Tom described the indigenously built solution called “Sales Anticipation Model (SAM)” to meet the goal of accurately predicting fiscal sales for each sales territory. The model variables were categorized as: Client Demographics, Sales History, Zip Code Demographics, Economic Indicators and Loss History. The success of this predictive model helped the ERM (Enterprise Risk Management) team use its Analytics skills to gain a voice in senior leadership strategic decisions. Paychex Zipcodes