Big Data Innovation Summit 2014 Santa Clara: Highlights of Selected Talks on Day 1

Highlights from the presentations by big data technology practitioners from eBay, YarcData, LinkedIn, Trulia, and other leading companies on day 1 of Big Data Innovation Summit 2014 in Santa Clara.

Big Data Innovation Summit 2014 (Apr 9-10, 2014) was organized by Innovation Enterprise at Santa Clara Convention Center in Santa Clara, CA. The summit brought together experts from industry as well as academia for two days of insightful presentations, workshops, discussions, panels and networking. It covered areas including Big Data Innovation, Data Analytics, Hadoop & Open-Source Software, Data Science, Algorithms & Machine Learning, Data Driven Business Decisions and more.

People attending such conferences would agree that there is so much happening quickly and often simultaneously at conferences that it is almost impossible to catch all the action. KDnuggets helps you by summarizing the key insights from all the talk at the conference. These concise, takeaway-oriented summaries are designed for both – people who attended the conference but would like to re-visit the key sessions for a deeper understanding and people who could not attend the conference. As you go through it, for any talk that you find interesting, check KDnuggets as we would soon publish exclusive interviews with some of these speakers.

Here are highlights from selected talks on day 1 (Wed Apr 9):

Bharath Sudarshan, Chief Data Scientist at WellDoc Inc. started by establishing context about how serious problem we are facing when it comes to heath issues. 22.3 million people (about 7.1% of US population) is suffering from diabetes and its prevalence will almost double by 2030 looking into the immense rise of patients.

He proposed a solution to such problems by providing mobile prescription therapy, real-time coaching to patients and leveraging a cloud based expert analytics system. He explained that Mobile Health (mHealth) Factors are of three kinds: Clinical, Behavioral & Engagement and working at the intersection of these three is the need of the hour. Similar to Data Science Value Chain he walked through all steps involved in mHealth Data Journey by giving examples and convinced the audience that a lot can be improved in HealthCare Sector by developing analytics based mobile applications. mHealth Factors Matthias Spycher, Chief Engineer at eBay talked about Big Data ecosystem at eBay and discussed a number of use-cases where near-line and offline analytics enable personalization of the customer experience. He explained the complicated process behind personalizing and engaging users. The stack for this is built with a base of data platform performing tasks such as tracking, experimenting and analyzing. On top of it lies the Insights layer which helps personalize, recommend and market. Above Insight layer there are two layers in parallel: User Engagement and User Messaging, which together enable an awesome customer experience. There were many challenges including multi-screen, data quality and governance.

He also discussed the Personalization Service Data Flow and In-cache Predictive Model Evaluation. He emphasized that as site speed matters the most his team performs machine translation process in less than 50 ms using the architecture displayed below. Stack for user personlization and engagement Venkat Krishnamurthy, Director, Project Management from YarcData focused on delivering high performance analytic platforms for the data discovery market. YarcData is the analytics subsidiary of Cray, a super computing pioneer. He quoted from Financial Times: “Big data is a vague term for a massive phenomenon that has rapidly become an obsession with entrepreneurs, scientists, governments and the media”. YarcData brings technology of Cray to solve large volumes of data and it enables one to go for interactive processing from the old-fashioned batch-oriented approach. He also discussed a few use cases from various verticals. Big Data question Anuj Goyal, Software Engineer at LinkedIn commenced by giving some statistics about LinkedIn such as 270M+ users as per March’14 with addition of 2 new member per second, 3M company pages, 90% companies use LinkedIn to hire. One of the main reasons user logs in LinkedIn is to find jobs. The actual value that a recommendation has is 50% based on LinkedIn. He discussed two metrics of evaluation: Upside and Downside.

Upside metrics are about users getting relevant jobs and downside metrics are more about users getting offending/irrelevant jobs. Giving a quick overview of the recommendation algorithm, he highlighted few challenges such as Entity Resolution (IBM has 13000+ variations and many job titles having same role), Geo Location (recommending a job in New York to a developer in Bay Area, some locations can be categorized as Sticky from where very less professionals migrate), network effect (knowing the chances of one leaving company looking into his/her network).

At the end, he gave a hybrid recommendation algorithm as a solution to the challenges mentioned before. Sticky locations map Amy Gershkoff, Director of Analytics at eBay shared Big Data lessons from the Obama campaign. She recommended that data acquisition should not be our focus, rather we should focus on building good business intelligence tools which enable us to make use of all the data we already have. A best-in-class BI tool will leverage data from multiple sources, reorganize data around key business questions, simplify the data to display only the most relevant metrics, visualize the data to make trends easy to see and benchmark results against KPIs.

It is very important to have daily measurement and based on the insights from data, pivot the message, channel and targeting for the campaign. She concluded her talk with an emphasis on Content, saying that: "If you are only measuring dollars spent, you are missing more than half of the story". Content is king Todd Holloway, Data Science Lead at Trulia is leading the Data Science team at Trulia, which is applying machine learning, network science, NLP, and computer vision technologies to the large datasets found in the real estate domain. Trulia is a real estate portal with over 35 M monthly users and 14.5 M mobile users.

At Trulia there are following teams: Econometrics/PR team, Analytics Team, Geo team and Data Science team. For Data Science team the mission is three-fold: to transform data into new content, improve content relevance and improve monetization. The skills of the team include machine learning, hacking, text mining, image mining, network science and data visualization. The analytics process followed by his team can be summarized as:
Understand problem -> Process Data -> Experiment -> Productionize -> Integrate Desired team skills