Big Data Developer Conference, Santa Clara: Day 2 Highlights

Highlights from the presentations/tutorials by Data Science leaders from Cloudera, LinkedIn, Intel, MapR, Locbit and others on day 2 of Big Data Developer Conference 2015.

big-data-developer-conferenceBig Data Developer Conference organized by Global Big Data Conference was held last week in Santa Clara during March 23-25. It brought together data scientists, professionals handling data / performing data analytics in various different domains and excited to learn the best and latest in Big Data Ecosystem. The conference tutorials and talks covered a wide variety of topics including Hadoop, Lambda Architecture, MapReduce, Hive, Pig, Spark, MongoDB, etc.

Highlights from Day 1

Highlights from Day 2(Tuesday, March 24):

Two parallel workshops were held for the first session of the day. Daniel Templeton from Cloudera and Avkash Chauhan from Big Data Perspective gave workshop on Data Analysis and Practical Predictive Analytics using R respectively. Daniel taught data management with HDFS and analysis of data with Hadoop using Pig, Hive and Impala. In parallel track, Avkash taught basics of R and how using R one can perform large-scale machine learning using open source library H2O. Participants in both the workshops performed hand-on exercises to practice the newly learned concepts.

David Freeman, Head of Security Data Science, LinkedIn delivered an interesting talk titled “Data Science vs. The Bad Guys – Using data to defend LinkedIn against fraud and abuse”. He said that not everyone follows rules and regulations by the world’s largest professional network. People/bots try to spam messages, fake companies, fake jobs, introduce malicious URLs, scrape data, etc. There are a number of ways they try to perform these kinds of actions. As an example, if lots of fake accounts are from one IP then they either block the IP, limit signup rate from any IP using heuristic rules and train model on historical data incorporating signups/IP/hour, number of good / bad accounts on IP, etc. LinkedIn stops them using a separate DB termed “Abuse DB” which performs scoring for each request. Based on the scoring, user is allowed or restricted to perform that action.

He shared three different and very fascinating case studies involving registration, fake accounts and account takeover. In order to know if the user registering is real or fake they have asset reputation systems, which assign a reputation score to each asset based on the level of abuse seen in the past. Reputation scoring is performed instantaneously as well as offline. For each registrant machine learning model combines reputation features (offline and online) to produce a registration score. Fake accounts are detected by estimating the probability [Member Reputation] that a given member is not real. He also shared how they go about estimating the likelihood of an attack using asset reputation, member history, site history and member reputation. He concluded the talk mentioning that it best to stop bad guys at the entry points and being careful about not bothering good members is also very critical.

Boain Spassov, CEO, Locbit gave a talk explaining why Locbit chose MogoDB and Redis over Hadoop to start. He started with explaining what all problems IoT (Internet of Things) can solve for an enterprise and described how Locbit platform helps them analyze their data across platforms.
He mentioned the following reasons for preferring MongoDB & Redis over Hadoop:
  1. MongoDB can exist in an edge environment – One of the biggest problems is that they need business rules to exist on the edge and Hadoop is too heavy for that
  2. Full Data flow though Binary JSON from DB to client eliminating the need for mid-tier, big processing layer
  3. Real-time analytics with Node
  4. MapReduce and Computation