Strata Conference Reports and Highlights

Highlights from Strata Feb 2013 Conference on Big Data, covering Hadoop, Python Data Science, game data mining, Groundhog day, data thoughtcrime, situational awareness, and more.

Here are highlights from interesting reports from a recently completed
StrataStrata 2013 Conference on Making Data Work,
Feb 26-28, Santa Clara, CA.

See also
If you missed Strata 2013, here are 53 videos.

I particulary liked 3 great and informative reports from Michael Malak.

Strata Trip Report: Days 0 & 1, Michael Malak

  • Berkeley Stack: Spark, Shark, and Spark Streaming. Map/Reduce is 5-10 years old now, and tere are many replacements which promise to be faster than Hadoop, including Impala (Cloudera), Apache Drill (MapR), Hawq (Greenplum, announced today), and Stinger (Hortonworks, announced last week).
    Spark beats them all because it's both a) free and b) available (and actually "mature", at version 0.7 this week with earlier versions having been out for a couple of years).

  • Python Data Science, with an interactive shell called iPython Notebook.
  • Platfora, like Tableau but works directly with Hadoop and allows end-users to construct reduced datasets on the fly. At only $60k/year/server, it reduces all Big Data projects to that of ingestion, and then you just hook up Platfora.

Strata Trip Report: Day 2, Michael Malak

  • Electronic Arts: Real-time processing of big data because it's critical to monetization of their games, now that games are free and sales come from in-game ad placement and sales of virtual goods.
  • Jeanne Harris (Accenture) said that in 1990, everyone was quoting Field of Dreams "Build it And They Will Come" when they should have been quoting Groundhog Day, because every five years there's a new cycle of "imagine the possibilities of databasing all this extra data."
  • Big Data is a Hotbed of Thoughtcrime, by Jim Adler: - is inferring private information using only public information a thoughtcrime (unethical)? Is it thoughtcrime to detect thoughtcrimes?

Strata Trip Report: Day 3, Michael Malak

  • Intel is going to have its own Hadoop distribution.
  • Third Generation Tools for Realizing Machine Learning Algorithms. The speaker classified first generation as desktop (or single server), such as R, and second generation as Map Reduce (e.g. Mahout), and third generation as post-Map Reduce. He actually called Spark a third-generation machine learning tool.
  • "Excel Big Data" demo from Microsoft has point-and-click querying of HDFS that translates into Map/Reduce. Once the data was populated in the spreadsheet, it looked like you were just left to your own regular Excel visualization devices.

Summary of My First Trip to Strata, Ryan Rosario, Byte Mining Blog.

  • Code for America, democratizing data and open data initiatives in local governments, and studying bail amounts and the outcome of a criminal trial.
  • Visualization strand: Agile Data Wrangling and Web-based Visualizations, including analysis of Using the federal election commission dataset using pandas Python package.
  • Law, Ethics and Open Data Strand: Sci vs. Sci: Attack Vectors for Black-Hat Data Scientists and Possible Countermeasures. Every skill has a good use and an evil use and Data Science is no exception. We create models to try to combat fraud, detect spam, measure influence and much more.
  • Data Science: The IPython Notebook: a Comprehensive Tool for Data Science
  • Adversarial Learning: What To Do When Your Machine Learning Gets Attacked

Contextual + Situational Awareness: The Next Big Thing in Big Data, Silicon Angle

In a round up of what was hot at Strata, Croll mentioned EA's presentation about how video game data is being use. While they play, "humans leave a bread crumb shell" that is worth analyzing. Other than determining customer behavior, in-game gathered data has many other practical appliances, such as rearchitecting cities based on places people tend to get stuck in a game.

Croll pointed out at a future evolution toward situational awareness. As an example, apps would react differently depending on what the user is doing - walking, driving, sitting, adapting to the current situation.

O'Reilly Strata: Busting Big Data Adoption Myths-Part 1, SQL Server Team.

Think infrastructure and scalability will impede your path to big data analytics? Windows Azure HDInsight is your big data solution.

Big Data insights from the Wikibon project:

Two major themes have emerged: (1) the Hadoop distribution competition is hot and getting hotter and (2) bringing SQL to Big Data is gaining acceptance as the preferred way to democratize Big Data.

O'Reilly Strata: Busting Big Data Adoption Myths-Part 2, SQL Server Team.

If concerns about having the right skillsets on staff are stopping you from trying big data, Microsoft's BI tools may hold the key to busting down those barriers.

Strata Keynoters show big data in action , FierceBigData.

EA Games such as "Battlefield" crank out 1TB of data per day and even "The Simpsons" generates 150 GB per day, as a global marketplace makes gaming a 24/7 activity.
[company] moved from a descriptive look at its business to a predictive one. And since gamers were playing on multiple devices, it needed a single view of how the games were performing for each and how users played differently on each device.
In the end, it moved from big iron to a Hadoop platform using MapReduce and new algorithms to do propensity modeling.

From Strata, the New Big Data on the Block, Steve Miller, Information Management.

Starting from a big data platform built on infrastructure, storage, data processing and applications, the goals of BDAS are:

  • To combine the now-disparate handling of batch, interactive and streaming data into a single execution engine
  • To readily accommodate sophisticated machine learning algorithms, and
  • To be compatible with the existing Hadoop ecosystem.

Spark is much better than MapReduce in integration and high-level accessibility. Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce. To make programming faster, Spark provides clean, concise APIs in both Scala and Java.