Big Data Techcon Boston: Hadoop is not dead yet
I report from the Big Data TechCon Boston (Apr 8-10), which differed from other Big Data conferences with its focus on applied, how-to, tutorials and classes. Despite Hadoop limitations, it is still the most popular tool for Big Data.
Gregory Piatetsky, Apr 11, 2013.
I have attended the opening day of Big Data TechCon, April 8-10 in Boston. This conference was different from most other conferences in Analytics, Big Data, Data Mining, and Data Science in its focus on practical, how-to tutorials and classes. Majority of the classes actually presented coding and people were deep into Java, R, and Hadoop. Over 600+ people from 28 countries have participated, and KDnuggets was a media sponsor for the conference.
The conference started with an excellent keynote by Mike Stonebraker who was, as always, very entertaining, opinionated, and informative. He talked about Big Data and how it needs complex analytics, such as computing covariance between daily closing prices of 15,000 the stocks for the last 10 years. Such problems usually require array operations which are very hard to do in RDBMS. Stonebraker also highlighted problems with Hadoop - it is only good for "embarassingly parallel" problems, such as building an index for the web, but not good for complex analytics. Stonebraker is a serial entrepreneur and his solution for complex analytics was SciDB, one of his latest start-ups. SciDB is an array database system for data-intensive scientific computations, freely downloadable from www.scidb.org.
My tweets from his talk:
- Mike Stonebraker: I was doing Databases for 40 years, now I discover it is called Big Data #BigDataTechCon
- Mike Stonebraker: 3 types of DW vendors: column-store, row-store, 2nd converting to 1st. Column-store is 50x faster #BigDataTechCon
- Mike Stonebraker: auto insurance companies want to put sensors in cars, mass personalization based on driver profile #BigDataTechCon
- Mike Stonebraker: BI, SQL, and Datawarehousing are useless for Complex Analytics #BigDataTechCon
- Mike Stonebraker: analysts now use 2 systems - one for stats, one for DBMS, copy back and forth, and hate it #BigDataTechCon
- Mike Stonebraker: Hadoop is very good at "embarassingly parallel" problems, sucks on all other problems #BigDataTechCon
- Mike Stonebraker: if you drink "Hadoop" Kool-Aid, u will hit performance wall on array ops; instead use array DBMS eg SciDB #BigDataTechCon
- Mike Stonebraker: Lots of people love R, or at least love to hate R #rstats #BigDataTechCon
- Mike Stonebraker: SciDB is 100x Postgres on analytics, comparable to R on analytics, but scales #BigDataTechCon
- Mike Stonebraker: succesful enterprises will move from stupid (SQL) analytics to smart (complex) analytics. #BigDataTechCon
- Mike Stonebraker: for complex analytics RDBMS likely to fail; Hadoop unlikely to scale; check SciDB #BigDataTechCon
Here is a public Wordle cloud I created from the Classes on this conference which shows that Hadoop is still the top tool for Big Data.
The well-attended tutorials and classes during the day covered Apache Cassandra, Machine Learning, Hadoop, Cassandra, NoSQL, Apache Hive, Hadoop, Map/Reduce and HDFS, Data Visualization, ZooKeeper, Data Modeling and Relational Analysis in a NoSQL World, Distributed Search and Real Time Analytics, Structured and Unstructured Data with Avro, Visualizing Your Graph, Analytics Maturity Model, Beyond Map/Reduce, Getting Started with R and Hadoop, and more.
He talked about Addition in the large: Simple counts and not-so-simple counts, and showed how some interesting computations, like computing number of followers of followers, can be done with clever programming and approximate sets.
My tweets from Oscar Boykin talk:
- Twitter Super-Overload: if Barack Obama, Lady Gaga, and Justin Bieber started messaging each other #BigDataTechCon
- Oscar Boykin: Map/Reduce framework, in principle, is suitable for streaming #BigDataTechCon
- Oscar Boykin: Important Twitter measure of influence: How many followers of followers are there (not a feature yet) #BigDataTechCon
- Oscar Boykin: Tip for hackers: associative and commutative ops can be pushed up on Map/Reduce process #BigDataTechCon
- Oscar Boykin: You can approximate complex analytics with Map/Reduce if you use more complex objects #BigDataTechCon
- Oscar Boykin: Why approximation algorithms work so well? Because real data is noisy anyway #BigDataTechCon
Here are selected other tweets from the conference:
- Dean Wampler, @thinkBigA: "Hadoop is the enterprise 'java beans' of our time." #BigDataTechCon
- SearchDataManagement @sDataManagement: Don Mallinger: iIt will never be the case that data scientist gets clean data set. @bigdatatechcon #bigdatatechcon #bigdata
- SearchDataManagement @sDataManagement: The data scientist spends half their time wrangling w data access. @bigdatatechcon #bigdatatechcon #bigdata