On Feb 15 I attended the The Big Data Disruption Summit at the appropriately named Microsoft NERD, in the high tech glass building overlooking Charles river and Boston. The meeting was well organized by MassTLC, The Massachusetts Technology Leadership Council.
Big Data clearly has a lot of buzz this year, and is approaching the top of the hype cycle according to Gartner. MassTLC recent report Big Data and Analytics: A Major Market Opportunity for Massachusetts, identified 100+ Big Data related companies in Massachusetts.
The meeting was packed with business leaders, entrepreneurs, venture capitalists, and those data scientists that were present were very much in demand.
The event was opened by Prof. Michael Stonebraker, a leading database researcher for over 30 years and a serial entrepreneur - he started several successful database companies including Ingres, Illustra, Vertica, VoltDB, and most recently Paradigm4. Michael succeeds at doing several things at once, so some described him as a "parallel" entrepreneur.
After a familiar overview of the 3 Vs of Big Data: Volume, Velocity, and Variety, Stonebraker argued that the focus of analytics in the past was "Little Analytics" on Big Volume, like finding an average closing price of MSFT on all trading days in the last 3 years - a request easily expressed in SQL.
Now there is demand for "Big Analytics" on Big Data, which may include complex math operations, such as machine learning or clustering. Stonebraker argued that most of these can be specified as linear algebra operations on array data. A typical inner loop in such algorithms may include matrix multiplication, SVD decomposition, or linear regression.
He gave an example of Big Analytics, where you need to compute a covariance between closing prices of stocks, and covariance which can be expressed easily as array operations, but not so easily in SQL. Imagine computing covariance for all pairs of stocks on NY Stock exchange for the last 1000 days. If you could do this, then do it for hourly prices, etc.
Stonebraker outlined and criticized different approaches to Big Analytics, which included
- Math/stats package: SPSS, SAS, R: these suffer from weak or nonexisting data management; R does not scale well - not a parallel system.
- RDBMS: very sloooow for matrix operations, can't handle matrix multiplication which can't be easily expressed in SQL.
- RDMBS + R: learn 2 systems problem, move the world nightmare
- Hadoop: top of the hype cycle, weak on data management, low level interface. Better to move to Pig/Hive. Also no support for math operations (but Mahout is a good option). Hadoop is very inefficient on math that is not embarrassingly parallel.
Paradigm4 is also sponsoring and distributing an open-source SciDB Scientific Data Management and Analytics software platform.
Chris Ahleberg, CEO of Recorded Future (and co-founder of Spotfire, which was sold to Tibco in 2007) talked about the unstructured web as the most compelling source of predictive information. He described a project where they are monitoring South American Cities for potential unrest by scanning documents from 70,000 sources, with the need to visualize the results quickly. He described the evolution of their architecture from a key value store to mongoDB + sphinx. They have many users in finance, and interestingly, processing the overnight accumulation of information is very important for the trading signal in the first second of trading in New York.
Recorded Future backers include Google Ventures, IQT, and IA Ventures.
Other speakers in the first panel outlined different approaches. Fritz Knabe of Netezza talked about new possibilities enabled by rapidly falling price of flash storage. When terabytes of memory is about $1000, much different and faster architectures become possible. However, the seamy underside of advent of flash is that the bottleneck moves from storage to power supply. This leads to interesting ideas like microservers, which have 8 servers on one board.
Mark Watkins, co-founder of Goby and currently General Manager, Entertainment Content at Telenav, talked about mobile applications. His company is a pioneer in location services and providing a traffic-aware routing engine, which learns from traffic behavior. He also described an already deployed mobile recommendation system at Telenav, which can recommend interesting restaurants, events, and activities to you based on your interests and the data it has.
The first half was followed by the keynote presentation by Deepak Advani, Vice President, Business Analytics, Products and Solutions, IBM. He gave a very good overview of the many use cases where analytics and IBM technology produces good results, from IBM Watson technology now being applied to improve health care diagnoses, to The Oscar Senti-meter which provides sentiment analysis of Twitter messages about Oscar nominations.
The second half of the event was focused on case studies of 4 start-ups and their learning experiences and challenges.
Bill Simmons, CTO, DataXu, talked about how their company tracks ad performance by using anonymous cookies. They build models to predict which ad impressions will lead to purchasing activity, and this is hard since for a million impression there may be only hundreds of purchases (very unbalanced class distribution). However, the ad cost is low and the economics work since their model is 2-3 times more accurate than random ads. Their software stack includes Hadoop, Hive, Postgres, Hbase, and Greenplum.
Alan Hoffman, Founder & President, Cloudant, talked about his experience as physicist where he dealt with GB/sec of particle data. His company provides a noSQL data layer service and uses couchdb. He suggested there is no big magic solution to Big Data, but lots of small useful solutions.
George Radford, Field CTO, EMC Greenplum talked about adjusting to EMC acquisition of Greenplum. He commented that the last thing you want to do with big data is move it.
Andy Palmer, the moderator, emphasized the need to think about a continuous upgrade path. Since design patterns in the system change rapidly, he argued for an MPP shared nothing architecture which scales well.
George Radford said that you can get from TB to PB with MPP shared nothing architecture.
I asked the panel if they thought the potential of Big Data was overhyped.
Bill Simmons (DataXu) said that their method works and is able to improve accuracy by a factor of 2 to 3. However, the cost also need to be considered - what works in the US, where media is expensive, would not be cost-effective in China where media is cheap.
Puneet Batra (Kyruus) suggested that one of the results of big data would be exposing bad decisions done without it and may will bring more rationality into business decisions.
One question was whether as a result of changes in privacy practices driven by facebook we can see changes in medical medical privacy - more medical data available for sharing. The panel thought it was unlikely and the more likely source of personal medical data was the Quantified Self movement.
The meeting had a lot of energy and showcased the depth of Big Data Industry in Massachusetts. New ideas will likely percolate to new start-ups !
Finally, many thanks to Sara Fraim and MassTLC for organizing such a stimulating and interesting meeting.
CNET report on Big Data Meeting: Why 'big data' is a magnet for startups
Another report on Big Data meeting at www.goinvo.com/big-data-in-boston/