Boston Data Festival and Next Trends in Big Data

Boston first-ever Data Festival gathers hundreds of analytics professionals, and All-Star Big Data panel answers what is next in Big Data and which trends are least likely to be successful.

By Gregory Piatetsky, Nov 14, 2013.

Last week I attended and moderated a panel at the first-ever

Boston Data FestivalBoston Data Festival, held Nov 4-10 in Boston area.

Festival opened on November 4 at Hyatt Regency Cambridge, with a full room of several hundred people.

John Verostek, the founder of Boston Predictive Analytics Meetup, and one of the key organizers of this event (along with Adrienne Cochrane of hack/reduce and Sheamus McGovern), presented interesting statistics on the very rapid growth of Boston meetups.

Boston Meetup Groups

The meetup stats are publicly available, so John also analyzed relationship between Meetup and LinkedIn membership for Python-related Meetup groups in 22 cities, and it seems to be pretty linear. Not surprisingly, Silicon Valley is leading in membership, but Boston and NYC are nearly tied for the second place.

Meetup and LinkedIn Group membership in 22 cities

Next, was a keynote presentation by  William KahnDr. William Kahn on "Ten Challenges for the next generation data scientist from a last generation statistician".

Dr. Kahn first established his bona-fides as a last generation statistician - his teachers included founders of statistics such as Neyman, Savage, and Anscombe and he was in Leo Breiman first-ever CART class.

His first challenge (challenge Zero) was

"What does zero mean?" - how do we compare 0/50 to 1/400?

If you have two possible events, and one happened 0 out of 50 tries while the other happened 1 out of 400 tries, which one is more significant?

In my opinion this question becomes less important in the era of Big Data, when we frequently don't have the precision of exact counts and need to get approximate answers quickly. Both of these events are significant, and it is not that important to know which event is more significant. A more typical question would be find, in less than 1 second, "almost all" relatively rare events that happened in the last N hours.

Here are Dr. Kahn challenges:

  1. Invent new stuff: What should be done, either in, or with, data science, that no one yet has?
  2. Decide on what data might matter: we should not be passive receivers of what has been measured. Think about and drive the collection of high value data
  3. Get and understand data: go beyond the data on your computer. Understand the full data process. Garbage in. Garbage out.
  4. Predict what is needed. Not just what is easy. Description is not enough.
  5. Integrate Probabilistic and Algorithmic models. Each does well. Each has holes. How are they two views into the same larger entity?
  6. Theory matters
  7. Experiments matter. Kahn showed speed of light measurement chart from 1879 to 1983, a good lesson about overconfidence of scientists. Note that current value (299,792,458 m/sec) falls outside of error range of many previous measurements.
    Speed of light measurement from 1879 to 1983

  8. Remove barrier to adoption: why is this taking so long?
  9. Help the Next Next-generation

The keynote talk was followed by the All-Star Data Panel, consisting of

  • Prof. Samuel Madden, Assoc. Professor at MIT CS and AI Laboratory, head of BigData@CSAIL industry initiative, co-founder of Vertica, and a leading researcher in databases, Big Data, and mobility.
  • Chris Lynch, a partner at Atlas Venture, focusing on Big Data technologies and business models. Chris is a serial entrepreneur and was most recently SVP & GM of HP Data Analytics Business Unit, after the acquisition of Vertica Systems where he was CEO.
  • Dr. Willard (Bill) Simmons, Co-Founder, CTO of DataXu, is a real rocket scientist (with PhD in PhD in Aeronautics and Astronautics from MIT).

Boston DataFest All-Star Data Panel, Samuel Madden, Chris Lynch, Bill Simmons, and moderator Gregory Piatetsky-Shapiro

I (Gregory Piatetsky-Shapiro) moderated this panel.

Some of the interesting points from the discussion:

Chris Lynch said is very bullish on the Boston area and his investments are exclusively there. He believes that Boston ecosystem is second to none. Boston has 5,000 computer science graduates/year and 35 top universities within 20 square miles. [GPS: in another talk Chris also mentioned a great density of talent and convenience of movement along the red line, from Alewife, to Harvard and MIT, to South Station].

What is happening in Big Data and Data Science space now reminded him of what was happening in the internet space in the 1990-s.

Sam Madden said that among things that excite him about Big Data is the opportunity to revisit how we teach students and apply machine learning to this process. How do we change the whole organizational structure of university like MIT in the face of Big Data?

He also addressed the hype in Big Data, and highlighted one very difficult real-world problem: data integration. If a company acquired 7 different companies, then there are 7 different systems, 7 different schemas, etc - how do we integrate them? There have been many attempts to build a "silver bullet" solution to this problem, but there are no silver bullets in this area.

Among exciting Big Data trends Sam Madden mentioned smartphones. One interesting application is in insurance, with insurance companies looking at setting the rates based on how people actually drive, as recorded by their smartphones. This will be much more accurate than setting rates based on very coarse age and gender brackets, and although it may look scary to consumers, it looks very promising to insurance companies.

I asked the panelists to name a couple most promising starups in Big Data and Analytics, excluding the ones they started or invested in.

Chris Lynch said that among places where there are the most opportunities for success is among start-ups that are applying Big Data to disrupt existing businesses (he gave Hopper as one example in travel business), and developing Big Data applications. Least interesting are companies that are building platforms and new data stores - it is very hard to be successful there.

Chris Lynch also pointed that there is more value the closer you get to end-users. He said that Vertica was acquired by HP for around $350M, while Zynga, which used Vertica technology extensively, had a market value of over $10B, 30 times higher.

I also asked the panelists afterwards to answer 2 questions:

1. What is the next big trend related to Big Data and Data Science?

Chris Lynch:

  • applications, applications, applications, in every major industry and use case.
  • artificial intelligence automation of data science - "data scientist in a box"
  • real time analytics integrated into every component of the hw/sw stack from storage, compute, networks and the app

Bill Simmons:

The biggest trend I'm anticipating in big data applications is a shift from just using big data for insights to a wide spread use of big data for insights and automated action. We've seen this big data used for automated action in a few fields, like 'products you may like', 'people you may know', and automated financial trading engines, but have rarely seen automated action in other fields.

I believe the real value of big data is still locked up in many fields, since many applications still just provide reports and then require a person to read these complicated reports and initiate processes within their organization to take action. I anticipate that new applications will be more bold and automate the outcome of the analysis in real-time. For example, at DataXu, we're allowing our users to specify rules within which the machine is able to take action and fully automate the rebalancing of a complicated, cross-channel media plan. This saves a tremendous amount of labor for our clients, enabling them to focus more time on high-value activities, such as strategic decision making and creative design.

2. What are good Boston resources for Big Data and Data Science ?

Chris Lynch:

Hackreducehack/reduce is the single best resource for aspiring big data science minds.

Bill Simmons:

Boston has a lively meetup scene for big data and data science. I've personally expanded my capabilities by attending these events, meeting other data scientists and big data architects and sharing ideas. In the past year, I've seen these groups really blossom and the quality of the content has gone up tremendously.

Finally, the panel ended around 9 pm. Everything was running late, and not many people stayed for the network social at 9 pm, despite the free wine and beer. However, there were so many interesting ideas expressed, that I am already looking forward to the next event, perhaps in the spring.