MassTLC Big Data Meeting Delivers Insights, Perspective

Summit highlights: Digitization and Datification - a love story, Strategies for creating a competitive advantage in #BigData world, Boston open data, Balancing privacy and governance, and the most widely used #BigData tool in the future.

By Gregory Piatetsky, May 8, 2014.

MassTLC Last week I attended @MassTLC (Massachusetts Technology Leadership Council)
Big Data Summit: You Have the Data, Now What?

The summit was held in the beautiful Microsoft NERD building on Memorial Drive in Cambridge, MA, but because of the construction, I had to go around several times before finally managing to sneak into the parking lot. However, data scientists must be either good with discovery of parking, or working in Cambridge, because the room was quite full by 8:30 am.

For those who could not attend, 3 excellent MassTLC #BigData2014 reports are available for download online at , covering
  • Big Data and Connected Cities,
  • Big Data and Healthcare, and
  • Big Data and Life Sciences.

Here are my notes and selected tweets from the meeting with hashtag #BigData2014. Joe Johnson also actively tweeted from this summit.

Paul Sonderegger, @PaulSonderegger, Oracle's Big Data Strategist, opened his keynote by asking: Big Data is a real thing, but what is it?

Here are my @KDnuggets tweets of his presentation:
  • Big Data is the capture and the use of more data in more daily activities
  • We are at a pivotal moment with #BigData - it flows into all activities
  • #BigData is a simple idea - datification of daily activities #BigData2014
  • Digitization and Datification - a love story
  • Dr. Snow map - #DataScience in the time of Cholera Dr. Snow map - DataScience in the time of Cholera
  • Michigan has tagged every cow at birth; there is datification of buildings, of smaller activities
  • What happens in Las Vegas no longer stays in Vegas - now it stays on Facebook and YouTube forever
  • Michael Porter (HBS) is an Aristotle of Business Strategy - everything after him is a footnote
  • IKEA strategy is to break everything into flat items
  • Strategy is choosing to create a unique value in a unique way
  • Only 12% of execs feel they understand the impact #BigData will have on their organization MassTLC Big Data Summit, May 1
  • The NEW thing is the opportunity to learn from unstructured #BigData before it is organized
  • #BigData can 1. Get fast answers to new questions
  • #BigData can 2. Predict many things (a little) more accurately
  • #BigData at work: 3. Create a data reservoir, eg with #Hadoop
  • #BigData at work: 4. Accelerate Data-Driven Action
  • To create a competitive advantage in #BigData world: 1. Think in terms of data market share
  • Be the first to capture the data from that activity
  • New Big Data land grab - get to new source of data before your rivals - datify activities, improve service
  • 2nd strategy: Create proprietary data assets (combining your data with open data)
  • Secret sauce of every #BigData strategy: Use Data to Make Data

MassTLC Big Data Summit, May 1, Panel This very lively presentation was followed by a panel discussion about data science, moderated by Chris Baker, @Dyn with Paul Sonderegger Joe Hendrickson, VP @athenahealth, Ingo Mierswa, CEO @RapidMiner, and Pete Martin, VP of Engineering, @pixability

RapidMiner CEO Ingo Mierswa talked about Data Science vs Statistics. He said that the biggest difference is that Data Science needs to be aware of computing infrastructure, be aware of business needs, and effectively present results to the management. Although "Data Science" is currently a marketing term, it is useful for capturing resumes of the right people.

Joe Hendrinkson talked about team selection and suggested to start with business analysts, then add data engineers.

Pete Martin noted the more mature organization is, more distinct group Data Science becomes

Joe Hendrickson said that the data can be a weapon inside a company.

Ingo Mierswa noted: we need to communicate not just results, but also the process. Even 51% accuracy is very valuable.

There was a discussion of#BigData backlash. Are we making a big mistake by trusting in Big Data?

The panel discussed a recent article in the Financial Times: Big data: are we making a big mistake?

which presented 4 "straw-men" articles of faith of Big Data, such as
  • big data analysis produces very accurate results;
  • that every single data point can be captured, making sampling techniques obsolete;
  • no need to worry about causation, because we have correlation
  • models are not needed because, to quote "The End of Theory", a provocative essay published in Wired in 2008, "with enough data, the numbers speak for themselves".

The panel pointed that these 4 items are not true about Big Data and quoted a famous statistician George Box:
All models are wrong but some are useful

Ingo Mierswa said that BigData does not give Big wins, but many small decisions, which can result in great value.

The next event was fast vendor pitches from Paradigm4, Quant5, and Prelert which showed their very impressive systems.

Paradigm4 showed a quadrant with Complex vs Simple Analytics on Y axis, and Small views/Data vs Big Views/Data on X-axis, and said that their product SciDB was well-positioned for the upper corner. Paradigm4 SciDB

Next, Paul Barth from NewVantage Partners NVPBigData and Boston CIO Justin Holmes @JustinCHolmes talked about Privacy and Governance. Paul Barth noted:
  • most big data & data science people are concerned about privacy and security of their own data.
  • #BigData will exacerbate the unintended consequences, especially for privacy
  • the approach of building the data lake quickly risks skipping the important privacy issues
  • if #BigData is stored in one well-engineered Data Lake, it is easier to safeguard & protect the data

Boston CIO @JustinCHolmes:
  • we collect lots of data, have over 2000 KPI, open all to public
  • can find problem areas by correlating 911 calls, property
  • for the first time we can provide real-time awareness on Mayor's dashboard
  • Big Data is a large opportunity
  • one of the highest value datasets - restaurant inspections
  • many data sources open to public at
  • the general approach is an open data policy, but needs to examine risks
  • key questions for opening the data: who owns it, in what context was it collected, risks?
  • What are the risks for opening the data? Eg anonymize 911 calls to a block
  • opening data for "Where is my schoolbus" App?
  • We don't want to disclose the location of every school bus

Fire hydrants are a good example - although they are considered critical infrastructure, they are very visible, so common sense says the database of fire hydrant can be open.

You can explore Boston data sets at Data

Paul Barth suggested 3 Data Levels: bronze (raw), silver (initial cleanup), gold (good quality, auditable) Bronze (raw) can be loaded quickly, Bronze to Silver can take a week, Silver to Gold quality can take a month.

Paul Barth recommend the work of MIT Alex (Sandy) Pentland with EU about information rights - information buyers and sellers do not have equal rights.

Paul noted that insurance companies disclose what info they collect. They usually collect acceleration/deceleration, but not speed or GPS location, to avoid privacy issues.

I asked this panel about opt-in approach to privacy: Big Data can make uncomfortably accurate predictions - can there be market-based, opt-in solutions?

Justin Holmes replied that we are trying opt-in with Boston School Bus, but also need to educate people more about data. We have unionized environment in Boston, and have GPS on police cruisers. Need to address union concerns about how data is used.

He also pointed that minority rights should be protected.

A predictive model cannot decide not to lend in a particular low-income area - this is called red-lining and is illegal.

The Summit ended with Sourcing Next Big Thing panel, with Chris Selland (HP Vertica) - moderator, Richard Dale (Optum Labs) @rdale, Steve Dodson (@prelert), Robert Nagle (@InterSystems).

Panel highlights:
  • Chris Selland, @CSelland HP Vertica CMO: When you say #Hadoop to a CMO, they usually say "God bless you"
  • At @MassTLC #bigdata2014 key steps to getting value from #BigData: from Art to (Data) Science to Platform
  • At @MassTLC #bigdata2014 Richard Dale @rdale: First 90% of the #BigData work is technical, the other 90% is governance
  • correlate social media and machine sensor data - reduce errors, improve customer service quality
  • Is #Hadoop a safe landing zone for data, even when you don't know what to do with it
  • Robert Nagle @InterSystems: The most widely used tool for #BigData in the future will be the same as today: #SQL

The final question for the panel was "What will be the next disruptive innovation?"

Some predictions:
  • What is unusual - Search for enterprise
  • In healthcare space - analysis, personalization for people with multiple diseases
  • Going beyond individuals to insights on groups
  • When #BigData insights will be available in multiple operational environments

Overall, an excellent meeting and big thanks to @MassTLC and @SaraFraim for organizing it.