Open Data: Why the Crowd Can Be Your Best Analytics Tool

From the bounty of data emerged "data science" and a plethora of new tools to deal with the size and speed of information. Today we are seeing crowdsourcing increasingly commoditize data, and projects like OpenStreetMap replacing the NAVTEQs of the world.

Mashable, Dec 30, 2010, Sean Gorman

The web will continue to generate data at an explosive rate. It will generate even more now that mobile devices have created yet another path to reach that data. For example, mobile traffic alone is predicted to exceed more than two exabytes per month by 2013. There are more than 90 million tweets per day and more than 60 billion images on Facebook. This is just the tip of the iceberg.

Out of this bounty of data emerged data science and a plethora of new tools to deal with the size and speed of information. Hadoop, Hbase, Cassandra, MongoDB, NodeJS, Hive, R, and Pig are just a few of the tools and techniques that have emerged to wrestle the growing juggernaut of data. The explosion in new tools and the demand to implement them has far exceeded the number of data scientists available.

The rapid rise in demand and the shortage of trained experts has led to the emergence of tools to democratize access to big data. Innovative startups like Datameer and Factual have simple spreadsheet interfaces for doing basic slicing and dicing. Larger players like Google have launched FusionTables to allow slicing and visualization of medium (100MB) data sets.

The Challenges of Big Data

This sprawling mass of emerging data brings with it a host of challenges. As we slice and dice data, how do we keep track of the many permutations that it creates? What bits are meaningful and validated? How do we move beyond just counting and binning the data and answer more meaningful questions for businesses?

...

Seeing Kate's analysis, another reader, "Bill," wonders what the relationship is between tweets about Walmart and their store location. How often are Walmart stores nearby when someone is tweeting about Walmart? He finds that 67% of the variation of tweets is explained by the number of Walmarts located in each county.

Walmart location image

One of the early premises of Web 2.0 was that data would be "the Intel inside" and firms like NAVTEQ that provide data would be the big winners. Today we are seeing crowdsourcing increasingly commoditize data, and projects like OpenStreetMap replacing the NAVTEQs of the world. As the market moves up the chain, the future value will be the meaningful questions we can answer with data. This will mean more focus on the "science" side of "data science." The more social and collaborative we make the science, the better the answers we'll create at a scale that is needed for an explosive market.

Read more.