As the "coming out" year for big data and data science draws to a close, what can we expect over the next 12 months? Streaming data processing, data science workflow, rise of data marketplaces, and more.
O'Reilly Radar, by Edd Dumbill, 14 December 2011
1. More powerful and expressive tools for analysis
This year has seen consolidation and engineering around improving the basic storage and data processing engines of NoSQL and Hadoop. That will doubtless continue, as we see the unruly menagerie of the Hadoop universe increasingly packaged into distributions, appliances and on-demand cloud services. Hopefully it won't be long before that's dull, yet necessary, infrastructure.
Looking up the stack, there's already an early cohort of tools directed at programmers and data scientists (Karmasphere,
Datameer), as well as Hadoop connectors for established analytical tools such as
2. Streaming data processing
... Over the next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use. ...
Emerging contenders in the real-time framework category include
Storm, from Twitter, and
S4, from Yahoo.
3. Development of data science workflows and tools
As data science teams become a recognized part of companies, we'll see a more regularized expectation of their roles and processes. One of the driving attributes of a successful data science team is its level of integration into a company's business operations, as opposed to being a sidecar analysis team. ...
4. Rise of data marketplaces
Your own data can become that much more potent when mixed with other datasets. For instance, add in weather conditions to your customer data, and discover if there are weather related patterns to your customers' purchasing patterns. Acquiring these datasets can be a pain, especially if you want to do it outside of the IT department, and with some exactness. The value of data marketplaces is in providing a directory to this data, as well as streamlined, standardized methods of delivering it.
Microsoft's direction of integrating its
right into analytical tools foreshadows the coming convenience of access to data.
5. Increased understanding of and demand for visualization
Visualization fulfills two purposes in a data workflow: explanation and exploration. While business people might think of a visualization as the end result, data scientists also use visualization as a way of looking for questions to ask and discovering new features of a dataset.