... Hadoop sprung from two research papers Google published in late 2003 and 2004. One described the Google File System, a way of storing massive amounts of data across thousands of dirt-cheap computer servers, and the other detailed MapReduce, which pooled the processing power inside all those servers and crunched all that data into something useful. Eight years later, Hadoop is widely used across the web, for data analysis and all sorts of other number-crunching tasks. But Google has moved on.
In 2009, the web giant started replacing GFS and MapReduce with new technologies, and Mike Olson will tell you that these technologies are where the world is going. "If you want to know what the large-scale, high-performance data processing infrastructure of the future looks like, my advice would be to read the Google research papers that are coming out right now," Olson said during a recent panel discussion alongside Wired.
Since the rise of Hadoop, Google has published three particularly interesting papers on the infrastructure that underpins its massive web operation. One details Caffeine, the software platform that builds the index for Google's web search engine. Another shows off Pregel, a "graph database" designed to map the relationships between vast amounts of online information. But the most intriguing paper is the one that describes a tool called Dremel.
... you can use Dremel today - even if you're not a Google engineer. Google now offers a Dremel web service it calls BigQuery. You can use the platform via an online API, or application programming interface. Basically, you upload your data to Google, and it lets you run queries on its internal infrastructure.
See also Google Research paper
Dremel: Interactive Analysis of WebScale Datasets, by Sergey Melnik et al.
Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.