How Twitter manages Big Data (12 TB/day) with Scribe, Hadoop, Pig, Cloudera, and other software
ReadWriteWeb, By Klint Finley / January 2, 2011
InfoQ has released a video of
Twitter's Kevin Weil speaking at
Strange Loop earlier this year on how the company uses NoSQL.
Weil is quick to point out that Twitter is heavily dependent on MySQL. However, Twitter does employ NoSQL solutions for many purposes for which MySQL isn't ideal. According to Weil, Twitter users generate 12 terrabytes of data a day - about four petabytes per year. And that amount is multiplying every year.
Syslog stopped scaling for Twitter after a while, so instead it uses
a log collection framework created and open-sourced by Facebook. ...
Twitter uses Scribe to write logs to Hadoop. Scribe made it so easy for Twitter to log data, it started to log much more data. It now logs 80 different categories of data.
Twitter needs to store more data per day than it can reliably write to a single hard drive, so it needs to store data on clusters. Twitter uses
Cloudera's Hadoop distribution to power its clusters.
Weil says MySQL isn't efficient at doing analytics at the scale Twitter needs. Instead, Twitter uses Hadoop and its own open source project called FlockDB. Hadoop can run analytics and hit FlockDB in parallel to assemble social graph aggregates.