How Twitter Uses NoSQL

How Twitter manages Big Data (12 TB/day) with Scribe, Hadoop, Pig, Cloudera, and other software

ReadWriteWeb, By Klint Finley / January 2, 2011

InfoQ has released a video of Twitter's Kevin Weil speaking at Strange Loop earlier this year on how the company uses NoSQL. Weil is quick to point out that Twitter is heavily dependent on MySQL. However, Twitter does employ NoSQL solutions for many purposes for which MySQL isn't ideal. According to Weil, Twitter users generate 12 terrabytes of data a day - about four petabytes per year. And that amount is multiplying every year. ...

Scribe
Syslog stopped scaling for Twitter after a while, so instead it uses Scribe, a log collection framework created and open-sourced by Facebook. ... Twitter uses Scribe to write logs to Hadoop. Scribe made it so easy for Twitter to log data, it started to log much more data. It now logs 80 different categories of data.

Hadoop
Twitter needs to store more data per day than it can reliably write to a single hard drive, so it needs to store data on clusters. Twitter uses Cloudera's Hadoop distribution to power its clusters.

Weil says MySQL isn't efficient at doing analytics at the scale Twitter needs. Instead, Twitter uses Hadoop and its own open source project called FlockDB. Hadoop can run analytics and hit FlockDB in parallel to assemble social graph aggregates.

A Pig Script

Read more.