Big Data Insights: Traackr migration from HBase to MongoDB

Traackr engineer talks about evolution of their data storage needs from Apache HBase to MongoDB and what lessons they learned along the way.

Traackr Blog, Feb 8, 2012, by george

HBase to MongoDB Let's get one thing out of the way before we start: this post is not an attempt to disparage HBase. HBase is an extremely powerful tool; applied appropriately and skillfully under the right scenarios, it can move mountains. This post is about the evolution of Traackr's data storage needs and how MongoDB ended up satisfying them. It's also a tip of the hat at the MongoDB team and 10gen and the tremendous work they have done.

Back in late 2009 early 2010, Traackr was designing the foundations of its' search engine and hunting for an appropriate datastore to back it up. Some of the requirements were:

Built-in support for storing terabytes of text: that meant that we shouldn't have to use or modify the software in an unconventional fashion beyond its' original design to get it to store and retrieve the quantities of data we wanted.
Flexible schema: Traackr deals with heterogenous data sources from the web, constantly discovering new content and new properties that characterize that content.
Ability to batch process the data: Traackr's scoring algorithms take into account statistical measurements derived from our entire active data set. Those computations need to be run at least once a week to account for the continuous growth and shifts in data samples. ...

Some of the contenders in our product selection matrix were:

Traditional enterprise packages such as Oracle: ruled out because they were way out of our budget.
MySQL: our content sizes vary from 140 character tweets to multi-page articles and using one size fits all BLOBs would be a tremendous waste of space. ...
Cassandra: It fit the bill in terms of schema flexibility and storage capacity ...
MongoDB: it was still new at the time, so we had concerns about its' stability and adoption. ...
Riak: it was a serious contender for us; most of our requirements were being met and it presented the same promise of ease of use and deployment as Cassandra did. ...
HBase: back then, it was one of the most polished solutions with quite a bit of traction. ...

...

Having worked with it now, it's no wonder why MongoDB is currently enjoying such growth. While the migration from HBase took us about three months the integration with MongoDB itself was achieved in just a couple weeks since we already had a DAO layer abstracted from the rest of our applications. The rest of the time was spent tweaking our new model and re-writing our content acquisition and attribution services. And at every step of that refactoring, we found that MongoDB was making things easier for us

Read more.

Related
→ Data Mining Software