BigData Choice: Which database to use?

In the era of big data, RDBMS is no longer the only choice. Here is a guide for what type of DB to choose, including Key-value pair, Column family/big table, Document, and Graph databases.

By Andrew Oliver | InfoWorld | 3 August 2012

NoSQL There are several types of NoSQL databases and rational reasons to use them in different situations for different datasets. It's much more complicated than tech industry marketing nonsense like "NoSQL = scale."

Part of the reason there are so many different types of NoSQL databases lies in the CAP theorem, aka Brewer's Theorem. The CAP theorem states you can provide only two out of the following three characteristics: consistency, availability, and partition tolerance. Different datasets and different runtime rules cause you to make different trade-offs. Different database technologies focus on different trade-offs. The complexity of the data and the scalability of the system also come into play.

Another reason for this divergence can be found in basic computer science or even more basic mathematics. Some datasets can be mapped easily to key-value pairs; in essence, flattening the data doesn't make it any less meaningful, and no reconstruction of its relationships is necessary. On the other hand, there are datasets where the relationship to other items of data is as important as the items of data themselves.

Cassandra Key-value pair databases

Key-value pair databases include the current 1.8 edition of Couchbase and Apache Cassandra. These are highly scalable, but offer no assistance to developers with complex datasets. If you essentially need a disk-backed, distributed hash table and can look everything up by identity, these will scale well and be lightning fast. However, if you find that you're looking up a key to get to another key to get to another key to get to your value, you probably have a more complicated case. ...

Column family/big table databases

Most key-value stores (including Cassandra) offer some form of grouping for columns and can be considered "column family" or "big table" as well. Some databases such as HBase were designed as column family stores from the beginning. This is a more advanced form of a key-value pair database. Essentially, the keys and values become composite. Think of this as a hash map crossed with a multidimensional array. Essentially each column contains a row of data. ...

Document databases

Many developers think document databases are the Holy Grail since they fit neatly with object-oriented programming. With high-flying vendors like 10gen ( MongoDB), Couchbase, and Apache's CouchDB, this is where most of the vendor buzz is generated. ...

Graph databases

Graph databases are really less about the volume of data or availability and more about how your data is related and what calculations you're attempting to perform. As Philip Rathle, senior director of product engineering at Neo Technologies (makers of Neo4j), told me, graph databases are especially useful when "the data set is fundamentally interconnected and non-tabular. The primary data access pattern is transactional, i.e., OLTP/system of record vs. batch... bearing in mind that graph databases allow relatedness operations to occur transactionally that, in an RDBMS world, would need to take place in batch." ...

Read more.

Related
→ Data Mining Software