DynamoDB vs. Cassandra: from “no idea” to “it’s a no-brainer”

DynamoDB vs. Cassandra: have they got anything in common? If yes, what? If no, what are the differences? We answer these questions and examine performance of both databases.

comments

By Alex Bekker, ScienceSoft.

Apache Cassandra is an open-source database, while Amazon DynamoDB is a database service on the list of AWS’s offering. And it’s a common misconception that this is the biggest, if not the only, difference between the two technologies. To refute this misconception, let’s look at them more closely in terms of:

Data models
Architectures
Security features
Performance issues
Use cases

1. Data model

DynamoDB’s data model:
DynamoDB vs Cassandra, Fig. 1

Here’s a simple DynamoDB table. Its rows are items, and cells are attributes. In DynamoDB, it’s possible to define a schema for each item, rather than for the whole table.

Each table has a primary key, which can be either simple or composite. If the primary key is simple, it contains only a partition key that defines what partition will physically store the data. And if the primary key is composite, it consists of both a partition key and a sort key. In this case, a partition key performs the same function and the sort key, as seen in its very name, sorts the data with the same partition key.

Cassandra’s data model:

DynamoDB vs Cassandra, Fig. 2

Here’s a simple Cassandra column family (also called a table).It consists of rows that contain varying numbers of columns.

Every column family has a primary key. It may be simple or compound. If the primary key is simple, it contains only a partition key that determines what node and what partition are going to store the data. If the primary key is compound, it includes both a partition key and clustering columns. The former is used for the same purposes as in a simple primary key, while the latter sorts data within one partition.

DynamoDB vs. Cassandra. Data models in comparison:

First of all, Cassandra and DynamoDB do have some things in common: they both allow creating ‘schemaless’ tables and both have two similar parts of a primary key(partition key and sort key/clustering columns).

But there’re also tangible differences:

Amazon DynamoDB is a key-value and document-oriented store, while Apache Cassandra is a column-oriented data store.
Although DynamoDB can store numerous data types, Cassandra’s list of supported data types is more extensive: it includes, for instance, tuples, varints, timeuuids, etc.
In DynamoDB, partition keys and sort keys can contain only one attribute. While Cassandra allows including more than one column(attribute) into partition keys and clustering columns.
DynamoDB’s ‘partition’ ≠Cassandra’s ‘partition.’ In DynamoDB, it’s a physical part of storage allocated for a particular chunk of a table(each partition can ‘weigh’ up to 10 GB). And Cassandra’s partition is a set of rows in a column family that has the same partition key and is therefore stored on one node.

2. Architecture

Cassandra’s architecture:

All Cassandra’s nodes are equal, and any of them can function as a coordinator that ‘communicates’ with the client app. Without master nodes, there’s no single point of failure. This allows Cassandra to be always (or almost always) available.

Cassandra’s data distribution is based on consistent hashing. It works like this: every node has a token defining the range of this node’s hash values. During the write, Cassandra transforms the data’s partition key into a hash value and checks the tokens to identify the needed node. When Cassandra finds the needed node, it stores the data on it and replicates it to a number of other nodes. This particular number depends on the tunable replication factor, but usually, it’s 3. This means that your data is stored on 3 separate nodes, and if one or even two of them fail, your data will still be available.

DynamoDB’s architecture:

It would be nice to know what DynamoDB has got under the hood, but to us all, it’s a big black box. We can only state this:

With DynamoDB, you don’t think servers: the biggest entity that concerns you is a table. And anything beyond that is in the ‘dark’ area.
By default, DynamoDB works within one AWS region and replicates data to 3 separate availability zones, which presupposes high availability. If some find that having data in only one region isn’t enough, it’s possible to do cross-region replication, but it comes with its own tricks that we drill in a bit later on.

Cassandra vs. DynamoDB. Architectures in comparison:

Given the non-exhaustive info about DynamoDB’s ‘insides,’ we can’t really compare the two architectures. However, we know one thing for sure: according to the CAP theorem, both databases are targeted at availability and partition tolerance. And this will lead to problems with consistency for both databases. Off the record: given the similarities in data models and the shared CAP theorem guarantees, we could suggest that some of Cassandra’s architectural features may also be present in DynamoDB (data distribution mechanism and masterless cluster organization, for instance). Still, this is just a wild guess.

3. Security features

Just like most other NoSQL databases, Cassandra provides possibilities for user authentication and access authorization. Data access is role-based, the smallest level of granularity is a row and, besides that, Cassandra offers client-to-node and inter-node encryption.

DynamoDB also provides ways to work with user authentication and access authorization. It’s a more common practice to assign certain permissions and access keys to users than go with user roles. And the smallest level of access granularity is an attribute. But there’s a huge security advantage on DynamoDB’s side. Instead of securing data only in transit, AWS has recently expanded the list of their security features with encryption at rest based on Advanced Encryption Standard (AES-256).Although they say it doesn’t affect performance in the least, you should still keep such a possibility in mind.

4. Performance issues

Cassandra’s issues:

Here, we don’t aim to provide a comprehensive overview of Cassandra’s performance (you can sure find that by following the link).In this section, we will focus on its major performance issues only.

a) Consistency and read speed.

Cassandra creates multiple data replicas to grant data availability and, for read speed purposes, doesn’t always check every node that has the data to find the latest data version. This causes data consistency problems. But that’s only the tip of the iceberg. Cassandra treats all write operations as pure adds. When you need to update, it creates another data version with an updated value and a fresher timestamp. And if it happens a lot, there’re tons of versions of the same data record, which is why fetching obsolete ones becomes a common thing. Moreover, Cassandra deletes data somewhat similarly: it first adds a tombstone to the to-be-deleted records and only later (during a compaction process) physically deletes them. This creates problems related to consistency and read speed, since the more tombstones there are, the more difficult it is to find the needed data version. However, all these issues are solvable through tunable consistency(with the help of the replication factor and the data consistency level) and an appropriate compaction strategy depending on your particular tasks.

b) Scans.

Even with the above-mentioned issues, Cassandra’s read is still very quick and efficient. But as long as you know the primary key of the data you need. If you don’t, to find the required data, you may need to resort to scanning. And Cassandra doesn’t like scans: if it takes longer than a particular time, it returns an error and your data will probably not be found. However, if you integrate Cassandra with Apache Spark, performant scans become more available.

c) Secondary indexes.

If you are used to indexing, be ready that Cassandra’s secondary indexes won’t do. They may be used in some situations, but mostly it’s preferable to avoid them since they lead to scans, and it isn’t something Cassandra favors. However, the database provides an alternative indexing method called materialized views. They presuppose creating another version of the base table and including the indexed column into the partition key, which makes the materialized views easily searchable without scans. But data volume and the write obviously get affected.

DynamoDB’s issues:

As DynamoDB is a black box, it’s fairly difficult to describe its performance systematically. Here are some issues we’ve found.

a) Auto scaling.

DynamoDB’s users are charged not for the amount of storage but for the write and read throughput consumed. For each table or index, you specify how many read/write capacity units (RCUs and WCUs) they will need per second, which essentially means how quick they will work.

If your app experiences occasional peak times and activity drops, throughput capacity should be easily managed. There is an option of reserved burst capacity in DynamoDB (some capacity allocated for emergencies), but it’s usually not enough. So, you need to either manually tune throughput or use auto scaling, where you only set the target and DynamoDB handles activity fluctuations for you. Sounds too nice to be true, right? And the doubt is justified. DynamoDB’s auto scaling has a number of issues: it reacts at activity variations very slowly (within 10-15 minutes) and still doesn’t manage them too effectively. However, AWS states that using DynamoDB Accelerator – DAX – with auto scaling sufficiently improves its capabilities to handle unpredictable bursts of activity.

b) Throttling and hot keys.

(Editor - see comment with updated info below from Jum Scharf from Amazon DynamoDB team)

When your app starts to send more read/write requests than your provisioned capacity allows (assuming you don’t tune throughput), the requests start to fail, or throttle. For them to succeed, the app has to wait and do retries. The latency doesn’t grow dramatically is such cases, but it’s still quite unpleasant.

And that’s not all. The provisioned throughput of a table is distributed between its partitions. So, if you have a 100-WCU throughput per table with 20 partitions, each gets only 5. Supposing your app’s user starts to perform ordinary not-too-abundant activities that are written to a table with the partition key being, say, user ID, 5 WCUs can get exceeded very quickly. Thus, the key becomes hot and the write requests start to throttle, increasing overall latency.

“What kind of problem is that, if you could simply add more throughput to a table?” you could ask. No so fast. At scale, it can be fairly difficult to know the number of your partitions, which means it’s hard to understand how much throughput you need. Besides, when a partition grows and reaches its size limit (10 GB), it gets separated into 2 new partitions, whose throughput will be equal to half the provisioned capacity of the parent partition. So, if you have 20 partitions with 5 WCUs each and one of them exceeds the limit, the 2 new partitions will get 2.5 WCUs each, which could be catastrophically little.

This is how DynamoDB makes you specify too big a throughout (just in case), which costs big times. And, to top it all, it makes you think really hard on the right partition keys to avoid them getting hot, which can be excruciating.

c) Cross-region replication.

You may think that having your data in only one AWS region won’t do you good, which is why you’ll have to do cross-region replication. So, you’ll need global tables which, as AWS claims, ‘don’t require any code or system changes.’ What they require, though, is substantial setup and maintenance: global tables in all regions must have the same auto scaling and throughput settings, time to live, number of global and local secondary indexes and so on. And besides that, they have some limitations:

Logical conflicts. There’s a rule: last writer wins. And generally, it works nicely. But due to network failures and time delays, the timestamp of the more up-to-date data can turn out to be older than that of obsolete data version in another region. And according to the rule above, the write of the informationally fresher data version will be rejected. Another example: different apps running in different regions can attempt to make a change to a common global table with only a nanosecond time difference. This way, according to the very same rule, only one change will be admitted. And these are only two ways of how consistency problems can occur, which means, if you need strong-consistency reads, global tables are not for you.
Migration troubles. It can be really difficult to migrate to global tables from the already existing tables, especially if you can’t stop your incoming data traffic. The migration process requires additional tools, such as Amazon S3 and Data Pipeline (or, instead, DynamoDB streams and Lambda function).Oryou may even have to temporarily modify your application code (which global tables are initially said to avoid).

Cassandra vs. DynamoDB. Performance issues in comparison:

Cassandra doesn’t suffer from the hot key issue and provides lower overall latency.
Cassandra can do replication across multiple data centers much easier than DynamoDB’s cross-region replication.
Cassandra doesn’t support auto scaling, but expanding the number of nodes in a cluster does allow linear performance scalability.
DynamoDB can do updates nicely without creating a mess like Cassandra.
DynamoDB can do scans better than Cassandra, but you still have to be careful with them, since they cost a lot. Besides, DynamoDB works much better with local secondary indexes than Cassandra.
DynamoDB doesn’t require any major changes to work with strong consistency, but it’s twice as expensive as eventual consistency.

5. Use cases

DynamoDB is supposed to be a good choice for IoT, real-time bidding platforms, recommendation engines and gaming applications (so says the official AWS website). Such features as high availability, relatively low latency and rapid scalability indeed can help DynamoDB work nicely in these cases.

Cassandra is good for IoT, recommendation and personalization engines, fraud detection, messaging systems, etc. Cassandra’s quick write and read operations coupled with extremely low latency and linear scalability make ita nice fit for these applications.

You can see that some use cases overlap and that most of them are based on write-intensive workloads. Given that Cassandra’s write operation is incredibly cheap and quick, it’s no surprise that it handles such tasks nicely. But as to DynamoDB, there is some contradiction. According to AWS’s pricing model, DynamoDB’s writes are 4 to 8 times more expensive than reads. And this fact makes the abundance of DynamoDB’s write-oriented use cases quite puzzling.

6. The long-awaited conclusion

DynamoDB’s advantages are: easy start; absence of the database management burden; sufficient flexibility, availability and scalability; in-built metrics for monitoring; encryption of data at rest.

Cassandra’s main advantages are: lightning speed of writes and reads; constant availability; SQL-like Cassandra Query Language instead of a complex DynamoDB’s API; cross-data-center replication; linear scalability and high performance.

However, the mere technical details of the two databases shouldn’t be the only aspect to analyze before making a choice. You need to look at your application as a whole and see what other technologies you’ll need to accompany your database. If, say, you’ll need the open-source Apache Spark, Cassandra is your choice. If you plan to use extensively AWS tools, then it’s DynamoDB. And whichever you choose, beware of the database’s tricks that we covered above.

Bio: Alex Bekker is the Head of Data Analytics Department at ScienceSoft, an IT consulting and software development company headquartered in McKinney, Texas. Combining 20+ years of expertise in delivering data analytics solutions with 10+ years in project management, Alex has been leading both business intelligence and big data projects, as well as helping companies embrace the advantages that data science and machine learning can bring. Among his largest projects are: big data analytics revealing media consumption patterns in 10+ countries, private labels product analysis for 18,500+ manufacturers, BI for 200 healthcare centers.

Related: