What do Postgres, Kafka, and Bitcoin Have in Common?
These three technologies on the surface couldn't look any more different, but under the hood they have one interesting thing in common.
By Jeff Hsu, Traintracks.io.
Postgres, Kafka, and Bitcoin. These three technologies on the surface couldn't look any more different.
Postgres is the goto object-relational database to use for developers these days, while Kafka is unquestionably the de-facto pub sub messaging system used in any system for streaming data at scale. And of course, Bitcoin achieved fame (notoriety?) as one of the first cryptocurrencies to really gain momentum.
Underneath the hood, all three of these technologies have one interesting thing in common: their use of the immutable log.
In order to achieve replication, Postgres uses something called an immutable Write-Ahead log, or WAL. Modifications of state are appended to the WAL before any significant updates to permanent storage. Followers can then read the log to apply these changes to their own copy to guarantee consistency.
In Kafka, streams of data are divided up into partitions, where each partition is an immutable log of ordered messages. Every new message that enters the system is appended to a partition. All consumers of the log keep track of what they've consumed themselves, so in case of an error, they can pick up from where they left off. And because the messages are ordered, all consumers get them in the same, intended order! No need to worry about race conditions.
Well, what about Bitcoin? Bitcoin uses an immutable log in the form of something called a blockchain, for all parties to agree on a history of transactions. In a blockchain, transactions are bundled into blocks, and each block has data regarding the previous block's hash, creating a chain of blocks. This makes it really hard for malicious agents to tamper with data records because of encryption. If a transaction of block N is tampered, then block N's hash will change, which means block N+1's will be referencing a nonexistent block, breaking the blockchain. Even if a malicious agent successfully tampered and re-mined the whole blockchain (which already takes significant processing power), a simple comparison with a different copy of the blockchain would easily find tampering, making it hard for people to hack the system.
Immutability as a concept seems so intuitive, so why have update-in-place approaches been the de-facto way of doing things for so long? A big reason is because storage used to be so expensive. But now, with advances with processing power and a significant decrease in the cost of storage, the solution of immutable logs is much more economically feasible, which means you'll be seeing more and more of it being used in technologies in the future.
Bio: Jeff Hsu is the CTO of Traintracks.io, a big data analytics platform built on top of a Virtual ETL engine that enables answers to questions about behavior in seconds that would typically take enterprises weeks to answer. Previously an engineer at Apple, Microsoft Research Asia, and the AMPLab at UC Berkeley, Jeff has also authored several papers about his research on energy monitoring and smart buildings.
Original. Reposted with permission.
- Top Big Data Processing Frameworks
- Practical skills that practical data scientists need
- arXiv.org and the 24 Hour Research Cycle