The (in)famous CAP Theorem
In the previous post, we discussed ACID and transactions for relational database. However, ACID is very expensive and infeasible to achieve for distributed data systems due to the challenges we have discussed in this post. As the trends of large scale distributed system arise, people begin to discuss CAP theorem. The CAP theorem, also named Brewer’s theorem after computer scientist Eric Brewer, states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees, Consistency, Availability and Partition tolerance.
However, the popular CAP theorem is proposed as a rule of thumb without precise definition. As it is well-cited by a lot of blogs or even in database specification documents, there are quite a lot of misleading even misunderstood information in the industry. Dr. Kleppmann even states that the CAP theorem is credited for the explosion of new databases technologies since 2000s, but it should be avoided when we try to understand a distributed system and its characteristics.
CAP or CA?
Although there is a lack of precision of consistency, availability and partition tolerance in the community, I will start with the definition of Wikipedia,
- Consistency: Every read receives the most recent write or an error
- Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write
- Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
Although the Wikipedia’s definition of partition tolerance emphasizes on the network partition, it is reasonable to expand it the all fault tolerance since we already knew it is difficult to distinguish remote host’s hardware fault from network issues. As we have discussed in the last post, fault is inevitable when we scale the system, which indicates the necessity of partition tolerance when designing a distributed data system. Therefore, the CAP should be re-stated as a system can only achieve availability or consistency when partitioned or I would like to focus on the tradeoff between availability and consistency (CA) when design a distributed data system.
Despite implementation details of various database, I will try to explain the CA in a simple way. When data nodes can communicate with each other, the system behaviors as there is only one copy of the data, which is replicated and distributed among data nodes. When the system receives a request, each node can response it exactly the same. Additionally, if some nodes are offline and unavailable from external point of view, replications can take the responsibility through fault tolerance design without interruption. However, the situation becomes tricky if there is network partition. Assuming a node lost data synchronization with other nodes, there are essentially two options when it receives a data request. It can either serve the client with local data or refuse the request. The earlier may lead to an inconsistent response due to some stale local data while the latter option makes the system unavailable.
Availability is the least discussed among the CAP triangle since the system availability is usually well defined and rare misunderstood. Distributed system achieve high availability and fault tolerance through data replication. In a nutshell, each nodes in the system can be replaced by a few other nodes (replication) without serious system downtime. However, there are some caveats when we discuss the availability. The precise requirement of availability in fact guides us toward a proper architecture for given use cases.
Read Availability vs. Write Availability
The read operations scan the data without any modification, which is as idempotent. All replicated data node can answer read request. Therefore, high read availability is usually not concerning for distributed data system design.
However, there are more considerations when process a write request, especially for those are not idempotent. Therefore, people make tradeoffs to achieve some degree of ACID. For example, single leader architecture system, like BigTable, HBase, favors consistency over write availability . There is usually only one node (leader node), which can handle write request and be responsible to distribute data among follower nodes. Consequently, the system is unavailable to write request if leader fails. On the other side of the spectrum, leaderless architecture, such as Cassandra, treats each node equally. Each replication node can serve write request and synchronize data with peers. Therefore, it is possible that peers cannot read an agreement due to concurrent write or synchronization delay.
In both architecture, read availability can be guaranteed with similar replications while it is very different for writing availability.
Availability vs. Scalability
Given the limited processing of a single commodity machine, the availability and scalability is usually very related. If there are more data size (storage scalability) or queries per second (computation scalability), the system will run out of resource and become unavailable. We will have to scale the system either vertical or horizontal. Therefore, a system with crippled availability can seldom scale well. For example, HBase is well criticized for its write operation scalability due to its single leader architecture.
It is important to understand the scale of write operations and read operations individually. For write-intensive system, a leaderless system is usually a wise choice while we should pay more attention to replication to support read-intensive system.
Almost all distributed data system tolerate node tolerance through data replication. Every piece of data is duplicated and stored on difference nodes to reduce the impact on individual node failure. Some data nodes may fail to keep in sync with each other. The data replication may be delayed due to a congested network. Some data nodes may be recovering from a system fault and catching up with recent data. There are countless possibilities to cause data discrepancy among data nodes. The consistency is system properties to quantify the discrepancy. In my opinion, we can read different consistency from timely and integral perspective. The timeliness ensures the user always sees the up-to-date data while the integrity guarantees no data loss or corrupted.
Most replicated data systems provide at least eventual consistency, which means that if you stop writing to database, then eventually all read requests will return the same value. In other words, the inconsistency is temporary and data will eventually coverage to the same value. We can see eventual consistency guarantees data integrity. However, it is a very weak consistency level since there is no statement about when the replicas will coverage.
The linearizability is considered as the strongest level of consistency. A linearizable system acts as there is one and only one copy of the data regardless of underlying data replication strategies. Linearizability guarantees both data integrity and timeliness.
From implementation perspective, a linearizable system requires synchronous replication and consensus algorithms. In practice, consensus algorithm is only used for leader election due to its crippled ability to handle high throughput load. Therefore, a linearizable system is usually a single-leader architecture with careful concurrency design.
Multi-leader data system is never linearizable. Some leaderless system claim they obtain “strong consistency” by requiring quorum reads and writes. However, it also requires the reader to perform synchronous read repair at the cost of reduced performance. System like Cassandra use “last write wins” conflict resolution is also not linearizable neither. Considering the limitation above is contradict to the scalability benefits of leaderless architecture, it is safe to claim a leaderless system is not linearizable neither.
Causal consistency captures the potential causal relationships between operations, and guarantees that all processes observe causally-related operations in a common order. In other words, all processes in the system agree on the order of the causally-related operations. They may disagree on the order of operations that are causally unrelated. Causal consistency is also a weak consistency but since it matches application’s intuition about time and logical ordering. Additionally due to the cost of linearizability, most distributed data system claims causal consistency to boost performance. Causal consistency can be summarized in four shades:
- Read Your Writes: If a process performs a write, the same process later observes the result of its write.
- Monotonic Reads: the set of writes observed (read) by a process is guaranteed to be monotonically non-decreasing.
- Writes Follow Reads: if some process performs a read followed by a write, and another process observes the result of the write, then it can also observe the read (unless it has been overwritten).
- Monotonic Writes: If some process performs a write, followed some time later by another write, other processes will observe them in the same order.
It seems that the causal consistency preserves causality, which is not promised by eventual consistency. However, there is nothing new from timeliness or integrity. In fact, the difference between causal consistency and eventual consistency is about concurrency or isolation.
What is Next
In this blog, we dive into the details of CAP theorem and clarified some misunderstanding about distributed system properties. In the last section, we briefly discussed the causal consistency needs proper concurrency or isolation design, which is an important properties of transactions. In the next blog, we will expand the scope of transaction and discuss distributed transactions, including the use cases and some implementation details.
In theoretical computer science, the CAP theorem, also named Brewer's theorem after computer scientist Eric Brewer…