Distributed Data System Architectures
In the previous blogs, we knew that distributed system replicates data across multiple nodes for fault tolerance. Among replicas, there are usually two roles, leader and follower. The leader node is responsible to process write request and send data to all followers as part of replication log or change stream. When a client wants to read data, it can query either the leader or any of the followers. However, write requests are only accepted on the leader.
The replication model is a built-in feature for many relational database and it is widely adapted by NoSQL database as well. In this post, I will discuss some common distributed data system architectures with respect to replication model. There are usually three architecture in the industry depends on the number of leader node per data partition.
Single Leader Architecture
As the name indicates, there is one leader node per partition. The leader is responsible for all write request and replication data to followers. Single leader architecture is a simple but common used architecture. However, there are some implementation details
Asynchronous Vs. Synchronous Replication
The first important details for a replicated system is whether the replication happens asynchronously or synchronously. The figure shows communication among client, leader and followers for a write request. In this example, the replication to follower 1 is synchronous since the leader waits for follower 1’s response before acknowledge the client. On the other hand, the replication to follower 2 is asynchronous and is not blocking the write request.
The advantage of synchronous replication is stronger consistency level. The follower can always see the up to date view of data, which is guaranteed to be the same as leader. However, a synchronous replication is blocking the process of write request. Therefore, it is impractical for all followers to be synchronous. Otherwise, any follower fault can lead to a whole system halt. When we enable synchronous replication for a data system, it usually means one followers are replicated through synchronous requests while the rest are still asynchronous. The synchronous follower can make sure the durability of write request in case of leader node failure.
A key difference between synchronous follower and leader is that when the synchronous follower becomes slow, it can be replaced by any other follower without consensus nor data repairing. However, a leader node failover usually requires a leader election among followers and a catch up of replication lag.
When the leader node becomes unavailable, one of the followers needs to be promoted. Clients should be reconfigured to contact the new leader for write request and followers should be notified the new leader and begin to replicate data from new leader. The process is called leader failover. A leader failover can be done manually by system administrator. Some systems also design automatic failover strategy to reduce system downtime. An automatic failover is tricky due to the following reasons
- Declaring a dead leader is tricky with unreliable network. Long timeout may end up with large data lost. A short timeout may cause false alarm, which may cause unnecessary failover.
- If asynchronous replication is used, un-replicated write will be lost, which breaks the durability guarantees.
- When the former leader rejoins, there may be conflict write requests for the former leader.
- In some fault scenarios, two or more followers may be promoted to be the leader, which leads to a split brain situation. A split brain cause a single leader system behaves like a multi-leader system without proper conflict resolving logic. The data can be corrupted.
Because of these challenges, some DevOps teams prefer to perform leader failover manually even the off-shelf software provides auto-failover feature.
Despite all the potential issues, an automatic failover strategy is usually implemented in the following steps.
- Detect the leader’s failure. As we discussed before, there is no foolproof way with unreliable network. Most systems use a simple timeout.
- Elect a new leader. The new leader is usually the one with least replication lag or the synchronous follower. Regardless of the preference, all follower must agree on the election results, which is usually achieved through a consensus algorithm.
- Re-configure the system to use the elected leader. If the old leader comes back, it needs to convert itself to become a follower, which is usually done with a fencing token.
Follower Failover / Setup
The follower node may suffers some fault and becomes unavailable. Additionally, a system administer may increase replication factor or add some new followers. For both cases, the follower of interests acts as the following
- Check local log for last processed transaction. For new follower, last processed transaction is just empty.
- The follower contacts the leader for unprocessed data change.
- The follower process data change locally called catch up. In the meantime, it receives new stream of data change as well.
- Once all data change processed, it declares fully catch up and begin to serve read request from client.
For leader based architecture, the last piece is the replication log. Several replication log have been proposed and implemented for different systems.
Statement based replication
Statement based replication records the write request as it is and broadcast the statement to followers. The benefits of this approach is the compact storage format. However, there are a few issues that can go wrong
- Any statement with nondeterministic functions, such as
NOW(), may cause inconsistent data.
- Statement use an auto-increment column, the statement must be executed in the exact order as leader node. The requirement limits the concurrency control design.
- Statement with side-effect may cause unexpected result on followers.
Row based replication
A row based replication log is usually a sequence of records describing the writes to storage at the granularity of a row. For example, an inserted row is recorded as the new values of all columns. A deleted row is usually recorded as the primary key in a
tombstone log. For a updated row, the log contains enough information to identify the updated row and the new values of all columns.
The row based replication decouples the storage engine and can more easily achieve backward compatibility. The disadvantage is the inefficiency.
Besides the above approaches, there are other alternatives, such as trigger based replication and write ahead log based replication.
Single leader replication has one major downside: there is only one leader and all writes must go through it. If we allow more than one leader node in the system, it is a multi-leader architecture. The use cases for a multi-leader architecture are as following
- Multi-datacenter operations. For a global distributed data system, there are usually more than one data center. With multiple data center setup, we usually have a leader per data center. Followers only get data change stream from local leader over local network while multiple leaders can exchange data change over Internet. The benefits is to reduce communication over Internet and improve performance. Additionally, multi-leader system can tolerate a data center outage.
- Client with poor network. For modern web application, the application may use local cached data when it cannot connects to server. As a result, each client with poor network acts as a leader. When it comes back, server or the client has to resolve the conflict write. We may see similar use case for collaborative editing such as Google Docs.
The biggest challenge for a multi-leader data system is the concurrent write, which may lead to write conflict. Since multi-leader usually “share nothing”, the coordination is usually very difficult. A important implementation detail for a multi-leader data system is the conflict resolution strategy.
The simplest solution is to avoid conflict and ensure all writes for a particular record go through the same leader. In this case, a multi-leader system acts as a single-leader system and conflict can never occur.
The convergence strategy is to ensure all replication reaches a consistent state eventually. When there are write conflict for a single-leader data system, the leader can assign a timestamp or transaction ID to each write request and last write determine the final state. Although it is more difficult to coordinate leaders and order events for a multi-leader system, it is still possible to resolve the concurrent writes by a few alternatives to ensure all replica picks the same transaction ID and converge to a consistent state
- A unique transaction ID. The ID can be generated based on timestamp, a random long number or a hash value of the transaction. The system can always pick the transaction with highest ID
- Leader ID. The idea is to assign a unique ID to each leader and always takes the write from leader with highest ID.
- Record conflict in a special data structure and prompt to customized logic to resolve conflict.
Despite all these conflict resolution design, multi-leader architecture is still considered as a dangerous territory that should be avoided. There are other pitfalls, such as auto-increment keys, triggers and integrity constrains. Therefore, if possible, multi-leader architecture should be avoided.
Amazon recently uses leaderless architecture for its in-house database Dynamo and inspires Apache Cassandra, Voldemort and other open source solutions. The advantage of a leaderless data system is the high availability as well as the scalability.
In a leaderless architecture, each node can handle both write and read request. In theory, the system can tolerate up to n-1 nodes fault and therefore provides high availability. As we have discussed in CAP theorem, the consistency is usually a concern. To resolve the consistency issue, leaderless data system utilize replica quorum, which means the majority decides the truth.
Quorum Reads and Writes
Generally speaking, if there are n replicas, every write must be confirmed by w nodes and every read consult r nodes. As long as w + r >n, we can expect to get an up-to-date value when reading. Reads and writes that obey the above rule is called Quorum Reads and Writes.
In most leaderless system, w and r are configurable to favor read-intensive (small r) or write-intensive scenario (small w). By default, we usually set w = r = (n+1) / 2. The default configuration can provide highest availability.
Although the quorum reads and writes sounds promising to resolve the consistency issue, there are some important difference
- If there are concurrent writes, the system has to deal with write conflict.
- If there are concurrent write and read, the read request may get a stale data, which breaks “read your write” consistency.
- It is possible that the write request succeed on some replicas but failed on others. If the write is not abortable, subsequent read request may or may not return the data of the write.
- Quorum is NOT Linearizability.
Therefore, although r and w can help you adjust the probability of read stale data, it is not an absolute guarantee. It is wise to only consider quorum as a eventual consistency guarantee.
What is next?
In the next 2 posts, I will discuss Apache HBase and Apache Cassandra as demonstrations for single leader architecture and leaderless architecture.