Distributed Data System Architectures

Single Leader Architecture

As the name indicates, there is one leader node per partition. The leader is responsible for all write request and replication data to followers. Single leader architecture is a simple but common used architecture. However, there are some implementation details

Asynchronous Vs. Synchronous Replication

The first important details for a replicated system is whether the replication happens asynchronously or synchronously. The figure shows communication among client, leader and followers for a write request. In this example, the replication to follower 1 is synchronous since the leader waits for follower 1’s response before acknowledge the client. On the other hand, the replication to follower 2 is asynchronous and is not blocking the write request.

Leader Failover

When the leader node becomes unavailable, one of the followers needs to be promoted. Clients should be reconfigured to contact the new leader for write request and followers should be notified the new leader and begin to replicate data from new leader. The process is called leader failover. A leader failover can be done manually by system administrator. Some systems also design automatic failover strategy to reduce system downtime. An automatic failover is tricky due to the following reasons

  1. Declaring a dead leader is tricky with unreliable network. Long timeout may end up with large data lost. A short timeout may cause false alarm, which may cause unnecessary failover.
  2. If asynchronous replication is used, un-replicated write will be lost, which breaks the durability guarantees.
  3. When the former leader rejoins, there may be conflict write requests for the former leader.
  4. In some fault scenarios, two or more followers may be promoted to be the leader, which leads to a split brain situation. A split brain cause a single leader system behaves like a multi-leader system without proper conflict resolving logic. The data can be corrupted.
  1. Detect the leader’s failure. As we discussed before, there is no foolproof way with unreliable network. Most systems use a simple timeout.
  2. Elect a new leader. The new leader is usually the one with least replication lag or the synchronous follower. Regardless of the preference, all follower must agree on the election results, which is usually achieved through a consensus algorithm.
  3. Re-configure the system to use the elected leader. If the old leader comes back, it needs to convert itself to become a follower, which is usually done with a fencing token.

Follower Failover / Setup

The follower node may suffers some fault and becomes unavailable. Additionally, a system administer may increase replication factor or add some new followers. For both cases, the follower of interests acts as the following

  1. Check local log for last processed transaction. For new follower, last processed transaction is just empty.
  2. The follower contacts the leader for unprocessed data change.
  3. The follower process data change locally called catch up. In the meantime, it receives new stream of data change as well.
  4. Once all data change processed, it declares fully catch up and begin to serve read request from client.

Replication Log

For leader based architecture, the last piece is the replication log. Several replication log have been proposed and implemented for different systems.

Statement based replication

Statement based replication records the write request as it is and broadcast the statement to followers. The benefits of this approach is the compact storage format. However, there are a few issues that can go wrong

  1. Any statement with nondeterministic functions, such as NOW() , may cause inconsistent data.
  2. Statement use an auto-increment column, the statement must be executed in the exact order as leader node. The requirement limits the concurrency control design.
  3. Statement with side-effect may cause unexpected result on followers.

Row based replication

A row based replication log is usually a sequence of records describing the writes to storage at the granularity of a row. For example, an inserted row is recorded as the new values of all columns. A deleted row is usually recorded as the primary key in a tombstone log. For a updated row, the log contains enough information to identify the updated row and the new values of all columns.

Multi-leader Architecture

Single leader replication has one major downside: there is only one leader and all writes must go through it. If we allow more than one leader node in the system, it is a multi-leader architecture. The use cases for a multi-leader architecture are as following

  1. Multi-datacenter operations. For a global distributed data system, there are usually more than one data center. With multiple data center setup, we usually have a leader per data center. Followers only get data change stream from local leader over local network while multiple leaders can exchange data change over Internet. The benefits is to reduce communication over Internet and improve performance. Additionally, multi-leader system can tolerate a data center outage.
  2. Client with poor network. For modern web application, the application may use local cached data when it cannot connects to server. As a result, each client with poor network acts as a leader. When it comes back, server or the client has to resolve the conflict write. We may see similar use case for collaborative editing such as Google Docs.

Conflict Avoidance

The simplest solution is to avoid conflict and ensure all writes for a particular record go through the same leader. In this case, a multi-leader system acts as a single-leader system and conflict can never occur.


The convergence strategy is to ensure all replication reaches a consistent state eventually. When there are write conflict for a single-leader data system, the leader can assign a timestamp or transaction ID to each write request and last write determine the final state. Although it is more difficult to coordinate leaders and order events for a multi-leader system, it is still possible to resolve the concurrent writes by a few alternatives to ensure all replica picks the same transaction ID and converge to a consistent state

  1. A unique transaction ID. The ID can be generated based on timestamp, a random long number or a hash value of the transaction. The system can always pick the transaction with highest ID
  2. Leader ID. The idea is to assign a unique ID to each leader and always takes the write from leader with highest ID.
  3. Record conflict in a special data structure and prompt to customized logic to resolve conflict.

Leaderless Architecture

Amazon recently uses leaderless architecture for its in-house database Dynamo and inspires Apache Cassandra, Voldemort and other open source solutions. The advantage of a leaderless data system is the high availability as well as the scalability.

Quorum Reads and Writes

Generally speaking, if there are n replicas, every write must be confirmed by w nodes and every read consult r nodes. As long as w + r >n, we can expect to get an up-to-date value when reading. Reads and writes that obey the above rule is called Quorum Reads and Writes.

  1. If there are concurrent writes, the system has to deal with write conflict.
  2. If there are concurrent write and read, the read request may get a stale data, which breaks “read your write” consistency.
  3. It is possible that the write request succeed on some replicas but failed on others. If the write is not abortable, subsequent read request may or may not return the data of the write.
  4. Quorum is NOT Linearizability.

What is next?

In the next 2 posts, I will discuss Apache HBase and Apache Cassandra as demonstrations for single leader architecture and leaderless architecture.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store