Challenges Building a Distributed Data System

Faults and Partial Failure

When we run a program on a single computer, it is reasonable to expect it to behave in a fairly predictable way. An individual computer with good software should be either fully functional or entirely broken, but not in between. When we design an application running on single machine, we usually choose to crash the program completely and ask the user to handle the failure. The reason is that the fault computer usually cannot recover automatically.

Unreliable Networks

In this section, we focus on the system with IP network connected commodity hardware. The Internet and internal networks in data centers are asynchronous packet networks. In this kind of network, a node can send a message to another but there is no guarantee of the message reception or delivery latency. Therefore, many things can go wrong during the message delivery

  1. The request may have been lost
  2. The request may be waiting in a queue and will be delivered later
  3. The remote node may fail
  4. The remote node may be temporally unresponsive, but it may begin responding later


The usually way to handle the unreliable network is a timeout. The sender will give up waiting and assume the response is not going to arrive after a pre-defined time period. A lot of systems also depends on active or passive request timeout to detect nodes failure. For example, load balancer period sends periodic health check request to nodes in the pool. A timeout health check usually indicates a dead node and the load balancer removes it from the pool. Apache ZooKeeper deploys a similar strategy for failure detection but through a passive heartbeat.

Synchronous Network

A synchronous network allocate connections with fixed bandwidth between sender and recipient and guarantee the delivery time. Such network is widely used for traditional fix line telephone and ATM. There are some proposals to simulate a synchronous network through careful use of QoS and rate limiting. This network can provide statistical bounded delay, which benefits the data system to detect and response undelivered message.

Unreliable Clocks

People use clocks to measure time for centuries. Accordingly to International System of Unit, the second has been defined as exactly “the duration of 9,192,631,770 periods of the radiation corresponding to the transition between the two hyperfine levels of the ground state of the caesium-133 atom” (at a temperature of 0 K). Unfortunately, high accuracy is infeasible to achieve in most practical systems. Additionally, there are leap seconds applied to Coordinated Universal Time (UTC), which makes the UTC time neither continuous nor monotonic increasing. In the context of computer system, there are two types of clock: time of day clock and monotonic clock. They come with their own strength and weakness, therefore are used for different use cases.

Time-of-day Clock and Monotonic Clock

Time-of-day clock is what we intuitively expect of a clock, which returns the current date and time accordingly to Gregorian calendar without leap seconds. Due to the accuracy issue, Time-of-day clock is usually synchronized with Network Time Protocol (NTP), which helps multiple hosts to agree on the same clock with certain accuracy. However, this synchronization forces local time-of-day clock appears to jump backward or leap forward, which makes it unsuitable to measure elapsed time.

  1. The quartz time in a computer is not very accurate. Google assumes a clock drifts 200pm for its server, which is equivalent to 6ms drive for a clock that is re-synchronized with a server every 30 seconds.
  2. Application may observe time go backward or suddenly jump forward due to synchronization
  3. A misconfigured NTP may go unnoticed for some time.
  4. NTP synchronization can only be as good as the network delay.
  5. Leap seconds
  6. Server cannot control the NTP synchronization on client device.

Why Does It Matters?

Is it really important to have a synchronized accurate time? The answer is surprisingly no. In fact, people use synchronous time for two cases

  1. An absolute time stamp to record when the events happen. In most cases, fluctuation of +/-10ms is usually tolerable.
  2. A timestamp to order events. In this scenario, a logical clock, which is a synchronous logical order, is sufficient.

What is Next

In the next post, I will share the common characteristics of distributed data system. Additionally I will try to categorize some popular distributed database based on these characteristics.




Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store