Challenges Building a Distributed Data System

6 min readNov 15, 2020

In the last post, I discussed transactions and relational database. As the world entered the data era since 2000s, many people consider data as the essential resource as fossil fuel. The new trends motivate the industry to build distributed data systems, since traditional database running on single machine cannot handle workload with TB or even PB of data. In this post, I will focus on discussion of a few challenges to build a distributed data system. In the future posts, I will discuss some solid solutions.

Faults and Partial Failure

When we run a program on a single computer, it is reasonable to expect it to behave in a fairly predictable way. An individual computer with good software should be either fully functional or entirely broken, but not in between. When we design an application running on single machine, we usually choose to crash the program completely and ask the user to handle the failure. The reason is that the fault computer usually cannot recover automatically.

However, when we scale the system to run on several hosts connected by network. The situation is fundamentally different. We are no longer running our application in an idealized model. The overall reliability of distributed system decreases exponentially as the cluster scales. It is just Murphy’s Law. If we assume there is only 1% failure rate on a single machine, the failure rate for a 100 nodes cluster is 63.3%. It is better to admit the definite existing of failures and handles them properly. Additionally, some of the faults, such as network or GC pause, is self-recovery without human intervene, making the outcome less deterministic. The messy physical details require us to define a partial failure state for distributed system.

The strategy to handle and recover from partial failures, named as fault tolerance is an important system design requirement. There is a use case spectrum for large scale distributed system. On one end, super computers with 1000s of individual computation units are used to solve computation intensive scientific tasks. A super computer is more like a scaled single computer case. People usually flush result to persist storage periodically and design fault recovery strategies through check-pointed data. On the other end, people run heterogenous workload on elastic computing clouds, which are with by commodity hardware connected by IP network. The solution for this end usually involves data replication and consensus.

Unreliable Networks

In this section, we focus on the system with IP network connected commodity hardware. The Internet and internal networks in data centers are asynchronous packet networks. In this kind of network, a node can send a message to another but there is no guarantee of the message reception or delivery latency. Therefore, many things can go wrong during the message delivery

The request may have been lost
The request may be waiting in a queue and will be delivered later
The remote node may fail
The remote node may be temporally unresponsive, but it may begin responding later

The sender cannot tell whether the request was delivered. The only option is to ask the responder for a receipt. However, due to the same reason, the response may have been lost or delayed. These issues are indistinguishable in an asynchronous packet network.

Timeout

The usually way to handle the unreliable network is a timeout. The sender will give up waiting and assume the response is not going to arrive after a pre-defined time period. A lot of systems also depends on active or passive request timeout to detect nodes failure. For example, load balancer period sends periodic health check request to nodes in the pool. A timeout health check usually indicates a dead node and the load balancer removes it from the pool. Apache ZooKeeper deploys a similar strategy for failure detection but through a passive heartbeat.

However, timeout mitigates the anxieties of the sender with some side-effect. For example, if the remote may have received the request and processed it, but the response is lost or delayed. The system may behave different than a lost request situation. Additionally, setting a proper timeout boundary is also challenging. For example, if the remote node stops responding because of system overload. A timeout and retried strategy may make the situation worse and leads to cascading failure.

Synchronous Network

A synchronous network allocate connections with fixed bandwidth between sender and recipient and guarantee the delivery time. Such network is widely used for traditional fix line telephone and ATM. There are some proposals to simulate a synchronous network through careful use of QoS and rate limiting. This network can provide statistical bounded delay, which benefits the data system to detect and response undelivered message.

Synchronous network is not general available today due to its efficiency concerns. In general, we can think a bounded delay as a consequence of static resource allocation. However, a static resource allocation cannot handle bursty traffic within a connection and cannot utilize unallocated bandwidth neither.

Unreliable Clocks

People use clocks to measure time for centuries. Accordingly to International System of Unit, the second has been defined as exactly “the duration of 9,192,631,770 periods of the radiation corresponding to the transition between the two hyperfine levels of the ground state of the caesium-133 atom” (at a temperature of 0 K). Unfortunately, high accuracy is infeasible to achieve in most practical systems. Additionally, there are leap seconds applied to Coordinated Universal Time (UTC), which makes the UTC time neither continuous nor monotonic increasing. In the context of computer system, there are two types of clock: time of day clock and monotonic clock. They come with their own strength and weakness, therefore are used for different use cases.

Time-of-day Clock and Monotonic Clock

Time-of-day clock is what we intuitively expect of a clock, which returns the current date and time accordingly to Gregorian calendar without leap seconds. Due to the accuracy issue, Time-of-day clock is usually synchronized with Network Time Protocol (NTP), which helps multiple hosts to agree on the same clock with certain accuracy. However, this synchronization forces local time-of-day clock appears to jump backward or leap forward, which makes it unsuitable to measure elapsed time.

On the other hand, a monotonic clock is usually used to measure the time elapsed on a single host, such as timeout or server’s response time. As the name suggests, the clock always move forward although the accuracy is limited by the local quartz clock. However, the absolute value of the clock makes no sense and it is meaningless to compare two monotonic clock values from two different computers since they don’t mean the same thing.

Despite the NTP is neither monotonic nor continuous, it seems a good candidate to help different nodes to reach a consensus about time. However, our methods for getting a clock to tell the correct time from NTP are not reliable as we hope. There are several issues, for example

The quartz time in a computer is not very accurate. Google assumes a clock drifts 200pm for its server, which is equivalent to 6ms drive for a clock that is re-synchronized with a server every 30 seconds.
Application may observe time go backward or suddenly jump forward due to synchronization
A misconfigured NTP may go unnoticed for some time.
NTP synchronization can only be as good as the network delay.
Leap seconds
Server cannot control the NTP synchronization on client device.

Why Does It Matters?

Is it really important to have a synchronized accurate time? The answer is surprisingly no. In fact, people use synchronous time for two cases

An absolute time stamp to record when the events happen. In most cases, fluctuation of +/-10ms is usually tolerable.
A timestamp to order events. In this scenario, a logical clock, which is a synchronous logical order, is sufficient.

A synchronous clock can certainly support two cases in one shot. But as we discussed before, it is really difficult to achieve high precision synchronous clock in a distributed system. Therefore, most system decides to use a NTP to support timestamp use case and use a consensus algorithm for nodes to agree on a logical order, which will be discussed in a future post. The only exception is Google Spanner, who depends Google TrueTime API to synchronize time. To achieve this, Google builds high precision atomic clock in every data centers.

What is Next

In the next post, I will share the common characteristics of distributed data system. Additionally I will try to categorize some popular distributed database based on these characteristics.

Reference

Second

The second (symbol: , abbreviation: sec) is the base unit of time in the International System of Units (SI) (French…

en.wikipedia.org

Leap second

A leap second is a one- second adjustment that is occasionally applied to Coordinated Universal Time (UTC), to…