Distributed Systems and Asynchronous I/O

How different forms of handling I/O affect the performance, availability, and fault-tolerance of network applications.

Dealing with distributed systems is quite a difficult job. There could be numerous components developed by different teams over long periods. Human mistakes, such as shipping bugs or incorrectly configured parameters, happen routinely in large systems, no matter how excellent the engineering practices are. Hardware faults also regularly occur in large fleets. On top of that, unpredictable workloads and permutations of all possible states and conditions make it virtually impossible to foresee what might go wrong.

That’s why it’s essential to limit the blast radius to avoid cascading failures and amplified outages. Also, it is no less critical to strive for simplicity to make systems easier to reason about.

There are multiple aspects and techniques for dealing with distributed systems, and I will cover some of them in the future. While this domain has a depth of knowledge to explore, this post will focus on basics, comparing different forms of handling I/O and how they affect network applications’ availability, performance, and fault-tolerance.

Non-blocking I/O is a mature and widely known concept. Still, there are plenty of services and SDKs built using blocking I/O. Some engineers assume that it’s just a matter of the number of threads that can compensate for the difference. And because blocking/synchronous programs are much easier to write, read and maintain, dealing with async I/O is not worth it.

I’ll try to use simple analogies and examples to show why the difference is more than just the number of threads. Also, I’ll touch on the topic of writing asynchronous code. And because: “I hear and I forget, I see and I remember, I do and I understand.”

I created a modeling tool for visualization and empirical verification of the below-presented points, so you also can play around with different scenarios if you’d like to explore the topic.

Let me start with some basic concepts.

Thread-safety and concurrency in intuitive terms

Balancing thread-safety and concurrency is non-trivial. For one, we want to avoid data races, specifically, modifying the same data with more than one thread simultaneously, which may lead to data corruption and inconsistency. Conversely, we want our applications to run fast and to minimize thread contention. Thread-safety and concurrency are somewhat at odds, and if thread-safety is done wrong, it may result in low concurrency.

I find the road traffic analogy very helpful to reason intuitively about thread-safety and concurrency. Let’s say we have tasks, which are vehicles, and threads, which are roads. Then, thread-safety is similar to avoiding collisions, while concurrency is about the road system’s bandwidth.

There are different ways to ensure the absence of collisions, each with its own pros and cons.

Immutable data

If data is never changed, it is safe to share it between threads. This is the best combination of thread-safety and concurrency, but it has limited application. It is like road signs: there is no limit on how many people can simultaneously look at them, but no one can interact with them. It is also suitable for rarely changing snapshots and relaxed consistency requirements.

Thread-local data

Each thread has its own isolated replica, which is similar to “each car has its own lane.” It is like single lane roads that never cross — safe but not always useful. This approach is performant but might be expensive and often leads to memory over-subscription.

Compare-and-swap/atomic operations

Atomic operations guarantee data consistency without locking. I consider it a 4-way stop sign where each thread needs to stop and check its surroundings before crossing an intersection. This approach works well for sharing pointers, flags, and counters between threads. It may inhibit performance if misused. You can learn more here: Chapter “8.2.2. Memory ordering.”


Similar to traffic lights, they enable only a single thread to access a resource at a given point in time. In case of concurrent access, all other threads wait, sometimes for a freight train to pass. But it might be better to have occasional traffic lights than too many stop signs.

So, if your road system (service) has too many traffic lights (mutexes/locks) and stop signs (volatile/atomic/memory barriers), the traffic would move slowly.

For instance, let’s benchmark incrementing a counter shared between multiple threads (using Rust and Criterion):

fn benchmark_batched_increment(    
    repetitions: usize, 
    mut increment: usize<