Worst-Case Distributed Systems Design

03 Feb 2015

Designing distributed systems that handle worst-case scenarios gracefully can—perhaps surprisingly—improve average-case behavior as well.

When designing CAP “AP” (or coordination-free) systems, we typically analyze their behavior under worst-case environmental conditions. We often make statements of the form “in the event of arbitrary loss of network connectivity between servers, every request received by a non-failing server in the system will result in a response.” These claims pertain to behavior under worst-case scenarios (“arbitrary loss of network connectivity”), which is indeed useful (e.g., DBAs don’t get paged when a network partition strikes). However, taken at face value, this language can lead to confusion (e.g., “in my experience, network failures don’t occur, so why bother?”). More importantly, this language tends to obscure a more serious benefit of this kind of distributed systems design.

When we design coordination-free systems that don’t have to communicate in the worst case, we’re designing systems that don’t have to communicate in the average case, either. If my servers don’t have to communicate to guarantee a response in the event of network partitions, they also don’t have to pay the cost of a round-trip within a datacenter or, even better, across geographically distant datacenters. I can add more servers and make use of them—without placing additional load on my existing cluster! I’m able to achieve effectively indefinite scale-out—simply because my servers never have to communicate on the fast-path.

This notion of worst-case-improves-average-case is particularly interesting because designing for the worst case doesn’t always work out so nicely. For example, when I bike to my lab, I put on a helmet to guard against collisions, knowing that my helmet will help in some but not all situations. But my helmet is no real match for a true worst case—say, a large meteor, or maybe just an eighteen-wheeler. To adequately defend myself against an eighteen-wheeler, I’d need more serious protection that’d undoubtedly encumber my bicycling. By handling the worst case, I lose in the average case. In fact, this pattern of worst-case-degrades-average-case is common, particularly in the real world: consider automotive design, building architecture, and processor design (e.g., for thermal, voltage, and process variations).1 Often, there’s a pragmatic trade-off between how much we’re willing to pay to handle extreme conditions and how much we’re willing to pay in terms of average-case performance.

So why do distributed systems exhibit this pattern? One possibility is that networks are unpredictable and, in the field, pretty terrible to work with. Despite exciting advances in networking research, we still don’t have reliable SLAs from our networks. A line of research in distributed computing has asked what we could do if we had better-behaved networks (e.g., with bounded delay)—but we (still) don’t yet.2 Given the inability to (easily and practically) distinguish between message delays, link failures, and (both permanent and transient) server failures, we do well by assuming the worst. Essentially, the defining feature of our distributed systems—the network—encourages and rewards us to minimize our reliance on it.

Over time, I’ve gained a greater appreciation for the subtle power of this worst-case thinking. It’s often instrumental in determining the fundamental overheads of a given design rather than superficial (albeit important) differences in implementations or engineering quality.3 It’s a clean (and often elegant) way to reason about system behavior and is a useful tool for systems architects. We’d do well by paying more attention to this pattern while fixating less on failures. Although formulations such as Abadi’s PACELC are a step in the right direction, the connections between latency, availability and scale-out performance deserve more attention.


  1. There’s an interesting related challenge (design theme/meme?) regarding how to design systems that behave sanely in the worst case but take advantage of the gap between average-case and worst-case conditions (when it exists). I used to hang out with computer architects, and there’s some cool work on this topic in their community (e.g., “Better than Worst-Case design”). One of my favorite papers is called DIVA and employs a dirt-simple, in-order co-processor to verify the computations of a less reliable primary processor, which employs all sorts of microarchitectural bells and whistles to speed up the primary computation. DIVA only “pays” for errors (worst-case, due to, say, noise, process variation, bugs in the processor design) when they occur (and are subsequently caught by the checker, which is much easier to verify and build correctly). (Fun fact, from Footnote 1: “This [single-author paper] was performed while the author was (un)employed as an independent consultant in late May and June of 1999.”) Some of my favorite algorithms (selfishly, e.g., RAMP) have this flavor, where we’re only penalized when a bad thing occurs (in the case of RAMP-Fast, readers only pay—using a second RTT with servers to repair missing values—when there is actually a race).

  2. Lest I come off as grumpy: I am actually very excited about this possibility, but it’s a challenging chicken-and-egg problem. Which comes first: better hardware, or applications that could make use of the hardware if the hardware existed? I think we can go either way, but I’ll note that (in the latter scenario), there are troves of brilliant ideas in the distributed computing and database literature that are lying in wait for better hardware. As one of my favorite examples, Barbara Liskov’s PODC 1991 keynote on “Practical Uses of Synchronized Clocks in Distributed Systems” has a bunch of great ideas (and associated papers). Now that Barbara’s former student has managed to convince Google to add atomic clocks to their datacenters, these algorithms can be efficiently implemented at scale. I think academics have a serious role to play in this conversation, and this type of co-design is on my research agenda. But there’s probably lower-hanging fruit for practitioners today.

  3. Three scenarios where this kind of analysis has useful implications:

    a.) Whenever someone claims their coordinated (e.g., serializable) system achieves unilateral performance or availability with a coordination-free design, we know they’re either misguided, are not exercising an interesting workload, and/or are simply being misleading. In these cases, we know that, regardless of how the system is implemented, because they’re choosing to implement a fundamentally “hard” problem (e.g., serializability, atomic commitment, or maybe just a consensus register) that requires coordination to prevent some “bad” execution, there’s going to be a cost associated. This is effectively the converse of our thesis: if you have to pay in the worst case, you (may) have to pay in the average case. Moreover (and perhaps most importantly), we don’t have to point fingers and argue about someone’s code or implementation—in many cases, there is literally no way to implement a better system, and we can prove that there’s no use in trying to do so. (Of course, there’s plenty of fun [and money] in building fast databases, but the point is that we can separate out what’s fundamentally new—which also happens to be a key criterion for good research—and what’s simply built better.)

    b.) “Scale out” versus “scale up” systems design is a hot topic—should we build (often more sophisticated) single-node systems or build simpler systems that are geared toward horizontal scalability? These two are often opposed, and proponents of each strategy will argue that the other is misguided for some particular workload or system implementation. Yet, couched in the terms of worst-case systems design, advances in single-node performance simply serve to increase the capacity of each node within a cluster. Making each node in a coordination-free system two times faster via “scale up” techniques makes the whole cluster two times faster. If the conversation is typically “scale out” versus “scale up,” if we’re coordination-free, we get to choose “scale out” while “scaling up.” This may seem obvious, but consider the fact that speeding up single-node performance does not always, for example, improve the throughput of distributed transaction processing if (ha!) you are bottlenecked on the network.

    c.) As hardware changes, the constants associated with communication costs may change—thus affecting the relative benefit of our worst-case designs—but the fundamental gains likely won’t. For example, for decades we’ve been running shared-everything architectures on single-node servers. However, there’s mounting evidence that a shared-nothing approach is desirable in a multi- and (especially) many-core setting. Andy Pavlo and friends recently did a study of OLTP scalability to 1000 cores (in simulation, with some folks at MIT). They found that—regardless of implementation—serializability tanks in performance (due to a variety of factors, depending on the algorithm, but—I’d argue—ultimately due to the fundamental costs of providing serializability). In the distributed setting, today’s datacenter networks make serializable transaction performance an easy target to beat—yet, even as network performance improves and performance bottlenecks lessen in severity, we’ll still see benefits. (An argument I like to make is that, despite DRAM accesses being ridiculously fast—at least relative to remote memory accesses or RPCs—we still spend a lot of time devising cache-friendly algorithms—albeit not for every application, but for many that matter.)

You can follow me on Twitter here.