Doing Redundant Work to Speed Up Distributed Queries

20 Sep 2012

tl;dr: In distributed data stores, redundant operations can dramatically drop tail latency at the expense of increased system load; different Dynamo-style stores handle this trade-off differently, and there’s room for improvement.

Update 10/2013: Cassandra has since added support for “speculative retry”–effectively, Dean’s suggestions applied to Dynamo reads, as described below.

Update 9/2014: Akka 2.3.5 introduced support for this kind of speculative retry.

At scale, tail latencies matter. When serving high volumes of traffic, even a miniscule fraction of requests corresponds to a large number of operations. Latency has a huge impact on service quality, and looking at the average service latency alone is often insufficient. Instead, folks running high-performance systems at places like Amazon and Google look to the tail when measuring their performance.1 High variance often hides in distribution tails: in a talk at Berkeley last spring, Jeff Dean reported a 95 percentile latency of 24ms and a 99.9th percentile latency of 994ms in Google’s BigTable service—a 42x difference!

In distributed systems, there’s a subtle and somewhat underappreciated strategy for reducing tail latencies: doing redundant work. If you send the same request to multiple servers, (all else equal) you’re going to get an answer back faster than waiting for a single server. Waiting for, say, one of three servers to reply is often faster than waiting for one of one to reply. The basic cause is due to variance in modern service components: requests take different amounts of time in the network and on different servers at different times.2 In Dean’s experiments, BigTable’s 99.9th percentile latency dropped to 50ms when he sent out a second, redundant request if the initial request hadn’t come back in 10ms—a 40x improvement. While there’s a cost associated with redundant work—increased service load—the load increase may be modest. In the example I’ve mentioned, Dean recorded only a 5% total increase in number of requests.3

Learning about Google’s systems is instructional, but we can also observe the trade-off between tail latency and load in several publicly-available distributed data stores patterned on Amazon’s influential Dynamo data store. In the original paper, Dynamo sends a client’s read and write requests to all replicas for a given key. For writes, the system needs to update all replicas anyway. For reads, requests are idempotent, so the system doesn’t necessarily need to contact all replicas—should it? Sending read requests to all replicas results in a linear increase in load compared to sending to the minimum required number of replicas.4 For read-dominated workloads (like many internet applications), this optimization has a cost. When is it worthwhile?

Open-source Dynamo-style stores have different answers. Apache Cassandra originally sent reads to all replicas, but CASSANDRA-930 and CASSANDRA-982 changed this: one commenter argued that “in IO overloaded situations” it was better to send read requests only to the minimum number of replicas. By default, Cassandra now sends reads to the minimum number of replicas 90% of the time and to all replicas 10% of the time, primarily for consistency purposes.5 (Surprisingly, the relevant JIRA issues don’t even mention the latency impact.) LinkedIn’s Voldemort also uses a send-to-minimum strategy (and has evidently done so since it was open-sourced). In contrast, Basho Riak chooses the “true” Dynamo-style send-to-all read policy.

Who’s right? What do these choices mean for a real NoSQL deployment? We can do a back-of-the-envelope analysis pretty easily. For one of our recent papers on latency-consistency trade-offs in Dynamo style systems, we obtained latency data from Yammer’s Riak clusters. If we run some simple Monte Carlo analysis (script available here), we see that—perhaps unsurprisingly—redundant work can have a big effect on latencies. For example, at the 99.9th percentile, sending a single read request to two servers instead of one is 17x faster than sending to one—maybe worth the 2x load increase. Sending reads to three servers and waiting for one is 30x faster. Pretty good!

Requests sent
Responses
waited for
1 2 3 4 5
1 170.0 10.7 5.6 4.8 4.5
2 200.6 33.9 6.5 5.3
3 218.2 50.0 7.5
4 231.1 59.8
5 242.2
99.9th percentile read latencies (in ms) for the Yammer Dynamo-style latency model.

The numbers above assume you send both requests at the same time, but this need not be the case. For example, sending a second request if the first hasn’t come back within 8ms results in a modest 4.2% increase in requests sent and a 99.9th percentile read latency of 11.0ms. This is due to the long tail of the latency distributions we see in Yammer’s clusters—we only have to speed up a small fraction of queries to improve the overall performance.

To preempt any dissatisfaction, I’ll admit that this analysis is simplistic. First, I’m not considering the increased load on each server due to sending multiple requests. The increased load may in turn increase latencies, which would decrease the benefits we see here. This effect depends on the system and workload. Second, I’m assuming that each request is identically, independently distributed. This means that each server behaves the same (according to the Yammer latency distribution we have). This models a system equally loaded, equally powerful servers, but this too may be different in practice.6 Third, with a different latency distribution, the numbers will change. Real-world benchmarking is the best source of truth, but this analysis is a starting point, and you can easily play with different distributions of your own either with the provided script or in your browser using an older demo I built.

Ultimately, the balance between redundant work and tail latency depends on the application. However, in latency-sensitive environments (particularly when there are serial dependencies between requests) this redundant work has a massive impact. And we’ve only begun: we don’t need to send to all replicas to see benefits—even one extra request can help—while delay and cancellation mechanisms like the ones that Jeff Dean hints at can further reduce load penalties. There’s a large amount of hard work and research to be done designing, implementing, and battle-testing these strategies, but I suspect that these kinds of techniques will have a substantial impact on future large-scale distributed data systems.

Thanks to Shivaram Venkataraman, Ali Ghodsi, Kay Ousterhout, Patrick Wendell, Sean Cribbs, and Andy Gross for assistance and feedback that contributed to this post.


Footnotes
[1] There are many studies of the importance of latency for different services. For example, an often-cited statistic is that an additional 100ms of latency cost Amazon 1% of sales. In the systems community, David Anderson attributes this sensitivity to tail latencies to both the original Dynamo paper and Werner Vogels's subsequent evangelizing (buried in the comments here).
[2] There's a large body of theoretical research on "the power of two choices" that's related to this phenomenon: if you select the less loaded of two randomly chosen servers instead of randomly picking one, you can exponentially improve a cluster's load balance. Theoreticians might quibble with this analogy: after all, here, we're usually still sending requests to both of the servers, and the original power of two work focuses on only sending requests to the lighter-loaded server. However, this research is still interesting to consider as a precursor to many of the techniques here and as the start of a more rigorous understanding of these trade-offs.
[3] Dean also describes the effect of different delay and cancellation mechanisms. Cancellation seems tricky to get right depending on the application. Intercepting and cancelling a lightweight read request is harder (i.e., needs to be faster) than, say, canceling a slower, more complex query. Recent work from some of my colleagues at UC Berkeley demonstrates alternative algorithms for limiting the overhead of redundant work in general-purpose cluster computing frameworks like Hadoop.
[4] Determining the minimum number of replicas to read from depends on the desired consistency of the operation. This is a complicated subject and is related to (but too involved for) the main discussion here. In short, if we denote the number of replicas as N, the number of replicas to block for during reads as R and equivalently for writes as W, R+W > N means you'll read your writes (each key acts as a regular register), while anything less gives you weak guarantees. For more info, check out some research we recently did on these latency-consistency trade-offs.
[5] For readers into hardcore Cassandra internals: currently, Cassandra fetches the data from one server and requests "digests," or value hashes, from the remaining (R-1) servers. In the case that the digests don't match, Cassandra will perform another round of requests for the actual values. Now, if a mechanism called read repair is enabled, then Cassandra will randomly (as a configurable parameter) send digest requests to all replicas (as opposed to just the R-1). In the original patch, the default probability of sending to all was 100% (all the time); it is now 10% (due—somewhat cryptically—to CASSANDRA-3169). (Sources: original patch here; latest code here; filtering of endpoints done here) This digest-based scheme is a substantial deviation from the Dynamo design and exposes a number of other interesting trade-offs that probably deserve further examination.
[6] For example, Cassandra's "endpoint snitches" keep track of which nodes are "closest" according to several different, configurable dimensions, including historical latencies. Depending on the configuration, if a given node is slow or overloaded, Cassandra may choose not to read from it. I haven't seen a performance analysis of this strategy, but, at first glance, it seems reasonable.
You can follow me on Twitter here.