Scalable Atomic Visibility with RAMP Transactions

07 Apr 2014

We recently wrote a paper that will appear at SIGMOD called Scalable Atomic Visibility with RAMP Transactions. This post introduces RAMP Transactions, explains why you should care, and briefly describes how they work.

Executive Summary

What they are: We’ve developed three new algorithms—called Read Atomic Multi-Partition (RAMP) Transactions—for ensuring atomic visibility in partitioned (sharded) databases: either all of a transaction’s updates are observed, or none are.

Why they’re useful: In addition to general-purpose multi-key updates, atomic visibility is required for correctly maintaining foreign key constraints, secondary indexes, and materialized views.

Why they’re needed: Existing protocols like locking couple mutual exclusion and atomic visibility. Mutual exclusion in a distributed environment can lead to serious performance degradation and availability problems.

How they work: RAMP transactions allow readers and writers to proceed concurrently. Operations race, but readers autonomously detect the races and repair any non-atomic reads. The write protocol ensures readers never stall waiting for writes to arrive.

Why they scale: Clients can’t cause other clients to stall (via synchronization independence) and clients only have to contact the servers responsible for items in their transactions (via partition independence). As a consequence, there’s no mutual exclusion or synchronous coordination across servers.

The end result: RAMP transactions outperform existing approaches across a variety of workloads, and, for a workload of 95% reads, RAMP transactions scale to over 7 million ops/second on 100 servers at less than 5% overhead.

Where the overhead goes: Writes take 2 RTTs and attach either a constant (in RAMP-Small and RAMP-Hybrid algorithms) or linear (in RAMP-Fast) amount of metadata to each write, while reads take 1-2 RTTs depending on the algorithm.

Background

Together with colleagues at Berkeley and the University of Sydney, I’ve spent the last several years exploring the limits of coordination-free, “AP” execution in distributed databases. Coordination-free execution is powerful: it guarantees a response from any non-failing servers, provides low latency, and allows scale out, even at the granularity of a single database record. Of course, coordination-free execution permits races between concurrent operations, so the question is: what useful guarantees are achievable under coordination-free execution?

We recently spent time studying traditional database isolation levels from the perspective of coordination free execution. In this work, we realized that a classic property from database systems was achievable without coordination—even though most almost all databases implement it using expensive mechanisms! That is, we realized that we could provide the atomic visibility property—either all updates should be visible to another transaction or none are—without resorting to typical strategies like locking. We introduced an early version of our algorithms in a paragraph of our VLDB 2014 paper on Highly Available Transactions but felt there was more work to be done. The positive reaction to my post on our prior algorithm further inspired us to write a full paper.

Why Atomic Visibility?

It turns out that many use cases require atomic visibility: either all or none of a transaction’s updates should be visible to other transactions. For example, if I update a table in my database, any secondary indexes associated with that table should also be updated. In the paper, we outline three use cases in detail: foreign key constraints, secondary indexing, and materialized view maintenance. These use cases crop up in a surprising number of real-world applications. For example, the authors of Facebook’s Tao, LinkedIn Espresso, and Yahoo! PNUTS each describe how users can observe inconsistent data due to fast but non-atomic updates to multiple entities and secondary indexes in their systems. These systems have chosen scalability over correctness. We wanted to know: can we provide these use cases with both scalability and correctness?

If you pick up a database textbook today and try to figure out how to implement atomic visibility, you’re likely to end up with a solution like two-phase locking or some version of optimistic concurrency control. This works great on a single machine, but, as soon as your operations span multiple servers, you’re likely to run into problems: if an operation holds a lock while waiting on an RPC, any concurrent lock requests are going to have to wait. Throughput effectively drops to 1/(Round Trip Time): a few thousand requests per second on a modern cloud computing network like EC2. To allow partitioned (sharded) operations to perform well, we’ll have to avoid any kind of blocking, or synchronous coordination.

Our thesis in this work is that traditional approaches like locking (often unnecessarily) couple atomic visibility and mutual exclusion. We can offer the benefits of the former without the penalties of the latter.

RAMP Internals, In Brief

In RAMP transactions, we allow reads and writes to proceed concurrently over the same data. This provides excellent scalability, but it introduces a race condition: what if a transaction only observes a subset of another, concurrent transaction’s writes? We address this race condition via two mechanisms (for now, assume read-only and write-only transactions):

Use metadata to detect the race: Clients attach metadata to each write that allows other clients to detect partial reads from in-flight writes. For example, in the RAMP-Fast algorithm, writers attach a unique, per-transaction timestamp and a list of items written in the transaction. Clients can combine all of the metadata from versions they have actually read to determine the highest timestamp that they should have read for each item. If any item they read from servers has a lower timestamp than calculated, it means that the write was in-flight at the time of the read. The client can subsequently fetch the right version(s) from the server(s) in a second (parallel) set of requests and return the resulting set of versions.
Prevent readers from waiting: To prevent readers from ever having to wait for the “right version” to arrive at a partition, writes proceed in two rounds. The first round places each write on its respective partition but doesn’t yet make it visible to readers (i.e., prepares the write). The second round makes each write visible to readers (i.e., commits the write). This means that, if a reader observes a write from a transaction, all of the other writes the reader might request are guaranteed to be present on their respective servers. The reader simply has to identify the right version (using metadata above) and the server can return it.

The three RAMP algorithms we develop in the paper offer a trade-off between metadata size and round trip times (RTTs) required for reads. RAMP-Fast requires 1 RTT in the race-free case and 2 RTTs in the event of a race but stores the list of keys written in the transaction as metadata. RAMP-Small always requires 2 RTTs for reads but only uses a single timestamp as metadata. RAMP-Hybrid uses a Bloom filter to compress the RAMP-Fast metadata and requires between 1-2 RTTs per read (determined by the Bloom filter false positive rate). Writes always require 2 RTTs.

As we show in the paper, RAMP-Fast and RAMP-Hybrid perform especially well for read-dominated workloads; there’s no overhead on reads that don’t race, and, for reads that do race writes, reads take an extra RTT. The worst-case throughput penalty is about 50% due to extra RTTs (either by using RAMP-Small or executing write-heavy workloads), and we observed a linear relationship between write proportion and throughput penalty. None of the existing algorithms we benchmarked (and describe in the paper) performed as well.

Scalability, For Reals

The term “scalability” is badly abused these days. (It’s almost as meaningless as the use of the term “ACID” in modern databases.) When devising the RAMP transaction protocols, we wanted a more rigorous set of criteria to capture our goals for “scalability” in a partitioned database. We decided on the following:

Synchronization independence prevents one client’s operations from causing another’s to stall or fail. This means locks are out of the question.
Partition independence means that each client only has to contact partitions responsible for items it wants to access.

We can (and, in the paper, do) prove that RAMP transactions satisfy these two criteria. Moreover, when we actually implemented and experimentally compared our algorithms to existing algorithms that failed one or both of these properties, the effects were measurable: algorithms that lacked partition independence required additional communication (in pathological cases, a lot more), while algorithms that lacked synchronization independence simply didn’t work well under high read/write contention. You can see the gory details in Section 5 and read more about these properties in Section 3 of the paper.

Note that, by providing the above criteria, we’re disallowing some useful semantics. For example, synchronization independence effectively prevents writers from stalling other writers. So, if you might have update conflicts that you need to resolve up-front, you shouldn’t use these algorithms and you should expect to incur throughput penalties due to coordination.

For Distributed Systems Nerds

If you’ve made it this far, you’re probably a distributed systems nerd (welcome!). As a distributed systems researcher, there two aspects of our approach that I think are particularly interesting:

First, we use an atomic commitment protocol (ACP) for writes. However, every ACP can (provably) block during failures. (Remember, AC is harder than consensus!) Our observation is that ACPs aren’t harmful on their own; ACPs get a bad rap because they’re usually used along with with blocking concurrency control primitives like locking! By employing an ACP with non-blocking concurrency control and semantics, we allow individual ACP rounds to stall without blocking non-failed clients. We talk about (and evaluate the overhead of) unblocking blocked ACP rounds in the paper.

Second, in the paper, we use “AP” (i.e., coordination-free) algorithms in a “CP” environment. That is, we can execute RAMP transactions in an available active-active (multi-master) environment, but we instead chose to implement them in a single-master-per-partition system. Master-based systems can also benefit from coordination-free algorithms; it’s as if each concurrent operation gets to execute on its own (logical) replica, plus we can make strong guarantees on data recency. For example, after a write completes, all later transactions are guaranteed to observe its effects. This probably deserves another post, but I see the application of coordination-free algorithms to otherwise linearizable systems as an exciting area for further research.

Conclusions

RAMP Transactions provide a scalable alternative to locking and other coordination-intensive solutions to atomic visibility. We’ve found them very useful (e.g., our recent 25x prior-best on the TPC-C New-Order benchmark was implemented using a variant of RAMP-Fast), and there’s a principled reason for their performance: synchronization and partition independence. Please check out our SIGMOD paper for more details, and many thanks for the encouraging feedback so far. Happy scaling!

You can follow me on Twitter here.

How To Make Fossils Productive Again (30 Apr 2016)
You Can Do Research Too (24 Apr 2016)
Lean Research (20 Feb 2016)
I Loved Graduate School (01 Jan 2016)
NSF Graduate Research Fellowship: N=1 Materials for Systems Research (03 Sep 2015)
Worst-Case Distributed Systems Design (03 Feb 2015)
When Does Consistency Require Coordination? (12 Nov 2014)
Data Integrity and Problems of Scope (20 Oct 2014)
Linearizability versus Serializability (24 Sep 2014)
MSR Silicon Valley Systems Projects I Have Loved (19 Sep 2014)

Peter Bailis :: Highly Available, Seldom Consistent