Understanding Weak Isolation Is a Serious Problem

16 Sep 2014

Modern transactional databases overwhelmingly don’t operate under textbook “ACID” isolation, or serializability. Instead, these databases—like Oracle 11g and SAP HANA—offer weaker guarantees, like Read Committed isolation or, if you’re lucky, Snapshot Isolation. There’s a good reason for this phenomenon: weak isolation is faster—often much faster—and incurs fewer aborts than serializability. Unfortunately, the exact behavior of these different isolation levels is difficult to understand and is highly technical. One of 2008 Turing Award winner Barbara Liskov’s Ph.D. students wrote an entire dissertation on the topic, and, even then, the definitions we have still aren’t perfect and can vary between databases.

The core problem: a monstrous abstraction in (most) every database

Despite the ubiquity of weak isolation, I haven’t found a database architect, researcher, or user who’s been able to offer an explanation of when, and, probably more importantly, why isolation models such as Read Committed are sufficient for correct execution. It’s reasonably well known that these weak isolation models represent “ACID in practice,” but I don’t think we have any real understanding of how so many applications are seemingly (!?) okay running under them. (If you haven’t seen these models before, they’re a little weird. For example, Read Committed isolation generally prevents users from reading uncommitted or non-final writes but allows a number of bad things to happen, like lost updates during concurrent read-modify-write operations. Why is this apparently okay for many applications?) In the research community and in our classrooms, we’ve historically contented ourselves with studying serializability and, to a certain extent, Snapshot Isolation. Understanding weak isolation as deployed in real-world databases is a wide open problem with serious consequences.1

To put this problem in perspective, there’s a flood of interesting new research that attempts to better understand programming models like eventual consistency. And, as you’re probably aware, there’s an ongoing and often lively debate between transactional adherents and more recent “NoSQL” upstarts about related issues of usability, data corruption, and performance. But, in contrast, many of these transactional adherents and the research community as a whole have effectively ignored weak isolation—even in a single server setting and despite the fact that literally millions of businesses today depend on weak isolation and that many of these isolation levels have been around for almost three decades.2

Why might weak isolation actually work, and how did we get into this mess?

I can offer a few guesses as to why weak isolation works:

  1. Non-serializable isolation “anomalies” are really just different kinds of race conditions. To have a race, you need concurrency. For low-traffic and low-contention applications, it’s possible that anomalies don’t arise (e.g., the read-modify-write race from above might not affect applications because there simply aren’t concurrent transactions).
  2. When anomalies do arise, it’s possible that the read-write anomalies don’t translate into application data corruption. Just because a read/write race occurs doesn’t mean all outcomes are necessarily wrong (e.g., two writers might perform the exact same read-modify-write operations with the same outcome regardless of order).
  3. It’s possible data is actually (occasionally) corrupted, and apps just don’t care. Or, when they do, they manually issue the customer an IOU and/or send them an overdraft notice.

However, these are still conjectures. With a few exceptions,3 we’re in uncharted terrority.4

Why haven’t we already solved this problem? One possibility: as a community tradition, database architects have prided themselves on top-down design of systems that fulfill beautiful interfaces. For example, the relational database revolution in the 1970s was fomented by the introduction of an amazing abstraction: relational algebra. Serializability is a similarly powerful abstraction that vastly simplifies end-user programming. In contrast, weak isolation is grotesque. Its development has been overwhelmingly mechanism-driven rather than policy- or application-driven. Do you know how Jim Gray et al. invented Read Committed back in 1975? Realizing serializability via two-phase locking is expensive, the System R gang changed their long read locks to short read locks. No joke. (Aside: look at that typesetting!) What about Snapshot Isolation? In the early 1990s, companies like Interbase and Microsoft started shipping databases with a fancy new multi-version concurrency control mechanism. It took another paper by Gray and friends to define what these systems had actually implemented.

Towards a better understanding and better system design

Understanding weak isolation is not “just” an academic problem—it’s a problem database users face every day. If a user wants to make responsible use of weak isolation, she first needs to learn the particulars of each isolation model (which receive only partial treatment even in the best textbooks). Second, she has to manually translate the set of low-level read/write anomalies that define each model to her application logic. This is hard. With effort and a lot of education, it can be done. But remember, if our user’s database doesn’t support serializability and she cares about correctness, it must be done. Is this good design? Is this the best we can do? Exposing models that benefit the systems builder rather than the end user is, in my opinion, antithetical to the database tradition and to empathetic, user-centric design.

I think there are at least three fronts for making progress on this problem:

  1. As a community, I think it’s time to pay attention to and give existing weak isolation models the deep treatment they deserve. We’ve a few hints already. For example, my dissertation work (and plenty of excellent related work) helps identify and apply conditions under which we can race and preserve correctness (and concurrent execution). But there’s much work to be done in specifically examining existing and widely-deployed models. There are numerous Ph.D. dissertations to be written in this space and, more importantly, serious potential for impact on practice.

  2. As we develop new systems, we can avoid making the same mistakes as our architectural ancestors by focusing on applications, not mechanisms. I’d personally welcome a moratorium on writing papers on new isolation, data consistency, and read/write transaction models unless there’s a clear and specific set of motivating and specific application-driven use cases.5

  3. There’s real promise in moving beyond the read-write interface of traditional isolation models entirely and building concurrency control systems that operate at a higher level of abstraction (holy grail: the application level). This helps both systems and users. Systems can exploit greater concurrency and therefore provide higher performance and availability. Users don’t have to think about data races. This is a topic for another post, but work on Bloom/CALM, CRDTs, and I-confluent coordination avoidance hints at what’s achievable by reasoning about applications, not read/write histories.

We can do better, and there’s a ton of interesting research and design to be done. Go!

Notes

  1. There’s actually some research (see below) on this topic, but I don’t think we have a satisfying answer yet—and, from the work I’ve seen, we definitely don’t have an answer to explain the prevalence of these models in practice. I don’t think that there’s a clear, unambiguous one, either: given no knowledge about your program semantics, any model other than serializable isolation can corrupt data. However, as an example of a paper I’d like to read, a PBS-style white-box probabilistic analysis of Postgres, MySQL, or another RDBMS could be enlightening.

    We had a fun discussion about this topic in the session on transaction processing at VLDB this year. Again, no one was able to come up with a great answer to explain the prevalence of these models. However, there was a considerable amount of excitement in the room (perhaps also surprising given the number of papers on serializability and Snapshot Isolation)—much more than I’ve previously seen in the database community.

  2. To be clear, I think we should work on both distributed data consistency and weak isolation. In fact, the solutions may be similar. For example, Read Committed is, according to the usual definitions, not much stronger than what most eventually consistent stores provide (at the minimum, there are few—if any—guarantees about the orderings of concurrent transaction execution). So, many approaches to coordination-free or coordination-avoiding distributed implementations may indeed apply to weak isolation, even on a single server. The main difference I see (and what I’m agitating for in this post) is that, while there’s a considerable amount of interest in examining often even more esoteric read/write data consistency models in the distributed setting, there’s a stunning lack of work on what many more people today are already running. It’s possible we can kill two birds with one stone by, say, moving beyond the read-write abstraction, which causes all sorts of problems once we drop serializability.

    Also, it’s pretty funny to discuss non-serializable “weak” distributed data consistency models such as eventual consistency as if their inherent usability challenges are foreign to traditional data management systems (that, as I’ve discussed, often aren’t serializable either).

  3. A few examples:

  4. Read: a potential high impact research goldmine.

    Related note: I’m a big fan of papers on new techniques (e.g., program analysis and synthesis, or theory) for increasing the scalability or concurrency of applications. However, I wish we saw more discussion of the applications themselves. Techniques are valuable in their own right, but the idea of analysis-as-design-tool can be powerful—this paper from MIT does a great job in this regard. For example, was a previously expensive syscall or transaction non-scalable because it was simply implemented in a silly way? Did the new tool you’re describing also teach you something about the actual intent of the application that might have been obfuscated by its implementation? Did it surprise you? I’m fascinated by techniques for determining if weak isolation is safe but, even moreso, when (in terms of applications) and why (in terms of programmer intent) it’s safe (or not). As a shameless plug, we’ve some work in the pipeline looking at a slew of open-source applications in this vein. Open source is great for this stuff—thank goodness for Github!

  5. As a self-serving example, after our HAT research (on understanding which isolation levels are achievable without coordination/”AP”), we realized there was no existing isolation model that’d efficiently serve the indexing, materialized view, and multi-put requirements in Google’s Megastore, Facebook Tao, LinkedIn Espresso, Yahoo! PNUTS, and a number of open-source databases (you’d have to use Repeatable Read or Snapshot Isolation, both of which require coordination/”CP”). So we devised a new model, called Read Atomic (RA) isolation that directly addresses these use cases and is achievable via high-performance, coordination-free “AP” algorithms. Does RA further complicate this polluted space of isolation levels? You bet. But, when I talk to an end user, I can tell them “RAMP transactions ensure you don’t have dangling pointers in your global secondary index entries and also preserve foreign key relationships” rather than tell them “RAMP transactions prevent G0 (Write Cycles), G1a (Aborted Reads), G1b (Intermediate Reads), and G1c (Circular Information Flow), PMP (Predicate-Many-Preceders), and Fractured Reads but not G2 (Anti-dependency cycles), G-single, G-SIa (Interference) or G-SIb (Missed Effects).” The second explanation is meaningful (and rightly belongs in the paper!), but the former is immediately use-case driven.

    (A side benefit of that HAT work along these lines: an existing application running on a coordinated implementation of a HAT isolation model could hypothetically run faster in a distributed/coordination-free manner.)

You can follow me on Twitter here.