Data Integrity and Problems of Scope

20 Oct 2014

Mutable state in distributed systems can cause all sorts of headaches, including data loss, corruption, and unavailability. Fortunately, there are a range of techniques—including exploiting commutativity and immutability—that can help reduce the incidence of these events without requiring much overhead. However, these techniques are only useful when applied correctly. When applied incorrectly, applications are still subject to data loss and corruption. In my experience, (the unfortunately common) incorrect application of these techniques is often due to problems of scope. What do I mean by scope? Let’s look at two examples:

1.) Commutativity Gone Wrong

For our purposes, commutativity informally means that the result of executing two operations is independent of the order in which the operations were executed. Consider a counter: if I increment by one and you also increment by one, the end result (counter += 2) is equivalent despite the order in which we execute our operations! We can use this fact to built more scalable counters that exploit this potential for reorderability.

However, we’re not off the hook yet. While our operations on the individual counter were commutative, does this mean that our programs that use this counter are “correct”? If we’re Twitter and we want to build retweet counters, then commutative counters are probably a fine choice. But what if the counter is storing a bank account balance?1 Or if it’s monitoring the amount of available inventory for a given item in a warehouse? While two individual decrement operations are commutative, two account withdrawal operations (or two item order placement operations) are not. By looking at individual data items instead of how the data is used by our programs, we risk data corruption. When analyzing commutativity, myopia is dangerous.

2.) Immutability Gone Wrong

As a second example, consider the recent trend of making all data immutable (e.g., the Lambda architecture). The idea here is that we can collect a grow-only set of facts and subsequently run queries over those facts (often via a mix of batch and streaming systems). There’s no way to corrupt individual data items as, once they’re created, they can’t change. If I can name a fact that’s been created, I can always get a “consistent” read of that item from the underlying store holding my facts. Sounds great, right?

Again, it’s easy to miss the forest for the trees. Simply because individual writes are immutable doesn’t mean that program outcomes are somehow “automatically correct.” Specifically, the outputs of the functions we’re computing over these facts are—in many cases—sensitive to order and are “mutable” with respect to the underlying facts. Pretend we use immutable data to store a ledger of deposits and withdrawals. In our bank account example, simply making individual deposit and withdrawal entries immutable doesn’t solve our above problems: we can insert two “immutable” withdrawal requests and get a negative final balance!2

The Real Problem

The crux of the problem is that, if used to provide data integrity, these properties must be applied at the appropriate scope.3 Simply making individual operations commutative or immutable doesn’t guarantee that their composition with other operations in your application also will be, nor will this somehow automatically lead to correct behavior. Analyzing operations (or even groups of operations) in isolation—divorced from the context in which they are being used—is often unsafe. If you’re building an application and want to use these properties to guarantee application correctness, you likely need to analyze your whole program behavior.

To Be Fair…

To be fair, commutativity and immutability—even applied at too fine a scope—do have merits. They may lead to a decreased likelihood of data loss and/or overwrites than simply using mutable state without coordination. Many of the issues involved in “Lost Updates” go away. It’s often easier to detect integrity violations and issue compensating actions in a no-overwrite system. And remember, some programs are actually safe under arbitrary composition of commutative and/or immutable operations with program logic. Just not all programs.

These are good design patterns that help simplify many of the challenges we started with. But naïvely applying design patterns isn’t a substitute for understanding your program behavior.

In a sense, these problems of scope are a natural side-effect of good software engineering. As systems builders, we design and implement abstractions that can be re-used across multiple programs. We have to make decisions about what information we need from our applications and how to take advantage of that information.4 In many cases, it’s unrealistic for us to require our programmers to submit their entire program for commutativity analysis on every request (e.g., increment(key: String, program: JARFILE)!). We have to make compromises, and exposing operation-commutative counters is a reasonable compromise—as long as users understand the implications and are wise enough to guard against any undesirable behavior.

These problems of scope crop up in traditional database systems as well. Consider a serializable database—say, the best that money can(’t) buy, with no bugs, zero latency, infinite throughput, and guaranteed availability. If your bank application fails to wrap its multi-statement increment and decrement operations in a transaction, the database will still corrupt your data. Implicit in the serializable transaction API is the assumption that your transactions will preserve data integrity on their own. If you scope your transactions incorrectly, your serializable database won’t help you.

Conclusions and Some References

Where’s the silver lining? If you reason about your application’s correctness at the correct scope (often, the application level), you’ll be fine. There are a number of powerful ways to achieve coordination-free execution without compromising correctness—if used correctly. Figure out what “correctness” means to you (e.g., state an invariant), and analyze your applications and operations at the appropriate granularity.

If you’re interested, many people are thinking about these issues. My colleagues and I wrote a short paper on reasoning about data integrity at different layers of abstraction, from read/write to object and whole program. Some smart people from MIT wrote a paper last year on how to reason about commutativity of interfaces, and my collaborators have developed ways to guarantee determinism of outcomes via whole-program monotonicity analysis in the Bloom language (see also note two). Finally, some of my own recent work examines when any coordination is strictly required to guarantee arbitrary application integrity (that is, when we can preserve application “consistency” without coordination and when we must coordinate).

With a better understanding of what is and is not possible—along with tools that embody this understanding and systems that exploit it—we can build much more robust and scalable systems without compromising safety or usability for end users.

Notes

  1. Yes, I used the bank example, and, indeed, banks deal with this problem via compensating actions and other techniques. But the point remains: otherwise commutative decrements on bank account balances aren’t enough to prevent—on their own, without other program logic—data integrity errors like negative balances.

  2. In short, I can’t safely make decisions based on the output of the functions I’m computing without making additional guarantees on the underlying set of facts. This is actually a pretty well studied (albeit challenging) problem in database systems. We’d call the “Lambda Architecture” an instance of “incremental materialized view maintenance” (or “continuous query processing”; via a combination of stream processing and batch processing). This has been, at various times, a hot topic in the research community and even in practice. For example, Jennifer Widom’s group at Stanford spent a considerable amount of time in the early 2000s understanding the relationship between streams and relations. Dave Maier and collaborators developed punctuated stream semantics that’d allow you to “seal” time and actually make definitive statements about a streaming output that, by definition, won’t change in the future. And Mike Franklin and friends built a company to commercialize earlier stream processing research from Berkeley’s TelegraphCQ project. Not surprisingly, the semantics of continuous queries became a serious issue in practice.

  3. Neil Conway and the Bloom team’s work on Bloom^L contains one of the first–and perhaps most lucid—discussions of what they call the “scope dilemma.” As part of the paper, they show that the composition of individual commutative replicated data types can yield non-commutative and, generally, incorrect behavior. Their later work (driven by Peter Alvaro) on Blazes examines problems of composition of monotonic and non-monotonic program modules across dataflows.

  4. These same issues apply to research, too. For example, the distributed computing community has standard abstractions like registers and consensus objects in part because these abstractions allow researchers to talk about, compare, and work on a common set of concepts.

    Moreover, research papers are not immune to making these sorts of errors. I actually started writing this post in response to a paper (on a particular distributed computing application) that we read in a campus reading group last week that abused a notion of “commutativity and associativity” and sparked a discussion along these lines. (I’m sure I’ve made these mistakes in some of my own papers.) To be clear, I’m not necessarily advocating for greater formalization of these concepts—you can formalize anything (and sometimes it’s even useful to do so)—but am instead advocating for greater care in thinking about when these concepts can be successfully applied.

You can follow me on Twitter here.