Avoiding Coordination Cost: Three Patterns for Building Efficient Distributed Systems
In this post, I describe a common situation in many distributed systems (the need for co-ordination protocols to achieve program consistency), why it can be a problem (it can affect scalability, performance, and availability), what types of systems do not need it, a simple intuitive property behind it (monotonicity), and three simple patterns that you can use to build more efficient distributed systems. This is based on my takeaways from a paper called “Keeping CALM: When Distributed Consistency is Easy” by Joseph M. Hellerstein and Peter Alvaro.
Ready to go on this journey? Before we get to the above paper, let’s start with a real-world analogy.
A real-world analogy: Co-ordination during a trip to reach “consistency” of which hike we wanted to take
Once, I had booked a site in a campground for a camping trip with my family. Just a day before the trip, two of my friends and their families wanted to join me on the trip. My campground was fully booked though — but they were able to book campsites in two nearby campgrounds. So, on the day of the trip, we reached our respective campgrounds (each a few miles apart).
So far so good. On the next morning though we realized we had a problem — we hadn’t yet discussed/agreed on the plan for that day.
We needed to decide which nearby hike we were going to take that day.
To do any coordination, you first need a co-ordination protocol. The reason distributed systems need to use such coordination protocols is to achieve program consistency. In the situation above, the coordination protocol we used is proposing and responding to hike options over a joint phone call. The program consistency goal here is to have a joint agreement of which hike we are taking that day.
Sounds simple enough: get on a phone call with each other and decide the place, right? Luckily, I had decent connectivity in my campground and one of my friends (Friend #1) seemed to have decent connectivity at his campground as well, so I called him up. I floated the idea of “Hike A” while he convinced me that we should do “Hike B”. But none of us could reach Friend #2 and we hadn’t heard from him either— there was no connectivity at his campground effectively causing a “network partition”. Hence, we couldn’t decide yet on “Hike B” as our final plan since the inputs from Friend #2 might change our final plan.
Welcome to “non-monotonic” systems.
The problems due to requiring co-ordination
The above is an example of a situation requiring co-ordination and that too in the critical path! In this example, it wouldn’t necessarily have been a problem if we had agreed ahead of time to meet at Hike B Trailhead at 9 AM.
As the paper describes, coordination protocols enable autonomous, loosely coupled machines to jointly decide how to control basic behaviors. Many distributed systems make use of such co-ordination protocols to achieve desired program outcomes. There are many popular protocols such as Paxos and two-phase commit.
However, this can come with a high cost: this can affect the scalability of the system (imagine we were a set of 10 friends instead of 3), availability (we were stalled while we were waiting for Friend #2 to be reachable), and performance (we could have spent that time exploring the scenery).
But when is co-ordination really needed?
The CALM paper asks this fundamental question:
What is the family of problems that can be consistently computed in a distributed fashion without coordination, and what problems lie outside that family?
There is a difference between “need” and “want” — does your system really need to use a coordination protocol, or can it be designed without one?
The paper describes a very simple and intuitive property of a system — monotonicity — that determines whether coordination is needed or not. Let’s see what that means.
The key idea: Logical Monotonicity
To understand monotonicity, let’s start with the inverse: when is a system not monotonic?
We have already seen one in my example above — I mentioned that we couldn’t proceed with a choice of a hike even though Friend #1 and I had agreed on it. This was because inputs from Friend #2 could change the final outcome. This is an example of a non-monotonic system where the system cannot make a final decision based on partial or incomplete information.
Distributed Garbage Collection: Another example of a non-monotonic system
The paper describes Distributed garbage collection as an example of a non-monotonic system. You have a set of objects on each node. You have a root object in a node that has references to other objects. An object can refer to objects in other nodes. In such a system, how can multiple “local garbage collectors” work together to determine which objects are unreachable?
Nodes need to exchange information about their “edges”. But a local garbage collector cannot autonomously make a decision. Co-ordination is required! Initially, Machine 3 may determine that object O4 is unreachable. But later on it can receive new information from M1 that demonstrates reachability — a path to O4 (Root -> O1 -> O3 -> O4). Similar to our camping example above, you need to hear from everyone before making a decision — this is an example of needing co-ordination.
The paper calls out another implication of non-monotonic systems: since such systems “change their mind/decision” based on new information, they are order sensitive, i.e., the order in which they receive the information matters.
Distributed Deadlock Detection: An example of a monotonic system
Now let’s see what monotonicity means. It boils down to this:
Are decisions made on partial or incomplete information still stable? Do the conclusions made on partial information still hold in eventuality?
The paper describes Distributed deadlock detection as an example of a monotonic system. Let’s say you have transactions running on different nodes of a distributed database. You want to detect deadlocks — e.g., a cycle where Transaction T1 holds a lock L1 and is waiting for a lock held by Transaction T2 in a different node which is in turn waiting for L1 to be released.
Similar to our garbage collection example, the nodes here have to keep exchanging information with each other. But, once a machine receives information that helps it determine that there is a cycle, even if it hasn’t yet received any information from other nodes, it can confidently determine that there is a cycle. No co-ordination is required here! If it receives new information from more nodes, it can learn about more cycles, but it won’t change the fact that there was a cycle. In other words, the output grows monotonically with the input.
Another important bonus of such monotonic systems is that ordering doesn’t matter — each participant can receive information in any order and the output only depends on the content of the input.
Monotonicity gives us a way to achieve consistency but without a need for coordination
The key intuition of the CALM paper is that if a system is logically monotonic, we can achieve consistency in a co-ordination free manner. The paper describes this as:
Theorem 1. Consistency As Logical Monotonicity (CALM). A program has a consistent, coordination-free distributed implementation if and only if it is monotonic.
The paper describes how monotonic programs are “safe” in the case of missing information and can proceed without any co-ordination. But in non-monotonic programs the outcome can change when new information is received. This requires them to coordinate and cause a participant to “stall” when it is waiting on such information.
But what does “consistency” really mean here? Let’s look at how the CALM paper defines consistency.
“Program Consistency” = Are we getting the desired system outcomes?
Rather than the narrower storage level consistency, the CALM paper zooms out the scope and focuses on program consistency: does the program produce the desired outcomes? It asks this practical question:
Does my program produce deterministic outcomes despite non-determinism in the runtime system?
It calls this program consistency as “Confluence”. In my trip example above, it would mean that all friends come to the same conclusion on which hike they are going to that day — irrespective of the non-deterministic ordering and timing of the individual messages we exchanged that morning over the unreliable phone network.
Takeaways: How can you use the above to build efficient distributed systems?
What does the above mean in terms of practical patterns for building systems?
1. Use Monotonic programming patterns when possible
Monotonic patterns are not the only way to build efficient distributed systems. But if you can model it as a monotonic pattern, you get the benefits of achieving program consistency without needing coordination.
A simple example called out in the paper is the use of tombstones for deletion of data items in a storage system: a direct deletion would be a non-monotonic construct. Instead, marking an item as deleted (tombstoning it) makes it a monotonic construct: a data item with tombstone monotonically transitions from undefined to a defined value and ultimately to tombstoned.
Another example described in the paper is the Amazon shopping cart from the Dynamo paper: to get logical monotonicity, it models the state of the cart as two different sets: one set is the set of items added to the cart, and one is the set of items deleted from the cart. This makes the “shopping” part of it monotonic, while the “checkout” part of it is non-monotonic.
2. See if you can keep co-ordination off the critical path
If your application cannot be built in a monotonic manner, co-ordination will be needed. But evaluate if you can keep it off the critical path.
Examples: In the hike example above, we can take the co-ordination off the critical path — e.g., the previous night. For the distributed garbage collector example, since it can be performed in the background, the required co-ordination is not in the critical path. In the Shopping cart scenario from the Amazon Dynamo paper, the co-ordination was limited to “Checkout” (which has different user expectations compared to adding and removing items to the cart).
3. See if you can use compensate (“apologize”) for inconsistency
Can you tolerate your system being inconsistent? If so, you can avoid the co-ordination cost.
An example from the paper is about ordering items on an e-commerce website: when a purchase is made, the count of items in the inventory should be decremented immediately. This will ensure that an item that shows up as available for purchase is indeed available. But this requires co-ordination and integration between various systems such as inventory, supply chain, and shopping. If this was not there, it is possible that a purchase cannot be honored — however, a compensating action would be an apology e-mail combined with a coupon for a future purchase. This advice also reminded me of the famous article “Starbucks does not use two-phase commit” by Gregor Hohpe.
Wrapping up
The CALM (“Consistency As Logical Monotonicity”) approach describes different approaches to building efficient distributed systems. By leveraging the monotonicity property, you can achieve program consistency in a co-ordination free manner. It also surfaces good questions around whether consistency is a must-have goal in all situations (e.g., where compensating actions can be used), or whether it can be taken off the critical path.
So, the next time I go on a camping trip with my friends, you can be sure that I will be exploring all of the above options :)
Thanks for reading! If you liked this post, please follow me here on Medium or on Twitter to be notified about my future posts. If you have feedback or comments, please let me know in the comments!
To learn more, check out the original paper by Joseph M. Hellerstein and Peter Alvaro at “Keeping CALM: When Distributed Consistency is Easy”.