Sitemap
An AI generated art of the crowd at Angel’s Landing. Credit: NightCafe Studio

Metastable Failures in Distributed Systems: What Causes Them and 3 Things You can do to Tame Them

8 min readJun 12, 2022

In this post, I cover my main takeaways from a paper called “Metastable Failures in Distributed Systems” ¹ by Nathan Bronson, Aleksey Charapko, Abutalib Aghayev, and Timothy Zhu. The paper describes a common failure pattern called metastable failures, a framework for thinking about it, and why they are overrepresented in many outages in hyper scale distributed systems. The paper also raises interesting questions on the hard problems of how to identify and recover from metastable failures, or even avoid them in the first place.

Ready to go on this journey? Before we get to the paper, let’s look at a real-world analogy (with some terminology from the paper).

A metastable failure in the real world

Angels Landing is a popular hike in the Zion National Park in the US. In the final part of this hike, you climb around rocks and go on a narrow ridge (“spine”) with 1000-foot drops on both sides. As a safety measure, hikers can hold onto metal chains that are installed in multiple sections. If you squint a bit :), you can think of this set of chain sections as a distributed system. An example of a chain section:

A picture from my Angels Landing Hike in April 2022: A chain section before heading to the “spine” portion
A picture from my Angels Landing Hike in April 2022: A chain section before heading to the “spine” portion

Before Apr 2022, there were no restrictions on the number of hikers — this was causing overcrowding resulting in ~4 hours wait at the trailhead to try reduce the load⁸. You can think of the “system” going through the following state transitions:

#1. Stable state

This is the state when the crowd is below some safe threshold. Any triggers (e.g., a slow hiker in a chain section, or a temporary increase in hikers) can still cause slowdowns, but the system will self-heal / recover automatically.

#2. Vulnerable state

This is the state when the number of hikers is above some safe threshold — it is operating in a more “efficient” state but in a less stable state. When compared to the stable state, it has much less headroom to handle any overload situation caused by an increase in the number of hikers or reduced “system capacity” caused by slow hikers.

#3. Metastable Failure State

When the system is in the vulnerable state, certain triggers can make things worse. Imagine a group of hikers who need to take more time to navigate a particular chain section. This will slow down other hikers- descending hikers will have to wait for the ascending hikers, or risk it and go around them without using chains. Similarly, the ascending hikers will have to wait for the descending hikers using the chain.

More people waiting can lead to narrower remaining space in the chain section and its either side, which can lead to people needing more time to navigate, leading to more people waiting, which leads to…you get the idea. This causes a downward spiral in the goodput (amount of useful work done by a system).

Welcome to metastable failure state.

Positive feedback loop that keeps the system in metastable failure state

Because of this self-sustaining feedback loop, the system remains in this state even after the removal of the initial trigger. To recover, you need to take other actions such as reducing the load below a threshold. In the case of this hike, used to happen eventually at around 3:30 pm⁹. But starting April 2022, NPS has introduced a lottery-based permit system¹⁰ (I was happy to get it!) which addresses this problem and makes it a lot safer for hikers.

Metastable failure: the definition

The paper describes it as:

Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is removed. In this state the goodput (i.e., throughput of useful work) is unusably low, and there is a sustaining effect — often involving work amplification or decreased overall efficiency — that prevents the system from leaving the bad state.

States and transitions of a system experiencing a metastable failure. Credit: Nathan Bronson’s talk

If this is the case, then why not always run in the Stable state? Well, efficient resource utilization is important in many systems, and the paper calls out that many production systems choose to run in the vulnerable state since it has much higher efficiency than the stable state.

When in a metastable failure state, the system doesn’t self-heal, and recovering from it is requires a major action such as reducing the load.

The Crux of the Issue: A Positive Feedback Loop

A positive feedback loop is the root cause of metastable failures. We saw one example of it above. As Wikipedia puts it, it happens when:

A produces more of B which in turn produces more of A

Another example is how a stampede is caused in wild animals: the number of animals running at any time (A) increases the overall level of panic (B) which causes more animals to run (A) which further increases the panic level (B)…eventually resulting in a stampede.

There can be many different triggers that can take a system to a metastable failure state. But what keeps it there, even after the removal of the trigger, is a self-sustaining feedback loop (if there’s one strong enough).

What are some examples of metastable failures?

Let’s look at a couple of case studies from the paper.

1. Retries causing Work Amplification

You have a web server that accepts client requests. Let’s say the web server talks to a database server and let’s say that the database server can handle 300 queries per second with a predictable latency of 100 milliseconds. Let’s say that the web server has a naive retry pattern: it retries once for every failed request.

As long as the incoming request load to the web server is 150 qps or less, even in the presence of retries, it can self-heal even when a trigger (say a brief network glitch) causes a backlog of requests. This is the “healthy” state.

However, when it crosses 150 qps, it enters the “vulnerable” state — let’s say it is 280 qps and is still functioning well while in this state. Now, if there’s a 10s outage in the network between the web server and the database server, the database server is going to see a spike (original workload + retries) which increases query latency which leads to more retries which further increases latency etc. resulting in a self-sustaining loop.

2. Work done to optimize the happy path

Any work done to optimize the happy path makes it more likely to get into metastable failure states. Consider the same system above but with a cache server to reduce the number of calls to the database server. If the cache hit ratio is 90%, the web server can handle 3000 qps (with only 300 qps going to the web server) in vulnerable state. But imagine that the cache server needs to restart for some reason, and we have a problem on our hands: it is a perfect recipe to get into metastable failure state. The paper calls the 3000 qps here as advertised capacity (the limit at which the system will be in vulnerable state) while the 300 qps is the hidden capacity (limit at which the system will self-heal).

The above shows the work amplification that is common in many large-scale outages¹¹. But the case study I found the most interesting was “link imbalance”¹² where the slowest link among a set of links between two systems kept getting chosen due to a MRU policy.

How to handle known metastable failures?

The paper describes a few techniques for preventing known metastable failures that have caused outages. Here are my takeaways from it:

#1. Focus on the sustaining feedback loop over the triggers

To address a metastable failure pattern, focus on weakening the positive feedback loop that is involved with the work amplification, instead of playing whack a mole with the triggers.

#2. Identify the strongest feedback loops

Understand where the largest instances of work amplification occur and define an upper bound for it. In essence, identify the strongest feedback loop and don’t go after every feedback loop.

How do you identify them? Stress tests are one way. But the challenge is that higher the scale, the stronger the feedback loop, so you discover many issues for the first time in high scale.

Hence, one thing you can do is to identify the characteristic metrics affected by the trigger: e.g., the retry rate in the case of the above example. By using this, you can understand the map of the boundary between the steady, vulnerable, and metastable failure states.

#3. Weaken the feedback loops

The paper describes a few approaches for weakening feedback loops. Examples include load shedding and patterns such as Circuit Breaker. The main challenge here is that when retry/failover decisions are made by clients, but they don’t have the global information (and trying to provide this itself can introduce a failure). It is important to differentiate between transient load spikes vs persistent overload and you would want to have metrics that help you accurately differentiate the two.

The paper makes a good case for organizational incentives that reward the right kind of optimizations (e.g., one that reduces resource usage in the first place) rather than optimizations in the happy path that are more likely to lead to metastable failures.

Wrapping up

Hope you found the above analogy and takeaways useful. Please feel free to let me know your thoughts in the comments.

So, the next time you see “storm of failing requests”, “death spiral”, “thrashing” etc. in an outage, you can use this as a framework / common language to understand it. The paper¹ has many more examples and research considerations, so I recommend checking it out!

To stay in touch for my future stories, follow me here or on Twitter.

References

  1. Metastable Failures in Distributed Systems (sigops.org)
  2. HotOS 2021: Metastable Failures in Distributed Systems (Fun Correctness) — YouTube
  3. Metastable Failures in Distributed Systems | Aleksey Charapko
  4. Quinn Wilton on Twitter: “A colleague made the observation that today’s Github incident report sounds like a metastable failure. If you’ve never seen the term before, it comes from a recent (2021) and incredibly readable distributed systems paper: https://t.co/Qruq2sBOuZ" / Twitter
  5. Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill • GOTO 2017
  6. Avoiding overload in distributed systems by putting the smaller service in control | Amazon Builders’ Library
  7. An update on recent service disruptions | The GitHub Blog
  8. https://twitter.com/ZionNPS/status/1399012666140770318
  9. The Angels Landing hike: the one hack no one tells you to beat the crowds — Walk My World
  10. Angels Landing Permits & Hiking — Zion National Park (U.S. National Park Service) (nps.gov)
  11. Summary of the Amazon SimpleDB Service Disruption
  12. Solving the Mystery of Link Imbalance: A Metastable Failure State at Scale — Engineering at Meta (fb.com)

--

--

J. Kalyana Sundaram
J. Kalyana Sundaram

Written by J. Kalyana Sundaram

Software Architect in Azure @ Microsoft.

No responses yet