In Defense of Systems that Fail

by Chloé VULQUIN

In recent times, as an industry, we've been building systems that are progressively less likely to fail. From the erlang's internal retries, to formally verified languages, to rust's borrow checker. These all have a place. However, it's a different thing to say that those projects and languages that are not in the same category do not have a place.

Making systems that are resistant (making a fail-free system is physically impossible, everything can and will go wrong) to failure is not free. The costs are many and come from various places. This type of software is harder and takes longer to write. The languages intended to make this easier require particular styles and restrictions to make it possible. They require additional resources to run, and are much more complex to tune, as making a perfect system is impossible, and therefore specific tuning is left to configuration time. Such systems demand more of the writer, the user, and the environment.

Furthermore, attempts to make systems fail less can actually make them fail more. The loss of a quorum is a common example in early HA (High Availability) SQL setups. There's nothing wrong anywhere, all that happened was a minor ping latency hiccup. A packet arrived late, or retried a few times, and by the time it made it over, the nodes decided that quorum is lost. What you end up with is a soft failure, where everything is running but needs manual intervention to run as intended. Were it a smaller, simpler, system, this issue would not be noticed in any way. The attempt to make something more failure-resistant has made it more sensitive to other types of environmental issues. This is another cost to these systems.

As with any tradeoff, then, it's important to make a cost-benefit analysis to determine if this particular tradeoff is worthwhile in the circumstances at hand. Let's start with the benefits.

Why would you want a system that doesn't fail, or at least fails less often? Well, for one, it's annoying when your system is down! Perhaps that's enough to justify the above costs if your system fails all the time. Then again, if your system fails all the time, you might have other problems. No, it would take something bigger, a real cost to failure.

These fail-resistant systems are obviously critical in applications like medicine, aerospace, and more, where failure is not an acceptable option. You definitely don't want a program ensuring your safety and ability to operate a rocket to fail due to unexpected user input (being chromium and javascript), or perhaps fail to deploy an automated parachute. The cost of failure in these cases is a human cost. As such, it's perfectly acceptable to sink a lot of effort into avoiding it.

Moving to somewhat less dramatic pastures, sometimes the cost is not human, but monetary. Software businesses and businesses relying on software for mission-critical operations can attribute a real and direct cost to every second of downtime. The Amazons and the Facebooks and the Fords of the world can perform an elementary analysis of dollar per second vs dollar per increased safety feature. It is then no wonder that it is often these large companies that are searching for ever more developers for these languages, those experienced in building failure-resistant systems. A rust developer likely won't even ask for more money than a C++ one, even if they probably should 🦀.

Finally, sometimes, the cost is legal. Sure, no one's going to die if the service dies randomly once every year. Sure, you're not losing tangible money out of it, or even intangible. But gosh darn, you signed that paper that promised a given SLA, so now you have to deliver it. Your hands are tied, and it hardly matters how you increase availability, but you've got to.

You'll notice that conspicuously missing from all of these scenarios is the small scale with low stakes. If you're hosting an RSS reader for your friends, whether it goes down once a year or three times a year doesn't matter – you're probably rebooting the single machine it's on more often than that. Oh, sure, it could go down even less if you set up HA, but why bother? No one will die, you won't lose any money, hell chances are no one will even notice, including you!

The cost-benefit analysis simply doesn't pan out. Most people already don't have HA set up for their personal or homelab services, the cost to the environment and to their maintenance of the software is higher than they're willing to accept for the little benefit gained. At those scales, performance, too, hardly matters. Or, more accurately, throughput is irrelevant, while latency is crucial. Have I mentioned making systems failure-resistant incurs a latency penalty?

So why then are the people running their own services often running “production-ready” “professional” “HA” systems? There are a few answers. Firstly, the cost is not visible to the user. They can just not enable the HA features (well, sometimes, anyway). And since they're not contributing to the software, nor are they the author, they do not see the increased maintenance cost. Additionally, oftentimes they won't be aware of any alternatives, if they even exist at all. The personal cost of writing something new is much higher than the cost of dealing with whatever is already out there, while smaller already-written systems will often be personalized or too small to be easily discovered. This does not, however, mean that it's not worth doing.

In the meanwhile, what pushes people to write software that is failure-resistant? Here too, the answers are many, but are not difficult either. The most obvious one is that failure conditions are bad. Nobody likes them! Humans are notoriously bad at estimating costs, time required and similar. It's a common joke that engineers will simply answer “it will be done once it is done”, and have no input to provide beyond this. An author is disproportionately likely to underestimate the cost that they are making themselves pay in advance, and therefore highly likely to not perform a cost-benefit analysis for their use case.

Sometimes, though, there is an analysis going on, though it is of a different nature. Sometimes, people write software not because they want to use it, but because they hope to get a job, or build a portfolio, or for someone else's needs. The biggest demand for software comes from corporations, and for the reasons mentioned above, corporations like HA failure-resistant systems, and are the entities most likely to hire you. As such, if you're writing the software for any of the above reasons, you're also much more likely to make them failure-resistant.

Because of all of this, most pieces of infrastructural software, where the stakes are low for most, medium for some, but never of a human cost, tend to be failure-resistant, and absolutely a pain to run. Authoritative DNS servers, email servers, collaboration software, and more. All of these suffer from this effect. So consequently, many people will never host their own authoritative DNS, or their own email, and so on. This in turn creates space for corporations to provide these as services for others. Since the name of the game is convenience, those aren't always performant, or even configured in a failure-resistant way, putting us back in square one. The price of this societal level issue ends up being paid by those that do not have the time to maintain software that is as a category needlessly obtuse, either monetarily, or by not having access to such functionality at all.

So in conclusion, and as advice – don't blindly make things failure-resistant. Perform cost-benefit analysis, avoid underestimating the costs to yourself, and if it is so justified, optimize, but not prematurely. It's ok for your software to fail sometimes; it's ok to focus on the happy code path; sometimes it might just make the world better to fail.

One of the reasons bunker labs posts are the way there are is precisely for this reason. In practice, failure-resistance is bolted on after the fact. When it isn't, you had better be getting paid well for it (again, rust devs, ask for higher salaries, I am not kidding). The final advantage to building systems that fail is that you still built a system; learned more about the subject and improved in your craft. That's what I hope to help achieve with what I write, at least on here. You can potentially use what I write in the real world, but maybe write your own instead. :)