Note: I originally wrote this on an internal Amazon blog in 2006. This is the original version with a few edits. A newer paper covering the same content can be found on the AWS Builder's Library, Avoiding fallback in distributed systems.
More complicated software is more buggy, so programmers try to apply Occam's Razor and code as simply as possible (but no simpler). But how does one define "complexity"? Lines of code? Number of modules? In this essay I'm going to focus on one of the most important measures of complexity of a distributed system: modes.
To illustrate the concept, consider a common dilemma: Your service needs to send a message to another service, and that other service might be down. You want to make sure the message eventually makes it to that service.
In this situation, and others like it, I usually hear the following proposal:
Try to send the message to the other service.
If that doesn't work (times out, ACK never received, etc), write the message to disk and try again later.
Now, there are other problems here that are out of scope of this entry, (such as losing the message if the originating machine has a disk failure, or sending the message twice), but the only problem I'd like to focus on is the modal nature of this behavior. It's modal because it has a happy mode where it does one thing, and an unhappy mode where it does something very different (i.e., write to disk).
Modal designs such as this one suffer from code rot, a disease that afflicts any piece of code that hasn't been run for a while. Much like an unused limb, unused code will eventually atrophy and stop working. Let's consider a simple example and then go back to the above case.
Say you work for an insurance company. You're maintaining some big ol' COBOL program on a water-cooled IBM 360 mainframe that, by itself, consumes more energy than some small nation-states. Your boss wants you to run a job that's going to calculate some critically important information. You have 30 minutes to do it, and your boss has made it clear you'll be fired if it isn't ready on time. How likely is it that you'll start working on your resume; if the code for the job was last run:
3 minutes ago?
1 week ago?
6 months ago?
1 year ago?
10 years ago?
I'm guessing that starting at the 6 month mark you'd be looking around the copier room for some heavy stock paper.
In the more complicated case of our client and server, chances are, the writing-to-disk mode is rarely executed. If the service is reliable enough, in fact, the client may not run that code for months at a time. So, even assuming you tested it when you first wrote that code (you did, didn't you? You forced the service to be down to make sure that this code was working correctly, right?), how do you know that it will still work? Six months later, the disk might be close to full. Or some config file might now be missing that told you which file to open, or some minor change in another part of the codebase accidentally messed up the writing-to-disk code.
Furthermore, keep in mind that this code is going to be running under duress. It's the unhappy case, right? If the remote service was down, well, then something nasty is probably wrong. Maybe it's affecting you too. This code could easily have negative side-effects that make recovery from this situation more difficult, such as filling up the local disk with messages and making all subsequent disk writes fail, etc. During an outage, this can make it much more difficult for the people diagnosing the problem to figure out what's wrong. They may start by assuming the problem is that the disk on your client machine is full and waste precious time that should be spent focusing on the server.
Lastly, let's call into question a really basic assumption that underlay the design: writing to disk is too slow, so we don't normally want to do it. Problem is, if it was too slow in the happy case, guess what, it's going to be too slow in the unhappy case. This kind of software engineering logical fallacy comes up a lot. I remember one project on my team had the fallback plan (when the in-memory cache was broken) of hitting the backend relational database directly. We even deployed production code that did this. Sure enough, the first serious failure took down the database (which happened to have other highly important stuff on it, unfortunately). Didn't do us any good. We could have avoided the whole problem if we'd just asked ourselves, if it's too slow to access the database, why would it be ok to access it when things start going wrong?
So what woiuld be the right thing to do in the writing-to-disk case? Pretty simple (ignoring the other issues with the design, of course). Always write to disk, then try to send the message. Yep, it is slower, on average. But if you can't afford the slowness now, why will you ever be able to afford it?
Let's now summarize all the problems with modes:
There are more cases to test, some of which can be hard because some service you talk to has to be down (or appear to be) in order to test.
The unhappy mode will be afflicted by code rot.
There's a logical fallacy that the undesirable mode is somehow "okay" when things are wrong; more likely it's the worst thing you could do.
If you've been a follower of Armando Fox's research into recovery-oriented computing, this may seem familiar. For example, check out his teams' research paper that describes a relatively mode-free service, Session State: Beyond Soft State.