Extreme Reliability
Yes, failure is an option
It’s simple: if we want to write software that doesn’t go wrong, we have to stop writing software that’s designed to go wrong.
Software has a reputation for being tricky. It feels brittle when we’re building it, and it fails incomprehensibly in use. We sadly shake our heads and say “We can never make it perfect”. But there’s a good reason why software is like this: for as long as we’ve been creating it, our programs have been stuffed with designed-in failure modes, every one placed there by a conscientious programmer, and every one amplified through larger and larger contexts until it becomes an irreversible collapse. This cascade of failures is what gives software its delinquent character.
Here’s an idea: let’s not do that!
When a computer runs your program, very little of what it executes is what you have written. Every program you write depends on a blizzard of third-party software assets: library functions used by the compiler, virtualisations provided by the operating system, and of course, a constantly-ballooning mass of dependencies.
Each of those assets is intended to do some specific job — read a file, schedule a thread, respond to an endpoint request. Each one makes assumptions about the state of the system — the file is accessible, the thread is runnable, the endpoint is listening on a port. But these assumptions are almost always beyond the control of either your program or the asset itself. Every time you use an asset it needs to check that its preconditions are satisfied, and report back to the caller if not.
In the old days, that report would be by means of an error code; in modern software, the report is carried by an exception. Either way, the effect is that the asset reneges on the task it has advertised, and instead makes you responsible for fixing whatever is upsetting it. Trouble is: you’re not the expert here (otherwise, why would you be using a third-party asset?) All you can do is pass the error back through your program, obliterating larger and more abstract contexts, like falling dominoes, until finally someone is prepared to catch the error and attempt some kind of a restart, or the program (or worse, one of its threads) is killed off.
Suppose we want to open and read a text file. The current best-practice way to do that would be something like this:
void readFile (string filename) {
reader = FileReader(); // might throw: out of memory,
reader.open (filename); // might throw: not found, permissions error, etc
try {
while (!reader.atEof()) { // might throw: file disappeared
line = reader.readLine (); // might throw:
// file has been altered by someone else,
// or, the file is binary and we can't find eol markers
processLine (line);
}
}
catch (e) {
rethrow e; // tell the caller about the failure
}
finally {
reader.close(); // let's hope this doesn't throw!
}
}
We’ve unpacked the file reading process into each of its constituent steps. At each step something can go wrong, and each error is reported by throwing an exception. The reasoning goes: if (for example) the file-opening doesn’t work properly, that’s a failure, and the file-reading task that depends on it must also fail. In this view, an exception is entirely the correct way to signal that failure.
But there’s a problem with this approach. All those things that can go wrong with the file: why should they be the caller’s job to fix? Even if the FileReader
‘s errors were fixable, by the time the exception has bubbled up to readFile
(or to whatever has called it), the chance to fix it is long-gone: FileReader
(and all the intervening state) has been obliterated by the stack’s tear-down. The only option available to the exception handler is to abandon the task, perhaps attempt to restart it from scratch, and send some cryptic, system-level message to the user.
This, right here, is how failures domino into brittleness — by turning perfectly anticipaple states into errors, treating those errors as failures, and then propagating the failures outwards, beyond any possibility of remediating the original state.
But it doesn’t have to be that way! To some extent, it doesn’t really matter what the error was: the occurrence of any error prevents the processing of any further lines. So, what if we had a file reader that reads as far as it can, no matter the state of the file, and then terminates normally?
Take a look at this:
void readFile (string filename) {
reader = TextReader (filename);
foreach line in reader.lines {
processLine (line);
}
}
The first thing you’ll notice is that this is much shorter and simpler than the previous example. There’s no complex state management (is the file open, closed, not yet assigned, etc) and there’s no complex flow or “babysitting” (such as the rethrow) which doesn’t really have anything to do with the task of processing lines from a file. It’s just:TextReader
internalises a (notionally) readable file, and represents its contents as an iterable of strings.
In this formulation, there are no exceptions, nor even any errors. If the file turns out to be blank, the iterator would be empty. If the file becomes damaged or stale during the iteration, the iteration would terminate there. But you’d still get a perfectly valid TextReader
even if the indicated file fails to open at all; the iterator would, again, be empty. And if the TextReader
can’t even be constructed, what you’d get is a nullary TextReader
whose iterator also appears empty.
Here there is no sense of a failure, nor even of an error: every different eventuality is handled on the same, happy path. There is no failure to start a crack in the program — no thrown exception, no error code, no abnormal exit at all. With no failure, there is nothing to propagate, so there’s no possibility of the brittleness that the previous code sample exemplified. It just works, normally, every time. This is inherently stable code.
“Ah, but!” you complain, “You can’t just pretend that everything is proceeding normally. How do you deal with a file-opening error?”
There are several answers to that. The first is: we don’t deal with the error at all because there never was an error. There was a perfectly normal exigency. That’s already been dealt with: it appears to be an empty file.
The second answer is: why has this happened? Suppose we’ve asked for a file for which we’re not yet authenticated. Rather than treating that as a failure (and handing the failure back to a caller who is manifestly incapable of fixing it), wouldn’t it be better to pause the file-opening, attempt to authenticate then and there, and resume where we left off? Like this:
void readFile (string filename) {
reader = TextReader (filename)
.onNeedsAuthentication ((filename) => {
switch await signIn (filename) {
case success: return Retry;
case userUnknown: return Default;
case userRejected: return Default;
}
});
foreach line in reader.lines {
processLine (line);
}
}
Here, we’re equipping reader
with a callback function which is used in the event of a missing authentication. The program gets an opportunity to secure a suitable authentication, and depending on the result of that it either re-attempts the file-opening (Retry
) or else it proceeds as if there had been no intervention (Default
) — which as we’ve already seen, shows just an empty file.
I want to make it absolutely clear: a missing authentication is not an authentication error, much less a failure. In order to access the file, we need authenticatation, but the fact that we’re not currently authenticated is not (yet) a matter of historical record. We call the remediator function like we would call an interrupt, to try to supply the authentication when we discover we need it. If we authenticate correctly, everything proceeds normally. If we don’t authenticate successfully, then we can’t read the file, but still everything proceeds normally — the file just appears empty.
These remediators can handle most eventualities, they can send messages to logging services, and they can even convey information back to the caller so the caller can understand what has happened during the file reading (for example, onNeedsAuthentication
can tell readFile
whether the authentication was successful or not). But we can, if we want, ignore all that: there’s a third way for the client to understand what has happened:
void readFile (string filename) {
reader = TextReader (filename);
result = foreach line in reader.lines {
processLine (line))
}
switch (result) {
case completedNormally: return;
case noAuthentication: // redirect to login process
case networkError: // do something else...
}
}
In this case, every time the iteration finishes, it returns a code representing the manner of its completion. If the completion is such that the program really can’t continue on its happy path, you can use the completion to adapt the program’s behaviour: redirect the user to a login screen, or switch a car into limp-home mode. (You could even try to remediate the context somehow and have another shot — in the manner of longjmp
or Eiffel’s exception model — but really, why would you bother?)
Again, to be absolutely clear: the completions aren’t failures. They look a lot like conventional error codes, but really that’s not what they are, because you can safely ignore them. By the time the completion becomes available, all the remediation that’s going to happen has already happened. They’re not exceptions either, so they don’t add any brittleness. When you use these codes to mode-switch the program, you’re doing it explicitly, in only one possible location: all the hidden complexity of exception handling is completely obviated.
I know what you’re thinking: “That looks all very well, but I bet there’s some complicated jiggery-pokery going on inside TextReader
. Surely that’s got all kinds of complex failure modes, just hidden away where we can’t see them.”
Let’s take a look at a simple, first-cut of TextReader
:
class TextReader (string filename) {
public enum Completion = inStream.Completion;
init string => RetryOrDefault onNeedsAuthentication = {return Default;}
// ...etc for other callbacks.
[Iterable <string>]
Completion lines () {
reader = inStream (filename)// open a new stream each time, so re-entrant
.authenticate (onNeedsAuthentication);
// ... etc, for any other remediations
completion = foreach substream in inStream.split ('\n') {// split into an iterable of (sub)streams
yield substream.toString();
}
return completion;
}
}
This is only a sketch, of course, but the principle should be obvious. inStream
is a readable stream of bytes, and all the file reading is deferred to this. It makes a very similar promise to the original TextReader
: no failures, no exceptions. We’ve outsourced all of TextReader’s
remediations to it.
The input file is split into separate lines by the inStream.split
method (you really don’t need me to unpack that, do you?), and then each substream is converted into a string and sent back to the caller.
The only unfamiliar syntax is the [Iterable<>]
decoration. lines
is a perfectly normal function, returning a value of type Completion
. But, in addition, it also provides Iterable
semantics: it can be the target of foreach
, and it can yield
values before it returns.
Everything here looks straightforward: so long as inStream
doesn’t fault or exception, neither will TextReader
. But there is a gotcha lurking in here: toString
might not work properly. Imagine if the file is not really a text file, but is actually a binary file without any end-of-line markers. toString
might try to interpret the stream as UTF8, but could encounter illegal byte combinations, or it may be that the whole file is too large to fit in the available memory and the string
can’t be constructed.
Normally, these would trigger errors inside toString
, but this toString
, like inStream
, has been constructed to be reliable. If it can’t perform the operation asked of it then it will return an un-normal object: something that behaves like "”
, but also identifies the nature of the error. Like this:
source = inStream (malformedFileName); // Suppose it's malformed UTF8
s = source.toString ();
if (s == inStream.Utf8Error) print ("I found a UTF8 error: stream is not readable");
// continue, assuming there's nothing wrong with s
print (s); // prints the equivalent of ""
You can see that TextReader.lines
will work fine even in the presence of such “errors” — each malformed line, and each non-constructible string, will be yield
ed as a simple blank line. The client can check its parameter if it cares why the string is empty (and, if it doesn’t care, can treat it just the same as any other empty string ).
But we can do better: we can control the iteration based on those un-normals:
class TextReader (string filename) {
public enum Completion = inStream.Completion + { OutOfMemory };
init string => RetryOrDefault onNeedsAuthentication = {return Default;}
// ...etc for other callbacks.
[Iterable <string>]
Completion lines () {
reader = inStream (filename)
.authenticate (onNeedsAuthentication);
// ... etc, for any other remediations
completion = foreach substream in inStream.split ('\n') {
s = substream.toString();
if (s == string.null) return OutOfMemory;
yield substream.toString();
}
return completion;
}
}
If toString
can’t acquire enough memory to construct s
, it returns string.null
(which is guaranteed to be available). When we see that, we break out of the iteration, and report the problem to the caller (using OutOfMemory
, which we’ve added to Completion
).
But, please note: just as you saw in readFile
, everything is still running exactly how it was designed to, on the happy-path! There are no special “error-handling” branches, no exceptions flying around. We simply break out of the iteration, unravelling the contexts exactly as they’re supposed to be unravelled. If the client cares about why the iteration has finished, we’ve provided an explanation in the return value, but it’s perfectly safe for the client to ignore that.
In this example, we’ve made the out-of-memory condition a one-time deal: if it happens, we just stop what we’re were doing. But perhaps, there’s something the client can do to remediate this? Perhaps the client could release some un-needed state, free up some memory, and have another go? I can feel another remediation coming on!
class TextReader (string filename) {
public enum Completion = inStream.Completion + { OutOfMemory };
init string => RetryOrDefault onNeedsAuthentication = {return Default;}
// ...etc for other callbacks.
init inStream => RetryOrDefault onMemoryOverflow = {return Default;}
[Iterable <string>]
Completion lines () {
reader = inStream (filename)
.authenticate (onNeedsAuthentication);
// ... etc, for any other remediations
completion = foreach substream in inStream.split ('\n') {
repeat {
s = substream.toString();
if (s == string.null) {
switch (onMemoryOverflow (substream)) {
case Retry: continue; // repeat
case Default: return OutOfMemory;
}
}
yield substream.toString();
break; // out of the repeat
}
}
return completion;
}
}
Here, if toString
can’t construct s
, instead of just breaking out of the iteration, it invokes a remediation called onMemoryOverflow
. The caller can provide a remediator for it which can attempt to free up some memory, and then instruct the iterator to Retry
whatever it was trying to do. (The callback provides the inStream
parameter so that the remediator can work out how much memory it needs to find). If the remediator can’t free up enough memory (and, in any case, if the default remediator is not overridden) the Default
processing closes down the iteration just as before.
We’ve just seen three, major principles, which between them represent a way to restore stability to our pathologically unstable programs:
- First is the idea that there are no errors, no failures. There are only states. If something about the state prevents us for doing what we wanted to do, we fall back into a trivial, null-like behaviour: a file that won’t open behaves identically to a file that opens but is empty; a stream that can’t be stringified returns an empty string.
- Second is the idea that the problematic state is accessible and remediable while it is still a solvable problem. If the file won’t open because it doesn’t exist, then we can give the user the opportunity to create it; if it won’t open because we don’t have the right permissions, we can try to get those permissions. The problematic state does not become a historical fact until after the remediation functions have had a chance to put things right.
- Finally, if we want to know why a null-like state has appeared (such as a file appearing blank, or a string appearing empty), we can ask it. We can ask a
TextReader
(or aninStream
) whether it represents a real, existing file; we can ask an un-normal string what kind of un-normal it is. More particularly, we can ask a file’s enumerator for its status only after it has finished enumerating: that’s the only location where that knowledge is provided — it certainly does not turn into an exception going who-knows-where.
These principles are important because they have two crucial properties that exceptions don’t:
- The first is: they’re local. Unlike an exception, which may be caught (if it’s caught at all) at any of the contexts leading up to the point of failure, a remediation exists in only one location, at the point of use. If you want to provide a remediation for an unreadable file, or you want to ask whether the file was unreadable, you do it where you read the file, and nowhere else.
- The second is: they’re optional. When an exception is thrown, somebody, somewhere has to catch it, otherwise the program (or — what’s worse — just one thread of it) will crash. But in this approach, if you ignore every “error”, the program will continue to work just as it was designed to.
This second point is particularly important during maintenance. If a dependency of yours adds a new exception to its welter of other failure modes, you have to know about it and be prepared to catch it, otherwise your program has just acquired a new brittle failure of its own. Without checked exceptions (which we all hate, don’t we?) there’s no way to keep track of this. But, using this approach, if your dependency creates a new remediation and you ignore it, your program will still continue to work just as it was designed to. You can create a remediator for the new remediation when — and if — you ever need to care about it.
I started by claiming that software is unstable because our current best-practice amplifies errors into catastrophic failures. In code which contains no errors to amplify that process can’t even get started, and by building systems that are self-stabilising and resilient even in the face of invalid preconditions, we can damp down any wobbles in the system before they even become errors. If we create software that definitively doesn’t have any failure modes, then, obviously, it can’t ever fail!
Thanks for reading! If you liked the ideas in this article, there’s a more thorough exploration of this topic in my book, Extreme Reliability. You might also want to clap the article, and consider following me.