EXTREME RELIABILITY

Software Wormholes

Jules May
9 min readMar 21, 2025

--

We think that bugs strike our programs unpredictably, randomly. But it turns out that real bugs usually follow a very small number of predictable anti-patterns. One in particular — the Wormhole — is responsible for more than 90% of production malfunctions. Here’s how to recognise it, and what to do about it.

A very dramatic wormhole: AI-generated by the author

Let me tell you about the worst codebase I ever worked on.

It was a modest desktop application. It had started as the personal project of a domain specialist — a few thousand lines of FORTRAN — but over five years of work by a team which eventually rose to about 25 developers, it had reached some 2½ million lines of a mixture of languages: Delphi, Delphi.net, C#, and with some bits of C++ thrown in for good measure. The longest method was some 32,000 lines of Delphi.net which had been translated from the original FORTRAN.

At the time I got involved, the bug list was up to 4500 unique items. Velocity had slowed to a crawl — two-week feature builds would typically take six months, if they delivered at all. We actually measured our progress on some of these features, and found that for every line that spoke to the feature, another fifty-odd were about whacking bugs that were popping up in other parts of the program and preventing the feature’s delivery.

A quick calculation suggested that, if we were to stop all feature development and put the entire team onto fixing bugs and clearing out the technical debt, we’d get through most of the major bugs in about three hundred years. I never had (and to this day never have) seen a codebase so obviously and irretrievably bankrupt. Nevertheless, killing it off wasn’t an option, and the client was very skeptical about the wisdom of a rewrite (not least, because they had no reason to believe the same problem wouldn’t surface again).

I should have had more sense. I should have quit. But instead, I wondered: since, clearly, we weren’t able to take bugs out of our code, perhaps we could avoid putting them in in the first place. If we can’t rescue the existing code, perhaps we can strangle it out of existence with something better.

We combed through the source control (no easy task: this was pre-Git days) trying to understand the causes of the bugs we were introducing so furiously. The idea was: we’d look at each checked-in bug-fix, and try to work out where and how that bug had originally been introduced.

We made a lot of interesting discoveries in this process. Here are the top three:

  • Most of the bugs that needed fixing were coding errors. (Logical errors and misunderstandings of the task seemed to be obvious as the code was being written, and never made it into production.) Not only that, nearly all followed a small handful of recognisable anti-patterns. One pattern — what we came to describe as a “Wormhole” — significantly outnumbered all the others.
  • We partitioned the causes into first- and second-generation bugs. First generation bugs were those introduced by a feature development, and second-generation bugs were those introduced during an attempt to fix some other previously-identified bugs. In this codebase, the second-generation bugs outnumbered the first generation about 3 to 1. It turned out: fixing bugs created more technical debt than all other causes combined.
  • Virtually every second-generation bug was a wormhole.

When we first saw these results, we didn’t believe it. But the more we looked, the more the figures told the same story. Since then, I have looked at innumerable codebases, both proprietary and open-source, and I see the same pattern over and over again. About 75% of first-generation bugs, and about 99.9999% of second-generation bugs are caused by one single anti-pattern: the wormhole. Invariably, second-generation bugs outnumber the first generation.

So what, exactly, is a wormhole? It comes in lots of forms, from the trivial to the highly cryptic, but what they all have in common is: there’s a piece of code over here, which changes its behaviour based on some condition. There’s another piece of code over there, which changes it’s behaviour based on ostensibly the same condition. It’s called a wormhole because there are two (or more) locations in the code, potentially widely separated and with no obvious connection between them, but which nevertheless are supposed to move as if they’re bound together in intimate contact.

Here’s a very simple example. Suppose we’re building a currency class. It’s basically a fixed-point decimal, but we want to modify its string representation a little:

class Currency<n> : fixed<n> {
override string toString () {
if (this < 0.0) return (-this).toString + "DR";
return base.toString + "CR";
}
}

But we’re not really satisfied with that. We’re producing a report, and we want to highlight the overdrawn balances, so (elsewhere):

string drawHighlighted (Currency<_> c) {
if (c < 0.0) return "<span class='red-text'>" + c.toString() + "</span>";
return c.toString();
}

That repeated if (c < 0.0): that’s the wormhole. These two fragments of code should move in synchrony: either it’s got DR appended and it’s shown in red-text, or it hasn’t and it’s not.

The reason why this pattern is so dangerous is: there’s no obvious connection between these two fragments of code. They may be separated in a file, or they may be in different files. You can even imagine variations of this in which they’re in different assemblies altogether. But there’s nothing in either one to tell you of the existence of the other.

Let’s try a little maintenance. We want to add some more special-casing to Currency, this time for a zero value:

class Currency<n>: fixed<n> {
override string toString () {
if (this < 0.0) return (-this).toString + "DR";
if (this == 0.0) return "nil"; // this is the addition
return base.toString + "CR";
}
}

There’s at least one other locus where the value of the currency matters: in drawHighlighted(). Perhaps it’s OK to leave that alone, or perhaps we wanted something more like:

string drawHighlighted (Currency<_> c) {
if (c < 0.0) return "<span class='red-text'>" + c.toString() + "</span>";
if (c == 0.0) return "";
return c.toString();
}

The point is: somebody should have been able to check what changes were needed at this location (and all the other locations where the value of the currency is significant). There’s no way to find them from the code itself, and clearly the existing unit tests are no help at all. Unless the developer is so familiar with the entire codebase that he doesn’t even need to look, these two loci have now become detached from each other. In all probability, that’s a bug.

It gets worse! In this illustration, it was clear what the repeated condition is: something to do with the value of the Currency. But often, the shared state is not at all clear.

Here’s an example. Suppose you have a have a service which periodically logs some state:

// do some work...
msg = prepareLogMessage (...);
Logger.log (msg); // using some globally-accessible Logger object
// ... etc

Easy enough. But for this to work, the logger object must have been created, initialised, and (potentially) connected to its sink. So, elsewhere, there will be code like this:

init() {
// ...
if (configuration["logger"] != null {
Logger = new systemLogger (...);
Logger.connect (configuration["logger"]);
}
// ... etc
}

But now, the service code above is incomplete. It should only log the message if logging is intended. Let’s fix that:

// do some work...
if ( /* logging_is_intended */ ) {
msg = prepareLogMessage (...);
Logger.log (msg); // using some globally-accessible Logger object
}
// ... etc

What should be in the condition? Should it be if (configuration[“logger”] != null? Maybe if (Logger is not null)? Perhaps if (logger.isConnected())? Ideally we want to log a message only if we have a logging service which is working as intended, but that idea isn’t surfaced anywhere in the code. And if the logger is intended but is not yet connected, what should the system do then? Attempt to reconnect? Fall back to some local emergency text file?

It turns out there’s actually a lot of logic, complexity, and hidden state behind that simple decision: to log, or not to log, That hidden state represents a wormhole, but now it’s invisible, and each locus has to guess how to behave from whatever clues it can find. That makes it much harder to get right in the first place, and almost impossible to maintain.

Debugging inevitably breeds wormholes. The evidence is unequivocal, but why?

Have you ever watched somebody debugging? Carefully? Generally what they’ll do is run the program (typically in a debugger) until they elicit the malfunction which is the cause of the bug report, then back-up a step to find the state which triggered the malfunction. What they’ll then do is put in some guard condition — right there — to protect the malfunctioning unit from the delinquent state: either fake the state into something more acceptable to the unit, shim the unit to provide some default result, or else add some special-casing to the malfunctioning unit itself. This is how the wormhole starts to form.

In all but the most trivial cases, the delinquent state was created by a defect somewhere else in the code, many steps before where the developer has been working. That defect is still there, still generating the delinquent state and feeding it to other consumers as yet undiscovered. Sooner or later, another malfunction will pop up, another maintainer will add another condition to suppress it. Our wormhole is now metastasising.

So now let’s get a senior developer involved. They will elicit the malfunction as their predecessors have, and will trace the delinquent state right back to its defect. If they’re lucky, the state is genuinely erroneous and it can be fixed once and for all. But more often, the state isn’t erroneous at all: it’s unanticipated. It may be easy to modify both the source and the malfunctioning unit to handle the new state gracefully, but in general it’s impossible to track down all the other locations where that state has an effect. This original defect has gone, but now more bugs are exploding all over the program (and, I’d bet, there are yet more still lying in wait). This is how the bug-splatting game begins.

It’s one of the smallest and most common words in any programming language, and there are few languages that don’t have a version of it — conditions are the very essence of computer programming! But it turns out that undisciplined ifs, and the wormholes they create, are responsible for more wasted debugging time, and more production malfunctions, than all other causes combined.

Wormholes are not inevitable, and it’s possible — even easy — to rearrange womholes into safe code. For example, in the two examples we saw earlier:

  • Currency<> can expose an enum characterising whether it’s positive, negative, or zero (or even undefined, if you wish). It’s easy to find every occurrence of Currency<>.sign, and if you check them with switch most decent compilers will make sure you’ve tested every case.
  • Logger can represent itself as an agent which autonomously provides its own caching and reconnection, a shim around the concrete logging services and their configuration. Now the service doesn’t need to check the existence or state of Logger: it’s always there and always prepared to accept messages.

Wormholes are a kind of coupling (albeit a form of coupling that isn’t visible to code health metrics). We all know that coupling is bad, no matter what form the coupling takes. Eliminating wormholes is only ever a good thing.

We need to get our instincts tuned — our smell-o-vision dialled up — so we can spot wormholes just from their shadows, and we need to apply some “wormhole-hygiene” to factor them out (and, of course, to avoid putting them in in the first place). That way, we’ll be able to eliminate 95% of all our production bugs in one, easy, almost mechanical step.

Postscript: Since writing this, I have become aware of Thoughts on.. Wormhole anti-pattern by William Caputo, from way back in 2004. I think he’s describing pretty much the same idea as mine, but from a different standpoint. So, in the interests of fairness, disclosure, and perspective, I’m referencing his post.

Thanks for reading! If you liked the ideas in this article, there’s a more thorough exploration of this topic in my book, Extreme Reliability. You might also want to clap the article, and consider following me.

--

--

Jules May
Jules May

Written by Jules May

A software developer and consultant specialising in reliability, compilers, and mathematical software.

Responses (13)

Write a response