There’s a poem that (literally) dates back centuries:
“For want of a nail, the shoe was lost.
For want of a shoe, the horse was lost.
For want of a horse, the rider was lost.
For want of a rider, the message was lost.
For want of a message, the battle was lost.
For want of a battle, the kingdom was lost.
And all for the want of a horseshoe nail.”
We can picture the limping horse and its desperate rider. But the message is simple: small things can break big things. And that’s true in software development.
Little errors can bite back
If you’re a programmer, you already know this. Big problems often start with small mistakes. Like the off‑by‑one error, the assumption everyone made, or the “this will never happen” part of the code that does happen. At 02:00 on a public holiday.
Here are some famous examples of small software mistakes that caused outsized problems.
The number that broke a rocket
On 4 June 1996, the Ariane 5 rocket exploded 37 seconds after take-off. Why? A 64‑bit floating‑point value was converted into a 16‑bit signed integer. In other words, the number was too big to fit into the space allocated for it. It was like trying to pour all the tea I drink in one day into a single mug. Like the tea, the value overflowed. The guidance system failed, and so did the rocket.
The code was copied from Ariane 4, where the value never got that large. It was a safe assumption — until it wasn’t.
That small decision was enough. Data types matter.
The race that killed
A race condition happens when two parts of a system try to do something at the same time. The result depends on which part wins the race. The outcome is unpredictable. This creates a bug that is hard to recreate, because the timing has to be just so. And it is very difficult to debug.
The Therac-25 was a computer-controlled radiation therapy machine used to treat cancer patients. From 1985 to 1987 it gave fatal overdoses to several patients because of a race condition.
There were other problems too, like "black box" software. Today we have medical AI systems that are the ultimate in "black box" software. We should expect problems.
One line to break them all
A regular expression (aka regex) is a pattern that helps you find text. For example, you can look for an "@" sign and a "." to check if some text looks like an email address.
Regular expressions can solve very complex problems in a single line. But they often look like a cat tap-danced across the keyboard. And they are notoriously hard to read and debug.
Cloudflare is one of the world’s leading internet security companies. On 2 July 2019, Cloudflare went down for about 30 minutes. And a large part of the internet went down with it.
The cause? A single regular expression. One line. Global outage.
Do you have a scary software story to share? Or would you like to hear more of these? I’d love to hear from you.