SystemsFeb 16, 20266 min read

Backups you have never restored are a rumor

You don't have backups, you have restores you've successfully performed, and everything else is wishful thinking.

Most teams say "we have backups" the way they say "we have a fire extinguisher" — they bought one, mounted it on the wall, and never pulled the pin. The reassuring part isn't the device. It's the belief that the device works. And belief is not a property of your storage system.

A backup is a hypothesis: that some bytes, sitting somewhere, can be turned back into a working system at a moment you don't get to choose. You don't have backups. You have restores you have successfully performed. Everything in between is a story you tell yourself to sleep at night.

The gap between a copy and a recovery

A copy is easy. Cron a pg_dump, push it to a bucket, watch the green checkmark, move on. What that checkmark proves is that a file was written. It says nothing about whether that file is complete, whether it's the schema you think it is, whether the encryption key still exists, or whether anyone alive knows the command to bring it back.

The failures I've actually watched happen were never "the backup didn't run." They were quieter and worse:

→The dump ran nightly for two years and silently truncated at 4 GB because of an old filesystem limit nobody remembered.
→The S3 bucket had a lifecycle rule that expired objects after 30 days, so the "daily backups" were a rolling month of nothing useful for the incident in question.
→The restore worked, but took eleven hours, and the business had assumed minutes.
→The data came back fine. The encryption key to read it had been rotated out of the vault.

Every one of those teams "had backups." Not one of them had a restore.

A backup that has never been restored is not a safety net. It is a screenshot of one.

Restore is the product; backup is the byproduct

Flip the whole thing around. The deliverable was never the .tar.gz in cold storage. The deliverable is a running system, on a known date, within a known amount of time. Once you frame it that way, the questions change. You stop asking "did the backup succeed" and start asking "how long does a full restore take, who can do it, and when did we last prove it."

That last word — prove — is the entire discipline. A restore you reasoned about is a guess. A restore you ran last Tuesday is a fact. The only honest way to know your recovery time is to hold a stopwatch while someone rebuilds the system from nothing but the artifacts you actually keep.

So write down the two numbers that matter and then go earn them. How much data can you afford to lose, measured in time. How long can you afford to be down. Then restore into a scratch environment and see if reality agrees with the numbers. It almost never does the first time, and the gap is the most useful thing you'll learn all quarter.

restore-drill.sh


set -euo pipefail
latest=$(aws s3 ls backups/ | tail -1 | awk '{print $4}')
aws s3 cp "backups/$latest" ./restore.sql
createdb restore_drill
psql restore_drill < ./restore.sql
rows=$(psql -tA restore_drill -c 'select count(*) from orders')
test "$rows" -gt 0 || { echo "RESTORE FAILED: empty orders"; exit 1; }

A drill is a script that asserts the system came back, not just the file.

The point of that script isn't the psql. It's the last line. A restore that finishes without checking that the orders table has rows is a restore that will cheerfully hand you an empty database and a green light.

Make the drill boring and frequent

People treat restore tests as a heroic annual event — block a day, gather the team, pray. That cadence guarantees the muscle is cold exactly when you need it warm. The fix is to make restoring so routine that it stops being scary.

The cleanest trick I know: build your staging environment by restoring production's backup every night. Now the restore path runs on a schedule whether anyone is watching or not. If the backup is broken, staging is broken by morning, and you find out on a Tuesday with coffee in hand instead of at 3 a.m. during the actual fire. Your recovery procedure stops being a document and becomes a thing that ran eight hours ago.

A few habits that turn rumors into restores:

→Time every drill. An untimed restore tells you it's possible, not whether it's fast enough.
→Restore to a clean machine, not the one with all your tooling already installed. The disaster won't have your laptop.
→Have someone who didn't write the runbook follow it. Their confusion is the bug report.
→Keep the decryption keys in the drill. A backup you can't decrypt is ciphertext with extra steps.

The only test that counts

There's a comfortable middle state where the dashboards are green, the bucket is filling, and nobody has restored anything in a year. That state feels like safety and is actually its opposite — you've spent real money to manufacture confidence without buying any of the thing the confidence was supposed to be about.

Restore something today. Pick the database that would end the company if it vanished, pull last night's artifact, and bring it back to life on a machine that has never seen it. Time it. You will either earn the right to say you have a backup, or you'll discover, cheaply and on your own schedule, that you never did.

#Systems#BackupsShare ↗

→ / AUTHOR

Ionut Dumitru

Full-stack engineer and product designer. Writes about building products where the engineering and the design are the same job.

GitHub ↗X ↗

→ / NEXT

AIFeb 9, 2026

Stop streaming tokens at people who can't read that fast →

← All writingionutdumitru.com