The migration you run a thousand times
Database migrations are the one piece of code you can't easily roll back, which is exactly why people write them last and least.
Database migrations look like the easiest code you'll write all week. They're short. They're declarative. A column here, an index there, a DROP TABLE you'll regret. And because they look easy, they get the least design attention of anything that touches production — squeezed in at the end of the pull request, after the interesting work is done, reviewed by someone who skims past the SQL to get to the application logic. That's backwards. The migration is the only artifact in your change that runs once against irreplaceable state and cannot be casually undone.
Everything else you ship is forgiving. A bad deploy rolls back to the previous container. A broken feature flag flips off. But a migration that drops a column has already thrown the data on the floor by the time you notice the mistake. You don't roll a migration back; you write a second migration forward and hope the first one didn't take anything with it.
Reversible by construction, not by intention
The trap is treating "rollback" as a button. Most frameworks give you a down method, and most teams fill it in out of habit and never run it. In production, the down is a fiction — by the time you'd want it, new rows have arrived that the down doesn't know how to handle, and reversing a schema change while traffic flows through it is its own outage.
Reversibility has to be built into the shape of the change, not bolted on as an inverse function. The discipline that actually works is making every migration additive and every destructive step a separate, later deploy:
- →Add the new column nullable. Never add it
NOT NULLwith a default in one shot on a large table. - →Backfill in batches, out of band, idempotent enough to rerun after a crash.
- →Switch the application to read the new column only after the backfill is verified.
- →Drop the old column in a migration that ships days later, once you're sure nothing reads it.
Each of those is independently safe to deploy and safe to stop between. That's what reversible actually means in a live system: not that you can run the change backwards, but that you can stop at any step and the system is still correct.
The lock you didn't see coming
The second thing people write last and understand least is what the migration does to the table while it runs. The SQL is correct; the behavior under load is not. On Postgres, an innocent ALTER TABLE ... ADD COLUMN ... DEFAULT on an old enough version rewrites the whole table and holds a lock the entire time. A migration that passed in nineteen milliseconds against your laptop's thousand rows takes an ACCESS EXCLUSIVE lock for four minutes against forty million, and every query that touches the table queues behind it.
That's not a slow migration. That's an outage with a green checkmark in CI.
A migration that passes in CI and locks the table for four minutes in production isn't slow. It's an outage you scheduled yourself.
The defenses are knowable and unglamorous. Set a lock_timeout so a migration that can't get its lock fails fast instead of forming a queue behind it. Create indexes concurrently. Add the constraint as NOT VALID first, then VALIDATE it in a separate statement that takes a weaker lock. None of this is clever. It's just the part of the job that doesn't show up when you run the migration once against an empty schema and watch it succeed.
Write it for the thousandth run
Here's the reframe that changes how you write them. A migration feels like a one-time event — you author it, you run it, you move on. But across every environment, every developer's laptop, every CI run, every staging reset, every region you deploy to, that "one-time" migration runs hundreds or thousands of times. It is not a script. It is a function the whole organization calls, and it has to be correct on the second call, the interrupted call, and the call that races another deploy.
So write it like one. Make it idempotent: guard creates with IF NOT EXISTS, make backfills safe to rerun from where they died. Assume it will be interrupted halfway and check what that leaves behind. Test it against a snapshot with production's data shape, not a fresh schema — the bugs live in the data you already have, the null you didn't expect, the duplicate that violates the unique constraint you're about to add.
Additive, idempotent, lock-aware — safe on the first run and the thousandth.
The unique constraint is the classic teacher here. You add it, CI is green, you deploy, and it fails on the one pair of duplicate rows that has existed in production for two years. The migration was correct. Your model of the data was wrong. The only way to find that out before your users do is to run the thing against the data you actually have.
The review you should actually be doing
When a migration shows up in a pull request, the application diff is the part that's easy to reason about and the migration is the part that can take the site down — so the review attention should be inverted. Ask the boring questions out loud. What lock does this take and for how long. What happens to in-flight writes while it runs. Is there a backfill, and is it batched. If this is interrupted at the worst possible moment, what's left behind. Can the previous version of the application keep running against the new schema, because for the length of a deploy, it will.
That last one is the whole game in a system that's never fully off. The old code and the new schema coexist, in both directions, for minutes at least. A migration that's only correct after the deploy finishes has a window where it's wrong, and that window is when your users are watching.
Treat the migration as the most dangerous line in the change, because it is. Give it the design attention you give the thing it deserves — not the leftover minutes at the end. The code you can't easily roll back is exactly the code you should have thought hardest about before it ran.
