The real reason migrations fail
It’s rarely the technical work itself. Teams that have successfully tested their upgrade a dozen times still get burned when they do it for real. The difference isn’t skill — it’s that real systems are messier than test environments.
A live business system has:
- Activity happening around the clock that creates conflicts your tests never encounter
- Data volumes that make certain operations take 10x longer than expected
- Connected services that break in unexpected ways when something changes
- Hidden dependencies that nobody documented and nobody remembers creating
The failure mode isn’t “we didn’t know how to do the upgrade.” It’s “we didn’t map everything that would be affected before we started.”
The three-phase approach
After running these kinds of upgrades for systems handling millions of daily transactions, we’ve settled on a three-phase approach that avoids the most common failure modes.
Phase 1: Shadow system
Before touching the live system, set up the new one alongside it. New structure, new capacity, but nothing depends on it yet. This phase is zero-risk — you’re adding something new, not changing what exists.
The key insight: get your new system into the live environment before anything relies on it. This separates “set up the new thing” from “move the data over,” which is where most teams try to do too much at once.
Phase 2: Run both systems together
Once the new system is live, start sending data to both the old and new systems simultaneously. Your business keeps running on the old system while the new one proves itself. This is the hardest phase, but it gives you something invaluable: if anything goes wrong, you can switch back instantly without losing any data.
Critical rules for this phase:
- The old system stays in charge until Phase 3
- Data going to the new system must handle duplicates gracefully
- Watch the new system’s speed closely — if it’s slow now, you’ve found the problem before it becomes an outage
- Set a time limit. If this phase runs longer than 2 weeks, something needs to be redesigned
Phase 3: Switch over and clean up
Move everything to the new system. The old one becomes the backup. Watch closely for 48-72 hours, then retire the old system.
The entire process should be uneventful. If your migration plan has a “cross your fingers” moment, you haven’t broken it into small enough steps.
What this looks like in practice
We recently used this approach for a fintech company running a 12-year-old database system. The system served over 40 different business processes, had no comprehensive documentation, and the team had tried (and failed) to upgrade it twice before.
The result: zero downtime, 94% faster processing, and 3.2x more capacity. The team made their first system improvement without a maintenance window within two weeks of the switchover.
The meta-lesson
Database upgrades are a microcosm of all technology change: the technical execution is the easy part. The hard part is understanding what depends on what, managing the transition carefully, and having a way to undo every step.
If your upgrade plan doesn’t have a “how we reverse this at 3am” section for every phase, it’s not ready.