PostgreSQL Failover and Disaster Recovery Guide
Level: intermediate · ~15 min read · Intent: informational
Audience: backend developers, database engineers, technical teams
Prerequisites
- basic familiarity with PostgreSQL
Key takeaways
- PostgreSQL failover and disaster recovery are related but not identical. Failover is about restoring service quickly, while disaster recovery is about recovering safely after data loss, corruption, or infrastructure failure.
- A strong PostgreSQL recovery strategy usually combines streaming replication for failover, backups plus WAL archiving for point-in-time recovery, and a clear plan for what happens to the old primary after promotion.
FAQ
- What is the difference between PostgreSQL failover and disaster recovery?
- Failover is the process of promoting a standby so service can continue after a primary failure. Disaster recovery is the broader process of restoring data and service after serious failures such as corruption, accidental deletion, or total infrastructure loss.
- Is a PostgreSQL replica enough for disaster recovery?
- No. A standby helps with failover and availability, but it does not replace backups and WAL archiving. Destructive changes can still replicate to the standby.
- How is PostgreSQL failover triggered?
- A standby can be promoted using pg_ctl promote or the pg_promote() function, either manually or through orchestration tooling.
- What should I do with the old primary after failover?
- In many setups, the old primary is rewound with pg_rewind and then rejoined as a standby following the new primary, assuming the prerequisites for pg_rewind were in place.
PostgreSQL failover and disaster recovery are often discussed together because both are about surviving failure.
But they are not the same thing.
That distinction matters, because many teams build one and assume they have built both.
They have not.
A PostgreSQL failover design is about:
- keeping the service available
- promoting a standby quickly
- rerouting traffic
- and getting the application back online
A PostgreSQL disaster recovery plan is broader. It must answer:
- what data can be lost
- how the system is restored after corruption or operator error
- how backups and WAL are used
- how long recovery takes
- and what happens when the whole primary environment is gone or unusable
This guide covers both sides together, because a serious PostgreSQL production system usually needs both.
1. Start With RPO and RTO
Before talking about standbys, promotion, or tools, define the recovery targets.
RPO: Recovery Point Objective
This is how much data loss you can tolerate.
Examples:
- near-zero
- 1 minute
- 15 minutes
- 1 hour
- 24 hours
RTO: Recovery Time Objective
This is how long the service can be unavailable.
Examples:
- 30 seconds
- 5 minutes
- 1 hour
- half a day
These two numbers drive the architecture.
Why this matters
If your business requires:
- very low data loss
- and very fast recovery
then you usually need:
- streaming replication
- standby promotion
- careful replication and WAL retention design
- and tested orchestration
If your business can tolerate:
- slower recovery
- and larger data-loss windows
then backups and manual recovery may be enough for some systems.
Do not start with:
- “Should we use a replica?”
Start with:
- “What recovery outcome do we actually need?”
2. Understand the Core PostgreSQL Recovery Building Blocks
A practical PostgreSQL failover and recovery strategy usually combines a few core pieces:
- primary server
- one or more standby servers
- physical streaming replication
- optional synchronous replication
- WAL archiving
- base backups
- failover decision logic
- and a plan for reintegrating the old primary
Each solves a different part of the problem.
3. Standby Types Matter
PostgreSQL’s high-availability docs distinguish between warm standby and hot standby.
Warm standby
A standby that cannot accept user connections until it is promoted.
Hot standby
A standby that can accept read-only queries while still following the primary.
For most modern operational setups, hot standby is what people usually mean when they say:
- replica
- standby
- follower
Why this matters
A standby used only for failover has different operational goals from a standby also serving read traffic.
If the standby is used for:
- reporting
- dashboards
- read scaling
then you also need to think about:
- query load
- replay lag
- recovery conflicts
- and whether the standby is still a reliable failover target under that extra workload
4. Streaming Replication Is Usually the Core of Failover
For most PostgreSQL failover designs, physical streaming replication is the main building block.
It allows the standby to receive WAL changes from the primary and stay close enough to take over when needed.
Why it is so useful
It gives you:
- relatively fast failover potential
- near-real-time data movement
- a practical base for hot standby
- and a cleaner path to promotion than purely archive-only recovery workflows
What it does not do by itself
Streaming replication alone does not answer:
- how failover is triggered
- which standby should be promoted
- how clients reconnect
- how split-brain is prevented
- how the old primary is handled afterward
- or how point-in-time recovery works after logical or operator mistakes
That is why replication is necessary, but not sufficient.
5. Manual Failover vs Orchestrated Failover
You can fail over PostgreSQL manually. You can also orchestrate it.
Manual failover
Manual failover usually means:
- detect primary failure
- confirm the standby is viable
- promote the chosen standby
- reroute application traffic
- stabilize the new topology
This is simpler conceptually, but slower operationally.
Orchestrated failover
This means an external system or operational process decides:
- when the primary is really down
- which standby is most suitable
- how to avoid split-brain
- how to update service discovery or routing
- and what actions are taken next
Practical rule
Manual failover is often fine for:
- smaller internal systems
- lower-urgency workloads
- teams with strong runbooks and low availability pressure
Orchestrated failover is often better for:
- high-availability production services
- customer-facing systems
- low-RTO environments
- multi-node clusters where a human bottleneck is too slow
6. Promotion Is the Failover Step, Not the Whole Plan
At the PostgreSQL level, failover is triggered by promoting a standby.
PostgreSQL documents that you can do this with:
pg_ctl promote- or
pg_promote()
That is an important point because PostgreSQL itself knows how to promote a standby, but not how to decide whether promotion is safe in your broader system.
Example
select pg_promote();
or from the shell:
pg_ctl promote -D /var/lib/postgresql/data
Why this distinction matters
Promotion answers:
- “make this standby writable now”
It does not answer:
- “was the primary definitely dead?”
- “did another standby also get promoted?”
- “did application traffic already move?”
- “do we still have the latest WAL?”
- “how do we prevent the old primary from coming back and writing independently?”
That is why failover needs process, not only command knowledge.
7. Read Replicas Are Not Enough for Disaster Recovery
This is one of the most important operational truths.
A standby helps with:
- availability
- fast recovery from primary host failure
- read scaling
- maintenance flexibility
It does not replace:
- backups
- WAL archiving
- point-in-time recovery
- or protection against destructive replicated changes
Why not?
Because many bad events replicate too:
- accidental deletes
- destructive migrations
- bad application writes
- some classes of corruption
- operator mistakes
If the primary receives a bad transaction and replicates it, the standby will usually reflect that bad state too.
That is why disaster recovery needs more than replication.
8. Backups and WAL Archiving Are the Disaster Recovery Layer
If failover is about surviving primary failure, disaster recovery is about surviving bad state.
That is where:
- base backups
- and archived WAL
become essential.
Base backup
A base backup gives you a physical starting point for recovery.
WAL archiving
Archived WAL gives you the change history needed to recover forward from the base backup.
Together, these enable:
- point-in-time recovery
- restoration to a moment before a destructive event
- rebuild of a cluster after total host loss
- recovery after corruption or operator error
Practical point
A standby without backup and WAL archiving is an availability mechanism. It is not a complete disaster recovery strategy.
9. Point-in-Time Recovery Is Often the Real DR Superpower
A lot of teams initially think disaster recovery means:
- restore last night’s backup
That is often too crude.
Point-in-time recovery lets you recover to:
- a specific timestamp
- a transaction boundary
- or another recovery target
This is extremely valuable when the failure was:
- human error
- bad code deploy
- mass delete
- migration mistake
- or corrupt write pattern
Example recovery question
Instead of:
- “Can we restore yesterday’s backup?”
you want:
- “Can we restore to 09:12:34, just before the destructive migration ran?”
That is a much stronger disaster recovery capability.
10. Replication Slots Help Prevent WAL From Disappearing Too Early
Replication slots are a very important part of reliable standby behavior.
PostgreSQL’s docs explain that replication slots can prevent the primary from removing WAL segments that a standby still needs.
Why this matters
Without a slot, a standby that falls behind may need WAL that the primary has already recycled or removed.
That can cause the standby to fall out of sync badly enough that it needs to be rebuilt.
What slots help with
- protecting lagging standbys from losing required WAL too early
- making replication retention more precise than simple WAL retention guesses
- improving reliability of standby catch-up behavior
Important warning
Replication slots are not free.
If a standby stops consuming WAL and the slot remains active, WAL can accumulate and fill storage on the primary.
So replication slots improve safety, but they also require monitoring.
11. Synchronous vs Asynchronous Replication Changes the Recovery Trade-Off
This is one of the most important design choices.
Asynchronous replication
With asynchronous replication:
- the primary can commit before a standby confirms receipt or durability
- performance is often better
- but failover can lose some very recent committed transactions if the primary fails before the standby receives them
This is the most common setup.
Synchronous replication
With synchronous replication:
- commits can wait for confirmation from synchronous standby targets
- data-loss risk during failover can be reduced
- but latency and availability trade-offs become more serious
Why this matters
This is fundamentally an RPO choice.
If near-zero data loss matters more than some latency and write-availability trade-offs, synchronous replication becomes more attractive.
If low write latency and operational simplicity matter more, asynchronous replication may still be the better fit.
There is no universal winner. Only a business trade-off.
12. Avoid Split-Brain at All Costs
One of the worst failover outcomes is split-brain:
- two servers both behaving as writable primaries
- diverging timelines
- inconsistent data
- confused clients
- and a recovery process that becomes much messier than the original failure
How split-brain happens
Usually through:
- bad failover decisions
- network partitions
- old primaries coming back without being fenced off
- orchestration mistakes
- manual failover without clear ownership of the cluster state
Practical lesson
A good failover system is not only about promoting the standby quickly. It is also about ensuring the old primary cannot continue accepting writes as if nothing happened.
This is why fencing, traffic control, and clean topology change matter so much.
13. Monitor the Replication Path Continuously
Failover only works well when you already know the standby is healthy.
Good monitoring should include:
- replication lag
- WAL sender status
- WAL receiver status
- slot health
- archive success/failure
- standby replay state
- connection status
- and whether the standby is still actually a valid failover target
Useful PostgreSQL views
PostgreSQL’s monitoring system provides views such as:
pg_stat_replicationpg_stat_wal_receiverpg_stat_archiverpg_stat_replication_slots
These are important because failover quality depends on the health of the replication chain before the failure happens.
Practical rule
Do not wait until failover to discover:
- the standby is lagging badly
- WAL archiving was broken
- the slot was misconfigured
- or the standby stopped replaying hours ago
14. Test the Whole Failover Workflow, Not Just Replication
A lot of teams validate replication once and then assume failover will work.
That is too optimistic.
You need to test:
- standby promotion
- traffic rerouting
- application reconnect behavior
- health checks
- monitoring alerts
- and post-failover cleanup steps
Good failover test questions
- How long does promotion actually take?
- What does the app do during promotion?
- How do clients find the new primary?
- Are writes retried correctly?
- Does any code still point to the old primary directly?
- Do background jobs recover cleanly?
- Does observability show the new topology correctly?
Replication that works in steady state is only part of the story.
15. Disaster Recovery Needs Restore Drills, Not Only Backup Jobs
Just as failover needs drills, disaster recovery needs restore drills.
You should be able to prove:
- base backups restore successfully
- WAL archives are usable
- recovery targets work
- the recovered cluster boots cleanly
- roles and configuration are correct
- the application can connect afterward
- the team knows the process under pressure
Good practical question
Can your team restore:
- the whole cluster
- to a specific point in time
- within the target RTO
- with acceptable data loss
If the answer is unknown, the DR design is not finished.
16. Plan What Happens to the Old Primary After Failover
This is one of the most overlooked steps.
After failover, the old primary is no longer part of the authoritative write path. You need a clean reintegration strategy.
In many environments, the standard answer is:
- use
pg_rewind - then turn the old primary into a standby following the new primary
PostgreSQL’s docs describe pg_rewind exactly for this scenario: bringing an old primary back online after the clusters have diverged due to failover.
Why this matters
Without a reintegration plan, teams often:
- leave the old primary in a dangerous ambiguous state
- rebuild from scratch more often than necessary
- or accidentally create topology confusion during recovery
Practical warning
pg_rewind has prerequisites and works only when divergence and WAL history support it.
So you should plan for it in advance, not discover the requirements during an outage.
17. Replication Slots and Failover Slots Matter More in Complex Topologies
For more advanced setups, especially when logical failover or synchronized slot behavior matters, PostgreSQL now has newer slot-related capabilities that need careful design.
Practical lesson
If your architecture depends on:
- logical replication continuity after failover
- synchronized standby slot behavior
- more advanced downstream consumers
then slot design becomes part of your failover design, not just a background replication detail.
This is especially true in complex or layered HA environments.
18. A Practical Failover Topology Pattern
A very common and sensible starting pattern looks like this:
Primary
- handles writes
- may also handle reads depending on load
Hot standby
- streams from primary
- can serve read-only queries if desired
- is the first failover target
Backups + WAL archiving
- stored independently
- support PITR and cluster rebuild
Monitoring + orchestration
- tracks lag, receiver state, slots, archive success, and server health
- controls or supports failover decisions
Old primary recovery plan
- fenced off after failover
- rewound or rebuilt
- rejoined as standby
This is a much more complete picture than:
- “we have a replica, so we are covered”
19. Common PostgreSQL Failover and DR Mistakes
Mistake 1: Treating a standby as a full DR plan
It is not.
Mistake 2: No tested promotion procedure
Knowing that pg_promote() exists is not the same as having a failover process.
Mistake 3: No plan for the old primary
This creates confusion and recovery delay.
Mistake 4: Ignoring split-brain risk
Fast failover is not useful if it creates two primaries.
Mistake 5: No WAL archiving or untested WAL archives
That weakens or removes PITR.
Mistake 6: No restore drills
Backups without restore testing are assumptions, not proof.
Mistake 7: Replication slots without monitoring
This can cause WAL retention problems on the primary.
Mistake 8: Assuming replicas are current enough without measuring lag
Failover quality depends on actual replication health, not hope.
20. A Practical PostgreSQL Failover and DR Checklist
Use this checklist for a serious system:
- Define RPO and RTO
- Set up streaming replication
- Decide whether async or sync replication fits the business
- Monitor lag, receiver state, archive status, and slots
- Prevent split-brain through fencing and clear traffic control
- Test standby promotion
- Test app reconnect behavior
- Take regular base backups
- Archive WAL continuously
- Test point-in-time recovery
- Write a runbook for failover and DR
- Define how the old primary is rewound or rebuilt after failover
That is the difference between having components and having a recovery strategy.
FAQ
What is the difference between PostgreSQL failover and disaster recovery?
Failover is the process of promoting a standby so service can continue after a primary failure. Disaster recovery is the broader process of restoring data and service after serious failures such as corruption, accidental deletion, or total infrastructure loss.
Is a PostgreSQL replica enough for disaster recovery?
No. A standby helps with failover and availability, but it does not replace backups and WAL archiving. Destructive changes can still replicate to the standby.
How is PostgreSQL failover triggered?
A standby can be promoted using pg_ctl promote or the pg_promote() function, either manually or through orchestration tooling.
What should I do with the old primary after failover?
In many setups, the old primary is rewound with pg_rewind and then rejoined as a standby following the new primary, assuming the prerequisites for pg_rewind were in place.
Conclusion
A good PostgreSQL failover and disaster recovery design is not one feature.
It is a combination of:
- healthy standbys
- promotion logic
- replication monitoring
- backups
- WAL archiving
- restore drills
- split-brain prevention
- and a clean path for handling the old primary after failover
That is why the most important question is not:
- “Do we have a replica?”
It is:
- “Can we recover service and data predictably when things go wrong?”
That is the real standard for PostgreSQL failover and disaster recovery.