PostgreSQL Failover and Disaster Recovery Guide

Data & Database Workflows

Apr 3, 2026·By Elysiate·Updated Apr 3, 2026·

postgresqldatabasesqlfailoverdisaster-recoveryhigh-availability

·

Level: intermediate · ~15 min read · Intent: informational

Audience: backend developers, database engineers, technical teams

Prerequisites

basic familiarity with PostgreSQL

Key takeaways

PostgreSQL failover and disaster recovery are related but not identical. Failover is about restoring service quickly, while disaster recovery is about recovering safely after data loss, corruption, or infrastructure failure.
A strong PostgreSQL recovery strategy usually combines streaming replication for failover, backups plus WAL archiving for point-in-time recovery, and a clear plan for what happens to the old primary after promotion.

FAQ

What is the difference between PostgreSQL failover and disaster recovery?: Failover is the process of promoting a standby so service can continue after a primary failure. Disaster recovery is the broader process of restoring data and service after serious failures such as corruption, accidental deletion, or total infrastructure loss.
Is a PostgreSQL replica enough for disaster recovery?: No. A standby helps with failover and availability, but it does not replace backups and WAL archiving. Destructive changes can still replicate to the standby.
How is PostgreSQL failover triggered?: A standby can be promoted using pg_ctl promote or the pg_promote() function, either manually or through orchestration tooling.
What should I do with the old primary after failover?: In many setups, the old primary is rewound with pg_rewind and then rejoined as a standby following the new primary, assuming the prerequisites for pg_rewind were in place.

0

PostgreSQL failover and disaster recovery are often discussed together because both are about surviving failure.

But they are not the same thing.

That distinction matters, because many teams build one and assume they have built both.

They have not.

A PostgreSQL failover design is about:

keeping the service available
promoting a standby quickly
rerouting traffic
and getting the application back online

A PostgreSQL disaster recovery plan is broader. It must answer:

what data can be lost
how the system is restored after corruption or operator error
how backups and WAL are used
how long recovery takes
and what happens when the whole primary environment is gone or unusable

This guide covers both sides together, because a serious PostgreSQL production system usually needs both.

1. Start With RPO and RTO

Before talking about standbys, promotion, or tools, define the recovery targets.

RPO: Recovery Point Objective

This is how much data loss you can tolerate.

Examples:

near-zero
1 minute
15 minutes
1 hour
24 hours

RTO: Recovery Time Objective

This is how long the service can be unavailable.

Examples:

30 seconds
5 minutes
1 hour
half a day

These two numbers drive the architecture.

Why this matters

If your business requires:

very low data loss
and very fast recovery

then you usually need:

streaming replication
standby promotion
careful replication and WAL retention design
and tested orchestration

If your business can tolerate:

slower recovery
and larger data-loss windows

then backups and manual recovery may be enough for some systems.

Do not start with:

“Should we use a replica?”

Start with:

“What recovery outcome do we actually need?”

2. Understand the Core PostgreSQL Recovery Building Blocks

A practical PostgreSQL failover and recovery strategy usually combines a few core pieces:

primary server
one or more standby servers
physical streaming replication
optional synchronous replication
WAL archiving
base backups
failover decision logic
and a plan for reintegrating the old primary

Each solves a different part of the problem.

3. Standby Types Matter

PostgreSQL’s high-availability docs distinguish between warm standby and hot standby.

Warm standby

A standby that cannot accept user connections until it is promoted.

Hot standby

A standby that can accept read-only queries while still following the primary.

For most modern operational setups, hot standby is what people usually mean when they say:

replica
standby
follower

Why this matters

A standby used only for failover has different operational goals from a standby also serving read traffic.

If the standby is used for:

reporting
dashboards
read scaling

then you also need to think about:

query load
replay lag
recovery conflicts
and whether the standby is still a reliable failover target under that extra workload

4. Streaming Replication Is Usually the Core of Failover

For most PostgreSQL failover designs, physical streaming replication is the main building block.

It allows the standby to receive WAL changes from the primary and stay close enough to take over when needed.

Why it is so useful

It gives you:

relatively fast failover potential
near-real-time data movement
a practical base for hot standby
and a cleaner path to promotion than purely archive-only recovery workflows

What it does not do by itself

Streaming replication alone does not answer:

how failover is triggered
which standby should be promoted
how clients reconnect
how split-brain is prevented
how the old primary is handled afterward
or how point-in-time recovery works after logical or operator mistakes

That is why replication is necessary, but not sufficient.

5. Manual Failover vs Orchestrated Failover

You can fail over PostgreSQL manually. You can also orchestrate it.

Manual failover

Manual failover usually means:

detect primary failure
confirm the standby is viable
promote the chosen standby
reroute application traffic
stabilize the new topology

This is simpler conceptually, but slower operationally.

Orchestrated failover

This means an external system or operational process decides:

when the primary is really down
which standby is most suitable
how to avoid split-brain
how to update service discovery or routing
and what actions are taken next

Practical rule

Manual failover is often fine for:

smaller internal systems
lower-urgency workloads
teams with strong runbooks and low availability pressure

Orchestrated failover is often better for:

high-availability production services
customer-facing systems
low-RTO environments
multi-node clusters where a human bottleneck is too slow

6. Promotion Is the Failover Step, Not the Whole Plan

At the PostgreSQL level, failover is triggered by promoting a standby.

PostgreSQL documents that you can do this with:

pg_ctl promote
or pg_promote()

That is an important point because PostgreSQL itself knows how to promote a standby, but not how to decide whether promotion is safe in your broader system.

Example

select pg_promote();

or from the shell:

pg_ctl promote -D /var/lib/postgresql/data

Why this distinction matters

Promotion answers:

“make this standby writable now”

It does not answer:

“was the primary definitely dead?”
“did another standby also get promoted?”
“did application traffic already move?”
“do we still have the latest WAL?”
“how do we prevent the old primary from coming back and writing independently?”

That is why failover needs process, not only command knowledge.

7. Read Replicas Are Not Enough for Disaster Recovery

This is one of the most important operational truths.

A standby helps with:

availability
fast recovery from primary host failure
read scaling
maintenance flexibility

It does not replace:

backups
WAL archiving
point-in-time recovery
or protection against destructive replicated changes

Why not?

Because many bad events replicate too:

accidental deletes
destructive migrations
bad application writes
some classes of corruption
operator mistakes

If the primary receives a bad transaction and replicates it, the standby will usually reflect that bad state too.

That is why disaster recovery needs more than replication.

8. Backups and WAL Archiving Are the Disaster Recovery Layer

If failover is about surviving primary failure, disaster recovery is about surviving bad state.

That is where:

base backups
and archived WAL

become essential.

Base backup

A base backup gives you a physical starting point for recovery.

WAL archiving

Archived WAL gives you the change history needed to recover forward from the base backup.

Together, these enable:

point-in-time recovery
restoration to a moment before a destructive event
rebuild of a cluster after total host loss
recovery after corruption or operator error

Practical point

A standby without backup and WAL archiving is an availability mechanism. It is not a complete disaster recovery strategy.

9. Point-in-Time Recovery Is Often the Real DR Superpower

A lot of teams initially think disaster recovery means:

restore last night’s backup

That is often too crude.

Point-in-time recovery lets you recover to:

a specific timestamp
a transaction boundary
or another recovery target

This is extremely valuable when the failure was:

human error
bad code deploy
mass delete
migration mistake
or corrupt write pattern

Example recovery question

Instead of:

“Can we restore yesterday’s backup?”

you want:

“Can we restore to 09:12:34, just before the destructive migration ran?”

That is a much stronger disaster recovery capability.

10. Replication Slots Help Prevent WAL From Disappearing Too Early

Replication slots are a very important part of reliable standby behavior.

PostgreSQL’s docs explain that replication slots can prevent the primary from removing WAL segments that a standby still needs.

Why this matters

Without a slot, a standby that falls behind may need WAL that the primary has already recycled or removed.

That can cause the standby to fall out of sync badly enough that it needs to be rebuilt.

What slots help with

protecting lagging standbys from losing required WAL too early
making replication retention more precise than simple WAL retention guesses
improving reliability of standby catch-up behavior

Important warning

Replication slots are not free.

If a standby stops consuming WAL and the slot remains active, WAL can accumulate and fill storage on the primary.

So replication slots improve safety, but they also require monitoring.

11. Synchronous vs Asynchronous Replication Changes the Recovery Trade-Off

This is one of the most important design choices.

Asynchronous replication

With asynchronous replication:

the primary can commit before a standby confirms receipt or durability
performance is often better
but failover can lose some very recent committed transactions if the primary fails before the standby receives them

This is the most common setup.

Synchronous replication

With synchronous replication:

commits can wait for confirmation from synchronous standby targets
data-loss risk during failover can be reduced
but latency and availability trade-offs become more serious

Why this matters

This is fundamentally an RPO choice.

If near-zero data loss matters more than some latency and write-availability trade-offs, synchronous replication becomes more attractive.

If low write latency and operational simplicity matter more, asynchronous replication may still be the better fit.

There is no universal winner. Only a business trade-off.

12. Avoid Split-Brain at All Costs

One of the worst failover outcomes is split-brain:

two servers both behaving as writable primaries
diverging timelines
inconsistent data
confused clients
and a recovery process that becomes much messier than the original failure

How split-brain happens

Usually through:

bad failover decisions
network partitions
old primaries coming back without being fenced off
orchestration mistakes
manual failover without clear ownership of the cluster state

Practical lesson

A good failover system is not only about promoting the standby quickly. It is also about ensuring the old primary cannot continue accepting writes as if nothing happened.

This is why fencing, traffic control, and clean topology change matter so much.

13. Monitor the Replication Path Continuously

Failover only works well when you already know the standby is healthy.

Good monitoring should include:

replication lag
WAL sender status
WAL receiver status
slot health
archive success/failure
standby replay state
connection status
and whether the standby is still actually a valid failover target

Useful PostgreSQL views

PostgreSQL’s monitoring system provides views such as:

pg_stat_replication
pg_stat_wal_receiver
pg_stat_archiver
pg_stat_replication_slots

These are important because failover quality depends on the health of the replication chain before the failure happens.

Practical rule

Do not wait until failover to discover:

the standby is lagging badly
WAL archiving was broken
the slot was misconfigured
or the standby stopped replaying hours ago

14. Test the Whole Failover Workflow, Not Just Replication

A lot of teams validate replication once and then assume failover will work.

That is too optimistic.

You need to test:

standby promotion
traffic rerouting
application reconnect behavior
health checks
monitoring alerts
and post-failover cleanup steps

Good failover test questions

How long does promotion actually take?
What does the app do during promotion?
How do clients find the new primary?
Are writes retried correctly?
Does any code still point to the old primary directly?
Do background jobs recover cleanly?
Does observability show the new topology correctly?

Replication that works in steady state is only part of the story.

15. Disaster Recovery Needs Restore Drills, Not Only Backup Jobs

Just as failover needs drills, disaster recovery needs restore drills.

You should be able to prove:

base backups restore successfully
WAL archives are usable
recovery targets work
the recovered cluster boots cleanly
roles and configuration are correct
the application can connect afterward
the team knows the process under pressure

Good practical question

Can your team restore:

the whole cluster
to a specific point in time
within the target RTO
with acceptable data loss

If the answer is unknown, the DR design is not finished.

16. Plan What Happens to the Old Primary After Failover

This is one of the most overlooked steps.

After failover, the old primary is no longer part of the authoritative write path. You need a clean reintegration strategy.

In many environments, the standard answer is:

use pg_rewind
then turn the old primary into a standby following the new primary

PostgreSQL’s docs describe pg_rewind exactly for this scenario: bringing an old primary back online after the clusters have diverged due to failover.

Why this matters

Without a reintegration plan, teams often:

leave the old primary in a dangerous ambiguous state
rebuild from scratch more often than necessary
or accidentally create topology confusion during recovery

Practical warning

pg_rewind has prerequisites and works only when divergence and WAL history support it. So you should plan for it in advance, not discover the requirements during an outage.

17. Replication Slots and Failover Slots Matter More in Complex Topologies

For more advanced setups, especially when logical failover or synchronized slot behavior matters, PostgreSQL now has newer slot-related capabilities that need careful design.

Practical lesson

If your architecture depends on:

logical replication continuity after failover
synchronized standby slot behavior
more advanced downstream consumers

then slot design becomes part of your failover design, not just a background replication detail.

This is especially true in complex or layered HA environments.

18. A Practical Failover Topology Pattern

A very common and sensible starting pattern looks like this:

Primary

handles writes
may also handle reads depending on load

Hot standby

streams from primary
can serve read-only queries if desired
is the first failover target

Backups + WAL archiving

stored independently
support PITR and cluster rebuild

Monitoring + orchestration

tracks lag, receiver state, slots, archive success, and server health
controls or supports failover decisions

Old primary recovery plan

fenced off after failover
rewound or rebuilt
rejoined as standby

This is a much more complete picture than:

“we have a replica, so we are covered”

19. Common PostgreSQL Failover and DR Mistakes

Mistake 1: Treating a standby as a full DR plan

It is not.

Mistake 2: No tested promotion procedure

Knowing that pg_promote() exists is not the same as having a failover process.

Mistake 3: No plan for the old primary

This creates confusion and recovery delay.

Mistake 4: Ignoring split-brain risk

Fast failover is not useful if it creates two primaries.

Mistake 5: No WAL archiving or untested WAL archives

That weakens or removes PITR.

Mistake 6: No restore drills

Backups without restore testing are assumptions, not proof.

Mistake 7: Replication slots without monitoring

This can cause WAL retention problems on the primary.

Mistake 8: Assuming replicas are current enough without measuring lag

Failover quality depends on actual replication health, not hope.

20. A Practical PostgreSQL Failover and DR Checklist

Use this checklist for a serious system:

Define RPO and RTO
Set up streaming replication
Decide whether async or sync replication fits the business
Monitor lag, receiver state, archive status, and slots
Prevent split-brain through fencing and clear traffic control
Test standby promotion
Test app reconnect behavior
Take regular base backups
Archive WAL continuously
Test point-in-time recovery
Write a runbook for failover and DR
Define how the old primary is rewound or rebuilt after failover

That is the difference between having components and having a recovery strategy.

FAQ

What is the difference between PostgreSQL failover and disaster recovery?

Failover is the process of promoting a standby so service can continue after a primary failure. Disaster recovery is the broader process of restoring data and service after serious failures such as corruption, accidental deletion, or total infrastructure loss.

Is a PostgreSQL replica enough for disaster recovery?

No. A standby helps with failover and availability, but it does not replace backups and WAL archiving. Destructive changes can still replicate to the standby.

How is PostgreSQL failover triggered?

A standby can be promoted using pg_ctl promote or the pg_promote() function, either manually or through orchestration tooling.

What should I do with the old primary after failover?

In many setups, the old primary is rewound with pg_rewind and then rejoined as a standby following the new primary, assuming the prerequisites for pg_rewind were in place.

Conclusion

A good PostgreSQL failover and disaster recovery design is not one feature.

It is a combination of:

healthy standbys
promotion logic
replication monitoring
backups
WAL archiving
restore drills
split-brain prevention
and a clean path for handling the old primary after failover

That is why the most important question is not:

“Do we have a replica?”

It is:

“Can we recover service and data predictably when things go wrong?”

That is the real standard for PostgreSQL failover and disaster recovery.

About the author

Elysiate publishes practical guides and privacy-first tools for data workflows, developer tooling, SEO, and product engineering.

View author profile Read editorial policy

Free, privacy-first utilities in your browser — no uploads required for most workflows.

PostgreSQL cluster

Explore the connected PostgreSQL guides around tuning, indexing, operations, schema design, scaling, and app integrations.

Pillar guide

PostgreSQL Performance Tuning: Complete Developer Guide

A practical PostgreSQL performance tuning guide for developers covering indexing, query plans, caching, connection pooling, vacuum, schema design, and troubleshooting with real examples.

View all PostgreSQL guides →

PostgreSQL Failover and Disaster Recovery Guide

Prerequisites

Key takeaways

FAQ

1. Start With RPO and RTO

RPO: Recovery Point Objective

RTO: Recovery Time Objective

Why this matters

2. Understand the Core PostgreSQL Recovery Building Blocks

3. Standby Types Matter

Warm standby

Hot standby

Why this matters

4. Streaming Replication Is Usually the Core of Failover

Why it is so useful

What it does not do by itself

5. Manual Failover vs Orchestrated Failover

Manual failover

Orchestrated failover

Practical rule

6. Promotion Is the Failover Step, Not the Whole Plan

Example

Why this distinction matters

7. Read Replicas Are Not Enough for Disaster Recovery

Why not?

8. Backups and WAL Archiving Are the Disaster Recovery Layer

Base backup

WAL archiving

Practical point

9. Point-in-Time Recovery Is Often the Real DR Superpower

Example recovery question

10. Replication Slots Help Prevent WAL From Disappearing Too Early

Why this matters

What slots help with

Important warning

11. Synchronous vs Asynchronous Replication Changes the Recovery Trade-Off

Asynchronous replication

Synchronous replication

Why this matters

12. Avoid Split-Brain at All Costs

How split-brain happens

Practical lesson

13. Monitor the Replication Path Continuously

Useful PostgreSQL views

Practical rule

14. Test the Whole Failover Workflow, Not Just Replication

Good failover test questions

15. Disaster Recovery Needs Restore Drills, Not Only Backup Jobs

Good practical question

16. Plan What Happens to the Old Primary After Failover

Why this matters

Practical warning

17. Replication Slots and Failover Slots Matter More in Complex Topologies

Practical lesson

18. A Practical Failover Topology Pattern

Primary

Hot standby

Backups + WAL archiving

Monitoring + orchestration

Old primary recovery plan

19. Common PostgreSQL Failover and DR Mistakes

Mistake 1: Treating a standby as a full DR plan

Mistake 2: No tested promotion procedure

Mistake 3: No plan for the old primary

Mistake 4: Ignoring split-brain risk

Mistake 5: No WAL archiving or untested WAL archives

Mistake 6: No restore drills

Mistake 7: Replication slots without monitoring

Mistake 8: Assuming replicas are current enough without measuring lag

20. A Practical PostgreSQL Failover and DR Checklist

FAQ

What is the difference between PostgreSQL failover and disaster recovery?

Is a PostgreSQL replica enough for disaster recovery?

How is PostgreSQL failover triggered?

What should I do with the old primary after failover?

Conclusion

About the author

Use these tools

PostgreSQL cluster

Related posts