PostgreSQL High Availability Architecture Guide

·Updated Apr 3, 2026·
postgresqldatabasesqlhigh-availabilityreplicationfailover
·

Level: intermediate · ~15 min read · Intent: informational

Audience: backend developers, database engineers, technical teams

Prerequisites

  • basic familiarity with PostgreSQL

Key takeaways

  • A good PostgreSQL HA architecture is not just a replica. It is a coordinated design that combines standbys, failover decisions, connection routing, monitoring, backups, and a clear plan for recovering the old primary.
  • The biggest architectural trade-off in PostgreSQL HA is between lower write latency and lower failover data loss. Asynchronous replication is simpler and faster for writes, while synchronous replication reduces data-loss risk at the cost of commit latency and availability trade-offs.

FAQ

What is the best PostgreSQL high availability architecture?
For many production systems, a strong starting point is one primary, at least one hot standby, connection routing through a stable endpoint or proxy, continuous monitoring, backups with WAL archiving, and a tested failover procedure.
Do read replicas make PostgreSQL highly available?
Not by themselves. A standby is only part of the architecture. You also need failover logic, application reconnect behavior, monitoring, and a recovery plan for the old primary.
Should I use synchronous or asynchronous replication in PostgreSQL HA?
It depends on your recovery point objective. Asynchronous replication is more common and keeps write latency lower, while synchronous replication reduces failover data loss risk but adds commit latency and stronger dependency on standby health.
Is PostgreSQL HA the same as disaster recovery?
No. High availability focuses on keeping the service online through failover. Disaster recovery is broader and includes backup, restore, and point-in-time recovery after corruption, operator error, or full-environment loss.
0

A PostgreSQL high availability architecture is not a single feature.

It is not:

  • “we have a replica”
  • “we enabled streaming replication”
  • or “we can promote a standby if something goes wrong”

Those are pieces. They are not the architecture.

A real PostgreSQL HA design has to answer a much bigger set of questions:

  • what happens when the primary dies?
  • what happens when the standby dies?
  • how quickly does traffic move?
  • how much committed data can be lost?
  • how do you avoid split-brain?
  • how do clients reconnect?
  • what happens to the old primary after failover?
  • and how do backups fit into the design when the failure is not just host loss, but bad data?

That is why high availability in PostgreSQL is really an architecture problem, not only a replication problem.

This guide explains how to think about that architecture clearly.

1. Start With the Outcome You Need

Before deciding on any topology, define the availability outcome.

Two numbers matter most:

RPO: Recovery Point Objective

How much data loss can you tolerate?

Examples:

  • near-zero
  • 30 seconds
  • 5 minutes
  • 1 hour

RTO: Recovery Time Objective

How long can the database be unavailable?

Examples:

  • 30 seconds
  • 2 minutes
  • 15 minutes
  • 1 hour

These drive almost every HA decision.

Why this matters

If your product can tolerate:

  • a few minutes of downtime
  • and some recent writes potentially not being on the standby yet

then asynchronous replication with a simple failover design may be enough.

If your product needs:

  • lower failover data-loss risk
  • and very fast promotion with minimal manual work

then you will likely need:

  • better standby health guarantees
  • stronger failover automation
  • and possibly synchronous replication or carefully designed semi-strict workflows

Do not start with:

  • “should we add a replica?”

Start with:

  • “what service outcome do we need when something fails?”

2. Know the Difference Between HA and Disaster Recovery

This distinction prevents a lot of bad architecture decisions.

High availability

High availability is about:

  • keeping service online
  • promoting a standby
  • minimizing downtime
  • continuing normal writes quickly

Disaster recovery

Disaster recovery is about:

  • recovering after bad state
  • restoring from backup
  • replaying WAL
  • recovering to a safe point in time
  • surviving corruption, operator error, or full-environment loss

A standby helps with HA. It does not replace DR.

If a destructive migration or bad delete reaches the primary, it can also reach the standby. That is why a serious PostgreSQL architecture usually needs:

  • HA design
  • and DR design

not one pretending to be the other.

3. The Smallest Serious PostgreSQL HA Architecture

For many production systems, the smallest architecture that deserves to be called serious looks like this:

Primary

  • accepts writes
  • may serve reads too depending on workload

Hot standby

  • streams from primary
  • stays ready to be promoted
  • may optionally serve read-only traffic

Stable connection endpoint

  • DNS, proxy, or service endpoint for writers
  • so applications do not hard-code one host forever

Monitoring and failover process

  • tracks primary health
  • tracks standby health and lag
  • decides when promotion should happen
  • alerts or triggers action

Backups plus WAL archiving

  • because failover is not enough for full recovery safety

That is the baseline shape many teams should aim for before inventing something more exotic.

4. Primary and Standby Roles Need to Be Operationally Clear

A common mistake is building a topology where everyone conceptually knows:

  • “this is the primary”
  • “that is the standby”

but the operational system does not make those roles explicit enough.

Your architecture should make it easy to answer:

  • which node currently accepts writes
  • which node is replaying WAL
  • which node is read-only
  • which node would be promoted first
  • how applications discover the current writer
  • and who or what is allowed to promote a standby

If the answer is:

  • “someone will SSH in and figure it out”

then the HA design is still too vague.

5. Use Hot Standby When Read Access to the Replica Matters

PostgreSQL hot standby allows read-only queries on a standby while it continues replaying WAL.

This is useful when the standby is doing double duty:

  • failover target
  • read endpoint

Good reasons to use hot standby reads

  • read-heavy APIs
  • dashboards
  • analytics-lite workloads
  • support tooling
  • reporting that can tolerate some lag

Important trade-off

A standby serving read traffic is still a recovery machine first.

That means you need to think about:

  • replay lag
  • conflicting long-running queries
  • whether read load weakens failover readiness
  • and whether the replica should stay clean and lightly loaded instead

For some systems, the best failover target is a mostly idle standby. For others, it is acceptable for the standby to serve reads too.

The architecture should choose this deliberately.

6. Replication Is the Core HA Transport, Not the Whole HA Story

Streaming replication is the core transport layer for many PostgreSQL HA systems.

It moves WAL changes from primary to standby and keeps the standby close enough to take over.

That is essential.

But replication alone does not answer:

  • when failover happens
  • who decides
  • how split-brain is prevented
  • where traffic goes after promotion
  • how the old primary is handled
  • and what happens if the failure is data corruption rather than host loss

This is why “we set up streaming replication” is not the same as “we designed HA.”

Replication is the transport. Architecture is the whole system around it.

7. Choose Between Asynchronous and Synchronous Replication on Purpose

This is one of the biggest HA trade-offs in PostgreSQL.

Asynchronous replication

With asynchronous replication:

  • the primary can commit before the standby confirms receipt or replay
  • write latency is usually lower
  • failover is operationally simpler

Trade-off

The standby can lag slightly behind, so failover may lose some of the most recent committed transactions.

This is the most common design because it is usually a good trade for many applications.

Synchronous replication

With synchronous replication:

  • commits wait for one or more standbys depending on configuration
  • failover data loss risk is reduced
  • write durability across nodes is stronger

Trade-off

You pay for this with:

  • higher commit latency
  • stronger dependency on standby health
  • more complicated operational trade-offs when standbys degrade

Practical architecture lesson

Choose asynchronous replication when:

  • you want lower latency and simpler behavior
  • small recent-write loss is acceptable in a failover event

Choose synchronous replication when:

  • lower failover data loss matters more than the extra latency and operational dependency

Neither is universally “better.” They serve different recovery goals.

8. Quorum and Priority Behavior Matter in Multi-Standby Designs

Once you have more than one standby, your HA design becomes more interesting.

Now you must think about:

  • which standby should be preferred
  • whether one or many standbys must acknowledge synchronous commits
  • whether one standby is local and another remote
  • and how promotion preference works

A mature HA architecture often needs to distinguish between:

  • preferred failover target
  • synchronous acknowledgment targets
  • remote disaster standby
  • local fast-failover standby

This is where topology matters more than raw node count.

Two standbys are not useful if both are identical in configuration but one is always stale or one cannot actually be promoted safely.

9. Never Treat Promotion as the Architecture

PostgreSQL supports promotion. That is important.

But promotion is only one step.

Your architecture must also define:

  • who is allowed to promote
  • how stale standbys are disqualified
  • how clients move to the new writer
  • how the old primary is fenced off
  • how monitoring state changes afterward
  • and how the new topology is stabilized

If failover is just:

  • “run pg_promote()”

then you have a promotion command, not a complete HA architecture.

10. Connection Routing Is Part of High Availability

Applications should not need to know every topology change directly.

A good HA architecture gives clients a stable way to find:

  • the current writer
  • and sometimes a reader endpoint too

Common patterns include:

  • DNS indirection
  • proxy-based routing
  • a service endpoint
  • load balancer or virtual IP approaches
  • cluster managers that update service discovery

Why this matters

Without connection routing, failover becomes a larger application incident because every client must learn the new primary more directly.

The better the routing layer, the easier it is for the application to survive the role switch.

11. Split-Brain Prevention Is a First-Class Design Requirement

A PostgreSQL HA architecture that can produce two primaries is dangerous, even if it fails over quickly.

Split-brain can create:

  • conflicting writes
  • diverged timelines
  • hard reconciliation work
  • and very confusing recovery paths

How it happens

Common causes include:

  • automatic failover without good failure detection
  • network partitions
  • old primary coming back without being fenced off
  • humans promoting a standby before confirming the old primary is really gone
  • cluster tooling that lacks strong coordination guarantees

Practical lesson

Fast failover is not enough. Safe failover is what matters.

Any HA architecture discussion that ignores split-brain is incomplete.

12. Replication Slots Usually Belong in Serious HA Designs

Replication slots can make a PostgreSQL HA architecture more reliable by preventing WAL needed by a standby from disappearing too early.

This matters because if a standby falls behind and the primary recycles WAL too aggressively, the standby may need to be rebuilt.

Why slots help

They help ensure:

  • the primary retains WAL needed by connected or lagging standbys
  • recovery chains are more predictable
  • catch-up behavior is more reliable

The operational trade-off

If a standby stops consuming WAL and the slot stays active, WAL can accumulate on the primary.

That means slots improve safety, but they must be monitored.

A good HA architecture does not only enable replication slots. It also monitors:

  • slot lag
  • retained WAL growth
  • stale slots
  • and broken replica consumption

13. Backup and WAL Archiving Still Belong in the HA Diagram

A lot of teams mentally place backups in a separate DR box outside the HA design.

That is often too simplistic.

A strong PostgreSQL architecture usually shows backups and WAL archiving as part of the full resilience picture because:

  • failover does not protect against bad replicated changes
  • HA without recoverability is still fragile
  • the same team often owns both availability and recovery

Why WAL archiving matters architecturally

If you lose:

  • the primary host
  • and the standby is unavailable or stale
  • or you need to recover to a specific point before corruption

then the architecture falls back to:

  • base backups
  • archived WAL
  • and point-in-time recovery

That is not a separate afterthought. It is part of the real resilience model.

14. Decide Whether the Standby Is Only for Failover or Also for Reads

This is one of the biggest architecture questions.

Failover-only standby

Benefits:

  • cleaner failover target
  • less read contention
  • simpler operational expectations

Read-serving standby

Benefits:

  • better read scaling
  • offload from primary
  • more hardware efficiency

Trade-offs

A read-serving standby may also see:

  • replay lag
  • query conflicts
  • slower recovery under certain workloads
  • more operational complexity

There is no universal answer. But the architecture should make the decision explicit.

15. Design for the Old Primary After Failover

This is one of the most overlooked HA architecture topics.

After failover:

  • the standby becomes the new primary
  • the old primary is no longer authoritative
  • the cluster is in a temporary degenerate state until redundancy is restored

Now what?

A strong architecture includes a defined process for:

  • fencing off the old primary
  • deciding whether it can be rewound
  • using pg_rewind where appropriate
  • or rebuilding it cleanly as a new standby
  • then restoring redundancy

If the answer is:

  • “we will think about that after the outage”

then the architecture is incomplete.

16. Monitoring Has to Be HA-Aware, Not Just Database-Aware

Normal PostgreSQL monitoring is not enough for HA.

Your monitoring needs to know:

  • current primary
  • standby replay lag
  • WAL sender/receiver health
  • replication slot health
  • archive success/failure
  • standby readiness for failover
  • whether the old primary is still reachable
  • whether the system is in a degraded single-writer/no-standby state

Good HA-specific alerts

  • standby disconnected
  • lag above threshold
  • archive failures
  • slot retention growing too far
  • no healthy failover target
  • cluster in degenerate state after failover
  • split-brain suspicion conditions

The more serious the uptime promise, the more HA-specific your observability needs to become.

17. Automatic Failover Is Powerful, but Wrong Automation Is Dangerous

A lot of teams assume automatic failover is always better.

That is not necessarily true.

Automatic failover is excellent when:

  • the environment is well understood
  • failure detection is strong
  • fencing is reliable
  • routing changes cleanly
  • and the team has tested the behavior repeatedly

Automatic failover is risky when:

  • failure signals are noisy
  • network partitions are possible
  • the routing layer is weak
  • or the cluster state model is not well controlled

Practical rule

If you automate failover, automate it well. Bad automatic failover is often worse than slower manual failover.

18. Cross-Zone or Cross-Region Placement Changes the Trade-Offs

A local standby and a remote standby serve different goals.

Same-zone or same-region standby

Usually better for:

  • lower latency
  • faster local failover
  • synchronous replication feasibility
  • tighter operational control

Cross-region or remote standby

Usually better for:

  • site-level disaster tolerance
  • regional outage resilience
  • remote recovery options

Architecture lesson

Many mature PostgreSQL HA designs eventually use both:

  • a nearer standby for fast failover
  • a farther standby or backup path for broader disaster resilience

But the farther away the node is, the more you need to think about:

  • lag
  • commit latency if synchronous
  • routing complexity
  • and realistic recovery behavior

19. Keep the Architecture Boring Until It Earns Complexity

This is an underrated HA principle.

A boring HA architecture is often the best one.

That usually means:

  • one primary
  • one hot standby
  • clear promotion process
  • one stable write endpoint
  • backups plus WAL archiving
  • tested failover runbooks
  • tested restore runbooks
  • and good monitoring

A lot of teams add complexity too early:

  • too many standbys
  • too many routing layers
  • failover logic they do not fully understand
  • region-spanning topologies before they have local failover working well

Complexity is sometimes necessary. But it should usually be earned by real requirements, not assumed from the start.

20. A Strong PostgreSQL HA Architecture Checklist

Use this as a practical architecture checklist:

  1. Define RPO and RTO
  2. Decide whether async or sync replication fits the business
  3. Build at least one healthy standby
  4. Define a stable write endpoint for clients
  5. Define how failover is triggered
  6. Define how split-brain is prevented
  7. Monitor lag, slots, and archive health
  8. Include backups and WAL archiving in the resilience model
  9. Define how the old primary is rewound or rebuilt
  10. Test failover and restore regularly

If any of those are still vague, the architecture is still incomplete.

FAQ

What is the best PostgreSQL high availability architecture?

For many production systems, a strong starting point is one primary, at least one hot standby, connection routing through a stable endpoint or proxy, continuous monitoring, backups with WAL archiving, and a tested failover procedure.

Do read replicas make PostgreSQL highly available?

Not by themselves. A standby is only part of the architecture. You also need failover logic, application reconnect behavior, monitoring, and a recovery plan for the old primary.

Should I use synchronous or asynchronous replication in PostgreSQL HA?

It depends on your recovery point objective. Asynchronous replication is more common and keeps write latency lower, while synchronous replication reduces failover data loss risk but adds commit latency and stronger dependency on standby health.

Is PostgreSQL HA the same as disaster recovery?

No. High availability focuses on keeping the service online through failover. Disaster recovery is broader and includes backup, restore, and point-in-time recovery after corruption, operator error, or full-environment loss.

Conclusion

A PostgreSQL high availability architecture is really a coordination design.

It combines:

  • primary and standby roles
  • replication mode
  • failover rules
  • routing
  • monitoring
  • backup and WAL strategy
  • and a recovery plan for the old primary after failover

That is why the best HA question is not:

  • “Do we have a replica?”

It is:

  • “When the primary fails, can the system keep serving safely, predictably, and with acceptable data loss?”

That is the standard a real PostgreSQL HA architecture should meet.

PostgreSQL cluster

Explore the connected PostgreSQL guides around tuning, indexing, operations, schema design, scaling, and app integrations.

Pillar guide

PostgreSQL Performance Tuning: Complete Developer Guide

A practical PostgreSQL performance tuning guide for developers covering indexing, query plans, caching, connection pooling, vacuum, schema design, and troubleshooting with real examples.

View all PostgreSQL guides →

Related posts