Upserts from CSV: ON CONFLICT patterns that scale

·By Elysiate·Updated Apr 11, 2026·
csvpostgresqlupserton-conflictmergedata-pipelines
·

Level: intermediate · ~14 min read · Intent: informational

Audience: developers, data analysts, ops engineers, technical teams

Prerequisites

  • basic familiarity with CSV files
  • basic familiarity with SQL inserts and updates
  • optional understanding of staging tables or ETL workflows

Key takeaways

  • For large CSV imports, the safest pattern is usually COPY into a staging table, dedupe or validate there, then upsert into the target table rather than inserting row by row.
  • PostgreSQL ON CONFLICT scales well when the conflict target is backed by the right unique index and the source batch has already been reduced to one row per business key.
  • MERGE is more flexible than ON CONFLICT for multi-branch logic and custom match conditions, but ON CONFLICT has a stronger insert-or-update guarantee in PostgreSQL under concurrency.
  • Warehouse upserts usually use MERGE rather than ON CONFLICT, so the operational pattern stays the same—stage, dedupe, merge—even when the SQL syntax changes.

References

FAQ

What is the safest way to upsert a large CSV into PostgreSQL?
Usually by loading the file into a staging table with COPY first, validating and deduplicating there, then inserting into the target table with INSERT ... ON CONFLICT DO UPDATE.
Why not just INSERT ... ON CONFLICT row by row?
Because row-by-row upserts add avoidable overhead, make validation and dedupe harder, and reduce your ability to inspect or replay the raw batch.
When is MERGE better than ON CONFLICT?
Use MERGE when you need multiple conditional branches, custom match logic, deletes, or warehouse portability. Use ON CONFLICT when you have a clear unique key conflict target and want PostgreSQL’s insert-or-update guarantee.
What is the biggest scaling mistake with CSV upserts?
Sending duplicate business keys from the source batch directly into the upsert step. PostgreSQL treats ON CONFLICT DO UPDATE as deterministic and will not let one statement affect the same target row more than once.
Do BigQuery and Snowflake support ON CONFLICT?
Not in the PostgreSQL sense. BigQuery and Snowflake use MERGE for upsert-style workflows, so the scale pattern is still stage then merge.
0

Upserts from CSV: ON CONFLICT patterns that scale

A lot of CSV upsert designs look fine at ten thousand rows and fall apart at ten million.

Not because ON CONFLICT is bad. Because the surrounding pattern is wrong.

The most common scaling mistake is treating upsert as the first operation instead of one of the last ones.

Teams often do this:

  • parse the CSV
  • map rows directly to the target table
  • call INSERT ... ON CONFLICT DO UPDATE
  • hope the unique index sorts everything out

That works for small or clean batches. It gets expensive and fragile when the batch is:

  • large
  • messy
  • duplicated
  • partially invalid
  • or shared across several downstream systems

The stronger mental model is:

upsert is a target-table write strategy, not a raw-file ingestion strategy.

That means the real design starts before ON CONFLICT.

Why this topic matters

Teams usually search for this after one of these failures:

  • the upsert works, but throughput is terrible
  • duplicate source keys cause statement failures or surprising updates
  • conflict targets are too broad or too vague
  • updates touch rows unnecessarily and amplify write cost
  • staging was skipped, so bad rows are mixed with valid ones
  • a warehouse workflow needs an equivalent pattern but does not support ON CONFLICT
  • or support teams cannot reproduce what actually landed because the raw batch was never staged

The core problem is not:

  • “how do I write an upsert statement?”

It is:

  • how do I turn a CSV batch into one correct source row per target key before I ask the database to arbitrate conflicts?

That is what scales.

Start with the strongest baseline: COPY first, then upsert

For PostgreSQL, the docs are direct:

  • COPY is optimized for loading large numbers of rows
  • it incurs significantly less overhead than a series of INSERT commands
  • and it is the recommended path when populating tables at scale.

That means the scalable baseline is usually:

  1. load the CSV into a staging table with COPY
  2. validate and normalize the batch there
  3. reduce it to one source row per business key
  4. then insert into the target with ON CONFLICT

This pattern buys you:

  • better ingest speed
  • clearer validation boundaries
  • easier replay
  • better observability
  • safer dedupe
  • and cleaner error handling

It also keeps your target table logic focused on target-table decisions instead of raw-file cleanup.

Why direct row-by-row upserts usually age badly

A row-by-row upsert path tends to accumulate hidden costs:

  • every row pays insert conflict arbitration directly against the target
  • validation gets mixed with mutation
  • duplicates inside the source batch surface too late
  • support cannot easily inspect what was staged
  • and retries can become messy because there is no durable batch boundary

This is especially painful when the CSV comes from:

  • vendors
  • spreadsheets
  • ad hoc analyst exports
  • or any source that can contain duplicate or malformed business keys

The database can resolve target conflicts. It is not the best place to discover that the input batch itself was contradictory.

ON CONFLICT is powerful, but it has a contract

PostgreSQL’s INSERT docs say ON CONFLICT DO UPDATE uses a specified arbiter constraint or unique index, and that the conflict_target can perform unique index inference. The docs also say ON CONFLICT DO UPDATE guarantees an atomic insert-or-update outcome, provided there is no independent error.

That guarantee is one of the best reasons to use it. But it comes with requirements:

  • there must be a usable unique or exclusion constraint to arbitrate the conflict
  • the conflict target must be explicit for DO UPDATE
  • the statement is only as correct as the uniqueness design underneath it

So the first scaling rule is: design the arbiter index as deliberately as the upsert statement.

A vague or overloaded uniqueness design makes upsert behavior harder to reason about.

Pick the narrowest realistic conflict target

If the business key is:

  • customer_id then the arbiter should normally be exactly that key

If the business key is composite:

  • source_system
  • plus external_id

then the arbiter should reflect that composite uniqueness directly

Do not make the conflict target broader than the business identity requires. And do not rely on “probably unique enough” columns.

Why? Because every extra ambiguity in the key:

  • increases conflict volume
  • complicates reasoning about updates
  • and makes batch dedupe harder upstream

The arbiter should answer one question cleanly:

  • what makes two rows the same real-world entity for this load?

Reduce the source batch to one row per key before the upsert

This is the most important scaling practice in the article.

PostgreSQL’s docs say INSERT ... ON CONFLICT DO UPDATE is a deterministic statement and will not be allowed to affect a single existing row more than once in one statement. They say rows proposed for insertion should not duplicate each other in terms of the arbiter constraint or index.

That means a source batch with duplicate keys is not just a data-quality issue. It is an upsert-design issue.

The right pattern is:

  • dedupe in staging first
  • choose the winning row per key deterministically
  • then upsert the reduced set

A common way to do this is to:

  • assign row precedence by source timestamp, batch sequence, or explicit priority
  • use row_number() partitioned by the business key
  • keep only the winner before the final insert

The principle matters more than the exact SQL: the target table should see one proposed row per target key per statement.

Decide your winner rule before the batch arrives

A lot of duplicate-key incidents happen because teams discover duplicates and only then ask:

  • which row should win?

That decision should not be made ad hoc during an incident.

A staging design should already know:

  • latest timestamp wins
  • highest sequence wins
  • source-of-truth system wins
  • non-null values preferred
  • or duplicate rows are rejected into quarantine

Without a winner rule, dedupe becomes political instead of operational.

This is especially important when CSV batches are assembled from:

  • multiple source systems
  • repeated exports
  • or human-edited files

Conditional updates reduce write amplification

PostgreSQL’s INSERT docs note that ON CONFLICT DO UPDATE ... WHERE can prevent the update branch from firing for rows that do not meet the condition, and such rows are not returned in RETURNING.

That is useful because a lot of CSV upserts are really:

  • insert new rows
  • update changed rows
  • ignore no-op rows

If you update every conflicting row whether anything changed or not, you may create avoidable:

  • write amplification
  • index churn
  • vacuum pressure
  • replication volume
  • trigger activity
  • and audit noise

A strong pattern is to update only when a relevant value is actually different. In PostgreSQL this is often expressed with a WHERE clause in the DO UPDATE branch.

The point is not syntax. The point is to avoid turning every conflict into a write.

RETURNING is useful for observability

PostgreSQL also says RETURNING can return rows actually inserted or updated by ON CONFLICT DO UPDATE.

That makes it useful for:

  • row-count reconciliation
  • debugging
  • auditing whether the batch inserted or updated
  • and feeding downstream metrics or logs

For scalable operations, you usually want counts such as:

  • rows staged
  • rows invalid
  • rows deduped away
  • rows inserted
  • rows updated
  • rows skipped by conditional no-op logic

That is much better than “job succeeded.”

ON CONFLICT versus MERGE in PostgreSQL

PostgreSQL now supports both INSERT ... ON CONFLICT and MERGE. They are not interchangeable.

The transaction-isolation docs say:

  • INSERT ... ON CONFLICT DO UPDATE in Read Committed mode guarantees that each row proposed for insertion will either insert or update, unless there is an unrelated error
  • MERGE can look similar but does not guarantee that either INSERT or UPDATE will occur, because it can have multiple conditional branches that are re-evaluated under concurrency.

That is a subtle but important difference.

Use ON CONFLICT when:

  • the conflict is driven by a real unique key
  • the desired action is basically insert-or-update
  • you want PostgreSQL’s strong atomic insert-or-update behavior

Use MERGE when:

  • you need multiple conditional branches
  • you may update, delete, or insert from one statement
  • the match logic is more complex than a single arbiter constraint
  • or you want syntax closer to warehouse MERGE patterns

For classic CSV upserts into a keyed OLTP-style table, ON CONFLICT is often the better default.

Why MERGE still matters

Even if PostgreSQL ON CONFLICT is your best local tool, MERGE matters for two reasons:

  1. you may need richer conditional logic
  2. warehouses and some analytics engines use MERGE as the standard upsert shape

So teams that want portability should understand both patterns.

The principle stays the same:

  • stage the data
  • reduce to one source row per key
  • then apply the mutation statement the engine is best at

BigQuery: MERGE is the upsert pattern

BigQuery’s DML docs say MERGE combines INSERT, UPDATE, and DELETE operations into a single statement and performs them atomically. They also note that if a target row matches more than one source row during update-style logic, the runtime error is: UPDATE/MERGE must match at most one source row for each target row.

That is very similar operationally to the PostgreSQL advice:

  • one source row per target key

BigQuery also documents DML cost behavior:

  • MERGE with only INSERT clauses processes query bytes
  • MERGE with UPDATE or DELETE clauses also involves the target table bytes scanned
  • on-demand billing depends on bytes processed.

So in BigQuery, staging and dedupe matter not only for correctness but also for cost and scan efficiency.

A strong BigQuery pattern is:

  • load raw CSV to a staging table
  • dedupe and normalize in a subquery or intermediate table
  • MERGE into the target only the reduced set

That improves both determinism and economics.

Snowflake: MERGE plus deterministic source shaping

Snowflake’s MERGE docs say the command inserts, updates, and deletes values in a target based on a source table or subquery. They also document duplicate join behavior and warn that when multiple source rows match a single target row, update or delete behavior can be nondeterministic unless the source is reduced so each target joins against at most one source row. Snowflake explicitly recommends using GROUP BY in the source clause to ensure that.

That is the same scaling lesson again: dedupe the source before the target mutation step.

Snowflake also offers ALL BY NAME variants for insert and update when source and target column names match, even if column order differs.

That can be convenient, but the operational rule is still the same:

  • source duplicates must be handled intentionally
  • do not rely on target-side ambiguity resolution

DuckDB: both ON CONFLICT and MERGE exist

DuckDB supports INSERT ... ON CONFLICT with conflict targets tied to primary key or unique constraints, and also supports MERGE INTO as an alternative that does not require a primary key because it allows a custom match condition.

That makes DuckDB interesting for local profiling and ETL design because you can test both upsert styles on staged CSV data.

A useful rule here is:

Use DuckDB ON CONFLICT when:

  • your destination table really has a key constraint
  • you want keyed insert-or-update behavior similar to PostgreSQL

Use DuckDB MERGE when:

  • the matching rule is custom
  • you are prototyping warehouse-style logic
  • or the destination model does not have a formal primary key

The same staging guidance still applies.

Concurrency and scale are not just syntax issues

PostgreSQL’s docs say tables with unique indexes might block if concurrent sessions lock or modify rows matching the unique index values being inserted, and that ON CONFLICT DO UPDATE still provides the insert-or-update guarantee under concurrency unless there is an unrelated error.

That means scalable upserts need more than correct SQL. They need:

  • sensible batching
  • conflict-target design
  • staging-table isolation
  • and awareness of how many concurrent writers are competing on the same key space

If every worker upserts random duplicates into the same hot keys, the unique index will become a coordination point whether you planned for that or not.

So scaling often means:

  • fewer, cleaner batches
  • pre-deduped source sets
  • and partitioning or sequencing by key range where practical

A practical staging-table pattern

This pattern works well in many systems.

Step 1. Land raw CSV into staging

Use the fastest load primitive available:

  • COPY in PostgreSQL
  • load job in BigQuery
  • staged load in Snowflake
  • local read into DuckDB staging

Keep the raw batch intact.

Step 2. Validate structure and types

Before upsert:

  • delimiter
  • encoding
  • row width
  • headers
  • required fields
  • key nullability

Do not let malformed data mix with target conflict logic.

Step 3. Normalize and dedupe

Produce one winning row per business key. Add derived fields as needed.

Step 4. Upsert into the target

Use:

  • PostgreSQL INSERT ... ON CONFLICT
  • BigQuery MERGE
  • Snowflake MERGE
  • DuckDB ON CONFLICT or MERGE

Step 5. Record outcomes

Capture inserted, updated, skipped, rejected, and deduped counts.

That sequence is much more robust than direct target writes from the parser.

When ON CONFLICT is the right answer

Use PostgreSQL ON CONFLICT as the default when:

  • the target has a clear unique key
  • the desired operation is mostly insert-or-update
  • you want PostgreSQL’s strong deterministic semantics
  • you can dedupe the source batch first
  • and you do not need delete branches or complex conditional match logic

This is the classic “CSV into keyed operational table” case.

When MERGE is the right answer

Use MERGE when:

  • you need inserts, updates, and deletes in one statement
  • the match logic is more complex than one unique-key conflict
  • you are working in BigQuery or Snowflake
  • you want one statement with multiple branch conditions
  • or the destination key is logical rather than enforced by a unique index

MERGE is more expressive. That does not make it simpler under concurrency.

Common anti-patterns

Anti-pattern 1. Upserting directly from raw CSV into the final table

You lose staging, validation, replay, and deterministic dedupe.

Anti-pattern 2. Letting duplicate source keys hit ON CONFLICT directly

PostgreSQL will reject affecting the same target row more than once in one statement.

Anti-pattern 3. Using a fuzzy or overloaded conflict target

The arbiter should reflect real business identity, not convenience.

Anti-pattern 4. Updating every conflict row even when nothing changed

That adds avoidable write cost and noise.

Anti-pattern 5. Copying PostgreSQL patterns straight into warehouses

BigQuery and Snowflake use MERGE semantics and have their own determinism and cost tradeoffs.

Which Elysiate tools fit this topic naturally?

The most natural related tools are:

They fit because upsert quality starts before SQL: with a structurally valid file, a clear key contract, and a reduced one-row-per-key source batch.

Why this page can rank broadly

To support broader search coverage, this page is intentionally shaped around several connected query families:

PostgreSQL upsert intent

  • postgres on conflict csv upsert
  • copy into staging then on conflict
  • on conflict do update scale

Warehouse upsert intent

  • bigquery merge csv staging
  • snowflake merge staged csv
  • duckdb merge into upsert

Pipeline-design intent

  • dedupe staging before upsert
  • one row per key upsert
  • merge vs on conflict postgres

That breadth helps one page rank for more than one narrow phrase.

FAQ

What is the safest way to upsert a large CSV into PostgreSQL?

Usually by loading into staging with COPY, deduplicating there, then inserting into the target with INSERT ... ON CONFLICT DO UPDATE.

Why not just upsert row by row?

Because it increases overhead, hides batch-level validation issues, and makes replay and debugging weaker.

Why must the source batch be deduped first?

Because PostgreSQL treats ON CONFLICT DO UPDATE as deterministic and will not let one statement affect the same existing row more than once. BigQuery and Snowflake also behave badly or nondeterministically when one target row matches multiple source rows.

When is MERGE better than ON CONFLICT?

When you need multiple conditional branches, deletes, or non-unique-key match logic, or when you are in a warehouse engine where MERGE is the standard upsert construct.

Does ON CONFLICT guarantee insert or update under concurrency?

In PostgreSQL Read Committed mode, yes: absent unrelated errors, each proposed row will either insert or update. PostgreSQL’s docs call out that MERGE does not make the same guarantee.

What is the safest default mindset?

Treat upsert as the last step of a staged, deduped, validated batch—not as the first place raw CSV rows land.

Final takeaway

Upserts from CSV scale when you stop asking the target table to solve every ingestion problem.

The safest baseline is:

  • load fast into staging
  • validate and normalize there
  • reduce to one row per key
  • use the narrowest correct arbiter or match condition
  • update only when values actually changed
  • and choose ON CONFLICT or MERGE based on engine semantics, not habit

That is how CSV upserts become predictable at scale instead of merely surviving small test batches.

About the author

Elysiate publishes practical guides and privacy-first tools for data workflows, developer tooling, SEO, and product engineering.

PostgreSQL cluster

Explore the connected PostgreSQL guides around tuning, indexing, operations, schema design, scaling, and app integrations.

Pillar guide

PostgreSQL Performance Tuning: Complete Developer Guide

A practical PostgreSQL performance tuning guide for developers covering indexing, query plans, caching, connection pooling, vacuum, schema design, and troubleshooting with real examples.

View all PostgreSQL guides →

Related posts