Why not just INSERT ... ON CONFLICT row by row?

Because row-by-row upserts add avoidable overhead, make validation and dedupe harder, and reduce your ability to inspect or replay the raw batch.

What is the biggest scaling mistake with CSV upserts?

Sending duplicate business keys from the source batch directly into the upsert step. PostgreSQL treats ON CONFLICT DO UPDATE as deterministic and will not let one statement affect the same target row more than once.

Do BigQuery and Snowflake support ON CONFLICT?

Not in the PostgreSQL sense. BigQuery and Snowflake use MERGE for upsert-style workflows, so the scale pattern is still stage then merge.

Back to Blog

Upserts from CSV: ON CONFLICT patterns that scale

Data & Database Workflows

Apr 11, 2026·By Elysiate·Updated Apr 11, 2026·

csv postgresqlupserton-conflictmergedata-pipelines

·

Level: intermediate · ~14 min read · Intent: informational

Audience: developers, data analysts, ops engineers, technical teams

Prerequisites

basic familiarity with CSV files
basic familiarity with SQL inserts and updates
optional understanding of staging tables or ETL workflows

Key takeaways

For large CSV imports, the safest pattern is usually COPY into a staging table, dedupe or validate there, then upsert into the target table rather than inserting row by row.
PostgreSQL ON CONFLICT scales well when the conflict target is backed by the right unique index and the source batch has already been reduced to one row per business key.
MERGE is more flexible than ON CONFLICT for multi-branch logic and custom match conditions, but ON CONFLICT has a stronger insert-or-update guarantee in PostgreSQL under concurrency.
Warehouse upserts usually use MERGE rather than ON CONFLICT, so the operational pattern stays the same—stage, dedupe, merge—even when the SQL syntax changes.

References

FAQ

What is the safest way to upsert a large CSV into PostgreSQL?: Usually by loading the file into a staging table with COPY first, validating and deduplicating there, then inserting into the target table with INSERT ... ON CONFLICT DO UPDATE.
Why not just INSERT ... ON CONFLICT row by row?: Because row-by-row upserts add avoidable overhead, make validation and dedupe harder, and reduce your ability to inspect or replay the raw batch.
When is MERGE better than ON CONFLICT?: Use MERGE when you need multiple conditional branches, custom match logic, deletes, or warehouse portability. Use ON CONFLICT when you have a clear unique key conflict target and want PostgreSQL’s insert-or-update guarantee.
What is the biggest scaling mistake with CSV upserts?: Sending duplicate business keys from the source batch directly into the upsert step. PostgreSQL treats ON CONFLICT DO UPDATE as deterministic and will not let one statement affect the same target row more than once.
Do BigQuery and Snowflake support ON CONFLICT?: Not in the PostgreSQL sense. BigQuery and Snowflake use MERGE for upsert-style workflows, so the scale pattern is still stage then merge.

0

Upserts from CSV: ON CONFLICT patterns that scale

A lot of CSV upsert designs look fine at ten thousand rows and fall apart at ten million.

Not because ON CONFLICT is bad. Because the surrounding pattern is wrong.

The most common scaling mistake is treating upsert as the first operation instead of one of the last ones.

Teams often do this:

parse the CSV
map rows directly to the target table
call INSERT ... ON CONFLICT DO UPDATE
hope the unique index sorts everything out

That works for small or clean batches. It gets expensive and fragile when the batch is:

large
messy
duplicated
partially invalid
or shared across several downstream systems

The stronger mental model is:

upsert is a target-table write strategy, not a raw-file ingestion strategy.

That means the real design starts before ON CONFLICT.

Why this topic matters

Teams usually search for this after one of these failures:

the upsert works, but throughput is terrible
duplicate source keys cause statement failures or surprising updates
conflict targets are too broad or too vague
updates touch rows unnecessarily and amplify write cost
staging was skipped, so bad rows are mixed with valid ones
a warehouse workflow needs an equivalent pattern but does not support ON CONFLICT
or support teams cannot reproduce what actually landed because the raw batch was never staged

The core problem is not:

“how do I write an upsert statement?”

It is:

how do I turn a CSV batch into one correct source row per target key before I ask the database to arbitrate conflicts?

That is what scales.

Start with the strongest baseline: COPY first, then upsert

For PostgreSQL, the docs are direct:

COPY is optimized for loading large numbers of rows
it incurs significantly less overhead than a series of INSERT commands
and it is the recommended path when populating tables at scale.

That means the scalable baseline is usually:

load the CSV into a staging table with COPY
validate and normalize the batch there
reduce it to one source row per business key
then insert into the target with ON CONFLICT

This pattern buys you:

better ingest speed
clearer validation boundaries
easier replay
better observability
safer dedupe
and cleaner error handling

It also keeps your target table logic focused on target-table decisions instead of raw-file cleanup.

Why direct row-by-row upserts usually age badly

A row-by-row upsert path tends to accumulate hidden costs:

every row pays insert conflict arbitration directly against the target
validation gets mixed with mutation
duplicates inside the source batch surface too late
support cannot easily inspect what was staged
and retries can become messy because there is no durable batch boundary

This is especially painful when the CSV comes from:

vendors
spreadsheets
ad hoc analyst exports
or any source that can contain duplicate or malformed business keys

The database can resolve target conflicts. It is not the best place to discover that the input batch itself was contradictory.

ON CONFLICT is powerful, but it has a contract

PostgreSQL’s INSERT docs say ON CONFLICT DO UPDATE uses a specified arbiter constraint or unique index, and that the conflict_target can perform unique index inference. The docs also say ON CONFLICT DO UPDATE guarantees an atomic insert-or-update outcome, provided there is no independent error.

That guarantee is one of the best reasons to use it. But it comes with requirements:

there must be a usable unique or exclusion constraint to arbitrate the conflict
the conflict target must be explicit for DO UPDATE
the statement is only as correct as the uniqueness design underneath it

So the first scaling rule is: design the arbiter index as deliberately as the upsert statement.

A vague or overloaded uniqueness design makes upsert behavior harder to reason about.

Pick the narrowest realistic conflict target

If the business key is:

customer_id then the arbiter should normally be exactly that key

If the business key is composite:

source_system
plus external_id

then the arbiter should reflect that composite uniqueness directly

Do not make the conflict target broader than the business identity requires. And do not rely on “probably unique enough” columns.

Why? Because every extra ambiguity in the key:

increases conflict volume
complicates reasoning about updates
and makes batch dedupe harder upstream

The arbiter should answer one question cleanly:

what makes two rows the same real-world entity for this load?

Reduce the source batch to one row per key before the upsert

This is the most important scaling practice in the article.

PostgreSQL’s docs say INSERT ... ON CONFLICT DO UPDATE is a deterministic statement and will not be allowed to affect a single existing row more than once in one statement. They say rows proposed for insertion should not duplicate each other in terms of the arbiter constraint or index.

That means a source batch with duplicate keys is not just a data-quality issue. It is an upsert-design issue.

The right pattern is:

dedupe in staging first
choose the winning row per key deterministically
then upsert the reduced set

A common way to do this is to:

assign row precedence by source timestamp, batch sequence, or explicit priority
use row_number() partitioned by the business key
keep only the winner before the final insert

The principle matters more than the exact SQL: the target table should see one proposed row per target key per statement.

Decide your winner rule before the batch arrives

A lot of duplicate-key incidents happen because teams discover duplicates and only then ask:

which row should win?

That decision should not be made ad hoc during an incident.

A staging design should already know:

latest timestamp wins
highest sequence wins
source-of-truth system wins
non-null values preferred
or duplicate rows are rejected into quarantine

Without a winner rule, dedupe becomes political instead of operational.

This is especially important when CSV batches are assembled from:

multiple source systems
repeated exports
or human-edited files

Conditional updates reduce write amplification

PostgreSQL’s INSERT docs note that ON CONFLICT DO UPDATE ... WHERE can prevent the update branch from firing for rows that do not meet the condition, and such rows are not returned in RETURNING.

That is useful because a lot of CSV upserts are really:

insert new rows
update changed rows
ignore no-op rows

If you update every conflicting row whether anything changed or not, you may create avoidable:

write amplification
index churn
vacuum pressure
replication volume
trigger activity
and audit noise

A strong pattern is to update only when a relevant value is actually different. In PostgreSQL this is often expressed with a WHERE clause in the DO UPDATE branch.

The point is not syntax. The point is to avoid turning every conflict into a write.

RETURNING is useful for observability

PostgreSQL also says RETURNING can return rows actually inserted or updated by ON CONFLICT DO UPDATE.

That makes it useful for:

row-count reconciliation
debugging
auditing whether the batch inserted or updated
and feeding downstream metrics or logs

For scalable operations, you usually want counts such as:

rows staged
rows invalid
rows deduped away
rows inserted
rows updated
rows skipped by conditional no-op logic

That is much better than “job succeeded.”

ON CONFLICT versus MERGE in PostgreSQL

PostgreSQL now supports both INSERT ... ON CONFLICT and MERGE. They are not interchangeable.

The transaction-isolation docs say:

INSERT ... ON CONFLICT DO UPDATE in Read Committed mode guarantees that each row proposed for insertion will either insert or update, unless there is an unrelated error
MERGE can look similar but does not guarantee that either INSERT or UPDATE will occur, because it can have multiple conditional branches that are re-evaluated under concurrency.

That is a subtle but important difference.

Use ON CONFLICT when:

the conflict is driven by a real unique key
the desired action is basically insert-or-update
you want PostgreSQL’s strong atomic insert-or-update behavior

Use MERGE when:

you need multiple conditional branches
you may update, delete, or insert from one statement
the match logic is more complex than a single arbiter constraint
or you want syntax closer to warehouse MERGE patterns

For classic CSV upserts into a keyed OLTP-style table, ON CONFLICT is often the better default.

Why MERGE still matters

Even if PostgreSQL ON CONFLICT is your best local tool, MERGE matters for two reasons:

you may need richer conditional logic
warehouses and some analytics engines use MERGE as the standard upsert shape

So teams that want portability should understand both patterns.

The principle stays the same:

stage the data
reduce to one source row per key
then apply the mutation statement the engine is best at

BigQuery: MERGE is the upsert pattern

BigQuery’s DML docs say MERGE combines INSERT, UPDATE, and DELETE operations into a single statement and performs them atomically. They also note that if a target row matches more than one source row during update-style logic, the runtime error is: UPDATE/MERGE must match at most one source row for each target row.

That is very similar operationally to the PostgreSQL advice:

one source row per target key

BigQuery also documents DML cost behavior:

MERGE with only INSERT clauses processes query bytes
MERGE with UPDATE or DELETE clauses also involves the target table bytes scanned
on-demand billing depends on bytes processed.

So in BigQuery, staging and dedupe matter not only for correctness but also for cost and scan efficiency.

A strong BigQuery pattern is:

load raw CSV to a staging table
dedupe and normalize in a subquery or intermediate table
MERGE into the target only the reduced set

That improves both determinism and economics.

Snowflake: MERGE plus deterministic source shaping

Snowflake’s MERGE docs say the command inserts, updates, and deletes values in a target based on a source table or subquery. They also document duplicate join behavior and warn that when multiple source rows match a single target row, update or delete behavior can be nondeterministic unless the source is reduced so each target joins against at most one source row. Snowflake explicitly recommends using GROUP BY in the source clause to ensure that.

That is the same scaling lesson again: dedupe the source before the target mutation step.

Snowflake also offers ALL BY NAME variants for insert and update when source and target column names match, even if column order differs.

That can be convenient, but the operational rule is still the same:

source duplicates must be handled intentionally
do not rely on target-side ambiguity resolution

DuckDB: both ON CONFLICT and MERGE exist

DuckDB supports INSERT ... ON CONFLICT with conflict targets tied to primary key or unique constraints, and also supports MERGE INTO as an alternative that does not require a primary key because it allows a custom match condition.

That makes DuckDB interesting for local profiling and ETL design because you can test both upsert styles on staged CSV data.

A useful rule here is:

Use DuckDB ON CONFLICT when:

your destination table really has a key constraint
you want keyed insert-or-update behavior similar to PostgreSQL

Use DuckDB MERGE when:

the matching rule is custom
you are prototyping warehouse-style logic
or the destination model does not have a formal primary key

The same staging guidance still applies.

Concurrency and scale are not just syntax issues

PostgreSQL’s docs say tables with unique indexes might block if concurrent sessions lock or modify rows matching the unique index values being inserted, and that ON CONFLICT DO UPDATE still provides the insert-or-update guarantee under concurrency unless there is an unrelated error.

That means scalable upserts need more than correct SQL. They need:

sensible batching
conflict-target design
staging-table isolation
and awareness of how many concurrent writers are competing on the same key space

If every worker upserts random duplicates into the same hot keys, the unique index will become a coordination point whether you planned for that or not.

So scaling often means:

fewer, cleaner batches
pre-deduped source sets
and partitioning or sequencing by key range where practical

A practical staging-table pattern

This pattern works well in many systems.

Step 1. Land raw CSV into staging

Use the fastest load primitive available:

COPY in PostgreSQL
load job in BigQuery
staged load in Snowflake
local read into DuckDB staging

Keep the raw batch intact.

Step 2. Validate structure and types

Before upsert:

delimiter
encoding
row width
headers
required fields
key nullability

Do not let malformed data mix with target conflict logic.

Step 3. Normalize and dedupe

Produce one winning row per business key. Add derived fields as needed.

Step 4. Upsert into the target

Use:

PostgreSQL INSERT ... ON CONFLICT
BigQuery MERGE
Snowflake MERGE
DuckDB ON CONFLICT or MERGE

Step 5. Record outcomes

Capture inserted, updated, skipped, rejected, and deduped counts.

That sequence is much more robust than direct target writes from the parser.

When ON CONFLICT is the right answer

Use PostgreSQL ON CONFLICT as the default when:

the target has a clear unique key
the desired operation is mostly insert-or-update
you want PostgreSQL’s strong deterministic semantics
you can dedupe the source batch first
and you do not need delete branches or complex conditional match logic

This is the classic “CSV into keyed operational table” case.

When MERGE is the right answer

Use MERGE when:

you need inserts, updates, and deletes in one statement
the match logic is more complex than one unique-key conflict
you are working in BigQuery or Snowflake
you want one statement with multiple branch conditions
or the destination key is logical rather than enforced by a unique index

MERGE is more expressive. That does not make it simpler under concurrency.

Common anti-patterns

Anti-pattern 1. Upserting directly from raw CSV into the final table

You lose staging, validation, replay, and deterministic dedupe.

Anti-pattern 2. Letting duplicate source keys hit ON CONFLICT directly

PostgreSQL will reject affecting the same target row more than once in one statement.

Anti-pattern 3. Using a fuzzy or overloaded conflict target

The arbiter should reflect real business identity, not convenience.

Anti-pattern 4. Updating every conflict row even when nothing changed

That adds avoidable write cost and noise.

Anti-pattern 5. Copying PostgreSQL patterns straight into warehouses

BigQuery and Snowflake use MERGE semantics and have their own determinism and cost tradeoffs.

Which Elysiate tools fit this topic naturally?

The most natural related tools are:

They fit because upsert quality starts before SQL: with a structurally valid file, a clear key contract, and a reduced one-row-per-key source batch.

Why this page can rank broadly

To support broader search coverage, this page is intentionally shaped around several connected query families:

PostgreSQL upsert intent

postgres on conflict csv upsert
copy into staging then on conflict
on conflict do update scale

Warehouse upsert intent

bigquery merge csv staging
snowflake merge staged csv
duckdb merge into upsert

Pipeline-design intent

dedupe staging before upsert
one row per key upsert
merge vs on conflict postgres

That breadth helps one page rank for more than one narrow phrase.

FAQ

What is the safest way to upsert a large CSV into PostgreSQL?

Usually by loading into staging with COPY, deduplicating there, then inserting into the target with INSERT ... ON CONFLICT DO UPDATE.

Why not just upsert row by row?

Because it increases overhead, hides batch-level validation issues, and makes replay and debugging weaker.

Why must the source batch be deduped first?

Because PostgreSQL treats ON CONFLICT DO UPDATE as deterministic and will not let one statement affect the same existing row more than once. BigQuery and Snowflake also behave badly or nondeterministically when one target row matches multiple source rows.

When is MERGE better than ON CONFLICT?

When you need multiple conditional branches, deletes, or non-unique-key match logic, or when you are in a warehouse engine where MERGE is the standard upsert construct.

Does ON CONFLICT guarantee insert or update under concurrency?

In PostgreSQL Read Committed mode, yes: absent unrelated errors, each proposed row will either insert or update. PostgreSQL’s docs call out that MERGE does not make the same guarantee.

What is the safest default mindset?

Treat upsert as the last step of a staged, deduped, validated batch—not as the first place raw CSV rows land.

Final takeaway

Upserts from CSV scale when you stop asking the target table to solve every ingestion problem.

The safest baseline is:

load fast into staging
validate and normalize there
reduce to one row per key
use the narrowest correct arbiter or match condition
update only when values actually changed
and choose ON CONFLICT or MERGE based on engine semantics, not habit

That is how CSV upserts become predictable at scale instead of merely surviving small test batches.

About the author

Elysiate publishes practical guides and privacy-first tools for data workflows, developer tooling, SEO, and product engineering.

View author profile Read editorial policy

Free, privacy-first utilities in your browser — no uploads required for most workflows.

PostgreSQL cluster

Explore the connected PostgreSQL guides around tuning, indexing, operations, schema design, scaling, and app integrations.

Pillar guide

PostgreSQL Performance Tuning: Complete Developer Guide

A practical PostgreSQL performance tuning guide for developers covering indexing, query plans, caching, connection pooling, vacuum, schema design, and troubleshooting with real examples.

View all PostgreSQL guides →

Upserts from CSV: ON CONFLICT patterns that scale

Prerequisites

Key takeaways

References

FAQ

Upserts from CSV: ON CONFLICT patterns that scale

Why this topic matters

Start with the strongest baseline: COPY first, then upsert

Why direct row-by-row upserts usually age badly

ON CONFLICT is powerful, but it has a contract

Pick the narrowest realistic conflict target

Reduce the source batch to one row per key before the upsert

Decide your winner rule before the batch arrives

Conditional updates reduce write amplification

RETURNING is useful for observability

ON CONFLICT versus MERGE in PostgreSQL

Use ON CONFLICT when:

Use MERGE when:

Why MERGE still matters

BigQuery: MERGE is the upsert pattern

Snowflake: MERGE plus deterministic source shaping

DuckDB: both ON CONFLICT and MERGE exist

Use DuckDB ON CONFLICT when:

Use DuckDB MERGE when:

Concurrency and scale are not just syntax issues

A practical staging-table pattern

Step 1. Land raw CSV into staging

Step 2. Validate structure and types

Step 3. Normalize and dedupe

Step 4. Upsert into the target

Step 5. Record outcomes

When ON CONFLICT is the right answer

When MERGE is the right answer

Common anti-patterns

Anti-pattern 1. Upserting directly from raw CSV into the final table

Anti-pattern 2. Letting duplicate source keys hit ON CONFLICT directly

Anti-pattern 3. Using a fuzzy or overloaded conflict target

Anti-pattern 4. Updating every conflict row even when nothing changed

Anti-pattern 5. Copying PostgreSQL patterns straight into warehouses

Which Elysiate tools fit this topic naturally?

Why this page can rank broadly

PostgreSQL upsert intent

Warehouse upsert intent

Pipeline-design intent

FAQ

What is the safest way to upsert a large CSV into PostgreSQL?

Why not just upsert row by row?

Why must the source batch be deduped first?

When is MERGE better than ON CONFLICT?

Does ON CONFLICT guarantee insert or update under concurrency?

What is the safest default mindset?

Final takeaway

About the author

Use these tools

PostgreSQL cluster

Related posts