Idempotent CSV Loads into PostgreSQL: Patterns and Pitfalls

Data & Database Workflows

Apr 8, 2026·By Elysiate·Updated Apr 8, 2026·

csv postgresqldatabaseidempotencyetldata-pipelines

·

Level: intermediate · ~15 min read · Intent: informational

Audience: developers, data engineers, ops engineers, backend teams, analytics engineers

Prerequisites

basic familiarity with PostgreSQL
basic understanding of CSV imports or ETL jobs

Key takeaways

A CSV load is only idempotent when rerunning the same file produces the same durable table state, not merely when the SQL finishes successfully.
The most reliable PostgreSQL pattern is usually raw-file retention plus a staging table plus a deterministic merge or upsert into the final table using stable business keys.
COPY is fast, but COPY alone does not make a load replay-safe. Idempotency comes from keys, constraints, batch bookkeeping, and carefully chosen insert or merge logic.

References

FAQ

What does idempotent mean for a PostgreSQL CSV load?: It means rerunning the same input file does not create duplicate facts, incorrect updates, or a different final table state than the original successful run.
Is COPY by itself idempotent?: No. COPY is a loading mechanism, not an idempotency strategy. Without keys, constraints, and replay logic, rerunning the same file can duplicate data.
Should I use ON CONFLICT or MERGE?: For many replay-safe insert-or-update patterns, INSERT ... ON CONFLICT is simpler and stronger. MERGE is useful for broader match logic, but it is not interchangeable with ON CONFLICT.
Why are staging tables useful for idempotent loads?: Because they let you land the raw rows safely, validate them, deduplicate them, and then apply deterministic business rules before touching the final tables.

0

Idempotent CSV Loads into PostgreSQL: Patterns and Pitfalls

A lot of teams say they want idempotent CSV loads when what they really mean is:

“Please let us rerun the job without making a mess.”

That is the right instinct.

Because CSV ingestion into PostgreSQL fails in very ordinary ways:

the job crashes halfway through
a vendor resends yesterday’s file
an operator reruns the pipeline manually
a bad row causes partial handling
the same data arrives under a different filename
a merge step updates records it should have left alone

If your ingestion path is not idempotent, those moments turn into:

duplicate rows
overwritten history
non-repeatable downstream metrics
replay fear
long incident cleanups

That is why idempotency is not a bonus feature. It is the thing that makes retries safe.

If you want to validate the file before the database ever sees it, start with the CSV Validator, CSV Format Checker, and CSV Header Checker. If your feed comes from multiple files, the CSV Merge and Converter are useful upstream helpers.

This guide explains what idempotent CSV loading really means in PostgreSQL, which patterns actually work, and which shortcuts look fine in demos but break in production.

What idempotent means in practice

For a PostgreSQL CSV load, idempotent usually means:

Running the same logical input again should leave the target tables in the same correct end state.

That does not mean:

the SQL command is repeatable in a vacuum
the loader never sees duplicates
the file load never errors

It means:

reruns are safe
retries do not corrupt data
partial failures can be recovered without guessing
final tables converge on one intended state

This is a state-management problem, not only a SQL syntax problem.

The first principle: COPY is fast, not magical

PostgreSQL’s COPY documentation makes clear that COPY FROM is about moving file data into a table quickly, with options for CSV format, headers, null markers, encoding, ON_ERROR, REJECT_LIMIT, and more. If ENCODING is omitted, PostgreSQL uses the current client encoding. If ON_ERROR='ignore' is used, PostgreSQL can skip erroneous rows, and REJECT_LIMIT controls how many conversion errors are tolerated before the command fails. citeturn230569view0

That is great for ingestion mechanics. It is not an idempotency strategy.

If you run:

COPY orders FROM '/data/orders.csv' WITH (FORMAT csv, HEADER true);

twice against the same plain target table, PostgreSQL will happily attempt to load the same rows twice unless your schema and logic stop it.

So the first production rule is simple:

Do not confuse fast loading with replay-safe loading.

The second principle: idempotency needs a stable identity

No database can protect you from replays if it cannot tell whether a row is “the same” row.

That means you need one of these:

a stable business key from the source
a deterministic composite key
a trustworthy natural key
a staging-time deduplication fingerprint that maps back to a real business identity

Examples:

order_id
invoice_number
(customer_id, invoice_date, line_number)
(external_system_id, record_type)

Weak keys create weak idempotency.

If your feed does not contain stable identity, your job becomes much harder because you are trying to deduplicate semantics after the fact.

The safest default pattern: raw file + staging table + deterministic write

The most reliable production pattern usually looks like this:

1. Preserve the raw file

Keep:

original bytes
file name
checksum
received timestamp
source metadata

2. Load into a staging table

Use COPY into a raw or semi-raw table first.

3. Validate and normalize

Check:

encoding
delimiter
header shape
required columns
duplicates inside the batch
type coercion
null semantics

4. Write into final tables deterministically

Use:

INSERT ... ON CONFLICT
or MERGE
or an explicit insert/update/delete workflow

5. Record batch outcome

Track:

file checksum
row counts
reject counts
final write counts
whether the batch was already processed

This is much safer than copying directly into production tables and hoping uniqueness errors tell the full story.

Why staging tables matter so much

Staging tables are not just an ETL cliché. They solve several real idempotency problems.

They let you:

land data before business logic touches it
profile what arrived
deduplicate within the batch
quarantine bad rows
compare file content to already-processed batches
separate file-acceptance from business-state mutation

Without staging, it becomes much harder to answer:

did we already get this file?
which rows were invalid?
did we partially update the target?
can we replay safely from the same raw input?

That is why staging is often the first real step toward idempotency.

A good batch registry is worth more than people think

One of the cleanest patterns is to maintain a batch or file registry table.

Typical fields:

batch_id
source_filename
source_checksum
received_at
loaded_at
status
raw_row_count
accepted_row_count
rejected_row_count

Why this matters:

If the same file arrives twice with the same checksum, you may be able to:

skip it
flag it
reprocess intentionally
compare the new run to the old run safely

A checksum is not the same thing as a business key, but it is a very useful file-level idempotency signal.

When INSERT ... ON CONFLICT is the right tool

PostgreSQL’s INSERT documentation says ON CONFLICT provides an alternative to raising a unique or exclusion constraint violation. ON CONFLICT DO NOTHING skips conflicting inserts, while ON CONFLICT DO UPDATE updates the existing row. PostgreSQL also documents that conflict_target can infer a suitable unique index and that DO UPDATE requires a conflict target. citeturn230569view1

That makes INSERT ... ON CONFLICT a great fit when:

the target table has a real unique key
the main decision is insert vs update
you want simple replay-safe semantics
the source and target match one logical row identity

Example:

INSERT INTO customers (customer_id, email, full_name, updated_at)
SELECT customer_id, email, full_name, updated_at
FROM staging_customers
ON CONFLICT (customer_id) DO UPDATE
SET email = EXCLUDED.email,
    full_name = EXCLUDED.full_name,
    updated_at = EXCLUDED.updated_at;

This is often the simplest strong default for dimension-like upserts.

Why ON CONFLICT still has pitfalls

ON CONFLICT is powerful, but it can still surprise teams.

1. No unique index means no real protection

Without the right unique constraint or unique index, your replay logic is weak.

2. Bad conflict target means wrong merges

If the chosen key is not truly stable, different logical rows can collapse together.

3. Triggers may fire for both insert and update paths

PostgreSQL’s trigger docs explicitly note that INSERT ... ON CONFLICT DO UPDATE may cause both insert and update operations, so both kinds of triggers can fire as needed. citeturn230569view5

If your trigger logic is not written with that in mind, “idempotent” loads can still produce surprising side effects. citeturn230569view5

4. Read Committed semantics matter

PostgreSQL’s transaction-isolation docs say that in Read Committed mode, INSERT ... ON CONFLICT DO UPDATE guarantees that each row proposed for insertion will either insert or update, absent unrelated errors. That is one reason it is such a strong fit for many replay-safe workflows. citeturn230569view4

That guarantee is useful, but only when your conflict key is correct. citeturn230569view4

When MERGE is the right tool

PostgreSQL’s MERGE documentation shows that it can combine INSERT, UPDATE, DELETE, and DO NOTHING actions based on source-target matching conditions. That makes it useful when your idempotent write path is more complex than plain insert-or-update. citeturn230569view2

Examples where MERGE helps:

insert new rows
update matched rows
delete target rows absent from the new source snapshot
handle source-vs-target reconciliation in one statement

This can be excellent for full-snapshot loads.

Why MERGE is not interchangeable with ON CONFLICT

PostgreSQL’s docs explicitly warn that MERGE and INSERT ... ON CONFLICT DO UPDATE are not interchangeable. The transaction-isolation docs also note that unlike ON CONFLICT, MERGE does not guarantee that either insert or update will occur in the same way under concurrent conditions. citeturn230569view2turn230569view4

That means a good rule of thumb is:

Prefer ON CONFLICT when

your operation is basically upsert
you have a strong unique key
you want simpler concurrency semantics

Prefer MERGE when

you need broader match logic
you need delete or conditional branches too
you are reconciling a snapshot, not just upserting rows

This distinction matters a lot in production.

A practical staging-table pattern

One solid pattern is:

Raw landing table

All columns stored as text plus batch metadata.

Typed staging table

Normalized, cast, and validated rows ready for business rules.

Final table

The real durable target with unique constraints and indexes.

That gives you three layers of protection:

file acceptance
schema/type acceptance
business-state mutation

Example workflow:

COPY raw CSV into staging_raw_orders
insert validated rows into staging_orders_typed
deduplicate inside the batch if needed
INSERT ... ON CONFLICT into orders

This keeps each concern much easier to debug.

File-level idempotency vs row-level idempotency

These are related but not identical.

File-level idempotency

Asks:

did we already process this exact file?

Useful tools:

checksum
filename
manifest
batch registry

Row-level idempotency

Asks:

did we already apply this logical row state?

Useful tools:

business key
unique constraint
upsert or merge logic
effective-date or history model

Strong pipelines usually need both.

Unlogged tables: fast, but know the tradeoff

PostgreSQL’s CREATE TABLE docs say unlogged tables are considerably faster because they are not written to WAL, but they are not crash-safe: they are automatically truncated after a crash or unclean shutdown, and they are not replicated to standby servers. citeturn230569view3

That means unlogged staging tables can be reasonable for:

transient landing
local performance optimization
pipelines where raw files are safely retained elsewhere

But they are a bad choice if your team assumes the stage itself is durable after a crash. citeturn230569view3

A good rule:

use unlogged tables only when you understand the replay story
never let unlogged speed tricks become your only copy of in-flight truth

Bad rows and partial acceptance need policy

PostgreSQL COPY now supports ON_ERROR='ignore', REJECT_LIMIT, and LOG_VERBOSITY around conversion failures. PostgreSQL documents that ignored-row counts are emitted at the end, and verbose logging can include the input line and column name for each discarded row. citeturn230569view0

That is useful, but it creates a policy decision:

Do you want:

all-or-nothing batches
partial acceptance with quarantined bad rows
unlimited ignored conversion errors
a strict reject ceiling

Unlimited row skipping is rarely what teams really want in financial or customer-facing pipelines. REJECT_LIMIT gives you a more controlled compromise. citeturn230569view0

Common idempotent patterns that work well

Pattern 1: insert-only facts with DO NOTHING

Best when:

a fact row should never be updated once accepted
duplicates should be ignored
unique key is strong

Example:

INSERT INTO invoice_lines (...)
SELECT ...
FROM staging_invoice_lines
ON CONFLICT (invoice_line_id) DO NOTHING;

Pattern 2: current-state dimensions with DO UPDATE

Best when:

latest source state should win
conflict key is trustworthy
updates are intentional

Pattern 3: snapshot reconciliation with MERGE

Best when:

source file represents the whole current state
deletes or target-only removals matter
branch logic is more complex

Pattern 4: history-preserving loads

Best when:

changes should be tracked historically
idempotency must preserve history instead of overwriting it

This usually means effective dates, versioning, or append-plus-closeout logic instead of blind overwrite.

Common anti-patterns

Copying directly into the final table

Fast in demos, fragile in production.

No unique key on the target

Then reruns cannot be trusted.

Using a surrogate key as the only conflict target

If the surrogate is assigned at insert time, it does not help recognize replayed source rows.

Treating MERGE as always equivalent to ON CONFLICT

PostgreSQL explicitly warns they are not interchangeable. citeturn230569view2turn230569view4

Letting “ignore bad rows” hide real feed failures

Skipping conversion errors without a policy can turn silent data loss into a habit. citeturn230569view0

Forgetting trigger side effects

ON CONFLICT may fire both insert and update triggers. citeturn230569view5

A practical decision framework

Ask these questions in order.

1. Do you have a real stable business key?

If no, solve identity first.

2. Is the load append-only, current-state, or snapshot-reconciling?

This tells you whether to prefer:

DO NOTHING
DO UPDATE
MERGE
history logic

3. Can you preserve the raw file and batch metadata?

If yes, recovery and replay get much easier.

4. Are partial bad-row skips acceptable?

If yes, define reject policy clearly. If no, fail the batch loudly.

5. Is concurrency part of the story?

If yes, simplicity matters even more, and ON CONFLICT may be preferable for straightforward upserts. citeturn230569view4turn230569view2

Which Elysiate tools fit this article best?

For this topic, the most natural supporting tools are:

These fit naturally because replay-safe PostgreSQL loads begin with structurally trustworthy CSV files and deterministic transformation steps.

FAQ

What does idempotent mean for a PostgreSQL CSV load?

It means rerunning the same input file does not create duplicate facts, incorrect updates, or a different final table state than the original successful run.

Is COPY by itself idempotent?

No. COPY is a loading mechanism, not an idempotency strategy. Without keys, constraints, and replay logic, rerunning the same file can duplicate data. PostgreSQL’s COPY docs describe ingestion behavior and error-handling options, not replay-safe semantics. citeturn230569view0

Should I use ON CONFLICT or MERGE?

For many replay-safe insert-or-update patterns, INSERT ... ON CONFLICT is simpler and stronger. MERGE is useful for broader match logic, but PostgreSQL explicitly notes the statements are not interchangeable. citeturn230569view1turn230569view2turn230569view4

Why are staging tables useful for idempotent loads?

Because they let you land the raw rows safely, validate them, deduplicate them, and then apply deterministic business rules before touching the final tables.

Are unlogged staging tables safe?

They can be useful for speed, but PostgreSQL documents that unlogged tables are not crash-safe and are truncated after a crash or unclean shutdown. They are also not replicated to standby servers. citeturn230569view3

What is the safest default?

Preserve raw files, load into staging, validate and deduplicate there, then write into the final tables with stable business keys and explicit upsert or merge behavior.

Final takeaway

Idempotent CSV loading into PostgreSQL is not about finding one clever SQL statement.

It is about combining the right layers:

stable row identity
raw-file retention
batch tracking
staging tables
deterministic final writes
explicit duplicate and bad-row policy

COPY gets the bytes in fast. ON CONFLICT and MERGE help shape the final write. But idempotency comes from the design around them.

That is what makes reruns boring instead of dangerous.

About the author

Elysiate publishes practical guides and privacy-first tools for data workflows, developer tooling, SEO, and product engineering.

View author profile Read editorial policy

Free, privacy-first utilities in your browser — no uploads required for most workflows.

PostgreSQL cluster

Explore the connected PostgreSQL guides around tuning, indexing, operations, schema design, scaling, and app integrations.

Pillar guide

PostgreSQL Performance Tuning: Complete Developer Guide

A practical PostgreSQL performance tuning guide for developers covering indexing, query plans, caching, connection pooling, vacuum, schema design, and troubleshooting with real examples.

View all PostgreSQL guides →

Idempotent CSV Loads into PostgreSQL: Patterns and Pitfalls

Prerequisites

Key takeaways

References

FAQ

Idempotent CSV Loads into PostgreSQL: Patterns and Pitfalls

What idempotent means in practice

The first principle: COPY is fast, not magical

The second principle: idempotency needs a stable identity

The safest default pattern: raw file + staging table + deterministic write

1. Preserve the raw file

2. Load into a staging table

3. Validate and normalize

4. Write into final tables deterministically

5. Record batch outcome

Why staging tables matter so much

A good batch registry is worth more than people think

When INSERT ... ON CONFLICT is the right tool

Why ON CONFLICT still has pitfalls

1. No unique index means no real protection

2. Bad conflict target means wrong merges

3. Triggers may fire for both insert and update paths

4. Read Committed semantics matter

When MERGE is the right tool

Why MERGE is not interchangeable with ON CONFLICT

Prefer ON CONFLICT when

Prefer MERGE when

A practical staging-table pattern

Raw landing table

Typed staging table

Final table

File-level idempotency vs row-level idempotency

File-level idempotency

Row-level idempotency

Unlogged tables: fast, but know the tradeoff

Bad rows and partial acceptance need policy

Common idempotent patterns that work well

Pattern 1: insert-only facts with DO NOTHING

Pattern 2: current-state dimensions with DO UPDATE

Pattern 3: snapshot reconciliation with MERGE

Pattern 4: history-preserving loads

Common anti-patterns

Copying directly into the final table

No unique key on the target

Using a surrogate key as the only conflict target

Treating MERGE as always equivalent to ON CONFLICT

Letting “ignore bad rows” hide real feed failures

Forgetting trigger side effects

A practical decision framework

1. Do you have a real stable business key?

2. Is the load append-only, current-state, or snapshot-reconciling?

3. Can you preserve the raw file and batch metadata?

4. Are partial bad-row skips acceptable?

5. Is concurrency part of the story?

Which Elysiate tools fit this article best?

FAQ

What does idempotent mean for a PostgreSQL CSV load?

Is COPY by itself idempotent?

Should I use ON CONFLICT or MERGE?

Why are staging tables useful for idempotent loads?

Are unlogged staging tables safe?

What is the safest default?

Final takeaway

About the author

Use these tools

PostgreSQL cluster

Related posts