Idempotent CSV Loads into PostgreSQL: Patterns and Pitfalls

·By Elysiate·Updated Apr 8, 2026·
csvpostgresqldatabaseidempotencyetldata-pipelines
·

Level: intermediate · ~15 min read · Intent: informational

Audience: developers, data engineers, ops engineers, backend teams, analytics engineers

Prerequisites

  • basic familiarity with PostgreSQL
  • basic understanding of CSV imports or ETL jobs

Key takeaways

  • A CSV load is only idempotent when rerunning the same file produces the same durable table state, not merely when the SQL finishes successfully.
  • The most reliable PostgreSQL pattern is usually raw-file retention plus a staging table plus a deterministic merge or upsert into the final table using stable business keys.
  • COPY is fast, but COPY alone does not make a load replay-safe. Idempotency comes from keys, constraints, batch bookkeeping, and carefully chosen insert or merge logic.

References

FAQ

What does idempotent mean for a PostgreSQL CSV load?
It means rerunning the same input file does not create duplicate facts, incorrect updates, or a different final table state than the original successful run.
Is COPY by itself idempotent?
No. COPY is a loading mechanism, not an idempotency strategy. Without keys, constraints, and replay logic, rerunning the same file can duplicate data.
Should I use ON CONFLICT or MERGE?
For many replay-safe insert-or-update patterns, INSERT ... ON CONFLICT is simpler and stronger. MERGE is useful for broader match logic, but it is not interchangeable with ON CONFLICT.
Why are staging tables useful for idempotent loads?
Because they let you land the raw rows safely, validate them, deduplicate them, and then apply deterministic business rules before touching the final tables.
0

Idempotent CSV Loads into PostgreSQL: Patterns and Pitfalls

A lot of teams say they want idempotent CSV loads when what they really mean is:

“Please let us rerun the job without making a mess.”

That is the right instinct.

Because CSV ingestion into PostgreSQL fails in very ordinary ways:

  • the job crashes halfway through
  • a vendor resends yesterday’s file
  • an operator reruns the pipeline manually
  • a bad row causes partial handling
  • the same data arrives under a different filename
  • a merge step updates records it should have left alone

If your ingestion path is not idempotent, those moments turn into:

  • duplicate rows
  • overwritten history
  • non-repeatable downstream metrics
  • replay fear
  • long incident cleanups

That is why idempotency is not a bonus feature. It is the thing that makes retries safe.

If you want to validate the file before the database ever sees it, start with the CSV Validator, CSV Format Checker, and CSV Header Checker. If your feed comes from multiple files, the CSV Merge and Converter are useful upstream helpers.

This guide explains what idempotent CSV loading really means in PostgreSQL, which patterns actually work, and which shortcuts look fine in demos but break in production.

What idempotent means in practice

For a PostgreSQL CSV load, idempotent usually means:

Running the same logical input again should leave the target tables in the same correct end state.

That does not mean:

  • the SQL command is repeatable in a vacuum
  • the loader never sees duplicates
  • the file load never errors

It means:

  • reruns are safe
  • retries do not corrupt data
  • partial failures can be recovered without guessing
  • final tables converge on one intended state

This is a state-management problem, not only a SQL syntax problem.

The first principle: COPY is fast, not magical

PostgreSQL’s COPY documentation makes clear that COPY FROM is about moving file data into a table quickly, with options for CSV format, headers, null markers, encoding, ON_ERROR, REJECT_LIMIT, and more. If ENCODING is omitted, PostgreSQL uses the current client encoding. If ON_ERROR='ignore' is used, PostgreSQL can skip erroneous rows, and REJECT_LIMIT controls how many conversion errors are tolerated before the command fails. citeturn230569view0

That is great for ingestion mechanics. It is not an idempotency strategy.

If you run:

COPY orders FROM '/data/orders.csv' WITH (FORMAT csv, HEADER true);

twice against the same plain target table, PostgreSQL will happily attempt to load the same rows twice unless your schema and logic stop it.

So the first production rule is simple:

Do not confuse fast loading with replay-safe loading.

The second principle: idempotency needs a stable identity

No database can protect you from replays if it cannot tell whether a row is “the same” row.

That means you need one of these:

  • a stable business key from the source
  • a deterministic composite key
  • a trustworthy natural key
  • a staging-time deduplication fingerprint that maps back to a real business identity

Examples:

  • order_id
  • invoice_number
  • (customer_id, invoice_date, line_number)
  • (external_system_id, record_type)

Weak keys create weak idempotency.

If your feed does not contain stable identity, your job becomes much harder because you are trying to deduplicate semantics after the fact.

The safest default pattern: raw file + staging table + deterministic write

The most reliable production pattern usually looks like this:

1. Preserve the raw file

Keep:

  • original bytes
  • file name
  • checksum
  • received timestamp
  • source metadata

2. Load into a staging table

Use COPY into a raw or semi-raw table first.

3. Validate and normalize

Check:

  • encoding
  • delimiter
  • header shape
  • required columns
  • duplicates inside the batch
  • type coercion
  • null semantics

4. Write into final tables deterministically

Use:

  • INSERT ... ON CONFLICT
  • or MERGE
  • or an explicit insert/update/delete workflow

5. Record batch outcome

Track:

  • file checksum
  • row counts
  • reject counts
  • final write counts
  • whether the batch was already processed

This is much safer than copying directly into production tables and hoping uniqueness errors tell the full story.

Why staging tables matter so much

Staging tables are not just an ETL cliché. They solve several real idempotency problems.

They let you:

  • land data before business logic touches it
  • profile what arrived
  • deduplicate within the batch
  • quarantine bad rows
  • compare file content to already-processed batches
  • separate file-acceptance from business-state mutation

Without staging, it becomes much harder to answer:

  • did we already get this file?
  • which rows were invalid?
  • did we partially update the target?
  • can we replay safely from the same raw input?

That is why staging is often the first real step toward idempotency.

A good batch registry is worth more than people think

One of the cleanest patterns is to maintain a batch or file registry table.

Typical fields:

  • batch_id
  • source_filename
  • source_checksum
  • received_at
  • loaded_at
  • status
  • raw_row_count
  • accepted_row_count
  • rejected_row_count

Why this matters:

If the same file arrives twice with the same checksum, you may be able to:

  • skip it
  • flag it
  • reprocess intentionally
  • compare the new run to the old run safely

A checksum is not the same thing as a business key, but it is a very useful file-level idempotency signal.

When INSERT ... ON CONFLICT is the right tool

PostgreSQL’s INSERT documentation says ON CONFLICT provides an alternative to raising a unique or exclusion constraint violation. ON CONFLICT DO NOTHING skips conflicting inserts, while ON CONFLICT DO UPDATE updates the existing row. PostgreSQL also documents that conflict_target can infer a suitable unique index and that DO UPDATE requires a conflict target. citeturn230569view1

That makes INSERT ... ON CONFLICT a great fit when:

  • the target table has a real unique key
  • the main decision is insert vs update
  • you want simple replay-safe semantics
  • the source and target match one logical row identity

Example:

INSERT INTO customers (customer_id, email, full_name, updated_at)
SELECT customer_id, email, full_name, updated_at
FROM staging_customers
ON CONFLICT (customer_id) DO UPDATE
SET email = EXCLUDED.email,
    full_name = EXCLUDED.full_name,
    updated_at = EXCLUDED.updated_at;

This is often the simplest strong default for dimension-like upserts.

Why ON CONFLICT still has pitfalls

ON CONFLICT is powerful, but it can still surprise teams.

1. No unique index means no real protection

Without the right unique constraint or unique index, your replay logic is weak.

2. Bad conflict target means wrong merges

If the chosen key is not truly stable, different logical rows can collapse together.

3. Triggers may fire for both insert and update paths

PostgreSQL’s trigger docs explicitly note that INSERT ... ON CONFLICT DO UPDATE may cause both insert and update operations, so both kinds of triggers can fire as needed. citeturn230569view5

If your trigger logic is not written with that in mind, “idempotent” loads can still produce surprising side effects. citeturn230569view5

4. Read Committed semantics matter

PostgreSQL’s transaction-isolation docs say that in Read Committed mode, INSERT ... ON CONFLICT DO UPDATE guarantees that each row proposed for insertion will either insert or update, absent unrelated errors. That is one reason it is such a strong fit for many replay-safe workflows. citeturn230569view4

That guarantee is useful, but only when your conflict key is correct. citeturn230569view4

When MERGE is the right tool

PostgreSQL’s MERGE documentation shows that it can combine INSERT, UPDATE, DELETE, and DO NOTHING actions based on source-target matching conditions. That makes it useful when your idempotent write path is more complex than plain insert-or-update. citeturn230569view2

Examples where MERGE helps:

  • insert new rows
  • update matched rows
  • delete target rows absent from the new source snapshot
  • handle source-vs-target reconciliation in one statement

This can be excellent for full-snapshot loads.

Why MERGE is not interchangeable with ON CONFLICT

PostgreSQL’s docs explicitly warn that MERGE and INSERT ... ON CONFLICT DO UPDATE are not interchangeable. The transaction-isolation docs also note that unlike ON CONFLICT, MERGE does not guarantee that either insert or update will occur in the same way under concurrent conditions. citeturn230569view2turn230569view4

That means a good rule of thumb is:

Prefer ON CONFLICT when

  • your operation is basically upsert
  • you have a strong unique key
  • you want simpler concurrency semantics

Prefer MERGE when

  • you need broader match logic
  • you need delete or conditional branches too
  • you are reconciling a snapshot, not just upserting rows

This distinction matters a lot in production.

A practical staging-table pattern

One solid pattern is:

Raw landing table

All columns stored as text plus batch metadata.

Typed staging table

Normalized, cast, and validated rows ready for business rules.

Final table

The real durable target with unique constraints and indexes.

That gives you three layers of protection:

  • file acceptance
  • schema/type acceptance
  • business-state mutation

Example workflow:

  1. COPY raw CSV into staging_raw_orders
  2. insert validated rows into staging_orders_typed
  3. deduplicate inside the batch if needed
  4. INSERT ... ON CONFLICT into orders

This keeps each concern much easier to debug.

File-level idempotency vs row-level idempotency

These are related but not identical.

File-level idempotency

Asks:

  • did we already process this exact file?

Useful tools:

  • checksum
  • filename
  • manifest
  • batch registry

Row-level idempotency

Asks:

  • did we already apply this logical row state?

Useful tools:

  • business key
  • unique constraint
  • upsert or merge logic
  • effective-date or history model

Strong pipelines usually need both.

Unlogged tables: fast, but know the tradeoff

PostgreSQL’s CREATE TABLE docs say unlogged tables are considerably faster because they are not written to WAL, but they are not crash-safe: they are automatically truncated after a crash or unclean shutdown, and they are not replicated to standby servers. citeturn230569view3

That means unlogged staging tables can be reasonable for:

  • transient landing
  • local performance optimization
  • pipelines where raw files are safely retained elsewhere

But they are a bad choice if your team assumes the stage itself is durable after a crash. citeturn230569view3

A good rule:

  • use unlogged tables only when you understand the replay story
  • never let unlogged speed tricks become your only copy of in-flight truth

Bad rows and partial acceptance need policy

PostgreSQL COPY now supports ON_ERROR='ignore', REJECT_LIMIT, and LOG_VERBOSITY around conversion failures. PostgreSQL documents that ignored-row counts are emitted at the end, and verbose logging can include the input line and column name for each discarded row. citeturn230569view0

That is useful, but it creates a policy decision:

Do you want:

  • all-or-nothing batches
  • partial acceptance with quarantined bad rows
  • unlimited ignored conversion errors
  • a strict reject ceiling

Unlimited row skipping is rarely what teams really want in financial or customer-facing pipelines. REJECT_LIMIT gives you a more controlled compromise. citeturn230569view0

Common idempotent patterns that work well

Pattern 1: insert-only facts with DO NOTHING

Best when:

  • a fact row should never be updated once accepted
  • duplicates should be ignored
  • unique key is strong

Example:

INSERT INTO invoice_lines (...)
SELECT ...
FROM staging_invoice_lines
ON CONFLICT (invoice_line_id) DO NOTHING;

Pattern 2: current-state dimensions with DO UPDATE

Best when:

  • latest source state should win
  • conflict key is trustworthy
  • updates are intentional

Pattern 3: snapshot reconciliation with MERGE

Best when:

  • source file represents the whole current state
  • deletes or target-only removals matter
  • branch logic is more complex

Pattern 4: history-preserving loads

Best when:

  • changes should be tracked historically
  • idempotency must preserve history instead of overwriting it

This usually means effective dates, versioning, or append-plus-closeout logic instead of blind overwrite.

Common anti-patterns

Copying directly into the final table

Fast in demos, fragile in production.

No unique key on the target

Then reruns cannot be trusted.

Using a surrogate key as the only conflict target

If the surrogate is assigned at insert time, it does not help recognize replayed source rows.

Treating MERGE as always equivalent to ON CONFLICT

PostgreSQL explicitly warns they are not interchangeable. citeturn230569view2turn230569view4

Letting “ignore bad rows” hide real feed failures

Skipping conversion errors without a policy can turn silent data loss into a habit. citeturn230569view0

Forgetting trigger side effects

ON CONFLICT may fire both insert and update triggers. citeturn230569view5

A practical decision framework

Ask these questions in order.

1. Do you have a real stable business key?

If no, solve identity first.

2. Is the load append-only, current-state, or snapshot-reconciling?

This tells you whether to prefer:

  • DO NOTHING
  • DO UPDATE
  • MERGE
  • history logic

3. Can you preserve the raw file and batch metadata?

If yes, recovery and replay get much easier.

4. Are partial bad-row skips acceptable?

If yes, define reject policy clearly. If no, fail the batch loudly.

5. Is concurrency part of the story?

If yes, simplicity matters even more, and ON CONFLICT may be preferable for straightforward upserts. citeturn230569view4turn230569view2

Which Elysiate tools fit this article best?

For this topic, the most natural supporting tools are:

These fit naturally because replay-safe PostgreSQL loads begin with structurally trustworthy CSV files and deterministic transformation steps.

FAQ

What does idempotent mean for a PostgreSQL CSV load?

It means rerunning the same input file does not create duplicate facts, incorrect updates, or a different final table state than the original successful run.

Is COPY by itself idempotent?

No. COPY is a loading mechanism, not an idempotency strategy. Without keys, constraints, and replay logic, rerunning the same file can duplicate data. PostgreSQL’s COPY docs describe ingestion behavior and error-handling options, not replay-safe semantics. citeturn230569view0

Should I use ON CONFLICT or MERGE?

For many replay-safe insert-or-update patterns, INSERT ... ON CONFLICT is simpler and stronger. MERGE is useful for broader match logic, but PostgreSQL explicitly notes the statements are not interchangeable. citeturn230569view1turn230569view2turn230569view4

Why are staging tables useful for idempotent loads?

Because they let you land the raw rows safely, validate them, deduplicate them, and then apply deterministic business rules before touching the final tables.

Are unlogged staging tables safe?

They can be useful for speed, but PostgreSQL documents that unlogged tables are not crash-safe and are truncated after a crash or unclean shutdown. They are also not replicated to standby servers. citeturn230569view3

What is the safest default?

Preserve raw files, load into staging, validate and deduplicate there, then write into the final tables with stable business keys and explicit upsert or merge behavior.

Final takeaway

Idempotent CSV loading into PostgreSQL is not about finding one clever SQL statement.

It is about combining the right layers:

  • stable row identity
  • raw-file retention
  • batch tracking
  • staging tables
  • deterministic final writes
  • explicit duplicate and bad-row policy

COPY gets the bytes in fast. ON CONFLICT and MERGE help shape the final write. But idempotency comes from the design around them.

That is what makes reruns boring instead of dangerous.

About the author

Elysiate publishes practical guides and privacy-first tools for data workflows, developer tooling, SEO, and product engineering.

PostgreSQL cluster

Explore the connected PostgreSQL guides around tuning, indexing, operations, schema design, scaling, and app integrations.

Pillar guide

PostgreSQL Performance Tuning: Complete Developer Guide

A practical PostgreSQL performance tuning guide for developers covering indexing, query plans, caching, connection pooling, vacuum, schema design, and troubleshooting with real examples.

View all PostgreSQL guides →

Related posts