Idempotent CSV Loads into PostgreSQL: Patterns and Pitfalls
Level: intermediate · ~15 min read · Intent: informational
Audience: developers, data engineers, ops engineers, backend teams, analytics engineers
Prerequisites
- basic familiarity with PostgreSQL
- basic understanding of CSV imports or ETL jobs
Key takeaways
- A CSV load is only idempotent when rerunning the same file produces the same durable table state, not merely when the SQL finishes successfully.
- The most reliable PostgreSQL pattern is usually raw-file retention plus a staging table plus a deterministic merge or upsert into the final table using stable business keys.
- COPY is fast, but COPY alone does not make a load replay-safe. Idempotency comes from keys, constraints, batch bookkeeping, and carefully chosen insert or merge logic.
References
FAQ
- What does idempotent mean for a PostgreSQL CSV load?
- It means rerunning the same input file does not create duplicate facts, incorrect updates, or a different final table state than the original successful run.
- Is COPY by itself idempotent?
- No. COPY is a loading mechanism, not an idempotency strategy. Without keys, constraints, and replay logic, rerunning the same file can duplicate data.
- Should I use ON CONFLICT or MERGE?
- For many replay-safe insert-or-update patterns, INSERT ... ON CONFLICT is simpler and stronger. MERGE is useful for broader match logic, but it is not interchangeable with ON CONFLICT.
- Why are staging tables useful for idempotent loads?
- Because they let you land the raw rows safely, validate them, deduplicate them, and then apply deterministic business rules before touching the final tables.
Idempotent CSV Loads into PostgreSQL: Patterns and Pitfalls
A lot of teams say they want idempotent CSV loads when what they really mean is:
“Please let us rerun the job without making a mess.”
That is the right instinct.
Because CSV ingestion into PostgreSQL fails in very ordinary ways:
- the job crashes halfway through
- a vendor resends yesterday’s file
- an operator reruns the pipeline manually
- a bad row causes partial handling
- the same data arrives under a different filename
- a merge step updates records it should have left alone
If your ingestion path is not idempotent, those moments turn into:
- duplicate rows
- overwritten history
- non-repeatable downstream metrics
- replay fear
- long incident cleanups
That is why idempotency is not a bonus feature. It is the thing that makes retries safe.
If you want to validate the file before the database ever sees it, start with the CSV Validator, CSV Format Checker, and CSV Header Checker. If your feed comes from multiple files, the CSV Merge and Converter are useful upstream helpers.
This guide explains what idempotent CSV loading really means in PostgreSQL, which patterns actually work, and which shortcuts look fine in demos but break in production.
What idempotent means in practice
For a PostgreSQL CSV load, idempotent usually means:
Running the same logical input again should leave the target tables in the same correct end state.
That does not mean:
- the SQL command is repeatable in a vacuum
- the loader never sees duplicates
- the file load never errors
It means:
- reruns are safe
- retries do not corrupt data
- partial failures can be recovered without guessing
- final tables converge on one intended state
This is a state-management problem, not only a SQL syntax problem.
The first principle: COPY is fast, not magical
PostgreSQL’s COPY documentation makes clear that COPY FROM is about moving file data into a table quickly, with options for CSV format, headers, null markers, encoding, ON_ERROR, REJECT_LIMIT, and more. If ENCODING is omitted, PostgreSQL uses the current client encoding. If ON_ERROR='ignore' is used, PostgreSQL can skip erroneous rows, and REJECT_LIMIT controls how many conversion errors are tolerated before the command fails. citeturn230569view0
That is great for ingestion mechanics. It is not an idempotency strategy.
If you run:
COPY orders FROM '/data/orders.csv' WITH (FORMAT csv, HEADER true);
twice against the same plain target table, PostgreSQL will happily attempt to load the same rows twice unless your schema and logic stop it.
So the first production rule is simple:
Do not confuse fast loading with replay-safe loading.
The second principle: idempotency needs a stable identity
No database can protect you from replays if it cannot tell whether a row is “the same” row.
That means you need one of these:
- a stable business key from the source
- a deterministic composite key
- a trustworthy natural key
- a staging-time deduplication fingerprint that maps back to a real business identity
Examples:
order_idinvoice_number(customer_id, invoice_date, line_number)(external_system_id, record_type)
Weak keys create weak idempotency.
If your feed does not contain stable identity, your job becomes much harder because you are trying to deduplicate semantics after the fact.
The safest default pattern: raw file + staging table + deterministic write
The most reliable production pattern usually looks like this:
1. Preserve the raw file
Keep:
- original bytes
- file name
- checksum
- received timestamp
- source metadata
2. Load into a staging table
Use COPY into a raw or semi-raw table first.
3. Validate and normalize
Check:
- encoding
- delimiter
- header shape
- required columns
- duplicates inside the batch
- type coercion
- null semantics
4. Write into final tables deterministically
Use:
INSERT ... ON CONFLICT- or
MERGE - or an explicit insert/update/delete workflow
5. Record batch outcome
Track:
- file checksum
- row counts
- reject counts
- final write counts
- whether the batch was already processed
This is much safer than copying directly into production tables and hoping uniqueness errors tell the full story.
Why staging tables matter so much
Staging tables are not just an ETL cliché. They solve several real idempotency problems.
They let you:
- land data before business logic touches it
- profile what arrived
- deduplicate within the batch
- quarantine bad rows
- compare file content to already-processed batches
- separate file-acceptance from business-state mutation
Without staging, it becomes much harder to answer:
- did we already get this file?
- which rows were invalid?
- did we partially update the target?
- can we replay safely from the same raw input?
That is why staging is often the first real step toward idempotency.
A good batch registry is worth more than people think
One of the cleanest patterns is to maintain a batch or file registry table.
Typical fields:
batch_idsource_filenamesource_checksumreceived_atloaded_atstatusraw_row_countaccepted_row_countrejected_row_count
Why this matters:
If the same file arrives twice with the same checksum, you may be able to:
- skip it
- flag it
- reprocess intentionally
- compare the new run to the old run safely
A checksum is not the same thing as a business key, but it is a very useful file-level idempotency signal.
When INSERT ... ON CONFLICT is the right tool
PostgreSQL’s INSERT documentation says ON CONFLICT provides an alternative to raising a unique or exclusion constraint violation. ON CONFLICT DO NOTHING skips conflicting inserts, while ON CONFLICT DO UPDATE updates the existing row. PostgreSQL also documents that conflict_target can infer a suitable unique index and that DO UPDATE requires a conflict target. citeturn230569view1
That makes INSERT ... ON CONFLICT a great fit when:
- the target table has a real unique key
- the main decision is insert vs update
- you want simple replay-safe semantics
- the source and target match one logical row identity
Example:
INSERT INTO customers (customer_id, email, full_name, updated_at)
SELECT customer_id, email, full_name, updated_at
FROM staging_customers
ON CONFLICT (customer_id) DO UPDATE
SET email = EXCLUDED.email,
full_name = EXCLUDED.full_name,
updated_at = EXCLUDED.updated_at;
This is often the simplest strong default for dimension-like upserts.
Why ON CONFLICT still has pitfalls
ON CONFLICT is powerful, but it can still surprise teams.
1. No unique index means no real protection
Without the right unique constraint or unique index, your replay logic is weak.
2. Bad conflict target means wrong merges
If the chosen key is not truly stable, different logical rows can collapse together.
3. Triggers may fire for both insert and update paths
PostgreSQL’s trigger docs explicitly note that INSERT ... ON CONFLICT DO UPDATE may cause both insert and update operations, so both kinds of triggers can fire as needed. citeturn230569view5
If your trigger logic is not written with that in mind, “idempotent” loads can still produce surprising side effects. citeturn230569view5
4. Read Committed semantics matter
PostgreSQL’s transaction-isolation docs say that in Read Committed mode, INSERT ... ON CONFLICT DO UPDATE guarantees that each row proposed for insertion will either insert or update, absent unrelated errors. That is one reason it is such a strong fit for many replay-safe workflows. citeturn230569view4
That guarantee is useful, but only when your conflict key is correct. citeturn230569view4
When MERGE is the right tool
PostgreSQL’s MERGE documentation shows that it can combine INSERT, UPDATE, DELETE, and DO NOTHING actions based on source-target matching conditions. That makes it useful when your idempotent write path is more complex than plain insert-or-update. citeturn230569view2
Examples where MERGE helps:
- insert new rows
- update matched rows
- delete target rows absent from the new source snapshot
- handle source-vs-target reconciliation in one statement
This can be excellent for full-snapshot loads.
Why MERGE is not interchangeable with ON CONFLICT
PostgreSQL’s docs explicitly warn that MERGE and INSERT ... ON CONFLICT DO UPDATE are not interchangeable. The transaction-isolation docs also note that unlike ON CONFLICT, MERGE does not guarantee that either insert or update will occur in the same way under concurrent conditions. citeturn230569view2turn230569view4
That means a good rule of thumb is:
Prefer ON CONFLICT when
- your operation is basically upsert
- you have a strong unique key
- you want simpler concurrency semantics
Prefer MERGE when
- you need broader match logic
- you need delete or conditional branches too
- you are reconciling a snapshot, not just upserting rows
This distinction matters a lot in production.
A practical staging-table pattern
One solid pattern is:
Raw landing table
All columns stored as text plus batch metadata.
Typed staging table
Normalized, cast, and validated rows ready for business rules.
Final table
The real durable target with unique constraints and indexes.
That gives you three layers of protection:
- file acceptance
- schema/type acceptance
- business-state mutation
Example workflow:
COPYraw CSV intostaging_raw_orders- insert validated rows into
staging_orders_typed - deduplicate inside the batch if needed
INSERT ... ON CONFLICTintoorders
This keeps each concern much easier to debug.
File-level idempotency vs row-level idempotency
These are related but not identical.
File-level idempotency
Asks:
- did we already process this exact file?
Useful tools:
- checksum
- filename
- manifest
- batch registry
Row-level idempotency
Asks:
- did we already apply this logical row state?
Useful tools:
- business key
- unique constraint
- upsert or merge logic
- effective-date or history model
Strong pipelines usually need both.
Unlogged tables: fast, but know the tradeoff
PostgreSQL’s CREATE TABLE docs say unlogged tables are considerably faster because they are not written to WAL, but they are not crash-safe: they are automatically truncated after a crash or unclean shutdown, and they are not replicated to standby servers. citeturn230569view3
That means unlogged staging tables can be reasonable for:
- transient landing
- local performance optimization
- pipelines where raw files are safely retained elsewhere
But they are a bad choice if your team assumes the stage itself is durable after a crash. citeturn230569view3
A good rule:
- use unlogged tables only when you understand the replay story
- never let unlogged speed tricks become your only copy of in-flight truth
Bad rows and partial acceptance need policy
PostgreSQL COPY now supports ON_ERROR='ignore', REJECT_LIMIT, and LOG_VERBOSITY around conversion failures. PostgreSQL documents that ignored-row counts are emitted at the end, and verbose logging can include the input line and column name for each discarded row. citeturn230569view0
That is useful, but it creates a policy decision:
Do you want:
- all-or-nothing batches
- partial acceptance with quarantined bad rows
- unlimited ignored conversion errors
- a strict reject ceiling
Unlimited row skipping is rarely what teams really want in financial or customer-facing pipelines. REJECT_LIMIT gives you a more controlled compromise. citeturn230569view0
Common idempotent patterns that work well
Pattern 1: insert-only facts with DO NOTHING
Best when:
- a fact row should never be updated once accepted
- duplicates should be ignored
- unique key is strong
Example:
INSERT INTO invoice_lines (...)
SELECT ...
FROM staging_invoice_lines
ON CONFLICT (invoice_line_id) DO NOTHING;
Pattern 2: current-state dimensions with DO UPDATE
Best when:
- latest source state should win
- conflict key is trustworthy
- updates are intentional
Pattern 3: snapshot reconciliation with MERGE
Best when:
- source file represents the whole current state
- deletes or target-only removals matter
- branch logic is more complex
Pattern 4: history-preserving loads
Best when:
- changes should be tracked historically
- idempotency must preserve history instead of overwriting it
This usually means effective dates, versioning, or append-plus-closeout logic instead of blind overwrite.
Common anti-patterns
Copying directly into the final table
Fast in demos, fragile in production.
No unique key on the target
Then reruns cannot be trusted.
Using a surrogate key as the only conflict target
If the surrogate is assigned at insert time, it does not help recognize replayed source rows.
Treating MERGE as always equivalent to ON CONFLICT
PostgreSQL explicitly warns they are not interchangeable. citeturn230569view2turn230569view4
Letting “ignore bad rows” hide real feed failures
Skipping conversion errors without a policy can turn silent data loss into a habit. citeturn230569view0
Forgetting trigger side effects
ON CONFLICT may fire both insert and update triggers. citeturn230569view5
A practical decision framework
Ask these questions in order.
1. Do you have a real stable business key?
If no, solve identity first.
2. Is the load append-only, current-state, or snapshot-reconciling?
This tells you whether to prefer:
- DO NOTHING
- DO UPDATE
- MERGE
- history logic
3. Can you preserve the raw file and batch metadata?
If yes, recovery and replay get much easier.
4. Are partial bad-row skips acceptable?
If yes, define reject policy clearly. If no, fail the batch loudly.
5. Is concurrency part of the story?
If yes, simplicity matters even more, and ON CONFLICT may be preferable for straightforward upserts. citeturn230569view4turn230569view2
Which Elysiate tools fit this article best?
For this topic, the most natural supporting tools are:
- CSV Validator
- CSV Format Checker
- CSV Header Checker
- CSV Delimiter Checker
- CSV Merge
- Converter
- CSV tools hub
These fit naturally because replay-safe PostgreSQL loads begin with structurally trustworthy CSV files and deterministic transformation steps.
FAQ
What does idempotent mean for a PostgreSQL CSV load?
It means rerunning the same input file does not create duplicate facts, incorrect updates, or a different final table state than the original successful run.
Is COPY by itself idempotent?
No. COPY is a loading mechanism, not an idempotency strategy. Without keys, constraints, and replay logic, rerunning the same file can duplicate data. PostgreSQL’s COPY docs describe ingestion behavior and error-handling options, not replay-safe semantics. citeturn230569view0
Should I use ON CONFLICT or MERGE?
For many replay-safe insert-or-update patterns, INSERT ... ON CONFLICT is simpler and stronger. MERGE is useful for broader match logic, but PostgreSQL explicitly notes the statements are not interchangeable. citeturn230569view1turn230569view2turn230569view4
Why are staging tables useful for idempotent loads?
Because they let you land the raw rows safely, validate them, deduplicate them, and then apply deterministic business rules before touching the final tables.
Are unlogged staging tables safe?
They can be useful for speed, but PostgreSQL documents that unlogged tables are not crash-safe and are truncated after a crash or unclean shutdown. They are also not replicated to standby servers. citeturn230569view3
What is the safest default?
Preserve raw files, load into staging, validate and deduplicate there, then write into the final tables with stable business keys and explicit upsert or merge behavior.
Final takeaway
Idempotent CSV loading into PostgreSQL is not about finding one clever SQL statement.
It is about combining the right layers:
- stable row identity
- raw-file retention
- batch tracking
- staging tables
- deterministic final writes
- explicit duplicate and bad-row policy
COPY gets the bytes in fast.
ON CONFLICT and MERGE help shape the final write.
But idempotency comes from the design around them.
That is what makes reruns boring instead of dangerous.
About the author
Elysiate publishes practical guides and privacy-first tools for data workflows, developer tooling, SEO, and product engineering.