Quarantine tables: isolating bad CSV rows without losing audits
Level: intermediate · ~15 min read · Intent: informational
Audience: developers, data engineers, data analysts, ops engineers, platform teams
Prerequisites
- basic familiarity with CSV files
- basic understanding of databases or ETL jobs
- optional familiarity with batch ingestion workflows
Key takeaways
- A good quarantine table isolates bad rows without destroying the raw evidence needed for audits, debugging, and replay.
- The most valuable quarantine columns are usually operational metadata, not business columns: source file, batch id, line number, raw payload, parser error, validation stage, and replay status.
- Quarantine is not the same as silent skipping. The point is to preserve traceability while keeping good data moving.
- Replay-safe ingestion needs idempotency, retention rules, and a clear distinction between parser failures, schema mismatches, and domain-rule violations.
References
- RFC 4180 — Common Format and MIME Type for CSV Files
- PostgreSQL COPY
- Snowflake COPY INTO <table>
- Snowflake bulk load troubleshooting
- BigQuery loading CSV data from Cloud Storage
- Databricks Auto Loader options
- Databricks Auto Loader schema inference and evolution
- Databricks read CSV files
- Azure Event Sourcing pattern
- Azure Functions error handling and retries
FAQ
- What is a quarantine table in a CSV pipeline?
- A quarantine table stores rows or files that failed parsing, schema checks, or business validation, while preserving enough metadata to debug, audit, and replay them later.
- What should a quarantine table store?
- At minimum, store source file identity, batch or run id, row or line reference when available, raw row content, error code, validation stage, timestamps, and replay status.
- Should I quarantine whole files or individual rows?
- It depends on the failure class. Structural parser failures often justify file-level quarantine, while row-level schema or business-rule failures usually benefit from row-level quarantine so valid records can keep flowing.
- How is quarantine different from silently dropping bad rows?
- Silent dropping destroys observability and weakens audits. Quarantine preserves evidence, supports remediation, and lets you prove what was accepted, rejected, fixed, and replayed.
Quarantine tables: isolating bad CSV rows without losing audits
Most CSV ingestion teams start with two bad defaults.
The first is fail the whole load when one row is bad.
The second is quietly skip bad rows and hope nobody notices.
Both are expensive.
Failing the whole load blocks good data behind a handful of broken records. Quietly skipping rows makes your pipeline look healthy while weakening trust, supportability, and auditability.
That is why mature CSV pipelines usually need a third path:
quarantine the bad data, preserve the evidence, and keep the rest of the pipeline moving.
This article explains how to design quarantine tables for CSV workflows so you can:
- isolate bad rows without losing the source evidence
- preserve audit trails across retries and fixes
- replay corrected rows safely
- separate parser failures from schema mismatches and business-rule violations
- support warehouse loads without turning every ingestion issue into a full incident
It is also built to rank for the real problems practitioners search for:
- quarantine table for bad CSV rows
- dead letter table for ETL
- rejected rows table design
- how to store invalid records in a data pipeline
- audit trail for failed CSV imports
- replay quarantined records safely
- CSV parser errors with line numbers
- source lineage for rejected records
- warehouse bad records path vs quarantine table
- keep good rows while isolating bad rows
Why quarantine matters
CSV is still one of the messiest interchange formats in production systems.
RFC 4180 defines the familiar shape: records, delimiters, optional header row, quotes, and escaped quotes. Real exports often drift from that shape through:
- inconsistent delimiters
- quoted newlines
- duplicate headers
- locale-specific number formatting
- manual spreadsheet edits
- trailing junk rows
- schema changes that arrive without warning
When you load those files into databases or warehouses, different layers can fail in different ways:
- the parser may reject the file
- the schema matcher may reject certain rows
- the warehouse may reject type conversions
- downstream validation may reject business rules such as uniqueness, foreign keys, or date windows
If you do not capture those failures cleanly, you lose three things that matter a lot in operations:
- proof of what failed
- context for why it failed
- control over how to replay it later
A quarantine table is the missing middle layer between “accept everything” and “drop everything.”
What a quarantine table actually is
A quarantine table is a durable store for data that failed some part of ingestion, together with the metadata needed to understand and reprocess it.
That sounds simple, but good quarantine design depends on one core idea:
the table is not only for bad data. It is for bad data plus operational evidence.
In practice, the evidence is often more important than the row itself.
A useful quarantine record might include:
- the source system
- the source file path or URI
- batch id or run id
- line number or row position when available
- raw row payload
- parsed column snapshot if partial parsing succeeded
- error class
- detailed error message
- validation stage
- first seen timestamp
- replay status
- replay attempt count
- remediation notes
Without that metadata, the quarantine table becomes a graveyard. With that metadata, it becomes an operational tool.
The core design principle: preserve the raw evidence
One of the biggest mistakes in rejected-row handling is storing only the “cleaned” interpretation of a bad row.
That is risky because the cleaned interpretation is already an opinion.
If your parser trimmed spaces, normalized encoding, coerced nulls, or partially mapped fields, you might lose the exact source evidence that explains the failure.
That is why strong quarantine design usually keeps at least one of these:
- the raw row text
- the raw file bytes reference
- or both
For CSV workflows, storing the raw row payload is often the fastest route to debugging. When the issue is a malformed quoted field, an extra delimiter, or a line ending problem, raw evidence matters more than downstream typed values.
Quarantine is not a synonym for dead-letter queue
People often mix up these terms:
- quarantine table
- rejected rows table
- error table
- dead-letter queue
- rescued data column
- bad records path
They overlap, but they are not always identical.
Dead-letter queue
Usually event-oriented and message-oriented. More common in stream or messaging systems.
Bad records path
Often a file-system sidecar that stores parser failures separately from the main load.
Rescued data column
Usually stores mismatched or unparsed fields inside an otherwise loaded row.
Quarantine table
A structured, queryable store designed for investigation, remediation, SLA ownership, and replay.
That last point matters for SEO and for architecture. Many teams already have parser-level rejection support, but still need a queryable quarantine model for ops and audit workflows.
The three classes of failures you should separate
A lot of bad ingestion design comes from treating every invalid row the same.
They are not the same.
1. Structural or parser failures
These are rows or files that fail basic CSV handling:
- broken quoting
- wrong delimiter
- corrupted encoding
- row shape impossible to parse safely
- ragged records where parser rules cannot recover
In many systems, these failures happen before row-level business validation even starts.
Examples from warehouse tooling reinforce this distinction:
- PostgreSQL
COPYsurfaces line and context information during CSV load failures. - BigQuery lets you configure behavior around bad records, jagged rows, quoted newlines, and unknown values.
- Databricks exposes
badRecordsPathand rescued-data features for malformed or partially mismatched data. - Snowflake supports validation-only COPY runs and post-load validation inspection.
Structural failures often justify either:
- file-level quarantine
- or row-level quarantine with raw payload only, depending on what the parser can still isolate
2. Schema and type failures
These happen when parsing succeeds, but the row no longer fits expected structure:
- missing required column
- extra unexpected field
- wrong header mapping
- numeric field contains text
- invalid date
- decimal formatting mismatch
- type drift across versions
These are often the sweet spot for row-level quarantine because good rows can usually proceed.
3. Domain or business-rule failures
These happen after parsing and typing:
- duplicate key
- foreign key missing
- invalid product state
- negative quantity not allowed
- event date outside expected window
- currency mismatch
- policy or compliance rule violation
These rows are structurally usable, which makes them strong candidates for quarantine plus remediation workflow.
Treating all three failure classes the same creates messy operations. Separating them makes SLAs, alerts, and ownership much clearer.
Row-level quarantine vs file-level quarantine
This is one of the most practical architectural decisions.
Quarantine the entire file when:
- the delimiter is wrong for the whole file
- the header row is invalid or ambiguous
- encoding is broken across the file
- the file is truncated or corrupted
- parser failures make row boundaries untrustworthy
If the file boundary itself is compromised, trying to salvage rows can create false confidence.
Quarantine individual rows when:
- parsing succeeded and row boundaries are trustworthy
- only some rows violate schema or business rules
- you want valid rows to continue to bronze, staging, or target tables
- replay can happen record by record
This choice matters for search intent too, because practitioners often ask:
- should I reject the whole CSV file or only the bad rows?
- how do I keep good records while isolating failures?
- how do I quarantine invalid records without breaking SLAs?
The best answer is usually: file-level quarantine for broken structure, row-level quarantine for broken content.
What columns a quarantine table should have
There is no universal schema, but strong quarantine tables usually need fields from five groups.
1. Source lineage
These columns answer: where did this record come from?
source_systemsource_feedsource_urisource_filenamesource_checksumingest_batch_idingest_run_id
If an auditor asks where a rejected row originated, lineage is the first thing you need.
2. Row identity
These columns answer: which record was it?
source_line_numbersource_row_numberheader_versionbusiness_keywhen availablerecord_hash
For CSV, line number is especially valuable when the upstream team wants a concrete example.
3. Raw and parsed payload
These columns answer: what did we actually receive?
raw_row_textraw_payload_jsonfor normalized row snapshotsparsed_columns_jsonif partial parsing worked
Storing only split columns is not enough. Store something close to the original form too.
4. Error metadata
These columns answer: why did it fail?
error_stagesuch asparse,schema,domain,loaderror_codeerror_messageerror_detailvalidator_namerule_nameseverity
This lets you group recurring issues rather than reading one row at a time.
5. Operational lifecycle
These columns answer: what happened after failure?
quarantined_atstatusownerreplay_attemptslast_replay_atresolved_atresolution_typeresolution_note
Without lifecycle fields, quarantine turns into a dump. With them, it becomes a queue.
A practical quarantine table example
Here is a practical relational shape:
create table ingestion_quarantine_rows (
quarantine_id bigserial primary key,
source_system text not null,
source_feed text not null,
source_uri text,
source_filename text,
source_checksum text,
ingest_batch_id text not null,
ingest_run_id text not null,
source_line_number bigint,
source_row_number bigint,
business_key text,
record_hash text,
raw_row_text text,
parsed_columns_json jsonb,
error_stage text not null,
error_code text not null,
error_message text not null,
error_detail jsonb,
validator_name text,
rule_name text,
status text not null default 'open',
replay_attempts integer not null default 0,
quarantined_at timestamptz not null default now(),
last_replay_at timestamptz,
resolved_at timestamptz,
resolution_type text,
resolution_note text
);
This is not the only valid model, but it covers the operational basics.
Why audits fail without quarantine evidence
An audit trail is not just a list of successful loads.
In ingestion systems, the audit story needs to answer:
- what file arrived
- what version of the schema was expected
- what rows were accepted
- what rows were rejected
- why each rejected row failed
- whether rejected rows were fixed or replayed later
- whether replay changed the target result
If you silently drop rows, that story collapses.
Append-only design patterns help here. Microsoft’s event sourcing guidance is useful conceptually: preserve the sequence of meaningful events rather than only the latest final state. For quarantine workflows, that means you often want an append-only history of:
- quarantined
- acknowledged
- remediated
- replayed
- accepted
- permanently rejected
That history is often more defensible than overwriting one status field repeatedly.
Do not mix quarantine with your final target model
Another common mistake is writing invalid rows into the same table shape as valid rows and marking them with an is_valid = false flag.
That can work in narrow cases, but it often creates downstream confusion:
- analysts forget to filter
- metrics accidentally include rejected rows
- replay logic becomes messy
- security controls blur between production and error data
A dedicated quarantine table or schema is usually cleaner.
Good common patterns include:
raw→quarantine→staging→corebronze→quarantine→silverlanding→error+accepted
The exact naming matters less than clear separation.
How quarantine fits warehouse-specific loaders
A strong page on this topic should also capture queries about specific platforms.
PostgreSQL
PostgreSQL COPY is powerful, but strict enough that malformed CSV details matter. Its docs note that CSV characters are significant, and failure context includes line information. This makes PostgreSQL a good example of why parser-level errors need clear row references.
Snowflake
Snowflake’s COPY INTO supports VALIDATION_MODE, which lets teams validate files and return errors without actually loading data. Snowflake also provides a VALIDATE function for inspecting errors from previous loads. That is helpful, but it does not replace your own quarantine model when you need business workflow, assignment, and replay tracking.
BigQuery
BigQuery exposes knobs such as max_bad_records, allow_jagged_rows, allow_quoted_newlines, and ignore_unknown_values. Those controls help shape load tolerance, but they also create architectural choices: is the platform merely skipping tolerated issues, or are you capturing them somewhere useful?
Databricks
Databricks Auto Loader supports badRecordsPath, rescued data behavior, and CSV parsing modes such as PERMISSIVE, DROPMALFORMED, and FAILFAST. Those features are helpful building blocks. They are not a substitute for a durable quarantine table if your operators need row-level ownership and replay.
That distinction is important: platform rejection features help detect bad data, but your quarantine table is where operational accountability lives.
Replay design: the part teams underbuild
Quarantine is only half the story. Replay is the other half.
A replay workflow answers:
- how do fixed rows get retried?
- how do we avoid duplicates?
- how do we prove what changed?
- how do we stop one row from bouncing forever?
Good replay design usually includes:
Idempotency
Retries should not create duplicates or double side effects. This matters because many cloud systems can retry beyond a simple mental model. Microsoft’s retry guidance explicitly warns that retries can be best effort and duplicate processing is possible, which is why idempotent design matters.
Versioned remediation
If a row is corrected manually or upstream, store:
- what changed
- who changed it
- when it changed
- whether replay used the original or remediated payload
Replay status
Use a lifecycle such as:
openinvestigatingready_for_replayreplayed_successreplayed_failedwaived
Replay isolation
Do not blindly dump replayed rows back into the normal path without marking them. Keep a replay batch id or correlation id.
When to keep rows, when to keep files, and when to keep both
Retention is not only a storage question. It is an audit and incident-response question.
Keep the raw file when:
- file-level corruption is possible
- source disputes happen often
- you need strong non-repudiation
- the upstream team may ask for byte-exact evidence
Keep the raw row when:
- row-level remediation is common
- replay happens record by record
- the file is large but the row is enough for evidence
Keep both when:
- the feed is high-risk
- compliance requirements are strict
- debugging historically requires exact file context
This is where checksums help. A row without a file reference is weaker evidence than a row tied to a source checksum and ingest batch.
The best alerting metric is not just “bad row count”
Teams often alert only on the total number of rejected rows. That is too shallow.
Better metrics include:
- bad row rate by feed
- bad row rate by validation stage
- unique error codes per day
- open quarantine age
- time to first acknowledgement
- time to replay
- repeated rejections of the same business key
- files with any structural parser failures
- ratio of accepted rows to quarantined rows
These metrics make the quarantine table a genuine operational surface instead of a passive archive.
Anti-patterns that make quarantine useless
1. Silent skipping
You already know this one. It makes dashboards look cleaner and systems harder to trust.
2. Storing only error messages, not payload
An error message without the row is often useless.
3. Storing payload without lineage
A bad row without source file and batch context is hard to trace.
4. No replay lifecycle
If everything stays open forever, quarantine becomes dead storage.
5. One generic error text field
Use structured codes and stages. Free-text only is hard to aggregate.
6. Mixing quarantined data with production facts
This invites downstream mistakes.
7. No retention policy
Eventually somebody asks whether old quarantined PII should still be there. Decide early.
A practical decision framework
Use this when you are deciding how much quarantine structure you really need.
Use lightweight quarantine when:
- feed volume is small
- failures are rare
- replay is manual and infrequent
- audit pressure is low
Use full operational quarantine when:
- multiple vendors send files
- SLAs matter
- replay must be safe
- auditors or customers care about rejected records
- you need ownership and remediation tracking
- accepted rows should keep flowing when some rows fail
Use append-only quarantine history when:
- regulatory or forensic needs are high
- fixes can change over time
- you must prove not only what failed, but how the failure was handled
How this fits Elysiate tools
For this topic, the most natural companion tools are:
- CSV Validator
- Malformed CSV Checker
- CSV Header Checker
- CSV Row Checker
- CSV Delimiter Checker
- CSV to JSON
These are especially useful before data ever reaches the warehouse, because the cheapest quarantine issue is the one you catch before the load job starts.
FAQ
What is a quarantine table in a CSV pipeline?
A quarantine table stores rows or files that failed parsing, schema checks, or business validation, while preserving enough metadata to debug, audit, and replay them later.
What should a quarantine table store?
At minimum, store source file identity, batch or run id, row or line reference when available, raw row content, error code, validation stage, timestamps, and replay status. In practice, raw payload, lineage, and lifecycle fields do most of the operational work.
Should I quarantine whole files or individual rows?
Use file-level quarantine when row boundaries or encoding are untrustworthy, and row-level quarantine when parsing succeeded but only some rows failed schema or domain validation. The failure class should drive the granularity.
How is quarantine different from silently dropping bad rows?
Silent dropping destroys observability and weakens audits. Quarantine preserves evidence, supports remediation, and lets you prove what was accepted, rejected, fixed, and replayed.
Is a badRecordsPath or rescued data column enough?
Usually not by itself. Those features are useful detection and capture mechanisms, but they do not automatically provide the operational workflow most teams need for ownership, replay, retention, and audit history.
How do I replay quarantined rows safely?
Use idempotent target writes, track replay batch ids, keep lifecycle status fields, and preserve the original failed payload alongside any corrected version. Replay should be a controlled workflow, not an ad hoc copy-paste.
Final takeaway
A quarantine table is not a punishment box for ugly data.
It is a control surface.
When designed well, it lets you keep good rows moving, isolate bad rows cleanly, preserve the source evidence, and answer the questions that matter later:
- what failed
- why it failed
- where it came from
- whether it was fixed
- and whether replay changed the final result
That is how you isolate bad CSV rows without losing audits.
About the author
Elysiate publishes practical guides and privacy-first tools for data workflows, developer tooling, SEO, and product engineering.