Slowly changing dimensions from daily CSV snapshots
Level: intermediate · ~15 min read · Intent: informational
Audience: developers, data engineers, data analysts, ops engineers, technical teams
Prerequisites
- basic familiarity with CSV files
- basic understanding of dimension tables
- optional familiarity with warehouses, dbt, or lakehouse tooling
Key takeaways
- Daily CSV snapshots are a practical raw input for slowly changing dimensions when CDC is unavailable, but you need explicit rules for detecting changes, deletions, and effective dates.
- Type 1 and Type 2 solve different problems. Type 1 overwrites to keep only current state, while Type 2 preserves history by closing old rows and inserting new versions.
- The most reliable implementation pattern is usually raw snapshot landing first, then a deterministic diff step using business keys plus a hash or column comparison to detect changed attributes.
- Late-arriving files, reruns, missing records, and duplicate source keys matter as much as SQL syntax. An SCD design without clear rules for these cases becomes unreliable quickly.
References
- dbt Developer Hub — Add snapshots to your DAG
- dbt Developer Hub — Dimensions
- Microsoft Fabric — Slowly changing dimension type 2
- Microsoft Power BI guidance — Star schema
- Databricks — Change data capture and snapshots
- Databricks SQL — MERGE INTO
- Kimball Group — Dimensional Modeling Techniques: Type 2
- Oracle — Using Slowly Changing Dimensions and Daily Snapshot Data
- RFC 4180
FAQ
- Can you build an SCD Type 2 dimension from daily CSV snapshots?
- Yes. Daily full snapshots are a common fallback when source systems do not provide CDC. You land each snapshot, compare it to the prior state using business keys and change-detection logic, then close old rows and insert new versions when tracked attributes change.
- What is the difference between using daily snapshots and true CDC for SCDs?
- Daily snapshots capture state at intervals, not every change event. That is usually simpler operationally, but you lose intraday history and must infer inserts, updates, and deletes by comparing snapshots.
- How do you detect changes between daily snapshots?
- The most common pattern is to compare rows by business key and then detect change using a hash-diff or explicit comparison of tracked attributes. Only meaningful attribute changes should create new Type 2 versions.
- How should deletes be handled in daily snapshot SCD pipelines?
- You need an explicit rule. Some teams mark a record inactive when it disappears from a snapshot, some keep the last version current until deletion is confirmed across multiple runs, and some model soft deletes separately.
- Do dbt snapshots automatically create a full dimensional model?
- dbt snapshots implement Type 2 history over mutable source tables, which is very useful, but you still need to decide how that history maps into your dimensional model, keys, and downstream facts.
Slowly changing dimensions from daily CSV snapshots
Daily CSV snapshots are one of the most common ways teams back into historical dimension tracking.
Not because they are elegant. Because they are available.
A source system may not give you:
- change data capture
- event streams
- audit tables
- or clean “updated since” queries
What it may give you is:
- a full daily extract
- a nightly CSV in object storage
- or a scheduled export sent by a vendor or internal team
That is enough to build slowly changing dimensions. But only if you make the design choices explicit.
This guide is about those choices.
Why this topic matters
People usually search for this problem using one of these frames:
- build SCD Type 2 from daily snapshots
- full snapshot to slowly changing dimension
- dbt snapshots for dimensions
- daily CSV exports with historical tracking
- detect changes between daily files
- Type 1 vs Type 2 from snapshots
- dimension history from full extracts
- how to handle deletes in snapshot-based SCDs
Those are all versions of the same question:
how do I turn repeated state snapshots into usable history?
That question matters because daily CSV snapshots are easy to collect and deceptively hard to model well.
Start with the core idea: a snapshot is state, not change events
This is the first conceptual distinction that prevents a lot of bad designs.
A daily CSV snapshot tells you:
- what the dimension looked like when the file was produced
It does not tell you directly:
- when each individual change happened during the day
- whether a missing row is a deletion or a source failure
- whether a changed value is a correction or a new fact of the business
- what happened between one daily file and the next
That means snapshot-driven SCD design is about inferring change from state comparisons.
This is different from true CDC.
Databricks’ CDC and snapshots documentation is useful here because it explicitly distinguishes:
- current-state handling
- and full-history-of-changes handling through SCD approaches
That distinction matters because daily snapshots can support SCDs well, but they still have a lower temporal resolution than event-level change capture.
What SCD means in this context
Microsoft’s Power BI star schema guidance defines a slowly changing dimension as one that appropriately manages change of dimension members over time, especially when values change slowly and historical context matters.
That is exactly what daily snapshots are usually trying to solve.
Typical examples:
- a customer changed address
- a product moved categories
- an employee changed department
- a supplier changed region
- a pricing tier changed
- an account status changed
If you only overwrite the current row, you lose historical context.
If you preserve versions, you can answer:
- what did we believe on a given date?
- what attributes were in force when a fact occurred?
- how long did a record remain in a certain state?
That is where SCD modeling matters.
Type 1 vs Type 2 from daily snapshots
This is the first real design choice.
Type 1
A Type 1 dimension overwrites old values and keeps only the current state.
Use it when:
- history is not important
- you only care about the latest truth
- old attribute values should not be preserved
- downstream consumers want simple current-state lookup tables
From daily CSV snapshots, Type 1 is easy:
- match by business key
- update changed attributes
- insert new rows
- optionally mark missing rows inactive depending on your rules
Type 2
A Type 2 dimension preserves history by:
- closing the current version when tracked attributes change
- inserting a new row for the new version
- keeping effective start and end boundaries
- often marking one row as current
Kimball’s Type 2 technique is the classic reference here: when a tracked attribute changes, you add a new row and generalize the primary key beyond the natural key because multiple rows now describe the same business entity over time.
From daily CSV snapshots, Type 2 is where the interesting work begins.
Daily snapshots are often a practical substitute for CDC
dbt’s snapshot docs are especially useful because they state plainly that snapshots implement Type 2 slowly changing dimensions over mutable source tables.
That is a strong mental model for daily CSV snapshot workflows:
- the source keeps only current mutable state
- you preserve history by re-snapshotting and comparing over time
This does not mean daily snapshots are as detailed as CDC. It means they are often the most practical path when CDC is unavailable.
Oracle’s documentation on using slowly changing dimensions with daily snapshot data makes a similar point in practice: daily snapshot data can be used to preserve historical context by associating facts with the correct version of dimensional data.
So daily snapshots are not a hack. They are a legitimate modeling input, as long as you understand the tradeoffs.
The most important design choice: what counts as a change?
This is where strong and weak SCD implementations diverge.
Not every column change should create a new version.
For each dimension, define:
- business key
- tracked attributes
- non-tracked attributes
- technical metadata fields
- derived fields that should not trigger versioning unless intentional
Example: A customer dimension might track:
- name
- address
- status
- segment
But maybe not:
- last ingestion timestamp
- batch ID
- file name
If you version on everything blindly, you create noisy history. If you version on too little, you miss business change.
Microsoft Fabric’s SCD Type 2 guidance explicitly recommends deciding which fields in the source you will use to detect new or changed records. That is exactly the right discipline here.
Hash-diff vs column-by-column comparison
Once you know which attributes matter, you need a change-detection method.
Two common patterns:
Pattern 1: explicit column comparison
Compare each tracked attribute directly.
Pros:
- easy to explain
- easy to debug
- straightforward when the number of tracked columns is small
Cons:
- verbose SQL
- harder to maintain when tracked columns increase
Pattern 2: hash-diff comparison
Compute a deterministic hash of the tracked attributes and compare the hash between snapshots.
Pros:
- compact
- maintainable for many attributes
- useful in merge workflows
Cons:
- you still need consistent normalization before hashing
- debugging requires looking past the hash to the actual changed columns
For daily CSV snapshots, hash-diff is often the most practical pattern for production, but explicit comparisons are often easier to teach and debug first.
The second big design choice: what date becomes the effective start date?
Many teams gloss over this and regret it later.
Possible choices:
- file delivery date
- source extract timestamp
- business effective timestamp from the source
- load timestamp
- end-of-day snapshot date
These are not equivalent.
If a daily snapshot lands at 02:00 and represents the system as of midnight, the “effective start” should probably not be 02:00 unless your consumers truly interpret it that way.
A strong design defines:
- valid_from
- valid_to
- is_current
- and the meaning of each
Otherwise your Type 2 history becomes technically precise and analytically confusing.
The third big design choice: how to handle missing rows
Daily snapshot SCD designs often fail here.
If a row existed yesterday but is missing today, what does that mean?
Possibilities:
- the entity was deleted
- the source temporarily omitted it
- the export failed partially
- the business no longer considers it active
- the row moved to another extract scope
You need an explicit rule.
Common patterns:
Soft-delete on first disappearance
If the row is missing today, close the current version and mark it inactive.
Pros:
- simple
- fast to reason about
Cons:
- sensitive to incomplete or bad snapshots
Confirm delete after N missing runs
Only treat disappearance as deletion after it is absent for multiple consecutive snapshots.
Pros:
- more resilient to flaky source exports
Cons:
- slower to reflect true deletions
Separate deletion logic
Use source-provided deletion markers or a separate feed when available.
Pros:
- most accurate if available
Cons:
- not always possible
This is one of the most important practical rules in the whole article: snapshot absence is not automatically the same as deletion.
The fourth big design choice: late-arriving corrections
Daily snapshots also create late-arriving-change problems.
Examples:
- a source system corrected yesterday’s customer segment today
- a vendor resent a corrected full file for a prior business date
- an older extract is delivered late after a network outage
- your warehouse reprocesses a past file after a bug fix
You need to decide:
- does the pipeline trust file arrival order?
- or business effective dates inside the file?
- can past snapshots be replayed?
- do reruns replace prior state or append corrected history?
Without an answer, historical dimensions become inconsistent whenever operations get messy.
Daily full snapshot vs true snapshot history table
There are two closely related patterns here.
Pattern 1: keep every raw daily file
Pros:
- best auditability
- easiest replay
- source fidelity preserved
Cons:
- more storage
- extra diffing work later
Pattern 2: append daily snapshot state into a raw history table
Pros:
- easier to query
- easier for dbt-style snapshot logic
- simpler lineage at row level
Cons:
- less file-native raw fidelity
- still requires clear dedupe and rerun rules
dbt snapshots fit especially well with mutable source tables or staged snapshot tables because they turn repeated source states into Type 2-style history.
But it is worth saying clearly: a dbt snapshot is not the whole dimensional model by itself. It is often one implementation mechanism for the history layer.
Surrogate keys still matter
Kimball’s guidance on Type 2 is still highly relevant: when multiple rows represent one business entity over time, you need a key strategy beyond the natural key.
Typical SCD Type 2 table design includes:
- surrogate dimension key
- natural or business key
- valid_from
- valid_to
- is_current
- change hash or tracked attributes
- load metadata
Why this matters:
- facts may need to join to the correct historical version
- multiple rows can share the same natural key
- current and prior versions must remain distinguishable
A daily snapshot pipeline that preserves history without thinking about surrogate keys is usually only halfway finished.
A practical implementation pattern
This is the pattern that works well in many CSV-based warehouse pipelines.
Step 1. Land the raw daily CSV
Preserve:
- original file
- checksum
- file date
- extract timestamp if available
- row count
- source metadata
Step 2. Validate structure
Before any SCD logic, confirm:
- delimiter
- encoding
- headers
- quote balance
- row width
- duplicates in business key if that is not allowed
Step 3. Normalize a current-state staging table
Create a cleaned staging layer with:
- standardized types
- normalized whitespace and case where appropriate
- one row per business key for the snapshot date
- deterministic business-key handling
Step 4. Detect change
Compare today’s staged snapshot to the prior known current version using:
- explicit comparisons or
- a hash-diff of tracked attributes
Step 5. Close changed current rows
For Type 2, set:
- valid_to
- is_current = false for rows whose tracked attributes changed
Step 6. Insert new current rows
Insert:
- new business keys
- changed business keys as new versions
Step 7. Handle missing rows according to your delete policy
Do not leave this undefined.
Step 8. Merge deterministically
Databricks’ MERGE INTO docs are a good example of the primitive many lakehouse and warehouse systems use to implement upsert and SCD-style workflows.
Whatever tool you use, deterministic merge behavior matters.
Type 1 and Type 2 can coexist
Not every dimension attribute needs the same treatment.
Some teams use:
- Type 1 for corrections or non-historical cleanup fields
- Type 2 for business-critical attributes where history matters
That is often a better design than trying to force all fields into one historical rule.
For example:
- fix spelling corrections as Type 1
- track account status changes as Type 2
- track product category changes as Type 2
- keep ingestion metadata as non-versioning
Daily snapshots can support hybrid patterns well if the contract is explicit.
Common anti-patterns
Anti-pattern 1: version every column
This creates noisy history and low trust.
Anti-pattern 2: use file arrival time blindly as business effective time
This can distort historical interpretation.
Anti-pattern 3: treat every disappearance as a hard delete
This is risky when snapshots are incomplete.
Anti-pattern 4: skip raw snapshot retention
You lose replay and audit power.
Anti-pattern 5: rely on one tool feature without defining semantics
A snapshot feature helps, but you still need your business rules.
Anti-pattern 6: ignore duplicate business keys inside a daily file
This makes change detection unreliable before history logic even begins.
Which Elysiate tools fit this page naturally?
The most natural companion tools are the structural CSV validators, because SCD logic is only trustworthy if the daily snapshot file itself is structurally trustworthy first:
- CSV Delimiter Checker
- CSV Header Checker
- CSV Row Checker
- Malformed CSV Checker
- CSV Validator
- CSV Splitter
This fits the workflow well:
- validate the snapshot
- then model the history
Why this page can rank broadly
To support broader search coverage, this page is intentionally shaped around several connected search clusters:
Core SCD intent
- slowly changing dimensions from daily csv snapshots
- scd type 2 from daily snapshots
- full snapshot to dimension history
dbt and warehouse intent
- dbt snapshots scd type 2
- dimension history from snapshots
- merge into scd type 2
Operational intent
- detect deletes in snapshot-based scd
- hash diff daily snapshot changes
- effective dates from full extracts
- late arriving snapshot corrections
That breadth helps one page rank for much more than the literal title.
FAQ
Can you build an SCD Type 2 dimension from daily CSV snapshots?
Yes. Daily full snapshots are a common substitute when CDC is unavailable. You land each snapshot, compare rows by business key, detect tracked-attribute changes, close old versions, and insert new current versions.
What is the difference between daily snapshots and true CDC?
Daily snapshots capture state at intervals. CDC captures individual changes as they happen. Snapshots are often simpler operationally, but they lose intraday detail and require inference for updates and deletes.
How do you detect changes between daily snapshots?
The most common production pattern is business-key matching plus a hash-diff or explicit comparison across tracked attributes.
How should deletes be handled?
You need an explicit policy. Some teams close records on first disappearance, some require multiple missing snapshots, and some rely on a separate delete signal.
Do dbt snapshots automatically solve the whole problem?
They solve an important part of it by implementing Type 2 history over mutable sources, but you still need to decide key strategy, effective dates, tracked attributes, delete handling, and how facts join to the dimension.
What is the safest default mindset?
Treat daily snapshots as reliable state captures, not perfect change logs. Preserve the raw files, validate structure first, define change rules explicitly, and make delete and effective-date semantics part of the model instead of afterthoughts.
Final takeaway
Daily CSV snapshots are a very practical way to build slowly changing dimensions.
But they only work well when you stop treating them like simple files and start treating them like repeated state observations.
The safest baseline is:
- preserve the raw snapshot files
- validate structure before history logic
- define business keys and tracked attributes explicitly
- detect changes deterministically
- handle deletes with an explicit policy
- separate Type 1 from Type 2 decisions
- and make effective dates, surrogate keys, and rerun behavior part of the design from the start
That is how daily snapshots become trustworthy historical dimensions instead of noisy file diffs.
About the author
Elysiate publishes practical guides and privacy-first tools for data workflows, developer tooling, SEO, and product engineering.