Unicode normalization (NFC/NFD) and duplicate keys

·By Elysiate·Updated Apr 11, 2026·
csvunicodenormalizationdata-pipelinesnfcnfd
·

Level: intermediate · ~14 min read · Intent: informational

Audience: developers, data analysts, ops engineers, technical teams

Prerequisites

  • basic familiarity with CSV files
  • basic familiarity with string comparison
  • optional understanding of SQL or ETL concepts

Key takeaways

  • Unicode normalization problems usually appear when two values render the same to humans but use different underlying code-point sequences, so equality checks, dedupe logic, or unique indexes behave unexpectedly.
  • NFC and NFD are not competing quality levels. They are different normalization forms for canonically equivalent text, and the safest choice is to standardize on one comparison form at clear system boundaries.
  • The safest operational pattern is usually preserve original text for audit, normalize a comparison key for matching and uniqueness, and make the chosen normalization form explicit in code and contracts.
  • Browser and database runtimes can help: JavaScript has String.prototype.normalize(), PostgreSQL has normalize() and normalization checks in UTF-8 databases, and BigQuery supports NORMALIZE and NORMALIZE_AND_CASEFOLD for query-time consistency.

References

FAQ

Why do visually identical keys sometimes fail duplicate checks?
Because they may use different Unicode code-point sequences that render the same but are not byte-for-byte equal until normalized.
Should I normalize to NFC or NFD?
Most application pipelines standardize on NFC for comparison and storage simplicity, but the important part is consistency. Pick one deliberately and document it.
Should I overwrite the original source text with the normalized version?
Usually no for audit-sensitive workflows. Preserve the original value, and derive a normalized comparison key for matching, dedupe, or uniqueness rules.
Can JavaScript help detect this problem?
Yes. String.prototype.normalize() can convert strings into a chosen normalization form so canonically equivalent values compare consistently.
Can databases help with Unicode normalization?
Yes. PostgreSQL and BigQuery both provide normalization functions that can be used in validation, dedupe, and query logic.
0

Unicode normalization (NFC/NFD) and duplicate keys

A lot of duplicate-key incidents are not really duplicate-key incidents.

They are string-identity incidents.

A row arrives with a key that looks identical to an existing value:

  • same visible letters
  • same accent marks
  • same apparent customer name or SKU suffix
  • same product label in the spreadsheet

But the equality check says they are different. The unique index allows both rows. A dedupe step misses one. A join drops matches. A support team insists the values are “the same.”

This is where Unicode normalization enters the pipeline.

The problem is not that CSV cannot carry the text. The problem is that visually identical text can be encoded in more than one valid way.

That is why Unicode normalization belongs in your data contract whenever keys or matching logic can include accented or non-ASCII text.

Why this topic matters

Teams usually reach this problem after one of these symptoms:

  • a unique constraint allows what humans think is a duplicate
  • a join misses records that look identical on screen
  • search works in one system but not another
  • a browser app and a database disagree about string equality
  • imported spreadsheet values look fine but break dedupe logic
  • or a “fix” in Excel changes the key behavior without anyone understanding why

Those failures all point to the same deeper issue:

rendering equality is not the same thing as binary equality.

If your pipeline relies on exact equality for keys, that difference matters.

Start with the core concept: canonical equivalence

Unicode normalization is built around the idea that some strings are canonically equivalent even though they are made from different code-point sequences.

Unicode Standard Annex #15 says normalization describes canonical and compatibility equivalence and defines the four normalization forms NFC, NFD, NFKC, and NFKD. It exists because more than one sequence of code points can represent the same abstract text. citeturn225273search0

The most common example is an accented character:

  • one form may use a single precomposed character
  • another may use a base letter plus a combining accent

Both can render the same on screen. They are not necessarily equal until normalized.

That is the root of the “duplicate key” surprise.

The practical example most teams hit first

Suppose one system exports a customer key that includes an accented character in precomposed form. Another system emits the same visible character as base letter plus combining mark.

To users:

  • they look the same

To naive equality:

  • they are different byte sequences

To your pipeline:

  • they may become separate keys

MDN’s String.prototype.normalize() docs make this exact point. They show that visually identical strings can have different lengths and compare unequal until normalized to the same Unicode normalization form. citeturn225273search1

That is why this problem often appears in:

  • customer names
  • supplier names
  • city names
  • product labels
  • IDs with accented prefixes
  • or any business key that was never supposed to depend on byte-level composition details

NFC and NFD are not “better” and “worse”

A lot of teams ask:

  • should we use NFC or NFD?

That question is too vague on its own.

Unicode UAX #15 defines:

  • NFD as canonical decomposition
  • NFC as canonical decomposition followed by canonical composition
  • and similarly NFKD/NFKC for compatibility forms citeturn225273search0

That means NFC and NFD are both legitimate normalized forms. They are not rival quality levels.

The more useful question is:

  • which form will we standardize on for comparison and key logic?

For most application pipelines, NFC is the common practical choice because:

  • it often looks closer to what developers expect in everyday text handling
  • it produces a composed form where possible
  • and many common tools and systems already tend to work comfortably with it

But consistency matters more than ideology. A pipeline that consistently uses NFD can still be correct. A pipeline that mixes both unpredictably will not be.

Why this breaks keys specifically

Unicode normalization problems are more serious for keys than for display fields because keys participate in:

  • uniqueness constraints
  • joins
  • dedupe
  • record matching
  • upserts
  • cache lookups
  • and identity decisions

That means one visually identical mismatch can produce:

  • duplicate records
  • unmatched foreign keys
  • double inserts
  • incorrect “new customer” detection
  • or silent downstream divergence

A comment field can survive normalization confusion longer. A key field usually cannot.

This is why key columns deserve explicit normalization policy, not casual string comparison.

CSV makes the problem easier to transport, not easier to see

CSV is not causing the normalization issue. It is just carrying the text across system boundaries.

RFC 4180 is about row and field structure, delimiters, quoting, and headers. It does not solve higher-level identity semantics for Unicode strings inside fields. citeturn604609search0

That is why a CSV file can be:

  • perfectly valid structurally
  • correctly UTF-8 encoded
  • and still contain key values that break matching because one producer emitted NFC and another emitted NFD

This is one reason normalization bugs feel so confusing: the file “looks fine” and “loads fine,” but equality still fails later.

The safest mental model: preserve original text, compare on a normalized key

This is the most useful production pattern for many teams.

Do not ask one column to do both jobs:

  • preserve original source text exactly as received
  • and serve as the canonical comparison key

Those are different responsibilities.

A safer model is:

Original field

Preserve for:

  • audit
  • display
  • reconciliation
  • source fidelity

Normalized comparison field

Use for:

  • dedupe
  • matching
  • uniqueness checks
  • join logic
  • search normalization

This pattern gives you:

  • traceability
  • reversible diagnostics
  • and consistent key logic

It also avoids the operational argument about whether the pipeline “changed user data,” because the original value is still available.

JavaScript can help, but only if you use it intentionally

A lot of CSV tools and upload validators run in the browser or in Node.js. That makes JavaScript normalization behavior important.

MDN documents String.prototype.normalize() as the built-in way to convert a string into a normalized form common to code-point sequences that represent the same text. It supports the four forms and defaults to NFC when no form is supplied. citeturn225273search1

That gives you a direct practical tool:

  • normalize before equality checks
  • normalize before building derived keys
  • normalize before hashing if the hash is intended to reflect semantic text equality rather than raw byte identity

But do this carefully:

  • normalize for comparison logic
  • do not silently overwrite original user-facing text unless that is an explicit product rule

That distinction prevents a lot of accidental data mutation.

PostgreSQL can enforce normalization-aware comparisons

PostgreSQL’s current string function docs include both:

  • normalize(text [, form])
  • and normalization checks such as text IS [form] NORMALIZED

It also says the normalization functions are available when the server encoding is UTF8 and supports NFC, NFD, NFKC, and NFKD forms. citeturn225273search2

That gives you several strong options:

  • validate inbound keys and reject non-normalized input
  • normalize into a derived comparison column
  • build unique indexes on normalized expressions
  • audit where source data mixes forms

This is very useful for staging layers and dedupe jobs because it moves the comparison policy into the database instead of hiding it in application code.

A pattern many teams can use:

  • keep raw_key
  • derive normalized_key = normalize(raw_key, NFC)
  • enforce uniqueness or join logic on normalized_key

That is much safer than hoping every producer emitted the same Unicode form already.

BigQuery can help with query-time normalization too

BigQuery’s official string functions docs include:

  • NORMALIZE(value[, normalization_mode])
  • NORMALIZE_AND_CASEFOLD(value[, normalization_mode])

The docs explicitly say normalization is used when two strings render the same on screen but have different Unicode code points, and support NFC, NFKC, NFD, and NFKD. citeturn225273search15

That is useful for:

  • query-time joins
  • dedupe checks
  • quality audits
  • and case-insensitive normalization scenarios when you truly want both normalization and case folding

This matters because warehouse teams often discover normalization issues only after the data has already landed. BigQuery lets you diagnose and repair comparison logic without pretending the issue was only an application problem.

Normalization and case-insensitive matching often get mixed together. They are not the same operation.

Normalization answers:

  • are these canonically equivalent Unicode strings represented differently?

Case folding answers:

  • should these strings be treated the same regardless of case?

Sometimes you need both. Sometimes you do not.

That is why BigQuery’s NORMALIZE_AND_CASEFOLD is useful, but also potentially dangerous if you use it casually for key logic without a clear product rule. citeturn225273search15

A good rule is:

  • normalize for Unicode equivalence first
  • decide separately whether case should matter

Do not bundle both decisions together accidentally just because a function exists.

Where normalization belongs in the pipeline

A practical sequence usually looks like this:

1. Preserve the raw file

Keep the original bytes and values for audit and replay.

2. Validate structure first

Delimiter, quoting, row width, encoding, headers. Do not let Unicode-key debugging distract from malformed CSV basics.

3. Identify key columns

Not every text field needs the same normalization policy. Keys do.

4. Apply normalization at a clear boundary

Usually:

  • in the browser validator
  • in the ingest service
  • or in the staging SQL layer

5. Compare and enforce on normalized keys

Do not leave key equality to whichever runtime happens to touch the row first.

6. Keep the original field

Especially if users need to see what they originally supplied.

This sequence keeps the logic explicit and debuggable.

Good examples

Example 1: duplicate customer key after import

Two rows look identical in the spreadsheet:

  • José
  • José

One uses a precomposed character. The other uses base e plus combining accent.

A naive unique check misses the duplicate. A normalized comparison catches it.

Example 2: browser preview and backend disagree

The frontend dedupes on value.normalize("NFC"). The backend compares raw strings. Users see one row in preview and two rows after import.

This is not a parser bug. It is a missing normalization contract.

Example 3: warehouse join misses rows

An analytics join on customer names or external keys fails for a subset of international records. Using BigQuery NORMALIZE(..., NFC) on both sides reveals the mismatch.

These are common production patterns, not academic Unicode trivia.

What to avoid

Anti-pattern 1: pretending visually identical means equal

Screens do not define key identity.

Anti-pattern 2: normalizing only in one layer

If the browser normalizes and the database does not, or vice versa, the pipeline behavior becomes inconsistent.

Anti-pattern 3: overwriting source text blindly

Preserve originals when auditability or user-visible fidelity matters.

Anti-pattern 4: using compatibility forms casually for strict identity keys

NFKC/NFKD can be useful, but they solve a different class of equivalence and may be too aggressive if your intent is only canonical equivalence.

Anti-pattern 5: debugging “duplicates” without checking normalization first

Teams often waste time investigating encoding or parser issues before testing canonical equivalence.

A practical decision framework

Use these questions before choosing a normalization strategy.

1. Is this field a key or just display text?

If it is a key, normalization policy should be explicit.

2. Do you need canonical equivalence or broader compatibility equivalence?

Most duplicate-key problems are canonical-equivalence problems, which usually points toward NFC or NFD rather than NFKC/NFKD.

3. Do you need to preserve the original text exactly?

If yes, store both the original and a normalized comparison key.

4. Will comparisons happen in multiple runtimes?

If yes, standardize normalization at boundaries and document it.

5. Do case differences matter too?

Decide separately whether case folding belongs in the key policy.

These questions usually make the right implementation pattern obvious.

Which Elysiate tools fit this topic naturally?

The most natural related tools are:

They fit because Unicode normalization issues often show up only after the file has already crossed one or more format boundaries. The safer workflow is:

  • validate structure
  • identify key columns
  • normalize comparison logic deliberately
  • then transform or load

Why this page can rank broadly

To support broader search coverage, this page is intentionally shaped around several connected query families:

Core Unicode intent

  • unicode normalization nfc nfd duplicate keys
  • visually identical strings not equal
  • canonical equivalence duplicate rows

Runtime and database intent

  • javascript normalize duplicate strings
  • postgres normalize unicode
  • bigquery normalize string equality

CSV and pipeline intent

  • csv unicode duplicate keys
  • accented characters break dedupe
  • normalize keys in data pipeline

That breadth helps one page rank for much more than the literal title.

FAQ

Why do visually identical keys sometimes fail duplicate checks?

Because they can use different Unicode code-point sequences that render the same but are not equal until normalized.

Should I normalize to NFC or NFD?

Most teams standardize on NFC for comparison and storage simplicity, but the critical part is consistency and documentation.

Should I overwrite the source text with the normalized version?

Usually no for audit-sensitive workflows. Preserve the original and derive a normalized comparison key.

Can JavaScript help with this?

Yes. String.prototype.normalize() is built specifically to convert strings into a common normalization form for consistent comparison.

Can databases help too?

Yes. PostgreSQL and BigQuery both provide normalization functions that are useful for dedupe, joins, and validation.

What is the safest default mindset?

Preserve original text, normalize comparison logic, and make the chosen form part of the contract.

Final takeaway

Unicode normalization bugs are really identity-contract bugs.

The safest baseline is:

  • preserve the original value
  • choose a comparison normalization form deliberately
  • normalize key logic consistently across layers
  • keep case-folding as a separate decision
  • and test visually identical edge cases in the same way you test duplicate IDs and header mismatches

That is how you stop “these two values look the same” from becoming a recurring pipeline incident.

About the author

Elysiate publishes practical guides and privacy-first tools for data workflows, developer tooling, SEO, and product engineering.

PostgreSQL cluster

Explore the connected PostgreSQL guides around tuning, indexing, operations, schema design, scaling, and app integrations.

Pillar guide

PostgreSQL Performance Tuning: Complete Developer Guide

A practical PostgreSQL performance tuning guide for developers covering indexing, query plans, caching, connection pooling, vacuum, schema design, and troubleshooting with real examples.

View all PostgreSQL guides →

Related posts