How to Find and Remove Duplicates in SQL

·Updated Apr 4, 2026·
sqldatabasequery-languagedata-cleaningsql-tutorialduplicate-data
·

Level: intermediate · ~18 min read · Intent: informational

Audience: backend developers, data analysts, data engineers, technical teams, students, operations teams

Prerequisites

  • basic familiarity with databases
  • basic understanding of SELECT, WHERE, GROUP BY, and DELETE

Key takeaways

  • The safest way to remove duplicates in SQL is to identify exactly what counts as a duplicate first, inspect the affected rows, and only then delete with a ranking pattern such as ROW_NUMBER.
  • Most duplicate problems should be fixed in two stages: clean the existing bad data, then prevent future duplicates with proper unique constraints, validation, and import logic.

FAQ

What is the easiest way to find duplicates in SQL?
The easiest way is usually to group by the columns that should be unique and use HAVING COUNT(*) > 1. That quickly shows which values appear more than once.
What is the safest way to remove duplicates in SQL?
The safest approach is to use ROW_NUMBER in a CTE or subquery to mark one row to keep and the extra rows to remove, then inspect the result before running DELETE.
How do you decide which duplicate row to keep?
You should choose based on business rules, such as keeping the newest row, the oldest row, the row with the most complete data, or the row with the trusted source flag.
How do you stop duplicates from coming back?
After cleanup, add the right unique constraint or index, improve import validation, and make sure inserts or upserts follow the same uniqueness rules.
0

Duplicate data is one of the most common database problems in real systems.

It shows up everywhere:

  • imported spreadsheets
  • user tables with repeated emails
  • CRM records with slightly different names
  • product catalogs with repeated SKUs
  • event logs accidentally inserted twice
  • analytics tables with double-counted rows
  • and operational systems where uniqueness was assumed but never enforced

That is why learning how to find and remove duplicates in SQL is one of the most useful practical database skills.

But there is an important catch:

Removing duplicates is easy to do badly.

A careless delete can remove the wrong rows, break downstream reporting, or destroy data you later realize mattered. That is why the right approach is not just:

  • find duplicate rows
  • delete them

The right approach is:

  1. define what “duplicate” actually means
  2. inspect the duplicate groups
  3. decide which row should be kept
  4. remove only the unwanted copies
  5. prevent the duplicates from returning

This guide walks through that full process in a way that is safe, practical, and reusable.

Why duplicate data happens in the first place

Before writing any SQL, it helps to understand why duplicates happen.

Common causes include:

  • missing unique constraints
  • repeated imports
  • manual data entry
  • race conditions in application code
  • weak matching rules during merges
  • inconsistent formatting like upper/lowercase or extra spaces
  • sync jobs replaying the same records
  • and event ingestion pipelines that are not idempotent

That matters because duplicate cleanup is only half the job.

If you delete duplicates without fixing the source of the problem, they usually come back.

What counts as a duplicate in SQL?

This is the first decision you must get right.

A duplicate is not always:

  • two rows that are completely identical in every column

Sometimes duplicates mean:

  • same email address
  • same customer name plus phone number
  • same order number
  • same SKU
  • same user ID plus event timestamp
  • same business key even if other columns differ

For example, these could all be duplicates in a customer table even though the rows are not fully identical:

id email name created_at
1 alice@test.com Alice Smith 2026-04-01 09:00:00
2 alice@test.com Alice Smith 2026-04-03 11:20:00

The rows are not identical because:

  • the IDs differ
  • the timestamps differ

But if email is supposed to be unique, then they are duplicates for business purposes.

That is why the most important question is:

Which columns define uniqueness in this table?

Once you know that, your duplicate logic becomes much more reliable.

Step 1: Find duplicate groups with GROUP BY and HAVING

The simplest way to find duplicates is to group by the columns that should be unique and then look for groups that appear more than once.

Suppose you have a customers table and email should be unique.

SELECT
    email,
    COUNT(*) AS duplicate_count
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;

This query tells you:

  • which email values appear more than once
  • how many times they appear

That is the easiest starting point for duplicate detection.

Example: duplicates by two columns

If uniqueness should be based on both first name and phone number, use both in the grouping:

SELECT
    first_name,
    phone_number,
    COUNT(*) AS duplicate_count
FROM customers
GROUP BY first_name, phone_number
HAVING COUNT(*) > 1;

This pattern is useful because it gives you a quick map of where duplicates exist before you touch the actual rows.

Step 2: Inspect the actual duplicate rows

After you find the duplicate values, the next step is to inspect the real rows. Do not skip this.

Suppose you already know duplicate emails exist. You can inspect them like this:

SELECT *
FROM customers
WHERE email IN (
    SELECT email
    FROM customers
    GROUP BY email
    HAVING COUNT(*) > 1
)
ORDER BY email, created_at;

This shows the actual rows inside each duplicate group.

That inspection matters because it helps you answer questions like:

  • Are these true duplicates or valid repeated entries?
  • Which row should be kept?
  • Is one row newer?
  • Is one row more complete?
  • Is one row already referenced elsewhere?
  • Are there formatting issues like trailing spaces or case differences?

A lot of duplicate-cleanup mistakes happen because teams delete before they inspect.

Step 3: Decide which row to keep

This is the most important business decision in duplicate cleanup.

You need a rule for which row stays and which rows go.

Common rules include:

  • keep the oldest row
  • keep the newest row
  • keep the row with the smallest ID
  • keep the row with the largest ID
  • keep the row with the most complete data
  • keep the row marked as active or trusted
  • keep the row that has downstream references

For example:

  • in user accounts, you might keep the earliest created record
  • in event ingestion, you might keep the first inserted row
  • in CRM cleanup, you might keep the row with the most non-null fields
  • in imports, you might keep the newest corrected version

There is no single universal rule. The correct choice depends on the table and the business logic.

Step 4: Use ROW_NUMBER() to mark duplicates safely

Once you know how to define duplicates and which row to keep, the cleanest SQL pattern is usually ROW_NUMBER().

This lets you assign a ranking inside each duplicate group.

Example: keep the oldest row per email and mark the rest as duplicates.

SELECT
    id,
    email,
    created_at,
    ROW_NUMBER() OVER (
        PARTITION BY email
        ORDER BY created_at ASC
    ) AS rn
FROM customers;

What this does:

  • partitions rows by email
  • orders rows within each email group by oldest first
  • assigns rn = 1 to the row you want to keep
  • assigns rn > 1 to the rows you want to remove

That is the core duplicate-removal pattern in modern SQL.

Step 5: Preview the rows that would be deleted

Before running a delete, preview the rows. Always.

Using a CTE:

WITH ranked_rows AS (
    SELECT
        id,
        email,
        created_at,
        ROW_NUMBER() OVER (
            PARTITION BY email
            ORDER BY created_at ASC
        ) AS rn
    FROM customers
)
SELECT *
FROM ranked_rows
WHERE rn > 1
ORDER BY email, created_at;

This gives you the exact rows that would be treated as duplicates.

That preview step is one of the safest habits you can build. It reduces the chance of deleting the wrong rows dramatically.

Step 6: Delete duplicates safely with a CTE

Once you have previewed the rows and confirmed the ranking logic is correct, you can delete the extras.

A common pattern looks like this:

WITH ranked_rows AS (
    SELECT
        id,
        ROW_NUMBER() OVER (
            PARTITION BY email
            ORDER BY created_at ASC
        ) AS rn
    FROM customers
)
DELETE FROM customers
WHERE id IN (
    SELECT id
    FROM ranked_rows
    WHERE rn > 1
);

This keeps:

  • the first row in each email group

And deletes:

  • every extra row in the same duplicate group

This is one of the safest and most reusable SQL duplicate-removal patterns.

Example: keep the newest row instead of the oldest

Sometimes the correct rule is the opposite. If you want to keep the newest row instead, change the ordering:

WITH ranked_rows AS (
    SELECT
        id,
        ROW_NUMBER() OVER (
            PARTITION BY email
            ORDER BY created_at DESC
        ) AS rn
    FROM customers
)
DELETE FROM customers
WHERE id IN (
    SELECT id
    FROM ranked_rows
    WHERE rn > 1
);

Now:

  • rn = 1 is the newest row
  • older duplicates are removed

That small change is very powerful. It lets the same pattern fit many different business rules.

Example: remove fully identical duplicate rows

Sometimes duplicates really are fully duplicated rows except for an identity column.

Suppose you have a table like this:

id product_name sku price
1 Keyboard KB100 49.99
2 Keyboard KB100 49.99
3 Mouse MS200 19.99

If the duplicate rule is:

  • same product_name
  • same sku
  • same price

you can partition by all three:

WITH ranked_rows AS (
    SELECT
        id,
        ROW_NUMBER() OVER (
            PARTITION BY product_name, sku, price
            ORDER BY id ASC
        ) AS rn
    FROM products
)
DELETE FROM products
WHERE id IN (
    SELECT id
    FROM ranked_rows
    WHERE rn > 1
);

This keeps one copy of each exact business-identical row.

Example: find duplicates without deleting them

Sometimes your job is only to report duplicate rows, not remove them. That often happens in:

  • audit work
  • QA processes
  • dashboard checks
  • data quality reviews

In that case, you can use a query like:

WITH ranked_rows AS (
    SELECT
        id,
        email,
        created_at,
        ROW_NUMBER() OVER (
            PARTITION BY email
            ORDER BY created_at ASC
        ) AS rn
    FROM customers
)
SELECT *
FROM ranked_rows
WHERE rn > 1;

This is useful for:

  • data quality dashboards
  • duplicate exception reports
  • cleanup backlogs
  • analyst workflows

Using COUNT(*) versus ROW_NUMBER()

These two approaches solve different parts of the problem.

GROUP BY + HAVING COUNT(*) > 1

Best for:

  • finding duplicate keys
  • summarizing duplicate groups
  • identifying where the problem exists

ROW_NUMBER()

Best for:

  • ranking rows inside each duplicate group
  • deciding which row stays
  • building the actual delete logic

In real workflows, you often use both:

  1. GROUP BY to discover duplicates
  2. ROW_NUMBER() to remove them safely

That combination is usually the most practical approach.

How to handle duplicates when rows are not exactly equal

This is where duplicate cleanup gets more realistic.

Imagine this data:

id email name phone
1 alice@test.com Alice Smith 1234567890
2 alice@test.com Alice S. Smith NULL

These are likely duplicates by email, but the rows are not equal.

In that case, you may not want to simply delete one row. You may want to:

  • merge useful values first
  • update the keeper row with more complete data
  • then delete the extra row

For example:

  1. choose the row to keep
  2. fill missing fields from the duplicate row if needed
  3. delete the extra row

That workflow is safer than blind deletion when duplicates contain partially useful data.

How to find duplicate rows with nulls or formatting differences

A big real-world challenge is that duplicates are often hidden by inconsistent formatting.

Examples:

  • Alice@test.com versus alice@test.com
  • trailing spaces
  • inconsistent phone number formatting
  • upper/lowercase product codes
  • extra punctuation in names

If the cleanup logic should treat those as duplicates, normalize the data in the duplicate-detection query.

Example using lowercase and trimming:

SELECT
    LOWER(TRIM(email)) AS normalized_email,
    COUNT(*) AS duplicate_count
FROM customers
GROUP BY LOWER(TRIM(email))
HAVING COUNT(*) > 1;

This helps reveal duplicates that plain grouping would miss.

You can use the same normalization inside ROW_NUMBER() logic too:

WITH ranked_rows AS (
    SELECT
        id,
        email,
        ROW_NUMBER() OVER (
            PARTITION BY LOWER(TRIM(email))
            ORDER BY created_at ASC
        ) AS rn
    FROM customers
)
SELECT *
FROM ranked_rows
WHERE rn > 1;

This is extremely useful in messy imported data.

How to delete duplicates in a transaction

If you are doing duplicate cleanup in production, use a transaction whenever possible.

Basic pattern:

BEGIN;

WITH ranked_rows AS (
    SELECT
        id,
        ROW_NUMBER() OVER (
            PARTITION BY email
            ORDER BY created_at ASC
        ) AS rn
    FROM customers
)
DELETE FROM customers
WHERE id IN (
    SELECT id
    FROM ranked_rows
    WHERE rn > 1
);

-- Inspect results here if your workflow allows it

COMMIT;

If something looks wrong before commit, you can roll back instead.

That extra safety matters a lot in live systems.

Backup and safety checklist before deleting duplicates

Before you run any duplicate delete in a serious system, check these:

  • Do you know exactly which columns define a duplicate?
  • Have you previewed the rows to be deleted?
  • Have you decided which row should be kept?
  • Do downstream foreign keys reference these rows?
  • Do you need to merge data before deletion?
  • Are you running this in a transaction?
  • Do you have a backup or restore path if needed?
  • Have you tested the query on staging or a copy first?

That checklist sounds obvious, but it prevents many painful mistakes.

How to prevent duplicates from coming back

Cleaning duplicates once is not enough. You need to stop them from reappearing.

The best prevention strategies usually include these.

1. Add a unique constraint or unique index

If email should be unique, enforce it.

Example:

ALTER TABLE customers
ADD CONSTRAINT uq_customers_email UNIQUE (email);

Or with a unique index:

CREATE UNIQUE INDEX uq_customers_email
ON customers (email);

This is the strongest fix because it moves the rule into the database itself.

2. Normalize input values before inserting

If the system treats Alice@test.com and alice@test.com as the same user, normalize at insert time.

That may include:

  • lowercasing emails
  • trimming spaces
  • standardizing phone formats
  • cleaning SKUs

A lot of “duplicate” problems are really input normalization problems.

3. Use upsert logic instead of blind inserts

In many systems, duplicates happen because the app always inserts and never checks whether the row already exists.

A better pattern is:

  • insert if new
  • update if existing

The exact syntax varies by database engine, but the principle is the same.

4. Fix import processes

CSV imports, ETL jobs, and sync jobs are common duplicate sources.

Improve them by:

  • validating keys before insert
  • staging data first
  • deduplicating in a temp table
  • rejecting bad rows
  • or matching against existing business keys before loading

5. Add data quality checks

For important tables, consider recurring checks that report duplicate keys before they become a larger problem.

Example:

SELECT email, COUNT(*)
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;

That query can become:

  • a dashboard check
  • a scheduled audit
  • or a pipeline validation step

Common duplicate-removal mistakes

Here are the mistakes that cause the most trouble.

Deleting before inspecting

This is the biggest one. Always preview the rows first.

Using the wrong duplicate definition

Rows may look duplicated by eye but not be true duplicates by business logic.

Keeping the wrong row

You need a clear rule for which record survives.

Ignoring foreign key relationships

Deleting a duplicate row that other tables reference can break data integrity or require merge logic first.

Forgetting normalization issues

Uppercase/lowercase or trailing spaces often hide the true duplicate pattern.

Cleaning the data but not fixing the cause

If you do not add prevention, the same bad data often comes back.

Practical duplicate-cleanup workflow

If you want one reliable workflow to follow every time, use this:

Step 1

Decide which columns define a duplicate.

Step 2

Run a GROUP BY and HAVING COUNT(*) > 1 query to find duplicate groups.

Step 3

Inspect the real rows inside those groups.

Step 4

Choose the rule for which row stays.

Step 5

Use ROW_NUMBER() to rank rows inside each duplicate group.

Step 6

Preview rows where rn > 1.

Step 7

Delete only those extra rows, preferably inside a transaction.

Step 8

Add the right unique constraint or prevention logic.

That workflow is simple, but it is also how a lot of real production cleanup gets done safely.

FAQ

What is the easiest way to find duplicates in SQL?

The easiest way is usually to group by the columns that should be unique and use HAVING COUNT(*) > 1. That quickly shows which values appear more than once.

What is the safest way to remove duplicates in SQL?

The safest approach is to use ROW_NUMBER in a CTE or subquery to mark one row to keep and the extra rows to remove, then inspect the result before running DELETE.

How do you decide which duplicate row to keep?

You should choose based on business rules, such as keeping the newest row, the oldest row, the row with the most complete data, or the row with the trusted source flag.

How do you stop duplicates from coming back?

After cleanup, add the right unique constraint or index, improve import validation, and make sure inserts or upserts follow the same uniqueness rules.

Final thoughts

Finding and removing duplicates in SQL is not just about writing one clever delete statement.

It is really a data quality workflow.

The safest and most effective approach is to:

  • define duplicates clearly
  • inspect them carefully
  • rank rows intentionally
  • delete only the extras
  • and then enforce the right uniqueness rule afterward

If you remember only one pattern from this article, remember this one:

  • use GROUP BY and HAVING to find duplicate groups
  • use ROW_NUMBER() to decide which rows to keep
  • and use constraints to make sure the problem does not return

That pattern will solve a very large percentage of real duplicate-cleanup problems in SQL.

SQL guides

Explore the connected SQL guides on fundamentals, joins, analytics, performance, interviews, and practical workflows.

Pillar guide

SQL Complete Guide for Beginners and Developers

A complete SQL guide for beginners and developers covering databases, tables, SELECT, WHERE, JOINs, GROUP BY, CASE, subqueries, CTEs, inserts, updates, deletes, indexes, and practical query patterns.

View all SQL guides →

Related posts