How to Find and Remove Duplicates in SQL
Level: intermediate · ~18 min read · Intent: informational
Audience: backend developers, data analysts, data engineers, technical teams, students, operations teams
Prerequisites
- basic familiarity with databases
- basic understanding of SELECT, WHERE, GROUP BY, and DELETE
Key takeaways
- The safest way to remove duplicates in SQL is to identify exactly what counts as a duplicate first, inspect the affected rows, and only then delete with a ranking pattern such as ROW_NUMBER.
- Most duplicate problems should be fixed in two stages: clean the existing bad data, then prevent future duplicates with proper unique constraints, validation, and import logic.
FAQ
- What is the easiest way to find duplicates in SQL?
- The easiest way is usually to group by the columns that should be unique and use HAVING COUNT(*) > 1. That quickly shows which values appear more than once.
- What is the safest way to remove duplicates in SQL?
- The safest approach is to use ROW_NUMBER in a CTE or subquery to mark one row to keep and the extra rows to remove, then inspect the result before running DELETE.
- How do you decide which duplicate row to keep?
- You should choose based on business rules, such as keeping the newest row, the oldest row, the row with the most complete data, or the row with the trusted source flag.
- How do you stop duplicates from coming back?
- After cleanup, add the right unique constraint or index, improve import validation, and make sure inserts or upserts follow the same uniqueness rules.
Duplicate data is one of the most common database problems in real systems.
It shows up everywhere:
- imported spreadsheets
- user tables with repeated emails
- CRM records with slightly different names
- product catalogs with repeated SKUs
- event logs accidentally inserted twice
- analytics tables with double-counted rows
- and operational systems where uniqueness was assumed but never enforced
That is why learning how to find and remove duplicates in SQL is one of the most useful practical database skills.
But there is an important catch:
Removing duplicates is easy to do badly.
A careless delete can remove the wrong rows, break downstream reporting, or destroy data you later realize mattered. That is why the right approach is not just:
- find duplicate rows
- delete them
The right approach is:
- define what “duplicate” actually means
- inspect the duplicate groups
- decide which row should be kept
- remove only the unwanted copies
- prevent the duplicates from returning
This guide walks through that full process in a way that is safe, practical, and reusable.
Why duplicate data happens in the first place
Before writing any SQL, it helps to understand why duplicates happen.
Common causes include:
- missing unique constraints
- repeated imports
- manual data entry
- race conditions in application code
- weak matching rules during merges
- inconsistent formatting like upper/lowercase or extra spaces
- sync jobs replaying the same records
- and event ingestion pipelines that are not idempotent
That matters because duplicate cleanup is only half the job.
If you delete duplicates without fixing the source of the problem, they usually come back.
What counts as a duplicate in SQL?
This is the first decision you must get right.
A duplicate is not always:
- two rows that are completely identical in every column
Sometimes duplicates mean:
- same email address
- same customer name plus phone number
- same order number
- same SKU
- same user ID plus event timestamp
- same business key even if other columns differ
For example, these could all be duplicates in a customer table even though the rows are not fully identical:
| id | name | created_at | |
|---|---|---|---|
| 1 | alice@test.com | Alice Smith | 2026-04-01 09:00:00 |
| 2 | alice@test.com | Alice Smith | 2026-04-03 11:20:00 |
The rows are not identical because:
- the IDs differ
- the timestamps differ
But if email is supposed to be unique, then they are duplicates for business purposes.
That is why the most important question is:
Which columns define uniqueness in this table?
Once you know that, your duplicate logic becomes much more reliable.
Step 1: Find duplicate groups with GROUP BY and HAVING
The simplest way to find duplicates is to group by the columns that should be unique and then look for groups that appear more than once.
Suppose you have a customers table and email should be unique.
SELECT
email,
COUNT(*) AS duplicate_count
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;
This query tells you:
- which email values appear more than once
- how many times they appear
That is the easiest starting point for duplicate detection.
Example: duplicates by two columns
If uniqueness should be based on both first name and phone number, use both in the grouping:
SELECT
first_name,
phone_number,
COUNT(*) AS duplicate_count
FROM customers
GROUP BY first_name, phone_number
HAVING COUNT(*) > 1;
This pattern is useful because it gives you a quick map of where duplicates exist before you touch the actual rows.
Step 2: Inspect the actual duplicate rows
After you find the duplicate values, the next step is to inspect the real rows. Do not skip this.
Suppose you already know duplicate emails exist. You can inspect them like this:
SELECT *
FROM customers
WHERE email IN (
SELECT email
FROM customers
GROUP BY email
HAVING COUNT(*) > 1
)
ORDER BY email, created_at;
This shows the actual rows inside each duplicate group.
That inspection matters because it helps you answer questions like:
- Are these true duplicates or valid repeated entries?
- Which row should be kept?
- Is one row newer?
- Is one row more complete?
- Is one row already referenced elsewhere?
- Are there formatting issues like trailing spaces or case differences?
A lot of duplicate-cleanup mistakes happen because teams delete before they inspect.
Step 3: Decide which row to keep
This is the most important business decision in duplicate cleanup.
You need a rule for which row stays and which rows go.
Common rules include:
- keep the oldest row
- keep the newest row
- keep the row with the smallest ID
- keep the row with the largest ID
- keep the row with the most complete data
- keep the row marked as active or trusted
- keep the row that has downstream references
For example:
- in user accounts, you might keep the earliest created record
- in event ingestion, you might keep the first inserted row
- in CRM cleanup, you might keep the row with the most non-null fields
- in imports, you might keep the newest corrected version
There is no single universal rule. The correct choice depends on the table and the business logic.
Step 4: Use ROW_NUMBER() to mark duplicates safely
Once you know how to define duplicates and which row to keep, the cleanest SQL pattern is usually ROW_NUMBER().
This lets you assign a ranking inside each duplicate group.
Example: keep the oldest row per email and mark the rest as duplicates.
SELECT
id,
email,
created_at,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY created_at ASC
) AS rn
FROM customers;
What this does:
- partitions rows by
email - orders rows within each email group by oldest first
- assigns
rn = 1to the row you want to keep - assigns
rn > 1to the rows you want to remove
That is the core duplicate-removal pattern in modern SQL.
Step 5: Preview the rows that would be deleted
Before running a delete, preview the rows. Always.
Using a CTE:
WITH ranked_rows AS (
SELECT
id,
email,
created_at,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY created_at ASC
) AS rn
FROM customers
)
SELECT *
FROM ranked_rows
WHERE rn > 1
ORDER BY email, created_at;
This gives you the exact rows that would be treated as duplicates.
That preview step is one of the safest habits you can build. It reduces the chance of deleting the wrong rows dramatically.
Step 6: Delete duplicates safely with a CTE
Once you have previewed the rows and confirmed the ranking logic is correct, you can delete the extras.
A common pattern looks like this:
WITH ranked_rows AS (
SELECT
id,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY created_at ASC
) AS rn
FROM customers
)
DELETE FROM customers
WHERE id IN (
SELECT id
FROM ranked_rows
WHERE rn > 1
);
This keeps:
- the first row in each email group
And deletes:
- every extra row in the same duplicate group
This is one of the safest and most reusable SQL duplicate-removal patterns.
Example: keep the newest row instead of the oldest
Sometimes the correct rule is the opposite. If you want to keep the newest row instead, change the ordering:
WITH ranked_rows AS (
SELECT
id,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY created_at DESC
) AS rn
FROM customers
)
DELETE FROM customers
WHERE id IN (
SELECT id
FROM ranked_rows
WHERE rn > 1
);
Now:
rn = 1is the newest row- older duplicates are removed
That small change is very powerful. It lets the same pattern fit many different business rules.
Example: remove fully identical duplicate rows
Sometimes duplicates really are fully duplicated rows except for an identity column.
Suppose you have a table like this:
| id | product_name | sku | price |
|---|---|---|---|
| 1 | Keyboard | KB100 | 49.99 |
| 2 | Keyboard | KB100 | 49.99 |
| 3 | Mouse | MS200 | 19.99 |
If the duplicate rule is:
- same
product_name - same
sku - same
price
you can partition by all three:
WITH ranked_rows AS (
SELECT
id,
ROW_NUMBER() OVER (
PARTITION BY product_name, sku, price
ORDER BY id ASC
) AS rn
FROM products
)
DELETE FROM products
WHERE id IN (
SELECT id
FROM ranked_rows
WHERE rn > 1
);
This keeps one copy of each exact business-identical row.
Example: find duplicates without deleting them
Sometimes your job is only to report duplicate rows, not remove them. That often happens in:
- audit work
- QA processes
- dashboard checks
- data quality reviews
In that case, you can use a query like:
WITH ranked_rows AS (
SELECT
id,
email,
created_at,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY created_at ASC
) AS rn
FROM customers
)
SELECT *
FROM ranked_rows
WHERE rn > 1;
This is useful for:
- data quality dashboards
- duplicate exception reports
- cleanup backlogs
- analyst workflows
Using COUNT(*) versus ROW_NUMBER()
These two approaches solve different parts of the problem.
GROUP BY + HAVING COUNT(*) > 1
Best for:
- finding duplicate keys
- summarizing duplicate groups
- identifying where the problem exists
ROW_NUMBER()
Best for:
- ranking rows inside each duplicate group
- deciding which row stays
- building the actual delete logic
In real workflows, you often use both:
GROUP BYto discover duplicatesROW_NUMBER()to remove them safely
That combination is usually the most practical approach.
How to handle duplicates when rows are not exactly equal
This is where duplicate cleanup gets more realistic.
Imagine this data:
| id | name | phone | |
|---|---|---|---|
| 1 | alice@test.com | Alice Smith | 1234567890 |
| 2 | alice@test.com | Alice S. Smith | NULL |
These are likely duplicates by email, but the rows are not equal.
In that case, you may not want to simply delete one row. You may want to:
- merge useful values first
- update the keeper row with more complete data
- then delete the extra row
For example:
- choose the row to keep
- fill missing fields from the duplicate row if needed
- delete the extra row
That workflow is safer than blind deletion when duplicates contain partially useful data.
How to find duplicate rows with nulls or formatting differences
A big real-world challenge is that duplicates are often hidden by inconsistent formatting.
Examples:
Alice@test.comversusalice@test.com- trailing spaces
- inconsistent phone number formatting
- upper/lowercase product codes
- extra punctuation in names
If the cleanup logic should treat those as duplicates, normalize the data in the duplicate-detection query.
Example using lowercase and trimming:
SELECT
LOWER(TRIM(email)) AS normalized_email,
COUNT(*) AS duplicate_count
FROM customers
GROUP BY LOWER(TRIM(email))
HAVING COUNT(*) > 1;
This helps reveal duplicates that plain grouping would miss.
You can use the same normalization inside ROW_NUMBER() logic too:
WITH ranked_rows AS (
SELECT
id,
email,
ROW_NUMBER() OVER (
PARTITION BY LOWER(TRIM(email))
ORDER BY created_at ASC
) AS rn
FROM customers
)
SELECT *
FROM ranked_rows
WHERE rn > 1;
This is extremely useful in messy imported data.
How to delete duplicates in a transaction
If you are doing duplicate cleanup in production, use a transaction whenever possible.
Basic pattern:
BEGIN;
WITH ranked_rows AS (
SELECT
id,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY created_at ASC
) AS rn
FROM customers
)
DELETE FROM customers
WHERE id IN (
SELECT id
FROM ranked_rows
WHERE rn > 1
);
-- Inspect results here if your workflow allows it
COMMIT;
If something looks wrong before commit, you can roll back instead.
That extra safety matters a lot in live systems.
Backup and safety checklist before deleting duplicates
Before you run any duplicate delete in a serious system, check these:
- Do you know exactly which columns define a duplicate?
- Have you previewed the rows to be deleted?
- Have you decided which row should be kept?
- Do downstream foreign keys reference these rows?
- Do you need to merge data before deletion?
- Are you running this in a transaction?
- Do you have a backup or restore path if needed?
- Have you tested the query on staging or a copy first?
That checklist sounds obvious, but it prevents many painful mistakes.
How to prevent duplicates from coming back
Cleaning duplicates once is not enough. You need to stop them from reappearing.
The best prevention strategies usually include these.
1. Add a unique constraint or unique index
If email should be unique, enforce it.
Example:
ALTER TABLE customers
ADD CONSTRAINT uq_customers_email UNIQUE (email);
Or with a unique index:
CREATE UNIQUE INDEX uq_customers_email
ON customers (email);
This is the strongest fix because it moves the rule into the database itself.
2. Normalize input values before inserting
If the system treats Alice@test.com and alice@test.com as the same user, normalize at insert time.
That may include:
- lowercasing emails
- trimming spaces
- standardizing phone formats
- cleaning SKUs
A lot of “duplicate” problems are really input normalization problems.
3. Use upsert logic instead of blind inserts
In many systems, duplicates happen because the app always inserts and never checks whether the row already exists.
A better pattern is:
- insert if new
- update if existing
The exact syntax varies by database engine, but the principle is the same.
4. Fix import processes
CSV imports, ETL jobs, and sync jobs are common duplicate sources.
Improve them by:
- validating keys before insert
- staging data first
- deduplicating in a temp table
- rejecting bad rows
- or matching against existing business keys before loading
5. Add data quality checks
For important tables, consider recurring checks that report duplicate keys before they become a larger problem.
Example:
SELECT email, COUNT(*)
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;
That query can become:
- a dashboard check
- a scheduled audit
- or a pipeline validation step
Common duplicate-removal mistakes
Here are the mistakes that cause the most trouble.
Deleting before inspecting
This is the biggest one. Always preview the rows first.
Using the wrong duplicate definition
Rows may look duplicated by eye but not be true duplicates by business logic.
Keeping the wrong row
You need a clear rule for which record survives.
Ignoring foreign key relationships
Deleting a duplicate row that other tables reference can break data integrity or require merge logic first.
Forgetting normalization issues
Uppercase/lowercase or trailing spaces often hide the true duplicate pattern.
Cleaning the data but not fixing the cause
If you do not add prevention, the same bad data often comes back.
Practical duplicate-cleanup workflow
If you want one reliable workflow to follow every time, use this:
Step 1
Decide which columns define a duplicate.
Step 2
Run a GROUP BY and HAVING COUNT(*) > 1 query to find duplicate groups.
Step 3
Inspect the real rows inside those groups.
Step 4
Choose the rule for which row stays.
Step 5
Use ROW_NUMBER() to rank rows inside each duplicate group.
Step 6
Preview rows where rn > 1.
Step 7
Delete only those extra rows, preferably inside a transaction.
Step 8
Add the right unique constraint or prevention logic.
That workflow is simple, but it is also how a lot of real production cleanup gets done safely.
FAQ
What is the easiest way to find duplicates in SQL?
The easiest way is usually to group by the columns that should be unique and use HAVING COUNT(*) > 1. That quickly shows which values appear more than once.
What is the safest way to remove duplicates in SQL?
The safest approach is to use ROW_NUMBER in a CTE or subquery to mark one row to keep and the extra rows to remove, then inspect the result before running DELETE.
How do you decide which duplicate row to keep?
You should choose based on business rules, such as keeping the newest row, the oldest row, the row with the most complete data, or the row with the trusted source flag.
How do you stop duplicates from coming back?
After cleanup, add the right unique constraint or index, improve import validation, and make sure inserts or upserts follow the same uniqueness rules.
Final thoughts
Finding and removing duplicates in SQL is not just about writing one clever delete statement.
It is really a data quality workflow.
The safest and most effective approach is to:
- define duplicates clearly
- inspect them carefully
- rank rows intentionally
- delete only the extras
- and then enforce the right uniqueness rule afterward
If you remember only one pattern from this article, remember this one:
- use
GROUP BYandHAVINGto find duplicate groups - use
ROW_NUMBER()to decide which rows to keep - and use constraints to make sure the problem does not return
That pattern will solve a very large percentage of real duplicate-cleanup problems in SQL.