How to handle DUPLICATE VALUES in dataset?🤔

Muhammad Taha
3 min readFeb 7, 2025

--

Photo by Markus Spiske on Unsplash

When working with datasets, one common issue data analysts and machine learning practitioners face is duplicate values. If not handled properly, duplicates can lead to misleading analysis, inaccurate predictions, and inefficient storage.

What Are Duplicate Values in a Dataset?

Duplicate values occur when one or more rows in a dataset have identical values across all or some columns. These duplicates may arise due to:
âś… Data collection errors
âś… Merging multiple datasets
âś… Web scraping issues
âś… User input mistakes (Mostly)

How Do Duplicate Values Look Like?

Consider the following dataset of customer purchases:

Here, the third row is a duplicate of the first row.

What Happens If Duplicates Are Not Handled?

Ignoring duplicate values can lead to:
❌ Incorrect statistical analysis — Skewed mean, median, and mode
❌ Biased machine learning models — Model overfitting and inaccurate predictions
❌ Redundant storage usage — Wasting memory and computation power

Example: Skewed Sales Data

Suppose we want to calculate total sales revenue:

import pandas as pd  
data = {'ID': [101, 102, 101, 103, 104],  
'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'David'],
'Product': ['Laptop', 'Phone', 'Laptop', 'Tablet', 'Laptop'],
'Price': [1200, 800, 1200, 500, 1200]}
df = pd.DataFrame(data)
print("Total Sales Revenue:", df['Price'].sum())

Output:

Total Sales Revenue: 4900

đź”´ Incorrect! The actual revenue should be 3700, but due to the duplicate row, we got an inflated result.

How to Handle Duplicate Values?

1. Identifying Duplicates

Use duplicated() to check duplicate rows:

print(df.duplicated())

Output:

0    False  
1 False
2 True # Duplicate row
3 False
4 False
dtype: bool

2. Removing Duplicate Rows

To remove duplicates and keep only the first occurrence:

df_cleaned = df.drop_duplicates()  
print(df_cleaned)

To remove duplicates based on specific columns:

df_cleaned = df.drop_duplicates(subset=['Name', 'Product'])

3. Handling Near-Duplicates

Sometimes, duplicates may not be exactly identical but still need merging. Example:

We can group and merge them using fuzzy matching techniques like fuzzywuzzy:

from fuzzywuzzy import fuzz  
fuzz.ratio("John", "Jon")  # Output: 80 (similar)

Final Thoughts

âś… Always check for duplicates before performing analysis
âś… Choose the right approach (drop, merge, or fix near-duplicates)
âś… Use domain knowledge to decide when a duplicate is meaningful

By handling duplicates efficiently, you ensure that your data remains accurate, reliable, and optimized for analysis and machine learning models. 🚀

A Youtube video, which I will suggest to watch too.

--

--

Muhammad Taha
Muhammad Taha

Written by Muhammad Taha

0 Followers

A Software Engineering student passionate about machine learning.

No responses yet