How to handle DUPLICATE VALUES in dataset?🤔

3 min readFeb 7, 2025

When working with datasets, one common issue data analysts and machine learning practitioners face is duplicate values. If not handled properly, duplicates can lead to misleading analysis, inaccurate predictions, and inefficient storage.

What Are Duplicate Values in a Dataset?

Duplicate values occur when one or more rows in a dataset have identical values across all or some columns. These duplicates may arise due to:
✅ Data collection errors
✅ Merging multiple datasets
✅ Web scraping issues
✅ User input mistakes (Mostly)

How Do Duplicate Values Look Like?

Consider the following dataset of customer purchases:

Here, the third row is a duplicate of the first row.

What Happens If Duplicates Are Not Handled?

Ignoring duplicate values can lead to:
❌ Incorrect statistical analysis — Skewed mean, median, and mode
❌ Biased machine learning models — Model overfitting and inaccurate predictions
❌ Redundant storage usage — Wasting memory and computation power

Example: Skewed Sales Data

Suppose we want to calculate total sales revenue:

import pandas as pd

data = {'ID': [101, 102, 101, 103, 104],  
        'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'David'],  
        'Product': ['Laptop', 'Phone', 'Laptop', 'Tablet', 'Laptop'],  
        'Price': [1200, 800, 1200, 500, 1200]}  df = pd.DataFrame(data)  
print("Total Sales Revenue:", df['Price'].sum())

Output:

Total Sales Revenue: 4900

🔴 Incorrect! The actual revenue should be 3700, but due to the duplicate row, we got an inflated result.

How to Handle Duplicate Values?

1. Identifying Duplicates

Use duplicated() to check duplicate rows:

print(df.duplicated())

Output:

0    False  
1    False  
2    True   # Duplicate row  
3    False  
4    False  
dtype: bool

2. Removing Duplicate Rows

To remove duplicates and keep only the first occurrence:

df_cleaned = df.drop_duplicates()  
print(df_cleaned)

To remove duplicates based on specific columns:

df_cleaned = df.drop_duplicates(subset=['Name', 'Product'])

3. Handling Near-Duplicates

Sometimes, duplicates may not be exactly identical but still need merging. Example:

We can group and merge them using fuzzy matching techniques like fuzzywuzzy:

from fuzzywuzzy import fuzz

fuzz.ratio("John", "Jon")  # Output: 80 (similar)

Final Thoughts

✅ Always check for duplicates before performing analysis
✅ Choose the right approach (drop, merge, or fix near-duplicates)
✅ Use domain knowledge to decide when a duplicate is meaningful

By handling duplicates efficiently, you ensure that your data remains accurate, reliable, and optimized for analysis and machine learning models. 🚀

A Youtube video, which I will suggest to watch too.