How to handle Outliers 📊 in dataset? (Machine Learning)· 🤖

4 min readFeb 11, 2025

What is an Outlier?

An outlier is a data point that significantly deviates from other observations in a dataset. It can be much higher or lower than the majority of data points and can distort statistical analyses and machine learning models.

How Do Outliers Look Like?

Outliers can be visualized in different ways:

In a histogram: A bar far from the rest of the distribution.

In a scatter plot: A point lying far from the cluster of data.

In a box plot: A point outside the whiskers.

Why Do Outliers Occur?

Outliers can occur due to several reasons:

Measurement errors: Incorrect data entry, faulty sensors, or incorrect recording.
Natural variation: Some real-world data naturally have extreme values.
Data processing errors: Issues like merging datasets incorrectly or incorrect scaling.
Intentional anomalies: Fraudulent transactions, cyber attacks, etc.

Why Do We Need to Handle Outliers?

Handling outliers is crucial because they can:

Distort statistical measures: Mean and standard deviation are highly sensitive to outliers.
Impact machine learning models: Outliers can mislead the training process and reduce model accuracy.
Affect decision-making: Biased results lead to incorrect conclusions.

How to Detect Outliers?

Several methods help in detecting outliers:

1. Visual Methods

Box plots: Outliers appear outside the whiskers.
Scatter plots: Isolated points indicate potential outliers.
Histogram: Bars separated from the main data range.

2. Statistical Methods

Z-Score

Measures how far a data point is from the mean.
Formula: , z = (X — mean)/standard deviation.
A threshold of |Z| > 3 is often used to detect outliers.

2. Interquartile Range (IQR)

Measures the spread of the middle 50% of data.
Formula: IQR = Q3 — Q1
Outliers are:
Lower bound: Q1 — 1.5 X IQR
Upper bound: Q3 + 1.5 X IQR

3. Machine Learning Methods

Isolation Forest: Uses decision trees to detect anomalies.
Local Outlier Factor (LOF): Measures the local density of data points.
One-Class SVM: Identifies the normal class and flags anomalies.

How to Handle Outliers?

Once outliers are detected, they can be handled in different ways depending on the dataset and problem.

1. Removing Outliers

If outliers result from data entry errors, they can be removed.

import pandas as pd
import numpy as np# Load dataset
df = pd.DataFrame({'Value': [10, 12, 15, 18, 120, 14, 13, 17, 16, 125]})# Calculate IQR
Q1 = df['Value'].quantile(0.25)
Q3 = df['Value'].quantile(0.75)
IQR = Q3 - Q1# Remove outliers
df_cleaned = df[(df['Value'] >= (Q1 - 1.5 * IQR)) & (df['Value'] <= (Q3 + 1.5 * IQR))]

2. Transforming Data

Log Transformation: Reduces skewness in the data.
Box-Cox Transformation: Stabilizes variance.

import numpy as np
df['Value'] = np.log(df['Value'])

3. Winsorization (Capping)

Replacing extreme values with a predefined percentile.

from scipy.stats.mstats import winsorize
df['Value'] = winsorize(df['Value'], limits=[0.05, 0.05])

4. Using Robust Machine Learning Models

Some models are less sensitive to outliers, such as:

Decision Trees
Random Forests
Gradient Boosting Machines (GBM)

Choosing the Right Outlier Handling Technique

Selecting the right technique depends on the dataset and the nature of outliers:

Remove Outliers: When they are caused by data entry errors or measurement mistakes (e.g., typos in numerical data).
Transform Data: If outliers are naturally occurring in a skewed distribution (e.g., salaries, income data).
Use Winsorization: When extreme values are valid but should not overly influence the model (e.g., financial or stock market data).
Use Robust Models: If the dataset inherently contains outliers but they hold important information (e.g., fraud detection, medical diagnosis).

Datasets Where Outliers Are Common

Outliers frequently occur in datasets such as:

Financial Data: Fraud detection in transactions.
Medical Data: Abnormal lab test values.
Sensor Data: Faulty sensor readings.
E-commerce: Unusual purchase behavior.

Conclusion

Handling outliers is crucial for building robust machine learning models. Choosing the right method depends on whether the outliers are due to errors or natural variations. By detecting and managing outliers properly, we can improve model accuracy and make better data-driven decisions.