Why Handling Missing Values In Dataset Is Important 🎯.

5 min readFeb 6, 2025

i. What are missing values in dataset?
ii. How they look like in dataset?
iii. What they do if they aren’t handled?
iv. How we handle them?

These will be the 4 Question’s which we will cover in this blog :)

📢 · Let’s start with “What are missing values in the dataset?”.

As you can see here, that in some specific columns there is being written ‘NaN’ which means ‘Not a Number’.

Missing Values in a dataset refer to data points that are not recorded or are unavailable for certain observations. These gaps can occur due to various reasons such as human error, system failures, or data corruption. Handling missing values properly is crucial for maintaining data integrity and ensuring accurate analysis.

Common Causes of Missing Values

Human Error: Mistakes in data entry or omission during data collection.
Data Corruption: Loss of data due to system crashes or file corruption.
Survey Non-Responses: Participants may choose not to answer certain questions.
System Constraints: Some databases reject null values, leading to missing data.
Conditional Skipping: Certain questions in surveys are skipped based on previous answers.
Privacy Concerns: Respondents might avoid sensitive questions like income or health details.

Examples of Missing Values in Different Datasets

1. Healthcare Data

Missing patient records due to incorrect entries.

Incomplete test results when patients miss follow-up appointments.

2. E-Commerce Data

Missing delivery addresses in customer profiles.

Unrecorded user activity due to tracking failures.

3. Financial Data

Missing stock prices for certain days due to market closures.

Unavailable income data due to tax privacy concerns.

📢 · Let’s start with “How they look like in dataset?”.

Missing values in a dataset can appear in different formats, depending on the data source, file type, or how they were recorded. Below are various ways missing values are represented in datasets:

1. Represented as NaN (Not a Number)

This is common in programming languages like Python (using pandas) and R.

Example (CSV or DataFrame Format)

NaN (Not a Number) represents missing values.
In Python, pandas will automatically recognize missing values as NaN.

2. Represented as -999, -1, or Other Placeholder Values

Some datasets use specific numbers (like -999, -1, or 9999) to indicate missing data.

Example (Survey Data)

-999 or -1 is used to indicate missing data.
This can cause errors in analysis if not properly handled.

3. Represented as ? or N/A

Some datasets use "?", "N/A", "None", or "Unknown" to indicate missing values.

Example

"?", "N/A", and "None" must be converted to NaN for proper handling in Python.
Use replace() in pandas to convert:

df.replace(["?", "N/A", "None"], np.nan, inplace=True)

4. Missing Rows (Structural Missingness)

Sometimes entire rows or sections of a dataset are missing, leading to incomplete data.

Example (Customer Data)

Email and phone numbers are missing for certain customers.

df.replace(["(Missing)"], np.nan, inplace=True)

📢 · Let’s start with “What they do if they aren’t handled in dataset?”.

Data Bias & Inaccuracy

Leads to incorrect insights and poor decision-making.

Example: Missing patient data in healthcare skews treatment outcomes.

2. Errors in Statistical Analysis

Mean, median, and standard deviation calculations become inaccurate.

Correlation and regression results may be misleading.

3. Machine Learning Model Issues

Many ML algorithms (e.g., Linear Regression, SVM) cannot handle NaN values.

Can cause models to fail, underperform, or produce biased predictions.

4. Data Visualization Problems

Charts and graphs may show incomplete or misleading trends.

Outlier detection becomes difficult.

5. Operational Failures

Systems relying on complete data may crash or produce errors.

Example: Banking fraud detection failing due to missing transaction data.

6. Loss of Valuable Information

Simply removing rows/columns may lead to loss of critical data.

Important insights might be overlooked.

📢 · Let’s start with “How we handle them?”.

1. Identifying Missing Values

Before handling missing values, we need to detect them.

import pandas as pd

# Sample dataset with missing values
data = {'Name': ['Alice', 'Bob', 'Carol', 'Dave'],
        'Age': [25, 30, None, 40],
        'Salary': [50000, 60000, None, 70000]}df = pd.DataFrame(data)# Check for missing values
print(df.isnull())  # True indicates a missing value
print(df.isnull().sum())  # Count of missing values in each column

2. Removing Missing Values

a) Removing Rows with Missing Values

df_cleaned = df.dropna()  # Removes any row with at least one missing value
print(df_cleaned)

b) Removing Columns with Missing Values

df_cleaned = df.dropna(axis=1)  # Removes columns with missing values
print(df_cleaned)

⚠ Drawback: This can cause data loss if too many rows or columns are removed.

3. Filling Missing Values (Imputation)

a) Filling with a Specific Value

df_filled = df.fillna(0)  # Replace missing values with 0
print(df_filled)

b) Filling with Mean, Median, or Mode

df['Age'].fillna(df['Age'].mean(), inplace=True)  # Fill with mean
df['Salary'].fillna(df['Salary'].median(), inplace=True)  # Fill with median
print(df)

c) Filling with the Previous or Next Value

df.fillna(method='ffill', inplace=True)  # Forward fill (use previous value)
df.fillna(method='bfill', inplace=True)  # Backward fill (use next value)

4. Interpolating Missing Values

Interpolation estimates missing values based on other values in the column.

df['Age'] = df['Age'].interpolate()
df['Salary'] = df['Salary'].interpolate()
print(df)

5. Handling Missing Data in Machine Learning

Some ML models cannot handle missing values directly. We can:

Fill missing values before training.
Use models like XGBoost that handle missing data automatically.

Example: Filling Missing Values Before Training

from sklearn.impute import SimpleImputer
import numpy as np

imputer = SimpleImputer(strategy='mean')  # Choose 'mean', 'median', or 'most_frequent'
df[['Age', 'Salary']] = imputer.fit_transform(df[['Age', 'Salary']])
print(df)