Why Handling Missing Values In Dataset Is Important šÆ.
i. What are missing values in dataset?
ii. How they look like in dataset?
iii. What they do if they arenāt handled?
iv. How we handle them?
These will be the 4 Questionās which we will cover in this blog :)
š¢ Ā· Letās start with āWhat are missing values in the dataset?ā.
As you can see here, that in some specific columns there is being written āNaNā which means āNot a Numberā.
Missing Values in a dataset refer to data points that are not recorded or are unavailable for certain observations. These gaps can occur due to various reasons such as human error, system failures, or data corruption. Handling missing values properly is crucial for maintaining data integrity and ensuring accurate analysis.
Common Causes of Missing Values
- Human Error: Mistakes in data entry or omission during data collection.
- Data Corruption: Loss of data due to system crashes or file corruption.
- Survey Non-Responses: Participants may choose not to answer certain questions.
- System Constraints: Some databases reject null values, leading to missing data.
- Conditional Skipping: Certain questions in surveys are skipped based on previous answers.
- Privacy Concerns: Respondents might avoid sensitive questions like income or health details.
Examples of Missing Values in Different Datasets
1. Healthcare Data
Missing patient records due to incorrect entries.
Incomplete test results when patients miss follow-up appointments.
2. E-Commerce Data
Missing delivery addresses in customer profiles.
Unrecorded user activity due to tracking failures.
3. Financial Data
Missing stock prices for certain days due to market closures.
Unavailable income data due to tax privacy concerns.
š¢ Ā· Letās start with āHow they look like in dataset?ā.
Missing values in a dataset can appear in different formats, depending on the data source, file type, or how they were recorded. Below are various ways missing values are represented in datasets:
1. Represented as NaN
(Not a Number)
This is common in programming languages like Python (using pandas) and R.
Example (CSV or DataFrame Format)
NaN
(Not a Number) represents missing values.- In Python,
pandas
will automatically recognize missing values asNaN
.
2. Represented as -999
, -1
, or Other Placeholder Values
Some datasets use specific numbers (like -999
, -1
, or 9999
) to indicate missing data.
Example (Survey Data)
-999
or-1
is used to indicate missing data.- This can cause errors in analysis if not properly handled.
3. Represented as ?
or N/A
Some datasets use "?"
, "N/A"
, "None"
, or "Unknown"
to indicate missing values.
Example
"?"
,"N/A"
, and"None"
must be converted toNaN
for proper handling in Python.- Use
replace()
in pandas to convert:
df.replace(["?", "N/A", "None"], np.nan, inplace=True)
4. Missing Rows (Structural Missingness)
Sometimes entire rows or sections of a dataset are missing, leading to incomplete data.
Example (Customer Data)
- Email and phone numbers are missing for certain customers.
df.replace(["(Missing)"], np.nan, inplace=True)
š¢ Ā· Letās start with āWhat they do if they arenāt handled in dataset?ā.
- Data Bias & Inaccuracy
Leads to incorrect insights and poor decision-making.
Example: Missing patient data in healthcare skews treatment outcomes.
2. Errors in Statistical Analysis
Mean, median, and standard deviation calculations become inaccurate.
Correlation and regression results may be misleading.
3. Machine Learning Model Issues
Many ML algorithms (e.g., Linear Regression, SVM) cannot handle NaN values.
Can cause models to fail, underperform, or produce biased predictions.
4. Data Visualization Problems
Charts and graphs may show incomplete or misleading trends.
Outlier detection becomes difficult.
5. Operational Failures
Systems relying on complete data may crash or produce errors.
Example: Banking fraud detection failing due to missing transaction data.
6. Loss of Valuable Information
Simply removing rows/columns may lead to loss of critical data.
Important insights might be overlooked.
š¢ Ā· Letās start with āHow we handle them?ā.
1. Identifying Missing Values
Before handling missing values, we need to detect them.
import pandas as pd
# Sample dataset with missing values
data = {'Name': ['Alice', 'Bob', 'Carol', 'Dave'],
'Age': [25, 30, None, 40],
'Salary': [50000, 60000, None, 70000]}df = pd.DataFrame(data)# Check for missing values
print(df.isnull()) # True indicates a missing value
print(df.isnull().sum()) # Count of missing values in each column
2. Removing Missing Values
a) Removing Rows with Missing Values
df_cleaned = df.dropna() # Removes any row with at least one missing value
print(df_cleaned)
b) Removing Columns with Missing Values
df_cleaned = df.dropna(axis=1) # Removes columns with missing values
print(df_cleaned)
ā Drawback: This can cause data loss if too many rows or columns are removed.
3. Filling Missing Values (Imputation)
a) Filling with a Specific Value
df_filled = df.fillna(0) # Replace missing values with 0
print(df_filled)
b) Filling with Mean, Median, or Mode
df['Age'].fillna(df['Age'].mean(), inplace=True) # Fill with mean
df['Salary'].fillna(df['Salary'].median(), inplace=True) # Fill with median
print(df)
c) Filling with the Previous or Next Value
df.fillna(method='ffill', inplace=True) # Forward fill (use previous value)
df.fillna(method='bfill', inplace=True) # Backward fill (use next value)
4. Interpolating Missing Values
Interpolation estimates missing values based on other values in the column.
df['Age'] = df['Age'].interpolate()
df['Salary'] = df['Salary'].interpolate()
print(df)
5. Handling Missing Data in Machine Learning
Some ML models cannot handle missing values directly. We can:
- Fill missing values before training.
- Use models like
XGBoost
that handle missing data automatically.
Example: Filling Missing Values Before Training
from sklearn.impute import SimpleImputer
import numpy as np
imputer = SimpleImputer(strategy='mean') # Choose 'mean', 'median', or 'most_frequent'
df[['Age', 'Salary']] = imputer.fit_transform(df[['Age', 'Salary']])
print(df)