How to do data transformation in machine learning?💹

Muhammad Taha
4 min readFeb 16, 2025

--

Data transformation is a crucial step in the machine learning pipeline that involves converting raw data into a suitable format for modeling. It enhances model accuracy, speeds up training, and ensures data consistency. In this article, we will explore data transformation, its types, when to use it, how to apply it, and provide code examples for better understanding.

What is Data Transformation in Machine Learning?

Data transformation refers to converting data from one format or structure to another to make it more suitable for analysis and modeling. This includes scaling, normalization, encoding, feature extraction, and handling missing values.

Why Do We Need Data Transformation?

  • Ensures Data Consistency: Raw data often contains inconsistencies, missing values, and noise.
  • Improves Model Performance: Scaled and normalized data lead to faster convergence and better accuracy.
  • Handles Different Data Distributions: Many machine learning algorithms assume normality in data distribution.
  • Reduces Computational Complexity: Smaller and well-structured data improve processing speed.
  • Prepares Data for Specific Algorithms: Certain ML algorithms require categorical data to be encoded as numerical values.

Issues in Dataset Requiring Data Transformation

  • Different Data Scales: Features in different units (e.g., height in cm and weight in kg) need to be standardized.
  • Skewed Distribution: Some algorithms perform poorly with skewed or non-normal data.
  • Categorical Variables: Machine learning models require numerical inputs.
  • Missing Values: Can cause errors or biases in model predictions.
  • High Dimensionality: Some datasets have too many features, requiring dimensionality reduction.

When and On Which Type of Datasets is Data Transformation Needed?

Data transformation is essential when:

  • Data has different scales (e.g., height in cm and weight in kg).
  • Data contains categorical variables that need encoding.
  • Data is highly skewed or contains outliers.
  • Features need extraction, selection, or dimensionality reduction.
  • Data contains missing or duplicate values.

Types of Data Transformation

  1. Scaling
  • Standardization (Z-score normalization)
  • Min-Max Scaling
  • Robust Scaling

2. Normalization

  • Used when data is not normally distributed.

3. Encoding Categorical Data

  • Label Encoding
  • One-Hot Encoding

4. Feature Engineering

  • Polynomial Features
  • Log Transform
  • Box-Cox Transform

5. Handling Missing Values

  • Imputation (Mean, Median, Mode)
  • Removing Null Values

6. Dimensionality Reduction

  • PCA (Principal Component Analysis)
  • t-SNE (t-distributed Stochastic Neighbor Embedding)

How to Perform Data Transformation (Code Examples)

1. Scaling and Normalization

from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np

# Sample Data
data = np.array([[100, 200], [300, 400], [500, 600]])

# Standardization (Mean = 0, Std = 1)
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print("Standardized Data:\n", scaled_data)

# Min-Max Scaling (Range [0,1])
minmax_scaler = MinMaxScaler()
normalized_data = minmax_scaler.fit_transform(data)
print("Normalized Data:\n", normalized_data)

Output:

Standardized Data:
[[-1.22474487 -1.22474487]
[ 0. 0. ]
[ 1.22474487 1.22474487]]
Normalized Data:
[[0. 0. ]
[0.5 0.5]
[1. 1. ]]

2. Encoding Categorical Variables

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd

# Sample Categorical Data
data = pd.DataFrame({'Category': ['Apple', 'Orange', 'Banana', 'Apple']})

# Label Encoding
label_encoder = LabelEncoder()
data['Category_Label'] = label_encoder.fit_transform(data['Category'])
print(data)

# One-Hot Encoding
onehot_encoder = OneHotEncoder()
onehot_encoded = onehot_encoder.fit_transform(data[['Category']]).toarray()
print("One-Hot Encoded:\n", onehot_encoded)

Output:

Category  Category_Label
0 Apple 0
1 Orange 2
2 Banana 1
3 Apple 0
One-Hot Encoded:
[[1. 0. 0.]
[0. 0. 1.]
[0. 1. 0.]
[1. 0. 0.]]

3. Handling Missing Values

from sklearn.impute import SimpleImputer
import numpy as np

# Sample Data with Missing Values
data = np.array([[1, 2, np.nan], [4, np.nan, 6], [7, 8, 9]])

# Impute Missing Values with Mean
imputer = SimpleImputer(strategy='mean')
imputed_data = imputer.fit_transform(data)
print("Imputed Data:\n", imputed_data)

Output:

Imputed Data:
[[1. 2. 7.5]
[4. 5. 6. ]
[7. 8. 9. ]]

4. Dimensionality Reduction using PCA

from sklearn.decomposition import PCA
import numpy as np

# Sample Data
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Apply PCA
pca = PCA(n_components=2)
pca_data = pca.fit_transform(data)
print("Reduced Data:\n", pca_data)

Output:

Reduced Data:
[[-4.24264069e+00 -2.41092262e-16]
[ 0.00000000e+00 0.00000000e+00]
[ 4.24264069e+00 2.41092262e-16]]

Advantages of Data Transformation

  • Improves Model Performance: Helps algorithms process data more effectively.
  • Better Data Interpretability: Makes patterns and trends more visible.
  • Reduces Overfitting: Prevents unnecessary complexity.
  • Ensures Algorithm Compatibility: Converts raw data into suitable formats.
  • Enhances Convergence Speed: Well-scaled data helps models learn faster.

Machine Learning Algorithms That Handle Data Transformation Automatically

Some machine learning algorithms inherently manage data transformations:

  • Decision Trees and Random Forests: Handle categorical data without encoding.
  • XGBoost: Can handle missing values.
  • Neural Networks: Learn feature scaling internally.
  • Gradient Boosting Models: Deal with skewed distributions effectively.

Conclusion

Data transformation is an essential part of the machine learning workflow. It ensures that data is properly formatted, normalized, and optimized for training models. Different types of transformation techniques should be applied based on the dataset characteristics. By understanding when and how to apply these techniques, you can significantly improve the performance of your machine learning models.

--

--

Muhammad Taha
Muhammad Taha

Written by Muhammad Taha

0 Followers

A Software Engineering student passionate about machine learning.

No responses yet