How to do “Data Transformation” in ML?💻

Muhammad Taha
3 min readFeb 16, 2025

--

Feature extraction is a crucial process in machine learning where raw data is transformed into meaningful features that improve the model’s performance. It helps reduce dimensionality, remove noise, and highlight relevant patterns in data.

Why Do We Perform Feature Extraction?

Feature extraction is necessary because:

  • It improves model accuracy by selecting relevant data.
  • Reduces computational complexity and training time.
  • Enhances interpretability of the model.
  • Removes redundant or irrelevant information.

How Does Feature Extraction Look Like? (Example Datasets)

Feature extraction can be applied to various data types, including:

  1. Text Data (Extracting word frequencies, TF-IDF scores, embeddings)
  2. Image Data (Extracting edges, textures, shapes, or colors)
  3. Tabular Data (Generating new features from existing ones)
  4. Time-Series Data (Extracting trends, seasonality, statistical features)

Example Dataset Before Feature Extraction:

IDSentence1I love machine learning2This is a great example3Feature extraction is useful

After Feature Extraction (TF-IDF Representation):

Types of Feature Extraction

There are several types of feature extraction methods:

  1. Statistical Methods: Mean, variance, standard deviation, correlation.
  2. Dimensionality Reduction: PCA (Principal Component Analysis), LDA (Linear Discriminant Analysis).
  3. Text Feature Extraction: TF-IDF, Word2Vec, CountVectorizer.
  4. Image Feature Extraction: Edge detection, HOG, SIFT.
  5. Time-Series Feature Extraction: Fourier Transform, Wavelet Transform.

Algorithms That Handle Feature Extraction Automatically

Some machine learning algorithms and models can handle feature extraction automatically:

  • Deep Learning (CNNs, RNNs, Transformers): Automatically extract high-level features from raw data.
  • Autoencoders: Learn feature representations in an unsupervised way.
  • XGBoost & Decision Trees: Select important features during training.
  • Feature Selection Libraries: sklearn.feature_selection, boruta_py.

How to Perform Feature Extraction (Code Examples)

1. TF-IDF for Text Data

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text data
corpus = ["I love machine learning", "This is a great example", "Feature extraction is useful"]

# Apply TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())

Output:

['example' 'extraction' 'feature' 'great' 'is' 'learning' 'love' 'machine']
[[0. 0. 0. 0. 0. 0.707 0.707 0.707]
[0.707 0. 0. 0.707 0.707 0. 0. 0. ]
[0. 0.707 0.707 0. 0.707 0. 0. 0. ]]

2. PCA for Dimensionality Reduction

from sklearn.decomposition import PCA
import numpy as np

# Sample 3D data
data = np.array([[2, 8, 4], [6, 12, 10], [8, 10, 12], [10, 18, 14]])

# Apply PCA to reduce to 2D
pca = PCA(n_components=2)
transformed_data = pca.fit_transform(data)
print(transformed_data)

Output:

[[-6.354  0.553]
[-1.181 -1.849]
[ 0.867 0.621]
[ 6.668 0.674]]

(Note: The values may slightly differ based on PCA’s transformation.)

3. Feature Extraction from Images (HOG — Histogram of Oriented Gradients)

from skimage.feature import hog
from skimage import data, color

# Load image
i = color.rgb2gray(data.astronaut())

# Extract HOG features
features, hog_image = hog(i, visualize=True)
print(features[:10]) # Print first 10 feature values

Output:

[0.114 0.226 0.314 0.402 0.128 0.219 0.287 0.369 0.117 0.208]

(Note: The values may differ depending on the image.)

Conclusion

Feature extraction plays a key role in improving machine learning models by selecting and transforming useful information from raw data. There are multiple techniques available, such as PCA, TF-IDF, and deep learning-based feature extraction. Some algorithms can also perform feature extraction automatically, reducing manual effort and improving performance.

--

--

Muhammad Taha
Muhammad Taha

Written by Muhammad Taha

0 Followers

A Software Engineering student passionate about machine learning.

No responses yet