Random Forest In Machine Learning + Real Solved Examples 🌲🌲🌲

6 min readFeb 24, 2025

Real code examples + some code snippets, Theory with code as an example for better understanding, why we use ensemble learning….

Random Forest is an ensemble learning algorithm that combines multiple decision trees to improve accuracy and reduce overfitting. It is used for classification and regression tasks.

🔹 It creates multiple decision trees using different random subsets of data and features.
🔹 The final prediction is based on the majority vote (for classification) or average prediction (for regression).
🔹 More trees = better accuracy and less overfitting compared to a single decision tree.

Why Use Random Forest in ML?

✔ High Accuracy: More trees reduce variance and increase stability.
✔ Handles Missing Data: Can handle missing values effectively.
✔ Works with Large Datasets: Handles high-dimensional data well.
✔ Resistant to Overfitting: Reduces overfitting compared to decision trees.
✔ Feature Importance: Identifies the most important features.

When to Use Random Forest?

Use Random Forest when:
✅ Your data has complex patterns and non-linear relationships.
✅ You want a robust model that avoids overfitting.
✅ You need feature selection to find important attributes.
✅ Your dataset has noise or missing values.
✅ You want to combine multiple weak models for a strong prediction.

Real-World Examples & Code Implementations

Example 1: Predicting Customer Churn (Classification)

Let’s predict whether a customer will churn (1) or not (0) based on features like tenure and monthly charges.

from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Sample Data (Tenure, Monthly Charges)
X = [[1, 50], [2, 55], [3, 60], [4, 70], [5, 75], [6, 80], [7, 85], [8, 90]]
y = [1, 1, 1, 0, 0, 0, 0, 0]  # 1 = Churn, 0 = No Churn

# Train Random Forest Model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

# Predict for a new customer
new_customer = [[3, 65]]
prediction = model.predict(new_customer)
print("Prediction:", "Churn" if prediction[0] == 1 else "No Churn")

Output:
Prediction: Churn

Example 2: House Price Prediction (Regression)

Let’s predict the house price based on square footage and number of bedroom

from sklearn.ensemble import RandomForestRegressor

# Sample Data (Square Feet, Bedrooms) -> House Price
X = [[1500, 3], [2000, 4], [2500, 4], [3000, 5], [3500, 5]]
y = [300000, 400000, 500000, 600000, 700000]  # Prices

# Train Random Forest Model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X, y)

# Predict for a new house
new_house = [[2800, 4]]
predicted_price = model.predict(new_house)
print("Predicted House Price: $", predicted_price[0])

Output:
Predicted House Price: ~$550,000

Example 3: Diagnosing Diabetes (Classification)

Using the Pima Indians Diabetes Dataset, we classify whether a patient has diabetes (1) or not (0).

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
diabetes = load_diabetes()
X, y = diabetes.data, (diabetes.target > 140).astype(int)  # Convert to binary classification

# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict and Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Diabetes Prediction Accuracy:", accuracy)

Output:
Diabetes Prediction Accuracy: ~85–90%

Advantages & Disadvantages of Random Forest

Advantages:

✔ High Accuracy: Reduces overfitting compared to a single decision tree.
✔ Handles Missing Values & Noisy Data.
✔ Feature Importance Analysis: Helps in selecting key features.
✔ Scalable: Works well with large datasets.
✔ Works for both Classification & Regression tasks.

Disadvantages:

❌ Slower Training: More trees mean longer training times.
❌ Less Interpretability: Unlike a single decision tree, Random Forests are harder to interpret.
❌ Memory-Intensive: Requires more computational power.
❌ Bias in Unbalanced Data: If one class dominates, Random Forest may predict it more often.

Where is Random Forest Used in ML?

✔ Healthcare: Disease prediction (diabetes, cancer detection).
✔ Finance: Credit scoring, fraud detection.
✔ Marketing: Customer segmentation, churn prediction.
✔ E-commerce: Product recommendations, demand forecasting.
✔ Agriculture: Crop yield prediction.
✔ Cybersecurity: Detecting anomalies and fraudulent activities.
✔ Self-Driving Cars: Detecting obstacles and lane changes.

More Example Code Snippets

Employee Attrition Prediction

X = [[1, 3000], [2, 3500], [3, 4000], [4, 4500], [5, 5000]]  # [Years of Experience, Salary]
y = [0, 0, 1, 1, 1]  # 0 = Stays, 1 = Leaves

model.fit(X, y)
print("Attrition Prediction for employee (3, 4000):", model.predict([[3, 4000]])[0])

Fraud Detection in Banking

X = [[100, 1], [200, 0], [500, 1], [1000, 0], [2000, 1]]  # [Transaction Amount, Previous Fraud]
y = [0, 0, 1, 0, 1]  # 0 = Legit, 1 = Fraud

model.fit(X, y)
print("Fraud Prediction for $1200 transaction:", model.predict([[1200, 1]])[0])

Ensemble learning combines multiple models to improve accuracy, reduce variance, and prevent overfitting. The main techniques include:

1️⃣ Bagging (Bootstrap Aggregating)
2️⃣ Boosting
3️⃣ Stacking (Stacked Generalization)
4️⃣ Voting (Hard & Soft Voting)

1️⃣ Bagging (Bootstrap Aggregating)

📌 Concept:

Creates multiple independent models by training them on different random subsets of the data.
The final prediction is based on the majority vote (classification) or average (regression).
Reduces variance and prevents overfitting.

📌 Best Used When?

When a model overfits the training data (e.g., Decision Trees).
When high variance is an issue.

📌 Example: Random Forest (Uses Bagging)

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# Random Forest (Bagging technique)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predictions
print("Predicted labels:", model.predict(X_test))

🔹 Why Use It?

Great for reducing variance.
Works well with high-dimensional datasets.
Less sensitive to noisy data.

2️⃣ Boosting

📌 Concept:

Builds models sequentially, where each new model corrects the errors of the previous one.
Uses weighted voting, meaning misclassified samples get more attention.
Reduces bias and improves accuracy.

📌 Best Used When?

When a model is underfitting (low complexity).
When you need high accuracy.

📌 Example: AdaBoost (Adaptive Boosting)

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Load dataset
X, y = load_iris(return_X_y=True)

# Weak learner (Decision Tree with depth=1)
base_model = DecisionTreeClassifier(max_depth=1)

# Boosting model
boost_model = AdaBoostClassifier(base_estimator=base_model, n_estimators=50)
boost_model.fit(X, y)

# Predictions
print("Predicted labels:", boost_model.predict(X[:5]))

🔹 Why Use It?

Improves weak models (like small decision trees).
Performs well on structured/tabular data.
Works well for imbalanced datasets.

🚀 Other Boosting Variants:

Gradient Boosting (GBM) — Stronger than AdaBoost.
XGBoost — Optimized version of GBM.
LightGBM — Faster than XGBoost.

3️⃣ Stacking (Stacked Generalization)

📌 Concept:

Combines predictions of multiple models (e.g., Decision Trees, SVM, and Neural Networks).
Uses a meta-model to learn the best combination of these predictions.

📌 Best Used When?

When you want to combine different models to get the best performance.
When no single model performs well.

📌 Example: Stacking with Logistic Regression as Meta-Model

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

# Base models
base_models = [
    ('knn', KNeighborsClassifier(n_neighbors=3)),
    ('svm', SVC(kernel='linear', probability=True))
]

# Meta model
stacking_model = StackingClassifier(estimators=base_models, final_estimator=LogisticRegression())

# Load dataset
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# Train and predict
stacking_model.fit(X_train, y_train)
print("Stacking Predictions:", stacking_model.predict(X_test))

🔹 Why Use It?

More accurate than individual models.
Good when base models have diverse strengths.
Handles complex relationships well.

4️⃣ Voting (Hard & Soft Voting)

📌 Concept:

Combines multiple different models and predicts based on:
Hard Voting: Majority class wins.
Soft Voting: Probability-based weighted average.

📌 Best Used When?

When different models give slightly different predictions, and you want consensus.
When all models perform reasonably well but have different strengths.

📌 Example: Voting Classifier (Using KNN, SVM, and Logistic Regression)

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Base models
model1 = LogisticRegression()
model2 = SVC(probability=True)
model3 = KNeighborsClassifier(n_neighbors=3)

# Voting Classifier (Soft Voting)
voting_model = VotingClassifier(estimators=[
    ('lr', model1), ('svm', model2), ('knn', model3)
], voting='soft')

# Train model
voting_model.fit(X_train, y_train)

# Predictions
print("Voting Classifier Predictions:", voting_model.predict(X_test))

🔹 Why Use It?

Works well when models disagree.
Reduces the risk of choosing a single bad model.
Simple to implement.

💡 When to Use Which Ensemble Technique?

TechniqueBest ForWhen to Use?Example AlgorithmBaggingReducing variance, improving stabilityWhen a model overfitsRandom ForestBoostingReducing bias, improving weak modelsWhen a model underfitsAdaBoost, XGBoostStackingCombining multiple strong modelsWhen no single model is bestStacked ClassifierVotingCombining different models for consensusWhen all models perform well but have different strengthsHard & Soft Voting

🚀 Summary

✅ Bagging (Random Forest): Best when reducing variance & overfitting.
✅ Boosting (XGBoost, AdaBoost): Best when reducing bias & improving weak learners.
✅ Stacking: Best when no single model performs well.
✅ Voting: Best when different models work well independently.

📌 Which One to Use?

If Overfitting → Bagging (Random Forest)
If Weak Model Needs Boost → Boosting (XGBoost, LightGBM)
If Combining Different Models → Stacking or Voting

Final Thoughts

Random Forest is one of the most powerful ML algorithms because of its accuracy, robustness, and ability to handle complex datasets. However, if speed and interpretability are important, simpler models like Decision Trees or Logistic Regression might be preferable. 🚀

Random Forest In Machine Learning + Real Solved Examples 🌲🌲🌲

Why Use Random Forest in ML?

When to Use Random Forest?

Real-World Examples & Code Implementations

Example 1: Predicting Customer Churn (Classification)

Example 2: House Price Prediction (Regression)

Example 3: Diagnosing Diabetes (Classification)

Advantages & Disadvantages of Random Forest

Advantages:

Disadvantages:

Where is Random Forest Used in ML?

More Example Code Snippets

Employee Attrition Prediction

Fraud Detection in Banking

1️⃣ Bagging (Bootstrap Aggregating)

2️⃣ Boosting

3️⃣ Stacking (Stacked Generalization)

4️⃣ Voting (Hard & Soft Voting)

💡 When to Use Which Ensemble Technique?

🚀 Summary

Final Thoughts

Written by Muhammad Taha

Responses (1)