A comprehensive implementation of Random Forests from scratch, including theoretical foundations, custom implementations, and extensive experiments on real-world datasets.
This project addresses Question 2 from the Machine Learning Lab Mid-term examination, which requires:
- Part A: Theoretical explanation of Random Forest key ingredients
- Part B: Implementation and experiments on two datasets
- Part C: Comprehensive report with analysis and insights
β
Custom Implementation: Decision Tree and Random Forest built from scratch with full mathematical formulations
β
Theoretical Foundation: Detailed explanations based on Leo Breiman's 2001 paper
β
Dual Experiments: Heart Disease UCI (tabular) and Intel Image Classification (image)
β
Comprehensive Analysis: Accuracy plots, confusion matrices, feature importance
β
Production-Ready Code: Clean, documented, and reproducible
-
Understand and explain the two key ingredients of Random Forests:
- Strength of individual trees
- Low correlation between trees
-
Conduct experiments to explore:
- Effect of varying number of trees (n_estimators)
- Comparison between Decision Tree and Random Forest
- Impact on accuracy, stability, and training time
-
Provide insights on:
- How randomness affects model performance
- Optimal ensemble size
- Practical trade-offs
Build-Random-Forests-From-scratch/
β
βββ README.md # This file
βββ REPORT.md # Comprehensive report (Part C)
βββ THEORY.md # Theoretical foundation (Part A)
βββ BREIMAN_SUMMARY.md # Summary of Breiman's 2001 paper
β
βββ requirements.txt # Python dependencies
βββ config.py # Configuration and parameters
β
βββ decision_tree_scratch.py # Decision Tree from scratch
βββ random_forest_scratch.py # Random Forest from scratch
β
βββ utils.py # Data loading and preprocessing
βββ visualization.py # Plotting utilities
β
βββ experiments_heart_disease.py # Experiments on Heart Disease dataset
βββ experiments_image_classification.py # Experiments on Image dataset
βββ test_custom_implementation.py # Testing script
β
βββ JSON_RESULTS_STRUCTURE.md # JSON results documentation
β
βββ data/ # Datasets (downloaded automatically)
β βββ heart_disease/
β βββ intel_images/
β
βββ outputs/ # Results and plots
βββ plots/ # Generated visualizations
βββ results/ # Experiment results
βββ *.pkl # Pickle format (Python)
βββ *.json # JSON format (human-readable)
- Python 3.8 or higher
- pip package manager
- Clone the repository:
git clone https://github.com/codewithdark-git/Build-Random-Forests-From-scratch.git
cd Build-Random-Forests-From-scratch- Install dependencies:
pip install -r requirements.txt-
Strength of individual trees
-
Low correlation between trees
-
Mathematical formulations
-
Intuitive explanations
-
BREIMAN_SUMMARY.md: Summary of Breiman's 2001 paper
- Main contributions
- Theoretical guarantees
- Key insights
-
REPORT.md: Comprehensive report (Part C)
- Experimental setup
- Results and analysis
- Discussion and insights
- Conclusions
All Python files include:
- Docstrings for classes and functions
- Mathematical formulations in comments
- Type hints for clarity
- Usage examples
Features:
- Gini impurity and entropy splitting criteria
- Configurable max depth, min samples split/leaf
- Feature randomization support
- Prediction and probability estimation
Mathematical Foundations:
- Gini:
$Gini(D) = 1 - \sum_{k=1}^{K} p_k^2$ - Entropy:
$H(D) = -\sum_{k=1}^{K} p_k \log_2(p_k)$ - Information Gain:
$IG = H(D) - \sum_v \frac{|D_v|}{|D|} H(D_v)$
Features:
- Bootstrap sampling (bagging)
- Random feature selection at each split
- Parallel tree building (optional)
- Out-of-bag error estimation
- Feature importance calculation
Mathematical Foundations:
- Generalization error:
$PE^* \leq \bar{\rho} \frac{(1-s^2)}{s^2}$ - Ensemble prediction:
$\hat{y} = \text{mode}({h_1(x), ..., h_T(x)})$ - OOB error: Unbiased estimate using ~36.8% out-of-bag samples
- Accuracy: Improves rapidly initially, then plateaus
-
Stability: Variance decreases as
$1/T$ - Optimal Range: 100-300 trees for most applications
- No Overfitting: Can add trees without hurting generalization
- Accuracy: RF consistently outperforms (10-15% improvement)
- Generalization: RF reduces overfitting significantly
- Robustness: RF more resistant to noise and outliers
- Trade-off: Slightly longer training time, less interpretable
- Bootstrap Sampling: Creates diversity in training data
- Feature Randomization: Decorrelates trees effectively
-
Optimal
$m$ :$\sqrt{p}$ for classification balances strength and correlation - Ensemble Size: Logarithmic accuracy improvement with tree count
Edit config.py to customize:
# Random seed for reproducibility
RANDOM_STATE = 42
# Experiment parameters
N_ESTIMATORS_RANGE = [1, 10, 50, 100, 300]
TEST_SIZE = 0.2
# Random Forest parameters
RF_MAX_FEATURES = 'sqrt'
RF_BOOTSTRAP = True
RF_OOB_SCORE = True
# Image processing
IMAGE_SIZE = (64, 64)
MAX_IMAGES_PER_CLASS = 1000All plots are automatically generated and saved to outputs/plots/:
- Accuracy vs n_estimators: Shows convergence behavior
- Training time comparison: Linear scaling verification
- Tree vs Forest comparison: Bar charts of metrics
- Confusion matrices: Error analysis
- Feature importance: Top contributing features
- Summary plots: Comprehensive 4-panel overview
- numpy: Numerical computations
- pandas: Data manipulation
- scikit-learn: ML algorithms and metrics
- matplotlib: Plotting
- seaborn: Statistical visualizations
- Pillow: Image processing
- joblib: Parallel processing
See requirements.txt for specific versions.
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see LICENSE file for details.
Your Name
- GitHub: @codewithdark-git
- Email: codewithdark@gmail.com
- Leo Breiman for the Random Forests algorithm
- UCI Machine Learning Repository for the Heart Disease dataset
- Kaggle for the Intel Image Classification dataset
- Scikit-learn for reference implementations
- Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning.
- Louppe, G. (2014). Understanding Random Forests: From Theory to Practice.
Note: This project is created for educational purposes as part of a Machine Learning course assignment (Question 2 - Random Forests).
Last Updated: December 2025