miguelmoralh/feature-selection-benchmark
Comprehensive benchmark study of feature selection techniques for predictive machine learning models on tabular data. Various feature selection methods are evaluated across different data characteristics and predictive scenarios.
Feature Selection Benchmark
A comprehensive benchmark study of feature selection techniques for supervised machine learning predictive models on tabular data.
Overview
This repository implements and evaluates 20 feature selection methods across different categories (filter, wrapper, embedded, hybrid and advanced) using both synthetic and real-world datasets. The study provides practical insights into the effectiveness of different feature selection techniques across various scenarios. A pipline to generate synthetic datasets with various characteristics and complex relationships is also provided. The only thing not provided in the repository is the data, however synthetic dataset can be generated by running synthetic_data_generation/main.py and the used real-world datasets are available and can be downloaded in Appendix 8.2 of miguel_moral_tfg.pdf file.
Key Features
- Implementation of 20 feature selection methods
- Synthetic datasets generation with controlled relationships
- Real-world dataset preprocessing
- Different evaluation frameworks for synthetic and real-world datasets
- Comprehensive benchmarking framework
- Visualization tools for results analysis
Installation
git clone https://github.com/miguelmoralh/feature_selection_benchmark.git
cd feature_selection_benchmark
pip install -r requirements.txtUsage
- Generate synthetic datasets:
python synthetic_data_generation/main.py- Process real-world datasets:
python generate_real_world_metadata.py- Run benchmarks:
python main.py- Generate results and visualizations:
python generate_results.py
python generate_plots.pyRepository Structure
feature_selection_methods/
Implementation of various feature selection techniques:
Filter Methods
-
Bivariate
information_value.py: Implements Weight of Evidence and Information Value based selectioncorrelation.py: Uses correlation coefficients for feature selectionnorm_mutual_info.py: Implements Normalized Mutual Information selectionchi_squared.py: Chi-squared statistical test based selection
-
Multivariate
fcbf.py: Fast Correlation-Based Filter selectionmrmr.py: Minimum Redundancy Maximum Relevance algorithm selectionrelief_algorithms.py: Relief family algorithms selection
Embedded Methods
- Importance
rf_feature_importances.py: Random Forest importance-based selectioncb_feature_importances.py: CatBoost importance-based selectionpermutation_feature_importance.py: Permutation importance selection implementation
Wrapper Methods
- Backward Elimination
sequential_backward_selection.py: Sequential Backward Selection algorithm
- Forward Selection
sequential_forward_selection.py: Sequential Forward Selection algorithm
- Bidirectional
sequential_forward_floating_selection.py: Sequential Forward Floating Selection implementationsequential_backward_floating_selection.py: Sequential Backward Floating Selection implementation
Advanced Methods
boruta.py: Boruta algorithm implementationshap.py: SHAP-based feature selection
Hybrid Methods
- Advanced-Wrapper
shap_sfs.py: SHAP combined with Sequential Forward Selection
- Embedded-Wrapper
recursive_feature_elimination.py: Recursive Feature Elimination selection implementation
- Filter-Wrapper
nmi_sfs.py: Mutual Information with Sequential Forward Selectionfcbf_sfs.py: FCBF with Sequential Forward Selection
synthetic_data_generator/
- Config
dataset_config.py: Configuration for synthetic dataset generationinteractions.py: Defines feature interaction typestransforms.py: Implements feature transformations
base_random_generator.py: Base feature generation functionalityfeature_importances.py: Feature importance calculationfs_configs.py: Feature selection configurationsmain.py: Main synthetic data generation scriptutils.py: Utility functions for data generation
Core Scripts
benchmark_loop.py: Main benchmarking implementationconstants.py: Project-wide constantsexecution_functions.py: Feature selection execution functions used in benchmark_loop.pygenerate_plots.py: Results visualizationgenerate_real_world_metadata.py: Real-world dataset preprocessinggenerate_results.py: Results compilation and analysismain.py: Main execution scriptparams_config.py: Model parameters configuration
utils/
utils_datasets.py: Dataset loading and processing utilitiesutils_methods.py: Common method utilitiesutils_preprocessing.py: Data preprocessing functionsutils_results_and_plots.py: Results processing and visualization utilities
Results
The benchmark results are stored in the logs directory:
logs/benchmark/: Raw benchmark resultslogs/results/: Processed results and analysislogs/plots/: Generated visualizations
Paper
- 'feature_selection_benchmark.pdf': The written paper of the study
Citation
If you use this work in your research, please cite:
@article{moral2025benchmark,
title={Benchmark of feature selection techniques for tabular data},
author={Moral, Miguel},
journal={Universitat Autònoma de Barcelona},
year={2025}
}Contact
Miguel Moral - miguelmoralhernandez@gmail.com
Project Link: https://github.com/miguelmoralh/feature_selection_benchmark