Feature Selection Benchmark

A comprehensive benchmark study of feature selection techniques for supervised machine learning predictive models on tabular data.

Overview

This repository implements and evaluates 20 feature selection methods across different categories (filter, wrapper, embedded, hybrid and advanced) using both synthetic and real-world datasets. The study provides practical insights into the effectiveness of different feature selection techniques across various scenarios. A pipline to generate synthetic datasets with various characteristics and complex relationships is also provided. The only thing not provided in the repository is the data, however synthetic dataset can be generated by running synthetic_data_generation/main.py and the used real-world datasets are available and can be downloaded in Appendix 8.2 of miguel_moral_tfg.pdf file.

Key Features

Implementation of 20 feature selection methods
Synthetic datasets generation with controlled relationships
Real-world dataset preprocessing
Different evaluation frameworks for synthetic and real-world datasets
Comprehensive benchmarking framework
Visualization tools for results analysis

Installation

git clone https://github.com/miguelmoralh/feature_selection_benchmark.git
cd feature_selection_benchmark
pip install -r requirements.txt

Usage

Generate synthetic datasets:

python synthetic_data_generation/main.py

Process real-world datasets:

python generate_real_world_metadata.py

Run benchmarks:

python main.py

Generate results and visualizations:

python generate_results.py
python generate_plots.py

Repository Structure

feature_selection_methods/

Implementation of various feature selection techniques:

Filter Methods

Bivariate
- information_value.py: Implements Weight of Evidence and Information Value based selection
- correlation.py: Uses correlation coefficients for feature selection
- norm_mutual_info.py: Implements Normalized Mutual Information selection
- chi_squared.py: Chi-squared statistical test based selection
Multivariate
- fcbf.py: Fast Correlation-Based Filter selection
- mrmr.py: Minimum Redundancy Maximum Relevance algorithm selection
- relief_algorithms.py: Relief family algorithms selection

Embedded Methods

Importance
- rf_feature_importances.py: Random Forest importance-based selection
- cb_feature_importances.py: CatBoost importance-based selection
- permutation_feature_importance.py: Permutation importance selection implementation

Wrapper Methods

Backward Elimination
- sequential_backward_selection.py: Sequential Backward Selection algorithm
Forward Selection
- sequential_forward_selection.py: Sequential Forward Selection algorithm
Bidirectional
- sequential_forward_floating_selection.py: Sequential Forward Floating Selection implementation
- sequential_backward_floating_selection.py: Sequential Backward Floating Selection implementation

Advanced Methods

boruta.py: Boruta algorithm implementation
shap.py: SHAP-based feature selection

Hybrid Methods

Advanced-Wrapper
- shap_sfs.py: SHAP combined with Sequential Forward Selection
Embedded-Wrapper
- recursive_feature_elimination.py: Recursive Feature Elimination selection implementation
Filter-Wrapper
- nmi_sfs.py: Mutual Information with Sequential Forward Selection
- fcbf_sfs.py: FCBF with Sequential Forward Selection

synthetic_data_generator/

Config
- dataset_config.py: Configuration for synthetic dataset generation
- interactions.py: Defines feature interaction types
- transforms.py: Implements feature transformations
base_random_generator.py: Base feature generation functionality
feature_importances.py: Feature importance calculation
fs_configs.py: Feature selection configurations
main.py: Main synthetic data generation script
utils.py: Utility functions for data generation

Core Scripts

benchmark_loop.py: Main benchmarking implementation
constants.py: Project-wide constants
execution_functions.py: Feature selection execution functions used in benchmark_loop.py
generate_plots.py: Results visualization
generate_real_world_metadata.py: Real-world dataset preprocessing
generate_results.py: Results compilation and analysis
main.py: Main execution script
params_config.py: Model parameters configuration

utils/

utils_datasets.py: Dataset loading and processing utilities
utils_methods.py: Common method utilities
utils_preprocessing.py: Data preprocessing functions
utils_results_and_plots.py: Results processing and visualization utilities

Results

The benchmark results are stored in the logs directory:

logs/benchmark/: Raw benchmark results
logs/results/: Processed results and analysis
logs/plots/: Generated visualizations

Paper

'feature_selection_benchmark.pdf': The written paper of the study

Citation

If you use this work in your research, please cite:

@article{moral2025benchmark,
 title={Benchmark of feature selection techniques for tabular data},
 author={Moral, Miguel},
 journal={Universitat Autònoma de Barcelona},
 year={2025}
}

Contact

Miguel Moral - miguelmoralhernandez@gmail.com
Project Link: https://github.com/miguelmoralh/feature_selection_benchmark

miguelmoralh/feature-selection-benchmark