GitHunt
AP

apanichella/Reproducible-ML4SE

Reproducible-ML4SE

This is the replication package for the paper entitles "On Randomization and Reproducibility in Machine Learning for Software Engineering" by Cynthia C.S. Liem and Annibale Panichella

We provide:

  • complete performance results in tabular form in the file Full_tabular_results.pdf;
  • exact input data files as used in our ML training procedures, structured per predictive SE task, in the folder Data;
  • random seeds, 10-fold cross-validation partition specification, and results for all models, libraries and research questions, in the folder Results.

Structure of the Results folder

Results
├── [task]
│   ├── [dataset]
│   │   ├── folds.csv (used for RQ1)
│   │   ├── random_seeds.csv (used for RQ2)
│   │   │   ├── [RQ]    
│   │   │   │   ├── [ML model]  
│   │   │   │   │   ├── [library]
│   │   │   │   │   │   ├── performance_metrics.csv

Each performance_metrics.csv file gives performance results per run for the three metrics of interest (F-measure, AUC, Matthews Correlation Coefficient).

How to read folds.csv files

Column fold_j specifies the 10-fold cross-validation partitioning for the j-th experimental run.
Column data_row_index considers the row index of each data point in the dataset.

For example, the row where data_row_index=1 refers to the 1st data point in the csv file in the Data folder for the dataset of interest.
If we consider this row, for each column fold_j, the fold index (for 10-fold cross-validation: an integer in the range [1,10]) is specified to which this data point was allocated for the j-th 10-fold cross-validation run.

Contributors

Created January 2, 2021
Updated January 2, 2021
apanichella/Reproducible-ML4SE | GitHunt