m-zakeri/CodART
Source Code Automated Refactoring Toolkit
CodART - Source Code Automated Refactoring Toolkit
Source Code Automated Refactoring Toolkit (CodART) is a refactoring engine with the ability to perform many-objective program transformation and optimization. We have currently focused on automating the various refactoring operations for Java source codes. A complete list of refactoring supported by CodART can be found at CodART refactorings list.
The CodART project is under active development. The current version of CodART works fine on our benchmark projects. To understand how CodART works, read the CodART white-paper.
Your contributions to the project and your comments in the discussion section would be welcomed.
Also, feel free to email and ask any question:
m-zakeri[at]live[dot]com.
System Architecture Overview
Overall system architecture showing containerized services
Reinforcement learning components and data flow
Machine learning training pipeline workflow
Overview
CodART (Source Code Automated Refactoring Toolkit) is a multi-objective program transformation and optimization engine that combines search-based software engineering (SBSE) with automated refactoring operations to improve Java source code quality. The system includes a modern web-based interface, containerized deployment with Docker, reinforcement learning capabilities using PPO algorithms, and advanced machine learning models for testability prediction and intelligent code refactoring.
Key Innovation: CodART integrates traditional search-based refactoring with modern reinforcement learning to create an intelligent system that learns optimal refactoring sequences, making it unique in the automated software refactoring domain.
Key Features
- Automated Java Refactoring: Supports 40+ refactoring operations including Extract Class, Move Method, Extract Interface, and more
- Multi-Objective Optimization: Uses NSGA-II and NSGA-III algorithms to optimize 8+ QMOOD quality metrics simultaneously
- Reinforcement Learning: PPO (Proximal Policy Optimization) algorithm for intelligent refactoring sequence generation
- Testability Prediction: Advanced ML models (RandomForest, GradientBoosting, MLP, VotingRegressor) predict code testability using 262+ source code metrics
- Code Smell Detection: PMD 7.11.0 integration with custom rulesets for automated quality analysis
- SciTools Understand Integration: Professional code analysis engine for parsing and metrics computation
- Web-based UI: Modern React interface with real-time progress tracking and project management
- Containerized Architecture: Docker-based deployment with microservices including API, UI, MinIO, Redis, RabbitMQ
- ANTLR4-based Parsing: Three grammar variants optimized for different parsing performance needs
- Benchmark Integration: 14 benchmark projects for testing and validation including JSON, JFreeChart, Weka
Quick Start with Docker
Prerequisites
- Docker and Docker Compose (Latest versions)
- System Requirements: At least 8GB RAM and 4 CPU cores (12GB+ recommended for large projects)
- SciTools Understand License: Professional license required for code analysis
- Academic licenses available for research purposes
- License activation requires internet connectivity
- Storage: Minimum 20GB free disk space for containers and project data
1. Clone and Setup
git clone https://github.com/m-zakeri/CodART.git
cd CodART2. Environment Configuration
Create a .env file in the project root:
# Project Configuration
PROJECT_ROOT_DIR="/opt/projects"
UDB_ROOT_DIR="/opt/understand_dbs"
BENCHMARK_INDEX=2 # Index from codart/config.py (0-13)
# Search Algorithm Settings
POPULATION_SIZE=15
MAX_ITERATIONS=15
PROBLEM=2 # 0: Simple Genetic, 1: NSGA-II, 2: NSGA-III
NUMBER_OBJECTIVES=8 # QMOOD metrics count
MUTATION_PROBABILITY=0.2
CROSSOVER_PROBABILITY=0.8
# Warm Start Options (Optional)
WARM_START=1 # Enable warm start from previous results
INIT_POP_FILE="/path/to/initial_population.csv" # Optional
CSV_ROOT_DIR="/path/to/jdeodorant_csv" # Optional
# MinIO Credentials (Change for production)
MINIO_ACCESS_KEY=00jFBl7n9Jn0ex0XL7m1
MINIO_SECRET_KEY=kYfujzkdSGjXKLN9oQhPDIVgRUaZRijvj1yaXmIZ
# Experimental Settings (Optional)
USE_CPP_BACKEND=0 # Enable C++ parser backend for performance
EXPERIMENTER="Your Name" # For research tracking
DESCRIPTION="Experiment description" # For result documentation3. Build and Run
# Build and start all services
docker-compose up --build
# Or run in background
docker-compose up -d --build4. Access the Application
- Web Interface: http://localhost:3000 (React UI)
- API Documentation: http://localhost:8000/docs (FastAPI Swagger)
- MinIO Console: http://localhost:9001 (minioadmin/minioadmin)
- RabbitMQ Management: http://localhost:15672 (guest/guest)
- Redis CLI:
docker exec -it codart_redis_1 redis-cli(Direct access)
5. Initial Setup
# Verify all services are running
docker-compose ps
# Check SciTools Understand license
docker exec -it codart_api_1 und license
# Upload a test project via web interface or API
curl -X POST "http://localhost:8000/projects/upload" \
-F "file=@your_project.zip" \
-F "project_name=TestProject"Architecture Components
Core Services
API Container (api)
- FastAPI Backend: RESTful API for all operations
- Celery Worker: Handles ML training and analysis tasks
- SciTools Understand: Code parsing and analysis engine
- PMD Integration: Code smell detection with custom rulesets
- Combined Architecture: API and worker run in same container for license sharing
User Interface (ui)
- React Frontend: Modern web interface with real-time updates
- Project Management: Upload and manage Java projects
- ML Training Interface: Configure and monitor training sessions
- Task Monitoring: Real-time progress tracking with localStorage persistence
Storage Layer
- MinIO: Object storage for models, reports, and temporary files
- Redis: Task result backend and caching
- Docker Volumes: Persistent data storage
Message Queue
- RabbitMQ: Async task processing with queues:
ml_training: Machine learning training tasksml_evaluation: Model evaluation taskscelery: General background tasks
Quality Analysis Components
Code Smell Detection (PMD)
- PMD 7.11.0: Integrated static analysis tool
- Custom Rulesets: Tailored rules for design patterns, complexity, and best practices
- Automated Detection: GodClass, LawOfDemeter, CyclomaticComplexity, etc.
- CSV Reporting: Structured output for refactoring candidate selection
- MinIO Storage: Cloud-based report archival and retrieval
Testability Prediction Engine
- ML Models: 7 different model types (RandomForest, GradientBoosting, MLP, etc.)
- Metric Analysis: 262 comprehensive source code metrics
- Real-time Prediction: Integration with refactoring operations
- Model Variants: Lightweight (68 metrics), Ultra-light (10 metrics), Design-based
- Distributed Training: Celery-based ML pipeline with model versioning
Project Structure
CodART/
├── application/ # FastAPI web service and APIs
│ ├── controllers/ # REST API endpoints
│ │ ├── learning_controller_testability.py
│ │ ├── project_management_controller.py
│ │ ├── rl/ # Reinforcement learning endpoints
│ │ └── reporter/ # Export and download controllers
│ ├── services/ # Business logic services
│ │ ├── minio_training_controller.py
│ │ └── config_integration.py
│ ├── celery_workers/ # Background task processors
│ │ ├── ml_training_task.py
│ │ └── model_prediction_task.py
│ └── main.py # FastAPI application entry point
├── codart/ # Core refactoring engine
│ ├── gen/ # ANTLR4-generated parsers
│ │ ├── JavaParserLabeled.py # Labeled grammar (preferred)
│ │ ├── JavaParserLabeledVisitor.py
│ │ └── JavaParserLabeledListener.py
│ ├── refactorings/ # 40+ refactoring implementations
│ │ ├── extract_class.py, extract_method.py
│ │ ├── move_method.py, move_field.py
│ │ ├── pullup_method.py, pushdown_method.py
│ │ └── handler.py # Refactoring registry
│ ├── metrics/ # Quality metrics computation
│ │ ├── qmood.py # QMOOD metrics (8 objectives)
│ │ ├── testability_prediction.py
│ │ └── learner_testability/ # ML models for testability
│ ├── smells/ # Code smell detection
│ │ ├── long_method.py
│ │ └── map_smell_refactoring.py
│ ├── sbse/ # Search-based software engineering
│ │ ├── search_based_refactoring2.py # Main SBSE engine
│ │ └── simple_genetics.py
│ ├── learner/ # Machine learning components
│ │ ├── genetic.py # Genetic algorithms
│ │ ├── alpha_zero_MCTS.py # Monte Carlo Tree Search
│ │ └── sbr_initializer/ # RL environment setup
│ └── utility/ # Common utilities
│ └── setup_understand.py
├── ui/ # React frontend application
│ ├── src/ # React source code
│ ├── public/ # Static assets
│ ├── Dockerfile # UI container build
│ └── nginx.conf # Production web server config
├── benchmark_projects/ # Test projects (14 Java projects)
├── tests/ # Individual refactoring test cases
├── pmd/ # PMD 7.11.0 code analysis tool
│ ├── bin/pmd # PMD executable
│ ├── rules/custom.xml # Custom rulesets
│ └── lib/ # PMD dependencies
├── scitools/ # SciTools Understand installation
│ ├── bin/ # Understand binaries
│ └── plugins/ # Analysis plugins
├── grammars/ # ANTLR4 grammar files
│ ├── JavaParserLabeled.g4 # Preferred fast grammar
│ ├── JavaParser.g4 # Original fast grammar
│ └── Java9_v2.g4 # Legacy slow grammar
├── docker-compose.yml # Multi-service orchestration
├── Dockerfile.api # API container build
├── Dockerfile.base # Base image with dependencies
└── requirements.txt # Python dependencies
Usage Workflows
1. Web Interface Workflow
- Project Upload: Upload Java projects (ZIP format) via web interface
- Understand Database Creation: Automatic
.unddatabase generation for code analysis - PMD Code Smell Detection: Automated smell detection using custom rulesets
- ML Training Configuration: Configure RL training parameters and objectives
- Training Execution: Monitor real-time progress with Celery task tracking
- Model Evaluation: View training metrics and model performance
- Results Download: Export trained models, reports, and refactored code
2. CLI Workflow
# Direct refactoring execution with SciTools Understand
python codart/refactoring_cli.py \
--udb_path "/path/to/project.und" \
--file_path "/path/to/SourceClass.java" \
--source_class "ClassName" \
--moved_methods "method1,method2" \
--core 0 # 0=Understand, 1=OpenUnderstand
# Search-based multi-objective optimization
python codart/sbse/search_based_refactoring2.py
# Individual refactoring testing
python tests/extract_method/test_1.py
# Testability prediction
python codart/metrics/testability_prediction.py --project-path /path/to/project
# Code smell detection with PMD
./pmd/bin/pmd check -d /path/to/source -R pmd/rules/custom.xml -f csv3. API Integration
# Upload and analyze project
curl -X POST "http://localhost:8000/projects/upload" \
-F "file=@project.zip" \
-F "project_name=MyProject"
# Start ML training with full configuration
curl -X POST "http://localhost:8000/ml-training/train" \
-H "Content-Type: application/json" \
-d '{
"project_id": "123",
"config": {
"population_size": 15,
"max_iterations": 20,
"problem_type": 2,
"objectives": ["ANA", "CAMC", "CIS", "DAM", "DCC", "DSC", "MFA", "MOA"]
}
}'
# Monitor training progress
curl "http://localhost:8000/tasks/{task_id}/status"
# Download results
curl "http://localhost:8000/projects/{project_id}/download/models" -o models.zip
# Get testability prediction
curl -X POST "http://localhost:8000/testability/predict" \
-H "Content-Type: application/json" \
-d '{"project_path": "/opt/projects/MyProject", "model_type": "voting_regressor"}'Machine Learning Features
Testability Prediction Models
CodART implements comprehensive testability prediction using multiple ML approaches:
Model Architecture
- RandomForestRegressor: Primary ensemble model for robust predictions
- GradientBoostingRegressor: High-accuracy gradient-based learning
- MLPRegressor: Neural network for complex pattern recognition
- VotingRegressor: Ensemble combining top 3 models for optimal accuracy
Metric Categories
- Package Metrics (59): Module-level design quality indicators
- Class Lexical Metrics (17): Code complexity and readability measures
- Class Ordinary Metrics (186): Comprehensive structural analysis
- Total: 262 source code metrics for comprehensive analysis
Model Variants
- Full Model: 262 metrics for maximum accuracy
- Lightweight: 68 metrics for fast real-time prediction
- Ultra-light: 10 most important metrics for instant feedback
- Design-based: Graph network analysis using NetworkX
PMD Code Smell Detection
Integrated PMD 7.11.0 provides automated code quality analysis:
Detection Categories
- Design Issues: GodClass, LawOfDemeter, CyclomaticComplexity
- Best Practices: LooseCoupling, UnusedPrivateMethod
- Code Style: UnnecessaryModifier, ProperLogger
- Complexity: NPathComplexity, CognitiveComplexity
Integration Points
- Refactoring Guidance: PMD results guide candidate selection
- Real-time Analysis: Automated execution on project upload
- Cloud Storage: Results archived in MinIO for persistent access
- Custom Rules: Tailored ruleset for refactoring-specific analysis
Reinforcement Learning Training
The system uses Proximal Policy Optimization (PPO) to learn optimal refactoring sequences:
- Environment:
RefactoringSequenceEnvironmentsimulates code transformation - State: Current code metrics and smell indicators
- Actions: Available refactoring operations
- Rewards: Multi-objective improvement in quality metrics
- Training: Experience replay with policy and value networks
Quality Objectives
The system optimizes for 8 design quality objectives:
- ANA (Average Number of Ancestors)
- CAMC (Cohesion Among Methods in Class)
- CIS (Class Interface Size)
- DAM (Data Access Metric)
- DCC (Direct Class Coupling)
- DSC (Design Size in Classes)
- MFA (Measure of Functional Abstraction)
- MOA (Measure of Aggregation)
Supported Refactorings
Structural Refactorings:
- Extract Class, Extract Method, Extract Interface
- Move Method, Move Field, Move Class
- Inline Class, Collapse Hierarchy
Access Control:
- Increase/Decrease Field/Method Visibility
- Encapsulate Field
Inheritance Operations:
- Pull Up Method/Field/Constructor
- Push Down Method/Field
- Make Class Abstract/Concrete/Final
Code Quality:
- Rename Class/Method/Field/Package
- Remove Dead Code
- Replace Conditional with Polymorphism
Configuration
Environment Variables
# Core Paths (Container)
PROJECT_ROOT_DIR="/opt/projects" # Java projects storage
UDB_ROOT_DIR="/opt/understand_dbs" # Understand database files
CSV_ROOT_DIR="/opt/csv_reports" # PMD analysis reports
# SciTools Understand Configuration
STILICENSE="/root/.config/SciTools/License.conf"
STIHOME="/app/scitools" # Understand installation
STIDOSUTILDIR="/root/.config/SciTools" # License directory
UNDERSTAND_API_LICENSE="/root/.local/share/SciTools/Understand/python_api.cfg"
# PMD Configuration
PMD_PATH="/app/pmd/bin/pmd" # PMD executable
PMD_RULESET="/app/pmd/rules/custom.xml" # Custom analysis rules
PMD_CACHE_DIR="/app/pmd/cache" # PMD cache directory
# SBSE Algorithm Configuration
POPULATION_SIZE=15 # GA population size
MAX_ITERATIONS=15 # Maximum generations
NGEN=10 # Alternative iteration setting
PROBLEM=2 # 0=GA, 1=NSGA-II, 2=NSGA-III
NUMBER_OBJECTIVES=8 # QMOOD metrics count
MUTATION_PROBABILITY=0.2 # Mutation rate
CROSSOVER_PROBABILITY=0.8 # Crossover rate
LOWER_BAND=15 # Lower bound for metrics
UPPER_BAND=50 # Upper bound for metrics
# Warm Start Configuration
WARM_START=1 # Enable warm start
INIT_POP_FILE="" # Initial population file
RESUME_EXECUTION="" # Resume from checkpoint
# Service URLs (Container Network)
CELERY_BROKER_URL="amqp://guest:guest@rabbitmq:5672//"
CELERY_RESULT_BACKEND="redis://redis:6379/0"
MINIO_ENDPOINT="minio:9000"
MINIO_ACCESS_KEY="00jFBl7n9Jn0ex0XL7m1"
MINIO_SECRET_KEY="kYfujzkdSGjXKLN9oQhPDIVgRUaZRijvj1yaXmIZ"
# Performance Options
USE_CPP_BACKEND=0 # Enable C++ parser (faster)
QT_QPA_PLATFORM="offscreen" # Headless Qt for Understand
# Research Tracking (Optional)
EXPERIMENTER="Researcher Name"
SCRIPT="search_based_refactoring2.py"
DESCRIPTION="Experiment description"Benchmark Projects
The system includes 14 benchmark projects (defined in codart/config.py):
| Index | Project | Description | Size |
|---|---|---|---|
| 0 | JSON20201115 | JSON parsing library | Small |
| 1 | JFreeChart | Chart generation library | Large |
| 2 | Weka | Machine learning toolkit | Large |
| 3 | FreeMind | Mind mapping software | Medium |
| 4 | Commons-codec | Apache commons codec | Small |
| 5 | JRDF | RDF framework | Medium |
| 6 | JMetal | Multi-objective optimization | Medium |
| 7 | AntApache | Build automation tool | Large |
| 8-13 | Additional projects | Various Java applications | Varies |
Configuration: Set BENCHMARK_INDEX (0-13) in .env file or codart/config.py.
Project Structure: Each benchmark includes:
- Source code in standard Maven/Gradle structure
- Pre-generated
.unddatabase file - PMD analysis reports (CSV format)
- Initial metrics baseline
- Code smell detection results
Development
Local Development Setup
# Install Python dependencies
pip install -r requirements.txt
# Setup SciTools Understand (Local Installation)
export PYTHONPATH="/opt/scitools/bin/linux64/Python:$PYTHONPATH"
export PATH="/opt/scitools/bin/linux64:$PATH"
export LD_LIBRARY_PATH="/opt/scitools/bin/linux64:$LD_LIBRARY_PATH"
# Activate Understand license
und -setofflinereplycode YOUR_LICENSE_CODE
# Install PMD (if not using Docker)
wget https://github.com/pmd/pmd/releases/download/pmd_releases%2F7.11.0/pmd-bin-7.11.0.zip
unzip pmd-bin-7.11.0.zip -d /opt/pmd
# Start development services
# Terminal 1: API server
uvicorn application.main:app --reload --host 0.0.0.0 --port 8000
# Terminal 2: Celery worker
celery -A application.celery_workers.ml_training_task worker --loglevel=info
# Terminal 3: UI development server
cd ui && npm install && npm start
# Terminal 4: Redis (if not using Docker)
redis-server
# Terminal 5: RabbitMQ (if not using Docker)
rabbitmq-serverGrammar Development
# Generate parser from grammar (requires ANTLR4)
cd grammars
antlr4 -Dlanguage=Python3 JavaParserLabeled.g4 -visitor -listener
mv *.py ../codart/gen/
# Test grammar parsing speed
python tests/grammar_speed_tests/test_performance.pyAdding New Refactorings
-
Create refactoring module in
codart/refactorings/:from codart.gen.JavaParserLabeledListener import JavaParserLabeledListener class MyRefactoring(JavaParserLabeledListener): def __init__(self, source_class, target_info): self.source_class = source_class # Implementation details
-
Inherit from appropriate base class:
JavaParserLabeledListener(recommended)JavaParserLabeledVisitor(for complex traversals)- Import from
codart.gen.JavaLabledpackage
-
Add comprehensive tests in
tests/directory:tests/my_refactoring/ ├── test_1.py # Main test script ├── input.java # Test input code ├── expected.java # Expected output └── README.md # Test documentation
-
Register refactoring in
codart/refactorings/handler.py:from .my_refactoring import MyRefactoring REFACTORING_REGISTRY = { 'my_refactoring': MyRefactoring, # ... other refactorings }
-
Test on benchmark projects:
python codart/sbse/search_based_refactoring2.py
-
Update documentation with refactoring details and examples
Testing
# Run individual refactoring tests
python tests/extract_method/test_1.py
python tests/move_method/test_move_method.py
python tests/pullup_method/test_pullup.py
# Test specific refactoring with custom input
python -c "from codart.refactorings.extract_class import ExtractClass; \
ec = ExtractClass('input.java', 'SourceClass', ['field1', 'method1']); \
ec.do_refactor()"
# Run all tests in a category
find tests/ -name "test_*.py" -exec python {} \;
# Test on benchmark projects (full SBSE)
python codart/sbse/search_based_refactoring2.py
# Test PMD integration
./pmd/bin/pmd check -d benchmark_projects/JSON20201115/src \
-R pmd/rules/custom.xml -f csv -r results.csv
# Test Understand integration
python codart/utility/understand_install_test.py
# Test ML components
python codart/metrics/testability_prediction.py --test
# Performance testing
python tests/grammar_speed_tests/benchmark_parsing.pyTroubleshooting
Common Issues
SciTools License Error:
# Check license status
docker exec -it codart_api_1 und license
# Reactivate license
docker exec -it codart_api_1 /app/activate_license.shMemory Issues:
- Increase container memory limit in docker-compose.yml
- Reduce population size in configuration
- Use smaller benchmark projects for testing
Build Failures:
# Clean rebuild
docker-compose down -v
docker-compose build --no-cache
docker-compose upPerformance Optimization
- Use fast grammar
JavaParserLabeled.g4for new development - Enable C++ backend for faster parsing (optional)
- Configure appropriate population size based on available resources
- Use SSD storage for Docker volumes
Contributing
We welcome contributions! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
Development Guidelines
- Follow existing code patterns and naming conventions
- Use
JavaParserLabeled.g4for new refactoring implementations - Test on individual files before benchmark projects
- Document new refactoring operations
- Follow security best practices
Citation
If you use CodART in your research, please cite:
@misc{codart2024,
title={CodART: Source Code Automated Refactoring Toolkit},
author={Zakeri, Morteza and contributors},
year={2024},
url={https://github.com/m-zakeri/CodART}
}License
This project is licensed under the MIT License - see the LICENSE file for details.
Links
- Official Documentation
- Refactoring Catalog
- Code Smells Reference
- Benchmark Projects
- API Documentation (when running locally)
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: m-zakeri[at]live[dot]com
CodART is actively developed at IUST Reverse Engineering Laboratory



