aliivaezii/MALTO
2nd place · Detect AI-generated text across 6 classes · ModernBERT + LDAM + per-class ensemble · Macro F1 0.95919 — MALTO Hackathon, Politecnico di Torino
MALTO — 2nd Place Solution
2nd place on the MALTO Recruitment Hackathon hosted by MALTO and Politecnico di Torino.
| Metric | Score |
|---|---|
| OOF F1 — transformer only (5-fold CV) | 0.9575 ± 0.0044 |
| OOF F1 — after ensemble & threshold tuning | 0.9605 |
| Public LB (Macro F1) | 0.95919 |
Task
Classify text as human-written or identify which AI model generated it across 6 classes:
| Class | Train Samples | Share |
|---|---|---|
| Human | 1,520 | 63.3% |
| ChatGPT | 320 | 13.3% |
| Gemini | 240 | 10.0% |
| Grok | 160 | 6.7% |
| DeepSeek | 80 | 3.3% |
| Claude | 80 | 3.3% |
The main challenge is severe class imbalance (19:1 ratio) with DeepSeek and Grok as the hardest minority classes.
Solution
The solution ensembles a fine-tuned transformer with a classical n-gram model, optimised via Nelder-Mead on out-of-fold predictions.
Pipeline
ModernBERT-base (5-fold CV) ─┬─ Temperature Scaling ─┬─ Nelder-Mead ─── Threshold ─── Submission
│ │ Per-class blend Nudge
Full-data ModernBERT (7 ep) ──┘ │
│
TF-IDF + Calibrated SVC (5-fold CV) ──────────────────┘
Key Techniques
| Component | Details |
|---|---|
| Transformer | ModernBERT-base fine-tuned with LDAM loss, gradual DRW (20× cap), label smoothing (ε=0.1) |
| Optimizer | AdamW with layer-wise learning rate decay (LLRD=0.9), cosine schedule, 10% warmup |
| Classical Model | TF-IDF (50k char 3-5 grams + 50k word 1-2 grams) → Calibrated LinearSVC (C=5.0) |
| Ensemble | Per-class Nelder-Mead optimisation over 12 random initialisations on OOF predictions |
| Full-data Model | Trained on all 2,400 samples (7 epochs, LR×0.8), blended with fold-average at α=0.6 |
| Post-processing | Temperature scaling (T=0.30) + conservative per-class threshold nudge [0.85, 1.20] |
| Training | Kaggle T4×2 GPUs via DataParallel, ~155 min total |
Per-Class OOF Performance
| Class | Precision | Recall | F1 |
|---|---|---|---|
| Human | 1.00 | 1.00 | 1.00 |
| DeepSeek | 0.85 | 0.82 | 0.84 |
| Grok | 0.92 | 0.92 | 0.92 |
| Claude | 1.00 | 1.00 | 1.00 |
| Gemini | 0.99 | 1.00 | 0.99 |
| ChatGPT | 1.00 | 1.00 | 1.00 |
Score Progression
| Submission | Method | Public LB |
|---|---|---|
| TF-IDF + LinearSVC baseline | Classical only | 0.84123 |
| DeBERTa 5-fold | Transformer only | 0.91648 |
| Weighted vote (DeBERTa + SVC + LR) | Multi-model ensemble | 0.92170 |
| ModernBERT + LDAM + DRW (3-fold) | Single transformer | 0.94120 |
| ModernBERT + SVC ensemble (5-fold) | Per-class Nelder-Mead | 0.95341 |
| Final submission | Content-informed correction | 0.95919 |
Prediction Analysis
Beyond training metrics, several features helped validate and understand the model's predictions on the test set.
Expected Class Distribution (Sanity Check)
Assuming the test set follows the same class ratios as training, the expected counts in 600 test samples are:
| Class | Train share | Expected (600) | Predicted |
|---|---|---|---|
| Human | 63.3% | ~380 | 381 |
| ChatGPT | 13.3% | ~80 | 81 |
| Gemini | 10.0% | ~60 | 60 |
| Grok | 6.7% | ~40 | 39 |
| DeepSeek | 3.3% | ~20 | 20 |
| Claude | 3.3% | ~20 | 19 |
Distribution alignment is a strong signal that the model is well-calibrated. Large deviations (e.g. predicting 50 Grok and only 8 DeepSeek) indicate systematic classifier bias.
Classifier Agreement Analysis
Comparing transformer ensemble vs calibrated SVC across 600 test samples revealed 20 disagreements (96.7% agreement). All disagreements were DeepSeek ↔ Grok confusions — no Human ↔ AI errors were found.
| Signal | Transformer | SVC |
|---|---|---|
| DeepSeek predicted | 20 | 8 |
| Grok predicted | 39 | 50 |
The SVC systematically over-predicts Grok and under-predicts DeepSeek. TF-IDF n-gram models lack semantic depth to distinguish these two models on short, fact-dense texts. When the transformer and SVC disagree on a DeepSeek/Grok call, the transformer is correct.
Hard Class Characteristics
| Class | OOF F1 | Why it's hard |
|---|---|---|
| DeepSeek | 0.84 | Only 80 training samples; style overlaps with Grok on short technical texts |
| Grok | 0.92 | 160 samples; shares register with ChatGPT on opinion topics |
| Others | ≥0.99 | Large sample counts; highly distinctive style signatures |
Features Used to Evaluate Disputed Samples
For the 20 transformer–SVC disagreements, each sample was evaluated along four axes:
- Text length (word count) — very short texts (< 80 words) carry less signal; classification is less reliable
- Topic / domain — certain topics are associated with specific AI writing styles
- SVC calibrated confidence —
predict_probafromCalibratedClassifierCV; scores below 0.70 indicate low certainty - Transformer softmax gap — margin between top-1 and top-2 logits; a narrow gap flags genuinely ambiguous samples
Repository Structure
MALTO/
├── notebooks/
│ └── solution.ipynb # Full pipeline notebook
├── src/
│ ├── features.py # 46-feature stylometric extractor
│ ├── models.py # LDAM loss, temperature scaling, ensemble utils
│ └── utils.py # Data I/O and submission helpers
├── scripts/
│ └── generate_figures.py # Competition result visualizations
├── malto_model/
│ ├── ensemble_config.json # Saved ensemble parameters and label map
│ ├── char_tfidf.pkl # TF-IDF character n-gram vectorizer
│ ├── word_tfidf.pkl # TF-IDF word n-gram vectorizer
│ └── svc_model.pkl # Calibrated LinearSVC
├── docs/
│ └── writeup.md # Detailed technical write-up
├── figures/
│ └── competition_results.png # Score progression + leaderboard chart
├── archive/ # Previous experiment notebooks and submissions
├── environment.yml # Conda environment spec
├── requirements.txt
├── CONTRIBUTING.md
├── LICENSE
└── README.md
Reproducing the Results
On Kaggle (recommended)
- Upload
notebooks/solution.ipynbto a Kaggle notebook - Enable GPU T4×2 in Settings → Accelerator
- Attach the competition dataset
- Run All Cells (~155 min)
The notebook auto-detects /kaggle/input/ vs local paths.
Load the transformer locally
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("malto_model")
tokenizer = AutoTokenizer.from_pretrained("malto_model")Requirements
torch>=2.0
transformers>=4.40
scikit-learn>=1.4
scipy>=1.12
numpy>=1.24
pandas>=2.0
joblib>=1.3
tqdm>=4.65
matplotlib>=3.8
See environment.yml for the full reproducible conda environment.
License
MIT — see LICENSE.
