Poorman's AR-DiT TTS π’
Keywords: ARDiT, AR-DiT, Autoregressive Diffusion Transformer, TTS, Text-to-Speech, Mel-Spectrogram
A resource-friendly Text-to-Speech system inspired by AR-DiT (ARDiT), combining an autoregressive Transformer (Qwen3 LLM) with a diffusion model architecture. It generates Mel spectrograms through a diffusion process, then converts them to audio via a Vocoder.
β¨ Minimal AR-DiT TTS training and inference pipeline that can train on an 8000-hour dataset using a single RTX 5090 (32GB) and produce intelligible speech synthesis results within two days.
PS: The diffusion backbone uses RFWave's ConvNeXt architecture, not DiT.
π Why Choose This Project?
- π Resource-Friendly: Single RTX 5090 (32GB) can handle 8000-hour dataset training
- π¦ Minimal Implementation: Clean and concise code, easy to understand and modify, suitable for learning and development
- π¨π³ Chinese-Friendly: Complete Chinese documentation and Chinese data processing pipeline
- π€ Ready to Use: Provides pre-trained models and processed datasets for quick start
- π‘ Practical-Oriented: Achieves intelligible results in two days, better quality with longer training - practical rather than perfect
π΅ Generation Examples
Audio samples generated by the trained model:
π¦ Installation
pip install -r requirements.txtπ Quick Start
π€ Inference with Pre-trained Model
We provide pre-trained models on Hugging Face that you can use directly:
Download Pre-trained Model:
huggingface-cli download laupeng1989/armel-checkpoint --local-dir ./models/armel-checkpointRun Inference:
python3 scripts/mel_inference.py \
--model_path ./models/armel-checkpoint/ \
--text example_data/transcript/fanren_short.txt \
--ref_audio fanren08 \
--output_path output/generated \
--dtype bfloat16Output Files:
output/generated.wav: Generated audiooutput/generated.png: Mel spectrogram visualizationoutput/generated.npy: Mel spectrogram array
π§ Reference Audio Instructions
The --ref_audio parameter specifies the reference audio name (without extension). The script will read the corresponding .wav and .txt files from the example_data/voice_prompts/ directory:
example_data/voice_prompts/
βββ fanren08.wav # Reference audio
βββ fanren08.txt # Text corresponding to reference audio
βββ fanren09.wav
βββ fanren09.txt
You can add your own reference audio by placing the audio file and corresponding text file in this directory.
π₯ Training from Scratch
If you want to train your own model from scratch, follow these steps.
π€ Training Dataset
We provide a processed training dataset on Hugging Face:
- Training Dataset: laupeng1989/armel-dataset
Download Dataset:
huggingface-cli download laupeng1989/armel-dataset --repo-type dataset --local-dir ./data/armel-datasetπ‘ Tip: If using the Hugging Face dataset, you can skip the "Data Preparation" section below and proceed directly to training.
π Data Preparation
1οΈβ£ Prepare Raw Data
This project uses the Amphion Emilia preprocessor to process raw audio data.
Processed data format:
example_data/
βββ δ»ι 第87ι θΊ«δΈθιοΌδΈοΌ [638031163].json
βββ δ»ι 第87ι θΊ«δΈθιοΌδΈοΌ [638031163]_000000.m4a
βββ δ»ι 第87ι θΊ«δΈθιοΌδΈοΌ [638031163]_000001.m4a
βββ δ»ι 第87ι θΊ«δΈθιοΌδΈοΌ [638031163]_000002.m4a
βββ ...
JSON file format (contains segmentation info and text):
[
{
"duration": 10.94,
"text": "[SPEAKER_00] ζ¬’θΏζΆε¬...",
"speaker": 0,
"parts": [
{
"text": "[SPEAKER_00] ζ¬’θΏζΆε¬...",
"start": 4.5125,
"end": 10.1525,
"speaker": 0,
"language": "zh"
}
]
}
]2οΈβ£ Build Training Dataset
Use build_dataset.py to convert raw data to training format:
python scripts/build_dataset.py \
--data_dir <your_raw_data_dir> \
--output_dir <your_output_dir> \
--num_proc 8 \
--test_samples 100 \
--random_seed 42π₯ Training
π» Training Hardware
This project was trained on NVIDIA RTX 5090 (32GB).
β‘ Training Command
Prepare Qwen3 Model:
model.llm_model_path can be:
- Local path: e.g.,
./Qwen3-0.6B(requires prior download) - Hugging Face model name: e.g.,
Qwen/Qwen3-0.6B(auto-downloads, but first training will be slower)
Recommended to download locally first:
huggingface-cli download Qwen/Qwen3-0.6B --local-dir ./Qwen3-0.6BTraining Command:
python3 scripts/mel_train.py \
dataset.train_dataset_path=<your_train_data_path> \
dataset.valid_dataset_path=<your_valid_data_path> \
model.llm_model_path=./Qwen3-0.6B \
model.rfmel.batch_mul=2 \
training.batch_size=4 \
dataset.max_tokens=1024 \
training.num_workers=16 \
training.learning_rate=0.0001 \
training.log_dir=<your_log_dir> \
training.diffusion_extra_steps=4 \
training.check_val_every_n_epoch=1 \
model.use_skip_connection=true \
model.estimator.hidden_dim=512 \
model.estimator.intermediate_dim=1536 \
model.estimator.num_layers=8Note: Lightning automatically detects and uses all available GPUs with DDP strategy. You may need to adjust batch_size, batch_mul, max_tokens based on your hardware configuration.
π€ Export Model
After training, export the model for inference:
python scripts/mel_export_checkpoint.py \
--ckpt_path <your_checkpoint_path>/last.ckpt \
--output_path ./exported_model/Or specify the checkpoints directory directly (automatically selects the latest):
python scripts/mel_export_checkpoint.py \
--ckpt_path <your_checkpoint_dir>/ \
--output_path ./exported_model/This will generate:
model.ckpt: Model weightsmodel.yaml: Inference configuration
After exporting, you can use the inference commands from the "Inference with Pre-trained Model" section above.
π Project Structure
ar-dit-mel/
βββ ar/ # Autoregressive model
β βββ armel.py # ARMel main model
β βββ qwen.py # Qwen3 LLM
β βββ mel_generate.py # Mel generation
βββ rfwave/ # Diffusion model
β βββ mel_model.py # RFMel model
β βββ mel_processor.py # Mel processor
β βββ estimator.py # Diffusion estimator
βββ dataset/ # Dataset
βββ scripts/ # Training and inference scripts
β βββ build_dataset.py # Build dataset
β βββ mel_train.py # Training script
β βββ mel_export_checkpoint.py # Export model
β βββ mel_inference.py # Inference script
βββ configs/ # Configuration files
π License
MIT License
π Related Papers
-
Autoregressive Diffusion Transformer for Text-to-Speech Synthesis
Zhijun Liu, et al.
arXiv:2406.05551 -
VibeVoice Technical Report
Zhiliang Peng, et al.
arXiv:2508.19205 -
VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
Yixuan Zhou, et al.
arXiv:2509.24650
π Acknowledgments
This project is based on the following open-source projects: