Poorman's AR-DiT TTS 📢

Keywords: ARDiT, AR-DiT, Autoregressive Diffusion Transformer, TTS, Text-to-Speech, Mel-Spectrogram

A resource-friendly Text-to-Speech system inspired by AR-DiT (ARDiT), combining an autoregressive Transformer (Qwen3 LLM) with a diffusion model architecture. It generates Mel spectrograms through a diffusion process, then converts them to audio via a Vocoder.

✨ Minimal AR-DiT TTS training and inference pipeline that can train on an 8000-hour dataset using a single RTX 5090 (32GB) and produce intelligible speech synthesis results within two days.

PS: The diffusion backbone uses RFWave's ConvNeXt architecture, not DiT.

🌟 Why Choose This Project?

🚀 Resource-Friendly: Single RTX 5090 (32GB) can handle 8000-hour dataset training
📦 Minimal Implementation: Clean and concise code, easy to understand and modify, suitable for learning and development
🇨🇳 Chinese-Friendly: Complete Chinese documentation and Chinese data processing pipeline
🤗 Ready to Use: Provides pre-trained models and processed datasets for quick start
💡 Practical-Oriented: Achieves intelligible results in two days, better quality with longer training - practical rather than perfect

🎵 Generation Examples

Audio samples generated by the trained model:

Your browser does not support the audio element. Download audio

📦 Installation

pip install -r requirements.txt

🚀 Quick Start

🎤 Inference with Pre-trained Model

We provide pre-trained models on Hugging Face that you can use directly:

Download Pre-trained Model:

huggingface-cli download laupeng1989/armel-checkpoint --local-dir ./models/armel-checkpoint

Run Inference:

python3 scripts/mel_inference.py \
  --model_path ./models/armel-checkpoint/ \
  --text example_data/transcript/fanren_short.txt \
  --ref_audio fanren08 \
  --output_path output/generated \
  --dtype bfloat16

Output Files:

output/generated.wav: Generated audio
output/generated.png: Mel spectrogram visualization
output/generated.npy: Mel spectrogram array

🎧 Reference Audio Instructions

The --ref_audio parameter specifies the reference audio name (without extension). The script will read the corresponding .wav and .txt files from the example_data/voice_prompts/ directory:

example_data/voice_prompts/
├── fanren08.wav          # Reference audio
├── fanren08.txt          # Text corresponding to reference audio
├── fanren09.wav
└── fanren09.txt

You can add your own reference audio by placing the audio file and corresponding text file in this directory.

🔥 Training from Scratch

If you want to train your own model from scratch, follow these steps.

🤗 Training Dataset

We provide a processed training dataset on Hugging Face:

Training Dataset: laupeng1989/armel-dataset

Download Dataset:

huggingface-cli download laupeng1989/armel-dataset --repo-type dataset --local-dir ./data/armel-dataset

💡 Tip: If using the Hugging Face dataset, you can skip the "Data Preparation" section below and proceed directly to training.

📊 Data Preparation

1️⃣ Prepare Raw Data

This project uses the Amphion Emilia preprocessor to process raw audio data.

Processed data format:

example_data/
├── 仙逆 第87集 身世苏醒（下） [638031163].json
├── 仙逆 第87集 身世苏醒（下） [638031163]_000000.m4a
├── 仙逆 第87集 身世苏醒（下） [638031163]_000001.m4a
├── 仙逆 第87集 身世苏醒（下） [638031163]_000002.m4a
└── ...

JSON file format (contains segmentation info and text):

[
  {
    "duration": 10.94,
    "text": "[SPEAKER_00] 欢迎收听...",
    "speaker": 0,
    "parts": [
      {
        "text": "[SPEAKER_00] 欢迎收听...",
        "start": 4.5125,
        "end": 10.1525,
        "speaker": 0,
        "language": "zh"
      }
    ]
  }
]

2️⃣ Build Training Dataset

Use build_dataset.py to convert raw data to training format:

python scripts/build_dataset.py \
  --data_dir <your_raw_data_dir> \
  --output_dir <your_output_dir> \
  --num_proc 8 \
  --test_samples 100 \
  --random_seed 42

🔥 Training

💻 Training Hardware

This project was trained on NVIDIA RTX 5090 (32GB).

⚡ Training Command

Prepare Qwen3 Model:

model.llm_model_path can be:

Local path: e.g., ./Qwen3-0.6B (requires prior download)
Hugging Face model name: e.g., Qwen/Qwen3-0.6B (auto-downloads, but first training will be slower)

Recommended to download locally first:

huggingface-cli download Qwen/Qwen3-0.6B --local-dir ./Qwen3-0.6B

Training Command:

python3 scripts/mel_train.py \
  dataset.train_dataset_path=<your_train_data_path> \
  dataset.valid_dataset_path=<your_valid_data_path> \
  model.llm_model_path=./Qwen3-0.6B \
  model.rfmel.batch_mul=2 \
  training.batch_size=4 \
  dataset.max_tokens=1024 \
  training.num_workers=16 \
  training.learning_rate=0.0001 \
  training.log_dir=<your_log_dir> \
  training.diffusion_extra_steps=4 \
  training.check_val_every_n_epoch=1 \
  model.use_skip_connection=true \
  model.estimator.hidden_dim=512 \
  model.estimator.intermediate_dim=1536 \
  model.estimator.num_layers=8

Note: Lightning automatically detects and uses all available GPUs with DDP strategy. You may need to adjust batch_size, batch_mul, max_tokens based on your hardware configuration.

📤 Export Model

After training, export the model for inference:

python scripts/mel_export_checkpoint.py \
  --ckpt_path <your_checkpoint_path>/last.ckpt \
  --output_path ./exported_model/

Or specify the checkpoints directory directly (automatically selects the latest):

python scripts/mel_export_checkpoint.py \
  --ckpt_path <your_checkpoint_dir>/ \
  --output_path ./exported_model/

This will generate:

model.ckpt: Model weights
model.yaml: Inference configuration

After exporting, you can use the inference commands from the "Inference with Pre-trained Model" section above.

📁 Project Structure

ar-dit-mel/
├── ar/                      # Autoregressive model
│   ├── armel.py            # ARMel main model
│   ├── qwen.py             # Qwen3 LLM
│   └── mel_generate.py     # Mel generation
├── rfwave/                  # Diffusion model
│   ├── mel_model.py        # RFMel model
│   ├── mel_processor.py    # Mel processor
│   └── estimator.py        # Diffusion estimator
├── dataset/                 # Dataset
├── scripts/                 # Training and inference scripts
│   ├── build_dataset.py    # Build dataset
│   ├── mel_train.py        # Training script
│   ├── mel_export_checkpoint.py  # Export model
│   └── mel_inference.py    # Inference script
└── configs/                 # Configuration files

📜 License

MIT License

Autoregressive Diffusion Transformer for Text-to-Speech Synthesis
Zhijun Liu, et al.
arXiv:2406.05551
VibeVoice Technical Report
Zhiliang Peng, et al.
arXiv:2508.19205
VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
Yixuan Zhou, et al.
arXiv:2509.24650

🙏 Acknowledgments

This project is based on the following open-source projects:

Qwen3 - Language Model 🤖
Amphion - Data Preprocessing 🎵
Vocos - Vocoder 🔊
RFWave - Diffusion Backbone 🌊
VoxCPM - Architecture Reference 💡
Higgs-Audio - Data Template 📋

MaxMax2016/armel

Poorman's AR-DiT TTS 📢

🌟 Why Choose This Project?

🎵 Generation Examples

📦 Installation

🚀 Quick Start

🎤 Inference with Pre-trained Model

🎧 Reference Audio Instructions

🔥 Training from Scratch

🤗 Training Dataset

📊 Data Preparation

1️⃣ Prepare Raw Data

2️⃣ Build Training Dataset

🔥 Training

💻 Training Hardware

⚡ Training Command

📤 Export Model

📁 Project Structure

📜 License

🙏 Acknowledgments

On this page

Languages

Contributors

MaxMax2016/armel

Poorman's AR-DiT TTS 📢

🌟 Why Choose This Project?

🎵 Generation Examples

📦 Installation

🚀 Quick Start

🎤 Inference with Pre-trained Model

🎧 Reference Audio Instructions

🔥 Training from Scratch

🤗 Training Dataset

📊 Data Preparation

1️⃣ Prepare Raw Data

2️⃣ Build Training Dataset

🔥 Training

💻 Training Hardware

⚡ Training Command

📤 Export Model

📁 Project Structure

📜 License

📚 Related Papers

🙏 Acknowledgments

On this page

Languages

Contributors