GitHunt
MA

poorman's ar-dit tts

Poorman's AR-DiT TTS πŸ“’

Keywords: ARDiT, AR-DiT, Autoregressive Diffusion Transformer, TTS, Text-to-Speech, Mel-Spectrogram

A resource-friendly Text-to-Speech system inspired by AR-DiT (ARDiT), combining an autoregressive Transformer (Qwen3 LLM) with a diffusion model architecture. It generates Mel spectrograms through a diffusion process, then converts them to audio via a Vocoder.

✨ Minimal AR-DiT TTS training and inference pipeline that can train on an 8000-hour dataset using a single RTX 5090 (32GB) and produce intelligible speech synthesis results within two days.

PS: The diffusion backbone uses RFWave's ConvNeXt architecture, not DiT.

🌟 Why Choose This Project?

  • πŸš€ Resource-Friendly: Single RTX 5090 (32GB) can handle 8000-hour dataset training
  • πŸ“¦ Minimal Implementation: Clean and concise code, easy to understand and modify, suitable for learning and development
  • πŸ‡¨πŸ‡³ Chinese-Friendly: Complete Chinese documentation and Chinese data processing pipeline
  • πŸ€— Ready to Use: Provides pre-trained models and processed datasets for quick start
  • πŸ’‘ Practical-Oriented: Achieves intelligible results in two days, better quality with longer training - practical rather than perfect

🎡 Generation Examples

Audio samples generated by the trained model:

Your browser does not support the audio element. Download audio

πŸ“¦ Installation

pip install -r requirements.txt

πŸš€ Quick Start

🎀 Inference with Pre-trained Model

We provide pre-trained models on Hugging Face that you can use directly:

Download Pre-trained Model:

huggingface-cli download laupeng1989/armel-checkpoint --local-dir ./models/armel-checkpoint

Run Inference:

python3 scripts/mel_inference.py \
  --model_path ./models/armel-checkpoint/ \
  --text example_data/transcript/fanren_short.txt \
  --ref_audio fanren08 \
  --output_path output/generated \
  --dtype bfloat16

Output Files:

  • output/generated.wav: Generated audio
  • output/generated.png: Mel spectrogram visualization
  • output/generated.npy: Mel spectrogram array

🎧 Reference Audio Instructions

The --ref_audio parameter specifies the reference audio name (without extension). The script will read the corresponding .wav and .txt files from the example_data/voice_prompts/ directory:

example_data/voice_prompts/
β”œβ”€β”€ fanren08.wav          # Reference audio
β”œβ”€β”€ fanren08.txt          # Text corresponding to reference audio
β”œβ”€β”€ fanren09.wav
└── fanren09.txt

You can add your own reference audio by placing the audio file and corresponding text file in this directory.


πŸ”₯ Training from Scratch

If you want to train your own model from scratch, follow these steps.

πŸ€— Training Dataset

We provide a processed training dataset on Hugging Face:

Download Dataset:

huggingface-cli download laupeng1989/armel-dataset --repo-type dataset --local-dir ./data/armel-dataset

πŸ’‘ Tip: If using the Hugging Face dataset, you can skip the "Data Preparation" section below and proceed directly to training.

πŸ“Š Data Preparation

1️⃣ Prepare Raw Data

This project uses the Amphion Emilia preprocessor to process raw audio data.

Processed data format:

example_data/
β”œβ”€β”€ 仙逆 第87集 θΊ«δΈ–θ‹ι†’οΌˆδΈ‹οΌ‰ [638031163].json
β”œβ”€β”€ 仙逆 第87集 θΊ«δΈ–θ‹ι†’οΌˆδΈ‹οΌ‰ [638031163]_000000.m4a
β”œβ”€β”€ 仙逆 第87集 θΊ«δΈ–θ‹ι†’οΌˆδΈ‹οΌ‰ [638031163]_000001.m4a
β”œβ”€β”€ 仙逆 第87集 θΊ«δΈ–θ‹ι†’οΌˆδΈ‹οΌ‰ [638031163]_000002.m4a
└── ...

JSON file format (contains segmentation info and text):

[
  {
    "duration": 10.94,
    "text": "[SPEAKER_00] ζ¬’θΏŽζ”Άε¬...",
    "speaker": 0,
    "parts": [
      {
        "text": "[SPEAKER_00] ζ¬’θΏŽζ”Άε¬...",
        "start": 4.5125,
        "end": 10.1525,
        "speaker": 0,
        "language": "zh"
      }
    ]
  }
]

2️⃣ Build Training Dataset

Use build_dataset.py to convert raw data to training format:

python scripts/build_dataset.py \
  --data_dir <your_raw_data_dir> \
  --output_dir <your_output_dir> \
  --num_proc 8 \
  --test_samples 100 \
  --random_seed 42

πŸ”₯ Training

πŸ’» Training Hardware

This project was trained on NVIDIA RTX 5090 (32GB).

⚑ Training Command

Prepare Qwen3 Model:

model.llm_model_path can be:

  • Local path: e.g., ./Qwen3-0.6B (requires prior download)
  • Hugging Face model name: e.g., Qwen/Qwen3-0.6B (auto-downloads, but first training will be slower)

Recommended to download locally first:

huggingface-cli download Qwen/Qwen3-0.6B --local-dir ./Qwen3-0.6B

Training Command:

python3 scripts/mel_train.py \
  dataset.train_dataset_path=<your_train_data_path> \
  dataset.valid_dataset_path=<your_valid_data_path> \
  model.llm_model_path=./Qwen3-0.6B \
  model.rfmel.batch_mul=2 \
  training.batch_size=4 \
  dataset.max_tokens=1024 \
  training.num_workers=16 \
  training.learning_rate=0.0001 \
  training.log_dir=<your_log_dir> \
  training.diffusion_extra_steps=4 \
  training.check_val_every_n_epoch=1 \
  model.use_skip_connection=true \
  model.estimator.hidden_dim=512 \
  model.estimator.intermediate_dim=1536 \
  model.estimator.num_layers=8

Note: Lightning automatically detects and uses all available GPUs with DDP strategy. You may need to adjust batch_size, batch_mul, max_tokens based on your hardware configuration.

πŸ“€ Export Model

After training, export the model for inference:

python scripts/mel_export_checkpoint.py \
  --ckpt_path <your_checkpoint_path>/last.ckpt \
  --output_path ./exported_model/

Or specify the checkpoints directory directly (automatically selects the latest):

python scripts/mel_export_checkpoint.py \
  --ckpt_path <your_checkpoint_dir>/ \
  --output_path ./exported_model/

This will generate:

  • model.ckpt: Model weights
  • model.yaml: Inference configuration

After exporting, you can use the inference commands from the "Inference with Pre-trained Model" section above.


πŸ“ Project Structure

ar-dit-mel/
β”œβ”€β”€ ar/                      # Autoregressive model
β”‚   β”œβ”€β”€ armel.py            # ARMel main model
β”‚   β”œβ”€β”€ qwen.py             # Qwen3 LLM
β”‚   └── mel_generate.py     # Mel generation
β”œβ”€β”€ rfwave/                  # Diffusion model
β”‚   β”œβ”€β”€ mel_model.py        # RFMel model
β”‚   β”œβ”€β”€ mel_processor.py    # Mel processor
β”‚   └── estimator.py        # Diffusion estimator
β”œβ”€β”€ dataset/                 # Dataset
β”œβ”€β”€ scripts/                 # Training and inference scripts
β”‚   β”œβ”€β”€ build_dataset.py    # Build dataset
β”‚   β”œβ”€β”€ mel_train.py        # Training script
β”‚   β”œβ”€β”€ mel_export_checkpoint.py  # Export model
β”‚   └── mel_inference.py    # Inference script
└── configs/                 # Configuration files

πŸ“œ License

MIT License

  • Autoregressive Diffusion Transformer for Text-to-Speech Synthesis
    Zhijun Liu, et al.
    arXiv:2406.05551

  • VibeVoice Technical Report
    Zhiliang Peng, et al.
    arXiv:2508.19205

  • VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
    Yixuan Zhou, et al.
    arXiv:2509.24650

πŸ™ Acknowledgments

This project is based on the following open-source projects:

  • Qwen3 - Language Model πŸ€–
  • Amphion - Data Preprocessing 🎡
  • Vocos - Vocoder πŸ”Š
  • RFWave - Diffusion Backbone 🌊
  • VoxCPM - Architecture Reference πŸ’‘
  • Higgs-Audio - Data Template πŸ“‹