MambaVoiceCloning (MVC) - Efficient and Expressive TTS with State-Space Modeling
This paper presents MambaVoiceCloning (MVC), a scalable and expressive text-to-speech (TTS) framework that unifies state-space sequence modeling with diffusion-driven style control. Distinct from prior diffusion-based models, MVC replaces all self-attention and recurrent components in the TTS pipeline with novel Mamba-based modules: a Bi-Mamba Text Encoder, Temporal Bi-Mamba Encoder, and Expressive Mamba Predictor. These modules enable linear-time modeling of long-range phonetic and prosodic dependencies, improving efficiency and expressiveness without relying on external reference encoders. While MVC uses a diffusion-based decoder for waveform generation, our contribution is architectural—introducing the first end-to-end Mamba-integrated TTS backbone. Extensive experiments on LJSpeech and LibriTTS demonstrate that MVC significantly improves naturalness, prosody, intelligibility, and latency over state-of-the-art methods. MVC maintains a lightweight footprint of 21M parameters and achieves 1.6× faster training than comparable Transformer-based baselines.
🎧 Audio Demos
Explore MVC's expressive and high-quality speech synthesis through our audio samples: MVC Audio Demos
🚀 Key Features
-
Efficient State-Space Modeling: Utilizes Mamba blocks for linear time sequence modeling, significantly reducing computation time and memory overhead compared to traditional self-attention mechanisms.
-
Lightweight Temporal and Spectrogram Encoders: Includes optimized BiMambaTextEncoder, TemporalBiMambaEncoder, and ExpressiveMambaEncoder with depthwise separable convolutions for reduced parameter count.
-
Dynamic Style Conditioning: Integrates AdaLayerNorm for style modulation, enabling flexible control over prosody and speaker style during synthesis.
-
Advanced Gating Mechanisms: Employs grouped convolutional gating for efficient residual connections, minimizing parameter overhead while maintaining expressiveness.
-
Optimized Inference Path: Supports gradient checkpointing and efficient feature aggregation, reducing memory usage during both training and inference.
📦 Installation
Prerequisites
- Python >= 3.8
- PyTorch >= 1.12.0
- CUDA-enabled GPU (recommended for training)
- Mamba SSM (Required for Mamba-based encoders)
Setup
Clone the repository and install dependencies:
git clone https://github.com/aiai-9/MVC.git
cd MVC
pip install -r requirements.txtInstall Mamba SSM
To install the Mamba SSM module, use the following command:
pip install git+https://github.com/state-spaces/mamba.gitTraining
First stage training (Text Encoder, Duration Encoder, Prosody Predictor):
accelerate launch train_first.py --config_path ./configs/config.ymlSecond stage training (Diffusion-based decoder and adversarial refinement):
python train_second.py --config_path ./configs/config.ymlInference
Generate high-quality speech with pre-trained models:
python inference.py --config_path ./configs/config.yml --input_text "Hello, this is MambaVoiceCloning."🧠 Model Architecture
MVC consists of three core components:
- Bi-Mamba Text Encoder: Efficiently captures phoneme-level context using bidirectional state-space models (SSMs).
- Expressive Mamba Encoder: Enhances prosodic variation and speaker expressiveness.
- Temporal Bi-Mamba Encoder: Models rhythmic structures and duration alignment for natural speech generation.
📊 Evaluation
Run objective and subjective evaluations using provided scripts:
python evaluate.py --config_path ./configs/config.yml🏆 Results
Table 1: Subjective Evaluation on LibriTTS (Zero-Shot)
| Model | MOS-N (↑) | MOS-S (↑) |
|---|---|---|
| Ground Truth | 4.60 ± 0.09 | 4.35 ± 0.10 |
| VITS | 3.69 ± 0.12 | 3.54 ± 0.13 |
| StyleTTS2 | 4.15 ± 0.11 | 4.03 ± 0.11 |
| MVC (Ours) | 4.22 ± 0.10 | 4.07 ± 0.10 |
Table 2: MOS Comparison on LJSpeech (ID vs OOD)
| Model | MOS_ID (↑) | MOS_OOD (↑) |
|---|---|---|
| Ground Truth | 3.81 ± 0.09 | 3.70 ± 0.11 |
| StyleTTS2 | 3.83 ± 0.08 | 3.87 ± 0.08 |
| JETS | 3.57 ± 0.10 | 3.21 ± 0.12 |
| VITS | 3.44 ± 0.10 | 3.21 ± 0.11 |
| MVC (Ours) | 3.87 ± 0.07 | 3.88 ± 0.09 |
Table 3: Objective Metrics on LJSpeech
| Model | F0 RMSE (↓) | MCD (↓) | WER (↓) | RTF (↓) |
|---|---|---|---|---|
| VITS | 0.667 ± 0.011 | 4.97 ± 0.09 | 7.23% | 0.0211 |
| StyleTTS2 | 0.651 ± 0.013 | 4.93 ± 0.06 | 6.50% | 0.0185 |
| MVC (Ours) | 0.653 ± 0.014 | 4.91 ± 0.07 | 6.52% | 0.0177 |
🛠️ Troubleshooting
- NaN Loss: Ensure the batch size is properly set (e.g., 16 for stable training).
- Out of Memory: Reduce batch size or sequence length if OOM errors occur.
- Audio Quality Issues: Fine-tune model hyperparameters for specific datasets.
📄 License
This project is released under the MIT License. See the LICENSE file for more details.
🙌 Contributing
We welcome contributions! Please read the CONTRIBUTING.md file for guidelines on code style, pull requests, and community support.
🤝 Acknowledgements
MVC builds on prior work from the Mamba, StyleTTS2, and VITS communities. We thank the authors for their foundational contributions to the field of TTS.
