MambaVoiceCloning (MVC) - Efficient and Expressive TTS with State-Space Modeling

This paper presents MambaVoiceCloning (MVC), a scalable and expressive text-to-speech (TTS) framework that unifies state-space sequence modeling with diffusion-driven style control. Distinct from prior diffusion-based models, MVC replaces all self-attention and recurrent components in the TTS pipeline with novel Mamba-based modules: a Bi-Mamba Text Encoder, Temporal Bi-Mamba Encoder, and Expressive Mamba Predictor. These modules enable linear-time modeling of long-range phonetic and prosodic dependencies, improving efficiency and expressiveness without relying on external reference encoders. While MVC uses a diffusion-based decoder for waveform generation, our contribution is architectural—introducing the first end-to-end Mamba-integrated TTS backbone. Extensive experiments on LJSpeech and LibriTTS demonstrate that MVC significantly improves naturalness, prosody, intelligibility, and latency over state-of-the-art methods. MVC maintains a lightweight footprint of 21M parameters and achieves 1.6× faster training than comparable Transformer-based baselines.

🎧 Audio Demos

Explore MVC's expressive and high-quality speech synthesis through our audio samples: MVC Audio Demos

🚀 Key Features

Efficient State-Space Modeling: Utilizes Mamba blocks for linear time sequence modeling, significantly reducing computation time and memory overhead compared to traditional self-attention mechanisms.
Lightweight Temporal and Spectrogram Encoders: Includes optimized BiMambaTextEncoder, TemporalBiMambaEncoder, and ExpressiveMambaEncoder with depthwise separable convolutions for reduced parameter count.
Dynamic Style Conditioning: Integrates AdaLayerNorm for style modulation, enabling flexible control over prosody and speaker style during synthesis.
Advanced Gating Mechanisms: Employs grouped convolutional gating for efficient residual connections, minimizing parameter overhead while maintaining expressiveness.
Optimized Inference Path: Supports gradient checkpointing and efficient feature aggregation, reducing memory usage during both training and inference.

📦 Installation

Prerequisites

Python >= 3.8
PyTorch >= 1.12.0
CUDA-enabled GPU (recommended for training)
Mamba SSM (Required for Mamba-based encoders)

Setup

Clone the repository and install dependencies:

git clone https://github.com/aiai-9/MVC.git
cd MVC
pip install -r requirements.txt

Install Mamba SSM

To install the Mamba SSM module, use the following command:

pip install git+https://github.com/state-spaces/mamba.git

Training

First stage training (Text Encoder, Duration Encoder, Prosody Predictor):

accelerate launch train_first.py --config_path ./configs/config.yml

Second stage training (Diffusion-based decoder and adversarial refinement):

python train_second.py --config_path ./configs/config.yml

Inference

Generate high-quality speech with pre-trained models:

python inference.py --config_path ./configs/config.yml --input_text "Hello, this is MambaVoiceCloning."

🧠 Model Architecture

MVC consists of three core components:

Bi-Mamba Text Encoder: Efficiently captures phoneme-level context using bidirectional state-space models (SSMs).
Expressive Mamba Encoder: Enhances prosodic variation and speaker expressiveness.
Temporal Bi-Mamba Encoder: Models rhythmic structures and duration alignment for natural speech generation.

📊 Evaluation

Run objective and subjective evaluations using provided scripts:

python evaluate.py --config_path ./configs/config.yml

🏆 Results

Table 1: Subjective Evaluation on LibriTTS (Zero-Shot)

Model	MOS-N (↑)	MOS-S (↑)
Ground Truth	4.60 ± 0.09	4.35 ± 0.10
VITS	3.69 ± 0.12	3.54 ± 0.13
StyleTTS2	4.15 ± 0.11	4.03 ± 0.11
MVC (Ours)	4.22 ± 0.10	4.07 ± 0.10

Table 2: MOS Comparison on LJSpeech (ID vs OOD)

Model	MOS_ID (↑)	MOS_OOD (↑)
Ground Truth	3.81 ± 0.09	3.70 ± 0.11
StyleTTS2	3.83 ± 0.08	3.87 ± 0.08
JETS	3.57 ± 0.10	3.21 ± 0.12
VITS	3.44 ± 0.10	3.21 ± 0.11
MVC (Ours)	3.87 ± 0.07	3.88 ± 0.09

Table 3: Objective Metrics on LJSpeech

Model	F0 RMSE (↓)	MCD (↓)	WER (↓)	RTF (↓)
VITS	0.667 ± 0.011	4.97 ± 0.09	7.23%	0.0211
StyleTTS2	0.651 ± 0.013	4.93 ± 0.06	6.50%	0.0185
MVC (Ours)	0.653 ± 0.014	4.91 ± 0.07	6.52%	0.0177

🛠️ Troubleshooting

NaN Loss: Ensure the batch size is properly set (e.g., 16 for stable training).
Out of Memory: Reduce batch size or sequence length if OOM errors occur.
Audio Quality Issues: Fine-tune model hyperparameters for specific datasets.

📄 License

This project is released under the MIT License. See the LICENSE file for more details.

🙌 Contributing

We welcome contributions! Please read the CONTRIBUTING.md file for guidelines on code style, pull requests, and community support.

🤝 Acknowledgements

MVC builds on prior work from the Mamba, StyleTTS2, and VITS communities. We thank the authors for their foundational contributions to the field of TTS.

MaxMax2016/MVC