pnnbao97/VieNeu-TTS
Vietnamese TTS with instant voice cloning • On-device • Real-time CPU inference • 24kHz audio quality • Chuyển văn bản thành giọng nói tiếng Việt • Text to speech tiếng Việt • TTS tiếng Việt
🦜 VieNeu-TTS
VieNeu-TTS is an advanced on-device Vietnamese Text-to-Speech (TTS) model with instant voice cloning.
Tip
Voice Cloning: All model variants (including GGUF) support instant voice cloning with just 3-5 seconds of reference audio.
This project features two core architectures trained on the VieNeu-TTS-1000h dataset:
- VieNeu-TTS (0.5B): An enhanced model optimized for maximum stability.
- VieNeu-TTS-0.3B: A specialized model trained from scratch using the VieNeu-TTS-1000h dataset, delivering 2x faster inference and ultra-low latency.
These represent a significant upgrade with the following improvements:
- Enhanced pronunciation: More accurate and stable Vietnamese pronunciation powered by the sea-g2p library
- Code-switching support: Seamless transitions between Vietnamese and English powered by the sea-g2p library
- Better voice cloning: Higher fidelity and speaker consistency
- Real-time synthesis: 24 kHz waveform generation on CPU or GPU
- Multiple model formats: Support for PyTorch, GGUF Q4/Q8 (CPU optimized), and ONNX codec
VieNeu-TTS delivers production-ready speech synthesis fully offline.
Author: Phạm Nguyễn Ngọc Bảo
Demo-VieNeu-TTS.mp4
📌 Table of Contents
- 🦜 Installation & Web UI
- 📦 Using the Python SDK
- 🐳 Docker & Remote Server
- 🎯 Custom Models
- 🛠️ Fine-tuning Guide
- 🔬 Model Overview
- 🐋 Deployment with Docker (Compose)
- 🚀 Roadmap
- 🤝 Support & Contact
🦜 1. Installation & Web UI
Intel Arc GPU installation (Optional): Using PyTorch 2.11 with XPU support. For Intel arc gpu user, read the Intel Arc GPU section below. Tested on Arc B580 and A770 on window.
Recommend setting: Intel arc have high memory bandwidth so keep the batch size as high as possible and minimize the number of character per chunk.
⚡ Quick Start
ℹ️ This is the fastest way to get started.
For streaming inference, SDK integration, Docker deployment, and advanced setups, see the sections below.git clone https://github.com/pnnbao97/VieNeu-TTS.git cd VieNeu-TTS uv sync uv run vieneu-webOpen
http://127.0.0.1:7860and start generating speech.
System Requirements
- NVIDIA GPU (Optional): For maximum speed via LMDeploy or GGUF GPU acceleration.
- Requires NVIDIA Driver >= 570.65 (CUDA 12.8+) or higher.
- For LMDeploy, it is recommended to have the NVIDIA GPU Computing Toolkit installed.
Installation Steps
-
Clone the Repo:
git clone https://github.com/pnnbao97/VieNeu-TTS.git cd VieNeu-TTS -
Environment Setup with
uv(Recommended):
-
Step A: Install uv (if you haven't)
# Windows: powershell -c "irm https://astral.sh/uv/install.ps1 | iex" # Linux/macOS: curl -LsSf https://astral.sh/uv/install.sh | sh
-
Step B: Install dependencies
Option 1: GPU Support (Default)
uv sync
Option 2: CPU-ONLY (Lightweight, no CUDA)
# Linux/macOS: cp pyproject.toml pyproject.toml.gpu cp pyproject.toml.cpu pyproject.toml uv sync # Windows (PowerShell/CMD): copy pyproject.toml pyproject.toml.gpu copy pyproject.toml.cpu pyproject.toml uv sync
- Start the Web UI:
uv run vieneu-webAccess the UI at http://127.0.0.1:7860.
⚡ Real-time Streaming (CPU Optimized)
VieNeu-TTS supports ultra-low latency streaming, allowing audio playback to start before the entire sentence is finished. This is specifically optimized for CPU-only devices using the GGUF backend.
- Latency: <300ms for the first chunk on modern i3/i5 CPUs.
- Efficiency: Uses Q4/Q8 quantization and ONNX-based lightweight codecs.
- Usage: Perfect for real-time interactive AI assistants.
Start the dedicated CPU streaming demo:
uv run vieneu-streamThen open http://localhost:8001 in your browser.
Intel Arc GPU Users - Installation Guide:
- Clone the Repo:
git clone https://github.com/pnnbao97/VieNeu-TTS.git cd VieNeu-TTS - Environment and dependencies setup with
uv(Recommended):
- run setup_xpu_uv.bat
- Start the Web UI:
- run run_xpu.bat
Access the UI athttp://127.0.0.1:7860.
📦 2. Using the Python SDK (vieneu)
Integrate VieNeu-TTS into your own software projects.
Quick Install
# Windows (CPU optimized)
pip install vieneu --extra-index-url https://pnnbao97.github.io/llama-cpp-python-v0.3.16/cpu/
# macOS (Metal GPU accelerated)
pip install vieneu --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal/
# Linux / Generic
pip install vieneuQuick Start (main.py)
from vieneu import Vieneu
import os
# Initialization
tts = Vieneu()
# Standard synthesis (uses default voice)
text = "Xin chào, tôi là VieNeu. Tôi có thể giúp bạn đọc sách, làm chatbot thời gian thực, hoặc thậm chí clone giọng nói của bạn."
audio = tts.infer(text=text)
tts.save(audio, "standard_output.wav")
print("💾 Saved synthesis to: standard_output.wav")For full implementation details, see examples/main.py.
🐳 3. Docker & Remote Server
Deploy VieNeu-TTS as a high-performance API Server (powered by LMDeploy) with a single command.
1. Run with Docker (Recommended)
Requirement: NVIDIA Container Toolkit is required for GPU support.
Start the Server with a Public Tunnel (No port forwarding needed):
docker run --gpus all -p 23333:23333 pnnbao/vieneu-tts:serve --tunnel- Default: The server loads the
VieNeu-TTSmodel for maximum quality. - Tunneling: The Docker image includes a built-in
boretunnel. Check the container logs to find your public address (e.g.,bore.pub:31631).
2. Using the SDK (Remote Mode)
Once the server is running, you can connect from anywhere (Colab, Web Apps, etc.) without loading heavy models locally:
from vieneu import Vieneu
import os
# Configuration
REMOTE_API_BASE = 'http://your-server-ip:23333/v1' # Or bore tunnel URL
REMOTE_MODEL_ID = "pnnbao-ump/VieNeu-TTS"
# Initialization (LIGHTWEIGHT - only loads small codec locally)
tts = Vieneu(mode='remote', api_base=REMOTE_API_BASE, model_name=REMOTE_MODEL_ID)
os.makedirs("outputs", exist_ok=True)
# List remote voices
available_voices = tts.list_preset_voices()
for desc, name in available_voices:
print(f" - {desc} (ID: {name})")
# Use specific voice (dynamically select second voice)
if available_voices:
_, my_voice_id = available_voices[1]
voice_data = tts.get_preset_voice(my_voice_id)
audio_spec = tts.infer(text="Chào bạn, tôi đang nói bằng giọng của bác sĩ Tuyên.", voice=voice_data)
tts.save(audio_spec, f"outputs/remote_{my_voice_id}.wav")
print(f"💾 Saved synthesis to: outputs/remote_{my_voice_id}.wav")
# Standard synthesis (uses default voice)
text_input = "Chế độ remote giúp tích hợp VieNeu vào ứng dụng Web hoặc App cực nhanh mà không cần GPU tại máy khách."
audio = tts.infer(text=text_input)
tts.save(audio, "outputs/remote_output.wav")
print("💾 Saved remote synthesis to: outputs/remote_output.wav")
# Zero-shot voice cloning (encodes audio locally, sends codes to server)
if os.path.exists("examples/audio_ref/example_ngoc_huyen.wav"):
cloned_audio = tts.infer(
text="Đây là giọng nói được clone và xử lý thông qua VieNeu Server.",
ref_audio="examples/audio_ref/example_ngoc_huyen.wav",
ref_text="Tác phẩm dự thi bảo đảm tính khoa học, tính đảng, tính chiến đấu, tính định hướng."
)
tts.save(cloned_audio, "outputs/remote_cloned_output.wav")
print("💾 Saved remote cloned voice to: outputs/remote_cloned_output.wav")For full implementation details, see: examples/main_remote.py
Voice Preset Specification (v1.0)
VieNeu-TTS uses the official vieneu.voice.presets specification to define reusable voice assets.
Only voices.json files following this spec are guaranteed to be compatible with VieNeu-TTS SDK ≥ v1.x.
3. Advanced Configuration
Customize the server to run specific versions or your own fine-tuned models.
Run the 0.3B Model (Faster):
docker run --gpus all pnnbao/vieneu-tts:serve --model pnnbao-ump/VieNeu-TTS-0.3B --tunnelServe a Local Fine-tuned Model:
If you have merged a LoRA adapter, mount your output directory to the container:
# Linux / macOS
docker run --gpus all \
-v $(pwd)/finetune/output:/workspace/models \
pnnbao/vieneu-tts:serve \
--model /workspace/models/merged_model --tunnelFor full implementation details, see: examples/main_remote.py
🎯 4. Custom Models (LoRA, GGUF, Finetune)
VieNeu-TTS allows you to load custom models directly from HuggingFace or local paths via the Web UI.
👉 See the detailed guide at: docs/CUSTOM_MODEL_USAGE.md
🛠️ 5. Fine-tuning Guide
Train VieNeu-TTS on your own voice or custom datasets.
- Simple Workflow: Use the
train.pyscript with optimized LoRA configurations. - Documentation: Follow the step-by-step guide in finetune/README.md.
- Notebook: Experience it directly on Google Colab via
finetune/finetune_VieNeu-TTS.ipynb.
🔬 6. Model Overview (Backbones)
| Model | Format | Device | Quality | Speed |
|---|---|---|---|---|
| VieNeu-TTS | PyTorch | GPU/CPU | ⭐⭐⭐⭐⭐ | Very Fast with lmdeploy |
| VieNeu-TTS-0.3B | PyTorch | GPU/CPU | ⭐⭐⭐⭐ | Ultra Fast (2x) |
| VieNeu-TTS-q8-gguf | GGUF Q8 | CPU/GPU | ⭐⭐⭐⭐ | Fast |
| VieNeu-TTS-q4-gguf | GGUF Q4 | CPU/GPU | ⭐⭐⭐ | Very Fast |
| VieNeu-TTS-0.3B-q8-gguf | GGUF Q8 | CPU/GPU | ⭐⭐⭐⭐ | Ultra Fast (1.5x) |
| VieNeu-TTS-0.3B-q4-gguf | GGUF Q4 | CPU/GPU | ⭐⭐⭐ | Extreme Speed (2x) |
🔬 Model Details
- Training Data: VieNeu-TTS-1000h — 443,641 curated Vietnamese samples (Used for all versions).
- Audio Codec: NeuCodec (Torch implementation; ONNX & quantized variants supported).
- Context Window: 2,048 tokens shared by prompt text and speech tokens.
- Output Watermark: Enabled by default.
🐋 7. Deployment with Docker (Compose)
Deploy quickly without manual environment setup.
Note: Docker deployment currently supports GPU only. For CPU usage, please follow the Installation & Web UI section to install from source.
# Run with GPU (Requires NVIDIA Container Toolkit)
docker compose -f docker/docker-compose.yml --profile gpu upCheck docs/Deploy.md for more details.
📚 References
- Dataset: VieNeu-TTS-1000h (Hugging Face)
- Model 0.5B: pnnbao-ump/VieNeu-TTS
- Model 0.3B: pnnbao-ump/VieNeu-TTS-0.3B
- LoRA Guide: docs/CUSTOM_MODEL_USAGE.md
🚀 Roadmap
We are constantly working to improve VieNeu-TTS. Here is what we have planned:
- 🦜 VieNeu-TTS 2.0: Upcoming version featuring superior voice cloning fidelity and improved handling of long-context text synthesis.
- 🔊 VieNeu-Codec: Development of a custom neural audio codec specifically optimized for the nuances of the Vietnamese language.
- 📦 Multi-format Support: Beyond GGUF, we plan to add official support for ONNX to enable even wider deployment flexibility (Web, Mobile, etc.).
- 🩺 VieNeu-TTS Medical: A specialized model variant fine-tuned for medical terminology and healthcare domain applications.
🤝 9. Support & Contact
- Hugging Face: pnnbao-ump
- Discord: Join our community
- Facebook: Pham Nguyen Ngoc Bao
- Licensing:
- VieNeu-TTS (0.5B): Apache 2.0 (Free to use).
- VieNeu-TTS-0.3B: CC BY-NC 4.0 (Non-commercial).
- ✅ Free: For students, researchers, and non-profit purposes.
⚠️ Commercial/Enterprise: Contact the author for licensing.
📑 Citation
@misc{vieneutts2026,
title = {VieNeu-TTS: Vietnamese Text-to-Speech with Instant Voice Cloning},
author = {Pham Nguyen Ngoc Bao},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/pnnbao-ump/VieNeu-TTS}}
}🌟 Star History
🤝 Contributors
Thanks to all the amazing people who have contributed to this project!
🙏 Acknowledgements
This project uses neucodec for audio decoding and sea-g2p for text normalization and phonemization.
Made with ❤️ for the Vietnamese TTS community