GitHunt
PN

pnnbao97/VieNeu-TTS

Vietnamese TTS with instant voice cloning • On-device • Real-time CPU inference • 24kHz audio quality • Chuyển văn bản thành giọng nói tiếng Việt • Text to speech tiếng Việt • TTS tiếng Việt

🦜 VieNeu-TTS

Awesome
Discord

Open In Colab
Hugging Face 0.5B
Hugging Face 0.3B

VieNeu-TTS UI

VieNeu-TTS is an advanced on-device Vietnamese Text-to-Speech (TTS) model with instant voice cloning.

Tip

Voice Cloning: All model variants (including GGUF) support instant voice cloning with just 3-5 seconds of reference audio.

This project features two core architectures trained on the VieNeu-TTS-1000h dataset:

  • VieNeu-TTS (0.5B): An enhanced model optimized for maximum stability.
  • VieNeu-TTS-0.3B: A specialized model trained from scratch using the VieNeu-TTS-1000h dataset, delivering 2x faster inference and ultra-low latency.

These represent a significant upgrade with the following improvements:

  • Enhanced pronunciation: More accurate and stable Vietnamese pronunciation powered by the sea-g2p library
  • Code-switching support: Seamless transitions between Vietnamese and English powered by the sea-g2p library
  • Better voice cloning: Higher fidelity and speaker consistency
  • Real-time synthesis: 24 kHz waveform generation on CPU or GPU
  • Multiple model formats: Support for PyTorch, GGUF Q4/Q8 (CPU optimized), and ONNX codec

VieNeu-TTS delivers production-ready speech synthesis fully offline.

Author: Phạm Nguyễn Ngọc Bảo


Demo-VieNeu-TTS.mp4

📌 Table of Contents

  1. 🦜 Installation & Web UI
  2. 📦 Using the Python SDK
  3. 🐳 Docker & Remote Server
  4. 🎯 Custom Models
  5. 🛠️ Fine-tuning Guide
  6. 🔬 Model Overview
  7. 🐋 Deployment with Docker (Compose)
  8. 🚀 Roadmap
  9. 🤝 Support & Contact

🦜 1. Installation & Web UI

Intel Arc GPU installation (Optional): Using PyTorch 2.11 with XPU support. For Intel arc gpu user, read the Intel Arc GPU section below. Tested on Arc B580 and A770 on window.
Recommend setting: Intel arc have high memory bandwidth so keep the batch size as high as possible and minimize the number of character per chunk.

Quick Start
ℹ️ This is the fastest way to get started.
For streaming inference, SDK integration, Docker deployment, and advanced setups, see the sections below.

git clone https://github.com/pnnbao97/VieNeu-TTS.git
cd VieNeu-TTS
uv sync
uv run vieneu-web

Open http://127.0.0.1:7860 and start generating speech.

System Requirements

  • NVIDIA GPU (Optional): For maximum speed via LMDeploy or GGUF GPU acceleration.
    • Requires NVIDIA Driver >= 570.65 (CUDA 12.8+) or higher.
    • For LMDeploy, it is recommended to have the NVIDIA GPU Computing Toolkit installed.

Installation Steps

  1. Clone the Repo:

    git clone https://github.com/pnnbao97/VieNeu-TTS.git
    cd VieNeu-TTS
  2. Environment Setup with uv (Recommended):

  • Step A: Install uv (if you haven't)

    # Windows:
    powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
    
    # Linux/macOS:
    curl -LsSf https://astral.sh/uv/install.sh | sh
  • Step B: Install dependencies

    Option 1: GPU Support (Default)

    uv sync

    Option 2: CPU-ONLY (Lightweight, no CUDA)

    # Linux/macOS:
    cp pyproject.toml pyproject.toml.gpu
    cp pyproject.toml.cpu pyproject.toml
    uv sync
    
    # Windows (PowerShell/CMD):
    copy pyproject.toml pyproject.toml.gpu
    copy pyproject.toml.cpu pyproject.toml
    uv sync
  1. Start the Web UI:
   uv run vieneu-web

Access the UI at http://127.0.0.1:7860.

⚡ Real-time Streaming (CPU Optimized)

VieNeu-TTS supports ultra-low latency streaming, allowing audio playback to start before the entire sentence is finished. This is specifically optimized for CPU-only devices using the GGUF backend.

  • Latency: <300ms for the first chunk on modern i3/i5 CPUs.
  • Efficiency: Uses Q4/Q8 quantization and ONNX-based lightweight codecs.
  • Usage: Perfect for real-time interactive AI assistants.

Start the dedicated CPU streaming demo:

uv run vieneu-stream

Then open http://localhost:8001 in your browser.

Intel Arc GPU Users - Installation Guide:

  1. Clone the Repo:
    git clone https://github.com/pnnbao97/VieNeu-TTS.git
    cd VieNeu-TTS
  2. Environment and dependencies setup with uv (Recommended):
  • run setup_xpu_uv.bat
  1. Start the Web UI:
  • run run_xpu.bat
    Access the UI at http://127.0.0.1:7860.

📦 2. Using the Python SDK (vieneu)

Integrate VieNeu-TTS into your own software projects.

Quick Install

# Windows (CPU optimized)
pip install vieneu --extra-index-url https://pnnbao97.github.io/llama-cpp-python-v0.3.16/cpu/

# macOS (Metal GPU accelerated)
pip install vieneu --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal/

# Linux / Generic
pip install vieneu

Quick Start (main.py)

from vieneu import Vieneu
import os

# Initialization
tts = Vieneu()

# Standard synthesis (uses default voice)
text = "Xin chào, tôi là VieNeu. Tôi có thể giúp bạn đọc sách, làm chatbot thời gian thực, hoặc thậm chí clone giọng nói của bạn."
audio = tts.infer(text=text)
tts.save(audio, "standard_output.wav")
print("💾 Saved synthesis to: standard_output.wav")

For full implementation details, see examples/main.py.


🐳 3. Docker & Remote Server

Deploy VieNeu-TTS as a high-performance API Server (powered by LMDeploy) with a single command.

Requirement: NVIDIA Container Toolkit is required for GPU support.

Start the Server with a Public Tunnel (No port forwarding needed):

docker run --gpus all -p 23333:23333 pnnbao/vieneu-tts:serve --tunnel
  • Default: The server loads the VieNeu-TTS model for maximum quality.
  • Tunneling: The Docker image includes a built-in bore tunnel. Check the container logs to find your public address (e.g., bore.pub:31631).

2. Using the SDK (Remote Mode)

Once the server is running, you can connect from anywhere (Colab, Web Apps, etc.) without loading heavy models locally:

from vieneu import Vieneu
import os

# Configuration
REMOTE_API_BASE = 'http://your-server-ip:23333/v1'  # Or bore tunnel URL
REMOTE_MODEL_ID = "pnnbao-ump/VieNeu-TTS"

# Initialization (LIGHTWEIGHT - only loads small codec locally)
tts = Vieneu(mode='remote', api_base=REMOTE_API_BASE, model_name=REMOTE_MODEL_ID)
os.makedirs("outputs", exist_ok=True)

# List remote voices
available_voices = tts.list_preset_voices()
for desc, name in available_voices:
    print(f"   - {desc} (ID: {name})")

# Use specific voice (dynamically select second voice)
if available_voices:
    _, my_voice_id = available_voices[1]
    voice_data = tts.get_preset_voice(my_voice_id)
    audio_spec = tts.infer(text="Chào bạn, tôi đang nói bằng giọng của bác sĩ Tuyên.", voice=voice_data)
    tts.save(audio_spec, f"outputs/remote_{my_voice_id}.wav")
    print(f"💾 Saved synthesis to: outputs/remote_{my_voice_id}.wav")

# Standard synthesis (uses default voice)
text_input = "Chế độ remote giúp tích hợp VieNeu vào ứng dụng Web hoặc App cực nhanh mà không cần GPU tại máy khách."
audio = tts.infer(text=text_input)
tts.save(audio, "outputs/remote_output.wav")
print("💾 Saved remote synthesis to: outputs/remote_output.wav")

# Zero-shot voice cloning (encodes audio locally, sends codes to server)
if os.path.exists("examples/audio_ref/example_ngoc_huyen.wav"):
    cloned_audio = tts.infer(
        text="Đây là giọng nói được clone và xử lý thông qua VieNeu Server.",
        ref_audio="examples/audio_ref/example_ngoc_huyen.wav",
        ref_text="Tác phẩm dự thi bảo đảm tính khoa học, tính đảng, tính chiến đấu, tính định hướng."
    )
    tts.save(cloned_audio, "outputs/remote_cloned_output.wav")
    print("💾 Saved remote cloned voice to: outputs/remote_cloned_output.wav")

For full implementation details, see: examples/main_remote.py

Voice Preset Specification (v1.0)

VieNeu-TTS uses the official vieneu.voice.presets specification to define reusable voice assets.
Only voices.json files following this spec are guaranteed to be compatible with VieNeu-TTS SDK ≥ v1.x.

3. Advanced Configuration

Customize the server to run specific versions or your own fine-tuned models.

Run the 0.3B Model (Faster):

docker run --gpus all pnnbao/vieneu-tts:serve --model pnnbao-ump/VieNeu-TTS-0.3B --tunnel

Serve a Local Fine-tuned Model:
If you have merged a LoRA adapter, mount your output directory to the container:

# Linux / macOS
docker run --gpus all \
  -v $(pwd)/finetune/output:/workspace/models \
  pnnbao/vieneu-tts:serve \
  --model /workspace/models/merged_model --tunnel

For full implementation details, see: examples/main_remote.py


🎯 4. Custom Models (LoRA, GGUF, Finetune)

VieNeu-TTS allows you to load custom models directly from HuggingFace or local paths via the Web UI.

👉 See the detailed guide at: docs/CUSTOM_MODEL_USAGE.md


🛠️ 5. Fine-tuning Guide

Train VieNeu-TTS on your own voice or custom datasets.

  • Simple Workflow: Use the train.py script with optimized LoRA configurations.
  • Documentation: Follow the step-by-step guide in finetune/README.md.
  • Notebook: Experience it directly on Google Colab via finetune/finetune_VieNeu-TTS.ipynb.

🔬 6. Model Overview (Backbones)

Model Format Device Quality Speed
VieNeu-TTS PyTorch GPU/CPU ⭐⭐⭐⭐⭐ Very Fast with lmdeploy
VieNeu-TTS-0.3B PyTorch GPU/CPU ⭐⭐⭐⭐ Ultra Fast (2x)
VieNeu-TTS-q8-gguf GGUF Q8 CPU/GPU ⭐⭐⭐⭐ Fast
VieNeu-TTS-q4-gguf GGUF Q4 CPU/GPU ⭐⭐⭐ Very Fast
VieNeu-TTS-0.3B-q8-gguf GGUF Q8 CPU/GPU ⭐⭐⭐⭐ Ultra Fast (1.5x)
VieNeu-TTS-0.3B-q4-gguf GGUF Q4 CPU/GPU ⭐⭐⭐ Extreme Speed (2x)

🔬 Model Details

  • Training Data: VieNeu-TTS-1000h — 443,641 curated Vietnamese samples (Used for all versions).
  • Audio Codec: NeuCodec (Torch implementation; ONNX & quantized variants supported).
  • Context Window: 2,048 tokens shared by prompt text and speech tokens.
  • Output Watermark: Enabled by default.

🐋 7. Deployment with Docker (Compose)

Deploy quickly without manual environment setup.

Note: Docker deployment currently supports GPU only. For CPU usage, please follow the Installation & Web UI section to install from source.

# Run with GPU (Requires NVIDIA Container Toolkit)
docker compose -f docker/docker-compose.yml --profile gpu up

Check docs/Deploy.md for more details.


📚 References


🚀 Roadmap

We are constantly working to improve VieNeu-TTS. Here is what we have planned:

  • 🦜 VieNeu-TTS 2.0: Upcoming version featuring superior voice cloning fidelity and improved handling of long-context text synthesis.
  • 🔊 VieNeu-Codec: Development of a custom neural audio codec specifically optimized for the nuances of the Vietnamese language.
  • 📦 Multi-format Support: Beyond GGUF, we plan to add official support for ONNX to enable even wider deployment flexibility (Web, Mobile, etc.).
  • 🩺 VieNeu-TTS Medical: A specialized model variant fine-tuned for medical terminology and healthcare domain applications.

🤝 9. Support & Contact

  • Hugging Face: pnnbao-ump
  • Discord: Join our community
  • Facebook: Pham Nguyen Ngoc Bao
  • Licensing:
    • VieNeu-TTS (0.5B): Apache 2.0 (Free to use).
    • VieNeu-TTS-0.3B: CC BY-NC 4.0 (Non-commercial).
      • Free: For students, researchers, and non-profit purposes.
      • ⚠️ Commercial/Enterprise: Contact the author for licensing.

📑 Citation

@misc{vieneutts2026,
  title        = {VieNeu-TTS: Vietnamese Text-to-Speech with Instant Voice Cloning},
  author       = {Pham Nguyen Ngoc Bao},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/pnnbao-ump/VieNeu-TTS}}
}

🌟 Star History

Star History Chart


🤝 Contributors

Thanks to all the amazing people who have contributed to this project!


🙏 Acknowledgements

This project uses neucodec for audio decoding and sea-g2p for text normalization and phonemization.


Made with ❤️ for the Vietnamese TTS community

pnnbao97/VieNeu-TTS | GitHunt