GitHunt
MA

LipGANs is a text-to-viseme GAN framework that generates realistic mouth movements directly from text, without requiring audio. It maps phonemes โ†’ visemes, predicts phoneme durations, and uses per-viseme 3D GANs to synthesize photorealistic frames that can be exported as PNG sequences, GIFs, or MP4 videos.

LipGANS: Text-to-Viseme GAN Framework

Python
TensorFlow
Keras
License: MIT
Dataset: TCD--TIMIT

Traditional lip-syncing methods rely heavily on audio to guide mouth movements. But what happens when audio is missing, corrupted, or unavailableโ€”such as in dubbing, translation, or accessibility scenarios?

LipGANs is my attempt to solve this problem by generating realistic lip movements without using audio at all. Instead, it leverages GANs (Generative Adversarial Networks) to map text or phoneme sequences directly into lip image frames. This makes the project unique and versatile, since no fixed timestamps or speech waveforms are required.


Why I Built It

๐Ÿ”Š For Accessibility

Deaf and hard-of-hearing users can type in words and visually learn how lip shapes look when spoken.

๐ŸŒ For Dubbing & Translation

When creating dubbed movies or multilingual content, we often only have translated text and not clean audio. This system enables generating lip-synced visuals directly from text.

๐Ÿ“ผ For Corrupted/Missing Audio

In cases where recordings are damaged, this approach still allows realistic lip generation without needing the original sound.


Innovation & Impact

Unlike traditional lip-sync models that are audio-first, this project explores a text-to-visual pipeline. It introduces a way to generate synchronized mouth movements even in the absence of audio, bridging accessibility and entertainment needs in a novel way.

This project showcases the potential of generative AI for:

  • Inclusive communication
  • Cross-language dubbing
  • Accessible education

LipGANs aims to reshape how we think about speech visualizationโ€”making it more inclusive, adaptable, and resilient.


๐Ÿ”„ Pipeline

Text โ†’ Phonemes โ†’ Predicted Durations โ†’ Visemes โ†’ GANs โ†’ Frames โ†’ Video


๐Ÿš€ Features

  • Audio-free lip generation โ†’ Converts raw text directly into viseme-based animations.
  • Phoneme-to-Viseme Mapping โ†’ Maps linguistic units to 10 distinct mouth shapes.
  • Per-Viseme GAN Training โ†’ A separate 3D Convolutional GAN is trained for each viseme class.
  • Automatic Dataset Preprocessing โ†’ Segmentation, lip ROI extraction, normalization.
  • Built on TCD-TIMIT dataset โ†’ Aligned audiovisual dataset for speech-driven lip synthesis.

๐Ÿ“‚ Repository Structure

lipgans/
โ”œโ”€ README.md                # Project documentation
โ”œโ”€ requirements.txt         # Python dependencies
โ”œโ”€ .gitignore               # Git ignore rules
โ”œโ”€ config/
โ”‚   โ””โ”€ paths.example.yaml   # Example YAML for setting dataset and model paths
โ”œโ”€ src/
โ”‚   โ””โ”€ lipgans/
โ”‚       โ”œโ”€ __init__.py
โ”‚       โ”œโ”€ config.py            # Config options: paths, latent dims, FPS, frame size
โ”‚       โ”œโ”€ phonemes.py          # Functions to convert word โ†’ phonemes โ†’ visemes
โ”‚       โ”œโ”€ data/                # Dataset preprocessing utilities
โ”‚       โ”‚   โ”œโ”€ mlf_parser.py         # Parses TCD-TIMIT phoneme MLF files
โ”‚       โ”‚   โ”œโ”€ extract_viseme_clips.py # Segments video/audio into per-viseme clips
โ”‚       โ”‚   โ”œโ”€ crop_mouth.py         # Crops mouth ROI from frames
โ”‚       โ”‚   โ””โ”€ dataset.py            # Dataset helper: load & organize clips for GAN training
โ”‚       โ”œโ”€ models/
โ”‚       โ”‚   โ””โ”€ gan3d.py             # 3D convolutional GAN architecture per viseme
โ”‚       โ”œโ”€ train/
โ”‚       โ”‚   โ””โ”€ train_viseme.py      # Script to train a single viseme GAN
โ”‚       โ”œโ”€ generate/
โ”‚       โ”‚   โ”œโ”€ merge_gans.py        # Load per-viseme GANs, generate frames, save PNG/GIF/MP4
โ”‚       โ”‚   โ””โ”€ frontend.py          # Optional GUI / interface to generate words interactively
โ”‚       โ””โ”€ utils/
โ”‚           โ”œโ”€ io.py                # File I/O helpers
โ”‚           โ”œโ”€ video.py             # Video assembling & frame handling helpers
โ”‚           โ””โ”€ seed.py              # Random seed initialization for reproducibility
โ”œโ”€ scripts/                     # High-level scripts for batch processing or experiments
โ”‚   โ”œโ”€ extract_all.py           # Slice all videos into per-viseme clips
โ”‚   โ”œโ”€ crop_all.py              # Crop mouth regions for all dataset videos
โ”‚   โ”œโ”€ train_all.py             # Train GANs for all viseme classes
โ”‚   โ”œโ”€ generate_word.py         # Generate lip animation for a single word
โ”‚   โ””โ”€ preview_crops.py         # Quick preview of cropped mouth ROIs
โ””โ”€ examples/                     # Example outputs
    โ””โ”€ demo_words.txt            # List of example words for demo generation

โš™๏ธ Installation

1. Clone the repository

git clone https://github.com/your-username/lipgans.git
cd lipgans
python -m venv venv
source venv/bin/activate   # Linux/Mac
venv\Scripts\activate      # Windows

3. Install dependencies

pip install -r requirements.txt

Dependencies include:

  • TensorFlow / Keras
  • NumPy, OpenCV, Imageio
  • MediaPipe (for lip landmark detection)
  • ffmpeg (for slicing & assembling clips)
  • NLTK (for CMU Pronouncing Dictionary)

๐Ÿ“Š Dataset Setup (TCD-TIMIT)

  1. Download TCD-TIMIT dataset manually:
    TCD-TIMIT Dataset

  2. Place it under:

    data/raw/
  3. Run preprocessing scripts:

    python src/lipgans/data/extract_viseme_clips.py
    python src/lipgans/data/crop_mouth.py
    

This will:

  • Segment videos into phoneme-aligned clips.
  • Extract mouth regions using MediaPipe FaceMesh.
  • Map phonemes โ†’ visemes (10 classes).
  • Save normalized 3-frame 64ร—64 sequences into data/viseme_xx/.

๐Ÿ—ฃ What are Visemes?

A viseme is any of several speech sounds that look the same on the lips, for example when lip reading.
Unlike phonemes (the smallest units of sound in language), visemes represent groups of phonemes that appear visually identical on the face when spoken.

๐Ÿ‘‰ Example:

  • The phonemes /p/, /b/, and /m/ all map to the same viseme (closed lips).

This is why phoneme-to-viseme mapping is essential for lip animation:

  • It reduces complexity.
  • It ensures natural-looking articulation.

๐Ÿ“Œ Example mapping (simplified):

Viseme Class Example Phonemes Lip Shape Description
Closed Lips /p/, /b/, /m/ Lips fully closed
Teeth Touching /t/, /d/ Tongue touches teeth
Open Mouth (wide) /a/, /aa/ Jaw dropped, lips open wide
Rounded Lips /oo/, /uw/, /w/ Lips rounded forward

๐Ÿ‹๏ธ Training

Train a GAN for a specific viseme class:

python src/training/train.py --viseme_id 03 --epochs 200
  • --viseme_id: Viseme class (01โ€“10).
  • --epochs: Number of training epochs (default = 200).

Trained models will be stored in:

models/viseme_xx/

๐ŸŽฌ Inference (Text โ†’ Animation)

The output is a sequence of generated frames (PNG), which can also be saved as GIF or MP4.

python src/lipgans/generate/generate_word.py

Steps performed:

  1. Text โ†’ Phonemes (using CMU Pronouncing Dictionary).
  2. Phonemes โ†’ Visemes (via viseme_mapping.json).
  3. GAN Generation: Loads each viseme GAN and generates 3-frame clips.
  4. Chaining & Smoothing: Concatenates clips with temporal blending.

Output saved in:

example/cat/
 โ”œโ”€ cat_01.png
 โ”œโ”€ cat_02.png
 โ”œโ”€ cat_03.png
 โ”œโ”€ ...
 โ”œโ”€ cat.gif
 โ””โ”€ cat.mp4


๐Ÿ“ˆ Results

Approach Output Quality
Single Multi-Class GAN Blurry, frequent mode collapse
Per-Viseme GANs (ours) Sharper details, stable articulation

โœ… Generated clips show accurate viseme realization and plausible articulation across unseen speakers.


๐ŸŒ Applications

  • ๐ŸŽญ Virtual Avatars & Chatbots โ†’ Realistic mouth articulation in animated characters.
  • ๐Ÿ—ฃ Speech Therapy Tools โ†’ Helping learners visualize correct articulation.
  • ๐Ÿฆป Assistive Technology for the Deaf/Hard of Hearing โ†’
    Deaf children (or learners with hearing difficulties) can simply type a word/sentence into the UI and see a sequence of lip movements (frames or animation) showing how it would be spoken. This bridges the gap between written text and spoken articulation.
  • ๐ŸŽฎ Gaming & AR/VR โ†’ Lifelike lip-syncing for immersive experiences. Can be used by animated characters
  • ๐ŸŽฌ Audio Dubbing & Localization โ†’ Generate realistic lip movements that match translated text for films, shows, and animations.

๐Ÿ”ฎ Roadmap

  • ๐Ÿ”น Speaker-conditioned GANs (identity preservation).
  • ๐Ÿ”น Variable-length viseme clips for realistic timing.
  • ๐Ÿ”น Quantitative evaluation using FVD, lip-reading accuracy.
  • ๐Ÿ”น Multilingual support (phoneme mappings for other languages).
  • ๐Ÿ”น Real-time integration for virtual avatars and chatbots.
  • ๐Ÿ”น Integration with dubbing & localization pipelines for film and media industries.

๐Ÿค Contributing

Contributions are welcome!

  • Fork the repo
  • Create a new branch (feature-xyz)
  • Commit your changes
  • Open a Pull Request ๐Ÿš€

๐Ÿ“œ License

This project is licensed under the MIT License โ€“ see LICENSE for details.


๐Ÿ”— Citation

If you use this project in your research, please cite:

@misc{lipgans2025,
  author = {Nandita Singh},
  title = {LipGANs: Text-to-Viseme GAN Framework for Audio-Free Lip Animation Generation},
  year = {2025},
  url = {https://github.com/madebynanditaaa/lipgans}
}

โœจ With LipGANs, we take the first step towards speech-free, text-driven lip animation for next-gen humanโ€“computer interaction and accessibility!

madebynanditaaa/lipgans | GitHunt