madebynanditaaa/lipgans
LipGANs is a text-to-viseme GAN framework that generates realistic mouth movements directly from text, without requiring audio. It maps phonemes โ visemes, predicts phoneme durations, and uses per-viseme 3D GANs to synthesize photorealistic frames that can be exported as PNG sequences, GIFs, or MP4 videos.
LipGANS: Text-to-Viseme GAN Framework
Traditional lip-syncing methods rely heavily on audio to guide mouth movements. But what happens when audio is missing, corrupted, or unavailableโsuch as in dubbing, translation, or accessibility scenarios?
LipGANs is my attempt to solve this problem by generating realistic lip movements without using audio at all. Instead, it leverages GANs (Generative Adversarial Networks) to map text or phoneme sequences directly into lip image frames. This makes the project unique and versatile, since no fixed timestamps or speech waveforms are required.
Why I Built It
๐ For Accessibility
Deaf and hard-of-hearing users can type in words and visually learn how lip shapes look when spoken.
๐ For Dubbing & Translation
When creating dubbed movies or multilingual content, we often only have translated text and not clean audio. This system enables generating lip-synced visuals directly from text.
๐ผ For Corrupted/Missing Audio
In cases where recordings are damaged, this approach still allows realistic lip generation without needing the original sound.
Innovation & Impact
Unlike traditional lip-sync models that are audio-first, this project explores a text-to-visual pipeline. It introduces a way to generate synchronized mouth movements even in the absence of audio, bridging accessibility and entertainment needs in a novel way.
This project showcases the potential of generative AI for:
- Inclusive communication
- Cross-language dubbing
- Accessible education
LipGANs aims to reshape how we think about speech visualizationโmaking it more inclusive, adaptable, and resilient.
๐ Pipeline
Text โ Phonemes โ Predicted Durations โ Visemes โ GANs โ Frames โ Video
๐ Features
- Audio-free lip generation โ Converts raw text directly into viseme-based animations.
- Phoneme-to-Viseme Mapping โ Maps linguistic units to 10 distinct mouth shapes.
- Per-Viseme GAN Training โ A separate 3D Convolutional GAN is trained for each viseme class.
- Automatic Dataset Preprocessing โ Segmentation, lip ROI extraction, normalization.
- Built on TCD-TIMIT dataset โ Aligned audiovisual dataset for speech-driven lip synthesis.
๐ Repository Structure
lipgans/
โโ README.md # Project documentation
โโ requirements.txt # Python dependencies
โโ .gitignore # Git ignore rules
โโ config/
โ โโ paths.example.yaml # Example YAML for setting dataset and model paths
โโ src/
โ โโ lipgans/
โ โโ __init__.py
โ โโ config.py # Config options: paths, latent dims, FPS, frame size
โ โโ phonemes.py # Functions to convert word โ phonemes โ visemes
โ โโ data/ # Dataset preprocessing utilities
โ โ โโ mlf_parser.py # Parses TCD-TIMIT phoneme MLF files
โ โ โโ extract_viseme_clips.py # Segments video/audio into per-viseme clips
โ โ โโ crop_mouth.py # Crops mouth ROI from frames
โ โ โโ dataset.py # Dataset helper: load & organize clips for GAN training
โ โโ models/
โ โ โโ gan3d.py # 3D convolutional GAN architecture per viseme
โ โโ train/
โ โ โโ train_viseme.py # Script to train a single viseme GAN
โ โโ generate/
โ โ โโ merge_gans.py # Load per-viseme GANs, generate frames, save PNG/GIF/MP4
โ โ โโ frontend.py # Optional GUI / interface to generate words interactively
โ โโ utils/
โ โโ io.py # File I/O helpers
โ โโ video.py # Video assembling & frame handling helpers
โ โโ seed.py # Random seed initialization for reproducibility
โโ scripts/ # High-level scripts for batch processing or experiments
โ โโ extract_all.py # Slice all videos into per-viseme clips
โ โโ crop_all.py # Crop mouth regions for all dataset videos
โ โโ train_all.py # Train GANs for all viseme classes
โ โโ generate_word.py # Generate lip animation for a single word
โ โโ preview_crops.py # Quick preview of cropped mouth ROIs
โโ examples/ # Example outputs
โโ demo_words.txt # List of example words for demo generation
โ๏ธ Installation
1. Clone the repository
git clone https://github.com/your-username/lipgans.git
cd lipgans2. Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows3. Install dependencies
pip install -r requirements.txtDependencies include:
- TensorFlow / Keras
- NumPy, OpenCV, Imageio
- MediaPipe (for lip landmark detection)
- ffmpeg (for slicing & assembling clips)
- NLTK (for CMU Pronouncing Dictionary)
๐ Dataset Setup (TCD-TIMIT)
-
Download TCD-TIMIT dataset manually:
TCD-TIMIT Dataset -
Place it under:
data/raw/
-
Run preprocessing scripts:
python src/lipgans/data/extract_viseme_clips.py python src/lipgans/data/crop_mouth.py
This will:
- Segment videos into phoneme-aligned clips.
- Extract mouth regions using MediaPipe FaceMesh.
- Map phonemes โ visemes (10 classes).
- Save normalized 3-frame 64ร64 sequences into
data/viseme_xx/.
๐ฃ What are Visemes?
A viseme is any of several speech sounds that look the same on the lips, for example when lip reading.
Unlike phonemes (the smallest units of sound in language), visemes represent groups of phonemes that appear visually identical on the face when spoken.
๐ Example:
- The phonemes
/p/,/b/, and/m/all map to the same viseme (closed lips).
This is why phoneme-to-viseme mapping is essential for lip animation:
- It reduces complexity.
- It ensures natural-looking articulation.
๐ Example mapping (simplified):
| Viseme Class | Example Phonemes | Lip Shape Description |
|---|---|---|
| Closed Lips | /p/, /b/, /m/ | Lips fully closed |
| Teeth Touching | /t/, /d/ | Tongue touches teeth |
| Open Mouth (wide) | /a/, /aa/ | Jaw dropped, lips open wide |
| Rounded Lips | /oo/, /uw/, /w/ | Lips rounded forward |
๐๏ธ Training
Train a GAN for a specific viseme class:
python src/training/train.py --viseme_id 03 --epochs 200--viseme_id: Viseme class (01โ10).--epochs: Number of training epochs (default = 200).
Trained models will be stored in:
models/viseme_xx/
๐ฌ Inference (Text โ Animation)
The output is a sequence of generated frames (PNG), which can also be saved as GIF or MP4.
python src/lipgans/generate/generate_word.pySteps performed:
- Text โ Phonemes (using CMU Pronouncing Dictionary).
- Phonemes โ Visemes (via
viseme_mapping.json). - GAN Generation: Loads each viseme GAN and generates 3-frame clips.
- Chaining & Smoothing: Concatenates clips with temporal blending.
Output saved in:
example/cat/
โโ cat_01.png
โโ cat_02.png
โโ cat_03.png
โโ ...
โโ cat.gif
โโ cat.mp4
๐ Results
| Approach | Output Quality |
|---|---|
| Single Multi-Class GAN | Blurry, frequent mode collapse |
| Per-Viseme GANs (ours) | Sharper details, stable articulation |
โ Generated clips show accurate viseme realization and plausible articulation across unseen speakers.
๐ Applications
- ๐ญ Virtual Avatars & Chatbots โ Realistic mouth articulation in animated characters.
- ๐ฃ Speech Therapy Tools โ Helping learners visualize correct articulation.
- ๐ฆป Assistive Technology for the Deaf/Hard of Hearing โ
Deaf children (or learners with hearing difficulties) can simply type a word/sentence into the UI and see a sequence of lip movements (frames or animation) showing how it would be spoken. This bridges the gap between written text and spoken articulation. - ๐ฎ Gaming & AR/VR โ Lifelike lip-syncing for immersive experiences. Can be used by animated characters
- ๐ฌ Audio Dubbing & Localization โ Generate realistic lip movements that match translated text for films, shows, and animations.
๐ฎ Roadmap
- ๐น Speaker-conditioned GANs (identity preservation).
- ๐น Variable-length viseme clips for realistic timing.
- ๐น Quantitative evaluation using FVD, lip-reading accuracy.
- ๐น Multilingual support (phoneme mappings for other languages).
- ๐น Real-time integration for virtual avatars and chatbots.
- ๐น Integration with dubbing & localization pipelines for film and media industries.
๐ค Contributing
Contributions are welcome!
- Fork the repo
- Create a new branch (
feature-xyz) - Commit your changes
- Open a Pull Request ๐
๐ License
This project is licensed under the MIT License โ see LICENSE for details.
๐ Citation
If you use this project in your research, please cite:
@misc{lipgans2025,
author = {Nandita Singh},
title = {LipGANs: Text-to-Viseme GAN Framework for Audio-Free Lip Animation Generation},
year = {2025},
url = {https://github.com/madebynanditaaa/lipgans}
}โจ With LipGANs, we take the first step towards speech-free, text-driven lip animation for next-gen humanโcomputer interaction and accessibility!