Perception Encoder Audio-Visual (PE-AV) SyncNet: Audio-Visual Synchronization

A PyTorch Lightning implementation of SyncNet for detecting audio-visual synchronization in videos. This project trains deep learning models to determine whether audio and video streams are temporally aligned.

Overview

SyncNet learns to measure the synchronization between audio and video by computing similarity scores between learned embeddings. The model uses a pretrained audio-visual encoder (PeAudioVideo) and is trained using contrastive learning with both synchronized (positive) and out-of-sync (negative) samples.

Key Applications

Lip-sync detection: Verify if speech audio matches visible lip movements
Video quality assessment: Detect audio-visual synchronization issues
Deepfake detection: Identify manipulated videos with mismatched audio
Video post-production: Automated sync checking for edited content

Features

Pretrained Encoder: Built on HuggingFace's PeAudioVideo model
PyTorch Lightning: Clean, scalable training framework with minimal boilerplate
Multi-GPU Support: Distributed training with DeepSpeed Stage 2
Mixed Precision Training: Automatic BF16 mixed precision for faster training
Comprehensive Logging: Weights & Biases integration with metric tracking
Data Augmentation: Automatic negative sample generation via temporal shifts
Gradient Checkpointing: Memory-efficient training for large models
Type Safety: Full type annotations with mypy validation
Modern Tooling: Fast dependency management with uv

Installation

Prerequisites

Python 3.12+
CUDA-capable GPU (recommended)
Git

1. Install uv (Fast Python Package Manager)

curl -LsSf https://astral.sh/uv/install.sh | sh

2. Clone the Repository

git clone https://github.com/yourusername/pe-av-syncnet.git
cd pe-av-syncnet

3. Install Dependencies

uv sync

This will install all required dependencies including:

PyTorch with CUDA support
PyTorch Lightning
Transformers (for PeAudioVideo model)
TorchAudio and TorchVision
Weights & Biases
And more...

4. Set Up Environment Variables

cp .env.example .env

Edit .env and add your credentials:

WANDB_PROJECT=your-project-name
WANDB_ENTITY=your-wandb-username

5. Install Pre-commit Hooks (Optional)

uv run pre-commit install

Project Structure

pe-av-syncnet/
├── src/syncnet/
│   ├── __init__.py              # Package initialization
│   ├── config.py                # Pydantic configuration with hyperparameters
│   ├── lightning_module.py      # Lightning training module
│   ├── datamodule.py           # Data loading and preprocessing
│   ├── datasets/
│   │   ├── __init__.py         # Batch data structure
│   │   └── dataset.py          # Video dataset loader
│   ├── modeling/
│   │   ├── __init__.py         # Model package
│   │   └── model.py            # SyncNet architecture
│   └── scripts/
│       ├── __init__.py         # Scripts package
│       └── train.py            # Training script
├── tests/
│   └── test_sample.py          # Test suite
├── pyproject.toml              # Project configuration and dependencies
├── .pre-commit-config.yaml     # Code quality hooks
├── .env.example                # Environment variables template
└── README.md                   # This file

Dataset Preparation

SyncNet expects a directory containing MP4 video files with both audio and video streams.

Dataset Structure

data/
├── video1.mp4
├── video2.mp4
├── video3.mp4
├── subfolder/
│   ├── video4.mp4
│   └── video5.mp4
└── ...

Requirements

Format: MP4 files with H.264 video and AAC audio
Audio: Preferably mono or stereo, will be converted to mono
Video: Any resolution (will be resized to 224x224)
Frame Rate: 25 fps recommended
Audio Sample Rate: 16kHz or 48kHz
Duration: At least 0.2 seconds (5 frames at 25fps)

Dataset Recommendations

Minimum size: 1000+ videos for meaningful training
Diversity: Include various speakers, environments, and scenarios
Quality: Clear audio with visible speakers for best results

Usage

Basic Training

Train a model on your video dataset:

uv run train /path/to/videos --num_devices 1 --num_workers 8

Multi-GPU Training

Train with multiple GPUs using DeepSpeed:

uv run train /path/to/videos --num_devices 4 --num_workers 16

Resume Training from Checkpoint

uv run train /path/to/videos --checkpoint_path logs/pe-av-small-abc1234/last.ckpt

Load Pretrained Weights

Initialize model with custom weights:

uv run train /path/to/videos --weights_path /path/to/weights.pth

Debug Mode

Run training offline without uploading to Weights & Biases:

uv run train /path/to/videos --debug

Fast Development Run

Test your pipeline with a single batch:

uv run train /path/to/videos --fast_dev_run

Command-Line Arguments

Argument	Type	Default	Description
`data_root`	Path	Required	Directory containing video files
`--project`	str	"template"	Project name for logging
`--num_devices`	int	1	Number of GPUs to use
`--num_workers`	int	12	Data loading workers
`--log_root`	Path	"logs"	Directory for checkpoints and logs
`--checkpoint_path`	Path	None	Path to checkpoint for resuming
`--weights_path`	Path	None	Path to pretrained weights
`--debug`	flag	False	Enable debug mode (offline logging)
`--fast_dev_run`	flag	False	Run single batch for testing

Model Architecture

SyncNet Model

The SyncNet model consists of:

Pretrained Encoder: Meta's PeAudioVideoModel on HuggingFace
- Processes audio and video separately
- Extracts rich multimodal embeddings
- Gradient checkpointing enabled for memory efficiency
Embedding Processing:
- Flatten temporal/spatial dimensions
- L2 normalization
- ReLU activation (ensures positive similarity)
Similarity Computation:
- Cosine similarity between audio and video embeddings
- Output: Score from 0 to 1 (higher = better sync)

Training Process

Input Video → Random Segment Sampling
           ↓
Audio + Video Preprocessing
           ↓
[50% chance] Temporal Shift (negative sample)
           ↓
PeAudioVideo Encoder
           ↓
Audio Embedding + Video Embedding
           ↓
Cosine Similarity
           ↓
Binary Cross-Entropy Loss

Configuration

All hyperparameters are defined in src/syncnet/config.py:

class Config(BaseModel):
    # Reproducibility
    seed: int = 42

    # Data
    test_split: float = 0.05
    batch_size: int = 4

    # Training
    max_epochs: int = 200
    early_stopping_patience: int = 10
    learning_rate: float = 1e-4
    min_learning_rate: float = 1e-6
    weight_decay: float = 1e-2
    accumulate_grad_batches: int = 1
    gradient_clip_val: float = 1.0

    # Model
    base_model: str = "facebook/pe-av-small"
    num_frames: int = 5
    negative_fraction: float = 0.5
    frame_height: int = 224
    frame_width: int = 224

Key Parameters

base_model: HuggingFace model ID for the pretrained encoder
num_frames: Number of video frames per sample (5 frames = 0.2s at 25fps)
negative_fraction: Proportion of negative samples (0.5 = 50% out-of-sync)
batch_size: Adjust based on GPU memory (4 works well for most GPUs)
learning_rate: Initial learning rate with OneCycleLR scheduler

Training Details

Data Augmentation

Random temporal cropping: Samples random 5-frame segments from videos
Negative sample generation: 50% of samples get audio shifted by ±1 frame
Stereo to mono conversion: Automatically handles stereo audio
Resampling: Audio resampled from 16kHz to 48kHz

Optimization

Optimizer: AdamW with weight decay
Scheduler: OneCycleLR with cosine annealing
- 10% warmup period
- Peak learning rate: config.learning_rate
- Final learning rate: config.min_learning_rate

Metrics

Training: Binary Cross-Entropy loss
Validation: BCE loss + binary accuracy
Logging: Real-time metrics to Weights & Biases

Automatic Model Saving

Saves best model based on validation loss
Optionally pushes to HuggingFace Hub (private repos)
Local checkpointing with automatic resumption

Development

Running Tests

uv run pytest

Type Checking

uv run mypy src/

Linting and Formatting

# Check code style
uv run ruff check src/

# Auto-format code
uv run ruff format src/

Pre-commit Hooks

Automatically run linters and formatters before each commit:

uv run pre-commit run --all-files

Model Inference

After training, use the model for inference:

import torch
from syncnet.modeling.model import SyncNet, SyncNetConfig
from transformers.models.pe_audio_video import PeAudioVideoProcessor

# Load model
config = SyncNetConfig(base_model="facebook/pe-av-small")
model = SyncNet.from_pretrained("your-username/your-model-name")
model.eval()

# Load processor
processor = PeAudioVideoProcessor.from_pretrained("facebook/pe-av-small")

# Process inputs
inputs = processor(
    videos=video_frames,  # Shape: (num_frames, H, W, C)
    audio=audio_samples,   # Shape: (num_samples,)
    return_tensors="pt",
    sampling_rate=48000
)

# Inference
with torch.no_grad():
    similarity = model(
        inputs["input_values"],
        inputs["pixel_values_videos"]
    )

print(f"Synchronization score: {similarity.item():.4f}")
# Higher score = better synchronization

Troubleshooting

Common Issues

Out of Memory (OOM)

Reduce batch_size in config.py
Reduce num_workers to decrease memory overhead
Enable gradient accumulation: accumulate_grad_batches=2

Slow Data Loading

Increase num_workers (recommended: 2-4x number of GPUs)
Ensure videos are on fast storage (SSD preferred)
Enable persistent_workers=True (already enabled)

Low Accuracy

Ensure dataset has sufficient diversity
Increase training epochs
Adjust negative_fraction (try 0.3-0.7)
Verify audio and video are actually synchronized in source data

WANDB Authentication Error

Set WANDB_PROJECT and WANDB_ENTITY in .env
Run wandb login to authenticate
Use --debug flag to train offline

SyncNet: Out of time: automated lip sync in the wild
PeAudioVideo: HuggingFace Transformers multimodal encoder

License

See LICENSE file for details.

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes with tests
Ensure all tests and pre-commit hooks pass
Submit a pull request

Acknowledgments

Built with PyTorch Lightning
Uses HuggingFace Transformers
Dependency management by uv
Experiment tracking with Weights & Biases

Support

For questions or issues:

Open an issue on GitHub
Check existing issues for solutions
Refer to documentation in docstrings

Note: This is a research/educational implementation. For production use, additional validation and optimization may be required.

will-rice/pe-av-syncnet