GitHunt
TE

Tencent/KsanaDiT

KsanaDiT: High-Performance DiT (Diffusion Transformer) Inference Framework for Video & Image Generation

KsanaDiT

High-Performance DiT (Diffusion Transformer) Inference Framework for Video & Image Generation

License
Python
PyTorch

English | ็ฎ€ไฝ“ไธญๆ–‡

๐Ÿ“– Introduction

KsanaDiT is a high-performance inference framework specifically designed for Diffusion Transformers (DiT), supporting video generation (T2V/I2V) and image generation (T2I) tasks. The framework provides a rich set of optimization techniques and flexible configuration options, enabling efficient execution of large-scale DiT models on single or multi-GPU environments.

โœจ Key Features

  • ๐Ÿš€ High-Performance Inference: FP8 quantization, QKV Fuse, Torch Compile, and various attention optimizations
  • ๐ŸŽฏ Multiple Attention Backends: SLA Attention, Flash Attention, Sage Attention, Radial Sage Attention, Torch SDPA
  • ๐ŸŽฌ Multi-Modal Generation: Text-to-Video (T2V), Image-to-Video (I2V), Video Controllable Editing (Vace), Text-to-Image (T2I)
  • ๐Ÿ’พ Smart Caching: Built-in caching strategies (DBCache, EasyCache, MagCache, TeaCache, CustomStepCache, HybridCache)
  • ๐Ÿ”ง Flexible Configuration: LoRA support, multiple samplers (Euler, UniPC, DPM++), custom sigma scheduling
  • ๐ŸŒ Distributed Support: Single-GPU, multi-GPU (torchrun), Ray distributed inference, Model Pool management
  • ๐Ÿ”Œ ComfyUI Integration: ComfyUI node support (standalone submodule) for visual workflow design
  • ๐Ÿ–ฅ๏ธ Multi-Platform Support: GPU, NPU, XPU (WIP)

๐Ÿ“ฆ Supported Models

Video Generation Models

Model Type Parameters Tasks Status
Turbo Diffusion Image-to-Video 14B I2V โœ…
Wan2.2-T2V Text-to-Video 5B/14B T2V โœ…
Wan2.2-I2V Image-to-Video 14B I2V โœ…
Wan2.1-Vace Video Controllable Editing 14B Vace โœ…

Image Generation Models

Model Type Parameters Tasks Status
Qwen-Image Text-to-Image 20B T2I โœ…
Qwen-Image Edit Image Editing 20B Image Edit โœ…

๐Ÿ› ๏ธ Installation

Docker

We are actively working on Dockerfiles. Stay tuned!

Requirements

  • Python: >= 3.10, < 4.0
  • PyTorch: >= 2.0
  • GPU Environment:
    • CUDA >= 12.8
    • Recommended: NVIDIA GPUs
  • NPU Environment:
    • CANN >= 8.0
    • torch_npu adapter

Basic Installation

# Clone the repository
git clone https://github.com/Tencent/KsanaDiT.git
cd KsanaDiT

# Install base dependencies (GPU version by default)
pip install -e .

GPU Accelerated Installation

# Install GPU optimization dependencies (recommended)
pip install -e ".[gpu]"

# Or install manually
pip install xformers>=0.0.29 flash-attn>=2.6.0 triton>=3.2.0

NPU Environment Installation

# 1. Install CANN toolkit (refer to official documentation)
# https://www.hiascend.com/software/cann

# 2. Install torch_npu
pip install torch-npu

# 3. Install KsanaDiT (NPU version)
pip install -e ".[npu]"

# 4. Verify NPU environment
python -c "import torch_npu; print(torch_npu.npu.is_available())"

Release Installation

Direct installation via wheel packages coming soon.

๐Ÿ”Œ Interface Support

KsanaDiT provides multiple usage methods to meet different scenario requirements:

Local Pipeline Mode

Run locally through the Python Pipeline API, suitable for scripted batch generation or integration into your own systems:

from ksana import KsanaPipeline

# Create inference pipeline
pipeline = KsanaPipeline.from_models("path/to/model")

# Generate video/image
result = pipeline.generate(prompt, ...)

For detailed usage, refer to Quick Start and the examples directory.

ComfyUI Integration

KsanaDiT supports usage as ComfyUI custom nodes, providing a visual workflow experience:

# 1. Navigate to ComfyUI's custom_nodes directory
cd /path/to/ComfyUI/custom_nodes

# 2. Clone the KsanaDiT repository
git clone https://github.com/Tencent/KsanaDiT.git

# 3. Enter the KsanaDiT directory and install dependencies
cd KsanaDiT
./scripts/install_dev.sh

After installation, restart ComfyUI and you will see KsanaDiT-related nodes in the node list. For more ComfyUI usage instructions, refer to comfyui/README.md.

๐Ÿš€ Quick Start

For detailed code examples, refer to examples.

Text-to-Video (T2V)

import torch
from ksana import KsanaPipeline
from ksana.config import (
    KsanaDistributedConfig,
    KsanaRuntimeConfig,
    KsanaSampleConfig,
)

# Create inference pipeline
pipeline = KsanaPipeline.from_models(
    "path/to/Wan2.2-T2V-A14B",
    dist_config=KsanaDistributedConfig(num_gpus=1)
)

# Generate video
video = pipeline.generate(
    "Street photography, cool girl with headphones skateboarding, New York streets, graffiti wall background",
    sample_config=KsanaSampleConfig(steps=40),
    runtime_config=KsanaRuntimeConfig(
        seed=1234,
        size=(720, 480),
        frame_num=17,
        return_frames=True,
    ),
)

print(f"Generated video shape: {video.shape}")

Image-to-Video (I2V)

from ksana import KsanaPipeline
from ksana.config import KsanaRuntimeConfig, KsanaSampleConfig

pipeline = KsanaPipeline.from_models("path/to/Wan2.2-I2V-A14B")

video = pipeline.generate(
    "Girl gently waves her fan, blows a breath of fairy air, lightning flies from her hand into the sky and thunder begins",
    start_img_path="input.png",
    sample_config=KsanaSampleConfig(steps=40),
    runtime_config=KsanaRuntimeConfig(
        seed=1234,
        size=(512, 512),
        frame_num=17,
    ),
)

Turbo Diffusion

See run_turbo_diffusion

Text-to-Image (T2I)

import torch
from ksana import KsanaPipeline
from ksana.config import (
    KsanaModelConfig,
    KsanaRuntimeConfig,
    KsanaSampleConfig,
    KsanaSolverType,
)

pipeline = KsanaPipeline.from_models(
    "path/to/Qwen-Image",
    model_config=KsanaModelConfig(run_dtype=torch.bfloat16),
)

image = pipeline.generate(
    "A cute orange cat sitting on a windowsill, sunlight streaming through the window onto its fur",
    sample_config=KsanaSampleConfig(
        steps=20,
        cfg_scale=4.0,
        solver=KsanaSolverType.FLOWMATCH_EULER,
    ),
    runtime_config=KsanaRuntimeConfig(
        seed=42,
        size=(1024, 1024),
    ),
)

๐ŸŽฏ Advanced Features

FP8 Quantized Inference

import torch
from ksana import KsanaPipeline
from ksana.config import (
    KsanaModelConfig,
    KsanaAttentionConfig,
    KsanaAttentionBackend,
    KsanaLinearBackend,
)

model_config = KsanaModelConfig(
    run_dtype=torch.float16,
    attention_config=KsanaAttentionConfig(backend=KsanaAttentionBackend.SAGE_ATTN),
    linear_backend=KsanaLinearBackend.FP8_GEMM,
)

pipeline = KsanaPipeline.from_models(
    ("high_noise_fp8.safetensors", "low_noise_fp8.safetensors"),
    model_config=model_config,
)

LoRA Accelerated Inference

from ksana import KsanaPipeline
from ksana.config import KsanaLoraConfig, KsanaSampleConfig

pipeline = KsanaPipeline.from_models(
    "path/to/Wan2.2-T2V-A14B",
    lora_config=KsanaLoraConfig("path/to/Wan2.2-Lightning-4steps-lora"),
)

# Fast generation with 4 steps
video = pipeline.generate(
    prompt,
    sample_config=KsanaSampleConfig(
        steps=4,
        cfg_scale=1.0,
        sigmas=[1.0, 0.9375, 0.6333, 0.225, 0.0],
    ),
)

Smart Cache Optimization - Under Active Development

from ksana.config.cache_config import (
    DCacheConfig,
    DBCacheConfig,
    KsanaHybridCacheConfig,
)

# Use hybrid caching strategy
cache_config = KsanaHybridCacheConfig(
    step_cache=DCacheConfig(fast_degree=50),
    block_cache=DBCacheConfig(),
)

video = pipeline.generate(
    prompt,
    cache_config=cache_config,
)

Multi-GPU Distributed Inference

# Method 1: Using CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=0,1,2,3 python your_script.py

# Method 2: Using torchrun
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 your_script.py
from ksana import KsanaPipeline
from ksana.config import KsanaDistributedConfig

pipeline = KsanaPipeline.from_models(
    model_path,
    dist_config=KsanaDistributedConfig(num_gpus=4),
)

๐Ÿ“Š Performance Optimization Techniques

Quantization & Compute Optimization

Technique Description Effect
FP8 GEMM FP8 quantized matrix multiplication Reduced memory, improved speed
Torchao FP8 Dynamic Dynamic FP8 quantization Adaptive precision, balanced quality and performance
QKV Fuse QKV projection fusion Reduced memory access, improved throughput
torch.compile Graph compilation optimization 10-30% end-to-end speedup

Attention Backends

Backend Characteristics Use Case
Flash Attention High performance, memory efficient General recommendation
Sage Attention Optimized attention computation Long sequences
Radial Sage Attention Radial sparse attention Very long sequences
Torch SDPA PyTorch native implementation Compatibility priority

Caching Strategies

Strategy Description Use Case
TeaCache Temporal-aware step-level caching Video generation optimization
MagCache Adaptive step-level caching Balanced quality and speed
EasyCache Lightweight step-level caching without pre-prepared parameters Fast inference with minimal overhead
DBCache Block-level caching Image generation
HybridCache Step-level + block-level hybrid caching Maximum acceleration

Samplers

Sampler Description Use Case
Euler Fast sampling 4-8 step inference
UniPC High-quality sampling 20-40 step inference
DPM++ Efficient multi-step sampling General purpose
Turbo Diffusion Ultra-fast sampling 4-step inference
FlowMatch Euler Flow matching sampling Image generation

๐Ÿ”ง Configuration

Environment Variables

# Log level: debug/info/warn/error
export KSANA_LOGGER_LEVEL=info

Model Configuration

The framework supports model parameter configuration via YAML files, located in the ksana/settings/ directory:

๐Ÿ“š Code Examples

Complete example code is available in the examples/ directory:

๐Ÿงช Testing

We have comprehensive test coverage. Tests are currently time-consuming; we will continue to streamline them. For developers only.

# Run all tests
pytest tests/

# Run specific tests
pytest tests/ksana/pipelines/wan2_2_t2v_test.py

# Run GPU tests
bash scripts/ci_tests/ci_ksana_gpus.sh

๐Ÿค Contributing

We welcome community contributions! Before submitting a PR, please ensure:

  1. Code passes all tests
  2. Follows project code style (using black and ruff)
  3. Includes necessary documentation and comments
  4. Updates relevant README and examples
# Install development dependencies
pip install -e ".[dev]"

# Run code style checks
pre-commit run --all-files

# Run tests
pytest tests/

๐Ÿ“‹ Changelog

For a detailed list of changes in each version, see the CHANGELOG.

๐Ÿ“„ License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

This project benefits from the following excellent open-source projects:

๐Ÿ“ฎ Contact

๐Ÿ—บ๏ธ Roadmap

Completed โœ…

  • Multi-Platform Support: GPU, NPU, XPU backend support
  • Batch Inference: Support for batch size > 1, merged cond/uncond
  • Video Editing: Wan2.1 Vace video controllable editing
  • Advanced Samplers: DPM++, Turbo Diffusion support
  • Performance Optimization: QKV Fuse + Dynamic FP8 optimization
  • Memory Optimization: Pin Manager to resolve OOM issues
  • Smart Caching: MagCache, TeaCache, EasyCache strategies
  • Image Editing: Qwen Image Edit model support
  • VAE Parallelism: Multi-GPU VAE decoding
  • Monitoring: Inference metrics reporting

In Progress ๐Ÿšง

  • Support for more generation models (Z-Image, Hunyuan, etc.)
  • Memory optimization for longer video generation
  • Cache strategy performance tuning
  • Model quantization toolchain
  • XPU full feature support optimization

If this project helps you, please give us a โญ๏ธ Star!

Made with โค๏ธ by the KsanaDiT Team

Tencent/KsanaDiT | GitHunt