Tencent/KsanaDiT
KsanaDiT: High-Performance DiT (Diffusion Transformer) Inference Framework for Video & Image Generation
KsanaDiT
High-Performance DiT (Diffusion Transformer) Inference Framework for Video & Image Generation
๐ Introduction
KsanaDiT is a high-performance inference framework specifically designed for Diffusion Transformers (DiT), supporting video generation (T2V/I2V) and image generation (T2I) tasks. The framework provides a rich set of optimization techniques and flexible configuration options, enabling efficient execution of large-scale DiT models on single or multi-GPU environments.
โจ Key Features
- ๐ High-Performance Inference: FP8 quantization, QKV Fuse, Torch Compile, and various attention optimizations
- ๐ฏ Multiple Attention Backends: SLA Attention, Flash Attention, Sage Attention, Radial Sage Attention, Torch SDPA
- ๐ฌ Multi-Modal Generation: Text-to-Video (T2V), Image-to-Video (I2V), Video Controllable Editing (Vace), Text-to-Image (T2I)
- ๐พ Smart Caching: Built-in caching strategies (DBCache, EasyCache, MagCache, TeaCache, CustomStepCache, HybridCache)
- ๐ง Flexible Configuration: LoRA support, multiple samplers (Euler, UniPC, DPM++), custom sigma scheduling
- ๐ Distributed Support: Single-GPU, multi-GPU (torchrun), Ray distributed inference, Model Pool management
- ๐ ComfyUI Integration: ComfyUI node support (standalone submodule) for visual workflow design
- ๐ฅ๏ธ Multi-Platform Support: GPU, NPU, XPU (WIP)
๐ฆ Supported Models
Video Generation Models
| Model | Type | Parameters | Tasks | Status |
|---|---|---|---|---|
| Turbo Diffusion | Image-to-Video | 14B | I2V | โ |
| Wan2.2-T2V | Text-to-Video | 5B/14B | T2V | โ |
| Wan2.2-I2V | Image-to-Video | 14B | I2V | โ |
| Wan2.1-Vace | Video Controllable Editing | 14B | Vace | โ |
Image Generation Models
| Model | Type | Parameters | Tasks | Status |
|---|---|---|---|---|
| Qwen-Image | Text-to-Image | 20B | T2I | โ |
| Qwen-Image Edit | Image Editing | 20B | Image Edit | โ |
๐ ๏ธ Installation
Docker
We are actively working on Dockerfiles. Stay tuned!
Requirements
- Python: >= 3.10, < 4.0
- PyTorch: >= 2.0
- GPU Environment:
- CUDA >= 12.8
- Recommended: NVIDIA GPUs
- NPU Environment:
- CANN >= 8.0
- torch_npu adapter
Basic Installation
# Clone the repository
git clone https://github.com/Tencent/KsanaDiT.git
cd KsanaDiT
# Install base dependencies (GPU version by default)
pip install -e .GPU Accelerated Installation
# Install GPU optimization dependencies (recommended)
pip install -e ".[gpu]"
# Or install manually
pip install xformers>=0.0.29 flash-attn>=2.6.0 triton>=3.2.0NPU Environment Installation
# 1. Install CANN toolkit (refer to official documentation)
# https://www.hiascend.com/software/cann
# 2. Install torch_npu
pip install torch-npu
# 3. Install KsanaDiT (NPU version)
pip install -e ".[npu]"
# 4. Verify NPU environment
python -c "import torch_npu; print(torch_npu.npu.is_available())"Release Installation
Direct installation via wheel packages coming soon.
๐ Interface Support
KsanaDiT provides multiple usage methods to meet different scenario requirements:
Local Pipeline Mode
Run locally through the Python Pipeline API, suitable for scripted batch generation or integration into your own systems:
from ksana import KsanaPipeline
# Create inference pipeline
pipeline = KsanaPipeline.from_models("path/to/model")
# Generate video/image
result = pipeline.generate(prompt, ...)For detailed usage, refer to Quick Start and the examples directory.
ComfyUI Integration
KsanaDiT supports usage as ComfyUI custom nodes, providing a visual workflow experience:
# 1. Navigate to ComfyUI's custom_nodes directory
cd /path/to/ComfyUI/custom_nodes
# 2. Clone the KsanaDiT repository
git clone https://github.com/Tencent/KsanaDiT.git
# 3. Enter the KsanaDiT directory and install dependencies
cd KsanaDiT
./scripts/install_dev.shAfter installation, restart ComfyUI and you will see KsanaDiT-related nodes in the node list. For more ComfyUI usage instructions, refer to comfyui/README.md.
๐ Quick Start
For detailed code examples, refer to examples.
Text-to-Video (T2V)
import torch
from ksana import KsanaPipeline
from ksana.config import (
KsanaDistributedConfig,
KsanaRuntimeConfig,
KsanaSampleConfig,
)
# Create inference pipeline
pipeline = KsanaPipeline.from_models(
"path/to/Wan2.2-T2V-A14B",
dist_config=KsanaDistributedConfig(num_gpus=1)
)
# Generate video
video = pipeline.generate(
"Street photography, cool girl with headphones skateboarding, New York streets, graffiti wall background",
sample_config=KsanaSampleConfig(steps=40),
runtime_config=KsanaRuntimeConfig(
seed=1234,
size=(720, 480),
frame_num=17,
return_frames=True,
),
)
print(f"Generated video shape: {video.shape}")Image-to-Video (I2V)
from ksana import KsanaPipeline
from ksana.config import KsanaRuntimeConfig, KsanaSampleConfig
pipeline = KsanaPipeline.from_models("path/to/Wan2.2-I2V-A14B")
video = pipeline.generate(
"Girl gently waves her fan, blows a breath of fairy air, lightning flies from her hand into the sky and thunder begins",
start_img_path="input.png",
sample_config=KsanaSampleConfig(steps=40),
runtime_config=KsanaRuntimeConfig(
seed=1234,
size=(512, 512),
frame_num=17,
),
)Turbo Diffusion
Text-to-Image (T2I)
import torch
from ksana import KsanaPipeline
from ksana.config import (
KsanaModelConfig,
KsanaRuntimeConfig,
KsanaSampleConfig,
KsanaSolverType,
)
pipeline = KsanaPipeline.from_models(
"path/to/Qwen-Image",
model_config=KsanaModelConfig(run_dtype=torch.bfloat16),
)
image = pipeline.generate(
"A cute orange cat sitting on a windowsill, sunlight streaming through the window onto its fur",
sample_config=KsanaSampleConfig(
steps=20,
cfg_scale=4.0,
solver=KsanaSolverType.FLOWMATCH_EULER,
),
runtime_config=KsanaRuntimeConfig(
seed=42,
size=(1024, 1024),
),
)๐ฏ Advanced Features
FP8 Quantized Inference
import torch
from ksana import KsanaPipeline
from ksana.config import (
KsanaModelConfig,
KsanaAttentionConfig,
KsanaAttentionBackend,
KsanaLinearBackend,
)
model_config = KsanaModelConfig(
run_dtype=torch.float16,
attention_config=KsanaAttentionConfig(backend=KsanaAttentionBackend.SAGE_ATTN),
linear_backend=KsanaLinearBackend.FP8_GEMM,
)
pipeline = KsanaPipeline.from_models(
("high_noise_fp8.safetensors", "low_noise_fp8.safetensors"),
model_config=model_config,
)LoRA Accelerated Inference
from ksana import KsanaPipeline
from ksana.config import KsanaLoraConfig, KsanaSampleConfig
pipeline = KsanaPipeline.from_models(
"path/to/Wan2.2-T2V-A14B",
lora_config=KsanaLoraConfig("path/to/Wan2.2-Lightning-4steps-lora"),
)
# Fast generation with 4 steps
video = pipeline.generate(
prompt,
sample_config=KsanaSampleConfig(
steps=4,
cfg_scale=1.0,
sigmas=[1.0, 0.9375, 0.6333, 0.225, 0.0],
),
)Smart Cache Optimization - Under Active Development
from ksana.config.cache_config import (
DCacheConfig,
DBCacheConfig,
KsanaHybridCacheConfig,
)
# Use hybrid caching strategy
cache_config = KsanaHybridCacheConfig(
step_cache=DCacheConfig(fast_degree=50),
block_cache=DBCacheConfig(),
)
video = pipeline.generate(
prompt,
cache_config=cache_config,
)Multi-GPU Distributed Inference
# Method 1: Using CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=0,1,2,3 python your_script.py
# Method 2: Using torchrun
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 your_script.pyfrom ksana import KsanaPipeline
from ksana.config import KsanaDistributedConfig
pipeline = KsanaPipeline.from_models(
model_path,
dist_config=KsanaDistributedConfig(num_gpus=4),
)๐ Performance Optimization Techniques
Quantization & Compute Optimization
| Technique | Description | Effect |
|---|---|---|
| FP8 GEMM | FP8 quantized matrix multiplication | Reduced memory, improved speed |
| Torchao FP8 Dynamic | Dynamic FP8 quantization | Adaptive precision, balanced quality and performance |
| QKV Fuse | QKV projection fusion | Reduced memory access, improved throughput |
| torch.compile | Graph compilation optimization | 10-30% end-to-end speedup |
Attention Backends
| Backend | Characteristics | Use Case |
|---|---|---|
| Flash Attention | High performance, memory efficient | General recommendation |
| Sage Attention | Optimized attention computation | Long sequences |
| Radial Sage Attention | Radial sparse attention | Very long sequences |
| Torch SDPA | PyTorch native implementation | Compatibility priority |
Caching Strategies
| Strategy | Description | Use Case |
|---|---|---|
| TeaCache | Temporal-aware step-level caching | Video generation optimization |
| MagCache | Adaptive step-level caching | Balanced quality and speed |
| EasyCache | Lightweight step-level caching without pre-prepared parameters | Fast inference with minimal overhead |
| DBCache | Block-level caching | Image generation |
| HybridCache | Step-level + block-level hybrid caching | Maximum acceleration |
Samplers
| Sampler | Description | Use Case |
|---|---|---|
| Euler | Fast sampling | 4-8 step inference |
| UniPC | High-quality sampling | 20-40 step inference |
| DPM++ | Efficient multi-step sampling | General purpose |
| Turbo Diffusion | Ultra-fast sampling | 4-step inference |
| FlowMatch Euler | Flow matching sampling | Image generation |
๐ง Configuration
Environment Variables
# Log level: debug/info/warn/error
export KSANA_LOGGER_LEVEL=info
Model Configuration
The framework supports model parameter configuration via YAML files, located in the ksana/settings/ directory:
qwen/t2i_20b.yaml- Qwen image generation model configqwen/edit_20b.yaml- Qwen image editing model configwan/t2v_14b.yaml- Wan2.2 T2V model configwan/i2v_14b.yaml- Wan2.2 I2V model configwan/vace_14b.yaml- Wan2.1 Vace model config
๐ Code Examples
Complete example code is available in the examples/ directory:
examples/wan/wan2_2_t2v.py- Text-to-Video exampleexamples/wan/wan2_2_i2v.py- Image-to-Video exampleexamples/wan/wan2_1_vace.py- Video controllable editing exampleexamples/qwen/qwen_image_t2i.py- Text-to-Image exampleexamples/qwen/qwen_image_edit.py- Image Editing example
๐งช Testing
We have comprehensive test coverage. Tests are currently time-consuming; we will continue to streamline them. For developers only.
# Run all tests
pytest tests/
# Run specific tests
pytest tests/ksana/pipelines/wan2_2_t2v_test.py
# Run GPU tests
bash scripts/ci_tests/ci_ksana_gpus.sh๐ค Contributing
We welcome community contributions! Before submitting a PR, please ensure:
- Code passes all tests
- Follows project code style (using
blackandruff) - Includes necessary documentation and comments
- Updates relevant README and examples
# Install development dependencies
pip install -e ".[dev]"
# Run code style checks
pre-commit run --all-files
# Run tests
pytest tests/๐ Changelog
For a detailed list of changes in each version, see the CHANGELOG.
๐ License
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
๐ Acknowledgments
This project benefits from the following excellent open-source projects:
- Wan-Video - Wan2.2 video generation model
- ComfyUI-WanVideoWrapper - ComfyUI integration reference
- FastVideo - Video generation optimization techniques
- Nunchaku - Quantization optimization solutions
- TurboDiffusion - Inference acceleration solutions
๐ฎ Contact
- Bug Reports: GitHub Issues
- Feature Requests: GitHub Discussions
๐บ๏ธ Roadmap
Completed โ
- Multi-Platform Support: GPU, NPU, XPU backend support
- Batch Inference: Support for batch size > 1, merged cond/uncond
- Video Editing: Wan2.1 Vace video controllable editing
- Advanced Samplers: DPM++, Turbo Diffusion support
- Performance Optimization: QKV Fuse + Dynamic FP8 optimization
- Memory Optimization: Pin Manager to resolve OOM issues
- Smart Caching: MagCache, TeaCache, EasyCache strategies
- Image Editing: Qwen Image Edit model support
- VAE Parallelism: Multi-GPU VAE decoding
- Monitoring: Inference metrics reporting
In Progress ๐ง
- Support for more generation models (Z-Image, Hunyuan, etc.)
- Memory optimization for longer video generation
- Cache strategy performance tuning
- Model quantization toolchain
- XPU full feature support optimization
If this project helps you, please give us a โญ๏ธ Star!
Made with โค๏ธ by the KsanaDiT Team