Repositories
30streaming-vlm
PublicStreamingVLM: Real-Time Understanding for Infinite Video Streams
llm-awq
Public[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
mcunet
Public[NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices; [NeurIPS 2021] MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning
torchsparse
Public[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.
streaming-llm
Public[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
TinyChatEngine
PublicTinyChatEngine: On-Device LLM Inference Library
omniserve
Public[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
parallel-computing-tutorial
Publicsmoothquant
Public[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
bevfusion
PublicArchived[ICRA'23] BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation
temporal-shift-module
Public[ICCV 2019] TSM: Temporal Shift Module for Efficient Video Understanding
torchquantum
PublicA PyTorch-based framework for Quantum Classical Simulation, Quantum Machine Learning, Quantum Neural Networks, Parameterized Quantum Circuits with support for easy deployments on real quantum computers.
foreact
Public[CVPR 2026] ForeAct: Steering Your VLA with Efficient Visual Foresight Planning
flash-moba
Publicfastrl
Public[ASPLOS'26] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter
efficientvit
PublicEfficient vision foundation models for high-resolution generation and perception.
vlash
PublicReal-Time VLAs via Future-state-aware Asynchronous Inference.
duo-attention
Public[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
data-efficient-gans
Public[NeurIPS 2020] Differentiable Augmentation for Data-Efficient GAN Training
tinyengine
Public[NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices; [NeurIPS 2021] MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning; [NeurIPS 2022] MCUNetV3: On-Device Training Under 256KB Memory
tinyml
Publicproxylessnas
Public[ICLR 2019] ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware
once-for-all
Public[ICLR 2020] Once for All: Train One Network and Specialize it for Efficient Deployment
lpd
Public[ICLR 2026 Oral] Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation
vcpo
PublicCode for the paper “Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs”
fouroversix
PublicCode for the paper “Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling”
sparsevit
Public[CVPR'23] SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer
amc-models
Public[ECCV 2018] AMC: AutoML for Model Compression and Acceleration on Mobile Devices
Quest
Public[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
spvnas
PublicArchived[ECCV 2020] Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution