Repos
7
Stars
0
Forks
0
Top Language
Python
Loading contributions...
Repositories
7FlashInfer: Kernel Library for LLM Serving
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
CUDA Core Compute Libraries
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
I want to use small repo to show the bug in vectorized loading of BlockLoad in CUB
RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications.