34 results for “topic:blackwell”
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a high-performance serving framework for large language models and multimodal models.
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.
Parallax is a distributed model serving framework that lets you build your own AI cluster anywhere
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
One-command vLLM installation for NVIDIA DGX Spark with Blackwell GB10 GPUs (sm_121 architecture)
Prebuilt DeepSpeed wheels for Windows with NVIDIA GPU support. Supports GTX 10 - RTX 50 series. Compiled with pytorch 2.7, 2.8 and cuda 12.8
Pre-built wheels for llama-cpp-python across platforms and CUDA versions
Cross-platform FlashAttention-2 Triton implementation for Turing+ GPUs with custom configuration mode
Code for the paper "ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs"
LLM fine-tuning with LoRA + NVFP4/MXFP8 on NVIDIA DGX Spark (Blackwell GB10)
High-performance LLM inference engine in C++/CUDA for NVIDIA Blackwell GPUs (RTX 5090)
No description provided.
GPU-accelerated WhisperX on NVIDIA Blackwell (SM_121) - DGX Spark compatible
Multi-model LLM serving for NVIDIA DGX Spark with vLLM, web UI, and tool calling
Optimized vLLM deployment for NVIDIA Blackwell (RTX 5090) on Linux Kernel 6.14. Resolves SM_120 kernel incompatibilities, P2P deadlocks, and memory fragmentation for high-performance LLM inference.
An empirical study of benchmarking LLM inference with KV cache offloading using vLLM and LMCache on NVIDIA GB200 with high-bandwidth NVLink-C2C .
📦 A fully automated method for installing Nvidia drivers on Arch Linux
Minimal GPU runtime for Python - high-performance CUDA kernels, memory management, and LLM inference without heavy dependencies
Pytorch Operation for distributed gemm in nvidia blackwell gpus
Blackwell ready pure Zig (0.15.2) bindings to the NVIDIA CUDA Driver API – dynamic loading, clean wrappers, no toolkit required at runtime.
🔧 Fine-tune large language models efficiently on NVIDIA DGX Spark with LoRA adapters and optimized quantization for high performance.
Run Qwen3.5-35B-A3B on NVIDIA DGX Spark (GB10) with SGLang - Ready-to-use Docker image + complete guide
🧭 Enhance navigation with VLN-YuanNav, a visual-language model using advanced memory and decision-making for effective exploration.
🚀 Build and explore OpenAI's GPT-OSS model from scratch in Python, unlocking the mechanics of large language models.
A fast API booty-licious back-end for running GGUF models with Llama.cpp
Pre-built onnxruntime-gpu 1.24.1 with Blackwell sm_120 CUDA kernels (RTX 5090/5080/5070)
Enterprise-grade Sovereign AI Stack optimized for NVIDIA Blackwell (sm_120) & vLLM. Features 256K context window, 5.8k tok/s prefill, and integrated observability via Langfuse.
LLM inference setup for NVIDIA Blackwell GPUs with FP4 quantization
Production LLM deployment specs for NVIDIA Blackwell GPUs (RTX Pro 6000, DGX Spark). Includes vLLM configurations, benchmarks, load balancer, and throughput calculators for NVFP4/FP8/MoE models.