298 results for “topic:cuda-kernels”
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Ecosystem of libraries and tools for writing and executing fast GPU code fully in Rust.
CUDA Core Compute Libraries
Deep learning in Rust, with shape checked tensors and neural networks
Safe rust wrapper around CUDA toolkit
CUDA Kernel Benchmarking Library
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
Kernel Tuner
Simple utilities to enable code reuse and portability between CUDA C/C++ and standard C/C++.
This is an archive of materials produced for an introductory class on CUDA programming at Stanford University in 2010
Comprehensive CUDA tutorials for Maths & ML with examples.
Amplifier allows .NET developers to easily run complex applications with intensive mathematical computation on Intel CPU/GPU, NVIDIA, AMD without writing any additional C kernel code. Write your function in .NET and Amplifier will take care of running it on your favorite hardware.
Triton implementation of FlashAttention2 that adds Custom Masks.
Some CUDA design patterns and a bit of template magic for CUDA
Spiking Neural Networks in C++ with strong GPU acceleration through CUDA
Attention Kernels for Symmetric Power Transformers
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
CUDA kernel author's tools
Open source cross-platform compiler for compute-intensive loops used in AI algorithms, from Microsoft Research
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
🎬 3.7× faster video generation E2E 🖼️ 1.6× faster image generation E2E ⚡ ColumnSparseAttn 9.3× vs FlashAttn‑3 💨 ColumnSparseGEMM 2.5× vs cuBLAS
A tool for examining GPU scheduling behavior.
Speed up image preprocess with cuda when handle image or tensorrt inference
CUDA Guide
Radio-Frequency Engineering Modeling Toolkit (RF-EMT)
Astrophysics program simulating the evolution of star systems based on the fast multipole method on adaptive Octrees
Implementation of ConjugateGradients method using C and Nvidia CUDA
(REOS) Radar and ElectroOptical Simulation Framework written in Fortran.