30 results for “topic:matmul”
Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm
Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.
Matrix multiplication on the NPU inside RK3588
Floating-point matrix multiplication implementation (arbitrary precision)
ForMatmul - A Fortran library that overloads the matmul function to enable efficient matrix multiplication with/without coarray.
This project integrates a custom CUDA-based matrix multiplication kernel into a PyTorch deep learning model, leveraging GPU acceleration for matrix operations. The goal is to compare the performance of this custom kernel with PyTorch's built-in matrix multiplication and demonstrate how custom CUDA kernels can optimize compute-intensive operations.
This repository provides a comprehensive guide to optimizing GPU kernels for performance, with a focus on NVIDIA GPUs. It covers key tools and techniques such as CUDA, PyTorch, and Triton, aimed at improving computational efficiency for deep learning and scientific computing tasks.
Generate optimized MatMul cuda kernel automatically using tvm auto schedule.
Raspberry Pi Pico (RP2040) and Adafruit Metro M7 (NXP IMXRT10XX) benchmark
In this project, ınstruction numbers from a c program are counted with pin and c++.
OpenMP Matrix Multiplication Offloading Playground
Optimised matrix multiplication kernels for acceleration on the Tenstorrent Tensix architecture
Implementations of Linear algebra algorithms in CPU and GPU
The provided code is a Python script that uses the CuPy library to perform optimized GPU operations, specifically matrix multiplication. The script includes a custom CUDA kernel that is optimized for performance and energy consumption. The kernel uses half-precision floating-point numbers (float16) for improved performance and warp utilization.
Check out the power of NumPy
No description provided.
This repo contains the python scripts for MatMul's all modules testing.
Perform the matrix-matrix operation `C = α*op(A)*op(B) + β*C` where `op(X)` is either `op(X) = X` or `op(X) = X^T`.
Repo for AMX + FAST
Matrix-matrix multiplication implementations benchmarking
fastest matmul operation than numpy for intel CPU (xeon)
Rust Basic Matrix Multiplication
CPU, GPU, and FPGA matrix multiplication examples via SYCL
This is the simple script that generate matrixes of size 4 by 4, for testing Matmul.
Exploring Matrix Multiplication techniques on RISC-V. Focusing on single-core opt via RVV vectorization.
Lightweight tensor library for deep learning model low-latency inference (cpu only)
Perform the matrix-matrix operation `C = α*op(A)*op(B) + β*C` where `op(X)` is either `op(X) = X` or `op(X) = X^T`.
High-performance RISC-V prototyping for Tenstorrent Blackhole p150a. MatMul optimized for Tensix cores.
cuBLAS GEMM Example for FP32 MatMul
Perform the matrix-matrix operation `C = α*op(A)*op(B) + β*C` where `op(X)` is either `op(X) = X` or `op(X) = X^T`.