"topic:matmul" — Search | GitHunt

Repositories Developers Collections

© 2026 GitHunt · tansuasici

30 results for “topic:matmul”

Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm

C++21333Updated 2 weeks ago

communication-optimalcudagpu-accelerationlinear-algebramatmulmatrix-multiplicationmpipdgemmrocmscalapack

eth-cscs/Tiled-MM

Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.

C++3211Updated 11 months ago

amdcublascublasxtcudagpumatmulmatrix-multiplicationnvidiarocblasrocblasxtrocm

sagi21805/matmul-npu

Matrix multiplication on the NPU inside RK3588

C++173Updated 1 year ago

matmulmatrix-multiplicationnpuopencvorange-pi-5rk3588rk3588s

paxbun/float-matmul

Floating-point matrix multiplication implementation (arbitrary precision)

Verilog174Updated 4 years ago

floating-pointfpgamatmulverilog

gha3mi/formatmul

ForMatmul - A Fortran library that overloads the matmul function to enable efficient matrix multiplication with/without coarray.

Fortran73Updated 8 months ago

coarrayfortranfortran-package-managermatmul

eduand-alvarez/CUDA_Custom_MatMul_Experiment

This project integrates a custom CUDA-based matrix multiplication kernel into a PyTorch deep learning model, leveraging GPU acceleration for matrix operations. The goal is to compare the performance of this custom kernel with PyTorch's built-in matrix multiplication and demonstrate how custom CUDA kernels can optimize compute-intensive operations.

Python61Updated 1 year ago

cuda-kernelsmatmul

Awrsha/Advanced-CUDA-Programming-GPU-Architecture

This repository provides a comprehensive guide to optimizing GPU kernels for performance, with a focus on NVIDIA GPUs. It covers key tools and techniques such as CUDA, PyTorch, and Triton, aimed at improving computational efficiency for deep learning and scientific computing tasks.

Cuda51Updated 1 year ago

cuda-programminggpu-programmingjitkernelsmatmulmojo-languagemultiprocessingmultithreadingtorchquantumtriton

digital-nomad-cheng/matmul_cuda_kernel_tvm

Generate optimized MatMul cuda kernel automatically using tvm auto schedule.

Jupyter Notebook50Updated 3 years ago

cudagemmgemm-optimizationgpuhpcmatmultvm

LaserBorg/circuitpython_benchmark

Raspberry Pi Pico (RP2040) and Adafruit Metro M7 (NXP IMXRT10XX) benchmark

Python20Updated 2 years ago

adafruitadafruit-metro-m7benchmarkcircuitpythonfloat32matmulmcupython3raspberry-pi-pico

alprn42/Instruction-Counter

In this project, ınstruction numbers from a c program are counted with pin and c++.

C++10Updated 6 years ago

branch-instructioncountercppinstructioninstruction-countermatmulpinregistersresgister-counter

OpenMP Matrix Multiplication Offloading Playground

C++11Updated 3 years ago

gemmgpumatmuloffloadingopenmp

markxio/tt_matmul

Optimised matrix multiplication kernels for acceleration on the Tenstorrent Tensix architecture

C++11Updated 3 months ago

matmultenstorrent

andresnowak/Fast_linear_algebra

Implementations of Linear algebra algorithms in CPU and GPU

Mojo11Updated 7 months ago

dot-prodgpulinear-algebramatmulmetalmojoreduce-sum

xone4/optimized-Mat-Mul-cuda-code

The provided code is a Python script that uses the CuPy library to perform optimized GPU operations, specifically matrix multiplication. The script includes a custom CUDA kernel that is optimized for performance and energy consumption. The kernel uses half-precision floating-point numbers (float16) for improved performance and warp utilization.

Python10Updated 1 year ago

cuda-kernelsmatmuloptimization

WilliamSpanfelner/day-76-computation_with_numpy

Check out the power of NumPy

Jupyter Notebook00Updated 3 years ago

arangearraysflipimshowlinspacematmulmatricesndarraynumpyrandomscalarstensorsvectors

vietfood/gemm_metal

No description provided.

C++00Updated 3 months ago

gemmgpumatmulmetal

akifejaz/HwVerification

This repo contains the python scripts for MatMul's all modules testing.

Python00Updated 2 years ago

hardwarematmultesting

stdlib-js/blas-base-ggemm

Perform the matrix-matrix operation `C = α*op(A)*op(B) + β*C` where `op(X)` is either `op(X) = X` or `op(X) = X^T`.

JavaScript00Updated 1 month ago

algebraarrayblasgemmggemmjavascriptlevel-3linearmathmathematicsmatmulmatrixmatrix-multiplicationmultiplicationndarraynodenode-jsnodejsstdlibsubroutines

CoffeeVampir3/Hyper-AMX

Repo for AMX + FAST

C++00Updated 4 months ago

amxavx512inferenceinference-enginematmulnuma-awarequantizationtensortensor-parallelism

martins0n/matmul

Matrix-matrix multiplication implementations benchmarking

Rust00Updated 4 years ago

blasgemmmatmulmatrix-multiplication

dellano54/extreme-matmul

fastest matmul operation than numpy for intel CPU (xeon)

C00Updated 9 months ago

matmulmatrixmatrix-librarymatrix-multiplicationnumpynumpy-arraysnumpy-librarynumpy-matrixpytorchpytorch-implementationpytorch-lightning

DelSquared/Rust-Basic-Matrix-Multiplication

Rust Basic Matrix Multiplication

Rust00Updated 6 years ago

algebralinalglinearmatmulmatrixmultiplicationrust

jhson989/SYCL-heterogeneous

CPU, GPU, and FPGA matrix multiplication examples via SYCL

C++00Updated 4 years ago

cpufpgagpumatmulsycl

akifejaz/matmul-testbench

This is the simple script that generate matrixes of size 4 by 4, for testing Matmul.

Python00Updated 3 years ago

matmulpythontestbench

labgua/riscv-matmul-vec

Exploring Matrix Multiplication techniques on RISC-V. Focusing on single-core opt via RVV vectorization.

C01Updated 5 days ago

banana-pibenchmarkmatmulrisc-vrvvvectorization

delveopers/Axon

Lightweight tensor library for deep learning model low-latency inference (cpu only)

C++01Updated 1 month ago

arrayarray-manipulationsmatmulnumpy

stdlib-js/blas-base-dgemm

Perform the matrix-matrix operation `C = α*op(A)*op(B) + β*C` where `op(X)` is either `op(X) = X` or `op(X) = X^T`.

JavaScript00Updated 1 week ago

algebraarrayblasdgemmgemmjavascriptlevel-3linearmathmathematicsmatmulmatrixmatrix-multiplicationmultiplicationndarraynodenode-jsnodejsstdlibsubroutines

Hopejrajoku/riscv-computer-prototype

High-performance RISC-V prototyping for Tenstorrent Blackhole p150a. MatMul optimized for Tensix cores.

C++00Updated 2 months ago

hard-hackkoyebmatmulrisc-vtenstorrenttt-metalium

jhson989/matmul_cublas

cuBLAS GEMM Example for FP32 MatMul

Cuda00Updated 3 years ago

cublascudamatmul

stdlib-js/blas-base-sgemm

Perform the matrix-matrix operation `C = α*op(A)*op(B) + β*C` where `op(X)` is either `op(X) = X` or `op(X) = X^T`.

JavaScript00Updated 1 week ago

algebraarrayblasgemmjavascriptlevel-3linearmathmathematicsmatmulmatrixmatrix-multiplicationmultiplicationndarraynodenode-jsnodejssgemmstdlibsubroutines