Repos
10
Stars
28
Forks
5
Top Language
C
Loading contributions...
Top Repositories
SGEMM and DGEMM subroutines using AVX512F instructions.
Fast avx2/fma3 dgemm and sgemm subroutines for medium to large matrices(>2000*2000) on haswell/skylake/zen processors, with performances comparable to MKL.
Sgemm kernel function on Nvidia Pascal GPU, able to achieve 60% theoretical performance.
sgemm and dgemm subroutine for large matrices, slightly outperform Intel MKL
cgemm and zgemm subroutines for large matrices, using avx2 and fma3 instructions, with performance comparable to MKL2018
Topk with K = 16 or 32, based on bitonic sort algorithm, using Intel AVX instructions.
Repositories
10SGEMM and DGEMM subroutines using AVX512F instructions.
Fast avx2/fma3 dgemm and sgemm subroutines for medium to large matrices(>2000*2000) on haswell/skylake/zen processors, with performances comparable to MKL.
cgemm and zgemm subroutines for large matrices, using avx2 and fma3 instructions, with performance comparable to MKL2018
Sgemm kernel function on Nvidia Pascal GPU, able to achieve 60% theoretical performance.
sgemm and dgemm subroutine for large matrices, slightly outperform Intel MKL
Topk with K = 16 or 32, based on bitonic sort algorithm, using Intel AVX instructions.
OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
No description provided.
cgemm3m and zgemm3m subroutines for large matrices, using AVX2 and FMA3 instructions.
how to design cpu gemm on x86 with avx256, that can beat openblas.