wjc404 | GitHunt

wjc404

Beijing, China

Languages

C56%C++22%Cuda11%Fortran11%

Repos

Stars

Forks

Top Language

Loading contributions...

Top Repositories

GEMM_AVX512F

SGEMM and DGEMM subroutines using AVX512F instructions.

15C

GEMM_AVX2

Fast avx2/fma3 dgemm and sgemm subroutines for medium to large matrices(>2000*2000) on haswell/skylake/zen processors, with performances comparable to MKL.

Simple_CUDA_GEMM

Sgemm kernel function on Nvidia Pascal GPU, able to achieve 60% theoretical performance.

5Cuda

GEMM_AVX2_FMA3

sgemm and dgemm subroutine for large matrices, slightly outperform Intel MKL

COMPLEX_GEMM_AVX2_FMA3

cgemm and zgemm subroutines for large matrices, using avx2 and fma3 instructions, with performance comparable to MKL2018

bitonic_fp32_avx_top16

Topk with K = 16 or 32, based on bitonic sort algorithm, using Intel AVX instructions.

0C++

Repositories

wjc404/GEMM_AVX512F

SGEMM and DGEMM subroutines using AVX512F instructions.

C151Updated 3 years ago

wjc404/GEMM_AVX2

Fast avx2/fma3 dgemm and sgemm subroutines for medium to large matrices(>2000*2000) on haswell/skylake/zen processors, with performances comparable to MKL.

C71Updated 5 years ago

avx2fmamatrixmultiply

wjc404/COMPLEX_GEMM_AVX2_FMA3

cgemm and zgemm subroutines for large matrices, using avx2 and fma3 instructions, with performance comparable to MKL2018

C00Updated 6 years ago

wjc404/Simple_CUDA_GEMM

Sgemm kernel function on Nvidia Pascal GPU, able to achieve 60% theoretical performance.

Cuda51Updated 5 years ago

wjc404/GEMM_AVX2_FMA3Archived

sgemm and dgemm subroutine for large matrices, slightly outperform Intel MKL

C11Updated 6 years ago

matrixmultiplicationsimd