Generating code to outperform native MatMul libraries (Accelerate, BLIS, MKL) and measuring it with MMperf

GEMM operations dominate the computation in modern Machine Learning Models. Silicon vendors typically provide hand optimized GEMM libraries such as Apple’s Accelerate Framework [1], AMD’s BLIS[2] and Intel’s MKL[3]. There are also open source implementations like OpenBLAS[4], BLIS[5], RUY[6]. We will demonstrate the performance of’s compiler generated code outperforming […]

Read More