GEMM operations dominate the computation in modern Machine Learning Models. Silicon vendors typically provide hand optimized GEMM libraries such as Apple’s Accelerate Framework [1], AMD’s BLIS[2] and Intel’s MKL[3]. There are also open source implementations like OpenBLAS[4], BLIS[5], RUY[6]. We will demonstrate the performance of’s compiler generated code outperforming these libraries on their respective hardware. 

Introducing MMPerf (

We introduce MMPerf, a MatMul benchmark that allows you to consistently measure the single threaded, single core performance of Matmul codegen and hand engineered BLAS libraries. We focus on single threaded and single core performance to remove variances introduced by multi-threading and inter-cpu/core communications.  Codegen also has the ability to fuse multiple of these MatMuls to form a highly optimized sequence of execution. Also technologies such as Nod’s Runtime provide for a holistic way to overlap compute and communications from within an SoC to a distributed cluster so individual Processing Element (PE) performance is key in benchmarking overall impact on the system. 

MMPerf has an extensible architecture and right now it supports Accelerate, MKL, BLIS, OpenBLAS, RUY libraries and for Codegen it supports MLIR, Halide and TVM. For this benchmark we will focus on a diverse set of MatMul sizes based on real Neural Network workloads taken from Resnet50, Mobilenet etc. Traditionally, BLAS libraries focussed on very large square matrices which you almost never encounter with Neural Networks, which tend to be medium sized but long and skinny.  We have open sourced the MMPerf Benchmark [7] so feel free to checkout and try it on your system and submit any feedback or pull requests. 

Let’s get to the results.

Apple Accelerate on iMac Pro (Xeon W-2191B) 


AMD BLIS – AMD Ryzen 5950x



Intel MKL – Intel Xeon Cascade Lake (GCP c2-instance)


As you can see’s codegen outperforms each of the native libraries on ther respective silicon. This is just a start of what codegen can achieve. Once you start stringing these kernels together you would be seeing much more dramatic performance improvements because Operator fusion in codegen kernels allows us to optimize deep into the mathematical core of the computation. These fused ops will provide substantial improvement of the utilization of each individual PEs and when combined with Nod’s Runtime will unlock huge system level efficiencies. 

Special note about Apple M1. Apple has introduced a special MatMul block in the M1 called AMX2 as discussed here [11]. The performance of this MatMul block is outstanding pushing a Teraflop of single precision Matmuls as shown below. Currently there is no way to target these AMX2 instructions with LLVM, which depends on. Though there have been some efforts to study these instructions it is not yet ready to integrate into a toolchain. RISC-V does have a similar Matmul block called Gemini [12] that we plan to integrate into our toolchain. 


Apple M1 (8GB Mini)

A huge thank you to the broader MLIR community for their support on Discord (LLVM’s #mlir, #mlir-npcomp, IREE’s #matmul). Special thanks to Nicolas Vasilache for Codegen discussions, Benoit Jacob and Marat Dukan for efficient Matmuls in general. 


In the future we will demonstrate’s codegen for GPUs (Nvidia and AMD) and how it compares to native GPU libraries like CUDA and ROCm, and then the fun part of running an entire end to end NN training flow with only codegen’d kernels. 


We are only getting started with this and we have interesting and very challenging work ahead. Come join us if you have read this far and you want to improve efficiency of A.I deployments at scale.  

AI Compiler Engineer:
ML Systems Engineer:


[1] Apple Accelerate,

[2] AMD BLAS Library,

[3] Intel OneAPI MKL,

[4] OpenBLAS,

[5] BLIS,

[6] RUY,

[7] MMPerf,

[8] MLIR.

[9] TVM,

[10] Halide,  

[11] Comparing AMX2 and Neon,

[12] Gemmini Matmul Accelerator,

Leave a Comment

Your email address will not be published. Required fields are marked *