In this blog post we will evaluate the codegen performance of OctoML/Apache TVM and MLIR with Nod’s learning based codegen search. We will also compare the performance with Intel MKL since it is supplied by the vendor and is usually the fastest though it is handwritten and optimized. As discussed in an earlier post here Matrix Multiply (GEMM) operations dominate the computation in modern Machine Learning Models such as BERT, Vision Transformer (ViT) etc.
Hardware
For this test we will use the latest Intel Alderlake CPUs `12th Gen Intel(R) Core(TM) i9-12900K` with 32GB DDR5-6000 CL-40 memory. It is the fastest Intel workstation CPU available. It does however come with controversial Performance and Efficiency cores which require Intel’s Thread Director which is only available on Windows currently and the efforts to merge something into the Linux Kernel has been running into headwinds. So for this test we disable the E-Cores and run all the P-Cores with performance governor so it is as fast as it can go.
Credit to those quotes and more information about the P-Cores here. So let’s see if we can use those awesome P-Cores to the fullest extent without any custom libraries like MKL.
Software
The Matmul benchmark suite is available at MMPerf.org. MMPerf focuses on single threaded, single precision Matrix Multiply across various libraries and codgen compilers like Halide, TVM and MLIR variations.
We use the latest version of Intel OneAPI MKL (intel-oneapi-mkl-common-2022.0.2) and Apache TVM is at the latest version on top of main as of Jan 27th 2022 (SHA 1b9b05e61d274b583d0ec7fa17728d30a60050a6). TVM is tuned using their Ansor Autotuner for 1000 generations. The code is available here
The LLVM MLIR Sandbox is also up to date with top of main. LLVM Sandbox is a staging area for advanced / experimental MLIR patches that the Google IREE team hosts. All the code to recreate the results are available here. The results of nod.ai’s search is checked in to recreate the results.
Results
Using the LLVM Sandbox MLIR codegen driver + Nod.ai’s Codegen tuning MLIR is able to outperform Intel MKL, OctoML Apache TVM and “stock” MLIR and “stock” MLIR Sandbox.
Inception V3 (22500x32x27, 5329x192x720), Resnet50 (3136x64x64), BERT-L(512x1024x1024) and MobileBERT (384x384x384)
“BENCHMARK ALL SIZES”
To generate a larger number of MatMuls we run the “benchmark_all_sizes.txt” in mmperf with the results below clearly showing superior performance across the board.
There are still areas of improvement with sizes like 9x1001x2048 in inception v3 etc.
Future Work
In the next blog post we will roll up these codegen performance numbers into whole model performance and demonstrate end to end latency of running BERT inference with Nod.ai’s codegen search. Stay tuned.
And if you have read this far we are hiring
Acknowledgements
Nod.ai’s Codegen Search sits on the work for the LLVM/MLIR community and the Google IREE team. Especial thanks to Nicolas Vasilache, Tobias Gysi, Mahesh Ravishankar, Thomas Raoux, Hanhan Wang for helping with the MLIR core codegen questions and guidance.