Matrix Multiply forms the foundation of Machine Learning computations. We show Apple’s M1 custom AMX2 Matrix Multiply unit can outperform ARMv8.6’s standard NEON instructions by about 2X.
Nod’s AI Compiler team focusses on the state of art code generation, async partitioning, optimizations and scheduling to overlap communication and compute on various A.I hardware from large datacenter clusters to edge A.I silicon. The basic computation building block for all of that is the venerable Matmul. In this post we focus on the the Matmul performance on the just released Apple M1 Chip since that translates directly to how much you can squeeze out of any A.I hardware.
Typically silicon teams work closely with the optimization teams to create highly optimized SGEMM ( Single precision GEneral Matrix Multiply) and DGEMM (Double precision GEneral Matrix Mulitply) kernels for each silicon platform. Intel’s MKL provides these kernels for the Intel chipsets and Apple’s Accelerate Framework provides highly optimized kernels for Apple machines (both Intel and Apple Silicon).
Eigen provides a reasonably easy to use high-level template library of these linear algebra algorithms while also exposing building blocks like GEBP (GeneralBlockPanelKernel) “traits” for each SoC. These GEBP traits effectively allow you to use Compiler Intrinsics and custom instructions to target a particular SoC while still wrapping it up in higher level C++ for ease of use. BLIS (a BLAS like library) also follows a similar paradigm where the “inner most” microkernel is highly hand optimized assembly for a particular architecture and forms the foundation of the higher level computations that could be written in more portable code. There is a good read on the concepts used by BLIS here. However BLIS’s microkernel on ARM/NEON is woefully inadequate (See this bug report when building with clang). There have been other attempts to write the GEBP kernel in portable code (see this), but I think Eigen is probably the most successful with the backing of the Tensorflow and Android efforts.
Apple’s M1 SoC has received rave reviews on its performance. It includes extension to the ARM Architectural Specification to include a Matrix Co-Processor – commonly referred to as AMX (Apple Matrix Co-processor). The version in the M1 SoC is supposedly a “Version 2” so let’s refer to it as AMX2. Supposedly the AMX2 is tightly coupled with the ARM core (has custom instructions to access it) than the ANE (Apple Neural Engine) which is a separate Neural Processing Unit on the SoC – which would behave more like an integrated GPU with higher latencies and higher throughput when compared to the inline AMX2.
Apple has not released the instructions to access the AMX2. This way there is no need to maintain backwards compatibility with compiled software. The only way you should (though not the only way you can) currently access AMX2 on M1 SoC is via the Accelerate Framework. ARM has just started adding support for Architecture ARMv8.7-a in LLVM and specifically the support for Accelerators such as AMX here . It includes the ability to add Accelerators such as the AMX but it is unclear if AMX will adhere to that specification. You can find out more about ARMv8.7-a here
In this post we will evaluate a simple SGEMM with a size of 1000 using AMX2 and Eigen’s Neon optimized version on the Apple M1. We have done some tests with other ARMv8 SoCs but it is not apples to apples since they were older generation parts or had different toolchain options etc. All tests were done with top of master Clang (12.0.0+) built with OpenMP support (for Eigen’s parallelism) and top of master Eigen. We run 10 iterations of the Matrix multiply as warmup (to initialize any lazy loading libraries or fill the instruction and data caches) and then run the test 20 times and average the run times. We have to use Eigen noalias() to make sure there are no unnecessary copies.
We compile the code with -O3 and validate with “otool -L” that we link against Accelerate when using it and just standard libs otherwise. Also “otool -tv a.out | grep fmla” should show you the NEON FMLA instructions being used as shown below:
nodai@macbook-pro-2 pytorch % otool -tv a.out| grep fmla 00000001000025a8 fmla.2d v0, v2, v1 ; Latency: 10 0000000100002700 fmla.2d v2, v4, v0 ; Latency: 10 0000000100002704 fmla.2d v1, v5, v3 ; Latency: 10 0000000100002a78 fmla.2d v3, v5, v1 ; Latency: 10
Apple M1 with Accelerate (AMX2)
(nnc_venv) nodai@macbook-pro pytorch % LD_LIBRARY_PATH=/Users/nodai/lokal/lib/ ./a.out Eigen is using 8 threads Starting matrix multiplication test with 1000 matrices Eigen avg execution time (ms) = 8
Apple M1 with NEON (AMX2)
(nnc_venv) nodai@macbook-pro pytorch % LD_LIBRARY_PATH=/Users/nodai/lokal/lib/ ./a.out Eigen is using 8 threads Starting matrix multiplication test with 1000 matrices Eigen avg execution time (ms) = 20
For the test matrix size of 1000 there is negligible cpu utilization so we tried increasing the matrix size large enough to see some impact. With AMX2 the CPU utilization delta is negligible but there is likely memory pressure to DMA into the AMX2, however with NEON we can saturate the CPU cores.
Matrix Multiplication for Various Matrix Sizes
- Apple Silicon M1
- compiler: clang version 12.0.0 (/Users/nodai/llvm-project/clang e6ae623314bab3ddd983ed941bf63a6d4c63a1f4)
- eigen3: fdf2ee62c5174441076fb64c9737d89bbe102759
Single Threaded FP32 Matmul NEON
But is this the best NEON optimization possible ? Based on this work as part of GEMMLOWP project there should be more room. So we ran the “standalone NEON tests” on the M1. This should give us a good idea on the delta between Eigen’s GEBP and a fully hand optimized NEON kernel by ARM themselves (though for a Cortex-A57 class core). The results are below:
kernel,Gop/s NEON_64bit_GEMM_Int425Operands_intrinsics,145.04 NEON_64bit_GEMM_Int7Operands_AccumEightWithin16Bits,105.541 NEON_64bit_GEMM_Int7Operands_AccumEightWithin16Bits_intrinsics,39.3034 NEON_64bit_GEMM_Int8Operands_AccumTwoWithin16Bits,81.8715 NEON_64bit_GEMM_Int8Operands_AccumTwoWithin16Bits_intrinsics,16.8501 NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators,51.2957 NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_intrinsics,54.485 NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_noexpand_A57,58.1817 NEON_64bit_GEMM_Int32_WithScalar,69.2449 NEON_64bit_GEMM_Float32_WithVectorDuplicatingScalar,48.9629 NEON_64bit_GEMM_Float32_WithScalar,69.229 NEON_64bit_GEMM_Float32_WithScalar_intrinsics,32.3552 NEON_64bit_GEMM_Float32_WithScalar_A57,68.6708 NEON_64bit_GEMM_Float32_WithScalar_A55r1,53.2717
Based on the above we estimate there is another 20% or so we can squeeze out of the NEON implementation. An important point to consider is the overhead of using intrinsics vs full inline assembly – you still have to make a choice between portability and performance.
Single Threaded FP32 Matmul AMX2
After the post was initially written we found there is an environment variable VECLIB_MAXIMUM_THREAD that can control single threaded operation of Accelerate. Stay tuned for more controlled studies.
This is a first pass performance test of using AMX2 vs NEON on the Apple M1 that shows AMX2 roughly twice as fast as the NEON implementation. There are probably a lot of corner cases to consider and tweaks to be done to the the NEON code especially with the varying size of L2 caches between the different cores. Based on the GEMMLowP work we estimate there is atleast another 20% or so left on the table with NEON. Let us know if we missed anything. Right now the system (Core clocks, core types, scheduler priorities etc) is not in a controlled environment to test it throughly but hopefully this gives us a first order approximation of what performance to expect at the fundamental building block – a Matmul, on the Apple M1.
In our next blog post we will build on a MatMul and share some numbers of the Nod Compiler’s codegen capabilities to automatically generate these GEMM kernels and other common kernels used in Machine Learning and compare the performance to native frameworks like MKL/MKL-DNN, Accelerate / MLCompute and Cudnn/cuBLAS on the GPU.
Update 1 (12/30): Add first pass ruy numbers and push source code
Thanks the Benoit Jacob from the Google who has worked on Eigen, Gemmlowp, TFlite, IREE etc we have first pass ruy numbers below on the M1. Stay tuned for more detailed comparisons and numbers.
macbook-pro-2 iree-build % THREADS=8 RUY_BENCHMARK_CUBIC=1 NOEXT=1 PATHS=f0 ./build_tools/third_party/ruy/benchmark_f32_f32_f32_f32 size,kNeon:Gop/s,kNeonDotprod:Gop/s 16,32.4,32 24,53.27,52.75 32,65.25,65.17 48,125.5,124 64,45.57,47.24 96,75.85,78.76 128,134.2,134.2 192,202.6,211.1 256,253.5,249.9 384,286.5,287.1 512,359.9,364.6 768,356,354 1024,430.4,424.6 1536,470.1,463.5 2048,470.7,469.2 3072,459,456.4 4096,454.1,454.3 Covered paths: kNeon, kNeonDotprod
Source code is now available at: here