Matrix Multiply forms the foundation of Machine Learning computations. We show Apple’s M1 custom AMX2 Matrix Multiply unit can outperform ARMv8.6’s standard NEON instructions by about 2X.

Nod’s AI Compiler team focusses on the state of art code generation, async partitioning, optimizations and scheduling to overlap communication and compute on various A.I hardware from large datacenter clusters to edge A.I silicon. The basic computation building block for all of that is the venerable Matmul. In this post we focus on the the Matmul performance on the just released Apple M1 Chip since that translates directly to how much you can squeeze out of any A.I hardware.

Typically silicon teams work closely with the optimization teams to create highly optimized SGEMM ( Single precision GEneral Matrix Multiply) and DGEMM (Double precision GEneral Matrix Mulitply) kernels for each silicon platform. Intel’s MKL provides these kernels for the Intel chipsets and Apple’s Accelerate Framework provides highly optimized kernels for Apple machines (both Intel and Apple Silicon).

Eigen provides a reasonably easy to use high-level template library of these linear algebra algorithms while also exposing building blocks like GEBP (GeneralBlockPanelKernel) “traits” for each SoC.  These GEBP traits effectively allow you to use Compiler Intrinsics and custom instructions to target a particular SoC while still wrapping it up in higher level C++ for ease of use. BLIS (a BLAS like library) also follows a similar paradigm where the “inner most” microkernel is highly hand optimized assembly for a particular architecture and  forms the foundation of the higher level computations that could be written in more portable code. There is a good read on the concepts used by BLIS here. However BLIS’s microkernel on ARM/NEON is woefully inadequate (See this bug report when building with clang). There have been other attempts to write the GEBP kernel in portable code (see this), but I think Eigen is probably the most successful with the backing of the Tensorflow and Android efforts.

Apple’s M1 SoC  has received rave reviews on its performance. It includes extension to the ARM Architectural Specification to include a Matrix Co-Processor – commonly referred to as AMX (Apple Matrix Co-processor). The version in the M1 SoC is supposedly a “Version 2” so let’s refer to it as AMX2. Supposedly the AMX2 is tightly coupled with the ARM core (has custom instructions to access it) than the ANE (Apple Neural Engine) which is a separate Neural Processing Unit on the SoC – which would behave more like an integrated GPU with higher latencies and higher throughput when compared to the inline AMX2.

Apple has not released the instructions to access the AMX2.  This way there is no need to maintain backwards compatibility with compiled software. The only way you should (though not the only way you can) currently access AMX2 on  M1 SoC is via the Accelerate Framework.  ARM has just started adding support for Architecture ARMv8.7-a in LLVM and specifically the support for Accelerators such as AMX here . It includes the ability to add Accelerators such as the AMX but it is unclear if AMX will adhere to that specification. You can find out more about ARMv8.7-a here

In this post we will evaluate a simple SGEMM with a size of 1000 using AMX2 and Eigen’s Neon optimized version on the Apple M1.  We have done some tests with other ARMv8 SoCs but it is not apples to apples since they were older generation parts or had different toolchain options etc.   All tests were done with top of master Clang (12.0.0+) built with OpenMP support (for Eigen’s parallelism) and top of master Eigen. We run 10 iterations of the Matrix multiply as warmup (to initialize any lazy loading libraries or fill the instruction and data caches) and then run the test 20 times and average the run times.  We have to use Eigen noalias() to make sure there are no unnecessary copies.

We compile the code with -O3 and validate with “otool -L” that we link against Accelerate when using it and just standard libs otherwise. Also “otool -tv a.out | grep fmla” should show you the NEON FMLA instructions being used as shown below:

nodai@macbook-pro-2 pytorch % otool -tv a.out| grep fmla
00000001000025a8 fmla.2d v0, v2, v1              ; Latency: 10
0000000100002700 fmla.2d v2, v4, v0              ; Latency: 10
0000000100002704 fmla.2d v1, v5, v3              ; Latency: 10
0000000100002a78 fmla.2d v3, v5, v1              ; Latency: 10

 

Results

Apple M1 with Accelerate (AMX2) 

(nnc_venv) nodai@macbook-pro pytorch % LD_LIBRARY_PATH=/Users/nodai/lokal/lib/ ./a.out
Eigen is using 8 threads
Starting matrix multiplication test with 1000 matrices
Eigen avg execution time (ms) = 8

Apple M1 with NEON (AMX2) 

(nnc_venv) nodai@macbook-pro pytorch % LD_LIBRARY_PATH=/Users/nodai/lokal/lib/ ./a.out
Eigen is using 8 threads
Starting matrix multiplication test with 1000 matrices
Eigen avg execution time (ms) = 20

CPU utilization: 

For the test matrix size of 1000 there is negligible cpu utilization so we tried increasing the matrix size large enough to see some impact. With AMX2 the CPU utilization delta is negligible but there is likely memory pressure to DMA into the AMX2, however with NEON we can saturate the CPU cores.

 

Matrix Multiplication for Various Matrix Sizes

Configuration

  • Apple Silicon M1
  • compiler: clang version 12.0.0 (/Users/nodai/llvm-project/clang e6ae623314bab3ddd983ed941bf63a6d4c63a1f4)
  • eigen3: fdf2ee62c5174441076fb64c9737d89bbe102759

Single Threaded FP32 Matmul NEON

 

But is this the best NEON optimization possible ? Based on this work as part of GEMMLOWP project there should be more room. So we ran the “standalone NEON tests” on the M1. This should give us a good idea on the delta between Eigen’s GEBP and a fully hand optimized NEON kernel by ARM themselves (though for a Cortex-A57 class core).  The results are below:

                                         

kernel,Gop/s
NEON_64bit_GEMM_Int425Operands_intrinsics,145.04
NEON_64bit_GEMM_Int7Operands_AccumEightWithin16Bits,105.541
NEON_64bit_GEMM_Int7Operands_AccumEightWithin16Bits_intrinsics,39.3034
NEON_64bit_GEMM_Int8Operands_AccumTwoWithin16Bits,81.8715
NEON_64bit_GEMM_Int8Operands_AccumTwoWithin16Bits_intrinsics,16.8501
NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators,51.2957
NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_intrinsics,54.485
NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_noexpand_A57,58.1817
NEON_64bit_GEMM_Int32_WithScalar,69.2449
NEON_64bit_GEMM_Float32_WithVectorDuplicatingScalar,48.9629
NEON_64bit_GEMM_Float32_WithScalar,69.229
NEON_64bit_GEMM_Float32_WithScalar_intrinsics,32.3552
NEON_64bit_GEMM_Float32_WithScalar_A57,68.6708
NEON_64bit_GEMM_Float32_WithScalar_A55r1,53.2717

 

Based on the above we estimate there is another 20% or so we can squeeze out of the NEON implementation.  An important point to consider is the overhead of using intrinsics vs full inline assembly – you still have to make a choice between portability and performance.

Single Threaded FP32 Matmul AMX2

After the post was initially written we found there is an environment variable VECLIB_MAXIMUM_THREAD that can control single threaded operation of Accelerate. Stay tuned for more controlled studies.

Summary:

This is a first pass performance test of using AMX2 vs NEON on the Apple M1 that shows AMX2 roughly twice as fast as the NEON implementation. There are probably a lot of corner cases to consider and tweaks to be done to the the NEON code especially with the varying size of L2 caches between the different cores.  Based on the GEMMLowP work we estimate there is atleast another 20% or so left on the table with NEON. Let us know if we missed anything. Right now the system (Core clocks, core types, scheduler priorities etc) is not in a controlled environment to test it throughly but hopefully this gives us a first order approximation of what performance to expect at the fundamental building block – a Matmul, on the Apple M1.

Future Work:

In our next blog post we will build on a MatMul and share some numbers of the Nod Compiler’s codegen capabilities to automatically generate these GEMM kernels and other common kernels used in Machine Learning and compare the performance to native frameworks like MKL/MKL-DNN, Accelerate / MLCompute and Cudnn/cuBLAS on the GPU.

 

Update 1 (12/30): Add first pass ruy numbers and push source code

Thanks the Benoit Jacob from the Google  who has worked on Eigen, Gemmlowp, TFlite, IREE etc we have first pass ruy numbers below on the M1. Stay tuned for more detailed comparisons and numbers.

macbook-pro-2 iree-build % THREADS=8 RUY_BENCHMARK_CUBIC=1 NOEXT=1 PATHS=f0 ./build_tools/third_party/ruy/benchmark_f32_f32_f32_f32
size,kNeon:Gop/s,kNeonDotprod:Gop/s
16,32.4,32
24,53.27,52.75
32,65.25,65.17
48,125.5,124
64,45.57,47.24
96,75.85,78.76
128,134.2,134.2
192,202.6,211.1
256,253.5,249.9
384,286.5,287.1
512,359.9,364.6
768,356,354
1024,430.4,424.6
1536,470.1,463.5
2048,470.7,469.2
3072,459,456.4
4096,454.1,454.3

Covered paths: kNeon, kNeonDotprod

 

Source code is now available at: here

11 Comments

  1. Benoit Jacob

    (I’m a contributor to all 3 matrix multiplication libraries mentioned here, and to TFLite). The Eigen and gemmlowp matrix multiplication libraries are not the fastest performing currently available for NN-centric matmul use cases. https://github.com/google/ruy is a newer matrix multiplication library that we have been using by default in TFLite for 1.5 years now, it’s (as far as NN-centric matmul use cases are concerned) a faster successor to Eigen (for float) and gemmlowp (for 8bit). I believe that rerunning these benchmarks with ruy instead of Eigen and gemmlowp would be very useful!

    ruy supports iOS (and is used by default by TFLite there too) and auto-detects ARMv8.2-dotprod instructions (available in iPhone 11 / SE – ruy’s 8bit ARMv8.2-dotprod matmul kernel achieves 250 Gop/s on a single core (single threaded) on iPhone 11.

    I’m available to help over email running ruy benchmarks (reproducing the above spreadsheet results), or using ruy in your own benchmarks.

  2. Pingback:Benchmark M1 (part 2) vs 20 cores Xeon vs AMD EPYC, 16 and 32 cores – Ramsey Elbasheer | History & ML

  3. Pingback:The Secret Apple M1 Coprocessor. Developer Dougall Johnson has through… | by Erik Engheim | The Startup | Jan, 2021 - Business 2020

  4. Pingback:The Secret Apple M1 Coprocessor – A2Z Facts

  5. Pingback:The Secret Apple M1 Coprocessor. Developer Dougall Johnson has through… | by Erik Engheim | The Startup | Jan, 2021 - TECHTELEGRAPH

  6. Pingback:The Secret Apple M1 Coprocessor. Developer Dougall Johnson has through… | by Erik Engheim | The Startup | Jan, 2021 - TechFans

  7. Pingback:Entérate sobre el secreto del coprocesador de Apple M1

  8. Pingback:Is Apple M1 good for machine learning? (Ep.136) - Datascience at home podcast

  9. Pingback:Generating code to outperform native MatMul libraries (Accelerate, BLIS, MKL) and measuring it with MMperf – nod.ai

  10. I appreciate your work , appreciate it for all the informative content .

Leave a Reply to Benoit Jacob Cancel

Your email address will not be published. Required fields are marked *