This is a follow up to our earlier Matmul performance benchmarks here and here from last year . We put the newly release Apple M1Max 16″ Macbook Pro 32GB 10 core laptop through the mmperf benchmarks. mmperf is an open source benchmark (https://mmperf.org) that allows you to benchmark various hardware and software backends in a controlled setting (single threaded, across matrix sizes commonly used for deep learning or other workloads).
tl;dr: 1.4 TFLOPs @~8.1W of FP32 performance with Apple M1Max
In the following graph we plotted Google’s RUY matrix multiplication library (https://github.com/google/ruy) which has highly optimized ARM Neon code vs Apple’s Accelerate which uses custom instructions not available to RUY. In summary the AMX pipeline outperforms the Neon instructions by about 10X-13X. Obviously ARMv8.6 has matmul instructions that should be getting into next gen devices that we can use to compare apples to apples against the AMX. But for now this is just to show the relative performance between then. As we move into more BERT / Transformers for NLP tasks to even VIT / Vision Transformers, Fully Connected / Matmuls will start dominating workloads (even more than Convs) and Apple with it’s Matmul unit is well suited for such workloads. In a follow on post we will talk about Nod.ai’s codegen and auto scheduling to unlock this performance. Stay tuned.
Compared to Apple M1
Here is a screenshot of the benchmark running and you can see the M1Max is mostly just on one core.
These benchmarks are all open source and can be recreated with https://mmperf.org. The command line power meter is thanks to https://github.com/tlkh/asitop The power numbers are hard to control for this experiment given it is a laptop – but it was plugged in and not throttled as you can see above.