GEMM operations dominate the computation in modern Machine Learning Models. Silicon vendors typically provide hand optimized GEMM libraries such as Apple’s Accelerate Framework [1], AMD’s BLIS[2] and Intel’s MKL[3]. There are also open source implementations like OpenBLAS[4], BLIS[5], RUY[6]. We will demonstrate the performance of Nod.ai’s compiler generated code outperforming […]
Read MoreA.I Compiler Technologies
Interesting information about A.I Compilers
Survey of Bilinear Algorithms for Fast Matrix Multiplication – Part 1
Matrix Multiplication forms the foundation of Machine Learning. In this write up we take a survey of Bilinear Matrix Multiplication Algorithms that are the most common set of algorithms that perform better than the naive O(n^3) implementations. Short Form / Easy Read: Long Form: Though most of the time the […]
Read MoreComparing Apple’s M1 matmul performance – AMX2 vs NEON
Matrix Multiply forms the foundation of Machine Learning computations. We show Apple’s M1 custom AMX2 Matrix Multiply unit can outperform ARMv8.6’s standard NEON instructions by about 2X. Nod’s AI Compiler team focusses on the state of art code generation, async partitioning, optimizations and scheduling to overlap communication and compute on […]
Read More