My name is Maksim Levental and I’m a PhD student in CS at UChicago. This Spring I worked as a compiler intern at Nod Labs. The project was in collaboration with Google’s IREE and torch-mlir projects and involved implementing a way to use PyTorch as an eager frontend to the […]
Read MoreArticles by: Nod Team
PyTorch on Apple M1 MAX GPUs with SHARK – 2X faster than TensorFlow-Metal
SHARK is a portable High Performance Machine Learning Runtime for PyTorch. In this blog we demonstrate PyTorch Training and Inference on the Apple M1Max GPU with SHARK with only a few lines of additional code and outperforming Apple’s Tensorflow-metal plugin. Though Apple has released GPU support for Tensorflow via the […]
Read MoreSHARK: The fastest PyTorch runtime – 3x over Torchscript, 1.6x over TF/XLA, 43% faster than ONNXRuntime
SHARK Introducing SHARK – A high performance PyTorch Runtime that is 3X faster than the PyTorch/Torchscript , 1.6X faster than Tensorflow+XLA and 43% faster than ONNXRuntime on the Nvidia A100. All of this is available to deploy seamlessly in minutes. Whether you are using Docker, Kubernetes or plain old `pip […]
Read MoreOutperforming Intel’s MKL and OctoML/Apache TVM with MLIR and Nod.ai’s Codegen Search
In this blog post we will evaluate the codegen performance of OctoML/Apache TVM and MLIR with Nod’s learning based codegen search. We will also compare the performance with Intel MKL since it is supplied by the vendor and is usually the fastest though it is handwritten and optimized. As discussed […]
Read MoreApple M1Max Matmul Performance ~1.4TFLOPS FP32 (vs 0.7TFLOPS on Apple M1)
This is a follow up to our earlier Matmul performance benchmarks here and here from last year . We put the newly release Apple M1Max 16″ Macbook Pro 32GB 10 core laptop through the mmperf benchmarks. mmperf is an open source benchmark (https://mmperf.org) that allows you to benchmark various hardware […]
Read Moretorch-mlir: Bridging PyTorch and LLVM/MLIR ecosystems
We presented the torch-mlir project today at the LLVM/MLIR Open Design Meeting with more than 125 attendees from the industry. This is an important piece of the next generation A.I Software stack to bridge the ubiquity of the PyTorch eco-system to the LLVM/MLIR ecosystem and unlock building performant, reusable and […]
Read MoreAnalysis of the Huggingface Infinity Inference Engine
We love Huggingface and use it a lot. It really has made NLP models so much easier to use. They recently released an enterprise product that is an inference solution with all the magic software for a hardware deployment in a docker container. https://huggingface.co/infinity Performance of ML Systems is close […]
Read MoreGenerating code to outperform native MatMul libraries (Accelerate, BLIS, MKL) and measuring it with MMperf
GEMM operations dominate the computation in modern Machine Learning Models. Silicon vendors typically provide hand optimized GEMM libraries such as Apple’s Accelerate Framework [1], AMD’s BLIS[2] and Intel’s MKL[3]. There are also open source implementations like OpenBLAS[4], BLIS[5], RUY[6]. We will demonstrate the performance of Nod.ai’s compiler generated code outperforming […]
Read MoreSurvey of Bilinear Algorithms for Fast Matrix Multiplication – Part 1
Matrix Multiplication forms the foundation of Machine Learning. In this write up we take a survey of Bilinear Matrix Multiplication Algorithms that are the most common set of algorithms that perform better than the naive O(n^3) implementations. Short Form / Easy Read: Long Form: Though most of the time the […]
Read MoreComparing Apple’s M1 matmul performance – AMX2 vs NEON
Matrix Multiply forms the foundation of Machine Learning computations. We show Apple’s M1 custom AMX2 Matrix Multiply unit can outperform ARMv8.6’s standard NEON instructions by about 2X. Nod’s AI Compiler team focusses on the state of art code generation, async partitioning, optimizations and scheduling to overlap communication and compute on […]
Read More