Pytorch 2.0 brings exciting new technologies such as Torch Dynamo focused on Machine Learning Model capture in Python. The nod.ai team along with other torch-mlir community members have been adding support for Torch Dynamo in Torch-MLIR over the past few months. We are now proud to have started shipping Torch-MLIR […]
Read MoreBlog
Experience Stable Diffusion on AMD RDNA™ 3 Architecture
At CES 2023, we are showing our Stable Diffusion Demonstration on Radeon™ RX 7900 XTX, in the AMD booth. Come check it out at The Venetian – Titian 2304 Generative AI has taken the world by storm but until now it took a while to generate an image from a […]
Read MoreHigh Performance Codegen for CPUs, GPUs and Accelerators with MLIR
Our CTO Harsh Menon presented an in-depth walk through of MLIR codegen at HOT Chips 2022
Read MoreUnleashing the power of 3rd Gen Intel Xeon Scalable Processors (8375C) with SHARK – the fastest ML runtime
25% faster than ONNXRuntime, 35% faster than PyTorch for BERT A vast majority of AI inference is done on CPUs and the Intel Xeon Scalable Processor family is by far the most widely deployed CPU in data centers today. In this post we will demonstrate nod.ai SHARK running on 3rd […]
Read MoreSHARK “Summer” Release
SHARK “Summer” Release
Read MoreMy spring internship – torch-mlir eager mode, OPT and blowing away the main git repo
My name is Maksim Levental and I’m a PhD student in CS at UChicago. This Spring I worked as a compiler intern at Nod Labs. The project was in collaboration with Google’s IREE and torch-mlir projects and involved implementing a way to use PyTorch as an eager frontend to the […]
Read MorePyTorch on Apple M1 MAX GPUs with SHARK – 2X faster than TensorFlow-Metal
SHARK is a portable High Performance Machine Learning Runtime for PyTorch. In this blog we demonstrate PyTorch Training and Inference on the Apple M1Max GPU with SHARK with only a few lines of additional code and outperforming Apple’s Tensorflow-metal plugin. Though Apple has released GPU support for Tensorflow via the […]
Read MoreSHARK: The fastest PyTorch runtime – 3x over Torchscript, 1.6x over TF/XLA, 76% faster than ONNXRuntime
SHARK Introducing SHARK – A high performance PyTorch Runtime that is 3X faster than the PyTorch/Torchscript , 1.6X faster than Tensorflow+XLA and 76% faster than ONNXRuntime on the Nvidia A100. All of this is available to deploy seamlessly in minutes. Whether you are using Docker, Kubernetes or plain old `pip […]
Read MoreOutperforming Intel’s MKL and OctoML/Apache TVM with MLIR and Nod.ai’s Codegen Search
In this blog post we will evaluate the codegen performance of OctoML/Apache TVM and MLIR with Nod’s learning based codegen search. We will also compare the performance with Intel MKL since it is supplied by the vendor and is usually the fastest though it is handwritten and optimized. As discussed […]
Read MoreApple M1Max Matmul Performance ~1.4TFLOPS FP32 (vs 0.7TFLOPS on Apple M1)
This is a follow up to our earlier Matmul performance benchmarks here and here from last year . We put the newly release Apple M1Max 16″ Macbook Pro 32GB 10 core laptop through the mmperf benchmarks. mmperf is an open source benchmark (https://mmperf.org) that allows you to benchmark various hardware […]
Read More