We love Huggingface and use it a lot. It really has made NLP models so much easier to use. They recently released an enterprise product that is an inference solution with all the magic software for a hardware deployment in a docker container. https://huggingface.co/infinity Performance of ML Systems is close […]
Read MoreBlog
Generating code to outperform native MatMul libraries (Accelerate, BLIS, MKL) and measuring it with MMperf
GEMM operations dominate the computation in modern Machine Learning Models. Silicon vendors typically provide hand optimized GEMM libraries such as Apple’s Accelerate Framework [1], AMD’s BLIS[2] and Intel’s MKL[3]. There are also open source implementations like OpenBLAS[4], BLIS[5], RUY[6]. We will demonstrate the performance of Nod.ai’s compiler generated code outperforming […]
Read MoreSurvey of Bilinear Algorithms for Fast Matrix Multiplication – Part 1
Matrix Multiplication forms the foundation of Machine Learning. In this write up we take a survey of Bilinear Matrix Multiplication Algorithms that are the most common set of algorithms that perform better than the naive O(n^3) implementations. Short Form / Easy Read: Long Form: Though most of the time the […]
Read MoreComparing Apple’s M1 matmul performance – AMX2 vs NEON
Matrix Multiply forms the foundation of Machine Learning computations. We show Apple’s M1 custom AMX2 Matrix Multiply unit can outperform ARMv8.6’s standard NEON instructions by about 2X. Nod’s AI Compiler team focusses on the state of art code generation, async partitioning, optimizations and scheduling to overlap communication and compute on […]
Read MoreSave 35% to 3X on your ML model training costs with Nod Runtime
A.I Training doesn’t have to be expensive. Nod is opening up a limited early access program to select customers to deploy the industry’s most efficient and cost effective ML distributed training runtime. No cost to you if we don’t show a cost and / or efficiency improvement. Nod Runtime uses […]
Read MoreOptimized state of art monocular depth estimation on Xilinx DPUs
Nod’s Runtime optimizations extend all the way to devices that consume a milliwatt for their CNN Inferencing. Nod recently showed off of its state of art Monocular Depth Estimation on the Xilinx DPU using Vitis-AI. Nod was able to optimize the model before targeting the Xilinx DPU to be able […]
Read MoreNod.AI’s Neural Perception stack
Computer Vision and Neural Perception have been disrupted with Machine Learning. Nod has optimized State of Art Computer Vision models deployed on really low power devices. Check out our post on LinkedIn
Read More