SHARK

Introducing SHARK – A high performance PyTorch Runtime that is 3X faster than the PyTorch/Torchscript , 1.6X faster than Tensorflow+XLA and 23% faster than ONNXRuntime on the Nvidia A100.  All of this is available to deploy seamlessly in minutes. Whether you are using Docker, Kubernetes or plain old `pip install` we have an easy to deploy solution of SHARK for you –  on-premise or in the cloud.

SHARK is seamlessly integrated with Pytorch via Torch-mlir, though SHARK can work across various Machine Learning Frameworks if required.  There is no need for you to modify your training or inference code. SHARK can be seamlessly integrated as part of your larger MLOps workflow.  SHARK extends its performance across CPUS/GPUs and accelerators. Sign up here today for early access – it is free to try and there is nothing to pay if you don’t see a performance improvement in your ML deployment.

RESULTS

Benchmarks

We selected the BERT microsoft/MiniLM-L12-H384-uncased  from Huggingface for our benchmarks.  This was primarily chosen so we can compare against Huggingface Infinity which used a similar model.  In this benchmark we lower down to SHARK via the mhlo exporter. We chose a Batch Size 1 and sequence length of 128 which is more representative of actual inference workloads. We had done a previous blog post on Infinity here. ONNXRuntime has a set of numerical approximations that bring its speed to 1.9ms for the same workload. We disable these approximations such as Fast_GELU approximation in this benchmark. In a follow on post we will compare with those fusions enabled, but we will need to measure accuracy too since the numerical approximations affect the quality of predictions. All the experiments were done on a A2-HIGHGPU-1G Google Cloud VM.

To back up the claims of being 3X faster than other runtimes we have open sourced all the benchmarks here. Running the ./run_benchmark.sh script will generate the baseline numbers with the latest nightly build of PyTorch/Torchscript, ONNXRuntime, Tensorflow/XLA and Google IREE.  To generate the numbers for SHARK – you can uninstall the IREE pip packages and build SHARK from here. (Updated March 21 2022 – SHARK should be at 2.09ms avg)

Acknowledgements

The Nod.ai team works very closely with the broader LLVM/MLIR community, the Google IREE community and the torch-mlir community without whom this won’t be possible. We would specifically like to  acknowledge the IREE team (Ben Vanik, Thomas Raoux, Mahesh Ravishankar, Hanhan Wang, Nicolas Vasilache, Tobias Gysi, Sean Silva, Stella Laurenzo, Yi Zhang ) for their continued support and for building a world class compiler and runtime. SHARK is built on pre-release IREE, downstream Nod enhancements and Nod.ai Codegen Search. The Google IREE team is focused on community-driven development of the core technology and is happy to enable industry partners like Nod.ai to pursue specific, high value integrations on behalf of their customers.

Comments are closed.