We love Huggingface and use it a lot. It really has made NLP models so much easier to use. They recently released an enterprise product that is an inference solution with all the magic software for a hardware deployment in a docker container. https://huggingface.co/infinity
Performance of ML Systems is close to our heart so our performance team set about to recreate the numbers shown in their launch video https://www.youtube.com/watch?v=jiftCAhOYQA and see how far we can get with off the shelf tools without any magic and how much magic there really is at the time of writing.
First off the claim of 1ms is disengengous since the demo video shows it as 1.7ms on a T4 and the presenter says “see there you go 1ms” 😀 However, you can run the same model on an A100 and get even below 1ms but there is nothing special to do than just run your stock PyTorch or ONNX.
CPU Benchmark Results
|Seq.Len||Torchscript (FP32)||Torchscript (INT8)||Intel Torchscript (FP32)||Intel Torchscript (Int8)||ONNX (FP32)||ONNX (Int8)|
In summary we are able to recreate or get better performance than Infinity on the same hardware (2 vCPU – Intel Cascade Lake). So if you want save yourself $20k/cpu/yr for a docker packaging solution go here for a detailed deep dive and how to get that performance. But much like hf.co itself the value is in making it easy to use along with the support so for enterprise customers Infinity may make sense but for everyone else who can pip install a few packages you may not need to operate on Infinity.
If you like working on ML Systems problems like this come join us: https://nod.ai/careers/ and if you are a customer looking for efficient deployments of ML Models on any hardware from CPUs/GPUs to accelerators reach out to us at email@example.com