Tag: Model Optimization
Posts tagged with Model Optimization.
NER algo benchmark: spaCy, Flair, m-BERT and camemBERT on anonymizing French commercial legal cases
Posted on:December 10, 2019Does (model) size matters?
Why we switched from Spacy to Flair to anonymize French case law
Posted on:September 26, 2019… and why you should always review your options
4.5 times faster Hugging Face transformer inference by modifying some Python AST
Posted on:December 29, 2021Recently, 🤗 Hugging Face people have released a commercial product called Infinity to perform inference with very high performance (aka very fast compared to Pytorch + FastAPI deployment). Unfortunately it’s a paid p…
Optimization of Hugging Face Transformer models to get Inference < 1 Millisecond Latency + deployment on production ready inference server
Posted on:November 5, 2021Hi, I just released a project showing how to optimize big NLP models and deploy them on Nvidia Triton inference server.
1st ever method to perform *GPU* quantization on most 🤗 HF transformer models: > 2X faster inference!
Posted on:December 10, 2021Quantization is a technique to significantly accelerate inference by replacing high precision tensors by lower precision representation in a way where accuracy is kept intact (or close to). It’s quite common in CPU in…
Python library to optimize Hugging Face transformer for inference: < 0.5 ms latency / 2850 infer/sec
Posted on:November 24, 2021We just launched a new open source Python library to help in optimizing Transformer model inference and prepare deployment in production. It’s a follow up of a proof of concept shared . Scripts have been conve…
FlashAttention: paper vs. Triton
Posted on:September 6, 2022A quick note on the loop-order mismatch between the FlashAttention paper and common Triton-style kernels, and why making ownership explicit avoids races on O.
Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels
Posted on:October 26, 2022We are releasing **** under Apache 2 license, a library to make PyTorch models inference significantly faster. With 1 line of code we applied the optimizations and made Bert up to 12X faster than Hugging Face baseline…
Hugging Face Transformer Inference Under 1 Millisecond Latency
Posted on:November 5, 2021Go to production with Microsoft and Nvidia open source tooling
What we learned by accelerating by 5X Hugging Face generative language models
Posted on:February 9, 20222 trends ongoing in the NLP ecosystem: bigger language model and better text generation. Both are NLP game changers (zero shot, etc.) but they bring their own challenges: how to perform inference with them? At what co…
What we learned by benchmarking TorchDynamo (PyTorch team), ONNX Runtime and TensorRT on transformers model (inference)
Posted on:August 3, 2022TL;DR: (prototype from PyTorch team) plus (from Nvidia) backend makes Bert (the tool is model agnostic) inference on PyTorch > 3X faster most of the time (it depends on input shape) by just adding a single lin…
What we learned by making T5-large 2X faster than Pytorch (and any autoregressive transformer)
Posted on:May 24, 2022We made autoregressive based models like 2X faster than 🤗 Hugging Face Pytorch with 3 simple tricks:
Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl
Posted on:February 9, 2023We are happy to announce the support of OpenAI Whisper model (ASR task) on Kernl. We focused on high quality transcription in a latency sensitive scenario, meaning: whisper-large-v2 weights _beam search 5 (as recomm…