Tag: Model Optimization

Posts tagged with Model Optimization.

NER algo benchmark: spaCy, Flair, m-BERT and camemBERT on anonymizing French commercial legal cases
Posted on:December 10, 2019
Does (model) size matters?
Why we switched from Spacy to Flair to anonymize French case law
Posted on:September 26, 2019
… and why you should always review your options
1st ever method to perform *GPU* quantization on most 🤗 HF transformer models: > 2X faster inference!
Posted on:December 10, 2021
Quantization is a technique to significantly accelerate inference by replacing high precision tensors by lower precision representation in a way where accuracy is kept intact (or close to). It’s quite common in CPU in…
4.5 times faster Hugging Face transformer inference by modifying some Python AST
Posted on:December 29, 2021
Recently, 🤗 Hugging Face people have released a commercial product called Infinity to perform inference with very high performance (aka very fast compared to Pytorch + FastAPI deployment). Unfortunately it’s a paid p…
Optimization of Hugging Face Transformer models to get Inference < 1 Millisecond Latency + deployment on production ready inference server
Posted on:November 5, 2021
Hi, I just released a project showing how to optimize big NLP models and deploy them on Nvidia Triton inference server.
Python library to optimize Hugging Face transformer for inference: < 0.5 ms latency / 2850 infer/sec
Posted on:November 24, 2021
We just launched a new open source Python library to help in optimizing Transformer model inference and prepare deployment in production. It’s a follow up of a proof of concept shared . Scripts have been conve…
FlashAttention: paper vs. Triton
Posted on:September 6, 2022
A quick note on the loop-order mismatch between the FlashAttention paper and common Triton-style kernels, and why making ownership explicit avoids races on O.
Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels
Posted on:October 26, 2022
We are releasing **** under Apache 2 license, a library to make PyTorch models inference significantly faster. With 1 line of code we applied the optimizations and made Bert up to 12X faster than Hugging Face baseline…
What we learned by accelerating by 5X Hugging Face generative language models
Posted on:February 9, 2022
2 trends ongoing in the NLP ecosystem: bigger language model and better text generation. Both are NLP game changers (zero shot, etc.) but they bring their own challenges: how to perform inference with them? At what co…
Hugging Face Transformer Inference Under 1 Millisecond Latency
Posted on:November 5, 2021
Go to production with Microsoft and Nvidia open source tooling
What we learned by benchmarking TorchDynamo (PyTorch team), ONNX Runtime and TensorRT on transformers model (inference)
Posted on:August 3, 2022
TL;DR: (prototype from PyTorch team) plus (from Nvidia) backend makes Bert (the tool is model agnostic) inference on PyTorch > 3X faster most of the time (it depends on input shape) by just adding a single lin…
What we learned by making T5-large 2X faster than Pytorch (and any autoregressive transformer)
Posted on:May 24, 2022
We made autoregressive based models like 2X faster than 🤗 Hugging Face Pytorch with 3 simple tricks:
Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl
Posted on:February 9, 2023
We are happy to announce the support of OpenAI Whisper model (ASR task) on Kernl. We focused on high quality transcription in a latency sensitive scenario, meaning: whisper-large-v2 weights _beam search 5 (as recomm…

Tag: Model Optimization

NER algo benchmark: spaCy, Flair, m-BERT and camemBERT on anonymizing French commercial legal cases

Why we switched from Spacy to Flair to anonymize French case law

1st ever method to perform *GPU* quantization on most 🤗 HF transformer models: > 2X faster inference!

4.5 times faster Hugging Face transformer inference by modifying some Python AST

Optimization of Hugging Face Transformer models to get Inference < 1 Millisecond Latency + deployment on production ready inference server

Python library to optimize Hugging Face transformer for inference: < 0.5 ms latency / 2850 infer/sec

FlashAttention: paper vs. Triton

Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels

What we learned by accelerating by 5X Hugging Face generative language models

Hugging Face Transformer Inference Under 1 Millisecond Latency

What we learned by benchmarking TorchDynamo (PyTorch team), ONNX Runtime and TensorRT on transformers model (inference)

What we learned by making T5-large 2X faster than Pytorch (and any autoregressive transformer)

Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl

1st ever method to perform GPU quantization on most 🤗 HF transformer models: > 2X faster inference!