Posts

All the articles I've posted.

What we learned by accelerating by 5X Hugging Face generative language models
Posted on:February 9, 2022
2 trends ongoing in the NLP ecosystem: bigger language model and better text generation. Both are NLP game changers (zero shot, etc.) but they bring their own challenges: how to perform inference with them? At what co…
4.5 times faster Hugging Face transformer inference by modifying some Python AST
Posted on:December 29, 2021
Recently, 🤗 Hugging Face people have released a commercial product called Infinity to perform inference with very high performance (aka very fast compared to Pytorch + FastAPI deployment). Unfortunately it’s a paid p…
1st ever method to perform *GPU* quantization on most 🤗 HF transformer models: > 2X faster inference!
Posted on:December 10, 2021
Quantization is a technique to significantly accelerate inference by replacing high precision tensors by lower precision representation in a way where accuracy is kept intact (or close to). It’s quite common in CPU in…
Python library to optimize Hugging Face transformer for inference: < 0.5 ms latency / 2850 infer/sec
Posted on:November 24, 2021
We just launched a new open source Python library to help in optimizing Transformer model inference and prepare deployment in production. It’s a follow up of a proof of concept shared . Scripts have been conve…
Optimization of Hugging Face Transformer models to get Inference < 1 Millisecond Latency + deployment on production ready inference server
Posted on:November 5, 2021
Hi, I just released a project showing how to optimize big NLP models and deploy them on Nvidia Triton inference server.
Hugging Face Transformer Inference Under 1 Millisecond Latency
Posted on:November 5, 2021
Go to production with Microsoft and Nvidia open source tooling
Divide Hugging Face Transformers training time by 2 or more with dynamic padding and uniform length batching
Posted on:May 20, 2020
Reducing training time helps to iterate more in a fixed budget time and thus achieve better results.
fastrtext — fastText for R, without the papercuts
Posted on:February 15, 2020
An R wrapper around Facebook's fastText library for swift text classification and word vectors.
Pushing open data from inside a legal publisher (2019): two pro bono partnerships in France & Luxembourg
Posted on:January 15, 2020
In 2019 we ran two pro bono partnerships to open up court decisions - with Etalab (the French government’s open data unit, within DINUM) and the Cour de cassation (France’s supreme court), and with Luxembourg’s Prosecutor General - focusing on engineering speedups for anonymization and an end-to-end PoC.
NER algo benchmark: spaCy, Flair, m-BERT and camemBERT on anonymizing French commercial legal cases
Posted on:December 10, 2019
Does (model) size matters?

Posts

What we learned by accelerating by 5X Hugging Face generative language models

4.5 times faster Hugging Face transformer inference by modifying some Python AST

1st ever method to perform *GPU* quantization on most 🤗 HF transformer models: > 2X faster inference!

Python library to optimize Hugging Face transformer for inference: < 0.5 ms latency / 2850 infer/sec

Optimization of Hugging Face Transformer models to get Inference < 1 Millisecond Latency + deployment on production ready inference server

Hugging Face Transformer Inference Under 1 Millisecond Latency

Divide Hugging Face Transformers training time by 2 or more with dynamic padding and uniform length batching

fastrtext — fastText for R, without the papercuts

Pushing open data from inside a legal publisher (2019): two pro bono partnerships in France & Luxembourg

NER algo benchmark: spaCy, Flair, m-BERT and camemBERT on anonymizing French commercial legal cases

1st ever method to perform GPU quantization on most 🤗 HF transformer models: > 2X faster inference!