Tag: LLMs
Posts tagged with LLMs.
-
6 façons d'utiliser les algorithmes prédictifs pour améliorer vos recherches de jurisprudence
Posted on: May 4, 20166 façons d'utiliser les algorithmes prédictifs pour améliorer vos recherches de jurisprudence Michael BENESTY Head Of Research And Development at Lefebvre Sarrut (Dalloz, Francis L
-
L’impartialité de certains juges mise à mal par l’intelligence artificielle
Posted on: March 24, 2016L’impartialité de certains juges mise à mal par l’intelligence artificielle Michael BENESTY Head Of Research And Development at Lefebvre Sarrut (Dalloz, Francis Lefebvre, Éditions
-
NER algo benchmark: spaCy, Flair, m-BERT and camemBERT on anonymizing French commercial legal cases
Posted on: December 10, 2019Does (model) size matters?
-
Why we switched from Spacy to Flair to anonymize French case law
Posted on: September 26, 2019… and why you should always review your options
-
Divide Hugging Face Transformers training time by 2 or more with dynamic padding and uniform length batching
Posted on: May 20, 2020Reducing training time helps to iterate more in a fixed budget time and thus achieve better results.
-
1st ever method to perform *GPU* quantization on most 🤗 HF transformer models: > 2X faster inference!
Posted on: December 10, 2021Quantization is a technique to significantly accelerate inference by replacing high precision tensors by lower precision representation in a way where accuracy is kept intact (or close to). It’s quite common in CPU in…
-
4.5 times faster Hugging Face transformer inference by modifying some Python AST
Posted on: December 29, 2021Recently, 🤗 Hugging Face people have released a commercial product called Infinity to perform inference with very high performance (aka very fast compared to Pytorch + FastAPI deployment). Unfortunately it’s a paid p…
-
Hugging Face Transformer Inference Under 1 Millisecond Latency
Posted on: November 5, 2021Go to production with Microsoft and Nvidia open source tooling
-
Optimization of Hugging Face Transformer models to get Inference < 1 Millisecond Latency + deployment on production ready inference server
Posted on: November 5, 2021Hi, I just released a project showing how to optimize big NLP models and deploy them on Nvidia Triton inference server.
-
Python library to optimize Hugging Face transformer for inference: < 0.5 ms latency / 2850 infer/sec
Posted on: November 24, 2021We just launched a new open source Python library to help in optimizing Transformer model inference and prepare deployment in production. It’s a follow up of a proof of concept shared . Scripts have been conve…
-
FlashAttention: paper vs. Triton
Posted on: September 6, 2022A quick note on the loop-order mismatch between the FlashAttention paper and common Triton-style kernels, and why making ownership explicit avoids races on O.
-
Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels
Posted on: October 26, 2022We are releasing **** under Apache 2 license, a library to make PyTorch models inference significantly faster. With 1 line of code we applied the optimizations and made Bert up to 12X faster than Hugging Face baseline…
-
What we learned by accelerating by 5X Hugging Face generative language models
Posted on: February 9, 20222 trends ongoing in the NLP ecosystem: bigger language model and better text generation. Both are NLP game changers (zero shot, etc.) but they bring their own challenges: how to perform inference with them? At what co…
-
What we learned by benchmarking TorchDynamo (PyTorch team), ONNX Runtime and TensorRT on transformers model (inference)
Posted on: August 3, 2022TL;DR: (prototype from PyTorch team) plus (from Nvidia) backend makes Bert (the tool is model agnostic) inference on PyTorch > 3X faster most of the time (it depends on input shape) by just adding a single lin…
-
What we learned by making T5-large 2X faster than Pytorch (and any autoregressive transformer)
Posted on: May 24, 2022We made autoregressive based models like 2X faster than 🤗 Hugging Face Pytorch with 3 simple tricks:
-
Deep Dive into Kernel Fusion: Accelerating Inference in Llama V2
Posted on: July 20, 2023The code is available at . Llama, the most widely discussed machine learning model in 2023, has recently received an upgrade with the release of Llama V2. Its new licensing terms have sparked significant excitement…
-
Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl
Posted on: February 9, 2023We are happy to announce the support of OpenAI Whisper model (ASR task) on Kernl. We focused on high quality transcription in a latency sensitive scenario, meaning: whisper-large-v2 weights _beam search 5 (as recomm…