Posts

All the articles I've posted.

How a Side Project Became a Crime (in France)
Posted on:August 27, 2024
How open-data work became criminally risky in France under Article 33; context, what is banned, and lessons learned.
Upstreamed: Kernl’s Triton “debugger” lands in OpenAI Triton
Posted on:October 12, 2023
In May 2023 we upstreamed our Python-level interpreter/debugger for Triton kernels to the OpenAI Triton project; here’s what it is, how to use it, and where it helps.
Deep Dive into Kernel Fusion: Accelerating Inference in Llama V2
Posted on:July 20, 2023
The code is available at . Llama, the most widely discussed machine learning model in 2023, has recently received an upgrade with the release of Llama V2. Its new licensing terms have sparked significant excitement…
Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl
Posted on:February 9, 2023
We are happy to announce the support of OpenAI Whisper model (ASR task) on Kernl. We focused on high quality transcription in a latency sensitive scenario, meaning: whisper-large-v2 weights _beam search 5 (as recomm…
OpenAI cites Kernl in Triton slides
Posted on:December 12, 2022
A short note to say how pleasant it is to see our work on Kernl cited in an OpenAI Research Acceleration Team slide deck about Triton. Thank you to the team for the nod and for building such an empowering tool.
Meeting Michael Lightstone, VP of AI Computing at NVIDIA
Posted on:November 16, 2022
A short, amused note on Kernl’s unexpected visibility and a chat with NVIDIA’s Michael Lightstone-yes, from a legal publisher.
Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels
Posted on:October 26, 2022
We are releasing **** under Apache 2 license, a library to make PyTorch models inference significantly faster. With 1 line of code we applied the optimizations and made Bert up to 12X faster than Hugging Face baseline…
FlashAttention: paper vs. Triton
Posted on:September 6, 2022
A quick note on the loop-order mismatch between the FlashAttention paper and common Triton-style kernels, and why making ownership explicit avoids races on O.
What we learned by benchmarking TorchDynamo (PyTorch team), ONNX Runtime and TensorRT on transformers model (inference)
Posted on:August 3, 2022
TL;DR: (prototype from PyTorch team) plus (from Nvidia) backend makes Bert (the tool is model agnostic) inference on PyTorch > 3X faster most of the time (it depends on input shape) by just adding a single lin…
What we learned by making T5-large 2X faster than Pytorch (and any autoregressive transformer)
Posted on:May 24, 2022
We made autoregressive based models like 2X faster than 🤗 Hugging Face Pytorch with 3 simple tricks:

Posts

How a Side Project Became a Crime (in France)

Upstreamed: Kernl’s Triton “debugger” lands in OpenAI Triton

Deep Dive into Kernel Fusion: Accelerating Inference in Llama V2

Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl

OpenAI cites Kernl in Triton slides

Meeting Michael Lightstone, VP of AI Computing at NVIDIA

Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels

FlashAttention: paper vs. Triton

What we learned by benchmarking TorchDynamo (PyTorch team), ONNX Runtime and TensorRT on transformers model (inference)

What we learned by making T5-large 2X faster than Pytorch (and any autoregressive transformer)