Skip to content

FlashAttention: paper vs. Triton

Posted on:September 6, 2022

While reimplementing FlashAttention in OpenAI Triton, a mismatch surfaced between the paper’s loop order and the Triton-style kernels many people used at the time. The short version:

This is the source of a “paper‑vs‑code” disagreement: the Triton pattern is not a line‑for‑line translation of Algorithm 1, but an equivalent schedule with a different concurrency story (GitHub)

Takeaways. Pick a schedule and make ownership explicit: either follow the paper’s loop order and don’t parallelize over K/V without care, or flip the loops and keep the Q tile’s state on‑chip with a single O store. Same math, fewer hazards. (arXiv, Triton Language)