FlashAttention: paper vs. Triton

While reimplementing FlashAttention in OpenAI Triton, a mismatch surfaced between the paper’s loop order and the Triton-style kernels many people used at the time. The short version:

Paper (Algorithm 1): stream over K/V tiles; for each tile, load O_i, update it, write it back, then move on. If you parallelize that outer K/V loop across blocks, multiple blocks can touch the same O_i → potential races unless you serialize or add synchronization. (arXiv)
Triton-style kernels: assign each program instance a Q tile, keep acc, m_i, l_i on‑chip while streaming K/V, and write O_i once at the end. Different schedule; same streaming‑softmax math. No races on O because ownership doesn’t split across K/V tiles. (Triton Language)

This is the source of a “paper‑vs‑code” disagreement: the Triton pattern is not a line‑for‑line translation of Algorithm 1, but an equivalent schedule with a different concurrency story (GitHub)

Takeaways. Pick a schedule and make ownership explicit: either follow the paper’s loop order and don’t parallelize over K/V without care, or flip the loops and keep the Q tile’s state on‑chip with a single O store. Same math, fewer hazards. (arXiv, Triton Language)