While reimplementing FlashAttention in OpenAI Triton, a mismatch surfaced between the paper’s loop order and the Triton-style kernels many people used at the time. The short version:
- Paper (Algorithm 1): stream over K/V tiles; for each tile, load
O_i
, update it, write it back, then move on. If you parallelize that outer K/V loop across blocks, multiple blocks can touch the sameO_i
→ potential races unless you serialize or add synchronization. (arXiv) - Triton-style kernels: assign each program instance a
Q
tile, keepacc
,m_i
,l_i
on‑chip while streaming K/V, and writeO_i
once at the end. Different schedule; same streaming‑softmax math. No races onO
because ownership doesn’t split across K/V tiles. (Triton Language)
This is the source of a “paper‑vs‑code” disagreement: the Triton pattern is not a line‑for‑line translation of Algorithm 1, but an equivalent schedule with a different concurrency story (GitHub)
Takeaways.
Pick a schedule and make ownership explicit: either follow the paper’s loop order and don’t parallelize over K/V without care, or flip the loops and keep the Q
tile’s state on‑chip with a single O
store. Same math, fewer hazards. (arXiv, Triton Language)