diff --git a/docs/advanced_features/attention_backend.md b/docs/advanced_features/attention_backend.md index b0db60674..0024aece6 100644 --- a/docs/advanced_features/attention_backend.md +++ b/docs/advanced_features/attention_backend.md @@ -17,7 +17,7 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu |---------------------------------|-----------------------------|------------------|-----------------|-----------------|--------------------|----------------| | **FlashInfer** | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | | **FA3 (FlashAttention 3)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | -| **FA4 (FlashAttention 4)** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | +| **FA4 (FlashAttention 4)** | 128 | ❌ | ❌ | ❌ | ❌ | ❌ | | **Triton** | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | | **Torch Native (SDPA)** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | **FlexAttention (PyTorch)** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | @@ -37,7 +37,7 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu | **TRTLLM MLA (Blackwell)** | 32 or 64 | ✅ | ✅ | ✅ | ❌ | | **FA3 (FlashAttention 3)** | n/a | ❌ | ✅ | ✅ | ⚠️ (page_size=1 only) | | **Triton** | n/a | ❌ | ❌ | ✅ | ⚠️ (page_size=1 only) | -| **FA4** | n/a | ❌ | ❌ | ❌ | ❌ | +| **FA4** | 128 | ❌ | ❌ | ❌ | ❌ | | **Ascend MLA (NPU)** | 128 | ❌ | ❌ | ❌ | ❌ | ```{warning} @@ -53,13 +53,14 @@ FlashMLA FP8 KV cache is currently not working. See upstream issue [#8856](https Speculative decoding topk: `topk` is the number of draft tokens sampled per step from the draft model. `topk = 1` follows classic EAGLE; `topk > 1` explores multiple branches and requires backend support in both draft and verification paths. ``` -Note: Many backends that do not natively operate on pages can emulate `page_size > 1` at the wrapper layer by expanding page tables to per-token indices. The "Page Size > 1 (native)" column indicates true in-kernel paging. Some backends require fixed native page sizes and cannot be reduced/emulated differently: TRTLLM MHA (16/32/64), TRTLLM MLA (32/64), FlashMLA (64), Cutlass MLA (128), Ascend (128). +Note: Many backends that do not natively operate on pages can emulate `page_size > 1` at the wrapper layer by expanding page tables to per-token indices. The "Page Size > 1 (native)" column indicates true in-kernel paging. Some backends require fixed native page sizes and cannot be reduced/emulated differently: TRTLLM MHA (16/32/64), TRTLLM MLA (32/64), FlashMLA (64), Cutlass MLA (128), FA4 (128), Ascend (128). MLA page-size constraints: - FlashInfer MLA: page_size = 1. - FlashMLA: page_size = 64. - Cutlass MLA: page_size = 128. - TRTLLM MLA: page_size ∈ {32, 64}. +- FA4: page_size = 128. ### Hybrid attention (different backends for prefill vs decode) (Experimental) @@ -224,7 +225,7 @@ python3 -m sglang.launch_server \ python3 -m sglang.launch_server \ --tp 8 \ --model deepseek-ai/DeepSeek-R1 \ - --attention-backend fa4 \ + --prefill-attention-backend fa4 \ --trust-remote-code ```