[Doc] Update documents for FA4 (#11778)
This commit is contained in:
@@ -17,7 +17,7 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu
|
||||
|---------------------------------|-----------------------------|------------------|-----------------|-----------------|--------------------|----------------|
|
||||
| **FlashInfer** | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||
| **FA3 (FlashAttention 3)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| **FA4 (FlashAttention 4)** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||
| **FA4 (FlashAttention 4)** | 128 | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||
| **Triton** | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ |
|
||||
| **Torch Native (SDPA)** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||
| **FlexAttention (PyTorch)** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||
@@ -37,7 +37,7 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu
|
||||
| **TRTLLM MLA (Blackwell)** | 32 or 64 | ✅ | ✅ | ✅ | ❌ |
|
||||
| **FA3 (FlashAttention 3)** | n/a | ❌ | ✅ | ✅ | ⚠️ (page_size=1 only) |
|
||||
| **Triton** | n/a | ❌ | ❌ | ✅ | ⚠️ (page_size=1 only) |
|
||||
| **FA4** | n/a | ❌ | ❌ | ❌ | ❌ |
|
||||
| **FA4** | 128 | ❌ | ❌ | ❌ | ❌ |
|
||||
| **Ascend MLA (NPU)** | 128 | ❌ | ❌ | ❌ | ❌ |
|
||||
|
||||
```{warning}
|
||||
@@ -53,13 +53,14 @@ FlashMLA FP8 KV cache is currently not working. See upstream issue [#8856](https
|
||||
Speculative decoding topk: `topk` is the number of draft tokens sampled per step from the draft model. `topk = 1` follows classic EAGLE; `topk > 1` explores multiple branches and requires backend support in both draft and verification paths.
|
||||
```
|
||||
|
||||
Note: Many backends that do not natively operate on pages can emulate `page_size > 1` at the wrapper layer by expanding page tables to per-token indices. The "Page Size > 1 (native)" column indicates true in-kernel paging. Some backends require fixed native page sizes and cannot be reduced/emulated differently: TRTLLM MHA (16/32/64), TRTLLM MLA (32/64), FlashMLA (64), Cutlass MLA (128), Ascend (128).
|
||||
Note: Many backends that do not natively operate on pages can emulate `page_size > 1` at the wrapper layer by expanding page tables to per-token indices. The "Page Size > 1 (native)" column indicates true in-kernel paging. Some backends require fixed native page sizes and cannot be reduced/emulated differently: TRTLLM MHA (16/32/64), TRTLLM MLA (32/64), FlashMLA (64), Cutlass MLA (128), FA4 (128), Ascend (128).
|
||||
|
||||
MLA page-size constraints:
|
||||
- FlashInfer MLA: page_size = 1.
|
||||
- FlashMLA: page_size = 64.
|
||||
- Cutlass MLA: page_size = 128.
|
||||
- TRTLLM MLA: page_size ∈ {32, 64}.
|
||||
- FA4: page_size = 128.
|
||||
|
||||
### Hybrid attention (different backends for prefill vs decode) (Experimental)
|
||||
|
||||
@@ -224,7 +225,7 @@ python3 -m sglang.launch_server \
|
||||
python3 -m sglang.launch_server \
|
||||
--tp 8 \
|
||||
--model deepseek-ai/DeepSeek-R1 \
|
||||
--attention-backend fa4 \
|
||||
--prefill-attention-backend fa4 \
|
||||
--trust-remote-code
|
||||
```
|
||||
|
||||
|
||||
Reference in New Issue
Block a user