[Doc] Update documents for FA4 (#11778)
This commit is contained in:
@@ -17,7 +17,7 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu
|
|||||||
|---------------------------------|-----------------------------|------------------|-----------------|-----------------|--------------------|----------------|
|
|---------------------------------|-----------------------------|------------------|-----------------|-----------------|--------------------|----------------|
|
||||||
| **FlashInfer** | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
|
| **FlashInfer** | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||||
| **FA3 (FlashAttention 3)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
| **FA3 (FlashAttention 3)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||||
| **FA4 (FlashAttention 4)** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
| **FA4 (FlashAttention 4)** | 128 | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| **Triton** | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ |
|
| **Triton** | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ |
|
||||||
| **Torch Native (SDPA)** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
| **Torch Native (SDPA)** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| **FlexAttention (PyTorch)** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
| **FlexAttention (PyTorch)** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||||
@@ -37,7 +37,7 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu
|
|||||||
| **TRTLLM MLA (Blackwell)** | 32 or 64 | ✅ | ✅ | ✅ | ❌ |
|
| **TRTLLM MLA (Blackwell)** | 32 or 64 | ✅ | ✅ | ✅ | ❌ |
|
||||||
| **FA3 (FlashAttention 3)** | n/a | ❌ | ✅ | ✅ | ⚠️ (page_size=1 only) |
|
| **FA3 (FlashAttention 3)** | n/a | ❌ | ✅ | ✅ | ⚠️ (page_size=1 only) |
|
||||||
| **Triton** | n/a | ❌ | ❌ | ✅ | ⚠️ (page_size=1 only) |
|
| **Triton** | n/a | ❌ | ❌ | ✅ | ⚠️ (page_size=1 only) |
|
||||||
| **FA4** | n/a | ❌ | ❌ | ❌ | ❌ |
|
| **FA4** | 128 | ❌ | ❌ | ❌ | ❌ |
|
||||||
| **Ascend MLA (NPU)** | 128 | ❌ | ❌ | ❌ | ❌ |
|
| **Ascend MLA (NPU)** | 128 | ❌ | ❌ | ❌ | ❌ |
|
||||||
|
|
||||||
```{warning}
|
```{warning}
|
||||||
@@ -53,13 +53,14 @@ FlashMLA FP8 KV cache is currently not working. See upstream issue [#8856](https
|
|||||||
Speculative decoding topk: `topk` is the number of draft tokens sampled per step from the draft model. `topk = 1` follows classic EAGLE; `topk > 1` explores multiple branches and requires backend support in both draft and verification paths.
|
Speculative decoding topk: `topk` is the number of draft tokens sampled per step from the draft model. `topk = 1` follows classic EAGLE; `topk > 1` explores multiple branches and requires backend support in both draft and verification paths.
|
||||||
```
|
```
|
||||||
|
|
||||||
Note: Many backends that do not natively operate on pages can emulate `page_size > 1` at the wrapper layer by expanding page tables to per-token indices. The "Page Size > 1 (native)" column indicates true in-kernel paging. Some backends require fixed native page sizes and cannot be reduced/emulated differently: TRTLLM MHA (16/32/64), TRTLLM MLA (32/64), FlashMLA (64), Cutlass MLA (128), Ascend (128).
|
Note: Many backends that do not natively operate on pages can emulate `page_size > 1` at the wrapper layer by expanding page tables to per-token indices. The "Page Size > 1 (native)" column indicates true in-kernel paging. Some backends require fixed native page sizes and cannot be reduced/emulated differently: TRTLLM MHA (16/32/64), TRTLLM MLA (32/64), FlashMLA (64), Cutlass MLA (128), FA4 (128), Ascend (128).
|
||||||
|
|
||||||
MLA page-size constraints:
|
MLA page-size constraints:
|
||||||
- FlashInfer MLA: page_size = 1.
|
- FlashInfer MLA: page_size = 1.
|
||||||
- FlashMLA: page_size = 64.
|
- FlashMLA: page_size = 64.
|
||||||
- Cutlass MLA: page_size = 128.
|
- Cutlass MLA: page_size = 128.
|
||||||
- TRTLLM MLA: page_size ∈ {32, 64}.
|
- TRTLLM MLA: page_size ∈ {32, 64}.
|
||||||
|
- FA4: page_size = 128.
|
||||||
|
|
||||||
### Hybrid attention (different backends for prefill vs decode) (Experimental)
|
### Hybrid attention (different backends for prefill vs decode) (Experimental)
|
||||||
|
|
||||||
@@ -224,7 +225,7 @@ python3 -m sglang.launch_server \
|
|||||||
python3 -m sglang.launch_server \
|
python3 -m sglang.launch_server \
|
||||||
--tp 8 \
|
--tp 8 \
|
||||||
--model deepseek-ai/DeepSeek-R1 \
|
--model deepseek-ai/DeepSeek-R1 \
|
||||||
--attention-backend fa4 \
|
--prefill-attention-backend fa4 \
|
||||||
--trust-remote-code
|
--trust-remote-code
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user