sglang/docs/advanced_features/attention_backend.md

# Attention Backend

SGLang supports a large variety of attention backends. Each of them has different pros and cons.
You can test them according to your needs.

```{important}
Selecting an optimal attention backend is crucial for maximizing your performance. Different backends excel in various scenarios, so choose based on your model, hardware, and use case. Not all backends are supported on all platforms and model architectures.
```

## Support Matrix

The support matrix is split into two parts: MHA (standard attention) and MLA (multi-head latent attention). For an explanation of the key differences between MHA and MLA, please see the [SGLang documentation on DeepSeek MLA](https://github.com/sgl-project/sglang/blob/main/docs/basic_usage/deepseek.md#multi-head-latent-attention-mla) and the original [DeepSeek MLA paper](https://arxiv.org/pdf/2405.04434).

### MHA Backends

| **Backend**                     | **Page Size > 1 (native)** | **FP8 KV Cache** | **Spec topk=1** | **Spec topk>1** | **Sliding Window** | **MultiModal** |
|---------------------------------|-----------------------------|------------------|-----------------|-----------------|--------------------|----------------|
| **FlashInfer**                  | ✅                          | ✅               | ✅              | ✅              | ✅                 | ❌             |
| **FA3 (FlashAttention 3)**      | ✅                          | ✅               | ✅              | ✅              | ✅                 | ✅             |
| **FA4 (FlashAttention 4)**      | 128                         | ❌               | ❌              | ❌              | ❌                 | ❌             |
| **Triton**                      | ❌                          | ❌               | ✅              | ✅              | ✅                 | ✅             |
| **Torch Native (SDPA)**         | ❌                          | ❌               | ❌              | ❌              | ❌                 | ❌             |
| **FlexAttention (PyTorch)**     | ❌                          | ❌               | ❌              | ❌              | ❌                 | ❌             |
| **TRTLLM MHA**                  | 16, 32 or 64                | ❌               | ✅              | ❌              | ❌                 | ❌             |
| **Dual Chunk FlashAttention**   | ✅                          | ❌               | ❌              | ❌              | ❌                 | ❌             |
| **AITER (ROCm)**                | ✅                          | ❌               | ✅              | ✅              | ❌                 | ❌             |
| **Wave (ROCm)**                 | ✅                          | ❌               | ❌              | ❌              | ❌                 | ❌             |
| **Ascend (NPU)**                | ✅                          | ❌               | ❌              | ❌              | ❌                 | ❌             |
| **Intel XPU**                   | ✅                          | ❌               | ❌              | ❌              | ✅                 | ❌             |

### MLA Backends

| **Backend**                | **Native Page Sizes**     | **FP8 KV Cache** | **Chunked Prefix Cache** | **Spec topk=1** | **Spec topk>1** |
|----------------------------|---------------------------|------------------|--------------------------|-----------------|-----------------|
| **FlashInfer MLA**         | 1                         | ❌               | ✅                       | ✅              | ❌              |
| **FlashMLA**               | 64                        | ❌               | ✅                       | ✅              | ❌              |
| **Cutlass MLA**            | 128                       | ✅               | ✅                       | ✅              | ❌              |
| **TRTLLM MLA (Blackwell)** | 32 or 64                  | ✅               | ✅                       | ✅              | ❌              |
| **FA3 (FlashAttention 3)** | n/a                       | ❌               | ✅                       | ✅              | ⚠️ (page_size=1 only) |
| **Triton**                 | n/a                       | ❌               | ❌                       | ✅              | ⚠️ (page_size=1 only) |
| **FA4**                    | 128                       | ❌               | ❌                       | ❌              | ❌              |
| **Ascend MLA (NPU)**       | 128                       | ❌               | ❌                       | ❌              | ❌              |

```{warning}
FlashMLA FP8 KV cache is currently not working. See upstream issue [#8856](https://github.com/sgl-project/sglang/pull/8856). Use non-FP8 KV or another backend when FP8 KV cache is required.
```

```{note}
- FlashAttention 4 is prefill-only for now.
- NSA is specifically designed for [DeepSeek V3.2 DSA](https://lmsys.org/blog/2025-09-29-deepseek-V32/).
```

```{tip}
Speculative decoding topk: `topk` is the number of draft tokens sampled per step from the draft model. `topk = 1` follows classic EAGLE; `topk > 1` explores multiple branches and requires backend support in both draft and verification paths.
```

Note: Many backends that do not natively operate on pages can emulate `page_size > 1` at the wrapper layer by expanding page tables to per-token indices. The "Page Size > 1 (native)" column indicates true in-kernel paging. Some backends require fixed native page sizes and cannot be reduced/emulated differently: TRTLLM MHA (16/32/64), TRTLLM MLA (32/64), FlashMLA (64), Cutlass MLA (128), FA4 (128), Ascend (128).

MLA page-size constraints:
- FlashInfer MLA: page_size = 1.
- FlashMLA: page_size = 64.
- Cutlass MLA: page_size = 128.
- TRTLLM MLA: page_size ∈ {32, 64}.
- FA4: page_size = 128.

### Hybrid attention (different backends for prefill vs decode) (Experimental)

```{warning}
Hybrid attention is an experimental feature.
```

You can mix-and-match attention backends for prefill and decode. This is useful when one backend excels at prefill and another excels at decode. For the implementation details, please see `python/sglang/srt/layers/attention/hybrid_attn_backend.py`.

```bash
# Example: Prefill with FA4, Decode with TRTLLM MLA (Blackwell)
python3 -m sglang.launch_server \
  --model-path nvidia/DeepSeek-R1-FP4 \
  --tp 8 \
  --attention-backend trtllm_mla \
  --moe-runner-backend flashinfer_trtllm \
  --quantization modelopt_fp4 \
  --prefill-attention-backend fa4
```

#### Speculative decoding with hybrid attention

Hybrid attention also works with speculative decoding. The backend used for draft decoding and target verification depends on `--speculative-attention-mode`:

- `--speculative-attention-mode decode` (recommended): draft/verify use the decode backend.
- `--speculative-attention-mode prefill` (default): draft/verify use the prefill backend.

Constraints when combining hybrid attention with speculative decoding:

- If any attention backend is `trtllm_mha`, speculative decoding supports only `--speculative-eagle-topk 1`.
- For paged MHA backends with `--page-size > 1` and `--speculative-eagle-topk > 1`, only `flashinfer` is supported.
- `flex_attention` is not supported with speculative decoding.
- For MLA backends, `trtllm_mla` supports `topk > 1`; `flashmla` and `flashinfer_mla` support only `topk = 1`.
- CUDA Graph: the decode backend is always captured; the prefill backend is captured only when `--speculative-attention-mode prefill`.


```{tip}
If you set only one of `--prefill-attention-backend` or `--decode-attention-backend`, the unspecified phase inherits `--attention-backend`.
If both are specified and differ, SGLang automatically enables a hybrid wrapper to dispatch to the chosen backend per phase.
```

## User guide

### Launch command for different attention backends.

- FlashInfer (Default for Non-Hopper Machines, e.g., A100, A40)
```bash
python3 -m sglang.launch_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --attention-backend flashinfer
python3 -m sglang.launch_server \
  --tp 8 \
  --model deepseek-ai/DeepSeek-V3 \
  --attention-backend flashinfer \
  --trust-remote-code
```

- FlashAttention 3 (Default for Hopper Machines, e.g., H100, H200, H20)
```bash
python3 -m sglang.launch_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --attention-backend fa3
python3 -m sglang.launch_server \
  --tp 8 \
  --model deepseek-ai/DeepSeek-V3 \
  --trust-remote-code \
  --attention-backend fa3
```

- Triton
```bash
python3 -m sglang.launch_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --attention-backend triton
python3 -m sglang.launch_server \
  --tp 8 \
  --model deepseek-ai/DeepSeek-V3 \
  --attention-backend triton \
  --trust-remote-code
```

- Torch Native
```bash
python3 -m sglang.launch_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --attention-backend torch_native
```

- FlashMLA
```bash
python3 -m sglang.launch_server \
  --tp 8 \
  --model deepseek-ai/DeepSeek-R1 \
  --attention-backend flashmla \
  --trust-remote-code
python3 -m sglang.launch_server \
  --tp 8 \
  --model deepseek-ai/DeepSeek-R1 \
  --attention-backend flashmla \
  --kv-cache-dtype fp8_e4m3 \
  --trust-remote-code
```

- TRTLLM MLA (Optimized for Blackwell Architecture, e.g., B200)
```bash
python3 -m sglang.launch_server \
  --tp 8 \
  --model deepseek-ai/DeepSeek-R1 \
  --attention-backend trtllm_mla \
  --trust-remote-code
```

- TRTLLM MLA with FP8 KV Cache (Higher concurrency, lower memory footprint)
```bash
python3 -m sglang.launch_server \
  --tp 8 \
  --model deepseek-ai/DeepSeek-R1 \
  --attention-backend trtllm_mla \
  --kv-cache-dtype fp8_e4m3 \
  --trust-remote-code
```

- Ascend
```bash
python3 -m sglang.launch_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --attention-backend ascend
```

- Intel XPU
```bash
python3 -m sglang.launch_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --attention-backend intel_xpu
```

- Wave
```bash
python3 -m sglang.launch_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --attention-backend wave
```

- FlexAttention
```bash
python3 -m sglang.launch_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --attention-backend flex_attention
```

- Dual Chunk FlashAttention (MHA-only)
```bash
python3 -m sglang.launch_server \
  --model Qwen/Qwen2.5-14B-Instruct-1M \
  --attention-backend dual_chunk_flash_attn
```

- Cutlass MLA
```bash
python3 -m sglang.launch_server \
  --tp 8 \
  --model deepseek-ai/DeepSeek-R1 \
  --attention-backend cutlass_mla \
  --trust-remote-code
```

- FlashAttention 4 (MHA & MLA)
```bash
python3 -m sglang.launch_server \
  --tp 8 \
  --model deepseek-ai/DeepSeek-R1 \
  --prefill-attention-backend fa4 \
  --trust-remote-code
```

## Steps to add a new attention backend
To add a new attention backend, you can learn from the existing backends
(`python/sglang/srt/layers/attention/triton_backend.py`, `python/sglang/srt/layers/attention/flashattention_backend.py`)
and follow the steps below.

1. Run without cuda graph. Support the two forward functions
    - forward_extend
        - Will be used for prefill, prefill with KV cache, and target verification
        - It will be called once per layer
    - forward_decode
        - Will be used for normal decode, and draft decode
        - It will be called once per layer
    - init_forward_metadata
        - Initialize the class and common metadata shared by all layers
        - Call the plan function for optimizations like split_kv
        - It will be called once per forward
2. Run with cuda graph. It has two phases (capture and replay) and you need to implement three functions
    - init_cuda_graph_state
        - It will be called once during life time
        - Create all common shared buffers
    - init_forward_metadata_capture_cuda_graph
        - It will be called before capturing a cuda graph
        - It is similar to init_forward_metadata but write the medatada to some pre-defined buffers
    - init_forward_metadata_replay_cuda_graph
        - It will be called before replaying a cuda graph
        - This function is in the critical path and needs to be fast
add attention backend supporting matrix in the doc (#5211) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> 2025-04-15 17:16:34 -07:00			`# Attention Backend`

[Doc] Update support matrix for attn and hybrid attn (#11293) 2025-10-14 22:43:11 -07:00			`SGLang supports a large variety of attention backends. Each of them has different pros and cons.`
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`You can test them according to your needs.`

[Doc] Update support matrix for attn and hybrid attn (#11293) 2025-10-14 22:43:11 -07:00			```{important}
			`Selecting an optimal attention backend is crucial for maximizing your performance. Different backends excel in various scenarios, so choose based on your model, hardware, and use case. Not all backends are supported on all platforms and model architectures.`
			```

			`## Support Matrix`

			`The support matrix is split into two parts: MHA (standard attention) and MLA (multi-head latent attention). For an explanation of the key differences between MHA and MLA, please see the [SGLang documentation on DeepSeek MLA](https://github.com/sgl-project/sglang/blob/main/docs/basic_usage/deepseek.md#multi-head-latent-attention-mla) and the original [DeepSeek MLA paper](https://arxiv.org/pdf/2405.04434).`

			`### MHA Backends`

			`\| Backend \| Page Size > 1 (native) \| FP8 KV Cache \| Spec topk=1 \| Spec topk>1 \| Sliding Window \| MultiModal \|`
			`\|---------------------------------\|-----------------------------\|------------------\|-----------------\|-----------------\|--------------------\|----------------\|`
			`\| FlashInfer \| ✅ \| ✅ \| ✅ \| ✅ \| ✅ \| ❌ \|`
			`\| FA3 (FlashAttention 3) \| ✅ \| ✅ \| ✅ \| ✅ \| ✅ \| ✅ \|`
[Doc] Update documents for FA4 (#11778) 2025-10-19 19:40:38 -05:00			`\| FA4 (FlashAttention 4) \| 128 \| ❌ \| ❌ \| ❌ \| ❌ \| ❌ \|`
[Doc] Update support matrix for attn and hybrid attn (#11293) 2025-10-14 22:43:11 -07:00			`\| Triton \| ❌ \| ❌ \| ✅ \| ✅ \| ✅ \| ✅ \|`
			`\| Torch Native (SDPA) \| ❌ \| ❌ \| ❌ \| ❌ \| ❌ \| ❌ \|`
			`\| FlexAttention (PyTorch) \| ❌ \| ❌ \| ❌ \| ❌ \| ❌ \| ❌ \|`
			`\| TRTLLM MHA \| 16, 32 or 64 \| ❌ \| ✅ \| ❌ \| ❌ \| ❌ \|`
			`\| Dual Chunk FlashAttention \| ✅ \| ❌ \| ❌ \| ❌ \| ❌ \| ❌ \|`
			`\| AITER (ROCm) \| ✅ \| ❌ \| ✅ \| ✅ \| ❌ \| ❌ \|`
			`\| Wave (ROCm) \| ✅ \| ❌ \| ❌ \| ❌ \| ❌ \| ❌ \|`
			`\| Ascend (NPU) \| ✅ \| ❌ \| ❌ \| ❌ \| ❌ \| ❌ \|`
Init attention backend for Intel XPU (#10656) Co-authored-by: guangyey <guangye.yu@intel.com> Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com> 2025-10-21 11:41:28 +08:00			`\| Intel XPU \| ✅ \| ❌ \| ❌ \| ❌ \| ✅ \| ❌ \|`
[Doc] Update support matrix for attn and hybrid attn (#11293) 2025-10-14 22:43:11 -07:00
			`### MLA Backends`

			`\| Backend \| Native Page Sizes \| FP8 KV Cache \| Chunked Prefix Cache \| Spec topk=1 \| Spec topk>1 \|`
			`\|----------------------------\|---------------------------\|------------------\|--------------------------\|-----------------\|-----------------\|`
			`\| FlashInfer MLA \| 1 \| ❌ \| ✅ \| ✅ \| ❌ \|`
			`\| FlashMLA \| 64 \| ❌ \| ✅ \| ✅ \| ❌ \|`
			`\| Cutlass MLA \| 128 \| ✅ \| ✅ \| ✅ \| ❌ \|`
			`\| TRTLLM MLA (Blackwell) \| 32 or 64 \| ✅ \| ✅ \| ✅ \| ❌ \|`
			`\| FA3 (FlashAttention 3) \| n/a \| ❌ \| ✅ \| ✅ \| ⚠️ (page_size=1 only) \|`
			`\| Triton \| n/a \| ❌ \| ❌ \| ✅ \| ⚠️ (page_size=1 only) \|`
[Doc] Update documents for FA4 (#11778) 2025-10-19 19:40:38 -05:00			`\| FA4 \| 128 \| ❌ \| ❌ \| ❌ \| ❌ \|`
[Doc] Update support matrix for attn and hybrid attn (#11293) 2025-10-14 22:43:11 -07:00			`\| Ascend MLA (NPU) \| 128 \| ❌ \| ❌ \| ❌ \| ❌ \|`

			```{warning}
			`FlashMLA FP8 KV cache is currently not working. See upstream issue [#8856](https://github.com/sgl-project/sglang/pull/8856). Use non-FP8 KV or another backend when FP8 KV cache is required.`
			```

			```{note}
			`- FlashAttention 4 is prefill-only for now.`
			`- NSA is specifically designed for [DeepSeek V3.2 DSA](https://lmsys.org/blog/2025-09-29-deepseek-V32/).`
			```

			```{tip}
			Speculative decoding topk: `topk` is the number of draft tokens sampled per step from the draft model. `topk = 1` follows classic EAGLE; `topk > 1` explores multiple branches and requires backend support in both draft and verification paths.
			```

[Doc] Update documents for FA4 (#11778) 2025-10-19 19:40:38 -05:00			Note: Many backends that do not natively operate on pages can emulate `page_size > 1` at the wrapper layer by expanding page tables to per-token indices. The "Page Size > 1 (native)" column indicates true in-kernel paging. Some backends require fixed native page sizes and cannot be reduced/emulated differently: TRTLLM MHA (16/32/64), TRTLLM MLA (32/64), FlashMLA (64), Cutlass MLA (128), FA4 (128), Ascend (128).
[Doc] Update support matrix for attn and hybrid attn (#11293) 2025-10-14 22:43:11 -07:00
			`MLA page-size constraints:`
			`- FlashInfer MLA: page_size = 1.`
			`- FlashMLA: page_size = 64.`
			`- Cutlass MLA: page_size = 128.`
			`- TRTLLM MLA: page_size ∈ {32, 64}.`
[Doc] Update documents for FA4 (#11778) 2025-10-19 19:40:38 -05:00			`- FA4: page_size = 128.`
[Doc] Update support matrix for attn and hybrid attn (#11293) 2025-10-14 22:43:11 -07:00
			`### Hybrid attention (different backends for prefill vs decode) (Experimental)`

			```{warning}
			`Hybrid attention is an experimental feature.`
			```
add attention backend supporting matrix in the doc (#5211) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> 2025-04-15 17:16:34 -07:00
[Doc] Update support matrix for attn and hybrid attn (#11293) 2025-10-14 22:43:11 -07:00			You can mix-and-match attention backends for prefill and decode. This is useful when one backend excels at prefill and another excels at decode. For the implementation details, please see `python/sglang/srt/layers/attention/hybrid_attn_backend.py`.
add attention backend supporting matrix in the doc (#5211) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> 2025-04-15 17:16:34 -07:00
[Doc] Update support matrix for attn and hybrid attn (#11293) 2025-10-14 22:43:11 -07:00			```bash
			`# Example: Prefill with FA4, Decode with TRTLLM MLA (Blackwell)`
			`python3 -m sglang.launch_server \`
			`--model-path nvidia/DeepSeek-R1-FP4 \`
			`--tp 8 \`
			`--attention-backend trtllm_mla \`
			`--moe-runner-backend flashinfer_trtllm \`
			`--quantization modelopt_fp4 \`
			`--prefill-attention-backend fa4`
			```

			`#### Speculative decoding with hybrid attention`
TRTLLM Gen MLA Decode Kernel Integration (same as #7938) (#8632) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> 2025-07-31 19:03:40 -04:00
[Doc] Update support matrix for attn and hybrid attn (#11293) 2025-10-14 22:43:11 -07:00			Hybrid attention also works with speculative decoding. The backend used for draft decoding and target verification depends on `--speculative-attention-mode`:

			- `--speculative-attention-mode decode` (recommended): draft/verify use the decode backend.
			- `--speculative-attention-mode prefill` (default): draft/verify use the prefill backend.

			`Constraints when combining hybrid attention with speculative decoding:`

			- If any attention backend is `trtllm_mha`, speculative decoding supports only `--speculative-eagle-topk 1`.
			- For paged MHA backends with `--page-size > 1` and `--speculative-eagle-topk > 1`, only `flashinfer` is supported.
			- `flex_attention` is not supported with speculative decoding.
			- For MLA backends, `trtllm_mla` supports `topk > 1`; `flashmla` and `flashinfer_mla` support only `topk = 1`.
			- CUDA Graph: the decode backend is always captured; the prefill backend is captured only when `--speculative-attention-mode prefill`.


			```{tip}
			If you set only one of `--prefill-attention-backend` or `--decode-attention-backend`, the unspecified phase inherits `--attention-backend`.
			`If both are specified and differ, SGLang automatically enables a hybrid wrapper to dispatch to the chosen backend per phase.`
			```
add attention backend supporting matrix in the doc (#5211) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> 2025-04-15 17:16:34 -07:00
			`## User guide`

Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`### Launch command for different attention backends.`
add attention backend supporting matrix in the doc (#5211) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> 2025-04-15 17:16:34 -07:00
			`- FlashInfer (Default for Non-Hopper Machines, e.g., A100, A40)`
			```bash
Fix formatting in long code blocks (#10528) 2025-09-16 12:02:05 -07:00			`python3 -m sglang.launch_server \`
			`--model meta-llama/Meta-Llama-3.1-8B-Instruct \`
			`--attention-backend flashinfer`
			`python3 -m sglang.launch_server \`
			`--tp 8 \`
			`--model deepseek-ai/DeepSeek-V3 \`
			`--attention-backend flashinfer \`
			`--trust-remote-code`
add attention backend supporting matrix in the doc (#5211) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> 2025-04-15 17:16:34 -07:00			```

			`- FlashAttention 3 (Default for Hopper Machines, e.g., H100, H200, H20)`
			```bash
Fix formatting in long code blocks (#10528) 2025-09-16 12:02:05 -07:00			`python3 -m sglang.launch_server \`
			`--model meta-llama/Meta-Llama-3.1-8B-Instruct \`
			`--attention-backend fa3`
			`python3 -m sglang.launch_server \`
			`--tp 8 \`
			`--model deepseek-ai/DeepSeek-V3 \`
			`--trust-remote-code \`
			`--attention-backend fa3`
add attention backend supporting matrix in the doc (#5211) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> 2025-04-15 17:16:34 -07:00			```

			`- Triton`
			```bash
Fix formatting in long code blocks (#10528) 2025-09-16 12:02:05 -07:00			`python3 -m sglang.launch_server \`
			`--model meta-llama/Meta-Llama-3.1-8B-Instruct \`
			`--attention-backend triton`
			`python3 -m sglang.launch_server \`
			`--tp 8 \`
			`--model deepseek-ai/DeepSeek-V3 \`
			`--attention-backend triton \`
			`--trust-remote-code`
add attention backend supporting matrix in the doc (#5211) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> 2025-04-15 17:16:34 -07:00			```

			`- Torch Native`
			```bash
Fix formatting in long code blocks (#10528) 2025-09-16 12:02:05 -07:00			`python3 -m sglang.launch_server \`
			`--model meta-llama/Meta-Llama-3.1-8B-Instruct \`
			`--attention-backend torch_native`
add attention backend supporting matrix in the doc (#5211) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> 2025-04-15 17:16:34 -07:00			```
[Feat] Support FlashMLA backend with MTP and FP8 KV cache (#6109) Co-authored-by: Yingyi <yingyihuang2000@outlook.com> Co-authored-by: neiltian <neiltian@tencent.com> Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com> Co-authored-by: kexueyu <kexueyu@tencent.com> Co-authored-by: vincentmeng <vincentmeng@tencent.com> Co-authored-by: pengmeng <pengmeng@tencent.com> 2025-05-15 15:48:09 +08:00
			`- FlashMLA`
			```bash
Fix formatting in long code blocks (#10528) 2025-09-16 12:02:05 -07:00			`python3 -m sglang.launch_server \`
			`--tp 8 \`
			`--model deepseek-ai/DeepSeek-R1 \`
			`--attention-backend flashmla \`
			`--trust-remote-code`
			`python3 -m sglang.launch_server \`
			`--tp 8 \`
			`--model deepseek-ai/DeepSeek-R1 \`
			`--attention-backend flashmla \`
			`--kv-cache-dtype fp8_e4m3 \`
			`--trust-remote-code`
[Feat] Support FlashMLA backend with MTP and FP8 KV cache (#6109) Co-authored-by: Yingyi <yingyihuang2000@outlook.com> Co-authored-by: neiltian <neiltian@tencent.com> Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com> Co-authored-by: kexueyu <kexueyu@tencent.com> Co-authored-by: vincentmeng <vincentmeng@tencent.com> Co-authored-by: pengmeng <pengmeng@tencent.com> 2025-05-15 15:48:09 +08:00			```
Ascend attention backend(PA&MLA) (#7722) Co-authored-by: Maksim <makcum888e@mail.ru> Co-authored-by: VDV1985 <vladdv85@mail.ru> 2025-07-03 19:23:19 +03:00
TRTLLM Gen MLA Decode Kernel Integration (same as #7938) (#8632) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> 2025-07-31 19:03:40 -04:00			`- TRTLLM MLA (Optimized for Blackwell Architecture, e.g., B200)`
			```bash
Fix formatting in long code blocks (#10528) 2025-09-16 12:02:05 -07:00			`python3 -m sglang.launch_server \`
			`--tp 8 \`
			`--model deepseek-ai/DeepSeek-R1 \`
			`--attention-backend trtllm_mla \`
			`--trust-remote-code`
TRTLLM Gen MLA Decode Kernel Integration (same as #7938) (#8632) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> 2025-07-31 19:03:40 -04:00			```

TRTLLM-MLA FP8 path (#8638) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> 2025-08-11 17:02:13 -04:00			`- TRTLLM MLA with FP8 KV Cache (Higher concurrency, lower memory footprint)`
			```bash
Fix formatting in long code blocks (#10528) 2025-09-16 12:02:05 -07:00			`python3 -m sglang.launch_server \`
			`--tp 8 \`
			`--model deepseek-ai/DeepSeek-R1 \`
			`--attention-backend trtllm_mla \`
			`--kv-cache-dtype fp8_e4m3 \`
			`--trust-remote-code`
TRTLLM-MLA FP8 path (#8638) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> 2025-08-11 17:02:13 -04:00			```

Ascend attention backend(PA&MLA) (#7722) Co-authored-by: Maksim <makcum888e@mail.ru> Co-authored-by: VDV1985 <vladdv85@mail.ru> 2025-07-03 19:23:19 +03:00			`- Ascend`
			```bash
Fix formatting in long code blocks (#10528) 2025-09-16 12:02:05 -07:00			`python3 -m sglang.launch_server \`
			`--model meta-llama/Meta-Llama-3.1-8B-Instruct \`
			`--attention-backend ascend`
Ascend attention backend(PA&MLA) (#7722) Co-authored-by: Maksim <makcum888e@mail.ru> Co-authored-by: VDV1985 <vladdv85@mail.ru> 2025-07-03 19:23:19 +03:00			```
[Doc] Steps to add a new attention backend (#8155) 2025-07-18 16:38:26 -07:00
Init attention backend for Intel XPU (#10656) Co-authored-by: guangyey <guangye.yu@intel.com> Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com> 2025-10-21 11:41:28 +08:00			`- Intel XPU`
			```bash
			`python3 -m sglang.launch_server \`
			`--model meta-llama/Meta-Llama-3.1-8B-Instruct \`
			`--attention-backend intel_xpu`
			```

[AMD] Support Wave attention backend with AMD GPU optimizations (#8660) Signed-off-by: Stanley Winata <stanley.winata@amd.com> Signed-off-by: Harsh Menon <harsh@nod-labs.com> Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Signed-off-by: xintin <gaurav.verma@amd.com> Co-authored-by: Harsh Menon <harsh@nod-labs.com> Co-authored-by: Stanley Winata <stanley.winata@amd.com> Co-authored-by: Stanley Winata <68087699+raikonenfnu@users.noreply.github.com> Co-authored-by: Stanley Winata <stanley@nod-labs.com> Co-authored-by: Ivan Butygin <ivan.butygin@gmail.com> Co-authored-by: nithinsubbiah <nithinsubbiah@gmail.com> Co-authored-by: Nithin Meganathan <18070964+nithinsubbiah@users.noreply.github.com> Co-authored-by: Ivan Butygin <ibutygin@amd.com> 2025-08-13 04:49:11 +08:00			`- Wave`
			```bash
Fix formatting in long code blocks (#10528) 2025-09-16 12:02:05 -07:00			`python3 -m sglang.launch_server \`
			`--model meta-llama/Meta-Llama-3.1-8B-Instruct \`
			`--attention-backend wave`
[AMD] Support Wave attention backend with AMD GPU optimizations (#8660) Signed-off-by: Stanley Winata <stanley.winata@amd.com> Signed-off-by: Harsh Menon <harsh@nod-labs.com> Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Signed-off-by: xintin <gaurav.verma@amd.com> Co-authored-by: Harsh Menon <harsh@nod-labs.com> Co-authored-by: Stanley Winata <stanley.winata@amd.com> Co-authored-by: Stanley Winata <68087699+raikonenfnu@users.noreply.github.com> Co-authored-by: Stanley Winata <stanley@nod-labs.com> Co-authored-by: Ivan Butygin <ivan.butygin@gmail.com> Co-authored-by: nithinsubbiah <nithinsubbiah@gmail.com> Co-authored-by: Nithin Meganathan <18070964+nithinsubbiah@users.noreply.github.com> Co-authored-by: Ivan Butygin <ibutygin@amd.com> 2025-08-13 04:49:11 +08:00			```
[Doc] Steps to add a new attention backend (#8155) 2025-07-18 16:38:26 -07:00
[Doc] Update support matrix for attn and hybrid attn (#11293) 2025-10-14 22:43:11 -07:00			`- FlexAttention`
			```bash
			`python3 -m sglang.launch_server \`
			`--model meta-llama/Meta-Llama-3.1-8B-Instruct \`
			`--attention-backend flex_attention`
			```

			`- Dual Chunk FlashAttention (MHA-only)`
			```bash
			`python3 -m sglang.launch_server \`
			`--model Qwen/Qwen2.5-14B-Instruct-1M \`
			`--attention-backend dual_chunk_flash_attn`
			```

			`- Cutlass MLA`
			```bash
			`python3 -m sglang.launch_server \`
			`--tp 8 \`
			`--model deepseek-ai/DeepSeek-R1 \`
			`--attention-backend cutlass_mla \`
			`--trust-remote-code`
			```

			`- FlashAttention 4 (MHA & MLA)`
			```bash
			`python3 -m sglang.launch_server \`
			`--tp 8 \`
			`--model deepseek-ai/DeepSeek-R1 \`
[Doc] Update documents for FA4 (#11778) 2025-10-19 19:40:38 -05:00			`--prefill-attention-backend fa4 \`
[Doc] Update support matrix for attn and hybrid attn (#11293) 2025-10-14 22:43:11 -07:00			`--trust-remote-code`
			```

[Doc] Steps to add a new attention backend (#8155) 2025-07-18 16:38:26 -07:00			`## Steps to add a new attention backend`
			`To add a new attention backend, you can learn from the existing backends`
			(`python/sglang/srt/layers/attention/triton_backend.py`, `python/sglang/srt/layers/attention/flashattention_backend.py`)
			`and follow the steps below.`

			`1. Run without cuda graph. Support the two forward functions`
			`- forward_extend`
			`- Will be used for prefill, prefill with KV cache, and target verification`
			`- It will be called once per layer`
			`- forward_decode`
			`- Will be used for normal decode, and draft decode`
			`- It will be called once per layer`
			`- init_forward_metadata`
			`- Initialize the class and common metadata shared by all layers`
			`- Call the plan function for optimizations like split_kv`
			`- It will be called once per forward`
			`2. Run with cuda graph. It has two phases (capture and replay) and you need to implement three functions`
			`- init_cuda_graph_state`
			`- It will be called once during life time`
			`- Create all common shared buffers`
			`- init_forward_metadata_capture_cuda_graph`
			`- It will be called before capturing a cuda graph`
			`- It is similar to init_forward_metadata but write the medatada to some pre-defined buffers`
			`- init_forward_metadata_replay_cuda_graph`
			`- It will be called before replaying a cuda graph`
			`- This function is in the critical path and needs to be fast`