sglang/docs/basic_usage/qwen3.md

# Qwen3-Next Usage

SGLang has supported Qwen3-Next-80B-A3B-Instruct and Qwen3-Next-80B-A3B-Thinking since [this PR](https://github.com/sgl-project/sglang/pull/10233).

## Launch Qwen3-Next with SGLang

To serve Qwen3-Next models on 4xH100/H200 GPUs:

```bash
python3 -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Instruct --tp 4
```

### Configuration Tips
- `--max-mamba-cache-size`: Adjust `--max-mamba-cache-size` to increase mamba cache space and max running requests capability. It will decrease KV cache space as a trade-off. You can adjust it according to workload.
- `--mamba-ssm-dtype`: `bfloat16` or `float32`, use `bfloat16` to save mamba cache size and `float32` to get more accurate results. The default setting is `float32`.

### EAGLE Speculative Decoding
**Description**: SGLang has supported Qwen3-Next models with [EAGLE speculative decoding](https://docs.sglang.ai/advanced_features/speculative_decoding.html#EAGLE-Decoding).

**Usage**:
Add arguments `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example:

``` bash
python3 -m sglang.launch_server \
  --model Qwen/Qwen3-Next-80B-A3B-Instruct \
  --tp 4 \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --speculative-algo NEXTN
```

Details can be seen in [this PR](https://github.com/sgl-project/sglang/pull/10233).
add qwen3-next doc (#10327) 2025-09-12 05:29:11 +08:00			`# Qwen3-Next Usage`

			`SGLang has supported Qwen3-Next-80B-A3B-Instruct and Qwen3-Next-80B-A3B-Thinking since [this PR](https://github.com/sgl-project/sglang/pull/10233).`

			`## Launch Qwen3-Next with SGLang`

			`To serve Qwen3-Next models on 4xH100/H200 GPUs:`

			```bash
			`python3 -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Instruct --tp 4`
			```

			`### Configuration Tips`
			- `--max-mamba-cache-size`: Adjust `--max-mamba-cache-size` to increase mamba cache space and max running requests capability. It will decrease KV cache space as a trade-off. You can adjust it according to workload.
			- `--mamba-ssm-dtype`: `bfloat16` or `float32`, use `bfloat16` to save mamba cache size and `float32` to get more accurate results. The default setting is `float32`.

			`### EAGLE Speculative Decoding`
			`Description: SGLang has supported Qwen3-Next models with [EAGLE speculative decoding](https://docs.sglang.ai/advanced_features/speculative_decoding.html#EAGLE-Decoding).`

			`Usage:`
			Add arguments `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example:

			``` bash
Fix formatting in long code blocks (#10528) 2025-09-16 12:02:05 -07:00			`python3 -m sglang.launch_server \`
			`--model Qwen/Qwen3-Next-80B-A3B-Instruct \`
			`--tp 4 \`
			`--speculative-num-steps 3 \`
			`--speculative-eagle-topk 1 \`
			`--speculative-num-draft-tokens 4 \`
			`--speculative-algo NEXTN`
add qwen3-next doc (#10327) 2025-09-12 05:29:11 +08:00			```

			`Details can be seen in [this PR](https://github.com/sgl-project/sglang/pull/10233).`