sglang/docs/basic_usage/llama4.md

# Llama4 Usage

[Llama 4](https://github.com/meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD.md) is Meta's latest generation of open-source LLM model with industry-leading performance.

SGLang has supported Llama 4 Scout (109B) and Llama 4 Maverick (400B) since [v0.4.5](https://github.com/sgl-project/sglang/releases/tag/v0.4.5).

Ongoing optimizations are tracked in the [Roadmap](https://github.com/sgl-project/sglang/issues/5118).

## Launch Llama 4 with SGLang

To serve Llama 4 models on 8xH100/H200 GPUs:

```bash
python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tp 8 \
  --context-length 1000000
```

### Configuration Tips

- **OOM Mitigation**: Adjust `--context-length` to avoid a GPU out-of-memory issue. For the Scout model, we recommend setting this value up to 1M on 8\*H100 and up to 2.5M on 8\*H200. For the Maverick model, we don't need to set context length on 8\*H200. When hybrid kv cache is enabled, `--context-length` can be set up to 5M on 8\*H100 and up to 10M on 8\*H200 for the Scout model.

- **Chat Template**: Add `--chat-template llama-4` for chat completion tasks.
- **Enable Multi-Modal**: Add `--enable-multimodal` for multi-modal capabilities.
- **Enable Hybrid-KVCache**: Add `--hybrid-kvcache-ratio` for hybrid kv cache. Details can be seen in [this PR](https://github.com/sgl-project/sglang/pull/6563)


### EAGLE Speculative Decoding
**Description**: SGLang has supported Llama 4 Maverick (400B) with [EAGLE speculative decoding](https://docs.sglang.ai/backend/speculative_decoding.html#EAGLE-Decoding).

**Usage**:
Add arguments `--speculative-draft-model-path`, `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example:
```
python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \
  --speculative-algorithm EAGLE3 \
  --speculative-draft-model-path nvidia/Llama-4-Maverick-17B-128E-Eagle3 \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --trust-remote-code \
  --tp 8 \
  --context-length 1000000
```

- **Note** The Llama 4 draft model *nvidia/Llama-4-Maverick-17B-128E-Eagle3* can only recognize conversations in chat mode.

## Benchmarking Results

### Accuracy Test with `lm_eval`

The accuracy on SGLang for both Llama4 Scout and Llama4 Maverick can match the [official benchmark numbers](https://ai.meta.com/blog/llama-4-multimodal-intelligence/).

Benchmark results on MMLU Pro dataset with 8*H100:
|                    | Llama-4-Scout-17B-16E-Instruct | Llama-4-Maverick-17B-128E-Instruct  |
|--------------------|--------------------------------|-------------------------------------|
| Official Benchmark | 74.3                           | 80.5                                |
| SGLang             | 75.2                           | 80.7                                |

Commands:

```bash
# Llama-4-Scout-17B-16E-Instruct model
python -m sglang.launch_server \
  --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --port 30000 \
  --tp 8 \
  --mem-fraction-static 0.8 \
  --context-length 65536
lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-4-Scout-17B-16E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks mmlu_pro --batch_size 128 --apply_chat_template --num_fewshot 0

# Llama-4-Maverick-17B-128E-Instruct
python -m sglang.launch_server \
  --model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \
  --port 30000 \
  --tp 8 \
  --mem-fraction-static 0.8 \
  --context-length 65536
lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-4-Maverick-17B-128E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks mmlu_pro --batch_size 128 --apply_chat_template --num_fewshot 0
```

Details can be seen in [this PR](https://github.com/sgl-project/sglang/pull/5092).
Add Llama4 user guide (#5133) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> 2025-04-08 10:09:34 +08:00			`# Llama4 Usage`

			`[Llama 4](https://github.com/meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD.md) is Meta's latest generation of open-source LLM model with industry-leading performance.`

			`SGLang has supported Llama 4 Scout (109B) and Llama 4 Maverick (400B) since [v0.4.5](https://github.com/sgl-project/sglang/releases/tag/v0.4.5).`

			`Ongoing optimizations are tracked in the [Roadmap](https://github.com/sgl-project/sglang/issues/5118).`

			`## Launch Llama 4 with SGLang`

			`To serve Llama 4 models on 8xH100/H200 GPUs:`

			```bash
Fix formatting in long code blocks (#10528) 2025-09-16 12:02:05 -07:00			`python3 -m sglang.launch_server \`
			`--model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \`
			`--tp 8 \`
			`--context-length 1000000`
Add Llama4 user guide (#5133) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> 2025-04-08 10:09:34 +08:00			```

			`### Configuration Tips`

Hybrid kv cache for LLaMA4 (#6563) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: tarinkk <rt572@physics.rutger.edu> Co-authored-by: tarinkk <rt572@rutgers.physics.edu> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com> 2025-06-27 21:58:55 -04:00			- OOM Mitigation: Adjust `--context-length` to avoid a GPU out-of-memory issue. For the Scout model, we recommend setting this value up to 1M on 8\H100 and up to 2.5M on 8\H200. For the Maverick model, we don't need to set context length on 8\H200. When hybrid kv cache is enabled, `--context-length` can be set up to 5M on 8\H100 and up to 10M on 8\*H200 for the Scout model.
Add Llama4 user guide (#5133) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> 2025-04-08 10:09:34 +08:00
			- Chat Template: Add `--chat-template llama-4` for chat completion tasks.
[Llama4] Add docs note about enable multimodal (#6235) 2025-05-12 22:05:47 -04:00			- Enable Multi-Modal: Add `--enable-multimodal` for multi-modal capabilities.
Hybrid kv cache for LLaMA4 (#6563) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: tarinkk <rt572@physics.rutger.edu> Co-authored-by: tarinkk <rt572@rutgers.physics.edu> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com> 2025-06-27 21:58:55 -04:00			- Enable Hybrid-KVCache: Add `--hybrid-kvcache-ratio` for hybrid kv cache. Details can be seen in [this PR](https://github.com/sgl-project/sglang/pull/6563)
Add Llama4 user guide (#5133) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> 2025-04-08 10:09:34 +08:00
add description for llama4 eagle3 (#7688) 2025-07-01 16:19:19 +08:00
			`### EAGLE Speculative Decoding`
			`Description: SGLang has supported Llama 4 Maverick (400B) with [EAGLE speculative decoding](https://docs.sglang.ai/backend/speculative_decoding.html#EAGLE-Decoding).`

			`Usage:`
			Add arguments `--speculative-draft-model-path`, `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example:
			```
Fix formatting in long code blocks (#10528) 2025-09-16 12:02:05 -07:00			`python3 -m sglang.launch_server \`
			`--model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \`
			`--speculative-algorithm EAGLE3 \`
			`--speculative-draft-model-path nvidia/Llama-4-Maverick-17B-128E-Eagle3 \`
			`--speculative-num-steps 3 \`
			`--speculative-eagle-topk 1 \`
			`--speculative-num-draft-tokens 4 \`
			`--trust-remote-code \`
			`--tp 8 \`
			`--context-length 1000000`
add description for llama4 eagle3 (#7688) 2025-07-01 16:19:19 +08:00			```

			`- Note The Llama 4 draft model nvidia/Llama-4-Maverick-17B-128E-Eagle3 can only recognize conversations in chat mode.`

Add Llama4 user guide (#5133) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> 2025-04-08 10:09:34 +08:00			`## Benchmarking Results`

			### Accuracy Test with `lm_eval`

			`The accuracy on SGLang for both Llama4 Scout and Llama4 Maverick can match the [official benchmark numbers](https://ai.meta.com/blog/llama-4-multimodal-intelligence/).`

			`Benchmark results on MMLU Pro dataset with 8*H100:`
			`\| \| Llama-4-Scout-17B-16E-Instruct \| Llama-4-Maverick-17B-128E-Instruct \|`
			`\|--------------------\|--------------------------------\|-------------------------------------\|`
			`\| Official Benchmark \| 74.3 \| 80.5 \|`
			`\| SGLang \| 75.2 \| 80.7 \|`

			`Commands:`

			```bash
			`# Llama-4-Scout-17B-16E-Instruct model`
Fix formatting in long code blocks (#10528) 2025-09-16 12:02:05 -07:00			`python -m sglang.launch_server \`
			`--model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \`
			`--port 30000 \`
			`--tp 8 \`
			`--mem-fraction-static 0.8 \`
			`--context-length 65536`
Add Llama4 user guide (#5133) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> 2025-04-08 10:09:34 +08:00			`lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-4-Scout-17B-16E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks mmlu_pro --batch_size 128 --apply_chat_template --num_fewshot 0`

			`# Llama-4-Maverick-17B-128E-Instruct`
Fix formatting in long code blocks (#10528) 2025-09-16 12:02:05 -07:00			`python -m sglang.launch_server \`
			`--model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \`
			`--port 30000 \`
			`--tp 8 \`
			`--mem-fraction-static 0.8 \`
			`--context-length 65536`
Add Llama4 user guide (#5133) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> 2025-04-08 10:09:34 +08:00			`lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-4-Maverick-17B-128E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks mmlu_pro --batch_size 128 --apply_chat_template --num_fewshot 0`
			```

			`Details can be seen in [this PR](https://github.com/sgl-project/sglang/pull/5092).`