From ade714a67f570e776b93ee534ea08b79b6b01b29 Mon Sep 17 00:00:00 2001
From: Ke Bao <ISPObaoke@163.com>
Date: Tue, 8 Apr 2025 10:09:34 +0800
Subject: [PATCH] Add Llama4 user guide (#5133)

Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
---
 docs/index.rst            |  1 +
 docs/references/llama4.md | 47 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+)
 create mode 100644 docs/references/llama4.md

diff --git a/docs/index.rst b/docs/index.rst
index 085e72261..99fea71ea 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -20,6 +20,7 @@ The core features include:
    :maxdepth: 1
    :caption: Backend Tutorial
 
+   references/llama4
    references/deepseek
    backend/send_request.ipynb
    backend/openai_api_completions.ipynb
diff --git a/docs/references/llama4.md b/docs/references/llama4.md
new file mode 100644
index 000000000..1ee058090
--- /dev/null
+++ b/docs/references/llama4.md
@@ -0,0 +1,47 @@
+# Llama4 Usage
+
+[Llama 4](https://github.com/meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD.md) is Meta's latest generation of open-source LLM model with industry-leading performance.
+
+SGLang has supported Llama 4 Scout (109B) and Llama 4 Maverick (400B) since [v0.4.5](https://github.com/sgl-project/sglang/releases/tag/v0.4.5).
+
+Ongoing optimizations are tracked in the [Roadmap](https://github.com/sgl-project/sglang/issues/5118).
+
+## Launch Llama 4 with SGLang
+
+To serve Llama 4 models on 8xH100/H200 GPUs:
+
+```bash
+python3 -m sglang.launch_server --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct --tp 8 --context-length 1000000
+```
+
+### Configuration Tips
+
+- **OOM Mitigation**: Adjust `--context-length` to avoid a GPU out-of-memory issue. For the Scout model, we recommend setting this value up to 1M on 8\*H100 and up to 2.5M on 8\*H200. For the Maverick model, we don't need to set context length on 8\*H200.
+
+- **Chat Template**: Add `--chat-template llama-4` for chat completion tasks.
+
+## Benchmarking Results
+
+### Accuracy Test with `lm_eval`
+
+The accuracy on SGLang for both Llama4 Scout and Llama4 Maverick can match the [official benchmark numbers](https://ai.meta.com/blog/llama-4-multimodal-intelligence/).
+
+Benchmark results on MMLU Pro dataset with 8*H100:
+|                    | Llama-4-Scout-17B-16E-Instruct | Llama-4-Maverick-17B-128E-Instruct  |
+|--------------------|--------------------------------|-------------------------------------|
+| Official Benchmark | 74.3                           | 80.5                                |
+| SGLang             | 75.2                           | 80.7                                |
+
+Commands:
+
+```bash
+# Llama-4-Scout-17B-16E-Instruct model
+python -m sglang.launch_server --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct --port 30000 --tp 8 --mem-fraction-static 0.8 --context-length 65536
+lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-4-Scout-17B-16E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks mmlu_pro --batch_size 128 --apply_chat_template --num_fewshot 0
+
+# Llama-4-Maverick-17B-128E-Instruct
+python -m sglang.launch_server --model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct --port 30000 --tp 8 --mem-fraction-static 0.8 --context-length 65536
+lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-4-Maverick-17B-128E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks mmlu_pro --batch_size 128 --apply_chat_template --num_fewshot 0
+```
+
+Details can be seen in [this PR](https://github.com/sgl-project/sglang/pull/5092).