From 93b6785d78d1225b65672979795566a9d73b6e47 Mon Sep 17 00:00:00 2001 From: Yi Zhang <1109276519@qq.com> Date: Tue, 1 Jul 2025 16:19:19 +0800 Subject: [PATCH] add description for llama4 eagle3 (#7688) --- docs/references/llama4.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/docs/references/llama4.md b/docs/references/llama4.md index b09a6e240..07cc2b737 100644 --- a/docs/references/llama4.md +++ b/docs/references/llama4.md @@ -22,6 +22,18 @@ python3 -m sglang.launch_server --model-path meta-llama/Llama-4-Scout-17B-16E-In - **Enable Multi-Modal**: Add `--enable-multimodal` for multi-modal capabilities. - **Enable Hybrid-KVCache**: Add `--hybrid-kvcache-ratio` for hybrid kv cache. Details can be seen in [this PR](https://github.com/sgl-project/sglang/pull/6563) + +### EAGLE Speculative Decoding +**Description**: SGLang has supported Llama 4 Maverick (400B) with [EAGLE speculative decoding](https://docs.sglang.ai/backend/speculative_decoding.html#EAGLE-Decoding). + +**Usage**: +Add arguments `--speculative-draft-model-path`, `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example: +``` +python3 -m sglang.launch_server --model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct --speculative-algorithm EAGLE3 --speculative-draft-model-path nvidia/Llama-4-Maverick-17B-128E-Eagle3 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --trust-remote-code --tp 8 --context-length 1000000 +``` + +- **Note** The Llama 4 draft model *nvidia/Llama-4-Maverick-17B-128E-Eagle3* can only recognize conversations in chat mode. + ## Benchmarking Results ### Accuracy Test with `lm_eval`