add description for llama4 eagle3 (#7688)
This commit is contained in:
@@ -22,6 +22,18 @@ python3 -m sglang.launch_server --model-path meta-llama/Llama-4-Scout-17B-16E-In
|
||||
- **Enable Multi-Modal**: Add `--enable-multimodal` for multi-modal capabilities.
|
||||
- **Enable Hybrid-KVCache**: Add `--hybrid-kvcache-ratio` for hybrid kv cache. Details can be seen in [this PR](https://github.com/sgl-project/sglang/pull/6563)
|
||||
|
||||
|
||||
### EAGLE Speculative Decoding
|
||||
**Description**: SGLang has supported Llama 4 Maverick (400B) with [EAGLE speculative decoding](https://docs.sglang.ai/backend/speculative_decoding.html#EAGLE-Decoding).
|
||||
|
||||
**Usage**:
|
||||
Add arguments `--speculative-draft-model-path`, `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example:
|
||||
```
|
||||
python3 -m sglang.launch_server --model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct --speculative-algorithm EAGLE3 --speculative-draft-model-path nvidia/Llama-4-Maverick-17B-128E-Eagle3 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --trust-remote-code --tp 8 --context-length 1000000
|
||||
```
|
||||
|
||||
- **Note** The Llama 4 draft model *nvidia/Llama-4-Maverick-17B-128E-Eagle3* can only recognize conversations in chat mode.
|
||||
|
||||
## Benchmarking Results
|
||||
|
||||
### Accuracy Test with `lm_eval`
|
||||
|
||||
Reference in New Issue
Block a user