Song Zhixin
2b6dc100b5
Eagle3 mm support, enablement on qwen3vl (#4848)
### What this PR does / why we need it?
follow pr
[https://github.com/vllm-project/vllm/pull/20788](https://github.com/vllm-project/vllm/pull/20788)
, Eagle3 mm support, enablement on qwen3vl
target model
[Qwen/Qwen3-VL-8B-Instruct]([https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct])
eagle3
[MNN/Qwen3-VL-8B-Instruct-Eagle3](https://www.modelscope.cn/models/MNN/Qwen3-VL-8B-Instruct-Eagle3)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
pytest ./tests/e2e/singlecard/test_completion_with_prompt_embeds.py -vv
vLLM with eagle3 :
```bash
vllm serve /model/Qwen3-VL-8B-Instruct --enforce-eager --port 9100 --max-model-len 32768 --max-num-seqs 32 --tensor-parallel-size 2 --allowed-local-media-path /model/gx/images --speculative-config '{
"method": "eagle3",
"model": "/model/hf/Qwen3-VL-8B-Instruct-Eagle3",
"num_speculative_tokens": 3
}'
```
vLLM without eagle3 :
```bash
vllm serve /model/Qwen3-VL-8B-Instruct --enforce-eager --port 9100 --max-model-len 32768 --max-num-seqs 32 --tensor-parallel-size 2 --allowed-local-media-path /model/gx/images
```
bench:
```
vllm bench serve --backend openai-chat --base-url http://127.0.0.1:9100 --tokenizer /model/Qwen3-VL-8B-Instruct --endpoint /v1/chat/completions --model /model/Qwen3-VL-8B-Instruct --dataset-name random --num-prompts 50 --max-concurrency 5 --temperature 0 --top-p 1.0 --seed 123
```
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: jesse <szxfml@gmail.com>
2026-01-19 08:58:07 +08:00
..
2025-12-10 09:20:40 +08:00
2026-01-19 08:58:07 +08:00
2025-12-16 09:14:05 +08:00
2026-01-16 15:49:57 +08:00
2025-12-18 22:27:47 +08:00
2025-12-18 22:27:47 +08:00