Eagle3 mm support, enablement on qwen3vl (#4848)
### What this PR does / why we need it?
follow pr
[https://github.com/vllm-project/vllm/pull/20788](https://github.com/vllm-project/vllm/pull/20788)
, Eagle3 mm support, enablement on qwen3vl
target model
[Qwen/Qwen3-VL-8B-Instruct]([https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct])
eagle3
[MNN/Qwen3-VL-8B-Instruct-Eagle3](https://www.modelscope.cn/models/MNN/Qwen3-VL-8B-Instruct-Eagle3)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
pytest ./tests/e2e/singlecard/test_completion_with_prompt_embeds.py -vv
vLLM with eagle3 :
```bash
vllm serve /model/Qwen3-VL-8B-Instruct --enforce-eager --port 9100 --max-model-len 32768 --max-num-seqs 32 --tensor-parallel-size 2 --allowed-local-media-path /model/gx/images --speculative-config '{
"method": "eagle3",
"model": "/model/hf/Qwen3-VL-8B-Instruct-Eagle3",
"num_speculative_tokens": 3
}'
```
vLLM without eagle3 :
```bash
vllm serve /model/Qwen3-VL-8B-Instruct --enforce-eager --port 9100 --max-model-len 32768 --max-num-seqs 32 --tensor-parallel-size 2 --allowed-local-media-path /model/gx/images
```
bench:
```
vllm bench serve --backend openai-chat --base-url http://127.0.0.1:9100 --tokenizer /model/Qwen3-VL-8B-Instruct --endpoint /v1/chat/completions --model /model/Qwen3-VL-8B-Instruct --dataset-name random --num-prompts 50 --max-concurrency 5 --temperature 0 --top-p 1.0 --seed 123
```
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: jesse <szxfml@gmail.com>
This commit is contained in:
@@ -1354,14 +1354,14 @@ class NPUModelRunner(GPUModelRunner):
|
||||
query_start_loc_pcp_full[1:num_reqs + 1] - 1
|
||||
target_token_ids = input_ids_pcp_full[:
|
||||
num_scheduled_tokens]
|
||||
target_positions = positions[:num_scheduled_tokens]
|
||||
target_positions = self._get_positions(num_scheduled_tokens)
|
||||
target_hidden_states = hidden_states
|
||||
else:
|
||||
token_indices_to_sample = None
|
||||
# input_ids can be None for multimodal models.
|
||||
target_token_ids = self.input_ids.gpu[:
|
||||
num_scheduled_tokens]
|
||||
target_positions = positions[:num_scheduled_tokens]
|
||||
target_positions = self._get_positions(num_scheduled_tokens)
|
||||
if self.use_aux_hidden_state_outputs:
|
||||
target_hidden_states = torch.cat([
|
||||
h[:num_scheduled_tokens]
|
||||
@@ -1402,7 +1402,7 @@ class NPUModelRunner(GPUModelRunner):
|
||||
target_hidden_states = hidden_states
|
||||
else:
|
||||
target_token_ids = self.input_ids.gpu[token_indices]
|
||||
target_positions = positions[token_indices]
|
||||
target_positions = self._get_positions(token_indices)
|
||||
if self.use_aux_hidden_state_outputs:
|
||||
target_hidden_states = torch.cat(
|
||||
[h[token_indices] for h in aux_hidden_states],
|
||||
@@ -3006,7 +3006,7 @@ class NPUModelRunner(GPUModelRunner):
|
||||
def _prepare_multimodal_fields(self):
|
||||
"""
|
||||
Ensures specific multimodal tensors are on CPU.
|
||||
This is necessary for fields like 'grid_thw' which are converted to numpy
|
||||
This is necessary for fields like 'grid_thw' which are converted to numpy
|
||||
inside the model's forward pass.
|
||||
"""
|
||||
if not self.multimodal_cpu_fields:
|
||||
|
||||
Reference in New Issue
Block a user