Eagle3 mm support, enablement on qwen3vl (#4848)

### What this PR does / why we need it? follow pr [https://github.com/vllm-project/vllm/pull/20788](https://github.com/vllm-project/vllm/pull/20788) , Eagle3 mm support, enablement on qwen3vl target model [Qwen/Qwen3-VL-8B-Instruct]([https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct]) eagle3 [MNN/Qwen3-VL-8B-Instruct-Eagle3](https://www.modelscope.cn/models/MNN/Qwen3-VL-8B-Instruct-Eagle3) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? pytest ./tests/e2e/singlecard/test_completion_with_prompt_embeds.py -vv vLLM with eagle3 : ```bash vllm serve /model/Qwen3-VL-8B-Instruct --enforce-eager --port 9100 --max-model-len 32768 --max-num-seqs 32 --tensor-parallel-size 2 --allowed-local-media-path /model/gx/images --speculative-config '{ "method": "eagle3", "model": "/model/hf/Qwen3-VL-8B-Instruct-Eagle3", "num_speculative_tokens": 3 }' ``` vLLM without eagle3 : ```bash vllm serve /model/Qwen3-VL-8B-Instruct --enforce-eager --port 9100 --max-model-len 32768 --max-num-seqs 32 --tensor-parallel-size 2 --allowed-local-media-path /model/gx/images ``` bench: ``` vllm bench serve --backend openai-chat --base-url http://127.0.0.1:9100 --tokenizer /model/Qwen3-VL-8B-Instruct --endpoint /v1/chat/completions --model /model/Qwen3-VL-8B-Instruct --dataset-name random --num-prompts 50 --max-concurrency 5 --temperature 0 --top-p 1.0 --seed 123 ``` - vLLM version: v0.12.0 - vLLM main: ad32e3e19c --------- Signed-off-by: jesse <szxfml@gmail.com>
2026-01-19 08:58:07 +08:00
parent 05e69b99e5
commit 2b6dc100b5
5 changed files with 145 additions and 29 deletions
--- a/vllm_ascend/worker/model_runner_v1.py
+++ b/vllm_ascend/worker/model_runner_v1.py
@@ -1354,14 +1354,14 @@ class NPUModelRunner(GPUModelRunner):
                            query_start_loc_pcp_full[1:num_reqs + 1] - 1
                        target_token_ids = input_ids_pcp_full[:
                                                              num_scheduled_tokens]
-                        target_positions = positions[:num_scheduled_tokens]
+                        target_positions = self._get_positions(num_scheduled_tokens)
                        target_hidden_states = hidden_states
                    else:
                        token_indices_to_sample = None
                        # input_ids can be None for multimodal models.
                        target_token_ids = self.input_ids.gpu[:
                                                              num_scheduled_tokens]
-                        target_positions = positions[:num_scheduled_tokens]
+                        target_positions = self._get_positions(num_scheduled_tokens)
                        if self.use_aux_hidden_state_outputs:
                            target_hidden_states = torch.cat([
                                h[:num_scheduled_tokens]
@@ -1402,7 +1402,7 @@ class NPUModelRunner(GPUModelRunner):
                        target_hidden_states = hidden_states
                    else:
                        target_token_ids = self.input_ids.gpu[token_indices]
-                        target_positions = positions[token_indices]
+                        target_positions = self._get_positions(token_indices)
                        if self.use_aux_hidden_state_outputs:
                            target_hidden_states = torch.cat(
                                [h[token_indices] for h in aux_hidden_states],
@@ -3006,7 +3006,7 @@ class NPUModelRunner(GPUModelRunner):
    def _prepare_multimodal_fields(self):
        """
        Ensures specific multimodal tensors are on CPU.
-        This is necessary for fields like 'grid_thw' which are converted to numpy 
+        This is necessary for fields like 'grid_thw' which are converted to numpy
        inside the model's forward pass.
        """
        if not self.multimodal_cpu_fields: