[V1][ModelRunner] Support pooling model for v1 engine (#1359)

### What this PR does / why we need it? Change as little existing code as possible to add v1 pooling task's support, notice that i move down the `vllm.v1.worker.gpu_input_batch` to vllm-ascend, Considering the frequent changes in upstream interfaces, in order to decouple, so i move it here ### How was this patch tested? CI passed with new added/existing test, and I have a simple test was first conducted locally which is adapted from https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B, just like bellow： ```python import os import torch from vllm import LLM os.environ["VLLM_USE_MODELSCOPE"]="True" def get_detailed_instruct(task_description: str, query: str) -> str: return f'Instruct: {task_description}\nQuery:{query}' # Each query must come with a one-sentence instruction that describes the task task = 'Given a web search query, retrieve relevant passages that answer the query' queries = [ get_detailed_instruct(task, 'What is the capital of China?'), get_detailed_instruct(task, 'Explain gravity') ] # No need to add instruction for retrieval documents documents = [ "The capital of China is Beijing.", "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun." ] input_texts = queries + documents model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed") outputs = model.embed(input_texts) embeddings = torch.tensor([o.outputs.embedding for o in outputs]) scores = (embeddings[:2] @ embeddings[2:].T) print(scores.tolist()) # [[0.7620252966880798, 0.14078938961029053], [0.1358368694782257, 0.6013815999031067]] ``` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: wangli <858794774@qq.com> Co-authored-by: wangli <858794774@qq.com>
2025-06-30 16:31:12 +08:00
parent 790c810bf7
commit 5f8241c25c
10 changed files with 1312 additions and 43 deletions
--- a/.github/workflows/vllm_ascend_test.yaml
+++ b/.github/workflows/vllm_ascend_test.yaml
@@ -86,7 +86,7 @@ jobs:
      - name: Run codespell check
        run: |
          CODESPELL_EXCLUDES=('--skip' 'tests/prompts/**,./benchmarks/sonnet.txt,*tests/lora/data/**,build/**,./vllm_ascend.egg-info/**')
-          CODESPELL_IGNORE_WORDS=('-L' 'CANN,cann,NNAL,nnal,ASCEND,ascend,EnQue,CopyIn')
+          CODESPELL_IGNORE_WORDS=('-L' 'CANN,cann,NNAL,nnal,ASCEND,ascend,EnQue,CopyIn,assertIn')

          codespell --toml pyproject.toml "${CODESPELL_EXCLUDES[@]}" "${CODESPELL_IGNORE_WORDS[@]}"
      - name: Analysing the code with ruff
@@ -262,11 +262,13 @@ jobs:
            pytest -sv tests/e2e/singlecard/test_ilama_lora.py
          pytest -sv tests/e2e/singlecard/test_guided_decoding.py
          pytest -sv tests/e2e/singlecard/test_camem.py
+          pytest -sv tests/e2e/singlecard/test_embedding.py
          pytest -sv tests/e2e/singlecard/ \
          --ignore=tests/e2e/singlecard/test_offline_inference.py \
          --ignore=tests/e2e/singlecard/test_ilama_lora.py \
          --ignore=tests/e2e/singlecard/test_guided_decoding.py \
-          --ignore=tests/e2e/singlecard/test_camem.py
+          --ignore=tests/e2e/singlecard/test_camem.py \
+          --ignore=tests/e2e/singlecard/test_embedding.py

      - name: Run e2e test on V0 engine
        if: ${{ github.event_name == 'schedule' }}
@@ -281,6 +283,7 @@ jobs:
          pytest -sv tests/e2e/singlecard/test_guided_decoding.py
          pytest -sv tests/e2e/singlecard/test_camem.py
          pytest -sv tests/e2e/singlecard/test_prompt_embedding.py
+          pytest -sv tests/e2e/singlecard/test_embedding.py
          pytest -sv tests/e2e/singlecard/ \
            --ignore=tests/e2e/singlecard/test_offline_inference.py \
            --ignore=tests/e2e/singlecard/test_ilama_lora.py \
@@ -288,7 +291,8 @@ jobs:
            --ignore=tests/e2e/singlecard/test_camem.py \
            --ignore=tests/e2e/singlecard/test_prompt_embedding.py \
            --ignore=tests/e2e/singlecard/core/test_ascend_scheduler.py \
-            --ignore=tests/e2e/singlecard/core/test_ascend_scheduler_e2e.py
+            --ignore=tests/e2e/singlecard/core/test_ascend_scheduler_e2e.py \
+            --ignore=tests/e2e/singlecard/test_embedding.py

  e2e-4-cards:
    needs: [e2e]