[Misc] Update pooling example (#5002)

### What this PR does / why we need it? Since the param `task` has been depprecated, we should use the latest unified standard parameters for pooling models, this should be more clear - vLLM version: v0.12.0 - vLLM main: ad32e3e19c --------- Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-15 08:36:19 +08:00
parent bb7b74c14f
commit 2497bbbaf6
2 changed files with 3 additions and 3 deletions
--- a/docs/source/tutorials/Qwen3_embedding.md
+++ b/docs/source/tutorials/Qwen3_embedding.md
@@ -40,7 +40,7 @@ export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
 ### Online Inference

 ```bash
-vllm serve Qwen/Qwen3-Embedding-8B --task embed
+vllm serve Qwen/Qwen3-Embedding-8B --runner pooling
 ```

 Once your server is started, you can query the model with input prompts.
@@ -81,7 +81,7 @@ if __name__=="__main__":
    input_texts = queries + documents

    model = LLM(model="Qwen/Qwen3-Embedding-8B",
-                task="embed",
+                runner="pooling",
                distributed_executor_backend="mp")

    outputs = model.embed(input_texts)
--- a/examples/offline_embed.py
+++ b/examples/offline_embed.py
@@ -44,7 +44,7 @@ def main():
    ]
    input_texts = queries + documents

-    model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed")
+    model = LLM(model="Qwen/Qwen3-Embedding-0.6B", runner="pooling")

    outputs = model.embed(input_texts)
    embeddings = torch.tensor([o.outputs.embedding for o in outputs])