[Misc] Update pooling example (#5002)
### What this PR does / why we need it?
Since the param `task` has been depprecated, we should use the latest
unified standard parameters for pooling models, this should be more
clear
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
This commit is contained in:
@@ -40,7 +40,7 @@ export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
|
||||
### Online Inference
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen3-Embedding-8B --task embed
|
||||
vllm serve Qwen/Qwen3-Embedding-8B --runner pooling
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
@@ -81,7 +81,7 @@ if __name__=="__main__":
|
||||
input_texts = queries + documents
|
||||
|
||||
model = LLM(model="Qwen/Qwen3-Embedding-8B",
|
||||
task="embed",
|
||||
runner="pooling",
|
||||
distributed_executor_backend="mp")
|
||||
|
||||
outputs = model.embed(input_texts)
|
||||
|
||||
@@ -44,7 +44,7 @@ def main():
|
||||
]
|
||||
input_texts = queries + documents
|
||||
|
||||
model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed")
|
||||
model = LLM(model="Qwen/Qwen3-Embedding-0.6B", runner="pooling")
|
||||
|
||||
outputs = model.embed(input_texts)
|
||||
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
|
||||
|
||||
Reference in New Issue
Block a user