Files

wangxiyuan ae49bfd13a [Core] Support pooling (#229 )

This PR added pooling support for vllm-ascend

Tested with `bge-base-en-v1.5` by encode:
```
from vllm import LLM

# Sample prompts.
prompts = [
  "Hello, my name is",
  "The president of the United States is",
  "The capital of France is",
  "The future of AI is",
]
# Create an LLM.
model = LLM(model="./bge-base-en-v1.5", enforce_eager=True)
# Generate embedding. The output is a list of EmbeddingRequestOutputs.
outputs = model.encode(prompts)
# Print the outputs.
for output in outputs:
    print(output.outputs.embedding)  # list of 4096 floats
```

Tested by embedding:
```
from vllm import LLM, SamplingParams

llm = LLM(model="./bge-base-en-v1.5", task="embed")
(output,) = llm.embed("Hello, my name is")

embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")
```

Related: https://github.com/vllm-project/vllm-ascend/issues/200

## Known issue
The accuracy is not correct since this feature rely on `enc-dec`
support. It'll be done in the following PR by @MengqingCao

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

2025-03-04 15:59:34 +08:00

864 B

Raw Blame History

Feature Support

Feature	Supported	Note
Chunked Prefill	✗	Plan in 2025 Q1
Automatic Prefix Caching	✅	Improve performance in 2025 Q2
LoRA	✗	Plan in 2025 Q1
Prompt adapter	✗	Plan in 2025 Q1
Speculative decoding	✗	Plan in 2025 Q1
Pooling	✅
Enc-dec	✗	Plan in 2025 Q2
Multi Modality	✅ (LLaVA/Qwen2-vl/Qwen2-audio/internVL)	Add more model support in 2025 Q1
LogProbs	✅
Prompt logProbs	✅
Async output	✅
Multi step scheduler	✗	Plan in 2025 Q1
Best of	✅
Beam search	✅
Guided Decoding	✅	Find more details at the issue
Tensor Parallel	✅	Only "mp" supported now
Pipeline Parallel	✅	Only "mp" supported now

864 B Raw Blame History

Feature Support

864 B

Raw Blame History