This PR added pooling support for vllm-ascend
Tested with `bge-base-en-v1.5` by encode:
```
from vllm import LLM
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create an LLM.
model = LLM(model="./bge-base-en-v1.5", enforce_eager=True)
# Generate embedding. The output is a list of EmbeddingRequestOutputs.
outputs = model.encode(prompts)
# Print the outputs.
for output in outputs:
print(output.outputs.embedding) # list of 4096 floats
```
Tested by embedding:
```
from vllm import LLM, SamplingParams
llm = LLM(model="./bge-base-en-v1.5", task="embed")
(output,) = llm.embed("Hello, my name is")
embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")
```
Related: https://github.com/vllm-project/vllm-ascend/issues/200
## Known issue
The accuracy is not correct since this feature rely on `enc-dec`
support. It'll be done in the following PR by @MengqingCao
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
22 lines
864 B
Markdown
22 lines
864 B
Markdown
# Feature Support
|
|
|
|
| Feature | Supported | Note |
|
|
|---------|-----------|------|
|
|
| Chunked Prefill | ✗ | Plan in 2025 Q1 |
|
|
| Automatic Prefix Caching | ✅ | Improve performance in 2025 Q2 |
|
|
| LoRA | ✗ | Plan in 2025 Q1 |
|
|
| Prompt adapter | ✗ | Plan in 2025 Q1 |
|
|
| Speculative decoding | ✗ | Plan in 2025 Q1 |
|
|
| Pooling | ✅ | |
|
|
| Enc-dec | ✗ | Plan in 2025 Q2 |
|
|
| Multi Modality | ✅ (LLaVA/Qwen2-vl/Qwen2-audio/internVL)| Add more model support in 2025 Q1 |
|
|
| LogProbs | ✅ ||
|
|
| Prompt logProbs | ✅ ||
|
|
| Async output | ✅ ||
|
|
| Multi step scheduler | ✗ | Plan in 2025 Q1 |
|
|
| Best of | ✅ ||
|
|
| Beam search | ✅ ||
|
|
| Guided Decoding | ✅ | Find more details at the [<u>issue</u>](https://github.com/vllm-project/vllm-ascend/issues/177) |
|
|
| Tensor Parallel | ✅ | Only "mp" supported now |
|
|
| Pipeline Parallel | ✅ | Only "mp" supported now |
|