Files
xc-llm-ascend/docs/source/user_guide/suppoted_features.md
wangxiyuan ae49bfd13a [Core] Support pooling (#229)
This PR added pooling support for vllm-ascend

Tested with `bge-base-en-v1.5` by encode:
```
from vllm import LLM

# Sample prompts.
prompts = [
  "Hello, my name is",
  "The president of the United States is",
  "The capital of France is",
  "The future of AI is",
]
# Create an LLM.
model = LLM(model="./bge-base-en-v1.5", enforce_eager=True)
# Generate embedding. The output is a list of EmbeddingRequestOutputs.
outputs = model.encode(prompts)
# Print the outputs.
for output in outputs:
    print(output.outputs.embedding)  # list of 4096 floats
```

Tested by embedding:
```
from vllm import LLM, SamplingParams

llm = LLM(model="./bge-base-en-v1.5", task="embed")
(output,) = llm.embed("Hello, my name is")

embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")
```

Related: https://github.com/vllm-project/vllm-ascend/issues/200

## Known issue
The accuracy is not correct since this feature rely on `enc-dec`
support. It'll be done in the following PR by @MengqingCao

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-03-04 15:59:34 +08:00

864 B

Feature Support

Feature Supported Note
Chunked Prefill Plan in 2025 Q1
Automatic Prefix Caching Improve performance in 2025 Q2
LoRA Plan in 2025 Q1
Prompt adapter Plan in 2025 Q1
Speculative decoding Plan in 2025 Q1
Pooling
Enc-dec Plan in 2025 Q2
Multi Modality (LLaVA/Qwen2-vl/Qwen2-audio/internVL) Add more model support in 2025 Q1
LogProbs
Prompt logProbs
Async output
Multi step scheduler Plan in 2025 Q1
Best of
Beam search
Guided Decoding Find more details at the issue
Tensor Parallel Only "mp" supported now
Pipeline Parallel Only "mp" supported now