[Model] Support pooling models (#3122)

### What this PR does / why we need it?

Support pooling models (like `bge-reranker-v2-m3`) in vllm-ascend, this
pr covered the three model types of embed (cls_token, mean_token,
lasttoken).

After this
[commit](17373dcd93),
vllm has provided support for adapting pooling models on the v1 engine.
This PR includes corresponding adaptations on the vllm-ascend side.

Fixes #1960

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: lianyibo <lianyibo1@kunlunit.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
This commit is contained in:
lianyibo
2025-12-10 11:37:57 +08:00
committed by GitHub
parent 1a7a34c5ec
commit e32014ac1d
17 changed files with 577 additions and 338 deletions

View File

@@ -106,16 +106,7 @@
#
# ** File: worker/patch_roberta.py **
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.models.roberta.RobertaEmbedding.forward`
# Why:
# shift operation in `_encode_token_type_ids` and `_decode_token_type_ids` cannot run in ascend aclgraph mode
# How
# Replace shift operation with multiplication and division.
# Related PR (if no, explain why):
# No, this need CANN add an aclnn shift operation
# Future Plan:
# Revert this when CANN support shift aclnn operation
# 2. `vllm.model_executor.models.roberta.RobertaForSequenceClassification.forward `
# 1. `vllm.model_executor.models.bert `
# Why:
# shift operation in `_encode_token_type_ids` and `_decode_token_type_ids` cannot run in ascend aclgraph mode
# How