[BugFix][main] Adapted Qwen3-Next-MTP to chunked prefill (#4770)
### What this PR does / why we need it?
The pad `-1` modification is from
https://github.com/vllm-project/vllm/pull/25743.
It still has bugs for batched chunked prefill.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: drslark <slarksblood@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
This commit is contained in:
@@ -129,3 +129,28 @@
|
||||
# Future Plan:
|
||||
# Remove this patch when adapted vllm version contains the above PR.
|
||||
#
|
||||
# ** File: worker/patch_qwen3_next_mtp.py**
|
||||
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
# 1. `vllm.v1.worker.utils.bind_kv_cache`
|
||||
# Why:
|
||||
# 'bind_kv_cache' func will raise an exception when current_platform is npu.
|
||||
# How:
|
||||
# Replace with a new bind_kv_cache.
|
||||
# Skip the raise.
|
||||
# Related PR (if no, explain why):
|
||||
# https://github.com/vllm-project/vllm/pull/4770
|
||||
# Future Plan:
|
||||
# Remove this patch after discussing with vllm community and adapting bind_kv_cache to npu.
|
||||
#
|
||||
# ** File: worker/patch_module.py**
|
||||
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
# 1. `vllm.v1.attention.backends.gdn_attn.torch.argsort`
|
||||
# Why:
|
||||
# 'torch.argsort' func of npu does not support bool.
|
||||
# How:
|
||||
# Replace with a new torch.argsort that will cast the input to torch.int32.
|
||||
# Related PR (if no, explain why):
|
||||
# https://github.com/vllm-project/vllm/pull/4770
|
||||
# Future Plan:
|
||||
# Remove this patch when bool is supported in 'torch.argsort' func of npu.
|
||||
#
|
||||
|
||||
Reference in New Issue
Block a user