[BugFix][v0.11.0] Fix quantization related mtp bug with patch (#3619)

vLLM 0.11.0 didn't bring PR (https://github.com/vllm-project/vllm/pull/25805) thus missing the prefix of mtp's SharedHead. This PR fixes this bug with a patch to vllm's deepseek_mtp. --------- Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-22 23:06:09 +08:00
parent 6e72bfdc50
commit 6464c97ff9
3 changed files with 70 additions and 0 deletions
--- a/vllm_ascend/patch/init.py
+++ b/vllm_ascend/patch/init.py
@@ -146,3 +146,17 @@
 #       No, this need CANN add an aclnn shift operation
 #    Future Plan:
 #       Revert this when CANN support shift aclnn operation
+#
+# ** File: worker/patch_deepseek_mtp.py**
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#   1. `vllm.model_executor.models.deepseek_mtp.DeepSeekMultiTokenPredictorLayer.__init__`
+#    Why:
+#       '__init__' func of DeepSeekMultiTokenPredictorLayer didn't pass prefix to SharedHead.
+#    How：
+#       Replace with a new __init__.
+#       Use a new SharedHead which passes prefix to ParallelLMHead.
+#    Related PR (if no, explain why):
+#       https://github.com/vllm-project/vllm/pull/25805
+#    Future Plan:
+#       Remove this patch when adapted vllm version contains the above PR.
+#