[Feature] Enable inference support for Deepseekr1-w8a8-MTP (#1994)

Support the inference of the Deepseekr1-w8a8-mtp model with statically-quantized shared_head in MTP layers. - vLLM version: v0.9.2 - vLLM main: 6eca337ce0 Signed-off-by: curryliu <120010041@link.cuhk.edu.cn>
2025-07-29 18:51:57 +08:00
parent 98cadc2146
commit ca8007f584
3 changed files with 46 additions and 4 deletions
--- a/vllm_ascend/models/deepseek_v2.py
+++ b/vllm_ascend/models/deepseek_v2.py
@@ -868,7 +868,9 @@ class CustomDeepseekV2ForCausalLM(DeepseekV2ForCausalLM):
        if get_pp_group().is_last_rank:
            self.lm_head = ParallelLMHead(config.vocab_size,
                                          config.hidden_size,
-                                          quant_config=quant_config)
+                                          quant_config=quant_config,
+                                          prefix=maybe_prefix(
+                                              prefix, "lm_head"))
        else:
            self.lm_head = PPMissingLayer()
        self.logits_processor = LogitsProcessor(config.vocab_size)