[Feature] Enable inference support for Deepseekr1-w8a8-MTP (#1994)

Support the inference of the Deepseekr1-w8a8-mtp model with
statically-quantized shared_head in MTP layers.

- vLLM version: v0.9.2
- vLLM main:
6eca337ce0

Signed-off-by: curryliu <120010041@link.cuhk.edu.cn>
This commit is contained in:
curryliu
2025-07-29 18:51:57 +08:00
committed by GitHub
parent 98cadc2146
commit ca8007f584
3 changed files with 46 additions and 4 deletions

View File

@@ -868,7 +868,9 @@ class CustomDeepseekV2ForCausalLM(DeepseekV2ForCausalLM):
if get_pp_group().is_last_rank:
self.lm_head = ParallelLMHead(config.vocab_size,
config.hidden_size,
quant_config=quant_config)
quant_config=quant_config,
prefix=maybe_prefix(
prefix, "lm_head"))
else:
self.lm_head = PPMissingLayer()
self.logits_processor = LogitsProcessor(config.vocab_size)