[BugFix]A2 MOE method&& layerwise MTP bugfix && Mamba gdn_metadata bugfix (#7364)
### What this PR does / why we need it? Some bug fixes, mainly including: 1. For A2, the number of experts each single card cannot be greater than 16 when using MC2. The PR fixed the error in the A2 moe communication method selection, which would cause the selection of an incorrect communication method when the number of model experts exceeds 256. For example, when using an A2 16-cards model to load the PD-disaggregation D node with Qwen3.5 series models, the incorrect MC2 method would be chosen. 2. Fixed the issue where the layerwise connector sends the kv-cache of the MTP layer multiple times when `num_spec_tokens` > 1. Now, the kv-cache is sent only when the MTP layer is forward for the first time. 3. Fix the accuracy issue of qwen3.5 when using MTP for PD disaggregation. The cause is that `num_decode_draft_tokens` does not consider that `spec_tokens` are not existed during the first inference when PD disaggregation (`spec_tokens` are generated during the first inference). However, `spec_tokens_padding` is added by `recomputed_scheduler`. As a result, `gdn_metadata` incorrectly considers that the prefill with a length of 2 is performed. --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: zxr2333 <64738772+nwpu-zxr@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This commit is contained in:
@@ -1493,6 +1493,9 @@ class MooncakeLayerwiseConnectorWorker:
|
||||
) -> None:
|
||||
"""MooncakeLayerwiseConnector does not save explicitly."""
|
||||
if self.vllm_config.kv_transfer_config.is_kv_producer and connector_metadata.requests.keys():
|
||||
if self.current_layer >= self.total_layers:
|
||||
self.current_layer += 1
|
||||
return
|
||||
# get reshape and cache event
|
||||
if layer_name == "":
|
||||
layer_name = self.index_to_name[self.current_layer][0]
|
||||
|
||||
Reference in New Issue
Block a user