[bugfix](CP,MLA) fix wrong slot_mapping of decode for mixed p/d batch (#6344)

### What this PR does / why we need it? PR #5672 attempted to remove the -1 padding for duplicate tokens in the decode slot_mapping when adapting PCP for MLAPO, and adopted a simpler slicing approach. However, in the single-ops logic and mixed PD batches, the decode slot_mapping did not eliminate the -1 and also shared the slicing method, resulting in incorrect slot_mapping. This PR resolves this issue, and the logic will be further consolidated in subsequent refactoring PRs. - vLLM version: v0.14.1 - vLLM main: dc917cceb8 --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-01-29 16:48:37 +08:00
parent 6a7b3bc29c
commit 50e0e87646
2 changed files with 3 additions and 1 deletions
--- a/tests/e2e/multicard/2-cards/test_external_launcher.py
+++ b/tests/e2e/multicard/2-cards/test_external_launcher.py
@@ -29,6 +29,7 @@ from unittest.mock import patch
 import pytest
 import torch_npu
 from modelscope import snapshot_download  # type: ignore
+from tests.e2e.conftest import wait_until_npu_memory_free

 MODELS = ["Qwen/Qwen3-0.6B"]
 MOE_MODELS = ["Qwen/Qwen3-30B-A3B"]
@@ -110,6 +111,7 @@ def test_qwen3_moe_external_launcher_ep_tp2(model):


@patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_NZ": "0"})
+@wait_until_npu_memory_free()
 def test_qwen3_external_launcher_with_sleepmode():
    script = Path(
        __file__