[bugfix]Fix accuracy issue in PCP/DCP with speculative decoding (#6491)

### What this PR does / why we need it?

This PR fixes an accuracy issue that occurs when using Prefill/Decode
Context Parallelism (PCP/DCP) in conjunction with speculative decoding
(MTP). The issue is caused by an irregular attention mask shape when
both features are enabled.

The fix involves flattening the `block_table` for speculative decoding
requests under PCP/DCP to ensure a regular attention mask. This PR also
introduces a `use_cp` property for cleaner code and updates dummy runs
to handle this scenario correctly.

### Does this PR introduce _any_ user-facing change?

No. This is a bug fix that improves accuracy and should not have
user-facing API changes.

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
This commit is contained in:
Wang Kunpeng
2026-02-05 10:06:14 +08:00
committed by GitHub
parent 0ead5e8681
commit 13c4a9c78b
3 changed files with 66 additions and 13 deletions

View File

@@ -73,9 +73,12 @@ def test_generate_pcp_metadata_basic(pcp_size, dcp_size, num_reqs, query_lens,
query_lens) - input_batch.num_computed_tokens_cpu
query_lens = torch.tensor(query_lens)
result = pcp_manager.generate_pcp_metadata(total_tokens, query_lens,
result, _ = pcp_manager.generate_pcp_metadata(total_tokens, query_lens,
input_batch,
num_scheduled_tokens)
num_scheduled_tokens,
torch.tensor([]),
num_reqs_padded=num_reqs,
num_reqs=num_reqs)
if not expect_not_none:
assert result is None, f"Expected to return None, but got {type(result)}"