xc-llm-ascend

Files

MengLong Chen 2d49f9079a [BugFix] Support ALL D-Nodes in fullgraph when running MTP in PD (#5472 )

### What this PR does / why we need it?
**BUG**
When using prefill-decode disaggregation + MTP + full graph
+asynchronous scheduling, the KV cache pulled by decode nodes from
prefill decodes does not include spec tokens. As a result, the
total_num_scheduled_tokens obtained by decode nodes from the scheduler
lacks spec tokens. When determining whether to enqueue the full graph on
decode nodes, the condition for uniform_decode `
scheduler_output.total_num_scheduled_tokens == self.input_batch.num_reqs
* max_query_len` is not met, leading to the current instance not being
enqueued into the full graph.

The above situation leads to both full graph and eagle mode instances
coexisting in the decode instances. Due to the synchronization wait of
MoeDispatch, the decode instances in full graph are significantly slowed
down by the instance in eagle mode.

**Solution**
The scenario is PD separation + MTP + Full Graph + asynchronous
scheduling.
On the decode nodes, the spec tokens of the request with KV cache from P
need be padded. Then, the padded spec tokens will be rejected by
sampling. This operation ensures that the uniform_decode condition is
satisfied when determining whether decode nodes are included in the full
graph, thereby guaranteeing that all decode instances are present in the
full graph and avoiding synchronous waiting for MoeDispatch.

- vLLM version: v0.15.0
- vLLM main:
5326c89803

Signed-off-by: chenmenglong <chenmenglong1@huawei.com>

2026-02-26 19:09:05 +08:00

__init__.py

[Scheduler] Add AscendScheduler. (#543 )

2025-04-17 19:31:50 +08:00

recompute_scheduler.py

[BugFix] Support ALL D-Nodes in fullgraph when running MTP in PD (#5472 )

2026-02-26 19:09:05 +08:00

scheduler_dynamic_batch.py

[CI]Fixed the spell check function in typos.toml (#6753 )

2026-02-14 11:57:26 +08:00