xc-llm-ascend

Files

lidenghui1110 1c8c23de58 [Bugfix] fix pipeline parallelism bug introduced by async-scheduling refactor work (#4973 )

### What this PR does / why we need it?
Currently, when using pipeline parallel and pd disaggregate,
model_runner will return None on non-last-pp-rank stages in
`sample_tokens`, which will cause assert error in vllm
KVOutputAggregator on [this
line](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/kv_transfer/kv_connector/utils.py#L84).

In fact, all pp workers should return a model_runner_output which
contains kv_connector_output to do aggregate in Enginecore scheduler
process to ensure all kv transfer is finished for kv cache releasing
later.

To fix this issue, this PR follows gpu_model_runner in vllm, passing
kv_connector_output in `sample_tokens` to make sure all ranks will
return a ModelRunnerOutput, in non-last-pp-rank workers, it will return
EMPTY_MODEL_RUNNER_OUTPUT with kv_connector_output.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: lidenghui <lidenghui1110@gmail.com>

2025-12-18 15:27:55 +08:00

__init__.py

[Misc][V0 Deprecation] Remove Cache Engine Used for V0 Worker (#1878 )

2025-07-19 09:42:32 +08:00

block_table.py

[Feature] model_runner refactor (#4764 )

2025-12-12 17:27:09 +08:00

model_runner_v1.py

[Bugfix] fix pipeline parallelism bug introduced by async-scheduling refactor work (#4973 )

2025-12-18 15:27:55 +08:00

npu_input_batch.py

[Misc] Upgrade vllm hash to 12_14 (#5000 )

2025-12-15 19:54:23 +08:00

worker_v1.py

fix profile run for vl model (#5136 )

2025-12-17 23:51:31 +08:00