xc-llm-ascend

Files

无脸男 69509bcdd6 [bugfix] fix oom in aclgraph (#3158 )

### What this PR does / why we need it?
fix oom in aclgraph.

1. In the current token dispatch implementation, tensors are mounted on
class instances to facilitate parameter passing between different
methods. This approach prevents automatic recycling of these tensors. In
some cases, it may lead to out-of-memory error. To address this issue,
we manually set these tensors to None to release corresponding memory.

2. The `profile_run` method is designed to accurately estimate the
maximum NPU memory usage during vLLM inference. However, in certain
scenarios, MoE models perform inference via MC2, which includes
communication and consumes additional NPU memory. This leads to
inaccurate estimation by the profile run. We address this by actively
triggering the MC2 during profile run for initialization.```.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
52d0cb8458

Signed-off-by: WithHades <244036962@qq.com>

2025-09-26 08:57:47 +08:00

__init__.py

[Misc][V0 Deprecation] Remove Cache Engine Used for V0 Worker (#1878 )

2025-07-19 09:42:32 +08:00

block_table.py

[HybridKV] Fix prefill disaggregation kvcache addr alignment & use hybrid kv cache only when running qwen3_next (#3007 )

2025-09-18 21:43:22 +08:00

model_runner_v1.py

[bugfix] fix oom in aclgraph (#3158 )

2025-09-26 08:57:47 +08:00

npu_input_batch.py

[CI] Upgrade vLLM to 20250919 (6d8246aa) and fix some broken issue (#2907 )

2025-09-20 17:37:57 +08:00

worker_v1.py

[CI] Upgrade vllm to newest commit (#3182 )

2025-09-26 06:18:15 +08:00