xc-llm-ascend

Author	SHA1	Message	Date
Li Wang	83a4065b4b	[CI] Add pre-commit check for patch logger (#7446 ) ### What this PR does / why we need it? See https://github.com/vllm-project/vllm-ascend/pull/7402, pre-commit hook will forbid init_logger(__name__) in vllm_ascend patch modules - vLLM version: v0.17.0 - vLLM main: `8a680463fa` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-19 16:53:20 +08:00
wangxiaoteng888	c7157af8f7	[P/D] LayerwiseConnector supports the virtual push functionality on node D. (#7361 ) ### What this PR does / why we need it? LayerwiseConnector supports the virtual push functionality on node D.By adding a do_virtual flag to request metadata, the system can now identify and process certain requests virtually, bypassing the actual KV cache transfer process. This allows for immediate completion of these requests from the consumer's perspective, potentially enabling optimizations or specific testing scenarios where physical data transfer is not required. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-03-18 10:50:02 +08:00
Chao Lei	d9ac7e8539	[Bugfix] Assertion error when decode prefix cache fully hits (#7236 ) ### What this PR does / why we need it? #### Problem When decode node enables prefix cache and the local prefix cache fully hits, the following assertion error occurs: ``` (EngineCore_DP3 pid=34912) File "/usr/local/python3.11.14/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 520, in step_with_batch_queue (EngineCore_DP3 pid=34912) engine_core_outputs = self.scheduler.update_from_output( (EngineCore_DP3 pid=34912) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP3 pid=34912) File "/usr/local/python3.11.14/lib/python3.11/site-packages/vllm/v1/core/sched/scheduler.py", line 1520, in update_from_output (EngineCore_DP3 pid=34912) self._update_from_kv_xfer_finished(kv_connector_output) (EngineCore_DP3 pid=34912) File "/usr/local/python3.11.14/lib/python3.11/site-packages/vllm/v1/core/sched/scheduler.py", line 2120, in _update_from_kv_xfer_finished (EngineCore_DP3 pid=34912) assert RequestStatus.is_finished(req.status) (EngineCore_DP3 pid=34912) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP3 pid=34912) AssertionError ``` The error is triggered in scheduler.py at _update_from_kv_xfer_finished: ``` if req.status == RequestStatus.WAITING_FOR_REMOTE_KVS: self.finished_recving_kv_req_ids.add(req_id) else: assert RequestStatus.is_finished(req.status) ``` #### Root Cause When decode node has prefix cache enabled and local prefix cache fully hits: 1. get_num_new_matched_tokens returns ext_tokens=0, load_kv_async=False when decode prefix cache fully hits 2. Request status becomes RUNNING (not WAITING_FOR_REMOTE_KVS) 3. However, update_state_after_alloc still adds the request to _reqs_need_recv because remote_block_ids exists in kv_transfer_params 4. Worker processes the request in _handle_request: - _transfer_kv_cache returns immediately (no actual transfer, local_block_ids is empty) - finally block still calls update_done_task_count(request_id) 5. finished_recving contains this request 6. When _update_from_kv_xfer_finished processes finished_recving, request status is RUNNING 7. Assertion fails #### Solution In _handle_request, only notify scheduler (update_done_task_count) when actual KV transfer happened (local_block_ids is not empty). The signals to notify Prefill to release KVCache (_send_done_signal_to_free_remote_port and _send_done_recv_signal) are still sent regardless. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: LCAIZJ <leichao139636@163.com>	2026-03-17 15:17:45 +00:00
zxr2333	5645ca8392	[BugFix]A2 MOE method&& layerwise MTP bugfix && Mamba gdn_metadata bugfix (#7364 ) ### What this PR does / why we need it? Some bug fixes, mainly including: 1. For A2, the number of experts each single card cannot be greater than 16 when using MC2. The PR fixed the error in the A2 moe communication method selection, which would cause the selection of an incorrect communication method when the number of model experts exceeds 256. For example, when using an A2 16-cards model to load the PD-disaggregation D node with Qwen3.5 series models, the incorrect MC2 method would be chosen. 2. Fixed the issue where the layerwise connector sends the kv-cache of the MTP layer multiple times when `num_spec_tokens` > 1. Now, the kv-cache is sent only when the MTP layer is forward for the first time. 3. Fix the accuracy issue of qwen3.5 when using MTP for PD disaggregation. The cause is that `num_decode_draft_tokens` does not consider that `spec_tokens` are not existed during the first inference when PD disaggregation (`spec_tokens` are generated during the first inference). However, `spec_tokens_padding` is added by `recomputed_scheduler`. As a result, `gdn_metadata` incorrectly considers that the prefill with a length of 2 is performed. --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: zxr2333 <64738772+nwpu-zxr@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-03-17 23:03:45 +08:00
pichangping	3f39ac9c8d	[Feature]Supports DSv3.1 PD separation and C8 quantization (#7222 ) Co-authored-by: kunpengW-code <1289706727@qq.com> Co-authored-by: linsheng1 <1950916997@qq.com> ### What this PR does / why we need it? Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8 supports only the PD separation scenario. C8 refers to quantizing the KV cache to int8, which aims to reduce the GPU memory usage of the KV cache and improve the inference throughput. Constraints: 1. Only the PD separation mode can be used and MooncakeLayerwiseConnector can be used to run the model. 2. Currently, only the activation value supports dynamic quantization, and the KV cache supports static quantization. C8 quantization with MTP is not supported. You can use ModelSlim for quantization. The quantization procedure is as follows: pip install transformers==4.48.2 git clone https://gitcode.com/Ascend/msmodelslim.git cd msmodelslim bash install.sh cd example/DeepSeek/ python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path <path/quant_weight> --anti_dataset../common/deepseek_anti_prompt_50_v3_1.json --calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot --trust_remote_code True --fa_quant --dynamic --anti_method m6 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: pichangping <1337510399@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>	2026-03-16 22:49:05 +08:00
DreamerLeader	199df03524	[BugFix]Fix CI errors “ascend_transport.so: cannot open shared object file: No such file or directory” (#7242 ) ### What this PR does / why we need it? Conditional Import for Mooncake: The import of mooncake.engine.TransferEngine was moved into a try-except block within the GlobalTE class's constructor. This ensures that mooncake is only imported when needed and provides a clear error message with installation instructions if it's missing. ### Does this PR introduce _any_ user-facing change? The error message "ascend_transport.so: cannot open shared object file: No such file or directory" in the CI is fixed to ensure the normal running of the CI. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>	2026-03-14 21:23:05 +08:00
Junyuan	6852a2e267	[feat] add LMCacheAscendConnector (#6882 ) ### What this PR does / why we need it? LMCache-Ascend is LMCache's solution on the Ascend platform and one of the KVCache pooling solutions for Ascend. We hope to integrate LMCache-Ascend into the vLLM-Ascend community as one of the official KVCache pooling solutions for vLLM-Ascend. We added a new LMCacheAscendConnector in vLLM-Ascend and registered it. ### Does this PR introduce _any_ user-facing change? Users can specify the kvconnector using `--kv-transfer-config`, allowing them to freely choose which kvconnector to use, without any user-facing change. ### How was this patch tested? Test by specifying `--kv-transfer-config '{"kv_connector":"LMCacheAscendConnector","kv_role":"kv_both"}'` - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: chloroethylene <jjysama@gmail.com>	2026-03-13 17:41:35 +08:00
xleoken	77b43492ae	improve the ttft when use mooncake (#6125 ) ### What this PR does / why we need it? improve performance of mooncake by change the log level from info to debug ### ENV 2P + 4D, EP 1. benchmark script ``` evalscope perf \ --parallel 512 \ --number 1024 \ --model deepseek \ --url http://localhost:9000/v1/chat/completions \ --api openai \ --dataset random \ --max-tokens 2 \ --min-tokens 2 \ --prefix-length 0 \ --min-prompt-length 512 \ --max-prompt-length 512 \ --tokenizer-path /tmp/DeepSeek-v3-0324-w8a8-0814 \ --extra-args '{"ignore_eos": true}' \ --rate 2 ``` 2. before patch ``` +-----------------------------------+-----------+ \| Key \| Value \| +===================================+===========+ \| Time taken for tests (s) \| 209.484 \| +-----------------------------------+-----------+ \| Number of concurrency \| 512 \| +-----------------------------------+-----------+ \| Request rate (req/s) \| 6 \| +-----------------------------------+-----------+ \| Total requests \| 1024 \| +-----------------------------------+-----------+ \| Succeed requests \| 1022 \| +-----------------------------------+-----------+ \| Failed requests \| 2 \| +-----------------------------------+-----------+ \| Output token throughput (tok/s) \| 9.7573 \| +-----------------------------------+-----------+ \| Total token throughput (tok/s) \| 2507.62 \| +-----------------------------------+-----------+ \| Request throughput (req/s) \| 4.8786 \| +-----------------------------------+-----------+ \| Average latency (s) \| 7.0561 \| +-----------------------------------+-----------+ \| Average time to first token (s) \| 5.7444 \| +-----------------------------------+-----------+ \| Average time per output token (s) \| 1.3117 \| +-----------------------------------+-----------+ \| Average inter-token latency (s) \| 1.3117 \| +-----------------------------------+-----------+ \| Average input tokens per request \| 512 \| +-----------------------------------+-----------+ \| Average output tokens per request \| 2 \| +-----------------------------------+-----------+ 2026-01-22 14:56:32 - evalscope - INFO: Percentile results: +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ \| Percentiles \| TTFT (s) \| ITL (s) \| TPOT (s) \| Latency (s) \| Input tokens \| Output tokens \| Output (tok/s) \| Total (tok/s) \| +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ \| 10% \| 0.6062 \| 0.5113 \| 0.5113 \| 1.234 \| 512 \| 2 \| 0.0888 \| 22.8338 \| \| 25% \| 0.7248 \| 0.5639 \| 0.5639 \| 1.4114 \| 512 \| 2 \| 0.2 \| 51.3919 \| \| 50% \| 0.9092 \| 0.7748 \| 0.7748 \| 1.6767 \| 512 \| 2 \| 1.1935 \| 306.7171 \| \| 66% \| 1.0745 \| 1.0345 \| 1.0345 \| 3.1308 \| 512 \| 2 \| 1.3395 \| 344.2495 \| \| 75% \| 7.0812 \| 1.5389 \| 1.5389 \| 10.0016 \| 512 \| 2 \| 1.417 \| 364.1808 \| \| 80% \| 10.6944 \| 1.8552 \| 1.8552 \| 13.3717 \| 512 \| 2 \| 1.4778 \| 379.7911 \| \| 90% \| 19.2342 \| 2.4325 \| 2.4326 \| 22.5105 \| 512 \| 2 \| 1.6208 \| 416.5381 \| \| 95% \| 24.4399 \| 2.8289 \| 2.8289 \| 26.0329 \| 512 \| 2 \| 1.7548 \| 450.9942 \| \| 98% \| 45.0941 \| 3.4098 \| 3.4098 \| 45.6287 \| 512 \| 2 \| 1.8193 \| 467.5476 \| \| 99% \| 46.2786 \| 3.8492 \| 3.8492 \| 46.9282 \| 512 \| 2 \| 1.8576 \| 477.4157 \| +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ ``` 3. after patch ``` Benchmarking summary: +-----------------------------------+-----------+ \| Key \| Value \| +===================================+===========+ \| Time taken for tests (s) \| 191.613 \| +-----------------------------------+-----------+ \| Number of concurrency \| 512 \| +-----------------------------------+-----------+ \| Request rate (req/s) \| 6 \| +-----------------------------------+-----------+ \| Total requests \| 1024 \| +-----------------------------------+-----------+ \| Succeed requests \| 1024 \| +-----------------------------------+-----------+ \| Failed requests \| 0 \| +-----------------------------------+-----------+ \| Output token throughput (tok/s) \| 10.6882 \| +-----------------------------------+-----------+ \| Total token throughput (tok/s) \| 2746.87 \| +-----------------------------------+-----------+ \| Request throughput (req/s) \| 5.3441 \| +-----------------------------------+-----------+ \| Average latency (s) \| 2.0407 \| +-----------------------------------+-----------+ \| Average time to first token (s) \| 0.7989 \| +-----------------------------------+-----------+ \| Average time per output token (s) \| 1.2419 \| +-----------------------------------+-----------+ \| Average inter-token latency (s) \| 1.2419 \| +-----------------------------------+-----------+ \| Average input tokens per request \| 512 \| +-----------------------------------+-----------+ \| Average output tokens per request \| 2 \| +-----------------------------------+-----------+ 2026-01-22 15:10:31 - evalscope - INFO: Percentile results: +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ \| Percentiles \| TTFT (s) \| ITL (s) \| TPOT (s) \| Latency (s) \| Input tokens \| Output tokens \| Output (tok/s) \| Total (tok/s) \| +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ \| 10% \| 0.5727 \| 0.5051 \| 0.5051 \| 1.1761 \| 512 \| 2 \| 1.0368 \| 266.4696 \| \| 25% \| 0.6497 \| 0.5324 \| 0.5324 \| 1.3159 \| 512 \| 2 \| 1.1763 \| 302.3184 \| \| 50% \| 0.7767 \| 0.6908 \| 0.6908 \| 1.4793 \| 512 \| 2 \| 1.3521 \| 347.4944 \| \| 66% \| 0.8711 \| 0.7912 \| 0.7912 \| 1.5916 \| 512 \| 2 \| 1.4518 \| 373.1092 \| \| 75% \| 0.9125 \| 0.8797 \| 0.8797 \| 1.7008 \| 512 \| 2 \| 1.521 \| 390.9018 \| \| 80% \| 0.9381 \| 0.9442 \| 0.9442 \| 1.7657 \| 512 \| 2 \| 1.5749 \| 404.7606 \| \| 90% \| 0.994 \| 1.0818 \| 1.0818 \| 1.9289 \| 512 \| 2 \| 1.7006 \| 437.0518 \| \| 95% \| 1.0369 \| 1.2454 \| 1.2454 \| 2.2154 \| 512 \| 2 \| 1.7937 \| 460.9731 \| \| 98% \| 1.1237 \| 18.8814 \| 18.8814 \| 19.4607 \| 512 \| 2 \| 1.8755 \| 482.0097 \| \| 99% \| 1.6752 \| 24.4406 \| 24.4406 \| 25.4734 \| 512 \| 2 \| 1.907 \| 490.0993 \| +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ ``` --------- Signed-off-by: xleoken <xleoken@163.com>	2026-03-12 16:13:48 +08:00
pz1116	a7f91fce71	[KV Pool]get_num_new_matched_tokens return 0 if token length < block_size (#7146 ) ### What this PR does / why we need it? Currently, we call lookup_client for looking up token hit in KV Pool, however, when token length < block size, the key will be empty and there is no point to lookup in KV Pool backend since there will never be a hit. Hence, add early return in `get_num_new_matched_tokens` when `token_len` < `block_size` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Pz1116 <zpbzpb123123@gmail.com> Co-authored-by: fems14 <1804143737@qq.com>	2026-03-11 15:05:34 +08:00
zxr2333	239683c7a6	[P/D]Mooncake Layerwise Connector supports hybrid attention manager with multiple kvcache groups (#7022 ) ### What this PR does / why we need it? Mooncake Layerwise Connector supports hybrid attention manager with multiple kvcache groups. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By CI. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-10 23:59:20 +08:00
xleoken	146b9d2a83	[BugFix] fix metadata execute error: integer modulo by zero (#6521 ) ### What this PR does / why we need it? fix metadata execute error: integer modulo by zero - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: xleoken <xleoken@163.com>	2026-03-10 09:58:06 +08:00
zxr2333	675387f1fd	[P/D][KVPool]Mooncake Layerwise Connector supports kv_pool (#7032 ) ### What this PR does / why we need it? This PR creates and registers `ascend_multi_connector`, which allows the `mooncake_layerwise_connector` to use the kv_pooling feature. We unregister the original vllm's `MultiConnector` and replace it with `AscendMultiConnector` when registering the connectors. ### Does this PR introduce _any_ user-facing change? No. User can use `MultiConnector` to initialize `AscendMultiConnector`. ### How was this patch tested? By CI. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-09 10:49:04 +08:00
fems14	ae394767d4	【main】ADXL/HIXL supports FabricMem Mode (#6806 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: fems14 <1804143737@qq.com>	2026-03-05 21:04:11 +08:00
Yuzhou Tong	9180dd6c51	[BugFix][PCP] Fix presion bugs for pcp/dcp in PD disaggregate (#6876 ) ### What this PR does / why we need it? Fix a bug for PD disaggregate of PCP/DCP, some conditions only consider MLA while ignoring DSA. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `15d76f74e2` - vLLM Ascend main: `81fb7d5779` Signed-off-by: tongyuzhou <tongyuzhou1@huawei.com> Co-authored-by: tongyuzhou <tongyuzhou1@huawei.com>	2026-03-02 16:11:00 +08:00
Canlin Guo	e4458b2d2b	[Main2Main] Upgrade vLLM to 0226 (#6813 ) ### What this PR does / why we need it? Breaking: 1. https://github.com/vllm-project/vllm/pull/33452 2. https://github.com/vllm-project/vllm/pull/33451 3. https://github.com/vllm-project/vllm/pull/32567 4. https://github.com/vllm-project/vllm/pull/32344 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: MrZ20 <2609716663@qq.com>	2026-02-27 16:05:21 +08:00
DreamerLeader	812c722cfb	[KVPool][BugFix] Correctly initialize head_or_tp_rank for mooncake backend (#6498 ) ### What this PR does / why we need it? The problem that the local priority is not used in the A2 environment on the Mooncake node is resolved. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>	2026-02-25 14:22:00 +08:00
xleoken	747484cb64	[Bugfix] Fix wrong computed_tokens when meet exception. (#6522 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> Fix wrong computed_tokens when meet exception. This pull request addresses a bug in the KV transfer mechanism where an exception during token lookup operations could lead to an incorrect count of computed_tokens. By modifying the exception handling in both the lookup and lookup_scheduler functions to return 0 instead of the start index, the system now correctly indicates that no tokens were successfully processed when a remote connection failure occurs. This enhancement improves the robustness and accuracy of token management within the vllm_ascend distributed KV pool. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> NO. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: xleoken <xleoken@163.com>	2026-02-24 15:29:30 +08:00
SILONG ZENG	e2237819a9	[CI]Fixed the spell check function in `typos.toml` (#6753 ) ### What this PR does / why we need it? The incorrect regular expression syntax `.[UE4M3\|ue4m3].` actually ignores all words containing any of the following characters: `u, e, 4, m, 3, \|` ```yaml extend-ignore-identifiers-re = [".Unc.", "._thw", ".UE8M0.", ".[UE4M3\|ue4m3].", ".eles.", ".fo.", ".ba.", ".ot.", ".[Tt]h[rR]."] ``` ===fix===> ```yaml extend-ignore-identifiers-re = [".Unc.", "._thw", ".UE8M0.", ".(UE4M3\|ue4m3]).", ".eles.", ".fo.", ".ba.", ".ot.", ".[Tt]h[rR]."] ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-14 11:57:26 +08:00
wangxiaoteng888	b881fab416	[P/D][PCP] mooncake layerwise support pcp function (#6627 ) ### What this PR does / why we need it? mooncake layerwise support pcp function PCP (Prefill Context Parallelism) Support: Introduced explicit support for Prefill Context Parallelism (PCP) and Decode Context Parallelism (DCP) in the Mooncake layerwise KV cache transfer mechanism, allowing for more granular control and awareness of parallel configurations during data transfer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2026-02-12 11:02:25 +08:00
yejj	8b23554741	[Misc] gen kv events in ascendconnector (#6593 ) ### What this PR does / why we need it? refer to https://github.com/vllm-project/vllm-ascend/issues/6391, Currently adapted the complete process of event publishing in vllm: * `kv_connector_model_runner_mixin` invoke kv-connector `get_kv_connector_kv_cache_events` func to collect kvevents * in `scheduler.py` , it's `update_from_output` func will invoke `_update_from_kv_xfer_finished` which invoke `connector.update_connector_output` to collect kv-events from all kv-worker, and then scheduler will invoke `connector.take_events` api to collect all kv-events and add it to the events which from `kv_cache_manager` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? You can add `--kv-events-config` parameter to the `vllm server` command to enable this feature. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: yejj710 <abyss1999@163.com> Co-authored-by: fems14 <1804143737@qq.com>	2026-02-12 11:01:09 +08:00
lty	c3db1aca2f	[Refactor]refactor p2p connector (#6551 ) ### What this PR does / why we need it? Redundant code is removed, and repeated logic is combined through the p2p connector refactor, making the code easy to extend. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? P节点： ``` vllm serve /mnt/weight/DeepSeek-V3.2-Exp-W8A8 \ --host 0.0.0.0 \ --port 8002 \ --data-parallel-size 2 \ --tensor-parallel-size 8 \ --enable-expert-parallel \ --seed 1024 \ --served-model-name model \ --max-model-len 8192 \ --max-num-batched-tokens 8192 \ --max-num-seqs 16 \ --enforce-eager \ --trust-remote-code \ --gpu-memory-utilization 0.92 \ --quantization ascend \ --async-scheduling \ --additional-config '{"ascend_scheduler_config":{"enabled":true}}' \ --kv-transfer-config \ '{ "kv_connector": "MultiConnector", "kv_role": "kv_producer", "kv_connector_extra_config": { "use_layerwise": false, "connectors": [ { "kv_connector": "MooncakeConnectorV1", "kv_role": "kv_producer", "kv_port": "30000", "kv_connector_extra_config": { "use_ascend_direct": true, "prefill": { "dp_size": 2, "tp_size": 8 }, "decode": { "dp_size": 4, "tp_size": 4 } } }, { "kv_connector": "AscendStoreConnector", "kv_role": "kv_producer", "kv_connector_extra_config": { "backend": "mooncake", "mooncake_rpc_port":"0" } } ] } }' ``` D节点： ``` vllm serve /mnt/share/DeepSeek-V3.2-Exp-W8A8 \ --host 0.0.0.0 \ --port 8003 \ --data-parallel-size 4 \ --tensor-parallel-size 4 \ --enable-expert-parallel \ --seed 1024 \ --served-model-name model \ --max-model-len 8192 \ --max-num-batched-tokens 8192 \ --max-num-seqs 16 \ --enforce-eager \ --trust-remote-code \ --gpu-memory-utilization 0.92 \ --quantization ascend \ --async-scheduling \ --additional-config '{"ascend_scheduler_config":{"enabled":true}}' \ --kv-transfer-config \ '{ "kv_connector": "MultiConnector", "kv_role": "kv_consumer", "kv_connector_extra_config": { "use_layerwise": false, "connectors": [ { "kv_connector": "MooncakeConnectorV1", "kv_role": "kv_consumer", "kv_port": "30100", "kv_connector_extra_config": { "use_ascend_direct": true, "prefill": { "dp_size": 2, "tp_size": 8 }, "decode": { "dp_size": 4, "tp_size": 4 } } },{ "kv_connector": "AscendStoreConnector", "kv_role": "kv_consumer", "kv_connector_extra_config": { "backend": "mooncake", "mooncake_rpc_port":"1" } } ] } }' ``` - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: lty <linhebiwen@gmail.com>	2026-02-07 09:27:15 +08:00
meihanc	922e5c163b	[main2main] upgrade vllm main 0202 (#6560 ) ### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to https://github.com/vllm-project/vllm/pull/32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to https://github.com/vllm-project/vllm/pull/33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to https://github.com/vllm-project/vllm/pull/33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to https://github.com/vllm-project/vllm/pull/32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to https://github.com/vllm-project/vllm/pull/32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to https://github.com/vllm-project/vllm/pull/27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to https://github.com/vllm-project/vllm/pull/33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to https://github.com/vllm-project/vllm/pull/32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-02-05 19:31:17 +08:00
lty	33b8ca4e96	[Feature]KV pool supports sparse attention (#6339 ) ### What this PR does / why we need it? The kv pooling feature is adapted to Sparse Attention to support models such as Deepseek V3.2. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? ``` vllm serve /mnt/weight/DeepSeek-V3.2-Exp-W8A8 \ --host $local_ip \ --port 8002 \ --served-model-name model \ --data-parallel-size 1 \ --tensor-parallel-size 8 \ --prefill-context-parallel-size 2 \ --decode-context-parallel-size 1 \ --cp-kv-cache-interleave-size 128 \ --block-size 128 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --no-enable-chunked-prefill \ --max-num-seqs 4 \ --max-model-len 8192 \ --max-num-batched-tokens 8192 \ --gpu-memory-utilization 0.95 \ --trust-remote-code \ --enforce-eager \ --quantization ascend \ --additional_config '{"ascend_scheduler_config":{"enabled":false}}' \ --kv-transfer-config \ '{ "kv_connector": "AscendStoreConnector", "kv_role": "kv_both", "kv_connector_extra_config": { "backend": "mooncake", "lookup_rpc_port":"0", "use_layerwise": false } }' ``` - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: lty <linhebiwen@gmail.com>	2026-02-05 10:36:52 +08:00
DreamerLeader	2dac18afea	[Bugfix]Fix of Pooling Code and Update of Pooling Usage Guide (#6126 ) ### What this PR does / why we need it? Fix of Pooling Code and Update of Pooling Usage Guide ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? pr:[[Bugfix]Fixed precision issues caused by pooled request pooling](https://github.com/vllm-project/vllm-ascend/pull/6049) readyhttps://github.com/vllm-project/vllm-ascend/pull/6049 read for review - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Signed-off-by: fangjianwei <f30058701@china.huawei.com> Signed-off-by: DreamerLeader <88812830+DreamerLeader@users.noreply.github.com> Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: fangjianwei <f30058701@china.huawei.com>	2026-02-04 16:35:41 +08:00
lidenghui1110	79803932e2	[Kernel] Add AscendC fused op transpose_kv_cache_by_block to speed up GQA transfer (#6366 ) ### What this PR does / why we need it? As #2947 describe, we need to transpose kv cache layout after GQA kv transfer when prefill and decode tensor parallel size are heterogeneous, in the previous implementation, we use `npu_paged_cache_load ` + `tranpose` + `_npu_reshape_and_cache` to do this work. But obviously, it is not an efficient plan, the ops above need to be called for each layer, which introduces 3 * layer_num kernel launch, and 6 * layer_num data movement between L1 Cache and HBM for one request on decode node. Usually, decode node uses graph mode, so these op kernels will be called between decode forward launched by an async thread in mooncacke connector, this kernels maybe last for several decode forward and TTFT will increase by 3~4 decode forward time. In this PR, we implement an AscendC fused op `transpose_kv_cache_by_block` to do this with only once kernel launch and move data between L1 Cache and HBM only once. After using this fused op, the time cost in transpose kv cacke layout can be decreased to 0.24ms from 7ms in UT on 910C, and in PD disaggregation scenario, TTFT can decrease about 90 ~ 110 ms in qwen3-235B. \| request_num \| original \| fused_op\| \|:----------------------:\|:---------------:\|:-------------------:\| \| 1 \| 643 ms \| 578 ms \| \| 128 \| 1480 ms \| 1368 ms \| ### Does this PR introduce _any_ user-facing change? Use fused op by default, incase the op has bug in any scenario, provide fallback choice using env to disable it. DISABLE fused op by add following env `export VLLM_ASCEND_FUSION_OP_TRANSPOSE_KV_CACHE_BY_BLOCK=0` ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2026-02-03 14:10:01 +08:00
lty	082aa2e5b7	[Bugfix]The service fails to be started when the memcache pool is enabled (#6229 ) ### What this PR does / why we need it? The service fails to be started when the memcache pool is enabled without configuring the mooncake path. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? ``` #memcache echo 200000 > /proc/sys/vm/nr_hugepages source /usr/local/memfabric_hybrid/set_env.sh source /usr/local/memcache_hybrid/set_env.sh source /usr/local/Ascend/ascend-toolkit/set_env.sh source /usr/local/Ascend/nnal/atb/set_env.sh export MMC_LOCAL_CONFIG_PATH=/usr/local/memcache_hybrid/latest/config/mmc-local.conf vllm serve /mnt/weight/DeepSeek-V3.2-Exp-W8A8 \ --host $local_ip \ --port 8002 \ --served-model-name model \ --data-parallel-size 2 \ --tensor-parallel-size 8 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --no-enable-chunked-prefill \ --max-num-seqs 4 \ --max-model-len 8192 \ --max-num-batched-tokens 8192 \ --gpu-memory-utilization 0.9 \ --trust-remote-code \ --enforce-eager \ --quantization ascend \ --additional_config '{"ascend_scheduler_config":{"enabled":false}}' \ --kv-transfer-config \ '{ "kv_connector": "AscendStoreConnector", "kv_role": "kv_both", "kv_connector_extra_config": { "backend": "memcache", "lookup_rpc_port":"0" } }' ``` - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: lty <linhebiwen@gmail.com>	2026-02-02 16:26:18 +08:00
liziyu	d252e4f5ec	[P/D] Using the cache load operator to replace the index select operator. (#6295 ) ### What this PR does / why we need it? Using the cache load operator to replace the index select operator. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2026-01-30 14:27:53 +08:00
zxr2333	14bd55f30c	[P/D][BugFix] Fix layerwise P/D request_id error (#6360 ) ### What this PR does / why we need it? Fix layerwise Connector P/D request_id error, due to vllm pr: https://github.com/vllm-project/vllm/pull/27987, which will add a random suffix to request_id in EngineCore. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-01-29 20:19:05 +08:00
JiangWeixiang	41a52beb26	[bugfix] resolve kv cache leak on P-side due to incorrect req_id (#6325 ) ### What this PR does / why we need it? This PR fixes a critical bug in the PD-separated inference pipeline where KV cache on the Prefill (P) side was not being properly released. The issue arises when multiple clients use the same x-request-id: to avoid request ID collisions, both Prefill and Decode nodes append a random suffix to the incoming x-request-id. A previous PR ensured consistency by having the P-side pass its final request_id as remote_request_id to the D-side via kv_transfer_param. However, during KV cache cleanup, the D-side incorrectly used the local req_id (instead of remote_request_id) to select the target P-side rank. This mismatch caused the P-side KV cache to remain unreleased on certain ranks, leading to memory leaks. This PR corrects the logic to use remote_request_id consistently when determining the P-side rank. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The fix was validated by running multiple concurrent benchmark instances - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: ghphotoframe <854746559@qq.com>	2026-01-29 16:05:56 +08:00
yuxinshan	0bb1f91c2c	[Feature] Mooncake connector get remote ptp size (#5822 ) ### What this PR does / why we need it? To support elastic scaling when using mooncake connector, we should support to configure different tp sizes for different nodes. As a result, we transfer the prefill node information, such as tp size, through the request's kv_transfer_params. The decode nodes get the prefill tp size through the request's kv_transfer_params, instead of getting it from the configuration of the mooncake connector . - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: yuxinshan <syx_ctyg@126.com> Signed-off-by: CalvinXKY <kyxiezju@163.com>	2026-01-26 14:28:33 +08:00
wangxiyuan	4e3919e965	Reapply "[Refactor] Unify full-graph parameter update logic (#6041 )" (#6227 ) (#6231 ) This reverts commit `95649344aa`. The CI failure doesn't related to this change. Let's reapply it. - vLLM version: v0.14.0 - vLLM main: `d68209402d`	2026-01-26 09:04:54 +08:00
wangxiyuan	95649344aa	Revert "[Refactor] Unify full-graph parameter update logic (#6041 )" (#6227 ) This reverts commit `8966a99710`. It breaks the test `tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py::test_deepseek_mtp_correctness[True-FULL_DECODE_ONLY-2-wemaster/deepseek_mtp_main_random_bf16]` - vLLM version: v0.14.0 - vLLM main: `d68209402d`	2026-01-25 15:25:38 +08:00
SILONG ZENG	6ccccad102	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #5 ) (#5996 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| `.../distributed/kv_transfer/kv_pool/ascend_store/ascend_store_connector.py` \| \| `vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/backend/backend.py` \| \| ` .../distributed/kv_transfer/kv_pool/ascend_store/backend/memcache_backend.py` \| \| ` .../distributed/kv_transfer/kv_pool/ascend_store/backend/mooncake_backend.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/config_data.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/kv_transfer.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/pool_scheduler.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/pool_worker.py` \| \| ` .../distributed/kv_transfer/kv_pool/cpu_offload/cpu_kv_cache_manager.py` \| \| ` .../distributed/kv_transfer/kv_pool/cpu_offload/cpu_offload_connector.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/cpu_offload/metadata.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/ucm_connector.py` \| \| ` vllm_ascend/distributed/kv_transfer/utils/mooncake_transfer_engine.py` \| \| ` vllm_ascend/distributed/kv_transfer/utils/utils.py` \| \| ` vllm_ascend/kv_offload/cpu_npu.py` \| \| ` vllm_ascend/kv_offload/npu.py` \| \| ` vllm_ascend/lora/lora_ops.py` \| \| ` vllm_ascend/lora/punica_npu.py` \| \| ` vllm_ascend/lora/utils.py` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: SILONG ZENG <2609716663@qq.com>	2026-01-24 22:45:38 +08:00
SILONG ZENG	153da1a669	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #4 ) (#6200 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| `vllm_ascend/distributed/kv_transfer/__init__.py` \| \| `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_connector.py` \| \| `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-24 20:40:48 +08:00
LICO67373	8966a99710	[Refactor] Unify full-graph parameter update logic (#6041 ) ### What this PR does / why we need it? Refactor: Unify full-graph parameter update logic This PR consolidates the scattered full-graph parameter update logic into a unified approach, improving code architecture and eliminating duplication. Key improvements: 1. Unified interface - Create `update_full_graph_params` as the single entry point for all full-graph updates - Replace multiple scattered update calls with one unified function - Remove ~50 lines of duplicated if-else logic across `model_runner_v1.py` and `eagle_proposer.py` 2. Better architecture - Move update logic to respective Backend classes (`AscendAttentionBackend`, `AscendMLABackend`) - Each Backend manages its own parameter update logic internally - Simplify caller code to just dispatch to the appropriate Backend 3. Cleaner parameter handling - Remove unnecessary `pcp_size` and `dcp_size` parameter passing - Get parallel configuration directly from distributed groups - Consistent with how other parts of the codebase obtain these values Why we need it: - Maintainability: Future changes only need to be made in one place per Backend - Code quality: Follows DRY principle and Single Responsibility Principle - Readability: Cleaner, more intuitive code structure ### Does this PR introduce _any_ user-facing change? No. This is a pure refactoring with no functional changes - same behavior, cleaner code. ### How was this patch tested? - All existing unit tests pass with updated mocks - No new tests needed (pure refactoring, no behavior changes) - CI validates correctness --- - vLLM version: v0.13.0 Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: drslark <slarksblood@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2026-01-24 20:12:57 +08:00
liziyu	f66bcdfb29	[P/D] Mooncake connector add zmq socket fail log (#6155 ) Mooncake connector add zmq socket fail log - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: liziyu <liziyu16@huawei.com>	2026-01-24 12:06:42 +08:00
liziyu	14bef9af6f	[P/D] Remove restrictions on mooncake for IPv6 (#5946 ) ### What this PR does / why we need it? Remove restrictions on mooncake for IPv6 Dependencies: cann8.5、mooncake v0.3.8.post1 - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2026-01-24 11:30:22 +08:00
UnifiedCacheManager	a2f022f9b6	[UCMConnector]Add has_connector_metadata (#6172 ) ### What this PR does / why we need it? ucm_connector add has `has_connector_metadata` interface to adapt to the latest KV connector in vLLM. ### Does this PR introduce _any_ user-facing change? this PR doesn't introduce _any_ user-facing change. ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>	2026-01-23 21:16:48 +08:00
baxingpiaochong	8786412f5c	[Bugfix]KV pool rank 0 consumes more HBM (#6113 ) ### What this PR does / why we need it? before add_set_deivce <img width="2354" height="674" alt="image" src="https://github.com/user-attachments/assets/8b81ab5f-b9ba-4fd2-8546-8f36ac15d32b" /> after <img width="1044" height="156" alt="image" src="https://github.com/user-attachments/assets/996d845a-8abd-4aae-b894-4a9832b1f742" /> ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: baxingpiaochong <771405853@qq.com>	2026-01-23 19:47:33 +08:00
weiguihua2	4173255c0c	[main][Bugix] fix kv pcp+pooling+pd separation bug (#6153 ) ### What this PR does / why we need it? Rectify the problem that the pcp and pd separation and kv pooling scenario. In the pooling scenario, multi_nodes_meta_mapping is empty. As a result, an error is reported when the remote_host information is obtained through the get_remote_port_send_num method. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-01-23 16:15:04 +08:00
wangxiaoteng888	82a2b3bcc7	[P/D]Add ssl cert for metaserver proxy (#5875 ) ### What this PR does / why we need it? When the P node accesses the proxy meteserver, add the SSL certificate and the CA certificate path to improve security. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-23 11:11:44 +08:00
zhangxinyuehfad	819a4459ce	Drop vLLM 0.13.0 support (#6069 ) ### What this PR does / why we need it? Drop vLLM 0.13.0 support, upgrade to 0.14.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-23 09:45:08 +08:00
wangxiaoteng888	f2c0ced06d	[P/D][PCP]bugfix pcp force free twice caused logger error (#6124 ) ### What this PR does / why we need it? The issue of the D node mistakenly sending the pull-end signal twice, leading to the P node printing logger errors abnormally, has been resolved. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-22 16:24:33 +08:00
Li Wang	484e7c59dc	[CI] optimize lint term (#5986 ) ### What this PR does / why we need it? This patch purpose to optimize the lint check term. The main idea is to reduce unnecessary installation time. 1. The installation of vllm is not must, only append the path of vllm src to the `PATHONPATH` is effective 2. This installation of `requirements-dev.txt` is not must, we have a pre-built image `quay.io/ascend-ci/vllm-ascend:lint` with all the requirements installed in advance. NOTE: the conditions for triggering image builds are: 1).Daily scheduled build; 2) Build when requirements are modified; 3) Manual build. This ensures that the dependencies in our image are up-to-date to the greatest extent possible. 3. The `mypy` was separated from the `pre-commit` hook for performance reasons; we found that integrating `mypy` into the `pre-commit` hook resulted in poor performance. 4. Reduce the CPU core consumption from 16 -> 8 ### Does this PR introduce _any_ user-facing change? The end-to-end lint time was optimized from 20min/per PR to 8min/per PR ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-22 15:46:59 +08:00
JiangWeixiang	cef04b3555	[bugfix] adapt_remote_request_id (#6051 ) This PR addresses a request ID mismatch issue in the PD (Prefill-Decoding) separation deployment scenario for vllm-ascend. Upstream vLLM recently mitigated request ID collisions by appending a random suffix to each request_id (e.g., req-123 → req-123-abc), refer to [PR-27987](https://github.com/vllm-project/vllm/pull/27987 ) & [PR-29665](https://github.com/vllm-project/vllm/pull/29665). While this works in single-node deployments, it breaks compatibility in PD-separated setups: the Producer (Prefill node) and Consumer (Decoding node) end up with different request_id values, preventing the Consumer from correctly retrieving the KV cache generated by the Producer. To resolve this, this PR introduces a new field remote_request_id in the metadata passed via mooncake_connector. The Producer preserves and forwards the original (unmodified) request_id as remote_request_id. The Consumer then uses this remote_request_id—instead of its locally generated suffixed ID—to fetch the correct KV cache from the Prefill node. This ensures consistent request identification across PD nodes while maintaining compatibility with upstream vLLM’s request ID deduplication mechanism. <img width="1279" height="781" alt="image" src="https://github.com/user-attachments/assets/274238c1-dab6-4d3a-9ee4-6e578679b762" /> - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: ghphotoframe <854746559@qq.com> Co-authored-by: jiangweixiang <jwx02384838@antgroup.com>	2026-01-22 10:48:40 +08:00
DreamerLeader	b6d55fc48e	[Bugfix]Fixed precision issues caused by pooled request pooling (#6049 ) ### What this PR does / why we need it? Fixed precision issues caused by pooled request pooling ### Does this PR introduce _any_ user-facing change? pr6045 ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>	2026-01-20 23:51:31 +08:00
fems14	8b98d7a4e8	【main】【bugfix】Resolved memory deallocation failure in the pooling layer under re-computation workloads. (#6045 ) ### What this PR does / why we need it? Resolved a double-free memory vulnerability in the pooling layer under re-computation scenarios. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: fems14 <1804143737@qq.com>	2026-01-20 22:56:04 +08:00
wangxiaochao6	bc486d9530	[main][bugfix] fix mooncake kv cache transfer when one P has multi nodes (#5960 ) ### What this PR does / why we need it? In PD disaggregation case, when P has multi nodes, mooncake fails to send data. Fix the issue in this PR. The details: If a P rank does not need to transfer kv cache to any one D rank, D node should send a message to P node to release the kv cache in P node. If P has multi nodes, D node should know the corresponding IP in each P node, then D node can send message to the right P node. Otherwise, send data error will happen. This PR fix this issue by providing P nodes IP to D node through Parameter `remote_port_send_num`. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: wangxiaochao <w00642655@china.huawei.com> Co-authored-by: wangxiaochao <w00642655@china.huawei.com>	2026-01-19 16:35:13 +08:00
LICO67373	687df88151	[Refactor] Move AttentionSpec initialization to Attention module (#5834 ) ### What this PR does / why we need it? This PR refactors `get_kv_cache_spec` method to delegate AttentionSpec creation to each attention module's own `get_kv_cache_spec()` method, aligning with the vllm source code structure. Changes: - Simplify `get_kv_cache_spec` in `model_runner_v1.py` and `cpu_offload_connector.py` - Remove manual `AttentionType` checks for `Attention` modules - Delegate spec creation to each attention module's `get_kv_cache_spec` method directly - Let `MambaBase` layers use their own `get_kv_cache_spec` method - Keep `use_sparse` hack for `MLAAttention` (DeepSeek DSA mode) as Ascend-specific handling This change follows RFC #5463 item 12: move AttentionSpec to Attention module. - Fixes #5463 (item 12) ### Does this PR introduce _any_ user-facing change? No. This is an internal refactoring that simplifies code structure without changing any external behavior. ### How was this patch tested? - Syntax validation passed via `python -m py_compile` - CI tests will verify the changes work correctly with existing test cases - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: lico67373 <918688502@qq.com>	2026-01-19 14:22:18 +08:00
wangxiaoteng888	fff5df3efe	[P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (#5968 ) ### What this PR does / why we need it? The force-free secondary release request causes the node to crash. When requests are pulled too quickly, they should not be added to the delay-free queue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-17 18:49:27 +08:00

1 2

60 Commits