xc-llm-ascend

Author	SHA1	Message	Date
pz1116	0b48ddbc8b	[Bugfix][0.18.0][KV Pool]Fix KV transfer put logic (#7718 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? Before when we do put for KV Pool, we find the first non-existing key and put all the blocks starting from that index; however, if the prefix cache blocks is from another request, and some of the blocks are evicted due to LRU, we will be putting blocks that still exist in the pool, and causing MooncakeStore printing unnecessary logs in master service. What this PR does: Now we lookup all the keys and only put the ones that are missing. Fix lookup_scheduler in pool_worker so it handles GQA correctly. Fixes a few existing typos Add UT, written by codex <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: Pz1116 <zpbzpb123123@gmail.com> Co-authored-by: DreamerLeader <2270923832@qq.com> Co-authored-by: fems14 <1804143737@qq.com>	2026-03-31 20:21:23 +08:00
SILONG ZENG	1e3c1e76bf	[Lint]Add lint hooks for clang-format, shellcheck, forbidden imports, and boolean context manager checks (#7511 ) ### What this PR does / why we need it? This PR introduces several upstream `vllm`-aligned lint hooks into `vllm-ascend` and makes them part of the actual `pre-commit` flow. Main changes in this PR: - add `check-boolean-context-manager` to catch boolean expressions in `with` statements - add `check-forbidden-imports` to forbid direct `re` imports and disallowed direct `triton` imports - enable shell script linting through `tools/shellcheck.sh` - add root `.clang-format` aligned with upstream `vllm`, enable `clang-format` in `pre-commit`, temporarily exclude all `csrc/` from `clang-format` to avoid bringing a large native code reformat into this PR This PR focuses on landing the smaller and immediately useful lint alignment first, without mixing in the larger requirements-management migration. ### Does this PR introduce _any_ user-facing change? No. This PR only updates repository lint configuration, static checks, and internal import/style enforcement. It does not change runtime behavior or public interfaces. ### How was this patch tested? Tested locally in the project virtual environment. Commands used: ```bash bash format.sh ``` Verified checks passed: ``` bash ruff check...............................................................Passed ruff format..............................................................Passed codespell................................................................Passed typos....................................................................Passed clang-format.............................................................Passed Lint GitHub Actions workflow files.......................................Passed Lint shell scripts.......................................................Passed Lint PNG exports from excalidraw.........................................Passed Check for spaces in all filenames........................................Passed Enforce __init__.py in Python packages...................................Passed Check for forbidden imports..............................................Passed Check for boolean ops in with-statements.................................Passed Suggestion...............................................................Passed - hook id: suggestion - duration: 0s To bypass pre-commit hooks, add --no-verify to git commit. ``` note: clang-format is enabled but currently excludes all csrc/ - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-03-24 20:03:01 +08:00
zouyida2052	0210cc0b07	lower log level in PD Disaggregation (#7589 ) ### What this PR does / why we need it? This log is printed too frequently and unecessary, Thus lowering its level from INFO to DEBUG. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.18.0 - vLLM main: `ed359c497a` --------- Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2026-03-24 18:03:17 +08:00
Li Wang	83a4065b4b	[CI] Add pre-commit check for patch logger (#7446 ) ### What this PR does / why we need it? See https://github.com/vllm-project/vllm-ascend/pull/7402, pre-commit hook will forbid init_logger(__name__) in vllm_ascend patch modules - vLLM version: v0.17.0 - vLLM main: `8a680463fa` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-19 16:53:20 +08:00
Junyuan	6852a2e267	[feat] add LMCacheAscendConnector (#6882 ) ### What this PR does / why we need it? LMCache-Ascend is LMCache's solution on the Ascend platform and one of the KVCache pooling solutions for Ascend. We hope to integrate LMCache-Ascend into the vLLM-Ascend community as one of the official KVCache pooling solutions for vLLM-Ascend. We added a new LMCacheAscendConnector in vLLM-Ascend and registered it. ### Does this PR introduce _any_ user-facing change? Users can specify the kvconnector using `--kv-transfer-config`, allowing them to freely choose which kvconnector to use, without any user-facing change. ### How was this patch tested? Test by specifying `--kv-transfer-config '{"kv_connector":"LMCacheAscendConnector","kv_role":"kv_both"}'` - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: chloroethylene <jjysama@gmail.com>	2026-03-13 17:41:35 +08:00
xleoken	77b43492ae	improve the ttft when use mooncake (#6125 ) ### What this PR does / why we need it? improve performance of mooncake by change the log level from info to debug ### ENV 2P + 4D, EP 1. benchmark script ``` evalscope perf \ --parallel 512 \ --number 1024 \ --model deepseek \ --url http://localhost:9000/v1/chat/completions \ --api openai \ --dataset random \ --max-tokens 2 \ --min-tokens 2 \ --prefix-length 0 \ --min-prompt-length 512 \ --max-prompt-length 512 \ --tokenizer-path /tmp/DeepSeek-v3-0324-w8a8-0814 \ --extra-args '{"ignore_eos": true}' \ --rate 2 ``` 2. before patch ``` +-----------------------------------+-----------+ \| Key \| Value \| +===================================+===========+ \| Time taken for tests (s) \| 209.484 \| +-----------------------------------+-----------+ \| Number of concurrency \| 512 \| +-----------------------------------+-----------+ \| Request rate (req/s) \| 6 \| +-----------------------------------+-----------+ \| Total requests \| 1024 \| +-----------------------------------+-----------+ \| Succeed requests \| 1022 \| +-----------------------------------+-----------+ \| Failed requests \| 2 \| +-----------------------------------+-----------+ \| Output token throughput (tok/s) \| 9.7573 \| +-----------------------------------+-----------+ \| Total token throughput (tok/s) \| 2507.62 \| +-----------------------------------+-----------+ \| Request throughput (req/s) \| 4.8786 \| +-----------------------------------+-----------+ \| Average latency (s) \| 7.0561 \| +-----------------------------------+-----------+ \| Average time to first token (s) \| 5.7444 \| +-----------------------------------+-----------+ \| Average time per output token (s) \| 1.3117 \| +-----------------------------------+-----------+ \| Average inter-token latency (s) \| 1.3117 \| +-----------------------------------+-----------+ \| Average input tokens per request \| 512 \| +-----------------------------------+-----------+ \| Average output tokens per request \| 2 \| +-----------------------------------+-----------+ 2026-01-22 14:56:32 - evalscope - INFO: Percentile results: +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ \| Percentiles \| TTFT (s) \| ITL (s) \| TPOT (s) \| Latency (s) \| Input tokens \| Output tokens \| Output (tok/s) \| Total (tok/s) \| +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ \| 10% \| 0.6062 \| 0.5113 \| 0.5113 \| 1.234 \| 512 \| 2 \| 0.0888 \| 22.8338 \| \| 25% \| 0.7248 \| 0.5639 \| 0.5639 \| 1.4114 \| 512 \| 2 \| 0.2 \| 51.3919 \| \| 50% \| 0.9092 \| 0.7748 \| 0.7748 \| 1.6767 \| 512 \| 2 \| 1.1935 \| 306.7171 \| \| 66% \| 1.0745 \| 1.0345 \| 1.0345 \| 3.1308 \| 512 \| 2 \| 1.3395 \| 344.2495 \| \| 75% \| 7.0812 \| 1.5389 \| 1.5389 \| 10.0016 \| 512 \| 2 \| 1.417 \| 364.1808 \| \| 80% \| 10.6944 \| 1.8552 \| 1.8552 \| 13.3717 \| 512 \| 2 \| 1.4778 \| 379.7911 \| \| 90% \| 19.2342 \| 2.4325 \| 2.4326 \| 22.5105 \| 512 \| 2 \| 1.6208 \| 416.5381 \| \| 95% \| 24.4399 \| 2.8289 \| 2.8289 \| 26.0329 \| 512 \| 2 \| 1.7548 \| 450.9942 \| \| 98% \| 45.0941 \| 3.4098 \| 3.4098 \| 45.6287 \| 512 \| 2 \| 1.8193 \| 467.5476 \| \| 99% \| 46.2786 \| 3.8492 \| 3.8492 \| 46.9282 \| 512 \| 2 \| 1.8576 \| 477.4157 \| +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ ``` 3. after patch ``` Benchmarking summary: +-----------------------------------+-----------+ \| Key \| Value \| +===================================+===========+ \| Time taken for tests (s) \| 191.613 \| +-----------------------------------+-----------+ \| Number of concurrency \| 512 \| +-----------------------------------+-----------+ \| Request rate (req/s) \| 6 \| +-----------------------------------+-----------+ \| Total requests \| 1024 \| +-----------------------------------+-----------+ \| Succeed requests \| 1024 \| +-----------------------------------+-----------+ \| Failed requests \| 0 \| +-----------------------------------+-----------+ \| Output token throughput (tok/s) \| 10.6882 \| +-----------------------------------+-----------+ \| Total token throughput (tok/s) \| 2746.87 \| +-----------------------------------+-----------+ \| Request throughput (req/s) \| 5.3441 \| +-----------------------------------+-----------+ \| Average latency (s) \| 2.0407 \| +-----------------------------------+-----------+ \| Average time to first token (s) \| 0.7989 \| +-----------------------------------+-----------+ \| Average time per output token (s) \| 1.2419 \| +-----------------------------------+-----------+ \| Average inter-token latency (s) \| 1.2419 \| +-----------------------------------+-----------+ \| Average input tokens per request \| 512 \| +-----------------------------------+-----------+ \| Average output tokens per request \| 2 \| +-----------------------------------+-----------+ 2026-01-22 15:10:31 - evalscope - INFO: Percentile results: +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ \| Percentiles \| TTFT (s) \| ITL (s) \| TPOT (s) \| Latency (s) \| Input tokens \| Output tokens \| Output (tok/s) \| Total (tok/s) \| +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ \| 10% \| 0.5727 \| 0.5051 \| 0.5051 \| 1.1761 \| 512 \| 2 \| 1.0368 \| 266.4696 \| \| 25% \| 0.6497 \| 0.5324 \| 0.5324 \| 1.3159 \| 512 \| 2 \| 1.1763 \| 302.3184 \| \| 50% \| 0.7767 \| 0.6908 \| 0.6908 \| 1.4793 \| 512 \| 2 \| 1.3521 \| 347.4944 \| \| 66% \| 0.8711 \| 0.7912 \| 0.7912 \| 1.5916 \| 512 \| 2 \| 1.4518 \| 373.1092 \| \| 75% \| 0.9125 \| 0.8797 \| 0.8797 \| 1.7008 \| 512 \| 2 \| 1.521 \| 390.9018 \| \| 80% \| 0.9381 \| 0.9442 \| 0.9442 \| 1.7657 \| 512 \| 2 \| 1.5749 \| 404.7606 \| \| 90% \| 0.994 \| 1.0818 \| 1.0818 \| 1.9289 \| 512 \| 2 \| 1.7006 \| 437.0518 \| \| 95% \| 1.0369 \| 1.2454 \| 1.2454 \| 2.2154 \| 512 \| 2 \| 1.7937 \| 460.9731 \| \| 98% \| 1.1237 \| 18.8814 \| 18.8814 \| 19.4607 \| 512 \| 2 \| 1.8755 \| 482.0097 \| \| 99% \| 1.6752 \| 24.4406 \| 24.4406 \| 25.4734 \| 512 \| 2 \| 1.907 \| 490.0993 \| +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ ``` --------- Signed-off-by: xleoken <xleoken@163.com>	2026-03-12 16:13:48 +08:00
pz1116	a7f91fce71	[KV Pool]get_num_new_matched_tokens return 0 if token length < block_size (#7146 ) ### What this PR does / why we need it? Currently, we call lookup_client for looking up token hit in KV Pool, however, when token length < block size, the key will be empty and there is no point to lookup in KV Pool backend since there will never be a hit. Hence, add early return in `get_num_new_matched_tokens` when `token_len` < `block_size` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Pz1116 <zpbzpb123123@gmail.com> Co-authored-by: fems14 <1804143737@qq.com>	2026-03-11 15:05:34 +08:00
xleoken	146b9d2a83	[BugFix] fix metadata execute error: integer modulo by zero (#6521 ) ### What this PR does / why we need it? fix metadata execute error: integer modulo by zero - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: xleoken <xleoken@163.com>	2026-03-10 09:58:06 +08:00
fems14	ae394767d4	【main】ADXL/HIXL supports FabricMem Mode (#6806 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: fems14 <1804143737@qq.com>	2026-03-05 21:04:11 +08:00
Canlin Guo	e4458b2d2b	[Main2Main] Upgrade vLLM to 0226 (#6813 ) ### What this PR does / why we need it? Breaking: 1. https://github.com/vllm-project/vllm/pull/33452 2. https://github.com/vllm-project/vllm/pull/33451 3. https://github.com/vllm-project/vllm/pull/32567 4. https://github.com/vllm-project/vllm/pull/32344 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: MrZ20 <2609716663@qq.com>	2026-02-27 16:05:21 +08:00
DreamerLeader	812c722cfb	[KVPool][BugFix] Correctly initialize head_or_tp_rank for mooncake backend (#6498 ) ### What this PR does / why we need it? The problem that the local priority is not used in the A2 environment on the Mooncake node is resolved. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>	2026-02-25 14:22:00 +08:00
xleoken	747484cb64	[Bugfix] Fix wrong computed_tokens when meet exception. (#6522 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> Fix wrong computed_tokens when meet exception. This pull request addresses a bug in the KV transfer mechanism where an exception during token lookup operations could lead to an incorrect count of computed_tokens. By modifying the exception handling in both the lookup and lookup_scheduler functions to return 0 instead of the start index, the system now correctly indicates that no tokens were successfully processed when a remote connection failure occurs. This enhancement improves the robustness and accuracy of token management within the vllm_ascend distributed KV pool. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> NO. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: xleoken <xleoken@163.com>	2026-02-24 15:29:30 +08:00
yejj	8b23554741	[Misc] gen kv events in ascendconnector (#6593 ) ### What this PR does / why we need it? refer to https://github.com/vllm-project/vllm-ascend/issues/6391, Currently adapted the complete process of event publishing in vllm: * `kv_connector_model_runner_mixin` invoke kv-connector `get_kv_connector_kv_cache_events` func to collect kvevents * in `scheduler.py` , it's `update_from_output` func will invoke `_update_from_kv_xfer_finished` which invoke `connector.update_connector_output` to collect kv-events from all kv-worker, and then scheduler will invoke `connector.take_events` api to collect all kv-events and add it to the events which from `kv_cache_manager` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? You can add `--kv-events-config` parameter to the `vllm server` command to enable this feature. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: yejj710 <abyss1999@163.com> Co-authored-by: fems14 <1804143737@qq.com>	2026-02-12 11:01:09 +08:00
meihanc	922e5c163b	[main2main] upgrade vllm main 0202 (#6560 ) ### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to https://github.com/vllm-project/vllm/pull/32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to https://github.com/vllm-project/vllm/pull/33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to https://github.com/vllm-project/vllm/pull/33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to https://github.com/vllm-project/vllm/pull/32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to https://github.com/vllm-project/vllm/pull/32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to https://github.com/vllm-project/vllm/pull/27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to https://github.com/vllm-project/vllm/pull/33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to https://github.com/vllm-project/vllm/pull/32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-02-05 19:31:17 +08:00
lty	33b8ca4e96	[Feature]KV pool supports sparse attention (#6339 ) ### What this PR does / why we need it? The kv pooling feature is adapted to Sparse Attention to support models such as Deepseek V3.2. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? ``` vllm serve /mnt/weight/DeepSeek-V3.2-Exp-W8A8 \ --host $local_ip \ --port 8002 \ --served-model-name model \ --data-parallel-size 1 \ --tensor-parallel-size 8 \ --prefill-context-parallel-size 2 \ --decode-context-parallel-size 1 \ --cp-kv-cache-interleave-size 128 \ --block-size 128 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --no-enable-chunked-prefill \ --max-num-seqs 4 \ --max-model-len 8192 \ --max-num-batched-tokens 8192 \ --gpu-memory-utilization 0.95 \ --trust-remote-code \ --enforce-eager \ --quantization ascend \ --additional_config '{"ascend_scheduler_config":{"enabled":false}}' \ --kv-transfer-config \ '{ "kv_connector": "AscendStoreConnector", "kv_role": "kv_both", "kv_connector_extra_config": { "backend": "mooncake", "lookup_rpc_port":"0", "use_layerwise": false } }' ``` - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: lty <linhebiwen@gmail.com>	2026-02-05 10:36:52 +08:00
DreamerLeader	2dac18afea	[Bugfix]Fix of Pooling Code and Update of Pooling Usage Guide (#6126 ) ### What this PR does / why we need it? Fix of Pooling Code and Update of Pooling Usage Guide ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? pr:[[Bugfix]Fixed precision issues caused by pooled request pooling](https://github.com/vllm-project/vllm-ascend/pull/6049) readyhttps://github.com/vllm-project/vllm-ascend/pull/6049 read for review - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Signed-off-by: fangjianwei <f30058701@china.huawei.com> Signed-off-by: DreamerLeader <88812830+DreamerLeader@users.noreply.github.com> Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: fangjianwei <f30058701@china.huawei.com>	2026-02-04 16:35:41 +08:00
lty	082aa2e5b7	[Bugfix]The service fails to be started when the memcache pool is enabled (#6229 ) ### What this PR does / why we need it? The service fails to be started when the memcache pool is enabled without configuring the mooncake path. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? ``` #memcache echo 200000 > /proc/sys/vm/nr_hugepages source /usr/local/memfabric_hybrid/set_env.sh source /usr/local/memcache_hybrid/set_env.sh source /usr/local/Ascend/ascend-toolkit/set_env.sh source /usr/local/Ascend/nnal/atb/set_env.sh export MMC_LOCAL_CONFIG_PATH=/usr/local/memcache_hybrid/latest/config/mmc-local.conf vllm serve /mnt/weight/DeepSeek-V3.2-Exp-W8A8 \ --host $local_ip \ --port 8002 \ --served-model-name model \ --data-parallel-size 2 \ --tensor-parallel-size 8 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --no-enable-chunked-prefill \ --max-num-seqs 4 \ --max-model-len 8192 \ --max-num-batched-tokens 8192 \ --gpu-memory-utilization 0.9 \ --trust-remote-code \ --enforce-eager \ --quantization ascend \ --additional_config '{"ascend_scheduler_config":{"enabled":false}}' \ --kv-transfer-config \ '{ "kv_connector": "AscendStoreConnector", "kv_role": "kv_both", "kv_connector_extra_config": { "backend": "memcache", "lookup_rpc_port":"0" } }' ``` - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: lty <linhebiwen@gmail.com>	2026-02-02 16:26:18 +08:00
wangxiyuan	4e3919e965	Reapply "[Refactor] Unify full-graph parameter update logic (#6041 )" (#6227 ) (#6231 ) This reverts commit `95649344aa`. The CI failure doesn't related to this change. Let's reapply it. - vLLM version: v0.14.0 - vLLM main: `d68209402d`	2026-01-26 09:04:54 +08:00
wangxiyuan	95649344aa	Revert "[Refactor] Unify full-graph parameter update logic (#6041 )" (#6227 ) This reverts commit `8966a99710`. It breaks the test `tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py::test_deepseek_mtp_correctness[True-FULL_DECODE_ONLY-2-wemaster/deepseek_mtp_main_random_bf16]` - vLLM version: v0.14.0 - vLLM main: `d68209402d`	2026-01-25 15:25:38 +08:00
SILONG ZENG	6ccccad102	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #5 ) (#5996 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| `.../distributed/kv_transfer/kv_pool/ascend_store/ascend_store_connector.py` \| \| `vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/backend/backend.py` \| \| ` .../distributed/kv_transfer/kv_pool/ascend_store/backend/memcache_backend.py` \| \| ` .../distributed/kv_transfer/kv_pool/ascend_store/backend/mooncake_backend.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/config_data.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/kv_transfer.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/pool_scheduler.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/pool_worker.py` \| \| ` .../distributed/kv_transfer/kv_pool/cpu_offload/cpu_kv_cache_manager.py` \| \| ` .../distributed/kv_transfer/kv_pool/cpu_offload/cpu_offload_connector.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/cpu_offload/metadata.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/ucm_connector.py` \| \| ` vllm_ascend/distributed/kv_transfer/utils/mooncake_transfer_engine.py` \| \| ` vllm_ascend/distributed/kv_transfer/utils/utils.py` \| \| ` vllm_ascend/kv_offload/cpu_npu.py` \| \| ` vllm_ascend/kv_offload/npu.py` \| \| ` vllm_ascend/lora/lora_ops.py` \| \| ` vllm_ascend/lora/punica_npu.py` \| \| ` vllm_ascend/lora/utils.py` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: SILONG ZENG <2609716663@qq.com>	2026-01-24 22:45:38 +08:00
LICO67373	8966a99710	[Refactor] Unify full-graph parameter update logic (#6041 ) ### What this PR does / why we need it? Refactor: Unify full-graph parameter update logic This PR consolidates the scattered full-graph parameter update logic into a unified approach, improving code architecture and eliminating duplication. Key improvements: 1. Unified interface - Create `update_full_graph_params` as the single entry point for all full-graph updates - Replace multiple scattered update calls with one unified function - Remove ~50 lines of duplicated if-else logic across `model_runner_v1.py` and `eagle_proposer.py` 2. Better architecture - Move update logic to respective Backend classes (`AscendAttentionBackend`, `AscendMLABackend`) - Each Backend manages its own parameter update logic internally - Simplify caller code to just dispatch to the appropriate Backend 3. Cleaner parameter handling - Remove unnecessary `pcp_size` and `dcp_size` parameter passing - Get parallel configuration directly from distributed groups - Consistent with how other parts of the codebase obtain these values Why we need it: - Maintainability: Future changes only need to be made in one place per Backend - Code quality: Follows DRY principle and Single Responsibility Principle - Readability: Cleaner, more intuitive code structure ### Does this PR introduce _any_ user-facing change? No. This is a pure refactoring with no functional changes - same behavior, cleaner code. ### How was this patch tested? - All existing unit tests pass with updated mocks - No new tests needed (pure refactoring, no behavior changes) - CI validates correctness --- - vLLM version: v0.13.0 Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: drslark <slarksblood@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2026-01-24 20:12:57 +08:00
UnifiedCacheManager	a2f022f9b6	[UCMConnector]Add has_connector_metadata (#6172 ) ### What this PR does / why we need it? ucm_connector add has `has_connector_metadata` interface to adapt to the latest KV connector in vLLM. ### Does this PR introduce _any_ user-facing change? this PR doesn't introduce _any_ user-facing change. ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>	2026-01-23 21:16:48 +08:00
baxingpiaochong	8786412f5c	[Bugfix]KV pool rank 0 consumes more HBM (#6113 ) ### What this PR does / why we need it? before add_set_deivce <img width="2354" height="674" alt="image" src="https://github.com/user-attachments/assets/8b81ab5f-b9ba-4fd2-8546-8f36ac15d32b" /> after <img width="1044" height="156" alt="image" src="https://github.com/user-attachments/assets/996d845a-8abd-4aae-b894-4a9832b1f742" /> ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: baxingpiaochong <771405853@qq.com>	2026-01-23 19:47:33 +08:00
zhangxinyuehfad	819a4459ce	Drop vLLM 0.13.0 support (#6069 ) ### What this PR does / why we need it? Drop vLLM 0.13.0 support, upgrade to 0.14.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-23 09:45:08 +08:00
DreamerLeader	b6d55fc48e	[Bugfix]Fixed precision issues caused by pooled request pooling (#6049 ) ### What this PR does / why we need it? Fixed precision issues caused by pooled request pooling ### Does this PR introduce _any_ user-facing change? pr6045 ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>	2026-01-20 23:51:31 +08:00
fems14	8b98d7a4e8	【main】【bugfix】Resolved memory deallocation failure in the pooling layer under re-computation workloads. (#6045 ) ### What this PR does / why we need it? Resolved a double-free memory vulnerability in the pooling layer under re-computation scenarios. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: fems14 <1804143737@qq.com>	2026-01-20 22:56:04 +08:00
LICO67373	687df88151	[Refactor] Move AttentionSpec initialization to Attention module (#5834 ) ### What this PR does / why we need it? This PR refactors `get_kv_cache_spec` method to delegate AttentionSpec creation to each attention module's own `get_kv_cache_spec()` method, aligning with the vllm source code structure. Changes: - Simplify `get_kv_cache_spec` in `model_runner_v1.py` and `cpu_offload_connector.py` - Remove manual `AttentionType` checks for `Attention` modules - Delegate spec creation to each attention module's `get_kv_cache_spec` method directly - Let `MambaBase` layers use their own `get_kv_cache_spec` method - Keep `use_sparse` hack for `MLAAttention` (DeepSeek DSA mode) as Ascend-specific handling This change follows RFC #5463 item 12: move AttentionSpec to Attention module. - Fixes #5463 (item 12) ### Does this PR introduce _any_ user-facing change? No. This is an internal refactoring that simplifies code structure without changing any external behavior. ### How was this patch tested? - Syntax validation passed via `python -m py_compile` - CI tests will verify the changes work correctly with existing test cases - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: lico67373 <918688502@qq.com>	2026-01-19 14:22:18 +08:00
lidenghui1110	48e10de8c9	[Bugfix] fix cpu offload hang with tp=1 (#5963 ) ### What this PR does / why we need it? As issue #5948 reported，when using cpu_offload_connector with TP=1, the server will hang on starting, we found several bugs here to fix. 1. some crash error encountered because of code changed with vllm version updating, some of them can be fixed as #5948, and this PR fixed all of them. 2. hang problem described in #5948, the direct reason is that in cpu_offload_connector, RPC client using the same client id in scheduler and worker when tensor_parrallel_size is 1, this PR force the client id to be different, then it is fixed. - Why we didn't find this hang problem before? Because we using --distributed-executor-backend mp or tensor_parrallel_size > 1 in our test, in our old test case, the scheduler and workers are different procceses, then client ids build by `worker-{os.getpid()}` are not the same. But when using tensor_parrallel_size=1, vllm will use uniproc as distributed-executor-backend by default, the scheduler and worker will by in the same proccess, then client ids are the same and hang. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2026-01-17 11:50:13 +08:00
lty	3cb0af0bcf	[Refactor]Refactor of vllm_ascend/distributed module (#5910 ) ### What this PR does / why we need it? Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604 This PR is a refactoring of vllm_ascend/distributed. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: lty <linhebiwen@gmail.com>	2026-01-15 16:26:53 +08:00
wjunLu	c11a05c4e1	[Main2Main] Upgrade vllm commit to 0113 (#5839 ) ### What this PR does / why we need it? Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9) - Modify import paths due to the refactors https://github.com/vllm-project/vllm/pull/31916 https://github.com/vllm-project/vllm/pull/32054 - Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional arguments but 3 were given` due to https://github.com/vllm-project/vllm/pull/24498 - Skip the async-scheduling tests in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never verified https://github.com/vllm-project/vllm/pull/31998 - Skip some pooling tests, which are caused by https://github.com/vllm-project/vllm/pull/32148 where vllm is also failed https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4 We will reopen those tests when main2main reachs https://github.com/vllm-project/vllm/pull/32243 - Skip some cases in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are broken by https://github.com/vllm-project/vllm/pull/32118 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-01-15 09:48:53 +08:00
lty	295018ec0f	[Refactor]Refactor of vllm_ascend/distributed module (#5719 ) ### What this PR does / why we need it? Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604 This PR is a refactoring of vllm_ascend/distributed, moving all kv_transfer realtaed codes into a dedicated folder, which has already been done in vLLM ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: lty <linhebiwen@gmail.com>	2026-01-15 08:57:40 +08:00

31 Commits