xc-llm-ascend

Author	SHA1	Message	Date
wangx700	55e37f5041	[v0.11.0][Bugfix] fix sleepmode level2 e2e test (#4023 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> fix sleepmode level2 e2e test ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> no ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> use e2e tests Signed-off-by: wangx700 <wangxin700@huawei.com>	2025-11-08 14:11:15 +08:00
tingfu	f9842560cb	[0.11.0][Perf] Add padding vision tower for Qwen2_5_Omni (#4041 ) ### What this PR does / why we need it? This PR repalce the vision tower in Qwen2.5-Omni-Thinker model, Qwen2_5_VisionTransformer, with AscendQwen2_5_VisionTransformer, which use QKV padding for padding performance. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Ting FU <futing10@huawei.com>	2025-11-08 13:56:05 +08:00
zxr2333	d4e2a44307	[Cherry Pick from pr#3981][0.11.0][P/D]Make kv-transfer env variable take effect & Fix load-balance proxy (#3983 ) ### What this PR does / why we need it? Make kv-transfer env variable take effect & Fix load-balance proxy. Cherry Pick from #3981 --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2025-11-08 13:52:33 +08:00
offline893	8e72758645	[BugFix]Fix grouplist type of mc2. (#4049 ) ### What this PR does / why we need it? Fix accrucy problem of eplb because of PTA upgrade. This is a backport of #4047 ### How was this patch tested? Mian: baseline: \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 87.50 \| EPLB: \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 87.50 \| - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-11-07 17:43:23 +08:00
lilinsiman	016337eaec	[v0.11.0][UT] Add new ut case for aclgraph enable (#4038 ) ### What this PR does / why we need it? add new ut case for aclgraph enable ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-11-07 11:35:24 +08:00
Angazenn	f9494d978a	[cherry-pick][v0.11.0-dev][bugfix] Fix a rare bug triggered by _npu_paged_attention in FULL_DECODE_ONLY mode (#3987 ) ### What this PR does / why we need it? This is cherry-pick from #3986 . This PR fixes a bug where the workspace of `_npu_paged_attention` in setup is smaller than execution. For current implementation of FULL_DECODE_ONLY with `_npu_paged_attention`, we use `_npu_paged_attention_get_workspace` when capturing with `max_model_len` as `seq_lens`. This assumes that PA with larger `seq_lens` inputs should have larger workspace than smaller `seq_lens`. However, there are rare cases where PA with smaller `seq_lens` incurs larger space. So I add `get_workspace` directly into `update_attn_params`. This change might introduce slight(≈1%) performance degradation for small num_tokens(such as 1) in decode phase, and there is no other known memory issues. So I think this change is acceptable. We can remove this if new attention op (such as `npu_fused_infer_attention_score`) does not have such problems. Signed-off-by: Angazenn <supperccell@163.com>	2025-11-06 23:08:57 +08:00
Shanshan Shen	27547a10e6	[MM][Bugfix] Add MoE verification for multi-modal models (#3897 ) (#4027 ) ### What this PR does / why we need it? Fix #3891. The empty of `moe_comm_method` in the above issue is due to the wrong check for MoE models. To be specific, the method `is_moe_model` only checks whether a text-only model is a MoE model, without considering multi-modal models, e.g., `VL` and `Omni`. Check the config dict recursively to find if it has a key contains "expert", without checking the model architecture. It is worth noting that, we can't verify a model by if it contains `FusedMoE` module because `is_moe_model` is called somewhere before the model loading, e.g., it's called when updating the ACLGraph config in platform initialization. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-11-06 20:30:40 +08:00
zzzzwwjj	3db53d117e	[0.11.0][doc] add aclgraph developer guide (#3947 ) ### What this PR does / why we need it? Add aclgraph developer guide. Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-11-06 09:54:38 +08:00
wangxiyuan	7ee0b0b5d8	[cherry-pick]Upgrade CANN to 8.3.rc1 (#3945 ) (#3962 ) This PR upgrade CANN from 8.2rc1 to 8.3rc1 and remove the CANN version check logic. TODO: we notice that UT runs failed with CANN 8.3 image. So the base image for UT is still 8.2. We'll fix it later. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-06 09:05:08 +08:00
Zetong Li	66b67f9cf2	[Bugfix][SHM] Fix weak memory ordering problem in share memory (#3988 ) ### What this PR does / why we need it? This PR aims to fix weak memory ordering problem in share memory by patching message queue with an additional lock. The detailed issue can be found here https://github.com/vllm-project/vllm/issues/27858. The key point is to use the writer lock to enforce memory fence before the ready flag `metadata_buffer[0] = 1` is set. This is a temporary solution, and you can use it by setting env `SHM_BARRIER=true`. By default, we disable this modification. ### Does this PR introduce _any_ user-facing change? `SHM_BARRIER=true` enables this change while `SHM_BARRIER=false` disables this change. The latter is the default choice. ### How was this patch tested? by ci --------- Signed-off-by: Zetong Li <slippersss@126.com>	2025-11-04 23:07:23 +08:00
zxr2333	954dab64fb	[v0.11.0][P/D]Set adxl as default backend and update readme (#3771 ) ### What this PR does / why we need it? Set adxl engine as the default Mooncake backend, because Ascend Transport is no longer maintained. Update README to include instructions for installing the adxl backend Mooncake. ### Does this PR introduce _any_ user-facing change? Users need to compile and install the mooncake backend for adxl according to the revised README instructions. ### How was this patch tested? By CI. --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2025-11-04 16:06:58 +08:00
leo-pony	0cead5c1ee	Quality enhancement: Immediately interrupt execution when allocate NPU memory OOM (#3944 ) ### What this PR does / why we need it? Protect the scene where the first problem occurs. The execution should be interrupted when the video memory application fails, rather than waiting until an illegal address is accessed. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-11-04 08:55:22 +08:00
Mengqing Cao	7cc6208029	[0.11.0][MTP][Aclgraph] Fix the support aclgraph with MTP (#3912 ) ### What this PR does / why we need it? Fix 2 breaks of aclgraph with MTP: 1. deepseekmtp in vllm 0.11.0 does not support aclgraph and lack the `support_torch_compile` decorator 2. There is a d2h synchornization in the original forward of mtp predictor. The fix pr in vllm https://github.com/vllm-project/vllm/pull/27643 As we'll fix it in vllm main, this fix pr is only needed in branch v0.11.0-dev The profling shows that MTP replays in aclgraph now: <img width="1612" height="1866" alt="a7d7f04155df4ed454b7eb20a92b2e2a" src="https://github.com/user-attachments/assets/eaa4b9ff-aeb0-416d-964f-5a06e497f155" /> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-11-03 14:25:37 +08:00
wangxiyuan	8a7154001e	[0.11.0]Chery pick pta upgrade change (#3940 ) This PR cherry-pick two commit from main to upgrade torch-npu to 2.7.1 official release --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-31 22:14:26 +08:00
rjg-lyh	3d81ea03ed	[v0.11.0-dev][bugfix] fix valueError in static_forward_context when prefix is empty (#3929 ) ### What this PR does / why we need it? This PR temporarily bypasses the scenario where some models in vLLM trigger a `ValueError` during the process of storing values in `static_forward_context` when no `prefix` is specified for the linear layers, which is a bug in some models in vLLM. The official fix will be addressed by submitting a PR to the vLLM community that specifies a prefix for the linear layers in each model. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` ### How was this patch tested? CI passed with new added/existing test. Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-10-31 15:45:06 +08:00
Nagisa125	9f7de45b75	[Bugfix] fix MTP support for lmhead_tensor_parallel_size (#3921 ) ### What this PR does / why we need it? Fix the issue of MTP being enabled and setting Imhead_tensor_parallel_size=16 causing the inference to hang. Signed-off-by: wyh145 <1987244901@qq.com>	2025-10-31 14:34:28 +08:00
lilinsiman	ee2e55e602	[v0.11.0][Test] Add new test model for aclgraph single_request v0.11.0 (#3889 ) ### What this PR does / why we need it? add new test model for aclgraph single_request v0.11.0 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-10-31 11:23:55 +08:00
zouyida2052	90aca84e60	fix bug when max_seqs=14 in mtp=2 scenario and raise error when cudagraph_capture_sizes can't be an integer multiple of uniform_decode_query_len (#3909 ) ### What this PR does / why we need it? 1. Revert [bugfix for mtp in fullgraph](`0948483642`) and support it when vllm supports 2. raise error when cudagraph_capture_sizes can't be an integer multiple of uniform_decode_query_len 3. bugfix when max_num_seqs=14 in mtp=2 scenario --------- Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-10-31 09:25:06 +08:00
lilinsiman	387ce1cc5b	add new e2e tests case for aclgraph memory to v0.11.0 (#3880 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? add new e2e tests case for aclgraph memory to v0.11.0 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-10-31 09:17:09 +08:00
wangxiaoteng888	38afd2c9cb	[bugfix_v0.11.0]cancel tokenize for layerwise_proxy (#3913 ) ### What this PR does / why we need it? cancel tokenize for layerwise_proxy ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? by ci Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2025-10-30 23:55:04 +08:00
wangxiaoteng888	af7a56550b	[bugfix_v0.11.0-dev] layerwise D first plan (#3907 ) ### What this PR does / why we need it? Refactored the layerwise code to send to the D node first, preventing P-node hangs due to communication timeouts when DP > 1. --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2025-10-30 22:21:11 +08:00
offline893	d5a9aba03f	[BugFix]Fix group list type of mc2. (#3890 ) ### What this PR does / why we need it? Fix the precision issue caused by the inconsistency between the group list type used by mc2 and that of eplb. --------- Signed-off-by: offline0806 <3337230449@qq.com>	2025-10-30 21:44:14 +08:00
weichen	c506ba60fb	[v0.11.0] [Bugfix] [MoE]fix error in deepseek when using allgather (#3827 ) ### What this PR does / why we need it? After refactoring vllm_ascend/models and FusedMoE, we are unable to pass `gate` from deepseekv2.py to `AscendFusedMoE.forward`, which will result in error when running deepseek v3/r1 with allgather. Hence, this pr removes `gate` related computations from FusedMoE module in eager/aclgraph mode. ### Does this PR introduce _any_ user-facing change? `rm_router_logits` is deprecated in eager/aclgraph. ### How was this patch tested? e2e & ut Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-10-30 14:59:46 +08:00
whx	211d4b9da4	[BugFix] Fix mlapo accuracy problem related with weight processing. (#3857 ) This PR fixes a mlapo accuracy problem related with weight processing. Furthermore, modify mlapo related e2e test with quantized deepseek model to make it effective. Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-30 00:35:50 +08:00
zouyida2052	d9249c968e	bugfix for mtp in fullgraph (#3878 ) ### What this PR does / why we need it? bugfix for mtp in fullgraph ### Does this PR introduce _any_ user-facing change? no --------- Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-10-29 23:52:20 +08:00
fems14	19f49ecb5f	[0.11.0][Bugfix]fix_mulit_connector_bug (#3332 ) (#3882 ) ### What this PR does / why we need it? When using multi connector, the multi connector does not define get_finished_count, which will cause the kv cache to be released ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` Signed-off-by: baxingpiaochong <771405853@qq.com> Co-authored-by: baxingpiaochong <771405853@qq.com>	2025-10-29 23:44:52 +08:00
liziyu	e5b938c5fe	[v0.11.0] [P/D] force with_prefill true after allreduce in kv producer (#3835 ) ### What this PR does / why we need it? force with_prefill true after allreduce in kv producer. This is a backport of #3768 and #3849 --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2025-10-29 23:14:00 +08:00
Wang Yixuan	b323be9fe4	deepseek torchair adapt for torch_npu version (#3876 ) ### What this PR does / why we need it? To adapt the torch_npu version to avoid the precision problem of torchair deepseek. The torch_npu version may result in the different branches in the ops register, the rms_norm ops has two branches according to the verson_check, this pr unify the rms_norm in torchair by patch method. #3862 Signed-off-by: hust17yixuan <303660421@qq.com>	2025-10-29 22:44:44 +08:00
realliujiaxu	29bd9235ed	[v0.11.0][Perf] Delete redundant operations in model_runner and forward_context (#3775 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> cherry pick https://github.com/vllm-project/vllm-ascend/pull/3677 Remove redundant operations from `model_runner` and `forward_context`. This optimization can significantly reduce the idle time (bubble) before decoding when running models with small parameter counts (e.g., Qwen/Qwen2.5-0.5B). Testing on 800I A2, bubble is reduced from 3.8ms to 2.8ms : Before <img width="1655" height="696" alt="image" src="https://github.com/user-attachments/assets/d7608e52-2438-46dd-8fc9-391fd6274495" /> After <img width="1607" height="774" alt="image" src="https://github.com/user-attachments/assets/56daf081-2dba-4d2e-99d4-e055187d9806" /> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> No ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-10-29 15:58:53 +08:00
zhangxinyuehfad	75de3fa172	[v0.11.0][Doc] Update doc (#3852 ) ### What this PR does / why we need it? Update doc Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-10-29 11:32:12 +08:00
ZYang6263	6188450269	[v0.11.0][Bugfix]Avoid using the fusion operator in the MOE model (#3837 ) ### What this PR does / why we need it? The current MatmulReduceScatter operator experiences performance degradation in small-shape scenarios, so it determines whether to use this operator by judging the size of the shape. --------- Signed-off-by: ZYang6263 <zy626375@gmail.com>	2025-10-28 23:31:19 +08:00
Shirley125	e48ca0b6ec	[bugfix][0.11]fix proxy decode bug (#3751 ) ### What this PR does / why we need it? fix proxy decode bug while parsing non-UTF-8 characters. --------- Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>	2025-10-27 16:56:50 +08:00
Yizhou	43276fd822	[v0.11.0][Fix] Prevent memory leak in MLA decode graph (#3743 ) (#3774 ) ### What this PR does / why we need it? The cache for MLA decode graph parameters was holding strong references to tensors, preventing them from being garbage collected and leading to increased memory usage. This change wraps the cached tensors in weak references, allowing them to be deallocated when no longer in use and reducing overall memory pressure. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-27 16:00:20 +08:00
Ruri	825fdfb197	[v0.11.0][Feat] Prefetching Attention QKV Linear Weight With `AddRmsNormQuant` Custom Op (#3649 ) ### What this PR does / why we need it? - `qkv_proj.weight` prefetching has been implemented with `Quant` op, when `AddRmsNormQuant` is enabled (#3465) `qkv_proj.weight` prefetching won't work - Implement `qkv_proj.weight` prefetching with `AddRmsNormQuant`, which has been merged on `main` branch (#3517) ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Tested on `Qwen3-235B-A22B-W8A8` <img width="1868" height="109" alt="image" src="https://github.com/user-attachments/assets/0bc28082-0287-4d5c-b8f6-f907c3134d36" /> - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>	2025-10-27 09:42:09 +08:00
Mengqing Cao	1b16c01afd	[v0.11.0-dev][Installation] limit opencv-python-headless version to resolve numpy version conflict (#3767 ) ### What this PR does / why we need it? vllm requires opencv-python-headless >= 4.11.0 which requires (numpy<2.3.0,>=2), but vllm-ascend numpy version must be less than 2.0.0, so limit opencv-python-headless less than 4.11.0.86 will fix this conflict. backport of `afc58184ec` Signed-off-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com>	2025-10-25 18:18:28 +08:00
whx	a58ff9e92f	[Cherry-pick] Port MoE multi-stream fix to v0.11.0-dev (#3753 ) This PR moves the communication operation of shared experts out of extra stream because I found that this might cause rtMemcpy related errors when running shared experts multistream with aclgraph. Furthermore, I utilize a global variable as extra stream object to avoid allocating streams for each layer in full-graph mode. Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-25 15:51:43 +08:00
Yizhou	1bc61031e5	[v0.11.0][Fix] Cap max tokens to prevent potential OOM (#3720 ) (#3744 ) ### What this PR does / why we need it? Caps the calculated maximum number of tokens at 512. This prevents allocating an excessively large buffer when a cudagraph capture size is not specified, mitigating the risk of out-of-memory errors. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-25 15:46:56 +08:00
fems14	99e154dc84	[0.11.0] cherry-pick from #3747 (#3746 ) cherry-pick from #3747 correct _register function place for mooncacke Signed-off-by: fems14 <1804143737@qq.com>	2025-10-25 14:21:30 +08:00
shaopeng-666	fed8145aea	[cherry-pick][Feat] Add mrope fusion op#3708 (#3735 ) ### What this PR does / why we need it? Add mrope fusion op for qwen2.5-vl. This mrope operator dosen't support Qwen3-VL currently. Thus could only take affect in qwen2.5-vl cherry pick from `39b994a987` CI passed with existing test Signed-off-by: shaopeng666 <shaopeng666@noreply.gitcode.com> Co-authored-by: shaopeng666 <shaopeng666@noreply.gitcode.com>	2025-10-25 11:41:23 +08:00
whx	0644113c35	[BugFix] cherry-pick PR 3736 to v0.11.0-dev (#3737 ) This PR comments out newly added vlm e2e test of ascend scheduler scenario because I found that when running in multi-batch this will stuck. Need to add this back after dealing with this issue. Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-25 10:35:14 +08:00
whx	5a2c5be229	[BugFix][Cherry-pick] Cherry-pick PR 3675 to v0.11.0-dev (#3732 ) This PR cherry-picks the bugfix related with running multi-modal models with AscendScheduler to v0.11.0-dev Signed-off-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com>	2025-10-25 09:41:51 +08:00
hucong	12bc78d252	[v0.11.0][BugFix][P/D] Modify the recalculation logic to prevent waiting requests from filling up the D node KVCache (#3686 ) ### What this PR does / why we need it? Modify the recalculation logic to prevent waiting requests from filling up the D node KVCache Signed-off-by: underfituu <hzhucong@163.com>	2025-10-25 09:15:42 +08:00
ZYang6263	5c0a23f98b	[0.11.0][Perf] Add fused matmul/reduce-scatter kernel for performance optimization. (#3725 ) ### What this PR does / why we need it? This PR boosts performance by introducing a fused kernel for the matrix matmul and reduce scatter operations. It supports both unquantized (e.g., BFloat16) and W8A8 quantized models. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: ZYang6263 <zy626375@gmail.com>	2025-10-25 08:20:43 +08:00
fems14	17dd9ae42c	[0.11.0][bugfix]look up multi_tp key (#3699 ) (#3723 ) ### What this PR does / why we need it? In multi-Tensor Parallel (TP) scenarios, the KV pool only queries the first GPU card. When keys on other cards are released, the query result still returns as successful, introducing accuracy issues. This PR modifies the KV pool's query logic to check all cards, resolving this problem. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: fems14 <1804143737@qq.com>	2025-10-24 18:22:45 +08:00
fems14	f0eb3e1d97	[v0.11.0][bugfix]kvpool sync load (#3698 ) (#3722 ) ### What this PR does / why we need it? In certain scenarios, the performance of synchronously loading data from the pool is better than that of asynchronously loading data. Therefore, a control logic (or switch) for asynchronous loading from the pool has been added. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html Signed-off-by: fems14 <1804143737@qq.com>	2025-10-24 18:21:46 +08:00
何必问	33514a4cc2	[Bugfix] The server fails to locate the request, leading to the server hanging. (#3721 ) ### What this PR does / why we need it? fix bug: In the mooncake pooling scenario, when the client closes the request, the server fails to locate the request, leading to the server hanging.oling scenario, when the client closes the request, the server fails to locate the request, leading to the server hanging. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Pull up the PD separated pooling service, send requests using aisbench, press CTRL+C twice, and check if the vllm_ascend service exit. --------- Signed-off-by: linhebiwen <linhebiwen@gmail.com>	2025-10-24 17:41:29 +08:00
offline893	4e21b1537e	[BugFix] Check all expert maps when using muilty instance. (#3662 ) ### What this PR does / why we need it? Check all expert maps when using muilty instance. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Qwen 235B in double A3. case1：master has expert map, slave has not expert map. case2: master has expert map, slave has error expert map. case3: master has expert map,slave has correct expert map. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-24 17:10:31 +08:00
wangxiyuan	b321e3846a	[cherry-pick]【main】patch sched_yield (#3648 ) (#3687 ) ### What this PR does / why we need it? On Arm systems, os.sched_yield() does not take effect, causing the GIL (Global Interpreter Lock) to remain unrelinquished and resulting in CPU bound issues. This PR applies a patch to sched_yield in vLLM, making the process execute time.sleep(0) instead to release the GIL. ### Does this PR introduce _any_ user-facing change? Signed-off-by: fems14 <1804143737@qq.com> Co-authored-by: fems14 <74094523+fems14@users.noreply.github.com>	2025-10-24 00:24:58 +08:00
Wang Yixuan	d0086d432a	fix deepseek torchair recompile (#3679 ) ### What this PR does / why we need it? The #3624 PR fix the precision of deepseek torchair, but don't consider the limitation of torch compile which results in the recompile, This PR fixs this problem. PR to main #3678 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: hust17yixuan <303660421@qq.com>	2025-10-23 22:53:13 +08:00
Slightwind	d2d19a4c3c	[v0.11.0][bugfix] Add 'layer_type' param to get_pergroup_param() for compatibility (#3684 ) Resolves a `TypeError: got an unexpected keyword argument 'layer_type'`. A recent change (PR #3311) started passing the `layer_type` argument when calling `get_pergroup_param()`. This specific implementation does not use this parameter, causing the error. This patch adds `layer_type=None` to the method signature to maintain API compatibility and ignore the unused argument. Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2025-10-23 21:26:50 +08:00

1 2 3 4 5 ...

1200 Commits