xc-llm-ascend

Author	SHA1	Message	Date
22dimensions	f5a97e8fa5	[Quantization] register AscendQuantRMSNorm for quantization (#2856 ) ### What this PR does / why we need it? modelslim will generate self.bias for rms norm in quantization, since RMSNorm in vllm has no this parameter, so its nesscesary to create a AscendQuantRmsNorm. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? tested by deepseek-v3.1-w8a8 <img width="2496" height="592" alt="image" src="https://github.com/user-attachments/assets/004c6e76-3d7a-4a1f-b59f-a14304012663" /> - vLLM version: main - vLLM main: `d6249d0699` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-09-11 23:14:02 +08:00
rjg-lyh	0005479b9c	[main] mlp weight prefetch in Qwen Dense Models (#2816 ) ### What this PR does / why we need it? This PR prefetchs the weight of mlp layers in Qwen Dense Models to optimize the performance in Decode phase mainly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: main - vLLM main: `a1213fae5f` Signed-off-by: rjg-lyh <1318825571@qq.com> Co-authored-by: Shuming19 <313093131@qq.com>	2025-09-11 21:20:09 +08:00
jiangpeng	2b9269b581	[Perf][V1] Fully overlap model execution (#2783 ) This PR is based on top of [#23569](https://github.com/vllm-project/vllm/pull/23569) and [#24219](https://github.com/vllm-project/vllm/pull/24219). ### What this PR does / why we need it? This PR allows the model runner to function asynchronously when using async scheduling. This allows full overlap of the cpu operations (including prepare_inputs) and the model forward pass. This diff is functional and does not support speculative decoding, PP, or guided decoding. Expected speedup is 5-10% over the current async scheduling. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? server ``` python -m vllm.entrypoints.openai.api_server --model=Qwen3-32B\ --trust-remote-code --enforce-eager \ --distributed-executor-backend=mp \ -tp=4 \ --port 8006 \ --max-model-len 32000 \ --block-size 128 \ --gpu-memory-utilization 0.99 ``` client ``` python $TEST_PY --backend vllm --trust-remote-code --model Qwen3-32B \ --dataset-name random --random-input-len 2048 --random-output-len 2048 \ --ignore-eos\ --num-prompts 48 --max-concurrency 48 --request-rate inf --temperature 0 \ --metric-percentiles 90 --base-url http://localhost:8006 --save-result \ --result-dir $PROFILER_DIR ``` benchmark test based on Qwen3-32B TPOT result: \|\|forward async\| scheduler async \|sync\| \|-\|-\|-\|-\| \|avg\|41.73\|41.86\|44.20\| \|improve0\|0.3%\|0\|0\| \|improve1\|5.58%\|0\|0\| benchmark test based on Qwen2___5-VL-7B-Instruct TPOT result: \|\|forward async\|sync\| \|-\|-\|-\| \|avg\|23.22\|29.16\| \|improve\|20.3%\|0\| - vLLM version: main - vLLM main: `e93f4cc9e3` Signed-off-by: jiangpeng36 <jiangpeng36@huawei.com> Signed-off-by: Ronald1995 <ronaldautomobile@163.com> Co-authored-by: jiangpeng36 <jiangpeng36@huawei.com> Co-authored-by: Ronald1995 <ronaldautomobile@163.com>	2025-09-11 16:35:36 +08:00
huangxialu	88d7af62be	[main] adjust the position of warm_up_atb (#2823 ) ### What this PR does / why we need it? Adjust the position of warm_up_atb. ### Does this PR introduce _any_ user-facing change? not involved ### How was this patch tested? CI passed with existing test. - vLLM version: main - vLLM main: `b23fb78623` Signed-off-by: huangxialu <huangxialu1@huawei.com>	2025-09-10 14:06:38 +08:00
CaranLic	168ad600b5	[main] add pd transfer for ascend scheduler (#2753 ) ### What this PR does / why we need it? For offline scenarios, adjust the scheduling process to prioritize the prefill phase of all requests, then process the decode phase of all requests. ### How was this patch tested? ``` max_num_seqs=24, additional_config={ "ascend_scheduler_config":{ "enabled": True, "enable_pd_transfer": True, "decode_max_num_seqs": 24, "enable_chunked_prefill": False } }, ``` \| input \| output \| num prompts \| max_num_seqs \| dp \| tp \| scheduler \| tps \| \| ------ \| ------ \| ---------- \| ---------------- \| ---- \| ---- \| ---------------- \| --------------- \| \| dapo-math-17K \| 2K \| 384 \| 24 \| 2 \| 1 \| v1 \| 234.06 \| \| dapo-math-17K \| 2K \| 384 \| 24 \| 2 \| 1 \| pd transfer \| 239.59(+2.4%) \| \| dapo-math-17K\| 2K \| 384 \| 24 \| 4 \| 1 \| v1 \| 222.85 \| \| dapo-math-17K\| 2K \| 384 \| 24 \| 4 \| 1 \| pd transfer \| 225.81(+1.3%) \| - vLLM version: v0.10.1.1 - vLLM main: `6fb2788163` --------- Signed-off-by: CaranLic <740821011@qq.com>	2025-09-10 08:46:39 +08:00
Mengqing Cao	edf1f600ad	[CI] Remove compatibility maintenance for vllm v0.10.1 and v0.10.1.1 (#2840 ) ### What this PR does / why we need it? Remove compatibility maintenance for vllm v0.10.1 and v0.10.1.1 ### Does this PR introduce _any_ user-facing change? branch main of vllm-ascend will not be compatible with vllm v0.10.1 and v0.10.1.1 ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.1.1 - vLLM main: `6fb2788163` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-10 08:43:10 +08:00
rjg-lyh	1bbb20ea13	[main] flashcomm_v1 optim in Qwen Dense Models (#2802 ) ### What this PR does / why we need it? Flashcomm_v1 optim in Qwen Dense Models. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.1.1 - vLLM main: `5e537f45b4` Co-authored-by: 1024daniel <xxltju324@gmail.com>	2025-09-08 22:52:24 +08:00
realliujiaxu	d3c3538ddc	[Bugfix]fix bug when graph_size is not divisible by tp_size (#2719 ) ### What this PR does / why we need it? fix https://github.com/vllm-project/vllm-ascend/issues/2702 - A2: skip graph_size update that makes it to tp_size because dispatch/combine op support different batch size across EP ranks - A3: add `max_num_reqs = max(new_graph_batch_sizes)` to fix graph_size and max_num_reqs mismatch ### Does this PR introduce _any_ user-facing change? Nope ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `e599e2c65e` --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-09-08 14:52:33 +08:00
TaoYu Chen	dd087effcc	Refector prepare_inputs in model_runner_v1.py (#2750 ) ### What this PR does / why we need it? Refector prepare_inputs in model_runner_v1.py for more easy read. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? PASS CI - vLLM version: v0.10.1.1 - vLLM main: `e599e2c65e` --------- Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com>	2025-09-08 10:45:23 +08:00
yiz-liu	c735bb0941	[Fix] Ensure metadata sync across DP ranks in eager mode (#2766 ) ### What this PR does / why we need it? Removes the condition that skips metadata synchronization when `enforce_eager` is enabled. This change is necessary to correctly sync the `with_prefill` and `enable_dbo` flags across all data parallel ranks, which is not required in the base implementation. Forcing the sync operation prevents potential inconsistencies, albeit with a minor performance impact. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Add a E2E online test case? - vLLM version: v0.10.1.1 - vLLM main: `e599e2c65e` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-08 09:55:16 +08:00
yiz-liu	83eb40a51c	[Fix][MoE] Refine MoE communication strategy (#2734 ) ### What this PR does / why we need it? Refactors the Mixture-of-Experts (MoE) communication method selection logic. The choice between all-gather, all-to-all, and mc2 is now determined by expert parallel configuration, SoC version (A2/A3), and token count for better performance. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Added. - vLLM version: v0.10.1.1 - vLLM main: `eafa8dcde6` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-05 09:04:04 +08:00
Icey	d4370ebc42	[Refactor] Refactor Spec Decode (#2668 ) ### What this PR does / why we need it? Refactor spec decode ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.1.1 - vLLM main: `6997a25ac6` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-04 11:34:47 +08:00
linfeng-yuan	90a75a90a9	[bugfix] fix torchair runtime error caused by configuration mismtaches and file missing (#2532 ) ### What this PR does / why we need it? This PR ports #2312 #2506 #2531 to main branch. Original implementation of torchair caching forces users to make everything prepared, fix all the configuration and enable `use_cached_npu_graph`, and it might cause some problems confusing to understand and tackle for users. It is better to compile the graph twice instead of reusing the old kvcaches and cached torchair graph. And the extra duration time is acceptable. Additionally, this pr fixes a recompilation problem of torchair graph mode caused by `running_in_graph` variable in `AscendMLATorchairImpl`. ### Does this PR introduce _any_ user-facing change? If users want to enabling torchair.cache_compile with high compilation speed, it is recommended to enable both `use_cached_kv_cache_bytes` and `use_cached_graph` in `torchair_graph_config`. Without `use_cached_kv_cache_bytes`, we'll compile torchair computation graph twice to avoid runtime error caused by configuration mismtaches (the second compilation will be much faster). Additionally, we've made a change to how the TORCHAIR_CACHE_HOME enviroment variable is utilized to enhance safety and prevent accidental file deletion by adding a suffix directory. ### How was this patch tested? CI and e2e vllm serving pass. - vLLM version: v0.10.1.1 - vLLM main: `70549c1245` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-09-03 17:56:12 +08:00
zhanghw0354	eaeb2efb20	[Main][Feat]Set the Profiler parameters through environment variables consistent with vLLM (#2608 ) ### What this PR does / why we need it? Currently, when performing profiling in vLLM-Ascend, if you need to obtain the Python call stack, you have to manually modify the code. The code location is: [worker_v1.py#L337](`6c973361fc/vllm_ascend/worker/worker_v1.py (L337)`) where you set with_stack to true. Now, in vLLM, you can set whether to obtain the Python call stack through an environment variable. The relevant PR is: [#21803](https://github.com/vllm-project/vllm/pull/21803) and the documentation is: [profiling](https://docs.vllm.ai/en/latest/contributing/profiling.html?h=vllm_torch_profiler_with_stack#profile-with-pytorch-profiler) This PR sets the profiler initialization parameters by using the same environment variable as vLLM, eliminating the need for manual code modification. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.1.1 - vLLM main: `0235103cbb` --------- Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com> Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>	2025-09-03 10:58:08 +08:00
xuyexiong	214b32a346	[V1][BUGFIX][0.10.1] FIX mtp on main branch (#2632 ) ### What this PR does / why we need it? Fix MTP torchair bug caused by torchair refactor and moe refactor Depends on PRs: fused moe fix: https://github.com/vllm-project/vllm-ascend/pull/2627 torchair multi DP fix: https://github.com/vllm-project/vllm-ascend/pull/2626 ### Does this PR introduce _any_ user-facing change? when dp is enabled, to run mtp online server, need to disable server log due to the current metrics does not support multi dp `--disable-log-stats` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `7c8271cd1e` Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-09-02 11:12:41 +08:00
yiz-liu	d3c93fba5c	[3/N][Feat][Graph] Support `all-to-all` and quantized models with ACL Graph (#2614 ) ### What this PR does / why we need it? * Unify execution paths: Consolidates the quantized and non-quantized execution paths into a single `fused_experts` function, removing duplicated logic and making the control flow clearer and easier to maintain. * W8A8 dynamic quantization: Adds support for W8A8 dynamic quantization inside the unified MoE kernel. Communication routines are updated to correctly handle dynamic quantization scales for activations. * Weight pre-processing: Prae-transpose the `w13` and `w2` weight matrices (as implemented in PR #2025) so that quantized and non-quantized models follow the same code path for the MoE gating, up-projection, and down-projection operations. * All-to-all communication: Adds an `all-to-all` collective communication pattern. For large token counts on modern hardware, `all-to-all` is more efficient than the previous `all-gather` strategy. However, `all-to-all` is not really captured and replayed due to multiple D2H operations which will trigger synchronization, and thus raise error when capture graphs. We only use `all-to-all` when fallback to `compiled_graph_for_general_shape`. * Dynamic communication selection: The model runner now selects the optimal MoE communication method (`mc2`, `allgather`, or `alltoall`) at runtime based on token count and the Ascend SoC version. * Limitation: `all-gather` is not yet supported for quantized models, which means there is still something left to do on A2. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? No further test cases needed. - vLLM version: v0.10.1.1 - vLLM main: `d660c98c1b` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-08-30 11:00:35 +08:00
yiz-liu	aadc75c247	[Fix] Resolve data-parallel (DP) assertion errors in TorchAir (#2626 ) ### What this PR does / why we need it? It is confirmed that `num_input_tokens` must be assigned the value of `maybe_padded_num_tokens` under all circumstances. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Waiting for daily test for TorchAir. - vLLM version: v0.10.1.1 - vLLM main: `006477e60b` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-08-29 16:06:49 +08:00
lidenghui1110	600b08f754	[Feat]: Add custom lmhead tensor model parallel (#2309 ) ### What this PR does / why we need it? This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. \| Name \| Effect \| Required \| Type \| Constraints \| \| :---------------------------- \| :--------------------------------------- \| :------- \| :--- \| :----------------- \| \| lmhead_tensor_parallel_size \| Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces \| No \| int \| default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. \| example `--additional_config={"lmhead_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `de533ab2a1` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com>	2025-08-29 11:41:21 +08:00
yiz-liu	dfc7eb39ad	[Fix] Fix DP-related padding logic (#2582 ) ### What this PR does / why we need it? The determination of attention state, padding, and other forward metadata has been moved to an earlier stage within the input preparation process. This change enables us to utilize a single all-reduce operation, maximizing synchronization efficiency as early as possible. The logic for synchronizing metadata—such as the number of tokens, prefill status, and DBO status—across data parallel (DP) ranks has now been unified and simplified. For performance improvements, the all-reduce operation has been switched from the `gloo` backend to the `npu` backend, which results in an reduction of several milliseconds per step (approximately 10% performance gain for TPOT!). Additionally, the multi-DP server hang issue has been resolved, ensuring no more hangs occur when `num_requests < dp_size`. Alas, a relief. Finally, the miscalculated memory usage issue has been addressed by removing the unnecessary `DummyCommImpl`, allowing the system to use the real communication method when determining available memory. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Maybe we should add an test case for multi-DP online server? @MengqingCao - vLLM version: v0.10.1.1 - vLLM main: `c5d004aaaf` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-08-28 19:39:58 +08:00
Yikun Jiang	175f6bc445	Support v0.10.1 (#2584 ) ### What this PR does / why we need it? This patch also supports v0.10.1 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - CI passed - test 0.10.1: https://github.com/vllm-project/vllm-ascend/pull/2583 - vLLM version: v0.10.1.1 - vLLM main: `321938e9ac` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-08-28 18:47:53 +08:00
rjg-lyh	2bfbf9b9b3	[main][bugfix] Fix bugs and refactor cached mask generation logic (#2442 ) ### What this PR does / why we need it? This PR fix bugs and refactor cached mask generation logic. Now just pre-construct and use the cached mask on cpu instead of device on npu. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.1.1 - vLLM main: `9b5f64238f` Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-08-27 12:07:29 +08:00
Mengqing Cao	a9e78a3299	[Aclgraph] Update compilation config in `check_and_update_config` (#2540 ) ### What this PR does / why we need it? This pr updates compilation config in `check_and_update_config`, we use `compilation_config.level` to update `compilation_config.cudagraph_mode` to ensure the config is correct. Add `compilation_config.cudagraph_num_of_warmups = 1` when V1 is enabled, cause this is also used in torchair graph mode. and this fixes https://github.com/vllm-project/vllm-ascend/issues/2523 fix the bug that the `aclgraphmode` always be `NONE` while running forward in aclgraph mode ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.1.1 - vLLM main: `f58675bfb3` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-27 09:30:25 +08:00
wangxiyuan	f22077daa6	[Embedding] Recover embedding function (#2483 ) Fix broken embedding function. It's broken by http://github.com/vllm-project/vllm/pull/23162 - vLLM version: v0.10.1.1 - vLLM main: `efc88cf64a` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-27 09:22:01 +08:00
rjg-lyh	358ba68994	[main][bugfix] Fix MatmulNZ format bug on some machines (#2549 ) ### What this PR does / why we need it? This PR fixes the bug on some machines where quantmatmul failed to run with the NZ format. The change ensures proper execution under the expected data layout. ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.1.1 - vLLM main: `b5d34af328` Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-08-27 09:08:17 +08:00
yiz-liu	a6bb502e70	[2/N][Feat] Add MC2 communication method for MoE layers (#2469 ) ### What this PR does / why we need it? This method replaces the previous all-gather approach for small numbers of tokens. The key changes include: - A new `AscendFusedMoE` layer that handles token splitting, local computation, and final aggregation via all-gather. - Logic in the model runner to dynamically select between the new MC2 method and the existing all-gather method based on the number of input tokens. - Sharding the MoE communication mask across tensor-parallel ranks. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test case fixed. - vLLM version: v0.10.1.1 - vLLM main: `b00e69f8ca` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-08-26 19:05:23 +08:00
wangxiyuan	7e494e94a9	[CI] Fix broken ci (#2530 ) vLLM commit https://github.com/vllm-project/vllm/pull/22711 changed the encode cache entries logic, this PR adapt the same change for vllm ascend to make CI happy. Co-Authored-By: zhoux77899 <zhouxiang100@huawei.com> - vLLM version: v0.10.1.1 - vLLM main: `0ff902f3b4` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-26 07:42:24 +08:00
yiz-liu	99bf25af76	[Fix] Add operations in `_dummy_run` to maintain synchronization with `_process_reqs`, resolving a service hang (#2454 ) ### What this PR does / why we need it? Fixes hang when batch size < DP size. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? After this change, the function in DP case will work now. - vLLM version: v0.10.1.1 - vLLM main: `d9a55204ba` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-08-25 19:56:02 +08:00
wangxiyuan	de7649492d	[Refactor] cleanup converting_weight_acl_format_format (#2482 ) move maybe_converting_weight_acl_format_format to torchair module, it's only used with 310p+torchair - vLLM version: v0.10.1.1 - vLLM main: `49ab23b3cc` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-25 19:48:55 +08:00
ZhaoJiangJiang	3629bc4431	feat: add mtp ut and fix some bugs (#2453 ) ### What this PR does / why we need it? Fix mtp mode ut ### Does this PR introduce _any_ user-facing change? Nothing ### How was this patch tested? This can be tested in the same way as a unit test. - vLLM version: v0.10.0 - vLLM main: `53415653ff` Signed-off-by: 赵江江 <zhaojiangjiang1@h-partners.com> Co-authored-by: 赵江江 <zhaojiangjiang1@h-partners.com>	2025-08-22 17:09:08 +08:00
Mengqing Cao	b0403f8d8a	[CI] fix ci (#2464 ) ### What this PR does / why we need it? 1. use action/checkout@v5 instead of v4 2. remove dbo test case because there is issue with it and will be refactored later 3. make vllm-ascend compatible with vllm v0.10.1.1 and add CI for it 4. fix sampler api changes introduced by https://github.com/vllm-project/vllm/pull/22387 6. fix qwen3 moe config changes intruoduced by https://github.com/vllm-project/vllm/pull/20562 7. fix kvcache block changes introduced by https://github.com/vllm-project/vllm/pull/23262 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.0 - vLLM main: `0c6e40bbaa` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-22 07:30:48 +08:00
linfeng-yuan	0ca3f48c90	[2/N][refactor] torchair deepseek mla backend refactor (#2459 ) ### What this PR does / why we need it? This PR move current unified mla backend to torchair folder and remove torchair-related code in attention/mla_v1.py (1.3k -> 0.9k). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Running eager mode with mla backend, and torchair mode with code before [2445](https://github.com/vllm-project/vllm-ascend/pull/2445) - vLLM version: v0.10.0 - vLLM main: `f571ff8eb6` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-08-21 14:02:30 +08:00
weiguihua2	0dca4c6dbd	refact runner model v1 (#2461 ) refact model runner v1 ### What this PR does / why we need it? 1. Separate the execute model logic from the prepare input logic 2. Disassemble the torchchair in model runner v1 - vLLM version: v0.10.0 - vLLM main: `68fcd3fa73` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-08-21 08:54:57 +08:00
Mengqing Cao	1327f9be1c	Fix some ci issue and refactor modelrunner (#2445 ) ### What this PR does / why we need it? Fix some ci issue and refactor modelrunner ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.0 - vLLM main: `4d9c61993a` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: weiguihua2 <weiguihua2@huawei.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: weiguihua2 <weiguihua2@huawei.com>	2025-08-20 09:01:04 +08:00
Shanshan Shen	83e0f41408	[3/N][Refactor] Move `torchair_attention` to `torchair` dir (#2017 ) ### What this PR does / why we need it? 1. Move `torchair_attention` to `torchair` dir. 2. Make `AscendAttentionTorchairBackend` extend `AscendAttentionBackend` to reduce duplicate methods. 3. Make `AscendTorchairMetadata` extend `AscendMetadata` to reduce duplicate properties. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `0933f9d518` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-08-19 10:25:22 +08:00
Pleaplusone	3f4a358b14	[Bugfix] Fix custom op register issue (#2409 ) ### What this PR does / why we need it? Our current code register the custom ops inside the platform intialization phase. however, when a new process started by creating a worker, the former patch will lose it effect on the custom ops and lead to fallback to the native pass wrote in vllm. This PR move the patch code to the worker to make sure the custom op patch worker as our expected. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.10.0 - vLLM main: `8ea0c2753a` Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-08-19 09:09:43 +08:00
CaveNightingale	2bb7e55022	[Bugfix][PD]fix non-working disaggregated prefill (#2374 ) ### What this PR does / why we need it? Mainline vLLM fixes its disaggregated prefill in https://github.com/vllm-project/vllm/pull/22598 . But it is still not working in vllm-ascend. To be concrete, decoder instances crash before vllm's fix and hang after vllm's fix in ascend devices. This patch allows disaggregated prefill to work. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Qwen3-0.6B 1P1D tp=1 dp=1 - vLLM version: v0.10.0 - vLLM main: `0fe85087a9` --------- Signed-off-by: CaveNightingale <cavenightingale@foxmail.com>	2025-08-15 16:59:52 +08:00
Mengqing Cao	61866b8ac6	[Quickfix] update CachedRequestState as NewRequestData changed (#2367 ) ### What this PR does / why we need it? 1. update `CachedRequestState` as `NewRequestData` changed in https://github.com/vllm-project/vllm/pull/22570 2. drop maintenance of vllm v0.10.0 in the branch main ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.0 - vLLM main: `92ff41abea` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-15 07:35:27 +08:00
Shanshan Shen	103654ccd6	[Misc] Remove redundant imported `envs`, using `envs_ascend` instead (#2193 ) ### What this PR does / why we need it? Remove redundant imported `envs`, using `envs_ascend` instead. ```python import vllm.envs as envs_vllm import vllm_ascend.envs as envs_ascend ``` - vLLM version: v0.10.0 - vLLM main: `71683ca6f6` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-08-14 09:33:39 +08:00
zhenghaojiang	0f7492d18e	[Bugfix] fix the oom when chunkprefill with long context like 64k (#2319 ) The attn mask was declared in the mla.py，we don't need the splitfuse mask when mla chunkprefill, and this mask will cause memory problem when long context like 64k or 128k - vLLM version: v0.10.0 - vLLM main: `14a5d903ab` --------- Signed-off-by: haojiangzheng <justineric096@gmail.com>	2025-08-13 17:15:59 +08:00
yiz-liu	992271b027	[1/N][Feat] Support MoE models with ACL Graph and refactor MoE communication logic (#2125 ) ### What this PR does / why we need it? This PR refactors the MoE (Mixture of Experts) communication logic by introducing a strategy pattern. It defines an abstract base class, `MoECommMethod`, which encapsulates different communication strategies for MoE layers. By decoupling the MoE implementation from any single communication method, this change makes it simpler to add, replace, or optimize communication strategies in the future. Plan / Roadmap 1. Introduce `MoECommMethod`, implement `AllGatherImpl`, and adapt ACL Graph handling to cover all scenarios (this PR). 2. Implement `MC2CommImpl` and `AllToAllCommImpl` to optimize performance in specific scenarios. 3. Enable W8A8 / Int8 models to use `unified_fused_experts`. Other notes * Data-parallel (DP) communication currently does not work with vLLM's dispatch/combine mechanisms; an alternative approach is required to resolve this incompatibility. - vLLM version: v0.10.0 - vLLM main: `f7ad6a1eb3` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-08-12 21:10:20 +08:00
wangxiyuan	1a70564e7c	[5/N][Refactor] torchair model runner refactor (#2216 ) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow #2203 What's this PR do: create common function `_capture_model` for capture_model - vLLM version: v0.10.0 - vLLM main: `1891a265d3` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-12 14:24:50 +08:00
wangxiyuan	c8b0f5f799	[4/N][Refactor] torchair model runner refactor (#2208 ) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow #2203, this is the first PR. What's this PR do: create common function `_convert_torch_foramt` for initialize_kv_cache - vLLM version: v0.10.0 - vLLM main: `14a5d903ab` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-11 21:39:24 +08:00
wangxiyuan	881e36d6a9	[3/N][Refactor] torchair model runner refactor (#2207 ) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow #2203, this is the first PR. What's this PR do: create common function `_build_attention_metadata` and `_generate_dummy_run_hidden_states` for dummy_run - vLLM version: v0.10.0 - vLLM main: `ebf7605b0d` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-11 18:03:19 +08:00
wangxiyuan	1ab15414bb	[2/N][Refactor] torchair model runner refactor (#2204 ) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow #2203 What's this PR do: move `torchair` related logic into `_get_forward_metadata_across_dp` and override it in torchair model runner - vLLM version: v0.10.0 - vLLM main: `1b99028069` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-11 14:06:49 +08:00
wangxiyuan	9260910c8d	[CI] Fix broken CI (#2302 ) 1. disable test_eagle_ccorrectness test, we'll reopen it once oom error fixed. 2. drop transformers version limit for main, since vLLM rely on >=4.55.0, see: `65552b476b` 3. fix kv_connector_output bug, see: `796bae07c5` - vLLM version: v0.10.0 - vLLM main: `d1af8b7be9` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-11 11:22:32 +08:00
lbk-sys	c611291661	【main】SP For Qwen3 MoE (#2209 ) ### What this PR does / why we need it? Qwen3 MoE supports SP. In scenarios like AlltoAll, AlltoAllv, and MC2, replacing AllReduce with Reduce-Scatter and AllGather achieves computational benefits in norm operations while saving one AllGather communication. This feature is enabled during the P-phase and delivers notable gains in long-sequence scenarios (e.g., 16k–25k), with performance improvements reaching 5%–10%. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ``` compilation_config={ "pass_config":{ "enable_sequence_parallelism": True } }, enable_expert_parallel=True, ``` - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: libaokui <libaokui@huawei.com> Co-authored-by: libaokui <libaokui@huawei.com>	2025-08-07 09:15:49 +08:00
Li Wang	57b9f02185	[Bugfix] Fix disaggregated pd error (#2242 ) ### What this PR does / why we need it? Fix `ascend_env has no attr VLLM_ASCEND_ENABLE_CHUNK_MC2`, remove useless lines - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-06 19:48:10 +08:00
xuyexiong	26fc36b0e0	[V1] MTP supports torchair (#2145 ) ### What this PR does / why we need it? Support MTP with： - [x] V0 Scheduler - [x] TorchAir - [x] Single DP - [x] Multi DP - [x] Disaggregate PD Known issues： - [ ] Not support V1 Scheduler (chunked prefill), will be supported in a few weeks - [ ] vllm v0.10.0 does not support metrics with `DP > 1` right now, need to comment out the line 171-175 in file `vllm/vllm/v1/metrics/loggers.py` ``` if (len(self.engine_indexes) > 1 and vllm_config.speculative_config is not None): raise NotImplementedError("Prometheus metrics with Spec Decoding " "with >1 EngineCore per AsyncLLM is not " "supported yet.") ``` To start an online server with torchair enabled, here is an example: ``` python -m vllm.entrypoints.openai.api_server \ --model="/weights/DeepSeek-R1_w8a8/" \ --trust-remote-code \ --max-model-len 40000 \ --tensor-parallel-size 4 \ --data_parallel_size 4 \ --max-num-seqs 16 \ --no-enable-prefix-caching \ --enable_expert_parallel \ --served-model-name deepseekr1 \ --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \ --quantization ascend \ --host 0.0.0.0 \ --port 1234 \ --additional-config '{"ascend_scheduler_config":{"enabled":true,"enable_chunked_prefill":false},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]},"enable_weight_nz_layout":true}' \ --gpu_memory_utilization 0.9 ``` offline example with torchair enabled ``` from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=16, temperature=0) # Create an LLM. llm = LLM( model="/home/data/DeepSeek-R1_w8a8/", tensor_parallel_size=16, max_num_seqs=16, gpu_memory_utilization=0.9, distributed_executor_backend="mp", enable_expert_parallel=True, speculative_config={ "method": "deepseek_mtp", "num_speculative_tokens": 1, }, trust_remote_code=True, enforce_eager=False, max_model_len=2000, additional_config = { 'torchair_graph_config': { 'enabled': True, "graph_batch_sizes": [16], 'enable_multistream_shared_expert': False, }, "ascend_scheduler_config": { "enabled": True }, # 'expert_tensor_parallel_size': 16, } ) # Generate texts from the prompts. # llm.start_profile() outputs = llm.generate(prompts, sampling_params) # llm.stop_profile() for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.10.0 - vLLM main: `302962e806` --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-08-06 19:37:43 +08:00
wangxiyuan	292fb8f696	[1/N][Refactor] torchair model runner refactor (#2205 ) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow #2203, this is the first PR. What this PR does: create the new torchair model runner, more function will be added later - vLLM version: v0.10.0 - vLLM main: `586f286789` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-05 18:43:04 +08:00
jinyuxin	583ad8f347	[main][refractor] Refractor forward metadata retrieval across DP nodes to reduce redundant padding. (#2062 ) Before refactoring cross-DP decoding metadata aggregation, clean up the token‐padding logic . ### What this PR does： 1. First checks whether any DP instance is in the prefill phase. 2. If in the `decode` phase and `torchair_graph_enabled `is true, pads each DP instance’s token count up to the global maximum. 3. If in the `prefill` phase, or in decode phase with graph mode disabled, returns each DP instance’s original token count without padding. This reordering removes the previous two‐step padding/unpadding flow and ensures padding only occurs when strictly necessary. - vLLM version: v0.10.0 - vLLM main: `bd3db7f469` Signed-off-by: yx0716 <jinyx1007@foxmail.com> Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-05 17:03:36 +08:00

1 2 3 4

188 Commits