xc-llm-ascend

Author	SHA1	Message	Date
yupeng	29f195a91c	[Bugfix][LoRA] Fix the bug when runs Qwen3-Reranker-0.6B with LoRA. (#7156 ) ### What this PR does / why we need it? Fix the error that reports while initializing qwen3-reranker-0.6b model with `--enable-lora`. And add a testcase to verify the fix. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: paulyu12 <507435917@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-03-15 17:55:42 +08:00
realliujiaxu	5e24b26a54	[Bugfix] rename enable_flash_comm_v1 back to enable_sp (#6883 ) ### What this PR does / why we need it? PR #5632 introduced a bug by replacing some branches gated by enable_sp with enable_flash_comm_v1. As a result, when enable_shared_expert_dp is enabled alone (i.e., VLLM_ASCEND_ENABLE_FLASHCOMM1=0 and VLLM_ASCEND_ENABLE_FLASHCOMM=0), the behavior becomes inconsistent with the previous logic and leads to accuracy issues. This PR restores the original enable_sp-based branching to recover expected behavior and accuracy. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? #### 1. start server ``` bash vllm serve /home/weights/DeepSeek-V2-Lite-W8A8/ \ --port 8001 \ --served-model-name auto \ --max-model-len 1024 \ --enforce-eager \ --tensor-parallel-size 2 \ --data-parallel-size 2 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --additional-config '{"enable_shared_expert_dp": true}' ``` #### 2. curl ```bash curl -s http://localhost:8001/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [ {"role": "user", "content": "Hello. I have a question. Who are you?"} ], "max_tokens": 10, "temperature": 0.0, "ignore_eos_token": true }' ``` - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2026-03-01 20:22:50 +08:00
realliujiaxu	5def28dcd3	[Feat]support sequence parallelism by pass for VL models (#5632 )	2026-02-27 08:27:41 +08:00
SILONG ZENG	19b5d44ea8	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #10 ) (#6173 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \|`vllm_ascend/ops/layer_shard_linear.py`\| \|`vllm_ascend/ops/linear.py`\| \|`vllm_ascend/ops/linear_op.py`\| \|`vllm_ascend/worker/worker.py`\| \| ` vllm_ascend/patch/worker/patch_bert.py` \| \| ` vllm_ascend/patch/worker/patch_deepseek.py` \| \| ` vllm_ascend/patch/worker/patch_distributed.py` \| \| ` vllm_ascend/patch/worker/patch_module.py` \| \| ` vllm_ascend/patch/worker/patch_multimodal_merge.py` \| \| ` vllm_ascend/patch/worker/patch_qwen3_next.py` \| \| ` vllm_ascend/patch/worker/patch_qwen3_next_mtp.py` \| \| ` vllm_ascend/patch/worker/patch_rejection_sampler.py` \| \| ` vllm_ascend/patch/worker/patch_rope.py` \| \| ` vllm_ascend/patch/worker/patch_triton.py` \| \| ` vllm_ascend/patch/worker/patch_unquantized_gemm.py` \| \| ` vllm_ascend/patch/worker/patch_v2_egale.py` \| \|` vllm_ascend/worker/npu_input_batch.py`\| \|` vllm_ascend/worker/v2/aclgraph_utils.py`\| \|` vllm_ascend/worker/v2/attn_utils.py`\| \|` vllm_ascend/worker/v2/model_runner.py`\| \|` vllm_ascend/worker/v2/sample/gumbel.py`\| \|` vllm_ascend/worker/v2/sample/penalties.py`\| \|` vllm_ascend/worker/v2/sample/sampler.py`\| \|` vllm_ascend/worker/v2/spec_decode/__init__.py`\| \|` vllm_ascend/worker/v2/spec_decode/eagle.py`\| \|` vllm_ascend/worker/v2/states.py`\| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: SILONG ZENG <2609716663@qq.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-06 15:35:06 +08:00
Wang Kunpeng	13cd6362c6	[bugfix] fix Error 'ValueError: Duplicate layer name' (#5280 ) ### What this PR does / why we need it? When matmul_and_reduce is enabled, the prefix attribute is required. However, in some models, the prefix is not passed correctly, causing errors when starting the service. The issue of incorrect prefix passing will be fixed in vLLM in the future. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-12-25 10:43:24 +08:00
Nengjun Ma	42c989a437	Update vllm pin to 12.24 (#5307 ) ### What this PR does / why we need it? Fix vllm break in the pr: 1. [Add MiMo-V2-Flash support] (https://github.com/vllm-project/vllm/pull/30836) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Co-authored-by: zxwang [1476209578@qq.com](mailto:1476209578@qq.com) - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zxwang <1476209578@qq.com> Co-authored-by: zxwang <1476209578@qq.com>	2025-12-24 17:24:31 +08:00
zzzzwwjj	cc23067f1e	[refactor] refactor weight trans nz and transpose (#4878 ) ### What this PR does / why we need it? Now `VLLM_ASCEND_ENABLE_NZ` will have three options: 0: disable nz; 1: only quant case enable nz; 2: enable nz as long as possible; And `VLLM_ASCEND_ENABLE_NZ`=1 by default. All cases are shown in the table below: \| \| W4A4 \| W4A8 \| W8A8 \| fp16/bf16 \| fp32 \| \|---\|---\|---\|---\|---\|---\| \| trans nz \| can't support nz \| trans nz by default \| trans nz by default \| trans nz when VLLM_ASCEND_ENABLE_NZ is 2 \| can't support nz \| \| transpose \| only support not transpose case \| only support transpose case \| only support transpose case \| linear: only support not transpose case<br>gmm: only support transpose case \| same to fp16/bf16 \| Some exceptional cases: 1. MLAPO op need to do some additional processing on the weights, including trans nz. If use MLAPO op, some weight will be transformed to nz forcely; 2. MLA/SFA's weight `W_UV` will be used by op `torch.ops._C_ascend.batch_matmul_transpose`, and this op can't support nz currently; ### Does this PR introduce _any_ user-facing change? Now fp16/bf16 weight will not trans nz by default. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-12-19 14:27:24 +08:00
Slightwind	12ca99c94e	[Bugfix] Remove ModelSlim-"M4 Quantization". (#4589 ) The M4 quantization method in ModelSlim adds bias to model weights that originally do not have a linear bias. PR #4235 supported PD-MIX quantization and M4 quantization, adding bias to `w8a8.py` and `w8a8_dynamic.py`, and implementing adaptations in `ops/linear.py` to prevent it from being reset to `None` by `self.register_parameter("bias", None)`. However, this modification introduced an issue where the bias was still being reset to `None` in certain scenarios, causing errors during service startup. Therefore, support for M4 quantization is temporarily being reverted in this PR. ___ - vLLM version: v0.11.2 Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2025-12-01 23:45:02 +08:00
zzhxxx	203b4e6777	[Bug_fix] fix torchair o_proj forward parameter (#4166 ) ### What this PR does / why we need it? In `torchair_mla.py`, the `self.oproj` function includes an additional parameter `is_force_scatter`, while the `AscendRowParallelLinear` function in `linear.py` does not add this parameter. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: zzhx1 <zzh_201018@outlook.com>	2025-12-01 19:57:01 +08:00
Slightwind	18eefc23c3	[feature] Support W8A8 PD-Mix Quantization (#4235 ) In PD-separated deployment scenarios: * MoE layers use dynamic quantization exclusively. * For the Attention module, Prefill (P) nodes use dynamic quantization, while Decode (D) nodes use static quantization. In PD-mixed deployment scenarios: * All components fall back to dynamic quantization, as it is difficult to distinguish between Prefill and Decode tokens. ___ - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Signed-off-by: Slightwind <slightwindsec@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-11-30 11:57:26 +08:00
InSec	5a4e8cdeba	[Feat][BugFix]Support the Qwen3-Next-80B-A3B-Instruct quantization model&Fix the NZ issue (#4245 ) ### What this PR does / why we need it? Support the Qwen3-Next-80B-A3B-Instruct quantization model and Fix the NZ issue. Triton kernel doesn't support data format nz, thus we skip converting weight to nz on layer `conv1d` - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: IncSec <1790766300@qq.com>	2025-11-21 10:42:56 +08:00
wangxiyuan	cc2cd42ad3	Upgrade CANN to 8.3.rc1 (#3945 ) ### What this PR does / why we need it? This PR upgrade CANN from 8.2rc1 to 8.3rc1 and remove the CANN version check logic. TODO: we notice that UT runs failed with CANN 8.3 image. So the base image for UT is still 8.2. We'll fix it later. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-03 20:21:07 +08:00
rjg-lyh	c1a6aeab46	[main][bugfix] fix valueError in static_forward_context when prefix is empty (#3924 ) ### What this PR does / why we need it? This PR temporarily bypasses the scenario where some models in vLLM trigger a `ValueError` during the process of storing values in `static_forward_context` when no `prefix` is specified for the linear layers, which is a bug in some models in vLLM. The official fix will be addressed by submitting a PR to the vLLM community that specifies a prefix for the linear layers in each model. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-10-31 14:55:58 +08:00
rjg-lyh	292e213dd2	[main][refactor] refactor SequenceRowParallelOp forward (#3616 ) ### What this PR does / why we need it? This PR refactors SequenceRowParallelOp forward. In order to further expand the operator inclusion scope in dynamic judgment scenarios, this PR customizes the entire matmul computation and communication as a custom operator masking. With this refactor, it will support directly writing code such as common operation fusion into the `SequenceRowParallelOp` class's member function `matmul_and_reduce`, without the need to register more redundant custom masking operators. ### How was this patch tested? CI passed with existing test. Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-10-23 14:41:15 +08:00
linfeng-yuan	068ed706c8	[feat][torchair] support super kernel feat for quantized dsr1 (#3485 ) ### What this PR does / why we need it? Port #1916 and #2157 to master branch to fuse operators in deepseek moe layers, which can reduce scheduling overhead on devices. Note that this feature is valid only when `tp_size = 1` and `multistream_overlap_shared_expert` is enabled with torchair graph mode. ### Does this PR introduce _any_ user-facing change? Users can enable this feature with `--additional-config '{"torchair_graph_config":{"enabled":true, "enable_super_kernel":true}, "multistream_overlap_shared_expert":true}'`. ### How was this patch tested? E2E deepseek serving with 2P1D disaggregated prefill scenarios. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-10-20 20:04:37 +08:00
anon189Ty	07e39620ea	[Feat] Unquantized Linear to nz and control all nz-cast (#3356 ) ### What this PR does / why we need it? Currently, when executing to the Linear layer of models in vLLM-Ascend, the weights format is ND in unquantized case and skipped ascend case. This PR supplements the execution logic for Linear layer. We use a new global variable: VLLM_ASCEND_ENABLE_NZ. When VLLM_ASCEND_ENABLE_NZ=1 and CANN version is 8.3, the weights of the Linear layer will be converted to FRACTAL_NZ, in both unquantized case and skipped ascend case. We also use VLLM_ASCEND_ENABLE_NZ to control the existing NZ conversion, such as w8a8-quantized case. ### Does this PR introduce _any_ user-facing change? Add a new global variable VLLM_ASCEND_ENABLE_NZ. If you want to use NZ format, you should set VLLM_ASCEND_ENABLE_NZ=1. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>	2025-10-14 17:39:26 +08:00
weijinqian0	6972df5951	[Feature] optimize sp & qwen3 next support sp. (#3225 ) This PR will accomplish the following tasks: optimize SP In the old version implementation, the first layer was all_reduce, which used rms to split chunks. We changed it to perform reduce_scatter on the embedding side, replace one all_reduce operation and one chunk with one reduce_scatter operation. Support qwen3 next Since Qwen3 Next includes a linear attention module, the prefix name of this module cannot take effect directly. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-10-13 23:02:12 +08:00
realliujiaxu	af2a886814	refactor linear (#2867 ) ### What this PR does / why we need it? The current linear.py has the following issues: - There is redundant conditional logic in the `comm_group` and `forward` selection for classes such as `AscendMergedColumnParallelLinear`. - Inconsistent comm_group selection logic exists among `AscendMergedColumnParallelLinear`, `AscendColumnParallelLinear`, and `AscendQKVParallelLinear`. To address these two issues, this PR encapsulates `comm_group` and `forward` into classes and extracts the classes selection logic into common functions. For future additions of custom communication groups or forward methods, it will only be necessary to extend `CustomColumnParallelOp` or `CustomRowParallelOp` and add new selection logic. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `dd39baf717` --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Co-authored-by: weijinqian0 <weijinqian@huawei.com>	2025-09-18 14:09:19 +08:00
Yikun Jiang	756b8a1946	Revert "[Feat] Unquantized linear nz support (#2619 )" (#2896 ) ### What this PR does / why we need it? This reverts commit `7b2ecc1e9a`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: main - vLLM main: `64d90c3e4f` Closes: https://github.com/vllm-project/vllm-ascend/issues/2890 Closes: https://github.com/vllm-project/vllm-ascend/issues/2887 Closes: https://github.com/vllm-project/vllm-ascend/issues/2885 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-12 20:51:12 +08:00
rjg-lyh	0005479b9c	[main] mlp weight prefetch in Qwen Dense Models (#2816 ) ### What this PR does / why we need it? This PR prefetchs the weight of mlp layers in Qwen Dense Models to optimize the performance in Decode phase mainly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: main - vLLM main: `a1213fae5f` Signed-off-by: rjg-lyh <1318825571@qq.com> Co-authored-by: Shuming19 <313093131@qq.com>	2025-09-11 21:20:09 +08:00
anon189Ty	7b2ecc1e9a	[Feat] Unquantized linear nz support (#2619 ) ### What this PR does / why we need it? Currently, when executing to the Linear layer of the model in vLLM-Ascend, the weights input format is ND in unquantized case and skipped ascend case, which is slower than FRACTAL_NZ. This PR supplements the execution logic for Linear layer. When VLLM_ASCEND_ENABLE_MLP_OPTIMIZE=1 and CANN version is 8.3, the weights of the Linear layer will be converted to FRACTAL_NZ, in both unquantized case and skipped ascend case. - vLLM version: main - vLLM main: `267c80d31f` Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>	2025-09-11 11:40:00 +08:00
Li Wang	22b425765a	[Bugfix] Fix broken CI (#2825 ) ### What this PR does / why we need it? 1. Initial support disable tp for integrating with [vllm-commit](https://github.com/vllm-project/vllm/pull/23024) 2. [vllm@commit](https://github.com/vllm-project/vllm/pull/23673) now use `bytes` to save the `BlockHash` to reduce GC overhead, this pr add the integration - vLLM version: main - vLLM main: `e40827280b` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-09-10 13:29:29 +08:00
rjg-lyh	1bbb20ea13	[main] flashcomm_v1 optim in Qwen Dense Models (#2802 ) ### What this PR does / why we need it? Flashcomm_v1 optim in Qwen Dense Models. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.1.1 - vLLM main: `5e537f45b4` Co-authored-by: 1024daniel <xxltju324@gmail.com>	2025-09-08 22:52:24 +08:00
lidenghui1110	5a7181569c	[feat]: oproj tensor parallelism in pure DP and graph-mode scenarios. (#2167 ) ### What this PR does / why we need it? This PR introduces Oproj matrix tensor model parallel to achieve decreasing of memory consumption. It only support graph mode in pure DP scenario. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with oproj_tensor_parallel_size = 8, we have 1 ms TPOT increasing, saved 5.8 GB NPU memory per RANK. We got best performance when oproj_tensor_parallel_size=4 without TPOT increasing. performance data: <img width="1442" height="442" alt="image" src="https://github.com/user-attachments/assets/83270fc5-868a-4387-b0a9-fac29b4a376d" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. \| Name \| Effect \| Required \| Type \| Constraints \| \| :---------------------------- \| :--------------------------------------- \| :------- \| :--- \| :----------------- \| \| oproj_tensor_parallel_size \| Split the o_proj matrix along the row dimension (head num * head dim) into oproj_tensor_parallel_size pieces. \| No \| int \| default value is None, once this value is set, the feature will be enabled, head num * head dim must be divisible by this value. \| example `--additional_config={"oproj_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `eddaafc1c7` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zzh <zzh_201018@outlook.com>	2025-09-07 10:31:32 +08:00
sherie	3fb80ee356	add mlp tp optimze (#2120 ) ### What this PR does / why we need it? For dense models, by not applying tensor parallelism (TP) to the attention module and applying TP to the MLP module, the allreduce operations in the attention module can be eliminated, thereby reducing computational overhead. However, this approach increases memory usage, so the environment variable VLLM_ASCEND_ENABLE_MLP_OPTIMZE is used to control this optimization. - vLLM main: `b17109beea` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-08-21 09:22:07 +08:00

25 Commits