xc-llm-ascend

Author	SHA1	Message	Date
SILONG ZENG	19b5d44ea8	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #10 ) (#6173 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \|`vllm_ascend/ops/layer_shard_linear.py`\| \|`vllm_ascend/ops/linear.py`\| \|`vllm_ascend/ops/linear_op.py`\| \|`vllm_ascend/worker/worker.py`\| \| ` vllm_ascend/patch/worker/patch_bert.py` \| \| ` vllm_ascend/patch/worker/patch_deepseek.py` \| \| ` vllm_ascend/patch/worker/patch_distributed.py` \| \| ` vllm_ascend/patch/worker/patch_module.py` \| \| ` vllm_ascend/patch/worker/patch_multimodal_merge.py` \| \| ` vllm_ascend/patch/worker/patch_qwen3_next.py` \| \| ` vllm_ascend/patch/worker/patch_qwen3_next_mtp.py` \| \| ` vllm_ascend/patch/worker/patch_rejection_sampler.py` \| \| ` vllm_ascend/patch/worker/patch_rope.py` \| \| ` vllm_ascend/patch/worker/patch_triton.py` \| \| ` vllm_ascend/patch/worker/patch_unquantized_gemm.py` \| \| ` vllm_ascend/patch/worker/patch_v2_egale.py` \| \|` vllm_ascend/worker/npu_input_batch.py`\| \|` vllm_ascend/worker/v2/aclgraph_utils.py`\| \|` vllm_ascend/worker/v2/attn_utils.py`\| \|` vllm_ascend/worker/v2/model_runner.py`\| \|` vllm_ascend/worker/v2/sample/gumbel.py`\| \|` vllm_ascend/worker/v2/sample/penalties.py`\| \|` vllm_ascend/worker/v2/sample/sampler.py`\| \|` vllm_ascend/worker/v2/spec_decode/__init__.py`\| \|` vllm_ascend/worker/v2/spec_decode/eagle.py`\| \|` vllm_ascend/worker/v2/states.py`\| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: SILONG ZENG <2609716663@qq.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-06 15:35:06 +08:00
Li Wang	484e7c59dc	[CI] optimize lint term (#5986 ) ### What this PR does / why we need it? This patch purpose to optimize the lint check term. The main idea is to reduce unnecessary installation time. 1. The installation of vllm is not must, only append the path of vllm src to the `PATHONPATH` is effective 2. This installation of `requirements-dev.txt` is not must, we have a pre-built image `quay.io/ascend-ci/vllm-ascend:lint` with all the requirements installed in advance. NOTE: the conditions for triggering image builds are: 1).Daily scheduled build; 2) Build when requirements are modified; 3) Manual build. This ensures that the dependencies in our image are up-to-date to the greatest extent possible. 3. The `mypy` was separated from the `pre-commit` hook for performance reasons; we found that integrating `mypy` into the `pre-commit` hook resulted in poor performance. 4. Reduce the CPU core consumption from 16 -> 8 ### Does this PR introduce _any_ user-facing change? The end-to-end lint time was optimized from 20min/per PR to 8min/per PR ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-22 15:46:59 +08:00
Magnus	5b129cf0a1	[1/N][Feat] Xlite Qwen3 MoE Support (#5951 ) ### What this PR does / why we need it? This patch adds support for the Qwen3-MoE model in Xlite. For more details about Xlite, please refer to the following link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md. Qwen3-MoE TODO List: - [ ] Qwen3-235B-A22B support - [ ] Qwen3-MoE weights NZ support - [ ] Qwen3-MoE data parallel support ## Qwen3-30B-A3B-Instruct-2507 910B3(A2) Online Inference Performance Comparison - aclgraph: main(`69b170b8b5`) - xlite-full: main + xlite-full - xlite-decode-only: main + xlite-decode-only - diff1: Performance comparison between xlite-full and aclgraph - diff2: Performance comparison between xlite-decode-only and aclgraph \| maxconcurrency \| item \| TTFT(ms) \| \| TPOT(ms) \| \| QPS (req/s) \| OutputSpeed (token/s) \| \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| \| \| \| Avg \| P99 \| Avg \| P99 \| \| \| \| 1 \| baseline-aclgraph \| 205.07 \| 287.29 \| 12.34 \| 12.65 \| 0.14 \| 78.81 \| \| 1 \| xlite-full \| 66.40 \| 113.69 \| 11.71 \| 12.40 \| 0.15 \| 84.73 \| \| 1 \| xlite-decode-only \| 221.15 \| 316.40 \| 12.16 \| 12.91 \| 0.14 \| 79.70 \| \| 1 \| diff1 \| -67.62% \| -60.43% \| -5.11% \| -1.98% \| 7.14% \| 7.51% \| \| 1 \| diff2 \| 7.84% \| 10.13% \| -1.46% \| 2.06% \| 0.00% \| 1.13% \| \| \| \| \| \| \| \| \| \| \| 16 \| baseline-aclgraph \| 1892.16 \| 13916.86 \| 22.78 \| 39.28 \| 1.15 \| 589.89 \| \| 16 \| xlite-full \| 1355.40 \| 8907.45 \| 15.96 \| 25.15 \| 1.65 \| 850.21 \| \| 16 \| xlite-decode-only \| 1519.42 \| 8711.64 \| 19.23 \| 29.73 \| 1.38 \| 711.60 \| \| 16 \| diff1 \| -28.37% \| -36.00% \| -29.94% \| -35.97% \| 43.48% \| 44.13% \| \| 16 \| diff2 \| -19.70% \| -37.40% \| -15.58% \| -24.31% \| 20.00% \| 20.63% \| \| \| \| \| \| \| \| \| \| \| 32 \| baseline-aclgraph \| 673.80 \| 3914.90 \| 32.20 \| 37.95 \| 1.80 \| 928.54 \| \| 32 \| xlite-full \| 481.65 \| 2710.50 \| 19.95 \| 25.35 \| 2.91 \| 1506.67 \| \| 32 \| xlite-decode-only \| 372.22 \| 1095.25 \| 25.19 \| 28.47 \| 2.33 \| 1202.82 \| \| 32 \| diff1 \| -28.52% \| -30.76% \| -38.04% \| -33.20% \| 61.67% \| 62.26% \| \| 32 \| diff2 \| -44.76% \| -72.02% \| -21.77% \| -24.98% \| 29.44% \| 29.54% \| \| \| \| \| \| \| \| \| \| \| 48 \| baseline-aclgraph \| 583.18 \| 3277.65 \| 41.02 \| 46.05 \| 2.17 \| 1115.08 \| \| 48 \| xlite-full \| 973.42 \| 8237.33 \| 23.29 \| 30.50 \| 3.71 \| 1908.09 \| \| 48 \| xlite-decode-only \| 480.79 \| 2026.98 \| 31.48 \| 35.41 \| 2.83 \| 1453.75 \| \| 48 \| diff1 \| 66.92% \| 151.32% \| -43.22% \| -33.77% \| 70.97% \| 71.12% \| \| 48 \| diff2 \| -17.56% \| -38.16% \| -23.26% \| -23.11% \| 30.41% \| 30.37% \| \| \| \| \| \| \| \| \| \| \| 64 \| baseline-aclgraph \| 742.74 \| 5953.39 \| 47.79 \| 53.15 \| 2.48 \| 1272.37 \| \| 64 \| xlite-full \| 545.22 \| 3941.34 \| 25.09 \| 30.41 \| 4.64 \| 2376.44 \| \| 64 \| xlite-decode-only \| 752.40 \| 4534.29 \| 38.67 \| 43.28 \| 3.06 \| 1567.94 \| \| 64 \| diff1 \| -26.59% \| -33.80% \| -47.50% \| -42.78% \| 87.10% \| 86.77% \| \| 64 \| diff2 \| 1.30% \| -23.84% \| -19.08% \| -18.57% \| 23.39% \| 23.23% \| \| \| \| \| \| \| \| \| \| \| 100 \| baseline-aclgraph \| 565.52 \| 1716.81 \| 60.89 \| 68.69 \| 3.08 \| 1580.64 \| \| 100 \| xlite-full \| 398.14 \| 2328.88 \| 30.70 \| 32.45 \| 6.01 \| 3086.42 \| \| 100 \| xlite-decode-only \| 712.53 \| 4875.94 \| 52.71 \| 60.78 \| 3.53 \| 1813.58 \| \| 100 \| diff1 \| -29.60% \| 35.65% \| -49.58% \| -52.76% \| 95.13% \| 95.26% \| \| 100 \| diff2 \| 26.00% \| 184.01% \| -13.43% \| -11.52% \| 14.61% \| 14.74% \| \| \| \| \| \| \| \| \| \| \| 150 \| baseline-aclgraph \| 842.42 \| 5175.01 \| 73.60 \| 88.18 \| 3.80 \| 1952.26 \| \| 150 \| xlite-full \| 568.52 \| 4204.33 \| 37.90 \| 40.01 \| 7.27 \| 3734.72 \| \| 150 \| xlite-decode-only \| 654.43 \| 2504.06 \| 67.40 \| 77.00 \| 4.18 \| 2145.11 \| \| 150 \| diff1 \| -32.51% \| -18.76% \| -48.51% \| -54.63% \| 91.32% \| 91.30% \| \| 150 \| diff2 \| -22.32% \| -51.61% \| -8.42% \| -12.68% \| 10.00% \| 9.88% \| \| \| \| \| \| \| \| \| \| \| 200 \| baseline-aclgraph \| 750.63 \| 3049.91 \| 88.26 \| 101.95 \| 4.28 \| 2189.72 \| \| 200 \| xlite-full \| 558.48 \| 3791.98 \| 45.54 \| 49.04 \| 8.17 \| 4175.52 \| \| 200 \| xlite-decode-only \| 807.09 \| 4254.95 \| 85.18 \| 101.79 \| 4.44 \| 2271.52 \| \| 200 \| diff1 \| -25.60% \| 24.33% \| -48.40% \| -51.90% \| 90.89% \| 90.69% \| \| 200 \| diff2 \| 7.52% \| 39.51% \| -3.49% \| -0.16% \| 3.74% \| 3.74% \| \| \| \| \| \| \| \| \| \| ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: changdawei1 <changdawei3@huawei.com> Co-authored-by: LVYANGGUO <275926687@qq.com> Co-authored-by: lulina <lina.lulina@huawei.com>	2026-01-21 09:26:03 +08:00
Wang Xiaoran	3ce5a34468	[BugFix] Xlite: Bypass the padding of the graph mode in non-MTP cases to obtain the correct decode num. (#5711 ) ### What this PR does / why we need it? This PR fixes a bug in Xlite backend(https://atomgit.com/openeuler/GVirt/issues/1), The direct cause of the problem is that the XModel::PrepareAttn function obtained an illegal number of tokens to be inferred, -540. This illegal value is due to the padding feature of inference in graph mode and the residual state across steps. This issue is triggered when a prefill request is newly added in a step and a decode ends simultaneously. It is first fixed using num_decode_tokens instead of attn_metadata.num_decodes. 1. In graph mode, vllm_ascend has padding characteristics. In the _prepare_inputs function, if the number of tokens to be inferred is less than the set threshold (8 in this case), the attn_metadata.num_decode array will be expanded to 8. 2. Meanwhile, vllm_ascend uses the class variable self.query_start_loc of NPUModelRunner to record the tokens to be inferred. Due to poor coordination with the graph mode padding mechanism when crossing steps, in some cases (such as when a decode request is completed in a certain step and a new prefill request is added at the same time), negative values may be calculated for attn_metadata.query_lens. 3. After type conversion, the negative values in query_lens cause an overflow. Xlite detects that the number of tokens to be inferred for the decode request is too large and triggers a "decode len too long" alert. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Same with https://atomgit.com/openeuler/GVirt/issues/1 - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wwwumr <1127858301@qq.com>	2026-01-09 15:55:30 +08:00
Shanshan Shen	b94d589769	[MM][Bugfix] Update `hf_config` to `hf_text_config` (#5319 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm-ascend/pull/5205, update `hf_config` to `hf_text_config`. Find more details at https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3675417534 and https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3677920872. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-06 16:41:39 +08:00
Magnus	a9fccbeb30	[CI] add xlite e2e test (#5305 ) ### What this PR does / why we need it? add xlite e2e test - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` Signed-off-by: DaweiChang <405739598@qq.com>	2025-12-25 09:17:06 +08:00
lvjunqi	55beac9c91	[Feat]Xlite Qwen3-vl Support (#5228 ) ### What this PR does / why we need it? This patch adds support for the Qwen3-VL model in Xlite. For more details about Xlite, please refer to the following link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md. The latest performance comparison data between xlite and the default aclgraph mode is as follows: ### Does this PR introduce _any_ user-facing change? XLite graph mode supports the Qwen3-VL model. ### How was this patch tested? vLLM version: v0.12.0 - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: lvjunqi <lvjunqi1@huawei.com> Co-authored-by: lvjunqi <lvjunqi1@huawei.com>	2025-12-22 16:30:52 +08:00
weijinqian0	35ad11b637	[Refactor] remove some metadata variables in attention_v1. (#5160 ) RFC: https://github.com/vllm-project/vllm-ascend/issues/4629 Reason: The metadata data class contains an excessive number of variables. We will inherit the metadata of the community and simultaneously remove some variables that are no longer needed at present. Todo: 1. remove attn_state partly. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-12-19 14:57:09 +08:00
zzzzwwjj	cc23067f1e	[refactor] refactor weight trans nz and transpose (#4878 ) ### What this PR does / why we need it? Now `VLLM_ASCEND_ENABLE_NZ` will have three options: 0: disable nz; 1: only quant case enable nz; 2: enable nz as long as possible; And `VLLM_ASCEND_ENABLE_NZ`=1 by default. All cases are shown in the table below: \| \| W4A4 \| W4A8 \| W8A8 \| fp16/bf16 \| fp32 \| \|---\|---\|---\|---\|---\|---\| \| trans nz \| can't support nz \| trans nz by default \| trans nz by default \| trans nz when VLLM_ASCEND_ENABLE_NZ is 2 \| can't support nz \| \| transpose \| only support not transpose case \| only support transpose case \| only support transpose case \| linear: only support not transpose case<br>gmm: only support transpose case \| same to fp16/bf16 \| Some exceptional cases: 1. MLAPO op need to do some additional processing on the weights, including trans nz. If use MLAPO op, some weight will be transformed to nz forcely; 2. MLA/SFA's weight `W_UV` will be used by op `torch.ops._C_ascend.batch_matmul_transpose`, and this op can't support nz currently; ### Does this PR introduce _any_ user-facing change? Now fp16/bf16 weight will not trans nz by default. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-12-19 14:27:24 +08:00
LuLina	2be0fe2691	[Feat] Add Euler xlite graph wrapper support (#4526 ) ### What this PR does / why we need it? This patch adds support for the xlite graph wrapper to vllm_ascend. Xlite provides operator implementations of the transformer network on Ascend hardware. For details about xlite, please refer to the following link: https://gitee.com/openeuler/GVirt/blob/master/xlite/README.md The latest performance comparison data between xlite and the default aclgraph mode is as follows: ## Qwen3 32B TPS 910B3(A2) Online Inference Performance Comparison - aclgraph: main(`c4a71fc6`) - xlite-full: main(`c4a71fc6`) + xlite-full - xlite-decode-only: main(`c4a71fc6`) + xlite-decode-only - diff1: Performance comparison between xlite-full and aclgraph - diff2: Performance comparison between xlite-decode-only and aclgraph ### Does this PR introduce _any_ user-facing change? Enable the xlite graph mode by setting xlite_graph_config: --additional-config='{"xlite_graph_config": {"enabled": true}}' # Enabled for decode only --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' # Enabled for prefill and decode - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: lulina <lina.lulina@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-08 08:27:46 +08:00

10 Commits