xc-llm-ascend

Author	SHA1	Message	Date
Hexiang Wang	48b624e4cc	[BugFix] Fix implementation bug of triton rope_siso (#7082 ) ### What this PR does / why we need it? Previously implemention of triton rope_siso missing the storage of second half of rope results, which will result in: 1. accuracy problem in neox-style scenario 2. ub overflow in non neox-style scenario This PR fixes it and supplement nightly test case for it. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-03-09 23:08:43 +08:00
rjg-lyh	2bd9c35788	[perf][refactor] Refactor and optimize sfa_v1.py for dsv3.2/glm5 (#6874 ) ### What this PR does / why we need it? This PR refactors sfa_v1.py to improve code readability and usability, fixes a code bug, and enhances performance through the replacement of certain operators. ### changes - improve code readability: Optimizes parts of the code structure in sfa_v1.py, supplementary comments for key code blocks, removes some unused variables, and improves the naming of certain functions and variables. - resolved a duplicated double write to k_cache: Fixed redundant double writes of k_cache in the indexer_select module (in both the `forward` function and `indexer_select_post_process`), improving performance to some extent. - replace `scatter` ops with `reshape_and_cache`: This optimization replaces two separate cache storage operations on `k_nope` and `k_pe` with a single call to the `reshape_and_cache` operator, improving performance. The original `scatter` operator involves reordering slot_mapping for generality, introducing significant scalar computations. In contrast, the `reshape_and_cache` operator eliminates this redundant reordering step, thus reducing unnecessary computation time and enhancing the operator's performance. ### performance comparison 4A3, 1P1D, P dp2tp16, D dp8tp4, input/output: 64K/3K origin: TTFT: 28s, TPOT: 26ms, TPS: 820 token/s* fixed redundant double writes of k_cache: TTFT: 24s, TPOT: 26ms, TPS: 840 token/s replace scatter ops with reshape_and_cache: TTFT: 24s, TPOT: 26ms, TPS: 850 token/s ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-03-05 14:27:11 +08:00
Angazenn	c0c2eb614e	[Main][Ops] Make triton rope support index_selecting from cos_sin_cache (#5450 ) ### What this PR does / why we need it? This PR extends original `rope_triton_forward` and `split_qkv_rmsnorm_rope` to support `cos_sin_cache` && `positions` as inputs. This fully aligns to vLLM RoPE api interface. Compared with earlier implementation for RoPE, the benefits are: 1. avoiding pre-computation of `cos` `sin` before model execution, which helps to remove redundant codes. 2. allowing eagle3 draft model to have different rope parameters with main model (see #6612 ). This help to recover accept rate && accuracy in that case. In addition, this kernel change only introduces very small performance degradation. Those `index_select` or `chunk` operations are now changed into simple memory access in triton kernel (For example, https://github.com/vllm-project/vllm-ascend/pull/5450/changes#diff-a4c2d3071530df193b98f9bf38553874bc4d47571336711f116c26d019cfbb6aR77-R81). Highlights - RoPE Cache Unification: Replaced separate _sin and _cos global tensors with a unified cos_sin_cache and explicit positions tensor for Rotary Positional Embeddings (RoPE), streamlining data handling. - Triton Kernel Integration: Updated Triton kernels (split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the cos_sin_cache and positions for more efficient and integrated RoPE calculations. - Custom Operation Registration: Registered `rope_forward_oot` as a new custom operation, allowing its use in fused compilation passes and providing a dedicated entry point for the new RoPE implementation. - Refactored RoPE Forward Pass: Modified the rope_forward_oot function to accept the new cos_sin_cache and positions arguments, enabling a more flexible and integrated RoPE application within the system. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `5326c89803` Additional test on Qwen3-235b accuracy: \| Aime2024 \| GSM8K \| Livecodebench \| \| -------- \| -------- \| -------- \| \| 83.33 \| 96.26 \| 70.23 \| --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-02-11 21:20:53 +08:00
SILONG ZENG	78af0c30a3	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #12 ) (#6177 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| `vllm_ascend/ops/triton/activation/swiglu_quant.py` \| \| `vllm_ascend/ops/triton/batch_invariant/matmul.py` \| \| `vllm_ascend/ops/triton/batch_invariant/mean.py` \| \| `vllm_ascend/ops/triton/batch_invariant/rmsnorm.py` \| \| `vllm_ascend/ops/triton/fla/chunk.py` \| \| `vllm_ascend/ops/triton/fla/chunk_delta_h.py` \| \| `vllm_ascend/ops/triton/fla/chunk_o.py` \| \| `vllm_ascend/ops/triton/fla/chunk_scaled_dot_kkt.py` \| \| `vllm_ascend/ops/triton/fla/cumsum.py` \| \| `vllm_ascend/ops/triton/fla/fused_qkvzba_split_reshape.py` \| \| `vllm_ascend/ops/triton/fla/l2norm.py` \| \| `vllm_ascend/ops/triton/fla/layernorm_guard.py` \| \| `vllm_ascend/ops/triton/fla/sigmoid_gating.py` \| \| `vllm_ascend/ops/triton/fla/solve_tril.py` \| \| `vllm_ascend/ops/triton/fla/utils.py` \| \| `vllm_ascend/ops/triton/fla/wy_fast.py` \| \| `vllm_ascend/ops/triton/fused_gdn_gating.py` \| \| `vllm_ascend/ops/triton/layernorm_gated.py` \| \| `vllm_ascend/ops/triton/linearnorm/split_qkv_rmsnorm_rope.py` \| \| `vllm_ascend/ops/triton/mamba/causal_conv1d.py` \| \| `vllm_ascend/ops/triton/reject_sample.py` \| \| `vllm_ascend/ops/triton/rope.py` \| \| `vllm_ascend/ops/triton/spec_decode/utils.py` \| \| `vllm_ascend/ops/triton/triton_utils.py` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-23 14:59:19 +08:00
ZCG12345	8900e3398b	[Ascend] perf: optimize rope embedding with triton kernel for huge performance gain (#5918 ) ### What this PR does / why we need it? 1. Implement a high-performance Triton custom kernel for the rotary position embedding (RoPE) operator on Ascend NPU platform 2. Fix critical bugs in the Triton RoPE kernel registration and invocation process: including incorrect fake impl function name matching, wrong torch ops namespace for kernel call, missing self parameter in cos/sin slice fetching, and syntax errors in function type annotations. 3. Achieve extreme performance optimization for the core RoPE operator: the single inference latency is reduced from 57.1 μs to 9 μs, with 6.34x performance improvement and 84.24% latency reduction. 4. The RoPE operator is a hot path that is executed in every transformer layer during LLM inference, the optimization will directly reduce the overall inference latency and improve the throughput of LLM serving on Ascend NPU. 5. Keep full backward compatibility: the Triton kernel is enabled only when `HAS_TRITON=True`, and automatically fall back to the original Ascend NPU native implementation if Triton is not available, no functional regression. ### Does this PR introduce _any_ user-facing change? NO - No changes to any public APIs, interfaces or inference behaviors of vLLM. - No impact on the text generation quality and correctness of the large model. - The optimization is transparent to end users, only the inference speed (latency/throughput) is improved without any functional change. ### How was this patch tested? 1. Environment Validation: Tested on Ascend NPU platform with vLLM-Ascend framework, Triton library installed and enabled (`HAS_TRITON=True`). 2. Kernel Registration Test: Verified the Triton RoPE kernel (`rope_forward_triton`) is successfully registered to `torch.ops._C_ascend` namespace without any `ValueError/NameError/SyntaxError`. 3. Functional Correctness Test: Run large model (GLM4/MoE) inference on the Ascend NPU platform, the generated text content is completely correct (no garbled text, no logical errors), consistent with the original implementation. 4. Performance Benchmark Test: Measure the single execution latency of the RoPE operator before/after optimization, confirm the latency is stably reduced from 57.1 μs to 9 μs, the performance gain is valid and stable. 5. Fallback Mechanism Test: Manually disable Triton (`HAS_TRITON=False`), verify the code correctly falls back to the original Ascend NPU native RoPE implementation, no service crash and normal inference. 6. Compatibility Test: Test with different tensor shapes/sizes of query/key, all cases work correctly with the Triton kernel, no shape mismatch error. - operator supply by Hexiang Wang - vLLM version: v0.13.0 - vLLM main: `11b6af5280` --------- Signed-off-by: ZCG12345 <2097562023@qq.com>	2026-01-21 22:01:22 +08:00
meihanc	16b1bee804	[bugfix] fix test_camem failed with triton-ascend (#5492 ) ### What this PR does / why we need it? This fixes a bug that occurred when running `test_camem.py` in the triton-ascend environment `NPU function error: aclrtGetMemInfo(ACL_HBM_MEM, &device_free, &device_total)` - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-05 20:10:54 +08:00
whx	96b2cdf6d8	[Ops][Triton] Add a triton kernel supporting partial rope. (#4413 ) ### What this PR does / why we need it? This PR adds a triton rope kernel witch supports scenarios of `rope_dim != head_dim`. This can save the split op before rope and the concat op after rope. Profiling shows improvement. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? I will add related ut after ci integrated with triton. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-12-02 17:10:19 +08:00

7 Commits