### What this PR does / why we need it?
Previously implemention of triton rope_siso missing the storage of
second half of rope results, which will result in:
1. accuracy problem in neox-style scenario
2. ub overflow in non neox-style scenario
This PR fixes it and supplement nightly test case for it.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
This PR refactors sfa_v1.py to improve code readability and usability,
fixes a code bug, and enhances performance through the replacement of
certain operators.
### changes
- **improve code readability**: Optimizes parts of the code structure in
sfa_v1.py, supplementary comments for key code blocks, removes some
unused variables, and improves the naming of certain functions and
variables.
- **resolved a duplicated double write to k_cache**: Fixed redundant
double writes of k_cache in the indexer_select module (in both the
`forward` function and `indexer_select_post_process`), improving
performance to some extent.
- **replace `scatter` ops with `reshape_and_cache`**: This optimization
replaces two separate cache storage operations on `k_nope` and `k_pe`
with a single call to the `reshape_and_cache` operator, improving
performance. The original `scatter` operator involves reordering
slot_mapping for generality, introducing significant scalar
computations. In contrast, the `reshape_and_cache` operator eliminates
this redundant reordering step, thus reducing unnecessary computation
time and enhancing the operator's performance.
### performance comparison
4*A3, 1P1D, P dp2tp16, D dp8tp4, input/output: 64K/3K
origin:
TTFT: **28s**, TPOT: 26ms, TPS: **820 token/s**
fixed redundant double writes of k_cache:
TTFT: **24s**, TPOT: 26ms, TPS: **840 token/s**
replace scatter ops with reshape_and_cache:
TTFT: **24s**, TPOT: 26ms, TPS: **850 token/s**
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it?
This PR extends original `rope_triton_forward` and
`split_qkv_rmsnorm_rope` to support `cos_sin_cache` && `positions` as
inputs. This fully aligns to vLLM RoPE api interface. Compared with
earlier implementation for RoPE, the benefits are:
1. avoiding pre-computation of `cos` `sin` before model execution, which
helps to remove redundant codes.
2. allowing eagle3 draft model to have different rope parameters with
main model (see #6612 ). This help to recover accept rate && accuracy in
that case.
In addition, this kernel change only introduces very small performance
degradation. Those `index_select` or `chunk` operations are now changed
into simple memory access in triton kernel (For example,
https://github.com/vllm-project/vllm-ascend/pull/5450/changes#diff-a4c2d3071530df193b98f9bf38553874bc4d47571336711f116c26d019cfbb6aR77-R81).
**Highlights**
- **RoPE Cache Unification**: Replaced separate _sin and _cos global
tensors with a unified cos_sin_cache and explicit positions tensor for
Rotary Positional Embeddings (RoPE), streamlining data handling.
- **Triton Kernel Integration**: Updated Triton kernels
(split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the
cos_sin_cache and positions for more efficient and integrated RoPE
calculations.
- **Custom Operation Registration**: Registered `rope_forward_oot` as a
new custom operation, allowing its use in fused compilation passes and
providing a dedicated entry point for the new RoPE implementation.
- **Refactored RoPE Forward Pass**: Modified the rope_forward_oot
function to accept the new cos_sin_cache and positions arguments,
enabling a more flexible and integrated RoPE application within the
system.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
5326c89803
Additional test on Qwen3-235b accuracy:
| Aime2024 | GSM8K | Livecodebench |
| -------- | -------- | -------- |
| 83.33 | 96.26 | 70.23 |
---------
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
1. Implement a **high-performance Triton custom kernel** for the rotary
position embedding (RoPE) operator on **Ascend NPU** platform
2. Fix critical bugs in the Triton RoPE kernel registration and
invocation process: including incorrect fake impl function name
matching, wrong torch ops namespace for kernel call, missing self
parameter in cos/sin slice fetching, and syntax errors in function type
annotations.
3. Achieve **extreme performance optimization** for the core RoPE
operator: the single inference latency is reduced from **57.1 μs** to
**9 μs**, with **6.34x performance improvement** and **84.24% latency
reduction**.
4. The RoPE operator is a **hot path** that is executed in every
transformer layer during LLM inference, the optimization will directly
reduce the overall inference latency and improve the throughput of LLM
serving on Ascend NPU.
5. Keep full backward compatibility: the Triton kernel is enabled only
when `HAS_TRITON=True`, and automatically fall back to the original
Ascend NPU native implementation if Triton is not available, no
functional regression.
### Does this PR introduce _any_ user-facing change?
**NO**
- No changes to any public APIs, interfaces or inference behaviors of
vLLM.
- No impact on the text generation quality and correctness of the large
model.
- The optimization is transparent to end users, only the inference speed
(latency/throughput) is improved without any functional change.
### How was this patch tested?
1. **Environment Validation**: Tested on Ascend NPU platform with
vLLM-Ascend framework, Triton library installed and enabled
(`HAS_TRITON=True`).
2. **Kernel Registration Test**: Verified the Triton RoPE kernel
(`rope_forward_triton`) is successfully registered to
`torch.ops._C_ascend` namespace without any
`ValueError/NameError/SyntaxError`.
3. **Functional Correctness Test**: Run large model (GLM4/MoE) inference
on the Ascend NPU platform, the generated text content is **completely
correct** (no garbled text, no logical errors), consistent with the
original implementation.
4. **Performance Benchmark Test**: Measure the single execution latency
of the RoPE operator before/after optimization, confirm the latency is
stably reduced from 57.1 μs to 9 μs, the performance gain is valid and
stable.
5. **Fallback Mechanism Test**: Manually disable Triton
(`HAS_TRITON=False`), verify the code correctly falls back to the
original Ascend NPU native RoPE implementation, no service crash and
normal inference.
6. **Compatibility Test**: Test with different tensor shapes/sizes of
query/key, all cases work correctly with the Triton kernel, no shape
mismatch error.
- operator supply by Hexiang Wang
- vLLM version: v0.13.0
- vLLM main:
11b6af5280
---------
Signed-off-by: ZCG12345 <2097562023@qq.com>
### What this PR does / why we need it?
This fixes a bug that occurred when running `test_camem.py` in the
triton-ascend environment `NPU function error:
aclrtGetMemInfo(ACL_HBM_MEM, &device_free, &device_total)`
- vLLM version: v0.13.0
- vLLM main:
5326c89803
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
This PR adds a triton rope kernel witch supports scenarios of `rope_dim
!= head_dim`. This can save the split op before rope and the concat op
after rope. Profiling shows improvement.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
I will add related ut after ci integrated with triton.
- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>