xc-llm-ascend

Files

songjianquan 43c8da3574 [Feat]fused_qkvzba_split_reshape supports token number greater than 65536 (#6740 )

### What this PR does / why we need it?

This pull request optimizes the fused_qkvzba_split_reshape_cat Triton
kernel for Qwen3-Next GatedDeltaNet model and removes the previous
conditional restrictions in the forward pass.
Key changes:
1. Refactored Triton kernel implementation: The
fused_qkvzba_split_reshape_cat_kernel has been optimized with a new
loop-based approach that supports arbitrary num_v_heads / num_k_heads
ratios and batch sizes. The kernel now uses configurable ROWS_PER_ITER
for better memory utilization .
2. The optimized kernel now handles all scenarios directly without
requiring a fallback path using fix_query_key_value_ordering and
torch.cat.

### Does this PR introduce _any_ user-facing change?
No. This is an internal optimization of the Triton kernel implementation
and does not introduce any user-facing changes.

### How was this patch tested?
CI is expected to pass with existing tests.

- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: songjianquan <songjianquan1@huawei.com>
Co-authored-by: songjianquan <songjianquan1@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>

2026-03-05 14:41:38 +08:00

__init__.py

[Bugfix] Fix the acceptance rates dorp issue when applying eagle3 to QuaRot model (#6914 )

2026-03-04 11:29:49 +08:00

patch_bert.py

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #10 ) (#6173 )

2026-02-06 15:35:06 +08:00

patch_distributed.py

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #10 ) (#6173 )

2026-02-06 15:35:06 +08:00

patch_huanyuan_vl.py

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #10 ) (#6173 )