xc-llm-ascend/vllm_ascend at 79803932e23ef2a09fbc0bc3dfee9a29eb302044 - xc-llm-ascend - Gitea: Git with a cup of tea

EngineX/xc-llm-ascend

Files

History

lidenghui1110 79803932e2 [Kernel] Add AscendC fused op transpose_kv_cache_by_block to speed up GQA transfer (#6366 )

### What this PR does / why we need it?
As #2947 describe, we need to transpose kv cache layout after GQA kv
transfer when prefill and decode tensor parallel size are heterogeneous,
in the previous implementation, we use `npu_paged_cache_load ` +
`tranpose` + `_npu_reshape_and_cache` to do this work.

But obviously, it is not an efficient plan, the ops above need to be
called for each layer, which introduces 3 * layer_num kernel launch, and
6 * layer_num data movement between L1 Cache and HBM for one request on
decode node. Usually, decode node uses graph mode, so these op kernels
will be called between decode forward launched by an async thread in
mooncacke connector, this kernels maybe last for several decode forward
and TTFT will increase by 3~4 decode forward time.

In this PR, we implement an AscendC fused op
`transpose_kv_cache_by_block` to do this with only once kernel launch
and move data between L1 Cache and HBM only once.

After using this fused op, the time cost in transpose kv cacke layout
can be decreased to 0.24ms from 7ms in UT on 910C, and in PD
disaggregation scenario, TTFT can decrease about 90 ~ 110 ms in
qwen3-235B.

| request_num | original | fused_op|
|:----------------------:|:---------------:|:-------------------:|
|           1            |      643 ms      |        578 ms        |
|          128           |     1480 ms      |       1368 ms        |

### Does this PR introduce _any_ user-facing change?
Use fused op by default, incase the op has bug in any scenario, provide
fallback choice using env to disable it.

**DISABLE fused op by add following env**
`export VLLM_ASCEND_FUSION_OP_TRANSPOSE_KV_CACHE_BY_BLOCK=0`

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: lidenghui <lidenghui1110@gmail.com>

2026-02-03 14:10:01 +08:00

..

[Bugfix]Fix the compatibility issue of may_reinitialize_input_batch (#6290 )

2026-02-02 19:16:26 +08:00

_cann_ops_custom

[Kernel] add custom op GmmSwigluQuantWeightNzTensorList (#3804 )

2025-11-28 18:06:39 +08:00

[Bugfix]Fix the compatibility issue of may_reinitialize_input_batch (#6290 )

2026-02-02 19:16:26 +08:00

[e2e Test][npugraph_ex]add static kernel e2e test case (#6320 )

2026-01-30 16:24:48 +08:00

[0.14.1][bugfix][sched] fix incompatibility of RecomputeScheduler with vllm v0.14.1 (#6286 )

2026-01-28 20:16:58 +08:00

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #2 ) (#5977 )

2026-01-19 08:59:46 +08:00

device_allocator

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #2 ) (#5977 )

2026-01-19 08:59:46 +08:00

[Kernel] Add AscendC fused op transpose_kv_cache_by_block to speed up GQA transfer (#6366 )

2026-02-03 14:10:01 +08:00

[EPLB][Bugfix] EPLB support fp/bf16 (#5531 )

2026-01-26 14:28:16 +08:00

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #5 ) (#5996 )

2026-01-24 22:45:38 +08:00

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #5 ) (#5996 )

2026-01-24 22:45:38 +08:00

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #6 ) (#6001 )

2026-01-24 22:08:33 +08:00

[Refactor] Add expert processed token count output for DispatchFFNCombine/DispatchFFNCombineBF16 (#6402 )

2026-02-03 10:41:06 +08:00

[Main2Main][Deps][Misc] Upgrade vLLM to v0.15.0 (#6470 )

2026-02-02 15:57:55 +08:00

[Quantization][Feature] Support compressed tensors moe w4a8 dynamic weight (#5889 )

2026-02-02 16:39:32 +08:00

[ops] support advanced apply_top_k_top_p without top_k constraint (#6098 )

2026-01-26 09:08:42 +08:00

[Refactor][EAGLE] 6/N route mtp to eagle except pcp/dcp+mtp (#6349 )

2026-02-02 19:15:31 +08:00

[Bugfix]Fix the compatibility issue of may_reinitialize_input_batch (#6290 )

2026-02-02 19:16:26 +08:00

[CI] optimize lint term (#5986 )

2026-01-22 15:46:59 +08:00

__init__.py

[Lint]Style: Convert vllm-ascend/compilation to ruff format (#5912 )

2026-01-16 20:57:46 +08:00

ascend_config.py

[BugFix] Disable enable_shared_expert_dp by default if tensor_parallel_size=1 (#6361 )

2026-01-28 22:01:01 +08:00

ascend_forward_context.py

[Refactor][EAGLE] 6/N route mtp to eagle except pcp/dcp+mtp (#6349 )

2026-02-02 19:15:31 +08:00

batch_invariant.py

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #2 ) (#5977 )

2026-01-19 08:59:46 +08:00

cpu_binding.py

[Lint]Style: Convert vllm-ascend/compilation to ruff format (#5912 )

2026-01-16 20:57:46 +08:00

envs.py

[Kernel] Add AscendC fused op transpose_kv_cache_by_block to speed up GQA transfer (#6366 )

2026-02-03 14:10:01 +08:00

flash_common3_context.py

[Lint]Style: Convert vllm-ascend/compilation to ruff format (#5912 )

2026-01-16 20:57:46 +08:00

meta_registration.py

[Lint]Style: Convert vllm-ascend/compilation to ruff format (#5912 )

2026-01-16 20:57:46 +08:00

platform.py

[Bugfix] fix hash conflict due to reset incompatible configuations (#6368 )

2026-02-03 10:32:02 +08:00

profiling_config.py

[Core][Misc] Clean up ProfileExecuteDuration (#6461 )

2026-02-01 20:06:01 +08:00

utils.py

[Refact.]: Refactor some leftover implementations of 300I DUO in the main branch. (#6425 )

2026-02-02 16:12:04 +08:00