[kernel] add AscendC op: lightning_indexer and sparse_flash_attention (#4625)

### What this PR does / why we need it?
Provide high-performance AscendC operators lightning_indexer and
sparse_flash_attention to boost the execution performance of the
DeepSeek v3.2 model. Meanwhile, adapt the two AscendC operators to
vllm-ascend framework.

### Does this PR introduce _any_ user-facing change?
No (only underlying operator optimizations, with no user-facing changes)

### How was this patch tested?

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: MingYang119 <songmingyang@huawei.com>
This commit is contained in:
Song Mingyang
2025-12-03 09:53:10 +08:00
committed by GitHub
parent 7f2673ea2d
commit 18b90b501d
28 changed files with 9772 additions and 19 deletions

View File

@@ -459,7 +459,7 @@ class AscendSFAImpl(MLAAttentionImpl):
kv_cache=kv_cache,
attn_metadata=attn_metadata,
need_gather_q_kv=need_gather_q_kv)
attn_output = torch.ops.custom.npu_sparse_flash_attention(
attn_output = torch.ops._C_ascend.npu_sparse_flash_attention(
query=ql_nope,
key=k_nope,
value=k_nope,
@@ -554,7 +554,7 @@ class AscendSFAImpl(MLAAttentionImpl):
seq_lens = attn_metadata.seq_lens
cum_query_lens = attn_metadata.cum_query_lens
topk_indices = torch.ops.custom.npu_lightning_indexer(
topk_indices = torch.ops._C_ascend.npu_lightning_indexer(
query=q,
key=kv_cache[2],
weights=weights,