[kernel] add AscendC op: lightning_indexer and sparse_flash_attention (#4625)

### What this PR does / why we need it? Provide high-performance AscendC operators lightning_indexer and sparse_flash_attention to boost the execution performance of the DeepSeek v3.2 model. Meanwhile, adapt the two AscendC operators to vllm-ascend framework. ### Does this PR introduce _any_ user-facing change? No (only underlying operator optimizations, with no user-facing changes) ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: MingYang119 <songmingyang@huawei.com>
2025-12-03 09:53:10 +08:00
parent 7f2673ea2d
commit 18b90b501d
28 changed files with 9772 additions and 19 deletions
--- a/vllm_ascend/attention/sfa_v1.py
+++ b/vllm_ascend/attention/sfa_v1.py
@@ -459,7 +459,7 @@ class AscendSFAImpl(MLAAttentionImpl):
                                           kv_cache=kv_cache,
                                           attn_metadata=attn_metadata,
                                           need_gather_q_kv=need_gather_q_kv)
-        attn_output = torch.ops.custom.npu_sparse_flash_attention(
+        attn_output = torch.ops._C_ascend.npu_sparse_flash_attention(
            query=ql_nope,
            key=k_nope,
            value=k_nope,
@@ -554,7 +554,7 @@ class AscendSFAImpl(MLAAttentionImpl):
        seq_lens = attn_metadata.seq_lens
        cum_query_lens = attn_metadata.cum_query_lens

-        topk_indices = torch.ops.custom.npu_lightning_indexer(
+        topk_indices = torch.ops._C_ascend.npu_lightning_indexer(
            query=q,
            key=kv_cache[2],
            weights=weights,