[kernel] add AscendC op: lightning_indexer and sparse_flash_attention (#4625)
### What this PR does / why we need it? Provide high-performance AscendC operators lightning_indexer and sparse_flash_attention to boost the execution performance of the DeepSeek v3.2 model. Meanwhile, adapt the two AscendC operators to vllm-ascend framework. ### Does this PR introduce _any_ user-facing change? No (only underlying operator optimizations, with no user-facing changes) ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: MingYang119 <songmingyang@huawei.com>
This commit is contained in:
@@ -459,7 +459,7 @@ class AscendSFAImpl(MLAAttentionImpl):
|
||||
kv_cache=kv_cache,
|
||||
attn_metadata=attn_metadata,
|
||||
need_gather_q_kv=need_gather_q_kv)
|
||||
attn_output = torch.ops.custom.npu_sparse_flash_attention(
|
||||
attn_output = torch.ops._C_ascend.npu_sparse_flash_attention(
|
||||
query=ql_nope,
|
||||
key=k_nope,
|
||||
value=k_nope,
|
||||
@@ -554,7 +554,7 @@ class AscendSFAImpl(MLAAttentionImpl):
|
||||
seq_lens = attn_metadata.seq_lens
|
||||
cum_query_lens = attn_metadata.cum_query_lens
|
||||
|
||||
topk_indices = torch.ops.custom.npu_lightning_indexer(
|
||||
topk_indices = torch.ops._C_ascend.npu_lightning_indexer(
|
||||
query=q,
|
||||
key=kv_cache[2],
|
||||
weights=weights,
|
||||
|
||||
Reference in New Issue
Block a user