xc-llm-ascend

Files

chenwaner e46dc142bf Enable kvcache_nz for the decode process in torchair graph mode (#1098 )

What this PR does / why we need it?
Enable kvcache_nz for the decode process in torchair graph mode, which
reduces the time consumed by FA in long sequences.

Does this PR introduce any user-facing change?
If need to enable kvcache_nz, should set the
additional_config.torchair_graph_config.enable_kv_nz=True

How was this patch tested?
1. Tested in deepseek model:
with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves
20.80ms -> 19.76ms
2. operator precision test: 

[aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv)
3. tpot test from @ttanzhiqiang, and curl one result is normal

https://github.com/vllm-project/vllm-ascend/pull/1098#issuecomment-2948542159

https://github.com/vllm-project/vllm-ascend/pull/1098#issuecomment-2954496588

---------

Signed-off-by: chenwaner <861645847@qq.com>

2025-06-11 14:09:28 +08:00

additional_config.md

Enable kvcache_nz for the decode process in torchair graph mode (#1098 )

2025-06-11 14:09:28 +08:00

env_vars.md

[Doc] Add environment variables doc (#519 )

2025-04-15 16:09:36 +08:00

graph_mode.md

[Doc] Fix the config parameter name "enable" in graph_mode.md. (#1159 )

2025-06-11 11:03:37 +08:00

release_notes.md

[Bugfix] add compilation/__init__.py to fix import error (#1152 )