Enable kvcache_nz for the decode process in torchair graph mode (#1098)
What this PR does / why we need it? Enable kvcache_nz for the decode process in torchair graph mode, which reduces the time consumed by FA in long sequences. Does this PR introduce any user-facing change? If need to enable kvcache_nz, should set the additional_config.torchair_graph_config.enable_kv_nz=True How was this patch tested? 1. Tested in deepseek model: with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves 20.80ms -> 19.76ms 2. operator precision test: [aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv) 3. tpot test from @ttanzhiqiang, and curl one result is normal https://github.com/vllm-project/vllm-ascend/pull/1098#issuecomment-2948542159 https://github.com/vllm-project/vllm-ascend/pull/1098#issuecomment-2954496588 --------- Signed-off-by: chenwaner <861645847@qq.com>
This commit is contained in:
@@ -44,6 +44,7 @@ The details of each config option are as follows:
|
||||
| `use_cached_graph` | bool | `False` | Whether to use cached graph |
|
||||
| `graph_batch_sizes` | list[int] | `[]` | The batch size for torchair graph cache |
|
||||
| `graph_batch_sizes_init` | bool | `False` | Init graph batch size dynamically if `graph_batch_sizes` is empty |
|
||||
| `enable_kv_nz`| bool | `False` | Whether to enable kvcache NZ layout |
|
||||
|
||||
**ascend_scheduler_config**
|
||||
|
||||
@@ -64,7 +65,8 @@ A full example of additional configuration is as follows:
|
||||
"use_cached_graph": true,
|
||||
"graph_batch_sizes": [1, 2, 4, 8],
|
||||
"graph_batch_sizes_init": false,
|
||||
"enable_multistream_moe": false
|
||||
"enable_multistream_moe": false,
|
||||
"enable_kv_nz": false
|
||||
},
|
||||
"ascend_scheduler_config": {
|
||||
"enabled": true,
|
||||
|
||||
Reference in New Issue
Block a user