Support multistream of MLA vector operations (#1135)
### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
| q_rmsnorm | | kv_norm_rope_cache | | q_rope |
| matmul W_DQ | matmul W_DKV | index | index | matmul W_UQ | split | matmul W_KV_T |
```
Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (#993), which hoists the computation of `cos` and `sin` up
to the first layer.
### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.
### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.
Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
This commit is contained in:
@@ -59,6 +59,7 @@ def test_run_with_ascend_config():
|
||||
"graph_batch_sizes": [1, 2, 4, 8],
|
||||
"graph_batch_sizes_init": False,
|
||||
"enable_multistream_moe": True,
|
||||
"enable_multistream_mla": True,
|
||||
},
|
||||
"ascend_scheduler_config": {
|
||||
"enabled": True,
|
||||
@@ -79,6 +80,7 @@ def test_run_with_ascend_config():
|
||||
1, 2, 4, 8
|
||||
]
|
||||
assert not ascend_config.torchair_graph_config.graph_batch_sizes_init
|
||||
assert ascend_config.torchair_graph_config.enable_multistream_mla
|
||||
assert ascend_config.torchair_graph_config.enable_multistream_moe
|
||||
assert ascend_config.ascend_scheduler_config.enabled
|
||||
assert ascend_config.ascend_scheduler_config.enable_chunked_prefill
|
||||
|
||||
Reference in New Issue
Block a user