Support multistream of MLA vector operations (#1135)

### What this PR does / why we need it? Move all vector operations to a secondary stream, with the expected overlaping being: ``` | q_rmsnorm | | kv_norm_rope_cache | | q_rope | | matmul W_DQ | matmul W_DKV | index | index | matmul W_UQ | split | matmul W_KV_T | ``` Currently, the `IndexByTensor` operators introduced by computation of `cos` and `sin` can't be offloaded to the secondary stream due to a known bug of graph fusion optimization pass. So we instead keep it in the main stream, only requires it be computed before `matmul W_UQ` to avoid hindering later overlapping. The problem may be solved by later optimization (#993), which hoists the computation of `cos` and `sin` up to the first layer. ### Does this PR introduce _any_ user-facing change? Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted to False. ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2. Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
2025-06-12 21:42:09 +08:00
parent 55c0e68883
commit e72f94e38f
5 changed files with 56 additions and 19 deletions
--- a/tests/singlecard/test_ascend_config.py
+++ b/tests/singlecard/test_ascend_config.py
@@ -59,6 +59,7 @@ def test_run_with_ascend_config():
            "graph_batch_sizes": [1, 2, 4, 8],
            "graph_batch_sizes_init": False,
            "enable_multistream_moe": True,
+            "enable_multistream_mla": True,
        },
        "ascend_scheduler_config": {
            "enabled": True,
@@ -79,6 +80,7 @@ def test_run_with_ascend_config():
            1, 2, 4, 8
        ]
        assert not ascend_config.torchair_graph_config.graph_batch_sizes_init
+        assert ascend_config.torchair_graph_config.enable_multistream_mla
        assert ascend_config.torchair_graph_config.enable_multistream_moe
        assert ascend_config.ascend_scheduler_config.enabled
        assert ascend_config.ascend_scheduler_config.enable_chunked_prefill