[CORE]initial support for torchair with non-mla backend (#1506)

### What this PR does / why we need it? This PR supports torchair graph mode with non-mla backend on both 800IA2 and 300I Duo platforms. The main change is to add `attention_v1_torchair.py` to support specific attention related operations that are required by torchair. ### Does this PR introduce _any_ user-facing change? Before this PR, vLLM-Ascend only allows deepseek to use torchair. Now we can also use it with pangu. Besides, we add a support model list to control which type of models that can use torchair. ### How was this patch tested? We have test it with PanguProMoE on both 800IA2 and 300I Duo platforms, and model generates answer normally. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Signed-off-by: tianyitang <tangtianyi4@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: tianyitang <tangtianyi4@huawei.com>
2025-07-03 22:21:42 +08:00
parent 9fbd8017c0
commit a5f33590d3
19 changed files with 1130 additions and 84 deletions
--- a/vllm_ascend/patch/platform/patch_common/patch_distributed.py
+++ b/vllm_ascend/patch/platform/patch_common/patch_distributed.py
@@ -132,19 +132,6 @@ def communication_adaptation_310p():
    torch.distributed.distributed_c10d.all_reduce = all_reduce_wrapper_310p(
        torch.distributed.distributed_c10d.all_reduce)

-    def reduce_scatter_310p(output_tensor, input_tensor, group=None):
-        rank = torch.distributed.get_rank(group)
-        world_size = torch.distributed.get_world_size(group)
-        torch.distributed.all_reduce(input_tensor,
-                                     torch.distributed.ReduceOp.SUM,
-                                     group,
-                                     async_op=False)
-        interval = input_tensor.shape[0] // world_size
-        output_tensor[:] = input_tensor[rank * interval:(rank + 1) * interval]
-
-    torch.distributed._reduce_scatter_base = reduce_scatter_310p
-    torch.distributed.distributed_c10d._reduce_scatter_base = reduce_scatter_310p
-

 if is_310p():
    communication_adaptation_310p()