add mlp tp optimze (#2120)

### What this PR does / why we need it? For dense models, by not applying tensor parallelism (TP) to the attention module and applying TP to the MLP module, the allreduce operations in the attention module can be eliminated, thereby reducing computational overhead. However, this approach increases memory usage, so the environment variable VLLM_ASCEND_ENABLE_MLP_OPTIMZE is used to control this optimization. - vLLM main: b17109beea Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-08-21 09:22:07 +08:00
parent 973a7cfdf0
commit 3fb80ee356
6 changed files with 729 additions and 2 deletions
--- a/vllm_ascend/envs.py
+++ b/vllm_ascend/envs.py
@@ -141,6 +141,10 @@ env_variables: Dict[str, Callable[[], Any]] = {
    #   1: enable moe all2all seq.
    "VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ":
    lambda: bool(int(os.getenv('VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ', '0'))),
+    # Whether to enable mlp optimize when tensor parallel is enabled.
+    # this feature in eager mode will get better performance.
+    "VLLM_ASCEND_ENABLE_MLP_OPTIMIZE":
+    lambda: bool(int(os.getenv("VLLM_ASCEND_ENABLE_MLP_OPTIMIZE", '0'))),
 }

 # end-env-vars-definition