[CI] Enable FLASHCOMM1 with layer_sharding and FULL_DECODE_ONLY in ds32 testing (#6115)

### What this PR does / why we need it? This PR enables FLASHCOMM1 communication optimization with layer sharding for DeepSeek-V3.2 W8A8 model testing to validate PR #5702. The changes include: 1. Enable FLASHCOMM1: Set VLLM_ASCEND_ENABLE_FLASHCOMM1=1 improves performance for distributed inference 2. Add layer sharding: Configure layer_sharding: ["q_b_proj", "o_proj"] 4. Update baselines: Adjust performance baselines to reflect the improvements from FLASHCOMM1 and layer sharding ### Does this PR introduce _any_ user-facing change? No. This is a CI/test-only change that enables new communication optimization features for testing purposes. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: d68209402d Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>
2026-01-23 19:48:37 +08:00
parent 8786412f5c
commit 6c73b88dd6
3 changed files with 13 additions and 5 deletions
--- a/tests/e2e/multicard/2-cards/test_offline_inference_distributed.py
+++ b/tests/e2e/multicard/2-cards/test_offline_inference_distributed.py
@@ -242,7 +242,7 @@ def test_qwen3_dense_prefetch_mlp_weight_tp2(model):


@patch.dict(os.environ, {"HCCL_OP_EXPANSION_MODE": "AIV"})
-@patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_FLASHCOMM1": "0"})
+@patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_FLASHCOMM1": "1"})
@patch.dict(os.environ, {"ASCEND_AGGREGATE_ENABLE": "1"})
@patch.dict(os.environ, {"HCCL_BUFFSIZE": "1024"})
 def test_deepseek3_2_w8a8_pruning_mtp_tp2_ep():
@@ -262,6 +262,9 @@ def test_deepseek3_2_w8a8_pruning_mtp_tp2_ep():
                        "num_speculative_tokens": 2,
                        "method": "deepseek_mtp"
                    },
+                    additional_config={
+                        "layer_sharding":["q_b_proj", "o_proj"]
+                    },
                    reasoning_parser="deepseek_v3",
                    tokenizer_mode="deepseek_v32") as vllm_model:
        vllm_model.generate_greedy(example_prompts, max_tokens)