[CI] Enable FLASHCOMM1 with layer_sharding and FULL_DECODE_ONLY in ds32 testing (#6115)
### What this PR does / why we need it?
This PR enables FLASHCOMM1 communication optimization with layer
sharding for DeepSeek-V3.2 W8A8 model testing to
validate PR #5702. The changes include:
1. Enable FLASHCOMM1: Set VLLM_ASCEND_ENABLE_FLASHCOMM1=1
improves performance for distributed inference
2. Add layer sharding: Configure layer_sharding: ["q_b_proj", "o_proj"]
4. Update baselines: Adjust performance baselines to reflect the
improvements from FLASHCOMM1 and layer sharding
### Does this PR introduce _any_ user-facing change?
No. This is a CI/test-only change that enables new communication
optimization features for testing purposes.
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
d68209402d
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
This commit is contained in:
@@ -242,7 +242,7 @@ def test_qwen3_dense_prefetch_mlp_weight_tp2(model):
|
||||
|
||||
|
||||
@patch.dict(os.environ, {"HCCL_OP_EXPANSION_MODE": "AIV"})
|
||||
@patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_FLASHCOMM1": "0"})
|
||||
@patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_FLASHCOMM1": "1"})
|
||||
@patch.dict(os.environ, {"ASCEND_AGGREGATE_ENABLE": "1"})
|
||||
@patch.dict(os.environ, {"HCCL_BUFFSIZE": "1024"})
|
||||
def test_deepseek3_2_w8a8_pruning_mtp_tp2_ep():
|
||||
@@ -262,6 +262,9 @@ def test_deepseek3_2_w8a8_pruning_mtp_tp2_ep():
|
||||
"num_speculative_tokens": 2,
|
||||
"method": "deepseek_mtp"
|
||||
},
|
||||
additional_config={
|
||||
"layer_sharding":["q_b_proj", "o_proj"]
|
||||
},
|
||||
reasoning_parser="deepseek_v3",
|
||||
tokenizer_mode="deepseek_v32") as vllm_model:
|
||||
vllm_model.generate_greedy(example_prompts, max_tokens)
|
||||
|
||||
Reference in New Issue
Block a user