[Feat][SP] Suport SP for VL MoE models (#7044)

### What this PR does / why we need it? 2nd PR for https://github.com/vllm-project/vllm-ascend/issues/5712, extend SP to VL MoE models. ### Does this PR introduce _any_ user-facing change? remove `sp_threshold` in additional config and reuse `sp_min_token_num` from vLLM. ### How was this patch tested? - Model: Qwen3-VL-30B-A3B, - TP4 DP2 - 100 reqs - max concurrency 1 | Seq length | Mean TTFT (ms) main | Mean TTFT (ms) this PR | |------------|---------------------|------------------------| | 4k | 429.40 | 323.3 | | 16k | 1297.01 | 911.74 | - vLLM version: v0.16.0 - vLLM main: 4034c3d32e --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2026-03-24 17:16:00 +08:00
parent 9615bc33fd
commit 5d12446573
21 changed files with 947 additions and 54 deletions
--- a/docs/source/assets/sp_moe.png
+++ b/docs/source/assets/sp_moe.png
--- a/docs/source/user_guide/configuration/additional_config.md
+++ b/docs/source/user_guide/configuration/additional_config.md
@@ -44,7 +44,6 @@ The following table lists additional configuration options available in vLLM Asc
 | `pa_shape_list`                     | list | `[]`    | The custom shape list of page attention ops.                                                              |
 | `enable_kv_nz`                      | bool | `False` | Whether to enable KV cache NZ layout. This option only takes effects on models using MLA (e.g., DeepSeek).                                      |
 | `layer_sharding` | dict | `{}` | Configuration options for Layer Sharding Linear |
-| `sp_threshold` | int | `1000` | For dense models, only num_tokens > threshold will enable sequence parallelism. |

 The details of each configuration option are as follows:

--- a/docs/source/user_guide/feature_guide/sequence_parallelism.md
+++ b/docs/source/user_guide/feature_guide/sequence_parallelism.md
@@ -14,14 +14,13 @@ Currently, vllm-ascend has implemented Sequence Parallelism for VL-class models
 ```bash
 vllm serve Qwen/Qwen3-VL-2B-Instruct \
    --tensor-parallel-size 2 \
-    --compilation-config '{"pass_config": {"enable_sp": true}}' \
-    --additional_config={"sp_threshold": 1000}
+    --compilation-config '{"pass_config": {"enable_sp": true, , "sp_min_token_num": 1000}}'
 ```

- `"pass_config": {"enable_sp": true}`: This is the switch for SP. Since SP relies on graph mode, it must be enabled and is not supported in eager mode.
- `--additional_config={"sp_threshold": 1000}`: Based on our experiments, when the number of tokens is small (empirical value is less than 1000), SP can actually bring negative benefits. This is because when the communication volume is small, the fixed overhead of the communication operator becomes the dominant factor. Therefore, when one communication operator (Allreduce) is split into two communication operators (ReduceScatter+Allgather), the end-to-end latency often becomes longer. Thus, we have reserved the `sp_threshold`parameter; SP will only take effect when `num_tokens >= sp_threshold`. **The default value is 1000, which generally does not need to be modified.** `sp_threshold` will be appended into `compile_ranges_split_points`, which is a parameter provided by vllm that splits the graph compilation range `[1, max_num_batched_tokens]` into `{[1, split_points[0]], [split_points[0] + 1, split_points[1]], ..., [split_points[-1] + 1, max_num_batched_tokens]}`, and sequentially checks whether the `is_applicable_for_range` of the pass returns `True`.
+- `"enable_sp"`: This is the switch for SP. Since SP relies on graph mode, it is not supported in eager mode.
+- `sp_min_token_num` (from upstream vllm's `pass_config`): Based on our experiments, when the number of tokens is small (empirical value is less than 1000), SP can actually bring negative benefits. This is because when the communication volume is small, the fixed overhead of the communication operator becomes the dominant factor. SP will only take effect when `num_tokens >= sp_min_token_num`. **The default value is 1000 on Ascend, which generally does not need to be modified.** To customize, use `--compilation-config '{"pass_config": {"enable_sp": true, "sp_min_token_num": 512}}'`. The value will be appended into `compile_ranges_split_points`, which splits the graph compilation range and checks whether the pass is applicable per range.

-Without modifying `sp_threshold`, the simplest way and recommended way to enable SP is:
+Without modifying `sp_min_token_num`, the simplest way and recommended way to enable SP is:

 ```bash
 vllm serve Qwen/Qwen3-VL-2B-Instruct \
@@ -44,7 +43,7 @@ FC1 is a unique optimization in vllm-ascend, currently implemented based on Cust

 |                      | VL + Dense | VL + MoE | non-VL + Dense | non-VL + MoE |
 | -------------------- | ---------- | -------- | -------------- | ------------ |
-| Sequence Parallelism | graph      | x        | x              | x            |
+| Sequence Parallelism | graph      | graph    | x              | x            |
 | Flash Comm V1        | x          | x        | eager/graph    | eager/graph  |

 ### With Quantization
@@ -55,3 +54,56 @@ SP currently does not support quantization and is under adaptation.
 | -------------------- | ---------- | -------- | -------------- | ------------ |
 | Sequence Parallelism | x          | x        | x              | x            |
 | Flash Comm V1        | x          | x        | eager/graph    | eager/graph  |
+
+## Pass Design
+
+When SP is enabled, the following passes run in order: `SequenceParallelismPass` then `SequenceParallelismMoePass`.
+
+### SequenceParallelismPass
+
+Runs `NoOpEliminationPass` first to eliminate redundant view-like operations, then applies AllReduce-based patterns:
+
+| Pattern                                | Match                            | Replacement                                                                           |
+| -------------------------------------- | -------------------------------- | ------------------------------------------------------------------------------------- |
+| `MiddleAllReduceRMSNormPattern`        | `all_reduce` + `layernorm`       | `reduce_scatter` + `layernorm` + `all_gather`                                         |
+| `LastAllReduceRMSNormPattern`          | Same (last layer, no residual)   | Same                                                                                  |
+| `Qwen3VLMiddleAllReduceRMSNormPattern` | `all_reduce` + add + `layernorm` | `reduce_scatter` + chunk(`deepstack_input_embeds`) + add + `layernorm` + `all_gather` |
+
+**Why Qwen3 VL needs special handling by Qwen3VLMiddleAllReduceRMSNormPattern**
+
+Qwen3-VL middle layers insert an extra add between `all_reduce` and `layernorm`: `hidden_states=hidden_states + deepstack_input_embeds`. Under SP, `hidden_states` (i.e., `input`) is reduced-scattered to shape `[seq_len/tp, hidden]` per rank, while `deepstack_input_embeds` comes from the vision/deepstack path and stays full-sequence `[seq_len, hidden]` (typically replicated across TP ranks). Simply doing `reduce_scatter(input) + deepstack_input_embeds` would cause a shape mismatch.
+The fix is to chunk `deepstack_input_embeds` by `tp_size` so each rank uses `add(reduce_scatter, chunk(deepstack_input_embeds)[tp_rank])`, keeping shapes consistent before `layernorm` and `all_gather`.
+
+### SequenceParallelismMoePass
+
+After `SequenceParallelismPass` applies, the MoE model computation graph looks like:
+
+![AllGather EP computation graph](../../assets/sp_moe.png)
+
+**Overview**
+
+1. **Postponing allgather**: Under SP, `residual` is chunked by tensor parallelism. This causes a shape mismatch between hidden states and residual in the next layer's layernorm: hidden states are gathered (full sequence) while residual remains chunked. The fix is to move `all_gather` to *after* layernorm so that layernorm operates on consistent shapes per rank. `MiddleLayerAllgatherAddRMSNormPattern`, `LastLayerAllgatherRMSNormPattern`, and `Qwen3VLMiddleLayerAllgatherAddRMSNormPattern` are designed for this purpose, each handling different layer and structure variants (see the table below).
+
+2. **AllGatherChunkNoOp cleanup**: When MoE SP is enabled, vllm introduces a `sequence_parallel_chunk` op (corresponding to `sp_chunk` in the diagram). Together with the preceding `all_gather`, the pair forms a redundant no-op (all_gather gathers, then chunk re-splits). `AllGatherChunkNoOpPattern` replaces this pair with identity to eliminate the redundant communication and computation.
+
+**Pattern details:**
+
+| Pattern                            | Match                                    | Replacement                             |
+| ---------------------------------- | ---------------------------------------- | --------------------------------------- |
+| `MiddleLayerAllgatherAddRMSNormPattern`        | `all_gather` + slice + `layernorm`       | `layernorm` + `all_gather`              |
+| `LastLayerAllgatherRMSNormPattern`             | Same (last layer, no residual)           | Same                                    |
+| `Qwen3VLMiddleLayerAllgatherAddRMSNormPattern` | `all_gather` + slice + add + `layernorm` | add(chunk) + `layernorm` + `all_gather` |
+| `AllGatherChunkNoOpPattern`                    | `all_gather` + `sequence_parallel_chunk_impl` | identity (no-op)                    |
+
+### FAQ
+
+#### Q1: Is SP enabled by default?
+
+No, SP is not enabled by default. SP is currently in the experimental stage and will be enabled by default in the future.
+
+The processing flow of `enable_sp` in the code is:
+
+- In `pass_config`, `enable_sp` and `sp_min_token_num` default to `None`
+- `NPUPlatform.apply_config_platform_defaults`: If `enable_sp` is `True` and `sp_min_token_num` is None, set default `sp_min_token_num` (1000 for Dense models, 1 for MoE models)
+- `VllmConfig._apply_optimization_level_defaults`: `enable_sp` is set to `True` for dense models.
+- `VllmConfig.__post_init__`: If `sp_min_token_num` is still `None`, then `enable_sp` is set to `False`