[3/N][Feat][Graph] Support `all-to-all` and quantized models with ACL Graph (#2614)

### What this PR does / why we need it?
* **Unify execution paths:** Consolidates the quantized and
non-quantized execution paths into a single `fused_experts` function,
removing duplicated logic and making the control flow clearer and easier
to maintain.
* **W8A8 dynamic quantization:** Adds support for W8A8 dynamic
quantization inside the unified MoE kernel. Communication routines are
updated to correctly handle dynamic quantization scales for activations.
* **Weight pre-processing:** Prae-transpose the `w13` and `w2` weight
matrices (as implemented in PR #2025) so that quantized and
non-quantized models follow the same code path for the MoE gating,
up-projection, and down-projection operations.
* **All-to-all communication:** Adds an `all-to-all` collective
communication pattern. For large token counts on modern hardware,
`all-to-all` is more efficient than the previous `all-gather` strategy.
However, `all-to-all` is not really captured and replayed due to
multiple D2H operations which will trigger synchronization, and thus
raise error when capture graphs. We only use `all-to-all` when fallback
to `compiled_graph_for_general_shape`.
* **Dynamic communication selection:** The model runner now selects the
optimal MoE communication method (`mc2`, `allgather`, or `alltoall`) at
runtime based on token count and the Ascend SoC version.
* **Limitation:** `all-gather` is not yet supported for quantized
models, which means there is still something left to do on A2.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
No further test cases needed.

- vLLM version: v0.10.1.1
- vLLM main:
d660c98c1b

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

This commit is contained in:

yiz-liu

2025-08-30 11:00:35 +08:00

committed by

GitHub

parent 91c35d765a

commit d3c93fba5c

7 changed files with 248 additions and 41 deletions

									
										2

tests/e2e/multicard/test_qwen3_moe.py
									
												View File
												
				@@ -107,4 +107,4 @@ def test_models_distributed_Qwen3_MOE_TP2_WITH_ACLGRAPH():

				            tensor_parallel_size=2,

				            enforce_eager=False,

				    ) as vllm_model:

				        vllm_model.generate_greedy(example_prompts, max_tokens)

				        vllm_model.generate_greedy(example_prompts, max_tokens)

[3/N][Feat][Graph] Support all-to-all and quantized models with ACL Graph (#2614)

2 tests/e2e/multicard/test_qwen3_moe.py Unescape Escape View File

[3/N][Feat][Graph] Support `all-to-all` and quantized models with ACL Graph (#2614)

2

tests/e2e/multicard/test_qwen3_moe.py

View File