xc-llm-ascend

Files

Slightwind f3b50c54e8 [main][Prefill Perf] Optimize Quantized MoE Performance by Reducing All2All Communication (#2195 )

This PR significantly optimizes performance for quantized Mixture of
Experts (MoE) layers by changing the order of quantization and
communication operations.

In the previous implementation, the `all2all` operation was performed on
unquantized `hidden_states` (in FP16/BF16) *before* quantization,
resulting in substantial communication overhead. By performing
quantization on each EP rank **first** and then sending the much smaller
quantized data, we reduce the communication volume by nearly 50%.

Additionally, this PR includes a minor optimization to cast `int` inputs
to `float` for the `argsort` operation, forcing it to run on a faster
NPU core instead of the AICPU.

These changes lead to a clear and significant performance gain in MoE
quantization scenarios.

- vLLM version: v0.10.0
- vLLM main:
7175817637

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

2025-08-05 18:47:13 +08:00

test_func_wrapper.py

[FOLLOWUP] Use base test to avoid patch everwhere (#1634 )

2025-07-22 09:03:40 +08:00

test_quant_config.py

Add unit test local cpu guide and enable base testcase (#1566 )

2025-07-06 10:42:27 +08:00

test_quantizer.py

Add unit test local cpu guide and enable base testcase (#1566 )

2025-07-06 10:42:27 +08:00

test_w4a8_dynamic.py

[main][Feature] Support Qwen3 W4A8 quantization (#2060 )