xc-llm-ascend

Files

Angazenn b84465c525 [Perf]Enable npu_moe_gating_top_k_softmax on quantized scenarios (#2633 )

### What this PR does / why we need it?
This PR enables `npu_moe_gating_top_k_softmax` when running quantized
MoE (such as W8A8). This op in fact makes no distinction between
quantized and non-quantized scenarios. Introducing this op reduces 3~4ms
for TPOT.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
ce30dca5c4

Signed-off-by: Angazenn <supperccell@163.com>

2025-09-03 09:14:17 +08:00

test_func_wrapper.py

[FOLLOWUP] Use base test to avoid patch everwhere (#1634 )

2025-07-22 09:03:40 +08:00

test_quant_config.py

[Main][Refactor]Change ASCEND_QUATIZATION_METHOD to ASCEND_QUANTIZATION_METHOD (#2517 )

2025-08-26 09:06:16 +08:00

test_quantizer.py

[main][Feature] Support deepseek w4a8 quantization (#2172 )

2025-08-06 10:17:44 +08:00

test_w4a8_dynamic.py

[4/N][refactor]delete torchair from quantization (#2535 )

2025-08-28 09:10:03 +08:00

test_w8a8.py

[Perf]Enable npu_moe_gating_top_k_softmax on quantized scenarios (#2633 )

2025-09-03 09:14:17 +08:00