xc-llm-ascend

Files

Siyuan Kong a16c99141b Adapt w8a8mxfp8 quantization for Qwen VL models (#7417 )

### What this PR does / why we need it?

This PR adapts the `w8a8_mxfp8` quantization method to support Qwen
Vision-Language (VL) models. Key changes include:
- Reshaping multi-dimensional input tensors to 2D before the quantized
matrix multiplication.
- Reshaping the 2D output back to its original multi-dimensional format.
- Adding specific output reshaping for the visual components of Qwen VL
models.
- Casting the bias tensor to `float32` to comply with the
`npu_quant_matmul` kernel requirements.

These changes are necessary to enable `w8a8_mxfp8` quantization for
models with multi-modal inputs like Qwen VL.

### Does this PR introduce _any_ user-facing change?

No, this is a backend enhancement to extend quantization support to new
model architectures. There are no user-facing API or behavior changes.

### How was this patch tested?

CI is expected to pass. Manual testing should be performed with a Qwen
VL model using `w8a8_mxfp8` quantization to verify correctness and
performance.

- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: ksiyuan <ksiyuan@umich.edu>

2026-03-20 16:18:58 +08:00

__init__.py

[Feature]Supports DSv3.1 PD separation and C8 quantization (#7222 )

2026-03-16 22:49:05 +08:00

base.py

add mxfp8 moe quantization (#6670 )

2026-03-02 11:04:06 +08:00

kv_c8.py

[Feature]Supports DSv3.1 PD separation and C8 quantization (#7222 )

2026-03-16 22:49:05 +08:00

registry.py

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #7 ) (#6023 )