Files

LHXuuu bdc66972db [Quantization] Support compressed tensors w8a8 static and w8a8 dynamic weight (#4036 )

### What this PR does / why we need it?

While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.

1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig
in vllm.
2. Support CompressedTensorsW8A8 static weight.
- weight: per-channel, int8, symmetric; activation: per-tensor, int8,
symmetric.
4. Support CompressedTensorsW8A8Dynamic weight.
- weight: per-channel, int8, symmetric; activation: per-token, int8,
symmetric, dynamic.
5. Modify the override_quantization_method in AscendQuantConfig.

Co-authored-by: taoqun110 taoqun@huawei.com
Co-authored-by: chenxi-hh chen464822955@163.com

- vLLM version: v0.11.2

---------

Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: chenxi-hh <chen464822955@163.com>
Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>
Co-authored-by: chenxi-hh <chen464822955@163.com>
Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>

2025-11-28 14:09:39 +08:00

301 B

Raw Blame History

Feature Guide

This section provides a detailed usage guide of vLLM Ascend features.

:::{toctree} :caption: Feature Guide :maxdepth: 1 graph_mode quantization quantization-llm-compressor sleep_mode structured_output lora eplb_swift_balancer netloader dynamic_batch kv_pool_mooncake external_dp :::

301 B Raw Blame History

Feature Guide

301 B

Raw Blame History