[Quantization] Support compressed tensors moe w8a8 int8 dynamic weight (#5718)

### What this PR does / why we need it?
While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.

1. Support Moe model W8A8 Int8 dynamic weight.
2. Specify W4A16 quantization configuration.

Co-authored-by: menogrey 1299267905@qq.com
Co-authored-by: kunpengW-code 1289706727@qq.com

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: menogrey <1299267905@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
This commit is contained in:
LHXuuu
2026-01-14 09:17:26 +08:00
committed by GitHub
parent ecf2fa482e
commit 0415e694cd
5 changed files with 192 additions and 43 deletions

View File

@@ -42,3 +42,26 @@ def test_qwen2_5_w8a8_external_quantized_tp2():
for i in range(len(vllm_output)):
assert golden_results[i] == vllm_output[i][1]
print(f"Generated text: {vllm_output[i][1]!r}")
def test_qwen3_moe_w8a8_dynamic_llm_compressor():
example_prompts = [
"The president of the United States is",
]
max_tokens = 5
with VllmRunner(
snapshot_download(
"vllm-ascend/Qwen3-30B-A3B-Instruct-2507-quantized.w8a8"),
tensor_parallel_size=2,
max_model_len=4096,
gpu_memory_utilization=0.8,
) as vllm_model:
vllm_output = vllm_model.generate_greedy(example_prompts, max_tokens)
golden_results = [
'The president of the United States is the head of state and',
]
for i in range(len(vllm_output)):
assert golden_results[i] == vllm_output[i][1]
print(f"Generated text: {vllm_output[i][1]!r}")