xc-llm-ascend/examples/quantization/llm-compressor/w8a8_int8_dynamic_moe.py

import torch
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "Qwen/Qwen3-30B-A3B-Instruct-2507"

model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype=torch.bfloat16, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

recipe = QuantizationModifier(
    targets="Linear",
    scheme="INT8",
    ignore=["lm_head", "re:.*mlp.gate$"],
)

oneshot(
    model=model,
    recipe=recipe,
    trust_remote_code_model=True,
)

# Save to disk in compressed-tensors format.
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-INT8_W8A8"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
[Quantization] Support compressed tensors moe w8a8 int8 dynamic weight (#5718) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W8A8 Int8 dynamic weight. 2. Specify W4A16 quantization configuration. Co-authored-by: menogrey 1299267905@qq.com Co-authored-by: kunpengW-code 1289706727@qq.com ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: menogrey <1299267905@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com> 2026-01-14 09:17:26 +08:00			`import torch`
			`from llmcompressor import oneshot`
			`from llmcompressor.modifiers.quantization import QuantizationModifier`
[CI] Fix lint CI (#5880) Quick fix for lint CI - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> 2026-01-14 11:23:38 +08:00			`from transformers import AutoModelForCausalLM, AutoTokenizer`
[Quantization] Support compressed tensors moe w8a8 int8 dynamic weight (#5718) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W8A8 Int8 dynamic weight. 2. Specify W4A16 quantization configuration. Co-authored-by: menogrey 1299267905@qq.com Co-authored-by: kunpengW-code 1289706727@qq.com ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: menogrey <1299267905@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com> 2026-01-14 09:17:26 +08:00
			`MODEL_ID = "Qwen/Qwen3-30B-A3B-Instruct-2507"`

[CI] Fix lint CI (#5880) Quick fix for lint CI - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> 2026-01-14 11:23:38 +08:00			`model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype=torch.bfloat16, trust_remote_code=True)`
[Quantization] Support compressed tensors moe w8a8 int8 dynamic weight (#5718) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W8A8 Int8 dynamic weight. 2. Specify W4A16 quantization configuration. Co-authored-by: menogrey 1299267905@qq.com Co-authored-by: kunpengW-code 1289706727@qq.com ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: menogrey <1299267905@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com> 2026-01-14 09:17:26 +08:00			`tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)`

			`recipe = QuantizationModifier(`
			`targets="Linear",`
			`scheme="INT8",`
			`ignore=["lm_head", "re:.*mlp.gate$"],`
			`)`

			`oneshot(`
			`model=model,`
			`recipe=recipe,`
			`trust_remote_code_model=True,`
			`)`

			`# Save to disk in compressed-tensors format.`
			`SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-INT8_W8A8"`
			`model.save_pretrained(SAVE_DIR, save_compressed=True)`
			`tokenizer.save_pretrained(SAVE_DIR)`