xc-llm-ascend/docs/source/user_guide/feature_guide/quantization-llm-compressor.md

# llm-compressor Quantization Guide

Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed.

## Supported llm-compressor Quantization Types

Support CompressedTensorsW8A8 static weight

weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric.

Support CompressedTensorsW8A8Dynamic weight

weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic.

## Install llm-compressor

To quantize a model, you should install [llm-compressor](https://github.com/vllm-project/llm-compressor/blob/main/README.md). It is a unified library for creating compressed models for faster inference with vLLM.

Install llm-compressor

```bash
pip install llmcompressor
```

### Generate the W8A8 weights

```bash
cd examples/quantization/llm-compressor

python3 w8a8_int8_dynamic.py
```

for more details, see the [Official Sample](https://github.com/vllm-project/llm-compressor/tree/main/examples).

## Run the model

Now, you can run the quantized model with vLLM Ascend. Examples for online and offline inference are provided as follows:

### Offline inference

```python
import torch

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)

llm = LLM(model="{quantized_model_save_path}",
          max_model_len=2048,
          trust_remote_code=True)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

### Online inference

Start the quantized model using vLLM Ascend; no modifications to the startup command are required.
[Quantization] Support compressed tensors w8a8 static and w8a8 dynamic weight (#4036) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig in vllm. 2. Support CompressedTensorsW8A8 static weight. - weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric. 4. Support CompressedTensorsW8A8Dynamic weight. - weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic. 5. Modify the override_quantization_method in AscendQuantConfig. Co-authored-by: taoqun110 taoqun@huawei.com Co-authored-by: chenxi-hh chen464822955@163.com - vLLM version: v0.11.2 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: chenxi-hh <chen464822955@163.com> Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com> Co-authored-by: chenxi-hh <chen464822955@163.com> Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com> 2025-11-28 14:09:39 +08:00			`# llm-compressor Quantization Guide`

			`Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed.`

			`## Supported llm-compressor Quantization Types`

			`Support CompressedTensorsW8A8 static weight`

			`weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric.`

			`Support CompressedTensorsW8A8Dynamic weight`

			`weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic.`

			`## Install llm-compressor`

			`To quantize a model, you should install [llm-compressor](https://github.com/vllm-project/llm-compressor/blob/main/README.md). It is a unified library for creating compressed models for faster inference with vLLM.`

			`Install llm-compressor`

			```bash
			`pip install llmcompressor`
			```

			`### Generate the W8A8 weights`

			```bash
			`cd examples/quantization/llm-compressor`

			`python3 w8a8_int8_dynamic.py`
			```

			`for more details, see the [Official Sample](https://github.com/vllm-project/llm-compressor/tree/main/examples).`

			`## Run the model`

			`Now, you can run the quantized model with vLLM Ascend. Examples for online and offline inference are provided as follows:`

			`### Offline inference`

			```python
			`import torch`

			`from vllm import LLM, SamplingParams`

			`prompts = [`
			`"Hello, my name is",`
			`"The future of AI is",`
			`]`
			`sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)`

			`llm = LLM(model="{quantized_model_save_path}",`
			`max_model_len=2048,`
			`trust_remote_code=True)`

			`outputs = llm.generate(prompts, sampling_params)`
			`for output in outputs:`
			`prompt = output.prompt`
			`generated_text = output.outputs[0].text`
			`print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")`
			```

			`### Online inference`

			`Start the quantized model using vLLM Ascend; no modifications to the startup command are required.`