# llm-compressor Quantization Guide Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed. ## Supported llm-compressor Quantization Types Support CompressedTensorsW8A8 static weight weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric. Support CompressedTensorsW8A8Dynamic weight weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic. ## Install llm-compressor To quantize a model, you should install [llm-compressor](https://github.com/vllm-project/llm-compressor/blob/main/README.md). It is a unified library for creating compressed models for faster inference with vLLM. Install llm-compressor ```bash pip install llmcompressor ``` ### Generate the W8A8 weights ```bash cd examples/quantization/llm-compressor python3 w8a8_int8_dynamic.py ``` for more details, see the [Official Sample](https://github.com/vllm-project/llm-compressor/tree/main/examples). ## Run the model Now, you can run the quantized model with vLLM Ascend. Examples for online and offline inference are provided as follows: ### Offline inference ```python import torch from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40) llm = LLM(model="{quantized_model_save_path}", max_model_len=2048, trust_remote_code=True) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ### Online inference Start the quantized model using vLLM Ascend; no modifications to the startup command are required.