Files

InSec 7cf65d0581 [Doc]modify the quantization user guide and add a quantization adaptation developer guide (#5554 )

### What this PR does / why we need it?
This PR makes the following modifications:
1.delete the `user_guide/feature_guide/quantization-llm-compressor.md`
and merge it into `user_guide/feature_guide/quantization.md`.
2.update the content of `user_guide/feature_guide/quantization.md`.
3.add guidance `developer_guide/feature_guide/quantization.md' on the
adaptation of quantization algorithms and quantized models.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
7157596103

---------

Signed-off-by: IncSec <1790766300@qq.com>
Signed-off-by: InSec <1790766300@qq.com>

2026-01-05 09:12:11 +08:00

7.8 KiB

Raw Blame History

Quantization Adaptation Guide

This document provides guidance for adapting quantization algorithms and models related to ModelSlim.

Quantization Feature Introduction

Quantization Inference Process

The current process for registering and obtaining quantization methods in vLLM Ascend is as follows:

vLLM Ascend registers a custom ascend quantization method. By configuring the --quantization ascend parameter (or quantization="ascend" for offline), the quantization feature is enabled. When constructing the quant_config, the registered AscendQuantConfig is initialized and get_quant_method is called to obtain the quantization method corresponding to each weight part, stored in the quant_method attribute.

Currently supported quantization methods include AscendLinearMethod, AscendFusedMoEMethod, AscendEmbeddingMethod, and their corresponding non-quantized methods:

The quantization method base class defined by vLLM and the overall call flow of quantization methods are as follows:

The embedding method is generally not implemented for quantization, focusing only on the other three methods.

The create_weights method is used for weight initialization; the process_weights_after_loading method is used for weight post-processing, such as transposition, format conversion, data type conversion, etc.; the apply method is used to perform activation quantization and quantized matrix multiplication calculations during the forward process.

We need to implement the create_weights, process_weights_after_loading, and apply methods for different layers (attention, mlp, moe).

Supplemnet: When loading the model, the quantized model's description file quant_model_description.json needs to be read. This file describes the quantization configuration and parameters for each part of the model weights, for example:

{
    "model.layers.0.linear_attn.dt_bias": "FLOAT",
    "model.layers.0.linear_attn.A_log": "FLOAT",
    "model.layers.0.linear_attn.conv1d.weight": "FLOAT",
    "model.layers.0.linear_attn.in_proj_qkvz.weight": "W8A8_DYNAMIC",
    "model.layers.0.linear_attn.in_proj_qkvz.weight_scale": "W8A8_DYNAMIC",
    "model.layers.0.linear_attn.in_proj_qkvz.weight_offset": "W8A8_DYNAMIC",
    "model.layers.0.linear_attn.in_proj_ba.weight": "FLOAT",
    "model.layers.0.linear_attn.norm.weight": "FLOAT",
    "model.layers.0.linear_attn.out_proj.weight": "FLOAT",
    "model.layers.0.mlp.gate.weight": "FLOAT",
    "model.layers.0.mlp.experts.0.gate_proj.weight": "W8A8_DYNAMIC",
    "model.layers.0.mlp.experts.0.gate_proj.weight_scale": "W8A8_DYNAMIC",
    "model.layers.0.mlp.experts.0.gate_proj.weight_offset": "W8A8_DYNAMIC",
}

Based on the above content, we present a brief description of the adaptation process for quantization algorithms and quantized models.

Quantization Algorithm Adaptation

Step 1: Algorithm Design. Define the algorithm ID (e.g., W4A8_DYNAMIC), determine supported layers (linear, moe, attention), and design the quantization scheme (static/dynamic, pertensor/perchannel/pergroup).
Step 2: Registration. Add the algorithm ID to ASCEND_QUANTIZATION_METHOD_MAP in vllm_ascend/quantization/utils.py and associate it with the corresponding method class.

ASCEND_QUANTIZATION_METHOD_MAP: Dict[str, Dict[str, Type[Any]]] = {
    "W4A8_DYNAMIC": {
        "linear": AscendW4A8DynamicLinearMethod,
        "moe": AscendW4A8DynamicFusedMoEMethod,
    },
}

Step 3: Implementation. Create an algorithm implementation file, such as vllm_ascend/quantization/w4a8_dynamic.py, and implement the method class and logic.
Step 4: Testing. Use your algorithm to generate quantization configurations and verify correctness and performance on target models and hardware.

Quantized Model Adaptation

Adapting a new quantized model requires ensuring the following three points:

The original model has been successfully adapted in vLLM Ascend.
Fused Module Mapping: Add the model's model_type to packed_modules_model_mapping in vllm_ascend/quantization/quant_config.py (e.g., qkv_proj, gate_up_proj, experts) to ensure sharding consistency and correct loading.

packed_modules_model_mapping = {
    "qwen3_moe": {
        "qkv_proj": [
            "q_proj",
            "k_proj",
            "v_proj",
        ],
        "gate_up_proj": [
            "gate_proj",
            "up_proj",
        ],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
    },
}

All quantization algorithms used by the quantized model have been integrated into the quantization module.

Currently Supported Quantization Algorithms

vLLM Ascend supports multiple quantization algorithms. The following table provides an overview of each quantization algorithm based on the implementation in the vllm_ascend.quantization module:

Algorithm	Weight	Activation	Weight Granularity	Activation Granularity	Type	Description
`W4A16`	INT4	FP16/BF16	Per-Group	Per-Tensor	Static	4-bit weight quantization with 16-bit activation precision, specifically designed for MoE model expert layers, supporting int32 format weight packing
`W8A16`	INT8	FP16/BF16	Per-Channel	Per-Tensor	Static	8-bit weight quantization with 16-bit activation precision, balancing accuracy and performance, suitable for linear layers
`W8A8`	INT8	INT8	Per-Channel	Per-Tensor	Static	Static activation quantization, suitable for scenarios requiring high precision
`W8A8_DYNAMIC`	INT8	INT8	Per-Channel	Per-Token	Dynamic	Dynamic activation quantization with per-token scaling factor calculation
`W4A8_DYNAMIC`	INT4	INT8	Per-Group	Per-Token	Dynamic	Supports both direct per-channel quantization to 4-bit and two-step quantization (per-channel to 8-bit then per-group to 4-bit)
`W4A4_FLATQUANT_DYNAMIC`	INT4	INT4	Per-Channel	Per-Token	Dynamic	Uses FlatQuant for activation distribution smoothing before 4-bit dynamic quantization, with additional matrix multiplications for precision preservation
`W8A8_MIX`	INT8	INT8	Per-Channel	Per-Tensor/Token	Mixed	PD Colocation Scenario uses dynamic quantization for both P node and D node; PD Disaggregation Scenario uses dynamic quantization for P node and static for D node

Static vs Dynamic: Static quantization uses pre-computed scaling factors with better performance, while dynamic quantization computes scaling factors on-the-fly for each token/activation tensor with higher precision.

Granularity: Refers to the scope of scaling factor computation (e.g., per-tensor, per-channel, per-group).

7.8 KiB Raw Blame History