### Summary
This PR refactors the `vllm_ascend/quantization` module to improve code
organization, maintainability, and extensibility. The refactoring
introduces a clear separation of concerns with a registry-based scheme
discovery pattern, abstract base classes for quantization schemes, and
dedicated wrapper classes.
### Key Changes
#### 1. **Modular Directory Structure**
| Before | After |
|--------|-------|
| Flat file structure with mixed responsibilities | Organized into
`methods/` subpackage for schemes |
| Single `quant_config.py` (600+ lines) | Separate config files:
`modelslim_config.py`, `compressed_tensors_config.py` |
| `utils.py` with scheme lookup logic | `methods/registry.py` with
decorator-based registration |
#### 2. **Registry-Based Scheme Discovery**
Replaced hardcoded `ASCEND_QUANTIZATION_METHOD_MAP` dictionary with a
decorator-based registry pattern:
```python
# Before: Manual dictionary mapping
ASCEND_QUANTIZATION_METHOD_MAP = {
"W8A8_DYNAMIC": {"linear": AscendW8A8DynamicLinearMethod, ...},
...
}
# After: Decorator-based registration
@register_scheme("W8A8_DYNAMIC", "linear")
class AscendW8A8DynamicLinearMethod(AscendLinearScheme):
...
```
#### 3. **Abstract Base Classes**
Introduced three abstract base classes in `methods/base.py`:
- `AscendLinearScheme` - Base for linear layer quantization
- `AscendMoEScheme` - Base for MoE layer quantization
- `AscendAttentionScheme` - Base for attention layer quantization
#### 4. **Separated Config and Wrapper Classes**
- **Config classes** (`AscendModelSlimConfig`,
`AscendCompressedTensorsConfig`): Handle config parsing and scheme
selection
- **Wrapper classes** (`AscendLinearMethod`, `AscendFusedMoEMethod`,
etc.): Implement vLLM interfaces and delegate to schemes
#### 5. **Cleaner Public API**
```python
# New clean module interface
from vllm_ascend.quantization import (
AscendModelSlimConfig,
AscendCompressedTensorsConfig,
)
from vllm_ascend.quantization.methods import get_scheme_class
```
### Architecture Diagram
```mermaid
classDiagram
direction TB
class QuantizationConfig {
<<vLLM Interface>>
+get_quant_method()
}
class AscendModelSlimConfig {
+quant_description
+get_quant_method()
-create_scheme_for_layer()
}
class AscendCompressedTensorsConfig {
+target_scheme_map
+get_quant_method()
-_get_scheme_from_parts()
}
class AscendLinearMethod {
<<Wrapper>>
+quant_method: AscendLinearScheme
+create_weights()
+apply()
}
class AscendFusedMoEMethod {
<<Wrapper>>
+quant_method: AscendMoEScheme
+create_weights()
+apply()
}
class AscendLinearScheme {
<<Abstract>>
+get_weight()*
+apply()*
+get_pertensor_param()
+get_perchannel_param()
}
class AscendMoEScheme {
<<Abstract>>
+get_weight()*
+get_dynamic_quant_param()*
+apply()*
}
class W8A8DynamicLinear {
+get_weight()
+apply()
}
class W8A8DynamicMoE {
+get_weight()
+apply()
}
QuantizationConfig <|-- AscendModelSlimConfig
QuantizationConfig <|-- AscendCompressedTensorsConfig
AscendModelSlimConfig ..> AscendLinearMethod : creates
AscendModelSlimConfig ..> AscendFusedMoEMethod : creates
AscendCompressedTensorsConfig ..> AscendLinearMethod : creates
AscendCompressedTensorsConfig ..> AscendFusedMoEMethod : creates
AscendLinearMethod o-- AscendLinearScheme : delegates to
AscendFusedMoEMethod o-- AscendMoEScheme : delegates to
AscendLinearScheme <|-- W8A8DynamicLinear
AscendMoEScheme <|-- W8A8DynamicMoE
```
### Scheme Registration Flow
```mermaid
sequenceDiagram
participant Module as Scheme Module
participant Registry as _SCHEME_REGISTRY
participant Config as QuantConfig
participant Wrapper as Wrapper Class
Note over Module: At import time
Module->>Registry: @register_scheme("W8A8_DYNAMIC", "linear")
Registry->>Registry: Store (quant_type, layer_type) -> Class
Note over Config: At runtime
Config->>Config: Determine quant_type from description
Config->>Registry: get_scheme_class(quant_type, layer_type)
Registry-->>Config: Return scheme class
Config->>Config: scheme = scheme_cls()
Config->>Wrapper: Create wrapper with scheme
Wrapper-->>Config: Return wrapper instance
```
### File Changes Summary
| Original Files | Refactored Files |
|----------------|------------------|
| `__init__.py` (empty) | `__init__.py` (exports public API) |
| `quant_config.py` | `modelslim_config.py` + `wrappers.py` |
| `compressed_tensors/` | `compressed_tensors_config.py` |
| `utils.py` | `methods/registry.py` |
| `w8a8_dynamic.py` | `methods/w8a8_dynamic.py` |
| `w8a8.py` | `methods/w8a8_static.py` |
| `w4a4_flatquant_dynamic.py` | `methods/w4a4_flatquant.py` |
| ... | `methods/base.py` (new) |
### Benefits
1. **Extensibility**: Adding new quantization schemes only requires
implementing the base class and adding `@register_scheme` decorator
2. **Maintainability**: Clear separation between config parsing, wrapper
logic, and scheme implementation
3. **Testability**: Abstract base classes enable easier unit testing and
mocking
4. **Discoverability**: Registry pattern makes it easy to list all
supported schemes
5. **Reduced Coupling**: Config classes no longer need to know about all
scheme implementations
___
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
115 lines
7.9 KiB
Markdown
115 lines
7.9 KiB
Markdown
# Quantization Adaptation Guide
|
|
|
|
This document provides guidance for adapting quantization algorithms and models related to **ModelSlim**.
|
|
|
|
## Quantization Feature Introduction
|
|
|
|
### Quantization Inference Process
|
|
|
|
The current process for registering and obtaining quantization methods in vLLM Ascend is as follows:
|
|
|
|

|
|
|
|
vLLM Ascend registers a custom ascend quantization method. By configuring the `--quantization ascend` parameter (or `quantization="ascend"` for offline), the quantization feature is enabled. When constructing the `quant_config`, the registered `AscendModelSlimConfig` is initialized and `get_quant_method` is called to obtain the quantization method corresponding to each weight part, stored in the `quant_method` attribute.
|
|
|
|
Currently supported quantization methods include `AscendLinearMethod`, `AscendFusedMoEMethod`, `AscendEmbeddingMethod`, and their corresponding non-quantized methods:
|
|
|
|

|
|
|
|
The quantization method base class defined by vLLM and the overall call flow of quantization methods are as follows:
|
|
|
|

|
|
|
|
The `embedding` method is generally not implemented for quantization, focusing only on the other three methods.
|
|
|
|
The `create_weights` method is used for weight initialization; the `process_weights_after_loading` method is used for weight post-processing, such as transposition, format conversion, data type conversion, etc.; the `apply` method is used to perform activation quantization and quantized matrix multiplication calculations during the forward process.
|
|
|
|
We need to implement the `create_weights`, `process_weights_after_loading`, and `apply` methods for different **layers** (**attention**, **mlp**, **moe**).
|
|
|
|
**Supplemnet**: When loading the model, the quantized model's description file **quant_model_description.json** needs to be read. This file describes the quantization configuration and parameters for each part of the model weights, for example:
|
|
|
|
```json
|
|
{
|
|
"model.layers.0.linear_attn.dt_bias": "FLOAT",
|
|
"model.layers.0.linear_attn.A_log": "FLOAT",
|
|
"model.layers.0.linear_attn.conv1d.weight": "FLOAT",
|
|
"model.layers.0.linear_attn.in_proj_qkvz.weight": "W8A8_DYNAMIC",
|
|
"model.layers.0.linear_attn.in_proj_qkvz.weight_scale": "W8A8_DYNAMIC",
|
|
"model.layers.0.linear_attn.in_proj_qkvz.weight_offset": "W8A8_DYNAMIC",
|
|
"model.layers.0.linear_attn.in_proj_ba.weight": "FLOAT",
|
|
"model.layers.0.linear_attn.norm.weight": "FLOAT",
|
|
"model.layers.0.linear_attn.out_proj.weight": "FLOAT",
|
|
"model.layers.0.mlp.gate.weight": "FLOAT",
|
|
"model.layers.0.mlp.experts.0.gate_proj.weight": "W8A8_DYNAMIC",
|
|
"model.layers.0.mlp.experts.0.gate_proj.weight_scale": "W8A8_DYNAMIC",
|
|
"model.layers.0.mlp.experts.0.gate_proj.weight_offset": "W8A8_DYNAMIC",
|
|
}
|
|
```
|
|
|
|
Based on the above content, we present a brief description of the adaptation process for quantization algorithms and quantized models.
|
|
|
|
### Quantization Algorithm Adaptation
|
|
|
|
- **Step 1: Algorithm Design**. Define the algorithm ID (e.g., `W4A8_DYNAMIC`), determine supported layers (linear, moe, attention), and design the quantization scheme (static/dynamic, pertensor/perchannel/pergroup).
|
|
- **Step 2: Registration**. Use the `@register_scheme` decorator in `vllm_ascend/quantization/methods/registry.py` to register your quantization scheme class.
|
|
|
|
```python
|
|
from vllm_ascend.quantization.methods import register_scheme, AscendLinearScheme
|
|
|
|
@register_scheme("W4A8_DYNAMIC", "linear")
|
|
class AscendW4A8DynamicLinearMethod(AscendLinearScheme):
|
|
...
|
|
|
|
@register_scheme("W4A8_DYNAMIC", "moe")
|
|
class AscendW4A8DynamicFusedMoEMethod(AscendMoEScheme):
|
|
...
|
|
```
|
|
|
|
- **Step 3: Implementation**. Create an algorithm implementation file, such as `vllm_ascend/quantization/methods/w4a8.py`, and implement the method class and logic.
|
|
- **Step 4: Testing**. Use your algorithm to generate quantization configurations and verify correctness and performance on target models and hardware.
|
|
|
|
### Quantized Model Adaptation
|
|
|
|
Adapting a new quantized model requires ensuring the following three points:
|
|
|
|
- The original model has been successfully adapted in `vLLM Ascend`.
|
|
- **Fused Module Mapping**: Add the model's `model_type` to `packed_modules_model_mapping` in `vllm_ascend/quantization/modelslim_config.py` (e.g., `qkv_proj`, `gate_up_proj`, `experts`) to ensure sharding consistency and correct loading.
|
|
|
|
```python
|
|
packed_modules_model_mapping = {
|
|
"qwen3_moe": {
|
|
"qkv_proj": [
|
|
"q_proj",
|
|
"k_proj",
|
|
"v_proj",
|
|
],
|
|
"gate_up_proj": [
|
|
"gate_proj",
|
|
"up_proj",
|
|
],
|
|
"experts":
|
|
["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
|
|
},
|
|
}
|
|
```
|
|
|
|
- All quantization algorithms used by the quantized model have been integrated into the `quantization` module.
|
|
|
|
## Currently Supported Quantization Algorithms
|
|
|
|
vLLM Ascend supports multiple quantization algorithms. The following table provides an overview of each quantization algorithm based on the implementation in the `vllm_ascend.quantization` module:
|
|
|
|
| Algorithm | Weight | Activation | Weight Granularity | Activation Granularity | Type | Description |
|
|
| ------------------------ | ------ | ---------- | ------------------ | ---------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| `W4A16` | INT4 | FP16/BF16 | Per-Group | Per-Tensor | Static | 4-bit weight quantization with 16-bit activation precision, specifically designed for MoE model expert layers, supporting int32 format weight packing |
|
|
| `W8A16` | INT8 | FP16/BF16 | Per-Channel | Per-Tensor | Static | 8-bit weight quantization with 16-bit activation precision, balancing accuracy and performance, suitable for linear layers |
|
|
| `W8A8` | INT8 | INT8 | Per-Channel | Per-Tensor | Static | Static activation quantization, suitable for scenarios requiring high precision |
|
|
| `W8A8_DYNAMIC` | INT8 | INT8 | Per-Channel | Per-Token | Dynamic | Dynamic activation quantization with per-token scaling factor calculation |
|
|
| `W4A8_DYNAMIC` | INT4 | INT8 | Per-Group | Per-Token | Dynamic | Supports both direct per-channel quantization to 4-bit and two-step quantization (per-channel to 8-bit then per-group to 4-bit) |
|
|
| `W4A4_FLATQUANT_DYNAMIC` | INT4 | INT4 | Per-Channel | Per-Token | Dynamic | Uses FlatQuant for activation distribution smoothing before 4-bit dynamic quantization, with additional matrix multiplications for precision preservation |
|
|
| `W8A8_MIX` | INT8 | INT8 | Per-Channel | Per-Tensor/Token | Mixed | PD Colocation Scenario uses dynamic quantization for both P node and D node; PD Disaggregation Scenario uses dynamic quantization for P node and static for D node |
|
|
|
|
**Static vs Dynamic:** Static quantization uses pre-computed scaling factors with better performance, while dynamic quantization computes scaling factors on-the-fly for each token/activation tensor with higher precision.
|
|
|
|
**Granularity:** Refers to the scope of scaling factor computation (e.g., per-tensor, per-channel, per-group).
|