### Summary
This PR refactors the `vllm_ascend/quantization` module to improve code
organization, maintainability, and extensibility. The refactoring
introduces a clear separation of concerns with a registry-based scheme
discovery pattern, abstract base classes for quantization schemes, and
dedicated wrapper classes.
### Key Changes
#### 1. **Modular Directory Structure**
| Before | After |
|--------|-------|
| Flat file structure with mixed responsibilities | Organized into
`methods/` subpackage for schemes |
| Single `quant_config.py` (600+ lines) | Separate config files:
`modelslim_config.py`, `compressed_tensors_config.py` |
| `utils.py` with scheme lookup logic | `methods/registry.py` with
decorator-based registration |
#### 2. **Registry-Based Scheme Discovery**
Replaced hardcoded `ASCEND_QUANTIZATION_METHOD_MAP` dictionary with a
decorator-based registry pattern:
```python
# Before: Manual dictionary mapping
ASCEND_QUANTIZATION_METHOD_MAP = {
"W8A8_DYNAMIC": {"linear": AscendW8A8DynamicLinearMethod, ...},
...
}
# After: Decorator-based registration
@register_scheme("W8A8_DYNAMIC", "linear")
class AscendW8A8DynamicLinearMethod(AscendLinearScheme):
...
```
#### 3. **Abstract Base Classes**
Introduced three abstract base classes in `methods/base.py`:
- `AscendLinearScheme` - Base for linear layer quantization
- `AscendMoEScheme` - Base for MoE layer quantization
- `AscendAttentionScheme` - Base for attention layer quantization
#### 4. **Separated Config and Wrapper Classes**
- **Config classes** (`AscendModelSlimConfig`,
`AscendCompressedTensorsConfig`): Handle config parsing and scheme
selection
- **Wrapper classes** (`AscendLinearMethod`, `AscendFusedMoEMethod`,
etc.): Implement vLLM interfaces and delegate to schemes
#### 5. **Cleaner Public API**
```python
# New clean module interface
from vllm_ascend.quantization import (
AscendModelSlimConfig,
AscendCompressedTensorsConfig,
)
from vllm_ascend.quantization.methods import get_scheme_class
```
### Architecture Diagram
```mermaid
classDiagram
direction TB
class QuantizationConfig {
<<vLLM Interface>>
+get_quant_method()
}
class AscendModelSlimConfig {
+quant_description
+get_quant_method()
-create_scheme_for_layer()
}
class AscendCompressedTensorsConfig {
+target_scheme_map
+get_quant_method()
-_get_scheme_from_parts()
}
class AscendLinearMethod {
<<Wrapper>>
+quant_method: AscendLinearScheme
+create_weights()
+apply()
}
class AscendFusedMoEMethod {
<<Wrapper>>
+quant_method: AscendMoEScheme
+create_weights()
+apply()
}
class AscendLinearScheme {
<<Abstract>>
+get_weight()*
+apply()*
+get_pertensor_param()
+get_perchannel_param()
}
class AscendMoEScheme {
<<Abstract>>
+get_weight()*
+get_dynamic_quant_param()*
+apply()*
}
class W8A8DynamicLinear {
+get_weight()
+apply()
}
class W8A8DynamicMoE {
+get_weight()
+apply()
}
QuantizationConfig <|-- AscendModelSlimConfig
QuantizationConfig <|-- AscendCompressedTensorsConfig
AscendModelSlimConfig ..> AscendLinearMethod : creates
AscendModelSlimConfig ..> AscendFusedMoEMethod : creates
AscendCompressedTensorsConfig ..> AscendLinearMethod : creates
AscendCompressedTensorsConfig ..> AscendFusedMoEMethod : creates
AscendLinearMethod o-- AscendLinearScheme : delegates to
AscendFusedMoEMethod o-- AscendMoEScheme : delegates to
AscendLinearScheme <|-- W8A8DynamicLinear
AscendMoEScheme <|-- W8A8DynamicMoE
```
### Scheme Registration Flow
```mermaid
sequenceDiagram
participant Module as Scheme Module
participant Registry as _SCHEME_REGISTRY
participant Config as QuantConfig
participant Wrapper as Wrapper Class
Note over Module: At import time
Module->>Registry: @register_scheme("W8A8_DYNAMIC", "linear")
Registry->>Registry: Store (quant_type, layer_type) -> Class
Note over Config: At runtime
Config->>Config: Determine quant_type from description
Config->>Registry: get_scheme_class(quant_type, layer_type)
Registry-->>Config: Return scheme class
Config->>Config: scheme = scheme_cls()
Config->>Wrapper: Create wrapper with scheme
Wrapper-->>Config: Return wrapper instance
```
### File Changes Summary
| Original Files | Refactored Files |
|----------------|------------------|
| `__init__.py` (empty) | `__init__.py` (exports public API) |
| `quant_config.py` | `modelslim_config.py` + `wrappers.py` |
| `compressed_tensors/` | `compressed_tensors_config.py` |
| `utils.py` | `methods/registry.py` |
| `w8a8_dynamic.py` | `methods/w8a8_dynamic.py` |
| `w8a8.py` | `methods/w8a8_static.py` |
| `w4a4_flatquant_dynamic.py` | `methods/w4a4_flatquant.py` |
| ... | `methods/base.py` (new) |
### Benefits
1. **Extensibility**: Adding new quantization schemes only requires
implementing the base class and adding `@register_scheme` decorator
2. **Maintainability**: Clear separation between config parsing, wrapper
logic, and scheme implementation
3. **Testability**: Abstract base classes enable easier unit testing and
mocking
4. **Discoverability**: Registry pattern makes it easy to list all
supported schemes
5. **Reduced Coupling**: Config classes no longer need to know about all
scheme implementations
___
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
440 lines
17 KiB
Python
440 lines
17 KiB
Python
#
|
|
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
|
|
# Copyright 2023 The vLLM team.
|
|
#
|
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
# you may not use this file except in compliance with the License.
|
|
# You may obtain a copy of the License at
|
|
#
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
#
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
# See the License for the specific language governing permissions and
|
|
# limitations under the License.
|
|
# This file is a part of the vllm-ascend project.
|
|
#
|
|
"""LLM-Compressor (compressed_tensors) quantization configuration for Ascend."""
|
|
|
|
from typing import Any, Optional, Union, cast
|
|
|
|
import torch
|
|
from compressed_tensors.quantization import (QuantizationArgs,
|
|
QuantizationStrategy,
|
|
QuantizationType)
|
|
from vllm.logger import init_logger
|
|
from vllm.model_executor.layers.fused_moe import FusedMoE
|
|
from vllm.model_executor.layers.linear import (LinearBase,
|
|
UnquantizedLinearMethod)
|
|
from vllm.model_executor.layers.quantization import (
|
|
QUANTIZATION_METHODS, register_quantization_config)
|
|
from vllm.model_executor.layers.quantization.base_config import (
|
|
QuantizationConfig, QuantizeMethodBase)
|
|
from vllm.model_executor.layers.quantization.compressed_tensors.utils import (
|
|
find_matched_target, is_activation_quantization_format,
|
|
should_ignore_layer)
|
|
from vllm.model_executor.models.utils import WeightsMapper
|
|
|
|
from vllm_ascend.utils import COMPRESSED_TENSORS_METHOD
|
|
|
|
from .methods import AscendLinearScheme, AscendMoEScheme
|
|
|
|
logger = init_logger(__name__)
|
|
|
|
|
|
# Remove the original compressed_tensors method to replace with our implementation
|
|
def _remove_quantization_method():
|
|
if COMPRESSED_TENSORS_METHOD in QUANTIZATION_METHODS:
|
|
QUANTIZATION_METHODS.remove(COMPRESSED_TENSORS_METHOD)
|
|
|
|
|
|
_remove_quantization_method()
|
|
|
|
QUANTIZATION_SCHEME_MAP_TYPE = dict[str, Optional[dict[str,
|
|
"QuantizationArgs"]]]
|
|
|
|
|
|
@register_quantization_config(COMPRESSED_TENSORS_METHOD)
|
|
class AscendCompressedTensorsConfig(QuantizationConfig):
|
|
"""Config class for LLM-Compressor (compressed_tensors) quantization on Ascend.
|
|
|
|
This class adapts the compressed_tensors format to work with Ascend's
|
|
quantization implementations.
|
|
"""
|
|
|
|
def __init__(
|
|
self,
|
|
target_scheme_map: dict[str, Any],
|
|
ignore: list[str],
|
|
quant_format: str,
|
|
config: Optional[dict[str, Any]] = None,
|
|
):
|
|
super().__init__()
|
|
self.ignore = ignore
|
|
self.quant_format = quant_format
|
|
# Map from [target -> scheme]
|
|
self.target_scheme_map = target_scheme_map
|
|
self.quant_description = config
|
|
|
|
def get_name(self) -> str:
|
|
return "compressed-tensors"
|
|
|
|
@classmethod
|
|
def get_supported_act_dtypes(cls) -> list[torch.dtype]:
|
|
return [torch.int8, torch.float16, torch.bfloat16]
|
|
|
|
@classmethod
|
|
def get_min_capability(cls) -> int:
|
|
raise NotImplementedError(
|
|
"Ascend hardware dose not support \"get_min_capability\" feature.")
|
|
|
|
@classmethod
|
|
def get_config_filenames(cls) -> list[str]:
|
|
return []
|
|
|
|
def _add_fused_moe_to_target_scheme_map(self):
|
|
"""
|
|
Helper function to update target_scheme_map
|
|
since linear layers get fused into FusedMoE
|
|
targeting 'Linear' needs to also match
|
|
FusedMoE modules.
|
|
"""
|
|
if ("Linear" not in self.target_scheme_map
|
|
or "FusedMoE" in self.target_scheme_map):
|
|
return
|
|
self.target_scheme_map["FusedMoE"] = self.target_scheme_map["Linear"]
|
|
|
|
@classmethod
|
|
def from_config(cls, config: dict[str,
|
|
Any]) -> "AscendCompressedTensorsConfig":
|
|
ignore: list[str] = cast(list[str], config.get("ignore", []))
|
|
quant_format = cast(str, config.get("format"))
|
|
target_scheme_map = cls._quantization_scheme_map_from_config(
|
|
config=config)
|
|
|
|
return cls(
|
|
target_scheme_map=target_scheme_map,
|
|
ignore=ignore,
|
|
quant_format=quant_format,
|
|
config=config,
|
|
)
|
|
|
|
@classmethod
|
|
def _quantization_scheme_map_from_config(
|
|
cls, config: dict[str, Any]) -> QUANTIZATION_SCHEME_MAP_TYPE:
|
|
"""Build target scheme map from config.
|
|
|
|
:param config: The `quantization_config` dictionary from config.json
|
|
:return: A dictionary mapping target layer names to their corresponding
|
|
quantization_args for weights and input activations
|
|
"""
|
|
|
|
target_scheme_map: dict[str, Any] = dict()
|
|
quant_format = cast(str, config.get("format"))
|
|
|
|
config_groups = config.get("config_groups", dict())
|
|
for _, quant_config in config_groups.items():
|
|
targets = quant_config.get("targets")
|
|
for target in targets:
|
|
target_scheme_map[target] = {}
|
|
target_scheme_map[target][
|
|
"weights"] = QuantizationArgs.model_validate(
|
|
quant_config.get("weights"))
|
|
|
|
target_scheme_map[target]["input_activations"] = None
|
|
target_scheme_map[target]["format"] = quant_config.get(
|
|
"format")
|
|
format = target_scheme_map[target].get("format")
|
|
# If no per-config format defined, use global format in config
|
|
act_quant_format = (
|
|
is_activation_quantization_format(format)
|
|
if format is not None else
|
|
is_activation_quantization_format(quant_format))
|
|
input_activations = quant_config.get("input_activations")
|
|
if act_quant_format and input_activations is not None:
|
|
target_scheme_map[target]["input_activations"] = (
|
|
QuantizationArgs.model_validate(
|
|
quant_config.get("input_activations")))
|
|
return target_scheme_map
|
|
|
|
def get_quant_method(
|
|
self,
|
|
layer: torch.nn.Module,
|
|
prefix: str,
|
|
) -> Optional["QuantizeMethodBase"]:
|
|
from .method_adapters import AscendFusedMoEMethod, AscendLinearMethod
|
|
|
|
if isinstance(layer, LinearBase):
|
|
layer.ascend_quant_method = COMPRESSED_TENSORS_METHOD
|
|
# Get the scheme for this layer
|
|
linear_scheme = self._get_linear_scheme(layer=layer,
|
|
layer_name=prefix)
|
|
|
|
# Return unquantized method if no scheme found
|
|
if linear_scheme is None:
|
|
return UnquantizedLinearMethod()
|
|
|
|
# Store scheme on layer for reference (optional, for debugging)
|
|
layer.scheme = linear_scheme
|
|
logger.info_once(
|
|
"Using the vLLM Ascend llmcompressor Quantization now!")
|
|
return AscendLinearMethod(linear_scheme)
|
|
|
|
if isinstance(layer, FusedMoE):
|
|
# Delayed import to avoid circular import
|
|
from vllm_ascend.ops.fused_moe.fused_moe import \
|
|
AscendUnquantizedFusedMoEMethod
|
|
|
|
layer.ascend_quant_method = COMPRESSED_TENSORS_METHOD
|
|
# Get the scheme for this layer
|
|
moe_scheme = self._get_moe_scheme(layer=layer, layer_name=prefix)
|
|
|
|
# Return unquantized method if no scheme found
|
|
if moe_scheme is None:
|
|
return AscendUnquantizedFusedMoEMethod(layer.moe_config)
|
|
|
|
# Store scheme on layer for reference (optional, for debugging)
|
|
layer.scheme = moe_scheme
|
|
logger.info_once(
|
|
"Using the vLLM Ascend llmcompressor Quantization now!")
|
|
return AscendFusedMoEMethod(moe_scheme, layer.moe_config)
|
|
|
|
return None
|
|
|
|
def _get_linear_scheme(
|
|
self,
|
|
layer: torch.nn.Module,
|
|
layer_name: Optional[str] = None) -> Optional[AscendLinearScheme]:
|
|
"""Get the linear quantization scheme for a layer.
|
|
|
|
Returns:
|
|
An AscendLinearScheme instance, or None if the layer
|
|
should use unquantized method.
|
|
"""
|
|
weight_quant, input_quant, format = self._get_quant_args(
|
|
layer, layer_name)
|
|
if weight_quant is None:
|
|
return None
|
|
|
|
scheme = self._create_scheme_for_layer_type(
|
|
weight_quant=weight_quant,
|
|
input_quant=input_quant,
|
|
format=format,
|
|
layer_type="linear",
|
|
)
|
|
return cast(AscendLinearScheme, scheme)
|
|
|
|
def _get_moe_scheme(
|
|
self,
|
|
layer: torch.nn.Module,
|
|
layer_name: Optional[str] = None) -> Optional[AscendMoEScheme]:
|
|
"""Get the MoE quantization scheme for a layer.
|
|
|
|
Returns:
|
|
An AscendMoEScheme instance, or None if the layer
|
|
should use unquantized method.
|
|
"""
|
|
# Add FusedMoE to target scheme map if needed
|
|
self._add_fused_moe_to_target_scheme_map()
|
|
|
|
weight_quant, input_quant, format = self._get_quant_args(
|
|
layer, layer_name)
|
|
if weight_quant is None:
|
|
return None
|
|
|
|
scheme = self._create_scheme_for_layer_type(
|
|
weight_quant=weight_quant,
|
|
input_quant=input_quant,
|
|
format=format,
|
|
layer_type="moe",
|
|
)
|
|
return cast(AscendMoEScheme, scheme)
|
|
|
|
def _get_quant_args(
|
|
self,
|
|
layer: torch.nn.Module,
|
|
layer_name: Optional[str] = None
|
|
) -> tuple[Optional["QuantizationArgs"], Optional["QuantizationArgs"],
|
|
Optional[str]]:
|
|
"""Extract quantization arguments for a layer.
|
|
|
|
compressed-tensors supports non uniform in the following way:
|
|
|
|
targets of config_groups: There can be N config_groups which each
|
|
have a quantization scheme. Each config_group has a list of targets
|
|
which can be a full layer_name, a regex for a layer_name, or
|
|
an nn.Module name.
|
|
|
|
Detect whether a layer_name is found in any target and
|
|
use the quantization scheme corresponding to the matched target.
|
|
|
|
Returns:
|
|
A tuple of (weight_quant, input_quant, format). weight_quant is
|
|
None if the layer should use unquantized method.
|
|
"""
|
|
scheme_dict = self.get_scheme_dict(layer, layer_name)
|
|
weight_quant = None
|
|
input_quant = None
|
|
format = None
|
|
if scheme_dict:
|
|
weight_quant = scheme_dict.get("weights")
|
|
input_quant = scheme_dict.get("input_activations")
|
|
format = scheme_dict.get("format")
|
|
|
|
if weight_quant is None:
|
|
logger.warning_once("Acceleration for non-quantized schemes is "
|
|
"not supported by Compressed Tensors. "
|
|
"Falling back to UnquantizedLinearMethod")
|
|
|
|
return weight_quant, input_quant, format
|
|
|
|
def get_scheme_dict(
|
|
self,
|
|
layer: torch.nn.Module,
|
|
layer_name: str | None = None
|
|
) -> dict[str, QuantizationArgs | str | None] | None:
|
|
"""
|
|
Extract the QuantizationArgs for a given layer.
|
|
|
|
Returns:
|
|
dict with {
|
|
"weights": QuantizationArgs,
|
|
"input_activations": QuantizationArgs | None,
|
|
"format": str | None
|
|
} | None
|
|
"""
|
|
if should_ignore_layer(layer_name,
|
|
ignore=self.ignore,
|
|
fused_mapping=self.packed_modules_mapping):
|
|
return None
|
|
|
|
if self.target_scheme_map:
|
|
matched_target = find_matched_target(
|
|
layer_name=layer_name,
|
|
module=layer,
|
|
targets=self.target_scheme_map.keys(),
|
|
fused_mapping=self.packed_modules_mapping,
|
|
)
|
|
scheme_dict = self.target_scheme_map[matched_target]
|
|
if scheme_dict.get("format") is None:
|
|
scheme_dict["format"] = self.quant_format
|
|
return scheme_dict
|
|
|
|
return None
|
|
|
|
def _create_scheme_for_layer_type(
|
|
self,
|
|
weight_quant: "QuantizationArgs",
|
|
input_quant: Optional["QuantizationArgs"],
|
|
format: Optional[str],
|
|
layer_type: str,
|
|
) -> Union[AscendLinearScheme, AscendMoEScheme]:
|
|
"""Create the appropriate Ascend scheme based on quantization args and layer type.
|
|
|
|
Args:
|
|
weight_quant: Weight quantization arguments.
|
|
input_quant: Input activation quantization arguments.
|
|
format: Per-layer format, if defined.
|
|
layer_type: Type of layer ("linear" or "moe").
|
|
|
|
Returns:
|
|
An instance of the appropriate Ascend quantization scheme.
|
|
"""
|
|
from .methods import get_scheme_class
|
|
|
|
# Determine the quantization type
|
|
quant_type = self._detect_quant_type(weight_quant, input_quant, format)
|
|
|
|
# Get the scheme class from registry
|
|
scheme_cls = get_scheme_class(quant_type, layer_type)
|
|
if scheme_cls is None:
|
|
raise NotImplementedError(
|
|
f"No compressed-tensors compatible scheme was found for "
|
|
f"quant_type={quant_type}, layer_type={layer_type}.")
|
|
|
|
return scheme_cls()
|
|
|
|
def _detect_quant_type(
|
|
self,
|
|
weight_quant: "QuantizationArgs",
|
|
input_quant: Optional["QuantizationArgs"],
|
|
format: Optional[str],
|
|
) -> str:
|
|
"""Detect the quantization type from quantization arguments.
|
|
|
|
Args:
|
|
weight_quant: Weight quantization arguments.
|
|
input_quant: Input activation quantization arguments.
|
|
format: Per-layer format, if defined.
|
|
|
|
Returns:
|
|
A string representing the quantization type (e.g., "W8A8", "W8A8_DYNAMIC").
|
|
"""
|
|
# use the per-layer format if defined, otherwise, use global format
|
|
format = format if format is not None else self.quant_format
|
|
act_quant_format = is_activation_quantization_format(format)
|
|
|
|
if act_quant_format and input_quant is not None:
|
|
if self._is_static_tensor_w8a8(weight_quant, input_quant):
|
|
return "W8A8"
|
|
|
|
if self._is_dynamic_token_w8a8(weight_quant, input_quant):
|
|
return "W8A8_DYNAMIC"
|
|
|
|
if self._is_w4a16(weight_quant, input_quant):
|
|
return "W4A16"
|
|
|
|
raise NotImplementedError(
|
|
"No compressed-tensors compatible quantization type was found.")
|
|
|
|
def _is_static_tensor_w8a8(self, weight_quant: "QuantizationArgs",
|
|
input_quant: "QuantizationArgs") -> bool:
|
|
is_8_bits = weight_quant.num_bits == input_quant.num_bits == 8
|
|
weight_strategy = (
|
|
weight_quant.strategy == QuantizationStrategy.CHANNEL.value)
|
|
is_tensor = (weight_strategy and input_quant.strategy
|
|
== QuantizationStrategy.TENSOR.value)
|
|
is_static = not weight_quant.dynamic and not input_quant.dynamic
|
|
is_symmetric = weight_quant.symmetric and input_quant.symmetric
|
|
|
|
# Only symmetric input quantization supported.
|
|
# Only symmetric weight quantization supported.
|
|
return is_8_bits and is_tensor and is_symmetric and is_static
|
|
|
|
def _is_dynamic_token_w8a8(self, weight_quant: "QuantizationArgs",
|
|
input_quant: "QuantizationArgs") -> bool:
|
|
is_8_bits = weight_quant.num_bits == input_quant.num_bits == 8
|
|
weight_strategy = (
|
|
weight_quant.strategy == QuantizationStrategy.CHANNEL.value)
|
|
is_token = (weight_strategy and input_quant.strategy
|
|
== QuantizationStrategy.TOKEN.value)
|
|
is_dynamic = not weight_quant.dynamic and input_quant.dynamic
|
|
is_symmetric = weight_quant.symmetric and input_quant.symmetric
|
|
|
|
# Only symmetric input quantization supported.
|
|
# Only symmetric weight quantization supported.
|
|
return is_8_bits and is_token and is_symmetric and is_dynamic
|
|
|
|
def _is_w4a16(self, weight_quant: "QuantizationArgs",
|
|
input_quant: Optional["QuantizationArgs"]) -> bool:
|
|
# Confirm weights quantized.
|
|
if weight_quant is None:
|
|
return False
|
|
|
|
# Confirm we have integer type.
|
|
if weight_quant.type != QuantizationType.INT:
|
|
return False
|
|
|
|
input_quant_none = input_quant is None
|
|
is_4_bits = weight_quant.num_bits == 4
|
|
is_group = (weight_quant.strategy == QuantizationStrategy.GROUP.value)
|
|
is_static = not weight_quant.dynamic
|
|
|
|
return input_quant_none and is_4_bits and is_group and is_static
|
|
|
|
def apply_vllm_mapper(self, hf_to_vllm_mapper: "WeightsMapper"):
|
|
self.target_scheme_map = hf_to_vllm_mapper.apply_dict(
|
|
self.target_scheme_map)
|
|
self.ignore = hf_to_vllm_mapper.apply_list(self.ignore)
|