[Refactor] Quantization Module Refactor (#5738)

### Summary This PR refactors the `vllm_ascend/quantization` module to improve code organization, maintainability, and extensibility. The refactoring introduces a clear separation of concerns with a registry-based scheme discovery pattern, abstract base classes for quantization schemes, and dedicated wrapper classes. ### Key Changes #### 1. **Modular Directory Structure** | Before | After | |--------|-------| | Flat file structure with mixed responsibilities | Organized into `methods/` subpackage for schemes | | Single `quant_config.py` (600+ lines) | Separate config files: `modelslim_config.py`, `compressed_tensors_config.py` | | `utils.py` with scheme lookup logic | `methods/registry.py` with decorator-based registration | #### 2. **Registry-Based Scheme Discovery** Replaced hardcoded `ASCEND_QUANTIZATION_METHOD_MAP` dictionary with a decorator-based registry pattern: ```python # Before: Manual dictionary mapping ASCEND_QUANTIZATION_METHOD_MAP = { "W8A8_DYNAMIC": {"linear": AscendW8A8DynamicLinearMethod, ...}, ... } # After: Decorator-based registration @register_scheme("W8A8_DYNAMIC", "linear") class AscendW8A8DynamicLinearMethod(AscendLinearScheme): ... ``` #### 3. **Abstract Base Classes** Introduced three abstract base classes in `methods/base.py`: - `AscendLinearScheme` - Base for linear layer quantization - `AscendMoEScheme` - Base for MoE layer quantization - `AscendAttentionScheme` - Base for attention layer quantization #### 4. **Separated Config and Wrapper Classes** - **Config classes** (`AscendModelSlimConfig`, `AscendCompressedTensorsConfig`): Handle config parsing and scheme selection - **Wrapper classes** (`AscendLinearMethod`, `AscendFusedMoEMethod`, etc.): Implement vLLM interfaces and delegate to schemes #### 5. **Cleaner Public API** ```python # New clean module interface from vllm_ascend.quantization import ( AscendModelSlimConfig, AscendCompressedTensorsConfig, ) from vllm_ascend.quantization.methods import get_scheme_class ``` ### Architecture Diagram ```mermaid classDiagram direction TB class QuantizationConfig { <<vLLM Interface>> +get_quant_method() } class AscendModelSlimConfig { +quant_description +get_quant_method() -create_scheme_for_layer() } class AscendCompressedTensorsConfig { +target_scheme_map +get_quant_method() -_get_scheme_from_parts() } class AscendLinearMethod { <<Wrapper>> +quant_method: AscendLinearScheme +create_weights() +apply() } class AscendFusedMoEMethod { <<Wrapper>> +quant_method: AscendMoEScheme +create_weights() +apply() } class AscendLinearScheme { <<Abstract>> +get_weight()* +apply()* +get_pertensor_param() +get_perchannel_param() } class AscendMoEScheme { <<Abstract>> +get_weight()* +get_dynamic_quant_param()* +apply()* } class W8A8DynamicLinear { +get_weight() +apply() } class W8A8DynamicMoE { +get_weight() +apply() } QuantizationConfig <|-- AscendModelSlimConfig QuantizationConfig <|-- AscendCompressedTensorsConfig AscendModelSlimConfig ..> AscendLinearMethod : creates AscendModelSlimConfig ..> AscendFusedMoEMethod : creates AscendCompressedTensorsConfig ..> AscendLinearMethod : creates AscendCompressedTensorsConfig ..> AscendFusedMoEMethod : creates AscendLinearMethod o-- AscendLinearScheme : delegates to AscendFusedMoEMethod o-- AscendMoEScheme : delegates to AscendLinearScheme <|-- W8A8DynamicLinear AscendMoEScheme <|-- W8A8DynamicMoE ``` ### Scheme Registration Flow ```mermaid sequenceDiagram participant Module as Scheme Module participant Registry as _SCHEME_REGISTRY participant Config as QuantConfig participant Wrapper as Wrapper Class Note over Module: At import time Module->>Registry: @register_scheme("W8A8_DYNAMIC", "linear") Registry->>Registry: Store (quant_type, layer_type) -> Class Note over Config: At runtime Config->>Config: Determine quant_type from description Config->>Registry: get_scheme_class(quant_type, layer_type) Registry-->>Config: Return scheme class Config->>Config: scheme = scheme_cls() Config->>Wrapper: Create wrapper with scheme Wrapper-->>Config: Return wrapper instance ``` ### File Changes Summary | Original Files | Refactored Files | |----------------|------------------| | `__init__.py` (empty) | `__init__.py` (exports public API) | | `quant_config.py` | `modelslim_config.py` + `wrappers.py` | | `compressed_tensors/` | `compressed_tensors_config.py` | | `utils.py` | `methods/registry.py` | | `w8a8_dynamic.py` | `methods/w8a8_dynamic.py` | | `w8a8.py` | `methods/w8a8_static.py` | | `w4a4_flatquant_dynamic.py` | `methods/w4a4_flatquant.py` | | ... | `methods/base.py` (new) | ### Benefits 1. **Extensibility**: Adding new quantization schemes only requires implementing the base class and adding `@register_scheme` decorator 2. **Maintainability**: Clear separation between config parsing, wrapper logic, and scheme implementation 3. **Testability**: Abstract base classes enable easier unit testing and mocking 4. **Discoverability**: Registry pattern makes it easy to list all supported schemes 5. **Reduced Coupling**: Config classes no longer need to know about all scheme implementations ___ - vLLM version: v0.13.0 - vLLM main: 2f4e6548ef --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2026-01-23 14:13:47 +08:00
parent 8378bc28b0
commit a69ef10c3a
36 changed files with 2044 additions and 1524 deletions
--- a/docs/source/developer_guide/feature_guide/quantization.md
+++ b/docs/source/developer_guide/feature_guide/quantization.md
@@ -10,7 +10,7 @@ The current process for registering and obtaining quantization methods in vLLM A
 ![get_quant_method](../../assets/quantization/get_quant_method.png)
-vLLM Ascend registers a custom ascend quantization method. By configuring the `--quantization ascend` parameter (or `quantization="ascend"` for offline), the quantization feature is enabled. When constructing the `quant_config`, the registered `AscendQuantConfig` is initialized and `get_quant_method` is called to obtain the quantization method corresponding to each weight part, stored in the `quant_method` attribute.
+vLLM Ascend registers a custom ascend quantization method. By configuring the `--quantization ascend` parameter (or `quantization="ascend"` for offline), the quantization feature is enabled. When constructing the `quant_config`, the registered `AscendModelSlimConfig` is initialized and `get_quant_method` is called to obtain the quantization method corresponding to each weight part, stored in the `quant_method` attribute.
 Currently supported quantization methods include `AscendLinearMethod`, `AscendFusedMoEMethod`, `AscendEmbeddingMethod`, and their corresponding non-quantized methods:
@@ -51,18 +51,21 @@ Based on the above content, we present a brief description of the adaptation pro
 ### Quantization Algorithm Adaptation
 - **Step 1: Algorithm Design**. Define the algorithm ID (e.g., `W4A8_DYNAMIC`), determine supported layers (linear, moe, attention), and design the quantization scheme (static/dynamic, pertensor/perchannel/pergroup).
- **Step 2: Registration**. Add the algorithm ID to `ASCEND_QUANTIZATION_METHOD_MAP` in `vllm_ascend/quantization/utils.py` and associate it with the corresponding method class.
+- **Step 2: Registration**. Use the `@register_scheme` decorator in `vllm_ascend/quantization/methods/registry.py` to register your quantization scheme class.
 ```python
-ASCEND_QUANTIZATION_METHOD_MAP: Dict[str, Dict[str, Type[Any]]] = {
+from vllm_ascend.quantization.methods import register_scheme, AscendLinearScheme
-    "W4A8_DYNAMIC": {
+
-        "linear": AscendW4A8DynamicLinearMethod,
+@register_scheme("W4A8_DYNAMIC", "linear")
-        "moe": AscendW4A8DynamicFusedMoEMethod,
+class AscendW4A8DynamicLinearMethod(AscendLinearScheme):
-    },
+    ...
-}
+
@register_scheme("W4A8_DYNAMIC", "moe")
 class AscendW4A8DynamicFusedMoEMethod(AscendMoEScheme):
    ...
 ```
- **Step 3: Implementation**. Create an algorithm implementation file, such as `vllm_ascend/quantization/w4a8_dynamic.py`, and implement the method class and logic.
+- **Step 3: Implementation**. Create an algorithm implementation file, such as `vllm_ascend/quantization/methods/w4a8.py`, and implement the method class and logic.
 - **Step 4: Testing**. Use your algorithm to generate quantization configurations and verify correctness and performance on target models and hardware.
 ### Quantized Model Adaptation
@@ -70,7 +73,7 @@ ASCEND_QUANTIZATION_METHOD_MAP: Dict[str, Dict[str, Type[Any]]] = {
 Adapting a new quantized model requires ensuring the following three points:
 - The original model has been successfully adapted in `vLLM Ascend`.
- **Fused Module Mapping**: Add the model's `model_type` to `packed_modules_model_mapping` in `vllm_ascend/quantization/quant_config.py` (e.g., `qkv_proj`, `gate_up_proj`, `experts`) to ensure sharding consistency and correct loading.
+- **Fused Module Mapping**: Add the model's `model_type` to `packed_modules_model_mapping` in `vllm_ascend/quantization/modelslim_config.py` (e.g., `qkv_proj`, `gate_up_proj`, `experts`) to ensure sharding consistency and correct loading.
 ```python
 packed_modules_model_mapping = {
--- a/tests/ut/ops/test_fused_moe.py
+++ b/tests/ut/ops/test_fused_moe.py
@@ -20,8 +20,6 @@ import torch
 import torch.nn as nn
 import torch_npu
 from pytest_mock import MockerFixture
 from vllm.model_executor.layers.fused_moe import FusedMoEMethodBase
 from tests.ut.base import TestBase
 from vllm_ascend.ascend_forward_context import MoECommType
 from vllm_ascend.ops.fused_moe.experts_selector import select_experts
@@ -233,25 +231,6 @@ class MockQuantMethod(nn.Module):
            self.apply = MagicMock(return_value=(torch.randn(num_tokens, 32)))
 class MockFusedMoEMethod(FusedMoEMethodBase):
    moe = MagicMock()
    def __init__(self):
        super().__init__(self.moe)
    def create_weights(self, layer: torch.nn.Module, num_experts: int,
                       hidden_size: int, intermediate_size_per_partition: int,
                       params_dtype: torch.dtype, **extra_weight_attrs):
        pass
    def apply(self, hidden_states: torch.Tensor,
              expert_weights: torch.Tensor) -> torch.Tensor:
        pass
    def get_fused_moe_quant_config(self, layer: torch.nn.Module):
        pass
 class TestExpertsSelector:
    @pytest.mark.parametrize("global_num_experts", [256, 128])
--- a/tests/ut/quantization/test_modelslim_config.py
+++ b/tests/ut/quantization/test_modelslim_config.py
@@ -7,11 +7,11 @@ from vllm.model_executor.layers.linear import LinearBase
 from tests.ut.base import TestBase
 from vllm_ascend.ops.linear import AscendUnquantizedLinearMethod
-from vllm_ascend.quantization.quant_config import AscendQuantConfig
+from vllm_ascend.quantization.modelslim_config import AscendModelSlimConfig
 from vllm_ascend.utils import ASCEND_QUANTIZATION_METHOD
-class TestAscendQuantConfig(TestBase):
+class TestAscendModelSlimConfig(TestBase):
    def setUp(self):
        self.sample_config = {
@@ -25,7 +25,7 @@ class TestAscendQuantConfig(TestBase):
            "shard1.weight": "FLOAT",
            "shard2.weight": "FLOAT",
        }
-        self.ascend_config = AscendQuantConfig(self.sample_config)
+        self.ascend_config = AscendModelSlimConfig(self.sample_config)
        self.ascend_config.packed_modules_mapping = None
    def test_init(self):
@@ -34,55 +34,55 @@ class TestAscendQuantConfig(TestBase):
    def test_repr(self):
        repr_str = repr(self.ascend_config)
-        self.assertTrue(repr_str.startswith("AscendQuantConfig:\n"))
+        self.assertTrue(repr_str.startswith("AscendModelSlimConfig:\n"))
    def test_get_name(self):
-        self.assertEqual(AscendQuantConfig.get_name(),
+        self.assertEqual(AscendModelSlimConfig.get_name(),
                         ASCEND_QUANTIZATION_METHOD)
    def test_get_supported_act_dtypes(self):
-        supported_dtypes = AscendQuantConfig.get_supported_act_dtypes()
+        supported_dtypes = AscendModelSlimConfig.get_supported_act_dtypes()
        self.assertEqual(len(supported_dtypes), 3)
    def test_get_min_capability(self):
        with self.assertRaises(NotImplementedError):
-            AscendQuantConfig.get_min_capability()
+            AscendModelSlimConfig.get_min_capability()
    def test_get_config_filenames(self):
-        filenames = AscendQuantConfig.get_config_filenames()
+        filenames = AscendModelSlimConfig.get_config_filenames()
        self.assertEqual(filenames, ["quant_model_description.json"])
    def test_from_config(self):
-        config = AscendQuantConfig.from_config(self.sample_config)
+        config = AscendModelSlimConfig.from_config(self.sample_config)
-        self.assertIsInstance(config, AscendQuantConfig)
+        self.assertIsInstance(config, AscendModelSlimConfig)
        self.assertEqual(config.quant_description, self.sample_config)
    @patch('torch.npu.is_available')
    def test_override_quantization_method(self, mock_is_available):
        # Test when NPU is available
        mock_is_available.return_value = True
-        result = AscendQuantConfig.override_quantization_method(None, None)
+        result = AscendModelSlimConfig.override_quantization_method(None, None)
        self.assertIsNone(result)
        hf_quant_cfg = {"quant_method": ""}
-        result = AscendQuantConfig.override_quantization_method(
+        result = AscendModelSlimConfig.override_quantization_method(
            hf_quant_cfg, None)
        self.assertEqual(result, "ascend")
        # Test when NPU is not available
        mock_is_available.return_value = False
-        result = AscendQuantConfig.override_quantization_method(None, None)
+        result = AscendModelSlimConfig.override_quantization_method(None, None)
        self.assertIsNone(result)
        hf_quant_cfg = {"quant_method": ""}
-        result = AscendQuantConfig.override_quantization_method(
+        result = AscendModelSlimConfig.override_quantization_method(
            hf_quant_cfg, None)
        self.assertIsNone(result)
    def test_get_quant_method_for_linear(self):
        mock_config = MagicMock()
-        mock_config.model_config.hf_text_config.model_type = None
+        mock_config.model_config.hf_config.model_type = None
        linear_layer = MagicMock(spec=LinearBase)
        # Test skipped layer
-        with patch("vllm_ascend.quantization.quant_config.get_current_vllm_config", return_value=mock_config), \
+        with patch("vllm_ascend.quantization.modelslim_config.get_current_vllm_config", return_value=mock_config), \
            patch.object(self.ascend_config, \
                          'is_layer_skipped_ascend',
                          return_value=True):
@@ -90,22 +90,24 @@ class TestAscendQuantConfig(TestBase):
            self.assertIsInstance(method, AscendUnquantizedLinearMethod)
        # Test quantized layer
        mock_scheme = MagicMock()
        with patch.object(self.ascend_config, 'is_layer_skipped_ascend', return_value=False), \
-            patch("vllm_ascend.quantization.quant_config.get_current_vllm_config", return_value=mock_config), \
+            patch("vllm_ascend.quantization.modelslim_config.get_current_vllm_config", return_value=mock_config), \
-            patch('vllm_ascend.quantization.quant_config.AscendLinearMethod', return_value=MagicMock()) as mock_ascend_linear:
+            patch("vllm_ascend.quantization.modelslim_config.create_scheme_for_layer", return_value=mock_scheme), \
            patch('vllm_ascend.quantization.method_adapters.AscendLinearMethod', return_value=MagicMock()) as mock_ascend_linear:
            method = self.ascend_config.get_quant_method(linear_layer, ".attn")
            self.assertIs(method, mock_ascend_linear.return_value)
-            mock_ascend_linear.assert_called_once_with(
+            mock_ascend_linear.assert_called_once_with(mock_scheme)
                self.ascend_config, ".attn",
                self.ascend_config.packed_modules_mapping, linear_layer)
    def test_get_quant_method_for_attention(self):
        attention_layer = MagicMock(spec=Attention)
        mock_config = MagicMock()
-        mock_config.model_config.hf_text_config.model_type = None
+        mock_config.model_config.hf_config.model_type = None
-        with patch("vllm_ascend.quantization.quant_config.get_current_vllm_config", return_value=mock_config), \
+        mock_scheme = MagicMock()
-            patch('vllm_ascend.quantization.quant_config.AscendKVCacheMethod', \
+        with patch("vllm_ascend.quantization.modelslim_config.get_current_vllm_config", return_value=mock_config), \
            patch("vllm_ascend.quantization.modelslim_config.create_scheme_for_layer", return_value=mock_scheme), \
            patch('vllm_ascend.quantization.method_adapters.AscendKVCacheMethod', \
                   return_value=MagicMock()) as mock_ascend_kvcache:
            # Test with fa_quant_type
            method = self.ascend_config.get_quant_method(
@@ -117,20 +119,22 @@ class TestAscendQuantConfig(TestBase):
        fused_moe_layer.moe = MagicMock(spec=FusedMoEConfig)
        fused_moe_layer.moe_config = MagicMock(spec=FusedMoEConfig)
        mock_config = MagicMock()
-        mock_config.model_config.hf_text_config.model_type = None
+        mock_config.model_config.hf_config.model_type = None
        # Test skipped layer
        with patch.object(self.ascend_config, 'is_layer_skipped_ascend', return_value=True), \
-            patch("vllm_ascend.quantization.quant_config.get_current_vllm_config", return_value=mock_config), \
+            patch("vllm_ascend.quantization.modelslim_config.get_current_vllm_config", return_value=mock_config), \
-            patch('vllm_ascend.quantization.quant_config.AscendUnquantizedFusedMoEMethod', return_value=MagicMock()) as mock_ascend_moe:
+            patch('vllm_ascend.ops.fused_moe.fused_moe.AscendUnquantizedFusedMoEMethod', return_value=MagicMock()) as mock_ascend_moe:
            method = self.ascend_config.get_quant_method(
                fused_moe_layer, "moe_layer")
            self.assertIs(method, mock_ascend_moe.return_value)
        # Test quantized layer
        mock_scheme = MagicMock()
        with patch.object(self.ascend_config, 'is_layer_skipped_ascend', return_value=False), \
-            patch("vllm_ascend.quantization.quant_config.get_current_vllm_config", return_value=mock_config), \
+            patch("vllm_ascend.quantization.modelslim_config.get_current_vllm_config", return_value=mock_config), \
-            patch('vllm_ascend.quantization.quant_config.AscendFusedMoEMethod', return_value=MagicMock()) as mock_ascend_moe:
+            patch("vllm_ascend.quantization.modelslim_config.create_scheme_for_layer", return_value=mock_scheme), \
            patch('vllm_ascend.quantization.method_adapters.AscendFusedMoEMethod', return_value=MagicMock()) as mock_ascend_moe:
            method = self.ascend_config.get_quant_method(
                fused_moe_layer, "moe_layer")
            self.assertIs(method, mock_ascend_moe.return_value)
@@ -150,7 +154,7 @@ class TestAscendQuantConfig(TestBase):
        # Test inconsistent fused layer shards
        bad_config = {"shard1.weight": "FLOAT", "shard2.weight": "INT8"}
-        config = AscendQuantConfig(bad_config)
+        config = AscendModelSlimConfig(bad_config)
        with self.assertRaises(ValueError):
            config.is_layer_skipped_ascend("fused_layer", fused_mapping)
--- a/tests/ut/quantization/test_utils.py
+++ b/tests/ut/quantization/test_utils.py
@@ -1,50 +0,0 @@
 import types
 from tests.ut.base import TestBase
 from vllm_ascend.quantization.utils import (ASCEND_QUANTIZATION_METHOD_MAP,
                                            get_quant_method)
 class TestGetQuantMethod(TestBase):
    def setUp(self):
        self.original_quantization_method_map = ASCEND_QUANTIZATION_METHOD_MAP.copy(
        )
        for quant_type, layer_map in ASCEND_QUANTIZATION_METHOD_MAP.items():
            for layer_type in layer_map.keys():
                ASCEND_QUANTIZATION_METHOD_MAP[quant_type][
                    layer_type] = types.new_class(f"{quant_type}_{layer_type}")
    def tearDown(self):
        # Restore original map
        ASCEND_QUANTIZATION_METHOD_MAP.clear()
        ASCEND_QUANTIZATION_METHOD_MAP.update(
            self.original_quantization_method_map)
    def test_linear_quant_methods(self):
        for quant_type, layer_map in ASCEND_QUANTIZATION_METHOD_MAP.items():
            if "linear" in layer_map.keys():
                prefix = "linear_layer"
                cls = layer_map["linear"]
                method = get_quant_method({"linear_layer.weight": quant_type},
                                          prefix, "linear")
                self.assertIsInstance(method, cls)
    def test_moe_quant_methods(self):
        for quant_type, layer_map in ASCEND_QUANTIZATION_METHOD_MAP.items():
            if "moe" in layer_map.keys():
                prefix = "layer"
                cls = layer_map["moe"]
                method = get_quant_method({"layer.weight": quant_type}, prefix,
                                          "moe")
                self.assertIsInstance(method, cls)
    def test_invalid_layer_type(self):
        quant_description = {"linear_layer.weight": "W8A8"}
        with self.assertRaises(NotImplementedError):
            get_quant_method(quant_description, "linear_layer", "unsupported")
    def test_invalid_quant_type(self):
        quant_description = {"linear_layer.weight": "UNKNOWN"}
        with self.assertRaises(NotImplementedError):
            get_quant_method(quant_description, "linear_layer", "linear")
--- a/tests/ut/quantization/test_w4a16.py
+++ b/tests/ut/quantization/test_w4a16.py
@@ -3,8 +3,9 @@ from unittest.mock import Mock, patch
 import torch
 from tests.ut.base import TestBase
-from vllm_ascend.quantization.w4a16 import (AscendW4A16FusedMoEMethod,
+from vllm_ascend.quantization.methods.w4a16 import (AscendW4A16FusedMoEMethod,
-                                            pack_to_int32, unpack_from_int32)
+                                                    pack_to_int32,
                                                    unpack_from_int32)
 class TestUnpackFromInt32(TestBase):
@@ -42,7 +43,7 @@ class TestUnpackFromInt32(TestBase):
 class TestPackToInt32(TestBase):
    @patch(
-        "vllm_ascend.quantization.w4a16.torch_npu.npu_convert_weight_to_int4pack"
+        "vllm_ascend.quantization.methods.w4a16.torch_npu.npu_convert_weight_to_int4pack"
    )
    def test_pack_to_int32_int8(self, mock_npu_convert_weight_to_int4pack):
        mock_npu_convert_weight_to_int4pack.return_value = torch.zeros(
@@ -57,7 +58,7 @@ class TestPackToInt32(TestBase):
        self.assertEqual(result.shape, torch.Size([2, 8, 4]))
    @patch(
-        "vllm_ascend.quantization.w4a16.torch_npu.npu_convert_weight_to_int4pack"
+        "vllm_ascend.quantization.methods.w4a16.torch_npu.npu_convert_weight_to_int4pack"
    )
    def test_pack_to_int32_int32(self, mock_npu_convert_weight_to_int4pack):
@@ -97,8 +98,8 @@ class TestAscendW4A16FusedMoEMethod(TestBase):
    output_size = 128
    group_size = 32
-    @patch("vllm_ascend.quantization.w4a16.get_ascend_config")
+    @patch("vllm_ascend.quantization.methods.w4a16.get_ascend_config")
-    @patch("vllm_ascend.quantization.w4a16.get_current_vllm_config")
+    @patch("vllm_ascend.quantization.methods.w4a16.get_current_vllm_config")
    def setUp(self, mock_get_current_vllm_config, mock_get_ascend_config):
        mock_ascend_config = Mock()
        mock_ascend_config.eplb_config.dynamic_eplb = False
@@ -218,7 +219,7 @@ class TestAscendW4A16FusedMoEMethod(TestBase):
        return layer
    @patch(
-        "vllm_ascend.quantization.w4a16.torch_npu.npu_convert_weight_to_int4pack"
+        "vllm_ascend.quantization.methods.w4a16.torch_npu.npu_convert_weight_to_int4pack"
    )
    def test_process_weights_after_loading_with_transpose(
            self, mock_npu_convert_weight_to_int4pack):
--- a/tests/ut/quantization/test_w4a4_flatquant_dynamic.py
+++ b/tests/ut/quantization/test_w4a4_flatquant_dynamic.py
@@ -4,7 +4,7 @@ from unittest.mock import MagicMock, patch
 import torch
 import torch.nn as nn
-from vllm_ascend.quantization.w4a4_flatquant_dynamic import (
+from vllm_ascend.quantization.methods.w4a4_flatquant import (
    AscendW4A4FlatQuantDynamicLinearMethod, get_decompose_dim,
    pack_int4_weights)
@@ -33,7 +33,7 @@ class TestW4A4FlatQuantDynamic(unittest.TestCase):
        self.assertEqual(get_decompose_dim(100), (10, 10))
        self.assertEqual(get_decompose_dim(99), (9, 11))
-    @patch('vllm_ascend.quantization.w4a4_flatquant_dynamic.torch_npu')
+    @patch('vllm_ascend.quantization.methods.w4a4_flatquant.torch_npu')
    def test_pack_int4_weights_npu_success(self, mock_torch_npu):
        """
        Tests weight packing using the mocked NPU kernel.
@@ -119,7 +119,7 @@ class TestW4A4FlatQuantDynamic(unittest.TestCase):
        x = torch.randn(batch_size, self.input_size, dtype=self.params_dtype)
        return layer, x, m, n
-    @patch('vllm_ascend.quantization.w4a4_flatquant_dynamic.torch_npu')
+    @patch('vllm_ascend.quantization.methods.w4a4_flatquant.torch_npu')
    def test_apply_small_batch(self, mock_torch_npu):
        """Tests the apply method with a batch size smaller than MAX_BATCH_SIZE."""
        batch_size = 128
@@ -143,9 +143,9 @@ class TestW4A4FlatQuantDynamic(unittest.TestCase):
        self.assertEqual(output.shape, (batch_size, self.output_size))
    @patch(
-        'vllm_ascend.quantization.w4a4_flatquant_dynamic.KRONECKER_QUANT_MAX_BATCH_SIZE',
+        'vllm_ascend.quantization.methods.w4a4_flatquant.KRONECKER_QUANT_MAX_BATCH_SIZE',
        10)
-    @patch('vllm_ascend.quantization.w4a4_flatquant_dynamic.torch_npu')
+    @patch('vllm_ascend.quantization.methods.w4a4_flatquant.torch_npu')
    def test_apply_large_batch(self, mock_torch_npu):
        """Tests the apply method with a batch size larger than MAX_BATCH_SIZE."""
        batch_size = 25
@@ -178,7 +178,7 @@ class TestW4A4FlatQuantDynamic(unittest.TestCase):
                ValueError, "FlatQuant transform matrices dimension mismatch"):
            self.method.apply(layer, x)
-    @patch('vllm_ascend.quantization.w4a4_flatquant_dynamic.pack_int4_weights')
+    @patch('vllm_ascend.quantization.methods.w4a4_flatquant.pack_int4_weights')
    def test_process_weights_after_loading(self, mock_pack_weights):
        """Tests weight processing after loading, without transpose."""
        layer = nn.Module()
--- a/tests/ut/quantization/test_w4a8_dynamic.py
+++ b/tests/ut/quantization/test_w4a8_dynamic.py
@@ -3,14 +3,14 @@ from unittest.mock import Mock, patch
 import torch
 from tests.ut.base import TestBase
-from vllm_ascend.quantization.w4a8_dynamic import (
+from vllm_ascend.quantization.methods.w4a8 import (
    AscendW4A8DynamicFusedMoEMethod, AscendW4A8DynamicLinearMethod)
 class TestAscendW4A8DynamicLinearMethod(TestBase):
    @patch('vllm.distributed.get_tensor_model_parallel_world_size')
-    @patch('vllm_ascend.quantization.w4a8_dynamic.get_current_vllm_config')
+    @patch('vllm_ascend.quantization.methods.w4a8.get_current_vllm_config')
    def setUp(self, mock_get_current_vllm_config, mock_get_tp_world_size):
        mock_get_tp_world_size.return_value = 1
        mock_vllm_config = Mock()
@@ -127,10 +127,10 @@ class TestAscendW4A8DynamicFusedMoEMethod(TestBase):
    output_size = 56
    group_size = 2
-    @patch('vllm_ascend.quantization.w4a8_dynamic.get_ascend_config')
+    @patch('vllm_ascend.quantization.methods.w4a8.get_ascend_config')
-    @patch('vllm_ascend.quantization.w4a8_dynamic.get_current_vllm_config')
+    @patch('vllm_ascend.quantization.methods.w4a8.get_current_vllm_config')
-    @patch('vllm_ascend.quantization.w4a8_dynamic.get_ep_group')
+    @patch('vllm_ascend.quantization.methods.w4a8.get_ep_group')
-    @patch('vllm_ascend.quantization.w4a8_dynamic.get_mc2_group')
+    @patch('vllm_ascend.quantization.methods.w4a8.get_mc2_group')
    @patch('torch.distributed.get_rank', return_value=0)
    def setUp(self, mock_get_rank, mock_get_mc2_group, mock_get_ep_group,
              get_current_vllm_config, mock_get_ascend_config):
--- a/tests/ut/quantization/test_w8a16.py
+++ b/tests/ut/quantization/test_w8a16.py
@@ -4,7 +4,7 @@ from unittest.mock import MagicMock, patch
 import torch
 from tests.ut.base import TestBase
-from vllm_ascend.quantization.w8a16 import AscendW8A16LinearMethod
+from vllm_ascend.quantization.methods.w8a16 import AscendW8A16LinearMethod
 class TestAscendW8A16LinearMethod(TestBase):
--- a/tests/ut/quantization/test_w8a8.py
+++ b/tests/ut/quantization/test_w8a8.py
@@ -4,36 +4,10 @@ from unittest.mock import MagicMock, patch
 import torch
 from tests.ut.base import TestBase
-from vllm_ascend.quantization.w8a8 import (AscendW8A8LinearMethod,
+from vllm_ascend.quantization.methods.w8a8_static import AscendW8A8LinearMethod
                                           quant_per_tensor)
 from vllm_ascend.utils import AscendDeviceType
 class TestQuantPerTensor(TestBase):
    @patch("torch_npu.npu_quantize")
    def test_quant_per_tensor(self, mock_npu_quantize):
        in_tensor = torch.randn(32, 128)
        input_scale = torch.tensor(0.1)
        input_offset = torch.tensor(0)
        expected_output = torch.randint(-128, 127, (32, 128), dtype=torch.int8)
        mock_npu_quantize.return_value = expected_output
        output = quant_per_tensor(in_tensor, input_scale, input_offset)
        mock_npu_quantize.assert_called_once_with(
            in_tensor,
            input_scale,
            input_offset,
            torch.qint8,
            -1,
            False,
        )
        self.assertTrue(torch.equal(output, expected_output))
 class TestAscendW8A8LinearMethod(TestBase):
    def setUp(self):
@@ -63,7 +37,9 @@ class TestAscendW8A8LinearMethod(TestBase):
        self.assertEqual(params['weight_scale'].shape, (10, 1))
        self.assertEqual(params['weight_offset'].shape, (10, 1))
-    @patch("vllm_ascend.quantization.w8a8.get_weight_prefetch_method")
+    @patch(
        "vllm_ascend.quantization.methods.w8a8_static.get_weight_prefetch_method"
    )
    @patch("torch.ops.vllm.quantize")
    @patch("torch_npu.npu_quant_matmul")
    def test_apply_with_x_not_int8(self, mock_npu_quant_matmul, mock_quantize,
--- a/tests/ut/quantization/test_w8a8_dynamic.py
+++ b/tests/ut/quantization/test_w8a8_dynamic.py
@@ -3,7 +3,7 @@ from unittest.mock import Mock, patch
 import torch
 from tests.ut.base import TestBase
-from vllm_ascend.quantization.w8a8_dynamic import \
+from vllm_ascend.quantization.methods.w8a8_dynamic import \
    AscendW8A8DynamicFusedMoEMethod
@@ -13,13 +13,13 @@ class TestAscendW8A8FusedMoEMethod(TestBase):
    intermediate_size = 128
    @patch("torch.distributed.get_rank")
-    @patch("vllm_ascend.quantization.w8a8_dynamic.get_mc2_group")
+    @patch("vllm_ascend.quantization.methods.w8a8_dynamic.get_mc2_group")
-    @patch("vllm_ascend.quantization.w8a8_dynamic.get_ascend_config")
+    @patch("vllm_ascend.quantization.methods.w8a8_dynamic.get_ascend_config")
-    @patch("vllm_ascend.quantization.w8a8_dynamic.get_ep_group")
+    @patch("vllm_ascend.quantization.methods.w8a8_dynamic.get_ep_group")
    def setUp(self, mock_get_ep_group, mock_get_ascend_config,
              mock_get_mc2_group, mock_get_rank):
        with patch(
-                'vllm_ascend.quantization.w8a8_dynamic.get_current_vllm_config'
+                'vllm_ascend.quantization.methods.w8a8_dynamic.get_current_vllm_config'
        ) as mock_get_current_vllm_config:
            mock_vllm_config = Mock()
            mock_vllm_config.quant_config = Mock(
--- a/tests/ut/test_platform.py
+++ b/tests/ut/test_platform.py
@@ -51,8 +51,9 @@ class TestNPUPlatform(TestBase):
        self.assertTrue(self.platform.is_sleep_mode_available())
    @patch("vllm_ascend.utils.adapt_patch")
-    @patch("vllm_ascend.quantization.quant_config.AscendQuantConfig")
+    @patch("vllm_ascend.quantization.modelslim_config.AscendModelSlimConfig")
-    def test_pre_register_and_update_with_parser(self, mock_quant_config, mock_adapt_patch):
+    def test_pre_register_and_update_with_parser(self, mock_quant_config,
                                                 mock_adapt_patch):
        mock_parser = MagicMock()
        mock_action = MagicMock()
        mock_action.choices = ["awq", "gptq"]
@@ -66,15 +67,17 @@ class TestNPUPlatform(TestBase):
        self.assertEqual(len(mock_action.choices), 3)  # original 2 + ascend
    @patch("vllm_ascend.utils.adapt_patch")
-    @patch("vllm_ascend.quantization.quant_config.AscendQuantConfig")
+    @patch("vllm_ascend.quantization.modelslim_config.AscendModelSlimConfig")
-    def test_pre_register_and_update_without_parser(self, mock_quant_config, mock_adapt_patch):
+    def test_pre_register_and_update_without_parser(self, mock_quant_config,
                                                    mock_adapt_patch):
        self.platform.pre_register_and_update(None)
        mock_adapt_patch.assert_called_once_with(is_global_patch=True)
    @patch("vllm_ascend.utils.adapt_patch")
-    @patch("vllm_ascend.quantization.quant_config.AscendQuantConfig")
+    @patch("vllm_ascend.quantization.modelslim_config.AscendModelSlimConfig")
-    def test_pre_register_and_update_with_parser_no_quant_action(self, mock_quant_config, mock_adapt_patch):
+    def test_pre_register_and_update_with_parser_no_quant_action(
            self, mock_quant_config, mock_adapt_patch):
        mock_parser = MagicMock()
        mock_parser._option_string_actions = {}
@@ -83,8 +86,9 @@ class TestNPUPlatform(TestBase):
        mock_adapt_patch.assert_called_once_with(is_global_patch=True)
    @patch("vllm_ascend.utils.adapt_patch")
-    @patch("vllm_ascend.quantization.quant_config.AscendQuantConfig")
+    @patch("vllm_ascend.quantization.modelslim_config.AscendModelSlimConfig")
-    def test_pre_register_and_update_with_existing_ascend_quant(self, mock_quant_config, mock_adapt_patch):
+    def test_pre_register_and_update_with_existing_ascend_quant(
            self, mock_quant_config, mock_adapt_patch):
        mock_parser = MagicMock()
        mock_action = MagicMock()
        mock_action.choices = ["awq", ASCEND_QUANTIZATION_METHOD]
--- a/vllm_ascend/attention/mla_v1.py
+++ b/vllm_ascend/attention/mla_v1.py
@@ -36,7 +36,7 @@ from vllm_ascend.ops.layer_shard_linear import (
    register_all_layers_to_shard_weight_series)
 from vllm_ascend.ops.rotary_embedding import get_cos_and_sin_mla
 from vllm_ascend.ops.weight_prefetch import maybe_npu_prefetch
-from vllm_ascend.quantization.w8a8 import AscendW8A8LinearMethod
+from vllm_ascend.quantization.methods import AscendW8A8LinearMethod
 from vllm_ascend.utils import (ACL_FORMAT_FRACTAL_ND, maybe_trans_nz,
                               weak_ref_tensors)
 from vllm_ascend.worker.npu_input_batch import NPUInputBatch
--- a/vllm_ascend/attention/sfa_v1.py
+++ b/vllm_ascend/attention/sfa_v1.py
@@ -35,7 +35,7 @@ from vllm_ascend.ops.layer_shard_linear import (
 from vllm_ascend.ops.rotary_embedding import get_cos_and_sin_mla
 from vllm_ascend.ops.triton.rope import rope_forward_triton
 from vllm_ascend.ops.weight_prefetch import maybe_npu_prefetch
-from vllm_ascend.quantization.w8a8 import AscendW8A8LinearMethod
+from vllm_ascend.quantization.methods import AscendW8A8LinearMethod
 from vllm_ascend.utils import (ACL_FORMAT_FRACTAL_ND, _round_up, dispose_layer,
                               enable_dsa_cp, enable_dsa_cp_with_layer_shard, maybe_trans_nz)
 from vllm_ascend.worker.npu_input_batch import NPUInputBatch
--- a/vllm_ascend/ops/fused_moe/fused_moe.py
+++ b/vllm_ascend/ops/fused_moe/fused_moe.py
@@ -43,10 +43,6 @@ from vllm_ascend.ops.fused_moe.moe_comm_method import (AllGatherCommImpl,
                                                       FusedExpertsResult,
                                                       setup_moe_comm_method)
 from vllm_ascend.ops.fused_moe.prepare_finalize import QuantType
 from vllm_ascend.quantization.w4a8_dynamic import \
    AscendW4A8DynamicFusedMoEMethod
 from vllm_ascend.quantization.w8a8_dynamic import \
    AscendW8A8DynamicFusedMoEMethod
 from vllm_ascend.utils import (AscendDeviceType, enable_sp,
                               get_ascend_device_type, maybe_trans_nz,
                               npu_stream_switch, shared_expert_dp_enabled,
@@ -251,12 +247,16 @@ class AscendFusedMoE(FusedMoE):
        method = quant_method.quant_method
-        if isinstance(method, AscendW8A8DynamicFusedMoEMethod):
+        if hasattr(method, "quant_type"):
-            return QuantType.W8A8
+            from vllm_ascend.quantization.methods.base import \
-        elif isinstance(method, AscendW4A8DynamicFusedMoEMethod):
+                QuantType as SchemeQuantType
-            return QuantType.W4A8
+            scheme_quant_type = method.quant_type
-        else:
+            if scheme_quant_type == SchemeQuantType.W8A8:
-            return QuantType.NONE
+                return QuantType.W8A8
            elif scheme_quant_type == SchemeQuantType.W4A8:
                return QuantType.W4A8
        return QuantType.NONE
    def update_expert_map(self, new_expert_map):
        self._expert_map = new_expert_map
--- a/vllm_ascend/ops/linear_op.py
+++ b/vllm_ascend/ops/linear_op.py
@@ -368,7 +368,8 @@ class Flashcomm2OProjRowParallelOp(CustomRowParallelOp):
            "communication_fn"] = otp_maybe_quant_comm
        actual_quant_method = getattr(self.quant_method, 'quant_method',
                                      self.quant_method)
-        from vllm_ascend.quantization.w8a8 import AscendW8A8LinearMethod
+        from vllm_ascend.quantization.methods.w8a8_static import \
            AscendW8A8LinearMethod
        if not isinstance(actual_quant_method, AscendW8A8LinearMethod):
            # Check if w8a8 quantization is enabled. If not, communicate immediately.
            input_parallel = otp_maybe_quant_comm(input_parallel)
@@ -586,8 +587,8 @@ class SequenceRowParallelOp(CustomRowParallelOp):
        from vllm.model_executor.layers.linear import UnquantizedLinearMethod
-        from vllm_ascend.quantization.quant_config import AscendLinearMethod
+        from vllm_ascend.quantization.methods import AscendW8A8LinearMethod
-        from vllm_ascend.quantization.w8a8 import AscendW8A8LinearMethod
+        from vllm_ascend.quantization.method_adapters import AscendLinearMethod
        # For unquant
        if mmrs_fusion and isinstance(self.layer.quant_method,
--- a/vllm_ascend/platform.py
+++ b/vllm_ascend/platform.py
@@ -151,8 +151,7 @@ class NPUPlatform(Platform):
                if ASCEND_QUANTIZATION_METHOD not in quant_action.choices:
                    quant_action.choices.append(ASCEND_QUANTIZATION_METHOD)
-        from vllm_ascend.quantization.compressed_tensors.compressed_tensors import AscendCompressedTensorsConfig  # noqa: F401
+        from vllm_ascend.quantization import AscendCompressedTensorsConfig, AscendModelSlimConfig  # noqa: F401
        from vllm_ascend.quantization.quant_config import AscendQuantConfig  # noqa: F401
        config_deprecated_logging()
--- a/vllm_ascend/quantization/init.py
+++ b/vllm_ascend/quantization/init.py
@@ -0,0 +1,38 @@
 #
 # Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
 # This file is a part of the vllm-ascend project.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 """Ascend quantization module.
 This module provides quantization support for Ascend NPU.
 Supported quantization tools:
 - ModelSlim: Use AscendModelSlimConfig
 - LLM-Compressor (compressed_tensors): Use AscendCompressedTensorsConfig
 Public API:
 - Config classes: AscendModelSlimConfig, AscendCompressedTensorsConfig
 - For scheme implementations, import from vllm_ascend.quantization.methods
 """
 # LLM-Compressor (compressed_tensors) quantization config
 from .compressed_tensors_config import AscendCompressedTensorsConfig
 # ModelSlim quantization config
 from .modelslim_config import AscendModelSlimConfig
 __all__ = [
    "AscendModelSlimConfig",
    "AscendCompressedTensorsConfig",
 ]
--- a/vllm_ascend/quantization/compressed_tensors/init.py
+++ b/vllm_ascend/quantization/compressed_tensors/init.py
--- a/vllm_ascend/quantization/compressed_tensors/compressed_tensors.py
+++ b/vllm_ascend/quantization/compressed_tensors/compressed_tensors.py
@@ -1,4 +1,23 @@
-from typing import TYPE_CHECKING, Any, Optional, cast
+#
 # Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
 # Copyright 2023 The vLLM team.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # This file is a part of the vllm-ascend project.
 #
 """LLM-Compressor (compressed_tensors) quantization configuration for Ascend."""
 from typing import Any, Optional, Union, cast
 import torch
 from compressed_tensors.quantization import (QuantizationArgs,
@@ -12,40 +31,37 @@ from vllm.model_executor.layers.quantization import (
    QUANTIZATION_METHODS, register_quantization_config)
 from vllm.model_executor.layers.quantization.base_config import (
    QuantizationConfig, QuantizeMethodBase)
 from vllm.model_executor.layers.quantization.compressed_tensors.schemes import \
    CompressedTensorsScheme
 from vllm.model_executor.layers.quantization.compressed_tensors.utils import (
    find_matched_target, is_activation_quantization_format,
    should_ignore_layer)
 from vllm.model_executor.models.utils import WeightsMapper
 from vllm_ascend.ops.fused_moe.fused_moe import AscendUnquantizedFusedMoEMethod
 from vllm_ascend.quantization.quant_config import (AscendFusedMoEMethod,
                                                   AscendLinearMethod,
                                                   AscendQuantConfig)
 from vllm_ascend.quantization.w4a16 import AscendW4A16FusedMoEMethod
 from vllm_ascend.quantization.w8a8 import AscendW8A8LinearMethod
 from vllm_ascend.quantization.w8a8_dynamic import (
    AscendW8A8DynamicFusedMoEMethod, AscendW8A8DynamicLinearMethod)
 from vllm_ascend.utils import COMPRESSED_TENSORS_METHOD
-if TYPE_CHECKING:
+from .methods import AscendLinearScheme, AscendMoEScheme
    from vllm.model_executor.models.utils import WeightsMapper
 logger = init_logger(__name__)
 QUANTIZATION_SCHEME_MAP_TYPE = dict[str, Optional[dict[str, QuantizationArgs]]]
-
+# Remove the original compressed_tensors method to replace with our implementation
-def remove_quantization_method():
+def _remove_quantization_method():
    if COMPRESSED_TENSORS_METHOD in QUANTIZATION_METHODS:
        QUANTIZATION_METHODS.remove(COMPRESSED_TENSORS_METHOD)
-remove_quantization_method()
+_remove_quantization_method()
 QUANTIZATION_SCHEME_MAP_TYPE = dict[str, Optional[dict[str,
                                                       "QuantizationArgs"]]]
@register_quantization_config(COMPRESSED_TENSORS_METHOD)
 class AscendCompressedTensorsConfig(QuantizationConfig):
    """Config class for LLM-Compressor (compressed_tensors) quantization on Ascend.
    This class adapts the compressed_tensors format to work with Ascend's
    quantization implementations.
    """
    def __init__(
        self,
@@ -107,23 +123,16 @@ class AscendCompressedTensorsConfig(QuantizationConfig):
    @classmethod
    def _quantization_scheme_map_from_config(
            cls, config: dict[str, Any]) -> QUANTIZATION_SCHEME_MAP_TYPE:
-        """
+        """Build target scheme map from config.
        :param config: The `quantization_config` dictionary from config.json
        :return: A dictionary mapping target layer names to their corresponding
            quantization_args for weights and input activations
        """
        target_scheme_map: dict[str, Any] = dict()
        quant_format = cast(str, config.get("format"))
        # The quant_config has multiple config_groups, each containing
        # an input_activations key with details about how the activations are
        # quantized, a weights key indicating how the weights are quantized,
        # and a list of targets under the `targets` key, dictating which
        # layers are impacted by the quantization details. The quantization
        # details follow the structure defined by the QuantizationArgs
        # pydantic model, which is used to verify the structure of the
        # quant_config and also store the details for later use.
        config_groups = config.get("config_groups", dict())
        for _, quant_config in config_groups.items():
            targets = quant_config.get("targets")
@@ -154,70 +163,102 @@ class AscendCompressedTensorsConfig(QuantizationConfig):
        layer: torch.nn.Module,
        prefix: str,
    ) -> Optional["QuantizeMethodBase"]:
        from .method_adapters import AscendFusedMoEMethod, AscendLinearMethod
        if isinstance(layer, LinearBase):
            layer.ascend_quant_method = COMPRESSED_TENSORS_METHOD
-            # collect schemes
+            # Get the scheme for this layer
-            quant_scheme = self.get_scheme(layer=layer, layer_name=prefix)
+            linear_scheme = self._get_linear_scheme(layer=layer,
                                                    layer_name=prefix)
            # Return unquantized method if no scheme found
            if linear_scheme is None:
                return UnquantizedLinearMethod()
            # Store scheme on layer for reference (optional, for debugging)
            layer.scheme = linear_scheme
            logger.info_once(
                "Using the vLLM Ascend llmcompressor Quantization now!")
            return AscendLinearMethod(linear_scheme)
            # choose quantization method
            quant_method = UnquantizedLinearMethod()
            if quant_scheme is not None:
                layer.scheme = quant_scheme
                ascend_quant_config = AscendQuantConfig(self.quant_description
                                                        or {})
                quant_method = AscendLinearMethod(ascend_quant_config, prefix,
                                                  None, layer)
            return quant_method
        if isinstance(layer, FusedMoE):
-            self._add_fused_moe_to_target_scheme_map()
+            # Delayed import to avoid circular import
-            unfused_names = [
+            from vllm_ascend.ops.fused_moe.fused_moe import \
-                prefix + proj_name for proj_name in
+                AscendUnquantizedFusedMoEMethod
                [".0.gate_proj", ".0.up_proj", ".0.down_proj"]
            ]
            # TODO: refactor this to use expert_mapping and check all layer numbers
            all_scheme_dicts = [
                self.get_scheme_dict(layer, name) for name in unfused_names
            ]
            scheme_dict = all_scheme_dicts.pop()
-            # multiple schemes found
+            layer.ascend_quant_method = COMPRESSED_TENSORS_METHOD
-            if not all(
+            # Get the scheme for this layer
-                [cur_dict == scheme_dict for cur_dict in all_scheme_dicts]):
+            moe_scheme = self._get_moe_scheme(layer=layer, layer_name=prefix)
                raise ValueError("All MoE projections need to have same "
                                 "quantization scheme but found multiple")
-            if scheme_dict is None:
+            # Return unquantized method if no scheme found
            if moe_scheme is None:
                return AscendUnquantizedFusedMoEMethod(layer.moe_config)
-            weight_quant = scheme_dict.get("weights")
+            # Store scheme on layer for reference (optional, for debugging)
-            input_quant = scheme_dict.get("input_activations")
+            layer.scheme = moe_scheme
            logger.info_once(
                "Using the vLLM Ascend llmcompressor Quantization now!")
            return AscendFusedMoEMethod(moe_scheme, layer.moe_config)
            quant_scheme = None
            act_quant_format = is_activation_quantization_format(self.quant_format)
            if act_quant_format:
                if self._is_dynamic_token_w8a8(weight_quant, input_quant):
                    quant_scheme = AscendW8A8DynamicFusedMoEMethod()
            else:
                if self._is_w4a16(weight_quant, input_quant):
                    quant_scheme = AscendW4A16FusedMoEMethod()
            if quant_scheme is None:
                raise RuntimeError(
                    f"Unsupported FusedMoe scheme: {weight_quant}, {input_quant}"
                )
            layer.scheme = quant_scheme
            layer.ascend_quant_method = COMPRESSED_TENSORS_METHOD
            ascend_quant_config = AscendQuantConfig(self.quant_description
                                                    or {})
            return AscendFusedMoEMethod(ascend_quant_config, prefix,
                                        self.packed_modules_mapping, layer)
        return None
-    def get_scheme(self,
+    def _get_linear_scheme(
-                   layer: torch.nn.Module,
+            self,
-                   layer_name: Optional[str] = None
+            layer: torch.nn.Module,
-                   ) -> Optional["CompressedTensorsScheme"]:
+            layer_name: Optional[str] = None) -> Optional[AscendLinearScheme]:
        """Get the linear quantization scheme for a layer.
        Returns:
            An AscendLinearScheme instance, or None if the layer
            should use unquantized method.
        """
        weight_quant, input_quant, format = self._get_quant_args(
            layer, layer_name)
        if weight_quant is None:
            return None
        scheme = self._create_scheme_for_layer_type(
            weight_quant=weight_quant,
            input_quant=input_quant,
            format=format,
            layer_type="linear",
        )
        return cast(AscendLinearScheme, scheme)
    def _get_moe_scheme(
            self,
            layer: torch.nn.Module,
            layer_name: Optional[str] = None) -> Optional[AscendMoEScheme]:
        """Get the MoE quantization scheme for a layer.
        Returns:
            An AscendMoEScheme instance, or None if the layer
            should use unquantized method.
        """
        # Add FusedMoE to target scheme map if needed
        self._add_fused_moe_to_target_scheme_map()
        weight_quant, input_quant, format = self._get_quant_args(
            layer, layer_name)
        if weight_quant is None:
            return None
        scheme = self._create_scheme_for_layer_type(
            weight_quant=weight_quant,
            input_quant=input_quant,
            format=format,
            layer_type="moe",
        )
        return cast(AscendMoEScheme, scheme)
    def _get_quant_args(
        self,
        layer: torch.nn.Module,
        layer_name: Optional[str] = None
    ) -> tuple[Optional["QuantizationArgs"], Optional["QuantizationArgs"],
               Optional[str]]:
        """Extract quantization arguments for a layer.
        compressed-tensors supports non uniform in the following way:
        targets of config_groups: There can be N config_groups which each
@@ -226,10 +267,12 @@ class AscendCompressedTensorsConfig(QuantizationConfig):
            an nn.Module name.
        Detect whether a layer_name is found in any target and
-        use the quantization scheme corresponding to the matched target
+        use the quantization scheme corresponding to the matched target.
-        to select the CompressedTensorsScheme used for inference.
+        
        Returns:
            A tuple of (weight_quant, input_quant, format). weight_quant is
            None if the layer should use unquantized method.
        """
        scheme_dict = self.get_scheme_dict(layer, layer_name)
        weight_quant = None
        input_quant = None
@@ -243,16 +286,8 @@ class AscendCompressedTensorsConfig(QuantizationConfig):
            logger.warning_once("Acceleration for non-quantized schemes is "
                                "not supported by Compressed Tensors. "
                                "Falling back to UnquantizedLinearMethod")
            return None
-        else:
+        return weight_quant, input_quant, format
            # Find the quant_scheme
            scheme = self._get_scheme_from_parts(
                weight_quant=weight_quant,
                input_quant=input_quant,
                format=format,
            )
        return scheme
    def get_scheme_dict(
        self,
@@ -288,28 +323,73 @@ class AscendCompressedTensorsConfig(QuantizationConfig):
        return None
-    def _get_scheme_from_parts(
+    def _create_scheme_for_layer_type(
        self,
-        weight_quant: QuantizationArgs,
+        weight_quant: "QuantizationArgs",
-        input_quant: QuantizationArgs,
+        input_quant: Optional["QuantizationArgs"],
-        format: str | None = None,
+        format: Optional[str],
-    ) -> "CompressedTensorsScheme":
+        layer_type: str,
    ) -> Union[AscendLinearScheme, AscendMoEScheme]:
        """Create the appropriate Ascend scheme based on quantization args and layer type.
        Args:
            weight_quant: Weight quantization arguments.
            input_quant: Input activation quantization arguments.
            format: Per-layer format, if defined.
            layer_type: Type of layer ("linear" or "moe").
        Returns:
            An instance of the appropriate Ascend quantization scheme.
        """
        from .methods import get_scheme_class
        # Determine the quantization type
        quant_type = self._detect_quant_type(weight_quant, input_quant, format)
        # Get the scheme class from registry
        scheme_cls = get_scheme_class(quant_type, layer_type)
        if scheme_cls is None:
            raise NotImplementedError(
                f"No compressed-tensors compatible scheme was found for "
                f"quant_type={quant_type}, layer_type={layer_type}.")
        return scheme_cls()
    def _detect_quant_type(
        self,
        weight_quant: "QuantizationArgs",
        input_quant: Optional["QuantizationArgs"],
        format: Optional[str],
    ) -> str:
        """Detect the quantization type from quantization arguments.
        Args:
            weight_quant: Weight quantization arguments.
            input_quant: Input activation quantization arguments.
            format: Per-layer format, if defined.
        Returns:
            A string representing the quantization type (e.g., "W8A8", "W8A8_DYNAMIC").
        """
        # use the per-layer format if defined, otherwise, use global format
        format = format if format is not None else self.quant_format
        act_quant_format = is_activation_quantization_format(format)
        if act_quant_format and input_quant is not None:
            if self._is_static_tensor_w8a8(weight_quant, input_quant):
-                return AscendW8A8LinearMethod()
+                return "W8A8"
            if self._is_dynamic_token_w8a8(weight_quant, input_quant):
-                return AscendW8A8DynamicLinearMethod()
+                return "W8A8_DYNAMIC"
        if self._is_w4a16(weight_quant, input_quant):
            return "W4A16"
        raise NotImplementedError(
-            "No compressed-tensors compatible scheme was found.")
+            "No compressed-tensors compatible quantization type was found.")
-    def _is_static_tensor_w8a8(self, weight_quant: QuantizationArgs,
+    def _is_static_tensor_w8a8(self, weight_quant: "QuantizationArgs",
-                               input_quant: QuantizationArgs) -> bool:
+                               input_quant: "QuantizationArgs") -> bool:
        is_8_bits = weight_quant.num_bits == input_quant.num_bits == 8
        weight_strategy = (
            weight_quant.strategy == QuantizationStrategy.CHANNEL.value)
@@ -322,8 +402,8 @@ class AscendCompressedTensorsConfig(QuantizationConfig):
        # Only symmetric weight quantization supported.
        return is_8_bits and is_tensor and is_symmetric and is_static
-    def _is_dynamic_token_w8a8(self, weight_quant: QuantizationArgs,
+    def _is_dynamic_token_w8a8(self, weight_quant: "QuantizationArgs",
-                               input_quant: QuantizationArgs) -> bool:
+                               input_quant: "QuantizationArgs") -> bool:
        is_8_bits = weight_quant.num_bits == input_quant.num_bits == 8
        weight_strategy = (
            weight_quant.strategy == QuantizationStrategy.CHANNEL.value)
@@ -336,13 +416,13 @@ class AscendCompressedTensorsConfig(QuantizationConfig):
        # Only symmetric weight quantization supported.
        return is_8_bits and is_token and is_symmetric and is_dynamic
-    def _is_w4a16(self, weight_quant: QuantizationArgs,
+    def _is_w4a16(self, weight_quant: "QuantizationArgs",
-                  input_quant: QuantizationArgs) -> bool:
+                  input_quant: Optional["QuantizationArgs"]) -> bool:
        # Confirm weights quantized.
        if weight_quant is None:
            return False
-        # Confirm we have floating points.
+        # Confirm we have integer type.
        if weight_quant.type != QuantizationType.INT:
            return False
--- a/vllm_ascend/quantization/method_adapters.py
+++ b/vllm_ascend/quantization/method_adapters.py
@@ -0,0 +1,288 @@
 #
 # Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
 # Copyright 2023 The vLLM team.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # This file is a part of the vllm-ascend project.
 #
 from typing import Callable, List, Optional
 import torch
 from vllm.distributed import get_tensor_model_parallel_rank
 from vllm.model_executor.layers.fused_moe import (FusedMoEMethodBase,
                                                  FusedMoeWeightScaleSupported)
 from vllm.model_executor.layers.fused_moe.config import FusedMoEConfig
 from vllm.model_executor.layers.linear import (LinearMethodBase,
                                               RowParallelLinear)
 from vllm.model_executor.layers.quantization.kv_cache import BaseKVCacheMethod
 from vllm.model_executor.parameter import PerTensorScaleParameter
 from vllm.model_executor.utils import set_weight_attrs
 from vllm_ascend.ascend_config import get_ascend_config
 from vllm_ascend.distributed.parallel_state import (get_flashcomm2_otp_group,
                                                    get_mlp_tp_group,
                                                    get_otp_group)
 from vllm_ascend.utils import flashcomm2_enable, mlp_tp_enable, oproj_tp_enable
 from .methods import (AscendAttentionScheme, AscendLinearScheme,
                      AscendMoEScheme, is_mx_quant_type)
 class AscendLinearMethod(LinearMethodBase):
    """Linear method for Ascend quantization.
    This wrapper class delegates to the actual quantization scheme implementation.
    The scheme is determined by the Config class and passed directly to this wrapper.
    Args:
        scheme: The quantization scheme instance (e.g., AscendW8A8DynamicLinearMethod).
    """
    def __init__(self, scheme: AscendLinearScheme) -> None:
        self.quant_method = scheme
    def create_weights(
        self,
        layer: torch.nn.Module,
        input_size_per_partition: int,
        output_partition_sizes: List[int],
        input_size: int,
        output_size: int,
        params_dtype: torch.dtype,
        **extra_weight_attrs,
    ) -> None:
        output_size_per_partition = sum(output_partition_sizes)
        weight_loader = extra_weight_attrs.get("weight_loader")
        weight_dict = self.quant_method.get_weight(input_size_per_partition,
                                                   output_size_per_partition,
                                                   params_dtype)
        # Extract packing information (if present)
        packed_dim = weight_dict.pop("_packed_dim", None)
        packed_factor = weight_dict.pop("_packed_factor", None)
        for weight_name, weight_param in weight_dict.items():
            param = torch.nn.Parameter(weight_param, requires_grad=False)
            set_weight_attrs(param, {"input_dim": 1, "output_dim": 0})
            # Set packing attributes if the weight is packed
            if packed_dim is not None and packed_factor is not None:
                set_weight_attrs(param, {
                    "packed_dim": packed_dim,
                    "packed_factor": packed_factor
                })
            layer.register_parameter(weight_name, param)
            set_weight_attrs(param, extra_weight_attrs)
        pertensor_dict = self.quant_method.get_pertensor_param(params_dtype)
        for pertensor_name, pertensor_param in pertensor_dict.items():
            param = PerTensorScaleParameter(data=pertensor_param,
                                            weight_loader=weight_loader)
            # disable warning
            param.ignore_warning = True
            layer.register_parameter(pertensor_name, param)
            param.weight_loader = extra_weight_attrs.get("weight_loader")
        perchannel_dict = self.quant_method.get_perchannel_param(
            output_size_per_partition, params_dtype)
        for perchannel_name, perchannel_param in perchannel_dict.items():
            param = torch.nn.Parameter(perchannel_param, requires_grad=False)
            set_weight_attrs(param, {"output_dim": 0})
            layer.register_parameter(perchannel_name, param)
            set_weight_attrs(param, extra_weight_attrs)
        # NOTE: In w4a8 quantization implementation,
        # for down_proj and o_proj scale_bias shape is [output_size, 16],
        # others are [output_size, 1]
        layer_type = "row" if isinstance(layer,
                                         RowParallelLinear) else "others"
        pergroup_dict = self.quant_method.get_pergroup_param(
            input_size_per_partition,
            output_size_per_partition,
            params_dtype,
            layer_type=layer_type)
        for pergroup_name, pergroup_param in pergroup_dict.items():
            param = torch.nn.Parameter(pergroup_param, requires_grad=False)
            set_weight_attrs(param, {"output_dim": 0})
            layer.register_parameter(pergroup_name, param)
            set_weight_attrs(param, extra_weight_attrs)
            if "weight_scale_second" in pergroup_name or "weight_offset_second" in pergroup_name \
                or is_mx_quant_type(self.quant_method):
                setattr(param, "input_dim", 1)
                param.input_dim = 1
    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
        if hasattr(self.quant_method, "process_weights_after_loading"):
            self.quant_method.process_weights_after_loading(layer)
    def apply(
        self,
        layer: torch.nn.Module,
        x: torch.Tensor,
        bias: Optional[torch.Tensor] = None,
    ) -> torch.Tensor:
        if isinstance(layer, RowParallelLinear):
            if layer.prefix.find("o_proj") != -1 and oproj_tp_enable():
                tp_rank = get_otp_group().rank_in_group
            elif layer.prefix.find("down_proj") != -1 and mlp_tp_enable():
                tp_rank = get_mlp_tp_group().rank_in_group
            elif (layer.prefix.find("o_proj") != -1 or
                  layer.prefix.find("out_proj") != -1) and flashcomm2_enable():
                if get_ascend_config(
                ).flashcomm2_oproj_tensor_parallel_size == 1:
                    tp_rank = 0
                else:
                    tp_rank = get_flashcomm2_otp_group().rank_in_group
            else:
                tp_rank = get_tensor_model_parallel_rank()
        else:
            tp_rank = 0
        return self.quant_method.apply(layer, x, bias, tp_rank)
 class AscendKVCacheMethod(BaseKVCacheMethod):
    """KVCache method for Ascend quantization.
    This wrapper class delegates to the actual attention quantization scheme.
    Args:
        scheme: The attention quantization scheme instance.
    """
    def __init__(self, scheme: AscendAttentionScheme) -> None:
        self.quant_method = scheme
    def create_weights(self, layer: torch.nn.Module) -> None:
        # Different from linear method, there are no weight processing/slicing
        # steps for attention in vllm. So the whole process of create weights
        # is hidden into the specific quant method.
        self.quant_method.create_weights(layer)
    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
        self.quant_method.process_weights_after_loading(layer)
    def apply(self, layer: torch.nn.Module, query: torch.Tensor,
              key: torch.Tensor, value: torch.Tensor, kv_cache, attn_metadata,
              attn_type, scale, output) -> torch.Tensor:
        return self.quant_method.apply(layer, query, key, value, kv_cache,
                                       attn_metadata, attn_type, scale, output)
 class AscendFusedMoEMethod(FusedMoEMethodBase):
    """FusedMoE method for Ascend quantization.
    This wrapper class delegates to the actual MoE quantization scheme.
    Args:
        scheme: The MoE quantization scheme instance.
        moe_config: The FusedMoE configuration.
    """
    def __init__(self, scheme: AscendMoEScheme,
                 moe_config: FusedMoEConfig) -> None:
        super().__init__(moe_config)
        self.quant_method = scheme
    def create_weights(
        self,
        layer: torch.nn.Module,
        num_experts: int,
        hidden_size: int,
        intermediate_size_per_partition: int,
        params_dtype: torch.dtype,
        **extra_weight_attrs,
    ) -> None:
        weight_param = self.quant_method.get_weight(
            num_experts, intermediate_size_per_partition, hidden_size,
            params_dtype)
        for param_key, param_value in weight_param.items():
            param = torch.nn.Parameter(param_value, requires_grad=False)
            layer.register_parameter(param_key, param)
            set_weight_attrs(param, extra_weight_attrs)
        extra_weight_attrs.update(
            {"quant_method": FusedMoeWeightScaleSupported.CHANNEL.value})
        per_group_param = [
            "weight_scale_second", "weight_offset_second", "scale_bias"
        ] + ["weight_scale", "weight_offset"] if hasattr(
            self.quant_method,
            "group_size") and self.quant_method.group_size > 0 else []
        dynamic_quant_param = self.quant_method.get_dynamic_quant_param(
            num_experts, intermediate_size_per_partition, hidden_size,
            params_dtype)
        for param_key, param_value in dynamic_quant_param.items():
            param = torch.nn.Parameter(param_value, requires_grad=False)
            layer.register_parameter(param_key, param)
            set_weight_attrs(param, extra_weight_attrs)
            if any(fields in param_key for fields in per_group_param):
                setattr(param, "quant_method",
                        FusedMoeWeightScaleSupported.GROUP.value)
    def apply(
        self,
        layer: torch.nn.Module,
        x: torch.Tensor,
        router_logits: torch.Tensor,
        top_k: int,
        renormalize: bool,
        use_grouped_topk: bool = False,
        global_num_experts: int = -1,
        expert_map: Optional[torch.Tensor] = None,
        topk_group: Optional[int] = None,
        num_expert_group: Optional[int] = None,
        custom_routing_function: Optional[Callable] = None,
        scoring_func: str = "softmax",
        routed_scaling_factor: float = 1.0,
        e_score_correction_bias: Optional[torch.Tensor] = None,
        is_prefill: bool = True,
        enable_force_load_balance: bool = False,
        log2phy: Optional[torch.Tensor] = None,
        global_redundant_expert_num=0,
        **kwargs,
    ) -> torch.Tensor:
        return self.quant_method.apply(
            layer, x, router_logits, top_k, renormalize, use_grouped_topk,
            global_num_experts, expert_map, topk_group, num_expert_group,
            custom_routing_function, scoring_func, routed_scaling_factor,
            e_score_correction_bias, is_prefill, enable_force_load_balance,
            log2phy, global_redundant_expert_num, **kwargs)
    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
        if hasattr(self.quant_method, "process_weights_after_loading"):
            self.quant_method.process_weights_after_loading(layer)
    def get_fused_moe_quant_config(self, layer: torch.nn.Module):
        pass
    @property
    def supports_eplb(self):
        supports_eplb = getattr(self.quant_method, "supports_eplb", False)
        return supports_eplb
 class AscendEmbeddingMethod(AscendLinearMethod):
    """Embedding method for Ascend quantization.
    This is essentially the same as AscendLinearMethod, just with a different name
    for clarity when used with VocabParallelEmbedding layers.
    Args:
        scheme: The quantization scheme instance.
    """
    def __init__(self, scheme: AscendLinearScheme) -> None:
        self.quant_method = scheme
--- a/vllm_ascend/quantization/methods/init.py
+++ b/vllm_ascend/quantization/methods/init.py
@@ -0,0 +1,82 @@
 #
 # Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
 # This file is a part of the vllm-ascend project.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 """Ascend quantization scheme implementations.
 This module provides all quantization scheme implementations for Ascend NPU.
 Schemes are automatically registered via the @register_scheme decorator.
 Usage:
    from vllm_ascend.quantization.methods import get_scheme_class
    # Get a scheme class by quant_type and layer_type
    scheme_cls = get_scheme_class("W8A8_DYNAMIC", "linear")
    scheme = scheme_cls()
 """
 from typing import Any
 # Import base classes
 from .base import (AscendAttentionScheme, AscendLinearScheme, AscendMoEScheme,
                   QuantType)
 # Import registry functions
 from .registry import get_scheme_class, register_scheme
 # Import all scheme classes for external access
 from .w4a4_flatquant import AscendW4A4FlatQuantDynamicLinearMethod
 from .w4a4_laos_dynamic import AscendW4A4LaosDynamicLinearMethod
 from .w4a8 import (AscendW4A8DynamicFusedMoEMethod,
                   AscendW4A8DynamicLinearMethod)
 from .w4a16 import AscendW4A16FusedMoEMethod
 from .w8a8_dynamic import (AscendW8A8DynamicFusedMoEMethod,
                           AscendW8A8DynamicLinearMethod)
 from .w8a8_mxfp8 import AscendW8A8MXFP8DynamicLinearMethod
 from .w8a8_pdmix import (AscendW8A8PDMixFusedMoeMethod,
                         AscendW8A8PDMixLinearMethod)
 from .w8a8_static import AscendW8A8LinearMethod
 from .w8a16 import AscendW8A16LinearMethod
 def is_mx_quant_type(instance: Any) -> bool:
    """Checks if the quantization method is a microscaling (MX) type."""
    MX_QUANT_TYPES = (AscendW8A8MXFP8DynamicLinearMethod, )
    return isinstance(instance, MX_QUANT_TYPES)
 __all__ = [
    # Base classes
    "AscendAttentionScheme",
    "AscendLinearScheme",
    "AscendMoEScheme",
    "QuantType",
    # Registry functions
    "register_scheme",
    "get_scheme_class",
    # Utility functions
    "is_mx_quant_type",
    # Scheme classes
    "AscendW8A8LinearMethod",
    "AscendW8A8DynamicLinearMethod",
    "AscendW8A8DynamicFusedMoEMethod",
    "AscendW8A8MXFP8DynamicLinearMethod",
    "AscendW8A8PDMixLinearMethod",
    "AscendW8A8PDMixFusedMoeMethod",
    "AscendW8A16LinearMethod",
    "AscendW4A8DynamicLinearMethod",
    "AscendW4A8DynamicFusedMoEMethod",
    "AscendW4A16FusedMoEMethod",
    "AscendW4A4FlatQuantDynamicLinearMethod",
    "AscendW4A4LaosDynamicLinearMethod",
 ]
--- a/vllm_ascend/quantization/methods/base.py
+++ b/vllm_ascend/quantization/methods/base.py
@@ -0,0 +1,279 @@
 #
 # Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
 # This file is a part of the vllm-ascend project.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 """Abstract base classes for Ascend quantization schemes."""
 from abc import ABC, abstractmethod
 from enum import Enum
 from typing import Any, Callable, Dict, Optional
 import torch
 class QuantType(Enum):
    """Quantization type enum for MoE schemes."""
    NONE = 0
    W8A8 = 1
    W4A8 = 2
 class AscendLinearScheme(ABC):
    """Base class for all linear quantization schemes.
    Subclasses must implement get_weight() and apply() methods.
    Other methods have default implementations that return empty dicts
    or do nothing.
    """
    @abstractmethod
    def get_weight(self, input_size: int, output_size: int,
                   params_dtype: torch.dtype) -> Dict[str, Any]:
        """Return weight tensor specifications.
        Args:
            input_size: Input dimension of the linear layer.
            output_size: Output dimension of the linear layer.
            params_dtype: Data type for parameters.
        Returns:
            Dictionary mapping parameter names to empty tensors with
            the correct shape and dtype.
        """
        ...
    def get_pertensor_param(self, params_dtype: torch.dtype) -> Dict[str, Any]:
        """Return per-tensor parameter specifications (e.g., input_scale).
        Args:
            params_dtype: Data type for parameters.
        Returns:
            Dictionary mapping parameter names to empty tensors.
        """
        return {}
    def get_perchannel_param(self, output_size: int,
                             params_dtype: torch.dtype) -> Dict[str, Any]:
        """Return per-channel parameter specifications (e.g., weight_scale).
        Args:
            output_size: Output dimension of the linear layer.
            params_dtype: Data type for parameters.
        Returns:
            Dictionary mapping parameter names to empty tensors.
        """
        return {}
    def get_pergroup_param(self,
                           input_size: int,
                           output_size: int,
                           params_dtype: torch.dtype,
                           layer_type: Optional[str] = None) -> Dict[str, Any]:
        """Return per-group parameter specifications.
        Args:
            input_size: Input dimension of the linear layer.
            output_size: Output dimension of the linear layer.
            params_dtype: Data type for parameters.
            layer_type: Type of layer (e.g., "row" for RowParallelLinear).
        Returns:
            Dictionary mapping parameter names to empty tensors.
        """
        return {}
    @abstractmethod
    def apply(self,
              layer: torch.nn.Module,
              x: torch.Tensor,
              bias: Optional[torch.Tensor] = None,
              tp_rank: Optional[int] = 0) -> torch.Tensor:
        """Forward computation.
        Args:
            layer: The linear layer module.
            x: Input tensor.
            bias: Optional bias tensor.
            tp_rank: Tensor parallel rank.
        Returns:
            Output tensor after quantized linear operation.
        """
        ...
    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
        """Post-loading weight processing (transpose, format conversion, etc.).
        Args:
            layer: The linear layer module.
        """
        pass
 class AscendAttentionScheme(ABC):
    """Base class for all attention quantization schemes.
    Subclasses must implement apply() method.
    Other methods have default implementations.
    """
    def create_weights(self, layer: torch.nn.Module) -> None:
        """Create weights for attention quantization.
        Args:
            layer: The attention layer module.
        """
        pass
    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
        """Post-loading weight processing for attention layer.
        Args:
            layer: The attention layer module.
        """
        pass
    @abstractmethod
    def apply(self, layer: torch.nn.Module, query: torch.Tensor,
              key: torch.Tensor, value: torch.Tensor, kv_cache, attn_metadata,
              attn_type, scale, output) -> torch.Tensor:
        """Forward computation for attention layer.
        Args:
            layer: The attention layer module.
            query: Query tensor.
            key: Key tensor.
            value: Value tensor.
            kv_cache: KV cache.
            attn_metadata: Attention metadata.
            attn_type: Attention type.
            scale: Scale factor.
            output: Output tensor.
        Returns:
            Output tensor after attention computation.
        """
        ...
 class AscendMoEScheme(ABC):
    """Base class for all MoE quantization schemes.
    Subclasses must implement get_weight(), get_dynamic_quant_param(),
    and apply() methods.
    Attributes:
        quant_type: The quantization type for this scheme. Subclasses should
                   override this class attribute to declare their quant type.
    """
    # Default quant type - subclasses should override this
    quant_type: QuantType = QuantType.NONE
    @abstractmethod
    def get_weight(self, num_experts: int,
                   intermediate_size_per_partition: int, hidden_sizes: int,
                   params_dtype: torch.dtype) -> Dict[str, Any]:
        """Return weight tensor specifications for MoE layer.
        Args:
            num_experts: Number of experts.
            intermediate_size_per_partition: Intermediate size per partition.
            hidden_sizes: Hidden dimension size.
            params_dtype: Data type for parameters.
        Returns:
            Dictionary mapping parameter names to empty tensors.
        """
        ...
    @abstractmethod
    def get_dynamic_quant_param(self, num_experts: int,
                                intermediate_size_per_partition: int,
                                hidden_sizes: int,
                                params_dtype: torch.dtype) -> Dict[str, Any]:
        """Return dynamic quantization parameters for MoE layer.
        Args:
            num_experts: Number of experts.
            intermediate_size_per_partition: Intermediate size per partition.
            hidden_sizes: Hidden dimension size.
            params_dtype: Data type for parameters.
        Returns:
            Dictionary mapping parameter names to empty tensors.
        """
        ...
    @abstractmethod
    def apply(
        self,
        layer: torch.nn.Module,
        x: torch.Tensor,
        router_logits: torch.Tensor,
        top_k: int,
        renormalize: bool,
        use_grouped_topk: bool = False,
        global_num_experts: int = -1,
        expert_map: Optional[torch.Tensor] = None,
        topk_group: Optional[int] = None,
        num_expert_group: Optional[int] = None,
        custom_routing_function: Optional[Callable] = None,
        scoring_func: str = "softmax",
        routed_scaling_factor: float = 1.0,
        e_score_correction_bias: Optional[torch.Tensor] = None,
        is_prefill: bool = True,
        enable_force_load_balance: bool = False,
        log2phy: Optional[torch.Tensor] = None,
        global_redundant_expert_num: int = 0,
        **kwargs,
    ) -> torch.Tensor:
        """Forward computation for MoE layer.
        Args:
            layer: The MoE layer module.
            x: Input hidden states.
            router_logits: Router logits for expert selection.
            top_k: Number of experts to select per token.
            renormalize: Whether to renormalize expert weights.
            use_grouped_topk: Whether to use grouped top-k selection.
            global_num_experts: Total number of experts globally.
            expert_map: Mapping from local to global expert indices.
            topk_group: Group size for grouped top-k.
            num_expert_group: Number of expert groups.
            custom_routing_function: Custom routing function.
            scoring_func: Scoring function name.
            routed_scaling_factor: Scaling factor for routed experts.
            e_score_correction_bias: Expert score correction bias.
            is_prefill: Whether in prefill phase.
            enable_force_load_balance: Whether to force load balancing.
            log2phy: Logical to physical expert mapping.
            global_redundant_expert_num: Number of redundant experts.
            **kwargs: Additional keyword arguments.
        Returns:
            Output tensor after MoE computation.
        """
        ...
    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
        """Post-loading weight processing for MoE layer.
        Args:
            layer: The MoE layer module.
        """
        pass
--- a/vllm_ascend/quantization/methods/registry.py
+++ b/vllm_ascend/quantization/methods/registry.py
@@ -0,0 +1,62 @@
 #
 # Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
 # This file is a part of the vllm-ascend project.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from typing import Any, Dict, Optional, Tuple, Type
 # Registry: maps (quant_type, layer_type) -> SchemeClass
 _SCHEME_REGISTRY: Dict[Tuple[str, str], Type[Any]] = {}
 def register_scheme(quant_type: str, layer_type: str):
    """Decorator to register a quantization scheme.
    Args:
        quant_type: Quantization type (e.g., "W8A8", "W8A8_DYNAMIC").
        layer_type: Layer type (e.g., "linear", "moe").
    Returns:
        Decorator function that registers the class.
    Example:
        @register_scheme("W8A8_DYNAMIC", "linear")
        class W8A8DynamicLinearScheme(AscendLinearScheme):
            ...
    """
    def decorator(cls: Type[Any]) -> Type[Any]:
        key = (quant_type, layer_type)
        if key in _SCHEME_REGISTRY:
            raise ValueError(
                f"Scheme already registered for {quant_type}/{layer_type}: "
                f"{_SCHEME_REGISTRY[key].__name__}")
        _SCHEME_REGISTRY[key] = cls
        return cls
    return decorator
 def get_scheme_class(quant_type: str, layer_type: str) -> Optional[Type[Any]]:
    """Get scheme class for given quant_type and layer_type.
    Args:
        quant_type: Quantization type (e.g., "W8A8", "W8A8_DYNAMIC").
        layer_type: Layer type (e.g., "linear", "moe").
    Returns:
        The registered scheme class, or None if not found.
    """
    return _SCHEME_REGISTRY.get((quant_type, layer_type))
--- a/vllm_ascend/quantization/methods/w4a16.py
+++ b/vllm_ascend/quantization/methods/w4a16.py
@@ -1,277 +1,278 @@
-#
+#
-# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
-# This file is a part of the vllm-ascend project.
+# This file is a part of the vllm-ascend project.
-#
+#
-# Licensed under the Apache License, Version 2.0 (the "License");
+# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
+# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
+# You may obtain a copy of the License at
-#
+#
-#     http://www.apache.org/licenses/LICENSE-2.0
+#     http://www.apache.org/licenses/LICENSE-2.0
-#
+#
-# Unless required by applicable law or agreed to in writing, software
+# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
+# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
+# See the License for the specific language governing permissions and
-# limitations under the License.
+# limitations under the License.
-#
+#
-
+
-from typing import Any, Callable, Dict, Optional
+from typing import Any, Callable, Dict, Optional
-
+
-import torch
+import torch
-import torch_npu
+import torch_npu
-from vllm.config import get_current_vllm_config
+from vllm.config import get_current_vllm_config
-from vllm.forward_context import get_forward_context
+from vllm.forward_context import get_forward_context
-
+
-from vllm_ascend.ascend_config import get_ascend_config
+from vllm_ascend.ascend_config import get_ascend_config
-from vllm_ascend.ops.fused_moe.experts_selector import select_experts
+from vllm_ascend.ops.fused_moe.experts_selector import select_experts
-
+
-
+from .base import AscendMoEScheme
-def unpack_from_int32(
+from .registry import register_scheme
-    weight: torch.Tensor,
+
-    shape: torch.Size,
+
-    num_bits: int,
+def unpack_from_int32(
-    packed_dim: int = 1,
+    weight: torch.Tensor,
-) -> torch.Tensor:
+    shape: torch.Size,
-    """
+    num_bits: int,
-    Unpacks quantized weights from int32 format back to original bits.
+    packed_dim: int = 1,
-
+) -> torch.Tensor:
-    :param weight: The packed int32 tensor containing quantized weights
+    """Unpacks quantized weights from int32 format back to original bits.
-    :param shape: Original shape to restore, defaults to None
+
-    :param num_bits: The number of bits used for quantization (<= 8)
+    :param weight: The packed int32 tensor containing quantized weights
-    :param packed_dim: Dimension along which weights are packed (0 or 1), defaults to 1
+    :param shape: Original shape to restore, defaults to None
-    :return: Unpacked tensor with int8 dtype after applying offset correction
+    :param num_bits: The number of bits used for quantization (<= 8)
-    """
+    :param packed_dim: Dimension along which weights are packed (0 or 1), defaults to 1
-    assert weight.dtype == torch.int32, f"Expecting `weight.dtype` is torch.int32 but got {weight.dtype}."
+    :return: Unpacked tensor with int8 dtype after applying offset correction
-    assert num_bits <= 8, f"Expecting `num_bits` should not be larger than 8 but got {num_bits}."
+    """
-
+    assert weight.dtype == torch.int32, f"Expecting `weight.dtype` is torch.int32 but got {weight.dtype}."
-    pack_factor = 32 // num_bits
+    assert num_bits <= 8, f"Expecting `num_bits` should not be larger than 8 but got {num_bits}."
-    mask = (1 << num_bits) - 1
+
-
+    pack_factor = 32 // num_bits
-    if packed_dim == 1:
+    mask = (1 << num_bits) - 1
-        unpacked_weight = torch.zeros(
+
-            (weight.shape[0], weight.shape[1] * pack_factor),
+    if packed_dim == 1:
-            device=weight.device,
+        unpacked_weight = torch.zeros(
-            dtype=torch.int32,
+            (weight.shape[0], weight.shape[1] * pack_factor),
-        )
+            device=weight.device,
-        for i in range(pack_factor):
+            dtype=torch.int32,
-            unpacked_weight[:, i::pack_factor] = (weight >>
+        )
-                                                  (num_bits * i)) & mask
+        for i in range(pack_factor):
-        original_row_size = int(shape[1])
+            unpacked_weight[:, i::pack_factor] = (weight >>
-        unpacked_weight = unpacked_weight[:, :original_row_size]
+                                                  (num_bits * i)) & mask
-    else:
+        original_row_size = int(shape[1])
-        unpacked_weight = torch.zeros(
+        unpacked_weight = unpacked_weight[:, :original_row_size]
-            (weight.shape[0] * pack_factor, weight.shape[1]),
+    else:
-            device=weight.device,
+        unpacked_weight = torch.zeros(
-            dtype=torch.int32,
+            (weight.shape[0] * pack_factor, weight.shape[1]),
-        )
+            device=weight.device,
-        for i in range(pack_factor):
+            dtype=torch.int32,
-            unpacked_weight[i::pack_factor, :] = (weight >>
+        )
-                                                  (num_bits * i)) & mask
+        for i in range(pack_factor):
-        original_row_size = int(shape[0])
+            unpacked_weight[i::pack_factor, :] = (weight >>
-        unpacked_weight = unpacked_weight[:original_row_size, :]
+                                                  (num_bits * i)) & mask
-
+        original_row_size = int(shape[0])
-    offset = pow(2, num_bits) // 2
+        unpacked_weight = unpacked_weight[:original_row_size, :]
-    unpacked_weight = (unpacked_weight - offset).to(torch.int8)
+
-
+    offset = pow(2, num_bits) // 2
-    return unpacked_weight
+    unpacked_weight = (unpacked_weight - offset).to(torch.int8)
-
+
-
+    return unpacked_weight
-def pack_to_int32(weight: torch.Tensor) -> torch.Tensor:
+
-    """
+
-    Packs quantized weights into int32 format for storage.
+def pack_to_int32(weight: torch.Tensor) -> torch.Tensor:
-
+    """Packs quantized weights into int32 format for storage.
-    :param weight: The 3D tensor to pack, must be int8 or int32 dtype
+
-    :return: Packed tensor with int32 dtype optimized for storage
+    :param weight: The 3D tensor to pack, must be int8 or int32 dtype
-    """
+    :return: Packed tensor with int32 dtype optimized for storage
-    assert weight.dim(
+    """
-    ) == 3, f"Expecting `weight.dim()` is 3 ([e, n, k] or [e, k, n]) but got {weight.dim()}."
+    assert weight.dim(
-    assert weight.dtype in [
+    ) == 3, f"Expecting `weight.dim()` is 3 ([e, n, k] or [e, k, n]) but got {weight.dim()}."
-        torch.int8, torch.int32
+    assert weight.dtype in [
-    ], f"Expecting `weight.dtype` is torch.int8 or torch.int32 bug got {weight.dtype}."
+        torch.int8, torch.int32
-
+    ], f"Expecting `weight.dtype` is torch.int8 or torch.int32 bug got {weight.dtype}."
-    if weight.dtype == torch.int32:
+
-        assert weight.shape[
+    if weight.dtype == torch.int32:
-            -1] % 8 == 0, "the last dim of weight needs to be divided by 8."
+        assert weight.shape[
-        packed_weight = torch_npu.npu_convert_weight_to_int4pack(
+            -1] % 8 == 0, "the last dim of weight needs to be divided by 8."
-            weight.flatten(0, 1))
+        packed_weight = torch_npu.npu_convert_weight_to_int4pack(
-        packed_weight = packed_weight.view(weight.shape[0], weight.shape[1],
+            weight.flatten(0, 1))
-                                           -1)
+        packed_weight = packed_weight.view(weight.shape[0], weight.shape[1],
-    else:
+                                           -1)
-        assert weight.shape[
+    else:
-            -1] % 4 == 0, "the last dim of weight needs to be divided by 4."
+        assert weight.shape[
-        packed_weight = weight.view(torch.int32).contiguous()
+            -1] % 4 == 0, "the last dim of weight needs to be divided by 4."
-
+        packed_weight = weight.view(torch.int32).contiguous()
-    return packed_weight
+
-
+    return packed_weight
-
+
-class AscendW4A16FusedMoEMethod:
+
-    """FusedMoe method for Ascend W4A16.
+@register_scheme("W4A16", "moe")
-    """
+class AscendW4A16FusedMoEMethod(AscendMoEScheme):
-
+    """FusedMoE method for Ascend W4A16."""
-    def __init__(self) -> None:
+
-        self.transpose_weight = True
+    def __init__(self) -> None:
-        self.num_bits = 4  # dtype = torch.int4
+        self.transpose_weight = True
-        self.pack_factor = 8  # pack 8 of torch.int4 tensors to torch.int32
+        self.num_bits = 4  # dtype = torch.int4
-
+        self.pack_factor = 8  # pack 8 of torch.int4 tensors to torch.int32
-        vllm_config = get_current_vllm_config()
+
-        self.group_size = vllm_config.quant_config.quant_description.get(
+        vllm_config = get_current_vllm_config()
-            "group_size", 32)
+        self.group_size = vllm_config.quant_config.quant_description.get(
-        self.dynamic_eplb = get_ascend_config().eplb_config.dynamic_eplb
+            "group_size", 32)
-
+        self.dynamic_eplb = get_ascend_config().eplb_config.dynamic_eplb
-    def get_weight(
+
-        self,
+    def get_weight(
-        num_experts: int,
+        self,
-        intermediate_size_per_partition: int,
+        num_experts: int,
-        hidden_sizes: int,
+        intermediate_size_per_partition: int,
-        params_dtype: torch.dtype,
+        hidden_sizes: int,
-    ) -> Dict[str, Any]:
+        params_dtype: torch.dtype,
-        assert intermediate_size_per_partition % self.pack_factor == 0, f"Expecting `intermediate_size_per_partition` {intermediate_size_per_partition} can be divided by `pack_factor` {self.pack_factor}"
+    ) -> Dict[str, Any]:
-        assert hidden_sizes % self.pack_factor == 0, f"Expecting `hidden_sizes` {hidden_sizes} can be divided by `pack_factor` {self.pack_factor}"
+        assert intermediate_size_per_partition % self.pack_factor == 0, f"Expecting `intermediate_size_per_partition` {intermediate_size_per_partition} can be divided by `pack_factor` {self.pack_factor}"
-
+        assert hidden_sizes % self.pack_factor == 0, f"Expecting `hidden_sizes` {hidden_sizes} can be divided by `pack_factor` {self.pack_factor}"
-        param_dict = {}
+
-
+        param_dict = {}
-        param_dict["w13_weight_packed"] = torch.empty(
+
-            num_experts,
+        param_dict["w13_weight_packed"] = torch.empty(
-            2 * intermediate_size_per_partition,
+            num_experts,
-            hidden_sizes // self.pack_factor,
+            2 * intermediate_size_per_partition,
-            dtype=torch.int32)
+            hidden_sizes // self.pack_factor,
-        param_dict["w2_weight_packed"] = torch.empty(
+            dtype=torch.int32)
-            num_experts,
+        param_dict["w2_weight_packed"] = torch.empty(
-            hidden_sizes,
+            num_experts,
-            intermediate_size_per_partition // self.pack_factor,
+            hidden_sizes,
-            dtype=torch.int32)
+            intermediate_size_per_partition // self.pack_factor,
-
+            dtype=torch.int32)
-        return param_dict
+
-
+        return param_dict
-    def get_dynamic_quant_param(
+
-        self,
+    def get_dynamic_quant_param(
-        num_experts: int,
+        self,
-        intermediate_size_per_partition: int,
+        num_experts: int,
-        hidden_sizes: int,
+        intermediate_size_per_partition: int,
-        params_dtype: torch.dtype,
+        hidden_sizes: int,
-    ) -> Dict[str, Any]:
+        params_dtype: torch.dtype,
-        assert intermediate_size_per_partition % self.group_size == 0, f"Expecting `intermediate_size_per_partition` {intermediate_size_per_partition} can be divided by `group_size` {self.group_size}"
+    ) -> Dict[str, Any]:
-        assert hidden_sizes % self.group_size == 0, f"Expecting `hidden_sizes` {hidden_sizes} can be divided by `group_size` {self.group_size}"
+        assert intermediate_size_per_partition % self.group_size == 0, f"Expecting `intermediate_size_per_partition` {intermediate_size_per_partition} can be divided by `group_size` {self.group_size}"
-
+        assert hidden_sizes % self.group_size == 0, f"Expecting `hidden_sizes` {hidden_sizes} can be divided by `group_size` {self.group_size}"
-        param_dict = {}
+
-
+        param_dict = {}
-        param_dict["w13_weight_scale"] = torch.empty(
+
-            num_experts,
+        param_dict["w13_weight_scale"] = torch.empty(
-            2 * intermediate_size_per_partition,
+            num_experts,
-            hidden_sizes // self.group_size,
+            2 * intermediate_size_per_partition,
-            dtype=torch.bfloat16)
+            hidden_sizes // self.group_size,
-        param_dict["w2_weight_scale"] = torch.empty(
+            dtype=torch.bfloat16)
-            num_experts,
+        param_dict["w2_weight_scale"] = torch.empty(
-            hidden_sizes,
+            num_experts,
-            intermediate_size_per_partition // self.group_size,
+            hidden_sizes,
-            dtype=torch.bfloat16)
+            intermediate_size_per_partition // self.group_size,
-        param_dict["w13_weight_shape"] = torch.empty(num_experts,
+            dtype=torch.bfloat16)
-                                                     2,
+        param_dict["w13_weight_shape"] = torch.empty(num_experts,
-                                                     dtype=torch.int32)
+                                                     2,
-        param_dict["w2_weight_shape"] = torch.empty(num_experts,
+                                                     dtype=torch.int32)
-                                                    2,
+        param_dict["w2_weight_shape"] = torch.empty(num_experts,
-                                                    dtype=torch.int32)
+                                                    2,
-        param_dict["w13_weight_offset"] = torch.zeros(
+                                                    dtype=torch.int32)
-            num_experts,
+        param_dict["w13_weight_offset"] = torch.zeros(
-            2 * intermediate_size_per_partition,
+            num_experts,
-            hidden_sizes // self.group_size,
+            2 * intermediate_size_per_partition,
-            dtype=torch.bfloat16)
+            hidden_sizes // self.group_size,
-        param_dict["w2_weight_offset"] = torch.zeros(
+            dtype=torch.bfloat16)
-            num_experts,
+        param_dict["w2_weight_offset"] = torch.zeros(
-            hidden_sizes,
+            num_experts,
-            intermediate_size_per_partition // self.group_size,
+            hidden_sizes,
-            dtype=torch.bfloat16)
+            intermediate_size_per_partition // self.group_size,
-
+            dtype=torch.bfloat16)
-        return param_dict
+
-
+        return param_dict
-    def apply(
+
-        self,
+    def apply(
-        layer: torch.nn.Module,
+        self,
-        x: torch.Tensor,
+        layer: torch.nn.Module,
-        router_logits: torch.Tensor,
+        x: torch.Tensor,
-        top_k: int,
+        router_logits: torch.Tensor,
-        renormalize: bool,
+        top_k: int,
-        use_grouped_topk: bool = False,
+        renormalize: bool,
-        global_num_experts: int = -1,
+        use_grouped_topk: bool = False,
-        expert_map: Optional[torch.Tensor] = None,
+        global_num_experts: int = -1,
-        topk_group: Optional[int] = None,
+        expert_map: Optional[torch.Tensor] = None,
-        num_expert_group: Optional[int] = None,
+        topk_group: Optional[int] = None,
-        custom_routing_function: Optional[Callable] = None,
+        num_expert_group: Optional[int] = None,
-        scoring_func: str = "softmax",
+        custom_routing_function: Optional[Callable] = None,
-        routed_scaling_factor: float = 1.0,
+        scoring_func: str = "softmax",
-        e_score_correction_bias: Optional[torch.Tensor] = None,
+        routed_scaling_factor: float = 1.0,
-        is_prefill: bool = True,
+        e_score_correction_bias: Optional[torch.Tensor] = None,
-        enable_force_load_balance: bool = True,
+        is_prefill: bool = True,
-        log2phy: torch.Tensor = None,
+        enable_force_load_balance: bool = True,
-        global_redundant_expert_num: int = 0,
+        log2phy: Optional[torch.Tensor] = None,
-        **kwargs,
+        global_redundant_expert_num: int = 0,
-    ) -> torch.Tensor:
+        **kwargs,
-        assert router_logits.shape[
+    ) -> torch.Tensor:
-            1] == global_num_experts - global_redundant_expert_num, "Number of global experts mismatch (excluding redundancy)"
+        assert router_logits.shape[
-
+            1] == global_num_experts - global_redundant_expert_num, "Number of global experts mismatch (excluding redundancy)"
-        topk_weights, topk_ids = select_experts(
+
-            hidden_states=x,
+        topk_weights, topk_ids = select_experts(
-            router_logits=router_logits,
+            hidden_states=x,
-            top_k=top_k,
+            router_logits=router_logits,
-            use_grouped_topk=use_grouped_topk,
+            top_k=top_k,
-            renormalize=renormalize,
+            use_grouped_topk=use_grouped_topk,
-            topk_group=topk_group,
+            renormalize=renormalize,
-            num_expert_group=num_expert_group,
+            topk_group=topk_group,
-            custom_routing_function=custom_routing_function,
+            num_expert_group=num_expert_group,
-            scoring_func=scoring_func,
+            custom_routing_function=custom_routing_function,
-            e_score_correction_bias=e_score_correction_bias,
+            scoring_func=scoring_func,
-            global_num_experts=global_num_experts)
+            e_score_correction_bias=e_score_correction_bias,
-
+            global_num_experts=global_num_experts)
-        topk_ids = topk_ids.to(torch.int32)
+
-        topk_weights = topk_weights.to(x.dtype)
+        topk_ids = topk_ids.to(torch.int32)
-
+        topk_weights = topk_weights.to(x.dtype)
-        moe_comm_method = get_forward_context().moe_comm_method
+
-        return moe_comm_method.fused_experts(hidden_states=x,
+        moe_comm_method = get_forward_context().moe_comm_method
-                                             w1=layer.w13_weight_packed,
+        return moe_comm_method.fused_experts(
-                                             w2=layer.w2_weight_packed,
+            hidden_states=x,
-                                             w1_scale=layer.w13_weight_scale,
+            w1=layer.w13_weight_packed,
-                                             w2_scale=layer.w2_weight_scale,
+            w2=layer.w2_weight_packed,
-                                             w1_offset=layer.w13_weight_offset,
+            w1_scale=layer.w13_weight_scale,
-                                             w2_offset=layer.w2_weight_offset,
+            w2_scale=layer.w2_weight_scale,
-                                             topk_weights=topk_weights,
+            w1_offset=layer.w13_weight_offset,
-                                             topk_ids=topk_ids,
+            w2_offset=layer.w2_weight_offset,
-                                             use_int4_w4a16=True,
+            topk_weights=topk_weights,
-                                             expert_map=expert_map,
+            topk_ids=topk_ids,
-                                             log2phy=log2phy,
+            use_int4_w4a16=True,
-                                             dynamic_eplb=self.dynamic_eplb,
+            expert_map=expert_map,
-                                             mc2_mask=kwargs.get(
+            log2phy=log2phy,
-                                                 "mc2_mask", None))
+            dynamic_eplb=self.dynamic_eplb,
-
+            mc2_mask=kwargs.get("mc2_mask", None))
-    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
+
-        if self.transpose_weight:
+    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
-            w13_shape = layer.w13_weight_packed.data.shape
+        if self.transpose_weight:
-            w2_shape = layer.w2_weight_packed.data.shape
+            w13_shape = layer.w13_weight_packed.data.shape
-            unpacked_w13_weight = (unpack_from_int32(
+            w2_shape = layer.w2_weight_packed.data.shape
-                layer.w13_weight_packed.data.flatten(0, 1),
+            unpacked_w13_weight = (unpack_from_int32(
-                torch.Size([
+                layer.w13_weight_packed.data.flatten(0, 1),
-                    w13_shape[0] * w13_shape[1],
+                torch.Size([
-                    w13_shape[2] * self.pack_factor
+                    w13_shape[0] * w13_shape[1],
-                ]),
+                    w13_shape[2] * self.pack_factor
-                self.num_bits,
+                ]),
-            ).view(w13_shape[0], w13_shape[1],
+                self.num_bits,
-                   -1).transpose(1, 2).contiguous().int())
+            ).view(w13_shape[0], w13_shape[1],
-            unpacked_w2_weight = (unpack_from_int32(
+                   -1).transpose(1, 2).contiguous().int())
-                layer.w2_weight_packed.data.flatten(0, 1),
+            unpacked_w2_weight = (unpack_from_int32(
-                torch.Size([
+                layer.w2_weight_packed.data.flatten(0, 1),
-                    w2_shape[0] * w2_shape[1], w2_shape[2] * self.pack_factor
+                torch.Size([
-                ]),
+                    w2_shape[0] * w2_shape[1], w2_shape[2] * self.pack_factor
-                self.num_bits,
+                ]),
-            ).view(w2_shape[0], w2_shape[1],
+                self.num_bits,
-                   -1).transpose(1, 2).contiguous().int())
+            ).view(w2_shape[0], w2_shape[1],
-            layer.w13_weight_packed.data = pack_to_int32(unpacked_w13_weight)
+                   -1).transpose(1, 2).contiguous().int())
-            layer.w2_weight_packed.data = pack_to_int32(unpacked_w2_weight)
+            layer.w13_weight_packed.data = pack_to_int32(unpacked_w13_weight)
-
+            layer.w2_weight_packed.data = pack_to_int32(unpacked_w2_weight)
-            layer.w13_weight_scale.data = layer.w13_weight_scale.data.transpose(
+
-                1, 2).contiguous()
+            layer.w13_weight_scale.data = layer.w13_weight_scale.data.transpose(
-            layer.w2_weight_scale.data = layer.w2_weight_scale.data.transpose(
+                1, 2).contiguous()
-                1, 2).contiguous()
+            layer.w2_weight_scale.data = layer.w2_weight_scale.data.transpose(
-
+                1, 2).contiguous()
-            layer.w13_weight_offset.data = layer.w13_weight_offset.data.transpose(
+
-                1, 2).contiguous()
+            layer.w13_weight_offset.data = layer.w13_weight_offset.data.transpose(
-            layer.w2_weight_offset.data = layer.w2_weight_offset.data.transpose(
+                1, 2).contiguous()
-                1, 2).contiguous()
+            layer.w2_weight_offset.data = layer.w2_weight_offset.data.transpose(
                1, 2).contiguous()
--- a/vllm_ascend/quantization/methods/w4a4_flatquant.py
+++ b/vllm_ascend/quantization/methods/w4a4_flatquant.py
@@ -14,16 +14,21 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import math
 from typing import Any, Dict, Optional, Tuple
 import torch
 import torch_npu
 from .base import AscendLinearScheme
 from .registry import register_scheme
 KRONECKER_QUANT_MAX_BATCH_SIZE = 32768
 def pack_int4_weights(weight_tensor: torch.Tensor) -> torch.Tensor:
    """Pack int4 weights for NPU."""
    original_device = weight_tensor.device
    weight_tensor_npu = weight_tensor.npu()
    weight_int4_packed = torch_npu.npu_convert_weight_to_int4pack(
@@ -32,6 +37,7 @@ def pack_int4_weights(weight_tensor: torch.Tensor) -> torch.Tensor:
 def get_decompose_dim(n):
    """Get decomposed dimensions for Kronecker quantization."""
    a = int(math.sqrt(n))
    if a * a < n:
        a += 1
@@ -53,6 +59,7 @@ def batched_kronecker_quant(
    right_trans: torch.Tensor,
    clip_ratio: float,
 ) -> Tuple[torch.Tensor, torch.Tensor]:
    """Batched Kronecker quantization with batch size limit handling."""
    batch_tokens = x.shape[0]
    if batch_tokens <= KRONECKER_QUANT_MAX_BATCH_SIZE:
        return torch_npu.npu_kronecker_quant(x,
@@ -75,7 +82,8 @@ def batched_kronecker_quant(
    return x_quantized_int4, activation_scale
-class AscendW4A4FlatQuantDynamicLinearMethod:
+@register_scheme("W4A4_FLATQUANT_DYNAMIC", "linear")
 class AscendW4A4FlatQuantDynamicLinearMethod(AscendLinearScheme):
    """Linear method for Ascend W4A4_FLATQUANT_DYNAMIC.
    This class implements W4A4 quantization with FlatQuant approach and dynamic activation quantization.
@@ -88,8 +96,7 @@ class AscendW4A4FlatQuantDynamicLinearMethod:
    def __init__(self):
        self.sym = True
-    @staticmethod
+    def get_weight(self, input_size: int, output_size: int,
    def get_weight(input_size: int, output_size: int,
                   params_dtype: torch.dtype) -> Dict[str, Any]:
        if input_size % 8 != 0:
            raise ValueError(
@@ -101,8 +108,7 @@ class AscendW4A4FlatQuantDynamicLinearMethod:
        }
        return params_dict
-    @staticmethod
+    def get_pertensor_param(self, params_dtype: torch.dtype) -> Dict[str, Any]:
    def get_pertensor_param(params_dtype: torch.dtype) -> Dict[str, Any]:
        params_dict = {}
        left_trans_dim, right_trans_dim = get_decompose_dim(
            AscendW4A4FlatQuantDynamicLinearMethod.input_size)
@@ -115,8 +121,8 @@ class AscendW4A4FlatQuantDynamicLinearMethod:
        params_dict["clip_ratio"] = torch.empty(1, dtype=torch.float32)
        return params_dict
    @staticmethod
    def get_perchannel_param(
        self,
        output_size: int,
        params_dtype: torch.dtype,
    ) -> Dict[str, Any]:
@@ -129,15 +135,8 @@ class AscendW4A4FlatQuantDynamicLinearMethod:
                                                   dtype=torch.float32)
        return params_dict
    def get_pergroup_param(self,
                           input_size: int,
                           output_size: int,
                           params_dtype: torch.dtype,
                           layer_type: Optional[str] = None) -> Dict[str, Any]:
        return {}
    @staticmethod
    def apply(
        self,
        layer: torch.nn.Module,
        x: torch.Tensor,
        bias: Optional[torch.Tensor] = None,
--- a/vllm_ascend/quantization/methods/w4a4_laos_dynamic.py
+++ b/vllm_ascend/quantization/methods/w4a4_laos_dynamic.py
@@ -14,22 +14,31 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-from typing import Any, Callable, Dict, Optional
+
 from typing import Any, Dict, Optional
 import torch
 import torch_npu
 import torch.nn.functional as F
-class AscendW4A4LaosDynamicLinearMethod:
+from .base import AscendLinearScheme
-    """Linear method for Ascend W4A4_LAOS_DYNAMIC.
+from .registry import register_scheme
@register_scheme("W4A4_DYNAMIC", "linear")
 class AscendW4A4LaosDynamicLinearMethod(AscendLinearScheme):
    """Linear method for Ascend W4A4_DYNAMIC.
    This class implements W4A4 quantization with LAOS approach and dynamic activation quantization.
    - Weight: 4-bit quantization (per-channel) with scale and offset, stored as int8.
    - Activation: 4-bit dynamic quantization.
    """
    def __init__(self):
        self.transpose_weight = True
        self.rotation_type = None
-    def set_rotation_config(self, prefix, metadata):
+    def set_rotation_config(self, prefix: str, metadata: Dict) -> Optional[str]:
        """Set rotation config based on prefix and metadata."""
        layer_idx = prefix.split(".")[2]
        if prefix.endswith("o_proj"):
            layers = metadata["quarot"]["heads_rotation"]["layers"]
@@ -39,17 +48,17 @@ class AscendW4A4LaosDynamicLinearMethod:
            layers = metadata["quarot"]["kronecker_rotation"]["layers"]
            if layer_idx in layers:
                return "kronecker_rotation"
        return None
-    @staticmethod
+    def get_weight(self, input_size: int, output_size: int,
-    def get_weight(input_size: int, output_size: int, params_dtype: torch.dtype) -> Dict[str, Any]:
+                   params_dtype: torch.dtype) -> Dict[str, Any]:
-        params_dict = {"weight": torch.empty(output_size, input_size, dtype=torch.int8)}
+        params_dict = {
            "weight": torch.empty(output_size, input_size, dtype=torch.int8)
        }
        return params_dict
-    
+
-    @staticmethod
+    def get_perchannel_param(self, output_size: int,
-    def get_pertensor_param(params_dtype: torch.dtype) -> Dict[str, Any]:
+                             params_dtype: torch.dtype) -> Dict[str, Any]:
        return {}
    def get_perchannel_param(self, output_size: int, params_dtype: torch.dtype) -> Dict[str, Any]:
        params_dict = {}
        params_dict["weight_scale"] = torch.empty(output_size,
                                                  1,
@@ -58,20 +67,18 @@ class AscendW4A4LaosDynamicLinearMethod:
                                                   1,
                                                   dtype=torch.float32)
        if self.rotation_type == "heads_rotation":
-            params_dict["heads_rotation"] = torch.zeros((64, 64), dtype=torch.float32)
+            params_dict["heads_rotation"] = torch.zeros((64, 64),
                                                        dtype=torch.float32)
        if self.rotation_type == "kronecker_rotation":
-            params_dict["kronecker_rotation_n"] = torch.zeros((160, 160), dtype=torch.float32)
+            params_dict["kronecker_rotation_n"] = torch.zeros(
-            params_dict["kronecker_rotation_m"] = torch.zeros((160, 160), dtype=torch.float32)
+                (160, 160), dtype=torch.float32)
            params_dict["kronecker_rotation_m"] = torch.zeros(
                (160, 160), dtype=torch.float32)
        return params_dict
-    def get_pergroup_param(self,
+    def apply_rotation(self, layer: torch.nn.Module,
-                           input_size: int,
+                       x: torch.Tensor) -> torch.Tensor:
-                           output_size: int,
+        """Apply rotation transformation to input tensor."""
                           params_dtype: torch.dtype, 
                           layer_type: Optional[str] = None) -> Dict[str, Any]:
        return {}
    def apply_rotation(self, layer, x):
        init_shape = x.shape
        dtype = x.dtype
        if self.rotation_type == "heads_rotation":
@@ -94,17 +101,26 @@ class AscendW4A4LaosDynamicLinearMethod:
        layer: torch.nn.Module,
        x: torch.Tensor,
        bias: Optional[torch.Tensor] = None,
-        tp_rank: Optional[int] = None,
+        tp_rank: Optional[int] = 0,
    ) -> torch.Tensor:
        dtype = x.dtype
        x, pertoken_scale = torch_npu.npu_dynamic_quant(x, dst_type=torch.quint4x2)
        pertoken_scale = pertoken_scale.reshape(-1, 1)
        pertoken_scale = pertoken_scale.squeeze(-1)
-        y2 = torch_npu.npu_quant_matmul(x, layer.weight.data, scale=layer.weight_scale.data.view(-1), pertoken_scale=pertoken_scale, bias=None, output_dtype=dtype)
+        output = torch_npu.npu_quant_matmul(
-        return y2
+            x,
            layer.weight.data,
            scale=layer.weight_scale.data.view(-1),
            pertoken_scale=pertoken_scale,
            bias=None,
            output_dtype=dtype)
        if bias is not None:
            output = output + bias.to(dtype)
        return output
-    def process_weights_after_loading(self, layer: torch.nn.Module):
+    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
        layer.weight_scale.data = layer.weight_scale.data.to(torch.float32)
-        layer.weight.data = torch_npu.npu_convert_weight_to_int4pack(layer.weight.data.to(torch.int32))
+        layer.weight.data = torch_npu.npu_convert_weight_to_int4pack(
            layer.weight.data.to(torch.int32))
        if self.transpose_weight:
            layer.weight.data = layer.weight.data.transpose(-1, -2)
--- a/vllm_ascend/quantization/methods/w4a8.py
+++ b/vllm_ascend/quantization/methods/w4a8.py
@@ -29,10 +29,13 @@ from vllm_ascend.distributed.parallel_state import get_mc2_group
 from vllm_ascend.ops.fused_moe.experts_selector import select_experts
 from vllm_ascend.utils import maybe_trans_nz
 from .base import AscendLinearScheme, AscendMoEScheme, QuantType
 from .registry import register_scheme
-class AscendW4A8DynamicLinearMethod:
+
-    """Linear method for Ascend W4A8_DYNAMIC
+@register_scheme("W4A8_DYNAMIC", "linear")
-    """
+class AscendW4A8DynamicLinearMethod(AscendLinearScheme):
    """Linear method for Ascend W4A8_DYNAMIC."""
    def __init__(self):
        vllm_config = get_current_vllm_config()
@@ -72,23 +75,12 @@ class AscendW4A8DynamicLinearMethod:
        return params_dict
    @staticmethod
    def get_pertensor_param(params_dtype: torch.dtype) -> Dict[str, Any]:
        return {}
    @staticmethod
    def get_perchannel_param(output_size: int,
                             params_dtype: torch.dtype) -> Dict[str, Any]:
        return {}
    def get_pergroup_param(self,
                           input_size: int,
                           output_size: int,
                           params_dtype: torch.dtype,
                           layer_type: Optional[str] = None) -> Dict[str, Any]:
-        """
+        """Create per-group quantization parameters."""
        Create per-group quantization parameters.
        """
        params_dict = {}
        params_dict["weight_scale"] = torch.empty(output_size,
                                                  1,
@@ -121,8 +113,7 @@ class AscendW4A8DynamicLinearMethod:
                             scale: torch.Tensor,
                             per_group_scale: torch.Tensor,
                             is_new_quant: bool = False):
-        """
+        """Process the scale for second-level quantization.
        Process the scale for second-level quantization.
        Args:
            weight: weight tensor [k, n] (in new version, n is already compressed to n/2)
@@ -207,9 +198,12 @@ class AscendW4A8DynamicLinearMethod:
                layer.weight.data.to(torch.int32))
-class AscendW4A8DynamicFusedMoEMethod:
+@register_scheme("W4A8_DYNAMIC", "moe")
-    """FusedMoe method for Ascend W4A8_DYNAMIC.
+class AscendW4A8DynamicFusedMoEMethod(AscendMoEScheme):
-    """
+    """FusedMoE method for Ascend W4A8_DYNAMIC."""
    # Declare the quantization type for this scheme
    quant_type: QuantType = QuantType.W4A8
    def __init__(self):
        self.ep_group = get_ep_group()
@@ -339,7 +333,7 @@ class AscendW4A8DynamicFusedMoEMethod:
        e_score_correction_bias: Optional[torch.Tensor] = None,
        is_prefill: bool = True,
        enable_force_load_balance: bool = False,
-        log2phy: torch.Tensor = None,
+        log2phy: Optional[torch.Tensor] = None,
        global_redundant_expert_num: int = 0,
        **kwargs,
    ) -> torch.Tensor:
--- a/vllm_ascend/quantization/methods/w8a16.py
+++ b/vllm_ascend/quantization/methods/w8a16.py
@@ -22,17 +22,22 @@ import torch_npu
 from vllm_ascend.utils import maybe_trans_nz
 from .base import AscendLinearScheme
 from .registry import register_scheme
-class AscendW8A16LinearMethod:
+
@register_scheme("W8A16", "linear")
 class AscendW8A16LinearMethod(AscendLinearScheme):
    """Linear method for Ascend W8A16.
-
+    
    This scheme uses 8-bit quantized weights with 16-bit activations.
    """
    def __init__(self) -> None:
        pass
    @staticmethod
    def get_weight(
        self,
        input_size: int,
        output_size: int,
        params_dtype: torch.dtype = torch.bfloat16,
@@ -42,12 +47,8 @@ class AscendW8A16LinearMethod:
        }
        return params_dict
    @staticmethod
    def get_pertensor_param(params_dtype: torch.dtype) -> Dict[str, Any]:
        return {}
    @staticmethod
    def get_perchannel_param(
        self,
        output_size: int,
        params_dtype: torch.dtype,
    ) -> Dict[str, Any]:
@@ -60,15 +61,8 @@ class AscendW8A16LinearMethod:
                                                   dtype=params_dtype)
        return params_dict
    def get_pergroup_param(self,
                           input_size: int,
                           output_size: int,
                           params_dtype: torch.dtype,
                           layer_type: Optional[str] = None) -> Dict[str, Any]:
        return {}
    @staticmethod
    def apply(
        self,
        layer: torch.nn.Module,
        x: torch.Tensor,
        bias: Optional[torch.Tensor] = None,
--- a/vllm_ascend/quantization/methods/w8a8_dynamic.py
+++ b/vllm_ascend/quantization/methods/w8a8_dynamic.py
@@ -32,28 +32,39 @@ from vllm_ascend.ops.fused_moe.experts_selector import (select_experts,
                                                        zero_experts_compute)
 from vllm_ascend.utils import ACL_FORMAT_FRACTAL_NZ, maybe_trans_nz
 from .base import AscendLinearScheme, AscendMoEScheme, QuantType
 from .registry import register_scheme
-class AscendW8A8DynamicLinearMethod:
+
 def scale_from_float_to_int64(scale):
    """Convert float32 scale to int64 representation."""
    import numpy as np
    scale = torch.from_numpy(
        np.frombuffer(scale.cpu().to(torch.float32).numpy().tobytes(),
                      dtype=np.int32).astype(np.int64)).to(scale.device)
    return scale
@register_scheme("W8A8_DYNAMIC", "linear")
 class AscendW8A8DynamicLinearMethod(AscendLinearScheme):
    """Linear method for Ascend W8A8_DYNAMIC.
    This scheme uses dynamic per-token quantization for activations
    and per-channel quantization for weights.
    """
    def __init__(self):
        pass
-    @staticmethod
+    def get_weight(self, input_size: int, output_size: int,
    def get_weight(input_size: int, output_size: int,
                   params_dtype: torch.dtype) -> Dict[str, Any]:
        params_dict = {
            "weight": torch.empty(output_size, input_size, dtype=torch.int8)
        }
        return params_dict
    @staticmethod
    def get_pertensor_param(params_dtype: torch.dtype) -> Dict[str, Any]:
        return {}
    @staticmethod
    def get_perchannel_param(
        self,
        output_size: int,
        params_dtype: torch.dtype,
    ) -> Dict[str, Any]:
@@ -66,15 +77,8 @@ class AscendW8A8DynamicLinearMethod:
                                                   dtype=params_dtype)
        return params_dict
    def get_pergroup_param(self,
                           input_size: int,
                           output_size: int,
                           params_dtype: torch.dtype,
                           layer_type: Optional[str] = None) -> Dict[str, Any]:
        return {}
    @staticmethod
    def apply(
        self,
        layer: torch.nn.Module,
        x: torch.Tensor,
        bias: Optional[torch.Tensor] = None,
@@ -100,9 +104,12 @@ class AscendW8A8DynamicLinearMethod:
        layer.weight_offset.data = layer.weight_offset.data.flatten()
-class AscendW8A8DynamicFusedMoEMethod:
+@register_scheme("W8A8_DYNAMIC", "moe")
-    """FusedMoe method for Ascend W8A8_DYNAMIC.
+class AscendW8A8DynamicFusedMoEMethod(AscendMoEScheme):
-    """
+    """FusedMoE method for Ascend W8A8_DYNAMIC."""
    # Declare the quantization type for this scheme
    quant_type: QuantType = QuantType.W8A8
    def __init__(self):
        self.ep_group = get_ep_group()
@@ -128,9 +135,8 @@ class AscendW8A8DynamicFusedMoEMethod:
        except AttributeError:
            self.moe_all_to_all_group_name = ""
-    @staticmethod
+    def get_weight(self, num_experts: int,
-    def get_weight(num_experts: int, intermediate_size_per_partition: int,
+                   intermediate_size_per_partition: int, hidden_sizes: int,
                   hidden_sizes: int,
                   params_dtype: torch.dtype) -> Dict[str, Any]:
        param_dict = {}
        param_dict["w13_weight"] = torch.empty(num_experts,
@@ -144,8 +150,7 @@ class AscendW8A8DynamicFusedMoEMethod:
                                              dtype=torch.int8)
        return param_dict
-    @staticmethod
+    def get_dynamic_quant_param(self, num_experts: int,
    def get_dynamic_quant_param(num_experts: int,
                                intermediate_size_per_partition: int,
                                hidden_sizes: int,
                                params_dtype: torch.dtype) -> Dict[str, Any]:
@@ -188,7 +193,7 @@ class AscendW8A8DynamicFusedMoEMethod:
        e_score_correction_bias: Optional[torch.Tensor] = None,
        is_prefill: bool = True,
        enable_force_load_balance: bool = False,
-        log2phy: torch.Tensor = None,
+        log2phy: Optional[torch.Tensor] = None,
        global_redundant_expert_num: int = 0,
        pertoken_scale: Optional[Any] = None,
        **kwargs,
@@ -324,11 +329,3 @@ class AscendW8A8DynamicFusedMoEMethod:
            del layer.w13_weight_scale_fp32
            del layer.w2_weight_scale
            torch.npu.empty_cache()
 def scale_from_float_to_int64(scale):
    import numpy as np
    scale = torch.from_numpy(
        np.frombuffer(scale.cpu().to(torch.float32).numpy().tobytes(),
                      dtype=np.int32).astype(np.int64)).to(scale.device)
    return scale
--- a/vllm_ascend/quantization/methods/w8a8_mxfp8.py
+++ b/vllm_ascend/quantization/methods/w8a8_mxfp8.py
@@ -21,9 +21,17 @@ import torch
 import torch_npu
 from vllm.config import get_current_vllm_config
 from .base import AscendLinearScheme
 from .registry import register_scheme
-class AscendW8A8MXFP8DynamicLinearMethod:
+
-    """Linear method for Ascend W8A8_DYNAMIC.
+@register_scheme("W8A8_MXFP8", "linear")
 class AscendW8A8MXFP8DynamicLinearMethod(AscendLinearScheme):
    """Linear method for Ascend W8A8_MXFP8 (Microscaling FP8) quantization.
    This scheme uses microscaling FP8 quantization with per-group scales.
    The activation is dynamically quantized to FP8 (E4M3FN format) with
    microscaling, and weights are stored in FP8 format with per-group scales.
    """
    model_dtype = None
@@ -32,8 +40,7 @@ class AscendW8A8MXFP8DynamicLinearMethod:
        self.group_size = vllm_config.quant_config.quant_description.get(
            "group_size", 32)
-    @staticmethod
+    def get_weight(self, input_size: int, output_size: int,
    def get_weight(input_size: int, output_size: int,
                   params_dtype: torch.dtype) -> Dict[str, Any]:
        params_dict = {
            "weight":
@@ -41,17 +48,6 @@ class AscendW8A8MXFP8DynamicLinearMethod:
        }
        return params_dict
    @staticmethod
    def get_pertensor_param(params_dtype: torch.dtype) -> Dict[str, Any]:
        return {}
    @staticmethod
    def get_perchannel_param(
        output_size: int,
        params_dtype: torch.dtype,
    ) -> Dict[str, Any]:
        return {}
    def get_pergroup_param(self,
                           input_size: int,
                           output_size: int,
--- a/vllm_ascend/quantization/methods/w8a8_pdmix.py
+++ b/vllm_ascend/quantization/methods/w8a8_pdmix.py
@@ -0,0 +1,117 @@
 #
 # Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
 # This file is a part of the vllm-ascend project.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 """W8A8 Prefill-Decode Mix quantization methods.
 This module provides quantization methods that use different strategies
 for prefill and decode phases:
 - Prefill: Uses dynamic W8A8 quantization
 - Decode (KV consumer): Uses static W8A8 quantization
 """
 from typing import Any, Dict, Optional
 import torch
 from vllm.config import get_current_vllm_config
 from .base import AscendLinearScheme
 from .registry import register_scheme
 from .w8a8_dynamic import (AscendW8A8DynamicFusedMoEMethod,
                           AscendW8A8DynamicLinearMethod)
 from .w8a8_static import AscendW8A8LinearMethod
@register_scheme("W8A8_MIX", "linear")
 class AscendW8A8PDMixLinearMethod(AscendLinearScheme):
    """Linear method for W8A8 prefill-decode mix quantization.
    This scheme uses composition to delegate to the appropriate quantization
    method based on the execution phase:
    - Static W8A8 for KV consumer (decode phase)
    - Dynamic W8A8 for prefill phase
    The static method is used for weight/parameter specifications since
    it requires more parameters (input_scale, deq_scale, etc.) that are
    needed for static quantization during decode.
    """
    def __init__(self):
        self._static_method = AscendW8A8LinearMethod()
        self._dynamic_method = AscendW8A8DynamicLinearMethod()
        kv_transfer_config = get_current_vllm_config().kv_transfer_config
        self._is_kv_consumer = (kv_transfer_config is not None
                                and kv_transfer_config.is_kv_consumer)
    def get_weight(self, input_size: int, output_size: int,
                   params_dtype: torch.dtype) -> Dict[str, Any]:
        return self._static_method.get_weight(input_size, output_size,
                                              params_dtype)
    def get_pertensor_param(self, params_dtype: torch.dtype) -> Dict[str, Any]:
        return self._static_method.get_pertensor_param(params_dtype)
    def get_perchannel_param(
        self,
        output_size: int,
        params_dtype: torch.dtype,
    ) -> Dict[str, Any]:
        return self._static_method.get_perchannel_param(
            output_size, params_dtype)
    def apply(
        self,
        layer: torch.nn.Module,
        x: torch.Tensor,
        bias: Optional[torch.Tensor] = None,
        tp_rank: Optional[int] = 0,
    ) -> torch.Tensor:
        if layer.is_kv_consumer:
            return self._static_method.apply(layer, x, bias, tp_rank)
        else:
            return self._dynamic_method.apply(layer, x, bias, tp_rank)
    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
        self._static_method.process_weights_after_loading(layer)
        layer.weight_scale_fp32 = layer.weight_scale.data.to(torch.float32)
        layer.is_kv_consumer = self._is_kv_consumer
@register_scheme("W8A8_MIX", "moe")
 class AscendW8A8PDMixFusedMoeMethod(AscendW8A8DynamicFusedMoEMethod):
    def get_dynamic_quant_param(self, num_experts: int,
                                intermediate_size_per_partition: int,
                                hidden_sizes: int,
                                params_dtype: torch.dtype) -> Dict[str, Any]:
        param_dict = super().get_dynamic_quant_param(
            num_experts, intermediate_size_per_partition, hidden_sizes,
            params_dtype)
        param_dict["w2_deq_scale"] = torch.empty(num_experts,
                                                 hidden_sizes,
                                                 dtype=torch.float32)
        param_dict["w13_deq_scale"] = torch.empty(
            num_experts,
            2 * intermediate_size_per_partition,
            dtype=torch.float32)
        param_dict["w2_input_offset"] = torch.empty(num_experts,
                                                    1,
                                                    dtype=torch.int8)
        param_dict["w13_input_offset"] = torch.empty(num_experts,
                                                     1,
                                                     dtype=torch.int8)
        return param_dict
--- a/vllm_ascend/quantization/methods/w8a8_static.py
+++ b/vllm_ascend/quantization/methods/w8a8_static.py
@@ -24,27 +24,23 @@ from vllm_ascend.utils import (COMPRESSED_TENSORS_METHOD, AscendDeviceType,
                               get_ascend_device_type,
                               get_weight_prefetch_method, maybe_trans_nz)
-
+from .base import AscendLinearScheme
-def quant_per_tensor(in_tensor: torch.Tensor,
+from .registry import register_scheme
                     input_scale: torch.Tensor,
                     input_offset: torch.Tensor,
                     function=False):
    return torch_npu.npu_quantize(in_tensor, input_scale, input_offset,
                                  torch.qint8, -1, function)
-class AscendW8A8LinearMethod:
+@register_scheme("W8A8", "linear")
-    """Linear method for Ascend W8A8.
+class AscendW8A8LinearMethod(AscendLinearScheme):
    """Linear method for Ascend W8A8 static quantization.
-    Args:
+    This scheme uses static per-tensor quantization for activations
-        w_sym: whether the linear weight is symmetrically quantized.
+    and per-channel quantization for weights.
    """
    def __init__(self) -> None:
        pass
    @staticmethod
    def get_weight(
        self,
        input_size: int,
        output_size: int,
        params_dtype: torch.dtype = torch.bfloat16,
@@ -54,15 +50,14 @@ class AscendW8A8LinearMethod:
        }
        return params_dict
-    @staticmethod
+    def get_pertensor_param(self, params_dtype: torch.dtype) -> Dict[str, Any]:
    def get_pertensor_param(params_dtype: torch.dtype) -> Dict[str, Any]:
        params_dict = {}
        params_dict["input_scale"] = torch.empty(1, dtype=params_dtype)
        params_dict["input_offset"] = torch.empty(1, dtype=torch.int8)
        return params_dict
    @staticmethod
    def get_perchannel_param(
        self,
        output_size: int,
        params_dtype: torch.dtype,
    ) -> Dict[str, Any]:
@@ -82,15 +77,8 @@ class AscendW8A8LinearMethod:
                                                   dtype=params_dtype)
        return params_dict
    def get_pergroup_param(self,
                           input_size: int,
                           output_size: int,
                           params_dtype: torch.dtype,
                           layer_type: Optional[str] = None) -> Dict[str, Any]:
        return {}
    @staticmethod
    def apply(
        self,
        layer: torch.nn.Module,
        x: torch.Tensor,
        bias: Optional[torch.Tensor] = None,
--- a/vllm_ascend/quantization/modelslim_config.py
+++ b/vllm_ascend/quantization/modelslim_config.py
@@ -0,0 +1,471 @@
 #
 # Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
 # Copyright 2023 The vLLM team.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # This file is a part of the vllm-ascend project.
 #
 """ModelSlim quantization configuration and model mappings for Ascend.
 This module provides the AscendModelSlimConfig class for parsing quantization
 configs generated by the ModelSlim tool, along with model-specific mappings.
 """
 from types import MappingProxyType
 from typing import Any, Dict, List, Mapping, Optional
 import torch
 from vllm.config import get_current_vllm_config
 from vllm.logger import init_logger
 from vllm.model_executor.layers.fused_moe import FusedMoE
 from vllm.model_executor.layers.linear import LinearBase
 from vllm.model_executor.layers.quantization import \
    register_quantization_config
 from vllm.model_executor.layers.quantization.base_config import (
    QuantizationConfig, QuantizeMethodBase)
 from vllm.model_executor.layers.vocab_parallel_embedding import (
    UnquantizedEmbeddingMethod, VocabParallelEmbedding)
 from vllm.model_executor.models.utils import WeightsMapper
 from vllm_ascend.utils import ASCEND_QUANTIZATION_METHOD
 from .methods import get_scheme_class
 logger = init_logger(__name__)
 # key: model_type
 # value: orig_to_new_prefix
 QUANT_MODEL_PREFIX_MAPPINGS: Dict[str, Dict[str, str]] = {
    "qwen3_vl_moe": {
        "visual.": "model.visual.",
        "language_model.lm_head.": "lm_head.",
        "language_model.model.": "model.language_model.",
    },
    "qwen3_vl_text": {
        "visual.": "model.visual.",
        "language_model.lm_head.": "lm_head.",
        "language_model.model.": "model.language_model.",
    },
 }
 # key: model_type
 # value: dict of fused module name -> list of original module names
 packed_modules_model_mapping: Dict[str, Dict[str, List[str]]] = {
    "qwen3_moe": {
        "qkv_proj": [
            "q_proj",
            "k_proj",
            "v_proj",
        ],
        "gate_up_proj": [
            "gate_proj",
            "up_proj",
        ],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
    },
    "deepseek_v2": {
        "gate_up_proj": ["gate_proj", "up_proj"],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
        "fused_qkv_a_proj": ["q_a_proj", "kv_a_proj_with_mqa"]
    },
    "deepseek_v3": {
        "gate_up_proj": ["gate_proj", "up_proj"],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
        "fused_qkv_a_proj": ["q_a_proj", "kv_a_proj_with_mqa"]
    },
    "pangu_ultra_moe": {
        "gate_up_proj": ["gate_proj", "up_proj"],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
        "fused_qkv_a_proj": ["q_a_proj", "kv_a_proj_with_mqa"]
    },
    "kimi_k2": {
        "gate_up_proj": ["gate_proj", "up_proj"],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
        "fused_qkv_a_proj": ["q_a_proj", "kv_a_proj_with_mqa"]
    },
    "deepseek_v32": {
        "gate_up_proj": ["gate_proj", "up_proj"],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
        "fused_qkv_a_proj": ["q_a_proj", "kv_a_proj_with_mqa"]
    },
    # NOTE 1.The quantized MTP layer of deepseek on the NPU is not quantized;
    # NOTE 2.The description file generated by the current msmodelslim tool does not have
    # MTP layer info. Please manually add it and set the value to FLOAT.
    "deepseek_mtp": {
        "gate_up_proj": ["gate_proj", "up_proj"],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"]
    },
    "pangu_ultra_moe_mtp": {
        "gate_up_proj": ["gate_proj", "up_proj"],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
        "fused_qkv_a_proj": ["q_a_proj", "kv_a_proj_with_mqa"]
    },
    "qwen3_next": {
        "qkv_proj": [
            "q_proj",
            "k_proj",
            "v_proj",
        ],
        "gate_up_proj": ["gate_proj", "up_proj"],
        "in_proj": ["in_proj_qkvz", "in_proj_ba"],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"]
    },
    "qwen2_5_vl": {
        "qkv_proj": [
            "q_proj",
            "k_proj",
            "v_proj",
        ],
        "gate_up_proj": [
            "gate_proj",
            "up_proj",
        ],
    },
    "qwen3_vl_moe": {
        "qkv_proj": [
            "q_proj",
            "k_proj",
            "v_proj",
        ],
        "gate_up_proj": [
            "gate_proj",
            "up_proj",
        ],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
    },
    "glm4_moe": {
        "qkv_proj": [
            "q_proj",
            "k_proj",
            "v_proj",
        ],
        "gate_up_proj": [
            "gate_proj",
            "up_proj",
        ],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"]
    },
    "longcat_flash": {
        "gate_up_proj": ["gate_proj", "up_proj"],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
        "fused_qkv_a_proj": ["q_a_proj", "kv_a_proj_with_mqa"]
    },
    "minimax_m2": {
        "qkv_proj": [
            "q_proj",
            "k_proj",
            "v_proj",
        ],
        "experts": ["experts.0.w1", "experts.0.w2", "experts.0.w3"]
    }
 }
 def get_packed_modules_mapping(model_type: str) -> Dict[str, List[str]]:
    """Get packed modules mapping for a model type.
    Args:
        model_type: The model type string (e.g., "deepseek_v3").
    Returns:
        Dictionary mapping fused module names to their component module names.
        Returns empty dict if model_type is not found.
    """
    return packed_modules_model_mapping.get(model_type, {})
 def get_prefix_mapping(model_type: str) -> Dict[str, str]:
    """Get prefix mapping for a model type.
    Args:
        model_type: The model type string (e.g., "qwen3_vl_moe").
    Returns:
        Dictionary mapping original prefixes to new prefixes.
        Returns empty dict if model_type is not found.
    """
    return QUANT_MODEL_PREFIX_MAPPINGS.get(model_type, {})
 def get_linear_quant_type(
        quant_description: Dict[str, Any], prefix: str,
        packed_modules_mapping: Dict[str, Any]) -> Optional[str]:
    """Determine the quantization type for a linear layer.
    Args:
        quant_description: The quantization description dictionary.
        prefix: The layer prefix.
        packed_modules_mapping: Mapping for packed/fused modules.
    Returns:
        The quantization type string (e.g., "W8A8_DYNAMIC").
    """
    proj_name = prefix.split(".")[-1]
    if proj_name in packed_modules_mapping:
        quant_type = None
        shard_prefixes = [
            prefix.replace(proj_name, shard_proj_name)
            for shard_proj_name in packed_modules_mapping[proj_name]
        ]
        for shard_prefix in shard_prefixes:
            shard_quant_type = quant_description[shard_prefix + '.weight']
            if quant_type is None:
                quant_type = shard_quant_type
            elif shard_quant_type != quant_type:
                raise ValueError(
                    f"Not all shards of {prefix} are quantized with same quant type."
                    f"Shard {proj_name} uses {shard_quant_type}, but another shard"
                    f"use {quant_type}. Please check quantization config.")
    else:
        quant_type = quant_description[prefix + '.weight']
    return quant_type
 def get_quant_type_for_layer(
        quant_description: Dict[str, Any],
        prefix: str,
        layer_type: str,
        packed_modules_mapping: Optional[Dict[str,
                                              Any]] = None) -> Optional[str]:
    """Determine the quantization type for a layer.
    Args:
        quant_description: The quantization description dictionary.
        prefix: The layer prefix.
        layer_type: The type of layer ("linear", "moe", "attention").
        packed_modules_mapping: Mapping for packed/fused modules.
    Returns:
        The quantization type string (e.g., "W8A8_DYNAMIC").
    """
    if packed_modules_mapping is None:
        packed_modules_mapping = dict()
    # Attention
    if layer_type == "attention" and 'fa_quant_type' in quant_description.keys(
    ):
        return quant_description['fa_quant_type']
    # Linear / MoE
    return get_linear_quant_type(quant_description, prefix,
                                 packed_modules_mapping)
 def create_scheme_for_layer(
        quant_description: Dict[str, Any],
        prefix: str,
        layer_type: str,
        packed_modules_mapping: Optional[Dict[str, Any]] = None):
    """Create a quantization scheme instance for a layer.
    Args:
        quant_description: The quantization description dictionary.
        prefix: The layer prefix.
        layer_type: The type of layer ("linear", "moe", "attention").
        packed_modules_mapping: Mapping for packed/fused modules.
    Returns:
        An instance of the appropriate quantization scheme class.
    """
    logger.info_once("Using the vLLM Ascend modelslim Quantization now!")
    quant_type = get_quant_type_for_layer(quant_description, prefix,
                                          layer_type, packed_modules_mapping)
    if quant_type is None:
        raise ValueError(
            f"Could not determine quantization type for layer {prefix}.")
    # Use registry to get scheme class
    scheme_cls = get_scheme_class(quant_type, layer_type)
    if scheme_cls is not None:
        return scheme_cls()
    raise NotImplementedError(
        f"Currently, vLLM Ascend doesn't support {quant_type} for {layer_type}."
    )
@register_quantization_config(ASCEND_QUANTIZATION_METHOD)
 class AscendModelSlimConfig(QuantizationConfig):
    """Config class for Ascend ModelSlim quantization.
    This class is a general class that parses quantization configs
    that are supported on Ascend hardware, specifically for models
    quantized using the ModelSlim tool.
    """
    def __init__(self, quant_config: Dict[str, Any]):
        super().__init__()
        self.quant_description = quant_config
        # TODO(whx): remove this adaptation after adding "shared_head"
        # to prefix of DeepSeekShareHead in vLLM.
        extra_quant_dict = {}
        for k in self.quant_description.keys():
            if "shared_head" in k:
                new_k = k.replace(".shared_head.", ".")
                extra_quant_dict[new_k] = self.quant_description[k]
            if "weight_packed" in k:
                new_k = k.replace("weight_packed", "weight")
                extra_quant_dict[new_k] = self.quant_description[k]
        self.quant_description.update(extra_quant_dict)
    def __repr__(self) -> str:
        return "AscendModelSlimConfig:\n" + super().__repr__()
    @classmethod
    def get_name(cls) -> str:
        return ASCEND_QUANTIZATION_METHOD
    @classmethod
    def get_supported_act_dtypes(cls) -> List[torch.dtype]:
        return [torch.int8, torch.float16, torch.bfloat16]
    @classmethod
    def get_min_capability(cls) -> int:
        raise NotImplementedError(
            "Ascend hardware dose not support \"get_min_capability\" feature.")
    @classmethod
    def get_config_filenames(cls) -> List[str]:
        return ["quant_model_description.json"]
    @classmethod
    def from_config(cls, config: Dict[str, Any]) -> "AscendModelSlimConfig":
        return cls(config)
    @classmethod
    def override_quantization_method(cls, hf_quant_cfg,
                                     user_quant) -> Optional[str]:
        if hf_quant_cfg is not None:
            quant_method = hf_quant_cfg.get("quant_method", None)
            if not quant_method and torch.npu.is_available():
                return ASCEND_QUANTIZATION_METHOD
        return None
    def quant_prefix_mapper(self, model_type: str, prefix: str) -> str:
        # TODO (Levi-JQ): will be removed when QuantizationConfig.apply_vllm_mapper is implemented
        prefix_mapping = QUANT_MODEL_PREFIX_MAPPINGS.get(model_type)
        if prefix_mapping:
            hf_to_vllm_mapper = WeightsMapper(
                orig_to_new_prefix=prefix_mapping)
            return hf_to_vllm_mapper._map_name(prefix)
        return prefix
    def get_quant_method(self, layer: torch.nn.Module,
                         prefix: str) -> Optional["QuantizeMethodBase"]:
        from .method_adapters import (AscendEmbeddingMethod, AscendFusedMoEMethod,
                               AscendKVCacheMethod, AscendLinearMethod)
        vllm_config = get_current_vllm_config()
        model_type = vllm_config.model_config.hf_config.model_type
        if model_type in ["minimax", "minimax_m2"]:
            # Adapt to Minimax architecture: update layer names to MoE convention
            prefix = prefix.replace("mlp", "block_sparse_moe")
            # Normalize the prefix by stripping specific expert indices (e.g., 'experts.0' -> 'experts')
            parts = prefix.split('.')
            if "experts" in parts and len(parts) > 2:
                exp_idx = parts.index("experts")
                if exp_idx + 1 < len(parts) and parts[exp_idx + 1].isdigit():
                    parts = parts[:exp_idx + 1]
                    prefix = ".".join(parts)
        if model_type in packed_modules_model_mapping:
            self.packed_modules_mapping = packed_modules_model_mapping[
                model_type]
        prefix = self.quant_prefix_mapper(model_type, prefix)
        from vllm.attention.layer import Attention
        if prefix.startswith("language_model"):
            prefix = prefix.split('.', 1)[-1]
        if isinstance(layer, LinearBase):
            if self.is_layer_skipped_ascend(prefix,
                                            self.packed_modules_mapping):
                # Delayed import to avoid circular import
                from vllm_ascend.ops.linear import \
                    AscendUnquantizedLinearMethod
                return AscendUnquantizedLinearMethod()
            scheme = create_scheme_for_layer(self.quant_description, prefix,
                                             "linear",
                                             self.packed_modules_mapping)
            return AscendLinearMethod(scheme)
        elif isinstance(layer, Attention) and \
            'fa_quant_type' in self.quant_description.keys() and \
            self.quant_description['fa_quant_type'] is not None:
            scheme = create_scheme_for_layer(self.quant_description, prefix,
                                             "attention",
                                             self.packed_modules_mapping)
            return AscendKVCacheMethod(scheme)
        elif isinstance(layer, FusedMoE):
            if self.is_layer_skipped_ascend(prefix,
                                            self.packed_modules_mapping):
                # Delayed import to avoid circular import
                from vllm_ascend.ops.fused_moe.fused_moe import \
                    AscendUnquantizedFusedMoEMethod
                return AscendUnquantizedFusedMoEMethod(layer.moe_config)
            scheme = create_scheme_for_layer(self.quant_description, prefix,
                                             "moe",
                                             self.packed_modules_mapping)
            return AscendFusedMoEMethod(scheme, layer.moe_config)
        elif isinstance(layer, VocabParallelEmbedding):
            if self.is_layer_skipped_ascend(prefix,
                                            self.packed_modules_mapping):
                return UnquantizedEmbeddingMethod()
            scheme = create_scheme_for_layer(self.quant_description, prefix,
                                             "linear",
                                             self.packed_modules_mapping)
            return AscendEmbeddingMethod(scheme)
        return None
    def is_layer_skipped_ascend(
        self,
        prefix: str,
        fused_mapping: Mapping[str, List[str]] = MappingProxyType({})):
        # adapted from vllm.model_executor.layers.quantization.utils.quant_utils.is_layer_skipped
        proj_name = prefix.split(".")[-1]
        if proj_name in fused_mapping:
            shard_prefixes = [
                prefix.replace(proj_name, shard_proj_name)
                for shard_proj_name in fused_mapping[proj_name]
            ]
            is_skipped = None
            for shard_prefix in shard_prefixes:
                is_shard_skipped = self.quant_description[shard_prefix +
                                                          '.weight'] == "FLOAT"
                if is_skipped is None:
                    is_skipped = is_shard_skipped
                elif is_shard_skipped != is_skipped:
                    raise ValueError(
                        f"Detected some but not all shards of {prefix} "
                        "are quantized. All shards of fused layers "
                        "to have the same precision.")
        else:
            is_skipped = self.quant_description[prefix + '.weight'] == "FLOAT"
        assert is_skipped is not None
        return is_skipped
    def get_scaled_act_names(self) -> List[str]:
        return []
--- a/vllm_ascend/quantization/quant_config.py
+++ b/vllm_ascend/quantization/quant_config.py
@@ -1,600 +0,0 @@
 #
 # Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
 # Copyright 2023 The vLLM team.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # This file is a part of the vllm-ascend project.
 #
 from types import MappingProxyType
 from typing import Any, Callable, Dict, List, Mapping, Optional
 import torch
 from vllm.config import get_current_vllm_config
 from vllm.distributed import get_tensor_model_parallel_rank
 from vllm.model_executor.layers.fused_moe import (FusedMoE, FusedMoEMethodBase,
                                                  FusedMoeWeightScaleSupported)
 from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase,
                                               RowParallelLinear)
 from vllm.model_executor.layers.quantization import \
    register_quantization_config
 from vllm.model_executor.layers.quantization.base_config import (
    QuantizationConfig, QuantizeMethodBase)
 from vllm.model_executor.layers.quantization.kv_cache import BaseKVCacheMethod
 from vllm.model_executor.layers.vocab_parallel_embedding import (
    UnquantizedEmbeddingMethod, VocabParallelEmbedding)
 from vllm.model_executor.models.utils import WeightsMapper
 from vllm.model_executor.parameter import PerTensorScaleParameter
 from vllm.model_executor.utils import set_weight_attrs
 from vllm_ascend.ascend_config import get_ascend_config
 from vllm_ascend.distributed.parallel_state import (get_flashcomm2_otp_group,
                                                    get_mlp_tp_group,
                                                    get_otp_group)
 from vllm_ascend.ops.fused_moe.fused_moe import AscendUnquantizedFusedMoEMethod
 from vllm_ascend.ops.linear import AscendUnquantizedLinearMethod
 from vllm_ascend.utils import (ASCEND_QUANTIZATION_METHOD, flashcomm2_enable,
                               mlp_tp_enable, oproj_tp_enable)
 from .utils import get_quant_method, is_mx_quant_type
@register_quantization_config(ASCEND_QUANTIZATION_METHOD)
 class AscendQuantConfig(QuantizationConfig):
    """Config class for Ascend
    This class is a general class that parse quantization configs
    that are supported on ascend hardware.
    """
    def __init__(self, quant_config: Dict[str, Any]):
        super().__init__()
        self.quant_description = quant_config
        # TODO(whx): remove this adaptation after adding "shared_head"
        # to prefix of DeepSeekShareHead in vLLM.
        extra_quant_dict = {}
        for k in self.quant_description.keys():
            if "shared_head" in k:
                new_k = k.replace(".shared_head.", ".")
                extra_quant_dict[new_k] = self.quant_description[k]
            if "weight_packed" in k:
                new_k = k.replace("weight_packed", "weight")
                extra_quant_dict[new_k] = self.quant_description[k]
        self.quant_description.update(extra_quant_dict)
    def __repr__(self) -> str:
        return "AscendQuantConfig:\n" + super().__repr__()
    @classmethod
    def get_name(cls) -> str:
        return ASCEND_QUANTIZATION_METHOD
    @classmethod
    def get_supported_act_dtypes(cls) -> List[torch.dtype]:
        return [torch.int8, torch.float16, torch.bfloat16]
    @classmethod
    def get_min_capability(cls) -> int:
        raise NotImplementedError(
            "Ascend hardware dose not support \"get_min_capability\" feature.")
    @classmethod
    def get_config_filenames(cls) -> List[str]:
        return ["quant_model_description.json"]
    @classmethod
    def from_config(cls, config: Dict[str, Any]) -> "AscendQuantConfig":
        return cls(config)
    @classmethod
    def override_quantization_method(cls, hf_quant_cfg,
                                     user_quant) -> Optional[str]:
        if hf_quant_cfg is not None:
            quant_method = hf_quant_cfg.get("quant_method", None)
            if not quant_method and torch.npu.is_available():
                return ASCEND_QUANTIZATION_METHOD
        return None
    def quant_prefix_mapper(self, model_type: str, prefix: str) -> str:
        # TODO (Levi-JQ): will be removed when QuantizationConfig.apply_vllm_mapper is implemented
        prefix_mapping = QUANT_MODEL_PREFIX_MAPPINGS.get(model_type)
        if prefix_mapping:
            hf_to_vllm_mapper = WeightsMapper(
                orig_to_new_prefix=prefix_mapping)
            return hf_to_vllm_mapper._map_name(prefix)
        return prefix
    def get_quant_method(self, layer: torch.nn.Module,
                         prefix: str) -> Optional["QuantizeMethodBase"]:
        vllm_config = get_current_vllm_config()
        model_type = vllm_config.model_config.hf_text_config.model_type
        if model_type in ["minimax", "minimax_m2"]:
            prefix = prefix.replace("mlp", "block_sparse_moe")
            #To adapt to minimax, modify the prefix of the model layer name
            parts = prefix.split('.')
            if "experts" in parts and len(parts) > 2:
                exp_idx = parts.index("experts")
                if exp_idx + 1 < len(parts) and parts[exp_idx + 1].isdigit():
                    parts = parts[:exp_idx + 1]
                    prefix = ".".join(parts)
        if model_type in packed_modules_model_mapping:
            self.packed_modules_mapping = packed_modules_model_mapping[
                model_type]
        prefix = self.quant_prefix_mapper(model_type, prefix)
        from vllm.attention.layer import Attention
        if prefix.startswith("language_model"):
            prefix = prefix.split('.', 1)[-1]
        if isinstance(layer, LinearBase):
            if self.is_layer_skipped_ascend(prefix,
                                            self.packed_modules_mapping):
                return AscendUnquantizedLinearMethod()
            return AscendLinearMethod(self, prefix,
                                      self.packed_modules_mapping, layer)
        elif isinstance(layer, Attention) and \
            'fa_quant_type' in self.quant_description.keys() and \
            self.quant_description['fa_quant_type'] is not None:
            return AscendKVCacheMethod(self, prefix)
        elif isinstance(layer, FusedMoE):
            if self.is_layer_skipped_ascend(prefix,
                                            self.packed_modules_mapping):
                return AscendUnquantizedFusedMoEMethod(layer.moe_config)
            return AscendFusedMoEMethod(self, prefix,
                                        self.packed_modules_mapping, layer)
        elif isinstance(layer, VocabParallelEmbedding):
            if self.is_layer_skipped_ascend(prefix,
                                            self.packed_modules_mapping):
                return UnquantizedEmbeddingMethod()
            return AscendEmbeddingMethod(self, prefix,
                                         self.packed_modules_mapping, layer)
        return None
    def is_layer_skipped_ascend(
        self,
        prefix: str,
        fused_mapping: Mapping[str, List[str]] = MappingProxyType({})):
        # adapted from vllm.model_executor.layers.quantization.utils.quant_utils.is_layer_skipped
        proj_name = prefix.split(".")[-1]
        if proj_name in fused_mapping:
            shard_prefixes = [
                prefix.replace(proj_name, shard_proj_name)
                for shard_proj_name in fused_mapping[proj_name]
            ]
            is_skipped = None
            for shard_prefix in shard_prefixes:
                is_shard_skipped = self.quant_description[shard_prefix +
                                                          '.weight'] == "FLOAT"
                if is_skipped is None:
                    is_skipped = is_shard_skipped
                elif is_shard_skipped != is_skipped:
                    raise ValueError(
                        f"Detected some but not all shards of {prefix} "
                        "are quantized. All shards of fused layers "
                        "to have the same precision.")
        else:
            # NOTE: In GLM4.6, the MTP draft model shares the same LM head weigthts
            # with the main model. Therefore, before `load_weights()` runs, some parameter
            # names may not include the expected prefix and may appear only with the
            # ".head" suffix. This can trigger a load-time error, so here we replace the
            # key with "lm_head.weight".
            key = prefix + '.weight'
            if key not in self.quant_description and ".head" in prefix:
                key = 'lm_head.weight'
            is_skipped = self.quant_description[key] == "FLOAT"
        assert is_skipped is not None
        return is_skipped
    def get_scaled_act_names(self) -> List[str]:
        return []
 # key: model_type
 # value: orig_to_new_prefix
 QUANT_MODEL_PREFIX_MAPPINGS = {
    "qwen3_vl_moe": {
        "visual.": "model.visual.",
        "language_model.lm_head.": "lm_head.",
        "language_model.model.": "model.language_model.",
    },
    "qwen3_vl_text": {
        "visual.": "model.visual.",
        "language_model.lm_head.": "lm_head.",
        "language_model.model.": "model.language_model.",
    },
 }
 packed_modules_model_mapping = {
    "qwen3_moe": {
        "qkv_proj": [
            "q_proj",
            "k_proj",
            "v_proj",
        ],
        "gate_up_proj": [
            "gate_proj",
            "up_proj",
        ],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
    },
    "deepseek_v2": {
        "gate_up_proj": ["gate_proj", "up_proj"],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
        "fused_qkv_a_proj": ["q_a_proj", "kv_a_proj_with_mqa"]
    },
    "deepseek_v3": {
        "gate_up_proj": ["gate_proj", "up_proj"],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
        "fused_qkv_a_proj": ["q_a_proj", "kv_a_proj_with_mqa"]
    },
    "pangu_ultra_moe": {
        "gate_up_proj": ["gate_proj", "up_proj"],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
        "fused_qkv_a_proj": ["q_a_proj", "kv_a_proj_with_mqa"]
    },
    "kimi_k2": {
        "gate_up_proj": ["gate_proj", "up_proj"],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
        "fused_qkv_a_proj": ["q_a_proj", "kv_a_proj_with_mqa"]
    },
    "deepseek_v32": {
        "gate_up_proj": ["gate_proj", "up_proj"],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
        "fused_qkv_a_proj": ["q_a_proj", "kv_a_proj_with_mqa"]
    },
    # NOTE 1.The quantized MTP layer of deepseek on the NPU is not quantized;
    # NOTE 2.The description file generated by the current msmodelslim tool does not have
    # MTP layer info. Please manually add it and set the value to FLOAT.
    "deepseek_mtp": {
        "gate_up_proj": ["gate_proj", "up_proj"],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"]
    },
    "pangu_ultra_moe_mtp": {
        "gate_up_proj": ["gate_proj", "up_proj"],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
        "fused_qkv_a_proj": ["q_a_proj", "kv_a_proj_with_mqa"]
    },
    "qwen3_next": {
        "qkv_proj": [
            "q_proj",
            "k_proj",
            "v_proj",
        ],
        "gate_up_proj": ["gate_proj", "up_proj"],
        "in_proj": ["in_proj_qkvz", "in_proj_ba"],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"]
    },
    "qwen2_5_vl": {
        "qkv_proj": [
            "q_proj",
            "k_proj",
            "v_proj",
        ],
        "gate_up_proj": [
            "gate_proj",
            "up_proj",
        ],
    },
    "qwen3_vl_moe": {
        "qkv_proj": [
            "q_proj",
            "k_proj",
            "v_proj",
        ],
        "gate_up_proj": [
            "gate_proj",
            "up_proj",
        ],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
    },
    "glm4_moe": {
        "qkv_proj": [
            "q_proj",
            "k_proj",
            "v_proj",
        ],
        "gate_up_proj": [
            "gate_proj",
            "up_proj",
        ],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"]
    },
    "longcat_flash": {
        "gate_up_proj": ["gate_proj", "up_proj"],
        "experts":
        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
        "fused_qkv_a_proj": ["q_a_proj", "kv_a_proj_with_mqa"]
    },
    "minimax_m2": {
        "qkv_proj": [
            "q_proj",
            "k_proj",
            "v_proj",
        ],
        "experts": ["experts.0.w1", "experts.0.w2", "experts.0.w3"]
    }
 }
 class AscendLinearMethod(LinearMethodBase):
    """Linear method for Ascend quantization.
    Args:
        quant_config: The Ascend quantization config.
    """
    def __init__(self,
                 quant_config: AscendQuantConfig,
                 prefix: str,
                 packed_modules_mapping: Dict[str, Any] | None,
                 layer: torch.nn.Module = None) -> None:
        self.quant_method = get_quant_method(quant_config.quant_description,
                                             prefix,
                                             "linear",
                                             packed_modules_mapping,
                                             layer=layer)
    def create_weights(
        self,
        layer: torch.nn.Module,
        input_size_per_partition: int,
        output_partition_sizes: List[int],
        input_size: int,
        output_size: int,
        params_dtype: torch.dtype,
        **extra_weight_attrs,
    ) -> None:
        output_size_per_partition = sum(output_partition_sizes)
        weight_loader = extra_weight_attrs.get("weight_loader")
        weight_dict = self.quant_method.get_weight(input_size_per_partition,
                                                   output_size_per_partition,
                                                   params_dtype)
        # Extract packing information (if present)
        packed_dim = weight_dict.pop("_packed_dim", None)
        packed_factor = weight_dict.pop("_packed_factor", None)
        for weight_name, weight_param in weight_dict.items():
            param = torch.nn.Parameter(weight_param, requires_grad=False)
            set_weight_attrs(param, {"input_dim": 1, "output_dim": 0})
            # Set packing attributes if the weight is packed
            if packed_dim is not None and packed_factor is not None:
                set_weight_attrs(param, {
                    "packed_dim": packed_dim,
                    "packed_factor": packed_factor
                })
            layer.register_parameter(weight_name, param)
            set_weight_attrs(param, extra_weight_attrs)
        pertensor_dict = self.quant_method.get_pertensor_param(params_dtype)
        for pertensor_name, pertensor_param in pertensor_dict.items():
            param = PerTensorScaleParameter(data=pertensor_param,
                                            weight_loader=weight_loader)
            # disable warning
            param.ignore_warning = True
            layer.register_parameter(pertensor_name, param)
            param.weight_loader = extra_weight_attrs.get("weight_loader")
        perchannel_dict = self.quant_method.get_perchannel_param(
            output_size_per_partition, params_dtype)
        for perchannel_name, perchannel_param in perchannel_dict.items():
            param = torch.nn.Parameter(perchannel_param, requires_grad=False)
            set_weight_attrs(param, {"output_dim": 0})
            layer.register_parameter(perchannel_name, param)
            set_weight_attrs(param, extra_weight_attrs)
        # NOTE: In w4a8 quantization implementation,
        # for down_proj and o_proj scale_bias shape is [output_size, 16],
        # others are [output_size, 1]
        layer_type = "row" if isinstance(layer,
                                         RowParallelLinear) else "others"
        pergroup_dict = self.quant_method.get_pergroup_param(
            input_size_per_partition,
            output_size_per_partition,
            params_dtype,
            layer_type=layer_type)
        for pergroup_name, pergroup_param in pergroup_dict.items():
            param = torch.nn.Parameter(pergroup_param, requires_grad=False)
            set_weight_attrs(param, {"output_dim": 0})
            layer.register_parameter(pergroup_name, param)
            set_weight_attrs(param, extra_weight_attrs)
            if "weight_scale_second" in pergroup_name or "weight_offset_second" in pergroup_name \
                or is_mx_quant_type(self.quant_method):
                setattr(param, "input_dim", 1)
                param.input_dim = 1
    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
        if hasattr(self.quant_method, "process_weights_after_loading"):
            self.quant_method.process_weights_after_loading(layer)
    def apply(
        self,
        layer: torch.nn.Module,
        x: torch.Tensor,
        bias: Optional[torch.Tensor] = None,
    ) -> torch.Tensor:
        if isinstance(layer, RowParallelLinear):
            if layer.prefix.find("o_proj") != -1 and oproj_tp_enable():
                tp_rank = get_otp_group().rank_in_group
            elif layer.prefix.find("down_proj") != -1 and mlp_tp_enable():
                tp_rank = get_mlp_tp_group().rank_in_group
            elif (layer.prefix.find("o_proj") != -1 or
                  layer.prefix.find("out_proj") != -1) and flashcomm2_enable():
                if get_ascend_config(
                ).flashcomm2_oproj_tensor_parallel_size == 1:
                    tp_rank = 0
                else:
                    tp_rank = get_flashcomm2_otp_group().rank_in_group
            else:
                tp_rank = get_tensor_model_parallel_rank()
        else:
            tp_rank = 0
        return self.quant_method.apply(layer, x, bias, tp_rank)
 class AscendKVCacheMethod(BaseKVCacheMethod):
    """KVCache method for Ascend quantization.
    Args:
        quant_config: The Ascend quantization config.
    """
    def __init__(self, quant_config: AscendQuantConfig, prefix: str) -> None:
        self.quant_method = get_quant_method(quant_config.quant_description,
                                             prefix, "attention")
    def create_weights(self, layer: torch.nn.Module) -> None:
        # Different from linear method, there are no weight processing/slicing
        # steps for attention in vllm. So the whole process of create weights
        # is hidden into the specific quant method.
        self.quant_method.create_weights(layer)
    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
        if hasattr(self.quant_method, "process_weights_after_loading"):
            self.quant_method.process_weights_after_loading(layer)
    def apply(self, layer: torch.nn.Module, query: torch.Tensor,
              key: torch.Tensor, value: torch.Tensor, kv_cache, attn_metadata,
              attn_type, scale, output) -> torch.Tensor:
        return self.quant_method.apply(layer, query, key, value, kv_cache,
                                       attn_metadata, attn_type, scale, output)
 class AscendFusedMoEMethod(FusedMoEMethodBase):
    """FusedMoE method for Ascend quantization.
    Args:
        quant_config: The Ascend quantization config.
    """
    def __init__(self, quant_config: AscendQuantConfig, prefix: str,
                 packed_modules_mapping: Dict[str,
                                              Any], layer: torch.nn.Module):
        super().__init__(layer.moe_config)
        self.quant_method = get_quant_method(quant_config.quant_description,
                                             prefix,
                                             "moe",
                                             packed_modules_mapping,
                                             layer=layer)
    def create_weights(
        self,
        layer: torch.nn.Module,
        num_experts: int,
        hidden_size: int,
        intermediate_size_per_partition: int,
        params_dtype: torch.dtype,
        **extra_weight_attrs,
    ) -> None:
        weight_param = self.quant_method.get_weight(
            num_experts, intermediate_size_per_partition, hidden_size,
            params_dtype)
        for param_key, param_value in weight_param.items():
            param = torch.nn.Parameter(param_value, requires_grad=False)
            layer.register_parameter(param_key, param)
            set_weight_attrs(param, extra_weight_attrs)
        extra_weight_attrs.update(
            {"quant_method": FusedMoeWeightScaleSupported.CHANNEL.value})
        per_group_param = [
            "weight_scale_second", "weight_offset_second", "scale_bias"
        ] + ["weight_scale", "weight_offset"] if hasattr(
            self.quant_method,
            "group_size") and self.quant_method.group_size > 0 else []
        dynamic_quant_param = self.quant_method.get_dynamic_quant_param(
            num_experts, intermediate_size_per_partition, hidden_size,
            params_dtype)
        for param_key, param_value in dynamic_quant_param.items():
            param = torch.nn.Parameter(param_value, requires_grad=False)
            layer.register_parameter(param_key, param)
            set_weight_attrs(param, extra_weight_attrs)
            if any(fields in param_key for fields in per_group_param):
                setattr(param, "quant_method",
                        FusedMoeWeightScaleSupported.GROUP.value)
    def apply(
        self,
        layer: torch.nn.Module,
        x: torch.Tensor,
        router_logits: torch.Tensor,
        top_k: int,
        renormalize: bool,
        use_grouped_topk: bool = False,
        global_num_experts: int = -1,
        expert_map: Optional[torch.Tensor] = None,
        topk_group: Optional[int] = None,
        num_expert_group: Optional[int] = None,
        custom_routing_function: Optional[Callable] = None,
        scoring_func: str = "softmax",
        routed_scaling_factor: float = 1.0,
        e_score_correction_bias: Optional[torch.Tensor] = None,
        is_prefill: bool = True,
        enable_force_load_balance: bool = False,
        log2phy: torch.Tensor = None,
        global_redundant_expert_num=0,
        **kwargs,
    ) -> torch.Tensor:
        return self.quant_method.apply(
            layer, x, router_logits, top_k, renormalize, use_grouped_topk,
            global_num_experts, expert_map, topk_group, num_expert_group,
            custom_routing_function, scoring_func, routed_scaling_factor,
            e_score_correction_bias, is_prefill, enable_force_load_balance,
            log2phy, global_redundant_expert_num, **kwargs)
    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
        if hasattr(self.quant_method, "process_weights_after_loading"):
            self.quant_method.process_weights_after_loading(layer)
    def get_fused_moe_quant_config(self, layer: torch.nn.Module):
        # TODO: implement this function
        pass
    @property
    def supports_eplb(self):
        supports_eplb = getattr(self.quant_method, "supports_eplb", False)
        return supports_eplb
 class AscendEmbeddingMethod(AscendLinearMethod):
    """Embedding method for Ascend quantization.
      Args:
          quant_config: The Ascend quantization config.
    """
    def __init__(self, quant_config: AscendQuantConfig, prefix: str,
                 packed_modules_mapping: Dict[str, Any],
                 layer: torch.nn.Module) -> None:
        self.quant_method = get_quant_method(quant_config.quant_description,
                                             prefix,
                                             "linear",
                                             packed_modules_mapping,
                                             layer=layer)
--- a/vllm_ascend/quantization/utils.py
+++ b/vllm_ascend/quantization/utils.py
@@ -1,129 +0,0 @@
 from typing import Any, Dict, Optional, Type
 import torch
 from vllm.logger import logger
 from vllm_ascend.utils import COMPRESSED_TENSORS_METHOD
 from .w4a4_flatquant_dynamic import AscendW4A4FlatQuantDynamicLinearMethod
 from .w4a4_laos_dynamic import AscendW4A4LaosDynamicLinearMethod
 from .w4a8_dynamic import (AscendW4A8DynamicFusedMoEMethod,
                           AscendW4A8DynamicLinearMethod)
 from .w4a16 import AscendW4A16FusedMoEMethod
 from .w8a8 import AscendW8A8LinearMethod
 from .w8a8_dynamic import (AscendW8A8DynamicFusedMoEMethod,
                           AscendW8A8DynamicLinearMethod)
 from .w8a8_pdmix import (AscendW8A8PDMixFusedMoeMethod,
                         AscendW8A8PDMixLinearMethod)
 from .w8a8mxfp8 import AscendW8A8MXFP8DynamicLinearMethod
 from .w8a16 import AscendW8A16LinearMethod
 ASCEND_QUANTIZATION_METHOD_MAP: Dict[str, Dict[str, Type[Any]]] = {
    "W4A16": {
        "moe": AscendW4A16FusedMoEMethod,
    },
    "W4A8_DYNAMIC": {
        "linear": AscendW4A8DynamicLinearMethod,
        "moe": AscendW4A8DynamicFusedMoEMethod,
    },
    "W4A4_DYNAMIC": {
        "linear": AscendW4A4LaosDynamicLinearMethod,
    },
    "W4A4_FLATQUANT_DYNAMIC": {
        "linear": AscendW4A4FlatQuantDynamicLinearMethod,
    },
    "W8A8": {
        "linear": AscendW8A8LinearMethod,
    },
    "W8A8_DYNAMIC": {
        "linear": AscendW8A8DynamicLinearMethod,
        "moe": AscendW8A8DynamicFusedMoEMethod,
    },
    "W8A8_MIX": {
        "linear": AscendW8A8PDMixLinearMethod,
        "moe": AscendW8A8PDMixFusedMoeMethod,
    },
    "W8A16": {
        "linear": AscendW8A16LinearMethod,
    },
    "W8A8_MXFP8": {
        "linear": AscendW8A8MXFP8DynamicLinearMethod,
    },
 }
 def get_linear_quant_type(quant_description: Dict[str, Any], prefix: str,
                          packed_modules_mapping: Dict[str, Any]):
    proj_name = prefix.split(".")[-1]
    if proj_name in packed_modules_mapping:
        quant_type = None
        shard_prefixes = [
            prefix.replace(proj_name, shard_proj_name)
            for shard_proj_name in packed_modules_mapping[proj_name]
        ]
        for shard_prefix in shard_prefixes:
            shard_quant_type = quant_description[shard_prefix + '.weight']
            if quant_type is None:
                quant_type = shard_quant_type
            elif shard_quant_type != quant_type:
                raise ValueError(
                    f"Not all shards of {prefix} are quantized with same quant type."
                    f"Shard {proj_name} uses {shard_quant_type}, but another shard"
                    f"use {quant_type}. Please check quantization config.")
    else:
        quant_type = quant_description[prefix + '.weight']
    return quant_type
 def get_quant_method(quant_description: Dict[str, Any],
                     prefix: str,
                     layer_type: str,
                     packed_modules_mapping: Optional[Dict[str, Any]] = None,
                     layer: torch.nn.Module = None):
    if quant_description.get("quant_method") == COMPRESSED_TENSORS_METHOD:
        return get_quant_method_llmcompressor(layer)
    return get_quant_method_modelslim(quant_description, prefix, layer_type,
                                      packed_modules_mapping)
 def get_quant_method_llmcompressor(layer: torch.nn.Module):
    logger.info_once("Using the vLLM Ascend llmcompressor Quantization now!")
    if layer.scheme is None:
        raise ValueError("A scheme must be defined for each layer")
    return layer.scheme
 def get_quant_method_modelslim(
        quant_description: Dict[str, Any],
        prefix: str,
        layer_type: str,
        packed_modules_mapping: Optional[Dict[str, Any]] = None):
    logger.info_once("Using the vLLM Ascend modelslim Quantization now!")
    if packed_modules_mapping is None:
        packed_modules_mapping = dict()
    # Attention
    if '.attn' in prefix and 'fa_quant_type' in quant_description.keys():
        quant_type = quant_description['fa_quant_type']
    # Linear
    else:
        quant_type = get_linear_quant_type(quant_description, prefix,
                                           packed_modules_mapping)
    if quant_type in ASCEND_QUANTIZATION_METHOD_MAP.keys():
        method_map = ASCEND_QUANTIZATION_METHOD_MAP[quant_type]
        if layer_type in method_map.keys():
            method_cls = method_map[layer_type]
            return method_cls()
        else:
            raise NotImplementedError(
                f"Currently, vLLM Ascend doesn't support {quant_type} for {layer_type}."
            )
    raise NotImplementedError("Currently, vLLM Ascend only supports following quant types:" \
                                f"{list(ASCEND_QUANTIZATION_METHOD_MAP.keys())}")
 def is_mx_quant_type(instance: Any) -> bool:
    """Checks if the quantization method is a mix-precision type."""
    MX_QUANT_TYPES = (AscendW8A8MXFP8DynamicLinearMethod, )
    return isinstance(instance, MX_QUANT_TYPES)
--- a/vllm_ascend/quantization/w8a8_pdmix.py
+++ b/vllm_ascend/quantization/w8a8_pdmix.py
@@ -1,70 +0,0 @@
 from typing import Any, Dict, cast
 import torch
 from vllm.config import get_current_vllm_config
 from .w8a8 import AscendW8A8LinearMethod
 from .w8a8_dynamic import (AscendW8A8DynamicFusedMoEMethod,
                           AscendW8A8DynamicLinearMethod)
 class AscendW8A8PDMixLinearMethod(AscendW8A8DynamicLinearMethod):
    def __init__(self):
        self.kv_transfer_config = get_current_vllm_config().kv_transfer_config
        super().__init__()
    @staticmethod
    def apply(layer, x, bias=None, tp_rank=0):
        if layer.is_kv_consumer:
            return AscendW8A8LinearMethod.apply(layer, x, bias, tp_rank)
        else:
            return AscendW8A8DynamicLinearMethod.apply(layer, x, bias, tp_rank)
    @staticmethod
    def get_pertensor_param(params_dtype: torch.dtype) -> Dict[str, Any]:
        return AscendW8A8LinearMethod.get_pertensor_param(params_dtype)
    @staticmethod
    def get_perchannel_param(
        output_size: int,
        params_dtype: torch.dtype,
    ) -> Dict[str, Any]:
        return AscendW8A8LinearMethod.get_perchannel_param(
            output_size, params_dtype)
    def process_weights_after_loading(self, layer):
        AscendW8A8LinearMethod.process_weights_after_loading(
            cast(AscendW8A8LinearMethod, self), layer)
        layer.weight_scale_fp32 = layer.weight_scale.data.to(torch.float32)
        layer.is_kv_consumer = self.kv_transfer_config is not None and self.kv_transfer_config.is_kv_consumer
 class AscendW8A8PDMixFusedMoeMethod(AscendW8A8DynamicFusedMoEMethod):
    def __init__(self):
        super().__init__()
    @staticmethod
    def get_dynamic_quant_param(num_experts: int,
                                intermediate_size_per_partition: int,
                                hidden_sizes: int,
                                params_dtype: torch.dtype) -> Dict[str, Any]:
        param_dict = AscendW8A8DynamicFusedMoEMethod.get_dynamic_quant_param(
            num_experts, intermediate_size_per_partition, hidden_sizes,
            params_dtype)
        param_dict["w2_deq_scale"] = torch.empty(num_experts,
                                                 hidden_sizes,
                                                 dtype=torch.float32)
        param_dict["w13_deq_scale"] = torch.empty(
            num_experts,
            2 * intermediate_size_per_partition,
            dtype=torch.float32)
        param_dict["w2_input_offset"] = torch.empty(num_experts,
                                                    1,
                                                    dtype=torch.int8)
        param_dict["w13_input_offset"] = torch.empty(num_experts,
                                                     1,
                                                     dtype=torch.int8)
        return param_dict