[Refactor] Quantization Module Refactor (#5738)

### Summary This PR refactors the `vllm_ascend/quantization` module to improve code organization, maintainability, and extensibility. The refactoring introduces a clear separation of concerns with a registry-based scheme discovery pattern, abstract base classes for quantization schemes, and dedicated wrapper classes. ### Key Changes #### 1. **Modular Directory Structure** | Before | After | |--------|-------| | Flat file structure with mixed responsibilities | Organized into `methods/` subpackage for schemes | | Single `quant_config.py` (600+ lines) | Separate config files: `modelslim_config.py`, `compressed_tensors_config.py` | | `utils.py` with scheme lookup logic | `methods/registry.py` with decorator-based registration | #### 2. **Registry-Based Scheme Discovery** Replaced hardcoded `ASCEND_QUANTIZATION_METHOD_MAP` dictionary with a decorator-based registry pattern: ```python # Before: Manual dictionary mapping ASCEND_QUANTIZATION_METHOD_MAP = { "W8A8_DYNAMIC": {"linear": AscendW8A8DynamicLinearMethod, ...}, ... } # After: Decorator-based registration @register_scheme("W8A8_DYNAMIC", "linear") class AscendW8A8DynamicLinearMethod(AscendLinearScheme): ... ``` #### 3. **Abstract Base Classes** Introduced three abstract base classes in `methods/base.py`: - `AscendLinearScheme` - Base for linear layer quantization - `AscendMoEScheme` - Base for MoE layer quantization - `AscendAttentionScheme` - Base for attention layer quantization #### 4. **Separated Config and Wrapper Classes** - **Config classes** (`AscendModelSlimConfig`, `AscendCompressedTensorsConfig`): Handle config parsing and scheme selection - **Wrapper classes** (`AscendLinearMethod`, `AscendFusedMoEMethod`, etc.): Implement vLLM interfaces and delegate to schemes #### 5. **Cleaner Public API** ```python # New clean module interface from vllm_ascend.quantization import ( AscendModelSlimConfig, AscendCompressedTensorsConfig, ) from vllm_ascend.quantization.methods import get_scheme_class ``` ### Architecture Diagram ```mermaid classDiagram direction TB class QuantizationConfig { <<vLLM Interface>> +get_quant_method() } class AscendModelSlimConfig { +quant_description +get_quant_method() -create_scheme_for_layer() } class AscendCompressedTensorsConfig { +target_scheme_map +get_quant_method() -_get_scheme_from_parts() } class AscendLinearMethod { <<Wrapper>> +quant_method: AscendLinearScheme +create_weights() +apply() } class AscendFusedMoEMethod { <<Wrapper>> +quant_method: AscendMoEScheme +create_weights() +apply() } class AscendLinearScheme { <<Abstract>> +get_weight()* +apply()* +get_pertensor_param() +get_perchannel_param() } class AscendMoEScheme { <<Abstract>> +get_weight()* +get_dynamic_quant_param()* +apply()* } class W8A8DynamicLinear { +get_weight() +apply() } class W8A8DynamicMoE { +get_weight() +apply() } QuantizationConfig <|-- AscendModelSlimConfig QuantizationConfig <|-- AscendCompressedTensorsConfig AscendModelSlimConfig ..> AscendLinearMethod : creates AscendModelSlimConfig ..> AscendFusedMoEMethod : creates AscendCompressedTensorsConfig ..> AscendLinearMethod : creates AscendCompressedTensorsConfig ..> AscendFusedMoEMethod : creates AscendLinearMethod o-- AscendLinearScheme : delegates to AscendFusedMoEMethod o-- AscendMoEScheme : delegates to AscendLinearScheme <|-- W8A8DynamicLinear AscendMoEScheme <|-- W8A8DynamicMoE ``` ### Scheme Registration Flow ```mermaid sequenceDiagram participant Module as Scheme Module participant Registry as _SCHEME_REGISTRY participant Config as QuantConfig participant Wrapper as Wrapper Class Note over Module: At import time Module->>Registry: @register_scheme("W8A8_DYNAMIC", "linear") Registry->>Registry: Store (quant_type, layer_type) -> Class Note over Config: At runtime Config->>Config: Determine quant_type from description Config->>Registry: get_scheme_class(quant_type, layer_type) Registry-->>Config: Return scheme class Config->>Config: scheme = scheme_cls() Config->>Wrapper: Create wrapper with scheme Wrapper-->>Config: Return wrapper instance ``` ### File Changes Summary | Original Files | Refactored Files | |----------------|------------------| | `__init__.py` (empty) | `__init__.py` (exports public API) | | `quant_config.py` | `modelslim_config.py` + `wrappers.py` | | `compressed_tensors/` | `compressed_tensors_config.py` | | `utils.py` | `methods/registry.py` | | `w8a8_dynamic.py` | `methods/w8a8_dynamic.py` | | `w8a8.py` | `methods/w8a8_static.py` | | `w4a4_flatquant_dynamic.py` | `methods/w4a4_flatquant.py` | | ... | `methods/base.py` (new) | ### Benefits 1. **Extensibility**: Adding new quantization schemes only requires implementing the base class and adding `@register_scheme` decorator 2. **Maintainability**: Clear separation between config parsing, wrapper logic, and scheme implementation 3. **Testability**: Abstract base classes enable easier unit testing and mocking 4. **Discoverability**: Registry pattern makes it easy to list all supported schemes 5. **Reduced Coupling**: Config classes no longer need to know about all scheme implementations ___ - vLLM version: v0.13.0 - vLLM main: 2f4e6548ef --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2026-01-23 14:13:47 +08:00
parent 8378bc28b0
commit a69ef10c3a
36 changed files with 2044 additions and 1524 deletions
--- a/vllm_ascend/quantization/methods/init.py
+++ b/vllm_ascend/quantization/methods/init.py
@@ -0,0 +1,82 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# This file is a part of the vllm-ascend project.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+"""Ascend quantization scheme implementations.
+
+This module provides all quantization scheme implementations for Ascend NPU.
+Schemes are automatically registered via the @register_scheme decorator.
+
+Usage:
+    from vllm_ascend.quantization.methods import get_scheme_class
+    
+    # Get a scheme class by quant_type and layer_type
+    scheme_cls = get_scheme_class("W8A8_DYNAMIC", "linear")
+    scheme = scheme_cls()
+"""
+
+from typing import Any
+
+# Import base classes
+from .base import (AscendAttentionScheme, AscendLinearScheme, AscendMoEScheme,
+                   QuantType)
+# Import registry functions
+from .registry import get_scheme_class, register_scheme
+# Import all scheme classes for external access
+from .w4a4_flatquant import AscendW4A4FlatQuantDynamicLinearMethod
+from .w4a4_laos_dynamic import AscendW4A4LaosDynamicLinearMethod
+from .w4a8 import (AscendW4A8DynamicFusedMoEMethod,
+                   AscendW4A8DynamicLinearMethod)
+from .w4a16 import AscendW4A16FusedMoEMethod
+from .w8a8_dynamic import (AscendW8A8DynamicFusedMoEMethod,
+                           AscendW8A8DynamicLinearMethod)
+from .w8a8_mxfp8 import AscendW8A8MXFP8DynamicLinearMethod
+from .w8a8_pdmix import (AscendW8A8PDMixFusedMoeMethod,
+                         AscendW8A8PDMixLinearMethod)
+from .w8a8_static import AscendW8A8LinearMethod
+from .w8a16 import AscendW8A16LinearMethod
+
+
+def is_mx_quant_type(instance: Any) -> bool:
+    """Checks if the quantization method is a microscaling (MX) type."""
+    MX_QUANT_TYPES = (AscendW8A8MXFP8DynamicLinearMethod, )
+    return isinstance(instance, MX_QUANT_TYPES)
+
+
+__all__ = [
+    # Base classes
+    "AscendAttentionScheme",
+    "AscendLinearScheme",
+    "AscendMoEScheme",
+    "QuantType",
+    # Registry functions
+    "register_scheme",
+    "get_scheme_class",
+    # Utility functions
+    "is_mx_quant_type",
+    # Scheme classes
+    "AscendW8A8LinearMethod",
+    "AscendW8A8DynamicLinearMethod",
+    "AscendW8A8DynamicFusedMoEMethod",
+    "AscendW8A8MXFP8DynamicLinearMethod",
+    "AscendW8A8PDMixLinearMethod",
+    "AscendW8A8PDMixFusedMoeMethod",
+    "AscendW8A16LinearMethod",
+    "AscendW4A8DynamicLinearMethod",
+    "AscendW4A8DynamicFusedMoEMethod",
+    "AscendW4A16FusedMoEMethod",
+    "AscendW4A4FlatQuantDynamicLinearMethod",
+    "AscendW4A4LaosDynamicLinearMethod",
+]
--- a/vllm_ascend/quantization/methods/base.py
+++ b/vllm_ascend/quantization/methods/base.py
@@ -0,0 +1,279 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# This file is a part of the vllm-ascend project.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+"""Abstract base classes for Ascend quantization schemes."""
+
+from abc import ABC, abstractmethod
+from enum import Enum
+from typing import Any, Callable, Dict, Optional
+
+import torch
+
+
+class QuantType(Enum):
+    """Quantization type enum for MoE schemes."""
+    NONE = 0
+    W8A8 = 1
+    W4A8 = 2
+
+
+class AscendLinearScheme(ABC):
+    """Base class for all linear quantization schemes.
+    
+    Subclasses must implement get_weight() and apply() methods.
+    Other methods have default implementations that return empty dicts
+    or do nothing.
+    """
+
+    @abstractmethod
+    def get_weight(self, input_size: int, output_size: int,
+                   params_dtype: torch.dtype) -> Dict[str, Any]:
+        """Return weight tensor specifications.
+        
+        Args:
+            input_size: Input dimension of the linear layer.
+            output_size: Output dimension of the linear layer.
+            params_dtype: Data type for parameters.
+            
+        Returns:
+            Dictionary mapping parameter names to empty tensors with
+            the correct shape and dtype.
+        """
+        ...
+
+    def get_pertensor_param(self, params_dtype: torch.dtype) -> Dict[str, Any]:
+        """Return per-tensor parameter specifications (e.g., input_scale).
+        
+        Args:
+            params_dtype: Data type for parameters.
+            
+        Returns:
+            Dictionary mapping parameter names to empty tensors.
+        """
+        return {}
+
+    def get_perchannel_param(self, output_size: int,
+                             params_dtype: torch.dtype) -> Dict[str, Any]:
+        """Return per-channel parameter specifications (e.g., weight_scale).
+        
+        Args:
+            output_size: Output dimension of the linear layer.
+            params_dtype: Data type for parameters.
+            
+        Returns:
+            Dictionary mapping parameter names to empty tensors.
+        """
+        return {}
+
+    def get_pergroup_param(self,
+                           input_size: int,
+                           output_size: int,
+                           params_dtype: torch.dtype,
+                           layer_type: Optional[str] = None) -> Dict[str, Any]:
+        """Return per-group parameter specifications.
+        
+        Args:
+            input_size: Input dimension of the linear layer.
+            output_size: Output dimension of the linear layer.
+            params_dtype: Data type for parameters.
+            layer_type: Type of layer (e.g., "row" for RowParallelLinear).
+            
+        Returns:
+            Dictionary mapping parameter names to empty tensors.
+        """
+        return {}
+
+    @abstractmethod
+    def apply(self,
+              layer: torch.nn.Module,
+              x: torch.Tensor,
+              bias: Optional[torch.Tensor] = None,
+              tp_rank: Optional[int] = 0) -> torch.Tensor:
+        """Forward computation.
+        
+        Args:
+            layer: The linear layer module.
+            x: Input tensor.
+            bias: Optional bias tensor.
+            tp_rank: Tensor parallel rank.
+            
+        Returns:
+            Output tensor after quantized linear operation.
+        """
+        ...
+
+    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
+        """Post-loading weight processing (transpose, format conversion, etc.).
+        
+        Args:
+            layer: The linear layer module.
+        """
+        pass
+
+
+class AscendAttentionScheme(ABC):
+    """Base class for all attention quantization schemes.
+    
+    Subclasses must implement apply() method.
+    Other methods have default implementations.
+    """
+
+    def create_weights(self, layer: torch.nn.Module) -> None:
+        """Create weights for attention quantization.
+        
+        Args:
+            layer: The attention layer module.
+        """
+        pass
+
+    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
+        """Post-loading weight processing for attention layer.
+        
+        Args:
+            layer: The attention layer module.
+        """
+        pass
+
+    @abstractmethod
+    def apply(self, layer: torch.nn.Module, query: torch.Tensor,
+              key: torch.Tensor, value: torch.Tensor, kv_cache, attn_metadata,
+              attn_type, scale, output) -> torch.Tensor:
+        """Forward computation for attention layer.
+        
+        Args:
+            layer: The attention layer module.
+            query: Query tensor.
+            key: Key tensor.
+            value: Value tensor.
+            kv_cache: KV cache.
+            attn_metadata: Attention metadata.
+            attn_type: Attention type.
+            scale: Scale factor.
+            output: Output tensor.
+            
+        Returns:
+            Output tensor after attention computation.
+        """
+        ...
+
+
+class AscendMoEScheme(ABC):
+    """Base class for all MoE quantization schemes.
+    
+    Subclasses must implement get_weight(), get_dynamic_quant_param(),
+    and apply() methods.
+    
+    Attributes:
+        quant_type: The quantization type for this scheme. Subclasses should
+                   override this class attribute to declare their quant type.
+    """
+
+    # Default quant type - subclasses should override this
+    quant_type: QuantType = QuantType.NONE
+
+    @abstractmethod
+    def get_weight(self, num_experts: int,
+                   intermediate_size_per_partition: int, hidden_sizes: int,
+                   params_dtype: torch.dtype) -> Dict[str, Any]:
+        """Return weight tensor specifications for MoE layer.
+        
+        Args:
+            num_experts: Number of experts.
+            intermediate_size_per_partition: Intermediate size per partition.
+            hidden_sizes: Hidden dimension size.
+            params_dtype: Data type for parameters.
+            
+        Returns:
+            Dictionary mapping parameter names to empty tensors.
+        """
+        ...
+
+    @abstractmethod
+    def get_dynamic_quant_param(self, num_experts: int,
+                                intermediate_size_per_partition: int,
+                                hidden_sizes: int,
+                                params_dtype: torch.dtype) -> Dict[str, Any]:
+        """Return dynamic quantization parameters for MoE layer.
+        
+        Args:
+            num_experts: Number of experts.
+            intermediate_size_per_partition: Intermediate size per partition.
+            hidden_sizes: Hidden dimension size.
+            params_dtype: Data type for parameters.
+            
+        Returns:
+            Dictionary mapping parameter names to empty tensors.
+        """
+        ...
+
+    @abstractmethod
+    def apply(
+        self,
+        layer: torch.nn.Module,
+        x: torch.Tensor,
+        router_logits: torch.Tensor,
+        top_k: int,
+        renormalize: bool,
+        use_grouped_topk: bool = False,
+        global_num_experts: int = -1,
+        expert_map: Optional[torch.Tensor] = None,
+        topk_group: Optional[int] = None,
+        num_expert_group: Optional[int] = None,
+        custom_routing_function: Optional[Callable] = None,
+        scoring_func: str = "softmax",
+        routed_scaling_factor: float = 1.0,
+        e_score_correction_bias: Optional[torch.Tensor] = None,
+        is_prefill: bool = True,
+        enable_force_load_balance: bool = False,
+        log2phy: Optional[torch.Tensor] = None,
+        global_redundant_expert_num: int = 0,
+        **kwargs,
+    ) -> torch.Tensor:
+        """Forward computation for MoE layer.
+        
+        Args:
+            layer: The MoE layer module.
+            x: Input hidden states.
+            router_logits: Router logits for expert selection.
+            top_k: Number of experts to select per token.
+            renormalize: Whether to renormalize expert weights.
+            use_grouped_topk: Whether to use grouped top-k selection.
+            global_num_experts: Total number of experts globally.
+            expert_map: Mapping from local to global expert indices.
+            topk_group: Group size for grouped top-k.
+            num_expert_group: Number of expert groups.
+            custom_routing_function: Custom routing function.
+            scoring_func: Scoring function name.
+            routed_scaling_factor: Scaling factor for routed experts.
+            e_score_correction_bias: Expert score correction bias.
+            is_prefill: Whether in prefill phase.
+            enable_force_load_balance: Whether to force load balancing.
+            log2phy: Logical to physical expert mapping.
+            global_redundant_expert_num: Number of redundant experts.
+            **kwargs: Additional keyword arguments.
+            
+        Returns:
+            Output tensor after MoE computation.
+        """
+        ...
+
+    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
+        """Post-loading weight processing for MoE layer.
+        
+        Args:
+            layer: The MoE layer module.
+        """
+        pass
--- a/vllm_ascend/quantization/methods/registry.py
+++ b/vllm_ascend/quantization/methods/registry.py
@@ -0,0 +1,62 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# This file is a part of the vllm-ascend project.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from typing import Any, Dict, Optional, Tuple, Type
+
+# Registry: maps (quant_type, layer_type) -> SchemeClass
+_SCHEME_REGISTRY: Dict[Tuple[str, str], Type[Any]] = {}
+
+
+def register_scheme(quant_type: str, layer_type: str):
+    """Decorator to register a quantization scheme.
+    
+    Args:
+        quant_type: Quantization type (e.g., "W8A8", "W8A8_DYNAMIC").
+        layer_type: Layer type (e.g., "linear", "moe").
+        
+    Returns:
+        Decorator function that registers the class.
+        
+    Example:
+        @register_scheme("W8A8_DYNAMIC", "linear")
+        class W8A8DynamicLinearScheme(AscendLinearScheme):
+            ...
+    """
+
+    def decorator(cls: Type[Any]) -> Type[Any]:
+        key = (quant_type, layer_type)
+        if key in _SCHEME_REGISTRY:
+            raise ValueError(
+                f"Scheme already registered for {quant_type}/{layer_type}: "
+                f"{_SCHEME_REGISTRY[key].__name__}")
+        _SCHEME_REGISTRY[key] = cls
+        return cls
+
+    return decorator
+
+
+def get_scheme_class(quant_type: str, layer_type: str) -> Optional[Type[Any]]:
+    """Get scheme class for given quant_type and layer_type.
+    
+    Args:
+        quant_type: Quantization type (e.g., "W8A8", "W8A8_DYNAMIC").
+        layer_type: Layer type (e.g., "linear", "moe").
+        
+    Returns:
+        The registered scheme class, or None if not found.
+    """
+    return _SCHEME_REGISTRY.get((quant_type, layer_type))
--- a/vllm_ascend/quantization/methods/w4a16.py
+++ b/vllm_ascend/quantization/methods/w4a16.py
@@ -0,0 +1,278 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# This file is a part of the vllm-ascend project.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from typing import Any, Callable, Dict, Optional
+
+import torch
+import torch_npu
+from vllm.config import get_current_vllm_config
+from vllm.forward_context import get_forward_context
+
+from vllm_ascend.ascend_config import get_ascend_config
+from vllm_ascend.ops.fused_moe.experts_selector import select_experts
+
+from .base import AscendMoEScheme
+from .registry import register_scheme
+
+
+def unpack_from_int32(
+    weight: torch.Tensor,
+    shape: torch.Size,
+    num_bits: int,
+    packed_dim: int = 1,
+) -> torch.Tensor:
+    """Unpacks quantized weights from int32 format back to original bits.
+
+    :param weight: The packed int32 tensor containing quantized weights
+    :param shape: Original shape to restore, defaults to None
+    :param num_bits: The number of bits used for quantization (<= 8)
+    :param packed_dim: Dimension along which weights are packed (0 or 1), defaults to 1
+    :return: Unpacked tensor with int8 dtype after applying offset correction
+    """
+    assert weight.dtype == torch.int32, f"Expecting `weight.dtype` is torch.int32 but got {weight.dtype}."
+    assert num_bits <= 8, f"Expecting `num_bits` should not be larger than 8 but got {num_bits}."
+
+    pack_factor = 32 // num_bits
+    mask = (1 << num_bits) - 1
+
+    if packed_dim == 1:
+        unpacked_weight = torch.zeros(
+            (weight.shape[0], weight.shape[1] * pack_factor),
+            device=weight.device,
+            dtype=torch.int32,
+        )
+        for i in range(pack_factor):
+            unpacked_weight[:, i::pack_factor] = (weight >>
+                                                  (num_bits * i)) & mask
+        original_row_size = int(shape[1])
+        unpacked_weight = unpacked_weight[:, :original_row_size]
+    else:
+        unpacked_weight = torch.zeros(
+            (weight.shape[0] * pack_factor, weight.shape[1]),
+            device=weight.device,
+            dtype=torch.int32,
+        )
+        for i in range(pack_factor):
+            unpacked_weight[i::pack_factor, :] = (weight >>
+                                                  (num_bits * i)) & mask
+        original_row_size = int(shape[0])
+        unpacked_weight = unpacked_weight[:original_row_size, :]
+
+    offset = pow(2, num_bits) // 2
+    unpacked_weight = (unpacked_weight - offset).to(torch.int8)
+
+    return unpacked_weight
+
+
+def pack_to_int32(weight: torch.Tensor) -> torch.Tensor:
+    """Packs quantized weights into int32 format for storage.
+
+    :param weight: The 3D tensor to pack, must be int8 or int32 dtype
+    :return: Packed tensor with int32 dtype optimized for storage
+    """
+    assert weight.dim(
+    ) == 3, f"Expecting `weight.dim()` is 3 ([e, n, k] or [e, k, n]) but got {weight.dim()}."
+    assert weight.dtype in [
+        torch.int8, torch.int32
+    ], f"Expecting `weight.dtype` is torch.int8 or torch.int32 bug got {weight.dtype}."
+
+    if weight.dtype == torch.int32:
+        assert weight.shape[
+            -1] % 8 == 0, "the last dim of weight needs to be divided by 8."
+        packed_weight = torch_npu.npu_convert_weight_to_int4pack(
+            weight.flatten(0, 1))
+        packed_weight = packed_weight.view(weight.shape[0], weight.shape[1],
+                                           -1)
+    else:
+        assert weight.shape[
+            -1] % 4 == 0, "the last dim of weight needs to be divided by 4."
+        packed_weight = weight.view(torch.int32).contiguous()
+
+    return packed_weight
+
+
+@register_scheme("W4A16", "moe")
+class AscendW4A16FusedMoEMethod(AscendMoEScheme):
+    """FusedMoE method for Ascend W4A16."""
+
+    def __init__(self) -> None:
+        self.transpose_weight = True
+        self.num_bits = 4  # dtype = torch.int4
+        self.pack_factor = 8  # pack 8 of torch.int4 tensors to torch.int32
+
+        vllm_config = get_current_vllm_config()
+        self.group_size = vllm_config.quant_config.quant_description.get(
+            "group_size", 32)
+        self.dynamic_eplb = get_ascend_config().eplb_config.dynamic_eplb
+
+    def get_weight(
+        self,
+        num_experts: int,
+        intermediate_size_per_partition: int,
+        hidden_sizes: int,
+        params_dtype: torch.dtype,
+    ) -> Dict[str, Any]:
+        assert intermediate_size_per_partition % self.pack_factor == 0, f"Expecting `intermediate_size_per_partition` {intermediate_size_per_partition} can be divided by `pack_factor` {self.pack_factor}"
+        assert hidden_sizes % self.pack_factor == 0, f"Expecting `hidden_sizes` {hidden_sizes} can be divided by `pack_factor` {self.pack_factor}"
+
+        param_dict = {}
+
+        param_dict["w13_weight_packed"] = torch.empty(
+            num_experts,
+            2 * intermediate_size_per_partition,
+            hidden_sizes // self.pack_factor,
+            dtype=torch.int32)
+        param_dict["w2_weight_packed"] = torch.empty(
+            num_experts,
+            hidden_sizes,
+            intermediate_size_per_partition // self.pack_factor,
+            dtype=torch.int32)
+
+        return param_dict
+
+    def get_dynamic_quant_param(
+        self,
+        num_experts: int,
+        intermediate_size_per_partition: int,
+        hidden_sizes: int,
+        params_dtype: torch.dtype,
+    ) -> Dict[str, Any]:
+        assert intermediate_size_per_partition % self.group_size == 0, f"Expecting `intermediate_size_per_partition` {intermediate_size_per_partition} can be divided by `group_size` {self.group_size}"
+        assert hidden_sizes % self.group_size == 0, f"Expecting `hidden_sizes` {hidden_sizes} can be divided by `group_size` {self.group_size}"
+
+        param_dict = {}
+
+        param_dict["w13_weight_scale"] = torch.empty(
+            num_experts,
+            2 * intermediate_size_per_partition,
+            hidden_sizes // self.group_size,
+            dtype=torch.bfloat16)
+        param_dict["w2_weight_scale"] = torch.empty(
+            num_experts,
+            hidden_sizes,
+            intermediate_size_per_partition // self.group_size,
+            dtype=torch.bfloat16)
+        param_dict["w13_weight_shape"] = torch.empty(num_experts,
+                                                     2,
+                                                     dtype=torch.int32)
+        param_dict["w2_weight_shape"] = torch.empty(num_experts,
+                                                    2,
+                                                    dtype=torch.int32)
+        param_dict["w13_weight_offset"] = torch.zeros(
+            num_experts,
+            2 * intermediate_size_per_partition,
+            hidden_sizes // self.group_size,
+            dtype=torch.bfloat16)
+        param_dict["w2_weight_offset"] = torch.zeros(
+            num_experts,
+            hidden_sizes,
+            intermediate_size_per_partition // self.group_size,
+            dtype=torch.bfloat16)
+
+        return param_dict
+
+    def apply(
+        self,
+        layer: torch.nn.Module,
+        x: torch.Tensor,
+        router_logits: torch.Tensor,
+        top_k: int,
+        renormalize: bool,
+        use_grouped_topk: bool = False,
+        global_num_experts: int = -1,
+        expert_map: Optional[torch.Tensor] = None,
+        topk_group: Optional[int] = None,
+        num_expert_group: Optional[int] = None,
+        custom_routing_function: Optional[Callable] = None,
+        scoring_func: str = "softmax",
+        routed_scaling_factor: float = 1.0,
+        e_score_correction_bias: Optional[torch.Tensor] = None,
+        is_prefill: bool = True,
+        enable_force_load_balance: bool = True,
+        log2phy: Optional[torch.Tensor] = None,
+        global_redundant_expert_num: int = 0,
+        **kwargs,
+    ) -> torch.Tensor:
+        assert router_logits.shape[
+            1] == global_num_experts - global_redundant_expert_num, "Number of global experts mismatch (excluding redundancy)"
+
+        topk_weights, topk_ids = select_experts(
+            hidden_states=x,
+            router_logits=router_logits,
+            top_k=top_k,
+            use_grouped_topk=use_grouped_topk,
+            renormalize=renormalize,
+            topk_group=topk_group,
+            num_expert_group=num_expert_group,
+            custom_routing_function=custom_routing_function,
+            scoring_func=scoring_func,
+            e_score_correction_bias=e_score_correction_bias,
+            global_num_experts=global_num_experts)
+
+        topk_ids = topk_ids.to(torch.int32)
+        topk_weights = topk_weights.to(x.dtype)
+
+        moe_comm_method = get_forward_context().moe_comm_method
+        return moe_comm_method.fused_experts(
+            hidden_states=x,
+            w1=layer.w13_weight_packed,
+            w2=layer.w2_weight_packed,
+            w1_scale=layer.w13_weight_scale,
+            w2_scale=layer.w2_weight_scale,
+            w1_offset=layer.w13_weight_offset,
+            w2_offset=layer.w2_weight_offset,
+            topk_weights=topk_weights,
+            topk_ids=topk_ids,
+            use_int4_w4a16=True,
+            expert_map=expert_map,
+            log2phy=log2phy,
+            dynamic_eplb=self.dynamic_eplb,
+            mc2_mask=kwargs.get("mc2_mask", None))
+
+    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
+        if self.transpose_weight:
+            w13_shape = layer.w13_weight_packed.data.shape
+            w2_shape = layer.w2_weight_packed.data.shape
+            unpacked_w13_weight = (unpack_from_int32(
+                layer.w13_weight_packed.data.flatten(0, 1),
+                torch.Size([
+                    w13_shape[0] * w13_shape[1],
+                    w13_shape[2] * self.pack_factor
+                ]),
+                self.num_bits,
+            ).view(w13_shape[0], w13_shape[1],
+                   -1).transpose(1, 2).contiguous().int())
+            unpacked_w2_weight = (unpack_from_int32(
+                layer.w2_weight_packed.data.flatten(0, 1),
+                torch.Size([
+                    w2_shape[0] * w2_shape[1], w2_shape[2] * self.pack_factor
+                ]),
+                self.num_bits,
+            ).view(w2_shape[0], w2_shape[1],
+                   -1).transpose(1, 2).contiguous().int())
+            layer.w13_weight_packed.data = pack_to_int32(unpacked_w13_weight)
+            layer.w2_weight_packed.data = pack_to_int32(unpacked_w2_weight)
+
+            layer.w13_weight_scale.data = layer.w13_weight_scale.data.transpose(
+                1, 2).contiguous()
+            layer.w2_weight_scale.data = layer.w2_weight_scale.data.transpose(
+                1, 2).contiguous()
+
+            layer.w13_weight_offset.data = layer.w13_weight_offset.data.transpose(
+                1, 2).contiguous()
+            layer.w2_weight_offset.data = layer.w2_weight_offset.data.transpose(
+                1, 2).contiguous()
--- a/vllm_ascend/quantization/methods/w4a4_flatquant.py
+++ b/vllm_ascend/quantization/methods/w4a4_flatquant.py
@@ -0,0 +1,190 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# This file is a part of the vllm-ascend project.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import math
+from typing import Any, Dict, Optional, Tuple
+
+import torch
+import torch_npu
+
+from .base import AscendLinearScheme
+from .registry import register_scheme
+
+KRONECKER_QUANT_MAX_BATCH_SIZE = 32768
+
+
+def pack_int4_weights(weight_tensor: torch.Tensor) -> torch.Tensor:
+    """Pack int4 weights for NPU."""
+    original_device = weight_tensor.device
+    weight_tensor_npu = weight_tensor.npu()
+    weight_int4_packed = torch_npu.npu_convert_weight_to_int4pack(
+        weight_tensor_npu.to(torch.int32), inner_k_tiles=1)
+    return weight_int4_packed.to(original_device)
+
+
+def get_decompose_dim(n):
+    """Get decomposed dimensions for Kronecker quantization."""
+    a = int(math.sqrt(n))
+    if a * a < n:
+        a += 1
+    while True:
+        tmp = a * a - n
+        b = int(math.sqrt(tmp))
+        if b * b == tmp:
+            break
+        a += 1
+    return a - b, a + b
+
+
+# TODO: This function is a temporary workaround for the npu_kronecker_quant operator,
+# which has a limitation on the maximum batch size (dim0). This wrapper should be
+# removed once the operator supports larger inputs natively.
+def batched_kronecker_quant(
+    x: torch.Tensor,
+    left_trans: torch.Tensor,
+    right_trans: torch.Tensor,
+    clip_ratio: float,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """Batched Kronecker quantization with batch size limit handling."""
+    batch_tokens = x.shape[0]
+    if batch_tokens <= KRONECKER_QUANT_MAX_BATCH_SIZE:
+        return torch_npu.npu_kronecker_quant(x,
+                                             left_trans,
+                                             right_trans,
+                                             clip_ratio=clip_ratio,
+                                             dst_dtype=torch.int32)
+    x_chunks = torch.split(x, KRONECKER_QUANT_MAX_BATCH_SIZE, dim=0)
+    processed_chunks = [
+        torch_npu.npu_kronecker_quant(chunk,
+                                      left_trans,
+                                      right_trans,
+                                      clip_ratio=clip_ratio,
+                                      dst_dtype=torch.int32)
+        for chunk in x_chunks
+    ]
+    quantized_list, scale_list = zip(*processed_chunks)
+    x_quantized_int4 = torch.cat(quantized_list, dim=0)
+    activation_scale = torch.cat(scale_list, dim=0)
+    return x_quantized_int4, activation_scale
+
+
+@register_scheme("W4A4_FLATQUANT_DYNAMIC", "linear")
+class AscendW4A4FlatQuantDynamicLinearMethod(AscendLinearScheme):
+    """Linear method for Ascend W4A4_FLATQUANT_DYNAMIC.
+    
+    This class implements W4A4 quantization with FlatQuant approach and dynamic activation quantization.
+    - Weight: 4-bit quantization (per-channel) with scale and offset, stored as int8 and packed to int32 during loading
+    - Activation: 4-bit dynamic quantization with FlatQuant transform matrices (left_trans, right_trans) for distribution smoothing
+    - Parameters: clip_ratio for controlling quantization clipping, weight_offset for asymmetric quantization, loaded from external weights
+    """
+    input_size = 0
+
+    def __init__(self):
+        self.sym = True
+
+    def get_weight(self, input_size: int, output_size: int,
+                   params_dtype: torch.dtype) -> Dict[str, Any]:
+        if input_size % 8 != 0:
+            raise ValueError(
+                f"input_size ({input_size}) must be divisible by 8 for int4 packing"
+            )
+        AscendW4A4FlatQuantDynamicLinearMethod.input_size = input_size
+        params_dict = {
+            "weight": torch.empty(output_size, input_size, dtype=torch.int8)
+        }
+        return params_dict
+
+    def get_pertensor_param(self, params_dtype: torch.dtype) -> Dict[str, Any]:
+        params_dict = {}
+        left_trans_dim, right_trans_dim = get_decompose_dim(
+            AscendW4A4FlatQuantDynamicLinearMethod.input_size)
+        params_dict["left_trans"] = torch.empty(left_trans_dim,
+                                                left_trans_dim,
+                                                dtype=params_dtype)
+        params_dict["right_trans"] = torch.empty(right_trans_dim,
+                                                 right_trans_dim,
+                                                 dtype=params_dtype)
+        params_dict["clip_ratio"] = torch.empty(1, dtype=torch.float32)
+        return params_dict
+
+    def get_perchannel_param(
+        self,
+        output_size: int,
+        params_dtype: torch.dtype,
+    ) -> Dict[str, Any]:
+        params_dict = {}
+        params_dict["weight_scale"] = torch.empty(output_size,
+                                                  1,
+                                                  dtype=torch.float32)
+        params_dict["weight_offset"] = torch.empty(output_size,
+                                                   1,
+                                                   dtype=torch.float32)
+        return params_dict
+
+    def apply(
+        self,
+        layer: torch.nn.Module,
+        x: torch.Tensor,
+        bias: Optional[torch.Tensor] = None,
+        tp_rank: Optional[int] = 0,
+    ) -> torch.Tensor:
+        original_dtype = x.dtype
+        input_shape = x.shape
+        in_features = input_shape[-1]
+        left_dim = layer.left_trans.shape[0]
+        right_dim = layer.right_trans.shape[0]
+        if left_dim * right_dim != in_features:
+            raise ValueError(
+                f"FlatQuant transform matrices dimension mismatch: "
+                f"left_dim({left_dim}) * right_dim({right_dim}) != in_features({in_features})"
+            )
+        left_trans_matched = layer.left_trans.to(original_dtype)
+        right_trans_matched = layer.right_trans.to(original_dtype)
+        x_reshaped = x.view(-1, left_dim, right_dim)
+        x_quantized_int4, activation_scale = batched_kronecker_quant(
+            x_reshaped, left_trans_matched, right_trans_matched,
+            layer.aclnn_clip_ratio)
+        x_quantized_reshaped = x_quantized_int4.view(-1,
+                                                     left_dim * right_dim // 8)
+        pertoken_scale = activation_scale.view(-1).to(torch.float32)
+        output = torch_npu.npu_quant_matmul(x_quantized_reshaped,
+                                            layer.weight_packed.t(),
+                                            layer.weight_scale.view(-1).to(
+                                                torch.float32),
+                                            pertoken_scale=pertoken_scale,
+                                            bias=None,
+                                            output_dtype=original_dtype)
+        output = output.view(*input_shape[:-1], -1)
+        if bias is not None:
+            output = output + bias.to(original_dtype)
+        return output
+
+    def process_weights_after_loading(self, layer):
+        # NOTE: Currently, w4a4 can't support weight nz
+        weight_packed = pack_int4_weights(layer.weight.data)
+        layer.register_parameter(
+            'weight_packed',
+            torch.nn.Parameter(weight_packed, requires_grad=False))
+        del layer.weight
+        layer.weight_scale.data = layer.weight_scale.data.to(torch.float32)
+        layer.weight_offset.data = layer.weight_offset.data.to(torch.float32)
+        layer.left_trans = torch.nn.Parameter(
+            layer.left_trans.data.t().contiguous())
+        layer.right_trans = torch.nn.Parameter(layer.right_trans.data)
+        layer.clip_ratio = torch.nn.Parameter(
+            layer.clip_ratio.data.to(torch.float32))
+        layer.aclnn_clip_ratio = layer.clip_ratio.item()
--- a/vllm_ascend/quantization/methods/w4a4_laos_dynamic.py
+++ b/vllm_ascend/quantization/methods/w4a4_laos_dynamic.py
@@ -0,0 +1,126 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# This file is a part of the vllm-ascend project.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from typing import Any, Dict, Optional
+
+import torch
+import torch_npu
+
+from .base import AscendLinearScheme
+from .registry import register_scheme
+
+
+@register_scheme("W4A4_DYNAMIC", "linear")
+class AscendW4A4LaosDynamicLinearMethod(AscendLinearScheme):
+    """Linear method for Ascend W4A4_DYNAMIC.
+    
+    This class implements W4A4 quantization with LAOS approach and dynamic activation quantization.
+    - Weight: 4-bit quantization (per-channel) with scale and offset, stored as int8.
+    - Activation: 4-bit dynamic quantization.
+    """
+
+    def __init__(self):
+        self.transpose_weight = True
+        self.rotation_type = None
+
+    def set_rotation_config(self, prefix: str, metadata: Dict) -> Optional[str]:
+        """Set rotation config based on prefix and metadata."""
+        layer_idx = prefix.split(".")[2]
+        if prefix.endswith("o_proj"):
+            layers = metadata["quarot"]["heads_rotation"]["layers"]
+            if layer_idx in layers:
+                return "heads_rotation"
+        if prefix.endswith("down_proj"):
+            layers = metadata["quarot"]["kronecker_rotation"]["layers"]
+            if layer_idx in layers:
+                return "kronecker_rotation"
+        return None
+
+    def get_weight(self, input_size: int, output_size: int,
+                   params_dtype: torch.dtype) -> Dict[str, Any]:
+        params_dict = {
+            "weight": torch.empty(output_size, input_size, dtype=torch.int8)
+        }
+        return params_dict
+
+    def get_perchannel_param(self, output_size: int,
+                             params_dtype: torch.dtype) -> Dict[str, Any]:
+        params_dict = {}
+        params_dict["weight_scale"] = torch.empty(output_size,
+                                                  1,
+                                                  dtype=torch.float32)
+        params_dict["weight_offset"] = torch.empty(output_size,
+                                                   1,
+                                                   dtype=torch.float32)
+        if self.rotation_type == "heads_rotation":
+            params_dict["heads_rotation"] = torch.zeros((64, 64),
+                                                        dtype=torch.float32)
+        if self.rotation_type == "kronecker_rotation":
+            params_dict["kronecker_rotation_n"] = torch.zeros(
+                (160, 160), dtype=torch.float32)
+            params_dict["kronecker_rotation_m"] = torch.zeros(
+                (160, 160), dtype=torch.float32)
+        return params_dict
+
+    def apply_rotation(self, layer: torch.nn.Module,
+                       x: torch.Tensor) -> torch.Tensor:
+        """Apply rotation transformation to input tensor."""
+        init_shape = x.shape
+        dtype = x.dtype
+        if self.rotation_type == "heads_rotation":
+            Q1 = layer.heads_rotation
+            scaled_x = x.reshape(-1, Q1.shape[1], 128)
+            scaled_x = torch.matmul(Q1.T, scaled_x).reshape(init_shape)
+            return scaled_x.to(dtype)
+        if self.rotation_type == "kronecker_rotation":
+            Q1 = layer.kronecker_rotation_m
+            Q2 = layer.kronecker_rotation_n
+            scaled_x = x.reshape(-1, Q1.shape[0], Q2.shape[0])
+            scaled_x = torch.matmul(scaled_x, Q2)
+            scaled_x = torch.matmul(Q1.T, scaled_x)
+            scaled_x = scaled_x.reshape(init_shape)
+            return scaled_x.to(dtype)
+        return x
+
+    def apply(
+        self,
+        layer: torch.nn.Module,
+        x: torch.Tensor,
+        bias: Optional[torch.Tensor] = None,
+        tp_rank: Optional[int] = 0,
+    ) -> torch.Tensor:
+        dtype = x.dtype
+        x, pertoken_scale = torch_npu.npu_dynamic_quant(x, dst_type=torch.quint4x2)
+        pertoken_scale = pertoken_scale.reshape(-1, 1)
+        pertoken_scale = pertoken_scale.squeeze(-1)
+        output = torch_npu.npu_quant_matmul(
+            x,
+            layer.weight.data,
+            scale=layer.weight_scale.data.view(-1),
+            pertoken_scale=pertoken_scale,
+            bias=None,
+            output_dtype=dtype)
+        if bias is not None:
+            output = output + bias.to(dtype)
+        return output
+
+    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
+        layer.weight_scale.data = layer.weight_scale.data.to(torch.float32)
+        layer.weight.data = torch_npu.npu_convert_weight_to_int4pack(
+            layer.weight.data.to(torch.int32))
+        if self.transpose_weight:
+            layer.weight.data = layer.weight.data.transpose(-1, -2)
--- a/vllm_ascend/quantization/methods/w4a8.py
+++ b/vllm_ascend/quantization/methods/w4a8.py
@@ -0,0 +1,475 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# This file is a part of the vllm-ascend project.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from typing import Any, Callable, Dict, Optional
+
+import numpy as np
+import torch
+import torch_npu
+from vllm.config import get_current_vllm_config
+from vllm.distributed import get_ep_group
+from vllm.forward_context import get_forward_context
+
+from vllm_ascend.ascend_config import get_ascend_config
+from vllm_ascend.distributed.parallel_state import get_mc2_group
+from vllm_ascend.ops.fused_moe.experts_selector import select_experts
+from vllm_ascend.utils import maybe_trans_nz
+
+from .base import AscendLinearScheme, AscendMoEScheme, QuantType
+from .registry import register_scheme
+
+
+@register_scheme("W4A8_DYNAMIC", "linear")
+class AscendW4A8DynamicLinearMethod(AscendLinearScheme):
+    """Linear method for Ascend W4A8_DYNAMIC."""
+
+    def __init__(self):
+        vllm_config = get_current_vllm_config()
+        self.group_size = vllm_config.quant_config.quant_description.get(
+            "group_size", 256)
+        quant_version = vllm_config.quant_config.quant_description.get(
+            "version", "0")
+        self.new_quant_version = quant_version == "1.0.0"
+
+        from vllm.distributed import get_tensor_model_parallel_world_size
+        self.tp_size = get_tensor_model_parallel_world_size()
+
+    def get_weight(self, input_size: int, output_size: int,
+                   params_dtype: torch.dtype) -> Dict[str, Any]:
+        """Create weight parameters.
+        
+        For new quantization version (double int4 pack into int8), the output dimension
+        is compressed by factor 2 (e.g., [2048, 3072] -> [1024, 3072]). The returned
+        dict includes "_packed_dim" and "_packed_factor" for vLLM's weight loader.
+        """
+        params_dict = {}
+
+        if self.new_quant_version:
+            # double int4 pack into int8: output dimension is compressed
+            pack_factor = 2
+            actual_output_size = output_size // pack_factor
+            params_dict["weight"] = torch.empty(actual_output_size,
+                                                input_size,
+                                                dtype=torch.int8)
+            # Add packing information for vLLM's weight_loader
+            params_dict["_packed_dim"] = 0
+            params_dict["_packed_factor"] = pack_factor
+        else:
+            params_dict["weight"] = torch.empty(output_size,
+                                                input_size,
+                                                dtype=torch.int8)
+
+        return params_dict
+
+    def get_pergroup_param(self,
+                           input_size: int,
+                           output_size: int,
+                           params_dtype: torch.dtype,
+                           layer_type: Optional[str] = None) -> Dict[str, Any]:
+        """Create per-group quantization parameters."""
+        params_dict = {}
+        params_dict["weight_scale"] = torch.empty(output_size,
+                                                  1,
+                                                  dtype=params_dtype)
+        params_dict["weight_offset"] = torch.empty(output_size,
+                                                   1,
+                                                   dtype=params_dtype)
+        params_dict["weight_scale_second"] = torch.empty(output_size,
+                                                         input_size //
+                                                         self.group_size,
+                                                         dtype=params_dtype)
+        params_dict["weight_offset_second"] = torch.empty(output_size,
+                                                          input_size //
+                                                          self.group_size,
+                                                          dtype=params_dtype)
+
+        # NOTE: In w4a8 quantization implementation,
+        #       for down_proj and o_proj(layer_type == "row") scale_bias shape is [output_size, 16],
+        #       others are [output_size, 1]
+        if self.new_quant_version:
+            scale_bias_dim = 16 if layer_type == "row" else 1
+
+            params_dict["scale_bias"] = torch.empty(output_size,
+                                                    scale_bias_dim,
+                                                    dtype=torch.float32)
+        return params_dict
+
+    @staticmethod
+    def process_scale_second(weight: torch.Tensor,
+                             scale: torch.Tensor,
+                             per_group_scale: torch.Tensor,
+                             is_new_quant: bool = False):
+        """Process the scale for second-level quantization.
+        
+        Args:
+            weight: weight tensor [k, n] (in new version, n is already compressed to n/2)
+            scale: first-level quantization scale [output_size]
+            per_group_scale: second-level per-group quantization scale [group_num, n_scale]
+            is_new_quant: whether it's the new quantization version (weight already compressed)
+        
+        Returns:
+            (antiquant_scale, bias): dequantization scale and bias (bias=None for new version)
+        """
+        k, n = weight.shape
+        group_num, n_scale = per_group_scale.shape
+
+        if is_new_quant:
+            # Restore logical dimension for compressed weight
+            n = n * 2
+
+        bias = None
+        if not is_new_quant:
+            weight_high = weight.to(torch.float32).reshape(
+                group_num, -1, n) * per_group_scale.reshape(group_num, 1, n)
+            weight_high = weight_high.reshape(k, n)
+            bias = 8 * (weight_high.to(torch.float32) * scale).sum(dim=0)
+        # NOTE: scale_bias is not used currently
+        #       because in msmodelslim w4a8 uses symmetric quantization
+
+        # TODO: support potential future asymmetric quantization
+        antiquant_scale = (scale * per_group_scale).reshape(group_num, n)
+        return antiquant_scale.npu(), bias
+
+    def apply(
+        self,
+        layer: torch.nn.Module,
+        x: torch.Tensor,
+        bias: Optional[torch.Tensor] = None,
+        tp_rank: Optional[int] = None,
+    ) -> torch.Tensor:
+        return torch_npu.npu_weight_quant_batchmatmul(
+            x,
+            layer.weight,
+            antiquant_scale=layer.weight_scale_second.to(x.dtype),
+            antiquant_group_size=self.group_size,
+        )
+
+    def process_weights_after_loading(self, layer: torch.nn.Module):
+        layer.weight.data = layer.weight.data.transpose(0, 1).contiguous()
+        layer.weight.data = maybe_trans_nz(layer.weight.data)
+        layer.weight_scale.data = layer.weight_scale.data.flatten().to(
+            torch.float32)
+        layer.weight_offset.data = layer.weight_offset.data.flatten()
+        layer.weight_scale_second.data, scale_bias = self.process_scale_second(
+            layer.weight.data,
+            layer.weight_scale.data,
+            layer.weight_scale_second.data.transpose(0, 1).contiguous(),
+            is_new_quant=self.new_quant_version,
+        )
+
+        if self.new_quant_version:
+            # Process the loaded data based on layer type
+            if hasattr(layer, "scale_bias"):
+                if layer.scale_bias.data.shape[1] == 1:
+                    layer.scale_bias.data = layer.scale_bias.data.flatten()
+                else:
+                    layer.scale_bias.data = layer.scale_bias.data.contiguous()
+        else:
+            if scale_bias is not None:
+                param = torch.nn.Parameter(scale_bias, requires_grad=False)
+                layer.register_parameter("weight_scale_bias", param)
+
+        # Convert to NPU-specific int4pack format
+        if self.new_quant_version:
+            # weights on disk are already in packed int4 format
+            # pack 4 int8(int4*2) to int32
+            assert layer.weight.data.shape[-1] % 4 == 0, \
+                f"the last dim of weight needs to be divided by 4, got shape {layer.weight.data.shape}"
+            layer.weight.data = layer.weight.data.view(
+                torch.int32).contiguous()
+        else:
+            # weights are not compressed
+            # need to be packed via npu_convert_weight_to_int4pack
+            layer.weight.data = torch_npu.npu_convert_weight_to_int4pack(
+                layer.weight.data.to(torch.int32))
+
+
+@register_scheme("W4A8_DYNAMIC", "moe")
+class AscendW4A8DynamicFusedMoEMethod(AscendMoEScheme):
+    """FusedMoE method for Ascend W4A8_DYNAMIC."""
+
+    # Declare the quantization type for this scheme
+    quant_type: QuantType = QuantType.W4A8
+
+    def __init__(self):
+        self.ep_group = get_ep_group()
+
+        vllm_config = get_current_vllm_config()
+        self.group_size = vllm_config.quant_config.quant_description.get(
+            "group_size", 256)
+        # NOTE: the weights are quantized from bf16 to int4 through a per-channel quantization process
+        self.is_per_channel_weight = self.group_size == 0
+        quant_version = vllm_config.quant_config.quant_description.get(
+            "version", "0")
+        # NOTE: new quantize weights: 2 int4 pack into int8
+        self.new_quant_version = quant_version == "1.0.0"
+        self.tp_size = 1 if vllm_config.parallel_config.enable_expert_parallel else self.ep_group.world_size
+        self.dynamic_eplb = get_ascend_config().eplb_config.dynamic_eplb
+        if self.new_quant_version and self.tp_size > 16:
+            raise ValueError(
+                "The current weight does not support moe part tp>16.")
+
+        try:
+            device_group = get_mc2_group().device_group
+            # TODO: Try local_rank = ep_group.rank_in_group
+            local_rank = torch.distributed.get_rank(group=device_group)
+            backend = device_group._get_backend(torch.device("npu"))
+            self.moe_all_to_all_group_name = backend.get_hccl_comm_name(
+                local_rank)
+        except AttributeError:
+            self.moe_all_to_all_group_name = ""
+
+    def get_weight(self, num_experts: int,
+                   intermediate_size_per_partition: int, hidden_sizes: int,
+                   params_dtype: torch.dtype) -> Dict[str, Any]:
+        param_dict = {}
+        if self.new_quant_version:
+            w13_output_size = intermediate_size_per_partition
+            w2_output_size = hidden_sizes // 2
+        else:
+            w13_output_size = 2 * intermediate_size_per_partition
+            w2_output_size = hidden_sizes
+
+        param_dict["w13_weight"] = torch.empty(num_experts,
+                                               w13_output_size,
+                                               hidden_sizes,
+                                               dtype=torch.int8)
+        param_dict["w2_weight"] = torch.empty(num_experts,
+                                              w2_output_size,
+                                              intermediate_size_per_partition,
+                                              dtype=torch.int8)
+        return param_dict
+
+    def get_dynamic_quant_param(self, num_experts: int,
+                                intermediate_size_per_partition: int,
+                                hidden_sizes: int,
+                                params_dtype: torch.dtype) -> Dict[str, Any]:
+        param_dict = {}
+        param_dict["w13_weight_scale"] = torch.empty(
+            num_experts,
+            2 * intermediate_size_per_partition,
+            1,
+            dtype=torch.float32)
+
+        param_dict["w13_weight_offset"] = torch.empty(
+            num_experts,
+            2 * intermediate_size_per_partition,
+            1,
+            dtype=torch.float32)
+
+        param_dict["w2_weight_scale"] = torch.empty(num_experts,
+                                                    hidden_sizes,
+                                                    1,
+                                                    dtype=torch.float32)
+        param_dict["w2_weight_offset"] = torch.empty(num_experts,
+                                                     hidden_sizes,
+                                                     1,
+                                                     dtype=torch.float32)
+        if not self.is_per_channel_weight:
+            param_dict["w13_weight_scale_second"] = torch.empty(
+                num_experts,
+                2 * intermediate_size_per_partition,
+                hidden_sizes // self.group_size,
+                dtype=torch.float32)
+            param_dict["w13_weight_offset_second"] = torch.empty(
+                num_experts,
+                2 * intermediate_size_per_partition,
+                hidden_sizes // self.group_size,
+                dtype=torch.float32)
+
+            param_dict["w2_weight_scale_second"] = torch.empty(
+                num_experts,
+                hidden_sizes,
+                intermediate_size_per_partition // self.group_size,
+                dtype=torch.float32)
+            param_dict["w2_weight_offset_second"] = torch.empty(
+                num_experts,
+                hidden_sizes,
+                intermediate_size_per_partition // self.group_size,
+                dtype=torch.float32)
+
+        if self.new_quant_version:
+            param_dict["w13_scale_bias"] = torch.empty(
+                num_experts,
+                2 * intermediate_size_per_partition,
+                1,
+                dtype=torch.float32)
+            param_dict["w2_scale_bias"] = torch.empty(num_experts,
+                                                      hidden_sizes,
+                                                      16 // self.tp_size,
+                                                      dtype=torch.float32)
+
+        return param_dict
+
+    def apply(
+        self,
+        layer: torch.nn.Module,
+        x: torch.Tensor,
+        router_logits: torch.Tensor,
+        top_k: int,
+        renormalize: bool,
+        use_grouped_topk: bool = False,
+        global_num_experts: int = -1,
+        expert_map: Optional[torch.Tensor] = None,
+        topk_group: Optional[int] = None,
+        num_expert_group: Optional[int] = None,
+        custom_routing_function: Optional[Callable] = None,
+        scoring_func: str = "softmax",
+        routed_scaling_factor: float = 1.0,
+        e_score_correction_bias: Optional[torch.Tensor] = None,
+        is_prefill: bool = True,
+        enable_force_load_balance: bool = False,
+        log2phy: Optional[torch.Tensor] = None,
+        global_redundant_expert_num: int = 0,
+        **kwargs,
+    ) -> torch.Tensor:
+        assert router_logits.shape[
+            1] == global_num_experts - global_redundant_expert_num, "Number of global experts mismatch (excluding redundancy)"
+
+        # NOTE: now npu_moe_gating_top_k can only support `group_count=256` pattern
+        topk_weights, topk_ids = select_experts(
+            hidden_states=x,
+            router_logits=router_logits,
+            top_k=top_k,
+            use_grouped_topk=use_grouped_topk,
+            renormalize=renormalize,
+            topk_group=topk_group,
+            num_expert_group=num_expert_group,
+            custom_routing_function=custom_routing_function,
+            scoring_func=scoring_func,
+            e_score_correction_bias=e_score_correction_bias,
+            global_num_experts=global_num_experts)
+
+        # this is a naive implementation for experts load balance so as
+        # to avoid accumulating too much tokens on a single rank.
+        # currently it is only activated when doing profile runs.
+        if enable_force_load_balance:
+            random_matrix = torch.rand(topk_ids.size(0),
+                                       global_num_experts -
+                                       global_redundant_expert_num,
+                                       device=topk_ids.device)
+            topk_ids = torch.argsort(
+                random_matrix, dim=1)[:, :topk_ids.size(1)].to(topk_ids.dtype)
+
+        topk_weights = topk_weights.to(x.dtype)
+
+        moe_comm_method = get_forward_context().moe_comm_method
+        return moe_comm_method.fused_experts(
+            hidden_states=x,
+            w1=[layer.w13_weight],
+            w2=[layer.w2_weight],
+            w1_scale=[layer.w13_weight_scale],
+            w2_scale=[layer.w2_weight_scale],
+            w1_scale_bias=layer.w13_scale_bias,
+            w2_scale_bias=layer.w2_scale_bias,
+            topk_weights=topk_weights,
+            topk_ids=topk_ids,
+            use_int4_w4a8=True,
+            expert_map=expert_map,
+            log2phy=log2phy,
+            dynamic_eplb=self.dynamic_eplb,
+            mc2_mask=kwargs.get("mc2_mask", None))
+
+    def process_scale(self, weight: torch.Tensor, scale, per_group_scale):
+        scale = scale.transpose(1, 2).contiguous()
+        if self.is_per_channel_weight:
+            scale_np = scale.cpu().numpy()
+            scale_np.dtype = np.uint32
+            scale_uint64_tensor = torch.from_numpy(scale_np.astype(
+                np.int64)).npu()
+            return scale_uint64_tensor, None
+        per_group_scale = per_group_scale.transpose(1, 2).contiguous()
+        group_num, k, n = weight.shape
+        # the weight of the new version is reduced by half by pack n, so it needs to be restored
+        if self.new_quant_version:
+            n = n * 2
+        per_group_scale = per_group_scale.reshape(group_num, -1, n)
+        group_num, quantgroup_num, n = per_group_scale.shape
+        bias = None
+        if not self.new_quant_version:
+            weight_high = weight.to(torch.float32).reshape([group_num, quantgroup_num, -1, n]) * \
+                per_group_scale.reshape([group_num, quantgroup_num, 1, n])
+            weight_high = weight_high.reshape([group_num, k, n])
+            bias = 8 * (weight_high.to(torch.float32) * scale).sum(axis=1)
+        scale_fp32 = (scale * per_group_scale).to(torch.float16).to(
+            torch.float32)
+        scale_fp32_np = scale_fp32.cpu().numpy()
+        scale_fp32_np.dtype = np.uint32
+        sscale_uint64 = np.zeros((group_num, quantgroup_num, n * 2),
+                                 dtype=np.uint32)
+
+        sscale_uint64[..., ::2] = scale_fp32_np
+
+        sscale_uint64_buffer = np.frombuffer(sscale_uint64.tobytes(),
+                                             dtype=np.int64).copy()
+        sscale_uint64_tensor = torch.from_numpy(sscale_uint64_buffer).reshape(
+            group_num, quantgroup_num, n)
+        sscale_uint64_tensor = sscale_uint64_tensor.npu()
+        return sscale_uint64_tensor, bias
+
+    def update_bias(self, layer, w13_bias, w2_bias):
+        if self.new_quant_version:
+            layer.w13_scale_bias.data = layer.w13_scale_bias.data.transpose(
+                1, 2).contiguous().sum(axis=1)
+            layer.w2_scale_bias.data = layer.w2_scale_bias.data.transpose(
+                1, 2).contiguous().sum(axis=1)
+        else:
+            w13_scale_bias = torch.nn.Parameter(w13_bias, requires_grad=False)
+            layer.register_parameter("w13_scale_bias", w13_scale_bias)
+            w2_scale_bias = torch.nn.Parameter(w2_bias, requires_grad=False)
+            layer.register_parameter("w2_scale_bias", w2_scale_bias)
+
+    def pack_to_int32(self, weight: torch.Tensor):
+        if self.new_quant_version:
+            # pack 4 int8(int4*2) to int32, because in pytorch, we need to use int32 to represent int4
+            assert weight.shape[
+                -1] % 4 == 0, "the last dim of weight needs to be divided by 4"
+            return weight.view(torch.int32).contiguous()
+        else:
+            return torch_npu.npu_quantize(weight.to(torch.float32),
+                                          torch.tensor([1.]).npu(), None,
+                                          torch.quint4x2, -1, False)
+
+    def process_weights_after_loading(self, layer):
+        layer.w13_weight.data = layer.w13_weight.data.transpose(
+            1, 2).contiguous()
+        layer.w2_weight.data = layer.w2_weight.data.transpose(1,
+                                                              2).contiguous()
+
+        w13_weight_scale_second = layer.w13_weight_scale_second.data if hasattr(
+            layer, "w13_weight_scale_second") else None
+        w2_weight_scale_second = layer.w2_weight_scale_second.data if hasattr(
+            layer, "w2_weight_scale_second") else None
+        layer.w13_weight_scale.data, w13_bias = self.process_scale(
+            layer.w13_weight, layer.w13_weight_scale.data,
+            w13_weight_scale_second)
+        layer.w2_weight_scale.data, w2_bias = self.process_scale(
+            layer.w2_weight, layer.w2_weight_scale.data,
+            w2_weight_scale_second)
+        if hasattr(layer, "w13_weight_scale_second"):
+            # scale_second is no longer used, release this part of the memory
+            del layer.w13_weight_scale_second
+            del layer.w2_weight_scale_second
+            del layer.w13_weight_offset_second
+            del layer.w2_weight_offset_second
+
+        self.update_bias(layer, w13_bias, w2_bias)
+
+        layer.w13_weight.data = maybe_trans_nz(layer.w13_weight.data)
+        layer.w2_weight.data = maybe_trans_nz(layer.w2_weight.data)
+        layer.w13_weight.data = self.pack_to_int32(layer.w13_weight.data)
+        layer.w2_weight.data = self.pack_to_int32(layer.w2_weight.data)
--- a/vllm_ascend/quantization/methods/w8a16.py
+++ b/vllm_ascend/quantization/methods/w8a16.py
@@ -0,0 +1,83 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# This file is a part of the vllm-ascend project.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from typing import Any, Dict, Optional
+
+import torch
+import torch_npu
+
+from vllm_ascend.utils import maybe_trans_nz
+
+from .base import AscendLinearScheme
+from .registry import register_scheme
+
+
+@register_scheme("W8A16", "linear")
+class AscendW8A16LinearMethod(AscendLinearScheme):
+    """Linear method for Ascend W8A16.
+    
+    This scheme uses 8-bit quantized weights with 16-bit activations.
+    """
+
+    def __init__(self) -> None:
+        pass
+
+    def get_weight(
+        self,
+        input_size: int,
+        output_size: int,
+        params_dtype: torch.dtype = torch.bfloat16,
+    ) -> Dict[str, Any]:
+        params_dict = {
+            "weight": torch.empty(output_size, input_size, dtype=torch.int8)
+        }
+        return params_dict
+
+    def get_perchannel_param(
+        self,
+        output_size: int,
+        params_dtype: torch.dtype,
+    ) -> Dict[str, Any]:
+        params_dict = {}
+        params_dict["weight_scale"] = torch.empty(output_size,
+                                                  1,
+                                                  dtype=params_dtype)
+        params_dict["weight_offset"] = torch.empty(output_size,
+                                                   1,
+                                                   dtype=params_dtype)
+        return params_dict
+
+    def apply(
+        self,
+        layer: torch.nn.Module,
+        x: torch.Tensor,
+        bias: Optional[torch.Tensor] = None,
+        tp_rank: Optional[int] = 0,
+    ) -> torch.Tensor:
+        output = torch_npu.npu_weight_quant_batchmatmul(
+            x=x,
+            weight=layer.weight,
+            antiquant_scale=layer.weight_scale,
+            antiquant_offset=layer.weight_offset,
+            bias=bias)
+        return output
+
+    def process_weights_after_loading(self, layer):
+        layer.weight.data = layer.weight.data.transpose(0, 1).contiguous()
+        layer.weight.data = maybe_trans_nz(layer.weight.data)
+        layer.weight_scale.data = torch.flatten(layer.weight_scale.data)
+        layer.weight_offset.data = torch.flatten(layer.weight_offset.data)
--- a/vllm_ascend/quantization/methods/w8a8_dynamic.py
+++ b/vllm_ascend/quantization/methods/w8a8_dynamic.py
@@ -0,0 +1,331 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# This file is a part of the vllm-ascend project.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from typing import Any, Callable, Dict, Optional
+
+import torch
+import torch_npu
+from vllm.config import CompilationMode, get_current_vllm_config
+from vllm.distributed import get_ep_group
+from vllm.forward_context import get_forward_context
+
+import vllm_ascend.envs as envs_ascend
+from vllm_ascend.ascend_config import get_ascend_config
+from vllm_ascend.ascend_forward_context import MoECommType
+from vllm_ascend.distributed.parallel_state import get_mc2_group
+from vllm_ascend.flash_common3_context import get_flash_common3_context
+from vllm_ascend.ops.fused_moe.experts_selector import (select_experts,
+                                                        zero_experts_compute)
+from vllm_ascend.utils import ACL_FORMAT_FRACTAL_NZ, maybe_trans_nz
+
+from .base import AscendLinearScheme, AscendMoEScheme, QuantType
+from .registry import register_scheme
+
+
+def scale_from_float_to_int64(scale):
+    """Convert float32 scale to int64 representation."""
+    import numpy as np
+    scale = torch.from_numpy(
+        np.frombuffer(scale.cpu().to(torch.float32).numpy().tobytes(),
+                      dtype=np.int32).astype(np.int64)).to(scale.device)
+    return scale
+
+
+@register_scheme("W8A8_DYNAMIC", "linear")
+class AscendW8A8DynamicLinearMethod(AscendLinearScheme):
+    """Linear method for Ascend W8A8_DYNAMIC.
+    
+    This scheme uses dynamic per-token quantization for activations
+    and per-channel quantization for weights.
+    """
+
+    def __init__(self):
+        pass
+
+    def get_weight(self, input_size: int, output_size: int,
+                   params_dtype: torch.dtype) -> Dict[str, Any]:
+        params_dict = {
+            "weight": torch.empty(output_size, input_size, dtype=torch.int8)
+        }
+        return params_dict
+
+    def get_perchannel_param(
+        self,
+        output_size: int,
+        params_dtype: torch.dtype,
+    ) -> Dict[str, Any]:
+        params_dict = {}
+        params_dict["weight_scale"] = torch.empty(output_size,
+                                                  1,
+                                                  dtype=params_dtype)
+        params_dict["weight_offset"] = torch.empty(output_size,
+                                                   1,
+                                                   dtype=params_dtype)
+        return params_dict
+
+    def apply(
+        self,
+        layer: torch.nn.Module,
+        x: torch.Tensor,
+        bias: Optional[torch.Tensor] = None,
+        tp_rank: Optional[int] = 0,
+    ) -> torch.Tensor:
+        quantized_x, pertoken_scale = torch_npu.npu_dynamic_quant(x)
+        output = torch_npu.npu_quant_matmul(
+            quantized_x,
+            layer.weight,
+            layer.weight_scale,
+            pertoken_scale=pertoken_scale,
+            bias=bias,
+            output_dtype=x.dtype,
+        )
+        return output
+
+    def process_weights_after_loading(self, layer):
+        layer.weight.data = layer.weight.data.transpose(0, 1).contiguous()
+        # cast quantized weight tensors in NZ format for higher inference speed
+        layer.weight.data = maybe_trans_nz(layer.weight.data)
+        layer.weight_scale.data = layer.weight_scale.data.flatten()
+        layer.weight_scale_fp32 = layer.weight_scale.data.to(torch.float32)
+        layer.weight_offset.data = layer.weight_offset.data.flatten()
+
+
+@register_scheme("W8A8_DYNAMIC", "moe")
+class AscendW8A8DynamicFusedMoEMethod(AscendMoEScheme):
+    """FusedMoE method for Ascend W8A8_DYNAMIC."""
+
+    # Declare the quantization type for this scheme
+    quant_type: QuantType = QuantType.W8A8
+
+    def __init__(self):
+        self.ep_group = get_ep_group()
+
+        vllm_config = get_current_vllm_config()
+        ascend_config = get_ascend_config()
+        self.use_aclgraph = (vllm_config.compilation_config.mode
+                             == CompilationMode.VLLM_COMPILE
+                             and not vllm_config.model_config.enforce_eager)
+        self.multistream_overlap_gate = ascend_config.multistream_overlap_gate
+
+        self.dynamic_eplb = ascend_config.eplb_config.dynamic_eplb
+        self.in_dtype = vllm_config.model_config.dtype
+        self.supports_eplb = True
+
+        try:
+            device_group = get_mc2_group().device_group
+            # TODO: Try local_rank = ep_group.rank_in_group
+            local_rank = torch.distributed.get_rank(group=device_group)
+            backend = device_group._get_backend(torch.device("npu"))
+            self.moe_all_to_all_group_name = backend.get_hccl_comm_name(
+                local_rank)
+        except AttributeError:
+            self.moe_all_to_all_group_name = ""
+
+    def get_weight(self, num_experts: int,
+                   intermediate_size_per_partition: int, hidden_sizes: int,
+                   params_dtype: torch.dtype) -> Dict[str, Any]:
+        param_dict = {}
+        param_dict["w13_weight"] = torch.empty(num_experts,
+                                               2 *
+                                               intermediate_size_per_partition,
+                                               hidden_sizes,
+                                               dtype=torch.int8)
+        param_dict["w2_weight"] = torch.empty(num_experts,
+                                              hidden_sizes,
+                                              intermediate_size_per_partition,
+                                              dtype=torch.int8)
+        return param_dict
+
+    def get_dynamic_quant_param(self, num_experts: int,
+                                intermediate_size_per_partition: int,
+                                hidden_sizes: int,
+                                params_dtype: torch.dtype) -> Dict[str, Any]:
+        param_dict = {}
+        param_dict["w13_weight_scale"] = torch.empty(
+            num_experts,
+            2 * intermediate_size_per_partition,
+            1,
+            dtype=params_dtype)
+        param_dict["w13_weight_offset"] = torch.empty(
+            num_experts,
+            2 * intermediate_size_per_partition,
+            1,
+            dtype=params_dtype)
+        param_dict["w2_weight_scale"] = torch.empty(num_experts,
+                                                    hidden_sizes,
+                                                    1,
+                                                    dtype=params_dtype)
+        param_dict["w2_weight_offset"] = torch.empty(num_experts,
+                                                     hidden_sizes,
+                                                     1,
+                                                     dtype=params_dtype)
+        return param_dict
+
+    def apply(
+        self,
+        layer: torch.nn.Module,
+        x: torch.Tensor,
+        router_logits: torch.Tensor,
+        top_k: int,
+        renormalize: bool,
+        use_grouped_topk: bool = False,
+        global_num_experts: int = -1,
+        expert_map: Optional[torch.Tensor] = None,
+        topk_group: Optional[int] = None,
+        num_expert_group: Optional[int] = None,
+        custom_routing_function: Optional[Callable] = None,
+        scoring_func: str = "softmax",
+        routed_scaling_factor: float = 1.0,
+        e_score_correction_bias: Optional[torch.Tensor] = None,
+        is_prefill: bool = True,
+        enable_force_load_balance: bool = False,
+        log2phy: Optional[torch.Tensor] = None,
+        global_redundant_expert_num: int = 0,
+        pertoken_scale: Optional[Any] = None,
+        **kwargs,
+    ) -> torch.Tensor:
+        zero_expert_num = getattr(layer, "zero_expert_num", 0)
+        zero_expert_type = getattr(layer, "zero_expert_type", None)
+        if zero_expert_num == 0 or zero_expert_type is None:
+            assert router_logits.shape[1] == global_num_experts - global_redundant_expert_num, \
+                "Number of global experts mismatch (excluding redundancy)"
+
+        if self.multistream_overlap_gate:
+            fc3_context = get_flash_common3_context()
+            assert fc3_context is not None
+            topk_weights = fc3_context.topk_weights
+            topk_ids = fc3_context.topk_ids
+        else:
+            topk_weights, topk_ids = select_experts(
+                hidden_states=x,
+                router_logits=router_logits,
+                top_k=top_k,
+                use_grouped_topk=use_grouped_topk,
+                renormalize=renormalize,
+                topk_group=topk_group,
+                num_expert_group=num_expert_group,
+                custom_routing_function=custom_routing_function,
+                scoring_func=scoring_func,
+                routed_scaling_factor=routed_scaling_factor,
+                e_score_correction_bias=e_score_correction_bias,
+                global_num_experts=global_num_experts)
+        assert topk_ids is not None
+        assert topk_weights is not None
+        if zero_expert_num > 0 and zero_expert_type is not None:
+            topk_ids, topk_weights, zero_expert_result = zero_experts_compute(
+                expert_indices=topk_ids,
+                expert_scales=topk_weights,
+                num_experts=global_num_experts,
+                zero_expert_type=zero_expert_type,
+                hidden_states=x,
+            )
+        # this is a naive implementation for experts load balance so as
+        # to avoid accumulating too much tokens on a single rank.
+        # currently it is only activated when doing profile runs.
+        if enable_force_load_balance:
+            random_matrix = torch.rand(topk_ids.size(0),
+                                       global_num_experts -
+                                       global_redundant_expert_num,
+                                       device=topk_ids.device)
+            topk_ids = torch.argsort(
+                random_matrix, dim=1)[:, :topk_ids.size(1)].to(topk_ids.dtype)
+
+        assert topk_weights is not None
+        topk_weights = topk_weights.to(self.in_dtype)
+
+        moe_comm_method = get_forward_context().moe_comm_method
+        if self.dynamic_eplb:
+            w1 = layer.w13_weight_list
+            w1_scale = layer.w13_weight_scale_fp32_list
+            w2 = layer.w2_weight_list
+            w2_scale = layer.w2_weight_scale_list
+        else:
+            w1 = [layer.w13_weight]
+            w1_scale = [layer.w13_weight_scale_fp32]
+            w2 = [layer.w2_weight]
+            w2_scale = [layer.w2_weight_scale]
+
+        fused_scale_flag = (get_forward_context().moe_comm_type
+                            == MoECommType.FUSED_MC2
+                            and envs_ascend.VLLM_ASCEND_ENABLE_FUSED_MC2 == 1)
+        final_hidden_states = moe_comm_method.fused_experts(
+            hidden_states=x,
+            pertoken_scale=pertoken_scale,
+            w1=w1,
+            w1_scale=[layer.fused_w1_scale] if fused_scale_flag else w1_scale,
+            w2=w2,
+            w2_scale=[layer.fused_w2_scale] if fused_scale_flag else w2_scale,
+            topk_weights=topk_weights,
+            topk_ids=topk_ids,
+            use_int8_w8a8=True,
+            expert_map=expert_map,
+            log2phy=log2phy,
+            dynamic_eplb=self.dynamic_eplb,
+            mc2_mask=kwargs.get("mc2_mask", None))
+        if zero_expert_num > 0 and zero_expert_type is not None:
+            final_hidden_states += zero_expert_result
+        return final_hidden_states
+
+    def process_weights_after_loading(self, layer):
+        layer.w13_weight.data = layer.w13_weight.data.transpose(
+            1, 2).contiguous()
+        layer.w2_weight.data = layer.w2_weight.data.transpose(1,
+                                                              2).contiguous()
+        # TODO(zzzzwwjj): Currently, `torch_npu.npu_grouped_matmul_swiglu_quant`
+        # can only support weight nz.
+        layer.w13_weight.data = torch_npu.npu_format_cast(
+            layer.w13_weight.data, ACL_FORMAT_FRACTAL_NZ)
+        layer.w2_weight.data = torch_npu.npu_format_cast(
+            layer.w2_weight.data, ACL_FORMAT_FRACTAL_NZ)
+        layer.w13_weight_scale.data = layer.w13_weight_scale.data.view(
+            layer.w13_weight_scale.data.shape[0], -1)
+        layer.w13_weight_scale_fp32 = layer.w13_weight_scale.data.to(
+            torch.float32)
+        layer.w13_weight_offset.data = layer.w13_weight_offset.data.view(
+            layer.w13_weight_offset.data.shape[0], -1)
+        layer.w2_weight_scale.data = layer.w2_weight_scale.data.view(
+            layer.w2_weight_scale.data.shape[0], -1)
+        layer.w2_weight_offset.data = layer.w2_weight_offset.data.view(
+            layer.w2_weight_offset.data.shape[0], -1)
+
+        layer.fused_w1_scale = scale_from_float_to_int64(
+            layer.w13_weight_scale.data)
+        layer.fused_w2_scale = scale_from_float_to_int64(
+            layer.w2_weight_scale.data)
+
+        if self.dynamic_eplb:
+            layer.w13_weight_list = [
+                weight.clone()
+                for weight in layer.w13_weight.data.unbind(dim=0)
+            ]
+            layer.w2_weight_list = [
+                weight.clone() for weight in layer.w2_weight.data.unbind(dim=0)
+            ]
+            layer.w13_weight_scale_fp32_list = [
+                weight.clone()
+                for weight in layer.w13_weight_scale_fp32.data.unbind(dim=0)
+            ]
+            layer.w2_weight_scale_list = [
+                weight.clone()
+                for weight in layer.w2_weight_scale.data.unbind(dim=0)
+            ]
+            del layer.w13_weight
+            del layer.w2_weight
+            del layer.w13_weight_scale
+            del layer.w13_weight_scale_fp32
+            del layer.w2_weight_scale
+            torch.npu.empty_cache()
--- a/vllm_ascend/quantization/methods/w8a8_mxfp8.py
+++ b/vllm_ascend/quantization/methods/w8a8_mxfp8.py
@@ -0,0 +1,94 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# This file is a part of the vllm-ascend project.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from typing import Any, Dict, Optional
+
+import torch
+import torch_npu
+from vllm.config import get_current_vllm_config
+
+from .base import AscendLinearScheme
+from .registry import register_scheme
+
+
+@register_scheme("W8A8_MXFP8", "linear")
+class AscendW8A8MXFP8DynamicLinearMethod(AscendLinearScheme):
+    """Linear method for Ascend W8A8_MXFP8 (Microscaling FP8) quantization.
+    
+    This scheme uses microscaling FP8 quantization with per-group scales.
+    The activation is dynamically quantized to FP8 (E4M3FN format) with
+    microscaling, and weights are stored in FP8 format with per-group scales.
+    """
+    model_dtype = None
+
+    def __init__(self):
+        vllm_config = get_current_vllm_config()
+        self.group_size = vllm_config.quant_config.quant_description.get(
+            "group_size", 32)
+
+    def get_weight(self, input_size: int, output_size: int,
+                   params_dtype: torch.dtype) -> Dict[str, Any]:
+        params_dict = {
+            "weight":
+            torch.empty(output_size, input_size, dtype=torch.float8_e4m3fn)
+        }
+        return params_dict
+
+    def get_pergroup_param(self,
+                           input_size: int,
+                           output_size: int,
+                           params_dtype: torch.dtype,
+                           layer_type: Optional[str] = None) -> Dict[str, Any]:
+        params_dict = {}
+        params_dict["weight_scale"] = torch.empty(output_size,
+                                                  input_size //
+                                                  self.group_size,
+                                                  dtype=torch.uint8)
+        return params_dict
+
+    def apply(
+        self,
+        layer: torch.nn.Module,
+        x: torch.Tensor,
+        bias: Optional[torch.Tensor] = None,
+        tp_rank: Optional[int] = 0,
+    ) -> torch.Tensor:
+
+        quantized_x, dynamic_scale = torch_npu.npu_dynamic_mx_quant(
+            x, dst_type=torch.float8_e4m3fn)
+        pertoken_scale = dynamic_scale
+        output_dtype = x.dtype
+
+        output = torch_npu.npu_quant_matmul(
+            quantized_x,
+            layer.weight,
+            layer.weight_scale,
+            scale_dtype=torch_npu.float8_e8m0fnu,
+            pertoken_scale=pertoken_scale,
+            pertoken_scale_dtype=torch_npu.float8_e8m0fnu,
+            bias=bias,
+            output_dtype=output_dtype,
+            group_sizes=[1, 1, self.group_size])
+
+        return output
+
+    def process_weights_after_loading(self, layer):
+        n_dim, k_dim = layer.weight_scale.data.shape
+        layer.weight_scale.data = layer.weight_scale.data.reshape(
+            n_dim, k_dim // 2, 2)
+        layer.weight.data = layer.weight.data.transpose(0, 1)
+        layer.weight_scale.data = layer.weight_scale.data.transpose(0, 1)
--- a/vllm_ascend/quantization/methods/w8a8_pdmix.py
+++ b/vllm_ascend/quantization/methods/w8a8_pdmix.py
@@ -0,0 +1,117 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# This file is a part of the vllm-ascend project.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+"""W8A8 Prefill-Decode Mix quantization methods.
+
+This module provides quantization methods that use different strategies
+for prefill and decode phases:
+- Prefill: Uses dynamic W8A8 quantization
+- Decode (KV consumer): Uses static W8A8 quantization
+"""
+
+from typing import Any, Dict, Optional
+
+import torch
+from vllm.config import get_current_vllm_config
+
+from .base import AscendLinearScheme
+from .registry import register_scheme
+from .w8a8_dynamic import (AscendW8A8DynamicFusedMoEMethod,
+                           AscendW8A8DynamicLinearMethod)
+from .w8a8_static import AscendW8A8LinearMethod
+
+
+@register_scheme("W8A8_MIX", "linear")
+class AscendW8A8PDMixLinearMethod(AscendLinearScheme):
+    """Linear method for W8A8 prefill-decode mix quantization.
+
+    This scheme uses composition to delegate to the appropriate quantization
+    method based on the execution phase:
+    - Static W8A8 for KV consumer (decode phase)
+    - Dynamic W8A8 for prefill phase
+
+    The static method is used for weight/parameter specifications since
+    it requires more parameters (input_scale, deq_scale, etc.) that are
+    needed for static quantization during decode.
+    """
+
+    def __init__(self):
+        self._static_method = AscendW8A8LinearMethod()
+        self._dynamic_method = AscendW8A8DynamicLinearMethod()
+
+        kv_transfer_config = get_current_vllm_config().kv_transfer_config
+        self._is_kv_consumer = (kv_transfer_config is not None
+                                and kv_transfer_config.is_kv_consumer)
+
+    def get_weight(self, input_size: int, output_size: int,
+                   params_dtype: torch.dtype) -> Dict[str, Any]:
+        return self._static_method.get_weight(input_size, output_size,
+                                              params_dtype)
+
+    def get_pertensor_param(self, params_dtype: torch.dtype) -> Dict[str, Any]:
+        return self._static_method.get_pertensor_param(params_dtype)
+
+    def get_perchannel_param(
+        self,
+        output_size: int,
+        params_dtype: torch.dtype,
+    ) -> Dict[str, Any]:
+        return self._static_method.get_perchannel_param(
+            output_size, params_dtype)
+
+    def apply(
+        self,
+        layer: torch.nn.Module,
+        x: torch.Tensor,
+        bias: Optional[torch.Tensor] = None,
+        tp_rank: Optional[int] = 0,
+    ) -> torch.Tensor:
+        if layer.is_kv_consumer:
+            return self._static_method.apply(layer, x, bias, tp_rank)
+        else:
+            return self._dynamic_method.apply(layer, x, bias, tp_rank)
+
+    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
+        self._static_method.process_weights_after_loading(layer)
+        layer.weight_scale_fp32 = layer.weight_scale.data.to(torch.float32)
+        layer.is_kv_consumer = self._is_kv_consumer
+
+
+@register_scheme("W8A8_MIX", "moe")
+class AscendW8A8PDMixFusedMoeMethod(AscendW8A8DynamicFusedMoEMethod):
+
+    def get_dynamic_quant_param(self, num_experts: int,
+                                intermediate_size_per_partition: int,
+                                hidden_sizes: int,
+                                params_dtype: torch.dtype) -> Dict[str, Any]:
+        param_dict = super().get_dynamic_quant_param(
+            num_experts, intermediate_size_per_partition, hidden_sizes,
+            params_dtype)
+        param_dict["w2_deq_scale"] = torch.empty(num_experts,
+                                                 hidden_sizes,
+                                                 dtype=torch.float32)
+        param_dict["w13_deq_scale"] = torch.empty(
+            num_experts,
+            2 * intermediate_size_per_partition,
+            dtype=torch.float32)
+        param_dict["w2_input_offset"] = torch.empty(num_experts,
+                                                    1,
+                                                    dtype=torch.int8)
+        param_dict["w13_input_offset"] = torch.empty(num_experts,
+                                                     1,
+                                                     dtype=torch.int8)
+
+        return param_dict
--- a/vllm_ascend/quantization/methods/w8a8_static.py
+++ b/vllm_ascend/quantization/methods/w8a8_static.py
@@ -0,0 +1,181 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# This file is a part of the vllm-ascend project.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from typing import Any, Dict, Optional
+
+import torch
+import torch_npu
+
+from vllm_ascend.utils import (COMPRESSED_TENSORS_METHOD, AscendDeviceType,
+                               get_ascend_device_type,
+                               get_weight_prefetch_method, maybe_trans_nz)
+
+from .base import AscendLinearScheme
+from .registry import register_scheme
+
+
+@register_scheme("W8A8", "linear")
+class AscendW8A8LinearMethod(AscendLinearScheme):
+    """Linear method for Ascend W8A8 static quantization.
+
+    This scheme uses static per-tensor quantization for activations
+    and per-channel quantization for weights.
+    """
+
+    def __init__(self) -> None:
+        pass
+
+    def get_weight(
+        self,
+        input_size: int,
+        output_size: int,
+        params_dtype: torch.dtype = torch.bfloat16,
+    ) -> Dict[str, Any]:
+        params_dict = {
+            "weight": torch.empty(output_size, input_size, dtype=torch.int8)
+        }
+        return params_dict
+
+    def get_pertensor_param(self, params_dtype: torch.dtype) -> Dict[str, Any]:
+        params_dict = {}
+        params_dict["input_scale"] = torch.empty(1, dtype=params_dtype)
+        params_dict["input_offset"] = torch.empty(1, dtype=torch.int8)
+        return params_dict
+
+    def get_perchannel_param(
+        self,
+        output_size: int,
+        params_dtype: torch.dtype,
+    ) -> Dict[str, Any]:
+        params_dict = {}
+        params_dict["quant_bias"] = torch.empty(output_size, dtype=torch.int32)
+        if params_dtype == torch.bfloat16:
+            params_dict["deq_scale"] = torch.empty(output_size,
+                                                   dtype=torch.float32)
+        elif params_dtype == torch.float16:
+            params_dict["deq_scale"] = torch.empty(output_size,
+                                                   dtype=torch.int64)
+        params_dict["weight_scale"] = torch.empty(output_size,
+                                                  1,
+                                                  dtype=params_dtype)
+        params_dict["weight_offset"] = torch.empty(output_size,
+                                                   1,
+                                                   dtype=params_dtype)
+        return params_dict
+
+    def apply(
+        self,
+        layer: torch.nn.Module,
+        x: torch.Tensor,
+        bias: Optional[torch.Tensor] = None,
+        tp_rank: Optional[int] = 0,
+    ) -> torch.Tensor:
+        if x.dtype != torch.int8:
+            layer_cls_name = layer.__class__.__name__
+            weight_prefetch_method = get_weight_prefetch_method()
+            # prefetch qkvo_proj.weight preprocess
+            if weight_prefetch_method:
+                weight_prefetch_method.maybe_prefetch_attn_weight_preprocess(
+                    layer_cls_name=layer_cls_name,
+                    weight=layer.weight,
+                    start_flag=x,
+                )
+            try:
+                quant_comm_config = getattr(layer, "_quant_comm_config")
+            except AttributeError:
+                quant_comm_config = {}
+            comm_fn = quant_comm_config.get("communication_fn")
+            enable_flashcomm2_quant_comm = comm_fn is not None and (
+                "o_proj" in layer.prefix or "out_proj" in layer.prefix)
+            if enable_flashcomm2_quant_comm:
+                quant_input_x = x.contiguous().view(
+                    -1, layer.aclnn_input_scale_reciprocal.size(0))
+                quant_x = torch.ops.vllm.quantize(
+                    quant_input_x,
+                    layer.aclnn_input_scale,
+                    layer.aclnn_input_scale_reciprocal,
+                    layer.aclnn_input_offset,
+                )
+                comm_input = quant_x.view(x.size(0), -1)
+                assert comm_fn is not None
+                x = comm_fn(comm_input)
+            else:
+                # quant
+                x = torch.ops.vllm.quantize(
+                    x,
+                    layer.aclnn_input_scale,
+                    layer.aclnn_input_scale_reciprocal,
+                    layer.aclnn_input_offset,
+                )
+
+            # prefetch qkvo_proj.weight postprocess
+            if weight_prefetch_method:
+                weight_prefetch_method.maybe_prefetch_attn_weight_postprocess(
+                    layer_cls_name=layer_cls_name,
+                    stop_flag=x,
+                )
+
+        quant_bias = layer.quant_bias if tp_rank == 0 else None
+
+        try:
+            ascend_quant_method = getattr(layer, "ascend_quant_method")
+        except AttributeError:
+            ascend_quant_method = ""
+        if ascend_quant_method == COMPRESSED_TENSORS_METHOD:
+            quant_bias = bias
+
+        if get_ascend_device_type() == AscendDeviceType._310P:
+            # On 300I Duo platform, we need transpose again if
+            # using nz. This transpose can be skipped in torchair.
+            output = torch_npu.npu_quant_matmul(
+                x,
+                layer.weight.data.transpose(1, 0),
+                layer.deq_scale,
+                bias=quant_bias,
+                output_dtype=layer.params_dtype,
+            )
+        else:
+            output = torch_npu.npu_quant_matmul(
+                x,
+                layer.weight,
+                layer.deq_scale,
+                bias=quant_bias,
+                output_dtype=layer.params_dtype,
+            )
+        return output
+
+    def process_weights_after_loading(self, layer):
+        expanding_factor = layer.weight.data.shape[1]
+        layer.aclnn_input_scale = torch.nn.Parameter(
+            layer.input_scale.data.repeat(expanding_factor),
+            requires_grad=False)
+        layer.aclnn_input_scale_reciprocal = 1 / torch.nn.Parameter(
+            layer.input_scale.data.repeat(expanding_factor),
+            requires_grad=False)
+        layer.aclnn_input_offset = torch.nn.Parameter(
+            layer.input_offset.data.repeat(expanding_factor),
+            requires_grad=False).to(layer.aclnn_input_scale.dtype)
+        if get_ascend_device_type() != AscendDeviceType._310P:
+            layer.weight.data = layer.weight.data.transpose(0, 1).contiguous()
+        layer.weight.data = maybe_trans_nz(layer.weight.data)
+        layer.weight_scale.data = torch.flatten(layer.weight_scale.data)
+        layer.weight_offset.data = torch.flatten(layer.weight_offset.data)
+        ascend_quant_method = getattr(layer, "ascend_quant_method", "")
+        if ascend_quant_method == COMPRESSED_TENSORS_METHOD:
+            deq_scale = layer.input_scale.data * layer.weight_scale.data
+            layer.deq_scale = torch.nn.Parameter(deq_scale,
+                                                 requires_grad=False)