xc-llm-ascend

Author	SHA1	Message	Date
linfeng-yuan	d452d04656	[A5][bugfix] Fix fused MoE A5 MXFP8 scale normalization, load-balance routing and gating_topk ops (#7573 ) ### What this PR does / why we need it? This PR fixes A5 MXFP8 MoE scale handling in the fused MoE path. - It normalizes MXFP8 activation scales to the packed 3D layout expected by A5 kernels, including both precomputed dynamic_scale inputs and gmm1 output scales before they are consumed by downstream grouped matmul ops. - It also refines the MXFP8 force load-balancing path in profiling runs. - This PR also enables npu_gating_top_k from torch_npu instead of custom op when running ascend950 chip. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI and E2E serving tests on Ascend950DT passed. --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-25 17:20:28 +08:00
linfeng-yuan	88d03a783f	[refactor] replace scattered business kwargs with typed request objects and explicit stage boundaries (#7024 ) ### What this PR does / why we need it? Refactor `vllm_ascend/ops/fused_moe` to replace scattered MoE business `**kwargs` with typed request objects and explicit stage boundaries. - Prepare, dispatch, MLP, and quant stages now have clearer ownership. - Main MoE path no longer depends on business `kwargs.get(...)` lookups. - Comm and dispatcher interfaces are request-only on the main path. - UTs can assert stage-level fields directly instead of inferring behavior indirectly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed. --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-20 23:23:57 +08:00
Siyuan Kong	a16c99141b	Adapt w8a8mxfp8 quantization for Qwen VL models (#7417 ) ### What this PR does / why we need it? This PR adapts the `w8a8_mxfp8` quantization method to support Qwen Vision-Language (VL) models. Key changes include: - Reshaping multi-dimensional input tensors to 2D before the quantized matrix multiplication. - Reshaping the 2D output back to its original multi-dimensional format. - Adding specific output reshaping for the visual components of Qwen VL models. - Casting the bias tensor to `float32` to comply with the `npu_quant_matmul` kernel requirements. These changes are necessary to enable `w8a8_mxfp8` quantization for models with multi-modal inputs like Qwen VL. ### Does this PR introduce _any_ user-facing change? No, this is a backend enhancement to extend quantization support to new model architectures. There are no user-facing API or behavior changes. ### How was this patch tested? CI is expected to pass. Manual testing should be performed with a Qwen VL model using `w8a8_mxfp8` quantization to verify correctness and performance. - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: ksiyuan <ksiyuan@umich.edu>	2026-03-20 16:18:58 +08:00
Ronald	c980e68d40	[Feature] support aclgraph for model runner v2 (#7110 ) ### What this PR does / why we need it? This PR aims to support aclgraph for model runner v2, please see RFC #5208. The PR contains these modifications: - adapt to newest commit of vllm main branch. - supply a unified interface of extra forward context for both model runner v1 and model runner v2. - implement graph mode for main model. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-03-13 09:11:46 +08:00
linfeng-yuan	68d8d20ca2	[misc] move mxfp_compat into device to decouple from quantization init chain (#6918 ) ### What this PR does / why we need it? `mxfp_compat` only provides dtype/symbol compatibility helpers for different `torch_npu` versions, but it was placed under `vllm_ascend.quantization`. Importing it from device/ops paths could trigger `quantization/__init__.py` and pull in heavy quantization method dependencies, increasing startup coupling and causing import-cycle risk (especially on 310P paths). ### Does this PR introduce _any_ user-facing change? No functional behavior change intended. ### How was this patch tested? CI passed. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-02 18:17:01 +08:00
Eric-dot	3c66a970f2	add mxfp8 moe quantization (#6670 ) ### What this PR does / why we need it? support mxfp8 quantization (Qwen MOE ) Using adaptor to make the hardware-specific behavior clearer and more maintainable ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `13397841ab` --------- Signed-off-by: fangrongcan <17343701736@163.com> Signed-off-by: wangyao-i <iwangyao@outlook.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: Eric-dot <60131170+Eric-dot@users.noreply.github.com> Co-authored-by: fangrongcan <f00876277@china.huawei.com> Co-authored-by: wangyao-i <iwangyao@outlook.com> Co-authored-by: linfeng-yuan <1102311262@qq.com>	2026-03-02 11:04:06 +08:00
SILONG ZENG	99aedaff63	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #7 ) (#6023 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \|` vllm_ascend/quantization/compressed_tensors/compressed_tensors.py`\| \|` vllm_ascend/quantization/quant_config.py`\| \|` vllm_ascend/quantization/utils.py`\| \|` vllm_ascend/quantization/w4a16.py`\| \|` vllm_ascend/quantization/w4a4_flatquant_dynamic.py`\| \|` vllm_ascend/quantization/w4a8_dynamic.py`\| \|` vllm_ascend/quantization/w8a16.py`\| \|` vllm_ascend/quantization/w8a8.py`\| \|` vllm_ascend/quantization/w8a8_dynamic.py`\| \|` vllm_ascend/quantization/w8a8_pdmix.py`\| \|` vllm_ascend/quantization/w8a8mxfp8.py`\| \|` vllm_ascend/sample/rejection_sampler.py`\| \|` vllm_ascend/sample/sampler.py`\| \|` vllm_ascend/worker/block_table.py`\| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-06 14:56:53 +08:00
Cao Yi	a69ef10c3a	[Refactor] Quantization Module Refactor (#5738 ) ### Summary This PR refactors the `vllm_ascend/quantization` module to improve code organization, maintainability, and extensibility. The refactoring introduces a clear separation of concerns with a registry-based scheme discovery pattern, abstract base classes for quantization schemes, and dedicated wrapper classes. ### Key Changes #### 1. Modular Directory Structure \| Before \| After \| \|--------\|-------\| \| Flat file structure with mixed responsibilities \| Organized into `methods/` subpackage for schemes \| \| Single `quant_config.py` (600+ lines) \| Separate config files: `modelslim_config.py`, `compressed_tensors_config.py` \| \| `utils.py` with scheme lookup logic \| `methods/registry.py` with decorator-based registration \| #### 2. Registry-Based Scheme Discovery Replaced hardcoded `ASCEND_QUANTIZATION_METHOD_MAP` dictionary with a decorator-based registry pattern: ```python # Before: Manual dictionary mapping ASCEND_QUANTIZATION_METHOD_MAP = { "W8A8_DYNAMIC": {"linear": AscendW8A8DynamicLinearMethod, ...}, ... } # After: Decorator-based registration @register_scheme("W8A8_DYNAMIC", "linear") class AscendW8A8DynamicLinearMethod(AscendLinearScheme): ... ``` #### 3. Abstract Base Classes Introduced three abstract base classes in `methods/base.py`: - `AscendLinearScheme` - Base for linear layer quantization - `AscendMoEScheme` - Base for MoE layer quantization - `AscendAttentionScheme` - Base for attention layer quantization #### 4. Separated Config and Wrapper Classes - Config classes (`AscendModelSlimConfig`, `AscendCompressedTensorsConfig`): Handle config parsing and scheme selection - Wrapper classes (`AscendLinearMethod`, `AscendFusedMoEMethod`, etc.): Implement vLLM interfaces and delegate to schemes #### 5. Cleaner Public API ```python # New clean module interface from vllm_ascend.quantization import ( AscendModelSlimConfig, AscendCompressedTensorsConfig, ) from vllm_ascend.quantization.methods import get_scheme_class ``` ### Architecture Diagram ```mermaid classDiagram direction TB class QuantizationConfig { <<vLLM Interface>> +get_quant_method() } class AscendModelSlimConfig { +quant_description +get_quant_method() -create_scheme_for_layer() } class AscendCompressedTensorsConfig { +target_scheme_map +get_quant_method() -_get_scheme_from_parts() } class AscendLinearMethod { <<Wrapper>> +quant_method: AscendLinearScheme +create_weights() +apply() } class AscendFusedMoEMethod { <<Wrapper>> +quant_method: AscendMoEScheme +create_weights() +apply() } class AscendLinearScheme { <<Abstract>> +get_weight()* +apply()* +get_pertensor_param() +get_perchannel_param() } class AscendMoEScheme { <<Abstract>> +get_weight()* +get_dynamic_quant_param()* +apply()* } class W8A8DynamicLinear { +get_weight() +apply() } class W8A8DynamicMoE { +get_weight() +apply() } QuantizationConfig <\|-- AscendModelSlimConfig QuantizationConfig <\|-- AscendCompressedTensorsConfig AscendModelSlimConfig ..> AscendLinearMethod : creates AscendModelSlimConfig ..> AscendFusedMoEMethod : creates AscendCompressedTensorsConfig ..> AscendLinearMethod : creates AscendCompressedTensorsConfig ..> AscendFusedMoEMethod : creates AscendLinearMethod o-- AscendLinearScheme : delegates to AscendFusedMoEMethod o-- AscendMoEScheme : delegates to AscendLinearScheme <\|-- W8A8DynamicLinear AscendMoEScheme <\|-- W8A8DynamicMoE ``` ### Scheme Registration Flow ```mermaid sequenceDiagram participant Module as Scheme Module participant Registry as _SCHEME_REGISTRY participant Config as QuantConfig participant Wrapper as Wrapper Class Note over Module: At import time Module->>Registry: @register_scheme("W8A8_DYNAMIC", "linear") Registry->>Registry: Store (quant_type, layer_type) -> Class Note over Config: At runtime Config->>Config: Determine quant_type from description Config->>Registry: get_scheme_class(quant_type, layer_type) Registry-->>Config: Return scheme class Config->>Config: scheme = scheme_cls() Config->>Wrapper: Create wrapper with scheme Wrapper-->>Config: Return wrapper instance ``` ### File Changes Summary \| Original Files \| Refactored Files \| \|----------------\|------------------\| \| `__init__.py` (empty) \| `__init__.py` (exports public API) \| \| `quant_config.py` \| `modelslim_config.py` + `wrappers.py` \| \| `compressed_tensors/` \| `compressed_tensors_config.py` \| \| `utils.py` \| `methods/registry.py` \| \| `w8a8_dynamic.py` \| `methods/w8a8_dynamic.py` \| \| `w8a8.py` \| `methods/w8a8_static.py` \| \| `w4a4_flatquant_dynamic.py` \| `methods/w4a4_flatquant.py` \| \| ... \| `methods/base.py` (new) \| ### Benefits 1. Extensibility: Adding new quantization schemes only requires implementing the base class and adding `@register_scheme` decorator 2. Maintainability: Clear separation between config parsing, wrapper logic, and scheme implementation 3. Testability: Abstract base classes enable easier unit testing and mocking 4. Discoverability: Registry pattern makes it easy to list all supported schemes 5. Reduced Coupling: Config classes no longer need to know about all scheme implementations ___ - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2026-01-23 14:13:47 +08:00

8 Commits