xc-llm-ascend

Author	SHA1	Message	Date
linfeng-yuan	88d03a783f	[refactor] replace scattered business kwargs with typed request objects and explicit stage boundaries (#7024 ) ### What this PR does / why we need it? Refactor `vllm_ascend/ops/fused_moe` to replace scattered MoE business `**kwargs` with typed request objects and explicit stage boundaries. - Prepare, dispatch, MLP, and quant stages now have clearer ownership. - Main MoE path no longer depends on business `kwargs.get(...)` lookups. - Comm and dispatcher interfaces are request-only on the main path. - UTs can assert stage-level fields directly instead of inferring behavior indirectly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed. --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-20 23:23:57 +08:00
Li Wang	83a4065b4b	[CI] Add pre-commit check for patch logger (#7446 ) ### What this PR does / why we need it? See https://github.com/vllm-project/vllm-ascend/pull/7402, pre-commit hook will forbid init_logger(__name__) in vllm_ascend patch modules - vLLM version: v0.17.0 - vLLM main: `8a680463fa` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-19 16:53:20 +08:00
Ronald	c980e68d40	[Feature] support aclgraph for model runner v2 (#7110 ) ### What this PR does / why we need it? This PR aims to support aclgraph for model runner v2, please see RFC #5208. The PR contains these modifications: - adapt to newest commit of vllm main branch. - supply a unified interface of extra forward context for both model runner v1 and model runner v2. - implement graph mode for main model. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-03-13 09:11:46 +08:00
pu-zhe	5df450bca4	[Feat] [310p] Support w8a8sc quantization method (#7075 ) ### What this PR does / why we need it? New Quantization Method: Introduced support for the W8A8SC static linear quantization scheme specifically for 310P hardware, enabling more efficient model compression. Refactored the save_sharded_state_310.py to avoid multi-process issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? W8A8SC quant E2E test. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-03-10 16:13:20 +08:00
Shaoxu Cheng	2064afe380	[300I][Bugfix] fix unquant model weight nd2nz error (#6851 ) ### What this PR does / why we need it? - This PR fixes an issue with weight format conversion for unquantized models running on Ascend 310P devices. - The changes refactor the logic for converting weights to the FRACTAL_NZ format. Previously, this was handled in a 310P-specific linear layer implementation (`AscendUnquantizedLinearMethod310`). This implementation has been removed, and the logic is now centralized in the `maybe_trans_nz` utility function. This function now checks if the device is a 310P and applies the NZ format cast accordingly for `float16`/`bfloat16` weights. - This refactoring simplifies the code by removing platform-specific duplication and ensures correct weight handling for unquantized models on 310P. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ut and local test - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-03 15:57:26 +08:00
pu-zhe	5899438a86	[Feat][310p] 310P support w8a8s quantization and saving w8a8sc state (#6878 ) ### What this PR does / why we need it? This pull request introduces significant enhancements for 310P device support, primarily by enabling W8A8S quantization and facilitating the saving of models with W8A8SC state outputs. It provides an example script for saving sharded and compressed model states, implements the core W8A8S quantization method, and integrates metadata generation within the 310P worker to accurately describe the quantization types of saved parameters. These changes aim to improve efficiency and compatibility for quantized models on 310P hardware. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? W8A8S accuarcy test and W8A8SC states save. <img width="886" height="184" alt="image" src="https://github.com/user-attachments/assets/e9bcac54-1f69-4d3a-a5b8-221a147ef99d" /> - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-03-02 20:09:15 +08:00
Shaoxu Cheng	b6bc3d2f9d	[Feat.][310P]: weightNZ feature with quant or unquant. (#6705 ) NZ Format Support for Linear Layers: Implemented support for the NZ (N-dimensional Z-order) format for linear layer weights on Ascend 310P, enhancing performance for both quantized and unquantized layers. Unquantized Linear Method for Ascend 310P: Introduced AscendUnquantizedLinearMethod310 to specifically handle and apply NZ format casting to unquantized linear layer weights during the loading process. MRotaryEmbedding Integration: Extended Rotary Embedding support by adding AscendMRotaryEmbedding310 to provide an Ascend-specific implementation for MRotaryEmbedding. Quantization Method Updates: Updated the w8a8_static quantization method to directly transpose weights and apply NZ format casting, ensuring consistency with the new format. - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-02-13 15:41:02 +08:00
pu-zhe	02886e2641	[Feat] 310p support MoE W8A8 quantizaition (#6641 ) ### What this PR does / why we need it? This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by: - Implementing a new quantization scheme `AscendW8A8DynamicFusedMoEMethod310`. - Adding a unified MLP implementation (`unified_apply_mlp`) for 310P that handles both quantized and unquantized paths. - Refactoring the MoE and quantization configuration logic to correctly route to the new 310P-specific implementations. - Adding new e2e and unit tests to verify the functionality of MoE W8A8 quantization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8 quantization in a multi-card setup. - Added several new unit tests for the 310P-specific MoE components, including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`, and the new `w8a8_dynamic` quantization method. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-10 17:17:44 +08:00
pu-zhe	23524f2ca4	[Refactor]refactor 310p ops and add ut (#6591 ) ### What this PR does / why we need it? This pull request focuses on a significant refactoring effort within the vllm-ascend project, specifically targeting operations optimized for the Ascend 310P hardware. The changes aim to streamline the implementation of core components like quantization and multi-head attention, making the codebase more maintainable and robust. Concurrently, new unit tests have been introduced to ensure the correctness and reliability of these refactored modules. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? E2E test with qwen3-32b w8a8 - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-07 09:25:17 +08:00
Shaoxu Cheng	39e77fb9e4	[Feat.]: support 310p w8a8 (#6454 ) ### What this PR does / why we need it? Introduced 310P W8A8 Quantization Support: New modules and methods have been added to enable W8A8 static quantization specifically for the Ascend 310P platform. Platform-Specific Quantization Configuration Loading: The system now dynamically loads the appropriate quantization configurations (AscendCompressedTensorsConfig, AscendModelSlimConfig) based on whether the current hardware is an Ascend 310P device. Implemented AscendW8A8LinearMethod310P: A dedicated linear quantization method for 310P is provided, handling the specifics of weight and activation quantization, including input parameter broadcasting and weight data manipulation. Extended AscendModelSlimConfig for 310P: A specialized configuration class for 310P integrates the new W8A8 linear method for both standard linear layers and vocabulary parallel embeddings, ensuring proper quantization application. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> Signed-off-by: Shaoxu Cheng <2906339855@qq.com>	2026-02-03 14:13:06 +08:00

10 Commits