xc-llm-ascend

Author	SHA1	Message	Date
Clorist33	4f0dddc9ee	[Bugfix] bugfix for moe_mlp in vllm-ascend/v0.11.0-dev (#4885 ) ### What this PR does / why we need it? This PR fixes a bug in the moe_mlp module by correcting the arguments passed to the torch_npu.npu_dequant_swiglu_quant function.It properly converts group_list from a cumulative sum to counts for the group_index parameter. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/main --------- Signed-off-by: tanqingshan (A) <50050625@china.huawei.com> Signed-off-by: tanqingshan (A) <50050625@china.huawei.com> Co-authored-by: tanqingshan (A) <50050625@china.huawei.com> Co-authored-by: Mercykid-bash <ruanche0218@gmail.com>	2025-12-12 14:51:47 +08:00
Mercykid-bash	8f45f9ce29	BugFix: Resolve shape mismatch in eplb update and calculation issues in quant_apply_mlp (#4777 ) ## Description This PR addresses two key issues in the MoE module when redundant experts are enabled, and fixes a calculation precision bug in the forward inference of quantized MLP: ### 1. Shape Mismatch in EPLB Expert Map Update - Root Cause: When redundant experts are turned on, a shape inconsistency occurs during the expert map update in `Vllm_apaptor`: - The shape of `self.expert_map_per_layer[layer_id]` is `[num_physical_experts,]` (aligned with physical expert count). - The shape of `updated_expert_map` is `[num_logical_experts,]` (aligned with logical expert count). - Indices in `self.expert_map_per_layer[layer_id]` that exceed the logical expert count cannot be properly mapped, leading to tensor shape mismatch errors. - The same shape mismatch exists in the `log2phy` map update (between `self.log2phy_map_per_layer[layer_id]` and `updated_log2phy_map`). - Fix: - Fix the shape initialization of `expert_map_per_layer` and `log2phy_map_per_layer` to be consistently set to `[num_physical_experts,]` across the module lifecycle. - Align the shape of `updated_expert_map` and `updated_log2phy_map` with the pre-initialized physical-expert-sized tensors during update operations, ensuring shape consistency for index mapping. ### 2. Calculation Precision Issue in Quantized MoE MLP Forward Inference - Root Cause: In the forward pass of `moe_mlp`, the `torch_npu.npu_dequant_swiglu_quant` operator only accepts group lists in Count format as input. However, the group list provided by `quant_apply_mlp` was in Cumsum format, which caused operator input format mismatch and degraded calculation precision. - Fix: - Convert the cumsum-formatted group list from `quant_apply_mlp` to Count format before passing it to `torch_npu.npu_dequant_swiglu_quant`. - Ensure the input format of the dequantization operator meets its requirements, restoring the expected calculation precision for quantized MoE MLP layers. ## Impact - Resolves shape mismatch errors in EPLB expert/log2phy map updates when redundant experts are enabled, ensuring stable expert routing. - Fixes quantized MoE MLP forward precision issues on NPU, aligning operator input formats with NPU kernel requirements. - No breaking changes to existing interfaces; the fixes are backward-compatible for scenarios without redundant experts enabled. --------- Signed-off-by: Che Ruan <cr623@ic.ac.uk> Signed-off-by: Mercykid-bash <ruanche0218@gmail.com> Co-authored-by: Che Ruan <cr623@ic.ac.uk> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-12-09 15:46:58 +08:00
huangdong2022	3a53bbc508	[Feat]Qwen3 Moe supports npu_add_rms_norm_quant op by default, update op with bias, resolve conflict with weight prefetch (#3465 ) ### What this PR does / why we need it? 1.qwen3 moe uses add_rms_norm_quant op instead of 'add_rms_norm op and quant op' during quantization scene. 2.torch_npu.add_rms_norm_quant op fixed accuracy while model weights is quantized by anti_method m4, m4 quantization is asymmetric outlier suppression method, it will generate none-zero norm bias, add_rms_norm_quant op updated to add this parameter to calculate. 3. add torch-npu check ### Does this PR introduce _any_ user-facing change? new feature works if torch_npu version >= torch_npu-2.7.1.dev20250919 ### How was this patch tested? 1.no special parameters to set, no new envs to set. new feature works if torch_npu version >= torch_npu-2.7.1.dev20250919 2.use qwen3 moe quantization model to test ,such as Qwen3-235B-A22B-W8A8, Qwen3-30B-A3B-W8A8, Qwen3-235B-A22B-Instruct-2507-m4 (anti_method m4) - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: h30027576 <huangdong51@huawei.com>	2025-10-17 09:30:51 +08:00
yuzhup	78777237a9	[2/N][Feat] Attention and MoE weight prefetch in Qwen3MoE models (#3203 ) ### What this PR does / why we need it? - Refacotr and integrate a unified `WeightPrefetchMethod` - Integrate `gate_up_proj.weight` in quantized Attention modules - Prefetching these weights ahead of matmul-like operators imporves performance by reducing L2 cache transfer latency ### Does this PR introduce _any_ user-facing change? Add a new config in `--additional-config` for configuration: ```json { "weight_prefetch_config": { "enabled": True, "prefetch_ratio": { "moe": { "gate_up": 0.8 }, }, }, } ``` This feature is enabled by default, and can be disabled through this configuration ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: yuzhup <15705211260@163.com>	2025-10-14 20:16:33 +08:00
offline893	82b6c846ca	[BugFix]Fix eplb problems when using dynamic eplb. (#3364 ) ### What this PR does / why we need it? When using dynamic eplb,it will be blocking by nz tensor.We fix these prolems by clone src tensor and recv tensor. ### Does this PR introduce any user-facing change? ### How was this patch tested? Qwen3_moe in A3. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-11 14:04:02 +08:00
florenceCH	14497b748d	Remove qwen3 moe MC2 cumsum & cast (#3126 ) What this PR does / why we need it? The Qwen3 moe MC2 graph currently has two redundant computational operator implementations. After npu_moe_distribute_dispatch_v2, the cumsum and cast operations have been added. By using expert_token_nums_type=0 and not converting weight_scale to float32, these two operators can be eliminated, thereby improving inference performance. Does this PR introduce any user-facing change? No How was this patch tested? No need vLLM version: v0.10.2 vLLM main: `f225ea7dd9` - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: florenceCH <gaoxiang120@huawei.com> Co-authored-by: florenceCH <gaoxiang120@huawei.com>	2025-09-26 08:51:30 +08:00
weichen	37a0715eda	[Refactor] Adjustments to moe_comm_method selection process (#3001 ) ### What this PR does / why we need it? Fix issues mentioned in https://github.com/vllm-project/vllm-ascend/pull/2791 and some minor refactoring. 1. Use Enum instead of string. 2. Avoid setting a new property to forward_context in AscendFusedMoE.forward(). 3. Enabling TokenDispatcherWithMoge. 4. Remove redundant code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Qwen3-30B-A3B/Qwen3-30B-A3B-W8A8/DeepSeek-V3-W4A8-Pruing/deepseek-mtp/pangu-pro-moe-pruing: 1. Enable/Disable EP 2. Aclgraph & eager - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>	2025-09-22 19:12:58 +08:00
weichen	18ca7861f6	[Main] [Refactor] Enable MoECommMethod in Eager Mode (#2791 ) ### What this PR does / why we need it? 1. Replace prepare/finalize operation in fused_moe.py by moe_comm_method.prepare()/finalize() 2. Replace unified_fused_experts by moe_comm_method.fused_experts() in fused_moe.py/w8a8_dynamic.py/w4a8_dynamic.py 3. Add calling _select_moe_comm_method in spec-decode proposers. 4. Currently, w4a8_dynamic does not support gatherep, use all2allv instead. 5. Remove redundant code. ### Does this PR introduce _any_ user-facing change? AllgatherEP switch is disabled in aclgraph/eager mode, just follow the rules in modelrunner_v1._select_moe_comm_method() ### How was this patch tested? e2e & ut - vLLM version: v0.10.2 - vLLM main: `7f6f2c1182` Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>	2025-09-16 11:06:00 +08:00
weichen	a041d4f328	[main] [refactor] refactor common_fused_moe.py (#2706 ) ### What this PR does / why we need it? 1. Move prepare/finalize operation from moe_comm_method to /ops/moe/fused_moe_prepare_and_finalize 2. Adapt to token_dispatcher in moe_comm_method 3. Move moe_comm_method/experts_selector/token_dispatcher/fused_moe_prepare_and_finalize to /ops/moe ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? e2e & ut - vLLM version: v0.10.1.1 - vLLM main: `f4962a6d55` Signed-off-by: weichen <calvin_zhu0210@outlook.com> Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>	2025-09-08 20:09:50 +08:00

9 Commits