xc-llm-ascend

Author	SHA1	Message	Date
Anion	5f8b1699ae	[Feat][quantization] Support new version w4a8 dynamic quantization for Linear layers (#3311 ) ### What this PR does / why we need it? Problem Description: The existing implementation for the w4a8-dynamic linear method only supports the old quantization format from msmodelslim. When attempting to load models quantized with the new version, vLLM encounters errors due to mismatched tensor shapes and unprocessed quantization parameters. Relavant issues: - https://github.com/vllm-project/vllm-ascend/issues/3192 - https://github.com/vllm-project/vllm-ascend/issues/3152 Proposed Changes: 1. Add support for w4a8 dynamic(new format) in AscendW4A8DynamicLinearMethod and TorchairAscendW4A8DynamicLinearMethod 2. Add unit tests and e2e tests for w4a8 dynamic new and old format models <details> <summary><b>details</b></summary> 1. Support for new w4a8-dynamic format: * Detects quantization format by reading the "version" field in quant_description to ensure backward compatibility. * Handles the new pre-packed weight format (`2x int4` in an `int8`), which has a halved dimension. It tells the vLLM loader how to unpack it using `_packed_dim` and `_packed_factor`. * Supports the new `scale_bias` parameter, setting its shape based on the layer type, as required by msmodelslim. For api consistency and future use, the `layer_type` parameter was also added to other quantization methods. * Updates the weight processing logic: new format weights are handled with `.view(torch.int32)` since they're pre-packed, while old ones are processed with `npu_convert_weight_to_int4pack`. 2. New unit and E2E tests: * Added unit tests that verify the logic for both the old and new formats. * Split the distributed E2E test to confirm that both old and new format models work correctly. </details> Theoretically, these changes will provide support for all common new version w4a8(dynamic) models from msmodelslim. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? I implement relevant unit tests and e2e tests and test the changes with following commands: ```bash # unit tests python -m pytest tests/ut/quantization/test_w4a8_dynamic.py tests/ut/torchair/quantization/test_torchair_w4a8_dynamic.py -v # e2e tests pytest tests/e2e/singlecard/test_quantization.py -v -s pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC_new_version -v -s pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC_old_version -v -s pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W4A8DYNAMIC -v -s ``` I also tested Hunyuan-1.8B-Instruct quantized with the new w4a8-dynamic format: ``` vllm serve ./models/Hunyuan-1.8B-Instruct-quantized --gpu-memory-utilization 0.96 --quantization ascend --max-model-len 9600 --seed 0 --max-num-batched-tokens 16384 ``` All tests mentioned passed locally. NOTE: I use quantization model from my own repo in test_offline_inference_distributed.py. Here is the description: [Anionex/Qwen3-1.7B-W4A8-V1](https://modelscope.cn/models/Anionex/Qwen3-1.7B-W4A8-V1/summary) (including quantization steps).This should be replaced by a model in vllm-ascend ci modelscope repo. Thanks for reading! - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Anionex <1005128408@qq.com>	2025-10-21 20:18:39 +08:00
linfeng-yuan	068ed706c8	[feat][torchair] support super kernel feat for quantized dsr1 (#3485 ) ### What this PR does / why we need it? Port #1916 and #2157 to master branch to fuse operators in deepseek moe layers, which can reduce scheduling overhead on devices. Note that this feature is valid only when `tp_size = 1` and `multistream_overlap_shared_expert` is enabled with torchair graph mode. ### Does this PR introduce _any_ user-facing change? Users can enable this feature with `--additional-config '{"torchair_graph_config":{"enabled":true, "enable_super_kernel":true}, "multistream_overlap_shared_expert":true}'`. ### How was this patch tested? E2E deepseek serving with 2P1D disaggregated prefill scenarios. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-10-20 20:04:37 +08:00
Jade Zheng	e04a5e3dd3	[Bugfix] Fix race condition in d2h transfer (#3372 ) ### What this PR does / why we need it? Using non-blocking operations for device-to-host transfers can lead to data corruption in later steps. The CPU tensor is accessed right after the transfer is triggered, but the transfer might not be complete yet. As a result, the data could be wrong. This problem was seen in the A3 environment during `profile_run`. ### How was this patch tested? CI pass. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-10-20 18:24:21 +08:00
offline893	6c9909c861	[Patch]patch of v1 executor when enable eplb. (#3511 ) ### What this PR does / why we need it? when using dynamic eplb, patch v1 executor to avoid create child process failed. ### How was this patch tested? deepseek in v3. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-19 10:54:26 +08:00
yechao237	4750d45d86	[BugFix]Support redundant experts in EPLB (#3473 ) This PR adds support for redundant experts in the EPLB. Key points: - Use global_num_experts = num_experts + num_redundant_experts consistently. - Backward compatible when num_redundant_experts=0. Tested On a 16-rank setup (W8A8) with static EPLB and expert_map_path, verifying router logits shape and successful requests. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: yechao237 <yechao20180411@gmail.com>	2025-10-18 00:09:16 +08:00
offline893	5a3082cd15	[EPLB]Record expert map without dynamic eplb. (#3409 ) What this PR does / why we need it? 1.Record expert map without dynamic eplb. 2.Add export PYTHONOPTIMIZE=1 when using dynamic eplb. 3.change eplb doc Does this PR introduce any user-facing change? How was this patch tested? Qwen3_moe in A3. - vLLM version: v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-15 14:21:15 +08:00
linfeng-yuan	c55d99d13e	[bugfix][torchair] fix missing weight nz cast for w13_weight in torchair_w8a8_dynamic.py (#3446 ) ### What this PR does / why we need it? Fix the issue of missing NZ conversion for quantized weights in GMM after moe_dispatch operator in torchair scenario, which does not involve aclgraph & single scenarios. ### How was this patch tested? vllm serving passed with lower latency (~5ms TPOT with bs_per_rank=28 & ep_size=32) - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-10-14 21:11:05 +08:00
anon189Ty	07e39620ea	[Feat] Unquantized Linear to nz and control all nz-cast (#3356 ) ### What this PR does / why we need it? Currently, when executing to the Linear layer of models in vLLM-Ascend, the weights format is ND in unquantized case and skipped ascend case. This PR supplements the execution logic for Linear layer. We use a new global variable: VLLM_ASCEND_ENABLE_NZ. When VLLM_ASCEND_ENABLE_NZ=1 and CANN version is 8.3, the weights of the Linear layer will be converted to FRACTAL_NZ, in both unquantized case and skipped ascend case. We also use VLLM_ASCEND_ENABLE_NZ to control the existing NZ conversion, such as w8a8-quantized case. ### Does this PR introduce _any_ user-facing change? Add a new global variable VLLM_ASCEND_ENABLE_NZ. If you want to use NZ format, you should set VLLM_ASCEND_ENABLE_NZ=1. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>	2025-10-14 17:39:26 +08:00
realliujiaxu	31682961af	[Feat] enable hierarchical communication for mc2 ops on A2 (#3015 ) Currently, when in A2, setting the environment variables `HCCL_INTRA_PCIE_ENABLE=1` and `HCCL_INTRA_ROCE_ENABLE=0` can reduce cross-machine communication traffic and significantly improve communication performance. For more details, please refer to [document](https://www.hiascend.com/document/detail/zh/Pytorch/710/apiref/torchnpuCustomsapi/context/torch_npu-npu_moe_distribute_dispatch_v2.md) - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-10-13 16:13:17 +08:00
dsxsteven	847d12a389	[BugFix]Fix moe load problems in torchair when using dynamic eplb (#3381 ) ### What this PR does / why we need it? When using dynamic eplb, moe load is not imported. We fix this problem by modifying the return value of hidden states in torchair. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? DeepseekV3 in A3. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: daishixun <dsxsteven@sina.com>	2025-10-13 11:38:57 +08:00
Wang Kunpeng	859e861d92	[main][quantization] Support deepseek w4a8 per-channel quantization (#3011 ) ### What this PR does / why we need it? 1.Support deepseek w4a8 per-channel quantization 2.The eager mode supports converting weights to the NZ format ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? #### How to get weights using Modelslim ##### Installation steps git clone https://gitcode.com/Ascend/msit.git cd msit/msmodelslim bash install.sh ##### Generate w4a8 per-channel weights cd /example/DeepSeek Command reference: msmodelslim/example/DeepSeek/README.md - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-09-27 21:01:16 +08:00
linfeng-yuan	d01fd1d1c3	[misc][torchair] fix bugs around `deepseek mtp`, `enable_shared_expert_dp` and `use_cached_kv_cache_bytes` (#3074 ) ### What this PR does / why we need it? This miscellaneous contains several small fixes: 1) fix initialization and forward bugs of DeepseekMTPLayer with `shared_expert_dp` enabled. 2) fix a tensor shape mismatches after o_proj caused by a work-aroud change in NPUModelRunner. 3) avoid unnecessary decline of kv_cache memory (default: 64MB) with `use_cached_kv_cache_bytes` disabled. 4) fall back `fused_moe_state` from `MC2` to `All2All` since the padding logic of `mc2_mask` is incompatible with input hidden_states when `shared_expert_dp` enabled. Once this PR is merged, users can launch disaggregated_prefill deployments (large_ep) with `deepseek_mtp` and `shared_expert_dp` as `v0.9.1-dev` branch. The remaining problem of kv_cache tokens decline compared to `v0.9.1-dev` will be resolved by https://github.com/vllm-project/vllm-ascend/pull/3073. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E vllm serving about deepseek_mtp with torchair graph mode and `enable_shared_expert_dp` with eager mode. Large ep deployments are also tested with this PR. - vLLM version: v0.10.2 - vLLM main: `5aeb925452` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-09-23 14:52:42 +08:00
linfeng-yuan	ffdd1a36e2	[bugfix][torchair] fix wasted NPU memory buffer allocation for quantized deepseek with unquantized MTP layer (#3068 ) ### What this PR does / why we need it? While running quantized deepseek models with unquantized MTP layer, free NPU memory abnormally decreases for `2*HCCL_BUFFSIZE` bytes. This results from the wasted VRAM buffer allocation casued by calling `dist.all_to_all_single` without correct device process group argument. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? We run vllm online serving with quantized deepseek-r1 and unquantized MTP layer, and observed that free_memory increased without redundat VRAM buffer for HCCL communication op (all_to_all_single). - vLLM version: v0.10.2 - vLLM main: `6d8246aaff` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-09-22 14:06:43 +08:00
Angazenn	aeffe27b30	[Perf]set moe w2_weight default to be nz (#2842 ) ### What this PR does / why we need it? This PR sets the default format of GMM w2_weight in w8a8_dynamic to be NZ to improve performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: main - vLLM main: `e40827280b` --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-09-11 21:40:54 +08:00
22dimensions	37f5a29cd4	[1/N][Refactor][Quantization] remove redundant quantizer class (#2680 ) ### What this PR does / why we need it? AscendQuantizer/LLMQuantizer class is used to select quant method based on quant config and some other arguments, but it is more simple and clean replacing these classes with map. So i remove them. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ut and e2e test - vLLM version: v0.10.1.1 - vLLM main: `6997a25ac6` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-09-04 11:35:14 +08:00
Wang Yixuan	936c102105	[bugfix][refactor]fix torchair_w8a8 (#2569 ) ### What this PR does / why we need it? torchair w8a8 and w4a8 Separate from fused_moe due to the refactor and change for fused_moe ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? vLLM version: main vLLM main: `ab9f2cfd19` - vLLM version: v0.10.1.1 - vLLM main: `69244e67e6` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-08-28 09:10:31 +08:00
Wang Yixuan	20a7bc4b71	[3/N][refactor] refactoer quantization (#2504 ) ### What this PR does / why we need it? Move torchair related qunatization section into torchair dir to make the code clear. Next step we'll remove all torchair related code outside of torchair quantization. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? vLLM version: main vLLM main: `ab9f2cfd19` - vLLM version: v0.10.1.1 - vLLM main: `959783fb99` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-08-27 10:45:50 +08:00

17 Commits