xc-llm-ascend

Author	SHA1	Message	Date
ZYang6263	6975d46627	[v0.11.0][Perf] Eliminating the zerolike operator through patch (#3632 ) ### What this PR does / why we need it? There is a zero-like operator before the attention operation in each decoding stage. After analysis, this operator can be eliminated. The purpose of this PR is to remove this operator and improve performance. --------- Signed-off-by: ZYang6263 <zy626375@gmail.com>	2025-10-23 14:49:28 +08:00
Wang Yixuan	edccd46d74	fix deepseek torchair precision (#3635 ) ### What this PR does / why we need it? The precision of deepseek torchair is broken by #3465 , which due to the origin patch or rmsnorm in torchair. This PR fixes the precision of deepseek torchair. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: hust17yixuan <303660421@qq.com>	2025-10-22 20:20:32 +08:00
whx	bd11c0054f	[BugFix] Fix torchair+mtp bug after deleting deepseek_mtp. (#3590 ) This is a missing bug fix introduced by PR #3561 - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-21 22:23:52 +08:00
xuyexiong	79821106e6	[BugFix]Fix mtp torchair bug caused by #2719 (#3566 ) ### What this PR does / why we need it? Fix mtp tochair bug cuased by #2719 Since FIA need extra space for padding, we need to enforce `self.max_num_seqs > self.scheduler_config.max_num_seqs` in KV consumer + MTP This means that, `self.max_num_seqs` > the actual maximum requests (`self.scheduler_config.max_num_seqs`) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-10-21 22:21:44 +08:00
wangxiyuan	13e8e75143	[Refactor] refactor patch module (#3555 ) ### What this PR does / why we need it? we notice that `patch_main` is never used. Usually the patch is for all version. And if it's for specified version, we can use `vllm_version_is` instead. So let's remove the useless sub folder in patch module to make it clear. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-21 20:19:46 +08:00
Anion	5f8b1699ae	[Feat][quantization] Support new version w4a8 dynamic quantization for Linear layers (#3311 ) ### What this PR does / why we need it? Problem Description: The existing implementation for the w4a8-dynamic linear method only supports the old quantization format from msmodelslim. When attempting to load models quantized with the new version, vLLM encounters errors due to mismatched tensor shapes and unprocessed quantization parameters. Relavant issues: - https://github.com/vllm-project/vllm-ascend/issues/3192 - https://github.com/vllm-project/vllm-ascend/issues/3152 Proposed Changes: 1. Add support for w4a8 dynamic(new format) in AscendW4A8DynamicLinearMethod and TorchairAscendW4A8DynamicLinearMethod 2. Add unit tests and e2e tests for w4a8 dynamic new and old format models <details> <summary><b>details</b></summary> 1. Support for new w4a8-dynamic format: * Detects quantization format by reading the "version" field in quant_description to ensure backward compatibility. * Handles the new pre-packed weight format (`2x int4` in an `int8`), which has a halved dimension. It tells the vLLM loader how to unpack it using `_packed_dim` and `_packed_factor`. * Supports the new `scale_bias` parameter, setting its shape based on the layer type, as required by msmodelslim. For api consistency and future use, the `layer_type` parameter was also added to other quantization methods. * Updates the weight processing logic: new format weights are handled with `.view(torch.int32)` since they're pre-packed, while old ones are processed with `npu_convert_weight_to_int4pack`. 2. New unit and E2E tests: * Added unit tests that verify the logic for both the old and new formats. * Split the distributed E2E test to confirm that both old and new format models work correctly. </details> Theoretically, these changes will provide support for all common new version w4a8(dynamic) models from msmodelslim. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? I implement relevant unit tests and e2e tests and test the changes with following commands: ```bash # unit tests python -m pytest tests/ut/quantization/test_w4a8_dynamic.py tests/ut/torchair/quantization/test_torchair_w4a8_dynamic.py -v # e2e tests pytest tests/e2e/singlecard/test_quantization.py -v -s pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC_new_version -v -s pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC_old_version -v -s pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W4A8DYNAMIC -v -s ``` I also tested Hunyuan-1.8B-Instruct quantized with the new w4a8-dynamic format: ``` vllm serve ./models/Hunyuan-1.8B-Instruct-quantized --gpu-memory-utilization 0.96 --quantization ascend --max-model-len 9600 --seed 0 --max-num-batched-tokens 16384 ``` All tests mentioned passed locally. NOTE: I use quantization model from my own repo in test_offline_inference_distributed.py. Here is the description: [Anionex/Qwen3-1.7B-W4A8-V1](https://modelscope.cn/models/Anionex/Qwen3-1.7B-W4A8-V1/summary) (including quantization steps).This should be replaced by a model in vllm-ascend ci modelscope repo. Thanks for reading! - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Anionex <1005128408@qq.com>	2025-10-21 20:18:39 +08:00
Yizhou	274b708e0c	[Fix] Refactor dummy attention metadata creation (#3497 ) ### What this PR does / why we need it? The `force_attention` parameter is designed for flash infer kernel warmup, we don't actually need it on Ascend device (at least for now).And it tends to make things more complicated. So we replace the `force_attention` parameter with `aclgraph_runtime_mode` in the attention metadata creation logic. This change makes the control flow more explicit by directly using the graph runtime mode to determine how to build attention metadata, rather than relying on an intermediate boolean flag. This simplification removes redundant logic and clarifies the conditions for building attention metadata for full decode graph mode. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? DP + `FULL_DECODE_ONLY` + online serving. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-21 00:00:42 +08:00
linfeng-yuan	068ed706c8	[feat][torchair] support super kernel feat for quantized dsr1 (#3485 ) ### What this PR does / why we need it? Port #1916 and #2157 to master branch to fuse operators in deepseek moe layers, which can reduce scheduling overhead on devices. Note that this feature is valid only when `tp_size = 1` and `multistream_overlap_shared_expert` is enabled with torchair graph mode. ### Does this PR introduce _any_ user-facing change? Users can enable this feature with `--additional-config '{"torchair_graph_config":{"enabled":true, "enable_super_kernel":true}, "multistream_overlap_shared_expert":true}'`. ### How was this patch tested? E2E deepseek serving with 2P1D disaggregated prefill scenarios. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-10-20 20:04:37 +08:00
Jade Zheng	e04a5e3dd3	[Bugfix] Fix race condition in d2h transfer (#3372 ) ### What this PR does / why we need it? Using non-blocking operations for device-to-host transfers can lead to data corruption in later steps. The CPU tensor is accessed right after the transfer is triggered, but the transfer might not be complete yet. As a result, the data could be wrong. This problem was seen in the A3 environment during `profile_run`. ### How was this patch tested? CI pass. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-10-20 18:24:21 +08:00
whx	f8b52fe950	[Model][1/N] Delete deepseek v2/v3 modeling codes. (#3189 ) This PR deletes model codes of deepseek_v2 and deepseek_v3 to reuse the model file from vLLM. vLLM Ascend now uses custom ops register way instead of model file hard-coding. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-20 15:31:34 +08:00
offline893	6c9909c861	[Patch]patch of v1 executor when enable eplb. (#3511 ) ### What this PR does / why we need it? when using dynamic eplb, patch v1 executor to avoid create child process failed. ### How was this patch tested? deepseek in v3. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-19 10:54:26 +08:00
xuyexiong	0777e2f899	Optimize torchair kv_consumer padding logic (#3526 ) ### What this PR does / why we need it? Optimize torchair kv_consumer padding logic. Only pad when it is spec decoding ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-10-18 16:42:17 +08:00
yechao237	4750d45d86	[BugFix]Support redundant experts in EPLB (#3473 ) This PR adds support for redundant experts in the EPLB. Key points: - Use global_num_experts = num_experts + num_redundant_experts consistently. - Backward compatible when num_redundant_experts=0. Tested On a 16-rank setup (W8A8) with static EPLB and expert_map_path, verifying router logits shape and successful requests. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: yechao237 <yechao20180411@gmail.com>	2025-10-18 00:09:16 +08:00
xuyexiong	21769e8f44	[BUGFIX] Mtp torchair pd fix (#3506 ) ### What this PR does / why we need it? In memory of https://github.com/vllm-project/vllm-ascend/pull/2610 and #3449 Fix Mtp torchair pd bug. In the pd Disaggregation scenario, the first token of the inference after the d node receives the kv follows the eager mode. Fixes: Running with MTP torchair graph mode with Prefilling Decoding Disaggregation , if all requests processed by the D node are requests just transmitted from the P node, it will break the torchair graph. Reason: During PD Disaggregation , the P node only transmits the KV cache and prompt to the D node, not the actual tokens inferred (neither the main model tokens nor the MTP tokens are transmitted). Therefore, the D node will treat this request as one without MTP tokens for inference (seq_len=1). The community does not have graph mode issues because the community's attention has a seq_len=1 for each batch during the decode phase. We have issues because the graph mode pads according to processing 2 tokens per request. When there are some seq_len=1 and some seq_len=2, padding is done at the end. If all requests received by the D node are seq_len=1, padding cannot be performed normally according to the attention's fia operator constraints. Solution: The kv consumer uses extra torchair graph padding to avoid breaking FIA graph constrains (The one this PR implemented). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-10-17 21:57:05 +08:00
xuyexiong	30e3d86b0f	Revert "[BUGFIX] Mtp torchair pd fix (#3449 )" (#3500 ) This reverts commit `b0ae203e72`. ### What this PR does / why we need it? The fix is not ready yet, conflict with #3411 need to revert first. Will fix this issue later ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-10-17 09:42:48 +08:00
xuyexiong	b0ae203e72	[BUGFIX] Mtp torchair pd fix (#3449 ) ### What this PR does / why we need it? In memory of https://github.com/vllm-project/vllm-ascend/pull/2610 In the pd Disaggregation scenario, the first token of the inference after the d node receives the kv follows the eager mode. Fixes: Running with MTP torchair graph mode with Prefilling Decoding Disaggregation , if all requests processed by the D node are requests just transmitted from the P node, it will break the torchair graph. Reason: During PD Disaggregation , the P node only transmits the KV cache and prompt to the D node, not the actual tokens inferred (neither the main model tokens nor the MTP tokens are transmitted). Therefore, the D node will treat this request as one without MTP tokens for inference (seq_len=1). The community does not have graph mode issues because the community's attention has a seq_len=1 for each batch during the decode phase. We have issues because the graph mode pads according to processing 2 tokens per request. When there are some seq_len=1 and some seq_len=2, padding is done at the end. If all requests received by the D node are seq_len=1, padding cannot be performed normally according to the attention's fia operator constraints. Solution: The kv consumer uses extra torchair graph padding to avoid breaking FIA graph constrains (The one this PR implemented). The kv producer provides the correct tokens to the kv consumer, so that our graph mode constraints are not broken, and all logic is the same as the PD mixed deployment . Since we are using the community scheduler, the modification requires patching the vllm scheduler, but theoretically, performance should be better. (Maybe later ) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-10-16 09:03:49 +08:00
Mengqing Cao	8abe517870	[Refactor] Adapt deepseek-v3.2 to vllm 0.11.0 (#3432 ) ### What this PR does / why we need it? Adapt deepseek-v3.2 to vllm 0.11.0, removing the useless patch. The final goal is to remove all the patches and align the code arch to vllm, thus we need to do the following work in next prs. TODO: - [x] remove patch on attention spec - [ ] refactor the kvcache creation logic ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? 1. CI passed with existing test. 2. Test pass with deepseek-v3.2-exp - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-10-15 17:48:58 +08:00
offline893	5a3082cd15	[EPLB]Record expert map without dynamic eplb. (#3409 ) What this PR does / why we need it? 1.Record expert map without dynamic eplb. 2.Add export PYTHONOPTIMIZE=1 when using dynamic eplb. 3.change eplb doc Does this PR introduce any user-facing change? How was this patch tested? Qwen3_moe in A3. - vLLM version: v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-15 14:21:15 +08:00
LeeWenquan	4e720936d8	Fix warning msg print (#3421 ) ### What this PR does / why we need it? Avoid printing some warning msg as below : UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach ... ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2025-10-15 11:30:30 +08:00
zouyida2052	3642b64afc	bugfix for mtp with multistream_moe (#3419 ) ### What this PR does / why we need it? when infer deepseek mtp layer with multistream_moe, we should pass a boolean to evaluate this feature and fix bugs when we are in mtp layer - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-10-15 08:59:58 +08:00
linfeng-yuan	c55d99d13e	[bugfix][torchair] fix missing weight nz cast for w13_weight in torchair_w8a8_dynamic.py (#3446 ) ### What this PR does / why we need it? Fix the issue of missing NZ conversion for quantized weights in GMM after moe_dispatch operator in torchair scenario, which does not involve aclgraph & single scenarios. ### How was this patch tested? vllm serving passed with lower latency (~5ms TPOT with bs_per_rank=28 & ep_size=32) - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-10-14 21:11:05 +08:00
anon189Ty	07e39620ea	[Feat] Unquantized Linear to nz and control all nz-cast (#3356 ) ### What this PR does / why we need it? Currently, when executing to the Linear layer of models in vLLM-Ascend, the weights format is ND in unquantized case and skipped ascend case. This PR supplements the execution logic for Linear layer. We use a new global variable: VLLM_ASCEND_ENABLE_NZ. When VLLM_ASCEND_ENABLE_NZ=1 and CANN version is 8.3, the weights of the Linear layer will be converted to FRACTAL_NZ, in both unquantized case and skipped ascend case. We also use VLLM_ASCEND_ENABLE_NZ to control the existing NZ conversion, such as w8a8-quantized case. ### Does this PR introduce _any_ user-facing change? Add a new global variable VLLM_ASCEND_ENABLE_NZ. If you want to use NZ format, you should set VLLM_ASCEND_ENABLE_NZ=1. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>	2025-10-14 17:39:26 +08:00
weijinqian0	6972df5951	[Feature] optimize sp & qwen3 next support sp. (#3225 ) This PR will accomplish the following tasks: optimize SP In the old version implementation, the first layer was all_reduce, which used rms to split chunks. We changed it to perform reduce_scatter on the embedding side, replace one all_reduce operation and one chunk with one reduce_scatter operation. Support qwen3 next Since Qwen3 Next includes a linear attention module, the prefix name of this module cannot take effect directly. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-10-13 23:02:12 +08:00
realliujiaxu	31682961af	[Feat] enable hierarchical communication for mc2 ops on A2 (#3015 ) Currently, when in A2, setting the environment variables `HCCL_INTRA_PCIE_ENABLE=1` and `HCCL_INTRA_ROCE_ENABLE=0` can reduce cross-machine communication traffic and significantly improve communication performance. For more details, please refer to [document](https://www.hiascend.com/document/detail/zh/Pytorch/710/apiref/torchnpuCustomsapi/context/torch_npu-npu_moe_distribute_dispatch_v2.md) - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-10-13 16:13:17 +08:00
dsxsteven	847d12a389	[BugFix]Fix moe load problems in torchair when using dynamic eplb (#3381 ) ### What this PR does / why we need it? When using dynamic eplb, moe load is not imported. We fix this problem by modifying the return value of hidden states in torchair. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? DeepseekV3 in A3. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: daishixun <dsxsteven@sina.com>	2025-10-13 11:38:57 +08:00
offline893	82b6c846ca	[BugFix]Fix eplb problems when using dynamic eplb. (#3364 ) ### What this PR does / why we need it? When using dynamic eplb,it will be blocking by nz tensor.We fix these prolems by clone src tensor and recv tensor. ### Does this PR introduce any user-facing change? ### How was this patch tested? Qwen3_moe in A3. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-11 14:04:02 +08:00
zouyida2052	ee0a95e47f	bugfix for mtp when running torchair in a2 (#3354 ) ### What this PR does / why we need it? when ops torchair_fused_experts_with_mc2 is called, we need pass a tp group, but now it only pass when quantized scenario, we need also pass it when unquantized. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-10-10 23:07:24 +08:00
offline893	1c2c72af8d	[bugfix]change log2phy map to npu (#3339 ) ### What this PR does / why we need it? Resolved the issue of EPLB failure caused by changes in the log2phy map due to device type modifications when using MTP rotation position encoding. ### Does this PR introduce any user-facing change? ### How was this patch tested? https://github.com/vllm-project/vllm/commit/releases/v0.11.0 - vLLM version: v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-10 08:47:55 +08:00
Ruri	ff37575936	[1/N][Feat] Add weight prefetch feature for Attention layers (#3146 ) ### What this PR does / why we need it? - Refacotr and integrate a unified `WeightPrefetchMethod` - Integrate `qkv_proj.weight` and `o_proj.weight` in quantized Attention modules - Prefetching these weights ahead of matmul-like operators imporves performance by reducing L2 cache transfer latency ### Does this PR introduce _any_ user-facing change? Add a new config in `--additional-config` for configuration: ```json { "weight_prefetch_config": { "enabled": false, "prefetch_ratio": { "attn": { "qkv": 1.0, "o": 1.0, }, }, }, } ``` This feature is enabled by default, and can be disabled through this configuration ### How was this patch tested? - vLLM version: v0.11.0 --------- Signed-off-by: yuzhup <15705211260@163.com> Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> Co-authored-by: yuzhup <15705211260@163.com>	2025-10-09 20:38:39 +08:00
Wang Yixuan	30c5d947c3	[bugfix]fix multistream moe in torchair (#3164 ) ### What this PR does / why we need it? the multistream moe in tochari only validate in decode, but can't be applied to chunked prefill, So add some judgments to isolate the scenario ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: hust17yixuan <303660421@qq.com>	2025-10-09 19:00:32 +08:00
weichen	94dd832815	[MoE] [Refactor] Combine common_fused_moe and fused_moe (#3176 ) ### What this PR does / why we need it? 1. Move additional functionalities from fused_moe.py to common_fused_moe.py and remove fused_moe.py 2. Remove unnecessary custom classes from qwen3_moe.py, and it will be completely removed after we release vllm-ascend v0.11.0 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Qwen3-30B-A3B/Qwen3-30B-A3B-W8A8/DeepSeek-V3-W4A8-Pruing/deepseek-mtp/pangu-pro-moe-pruing: 1. Enable/Disable EP 3. Aclgraph & eager 4. SP - vLLM version: v0.11.0 --------- Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>	2025-10-09 14:12:46 +08:00
wangxiyuan	f12f76d7ba	Drop 0.10.2 (#3284 ) Drop v0.10.2 support, we support vLLM 0.11.0rc3 now. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-09 10:28:38 +08:00
wangxiyuan	81bd6e4c99	Add DeepSeek V3.2 support (#3270 ) ### What this PR does / why we need it? This PR added the initial DeepSeek V3.2 support with [vLLM v0.11.0](https://github.com/vllm-project/vllm/tree/releases/v0.11.0) (not released yet). We will complete vLLM adaptation as soon as possible. This feature will be ready in recent 1-2 days. Related doc: https://github.com/vllm-project/vllm-ascend/pull/3223 . ### Does this PR introduce _any_ user-facing change? Yes! ### How was this patch tested? CI passed and Run deepseek doc soon. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zzzzwwjj <1183291235@qq.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: zzzzwwjj <1183291235@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: wxsIcey <1790571317@qq.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-09-30 03:25:58 +08:00
Mengqing Cao	4ff422c730	[CI][Bugfix] Quickfix for DPMetaData (#3234 ) ### What this PR does / why we need it? Fix `dpmetadata` and `Qwen3MoeSparseMoeBlock` break introduced by `26a7a33b88 (diff-c1550d0a38469d039370567d8981969530cbfffc7302cd1778e7c2c8a9322dea)` NOTE: we maintain a different sp in vllm-ascend with vllm, thus we can just use `cu_tokens_across_sp(1)` as `cu_tokens_across_dp_cpu` close https://github.com/vllm-project/vllm-ascend/issues/3236, https://github.com/vllm-project/vllm-ascend/issues/3239 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-28 21:11:22 +08:00
Wang Kunpeng	859e861d92	[main][quantization] Support deepseek w4a8 per-channel quantization (#3011 ) ### What this PR does / why we need it? 1.Support deepseek w4a8 per-channel quantization 2.The eager mode supports converting weights to the NZ format ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? #### How to get weights using Modelslim ##### Installation steps git clone https://gitcode.com/Ascend/msit.git cd msit/msmodelslim bash install.sh ##### Generate w4a8 per-channel weights cd /example/DeepSeek Command reference: msmodelslim/example/DeepSeek/README.md - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-09-27 21:01:16 +08:00
wangxiyuan	2930e4a6bd	[CI] Upgrade vllm to newest commit (#3182 ) ### What this PR does / why we need it? Upgrade vLLM to newest commit - Fix the aclgraph doesn't work problem, caused by `24fab45d96` - Fix PoolerOutput import error, caused by `755ed7b05b` - Fix the aclgraph weight load error to keep the same with torchair fix. `4492e3a554` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All test should pass - vLLM version: v0.10.2 - vLLM main: `52d0cb8458` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-26 06:18:15 +08:00
wangxiyuan	a055183821	[CI] Upgrade vLLM version (#3139 ) Upgrade vLLM version to the newest commit. - Fix the break change introduced by `969b4da3a6` - Add a patch to quick fix torhcair `de94289a98` - fix the ut error introduced by `de94289a98` Close: https://github.com/vllm-project/vllm-ascend/issues/3138 - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-09-25 07:36:51 +08:00
liziyu	464270e4ca	Remove useless PD check in deepseek (#3161 ) ### What this PR does / why we need it? Remove useless PD check in deepseek ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2025-09-24 23:25:47 +08:00
clrs97	cd1ffbb6cd	[1/N][Feat] Cut down memory usage for o_proj in DeepSeek (#2931 ) ### What this PR does / why we need it? To cut down the memory usage of large weight matrices, we often rely on various linear operations: - `ReplicatedLinear`: Stores the entire matrix, consuming excessive memory. - `RowParallelLinear`: Requires an `all_reduce` to merge answer, introducing additional communication overhead and potential accuracy loss. Each token is handled across multiple devices rather than a single device, which is undesirable in SP scenario. - ... Furthermore, in multi-way Data Parallelism (DP) configurations, layers typically store redundant weight copies. This PR introduces a shared-weight plugin for layers inheriting from `LinearBase`. It offers the following advantages: - It evenly distributes a set of layers with identical structures across devices. Each layer retains its complete weights, eliminating redundant memory usage. - It supports asynchronous broadcasting to prefetch weights for upcoming layers. - It preserves the custom `process_weights_after_loading()` method to make keeping NZ format possible. - It is compatible with any linear class that inherits from `LinearBase`, thereby preserving all the features of the original linear implementation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM main: `f4a948f33f` - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: clrs97 <524936896@qq.com> Co-authored-by: CalvinXKY <kyxiezju@163.com>	2025-09-24 17:16:41 +08:00
Mengqing Cao	2d885869c5	[KVCache][Bugfix] Fix kv cache initialization error of attention layer (#3113 ) ### What this PR does / why we need it? Fixes #3096 1. Fix kv cache initialization error of attention layer. There are some models with layer name like `attn.attn`, instead of `self_attn`, but the initialization of kv cache tensors only check for `self_attn` and `attn.attn`, which leding to the error `AssertionError: Some layers are not correctly initialized` 2. Set the default value of input arg `sampling_metadata` in `compute_logits` for the modeling files in vllm-ascend. Thus fixing the error `Qwen3NextForCausalLM.compute_logits() missing 1 required positional argument: 'sampling_metadata'` ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? test locally with internlm - vLLM version: v0.10.2 - vLLM main: `5aeb925452` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-24 11:32:34 +08:00
weijinqian0	6aa4253798	[Refactor] [SP]The sequence parallelism characteristics in the MoE and Dense models are integrated into a single solution. (#3085 ) What this PR does / why we need it? there are two sets of sp implementations for moe and dense models. One is called sequence_parallelism, and the other is flashcomm_v1. We did the following things： Merge two sets of code with the same implementation into one. Remove the implementation of sequence_parallelism, as this solution cannot support aclgraph. Does this PR introduce any user-facing change? No How was this patch tested? e2e&ut - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-09-24 11:29:59 +08:00
linfeng-yuan	d01fd1d1c3	[misc][torchair] fix bugs around `deepseek mtp`, `enable_shared_expert_dp` and `use_cached_kv_cache_bytes` (#3074 ) ### What this PR does / why we need it? This miscellaneous contains several small fixes: 1) fix initialization and forward bugs of DeepseekMTPLayer with `shared_expert_dp` enabled. 2) fix a tensor shape mismatches after o_proj caused by a work-aroud change in NPUModelRunner. 3) avoid unnecessary decline of kv_cache memory (default: 64MB) with `use_cached_kv_cache_bytes` disabled. 4) fall back `fused_moe_state` from `MC2` to `All2All` since the padding logic of `mc2_mask` is incompatible with input hidden_states when `shared_expert_dp` enabled. Once this PR is merged, users can launch disaggregated_prefill deployments (large_ep) with `deepseek_mtp` and `shared_expert_dp` as `v0.9.1-dev` branch. The remaining problem of kv_cache tokens decline compared to `v0.9.1-dev` will be resolved by https://github.com/vllm-project/vllm-ascend/pull/3073. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E vllm serving about deepseek_mtp with torchair graph mode and `enable_shared_expert_dp` with eager mode. Large ep deployments are also tested with this PR. - vLLM version: v0.10.2 - vLLM main: `5aeb925452` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-09-23 14:52:42 +08:00
Li Wang	02f89d166f	[CI] Update vllm version to 20250922(5aeb925) (#3091 ) ### What this PR does / why we need it? This pr bump vllm commit hash to `5aeb925452` fix issues: 1. https://github.com/vllm-project/vllm/pull/25345 has remove v0 metadata 2. https://github.com/vllm-project/vllm/pull/25332 3. https://github.com/vllm-project/vllm/pull/25334 4. https://github.com/vllm-project/vllm/pull/23558, note that this vllm commit update the model register logic, which will check all the model registered have the `vllm.model_executor.models` path , which breaks our custom registration of the deepseek_v3 model (it doesn't exist in the vllm model path). so I move deepseek_v3 model registy to deepseek_v2 to solve temporary ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-09-22 22:18:13 +08:00
Yizhou	338231acaf	[Feat][Graph] Support `FULL_DECODE_ONLY` mode for GQA/MHA models (#2128 ) Note: This depends on [vLLM #25161](https://github.com/vllm-project/vllm/pull/25161) and the torch\_npu release from September 30. ### What this PR does / why we need it? This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA models like DeepSeek V3/R1 are not included). Key improvements include: * Reduced dispatch latency: By replaying the entire model execution graph at once, we cut overhead compared with multiple smaller replays. * Stabilized multi-device performance: Captureing the whole model as one static graph also mitigates the dispatch fluctuations across devices. * Stream/resource savings: Consolidating graph captures frees up streams, allowing more graphs to be captured. Known issues: 1. `_npu_paged_attention` currently manages its own workspace in `torch_npu`, which can deadlock when synchronizing during graph replay — we’re working on a fix. There may be other corner cases. This PR is the first in a planned series; we’ll continue to iterate and address remaining issues in follow-ups. This is essentially a port of #1503 and #1677, but includes two major changes: 1. Let `graph_dispatcher` decide the graph mode instead of hard-coding it in the backend, which decouples Full Graph and Piecewise Graph and could make it possible to remove dynamo. 2. Adapt to the new `attn_group` logic, but leave a small hack in `update_graph_params`; multi-attention models may or may not be fully supported yet. ### Does this PR introduce _any_ user-facing change? ```python compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ``` ### How was this patch tested? Tests included. - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-22 17:14:28 +08:00
linfeng-yuan	ffdd1a36e2	[bugfix][torchair] fix wasted NPU memory buffer allocation for quantized deepseek with unquantized MTP layer (#3068 ) ### What this PR does / why we need it? While running quantized deepseek models with unquantized MTP layer, free NPU memory abnormally decreases for `2*HCCL_BUFFSIZE` bytes. This results from the wasted VRAM buffer allocation casued by calling `dist.all_to_all_single` without correct device process group argument. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? We run vllm online serving with quantized deepseek-r1 and unquantized MTP layer, and observed that free_memory increased without redundat VRAM buffer for HCCL communication op (all_to_all_single). - vLLM version: v0.10.2 - vLLM main: `6d8246aaff` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-09-22 14:06:43 +08:00
Li Wang	12bcbd02bb	[CI] Upgrade vLLM to 20250919 (6d8246aa) and fix some broken issue (#2907 ) ### What this PR does / why we need it? 1. This pr bump vllm commit to `6d8246aaff` 2. fix upstream changes https://github.com/vllm-project/vllm/pull/24548 abort multi-modal kwargs, make vllm main and `v0.10.2` both adaptable 3. fix metadata_builder changes introduced by https://github.com/vllm-project/vllm/pull/23693 4. fix `structured_outputs_config` changes introduced by https://github.com/vllm-project/vllm/pull/22772 5. fix `moe_config` changes introduced by https://github.com/vllm-project/vllm/pull/22537 Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> - vLLM version: v0.10.2 - vLLM main: `c60e6137f0` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-09-20 17:37:57 +08:00
Lucas Kabela	53ecd89e8f	[Bugfix] Remove `VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE` (#2969 ) ### What this PR does / why we need it? This PR prepares for deleting this enviroment variable, `VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE`, as vllm requires `fullgraph=True` to run - Fixes https://github.com/vllm-project/vllm/issues/21834 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? See CI - vLLM version: v0.10.2 - vLLM main: `99cc41ad50` --------- Signed-off-by: Lucas Kabela <lucaskabela@meta.com>	2025-09-20 08:22:30 +08:00
xuyexiong	2a87b4cecb	[Bugfix] Fix specdecoding in chunkedprefill scenario (#3025 ) ### What this PR does / why we need it? The speculative decode phase of chunkedprefill has taken an incorrect path, should always use TND layout for speculative decoding. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `6d8246aaff` Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-09-19 14:05:08 +08:00
whx	0a526768f5	[Feature] Support moe multi-stream for aclgraph. (#2946 ) This PR puts the calculation of shared experts into a separate stream, overlaping with routing experts. - vLLM version: v0.10.2 - vLLM main: `fbd6523ac0` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-09-19 11:06:45 +08:00
LeeWenquan	f4e3d22432	Remove chunked_prefill_for_mla and fix ring_mla bug (#2781 ) ### What this PR does / why we need it? Remove chunked prefill for mla branch in mla , and change dtype of prefill_mask to avoid accuracy problem ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `ef7eefe17a` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2025-09-18 19:43:26 +08:00

1 2 3

102 Commits