xc-llm-ascend

Author	SHA1	Message	Date
aipaes	f58e110afe	【feat】switch for fusion ops gmmswigluquant (#5992 ) ### What this PR does / why we need it? Set a additional config parameter to control whether the gmmswigluequant fuseion operator is enabled; it is enabled by True. / When enabled with a small number of GPUs, the gmmswigluquant fused operator can cause some performance degradation. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` #### Perf test model: GLM 4.6(w8a8) - single A3 node(ep16, tp16), async-scheduling, mtp, FULL_DECODE_ONLY - bs=1, input_lens=32000, ouput_lens=1024 Without this PR: TPOT 32.22.ms With this PR: TPOT 30.23ms --------- Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-01-19 13:19:25 +00:00
wangqiankun13	ebb940691f	[Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476] (#5755 ) ### What this PR does / why we need it? [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. - Before: weight scale must be float32 - After: weight scale can be float32/float16 when x is float16, float32/bfloat16 when x is float32/bfloat16. And w1 scale can use different dtype with w2 scale. More info about this operator, please refer to RFC: issue https://github.com/vllm-project/vllm-ascend/issues/5476 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? #### Perf > When scale is of type fp16 or bf16, it will be cast to fp32 internally within the operator, while the subsequent computations remain unchanged. Therefore, this PR will introduce an additional cast operation but halve the memory copy operations for scale . Furthermore, since the scale data is only a few KB in size and participates in relatively few computations, its impact is almost negligible compared to major operations like matrix multiplication. Thus, the theoretical performance change should be minimal. test single operator cases from qwen3-235b, - single A3 node(ep16), 64 moe experts, 4 experts / die (like qwen3-235b ep32) - batch=18/32, token_hidden_size 4096, moe_intermediate_size 1536 The test was conducted for 100 rounds, and the average of the last 95 rounds was taken. \| \| bs18(us)\| bs32(us)\| \| -----\| -----\| -----\| \|Without this PR\|96.28\|108.83\| \|With this PR\|96.06\|107.90\| Note: Single-operator benchmarks represent an ideal scenario. They are usually only useful for referencing relative changes and may not fully align with performance data observed within the full model. #### Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode \| dataset \| version \| metric \| mode \| vllm-api-stream-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 83.33 \| - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wangqiankun <wangqiankun13@huawei.com>	2026-01-19 16:10:43 +08:00
LI SHENGYONG	83de5385b4	[EPLB][Bugfix] policy_swift_balancer bugfix and renaming (#5897 ) ### What this PR does / why we need it? 1. Rename dynamic_ep to default_eplb. 2. Rename dynamic_ep_v2 to swift_balancer 3. Discard func compose_expert_update_info_bipartite. - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-19 05:47:40 +00:00
meihanc	9cad1a8349	[Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (#5928 ) ### What this PR does / why we need it? Migrate the torch profiler configuration from deprecated environment variables (`VLLM_TORCH_PROFILER_DIR`, `VLLM_TORCH_PROFILER_WITH_STACK`, `VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY`) to the explicit `ProfilerConfig` object, aligning with vLLM's configuration best practices. The profiler environment variable approach is deprecated in vLLM and will be removed in v0.14.0 or v1.0.0. ### Does this PR introduce _any_ user-facing change? yes, for deverlopers who want to fetch profiler, he should use `--profiler-config` instead of `VLLM_TORCH_PROFILER_DIR` ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-19 09:27:55 +08:00
LI SHENGYONG	bc1f6713e7	[EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (#5933 ) ### What this PR does / why we need it? 1. Move the logic of expert mapping forward to prevent shotgun changes 2. Disable the update of expert map. ### How was this patch tested? a2 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| GPQA_diamond \| 53064e \| accuracy \| gen \| 73.23 \| a3 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 83.33 \| - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-19 09:24:25 +08:00
LI SHENGYONG	9fed2636cb	[EPLB][Nightly][Bugfix] Get expert from moe layer only (#5908 ) ### What this PR does / why we need it? 1. If the model has dense layers, the current code will attempt to obtain the routing experts of the dense layers, which will cause an error. This should be fixed by modifying the code to skip the dense layers when obtaining the routing experts. 2. The global_expert_map that the function directly outputs a affects the performance of dsv3.2. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? DeepSeek V3.1 conversation is normal. #### aime precision test (dsv3.1) baseline without eplb \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 66.67 \| eplb \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 70.00 \| - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-19 09:23:28 +08:00
Song Zhixin	2b6dc100b5	Eagle3 mm support, enablement on qwen3vl (#4848 ) ### What this PR does / why we need it? follow pr [https://github.com/vllm-project/vllm/pull/20788](https://github.com/vllm-project/vllm/pull/20788) , Eagle3 mm support, enablement on qwen3vl target model [Qwen/Qwen3-VL-8B-Instruct]([https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct]) eagle3 [MNN/Qwen3-VL-8B-Instruct-Eagle3](https://www.modelscope.cn/models/MNN/Qwen3-VL-8B-Instruct-Eagle3) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? pytest ./tests/e2e/singlecard/test_completion_with_prompt_embeds.py -vv vLLM with eagle3 : ```bash vllm serve /model/Qwen3-VL-8B-Instruct --enforce-eager --port 9100 --max-model-len 32768 --max-num-seqs 32 --tensor-parallel-size 2 --allowed-local-media-path /model/gx/images --speculative-config '{ "method": "eagle3", "model": "/model/hf/Qwen3-VL-8B-Instruct-Eagle3", "num_speculative_tokens": 3 }' ``` vLLM without eagle3 : ```bash vllm serve /model/Qwen3-VL-8B-Instruct --enforce-eager --port 9100 --max-model-len 32768 --max-num-seqs 32 --tensor-parallel-size 2 --allowed-local-media-path /model/gx/images ``` bench: ``` vllm bench serve --backend openai-chat --base-url http://127.0.0.1:9100 --tokenizer /model/Qwen3-VL-8B-Instruct --endpoint /v1/chat/completions --model /model/Qwen3-VL-8B-Instruct --dataset-name random --num-prompts 50 --max-concurrency 5 --temperature 0 --top-p 1.0 --seed 123 ``` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: jesse <szxfml@gmail.com>	2026-01-19 08:58:07 +08:00
Jade Zheng	22f253142a	[Feature] Support fine-grained shared expert overlap (#5482 ) Fine-grained control over shared expert overlap to prevent resource contention. - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2026-01-17 11:53:22 +08:00
Shaoxu Cheng	1ffca8673f	[Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (#5776 ) ### What this PR does / why we need it? Add basic 310p support. Only dense models work with eager mode now. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> Signed-off-by: Shaoxu Cheng <2906339855@qq.com>	2026-01-17 11:49:18 +08:00
rjg-lyh	3af91e5ac4	[Bugfix] Fix the input constraints checks for the mlapo and bmm_transpose operators (#5764 ) ### What this PR does / why we need it? This PR fix the input constraints checks for the mlapo and bmm_transpose operators. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` ### Perf 64K/3K，1P1D，bs=32 before this pr: TPOT 29ms, TTFT 47s，TPS 606 token/s after this pr: TPOT 29ms, TTFT 48s，TPS 636 token/s Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-01-16 09:52:48 +00:00
zhangxinyuehfad	4f446aec4c	[CI] Add DeepSeek-V3.2-W8A8-Pruning e2e test (#5922 ) ### What this PR does / why we need it? 1. Fix DeepSeek-V3.2-W8A8-Pruning mtp 2. Add DeepSeek-V3.2-W8A8-Pruning e2e test ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-16 15:49:57 +08:00
wjunLu	73a3f822c7	[Main2Main] Upgrade vllm commit to releases/v0.14.0 (#5911 ) ### What this PR does / why we need it? Upgrade vllm commit to releases/v0.14.0 - Re-open cases in `tests/e2e/singlecard/pooling/test_scoring.py`, since the errors before have been fixed by https://github.com/vllm-project/vllm/pull/32243 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-15 23:22:43 +08:00
zhangxinyuehfad	372f979aa5	[CI] Add DeepSeek R1 W8A8 HMB nightly ci (#5874 ) ### What this PR does / why we need it? Add DeepSeek R1 W8A8 HMB nightly ci - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-15 20:48:20 +08:00
Qiu	72fee47cba	[CI](cp) skip bad UT test_models_chunked_prefill_with_empty_kvcache temporarily (#5919 ) Skip bad UT test_models_chunked_prefill_with_empty_kvcache temporarily, which is inadaptable with main2main 20260114. - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-15 15:50:06 +08:00
wangxiyuan	a25209252f	[CI] Add 310p e2e test back (#5797 ) This PR add 310 e2e test back to ensure the related PR will be tested on 310. 1. for light e2e, we'll run 310p test if the changed files are located in `vllm_ascend/_310p` 2. for full e2e, we'll always run 310p test 3. for main2main test, we'll stop run 310p test - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-15 15:47:13 +08:00
meihanc	80fbb1b6b1	[CI]Fix nightly clang installation following previous attempt (#5907 ) ### What this PR does / why we need it? This PR fixes the issue where the previous PR https://github.com/vllm-project/vllm-ascend/pull/5733 failed to install Clang in nightly environment. - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-15 14:18:11 +08:00
Shanshan Shen	efa0f64f22	[Doc] Add tutorials for Qwen3-VL-30B-A3B-Instruct (#5331 ) ### What this PR does / why we need it? Add tutorials for `Qwen3-VL-30B-A3B-Instruct`. - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-15 10:56:19 +08:00
LI SHENGYONG	da958ee386	[EPLB]Eplb Config Renaming (#5533 ) ### What this PR does / why we need it? 1. Rename num_iterations_eplb_update to expert_heat_collection_interval. 2. Rename num_wait_worker_iterations to algorithm_execution_interval. 3. Rename init_redundancy_expert to num_redundant_experts because the variable with the same meaning in vLLM is named this way. 4. Delete gate_eplb because we don't need this feature. 5. Move eplb config into a dict in additional config. 6. Depend on pr5817 ### Does this PR introduce _any_ user-facing change? before this pr： `--additional-config '{"dynamic_eplb":true, "num_iterations_eplb_update": 4000, "num_wait_worker_iterations": 150, "init_redundancy_expert": 16, "expert_map_path": "xxx.json"}'` after this pr: `--additional-config '{"eplb_config":{"dynamic_eplb":true,"expert_heat_collection_interval":4000, "algorithm_execution_interval":150,"num_redundant_experts": 16, "expert_map_path": "xxx.json"}}'` ### How was this patch tested? #### test qwen3-235b eplb num_redundant_experts=16 without pr5817 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 83.33 \| with pr5817 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-15 10:26:44 +08:00
Zetong Li	ea01aeaab7	[Refactor][EAGLE] 4/N extract common methods from eagle and mtp (#5870 ) ### What this PR does / why we need it? This PR aims to extract common methods from eagle_proposer and mtp_proposer. This is a small step towards merging eagle and mtp. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: Zetong Li <slippersss@126.com>	2026-01-15 10:24:35 +08:00
wjunLu	c11a05c4e1	[Main2Main] Upgrade vllm commit to 0113 (#5839 ) ### What this PR does / why we need it? Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9) - Modify import paths due to the refactors https://github.com/vllm-project/vllm/pull/31916 https://github.com/vllm-project/vllm/pull/32054 - Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional arguments but 3 were given` due to https://github.com/vllm-project/vllm/pull/24498 - Skip the async-scheduling tests in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never verified https://github.com/vllm-project/vllm/pull/31998 - Skip some pooling tests, which are caused by https://github.com/vllm-project/vllm/pull/32148 where vllm is also failed https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4 We will reopen those tests when main2main reachs https://github.com/vllm-project/vllm/pull/32243 - Skip some cases in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are broken by https://github.com/vllm-project/vllm/pull/32118 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-01-15 09:48:53 +08:00
lty	295018ec0f	[Refactor]Refactor of vllm_ascend/distributed module (#5719 ) ### What this PR does / why we need it? Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604 This PR is a refactoring of vllm_ascend/distributed, moving all kv_transfer realtaed codes into a dedicated folder, which has already been done in vLLM ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: lty <linhebiwen@gmail.com>	2026-01-15 08:57:40 +08:00
Li Wang	f34b3b8ee9	[nightly] Remove node tolerations for hk cluster (#5896 ) ### What this PR does / why we need it? Since we have upgrade all the nodes' `cann` HDK version to `25.3rc1`, we should not limit nightly schedule to the specific nodes ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-15 08:55:06 +08:00
meihanc	a9f730b853	[bugfix]Intermittent CI failure in the triton runtime jit (#5733 ) ### What this PR does / why we need it? fix bug : https://github.com/vllm-project/vllm-ascend/issues/5634 Intermittent CI failure due to a compilation error in the triton operator ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-14 22:58:08 +08:00
Qiu	a88937f5cb	[bugfix](cp) replace None with zeros/inf tensor to avoid TypeError (#5837 ) ### What this PR does / why we need it? When there is no kv cache in some devices, the `_compute_prefill_context func` will return `None`, which is unexecpted. This PR replaces None with full zeros/-inf tensors to avoid TypeError. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```bash pytest tests/e2e/multicard/4-cards/long_sequence/test_chunked_prefill.py -k test_models_chunked_prefill_with_empty_kvcache ``` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-14 20:57:48 +08:00
zhaomingyu13	01805fbd7d	Revert "[BugFix] Support setting tp=1 for the Eagle draft model to take effect (#5519 )"(#5902 ) This reverts commit `d886b81971`. it breaks pd function - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>	2026-01-14 20:55:10 +08:00
Ronald	e20813f441	[Feature] implement eagle spec decoding for model runner v2 (#5840 ) ### What this PR does / why we need it? this pr implement eagle spec decoding for model runner v2, please see RFC https://github.com/vllm-project/vllm-ascend/issues/5208 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: v0.13.0 --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-01-14 09:18:05 +08:00
LHXuuu	0415e694cd	[Quantization] Support compressed tensors moe w8a8 int8 dynamic weight (#5718 ) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W8A8 Int8 dynamic weight. 2. Specify W4A16 quantization configuration. Co-authored-by: menogrey 1299267905@qq.com Co-authored-by: kunpengW-code 1289706727@qq.com ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: menogrey <1299267905@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>	2026-01-14 09:17:26 +08:00
LI SHENGYONG	ecf2fa482e	[EPLB][Bugfix] Get expert map from layers (#5817 ) ### What this PR does / why we need it? The initialization method of expert_map used by the eplb module is different from that used by the fused_moe module. This PR deletes the expert_map initialization method used by the eplb module to make the initialization methods consistent. #### before bugfix self._expert_map=tensor([64, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61,62, 63], device='npu:1', dtype=torch.int32) self.shared_dict["expert_maps"][0]=tensor([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64]], dtype=torch.int32) ### How was this patch tested? #### qwen3-235B-w8a8 aime \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-14 09:16:51 +08:00
drslark	48ec97821a	[Bugfix] Fixed an accuracy problem of sp with eagle3 (#5816 ) ### What this PR does / why we need it? Fixed an accuracy problem when using eagle3 with sp. The problem is described in https://github.com/vllm-project/vllm-ascend/issues/5825. It also adds a much more precise way to determine whether drafter should use `sp` or not. Also, it changes the `eager` of drafter to be a real `eager` in frontend to avoid a `fx-graph` problem. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? For simpilicity, we test it as in https://github.com/vllm-project/vllm-ascend/issues/5825. And we get the same result of `eagle3` with `sp` disabled. ```text -------------------------------------------------- total_num_output_tokens: 1000 num_drafts: 437 num_draft_tokens: 1311 num_accepted_tokens: 564 mean acceptance length: 2.29 -------------------------------------------------- acceptance at token 0: 0.62 acceptance at token 1: 0.40 acceptance at token 2: 0.27 acceptance at token 3: 0.00 acceptance at token 4: 0.00 acceptance at token 5: 0.00 ``` * vLLM version: v0.13.0 * vLLM main: `2f4e6548ef` Signed-off-by: drslark <slarksblood@qq.com>	2026-01-14 09:00:37 +08:00
zhangxinyuehfad	f7b904641e	[Main2Main] Upgrade vllm commit to 0109 (#5752 ) ### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to https://github.com/vllm-project/vllm/pull/31786 2. fix spec_decode e2e test due to https://github.com/vllm-project/vllm/pull/29821 break 3. fix `vllm.v1.attention.backends.utils` duo to https://github.com/vllm-project/vllm/pull/31891 4. fix `self.seq_lens - query_lens` on same device due to https://github.com/vllm-project/vllm/pull/31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-13 19:14:43 +08:00
yupeng	5b95c6b03a	[Test][e2e][LoRA] Add more e2e tests to cover scenarios of LoRA (#4075 ) ### What this PR does / why we need it? This PR depends on PR https://github.com/vllm-project/vllm-ascend/pull/4046. And only if the latter merged, it will work. This PR aims to solve the issue https://github.com/vllm-project/vllm-ascend/issues/3240. The new-added Llama-2-7b-hf and Qwen3-0.6B testcases will cover the senarios that the LoRA weights are added to q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens and lm_head modules. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_llama2_lora.py pytest -sv tests/e2e/singlecard/test_qwen3_multi_loras.py - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: paulyu12 <507435917@qq.com>	2026-01-13 16:32:28 +08:00
Rozwel-dx	8d571286dd	[Refactor] Modify the binding logic to allocate CPU cores for each NPU card (#5555 ) [Refactor] Modify the binding logic to allocate CPU cores for each NPU card ### What this PR does / why we need it? Modify the binding logic to allocate CPU cores for each NPU card based on NUMA affinity, while isolating acl_thread/release_thread and other processes to prevent mutual interference. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `c85cc045f8` Signed-off-by: rowzwel_dx <1392851715@qq.com> - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: Rozwel-dx <1392851715@qq.com>	2026-01-13 09:21:28 +08:00
zhaomingyu13	d886b81971	[BugFix] Support setting tp=1 for the Eagle draft model to take effect (#5519 ) ### What this PR does / why we need it? According to the official documentation, the parameter "draft_tensor_parallel_size": 1 is supposed to be applied to the Eagle3 model. However, based on actual debugging, it was found that the number of tensor parallelisms (tp) of the Eagle model is consistent with that of the target model. The setting of tp for the draft model did not take effect as expected. Note: This feature has not been superimposed and tested with `sp` and `dp`. It will be adapted later ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```python from vllm import LLM, SamplingParams def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=4, gpu_memory_utilization=0.9, enforce_eager=True, speculative_config={ "method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B" "draft_tensor_parallel_size": 1, "num_speculative_tokens": 3, }, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Fixes vllm-project/vllm#31345 Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: drslark <slarksblood@qq.com>	2026-01-13 09:14:30 +08:00
shiyuan680	7af3b880c1	support triton of mrope (#5664 ) ### What this PR does / why we need it? this pr support use triton mrope like cuda_forward, which performance is equal to ascendc ops this triton ops should use cann 8.5.0 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? test in qwen3-vl-235b acc textvqa native 81.82 npu triton 81.58 cuda triton 81.52 - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: shiyuan680 <917935075@qq.com>	2026-01-13 09:13:51 +08:00
Li Wang	75c92a3640	[CI] Move nightly-a2 test to hk (#5807 ) ### What this PR does / why we need it? This patch initial testing involved connecting two nodes from the HK region to nightly A2. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-12 22:58:35 +08:00
SILONG ZENG	7a6fde80b1	[CI]Add Kimi k2 nightly test (#5682 ) ### What this PR does / why we need it? The PR add performance and accuracy tests for Kimi-K2-Instruct-W8A8 and Kimi-K2-Thinking models to the Nightly test suite. #### Test Configuration Kimi-K2-Instruct-W8A8 - model: vllm-ascend/Kimi-K2-Instruct-W8A8 - Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node) - Architecture: Unified Distributed Inference - Parallelism: DP4 + TP8 + EP (Data Parallel 4, Tensor Parallel 8, Expert Parallel enabled). - Optimization: torchair graph, no-prefix-caching. - Node 0: DP Rank 0-1, Local DP 2, Tensor Parallel 8. - Node 1: DP Rank 2-3, Local DP 2, Tensor Parallel 8. - Benchmarks: - Performance: vllm-ascend/GSM8K-in3500-bs2800. - Accuracy: vllm-ascend/gsm8k-lite. Kimi-K2-Thinking - Model: moonshotai/Kimi-K2-Thinking - Hardware: A3, 1 Node (16 NPUs total) - Architecture: Single Node Distributed Inference - Parallelism: TP16 + EP (Tensor Parallel 16, Expert Parallel enabled). - Optimization: no-prefix-caching - Benchmarks: - Performance: vllm-ascend/GSM8K-in3500-bs400. - Accuracy: vllm-ascend/gsm8k-lite. ### Does this PR introduce _any_ user-facing change? Yes. This PR enhances the ```AisbenchRunner``` to support dynamic configuration of the ```trust_remote_code``` flag. This allows the AISBench client to successfully load tokenizers for models that require custom code execution (e.g., Kimi-K2-Thinking and Kimi-K2-Instruct-W8A8). Changes: 1. ```AisbenchRunner.__init__ ```Added the ability to capture the ```trust_remote_code``` parameter from the case configuration. ``` python self.batch_size = aisbench_config["batch_size"] self.request_rate = aisbench_config.get("request_rate", 0) + self.trust_remote_code = aisbench_config.get("trust_remote_code", False) self.temperature = aisbench_config.get("temperature") self.top_k = aisbench_config.get("top_k") ``` 2. ```AisbenchRunner._init_request_conf``` Added regex substitution to inject the parameter into the generated dynamic configuration file. ``` python content = re.sub(r'batch_size.', f'batch_size = {self.batch_size},', content) + content = re.sub(r'trust_remote_code=.', + f'trust_remote_code={self.trust_remote_code},', + content) content = content.replace("top_k", "#top_k") content = content.replace("seed", "#seed") ``` Details: - New Config Key: Users can add ```"trust_remote_code": True``` to any dictionary within the ```aisbench_cases``` list. - Default Value: Defaults to ```False``` to maintain existing security protocols for standard models. - Impact: Resolves ```ValueError``` when benchmarking reasoning models or models with custom tokenizers that previously failed during the AISBench local initialization phase. User Example: Users can now enable custom code execution for specific models (like Kimi-K2-Thinking) directly in their test suite: ``` # Now supported in test scripts: aisbench_cases = [{ "case_type": "performance", "request_conf": "vllm_api_stream_chat", "trust_remote_code": True, # New user-facing parameter ... }] ``` ### How was this patch tested? Actions: - https://github.com/vllm-project/vllm-ascend/actions/runs/20849768433 Result as following: - Kimi-K2-Instruct-W8A8(25m25s) 1. Accuracy test ``` dataset version metric mode vllm-api-general-chat --------- --------- -------- ------ ----------------------- gsm8k 7cd45e accuracy gen 96.88 ``` 2. Perf test ``` ╒══════════════════════════╤═════════╤════════════════╤════════════════╤═══════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕ │ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ ╞══════════════════════════╪═════════╪════════════════╪════════════════╪═══════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡ │ E2EL │ total │ 34571.489 ms │ 28657.8054 ms │ 36294.1788 ms │ 34714.7329 ms │ 35247.2724 ms │ 35526.6758 ms │ 36146.4314 ms │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TTFT │ total │ 2043.9136 ms │ 627.4718 ms │ 3532.3978 ms │ 1906.0194 ms │ 2307.7979 ms │ 2883.8528 ms │ 3283.7012 ms │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TPOT │ total │ 127.5591 ms │ 106.4937 ms │ 137.107 ms │ 128.3135 ms │ 129.5704 ms │ 131.1332 ms │ 134.1087 ms │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ ITL │ total │ 126.5571 ms │ 0.0095 ms │ 1340.783 ms │ 104.1398 ms │ 110.1272 ms │ 119.6124 ms │ 950.2924 ms │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ InputTokens │ total │ 3516.6055 │ 3014.0 │ 3985.0 │ 3525.0 │ 3525.0 │ 3586.8 │ 3800.67 │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokenThroughput │ total │ 7.4143 token/s │ 7.0535 token/s │ 8.933 token/s │ 7.3744 token/s │ 7.4118 token/s │ 7.5608 token/s │ 8.7051 token/s │ 512 │ ╘══════════════════════════╧═════════╧════════════════╧════════════════╧═══════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛ ╒══════════════════════════╤═════════╤═══════════════════╕ │ Common Metric │ Stage │ Value │ ╞══════════════════════════╪═════════╪═══════════════════╡ │ Benchmark Duration │ total │ 279430.9375 ms │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Requests │ total │ 512 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Failed Requests │ total │ 0 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Success Requests │ total │ 512 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Concurrency │ total │ 63.3452 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Max Concurrency │ total │ 64 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Request Throughput │ total │ 1.8323 req/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Input Tokens │ total │ 1800502 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Prefill Token Throughput │ total │ 1720.5255 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total generated tokens │ total │ 131072 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Input Token Throughput │ total │ 6443.4598 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Output Token Throughput │ total │ 469.0676 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Token Throughput │ total │ 6912.5274 token/s │ ╘══════════════════════════╧═════════╧═══════════════════╛ ``` - Kimi-K2-Thinking(43m51s) 1. Accuracy test ``` dataset version metric mode vllm-api-general-chat --------- --------- -------- ------ ----------------------- gsm8k 7cd45e accuracy gen 100.00 ``` 2. Perf test ``` ╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕ │ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ ╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡ │ E2EL │ total │ 172384.3573 ms │ 34456.5517 ms │ 205922.9407 ms │ 174844.2216 ms │ 202656.092 ms │ 204428.9502 ms │ 205468.6776 ms │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TTFT │ total │ 138740.3228 ms │ 655.1066 ms │ 171777.3003 ms │ 141088.0561 ms │ 169237.5599 ms │ 170716.4954 ms │ 171393.1278 ms │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TPOT │ total │ 131.9374 ms │ 90.6331 ms │ 135.4144 ms │ 132.405 ms │ 132.948 ms │ 133.7549 ms │ 135.2543 ms │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ ITL │ total │ 130.9028 ms │ 0.0099 ms │ 960.3683 ms │ 116.9623 ms │ 122.3127 ms │ 132.0522 ms │ 886.4662 ms │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ InputTokens │ total │ 3514.575 │ 3014.0 │ 3843.0 │ 3525.0 │ 3525.0 │ 3588.0 │ 3801.08 │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokenThroughput │ total │ 1.6799 token/s │ 1.2432 token/s │ 7.4296 token/s │ 1.4642 token/s │ 1.4737 token/s │ 1.8754 token/s │ 7.125 token/s │ 400 │ ╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛ ╒══════════════════════════╤═════════╤═══════════════════╕ │ Common Metric │ Stage │ Value │ ╞══════════════════════════╪═════════╪═══════════════════╡ │ Benchmark Duration │ total │ 1166795.568 ms │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Requests │ total │ 400 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Failed Requests │ total │ 0 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Success Requests │ total │ 400 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Concurrency │ total │ 59.0967 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Max Concurrency │ total │ 64 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Request Throughput │ total │ 0.3428 req/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Input Tokens │ total │ 1405830 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Prefill Token Throughput │ total │ 25.332 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total generated tokens │ total │ 102400 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Input Token Throughput │ total │ 1204.864 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Output Token Throughput │ total │ 87.7617 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Token Throughput │ total │ 1292.6258 token/s │ ╘══════════════════════════╧═════════╧═══════════════════╛ ``` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>	2026-01-12 15:56:07 +08:00
Nengjun Ma	297f6deb09	[CI] Align multi-node nightly test paramter with corresponding tutorials document (#5756 ) ### What this PR does / why we need it? Align multi-node nightly test paramter with tutorials documents. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? Test locally and nighly e2e multi-node test cases. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-01-12 09:00:31 +08:00
gh924	6880c1b383	[Feature] Support for cross-attention and whisper model (#5592 ) ### What this PR does / why we need it? To solve the problem of the issue：https://github.com/vllm-project/vllm-ascend/issues/2262 - support for cross-attention when the model is encoder-decoder - support for whisper model - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: gh924 <guihao2@huawei.com> Co-authored-by: Aoxuan Chen <43376869+chenaoxuan@users.noreply.github.com>	2026-01-11 11:38:45 +08:00
zxr2333	78b554dda9	[P/D] layerwise connector supports DeepSeek-V3.2 sparse attention && Distribute transfer tasks to redundant kv_head cards (#5722 ) ### What this PR does / why we need it? Add new function to mooncake layerwise connector, including: 1. supports sparse attention, for DeepSeek-V3.2 2. Distribute transfer tasks to redundant kv_head cards This PR is related to [[RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support](https://github.com/vllm-project/vllm-ascend/issues/4842) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2026-01-10 23:04:16 +08:00
Levi	ecd4232698	[Feat] flashcomm2+oshard Generalized (#4723 ) ### What this PR does / why we need it? [FlashComm2](https://gitcode.com/ascend-tribe/ascend-inference-cluster/blob/main/FlashComm/FlashComm2%E5%A4%A7%E6%A8%A1%E5%9E%8B%E6%8E%A8%E7%90%86%E4%B8%AD%E4%BB%A5%E5%AD%98%E6%8D%A2%E4%BC%A0%E7%9A%84%E9%80%9A%E4%BF%A1%E4%BC%98%E5%8C%96%E6%8A%80%E6%9C%AF.pdf) introduces redundant storage of the o_proj matrix, which imposes pressure on GPU memory. We propose the FlashComm2+Oshard approach by integrating the shared linear layer feature (#2931). This approach distributes weights layer-by-layer to each GPU and accesses the o_proj of each layer via asynchronous broadcast operations, thereby alleviating memory pressure while achieving nearly lossless performance compared to the original FlashComm2. This PR implements a generalized FlashComm2+Oshard solution. Using following env to support flashcomm2 with oshard ```shell export VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 --additional-config '{ "layer_sharding": ["o_proj"] }' ``` ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com>	2026-01-10 22:57:57 +08:00
wangxiaoteng888	aa987ffe87	[P/D][bugfix]Fix the PCP port mapping error issue (#5706 ) ### What this PR does / why we need it? Fix the PCP port mapping error issue.In a multi-node PD separation scenario, when the PCP feature is enabled, there is an issue with the ZMQ transmission port. Specifically, the IP and port received by Side D do not match. The cause of this issue is an error in the port mapping update strategy logic. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-10 22:43:52 +08:00
fems14	ff4c1a47b3	[bugfix] Fixing KV Pool Memory Retention and Performance Degradation Issues (#5751 ) ### What this PR does / why we need it? 1.Fixed memory retention on certain GPUs caused by missing PUT operations. 2.Fixed performance degradation resulting from architectural incompatibilities in the underlying refactor. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: fems14 <1804143737@qq.com>	2026-01-09 17:46:23 +08:00
SILONG ZENG	09b3f9d91b	[CI]Add Disaggregated PD Nightly Test for Qwen3-235B and Qwen3-VL-235B (#5502 ) ### What this PR does / why we need it? This PR adds online Disaggregated Prefill/Decode performance and accuracy tests for the Qwen3-235B-A22B and Qwen3-VL-235B-A22B-Instruct models to the Nightly test suite. These test configurations simulate the deployment of massive MoE and Vision-Language models in a dual-node (32 NPU) environment, utilizing Mooncake (KVCache Transfer) technology to achieve efficient KV cache transfer between the Prefill node and the Decode node. #### Test Configuration Qwen3-235B-A22B - Model: Qwen/Qwen3-235B-A22B - Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node) - Architecture: Disaggregated Prefill & Decode - Node 0 (Producer/Prefill): DP2 + TP8 + EP + FLASHCOMM1 + FUSED_MC2. - Node 1 (Consumer/Decode): DP4 + TP4 + EP + FLASHCOMM1 + FUSED_MC2 + FULL_DECODE_ONLY. - Benchmarks: - Performance: vllm-ascend/GSM8K-in3500-bs2800. - Accuracy: vllm-ascend/gsm8k-lite. Qwen3-VL-235B-A22B-Instruct - Model: Qwen/Qwen3-VL-235B-A22B-Instruct - Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node) - Architecture: Disaggregated Prefill & Decode - Node 0 (Producer/Prefill): DP2 + TP8 + EP. - Node 1 (Consumer/Decode): DP4 + TP4 + EP + FULL_DECODE_ONLY. - Benchmarks: - Performance: vllm-ascend/textvqa-perf-1080p. - Accuracy: vllm-ascend/textvqa-lite. ### How was this patch tested? Nightly test action on CI - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-09 16:25:20 +08:00
1092626063	f63c1341d9	[Feature] GLM4.6 support mtp with fullgraph (#5460 ) ### What this PR does / why we need it? GLM4.6 support mtp with fullgraph to improve performance ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ` export HCCL_BUFFSIZE=1024 export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export HCCL_OP_EXPANSION_MODE=AIV vllm serve /weight/glm4.6_w8a8_with_float_mtp \ --data-parallel-size 1 \ --tensor-parallel-size 16 \ --seed 1024 \ --served-model-name glm \ --max-model-len 35000 \ --max-num-batched-tokens 16384 \ --max-num-seqs 16 \ --trust-remote-code \ --gpu-memory-utilization 0.9 \ --speculative-config '{"num_speculative_tokens": 1, "model":"/weight/glm4.6_w8a8_with_float_mtp", "method":"mtp"}' \ --compilation-config '{"cudagraph_capture_sizes": [1,2,4,8,16,32], "cudagraph_mode": "FULL_DECODE_ONLY"}' \ --async-scheduling \ ` test case： ` vllm bench serve \ --backend vllm \ --dataset-name prefix_repetition \ --prefix-repetition-prefix-len 22400 \ --prefix-repetition-suffix-len 9600 \ --prefix-repetition-output-len 1024 \ --num-prompts 1 \ --prefix-repetition-num-prefixes 1 \ --ignore-eos \ --model glm \ --tokenizer /weight/glm4.6_w8a8_with_float_mtp \ --seed 1000 \ --host 0.0.0.0 \ --port 8000 \ --endpoint /v1/completions \ --max-concurrency 1 \ --request-rate 1 ` - vLLM version: v0.13.0 - vLLM main: `5326c89803` Signed-off-by: 1092626063 <1092626063@qq.com>	2026-01-09 16:07:42 +08:00
zzhxxx	64d29875f9	[Refactor] Replace the implementations of o_proj, q_b_proj, and kv_b_proj with custom_op for sharded CP (#5698 ) ### What this PR does / why we need it? Based on the Sharded-CP feature PR:https://github.com/vllm-project/vllm-ascend/pull/4702; RFC:https://github.com/vllm-project/vllm/issues/30055 This PR officially integrates Deepseek V3.2's DSA-CP support on the basis of https://github.com/vllm-project/vllm-ascend/pull/4702, improving inference efficiency and scalability under mixed prefill-decode workloads. The main improvements include: - Replace the implementations of o_proj, q_b_proj, and kv_b_proj with custom_op for TP=1. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Signed-off-by: Kurumi5210 <jaychou1620@gmail.com> Co-authored-by: clrs97 <524936896@qq.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com>	2026-01-09 15:58:40 +08:00
ZT-AIA	e11ff8e535	[BufFix]Fix the error when using Ascend custom operators with rank=128 (#5394 ) ### What this PR does / why we need it? The customized ascend operator sgmv_expand and sgmv_shrink applies only to the scenario where rank is 8,16,32,64. When rank >= 128, the operator is out of range, causing the model to report an error. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Depends on this commit https://github.com/vllm-project/vllm/pull/31408 - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>	2026-01-09 15:57:43 +08:00
lhchg	dc99cfdc15	[CustomOp] support TensorList for dispatchFFNCombine (#5665 ) ### What this PR does / why we need it? To support tensorList for dispatch_ffn_combine, to adjust eplb ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Single Operator Testing - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: lhchg <lhao_cheng@163.com> Co-authored-by: lihaocheng <lihaosheng1@h-partners.com>	2026-01-09 15:56:29 +08:00
InSec	2d713fee93	[CI] Accuracy issue of qwen3-next-w8a8 nightly test fix. (#5746 ) ### What this PR does / why we need it? Close the Full Graph mode to temporarily avoid accuracy issue for Qwen3-Next-80B-A3B-Instruct-W8A8. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: InSec <1790766300@qq.com>	2026-01-09 15:55:13 +08:00
LeeWenquan	a3a74d6984	[CI] Add qwen3 next ci (#5395 ) ### What this PR does / why we need it? Add Qwen3Next CI ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2026-01-09 10:29:09 +08:00
Chenxi Qian	40eb3e1836	[OP] Enable custom op aclnnMoeInitRoutingCustom (#5332 ) ### What this PR does / why we need it? This PR enables custom op `aclnnMoeInitRoutingCustom` introduced in PR #5251 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: QianChenxi <chenxi.qian.cq@outlook.com> Signed-off-by: zzzzwwjj <1183291235@qq.com> Co-authored-by: zzzzwwjj <1183291235@qq.com>	2026-01-09 09:35:18 +08:00

... 4 5 6 7 8 ...

1120 Commits