xc-llm-ascend

Author	SHA1	Message	Date
Li Wang	eb0a2ee2d0	[CI] Optimize nightly CI (#3898 ) ### What this PR does / why we need it? This patch mainly fix the the problem of not being able to determine the exit status of the pod's entrypoint script and some other tiny optimizations: 1. Shorten wait for server timeout 2. fix typo 3. fix the issue of ais_bench failing to correctly access the proxy URL in a PD separation scenario. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-30 23:42:20 +08:00
wangxiaoteng888	2c291bc63f	[bugfix] layerwise D first plan (#3866 ) ### What this PR does / why we need it? Refactored the layerwise code to send to the D node first, preventing P-node hangs due to communication timeouts when DP > 1. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2025-10-30 22:20:34 +08:00
offline893	627f20ce26	[BugFix]Fix group list type of mc2. (#3864 ) ### What this PR does / why we need it? Fix the precision issue caused by the inconsistency between the group list type used by mc2 and that of eplb. - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` --------- Signed-off-by: offline0806 <3337230449@qq.com>	2025-10-30 21:39:01 +08:00
jiangyunfan1	655a229455	[TEST]Add MALPO for aclgraph in nightly test (#3894 ) ### What this PR does / why we need it? This PR adds MALPO for deepseek aclgraph, we need to test it nightly ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the test - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2025-10-30 18:25:54 +08:00
Song Zhixin	216fc0e8e4	[feature] Prompt Embeddings Support for v1 Engine (#3026 ) ### What this PR does / why we need it? this PR based on [19746](https://github.com/vllm-project/vllm/issues/19746), support Prompt Embeddings for v1 engine on NPU ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ```python python examples/prompt_embed_inference.py ``` - vLLM version: v0.11.0 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.1 --------- Signed-off-by: jesse <szxfml@gmail.com>	2025-10-30 17:15:57 +08:00
whx	f6149f3894	[Model][3/N] Refactor sfa into mla and remove deepseek_v3_2.py (#3769 ) This is the follow-up PR to PR #3189, which continues to refactor sfa into mla and finally remove deepseek_v3_2.py. This is the last PR of deepseek modeling refactoring. After this, all deepseek-related model codes are removed from vllm_ascend. FurtherMore, after this PR deepseek v3.2 can run chunk-prefill with correct accuracy. - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-30 17:06:38 +08:00
xuyexiong	eff3e5fc6f	[FEAT] Refactor spec decode to support efficient padded speculation (#3528 ) ### What this PR does / why we need it? 1. Refactor the file `mtp_proposer.py`, splits torchair related codes into `mtp_torchair_proposer.py` 2. According to https://github.com/vllm-project/vllm/pull/24539, implements padded speculative decoding as described in https://github.com/vllm-project/vllm/issues/21984. ### Does this PR introduce _any_ user-facing change? User can use `disable_padded_drafter_batch` to disable/enable padded speculation, default is `False`. offline example: ``` speculative_config={"method": "deepseek_mtp", "num_speculative_tokens": 1, "disable_padded_drafter_batch": False} ``` ### How was this patch tested? - [x] egaer with pad/unpad: - [x] aclgraph with pad/unpad - [x] torchair with pad/unpad performance test of deepseek-r1 with tp16、dp1 aclgraph with pad ITL: 168ms aclgraph with unpad ITL: 169ms original: 178ms - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-10-30 16:53:05 +08:00
wangxiyuan	10772d94e3	[Build] Force torch version (#3791 ) We notice that sometimes user build vllm-ascend with incorrect torch version. In this case, the build is passed, but when running the code, the error `AttributeError: '_OpNamespace' '_C_ascend' object has no attribute 'weak_ref_tensor'` is raised. Let's force the torch version to 2.7.1 and check the torch version when build from source to fix the issue. closes: #3342 - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-30 15:53:15 +08:00
wangxiyuan	ff47524b88	[Doc] Remove modeling doc (#3789 ) Remove `modeling` doc, it's useless now - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.1 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-30 15:53:02 +08:00
Meihan-chen	67dd3a4581	[UT] fix skip ut test for test_utils (#3803 ) ### What this PR does / why we need it? [UT] fix ut test for test_utils that https://github.com/vllm-project/vllm-ascend/pull/3612 skipped. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: v0.11.0rc3 vLLM main: `17c540a993` - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2025-10-30 15:52:53 +08:00
Liwx	eed1957f03	Add FAQ for docker pull error on Kylin OS (#3870 ) Added instructions for resolving 'invalid tar header' error on Kylin OS with an ARM64 architecture on Atlas300I hardware during docker pull, including steps for offline loading of docker images. --- ### What this PR does / why we need it? The primary motivation for this PR is to address a critical `docker pull` failure that occurs on specific, yet important, enterprise environments. Specifically, when operating on Kylin OS (麒麟操作系统) with an ARM64 architecture on Atlas300I hardware, users frequently encounter an `archive/tar: invalid tar header` error, which completely blocks the setup process. This issue has been consistently reproduced, with multiple retries failing with the same error, confirming that it is a persistent environmental problem rather than a transient network issue. <img width="2060" height="525" alt="image" src="https://github.com/user-attachments/assets/6c1c5728-de27-476f-8df4-723564fc290b" /> This guide provides a robust, step-by-step workaround using an offline-loading method (`docker save` on a host machine and `docker load` on the target machine). This solution is crucial for enabling users on this platform to use vLLM. This contribution does not directly fix an existing issue number, but it proactively solves a significant environmental and usability problem for a growing user base. ### Does this PR introduce _any_ user-facing change? No.It does not alter any code, APIs, interfaces, or existing behavior of the vLLM project. ### How was this patch tested? The instructions and troubleshooting steps in this guide were validated through a real-world, end-to-end test case on the my hardware and OS. The testing process was as follows: 1. Problem Reproduction: An attempt was made to directly `docker pull` the `vllm-ascend:v0.10.0rc1-310p` image on a target machine running Kylin OS (ARM64). The `invalid tar header` failure was successfully and consistently reproduced, confirming the existence of the problem. 2. Solution Implementation: The workaround detailed in the guide was executed: * On a separate host machine (Ubuntu x86_64), the image was successfully pulled using the `--platform linux/arm64` flag. * The image was then saved to a `.tar` archive using `docker save`. * The `.tar` archive was transferred to the target Kylin OS machine. * The image was successfully loaded from the archive using `docker load -i ...`. 3. End-to-End Validation: After loading the image, the vLLM container was launched on the target machine following the instructions in the guide. Both online inference (via `curl` to the API server) and offline inference (via the Python script) were executed successfully, confirming that the entire workflow described in the document is accurate and effective. Since this is a documentation-only change based on a validated workflow, no new unit or integration tests were added to the codebase. - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` --------- Signed-off-by: Liwx <liweixuan1014@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-10-30 14:10:52 +08:00
offline893	14ca1e5cb2	[CI]Fix oom of deepseek-eplb nigtly test. (#3884 ) ### What this PR does / why we need it? Fix oom of deepseek-eplb nigtly test - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-30 10:18:07 +08:00
whx	dc960e798e	[BugFix] Fix mlapo accuracy problem related with weight processing. (#3850 ) This PR fixes a mlapo accuracy problem related with weight processing. Furthermore, add back mlapo related e2e test with quantized deepseek model. - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-30 00:34:55 +08:00
zouyida2052	adadd50613	bugfix for mtp fullgraph (#3845 ) ### What this PR does / why we need it? bugfix for mtp fullgraph ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-10-29 23:50:13 +08:00
baxingpiaochong	d6ef3df3b3	[Bugfix]fix_mulit_connector_bug (#3332 ) ### What this PR does / why we need it? When using multi connector, the multi connector does not define get_finished_count, which will cause the kv cache to be released ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` --------- Signed-off-by: baxingpiaochong <771405853@qq.com>	2025-10-29 23:23:06 +08:00
liziyu	07873d9396	fix mooncake layerwise connector (#3849 ) ### What this PR does / why we need it? fix a typo in mooncake layerwise connector. There is only `requests`, instead of `request` in `connector_metadata`. This pr fixes this typo - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` Signed-off-by: liziyu <liziyu16@huawei.com>	2025-10-29 23:10:51 +08:00
offline893	5f176ca992	[CI]Fix eplb nightly tests. (#3863 ) ### What this PR does / why we need it? Fix eplb nightly tests. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-29 23:06:05 +08:00
Wang Yixuan	870a3f21cb	[BugFix] deepseek torchair adapt for torch_npu version (#3862 ) ### What this PR does / why we need it? To adapt the torch_npu version to avoid the precision problem of torchair deepseek. The torch_npu version may result in the different branches in the ops register, the rms_norm ops has two branches according to the verson_check, this pr unify the rms_norm in torchair by patching quant_rms_norm to rms_norm to fix the accuracy issue in torchair scenario - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-10-29 22:39:34 +08:00
Li Wang	4a2ab13743	[CI] Optimize nightly CI (#3858 ) ### What this PR does / why we need it? This patch optimize nightly CI: 1. Bug fixes ais_bench get None repo_type error 2. Fix A2 install kubectl error with arm arch 3. Fix the multi_node CI unable to determine whether the job was successful error ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-29 22:30:19 +08:00
Meihan-chen	cba69e117e	[CI]pin vllm commit id (#3861 ) ### What this PR does / why we need it? the code of vllm is updated, pin vllm commit id to recover CI firstly ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.1 Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2025-10-29 17:43:58 +08:00
realliujiaxu	74191864b7	[Perf] Delete redundant operations in model_runner and forward_context (#3677 ) ### What this PR does / why we need it? Remove redundant operations from `model_runner` and `forward_context`. This optimization can significantly reduce the idle time (bubble) before decoding when running models with small parameter counts (e.g., Qwen/Qwen2.5-0.5B). Testing on 800I A2, bubble is reduced from 3.8ms to 2.8ms : Before <img width="1655" height="696" alt="image" src="https://github.com/user-attachments/assets/d7608e52-2438-46dd-8fc9-391fd6274495" /> After <img width="1607" height="774" alt="image" src="https://github.com/user-attachments/assets/56daf081-2dba-4d2e-99d4-e055187d9806" /> ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.1 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-10-29 15:59:55 +08:00
weichen	0d1859af08	[Bugfix] [MoE] fix error in deepseek when using allgather (#3824 ) ### What this PR does / why we need it? After refactoring vllm_ascend/models and FusedMoE, we are unable to pass `gate` from deepseekv2.py to `AscendFusedMoE.forward`, which will result in error when running deepseek v3/r1 with allgather. Hence, this pr removes `gate` related computations from FusedMoE module in eager/aclgraph mode. ### Does this PR introduce _any_ user-facing change? `rm_router_logits` is deprecated in eager/aclgraph. ### How was this patch tested? e2e & ut - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.1 Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-10-29 14:51:39 +08:00
Mengqing Cao	900086fdc6	[HybridKV][Bugfix] Fix Hybrid kvcache sharing bug in same attention type (#3760 ) ### What this PR does / why we need it? Part of https://github.com/vllm-project/vllm-ascend/pull/3106 Fix Hybrid kvcache sharing bug in same attention type Change the `shared_by` logic so that the same attention spec could share the same buffer instead of allocating more hbm. After this pr, kvcache memory saved 50% in qwen3-next compared with before (`self_attn:linear_attn=1:3` in an `attn_group`), and `gpu_memory_utilization` could increase to `0.8` on Qwen3-Next when running on A2 64G/card with tp4 <img width="2833" height="1540" alt="image" src="https://github.com/user-attachments/assets/2a91fa99-fb0f-447c-9e8b-acd587890fbe" /> ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Test pass with the latest e2e test case on qwen3-next - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-10-29 14:18:52 +08:00
zhangxinyuehfad	789ba4c5c2	[Doc] Update doc (#3836 ) ### What this PR does / why we need it? Update doc ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.1 Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-10-29 11:03:39 +08:00
XiaoxinWang	1e31b07fa7	fix qwen3next full graph break. (#3812 ) ### What this PR does / why we need it? fix qwen3next full graph break. linearattention doesnot has aclgraph_support attr，so change to cudagraph_support to support vllm. <img width="603" height="120" alt="image" src="https://github.com/user-attachments/assets/d2de53bb-4147-495a-9129-51d9083749be" /> ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.1 Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-10-29 10:30:23 +08:00
liziyu	c76db627ab	[P/D] force with_prefill true after allreduce in kv producer (#3768 ) ### What this PR does / why we need it? force with_prefill true after allreduce in kv producer - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2025-10-29 10:15:38 +08:00
pichangping	f57bdb09fc	[long_seq_optim] BSND to TND and FA_UPDATE replacement (#3778 ) ### What this PR does / why we need it? We have optimized the performance of long sequences：First,Modify the input data format for attention calculation. Instead of using the original BSND format, remove the logic for converting between TND and BSND, and directly adopt the TND format. The TND input format can be directly reused, which shortens the data flow path. Converting to BSND is an unnecessary processing step.Second, we switched the output update of the concatenated small operators to the npu_attention_update fusion operator to improve performance. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: pichangping <1337510399@qq.com>	2025-10-29 09:33:35 +08:00
jiangyunfan1	e56b0017a3	[TEST]Add aisbench log and A2 cases (#3841 ) ### What this PR does / why we need it? This PR adds 2 more A2 caces which we need to test daily. It also enhances the logging for aisbench test failures to improve issues identification ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the test - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.1 --------- Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2025-10-28 23:33:15 +08:00
ZYang6263	d08401d1e7	[Main][Bugfix]Avoid using the fusion operator in the MOE model (#3834 ) ### What this PR does / why we need it? The current MatmulReduceScatter operator experiences performance degradation in small-shape scenarios, so it determines whether to use this operator by judging the size of the shape. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.1 --------- Signed-off-by: ZYang6263 <zy626375@gmail.com>	2025-10-28 23:30:27 +08:00
Li Wang	90ae114569	[CI] Fix nightly CI (#3821 ) ### What this PR does / why we need it? This patch fix the nightly CI runs [failure](https://github.com/vllm-project/vllm-ascend/actions/runs/18848144365) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.1 --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-28 20:40:03 +08:00
Icey	a7450db1bd	Upgrade to 0.11.1 newest vllm commit (#3762 ) ### What this PR does / why we need it? `c9461e05a4` Fix ```spec decode rejection sampler```, caused by https://github.com/vllm-project/vllm/pull/26060 Fix some ```import```, caused by https://github.com/vllm-project/vllm/pull/27374 Fix ```scheduler_config.send_delta_data```, caused by https://github.com/vllm-project/vllm-ascend/pull/3719 Fix ```init_with_cudagraph_sizes```, caused by https://github.com/vllm-project/vllm/pull/26016 Fix ```vl model```of replacing PatchEmbed's conv3d to linear layer, caused by https://github.com/vllm-project/vllm/pull/27418 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-10-28 14:55:03 +08:00
Li Wang	f846bd20e4	[CI] Add multi-node test case for a2 (#3805 ) ### What this PR does / why we need it? This patch add multi-node test case for a2 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-27 23:10:17 +08:00
jiangyunfan1	9030106a14	[TEST]Add 2P1D multi node cases for nightly test (#3764 ) ### What this PR does / why we need it? This PR adds the 2P1D multi node func/acc/perf test cases, we need test them daily ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com> Signed-off-by: wangli <wangli858794774@gmail.com> Co-authored-by: wangli <wangli858794774@gmail.com>	2025-10-27 23:09:15 +08:00
Levi	d64bdd06ae	【Bugfix】bugfix for weight load of kimi-k2 (#3798 ) Signed-off-by: Levi-JQ <yujinqi2@huawei.com> ### What this PR does / why we need it? Fix kimi-k2 start bug, weight load ERROR：https://github.com/vllm-project/vllm-ascend/issues/3785 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: zhaozx-cn <zhaozx2116@163.com>	2025-10-27 21:18:35 +08:00
wangxiyuan	da5f2cc1e3	[Doc] Update FAQ (#3792 ) Many FAQ content is out of date, this PR refresh it. - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-27 20:32:17 +08:00
shiyuan680	00aa0bf33e	support prefill cache mode use fia op (#3696 ) ### What this PR does / why we need it? support prefill cache mode use fia op for full graph ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` origin ============ Serving Benchmark Result ============ Successful requests: 30 Maximum request concurrency: 256 Request rate configured (RPS): 0.70 Benchmark duration (s): 131.63 Total input tokens: 61363 Total generated tokens: 61440 Request throughput (req/s): 0.23 Output token throughput (tok/s): 466.77 Peak output token throughput (tok/s): 750.00 Peak concurrent requests: 30.00 Total Token throughput (tok/s): 932.95 ---------------Time to First Token---------------- Mean TTFT (ms): 125.17 Median TTFT (ms): 121.51 P50 TTFT (ms): 121.51 P90 TTFT (ms): 140.91 P99 TTFT (ms): 182.36 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 43.85 Median TPOT (ms): 43.84 P50 TPOT (ms): 43.84 P90 TPOT (ms): 44.28 P99 TPOT (ms): 44.32 ---------------Inter-token Latency---------------- Mean ITL (ms): 43.85 Median ITL (ms): 42.63 P50 ITL (ms): 42.63 P90 ITL (ms): 48.74 P99 ITL (ms): 59.62 ================================================== after ============ Serving Benchmark Result ============ Successful requests: 30 Maximum request concurrency: 256 Request rate configured (RPS): 0.70 Benchmark duration (s): 130.10 Total input tokens: 61363 Total generated tokens: 61440 Request throughput (req/s): 0.23 Output token throughput (tok/s): 472.26 Peak output token throughput (tok/s): 750.00 Peak concurrent requests: 30.00 Total Token throughput (tok/s): 943.94 ---------------Time to First Token---------------- Mean TTFT (ms): 123.69 Median TTFT (ms): 122.51 P50 TTFT (ms): 122.51 P90 TTFT (ms): 143.69 P99 TTFT (ms): 165.00 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 43.07 Median TPOT (ms): 43.13 P50 TPOT (ms): 43.13 P90 TPOT (ms): 43.50 P99 TPOT (ms): 43.57 ---------------Inter-token Latency---------------- Mean ITL (ms): 43.07 Median ITL (ms): 41.81 P50 ITL (ms): 41.81 P90 ITL (ms): 48.11 P99 ITL (ms): 62.13 ================================================== Signed-off-by: shiyuan680 <917935075@qq.com>	2025-10-27 19:41:07 +08:00
Shanshan Shen	3e5ae49160	[MM][Doc] Update online serving tutorials for `Qwen2-Audio` (#3606 ) ### What this PR does / why we need it? Update online serving tutorials for `Qwen2-Audio`. Part of https://github.com/vllm-project/vllm-ascend/issues/3508. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-10-27 16:58:03 +08:00
Shirley125	d8ca7fee75	[bugfix][main]fix proxy decode bug (#3750 ) ### What this PR does / why we need it? fix proxy decode bug when parsing non-UTF-8 characters. - vLLM version: v0.11.0 - vLLM main: `c9461e05a4` --------- Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>	2025-10-27 16:56:09 +08:00
yupeng	b8796b06c8	[Doc][Example][Bugfix] Elements in local_device_ids should be casted … (#3782 ) ### What this PR does / why we need it? It's a tiny bugfix in the `gen_ranktable.py` script. The script is an util to help setup an example case. It is used to prepare a ranktable before disaggregated prefill deployment. Elements in `local_device_ids` list should be casted to `int` type before referred for a MOD math operation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. - vLLM version: v0.11.0 - vLLM main: `c9461e05a4` --------- Signed-off-by: paulyu12 <507435917@qq.com>	2025-10-27 14:52:47 +08:00
dependabot[bot]	638d8d1a47	Bump actions/upload-artifact from 4 to 5 (#3786 ) Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 5. - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-10-27 14:11:53 +08:00
dependabot[bot]	79623e0bab	Bump actions/download-artifact from 5 to 6 (#3787 ) Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 5 to 6. - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-10-27 14:10:56 +08:00
jiangyunfan1	e9072429fb	[CI] Enable 2 jobs for nightly test (#3781 ) ### What this PR does / why we need it? This PR adds 2 jobs to a3 nightly test, which contains 4 test cases, we need test them nightly ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2025-10-27 14:08:29 +08:00
Li Wang	60ee4af6d0	[CI] Add custom op to nightly (#3765 ) ### What this PR does / why we need it? 1. Add custom op to nightly tests, fix https://github.com/vllm-project/vllm-ascend/pull/3665 2. Correctly pass github secrets when using workflow_call, see https://docs.github.com/en/actions/how-tos/reuse-automations/reuse-workflows 3. Fix the single node mutual cancellation issue - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-27 14:07:03 +08:00
weiguihua2	4312a92a4f	[feat]dcp pcp support aclgraph (#3731 ) ### What this PR does / why we need it? dcp pcp support full aclgraph, including mla attention_v1 - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-10-27 09:58:23 +08:00
Yizhou	8ab8111fde	[Fix] Prevent memory leak in MLA decode graph (#3743 ) ### What this PR does / why we need it? The cache for MLA decode graph parameters was holding strong references to tensors, preventing them from being garbage collected and leading to increased memory usage. This change wraps the cached tensors in weak references, allowing them to be deallocated when no longer in use and reducing overall memory pressure. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-25 20:37:33 +08:00
22dimensions	afc58184ec	[Installation] limit opencv-python-headless version to resolve numpy version conflict (#3713 ) ### What this PR does / why we need it? vllm requires opencv-python-headless >= 4.11.0 which requires (numpy<2.3.0,>=2), but vllm-ascend numpy version must be less than 2.0.0, so limit opencv-python-headless less than 4.11.0.86 will fix this conflict. ### How was this patch tested? tested by CI - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-10-25 18:07:54 +08:00
Icey	bb5f16d926	[BugFix] Fix Qwen3-next break (#3428 ) ### What this PR does / why we need it? Fix Qwen3NextGatedDeltaNet, caused by https://github.com/vllm-project/vllm/pull/26437 ### How was this patch tested? ``` def main(): prompts = [ "窗前明月光，", "The president of the United States is Mr.", "The capital of France is", "The future of AI is", "感时花溅泪，", "家书抵万金啥意思？", "plz tell me a story: ", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="/root/.cache/modelscope/hub/models/Qwen/Qwen3-Next-80B-A3B-Instruct", tensor_parallel_size=4, enforce_eager=True, trust_remote_code=True, max_model_len=256, gpu_memory_utilization=0.7, block_size=64 ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Icey <1790571317@qq.com>	2025-10-25 18:03:36 +08:00
ck-hw-1018	7572939b94	add qwq testcase (#3757 ) ### What this PR does / why we need it? This PR adds a qwq case for nightly test for qwen-qwq on A3 ,we need test them daily ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? by running the test - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: ckhw <cuikai1@huawei.com>	2025-10-25 17:11:35 +08:00
zzzzwwjj	e5676fc36e	[main] remove dbo code (#3712 ) ### What this PR does / why we need it? Remove codes of dbo. Currently, vLLM has supported dbo with pr: https://github.com/vllm-project/vllm/pull/23693. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-10-25 15:53:01 +08:00
Icey	d9cdc65854	Upgrade to new vllm commit (#3719 ) ### What this PR does / why we need it? Upgrade to new vllm commit: `c9461e05a4` - Fix many imports, caused by https://github.com/vllm-project/vllm/pull/26908 - Fix import ```sha256```, caused by https://github.com/vllm-project/vllm/pull/27169 - Remove ```SchedulerConfig.send_delta_data```, caused by https://github.com/vllm-project/vllm/pull/27142 - Fix ```FusedMoE``` because of dual stream execution, caused by https://github.com/vllm-project/vllm/pull/26440 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-10-25 15:36:32 +08:00

... 2 3 4 5 6 ...

1399 Commits