xc-llm-ascend

Author	SHA1	Message	Date
weiguihua2	59a7526339	[CI][Misc] modify ds3.2+dcp ci (#7841 ) ### What this PR does / why we need it? Due to the current dcp solution of allgathering the KV cache, the performance deteriorates significantly, and the CI may get stuck. This PR temporarily removes the performance and accuracy benchmarks for DeepSeek-V3.2-W8A8-cp to prevent CI hangs until optimization is complete. pcik-from:https://github.com/vllm-project/vllm-ascend/pull/7842 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Verified that the configuration file remains valid and that the CI no longer attempts to run the problematic benchmarks. pick-from: https://github.com/vllm-project/vllm-ascend/pull/7842 --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-04-01 08:58:21 +08:00
weiguihua2	bc8e87f3db	[v0.18.0][Bugfix] fix ds3.2 dcp mtp (#7681 ) ### What this PR does / why we need it? Fixed the issue where the DCP overlaps the MTP scenario in the ds3.2 scenario. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? cherry-pick from: https://github.com/vllm-project/vllm-ascend/pull/7617 Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-03-27 14:24:53 +08:00
zhangxinyuehfad	886756aea0	[Bugfix][CI] Fix aisbench installation to avoid Gitee authentication (#7536 ) ### What this PR does / why we need it? - Pass GITEE_USERNAME (var) and GITEE_TOKEN (secret) as Docker build args in nightly image build so Dockerfile can authenticate to Gitee - In Dockerfile.nightly.a2/a3, embed credentials into clone URL to avoid auth failure during `git clone` - In single-node and multi-node PR test workflows, backup the pre-installed benchmark from the nightly image before wiping vllm-ascend, then restore it instead of re-cloning from Gitee, which is inaccessible from fork PR contexts ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.18.0 - vLLM main: `8b6325758c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-03-23 20:16:51 +08:00
pz1116	3effc4bc70	[Doc][KV Pool]Revision KV Pool User Guide (#7434 ) ### What this PR does / why we need it? Revise the KV Pool user guide: 1. Revise Mooncake environment variables and kvconnector extra configs. 2. Delete `use_ascend_direct` in kv connector extra config as it is deprecated 3. Delete `kv_buffer_device` and `kv_rank` in P2P mooncake config 4. Unifies default `max-model-len` and `max-num-batch-tokens` in examples given. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: Pz1116 <zpbzpb123123@gmail.com> Co-authored-by: Chao Lei <leichao139636@163.com>	2026-03-19 10:13:13 +08:00
zhangxinyuehfad	67d40f23fd	[CI]Upgrade niglty multi-node-tests max-parallel to 2 (#7035 ) ### What this PR does / why we need it? 1. Increase nightly multi-node test max-parallel from 1 to 2, and fix resource conflicts that arise when tests run concurrently. 2. Fix parse-trigger job: Add an if condition so it only runs on schedule, workflow_dispatch, or PRs labeled nightly-test 3. Adjust nightly schedule: Shift trigger time from 24:00 to 23:45 (UTC+8) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-03-10 16:25:51 +08:00
zhangxinyuehfad	1e4017e3fa	[CI] support nightly ci for per pr by labels (#6483 ) ### What this PR does / why we need it? This PR refactors the nightly CI workflows (A2 and A3) to support running tests against a specific PR's code, in addition to the existing scheduled/dispatch runs using pre-built images. #### Motivation: Previously, nightly tests could only be triggered by schedule or workflow_dispatch, always using the pre-built nightly image. This change allows developers to trigger nightly tests against their own PR's source code, enabling early validation without waiting for a nightly build. #### Changes Trigger logic (parse-trigger job) A new parse-trigger job is introduced in both schedule_nightly_test_a2.yaml and schedule_nightly_test_a3.yaml to centralize trigger evaluation: `schedule / workflow_dispatch`: runs all tests with the pre-built image (existing behavior preserved) `pull_request (labeled + synchronize)`: runs only when:The PR has the nightly-test label, and /nightly [test-names] comment exists (latest one wins) 1. /nightly or /nightly all — runs all tests 2. /nightly test1 test2 — runs only named tests (comma-wrapped for exact matching) #### How to trigger 1. Add the nightly-test label to your PR 2. Comment /nightly (all tests) or /nightly test1 test2 (specific tests) 4. Re-triggering: add another /nightly comment and push a new commit (synchronize event) ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-03-05 16:46:37 +08:00
zhangxinyuehfad	566c367a10	[CI] Add DeepSeek-V3.2 large EP nightly ci (#6378 ) ### What this PR does / why we need it? Add DeepSeek-V3.2 nightly ci Fix PD routing to exclude headless nodes when collecting prefiller/decoder IPs - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-03-04 16:15:56 +08:00
Xiaoshuang Wang	f7a8befc20	[CI] Upgrade CANN to 8.5.1 (#6897 ) ### What this PR does / why we need it? [CI] Upgrade CANN to 8.5.1 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: wxsIcey <1790571317@qq.com>	2026-03-03 09:02:42 +08:00
wjunLu	b60b991005	[CI] Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector (#5441 ) ### What this PR does / why we need it? Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-02-27 16:31:02 +08:00
starmountain1997	bc1622338c	[CI] Add long and short prompt tests for DeepSeek-V3.2 (#6536 ) ### What this PR does / why we need it? This version has no divisibility constraint between tp and mtp+1. However, cudagraph_capture_sizes must be a common multiple of tp and mtp+1, with a maximum of tp * (mtp+1). Therefore, we fixed cudagraph_capture_sizes. We added a long-sequence test (64k input, 3k output) for the two-node mixed deployment scenario. Due to the excessive time required for performance benchmarking, we are only verifying functionality. The single-node scenario is skipped because VRAM limitations prevent launching the model with a max-model-len of 68,000. and we also add aime2025 test for dual-node deepseek 3.2 nightly test. ### How was this patch tested? test at nightly environment. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-02-26 10:58:50 +08:00
Li Wang	ac9a7d1301	[Nightly] Increase VLLM_ENGINE_READY_TIMEOUT_S to avoid nightly failure (#6778 ) ### What this PR does / why we need it? After some observation, I found some cases failed for timeout, just like https://github.com/vllm-project/vllm-ascend/actions/runs/22280996034/job/64487867977#step:9:921 and https://github.com/vllm-project/vllm-ascend/actions/runs/22315540111/job/64574590762#step:9:1809, this may caused by the excessively long model loading time (currently we are still loading weights from network storage), it is necessary to adjust the timeout seconds 600s -> 1800s ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-02-25 10:14:51 +08:00
SILONG ZENG	e2237819a9	[CI]Fixed the spell check function in `typos.toml` (#6753 ) ### What this PR does / why we need it? The incorrect regular expression syntax `.[UE4M3\|ue4m3].` actually ignores all words containing any of the following characters: `u, e, 4, m, 3, \|` ```yaml extend-ignore-identifiers-re = [".Unc.", "._thw", ".UE8M0.", ".[UE4M3\|ue4m3].", ".eles.", ".fo.", ".ba.", ".ot.", ".[Tt]h[rR]."] ``` ===fix===> ```yaml extend-ignore-identifiers-re = [".Unc.", "._thw", ".UE8M0.", ".(UE4M3\|ue4m3]).", ".eles.", ".fo.", ".ba.", ".ot.", ".[Tt]h[rR]."] ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-14 11:57:26 +08:00
SILONG ZENG	6bc44bf49b	[CI]fix nightly multi node test error for wait for pod ready (#6675 ) ### What this PR does / why we need it? Fixes the issue where nightly multi-node tests hang during the "wait for pod ready" stage due to strict shell mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `13397841ab` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-11 18:11:00 +08:00
Qiu	cb7c419bc0	[Feat](sfa,dcp) support dcp for sfa (#6563 ) ### What this PR does / why we need it? This PR adds DCP support to the SFA backend. Please note that due to operator constraints, the current implementation has to all-gather the entire KV cache and modify the block table to satisfy the operator input requirements. This results in significantly increased communication overhead and peak memory usage. Therefore, this is only a temporary workaround and will be refactored once the operator provides proper support. Additionally, because of the above limitations, `cp_kv_cache_interleave_size` is currently required to be equal to `block_size`. This restriction will also be removed after the refactor. #### Test accuracy test using DeepSeek-V3.2-Exp-W8A8 with dp2tp8dcp8 \| dataset \| version \| metric \| mode \| vllm-api-general-stream \| \|----- \| ----- \| ----- \| ----- \| -----\| \| gsm8kdataset \| - \| accuracy \| gen \| 96.35 \| - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-02-09 18:52:25 +08:00
Li Wang	d018aeb5fa	[Image] Bump mooncake version to v0.3.8.post1 (#6428 ) ### What this PR does / why we need it? This patch bump the mooncake version to the latest [release](https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.8.post1) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? test is locally >>> from mooncake.engine import TransferEngine - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-02-06 10:54:03 +08:00
meihanc	c08364f761	[Bugfix] Fix intermittent kv_port conflict with AscendDirectTransport (#6455 ) ### What this PR does / why we need it? When using Mooncake on Ascend NPU, AscendDirectTransport randomly allocates ports within range `[20000, 20000 + npu_per_node × 1000)`. Reference: [ascend_direct_transport.cpp#L554](https://github.com/kvcache-ai/Mooncake/blob/v0.3.7.post2/mooncake-transfer-engine/src/transport/ascend_transport/ascend_direct_transport/ascend_direct_transport.cpp#L475) If `kv_port` overlaps with this range, users may encounter intermittent startup failures: ```bash zmq.error.ZMQError: Address already in use (addr='tcp://x.x.x.x:30012') RuntimeError: KV Cache sending/receiving thread failed to start. ``` This pr fix intermittent kv_port conflict with AscendDirectTransport in `Qwen3-235B-W8A8-EPLB.yaml`, and add Added `kv_port Configuration Guide` section in `pd_disaggregation_mooncake_multi_node.md`. test Results(tests/e2e/nightly/multi_node/config/Qwen3-235B-W8A8-EPLB.yaml): https://github.com/vllm-project/vllm-ascend/actions/runs/21540138907/job/62073265259 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-02-02 17:31:21 +08:00
dsxsteven	325cb16e3f	[BugFix][CI]Fix DeepSeek-R1-W8A8-longseq nightly CI (#6297 ) ### What this PR does / why we need it? The precision issue arose because the kv cache of the p-node had not been fetched for an extended period(>6min) and was forcibly freed. To avoid this problem, the batch size was reduced and the timeout period has also been extended. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: dsxsteven <dsxsteven@sina.com>	2026-01-28 16:36:24 +08:00
wangxiyuan	f8e76a49fa	[CI] Upgrade trasnformers version (#6307 ) Upgrade transformers to >=4.56.4 - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-28 14:06:39 +08:00
starmountain1997	6c73b88dd6	[CI] Enable FLASHCOMM1 with layer_sharding and FULL_DECODE_ONLY in ds32 testing (#6115 ) ### What this PR does / why we need it? This PR enables FLASHCOMM1 communication optimization with layer sharding for DeepSeek-V3.2 W8A8 model testing to validate PR #5702. The changes include: 1. Enable FLASHCOMM1: Set VLLM_ASCEND_ENABLE_FLASHCOMM1=1 improves performance for distributed inference 2. Add layer sharding: Configure layer_sharding: ["q_b_proj", "o_proj"] 4. Update baselines: Adjust performance baselines to reflect the improvements from FLASHCOMM1 and layer sharding ### Does this PR introduce _any_ user-facing change? No. This is a CI/test-only change that enables new communication optimization features for testing purposes. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-01-23 19:48:37 +08:00
zhangxinyuehfad	193acc2c19	[CI] Add nightly ci test for deepseek v3.1 (#5386 ) ### What this PR does / why we need it? Add nightly ci test for deepseek v3.1 - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-23 14:36:49 +08:00
meihanc	e54d294df3	[CI]Install clang in dokerfile for triton ascend (#4409 ) ### What this PR does / why we need it? Install clang in dokerfile for triton ascend - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-22 19:01:28 +08:00
wangxiyuan	69740039b7	[CI] Upgrade CANN to 8.5.0 (#6070 ) ### What this PR does / why we need it? 1. Upgrade CANN to 8.5.0 2. move triton-ascend 3.2.0 to requirements note: we skipped the two failed e2e test, see https://github.com/vllm-project/vllm-ascend/issues/6076 for more detail. We'll fix it soon. ### How was this patch tested? Closes: https://github.com/vllm-project/vllm-ascend/issues/5494 - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-22 09:29:50 +08:00
Nengjun Ma	ab676413e6	Default enable MLAPO (#5952 ) ### What this PR does / why we need it? 1) Default enable MLAPO for deepseek MLA Attention W8A8 models on PD disagregation D Instance, for example: DeepSeekV3-W8A8, DeepSeek-R1-W8A8. 2) Default enable MLAPO for DeepSeek SFA Attention W8A8 models, currently is DeepSeek-V3.2-W8A8. ### Does this PR introduce _any_ user-facing change? Don't need use manully to VLLM_ASCEND_ENABLE_MLAPO=1, to enable MLAPO feature for deepseek w8a8 model The effect of enabling MLAPO SFA model deployed on a single A3 Node: Test with:tests/e2e/nightly/single_node/models/test_deepseek_v3_2_exp_w8a8.py dataset: gsm8k-lite，without set MTP, FULL GRAPH, has 19% promote：未默认开启 MLAPO 时： ├─────────────────────────┤ │ TTFT │ 14055.8836 ms │ ├─────────────────────────┤ │ ITL │ 66.8171 ms. │ ├─────────────────────────┤ │ Output Token Throughput │ 104.9105 token/s │ ├─────────────────────────┤ 默认开启 MLAPO 时： ├─────────────────────────┤ │ TTFT │ 3753.1547 ms │ ├─────────────────────────┤ │ ITL. │ 61.4236 ms. │ ├─────────────────────────┤ │ Output Token Throughput │ 125.2075 token/s│ ├─────────────────────────┤ - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-01-22 09:26:39 +08:00
meihanc	53bfb38192	[CI]Update triton ascend version in 3.2.0 (#6067 ) ### What this PR does / why we need it? update triton ascend version in 3.2.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-21 16:02:23 +08:00
zhangxinyuehfad	750c06c78a	[CI] Add DeepSeek-V3.2-W8A8 nightly ci test (#4633 ) ### What this PR does / why we need it? Add DeepSeek-V3.2-W8A8 nightly ci test： DeepSeek-V3.2-W8A8 1node DP2+TP8 :tests/e2e/nightly/models/test_deepseek_v3_2_w8a8.py ### Does this PR introduce _any_ user-facing change - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-20 21:05:15 +08:00
starmountain1997	0664c6e67a	[Doc] Add layer_sharding additional config for DeepSeek-V3.2-W8A8 (#5921 ) ### What this PR does / why we need it? #### Documentation Improvements New Configuration: Added the layer_sharding parameter to the DeepSeek-V3.2-W8A8 deployment tutorial. This guides users to include `["q_b_proj", "o_proj"]` in their prefill node setup for better resource utilization. #### CI and Testing Updates Test Config Update: Updated the multi-node E2E test configuration file: tests/e2e/nightly/multi_node/config/DeepSeek-V3_2-W8A8-A3-dual-nodes.yaml. including disable `FLASHCOMM` and enable `FULL_DECODE_ONLY` and update performance baseline. ### Does this PR introduce any user-facing change? Yes. The documentation now recommends a more optimized startup command for DeepSeek-V3.2-W8A8. Users following the updated tutorial will see improved performance in multi-node PD disaggregation environments. ### How was this patch tested? CI Validation: The updated E2E test configuration has been verified through the nightly CI pipeline. Environment: * vLLM version: v0.13.0 Base Commit: [11b6af5](`11b6af5280`) Hardware: Ascend A3/A2 multi-node cluster. --------- Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-01-20 12:40:54 +08:00
meihanc	80fbb1b6b1	[CI]Fix nightly clang installation following previous attempt (#5907 ) ### What this PR does / why we need it? This PR fixes the issue where the previous PR https://github.com/vllm-project/vllm-ascend/pull/5733 failed to install Clang in nightly environment. - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-15 14:18:11 +08:00
LI SHENGYONG	da958ee386	[EPLB]Eplb Config Renaming (#5533 ) ### What this PR does / why we need it? 1. Rename num_iterations_eplb_update to expert_heat_collection_interval. 2. Rename num_wait_worker_iterations to algorithm_execution_interval. 3. Rename init_redundancy_expert to num_redundant_experts because the variable with the same meaning in vLLM is named this way. 4. Delete gate_eplb because we don't need this feature. 5. Move eplb config into a dict in additional config. 6. Depend on pr5817 ### Does this PR introduce _any_ user-facing change? before this pr： `--additional-config '{"dynamic_eplb":true, "num_iterations_eplb_update": 4000, "num_wait_worker_iterations": 150, "init_redundancy_expert": 16, "expert_map_path": "xxx.json"}'` after this pr: `--additional-config '{"eplb_config":{"dynamic_eplb":true,"expert_heat_collection_interval":4000, "algorithm_execution_interval":150,"num_redundant_experts": 16, "expert_map_path": "xxx.json"}}'` ### How was this patch tested? #### test qwen3-235b eplb num_redundant_experts=16 without pr5817 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 83.33 \| with pr5817 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-15 10:26:44 +08:00
lty	295018ec0f	[Refactor]Refactor of vllm_ascend/distributed module (#5719 ) ### What this PR does / why we need it? Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604 This PR is a refactoring of vllm_ascend/distributed, moving all kv_transfer realtaed codes into a dedicated folder, which has already been done in vLLM ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: lty <linhebiwen@gmail.com>	2026-01-15 08:57:40 +08:00
Li Wang	f34b3b8ee9	[nightly] Remove node tolerations for hk cluster (#5896 ) ### What this PR does / why we need it? Since we have upgrade all the nodes' `cann` HDK version to `25.3rc1`, we should not limit nightly schedule to the specific nodes ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-15 08:55:06 +08:00
meihanc	a9f730b853	[bugfix]Intermittent CI failure in the triton runtime jit (#5733 ) ### What this PR does / why we need it? fix bug : https://github.com/vllm-project/vllm-ascend/issues/5634 Intermittent CI failure due to a compilation error in the triton operator ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-14 22:58:08 +08:00
Li Wang	75c92a3640	[CI] Move nightly-a2 test to hk (#5807 ) ### What this PR does / why we need it? This patch initial testing involved connecting two nodes from the HK region to nightly A2. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-12 22:58:35 +08:00
SILONG ZENG	7a6fde80b1	[CI]Add Kimi k2 nightly test (#5682 ) ### What this PR does / why we need it? The PR add performance and accuracy tests for Kimi-K2-Instruct-W8A8 and Kimi-K2-Thinking models to the Nightly test suite. #### Test Configuration Kimi-K2-Instruct-W8A8 - model: vllm-ascend/Kimi-K2-Instruct-W8A8 - Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node) - Architecture: Unified Distributed Inference - Parallelism: DP4 + TP8 + EP (Data Parallel 4, Tensor Parallel 8, Expert Parallel enabled). - Optimization: torchair graph, no-prefix-caching. - Node 0: DP Rank 0-1, Local DP 2, Tensor Parallel 8. - Node 1: DP Rank 2-3, Local DP 2, Tensor Parallel 8. - Benchmarks: - Performance: vllm-ascend/GSM8K-in3500-bs2800. - Accuracy: vllm-ascend/gsm8k-lite. Kimi-K2-Thinking - Model: moonshotai/Kimi-K2-Thinking - Hardware: A3, 1 Node (16 NPUs total) - Architecture: Single Node Distributed Inference - Parallelism: TP16 + EP (Tensor Parallel 16, Expert Parallel enabled). - Optimization: no-prefix-caching - Benchmarks: - Performance: vllm-ascend/GSM8K-in3500-bs400. - Accuracy: vllm-ascend/gsm8k-lite. ### Does this PR introduce _any_ user-facing change? Yes. This PR enhances the ```AisbenchRunner``` to support dynamic configuration of the ```trust_remote_code``` flag. This allows the AISBench client to successfully load tokenizers for models that require custom code execution (e.g., Kimi-K2-Thinking and Kimi-K2-Instruct-W8A8). Changes: 1. ```AisbenchRunner.__init__ ```Added the ability to capture the ```trust_remote_code``` parameter from the case configuration. ``` python self.batch_size = aisbench_config["batch_size"] self.request_rate = aisbench_config.get("request_rate", 0) + self.trust_remote_code = aisbench_config.get("trust_remote_code", False) self.temperature = aisbench_config.get("temperature") self.top_k = aisbench_config.get("top_k") ``` 2. ```AisbenchRunner._init_request_conf``` Added regex substitution to inject the parameter into the generated dynamic configuration file. ``` python content = re.sub(r'batch_size.', f'batch_size = {self.batch_size},', content) + content = re.sub(r'trust_remote_code=.', + f'trust_remote_code={self.trust_remote_code},', + content) content = content.replace("top_k", "#top_k") content = content.replace("seed", "#seed") ``` Details: - New Config Key: Users can add ```"trust_remote_code": True``` to any dictionary within the ```aisbench_cases``` list. - Default Value: Defaults to ```False``` to maintain existing security protocols for standard models. - Impact: Resolves ```ValueError``` when benchmarking reasoning models or models with custom tokenizers that previously failed during the AISBench local initialization phase. User Example: Users can now enable custom code execution for specific models (like Kimi-K2-Thinking) directly in their test suite: ``` # Now supported in test scripts: aisbench_cases = [{ "case_type": "performance", "request_conf": "vllm_api_stream_chat", "trust_remote_code": True, # New user-facing parameter ... }] ``` ### How was this patch tested? Actions: - https://github.com/vllm-project/vllm-ascend/actions/runs/20849768433 Result as following: - Kimi-K2-Instruct-W8A8(25m25s) 1. Accuracy test ``` dataset version metric mode vllm-api-general-chat --------- --------- -------- ------ ----------------------- gsm8k 7cd45e accuracy gen 96.88 ``` 2. Perf test ``` ╒══════════════════════════╤═════════╤════════════════╤════════════════╤═══════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕ │ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ ╞══════════════════════════╪═════════╪════════════════╪════════════════╪═══════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡ │ E2EL │ total │ 34571.489 ms │ 28657.8054 ms │ 36294.1788 ms │ 34714.7329 ms │ 35247.2724 ms │ 35526.6758 ms │ 36146.4314 ms │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TTFT │ total │ 2043.9136 ms │ 627.4718 ms │ 3532.3978 ms │ 1906.0194 ms │ 2307.7979 ms │ 2883.8528 ms │ 3283.7012 ms │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TPOT │ total │ 127.5591 ms │ 106.4937 ms │ 137.107 ms │ 128.3135 ms │ 129.5704 ms │ 131.1332 ms │ 134.1087 ms │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ ITL │ total │ 126.5571 ms │ 0.0095 ms │ 1340.783 ms │ 104.1398 ms │ 110.1272 ms │ 119.6124 ms │ 950.2924 ms │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ InputTokens │ total │ 3516.6055 │ 3014.0 │ 3985.0 │ 3525.0 │ 3525.0 │ 3586.8 │ 3800.67 │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokenThroughput │ total │ 7.4143 token/s │ 7.0535 token/s │ 8.933 token/s │ 7.3744 token/s │ 7.4118 token/s │ 7.5608 token/s │ 8.7051 token/s │ 512 │ ╘══════════════════════════╧═════════╧════════════════╧════════════════╧═══════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛ ╒══════════════════════════╤═════════╤═══════════════════╕ │ Common Metric │ Stage │ Value │ ╞══════════════════════════╪═════════╪═══════════════════╡ │ Benchmark Duration │ total │ 279430.9375 ms │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Requests │ total │ 512 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Failed Requests │ total │ 0 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Success Requests │ total │ 512 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Concurrency │ total │ 63.3452 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Max Concurrency │ total │ 64 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Request Throughput │ total │ 1.8323 req/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Input Tokens │ total │ 1800502 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Prefill Token Throughput │ total │ 1720.5255 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total generated tokens │ total │ 131072 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Input Token Throughput │ total │ 6443.4598 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Output Token Throughput │ total │ 469.0676 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Token Throughput │ total │ 6912.5274 token/s │ ╘══════════════════════════╧═════════╧═══════════════════╛ ``` - Kimi-K2-Thinking(43m51s) 1. Accuracy test ``` dataset version metric mode vllm-api-general-chat --------- --------- -------- ------ ----------------------- gsm8k 7cd45e accuracy gen 100.00 ``` 2. Perf test ``` ╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕ │ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ ╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡ │ E2EL │ total │ 172384.3573 ms │ 34456.5517 ms │ 205922.9407 ms │ 174844.2216 ms │ 202656.092 ms │ 204428.9502 ms │ 205468.6776 ms │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TTFT │ total │ 138740.3228 ms │ 655.1066 ms │ 171777.3003 ms │ 141088.0561 ms │ 169237.5599 ms │ 170716.4954 ms │ 171393.1278 ms │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TPOT │ total │ 131.9374 ms │ 90.6331 ms │ 135.4144 ms │ 132.405 ms │ 132.948 ms │ 133.7549 ms │ 135.2543 ms │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ ITL │ total │ 130.9028 ms │ 0.0099 ms │ 960.3683 ms │ 116.9623 ms │ 122.3127 ms │ 132.0522 ms │ 886.4662 ms │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ InputTokens │ total │ 3514.575 │ 3014.0 │ 3843.0 │ 3525.0 │ 3525.0 │ 3588.0 │ 3801.08 │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokenThroughput │ total │ 1.6799 token/s │ 1.2432 token/s │ 7.4296 token/s │ 1.4642 token/s │ 1.4737 token/s │ 1.8754 token/s │ 7.125 token/s │ 400 │ ╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛ ╒══════════════════════════╤═════════╤═══════════════════╕ │ Common Metric │ Stage │ Value │ ╞══════════════════════════╪═════════╪═══════════════════╡ │ Benchmark Duration │ total │ 1166795.568 ms │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Requests │ total │ 400 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Failed Requests │ total │ 0 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Success Requests │ total │ 400 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Concurrency │ total │ 59.0967 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Max Concurrency │ total │ 64 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Request Throughput │ total │ 0.3428 req/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Input Tokens │ total │ 1405830 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Prefill Token Throughput │ total │ 25.332 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total generated tokens │ total │ 102400 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Input Token Throughput │ total │ 1204.864 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Output Token Throughput │ total │ 87.7617 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Token Throughput │ total │ 1292.6258 token/s │ ╘══════════════════════════╧═════════╧═══════════════════╛ ``` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>	2026-01-12 15:56:07 +08:00
Nengjun Ma	297f6deb09	[CI] Align multi-node nightly test paramter with corresponding tutorials document (#5756 ) ### What this PR does / why we need it? Align multi-node nightly test paramter with tutorials documents. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? Test locally and nighly e2e multi-node test cases. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-01-12 09:00:31 +08:00
SILONG ZENG	09b3f9d91b	[CI]Add Disaggregated PD Nightly Test for Qwen3-235B and Qwen3-VL-235B (#5502 ) ### What this PR does / why we need it? This PR adds online Disaggregated Prefill/Decode performance and accuracy tests for the Qwen3-235B-A22B and Qwen3-VL-235B-A22B-Instruct models to the Nightly test suite. These test configurations simulate the deployment of massive MoE and Vision-Language models in a dual-node (32 NPU) environment, utilizing Mooncake (KVCache Transfer) technology to achieve efficient KV cache transfer between the Prefill node and the Decode node. #### Test Configuration Qwen3-235B-A22B - Model: Qwen/Qwen3-235B-A22B - Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node) - Architecture: Disaggregated Prefill & Decode - Node 0 (Producer/Prefill): DP2 + TP8 + EP + FLASHCOMM1 + FUSED_MC2. - Node 1 (Consumer/Decode): DP4 + TP4 + EP + FLASHCOMM1 + FUSED_MC2 + FULL_DECODE_ONLY. - Benchmarks: - Performance: vllm-ascend/GSM8K-in3500-bs2800. - Accuracy: vllm-ascend/gsm8k-lite. Qwen3-VL-235B-A22B-Instruct - Model: Qwen/Qwen3-VL-235B-A22B-Instruct - Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node) - Architecture: Disaggregated Prefill & Decode - Node 0 (Producer/Prefill): DP2 + TP8 + EP. - Node 1 (Consumer/Decode): DP4 + TP4 + EP + FULL_DECODE_ONLY. - Benchmarks: - Performance: vllm-ascend/textvqa-perf-1080p. - Accuracy: vllm-ascend/textvqa-lite. ### How was this patch tested? Nightly test action on CI - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-09 16:25:20 +08:00
meihanc	6315a31399	[CI] Add triton ascend in nightly CI (#5716 ) ### What this PR does / why we need it? Add triton ascend in nightly ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-08 21:17:32 +08:00
starmountain1997	086c093347	[CI] Add DeepSeek-V3.2-W8A8 nightly ci test (#5371 ) # What this PR does / why we need it? Add DeepSeek-V3.2-W8A8 dual-node nightly CI test and update A3 nightly test configuration: 1. Add DeepSeek-V3.2-W8A8 dual-node test: tests/e2e/nightly/multi_node/config/DeepSeek-V3_2-W8A8-A3-dual-nodes.yaml - 2 nodes, 16 NPUs per node (32 NPUs total) - Configuration: 2P+1D (data-parallel-size=4, tensor-parallel-size=8, data-parallel-size-local=2) - Includes performance and accuracy benchmarks with GSM8K dataset 2. Update A3 nightly workflow: .github/workflows/nightly_test_a3.yaml - Added DeepSeek-V3.2-W8A8 dual-node test to the A3 nightly test matrix - Test name: multi-node-dpsk3.2-2node 3. Improve test scripts: Updated .github/workflows/_e2e_nightly_multi_node.yaml and related scripts for better multi-node testing support test on A3 instances - Performance baseline: 1 (threshold: 0.97) - Accuracy baseline: 95% (threshold: 5%) - Test dataset: GSM8K with 512 prompts for performance, gsm8k-lite for accuracy --------- Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-01-07 10:02:02 +08:00
dsxsteven	129ba9fe1b	[BugFix] Fix Smoke Testing Bug for DSR1 longseq (#5613 ) ### What this PR does / why we need it? Fix Smoke Testing Bug for DSR1 longseq We need to make this change because the daily smoke test case is throwing an error: "max_tokens or max_completion_tokens is too large: 32768.This model's maximum context length is 32768 tokens and your request has 128 input tokens". We encounter this error due to max-out-len equals to max-model-len. We can fix this error by increasing max-model-len argument in the script. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: daishixun <dsxsteven@sina.com>	2026-01-05 22:40:28 +08:00
weiguihua2	549be94397	[Bugfix] fix pcp + eplb error (#5561 ) ### What this PR does / why we need it? Fix the bug in the PCP overlay feature 1、Fix the bug related to PCP and EPLB overlap by including PCP size in the word_size calculation. 2、In the PCP pooling scenario, a prompt has been added for setting the cp_kv_cache_interleave_size. - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-01-05 14:08:11 +08:00
dsxsteven	37fd48bee5	[CI] Move longseq Nightly CI (#5577 ) ### What this PR does / why we need it? move longseq nightly CI to correct path due to #5479 [1/N] Refactor nightly test structure Signed-off-by: daishixun <dsxsteven@sina.com>	2026-01-04 15:42:43 +08:00
dsxsteven	3c7e6c6817	[CI] Add multi-nodes longseq configs of DeepSeek-R1-W8A8 & Qwen3-235B-W8A8 (#5381 ) ### What this PR does / why we need it? add DeepSeek-R1-W8A8 and Qwen3-235B-W8A8 configs in multi-nodes and longseq (PCP&DCP) scenario - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: daishixun <dsxsteven@sina.com>	2026-01-04 10:38:40 +08:00
Li Wang	2ee17e50a1	[2/N] Upgrade nightly doc (#5534 ) ### What this PR does / why we need it? Follow up https://github.com/vllm-project/vllm-ascend/pull/5479, upgrade the corresponding doc for developers - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-31 09:11:42 +08:00
Li Wang	e760aae1df	[1/N] Refactor nightly test structure (#5479 ) ### What this PR does / why we need it? This patch is a series of refactoring actions, including clarifying the directory structure of nightly tests, refactoring the config retrieval logic, and optimizing the workflow, etc. This is the first step: refactoring the directory structure of nightly to make it more readable and logical. - vLLM version: v0.13.0 - vLLM main: `5326c89803` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-30 19:03:02 +08:00
Li Wang	1d81bfaed1	Fix nightly (#5413 ) ### What this PR does / why we need it? This pacth mainly do the following things: 1. Bugfix for multi_node_tests log, log names must be unique when uploading logs. 2. Optimize `get_cluster_ips` logic, increase the max retry times for robustness 3. Abandoned the existing gh-proxy temporarily until it is stable enough. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-27 18:16:46 +08:00
Nengjun Ma	f5af6bbd1e	[CI] Add qwen-235b-a22b a2 multi-node test (#5393 ) ### What this PR does / why we need it? Qwen3-235B-A22B belongs to the TopN model, but there is currently a lack of care for the test cases of the wen3-235B-A22B model on Atlas A2, and most of the machines currently owned by users in the community are A2. When users encounter problems, we currently have no way of knowing whether the model runs normally on the corresponding version of the code, so we added it. In addition, we currently see TopN models such as: qwen-dense, qwen3-30b-a3b, Qwen3-Next, Qwen2.5-Omni, but Qwen3-235B-A22B is missing. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? Test with multi-node, result as following: 1. Accuracy test (Time for executing this test case: 25 minutes) test running successfully, accuracy as following: ``` dataset version metric mode vllm-api-general-chat --------- --------- -------- ------ ----------------------- gsm8k 7cd45e accuracy gen 95.68 ``` 2. Perf test (Time for executing this test case: 1h15 minutes) test running successfully, throughput as following(This is the atlas A3, for A2 the result about A3/1.3): ``` ╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤══════╕ │ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ ╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪══════╡ │ E2EL │ total │ 384086.3958 ms │ 214767.0486 ms │ 528014.771 ms │ 387621.5746 ms │ 388776.7492 ms │ 390164.3559 ms │ 488105.8512 ms │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ TTFT │ total │ 159409.9868 ms │ 1849.4588 ms │ 302439.6965 ms │ 162183.7007 ms │ 162965.477 ms │ 164274.1936 ms │ 262578.6041 ms │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ TPOT │ total │ 149.8842 ms │ 130.2175 ms │ 151.2625 ms │ 150.473 ms │ 150.6978 ms │ 150.9102 ms │ 151.2131 ms │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ ITL │ total │ 149.6789 ms │ 0.0099 ms │ 283.0242 ms │ 150.3276 ms │ 156.8649 ms │ 168.1372 ms │ 199.378 ms │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ InputTokens │ total │ 3654.3079 │ 3108.0 │ 4280.0 │ 3629.0 │ 3728.0 │ 3842.1 │ 4079.0 │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ OutputTokens │ total │ 1500.0 │ 1500.0 │ 1500.0 │ 1500.0 │ 1500.0 │ 1500.0 │ 1500.0 │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ OutputTokenThroughput │ total │ 3.935 token/s │ 2.8408 token/s │ 6.9843 token/s │ 3.8698 token/s │ 3.8799 token/s │ 3.9916 token/s │ 6.2137 token/s │ 2800 │ ╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧══════╛ ╒══════════════════════════╤═════════╤═══════════════════╕ │ Common Metric │ Stage │ Value │ ╞══════════════════════════╪═════════╪═══════════════════╡ │ Benchmark Duration │ total │ 4391524.3389 ms │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Requests │ total │ 2800 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Failed Requests │ total │ 0 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Success Requests │ total │ 2800 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Concurrency │ total │ 244.8903 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Max Concurrency │ total │ 256 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Request Throughput │ total │ 0.6376 req/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Input Tokens │ total │ 10232062 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Prefill Token Throughput │ total │ 22.924 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total generated tokens │ total │ 4200000 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Input Token Throughput │ total │ 2329.9568 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Output Token Throughput │ total │ 956.3877 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Token Throughput │ total │ 3286.3445 token/s │ ╘══════════════════════════╧═════════╧═══════════════════╛ ``` - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-12-26 23:46:09 +08:00
wangxiyuan	29d2fe653d	cleanup ascend config (#5296 ) 1. refresh additional config doc 2. move kv config logic to platform. 3. improve `dump_config` init logic and rename it to `dump_config_path` this change is user impacted. dump_config is changed from dict to string. 4. correct `enable_async_exponential` type 5. remove useless `chunked_prefill_for_mla` - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-26 14:07:37 +08:00
Li Wang	c2f776b846	[Nightly] Initial logging for nightly multi-node testing (#5362 ) ### What this PR does / why we need it? Currently, our multi-node logs only show the master node's logs (via the Kubernetes API), which is insufficient for effective problem localization if other nodes experience issues. Therefore, this pull request adds the ability to upload logs for other nodes. Next plan: Output structured directory logs, including logs from each node and the polog. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-26 11:39:07 +08:00
zxr2333	073a3a6e6c	[Doc][P/D] Fix MooncakeConnector's name (#5172 ) ### What this PR does / why we need it? vLLM community has integrated their MooncakeConnector. The original scripts will now find this MooncakeConnector instead of the one from vLLM-Ascend. All scripts that involve using the MooncakeConnector need to be modified to another name. ### Does this PR introduce _any_ user-facing change? Yes, users need to use a new name to load vLLM-Ascend MooncakeConnector. ### How was this patch tested? By CI. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2025-12-18 22:29:19 +08:00
Li Wang	0f92d34a70	[CI] Pull latest vllm-ascend src before tests (#4988 ) ### What this PR does / why we need it? Currently, our image build suffers from errors during cross-compilation, which causing the image to fail to build sometimes(see https://github.com/vllm-project/vllm-ascend/actions/runs/20152861650/job/57849208186). This results in the nightly test code not being the latest version. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-13 19:04:14 +08:00
Li Wang	5b12c068f9	[Nightly] Remove gen_ranktable logic (#4941 ) ### What this PR does / why we need it? Since the `llmdatadist` has sunset, the logic gen_ranktable should also be removed - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-12 17:20:18 +08:00

1 2

82 Commits