xc-llm-ascend

Author	SHA1	Message	Date
Li Wang	90ae114569	[CI] Fix nightly CI (#3821 ) ### What this PR does / why we need it? This patch fix the nightly CI runs [failure](https://github.com/vllm-project/vllm-ascend/actions/runs/18848144365) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.1 --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-28 20:40:03 +08:00
Icey	a7450db1bd	Upgrade to 0.11.1 newest vllm commit (#3762 ) ### What this PR does / why we need it? `c9461e05a4` Fix ```spec decode rejection sampler```, caused by https://github.com/vllm-project/vllm/pull/26060 Fix some ```import```, caused by https://github.com/vllm-project/vllm/pull/27374 Fix ```scheduler_config.send_delta_data```, caused by https://github.com/vllm-project/vllm-ascend/pull/3719 Fix ```init_with_cudagraph_sizes```, caused by https://github.com/vllm-project/vllm/pull/26016 Fix ```vl model```of replacing PatchEmbed's conv3d to linear layer, caused by https://github.com/vllm-project/vllm/pull/27418 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-10-28 14:55:03 +08:00
Li Wang	f846bd20e4	[CI] Add multi-node test case for a2 (#3805 ) ### What this PR does / why we need it? This patch add multi-node test case for a2 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-27 23:10:17 +08:00
jiangyunfan1	9030106a14	[TEST]Add 2P1D multi node cases for nightly test (#3764 ) ### What this PR does / why we need it? This PR adds the 2P1D multi node func/acc/perf test cases, we need test them daily ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com> Signed-off-by: wangli <wangli858794774@gmail.com> Co-authored-by: wangli <wangli858794774@gmail.com>	2025-10-27 23:09:15 +08:00
Levi	d64bdd06ae	【Bugfix】bugfix for weight load of kimi-k2 (#3798 ) Signed-off-by: Levi-JQ <yujinqi2@huawei.com> ### What this PR does / why we need it? Fix kimi-k2 start bug, weight load ERROR：https://github.com/vllm-project/vllm-ascend/issues/3785 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: zhaozx-cn <zhaozx2116@163.com>	2025-10-27 21:18:35 +08:00
wangxiyuan	da5f2cc1e3	[Doc] Update FAQ (#3792 ) Many FAQ content is out of date, this PR refresh it. - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-27 20:32:17 +08:00
shiyuan680	00aa0bf33e	support prefill cache mode use fia op (#3696 ) ### What this PR does / why we need it? support prefill cache mode use fia op for full graph ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` origin ============ Serving Benchmark Result ============ Successful requests: 30 Maximum request concurrency: 256 Request rate configured (RPS): 0.70 Benchmark duration (s): 131.63 Total input tokens: 61363 Total generated tokens: 61440 Request throughput (req/s): 0.23 Output token throughput (tok/s): 466.77 Peak output token throughput (tok/s): 750.00 Peak concurrent requests: 30.00 Total Token throughput (tok/s): 932.95 ---------------Time to First Token---------------- Mean TTFT (ms): 125.17 Median TTFT (ms): 121.51 P50 TTFT (ms): 121.51 P90 TTFT (ms): 140.91 P99 TTFT (ms): 182.36 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 43.85 Median TPOT (ms): 43.84 P50 TPOT (ms): 43.84 P90 TPOT (ms): 44.28 P99 TPOT (ms): 44.32 ---------------Inter-token Latency---------------- Mean ITL (ms): 43.85 Median ITL (ms): 42.63 P50 ITL (ms): 42.63 P90 ITL (ms): 48.74 P99 ITL (ms): 59.62 ================================================== after ============ Serving Benchmark Result ============ Successful requests: 30 Maximum request concurrency: 256 Request rate configured (RPS): 0.70 Benchmark duration (s): 130.10 Total input tokens: 61363 Total generated tokens: 61440 Request throughput (req/s): 0.23 Output token throughput (tok/s): 472.26 Peak output token throughput (tok/s): 750.00 Peak concurrent requests: 30.00 Total Token throughput (tok/s): 943.94 ---------------Time to First Token---------------- Mean TTFT (ms): 123.69 Median TTFT (ms): 122.51 P50 TTFT (ms): 122.51 P90 TTFT (ms): 143.69 P99 TTFT (ms): 165.00 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 43.07 Median TPOT (ms): 43.13 P50 TPOT (ms): 43.13 P90 TPOT (ms): 43.50 P99 TPOT (ms): 43.57 ---------------Inter-token Latency---------------- Mean ITL (ms): 43.07 Median ITL (ms): 41.81 P50 ITL (ms): 41.81 P90 ITL (ms): 48.11 P99 ITL (ms): 62.13 ================================================== Signed-off-by: shiyuan680 <917935075@qq.com>	2025-10-27 19:41:07 +08:00
Shanshan Shen	3e5ae49160	[MM][Doc] Update online serving tutorials for `Qwen2-Audio` (#3606 ) ### What this PR does / why we need it? Update online serving tutorials for `Qwen2-Audio`. Part of https://github.com/vllm-project/vllm-ascend/issues/3508. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-10-27 16:58:03 +08:00
Shirley125	d8ca7fee75	[bugfix][main]fix proxy decode bug (#3750 ) ### What this PR does / why we need it? fix proxy decode bug when parsing non-UTF-8 characters. - vLLM version: v0.11.0 - vLLM main: `c9461e05a4` --------- Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>	2025-10-27 16:56:09 +08:00
yupeng	b8796b06c8	[Doc][Example][Bugfix] Elements in local_device_ids should be casted … (#3782 ) ### What this PR does / why we need it? It's a tiny bugfix in the `gen_ranktable.py` script. The script is an util to help setup an example case. It is used to prepare a ranktable before disaggregated prefill deployment. Elements in `local_device_ids` list should be casted to `int` type before referred for a MOD math operation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. - vLLM version: v0.11.0 - vLLM main: `c9461e05a4` --------- Signed-off-by: paulyu12 <507435917@qq.com>	2025-10-27 14:52:47 +08:00
dependabot[bot]	638d8d1a47	Bump actions/upload-artifact from 4 to 5 (#3786 ) Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 5. - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-10-27 14:11:53 +08:00
dependabot[bot]	79623e0bab	Bump actions/download-artifact from 5 to 6 (#3787 ) Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 5 to 6. - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-10-27 14:10:56 +08:00
jiangyunfan1	e9072429fb	[CI] Enable 2 jobs for nightly test (#3781 ) ### What this PR does / why we need it? This PR adds 2 jobs to a3 nightly test, which contains 4 test cases, we need test them nightly ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2025-10-27 14:08:29 +08:00
Li Wang	60ee4af6d0	[CI] Add custom op to nightly (#3765 ) ### What this PR does / why we need it? 1. Add custom op to nightly tests, fix https://github.com/vllm-project/vllm-ascend/pull/3665 2. Correctly pass github secrets when using workflow_call, see https://docs.github.com/en/actions/how-tos/reuse-automations/reuse-workflows 3. Fix the single node mutual cancellation issue - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-27 14:07:03 +08:00
weiguihua2	4312a92a4f	[feat]dcp pcp support aclgraph (#3731 ) ### What this PR does / why we need it? dcp pcp support full aclgraph, including mla attention_v1 - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-10-27 09:58:23 +08:00
Yizhou	8ab8111fde	[Fix] Prevent memory leak in MLA decode graph (#3743 ) ### What this PR does / why we need it? The cache for MLA decode graph parameters was holding strong references to tensors, preventing them from being garbage collected and leading to increased memory usage. This change wraps the cached tensors in weak references, allowing them to be deallocated when no longer in use and reducing overall memory pressure. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-25 20:37:33 +08:00
22dimensions	afc58184ec	[Installation] limit opencv-python-headless version to resolve numpy version conflict (#3713 ) ### What this PR does / why we need it? vllm requires opencv-python-headless >= 4.11.0 which requires (numpy<2.3.0,>=2), but vllm-ascend numpy version must be less than 2.0.0, so limit opencv-python-headless less than 4.11.0.86 will fix this conflict. ### How was this patch tested? tested by CI - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-10-25 18:07:54 +08:00
Icey	bb5f16d926	[BugFix] Fix Qwen3-next break (#3428 ) ### What this PR does / why we need it? Fix Qwen3NextGatedDeltaNet, caused by https://github.com/vllm-project/vllm/pull/26437 ### How was this patch tested? ``` def main(): prompts = [ "窗前明月光，", "The president of the United States is Mr.", "The capital of France is", "The future of AI is", "感时花溅泪，", "家书抵万金啥意思？", "plz tell me a story: ", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="/root/.cache/modelscope/hub/models/Qwen/Qwen3-Next-80B-A3B-Instruct", tensor_parallel_size=4, enforce_eager=True, trust_remote_code=True, max_model_len=256, gpu_memory_utilization=0.7, block_size=64 ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Icey <1790571317@qq.com>	2025-10-25 18:03:36 +08:00
ck-hw-1018	7572939b94	add qwq testcase (#3757 ) ### What this PR does / why we need it? This PR adds a qwq case for nightly test for qwen-qwq on A3 ,we need test them daily ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? by running the test - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: ckhw <cuikai1@huawei.com>	2025-10-25 17:11:35 +08:00
zzzzwwjj	e5676fc36e	[main] remove dbo code (#3712 ) ### What this PR does / why we need it? Remove codes of dbo. Currently, vLLM has supported dbo with pr: https://github.com/vllm-project/vllm/pull/23693. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-10-25 15:53:01 +08:00
Icey	d9cdc65854	Upgrade to new vllm commit (#3719 ) ### What this PR does / why we need it? Upgrade to new vllm commit: `c9461e05a4` - Fix many imports, caused by https://github.com/vllm-project/vllm/pull/26908 - Fix import ```sha256```, caused by https://github.com/vllm-project/vllm/pull/27169 - Remove ```SchedulerConfig.send_delta_data```, caused by https://github.com/vllm-project/vllm/pull/27142 - Fix ```FusedMoE``` because of dual stream execution, caused by https://github.com/vllm-project/vllm/pull/26440 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-10-25 15:36:32 +08:00
fems14	226f832c0b	[bugfixfix] correct _register function place for mooncacke (#3747 ) correct _register function place for mooncacke - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` Signed-off-by: fems14 <1804143737@qq.com>	2025-10-25 14:20:09 +08:00
HuaJiaHeng	11f75883be	[Test] add test for prefix cache feature of deepseek (#3733 ) ### What this PR does / why we need it? This PR adds a prefix cache case for nightly test for DeepSeek-r1-0528-W8A8 on A3, we need test them daily. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the test - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` --------- Signed-off-by: root <root@hostname-2pbfv.foreman.pxe> Co-authored-by: root <root@hostname-2pbfv.foreman.pxe>	2025-10-25 14:08:15 +08:00
Yizhou	1f25d60870	[Fix] Cap max tokens to prevent potential OOM (#3720 ) ### What this PR does / why we need it? Caps the calculated maximum number of tokens at 512. This prevents allocating an excessively large buffer when a cudagraph capture size is not specified, mitigating the risk of out-of-memory errors. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-25 11:23:21 +08:00
weichen	63c363d3de	[Refactor] [MoE] Rename moe-related classes & files (#3646 ) ### What this PR does / why we need it? 1. Rename common_fused_moe.py to fused_moe.py. 2. Rename fused_moe_prepare_and_finalize.py / FusedMoEPrepareAndFinalize to prepare_finalize.py / PrepareAndFinalize. 3. Rename vllm_ascend/ops/moe to vllm_ascend/ops/fused_moe. 4. Move vllm_ascend/ops/fused_moe.py to vllm_ascend/ops/fused_moe/fused_moe.py ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-10-25 11:22:03 +08:00
zhangxinyuehfad	0637e8f021	[Doc] Update supported models (#3481 ) ### What this PR does / why we need it? Update supported models ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-10-25 11:13:46 +08:00
zhangxinyuehfad	8f6f967028	[Test] Add e2e test and accuracy test for Qwen3-Next-80B-A3B-Instruct (#3450 ) ### What this PR does / why we need it? Add e2e test and accuracy test for Qwen3-Next-80B-A3B-Instruct ### How was this patch tested? accuracy test: https://github.com/vllm-project/vllm-ascend/actions/runs/18771221544/job/53556027634?pr=3450 ci test: https://github.com/vllm-project/vllm-ascend/actions/runs/18771221530/job/53556027614?pr=3450 <img width="1703" height="562" alt="image" src="https://github.com/user-attachments/assets/973b6cfa-8240-41e3-893a-5024ff8d0693" /> - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-10-25 10:57:56 +08:00
whx	d5609e2c48	[BugFix] Comment out newly added vlm e2e. (#3736 ) This PR comments out newly added vlm e2e test of ascend scheduler scenario because I found that when running in multi-batch this will stuck. Need to add this back after dealing with this issue. - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-25 10:34:59 +08:00
lio	9e150e5009	[Refactor] optimize _prepare_inputs method in eagle_proposer (#3296 ) ### What this PR does / why we need it? We optimized the _prepare_input method in eagle_proposer and no longer use the _prepare_eagle_input_sequential method, improving the performance of eagle-3. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ``` python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 13963 --dtype bfloat16 --model meta-llama/Llama-3.1-8B-Instruct --served-model-name Llama-3.1-8B-Instruct --tensor-parallel-size 1 --gpu-memory-utilization 0.85 --max-model-len 32768 --trust-remote-code --seed 42 --no-enable-prefix-caching --speculative_config '{"method":"eagle3","model":"yuhuili/EAGLE3-LLaMA3.1-Instruct-8B","num_speculative_tokens":2,"draft_tensor_parallel_size":1}' ``` Co-authored-by: QilaiZhang (245706640@qq.com ) - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: lio <1983142975@qq.com>	2025-10-25 09:49:42 +08:00
QilaiZhang	d30bb95b90	[Bugfix] Fix zero attention output in qwen3-next (#3572 ) ### What this PR does / why we need it? Since Attention and LinearAttention share the same ```slot_mapping```, and the ```slot_mapping``` for LinearAttention is all zeros, the ```slot_mapping``` for Attention gets overwritten, resulting in the computed output being all zeros. This PR removes the uniformly managed ```self.slot_mapping``` and directly passes the ```slot_mapping``` from ```input_batch.blocktable``` to ```attn_metadata```, along with modifying the relevant references. Due to hardware, the data type of ```block_table.slot_mapping``` needs to be set to int32. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: QilaiZhang <245706640@qq.com>	2025-10-25 09:47:03 +08:00
whx	e33751ef8b	[BugFix][Core] Fix a bug running multi-modal with ascend_scheduler (#3675 ) This PR fix the bug related with running multi-modal models with AscendScheduler. This bug was introduced by PR #2372 by using the same parameter names as vLLM with different default values. Currently I fix this bug by changing the default values of these two parameters to align with vLLM. - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` Signed-off-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com>	2025-10-25 09:41:33 +08:00
wangxiyuan	1a9feb3ba5	Update version doc (#3599 ) 1. Add v0.11.0-dev branch info 2. mark rfc/long_seq_optimization branch as completed - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-25 09:37:56 +08:00
wangxiyuan	07c8d4547c	[CI] Skip ops test for e2e (#3665 ) ### What this PR does / why we need it? Skip ops test for e2e and will move it to nightly test in the following pr - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-25 09:37:30 +08:00
wangxiyuan	6922947033	[Misc] Limit ray version (#3660 ) We notice that with ray>2.48.0, the npu card count is not correct from ray. This is a know bug. Let's limit ray version to <2.48.0 now. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-25 09:36:44 +08:00
Canlin Guo	8295136575	[UT][fix] Add missing get_ascend_config mock to NPUWorker initialization tests (#3729 ) ### What this PR does / why we need it? Enable the unit tests that #3612 skipped. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Unit tests. - vLLM main: `17c540a993` Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2025-10-25 09:33:16 +08:00
Li Wang	7f73c28a24	[CI][Doc] Optimize multi-node CI (#3565 ) ### What this PR does / why we need it? This pull request mainly do the following things: 1. Add a doc for multi-node CI, The main content is the mechanism principle and how to contribute 2. Simplify the config yaml for more developer-friendly 3. Optimized the mooncake installation script to prevent accidental failures during installation 4. Fix the workflow to ensure the kubernetes can be apply correctly 5. Add Qwen3-235B-W8A8 disaggregated_prefill test 6. Add GLM-4.5 multi dp test 7. Add 2p1d 4nodes disaggregated_prefill test 8. Refactor nightly tests ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-25 09:23:47 +08:00
hucong	292cf339c3	[BugFix][P/D] Modify the recalculation logic to prevent waiting requests from filling up the D node KVCache (#3641 ) ### What this PR does / why we need it? Modify the recalculation logic to prevent waiting requests from filling up the D node KVCache - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` Signed-off-by: underfituu <hzhucong@163.com>	2025-10-25 09:14:20 +08:00
shaopeng-666	39b994a987	[Feat] Add mrope fusion op (#3708 ) ### What this PR does / why we need it? Add mrope fusion op for qwen2.5-vl. This mrope operator dosen't support Qwen3-VL currently. Thus could only take affect in qwen2.5-vl - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: shaopeng666 <shaopeng666@noreply.gitcode.com> Co-authored-by: shaopeng666 <shaopeng666@noreply.gitcode.com>	2025-10-25 09:12:18 +08:00
Yizhou	3158742a97	[Refactor] Refactor Ascend attention implementation forward (#3714 ) ### What this PR does / why we need it? This PR refactors the Ascend attention implementation to align with vLLM's core interfaces, simplifying the code and improving maintainability. ### Key Changes: * Align with vLLM's Attention Interface: The `forward` method signature in `AscendAttentionBackendImpl` now matches the base `AttentionImpl` in vLLM, removing the custom `trace_flag`. * Enable Opaque Attention Operator: By adding `opaque_attention_op` to `AscendPlatform`, we allow vLLM to wrap our attention kernel in its standard `vllm.unified_attention_with_output` operator. This avoids the need for a custom call path. * Remove Obsolete Code: * The custom op `vllm.unified_ascend_attention_with_output` has been deleted as it is now redundant. * The `trace_flag` and its associated logic were removed, reducing code complexity. * An outdated quantization branch within the attention implementation was cleaned up. * Improve Readability: Renamed output variables (`output` vs. `intermediate_output`) and added comments to clarify the in-place nature of the attention output. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? No extra tests needed. - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-25 08:58:35 +08:00
ZYang6263	0b1da24742	[Main][Perf] Add fused matmul/reduce-scatter kernel for performance optimization. (#3693 ) ### What this PR does / why we need it? This PR boosts performance by introducing a fused kernel for the matrix matmul and reduce scatter operations. It supports both unquantized (e.g., BFloat16) and W8A8 quantized models. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: ZYang6263 <zy626375@gmail.com>	2025-10-24 18:19:58 +08:00
fems14	82a4970fe9	look up multi_tp key (#3699 ) ### What this PR does / why we need it? In multi-Tensor Parallel (TP) scenarios, the KV pool only queries the first GPU card. When keys on other cards are released, the query result still returns as successful, introducing accuracy issues. This PR modifies the KV pool's query logic to check all cards, resolving this problem. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-10-24 17:23:36 +08:00
fems14	c83efcb9e4	kvpool sync load (#3698 ) ### What this PR does / why we need it? In certain scenarios, the performance of synchronously loading data from the pool is better than that of asynchronously loading data. Therefore, a control logic (or switch) for asynchronous loading from the pool has been added. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-10-24 17:22:53 +08:00
何必问	59bb16b75c	[Bugfix] The server fails to locate the request, leading to the server hanging. (#3703 ) ### What this PR does / why we need it? fix bug: In the mooncake pooling scenario, when the client closes the request, the server fails to locate the request, leading to the server hanging.oling scenario, when the client closes the request, the server fails to locate the request, leading to the server hanging. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Pull up the PD separated pooling service, send requests using aisbench, press CTRL+C twice, and check if the vllm_ascend service exit. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: linhebiwen <linhebiwen@gmail.com>	2025-10-24 17:18:03 +08:00
wangyu	d301c56d1a	[TEST]Add initial multi modal cases of Qwen2.5-VL-32B-Instruct for nightly test (#3707 ) ### What this PR does / why we need it? This PR adds the initial multi modal model for nightly test, including 2 cases for Qwen2.5-vl-32b acc/perf test on A3, we need test them daily. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test vLLM version: v0.11.0rc3 vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>	2025-10-24 17:12:06 +08:00
offline893	9b0baa1182	[BugFix] Check all expert maps when using muilty instance. (#3576 ) ### What this PR does / why we need it? Check all expert maps when using muilty instance. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Qwen 235B in double A3. case1：master has expert map, slave has not expert map. case2: master has expert map, slave has error expert map. case3: master has expert map,slave has correct expert map. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-24 17:10:14 +08:00
Mengqing Cao	cea0755b07	[1/N][Refactor] Refactor code to adapt with vllm main (#3612 ) ### What this PR does / why we need it? This is the step 1 of refactoring code to adapt with vllm main, and this pr aligned with `17c540a993` 1. refactor deepseek to the latest code arch as of `17c540a993` 2. bunches of fixes due to vllm changes - Fix `AscendScheduler` `__post_init__`, caused by https://github.com/vllm-project/vllm/pull/25075 - Fix `AscendScheduler` init got an unexpected arg `block_size`, caused by https://github.com/vllm-project/vllm/pull/26296 - Fix `KVCacheManager` `get_num_common_prefix_blocks` arg, caused by https://github.com/vllm-project/vllm/pull/23485 - Fix `MLAAttention` import,caused by https://github.com/vllm-project/vllm/pull/25103 - Fix `SharedFusedMoE` import, caused by https://github.com/vllm-project/vllm/pull/26145 - Fix `LazyLoader` improt, caused by https://github.com/vllm-project/vllm/pull/27022 - Fix `vllm.utils.swap_dict_values` improt, caused by https://github.com/vllm-project/vllm/pull/26990 - Fix `Backend` enum import, caused by https://github.com/vllm-project/vllm/pull/25893 - Fix `CompilationLevel` renaming to `CompilationMode` issue introduced by https://github.com/vllm-project/vllm/pull/26355 - Fix fused_moe ops, caused by https://github.com/vllm-project/vllm/pull/24097 - Fix bert model because of `inputs_embeds`, caused by https://github.com/vllm-project/vllm/pull/25922 - Fix MRope because of `get_input_positions_tensor` to `get_mrope_input_positions`, caused by https://github.com/vllm-project/vllm/pull/24172 - Fix `splitting_ops` changes introduced by https://github.com/vllm-project/vllm/pull/25845 - Fix multi-modality changes introduced by https://github.com/vllm-project/vllm/issues/16229 - Fix lora bias dropping issue introduced by https://github.com/vllm-project/vllm/pull/25807 - Fix structured ouput break introduced by https://github.com/vllm-project/vllm/issues/26737 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? CI passed with existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: Icey <1790571317@qq.com>	2025-10-24 16:55:08 +08:00
jiangyunfan1	ec9ec78b53	[TEST]Add initial prefix cache case for nightly test (#3709 ) ### What this PR does / why we need it? This PR adds the initial prefix cache case for nightly test for Qwen3-32b-int8 on A3, we need test them daily. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the test - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2025-10-24 16:33:18 +08:00
zzzzwwjj	6be321b95e	remove useless code (#3685 ) ### What this PR does / why we need it? `vanilla_chunked_prefill_mla` and `vanilla_decode_mla` is unused, so remove it. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-10-24 16:29:08 +08:00
lio	cd58a643c5	[UT] Fix test_sample_recovered_tokens_pytorch_autoregressive (#3434 ) ### What this PR does / why we need it? This 'test_rejection_sampler' unit test is something wrong. > def test_sample_recovered_tokens_pytorch_autoregressive(self): > output_token_ids = torch.empty(2, dtype=torch.int32) > cu_num_draft_tokens = torch.tensor([1, 1]) > draft_token_ids = torch.tensor([0, 1]) len(draft_token_ids ) = 2, cu_num_draft_tokens should be torch.tensor([1, 2]) or torch.tensor([2, 2]) I fix it and set cu_num_draft_tokens = torch.tensor([1, 2]). The methods before and after optimization can pass. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? NA - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: lio <1983142975@qq.com>	2025-10-24 11:20:57 +08:00
Li Wang	802c574532	[Benchmark] Upgrade benchmark args for new vllm version (#3218 ) ### What this PR does / why we need it? Since the newest vllm commit has deprecated the arg `--endpoint-type`, we should use `--backend` instead ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? test it locally: ```shell export VLLM_USE_MODELSCOPE=true export DATASET_PATH=/root/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json vllm serve Qwen/Qwen2.5-7B-Instruct --load-format dummy wget -O ${DATASET_PATH} /root/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json https://hf-mirror.com/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json vllm bench serve --model Qwen/Qwen2.5-7B-Instruct --backend vllm --dataset-name sharegpt --dataset-path ${DATASET_PATH} --num-prompt 200 ``` and the result looks good: ```shell ============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 20.36 Total input tokens: 43560 Total generated tokens: 44697 Request throughput (req/s): 9.82 Output token throughput (tok/s): 2194.88 Peak output token throughput (tok/s): 4676.00 Peak concurrent requests: 200.00 Total Token throughput (tok/s): 4333.93 ---------------Time to First Token---------------- Mean TTFT (ms): 2143.85 Median TTFT (ms): 2486.17 P99 TTFT (ms): 2530.36 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 43.50 Median TPOT (ms): 30.75 P99 TPOT (ms): 309.22 ---------------Inter-token Latency---------------- Mean ITL (ms): 28.15 Median ITL (ms): 25.42 P99 ITL (ms): 38.30 ================================================== ``` - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-24 11:18:19 +08:00

1 2 3 4 5 ...

1220 Commits