xc-llm-ascend

Author	SHA1	Message	Date
Angazenn	10a046ddce	[main][misc]change default capture size for Qwen3-MoE when using full dp (#4199 ) ### What this PR does / why we need it? Currently, the default `cudagraph_capture_size` in vLLM is `[1, 2, 4 ,8 ,16 ,24 ,... , max_capture_size]`. However, this is not always the best choice on different situations. This PR aims to change the default setting when running Qwen3-MoE on full dp (`dp_size > 1` && `tp_size == 1`) setting, which is usually applied in Large-Scale EP. old : `[1, 2, 4 ,8 ,16 ,24 ,... , max_capture_size]` new: `[1, 2, 5 ,10 ,15, 16 ,24 ,... , max_capture_size]` This is mainly because the performance of `_npu_paged_attention` op degrades dramatically on old settings. We hope to provide better performance if users do not set specific `cudagraph_capture_size`. ### Does this PR introduce _any_ user-facing change? The default `cudagraph_capture_size` is modified in above cases. However, if `cudagraph_capture_size` has already set by users, this PR won't have any influence on this. ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-11-18 08:41:45 +08:00
jiangyunfan1	9a1cfb48d4	[TEST]Update prefixcache perf threshold for qwen3-32b-int8 (#4220 ) ### What this PR does / why we need it? This PR update the prefixcache threshold for qwen3-32b-int from 0.4 to 0.8, as the baseline has been improved. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the test - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2025-11-17 19:06:54 +08:00
XiaoxinWang	e38ef2c434	support FULL graph mode for GQA (#3970 ) ### What this PR does / why we need it? The current library only supports the FullDecodeOnly graph mode, which enables full graph execution during the decode. This PR extends support to allow full graph execution in both the prefill and decode, referred to as FULL graph mode. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-11-17 10:50:35 +08:00
zhangxinyuehfad	67f2b3a031	[Test] Add deepseek v3.2 exp nightly test (#4191 ) ### What this PR does / why we need it? - skip the nightly image build when the github event is pull_request - set imagepullpolicy as alway for multi_node test - move multi_node tests ahead to have some resource clean first - do not relevant nightly image build with nightly tests for tolerance - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: wangli <wangli858794774@gmail.com> Co-authored-by: wangli <wangli858794774@gmail.com>	2025-11-14 15:46:10 +08:00
欧派果奶我还要	f90ed95578	[CI] Add multi-nodes EPLB configs of DeepSeek-R1-W8A8 & Qwen3-235B-W8A8 (#4144 ) ### What this PR does / why we need it? add DeepSeek-R1-W8A8 and Qwen3-235B-W8A8 configs in multi-nodes and EPLB scenario ### Does this PR introduce _any_ user-facing change? no - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>	2025-11-14 08:50:29 +08:00
LookAround0301	5ec96fd46c	[long_seq_Feat] support chunk prefill (#4158 ) ### What this PR does / why we need it? 1、qwen GQA attention_v1 optim 2、DeepSeek MLA refactor, all gather q -> all gather kv 3、modelrunner refactor for chunk prefill, we remove some code not use - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: Delphine-Nic <tanwenqin@huawei.com> Co-authored-by: Delphine-Nic <tanwenqin@huawei.com>	2025-11-14 08:43:37 +08:00
Li Wang	7294f89e43	[CI] Add daily images build for nightly ci (#3989 ) ### What this PR does / why we need it? Given the current excessively long build time of our nightly-ci, I recommend installing necessary, confirmed versions of packages in the Docker image to reduce the time required for integration testing. Including Mooncake vllm with fixed tags, This is expected to reduce nightly-ci duration by 2 hours. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-11-13 20:10:12 +08:00
CodeCat	49818dbbed	[Test]Add ut test qwen3_moe and sfa (#4121 ) ### What this PR does / why we need it? Currently, the UT tests lack coverage for the Qwen3_moe network and torchair_sfa. Therefore, supplementary tests are being added. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by CI - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: CodeNine-CJ <chenjian343@huawei.com>	2025-11-13 16:09:22 +08:00
drslark	9d84172359	[BugFix] adapted e2e tests for Qwen3-next-mtp (#4160 ) ### What this PR does / why we need it? Now, from https://github.com/vllm-project/vllm-ascend/pull/3967, chunked prefill and spiltfuse are defaultly enabled. The e2e test for mtp breaks now. After locating the bug, we found that a triton operator does not support chunked prefill. But if let e2e test be skipped is bad. So, we changed the e2e test to only test the case in which chunked prefill is off. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Because we only modified `test_models_distributed_Qwen3_NEXT_MTP_TP4_SIMILARITY`. So, we only run `pytest -s tests/e2e/multicard/test_qwen3_next.py::test_models_distributed_Qwen3_NEXT_MTP_TP4_SIMILARITY` locally to test it. Below is the result: ```text ==================================================================================================================== warnings summary ==================================================================================================================== usr/local/python3.11.10/lib/python3.11/site-packages/torch_npu/dynamo/torchair/__init__.py:8 /usr/local/python3.11.10/lib/python3.11/site-packages/torch_npu/dynamo/torchair/__init__.py:8: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html import pkg_resources <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute tests/e2e/multicard/test_qwen3_next.py::test_models_distributed_Qwen3_NEXT_MTP_TP4_SIMILARITY tests/e2e/multicard/test_qwen3_next.py::test_models_distributed_Qwen3_NEXT_MTP_TP4_SIMILARITY /usr/local/python3.11.10/lib/python3.11/site-packages/pydantic/_internal/_dataclasses.py:121: DeprecationWarning: The 'task' option has been deprecated and will be removed in v0.13.0 or v1.0, whichever comes first. Please remove this option. s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ======================================================================================================= 1 passed, 5 warnings in 314.52s (0:05:14) ======================================================================================================== sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute ``` - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: drslark <slarksblood@qq.com>	2025-11-13 11:08:35 +08:00
realliujiaxu	5093192769	[Bugfix] fix mtp profile run error where main model and mtp model use different quantization (#4102 ) ### What this PR does / why we need it? In PR https://github.com/vllm-project/vllm-ascend/pull/3420, we initially placed the quantization type (quant_type) in the MoECommMethod class. However, since MoECommMethod follows a singleton pattern, it couldn't accommodate scenarios where different layers in the model might use different quantization approaches (e.g., MTP modules using floating-point computation while the main model employs quantized computation). In this PR, we've moved the quantization type to the AscendFusedMoe class and pass it as a parameter to MoECommMethod. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```bash export HCCL_BUFFSIZE=1024 export VLLM_VERSION=0.11.0 vllm serve /home/data/DeepSeek-R1_w8a8/ \ --data-parallel-size 2 \ --tensor-parallel-size 8 \ --enable-expert-parallel \ --served-model-name dsv3 \ --max-model-len 32768 \ --max-num-batched-tokens 4096 \ --max-num-seqs 16 \ --quantization ascend \ --trust-remote-code \ --gpu-memory-utilization 0.9 \ --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' ``` - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-11-13 11:02:31 +08:00
22dimensions	c272747d13	Upgrade to 0.11.1 newest vllm commit (#3982 ) ### What this PR does / why we need it? adapt vllm-ascend main branch with vllm releases/v0.11.1 fix `forward context not set` in test_vlm.py caused by: https://github.com/vllm-project/vllm/pull/23207 fix import `cdiv round` failed caused by: https://github.com/vllm-project/vllm/pull/27188 fix import `init_cached_hf_modules` failed caused by: https://github.com/vllm-project/vllm/pull/27567 adapt triton kernel `fused_recurrent_gated_delta_rule_fwd_kernel` caused by: https://github.com/vllm-project/vllm/pull/27654 - remove unused code in sigmoid_gating.py - `class FusedRecurrentFunction` , `fused_recurrent_gated_delta_rule`, `fused_recurrent_gated_delta_rule_fwd` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-11-12 23:01:19 +08:00
Li Wang	3ca11d5a7c	[CI] Fix nightly-ci (#4159 ) ### What this PR does / why we need it? Explicit specification `NUMEXPR_MAX_THREADS` to avoid `Error. nthreads cannot be larger than environment variable "NUMEXPR_MAX_THREADS" (64)` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-11-12 22:06:49 +08:00
zhangsicheng5	a123f355e9	[feature] support pcp + mtp (in pd co-locate scenario) (#4098 ) 1. support pcp + mtp in pd co-locate scenario 2. llmdatadist connector pcp related bugfix and cleancode - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>	2025-11-12 17:22:21 +08:00
XiaoxinWang	1b4ce63ec9	fix fullgraph in ds. (#4016 ) ### What this PR does / why we need it? DS don't have 'AscendAttentionMetadataBuilder' class so will fail in fullgraph. We resolved the issue by modifying the code to only check for 'GDNAttentionMetadataBuilder ', while all other attention cases follow the default branch. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-11-12 10:11:43 +08:00
Canlin Guo	1c677c3b87	[Test][Accuracy] Add accuracy evaluation config for InternVL3_5-8B (#3964 ) ### What this PR does / why we need it? To continuously monitor the accuracy of the InternVL3_5-8B model, this PR adds the corresponding configuration file to the CI. We need to add the `-hf` suffix to avoid incompatibility with the `lm-eval` preprocessor. ### How was this patch tested? `pytest -sv ./tests/e2e/models/test_lm_eval_correctness.py --config ./tests/e2e/models/configs/InternVL3_5-8B.yaml` - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2025-11-12 09:05:55 +08:00
zzhxxx	46a41b26d3	oproj TP support acl graph (#4073 ) ### What this PR does / why we need it? Reference #2167 and orpoj TP supports ACL graph. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com>	2025-11-11 19:39:06 +08:00
jiangyunfan1	0e6e08e939	[TEST]Update nightly cases and add mtpx (#4111 ) ### What this PR does / why we need it? This PR updates some nightly test cases and adds mtpx cases, we need to test them daily ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the test - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2025-11-11 17:39:58 +08:00
wangxiyuan	f811a24bf0	Remove VLLM_USE_V1 (#4086 ) Drop VLLM_USE_V1 usage. This env has been removed from vLLM already. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-11 15:43:39 +08:00
zhangxinyuehfad	d5567680a2	[Fixbug] Fix ut test (#4116 ) ### What this PR does / why we need it? Fix ut test：pytest<9.0.0 test_models_distributed_Qwen3_NEXT_MTP_TP4_SIMILARITY failed by https://github.com/vllm-project/vllm-ascend/pull/3967, skip it now, and fix it later. test ok :https://github.com/vllm-project/vllm-ascend/actions/runs/19255274573/job/55048851066?pr=4116 - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-11-11 15:31:00 +08:00
zhangxinyuehfad	b77b4f1abf	[Test] Add nightly test for DeepSeek-V3.2-Exp (#3908 ) ### What this PR does / why we need it? Add nightly test for DeepSeek-V3.2-Exp ### How was this patch tested? test action： https://github.com/vllm-project/vllm-ascend/actions/runs/19156153634/job/54757008557?pr=3908 - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-11-11 10:29:57 +08:00
Yikun Jiang	e384755ce1	[Doc] Recover installation doc to use pip install (#4109 ) ### What this PR does / why we need it? Use pip installation in installation doc and change related doctest to validate. ### Does this PR introduce _any_ user-facing change? No, doc only ### How was this patch tested? Doctest related CI passed - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-11-11 09:25:44 +08:00
Apocalypse	71866d5311	[feature] chunkprefill support pcp&dcp (#3801 ) ### What this PR does / why we need it? ChunkPrefill now can support Long Sequence Feature Pcp&Dcp ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI tests passed with self-test - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu> Signed-off-by: Delphine-Nic <tanwenqin@huawei.com> Co-authored-by: Delphine-Nic <tanwenqin@huawei.com> Co-authored-by: Delphine-Nic <3834144971@qq.com>	2025-11-11 09:18:02 +08:00
zhaomingyu13	7ffbe73d54	[main][Bugfix] Fix ngram precision issue and open e2e ngram test (#4090 ) ### What this PR does / why we need it? Fix ngram precision issue and open e2e ngram test - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: Icey <1790571317@qq.com>	2025-11-11 09:06:24 +08:00
Icey	e04a87f4be	[BugFix] Fixes Qwen3-Next enable nz accuracy problem (#4058 ) ### What this PR does / why we need it? - Fixes Qwen3-Next enable nz accuracy problem ### Does this PR introduce _any_ user-facing change? N/A - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com>	2025-11-10 20:54:57 +08:00
rjg-lyh	a1558b99c2	[Core] Restore scheduling logic under default configuration (#3967 ) ### What this PR does / why we need it? This PR reverts the changes introduced in PR #2894 Initially, due to performance issues with the older version of the chunked prefill ops, the default behavior was to use the Ascend scheduler to disable the chunked prefill feature. However, with the improvements in the performance of the new chunked prefill ops, this interception strategy has been removed. This change also aligns with the community's default configuration behavior. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-11-10 17:48:56 +08:00
zhangxinyuehfad	d40ba52454	[Fix] fix Qwen2-Audio-7B-Instruct accuracy test (#4017 ) ### What this PR does / why we need it? fix Qwen2-Audio-7B-Instruct accuracy test ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-11-10 11:54:18 +08:00
Levi	0a62e671fb	[Feat] flashcomm_v2 optim solution (#3232 ) ### What this PR does / why we need it? Supports generalized FlashComm2 optimization, which reduces communication overhead, decreases RmsNorm computation, and saves one AllGather step by replacing Allreduce operations in the Attention module with pre-AlltoAll and post-AllGather operations (used in combination with FlashComm1). This feature is enabled during the Prefill phase and is recommended to be used together with FlashComm1, delivering broad performance improvements, especially in long sequence scenarios with large tensor parallelism (TP) configurations. Benchmark tests show that under TP16DP1 configuration, it can improve the prefill performance of the DeepSeek model by 8% on top of FlashComm1. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: zzhxx <2783294813@qq.com> Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: zzhxx <2783294813@qq.com>	2025-11-10 11:01:45 +08:00
jiangyunfan1	c116524379	[TEST]Add qwen3-235b-w8a8 and qwen3-30b-w8a8 nightly test (#3973 ) ### What this PR does / why we need it? This PR adds some qwen3-235b-w8a8 cases qwen3-30b-w8a8 cases, we need test them daily ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2025-11-08 18:49:28 +08:00
hucong	48094148f8	[BugFix] Improve the performance of prefixcache features (#4022 ) ### What this PR does / why we need it? The code bug caused an empty bubble. When the npu_paged_cache_load operator was called, it forcibly transferred seq_len2 to the device, which triggered synchronization and interrupted the CPU operator's launch stream. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: underfituu <hzhucong@163.com>	2025-11-08 18:45:31 +08:00
zxr2333	1d81a289d0	[P/D][BugFix]Fix proxy format processing errors & Layerwise connector performance optimization (#4043 ) ### What this PR does / why we need it? 1. Fix proxy format processing errors. 2. Layer-wise connector performance optimization. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2025-11-08 18:44:06 +08:00
wangx700	24d6314718	[Bugfix] fix sleepmode level2 e2e test (#4019 ) ### What this PR does / why we need it? enable sleepmode level2 e2e test and add the check logic to ensure the nz is not enabled. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? use e2e tests - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangx700 <wangxin700@huawei.com>	2025-11-08 14:11:55 +08:00
offline893	f7ca3bc0fa	[CI]Fix eplb ci. (#4052 ) ### What this PR does / why we need it? This pr fixes ci on eplb - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-11-07 23:53:35 +08:00
drslark	23b785fdfb	[Feat] Adapted mtp function to Qwen3-next (#3918 ) ### What this PR does / why we need it? Adapts mtp function to Qwen3-next. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: drslark <slarksblood@qq.com>	2025-11-07 16:39:03 +08:00
lilinsiman	22286fc67d	[UT] Add new ut case for aclgraph in auto enable (#4031 ) ### What this PR does / why we need it? add new ut case for aclgraph in auto enable ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-11-07 10:39:11 +08:00
Li Wang	259eb25f88	[CI] Quick fix mooncake for nightly-ci (#4028 ) ### What this PR does / why we need it? Since we have upgraded to CANN 8.3rc1, we will no longer use the privately maintained Mooncake repository, but instead use the official release released by Mooncake: https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.7.post2 . Next step: this is only a temporary solution. We will integrate mooncake into the vllm-ascend base image later for easier use. see https://github.com/vllm-project/vllm-ascend/pull/3989 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-11-06 18:46:00 +08:00
jiangyunfan1	34b278a339	[TEST]Update nightly acc test standard (#4032 ) ### What this PR does / why we need it? This PR updates the acc test standard for some cases, we need it to better maintain acc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2025-11-06 16:58:38 +08:00
weiguihua2	2eebe1dc0a	[feat]decode convert bsnd to tnd and fix bug when pcp and dcp (#3980 ) ### What this PR does / why we need it? 1、in attention_v1 module, convert bsnd t0 tnd when pcp and dcp 2、fix tochair bug: service startup problem ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-11-06 14:58:24 +08:00
Liziqi-77	25b24c02ea	[Feat](Mooncake) Supports multiple input suffixes for global_segment_size (#3690 ) ### What this PR does / why we need it? - global_segment_size and local_buffer_size use constants for unified management. - Newly added support for input formats ending with GB, MB, KB, and B, while being compatible with existing input methods. ### Does this PR introduce _any_ user-facing change? - Users can use new input methods - The documentation has also been modified ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: 李子琦 <liziqi_ing@163.com>	2025-11-06 14:48:15 +08:00
zxr2333	b206e831e9	[P/D]Make kv-transfer env variable take effect & Fix load-balance proxy (#3981 ) ### What this PR does / why we need it? Make kv-transfer env variable take effect and Fix load-balance proxy. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2025-11-06 12:02:47 +08:00
XiaoxinWang	738bf2b720	support qwen3-next full_decode_only mode. (#3949 ) ### What this PR does / why we need it? support qwen3-next full_decode_only mode. bs=1, max_token=1024 \| branch\| tps\| e2e time\| \| --- \| --- \| --- \| \|piecewise \|3.06 \| 8.15 \| \|fulldecodeonly \| 7.2 \| 3.47 \| - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-11-05 08:46:05 +08:00
zhangxinyuehfad	49e6983b3b	[Test] Add accuracy test for qwen3-30b-a3b-w8a8 (#3807 ) ### What this PR does / why we need it? Add accuracy test for qwen3-30b-a3b-w8a8 This PR depends on https://github.com/vllm-project/vllm-ascend/pull/3799 ### How was this patch tested? qwen3-30b-a3b-w8a8 accuarcy test ok: https://github.com/vllm-project/vllm-ascend/actions/runs/19062045267/job/54443732877?pr=3807 - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-11-04 18:56:31 +08:00
realliujiaxu	bedf223771	[Perf] move quant before allgather in Allgather EP (#3420 ) ### What this PR does / why we need it? move quant before allgather in Allgather EP, rely on https://github.com/vllm-project/vllm-ascend/pull/3334 Deepseek R1 W8A8 performance on A2 with `HCCL_ALGO="level0:NA;level1:pipeline"`: \| Seq length \| Mean TTFT (ms) main \| Mean TTFT (ms) this PR \| \|----------\|----------\|----------\| \| 4k \| 375.21 \| 364.99 \| \| 16k \| 1465.23 \| 1421.75 \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-11-04 16:49:58 +08:00
jiangyunfan1	44b58b8665	[TEST]Add full graph for multimodal nightly tests (#3968 ) ### What this PR does / why we need it? This PR adds full graph for multimodal nightly test, we need to maintain this senario ### How was this patch tested? by running the test - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2025-11-04 16:47:48 +08:00
ZengSilong	dc1a6cb503	[Test]Add accuracy test for multiple models (#3823 ) ### What this PR does / why we need it? Add accuracy test for multiple models： - Meta_Llama_3.1_8B_Instruct - Qwen2.5-Omni-7B - Qwen3-VL-8B-Instruct - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2025-11-04 14:46:39 +08:00
zhangxinyuehfad	646fbac7a9	[Test] Add accuracy test for qwen3-8b-w8a8 (#3799 ) ### What this PR does / why we need it? Add accuracy test for qwen3-8b-w8a8 - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-11-04 09:23:11 +08:00
wangxiyuan	cc2cd42ad3	Upgrade CANN to 8.3.rc1 (#3945 ) ### What this PR does / why we need it? This PR upgrade CANN from 8.2rc1 to 8.3rc1 and remove the CANN version check logic. TODO: we notice that UT runs failed with CANN 8.3 image. So the base image for UT is still 8.2. We'll fix it later. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-03 20:21:07 +08:00
CodeCat	49d74785c4	[Test] Add new e2e test use deepseek-v2-lite in ge graph mode (#3937 ) ### What this PR does / why we need it? The current test cases lack end-to-end (e2e) testing for the deepseek-v2-lite network in ge graph mode. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: CodeNine-CJ <chenjian343@huawei.com>	2025-11-03 20:10:01 +08:00
Li Wang	8f222f21f1	[CI][Nightly] Fix mooncake build (#3958 ) ### What this PR does / why we need it? Fix https://github.com/vllm-project/vllm-ascend/pull/3943 - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-11-03 20:07:47 +08:00
1Fire4	0b9b6d79fe	[Feat][UT] Support Deepseekv32 FULL_DECODE_ONLY mode and add unit test of sfa_v1 (#3763 ) ### What this PR does / why we need it? - Add support for DeepSeek v3.2 in FULL_DECODE_ONLY mode. - Add unit test for sfa_v1. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: 1Fire4 <wangdingyi2@huawei.com>	2025-11-03 10:02:47 +08:00
Li Wang	d0cc9c1203	[CI][Nightly] Correct the commit hash available for mooncake (#3943 ) ### What this PR does / why we need it? Because the previous commit hash was accidentally deleted or overwritten. This patch correct the commit hash available for https://github.com/AscendTransport/Mooncake to make nightly ci happy ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-11-01 21:52:16 +08:00

... 11 12 13 14 15 ...

1128 Commits