xc-llm-ascend

Author	SHA1	Message	Date
Yang Yuxi	e776d5c0f1	[Bugfix]v0.18.0 support FlashComm1 & DCP for Qwen (#7726 ) ### What this PR does / why we need it? This PR backports the changes from #7673 ([Bugfix] support FlashComm1 & DCP for Qwen) to the releases/v0.18.0 branch. -------- Signed-off-by: Yang Yuxi <907276627@qq.com>	2026-03-29 15:59:19 +08:00
wangbj127	2ad0ca52a6	Qwen3.5 MoE supports flashcomm v1 (#7644 ) cherry pick from https://github.com/vllm-project/vllm-ascend/pull/7486 <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> Multimodal models like Qwen3.5 MoE does embedding in model_runner, so when flash comm is enabled, the first AllGather operation should be skipped. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> No. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> - vLLM version: v0.18.0 - vLLM main: `8b6325758c` --------- Signed-off-by: Wangbingjie <wangbj1207@126.com> Signed-off-by: wangbj127 <256472688+wangbj127@users.noreply.github.com>	2026-03-25 23:09:33 +08:00
realliujiaxu	5d12446573	[Feat][SP] Suport SP for VL MoE models (#7044 ) ### What this PR does / why we need it? 2nd PR for https://github.com/vllm-project/vllm-ascend/issues/5712, extend SP to VL MoE models. ### Does this PR introduce _any_ user-facing change? remove `sp_threshold` in additional config and reuse `sp_min_token_num` from vLLM. ### How was this patch tested? - Model: Qwen3-VL-30B-A3B, - TP4 DP2 - 100 reqs - max concurrency 1 \| Seq length \| Mean TTFT (ms) main \| Mean TTFT (ms) this PR \| \|------------\|---------------------\|------------------------\| \| 4k \| 429.40 \| 323.3 \| \| 16k \| 1297.01 \| 911.74 \| - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2026-03-24 17:16:00 +08:00
Nengjun Ma	8e0789bb36	[CI] Recover pd disaggregated encoder test case that been incorrectly skipped (#7505 ) ### What this PR does / why we need it? [CI] Recover pd disaggregated encoder test case that been incorrectly skipped in PR: https://github.com/vllm-project/vllm-ascend/pull/7412 ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA - vLLM version: v0.17.0 - vLLM main: `8b6325758c` Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-03-23 21:41:28 +08:00
Qiu	71df17f4e6	bugfix(MC2): refactor the comm group of MC2 to be compatible with PP (#7291 ) ### What this PR does / why we need it? This PR refactors the communication group of MC2 to keep it consistent with vllm's EP group, making it compatible with PP. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-03-23 15:44:21 +08:00
Shanshan Shen	5c0d02f689	[Bugfix] Fix multi-instance serving OOM on single card (#7427 ) ### What this PR does / why we need it? Fix https://github.com/vllm-project/vllm-ascend/issues/7308. Subtracting `init_non_torch_memory` (maybe used by the first instance) from the total `non_torch_memory` when calculating `available_kv_cache_memory`. Directly use `non_torch_memory_increase` (contained in `non_kv_cache_memory`) to calculate `available_kv_cache_memory`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Launch tow vllm-ascend instances sequentially on single card. ```bash # Launch first instance vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B \ --port 8100 \ --host 0.0.0.0 \ --additional-config='{"enable_cpu_binding":true}' \ --gpu-memory-utilization 0.3 \ --max-num-seqs 1 \ --max-model-len 2048 \ --max-num-batched-tokens 2048 \ --no-enable-prefix-caching \ --enforce-eager # Launch second instance vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B \ --port 8101 \ --host 0.0.0.0 \ --additional-config='{"enable_cpu_binding":true}' \ --gpu-memory-utilization 0.3 \ --max-num-seqs 1 \ --max-model-len 2048 \ --max-num-batched-tokens 2048 \ --no-enable-prefix-caching \ --enforce-eager ``` Before this PR: ```bash # First instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.2340388298034668 GiB init_non_torch_memory: 0.3616676330566406 GiB non_torch_memory_before_empty_cache: 0.3896217346191406 GiB non_torch_memory_increase: 0.0279541015625 GiB non_torch_memory_cleared_by_empty_cache: 0.3616676330566406 GiB ------------------------------------------------------------------ # Second instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.2336344718933105 GiB init_non_torch_memory: 18.37220001220703 GiB non_torch_memory_before_empty_cache: 18.399906158447266 GiB non_torch_memory_increase: 0.02754974365234375 GiB non_torch_memory_cleared_by_empty_cache: 18.372356414794922 GiB ------------------------------------------------------------------ # available_kv_cache_memory = requested_memory - non_kv_cache_memory - non_torch_memory_cleared_by_empty_cache Available KV cache memory: -1.32 GiB ``` After this PR: ```bash # First instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.2340540885925293 GiB init_non_torch_memory: 0.36182403564453125 GiB non_torch_memory_before_empty_cache: 0.38979339599609375 GiB non_torch_memory_increase: 0.0279693603515625 GiB non_torch_memory_cleared_by_empty_cache: 0.0 GiB ------------------------------------------------------------------ # Second instance: ------------------------------------------------------------------ requested_memory: 18.287109375 GiB non_kv_cache_memory: 1.233344554901123 GiB init_non_torch_memory: 18.74309539794922 GiB non_torch_memory_before_empty_cache: 18.770355224609375 GiB non_torch_memory_increase: 0.02725982666015625 GiB non_torch_memory_cleared_by_empty_cache: 0.0 GiB ------------------------------------------------------------------ # available_kv_cache_memory = requested_memory - non_kv_cache_memory - non_torch_memory_cleared_by_empty_cache Available KV cache memory: 17.05 GiB ``` - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>	2026-03-23 14:22:59 +08:00
meihanc	bff4fbfca5	upgrade to 0.18.0 (#7502 ) ### What this PR does / why we need it? 1. upgrade to 0.18.0 2. ensure kernel_block_sizes is int for Eagle drafter ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-03-21 16:05:38 +08:00
linfeng-yuan	88d03a783f	[refactor] replace scattered business kwargs with typed request objects and explicit stage boundaries (#7024 ) ### What this PR does / why we need it? Refactor `vllm_ascend/ops/fused_moe` to replace scattered MoE business `**kwargs` with typed request objects and explicit stage boundaries. - Prepare, dispatch, MLP, and quant stages now have clearer ownership. - Main MoE path no longer depends on business `kwargs.get(...)` lookups. - Comm and dispatcher interfaces are request-only on the main path. - UTs can assert stage-level fields directly instead of inferring behavior indirectly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed. --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-20 23:23:57 +08:00
LI SHENGYONG	4e6dbe0956	[EPLB][Bugfix] Set parallel_config.enable_eplb to true to load redundant experts (#7470 ) ### What this PR does / why we need it? pr: https://github.com/vllm-project/vllm/pull/37136 break eplb because it filters out redundant experts. pr: https://github.com/vllm-project/vllm/pull/37322 fix it due to use parallel_config.enable_eplb to determine whether to skip the weight loading filter. But in vllm-ascend, parallel_config.enable_eplb is always false. When we use eplb, we temporarily set it to true. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? ![Snipaste_2026-03-19_16-13-01](https://github.com/user-attachments/assets/b3a4911e-36b3-4c31-951c-7c091f416d00) \| dataset \| version \| metric \| mode \| vllm-api-stream-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-03-20 15:22:55 +08:00
Nengjun Ma	ee804ce23e	Main2main upgrade vllm to 0318 commit (#7412 ) ### What this PR does / why we need it? Upgrade vllm commit to 0318. Main content: Added a pre-operation for cleaning up and waiting(default max 50s) for the completion of the clean up of the NPU memory to some test cases that failed due to the failure to release the NPU memory in a timely manner when the previous test cases were executed. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-03-19 17:17:36 +08:00
Nengjun Ma	8b79d4de52	Main2main upgrade to vllm 0317 afternoon (#7409 ) ### What this PR does / why we need it? 1.fix "TypeError: get_attn_backend() remove variable": [Refactor `check_and_update_config`](https://github.com/vllm-project/vllm/pull/35122) 2.fix [Rename `compile_ranges_split_points` to `compile_ranges_endpoints`](https://github.com/vllm-project/vllm/pull/36027) 3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace memory related torch.cuda APIs"](https://github.com/vllm-project/vllm/pull/37031) 4.fix [Support multiple KV groups in OffloadingSpec ](https://github.com/vllm-project/vllm/pull/36610) removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor. 5.fix [Consolidate SupportsEagle](https://github.com/vllm-project/vllm/pull/36063) renamed get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? E2E - vLLM version: v0.17.0 - vLLM main: `8a680463fa` --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>	2026-03-18 23:24:27 +08:00
lilinsiman	8f278fc101	[eagle3][pcp] fix bug for eagle3 and cp enable (#7309 ) ### What this PR does / why we need it? This PR fixes the bug for eagle3 and cp enable introduced by the parallel speculative inference PR. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tests and ut - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-03-17 16:14:45 +08:00
rjg-lyh	4d443b9228	[bugfix] restore pr-7029 and fix patch error (#7294 ) ### What this PR does / why we need it? This PR restores #7029, which adds W8A8C8 support for dsv3.2/glm5 using the `lightning_indexer_quant` ops in the pd-mix stage. The original PR was reverted by #7288 because the patch did not work with the recompute scheduler. This PR also fixes the patching issue so that it works correctly with the recompute scheduler. ### Does this PR introduce _any_ user-facing change? Yes. To enable LI C8, users need to set the `enable_sparse_c8` option to `"true"` in `additional_config`. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-03-16 15:39:42 +08:00
zhaomingyu13	9320365dab	[Test][Feature] Add e2e test for QuaRot model with eagle3 (#7128 ) ### What this PR does / why we need it? Add an e2e test for QuaRot model with eagle3 that runs both the QuaRot model and the float model, and then compares their acceptance rates. The QuaRot model adapting eagle3 PR(#6914, #7038) - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>	2026-03-16 15:35:55 +08:00
pppeng	7e85f2ff97	[CI] Add test_qwen3_5.py (#7133 ) ### What this PR does / why we need it? Add test_qwen3_5.py for base scenarios tp4 on Qwen3.5-27B and Qwen3.5-35B-A3B. - vLLM version: main - vLLM main: `4034c3d32e` --------- Signed-off-by: pppeng <zepengliu912@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-03-15 22:19:02 +08:00
Mengqing Cao	0c299f79b9	Revert "[Perf][1/N] w8a8c8 support in dsv3.2/glm5 (#7029 )" (#7288 ) ### What this PR does / why we need it? This reverts commit `7ed9e9de69`, which introduces an issue that the patch doesn't work with recompute scheduler enabled. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2026-03-15 20:19:09 +08:00
rjg-lyh	7ed9e9de69	[Perf][1/N] w8a8c8 support in dsv3.2/glm5 (#7029 ) ### What this PR does / why we need it? This PR supports W8A8C8 in dsv3.2/glm5 with lightning_indexer_quant ops in pd-mix stage mainly. Because the code for the current PD-disaggregated scenario is still under refactoring and cleanup, this PR prioritizes ensuring the C8 functionality in the pd-mix scenario. The next steps are planned in two parts: ① Once the optimized scatter operator is updated, we will replace the original operator to improve the performance of storing k_scale. ② Once the code logic for the PD-disaggregated scenario becomes stable, we will carry out more comprehensive validation and make appropriate adaptations. ③ Because enabling C8 currently introduces several new operators whose performance still needs improvement, performance may regress in some scenarios. Therefore, only after all the operators are fully ready can we ensure that this feature does not cause any performance degradation. At that point, we will enable this feature by default and remove the switch in `additional_config`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-03-13 14:47:42 +08:00
Li Wang	7fe0469e27	[CI][Misc] Use offline mode for model downloads (#7179 ) ### What this PR does / why we need it? 1. For all parts of the current test module involving the millisecond download model, add the `local_file_only` parameter to specify offline mode; this ensures that CI will not fail due to network instability. 2. Install modelscope from a fixed commit until it next release ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? check if the env or arg `local_files_only` works 1) set the env: ```shell export HF_HUB_OFFLINE=1 ``` 2) run the script ```python from transformers import PretrainedConfig import huggingface_hub from modelscope.utils.hf_util import patch_hub patch_hub() model="Qwen/Qwen3-0.6B" kwargs = {} config_dict, _ = PretrainedConfig.get_config_dict( model, trust_remote_code=True, local_files_only=huggingface_hub.constants.HF_HUB_OFFLINE, kwargs, ) print(config_dict) ``` it works well: ```shell 2026-03-06 06:40:12,546 - modelscope - WARNING - We can not confirm the cached file is for revision: master The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored. {'architectures': ['Qwen3ForCausalLM'], 'attention_bias': False, 'attention_dropout': 0.0, 'bos_token_id': 151643, 'eos_token_id': 151645, 'head_dim': 128, 'hidden_act': 'silu', 'hidden_size': 1024, 'initializer_range': 0.02, 'intermediate_size': 3072, 'max_position_embeddings': 40960, 'max_window_layers': 28, 'model_type': 'qwen3', 'num_attention_heads': 16, 'num_hidden_layers': 28, 'num_key_value_heads': 8, 'rms_norm_eps': 1e-06, 'rope_scaling': None, 'rope_theta': 1000000, 'sliding_window': None, 'tie_word_embeddings': True, 'torch_dtype': 'bfloat16', 'transformers_version': '4.51.0', 'use_cache': True, 'use_sliding_window': False, 'vocab_size': 151936, '_commit_hash': None} ``` 3) test the model repo does not cached locally when the env `HF_HUB_OFFLINE`==True ```python from transformers import PretrainedConfig import huggingface_hub from modelscope.utils.hf_util import patch_hub patch_hub() model="FireRedTeam/FireRed-OCR" kwargs = {} config_dict, _ = PretrainedConfig.get_config_dict( model, trust_remote_code=True, local_files_only=huggingface_hub.constants.HF_HUB_OFFLINE, kwargs, ) print(config_dict) ``` and the result is as expected: ```shell File "/workspace/demo.py", line 12, in <module> config_dict, _ = PretrainedConfig.get_config_dict( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/utils/hf_util/patcher.py", line 189, in patch_get_config_dict model_dir = get_model_dir(pretrained_model_name_or_path, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/utils/hf_util/patcher.py", line 164, in get_model_dir model_dir = snapshot_download( ^^^^^^^^^^^^^^^^^^ File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/hub/snapshot_download.py", line 137, in snapshot_download return _snapshot_download( ^^^^^^^^^^^^^^^^^^^ File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/hub/snapshot_download.py", line 283, in _snapshot_download raise ValueError( ValueError: Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable look-ups and downloads online, set 'local_files_only' to False ``` - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-13 08:52:24 +08:00
XiaoxinWang	37d1bd8c50	fixed fia pad logic in graph mode. (#7144 ) ### What this PR does / why we need it? related to vllm PR #34043 this pr delete func ‘relax_for_mixed_batch_cudagraphs’, num_reqs no longer equals the actual number of requests, due to fia operator requires that query_start_loc[-1] equals the total number of computed tokens, so this func delete cause the ifa error. In full graph mode, set num_reqs_paded = num_reqs to fix the error ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2026-03-12 14:50:54 +08:00
meihanc	da01a74009	Revert "[CI] fix skiped e2e test when upgrade vllm version (#6654 )" (#7166 ) This reverts commit `f6db47f103`. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-03-11 23:03:15 +08:00
yupeng	830f39dd70	[Bugfix][LoRA] Fix the issue when enable LoRA + tp + fully_sharded_loras (#6650 ) ### What this PR does / why we need it? Fix the issue #6143 . ### Does this PR introduce _any_ user-facing change? Allow to start the server with "--enable-lora && --fully-sharded-loras && --tensor_parallel_size 2". ### How was this patch tested? pytest -sv tests/e2e/multicard/2-cards/test_llama32_lora_tp2.py - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: paulyu12 <507435917@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-11 15:43:15 +08:00
meihanc	f6db47f103	[CI] fix skiped e2e test when upgrade vllm version (#6654 ) ### What this PR does / why we need it? fix skiped test_aclgraph_capture_replay.py when upgrade vllm version ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `13397841ab` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-03-10 09:55:35 +08:00
SILONG ZENG	43df2cb2fc	[Lint]Style: Convert `test/` to ruff format(Batch #1 ) (#6738 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| `tests/e2e/310p/multicard/test_vl_model_multicard.py` \| \| `tests/e2e/310p/singlecard/test_vl_model_singlecard.py` \| \| `tests/e2e/310p/test_utils.py` \| \| `tests/e2e/conftest.py` \| \| `tests/e2e/model_utils.py` \| \| `tests/e2e/models/conftest.py` \| \| `tests/e2e/models/test_lm_eval_correctness.py` \| \| `tests/e2e/multicard/2-cards/spec_decode/test_spec_decode.py` \| \| `tests/e2e/multicard/2-cards/test_aclgraph_capture_replay.py` \| \| `tests/e2e/multicard/2-cards/test_data_parallel.py` \| \| `tests/e2e/multicard/2-cards/test_disaggregated_encoder.py` \| \| `tests/e2e/multicard/2-cards/test_expert_parallel.py` \| \| `tests/e2e/multicard/2-cards/test_external_launcher.py` \| \| `tests/e2e/multicard/2-cards/test_full_graph_mode.py` \| \| `tests/e2e/multicard/2-cards/test_ilama_lora_tp2.py` \| \| `tests/e2e/multicard/2-cards/test_offline_inference_distributed.py` \| \| `tests/e2e/multicard/2-cards/test_offline_weight_load.py` \| \| `tests/e2e/multicard/2-cards/test_pipeline_parallel.py` \| \| `tests/e2e/multicard/2-cards/test_prefix_caching.py` \| \| `tests/e2e/multicard/2-cards/test_quantization.py` \| \| `tests/e2e/multicard/2-cards/test_qwen3_moe.py` \| \| `tests/e2e/multicard/2-cards/test_qwen3_moe_routing_replay.py` \| \| `tests/e2e/multicard/2-cards/test_qwen3_performance.py` \| \| `tests/e2e/multicard/2-cards/test_shared_expert_dp.py` \| \| `tests/e2e/multicard/2-cards/test_single_request_aclgraph.py` \| \| `tests/e2e/multicard/2-cards/test_sp_pass.py` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` Signed-off-by: MrZ20 <2609716663@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-10 09:52:50 +08:00
ZT-AIA	ee5347e824	[qwen3 next ]add ascend c casual_conv1d_fn (#6661 ) ### What this PR does / why we need it? add ascend c casual_conv1d_fn - vLLM version: v0.15.0 - vLLM main: `13397841ab` --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-03-09 23:29:49 +08:00
Qiu	13adcbe44b	feat(attention_cp): support chunked prefill for Qwen3Next with PCP&DCP (#6900 ) ### What this PR does / why we need it? Support chunked prefill for Qwen3Next with PCP&DCP - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-03-09 17:55:09 +08:00
ZhaoJiangJiang	a51d6366b9	[Bugfix] Qwen3Next support FlashComm1 (#6830 ) ### What this PR does / why we need it? Support FlashComm1 for Qwen3-Next. Fix some padding problems in Sequence Parallel (SP) and resolve precision problems in shared_out when both FlashComm1 is enabled. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com> Co-authored-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com>	2026-03-06 17:14:08 +08:00
xiaocongtou6	bc0fd7ca72	[Feat]Adapt the graph mode (piecewise and full_decode_only) of PCP and DCP for DeepSeek v3.2. (#6940 ) ### What this PR does / why we need it? Adapt the graph mode (piecewise and full_decode_only) of PCP and DCP for DeepSeek v3.2. ### How was this patch tested? Test output: {"object":"text_completion","model":"deepeek_v3","choices":[{"index":0,"text":" the head of state and head of government of the United States, indirectly elected to a four-year term by the American people through the Electoral College. The officeholder leads the executive branch of the federal government and is the commander-in-chief of the United States","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null},{"index":1,"text":" Paris. This is the largest city in France and its main political, cultural and commercial center. The modern location of the city is the north of the central part of the country, on the banks of the Seine River Seine River Seine in 3\n\n","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null},{"index":2,"text":" now\n\n# AI future is now\n\nThe world is changing at a rapid pace, and artificial intelligence (AI) is at the forefront of this transformation. From self-driving cars to virtual assistants, AI is already making a significant impact on our daily lives","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null},{"index":3,"text":" a 3rd year student at the University of Lincoln studying Media Production. This blog is about my work throughout my final year on the course.\n\n## Tuesday 3 May 2016\n### Final Major Project - Evaluation\n\nFor my final project I","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":27,"total_tokens":227,"completion_tokens":200,"prompt_tokens_details":null},"kv_transfer_params":null} - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: xiaocongtou6 <2066962956@qq.com> Signed-off-by: xiaocongtou6 <105542647+xiaocongtou6@users.noreply.github.com>	2026-03-06 16:10:24 +08:00
Cao Yi	50441e4650	[BugFix][MTP] Fix prefill misclassified as decode when prompt tokens == num_spec_tokens + 1 (#6835 ) ## Problem When MTP is enabled, prefill requests with `prompt_tokens == num_spec_tokens + 1` are incorrectly classified as decode requests, causing accuracy issues. ## Root Cause The `uniform_decode` condition only checked: - `max_num_scheduled_tokens == uniform_decode_query_len` - `num_tokens == max_num_scheduled_tokens * num_reqs` This is insufficient because a prefill request with specific prompt length satisfies these conditions as well. ## Fix Add `is_all_decode` check to ensure all requests have `num_computed_tokens > 0` before classifying as uniform decode, since decode requests must have computed at least one token. - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2026-03-05 17:33:10 +08:00
zhangxinyuehfad	a6745b8577	[CI] fix test_qwen3_moe_external_launcher_ep_tp2 (#6951 ) ### What this PR does / why we need it? fix test_qwen3_moe_external_launcher_ep_tp2 by wait_until_npu_memory_free ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-03-05 16:43:45 +08:00
whx	16c879cdf7	[Triton][Config] Add muls_add triton kernel and refactor AscendCompilationConfig (#5518 ) ### What this PR does / why we need it? Add muls_add triton kernel with related fusion pass. What's more, this PR refactors `AscendCompilationConfig` and delete `NpugraphExConfig`. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI passed with new added test. - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-03-02 17:54:25 +08:00
Bai Yongbin	9d09488b4a	[Feat] support basic pcp&dcp for qwen3next (#6091 ) ### What this PR does / why we need it? This PR implements Context Parallelism (CP) support for the Qwen3-Next model, including PCP (Parallel Context Parallelism) and DCP (Dynamic/Data Context Parallelism). - vLLM version: v0.15.0 - vLLM main: `f176443446` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com> Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: Bai Yongbin <845473182@qq.com> Co-authored-by: SunnyLee219 <3294305115@qq.com> Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-02-28 21:44:08 +08:00
realliujiaxu	5def28dcd3	[Feat]support sequence parallelism by pass for VL models (#5632 )	2026-02-27 08:27:41 +08:00
starmountain1997	bc1622338c	[CI] Add long and short prompt tests for DeepSeek-V3.2 (#6536 ) ### What this PR does / why we need it? This version has no divisibility constraint between tp and mtp+1. However, cudagraph_capture_sizes must be a common multiple of tp and mtp+1, with a maximum of tp * (mtp+1). Therefore, we fixed cudagraph_capture_sizes. We added a long-sequence test (64k input, 3k output) for the two-node mixed deployment scenario. Due to the excessive time required for performance benchmarking, we are only verifying functionality. The single-node scenario is skipped because VRAM limitations prevent launching the model with a max-model-len of 68,000. and we also add aime2025 test for dual-node deepseek 3.2 nightly test. ### How was this patch tested? test at nightly environment. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-02-26 10:58:50 +08:00
Li-Yongwen	2870f7c8ad	[Feat] Support routing replay (#6696 ) ### What this PR does / why we need it? [Feat] Support routing replay same as https://github.com/vllm-project/vllm-ascend/pull/6666 resubmit because of DOC failure ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: liyongwen <1310439159@qq.com> Signed-off-by: Li-Yongwen <63399187+Li-Yongwen@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-26 10:22:47 +08:00
bowenli	e3927cc8f5	[Bugfix] fix bug for mtp (#6514 ) ### What this PR does / why we need it? fix(mtp): resolve MTP core bugs and enhance eager mode test cases 1. Resolved critical issues in eager mode MTP core execution logic; 2. Fixed functional bugs in the _update_states_after_model_execute function; 3. Updated and released test_mtp_qwen3_next.py to validate eager mode acceptance rate. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: Bowen-Leee <caoshankuangren@gmail.com>	2026-02-25 17:50:57 +08:00
weiguihua2	db51a1b9b6	[Feat]ds3.2 support pcp (#6733 ) ### What this PR does / why we need it? The ds3.2 model adaptation supports the PCP feature. The solution is as follows: When saving the KV cache, first perform an allgather operation on the KVs, and then each node saves its own copy. When the attention or indexer performs calculations, they all gather the KV cache and then perform the calculations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? 02/12 23:05:10 - AISBench - INFO - Running 1-th replica of evaluation 02/12 23:05:10 - AISBench - INFO - Task [vllm-api-general-chat/gsm8k]: {'accuracy': 96.35416666666667, 'type': 'GEN'} 02/12 23:05:10 - AISBench - INFO - time elapsed: 2.87s 02/12 23:05:12 - AISBench - INFO - Evaluation tasks completed. 02/12 23:05:12 - AISBench - INFO - Summarizing evaluation results... dataset version metric mode vllm-api-general-chat gsm8kdataset - accuracy gen 96.35 - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-25 09:46:57 +08:00
jiahao.quan	7221045777	[Attention] add gpt-oss support (#5901 ) ### What this PR does / why we need it? Please refer to the following link for the historical conversation https://github.com/vllm-project/vllm-ascend/pull/4467. We have made updates in light of the comments from the prior PR review. Given the refactoring of the attention_v1 component, we have carried out necessary adjustments to fit the newly revised code. ### Does this PR introduce _any_ user-facing change? 1. Modified the code in the Attention section to adapt to the SWA and Sink features required by gpt-oss. 2. Modified the code in the MoE section to add support for bias and swigluoai. ### How was this patch tested? Please refer to the https://github.com/vllm-project/vllm-ascend/pull/4467 for performance tests, on the basis of which the accuracy tests from AIME2024 have been newly added. ![img_v3_02tu_501e88e3-2217-4565-8edf-b9acf4f43f2g](https://github.com/user-attachments/assets/024f8283-18ab-4d4d-ab12-27917b5d7d06) - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: mikequan0425 <mikequan0425@foxmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com> Signed-off-by: pu-zhe <zpuaa@outlook.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: luomin2005 <luomin2005@huawei.com> Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: MrZ20 <2609716663@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: leon_tao <taoyao2@huawei.com> Co-authored-by: nurxat <738457498@qq.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: mikequan <199741451@qq.com> Co-authored-by: LI SHENGYONG <49200266+shenchuxiaofugui@users.noreply.github.com> Co-authored-by: jiangyunfan1 <jiangyunfan1@h-partners.com> Co-authored-by: pu-zhe <zpuaa@outlook.com> Co-authored-by: luomin2005 <luomin2005@huawei.com> Co-authored-by: liziyu <56102866+liziyu179@users.noreply.github.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com> Co-authored-by: Cao Yi <slightwindsec@gmail.com> Co-authored-by: Icey <1790571317@qq.com> Co-authored-by: SILONG ZENG <2609716663@qq.com>	2026-02-12 10:55:34 +08:00
wangxiyuan	2a826b5fad	[Misc] upgrade to vllm main (#6646 ) ### What this PR does / why we need it? This PR upgrades the core vLLM dependency to a newer version from the main branch (`13397841ab469cecf1ed425c3f52a9ffc38139b5`). This is necessary to keep our project up-to-date with the latest features and fixes from upstream vLLM. 1. `ac32e66cf9` pass file is moved. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wxsIcey <1790571317@qq.com>	2026-02-10 14:08:59 +08:00
wangyu	c63b7a1188	[Test] Add initial multi modal cases of Qwen2.5-VL-7B-Instruct for disaggregated encoder (#5301 ) ### What this PR does / why we need it? This PR adds disaggregated encoder tests for Qwen2.5-VL-7B-Instruct ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test by running ci - vLLM version: release/v0.12.0 --------- Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>	2026-02-06 17:30:17 +08:00
starmountain1997	bfcc372f75	[CI] Add long and short prompt tests for DeepSeek-V3.2 (#6499 ) ### What this PR does / why we need it? This PR enhances the test_deepseek3_2_w8a8_pruning_mtp_tp2_ep E2E test by adding both short and long prompt test cases: - Short test: Validates basic functionality with minimal input ("Hello ") - Long test: Validates the model can handle prompts near its maximum context length (~163K tokens, approaching the max_position_embeddings limit of 163,840) Additionally, explicitly sets max_model_len=163840 to ensure the test properly exercises the model's full context window capability. ### Does this PR introduce _any_ user-facing change? No. This change only affects internal E2E testing infrastructure. ### How was this patch tested? The modified test case will be executed as part of the E2E test suite and has been validated [here](https://github.com/vllm-project/vllm-ascend/actions/runs/21620195055/job/62308026205?pr=6499). - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-02-04 09:10:50 +08:00
Nengjun Ma	78fad4e348	[Refactor] MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage (#6442 ) ### What this PR does / why we need it? Refactor MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage. Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP, VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following: --additional-config '{"weight_prefetch_config": { "enabled": true, "prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}' ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-02-04 09:08:18 +08:00
Feng Liu	03a18ad6fd	[E2E] add E2E for Prefix Caching cp & Chunked Prefill cp (#5149 ) ### What this PR does / why we need it? Add E2E for Prefix Caching cp & Chunked Prefill cp ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: F.Liu <liufeng248@huawei.com> Signed-off-by: Feng Liu <46866849+ader47@users.noreply.github.com> Co-authored-by: F.Liu <liufeng248@huawei.com>	2026-02-03 15:04:14 +08:00
LeeWenquan	b1de6cbb31	[Bugfix][CI]Add qwen3Next MTP+Full Decode (#6047 ) ### What this PR does / why we need it? Fix a bug in the repo and add a test case for MTP + Full Decode Only + Qwen3Next. The _build_dummy_attn_metadata function in NPUModelRunner seems losed a query_star_loc.copy_to_gpu operation, which will lead to difference between query_start_loc and query_start_loc_cpu, and they are required to be same in MTP + Full Decode Only + Qwen3Next case. Before this pr: `self.query_start_loc = [0, 0, 0, 0, ... , 0] self.query_start_loc_cpu = [0, 2, 4, 6, ... ,128]` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2026-02-03 14:26:21 +08:00
LHXuuu	45a573cff1	[Quantization][Feature] Support compressed tensors moe w4a8 dynamic weight (#5889 ) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W4A8 dynamic weight. - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Co-authored-by: menogrey <1299267905@qq.com>	2026-02-02 16:39:32 +08:00
Qiu	638cae824d	[bugfix](CP) Fix and unify the PD request discrimination logic. (#5939 ) ### What this PR does / why we need it? Since the PR (https://github.com/vllm-project/vllm/pull/32118) has modified the criteria for judging Prefill and Decode requests in vLLM, PCPManager needs to synchronize with this standard. As PCPManager involves multiple calculations of PD request counts, this PR attempts to consolidate the related logic and update the PD request count once per batch. ### How was this patch tested? ```bash pytest tests/e2e/multicard/4-cards/long_sequence/test_mtp.py ``` - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-31 10:26:02 +08:00
wjunLu	4970de4242	[CI] Enable the skipped cases when HDK is upgraded to 25.5.0 (#6195 ) ### What this PR does / why we need it? Enable the tests that were skipped due to an outdated driver version: - tests/e2e/multicard/4-cards/long_sequence/test_accuracy.py - tests/e2e/multicard/4-cards/long_sequence/test_basic.py - tests/e2e/multicard/4-cards/long_sequence/test_chunked_prefill.py and some cases in - tests/e2e/multicard/2-cards/spec_decode/test_spec_decode.py - tests/e2e/multicard/2-cards/test_external_launcher.py - tests/e2e/multicard/2-cards/test_offline_weight_load.py - tests/e2e/multicard/2-cards/test_quantization.py - tests/e2e/multicard/4-cards/test_data_parallel_tp2.py TODO: - tests/e2e/multicard/4-cards/spec_decode/test_mtp_qwen3_next.py - tests/e2e/multicard/4-cards/long_sequence/test_mtp.py ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-29 22:41:41 +08:00
Qiu	50e0e87646	[bugfix](CP,MLA) fix wrong slot_mapping of decode for mixed p/d batch (#6344 ) ### What this PR does / why we need it? PR #5672 attempted to remove the -1 padding for duplicate tokens in the decode slot_mapping when adapting PCP for MLAPO, and adopted a simpler slicing approach. However, in the single-ops logic and mixed PD batches, the decode slot_mapping did not eliminate the -1 and also shared the slicing method, resulting in incorrect slot_mapping. This PR resolves this issue, and the logic will be further consolidated in subsequent refactoring PRs. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-29 16:48:37 +08:00
wangxiyuan	f8e76a49fa	[CI] Upgrade trasnformers version (#6307 ) Upgrade transformers to >=4.56.4 - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-28 14:06:39 +08:00
meihanc	fea197ad50	[Main2Main] Upgrade vllm commit to 0123 (#6169 ) ### What this PR does / why we need it? 1. ✅ Upgrade vllm commit to: 0115 (8471b27df97c3eb79f891802fc0e858f8f7ac6a0) Modify import paths due to the refactors： https://github.com/vllm-project/vllm/pull/32245 https://github.com/vllm-project/vllm/pull/32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913 2. ✅Upgrade vllm commit to: 0119 (9a1f16da1e423ede2c2f52a9850cbfbb39cefe96) Fix `WorkerProc.__init__() missing 1 required positional argument: 'is_driver_worker'` due to https://github.com/vllm-project/vllm/pull/28506 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569 3. ✅Upgrade vllm commit to: 0120(148117ea2e689cd43df4be6892671a17cdae5833) 1. Add `skip_compiled` param in `set_forward_context` due to https://github.com/vllm-project/vllm/pull/30385 2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to https://github.com/vllm-project/vllm/pull/24322 change `self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size` 3. Modify UT import paths due to the refactors：https://github.com/vllm-project/vllm/pull/32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946 4. ✅Upgrade vllm commit to: 0121(f23fb5a7c1b61350c5c40ca1115d3bf8cf2b8cc9) 1. vLLM switched `uses_mrope` from target to draft model config, making `positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's direct self.positions access and tests missing `draft_model_config.uses_mrope`. https://github.com/vllm-project/vllm/pull/32048 2. Moved bs_to_padded_graph_size from CompilationConfig to CudagraphDispatcher due to the refactor https://github.com/vllm-project/vllm/pull/30143 3. Remove unused `maybe_setup_kv_connector` due to https://github.com/vllm-project/vllm/pull/32077 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834 6. ✅Upgrade vllm commit to: 0122(8ebf271bb6d1e7e9b1a55be73d755ef1a57dbbe5) Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig due to https://github.com/vllm-project/vllm/pull/32414 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054 8. ✅Upgrade vllm commit to: 0123(dc917cceb877dfd13f98c538c4c96158047d98bd) Setting temperature=0.0 due to the removal of the default temperature value in https://github.com/vllm-project/vllm/pull/32723 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wjunLu <wjunlu217@gmail.com>	2026-01-27 08:44:36 +08:00
Li Wang	c38c838d03	[CI] Decrease Qwen3 dense model output throughput baseline to make ci happy (#6233 ) ### What this PR does / why we need it? As https://github.com/vllm-project/vllm-ascend/actions/runs/21327913593/job/61388195448 shows, I encountered two CI failures., The results consistently pointed to the reduced outcome 1600 -> 1514 - vLLM version: v0.14.1 - vLLM main: `d68209402d` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-26 09:04:13 +08:00

1 2 3 4 5

207 Commits