xc-llm-ascend

Author	SHA1	Message	Date
yupeng	830f39dd70	[Bugfix][LoRA] Fix the issue when enable LoRA + tp + fully_sharded_loras (#6650 ) ### What this PR does / why we need it? Fix the issue #6143 . ### Does this PR introduce _any_ user-facing change? Allow to start the server with "--enable-lora && --fully-sharded-loras && --tensor_parallel_size 2". ### How was this patch tested? pytest -sv tests/e2e/multicard/2-cards/test_llama32_lora_tp2.py - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: paulyu12 <507435917@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-11 15:43:15 +08:00
pz1116	a7f91fce71	[KV Pool]get_num_new_matched_tokens return 0 if token length < block_size (#7146 ) ### What this PR does / why we need it? Currently, we call lookup_client for looking up token hit in KV Pool, however, when token length < block size, the key will be empty and there is no point to lookup in KV Pool backend since there will never be a hit. Hence, add early return in `get_num_new_matched_tokens` when `token_len` < `block_size` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Pz1116 <zpbzpb123123@gmail.com> Co-authored-by: fems14 <1804143737@qq.com>	2026-03-11 15:05:34 +08:00
Mengqing Cao	1a83c8e2f5	[CI] Build Image for v0.16.0rc1 (#7155 ) ### What this PR does / why we need it? Build Image for v0.16.0rc1 - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-11 14:48:50 +08:00
SILONG ZENG	90aa048e60	[CI] Skip `test_mooncake_layerwise_connector.py` in `ut` (#7147 ) ### What this PR does / why we need it? The `test_mooncake_layerwise_connector.py` file in the `ut` test will be skipped for now and fixed later. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-03-11 11:46:29 +08:00
zxr2333	e16009b2cc	[BugFix]Fix recomputed scheduler bug (#7137 ) ### What this PR does / why we need it? Fix the wrong usage of `model_type`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-11 00:32:19 +08:00
SparrowMu	54668e73c5	[Model] Support Minimax-m2.5 on NPU (#7105 ) ### What this PR does / why we need it? Initial version to support minimax-m2.5 on vllm-ascend. This commit coverting original fp8 weight to a quantilized bf16 to support Minimax-m2.5 on NPU. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` ### Test Report Self tested precision summary, where the official precision score of AIME2025 is 86.3 <img width="426" height="84" alt="image" src="https://github.com/user-attachments/assets/a3ce2452-92fa-4713-962e-862248e0b61a" /> --------- Signed-off-by: limuyuan <limuyuan3@huawei.com> Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com> Co-authored-by: limuyuan <limuyuan3@huawei.com>	2026-03-11 00:12:02 +08:00
zxr2333	239683c7a6	[P/D]Mooncake Layerwise Connector supports hybrid attention manager with multiple kvcache groups (#7022 ) ### What this PR does / why we need it? Mooncake Layerwise Connector supports hybrid attention manager with multiple kvcache groups. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By CI. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-10 23:59:20 +08:00
pppeng	0f289fa2a8	Add patch_qwen3_5 for triton ops fused_recurrent_gated_delta_rule (#7109 ) ### What this PR does / why we need it? The ops `torch_npu.npu_recurrent_gated_delta_rule` currently does not support `ssm_state` inputs in float32 format, we temporarily retain the _forward_core implementation with triton for Qwen3_5 --------- Signed-off-by: pppeng <zepengliu912@qq.com> Signed-off-by: pppeng <60355449+ppppeng@users.noreply.github.com>	2026-03-10 23:28:58 +08:00
Canlin Guo	a78a00e0b1	[Doc][ReleaseNote] Add release notes for v0.16.0rc1 (#7067 ) Add release notes for v0.16.0rc1 - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: Canlin Guo <961750412@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-03-10 22:45:05 +08:00
Li Wang	881c38d210	[Misc] Download on both hk and guiyang region (#7129 ) ### What this PR does / why we need it? Since the PVC files for Guiyang and Hong Kong are not shared, we need to trigger the download of both regions simultaneously when downloading the model to ensure that the models in all regions are synchronized. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-10 19:22:32 +08:00
shaopeng-666	6e8d3681ae	[bugdix] The problem that the w4a8 weight fails to be loaded when the EP is not enabled is resolved. (#7090 ) ### What this PR does / why we need it? This is a bug fix to resolve the issue where the MOE model fails to load quantized weights in w4a8 format when EP is not enabled.The parameters ["weight_scale_second", "weight_offset_second", "scale_bias"] shall be parsed in per-group mode, regardless of other conditions. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>	2026-03-10 16:57:05 +08:00
lilinsiman	a5ea699e29	[eagle][cp] fix eagle_cp enable bug2 (#7079 ) ### What this PR does / why we need it? Fix acceptance and high-concurrency bug in eagle3 and cp enabled ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tests and ut - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-03-10 16:32:49 +08:00
zhangxinyuehfad	67d40f23fd	[CI]Upgrade niglty multi-node-tests max-parallel to 2 (#7035 ) ### What this PR does / why we need it? 1. Increase nightly multi-node test max-parallel from 1 to 2, and fix resource conflicts that arise when tests run concurrently. 2. Fix parse-trigger job: Add an if condition so it only runs on schedule, workflow_dispatch, or PRs labeled nightly-test 3. Adjust nightly schedule: Shift trigger time from 24:00 to 23:45 (UTC+8) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-03-10 16:25:51 +08:00
pu-zhe	5df450bca4	[Feat] [310p] Support w8a8sc quantization method (#7075 ) ### What this PR does / why we need it? New Quantization Method: Introduced support for the W8A8SC static linear quantization scheme specifically for 310P hardware, enabling more efficient model compression. Refactored the save_sharded_state_310.py to avoid multi-process issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? W8A8SC quant E2E test. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-03-10 16:13:20 +08:00
Frank Chen	14c71b19e1	[Doc][CPU binding] Add user/developer guide for CPU binding (#7045 ) ### What this PR does / why we need it? This PR adds comprehensive documentation for the CPU binding feature on Ascend NPUs. It includes: - A detailed developer guide (`docs/source/developer_guide/feature_guide/cpu_binding.md`) covering the design, internal logic, allocation examples, and troubleshooting for the CPU binding mechanism. - A concise user guide (`docs/source/user_guide/feature_guide/cpu_binding.md`) explaining the core concepts, usage, and common issues for end-users. - An update to `additional_config.md` to use consistent terminology for binding strategies (`global-slicing` and `topo-affinity`). This documentation is needed to help both developers and users understand, use, and debug the CPU binding feature, which is critical for performance on ARM+Ascend platforms. ### Does this PR introduce _any_ user-facing change? No. This is a documentation-only update. ### How was this patch tested? The documentation has been reviewed for clarity and technical accuracy. The examples and descriptions align with the implementation in `vllm_ascend/cpu_binding.py`. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: chenchuw886 <chenchuw@huawei.com> Signed-off-by: c00818886 <chenchuwei@huawei.com> Co-authored-by: chenchuw886 <chenchuw@huawei.com>	2026-03-10 15:59:31 +08:00
Li Wang	33234aa0c5	Revert "[Feature][Quant] Auto-detect quantization format from model f… (#6873 ) This reverts commit `3953dcf784`. to keep the basic functions available --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-10 11:27:32 +08:00
yupeng	40f7d93f1a	[bugfix][LoRA] Fix the lora accuracy issue introduced by the upstream vLLM changed. (#6958 ) ### What this PR does / why we need it? Fix the LoRA e2e test accuracy issue that introduced by the upstream PR https://github.com/vllm-project/vllm/pull/32005 ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_llama32_lora.py - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: paulyu12 <507435917@qq.com> Signed-off-by: yupeng <507435917@qq.com>	2026-03-10 10:43:18 +08:00
ZRJ026	a398fa6a0b	[Bugfix]: correct streaming content-type in load balance proxy server (#6985 ) Set proper 'text/event-stream; charset=utf-8' media type for streaming requests instead of hardcoded 'application/json' ### What this PR does / why we need it? This PR fixes an issue in the disaggregated prefill proxy server where streaming requests (`"stream": true`) were always returned with a hardcoded `Content-Type: application/json`, even when the backend vLLM servers correctly returned Server-Sent Events (SSE) with `Content-Type: text/event-stream; charset=utf-8`. Specifically, the proxy used `StreamingResponse` with a fixed `media_type` of `application/json`, which caused FastAPI to override the response headers and break proper SSE semantics. As a result, clients (e.g. `curl -i`, EventSource, or OpenAI-compatible SDKs) could not reliably receive token-by-token streaming output. In addition, this incorrect response type causes compatibility issues with benchmarking and load-testing tools such as EvalScope. When streaming is enabled, these tools expect SSE-formatted responses to correctly parse token usage information. With the incorrect `application/json` content type, EvalScope fails to parse the response and reports errors similar to:`2025-12-15 09:27:56 - evalscope - ERROR: Failed to parse usage from response: list index out of range. Response: []` This PR updates the proxy to: - Detect whether the incoming request is a streaming request (`stream=true`) - Use `text/event-stream; charset=utf-8` for streaming responses - Preserve `application/json` for non-streaming responses This aligns the proxy behavior with native vLLM prefill/decoder servers and the OpenAI-compatible streaming API contract. Fixes incorrect streaming response headers that prevented proper real-time token delivery. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? This change was tested manually using a disaggregated prefill + decode setup with the proxy server. ### Test Steps 1. Start prefiller and decoder vLLM servers: ```bash vllm serve --host 0.0.0.0 --port 8001 ... vllm serve --host 0.0.0.0 --port 8002 ... ``` 2. Start the proxy server: ```bash python load_balance_proxy_server_example.py \ --host 127.0.0.1 --port 8000 \ --prefiller-hosts 127.0.0.1 --prefiller-ports 8001 \ --decoder-hosts 127.0.0.1 --decoder-ports 8002 ``` 3. Send a streaming completion request through the proxy: ```bash curl -i -X POST http://127.0.0.1:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "test", "prompt": "hello", "max_tokens": 3, "stream": true }' ``` 4. Verify the following: - The response header is Content-Type: text/event-stream; charset=utf-8 - Tokens are streamed incrementally as SSE data: events - Non-streaming requests still return application/json No automated tests were added because this change affects an example proxy server and is limited to HTTP response headers. The behavior is directly verifiable using standard SSE-compatible clients. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: zrj026 <zhangrunjiang026@gmail.com> Co-authored-by: zrj026 <zhangrunjiang026@gmail.com>	2026-03-10 10:11:35 +08:00
NJX	bb7ed759d4	[Doc] Fix broken chunked-prefill URL in supported features (#6963 ) ## What this PR does / why we need it? Fixes the broken URL for chunked-prefill in the supported features documentation page. The chunked prefill documentation URL was moved from `performance/optimization.html` to `configuration/optimization.html` in upstream vLLM docs. This PR updates the link to point to the correct location. Before: https://docs.vllm.ai/en/stable/performance/optimization.html#chunked-prefill (404) After: https://docs.vllm.ai/en/stable/configuration/optimization.html#chunked-prefill (working) ## Does this PR introduce _any_ user-facing change? Yes - fixes a broken documentation link that users encounter when clicking 'Chunked Prefill' in the supported features page. ## How was this patch tested? - Verified the new URL resolves correctly - Documentation change only Closes #4217 - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: NJX-njx <3771829673@qq.com>	2026-03-10 10:10:07 +08:00
NJX	9b30d4e774	[Doc][Misc] Add metrics usage documentation and example (#6962 ) ## What this PR does / why we need it? This PR addresses issue #5027 where users find that `output.metrics` returns `None` when using the vLLM offline inference API. Root Cause: vLLM disables log stats by default (`disable_log_stats=True`), which causes `output.metrics` to be `None`. Changes: 1. Added a NOTE comment in `examples/offline_inference_npu.py` explaining how to enable metrics 2. Created a new example `examples/offline_inference_metrics.py` demonstrating how to access request-level metrics (`first_token_time`, `finished_time`, etc.) by setting `disable_log_stats=False` ## Does this PR introduce _any_ user-facing change? Yes - adds documentation and example code to help users understand how to access output metrics. ## How was this patch tested? - Documentation/example change only - Verified example code follows the same patterns as existing examples Closes #5027 - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: NJX-njx <3771829673@qq.com>	2026-03-10 10:09:50 +08:00
Yikun Jiang	326fd359aa	[Docs] add and publish llms.txt for LLM discovery (#6886 ) ### What this PR does / why we need it? - move llms.txt under docs/source and publish it at /llms.txt via html_extra_path - rewrite llms.txt to an LLM-friendly link index - use _sources markdown links and include missing entry points such as FAQs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2026-03-10 10:06:27 +08:00
ZKSU	bdad11e9a8	[doc] Update GLM4.x.md, add GLM4.x multi-node deploy tutorial (#6872 ) ### What this PR does / why we need it? This PR updates the GLM4.x documentation by adding multi-node like 2 × Atlas 800 A2 (64G × 8) deployment tutorial. - What changed: Added instructions for deploying GLM-4.X models across multiple nodes, including environment variables and example commands. - Why needed: Although the previous tutorial stated that multi-node deployment on Atlas 800 A2 (64GB × 8) is not recommended, but we still face some situation that must deploy GLM-4.7 on 2 × Atlas 800 A2 (64G × 8). And we successfully run GLM-4.7 on 2 nodes and it works fine, so we think it might be the time to update this part. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Verified that the new documentation renders correctly in Markdown format. - Tested the multi-node deployment steps on 2 × Atlas 800 A2 (64G × 8) to ensure the commands work as described. - Confirmed that existing GLM4.x documentation links and structure remain intact. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: ZKSU <zksu@outlook.com>	2026-03-10 10:01:53 +08:00
xleoken	146b9d2a83	[BugFix] fix metadata execute error: integer modulo by zero (#6521 ) ### What this PR does / why we need it? fix metadata execute error: integer modulo by zero - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: xleoken <xleoken@163.com>	2026-03-10 09:58:06 +08:00
meihanc	f6db47f103	[CI] fix skiped e2e test when upgrade vllm version (#6654 ) ### What this PR does / why we need it? fix skiped test_aclgraph_capture_replay.py when upgrade vllm version ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `13397841ab` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-03-10 09:55:35 +08:00
SILONG ZENG	43df2cb2fc	[Lint]Style: Convert `test/` to ruff format(Batch #1 ) (#6738 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| `tests/e2e/310p/multicard/test_vl_model_multicard.py` \| \| `tests/e2e/310p/singlecard/test_vl_model_singlecard.py` \| \| `tests/e2e/310p/test_utils.py` \| \| `tests/e2e/conftest.py` \| \| `tests/e2e/model_utils.py` \| \| `tests/e2e/models/conftest.py` \| \| `tests/e2e/models/test_lm_eval_correctness.py` \| \| `tests/e2e/multicard/2-cards/spec_decode/test_spec_decode.py` \| \| `tests/e2e/multicard/2-cards/test_aclgraph_capture_replay.py` \| \| `tests/e2e/multicard/2-cards/test_data_parallel.py` \| \| `tests/e2e/multicard/2-cards/test_disaggregated_encoder.py` \| \| `tests/e2e/multicard/2-cards/test_expert_parallel.py` \| \| `tests/e2e/multicard/2-cards/test_external_launcher.py` \| \| `tests/e2e/multicard/2-cards/test_full_graph_mode.py` \| \| `tests/e2e/multicard/2-cards/test_ilama_lora_tp2.py` \| \| `tests/e2e/multicard/2-cards/test_offline_inference_distributed.py` \| \| `tests/e2e/multicard/2-cards/test_offline_weight_load.py` \| \| `tests/e2e/multicard/2-cards/test_pipeline_parallel.py` \| \| `tests/e2e/multicard/2-cards/test_prefix_caching.py` \| \| `tests/e2e/multicard/2-cards/test_quantization.py` \| \| `tests/e2e/multicard/2-cards/test_qwen3_moe.py` \| \| `tests/e2e/multicard/2-cards/test_qwen3_moe_routing_replay.py` \| \| `tests/e2e/multicard/2-cards/test_qwen3_performance.py` \| \| `tests/e2e/multicard/2-cards/test_shared_expert_dp.py` \| \| `tests/e2e/multicard/2-cards/test_single_request_aclgraph.py` \| \| `tests/e2e/multicard/2-cards/test_sp_pass.py` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` Signed-off-by: MrZ20 <2609716663@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-10 09:52:50 +08:00
xmpp777	9216e1b050	[fix] Add support for Qwen3.5 Dense and MoE on Ascend (#6933 ) ### What this PR does / why we need it? This pull request introduces support for the Qwen3.5 MoE model on Ascend devices. The key changes are: * Quantization Configuration for Qwen3.5 MoE: Adds necessary prefix mappings and packed module definitions for `qwen3_5_moe` in `vllm_ascend/quantization/modelslim_config.py` to enable ModelSlim quantization. * Triton Kernel Fix: Corrects a bug in the `fused_gdn_gating` Triton kernel. The calculation for `BLK_BATCHES` had an operator precedence issue which is now resolved. The calculation has also been made more robust with added clamping to prevent potential out-of-bounds memory access in the unified buffer. These changes enable the correct and efficient execution of Qwen3.5 MoE models on Ascend hardware. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI should be used to verify the correctness of these changes. It is recommended to run tests with the Qwen3.5 MoE model to ensure the new configurations and the kernel fix work as expected. Signed-off-by: xmpp777 <yangming2@huawei.com>	2026-03-10 09:09:31 +08:00
dependabot[bot]	3b25ded8b7	[CI] Bump docker/metadata-action from 5 to 6 (#7069 ) Bumps [docker/metadata-action](https://github.com/docker/metadata-action) from 5 to 6. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-03-10 09:06:04 +08:00
dependabot[bot]	2325bbe79b	[CI] Bump actions/checkout from 4 to 6 (#7070 ) Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-03-10 09:05:22 +08:00
ZT-AIA	ee5347e824	[qwen3 next ]add ascend c casual_conv1d_fn (#6661 ) ### What this PR does / why we need it? add ascend c casual_conv1d_fn - vLLM version: v0.15.0 - vLLM main: `13397841ab` --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-03-09 23:29:49 +08:00
Hexiang Wang	48b624e4cc	[BugFix] Fix implementation bug of triton rope_siso (#7082 ) ### What this PR does / why we need it? Previously implemention of triton rope_siso missing the storage of second half of rope results, which will result in: 1. accuracy problem in neox-style scenario 2. ub overflow in non neox-style scenario This PR fixes it and supplement nightly test case for it. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-03-09 23:08:43 +08:00
liuchen2026fly	542258ac9d	[feat] parameterize hardcoded MLA dimensions to support GLM5-W8A8 (#6902 ) Derive MLA dimension constants (q_lora_rank, qk_nope_head_dim, etc.) from tensor shapes at runtime instead of hardcoding DeepSeek V3 values. This enables the mla_preprocess fused op to work with both DeepSeek V3 and GLM5 models without Python API changes. - Add 9 dimension fields to MlaTilingData with DeepSeek V3 defaults - Add OpParam fields and dynamize all host-side tiling functions - Derive dimensions from wuk, gamma1, kv_cache_rope tensor shapes - Replace 310+ hardcoded constants across 4 kernel .hpp files - Remove unused MMSIZE1/MMSIZE2 constants ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: liuchenbing <chenliumail@163.com> Co-authored-by: liuchenbing <chenliumail@163.com>	2026-03-09 20:17:21 +08:00
Qiu	13adcbe44b	feat(attention_cp): support chunked prefill for Qwen3Next with PCP&DCP (#6900 ) ### What this PR does / why we need it? Support chunked prefill for Qwen3Next with PCP&DCP - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-03-09 17:55:09 +08:00
LI SHENGYONG	a76a509fae	[MOE][Bugfix] Cancel H2D for expert_map (#7000 ) ### What this PR does / why we need it? If expert_map is on the device, there may be occasional repeated answers in long output scenarios. dsv3.2-exp-w8a8 No garbled characters are displayed in the output. \| dataset \| version \| metric \| mode \| vllm-api-stream-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2025 \| ef2f4f \| accuracy \| gen \| 60.00 \| - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-03-09 17:53:54 +08:00
王远	82fdd40d49	[Feat]Xlite Qwen3 MoE Support Data Parallel (#6715 ) ### What this PR does / why we need it? This patch adds support for the Qwen3-MoE data parallel in Xlite. For more details about Xlite, please refer to the following link:[https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md](https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md). online server config: ```shell port=$1 log=$2 export VLLM_USE_V1=1 export TASK_QUEUE_ENABLE=1 export HCCL_BUFFSIZE=512 export HCCL_OP_EXPANSION_MODE="AIV" export OMP_PROC_BIND=false export VLLM_ASCEND_ENABLE_NZ=0 sysctl -w vm.swappiness=0 sysctl -w kernel.numa_balancing=0 sysctl kernel.sched_migration_cost_ns=50000 ip=127.0.0.1 python -m vllm.entrypoints.openai.api_server \ --model /mnt/nvme1n1/wy/models/Qwen3-30B-A3B \ --tensor-parallel-size 2 \ --enable-expert-parallel \ --data-parallel-size 4 \ --gpu-memory-utilization 0.9 \ --max-num-batched-tokens 32768 \ --data-parallel-size-local 4 \ --max-num-seqs=200 \ --block-size 128 \ --max-model-len 6656 \ --trust-remote-code \ --disable-log-requests \ --served-model-name qwen \ --no-enable-prefix-caching \ --additional-config '{"xlite_graph_config": {"enabled": true, "full_mode": true}, "enable_cpu_binding": true}' \ --compilation-config '{"cudagraph_capture_sizes":[1, 16, 32, 48, 64, 100, 150, 200], "cudagraph_mode": "FULL_DECODE_ONLY"}' \ --async-scheduling \ --host ${ip} \ --port ${port} > ${log} 2>&1 & ``` test_config: ```shell vllm bench serve \ --max-concurrency ${maxconcurrency} \ --num-prompts ${num_prompts} \ --host ${HOST} \ --port ${PORT} \ --model ${MODEL_NAME} \ --dataset-name random \ --backend openai-chat \ --random-input-len 512 \ --random-output-len 512 \ --random-range-ratio 0.2 \ --temperature 0.6 \ --metric-percentiles "50,90,99" \ --tokenizer ${TOKENIZER_PATH} \ --endpoint /v1/chat/completions \ --ignore-eos ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `c86cdcbcd2` Signed-off-by: uuzWY <Ethan.wangyuan@huawei.com> Co-authored-by: uuzWY <Ethan.wangyuan@huawei.com>	2026-03-09 17:53:35 +08:00
Shaoxu Cheng	ba1c82e758	[DOC] Add explaination of 310p special param: max-model-len (#7065 ) ### What this PR does / why we need it? This PR updates the documentation for running vLLM on Atlas 300I series (310p) hardware. It adds a warning to explicitly set `--max-model-len` to prevent potential Out-of-Memory (OOM) errors that can occur with the default configuration. The example commands and Python scripts for online and offline inference have been updated to: - Include `--max-model-len 4096` (or `max_model_len=4096`). - Remove the `compilation-config` parameter, which is no longer necessary for 310p devices. These changes ensure users have a clearer and more stable experience when using vLLM on Atlas 300I hardware. ### Does this PR introduce _any_ user-facing change? No, this is a documentation-only update. ### How was this patch tested? The changes are to documentation and do not require testing. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-09 16:54:43 +08:00
wanghuanjun2113	dec04ec8d8	[Bugfix] Fix incorrect layer count for MTP models in update_aclgraph_sizes (#7064 ) ## Summary - Fix incorrect layer count calculation for MTP (Multi-Token Prediction) models in `update_aclgraph_sizes()` function - For MTP models, the draft model's layer count is stored in `num_nextn_predict_layers` or `mtp_num_hidden_layers` (for Qwen3.5), not in the standard `num_hidden_layers` field - Directly accessing `draft.hf_config.num_hidden_layers` returns the main model's layer count instead of the MTP draft model's layer count ## Bug Description In `vllm_ascend/utils.py`, the `update_aclgraph_sizes()` function calculates `resources_per_graph` for speculative decoding scenarios. When calculating the resources needed for the draft model, the original code directly accessed: ```python resources_per_graph += draft.hf_config.num_hidden_layers + 1 ``` This works correctly for standard draft models, but fails for MTP models (like DeepSeek-V3's MTP or Qwen3.5's MTP) because: 1. MTP models store their layer count in model-specific fields: - `num_nextn_predict_layers` (DeepSeek-V3 MTP) - `mtp_num_hidden_layers` (Qwen3.5 MTP) 2. The `num_hidden_layers` field in these models contains the main model's layer count, not the MTP layer count 3. This leads to grossly overestimating the `resources_per_graph`, which in turn causes the calculated `max_batch_sizes` to be unnecessarily small ## Fix Use `draft.get_total_num_hidden_layers()` instead of directly accessing `draft.hf_config.num_hidden_layers`. This method correctly handles different model types through the `model_arch_config_convertor` infrastructure, returning the appropriate layer count for: - Standard draft models → `num_hidden_layers` - DeepSeek-V3 MTP → `num_nextn_predict_layers` - Qwen3.5 MTP → `mtp_num_hidden_layers` 🤖 Generated with [Claude Code](https://claude.com/claude-code) - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: wanghuanjun2113 <wanghuanjun2113@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 16:14:51 +08:00
guanguan0308	4b4961ba5f	[fix]Resolve compilation errors that occur when building versions subsequent to b020 (#7059 ) ### What this PR does / why we need it? Resolve compilation errors that occur when building versions subsequent to b020： Root Cause During operator compilation, we previously modified the names of structs HcclOpResParam and HcclRankRelationResV2 in the moe_distribute_base.h file. After version b020, moe_distribute_base.h was updated with additional code that references these two structs. This resulted in compilation errors, as renaming the structs alone broke the newly added references to them. Solution we have added the moe_distribute_base.h file to the operator implementation. This avoids compilation errors caused by updates to this file in the CANN framework. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: guanguan0308 <1546542263@qq.com>	2026-03-09 16:09:35 +08:00
LoganJane	eb648f7398	[Bugfix] Support quant config in glm46v (#7062 ) ### What this PR does / why we need it? We need to support quant config in glm46v . ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? We used the 'Ascend/msit' quantization method to test the w8a8 weights. Successfully ran on NPU using vllm-ascend by the w8a8 weights. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: g00887675/loganJane <g00887675/loganJane73@hotmail.com> Co-authored-by: g00887675/loganJane <g00887675/loganJane73@hotmail.com>	2026-03-09 16:07:16 +08:00
tanhaoan333	57c554a23f	[bugfix]Fix parameter ordering bug in _merge_multimodal_embeddings (#7068 ) ### What this PR does / why we need it? This PR fixes a bug in the `_merge_multimodal_embeddings` function where the parameter order was incorrect. The `multimodal_embeddings` and `is_multimodal` parameters were swapped, which would lead to runtime errors when the function is called with positional arguments. This change corrects the function signature to align with its expected usage, ensuring that multimodal embeddings are correctly merged. ### Does this PR introduce _any_ user-facing change? No. This is a bug fix for an internal utility function and has no user-facing impact. ### How was this patch tested? The correctness of this fix is validated by existing tests for multimodal functionality. With the incorrect function signature, these tests would fail due to argument type mismatches. CI passing confirms the fix is effective. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: tanhaoan333 <tanhaoan@huawei.com>	2026-03-09 16:05:52 +08:00
Cao Yi	cb4c7de856	[Perf] Optimize MTP execution by reordering state update operation (#6844 ) ## Summary - Move `_update_states_after_model_execute` call from after main model sampling to after draft model execution - This reordering reduces pipeline bubbles between main model and draft model execution - No accuracy impact - the state update operation is independent of draft token proposal ## Performance Impact Reduces idle time between main model and draft model execution stages, improving overall MTP (Multi-Token Prediction) performance. - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>	2026-03-09 15:55:27 +08:00
zxr2333	d39d80830c	[KVCache]Qwen3.5 supports contiguous tensor hybrid-attn kv-cache (#6887 ) ### What this PR does / why we need it? Supports contiguous tensor hybrid-attn kv-cache on fullattn-mamba hybrid model, such as Qwen3Next and Qwen3.5. Due to the restrictions of Ascend operators, all KV tensors, conv tensors, and SSM tensors must be contiguous. Therefore, this PR uses the following solution to generate the KV cache: tensor1: [(kv_padding), conv , ...] tensor2: [k , ssm , ...] tensor3: [v , (mamba_padding), ...] Under this scheme, although some waste may occur, the tensors of all caches are guaranteed to be contiguous. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-09 15:28:40 +08:00
wangxiyuan	482d39c1b0	[commuinty]update contributor and refresh tool (#7072 ) ### What this PR does / why we need it? This PR refactors the `tools/collect_user_first_contribution.sh` script to improve how we track and update our contributors list. Key changes include: - Incremental Updates: The script can now perform incremental updates by storing and reading the last processed commit hash from `docs/source/community/contributors.md`. This is much more efficient than re-processing all commits every time. - Full Refresh Option: A `--full` flag is added to allow forcing a full recalculation of all contributors, useful for correcting errors or initial setup. - Improved Usage: Replaced positional arguments with command-line flags (`--repo`, `--file`, `--full`) for better usability and clarity. - Robust Contributor-ID detection: Improved logic to find a contributor's GitHub login, including a fallback to parse it from `noreply` email addresses. - In-place File Updates: The script now directly updates the `contributors.md` file with new contributors and correct numbering, automating the entire process. These changes make the process of maintaining the contributors list more automated, reliable, and efficient. ### Does this PR introduce _any_ user-facing change? No, this only changes a developer tool and does not affect the vLLM library's public API or behavior. ### How was this patch tested? The script can be tested locally by running it against the repository. For an incremental update: `GITHUB_TOKEN=<your_token> ./tools/collect_user_first_contribution.sh` For a full refresh: `GITHUB_TOKEN=<your_token> ./tools/collect_user_first_contribution.sh --full` - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-09 15:19:35 +08:00
Cao Yi	aef9d4249d	[Perf] Avoid CPU sync in mrope_positions copy by using full tensor copy (#7014 ) ### What this PR does / why we need it? The index-select operation `mrope_positions.gpu[:, :total_num_scheduled_tokens].copy_(...)` triggers a CPU-NPU synchronization, which blocks subsequent operator dispatch and causes bubbles visible in Profiling. This PR changes to full tensor copy (`mrope_positions.gpu.copy_(mrope_positions.cpu)`) to eliminate the sync point. The trade-off is a negligible increase in memory usage since `mrope_positions.cpu` is a small tensor. Result: ~2-3% TPOT improvement with the profiling bubbles eliminated. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Verified via Profiling that the CPU sync bubble is eliminated and TPOT is reduced by 2-3%. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>	2026-03-09 14:46:37 +08:00
LeeWenquan	65eae6de7b	Add Ascend Ops recurrent_gated_delta_rule (#6725 ) ### What this PR does / why we need it? Change recurrent_gated_delta_rule ops from triton to ascend C version for better performance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2026-03-09 14:14:14 +08:00
JIACHENG XU	23bf5d4d48	[EPLB][bugfix] Bugfix for fused mc2 (#6794 ) ### What this PR does / why we need it? This pull request addresses a bug related to the fused mc2 functionality within the EPLB (Expert Parallelism Load Balancing) system, specifically impacting quantization and MoE communication. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` Signed-off-by: Spicy-Stick <873805887@qq.com> Signed-off-by: root <root@localhost.localdomain>	2026-03-09 11:26:57 +08:00
Zetong Li	06ec136f08	[Bugfix] Obtain kernel block size for computing slot mapping correctly (#7019 ) ### What this PR does / why we need it? This PR aims to fix incorrect slot mapping in qwen35 due to mismatched block size. In qwen35, we should use `kernel_block_size` so that we can compute it in a correct way, and it is obtained in `load_model` when we have a chance to grab `draft_attn_layers`. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: Zetong Li <slippersss@126.com>	2026-03-09 11:05:01 +08:00
wangxiaoteng888	a3f4f6b10b	[P/D][Bugfix] Layerwise stacking MTP error. (#7036 ) ### What this PR does / why we need it? The community has added a cleaning mechanism for the metadata after the main model finishes running. The MTP layer should not clean the metadata, and a new condition has been added to avoid cleaning it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-03-09 10:55:43 +08:00
zxr2333	675387f1fd	[P/D][KVPool]Mooncake Layerwise Connector supports kv_pool (#7032 ) ### What this PR does / why we need it? This PR creates and registers `ascend_multi_connector`, which allows the `mooncake_layerwise_connector` to use the kv_pooling feature. We unregister the original vllm's `MultiConnector` and replace it with `AscendMultiConnector` when registering the connectors. ### Does this PR introduce _any_ user-facing change? No. User can use `MultiConnector` to initialize `AscendMultiConnector`. ### How was this patch tested? By CI. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-09 10:49:04 +08:00
drslark	6a7115fa0d	[main][feature] Support quarot for eagle3 without embedding (#7038 ) ### What this PR does / why we need it? If some `eagle3` model without embed_tokens works with `quarot` target model, the acceptence rate will drop. We solve it in this PR. The relative vllm pr is https://github.com/vllm-project/vllm/pull/36225. - vLLM main: `4034c3d32e` Signed-off-by: drslark <slarksblood@qq.com>	2026-03-09 10:43:06 +08:00
chenxi-hh	737dfcf638	[MOE] commit GMM custom operator (#7010 ) ### What this PR does / why we need it? GMM custom operator optimization in small batch scenarios ### How was this patch tested? Submit the GMM custom operator for subsequent integration into the MOE process. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: chenxi-hh <chen464822955@163.com> Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>	2026-03-09 09:56:31 +08:00

1 2 3 4 5 ...

2567 Commits