xc-llm-ascend

Author	SHA1	Message	Date
yupeng	40f7d93f1a	[bugfix][LoRA] Fix the lora accuracy issue introduced by the upstream vLLM changed. (#6958 ) ### What this PR does / why we need it? Fix the LoRA e2e test accuracy issue that introduced by the upstream PR https://github.com/vllm-project/vllm/pull/32005 ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_llama32_lora.py - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: paulyu12 <507435917@qq.com> Signed-off-by: yupeng <507435917@qq.com>	2026-03-10 10:43:18 +08:00
ZRJ026	a398fa6a0b	[Bugfix]: correct streaming content-type in load balance proxy server (#6985 ) Set proper 'text/event-stream; charset=utf-8' media type for streaming requests instead of hardcoded 'application/json' ### What this PR does / why we need it? This PR fixes an issue in the disaggregated prefill proxy server where streaming requests (`"stream": true`) were always returned with a hardcoded `Content-Type: application/json`, even when the backend vLLM servers correctly returned Server-Sent Events (SSE) with `Content-Type: text/event-stream; charset=utf-8`. Specifically, the proxy used `StreamingResponse` with a fixed `media_type` of `application/json`, which caused FastAPI to override the response headers and break proper SSE semantics. As a result, clients (e.g. `curl -i`, EventSource, or OpenAI-compatible SDKs) could not reliably receive token-by-token streaming output. In addition, this incorrect response type causes compatibility issues with benchmarking and load-testing tools such as EvalScope. When streaming is enabled, these tools expect SSE-formatted responses to correctly parse token usage information. With the incorrect `application/json` content type, EvalScope fails to parse the response and reports errors similar to:`2025-12-15 09:27:56 - evalscope - ERROR: Failed to parse usage from response: list index out of range. Response: []` This PR updates the proxy to: - Detect whether the incoming request is a streaming request (`stream=true`) - Use `text/event-stream; charset=utf-8` for streaming responses - Preserve `application/json` for non-streaming responses This aligns the proxy behavior with native vLLM prefill/decoder servers and the OpenAI-compatible streaming API contract. Fixes incorrect streaming response headers that prevented proper real-time token delivery. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? This change was tested manually using a disaggregated prefill + decode setup with the proxy server. ### Test Steps 1. Start prefiller and decoder vLLM servers: ```bash vllm serve --host 0.0.0.0 --port 8001 ... vllm serve --host 0.0.0.0 --port 8002 ... ``` 2. Start the proxy server: ```bash python load_balance_proxy_server_example.py \ --host 127.0.0.1 --port 8000 \ --prefiller-hosts 127.0.0.1 --prefiller-ports 8001 \ --decoder-hosts 127.0.0.1 --decoder-ports 8002 ``` 3. Send a streaming completion request through the proxy: ```bash curl -i -X POST http://127.0.0.1:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "test", "prompt": "hello", "max_tokens": 3, "stream": true }' ``` 4. Verify the following: - The response header is Content-Type: text/event-stream; charset=utf-8 - Tokens are streamed incrementally as SSE data: events - Non-streaming requests still return application/json No automated tests were added because this change affects an example proxy server and is limited to HTTP response headers. The behavior is directly verifiable using standard SSE-compatible clients. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: zrj026 <zhangrunjiang026@gmail.com> Co-authored-by: zrj026 <zhangrunjiang026@gmail.com>	2026-03-10 10:11:35 +08:00
NJX	bb7ed759d4	[Doc] Fix broken chunked-prefill URL in supported features (#6963 ) ## What this PR does / why we need it? Fixes the broken URL for chunked-prefill in the supported features documentation page. The chunked prefill documentation URL was moved from `performance/optimization.html` to `configuration/optimization.html` in upstream vLLM docs. This PR updates the link to point to the correct location. Before: https://docs.vllm.ai/en/stable/performance/optimization.html#chunked-prefill (404) After: https://docs.vllm.ai/en/stable/configuration/optimization.html#chunked-prefill (working) ## Does this PR introduce _any_ user-facing change? Yes - fixes a broken documentation link that users encounter when clicking 'Chunked Prefill' in the supported features page. ## How was this patch tested? - Verified the new URL resolves correctly - Documentation change only Closes #4217 - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: NJX-njx <3771829673@qq.com>	2026-03-10 10:10:07 +08:00
NJX	9b30d4e774	[Doc][Misc] Add metrics usage documentation and example (#6962 ) ## What this PR does / why we need it? This PR addresses issue #5027 where users find that `output.metrics` returns `None` when using the vLLM offline inference API. Root Cause: vLLM disables log stats by default (`disable_log_stats=True`), which causes `output.metrics` to be `None`. Changes: 1. Added a NOTE comment in `examples/offline_inference_npu.py` explaining how to enable metrics 2. Created a new example `examples/offline_inference_metrics.py` demonstrating how to access request-level metrics (`first_token_time`, `finished_time`, etc.) by setting `disable_log_stats=False` ## Does this PR introduce _any_ user-facing change? Yes - adds documentation and example code to help users understand how to access output metrics. ## How was this patch tested? - Documentation/example change only - Verified example code follows the same patterns as existing examples Closes #5027 - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: NJX-njx <3771829673@qq.com>	2026-03-10 10:09:50 +08:00
Yikun Jiang	326fd359aa	[Docs] add and publish llms.txt for LLM discovery (#6886 ) ### What this PR does / why we need it? - move llms.txt under docs/source and publish it at /llms.txt via html_extra_path - rewrite llms.txt to an LLM-friendly link index - use _sources markdown links and include missing entry points such as FAQs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2026-03-10 10:06:27 +08:00
ZKSU	bdad11e9a8	[doc] Update GLM4.x.md, add GLM4.x multi-node deploy tutorial (#6872 ) ### What this PR does / why we need it? This PR updates the GLM4.x documentation by adding multi-node like 2 × Atlas 800 A2 (64G × 8) deployment tutorial. - What changed: Added instructions for deploying GLM-4.X models across multiple nodes, including environment variables and example commands. - Why needed: Although the previous tutorial stated that multi-node deployment on Atlas 800 A2 (64GB × 8) is not recommended, but we still face some situation that must deploy GLM-4.7 on 2 × Atlas 800 A2 (64G × 8). And we successfully run GLM-4.7 on 2 nodes and it works fine, so we think it might be the time to update this part. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Verified that the new documentation renders correctly in Markdown format. - Tested the multi-node deployment steps on 2 × Atlas 800 A2 (64G × 8) to ensure the commands work as described. - Confirmed that existing GLM4.x documentation links and structure remain intact. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: ZKSU <zksu@outlook.com>	2026-03-10 10:01:53 +08:00
xleoken	146b9d2a83	[BugFix] fix metadata execute error: integer modulo by zero (#6521 ) ### What this PR does / why we need it? fix metadata execute error: integer modulo by zero - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: xleoken <xleoken@163.com>	2026-03-10 09:58:06 +08:00
meihanc	f6db47f103	[CI] fix skiped e2e test when upgrade vllm version (#6654 ) ### What this PR does / why we need it? fix skiped test_aclgraph_capture_replay.py when upgrade vllm version ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `13397841ab` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-03-10 09:55:35 +08:00
SILONG ZENG	43df2cb2fc	[Lint]Style: Convert `test/` to ruff format(Batch #1 ) (#6738 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| `tests/e2e/310p/multicard/test_vl_model_multicard.py` \| \| `tests/e2e/310p/singlecard/test_vl_model_singlecard.py` \| \| `tests/e2e/310p/test_utils.py` \| \| `tests/e2e/conftest.py` \| \| `tests/e2e/model_utils.py` \| \| `tests/e2e/models/conftest.py` \| \| `tests/e2e/models/test_lm_eval_correctness.py` \| \| `tests/e2e/multicard/2-cards/spec_decode/test_spec_decode.py` \| \| `tests/e2e/multicard/2-cards/test_aclgraph_capture_replay.py` \| \| `tests/e2e/multicard/2-cards/test_data_parallel.py` \| \| `tests/e2e/multicard/2-cards/test_disaggregated_encoder.py` \| \| `tests/e2e/multicard/2-cards/test_expert_parallel.py` \| \| `tests/e2e/multicard/2-cards/test_external_launcher.py` \| \| `tests/e2e/multicard/2-cards/test_full_graph_mode.py` \| \| `tests/e2e/multicard/2-cards/test_ilama_lora_tp2.py` \| \| `tests/e2e/multicard/2-cards/test_offline_inference_distributed.py` \| \| `tests/e2e/multicard/2-cards/test_offline_weight_load.py` \| \| `tests/e2e/multicard/2-cards/test_pipeline_parallel.py` \| \| `tests/e2e/multicard/2-cards/test_prefix_caching.py` \| \| `tests/e2e/multicard/2-cards/test_quantization.py` \| \| `tests/e2e/multicard/2-cards/test_qwen3_moe.py` \| \| `tests/e2e/multicard/2-cards/test_qwen3_moe_routing_replay.py` \| \| `tests/e2e/multicard/2-cards/test_qwen3_performance.py` \| \| `tests/e2e/multicard/2-cards/test_shared_expert_dp.py` \| \| `tests/e2e/multicard/2-cards/test_single_request_aclgraph.py` \| \| `tests/e2e/multicard/2-cards/test_sp_pass.py` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` Signed-off-by: MrZ20 <2609716663@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-10 09:52:50 +08:00
xmpp777	9216e1b050	[fix] Add support for Qwen3.5 Dense and MoE on Ascend (#6933 ) ### What this PR does / why we need it? This pull request introduces support for the Qwen3.5 MoE model on Ascend devices. The key changes are: * Quantization Configuration for Qwen3.5 MoE: Adds necessary prefix mappings and packed module definitions for `qwen3_5_moe` in `vllm_ascend/quantization/modelslim_config.py` to enable ModelSlim quantization. * Triton Kernel Fix: Corrects a bug in the `fused_gdn_gating` Triton kernel. The calculation for `BLK_BATCHES` had an operator precedence issue which is now resolved. The calculation has also been made more robust with added clamping to prevent potential out-of-bounds memory access in the unified buffer. These changes enable the correct and efficient execution of Qwen3.5 MoE models on Ascend hardware. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI should be used to verify the correctness of these changes. It is recommended to run tests with the Qwen3.5 MoE model to ensure the new configurations and the kernel fix work as expected. Signed-off-by: xmpp777 <yangming2@huawei.com>	2026-03-10 09:09:31 +08:00
dependabot[bot]	3b25ded8b7	[CI] Bump docker/metadata-action from 5 to 6 (#7069 ) Bumps [docker/metadata-action](https://github.com/docker/metadata-action) from 5 to 6. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-03-10 09:06:04 +08:00
dependabot[bot]	2325bbe79b	[CI] Bump actions/checkout from 4 to 6 (#7070 ) Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-03-10 09:05:22 +08:00
ZT-AIA	ee5347e824	[qwen3 next ]add ascend c casual_conv1d_fn (#6661 ) ### What this PR does / why we need it? add ascend c casual_conv1d_fn - vLLM version: v0.15.0 - vLLM main: `13397841ab` --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-03-09 23:29:49 +08:00
Hexiang Wang	48b624e4cc	[BugFix] Fix implementation bug of triton rope_siso (#7082 ) ### What this PR does / why we need it? Previously implemention of triton rope_siso missing the storage of second half of rope results, which will result in: 1. accuracy problem in neox-style scenario 2. ub overflow in non neox-style scenario This PR fixes it and supplement nightly test case for it. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-03-09 23:08:43 +08:00
liuchen2026fly	542258ac9d	[feat] parameterize hardcoded MLA dimensions to support GLM5-W8A8 (#6902 ) Derive MLA dimension constants (q_lora_rank, qk_nope_head_dim, etc.) from tensor shapes at runtime instead of hardcoding DeepSeek V3 values. This enables the mla_preprocess fused op to work with both DeepSeek V3 and GLM5 models without Python API changes. - Add 9 dimension fields to MlaTilingData with DeepSeek V3 defaults - Add OpParam fields and dynamize all host-side tiling functions - Derive dimensions from wuk, gamma1, kv_cache_rope tensor shapes - Replace 310+ hardcoded constants across 4 kernel .hpp files - Remove unused MMSIZE1/MMSIZE2 constants ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: liuchenbing <chenliumail@163.com> Co-authored-by: liuchenbing <chenliumail@163.com>	2026-03-09 20:17:21 +08:00
Qiu	13adcbe44b	feat(attention_cp): support chunked prefill for Qwen3Next with PCP&DCP (#6900 ) ### What this PR does / why we need it? Support chunked prefill for Qwen3Next with PCP&DCP - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-03-09 17:55:09 +08:00
LI SHENGYONG	a76a509fae	[MOE][Bugfix] Cancel H2D for expert_map (#7000 ) ### What this PR does / why we need it? If expert_map is on the device, there may be occasional repeated answers in long output scenarios. dsv3.2-exp-w8a8 No garbled characters are displayed in the output. \| dataset \| version \| metric \| mode \| vllm-api-stream-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2025 \| ef2f4f \| accuracy \| gen \| 60.00 \| - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-03-09 17:53:54 +08:00
王远	82fdd40d49	[Feat]Xlite Qwen3 MoE Support Data Parallel (#6715 ) ### What this PR does / why we need it? This patch adds support for the Qwen3-MoE data parallel in Xlite. For more details about Xlite, please refer to the following link:[https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md](https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md). online server config: ```shell port=$1 log=$2 export VLLM_USE_V1=1 export TASK_QUEUE_ENABLE=1 export HCCL_BUFFSIZE=512 export HCCL_OP_EXPANSION_MODE="AIV" export OMP_PROC_BIND=false export VLLM_ASCEND_ENABLE_NZ=0 sysctl -w vm.swappiness=0 sysctl -w kernel.numa_balancing=0 sysctl kernel.sched_migration_cost_ns=50000 ip=127.0.0.1 python -m vllm.entrypoints.openai.api_server \ --model /mnt/nvme1n1/wy/models/Qwen3-30B-A3B \ --tensor-parallel-size 2 \ --enable-expert-parallel \ --data-parallel-size 4 \ --gpu-memory-utilization 0.9 \ --max-num-batched-tokens 32768 \ --data-parallel-size-local 4 \ --max-num-seqs=200 \ --block-size 128 \ --max-model-len 6656 \ --trust-remote-code \ --disable-log-requests \ --served-model-name qwen \ --no-enable-prefix-caching \ --additional-config '{"xlite_graph_config": {"enabled": true, "full_mode": true}, "enable_cpu_binding": true}' \ --compilation-config '{"cudagraph_capture_sizes":[1, 16, 32, 48, 64, 100, 150, 200], "cudagraph_mode": "FULL_DECODE_ONLY"}' \ --async-scheduling \ --host ${ip} \ --port ${port} > ${log} 2>&1 & ``` test_config: ```shell vllm bench serve \ --max-concurrency ${maxconcurrency} \ --num-prompts ${num_prompts} \ --host ${HOST} \ --port ${PORT} \ --model ${MODEL_NAME} \ --dataset-name random \ --backend openai-chat \ --random-input-len 512 \ --random-output-len 512 \ --random-range-ratio 0.2 \ --temperature 0.6 \ --metric-percentiles "50,90,99" \ --tokenizer ${TOKENIZER_PATH} \ --endpoint /v1/chat/completions \ --ignore-eos ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `c86cdcbcd2` Signed-off-by: uuzWY <Ethan.wangyuan@huawei.com> Co-authored-by: uuzWY <Ethan.wangyuan@huawei.com>	2026-03-09 17:53:35 +08:00
Shaoxu Cheng	ba1c82e758	[DOC] Add explaination of 310p special param: max-model-len (#7065 ) ### What this PR does / why we need it? This PR updates the documentation for running vLLM on Atlas 300I series (310p) hardware. It adds a warning to explicitly set `--max-model-len` to prevent potential Out-of-Memory (OOM) errors that can occur with the default configuration. The example commands and Python scripts for online and offline inference have been updated to: - Include `--max-model-len 4096` (or `max_model_len=4096`). - Remove the `compilation-config` parameter, which is no longer necessary for 310p devices. These changes ensure users have a clearer and more stable experience when using vLLM on Atlas 300I hardware. ### Does this PR introduce _any_ user-facing change? No, this is a documentation-only update. ### How was this patch tested? The changes are to documentation and do not require testing. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-09 16:54:43 +08:00
wanghuanjun2113	dec04ec8d8	[Bugfix] Fix incorrect layer count for MTP models in update_aclgraph_sizes (#7064 ) ## Summary - Fix incorrect layer count calculation for MTP (Multi-Token Prediction) models in `update_aclgraph_sizes()` function - For MTP models, the draft model's layer count is stored in `num_nextn_predict_layers` or `mtp_num_hidden_layers` (for Qwen3.5), not in the standard `num_hidden_layers` field - Directly accessing `draft.hf_config.num_hidden_layers` returns the main model's layer count instead of the MTP draft model's layer count ## Bug Description In `vllm_ascend/utils.py`, the `update_aclgraph_sizes()` function calculates `resources_per_graph` for speculative decoding scenarios. When calculating the resources needed for the draft model, the original code directly accessed: ```python resources_per_graph += draft.hf_config.num_hidden_layers + 1 ``` This works correctly for standard draft models, but fails for MTP models (like DeepSeek-V3's MTP or Qwen3.5's MTP) because: 1. MTP models store their layer count in model-specific fields: - `num_nextn_predict_layers` (DeepSeek-V3 MTP) - `mtp_num_hidden_layers` (Qwen3.5 MTP) 2. The `num_hidden_layers` field in these models contains the main model's layer count, not the MTP layer count 3. This leads to grossly overestimating the `resources_per_graph`, which in turn causes the calculated `max_batch_sizes` to be unnecessarily small ## Fix Use `draft.get_total_num_hidden_layers()` instead of directly accessing `draft.hf_config.num_hidden_layers`. This method correctly handles different model types through the `model_arch_config_convertor` infrastructure, returning the appropriate layer count for: - Standard draft models → `num_hidden_layers` - DeepSeek-V3 MTP → `num_nextn_predict_layers` - Qwen3.5 MTP → `mtp_num_hidden_layers` 🤖 Generated with [Claude Code](https://claude.com/claude-code) - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: wanghuanjun2113 <wanghuanjun2113@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 16:14:51 +08:00
guanguan0308	4b4961ba5f	[fix]Resolve compilation errors that occur when building versions subsequent to b020 (#7059 ) ### What this PR does / why we need it? Resolve compilation errors that occur when building versions subsequent to b020： Root Cause During operator compilation, we previously modified the names of structs HcclOpResParam and HcclRankRelationResV2 in the moe_distribute_base.h file. After version b020, moe_distribute_base.h was updated with additional code that references these two structs. This resulted in compilation errors, as renaming the structs alone broke the newly added references to them. Solution we have added the moe_distribute_base.h file to the operator implementation. This avoids compilation errors caused by updates to this file in the CANN framework. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: guanguan0308 <1546542263@qq.com>	2026-03-09 16:09:35 +08:00
LoganJane	eb648f7398	[Bugfix] Support quant config in glm46v (#7062 ) ### What this PR does / why we need it? We need to support quant config in glm46v . ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? We used the 'Ascend/msit' quantization method to test the w8a8 weights. Successfully ran on NPU using vllm-ascend by the w8a8 weights. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: g00887675/loganJane <g00887675/loganJane73@hotmail.com> Co-authored-by: g00887675/loganJane <g00887675/loganJane73@hotmail.com>	2026-03-09 16:07:16 +08:00
tanhaoan333	57c554a23f	[bugfix]Fix parameter ordering bug in _merge_multimodal_embeddings (#7068 ) ### What this PR does / why we need it? This PR fixes a bug in the `_merge_multimodal_embeddings` function where the parameter order was incorrect. The `multimodal_embeddings` and `is_multimodal` parameters were swapped, which would lead to runtime errors when the function is called with positional arguments. This change corrects the function signature to align with its expected usage, ensuring that multimodal embeddings are correctly merged. ### Does this PR introduce _any_ user-facing change? No. This is a bug fix for an internal utility function and has no user-facing impact. ### How was this patch tested? The correctness of this fix is validated by existing tests for multimodal functionality. With the incorrect function signature, these tests would fail due to argument type mismatches. CI passing confirms the fix is effective. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: tanhaoan333 <tanhaoan@huawei.com>	2026-03-09 16:05:52 +08:00
Cao Yi	cb4c7de856	[Perf] Optimize MTP execution by reordering state update operation (#6844 ) ## Summary - Move `_update_states_after_model_execute` call from after main model sampling to after draft model execution - This reordering reduces pipeline bubbles between main model and draft model execution - No accuracy impact - the state update operation is independent of draft token proposal ## Performance Impact Reduces idle time between main model and draft model execution stages, improving overall MTP (Multi-Token Prediction) performance. - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>	2026-03-09 15:55:27 +08:00
zxr2333	d39d80830c	[KVCache]Qwen3.5 supports contiguous tensor hybrid-attn kv-cache (#6887 ) ### What this PR does / why we need it? Supports contiguous tensor hybrid-attn kv-cache on fullattn-mamba hybrid model, such as Qwen3Next and Qwen3.5. Due to the restrictions of Ascend operators, all KV tensors, conv tensors, and SSM tensors must be contiguous. Therefore, this PR uses the following solution to generate the KV cache: tensor1: [(kv_padding), conv , ...] tensor2: [k , ssm , ...] tensor3: [v , (mamba_padding), ...] Under this scheme, although some waste may occur, the tensors of all caches are guaranteed to be contiguous. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-09 15:28:40 +08:00
wangxiyuan	482d39c1b0	[commuinty]update contributor and refresh tool (#7072 ) ### What this PR does / why we need it? This PR refactors the `tools/collect_user_first_contribution.sh` script to improve how we track and update our contributors list. Key changes include: - Incremental Updates: The script can now perform incremental updates by storing and reading the last processed commit hash from `docs/source/community/contributors.md`. This is much more efficient than re-processing all commits every time. - Full Refresh Option: A `--full` flag is added to allow forcing a full recalculation of all contributors, useful for correcting errors or initial setup. - Improved Usage: Replaced positional arguments with command-line flags (`--repo`, `--file`, `--full`) for better usability and clarity. - Robust Contributor-ID detection: Improved logic to find a contributor's GitHub login, including a fallback to parse it from `noreply` email addresses. - In-place File Updates: The script now directly updates the `contributors.md` file with new contributors and correct numbering, automating the entire process. These changes make the process of maintaining the contributors list more automated, reliable, and efficient. ### Does this PR introduce _any_ user-facing change? No, this only changes a developer tool and does not affect the vLLM library's public API or behavior. ### How was this patch tested? The script can be tested locally by running it against the repository. For an incremental update: `GITHUB_TOKEN=<your_token> ./tools/collect_user_first_contribution.sh` For a full refresh: `GITHUB_TOKEN=<your_token> ./tools/collect_user_first_contribution.sh --full` - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-09 15:19:35 +08:00
Cao Yi	aef9d4249d	[Perf] Avoid CPU sync in mrope_positions copy by using full tensor copy (#7014 ) ### What this PR does / why we need it? The index-select operation `mrope_positions.gpu[:, :total_num_scheduled_tokens].copy_(...)` triggers a CPU-NPU synchronization, which blocks subsequent operator dispatch and causes bubbles visible in Profiling. This PR changes to full tensor copy (`mrope_positions.gpu.copy_(mrope_positions.cpu)`) to eliminate the sync point. The trade-off is a negligible increase in memory usage since `mrope_positions.cpu` is a small tensor. Result: ~2-3% TPOT improvement with the profiling bubbles eliminated. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Verified via Profiling that the CPU sync bubble is eliminated and TPOT is reduced by 2-3%. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>	2026-03-09 14:46:37 +08:00
LeeWenquan	65eae6de7b	Add Ascend Ops recurrent_gated_delta_rule (#6725 ) ### What this PR does / why we need it? Change recurrent_gated_delta_rule ops from triton to ascend C version for better performance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2026-03-09 14:14:14 +08:00
JIACHENG XU	23bf5d4d48	[EPLB][bugfix] Bugfix for fused mc2 (#6794 ) ### What this PR does / why we need it? This pull request addresses a bug related to the fused mc2 functionality within the EPLB (Expert Parallelism Load Balancing) system, specifically impacting quantization and MoE communication. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` Signed-off-by: Spicy-Stick <873805887@qq.com> Signed-off-by: root <root@localhost.localdomain>	2026-03-09 11:26:57 +08:00
Zetong Li	06ec136f08	[Bugfix] Obtain kernel block size for computing slot mapping correctly (#7019 ) ### What this PR does / why we need it? This PR aims to fix incorrect slot mapping in qwen35 due to mismatched block size. In qwen35, we should use `kernel_block_size` so that we can compute it in a correct way, and it is obtained in `load_model` when we have a chance to grab `draft_attn_layers`. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: Zetong Li <slippersss@126.com>	2026-03-09 11:05:01 +08:00
wangxiaoteng888	a3f4f6b10b	[P/D][Bugfix] Layerwise stacking MTP error. (#7036 ) ### What this PR does / why we need it? The community has added a cleaning mechanism for the metadata after the main model finishes running. The MTP layer should not clean the metadata, and a new condition has been added to avoid cleaning it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-03-09 10:55:43 +08:00
zxr2333	675387f1fd	[P/D][KVPool]Mooncake Layerwise Connector supports kv_pool (#7032 ) ### What this PR does / why we need it? This PR creates and registers `ascend_multi_connector`, which allows the `mooncake_layerwise_connector` to use the kv_pooling feature. We unregister the original vllm's `MultiConnector` and replace it with `AscendMultiConnector` when registering the connectors. ### Does this PR introduce _any_ user-facing change? No. User can use `MultiConnector` to initialize `AscendMultiConnector`. ### How was this patch tested? By CI. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-09 10:49:04 +08:00
drslark	6a7115fa0d	[main][feature] Support quarot for eagle3 without embedding (#7038 ) ### What this PR does / why we need it? If some `eagle3` model without embed_tokens works with `quarot` target model, the acceptence rate will drop. We solve it in this PR. The relative vllm pr is https://github.com/vllm-project/vllm/pull/36225. - vLLM main: `4034c3d32e` Signed-off-by: drslark <slarksblood@qq.com>	2026-03-09 10:43:06 +08:00
chenxi-hh	737dfcf638	[MOE] commit GMM custom operator (#7010 ) ### What this PR does / why we need it? GMM custom operator optimization in small batch scenarios ### How was this patch tested? Submit the GMM custom operator for subsequent integration into the MOE process. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: chenxi-hh <chen464822955@163.com> Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>	2026-03-09 09:56:31 +08:00
lilinsiman	01d3515dcf	[eagle][cp][bugfix] Fix the bug in eagle and cp enabled (#6981 ) ### What this PR does / why we need it? When eagle and cp are enabled at the same time, there is an error in pcp_allgather due to hidden_states. This PR fixes this issue. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-03-06 20:49:49 +08:00
aipaes	1c0ecf806a	[bugfix] fix pass bug: pass really rope dim for npu_rotary_embedding (#6880 ) ### What this PR does / why we need it? pass really rope dim for npu_rotary_embedding before： q_rope, k_rope = torch.ops.vllm.npu_rotary_embedding( positions, q_flat, k_flat, cos_sin_cache, self.head_dim, self.head_dim, True ) after： q_rope, k_rope = torch.ops.vllm.npu_rotary_embedding( positions, q_flat, k_flat, cos_sin_cache, self.head_dim, self.rope_dim, True ) ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Signed-off-by: aipaes <82140963+aipaes@users.noreply.github.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-03-06 19:35:17 +08:00
tanhaoan333	094eb0eff9	[bugfix]Qwen-Omni quantization bugfix (#7042 ) ### What this PR does / why we need it? [bugfix]Qwen-Omni quantization bugfix fix Qwen-Omni quantization weight mapping to float weight ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: tanhaoan333 <tanhaoan@huawei.com>	2026-03-06 17:24:22 +08:00
ZhaoJiangJiang	a51d6366b9	[Bugfix] Qwen3Next support FlashComm1 (#6830 ) ### What this PR does / why we need it? Support FlashComm1 for Qwen3-Next. Fix some padding problems in Sequence Parallel (SP) and resolve precision problems in shared_out when both FlashComm1 is enabled. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com> Co-authored-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com>	2026-03-06 17:14:08 +08:00
Zetong Li	a2696006d1	[Refactor][EAGLE] 8/N delete mtp_proposer (re-pull) (#7033 ) ### What this PR does / why we need it? NOTE: This PR is re-pull of #7016 since ci mistakenly marked unfinished pr as having passed. This PR aims to delete mtp_proposer. By fixing a bug in both dsv32 and glm5, now it should be ok to remove mtp_proposer. The bug is actually about unnecessary slicing of `slot_mapping`. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Zetong Li <slippersss@126.com>	2026-03-06 17:11:22 +08:00
Fager10086	c5dfa8d645	[OPS]add split_qkv_rmsnorm_mrope ops (#6730 ) ### What this PR does / why we need it? This PR adds split_qkv_rmsnorm_mrope kernel with interleaved for qwen3.5 and qwen3-vl to improve performance. ### Does this PR introduce _any_ user-facing change? Does not. ### How to use? ```python real_q, real_k, real_v, real_gate = torch.ops.vllm.triton_split_qkv_rmsnorm_mrope( qkv=qkv, q_weight=q_weight, k_weight=k_weight, cos_sin=cos_sin, num_q_heads=num_q_heads, num_kv_heads=num_kv_heads, head_size=head_size, eps=eps, mrope_section=mrope_section, is_interleaved=is_interleaved, rope_dim=rope_dim, has_gate=has_gate, ) ``` ### How was this patch tested? - vLLM version: v0.16.0 - Accuracy test script： ```shell pytest tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_split_qkv_rmsnorm_mrope.py ``` --------- Signed-off-by: Fager <865071616@qq.com> Signed-off-by: Fager10086 <77871921+Fager10086@users.noreply.github.com> Signed-off-by: fager <865071616@qq.com>	2026-03-06 16:18:37 +08:00
xiaocongtou6	bc0fd7ca72	[Feat]Adapt the graph mode (piecewise and full_decode_only) of PCP and DCP for DeepSeek v3.2. (#6940 ) ### What this PR does / why we need it? Adapt the graph mode (piecewise and full_decode_only) of PCP and DCP for DeepSeek v3.2. ### How was this patch tested? Test output: {"object":"text_completion","model":"deepeek_v3","choices":[{"index":0,"text":" the head of state and head of government of the United States, indirectly elected to a four-year term by the American people through the Electoral College. The officeholder leads the executive branch of the federal government and is the commander-in-chief of the United States","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null},{"index":1,"text":" Paris. This is the largest city in France and its main political, cultural and commercial center. The modern location of the city is the north of the central part of the country, on the banks of the Seine River Seine River Seine in 3\n\n","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null},{"index":2,"text":" now\n\n# AI future is now\n\nThe world is changing at a rapid pace, and artificial intelligence (AI) is at the forefront of this transformation. From self-driving cars to virtual assistants, AI is already making a significant impact on our daily lives","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null},{"index":3,"text":" a 3rd year student at the University of Lincoln studying Media Production. This blog is about my work throughout my final year on the course.\n\n## Tuesday 3 May 2016\n### Final Major Project - Evaluation\n\nFor my final project I","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":27,"total_tokens":227,"completion_tokens":200,"prompt_tokens_details":null},"kv_transfer_params":null} - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: xiaocongtou6 <2066962956@qq.com> Signed-off-by: xiaocongtou6 <105542647+xiaocongtou6@users.noreply.github.com>	2026-03-06 16:10:24 +08:00
Shanshan Shen	a813eadd2d	[MM][Perf] Enable 2.7x faster for convolution computation with aclnn BatchMatMulV2 (#7017 ) ### What this PR does / why we need it? Currently, we are using `e2b31243c0/vllm/model_executor/layers/conv.py (L219-L232)` for convolution computation, which is used in patch embedding for VL models. After profiling, we find that this linear method will take about 6.87 ms, which is much slower than just using `F.conv3d()`. In `F.conv3d()`, it will call aclnn `BatchMatMulV2` with optimization on Ascend NPU, which only take about 2.50 ms and is 2.7x faster than linear method. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2026-03-06 14:26:37 +08:00
wanghengkang	c49ce18ea5	[Test] Add e2e test cases for the Qwen-VL model adaptation to Ascend 310p (#6977 ) ### What this PR does / why we need it? Add e2e test cases for the Qwen-VL model adaptation to Ascend 310p - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: gcw_61wqY8cy <wanghengkang1@huawei.com>	2026-03-06 14:25:10 +08:00
aipaes	620076b76a	[bugs] fix install FIA sh (#6989 ) ### What this PR does / why we need it? Update the replacement shell script for the FIA operator FD feature in CANN 8.5.1 - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Signed-off-by: aipaes <82140963+aipaes@users.noreply.github.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-03-06 11:42:32 +08:00
wangxiyuan	16c3b0b822	Revert "[Refactor][EAGLE] 8/N delete mtp_proposer" (#7030 ) Reverts vllm-project/vllm-ascend#7016 It breaks E2E test - vLLM version: v0.16.0 - vLLM main: `4034c3d32e`	2026-03-06 11:24:05 +08:00
panchao-hub	8c2c82f3e1	[Bugfix] Fix the moe_forward error when setting enable_static_kernel … (#6964 ) ### What this PR does / why we need it? Fix the moe_forward error when setting enable_static_kernel to true. When static kernels are enabled, the forward pass runs twice (compilation + capture), causing moe_layer_index to overflow. Wrap the index to prevent out-of-bounds errors. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI passed with new added test - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: p00465316 <panchao13@huawei.com>	2026-03-06 10:36:10 +08:00
pz1116	a7820d20f4	[Doc][KV Pool]Update Memcache local service config example: increase default world size to 256 and update description (#7025 ) ### What this PR does / why we need it? Update Memcache local service config example: increase default world size to 256 and update the description for better clarity. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>	2026-03-06 10:23:55 +08:00
MengLong Chen	a838a89630	[v0.16.0][P/D][Bugfix] Support ALL D-Nodes in fullgraph when running MTP in PD (#6948 ) ### What this PR does / why we need it? Fix the bug for v0.16.0 recompute_scheduler, the same way as https://github.com/vllm-project/vllm-ascend/pull/5472. Signed-off-by: chenmenglong <chenmenglong1@huawei.com>	2026-03-06 10:01:33 +08:00
LI SHENGYONG	ccd00798f3	[EPLB] Display the expert hotness comparison before and after eplb. (#6877 ) ### What this PR does / why we need it? To intuitively show the effect of the eplb algorithm, we print the expert heat before and after eplb. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ![Snipaste_2026-02-28_17-23-42](https://github.com/user-attachments/assets/db1dadd1-cf96-44da-af34-57d41ccf412f) - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-03-06 09:53:29 +08:00
frank	18b52afe2b	[Ops][Misc] Optimize split_qkv_rmsnorm_rope op (#6827 ) ### What this PR does / why we need it? This PR optimizes the `split_qkv_rmsnorm_rope` operator by introducing a new Triton kernel, `split_qkv_rmsnorm_rope_prefill_kernel`, for the prefill stage (i.e., large batch sizes). The implementation now dynamically selects between the existing decode kernel and the new prefill kernel based on the batch size, which improves performance for large batch scenarios. Additionally, the RoPE implementation is updated to support partial rotation dimensions (`rope_dim`), making the operator more flexible. ### Does this PR introduce _any_ user-facing change? No. This is a performance optimization and is not expected to introduce any user-facing changes. ### How was this patch tested? CI should pass with existing tests. The new prefill path is triggered when the batch size is larger than the number of available vector cores. The partial RoPE feature can be tested by passing the `rope_dim` argument. - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: guzhiyong <guzhiyong5@h-partners.com> Signed-off-by: frank <2547457096@qq.com> Co-authored-by: guzhiyong <guzhiyong5@h-partners.com>	2026-03-06 09:30:31 +08:00

1 2 3 4 5 ...

2551 Commits