xc-llm-ascend

Author	SHA1	Message	Date
cookieyyds	51415aaa2f	[bugfix]support dsv3.2 enable both mtp and full_decode_only (#5849 ) ### What this PR does / why we need it? support dsv3.2 enable both mtp and full_decode_only PR5626 To align with the community, the branch logic was modified. Previously, dsv32 could not reach inside the branch, and now an additional unpadded step is required, which causes transformations in positions and num_input_tokens, leading to changes in the cos and sin dimensions in sfa_v1.py. This, in turn, causes an illegal shape error when passed to the operator. 1. The unpadded function is introduced to align with the community， and in the community the function does not have the parameters num_input_tokens and positions. 2. The positions are split and num_input_tokens=num_actual_tokens are used to correspond to the function name unpad, so that the padded positions and num_input_tokens are not output. However, in fact, attention_v1 does not use the above two parameters. This is done because we are concerned that some people might use these parameters later and encounter shape mismatch issues if they are not aware of this. Therefore, we have performed the cropping. From the perspective of the source of acquisition, positions are not cropped, so there is actually no need to add unpad in this case. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>	2026-01-14 22:57:38 +08:00
Qiu	a88937f5cb	[bugfix](cp) replace None with zeros/inf tensor to avoid TypeError (#5837 ) ### What this PR does / why we need it? When there is no kv cache in some devices, the `_compute_prefill_context func` will return `None`, which is unexecpted. This PR replaces None with full zeros/-inf tensors to avoid TypeError. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```bash pytest tests/e2e/multicard/4-cards/long_sequence/test_chunked_prefill.py -k test_models_chunked_prefill_with_empty_kvcache ``` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-14 20:57:48 +08:00
zhaomingyu13	d450ba24c7	Revert "[BugFix] Support setting tp=1 for the Eagle draft model to take effect (#5903 ) This reverts commit `d886b81971` - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>	2026-01-14 20:56:20 +08:00
zhaomingyu13	01805fbd7d	Revert "[BugFix] Support setting tp=1 for the Eagle draft model to take effect (#5519 )"(#5902 ) This reverts commit `d886b81971`. it breaks pd function - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>	2026-01-14 20:55:10 +08:00
LICO67373	2a6d95c389	[Cleanup] Remove dead code make_attention_mask function (#5818 ) ### What this PR does / why we need it? This PR removes the unused `make_attention_mask` function from `vllm_ascend/worker/v2/attn_utils.py`. Why it's dead code: - After PR #4870 (attention mask unification refactor), attention mask generation has been centralized in the `AttentionMaskBuilder` singleton class - The mask is now generated directly by metadata builders when needed (e.g., `AscendAttentionMetadataBuilder`, `AscendMLAMetadataBuilder`) - The `make_attention_mask` function is no longer called anywhere in the codebase - The function's parameters (including `attn_mask` and `spec_attn_mask`) were also removed from `build_attn_metadata` in the same refactor Changes: - Remove `make_attention_mask` function (24 lines) from `vllm_ascend/worker/v2/attn_utils.py` ### Does this PR introduce _any_ user-facing change? No. This is a code cleanup that removes dead code. No user-facing behavior changes. ### How was this patch tested? - Verified that `make_attention_mask` is not called anywhere in the codebase (via `grep`) - CI tests pass to ensure no regressions - The function has been unused since PR #4870 was merged - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2026-01-14 16:52:51 +08:00
herizhen	d31170496b	[doc]index display by category (#5852 ) ### What this PR does / why we need it? upgrade tutorial doc index display by category ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-01-14 16:50:49 +08:00
Li Wang	f6a37fc549	[CI] Reduce the resource consumption of unit tests (#5891 ) ### What this PR does / why we need it? Reduce the resource consumption of unit tests: 32U/pr -> 16U /pr - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-14 16:33:19 +08:00
wangxiyuan	e5c46bf169	[CI] Fix lint CI (#5880 ) Quick fix for lint CI - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-14 11:23:38 +08:00
Ronald	e20813f441	[Feature] implement eagle spec decoding for model runner v2 (#5840 ) ### What this PR does / why we need it? this pr implement eagle spec decoding for model runner v2, please see RFC https://github.com/vllm-project/vllm-ascend/issues/5208 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: v0.13.0 --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-01-14 09:18:05 +08:00
LHXuuu	0415e694cd	[Quantization] Support compressed tensors moe w8a8 int8 dynamic weight (#5718 ) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W8A8 Int8 dynamic weight. 2. Specify W4A16 quantization configuration. Co-authored-by: menogrey 1299267905@qq.com Co-authored-by: kunpengW-code 1289706727@qq.com ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: menogrey <1299267905@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>	2026-01-14 09:17:26 +08:00
LI SHENGYONG	ecf2fa482e	[EPLB][Bugfix] Get expert map from layers (#5817 ) ### What this PR does / why we need it? The initialization method of expert_map used by the eplb module is different from that used by the fused_moe module. This PR deletes the expert_map initialization method used by the eplb module to make the initialization methods consistent. #### before bugfix self._expert_map=tensor([64, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61,62, 63], device='npu:1', dtype=torch.int32) self.shared_dict["expert_maps"][0]=tensor([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64]], dtype=torch.int32) ### How was this patch tested? #### qwen3-235B-w8a8 aime \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-14 09:16:51 +08:00
drslark	48ec97821a	[Bugfix] Fixed an accuracy problem of sp with eagle3 (#5816 ) ### What this PR does / why we need it? Fixed an accuracy problem when using eagle3 with sp. The problem is described in https://github.com/vllm-project/vllm-ascend/issues/5825. It also adds a much more precise way to determine whether drafter should use `sp` or not. Also, it changes the `eager` of drafter to be a real `eager` in frontend to avoid a `fx-graph` problem. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? For simpilicity, we test it as in https://github.com/vllm-project/vllm-ascend/issues/5825. And we get the same result of `eagle3` with `sp` disabled. ```text -------------------------------------------------- total_num_output_tokens: 1000 num_drafts: 437 num_draft_tokens: 1311 num_accepted_tokens: 564 mean acceptance length: 2.29 -------------------------------------------------- acceptance at token 0: 0.62 acceptance at token 1: 0.40 acceptance at token 2: 0.27 acceptance at token 3: 0.00 acceptance at token 4: 0.00 acceptance at token 5: 0.00 ``` * vLLM version: v0.13.0 * vLLM main: `2f4e6548ef` Signed-off-by: drslark <slarksblood@qq.com>	2026-01-14 09:00:37 +08:00
liziyu	e1bed43cff	[P/D] bugfix for p node force free requset (#5431 ) ### What this PR does / why we need it? Fix the bug where the P-node's schedule dead after it force-frees a request due to timeout and then receives the completed kv cache pulled by the D-node again. By add list to recode all requests. - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-14 08:51:31 +08:00
SILONG ZENG	78d5ce3e01	[Lint]Style: Convert `example` to `ruff format` (#5863 ) ### What this PR does / why we need it? This PR fixes linting issues in the `example/` to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	2026-01-13 20:46:50 +08:00
zhangxinyuehfad	f7b904641e	[Main2Main] Upgrade vllm commit to 0109 (#5752 ) ### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to https://github.com/vllm-project/vllm/pull/31786 2. fix spec_decode e2e test due to https://github.com/vllm-project/vllm/pull/29821 break 3. fix `vllm.v1.attention.backends.utils` duo to https://github.com/vllm-project/vllm/pull/31891 4. fix `self.seq_lens - query_lens` on same device due to https://github.com/vllm-project/vllm/pull/31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-13 19:14:43 +08:00
liziyu	eed9e366a7	[Bugfix][P/D] fix layerwise connector for decoder tp size > num kv heads (#5846 ) ### What this PR does / why we need it? Fix layerwise connector for decoder tp size > num kv heads. In this case prefiller should push kv cache to all decoder npu. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: liziyu <liziyu16@huawei.com>	2026-01-13 17:30:33 +08:00
yupeng	5b95c6b03a	[Test][e2e][LoRA] Add more e2e tests to cover scenarios of LoRA (#4075 ) ### What this PR does / why we need it? This PR depends on PR https://github.com/vllm-project/vllm-ascend/pull/4046. And only if the latter merged, it will work. This PR aims to solve the issue https://github.com/vllm-project/vllm-ascend/issues/3240. The new-added Llama-2-7b-hf and Qwen3-0.6B testcases will cover the senarios that the LoRA weights are added to q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens and lm_head modules. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_llama2_lora.py pytest -sv tests/e2e/singlecard/test_qwen3_multi_loras.py - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: paulyu12 <507435917@qq.com>	2026-01-13 16:32:28 +08:00
Shanshan Shen	d350c2ada6	[CustomOp][Perf] Merge Q/K split to simplify AscendApplyRotaryEmb for better performance (#5799 ) ### What this PR does / why we need it? - Use upstream util function (`_pre_process()` and `_post_process()`) to reduce redundant codes. (Find more details at https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/rotary_embedding/common.py#L184-L213) - Merge Q/K split to simplify the logic of calling `torch_npu.npu_rotary_mul()` for better performance (TPOT has been reduced by 6.22%). ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? #### ✅ Functional test Launch the server: ```bash export VLLM_USE_MODELSCOPE=True vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \ --dtype bfloat16 \ --limit-mm-per-prompt '{"image": 1}' \ --max-model-len 16384 \ --max-num-batched-tokens 16384 ``` Query the server: ```bash curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": [ {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}}, {"type": "text", "text": "What is the text in the illustrate? How does it look?"} ]} ], "max_tokens": 100 }' ``` Output: ``` {"id":"chatcmpl-b2911ab6989ef098","object":"chat.completion","created":1768202780,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is written in blue, and \"Qwen\" is written in gray. The text appears to be part of a logo or branding design, with \"TONGYI\" being more prominent and \"Qwen\" being slightly smaller and positioned below it. The font style is modern and clean, with \"TONGYI\" having a slightly bolder appearance compared to \"Qwen.\"","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":178,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` #### ✅ Benchmark Run: ```bash export VLLM_USE_MODELSCOPE=False export HF_ENDPOINT="https://hf-mirror.com" vllm bench serve \ --model /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \ --backend openai-chat \ --endpoint /v1/chat/completions \ --dataset-name hf \ --hf-split train \ --dataset-path lmarena-ai/vision-arena-bench-v0.1 \ --num-prompts 10 \ --no-stream ``` Before this PR: ``` ============ Serving Benchmark Result ============ Successful requests: 10 Failed requests: 0 Benchmark duration (s): 5.96 Total input tokens: 7191 Total generated tokens: 996 Request throughput (req/s): 1.68 Output token throughput (tok/s): 167.05 Peak output token throughput (tok/s): 261.00 Peak concurrent requests: 10.00 Total token throughput (tok/s): 1373.16 ---------------Time to First Token---------------- Mean TTFT (ms): 964.43 Median TTFT (ms): 858.48 P99 TTFT (ms): 1691.45 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 63.08 Median TPOT (ms): 40.86 P99 TPOT (ms): 241.30 ---------------Inter-token Latency---------------- Mean ITL (ms): 40.16 Median ITL (ms): 33.61 P99 ITL (ms): 250.30 ================================================== ``` After this PR: ``` ============ Serving Benchmark Result ============ Successful requests: 10 Failed requests: 0 Benchmark duration (s): 5.71 Total input tokens: 7191 Total generated tokens: 996 Request throughput (req/s): 1.75 Output token throughput (tok/s): 174.45 Peak output token throughput (tok/s): 279.00 Peak concurrent requests: 10.00 Total token throughput (tok/s): 1433.95 ---------------Time to First Token---------------- Mean TTFT (ms): 992.14 Median TTFT (ms): 938.30 P99 TTFT (ms): 1728.71 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 59.16 Median TPOT (ms): 37.65 P99 TPOT (ms): 234.89 ---------------Inter-token Latency---------------- Mean ITL (ms): 36.55 Median ITL (ms): 30.73 P99 ITL (ms): 170.72 ================================================== ``` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-13 15:47:23 +08:00
SILONG ZENG	523e83016b	[Lint]Style: Convert `root`, `benchmarks`, `tools` and `docs` to `ruff format` (#5843 ) ### What this PR does / why we need it? Description This PR fixes linting issues in the root directory, benchmarks/, tools/ and docs/ to align with the project's Ruff configuration. This is part of a gradual effort to enable full linting coverage across the repository. The corresponding paths have been removed from the exclude list in pyproject.toml. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	2026-01-13 15:29:34 +08:00
lhchg	4b679984de	enable ep32 for dispatch_ffn_combine (#5787 ) ### What this PR does / why we need it? To support dispatch_ffn_combine ep32 enabled ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Single operator tested --------- Signed-off-by: lhchg <lhao_cheng@163.com>	2026-01-13 14:35:52 +08:00
wangxiyuan	84d4f474c0	[CI] Unblock 4-cards test (#5831 ) CI cost time: single: 160min 2-cards: 110min 4-cards: 120min full cost time: before this PR: max(160, 110)+120 = 280min after this PR: min(160, 110)+120 = 230min Reduce 50min for e2e test. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-13 11:15:29 +08:00
weijinqian0	1ccb9acd9a	[Refactor] Provide a framework to accommodate operators for different hardware devices (#5735 ) come from: https://github.com/vllm-project/vllm-ascend/issues/5463 Reason: During the iteration process of the hardware version, there may be a large number of iterations for the operators, which can lead to short-term compatibility differences. Therefore, an intermediate adaptation layer is provided to accommodate the short-term differences in operators. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Signed-off-by: weijinqian0 <1184188277@qq.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2026-01-13 09:53:26 +08:00
Rozwel-dx	8d571286dd	[Refactor] Modify the binding logic to allocate CPU cores for each NPU card (#5555 ) [Refactor] Modify the binding logic to allocate CPU cores for each NPU card ### What this PR does / why we need it? Modify the binding logic to allocate CPU cores for each NPU card based on NUMA affinity, while isolating acl_thread/release_thread and other processes to prevent mutual interference. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `c85cc045f8` Signed-off-by: rowzwel_dx <1392851715@qq.com> - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: Rozwel-dx <1392851715@qq.com>	2026-01-13 09:21:28 +08:00
zhaomingyu13	d886b81971	[BugFix] Support setting tp=1 for the Eagle draft model to take effect (#5519 ) ### What this PR does / why we need it? According to the official documentation, the parameter "draft_tensor_parallel_size": 1 is supposed to be applied to the Eagle3 model. However, based on actual debugging, it was found that the number of tensor parallelisms (tp) of the Eagle model is consistent with that of the target model. The setting of tp for the draft model did not take effect as expected. Note: This feature has not been superimposed and tested with `sp` and `dp`. It will be adapted later ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```python from vllm import LLM, SamplingParams def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=4, gpu_memory_utilization=0.9, enforce_eager=True, speculative_config={ "method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B" "draft_tensor_parallel_size": 1, "num_speculative_tokens": 3, }, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Fixes vllm-project/vllm#31345 Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: drslark <slarksblood@qq.com>	2026-01-13 09:14:30 +08:00
shiyuan680	7af3b880c1	support triton of mrope (#5664 ) ### What this PR does / why we need it? this pr support use triton mrope like cuda_forward, which performance is equal to ascendc ops this triton ops should use cann 8.5.0 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? test in qwen3-vl-235b acc textvqa native 81.82 npu triton 81.58 cuda triton 81.52 - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: shiyuan680 <917935075@qq.com>	2026-01-13 09:13:51 +08:00
DreamerLeader	db7cf9b0ca	[bugfix] A2 Environment Pooling for Memcache Compatibility (#5601 ) ### What this PR does / why we need it? When running memcache in the A2 environment, the logic for registering memory needs to be added. Additionally, there is a link establishment conflict between memcache and HCCS during initialization in A2, so the link should be established in advance. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` --------- Signed-off-by: fangjianwei <f30058701@china.huawei.com> Co-authored-by: fangjianwei <f30058701@china.huawei.com>	2026-01-13 09:07:38 +08:00
Yikun Jiang	fe251a2efe	[Doc] Update community contributors and versioning naming to follow vLLM (#5820 ) ### What this PR does / why we need it? This pull request updates documentation to align with vLLM's community standards. - Change `Maintainers` to `Committers` to follow vLLM naming: https://docs.vllm.ai/en/latest/governance/committers/ - Change release branch policy from `vX.Y.Z-dev` to `releases/vX.Y.Z` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? doc ci passed - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2026-01-13 08:47:11 +08:00
LICO67373	c8a324ab73	[Refactor] Add comments for Metadata classes in attention module (#5789 ) ### What this PR does / why we need it? Add docstrings for Metadata and MetadataBuilder classes in the attention module to improve code readability. Related to #5463 (Item 11: Add some comments for CommonMetadata and others) Modified files: - `vllm_ascend/attention/context_parallel/common_cp.py`: Added comments for `AscendPCPMetadata`, `CPChunkedContextMetadata`, `AscendMetadataForPrefill`, `AscendMetadataForDecode` - `vllm_ascend/attention/utils.py`: Added comments for `AscendPrefillContextParallelMetadata` - `vllm_ascend/attention/mla_v1.py`: Added comments for `ChunkedContextMetadata`, `AscendMLADecodeMetadata` - `vllm_ascend/attention/attention_v1.py`: Added comments for `AscendMetadata`, `AscendAttentionMetadataBuilder` - `vllm_ascend/attention/context_parallel/attention_cp.py`: Added comments for `AscendAttentionCPMetadataBuilder` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation only, no functional changes. Signed-off-by: lico67373 <918688502@qq.com>	2026-01-13 08:46:50 +08:00
LiuYi-Up	dde547e900	[Bugfix] bugfix for the order of dummy run pad and sync (#5777 ) ### What this PR does / why we need it? This PR addresses an issue in piecewise graph mode when Multi-Threading Parallelism (MTP) is enabled. Specifically, the original dummy run sequence performs the following steps in order: 1. Sync DP (input length = 1 + k) 2. Dispatch (input length = 1 + k, with padding==graph size) However, in the model execution phase, the sequence differs, resulting in: 1. Padding (input length = 1, with padding) 2. Sync DP (input length = 1 + k) 3. Dispatch (input length 1 + k != graph size 1 + k, with padding) This discrepancy leads to a mismatch between the input sizes used in the model execution and those expected by the dispatch graph, causing an inconsistency in graph size. This PR ensures that the dispatch graph size aligns correctly by modifying the sequence of operations during model execution to match the dummy run sequence, resolving the mismatch issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: LiuYi-UP <1150854440@qq.com>	2026-01-13 08:44:10 +08:00
Li Wang	75c92a3640	[CI] Move nightly-a2 test to hk (#5807 ) ### What this PR does / why we need it? This patch initial testing involved connecting two nodes from the HK region to nightly A2. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-12 22:58:35 +08:00
Li Wang	2a010a1f0e	[CI] Show disk usage for CI shared volume (#5821 ) ### What this PR does / why we need it? 1. Remove some useless but too large models from the shared volume 2. Add a new step to show current usage - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-12 22:56:23 +08:00
dependabot[bot]	86c4bea116	Bump actions/checkout from 4 to 6 (#5795 ) Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-01-12 20:44:23 +08:00
dependabot[bot]	7ab63661f5	Bump actions/github-script from 7 to 8 (#5796 ) Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 8. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-01-12 20:44:02 +08:00
Qiu	5f4b13ab3d	[bugfix](cp) align max_context_chunk to cp_virtual_block_size (#5767 ) ### What this PR does / why we need it? In the chunked prefill scenario, CP needs to align the `max_context_chunk` to the `cp_virtual_block_size`, but the current implementation only aligns it to the `block_size`. For PD-disaggregation, `cp_kv_cache_interleave_size` is typically set equal to `block_size`, in which case `cp_virtual_block_size=block_size * dcp_size * pcp_size`. Under specific conditions, this can lead to misalignment of certain chunks, subsequently triggering assertion check errors. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-12 20:11:46 +08:00
wangyongjun	4453c60262	[bugfix]limit graph replay sync (#5761 ) ### What this PR does / why we need it? when graph mode is picewise，replay by synchronize will be effect performance, sync almost cost 250us ![123](https://github.com/user-attachments/assets/04d2a1f3-1f57-4dbb-85ce-b250f2ee7ff0) ### Does this PR introduce _any_ user-facing change? only sync when graph mode contain full mode ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: wangyongjun <wangyongjun7@huawei.com>	2026-01-12 16:46:21 +08:00
SILONG ZENG	7a6fde80b1	[CI]Add Kimi k2 nightly test (#5682 ) ### What this PR does / why we need it? The PR add performance and accuracy tests for Kimi-K2-Instruct-W8A8 and Kimi-K2-Thinking models to the Nightly test suite. #### Test Configuration Kimi-K2-Instruct-W8A8 - model: vllm-ascend/Kimi-K2-Instruct-W8A8 - Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node) - Architecture: Unified Distributed Inference - Parallelism: DP4 + TP8 + EP (Data Parallel 4, Tensor Parallel 8, Expert Parallel enabled). - Optimization: torchair graph, no-prefix-caching. - Node 0: DP Rank 0-1, Local DP 2, Tensor Parallel 8. - Node 1: DP Rank 2-3, Local DP 2, Tensor Parallel 8. - Benchmarks: - Performance: vllm-ascend/GSM8K-in3500-bs2800. - Accuracy: vllm-ascend/gsm8k-lite. Kimi-K2-Thinking - Model: moonshotai/Kimi-K2-Thinking - Hardware: A3, 1 Node (16 NPUs total) - Architecture: Single Node Distributed Inference - Parallelism: TP16 + EP (Tensor Parallel 16, Expert Parallel enabled). - Optimization: no-prefix-caching - Benchmarks: - Performance: vllm-ascend/GSM8K-in3500-bs400. - Accuracy: vllm-ascend/gsm8k-lite. ### Does this PR introduce _any_ user-facing change? Yes. This PR enhances the ```AisbenchRunner``` to support dynamic configuration of the ```trust_remote_code``` flag. This allows the AISBench client to successfully load tokenizers for models that require custom code execution (e.g., Kimi-K2-Thinking and Kimi-K2-Instruct-W8A8). Changes: 1. ```AisbenchRunner.__init__ ```Added the ability to capture the ```trust_remote_code``` parameter from the case configuration. ``` python self.batch_size = aisbench_config["batch_size"] self.request_rate = aisbench_config.get("request_rate", 0) + self.trust_remote_code = aisbench_config.get("trust_remote_code", False) self.temperature = aisbench_config.get("temperature") self.top_k = aisbench_config.get("top_k") ``` 2. ```AisbenchRunner._init_request_conf``` Added regex substitution to inject the parameter into the generated dynamic configuration file. ``` python content = re.sub(r'batch_size.', f'batch_size = {self.batch_size},', content) + content = re.sub(r'trust_remote_code=.', + f'trust_remote_code={self.trust_remote_code},', + content) content = content.replace("top_k", "#top_k") content = content.replace("seed", "#seed") ``` Details: - New Config Key: Users can add ```"trust_remote_code": True``` to any dictionary within the ```aisbench_cases``` list. - Default Value: Defaults to ```False``` to maintain existing security protocols for standard models. - Impact: Resolves ```ValueError``` when benchmarking reasoning models or models with custom tokenizers that previously failed during the AISBench local initialization phase. User Example: Users can now enable custom code execution for specific models (like Kimi-K2-Thinking) directly in their test suite: ``` # Now supported in test scripts: aisbench_cases = [{ "case_type": "performance", "request_conf": "vllm_api_stream_chat", "trust_remote_code": True, # New user-facing parameter ... }] ``` ### How was this patch tested? Actions: - https://github.com/vllm-project/vllm-ascend/actions/runs/20849768433 Result as following: - Kimi-K2-Instruct-W8A8(25m25s) 1. Accuracy test ``` dataset version metric mode vllm-api-general-chat --------- --------- -------- ------ ----------------------- gsm8k 7cd45e accuracy gen 96.88 ``` 2. Perf test ``` ╒══════════════════════════╤═════════╤════════════════╤════════════════╤═══════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕ │ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ ╞══════════════════════════╪═════════╪════════════════╪════════════════╪═══════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡ │ E2EL │ total │ 34571.489 ms │ 28657.8054 ms │ 36294.1788 ms │ 34714.7329 ms │ 35247.2724 ms │ 35526.6758 ms │ 36146.4314 ms │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TTFT │ total │ 2043.9136 ms │ 627.4718 ms │ 3532.3978 ms │ 1906.0194 ms │ 2307.7979 ms │ 2883.8528 ms │ 3283.7012 ms │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TPOT │ total │ 127.5591 ms │ 106.4937 ms │ 137.107 ms │ 128.3135 ms │ 129.5704 ms │ 131.1332 ms │ 134.1087 ms │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ ITL │ total │ 126.5571 ms │ 0.0095 ms │ 1340.783 ms │ 104.1398 ms │ 110.1272 ms │ 119.6124 ms │ 950.2924 ms │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ InputTokens │ total │ 3516.6055 │ 3014.0 │ 3985.0 │ 3525.0 │ 3525.0 │ 3586.8 │ 3800.67 │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokenThroughput │ total │ 7.4143 token/s │ 7.0535 token/s │ 8.933 token/s │ 7.3744 token/s │ 7.4118 token/s │ 7.5608 token/s │ 8.7051 token/s │ 512 │ ╘══════════════════════════╧═════════╧════════════════╧════════════════╧═══════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛ ╒══════════════════════════╤═════════╤═══════════════════╕ │ Common Metric │ Stage │ Value │ ╞══════════════════════════╪═════════╪═══════════════════╡ │ Benchmark Duration │ total │ 279430.9375 ms │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Requests │ total │ 512 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Failed Requests │ total │ 0 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Success Requests │ total │ 512 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Concurrency │ total │ 63.3452 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Max Concurrency │ total │ 64 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Request Throughput │ total │ 1.8323 req/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Input Tokens │ total │ 1800502 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Prefill Token Throughput │ total │ 1720.5255 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total generated tokens │ total │ 131072 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Input Token Throughput │ total │ 6443.4598 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Output Token Throughput │ total │ 469.0676 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Token Throughput │ total │ 6912.5274 token/s │ ╘══════════════════════════╧═════════╧═══════════════════╛ ``` - Kimi-K2-Thinking(43m51s) 1. Accuracy test ``` dataset version metric mode vllm-api-general-chat --------- --------- -------- ------ ----------------------- gsm8k 7cd45e accuracy gen 100.00 ``` 2. Perf test ``` ╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕ │ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ ╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡ │ E2EL │ total │ 172384.3573 ms │ 34456.5517 ms │ 205922.9407 ms │ 174844.2216 ms │ 202656.092 ms │ 204428.9502 ms │ 205468.6776 ms │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TTFT │ total │ 138740.3228 ms │ 655.1066 ms │ 171777.3003 ms │ 141088.0561 ms │ 169237.5599 ms │ 170716.4954 ms │ 171393.1278 ms │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TPOT │ total │ 131.9374 ms │ 90.6331 ms │ 135.4144 ms │ 132.405 ms │ 132.948 ms │ 133.7549 ms │ 135.2543 ms │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ ITL │ total │ 130.9028 ms │ 0.0099 ms │ 960.3683 ms │ 116.9623 ms │ 122.3127 ms │ 132.0522 ms │ 886.4662 ms │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ InputTokens │ total │ 3514.575 │ 3014.0 │ 3843.0 │ 3525.0 │ 3525.0 │ 3588.0 │ 3801.08 │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokenThroughput │ total │ 1.6799 token/s │ 1.2432 token/s │ 7.4296 token/s │ 1.4642 token/s │ 1.4737 token/s │ 1.8754 token/s │ 7.125 token/s │ 400 │ ╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛ ╒══════════════════════════╤═════════╤═══════════════════╕ │ Common Metric │ Stage │ Value │ ╞══════════════════════════╪═════════╪═══════════════════╡ │ Benchmark Duration │ total │ 1166795.568 ms │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Requests │ total │ 400 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Failed Requests │ total │ 0 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Success Requests │ total │ 400 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Concurrency │ total │ 59.0967 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Max Concurrency │ total │ 64 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Request Throughput │ total │ 0.3428 req/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Input Tokens │ total │ 1405830 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Prefill Token Throughput │ total │ 25.332 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total generated tokens │ total │ 102400 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Input Token Throughput │ total │ 1204.864 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Output Token Throughput │ total │ 87.7617 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Token Throughput │ total │ 1292.6258 token/s │ ╘══════════════════════════╧═════════╧═══════════════════╛ ``` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>	2026-01-12 15:56:07 +08:00
liziyu	451bbdc292	[Doc] add tls check to pd disaggregation readme (#5638 ) ### What this PR does / why we need it? update pd disaggregation multi_node readme, update the environment check command for A3, add tls check ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `8be6432bda` Signed-off-by: liziyu <liziyu16@huawei.com>	2026-01-12 15:49:18 +08:00
wangxiyuan	5ccd53e28a	[CI] adpat v0.13.0 change (#5793 ) Add `releases` match case for CI jobs and update related doc for v0.13.0 branch - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-12 14:06:56 +08:00
wangxiyuan	354ee3b330	[Doc] Update doc url link (#5781 ) Drop `dev` suffix for doc url. Rename url to `https://docs.vllm.ai/projects/ascend` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-12 11:21:31 +08:00
Nengjun Ma	297f6deb09	[CI] Align multi-node nightly test paramter with corresponding tutorials document (#5756 ) ### What this PR does / why we need it? Align multi-node nightly test paramter with tutorials documents. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? Test locally and nighly e2e multi-node test cases. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-01-12 09:00:31 +08:00
gh924	6880c1b383	[Feature] Support for cross-attention and whisper model (#5592 ) ### What this PR does / why we need it? To solve the problem of the issue：https://github.com/vllm-project/vllm-ascend/issues/2262 - support for cross-attention when the model is encoder-decoder - support for whisper model - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: gh924 <guihao2@huawei.com> Co-authored-by: Aoxuan Chen <43376869+chenaoxuan@users.noreply.github.com>	2026-01-11 11:38:45 +08:00
zzhxxx	db12c1e2c8	[Perf] Supports compute-communication overlap in the forward of sfa_v1 in the Sharded-CP feature. (#5701 ) ### What this PR does / why we need it? > Extracted from PR #5513 Based on the Sharded-CP feature PR:#4702; RFC:https://github.com/vllm-project/vllm/issues/30055 ### All-gather KV Cache for Communication Overlap: - This PR adjusts the calculation order in the SFA. - split `index_select` into `indexer_select_pre_process` and `indexer_select_post_process`. - Combine `nope`, `rope` and `index-k` into a tensor to perform asynchronous all-gather. ### benchmark: input=40k && num_batch_token=20k - before: ``` Mean TTFT (ms): 2614.52 Median TTFT (ms): 3148.03 P50 TTFT (ms): 3148.03 P90 TTFT (ms): 3163.48 P99 TTFT (ms): 3170.20 ``` - after: ``` Mean TTFT (ms): 2529.92 Median TTFT (ms): 3051.69 P50 TTFT (ms): 3051.69 P90 TTFT (ms): 3067.31 P99 TTFT (ms): 3072.15 ``` ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com>	2026-01-11 09:47:27 +08:00
lilinsiman	c5744e2350	[main][bugfix] Fix fullgraph padding bug in mtp eagle refactor (#5692 ) ### What this PR does / why we need it? The condition for determining padding in the fullgraph overlay with MTP and PCP has been modified to accommodate corner cases where the shape capture size is manually specified. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut and tests - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-01-10 23:07:48 +08:00
zxr2333	78b554dda9	[P/D] layerwise connector supports DeepSeek-V3.2 sparse attention && Distribute transfer tasks to redundant kv_head cards (#5722 ) ### What this PR does / why we need it? Add new function to mooncake layerwise connector, including: 1. supports sparse attention, for DeepSeek-V3.2 2. Distribute transfer tasks to redundant kv_head cards This PR is related to [[RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support](https://github.com/vllm-project/vllm-ascend/issues/4842) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2026-01-10 23:04:16 +08:00
Feng-xiaosuo	c316679e65	adapt to minimax_m2 (#5624 ) ### What this PR does / why we need it? This PR fixes Minimax model loading in vLLM Ascend backend by: Adding model type check for "minimax" and "minimax_m2" to replace "mlp" prefix with "block_sparse_moe" Implementing special handling for Minimax expert layer naming conventions Adding Minimax configuration to packed_modules_model_mapping for proper qkv_proj and experts module handling Without these changes, Minimax models fail to load on Ascend devices due to incompatible layer naming and module packing. ### Does this PR introduce _any_ user-facing change? Yes. Users can now successfully load and run Minimax models on Ascend hardware with vLLM. This enables inference capabilities for this model family on Ascend devices. ### How was this patch tested? Local Testing: Verified model loading for minimax-xxx and minimax_m2-xxx model variants on Atlas 800I A2 hardware Tested inference with sample prompts using vLLM's OpenAI-compatible API server Benchmark Validation: Compared throughput and latency metrics against GPU baseline Verified memory usage stays within expected limits for different batch sizes Tested multi-card inference scenarios with tensor parallelism - vLLM version: v0.13.0 - vLLM main: `8be6432bda` --------- Signed-off-by: Feng-xiaosuo <tengchang1@huawei.com>	2026-01-10 23:01:35 +08:00
Levi	ecd4232698	[Feat] flashcomm2+oshard Generalized (#4723 ) ### What this PR does / why we need it? [FlashComm2](https://gitcode.com/ascend-tribe/ascend-inference-cluster/blob/main/FlashComm/FlashComm2%E5%A4%A7%E6%A8%A1%E5%9E%8B%E6%8E%A8%E7%90%86%E4%B8%AD%E4%BB%A5%E5%AD%98%E6%8D%A2%E4%BC%A0%E7%9A%84%E9%80%9A%E4%BF%A1%E4%BC%98%E5%8C%96%E6%8A%80%E6%9C%AF.pdf) introduces redundant storage of the o_proj matrix, which imposes pressure on GPU memory. We propose the FlashComm2+Oshard approach by integrating the shared linear layer feature (#2931). This approach distributes weights layer-by-layer to each GPU and accesses the o_proj of each layer via asynchronous broadcast operations, thereby alleviating memory pressure while achieving nearly lossless performance compared to the original FlashComm2. This PR implements a generalized FlashComm2+Oshard solution. Using following env to support flashcomm2 with oshard ```shell export VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 --additional-config '{ "layer_sharding": ["o_proj"] }' ``` ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com>	2026-01-10 22:57:57 +08:00
wangxiaoteng888	aa987ffe87	[P/D][bugfix]Fix the PCP port mapping error issue (#5706 ) ### What this PR does / why we need it? Fix the PCP port mapping error issue.In a multi-node PD separation scenario, when the PCP feature is enabled, there is an issue with the ZMQ transmission port. Specifically, the IP and port received by Side D do not match. The cause of this issue is an error in the port mapping update strategy logic. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-10 22:43:52 +08:00
fems14	ff4c1a47b3	[bugfix] Fixing KV Pool Memory Retention and Performance Degradation Issues (#5751 ) ### What this PR does / why we need it? 1.Fixed memory retention on certain GPUs caused by missing PUT operations. 2.Fixed performance degradation resulting from architectural incompatibilities in the underlying refactor. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: fems14 <1804143737@qq.com>	2026-01-09 17:46:23 +08:00
1092626063	3ba064f804	[Doc] Add GLM4.5 GLM4.6 doc (#5740 ) ### What this PR does / why we need it? Add GLM4.5 GLM4.6 doc - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: 1092626063 <1092626063@qq.com>	2026-01-09 16:40:49 +08:00
wangyao-i	3b997fdd32	support mxfp8 quantization (qwen dense) (#5723 ) ### What this PR does / why we need it? support mxfp8 quantization (qwen liner layer) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wangyao <iwangyao@outlook.com>	2026-01-09 16:26:31 +08:00

1 2 3 4 5 ...

2069 Commits