xc-llm-ascend

Author	SHA1	Message	Date
yydyzr	ff3a50d011	[Model] GLM5 adaptation (#6642 ) ### What this PR does / why we need it? GLM5 adaptation 1. use torch_npu.npu_lightning_indexer for GLM5 2. forbid eagle proposer when fullgraph mode is enabled because of bugs 3. add quatization config for GLM5 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM main: `978a37c823` --------- Signed-off-by: yydyzr <liuyuncong1@huawei.com> Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Co-authored-by: shenchuxiaofugui <1311027364@qq.com>	2026-02-11 22:22:22 +08:00
Zetong Li	140fcaffc3	[Bugfix] Update target probs to target logits in rejection sample (#6685 ) ### What this PR does / why we need it? This PR aims to update `target_probs` to `target_logits` in `rejection_sample`, for catching up with https://github.com/vllm-project/vllm/pull/32852. Otherwise, sampling with temperature will incur accuracy problem where tokens can be accepted or rejected unreasonably. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.15.0 - vLLM main: `13397841ab` Signed-off-by: Zetong Li <slippersss@126.com>	2026-02-11 21:31:40 +08:00
Angazenn	c0c2eb614e	[Main][Ops] Make triton rope support index_selecting from cos_sin_cache (#5450 ) ### What this PR does / why we need it? This PR extends original `rope_triton_forward` and `split_qkv_rmsnorm_rope` to support `cos_sin_cache` && `positions` as inputs. This fully aligns to vLLM RoPE api interface. Compared with earlier implementation for RoPE, the benefits are: 1. avoiding pre-computation of `cos` `sin` before model execution, which helps to remove redundant codes. 2. allowing eagle3 draft model to have different rope parameters with main model (see #6612 ). This help to recover accept rate && accuracy in that case. In addition, this kernel change only introduces very small performance degradation. Those `index_select` or `chunk` operations are now changed into simple memory access in triton kernel (For example, https://github.com/vllm-project/vllm-ascend/pull/5450/changes#diff-a4c2d3071530df193b98f9bf38553874bc4d47571336711f116c26d019cfbb6aR77-R81). Highlights - RoPE Cache Unification: Replaced separate _sin and _cos global tensors with a unified cos_sin_cache and explicit positions tensor for Rotary Positional Embeddings (RoPE), streamlining data handling. - Triton Kernel Integration: Updated Triton kernels (split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the cos_sin_cache and positions for more efficient and integrated RoPE calculations. - Custom Operation Registration: Registered `rope_forward_oot` as a new custom operation, allowing its use in fused compilation passes and providing a dedicated entry point for the new RoPE implementation. - Refactored RoPE Forward Pass: Modified the rope_forward_oot function to accept the new cos_sin_cache and positions arguments, enabling a more flexible and integrated RoPE application within the system. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `5326c89803` Additional test on Qwen3-235b accuracy: \| Aime2024 \| GSM8K \| Livecodebench \| \| -------- \| -------- \| -------- \| \| 83.33 \| 96.26 \| 70.23 \| --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-02-11 21:20:53 +08:00
SILONG ZENG	6bc44bf49b	[CI]fix nightly multi node test error for wait for pod ready (#6675 ) ### What this PR does / why we need it? Fixes the issue where nightly multi-node tests hang during the "wait for pod ready" stage due to strict shell mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `13397841ab` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-11 18:11:00 +08:00
Icey	88773bb101	[main to main] upgrade main 0210 (#6673 ) ### What this PR does / why we need it? upgrade vllm commit to `9562912cead1f11e8540fb91306c5cbda66f0007` ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? all tests passed - vLLM version: v0.15.0 - vLLM main: `13397841ab` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2026-02-11 18:10:14 +08:00
Cao Yi	53b494b1e4	[main][Quant] Remove unused rotation functions and parameters from W4A4 LAOS quantization (#6648 ) ## Summary - Remove unused `set_rotation_config` and `apply_rotation` methods from `AscendW4A4LaosDynamicLinearMethod` - Remove unused `rotation_type` field and associated conditional quantization parameters (`heads_rotation`, `kronecker_rotation_n`, `kronecker_rotation_m`) These rotation-related functions and parameters are never called in the current W4A4 LAOS dynamic quantization workflow. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2026-02-11 16:38:45 +08:00
whx	bb73478c00	[Test][BugFix] Fix torch.rand usage in triton penalty test (#6680 ) ### What this PR does / why we need it? This PR fixes a `TypeError` in `tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_penality.py` that was causing nightly test failures. The `torch.rand()` function was being called with the `device` string as a positional argument, which is incorrect. This has been corrected to use the `device` keyword argument. Fixes # ### Does this PR introduce _any_ user-facing change? No, this change only affects a test file. ### How was this patch tested? CI is expected to pass with this fix. - vLLM version: v0.15.0 - vLLM main: `13397841ab` Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-02-11 16:31:49 +08:00
luomin2005	0c1cfa2bac	Add Worker Interface:check_health (#6681 ) This pull request introduces a new capability to monitor the health of NPU cards directly from the Worker class. This enhancement allows for proactive detection of NPU issues by executing the npu-smi command, improving system reliability and operational visibility within the vllm_ascend worker environment. - vLLM version: v0.15.0 - vLLM main: `13397841ab` --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: luomin2005 <luomin2005@huawei.com> Co-authored-by: liziyu <56102866+liziyu179@users.noreply.github.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-02-11 15:24:48 +08:00
pu-zhe	02886e2641	[Feat] 310p support MoE W8A8 quantizaition (#6641 ) ### What this PR does / why we need it? This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by: - Implementing a new quantization scheme `AscendW8A8DynamicFusedMoEMethod310`. - Adding a unified MLP implementation (`unified_apply_mlp`) for 310P that handles both quantized and unquantized paths. - Refactoring the MoE and quantization configuration logic to correctly route to the new 310P-specific implementations. - Adding new e2e and unit tests to verify the functionality of MoE W8A8 quantization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8 quantization in a multi-card setup. - Added several new unit tests for the 310P-specific MoE components, including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`, and the new `w8a8_dynamic` quantization method. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-10 17:17:44 +08:00
jiangyunfan1	1eb07986bf	[TEST]add a qwen3-30b acc case with mooncake mempool (#6244 ) ### What this PR does / why we need it? This PR adds a case of qwen3-30b w8a8 with mooncake mempool, we need to test it regual ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test - vLLM version: v0.14.1 - vLLM main: `d68209402d` Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2026-02-10 16:26:55 +08:00
LI SHENGYONG	7cf285a77a	[MOE Refactor] Remove QuantType in prepare_finalize.py (#6534 ) ### What this PR does / why we need it? To prevent confusion between different QuantType classes, we remove** QuantType in prepare_finalize.py - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-02-10 15:59:58 +08:00
LI SHENGYONG	34eecacace	[EPLB] Avoiding eplb's dependency on a specified model (#6528 ) ### What this PR does / why we need it? 1. Currently, eplb registers different attributes for different models, but these attributes are not actually used. Now, these attributes are directly deleted. 2. Add some log about eplb. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? #### Deepseek v3.1 chat Of course! Here is a comprehensive explanation of deep learning, broken down for clarity.\n\n### The Simple Analogy: A Child Learning to Recognize a Cat\n\nImagine teaching a child what a cat is. You don't give them a rulebook with instructions like \"has pointy ears, whiskers, and a tail.\" Instead, you show them many pictures, saying \"this is a cat\" or \"this is not a cat.\" The child's brain gradually learns to identify the complex patterns—the combination of shapes, colors, and textures—that define \"cat-ness.\"\n\nDeep learning is essentially this, but for computers. It's a method for teaching computers to learn from examples and recognize patterns directly from data (like images, sound, or text) without being explicitly programmed with rigid rules.\n\n---\n\n### The Technical Definition\n\nDeep Learning is a subfield of machine learning, which itself is a subfield of artificial intelligence (AI). It uses artificial neural networks with many layers (\"deep\" networks) to model and understand complex patterns in data.\n\nHere are the key concepts in that definition:\n\n1. Artificial Intelligence (AI): The broad science of making machines smart and capable of performing tasks that typically require human intelligence.\n2. Machine Learning (ML): A subset of AI that gives computers the ability to learn from data without being explicitly programmed for every single rule.\n3. Deep Learning (DL): A specific, powerful - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-02-10 15:58:44 +08:00
wangxiyuan	7d4833bce9	[Doc][Misc] Restructure tutorial documentation (#6501 ) ### What this PR does / why we need it? This PR refactors the tutorial documentation by restructuring it into three categories: Models, Features, and Hardware. This improves the organization and navigation of the tutorials, making it easier for users to find relevant information. - The single `tutorials/index.md` is split into three separate index files: - `docs/source/tutorials/models/index.md` - `docs/source/tutorials/features/index.md` - `docs/source/tutorials/hardwares/index.md` - Existing tutorial markdown files have been moved into their respective new subdirectories (`models/`, `features/`, `hardwares/`). - The main `index.md` has been updated to link to these new tutorial sections. This change makes the documentation structure more logical and scalable for future additions. ### Does this PR introduce _any_ user-facing change? Yes, this PR changes the structure and URLs of the tutorial documentation pages. Users following old links to tutorials will encounter broken links. It is recommended to set up redirects if the documentation framework supports them. ### How was this patch tested? These are documentation-only changes. The documentation should be built and reviewed locally to ensure all links are correct and the pages render as expected. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-10 15:03:35 +08:00
Ronald	77305df398	implement batch invariant with ascendc (#6590 ) ### What this PR does / why we need it? there are batch invariant ops implemented by triton and ascendc, this pr aims to choose which kind of ops to be used to enable batch invariant. #5487 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-02-10 14:15:26 +08:00
Nengjun Ma	66b60c9440	[Refact]Refact MLA/SFA weight prefetch to consist with moe weight prefetch (#6629 ) ### What this PR does / why we need it? 1. [Refact] Refact MLA/SFA weight prefetch to consist with moe weight prefetch 2. Remove duplicated o_proj weight prefetch in forward for MLA/SFA ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? 1) Performance result: Perf test data: ) MLA: \| \| 1st test \| 2nd test \| Output Token Throughput(Avg) \| Performance improvement percentage \| \| --- \| --- \| --- \| --- \| --- \| \| o_proj duplicate prefetch \| 11.9669 token/s \| 12.0287 token/s \| 11.9978 \| \| o_proj no duplicate prefetch \| 12.5594 token/s \| 12.6216 token/s \| 12.5905 \| 4.94%\| \| single layer performace improve: 5%~8% ) SFA: \| \| 1st test \| 2nd test \| Output Token Throughput(Avg) \| Performance improvement percentage \| \| --- \| --- \| --- \| --- \| --- \| \| o_proj duplicate prefetch \| 13.0523 token/s \| 13.1084 token/s \| 13.08035 \| \| \| o_proj no duplicate prefetch \| 13.9844 token/s \| 14.1678 token/s \| 14.0761 \| 7.6% \| - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-02-10 14:14:37 +08:00
wangxiyuan	2a826b5fad	[Misc] upgrade to vllm main (#6646 ) ### What this PR does / why we need it? This PR upgrades the core vLLM dependency to a newer version from the main branch (`13397841ab469cecf1ed425c3f52a9ffc38139b5`). This is necessary to keep our project up-to-date with the latest features and fixes from upstream vLLM. 1. `ac32e66cf9` pass file is moved. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wxsIcey <1790571317@qq.com>	2026-02-10 14:08:59 +08:00
Cao Yi	1c7d1163f5	[main][Docs] Fix spelling errors across documentation (#6649 ) Fix various spelling mistakes in the project documentation to improve clarity and correctness. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2026-02-10 11:14:57 +08:00
meihanc	5b8e47cb68	[bugfix]Fix no attribute 'data' when MLAPO is enable (#6601 ) ### What this PR does / why we need it? This PR fixes an `AttributeError: 'Parameter' object has no attribute 'data'` that occurs when MLAPO is enabled with vLLM v0.15.0. The error is caused by a monkey-patch on `MLAAttention.process_weights_after_loading` which is incompatible with changes in vLLM v0.15.0. This is likely related to PyTorch's deprecation of the `.data` attribute on `torch.nn.Parameter` objects. This change makes the monkey-patch conditional, so it is not applied for vLLM v0.15.0 and newer versions, resolving the crash. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-02-10 09:04:32 +08:00
DreamerLeader	905f0764e0	[DOC]Add Memcache Usage Guide (#6476 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>	2026-02-09 21:55:00 +08:00
lilinsiman	9564c6bb5d	[main][bugfix] Fix spec acceptance rate problem in vllm_0.15.0 (#6606 ) ### What this PR does / why we need it? The speculative inference acceptance rate decreases after the vllm version is upgraded to v0.15.0. This issue is resolved. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? UT and tests case - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-02-09 21:33:58 +08:00
yupeng	8d44ddacb0	[Test][LoRA] Add e2e test for base model inference (#6624 ) ### What this PR does / why we need it? This PR adds an end-to-end test case to verify the correctness of base model inference when LoRA is enabled. This is to ensure that after a LoRA base model request issue was fixed, the functionality remains correct and does not regress. The new test case calls `do_sample` with `lora_id=0` to target the base model and asserts the output against expected SQL queries. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with the new test case. The test can be run with: ```bash pytest -sv tests/e2e/singlecard/test_llama32_lora.py Signed-off-by: paulyu12 <507435917@qq.com>	2026-02-09 21:06:49 +08:00
Wang Kunpeng	156976b982	[refactor]Optimized the kvcache usage of Deepseek v3.2 (#6610 ) ### What this PR does / why we need it? For deepseek v3.2, DSA use FullAttentionSpec, allocate 2 * mla page size bytes, and we use half of that for k cache in DSA However, the actual proportion of k cache is not high, which results in a large amount of kvcache being wasted. The proportion of discarded kvcache is (576-128)/(576 x 2) = 0.388. Run the same script to start DeepSeek V3.2 on a single A3 server. The following shows the comparison of kvcache usage: Before refactoring ``` [kv_cache_utils.py:1307] GPU KV cache size: 15,872 tokens ``` After refactoring ``` [kv_cache_utils.py:1307] GPU KV cache size: 25,984 tokens ``` This pull request refactors the KV cache allocation for Deepseek v3.2 models that use sparse attention. It replaces the use of `FullAttentionSpec` with `MLAAttentionSpec` and introduces a more principled way of calculating KV cache tensor split factors based on model configuration. This change removes hardcoded values and correctly sizes the cache tensors, leading to optimized memory usage and improved code maintainability. ### Does this PR introduce _any_ user-facing change? No, this is an internal optimization and does not introduce any user-facing changes. ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2026-02-09 18:53:56 +08:00
Qiu	cb7c419bc0	[Feat](sfa,dcp) support dcp for sfa (#6563 ) ### What this PR does / why we need it? This PR adds DCP support to the SFA backend. Please note that due to operator constraints, the current implementation has to all-gather the entire KV cache and modify the block table to satisfy the operator input requirements. This results in significantly increased communication overhead and peak memory usage. Therefore, this is only a temporary workaround and will be refactored once the operator provides proper support. Additionally, because of the above limitations, `cp_kv_cache_interleave_size` is currently required to be equal to `block_size`. This restriction will also be removed after the refactor. #### Test accuracy test using DeepSeek-V3.2-Exp-W8A8 with dp2tp8dcp8 \| dataset \| version \| metric \| mode \| vllm-api-general-stream \| \|----- \| ----- \| ----- \| ----- \| -----\| \| gsm8kdataset \| - \| accuracy \| gen \| 96.35 \| - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-02-09 18:52:25 +08:00
GoCHug	80e5812b39	[BugFix] Add support for rotary_dim parameter when using partial rope in rotary_embedding (#6581 ) ### What this PR does / why we need it? Issue: If a model such as Ling-1T adopts partial rotary position embedding (partial RoPE), but config.json uses the rotary_dim parameter instead of partial_rotary_factor, it will trigger a RuntimeError: The expanded size of the tensor (128) must match the existing size (64) at non-singleton dimension 3. <img width="1681" height="472" alt="image" src="https://github.com/user-attachments/assets/ba03d7df-ecba-4d6f-9ec1-4dc55f59799e" /> This PR addresses an issue where models using partial rotary position embedding (partial RoPE) with the `rotary_dim` parameter in `config.json` (instead of `partial_rotary_factor`) would encounter a `RuntimeError`. This change adds support for the `rotary_dim` parameter in `vllm_ascend/ops/rotary_embedding.py` to correctly calculate the `rope_dim`, resolving the tensor size mismatch error. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The patch was tested successfully with the Ling-1T model, which previously triggered the error. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` Signed-off-by: GoCHug <93277779+GoCHug@users.noreply.github.com>	2026-02-09 17:17:52 +08:00
lhp-deep	d060c797ed	[fix bug] fix tensor mismatch bug in sigmoid operate test case (#6619 ) ### What this PR does / why we need it? This PR fixes a bug in the `test_triton_fusion_ops` test case. The test compares a fused kernel (`fused_sigmoid_gating_delta_rule_update`) with a split implementation. Both paths use a recurrent state tensor. The bug was that the state tensor was being modified in-place by the fused kernel call, and this modified tensor was then reused for the split implementation path. This led to an incorrect comparison and test failure. This fix ensures that each path starts with an identical, clean initial state by creating separate tensors. It also changes the state initialization from `torch.randn` to `torch.ones` to make the test deterministic. ### Does this PR introduce _any_ user-facing change? No, this change only affects a test case and has no user-facing impact. ### How was this patch tested? The fix is applied directly to the test case. The CI passing for `test_fused_sigmoid_gating_delta_rule.py` will confirm that the fix is working as expected. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` Signed-off-by: lhp-deep <liuhaopeng1@huawei.com>	2026-02-09 16:43:27 +08:00
xulei	8325528368	[Kernel]: Optimize DispatchFFNCombine performance (#6468 ) ### What this PR does / why we need it? This PR focuses on performance optimization for the DispatchFFNCombine operator. The key optimizations include: 1. Improving communication efficiency by merging the transmission of tokens and scales; 2. Decoupling multi-core dependencies and reducing waiting bubbles in the combine process through tile-granularity communication; 3. Optimizing the full-card synchronization overhead before the umpermute operation. These optimizations aim to reduce the overall execution latency of the DispatchFFNCombine operator and enhance the runtime performance of the model inference process on Ascend devices. ### Does this PR introduce _any_ user-facing change? No. This PR only involves internal performance optimization of the DispatchFFNCombine operator and does not introduce any changes to user-facing APIs, interfaces, or behaviors. ### How was this patch tested? 1. Enable the DispatchFFNCombine operator by setting the environment variable: ``` export VLLM_ASCEND_ENABLE_FUSED_MC2=1 ``` 2. Run the standard model inference test suite with the above environment variable enabled; 4. Verify the correctness of model outputs (ensuring no functional regression) and measure the performance improvement of the DispatchFFNCombine operator (reduced latency and improved throughput). - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: xulei_ict <xulei292@huawei.com> Co-authored-by: xulei_ict <xulei292@huawei.com>	2026-02-09 16:30:34 +08:00
wangxiyuan	9c6d031797	[MISC] Clean up useless env USE_OPTIMIZED_MODEL (#6618 ) Clean up uesless env `USE_OPTIMIZED_MODEL` - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-09 15:38:58 +08:00
Canlin Guo	b7aa511daa	[Patch] Remove the patch of MiniCPM (#5975 ) ### What this PR does / why we need it? Part of #5304. After https://github.com/vllm-project/vllm/pull/32523 merge, we could remove the patch of `MiniCPMAttention`. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Test it locally. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2026-02-09 14:07:44 +08:00
liziyu	e5f0e0eaf7	[P/D] layerwise connector support recompute scheduler (#5900 ) ### What this PR does / why we need it? layerwise connector support recompute scheduler. NOTE： Triggering recompute will invoke the tokenizer again, which may lead to precision fluctuations. [RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support https://github.com/vllm-project/vllm-ascend/issues/4842 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-02-07 15:24:42 +08:00
wangxiyuan	d266fd7b47	[CI] Add workflow support for lint image build (#6489 ) Support specify commit hash with lint image build workflow - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-07 09:32:01 +08:00
Zetong Li	4fa7cf6f50	[Bugfix] Fix problematic dummy_run & improper input_batch_size in eagle (#6517 ) ### What this PR does / why we need it? This PR aims to fix problematic dummy_run that will cause excessive npu memory and to fix improper input_batch_size that will degrade running performance. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: Zetong Li <slippersss@126.com> Signed-off-by: lilinsiman <lilinsiman@gmail.com> Co-authored-by: lilinsiman <lilinsiman@gmail.com>	2026-02-07 09:30:10 +08:00
pu-zhe	1cc225711d	[Refactor]310p_e2e test case update (#6539 ) ### What this PR does / why we need it? This pull request significantly enhances the test suite by adding new end-to-end test cases for Qwen3 models on the 310P hardware platform. The primary goal is to ensure the stability and correctness of these models under diverse operational conditions, including various parallelism strategies, data types, and quantization methods. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? E2E test - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-07 09:28:37 +08:00
lty	c3db1aca2f	[Refactor]refactor p2p connector (#6551 ) ### What this PR does / why we need it? Redundant code is removed, and repeated logic is combined through the p2p connector refactor, making the code easy to extend. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? P节点： ``` vllm serve /mnt/weight/DeepSeek-V3.2-Exp-W8A8 \ --host 0.0.0.0 \ --port 8002 \ --data-parallel-size 2 \ --tensor-parallel-size 8 \ --enable-expert-parallel \ --seed 1024 \ --served-model-name model \ --max-model-len 8192 \ --max-num-batched-tokens 8192 \ --max-num-seqs 16 \ --enforce-eager \ --trust-remote-code \ --gpu-memory-utilization 0.92 \ --quantization ascend \ --async-scheduling \ --additional-config '{"ascend_scheduler_config":{"enabled":true}}' \ --kv-transfer-config \ '{ "kv_connector": "MultiConnector", "kv_role": "kv_producer", "kv_connector_extra_config": { "use_layerwise": false, "connectors": [ { "kv_connector": "MooncakeConnectorV1", "kv_role": "kv_producer", "kv_port": "30000", "kv_connector_extra_config": { "use_ascend_direct": true, "prefill": { "dp_size": 2, "tp_size": 8 }, "decode": { "dp_size": 4, "tp_size": 4 } } }, { "kv_connector": "AscendStoreConnector", "kv_role": "kv_producer", "kv_connector_extra_config": { "backend": "mooncake", "mooncake_rpc_port":"0" } } ] } }' ``` D节点： ``` vllm serve /mnt/share/DeepSeek-V3.2-Exp-W8A8 \ --host 0.0.0.0 \ --port 8003 \ --data-parallel-size 4 \ --tensor-parallel-size 4 \ --enable-expert-parallel \ --seed 1024 \ --served-model-name model \ --max-model-len 8192 \ --max-num-batched-tokens 8192 \ --max-num-seqs 16 \ --enforce-eager \ --trust-remote-code \ --gpu-memory-utilization 0.92 \ --quantization ascend \ --async-scheduling \ --additional-config '{"ascend_scheduler_config":{"enabled":true}}' \ --kv-transfer-config \ '{ "kv_connector": "MultiConnector", "kv_role": "kv_consumer", "kv_connector_extra_config": { "use_layerwise": false, "connectors": [ { "kv_connector": "MooncakeConnectorV1", "kv_role": "kv_consumer", "kv_port": "30100", "kv_connector_extra_config": { "use_ascend_direct": true, "prefill": { "dp_size": 2, "tp_size": 8 }, "decode": { "dp_size": 4, "tp_size": 4 } } },{ "kv_connector": "AscendStoreConnector", "kv_role": "kv_consumer", "kv_connector_extra_config": { "backend": "mooncake", "mooncake_rpc_port":"1" } } ] } }' ``` - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: lty <linhebiwen@gmail.com>	2026-02-07 09:27:15 +08:00
pu-zhe	4f33e25046	[Refactor]refactor 310p attention impl and add ut (#6579 ) ### What this PR does / why we need it? This pull request significantly refactors the attention mechanism for the Ascend 310P hardware, enhancing its architecture by separating mask generation concerns from the core attention implementation. It introduces a dedicated mask builder class capable of handling various mask types, including causal, splitfuse, and sliding window attention masks, all optimized for the NPU's fractal data format. This change not only cleans up the codebase but also lays the groundwork for more robust and feature-rich attention operations on Ascend devices, backed by new, extensive unit tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? E2E test with qwen3 and qwen3-moe - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-07 09:26:26 +08:00
pu-zhe	23524f2ca4	[Refactor]refactor 310p ops and add ut (#6591 ) ### What this PR does / why we need it? This pull request focuses on a significant refactoring effort within the vllm-ascend project, specifically targeting operations optimized for the Ascend 310P hardware. The changes aim to streamline the implementation of core components like quantization and multi-head attention, making the codebase more maintainable and robust. Concurrently, new unit tests have been introduced to ensure the correctness and reliability of these refactored modules. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? E2E test with qwen3-32b w8a8 - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-07 09:25:17 +08:00
wangxiyuan	6c49f95da2	[Ops][Refactor] Remove custom rotary_embedding operator (#6523 ) ### What this PR does / why we need it? This PR removes the custom `rotary_embedding` operator and its associated C++ kernel implementation, PyTorch bindings, and tests. The codebase now falls back to using the native `torch_npu._npu_rotary_embedding` implementation. This change simplifies the codebase by removing custom, platform-specific kernel code and relying on the standard NPU library implementation, which is presumably more optimized and easier to maintain. ### Does this PR introduce _any_ user-facing change? No. This is an internal refactoring and does not introduce any user-facing changes. ### How was this patch tested? The tests for the custom `rotary_embedding` operator have been removed along with the operator itself. The correctness of the fallback to the native `torch_npu` implementation is verified by existing CI tests for attention layers and models that use rotary embeddings. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-07 09:24:05 +08:00
SILONG ZENG	06aa6036f6	[Lint]Style: Convert `vllm-ascend/` to ruff format(new Batch #8 ) (#6604 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| vllm_ascend/ops/\_\_init\_\_.py \| \| vllm_ascend/ops/activation.py \| \| vllm_ascend/ops/flashcomm2_oshard_manager.py \| \| vllm_ascend/ops/layernorm.py \| \| vllm_ascend/ops/mla.py \| \| vllm_ascend/ops/mm_encoder_attention.py \| \| vllm_ascend/ops/register_custom_ops.py \| \| vllm_ascend/ops/vocab_parallel_embedding.py \| \| vllm_ascend/ops/weight_prefetch.py \| \| vllm_ascend/spec_decode/\_\_init\_\_.py \| \| vllm_ascend/spec_decode/eagle_proposer.py \| \| vllm_ascend/spec_decode/interface.py \| \| vllm_ascend/spec_decode/mtp_proposer.py \| \| vllm_ascend/spec_decode/ngram_proposer.py \| \| vllm_ascend/spec_decode/suffix_proposer.py \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-07 09:16:07 +08:00
wangyu	c63b7a1188	[Test] Add initial multi modal cases of Qwen2.5-VL-7B-Instruct for disaggregated encoder (#5301 ) ### What this PR does / why we need it? This PR adds disaggregated encoder tests for Qwen2.5-VL-7B-Instruct ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test by running ci - vLLM version: release/v0.12.0 --------- Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>	2026-02-06 17:30:17 +08:00
wangxiyuan	06c0aed124	[CI] Fix broken CI (#6599 ) Revert `4fb3d5e1b2` it breaks E2E Test - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd`	2026-02-06 17:23:58 +08:00
SILONG ZENG	19b5d44ea8	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #10 ) (#6173 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \|`vllm_ascend/ops/layer_shard_linear.py`\| \|`vllm_ascend/ops/linear.py`\| \|`vllm_ascend/ops/linear_op.py`\| \|`vllm_ascend/worker/worker.py`\| \| ` vllm_ascend/patch/worker/patch_bert.py` \| \| ` vllm_ascend/patch/worker/patch_deepseek.py` \| \| ` vllm_ascend/patch/worker/patch_distributed.py` \| \| ` vllm_ascend/patch/worker/patch_module.py` \| \| ` vllm_ascend/patch/worker/patch_multimodal_merge.py` \| \| ` vllm_ascend/patch/worker/patch_qwen3_next.py` \| \| ` vllm_ascend/patch/worker/patch_qwen3_next_mtp.py` \| \| ` vllm_ascend/patch/worker/patch_rejection_sampler.py` \| \| ` vllm_ascend/patch/worker/patch_rope.py` \| \| ` vllm_ascend/patch/worker/patch_triton.py` \| \| ` vllm_ascend/patch/worker/patch_unquantized_gemm.py` \| \| ` vllm_ascend/patch/worker/patch_v2_egale.py` \| \|` vllm_ascend/worker/npu_input_batch.py`\| \|` vllm_ascend/worker/v2/aclgraph_utils.py`\| \|` vllm_ascend/worker/v2/attn_utils.py`\| \|` vllm_ascend/worker/v2/model_runner.py`\| \|` vllm_ascend/worker/v2/sample/gumbel.py`\| \|` vllm_ascend/worker/v2/sample/penalties.py`\| \|` vllm_ascend/worker/v2/sample/sampler.py`\| \|` vllm_ascend/worker/v2/spec_decode/__init__.py`\| \|` vllm_ascend/worker/v2/spec_decode/eagle.py`\| \|` vllm_ascend/worker/v2/states.py`\| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: SILONG ZENG <2609716663@qq.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-06 15:35:06 +08:00
SILONG ZENG	65b7f716e6	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #11 ) (#6176 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| `vllm_ascend/ops/fused_moe/comm_utils.py` \| \| `vllm_ascend/ops/fused_moe/experts_selector.py` \| \| `vllm_ascend/ops/fused_moe/fused_moe.py` \| \| `vllm_ascend/ops/fused_moe/moe_comm_method.py` \| \| `vllm_ascend/ops/fused_moe/moe_mlp.py` \| \| `vllm_ascend/ops/fused_moe/prepare_finalize.py` \| \| `vllm_ascend/ops/fused_moe/token_dispatcher.py` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: SILONG ZENG <2609716663@qq.com>	2026-02-06 15:28:49 +08:00
SILONG ZENG	4fb3d5e1b2	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #8 ) (#6129 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| vllm_ascend/ops/\_\_init\_\_.py \| \| vllm_ascend/ops/activation.py \| \| vllm_ascend/ops/flashcomm2_oshard_manager.py \| \| vllm_ascend/ops/layernorm.py \| \| vllm_ascend/ops/mla.py \| \| vllm_ascend/ops/mm_encoder_attention.py \| \| vllm_ascend/ops/register_custom_ops.py \| \| vllm_ascend/ops/vocab_parallel_embedding.py \| \| vllm_ascend/ops/weight_prefetch.py \| \| vllm_ascend/spec_decode/\_\_init\_\_.py \| \| vllm_ascend/spec_decode/eagle_proposer.py \| \| vllm_ascend/spec_decode/interface.py \| \| vllm_ascend/spec_decode/mtp_proposer.py \| \| vllm_ascend/spec_decode/ngram_proposer.py \| \| vllm_ascend/spec_decode/suffix_proposer.py \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: SILONG ZENG <2609716663@qq.com>	2026-02-06 15:25:08 +08:00
SILONG ZENG	99aedaff63	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #7 ) (#6023 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \|` vllm_ascend/quantization/compressed_tensors/compressed_tensors.py`\| \|` vllm_ascend/quantization/quant_config.py`\| \|` vllm_ascend/quantization/utils.py`\| \|` vllm_ascend/quantization/w4a16.py`\| \|` vllm_ascend/quantization/w4a4_flatquant_dynamic.py`\| \|` vllm_ascend/quantization/w4a8_dynamic.py`\| \|` vllm_ascend/quantization/w8a16.py`\| \|` vllm_ascend/quantization/w8a8.py`\| \|` vllm_ascend/quantization/w8a8_dynamic.py`\| \|` vllm_ascend/quantization/w8a8_pdmix.py`\| \|` vllm_ascend/quantization/w8a8mxfp8.py`\| \|` vllm_ascend/sample/rejection_sampler.py`\| \|` vllm_ascend/sample/sampler.py`\| \|` vllm_ascend/worker/block_table.py`\| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-06 14:56:53 +08:00
wangxiyuan	d0bc16859c	[CI][Misc] Some improvement for github action (#6587 ) ### What this PR does / why we need it? - This PR removes several self-hosted runner labels from the `actionlint.yaml` configuration file. These runners are likely no longer in use, so this change cleans up the configuration and ensures `actionlint` has an accurate list of available runners. - Move all Action dockerfiles to one folder - remove useless `runner` input for e2e test. - update workflow option version ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is a configuration change for the CI linter. The correctness will be verified by `actionlint` running in CI on subsequent pull requests. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-06 14:06:27 +08:00
Li Wang	d018aeb5fa	[Image] Bump mooncake version to v0.3.8.post1 (#6428 ) ### What this PR does / why we need it? This patch bump the mooncake version to the latest [release](https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.8.post1) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? test is locally >>> from mooncake.engine import TransferEngine - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-02-06 10:54:03 +08:00
pu-zhe	85e33941e8	[Feat.]: 310p support MOE models (#6530 ) ### What this PR does / why we need it? This pull request integrates comprehensive support for Mixture of Experts (MoE) models on the Ascend 310P device within the vllm-ascend framework. It achieves this by introducing specialized modules for expert selection, fused MoE layers, and optimized all-gather communication. The changes also refine existing NPU operations, making them more consistent and efficient for 310P, ultimately enhancing the performance and compatibility of MoE models on this hardware. Highlights 310P MoE Support: Introduces dedicated implementations for Mixture of Experts (MoE) models on Ascend 310P devices, including new modules for expert selection, fused MoE layers, and communication. All-Gather Communication: Enforces the use of ALLGATHER communication for MoE operations on 310P, optimizing data transfer and leveraging NPU-specific token dispatching. Simplified NPU Operations: Removes conditional type casting for npu_swiglu and enables custom rotary embedding kernels unconditionally, suggesting improved native support for 310P. New MoE Classes Registered: Registers AscendFusedMoE310 and AscendSharedFusedMoE310 to integrate 310P-specific MoE layers into the system's custom operation registry. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? offline test and server test, with qwen3-30b-a3b,tp/ep 4 on 310p - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-06 10:30:56 +08:00
wangxiyuan	c38166eefa	[Doc] backport 0.13.0 release note (#6584 ) ### What this PR does / why we need it? Backport 0.13.0 release note to main branch and update related doc link ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? by doc CI - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-06 10:29:15 +08:00
Nengjun Ma	11339eb48a	[CI] Update UT CANN version to 8.5.0 for main branch (#6564 ) ### What this PR does / why we need it? Update UT CANN version to 8.5.0 ### Does this PR introduce _any_ user-facing change? NA - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-02-06 10:28:42 +08:00
zhangxinyuehfad	81f3c09d6d	[CI] Change A2 runner (#6557 ) ### What this PR does / why we need it? This PR updates the CI runner from `linux-aarch64-a2-` to `linux-aarch64-a2b3-` in various test configuration files. This change is necessary to adapt to updates in the CI infrastructure. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The changes are configuration updates for CI tests. The correctness will be verified by the CI pipeline. Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-02-05 23:43:57 +08:00
Ruowei Zheng	8e66299bf1	[Bugfix] Fix the incorrect use of the output parameter in _forward_fia_slidingwindow (#6469 ) ### What this PR does / why we need it? Fix the incorrect use of the `output` parameter in `_forward_fia_slidingwindow`: ``` # Original (incorrect) output, _ = torch_npu.npu_fused_infer_attention_score(...) output= output.view(batch_size, self.num_heads, self.head_size) ``` In the original writing, the `output `parameter was directly assigned a new value, which is inconsistent with the interface definition, resulting in the inability to directly update `output `when calling externally. ``` attn_output, _ = torch_npu.npu_fused_infer_attention_score(...) attn_output = attn_output.view(batch_size, self.num_heads, self.head_size) output[:batch_size] = attn_output[:batch_size] ``` ### Does this PR introduce _any_ user-facing change? No change. Co-authored-by: GoCHug<gch59135228@163.com> ### How was this patch tested? vLLM ascend version: v0.13.0rc1 Signed-off-by: acat-rw <892882856@qq.com>	2026-02-05 20:58:54 +08:00

1 2 3 4 5 ...

2430 Commits