xc-llm-ascend

Author	SHA1	Message	Date
MengLong Chen	a838a89630	[v0.16.0][P/D][Bugfix] Support ALL D-Nodes in fullgraph when running MTP in PD (#6948 ) ### What this PR does / why we need it? Fix the bug for v0.16.0 recompute_scheduler, the same way as https://github.com/vllm-project/vllm-ascend/pull/5472. Signed-off-by: chenmenglong <chenmenglong1@huawei.com>	2026-03-06 10:01:33 +08:00
LI SHENGYONG	ccd00798f3	[EPLB] Display the expert hotness comparison before and after eplb. (#6877 ) ### What this PR does / why we need it? To intuitively show the effect of the eplb algorithm, we print the expert heat before and after eplb. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ![Snipaste_2026-02-28_17-23-42](https://github.com/user-attachments/assets/db1dadd1-cf96-44da-af34-57d41ccf412f) - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-03-06 09:53:29 +08:00
frank	18b52afe2b	[Ops][Misc] Optimize split_qkv_rmsnorm_rope op (#6827 ) ### What this PR does / why we need it? This PR optimizes the `split_qkv_rmsnorm_rope` operator by introducing a new Triton kernel, `split_qkv_rmsnorm_rope_prefill_kernel`, for the prefill stage (i.e., large batch sizes). The implementation now dynamically selects between the existing decode kernel and the new prefill kernel based on the batch size, which improves performance for large batch scenarios. Additionally, the RoPE implementation is updated to support partial rotation dimensions (`rope_dim`), making the operator more flexible. ### Does this PR introduce _any_ user-facing change? No. This is a performance optimization and is not expected to introduce any user-facing changes. ### How was this patch tested? CI should pass with existing tests. The new prefill path is triggered when the batch size is larger than the number of available vector cores. The partial RoPE feature can be tested by passing the `rope_dim` argument. - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: guzhiyong <guzhiyong5@h-partners.com> Signed-off-by: frank <2547457096@qq.com> Co-authored-by: guzhiyong <guzhiyong5@h-partners.com>	2026-03-06 09:30:31 +08:00
Zetong Li	a60e179c7f	[Refactor][EAGLE] 8/N delete mtp_proposer (#7016 ) ### What this PR does / why we need it? This PR aims to delete mtp_proposer. By fixing a bug in both dsv32 and glm5, now it should be ok to remove mtp_proposer. The bug is actually about unnecessary slicing of `slot_mapping`. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: Zetong Li <slippersss@126.com>	2026-03-06 09:10:57 +08:00
SILONG ZENG	bd571cf6d6	[Main2Main] Upgrade vLLM to 0303 (#6944 ) ### What this PR does / why we need it? break: - https://github.com/vllm-project/vllm/pull/34102 Disable_full param replaced with valid_modes/invalid_modes API - https://github.com/vllm-project/vllm/pull/35503 Now must return float compilation_time - https://github.com/vllm-project/vllm/pull/35564 New sequence_lengths param added - https://github.com/vllm-project/vllm/pull/33807 A check was performed (if runner_backend != "auto") - https://github.com/vllm-project/vllm/pull/34861 `BaseDeviceCommunicator` now accesses PyTorch's internal `pg_map` to check process group state - https://github.com/vllm-project/vllm/pull/35274 Important change: - https://github.com/vllm-project/vllm/pull/28672 `matcher_utils` directly accesses `torch.ops._C.*` during the import phase. In the Ascend environment, some unregistered ops trigger `AttributeError`, causing e2e initialization failure. https://github.com/vllm-project/vllm-ascend/actions/runs/22607260487/job/65502047131#step:10:2323 https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/matcher_utils.py#L29 This PR adds temporary compatibility placeholders (rms_norm, fused_add_rms_norm, rotate_embedding, static/dynamic fp8 quant, silu_and_mul) to `vllm_ascend/patch/platform/patch_fusion_matcher_compat_ops.py` to ensure no crashes during the import phase. Upstream repairs will be considered later. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: Claude Code <noreply@anthropic.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com>	2026-03-06 09:08:52 +08:00
liuchen2026fly	640ecd1b77	[BugFix] Fix muls_add fusion not working for GLM5 models (#6928 ) ### What this PR does / why we need it? fix: support model-specific routed_scaling_factor in muls_add fusion Previously, MulsAddFusionPass used a hardcoded scale=1.0, which failed to match the x * routed_scaling_factor + y pattern in models like GLM5 that use routed_scaling_factor=2.5. This caused the muls_add fusion to be skipped, leaving unoptimized mul+add operations. This fix reads routed_scaling_factor from model config (defaulting to 1.0 for backward compatibility) and uses it as the pattern scale, enabling correct fusion for GLM5 and other models with custom scaling factors. Fixes: Unoptimized mul+add in GLM5 attention blocks Tested: GLM5-W8A8 with routed_scaling_factor=2.5 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: liuchenbing <chenliumail@163.com> Co-authored-by: liuchenbing <chenliumail@163.com>	2026-03-05 22:35:54 +08:00
fems14	ae394767d4	【main】ADXL/HIXL supports FabricMem Mode (#6806 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: fems14 <1804143737@qq.com>	2026-03-05 21:04:11 +08:00
Cao Yi	50441e4650	[BugFix][MTP] Fix prefill misclassified as decode when prompt tokens == num_spec_tokens + 1 (#6835 ) ## Problem When MTP is enabled, prefill requests with `prompt_tokens == num_spec_tokens + 1` are incorrectly classified as decode requests, causing accuracy issues. ## Root Cause The `uniform_decode` condition only checked: - `max_num_scheduled_tokens == uniform_decode_query_len` - `num_tokens == max_num_scheduled_tokens * num_reqs` This is insufficient because a prefill request with specific prompt length satisfies these conditions as well. ## Fix Add `is_all_decode` check to ensure all requests have `num_computed_tokens > 0` before classifying as uniform decode, since decode requests must have computed at least one token. - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2026-03-05 17:33:10 +08:00
dsxsteven	91c39ebae6	[BugFix] [dcp] Fix GQA Model Error when Enable both DP and DCP (#7012 ) ### What this PR does / why we need it? For GQA model, when we enable both dp and dcp (disable pcp), the key-value pairs were not being captured correctly; we have now fixed it. Signed-off-by: dsxsteven <dsxsteven@sina.com>	2026-03-05 16:51:08 +08:00
tanhaoan333	1f2a083597	[bugfix]Qwen-Omni quantization model_type bugfix (#7007 ) ### What this PR does / why we need it? [bugfix]Qwen-Omni quantization model_type bugfix ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: tanhaoan333 <tanhaoan@huawei.com>	2026-03-05 16:34:34 +08:00
realliujiaxu	1a7f845696	[Feat][Worker] NPUWorker Profiler profile_prefix full adaptation (RFC #6954 ) (#6968 ) ## What this PR does / why we need it? Implements [RFC #6954](https://github.com/vllm-project/vllm-ascend/issues/6954): NPUWorker Profiler profile_prefix full adaptation for API parity with upstream vLLM. ### Changes - Lazy profiler init: Defer profiler creation until first `profile(is_start=True)` call - profile_prefix param: Add `profile_prefix` to `profile()`; compute `trace_name` from prefix + `get_worker_rank_suffix()` - Refactor `_init_profiler` → `_create_profiler(trace_name)`: Pass `worker_name` to `tensorboard_trace_handler` for unique trace files per worker - Unique trace files per worker; no collision in multi-worker setups ### Testing - Unit tests updated/added in `tests/ut/worker/test_worker_v1.py` - `pytest tests/ut/worker/test_worker_v1.py::TestNPUWorker` passed ## Does this PR introduce _any_ user-facing change? Yes. Trace file naming may differ (more descriptive with worker rank suffix). `profile(is_start=True, profile_prefix="warmup")` now supported. ## How was this patch tested? - Unit tests:`pytest tests/ut/worker/test_worker_v1.py::TestNPUWorker` - Manual: vLLM serve with profiler config, start/stop profile, verified trace files - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2026-03-05 16:18:34 +08:00
LeeWenquan	3047b724b3	Add GemmaRmsNorm ACLGraph Support (#6473 ) ### What this PR does / why we need it? 1. New Custom NPU Operation: Introduced npu_gemma_rms_norm in csrc/torch_binding.cpp to provide optimized Gemma RMS Normalization support for Ascend NPUs. This function includes logic to handle dynamic shapes for the gamma tensor. 2. PyTorch Operator Registration: The new npu_gemma_rms_norm operation has been registered with the PyTorch custom operator library, making it accessible from Python. Meta-Implementation for ACLGraph: A corresponding meta-implementation, npu_gemma_rms_norm_meta, was added in csrc/torch_binding_meta.cpp. This is crucial for symbolic tracing and allowing the custom kernel to be captured and optimized by ACLGraph. 3. Python Frontend Integration: The vllm_ascend/ops/layernorm.py file was updated to utilize the newly added torch.ops._C_ascend.npu_gemma_rms_norm for Gemma RMS Normalization, replacing the generic torch_npu.npu_rms_norm ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com> Signed-off-by: LeeWenquan <83354342+SunnyLee151064@users.noreply.github.com>	2026-03-05 16:15:07 +08:00
LI SHENGYONG	5a3744c542	[EPLB] The profiling can collect the time required for adjusting the eplb. (#7001 ) ### What this PR does / why we need it? To analyze the overhead of the dynamic eplb adjustment framework in detail, we added the time consumption of the adjustment to the print information in profiling mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ![Snipaste_2026-03-05_11-42-28](https://github.com/user-attachments/assets/41c2b82a-5dfa-4e39-8b50-f4649deed30c) - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-03-05 16:10:57 +08:00
songjianquan	43c8da3574	[Feat]fused_qkvzba_split_reshape supports token number greater than 65536 (#6740 ) ### What this PR does / why we need it? This pull request optimizes the fused_qkvzba_split_reshape_cat Triton kernel for Qwen3-Next GatedDeltaNet model and removes the previous conditional restrictions in the forward pass. Key changes: 1. Refactored Triton kernel implementation: The fused_qkvzba_split_reshape_cat_kernel has been optimized with a new loop-based approach that supports arbitrary num_v_heads / num_k_heads ratios and batch sizes. The kernel now uses configurable ROWS_PER_ITER for better memory utilization . 2. The optimized kernel now handles all scenarios directly without requiring a fallback path using fix_query_key_value_ordering and torch.cat. ### Does this PR introduce _any_ user-facing change? No. This is an internal optimization of the Triton kernel implementation and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with existing tests. - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: songjianquan <songjianquan1@huawei.com> Co-authored-by: songjianquan <songjianquan1@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-05 14:41:38 +08:00
wangxiyuan	13777bf3f0	[Spec Decode]clean up spec decode interface (#6947 ) This pull request refactors the speculative decoding proposer interface to align with upstream vLLM, removing the local `Proposer` interface and renaming methods to `propose`. This is the first step. In the future we should remove the class register and just add few Ascend specified method once the arch in vLLM is ready. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-05 14:30:10 +08:00
rjg-lyh	2bd9c35788	[perf][refactor] Refactor and optimize sfa_v1.py for dsv3.2/glm5 (#6874 ) ### What this PR does / why we need it? This PR refactors sfa_v1.py to improve code readability and usability, fixes a code bug, and enhances performance through the replacement of certain operators. ### changes - improve code readability: Optimizes parts of the code structure in sfa_v1.py, supplementary comments for key code blocks, removes some unused variables, and improves the naming of certain functions and variables. - resolved a duplicated double write to k_cache: Fixed redundant double writes of k_cache in the indexer_select module (in both the `forward` function and `indexer_select_post_process`), improving performance to some extent. - replace `scatter` ops with `reshape_and_cache`: This optimization replaces two separate cache storage operations on `k_nope` and `k_pe` with a single call to the `reshape_and_cache` operator, improving performance. The original `scatter` operator involves reordering slot_mapping for generality, introducing significant scalar computations. In contrast, the `reshape_and_cache` operator eliminates this redundant reordering step, thus reducing unnecessary computation time and enhancing the operator's performance. ### performance comparison 4A3, 1P1D, P dp2tp16, D dp8tp4, input/output: 64K/3K origin: TTFT: 28s, TPOT: 26ms, TPS: 820 token/s* fixed redundant double writes of k_cache: TTFT: 24s, TPOT: 26ms, TPS: 840 token/s replace scatter ops with reshape_and_cache: TTFT: 24s, TPOT: 26ms, TPS: 850 token/s ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-03-05 14:27:11 +08:00
Ronald	77e009d9fc	[Feature] Add docs of batch invariance and make some extra operators patch (#6910 ) ### What this PR does / why we need it? This PR add docs of batch invariance and make some extra operators according to validation result. please see https://github.com/vllm-project/vllm-ascend/issues/5487 to track progress. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-03-05 09:12:40 +08:00
tanhaoan333	f8315f5717	[bugfix]Qwen2.5VL accurate question (#6975 ) ### What this PR does / why we need it? The attention mechanism in the ViT model architecture of Qwen2.5VL consists of two parts and does not support using cache to pass sequence lengths. ### Does this PR introduce _any_ user-facing change? remove seq_lens_cache ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: tanhaoan333 <tanhaoan@huawei.com>	2026-03-04 22:02:29 +08:00
Zhujiyang2	c3c265648f	[Ops][BugFix] Fix RoPE shape mismatch for mtp models with flashcomm v1 enabled (#6939 ) What this PR does / why we need it? When using a draft model (e.g., in MTP speculative decoding) with shared expert data parallelism (enabled via flashcomm), a shape mismatch error occurs in the rotary embedding calculation for models like GLM-4.7. This is because the positions tensor has an incorrect shape for this specific configuration. This PR fixes the issue by adding a check in AscendRotaryEmbedding.forward_oot. If the model is a draft model and shared expert DP is enabled, it processes the positions tensor using torch.ops.vllm.maybe_all_gather_and_maybe_unpad to ensure its shape is correct before applying the rotary embedding. This resolves the shape mismatch error. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com>	2026-03-04 16:02:08 +08:00
zhaomingyu13	52d9086f64	[Bugfix] Fix the acceptance rates dorp issue when applying eagle3 to QuaRot model (#6914 ) ### What this PR does / why we need it? When using the target model after rotational quantization, the acceptance rate decreases because the fc weight of the draft model has not undergone rotational quantization(issue: #6445). We fixed this issue by performing rotation quantization on the fc weight of the draft model in the same way as the main model when loading draft model. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>	2026-03-04 11:29:49 +08:00
Cao Yi	a0a904a3d4	[BugFix] Improve GDN layer detection for multimodal models (#6941 ) ## Summary - Enhanced `check_gdn_layer()` function to properly detect GDN layers in multimodal models - Added support for checking `text_config.layer_types` in addition to root-level `layer_types` - Fixed potential None reference errors when `layer_types` attribute is missing ## Changes - Modified `vllm_ascend/utils.py`: - Replaced `hasattr()` check with safer `getattr()` approach - Added fallback to empty list when `layer_types` is None - Added secondary check for `text_config.layer_types` to support models like Qwen-Omni ## Motivation Previous implementation only checked `layer_types` at the root config level, which failed to detect GDN layers in multimodal models where this information is nested under `text_config`. Additionally, it could raise errors when `layer_types` was None. --- Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com> Co-authored-by: SlightwindSec <slightwindsec@gmail.com> 🤖 Generated with [Claude Code](https://claude.com/claude-code) - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>	2026-03-03 20:08:39 +08:00
weiguihua2	5b05b3a090	[feat]ds3.2 pcp support mtp and chunkprefill (#6917 ) ### What this PR does / why we need it? ds3.2 pcp supports the combination of MTP and chunkprefill features. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-03-03 19:03:50 +08:00
Frank Chen	b771ca9a47	[CPU binding] Implement global CPU slicing and improve IRQ binding for Ascend NPUs (#6945 ) ### What this PR does / why we need it? This PR introduces global CPU slicing for Ascend NPUs to ensure non-overlapping CPU partitions, addresses IRQ binding logical errors on A3, and enhances the logic for determining total NPUs in CPU allocation. These changes are necessary to optimize CPU resource management and improve system stability. - Global CPU Slicing: Introduced a global CPU slicing mechanism for Ascend NPUs to ensure non-overlapping CPU partitions across multiple processes or data parallel groups, preventing resource contention. - Improved IRQ Binding for A3 Devices: Refined the IRQ binding logic specifically for Ascend A3 devices, correctly mapping logical NPU IDs to physical card and chip IDs for accurate npu-smi queries and preventing multi-process overwrite of IRQ settings. - Enhanced NPU Count Determination: Improved the logic for determining the total number of logical NPUs, prioritizing NPU mapping information to ensure more accurate CPU allocation. - Minimum CPU Requirement: Established a minimum requirement of 5 CPUs per NPU for binding, reserving specific cores for IRQ, main, ACL, and release operations to ensure stable operation. ### Does this PR introduce _any_ user-facing change? No user-facing changes are introduced. ### How was this patch tested? CI passed with new added/existing tests. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: c00818886 <chenchuwei@huawei.com>	2026-03-03 17:20:52 +08:00
linfeng-yuan	700423156f	[Triton] Centralize Ascend extension op dispatch in triton_utils (#6937 ) ### What this PR does / why we need it? This pull request refactors the dispatch mechanism for the triton-ascend-specific operators `insert_slice`, `extract_slice`, and `get_element` to ensure compatibility with both CANN 8.5 and 9.0. A unified helper function, `_resolve_triton_ascend_op`, has been introduced in `vllm_ascend/ops/triton/triton_utils.py`. This function dynamically resolves these operators by first attempting to import them from the `triton.language.extra.cann.extension` module, which is present in newer CANN versions. If that fails, it falls back to the standard `triton.language` module. This approach centralizes operator dispatch logic, allowing individual Triton kernels to use these functions without being aware of the underlying Triton/CANN version. All call sites have been updated to use these new unified functions. ### Does this PR introduce _any_ user-facing change? No. This is an internal refactoring of operator implementations and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with existing tests. Testing Context: - vLLM version: v0.16.0 - vLLM main: `15d76f74e2fdb12a95ea00f0ca283acf6219a2b7` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-03 17:10:30 +08:00
Shaoxu Cheng	2064afe380	[300I][Bugfix] fix unquant model weight nd2nz error (#6851 ) ### What this PR does / why we need it? - This PR fixes an issue with weight format conversion for unquantized models running on Ascend 310P devices. - The changes refactor the logic for converting weights to the FRACTAL_NZ format. Previously, this was handled in a 310P-specific linear layer implementation (`AscendUnquantizedLinearMethod310`). This implementation has been removed, and the logic is now centralized in the `maybe_trans_nz` utility function. This function now checks if the device is a 310P and applies the NZ format cast accordingly for `float16`/`bfloat16` weights. - This refactoring simplifies the code by removing platform-specific duplication and ensures correct weight handling for unquantized models on 310P. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ut and local test - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-03 15:57:26 +08:00
tanhaoan333	15f6564976	[Model]Add Qwen3-Omni quantization Ascend NPU adaptation and optimization (#6828 ) ### What this PR does / why we need it? This pull request is for quantization adaptation of Qwen3Omni, and it achieves operator-level optimization and AUT (Auto-Quantization Tuning) component optimization through patch-based modifications. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: tanhaoan333 <tanhaoan@huawei.com>	2026-03-03 00:07:23 +08:00
wangxiaoteng888	dfa9ff7f2a	[P/D][v0.16.0]Adapt to RecomputeScheduler in vLLM 0.16.0 (#6898 ) ### What this PR does / why we need it? Adapt the recompute feature to vLLM 0.16.0, where the D node forwards recompute requests to the P node. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-03-02 23:24:03 +08:00
pu-zhe	5899438a86	[Feat][310p] 310P support w8a8s quantization and saving w8a8sc state (#6878 ) ### What this PR does / why we need it? This pull request introduces significant enhancements for 310P device support, primarily by enabling W8A8S quantization and facilitating the saving of models with W8A8SC state outputs. It provides an example script for saving sharded and compressed model states, implements the core W8A8S quantization method, and integrates metadata generation within the 310P worker to accurately describe the quantization types of saved parameters. These changes aim to improve efficiency and compatibility for quantized models on 310P hardware. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? W8A8S accuarcy test and W8A8SC states save. <img width="886" height="184" alt="image" src="https://github.com/user-attachments/assets/e9bcac54-1f69-4d3a-a5b8-221a147ef99d" /> - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-03-02 20:09:15 +08:00
linfeng-yuan	68d8d20ca2	[misc] move mxfp_compat into device to decouple from quantization init chain (#6918 ) ### What this PR does / why we need it? `mxfp_compat` only provides dtype/symbol compatibility helpers for different `torch_npu` versions, but it was placed under `vllm_ascend.quantization`. Importing it from device/ops paths could trigger `quantization/__init__.py` and pull in heavy quantization method dependencies, increasing startup coupling and causing import-cycle risk (especially on 310P paths). ### Does this PR introduce _any_ user-facing change? No functional behavior change intended. ### How was this patch tested? CI passed. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-02 18:17:01 +08:00
whx	16c879cdf7	[Triton][Config] Add muls_add triton kernel and refactor AscendCompilationConfig (#5518 ) ### What this PR does / why we need it? Add muls_add triton kernel with related fusion pass. What's more, this PR refactors `AscendCompilationConfig` and delete `NpugraphExConfig`. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI passed with new added test. - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-03-02 17:54:25 +08:00
Yuzhou Tong	9180dd6c51	[BugFix][PCP] Fix presion bugs for pcp/dcp in PD disaggregate (#6876 ) ### What this PR does / why we need it? Fix a bug for PD disaggregate of PCP/DCP, some conditions only consider MLA while ignoring DSA. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `15d76f74e2` - vLLM Ascend main: `81fb7d5779` Signed-off-by: tongyuzhou <tongyuzhou1@huawei.com> Co-authored-by: tongyuzhou <tongyuzhou1@huawei.com>	2026-03-02 16:11:00 +08:00
Shaoxu Cheng	ddc78dbade	[300I] support decode-only aclgraph mode (#6849 ) ### What this PR does / why we need it? 310p aclgraph mode, but has some problems: - the event-id hardware limit, the num of graph will be limited. - the cann version support this feature cannot be get from external of huawei. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? local test - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-02 14:15:14 +08:00
Eric-dot	3c66a970f2	add mxfp8 moe quantization (#6670 ) ### What this PR does / why we need it? support mxfp8 quantization (Qwen MOE ) Using adaptor to make the hardware-specific behavior clearer and more maintainable ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `13397841ab` --------- Signed-off-by: fangrongcan <17343701736@163.com> Signed-off-by: wangyao-i <iwangyao@outlook.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: Eric-dot <60131170+Eric-dot@users.noreply.github.com> Co-authored-by: fangrongcan <f00876277@china.huawei.com> Co-authored-by: wangyao-i <iwangyao@outlook.com> Co-authored-by: linfeng-yuan <1102311262@qq.com>	2026-03-02 11:04:06 +08:00
Frank Chen	a77fe932e4	[Platform] Fix CPU binding logic (#6889 ) ### What this PR does / why we need it? - Rework CpuAlloc.handle_no_affinity() to build available NUMA nodes after allowed_cpus filtering, assign NPUs to NUMA nodes via round‑robin, and split CPUs per NPU with disjoint slices for better balance. - Improve bind_memory() robustness by deriving the target NUMA from each NPU’s CPU pool, validating NUMA existence, and skipping binding when data is missing. - bind_memory() now only bind the single NUMA node that corresponds to NPU id, instead of 2 NUMA nodes. - Fix the issue that all NPUs bind to 0th NUMA node when DP16 due to global NPU id is not visible across DP domain. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added/updated unit tests: test_cpu_binding.py 1. test_binding_mode_table covers A2 vs A3 binding mode mapping. 2. test_build_cpu_pools_fallback_to_numa_balanced covers fallback when affinity info is missing. 3. TestBindingSwitch.test_is_arm_cpu covers ARM/x86/unknown arch detection. 4. test_bind_cpus_skip_non_arm covers non‑ARM skip path in bind_cpus. test_worker_v1.py 1. Updated mocks for enable_cpu_binding default True to align with new config default. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: chenchuw886 <chenchuw@huawei.com> Co-authored-by: chenchuw886 <chenchuw@huawei.com>	2026-03-01 20:30:43 +08:00
realliujiaxu	5e24b26a54	[Bugfix] rename enable_flash_comm_v1 back to enable_sp (#6883 ) ### What this PR does / why we need it? PR #5632 introduced a bug by replacing some branches gated by enable_sp with enable_flash_comm_v1. As a result, when enable_shared_expert_dp is enabled alone (i.e., VLLM_ASCEND_ENABLE_FLASHCOMM1=0 and VLLM_ASCEND_ENABLE_FLASHCOMM=0), the behavior becomes inconsistent with the previous logic and leads to accuracy issues. This PR restores the original enable_sp-based branching to recover expected behavior and accuracy. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? #### 1. start server ``` bash vllm serve /home/weights/DeepSeek-V2-Lite-W8A8/ \ --port 8001 \ --served-model-name auto \ --max-model-len 1024 \ --enforce-eager \ --tensor-parallel-size 2 \ --data-parallel-size 2 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --additional-config '{"enable_shared_expert_dp": true}' ``` #### 2. curl ```bash curl -s http://localhost:8001/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [ {"role": "user", "content": "Hello. I have a question. Who are you?"} ], "max_tokens": 10, "temperature": 0.0, "ignore_eos_token": true }' ``` - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2026-03-01 20:22:50 +08:00
Bai Yongbin	9d09488b4a	[Feat] support basic pcp&dcp for qwen3next (#6091 ) ### What this PR does / why we need it? This PR implements Context Parallelism (CP) support for the Qwen3-Next model, including PCP (Parallel Context Parallelism) and DCP (Dynamic/Data Context Parallelism). - vLLM version: v0.15.0 - vLLM main: `f176443446` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com> Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: Bai Yongbin <845473182@qq.com> Co-authored-by: SunnyLee219 <3294305115@qq.com> Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-02-28 21:44:08 +08:00
starmountain1997	5ffae03156	[bugfix] fix capture shape in sp_eagle_fullgraph (#6846 ) ### What this PR does / why we need it? This was meant to be merged in #6536, but I accidentally restored a commit. You can find the relevant discussion [here](https://github.com/vllm-project/vllm-ascend/pull/6536#issuecomment-3882883471). Since `self.pass_config.enable_sp` is forcibly set to `False` in the [source code](`f176443446/vllm/config/compilation.py (L1066)`), this section will no longer verify whether the generated cudagraph shapes are multiples of both `uniform_decode_query_len` (`num_speculative_tokens + 1`) and `tensor_parallel_size`. This PR enables the `num_speculative_tokens + 1` and `tensor_parallel_size` check upfront. Therefore, it won't silently round up the `cudagraph_size` and throw a cryptic error for the user. A typical example of this cryptic error looks like: ``` ValueError: could not broadcast input array from shape (196,) into shape (14,) ``` ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? Have passed all test. - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com> Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: lilinsiman <lilinsiman@gmail.com> Co-authored-by: drslark <slarksblood@qq.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-02-28 17:30:02 +08:00
luomin2005	3cc8bf15da	Support platform.get_device_uuid function (#6777 ) ### What this PR does / why we need it? Support platform.get_device_uuid function. currently, the pytorch.npu.get_device_properties return uuid as full zero, vllm-ascend implement the interface at first, once the pytorch.npu.get_device_properties return the real uuid, vllm-ascend will support without modification. more details see https://github.com/vllm-project/vllm-ascend/issues/6669 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` root@localhost:/workspace/l00614971/vllm_test# python vllm_test.py INFO 02-24 09:43:48 [__init__.py:43] Available plugins for group vllm.platform_plugins: INFO 02-24 09:43:48 [__init__.py:45] - ascend -> vllm_ascend:register INFO 02-24 09:43:48 [__init__.py:48] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load. INFO 02-24 09:43:48 [__init__.py:217] Platform plugin ascend is activated device_uuid = 00000000-0000-0000-0000-000000000000 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: luomin2005 <luomin2005@huawei.com> Co-authored-by: liziyu <56102866+liziyu179@users.noreply.github.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-02-28 14:17:12 +08:00
wangxiyuan	3d563292f3	clean 0.15.0 support (#6852 ) Clean up vllm 0.15.0 related code - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-28 09:20:57 +08:00
drslark	5666ce03f5	[bugfix] Fixed an accuracy problem of gdn layer in graph (#6822 ) ### What this PR does / why we need it? There will be random ouputs if we run model with GDN attention in graph mode: ```python prompts = [ "1. Who are you?", ] sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32) sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=5) llm = LLM(model="/home/model/Qwen3-Next-80B-A3B-Instruct", tensor_parallel_size=4, distributed_executor_backend="mp", gpu_memory_utilization=0.7, speculative_config={ "method": "qwen3_next_mtp", "num_speculative_tokens": 3, }, compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [8], }, max_model_len=4096, enable_prefix_caching=False) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"{output.prompt_token_ids=}") print(f"{output.outputs[0].token_ids=}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Before appling this change, the outputs was: ```text output.prompt_token_ids=[16, 13, 10479, 525, 498, 30] output.outputs[0].token_ids=[3555, 323, 279, 1112, 279] Prompt: '1. Who are you?', Generated text: ' What and the... the' ``` After applying this change, the output is: ```text output.prompt_token_ids=[16, 13, 10479, 525, 498, 30] output.outputs[0].token_ids=[3555, 374, 697, 829, 30] Prompt: '1. Who are you?', Generated text: ' What is your name?' ``` Why does this change sovle the problem? Now, `query_start_loc` is padded because of `fia`. But, for `gdn-attention`, padded version of `query_start_loc` will cause accuracy problem. So, we need an unpadded version of `query_start_loc` named `gdn_query_start_loc` and use it in `gdn-attention`, it works fine. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? As described aboved. - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` Signed-off-by: drslark <slarksblood@qq.com>	2026-02-28 08:57:53 +08:00
lilinsiman	c13d90b766	[Refactor][EAGLE] 7/N Merged PCP and disable_padded interface (#6811 ) ### What this PR does / why we need it? [Refactor][EAGLE] 7/N Merged PCP and disable_padded interface into eagle_proposer.py This pull request significantly refactors the speculative decoding mechanism by merging Parallel Context Processing (PCP) and Multi-Token Prediction (MTP) functionalities directly into the eagle_proposer.py. The changes aim to enhance the efficiency and correctness of distributed speculative decoding, particularly by enabling the Eagle feature to work seamlessly with the disable_padded interface. This involves detailed adjustments to attention metadata, input/output processing, and state management to ensure proper operation in parallel environments. 1. The PCP and MTP features are migrated to the eagle_proposer.py 2. The Eagle and PCP features are integrated 3. Enable the eagle feature to use the disable_padded interface ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tests and UT - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-02-27 16:06:56 +08:00
Canlin Guo	e4458b2d2b	[Main2Main] Upgrade vLLM to 0226 (#6813 ) ### What this PR does / why we need it? Breaking: 1. https://github.com/vllm-project/vllm/pull/33452 2. https://github.com/vllm-project/vllm/pull/33451 3. https://github.com/vllm-project/vllm/pull/32567 4. https://github.com/vllm-project/vllm/pull/32344 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: MrZ20 <2609716663@qq.com>	2026-02-27 16:05:21 +08:00
realliujiaxu	5def28dcd3	[Feat]support sequence parallelism by pass for VL models (#5632 )	2026-02-27 08:27:41 +08:00
MengLong Chen	2d49f9079a	[BugFix] Support ALL D-Nodes in fullgraph when running MTP in PD (#5472 ) ### What this PR does / why we need it? BUG When using prefill-decode disaggregation + MTP + full graph +asynchronous scheduling, the KV cache pulled by decode nodes from prefill decodes does not include spec tokens. As a result, the total_num_scheduled_tokens obtained by decode nodes from the scheduler lacks spec tokens. When determining whether to enqueue the full graph on decode nodes, the condition for uniform_decode ` scheduler_output.total_num_scheduled_tokens == self.input_batch.num_reqs * max_query_len` is not met, leading to the current instance not being enqueued into the full graph. The above situation leads to both full graph and eagle mode instances coexisting in the decode instances. Due to the synchronization wait of MoeDispatch, the decode instances in full graph are significantly slowed down by the instance in eagle mode. Solution The scenario is PD separation + MTP + Full Graph + asynchronous scheduling. On the decode nodes, the spec tokens of the request with KV cache from P need be padded. Then, the padded spec tokens will be rejected by sampling. This operation ensures that the uniform_decode condition is satisfied when determining whether decode nodes are included in the full graph, thereby guaranteeing that all decode instances are present in the full graph and avoiding synchronous waiting for MoeDispatch. - vLLM version: v0.15.0 - vLLM main: `5326c89803` Signed-off-by: chenmenglong <chenmenglong1@huawei.com>	2026-02-26 19:09:05 +08:00
wangxiyuan	532f7a82f2	[Patch][Misc] Cleanup and update patches (#6802 ) ### What this PR does / why we need it? This PR performs a cleanup and update of the patch mechanism in `vllm-ascend`. - Removes several obsolete patches: `patch_deepseek.py`. - Updates the central patch documentation in `vllm_ascend/patch/__init__.py` to reflect these removals and additions, re-numbering and re-organizing the patch list for better clarity. ### Does this PR introduce _any_ user-facing change? No. These are internal changes to the patching mechanism and should not affect users. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-26 14:45:33 +08:00
pu-zhe	e76b69b9ef	[BugFix] [310p] Fix attention accuracy issue (#6803 ) ### What this PR does / why we need it? This pull request resolves an attention accuracy issue by enhancing the AttentionMaskBuilder310 to correctly handle the maximum model length. The change ensures that the attention mask generation process is properly parameterized by the model's configuration, rather than relying on a fixed internal value. This leads to more accurate attention mask creation, which is crucial for the correct functioning of the attention mechanism. Update fused_moe to main branch. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Qwen3 dense mode & moe model e2e test - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-26 14:30:39 +08:00
Canlin Guo	9f8b84e5fc	[Misc] Drop patch_rope.py (#6291 ) ### What this PR does / why we need it? Part of #5304. We have align with vLLM's latest change for `RotaryEmbeddingBase`. Don't need this patch anymore. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2026-02-26 14:04:53 +08:00
Cao Yi	3953dcf784	[Feature][Quant] Auto-detect quantization format from model files (#6645 ) ## Summary - Add automatic quantization format detection, eliminating the need to manually specify `--quantization` when serving quantized models. - The detection inspects only lightweight JSON files (`quant_model_description.json` and `config.json`) at engine initialization time, with no `.safetensors` reads. - User-explicit `--quantization` flags are always respected; auto-detection only applies when the flag is omitted. ## Details Detection priority: 1. `quant_model_description.json` exists → `quantization="ascend"` (ModelSlim) 2. `config.json` contains `"quant_method": "compressed-tensors"` → `quantization="compressed-tensors"` (LLM-Compressor) 3. Neither → default float behavior Technical approach: Hooked into `NPUPlatform.check_and_update_config()` to run detection after `VllmConfig.__post_init__`. Since `quant_config` is already `None` at that point, we explicitly recreate it via `VllmConfig._get_quantization_config()` to trigger the full quantization initialization pipeline. ## Files Changed \| File \| Description \| \|------\|-------------\| \| `vllm_ascend/quantization/utils.py` \| Added `detect_quantization_method()` and `maybe_auto_detect_quantization()` \| \| `vllm_ascend/platform.py` \| Integrated auto-detection in `check_and_update_config()` \| \| `vllm_ascend/quantization/modelslim_config.py` \| Improved error handling for weight loading \| - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2026-02-26 10:59:25 +08:00
Dijurido	169e434f78	[CI] Fix EAGLE CI problems (#6702 ) ### What this PR does / why we need it? New FIA operator requires queryT equal to the last element of actualSequenceLengthQ. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passed existing test (test_mtp_eagle_correctness.py). - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: Wangbingjie <wangbj1207@126.com> Signed-off-by: Wangbingjie <w30061490@china.huawei.com> Co-authored-by: Wangbingjie <w30061490@china.huawei.com>	2026-02-26 10:26:01 +08:00
Li-Yongwen	2870f7c8ad	[Feat] Support routing replay (#6696 ) ### What this PR does / why we need it? [Feat] Support routing replay same as https://github.com/vllm-project/vllm-ascend/pull/6666 resubmit because of DOC failure ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: liyongwen <1310439159@qq.com> Signed-off-by: Li-Yongwen <63399187+Li-Yongwen@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-26 10:22:47 +08:00

... 3 4 5 6 7 ...

1686 Commits