xc-llm-ascend

Author	SHA1	Message	Date
chenxi-hh	42bcad7e9b	GMM custom operator optimization in small batch scenarios (#7100 ) ### What this PR does / why we need it? GMM custom operator optimization in small batch scenarios ### How was this patch tested? Qwen3-30B input: 4k, output: 1k batch 1： TPOT 7.9 ms -> 7.0 ms Output Token Throughput 125.4651 token/s -> 140.6278 token/s batch 2： TPOT 9.4 ms -> 8.8 ms Output Token Throughput 211.8187 token/s -> 225.2254 token/s batch 16： TPOT 13.6 ms -> 13.5 ms Output Token Throughput 1159.8213 token/s -> 1165.0982 token/s - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: chenxi-hh <chen464822955@163.com>	2026-03-19 16:10:30 +08:00
wangxiyuan	8e0ebb470a	[Misc] Drop Prefetch MLP Env (#7357 ) ### What this PR does / why we need it? remove deprecated environment variables related to MLP prefetching ### Does this PR introduce _any_ user-facing change? yes, the deprecated env vars can not be used then. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-19 14:27:27 +08:00
asunxiao	a370dfa962	[bugfix]Enable dispatch_ffn_combine feature for qwen3.5 (#7066 ) ### What this PR does / why we need it? Qwen3.5 Moe supports enabling the dispatch_ffn_combine fusion operator. Fix problem: In the w8a8 quantization scene, Qwen3.5 model's config.json lacks the quantize field. The previous logic strictly relied on quant_type == "w8a8_dynamic" to enable VLLM_ASCEND_ENABLE_FUSED_MC2. This caused the dispatch_ffn_combine fusion operator to fail to activate even when the environment variable was set. Enable dispatch_ffn_combine fusion operator for BF16 scenarios. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: asunxiao <asunxiao@qq.com>	2026-03-17 19:53:02 +08:00
rjg-lyh	7669963c27	[Perf] Optimize bias handling in AscendRMSNorm (#7226 ) ### What this PR does / why we need it? This PR optimizes bias handling in `AscendRMSNorm` without changing the intended functional behavior. In the current implementation, bias may be initialized for `AscendRMSNorm` based on configuration-level detection, even though some norm layers never actually load a bias weight. This can cause the inference path to enter the bias branch and execute an unnecessary `add_` operator. To improve this, this PR introduces a loader-based flag to record whether the bias has actually been loaded. The bias addition is then executed only when the bias is truly present. This optimization reduces redundant computation in inference and makes the bias application logic better aligned with the actual model weights. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-03-17 16:53:28 +08:00
pichangping	3f39ac9c8d	[Feature]Supports DSv3.1 PD separation and C8 quantization (#7222 ) Co-authored-by: kunpengW-code <1289706727@qq.com> Co-authored-by: linsheng1 <1950916997@qq.com> ### What this PR does / why we need it? Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8 supports only the PD separation scenario. C8 refers to quantizing the KV cache to int8, which aims to reduce the GPU memory usage of the KV cache and improve the inference throughput. Constraints: 1. Only the PD separation mode can be used and MooncakeLayerwiseConnector can be used to run the model. 2. Currently, only the activation value supports dynamic quantization, and the KV cache supports static quantization. C8 quantization with MTP is not supported. You can use ModelSlim for quantization. The quantization procedure is as follows: pip install transformers==4.48.2 git clone https://gitcode.com/Ascend/msmodelslim.git cd msmodelslim bash install.sh cd example/DeepSeek/ python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path <path/quant_weight> --anti_dataset../common/deepseek_anti_prompt_50_v3_1.json --calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot --trust_remote_code True --fa_quant --dynamic --anti_method m6 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: pichangping <1337510399@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>	2026-03-16 22:49:05 +08:00
yupeng	29f195a91c	[Bugfix][LoRA] Fix the bug when runs Qwen3-Reranker-0.6B with LoRA. (#7156 ) ### What this PR does / why we need it? Fix the error that reports while initializing qwen3-reranker-0.6b model with `--enable-lora`. And add a testcase to verify the fix. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: paulyu12 <507435917@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-03-15 17:55:42 +08:00
Angazenn	ce5544bfc1	[Hybrid] support prefix cache for Qwen3.5/Next with `--mamba-cache-mode align` (#7103 ) ### What this PR does / why we need it? To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly follows the design in [#30877](https://github.com/vllm-project/vllm/pull/30877) and inherits changes to functions which are overridden in vLLM-Ascend. Note: 1. `--mamba-cache-mode align` && PD disaggregation is still not supported yet in vLLM v0.17.0(see https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295). 2. The current implementation of hybrid kv cache might result in a very large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B with `-tp 2`, the block_size is adjusted to 2048, which means that any prefix shorter than 2048 will never be cached. Although this behavior is consistent with vLLM, it still needs improvements in the future. 3. `--mamba-cache-mode align` requires to copy mamba states during forward steps. vLLM uses a triton kernel to implement it. However, the original version run into some bugs on Ascend hardwares. Thus we patch a new triton kernel to avoid this bug. ### Does this PR introduce _any_ user-facing change? To use mamba prefix cache, set `--enable-prefix-caching` and `--mamba-cache-mode align`. Note that the mamba state copy function(see [do_mamba_copy_block](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/mamba_utils.py#L132)) does not provide a torch native version, thus it might have trouble if users can't use triton. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-03-15 09:44:09 +08:00
Mengqing Cao	986cd45397	[Version] Drop 0.16.0 support (#7153 ) ### What this PR does / why we need it? Drop 0.16.0 support in main - Fix eagle proposer break introduced by https://github.com/vllm-project/vllm/pull/34552. Mainly change to use the draft attention group to initialize the attention metadata builder. - Fix the `ModelRunner` has no attribute `cudagraph_capture_sizes` error, which is a bug in vLLM v0.17.0, and fixed by a later pr https://github.com/vllm-project/vllm/pull/30515 - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2026-03-13 16:14:15 +08:00
kx	df1ee8070d	[feat][spec decode]Unified draft parallel (#6766 ) ### What this PR does / why we need it? Implement a unified parallelized speculative decoding in VLLM Ascend，which can simultaneously support parallel speculative inference schemes such as Pard, P-Eagle, etc. refer to https://github.com/vllm-project/vllm-ascend/pull/6565 and https://github.com/vllm-project/vllm-ascend/pull/4078 ### How was this patch tested? run with parallel drafting script: export target=/model/Llama-3.1-8B-Instruct export draft=/model/PARD-Llama-3.2-1B export CUDA_VISIBLE_DEVICES=6 export ASCEND_RT_VISIBLE_DEVICES=6 vllm serve $target \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --no-enable-prefix-caching \ --port 8811 \ --speculative-config '{"model": "/model/PARD-Llama-3.2-1B", "method": "draft_model", "num_speculative_tokens": 8, "parallel_drafting": true}' base script: export target=/model/Llama-3.1-8B-Instruct export draft=/model/PARD-Llama-3.2-1B export CUDA_VISIBLE_DEVICES=6 export ASCEND_RT_VISIBLE_DEVICES=6 vllm serve $target \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --no-enable-prefix-caching \ --port 8811 benchmark script: MAX_CONCURRENCY=1 NUM_PROMPTS=80 vllm bench serve --port 8811 \ --temperature 0 \ --model /model/Llama-3.1-8B-Instruct \ --backend openai-chat \ --endpoint /v1/chat/completions \ --dataset-name hf \ --dataset-path philschmid/mt-bench \ --num-prompts ${NUM_PROMPTS} \ --max-concurrency ${MAX_CONCURRENCY} \ --seed 1234 test results : base(without spec decode): TTFT 79.46ms TPOT 26.99ms output_tokens_throughput 36.75 tok/s this pr(with parallel drafting): TTFT 72.24ms TPOT 13.45ms output_tokens_throughput 72.98 tok/s per-position acceptance(from position 0 to 7): 79.48%、56.93%、40%、27.90%、19.79%、14.25%、10.57%、7.61%. ---------------------------------------------------------------------- run on qwen3 model script ： export target=/model/Qwen3-1.7B export draft=/model/PARD-Qwen3-0.6B export CUDA_VISIBLE_DEVICES=1 export ASCEND_RT_VISIBLE_DEVICES=1 vllm serve $target \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --no-enable-prefix-caching \ --port 8811 \ --speculative-config '{"model": "/model/PARD-Qwen3-0.6B", "method": "draft_model", "num_speculative_tokens": 8, "parallel_drafting": true}' cc @NickJudyHvv - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: kx <1670186653@qq.com> Signed-off-by: HF-001 <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com>	2026-03-13 14:07:35 +08:00
Ronald	c980e68d40	[Feature] support aclgraph for model runner v2 (#7110 ) ### What this PR does / why we need it? This PR aims to support aclgraph for model runner v2, please see RFC #5208. The PR contains these modifications: - adapt to newest commit of vllm main branch. - supply a unified interface of extra forward context for both model runner v1 and model runner v2. - implement graph mode for main model. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-03-13 09:11:46 +08:00
Mercykid-bash	132f3c5d0a	Support per-step heat collection and enhance FlashLB for multi-stage load balancing (#6477 ) # Feature: FlashLB algorithm ## Purpose This Pull Request enhances the EPLB (Expert Parallelism Load Balancing) system by introducing a novel load balancing algorithm: FlashLB. 1. The default algorithm adopts two separate sub-procedures to optimize expert replication and placement independently: a. Expert Replica Allotment Sub-procedure : Determines the number of replicas for all experts. At each step, it greedily adds one more replica to the expert with the highest per-replica load, aiming to minimize load skew at the expert replica granularity (Min Max Replica, MMR). b. Expert Replica Placement Sub-procedure : Distributes all replicas across devices. First, it sorts the generated replicas in descending order of hotness, then iteratively places the currently hottest replica onto the device with the lowest cumulative load and available slots. However, this simplistic combination of two separate procedures lacks synergy and often leads to sub-optimal load balancing. For example, in the simple scenario illustrated below: Given 8 logical experts with hotness values [600, 560, 120, 120, 20, 10, 10, 10], and 2 replicas allocated per device across 8 devices, the default EPLB algorithm results in a maximum per-device hotness of 232 (peak-average load ratio 1.28), while our proposed FlashLB algorithm reduces this value to 205 (peak-average load ratio 1.13). <figure><img src="https://github.com/user-attachments/assets/b9b10fab-651e-4524-9942-adbca8d044a4" width="90%"</figure> 2. The default algorithm simply aggregates hotness measurements across the entire profiling window. While this provides a coarse approximation of the hotness distribution, it fails to capture the time-phased variations and temporal correlations in expert hotness (both within and between experts) across iterations—phenomena that have been observed in real-world scenarios. Such single-point hotness estimation degrades the solution quality of the load balancing algorithm. 3. The default algorithm regularly recalculates updated expert placement results for all layers without discrimination. Considering that excessive expert updates can impact Service Level Objectives (SLOs), such full-scale redeployment leads to excessively high adjustment overhead, which negatively affects end-to-end performance. ## FlashLB Algorithm Principle ### 1. Joint Optimization of Replica Allotment and Placement FlashLB achieves joint optimization of replica allotment and placement through a novel tree search approach, combined with carefully designed e Fl fficient pruning and lightweight look-ahead estimation. We partition all experts into several subsets, and for each subset, hierarchically determine the optimal replica count and placement. Leveraging efficient pruning and lightweight look-ahead estimation, the process consistently aims to optimize the globally expected inter-device load balance degree (considering both deployed and unexplored experts) while ensuring sufficient computational efficiency. Additionally, precompilation techniques are employed for acceleration, delivering load balancing that is both high-quality and practically efficient. ### 2. Multi-Episode Enhancement Instead of performing full-duration averaging like the default algorithm, FlashLB partitions each profiling interval (e.g., 1024 iterations) into multiple consecutive smaller episodes (e.g., 16 iterations). This preserves hotness fluctuation and correlation information. It then constructs a multi-objective optimization problem to co-optimize these episodes simultaneously, enabling adaptability to interleaved hotness patterns and improving statistical robustness. ### 3. Layer-wise Cherry-Picking Redeployment To reduce the overhead of frequent expert redeployment, FlashLB introduces a cherry-picking redeployment scheme. During each algorithmic decision cycle, it real-time tracks load balance degree of all layers and triggers expert placement updates only for those layers whose peak-average ratio exceeds a predefined threshold. This avoids unnecessary redeployment for stable layers, significantly reducing adjustment overhead and thereby improving end-to-end performance gains. ## Co-author: Co-authored-by: Skywalker-EP 173723846@qq.com This PR mainly introduces two key optimizations for load balancing scheduling: 1. Add per-step heat collection function: Support real-time collection of per-step heat information during model inference. This enables more fine-grained load balancing decisions by taking per-step heat as the optimization target, improving scheduling accuracy for dynamic and fluctuating workloads. 2. Update FlashLB algorithm: Upgrade the FlashLB scheduling logic to better adapt to multi-stage heat distribution scenarios. The improved algorithm can comprehensively perceive and utilize multi-stage heat characteristics, achieving more stable and efficient load balancing under complex expert deployment and dynamic traffic patterns. --------- Signed-off-by: Mercykid-bash <ruanche0218@gmail.com> Signed-off-by: xuzewei28 <xuzewei2@h-partners.com> Co-authored-by: xuzewei28 <xuzewei2@h-partners.com>	2026-03-12 15:49:09 +08:00
xmpp777	9216e1b050	[fix] Add support for Qwen3.5 Dense and MoE on Ascend (#6933 ) ### What this PR does / why we need it? This pull request introduces support for the Qwen3.5 MoE model on Ascend devices. The key changes are: * Quantization Configuration for Qwen3.5 MoE: Adds necessary prefix mappings and packed module definitions for `qwen3_5_moe` in `vllm_ascend/quantization/modelslim_config.py` to enable ModelSlim quantization. * Triton Kernel Fix: Corrects a bug in the `fused_gdn_gating` Triton kernel. The calculation for `BLK_BATCHES` had an operator precedence issue which is now resolved. The calculation has also been made more robust with added clamping to prevent potential out-of-bounds memory access in the unified buffer. These changes enable the correct and efficient execution of Qwen3.5 MoE models on Ascend hardware. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI should be used to verify the correctness of these changes. It is recommended to run tests with the Qwen3.5 MoE model to ensure the new configurations and the kernel fix work as expected. Signed-off-by: xmpp777 <yangming2@huawei.com>	2026-03-10 09:09:31 +08:00
Hexiang Wang	48b624e4cc	[BugFix] Fix implementation bug of triton rope_siso (#7082 ) ### What this PR does / why we need it? Previously implemention of triton rope_siso missing the storage of second half of rope results, which will result in: 1. accuracy problem in neox-style scenario 2. ub overflow in non neox-style scenario This PR fixes it and supplement nightly test case for it. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-03-09 23:08:43 +08:00
LI SHENGYONG	a76a509fae	[MOE][Bugfix] Cancel H2D for expert_map (#7000 ) ### What this PR does / why we need it? If expert_map is on the device, there may be occasional repeated answers in long output scenarios. dsv3.2-exp-w8a8 No garbled characters are displayed in the output. \| dataset \| version \| metric \| mode \| vllm-api-stream-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2025 \| ef2f4f \| accuracy \| gen \| 60.00 \| - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-03-09 17:53:54 +08:00
JIACHENG XU	23bf5d4d48	[EPLB][bugfix] Bugfix for fused mc2 (#6794 ) ### What this PR does / why we need it? This pull request addresses a bug related to the fused mc2 functionality within the EPLB (Expert Parallelism Load Balancing) system, specifically impacting quantization and MoE communication. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` Signed-off-by: Spicy-Stick <873805887@qq.com> Signed-off-by: root <root@localhost.localdomain>	2026-03-09 11:26:57 +08:00
ZhaoJiangJiang	a51d6366b9	[Bugfix] Qwen3Next support FlashComm1 (#6830 ) ### What this PR does / why we need it? Support FlashComm1 for Qwen3-Next. Fix some padding problems in Sequence Parallel (SP) and resolve precision problems in shared_out when both FlashComm1 is enabled. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com> Co-authored-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com>	2026-03-06 17:14:08 +08:00
Fager10086	c5dfa8d645	[OPS]add split_qkv_rmsnorm_mrope ops (#6730 ) ### What this PR does / why we need it? This PR adds split_qkv_rmsnorm_mrope kernel with interleaved for qwen3.5 and qwen3-vl to improve performance. ### Does this PR introduce _any_ user-facing change? Does not. ### How to use? ```python real_q, real_k, real_v, real_gate = torch.ops.vllm.triton_split_qkv_rmsnorm_mrope( qkv=qkv, q_weight=q_weight, k_weight=k_weight, cos_sin=cos_sin, num_q_heads=num_q_heads, num_kv_heads=num_kv_heads, head_size=head_size, eps=eps, mrope_section=mrope_section, is_interleaved=is_interleaved, rope_dim=rope_dim, has_gate=has_gate, ) ``` ### How was this patch tested? - vLLM version: v0.16.0 - Accuracy test script： ```shell pytest tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_split_qkv_rmsnorm_mrope.py ``` --------- Signed-off-by: Fager <865071616@qq.com> Signed-off-by: Fager10086 <77871921+Fager10086@users.noreply.github.com> Signed-off-by: fager <865071616@qq.com>	2026-03-06 16:18:37 +08:00
Shanshan Shen	a813eadd2d	[MM][Perf] Enable 2.7x faster for convolution computation with aclnn BatchMatMulV2 (#7017 ) ### What this PR does / why we need it? Currently, we are using `e2b31243c0/vllm/model_executor/layers/conv.py (L219-L232)` for convolution computation, which is used in patch embedding for VL models. After profiling, we find that this linear method will take about 6.87 ms, which is much slower than just using `F.conv3d()`. In `F.conv3d()`, it will call aclnn `BatchMatMulV2` with optimization on Ascend NPU, which only take about 2.50 ms and is 2.7x faster than linear method. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2026-03-06 14:26:37 +08:00
panchao-hub	8c2c82f3e1	[Bugfix] Fix the moe_forward error when setting enable_static_kernel … (#6964 ) ### What this PR does / why we need it? Fix the moe_forward error when setting enable_static_kernel to true. When static kernels are enabled, the forward pass runs twice (compilation + capture), causing moe_layer_index to overflow. Wrap the index to prevent out-of-bounds errors. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI passed with new added test - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: p00465316 <panchao13@huawei.com>	2026-03-06 10:36:10 +08:00
frank	18b52afe2b	[Ops][Misc] Optimize split_qkv_rmsnorm_rope op (#6827 ) ### What this PR does / why we need it? This PR optimizes the `split_qkv_rmsnorm_rope` operator by introducing a new Triton kernel, `split_qkv_rmsnorm_rope_prefill_kernel`, for the prefill stage (i.e., large batch sizes). The implementation now dynamically selects between the existing decode kernel and the new prefill kernel based on the batch size, which improves performance for large batch scenarios. Additionally, the RoPE implementation is updated to support partial rotation dimensions (`rope_dim`), making the operator more flexible. ### Does this PR introduce _any_ user-facing change? No. This is a performance optimization and is not expected to introduce any user-facing changes. ### How was this patch tested? CI should pass with existing tests. The new prefill path is triggered when the batch size is larger than the number of available vector cores. The partial RoPE feature can be tested by passing the `rope_dim` argument. - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: guzhiyong <guzhiyong5@h-partners.com> Signed-off-by: frank <2547457096@qq.com> Co-authored-by: guzhiyong <guzhiyong5@h-partners.com>	2026-03-06 09:30:31 +08:00
SILONG ZENG	bd571cf6d6	[Main2Main] Upgrade vLLM to 0303 (#6944 ) ### What this PR does / why we need it? break: - https://github.com/vllm-project/vllm/pull/34102 Disable_full param replaced with valid_modes/invalid_modes API - https://github.com/vllm-project/vllm/pull/35503 Now must return float compilation_time - https://github.com/vllm-project/vllm/pull/35564 New sequence_lengths param added - https://github.com/vllm-project/vllm/pull/33807 A check was performed (if runner_backend != "auto") - https://github.com/vllm-project/vllm/pull/34861 `BaseDeviceCommunicator` now accesses PyTorch's internal `pg_map` to check process group state - https://github.com/vllm-project/vllm/pull/35274 Important change: - https://github.com/vllm-project/vllm/pull/28672 `matcher_utils` directly accesses `torch.ops._C.*` during the import phase. In the Ascend environment, some unregistered ops trigger `AttributeError`, causing e2e initialization failure. https://github.com/vllm-project/vllm-ascend/actions/runs/22607260487/job/65502047131#step:10:2323 https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/matcher_utils.py#L29 This PR adds temporary compatibility placeholders (rms_norm, fused_add_rms_norm, rotate_embedding, static/dynamic fp8 quant, silu_and_mul) to `vllm_ascend/patch/platform/patch_fusion_matcher_compat_ops.py` to ensure no crashes during the import phase. Upstream repairs will be considered later. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: Claude Code <noreply@anthropic.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com>	2026-03-06 09:08:52 +08:00
LeeWenquan	3047b724b3	Add GemmaRmsNorm ACLGraph Support (#6473 ) ### What this PR does / why we need it? 1. New Custom NPU Operation: Introduced npu_gemma_rms_norm in csrc/torch_binding.cpp to provide optimized Gemma RMS Normalization support for Ascend NPUs. This function includes logic to handle dynamic shapes for the gamma tensor. 2. PyTorch Operator Registration: The new npu_gemma_rms_norm operation has been registered with the PyTorch custom operator library, making it accessible from Python. Meta-Implementation for ACLGraph: A corresponding meta-implementation, npu_gemma_rms_norm_meta, was added in csrc/torch_binding_meta.cpp. This is crucial for symbolic tracing and allowing the custom kernel to be captured and optimized by ACLGraph. 3. Python Frontend Integration: The vllm_ascend/ops/layernorm.py file was updated to utilize the newly added torch.ops._C_ascend.npu_gemma_rms_norm for Gemma RMS Normalization, replacing the generic torch_npu.npu_rms_norm ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com> Signed-off-by: LeeWenquan <83354342+SunnyLee151064@users.noreply.github.com>	2026-03-05 16:15:07 +08:00
songjianquan	43c8da3574	[Feat]fused_qkvzba_split_reshape supports token number greater than 65536 (#6740 ) ### What this PR does / why we need it? This pull request optimizes the fused_qkvzba_split_reshape_cat Triton kernel for Qwen3-Next GatedDeltaNet model and removes the previous conditional restrictions in the forward pass. Key changes: 1. Refactored Triton kernel implementation: The fused_qkvzba_split_reshape_cat_kernel has been optimized with a new loop-based approach that supports arbitrary num_v_heads / num_k_heads ratios and batch sizes. The kernel now uses configurable ROWS_PER_ITER for better memory utilization . 2. The optimized kernel now handles all scenarios directly without requiring a fallback path using fix_query_key_value_ordering and torch.cat. ### Does this PR introduce _any_ user-facing change? No. This is an internal optimization of the Triton kernel implementation and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with existing tests. - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: songjianquan <songjianquan1@huawei.com> Co-authored-by: songjianquan <songjianquan1@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-05 14:41:38 +08:00
rjg-lyh	2bd9c35788	[perf][refactor] Refactor and optimize sfa_v1.py for dsv3.2/glm5 (#6874 ) ### What this PR does / why we need it? This PR refactors sfa_v1.py to improve code readability and usability, fixes a code bug, and enhances performance through the replacement of certain operators. ### changes - improve code readability: Optimizes parts of the code structure in sfa_v1.py, supplementary comments for key code blocks, removes some unused variables, and improves the naming of certain functions and variables. - resolved a duplicated double write to k_cache: Fixed redundant double writes of k_cache in the indexer_select module (in both the `forward` function and `indexer_select_post_process`), improving performance to some extent. - replace `scatter` ops with `reshape_and_cache`: This optimization replaces two separate cache storage operations on `k_nope` and `k_pe` with a single call to the `reshape_and_cache` operator, improving performance. The original `scatter` operator involves reordering slot_mapping for generality, introducing significant scalar computations. In contrast, the `reshape_and_cache` operator eliminates this redundant reordering step, thus reducing unnecessary computation time and enhancing the operator's performance. ### performance comparison 4A3, 1P1D, P dp2tp16, D dp8tp4, input/output: 64K/3K origin: TTFT: 28s, TPOT: 26ms, TPS: 820 token/s* fixed redundant double writes of k_cache: TTFT: 24s, TPOT: 26ms, TPS: 840 token/s replace scatter ops with reshape_and_cache: TTFT: 24s, TPOT: 26ms, TPS: 850 token/s ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-03-05 14:27:11 +08:00
tanhaoan333	f8315f5717	[bugfix]Qwen2.5VL accurate question (#6975 ) ### What this PR does / why we need it? The attention mechanism in the ViT model architecture of Qwen2.5VL consists of two parts and does not support using cache to pass sequence lengths. ### Does this PR introduce _any_ user-facing change? remove seq_lens_cache ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: tanhaoan333 <tanhaoan@huawei.com>	2026-03-04 22:02:29 +08:00
Zhujiyang2	c3c265648f	[Ops][BugFix] Fix RoPE shape mismatch for mtp models with flashcomm v1 enabled (#6939 ) What this PR does / why we need it? When using a draft model (e.g., in MTP speculative decoding) with shared expert data parallelism (enabled via flashcomm), a shape mismatch error occurs in the rotary embedding calculation for models like GLM-4.7. This is because the positions tensor has an incorrect shape for this specific configuration. This PR fixes the issue by adding a check in AscendRotaryEmbedding.forward_oot. If the model is a draft model and shared expert DP is enabled, it processes the positions tensor using torch.ops.vllm.maybe_all_gather_and_maybe_unpad to ensure its shape is correct before applying the rotary embedding. This resolves the shape mismatch error. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com>	2026-03-04 16:02:08 +08:00
linfeng-yuan	700423156f	[Triton] Centralize Ascend extension op dispatch in triton_utils (#6937 ) ### What this PR does / why we need it? This pull request refactors the dispatch mechanism for the triton-ascend-specific operators `insert_slice`, `extract_slice`, and `get_element` to ensure compatibility with both CANN 8.5 and 9.0. A unified helper function, `_resolve_triton_ascend_op`, has been introduced in `vllm_ascend/ops/triton/triton_utils.py`. This function dynamically resolves these operators by first attempting to import them from the `triton.language.extra.cann.extension` module, which is present in newer CANN versions. If that fails, it falls back to the standard `triton.language` module. This approach centralizes operator dispatch logic, allowing individual Triton kernels to use these functions without being aware of the underlying Triton/CANN version. All call sites have been updated to use these new unified functions. ### Does this PR introduce _any_ user-facing change? No. This is an internal refactoring of operator implementations and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with existing tests. Testing Context: - vLLM version: v0.16.0 - vLLM main: `15d76f74e2fdb12a95ea00f0ca283acf6219a2b7` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-03 17:10:30 +08:00
linfeng-yuan	68d8d20ca2	[misc] move mxfp_compat into device to decouple from quantization init chain (#6918 ) ### What this PR does / why we need it? `mxfp_compat` only provides dtype/symbol compatibility helpers for different `torch_npu` versions, but it was placed under `vllm_ascend.quantization`. Importing it from device/ops paths could trigger `quantization/__init__.py` and pull in heavy quantization method dependencies, increasing startup coupling and causing import-cycle risk (especially on 310P paths). ### Does this PR introduce _any_ user-facing change? No functional behavior change intended. ### How was this patch tested? CI passed. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-02 18:17:01 +08:00
whx	16c879cdf7	[Triton][Config] Add muls_add triton kernel and refactor AscendCompilationConfig (#5518 ) ### What this PR does / why we need it? Add muls_add triton kernel with related fusion pass. What's more, this PR refactors `AscendCompilationConfig` and delete `NpugraphExConfig`. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI passed with new added test. - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-03-02 17:54:25 +08:00
Eric-dot	3c66a970f2	add mxfp8 moe quantization (#6670 ) ### What this PR does / why we need it? support mxfp8 quantization (Qwen MOE ) Using adaptor to make the hardware-specific behavior clearer and more maintainable ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `13397841ab` --------- Signed-off-by: fangrongcan <17343701736@163.com> Signed-off-by: wangyao-i <iwangyao@outlook.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: Eric-dot <60131170+Eric-dot@users.noreply.github.com> Co-authored-by: fangrongcan <f00876277@china.huawei.com> Co-authored-by: wangyao-i <iwangyao@outlook.com> Co-authored-by: linfeng-yuan <1102311262@qq.com>	2026-03-02 11:04:06 +08:00
realliujiaxu	5e24b26a54	[Bugfix] rename enable_flash_comm_v1 back to enable_sp (#6883 ) ### What this PR does / why we need it? PR #5632 introduced a bug by replacing some branches gated by enable_sp with enable_flash_comm_v1. As a result, when enable_shared_expert_dp is enabled alone (i.e., VLLM_ASCEND_ENABLE_FLASHCOMM1=0 and VLLM_ASCEND_ENABLE_FLASHCOMM=0), the behavior becomes inconsistent with the previous logic and leads to accuracy issues. This PR restores the original enable_sp-based branching to recover expected behavior and accuracy. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? #### 1. start server ``` bash vllm serve /home/weights/DeepSeek-V2-Lite-W8A8/ \ --port 8001 \ --served-model-name auto \ --max-model-len 1024 \ --enforce-eager \ --tensor-parallel-size 2 \ --data-parallel-size 2 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --additional-config '{"enable_shared_expert_dp": true}' ``` #### 2. curl ```bash curl -s http://localhost:8001/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [ {"role": "user", "content": "Hello. I have a question. Who are you?"} ], "max_tokens": 10, "temperature": 0.0, "ignore_eos_token": true }' ``` - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2026-03-01 20:22:50 +08:00
Bai Yongbin	9d09488b4a	[Feat] support basic pcp&dcp for qwen3next (#6091 ) ### What this PR does / why we need it? This PR implements Context Parallelism (CP) support for the Qwen3-Next model, including PCP (Parallel Context Parallelism) and DCP (Dynamic/Data Context Parallelism). - vLLM version: v0.15.0 - vLLM main: `f176443446` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com> Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: Bai Yongbin <845473182@qq.com> Co-authored-by: SunnyLee219 <3294305115@qq.com> Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-02-28 21:44:08 +08:00
wangxiyuan	3d563292f3	clean 0.15.0 support (#6852 ) Clean up vllm 0.15.0 related code - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-28 09:20:57 +08:00
Canlin Guo	e4458b2d2b	[Main2Main] Upgrade vLLM to 0226 (#6813 ) ### What this PR does / why we need it? Breaking: 1. https://github.com/vllm-project/vllm/pull/33452 2. https://github.com/vllm-project/vllm/pull/33451 3. https://github.com/vllm-project/vllm/pull/32567 4. https://github.com/vllm-project/vllm/pull/32344 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: MrZ20 <2609716663@qq.com>	2026-02-27 16:05:21 +08:00
realliujiaxu	5def28dcd3	[Feat]support sequence parallelism by pass for VL models (#5632 )	2026-02-27 08:27:41 +08:00
Li-Yongwen	2870f7c8ad	[Feat] Support routing replay (#6696 ) ### What this PR does / why we need it? [Feat] Support routing replay same as https://github.com/vllm-project/vllm-ascend/pull/6666 resubmit because of DOC failure ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: liyongwen <1310439159@qq.com> Signed-off-by: Li-Yongwen <63399187+Li-Yongwen@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-26 10:22:47 +08:00
Shanshan Shen	3a4292e5b7	[MM][Perf] Use `seq_lens` CPU cache to avoid frequent d2h copy for better performance (#6448 ) ### What this PR does / why we need it? Currently, the performance of multi-modal encoding (i.e., `AscendMMEncoderAttention` forward) is considerably bounded by the heavy host pre-process operations. We can see from the profiling results below, before the real computation of Attention, there are long free time in the device, which will lead to extremely low NPU utilization. <img width="2264" height="1398" alt="iShot_2026-01-23_16 26 39" src="https://github.com/user-attachments/assets/37f21d06-e526-4f28-82fe-005746cf13bd" /> --- To opitimize this, this PR has proposed four changes: 1. Use `seq_lens` CPU cache to avoid frequent d2h copy. Before this PR, `AscendMMEncoderAttention` will copy the `cu_seqlens` from NPU to CPU in every forward, since the op `_npu_flash_attention_unpad()` requires CPU `cu_seqlens` (otherwise it will crash). Thus, we use `seq_lens_cpu_cache` to cache this tensor, since it's shared between all layers, but may change in different forward step. When the current `layer_index` is `0`, we update the cache, otherwise we directly use the cache to avoid frequent `diff` and `copy` operations, which are costful. 2. Pre-compute the scale value to avoid calculating it in every forward. 3. Move the judgment of `enable_pad` from forward to the `__init__` method. 4. Revert https://github.com/vllm-project/vllm-ascend/pull/6204. Performance after these optimizations: - TTFT has been reduced by 7.43% ⬇️. - Throughput has been increased by 1.23% ⬆️. --- > [!NOTE] > This PR requires https://github.com/vllm-project/vllm/pull/33674 be merged. --- ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Launch the server: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --dtype bfloat16 \ --limit-mm-per-prompt '{"image": 1}' \ --max-model-len 16384 \ --max-num-batched-tokens 16384 \ --no-async-scheduling ``` Run benchmark: ```bash vllm bench serve \ --model /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --backend openai-chat \ --endpoint /v1/chat/completions \ --dataset-name hf \ --hf-split train \ --dataset-path lmarena-ai/vision-arena-bench-v0.1 \ --num-prompts 500 \ --request-rate 10 \ --burstiness 5 \ --no-stream ``` Before this PR: ``` ============ Serving Benchmark Result ============ Successful requests: 500 Failed requests: 0 Request rate configured (RPS): 10.00 Benchmark duration (s): 82.23 Total input tokens: 33418 Total generated tokens: 61543 Request throughput (req/s): 6.08 Output token throughput (tok/s): 748.45 Peak output token throughput (tok/s): 3203.00 Peak concurrent requests: 402.00 Total token throughput (tok/s): 1154.86 ---------------Time to First Token---------------- Mean TTFT (ms): 10275.37 Median TTFT (ms): 6297.88 P99 TTFT (ms): 22918.26 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 263.02 Median TPOT (ms): 277.61 P99 TPOT (ms): 483.56 ---------------Inter-token Latency---------------- Mean ITL (ms): 257.31 Median ITL (ms): 94.83 P99 ITL (ms): 1773.90 ================================================== ``` After this PR: ``` ============ Serving Benchmark Result ============ Successful requests: 500 Failed requests: 0 Request rate configured (RPS): 10.00 Benchmark duration (s): 81.20 Total input tokens: 33418 Total generated tokens: 61509 Request throughput (req/s): 6.16 Output token throughput (tok/s): 757.54 Peak output token throughput (tok/s): 2562.00 Peak concurrent requests: 395.00 Total token throughput (tok/s): 1169.11 ---------------Time to First Token---------------- Mean TTFT (ms): 9511.91 Median TTFT (ms): 5479.78 P99 TTFT (ms): 21427.21 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 261.12 Median TPOT (ms): 276.03 P99 TPOT (ms): 446.99 ---------------Inter-token Latency---------------- Mean ITL (ms): 254.04 Median ITL (ms): 97.71 P99 ITL (ms): 1516.67 ================================================== ``` - vLLM version: v0.15.0 - vLLM main: `dc917cceb8` Signed-off-by: shen-shanshan <467638484@qq.com>	2026-02-26 08:49:36 +08:00
Icey	ee59429015	upgrade main to 0212 (#6712 ) ### What this PR does / why we need it? Fixes `transformers_utils/processors/__init__` import error, due to https://github.com/vllm-project/vllm/pull/33247 Fixes Fused MoE break introduced by `MoERunner abstraction,` due to https://github.com/vllm-project/vllm/pull/32344 > delete AscendMoERunnere when https://github.com/vllm-project/vllm/pull/35178 is merged Fixes `Make Qwen3VL compatible with Transformers v5`, due to https://github.com/vllm-project/vllm/pull/34262 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2026-02-25 09:17:29 +08:00
xulei	1e77077788	[Bugfix][DispatchFFNCombine] resolve vec error caused by unaligned UB access (#6707 ) ### What this PR does / why we need it? 1. Fix a vec error caused by unaligned UB accesss in the DispatchFFNCombine; 2. Fix expert_token_nums tensor defined on host instead of NPU in moe_comm_method.py 3. Fix multi-core copy issue of expert_token_nums in dispatchffnCombine op (single aiv copy is sufficient) ### Does this PR introduce _any_ user-facing change? No, this PR does not introduce any user-facing changes. The fix only addresses internal memory access logic and does not modify any public APIs, interfaces, or user-visible behaviors. ### How was this patch tested? `export VLLM_ASCEND_ENABLE_FUSED_MC2=1` vLLM version: v0.15.0 - vLLM version: v0.15.0 - vLLM main: `9562912cea` Signed-off-by: xulei_ict <xulei292@huawei.com> Co-authored-by: xulei_ict <xulei292@huawei.com>	2026-02-14 10:32:50 +08:00
jiahao.quan	7221045777	[Attention] add gpt-oss support (#5901 ) ### What this PR does / why we need it? Please refer to the following link for the historical conversation https://github.com/vllm-project/vllm-ascend/pull/4467. We have made updates in light of the comments from the prior PR review. Given the refactoring of the attention_v1 component, we have carried out necessary adjustments to fit the newly revised code. ### Does this PR introduce _any_ user-facing change? 1. Modified the code in the Attention section to adapt to the SWA and Sink features required by gpt-oss. 2. Modified the code in the MoE section to add support for bias and swigluoai. ### How was this patch tested? Please refer to the https://github.com/vllm-project/vllm-ascend/pull/4467 for performance tests, on the basis of which the accuracy tests from AIME2024 have been newly added. ![img_v3_02tu_501e88e3-2217-4565-8edf-b9acf4f43f2g](https://github.com/user-attachments/assets/024f8283-18ab-4d4d-ab12-27917b5d7d06) - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: mikequan0425 <mikequan0425@foxmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com> Signed-off-by: pu-zhe <zpuaa@outlook.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: luomin2005 <luomin2005@huawei.com> Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: MrZ20 <2609716663@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: leon_tao <taoyao2@huawei.com> Co-authored-by: nurxat <738457498@qq.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: mikequan <199741451@qq.com> Co-authored-by: LI SHENGYONG <49200266+shenchuxiaofugui@users.noreply.github.com> Co-authored-by: jiangyunfan1 <jiangyunfan1@h-partners.com> Co-authored-by: pu-zhe <zpuaa@outlook.com> Co-authored-by: luomin2005 <luomin2005@huawei.com> Co-authored-by: liziyu <56102866+liziyu179@users.noreply.github.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com> Co-authored-by: Cao Yi <slightwindsec@gmail.com> Co-authored-by: Icey <1790571317@qq.com> Co-authored-by: SILONG ZENG <2609716663@qq.com>	2026-02-12 10:55:34 +08:00
Angazenn	c0c2eb614e	[Main][Ops] Make triton rope support index_selecting from cos_sin_cache (#5450 ) ### What this PR does / why we need it? This PR extends original `rope_triton_forward` and `split_qkv_rmsnorm_rope` to support `cos_sin_cache` && `positions` as inputs. This fully aligns to vLLM RoPE api interface. Compared with earlier implementation for RoPE, the benefits are: 1. avoiding pre-computation of `cos` `sin` before model execution, which helps to remove redundant codes. 2. allowing eagle3 draft model to have different rope parameters with main model (see #6612 ). This help to recover accept rate && accuracy in that case. In addition, this kernel change only introduces very small performance degradation. Those `index_select` or `chunk` operations are now changed into simple memory access in triton kernel (For example, https://github.com/vllm-project/vllm-ascend/pull/5450/changes#diff-a4c2d3071530df193b98f9bf38553874bc4d47571336711f116c26d019cfbb6aR77-R81). Highlights - RoPE Cache Unification: Replaced separate _sin and _cos global tensors with a unified cos_sin_cache and explicit positions tensor for Rotary Positional Embeddings (RoPE), streamlining data handling. - Triton Kernel Integration: Updated Triton kernels (split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the cos_sin_cache and positions for more efficient and integrated RoPE calculations. - Custom Operation Registration: Registered `rope_forward_oot` as a new custom operation, allowing its use in fused compilation passes and providing a dedicated entry point for the new RoPE implementation. - Refactored RoPE Forward Pass: Modified the rope_forward_oot function to accept the new cos_sin_cache and positions arguments, enabling a more flexible and integrated RoPE application within the system. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `5326c89803` Additional test on Qwen3-235b accuracy: \| Aime2024 \| GSM8K \| Livecodebench \| \| -------- \| -------- \| -------- \| \| 83.33 \| 96.26 \| 70.23 \| --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-02-11 21:20:53 +08:00
LI SHENGYONG	7cf285a77a	[MOE Refactor] Remove QuantType in prepare_finalize.py (#6534 ) ### What this PR does / why we need it? To prevent confusion between different QuantType classes, we remove** QuantType in prepare_finalize.py - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-02-10 15:59:58 +08:00
Nengjun Ma	66b60c9440	[Refact]Refact MLA/SFA weight prefetch to consist with moe weight prefetch (#6629 ) ### What this PR does / why we need it? 1. [Refact] Refact MLA/SFA weight prefetch to consist with moe weight prefetch 2. Remove duplicated o_proj weight prefetch in forward for MLA/SFA ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? 1) Performance result: Perf test data: ) MLA: \| \| 1st test \| 2nd test \| Output Token Throughput(Avg) \| Performance improvement percentage \| \| --- \| --- \| --- \| --- \| --- \| \| o_proj duplicate prefetch \| 11.9669 token/s \| 12.0287 token/s \| 11.9978 \| \| o_proj no duplicate prefetch \| 12.5594 token/s \| 12.6216 token/s \| 12.5905 \| 4.94%\| \| single layer performace improve: 5%~8% ) SFA: \| \| 1st test \| 2nd test \| Output Token Throughput(Avg) \| Performance improvement percentage \| \| --- \| --- \| --- \| --- \| --- \| \| o_proj duplicate prefetch \| 13.0523 token/s \| 13.1084 token/s \| 13.08035 \| \| \| o_proj no duplicate prefetch \| 13.9844 token/s \| 14.1678 token/s \| 14.0761 \| 7.6% \| - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-02-10 14:14:37 +08:00
meihanc	5b8e47cb68	[bugfix]Fix no attribute 'data' when MLAPO is enable (#6601 ) ### What this PR does / why we need it? This PR fixes an `AttributeError: 'Parameter' object has no attribute 'data'` that occurs when MLAPO is enabled with vLLM v0.15.0. The error is caused by a monkey-patch on `MLAAttention.process_weights_after_loading` which is incompatible with changes in vLLM v0.15.0. This is likely related to PyTorch's deprecation of the `.data` attribute on `torch.nn.Parameter` objects. This change makes the monkey-patch conditional, so it is not applied for vLLM v0.15.0 and newer versions, resolving the crash. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-02-10 09:04:32 +08:00
GoCHug	80e5812b39	[BugFix] Add support for rotary_dim parameter when using partial rope in rotary_embedding (#6581 ) ### What this PR does / why we need it? Issue: If a model such as Ling-1T adopts partial rotary position embedding (partial RoPE), but config.json uses the rotary_dim parameter instead of partial_rotary_factor, it will trigger a RuntimeError: The expanded size of the tensor (128) must match the existing size (64) at non-singleton dimension 3. <img width="1681" height="472" alt="image" src="https://github.com/user-attachments/assets/ba03d7df-ecba-4d6f-9ec1-4dc55f59799e" /> This PR addresses an issue where models using partial rotary position embedding (partial RoPE) with the `rotary_dim` parameter in `config.json` (instead of `partial_rotary_factor`) would encounter a `RuntimeError`. This change adds support for the `rotary_dim` parameter in `vllm_ascend/ops/rotary_embedding.py` to correctly calculate the `rope_dim`, resolving the tensor size mismatch error. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The patch was tested successfully with the Ling-1T model, which previously triggered the error. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` Signed-off-by: GoCHug <93277779+GoCHug@users.noreply.github.com>	2026-02-09 17:17:52 +08:00
wangxiyuan	9c6d031797	[MISC] Clean up useless env USE_OPTIMIZED_MODEL (#6618 ) Clean up uesless env `USE_OPTIMIZED_MODEL` - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-09 15:38:58 +08:00
wangxiyuan	6c49f95da2	[Ops][Refactor] Remove custom rotary_embedding operator (#6523 ) ### What this PR does / why we need it? This PR removes the custom `rotary_embedding` operator and its associated C++ kernel implementation, PyTorch bindings, and tests. The codebase now falls back to using the native `torch_npu._npu_rotary_embedding` implementation. This change simplifies the codebase by removing custom, platform-specific kernel code and relying on the standard NPU library implementation, which is presumably more optimized and easier to maintain. ### Does this PR introduce _any_ user-facing change? No. This is an internal refactoring and does not introduce any user-facing changes. ### How was this patch tested? The tests for the custom `rotary_embedding` operator have been removed along with the operator itself. The correctness of the fallback to the native `torch_npu` implementation is verified by existing CI tests for attention layers and models that use rotary embeddings. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-07 09:24:05 +08:00
SILONG ZENG	06aa6036f6	[Lint]Style: Convert `vllm-ascend/` to ruff format(new Batch #8 ) (#6604 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| vllm_ascend/ops/\_\_init\_\_.py \| \| vllm_ascend/ops/activation.py \| \| vllm_ascend/ops/flashcomm2_oshard_manager.py \| \| vllm_ascend/ops/layernorm.py \| \| vllm_ascend/ops/mla.py \| \| vllm_ascend/ops/mm_encoder_attention.py \| \| vllm_ascend/ops/register_custom_ops.py \| \| vllm_ascend/ops/vocab_parallel_embedding.py \| \| vllm_ascend/ops/weight_prefetch.py \| \| vllm_ascend/spec_decode/\_\_init\_\_.py \| \| vllm_ascend/spec_decode/eagle_proposer.py \| \| vllm_ascend/spec_decode/interface.py \| \| vllm_ascend/spec_decode/mtp_proposer.py \| \| vllm_ascend/spec_decode/ngram_proposer.py \| \| vllm_ascend/spec_decode/suffix_proposer.py \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-07 09:16:07 +08:00
wangxiyuan	06c0aed124	[CI] Fix broken CI (#6599 ) Revert `4fb3d5e1b2` it breaks E2E Test - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd`	2026-02-06 17:23:58 +08:00
SILONG ZENG	19b5d44ea8	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #10 ) (#6173 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \|`vllm_ascend/ops/layer_shard_linear.py`\| \|`vllm_ascend/ops/linear.py`\| \|`vllm_ascend/ops/linear_op.py`\| \|`vllm_ascend/worker/worker.py`\| \| ` vllm_ascend/patch/worker/patch_bert.py` \| \| ` vllm_ascend/patch/worker/patch_deepseek.py` \| \| ` vllm_ascend/patch/worker/patch_distributed.py` \| \| ` vllm_ascend/patch/worker/patch_module.py` \| \| ` vllm_ascend/patch/worker/patch_multimodal_merge.py` \| \| ` vllm_ascend/patch/worker/patch_qwen3_next.py` \| \| ` vllm_ascend/patch/worker/patch_qwen3_next_mtp.py` \| \| ` vllm_ascend/patch/worker/patch_rejection_sampler.py` \| \| ` vllm_ascend/patch/worker/patch_rope.py` \| \| ` vllm_ascend/patch/worker/patch_triton.py` \| \| ` vllm_ascend/patch/worker/patch_unquantized_gemm.py` \| \| ` vllm_ascend/patch/worker/patch_v2_egale.py` \| \|` vllm_ascend/worker/npu_input_batch.py`\| \|` vllm_ascend/worker/v2/aclgraph_utils.py`\| \|` vllm_ascend/worker/v2/attn_utils.py`\| \|` vllm_ascend/worker/v2/model_runner.py`\| \|` vllm_ascend/worker/v2/sample/gumbel.py`\| \|` vllm_ascend/worker/v2/sample/penalties.py`\| \|` vllm_ascend/worker/v2/sample/sampler.py`\| \|` vllm_ascend/worker/v2/spec_decode/__init__.py`\| \|` vllm_ascend/worker/v2/spec_decode/eagle.py`\| \|` vllm_ascend/worker/v2/states.py`\| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: SILONG ZENG <2609716663@qq.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-06 15:35:06 +08:00

1 2 3 4 5 ...

421 Commits