xc-llm-ascend

Author	SHA1	Message	Date
Nagisa125	2cb9195ff0	[Releases/v0.18.0][CI] Updated the parameters for the single-node test to fix the OOM issue for DeepSeek-V3.2 (#7862 ) ### What this PR does / why we need it? Fix the OOM (Out-of-Memory) error in the single-node-deepseek-v3-2-w8a8 nightly test of vllm-ascend: - Reduced the value of HCCL_BUFFSIZE - Lowered the gpu-memory-utilization Optimize service-side performance: Updated service-oriented configuration parameters (e.g., max-num-seqs, cudagraph_capture_sizes, batch_size) to improve the inference performance,so that the performance is closer to the optimal performance of the current mainline. Align performance baseline with main branch: Updated the performance baseline according to the latest performance data ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The test has passed. https://github.com/vllm-project/vllm-ascend/actions/runs/23734079080/job/69134387320?pr=7793 --------- Signed-off-by: wyh145 <1987244901@qq.com>	2026-04-01 10:28:46 +08:00
linfeng-yuan	ed4ef1f4e7	[releases/v0.18.0][Triton][Sampler] Add penalty-related Triton kernel for better performance of penalties (#7794 ) ### What this PR does / why we need it? Implement get_token_bin_counts_and_mask and apply_penalties with Triton-Ascend kernels. This significantly reduces latency of the sampling process when repetition/frequency/presence penalties are enabled. Cherry-pick from main PR #7569 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed. Signed-off-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>	2026-03-31 19:01:51 +08:00
ZT-AIA	66db070423	[cherry-pick][Test]repair for test_compute_slot_mapping (#7836 ) ### What this PR does / why we need it? repair for test_compute_slot_mapping Signed-off-by: ZT-AIA <1028681969@qq.com>	2026-03-31 16:52:58 +08:00
Shaoxu Cheng	3f4087a8f0	[310P]fused recurrent gated delta rule pytorch core and ut (#7398 ) ### What this PR does / why we need it? RFC https://github.com/vllm-project/vllm-ascend/issues/7394 Add a PyTorch implementation of the fused recurrent gated delta ruler on 310P. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-25 08:53:14 +08:00
SILONG ZENG	1e3c1e76bf	[Lint]Add lint hooks for clang-format, shellcheck, forbidden imports, and boolean context manager checks (#7511 ) ### What this PR does / why we need it? This PR introduces several upstream `vllm`-aligned lint hooks into `vllm-ascend` and makes them part of the actual `pre-commit` flow. Main changes in this PR: - add `check-boolean-context-manager` to catch boolean expressions in `with` statements - add `check-forbidden-imports` to forbid direct `re` imports and disallowed direct `triton` imports - enable shell script linting through `tools/shellcheck.sh` - add root `.clang-format` aligned with upstream `vllm`, enable `clang-format` in `pre-commit`, temporarily exclude all `csrc/` from `clang-format` to avoid bringing a large native code reformat into this PR This PR focuses on landing the smaller and immediately useful lint alignment first, without mixing in the larger requirements-management migration. ### Does this PR introduce _any_ user-facing change? No. This PR only updates repository lint configuration, static checks, and internal import/style enforcement. It does not change runtime behavior or public interfaces. ### How was this patch tested? Tested locally in the project virtual environment. Commands used: ```bash bash format.sh ``` Verified checks passed: ``` bash ruff check...............................................................Passed ruff format..............................................................Passed codespell................................................................Passed typos....................................................................Passed clang-format.............................................................Passed Lint GitHub Actions workflow files.......................................Passed Lint shell scripts.......................................................Passed Lint PNG exports from excalidraw.........................................Passed Check for spaces in all filenames........................................Passed Enforce __init__.py in Python packages...................................Passed Check for forbidden imports..............................................Passed Check for boolean ops in with-statements.................................Passed Suggestion...............................................................Passed - hook id: suggestion - duration: 0s To bypass pre-commit hooks, add --no-verify to git commit. ``` note: clang-format is enabled but currently excludes all csrc/ - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-03-24 20:03:01 +08:00
lhp-deep	0e3186f07c	[model_runner_v2]:optimize the performance of the _compute_slot_mappings_kernel (#7575 ) ### What this PR does / why we need it? This PR optimizes the `_compute_slot_mappings_kernel` for Ascend NPUs to improve performance. The key changes include: - A new Triton kernel implementation (`_compute_slot_mappings_kernel`) with NPU-specific optimizations, such as using `tl.gather` to handle non-contiguous memory access and replacing modulo operations. - A new method `compute_slot_mappings` in `AscendBlockTables` to use this new kernel. - An end-to-end test to verify the correctness of the new kernel against the reference GPU implementation. The optimization is needed to avoid performance degradation from scalar computation on Ascend devices. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.18.0 - vLLM main: `ed359c497a` --------- Signed-off-by: lhp-deep <liuhaopeng1@huawei.com>	2026-03-24 17:29:14 +08:00
LeeWenquan	9615bc33fd	Fix Qwen3Next CI Config (#7561 ) ### What this PR does / why we need it? This pr modifies qwen3Next nightly CI config. (1) Add a nightly CI . (2) Set a more precise accuracy standard - vLLM version: v0.18.0 - vLLM main: `6a9cceb219` Signed-off-by: Your Name <you@example.com> Co-authored-by: Your Name <you@example.com>	2026-03-24 17:08:17 +08:00
jiaojiao	1de805ce0a	[Ops][Misc] Refactor and optimize CausalConv1d for Ascend (#7495 ) ### What this PR does / why we need it? During the prefill phase of Qwen3-Next and Qwen3.5, the `torch.ops._C_ascend.causal_conv1d_fn` operator exhibits significant performance bottlenecks. To address this, we have re-implemented the optimization using `torch.ops._C_ascend.npu_causal_conv1d_custom`. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? 1 accuracy test ``` [2026-03-20 16:44:22,961] [ais_bench] [INFO] Start launch task state board ... +-----------------------------+-----------+------------+-------------+----------+-------------------------------------------+---------------------+ \| Task Name \| Process \| Progress \| Time Cost \| Status \| Log Path \| Extend Parameters \| +=============================+===========+============+=============+==========+===========================================+=====================+ \| vllm-api-general-chat/gsm8k \| 2918978 \| NA \| 0:00:01 \| finish \| logs/eval/vllm-api-general-chat/gsm8k.out \| None \| +-----------------------------+-----------+------------+-------------+----------+-------------------------------------------+---------------------+ [2026-03-20 16:44:34,284] [ais_bench] [INFO] Evaluation tasks completed. [2026-03-20 16:44:34,287] [ais_bench] [INFO] Summarizing evaluation results... dataset version metric mode vllm-api-general-chat --------- --------- -------- ------ ----------------------- gsm8k 271d0b accuracy gen 96.21 ``` 2 ut modify test `pytest -sv /home/c30006096/vllm-ascend/tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_causal_conv1d.py::test_ascend_causal_conv1d` - vLLM version: v0.17.0 - vLLM main: `8b6325758c` Signed-off-by: wenba0 <3054239545@qq.com> Signed-off-by: jiaojiao <56385650+wenba0@users.noreply.github.com>	2026-03-24 00:07:12 +08:00
weijinqian0	bdd90c0088	[model_runner_v2]optimize the performance of the post_update. (#7496 ) ### What this PR does / why we need it? - This PR aims to enhance the operator performance in the `post_update` phase of `model_runner_v2` on NPUs. By optimizing the relevant operations, it is expected to improve the overall efficiency and speed of the model running on NPU hardware, which is crucial for scenarios where high-performance inference is required. - when bs = 256, time cost reduce from 26us to 11 us; ### Does this PR introduce _any_ user-facing change? No, there are no changes to the API, interface, or other high-level behaviors that would directly affect the user's code or interaction with the system beyond the performance improvement. ### How was this patch tested? CI passed with new added/existing tests. In addition to the regular CI tests, specific benchmark tests were conducted on NPU hardware to measure the performance improvement of the `post_update` operators. --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2026-03-23 20:29:55 +08:00
Shaoxu Cheng	13397e9cb7	[310p] Add a PyTorch implementation of the GDN gating operator on 310P (#7430 ) ### What this PR does / why we need it? RFC #7394 Add a PyTorch implementation of the GDN gating operator on 310P. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT - vLLM version: v0.17.0 - vLLM main: `4497431df6` Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-23 20:26:39 +08:00
liuhy1213-cell	fb283b5820	[CI] Add nightly CI test cases for the GLM-5 (#7429 ) ### What this PR does / why we need it? Add nightly CI test cases for the GLM-5 Add model download for the GLM-5 https://github.com/vllm-project/vllm-ascend/actions/runs/23286178651/job/67710409642#logs - vLLM version: v0.17.0 - vLLM main: `b31e9326a7` --------- Signed-off-by: liuhaiyang27 <liuhaiyang27@huawei.com> Signed-off-by: liuhy1213-cell <liuhy1213@gmail.com> Co-authored-by: liuhaiyang27 <liuhaiyang27@huawei.com>	2026-03-23 19:14:19 +08:00
linfeng-yuan	88d03a783f	[refactor] replace scattered business kwargs with typed request objects and explicit stage boundaries (#7024 ) ### What this PR does / why we need it? Refactor `vllm_ascend/ops/fused_moe` to replace scattered MoE business `**kwargs` with typed request objects and explicit stage boundaries. - Prepare, dispatch, MLP, and quant stages now have clearer ownership. - Main MoE path no longer depends on business `kwargs.get(...)` lookups. - Comm and dispatcher interfaces are request-only on the main path. - UTs can assert stage-level fields directly instead of inferring behavior indirectly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed. --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-20 23:23:57 +08:00
ichaoren	9d1452c74d	[OPS]add split_qkv_tp_rmsnorm_rope ops (#7376 ) ### What this PR does / why we need it? This PR introduces a new fused Triton kernel, `split_qkv_tp_rmsnorm_rope` for Minimax-m2.5. The implementation includes two Triton kernels: 1. `_split_qkv_and_compute_local_qk_var_kernel`: Splits the QKV input and computes the local variance for RMSNorm. 2. `_apply_global_rmsnorm_kernel`: Applies global RMSNorm (considering TP all-reduce for variance) and Neox-style RoPE. ### Does this PR introduce _any_ user-facing change? Does not. ### How was this patch tested? ```python pytest tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_split_qkv_tp_rmsnorm_rope.py ``` ### Test Data A3 TP16 基线 \| data \| TTFT(ms) \| TPOT(ms) \| TPS \| \|------------\|---------:\|---------:\|-------:\| \| 4k/1k@bs1 \| 267.55 \| 25.5 \| 38.85 \| \| 4k/1k@bs4 \| 542.4 \| 26.51 \| 148.06 \| 测试线 \| data \| TTFT(ms) \| TPOT(ms) \| TPS \| \|------------\|---------:\|---------:\|-------:\| \| 4k/1k@bs1 \| 234.64 \| 20.96 \| 47.24 \| \| 4k/1k@bs4 \| 508.36 \| 22.16 \| 176.69 \| - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: xutianyi <xutianyi5@huawei.com> Co-authored-by: xutianyi <xutianyi5@huawei.com>	2026-03-19 17:19:18 +08:00
ZT-AIA	05afc7f8c3	[CI]repair for ci custom ops (#7461 ) ### What this PR does / why we need it? NPU resources are not released immediately when custom operator test cases are executed, causing an error when other operator test cases are executed. - vLLM version: v0.17.0 - vLLM main: `8a680463fa` Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>	2026-03-19 17:13:12 +08:00
aipaes	87d6424b2e	[CI] Add nightly CI test cases for the GLM-4.7 model. (#7391 ) ### What this PR does / why we need it? Add acc nightly CI test cases for the GLM-4.7 model. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? through CI - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-03-19 16:43:29 +08:00
LoganJane	270c5cb8cd	[CI] Add nightly CI test cases for the Kimi-K2.5 (#7416 ) ### What this PR does / why we need it? Add nightly CI test cases for the Kimi-K2.5. - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: LoganJane <loganJane73@hotmail.com> Signed-off-by: LoganJane <42287016+LoganJane@users.noreply.github.com>	2026-03-19 11:02:29 +08:00
SparrowMu	fb8e22ec00	[DOC] MiniMax-M2.5 model intro (#7296 ) ### What this PR does / why we need it? 1. Add nightly test on MiniMax-M2.5 with deployment method on A3 2. Add MiniMax-M2.5 deployment introduction to vllm-ascend docs - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: limuyuan <limuyuan3@huawei.com> Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com> Co-authored-by: limuyuan <limuyuan3@huawei.com>	2026-03-18 20:14:36 +08:00
liuhy1213-cell	58725b8b24	[doc] add Prefill-Decode Disaggregation doc for GLM5.md (#7300 ) ### What this PR does / why we need it? add Prefill-Decode Disaggregation doc for GLM5.md w8a8 65k-1.5k Concurrency: 80 prefixcache: 90% tps: 2054 - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: liuhaiyang27 <liuhaiyang27@huawei.com> Co-authored-by: liuhaiyang27 <liuhaiyang27@huawei.com>	2026-03-18 17:00:31 +08:00
wangx700	22d0e1d3d7	[model_runner_v2]optimize the performance of the _topk_log_softmax_kernel (#7221 ) ### What this PR does / why we need it? Optimize the performance of the triton operator _topk_log_softmax_kernel in model_runner_v2 to 1.04xH100，which is 7% of its original value.(issue https://github.com/vllm-project/vllm-ascend/issues/5208) - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangx700 <wangxin700@huawei.com>	2026-03-16 16:49:10 +08:00
Angazenn	ce5544bfc1	[Hybrid] support prefix cache for Qwen3.5/Next with `--mamba-cache-mode align` (#7103 ) ### What this PR does / why we need it? To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly follows the design in [#30877](https://github.com/vllm-project/vllm/pull/30877) and inherits changes to functions which are overridden in vLLM-Ascend. Note: 1. `--mamba-cache-mode align` && PD disaggregation is still not supported yet in vLLM v0.17.0(see https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295). 2. The current implementation of hybrid kv cache might result in a very large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B with `-tp 2`, the block_size is adjusted to 2048, which means that any prefix shorter than 2048 will never be cached. Although this behavior is consistent with vLLM, it still needs improvements in the future. 3. `--mamba-cache-mode align` requires to copy mamba states during forward steps. vLLM uses a triton kernel to implement it. However, the original version run into some bugs on Ascend hardwares. Thus we patch a new triton kernel to avoid this bug. ### Does this PR introduce _any_ user-facing change? To use mamba prefix cache, set `--enable-prefix-caching` and `--mamba-cache-mode align`. Note that the mamba state copy function(see [do_mamba_copy_block](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/mamba_utils.py#L132)) does not provide a torch native version, thus it might have trouble if users can't use triton. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-03-15 09:44:09 +08:00
kx	df1ee8070d	[feat][spec decode]Unified draft parallel (#6766 ) ### What this PR does / why we need it? Implement a unified parallelized speculative decoding in VLLM Ascend，which can simultaneously support parallel speculative inference schemes such as Pard, P-Eagle, etc. refer to https://github.com/vllm-project/vllm-ascend/pull/6565 and https://github.com/vllm-project/vllm-ascend/pull/4078 ### How was this patch tested? run with parallel drafting script: export target=/model/Llama-3.1-8B-Instruct export draft=/model/PARD-Llama-3.2-1B export CUDA_VISIBLE_DEVICES=6 export ASCEND_RT_VISIBLE_DEVICES=6 vllm serve $target \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --no-enable-prefix-caching \ --port 8811 \ --speculative-config '{"model": "/model/PARD-Llama-3.2-1B", "method": "draft_model", "num_speculative_tokens": 8, "parallel_drafting": true}' base script: export target=/model/Llama-3.1-8B-Instruct export draft=/model/PARD-Llama-3.2-1B export CUDA_VISIBLE_DEVICES=6 export ASCEND_RT_VISIBLE_DEVICES=6 vllm serve $target \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --no-enable-prefix-caching \ --port 8811 benchmark script: MAX_CONCURRENCY=1 NUM_PROMPTS=80 vllm bench serve --port 8811 \ --temperature 0 \ --model /model/Llama-3.1-8B-Instruct \ --backend openai-chat \ --endpoint /v1/chat/completions \ --dataset-name hf \ --dataset-path philschmid/mt-bench \ --num-prompts ${NUM_PROMPTS} \ --max-concurrency ${MAX_CONCURRENCY} \ --seed 1234 test results : base(without spec decode): TTFT 79.46ms TPOT 26.99ms output_tokens_throughput 36.75 tok/s this pr(with parallel drafting): TTFT 72.24ms TPOT 13.45ms output_tokens_throughput 72.98 tok/s per-position acceptance(from position 0 to 7): 79.48%、56.93%、40%、27.90%、19.79%、14.25%、10.57%、7.61%. ---------------------------------------------------------------------- run on qwen3 model script ： export target=/model/Qwen3-1.7B export draft=/model/PARD-Qwen3-0.6B export CUDA_VISIBLE_DEVICES=1 export ASCEND_RT_VISIBLE_DEVICES=1 vllm serve $target \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --no-enable-prefix-caching \ --port 8811 \ --speculative-config '{"model": "/model/PARD-Qwen3-0.6B", "method": "draft_model", "num_speculative_tokens": 8, "parallel_drafting": true}' cc @NickJudyHvv - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: kx <1670186653@qq.com> Signed-off-by: HF-001 <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com>	2026-03-13 14:07:35 +08:00
Ronald	c980e68d40	[Feature] support aclgraph for model runner v2 (#7110 ) ### What this PR does / why we need it? This PR aims to support aclgraph for model runner v2, please see RFC #5208. The PR contains these modifications: - adapt to newest commit of vllm main branch. - supply a unified interface of extra forward context for both model runner v1 and model runner v2. - implement graph mode for main model. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-03-13 09:11:46 +08:00
Li Wang	0a171b5cdd	[Test][BugFix] Fix dispatch_gmm_combine_decode test stability (#7097 ) ### What this PR does / why we need it? This patch fix the nightly failure 1. Each case uses a copy of the global kwargs instead of a reference to prevent parameter pollution between use cases. 2. Add weight initialization in the scenario of `eplb` + `w8a8_dynamic` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ```python pytest -sv tests/e2e/nightly/single_node/ops/multicard_ops_a3/test_dispatch_gmm_combine_decode.py ``` ```shell ===================================================================== 3 passed, 4 warnings in 194.86s (0:03:14) ====================================================================== ``` - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-12 17:22:44 +08:00
shiyuan680	3b6b3c4214	[MODELRUNNERV2]fix penality ops (#7013 ) ### What this PR does / why we need it? fix penality ops for new version, and achieved a 10% performance improvement ### How was this patch tested? pytest ‎tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_penality.py - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: shiyuan680 <917935075@qq.com>	2026-03-11 17:13:34 +08:00
ZT-AIA	ee5347e824	[qwen3 next ]add ascend c casual_conv1d_fn (#6661 ) ### What this PR does / why we need it? add ascend c casual_conv1d_fn - vLLM version: v0.15.0 - vLLM main: `13397841ab` --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-03-09 23:29:49 +08:00
Hexiang Wang	48b624e4cc	[BugFix] Fix implementation bug of triton rope_siso (#7082 ) ### What this PR does / why we need it? Previously implemention of triton rope_siso missing the storage of second half of rope results, which will result in: 1. accuracy problem in neox-style scenario 2. ub overflow in non neox-style scenario This PR fixes it and supplement nightly test case for it. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-03-09 23:08:43 +08:00
LeeWenquan	65eae6de7b	Add Ascend Ops recurrent_gated_delta_rule (#6725 ) ### What this PR does / why we need it? Change recurrent_gated_delta_rule ops from triton to ascend C version for better performance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2026-03-09 14:14:14 +08:00
Fager10086	c5dfa8d645	[OPS]add split_qkv_rmsnorm_mrope ops (#6730 ) ### What this PR does / why we need it? This PR adds split_qkv_rmsnorm_mrope kernel with interleaved for qwen3.5 and qwen3-vl to improve performance. ### Does this PR introduce _any_ user-facing change? Does not. ### How to use? ```python real_q, real_k, real_v, real_gate = torch.ops.vllm.triton_split_qkv_rmsnorm_mrope( qkv=qkv, q_weight=q_weight, k_weight=k_weight, cos_sin=cos_sin, num_q_heads=num_q_heads, num_kv_heads=num_kv_heads, head_size=head_size, eps=eps, mrope_section=mrope_section, is_interleaved=is_interleaved, rope_dim=rope_dim, has_gate=has_gate, ) ``` ### How was this patch tested? - vLLM version: v0.16.0 - Accuracy test script： ```shell pytest tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_split_qkv_rmsnorm_mrope.py ``` --------- Signed-off-by: Fager <865071616@qq.com> Signed-off-by: Fager10086 <77871921+Fager10086@users.noreply.github.com> Signed-off-by: fager <865071616@qq.com>	2026-03-06 16:18:37 +08:00
frank	18b52afe2b	[Ops][Misc] Optimize split_qkv_rmsnorm_rope op (#6827 ) ### What this PR does / why we need it? This PR optimizes the `split_qkv_rmsnorm_rope` operator by introducing a new Triton kernel, `split_qkv_rmsnorm_rope_prefill_kernel`, for the prefill stage (i.e., large batch sizes). The implementation now dynamically selects between the existing decode kernel and the new prefill kernel based on the batch size, which improves performance for large batch scenarios. Additionally, the RoPE implementation is updated to support partial rotation dimensions (`rope_dim`), making the operator more flexible. ### Does this PR introduce _any_ user-facing change? No. This is a performance optimization and is not expected to introduce any user-facing changes. ### How was this patch tested? CI should pass with existing tests. The new prefill path is triggered when the batch size is larger than the number of available vector cores. The partial RoPE feature can be tested by passing the `rope_dim` argument. - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: guzhiyong <guzhiyong5@h-partners.com> Signed-off-by: frank <2547457096@qq.com> Co-authored-by: guzhiyong <guzhiyong5@h-partners.com>	2026-03-06 09:30:31 +08:00
SILONG ZENG	859f2c25b9	[Nightly][Refactor]Migrate nightly single-node model tests from `.py` to `.yaml` (#6503 ) ### What this PR does / why we need it? This PR refactors the nightly single-node model test by migrating test configurations from Python scripts to a more maintainable `YAML-based` format. \| Original PR \| Python (`.py`) \| YAML (`.yaml`) \| \| :--- \| :--- \| :--- \| \| [#3568](https://github.com/vllm-project/vllm-ascend/pull/3568) \| `test_deepseek_r1_0528_w8a8_eplb.py` \| `DeepSeek-R1-0528-W8A8.yaml` \| \| [#3631](https://github.com/vllm-project/vllm-ascend/pull/3631) \| `test_deepseek_r1_0528_w8a8.py` \| `DeepSeek-R1-0528-W8A8.yaml` \| \| [#5874](https://github.com/vllm-project/vllm-ascend/pull/5874) \| `test_deepseek_r1_w8a8_hbm.py` \| `DeepSeek-R1-W8A8-HBM.yaml` \| \| [#3908](https://github.com/vllm-project/vllm-ascend/pull/3908) \| `test_deepseek_v3_2_w8a8.py` \| `DeepSeek-V3.2-W8A8.yaml` \| \| [#5682](https://github.com/vllm-project/vllm-ascend/pull/5682) \| `test_kimi_k2_thinking.py` \| `Kimi-K2-Thinking.yaml` \| \| [#4111](https://github.com/vllm-project/vllm-ascend/pull/4111) \| `test_mtpx_deepseek_r1_0528_w8a8.py` \| `MTPX-DeepSeek-R1-0528-W8A8.yaml` \| \| [#3733](https://github.com/vllm-project/vllm-ascend/pull/3733) \| `test_prefix_cache_deepseek_r1_0528_w8a8.py` \| `Prefix-Cache-DeepSeek-R1-0528-W8A8.yaml` \| \| [#6543](https://github.com/vllm-project/vllm-ascend/pull/6543) \| `test_qwen3_235b_w8a8.py` \| `Qwen3-235B-A22B-W8A8.yaml` \| \| [#6543](https://github.com/vllm-project/vllm-ascend/pull/6543) \| `test_qwen3_235b_a22b_w8a8_eplb.py` \| `Qwen3-235B-A22B-W8A8.yaml` \| \| [#3973](https://github.com/vllm-project/vllm-ascend/pull/3973) \| `test_qwen3_30b_w8a8.py` \| `Qwen3-30B-A3B-W8A8.yaml` \| \| [#3541](https://github.com/vllm-project/vllm-ascend/pull/3541) \| `test_qwen3_32b_int8.py` \| `Qwen3-32B-Int8.yaml` \| \| [#3757](https://github.com/vllm-project/vllm-ascend/pull/3757) \| `test_qwq_32b.py` \| `QwQ-32B.yaml` \| \| [#5616](https://github.com/vllm-project/vllm-ascend/pull/5616) \| `test_qwen3_next_w8a8.py` \| `Qwen3-Next-80B-A3B-Instruct-W8A8.yaml` \| \| [#3541](https://github.com/vllm-project/vllm-ascend/pull/3541) \| `test_qwen2_5_vl_7b.py` \| `Qwen2.5-VL-7B-Instruct.yaml` \| \| [#5301](https://github.com/vllm-project/vllm-ascend/pull/5301) \| `test_qwen2_5_vl_7b_epd.py` \| `Qwen2.5-VL-7B-Instruct-EPD.yaml` \| \| [#3707](https://github.com/vllm-project/vllm-ascend/pull/3707) \| `test_qwen2_5_vl_32b.py` \| `Qwen2.5-VL-32B-Instruct.yaml` \| \| [#3676](https://github.com/vllm-project/vllm-ascend/pull/3676) \| `test_qwen3_32b_int8_a3_feature_stack3.py` \| `Qwen3-32B-Int8-A3-Feature-Stack3.yaml` \| \| [#3709](https://github.com/vllm-project/vllm-ascend/pull/3709) \| `test_prefix_cache_qwen3_32b_int8.py` \| `Prefix-Cache-Qwen3-32B-Int8.yaml` \| \| [#5395](https://github.com/vllm-project/vllm-ascend/pull/5395) \| `test_qwen3_next.py` \| `Qwen3-Next-80B-A3B-Instruct-A2.yaml` \| \| [#3474](https://github.com/vllm-project/vllm-ascend/pull/3474) \| `test_qwen3_32b.py` \| `Qwen3-32B.yaml` \| \| [#3541](https://github.com/vllm-project/vllm-ascend/pull/3541) \| `test_qwen3_32b_int8.py` \| `Qwen3-32B-Int8-A2.yaml` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-03-03 20:13:43 +08:00
starmountain1997	248d07566f	[CI] nightly test timeout (#6912 ) ### What this PR does / why we need it? The nightly test is currently failing due to a [timeout](https://github.com/vllm-project/vllm-ascend/actions/runs/22547280169/job/65326335134). As noted in #6778, this issue can be resolved by applying this fix. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? run nightly test. Co-authored-by: guozr <guozr1997@hotmail.com>	2026-03-03 09:31:46 +08:00
whx	16c879cdf7	[Triton][Config] Add muls_add triton kernel and refactor AscendCompilationConfig (#5518 ) ### What this PR does / why we need it? Add muls_add triton kernel with related fusion pass. What's more, this PR refactors `AscendCompilationConfig` and delete `NpugraphExConfig`. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI passed with new added test. - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-03-02 17:54:25 +08:00
starmountain1997	bc1622338c	[CI] Add long and short prompt tests for DeepSeek-V3.2 (#6536 ) ### What this PR does / why we need it? This version has no divisibility constraint between tp and mtp+1. However, cudagraph_capture_sizes must be a common multiple of tp and mtp+1, with a maximum of tp * (mtp+1). Therefore, we fixed cudagraph_capture_sizes. We added a long-sequence test (64k input, 3k output) for the two-node mixed deployment scenario. Due to the excessive time required for performance benchmarking, we are only verifying functionality. The single-node scenario is skipped because VRAM limitations prevent launching the model with a max-model-len of 68,000. and we also add aime2025 test for dual-node deepseek 3.2 nightly test. ### How was this patch tested? test at nightly environment. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-02-26 10:58:50 +08:00
SILONG ZENG	e2237819a9	[CI]Fixed the spell check function in `typos.toml` (#6753 ) ### What this PR does / why we need it? The incorrect regular expression syntax `.[UE4M3\|ue4m3].` actually ignores all words containing any of the following characters: `u, e, 4, m, 3, \|` ```yaml extend-ignore-identifiers-re = [".Unc.", "._thw", ".UE8M0.", ".[UE4M3\|ue4m3].", ".eles.", ".fo.", ".ba.", ".ot.", ".[Tt]h[rR]."] ``` ===fix===> ```yaml extend-ignore-identifiers-re = [".Unc.", "._thw", ".UE8M0.", ".(UE4M3\|ue4m3]).", ".eles.", ".fo.", ".ba.", ".ot.", ".[Tt]h[rR]."] ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-14 11:57:26 +08:00
JIACHENG XU	64aea60f2e	[EPLB][Nightly] Refactor UT (#6543 ) ### What this PR does / why we need it? The basic configs are extracted and reused for eplb UT. This is done so that if the basic configs are changed later, eplb UT does not need to be modified repeatedly. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: bigsir007 <xujiacheng12@huawei.com> Co-authored-by: bigsir007 <xujiacheng12@huawei.com>	2026-02-14 10:56:29 +08:00
lih827	f71812011d	[Feature] DispatchGmmCombineDecode support bf16/float16 gmm1/gmm2 weight and support gmm weight with ND format (#6393 ) ### What this PR does / why we need it? 1. support ND format gmm weight input. Before this pr, gmm1_weight and gmm2_weight could only be passed as input to the DispatchGmmCombineDecode operator in NZ data format. After the modification, they are allowed to be passed in ND data format. 2. support bf16/float16 gmm weight The current PR modification enables the DispatchGmmCombineDecode operator to support non-W8A8 scenarios, allowing gmm1_weight and gmm2_weight to be passed as float16/bfloat16 which is correspond with input token data type. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: lih827 <383084552@qq.com>	2026-02-12 10:37:41 +08:00
Angazenn	c0c2eb614e	[Main][Ops] Make triton rope support index_selecting from cos_sin_cache (#5450 ) ### What this PR does / why we need it? This PR extends original `rope_triton_forward` and `split_qkv_rmsnorm_rope` to support `cos_sin_cache` && `positions` as inputs. This fully aligns to vLLM RoPE api interface. Compared with earlier implementation for RoPE, the benefits are: 1. avoiding pre-computation of `cos` `sin` before model execution, which helps to remove redundant codes. 2. allowing eagle3 draft model to have different rope parameters with main model (see #6612 ). This help to recover accept rate && accuracy in that case. In addition, this kernel change only introduces very small performance degradation. Those `index_select` or `chunk` operations are now changed into simple memory access in triton kernel (For example, https://github.com/vllm-project/vllm-ascend/pull/5450/changes#diff-a4c2d3071530df193b98f9bf38553874bc4d47571336711f116c26d019cfbb6aR77-R81). Highlights - RoPE Cache Unification: Replaced separate _sin and _cos global tensors with a unified cos_sin_cache and explicit positions tensor for Rotary Positional Embeddings (RoPE), streamlining data handling. - Triton Kernel Integration: Updated Triton kernels (split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the cos_sin_cache and positions for more efficient and integrated RoPE calculations. - Custom Operation Registration: Registered `rope_forward_oot` as a new custom operation, allowing its use in fused compilation passes and providing a dedicated entry point for the new RoPE implementation. - Refactored RoPE Forward Pass: Modified the rope_forward_oot function to accept the new cos_sin_cache and positions arguments, enabling a more flexible and integrated RoPE application within the system. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `5326c89803` Additional test on Qwen3-235b accuracy: \| Aime2024 \| GSM8K \| Livecodebench \| \| -------- \| -------- \| -------- \| \| 83.33 \| 96.26 \| 70.23 \| --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-02-11 21:20:53 +08:00
whx	bb73478c00	[Test][BugFix] Fix torch.rand usage in triton penalty test (#6680 ) ### What this PR does / why we need it? This PR fixes a `TypeError` in `tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_penality.py` that was causing nightly test failures. The `torch.rand()` function was being called with the `device` string as a positional argument, which is incorrect. This has been corrected to use the `device` keyword argument. Fixes # ### Does this PR introduce _any_ user-facing change? No, this change only affects a test file. ### How was this patch tested? CI is expected to pass with this fix. - vLLM version: v0.15.0 - vLLM main: `13397841ab` Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-02-11 16:31:49 +08:00
lhp-deep	d060c797ed	[fix bug] fix tensor mismatch bug in sigmoid operate test case (#6619 ) ### What this PR does / why we need it? This PR fixes a bug in the `test_triton_fusion_ops` test case. The test compares a fused kernel (`fused_sigmoid_gating_delta_rule_update`) with a split implementation. Both paths use a recurrent state tensor. The bug was that the state tensor was being modified in-place by the fused kernel call, and this modified tensor was then reused for the split implementation path. This led to an incorrect comparison and test failure. This fix ensures that each path starts with an identical, clean initial state by creating separate tensors. It also changes the state initialization from `torch.randn` to `torch.ones` to make the test deterministic. ### Does this PR introduce _any_ user-facing change? No, this change only affects a test case and has no user-facing impact. ### How was this patch tested? The fix is applied directly to the test case. The CI passing for `test_fused_sigmoid_gating_delta_rule.py` will confirm that the fix is working as expected. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` Signed-off-by: lhp-deep <liuhaopeng1@huawei.com>	2026-02-09 16:43:27 +08:00
wangxiyuan	6c49f95da2	[Ops][Refactor] Remove custom rotary_embedding operator (#6523 ) ### What this PR does / why we need it? This PR removes the custom `rotary_embedding` operator and its associated C++ kernel implementation, PyTorch bindings, and tests. The codebase now falls back to using the native `torch_npu._npu_rotary_embedding` implementation. This change simplifies the codebase by removing custom, platform-specific kernel code and relying on the standard NPU library implementation, which is presumably more optimized and easier to maintain. ### Does this PR introduce _any_ user-facing change? No. This is an internal refactoring and does not introduce any user-facing changes. ### How was this patch tested? The tests for the custom `rotary_embedding` operator have been removed along with the operator itself. The correctness of the fallback to the native `torch_npu` implementation is verified by existing CI tests for attention layers and models that use rotary embeddings. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-07 09:24:05 +08:00
wangyu	c63b7a1188	[Test] Add initial multi modal cases of Qwen2.5-VL-7B-Instruct for disaggregated encoder (#5301 ) ### What this PR does / why we need it? This PR adds disaggregated encoder tests for Qwen2.5-VL-7B-Instruct ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test by running ci - vLLM version: release/v0.12.0 --------- Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>	2026-02-06 17:30:17 +08:00
zhangxinyuehfad	81f3c09d6d	[CI] Change A2 runner (#6557 ) ### What this PR does / why we need it? This PR updates the CI runner from `linux-aarch64-a2-` to `linux-aarch64-a2b3-` in various test configuration files. This change is necessary to adapt to updates in the CI infrastructure. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The changes are configuration updates for CI tests. The correctness will be verified by the CI pipeline. Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-02-05 23:43:57 +08:00
Nengjun Ma	78fad4e348	[Refactor] MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage (#6442 ) ### What this PR does / why we need it? Refactor MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage. Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP, VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following: --additional-config '{"weight_prefetch_config": { "enabled": true, "prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}' ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-02-04 09:08:18 +08:00
whx	4d6444d5fd	[Nightly][BugFix] Remove kv_cache nz test case for test_mla_preprocess_nq.py (#6505 ) ### What this PR does / why we need it? Remove kv_cache nz test case for test_mla_preprocess_nq.py. This case is added by https://github.com/vllm-project/vllm-ascend/pull/3072 but has not been tested on bf16 scenario. Results show that this is not currently supported. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with existing test. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-02-03 18:26:51 +08:00
lidenghui1110	79803932e2	[Kernel] Add AscendC fused op transpose_kv_cache_by_block to speed up GQA transfer (#6366 ) ### What this PR does / why we need it? As #2947 describe, we need to transpose kv cache layout after GQA kv transfer when prefill and decode tensor parallel size are heterogeneous, in the previous implementation, we use `npu_paged_cache_load ` + `tranpose` + `_npu_reshape_and_cache` to do this work. But obviously, it is not an efficient plan, the ops above need to be called for each layer, which introduces 3 * layer_num kernel launch, and 6 * layer_num data movement between L1 Cache and HBM for one request on decode node. Usually, decode node uses graph mode, so these op kernels will be called between decode forward launched by an async thread in mooncacke connector, this kernels maybe last for several decode forward and TTFT will increase by 3~4 decode forward time. In this PR, we implement an AscendC fused op `transpose_kv_cache_by_block` to do this with only once kernel launch and move data between L1 Cache and HBM only once. After using this fused op, the time cost in transpose kv cacke layout can be decreased to 0.24ms from 7ms in UT on 910C, and in PD disaggregation scenario, TTFT can decrease about 90 ~ 110 ms in qwen3-235B. \| request_num \| original \| fused_op\| \|:----------------------:\|:---------------:\|:-------------------:\| \| 1 \| 643 ms \| 578 ms \| \| 128 \| 1480 ms \| 1368 ms \| ### Does this PR introduce _any_ user-facing change? Use fused op by default, incase the op has bug in any scenario, provide fallback choice using env to disable it. DISABLE fused op by add following env `export VLLM_ASCEND_FUSION_OP_TRANSPOSE_KV_CACHE_BY_BLOCK=0` ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2026-02-03 14:10:01 +08:00
guanguan0308	dffac6db73	[Refactor] Add expert processed token count output for DispatchFFNCombine/DispatchFFNCombineBF16 (#6402 ) ### What this PR does / why we need it? Add New Output for Expert Token Count An additional output tensor expert_token_nums is added to both operators to meet the requirement of tracking token distribution among experts: Tensor Name: expert_token_nums Dimension: 1D tensor Shape: (local_expert_num,) Data Type: int32 Semantics: Represents the number of tokens actually received by each expert on the current card. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: guanguan0308 <1546542263@qq.com> Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>	2026-02-03 10:41:06 +08:00
starmountain1997	b6256e8bc9	Revert "[CI] fix DS3.2 single node cudagraph_sizes config (#6241 )" (#6497 ) # What this PR does / why we need it? This PR reverts commit `8134146ab6`, which modified the DeepSeek V3.2 (W8A8) single-node nightly test configuration. as there is no limit between tp_size and MTP. # Does this PR introduce any user-facing change? No. This PR only affects CI/CD test configurations and does not introduce any user-facing changes. # How was this patch tested? N/A for a revert PR. The changes restore the previously known working configuration. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-02-03 08:42:58 +08:00
starmountain1997	8134146ab6	[CI] fix DS3.2 single node cudagraph_sizes config (#6241 ) # What this PR does / why we need it? This PR fixes the single-node nightly test for DeepSeek V3.2 (W8A8) model to ensure CI stability. The changes include: 1. Simplified nightly test matrix (nightly_test_a3.yaml): - Temporarily reduced to only run deepseek3_2-w8a8 test case for debugging - Changed trigger from schedule/workflow_dispatch to support push/pull_request for faster iteration 2. Updated DeepSeek V3.2 test configuration (test_deepseek_v3_2_w8a8.py): - Adjusted cudagraph_capture_sizes from [3, 6, 9, 12] to [8, 16, 24, 32] for better performance - Increased max-num-seqs from 4 to 8 - Increased gpu-memory-utilization from 0.92 to 0.98 - Increased num_speculative_tokens from 2 to 3 3. Added PR checkout step (_e2e_nightly_single_node.yaml): - Added ability to checkout a specific PR (#6241) for testing # Does this PR introduce any user-facing change? No. This PR only affects CI/CD test configurations and does not introduce any user-facing changes. # How was this patch tested? Mock nightly test has passed, see [here](https://github.com/vllm-project/vllm-ascend/actions/runs/21574655952/job/62159656622?pr=6241). <img width="1053" height="714" alt="a2f2ee359febb13e1f6330b1bd3c116b" src="https://github.com/user-attachments/assets/3262ad0f-adec-4c71-871f-d9cf2db06fbc" /> - vLLM version: v0.14.1 - vLLM main: `d68209402d` --------- Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-02-02 11:47:32 +08:00
InSec	86b6ecac4c	[CI][BugFix] Import error fix. (#6293 ) ### What this PR does / why we need it? Fix the import error of qwen3-next nightly test. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: InSec <1790766300@qq.com>	2026-01-28 22:07:47 +08:00
linfeng-yuan	e25ee65729	[Misc][Test] add e2e test for apply_top_k_top_p_custom kernel (#6348 ) ### What this PR does / why we need it? Add e2e test case for apply_top_k_top_p_custom kernel and eliminate chinese comments. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest passed. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-01-28 17:25:57 +08:00

1 2

81 Commits