xc-llm-ascend

Author	SHA1	Message	Date
shiyuan680	7af3b880c1	support triton of mrope (#5664 ) ### What this PR does / why we need it? this pr support use triton mrope like cuda_forward, which performance is equal to ascendc ops this triton ops should use cann 8.5.0 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? test in qwen3-vl-235b acc textvqa native 81.82 npu triton 81.58 cuda triton 81.52 - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: shiyuan680 <917935075@qq.com>	2026-01-13 09:13:51 +08:00
Li Wang	75c92a3640	[CI] Move nightly-a2 test to hk (#5807 ) ### What this PR does / why we need it? This patch initial testing involved connecting two nodes from the HK region to nightly A2. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-12 22:58:35 +08:00
SILONG ZENG	7a6fde80b1	[CI]Add Kimi k2 nightly test (#5682 ) ### What this PR does / why we need it? The PR add performance and accuracy tests for Kimi-K2-Instruct-W8A8 and Kimi-K2-Thinking models to the Nightly test suite. #### Test Configuration Kimi-K2-Instruct-W8A8 - model: vllm-ascend/Kimi-K2-Instruct-W8A8 - Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node) - Architecture: Unified Distributed Inference - Parallelism: DP4 + TP8 + EP (Data Parallel 4, Tensor Parallel 8, Expert Parallel enabled). - Optimization: torchair graph, no-prefix-caching. - Node 0: DP Rank 0-1, Local DP 2, Tensor Parallel 8. - Node 1: DP Rank 2-3, Local DP 2, Tensor Parallel 8. - Benchmarks: - Performance: vllm-ascend/GSM8K-in3500-bs2800. - Accuracy: vllm-ascend/gsm8k-lite. Kimi-K2-Thinking - Model: moonshotai/Kimi-K2-Thinking - Hardware: A3, 1 Node (16 NPUs total) - Architecture: Single Node Distributed Inference - Parallelism: TP16 + EP (Tensor Parallel 16, Expert Parallel enabled). - Optimization: no-prefix-caching - Benchmarks: - Performance: vllm-ascend/GSM8K-in3500-bs400. - Accuracy: vllm-ascend/gsm8k-lite. ### Does this PR introduce _any_ user-facing change? Yes. This PR enhances the ```AisbenchRunner``` to support dynamic configuration of the ```trust_remote_code``` flag. This allows the AISBench client to successfully load tokenizers for models that require custom code execution (e.g., Kimi-K2-Thinking and Kimi-K2-Instruct-W8A8). Changes: 1. ```AisbenchRunner.__init__ ```Added the ability to capture the ```trust_remote_code``` parameter from the case configuration. ``` python self.batch_size = aisbench_config["batch_size"] self.request_rate = aisbench_config.get("request_rate", 0) + self.trust_remote_code = aisbench_config.get("trust_remote_code", False) self.temperature = aisbench_config.get("temperature") self.top_k = aisbench_config.get("top_k") ``` 2. ```AisbenchRunner._init_request_conf``` Added regex substitution to inject the parameter into the generated dynamic configuration file. ``` python content = re.sub(r'batch_size.', f'batch_size = {self.batch_size},', content) + content = re.sub(r'trust_remote_code=.', + f'trust_remote_code={self.trust_remote_code},', + content) content = content.replace("top_k", "#top_k") content = content.replace("seed", "#seed") ``` Details: - New Config Key: Users can add ```"trust_remote_code": True``` to any dictionary within the ```aisbench_cases``` list. - Default Value: Defaults to ```False``` to maintain existing security protocols for standard models. - Impact: Resolves ```ValueError``` when benchmarking reasoning models or models with custom tokenizers that previously failed during the AISBench local initialization phase. User Example: Users can now enable custom code execution for specific models (like Kimi-K2-Thinking) directly in their test suite: ``` # Now supported in test scripts: aisbench_cases = [{ "case_type": "performance", "request_conf": "vllm_api_stream_chat", "trust_remote_code": True, # New user-facing parameter ... }] ``` ### How was this patch tested? Actions: - https://github.com/vllm-project/vllm-ascend/actions/runs/20849768433 Result as following: - Kimi-K2-Instruct-W8A8(25m25s) 1. Accuracy test ``` dataset version metric mode vllm-api-general-chat --------- --------- -------- ------ ----------------------- gsm8k 7cd45e accuracy gen 96.88 ``` 2. Perf test ``` ╒══════════════════════════╤═════════╤════════════════╤════════════════╤═══════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕ │ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ ╞══════════════════════════╪═════════╪════════════════╪════════════════╪═══════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡ │ E2EL │ total │ 34571.489 ms │ 28657.8054 ms │ 36294.1788 ms │ 34714.7329 ms │ 35247.2724 ms │ 35526.6758 ms │ 36146.4314 ms │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TTFT │ total │ 2043.9136 ms │ 627.4718 ms │ 3532.3978 ms │ 1906.0194 ms │ 2307.7979 ms │ 2883.8528 ms │ 3283.7012 ms │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TPOT │ total │ 127.5591 ms │ 106.4937 ms │ 137.107 ms │ 128.3135 ms │ 129.5704 ms │ 131.1332 ms │ 134.1087 ms │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ ITL │ total │ 126.5571 ms │ 0.0095 ms │ 1340.783 ms │ 104.1398 ms │ 110.1272 ms │ 119.6124 ms │ 950.2924 ms │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ InputTokens │ total │ 3516.6055 │ 3014.0 │ 3985.0 │ 3525.0 │ 3525.0 │ 3586.8 │ 3800.67 │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokenThroughput │ total │ 7.4143 token/s │ 7.0535 token/s │ 8.933 token/s │ 7.3744 token/s │ 7.4118 token/s │ 7.5608 token/s │ 8.7051 token/s │ 512 │ ╘══════════════════════════╧═════════╧════════════════╧════════════════╧═══════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛ ╒══════════════════════════╤═════════╤═══════════════════╕ │ Common Metric │ Stage │ Value │ ╞══════════════════════════╪═════════╪═══════════════════╡ │ Benchmark Duration │ total │ 279430.9375 ms │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Requests │ total │ 512 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Failed Requests │ total │ 0 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Success Requests │ total │ 512 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Concurrency │ total │ 63.3452 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Max Concurrency │ total │ 64 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Request Throughput │ total │ 1.8323 req/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Input Tokens │ total │ 1800502 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Prefill Token Throughput │ total │ 1720.5255 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total generated tokens │ total │ 131072 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Input Token Throughput │ total │ 6443.4598 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Output Token Throughput │ total │ 469.0676 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Token Throughput │ total │ 6912.5274 token/s │ ╘══════════════════════════╧═════════╧═══════════════════╛ ``` - Kimi-K2-Thinking(43m51s) 1. Accuracy test ``` dataset version metric mode vllm-api-general-chat --------- --------- -------- ------ ----------------------- gsm8k 7cd45e accuracy gen 100.00 ``` 2. Perf test ``` ╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕ │ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ ╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡ │ E2EL │ total │ 172384.3573 ms │ 34456.5517 ms │ 205922.9407 ms │ 174844.2216 ms │ 202656.092 ms │ 204428.9502 ms │ 205468.6776 ms │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TTFT │ total │ 138740.3228 ms │ 655.1066 ms │ 171777.3003 ms │ 141088.0561 ms │ 169237.5599 ms │ 170716.4954 ms │ 171393.1278 ms │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TPOT │ total │ 131.9374 ms │ 90.6331 ms │ 135.4144 ms │ 132.405 ms │ 132.948 ms │ 133.7549 ms │ 135.2543 ms │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ ITL │ total │ 130.9028 ms │ 0.0099 ms │ 960.3683 ms │ 116.9623 ms │ 122.3127 ms │ 132.0522 ms │ 886.4662 ms │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ InputTokens │ total │ 3514.575 │ 3014.0 │ 3843.0 │ 3525.0 │ 3525.0 │ 3588.0 │ 3801.08 │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokenThroughput │ total │ 1.6799 token/s │ 1.2432 token/s │ 7.4296 token/s │ 1.4642 token/s │ 1.4737 token/s │ 1.8754 token/s │ 7.125 token/s │ 400 │ ╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛ ╒══════════════════════════╤═════════╤═══════════════════╕ │ Common Metric │ Stage │ Value │ ╞══════════════════════════╪═════════╪═══════════════════╡ │ Benchmark Duration │ total │ 1166795.568 ms │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Requests │ total │ 400 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Failed Requests │ total │ 0 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Success Requests │ total │ 400 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Concurrency │ total │ 59.0967 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Max Concurrency │ total │ 64 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Request Throughput │ total │ 0.3428 req/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Input Tokens │ total │ 1405830 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Prefill Token Throughput │ total │ 25.332 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total generated tokens │ total │ 102400 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Input Token Throughput │ total │ 1204.864 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Output Token Throughput │ total │ 87.7617 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Token Throughput │ total │ 1292.6258 token/s │ ╘══════════════════════════╧═════════╧═══════════════════╛ ``` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>	2026-01-12 15:56:07 +08:00
Nengjun Ma	297f6deb09	[CI] Align multi-node nightly test paramter with corresponding tutorials document (#5756 ) ### What this PR does / why we need it? Align multi-node nightly test paramter with tutorials documents. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? Test locally and nighly e2e multi-node test cases. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-01-12 09:00:31 +08:00
SILONG ZENG	09b3f9d91b	[CI]Add Disaggregated PD Nightly Test for Qwen3-235B and Qwen3-VL-235B (#5502 ) ### What this PR does / why we need it? This PR adds online Disaggregated Prefill/Decode performance and accuracy tests for the Qwen3-235B-A22B and Qwen3-VL-235B-A22B-Instruct models to the Nightly test suite. These test configurations simulate the deployment of massive MoE and Vision-Language models in a dual-node (32 NPU) environment, utilizing Mooncake (KVCache Transfer) technology to achieve efficient KV cache transfer between the Prefill node and the Decode node. #### Test Configuration Qwen3-235B-A22B - Model: Qwen/Qwen3-235B-A22B - Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node) - Architecture: Disaggregated Prefill & Decode - Node 0 (Producer/Prefill): DP2 + TP8 + EP + FLASHCOMM1 + FUSED_MC2. - Node 1 (Consumer/Decode): DP4 + TP4 + EP + FLASHCOMM1 + FUSED_MC2 + FULL_DECODE_ONLY. - Benchmarks: - Performance: vllm-ascend/GSM8K-in3500-bs2800. - Accuracy: vllm-ascend/gsm8k-lite. Qwen3-VL-235B-A22B-Instruct - Model: Qwen/Qwen3-VL-235B-A22B-Instruct - Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node) - Architecture: Disaggregated Prefill & Decode - Node 0 (Producer/Prefill): DP2 + TP8 + EP. - Node 1 (Consumer/Decode): DP4 + TP4 + EP + FULL_DECODE_ONLY. - Benchmarks: - Performance: vllm-ascend/textvqa-perf-1080p. - Accuracy: vllm-ascend/textvqa-lite. ### How was this patch tested? Nightly test action on CI - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-09 16:25:20 +08:00
1092626063	f63c1341d9	[Feature] GLM4.6 support mtp with fullgraph (#5460 ) ### What this PR does / why we need it? GLM4.6 support mtp with fullgraph to improve performance ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ` export HCCL_BUFFSIZE=1024 export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export HCCL_OP_EXPANSION_MODE=AIV vllm serve /weight/glm4.6_w8a8_with_float_mtp \ --data-parallel-size 1 \ --tensor-parallel-size 16 \ --seed 1024 \ --served-model-name glm \ --max-model-len 35000 \ --max-num-batched-tokens 16384 \ --max-num-seqs 16 \ --trust-remote-code \ --gpu-memory-utilization 0.9 \ --speculative-config '{"num_speculative_tokens": 1, "model":"/weight/glm4.6_w8a8_with_float_mtp", "method":"mtp"}' \ --compilation-config '{"cudagraph_capture_sizes": [1,2,4,8,16,32], "cudagraph_mode": "FULL_DECODE_ONLY"}' \ --async-scheduling \ ` test case： ` vllm bench serve \ --backend vllm \ --dataset-name prefix_repetition \ --prefix-repetition-prefix-len 22400 \ --prefix-repetition-suffix-len 9600 \ --prefix-repetition-output-len 1024 \ --num-prompts 1 \ --prefix-repetition-num-prefixes 1 \ --ignore-eos \ --model glm \ --tokenizer /weight/glm4.6_w8a8_with_float_mtp \ --seed 1000 \ --host 0.0.0.0 \ --port 8000 \ --endpoint /v1/completions \ --max-concurrency 1 \ --request-rate 1 ` - vLLM version: v0.13.0 - vLLM main: `5326c89803` Signed-off-by: 1092626063 <1092626063@qq.com>	2026-01-09 16:07:42 +08:00
lhchg	dc99cfdc15	[CustomOp] support TensorList for dispatchFFNCombine (#5665 ) ### What this PR does / why we need it? To support tensorList for dispatch_ffn_combine, to adjust eplb ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Single Operator Testing - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: lhchg <lhao_cheng@163.com> Co-authored-by: lihaocheng <lihaosheng1@h-partners.com>	2026-01-09 15:56:29 +08:00
InSec	2d713fee93	[CI] Accuracy issue of qwen3-next-w8a8 nightly test fix. (#5746 ) ### What this PR does / why we need it? Close the Full Graph mode to temporarily avoid accuracy issue for Qwen3-Next-80B-A3B-Instruct-W8A8. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: InSec <1790766300@qq.com>	2026-01-09 15:55:13 +08:00
LeeWenquan	a3a74d6984	[CI] Add qwen3 next ci (#5395 ) ### What this PR does / why we need it? Add Qwen3Next CI ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2026-01-09 10:29:09 +08:00
Li Wang	595d3484c4	[Nightly] Move ops to the correct path (#5642 ) ### What this PR does / why we need it? Move ops to the correct path where they belong - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-09 09:23:36 +08:00
meihanc	6315a31399	[CI] Add triton ascend in nightly CI (#5716 ) ### What this PR does / why we need it? Add triton ascend in nightly ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-08 21:17:32 +08:00
Aoxuan Chen	8763953f56	[Feature] add the magicmtp speculative decoding acceleration algorithm (#5542 ) ### What this PR does / why we need it? 1. MagicMTP (paper: "Block Verification Accelerates Speculative Decoding") was introduced to consider the influence among multiple draft tokens, improving the acceptance rate without compromising accuracy. 2. Added Triton and PyTorch implementations, and added E2E test cases. ### Does this PR introduce _any_ user-facing change? MagicMTP will automatically take effect when the parameter "num_speculative_tokens" >= 3. - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: chenaoxuan <cax1165@163.com>	2026-01-08 09:15:55 +08:00
ZCG12345	3be8e33fe9	[Kernel] Add moe_gating_top_k operator support for Ascend NPU (#5579 ) ### What this PR does / why we need it? 1.replace moe_gating_top_k from torch_npu with custom op 2.enable the renorm function of moe_gating_top_k in softmax scenerio ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No need test - vLLM version: v0.13.0 - vLLM main: `7157596103` --------- Signed-off-by: ZCG12345 <2097562023@qq.com>	2026-01-07 21:42:31 +08:00
Icey	137f28341d	[Tests] Add qwen3-8b nightly test (#5597 ) ### What this PR does / why we need it? Add qwen3-8b nightly test - vLLM version: v0.13.0 - vLLM main: `7157596103` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2026-01-07 18:42:05 +08:00
wangxiyuan	6f7a81cd9f	[CI] cleanup single/multi-card test (#5623 ) 1. speed up e2e light test. 2. create `2-cards` and `4-cards` folder in multicard 3. move ops to nightly 4. run test in Alphabetical Order - vLLM version: v0.13.0 - vLLM main: `8be6432bda` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-07 14:13:34 +08:00
wangyibo1005	25baf6df09	[Feature]EPLB:Adapt DispatchGmmCombineDecode operator to eplb tensor list and expert token numbers (#5552 ) #### What this PR does / why we need it? This PR adapt DispatchGmmCombineDecode operator to eplb tensor list and expert token numbers. This operator support gmm1, gmm2, gmm1Scale and gmm2Scale in format of list. This operator support couting how many token each local expert recieves by expertTokensNum . - vLLM version: v0.13.0 - vLLM main: `7157596103` More info about this operator, please refer to RFC: issue https://github.com/vllm-project/vllm-ascend/issues/5476	2026-01-07 11:23:42 +08:00
starmountain1997	086c093347	[CI] Add DeepSeek-V3.2-W8A8 nightly ci test (#5371 ) # What this PR does / why we need it? Add DeepSeek-V3.2-W8A8 dual-node nightly CI test and update A3 nightly test configuration: 1. Add DeepSeek-V3.2-W8A8 dual-node test: tests/e2e/nightly/multi_node/config/DeepSeek-V3_2-W8A8-A3-dual-nodes.yaml - 2 nodes, 16 NPUs per node (32 NPUs total) - Configuration: 2P+1D (data-parallel-size=4, tensor-parallel-size=8, data-parallel-size-local=2) - Includes performance and accuracy benchmarks with GSM8K dataset 2. Update A3 nightly workflow: .github/workflows/nightly_test_a3.yaml - Added DeepSeek-V3.2-W8A8 dual-node test to the A3 nightly test matrix - Test name: multi-node-dpsk3.2-2node 3. Improve test scripts: Updated .github/workflows/_e2e_nightly_multi_node.yaml and related scripts for better multi-node testing support test on A3 instances - Performance baseline: 1 (threshold: 0.97) - Accuracy baseline: 95% (threshold: 5%) - Test dataset: GSM8K with 512 prompts for performance, gsm8k-lite for accuracy --------- Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-01-07 10:02:02 +08:00
InSec	089ca2ddcc	[Nightly][Test] Add Qwen3-Next-80B-A3B-Instruct-W8A8 nightly test (#5616 ) ### What this PR does / why we need it? There was an accuracy issue with Qwen3-Next-80B-A3B-Instruct-W8A8 model in the old version of Triton-Ascend, so, we are now adding one nightly test to maintain it. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: IncSec <1790766300@qq.com>	2026-01-06 17:36:00 +08:00
Li Wang	c5e2f48510	[CI] mv ops to correct path (#5615 ) ### What this PR does / why we need it? mv ops to correct path :`tests/e2e/nightly/single_node/ops/singlecard_ops/triton` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-05 23:17:07 +08:00
dsxsteven	129ba9fe1b	[BugFix] Fix Smoke Testing Bug for DSR1 longseq (#5613 ) ### What this PR does / why we need it? Fix Smoke Testing Bug for DSR1 longseq We need to make this change because the daily smoke test case is throwing an error: "max_tokens or max_completion_tokens is too large: 32768.This model's maximum context length is 32768 tokens and your request has 128 input tokens". We encounter this error due to max-out-len equals to max-model-len. We can fix this error by increasing max-model-len argument in the script. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: daishixun <dsxsteven@sina.com>	2026-01-05 22:40:28 +08:00
Angazenn	11e75494b1	[TRITON][TEST]Add nightly test for triton split_qkv_rmsnorm_rope (#5267 ) ### What this PR does / why we need it? Add nightly test for triton split_rmsnorm_rope ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-01-05 21:35:37 +08:00
ZT-AIA	58e8d19c35	[UT]add triton ops ut : test_fused_qkvzba_split_reshape_cat (#5474 ) ### What this PR does / why we need it? [UT]add triton ops ut : test_fused_qkvzba_split_reshape_cat ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? pytest -sv tests/ut/ops/test_fused_qkvzba_split_reshape_cat.py - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: ZT-AIA <1028681969@qq.com>	2026-01-05 20:05:07 +08:00
Yizhou	755caeb06e	[Feat][Spec] Optimize token index calculation in spec decode with Triton kernel (#5356 ) ### What this PR does / why we need it? Replace multiple PyTorch operations with a fused Triton kernel to determine token indices for sampling during speculative decoding. This reduces kernel launch overhead and memory traffic, improving overall performance on Ascend hardware. --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2026-01-05 16:51:29 +08:00
daniel	8ffe3f5d78	feat: implement high-performance Triton kernels for rejection sampling: optimization for rejection_random_sample_kernel (#5259 ) ### What this PR does / why we need it? This PR introduces optimized Triton implementations for the rejection_random_sample_kernel delivering superior performance compared to the existing Triton implementations. The new Triton kernels maintain full functional accuracy while delivering significant performance improvements across various batch sizes and MTP configurations. ### Does this PR introduce _any_ user-facing change? Yes, this PR modifies rejection_sampler.py to use optimized Triton kernels: rejection_random_sample_kernel is modified and optimized ### How was this patch tested? performance benchmark results: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=Generator content="Microsoft Excel"> <!--[if !mso]> </head> <body> <!--StartFragment--> Batch Size \| MTP \| origin implementation(us) \| optimized version(us) -- \| -- \| -- \| -- 1 \| 1 \| 2.934 \| 3.64 8 \| 1 \| 4.467 \| 4 32 \| 1 \| 6.98 \| 4.54 64 \| 1 \| 11.087 \| 6.42 128 \| 1 \| 13.414 \| 7.84 256 \| 1 \| 19.66 \| 8.487 512 \| 1 \| 39.908 \| 11.62 1024 \| 1 \| 81.781 \| 18.16 2048 \| 1 \| 137.923 \| 32.934 1 \| 2 \| 3.4 \| 4.02 8 \| 2 \| 3.74 \| 4.24 32 \| 2 \| 6.373 \| 7.394 64 \| 2 \| 9.747 \| 6.46 128 \| 2 \| 12.98 \| 7.76 256 \| 2 \| 20.834 \| 9.787 512 \| 2 \| 39.314 \| 13.56 1024 \| 2 \| 83.135 \| 22.387 2048 \| 2 \| 157.563 \| 40.607 <!--EndFragment--> </body> </html> - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: 1024daniel <xxltju324@gmail.com>	2026-01-05 16:03:02 +08:00
Trunrain	91bf524364	[BugFix][kernel] fix matmul_allreduce_add_rmsnorm_kernel (#5335 ) ### What this PR does / why we need it? fix matmul_allreduce_add_rmsnorm_kernel, add hccl Init, SetCcTiling interface test case use multicard-4 ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? pytest -sv tests/e2e/nightly/ops/test_matmul_allreduce_add_rmsnorm.py multicard-4 pass https://github.com/vllm-project/vllm-ascend/actions/runs/20502630658/job/58914474652?pr=5335 - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: tongrunze <t00574058@china.huawei.com> Co-authored-by: tongrunze <t00574058@china.huawei.com>	2026-01-05 15:19:54 +08:00
weiguihua2	549be94397	[Bugfix] fix pcp + eplb error (#5561 ) ### What this PR does / why we need it? Fix the bug in the PCP overlay feature 1、Fix the bug related to PCP and EPLB overlap by including PCP size in the word_size calculation. 2、In the PCP pooling scenario, a prompt has been added for setting the cp_kv_cache_interleave_size. - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-01-05 14:08:11 +08:00
dsxsteven	37fd48bee5	[CI] Move longseq Nightly CI (#5577 ) ### What this PR does / why we need it? move longseq nightly CI to correct path due to #5479 [1/N] Refactor nightly test structure Signed-off-by: daishixun <dsxsteven@sina.com>	2026-01-04 15:42:43 +08:00
dsxsteven	3c7e6c6817	[CI] Add multi-nodes longseq configs of DeepSeek-R1-W8A8 & Qwen3-235B-W8A8 (#5381 ) ### What this PR does / why we need it? add DeepSeek-R1-W8A8 and Qwen3-235B-W8A8 configs in multi-nodes and longseq (PCP&DCP) scenario - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: daishixun <dsxsteven@sina.com>	2026-01-04 10:38:40 +08:00
Jade Zheng	7d5242faca	[Refactor] Formatting output types related to FuseMoE (#5481 ) Currently in the Fused MoE module, functions of classes like MoECommMethod and MoETokenDispatcher output data in dictionary or tuple format, which hampers code maintainability, readability, and extensibility. This PR introduces dataclasses for these key output types to address these issues. - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-31 14:24:37 +08:00
Jade Zheng	38570cfeb6	[Feature] Support kv nz feature for DeepSeek decode node in disagg-prefill scenario (#3072 ) By converting the KV cache from ND to NZ format when the decode node receives it, this PR ensures that the KV NZ feature works correctly during the decoding phase in disagg-prefill scenario. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com> Co-authored-by: ghphotoframe <854746559@qq.com> Co-authored-by: alex101-ops <alex1015718386@gmail.com>	2025-12-31 14:24:04 +08:00
Li Wang	2ee17e50a1	[2/N] Upgrade nightly doc (#5534 ) ### What this PR does / why we need it? Follow up https://github.com/vllm-project/vllm-ascend/pull/5479, upgrade the corresponding doc for developers - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-31 09:11:42 +08:00
Li Wang	e760aae1df	[1/N] Refactor nightly test structure (#5479 ) ### What this PR does / why we need it? This patch is a series of refactoring actions, including clarifying the directory structure of nightly tests, refactoring the config retrieval logic, and optimizing the workflow, etc. This is the first step: refactoring the directory structure of nightly to make it more readable and logical. - vLLM version: v0.13.0 - vLLM main: `5326c89803` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-30 19:03:02 +08:00
zzzzwwjj	71f729a661	Revert "moe_gating_top_k" (#5512 ) Reverts vllm-project/vllm-ascend#5271 It breaks e2e test - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1`	2025-12-30 15:05:47 +08:00
ZCG12345	45c3c279e2	moe_gating_top_k (#5271 ) 1. What this PR does / why we need it? This PR supports the moe_gating_top_k operator, which enables post-positioned renormalization (renorm) on the basis of softmax. 2. Does this PR introduce any user-facing change? No user-facing changes are required. 3. How was this patch tested? This patch was tested with the test_npu_moe_gating_top_k test case. vLLM version: release/v0.13.0 vLLM main: `ad32e3e19c` --------- Signed-off-by: ZCG12345 <2097562023@qq.com> Signed-off-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com> Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>	2025-12-30 09:28:01 +08:00
jiazhengyi	d5f72835e6	[OP] add custom op aclnnMoeInitRoutingCustom (#5251 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> This pull request introduces a new custom operator `aclnnMoeInitRoutingCustom` for Mixture-of-Experts models. It can be replaced by `aclnnMoeInitRoutingV3` once CANN 8.5 becomes available. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> No. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: jiazhengyi <jiazhengyi@huawei.com> Signed-off-by: Chenxi Qian <chenxi.qian.cq@outlook.com> Co-authored-by: jiazhengyi <jiazhengyi@huawei.com> Co-authored-by: Chenxi Qian <chenxi.qian.cq@outlook.com>	2025-12-29 19:29:40 +08:00
Li Wang	1d81bfaed1	Fix nightly (#5413 ) ### What this PR does / why we need it? This pacth mainly do the following things: 1. Bugfix for multi_node_tests log, log names must be unique when uploading logs. 2. Optimize `get_cluster_ips` logic, increase the max retry times for robustness 3. Abandoned the existing gh-proxy temporarily until it is stable enough. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-27 18:16:46 +08:00
Nengjun Ma	f5af6bbd1e	[CI] Add qwen-235b-a22b a2 multi-node test (#5393 ) ### What this PR does / why we need it? Qwen3-235B-A22B belongs to the TopN model, but there is currently a lack of care for the test cases of the wen3-235B-A22B model on Atlas A2, and most of the machines currently owned by users in the community are A2. When users encounter problems, we currently have no way of knowing whether the model runs normally on the corresponding version of the code, so we added it. In addition, we currently see TopN models such as: qwen-dense, qwen3-30b-a3b, Qwen3-Next, Qwen2.5-Omni, but Qwen3-235B-A22B is missing. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? Test with multi-node, result as following: 1. Accuracy test (Time for executing this test case: 25 minutes) test running successfully, accuracy as following: ``` dataset version metric mode vllm-api-general-chat --------- --------- -------- ------ ----------------------- gsm8k 7cd45e accuracy gen 95.68 ``` 2. Perf test (Time for executing this test case: 1h15 minutes) test running successfully, throughput as following(This is the atlas A3, for A2 the result about A3/1.3): ``` ╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤══════╕ │ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ ╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪══════╡ │ E2EL │ total │ 384086.3958 ms │ 214767.0486 ms │ 528014.771 ms │ 387621.5746 ms │ 388776.7492 ms │ 390164.3559 ms │ 488105.8512 ms │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ TTFT │ total │ 159409.9868 ms │ 1849.4588 ms │ 302439.6965 ms │ 162183.7007 ms │ 162965.477 ms │ 164274.1936 ms │ 262578.6041 ms │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ TPOT │ total │ 149.8842 ms │ 130.2175 ms │ 151.2625 ms │ 150.473 ms │ 150.6978 ms │ 150.9102 ms │ 151.2131 ms │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ ITL │ total │ 149.6789 ms │ 0.0099 ms │ 283.0242 ms │ 150.3276 ms │ 156.8649 ms │ 168.1372 ms │ 199.378 ms │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ InputTokens │ total │ 3654.3079 │ 3108.0 │ 4280.0 │ 3629.0 │ 3728.0 │ 3842.1 │ 4079.0 │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ OutputTokens │ total │ 1500.0 │ 1500.0 │ 1500.0 │ 1500.0 │ 1500.0 │ 1500.0 │ 1500.0 │ 2800 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤ │ OutputTokenThroughput │ total │ 3.935 token/s │ 2.8408 token/s │ 6.9843 token/s │ 3.8698 token/s │ 3.8799 token/s │ 3.9916 token/s │ 6.2137 token/s │ 2800 │ ╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧══════╛ ╒══════════════════════════╤═════════╤═══════════════════╕ │ Common Metric │ Stage │ Value │ ╞══════════════════════════╪═════════╪═══════════════════╡ │ Benchmark Duration │ total │ 4391524.3389 ms │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Requests │ total │ 2800 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Failed Requests │ total │ 0 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Success Requests │ total │ 2800 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Concurrency │ total │ 244.8903 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Max Concurrency │ total │ 256 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Request Throughput │ total │ 0.6376 req/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Input Tokens │ total │ 10232062 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Prefill Token Throughput │ total │ 22.924 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total generated tokens │ total │ 4200000 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Input Token Throughput │ total │ 2329.9568 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Output Token Throughput │ total │ 956.3877 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Token Throughput │ total │ 3286.3445 token/s │ ╘══════════════════════════╧═════════╧═══════════════════╛ ``` - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-12-26 23:46:09 +08:00
jiangyunfan1	48854aef5c	[TEST]Add sending request with and without chat (#5286 ) ### What this PR does / why we need it? This PR adds the method for sending chat and non-chat request, we need it to test much folloing cases. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2025-12-26 18:04:17 +08:00
Zhu Yi Lin	18302c8467	Revert "Add MagicMTP(block verify) and Triton optimization (#4443 )" (#5380 ) ### What this PR does / why we need it? #4443 introduces a precision issue in scenarios where MTP >= 3 + deepseek v3.1, and this pr reverts it - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: GDzhu01 <809721801@qq.com>	2025-12-26 15:06:13 +08:00
wangxiyuan	29d2fe653d	cleanup ascend config (#5296 ) 1. refresh additional config doc 2. move kv config logic to platform. 3. improve `dump_config` init logic and rename it to `dump_config_path` this change is user impacted. dump_config is changed from dict to string. 4. correct `enable_async_exponential` type 5. remove useless `chunked_prefill_for_mla` - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-26 14:07:37 +08:00
ZT-AIA	adaa89a7a5	Update vllm pin to 12.25 (#5342 ) ### What this PR does / why we need it? - Fix vllm break in the pr: 1.[Drop v0.14 deprecations ]https://github.com/vllm-project/vllm/pull/31285 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: ZT-AIA <1028681969@qq.com>	2025-12-26 14:05:40 +08:00
Li Wang	c2f776b846	[Nightly] Initial logging for nightly multi-node testing (#5362 ) ### What this PR does / why we need it? Currently, our multi-node logs only show the master node's logs (via the Kubernetes API), which is insufficient for effective problem localization if other nodes experience issues. Therefore, this pull request adds the ability to upload logs for other nodes. Next plan: Output structured directory logs, including logs from each node and the polog. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-26 11:39:07 +08:00
Qi Mao	7372225bcb	[FIX] Update _causal_conv1d_update_kernel for Efficient Conv State Handling on NPU (#5322 ) Description: This PR updates the implementation of the Triton operator for deployment on NPU devices, focusing on optimizing grid size and memory handling based on NPU limitations. Design Plan: Grid Calculation: The grid size is now dynamically calculated by batch and dim to ensure that the number of programs executed does not exceed the NPU's vector core capacity. This ensures optimal parallelism without overloading the hardware. Data Block Handling: Due to the limited on-chip memory (UB) on Ascend NPUs, this implementation splits large data into smaller chunks of 32k or less per block. The kernel performs a for-loop to process the data in these smaller chunks, minimizing memory usage and avoiding potential overflows. Changes Compared to GPU Implementation: Grid and Block Sizing: For GPU, the grid and block size were determined based on available thread counts and memory size. In contrast, the NPU version dynamically adjusts these parameters using B_TILE and BLOCK_N to optimize for NPU’s architecture. Memory Chunking: The original GPU implementation did not require chunking due to the higher available memory and processing capacity. For the NPU, data is divided into smaller chunks (32k or smaller) to comply with memory constraints on the device. The kernel has been modified to handle this chunking mechanism inside a loop. Optimized Thread Usage: The NPU implementation takes into account the hardware-specific thread limit (24 threads per vector core), ensuring that the number of active programs is aligned with the NPU's vector core count, avoiding over-subscription that would lead to serial processing. This PR ensures that the operator functions efficiently on Ascend NPU, considering hardware limitations while maintaining the same functionality and input parameters as the GPU implementation. - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` Signed-off-by: maoxx241 <maomaoyu870@gmail.com>	2025-12-26 09:12:30 +08:00
Aoxuan Chen	8caad0510d	fix e2e rejection-sampler error (#5341 ) ### What this PR does / why we need it? Fixed the error in the CI process for vllm-ascend/tests/e2e/nightly/ops/triton/test_rejection_sampler.py Error: test_rejection_sampler_block_verify_triton_kernel: duplicate parametrization of 'vocab_size'. - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: chenaoxuan <cax1165@163.com>	2025-12-25 11:39:38 +08:00
wangxiyuan	2ae0bad96d	Remove VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE (#5272 ) `VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE` is only used together with `VLLM_ASCEND_ENABLE_PREFETCH_MLP` which is useless totally. This PR remove it. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-25 11:09:56 +08:00
Aoxuan Chen	6d25372baa	Add MagicMTP(block verify) and Triton optimization (#4443 ) ### What this PR does / why we need it? 1. MagicMTP (paper: "Block Verification Accelerates Speculative Decoding") was introduced to consider the influence among multiple draft tokens, improving the acceptance rate without compromising accuracy. 2. The rejection sampling logic in rejection_sampler.py was restructured using Triton-Ascend, enabling it to operate under high concurrency, thus resolving CPU and NPU operator bottlenecks and enhancing throughput. ### Does this PR introduce _any_ user-facing change? MagicMTP will automatically take effect when the parameter "num_speculative_tokens" >= 3. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: chenaoxuan <cax1165@163.com>	2025-12-25 09:00:25 +08:00
Ascendyh	a90482803d	[Kernel] add l2norm triton kernel (#4595 ) ### What this PR does / why we need it? This pull request introduces an L2 normalization kernel implemented in Triton, specifically optimized for Ascend NPUs. ### Does this PR introduce _any_ user-facing change? No, this PR does not introduce any user-facing changes. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: Ascendyh <hw7osiris@outlook.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-25 06:06:18 +08:00
Wang Kunpeng	c3a8d13ca7	[refactor] Remove unnecessary attributes from set_ascend_forward_context (#5204 ) ### What this PR does / why we need it? Remove unnecessary attributes from set_ascend_forward_context 1.prefetch_stream 2.weight_prefetch_method ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-12-23 08:49:52 +08:00
jiangyunfan1	3ba920a65b	[TEST]Update mm param --mm-processor-cache-gb (#5242 ) ### What this PR does / why we need it? This PR updates the mm param --mm-processor-cache-gb, we need it to run the case ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2025-12-22 18:54:03 +08:00
wangqiankun13	118b0ed346	[Feature] Add token mask for DispatchGmmCombineDecode operator (#5171 ) ### What this PR does / why we need it? In this PR, DispatchGmmCombineDecode add an optional input x_active_mask, with which only token masked True will be dispatched and handle. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangqiankun <wangqiankun13@huawei.com>	2025-12-19 16:31:48 +08:00

1 2 3

121 Commits