xc-llm-ascend

Author	SHA1	Message	Date
InSec	595b57c4d4	[CI][BugFix] Qwen3-Next nightly test fix. (#6247 ) ### What this PR does / why we need it? Qwen3-Next nightly test fix. Temporarily avoid the accuracy issue in the full graph mode. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `d68209402d` Signed-off-by: InSec <1790766300@qq.com>	2026-01-26 19:53:53 +08:00
yjmyl	e90b14140b	[feature] add_rms_norm support bias (#5790 ) ### What this PR does / why we need it? This PR is to replace addRmsNorm and Add With addRmsNormBias. This way can lead to a more effecient result. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Full Test Pass - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: Chen_HaoWen <chenhaowen12@huawei.com> Co-authored-by: Chen_HaoWen <chenhaowen12@huawei.com>	2026-01-23 21:09:54 +08:00
starmountain1997	6c73b88dd6	[CI] Enable FLASHCOMM1 with layer_sharding and FULL_DECODE_ONLY in ds32 testing (#6115 ) ### What this PR does / why we need it? This PR enables FLASHCOMM1 communication optimization with layer sharding for DeepSeek-V3.2 W8A8 model testing to validate PR #5702. The changes include: 1. Enable FLASHCOMM1: Set VLLM_ASCEND_ENABLE_FLASHCOMM1=1 improves performance for distributed inference 2. Add layer sharding: Configure layer_sharding: ["q_b_proj", "o_proj"] 4. Update baselines: Adjust performance baselines to reflect the improvements from FLASHCOMM1 and layer sharding ### Does this PR introduce _any_ user-facing change? No. This is a CI/test-only change that enables new communication optimization features for testing purposes. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-01-23 19:48:37 +08:00
Nengjun Ma	ab676413e6	Default enable MLAPO (#5952 ) ### What this PR does / why we need it? 1) Default enable MLAPO for deepseek MLA Attention W8A8 models on PD disagregation D Instance, for example: DeepSeekV3-W8A8, DeepSeek-R1-W8A8. 2) Default enable MLAPO for DeepSeek SFA Attention W8A8 models, currently is DeepSeek-V3.2-W8A8. ### Does this PR introduce _any_ user-facing change? Don't need use manully to VLLM_ASCEND_ENABLE_MLAPO=1, to enable MLAPO feature for deepseek w8a8 model The effect of enabling MLAPO SFA model deployed on a single A3 Node: Test with:tests/e2e/nightly/single_node/models/test_deepseek_v3_2_exp_w8a8.py dataset: gsm8k-lite，without set MTP, FULL GRAPH, has 19% promote：未默认开启 MLAPO 时： ├─────────────────────────┤ │ TTFT │ 14055.8836 ms │ ├─────────────────────────┤ │ ITL │ 66.8171 ms. │ ├─────────────────────────┤ │ Output Token Throughput │ 104.9105 token/s │ ├─────────────────────────┤ 默认开启 MLAPO 时： ├─────────────────────────┤ │ TTFT │ 3753.1547 ms │ ├─────────────────────────┤ │ ITL. │ 61.4236 ms. │ ├─────────────────────────┤ │ Output Token Throughput │ 125.2075 token/s│ ├─────────────────────────┤ - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-01-22 09:26:39 +08:00
Li Wang	839e03cbc9	[Nightly] Use Qwen repo for qwen3-next (#6064 ) ### What this PR does / why we need it? Use Qwen repo for qwen3-next to make nightly test happy. see https://github.com/vllm-project/vllm-ascend/actions/runs/21179025996/job/60915871441 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-21 10:39:12 +08:00
guanguan0308	1ed9524763	add dispath_ffn_combine_bf16 (#5866 ) ### What this PR does / why we need it? add dispath_ffn_combine_bf16 - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: guanguan0308 <1546542263@qq.com>	2026-01-21 09:30:30 +08:00
zhangxinyuehfad	750c06c78a	[CI] Add DeepSeek-V3.2-W8A8 nightly ci test (#4633 ) ### What this PR does / why we need it? Add DeepSeek-V3.2-W8A8 nightly ci test： DeepSeek-V3.2-W8A8 1node DP2+TP8 :tests/e2e/nightly/models/test_deepseek_v3_2_w8a8.py ### Does this PR introduce _any_ user-facing change - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-20 21:05:15 +08:00
shiyuan680	cea48c2a34	model runner v2 support triton of penalty (#5854 ) ### What this PR does / why we need it? Optimized operator performance and add ut test ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? test in qwen2.5 7b vl, ops time approved 90% - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` this pr is for # https://github.com/vllm-project/vllm-ascend/issues/5208 Signed-off-by: shiyuan680 <917935075@qq.com>	2026-01-20 12:26:05 +00:00
Icey	402872050a	[Tests] move qwen3 performance test from nightly to e2e (#5980 ) ### What this PR does / why we need it? Move the qwen3 performance test from nightly to e2e to intercept performance degradation. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2026-01-20 17:08:43 +08:00
wangqiankun13	ebb940691f	[Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476] (#5755 ) ### What this PR does / why we need it? [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. - Before: weight scale must be float32 - After: weight scale can be float32/float16 when x is float16, float32/bfloat16 when x is float32/bfloat16. And w1 scale can use different dtype with w2 scale. More info about this operator, please refer to RFC: issue https://github.com/vllm-project/vllm-ascend/issues/5476 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? #### Perf > When scale is of type fp16 or bf16, it will be cast to fp32 internally within the operator, while the subsequent computations remain unchanged. Therefore, this PR will introduce an additional cast operation but halve the memory copy operations for scale . Furthermore, since the scale data is only a few KB in size and participates in relatively few computations, its impact is almost negligible compared to major operations like matrix multiplication. Thus, the theoretical performance change should be minimal. test single operator cases from qwen3-235b, - single A3 node(ep16), 64 moe experts, 4 experts / die (like qwen3-235b ep32) - batch=18/32, token_hidden_size 4096, moe_intermediate_size 1536 The test was conducted for 100 rounds, and the average of the last 95 rounds was taken. \| \| bs18(us)\| bs32(us)\| \| -----\| -----\| -----\| \|Without this PR\|96.28\|108.83\| \|With this PR\|96.06\|107.90\| Note: Single-operator benchmarks represent an ideal scenario. They are usually only useful for referencing relative changes and may not fully align with performance data observed within the full model. #### Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode \| dataset \| version \| metric \| mode \| vllm-api-stream-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 83.33 \| - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wangqiankun <wangqiankun13@huawei.com>	2026-01-19 16:10:43 +08:00
zhangxinyuehfad	372f979aa5	[CI] Add DeepSeek R1 W8A8 HMB nightly ci (#5874 ) ### What this PR does / why we need it? Add DeepSeek R1 W8A8 HMB nightly ci - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-15 20:48:20 +08:00
LI SHENGYONG	da958ee386	[EPLB]Eplb Config Renaming (#5533 ) ### What this PR does / why we need it? 1. Rename num_iterations_eplb_update to expert_heat_collection_interval. 2. Rename num_wait_worker_iterations to algorithm_execution_interval. 3. Rename init_redundancy_expert to num_redundant_experts because the variable with the same meaning in vLLM is named this way. 4. Delete gate_eplb because we don't need this feature. 5. Move eplb config into a dict in additional config. 6. Depend on pr5817 ### Does this PR introduce _any_ user-facing change? before this pr： `--additional-config '{"dynamic_eplb":true, "num_iterations_eplb_update": 4000, "num_wait_worker_iterations": 150, "init_redundancy_expert": 16, "expert_map_path": "xxx.json"}'` after this pr: `--additional-config '{"eplb_config":{"dynamic_eplb":true,"expert_heat_collection_interval":4000, "algorithm_execution_interval":150,"num_redundant_experts": 16, "expert_map_path": "xxx.json"}}'` ### How was this patch tested? #### test qwen3-235b eplb num_redundant_experts=16 without pr5817 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 83.33 \| with pr5817 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-15 10:26:44 +08:00
shiyuan680	7af3b880c1	support triton of mrope (#5664 ) ### What this PR does / why we need it? this pr support use triton mrope like cuda_forward, which performance is equal to ascendc ops this triton ops should use cann 8.5.0 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? test in qwen3-vl-235b acc textvqa native 81.82 npu triton 81.58 cuda triton 81.52 - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: shiyuan680 <917935075@qq.com>	2026-01-13 09:13:51 +08:00
SILONG ZENG	7a6fde80b1	[CI]Add Kimi k2 nightly test (#5682 ) ### What this PR does / why we need it? The PR add performance and accuracy tests for Kimi-K2-Instruct-W8A8 and Kimi-K2-Thinking models to the Nightly test suite. #### Test Configuration Kimi-K2-Instruct-W8A8 - model: vllm-ascend/Kimi-K2-Instruct-W8A8 - Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node) - Architecture: Unified Distributed Inference - Parallelism: DP4 + TP8 + EP (Data Parallel 4, Tensor Parallel 8, Expert Parallel enabled). - Optimization: torchair graph, no-prefix-caching. - Node 0: DP Rank 0-1, Local DP 2, Tensor Parallel 8. - Node 1: DP Rank 2-3, Local DP 2, Tensor Parallel 8. - Benchmarks: - Performance: vllm-ascend/GSM8K-in3500-bs2800. - Accuracy: vllm-ascend/gsm8k-lite. Kimi-K2-Thinking - Model: moonshotai/Kimi-K2-Thinking - Hardware: A3, 1 Node (16 NPUs total) - Architecture: Single Node Distributed Inference - Parallelism: TP16 + EP (Tensor Parallel 16, Expert Parallel enabled). - Optimization: no-prefix-caching - Benchmarks: - Performance: vllm-ascend/GSM8K-in3500-bs400. - Accuracy: vllm-ascend/gsm8k-lite. ### Does this PR introduce _any_ user-facing change? Yes. This PR enhances the ```AisbenchRunner``` to support dynamic configuration of the ```trust_remote_code``` flag. This allows the AISBench client to successfully load tokenizers for models that require custom code execution (e.g., Kimi-K2-Thinking and Kimi-K2-Instruct-W8A8). Changes: 1. ```AisbenchRunner.__init__ ```Added the ability to capture the ```trust_remote_code``` parameter from the case configuration. ``` python self.batch_size = aisbench_config["batch_size"] self.request_rate = aisbench_config.get("request_rate", 0) + self.trust_remote_code = aisbench_config.get("trust_remote_code", False) self.temperature = aisbench_config.get("temperature") self.top_k = aisbench_config.get("top_k") ``` 2. ```AisbenchRunner._init_request_conf``` Added regex substitution to inject the parameter into the generated dynamic configuration file. ``` python content = re.sub(r'batch_size.', f'batch_size = {self.batch_size},', content) + content = re.sub(r'trust_remote_code=.', + f'trust_remote_code={self.trust_remote_code},', + content) content = content.replace("top_k", "#top_k") content = content.replace("seed", "#seed") ``` Details: - New Config Key: Users can add ```"trust_remote_code": True``` to any dictionary within the ```aisbench_cases``` list. - Default Value: Defaults to ```False``` to maintain existing security protocols for standard models. - Impact: Resolves ```ValueError``` when benchmarking reasoning models or models with custom tokenizers that previously failed during the AISBench local initialization phase. User Example: Users can now enable custom code execution for specific models (like Kimi-K2-Thinking) directly in their test suite: ``` # Now supported in test scripts: aisbench_cases = [{ "case_type": "performance", "request_conf": "vllm_api_stream_chat", "trust_remote_code": True, # New user-facing parameter ... }] ``` ### How was this patch tested? Actions: - https://github.com/vllm-project/vllm-ascend/actions/runs/20849768433 Result as following: - Kimi-K2-Instruct-W8A8(25m25s) 1. Accuracy test ``` dataset version metric mode vllm-api-general-chat --------- --------- -------- ------ ----------------------- gsm8k 7cd45e accuracy gen 96.88 ``` 2. Perf test ``` ╒══════════════════════════╤═════════╤════════════════╤════════════════╤═══════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕ │ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ ╞══════════════════════════╪═════════╪════════════════╪════════════════╪═══════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡ │ E2EL │ total │ 34571.489 ms │ 28657.8054 ms │ 36294.1788 ms │ 34714.7329 ms │ 35247.2724 ms │ 35526.6758 ms │ 36146.4314 ms │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TTFT │ total │ 2043.9136 ms │ 627.4718 ms │ 3532.3978 ms │ 1906.0194 ms │ 2307.7979 ms │ 2883.8528 ms │ 3283.7012 ms │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TPOT │ total │ 127.5591 ms │ 106.4937 ms │ 137.107 ms │ 128.3135 ms │ 129.5704 ms │ 131.1332 ms │ 134.1087 ms │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ ITL │ total │ 126.5571 ms │ 0.0095 ms │ 1340.783 ms │ 104.1398 ms │ 110.1272 ms │ 119.6124 ms │ 950.2924 ms │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ InputTokens │ total │ 3516.6055 │ 3014.0 │ 3985.0 │ 3525.0 │ 3525.0 │ 3586.8 │ 3800.67 │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 512 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokenThroughput │ total │ 7.4143 token/s │ 7.0535 token/s │ 8.933 token/s │ 7.3744 token/s │ 7.4118 token/s │ 7.5608 token/s │ 8.7051 token/s │ 512 │ ╘══════════════════════════╧═════════╧════════════════╧════════════════╧═══════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛ ╒══════════════════════════╤═════════╤═══════════════════╕ │ Common Metric │ Stage │ Value │ ╞══════════════════════════╪═════════╪═══════════════════╡ │ Benchmark Duration │ total │ 279430.9375 ms │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Requests │ total │ 512 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Failed Requests │ total │ 0 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Success Requests │ total │ 512 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Concurrency │ total │ 63.3452 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Max Concurrency │ total │ 64 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Request Throughput │ total │ 1.8323 req/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Input Tokens │ total │ 1800502 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Prefill Token Throughput │ total │ 1720.5255 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total generated tokens │ total │ 131072 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Input Token Throughput │ total │ 6443.4598 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Output Token Throughput │ total │ 469.0676 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Token Throughput │ total │ 6912.5274 token/s │ ╘══════════════════════════╧═════════╧═══════════════════╛ ``` - Kimi-K2-Thinking(43m51s) 1. Accuracy test ``` dataset version metric mode vllm-api-general-chat --------- --------- -------- ------ ----------------------- gsm8k 7cd45e accuracy gen 100.00 ``` 2. Perf test ``` ╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕ │ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ ╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡ │ E2EL │ total │ 172384.3573 ms │ 34456.5517 ms │ 205922.9407 ms │ 174844.2216 ms │ 202656.092 ms │ 204428.9502 ms │ 205468.6776 ms │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TTFT │ total │ 138740.3228 ms │ 655.1066 ms │ 171777.3003 ms │ 141088.0561 ms │ 169237.5599 ms │ 170716.4954 ms │ 171393.1278 ms │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TPOT │ total │ 131.9374 ms │ 90.6331 ms │ 135.4144 ms │ 132.405 ms │ 132.948 ms │ 133.7549 ms │ 135.2543 ms │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ ITL │ total │ 130.9028 ms │ 0.0099 ms │ 960.3683 ms │ 116.9623 ms │ 122.3127 ms │ 132.0522 ms │ 886.4662 ms │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ InputTokens │ total │ 3514.575 │ 3014.0 │ 3843.0 │ 3525.0 │ 3525.0 │ 3588.0 │ 3801.08 │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 400 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokenThroughput │ total │ 1.6799 token/s │ 1.2432 token/s │ 7.4296 token/s │ 1.4642 token/s │ 1.4737 token/s │ 1.8754 token/s │ 7.125 token/s │ 400 │ ╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛ ╒══════════════════════════╤═════════╤═══════════════════╕ │ Common Metric │ Stage │ Value │ ╞══════════════════════════╪═════════╪═══════════════════╡ │ Benchmark Duration │ total │ 1166795.568 ms │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Requests │ total │ 400 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Failed Requests │ total │ 0 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Success Requests │ total │ 400 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Concurrency │ total │ 59.0967 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Max Concurrency │ total │ 64 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Request Throughput │ total │ 0.3428 req/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Input Tokens │ total │ 1405830 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Prefill Token Throughput │ total │ 25.332 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total generated tokens │ total │ 102400 │ ├──────────────────────────┼─────────┼───────────────────┤ │ Input Token Throughput │ total │ 1204.864 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Output Token Throughput │ total │ 87.7617 token/s │ ├──────────────────────────┼─────────┼───────────────────┤ │ Total Token Throughput │ total │ 1292.6258 token/s │ ╘══════════════════════════╧═════════╧═══════════════════╛ ``` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>	2026-01-12 15:56:07 +08:00
1092626063	f63c1341d9	[Feature] GLM4.6 support mtp with fullgraph (#5460 ) ### What this PR does / why we need it? GLM4.6 support mtp with fullgraph to improve performance ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ` export HCCL_BUFFSIZE=1024 export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export HCCL_OP_EXPANSION_MODE=AIV vllm serve /weight/glm4.6_w8a8_with_float_mtp \ --data-parallel-size 1 \ --tensor-parallel-size 16 \ --seed 1024 \ --served-model-name glm \ --max-model-len 35000 \ --max-num-batched-tokens 16384 \ --max-num-seqs 16 \ --trust-remote-code \ --gpu-memory-utilization 0.9 \ --speculative-config '{"num_speculative_tokens": 1, "model":"/weight/glm4.6_w8a8_with_float_mtp", "method":"mtp"}' \ --compilation-config '{"cudagraph_capture_sizes": [1,2,4,8,16,32], "cudagraph_mode": "FULL_DECODE_ONLY"}' \ --async-scheduling \ ` test case： ` vllm bench serve \ --backend vllm \ --dataset-name prefix_repetition \ --prefix-repetition-prefix-len 22400 \ --prefix-repetition-suffix-len 9600 \ --prefix-repetition-output-len 1024 \ --num-prompts 1 \ --prefix-repetition-num-prefixes 1 \ --ignore-eos \ --model glm \ --tokenizer /weight/glm4.6_w8a8_with_float_mtp \ --seed 1000 \ --host 0.0.0.0 \ --port 8000 \ --endpoint /v1/completions \ --max-concurrency 1 \ --request-rate 1 ` - vLLM version: v0.13.0 - vLLM main: `5326c89803` Signed-off-by: 1092626063 <1092626063@qq.com>	2026-01-09 16:07:42 +08:00
lhchg	dc99cfdc15	[CustomOp] support TensorList for dispatchFFNCombine (#5665 ) ### What this PR does / why we need it? To support tensorList for dispatch_ffn_combine, to adjust eplb ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Single Operator Testing - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: lhchg <lhao_cheng@163.com> Co-authored-by: lihaocheng <lihaosheng1@h-partners.com>	2026-01-09 15:56:29 +08:00
InSec	2d713fee93	[CI] Accuracy issue of qwen3-next-w8a8 nightly test fix. (#5746 ) ### What this PR does / why we need it? Close the Full Graph mode to temporarily avoid accuracy issue for Qwen3-Next-80B-A3B-Instruct-W8A8. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: InSec <1790766300@qq.com>	2026-01-09 15:55:13 +08:00
LeeWenquan	a3a74d6984	[CI] Add qwen3 next ci (#5395 ) ### What this PR does / why we need it? Add Qwen3Next CI ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2026-01-09 10:29:09 +08:00
Li Wang	595d3484c4	[Nightly] Move ops to the correct path (#5642 ) ### What this PR does / why we need it? Move ops to the correct path where they belong - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-09 09:23:36 +08:00
Aoxuan Chen	8763953f56	[Feature] add the magicmtp speculative decoding acceleration algorithm (#5542 ) ### What this PR does / why we need it? 1. MagicMTP (paper: "Block Verification Accelerates Speculative Decoding") was introduced to consider the influence among multiple draft tokens, improving the acceptance rate without compromising accuracy. 2. Added Triton and PyTorch implementations, and added E2E test cases. ### Does this PR introduce _any_ user-facing change? MagicMTP will automatically take effect when the parameter "num_speculative_tokens" >= 3. - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: chenaoxuan <cax1165@163.com>	2026-01-08 09:15:55 +08:00
ZCG12345	3be8e33fe9	[Kernel] Add moe_gating_top_k operator support for Ascend NPU (#5579 ) ### What this PR does / why we need it? 1.replace moe_gating_top_k from torch_npu with custom op 2.enable the renorm function of moe_gating_top_k in softmax scenerio ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No need test - vLLM version: v0.13.0 - vLLM main: `7157596103` --------- Signed-off-by: ZCG12345 <2097562023@qq.com>	2026-01-07 21:42:31 +08:00
Icey	137f28341d	[Tests] Add qwen3-8b nightly test (#5597 ) ### What this PR does / why we need it? Add qwen3-8b nightly test - vLLM version: v0.13.0 - vLLM main: `7157596103` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2026-01-07 18:42:05 +08:00
wangxiyuan	6f7a81cd9f	[CI] cleanup single/multi-card test (#5623 ) 1. speed up e2e light test. 2. create `2-cards` and `4-cards` folder in multicard 3. move ops to nightly 4. run test in Alphabetical Order - vLLM version: v0.13.0 - vLLM main: `8be6432bda` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-07 14:13:34 +08:00
wangyibo1005	25baf6df09	[Feature]EPLB:Adapt DispatchGmmCombineDecode operator to eplb tensor list and expert token numbers (#5552 ) #### What this PR does / why we need it? This PR adapt DispatchGmmCombineDecode operator to eplb tensor list and expert token numbers. This operator support gmm1, gmm2, gmm1Scale and gmm2Scale in format of list. This operator support couting how many token each local expert recieves by expertTokensNum . - vLLM version: v0.13.0 - vLLM main: `7157596103` More info about this operator, please refer to RFC: issue https://github.com/vllm-project/vllm-ascend/issues/5476	2026-01-07 11:23:42 +08:00
InSec	089ca2ddcc	[Nightly][Test] Add Qwen3-Next-80B-A3B-Instruct-W8A8 nightly test (#5616 ) ### What this PR does / why we need it? There was an accuracy issue with Qwen3-Next-80B-A3B-Instruct-W8A8 model in the old version of Triton-Ascend, so, we are now adding one nightly test to maintain it. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: IncSec <1790766300@qq.com>	2026-01-06 17:36:00 +08:00
Li Wang	c5e2f48510	[CI] mv ops to correct path (#5615 ) ### What this PR does / why we need it? mv ops to correct path :`tests/e2e/nightly/single_node/ops/singlecard_ops/triton` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-05 23:17:07 +08:00
Yizhou	755caeb06e	[Feat][Spec] Optimize token index calculation in spec decode with Triton kernel (#5356 ) ### What this PR does / why we need it? Replace multiple PyTorch operations with a fused Triton kernel to determine token indices for sampling during speculative decoding. This reduces kernel launch overhead and memory traffic, improving overall performance on Ascend hardware. --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2026-01-05 16:51:29 +08:00
Trunrain	91bf524364	[BugFix][kernel] fix matmul_allreduce_add_rmsnorm_kernel (#5335 ) ### What this PR does / why we need it? fix matmul_allreduce_add_rmsnorm_kernel, add hccl Init, SetCcTiling interface test case use multicard-4 ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? pytest -sv tests/e2e/nightly/ops/test_matmul_allreduce_add_rmsnorm.py multicard-4 pass https://github.com/vllm-project/vllm-ascend/actions/runs/20502630658/job/58914474652?pr=5335 - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: tongrunze <t00574058@china.huawei.com> Co-authored-by: tongrunze <t00574058@china.huawei.com>	2026-01-05 15:19:54 +08:00
Jade Zheng	7d5242faca	[Refactor] Formatting output types related to FuseMoE (#5481 ) Currently in the Fused MoE module, functions of classes like MoECommMethod and MoETokenDispatcher output data in dictionary or tuple format, which hampers code maintainability, readability, and extensibility. This PR introduces dataclasses for these key output types to address these issues. - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-31 14:24:37 +08:00
Jade Zheng	38570cfeb6	[Feature] Support kv nz feature for DeepSeek decode node in disagg-prefill scenario (#3072 ) By converting the KV cache from ND to NZ format when the decode node receives it, this PR ensures that the KV NZ feature works correctly during the decoding phase in disagg-prefill scenario. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com> Co-authored-by: ghphotoframe <854746559@qq.com> Co-authored-by: alex101-ops <alex1015718386@gmail.com>	2025-12-31 14:24:04 +08:00
Li Wang	e760aae1df	[1/N] Refactor nightly test structure (#5479 ) ### What this PR does / why we need it? This patch is a series of refactoring actions, including clarifying the directory structure of nightly tests, refactoring the config retrieval logic, and optimizing the workflow, etc. This is the first step: refactoring the directory structure of nightly to make it more readable and logical. - vLLM version: v0.13.0 - vLLM main: `5326c89803` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-30 19:03:02 +08:00

1 2

81 Commits