### What this PR does / why we need it?
Add New Output for Expert Token Count
An additional output tensor expert_token_nums is added to both operators
to meet the requirement of tracking token distribution among experts:
Tensor Name: expert_token_nums
Dimension: 1D tensor
Shape: (local_expert_num,)
Data Type: int32
Semantics: Represents the number of tokens actually received by each
expert on the current card.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: guanguan0308 <1546542263@qq.com>
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
# What this PR does / why we need it?
This PR reverts commit 8134146ab6, which
modified the DeepSeek V3.2 (W8A8) single-node nightly test
configuration. as there is no limit between tp_size and MTP.
# Does this PR introduce any user-facing change?
No. This PR only affects CI/CD test configurations and does not
introduce any user-facing changes.
# How was this patch tested?
N/A for a revert PR. The changes restore the previously known working
configuration.
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
### What this PR does / why we need it?
When using Mooncake on Ascend NPU, AscendDirectTransport randomly
allocates ports within range `[20000, 20000 + npu_per_node × 1000)`.
Reference:
[ascend_direct_transport.cpp#L554](https://github.com/kvcache-ai/Mooncake/blob/v0.3.7.post2/mooncake-transfer-engine/src/transport/ascend_transport/ascend_direct_transport/ascend_direct_transport.cpp#L475)
If `kv_port` overlaps with this range, users may encounter intermittent
startup failures:
```bash
zmq.error.ZMQError: Address already in use (addr='tcp://x.x.x.x:30012')
RuntimeError: KV Cache sending/receiving thread failed to start.
```
This pr fix intermittent kv_port conflict with AscendDirectTransport in
`Qwen3-235B-W8A8-EPLB.yaml`, and add Added `kv_port Configuration Guide`
section in `pd_disaggregation_mooncake_multi_node.md`.
test
Results(tests/e2e/nightly/multi_node/config/Qwen3-235B-W8A8-EPLB.yaml):
https://github.com/vllm-project/vllm-ascend/actions/runs/21540138907/job/62073265259
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
# What this PR does / why we need it?
This PR fixes the single-node nightly test for DeepSeek V3.2 (W8A8)
model to ensure CI stability. The changes include:
1. Simplified nightly test matrix (nightly_test_a3.yaml):
- Temporarily reduced to only run deepseek3_2-w8a8 test case for
debugging
- Changed trigger from schedule/workflow_dispatch to support
push/pull_request for faster iteration
2. Updated DeepSeek V3.2 test configuration
(test_deepseek_v3_2_w8a8.py):
- Adjusted cudagraph_capture_sizes from [3, 6, 9, 12] to [8, 16, 24, 32]
for better performance
- Increased max-num-seqs from 4 to 8
- Increased gpu-memory-utilization from 0.92 to 0.98
- Increased num_speculative_tokens from 2 to 3
3. Added PR checkout step (_e2e_nightly_single_node.yaml):
- Added ability to checkout a specific PR (#6241) for testing
# Does this PR introduce any user-facing change?
No. This PR only affects CI/CD test configurations and does not
introduce any user-facing changes.
# How was this patch tested?
Mock nightly test has passed, see
[here](https://github.com/vllm-project/vllm-ascend/actions/runs/21574655952/job/62159656622?pr=6241).
<img width="1053" height="714" alt="a2f2ee359febb13e1f6330b1bd3c116b"
src="https://github.com/user-attachments/assets/3262ad0f-adec-4c71-871f-d9cf2db06fbc"
/>
- vLLM version: v0.14.1
- vLLM main:
d68209402d
---------
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
### What this PR does / why we need it?
Fix the **import error** of qwen3-next nightly test.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
Signed-off-by: InSec <1790766300@qq.com>
### What this PR does / why we need it?
Add e2e test case for apply_top_k_top_p_custom kernel and eliminate
chinese comments.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
pytest passed.
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
The precision issue arose because the kv cache of the p-node had not
been fetched for an extended period(>6min) and was forcibly freed. To
avoid this problem, the batch size was reduced and the timeout period
has also been extended.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
Signed-off-by: dsxsteven <dsxsteven@sina.com>
### What this PR does / why we need it?
Qwen3-Next nightly test fix. Temporarily avoid the accuracy issue in the
**full graph** mode.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
d68209402d
Signed-off-by: InSec <1790766300@qq.com>
### What this PR does / why we need it?
This PR is to replace addRmsNorm and Add With addRmsNormBias. This way
can lead to a more effecient result.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Full Test Pass
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: Chen_HaoWen <chenhaowen12@huawei.com>
Co-authored-by: Chen_HaoWen <chenhaowen12@huawei.com>
### What this PR does / why we need it?
This PR enables FLASHCOMM1 communication optimization with layer
sharding for DeepSeek-V3.2 W8A8 model testing to
validate PR #5702. The changes include:
1. Enable FLASHCOMM1: Set VLLM_ASCEND_ENABLE_FLASHCOMM1=1
improves performance for distributed inference
2. Add layer sharding: Configure layer_sharding: ["q_b_proj", "o_proj"]
4. Update baselines: Adjust performance baselines to reflect the
improvements from FLASHCOMM1 and layer sharding
### Does this PR introduce _any_ user-facing change?
No. This is a CI/test-only change that enables new communication
optimization features for testing purposes.
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
d68209402d
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
### What this PR does / why we need it?
Add nightly ci test for deepseek v3.1
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Install clang in dokerfile for triton ascend
- vLLM version: v0.13.0
- vLLM main:
d68209402d
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
1) Default enable MLAPO for deepseek MLA Attention W8A8 models on PD
disagregation D Instance, for example: DeepSeekV3-W8A8,
DeepSeek-R1-W8A8.
2) Default enable MLAPO for DeepSeek SFA Attention W8A8 models,
currently is DeepSeek-V3.2-W8A8.
### Does this PR introduce _any_ user-facing change?
Don't need use manully to VLLM_ASCEND_ENABLE_MLAPO=1, to enable MLAPO
feature for deepseek w8a8 model
The effect of enabling MLAPO SFA model deployed on a single A3 Node:
Test
with:tests/e2e/nightly/single_node/models/test_deepseek_v3_2_exp_w8a8.py
dataset: gsm8k-lite,without set MTP, FULL GRAPH, has 19% promote:
未默认开启 MLAPO 时:
├─────────────────────────┤
│ TTFT │ 14055.8836 ms │
├─────────────────────────┤
│ ITL │ 66.8171 ms. │
├─────────────────────────┤
│ Output Token Throughput │ 104.9105 token/s │
├─────────────────────────┤
默认开启 MLAPO 时:
├─────────────────────────┤
│ TTFT │ 3753.1547 ms │
├─────────────────────────┤
│ ITL. │ 61.4236 ms. │
├─────────────────────────┤
│ Output Token Throughput │ 125.2075 token/s│
├─────────────────────────┤
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
update triton ascend version in 3.2.0
- vLLM version: v0.13.0
- vLLM main:
d68209402d
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
add dispath_ffn_combine_bf16
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
---------
Signed-off-by: guanguan0308 <1546542263@qq.com>
### What this PR does / why we need it?
Add DeepSeek-V3.2-W8A8 nightly ci test:
DeepSeek-V3.2-W8A8 1node DP2+TP8
:tests/e2e/nightly/models/test_deepseek_v3_2_w8a8.py
### Does this PR introduce _any_ user-facing change
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Optimized operator performance and add ut test
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
test in qwen2.5 7b vl, ops time approved 90%
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
this pr is for
# https://github.com/vllm-project/vllm-ascend/issues/5208
Signed-off-by: shiyuan680 <917935075@qq.com>
### What this PR does / why we need it?
Move the qwen3 performance test from nightly to e2e to intercept
performance degradation.
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it?
#### Documentation Improvements
New Configuration: Added the layer_sharding parameter to the
DeepSeek-V3.2-W8A8 deployment tutorial. This guides users to include
`["q_b_proj", "o_proj"]` in their prefill node setup for better resource
utilization.
#### CI and Testing Updates
Test Config Update: Updated the multi-node E2E test configuration file:
tests/e2e/nightly/multi_node/config/DeepSeek-V3_2-W8A8-A3-dual-nodes.yaml.
including disable `FLASHCOMM` and enable `FULL_DECODE_ONLY` and update
performance baseline.
### Does this PR introduce any user-facing change?
Yes. The documentation now recommends a more optimized startup command
for DeepSeek-V3.2-W8A8. Users following the updated tutorial will see
improved performance in multi-node PD disaggregation environments.
### How was this patch tested?
CI Validation: The updated E2E test configuration has been verified
through the nightly CI pipeline.
Environment: * vLLM version: v0.13.0
Base Commit:
[11b6af5](11b6af5280)
Hardware: Ascend A3/A2 multi-node cluster.
---------
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
### What this PR does / why we need it?
[Feature] Adapt DispathGmmCombineDecode opertor to align with weight
scale dtype of small operators.
- **Before**: weight scale must be float32
- **After**: weight scale can be float32/float16 when x is float16,
float32/bfloat16 when x is float32/bfloat16. And w1 scale can use
different dtype with w2 scale.
More info about this operator, please refer to RFC: issue
https://github.com/vllm-project/vllm-ascend/issues/5476
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
#### Perf
> When scale is of type fp16 or bf16, it will be cast to fp32 internally
within the operator, while the subsequent computations remain unchanged.
Therefore, this PR will introduce an additional cast operation but halve
the memory copy operations for scale . Furthermore, since the scale data
is only a few KB in size and participates in relatively few
computations, its impact is almost negligible compared to major
operations like matrix multiplication. Thus, the theoretical performance
change should be minimal.
test single operator cases from qwen3-235b,
- single A3 node(ep16), 64 moe experts, 4 experts / die (like qwen3-235b
ep32)
- batch=18/32, token_hidden_size 4096, moe_intermediate_size 1536
The test was conducted for 100 rounds, and the average of the last 95
rounds was taken.
| | bs18(us)| bs32(us)|
| -----| -----| -----|
|Without this PR|96.28|108.83|
|With this PR|96.06|107.90|
Note: Single-operator benchmarks represent an ideal scenario. They are
usually only useful for referencing relative changes and may not fully
align with performance data observed within the full model.
#### Acc
test qwen3-235b eplb on a single A3 node(ep16),
with dispatch_gmm_combine_decode
| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 83.33 |
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
### What this PR does / why we need it?
Add DeepSeek R1 W8A8 HMB nightly ci
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
1. Rename num_iterations_eplb_update to expert_heat_collection_interval.
2. Rename num_wait_worker_iterations to algorithm_execution_interval.
3. Rename init_redundancy_expert to num_redundant_experts because the
variable with the same meaning in vLLM is named this way.
4. Delete gate_eplb because we don't need this feature.
5. Move eplb config into a dict in additional config.
6. Depend on pr5817
### Does this PR introduce _any_ user-facing change?
before this pr:
`--additional-config '{"dynamic_eplb":true,
"num_iterations_eplb_update": 4000, "num_wait_worker_iterations": 150,
"init_redundancy_expert": 16, "expert_map_path": "xxx.json"}'`
after this pr:
`--additional-config
'{"eplb_config":{"dynamic_eplb":true,"expert_heat_collection_interval":4000,
"algorithm_execution_interval":150,"num_redundant_experts": 16,
"expert_map_path": "xxx.json"}}'`
### How was this patch tested?
#### test qwen3-235b eplb num_redundant_experts=16
without pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 83.33 |
with pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |
- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604
This PR is a refactoring of vllm_ascend/distributed, moving all
kv_transfer realtaed codes into a dedicated folder, which has already
been done in vLLM
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: lty <linhebiwen@gmail.com>
### What this PR does / why we need it?
Since we have upgrade all the nodes' `cann` HDK version to `25.3rc1`, we
should not limit nightly schedule to the specific nodes
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
fix bug : https://github.com/vllm-project/vllm-ascend/issues/5634
Intermittent CI failure due to a compilation error in the triton
operator
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
this pr support use triton mrope like cuda_forward, which performance is
equal to ascendc ops
this triton ops should use cann 8.5.0
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
test in qwen3-vl-235b acc textvqa
native 81.82
npu triton 81.58
cuda triton 81.52
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: shiyuan680 <917935075@qq.com>
### What this PR does / why we need it?
This patch initial testing involved connecting two nodes from the HK
region to nightly A2.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Align multi-node nightly test paramter with tutorials documents.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
Test locally and nighly e2e multi-node test cases.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
To support tensorList for dispatch_ffn_combine, to adjust eplb
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
Single Operator Testing
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: lhchg <lhao_cheng@163.com>
Co-authored-by: lihaocheng <lihaosheng1@h-partners.com>
### What this PR does / why we need it?
Close the **Full Graph** mode to temporarily avoid accuracy issue for
**Qwen3-Next-80B-A3B-Instruct-W8A8**.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: InSec <1790766300@qq.com>
### What this PR does / why we need it?
Add Qwen3Next CI
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
---------
Signed-off-by: SunnyLee219 <3294305115@qq.com>
### What this PR does / why we need it?
Move ops to the correct path where they belong
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Add triton ascend in nightly
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
1. MagicMTP (paper: "Block Verification Accelerates Speculative
Decoding") was introduced to consider the influence among multiple draft
tokens, improving the acceptance rate without compromising accuracy.
2. Added Triton and PyTorch implementations, and added E2E test cases.
### Does this PR introduce _any_ user-facing change?
MagicMTP will automatically take effect when the parameter
"num_speculative_tokens" >= 3.
- vLLM version: v0.13.0
- vLLM main:
7157596103
Signed-off-by: chenaoxuan <cax1165@163.com>
### What this PR does / why we need it?
1.replace moe_gating_top_k from torch_npu with custom op
2.enable the renorm function of moe_gating_top_k in softmax scenerio
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No need test
- vLLM version: v0.13.0
- vLLM main:
7157596103
---------
Signed-off-by: ZCG12345 <2097562023@qq.com>
### What this PR does / why we need it?
Add qwen3-8b nightly test
- vLLM version: v0.13.0
- vLLM main:
7157596103
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
1. speed up e2e light test.
2. create `2-cards` and `4-cards` folder in multicard
3. move ops to nightly
4. run test in Alphabetical Order
- vLLM version: v0.13.0
- vLLM main:
8be6432bda
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
#### What this PR does / why we need it?
This PR adapt DispatchGmmCombineDecode operator to eplb tensor list and
expert token numbers.
This operator support gmm1, gmm2, gmm1Scale and gmm2Scale in format of
list.
This operator support couting how many token each local expert recieves
by expertTokensNum .
- vLLM version: v0.13.0
- vLLM main:
7157596103
More info about this operator, please refer to RFC: issue
https://github.com/vllm-project/vllm-ascend/issues/5476
# What this PR does / why we need it?
Add DeepSeek-V3.2-W8A8 dual-node nightly CI test and update A3 nightly
test configuration:
1. Add DeepSeek-V3.2-W8A8 dual-node test:
tests/e2e/nightly/multi_node/config/DeepSeek-V3_2-W8A8-A3-dual-nodes.yaml
- 2 nodes, 16 NPUs per node (32 NPUs total)
- Configuration: 2P+1D (data-parallel-size=4, tensor-parallel-size=8,
data-parallel-size-local=2)
- Includes performance and accuracy benchmarks with GSM8K dataset
2. Update A3 nightly workflow: .github/workflows/nightly_test_a3.yaml
- Added DeepSeek-V3.2-W8A8 dual-node test to the A3 nightly test matrix
- Test name: multi-node-dpsk3.2-2node
3. Improve test scripts: Updated
.github/workflows/_e2e_nightly_multi_node.yaml and related scripts for
better multi-node testing support
test on A3 instances
- Performance baseline: 1 (threshold: 0.97)
- Accuracy baseline: 95% (threshold: 5%)
- Test dataset: GSM8K with 512 prompts for performance, gsm8k-lite for
accuracy
---------
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
### What this PR does / why we need it?
There was an accuracy issue with **Qwen3-Next-80B-A3B-Instruct-W8A8**
model in the old version of **Triton-Ascend**, so, we are now adding one
nightly test to maintain it.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
7157596103
Signed-off-by: IncSec <1790766300@qq.com>
### What this PR does / why we need it?
mv ops to correct path
:`tests/e2e/nightly/single_node/ops/singlecard_ops/triton`
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Fix Smoke Testing Bug for DSR1 longseq
We need to make this change because the daily smoke test case is
throwing an error: "max_tokens or max_completion_tokens is too large:
32768.This model's maximum context length is 32768 tokens and your
request has 128 input tokens". We encounter this error due to
max-out-len equals to max-model-len. We can fix this error by increasing
max-model-len argument in the script.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
7157596103
Signed-off-by: daishixun <dsxsteven@sina.com>
### What this PR does / why we need it?
Add nightly test for triton split_rmsnorm_rope
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Angazenn <supperccell@163.com>