### What this PR does / why we need it?
Last month the interface of `OffloadingSpec` has
changed(https://github.com/vllm-project/vllm/pull/27743). This PR fixes
this bug and adds e2e test for cpu offloading.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
CI passed with new added test.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
1.KV Pool for KV Transfer in PD Disaggregation Scenarios Error
Resolution
2.Update KV Pool Documentation
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
---------
Signed-off-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it?
Provide sample guidance for running long-sequence DeepSeek across
multiple nodes
To guide users on using the context parallel feature, a practical
example is provided.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
We'll release 0.13.0 soon. The main branch is freeze. Let's revert the
newest change and redo it once 0.13.0 is released
- vLLM version: release/v0.13.0
- vLLM main:
81786c8774
### What this PR does / why we need it?
Since the _npu_ring_mla operator deteriorates in long-sequencescenarios,
the long sequence is split into shorter sequences for input to improve
performance.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: pichangping <1337510399@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
This PR adds vllm bench common method, we need it to add some test cases
later
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test
- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef
---------
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
### What this PR does / why we need it?
This PR aims to add acceptance test for eagle/eagle3 via llama/qwen. We
obtained golden baselines by running several times (based on healthy
main), which is feasible and convincing.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
---------
Signed-off-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it?
Add installation script of `fused_infer_attention_score` kernel with
flash decoding
### Userface changes
Users can install the kernel `fused_infer_attention_score` with flash
decoding feature by `bash
tools/install_flash_infer_attention_score_ops_a2.sh` or `bash
tools/install_flash_infer_attention_score_ops_a3.sh`
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Qwen3-235B-A22B belongs to the TopN model, but there is currently a lack
of care for the test cases of the wen3-235B-A22B model on Atlas A2, and
most of the machines currently owned by users in the community are A2.
When users encounter problems, we currently have no way of knowing
whether the model runs normally on the corresponding version of the
code, so we added it. In addition, we currently see TopN models such as:
qwen-dense, qwen3-30b-a3b, Qwen3-Next, Qwen2.5-Omni, but Qwen3-235B-A22B
is missing.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
Test with multi-node, result as following:
1. Accuracy test (Time for executing this test case: 25 minutes)
test running successfully, accuracy as following:
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 95.68
```
2. Perf test (Time for executing this test case: 1h15 minutes)
test running successfully, throughput as following(This is the atlas A3,
for A2 the result about A3/1.3):
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤══════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪══════╡
│ E2EL │ total │ 384086.3958 ms │ 214767.0486 ms │ 528014.771 ms │ 387621.5746 ms │ 388776.7492 ms │ 390164.3559 ms │ 488105.8512 ms │ 2800 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤
│ TTFT │ total │ 159409.9868 ms │ 1849.4588 ms │ 302439.6965 ms │ 162183.7007 ms │ 162965.477 ms │ 164274.1936 ms │ 262578.6041 ms │ 2800 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤
│ TPOT │ total │ 149.8842 ms │ 130.2175 ms │ 151.2625 ms │ 150.473 ms │ 150.6978 ms │ 150.9102 ms │ 151.2131 ms │ 2800 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤
│ ITL │ total │ 149.6789 ms │ 0.0099 ms │ 283.0242 ms │ 150.3276 ms │ 156.8649 ms │ 168.1372 ms │ 199.378 ms │ 2800 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤
│ InputTokens │ total │ 3654.3079 │ 3108.0 │ 4280.0 │ 3629.0 │ 3728.0 │ 3842.1 │ 4079.0 │ 2800 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤
│ OutputTokens │ total │ 1500.0 │ 1500.0 │ 1500.0 │ 1500.0 │ 1500.0 │ 1500.0 │ 1500.0 │ 2800 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────┤
│ OutputTokenThroughput │ total │ 3.935 token/s │ 2.8408 token/s │ 6.9843 token/s │ 3.8698 token/s │ 3.8799 token/s │ 3.9916 token/s │ 6.2137 token/s │ 2800 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧══════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 4391524.3389 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 2800 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 2800 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 244.8903 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 256 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 0.6376 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 10232062 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 22.924 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 4200000 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 2329.9568 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 956.3877 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 3286.3445 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
Update vllm pin to 12.26
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
81786c8774
---------
Signed-off-by: ZT-AIA <1028681969@qq.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
1. The `npu_fused_infer_attention_score` kernel supports specifying the
output layout. By selecting the appropriate layout, we can avoid the
transpose operation typically required after the attention.
2. The `transpose_batchmatmul` function allows us to control whether the
output tensor is transposed. If we configure `perm_y`, an additional
transpose after executing `v_up` becomes unnecessary.
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
---------
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
### What this PR does / why we need it?
Currently, MHA models (eg: minicpm-2b, Baichuan-7b) will encounter
errors when running in piecewise graph mode, with error messages similar
to:
```
(E89999): When layout is TND and PA not enabled, keyT(8) and valueT(8) must be equal to the last element of actualSeqenceLengthKV(5)[FUNC:CheckInputShapeWhenLayoutIsTND][FILE:prompt_flash_attention_tiling.cpp][LINE:3618]
```
The error occurs because the qkv in the Prefill stage is also padded,
causing the shape to be inconsistent with actual_seq_lengths.
Add unpadding logic for kv.
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
Rollback causal_conv1d_fn ops from triton to torch version to fix
hanging issues,meanwhile update Qwen3Next doc
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
---------
Signed-off-by: SunnyLee219 <3294305115@qq.com>
### What this PR does / why we need it?
This PR adds the method for sending chat and non-chat request, we need
it to test much folloing cases.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
### What this PR does / why we need it?
This PR updates DeepSeek-R1/V3.1 doc to give a simple recipe for
repreducing our latest perfomance on Atlas A3/A2 servers.
### Does this PR introduce any user-facing change?
No.
Signed-off-by: GDzhu01 <809721801@qq.com>
### What this PR does / why we need it?
This PR aims to fix unsuitable `moe_comm_type` under `ep=1` scenario.
The related issue #5375 have reported that `ep=1` can cause errors in
local environment, but those cases work well on ci. The point is the
difference between machines and `moe_comm_type` may not be chosen
correctly.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
Signed-off-by: Zetong Li <slippersss@126.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
add developer guide for PCP&DCP
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
### What this PR does / why we need it?
#4443 introduces a precision issue in scenarios where MTP >= 3 + deepseek v3.1, and this pr reverts it
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
Signed-off-by: GDzhu01 <809721801@qq.com>
### What this PR does / why we need it?
Add cudagraph_capture_sizes for E2E CI test.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: menogrey <1299267905@qq.com>
1. refresh additional config doc
2. move kv config logic to platform.
3. improve `dump_config` init logic and rename it to `dump_config_path`
this change is user impacted. dump_config is changed from dict to
string.
4. correct `enable_async_exponential` type
5. remove useless `chunked_prefill_for_mla`
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
- Fix vllm break in the pr:
1.[Drop v0.14 deprecations
]https://github.com/vllm-project/vllm/pull/31285
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
---------
Signed-off-by: ZT-AIA <1028681969@qq.com>
### What this PR does / why we need it?
Currently, our multi-node logs only show the master node's logs (via the
Kubernetes API), which is insufficient for effective problem
localization if other nodes experience issues. Therefore, this pull
request adds the ability to upload logs for other nodes.
Next plan: Output structured directory logs, including logs from each
node and the polog.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
The contiguous() operation temporarily increases memory usage, leading
to higher peak GPU memory, which necessitates reducing
gpu_memory_utilization. However, making tensors contiguous in
modelrunnerv1 significantly enhances operator performance, resulting in
greater end-to-end model benefits despite the memory overhead.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Currently, the vllm pr: https://github.com/vllm-project/vllm/pull/24252
is causing operator fusion to fail, which can be mitigated by patching
the backend. Once the problem is completely resolved, I will submit a
new pull request to remove the patch.
- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
Description:
This PR updates the implementation of the Triton operator for deployment
on NPU devices, focusing on optimizing grid size and memory handling
based on NPU limitations.
Design Plan:
Grid Calculation: The grid size is now dynamically calculated by batch
and dim to ensure that the number of programs executed does not exceed
the NPU's vector core capacity. This ensures optimal parallelism without
overloading the hardware.
Data Block Handling: Due to the limited on-chip memory (UB) on Ascend
NPUs, this implementation splits large data into smaller chunks of 32k
or less per block. The kernel performs a for-loop to process the data in
these smaller chunks, minimizing memory usage and avoiding potential
overflows.
Changes Compared to GPU Implementation:
Grid and Block Sizing:
For GPU, the grid and block size were determined based on available
thread counts and memory size. In contrast, the NPU version dynamically
adjusts these parameters using B_TILE and BLOCK_N to optimize for NPU’s
architecture.
Memory Chunking:
The original GPU implementation did not require chunking due to the
higher available memory and processing capacity. For the NPU, data is
divided into smaller chunks (32k or smaller) to comply with memory
constraints on the device. The kernel has been modified to handle this
chunking mechanism inside a loop.
Optimized Thread Usage:
The NPU implementation takes into account the hardware-specific thread
limit (24 threads per vector core), ensuring that the number of active
programs is aligned with the NPU's vector core count, avoiding
over-subscription that would lead to serial processing.
This PR ensures that the operator functions efficiently on Ascend NPU,
considering hardware limitations while maintaining the same
functionality and input parameters as the GPU implementation.
- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef
Signed-off-by: maoxx241 <maomaoyu870@gmail.com>
### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
fix xlite decode-only e2e test, xlite decode-only mode utilizes
aclgraph's prefill and will be affected by aclgraph, so shortened test
length.
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
Signed-off-by: changdawei1 <changdawei3@huawei.com>
Co-authored-by: changdawei1 <changdawei3@huawei.com>
### What this PR does / why we need it?
[Bugfix] Fixing the issue where 128K context does not work in long
sequence scenarios.
This issue is caused by not splitting num_token according to pcp_size
during profile_run.
During `profile_run`, a warm-up is performed based on
`self.max_num_tokens`. When PCP is enabled, each PCP group will only
schedule up to `self.max_num_tokens / pcp_size`. After `profile_run` is
completed, the original scheduling size needs to be restored.
This is a temporary workaround; once
https://github.com/vllm-project/vllm/pull/28988/files is implemented,
this part can be removed.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
Fixed the error in the CI process for
vllm-ascend/tests/e2e/nightly/ops/triton/test_rejection_sampler.py
Error: test_rejection_sampler_block_verify_triton_kernel: duplicate
parametrization of 'vocab_size'.
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
Signed-off-by: chenaoxuan <cax1165@163.com>
`VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE` is only used together with
`VLLM_ASCEND_ENABLE_PREFETCH_MLP` which is useless totally. This PR
remove it.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
When matmul_and_reduce is enabled, the prefix attribute is required.
However, in some models, the prefix is not passed correctly, causing
errors when starting the service.
The issue of incorrect prefix passing will be fixed in vLLM in the
future.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
The variable `self.num_pcp_pads` was incorrectly truncated during
assignment, causing errors in certain scenarios such as PD
disaggregated. This issue has now been resolved.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
Co-author by: QiuChunshuo <qiuchunshuo@huawei.com>
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: daishixun <dsxsteven@sina.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
add xlite e2e test
- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef
Signed-off-by: DaweiChang <405739598@qq.com>
### What this PR does / why we need it?
1. MagicMTP (paper: "Block Verification Accelerates Speculative
Decoding") was introduced to consider the influence among multiple draft
tokens, improving the acceptance rate without compromising accuracy.
2. The rejection sampling logic in rejection_sampler.py was restructured
using Triton-Ascend, enabling it to operate under high concurrency, thus
resolving CPU and NPU operator bottlenecks and enhancing throughput.
### Does this PR introduce _any_ user-facing change?
MagicMTP will automatically take effect when the parameter
"num_speculative_tokens" >= 3.
- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2
Signed-off-by: chenaoxuan <cax1165@163.com>
### What this PR does / why we need it?
This pull request introduces an L2 normalization kernel implemented in
Triton, specifically optimized for Ascend NPUs.
### Does this PR introduce _any_ user-facing change?
No, this PR does not introduce any user-facing changes.
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
bc0a5a0c08
---------
Signed-off-by: Ascendyh <hw7osiris@outlook.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138) as
it causes deepseek v3.2 hang error
- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
`VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` is not used anywhere, let's
remove it.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
1. Use optimized apply_top_k_top_p for NPU platfrom in rejection
sampler; (avoid scatter elements which can reduce ~26ms TPOT with bs=24
per DP)
2. <del>Avoid D2H Synchronization before calling npu_top_k_top_p
introduced by parameter validation which improves inference speed with
`async_scheduling` enabled;</del> In order to elminate the D2H
synchronization introduced by parameter validation before calling
`npu_top_k_top_p`, we directly drop this fused operator since the
performance improvement is not significant compared to async_scheduling
and may bring potential accuracy problem.
3. Refactor the implementation of AscendTopKTopPSampler to align that of
vLLM.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E serving test with combinations of `k=500` and `p=0.95` with
async_scheduling in single node and wide-EP scenarios.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it?
Skip some failed ops tests
- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Add pa_shape_list description to qwen dense tutorial.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: ZYang6263 <zy626375@gmail.com>
Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>
### What this PR does / why we need it?
- This PR removes the Expert Parallel (EP) HCCL buffer allocation that
was previously introduced by the fused-op `dispatch_ffn_combine` (#3532
), since the fused-op has switch to MC2 HCCL buffer (#5156 ).
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: Chen Chen <0109chenchen@gmail.com>
### What this PR does / why we need it?
[E2E] Optimize e2e test.
- Remove the test_basic_camem testcase.
- Change Qwen2.5-0.5B-Instruct-W8A8 to Qwen3-0.6B-W8A8
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: menogrey <1299267905@qq.com>
### What this PR does / why we need it?
Some E2E testcases are not in our CI workflow, this PR add them back.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: menogrey <1299267905@qq.com>