### What this PR does / why we need it?
In the chunked prefill scenario, CP needs to align the
`max_context_chunk` to the `cp_virtual_block_size`, but the current
implementation only aligns it to the `block_size`. For
PD-disaggregation, `cp_kv_cache_interleave_size` is typically set equal
to `block_size`, in which case `cp_virtual_block_size=block_size *
dcp_size * pcp_size`. Under specific conditions, this can lead to
misalignment of certain chunks, subsequently triggering assertion check
errors.
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
### What this PR does / why we need it?
when graph mode is picewise,replay by synchronize will be effect
performance, sync almost cost 250us

### Does this PR introduce _any_ user-facing change?
only sync when graph mode contain full mode
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: wangyongjun <wangyongjun7@huawei.com>
### What this PR does / why we need it?
update pd disaggregation multi_node readme, update the environment check
command for A3, add tls check
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
8be6432bda
Signed-off-by: liziyu <liziyu16@huawei.com>
Add `releases` match case for CI jobs and update related doc for v0.13.0
branch
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Align multi-node nightly test paramter with tutorials documents.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
Test locally and nighly e2e multi-node test cases.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
> Extracted from PR #5513
Based on the Sharded-CP feature PR:#4702;
RFC:https://github.com/vllm-project/vllm/issues/30055
### All-gather KV Cache for Communication Overlap:
- This PR adjusts the calculation order in the SFA.
- split `index_select` into `indexer_select_pre_process` and
`indexer_select_post_process`.
- Combine `nope`, `rope` and `index-k` into a tensor to perform
asynchronous all-gather.
### benchmark:
input=40k && num_batch_token=20k
- before:
```
Mean TTFT (ms): 2614.52
Median TTFT (ms): 3148.03
P50 TTFT (ms): 3148.03
P90 TTFT (ms): 3163.48
P99 TTFT (ms): 3170.20
```
- after:
```
Mean TTFT (ms): 2529.92
Median TTFT (ms): 3051.69
P50 TTFT (ms): 3051.69
P90 TTFT (ms): 3067.31
P99 TTFT (ms): 3072.15
```
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
### What this PR does / why we need it?
The condition for determining padding in the fullgraph overlay with MTP
and PCP has been modified to accommodate corner cases where the shape
capture size is manually specified.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut and tests
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it?
Add new function to mooncake layerwise connector, including:
1. supports sparse attention, for DeepSeek-V3.2
2. Distribute transfer tasks to redundant kv_head cards
This PR is related to [[RFC]: CDCP Scheduling for Disaggregated
Prefilling with KV Cache Layerwise Push
Support](https://github.com/vllm-project/vllm-ascend/issues/4842)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By CI.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
This PR fixes Minimax model loading in vLLM Ascend backend by:
Adding model type check for "minimax" and "minimax_m2" to replace "mlp"
prefix with "block_sparse_moe"
Implementing special handling for Minimax expert layer naming
conventions
Adding Minimax configuration to packed_modules_model_mapping for proper
qkv_proj and experts module handling
Without these changes, Minimax models fail to load on Ascend devices due
to incompatible layer naming and module packing.
### Does this PR introduce _any_ user-facing change?
Yes. Users can now successfully load and run Minimax models on Ascend
hardware with vLLM. This enables inference capabilities for this model
family on Ascend devices.
### How was this patch tested?
Local Testing:
Verified model loading for minimax-xxx and minimax_m2-xxx model variants
on Atlas 800I A2 hardware
Tested inference with sample prompts using vLLM's OpenAI-compatible API
server
Benchmark Validation:
Compared throughput and latency metrics against GPU baseline
Verified memory usage stays within expected limits for different batch
sizes
Tested multi-card inference scenarios with tensor parallelism
- vLLM version: v0.13.0
- vLLM main:
8be6432bda
---------
Signed-off-by: Feng-xiaosuo <tengchang1@huawei.com>
### What this PR does / why we need it?
Fix the PCP port mapping error issue.In a multi-node PD separation
scenario, when the PCP feature is enabled, there is an issue with the
ZMQ transmission port. Specifically, the IP and port received by Side D
do not match. The cause of this issue is an error in the port mapping
update strategy logic.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
### What this PR does / why we need it?
1.Fixed memory retention on certain GPUs caused by missing PUT
operations.
2.Fixed performance degradation resulting from architectural
incompatibilities in the underlying refactor.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it?
Add GLM4.5 GLM4.6 doc
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: 1092626063 <1092626063@qq.com>
### What this PR does / why we need it?
support mxfp8 quantization (qwen liner layer)
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: wangyao <iwangyao@outlook.com>
### What this PR does / why we need it?
This PR fixes the precision issue from improper Tensor maintenance in
`vllm_ascend/ops/linear_op.py` under the Verl reinforcement learning
(RL) scenario. issue:
https://github.com/vllm-project/vllm-ascend/issues/5747
Key changes:
1. Remove the custom class member `self.weight_t` in
`vllm_ascend/ops/linear_op.py`;
2. Adjust the input logic of the `npu_mm_all_reduce_base` operator to
directly fetch weight parameters from the model's `nn.Parameters`,
instead of using pre-created Tensors.
> In the vllm model, it is recommended to avoid creating additional
parameter copies (such as self.weight_t) for computation; if already
created, they must be synchronized with the model's original parameters.
This is because parameter synchronization between training and inference
in the Verl reinforcement learning (RL) scenario may cause memory
address changes to nn.Parameters, and unsynchronized extra Tensors will
reference old memory without updating with the parameters—ultimately
leading to precision issues.
### Does this PR introduce _any_ user-facing change?
No.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: icerain-alt <450125138@qq.com>
Co-authored-by: Shangwei-Li <lishangwei@mail.ustc.edu.cn>
### What this PR does / why we need it?
The customized ascend operator sgmv_expand and sgmv_shrink applies only
to the scenario where rank is 8,16,32,64. When rank >= 128, the operator
is out of range, causing the model to report an error.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Depends on this commit https://github.com/vllm-project/vllm/pull/31408
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
---------
Signed-off-by: ZT-AIA <1028681969@qq.com>
Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>
1. Don't run lint and ut again once the PR is merged to save CI resource
2. Update codecov every 4 hour
3. rename `model_downloader` to suitable name
4. update schedule job to better time.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
To support tensorList for dispatch_ffn_combine, to adjust eplb
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
Single Operator Testing
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: lhchg <lhao_cheng@163.com>
Co-authored-by: lihaocheng <lihaosheng1@h-partners.com>
### What this PR does / why we need it?
This PR fixes a bug in Xlite
backend(https://atomgit.com/openeuler/GVirt/issues/1), The direct cause
of the problem is that the XModel::PrepareAttn function obtained an
illegal number of tokens to be inferred, -540. This illegal value is due
to the padding feature of inference in graph mode and the residual state
across steps. This issue is triggered when a prefill request is newly
added in a step and a decode ends simultaneously. It is first fixed
using num_decode_tokens instead of attn_metadata.num_decodes.
1. In graph mode, vllm_ascend has padding characteristics. In the
_prepare_inputs function, if the number of tokens to be inferred is less
than the set threshold (8 in this case), the attn_metadata.num_decode
array will be expanded to 8.
2. Meanwhile, vllm_ascend uses the class variable self.query_start_loc
of NPUModelRunner to record the tokens to be inferred. Due to poor
coordination with the graph mode padding mechanism when crossing steps,
in some cases (such as when a decode request is completed in a certain
step and a new prefill request is added at the same time), negative
values may be calculated for attn_metadata.query_lens.
3. After type conversion, the negative values in query_lens cause an
overflow. Xlite detects that the number of tokens to be inferred for the
decode request is too large and triggers a "decode len too long" alert.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Same with https://atomgit.com/openeuler/GVirt/issues/1
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: wwwumr <1127858301@qq.com>
### What this PR does / why we need it?
Close the **Full Graph** mode to temporarily avoid accuracy issue for
**Qwen3-Next-80B-A3B-Instruct-W8A8**.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: InSec <1790766300@qq.com>
**What this PR does / why we need it?**
This PR fixes a bug in NetLoader
[PR#2888](https://github.com/vllm-project/vllm-ascend/pull/2888). The
bug was caused by
[PR#3612](https://github.com/vllm-project/vllm-ascend/pull/3612)
([1/N][Refactor] Refactor code to adapt with vllm main), which removed
the `stateless_init_device_torch_dist_pg` function from platform.py,
leading to a failure in the call. This PR adds a way to create a
stateless process group that does not depend on external code.
**Does this PR introduce any user-facing change?**
No
**How was this patch tested?**
Same with
[PR#2888](https://github.com/vllm-project/vllm-ascend/pull/2888)
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: destinysky <kangrui10@126.com>
### What this PR does / why we need it?
lint and ut use self_hosted runner
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Fix the title of the document for the layer_sharding feature
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
### What this PR does / why we need it?
The rotary algorithm in deepseek indexer should be neox-style instead of
gptj style. PR #4413 fix this accuracy bug with new triton kernel. This
PR fixes original pytorch version.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
CI passed with existing test.
- vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24
- vLLM main:
86e178f7c4
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
1. add PaddleOCR-VL.md in the `docs/source/tutorials/`
2. add PaddleOCR-VL index in `docs/source/tutorials/index.md`
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by CI
- vLLM version: v0.13.0
- vLLM main:
7157596103
Signed-off-by: zouyizhou <zouyizhou@huawei.com>
### What this PR does / why we need it?
Add Qwen3Next CI
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
---------
Signed-off-by: SunnyLee219 <3294305115@qq.com>
### What this PR does / why we need it?
This PR enables custom op `aclnnMoeInitRoutingCustom` introduced in PR
#5251
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
---------
Signed-off-by: QianChenxi <chenxi.qian.cq@outlook.com>
Signed-off-by: zzzzwwjj <1183291235@qq.com>
Co-authored-by: zzzzwwjj <1183291235@qq.com>
### What this PR does / why we need it?
Move ops to the correct path where they belong
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: wangli <wangli858794774@gmail.com>
There is some problem for workflow_dispatch way for image build. Let's
remove it first to make CI happy. I'll add it back once it's well
tested.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
mlapo in deepseek is a huge performance improvement in decode, this pr
support pcp & dcp with mlapo
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
### What this PR does / why we need it?
Add triton ascend in nightly
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
This addresses the issue brought up by #5356 and #4963, and we believe
the unnecessary conditions are the root cause.
Change the unpad trigger to be driven by actual size mismatches
(num_reqs vs base_num_reqs or scheduled vs input token counts) rather
than specific speculative-method flags. Then remove brittle workarounds
that forced request counts and sliced query start locations.
This prevents incorrect indexing and length mismatches during
speculative decoding and makes metadata unpadding more robust across
scheduling modes.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
Tested by existing cases.
- vLLM version: v0.13.0
- vLLM main:
8be6432bda
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
Add Qwen3-Omni-30B-A3B-Thinking Tutorials
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
5326c89803
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
#5230 this PR introduced a problem when both mtp and full_decode_only
are enabled for the DSV32 model, the operators cannot be compiled into
the graph. This PR fixes that issue.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
### What this PR does / why we need it?
Adapted sp to eagle3.
There may still be some problems, e.g., accuracy in some scenes,
`sp`+`dp`...
We will fix them later.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
We tested it mainly in a new `e2e`.
```shell
pytest -s tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_llama_qwen_eagle_acceptance
```
```text
.
=============================== warnings summary ===============================
<frozen importlib._bootstrap>:241
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute
<frozen importlib._bootstrap>:241
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============= 3 passed, 1 skipped, 2 warnings in 142.05s (0:02:22) =============
```
It passed.
- vLLM version: v0.13.0
- vLLM main:
7157596103
Signed-off-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
Correcting some outdated use cases:
`tests/e2e/singlecard/test_aclgraph_accuracy.py::test_models_output` ->
`tests/e2e/singlecard/test_aclgraph_accuracy.py::test_piecewise_res_consistency`
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
EPLB currently does not have CI related to aclgraph and redundancy
experts; this PR adds them.
release on #5529
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Tested the use cases to be added in this PR.
PASSED
====================================================== warnings summary
==========================================================
<frozen importlib._bootstrap>:241
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type
SwigPyPacked has no __module__ attribute
<frozen importlib._bootstrap>:241
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type
SwigPyObject has no __module__ attribute
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================== 1 passed, 2
warnings in 272.24s (0:04:32)
=====================================================
- vLLM version: v0.13.0
- vLLM main:
8be6432bda
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
ref doesn't work with workflow_dispatch, let's change it to raw way
This PR also merge the pr_create job into one runner to save resource.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Optimize the warning print information format when detects depredated
code is used in vllm-ascend.
### Does this PR introduce _any_ user-facing change?
NA
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
1. MagicMTP (paper: "Block Verification Accelerates Speculative
Decoding") was introduced to consider the influence among multiple draft
tokens, improving the acceptance rate without compromising accuracy.
2. Added Triton and PyTorch implementations, and added E2E test cases.
### Does this PR introduce _any_ user-facing change?
MagicMTP will automatically take effect when the parameter
"num_speculative_tokens" >= 3.
- vLLM version: v0.13.0
- vLLM main:
7157596103
Signed-off-by: chenaoxuan <cax1165@163.com>
### What this PR does / why we need it?
func `get_kv_cache_spec` in model_runner changed a lot and caused error
in cpuoffloading connector which is copied from model_runner, this PR
adapts to new implemented `get_kv_cache_spec` to fix it.
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: lidenghui <lidenghui1110@gmail.com>
### What this PR does / why we need it?
- Delete the environment variable
`VLLM_ASCEND_ENABLE_FLASHCOMM2_OSHARED`
- Introduce layer_sharding as a configurable feature in
additional_config
- Revise the term "shared weight" to "shard weight."
Configuration : The feature is opt-in via the additional_config
argument:
```
--additional-config '{
"layer_sharding": ["o_proj", "q_b_proj"]
}'
```
This is orthogonal to standard tensor parallelism and weight replication
strategies. It is treated as a separate, explicit feature.It can be used
in any scenario, combined with the
flashcomm2https://github.com/vllm-project/vllm-ascend/pull/3232 feature
or the ShardedCP #4702 feature, to achieve significant performance.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>
Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: clrs97 <524936896@qq.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
### What this PR does / why we need it?
Fix pre-create link parameter error, `batch_transfer_sync_write`
requires list.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By CI.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>