### What this PR does / why we need it?
This PR adds a triton rope kernel witch supports scenarios of `rope_dim
!= head_dim`. This can save the split op before rope and the concat op
after rope. Profiling shows improvement.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
I will add related ut after ci integrated with triton.
- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
This PR integrate suffix decoding (https://arxiv.org/abs/2411.04975)
from vllm (https://github.com/vllm-project/vllm/pull/25784)
#
Suffix Decoding is a dynamic n-gram matching method that:
1. Uses suffix trees to generate speculative tokens quickly using branch
frequency counts.
2. Can keep a history of prior model responses, which tends to work very
well with repetitive agentic use cases.
3. Can be dynamically updated with newly generated tokens, and FIFO
eviction of older requests.
#
### Does this PR introduce _any_ user-facing change?
This feature should be implemented as opt-in and remain seamless for
users who do not require suffix speculative decoding.
For users who wish to enable it, they must first install
arctic-inference:
`pip install arctic-inference
`
After installation, the suffix speculative decoding feature can be
enabled using the following speculative config:
`--speculative_config '{"method": "suffix", "num_speculative_tokens":
5}'
`
### How was this patch tested?
This PR is currently being tested on vLLM
main:83f478bb19
with PR https://github.com/vllm-project/vllm/pull/25784
In our previous testing, suffix decoding achieved a 13%-30% throughput
improvement over n-gram on the sonnet dataset, tested on vllm-ascend
v0.9.1 with concurrency ranging from 2 to 40.
- vLLM version: v0.11.2
---------
Signed-off-by: fluctlux <38945811+fluctlux@users.noreply.github.com>
### What this PR does / why we need it?
Previously, the dummy run executed compute_logits only once, regardless
of num_speculative_tokens. This caused execute_model to hang on
compute_logits when lm head tensor parallelism exceeded 1. The fix
ensures compute_logits executes correctly during dummy run, matching
num_speculative_tokens.
I set the `non_blocking` argument to False when moving
`exceeds_max_model_len` to the CPU. From what I understand, using
`non_blocking=True` and immediately accessing the tensor on the CPU can
cause accuracy problems. However, this issue doesn't happen when
transferring data to a device. ref:
https://discuss.pytorch.org/t/should-we-set-non-blocking-to-true/38234/18
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Ascend scheduler was added for non chunk prefill case before, since that
the npu ops didn't work well with chunked prefill.
Now the ops with chunked prefill work better, it's time to remove the
ascend scheduler to use vLLM default scheduler.
- vLLM version: v0.11.2
---------
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Fix model run _npu_flash_attention in _forward_prefill_no_cache hang
issue, it was caused by wrong attention mask dtype.
### How was this patch tested?
Yes, tesed on Qwen2.5-VL and Qwen2.5-Omni
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: Ting FU <futing10@huawei.com>
### What this PR does / why we need it?
Currently, there are two paths to judge the chip type in code,
`get_ascend_soc_version` use `get_soc_version` api in torch_npu, and
`is_310p` `use _build_info.__soc_version__`, which generate when
install. We need to unify the two paths.
We need to unify these codes based on the following points:
1. We need to ensure consistency in chip type judgment between compiling
and running states;
2. In compiling state, we need chip type to complete op's compilation,
but in running state, we only need device
type(910B/910_93/310P/910_95/etc) to make code branch judgement;
3. In compiling state, torch_npu may not have been installed yet, so we
can't use torch_npu's api.
Based on the above points, we have made the following changes:
1. When user set env `SOC_VERSION`, use it; when not set, query
soc_version by `npu-smi`;
2. generate device_type based on soc_version when compiling, and write
`__device_type__` instead of `__soc_version__` in `_build_info.py`;
3. In running state, use `__device_type__` to judge code branch.
### Does this PR introduce _any_ user-facing change?
When not set env `SOC_VERSION`, it will not be `ASCEND910B1` by default,
we will query soc_version by `npu-smi`. And env `SOC_VERSION` must be in
the list `soc_to_device` in `setup.py`.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: zzzzwwjj <1183291235@qq.com>
### What this PR does / why we need it?
When cudagraph_mode is set to FULL_DECODE_ONLY, if dp > 1, the dummy-run
process will be triggered. When calling the update_attn_params function,
the num_tokens parameter needs to be passed, and this value is obtained
through positions.shape[0]. However, the multimodal model uses mRope
(multi-dimensional rotary positional embeddings), which causes the shape
of positions to be 2. As a result, the value obtained from
positions.shape[0] is incorrect. We solve this problem by replacing
positions.shape[0] with num_tokens.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
vLLM version: v0.11.0rc3
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: wujinyuan1 <wjy9595@qq.com>
Co-authored-by: wujinyuan1 <wjy9595@qq.com>
### What this PR does / why we need it?
vllm-ascend need to dump data during model execution to debug some
precision problems, here msprobe provide the corresponding abilities, so
msprobe will join vllm-ascend to make debug easier
### Does this PR introduce _any_ user-facing change?
```
'dump_config': '/path/to/config.json'
```
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: Tjh-UKN <2559659915@qq.com>
The main purposes of this PR are as follows:
1. Remove the multicast-related code;
Reason:
1. In the scenario like a2 Dual-System Back-to-Back Networking,the
performance is worse than all_gather. Before the modification, in e2e
test, it was 3 tps; after the modification, it is 10 tps.
2. At the same time, we usually enable the SP feature,it is consistent
with the current logic.
3. The advantage of broadcast communication lies in the fact that it
does not suffer from uneven DP load and does not require the prefill ACL
graph to be enabled. But we support prefill Acl graph recently.
So we think there is no need to maintain the multicast as one choice in
moe communication.
Performance benefits are as follows:
When not enable_flashcomm1, TTFT remains relatively stable at around
43000ms, which is approximately 15000ms faster than before the
modification.
When enable_flashcomm1, there is no diffenence, TTFT remains relatively
stable at around 29000ms.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian0 <1184188277@qq.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
There is a lot hack code for v0.11.0, which makes the code hard to
upgrade to newer vLLM version. Since v0.11.0 will release soon. Let's
drop v0.11.0 support first. Then we'll upgrade to v0.11.2 soon.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
In [#26016](https://github.com/vllm-project/vllm/pull/26016), vllm
change the `cudagraph_capture_sizes` to be in ascending order. This PR
fixes related issues caused by this.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
Support the Qwen3-Next-80B-A3B-Instruct quantization model and Fix the
NZ issue. Triton kernel doesn't support data format nz, thus we skip
converting weight to nz on layer `conv1d`
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: IncSec <1790766300@qq.com>
### What this PR does / why we need it?
Currently, the MTP model still runs in eager in full graph mode. This PR
adapts the MTP with the full graph capture and execution. When the graph
mode is set to "FULL_DECODE_ONLY", the MTP will run in full-graph to
improve the performance.
The change in both disable_padded_drafter_batch is True and False case
include:
1. Add _mtp_graph_params in acl_graph.py to isolate the data of main
model and the data of MTP.
2. Padding some metadata in mla_v1.py when in fullgraph mode.
3. Fixed the essential data address that will be used in model.forward.
4. Adapted according to the aclgraph capture framwork:
1). Rebuild MTP model with ACLGraphWrapper.
2). Add common attn metadata when start capture in MTP dummy_run.
3). Add common attn metadata update in MTP.
4). Addapted data update when num_speculative_tokens > 1.
5. Add a patch of MTP to adapt vllm v0.11.0.
Existing Issues:
1. When disable_padded_drafter_batch=True and running in FullGraph mode,
the data of the first-round requests in MTP is abnormal. We need to
identify the cause subsequently.
2. When disable_padded_drafter_batch=False and running in FullGraph
mode, the acceptance rate of the second and third tokens will decrease
(For example, if we set the num_speculative_tokens=3, the acceptance
rate of first token is 90%, the second is only 50% lower than 60%, the
third is only 20% lower than 30%). The reason is that the data processed
after the model runs does not match. This is a problem from another PR.
It works fine in eager and PIECEWISE mode, but has problem in FullGraph
mode. Once we have a solution, we will submit a bugfix.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Fix pcp + mtp bug while using acl graph.
While using pcp + mtp, we need to flatten block_table to avoid irregular
attn mask shape, this was done in mla attn_metadata builder, but we
found out that this influences block_table address and leads to
incorrect results while enable acl graph.
To fix this, we enlarge block_table buffer size and flatten block_table
in model_runner prepare_inputs, so this will not influence block_table
address.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
### What this PR does / why we need it?
Sorts aclgraph batch sizes in ascending order, corresponding to vLLM
[#26016](https://github.com/vllm-project/vllm/pull/26016)
Ensures batch sizes for aclgraph are sorted ascending when aclgraph mode
is enabled, improving consistency and compatibility with later logic
that may depend on order.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
Waiting for #3886
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
The current library only supports the FullDecodeOnly graph mode, which
enables full graph execution during the decode. This PR extends support
to allow full graph execution in both the prefill and decode, referred
to as FULL graph mode.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
1、qwen GQA attention_v1 optim
2、DeepSeek MLA refactor, all gather q -> all gather kv
3、modelrunner refactor for chunk prefill, we remove some code not use
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Co-authored-by: Delphine-Nic <tanwenqin@huawei.com>
### What this PR does / why we need it?
Only CPU tensors with `pin_memory=True` can be asynchronously copied to
the device. Currently, there are two instances where non-pinned CPU
tensors are being copied to the device, which will trigger synchronous
operations, reducing the expected benefits of asynchronous scheduling.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it?
Currently, we set `seq_lens` in dummy attn_metadata to be
`max_model_len` to get max workspace for attention during capturing.
However, setting it consistently to be `max_model_len` causing dummy_run
to execute a long attention when running actual inference. For example,
if there is a single req with `seqs_lens` as [8] but `max_model_len` is
131072, the whole process will be slow down by dummy_run as it execute a
fake long-seq attention. Therefore, we instead set it to max_query_len,
which is also consistent with vLLM gpu implementation.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
DS don't have 'AscendAttentionMetadataBuilder' class so will fail in
fullgraph.
We resolved the issue by modifying the code to only check for
'GDNAttentionMetadataBuilder ', while all other attention cases follow
the default branch.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
ChunkPrefill now can support Long Sequence Feature Pcp&Dcp
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI tests passed with self-test
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu>
Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Co-authored-by: Delphine-Nic <tanwenqin@huawei.com>
Co-authored-by: Delphine-Nic <3834144971@qq.com>
### What this PR does / why we need it?
enable sleepmode level2 e2e test and add the check logic to ensure the
nz is not enabled.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
use e2e tests
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: wangx700 <wangxin700@huawei.com>
### What this PR does / why we need it?
Adapts mtp function to Qwen3-next.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
1、in mla_v1 module, add torch_npu.npu_attention_update op when pcp and dcp
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: LookAround <lixushi@huawei.com>
### What this PR does / why we need it?
1、in attention_v1 module, convert bsnd t0 tnd when pcp and dcp
2、fix tochair bug: service startup problem
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
Refactor kv cache tensor initialization logic.
1. Unify the kvcache tensor initialization logic of deepseek and normal
models
2. spilt `initialize_kv_cache_tensors` into `_allocate_kv_cache_tensors`
and `_reshape_kv_cache_tensors`, following gpu modelrunner in vllm
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test.
1. prefill disaggregation scenario
4. deepseek + aclgraph/eager mode
5. qwen3 next
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
1、revert TND modify when dcp pcp, which is introduced by
f57bdb09fc
2、deal aclgraph pad border issue
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
This PR upgrade CANN from 8.2rc1 to 8.3rc1 and remove the CANN version
check logic.
TODO: we notice that UT runs failed with CANN 8.3 image. So the base
image for UT is still 8.2. We'll fix it later.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
- Add support for DeepSeek v3.2 in FULL_DECODE_ONLY mode.
- Add unit test for sfa_v1.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: 1Fire4 <wangdingyi2@huawei.com>
### What this PR does / why we need it?
support pcp + mtp (with pd disaggregate, only pcp in P nodes)
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
### What this PR does / why we need it?
Fix the issue of MTP being enabled and setting
Imhead_tensor_parallel_size=16 causing the inference to hang.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: wyh145 <1987244901@qq.com>
### What this PR does / why we need it?
1. Revert [bugfix for mtp in
fullgraph](0948483642)
and support it when vllm supports
2. raise error when cudagraph_capture_sizes can't be an integer multiple
of uniform_decode_query_len
3. bugfix when max_num_seqs=14 in mtp=2 scenario
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
### What this PR does / why we need it?
1. Refactor the file `mtp_proposer.py`, splits torchair related codes
into `mtp_torchair_proposer.py`
2. According to https://github.com/vllm-project/vllm/pull/24539,
implements padded speculative decoding as described in
https://github.com/vllm-project/vllm/issues/21984.
### Does this PR introduce _any_ user-facing change?
User can use `disable_padded_drafter_batch` to disable/enable padded
speculation, default is `False`.
offline example:
```
speculative_config={"method": "deepseek_mtp", "num_speculative_tokens": 1, "disable_padded_drafter_batch": False}
```
### How was this patch tested?
- [x] egaer with pad/unpad:
- [x] aclgraph with pad/unpad
- [x] torchair with pad/unpad
performance test of deepseek-r1 with tp16、dp1
aclgraph with pad ITL: 168ms
aclgraph with unpad ITL: 169ms
original: 178ms
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
---------
Signed-off-by: xuyexiong <xuyexiong@huawei.com>
### What this PR does / why we need it?
bugfix for mtp fullgraph
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
### What this PR does / why we need it?
Part of https://github.com/vllm-project/vllm-ascend/pull/3106
Fix Hybrid kvcache sharing bug in same attention type
Change the `shared_by` logic so that the same attention spec could share
the same buffer instead of allocating more hbm.
After this pr, kvcache memory saved 50% in qwen3-next compared with
before (`self_attn:linear_attn=1:3` in an `attn_group`), and
`gpu_memory_utilization` could increase to `0.8` on Qwen3-Next when
running on A2 64G/card with tp4
<img width="2833" height="1540" alt="image"
src="https://github.com/user-attachments/assets/2a91fa99-fb0f-447c-9e8b-acd587890fbe"
/>
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Test pass with the latest e2e test case on qwen3-next
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
force with_prefill true after allreduce in kv producer
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
---------
Signed-off-by: liziyu <liziyu16@huawei.com>