### What this PR does / why we need it?
This PR backports the changes from #7673 ([Bugfix] support FlashComm1 &
DCP for Qwen) to the releases/v0.18.0 branch.
--------
Signed-off-by: Yang Yuxi <907276627@qq.com>
### What this PR does / why we need it?
cherry-pick from https://github.com/vllm-project/vllm-ascend/pull/7736
**Error information**
When the quantized weights in CompressedTensors format of the kimi-k2
model are used, the following error is reported:
`AttributeError: 'AscendCompressedTensorsConfig' obiect has no attribute
'enabling_fa_quant'`
**Error Cause**
Currently, FA3 quantization supports only the weights of modelslim
quantization. The added methods are not defined in
AscendCompressedTensorsConfig.
**Solution**
Before invoking related methods, check whether the FA3 feature is
enabled.
Additionally, the unused `get_scaled_act_names` method and its
corresponding unit test have been removed.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing unit tests were updated by removing a deprecated test case, and
the refactored logic was reviewed for correctness.
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
This pull request updates the NPU profiler configuration in the Ascend
worker by changing the `aic_metrics` parameter from `AiCoreNone` to
`PipeUtilization`. This change enables the collection of pipe
utilization metrics during profiling sessions.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed.
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
This PR aims to adapt to newest commit of vllm main branch for model
runner v2. please refer to
https://github.com/vllm-project/vllm-ascend/issues/5208
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.18.0
- vLLM main:
ed359c497a
---------
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
### What this PR does / why we need it?
This PR optimizes the `_compute_slot_mappings_kernel` for Ascend NPUs to
improve performance. The key changes include:
- A new Triton kernel implementation (`_compute_slot_mappings_kernel`)
with NPU-specific optimizations, such as using `tl.gather` to handle
non-contiguous memory access and replacing modulo operations.
- A new method `compute_slot_mappings` in `AscendBlockTables` to use
this new kernel.
- An end-to-end test to verify the correctness of the new kernel against
the reference GPU implementation.
The optimization is needed to avoid performance degradation from scalar
computation on Ascend devices.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.18.0
- vLLM main:
ed359c497a
---------
Signed-off-by: lhp-deep <liuhaopeng1@huawei.com>
### What this PR does / why we need it?
2nd PR for https://github.com/vllm-project/vllm-ascend/issues/5712,
extend SP to VL MoE models.
### Does this PR introduce _any_ user-facing change?
remove `sp_threshold` in additional config and reuse `sp_min_token_num`
from vLLM.
### How was this patch tested?
- Model: Qwen3-VL-30B-A3B,
- TP4 DP2
- 100 reqs
- max concurrency 1
| Seq length | Mean TTFT (ms) main | Mean TTFT (ms) this PR |
|------------|---------------------|------------------------|
| 4k | 429.40 | 323.3 |
| 16k | 1297.01 | 911.74 |
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it?
1. When the FullGraph mode is used, the branches in the Triton operator
are compiled and fixed during the graph capture process, causing the
branch condition in the `fused_recurrent_gated_delta_rule` operator,
which checks whether `ssm_state_indices >= 0` before writing to the SSM
cache, to become invalid. Now, the write operation is performed
regardless of the value. This results in the operator performing address
offset calculations and writing to the SSM cache based on the -1 offset
after -1 is used for padding in vLLM GDN backend. Since the conv cache
and SSM cache in vLLM Ascend implementation are actually a single
continuous tensor divided into two parts, this leads to data overwriting
and the generation of NaN values.
This PR addresses two cases where padding -1 is required in the GDN
metadata builder. The same logic is used to replace the padding with 0
to avoid the problem of memory overwriting, because block 0 is a
reserved block.
2. Fix layerwise connector bug for mamba cache sending on heterogeneous
TP.
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
### What this PR does / why we need it?
- This PR aims to enhance the operator performance in the `post_update`
phase of `model_runner_v2` on NPUs. By optimizing the relevant
operations, it is expected to improve the overall efficiency and speed
of the model running on NPU hardware, which is crucial for scenarios
where high-performance inference is required.
- when bs = 256, time cost reduce from 26us to 11 us;
### Does this PR introduce _any_ user-facing change?
No, there are no changes to the API, interface, or other high-level
behaviors that would directly affect the user's code or interaction with
the system beyond the performance improvement.
### How was this patch tested?
CI passed with new added/existing tests. In addition to the regular CI
tests, specific benchmark tests were conducted on NPU hardware to
measure the performance improvement of the `post_update` operators.
---------
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
### What this PR does / why we need it?
In fulldecodeonly mode, num_req_padded was set to an incorrect value,
causing accuracy degradation in Qwen3-Next. Therefore, we added a check
for compilation_config.cudagraph_mode to the conditional logic, ensuring
that padding is applied only in FULL mode.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
8a680463fa
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
This PR aims to fix padding logic in eagle proposer for kimi25. Main
changes involve:
1. modify the way to obtain draft model attention builder and backend
2. add block table padding & related tensor slicing in common metadata
when `draft_step>1` for solving fia verifying error
3. replace block table in `update_graph_params` for solving fia
verifying error
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
Signed-off-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it?
1. upgrade to 0.18.0
2. ensure kernel_block_sizes is int for Eagle drafter
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
This PR enables separate attention backend configuration for target and
draft models in speculative decoding, decoupling the previously bound
attention backend settings between the two models.
It solves the compatibility issue where some draft models do not support
the attention backend used by the target model, and allows users to
select the optimal attention backend for each model individually to
maximize inference performance. The change is fully backward compatible.
---------
Signed-off-by: SidaoY <1024863041@qq.com>
### What this PR does / why we need it?
pr: https://github.com/vllm-project/vllm/pull/37136 break eplb because
it filters out redundant experts.
pr: https://github.com/vllm-project/vllm/pull/37322 fix it due to use
parallel_config.enable_eplb to determine whether to skip the weight
loading filter.
But in vllm-ascend, parallel_config.enable_eplb is always false. When we
use eplb, we temporarily set it to true.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?

| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
1.fix "TypeError: get_attn_backend() remove variable": [Refactor
`check_and_update_config`](https://github.com/vllm-project/vllm/pull/35122)
2.fix [Rename `compile_ranges_split_points` to
`compile_ranges_endpoints`](https://github.com/vllm-project/vllm/pull/36027)
3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace
memory related torch.cuda
APIs"](https://github.com/vllm-project/vllm/pull/37031)
4.fix [Support multiple KV groups in OffloadingSpec
](https://github.com/vllm-project/vllm/pull/36610) removed
self.offloaded_block_size and changed self.gpu_block_size from a scalar
to a tuple of per-group block sizes, adding block_size_factor.
5.fix [Consolidate
SupportsEagle](https://github.com/vllm-project/vllm/pull/36063) renamed
get_eagle3_aux_hidden_state_layers() to
get_eagle3_default_aux_hidden_state_layers() and added a
supports_eagle3() guard before calling it.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
E2E
- vLLM version: v0.17.0
- vLLM main:
8a680463fa
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
### What this PR does / why we need it?
Some bug fixes, mainly including:
1. For A2, the number of experts each single card cannot be greater than
16 when using MC2. The PR fixed the error in the A2 moe communication
method selection, which would cause the selection of an incorrect
communication method when the number of model experts exceeds 256. For
example, when using an A2 16-cards model to load the PD-disaggregation D
node with Qwen3.5 series models, the incorrect MC2 method would be
chosen.
2. Fixed the issue where the layerwise connector sends the kv-cache of
the MTP layer multiple times when `num_spec_tokens` > 1. Now, the
kv-cache is sent only when the MTP layer is forward for the first time.
3. Fix the accuracy issue of qwen3.5 when using MTP for PD
disaggregation. The cause is that `num_decode_draft_tokens` does not
consider that `spec_tokens` are not existed during the first inference
when PD disaggregation (`spec_tokens` are generated during the first
inference). However, `spec_tokens_padding` is added by
`recomputed_scheduler`. As a result, `gdn_metadata` incorrectly
considers that the prefill with a length of 2 is performed.
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: zxr2333 <64738772+nwpu-zxr@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: kunpengW-code <1289706727@qq.com>
Co-authored-by: linsheng1 <1950916997@qq.com>
### What this PR does / why we need it?
Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8
supports only the PD separation scenario. C8 refers to quantizing the KV
cache to int8, which aims to reduce the GPU memory usage of the KV cache
and improve the inference throughput.
Constraints:
1. Only the PD separation mode can be used and
MooncakeLayerwiseConnector can be used to run the model.
2. Currently, only the activation value supports dynamic quantization,
and the KV cache supports static quantization. C8 quantization with MTP
is not supported. You can use ModelSlim for quantization. The
quantization procedure is as follows:
pip install transformers==4.48.2
git clone https://gitcode.com/Ascend/msmodelslim.git
cd msmodelslim
bash install.sh
cd example/DeepSeek/
python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path
<path/quant_weight>
--anti_dataset../common/deepseek_anti_prompt_50_v3_1.json
--calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot
--trust_remote_code True --fa_quant --dynamic --anti_method m6
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
Two problems have been solved in this pr.
These problems occur in the `FULL_DECODE_ONLY` mode that `num_tokens`
should be padded to some value in `cudagraph_capture_sizes`.
1. We found the length of `seq_lens_list` in drafter's `attn_metadata`
is 1 shorter than expected. It will raise a kernel exception to make
vllm crash.
e.g., `num_reqs` = 3, `cudagraph_capture_sizes` = [20],
`actual_seq_lengths_q` is padded well to [4, 8, 12, 20]. But
`seq_lens_list` = [5742, 4700, 7996], it is not padded.
3. Though the length of `seq_lens_list` in target's `attn_metadata` is
the same as expected in `FULL_DECODE_ONLY`, some data are corrupted at
the end of the list.
e.g., `num_reqs` = 3, `cudagraph_capture_sizes` = [20],
`actual_seq_lengths_q` is padded well to [4, 8, 12, 20]. But
`seq_lens_list` = [5742, 4700, 7996, 5738], it has corrupted at the end
of the list.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
Signed-off-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
Optimize the performance of the triton operator _topk_log_softmax_kernel
in model_runner_v2 to 1.04xH100,which is 7% of its original value.(issue
https://github.com/vllm-project/vllm-ascend/issues/5208)
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: wangx700 <wangxin700@huawei.com>
### What this PR does / why we need it?
This PR restores #7029, which adds W8A8C8 support for dsv3.2/glm5 using
the `lightning_indexer_quant` ops in the pd-mix stage.
The original PR was reverted by #7288 because the patch did not work
with the recompute scheduler.
This PR also fixes the patching issue so that it works correctly with
the recompute scheduler.
### Does this PR introduce _any_ user-facing change?
Yes. To enable LI C8, users need to set the `enable_sparse_c8` option to
`"true"` in `additional_config`.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it?
This reverts commit 7ed9e9de69, which
introduces an issue that the patch doesn't work with recompute scheduler
enabled.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly
follows the design in
[#30877](https://github.com/vllm-project/vllm/pull/30877) and inherits
changes to functions which are overridden in vLLM-Ascend.
Note:
1. `--mamba-cache-mode align` && PD disaggregation is still not
supported yet in vLLM v0.17.0(see
https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295).
2. The current implementation of hybrid kv cache might result in a very
large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B
with `-tp 2`, the block_size is adjusted to 2048, which means that any
prefix shorter than 2048 will never be cached. Although this behavior is
consistent with vLLM, it still needs improvements in the future.
3. `--mamba-cache-mode align` requires to copy mamba states during
forward steps. vLLM uses a triton kernel to implement it. However, the
original version run into some bugs on Ascend hardwares. Thus we patch a
new triton kernel to avoid this bug.
### Does this PR introduce _any_ user-facing change?
To use mamba prefix cache, set `--enable-prefix-caching` and
`--mamba-cache-mode align`. Note that the mamba state copy function(see
[do_mamba_copy_block](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/mamba_utils.py#L132))
does not provide a torch native version, thus it might have trouble if
users can't use triton.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
This pr fix the Unified draft parallel feature.
1. In Draft model proposer, there are exceed 1 attention layers in
target model, thus removing the assertion on layer number.
2. we should get block size through `draft_attn_groups` instead of
`attn_metadata_builder` after 0.17.0.
3. `attn_update_stack_num_spec_norm` shouldn't be done when unified
draft parallel is enabled
### How was this patch tested?
Test pass with
`tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_parallel_drafting_acceptance`,
which is already included in CI
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Drop 0.16.0 support in main
- Fix eagle proposer break introduced by
https://github.com/vllm-project/vllm/pull/34552. Mainly change to use
the draft attention group to initialize the attention metadata builder.
- Fix the `ModelRunner` has no attribute `cudagraph_capture_sizes`
error, which is a bug in vLLM v0.17.0, and fixed by a later pr
https://github.com/vllm-project/vllm/pull/30515
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
This PR supports W8A8C8 in dsv3.2/glm5 with lightning_indexer_quant ops
in pd-mix stage mainly.
Because the code for the current PD-disaggregated scenario is still
under refactoring and cleanup, this PR prioritizes ensuring the C8
functionality in the pd-mix scenario.
The next steps are planned in two parts:
① Once the optimized scatter operator is updated, we will replace the
original operator to improve the performance of storing k_scale.
② Once the code logic for the PD-disaggregated scenario becomes stable,
we will carry out more comprehensive validation and make appropriate
adaptations.
③ Because enabling C8 currently introduces several new operators whose
performance still needs improvement, performance may regress in some
scenarios. Therefore, only after all the operators are fully ready can
we ensure that this feature does not cause any performance degradation.
At that point, we will enable this feature by default and remove the
switch in `additional_config`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it?
This PR aims to support aclgraph for model runner v2, please see RFC
#5208. The PR contains these modifications:
- adapt to newest commit of vllm main branch.
- supply a unified interface of extra forward context for both model
runner v1 and model runner v2.
- implement graph mode for main model.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
### What this PR does / why we need it?
This PR fixes a bug in `reshape_kvcache_tensors` when reshaping the
Mamba cache for models like Qwen3.5. The previous implementation did not
correctly handle cases where the KV cache tensors have different data
types. This change ensures that slicing is performed based on byte
offsets before reshaping the tensors, which correctly handles
heterogeneous dtypes.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By CI.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
### What this PR does / why we need it?
The merged graph of draft in `FULL` mode is broken now.
This pr solves it.
Also, `actual_seq_lengths_q` in `model_runner` is found redundant, so,
it is removed.
It depends on https://github.com/vllm-project/vllm-ascend/pull/7144 and
https://github.com/vllm-project/vllm-ascend/pull/7148.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
Test code is shown as below:
```python
prompts = [
"1.Who are you?",
"2. Who are you?",
]
sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=200)
llm = LLM(
model="/home/some-model/Meta-Llama-3.1-8B-Instruct",
tensor_parallel_size=1,
max_num_seqs=32,
# enforce_eager=True,
disable_log_stats=False,
distributed_executor_backend="mp",
gpu_memory_utilization=0.7,
async_scheduling=True,
speculative_config={
"enforce_eager": True,
"model": "/home/some-model/EAGLE3-LLaMA3.1-Instruct-8B",
"disable_padded_drafter_batch": False,
"method": "eagle3",
"num_speculative_tokens": 3,
},
compilation_config={
"cudagraph_mode": "FULL",
"cudagraph_num_of_warmups": 1,
},
max_model_len=4096,
enable_prefix_caching=False,
)
outputs = llm.generate(prompts, sampling_params)
```
The result before:
```text
File "/vllm-workspace/vllm-ascend/vllm_ascend/attention/attention_v1.py", line 575, in full_graph_fia
graph_params.events[num_tokens].append(event)
~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: 132
```
The result after:
```text
--------------------------------------------------
total_num_output_tokens: 400
num_drafts: 242
num_draft_tokens: 726
num_accepted_tokens: 156
mean acceptance length: 1.64
--------------------------------------------------
acceptance at token 0: 0.42
acceptance at token 1: 0.16
acceptance at token 2: 0.07
```
We also test `FULL_DECODE_ONLY` mode.
The result is:
```text
--------------------------------------------------
total_num_output_tokens: 400
num_drafts: 244
num_draft_tokens: 732
num_accepted_tokens: 155
mean acceptance length: 1.64
--------------------------------------------------
acceptance at token 0: 0.42
acceptance at token 1: 0.16
acceptance at token 2: 0.06
```
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
Signed-off-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
Fix the issue where no exception is thrown when graph capture fails.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: WithHades <244036962@qq.com>
### What this PR does / why we need it?
Currently when tp>1, we have extra processes on tp rank0 device which
consumes extra HBM memory. This is caused by `import
torch_npu._inductor` before set_device which introduces extra
initialization of device.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
All ci passed.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
related to vllm PR #34043 this pr delete func
‘relax_for_mixed_batch_cudagraphs’, num_reqs no longer equals the actual
number of requests, due to fia operator requires that
query_start_loc[-1] equals the total number of computed tokens, so this
func delete cause the ifa error.
In full graph mode, set num_reqs_paded = num_reqs to fix the error
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
fix penality ops for new version, and achieved a 10% performance
improvement
### How was this patch tested?
pytest
tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_penality.py
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: shiyuan680 <917935075@qq.com>
### What this PR does / why we need it?
Fix acceptance and high-concurrency bug in eagle3 and cp enabled
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
tests and ut
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it?
Fix the LoRA e2e test accuracy issue that introduced by the upstream PR
https://github.com/vllm-project/vllm/pull/32005
### How was this patch tested?
pytest -sv tests/e2e/singlecard/test_llama32_lora.py
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: paulyu12 <507435917@qq.com>
Signed-off-by: yupeng <507435917@qq.com>
### What this PR does / why we need it?
Support chunked prefill for Qwen3Next with PCP&DCP
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
## Summary
- Move `_update_states_after_model_execute` call from after main model
sampling to after draft model execution
- This reordering reduces pipeline bubbles between main model and draft
model execution
- No accuracy impact - the state update operation is independent of
draft token proposal
## Performance Impact
Reduces idle time between main model and draft model execution stages,
improving overall MTP (Multi-Token Prediction) performance.
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>
### What this PR does / why we need it?
Supports contiguous tensor hybrid-attn kv-cache on fullattn-mamba hybrid
model, such as Qwen3Next and Qwen3.5.
Due to the restrictions of Ascend operators, all KV tensors, conv
tensors, and SSM tensors must be contiguous. Therefore, this PR uses the
following solution to generate the KV cache:
tensor1: [(kv_padding), conv , ...]
tensor2: [k , ssm , ...]
tensor3: [v , (mamba_padding), ...]
Under this scheme, although some waste may occur, the tensors of all
caches are guaranteed to be contiguous.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By CI.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
### What this PR does / why we need it?
The index-select operation `mrope_positions.gpu[:,
:total_num_scheduled_tokens].copy_(...)` triggers a CPU-NPU
synchronization, which blocks subsequent operator dispatch and causes
bubbles visible in Profiling.
This PR changes to full tensor copy
(`mrope_positions.gpu.copy_(mrope_positions.cpu)`) to eliminate the sync
point. The trade-off is a negligible increase in memory usage since
`mrope_positions.cpu` is a small tensor.
**Result:** ~2-3% TPOT improvement with the profiling bubbles
eliminated.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Verified via Profiling that the CPU sync bubble is eliminated and TPOT
is reduced by 2-3%.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>
### What this PR does / why we need it?
The community has added a cleaning mechanism for the metadata after the
main model finishes running. The MTP layer should not clean the
metadata, and a new condition has been added to avoid cleaning it.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
### What this PR does / why we need it?
This PR creates and registers `ascend_multi_connector`, which allows the
`mooncake_layerwise_connector` to use the kv_pooling feature.
We unregister the original vllm's `MultiConnector` and replace it with
`AscendMultiConnector` when registering the connectors.
### Does this PR introduce _any_ user-facing change?
No. User can use `MultiConnector` to initialize `AscendMultiConnector`.
### How was this patch tested?
By CI.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>