By converting the KV cache from ND to NZ format when the decode node
receives it, this PR ensures that the KV NZ feature works correctly
during the decoding phase in disagg-prefill scenario.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Co-authored-by: ghphotoframe <854746559@qq.com>
Co-authored-by: alex101-ops <alex1015718386@gmail.com>
### What this PR does / why we need it?
1. This PR is proposed to support complicated pcp/dcp parallelisms in
Prefill and Decode nodes in Mooncake, such as Prefill: TP8/PCP2DCP8 and
Decode: TP8/DCP4/DP2, which is not supported now. We establish the link
mappings to transfer KVCache between prefill and decode nodes. The main
function is realized in Function of `_get_kv_split_metadata` in
Mooncake_connector.py
2. After a prefill rank is pulled KVCache by a decode rank, the decode
rank will send `DONE_RECVING_MSG` to the prefill rank and the prefill
rank will free its KVCache blocks. If a prefill rank is pulled KVCache
more than one time by several decode ranks and it surely could happen in
complicated pcp/dcp parallelisms, it will cause the prefill rank free
its KVCache blocks for several times, which could cause memory issue.
This PR solve this issue by counting the times of prefill rank would be
pulled KVCache and in the last time, it will free the prefill rank
KVCache blocks. The related code is in Function of `run_busy_loop` in
Mooncake_connector.py
3. If a prefill rank is not pulled KVCache by any decode ranks, the
first rank in decode node will send "DONE_RECVING_MSG" to free its
blocks. The related code is in Function of
`_send_done_signal_to_free_remote_port` in Mooncake_connector.py
### How was this patch tested?
This PR is tested in many pcp/dcp parallelisms, and the accuracy are all
correct.
MLA model:
Prefill node: TP8/DP2, Decode node: TP8/DP2
Prefill node: TP8/PCP2/DCP8, Decode node: TP8/DP2
Prefill node: TP8/PCP2/DCP8, Decode node: TP8/DCP4/DP2
Prefill node: TP8/PCP2/DCP4, Decode node: TP4/DCP2/DP4
Prefill node: TP8/PCP2/DCP2, Decode node: TP4/DCP4/DP4
Prefill node: TP8/PCP2, Decode node: TP4/DCP2
GQA model:
Prefill node: TP8/DP2, Decode node: TP8/DP2
Prefill node: TP8/PCP2/DCP2, Decode node: TP8/DP2
Prefill node: TP8/PCP2/DCP2, Decode node: TP8/DCP2/DP2
Prefill node: TP8/PCP2/DCP2, Decode node: TP4/DP4
Prefill node: TP16/DCP2/PCP1, Decode node: TP8/DCP2/DP2
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
- Co-author by: Daishixun dsxtsteven@sina.com
---------
Signed-off-by: wangxiaochao <w00642655@china.huawei.com>
Co-authored-by: wangxiaochao <w00642655@china.huawei.com>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
### What this PR does / why we need it?
We should transfer the mm_embed to the dtype of input_embed before
performing the in-place assignment
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Refactor pcp& dcp related code. we use pcp_manager class to Unifiy
Manage pcp & dcp . as we do this , many code can be deleted from
model_runner, and can avoid break pcp & dcp by other developments.
RFC:https://github.com/vllm-project/vllm-ascend/issues/5449
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>
### What this PR does / why we need it?
The float kernel of MOE_init_routing_v2 in the dispatch allgather
operation does not support tensor format for active_expert_range; it
only supports int.
PR5311 To unify the variables `local_num_experts` and
`self.local_num_experts`, `self.local_num_experts` was used
consistently, which led to the subsequent integer type parameter being
converted to a tensor type.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
gsm8k | exact_match,strict-match: ground_truth=0.89 | measured=0.8939 |
success=✅
gsm8k | exact_match,flexible-extract: ground_truth=0.85 | measured=0.856
| success=✅
ceval-valid | acc,none: ground_truth=0.84 | measured=0.8373 | success=✅
Model Parameters:
{'pretrained': 'Qwen/Qwen3-30B-A3B', 'tensor_parallel_size': 2, 'dtype':
'auto', 'trust_remote_code': False, 'max_model_len': 4096,
'gpu_memory_utilization': 0.6, 'enable_expert_parallel': True}
- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
1. What this PR does / why we need it?
This PR supports the moe_gating_top_k operator, which enables
post-positioned renormalization (renorm) on the basis of softmax.
2. Does this PR introduce any user-facing change?
No user-facing changes are required.
3. How was this patch tested?
This patch was tested with the test_npu_moe_gating_top_k test case.
vLLM version: release/v0.13.0
vLLM main:
ad32e3e19c
---------
Signed-off-by: ZCG12345 <2097562023@qq.com>
Signed-off-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>
Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>
### What this PR does / why we need it?
Refactor the `capture_model` method in model_runner to directly reuse
the method from vLLM.
Currently, most of the logic in the capture_model method is similar to
that in the vllm code. Directly using the vllm method can reduce the
maintenance cost of the vllm-ascend code. Modify as follows:
1、refactor capture_model function, directly inheriting community methods
2、refactor initialize_aclgraph_capture function, move to
initialize_attn_backend
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
- Fixes vllm break:
1. [[BugFix] register quant scale tensors as buffer #31395]
(https://github.com/vllm-project/vllm/pull/31395)
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
5326c89803
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
This PR aims to refactor eagle-related modules in vllm-ascend.
This is the starting PR of eagle refactoring. Provided with vllm-eagle,
ascend-eagle and ascend-mtp, we first let ascend-mtp inherit from
ascend-eagle and let ascend-eagle inherit from vllm-eagle. As a
initialization, we just delete `__init__` in mtp_proposer and simplify
the corresponding logic in eagle_proposer.
Based on "vllm-eagle <----- ascend-eagle <----- ascend-mtp", our target
is to gradually delete ascend-mtp and enable ascend-eagle to converge to
vllm-eagle. So the main workspace is eagle_proposer. In this way, we
hope that contributors can concurrently refactor eagle.
Incoming changes:
1. delete common methods in vllm-eagle & ascend-eagle & ascend-mtp
2. delete `load_model` in mtp_proposer
3. delete `dummy_run` and `propose` in mtp_proposer
4. ......
RFC: #5467
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it?
This PR moves reject sample related triton kernels into `ops/triton`.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed with existing test.
- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
#5051 only implement a basic framework for model runner v2, but there
are still some bugs for e2e functionality, this PR aim to enable basic
functionality.
model runner v2 plans:
https://github.com/vllm-project/vllm-ascend/issues/5208
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
Enable fia to support sliding window function and adapt to the Gemma3
model.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: nsdie <yeyifan@huawei.com>
### What this PR does / why we need it?
We support to use full graph with eagle.
Change list:
1. Distinguish between processing graph_params and draft_graph_params in
attention_v1.
2. Adapt the full-graph mode in eagle_proposer, include:
1). If use full graph, make Fullgraph Wrapper when load model.
2). Build a new meatadata, set running mode in FULL and mark attention
update in dummy_run when in Fullgraph mode.
3). Fixed and fill any attn_metadata, such as
attn_metadata.slot_mapping.
4). Add a descriptor.
5). Set running mode and triggered update metadata.
3. Trans is_mtp_model to is_draft_model, and add the update of
workspace.
NOTE:
When set async_scheduling=True, the draft model will enforce execution
in eager mode.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
### What this PR does / why we need it?
Unify the loading logic for expert_map and log2phy.
1. The map generated when enabling the redundancy expert is incorrect.
The community generation map function only accepts the number of global
experts. When we pass in the number of logical experts plus redundant
experts, the local expert ID of the last card will index to an expert ID
that does not exist. Now we ensure that the index points to a real
existing expert ID, and each expert can be accessed. Moreover, when
redundant experts are not enabled, the output of our function remains
consistent with the community's function.
2. The map we generate is based on the length of the physical expert,
but in reality, we only need to use the length of the logical expert.
Later on, we will need to pad it accordingly, so we can simply generate
a map with the length of the logical [expert.]
3. Unify the initialization logic across different scenarios and
simplify the code for fused_moe.
**Before refactoring**
- map path is not None:
expert map: get_rank_placement_map from _'expert_load_balancer.py'_,
maintains the map for all ranks and all layers.
log2phy: get_rank_log2phy_map from _'expert_load_balancer.py'_,
maintains the map for all ranks and all layers.
- map path is None:
expert map: determine_expert_map from '_vllm.laye_r', The function does
not support the redundant experts of vllm-ascend.
log2phy: determine_default_log2phy_map from _'eplb_utils.py'_. The
function does not support the redundant experts of vllm-ascend.
**Refactoring**
eplb_utils.py
init_eplb_config
generate placement
generate expert map
generate log2phy
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Expert Mapping Test Generation:
ep size: 16, num of experts: 256, num of redundant experts: 16
+++++++++++++++++++++++++++++++++++++++++
Expert Mapping (Non-1 indicates the expert responsible for this rank)
for Rank 15:
vllm map:
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16]
+++++++++++++++++++++++++++++++++++++++++
Improved map:
[16 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
Expert Mapping Test Generation:
ep size: 16, num of experts: 256, num of redundant experts: 0
+++++++++++++++++++++++++++++++++++++++++
Expert Mapping (Non-1 indicates the expert responsible for this rank)
for Rank 15:
vllm map:
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
+++++++++++++++++++++++++++++++++++++++
Improved map:
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
dsr1 baselie:
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| gsm8k-lite | 7cd45e | accuracy | gen | 100.00 |
dsr1 eplb:
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| gsm8k-lite | 7cd45e | accuracy | gen | 100.00 |
- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
fix typo of _skip_all_reduce_across_dp_group
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
81786c8774
Signed-off-by: jiangkuaixue123 <jiangxiaozhou111@163.com>
We'll release 0.13.0 soon. The main branch is freeze. Let's revert the
newest change and redo it once 0.13.0 is released.
- vLLM version: release/v0.13.0
- vLLM main:
81786c8774
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it?
Previously, it was necessary to set the environment variables
HCCL_INTRA_PCIE_ENABLE=1 and HCCL_INTRA_ROCE_ENABLE=0. This PR enables
hierarchical MC2 operations on A2 by default.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: hwhaokun <haokun0405@163.com>
Co-authored-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it?
In the speculative decoding scenario, the original code performs
Host-Device synchronization, which slows down the main model's execution
speed.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: hwhaokun <haokun0405@163.com>
Co-authored-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it?
Last month the interface of `OffloadingSpec` has
changed(https://github.com/vllm-project/vllm/pull/27743). This PR fixes
this bug and adds e2e test for cpu offloading.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
CI passed with new added test.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
1.KV Pool for KV Transfer in PD Disaggregation Scenarios Error
Resolution
2.Update KV Pool Documentation
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
---------
Signed-off-by: fems14 <1804143737@qq.com>
We'll release 0.13.0 soon. The main branch is freeze. Let's revert the
newest change and redo it once 0.13.0 is released
- vLLM version: release/v0.13.0
- vLLM main:
81786c8774
### What this PR does / why we need it?
Since the _npu_ring_mla operator deteriorates in long-sequencescenarios,
the long sequence is split into shorter sequences for input to improve
performance.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: pichangping <1337510399@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
1. The `npu_fused_infer_attention_score` kernel supports specifying the
output layout. By selecting the appropriate layout, we can avoid the
transpose operation typically required after the attention.
2. The `transpose_batchmatmul` function allows us to control whether the
output tensor is transposed. If we configure `perm_y`, an additional
transpose after executing `v_up` becomes unnecessary.
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
---------
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
### What this PR does / why we need it?
Currently, MHA models (eg: minicpm-2b, Baichuan-7b) will encounter
errors when running in piecewise graph mode, with error messages similar
to:
```
(E89999): When layout is TND and PA not enabled, keyT(8) and valueT(8) must be equal to the last element of actualSeqenceLengthKV(5)[FUNC:CheckInputShapeWhenLayoutIsTND][FILE:prompt_flash_attention_tiling.cpp][LINE:3618]
```
The error occurs because the qkv in the Prefill stage is also padded,
causing the shape to be inconsistent with actual_seq_lengths.
Add unpadding logic for kv.
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
Rollback causal_conv1d_fn ops from triton to torch version to fix
hanging issues,meanwhile update Qwen3Next doc
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
---------
Signed-off-by: SunnyLee219 <3294305115@qq.com>
### What this PR does / why we need it?
This PR aims to fix unsuitable `moe_comm_type` under `ep=1` scenario.
The related issue #5375 have reported that `ep=1` can cause errors in
local environment, but those cases work well on ci. The point is the
difference between machines and `moe_comm_type` may not be chosen
correctly.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
Signed-off-by: Zetong Li <slippersss@126.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
#4443 introduces a precision issue in scenarios where MTP >= 3 + deepseek v3.1, and this pr reverts it
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
Signed-off-by: GDzhu01 <809721801@qq.com>
1. refresh additional config doc
2. move kv config logic to platform.
3. improve `dump_config` init logic and rename it to `dump_config_path`
this change is user impacted. dump_config is changed from dict to
string.
4. correct `enable_async_exponential` type
5. remove useless `chunked_prefill_for_mla`
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
The contiguous() operation temporarily increases memory usage, leading
to higher peak GPU memory, which necessitates reducing
gpu_memory_utilization. However, making tensors contiguous in
modelrunnerv1 significantly enhances operator performance, resulting in
greater end-to-end model benefits despite the memory overhead.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Currently, the vllm pr: https://github.com/vllm-project/vllm/pull/24252
is causing operator fusion to fail, which can be mitigated by patching
the backend. Once the problem is completely resolved, I will submit a
new pull request to remove the patch.
- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
Description:
This PR updates the implementation of the Triton operator for deployment
on NPU devices, focusing on optimizing grid size and memory handling
based on NPU limitations.
Design Plan:
Grid Calculation: The grid size is now dynamically calculated by batch
and dim to ensure that the number of programs executed does not exceed
the NPU's vector core capacity. This ensures optimal parallelism without
overloading the hardware.
Data Block Handling: Due to the limited on-chip memory (UB) on Ascend
NPUs, this implementation splits large data into smaller chunks of 32k
or less per block. The kernel performs a for-loop to process the data in
these smaller chunks, minimizing memory usage and avoiding potential
overflows.
Changes Compared to GPU Implementation:
Grid and Block Sizing:
For GPU, the grid and block size were determined based on available
thread counts and memory size. In contrast, the NPU version dynamically
adjusts these parameters using B_TILE and BLOCK_N to optimize for NPU’s
architecture.
Memory Chunking:
The original GPU implementation did not require chunking due to the
higher available memory and processing capacity. For the NPU, data is
divided into smaller chunks (32k or smaller) to comply with memory
constraints on the device. The kernel has been modified to handle this
chunking mechanism inside a loop.
Optimized Thread Usage:
The NPU implementation takes into account the hardware-specific thread
limit (24 threads per vector core), ensuring that the number of active
programs is aligned with the NPU's vector core count, avoiding
over-subscription that would lead to serial processing.
This PR ensures that the operator functions efficiently on Ascend NPU,
considering hardware limitations while maintaining the same
functionality and input parameters as the GPU implementation.
- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef
Signed-off-by: maoxx241 <maomaoyu870@gmail.com>
### What this PR does / why we need it?
[Bugfix] Fixing the issue where 128K context does not work in long
sequence scenarios.
This issue is caused by not splitting num_token according to pcp_size
during profile_run.
During `profile_run`, a warm-up is performed based on
`self.max_num_tokens`. When PCP is enabled, each PCP group will only
schedule up to `self.max_num_tokens / pcp_size`. After `profile_run` is
completed, the original scheduling size needs to be restored.
This is a temporary workaround; once
https://github.com/vllm-project/vllm/pull/28988/files is implemented,
this part can be removed.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
`VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE` is only used together with
`VLLM_ASCEND_ENABLE_PREFETCH_MLP` which is useless totally. This PR
remove it.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
When matmul_and_reduce is enabled, the prefix attribute is required.
However, in some models, the prefix is not passed correctly, causing
errors when starting the service.
The issue of incorrect prefix passing will be fixed in vLLM in the
future.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
The variable `self.num_pcp_pads` was incorrectly truncated during
assignment, causing errors in certain scenarios such as PD
disaggregated. This issue has now been resolved.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
Co-author by: QiuChunshuo <qiuchunshuo@huawei.com>
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: daishixun <dsxsteven@sina.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
add xlite e2e test
- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef
Signed-off-by: DaweiChang <405739598@qq.com>
### What this PR does / why we need it?
1. MagicMTP (paper: "Block Verification Accelerates Speculative
Decoding") was introduced to consider the influence among multiple draft
tokens, improving the acceptance rate without compromising accuracy.
2. The rejection sampling logic in rejection_sampler.py was restructured
using Triton-Ascend, enabling it to operate under high concurrency, thus
resolving CPU and NPU operator bottlenecks and enhancing throughput.
### Does this PR introduce _any_ user-facing change?
MagicMTP will automatically take effect when the parameter
"num_speculative_tokens" >= 3.
- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2
Signed-off-by: chenaoxuan <cax1165@163.com>
### What this PR does / why we need it?
This pull request introduces an L2 normalization kernel implemented in
Triton, specifically optimized for Ascend NPUs.
### Does this PR introduce _any_ user-facing change?
No, this PR does not introduce any user-facing changes.
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
bc0a5a0c08
---------
Signed-off-by: Ascendyh <hw7osiris@outlook.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138) as
it causes deepseek v3.2 hang error
- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
`VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` is not used anywhere, let's
remove it.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
1. Use optimized apply_top_k_top_p for NPU platfrom in rejection
sampler; (avoid scatter elements which can reduce ~26ms TPOT with bs=24
per DP)
2. <del>Avoid D2H Synchronization before calling npu_top_k_top_p
introduced by parameter validation which improves inference speed with
`async_scheduling` enabled;</del> In order to elminate the D2H
synchronization introduced by parameter validation before calling
`npu_top_k_top_p`, we directly drop this fused operator since the
performance improvement is not significant compared to async_scheduling
and may bring potential accuracy problem.
3. Refactor the implementation of AscendTopKTopPSampler to align that of
vLLM.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E serving test with combinations of `k=500` and `p=0.95` with
async_scheduling in single node and wide-EP scenarios.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: realliujiaxu <realliujiaxu@163.com>