### What this PR does / why we need it?
Since the param `task` has been depprecated, we should use the latest
unified standard parameters for pooling models, this should be more
clear
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
add ut for model runner
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: LookAround <lixushi@huawei.com>
### What this PR does / why we need it?
moe multistream overlap to improve the performance.
### How was this patch tested?
--additional-config '{"multistream_overlap_gate": true}'
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: AlvisGong <gwly0401@163.com>
Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: clrs97 <524936896@qq.com>
Co-authored-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
### What this PR does / why we need it?
Corrects attention metadata size for MTP when both asynchronous
scheduling and full ACL graph mode are enabled. This prevents potential
size mismatches during execution.
Additionally, improves the robustness of calculating token sample
indices by explicitly aligning tensor shapes.
Finally, prevents padding when the number of input tokens exceeds the
maximum ACL graph batch size to avoid out-of-bounds errors.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
Need to add corresponding test case ASAP.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
pick from https://github.com/vllm-project/vllm-ascend/pull/4736 to fix
the merge conflict
### What this PR does / why we need it?
Currently, the all_reduce operation in _sync_metadata_across_dp is
performed with gloo backend which is extremely time-consuming when
DPEngineCores are in different nodes. This operation cannot be ignored
by async scheduling in multi-node-scenarios with speculative decoding
(e.g., EAGLE, mtp).
This pr eliminates the all_reduce operation for D Nodes and change the
input parameter of MoEDispatch & MoeCombine operators to make MC2EP
support different num_tokens across all ranks.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tested with PD disaggregation (2P: DP2TP8EP16 1D: DP8TP4EP32) scenarios
while enabling async scheduling. This pr can remove cross-node
all_reduce with gloo backend and further reduce latency with correct
accuracy.
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
Correct more doc mistakes
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it?
now vllm-ascend uses AsyncGPUModelRunnerOutput
,AsyncNPUModelRunnerOutput before is outdated, so we should fix it
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
### What this PR does / why we need it?
The newest version crashes in PD separation scenarios because the
function is missing the `vllm_config` parameter.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
### What this PR does / why we need it?
Add mtp_proposer ut
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
### What this PR does / why we need it?
Correct mistakes in doc
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it?
refactor npu_modelrunner, we should be close to gpu_modelrunner
### Does this PR introduce _any_ user-facing change?
NO
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Signed-off-by: zhenwenqi2024 <155598497+zhenwenqi2024@users.noreply.github.com>
### What this PR does / why we need it?
Since the `llmdatadist` has sunset, the logic gen_ranktable should also
be removed
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
The recommended configuration in the document kv_pool.md is ascend.
Modify the default value of the protocol to ascend,Improve usability
#### 1.Configure mooncake.json
The environment variable **MOONCAKE_CONFIG_PATH** is configured to the
full path where mooncake.json is located.
```
{
"local_hostname": "xx.xx.xx.xx",
"metadata_server": "P2PHANDSHAKE",
"protocol": "ascend",
"device_name": "",
"alloc_in_same_node": true,
"master_server_address": "xx.xx.xx.xx:50088",
"global_segment_size": "1GB" (1024MB/1048576KB/1073741824B/1073741824)
}
```
**local_hostname**: Configured as the IP address of the current master
node.
**metadata_server**: Configured as **P2PHANDSHAKE**.
**protocol:** Configured for Ascend to use Mooncake's HCCL
communication.
**device_name**: ""
**alloc_in_same_node**: Indicator for preferring local buffer allocation
strategy.
**master_server_address**: Configured with the IP and port of the master
service.
**global_segment_size**: Expands the kvcache size registered by the PD
node to the master.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Mooncake does not set up a protocol to launch the pooled VLLM service;
test whether the pooling function is working.
Signed-off-by: lty <linhebiwen@gmail.com>
### What this PR does / why we need it?
vllm-ascend support Ascend950 with Qwen dense model
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wangyao <iwangyao@outlook.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
fix qwen2.5vl readme, del gen ranktable and add install mooncake
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
Updated some issues that caused sleep mode document content to be
unavailable due to changes/outdated environment variables.
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
doc tutorials add model feature matrix:
DeepSeekR1
DeepSeekV3.1
Qwen3-Dense
Qwen3-Moe
Qwen3-Next
Qwen2.5
Qwen2.5-VL
Qwen3-VL
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: 1092626063 <1092626063@qq.com>
## Description
Fix the AttributeError caused by incorrect invocation of the warm-up
function in the FlashLB algorithm:
1. **Root Cause**: The warm-up function for FlashLB is defined outside
the `PolicyFlashlb` class (not a class method), but the code incorrectly
attempted to call it via the `PolicyFlashlb` class instance.
2. **Key Fix**: Clarify the invocation rule for FlashLB: when selecting
the FlashLB algorithm, the warm-up function must be called in advance to
precompile and warm up the algorithm (invoked as a standalone function),
instead of calling it through the `PolicyFlashlb` class.
3. **Impact**: Resolve the runtime error when using FlashLB, ensure the
algorithm pre-compilation/warm-up process works as expected, and avoid
attribute missing exceptions.
Signed-off-by: Mercykid-bash <ruanche0218@gmail.com>
### What this PR does / why we need it?
This PR fixes a bug in the moe_mlp module by correcting the arguments
passed to the torch_npu.npu_dequant_swiglu_quant function.It properly
converts group_list from a cumulative sum to counts for the group_index
parameter.
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
Co-authored-by: tanqingshan (A) <50050625@china.huawei.com>
Similar to #2309 , this PR introduces Embedding tensor model parallel to
achieve decreasing of memory consumption. It support both eager mode and
graph mode.
And this PR refactor module tensor parallel configurations supported in
#2309, #2167, #2120, merge all config into `finegrained_tp_config` in
`additional_config`, including:
`lmhead_tensor_parallel_size`
`oproj_tensor_parallel_size`
`embedding_tensor_parallel_size`
`mlp_tensor_parallel_size`
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>
Co-authored-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
In the PD separation scenario, the D node does not need to perform get
operations, and therefore does not need to create ZeroMQ (ZMQ)
communication.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
### What this PR does / why we need it?
Remove FusedMoEState which is used by torchair.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
e2e & ut
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: weichen <calvin_zhu0210@outlook.com>
### What this PR does / why we need it?
This PR updates the CI configuration and adjusts a set of end-to-end
(e2e) tests under tests/e2e/multicard, in order to refactor the test
suite and ensure compatibility with current codebase and CI workflows.
1. tests/e2e/multicard/test_prefix_caching.py: change model to Qwen3-8B
and rename the test case
2. tests/e2e/multicard/test_quantization.py: rename the test case
3. tests/e2e/multicard/test_qwen3_moe.py: remove duplicate test and
rename test cases
4. tests/e2e/multicard/test_qwen3_next.py: rename test cases and change
the W8A8 pruning model to the W8A8 model and remove the eager parameter
5. tests/e2e/multicard/test_shared_expert_dp.py: rename test case and
remove the eager parameter
6. tests/e2e/multicard/test_single_request_aclgraph.py: rename test case
and change Qwen3-30B to Qwen3-0.6B
7. tests/e2e/multicard/test_torchair_graph_mode.py: delete test cases
about torchair
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Currently, the initialization and fundamental functions of
RecomputeScheduler are broken with `vLLM v0.12.0`. This PR fixes the
conflicts of `RecomputeScheduler` and refactor its implementations by
inheriting original `Scheduler` of vLLM. Meanwhile, this PR also
supports async cheduling with recompute scheduler by implementing
`AsyncRecomputeScheduler` which is simply inherited `AsncyScheduler` of
vLLM and `RecomputeScheduler` of vLLM-Ascend with python MRO.
### Does this PR introduce _any_ user-facing change?
No. The switch naming is the same as v0.11.0 :
`recompute_scheduler_enable`
### How was this patch tested?
E2E serving with 2P1D dsv3.1 passed. The performance was the same as
original vllm scheduler with `async_scheduling` and preempted requests
in D Nodes are successfully transfered to Proxy and further to P Node.
This significantly improves the performance and robustness of PD
disaggregation deployments.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
From a resource-saving perspective, canceling old jobs when submitting
new commits can reduce github_hosted in queue
Signed-off-by: wangli <wangli858794774@gmail.com>
- merge image build workflow into one
- merge package build workflow into one
- merge community related workflow into one
This change makes the workflow more clear
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Currently, the usage of structured output feature in vllm-ascend is
totally the same as that in vllm.
Thus, IMO, it's better to remove this doc directly to avoid some case
that there are some changes in the upstream doc and we don't update our
doc in time, which can be misleading to users.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
mindie_turbo is out of data for long time. This PR remove the related register method.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Previously, if the KVCacheSendingThread couldn't create a socket because
of port conflicts or other problems, the main thread would wait
endlessly for the ready_event signal, causing the entire engine
initialization to freeze. This update fixes the issue by adding timeouts
for thread startup and handling unexpected thread exits, so the
initialization process no longer gets stuck indefinitely.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
### What this PR does / why we need it?
This PR standardizes the fusion naming, changing
`enable_quantization_fusion` to `fuse_norm_quant`, and enables e2e
testing.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it?
The variable ’is_deepseek_v3_r1‘ now is useless in the repository, so
delete it now. And the funciton 'get_fused_moe_state' is used only for
torchair, so it need to be deleted along with torchair
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: hust17yixuan <303660421@qq.com>
### What this PR does / why we need it?
Current mooncake connector has following problems with PP and MTP
enabled:
1. MTP layer kv caches are not transfered, it may cause decreasing of
accept ratio: This PR add MTP layer indices for last PP stage after
calculating end_layer in transfer_kv_cache
2. While MTP enabled, PP layers divided by default may cause imbalance
between stages, we need to use `VLLM_PP_LAYER_PARTITION` environment to
make it balance by hand, but in mooncake connector kv transfer, decode
doesn't know the partition of prefill node: This PR add config
`pp_layer_partition` in `kv_connector_extra_config` to make decode node
acquire the partition information of prefill node.
### Does this PR introduce _any_ user-facing change?
When prefill using `VLLM_PP_LAYER_PARTITION` environment, add
`pp_layer_partition` in `kv_connector_extra_config` like below:
```
export VLLM_PP_LAYER_PARTITION=33,28
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 1,
"tp_size": 8,
"pp_size": 2,
"pp_layer_partition": "33,28"
},
"decode": {
"dp_size": 16,
"tp_size": 1,
"pp_size": 1
}
}
```
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: lidenghui <lidenghui1110@gmail.com>
### What this PR does / why we need it?
This PR adjusts the layer prefix matching rules for tensor parallelism
(column/row parallel ops) to fit Bailing model's naming conventions
(adding "query_key_value" for column parallel and "attention.dense" for
row parallel), enabling flashcomm1 to work properly on the Bailing
model.
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: hwhaokun <haokun0405@163.com>
### What this PR does / why we need it?
This PR fix the bug in sfa-cp under multi-DP scenarios.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
None
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: zzhxx <2783294813@qq.com>
Co-authored-by: clrs97 <524936896@qq.com>
The bmm_transpose operator in version 3.2 is only used in the decoding stage due to shape limitations.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: ChrisGelhLan <33011886+xlan-huawei@users.noreply.github.com>
### What this PR does / why we need it?
Support triton causal_conv1d_fn ops.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: QilaiZhang <245706640@qq.com>
This PR clean up useless torchair logic in model runner. The moge doc is
only for torchair, it can be removed as well.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
This PR adds mlapo operation support for bf16 no_quant mode.
### Does this PR introduce _any_ user-facing change?
This PR makes quant related parameters optional.
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: chenjunyi <isjunyi.chen@gmail.com>