**[RFC]: Elastic Scaling Support for P/D Instances Based on KV Pool:**
https://github.com/vllm-project/vllm-ascend/issues/3380
### What this PR does / why we need it?
Support elastic scaling for P/D instances based on mooncake conncetor
deplayment.
**Support API routes**
* `/instances/add`: add prefill nodes or decode nodes to the list.
* `/instances/remove`: remove prefill nodes or decode nodes from the
list.
**Support functions**
* Support **adding** prefill nodes or decode nodes.
- If prefill or decode server deployed **after the proxy deployed**,
server can use `/instances/add` API to join the proxy server. The
prefill server or decode server sends a signal to the proxy server, and
the proxy server will check the status of the node util the node is
available.
* Support **removing** prefill nodes or decode nodes:
- Support using `/instances/remove` API to **delete the node** from the
proxy server.
### Does this PR introduce _any_ user-facing change?
For
`examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py`:
**Add 2 params**
When adding nodes to the proxy, the proxy will wait the nodes to be
started util retrying a certain of times.
| name | type | default | help |
| ----- | ---- | ---- | ---- |
| max-waiting-retries | int | 3 | Maximum number of retries for waiting
nodes to be started |
| waiting-retry-interval | float | 10 | Check interval (seconds) for
waiting nodes to be started |
For example:
```shell
python load_balance_proxy_server_example.py \
--host 0.0.0.0 --port 9000 \
--prefiller-hosts 127.0.0.1 127.0.0.1 \
--prefiller-ports 8100 8101 \
--decoder-hosts 127.0.0.1 127.0.0.1 \
--decoder-ports 8200 8201 \
--max-waiting-retries 3 \
--waiting-retry-interval 10
```
**Add 2 API routings**
* Add instances: `instances/add`
For example, add 2 prefiller instances:
```shell
curl -X POST http://localhost:9000/instances/add \
-H "Content-Type: application/json" \
-d '{
"type": "prefill",
"instances": ["127.0.0.1:8102", "127.0.0.1:8103"]
}'
```
Response:
```shell
{"message": "add prefill instances: ['127.0.0.1:8102', '127.0.0.1:8103'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101', '127.0.0.1:8102', '127.0.0.1:8103'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']}
```
If the node '127.0.0.1:8103' has not benn started:
```shell
{"message": "add prefill instances: ['127.0.0.1:8102']. Instances ['127.0.0.1:8103'] are waiting to be added.", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101', '127.0.0.1:8102'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']}
```
* Remove instances: `instances/remove`
For example, remove 1 decoder instance:
```shell
curl -X POST http://localhost:9000/instances/remove \
-H "Content-Type: application/json" \
-d '{
"type": "decode",
"instances": "127.0.0.1:8201"
}'
```
Response:
```shell
{"message": "remove decode instances: ['127.0.0.1:8201'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200']}
```
### How was this patch tested?
Run proxy and using `/instances/add` API to add nodes and
`/instances/remove` API to remove nodes
* vLLM version: v0.11.0.rc3
* vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0.rc3
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: yuxinshan <syx_ctyg@126.com>
Signed-off-by: CalvinXKY <kyxiezju@163.com>
### What this PR does / why we need it?
This PR add w4a8 accuracy testcase for e2e test
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: cuikai (C) <c00827167@china.huawei.com>
Co-authored-by: cuikai (C) <c00827167@china.huawei.com>
### What this PR does / why we need it?
add triton ops fused_qkvzba_split_reshape_cat for qwen3_next
GatedDeltaNet
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: ZT-AIA <1028681969@qq.com>
Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>
### What this PR does / why we need it?
add qwen3 reranker tutorials
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
---------
Signed-off-by: TingW09 <944713709@qq.com>
### What this PR does / why we need it?
We will expose the enabling switch for npugraph_ex to better facilitate
subsequent optimization.
### Does this PR introduce _any_ user-facing change?
Previously, the enable_npugraph_ex switch would trigger an error; now we
have removed the error reporting mechanism to better facilitate
subsequent optimization efforts.
Basic functionalities are available in CANN and torch_npu for Q3, while
advanced optimizations will depend on the Q4 release.
### How was this patch tested?
llm =LLM(
model=model,
enforce_eager=False ,
additional_config={
"enable_npugraph_ex": True
},
compilation_config={
"cudagraph_mode": "FULL_DECODE_ONLY",
"cudagraph_capture_sizes": [16],
},
}
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: p00465316 <panchao13@huawei.com>
Co-authored-by: p00465316 <panchao13@huawei.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
### What this PR does / why we need it?
Synchronize the host query_start_loc with device values to prevent shape
mismatches when not enable async scheduling.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
None.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
delete sekf.in_profile_run in model_runner to make EPLB works as expect
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
RFC: https://github.com/vllm-project/vllm-ascend/issues/4629
Reason:
We distinguish the branches based on the applicable scenarios of
pagedAttention and fusedInferAttention, making the code more clear.
At the same time, it is convenient for the subsequent iterations of
sliding_window and sinks and removePA ops after FIA is ready.
Todo:
remove PA ops after FIA is ready
add slidingwindow and ops for gpt_oss
replace FIA with FIA_v2
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
### What this PR does / why we need it?
[Fix a data conversion bug introduced by
[main#4655](3b7eb5179f)
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: tongyuzhou <tongyuzhou1@huawei.com>
Co-authored-by: tongyuzhou <tongyuzhou1@huawei.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
GatherEP has been fixed in
https://github.com/vllm-project/vllm-ascend/pull/3279, remove all2all in
w4a8_dynamic scenario.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
e2e & ut
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: weichen <calvin_zhu0210@outlook.com>
### What this PR does / why we need it?
The mooncake_connector.py file was importing the wrong arguments to the
file, which could cause errors when use PCP; this issue has been
corrected.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: daishixun <dsxsteven@sina.com>
### What this PR does / why we need it?
PanguProMoEV1 is no longer supported in vllm-ascend, remove related
code.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
e2e & ut
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: weichen <calvin_zhu0210@outlook.com>
### What this PR does / why we need it?
Instructions for using permissions added to docker
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it?
Rename `_910B` to `A2`;
Rename `_910_93` to `A3`;
Rename `_910_95` to `A5`;
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: zzzzwwjj <1183291235@qq.com>
### What this PR does / why we need it?
pd disaggregated support cross-machine.
We send the primary and secondary node information of node p to node d.
When node d pulls the KV data, it retrieves the corresponding primary or
secondary node information from the mapping.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
When enable the async_scheduling, in large scale EP scene, mtp module
goes to eagler mode, which results in the mismatch of
seq_lens_list、block_table. So adapt the judgement before the draft model
forward.
fix#4986
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: hust17yixuan <303660421@qq.com>
### What this PR does / why we need it?
add the UT of pcp and dcp in the attention_cp file
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: pichangping <1337510399@qq.com>
### What this PR does / why we need it?
This PR add `qkv_rmsnorm_rope` operator and introduces a graph fusion
pass for `qknorm_rope` operations. The implementation includes a new
configuration flag, a pattern matching pass using
`torch._inductor.pattern_matcher`, and a custom Triton kernel for the
fused operation.
Co-authored-by: Angazenn
[supperccell@163.com](mailto:supperccell@163.com)
### Does this PR introduce _any_ user-facing change?
Yes, add new additional_config
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it?
Upstream vLLM PR #30212https://github.com/vllm-project/vllm/pull/30212
refactored the attention backend selection interface, This PR adapts
vllm-ascend's get_attn_backend_cls to align with the new upstream
standard, ensuring compatibility and reducing maintenance overhead.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
co-author:[leo-pony][nengjunma@outlook.com](mailto:nengjunma@outlook.com)
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: zxwang <1476209578@qq.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
We refactored the eagle_proposer.py to adapt the framework of eagle.py
in vllm-v0.12.0, to support the logit of padded drafter batch and
async-scheduler.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
This PR adds triton-ascend installation to nightly e2e single card
environment.
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
The name of the smoke test file for DeepSeek EPLB has been changed, but
the name in the script hasn't been updated. Fix this bug.
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
Add a Mooncake installation tutorial for kv pool and update Mooncake
installation tutorial
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
(1)refactor npu_model_runner for profile_run
(2) move _select_moe_comm_method to ascend_forward_context
(3) delete _init_model_kwargs in npu_model_runner
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Na
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Signed-off-by: zhenwenqi2024 <155598497+zhenwenqi2024@users.noreply.github.com>
What this PR does / why we need it?
a2 kernel aclnn interface extern "C" fix
Does this PR introduce any user-facing change?
No
How was this patch tested?
vLLM version: v0.12.0
Signed-off-by: tongrunze <t00574058@china.huawei.com>
Co-authored-by: tongrunze <t00574058@china.huawei.com>
### What this PR does / why we need it?
Add user guide of speculative decoding that includes n-grams, EAGLE,
MTP, and suffix.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
### What this PR does / why we need it?
When use mtp, full decdoe only and async_scheduling together, finding a
input err for FIA ops due to the non-increasing input
of the 'actual_seq_lengths'. This bug is caused by the filling the
variable ‘query_start_loc’. We need to fill the query_start_loc' s end
by the 'cu_num_tokens' instead of '-1'
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: hust17yixuan <303660421@qq.com>
### What this PR does / why we need it?
fix fastapi version == 0.123.10(<0.124.0)
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
this pr removes apply_gramme in npu_model_runner. we change logits to
cpu, and do the same thing with gpu_model_runner.
it may change the performance, we will change it after torch.compile is
supported with npu inductor
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
### What this PR does / why we need it?
Related to #4084. Before we add the patches temporarily for making
`set_forward_context` patched by `set_ascend_forward_context` in the
function `_process_image_input` and `_process_video_input` of
`Qwen2.5-VL` and `Qwen2.5-Omni` models. After removing these patches, I
met the `AttributeError` for `ForwardContext` missing
`prefetch_mlp_enabled`. So we need to add the defensive check for
`prefetch_mlp_enabled`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
```
vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
--max-model-len 30000 \
--max-num-batched-tokens 50000 \
--max-num-seqs 30 \
--no-enable-prefix-caching \
--trust-remote-code \
--dtype bfloat16
```
```
{"id":"chatcmpl-b66d8acb76905c49","object":"chat.completion","created":1765796863,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration reads \"TONGYI Qwen.\"","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":73,"total_tokens":88,"completion_tokens":15,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
```
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
The current KVPool has a accuracy issue
https://github.com/vllm-project/vllm-ascend/issues/4412. This PR aims to
fix the precision problem without impacting prefill performance.
Note:Due to a bug in ADXL, calling `current_event.synchronize()` may
occasionally hang. This issue will be fixed in Cann version 8.5.rc1. You
can manually build the master branch of the project at
https://gitcode.com/cann/hixl to resolve this issue before the 8.5.RC1
release.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: LCAIZJ <leichao139636@163.com>
### What this PR does / why we need it?
Currently, we are using `AscendRejctionSampler` that extends from
`RejctionSampler` in spec decoding. `AscendRejctionSampler` override
`forward` of `RejctionSampler`, only aming to replace `rejection_sample`
func. This
causes a lot of code of `RejctionSampler` cannot be reused, for example:
- https://github.com/vllm-project/vllm/pull/19482
- https://github.com/vllm-project/vllm/pull/26060
- https://github.com/vllm-project/vllm/pull/29223
#### Proposed Change:
- Delete `AscendRejctionSampler` and use `RejctionSampler` directly in
model runner.
- Patch `RejctionSampler.expand_batch_to_tokens` and
`RejctionSampler.rejection_sample`, maybe a better way is to make them
as custom ops.
- Modify `NPUModelRunner` following
https://github.com/vllm-project/vllm/pull/26060
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- [x] test logits processor for spec decoding
- [x] test logprobs for spec decoding
- [x] test logprobs for spec decoding + async shcheduling (test with
https://github.com/vllm-project/vllm-ascend/pull/4893/)
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it?
The fused alltoall operator itself was not designed or implemented to
handle the scenario where tensors are lists, but the weights for dynamic
load balancing are in list form.
Therefore, we have disabled this operator when using dynamic load
balancing.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
This PR introduces the initial integration of **UCM (Unified Cache
Management)** into the vllm-ascend distributed KV-cache system.
Specifically, it adds:
- A new `UCMConnector` implementation under the distributed KV-transfer
framework.
- Support for offloading KV-cache blocks to external UCM backends (DRAM
/ NFS / Localdisk), depending on UCM configuration).
- Integration with vLLM V1 KV connector interface, including metadata
handling and role registration.
**Why it is needed:**
- UCM provides a unified, high-performance storage layer for KV-cache
externalization.
- This enables vllm-ascend to support out-of-core KV-cache workloads,
improve memory efficiency, and leverage hardware-accelerated storage
paths (RDMA / NFS / hybrid modes).
- This connector is a required component to allow future work on
multi-node inference + UCM-based scaling.
---
### Does this PR introduce _any_ user-facing change?
Yes, but limited:
- A new `kv_connector=UCMConnector` option becomes available through the
configuration interface.
- When selected, vllm-ascend workers may initialize UCM and offload
KV-cache blocks externally.
- No default behaviors are changed. Users must explicitly enable this
connector.
This PR does **not** modify:
- existing APIs,
- default execution paths,
- model runner behavior,
- user workflow unless `UCMConnector` is configured.
---
### How was this patch tested?
---
### Prefix Caching Benchmark
We provide preliminary measurements for TTFT (ms) under VLLM benchmark.
Tests run on 2 * Ascend 910B3, vllm-ascend 0.11.0, Tensor Parallel size
2, with UCM (Localdisk) enabled.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>
### What this PR does / why we need it?
This Pull Request removes the @pytest.mark.skip decorators from
test_mtp1_correctness_piecewise_graph and
test_mtp2_correctness_piecewise_graph.
These tests were temporarily skipped because of an issue with the MTP
ACL Graph (as per the original TODO comment). Since the relevant
bug/issue has been resolved, these tests are now re-enabled to ensure
full correctness coverage for MTP functionality.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
Upgrade vllm commit hash to `4429d934de3c5cc327b0d7aec8e473aeba38db90`
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Fix the bug " TypeError: 'NoneType' object is not iterable' " in
vllm_ascend/compilation/acl_graph.py
The reason of that is the attn_metadata is none in the dummy_run of MTP.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
### What this PR does / why we need it?
Use group_list[0] to replace group_diff[0] in function
"cumsum_group_list" (moe_mlp.py).
The purpose is to modify it to the correct logic of converting cumsum to
count.
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
Co-authored-by: tanqingshan (A) <50050625@china.huawei.com>
### What this PR does / why we need it?
In the current KV Pool scenario for models like MLA and GQA, where
different TP ranks generate identical KV caches, the system is designed
to store only a single copy. The previous approach allowed each card to
query storage requirements dynamically, but inconsistent query results
across cards led to incorrect storage. To fix this, the new solution
pre-allocates storage responsibilities; each card now simply stores its
pre-assigned blocks, bypassing the inconsistent query step and ensuring
data correctness.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: fems14 <1804143737@qq.com>
The `attn_metadata` is not used by any draft proposer, so we can remove
it.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>