### What this PR does / why we need it?
Check wildcard address address for layerwise connector
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
Set proper 'text/event-stream; charset=utf-8' media type for streaming
requests instead of hardcoded 'application/json'
### What this PR does / why we need it?
This PR fixes an issue in the disaggregated prefill proxy server where
streaming requests (`"stream": true`) were always returned with a
hardcoded `Content-Type: application/json`, even when the backend vLLM
servers correctly returned Server-Sent Events (SSE) with `Content-Type:
text/event-stream; charset=utf-8`.
Specifically, the proxy used `StreamingResponse` with a fixed
`media_type` of `application/json`, which caused FastAPI to override the
response headers and break proper SSE semantics. As a result, clients
(e.g. `curl -i`, EventSource, or OpenAI-compatible SDKs) could not
reliably receive token-by-token streaming output.
In addition, this incorrect response type causes compatibility issues
with benchmarking and load-testing tools such as **EvalScope**. When
streaming is enabled, these tools expect SSE-formatted responses to
correctly parse token usage information. With the incorrect
`application/json` content type, EvalScope fails to parse the response
and reports errors similar to:`2025-12-15 09:27:56 - evalscope - ERROR:
Failed to parse usage from response: list index out of range. Response:
[]`
This PR updates the proxy to:
- Detect whether the incoming request is a streaming request
(`stream=true`)
- Use `text/event-stream; charset=utf-8` for streaming responses
- Preserve `application/json` for non-streaming responses
This aligns the proxy behavior with native vLLM prefill/decoder servers
and the OpenAI-compatible streaming API contract.
Fixes incorrect streaming response headers that prevented proper
real-time token delivery.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
This change was tested manually using a disaggregated prefill + decode
setup
with the proxy server.
### Test Steps
1. Start prefiller and decoder vLLM servers:
```bash
vllm serve --host 0.0.0.0 --port 8001 ...
vllm serve --host 0.0.0.0 --port 8002 ...
```
2. Start the proxy server:
```bash
python load_balance_proxy_server_example.py \
--host 127.0.0.1 --port 8000 \
--prefiller-hosts 127.0.0.1 --prefiller-ports 8001 \
--decoder-hosts 127.0.0.1 --decoder-ports 8002
```
3. Send a streaming completion request through the proxy:
```bash
curl -i -X POST http://127.0.0.1:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "test",
"prompt": "hello",
"max_tokens": 3,
"stream": true
}'
```
4. Verify the following:
- The response header is Content-Type: text/event-stream; charset=utf-8
- Tokens are streamed incrementally as SSE data: events
- Non-streaming requests still return application/json
No automated tests were added because this change affects an example
proxy
server and is limited to HTTP response headers. The behavior is directly
verifiable using standard SSE-compatible clients.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
Co-authored-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it?
The incorrect regular expression syntax `.*[UE4M3|ue4m3].*` actually
ignores all words containing any of the following characters: `u, e, 4,
m, 3, |`
```yaml
extend-ignore-identifiers-re = [".*Unc.*", ".*_thw",
".*UE8M0.*", ".*[UE4M3|ue4m3].*", ".*eles.*", ".*fo.*", ".*ba.*",
".*ot.*", ".*[Tt]h[rR].*"]
```
===fix===>
```yaml
extend-ignore-identifiers-re = [".*Unc.*", ".*_thw",
".*UE8M0.*", ".*(UE4M3|ue4m3]).*", ".*eles.*", ".*fo.*", ".*ba.*",
".*ot.*", ".*[Tt]h[rR].*"]
```
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
9562912cea
Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
layerwise connector support recompute scheduler.
NOTE:
Triggering recompute will invoke the tokenizer again, which may lead to
precision fluctuations.
[RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache
Layerwise Push Support
https://github.com/vllm-project/vllm-ascend/issues/4842
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
### What this PR does / why we need it?
The variables `self.prefiller_heap` `self.decoder_heap` are used as
`List[tuple[float, int, ServerState]]` but defined as `List[tuple[int,
int, ServerState]]`, which leads to the failed of mypy, see
https://github.com/vllm-project/vllm-ascend/actions/runs/21351411010/job/61448739554?pr=6265
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
d68209402d
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
For the proxy, we should remove instances when the proxy are not
processing requests.
But sometimes, We need to **isolate** some faulty nodes when a large
number of **requests** are coming in.
So we support to **isolate** faulty nodes by **lowering their priority**
and **deleted** them when the proxy does not process requests.
### Does this PR introduce _any_ user-facing change?
For
`examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py`,
when using `/instances/remove` API to delete the node from the proxy
server:
```txt
curl -X POST http://localhost:9000/instances/remove \
-H "Content-Type: application/json" \
-d '{
"type": "decode",
"instances": "127.0.0.1:8201"
}'
```
There are 2 situations:
* 【New】When the proxy is processing requests, isolate the nodes and
remove them when the proxy is free.
```txt
{"message": "Instances ['127.0.0.1:8201'] are isolated and waiting to be removed.", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']}
```
* When the proxy is free, remove the nodes directly.
```txt
{"message": "remove decode instances: ['127.0.0.1:8201'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200']}
```
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
11b6af5280
Signed-off-by: yuxinshan <syx_ctyg@126.com>
### What this PR does / why we need it?
Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604
This PR is a refactoring of vllm_ascend/distributed, moving all
kv_transfer realtaed codes into a dedicated folder, which has already
been done in vLLM
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: lty <linhebiwen@gmail.com>
**[RFC]: Elastic Scaling Support for P/D Instances Based on KV Pool:**
https://github.com/vllm-project/vllm-ascend/issues/3380
### What this PR does / why we need it?
Support elastic scaling for P/D instances based on mooncake conncetor
deplayment.
**Support API routes**
* `/instances/add`: add prefill nodes or decode nodes to the list.
* `/instances/remove`: remove prefill nodes or decode nodes from the
list.
**Support functions**
* Support **adding** prefill nodes or decode nodes.
- If prefill or decode server deployed **after the proxy deployed**,
server can use `/instances/add` API to join the proxy server. The
prefill server or decode server sends a signal to the proxy server, and
the proxy server will check the status of the node util the node is
available.
* Support **removing** prefill nodes or decode nodes:
- Support using `/instances/remove` API to **delete the node** from the
proxy server.
### Does this PR introduce _any_ user-facing change?
For
`examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py`:
**Add 2 params**
When adding nodes to the proxy, the proxy will wait the nodes to be
started util retrying a certain of times.
| name | type | default | help |
| ----- | ---- | ---- | ---- |
| max-waiting-retries | int | 3 | Maximum number of retries for waiting
nodes to be started |
| waiting-retry-interval | float | 10 | Check interval (seconds) for
waiting nodes to be started |
For example:
```shell
python load_balance_proxy_server_example.py \
--host 0.0.0.0 --port 9000 \
--prefiller-hosts 127.0.0.1 127.0.0.1 \
--prefiller-ports 8100 8101 \
--decoder-hosts 127.0.0.1 127.0.0.1 \
--decoder-ports 8200 8201 \
--max-waiting-retries 3 \
--waiting-retry-interval 10
```
**Add 2 API routings**
* Add instances: `instances/add`
For example, add 2 prefiller instances:
```shell
curl -X POST http://localhost:9000/instances/add \
-H "Content-Type: application/json" \
-d '{
"type": "prefill",
"instances": ["127.0.0.1:8102", "127.0.0.1:8103"]
}'
```
Response:
```shell
{"message": "add prefill instances: ['127.0.0.1:8102', '127.0.0.1:8103'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101', '127.0.0.1:8102', '127.0.0.1:8103'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']}
```
If the node '127.0.0.1:8103' has not benn started:
```shell
{"message": "add prefill instances: ['127.0.0.1:8102']. Instances ['127.0.0.1:8103'] are waiting to be added.", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101', '127.0.0.1:8102'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']}
```
* Remove instances: `instances/remove`
For example, remove 1 decoder instance:
```shell
curl -X POST http://localhost:9000/instances/remove \
-H "Content-Type: application/json" \
-d '{
"type": "decode",
"instances": "127.0.0.1:8201"
}'
```
Response:
```shell
{"message": "remove decode instances: ['127.0.0.1:8201'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200']}
```
### How was this patch tested?
Run proxy and using `/instances/add` API to add nodes and
`/instances/remove` API to remove nodes
* vLLM version: v0.11.0.rc3
* vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0.rc3
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: yuxinshan <syx_ctyg@126.com>
Signed-off-by: CalvinXKY <kyxiezju@163.com>
### What this PR does / why we need it?
fix fastapi version == 0.123.10(<0.124.0)
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
As support for the mooncake connector is now available, the llmdatadist
connector is no longer being maintained, so the llmdatadist-related
files need to be retired.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
Fix configuration error in our documentations.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
NA.
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
1. Optimize multi-node waiting logic
2. Remove the `tee` pipeline for logs, which will lead to hang issue
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Currently, there are two paths to judge the chip type in code,
`get_ascend_soc_version` use `get_soc_version` api in torch_npu, and
`is_310p` `use _build_info.__soc_version__`, which generate when
install. We need to unify the two paths.
We need to unify these codes based on the following points:
1. We need to ensure consistency in chip type judgment between compiling
and running states;
2. In compiling state, we need chip type to complete op's compilation,
but in running state, we only need device
type(910B/910_93/310P/910_95/etc) to make code branch judgement;
3. In compiling state, torch_npu may not have been installed yet, so we
can't use torch_npu's api.
Based on the above points, we have made the following changes:
1. When user set env `SOC_VERSION`, use it; when not set, query
soc_version by `npu-smi`;
2. generate device_type based on soc_version when compiling, and write
`__device_type__` instead of `__soc_version__` in `_build_info.py`;
3. In running state, use `__device_type__` to judge code branch.
### Does this PR introduce _any_ user-facing change?
When not set env `SOC_VERSION`, it will not be `ASCEND910B1` by default,
we will query soc_version by `npu-smi`. And env `SOC_VERSION` must be in
the list `soc_to_device` in `setup.py`.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: zzzzwwjj <1183291235@qq.com>
### What this PR does / why we need it?
This PR adds a load-balance dp proxy server which can be used in
external DP scenario without Disaggregated-Prefill enabled. What's more,
add a doc of external dp and load-balance dp proxy server.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
See the new doc.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
fix proxy when host ip using domain name
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
pd proxy support ipv6, mooncake connector check whether the IPv6 address
is used and notify the user.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
Drop VLLM_USE_V1 usage. This env has been removed from vLLM already.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Closes#3728, #3657.
The main branch is now aligned with the vllm `releases/v0.11.1` branch,
which no longer supports `Python 3.9`. Check it
[here](https://github.com/vllm-project/vllm/blob/releases/v0.11.1/pyproject.toml).
### Does this PR introduce _any_ user-facing change?
The newest version of vllm-ascend don't support Python 3.9.
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
### What this PR does / why we need it?
1. Fix proxy format processing errors.
2. Layer-wise connector performance optimization.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By CI.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
### What this PR does / why we need it?
Make kv-transfer env variable take effect and Fix load-balance proxy.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By CI.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
This PR upgrade CANN from 8.2rc1 to 8.3rc1 and remove the CANN version
check logic.
TODO: we notice that UT runs failed with CANN 8.3 image. So the base
image for UT is still 8.2. We'll fix it later.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Upgrade torch-npu to the official release version 2.7.1
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
cancel tokenize for layerwise_proxy
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
by ci
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
### What this PR does / why we need it?
Refactored the layerwise code to send to the D node first, preventing
P-node hangs due to communication timeouts when DP > 1.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
We notice that sometimes user build vllm-ascend with incorrect torch
version. In this case, the build is passed, but when running the code,
the error `AttributeError: '_OpNamespace' '_C_ascend' object has no
attribute 'weak_ref_tensor'` is raised. Let's force the torch version to
2.7.1 and check the torch version when build from source to fix the
issue.
closes: #3342
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
fix proxy decode bug when parsing non-UTF-8 characters.
- vLLM version: v0.11.0
- vLLM main:
c9461e05a4
---------
Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
### What this PR does / why we need it?
It's a tiny bugfix in the `gen_ranktable.py` script. The script is an
util to help setup an example case. It is used to prepare a ranktable
before disaggregated prefill deployment.
Elements in `local_device_ids` list should be casted to `int` type
before referred for a MOD math operation.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
- vLLM version: v0.11.0
- vLLM main:
c9461e05a4
---------
Signed-off-by: paulyu12 <507435917@qq.com>
### What this PR does / why we need it?
Modify the recalculation logic to prevent waiting requests from filling
up the D node KVCache
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
Signed-off-by: underfituu <hzhucong@163.com>
### What this PR does / why we need it?
In certain scenarios, the performance of synchronously loading data from
the pool is better than that of asynchronously loading data. Therefore,
a control logic (or switch) for asynchronous loading from the pool has
been added.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it?
This PR is aimed to fix the recomputing out of memory bug in decode
instance. When recomputing happens in decode, kv cache usage may exceed
the pre-allocated memory, and it will cause OOM.
So we propose a new scheduling strategy, when decode instance cannot
allocate new block for running requests, we will stop the request that
will be preempted. These stopped request will be recognied by proxy, and
they will be send to prefill instance again to calculate kvc and then
direct to decode instance.
This is a temporary plan to fix the bug. The long-term stratege is to
use CPU offload in decode instance.
### Does this PR introduce _any_ user-facing change?
An extra ascend configuration option **-- recompute_scheduler_enable =
True** is added to enable this strategy. The default value is False
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
### What this PR does / why we need it?
Refactor the multi-machine CI use case. The purpose of this PR is to
increase the ease of adding multi-machine CI use cases, allowing
developers to add multi-machine cluster model testing use cases
(including PD separation) by simply adding a new YAML configuration
file.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
A new kv_role "kv_both" is added to run mixed deployment scenarios. The
mixed deployment will involve a decode phase, where with_prefill should
be false.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
c60e6137f0
Signed-off-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it?
Remove chunked prefill for mla branch in mla , and change dtype of
prefill_mask to avoid accuracy problem
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
ef7eefe17a
---------
Signed-off-by: SunnyLee219 <3294305115@qq.com>
### What this PR does / why we need it?
Added a new connector for Mooncake store integration to enable kvcache
reuse in scenarios with system prompts or multi-turn dialogues.
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
5963b98b46
---------
Signed-off-by: LCAIZJ <leichao139636@163.com>
Signed-off-by: fems14 <1804143737@qq.com>
Co-authored-by: fems14 <1804143737@qq.com>
Co-authored-by: Dreamerleader <2270923832@qq.com>
Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>
Co-authored-by: lizy124 <1950471827@qq.com>
Co-authored-by: zouyida2052 <zouyida2002@gmail.com>
### What this PR does / why we need it?
LLMdatadist connector adapt the distributed KV aggregation for the main
branch. Change the P node from returning "finish sending" only when TP0
responds to returning "finish sending" as soon as each NPU receives it.
The D node will send a finish receive signal to the corresponding tp
rank of the P node.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
gsm8k test
2*A3 1P 1D
P: dp2 tp8 D:dp 4 tp4
P: dp2 tp8 D:dp 2 tp8
- vLLM version: main
- vLLM main:
cc99baf14d
Signed-off-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
The PR is for the document of the prefiller&decoder disaggregation
deloyment guide.
The scenario of the guide is:
- Use 3 nodes totally and 2 NPUs on each node
- Qwen3-30B-A3B
- 1P2D
- Expert Parallel
The deployment can be used to verify PD Disggregation / Expert Parallel
features with a slightly less resources.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
- vLLM version: v0.10.1.1
- vLLM main:
e599e2c65e
---------
Signed-off-by: paulyu12 <507435917@qq.com>
### What this PR does / why we need it?
The detail has been clarified in that issue :
https://github.com/vllm-project/vllm-ascend/issues/2557
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
easy to test beacause we just need to echo the variable
- vLLM version: v0.10.1.1
- vLLM main:
6997a25ac6
---------
Signed-off-by: zzy-ContiLearn <1831242919@qq.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: LCAIZJ <leichao139636@163.com>
### What this PR does / why we need it?
Resolve the issue of waiting queue accumulation when requests are
canceled.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.10.1.1
- vLLM main:
006477e60b
---------
Signed-off-by: wangxiaoteng666 <wangxiaoteng@huawei.com>
### What this PR does / why we need it?
This PR adopt Mooncake TransferEngine for kv cache register and
pull_blocks style disaggregate prefill implementation.
### Does this PR introduce any user-facing change?
No
### Dependencies
1. Cann Dependencies
Using Mooncake TransferEngine with Ascend Transport requires CANN
version 8.2.RC1 or higher.(see detail
Mooncake[#502](https://github.com/kvcache-ai/Mooncake/pull/502))
2. vllm-ascend
This PR depends on changes introduced by #950 (modifications to
`model_runner_v1`) and #1361 (updates to `schedule`), both of which have
been merged into the `v0.9.1-dev` branch and are expected to land in
`main` shortly.
### How was this patch tested?
- vLLM version: v0.10.0
- vLLM main:
1c859a1387
---------
Signed-off-by: leichao.lc <leichao139636@163.com>
Co-authored-by: jianzs <zheng.shoujian@outlook.com>
Co-authored-by: zzy-ContiLearn <1831242919@qq.com>
Co-authored-by: fems14 <1804143737@qq.com>
Co-authored-by: Dreamerleader <2270923832@qq.com>
Co-authored-by: chris668899 <15105191595@126.com>
Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>
we recently added disaggregated_prefill and ascend_forward_context
feature by
ba3dfbd59e
and
df0ec55162.
This PR fix some nit introduced by them to make the code clear.
1. drop `current_platform` usage. It'll lead unknown circular import
error in some case
2. update `set_ascend_forward_context` function to make the logic clear.
for example, remove V0 support in this function.
3. Remove useless `self.local_rank_across_dp` in worker
4. Remove `soc_info.py` to use `get_ascend_soc_version` instead.
- vLLM version: v0.10.0
- vLLM main:
02f82fe438
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
- In the D node, the max-num-batched-tokens parameter can be set to a
smaller value since the D node processes at most max-num-seqs batches
concurrently. As the profile_run only needs to handle max-num-seqs
sequences at a time, we can safely set max-num-batched-tokens equal to
max-num-seqs. This optimization will help reduce activation memory
consumption.
- Restore the default configuration items for PD separation.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.0
- vLLM main:
61dcc280fa
Signed-off-by: underfituu <hzhucong@163.com>
### What this PR does / why we need it?
The PR is the cherry-pick from v0.9.1
https://github.com/vllm-project/vllm-ascend/pull/1953
This PR introduce a new load balance proxy server example implementation
for disaggregated pd, which support simple token&kv_cache aware load
balance routing strategy for the disaggregated pd system compared with
origin round robin toy_proxy.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
tested on real workload and unittest
- vLLM version: v0.10.0
- vLLM main:
ad57f23f6a
---------
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
### What this PR does / why we need it?
Currently our workflow run time takes about 3 hours in total, which
seriously affects the developer experience, so it is urgent to have a
optimization, after this pr, It is expected that the running time of the
full CI can be shortened to 1h40min.
- Enable linux-aarch64-a2 (64GB) to replace linux-arm64-npu (32GB)
- Change TP4 ---> TP2 * 2 max-parallel
- Move DeepSeek-V2-Lite-W8A8 to single card test
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.10.0
- vLLM main:
a2480251ec
---------
Signed-off-by: wangli <wangli858794774@gmail.com>