we recently added disaggregated_prefill and ascend_forward_context
feature by
ba3dfbd59e
and
df0ec55162.
This PR fix some nit introduced by them to make the code clear.
1. drop `current_platform` usage. It'll lead unknown circular import
error in some case
2. update `set_ascend_forward_context` function to make the logic clear.
for example, remove V0 support in this function.
3. Remove useless `self.local_rank_across_dp` in worker
4. Remove `soc_info.py` to use `get_ascend_soc_version` instead.
- vLLM version: v0.10.0
- vLLM main:
02f82fe438
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
- In the D node, the max-num-batched-tokens parameter can be set to a
smaller value since the D node processes at most max-num-seqs batches
concurrently. As the profile_run only needs to handle max-num-seqs
sequences at a time, we can safely set max-num-batched-tokens equal to
max-num-seqs. This optimization will help reduce activation memory
consumption.
- Restore the default configuration items for PD separation.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.0
- vLLM main:
61dcc280fa
Signed-off-by: underfituu <hzhucong@163.com>
### What this PR does / why we need it?
The PR is the cherry-pick from v0.9.1
https://github.com/vllm-project/vllm-ascend/pull/1953
This PR introduce a new load balance proxy server example implementation
for disaggregated pd, which support simple token&kv_cache aware load
balance routing strategy for the disaggregated pd system compared with
origin round robin toy_proxy.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
tested on real workload and unittest
- vLLM version: v0.10.0
- vLLM main:
ad57f23f6a
---------
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
### What this PR does / why we need it?
Currently our workflow run time takes about 3 hours in total, which
seriously affects the developer experience, so it is urgent to have a
optimization, after this pr, It is expected that the running time of the
full CI can be shortened to 1h40min.
- Enable linux-aarch64-a2 (64GB) to replace linux-arm64-npu (32GB)
- Change TP4 ---> TP2 * 2 max-parallel
- Move DeepSeek-V2-Lite-W8A8 to single card test
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.10.0
- vLLM main:
a2480251ec
---------
Signed-off-by: wangli <wangli858794774@gmail.com>