### What this PR does / why we need
Fix layerwise connector out of memory during large buffer transfer.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By nightly.
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
Before when we do put for KV Pool, we find the first non-existing key
and put all the blocks starting from that index; however, if the prefix
cache blocks is from another request, and some of the blocks are evicted
due to LRU, we will be putting blocks that still exist in the pool, and
causing MooncakeStore printing unnecessary logs in master service.
What this PR does:
Now we lookup all the keys and only put the ones that are missing.
Fix lookup_scheduler in pool_worker so it handles GQA correctly.
Fixes a few existing typos
Add UT, written by codex
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
---------
Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
Co-authored-by: DreamerLeader <2270923832@qq.com>
Co-authored-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it?
Adjust request map pop time.This pull request optimizes the KV cache
transfer mechanism by streamlining how requests are tracked and cleaned
up. By removing unnecessary mapping structures and adjusting the timing
of request removal, the system achieves more efficient state management
during the transfer process.
pick-from:https://github.com/vllm-project/vllm-ascend/pull/7855
### How was this patch tested?
By ci
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
### What this PR does / why we need it?
Mooncake layerwise connector supports Mamba prefix caching on prefiller
nodes.
### Does this PR introduce _any_ user-facing change?
Yes. Use `--enable-prefix-caching` and `--mamba-cache-mode align` to
enable mamba align mode prefix caching on P/D prefill nodes. This
function does not supports on decode nodes now.
### How was this patch tested?
By P/D E2E test.
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
### What this PR does / why we need it?
This PR introduces several upstream `vllm`-aligned lint hooks into
`vllm-ascend` and makes them part of the actual `pre-commit` flow.
Main changes in this PR:
- add `check-boolean-context-manager` to catch boolean expressions in
`with` statements
- add `check-forbidden-imports` to forbid direct `re` imports and
disallowed direct `triton` imports
- enable shell script linting through `tools/shellcheck.sh`
- add root `.clang-format` aligned with upstream `vllm`, enable
`clang-format` in `pre-commit`, temporarily **exclude all `csrc/**`**
from `clang-format` to avoid bringing a large native code reformat into
this PR
This PR focuses on landing the smaller and immediately useful lint
alignment first, without mixing in the larger requirements-management
migration.
### Does this PR introduce _any_ user-facing change?
No.
This PR only updates repository lint configuration, static checks, and
internal import/style enforcement. It does not change runtime behavior
or public interfaces.
### How was this patch tested?
Tested locally in the project virtual environment.
Commands used:
```bash
bash format.sh
```
Verified checks passed:
``` bash
ruff check...............................................................Passed
ruff format..............................................................Passed
codespell................................................................Passed
typos....................................................................Passed
clang-format.............................................................Passed
Lint GitHub Actions workflow files.......................................Passed
Lint shell scripts.......................................................Passed
Lint PNG exports from excalidraw.........................................Passed
Check for spaces in all filenames........................................Passed
Enforce __init__.py in Python packages...................................Passed
Check for forbidden imports..............................................Passed
Check for boolean ops in with-statements.................................Passed
Suggestion...............................................................Passed
- hook id: suggestion
- duration: 0s
To bypass pre-commit hooks, add --no-verify to git commit.
```
**note:**
clang-format is enabled but currently excludes all csrc/**
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
This log is printed too frequently and unecessary, Thus lowering its
level from INFO to DEBUG.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.18.0
- vLLM main:
ed359c497a
---------
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
### What this PR does / why we need it?
Fix mooncake layerconnector dead when update_decoder_info fail. For the
scenario where node D is dead, node P failing to update_decoder_info
should not cause node P to become dead.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
by CI
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
1. When the FullGraph mode is used, the branches in the Triton operator
are compiled and fixed during the graph capture process, causing the
branch condition in the `fused_recurrent_gated_delta_rule` operator,
which checks whether `ssm_state_indices >= 0` before writing to the SSM
cache, to become invalid. Now, the write operation is performed
regardless of the value. This results in the operator performing address
offset calculations and writing to the SSM cache based on the -1 offset
after -1 is used for padding in vLLM GDN backend. Since the conv cache
and SSM cache in vLLM Ascend implementation are actually a single
continuous tensor divided into two parts, this leads to data overwriting
and the generation of NaN values.
This PR addresses two cases where padding -1 is required in the GDN
metadata builder. The same logic is used to replace the padding with 0
to avoid the problem of memory overwriting, because block 0 is a
reserved block.
2. Fix layerwise connector bug for mamba cache sending on heterogeneous
TP.
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
### What this PR does / why we need it?
LayerwiseConnector supports the virtual push functionality on node D.By
adding a do_virtual flag to request metadata, the system can now
identify and process certain requests virtually, bypassing the actual KV
cache transfer process. This allows for immediate completion of these
requests from the consumer's perspective, potentially enabling
optimizations or specific testing scenarios where physical data transfer
is not required.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
### What this PR does / why we need it?
#### Problem
When decode node enables prefix cache and the local prefix cache fully
hits, the following assertion error occurs:
```
(EngineCore_DP3 pid=34912) File "/usr/local/python3.11.14/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 520, in step_with_batch_queue
(EngineCore_DP3 pid=34912) engine_core_outputs = self.scheduler.update_from_output(
(EngineCore_DP3 pid=34912) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP3 pid=34912) File "/usr/local/python3.11.14/lib/python3.11/site-packages/vllm/v1/core/sched/scheduler.py", line 1520, in update_from_output
(EngineCore_DP3 pid=34912) self._update_from_kv_xfer_finished(kv_connector_output)
(EngineCore_DP3 pid=34912) File "/usr/local/python3.11.14/lib/python3.11/site-packages/vllm/v1/core/sched/scheduler.py", line 2120, in _update_from_kv_xfer_finished
(EngineCore_DP3 pid=34912) assert RequestStatus.is_finished(req.status)
(EngineCore_DP3 pid=34912) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP3 pid=34912) AssertionError
```
The error is triggered in scheduler.py at _update_from_kv_xfer_finished:
```
if req.status == RequestStatus.WAITING_FOR_REMOTE_KVS:
self.finished_recving_kv_req_ids.add(req_id)
else:
assert RequestStatus.is_finished(req.status)
```
#### Root Cause
When decode node has prefix cache enabled and local prefix cache fully
hits:
1. get_num_new_matched_tokens returns ext_tokens=0, load_kv_async=False
when decode prefix cache fully hits
2. Request status becomes RUNNING (not WAITING_FOR_REMOTE_KVS)
3. However, update_state_after_alloc still adds the request to
_reqs_need_recv because remote_block_ids exists in kv_transfer_params
4. Worker processes the request in _handle_request:
- _transfer_kv_cache returns immediately (no actual transfer,
local_block_ids is empty)
- finally block still calls update_done_task_count(request_id)
5. finished_recving contains this request
6. When _update_from_kv_xfer_finished processes finished_recving,
request status is RUNNING
7. Assertion fails
#### Solution
In _handle_request, only notify scheduler (update_done_task_count) when
actual KV transfer happened (local_block_ids is not empty). The signals
to notify Prefill to release KVCache
(_send_done_signal_to_free_remote_port and _send_done_recv_signal) are
still sent regardless.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
Signed-off-by: LCAIZJ <leichao139636@163.com>
### What this PR does / why we need it?
Some bug fixes, mainly including:
1. For A2, the number of experts each single card cannot be greater than
16 when using MC2. The PR fixed the error in the A2 moe communication
method selection, which would cause the selection of an incorrect
communication method when the number of model experts exceeds 256. For
example, when using an A2 16-cards model to load the PD-disaggregation D
node with Qwen3.5 series models, the incorrect MC2 method would be
chosen.
2. Fixed the issue where the layerwise connector sends the kv-cache of
the MTP layer multiple times when `num_spec_tokens` > 1. Now, the
kv-cache is sent only when the MTP layer is forward for the first time.
3. Fix the accuracy issue of qwen3.5 when using MTP for PD
disaggregation. The cause is that `num_decode_draft_tokens` does not
consider that `spec_tokens` are not existed during the first inference
when PD disaggregation (`spec_tokens` are generated during the first
inference). However, `spec_tokens_padding` is added by
`recomputed_scheduler`. As a result, `gdn_metadata` incorrectly
considers that the prefill with a length of 2 is performed.
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: zxr2333 <64738772+nwpu-zxr@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: kunpengW-code <1289706727@qq.com>
Co-authored-by: linsheng1 <1950916997@qq.com>
### What this PR does / why we need it?
Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8
supports only the PD separation scenario. C8 refers to quantizing the KV
cache to int8, which aims to reduce the GPU memory usage of the KV cache
and improve the inference throughput.
Constraints:
1. Only the PD separation mode can be used and
MooncakeLayerwiseConnector can be used to run the model.
2. Currently, only the activation value supports dynamic quantization,
and the KV cache supports static quantization. C8 quantization with MTP
is not supported. You can use ModelSlim for quantization. The
quantization procedure is as follows:
pip install transformers==4.48.2
git clone https://gitcode.com/Ascend/msmodelslim.git
cd msmodelslim
bash install.sh
cd example/DeepSeek/
python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path
<path/quant_weight>
--anti_dataset../common/deepseek_anti_prompt_50_v3_1.json
--calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot
--trust_remote_code True --fa_quant --dynamic --anti_method m6
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
Conditional Import for Mooncake: The import of
mooncake.engine.TransferEngine was moved into a try-except block within
the GlobalTE class's constructor. This ensures that mooncake is only
imported when needed and provides a clear error message with
installation instructions if it's missing.
### Does this PR introduce _any_ user-facing change?
The error message "ascend_transport.so: cannot open shared object file:
No such file or directory" in the CI is fixed to ensure the normal
running of the CI.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
### What this PR does / why we need it?
LMCache-Ascend is LMCache's solution on the Ascend platform and one of
the KVCache pooling solutions for Ascend. We hope to integrate
LMCache-Ascend into the vLLM-Ascend community as one of the official
KVCache pooling solutions for vLLM-Ascend.
We added a new LMCacheAscendConnector in vLLM-Ascend and registered it.
### Does this PR introduce _any_ user-facing change?
Users can specify the kvconnector using `--kv-transfer-config`, allowing
them to freely choose which kvconnector to use, without any user-facing
change.
### How was this patch tested?
Test by specifying `--kv-transfer-config
'{"kv_connector":"LMCacheAscendConnector","kv_role":"kv_both"}'`
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: chloroethylene <jjysama@gmail.com>
### What this PR does / why we need it?
Currently, we call lookup_client for looking up token hit in KV Pool,
however, when token length < block size, the key will be empty and there
is no point to lookup in KV Pool backend since there will never be a
hit.
Hence, add early return in `get_num_new_matched_tokens` when `token_len`
< `block_size`
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
Co-authored-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it?
Mooncake Layerwise Connector supports hybrid attention manager with
multiple kvcache groups.
### Does this PR introduce _any_ user-facing change?
Yes.
### How was this patch tested?
By CI.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
### What this PR does / why we need it?
This PR creates and registers `ascend_multi_connector`, which allows the
`mooncake_layerwise_connector` to use the kv_pooling feature.
We unregister the original vllm's `MultiConnector` and replace it with
`AscendMultiConnector` when registering the connectors.
### Does this PR introduce _any_ user-facing change?
No. User can use `MultiConnector` to initialize `AscendMultiConnector`.
### How was this patch tested?
By CI.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
### What this PR does / why we need it?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
---------
Signed-off-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it?
Fix a bug for PD disaggregate of PCP/DCP, some conditions only consider
MLA while ignoring DSA.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
15d76f74e2
- vLLM Ascend main: 81fb7d5779
Signed-off-by: tongyuzhou <tongyuzhou1@huawei.com>
Co-authored-by: tongyuzhou <tongyuzhou1@huawei.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
Fix wrong computed_tokens when meet exception. This pull request
addresses a bug in the KV transfer mechanism where an exception during
token lookup operations could lead to an incorrect count of
computed_tokens. By modifying the exception handling in both the lookup
and lookup_scheduler functions to return 0 instead of the start index,
the system now correctly indicates that no tokens were successfully
processed when a remote connection failure occurs. This enhancement
improves the robustness and accuracy of token management within the
vllm_ascend distributed KV pool.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
NO.
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Signed-off-by: xleoken <xleoken@163.com>
### What this PR does / why we need it?
The incorrect regular expression syntax `.*[UE4M3|ue4m3].*` actually
ignores all words containing any of the following characters: `u, e, 4,
m, 3, |`
```yaml
extend-ignore-identifiers-re = [".*Unc.*", ".*_thw",
".*UE8M0.*", ".*[UE4M3|ue4m3].*", ".*eles.*", ".*fo.*", ".*ba.*",
".*ot.*", ".*[Tt]h[rR].*"]
```
===fix===>
```yaml
extend-ignore-identifiers-re = [".*Unc.*", ".*_thw",
".*UE8M0.*", ".*(UE4M3|ue4m3]).*", ".*eles.*", ".*fo.*", ".*ba.*",
".*ot.*", ".*[Tt]h[rR].*"]
```
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
9562912cea
Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
mooncake layerwise support pcp function
PCP (Prefill Context Parallelism) Support: Introduced explicit support
for Prefill Context Parallelism (PCP) and Decode Context Parallelism
(DCP) in the Mooncake layerwise KV cache transfer mechanism, allowing
for more granular control and awareness of parallel configurations
during data transfer.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
---------
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
refer to https://github.com/vllm-project/vllm-ascend/issues/6391,
Currently adapted the complete process of event publishing in vllm:
* `kv_connector_model_runner_mixin` invoke kv-connector
`get_kv_connector_kv_cache_events` func to collect kvevents
* in `scheduler.py` , it's `update_from_output` func will invoke
`_update_from_kv_xfer_finished` which invoke
`connector.update_connector_output` to collect kv-events from all
kv-worker, and then scheduler will invoke `connector.take_events` api to
collect all kv-events and add it to the events which from
`kv_cache_manager`
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
You can add `--kv-events-config` parameter to the `vllm server` command
to enable this feature.
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
---------
Signed-off-by: yejj710 <abyss1999@163.com>
Co-authored-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it?
As #2947 describe, we need to transpose kv cache layout after GQA kv
transfer when prefill and decode tensor parallel size are heterogeneous,
in the previous implementation, we use `npu_paged_cache_load ` +
`tranpose` + `_npu_reshape_and_cache` to do this work.
But obviously, it is not an efficient plan, the ops above need to be
called for each layer, which introduces 3 * layer_num kernel launch, and
6 * layer_num data movement between L1 Cache and HBM for one request on
decode node. Usually, decode node uses graph mode, so these op kernels
will be called between decode forward launched by an async thread in
mooncacke connector, this kernels maybe last for several decode forward
and TTFT will increase by 3~4 decode forward time.
In this PR, we implement an AscendC fused op
`transpose_kv_cache_by_block` to do this with only once kernel launch
and move data between L1 Cache and HBM only once.
After using this fused op, the time cost in transpose kv cacke layout
can be decreased to 0.24ms from 7ms in UT on 910C, and in PD
disaggregation scenario, TTFT can decrease about 90 ~ 110 ms in
qwen3-235B.
| request_num | original | fused_op|
|:----------------------:|:---------------:|:-------------------:|
| 1 | 643 ms | 578 ms |
| 128 | 1480 ms | 1368 ms |
### Does this PR introduce _any_ user-facing change?
Use fused op by default, incase the op has bug in any scenario, provide
fallback choice using env to disable it.
**DISABLE fused op by add following env**
`export VLLM_ASCEND_FUSION_OP_TRANSPOSE_KV_CACHE_BY_BLOCK=0`
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: lidenghui <lidenghui1110@gmail.com>
### What this PR does / why we need it?
Using the cache load operator to replace the index select operator.
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
This PR fixes a critical bug in the PD-separated inference pipeline
where KV cache on the Prefill (P) side was not being properly released.
The issue arises when multiple clients use the same x-request-id: to
avoid request ID collisions, both Prefill and Decode nodes append a
random suffix to the incoming x-request-id. A previous PR ensured
consistency by having the P-side pass its final request_id as
remote_request_id to the D-side via kv_transfer_param. However, during
KV cache cleanup, the D-side incorrectly used the local req_id (instead
of remote_request_id) to select the target P-side rank. This mismatch
caused the P-side KV cache to remain unreleased on certain ranks,
leading to memory leaks. This PR corrects the logic to use
remote_request_id consistently when determining the P-side rank.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
The fix was validated by running multiple concurrent benchmark instances
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
Signed-off-by: ghphotoframe <854746559@qq.com>
### What this PR does / why we need it?
To support elastic scaling when using mooncake connector, we should
support to **configure different tp sizes for different nodes**.
As a result, we transfer the prefill node information, such as tp size,
through **the request's kv_transfer_params**.
The decode nodes **get the prefill tp size** through the request's
kv_transfer_params, instead of getting it from the configuration of the
mooncake connector .
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: yuxinshan <syx_ctyg@126.com>
Signed-off-by: CalvinXKY <kyxiezju@163.com>
This reverts commit 8966a99710.
It breaks the test
`tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py::test_deepseek_mtp_correctness[True-FULL_DECODE_ONLY-2-wemaster/deepseek_mtp_main_random_bf16]`
- vLLM version: v0.14.0
- vLLM main:
d68209402d
### What this PR does / why we need it?
**Scope of Changes**:
| File Path |
| :--- |
| `vllm_ascend/distributed/kv_transfer/__init__.py` |
| `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_connector.py` |
|
`vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py`
|
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.14.0
- vLLM main:
d68209402d
Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
**Refactor: Unify full-graph parameter update logic**
This PR consolidates the scattered full-graph parameter update logic
into a unified approach, improving code architecture and eliminating
duplication.
**Key improvements:**
1. **Unified interface**
- Create `update_full_graph_params` as the single entry point for all
full-graph updates
- Replace multiple scattered update calls with one unified function
- Remove ~50 lines of duplicated if-else logic across
`model_runner_v1.py` and `eagle_proposer.py`
2. **Better architecture**
- Move update logic to respective Backend classes
(`AscendAttentionBackend`, `AscendMLABackend`)
- Each Backend manages its own parameter update logic internally
- Simplify caller code to just dispatch to the appropriate Backend
3. **Cleaner parameter handling**
- Remove unnecessary `pcp_size` and `dcp_size` parameter passing
- Get parallel configuration directly from distributed groups
- Consistent with how other parts of the codebase obtain these values
**Why we need it:**
- **Maintainability**: Future changes only need to be made in one place
per Backend
- **Code quality**: Follows DRY principle and Single Responsibility
Principle
- **Readability**: Cleaner, more intuitive code structure
### Does this PR introduce _any_ user-facing change?
**No.** This is a pure refactoring with no functional changes - same
behavior, cleaner code.
### How was this patch tested?
- All existing unit tests pass with updated mocks
- No new tests needed (pure refactoring, no behavior changes)
- CI validates correctness
---
- vLLM version: v0.13.0
Signed-off-by: lico67373 <918688502@qq.com>
Co-authored-by: drslark <slarksblood@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
Remove restrictions on mooncake for IPv6
Dependencies: cann8.5、mooncake v0.3.8.post1
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
ucm_connector add has `has_connector_metadata` interface to adapt to the
latest KV connector in vLLM.
### Does this PR introduce _any_ user-facing change?
this PR doesn't introduce _any_ user-facing change.
### How was this patch tested?
- vLLM version: v0.14.0
- vLLM main:
d68209402d
Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>
### What this PR does / why we need it?
Rectify the problem that the pcp and pd separation and kv pooling
scenario.
In the pooling scenario, multi_nodes_meta_mapping is empty. As a result,
an error is reported when the remote_host information is obtained
through the get_remote_port_send_num method.
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.13.0
- vLLM main:
d68209402d
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
When the P node accesses the proxy meteserver, add the SSL certificate
and the CA certificate path to improve security.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
---------
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
### What this PR does / why we need it?
Drop vLLM 0.13.0 support, upgrade to 0.14.0
- vLLM version: v0.13.0
- vLLM main:
d68209402d
---------
Signed-off-by: hfadzxy <starmoon_zhang@163.com>