<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
Before when we do put for KV Pool, we find the first non-existing key
and put all the blocks starting from that index; however, if the prefix
cache blocks is from another request, and some of the blocks are evicted
due to LRU, we will be putting blocks that still exist in the pool, and
causing MooncakeStore printing unnecessary logs in master service.
What this PR does:
Now we lookup all the keys and only put the ones that are missing.
Fix lookup_scheduler in pool_worker so it handles GQA correctly.
Fixes a few existing typos
Add UT, written by codex
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
---------
Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
Co-authored-by: DreamerLeader <2270923832@qq.com>
Co-authored-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it?
This PR introduces several upstream `vllm`-aligned lint hooks into
`vllm-ascend` and makes them part of the actual `pre-commit` flow.
Main changes in this PR:
- add `check-boolean-context-manager` to catch boolean expressions in
`with` statements
- add `check-forbidden-imports` to forbid direct `re` imports and
disallowed direct `triton` imports
- enable shell script linting through `tools/shellcheck.sh`
- add root `.clang-format` aligned with upstream `vllm`, enable
`clang-format` in `pre-commit`, temporarily **exclude all `csrc/**`**
from `clang-format` to avoid bringing a large native code reformat into
this PR
This PR focuses on landing the smaller and immediately useful lint
alignment first, without mixing in the larger requirements-management
migration.
### Does this PR introduce _any_ user-facing change?
No.
This PR only updates repository lint configuration, static checks, and
internal import/style enforcement. It does not change runtime behavior
or public interfaces.
### How was this patch tested?
Tested locally in the project virtual environment.
Commands used:
```bash
bash format.sh
```
Verified checks passed:
``` bash
ruff check...............................................................Passed
ruff format..............................................................Passed
codespell................................................................Passed
typos....................................................................Passed
clang-format.............................................................Passed
Lint GitHub Actions workflow files.......................................Passed
Lint shell scripts.......................................................Passed
Lint PNG exports from excalidraw.........................................Passed
Check for spaces in all filenames........................................Passed
Enforce __init__.py in Python packages...................................Passed
Check for forbidden imports..............................................Passed
Check for boolean ops in with-statements.................................Passed
Suggestion...............................................................Passed
- hook id: suggestion
- duration: 0s
To bypass pre-commit hooks, add --no-verify to git commit.
```
**note:**
clang-format is enabled but currently excludes all csrc/**
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
This log is printed too frequently and unecessary, Thus lowering its
level from INFO to DEBUG.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.18.0
- vLLM main:
ed359c497a
---------
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
### What this PR does / why we need it?
LMCache-Ascend is LMCache's solution on the Ascend platform and one of
the KVCache pooling solutions for Ascend. We hope to integrate
LMCache-Ascend into the vLLM-Ascend community as one of the official
KVCache pooling solutions for vLLM-Ascend.
We added a new LMCacheAscendConnector in vLLM-Ascend and registered it.
### Does this PR introduce _any_ user-facing change?
Users can specify the kvconnector using `--kv-transfer-config`, allowing
them to freely choose which kvconnector to use, without any user-facing
change.
### How was this patch tested?
Test by specifying `--kv-transfer-config
'{"kv_connector":"LMCacheAscendConnector","kv_role":"kv_both"}'`
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: chloroethylene <jjysama@gmail.com>
### What this PR does / why we need it?
Currently, we call lookup_client for looking up token hit in KV Pool,
however, when token length < block size, the key will be empty and there
is no point to lookup in KV Pool backend since there will never be a
hit.
Hence, add early return in `get_num_new_matched_tokens` when `token_len`
< `block_size`
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
Co-authored-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
---------
Signed-off-by: fems14 <1804143737@qq.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
Fix wrong computed_tokens when meet exception. This pull request
addresses a bug in the KV transfer mechanism where an exception during
token lookup operations could lead to an incorrect count of
computed_tokens. By modifying the exception handling in both the lookup
and lookup_scheduler functions to return 0 instead of the start index,
the system now correctly indicates that no tokens were successfully
processed when a remote connection failure occurs. This enhancement
improves the robustness and accuracy of token management within the
vllm_ascend distributed KV pool.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
NO.
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Signed-off-by: xleoken <xleoken@163.com>
### What this PR does / why we need it?
refer to https://github.com/vllm-project/vllm-ascend/issues/6391,
Currently adapted the complete process of event publishing in vllm:
* `kv_connector_model_runner_mixin` invoke kv-connector
`get_kv_connector_kv_cache_events` func to collect kvevents
* in `scheduler.py` , it's `update_from_output` func will invoke
`_update_from_kv_xfer_finished` which invoke
`connector.update_connector_output` to collect kv-events from all
kv-worker, and then scheduler will invoke `connector.take_events` api to
collect all kv-events and add it to the events which from
`kv_cache_manager`
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
You can add `--kv-events-config` parameter to the `vllm server` command
to enable this feature.
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
---------
Signed-off-by: yejj710 <abyss1999@163.com>
Co-authored-by: fems14 <1804143737@qq.com>
This reverts commit 8966a99710.
It breaks the test
`tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py::test_deepseek_mtp_correctness[True-FULL_DECODE_ONLY-2-wemaster/deepseek_mtp_main_random_bf16]`
- vLLM version: v0.14.0
- vLLM main:
d68209402d
### What this PR does / why we need it?
**Refactor: Unify full-graph parameter update logic**
This PR consolidates the scattered full-graph parameter update logic
into a unified approach, improving code architecture and eliminating
duplication.
**Key improvements:**
1. **Unified interface**
- Create `update_full_graph_params` as the single entry point for all
full-graph updates
- Replace multiple scattered update calls with one unified function
- Remove ~50 lines of duplicated if-else logic across
`model_runner_v1.py` and `eagle_proposer.py`
2. **Better architecture**
- Move update logic to respective Backend classes
(`AscendAttentionBackend`, `AscendMLABackend`)
- Each Backend manages its own parameter update logic internally
- Simplify caller code to just dispatch to the appropriate Backend
3. **Cleaner parameter handling**
- Remove unnecessary `pcp_size` and `dcp_size` parameter passing
- Get parallel configuration directly from distributed groups
- Consistent with how other parts of the codebase obtain these values
**Why we need it:**
- **Maintainability**: Future changes only need to be made in one place
per Backend
- **Code quality**: Follows DRY principle and Single Responsibility
Principle
- **Readability**: Cleaner, more intuitive code structure
### Does this PR introduce _any_ user-facing change?
**No.** This is a pure refactoring with no functional changes - same
behavior, cleaner code.
### How was this patch tested?
- All existing unit tests pass with updated mocks
- No new tests needed (pure refactoring, no behavior changes)
- CI validates correctness
---
- vLLM version: v0.13.0
Signed-off-by: lico67373 <918688502@qq.com>
Co-authored-by: drslark <slarksblood@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
ucm_connector add has `has_connector_metadata` interface to adapt to the
latest KV connector in vLLM.
### Does this PR introduce _any_ user-facing change?
this PR doesn't introduce _any_ user-facing change.
### How was this patch tested?
- vLLM version: v0.14.0
- vLLM main:
d68209402d
Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>
### What this PR does / why we need it?
Drop vLLM 0.13.0 support, upgrade to 0.14.0
- vLLM version: v0.13.0
- vLLM main:
d68209402d
---------
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Resolved a double-free memory vulnerability in the pooling layer under
re-computation scenarios.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
d68209402d
Signed-off-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it?
This PR refactors `get_kv_cache_spec` method to delegate AttentionSpec
creation to each attention module's own `get_kv_cache_spec()` method,
aligning with the vllm source code structure.
**Changes:**
- Simplify `get_kv_cache_spec` in `model_runner_v1.py` and
`cpu_offload_connector.py`
- Remove manual `AttentionType` checks for `Attention` modules
- Delegate spec creation to each attention module's `get_kv_cache_spec`
method directly
- Let `MambaBase` layers use their own `get_kv_cache_spec` method
- Keep `use_sparse` hack for `MLAAttention` (DeepSeek DSA mode) as
Ascend-specific handling
This change follows RFC #5463 item 12: move AttentionSpec to Attention
module.
- Fixes#5463 (item 12)
### Does this PR introduce _any_ user-facing change?
No. This is an internal refactoring that simplifies code structure
without changing any external behavior.
### How was this patch tested?
- Syntax validation passed via `python -m py_compile`
- CI tests will verify the changes work correctly with existing test
cases
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: lico67373 <918688502@qq.com>
### What this PR does / why we need it?
As issue #5948 reported,when using cpu_offload_connector with TP=1, the
server will hang on starting, we found several bugs here to fix.
1. some crash error encountered because of code changed with vllm
version updating, some of them can be fixed as #5948, and this PR fixed
all of them.
2. hang problem described in #5948, the direct reason is that in
cpu_offload_connector, RPC client using the same client id in scheduler
and worker when tensor_parrallel_size is 1, this PR force the client id
to be different, then it is fixed.
- Why we didn't find this hang problem before?
Because we using --distributed-executor-backend mp or
tensor_parrallel_size > 1 in our test, in our old test case, the
scheduler and workers are different procceses, then client ids build by
`worker-{os.getpid()}` are not the same. But when using
tensor_parrallel_size=1, vllm will use uniproc as
distributed-executor-backend by default, the scheduler and worker will
by in the same proccess, then client ids are the same and hang.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
Signed-off-by: lidenghui <lidenghui1110@gmail.com>
### What this PR does / why we need it?
Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604
This PR is a refactoring of vllm_ascend/distributed.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
11b6af5280
Signed-off-by: lty <linhebiwen@gmail.com>
### What this PR does / why we need it?
Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604
This PR is a refactoring of vllm_ascend/distributed, moving all
kv_transfer realtaed codes into a dedicated folder, which has already
been done in vLLM
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: lty <linhebiwen@gmail.com>