## Summary
This PR was auto-generated by the **Update estimated test times**
[workflow](https://github.com/vllm-project/vllm-ascend/actions/runs/23226502411).
It updates the `estimated_time` values in
`.github/workflows/scripts/config.yaml` based on actual elapsed times
collected from CI workflow runs.
### Methodology
- Each e2e test job uploads its elapsed time as a `timing-data-*`
artifact upon completion.
- The workflow aggregates all collected timing artifacts across jobs.
- For each test, the **median** elapsed time is computed to reduce
outlier impact.
- A **10% safety buffer** is applied and the result is rounded to the
nearest 10 seconds.
### Review Checklist
- [ ] Verify that updated `estimated_time` values are within a
reasonable range.
- [ ] Confirm no test entries are missing or unexpectedly removed.
> If the new values look reasonable, feel free to merge. Otherwise,
leave a comment describing the anomaly.
- vLLM version: v0.17.0
- vLLM main:
4497431df6
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
### What this PR does / why we need it?
This PR introduces a new fused Triton kernel,
`split_qkv_tp_rmsnorm_rope` for Minimax-m2.5.
The implementation includes two Triton kernels:
1. `_split_qkv_and_compute_local_qk_var_kernel`: Splits the QKV input
and computes the local variance for RMSNorm.
2. `_apply_global_rmsnorm_kernel`: Applies global RMSNorm (considering
TP all-reduce for variance) and Neox-style RoPE.
### Does this PR introduce _any_ user-facing change?
Does not.
### How was this patch tested?
```python
pytest tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_split_qkv_tp_rmsnorm_rope.py
```
### Test Data
A3 TP16
基线
| data | TTFT(ms) | TPOT(ms) | TPS |
|------------|---------:|---------:|-------:|
| 4k/1k@bs1 | 267.55 | 25.5 | 38.85 |
| 4k/1k@bs4 | 542.4 | 26.51 | 148.06 |
测试线
| data | TTFT(ms) | TPOT(ms) | TPS |
|------------|---------:|---------:|-------:|
| 4k/1k@bs1 | 234.64 | 20.96 | 47.24 |
| 4k/1k@bs4 | 508.36 | 22.16 | 176.69 |
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
Signed-off-by: xutianyi <xutianyi5@huawei.com>
Co-authored-by: xutianyi <xutianyi5@huawei.com>
### What this PR does / why we need it?
Upgrade vllm commit to 0318.
Main content: Added a pre-operation for cleaning up and waiting(default
max 50s) for the completion of the clean up of the NPU memory to some
test cases that failed due to the failure to release the NPU memory in a
timely manner when the previous test cases were executed.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
NA
- vLLM version: v0.17.0
- vLLM main:
4497431df6
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
NPU resources are not released immediately when custom operator test
cases are executed, causing an error when other operator test cases are
executed.
- vLLM version: v0.17.0
- vLLM main:
8a680463fa
Signed-off-by: ZT-AIA <1028681969@qq.com>
Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>
### What this PR does / why we need it?
This PR fixes the layer name mapping logic in `AscendModelSlimConfig`
for quantization config loading.
1. **kimi_k2 model layer name mapping issue**: The `kimi_k2` model has a
unique layer naming convention that differs from the standard
`hf_to_vllm` mapping. One layer was defined in the mapper but was not
being correctly applied, causing quantization config lookup failures.
2. **Manual mapping registration timing issue**: The manual mapping
check in `apply_vllm_mapper` was executed before `vllm_config` was
initialized, causing `model_type` to be unavailable. This prevented some
models with manual mappings from being correctly registered.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
Tested with `kimi_k2` model to verify the special layer name mapping
works correctly. Also tested with other models that have manual mappings
defined in `QUANT_MODEL_PREFIX_MAPPINGS` to ensure the registration
timing fix works properly.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Matrix_K <zhangke144@huawei.com>
Signed-off-by: Feng-xiaosuo <tengchang1@huawei.com>
Co-authored-by: Matrix_K <zhangke144@huawei.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
Add acc nightly CI test cases for the GLM-4.7 model.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
through CI
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: zjks98 <zhangjiakang4@huawei.com>
Co-authored-by: zjks98 <zhangjiakang4@huawei.com>
### What this PR does / why we need it?
Fix issues in the GLM4.7 documentation and add some missing
explanations.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
document test
- vLLM version: v0.17.0
- vLLM main:
8a680463fa
---------
Signed-off-by: zjks98 <zhangjiakang4@huawei.com>
Co-authored-by: zjks98 <zhangjiakang4@huawei.com>
### What this PR does / why we need it?
Revise the KV Pool user guide:
4. Revise parameters for Memcache for better clarity, at notification
that currently heterogeneous protocol setting is not supported (e.g.
enable `device_rdma` and `device_sdma` at the same time, a example
scenario would be data transfer by memcache across different super pods)
5. Modify the condition for Mooncakestore warmup, warmup is now needed
only when `ASCEND_BUFFER_POOL` is enabled.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
8a680463fa
---------
Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
Co-authored-by: Chao Lei <leichao139636@163.com>
### What this PR does / why we need it?
remove deprecated environment variables related to MLP prefetching
### Does this PR introduce _any_ user-facing change?
yes, the deprecated env vars can not be used then.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Adds a scheduled CI workflow (schedule_release_code_and_wheel.yml) to
automatically build and release vllm-ascend source packages and binary
wheels for multiple Ascend hardware targets.
Key features:
1. Source release: Builds tar.gz sdist and uploads to PyPI on version
tag push
2. Multi-hardware wheel builds: Supports three hardware targets in
parallel:
2.1 A2 (Ascend 910B): x86_64 + ARM64, Python 3.10 / 3.11
2.2 A3 (Ascend 910C): x86_64 + ARM64, Python 3.10 / 3.11
2.3 310P: x86_64 + ARM64, Python 3.10 / 3.11
3. Wheel repair: Uses auditwheel to produce manylinux-compatible wheels,
excluding Ascend NPU runtime libs (libascend*.so, libtorch*.so, etc.)
that must be provided by the runtime environment
4. Variant wheels: Generates hardware-variant wheels via variantlib for
hardware-specific distribution
5. OBS upload: Aggregates all variant wheels and a combined index JSON,
then uploads to Huawei OBS for hosting
### Does this PR introduce _any_ user-facing change?
Yes. Users will be able to install hardware-specific vllm-ascend wheels
from PyPI or the OBS variant index, eliminating the need to build from
source.
### How was this patch tested?
1. CI verification only — workflow syntax and job dependency logic
reviewed manually
2. Wheel build steps validated against existing Dockerfiles
(Dockerfile.buildwheel.a2/a3/310p)
3. auditwheel exclusion list verified against known Ascend runtime
shared libraries
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: YanZhicong <mryanzhicong@163.com>
Co-authored-by: YanZhicong <mryanzhicong@163.com>
### What this PR does / why we need it?
Revise the KV Pool user guide:
1. Revise Mooncake environment variables and kvconnector extra configs.
2. Delete `use_ascend_direct` in kv connector extra config as it is
deprecated
3. Delete `kv_buffer_device` and `kv_rank` in P2P mooncake config
4. Unifies default `max-model-len` and `max-num-batch-tokens` in
examples given.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
4497431df6
---------
Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
Co-authored-by: Chao Lei <leichao139636@163.com>
### What this PR does / why we need it?
This PR adds a new CI log summarizer, `ci_log_summary.py`, and wires it
into unit-test and e2e workflows so failed jobs publish a structured
failure summary to the GitHub step summary.
Examples:
- `python3 .github/workflows/scripts/ci_log_summary.py --log-file
/tmp/unit-test.log --mode ut --step-name "Unit test"`
- `python3 .github/workflows/scripts/ci_log_summary.py --run-id
23127187822 --format json`
A maintenance note is added to `ci_utils.py` to clarify that the `START`
/ `PASSED` / `FAILED (exit code X)` log lines are parsed by
`ci_log_summary.py`, so any future format changes must be coordinated
with the corresponding summarizer regexes.
🤖 Generated with [Codex]<noreply@openai.com>
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: meihanc <jcccx.cmh@gmail.com>
Co-authored-by: Codex <noreply@openai.com>
### What this PR does / why we need it?
1. Mamba Cache Support on 310P: Implemented logic to correctly
initialize and allocate KV cache for Mamba models on the 310P platform,
including handling of state tensors and page size alignment.
2. Increased Attention Head Size Support: Modified the attention backend
to support attn_head_size larger than 128 by dynamically selecting
appropriate kernel block sizes based on hardware limitations (e.g.,
block_size * head_size <= 16384).
3. Refactored KV Cache Allocation: Consolidated and improved the KV
cache allocation mechanism, moving from separate size calculation and
allocation steps to a unified _allocate_kv_cache_tensors method that
handles both Attention and Mamba specific cache structures.
4. Dynamic Mamba Config Patching: Introduced conditional loading of
Mamba configuration patches, specifically using patch_mamba_config_310
for the 310P platform to ensure platform-specific optimizations and
validations.
5. Reserve reasonable memory to allocate KV cache to avoid OOM issue
with default gpu_memory_utilization.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Qwen3.5 E2E test
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: pu-zhe <zpuaa@outlook.com>
### What this PR does / why we need it?
1.fix "TypeError: get_attn_backend() remove variable": [Refactor
`check_and_update_config`](https://github.com/vllm-project/vllm/pull/35122)
2.fix [Rename `compile_ranges_split_points` to
`compile_ranges_endpoints`](https://github.com/vllm-project/vllm/pull/36027)
3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace
memory related torch.cuda
APIs"](https://github.com/vllm-project/vllm/pull/37031)
4.fix [Support multiple KV groups in OffloadingSpec
](https://github.com/vllm-project/vllm/pull/36610) removed
self.offloaded_block_size and changed self.gpu_block_size from a scalar
to a tuple of per-group block sizes, adding block_size_factor.
5.fix [Consolidate
SupportsEagle](https://github.com/vllm-project/vllm/pull/36063) renamed
get_eagle3_aux_hidden_state_layers() to
get_eagle3_default_aux_hidden_state_layers() and added a
supports_eagle3() guard before calling it.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
E2E
- vLLM version: v0.17.0
- vLLM main:
8a680463fa
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
### What this PR does / why we need it?
Adapt to the model type of Qwen3-VL-8B-Instruct-W8A8
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: betta18 <jiangmengyu1@huawei.com>
Co-authored-by: betta18 <jiangmengyu1@huawei.com>
### What this PR does / why we need it?
1. Add nightly test on MiniMax-M2.5 with deployment method on A3
2. Add MiniMax-M2.5 deployment introduction to vllm-ascend docs
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: limuyuan <limuyuan3@huawei.com>
Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com>
Co-authored-by: limuyuan <limuyuan3@huawei.com>
### What this PR does / why we need it?
Add Kimi-K2.5 weights download.
- vLLM version: v0.17.0
- vLLM main:
4497431df6
Signed-off-by: LoganJane <loganJane73@hotmail.com>
### What this PR does / why we need it?
Documented an issue in the 2-node PD mixed deployment scenario where
inference may hang when concurrency exceeds 8.(GLM5)
Noted that the issue has been fixed in PR:
- #7235
- #7290.
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
This PR fixes the logger initialization in patches so that the log info
can be displayed as expected.
### Does this PR introduce _any_ user-facing change?
No.
- vLLM version: v0.17.0
- vLLM main:
4497431df6
---------
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
Updated the DSV32 document.
1. Changed the PD separation boot mode to layerwise.
2. Changed max-num-batched-tokens to a multiple of the TP to avoid
triggering a verification error.
3. Added a link to help users adjust the configuration.
- vLLM version: v0.17.0
- vLLM main:
4497431df6
Signed-off-by: wyh145 <1987244901@qq.com>
### What this PR does / why we need it?
The rotary algorithm in deepseek indexer should be neox-style instead of
gptj style. PR #4641 fix this accuracy bug in original pytorch version.
But PR #5701 accidentally removed the fixed code line and reverted the
implementation back to the problematic version. This PR fixes it.
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it?
LayerwiseConnector supports the virtual push functionality on node D.By
adding a do_virtual flag to request metadata, the system can now
identify and process certain requests virtually, bypassing the actual KV
cache transfer process. This allows for immediate completion of these
requests from the consumer's perspective, potentially enabling
optimizations or specific testing scenarios where physical data transfer
is not required.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
### What this PR does / why we need it?
When we checkout the fork repo and wanna to submit push to the fork
repo, the pat_token is needed
- vLLM version: v0.17.0
- vLLM main:
4497431df6
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
1. issue with "resolved", 7 days stale, 14 days closed after stale with
`stale` and `resolved` label.
2. issue with "awaiting-feedback", 7 days stale, 14 days closed after
stale with `stale` and `awaiting-feedback` label.
Change items:
- Add a scheduled stale-management workflow to process resolved and
awaiting-feedback issues independently.
- Automatically mark inactive issues as stale , post tailored reminder
messages, and close issues after a grace period.
- Remove source labels when issues become active again, and disable PR
stale handling so the automation remains issue-scoped.
### Does this PR introduce _any_ user-facing change?
- No API or runtime behavior changes.
- This PR only updates GitHub issue automation (labeling and stale
management workflow).
### How was this patch tested?
- Test locally
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: drizzlezyk <drizzlezyk@163.com>
- Replace minimal label rules with a comprehensive keyword-based issue
labeler taxonomy.
- Add grouped labels for core features and advanced capabilities to
improve issue routing.
- Expand model-related matching for LLM, multimodal generation,
multimodal understanding, audio, and omni scenarios.
- Add/normalize regex patterns for common model families (DeepSeek,
Kimi, GLM, Qwen, 310p, etc.) to increase auto-label coverage and
consistency.
### What this PR does / why we need it?
- Expands `.github/issue-labeler.yml` from a minimal set of rules to a
richer keyword-based labeling configuration.
- Adds grouped label dimensions for:
- Core features (e.g., PD disaggregation, KV cache pool, ACLGraph, async
scheduler, CPU binding, quantization)
- Advanced features (e.g., long sequence, DPC/PCP, MTP/speculative
decode)
- Model categories (LLM, multimodal generation, multimodal
understanding, audio, omni, etc.)
- Specific model families (e.g., DeepSeek, Kimi, GLM, Qwen, 310p)
- Improves automatic issue triage accuracy and reduces manual label
maintenance effort.
- Makes issue categorization more consistent for maintainers and
contributors.
Why needed:
- Existing labeler rules were too limited and could not adequately cover
current feature/model issue distribution.
- Broader and more structured matching helps faster routing,
prioritization, and ownership assignment.
Fixes #N/A
### Does this PR introduce _any_ user-facing change?
- No runtime/API user-facing changes.
- This PR only updates GitHub issue automation rules.
### How was this patch tested?
- Performed static validation and review of `.github/issue-labeler.yml`
structure and regex entries.
- Verified that rule groups and label keys are correctly formatted for
GitHub issue labeler consumption.
- Confirmed that legacy minimal rules were replaced by expanded taxonomy
without syntax-breaking YAML changes.
- No unit/e2e tests were added because this is repository automation
configuration (GitHub labeling rules) rather than application runtime
logic.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: drizzlezyk <drizzlezyk@163.com>
### What this PR does / why we need it?
#### Problem
When decode node enables prefix cache and the local prefix cache fully
hits, the following assertion error occurs:
```
(EngineCore_DP3 pid=34912) File "/usr/local/python3.11.14/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 520, in step_with_batch_queue
(EngineCore_DP3 pid=34912) engine_core_outputs = self.scheduler.update_from_output(
(EngineCore_DP3 pid=34912) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP3 pid=34912) File "/usr/local/python3.11.14/lib/python3.11/site-packages/vllm/v1/core/sched/scheduler.py", line 1520, in update_from_output
(EngineCore_DP3 pid=34912) self._update_from_kv_xfer_finished(kv_connector_output)
(EngineCore_DP3 pid=34912) File "/usr/local/python3.11.14/lib/python3.11/site-packages/vllm/v1/core/sched/scheduler.py", line 2120, in _update_from_kv_xfer_finished
(EngineCore_DP3 pid=34912) assert RequestStatus.is_finished(req.status)
(EngineCore_DP3 pid=34912) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP3 pid=34912) AssertionError
```
The error is triggered in scheduler.py at _update_from_kv_xfer_finished:
```
if req.status == RequestStatus.WAITING_FOR_REMOTE_KVS:
self.finished_recving_kv_req_ids.add(req_id)
else:
assert RequestStatus.is_finished(req.status)
```
#### Root Cause
When decode node has prefix cache enabled and local prefix cache fully
hits:
1. get_num_new_matched_tokens returns ext_tokens=0, load_kv_async=False
when decode prefix cache fully hits
2. Request status becomes RUNNING (not WAITING_FOR_REMOTE_KVS)
3. However, update_state_after_alloc still adds the request to
_reqs_need_recv because remote_block_ids exists in kv_transfer_params
4. Worker processes the request in _handle_request:
- _transfer_kv_cache returns immediately (no actual transfer,
local_block_ids is empty)
- finally block still calls update_done_task_count(request_id)
5. finished_recving contains this request
6. When _update_from_kv_xfer_finished processes finished_recving,
request status is RUNNING
7. Assertion fails
#### Solution
In _handle_request, only notify scheduler (update_done_task_count) when
actual KV transfer happened (local_block_ids is not empty). The signals
to notify Prefill to release KVCache
(_send_done_signal_to_free_remote_port and _send_done_recv_signal) are
still sent regardless.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
Signed-off-by: LCAIZJ <leichao139636@163.com>
### What this PR does / why we need it?
Some bug fixes, mainly including:
1. For A2, the number of experts each single card cannot be greater than
16 when using MC2. The PR fixed the error in the A2 moe communication
method selection, which would cause the selection of an incorrect
communication method when the number of model experts exceeds 256. For
example, when using an A2 16-cards model to load the PD-disaggregation D
node with Qwen3.5 series models, the incorrect MC2 method would be
chosen.
2. Fixed the issue where the layerwise connector sends the kv-cache of
the MTP layer multiple times when `num_spec_tokens` > 1. Now, the
kv-cache is sent only when the MTP layer is forward for the first time.
3. Fix the accuracy issue of qwen3.5 when using MTP for PD
disaggregation. The cause is that `num_decode_draft_tokens` does not
consider that `spec_tokens` are not existed during the first inference
when PD disaggregation (`spec_tokens` are generated during the first
inference). However, `spec_tokens_padding` is added by
`recomputed_scheduler`. As a result, `gdn_metadata` incorrectly
considers that the prefill with a length of 2 is performed.
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: zxr2333 <64738772+nwpu-zxr@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
### What this PR does / why we need it?
Upload doc for qwen3.5-27B and qwen3.5-397B-A17B on Ascend
Base on vllm-ascend:v0.17.0rc1
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: pppeng <zepengliu912@qq.com>
Signed-off-by: pppeng <60355449+ppppeng@users.noreply.github.com>
### What this PR does / why we need it?
Qwen3.5 Moe supports enabling the dispatch_ffn_combine fusion operator.
Fix problem: In the w8a8 quantization scene, Qwen3.5 model's config.json
lacks the quantize field. The previous logic strictly relied on
quant_type == "w8a8_dynamic" to enable VLLM_ASCEND_ENABLE_FUSED_MC2.
This caused the dispatch_ffn_combine fusion operator to fail to activate
even when the environment variable was set.
Enable dispatch_ffn_combine fusion operator for BF16 scenarios.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: asunxiao <asunxiao@qq.com>
### What this PR does / why we need it?
The decompression path of the FIA operator package is incorrect, and
unnecessary folders have been created during modification.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
Signed-off-by: zjks98 <zhangjiakang4@huawei.com>
Co-authored-by: zjks98 <zhangjiakang4@huawei.com>
### What this PR does / why we need it?
This PR optimizes bias handling in `AscendRMSNorm` without changing the
intended
functional behavior.
In the current implementation, bias may be initialized for
`AscendRMSNorm` based
on configuration-level detection, even though some norm layers never
actually
load a bias weight. This can cause the inference path to enter the bias
branch
and execute an unnecessary `add_` operator.
To improve this, this PR introduces a loader-based flag to record
whether the
bias has actually been loaded. The bias addition is then executed only
when the
bias is truly present.
This optimization reduces redundant computation in inference and makes
the bias
application logic better aligned with the actual model weights.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it?
This PR fixes the bug for eagle3 and cp enable introduced by the
parallel speculative inference PR.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
tests and ut
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it?
As issue #7201 reported, there are some TransposeKvCacheByBlock
operation related ERRORs in plog when vllm launching, though it doesn't
influence the running of vllm, but ERRORs will be very confused in
debug, this PR fixed the problem as suggested.
### Does this PR introduce _any_ user-facing change?
no.
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
Signed-off-by: lidenghui <lidenghui1110@gmail.com>
Co-authored-by: kunpengW-code <1289706727@qq.com>
Co-authored-by: linsheng1 <1950916997@qq.com>
### What this PR does / why we need it?
Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8
supports only the PD separation scenario. C8 refers to quantizing the KV
cache to int8, which aims to reduce the GPU memory usage of the KV cache
and improve the inference throughput.
Constraints:
1. Only the PD separation mode can be used and
MooncakeLayerwiseConnector can be used to run the model.
2. Currently, only the activation value supports dynamic quantization,
and the KV cache supports static quantization. C8 quantization with MTP
is not supported. You can use ModelSlim for quantization. The
quantization procedure is as follows:
pip install transformers==4.48.2
git clone https://gitcode.com/Ascend/msmodelslim.git
cd msmodelslim
bash install.sh
cd example/DeepSeek/
python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path
<path/quant_weight>
--anti_dataset../common/deepseek_anti_prompt_50_v3_1.json
--calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot
--trust_remote_code True --fa_quant --dynamic --anti_method m6
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
Two problems have been solved in this pr.
These problems occur in the `FULL_DECODE_ONLY` mode that `num_tokens`
should be padded to some value in `cudagraph_capture_sizes`.
1. We found the length of `seq_lens_list` in drafter's `attn_metadata`
is 1 shorter than expected. It will raise a kernel exception to make
vllm crash.
e.g., `num_reqs` = 3, `cudagraph_capture_sizes` = [20],
`actual_seq_lengths_q` is padded well to [4, 8, 12, 20]. But
`seq_lens_list` = [5742, 4700, 7996], it is not padded.
3. Though the length of `seq_lens_list` in target's `attn_metadata` is
the same as expected in `FULL_DECODE_ONLY`, some data are corrupted at
the end of the list.
e.g., `num_reqs` = 3, `cudagraph_capture_sizes` = [20],
`actual_seq_lengths_q` is padded well to [4, 8, 12, 20]. But
`seq_lens_list` = [5742, 4700, 7996, 5738], it has corrupted at the end
of the list.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
Signed-off-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
This PR fixes a bug in Xlite
backend(https://atomgit.com/openeuler/GVirt/issues/3).
This PR adds support for mrope (Mixture-of-RoPE) and deepstack features
in the xlite backend. These features are necessary for running certain
multimodal models that utilize them.
The main changes include:
- Updating `_build_model_config` to parse mrope and deepstack
configurations from the model's `hf_config`.
- Modifying `XliteWrapper.__call__` to handle `deepstack_input_embeds`
and mrope positions during the model forward pass.
- Replacing `ModelAttnMeta` with the newer `AttnMeta` to accommodate the
new metadata fields required by these features.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
online server config:
```
python -m vllm.entrypoints.openai.api_server \
--model /mnt/nvme0n1/models/checkpoint-8200 \
--additional-config='{"xlite_graph_config": {"enabled": true}}' \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--max-num-batched-tokens 8192 \
--max-num-seqs=20 \
--block-size 128 \
--max-model-len 8192 \
--trust-remote-code \
--served-model-name Qwen3-VL-8B \
--host localhost \
--generation-config vllm \
--port 6777
```
test_config:
```
vllm bench serve \
--max-concurrency ${maxconcurrency} \
--num-prompts ${num_prompts} \
--host ${HOST} \
--port ${PORT} \
--model ${MODEL_NAME} \
--dataset-name random \
--backend openai-chat \
--random-input-len 512 \
--random-output-len 512 \
--random-range-ratio 0.2 \
--temperature 0.6 \
--metric-percentiles "50,90,99" \
--tokenizer ${TOKENIZER_PATH} \
--endpoint /v1/chat/completions \
--ignore-eos
```
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
Signed-off-by: LVYANGGUO <lvyangguo@huawei.com>
Co-authored-by: LVYANGGUO <lvyangguo@huawei.com>
### What this PR does / why we need it?
Optimize the performance of the triton operator _topk_log_softmax_kernel
in model_runner_v2 to 1.04xH100,which is 7% of its original value.(issue
https://github.com/vllm-project/vllm-ascend/issues/5208)
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: wangx700 <wangxin700@huawei.com>
### What this PR does / why we need it?
This PR restores #7029, which adds W8A8C8 support for dsv3.2/glm5 using
the `lightning_indexer_quant` ops in the pd-mix stage.
The original PR was reverted by #7288 because the patch did not work
with the recompute scheduler.
This PR also fixes the patching issue so that it works correctly with
the recompute scheduler.
### Does this PR introduce _any_ user-facing change?
Yes. To enable LI C8, users need to set the `enable_sparse_c8` option to
`"true"` in `additional_config`.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it?
Add an e2e test for QuaRot model with eagle3 that runs both the QuaRot
model and the float model, and then compares their acceptance rates. The
QuaRot model adapting eagle3 PR(#6914, #7038)
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>