### What this PR does / why we need it?
mooncake layerwise support pcp function
PCP (Prefill Context Parallelism) Support: Introduced explicit support
for Prefill Context Parallelism (PCP) and Decode Context Parallelism
(DCP) in the Mooncake layerwise KV cache transfer mechanism, allowing
for more granular control and awareness of parallel configurations
during data transfer.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
---------
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
refer to https://github.com/vllm-project/vllm-ascend/issues/6391,
Currently adapted the complete process of event publishing in vllm:
* `kv_connector_model_runner_mixin` invoke kv-connector
`get_kv_connector_kv_cache_events` func to collect kvevents
* in `scheduler.py` , it's `update_from_output` func will invoke
`_update_from_kv_xfer_finished` which invoke
`connector.update_connector_output` to collect kv-events from all
kv-worker, and then scheduler will invoke `connector.take_events` api to
collect all kv-events and add it to the events which from
`kv_cache_manager`
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
You can add `--kv-events-config` parameter to the `vllm server` command
to enable this feature.
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
---------
Signed-off-by: yejj710 <abyss1999@163.com>
Co-authored-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it?
vllm model runner v2 use uva buffer to prepare input data, but npu
doesn't support uva yet, this pr implement a uvawrapper class to mimic
gpu's uva backend. what's more, this pr make some modifications to adapt
to the newer main branch.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM main:
13397841ab
---------
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
### What this PR does / why we need it?
Fix the issue where, in graph mode, the fused `AddRMSNormQuant` operator
does not take effect when there is no bias.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
---------
Signed-off-by: ZYang6263 <zy626375@gmail.com>
### What this PR does / why we need it?
This pull request enables the `npugraph_ex` backend by default to
improve performance on Ascend NPUs, as proposed in the
[RFC](https://github.com/vllm-project/vllm-ascend/issues/6214).
### Does this PR introduce _any_ user-facing change?
Yes. `npugraph_ex` is now enabled by default. Users can disable it by
setting `enable: false` in the `npugraph_ex_config` section of the
`additional_config`.
### How was this patch tested?
CI passed. The changes are covered by existing and new E2E tests
(`test_aclgraph_accuracy.py`) and unit tests (`test_ascend_config.py`)
that have been updated to reflect the new default behavior. The tests
verify correctness and consistency with `npugraph_ex` enabled and
disabled, as well as with the new static kernel option.
Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com>
Co-authored-by: huyuanquan1 <huyuanquan1@huawei.com>
### What this PR does / why we need it?
GLM5 adaptation
1. use torch_npu.npu_lightning_indexer for GLM5
2. forbid eagle proposer when fullgraph mode is enabled because of bugs
3. add quatization config for GLM5
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM main:
978a37c823
---------
Signed-off-by: yydyzr <liuyuncong1@huawei.com>
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Co-authored-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
This PR aims to update `target_probs` to `target_logits` in
`rejection_sample`, for catching up with
https://github.com/vllm-project/vllm/pull/32852. Otherwise, sampling
with temperature will incur accuracy problem where tokens can be
accepted or rejected unreasonably.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM version: v0.15.0
- vLLM main:
13397841ab
Signed-off-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it?
This PR extends original `rope_triton_forward` and
`split_qkv_rmsnorm_rope` to support `cos_sin_cache` && `positions` as
inputs. This fully aligns to vLLM RoPE api interface. Compared with
earlier implementation for RoPE, the benefits are:
1. avoiding pre-computation of `cos` `sin` before model execution, which
helps to remove redundant codes.
2. allowing eagle3 draft model to have different rope parameters with
main model (see #6612 ). This help to recover accept rate && accuracy in
that case.
In addition, this kernel change only introduces very small performance
degradation. Those `index_select` or `chunk` operations are now changed
into simple memory access in triton kernel (For example,
https://github.com/vllm-project/vllm-ascend/pull/5450/changes#diff-a4c2d3071530df193b98f9bf38553874bc4d47571336711f116c26d019cfbb6aR77-R81).
**Highlights**
- **RoPE Cache Unification**: Replaced separate _sin and _cos global
tensors with a unified cos_sin_cache and explicit positions tensor for
Rotary Positional Embeddings (RoPE), streamlining data handling.
- **Triton Kernel Integration**: Updated Triton kernels
(split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the
cos_sin_cache and positions for more efficient and integrated RoPE
calculations.
- **Custom Operation Registration**: Registered `rope_forward_oot` as a
new custom operation, allowing its use in fused compilation passes and
providing a dedicated entry point for the new RoPE implementation.
- **Refactored RoPE Forward Pass**: Modified the rope_forward_oot
function to accept the new cos_sin_cache and positions arguments,
enabling a more flexible and integrated RoPE application within the
system.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
5326c89803
Additional test on Qwen3-235b accuracy:
| Aime2024 | GSM8K | Livecodebench |
| -------- | -------- | -------- |
| 83.33 | 96.26 | 70.23 |
---------
Signed-off-by: Angazenn <supperccell@163.com>
## Summary
- Remove unused `set_rotation_config` and `apply_rotation` methods from
`AscendW4A4LaosDynamicLinearMethod`
- Remove unused `rotation_type` field and associated conditional
quantization parameters (`heads_rotation`, `kronecker_rotation_n`,
`kronecker_rotation_m`)
These rotation-related functions and parameters are never called in the
current W4A4 LAOS dynamic quantization workflow.
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
This pull request introduces a new capability to monitor the health of
NPU cards directly from the Worker class. This enhancement allows for
proactive detection of NPU issues by executing the npu-smi command,
improving system reliability and operational visibility within the
vllm_ascend worker environment.
- vLLM version: v0.15.0
- vLLM main:
13397841ab
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: luomin2005 <luomin2005@huawei.com>
Co-authored-by: liziyu <56102866+liziyu179@users.noreply.github.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
### What this PR does / why we need it?
This PR introduces support for W8A8 dynamic quantization for
Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved
by:
- Implementing a new quantization scheme
`AscendW8A8DynamicFusedMoEMethod310`.
- Adding a unified MLP implementation (`unified_apply_mlp`) for 310P
that handles both quantized and unquantized paths.
- Refactoring the MoE and quantization configuration logic to correctly
route to the new 310P-specific implementations.
- Adding new e2e and unit tests to verify the functionality of MoE W8A8
quantization.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8
quantization in a multi-card setup.
- Added several new unit tests for the 310P-specific MoE components,
including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`,
and the new `w8a8_dynamic` quantization method.
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
---------
Signed-off-by: pu-zhe <zpuaa@outlook.com>
### What this PR does / why we need it?
To prevent confusion between different QuantType classes, we remove**
QuantType in prepare_finalize.py
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
1. Currently, eplb registers different attributes for different models,
but these attributes are not actually used. Now, these attributes are
directly deleted.
2. Add some log about eplb.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
#### Deepseek v3.1 chat
Of course! Here is a comprehensive explanation of deep learning, broken
down for clarity.\n\n### The Simple Analogy: A Child Learning to
Recognize a Cat\n\nImagine teaching a child what a cat is. You don't
give them a rulebook with instructions like \"has pointy ears, whiskers,
and a tail.\" Instead, you show them many pictures, saying \"this is a
cat\" or \"this is not a cat.\" The child's brain gradually learns to
identify the complex patterns—the combination of shapes, colors, and
textures—that define \"cat-ness.\"\n\n**Deep learning is essentially
this, but for computers.** It's a method for teaching computers to learn
from examples and recognize patterns directly from data (like images,
sound, or text) without being explicitly programmed with rigid
rules.\n\n---\n\n### The Technical Definition\n\n**Deep Learning is a
subfield of machine learning, which itself is a subfield of artificial
intelligence (AI).** It uses artificial **neural networks** with many
layers (\"deep\" networks) to model and understand complex patterns in
data.\n\nHere are the key concepts in that definition:\n\n1.
**Artificial Intelligence (AI):** The broad science of making machines
smart and capable of performing tasks that typically require human
intelligence.\n2. **Machine Learning (ML):** A subset of AI that gives
computers the ability to learn from data *without* being explicitly
programmed for every single rule.\n3. **Deep Learning (DL):** A
specific, powerful
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
there are batch invariant ops implemented by triton and ascendc, this pr
aims to choose which kind of ops to be used to enable batch invariant.
#5487
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
---------
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
### What this PR does / why we need it?
This PR upgrades the core vLLM dependency to a newer version from the
main branch (`13397841ab469cecf1ed425c3f52a9ffc38139b5`). This is
necessary to keep our project up-to-date with the latest features and
fixes from upstream vLLM.
1.
ac32e66cf9
pass file is moved.
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
---------
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Co-authored-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it?
This PR fixes an `AttributeError: 'Parameter' object has no attribute
'data'` that occurs when MLAPO is enabled with vLLM v0.15.0.
The error is caused by a monkey-patch on
`MLAAttention.process_weights_after_loading` which is incompatible with
changes in vLLM v0.15.0. This is likely related to PyTorch's deprecation
of the `.data` attribute on `torch.nn.Parameter` objects.
This change makes the monkey-patch conditional, so it is not applied for
vLLM v0.15.0 and newer versions, resolving the crash.
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
The speculative inference acceptance rate decreases after the vllm
version is upgraded to v0.15.0. This issue is resolved.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
UT and tests case
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
---------
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it?
For deepseek v3.2, DSA use FullAttentionSpec, allocate 2 * mla page size
bytes, and we use half of that for k cache in DSA
However, the actual proportion of k cache is not high, which results in
a large amount of kvcache being wasted. The proportion of discarded
kvcache is (576-128)/(576 x 2) = 0.388.
Run the same script to start DeepSeek V3.2 on a single A3 server. The
following shows the comparison of kvcache usage:
Before refactoring
```
[kv_cache_utils.py:1307] GPU KV cache size: 15,872 tokens
```
After refactoring
```
[kv_cache_utils.py:1307] GPU KV cache size: 25,984 tokens
```
This pull request refactors the KV cache allocation for Deepseek v3.2
models that use sparse attention. It replaces the use of
`FullAttentionSpec` with `MLAAttentionSpec` and introduces a more
principled way of calculating KV cache tensor split factors based on
model configuration.
This change removes hardcoded values and correctly sizes the cache
tensors, leading to optimized memory usage and improved code
maintainability.
### Does this PR introduce _any_ user-facing change?
No, this is an internal optimization and does not introduce any
user-facing changes.
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
---------
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
This PR adds DCP support to the SFA backend.
Please note that due to operator constraints, the current implementation
has to all-gather the entire KV cache and modify the block table to
satisfy the operator input requirements. This results in significantly
increased communication overhead and peak memory usage. Therefore, this
is only a temporary workaround and will be refactored once the operator
provides proper support.
Additionally, because of the above limitations,
`cp_kv_cache_interleave_size` is currently required to be equal to
`block_size`. This restriction will also be removed after the refactor.
#### Test
accuracy test using DeepSeek-V3.2-Exp-W8A8 with dp2tp8dcp8
| dataset | version | metric | mode | vllm-api-general-stream |
|----- | ----- | ----- | ----- | -----|
| gsm8kdataset | - | accuracy | gen | 96.35 |
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0
---------
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
### What this PR does / why we need it?
Issue: If a model such as Ling-1T adopts partial rotary position
embedding (partial RoPE), but config.json uses the rotary_dim parameter
instead of partial_rotary_factor, it will trigger a RuntimeError: The
expanded size of the tensor (128) must match the existing size (64) at
non-singleton dimension 3.
<img width="1681" height="472" alt="image"
src="https://github.com/user-attachments/assets/ba03d7df-ecba-4d6f-9ec1-4dc55f59799e"
/>
This PR addresses an issue where models using partial rotary position
embedding (partial RoPE) with the `rotary_dim` parameter in
`config.json` (instead of `partial_rotary_factor`) would encounter a
`RuntimeError`.
This change adds support for the `rotary_dim` parameter in
`vllm_ascend/ops/rotary_embedding.py` to correctly calculate the
`rope_dim`, resolving the tensor size mismatch error.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
The patch was tested successfully with the Ling-1T model, which
previously triggered the error.
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
Signed-off-by: GoCHug <93277779+GoCHug@users.noreply.github.com>
### What this PR does / why we need it?
Part of #5304.
After https://github.com/vllm-project/vllm/pull/32523 merge, we could
remove the patch of `MiniCPMAttention`.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Test it locally.
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
---------
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
### What this PR does / why we need it?
layerwise connector support recompute scheduler.
NOTE:
Triggering recompute will invoke the tokenizer again, which may lead to
precision fluctuations.
[RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache
Layerwise Push Support
https://github.com/vllm-project/vllm-ascend/issues/4842
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
### What this PR does / why we need it?
This PR aims to fix problematic dummy_run that will cause excessive npu
memory and to fix improper input_batch_size that will degrade running
performance.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0
---------
Signed-off-by: Zetong Li <slippersss@126.com>
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
Co-authored-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it?
This pull request significantly enhances the test suite by adding new
end-to-end test cases for Qwen3 models on the 310P hardware platform.
The primary goal is to ensure the stability and correctness of these
models under diverse operational conditions, including various
parallelism strategies, data types, and quantization methods.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
E2E test
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0
---------
Signed-off-by: pu-zhe <zpuaa@outlook.com>
### What this PR does / why we need it?
This pull request significantly refactors the attention mechanism for
the Ascend 310P hardware, enhancing its architecture by separating mask
generation concerns from the core attention implementation. It
introduces a dedicated mask builder class capable of handling various
mask types, including causal, splitfuse, and sliding window attention
masks, all optimized for the NPU's fractal data format. This change not
only cleans up the codebase but also lays the groundwork for more robust
and feature-rich attention operations on Ascend devices, backed by new,
extensive unit tests.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
E2E test with qwen3 and qwen3-moe
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
---------
Signed-off-by: pu-zhe <zpuaa@outlook.com>
### What this PR does / why we need it?
This pull request focuses on a significant refactoring effort within the
vllm-ascend project, specifically targeting operations optimized for the
Ascend 310P hardware. The changes aim to streamline the implementation
of core components like quantization and multi-head attention, making
the codebase more maintainable and robust. Concurrently, new unit tests
have been introduced to ensure the correctness and reliability of these
refactored modules.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
E2E test with qwen3-32b w8a8
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
---------
Signed-off-by: pu-zhe <zpuaa@outlook.com>
### What this PR does / why we need it?
This PR removes the custom `rotary_embedding` operator and its
associated C++ kernel implementation, PyTorch bindings, and tests.
The codebase now falls back to using the native
`torch_npu._npu_rotary_embedding` implementation. This change simplifies
the codebase by removing custom, platform-specific kernel code and
relying on the standard NPU library implementation, which is presumably
more optimized and easier to maintain.
### Does this PR introduce _any_ user-facing change?
No. This is an internal refactoring and does not introduce any
user-facing changes.
### How was this patch tested?
The tests for the custom `rotary_embedding` operator have been removed
along with the operator itself. The correctness of the fallback to the
native `torch_npu` implementation is verified by existing CI tests for
attention layers and models that use rotary embeddings.
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
This pull request integrates comprehensive support for Mixture of
Experts (MoE) models on the Ascend 310P device within the vllm-ascend
framework. It achieves this by introducing specialized modules for
expert selection, fused MoE layers, and optimized all-gather
communication. The changes also refine existing NPU operations, making
them more consistent and efficient for 310P, ultimately enhancing the
performance and compatibility of MoE models on this hardware.
Highlights
310P MoE Support: Introduces dedicated implementations for Mixture of
Experts (MoE) models on Ascend 310P devices, including new modules for
expert selection, fused MoE layers, and communication.
All-Gather Communication: Enforces the use of ALLGATHER communication
for MoE operations on 310P, optimizing data transfer and leveraging
NPU-specific token dispatching.
Simplified NPU Operations: Removes conditional type casting for
npu_swiglu and enables custom rotary embedding kernels unconditionally,
suggesting improved native support for 310P.
New MoE Classes Registered: Registers AscendFusedMoE310 and
AscendSharedFusedMoE310 to integrate 310P-specific MoE layers into the
system's custom operation registry.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
offline test and server test, with qwen3-30b-a3b,tp/ep 4 on 310p
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0
---------
Signed-off-by: pu-zhe <zpuaa@outlook.com>
### What this PR does / why we need it?
Fix the incorrect use of the `output` parameter in
`_forward_fia_slidingwindow`:
```
# Original (incorrect)
output, _ = torch_npu.npu_fused_infer_attention_score(...)
output= output.view(batch_size, self.num_heads, self.head_size)
```
In the original writing, the `output `parameter was directly assigned a
new value, which is inconsistent with the interface definition,
resulting in the inability to directly update `output `when calling
externally.
```
attn_output, _ = torch_npu.npu_fused_infer_attention_score(...)
attn_output = attn_output.view(batch_size, self.num_heads, self.head_size)
output[:batch_size] = attn_output[:batch_size]
```
### Does this PR introduce _any_ user-facing change?
No change.
Co-authored-by: GoCHug<gch59135228@163.com>
### How was this patch tested?
vLLM ascend version: v0.13.0rc1
Signed-off-by: acat-rw <892882856@qq.com>
### What this PR does / why we need it?
This PR fixes an accuracy issue that occurs when using Prefill/Decode
Context Parallelism (PCP/DCP) in conjunction with speculative decoding
(MTP). The issue is caused by an irregular attention mask shape when
both features are enabled.
The fix involves flattening the `block_table` for speculative decoding
requests under PCP/DCP to ensure a regular attention mask. This PR also
introduces a `use_cp` property for cleaner code and updates dummy runs
to handle this scenario correctly.
### Does this PR introduce _any_ user-facing change?
No. This is a bug fix that improves accuracy and should not have
user-facing API changes.
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0
---------
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
**Optimization:** Replaces fixed block sizes (128x128x128) in
`linear_persistent_kernel` with adaptive selection logic that considers:
- Matrix dimensions (M, N, K)
- Device NPU vector core count
- Data type (float32 vs others)
**Why:** Fixed block sizes lead to suboptimal hardware utilization
across different matrix shapes. Adaptive sizing maximizes occupancy and
memory efficiency for varied workload patterns, improving throughput for
batch-invariant linear operations in LLM inference.
**Details:**
- Small matrices (M < 256): Size-proportional allocation
- Medium matrices (256 ≤ M < 1024): Balanced distribution based on grid
capacity
- Large matrices (M ≥ 1024): Optimized for dominant dimension
### Does this PR introduce _any_ user-facing change?
No. This is a performance optimization. The API and numerical results
remain unchanged; only kernel execution efficiency improves.
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0
Signed-off-by: DDCHY <843049740@qq.com>
Signed-off-by: zjchenn <zjchenn@gmail.com>
Co-authored-by: DDCHY <843049740@qq.com>
### What this PR does / why we need it?
This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to
satisfy FIA/TND constraint #6459 (commit
5b0a6bcfe9)" and fixes a check in
`model_runner_v1`.
**A key change is that we remove the strict assertion in the latest
commit, as it turns out MLA + PIECEWISE will slice during computing,
leaving our assertion uncalled for and will only cause false alarm.**
This handles both uniform and mixed batches (by inserting a dummy
request for mixed batches), consolidates ad-hoc padding into a single
helper, copies the updated buffer to the device, which prevents kernel
mismatches or failures and ensure correct shapes for FIA/TND execution
in full graph modes.
We currently place this helper in `execute_model`. My original design
was to include it in `_prepare_inputs`, but that doesn’t work because it
must run after padding. While I’d prefer to minimize the impact and
reuse as much of the base class as possible in the future, it doesn’t
seem achievable at the moment.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
Test cases added.
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
This PR introduces four new patterns to support the fusion of RMSNorm
and DynamicQuant operators. After replacing the fusion operators, the
execution time has been reduced from 22.8us to 16.9us.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
d7de043d55
Signed-off-by: Bryan <250470359+Zhang-Bryan@users.noreply.github.com>
### What this PR does / why we need it?
Following [PR
#4233](https://github.com/vllm-project/vllm-ascend/pull/4233), a
synchronization mechanism was introduced between steps in asynchronous
scheduling with ACL Graph to address a hanging issue. However, full
device-level synchronization is unnecessary—only the operations on the
current stream need to be synchronized. Otherwise, if other background
operations (such as send and recv) are running concurrently, they may
negatively impact inference performance for the instance.
hang problem

Synchronizing only the current stream can also resolve the hang issue.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
Signed-off-by: For_YL <zhangtangwei@huawei.com>
Co-authored-by: For_YL <zhangtangwei@huawei.com>
### What this PR does / why we need it?
Refactor MLP weight prefetch to consistency with MoE Model's prefetching
in terms of code and usage.
Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP,
VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and
VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following:
--additional-config '{"weight_prefetch_config": { "enabled": true,
"prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}'
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
When the draft model also uses vllmbackend for graph compilation, the
fusion pass registration occurs again, resulting in errors due to
duplicate patterns.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0
---------
Signed-off-by: chencangtao <chencangtao@huawei.com>
Co-authored-by: chencangtao <chencangtao@huawei.com>