Commit Graph

5 Commits

Author SHA1 Message Date
LICO67373
380f089fbf [Refactor] Fix AttentionMaskBuilder singleton and remove redundant pcp_prefill_mask (#4870)
## What this PR does / why we need it?

This PR fixes the `AttentionMaskBuilder` singleton initialization issue
introduced in PR #4779 and removes the unused `pcp_prefill_mask` field.

### Background

After PR #4779 made `AttentionMaskBuilder` a singleton with `@singleton`
decorator, the class constructor now requires a `device` parameter.
However, two initialization sites were still using the old parameterless
constructor, causing failures.

### Changes

1. **Fix singleton initialization**
- Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)`
in `AscendMLAMetadataBuilder.__init__()`
- Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)`
in `AscendAttentionMetadataBuilder.__init__()`

2. **Remove unused field**
- Removed `pcp_prefill_mask` field from
`AscendPrefillContextParallelMetadata` (never used in codebase)
   - Updated related test assertions

### Related

- Issue #5463
- PR #4779 (Unify all mask generation methods)
- PR #5389 (Make AttentionMaskBuilder singleton)

## Does this PR introduce _any_ user-facing change?

No. This is an internal refactoring.

## How was this patch tested?

-  Local testing: No linter errors
-  Unit tests for attention modules verified
-  CI pipeline

Signed-off-by: lico67373 <918688502@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
2026-01-07 17:09:52 +08:00
Feng Liu
cbc987db0b [bugfix (pcp)] fix chunked prefill accurancy issue (#5647)
### What this PR does / why we need it?
Purpose: initialize padded slot mapping buffer to prevent garbage
values.

In PCP mode, the `pcp_padded_slot_mapping` buffer is reused across
invocations. Without explicit initialization, this buffer retain stale
values from previous runs, which can lead to incorrect results.

This change ensures the buffer is filled with -1.

### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: F.Liu <liufeng248@huawei.com>
Co-authored-by: F.Liu <liufeng248@huawei.com>
2026-01-07 10:01:27 +08:00
pichangping
50e7934415 MLA prefill preformance optimization (#5456)
### What this PR does / why we need it?
Since the _npu_ring_mla operator deteriorates in long-sequencescenarios,
the long sequence is split into shorter sequences for input to improve
performance.

- vLLM version: v0.13.0
- vLLM main:
5326c89803

---------

Signed-off-by: pichangping <1337510399@qq.com>
2026-01-05 11:41:59 +08:00
LookAround0301
d25a2c20c5 [Bugfix] Fix chunk prefill bug for long_sequence feature (#5444)
### What this PR does / why we need it?
Fix chunk prefill bug for long_sequence feature

When there are two requests with chunk prefill enabled in the
long-sequence scenario, if one request has only 1 token during
scheduling, it will be identified as a decode request and trigger an
error. This PR fixes the issue.
Closes: https://github.com/vllm-project/vllm-ascend/issues/5445

- vLLM version: release/v0.13.0
- vLLM main:
81786c8774
---------
Signed-off-by: LookAround <lixushi@huawei.com>
2026-01-05 09:16:36 +08:00
zhenwenqi2024
5d9fde9819 [Feature] Refactor PCP &DCP related code (#5214)
### What this PR does / why we need it?
Refactor pcp& dcp related code. we use pcp_manager class to Unifiy
Manage pcp & dcp . as we do this , many code can be deleted from
model_runner, and can avoid break pcp & dcp by other developments.
RFC:https://github.com/vllm-project/vllm-ascend/issues/5449
### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>
2025-12-31 09:29:57 +08:00