[Refactor] Fix AttentionMaskBuilder singleton and remove redundant pcp_prefill_mask (#4870)
## What this PR does / why we need it? This PR fixes the `AttentionMaskBuilder` singleton initialization issue introduced in PR #4779 and removes the unused `pcp_prefill_mask` field. ### Background After PR #4779 made `AttentionMaskBuilder` a singleton with `@singleton` decorator, the class constructor now requires a `device` parameter. However, two initialization sites were still using the old parameterless constructor, causing failures. ### Changes 1. **Fix singleton initialization** - Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)` in `AscendMLAMetadataBuilder.__init__()` - Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)` in `AscendAttentionMetadataBuilder.__init__()` 2. **Remove unused field** - Removed `pcp_prefill_mask` field from `AscendPrefillContextParallelMetadata` (never used in codebase) - Updated related test assertions ### Related - Issue #5463 - PR #4779 (Unify all mask generation methods) - PR #5389 (Make AttentionMaskBuilder singleton) ## Does this PR introduce _any_ user-facing change? No. This is an internal refactoring. ## How was this patch tested? - ✅ Local testing: No linter errors - ✅ Unit tests for attention modules verified - ⏳ CI pipeline Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>
This commit is contained in:
@@ -32,8 +32,7 @@ from vllm.v1.worker.gpu.sample.output import SamplerOutput
|
||||
|
||||
from vllm_ascend.worker.v2.aclgraph_utils import AclGraphManager
|
||||
from vllm_ascend.worker.v2.attn_utils import (build_attn_metadata,
|
||||
build_attn_state,
|
||||
make_attention_mask)
|
||||
build_attn_state)
|
||||
from vllm_ascend.worker.v2.input_batch import AscendInputBuffers
|
||||
from vllm_ascend.worker.v2.sample.sampler import AscendSampler
|
||||
from vllm_ascend.worker.v2.states import AscendRequestState, uva_wrapper
|
||||
@@ -155,12 +154,6 @@ class NPUModelRunner(GPUModelRunner):
|
||||
num_scheduled_tokens,
|
||||
num_valid_tokens,
|
||||
)
|
||||
attn_mask = make_attention_mask(
|
||||
self.vllm_config,
|
||||
attn_state,
|
||||
self.dtype,
|
||||
self.device,
|
||||
)
|
||||
|
||||
idx_mapping_list = [
|
||||
self.req_states.req_id_to_index[req_id] for req_id in req_ids
|
||||
@@ -284,7 +277,6 @@ class NPUModelRunner(GPUModelRunner):
|
||||
slot_mappings=slot_mappings.to(torch.int32),
|
||||
kv_cache_config=self.kv_cache_config,
|
||||
decode_token_per_req=self.decode_token_per_req,
|
||||
attn_mask=attn_mask,
|
||||
attn_state=attn_state,
|
||||
)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user