[Refactor] Replace npu_ring_mla with FIA in MLA prefill (#5704)
### What this PR does / why we need it? **Refactor: Replace npu_ring_mla with FIA in MLA prefill** This PR refactors the MLA (Multi-Layer Attention) prefill implementation by replacing `npu_ring_mla` with `npu_fused_infer_attention_score` (FIA) operator, unifying the attention backend with the standard attention implementation. **Key changes:** 1. **Core prefill refactoring (`mla_v1.py`)** - Replace `npu_ring_mla` with `npu_fused_infer_attention_score` in `_forward_prefill` and `_compute_prefill_context` - Use TND layout with `softmax_lse_flag=True` for prefill attention - Use `npu_attention_update` to merge multiple chunk outputs with LSE (Log-Sum-Exp) - Change `attn_mask` from `get_final_mla_mask()` to `get_splitfuse_attn_mask()` for FIA compatibility 2. **Data type handling** - Add automatic float16 → bfloat16 conversion (FIA with TND layout only supports bfloat16) - Convert output back to original dtype after FIA computation 3. **Metadata optimization** - Pre-calculate `actual_seq_lengths_q` in `AscendMLAPrefillMetadata` - Pre-calculate `chunk_actual_seq_lengths_kv_list` in `ChunkedContextMetadata` - Move `torch.cumsum` operations from forward pass to metadata building phase 4. **CP compatibility (`mla_cp.py`)** - Add `_ring_mla_mask_builder` to get `npu_ring_mla`-compatible masks for Context Parallel scenarios - Add `chunk_actual_seq_lengths_kv_list` field to `CPChunkedContextMetadata` **Why we need it:** - **Backend unification**: Aligns MLA prefill with standard attention implementation (`attention_v1.py`) - **Better chunked context support**: FIA + `npu_attention_update` provides native LSE-based output merging - **Future compatibility**: Prepares for eventual `npu_ring_mla` removal across the codebase ### Does this PR introduce _any_ user-facing change? **No.** This is a pure refactoring with no functional changes - same behavior, unified backend. --- - Related issue: #5463 (item 7) - vLLM version: v0.14.1 Signed-off-by: lico67373 <918688502@qq.com>
This commit is contained in:
@@ -85,8 +85,8 @@ CASE_DS_FULL_DECODE_ONLY = LLMTestCase(
|
||||
prompts=PROMPTS_LONG,
|
||||
golden_answers=[
|
||||
"\n\nSelect an assignment template",
|
||||
"\n\nI'm not sure how to approach this problem. I'm not sure if I should use the law of total probability or if I should use",
|
||||
"\n\n## Answer\n\n$a + b + c = 0$\n\nSolution\n\nLet $x$ be the common root of the equations",
|
||||
"\n\nI'm not sure how to approach this problem. I'm thinking that the area of the triangle is $1/2$ times the area",
|
||||
"\n\n## Answer\n\n$a + b + c = 0$\n\nSolution\n\nLet $x = \\alpha$ be the common root",
|
||||
],
|
||||
)
|
||||
|
||||
@@ -106,8 +106,8 @@ CASE_DS_EX = LLMTestCase(
|
||||
prompts=PROMPTS_LONG,
|
||||
golden_answers=[
|
||||
"\n\nSelect an assignment template",
|
||||
"\n\nI'm not sure how to approach this problem. I'm not sure if I should use the law of total probability or if I should use",
|
||||
"\n\n## Answer\n\n$a + b + c = 0$\n\nSolution\n\nLet $x$ be the common root of the equations",
|
||||
"\n\nI'm not sure how to approach this problem. I'm thinking that the area of the triangle is $1/2$ times the area",
|
||||
"\n\n## Answer\n\n$a + b + c = 0$\n\nSolution\n\nLet $x = \\alpha$ be the common root",
|
||||
],
|
||||
)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user