[Perf] Deepseekv3 performance optimization for eager mode (#598)

### What this PR does / why we need it?
Deepseek v3 now adopt vanilla chunked prefill on MLA part which is
ineffcient for computing but necessary for chunked prefill. Since PR
https://github.com/vllm-project/vllm-ascend/pull/543 bring v0 scheduler
into vllm-ascend, we can now adopt torch_npu._npu_flash_attention inside
the mla backend for more performance boost. Also there are some
redundant computation inside the rope, which is also removed. This PR
should bring some performance gain for deepseek eager mode inference.

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
This commit is contained in:
Pleaplusone
2025-04-29 17:12:03 +08:00
committed by GitHub
parent 87975fa058
commit 0329fad927
4 changed files with 180 additions and 102 deletions

View File

@@ -136,11 +136,6 @@ class RotaryEmbedding(nn.Module):
# test with leading dimension and merge seqlen and batch_size as num_tokens
# TODO(ganyi): open this test in the future
@pytest.mark.skip(
reason=
"skip this test by default for now because of ci issue, will enable it in the future"
)
@pytest.mark.parametrize("is_neox_style", IS_NEOX_STYLE)
@pytest.mark.parametrize("batch_size", BATCH_SIZES)
@pytest.mark.parametrize("seq_len", SEQ_LENS)