Files
xc-llm-ascend/tests
Pleaplusone 0329fad927 [Perf] Deepseekv3 performance optimization for eager mode (#598)
### What this PR does / why we need it?
Deepseek v3 now adopt vanilla chunked prefill on MLA part which is
ineffcient for computing but necessary for chunked prefill. Since PR
https://github.com/vllm-project/vllm-ascend/pull/543 bring v0 scheduler
into vllm-ascend, we can now adopt torch_npu._npu_flash_attention inside
the mla backend for more performance boost. Also there are some
redundant computation inside the rope, which is also removed. This PR
should bring some performance gain for deepseek eager mode inference.

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-04-29 17:12:03 +08:00
..
2025-04-23 20:56:24 +08:00
2025-04-24 17:12:12 +08:00
2025-04-24 17:12:12 +08:00