[Feature] Add FlashAttention3 as a backend for VisionAttention (#5764)

Co-authored-by: othame <chenzhu_912@zju.edu.cn> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Yi Zhang <1109276519@qq.com>
2025-05-09 01:01:19 +08:00
parent f1ff736d68
commit fa7d7fd9e5
11 changed files with 328 additions and 186 deletions
--- a/docs/backend/server_arguments.md
+++ b/docs/backend/server_arguments.md
@@ -166,10 +166,11 @@ Please consult the documentation below and [server_args.py](https://github.com/s

 ## Kernel backend

-| Arguments | Description | Defaults |
-|----------|-------------|---------|
-| `attention_backend` | This argument specifies the backend for attention computation and KV cache management, which can be `fa3`, `flashinfer`, `triton`, `flashmla`, `cutlass_mla`, or `torch_native`. When deploying DeepSeek models, use this argument to specify the MLA backend. | None |
-| `sampling_backend` | Specifies the backend used for sampling. | None |
+| Arguments              | Description | Defaults |
+|------------------------|-------------|---------|
+| `attention_backend`    | This argument specifies the backend for attention computation and KV cache management, which can be `fa3`, `flashinfer`, `triton`, `cutlass_mla`, or `torch_native`. When deploying DeepSeek models, use this argument to specify the MLA backend. | None |
+| `sampling_backend`     | Specifies the backend used for sampling. | None |
+| `mm_attention_backend` | Set multimodal attention backend.

 ## Constrained Decoding