Add Cutlass MLA attention backend (#5390)

This commit is contained in:
Trevor Morris
2025-04-27 20:58:53 -07:00
committed by GitHub
parent 40d9b8acce
commit 84810da4ae
7 changed files with 305 additions and 3 deletions

View File

@@ -138,7 +138,7 @@ Please consult the documentation below to learn more about the parameters you ma
## Kernel backend
* `attention_backend`: This argument specifies the backend for attention computation and KV cache management, which can be `fa3`, `flashinfer`, `triton`, or `torch_native`. When deploying DeepSeek models, use this argument to specify the MLA backend.
* `attention_backend`: This argument specifies the backend for attention computation and KV cache management, which can be `fa3`, `flashinfer`, `triton`, `cutlass_mla`, or `torch_native`. When deploying DeepSeek models, use this argument to specify the MLA backend.
* `sampling_backend`: The backend for sampling.
## Constrained Decoding