[Feature]Support ragged prefill in flashinfer mla backend (#3967)

Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: pankajroark <pankajroark@users.noreply.github.com>
2025-02-28 18:13:56 -08:00
parent f3b99f73b3
commit 90a4b7d98a
9 changed files with 308 additions and 407 deletions
--- a/docs/backend/server_arguments.md
+++ b/docs/backend/server_arguments.md
@@ -133,7 +133,6 @@ Please consult the documentation below to learn more about the parameters you ma

 * `attention_backend`: The backend for attention computation and KV cache management.
 * `sampling_backend`: The backend for sampling.
-* `enable_flashinfer_mla`: The backend for flashinfer MLA wrapper that accelerates deepseek models. (In Experiment Stage)

 ## Constrained Decoding

@@ -186,3 +185,5 @@ Please consult the documentation below to learn more about the parameters you ma
 * `cuda_graph_bs`: The batch sizes to capture by `CudaGraphRunner`. By default this is done for you.
 * `torchao_config`: Experimental feature that optimizes the model with [torchao](https://github.com/pytorch/ao). Possible choices are: int8dq, int8wo, int4wo-<group_size>, fp8wo, fp8dq-per_tensor, fp8dq-per_row.
 * `triton_attention_num_kv_splits`: Use to adjust the number of KV splits in triton kernels. Default is 8.
+* `enable_flashinfer_mla`: The backend for flashinfer MLA wrapper that accelerates deepseek models.
+* `flashinfer_mla_disable_ragged`: Disable usage of ragged prefill wrapper for flashinfer mla attention backend. Should be used when `enable_flashinfer_mla` is turned on.