From 71ed01833dd766e581ce050bc26f72178408cf1d Mon Sep 17 00:00:00 2001 From: Baizhou Zhang Date: Wed, 26 Feb 2025 20:40:45 -0800 Subject: [PATCH] [doc] Update document for flashinfer mla (#3907) --- docs/backend/server_arguments.md | 1 + docs/references/deepseek.md | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/backend/server_arguments.md b/docs/backend/server_arguments.md index c78e0a98c..204396a63 100644 --- a/docs/backend/server_arguments.md +++ b/docs/backend/server_arguments.md @@ -133,6 +133,7 @@ Please consult the documentation below to learn more about the parameters you ma * `attention_backend`: The backend for attention computation and KV cache management. * `sampling_backend`: The backend for sampling. +* `enable_flashinfer_mla`: The backend for flashinfer MLA wrapper. It can optimize the throughput of deepseek models. ## Constrained Decoding diff --git a/docs/references/deepseek.md b/docs/references/deepseek.md index a0d114781..3aac5e077 100644 --- a/docs/references/deepseek.md +++ b/docs/references/deepseek.md @@ -113,7 +113,7 @@ Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/be - **Weight Absorption**: By applying the associative law of matrix multiplication to reorder computation steps, this method balances computation and memory access and improves efficiency in the decoding phase. -- **Triton Decoding Kernel Optimization**: In the MLA decoding kernel, there is only one KV head. This optimization reduces memory access to the KV cache by processing multiple query heads within one block, accelerating the decoding process. +- **Flashinfer MLA Wrapper**: By providing `--enable-flashinfer-mla` argument, the server will use MLA kernels customized by Flashinfer. This optimization can be significant under long context scenarios. More details can be referred to [this document](https://docs.flashinfer.ai/api/mla.html). - **FP8 Quantization**: W8A8 FP8 and KV Cache FP8 quantization enables efficient FP8 inference. Additionally, we have implemented Batched Matrix Multiplication (BMM) operator to facilitate FP8 inference in MLA with weight absorption.