From 3e02526b1ff77c40c192bcf199673e696d3bb702 Mon Sep 17 00:00:00 2001 From: Baizhou Zhang Date: Thu, 27 Feb 2025 01:55:36 -0800 Subject: [PATCH] [Doc] Add experimental tag for flashinfer mla (#3925) --- docs/backend/server_arguments.md | 2 +- docs/references/deepseek.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/backend/server_arguments.md b/docs/backend/server_arguments.md index 204396a63..7879ada57 100644 --- a/docs/backend/server_arguments.md +++ b/docs/backend/server_arguments.md @@ -133,7 +133,7 @@ Please consult the documentation below to learn more about the parameters you ma * `attention_backend`: The backend for attention computation and KV cache management. * `sampling_backend`: The backend for sampling. -* `enable_flashinfer_mla`: The backend for flashinfer MLA wrapper. It can optimize the throughput of deepseek models. +* `enable_flashinfer_mla`: The backend for flashinfer MLA wrapper that accelerates deepseek models. (In Experiment Stage) ## Constrained Decoding diff --git a/docs/references/deepseek.md b/docs/references/deepseek.md index 2ed666088..01ac87496 100644 --- a/docs/references/deepseek.md +++ b/docs/references/deepseek.md @@ -85,7 +85,7 @@ Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/be - **Weight Absorption**: By applying the associative law of matrix multiplication to reorder computation steps, this method balances computation and memory access and improves efficiency in the decoding phase. -- **Flashinfer MLA Wrapper**: By providing `--enable-flashinfer-mla` argument, the server will use MLA kernels customized by Flashinfer. This optimization can be significant under long context scenarios. More details can be referred to [this document](https://docs.flashinfer.ai/api/mla.html). +- **Flashinfer MLA Wrapper**: By providing `--enable-flashinfer-mla` argument, the server will use MLA kernels customized by Flashinfer. More details can be referred to [this document](https://docs.flashinfer.ai/api/mla.html). (In Experiment Stage) - **FP8 Quantization**: W8A8 FP8 and KV Cache FP8 quantization enables efficient FP8 inference. Additionally, we have implemented Batched Matrix Multiplication (BMM) operator to facilitate FP8 inference in MLA with weight absorption.