diff --git a/docs/references/deepseek.md b/docs/references/deepseek.md index 4a1ed37d2..5375008b7 100644 --- a/docs/references/deepseek.md +++ b/docs/references/deepseek.md @@ -136,7 +136,9 @@ With data parallelism attention enabled, we have achieved up to **1.9x** decodin - **Weight**: Per-128x128-block quantization for better numerical stability. -**Usage**: Turn on by default for DeepSeek V3 models. +- **DeepGEMM**: The [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM) kernel library deisgned for FP8 matrix multiplications. Note that enabling DeepGEMM will cause large compilation overhead during the first few run. + +**Usage**: The activation and weight optimization above are turned on by default for DeepSeek V3 models. DeepGEMM is turned off by default, and can be enabled with environment variable `SGL_ENABLE_JIT_DEEPGEMM=1`. ### Multi-token Prediction **Description**: SGLang implements DeepSeek V3 Multi-Token Prediction (MTP) based on [EAGLE speculative decoding](https://docs.sglang.ai/backend/speculative_decoding.html#EAGLE-Decoding). With this optimization, the decoding speed can be improved by **1.8x** for batch size 1 and **1.5x** for batch size 32 respectively on H200 TP8 setting. diff --git a/python/sglang/srt/layers/quantization/fp8_kernel.py b/python/sglang/srt/layers/quantization/fp8_kernel.py index e2b597c4f..36060d374 100644 --- a/python/sglang/srt/layers/quantization/fp8_kernel.py +++ b/python/sglang/srt/layers/quantization/fp8_kernel.py @@ -45,7 +45,9 @@ if _is_cuda: from sgl_kernel import sgl_per_token_group_quant_fp8, sgl_per_token_quant_fp8 sm_version = get_device_sm() - if sm_version == 90 and get_bool_env_var("SGL_ENABLE_JIT_DEEPGEMM", default="true"): + if sm_version == 90 and get_bool_env_var( + "SGL_ENABLE_JIT_DEEPGEMM", default="false" + ): _enable_jit_deepgemm = True