From 06427dfab12b8cb149809ca17ddfa61c97b7b70e Mon Sep 17 00:00:00 2001 From: Shenggui Li Date: Wed, 26 Feb 2025 01:43:28 +0800 Subject: [PATCH] [doc] added quantization doc for dpsk (#3843) --- docs/references/deepseek.md | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/docs/references/deepseek.md b/docs/references/deepseek.md index 0267ab5a8..883dd6119 100644 --- a/docs/references/deepseek.md +++ b/docs/references/deepseek.md @@ -84,6 +84,14 @@ Overall, with these optimizations, we have achieved up to a 7x acceleration in o ## FAQ -**Question**: What should I do if model loading takes too long and NCCL timeout occurs? +1. **Question**: What should I do if model loading takes too long and NCCL timeout occurs? -Answer: You can try to add `--dist-timeout 3600` when launching the model, this allows for 1-hour timeout.i + **Answer**: You can try to add `--dist-timeout 3600` when launching the model, this allows for 1-hour timeout. + +2. **Question**: How to use quantized DeepSeek models? + + **Answer**: DeepSeek's MLA does not have support for quantization. You need to add the `--disable-mla` flag to run the quantized model successfully. Meanwhile, AWQ does not support BF16, so add the `--dtype half` flag if AWQ is used for quantization. One example is as follows: + + ```bash + python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --dtype half --disable-mla + ```