From 7c1692aa90b208b1925292f155908fa745a52335 Mon Sep 17 00:00:00 2001 From: Chayenne Date: Wed, 26 Feb 2025 15:16:31 -0800 Subject: [PATCH] Docs: Reorngaize dpsk links (#3900) --- benchmark/deepseek_v3/README.md | 19 ++++--- docs/references/amd.md | 2 + docs/references/deepseek.md | 89 +++++++++++++++++++++++++++++---- 3 files changed, 91 insertions(+), 19 deletions(-) diff --git a/benchmark/deepseek_v3/README.md b/benchmark/deepseek_v3/README.md index 910862db5..9d0ead627 100644 --- a/benchmark/deepseek_v3/README.md +++ b/benchmark/deepseek_v3/README.md @@ -2,18 +2,8 @@ The SGLang and DeepSeek teams collaborated to get DeepSeek V3 FP8 running on NVIDIA and AMD GPUs **from day one**. SGLang also supports [MLA optimization](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) and [DP attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models), making SGLang one of the best open-source LLM engines for running DeepSeek models. SGLang is the inference engine recommended by the official [DeepSeek team](https://github.com/deepseek-ai/DeepSeek-V3/tree/main?tab=readme-ov-file#62-inference-with-sglang-recommended). -Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources. - For optimizations made on the DeepSeek series models regarding SGLang, please refer to [DeepSeek Model Optimizations in SGLang](https://docs.sglang.ai/references/deepseek.html). -## Hardware Recommendation - -- 8 x NVIDIA H200 GPUs - -If you do not have GPUs with large enough memory, please try multi-node tensor parallelism. There is an example serving with [2 H20 nodes](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208) below. - -For running on AMD MI300X, use this as a reference. [Running DeepSeek-R1 on a single NDv5 MI300X VM](https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/running-deepseek-r1-on-a-single-ndv5-mi300x-vm/4372726) - ## Installation & Launch If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded. @@ -183,6 +173,15 @@ python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --host http://10.0. python3 -m sglang.bench_one_batch_server --model None --base-url http://10.0.0.1:30000 --batch-size 1 --input-len 128 --output-len 128 ``` + +### Example: Serving with 8 A100/A800 with AWQ Quantization + +AWQ does not support BF16, so add the `--dtype half` flag if AWQ is used for quantization. One example is as follows: + +```bash +python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --dtype half +``` + ### Example: Serving on any cloud or Kubernetes with SkyPilot SkyPilot helps find cheapest available GPUs across any cloud or existing Kubernetes clusters and launch distributed serving with a single command. See details [here](https://github.com/skypilot-org/skypilot/tree/master/llm/deepseek-r1). diff --git a/docs/references/amd.md b/docs/references/amd.md index 4b1c8230d..4f88b1373 100644 --- a/docs/references/amd.md +++ b/docs/references/amd.md @@ -124,6 +124,8 @@ drun -p 30000:30000 \ --port 30000 ``` +[Running DeepSeek-R1 on a single NDv5 MI300X VM](https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/running-deepseek-r1-on-a-single-ndv5-mi300x-vm/4372726) could also be a good reference. + ### Running Llama3.1 Running Llama3.1 is nearly identical. The only difference is in the model specified when starting the server, shown by the following example command: diff --git a/docs/references/deepseek.md b/docs/references/deepseek.md index 68bffc6e3..a0d114781 100644 --- a/docs/references/deepseek.md +++ b/docs/references/deepseek.md @@ -2,9 +2,88 @@ SGLang provides several optimizations specifically designed for the DeepSeek model to boost its inference speed. This document outlines current optimizations for DeepSeek. Additionally, the SGLang team is actively developing enhancements for [DeepSeek V3](https://github.com/sgl-project/sglang/issues/2591). +Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources. + ## Launch DeepSeek V3 with SGLang -SGLang is recognized as one of the top engines for [DeepSeek model inference](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3). Refer to [installation and launch](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#installation--launch) to learn how to run fast inference of DeepSeek V3/R1 with SGLang. +SGLang is recognized as one of the top engines for [DeepSeek model inference](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3). To run DeepSeek V3/R1 models, the requirements are as follows: + + + + + + + Weight Configurations + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Weight TypeConfiguration
Full precision FP8 (recommended)8 x H200
8 x MI300X
2 x 8 x H100/800/20
Full precision BF162 x 8 x H200
2 x 8 x MI300X
4 x 8 x H100/800/20
4 x 8 x A100/A800
Quantized weights (AWQ)8 x H100/800/20
8 x A100/A800
+ + + +Detailed commands for reference: + +- [8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended) +- [8 x MI300X](https://docs.sglang.ai/references/amd.html#running-deepseek-v3) +- [2 x 8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h208-nodes) +- [4 x 8 x A100](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-four-a1008-nodes) +- [8 x A100 (AWQ)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-8-a100a800-with-awq-quantization) ### Download Weights @@ -87,11 +166,3 @@ Overall, with these optimizations, we have achieved up to a 7x acceleration in o 1. **Question**: What should I do if model loading takes too long and NCCL timeout occurs? **Answer**: You can try to add `--dist-timeout 3600` when launching the model, this allows for 1-hour timeout. - -2. **Question**: How to use quantized DeepSeek models? - - **Answer**: AWQ does not support BF16, so add the `--dtype half` flag if AWQ is used for quantization. One example is as follows: - - ```bash - python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --dtype half - ```