Docs: Reorngaize dpsk links (#3900)
This commit is contained in:
@@ -2,18 +2,8 @@
|
||||
|
||||
The SGLang and DeepSeek teams collaborated to get DeepSeek V3 FP8 running on NVIDIA and AMD GPUs **from day one**. SGLang also supports [MLA optimization](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) and [DP attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models), making SGLang one of the best open-source LLM engines for running DeepSeek models. SGLang is the inference engine recommended by the official [DeepSeek team](https://github.com/deepseek-ai/DeepSeek-V3/tree/main?tab=readme-ov-file#62-inference-with-sglang-recommended).
|
||||
|
||||
Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources.
|
||||
|
||||
For optimizations made on the DeepSeek series models regarding SGLang, please refer to [DeepSeek Model Optimizations in SGLang](https://docs.sglang.ai/references/deepseek.html).
|
||||
|
||||
## Hardware Recommendation
|
||||
|
||||
- 8 x NVIDIA H200 GPUs
|
||||
|
||||
If you do not have GPUs with large enough memory, please try multi-node tensor parallelism. There is an example serving with [2 H20 nodes](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208) below.
|
||||
|
||||
For running on AMD MI300X, use this as a reference. [Running DeepSeek-R1 on a single NDv5 MI300X VM](https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/running-deepseek-r1-on-a-single-ndv5-mi300x-vm/4372726)
|
||||
|
||||
## Installation & Launch
|
||||
|
||||
If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded.
|
||||
@@ -183,6 +173,15 @@ python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --host http://10.0.
|
||||
python3 -m sglang.bench_one_batch_server --model None --base-url http://10.0.0.1:30000 --batch-size 1 --input-len 128 --output-len 128
|
||||
```
|
||||
|
||||
|
||||
### Example: Serving with 8 A100/A800 with AWQ Quantization
|
||||
|
||||
AWQ does not support BF16, so add the `--dtype half` flag if AWQ is used for quantization. One example is as follows:
|
||||
|
||||
```bash
|
||||
python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --dtype half
|
||||
```
|
||||
|
||||
### Example: Serving on any cloud or Kubernetes with SkyPilot
|
||||
|
||||
SkyPilot helps find cheapest available GPUs across any cloud or existing Kubernetes clusters and launch distributed serving with a single command. See details [here](https://github.com/skypilot-org/skypilot/tree/master/llm/deepseek-r1).
|
||||
|
||||
Reference in New Issue
Block a user