Docs: Reorngaize dpsk links (#3900)

This commit is contained in:
Chayenne
2025-02-26 15:16:31 -08:00
committed by GitHub
parent 8f019c7d1a
commit 7c1692aa90
3 changed files with 91 additions and 19 deletions

View File

@@ -124,6 +124,8 @@ drun -p 30000:30000 \
--port 30000
```
[Running DeepSeek-R1 on a single NDv5 MI300X VM](https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/running-deepseek-r1-on-a-single-ndv5-mi300x-vm/4372726) could also be a good reference.
### Running Llama3.1
Running Llama3.1 is nearly identical. The only difference is in the model specified when starting the server, shown by the following example command:

View File

@@ -2,9 +2,88 @@
SGLang provides several optimizations specifically designed for the DeepSeek model to boost its inference speed. This document outlines current optimizations for DeepSeek. Additionally, the SGLang team is actively developing enhancements for [DeepSeek V3](https://github.com/sgl-project/sglang/issues/2591).
Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources.
## Launch DeepSeek V3 with SGLang
SGLang is recognized as one of the top engines for [DeepSeek model inference](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3). Refer to [installation and launch](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#installation--launch) to learn how to run fast inference of DeepSeek V3/R1 with SGLang.
SGLang is recognized as one of the top engines for [DeepSeek model inference](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3). To run DeepSeek V3/R1 models, the requirements are as follows:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Weight Configurations</title>
<style>
table {
width: 100%;
border-collapse: collapse;
margin: 20px 0;
}
th, td {
border: 1px solid #ddd;
padding: 12px;
text-align: left;
}
th {
background-color: #f2f2f2;
}
tr:nth-child(even) {
background-color: #f9f9f9;
}
</style>
</head>
<body>
<table>
<thead>
<tr>
<th>Weight Type</th>
<th>Configuration</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="3"><b>Full precision FP8 (recommended)</b></td>
<td>8 x H200</td>
</tr>
<tr>
<td>8 x MI300X</td>
</tr>
<tr>
<td>2 x 8 x H100/800/20</td>
</tr>
<tr>
<td rowspan="4">Full precision BF16</td>
<td>2 x 8 x H200</td>
</tr>
<tr>
<td>2 x 8 x MI300X</td>
</tr>
<tr>
<td>4 x 8 x H100/800/20</td>
</tr>
<tr>
<td>4 x 8 x A100/A800</td>
</tr>
<tr>
<td rowspan="2">Quantized weights (AWQ)</td>
<td>8 x H100/800/20</td>
</tr>
<tr>
<td>8 x A100/A800</td>
</tr>
</tbody>
</table>
</body>
</html>
Detailed commands for reference:
- [8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended)
- [8 x MI300X](https://docs.sglang.ai/references/amd.html#running-deepseek-v3)
- [2 x 8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h208-nodes)
- [4 x 8 x A100](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-four-a1008-nodes)
- [8 x A100 (AWQ)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-8-a100a800-with-awq-quantization)
### Download Weights
@@ -87,11 +166,3 @@ Overall, with these optimizations, we have achieved up to a 7x acceleration in o
1. **Question**: What should I do if model loading takes too long and NCCL timeout occurs?
**Answer**: You can try to add `--dist-timeout 3600` when launching the model, this allows for 1-hour timeout.
2. **Question**: How to use quantized DeepSeek models?
**Answer**: AWQ does not support BF16, so add the `--dtype half` flag if AWQ is used for quantization. One example is as follows:
```bash
python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --dtype half
```