Docs: Reorngaize dpsk links (#3900)
This commit is contained in:
@@ -124,6 +124,8 @@ drun -p 30000:30000 \
|
||||
--port 30000
|
||||
```
|
||||
|
||||
[Running DeepSeek-R1 on a single NDv5 MI300X VM](https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/running-deepseek-r1-on-a-single-ndv5-mi300x-vm/4372726) could also be a good reference.
|
||||
|
||||
### Running Llama3.1
|
||||
|
||||
Running Llama3.1 is nearly identical. The only difference is in the model specified when starting the server, shown by the following example command:
|
||||
|
||||
@@ -2,9 +2,88 @@
|
||||
|
||||
SGLang provides several optimizations specifically designed for the DeepSeek model to boost its inference speed. This document outlines current optimizations for DeepSeek. Additionally, the SGLang team is actively developing enhancements for [DeepSeek V3](https://github.com/sgl-project/sglang/issues/2591).
|
||||
|
||||
Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources.
|
||||
|
||||
## Launch DeepSeek V3 with SGLang
|
||||
|
||||
SGLang is recognized as one of the top engines for [DeepSeek model inference](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3). Refer to [installation and launch](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#installation--launch) to learn how to run fast inference of DeepSeek V3/R1 with SGLang.
|
||||
SGLang is recognized as one of the top engines for [DeepSeek model inference](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3). To run DeepSeek V3/R1 models, the requirements are as follows:
|
||||
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>Weight Configurations</title>
|
||||
<style>
|
||||
table {
|
||||
width: 100%;
|
||||
border-collapse: collapse;
|
||||
margin: 20px 0;
|
||||
}
|
||||
th, td {
|
||||
border: 1px solid #ddd;
|
||||
padding: 12px;
|
||||
text-align: left;
|
||||
}
|
||||
th {
|
||||
background-color: #f2f2f2;
|
||||
}
|
||||
tr:nth-child(even) {
|
||||
background-color: #f9f9f9;
|
||||
}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Weight Type</th>
|
||||
<th>Configuration</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td rowspan="3"><b>Full precision FP8 (recommended)</b></td>
|
||||
<td>8 x H200</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>8 x MI300X</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>2 x 8 x H100/800/20</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="4">Full precision BF16</td>
|
||||
<td>2 x 8 x H200</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>2 x 8 x MI300X</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>4 x 8 x H100/800/20</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>4 x 8 x A100/A800</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="2">Quantized weights (AWQ)</td>
|
||||
<td>8 x H100/800/20</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>8 x A100/A800</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</body>
|
||||
</html>
|
||||
|
||||
Detailed commands for reference:
|
||||
|
||||
- [8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended)
|
||||
- [8 x MI300X](https://docs.sglang.ai/references/amd.html#running-deepseek-v3)
|
||||
- [2 x 8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h208-nodes)
|
||||
- [4 x 8 x A100](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-four-a1008-nodes)
|
||||
- [8 x A100 (AWQ)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-8-a100a800-with-awq-quantization)
|
||||
|
||||
### Download Weights
|
||||
|
||||
@@ -87,11 +166,3 @@ Overall, with these optimizations, we have achieved up to a 7x acceleration in o
|
||||
1. **Question**: What should I do if model loading takes too long and NCCL timeout occurs?
|
||||
|
||||
**Answer**: You can try to add `--dist-timeout 3600` when launching the model, this allows for 1-hour timeout.
|
||||
|
||||
2. **Question**: How to use quantized DeepSeek models?
|
||||
|
||||
**Answer**: AWQ does not support BF16, so add the `--dtype half` flag if AWQ is used for quantization. One example is as follows:
|
||||
|
||||
```bash
|
||||
python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --dtype half
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user