Add an example of using deepseekv3 int8 sglang. (#4177)
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
This commit is contained in:
@@ -184,6 +184,26 @@ AWQ does not support BF16, so add the `--dtype half` flag if AWQ is used for qua
|
|||||||
python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --dtype half
|
python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --dtype half
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Example: Serving with 16 A100/A800 with int8 Quantization
|
||||||
|
|
||||||
|
There are block-wise and per-channel quantization methods, and the quantization parameters have already been uploaded to Huggingface. One example is as follows:
|
||||||
|
|
||||||
|
- [meituan/DeepSeek-R1-Block-INT8](https://huggingface.co/meituan/DeepSeek-R1-Block-INT8)
|
||||||
|
- [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#master
|
||||||
|
python3 -m sglang.launch_server \
|
||||||
|
--model meituan/DeepSeek-R1-Block-INT8 --tp 16 --dist-init-addr \
|
||||||
|
HEAD_IP:5000 --nnodes 2 --node-rank 0 --trust-remote --enable-torch-compile --torch-compile-max-bs 8
|
||||||
|
#cluster
|
||||||
|
python3 -m sglang.launch_server \
|
||||||
|
--model meituan/DeepSeek-R1-Block-INT8 --tp 16 --dist-init-addr \
|
||||||
|
HEAD_IP:5000 --nnodes 2 --node-rank 1 --trust-remote --enable-torch-compile --torch-compile-max-bs 8
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
### Example: Serving on any cloud or Kubernetes with SkyPilot
|
### Example: Serving on any cloud or Kubernetes with SkyPilot
|
||||||
|
|
||||||
SkyPilot helps find cheapest available GPUs across any cloud or existing Kubernetes clusters and launch distributed serving with a single command. See details [here](https://github.com/skypilot-org/skypilot/tree/master/llm/deepseek-r1).
|
SkyPilot helps find cheapest available GPUs across any cloud or existing Kubernetes clusters and launch distributed serving with a single command. See details [here](https://github.com/skypilot-org/skypilot/tree/master/llm/deepseek-r1).
|
||||||
|
|||||||
@@ -17,6 +17,7 @@ SGLang is recognized as one of the top engines for [DeepSeek model inference](ht
|
|||||||
| | 4 x 8 x A100/A800 |
|
| | 4 x 8 x A100/A800 |
|
||||||
| **Quantized weights (AWQ)** | 8 x H100/800/20 |
|
| **Quantized weights (AWQ)** | 8 x H100/800/20 |
|
||||||
| | 8 x A100/A800 |
|
| | 8 x A100/A800 |
|
||||||
|
| **Quantized weights (int8)** | 16 x A100/800 |
|
||||||
|
|
||||||
<style>
|
<style>
|
||||||
.md-typeset__table {
|
.md-typeset__table {
|
||||||
@@ -54,6 +55,7 @@ Detailed commands for reference:
|
|||||||
- [2 x 8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h208-nodes)
|
- [2 x 8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h208-nodes)
|
||||||
- [4 x 8 x A100](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-four-a1008-nodes)
|
- [4 x 8 x A100](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-four-a1008-nodes)
|
||||||
- [8 x A100 (AWQ)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-8-a100a800-with-awq-quantization)
|
- [8 x A100 (AWQ)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-8-a100a800-with-awq-quantization)
|
||||||
|
- [16 x A100 (int8)](https://github.com/sgl-project/sglang/tree/modify-doc/benchmark/deepseek_v3#example-serving-with-16-a100a800-with-int8-quantization)
|
||||||
|
|
||||||
### Download Weights
|
### Download Weights
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user