Support serving DeepSeek-R1-Channel-INT8 with 32 L40S. (#4418)

This commit is contained in:
Wenbo Yang
2025-03-17 15:03:43 +08:00
committed by GitHub
parent 0f52fb55ec
commit 75b656488a
7 changed files with 489 additions and 11 deletions

View File

@@ -218,6 +218,33 @@ python3 -m sglang.bench_serving --dataset-path /path/to/ShareGPT_V3_unfiltered_c
> **Note: using `--parallel 200` can accelerate accuracy benchmarking**.
### Example: Serving with 32 L40S with int8 Quantization
Running with per-channel quantization model:
- [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8)
Assuming that master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-Channel-INT8` and port=5000, we can have following commands to launch the server:
```bash
#master
python3 -m sglang.launch_server --model meituan/DeepSeek-R1-Channel-INT8 --tp 32 --quantization w8a8_int8 \
--dist-init-addr MASTER_IP:5000 --nnodes 4 --node-rank 0 --trust-remote \
--enable-torch-compile --torch-compile-max-bs 32
#cluster
python3 -m sglang.launch_server --model meituan/DeepSeek-R1-Channel-INT8 --tp 32 --quantization w8a8_int8 \
--dist-init-addr MASTER_IP:5000 --nnodes 4 --node-rank 1 --trust-remote \
--enable-torch-compile --torch-compile-max-bs 32
python3 -m sglang.launch_server --model meituan/DeepSeek-R1-Channel-INT8 --tp 32 --quantization w8a8_int8 \
--dist-init-addr MASTER_IP:5000 --nnodes 4 --node-rank 2 --trust-remote \
--enable-torch-compile --torch-compile-max-bs 32
python3 -m sglang.launch_server --model meituan/DeepSeek-R1-Channel-INT8 --tp 32 --quantization w8a8_int8 \
--dist-init-addr MASTER_IP:5000 --nnodes 4 --node-rank 3 --trust-remote \
--enable-torch-compile --torch-compile-max-bs 32
```
The benchmarking method is the same as describted in the previous [16 x A100](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-16-a100a800-with-int8-quantization) example.
### Example: Serving on any cloud or Kubernetes with SkyPilot
SkyPilot helps find cheapest available GPUs across any cloud or existing Kubernetes clusters and launch distributed serving with a single command. See details [here](https://github.com/skypilot-org/skypilot/tree/master/llm/deepseek-r1).