Support serving DeepSeek-R1-Channel-INT8 with 32 L40S. (#4418)
This commit is contained in:
@@ -218,6 +218,33 @@ python3 -m sglang.bench_serving --dataset-path /path/to/ShareGPT_V3_unfiltered_c
|
||||
|
||||
> **Note: using `--parallel 200` can accelerate accuracy benchmarking**.
|
||||
|
||||
### Example: Serving with 32 L40S with int8 Quantization
|
||||
|
||||
Running with per-channel quantization model:
|
||||
|
||||
- [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8)
|
||||
|
||||
Assuming that master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-Channel-INT8` and port=5000, we can have following commands to launch the server:
|
||||
|
||||
```bash
|
||||
#master
|
||||
python3 -m sglang.launch_server --model meituan/DeepSeek-R1-Channel-INT8 --tp 32 --quantization w8a8_int8 \
|
||||
--dist-init-addr MASTER_IP:5000 --nnodes 4 --node-rank 0 --trust-remote \
|
||||
--enable-torch-compile --torch-compile-max-bs 32
|
||||
#cluster
|
||||
python3 -m sglang.launch_server --model meituan/DeepSeek-R1-Channel-INT8 --tp 32 --quantization w8a8_int8 \
|
||||
--dist-init-addr MASTER_IP:5000 --nnodes 4 --node-rank 1 --trust-remote \
|
||||
--enable-torch-compile --torch-compile-max-bs 32
|
||||
python3 -m sglang.launch_server --model meituan/DeepSeek-R1-Channel-INT8 --tp 32 --quantization w8a8_int8 \
|
||||
--dist-init-addr MASTER_IP:5000 --nnodes 4 --node-rank 2 --trust-remote \
|
||||
--enable-torch-compile --torch-compile-max-bs 32
|
||||
python3 -m sglang.launch_server --model meituan/DeepSeek-R1-Channel-INT8 --tp 32 --quantization w8a8_int8 \
|
||||
--dist-init-addr MASTER_IP:5000 --nnodes 4 --node-rank 3 --trust-remote \
|
||||
--enable-torch-compile --torch-compile-max-bs 32
|
||||
```
|
||||
|
||||
The benchmarking method is the same as describted in the previous [16 x A100](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-16-a100a800-with-int8-quantization) example.
|
||||
|
||||
### Example: Serving on any cloud or Kubernetes with SkyPilot
|
||||
|
||||
SkyPilot helps find cheapest available GPUs across any cloud or existing Kubernetes clusters and launch distributed serving with a single command. See details [here](https://github.com/skypilot-org/skypilot/tree/master/llm/deepseek-r1).
|
||||
|
||||
Reference in New Issue
Block a user