diff --git a/benchmark/deepseek_v3/README.md b/benchmark/deepseek_v3/README.md index 0a41ceae1..8863142c5 100644 --- a/benchmark/deepseek_v3/README.md +++ b/benchmark/deepseek_v3/README.md @@ -218,6 +218,33 @@ python3 -m sglang.bench_serving --dataset-path /path/to/ShareGPT_V3_unfiltered_c > **Note: using `--parallel 200` can accelerate accuracy benchmarking**. +### Example: Serving with 32 L40S with int8 Quantization + +Running with per-channel quantization model: + +- [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8) + +Assuming that master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-Channel-INT8` and port=5000, we can have following commands to launch the server: + +```bash +#master +python3 -m sglang.launch_server --model meituan/DeepSeek-R1-Channel-INT8 --tp 32 --quantization w8a8_int8 \ + --dist-init-addr MASTER_IP:5000 --nnodes 4 --node-rank 0 --trust-remote \ + --enable-torch-compile --torch-compile-max-bs 32 +#cluster +python3 -m sglang.launch_server --model meituan/DeepSeek-R1-Channel-INT8 --tp 32 --quantization w8a8_int8 \ + --dist-init-addr MASTER_IP:5000 --nnodes 4 --node-rank 1 --trust-remote \ + --enable-torch-compile --torch-compile-max-bs 32 +python3 -m sglang.launch_server --model meituan/DeepSeek-R1-Channel-INT8 --tp 32 --quantization w8a8_int8 \ + --dist-init-addr MASTER_IP:5000 --nnodes 4 --node-rank 2 --trust-remote \ + --enable-torch-compile --torch-compile-max-bs 32 +python3 -m sglang.launch_server --model meituan/DeepSeek-R1-Channel-INT8 --tp 32 --quantization w8a8_int8 \ + --dist-init-addr MASTER_IP:5000 --nnodes 4 --node-rank 3 --trust-remote \ + --enable-torch-compile --torch-compile-max-bs 32 +``` + +The benchmarking method is the same as describted in the previous [16 x A100](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-16-a100a800-with-int8-quantization) example. + ### Example: Serving on any cloud or Kubernetes with SkyPilot SkyPilot helps find cheapest available GPUs across any cloud or existing Kubernetes clusters and launch distributed serving with a single command. See details [here](https://github.com/skypilot-org/skypilot/tree/master/llm/deepseek-r1). diff --git a/docs/references/deepseek.md b/docs/references/deepseek.md index a80aab9dd..a056f2498 100644 --- a/docs/references/deepseek.md +++ b/docs/references/deepseek.md @@ -18,6 +18,7 @@ SGLang is recognized as one of the top engines for [DeepSeek model inference](ht | **Quantized weights (AWQ)** | 8 x H100/800/20 | | | 8 x A100/A800 | | **Quantized weights (int8)** | 16 x A100/800 | +| | 32 x L40S |