From 75b656488a6418c14421611132ea0b4bc10e993d Mon Sep 17 00:00:00 2001 From: Wenbo Yang Date: Mon, 17 Mar 2025 15:03:43 +0800 Subject: [PATCH] Support serving DeepSeek-R1-Channel-INT8 with 32 L40S. (#4418) --- benchmark/deepseek_v3/README.md | 27 +++ docs/references/deepseek.md | 2 + .../attention/triton_ops/extend_attention.py | 19 ++- ...evice_name=NVIDIA_L20,dtype=int8_w8a8.json | 146 ++++++++++++++++ ...vice_name=NVIDIA_L40S,dtype=int8_w8a8.json | 146 ++++++++++++++++ sgl-kernel/csrc/gemm/int8_gemm_kernel.cu | 158 +++++++++++++++++- sgl-kernel/tests/test_int8_gemm.py | 2 +- 7 files changed, 489 insertions(+), 11 deletions(-) create mode 100644 python/sglang/srt/layers/moe/fused_moe_triton/configs/E=256,N=64,device_name=NVIDIA_L20,dtype=int8_w8a8.json create mode 100644 python/sglang/srt/layers/moe/fused_moe_triton/configs/E=256,N=64,device_name=NVIDIA_L40S,dtype=int8_w8a8.json diff --git a/benchmark/deepseek_v3/README.md b/benchmark/deepseek_v3/README.md index 0a41ceae1..8863142c5 100644 --- a/benchmark/deepseek_v3/README.md +++ b/benchmark/deepseek_v3/README.md @@ -218,6 +218,33 @@ python3 -m sglang.bench_serving --dataset-path /path/to/ShareGPT_V3_unfiltered_c > **Note: using `--parallel 200` can accelerate accuracy benchmarking**. +### Example: Serving with 32 L40S with int8 Quantization + +Running with per-channel quantization model: + +- [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8) + +Assuming that master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-Channel-INT8` and port=5000, we can have following commands to launch the server: + +```bash +#master +python3 -m sglang.launch_server --model meituan/DeepSeek-R1-Channel-INT8 --tp 32 --quantization w8a8_int8 \ + --dist-init-addr MASTER_IP:5000 --nnodes 4 --node-rank 0 --trust-remote \ + --enable-torch-compile --torch-compile-max-bs 32 +#cluster +python3 -m sglang.launch_server --model meituan/DeepSeek-R1-Channel-INT8 --tp 32 --quantization w8a8_int8 \ + --dist-init-addr MASTER_IP:5000 --nnodes 4 --node-rank 1 --trust-remote \ + --enable-torch-compile --torch-compile-max-bs 32 +python3 -m sglang.launch_server --model meituan/DeepSeek-R1-Channel-INT8 --tp 32 --quantization w8a8_int8 \ + --dist-init-addr MASTER_IP:5000 --nnodes 4 --node-rank 2 --trust-remote \ + --enable-torch-compile --torch-compile-max-bs 32 +python3 -m sglang.launch_server --model meituan/DeepSeek-R1-Channel-INT8 --tp 32 --quantization w8a8_int8 \ + --dist-init-addr MASTER_IP:5000 --nnodes 4 --node-rank 3 --trust-remote \ + --enable-torch-compile --torch-compile-max-bs 32 +``` + +The benchmarking method is the same as describted in the previous [16 x A100](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-16-a100a800-with-int8-quantization) example. + ### Example: Serving on any cloud or Kubernetes with SkyPilot SkyPilot helps find cheapest available GPUs across any cloud or existing Kubernetes clusters and launch distributed serving with a single command. See details [here](https://github.com/skypilot-org/skypilot/tree/master/llm/deepseek-r1). diff --git a/docs/references/deepseek.md b/docs/references/deepseek.md index a80aab9dd..a056f2498 100644 --- a/docs/references/deepseek.md +++ b/docs/references/deepseek.md @@ -18,6 +18,7 @@ SGLang is recognized as one of the top engines for [DeepSeek model inference](ht | **Quantized weights (AWQ)** | 8 x H100/800/20 | | | 8 x A100/A800 | | **Quantized weights (int8)** | 16 x A100/800 | +| | 32 x L40S |