Add deepseek-v3 a100 serving example (#3404)
This commit is contained in:
@@ -74,10 +74,10 @@ If the command fails, try setting the `GLOO_SOCKET_IFNAME` parameter. For more i
|
||||
|
||||
```bash
|
||||
# node 1
|
||||
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code
|
||||
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code
|
||||
|
||||
# node 2
|
||||
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code
|
||||
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code
|
||||
```
|
||||
|
||||
If you have two H100 nodes, the usage is similar to the aforementioned H20.
|
||||
@@ -131,6 +131,35 @@ docker run --gpus all \
|
||||
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 1 --random-output 512 --random-range-ratio 1 --num-prompts 1 --host 0.0.0.0 --port 40000 --output-file "deepseekv3_multinode.jsonl"
|
||||
```
|
||||
|
||||
### Example: Serving with four A100*4 nodes
|
||||
To serve DeepSeek-V3 with A100 GPUs, we need to convert the [FP8 model checkpoints](https://huggingface.co/deepseek-ai/DeepSeek-V3) to BF16 with [script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) mentioned [here](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) first.
|
||||
|
||||
Since the BF16 model is over 1.3 TB, we need to prepare four A100 nodes, each with 8 80GB GPUs. Assume the first node's IP is `10.0.0.1`, and the converted model path is `/path/to/DeepSeek-V3-BF16`, we can have following commands to launch the server.
|
||||
|
||||
```bash
|
||||
# node 1
|
||||
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3-BF16 --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 30000
|
||||
|
||||
# node 2
|
||||
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3-BF16 --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 1 --trust-remote-code
|
||||
|
||||
# node 3
|
||||
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3-BF16 --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 2 --trust-remote-code
|
||||
|
||||
# node 4
|
||||
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3-BF16 --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 3 --trust-remote-code
|
||||
```
|
||||
|
||||
Then we can benchmark the accuracy and latency by accessing the first node's exposed port with the following example commands.
|
||||
|
||||
```bash
|
||||
# bench accuracy
|
||||
python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --host http://10.0.0.1 --port 30000
|
||||
|
||||
# bench latency
|
||||
python3 -m sglang.bench_one_batch_server --model None --base-url http://10.0.0.1:30000 --batch-size 1 --input-len 128 --output-len 128
|
||||
```
|
||||
|
||||
## DeepSeek V3 Optimization Plan
|
||||
|
||||
https://github.com/sgl-project/sglang/issues/2591
|
||||
|
||||
Reference in New Issue
Block a user