add INT8 example into dsv3 README (#4079)
This commit is contained in:
@@ -184,6 +184,7 @@ AWQ does not support BF16, so add the `--dtype half` flag if AWQ is used for qua
|
|||||||
python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --dtype half
|
python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --dtype half
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
### Example: Serving with 16 A100/A800 with int8 Quantization
|
### Example: Serving with 16 A100/A800 with int8 Quantization
|
||||||
|
|
||||||
There are block-wise and per-channel quantization methods, and the quantization parameters have already been uploaded to Huggingface. One example is as follows:
|
There are block-wise and per-channel quantization methods, and the quantization parameters have already been uploaded to Huggingface. One example is as follows:
|
||||||
@@ -191,18 +192,31 @@ There are block-wise and per-channel quantization methods, and the quantization
|
|||||||
- [meituan/DeepSeek-R1-Block-INT8](https://huggingface.co/meituan/DeepSeek-R1-Block-INT8)
|
- [meituan/DeepSeek-R1-Block-INT8](https://huggingface.co/meituan/DeepSeek-R1-Block-INT8)
|
||||||
- [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8)
|
- [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8)
|
||||||
|
|
||||||
|
Assuming that master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-INT8` and port=5000, we can have following commands to launch the server:
|
||||||
```bash
|
```bash
|
||||||
#master
|
#master
|
||||||
python3 -m sglang.launch_server \
|
python3 -m sglang.launch_server \
|
||||||
--model meituan/DeepSeek-R1-Block-INT8 --tp 16 --dist-init-addr \
|
--model meituan/DeepSeek-R1-Block-INT8 --tp 16 --dist-init-addr \
|
||||||
HEAD_IP:5000 --nnodes 2 --node-rank 0 --trust-remote --enable-torch-compile --torch-compile-max-bs 8
|
MASTER_IP:5000 --nnodes 2 --node-rank 0 --trust-remote --enable-torch-compile --torch-compile-max-bs 8
|
||||||
#cluster
|
#cluster
|
||||||
python3 -m sglang.launch_server \
|
python3 -m sglang.launch_server \
|
||||||
--model meituan/DeepSeek-R1-Block-INT8 --tp 16 --dist-init-addr \
|
--model meituan/DeepSeek-R1-Block-INT8 --tp 16 --dist-init-addr \
|
||||||
HEAD_IP:5000 --nnodes 2 --node-rank 1 --trust-remote --enable-torch-compile --torch-compile-max-bs 8
|
MASTER_IP:5000 --nnodes 2 --node-rank 1 --trust-remote --enable-torch-compile --torch-compile-max-bs 8
|
||||||
```
|
```
|
||||||
|
|
||||||
|
> **Note that the launch command here enables `torch.compile` Optimization**. For optimal performance, please refer to the command options in [Performance Optimization Options](#option_args).
|
||||||
|
|
||||||
|
Then on the **master node**, supposing the ShareGPT data is located at `/path/to/ShareGPT_V3_unfiltered_cleaned_split.json`, you can run the following commands to benchmark the launched server:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# bench accuracy
|
||||||
|
python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319
|
||||||
|
|
||||||
|
# bench serving
|
||||||
|
python3 -m sglang.bench_serving --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name random --random-input 128 --random-output 128 --num-prompts 1000 --request-rate 128 --random-range-ratio 1.0
|
||||||
|
```
|
||||||
|
|
||||||
|
> **Note: using `--parallel 200` can accelerate accuracy benchmarking**.
|
||||||
|
|
||||||
### Example: Serving on any cloud or Kubernetes with SkyPilot
|
### Example: Serving on any cloud or Kubernetes with SkyPilot
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user