Revert "fix some typos" (#6244)
This commit is contained in:
@@ -45,10 +45,10 @@ Add [performance optimization options](#performance-optimization-options) as nee
|
||||
|
||||
### Performance Optimization Options
|
||||
|
||||
[MLA optimizations](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) are enabled by default. Here are some optional optimizations that can be enabled as needed.
|
||||
[MLA optimizations](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) are enabled by default. Here are some optional optimizations can be enabled as needed.
|
||||
|
||||
- [Data Parallelism Attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models): For high QPS scenarios, add the `--enable-dp-attention` argument to boost throughput.
|
||||
- [Torch.compile Optimization](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#torchcompile-latency-optimizations): Add `--enable-torch-compile` argument to enable it. This will take some time while the server starts. The maximum batch size for torch.compile optimization can be controlled with `--torch-compile-max-bs`. It's recommended to set it between `1` and `8`. (e.g., `--torch-compile-max-bs 8`)
|
||||
- [Torch.compile Optimization](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#torchcompile-latency-optimizations): Add `--enable-torch-compile` argument to enable it. This will take some time while server starts. The maximum batch size for torch.compile optimization can be controlled with `--torch-compile-max-bs`. It's recommended to set it between `1` and `8`. (e.g., `--torch-compile-max-bs 8`)
|
||||
|
||||
### Example: Sending requests with OpenAI API
|
||||
|
||||
@@ -90,7 +90,7 @@ If you have two H100 nodes, the usage is similar to the aforementioned H20.
|
||||
|
||||
> **Note that the launch command here does not enable Data Parallelism Attention or `torch.compile` Optimization**. For optimal performance, please refer to the command options in [Performance Optimization Options](#option_args).
|
||||
|
||||
### Example: Serving with two H200\*8 nodes and Docker
|
||||
### Example: Serving with two H200\*8 nodes and docker
|
||||
|
||||
There are two H200 nodes, each with 8 GPUs. The first node's IP is `192.168.114.10`, and the second node's IP is `192.168.114.11`. Configure the endpoint to expose it to another Docker container using `--host 0.0.0.0` and `--port 40000`, and set up communications with `--dist-init-addr 192.168.114.10:20000`.
|
||||
A single H200 with 8 devices can run DeepSeek V3, the dual H200 setup is just to demonstrate multi-node usage.
|
||||
@@ -147,7 +147,7 @@ docker run --gpus all \
|
||||
|
||||
To serve DeepSeek-V3 with A100 GPUs, we need to convert the [FP8 model checkpoints](https://huggingface.co/deepseek-ai/DeepSeek-V3) to BF16 with [script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) mentioned [here](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) first.
|
||||
|
||||
Since the BF16 model is over 1.3 TB, we need to prepare four A100 nodes, each with 8 80GB GPUs. Assuming the first node's IP is `10.0.0.1`, and the converted model path is `/path/to/DeepSeek-V3-BF16`, we can run the following commands to launch the server.
|
||||
Since the BF16 model is over 1.3 TB, we need to prepare four A100 nodes, each with 8 80GB GPUs. Assume the first node's IP is `10.0.0.1`, and the converted model path is `/path/to/DeepSeek-V3-BF16`, we can have following commands to launch the server.
|
||||
|
||||
```bash
|
||||
# node 1
|
||||
@@ -178,7 +178,7 @@ python3 -m sglang.bench_one_batch_server --model None --base-url http://10.0.0.1
|
||||
|
||||
### Example: Serving with 8 A100/A800 with AWQ Quantization
|
||||
|
||||
Add the `--quantization moe_wna16` flag to enable the MoE wna16 kernel for better performance.
|
||||
Add `--quantization moe_wna16` flag to enable moe wna16 kernel for better performance.
|
||||
One example is as follows:
|
||||
|
||||
```bash
|
||||
@@ -188,12 +188,12 @@ python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --
|
||||
|
||||
### Example: Serving with 16 A100/A800 with int8 Quantization
|
||||
|
||||
There are block-wise and per-channel quantization methods, and the quantization parameters have already been uploaded to HuggingFace. One example is as follows:
|
||||
There are block-wise and per-channel quantization methods, and the quantization parameters have already been uploaded to Huggingface. One example is as follows:
|
||||
|
||||
- [meituan/DeepSeek-R1-Block-INT8](https://huggingface.co/meituan/DeepSeek-R1-Block-INT8)
|
||||
- [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8)
|
||||
|
||||
Assuming that the master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-INT8` and port=5000, we can run the following commands to launch the server:
|
||||
Assuming that master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-INT8` and port=5000, we can have following commands to launch the server:
|
||||
```bash
|
||||
#master
|
||||
python3 -m sglang.launch_server \
|
||||
@@ -225,7 +225,7 @@ Running with per-channel quantization model:
|
||||
|
||||
- [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8)
|
||||
|
||||
Assuming that the master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-Channel-INT8` and port=5000, we can run the following commands to launch the server:
|
||||
Assuming that master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-Channel-INT8` and port=5000, we can have following commands to launch the server:
|
||||
|
||||
```bash
|
||||
#master
|
||||
|
||||
Reference in New Issue
Block a user