sglang/benchmark/deepseek_v3/README.md

# DeepSeek V3 Support

The SGLang and DeepSeek teams worked together to get DeepSeek V3 FP8 running on NVIDIA and AMD GPUs **from day one**. SGLang also has supported [MLA optimization](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) and [DP attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models), making SGLang one of the best open-source LLM engines for running DeepSeek models.

Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources.

## Hardware Recommendation
- 8 x NVIDIA H200 GPUs

If you do not have GPUs with large enough memory, please try multi-node tensor parallelism ([help 1](https://github.com/sgl-project/sglang/blob/637de9e8ce91fd3e92755eb2a842860925954ab1/docs/backend/backend.md?plain=1#L88-L95) [help 2](https://github.com/sgl-project/sglang/blob/637de9e8ce91fd3e92755eb2a842860925954ab1/docs/backend/backend.md?plain=1#L152-L168)). Here is an example serving with [2 H20 node](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208)

## Installation & Launch

If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded.

### Using Docker (Recommended)
```bash
# Pull latest image
# https://hub.docker.com/r/lmsysorg/sglang/tags
docker pull lmsysorg/sglang:latest

# Launch
docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --port 30000
```

For high QPS scenarios, add the `--enable-dp-attention` argument to boost throughput.

### Using pip
```bash
# Installation
pip install "sglang[all]>=0.4.1.post3" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer

# Launch
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code
```

For high QPS scenarios, add the `--enable-dp-attention` argument to boost throughput.

### Example with OpenAI API

```python3
import openai
client = openai.Client(
    base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")

# Chat completion
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)
print(response)
```
### Example serving with 2 H20*8
For example, there are two H20 nodes, each with 8 GPUs. The first node's IP is `10.0.0.1`, and the second node's IP is `10.0.0.2`.

```bash
# node 1
GLOO_SOCKET_IFNAME=eth0 python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code

# node 2
GLOO_SOCKET_IFNAME=eth0 python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code
```

If you have two H100 nodes, the usage is similar to the aforementioned H20.

## DeepSeek V3 Optimization Plan

https://github.com/sgl-project/sglang/issues/2591

## Appendix

SGLang is the inference engine officially recommended by the DeepSeek team.

https://github.com/deepseek-ai/DeepSeek-V3/tree/main?tab=readme-ov-file#62-inference-with-sglang-recommended
Release 0.4.1.post3 - upload the config.json to PyPI (#2647) 2024-12-29 14:25:53 -08:00			`# DeepSeek V3 Support`
docs: add deepseek v3 launch instructions (#2589) 2024-12-26 15:26:54 +08:00
Release 0.4.1.post3 - upload the config.json to PyPI (#2647) 2024-12-29 14:25:53 -08:00			`The SGLang and DeepSeek teams worked together to get DeepSeek V3 FP8 running on NVIDIA and AMD GPUs from day one. SGLang also has supported [MLA optimization](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) and [DP attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models), making SGLang one of the best open-source LLM engines for running DeepSeek models.`
docs: add deepseek v3 launch instructions (#2589) 2024-12-26 15:26:54 +08:00
Update README.md (#2605) 2024-12-26 10:58:49 -08:00			`Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources.`
docs: add deepseek v3 launch instructions (#2589) 2024-12-26 15:26:54 +08:00
			`## Hardware Recommendation`
			`- 8 x NVIDIA H200 GPUs`

docs: update README (#2651) 2024-12-30 13:33:29 +08:00			If you do not have GPUs with large enough memory, please try multi-node tensor parallelism ([help 1](https://github.com/sgl-project/sglang/blob/637de9e8ce91fd3e92755eb2a842860925954ab1/docs/backend/backend.md?plain=1#L88-L95) [help 2](https://github.com/sgl-project/sglang/blob/637de9e8ce91fd3e92755eb2a842860925954ab1/docs/backend/backend.md?plain=1#L152-L168)). Here is an example serving with [2 H20 node](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208)
Fix logprob_start_len for multi modal models (#2597) Co-authored-by: libra <lihu723@gmail.com> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: Wang, Haoyu <haoyu.wang@intel.com> 2024-12-26 06:27:45 -08:00
docs: add deepseek v3 launch instructions (#2589) 2024-12-26 15:26:54 +08:00			`## Installation & Launch`

docs: update README (#2644) 2024-12-30 01:24:06 +08:00			`If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded.`
Fix logprob_start_len for multi modal models (#2597) Co-authored-by: libra <lihu723@gmail.com> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: Wang, Haoyu <haoyu.wang@intel.com> 2024-12-26 06:27:45 -08:00
docs: add deepseek v3 launch instructions (#2589) 2024-12-26 15:26:54 +08:00			`### Using Docker (Recommended)`
			```bash
docs: update README (#2651) 2024-12-30 13:33:29 +08:00			`# Pull latest image`
			`# https://hub.docker.com/r/lmsysorg/sglang/tags`
			`docker pull lmsysorg/sglang:latest`

			`# Launch`
docs: add deepseek v3 launch instructions (#2589) 2024-12-26 15:26:54 +08:00			`docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host lmsysorg/sglang:latest \`
Update readme (#2625) 2024-12-28 13:39:56 +08:00			`python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --port 30000`
docs: add deepseek v3 launch instructions (#2589) 2024-12-26 15:26:54 +08:00			```
Release 0.4.1.post3 - upload the config.json to PyPI (#2647) 2024-12-29 14:25:53 -08:00
docs: update README (#2644) 2024-12-30 01:24:06 +08:00			For high QPS scenarios, add the `--enable-dp-attention` argument to boost throughput.
docs: add deepseek v3 launch instructions (#2589) 2024-12-26 15:26:54 +08:00
			`### Using pip`
			```bash
			`# Installation`
Release 0.4.1.post3 - upload the config.json to PyPI (#2647) 2024-12-29 14:25:53 -08:00			`pip install "sglang[all]>=0.4.1.post3" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer`
docs: add deepseek v3 launch instructions (#2589) 2024-12-26 15:26:54 +08:00
			`# Launch`
Update readme (#2625) 2024-12-28 13:39:56 +08:00			`python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
docs: add deepseek v3 launch instructions (#2589) 2024-12-26 15:26:54 +08:00			```
docs: update deepseek v3 example (#2592) 2024-12-26 17:43:37 +08:00
Release 0.4.1.post3 - upload the config.json to PyPI (#2647) 2024-12-29 14:25:53 -08:00			For high QPS scenarios, add the `--enable-dp-attention` argument to boost throughput.

docs: update deepseek v3 example (#2592) 2024-12-26 17:43:37 +08:00			`### Example with OpenAI API`

			```python3
			`import openai`
			`client = openai.Client(`
			`base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")`

			`# Chat completion`
			`response = client.chat.completions.create(`
			`model="default",`
			`messages=[`
			`{"role": "system", "content": "You are a helpful AI assistant"},`
			`{"role": "user", "content": "List 3 countries and their capitals."},`
			`],`
			`temperature=0,`
			`max_tokens=64,`
			`)`
			`print(response)`
			```
add 2*h20 node serving example for deepseek v3 (#2650) Co-authored-by: Yineng Zhang <me@zhyncs.com> 2024-12-30 13:04:38 +08:00			`### Example serving with 2 H20*8`
			For example, there are two H20 nodes, each with 8 GPUs. The first node's IP is `10.0.0.1`, and the second node's IP is `10.0.0.2`.

			```bash
			`# node 1`
			`GLOO_SOCKET_IFNAME=eth0 python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code`

			`# node 2`
			`GLOO_SOCKET_IFNAME=eth0 python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code`
			```
docs: update deepseek v3 example (#2592) 2024-12-26 17:43:37 +08:00
docs: update README (#2651) 2024-12-30 13:33:29 +08:00			`If you have two H100 nodes, the usage is similar to the aforementioned H20.`

Fix logprob_start_len for multi modal models (#2597) Co-authored-by: libra <lihu723@gmail.com> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: Wang, Haoyu <haoyu.wang@intel.com> 2024-12-26 06:27:45 -08:00			`## DeepSeek V3 Optimization Plan`
docs: update deepseek v3 example (#2592) 2024-12-26 17:43:37 +08:00
			`https://github.com/sgl-project/sglang/issues/2591`

			`## Appendix`

			`SGLang is the inference engine officially recommended by the DeepSeek team.`

			`https://github.com/deepseek-ai/DeepSeek-V3/tree/main?tab=readme-ov-file#62-inference-with-sglang-recommended`