sglang/benchmark/deepseek_v3/README.md

# SGLang v0.4.1 - DeepSeek V3 Support

We're excited to announce [SGLang v0.4.1](https://github.com/sgl-project/sglang/releases/tag/v0.4.1), which now supports [DeepSeek V3](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base) - currently the strongest open-source LLM, even surpassing GPT-4o.

The SGLang and DeepSeek teams worked together to get DeepSeek V3 FP8 running on NVIDIA and AMD GPU **from day one**. We've also supported MLA optimization and DP attention before, making SGLang one of the best open-source LLM engines for running DeepSeek models.

Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources.

## Hardware Recommendation
- 8 x NVIDIA H200 GPUs

If you do not have GPUs with large enough memory, please try multi-node tensor parallelism ([help 1](https://github.com/sgl-project/sglang/blob/637de9e8ce91fd3e92755eb2a842860925954ab1/docs/backend/backend.md?plain=1#L88-L95) [help 2](https://github.com/sgl-project/sglang/blob/637de9e8ce91fd3e92755eb2a842860925954ab1/docs/backend/backend.md?plain=1#L152-L168)).

## Installation & Launch

If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded.

### Using Docker (Recommended)
```bash
docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --port 30000
```
For high QPS scenarios, add the `--enable-dp-attention` argument to boost throughput.

### Using pip
```bash
# Installation
pip install "sglang[all]==0.4.1.post2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer

# Launch
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code
```

### Example with OpenAI API

```python3
import openai
client = openai.Client(
    base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")

# Chat completion
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)
print(response)
```

## DeepSeek V3 Optimization Plan

https://github.com/sgl-project/sglang/issues/2591

## Appendix

SGLang is the inference engine officially recommended by the DeepSeek team.

https://github.com/deepseek-ai/DeepSeek-V3/tree/main?tab=readme-ov-file#62-inference-with-sglang-recommended
docs: add deepseek v3 launch instructions (#2589) 2024-12-26 15:26:54 +08:00			`# SGLang v0.4.1 - DeepSeek V3 Support`

			`We're excited to announce [SGLang v0.4.1](https://github.com/sgl-project/sglang/releases/tag/v0.4.1), which now supports [DeepSeek V3](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base) - currently the strongest open-source LLM, even surpassing GPT-4o.`

			`The SGLang and DeepSeek teams worked together to get DeepSeek V3 FP8 running on NVIDIA and AMD GPU from day one. We've also supported MLA optimization and DP attention before, making SGLang one of the best open-source LLM engines for running DeepSeek models.`

Update README.md (#2605) 2024-12-26 10:58:49 -08:00			`Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources.`
docs: add deepseek v3 launch instructions (#2589) 2024-12-26 15:26:54 +08:00
			`## Hardware Recommendation`
			`- 8 x NVIDIA H200 GPUs`

Fix logprob_start_len for multi modal models (#2597) Co-authored-by: libra <lihu723@gmail.com> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: Wang, Haoyu <haoyu.wang@intel.com> 2024-12-26 06:27:45 -08:00			`If you do not have GPUs with large enough memory, please try multi-node tensor parallelism ([help 1](https://github.com/sgl-project/sglang/blob/637de9e8ce91fd3e92755eb2a842860925954ab1/docs/backend/backend.md?plain=1#L88-L95) [help 2](https://github.com/sgl-project/sglang/blob/637de9e8ce91fd3e92755eb2a842860925954ab1/docs/backend/backend.md?plain=1#L152-L168)).`

docs: add deepseek v3 launch instructions (#2589) 2024-12-26 15:26:54 +08:00			`## Installation & Launch`

docs: update README (#2644) 2024-12-30 01:24:06 +08:00			`If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded.`
Fix logprob_start_len for multi modal models (#2597) Co-authored-by: libra <lihu723@gmail.com> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: Wang, Haoyu <haoyu.wang@intel.com> 2024-12-26 06:27:45 -08:00
docs: add deepseek v3 launch instructions (#2589) 2024-12-26 15:26:54 +08:00			`### Using Docker (Recommended)`
			```bash
			`docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host lmsysorg/sglang:latest \`
Update readme (#2625) 2024-12-28 13:39:56 +08:00			`python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --port 30000`
docs: add deepseek v3 launch instructions (#2589) 2024-12-26 15:26:54 +08:00			```
docs: update README (#2644) 2024-12-30 01:24:06 +08:00			For high QPS scenarios, add the `--enable-dp-attention` argument to boost throughput.
docs: add deepseek v3 launch instructions (#2589) 2024-12-26 15:26:54 +08:00
			`### Using pip`
			```bash
			`# Installation`
docs: update README (#2644) 2024-12-30 01:24:06 +08:00			`pip install "sglang[all]==0.4.1.post2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer`
docs: add deepseek v3 launch instructions (#2589) 2024-12-26 15:26:54 +08:00
			`# Launch`
Update readme (#2625) 2024-12-28 13:39:56 +08:00			`python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
docs: add deepseek v3 launch instructions (#2589) 2024-12-26 15:26:54 +08:00			```
docs: update deepseek v3 example (#2592) 2024-12-26 17:43:37 +08:00
			`### Example with OpenAI API`

			```python3
			`import openai`
			`client = openai.Client(`
			`base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")`

			`# Chat completion`
			`response = client.chat.completions.create(`
			`model="default",`
			`messages=[`
			`{"role": "system", "content": "You are a helpful AI assistant"},`
			`{"role": "user", "content": "List 3 countries and their capitals."},`
			`],`
			`temperature=0,`
			`max_tokens=64,`
			`)`
			`print(response)`
			```

Fix logprob_start_len for multi modal models (#2597) Co-authored-by: libra <lihu723@gmail.com> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: Wang, Haoyu <haoyu.wang@intel.com> 2024-12-26 06:27:45 -08:00			`## DeepSeek V3 Optimization Plan`
docs: update deepseek v3 example (#2592) 2024-12-26 17:43:37 +08:00
			`https://github.com/sgl-project/sglang/issues/2591`

			`## Appendix`

			`SGLang is the inference engine officially recommended by the DeepSeek team.`

			`https://github.com/deepseek-ai/DeepSeek-V3/tree/main?tab=readme-ov-file#62-inference-with-sglang-recommended`