[doc] deepseekv31 support (#9544)
This commit is contained in:
@@ -1,4 +1,4 @@
|
||||
# DeepSeek V3 Support
|
||||
# DeepSeek V3.1/V3/R1 Support
|
||||
|
||||
The SGLang and DeepSeek teams collaborated to get DeepSeek V3 FP8 running on NVIDIA and AMD GPUs **from day one**. SGLang also supports [MLA optimization](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) and [DP attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models), making SGLang one of the best open-source LLM engines for running DeepSeek models. SGLang is the inference engine recommended by the official [DeepSeek team](https://github.com/deepseek-ai/DeepSeek-V3/tree/main?tab=readme-ov-file#62-inference-with-sglang-recommended).
|
||||
|
||||
@@ -50,7 +50,9 @@ Add [performance optimization options](#performance-optimization-options) as nee
|
||||
- [Data Parallelism Attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models): For high QPS scenarios, add the `--enable-dp-attention` argument to boost throughput.
|
||||
- [Torch.compile Optimization](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#torchcompile-latency-optimizations): Add `--enable-torch-compile` argument to enable it. This will take some time while server starts. The maximum batch size for torch.compile optimization can be controlled with `--torch-compile-max-bs`. It's recommended to set it between `1` and `8`. (e.g., `--torch-compile-max-bs 8`)
|
||||
|
||||
### Example: Sending requests with OpenAI API
|
||||
### Usage: Chat with DeepSeek
|
||||
|
||||
#### DeepSeek V3/R1
|
||||
|
||||
```python3
|
||||
import openai
|
||||
@@ -70,6 +72,82 @@ response = client.chat.completions.create(
|
||||
print(response)
|
||||
```
|
||||
|
||||
#### DeepSeek V3.1
|
||||
On top of the basic usage similar to the DeepSeek V3/R1 example, DeepSeek V3.1 supports a request-level thinking/non-thinking toggle. Simply switch the `"thinking"` field in `extra_body={"chat_template_kwargs": {"thinking": True}}` to enable/disable the thinking mode.
|
||||
|
||||
##### Non Thinking
|
||||
```python3
|
||||
import openai
|
||||
client = openai.Client(
|
||||
base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
|
||||
|
||||
# Chat completion
|
||||
response = client.chat.completions.create(
|
||||
model="default",
|
||||
messages=[
|
||||
{"role": "system", "content": "You are a helpful AI assistant"},
|
||||
{"role": "user", "content": "Answer the following with the second letter of the correct answer only: What is the capital of France?"},
|
||||
],
|
||||
temperature=0,
|
||||
max_tokens=1024,
|
||||
extra_body = {"chat_template_kwargs": {"thinking": False}}
|
||||
)
|
||||
print(response.choices[0].message.content)
|
||||
```
|
||||
Answer:
|
||||
```
|
||||
h
|
||||
```
|
||||
* The correct response should be 'A', as the correct answer to the question is 'Paris'.
|
||||
##### Thinking
|
||||
```python3
|
||||
import openai
|
||||
client = openai.Client(
|
||||
base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
|
||||
|
||||
# Chat completion
|
||||
response = client.chat.completions.create(
|
||||
model="default",
|
||||
messages=[
|
||||
{"role": "system", "content": "You are a helpful AI assistant"},
|
||||
{"role": "user", "content": "Answer the following with the second letter of the correct answer only: What is the capital of France?"},
|
||||
],
|
||||
temperature=0,
|
||||
max_tokens=1024,
|
||||
extra_body = {"chat_template_kwargs": {"thinking": True}}
|
||||
)
|
||||
print(response)
|
||||
```
|
||||
Answer:
|
||||
```
|
||||
First, the question is: "What is the capital of France?" I know that the capital of France is Paris.
|
||||
|
||||
The user says: "Answer the following with the second letter of the correct answer only." So, I need to provide only the second letter of the correct answer.
|
||||
|
||||
The correct answer is "Paris". Now, I need to find the second letter of "Paris".
|
||||
|
||||
Let's spell it out: P-A-R-I-S.
|
||||
|
||||
- First letter: P
|
||||
|
||||
- Second letter: A
|
||||
|
||||
- Third letter: R
|
||||
|
||||
- Fourth letter: I
|
||||
|
||||
- Fifth letter: S
|
||||
|
||||
So, the second letter is "A".
|
||||
|
||||
I should only output the second letter, which is "A". No additional text or explanation, just the letter.
|
||||
|
||||
The user emphasized "the second letter of the correct answer only", so my response should be just "A".
|
||||
|
||||
Finally, I need to make sure that this is the correct answer. Yes, Paris is indeed the capital of France.</think>A
|
||||
```
|
||||
* The response contains `</think>` thinking trace and model was able to derive the correct answer from it.
|
||||
|
||||
### Example: Serving with two H20\*8 nodes
|
||||
|
||||
For example, there are two H20 nodes, each with 8 GPUs. The first node's IP is `10.0.0.1`, and the second node's IP is `10.0.0.2`. Please **use the first node's IP** for both commands.
|
||||
|
||||
@@ -5,9 +5,9 @@ SGLang provides many optimizations specifically designed for the DeepSeek models
|
||||
This document outlines current optimizations for DeepSeek.
|
||||
For an overview of the implemented features see the completed [Roadmap](https://github.com/sgl-project/sglang/issues/2591).
|
||||
|
||||
## Launch DeepSeek V3 with SGLang
|
||||
## Launch DeepSeek V3.1/V3/R1 with SGLang
|
||||
|
||||
To run DeepSeek V3/R1 models, the requirements are as follows:
|
||||
To run DeepSeek V3.1/V3/R1 models, the recommended settings are as follows:
|
||||
|
||||
| Weight Type | Configuration |
|
||||
|------------|-------------------|
|
||||
|
||||
Reference in New Issue
Block a user