diff --git a/benchmark/deepseek_v3/README.md b/benchmark/deepseek_v3/README.md index be22cfda2..12172ca45 100644 --- a/benchmark/deepseek_v3/README.md +++ b/benchmark/deepseek_v3/README.md @@ -1,4 +1,4 @@ -# DeepSeek V3 Support +# DeepSeek V3.1/V3/R1 Support The SGLang and DeepSeek teams collaborated to get DeepSeek V3 FP8 running on NVIDIA and AMD GPUs **from day one**. SGLang also supports [MLA optimization](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) and [DP attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models), making SGLang one of the best open-source LLM engines for running DeepSeek models. SGLang is the inference engine recommended by the official [DeepSeek team](https://github.com/deepseek-ai/DeepSeek-V3/tree/main?tab=readme-ov-file#62-inference-with-sglang-recommended). @@ -50,7 +50,9 @@ Add [performance optimization options](#performance-optimization-options) as nee - [Data Parallelism Attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models): For high QPS scenarios, add the `--enable-dp-attention` argument to boost throughput. - [Torch.compile Optimization](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#torchcompile-latency-optimizations): Add `--enable-torch-compile` argument to enable it. This will take some time while server starts. The maximum batch size for torch.compile optimization can be controlled with `--torch-compile-max-bs`. It's recommended to set it between `1` and `8`. (e.g., `--torch-compile-max-bs 8`) -### Example: Sending requests with OpenAI API +### Usage: Chat with DeepSeek + +#### DeepSeek V3/R1 ```python3 import openai @@ -70,6 +72,82 @@ response = client.chat.completions.create( print(response) ``` +#### DeepSeek V3.1 +On top of the basic usage similar to the DeepSeek V3/R1 example, DeepSeek V3.1 supports a request-level thinking/non-thinking toggle. Simply switch the `"thinking"` field in `extra_body={"chat_template_kwargs": {"thinking": True}}` to enable/disable the thinking mode. + +##### Non Thinking +```python3 +import openai +client = openai.Client( + base_url="http://127.0.0.1:30000/v1", api_key="EMPTY") + +# Chat completion +response = client.chat.completions.create( + model="default", + messages=[ + {"role": "system", "content": "You are a helpful AI assistant"}, + {"role": "user", "content": "Answer the following with the second letter of the correct answer only: What is the capital of France?"}, + ], + temperature=0, + max_tokens=1024, + extra_body = {"chat_template_kwargs": {"thinking": False}} +) +print(response.choices[0].message.content) +``` +Answer: +``` +h +``` +* The correct response should be 'A', as the correct answer to the question is 'Paris'. +##### Thinking +```python3 +import openai +client = openai.Client( + base_url="http://127.0.0.1:30000/v1", api_key="EMPTY") + +# Chat completion +response = client.chat.completions.create( + model="default", + messages=[ + {"role": "system", "content": "You are a helpful AI assistant"}, + {"role": "user", "content": "Answer the following with the second letter of the correct answer only: What is the capital of France?"}, + ], + temperature=0, + max_tokens=1024, + extra_body = {"chat_template_kwargs": {"thinking": True}} +) +print(response) +``` +Answer: +``` +First, the question is: "What is the capital of France?" I know that the capital of France is Paris. + +The user says: "Answer the following with the second letter of the correct answer only." So, I need to provide only the second letter of the correct answer. + +The correct answer is "Paris". Now, I need to find the second letter of "Paris". + +Let's spell it out: P-A-R-I-S. + +- First letter: P + +- Second letter: A + +- Third letter: R + +- Fourth letter: I + +- Fifth letter: S + +So, the second letter is "A". + +I should only output the second letter, which is "A". No additional text or explanation, just the letter. + +The user emphasized "the second letter of the correct answer only", so my response should be just "A". + +Finally, I need to make sure that this is the correct answer. Yes, Paris is indeed the capital of France.A +``` +* The response contains `` thinking trace and model was able to derive the correct answer from it. + ### Example: Serving with two H20\*8 nodes For example, there are two H20 nodes, each with 8 GPUs. The first node's IP is `10.0.0.1`, and the second node's IP is `10.0.0.2`. Please **use the first node's IP** for both commands. diff --git a/docs/basic_usage/deepseek.md b/docs/basic_usage/deepseek.md index 8c6fcfea5..56d650601 100644 --- a/docs/basic_usage/deepseek.md +++ b/docs/basic_usage/deepseek.md @@ -5,9 +5,9 @@ SGLang provides many optimizations specifically designed for the DeepSeek models This document outlines current optimizations for DeepSeek. For an overview of the implemented features see the completed [Roadmap](https://github.com/sgl-project/sglang/issues/2591). -## Launch DeepSeek V3 with SGLang +## Launch DeepSeek V3.1/V3/R1 with SGLang -To run DeepSeek V3/R1 models, the requirements are as follows: +To run DeepSeek V3.1/V3/R1 models, the recommended settings are as follows: | Weight Type | Configuration | |------------|-------------------|