Signed-off-by: ybyang <ybyang7@iflytek.com> Co-authored-by: YorkSu <york_su@qq.com>
This commit is contained in:
@@ -235,6 +235,44 @@ Important Notes:
|
||||
2. To receive more consistent tool call results, it is recommended to use `--chat-template examples/chat_template/tool_chat_template_deepseekv3.jinja`. It provides an improved unified prompt.
|
||||
|
||||
|
||||
### Thinking Budget for DeepSeek R1
|
||||
|
||||
In SGLang, we can implement thinking budget with `CustomLogitProcessor`.
|
||||
|
||||
Launch a server with `--enable-custom-logit-processor` flag on.
|
||||
|
||||
```
|
||||
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1 --tp 8 --port 30000 --host 0.0.0.0 --mem-fraction-static 0.9 --disable-cuda-graph --reasoning-parser deepseek-r1 --enable-custom-logit-processor
|
||||
```
|
||||
|
||||
Sample Request:
|
||||
|
||||
```python
|
||||
import openai
|
||||
from rich.pretty import pprint
|
||||
from sglang.srt.sampling.custom_logit_processor import DeepSeekR1ThinkingBudgetLogitProcessor
|
||||
|
||||
|
||||
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="*")
|
||||
response = client.chat.completions.create(
|
||||
model="deepseek-ai/DeepSeek-R1",
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Question: Is Paris the Capital of France?",
|
||||
}
|
||||
],
|
||||
max_tokens=1024,
|
||||
extra_body={
|
||||
"custom_logit_processor": DeepSeekR1ThinkingBudgetLogitProcessor().to_str(),
|
||||
"custom_params": {
|
||||
"thinking_budget": 512,
|
||||
},
|
||||
},
|
||||
)
|
||||
pprint(response)
|
||||
```
|
||||
|
||||
## FAQ
|
||||
|
||||
**Q: Model loading is taking too long, and I'm encountering an NCCL timeout. What should I do?**
|
||||
|
||||
@@ -319,3 +319,27 @@ response = requests.post(
|
||||
)
|
||||
print(response.json())
|
||||
```
|
||||
|
||||
Send an OpenAI chat completion request:
|
||||
|
||||
```python
|
||||
import openai
|
||||
from sglang.utils import print_highlight
|
||||
|
||||
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="meta-llama/Meta-Llama-3-8B-Instruct",
|
||||
messages=[
|
||||
{"role": "user", "content": "List 3 countries and their capitals."},
|
||||
],
|
||||
temperature=0.0,
|
||||
max_tokens=32,
|
||||
extra_body={
|
||||
"custom_logit_processor": DeterministicLogitProcessor().to_str(),
|
||||
"custom_params": {"token_id": 5},
|
||||
},
|
||||
)
|
||||
|
||||
print_highlight(f"Response: {response}")
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user