diff --git a/benchmark/deepseek_v3/README.md b/benchmark/deepseek_v3/README.md index 73b3612da..763c76186 100644 --- a/benchmark/deepseek_v3/README.md +++ b/benchmark/deepseek_v3/README.md @@ -9,12 +9,16 @@ Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model ## Hardware Recommendation - 8 x NVIDIA H200 GPUs +If you do not have GPUs with large enough memory, please try multi-node tensor parallelism ([help 1](https://github.com/sgl-project/sglang/blob/637de9e8ce91fd3e92755eb2a842860925954ab1/docs/backend/backend.md?plain=1#L88-L95) [help 2](https://github.com/sgl-project/sglang/blob/637de9e8ce91fd3e92755eb2a842860925954ab1/docs/backend/backend.md?plain=1#L152-L168)). + ## Installation & Launch +If you see errors when launching the server, please check if it has finished downloading the weights. It is recommended to download the weights before launching, or to launch multiple times until all the weights have been downloaded. + ### Using Docker (Recommended) ```bash docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host lmsysorg/sglang:latest \ - python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3-Base --enable-dp-attention --tp 8 --trust-remote-code --port 30000 + python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --enable-dp-attention --tp 8 --trust-remote-code --port 30000 ``` ### Using pip @@ -23,11 +27,9 @@ docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/roo pip install "sglang[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer # Launch -python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3-Base --enable-dp-attention --tp 8 --trust-remote-code +python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --enable-dp-attention --tp 8 --trust-remote-code ``` -If you see errors when launching the server, please check if it has finished downloading the weights. It is recommended to download the weights before launching, or to launch multiple times until all the weights have been downloaded. - ### Example with OpenAI API ```python3 @@ -48,7 +50,7 @@ response = client.chat.completions.create( print(response) ``` -## DeepSeek V3 optimization plan +## DeepSeek V3 Optimization Plan https://github.com/sgl-project/sglang/issues/2591 diff --git a/docs/backend/openai_api_completions.ipynb b/docs/backend/openai_api_completions.ipynb index 9340f953f..8e6634900 100644 --- a/docs/backend/openai_api_completions.ipynb +++ b/docs/backend/openai_api_completions.ipynb @@ -223,7 +223,7 @@ "## Structured decoding (JSON, Regex)\n", "You can define a JSON schema or regular expression to constrain the model's output. The model output will be guaranteed to follow the given constraints and this depends on the grammar backend.\n", "\n", - "SGlang has two backends: outlines (default) and Xgrammar. Xgrammar enhances JSON decoding performance but does not support regular expressions. To use Xgrammar, add the `--grammar-backend xgrammar` when launching the server:\n", + "SGlang has two backends: [Outlines](https://github.com/dottxt-ai/outlines) (default) and [XGrammar](https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar). Xgrammar accelerates JSON decoding performance but does not support regular expressions. To use Xgrammar, add the `--grammar-backend xgrammar` when launching the server:\n", "\n", "```bash\n", "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n", diff --git a/docs/references/sampling_params.md b/docs/references/sampling_params.md index 147e6c2ab..c5b5e17f1 100644 --- a/docs/references/sampling_params.md +++ b/docs/references/sampling_params.md @@ -1,8 +1,7 @@ # Sampling Parameters in SGLang Runtime This doc describes the sampling parameters of the SGLang Runtime. It is the low-level endpoint of the runtime. -If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API -](https://github.com/sgl-project/sglang?tab=readme-ov-file#openai-compatible-api). +If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API](../backend/openai_api_completions.ipynb). The `/generate` endpoint accepts the following arguments in the JSON format. diff --git a/python/sglang/srt/managers/scheduler.py b/python/sglang/srt/managers/scheduler.py index 9be35b0ee..f283f0412 100644 --- a/python/sglang/srt/managers/scheduler.py +++ b/python/sglang/srt/managers/scheduler.py @@ -565,7 +565,7 @@ class Scheduler: if req.logprob_start_len == -1: # By default, only return the logprobs for output tokens - req.logprob_start_len = len(recv_req.input_ids) - 1 + req.logprob_start_len = len(req.origin_input_ids) - 1 # Truncate prompts that are too long if len(req.origin_input_ids) > self.max_req_input_len: