sglangv0.5.2 & support Qwen3-Next-80B-A3B-Instruct
This commit is contained in:
42
docs/references/custom_chat_template.md
Normal file
42
docs/references/custom_chat_template.md
Normal file
@@ -0,0 +1,42 @@
|
||||
# Custom Chat Template
|
||||
|
||||
**NOTE**: There are two chat template systems in SGLang project. This document is about setting a custom chat template for the OpenAI-compatible API server (defined at [conversation.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/conversation.py)). It is NOT related to the chat template used in the SGLang language frontend (defined at [chat_template.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/lang/chat_template.py)).
|
||||
|
||||
By default, the server uses the chat template specified in the model tokenizer from Hugging Face.
|
||||
It should just work for most official models such as Llama-2/Llama-3.
|
||||
|
||||
If needed, you can also override the chat template when launching the server:
|
||||
|
||||
```bash
|
||||
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template llama-2
|
||||
```
|
||||
|
||||
If the chat template you are looking for is missing, you are welcome to contribute it or load it from a file.
|
||||
|
||||
## JSON Format
|
||||
|
||||
You can load the JSON format, which is defined by `conversation.py`.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "my_model",
|
||||
"system": "<|im_start|>system",
|
||||
"user": "<|im_start|>user",
|
||||
"assistant": "<|im_start|>assistant",
|
||||
"sep_style": "CHATML",
|
||||
"sep": "<|im_end|>",
|
||||
"stop_str": ["<|im_end|>", "<|im_start|>"]
|
||||
}
|
||||
```
|
||||
|
||||
```bash
|
||||
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.json
|
||||
```
|
||||
|
||||
## Jinja Format
|
||||
|
||||
You can also use the [Jinja template format](https://huggingface.co/docs/transformers/main/en/chat_templating) as defined by Hugging Face Transformers.
|
||||
|
||||
```bash
|
||||
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.jinja
|
||||
```
|
||||
97
docs/references/environment_variables.md
Normal file
97
docs/references/environment_variables.md
Normal file
@@ -0,0 +1,97 @@
|
||||
# Environment Variables
|
||||
|
||||
SGLang supports various environment variables that can be used to configure its runtime behavior. This document provides a comprehensive list and aims to stay updated over time.
|
||||
|
||||
*Note: SGLang uses two prefixes for environment variables: `SGL_` and `SGLANG_`. This is likely due to historical reasons. While both are currently supported for different settings, future versions might consolidate them.*
|
||||
|
||||
## General Configuration
|
||||
|
||||
| Environment Variable | Description | Default Value |
|
||||
| --- | --- | --- |
|
||||
| `SGLANG_USE_MODELSCOPE` | Enable using models from ModelScope | `false` |
|
||||
| `SGLANG_HOST_IP` | Host IP address for the server | `0.0.0.0` |
|
||||
| `SGLANG_PORT` | Port for the server | auto-detected |
|
||||
| `SGLANG_LOGGING_CONFIG_PATH` | Custom logging configuration path | Not set |
|
||||
| `SGLANG_DISABLE_REQUEST_LOGGING` | Disable request logging | `false` |
|
||||
| `SGLANG_HEALTH_CHECK_TIMEOUT` | Timeout for health check in seconds | `20` |
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
| Environment Variable | Description | Default Value |
|
||||
| --- | --- | --- |
|
||||
| `SGLANG_ENABLE_TORCH_INFERENCE_MODE` | Control whether to use torch.inference_mode | `false` |
|
||||
| `SGLANG_ENABLE_TORCH_COMPILE` | Enable torch.compile | `true` |
|
||||
| `SGLANG_SET_CPU_AFFINITY` | Enable CPU affinity setting (often set to `1` in Docker builds) | `0` |
|
||||
| `SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN` | Allows the scheduler to overwrite longer context length requests (often set to `1` in Docker builds) | `0` |
|
||||
| `SGLANG_IS_FLASHINFER_AVAILABLE` | Control FlashInfer availability check | `true` |
|
||||
| `SGLANG_SKIP_P2P_CHECK` | Skip P2P (peer-to-peer) access check | `false` |
|
||||
| `SGL_CHUNKED_PREFIX_CACHE_THRESHOLD` | Sets the threshold for enabling chunked prefix caching | `8192` |
|
||||
| `SGLANG_FUSED_MLA_ENABLE_ROPE_FUSION` | Enable RoPE fusion in Fused Multi-Layer Attention | `1` |
|
||||
|
||||
## DeepGEMM Configuration (Advanced Optimization)
|
||||
|
||||
| Environment Variable | Description | Default Value |
|
||||
| --- | --- | --- |
|
||||
| `SGL_ENABLE_JIT_DEEPGEMM` | Enable Just-In-Time compilation of DeepGEMM kernels | `"true"` |
|
||||
| `SGL_JIT_DEEPGEMM_PRECOMPILE` | Enable precompilation of DeepGEMM kernels | `"true"` |
|
||||
| `SGL_JIT_DEEPGEMM_COMPILE_WORKERS` | Number of workers for parallel DeepGEMM kernel compilation | `4` |
|
||||
| `SGL_IN_DEEPGEMM_PRECOMPILE_STAGE` | Indicator flag used during the DeepGEMM precompile script | `"false"` |
|
||||
| `SGL_DG_CACHE_DIR` | Directory for caching compiled DeepGEMM kernels | `~/.cache/deep_gemm` |
|
||||
| `SGL_DG_USE_NVRTC` | Use NVRTC (instead of Triton) for JIT compilation (Experimental) | `"0"` |
|
||||
| `SGL_USE_DEEPGEMM_BMM` | Use DeepGEMM for Batched Matrix Multiplication (BMM) operations | `"false"` |
|
||||
|
||||
## Memory Management
|
||||
|
||||
| Environment Variable | Description | Default Value |
|
||||
| --- | --- | --- |
|
||||
| `SGLANG_DEBUG_MEMORY_POOL` | Enable memory pool debugging | `false` |
|
||||
| `SGLANG_CLIP_MAX_NEW_TOKENS_ESTIMATION` | Clip max new tokens estimation for memory planning | `4096` |
|
||||
| `SGLANG_DETOKENIZER_MAX_STATES` | Maximum states for detokenizer | Default value based on system |
|
||||
| `SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK` | Disable checks for memory imbalance across Tensor Parallel ranks | Not set (defaults to enabled check) |
|
||||
|
||||
## Model-Specific Options
|
||||
|
||||
| Environment Variable | Description | Default Value |
|
||||
| --- | --- | --- |
|
||||
| `SGLANG_USE_AITER` | Use AITER optimize implementation | `false` |
|
||||
| `SGLANG_INT4_WEIGHT` | Enable INT4 weight quantization | `false` |
|
||||
| `SGLANG_MOE_PADDING` | Enable MoE padding (sets padding size to 128 if value is `1`, often set to `1` in Docker builds) | `0` |
|
||||
| `SGLANG_FORCE_FP8_MARLIN` | Force using FP8 MARLIN kernels even if other FP8 kernels are available | `false` |
|
||||
| `SGLANG_ENABLE_FLASHINFER_GEMM` | Use flashinfer kernels when running blockwise fp8 GEMM on Blackwell GPUs | `false` |
|
||||
| `SGLANG_SUPPORT_CUTLASS_BLOCK_FP8` | Use Cutlass kernels when running blockwise fp8 GEMM on Hopper or Blackwell GPUs | `false` |
|
||||
| `SGLANG_CUTLASS_MOE` | Use Cutlass FP8 MoE kernel on Blackwell GPUs | `false` |
|
||||
|
||||
|
||||
## Distributed Computing
|
||||
|
||||
| Environment Variable | Description | Default Value |
|
||||
| --- | --- | --- |
|
||||
| `SGLANG_BLOCK_NONZERO_RANK_CHILDREN` | Control blocking of non-zero rank children processes | `1` |
|
||||
| `SGL_IS_FIRST_RANK_ON_NODE` | Indicates if the current process is the first rank on its node | `"true"` |
|
||||
| `SGLANG_PP_LAYER_PARTITION` | Pipeline parallel layer partition specification | Not set |
|
||||
|
||||
## Testing & Debugging (Internal/CI)
|
||||
|
||||
*These variables are primarily used for internal testing, continuous integration, or debugging.*
|
||||
|
||||
| Environment Variable | Description | Default Value |
|
||||
| --- | --- | --- |
|
||||
| `SGLANG_IS_IN_CI` | Indicates if running in CI environment | `false` |
|
||||
| `SGLANG_AMD_CI` | Indicates running in AMD CI environment | `0` |
|
||||
| `SGLANG_TEST_RETRACT` | Enable retract decode testing | `false` |
|
||||
| `SGLANG_RECORD_STEP_TIME` | Record step time for profiling | `false` |
|
||||
| `SGLANG_TEST_REQUEST_TIME_STATS` | Test request time statistics | `false` |
|
||||
| `SGLANG_CI_SMALL_KV_SIZE` | Use small KV cache size in CI | Not set |
|
||||
|
||||
## Profiling & Benchmarking
|
||||
|
||||
| Environment Variable | Description | Default Value |
|
||||
| --- | --- | --- |
|
||||
| `SGLANG_TORCH_PROFILER_DIR` | Directory for PyTorch profiler output | `/tmp` |
|
||||
| `SGLANG_PROFILE_WITH_STACK` | Set `with_stack` option (bool) for PyTorch profiler (capture stack trace) | `true` |
|
||||
|
||||
## Storage & Caching
|
||||
|
||||
| Environment Variable | Description | Default Value |
|
||||
| --- | --- | --- |
|
||||
| `SGLANG_DISABLE_OUTLINES_DISK_CACHE` | Disable Outlines disk cache | `true` |
|
||||
35
docs/references/faq.md
Normal file
35
docs/references/faq.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# Troubleshooting and Frequently Asked Questions
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
This page lists common errors and tips for resolving them.
|
||||
|
||||
### CUDA Out of Memory
|
||||
If you encounter out-of-memory (OOM) errors, you can adjust the following parameters:
|
||||
|
||||
- If OOM occurs during prefill, try reducing `--chunked-prefill-size` to `4096` or `2048`. This saves memory but slows down the prefill speed for long prompts.
|
||||
- If OOM occurs during decoding, try lowering `--max-running-requests`.
|
||||
- You can also reduce `--mem-fraction-static` to a smaller value, such as 0.8 or 0.7. This decreases the memory usage of the KV cache memory pool and helps prevent OOM errors during both prefill and decoding. However, it limits maximum concurrency and reduces peak throughput.
|
||||
- Another common case for OOM is requesting input logprobs for a long prompt as it requires significant memory. To address this, set `logprob_start_len` in your sampling parameters to include only the necessary parts. If you do need input logprobs for a long prompt, try reducing `--mem-fraction-static`.
|
||||
|
||||
### CUDA Error: Illegal Memory Access Encountered
|
||||
This error may result from kernel errors or out-of-memory issues:
|
||||
- If it is a kernel error, resolving it may be challenging. Please file an issue on GitHub.
|
||||
- If it is an out-of-memory issue, it may sometimes be reported as this error instead of "Out of Memory." Refer to the section above for guidance on avoiding OOM issues.
|
||||
|
||||
|
||||
## Frequently Asked Questions
|
||||
|
||||
### The results are not deterministic, even with a temperature of 0
|
||||
|
||||
You may notice that when you send the same request twice, the results from the engine will be slightly different, even when the temperature is set to 0.
|
||||
|
||||
From our initial investigation, this indeterminism arises from two factors: dynamic batching and prefix caching. Roughly speaking, dynamic batching accounts for about 95% of the indeterminism, while prefix caching accounts for the remaining portion. The server runs dynamic batching under the hood. Different batch sizes can cause PyTorch/CuBLAS to dispatch to different CUDA kernels, which can lead to slight numerical differences. This difference accumulates across many layers, resulting in nondeterministic output when the batch size changes. Similarly, when prefix caching is enabled, it can also dispatch to different kernels. Even when the computations are mathematically equivalent, small numerical differences from different kernel implementations lead to the final nondeterministic outputs.
|
||||
|
||||
To achieve more deterministic outputs in the current code, you can add `--disable-radix-cache` and send only one request at a time. The results will be mostly deterministic under this setting.
|
||||
|
||||
We are still investigating the root causes and potential solutions. In the short term, we may introduce a "deterministic mode" that uses more padding to address the variance caused by dynamic batching. This mode will be more deterministic but slower.
|
||||
|
||||
We have two issues to track our progress:
|
||||
- The deterministic mode is tracked at [https://github.com/sgl-project/sglang/issues/1729](https://github.com/sgl-project/sglang/issues/1729).
|
||||
- The per-request random seed is tracked at [https://github.com/sgl-project/sglang/issues/1335](https://github.com/sgl-project/sglang/issues/1335).
|
||||
77
docs/references/frontend/choices_methods.md
Normal file
77
docs/references/frontend/choices_methods.md
Normal file
@@ -0,0 +1,77 @@
|
||||
# Choices Methods in SGLang
|
||||
This doc describes the choices methods supported by SGLang.
|
||||
|
||||
The optional `choices_method` arg determines how options supplied to SGLang's `choices` primitive are selected. Only the `RuntimeEndpoint` backend supports the `choices_method` arg. Other backends, such as `OpenAI`, have bespoke selection implementations due to API limitations.
|
||||
|
||||
## Methods
|
||||
|
||||
### Token Length Normalized
|
||||
|
||||
Token length normalized is the default SGLang choices method. It selects the option with the highest average logprob across all of its tokens.
|
||||
|
||||
Usage example (alternatively, simply omit the `choices_method` arg):
|
||||
```python
|
||||
@sgl.function
|
||||
def example(s):
|
||||
s += sgl.user("What is the capital of France?")
|
||||
s += sgl.assistant(
|
||||
sgl.gen(
|
||||
"answer",
|
||||
choices=["London", "Paris", "Berlin"],
|
||||
choices_method=sgl.token_length_normalized,
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
This can perform poorly if an option contains many tokens, where its later tokens are predicted with high confidence based on its earlier tokens. For instance, even strong models will fail the above example if the specified options are `["Paris", "Antidisestablishmentarianism"]`.
|
||||
|
||||
### Greedy Token Selection
|
||||
|
||||
Greedy token selection simply selects the option with the highest logprob for its initial token. For overlapping options where one option is a subset of a longer option, the logprobs of the shorter option are extended using its average logprob for comparison against the longer option.
|
||||
|
||||
Usage example:
|
||||
```python
|
||||
@sgl.function
|
||||
def example(s):
|
||||
s += sgl.user("What is the capital of France?")
|
||||
s += sgl.assistant(
|
||||
sgl.gen(
|
||||
"answer",
|
||||
choices=["London", "Paris", "Berlin"],
|
||||
choices_method=sgl.greedy_token_selection,
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
This can perform poorly if an option misleads the model down a bad path based on an attractive initial token. For instance, greedy selection will result in an incorrect response for this example:
|
||||
```python
|
||||
@sgl.function
|
||||
def us_president_example(s):
|
||||
s += sgl.user("Name a US president.")
|
||||
s += sgl.assistant(
|
||||
sgl.gen(
|
||||
"answer",
|
||||
choices=["Donald Duck", "Millard Fillmore"],
|
||||
choices_method=sgl.greedy_token_selection,
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### Unconditional Likelihood Normalized
|
||||
|
||||
Unconditional likelihood normalized selects the option with the highest average token logprob once normalized by the unconditional token logprobs, as described in [this EleutherAI blogpost](https://blog.eleuther.ai/multiple-choice-normalization/). This method incurs an additional LLM call to obtain the unconditional likelihoods.
|
||||
|
||||
Usage example:
|
||||
```python
|
||||
@sgl.function
|
||||
def example(s):
|
||||
s += sgl.user("What is the capital of France?")
|
||||
s += sgl.assistant(
|
||||
sgl.gen(
|
||||
"answer",
|
||||
choices=["London", "Paris", "Berlin"],
|
||||
choices_method=sgl.unconditional_likelihood_normalized,
|
||||
)
|
||||
)
|
||||
```
|
||||
9
docs/references/frontend/frontend_index.rst
Normal file
9
docs/references/frontend/frontend_index.rst
Normal file
@@ -0,0 +1,9 @@
|
||||
Frontend Language
|
||||
=================
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: Frontend Language
|
||||
|
||||
frontend_tutorial.ipynb
|
||||
choices_methods.md
|
||||
456
docs/references/frontend/frontend_tutorial.ipynb
Normal file
456
docs/references/frontend/frontend_tutorial.ipynb
Normal file
@@ -0,0 +1,456 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# SGLang Frontend Language"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"SGLang frontend language can be used to define simple and easy prompts in a convenient, structured way."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Launch A Server\n",
|
||||
"\n",
|
||||
"Launch the server in your terminal and wait for it to initialize."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sglang import assistant_begin, assistant_end\n",
|
||||
"from sglang import assistant, function, gen, system, user\n",
|
||||
"from sglang import image\n",
|
||||
"from sglang import RuntimeEndpoint\n",
|
||||
"from sglang.lang.api import set_default_backend\n",
|
||||
"from sglang.srt.utils import load_image\n",
|
||||
"from sglang.test.doc_patch import launch_server_cmd\n",
|
||||
"from sglang.utils import print_highlight, terminate_process, wait_for_server\n",
|
||||
"\n",
|
||||
"server_process, port = launch_server_cmd(\n",
|
||||
" \"python -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --log-level warning\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")\n",
|
||||
"print(f\"Server started on http://localhost:{port}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Set the default backend. Note: Besides the local server, you may use also `OpenAI` or other API endpoints."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"set_default_backend(RuntimeEndpoint(f\"http://localhost:{port}\"))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Basic Usage\n",
|
||||
"\n",
|
||||
"The most simple way of using SGLang frontend language is a simple question answer dialog between a user and an assistant."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"@function\n",
|
||||
"def basic_qa(s, question):\n",
|
||||
" s += system(f\"You are a helpful assistant than can answer questions.\")\n",
|
||||
" s += user(question)\n",
|
||||
" s += assistant(gen(\"answer\", max_tokens=512))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"state = basic_qa(\"List 3 countries and their capitals.\")\n",
|
||||
"print_highlight(state[\"answer\"])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Multi-turn Dialog\n",
|
||||
"\n",
|
||||
"SGLang frontend language can also be used to define multi-turn dialogs."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"@function\n",
|
||||
"def multi_turn_qa(s):\n",
|
||||
" s += system(f\"You are a helpful assistant than can answer questions.\")\n",
|
||||
" s += user(\"Please give me a list of 3 countries and their capitals.\")\n",
|
||||
" s += assistant(gen(\"first_answer\", max_tokens=512))\n",
|
||||
" s += user(\"Please give me another list of 3 countries and their capitals.\")\n",
|
||||
" s += assistant(gen(\"second_answer\", max_tokens=512))\n",
|
||||
" return s\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"state = multi_turn_qa()\n",
|
||||
"print_highlight(state[\"first_answer\"])\n",
|
||||
"print_highlight(state[\"second_answer\"])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Control flow\n",
|
||||
"\n",
|
||||
"You may use any Python code within the function to define more complex control flows."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"@function\n",
|
||||
"def tool_use(s, question):\n",
|
||||
" s += assistant(\n",
|
||||
" \"To answer this question: \"\n",
|
||||
" + question\n",
|
||||
" + \". I need to use a \"\n",
|
||||
" + gen(\"tool\", choices=[\"calculator\", \"search engine\"])\n",
|
||||
" + \". \"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" if s[\"tool\"] == \"calculator\":\n",
|
||||
" s += assistant(\"The math expression is: \" + gen(\"expression\"))\n",
|
||||
" elif s[\"tool\"] == \"search engine\":\n",
|
||||
" s += assistant(\"The key word to search is: \" + gen(\"word\"))\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"state = tool_use(\"What is 2 * 2?\")\n",
|
||||
"print_highlight(state[\"tool\"])\n",
|
||||
"print_highlight(state[\"expression\"])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Parallelism\n",
|
||||
"\n",
|
||||
"Use `fork` to launch parallel prompts. Because `sgl.gen` is non-blocking, the for loop below issues two generation calls in parallel."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"@function\n",
|
||||
"def tip_suggestion(s):\n",
|
||||
" s += assistant(\n",
|
||||
" \"Here are two tips for staying healthy: \"\n",
|
||||
" \"1. Balanced Diet. 2. Regular Exercise.\\n\\n\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" forks = s.fork(2)\n",
|
||||
" for i, f in enumerate(forks):\n",
|
||||
" f += assistant(\n",
|
||||
" f\"Now, expand tip {i+1} into a paragraph:\\n\"\n",
|
||||
" + gen(\"detailed_tip\", max_tokens=256, stop=\"\\n\\n\")\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" s += assistant(\"Tip 1:\" + forks[0][\"detailed_tip\"] + \"\\n\")\n",
|
||||
" s += assistant(\"Tip 2:\" + forks[1][\"detailed_tip\"] + \"\\n\")\n",
|
||||
" s += assistant(\n",
|
||||
" \"To summarize the above two tips, I can say:\\n\" + gen(\"summary\", max_tokens=512)\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"state = tip_suggestion()\n",
|
||||
"print_highlight(state[\"summary\"])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Constrained Decoding\n",
|
||||
"\n",
|
||||
"Use `regex` to specify a regular expression as a decoding constraint. This is only supported for local models."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"@function\n",
|
||||
"def regular_expression_gen(s):\n",
|
||||
" s += user(\"What is the IP address of the Google DNS servers?\")\n",
|
||||
" s += assistant(\n",
|
||||
" gen(\n",
|
||||
" \"answer\",\n",
|
||||
" temperature=0,\n",
|
||||
" regex=r\"((25[0-5]|2[0-4]\\d|[01]?\\d\\d?).){3}(25[0-5]|2[0-4]\\d|[01]?\\d\\d?)\",\n",
|
||||
" )\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"state = regular_expression_gen()\n",
|
||||
"print_highlight(state[\"answer\"])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Use `regex` to define a `JSON` decoding schema."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"character_regex = (\n",
|
||||
" r\"\"\"\\{\\n\"\"\"\n",
|
||||
" + r\"\"\" \"name\": \"[\\w\\d\\s]{1,16}\",\\n\"\"\"\n",
|
||||
" + r\"\"\" \"house\": \"(Gryffindor|Slytherin|Ravenclaw|Hufflepuff)\",\\n\"\"\"\n",
|
||||
" + r\"\"\" \"blood status\": \"(Pure-blood|Half-blood|Muggle-born)\",\\n\"\"\"\n",
|
||||
" + r\"\"\" \"occupation\": \"(student|teacher|auror|ministry of magic|death eater|order of the phoenix)\",\\n\"\"\"\n",
|
||||
" + r\"\"\" \"wand\": \\{\\n\"\"\"\n",
|
||||
" + r\"\"\" \"wood\": \"[\\w\\d\\s]{1,16}\",\\n\"\"\"\n",
|
||||
" + r\"\"\" \"core\": \"[\\w\\d\\s]{1,16}\",\\n\"\"\"\n",
|
||||
" + r\"\"\" \"length\": [0-9]{1,2}\\.[0-9]{0,2}\\n\"\"\"\n",
|
||||
" + r\"\"\" \\},\\n\"\"\"\n",
|
||||
" + r\"\"\" \"alive\": \"(Alive|Deceased)\",\\n\"\"\"\n",
|
||||
" + r\"\"\" \"patronus\": \"[\\w\\d\\s]{1,16}\",\\n\"\"\"\n",
|
||||
" + r\"\"\" \"bogart\": \"[\\w\\d\\s]{1,16}\"\\n\"\"\"\n",
|
||||
" + r\"\"\"\\}\"\"\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"@function\n",
|
||||
"def character_gen(s, name):\n",
|
||||
" s += user(\n",
|
||||
" f\"{name} is a character in Harry Potter. Please fill in the following information about this character.\"\n",
|
||||
" )\n",
|
||||
" s += assistant(gen(\"json_output\", max_tokens=256, regex=character_regex))\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"state = character_gen(\"Harry Potter\")\n",
|
||||
"print_highlight(state[\"json_output\"])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Batching \n",
|
||||
"\n",
|
||||
"Use `run_batch` to run a batch of prompts."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"@function\n",
|
||||
"def text_qa(s, question):\n",
|
||||
" s += user(question)\n",
|
||||
" s += assistant(gen(\"answer\", stop=\"\\n\"))\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"states = text_qa.run_batch(\n",
|
||||
" [\n",
|
||||
" {\"question\": \"What is the capital of the United Kingdom?\"},\n",
|
||||
" {\"question\": \"What is the capital of France?\"},\n",
|
||||
" {\"question\": \"What is the capital of Japan?\"},\n",
|
||||
" ],\n",
|
||||
" progress_bar=True,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"for i, state in enumerate(states):\n",
|
||||
" print_highlight(f\"Answer {i+1}: {states[i]['answer']}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Streaming \n",
|
||||
"\n",
|
||||
"Use `stream` to stream the output to the user."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"@function\n",
|
||||
"def text_qa(s, question):\n",
|
||||
" s += user(question)\n",
|
||||
" s += assistant(gen(\"answer\", stop=\"\\n\"))\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"state = text_qa.run(\n",
|
||||
" question=\"What is the capital of France?\", temperature=0.1, stream=True\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"for out in state.text_iter():\n",
|
||||
" print(out, end=\"\", flush=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Complex Prompts\n",
|
||||
"\n",
|
||||
"You may use `{system|user|assistant}_{begin|end}` to define complex prompts."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"@function\n",
|
||||
"def chat_example(s):\n",
|
||||
" s += system(\"You are a helpful assistant.\")\n",
|
||||
" # Same as: s += s.system(\"You are a helpful assistant.\")\n",
|
||||
"\n",
|
||||
" with s.user():\n",
|
||||
" s += \"Question: What is the capital of France?\"\n",
|
||||
"\n",
|
||||
" s += assistant_begin()\n",
|
||||
" s += \"Answer: \" + gen(\"answer\", max_tokens=100, stop=\"\\n\")\n",
|
||||
" s += assistant_end()\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"state = chat_example()\n",
|
||||
"print_highlight(state[\"answer\"])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"terminate_process(server_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Multi-modal Generation\n",
|
||||
"\n",
|
||||
"You may use SGLang frontend language to define multi-modal prompts.\n",
|
||||
"See [here](https://docs.sglang.ai/supported_models/generative_models.html) for supported models."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"server_process, port = launch_server_cmd(\n",
|
||||
" \"python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --host 0.0.0.0 --log-level warning\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")\n",
|
||||
"print(f\"Server started on http://localhost:{port}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"set_default_backend(RuntimeEndpoint(f\"http://localhost:{port}\"))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Ask a question about an image."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"@function\n",
|
||||
"def image_qa(s, image_file, question):\n",
|
||||
" s += user(image(image_file) + question)\n",
|
||||
" s += assistant(gen(\"answer\", max_tokens=256))\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"image_url = \"https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true\"\n",
|
||||
"image_bytes, _ = load_image(image_url)\n",
|
||||
"state = image_qa(image_bytes, \"What is in the image?\")\n",
|
||||
"print_highlight(state[\"answer\"])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"terminate_process(server_process)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
7
docs/references/learn_more.md
Normal file
7
docs/references/learn_more.md
Normal file
@@ -0,0 +1,7 @@
|
||||
# Learn more
|
||||
|
||||
You can find more blogs, slides, and videos about SGLang at [https://github.com/sgl-project/sgl-learning-materials](https://github.com/sgl-project/sgl-learning-materials).
|
||||
|
||||
The latest SGLang features and updates are shared through the [LMSYS blog](https://lmsys.org/blog/).
|
||||
|
||||
The 2025 H2 roadmap can be found at this [issue](https://github.com/sgl-project/sglang/issues/7736).
|
||||
337
docs/references/multi_node_deployment/deploy_on_k8s.md
Normal file
337
docs/references/multi_node_deployment/deploy_on_k8s.md
Normal file
@@ -0,0 +1,337 @@
|
||||
# Deploy On Kubernetes
|
||||
|
||||
This document is for deploying a RoCE network-based SGLang two-node inference service on a Kubernetes (K8S) cluster.
|
||||
|
||||
[LeaderWorkerSet (LWS)](https://github.com/kubernetes-sigs/lws) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads. A major use case is for multi-host/multi-node distributed inference.
|
||||
|
||||
SGLang can also be deployed with LWS on Kubernetes for distributed model serving.
|
||||
|
||||
Please see this guide for more details on deploying SGLang on Kubernetes using LWS.
|
||||
|
||||
Here we take the deployment of DeepSeek-R1 as an example.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
1. At least two Kubernetes nodes, each with two H20 systems and eight GPUs, are required.
|
||||
|
||||
2. Make sure your K8S cluster has LWS correctly installed. If it hasn't been set up yet, please follow the [installation instructions](https://github.com/kubernetes-sigs/lws/blob/main/site/content/en/docs/installation/_index.md). **Note:** For LWS versions ≤0.5.x, you must use the Downward API to obtain `LWS_WORKER_INDEX`, as native support for this feature was introduced in v0.6.0.
|
||||
|
||||
## Basic example
|
||||
|
||||
For the basic example documentation, refer to [Deploy Distributed Inference Service with SGLang and LWS on GPUs](https://github.com/kubernetes-sigs/lws/tree/main/docs/examples/sglang).
|
||||
|
||||
However, that document only covers the basic NCCL socket mode.
|
||||
|
||||
In this section, we’ll make some simple modifications to adapt the setup to the RDMA scenario.
|
||||
|
||||
## RDMA RoCE case
|
||||
|
||||
* Check your env:
|
||||
|
||||
```bash
|
||||
[root@node1 ~]# ibstatus
|
||||
Infiniband device 'mlx5_bond_0' port 1 status:
|
||||
default gid: fe80:0000:0000:0000:0225:9dff:fe64:c79a
|
||||
base lid: 0x0
|
||||
sm lid: 0x0
|
||||
state: 4: ACTIVE
|
||||
phys state: 5: LinkUp
|
||||
rate: 200 Gb/sec (2X NDR)
|
||||
link_layer: Ethernet
|
||||
|
||||
Infiniband device 'mlx5_bond_1' port 1 status:
|
||||
default gid: fe80:0000:0000:0000:0225:9dff:fe6e:c3ec
|
||||
base lid: 0x0
|
||||
sm lid: 0x0
|
||||
state: 4: ACTIVE
|
||||
phys state: 5: LinkUp
|
||||
rate: 200 Gb/sec (2X NDR)
|
||||
link_layer: Ethernet
|
||||
|
||||
Infiniband device 'mlx5_bond_2' port 1 status:
|
||||
default gid: fe80:0000:0000:0000:0225:9dff:fe73:0dd7
|
||||
base lid: 0x0
|
||||
sm lid: 0x0
|
||||
state: 4: ACTIVE
|
||||
phys state: 5: LinkUp
|
||||
rate: 200 Gb/sec (2X NDR)
|
||||
link_layer: Ethernet
|
||||
|
||||
Infiniband device 'mlx5_bond_3' port 1 status:
|
||||
default gid: fe80:0000:0000:0000:0225:9dff:fe36:f7ff
|
||||
base lid: 0x0
|
||||
sm lid: 0x0
|
||||
state: 4: ACTIVE
|
||||
phys state: 5: LinkUp
|
||||
rate: 200 Gb/sec (2X NDR)
|
||||
link_layer: Ethernet
|
||||
```
|
||||
|
||||
* Prepare the `lws.yaml` file for deploying on k8s.
|
||||
|
||||
```yaml
|
||||
apiVersion: leaderworkerset.x-k8s.io/v1
|
||||
kind: LeaderWorkerSet
|
||||
metadata:
|
||||
name: sglang
|
||||
spec:
|
||||
replicas: 1
|
||||
leaderWorkerTemplate:
|
||||
size: 2
|
||||
restartPolicy: RecreateGroupOnPodRestart
|
||||
leaderTemplate:
|
||||
metadata:
|
||||
labels:
|
||||
role: leader
|
||||
spec:
|
||||
dnsPolicy: ClusterFirstWithHostNet
|
||||
hostNetwork: true
|
||||
hostIPC: true
|
||||
containers:
|
||||
- name: sglang-leader
|
||||
image: sglang:latest
|
||||
securityContext:
|
||||
privileged: true
|
||||
env:
|
||||
- name: NCCL_IB_GID_INDEX
|
||||
value: "3"
|
||||
command:
|
||||
- python3
|
||||
- -m
|
||||
- sglang.launch_server
|
||||
- --model-path
|
||||
- /work/models
|
||||
- --mem-fraction-static
|
||||
- "0.93"
|
||||
- --torch-compile-max-bs
|
||||
- "8"
|
||||
- --max-running-requests
|
||||
- "20"
|
||||
- --tp
|
||||
- "16" # Size of Tensor Parallelism
|
||||
- --dist-init-addr
|
||||
- $(LWS_LEADER_ADDRESS):20000
|
||||
- --nnodes
|
||||
- $(LWS_GROUP_SIZE)
|
||||
- --node-rank
|
||||
- $(LWS_WORKER_INDEX)
|
||||
- --trust-remote-code
|
||||
- --host
|
||||
- "0.0.0.0"
|
||||
- --port
|
||||
- "40000"
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: "8"
|
||||
ports:
|
||||
- containerPort: 40000
|
||||
readinessProbe:
|
||||
tcpSocket:
|
||||
port: 40000
|
||||
initialDelaySeconds: 15
|
||||
periodSeconds: 10
|
||||
volumeMounts:
|
||||
- mountPath: /dev/shm
|
||||
name: dshm
|
||||
- name: model
|
||||
mountPath: /work/models
|
||||
- name: ib
|
||||
mountPath: /dev/infiniband
|
||||
volumes:
|
||||
- name: dshm
|
||||
emptyDir:
|
||||
medium: Memory
|
||||
- name: model
|
||||
hostPath:
|
||||
path: '< your models dir >' # modify it according your models dir
|
||||
- name: ib
|
||||
hostPath:
|
||||
path: /dev/infiniband
|
||||
workerTemplate:
|
||||
spec:
|
||||
dnsPolicy: ClusterFirstWithHostNet
|
||||
hostNetwork: true
|
||||
hostIPC: true
|
||||
containers:
|
||||
- name: sglang-worker
|
||||
image: sglang:latest
|
||||
securityContext:
|
||||
privileged: true
|
||||
env:
|
||||
- name: NCCL_IB_GID_INDEX
|
||||
value: "3"
|
||||
command:
|
||||
- python3
|
||||
- -m
|
||||
- sglang.launch_server
|
||||
- --model-path
|
||||
- /work/models
|
||||
- --mem-fraction-static
|
||||
- "0.93"
|
||||
- --torch-compile-max-bs
|
||||
- "8"
|
||||
- --max-running-requests
|
||||
- "20"
|
||||
- --tp
|
||||
- "16" # Size of Tensor Parallelism
|
||||
- --dist-init-addr
|
||||
- $(LWS_LEADER_ADDRESS):20000
|
||||
- --nnodes
|
||||
- $(LWS_GROUP_SIZE)
|
||||
- --node-rank
|
||||
- $(LWS_WORKER_INDEX)
|
||||
- --trust-remote-code
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: "8"
|
||||
volumeMounts:
|
||||
- mountPath: /dev/shm
|
||||
name: dshm
|
||||
- name: model
|
||||
mountPath: /work/models
|
||||
- name: ib
|
||||
mountPath: /dev/infiniband
|
||||
volumes:
|
||||
- name: dshm
|
||||
emptyDir:
|
||||
medium: Memory
|
||||
- name: ib
|
||||
hostPath:
|
||||
path: /dev/infiniband
|
||||
- name: model
|
||||
hostPath:
|
||||
path: /data1/models/deepseek_v3_moe
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: sglang-leader
|
||||
spec:
|
||||
selector:
|
||||
leaderworkerset.sigs.k8s.io/name: sglang
|
||||
role: leader
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 40000
|
||||
targetPort: 40000
|
||||
|
||||
```
|
||||
|
||||
* Then use `kubectl apply -f lws.yaml` you will get this output.
|
||||
|
||||
```text
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
sglang-0 0/1 Running 0 9s
|
||||
sglang-0-1 1/1 Running 0 9s
|
||||
```
|
||||
|
||||
Wait for the sglang leader (`sglang-0`) status to change to 1/1, which indicates it is `Ready`.
|
||||
|
||||
You can use the command `kubectl logs -f sglang-0` to view the logs of the leader node.
|
||||
|
||||
Once successful, you should see output like this:
|
||||
|
||||
```text
|
||||
[2025-02-17 05:27:24 TP1] Capture cuda graph end. Time elapsed: 84.89 s
|
||||
[2025-02-17 05:27:24 TP6] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
|
||||
[2025-02-17 05:27:24 TP0] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
|
||||
[2025-02-17 05:27:24 TP7] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
|
||||
[2025-02-17 05:27:24 TP3] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
|
||||
[2025-02-17 05:27:24 TP2] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
|
||||
[2025-02-17 05:27:24 TP4] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
|
||||
[2025-02-17 05:27:24 TP1] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
|
||||
[2025-02-17 05:27:24 TP5] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
|
||||
[2025-02-17 05:27:24] INFO: Started server process [1]
|
||||
[2025-02-17 05:27:24] INFO: Waiting for application startup.
|
||||
[2025-02-17 05:27:24] INFO: Application startup complete.
|
||||
[2025-02-17 05:27:24] INFO: Uvicorn running on http://0.0.0.0:40000 (Press CTRL+C to quit)
|
||||
[2025-02-17 05:27:25] INFO: 127.0.0.1:48908 - "GET /get_model_info HTTP/1.1" 200 OK
|
||||
[2025-02-17 05:27:25 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
|
||||
[2025-02-17 05:27:32] INFO: 127.0.0.1:48924 - "POST /generate HTTP/1.1" 200 OK
|
||||
[2025-02-17 05:27:32] The server is fired up and ready to roll!
|
||||
```
|
||||
|
||||
If it doesn’t start up successfully, please follow these steps to check for any remaining issues. Thanks!
|
||||
|
||||
### Debug
|
||||
|
||||
* Set `NCCL_DEBUG=TRACE` to check if it is a NCCL communication problem.
|
||||
|
||||
This should resolve most NCCL-related issues.
|
||||
|
||||
***Notice: If you find that NCCL_DEBUG=TRACE is not effective in the container environment, but the process is stuck or you encounter hard-to-diagnose issues, try switching to a different container image. Some images may not handle standard error output properly.***
|
||||
|
||||
#### RoCE scenario
|
||||
|
||||
* Please make sure that RDMA devices are available in the cluster environment.
|
||||
* Please make sure that the nodes in the cluster have Mellanox NICs with RoCE. In this example, we use Mellanox ConnectX 5 model NICs, and the proper OFED driver has been installed. If not, please refer to the document [Install OFED Driver](https://docs.nvidia.com/networking/display/mlnxofedv461000/installing+mellanox+ofed) to install the driver.
|
||||
* Check your env:
|
||||
|
||||
```shell
|
||||
$ lspci -nn | grep Eth | grep Mellanox
|
||||
0000:7f:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
|
||||
0000:7f:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
|
||||
0000:c7:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
|
||||
0000:c7:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
|
||||
0001:08:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
|
||||
0001:08:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
|
||||
0001:a2:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
|
||||
0001:a2:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
|
||||
```
|
||||
|
||||
* Check the OFED driver:
|
||||
|
||||
```shell
|
||||
ofed_info -s
|
||||
OFED-internal-23.07-0.5.0:
|
||||
```
|
||||
|
||||
* Show RDMA link status and check IB devices:
|
||||
|
||||
```shell
|
||||
$ rdma link show
|
||||
8/1: mlx5_bond_0/1: state ACTIVE physical_state LINK_UP netdev reth0
|
||||
9/1: mlx5_bond_1/1: state ACTIVE physical_state LINK_UP netdev reth2
|
||||
10/1: mlx5_bond_2/1: state ACTIVE physical_state LINK_UP netdev reth4
|
||||
11/1: mlx5_bond_3/1: state ACTIVE physical_state LINK_UP netdev reth6
|
||||
|
||||
$ ibdev2netdev
|
||||
8/1: mlx5_bond_0/1: state ACTIVE physical_state LINK_UP netdev reth0
|
||||
9/1: mlx5_bond_1/1: state ACTIVE physical_state LINK_UP netdev reth2
|
||||
10/1: mlx5_bond_2/1: state ACTIVE physical_state LINK_UP netdev reth4
|
||||
11/1: mlx5_bond_3/1: state ACTIVE physical_state LINK_UP netdev reth6
|
||||
```
|
||||
|
||||
* Test RoCE network speed on the host:
|
||||
|
||||
```shell
|
||||
yum install qperf
|
||||
# for server:
|
||||
execute qperf
|
||||
# for client
|
||||
qperf -t 60 -cm1 <server_ip> rc_rdma_write_bw
|
||||
```
|
||||
|
||||
* Check RDMA accessible in your container:
|
||||
|
||||
```shell
|
||||
# ibv_devices
|
||||
# ibv_devinfo
|
||||
```
|
||||
|
||||
## Keys to success
|
||||
|
||||
* In the YAML configuration above, pay attention to the NCCL environment variable. For older versions of NCCL, you should check the NCCL_IB_GID_INDEX environment setting.
|
||||
* NCCL_SOCKET_IFNAME is also crucial, but in a containerized environment, this typically isn’t an issue.
|
||||
* In some cases, it’s necessary to configure GLOO_SOCKET_IFNAME correctly.
|
||||
* NCCL_DEBUG is essential for troubleshooting, but I've found that sometimes it doesn't show error logs within containers. This could be related to the Docker image you're using. You may want to try switching images if needed.
|
||||
* Avoid using Docker images based on Ubuntu 18.04, as they tend to have compatibility issues.
|
||||
|
||||
## Remaining issues
|
||||
|
||||
* In Kubernetes, Docker, or Containerd environments, we use hostNetwork to prevent performance degradation.
|
||||
* We utilize privileged mode, which isn’t secure. Additionally, in containerized environments, full GPU isolation cannot be achieved.
|
||||
|
||||
## TODO
|
||||
|
||||
* Integrated with [k8s-rdma-shared-dev-plugin](https://github.com/Mellanox/k8s-rdma-shared-dev-plugin).
|
||||
@@ -0,0 +1,12 @@
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: deepseekr10528-decode-main
|
||||
spec:
|
||||
selector:
|
||||
leaderworkerset.sigs.k8s.io/name: deepseekr10528-decode-main
|
||||
role: leader
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 30000
|
||||
targetPort: 30000
|
||||
290
docs/references/multi_node_deployment/lws_pd/lws-examples/d.yaml
Normal file
290
docs/references/multi_node_deployment/lws_pd/lws-examples/d.yaml
Normal file
@@ -0,0 +1,290 @@
|
||||
apiVersion: leaderworkerset.x-k8s.io/v1
|
||||
kind: LeaderWorkerSet
|
||||
metadata:
|
||||
name: deepseekr10528-decode-main
|
||||
spec:
|
||||
leaderWorkerTemplate:
|
||||
leaderTemplate:
|
||||
metadata:
|
||||
labels:
|
||||
role: leader
|
||||
spec:
|
||||
containers:
|
||||
- command:
|
||||
- python3
|
||||
- -m
|
||||
- sglang.launch_server
|
||||
- --port
|
||||
- "30000"
|
||||
- --host
|
||||
- "0.0.0.0"
|
||||
- --model-path
|
||||
- /work/models
|
||||
- --chunked-prefill-size
|
||||
- "262144"
|
||||
- --page-size
|
||||
- "64"
|
||||
- --enable-dp-attention
|
||||
- --enable-dp-lm-head
|
||||
- --dp-size
|
||||
- "16"
|
||||
- --moe-a2a-backend
|
||||
- deepep
|
||||
- --disaggregation-mode
|
||||
- decode
|
||||
- --mem-fraction-static
|
||||
- "0.849"
|
||||
- --context-length
|
||||
- "32768"
|
||||
- --disaggregation-ib-device
|
||||
- "mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3"
|
||||
- --cuda-graph-max-bs
|
||||
- "64"
|
||||
- --max-running-requests
|
||||
- "2048"
|
||||
- --tp-size
|
||||
- "16" # Size of Tensor Parallelism
|
||||
- --dist-init-addr
|
||||
- $(LWS_LEADER_ADDRESS):20102
|
||||
- --nnodes
|
||||
- $(LWS_GROUP_SIZE)
|
||||
- --node-rank
|
||||
- $(LWS_WORKER_INDEX)
|
||||
- --trust-remote-code
|
||||
- --ep-num-redundant-experts
|
||||
- "32"
|
||||
- --moe-dense-tp-size
|
||||
- "1"
|
||||
env:
|
||||
- name: CUDA_LAUNCH_BLOCKING
|
||||
value: "0"
|
||||
- name: NVSHMEM_IB_GID_INDEX
|
||||
value: "3"
|
||||
- name: NVSHMEM_ENABLE_NIC_PE_MAPPING
|
||||
value: "1"
|
||||
- name: NVSHMEM_HCA_PE_MAPPING
|
||||
value: "mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"
|
||||
- name: NCCL_IB_QPS_PER_CONNECTION
|
||||
value: "8"
|
||||
- name: NCCL_IB_SPLIT_DATA_ON_QPS
|
||||
value: "1"
|
||||
- name: NCCL_NET_PLUGIN
|
||||
value: "none"
|
||||
- name: NCCL_IB_TC
|
||||
value: "136"
|
||||
- name: NCCL_MIN_NCHANNELS
|
||||
value: "4"
|
||||
- name: NCCL_IB_SL
|
||||
value: "5"
|
||||
- name: MC_TE_METRIC
|
||||
value: "true"
|
||||
- name: SGLANG_MOONCAKE_TRANS_THREAD
|
||||
value: "16"
|
||||
- name: SGL_ENABLE_JIT_DEEPGEMM
|
||||
value: "1"
|
||||
- name: NCCL_IB_HCA
|
||||
value: ^=mlx5_0,mlx5_5,mlx5_6
|
||||
- name: LWS_WORKER_INDEX
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
|
||||
image: lmsysorg/sglang:latest
|
||||
name: sglang-leader
|
||||
ports:
|
||||
- containerPort: 30000
|
||||
protocol: TCP
|
||||
readinessProbe:
|
||||
periodSeconds: 30
|
||||
tcpSocket:
|
||||
port: 30000
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: "8"
|
||||
securityContext:
|
||||
capabilities:
|
||||
add:
|
||||
- IPC_LOCK
|
||||
privileged: true
|
||||
volumeMounts:
|
||||
- mountPath: /root/.cache
|
||||
name: sgl-cache
|
||||
- mountPath: /dev/shm
|
||||
name: dshm
|
||||
- mountPath: /work/models
|
||||
name: model
|
||||
- mountPath: /dev/infiniband
|
||||
name: ib
|
||||
- mountPath: /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs
|
||||
name: cf
|
||||
dnsPolicy: ClusterFirstWithHostNet
|
||||
hostIPC: true
|
||||
hostNetwork: true
|
||||
nodeSelector:
|
||||
# should modify according your deployment env
|
||||
pd: "yes"
|
||||
tolerations:
|
||||
# should modify according your deployment env
|
||||
- key: bopd
|
||||
operator: Exists
|
||||
- key: node-role
|
||||
operator: Exists
|
||||
volumes:
|
||||
- hostPath:
|
||||
path: /data1/sgl_cache1
|
||||
type: DirectoryOrCreate
|
||||
name: sgl-cache
|
||||
- emptyDir:
|
||||
medium: Memory
|
||||
name: dshm
|
||||
- hostPath:
|
||||
path: /data1/maas_hosted_models/models/DeepSeek-R1-0528/deepseek_r1_0528
|
||||
name: model
|
||||
- hostPath:
|
||||
path: /dev/infiniband
|
||||
name: ib
|
||||
- hostPath:
|
||||
path: /data1/maas_hosted_models/models/fused_moe_triton/configs
|
||||
name: cf
|
||||
restartPolicy: RecreateGroupOnPodRestart
|
||||
size: 2
|
||||
workerTemplate:
|
||||
metadata: {}
|
||||
spec:
|
||||
containers:
|
||||
- command:
|
||||
- python3
|
||||
- -m
|
||||
- sglang.launch_server
|
||||
- --model-path
|
||||
- /work/models
|
||||
- --chunked-prefill-size
|
||||
- "262144"
|
||||
- --page-size
|
||||
- "64"
|
||||
- --enable-dp-attention
|
||||
- --enable-dp-lm-head
|
||||
- --dp-size
|
||||
- "16"
|
||||
- --moe-a2a-backend
|
||||
- deepep
|
||||
- --disaggregation-mode
|
||||
- decode
|
||||
- --mem-fraction-static
|
||||
- "0.849"
|
||||
- --context-length
|
||||
- "32768"
|
||||
- --disaggregation-ib-device
|
||||
- "mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3"
|
||||
- --cuda-graph-max-bs
|
||||
- "64"
|
||||
- --max-running-requests
|
||||
- "2048"
|
||||
- --tp-size
|
||||
- "16" # Size of Tensor Parallelism
|
||||
- --dist-init-addr
|
||||
- $(LWS_LEADER_ADDRESS):20102
|
||||
- --nnodes
|
||||
- $(LWS_GROUP_SIZE)
|
||||
- --node-rank
|
||||
- $(LWS_WORKER_INDEX)
|
||||
- --trust-remote-code
|
||||
- --ep-num-redundant-experts
|
||||
- "32"
|
||||
- --moe-dense-tp-size
|
||||
- "1"
|
||||
env:
|
||||
- name: NVSHMEM_IB_TRAFFIC_CLASS
|
||||
value: "16"
|
||||
- name: NVSHMEM_IB_GID_INDEX
|
||||
value: "3"
|
||||
- name: NVSHMEM_ENABLE_NIC_PE_MAPPING
|
||||
value: "1"
|
||||
- name: NVSHMEM_HCA_PE_MAPPING
|
||||
value: "mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"
|
||||
- name: NCCL_IB_QPS_PER_CONNECTION
|
||||
value: "8"
|
||||
- name: NCCL_IB_SPLIT_DATA_ON_QPS
|
||||
value: "1"
|
||||
- name: NCCL_NET_PLUGIN
|
||||
value: "none"
|
||||
- name: NCCL_IB_TC
|
||||
value: "136"
|
||||
- name: NCCL_MIN_NCHANNELS
|
||||
value: "4"
|
||||
- name: MC_TE_METRIC
|
||||
value: "true"
|
||||
- name: NCCL_IB_SL
|
||||
value: "5"
|
||||
- name: SGLANG_MOONCAKE_TRANS_THREAD
|
||||
value: "16"
|
||||
- name: SGL_ENABLE_JIT_DEEPGEMM
|
||||
value: "1"
|
||||
- name: NCCL_IB_HCA
|
||||
value: ^=mlx5_0,mlx5_5,mlx5_6
|
||||
- name: LWS_WORKER_INDEX
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
|
||||
image: lmsysorg/sglang:latest
|
||||
name: sglang-worker
|
||||
ports:
|
||||
- containerPort: 30001
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: "8"
|
||||
securityContext:
|
||||
capabilities:
|
||||
add:
|
||||
- IPC_LOCK
|
||||
privileged: true
|
||||
volumeMounts:
|
||||
- mountPath: /root/.cache
|
||||
name: sgl-cache
|
||||
- mountPath: /dev/shm
|
||||
name: dshm
|
||||
- mountPath: /work/models
|
||||
name: model
|
||||
- mountPath: /dev/infiniband
|
||||
name: ib
|
||||
- mountPath: /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs
|
||||
name: cf
|
||||
dnsPolicy: ClusterFirstWithHostNet
|
||||
hostIPC: true
|
||||
hostNetwork: true
|
||||
nodeSelector:
|
||||
# should modify according your deployment env
|
||||
pd: "yes"
|
||||
tolerations:
|
||||
# should modify according your deployment env
|
||||
- key: bopd
|
||||
operator: Exists
|
||||
- key: node-role
|
||||
operator: Exists
|
||||
volumes:
|
||||
- hostPath:
|
||||
path: /data1/sgl_cache1
|
||||
type: DirectoryOrCreate
|
||||
name: sgl-cache
|
||||
- emptyDir:
|
||||
medium: Memory
|
||||
name: dshm
|
||||
- hostPath:
|
||||
path: /dev/infiniband
|
||||
name: ib
|
||||
- hostPath:
|
||||
# modify according to you deployment env
|
||||
path: /data1/maas_hosted_models/models/DeepSeek-R1-0528/deepseek_r1_0528
|
||||
name: model
|
||||
- hostPath:
|
||||
# modify according to you deployment env
|
||||
path: /data1/maas_hosted_models/models/fused_moe_triton/configs
|
||||
name: cf
|
||||
networkConfig:
|
||||
subdomainPolicy: Shared
|
||||
replicas: 1
|
||||
rolloutStrategy:
|
||||
rollingUpdateConfiguration:
|
||||
maxSurge: 0
|
||||
maxUnavailable: 1
|
||||
type: RollingUpdate
|
||||
startupPolicy: LeaderCreated
|
||||
@@ -0,0 +1,56 @@
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: deepseekr10528-lb-main
|
||||
labels:
|
||||
app: deepseekr10528-lb
|
||||
spec:
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app: deepseekr10528-lb
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: deepseekr10528-lb
|
||||
spec:
|
||||
nodeSelector:
|
||||
bo: "yes"
|
||||
tolerations:
|
||||
- key: bopd
|
||||
operator: Exists
|
||||
- key: node-role
|
||||
operator: Exists
|
||||
containers:
|
||||
- name: sgl-minilb
|
||||
image: lmsysorg/sglang:latest
|
||||
command:
|
||||
- python
|
||||
- -m
|
||||
- sglang_router.launch_router
|
||||
- --pd-disaggregation
|
||||
- --prefill
|
||||
- http://deepseekr10528-prefill-main:30000
|
||||
- --decode
|
||||
- http://deepseekr10528-decode-main:30000
|
||||
- --host
|
||||
- 0.0.0.0
|
||||
- --port
|
||||
- "8000"
|
||||
ports:
|
||||
- containerPort: 8000
|
||||
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: deepseekr10528-lb-service
|
||||
spec:
|
||||
type: NodePort # NodePort is easy to test, you can also specify `ClusterIP`
|
||||
selector:
|
||||
app: deepseekr10528-lb
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 8000 # Service Port(In-Cluster)
|
||||
targetPort: 8000 # Exposed Container
|
||||
nodePort: 30800
|
||||
@@ -0,0 +1,12 @@
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: deepseekr10528-prefill-main
|
||||
spec:
|
||||
selector:
|
||||
leaderworkerset.sigs.k8s.io/name: deepseekr10528-prefill-main
|
||||
role: leader
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 30000
|
||||
targetPort: 30000
|
||||
304
docs/references/multi_node_deployment/lws_pd/lws-examples/p.yaml
Normal file
304
docs/references/multi_node_deployment/lws_pd/lws-examples/p.yaml
Normal file
@@ -0,0 +1,304 @@
|
||||
apiVersion: leaderworkerset.x-k8s.io/v1
|
||||
kind: LeaderWorkerSet
|
||||
metadata:
|
||||
name: deepseekr10528-prefill-main
|
||||
spec:
|
||||
leaderWorkerTemplate:
|
||||
leaderTemplate:
|
||||
metadata:
|
||||
labels:
|
||||
role: leader
|
||||
spec:
|
||||
containers:
|
||||
- command:
|
||||
- python3
|
||||
- -m
|
||||
- sglang.launch_server
|
||||
- --port
|
||||
- "30000"
|
||||
- --host
|
||||
- "0.0.0.0"
|
||||
- --model-path
|
||||
- /work/models
|
||||
- --disaggregation-ib-device
|
||||
# should modify according your rdma env
|
||||
- mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3
|
||||
- --chunked-prefill-size
|
||||
- "524288"
|
||||
- --max-prefill-tokens
|
||||
- "32768"
|
||||
- --page-size
|
||||
- "64"
|
||||
- --ep-dispatch-algorithm
|
||||
- dynamic
|
||||
- --eplb-algorithm
|
||||
- deepseek
|
||||
- --enable-dp-lm-head
|
||||
- --enable-dp-attention
|
||||
- --dp-size
|
||||
- "16"
|
||||
- --disable-radix-cache
|
||||
- --moe-a2a-backend
|
||||
- deepep
|
||||
- --disaggregation-mode
|
||||
- prefill
|
||||
- --mem-fraction-static
|
||||
- "0.7"
|
||||
- --context-length
|
||||
- "32768"
|
||||
- --tp
|
||||
- "16"
|
||||
- --dist-init-addr
|
||||
- $(LWS_LEADER_ADDRESS):20102
|
||||
- --nnodes
|
||||
- $(LWS_GROUP_SIZE)
|
||||
- --node-rank
|
||||
- $(LWS_WORKER_INDEX)
|
||||
- --trust-remote-code
|
||||
- --ep-num-redundant-experts
|
||||
- "32"
|
||||
- --moe-dense-tp-size
|
||||
- "1"
|
||||
- --max-running-requests
|
||||
- "1024"
|
||||
env:
|
||||
- name: NVSHMEM_HCA_PE_MAPPING
|
||||
# should modify according your rdma env
|
||||
value: "mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"
|
||||
- name: NVSHMEM_IB_GID_INDEX
|
||||
value: "3"
|
||||
- name: NVSHMEM_ENABLE_NIC_PE_MAPPING
|
||||
value: "1"
|
||||
- name: SGLANG_SET_CPU_AFFINITY
|
||||
value: "true"
|
||||
- name: SGL_ENABLE_JIT_DEEPGEMM
|
||||
value: "1"
|
||||
- name: NCCL_IB_QPS_PER_CONNECTION
|
||||
value: "8"
|
||||
- name: NCCL_IB_SPLIT_DATA_ON_QPS
|
||||
value: "1"
|
||||
- name: NCCL_NET_PLUGIN
|
||||
value: none
|
||||
- name: NCCL_IB_TC
|
||||
value: "136"
|
||||
- name: NCCL_MIN_NCHANNELS
|
||||
value: "4"
|
||||
- name: MC_TE_METRIC
|
||||
value: "false"
|
||||
- name: NCCL_IB_SL
|
||||
value: "5"
|
||||
- name: NCCL_IB_HCA
|
||||
value: ^=mlx5_0,mlx5_5,mlx5_6
|
||||
- name: LWS_WORKER_INDEX
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
|
||||
image: lmsysorg/sglang:latest
|
||||
name: sglang-leader
|
||||
ports:
|
||||
- containerPort: 30000
|
||||
protocol: TCP
|
||||
readinessProbe:
|
||||
periodSeconds: 30
|
||||
tcpSocket:
|
||||
port: 30000
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: "8"
|
||||
securityContext:
|
||||
capabilities:
|
||||
add:
|
||||
- IPC_LOCK
|
||||
privileged: true
|
||||
volumeMounts:
|
||||
- mountPath: /dev/shm
|
||||
name: dshm
|
||||
- mountPath: /work/models
|
||||
name: model
|
||||
- mountPath: /dev/infiniband
|
||||
name: ib
|
||||
- mountPath: /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs
|
||||
name: cf
|
||||
- mountPath: /root/.cache
|
||||
name: sgl-cache
|
||||
dnsPolicy: ClusterFirstWithHostNet
|
||||
hostIPC: true
|
||||
hostNetwork: true
|
||||
nodeSelector:
|
||||
# should modify according your deployment env
|
||||
pd: "yes"
|
||||
tolerations:
|
||||
# should modify according your deployment env
|
||||
- key: bopd
|
||||
operator: Exists
|
||||
- key: node-role
|
||||
operator: Exists
|
||||
volumes:
|
||||
- emptyDir:
|
||||
medium: Memory
|
||||
name: dshm
|
||||
- hostPath:
|
||||
path: /data1/maas_hosted_models/models/DeepSeek-R1-0528/deepseek_r1_0528
|
||||
name: model
|
||||
- hostPath:
|
||||
path: /dev/infiniband
|
||||
name: ib
|
||||
- hostPath:
|
||||
path: /data1/maas_hosted_models/models/fused_moe_triton/configs
|
||||
name: cf
|
||||
- hostPath:
|
||||
path: /data1/sgl_cache
|
||||
type: DirectoryOrCreate
|
||||
name: sgl-cache
|
||||
restartPolicy: RecreateGroupOnPodRestart
|
||||
size: 2
|
||||
workerTemplate:
|
||||
metadata: {}
|
||||
spec:
|
||||
containers:
|
||||
- command:
|
||||
- python3
|
||||
- -m
|
||||
- sglang.launch_server
|
||||
- --model-path
|
||||
- /work/models
|
||||
- --disaggregation-ib-device
|
||||
# should modify according your rdma env
|
||||
- mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3
|
||||
- --chunked-prefill-size
|
||||
- "524288"
|
||||
- --max-prefill-tokens
|
||||
- "32768"
|
||||
- --page-size
|
||||
- "64"
|
||||
- --ep-dispatch-algorithm
|
||||
- dynamic
|
||||
- --eplb-algorithm
|
||||
- deepseek
|
||||
# - --deepep-config
|
||||
# - /home/aiges/tuned/tuned_8sms.json
|
||||
# can be tuned using deepep test scripts
|
||||
- --enable-dp-lm-head
|
||||
- --enable-dp-attention
|
||||
- --dp-size
|
||||
- "16"
|
||||
- --disable-radix-cache
|
||||
- --moe-a2a-backend
|
||||
- deepep
|
||||
- --disaggregation-mode
|
||||
- prefill
|
||||
- --mem-fraction-static
|
||||
- "0.7"
|
||||
- --context-length
|
||||
- "32768"
|
||||
- --tp
|
||||
- "16"
|
||||
- --dist-init-addr
|
||||
- $(LWS_LEADER_ADDRESS):20102
|
||||
- --nnodes
|
||||
- $(LWS_GROUP_SIZE)
|
||||
- --node-rank
|
||||
- $(LWS_WORKER_INDEX)
|
||||
- --trust-remote-code
|
||||
- --ep-num-redundant-experts
|
||||
- "32"
|
||||
- --moe-dense-tp-size
|
||||
- "1"
|
||||
- --max-running-requests
|
||||
- "1024"
|
||||
env:
|
||||
- name: SGLANG_SET_CPU_AFFINITY
|
||||
value: "true"
|
||||
- name: NVSHMEM_HCA_PE_MAPPING
|
||||
# should modify according your rdma env
|
||||
value: "mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"
|
||||
- name: NCCL_IB_HCA
|
||||
value: ^=mlx5_0,mlx5_5,mlx5_6
|
||||
- name: NVSHMEM_IB_TRAFFIC_CLASS
|
||||
value: "16"
|
||||
- name: NVSHMEM_IB_GID_INDEX
|
||||
value: "3"
|
||||
- name: NVSHMEM_ENABLE_NIC_PE_MAPPING
|
||||
value: "1"
|
||||
- name: CUDA_LAUNCH_BLOCKING
|
||||
value: "0"
|
||||
- name: SGLANG_MOONCAKE_TRANS_THREAD
|
||||
value: "8"
|
||||
- name: SGL_ENABLE_JIT_DEEPGEMM
|
||||
value: "1"
|
||||
- name: SGL_CHUNKED_PREFIX_CACHE_THRESHOLD
|
||||
value: "0"
|
||||
- name: NCCL_IB_QPS_PER_CONNECTION
|
||||
value: "8"
|
||||
- name: NCCL_IB_SPLIT_DATA_ON_QPS
|
||||
value: "1"
|
||||
- name: NCCL_NET_PLUGIN
|
||||
value: none
|
||||
- name: NCCL_IB_TC
|
||||
value: "136"
|
||||
- name: NCCL_MIN_NCHANNELS
|
||||
value: "4"
|
||||
- name: MC_TE_METRIC
|
||||
value: "true"
|
||||
- name: NCCL_IB_SL
|
||||
value: "5"
|
||||
- name: LWS_WORKER_INDEX
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
|
||||
image: lmsysorg/sglang:latest
|
||||
name: sglang-worker
|
||||
ports:
|
||||
- containerPort: 30001
|
||||
protocol: TCP
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: "8"
|
||||
securityContext:
|
||||
capabilities:
|
||||
add:
|
||||
- IPC_LOCK
|
||||
privileged: true
|
||||
volumeMounts:
|
||||
- mountPath: /root/.cache
|
||||
name: sgl-cache
|
||||
- mountPath: /dev/shm
|
||||
name: dshm
|
||||
- mountPath: /work/models
|
||||
name: model
|
||||
- mountPath: /dev/infiniband
|
||||
name: ib
|
||||
- mountPath: /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs
|
||||
name: cf
|
||||
dnsPolicy: ClusterFirstWithHostNet
|
||||
hostIPC: true
|
||||
hostNetwork: true
|
||||
nodeSelector:
|
||||
# should modify according your deployment env
|
||||
pd: "yes"
|
||||
tolerations:
|
||||
# should modify according your deployment env
|
||||
- key: bopd
|
||||
operator: Exists
|
||||
- key: node-role
|
||||
operator: Exists
|
||||
volumes:
|
||||
- emptyDir:
|
||||
medium: Memory
|
||||
name: dshm
|
||||
- hostPath:
|
||||
path: /dev/infiniband
|
||||
name: ib
|
||||
- hostPath:
|
||||
# modify according to you deployment env
|
||||
path: /data1/maas_hosted_models/models/DeepSeek-R1-0528/deepseek_r1_0528
|
||||
name: model
|
||||
- hostPath:
|
||||
# modify according to you deployment env
|
||||
path: /data1/maas_hosted_models/models/fused_moe_triton/configs
|
||||
name: cf
|
||||
- hostPath:
|
||||
# modify according to you deployment env
|
||||
path: /data1/sgl_cache
|
||||
type: DirectoryOrCreate
|
||||
name: sgl-cache
|
||||
783
docs/references/multi_node_deployment/lws_pd/lws_pd_deploy.md
Normal file
783
docs/references/multi_node_deployment/lws_pd/lws_pd_deploy.md
Normal file
@@ -0,0 +1,783 @@
|
||||
# LWS Based PD Deploy
|
||||
|
||||
## 0. Prerequisites
|
||||
|
||||
1. k8s >=1.26
|
||||
2. lws installed on k8s.
|
||||
|
||||
## 1. Image Preparation
|
||||
|
||||
`lmsysorg/sglang:deepep`
|
||||
|
||||
## 2. Deployment Manifest Files
|
||||
|
||||
***Notice: We will package all deployment files into Helm Chart format in the near future. Interested community members can contact us to contribute***
|
||||
|
||||
### Prefill
|
||||
|
||||
Prefill manifest file [prefill.yaml](lws-examples/p.yaml)
|
||||
|
||||
*Note: The NodeSelector section, model location section, and taint toleration section can be adjusted according to your actual deployment environment*
|
||||
|
||||
```yaml
|
||||
apiVersion: leaderworkerset.x-k8s.io/v1
|
||||
kind: LeaderWorkerSet
|
||||
metadata:
|
||||
name: deepseekr10528-prefill-main
|
||||
spec:
|
||||
leaderWorkerTemplate:
|
||||
leaderTemplate:
|
||||
metadata:
|
||||
labels:
|
||||
role: leader
|
||||
spec:
|
||||
containers:
|
||||
- command:
|
||||
- python3
|
||||
- -m
|
||||
- sglang.launch_server
|
||||
- --port
|
||||
- "30000"
|
||||
- --host
|
||||
- "0.0.0.0"
|
||||
- --model-path
|
||||
- /work/models
|
||||
- --disaggregation-ib-device
|
||||
# should modify according your rdma env
|
||||
- mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3
|
||||
- --chunked-prefill-size
|
||||
- "524288"
|
||||
- --max-prefill-tokens
|
||||
- "32768"
|
||||
- --page-size
|
||||
- "64"
|
||||
# - --init-expert-location
|
||||
# - /home/aiges/tuned/attachment_ep_statistics/prefill_in1024.json
|
||||
- --ep-dispatch-algorithm
|
||||
- dynamic
|
||||
- --eplb-algorithm
|
||||
- deepseek
|
||||
# - --deepep-config
|
||||
# - /home/aiges/tuned/tuned_8sms.json
|
||||
- --enable-dp-lm-head
|
||||
- --enable-dp-attention
|
||||
- --dp-size
|
||||
- "16"
|
||||
- --disable-radix-cache
|
||||
- --moe-a2a-backend
|
||||
- deepep
|
||||
- --disaggregation-mode
|
||||
- prefill
|
||||
- --mem-fraction-static
|
||||
- "0.7"
|
||||
- --context-length
|
||||
- "32768"
|
||||
- --tp
|
||||
- "16"
|
||||
- --dist-init-addr
|
||||
- $(LWS_LEADER_ADDRESS):20102
|
||||
- --nnodes
|
||||
- $(LWS_GROUP_SIZE)
|
||||
- --node-rank
|
||||
- $(LWS_WORKER_INDEX)
|
||||
- --trust-remote-code
|
||||
- --ep-num-redundant-experts
|
||||
- "32"
|
||||
- --moe-dense-tp-size
|
||||
- "1"
|
||||
- --max-running-requests
|
||||
- "1024"
|
||||
env:
|
||||
# - name: NVSHMEM_HCA_PE_MAPPING
|
||||
# value: "mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"
|
||||
# - name: NVSHMEM_HCA_LIST
|
||||
# value: "mlx5_bond_0:1,mlx5_bond_1:1,mlx5_bond_2:1,mlx5_bond_3:1"
|
||||
- name: NVSHMEM_IB_GID_INDEX
|
||||
value: "3"
|
||||
- name: NVSHMEM_ENABLE_NIC_PE_MAPPING
|
||||
value: "1"
|
||||
- name: SGLANG_SET_CPU_AFFINITY
|
||||
value: "true"
|
||||
- name: SGL_ENABLE_JIT_DEEPGEMM
|
||||
value: "1"
|
||||
- name: NCCL_IB_QPS_PER_CONNECTION
|
||||
value: "8"
|
||||
- name: NCCL_IB_SPLIT_DATA_ON_QPS
|
||||
value: "1"
|
||||
- name: NCCL_NET_PLUGIN
|
||||
value: none
|
||||
- name: NCCL_IB_TC
|
||||
value: "136"
|
||||
- name: NCCL_MIN_NCHANNELS
|
||||
value: "4"
|
||||
- name: MC_TE_METRIC
|
||||
value: "false"
|
||||
- name: NCCL_IB_SL
|
||||
value: "5"
|
||||
- name: NCCL_IB_HCA
|
||||
value: ^=mlx5_0,mlx5_5,mlx5_6
|
||||
- name: LWS_WORKER_INDEX
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
|
||||
image: lmsysorg/sglang:deepep
|
||||
name: sglang-leader
|
||||
ports:
|
||||
- containerPort: 30000
|
||||
protocol: TCP
|
||||
readinessProbe:
|
||||
periodSeconds: 30
|
||||
tcpSocket:
|
||||
port: 30000
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: "8"
|
||||
securityContext:
|
||||
capabilities:
|
||||
add:
|
||||
- IPC_LOCK
|
||||
privileged: true
|
||||
volumeMounts:
|
||||
- mountPath: /dev/shm
|
||||
name: dshm
|
||||
- mountPath: /work/models
|
||||
name: model
|
||||
- mountPath: /dev/infiniband
|
||||
name: ib
|
||||
- mountPath: /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs
|
||||
name: cf
|
||||
- mountPath: /root/.cache
|
||||
name: sgl-cache
|
||||
dnsPolicy: ClusterFirstWithHostNet
|
||||
hostIPC: true
|
||||
hostNetwork: true
|
||||
nodeSelector:
|
||||
pd: "yes"
|
||||
tolerations:
|
||||
- key: pd
|
||||
operator: Exists
|
||||
- key: node-role
|
||||
operator: Exists
|
||||
volumes:
|
||||
- emptyDir:
|
||||
medium: Memory
|
||||
name: dshm
|
||||
- hostPath:
|
||||
# modify according to you deployment env
|
||||
path: /data1/maas_hosted_models/models/DeepSeek-R1-0528/deepseek_r1_0528
|
||||
name: model
|
||||
- hostPath:
|
||||
path: /dev/infiniband
|
||||
name: ib
|
||||
- hostPath:
|
||||
# modify according to you deployment env
|
||||
path: /data1/maas_hosted_models/models/fused_moe_triton/configs
|
||||
name: cf
|
||||
- hostPath:
|
||||
# modify according to you deployment env
|
||||
path: /data1/sgl_cache
|
||||
type: DirectoryOrCreate
|
||||
name: sgl-cache
|
||||
restartPolicy: RecreateGroupOnPodRestart
|
||||
size: 2
|
||||
workerTemplate:
|
||||
metadata: {}
|
||||
spec:
|
||||
containers:
|
||||
- command:
|
||||
- python3
|
||||
- -m
|
||||
- sglang.launch_server
|
||||
- --model-path
|
||||
- /work/models
|
||||
- --disaggregation-ib-device
|
||||
- mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3
|
||||
- --chunked-prefill-size
|
||||
- "524288"
|
||||
- --max-prefill-tokens
|
||||
- "32768"
|
||||
- --page-size
|
||||
- "64"
|
||||
#- --init-expert-location
|
||||
#- /home/aiges/tuned/attachment_ep_statistics/prefill_in1024.json
|
||||
- --ep-dispatch-algorithm
|
||||
- dynamic
|
||||
- --eplb-algorithm
|
||||
- deepseek
|
||||
# - --deepep-config
|
||||
# - /home/aiges/tuned/tuned_8sms.json
|
||||
- --enable-dp-lm-head
|
||||
- --enable-dp-attention
|
||||
- --dp-size
|
||||
- "16"
|
||||
- --disable-radix-cache
|
||||
- --moe-a2a-backend
|
||||
- deepep
|
||||
- --disaggregation-mode
|
||||
- prefill
|
||||
- --mem-fraction-static
|
||||
- "0.7"
|
||||
- --context-length
|
||||
- "32768"
|
||||
- --tp
|
||||
- "16"
|
||||
- --dist-init-addr
|
||||
- $(LWS_LEADER_ADDRESS):20102
|
||||
- --nnodes
|
||||
- $(LWS_GROUP_SIZE)
|
||||
- --node-rank
|
||||
- $(LWS_WORKER_INDEX)
|
||||
- --trust-remote-code
|
||||
- --ep-num-redundant-experts
|
||||
- "32"
|
||||
- --moe-dense-tp-size
|
||||
- "1"
|
||||
- --max-running-requests
|
||||
- "1024"
|
||||
env:
|
||||
- name: SGLANG_SET_CPU_AFFINITY
|
||||
value: "true"
|
||||
- name: SGLANG_HACK_DEEPEP_NUM_SMS
|
||||
value: "8"
|
||||
- name: SGLANG_HACK_DEEPEP_NEW_MODE
|
||||
value: "0"
|
||||
# - name: NVSHMEM_HCA_PE_MAPPING
|
||||
# value: "mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"
|
||||
# - name: NVSHMEM_HCA_LIST
|
||||
# value: "mlx5_bond_0:1,mlx5_bond_1:1,mlx5_bond_2:1,mlx5_bond_3:1"
|
||||
- name: NCCL_IB_HCA
|
||||
value: ^=mlx5_0,mlx5_5,mlx5_6
|
||||
- name: NVSHMEM_IB_TRAFFIC_CLASS
|
||||
value: "16"
|
||||
- name: NVSHMEM_IB_GID_INDEX
|
||||
value: "3"
|
||||
- name: NVSHMEM_ENABLE_NIC_PE_MAPPING
|
||||
value: "1"
|
||||
- name: CUDA_LAUNCH_BLOCKING
|
||||
value: "0"
|
||||
- name: SGLANG_MOONCAKE_TRANS_THREAD
|
||||
value: "8"
|
||||
- name: SGL_ENABLE_JIT_DEEPGEMM
|
||||
value: "1"
|
||||
- name: SGL_CHUNKED_PREFIX_CACHE_THRESHOLD
|
||||
value: "0"
|
||||
- name: NCCL_IB_QPS_PER_CONNECTION
|
||||
value: "8"
|
||||
- name: NCCL_IB_SPLIT_DATA_ON_QPS
|
||||
value: "1"
|
||||
- name: NCCL_NET_PLUGIN
|
||||
value: none
|
||||
- name: NCCL_IB_TC
|
||||
value: "136"
|
||||
- name: NCCL_MIN_NCHANNELS
|
||||
value: "4"
|
||||
- name: MC_TE_METRIC
|
||||
value: "true"
|
||||
- name: NCCL_IB_SL
|
||||
value: "5"
|
||||
- name: LWS_WORKER_INDEX
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
|
||||
image: lmsysorg/sglang:deepep
|
||||
name: sglang-worker
|
||||
ports:
|
||||
- containerPort: 30001
|
||||
protocol: TCP
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: "8"
|
||||
securityContext:
|
||||
capabilities:
|
||||
add:
|
||||
- IPC_LOCK
|
||||
privileged: true
|
||||
volumeMounts:
|
||||
|
||||
- mountPath: /root/.cache
|
||||
name: sgl-cache
|
||||
- mountPath: /dev/shm
|
||||
name: dshm
|
||||
- mountPath: /work/models
|
||||
name: model
|
||||
- mountPath: /dev/infiniband
|
||||
name: ib
|
||||
- mountPath: /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs
|
||||
name: cf
|
||||
dnsPolicy: ClusterFirstWithHostNet
|
||||
hostIPC: true
|
||||
hostNetwork: true
|
||||
nodeSelector:
|
||||
pd: "yes"
|
||||
tolerations:
|
||||
- key: pd
|
||||
operator: Exists
|
||||
- key: node-role
|
||||
operator: Exists
|
||||
volumes:
|
||||
- emptyDir:
|
||||
medium: Memory
|
||||
name: dshm
|
||||
- hostPath:
|
||||
path: /dev/infiniband
|
||||
name: ib
|
||||
- hostPath:
|
||||
path: /data1/maas_hosted_models/models/DeepSeek-R1-0528/deepseek_r1_0528
|
||||
name: model
|
||||
- hostPath:
|
||||
path: /data1/maas_hosted_models/models/fused_moe_triton/configs
|
||||
name: cf
|
||||
- hostPath:
|
||||
path: /data1/sgl_cache
|
||||
type: DirectoryOrCreate
|
||||
name: sgl-cache
|
||||
|
||||
```
|
||||
|
||||
### Decode
|
||||
|
||||
Decode node deployment manifest file [decode.yaml](lws-examples/d.yaml)
|
||||
|
||||
*Note: The NodeSelector section, model location section, and taint toleration section can be adjusted according to your actual deployment environment*
|
||||
|
||||
```yaml
|
||||
apiVersion: leaderworkerset.x-k8s.io/v1
|
||||
kind: LeaderWorkerSet
|
||||
metadata:
|
||||
name: deepseekr10528-decode-main
|
||||
spec:
|
||||
leaderWorkerTemplate:
|
||||
leaderTemplate:
|
||||
metadata:
|
||||
labels:
|
||||
role: leader
|
||||
spec:
|
||||
containers:
|
||||
- command:
|
||||
- python3
|
||||
- -m
|
||||
- sglang.launch_server
|
||||
- --port
|
||||
- "30000"
|
||||
- --host
|
||||
- "0.0.0.0"
|
||||
- --model-path
|
||||
- /work/models
|
||||
- --chunked-prefill-size
|
||||
- "262144"
|
||||
- --page-size
|
||||
- "64"
|
||||
- --enable-dp-attention
|
||||
- --enable-dp-lm-head
|
||||
- --dp-size
|
||||
- "16"
|
||||
- --moe-a2a-backend
|
||||
- deepep
|
||||
- --disaggregation-mode
|
||||
- decode
|
||||
- --mem-fraction-static
|
||||
- "0.849"
|
||||
- --context-length
|
||||
- "32768"
|
||||
- --disaggregation-ib-device
|
||||
- "mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3"
|
||||
- --cuda-graph-max-bs
|
||||
- "64"
|
||||
- --max-running-requests
|
||||
- "2048"
|
||||
- --tp-size
|
||||
- "16" # Size of Tensor Parallelism
|
||||
- --dist-init-addr
|
||||
- $(LWS_LEADER_ADDRESS):20102
|
||||
- --nnodes
|
||||
- $(LWS_GROUP_SIZE)
|
||||
- --node-rank
|
||||
- $(LWS_WORKER_INDEX)
|
||||
- --trust-remote-code
|
||||
- --ep-num-redundant-experts
|
||||
- "32"
|
||||
- --moe-dense-tp-size
|
||||
- "1"
|
||||
env:
|
||||
- name: CUDA_LAUNCH_BLOCKING
|
||||
value: "0"
|
||||
- name: NVSHMEM_IB_GID_INDEX
|
||||
value: "3"
|
||||
- name: NVSHMEM_ENABLE_NIC_PE_MAPPING
|
||||
value: "1"
|
||||
- name: NCCL_IB_QPS_PER_CONNECTION
|
||||
value: "8"
|
||||
- name: NCCL_IB_SPLIT_DATA_ON_QPS
|
||||
value: "1"
|
||||
- name: NCCL_NET_PLUGIN
|
||||
value: "none"
|
||||
- name: NCCL_IB_TC
|
||||
value: "136"
|
||||
- name: NCCL_MIN_NCHANNELS
|
||||
value: "4"
|
||||
- name: NCCL_IB_SL
|
||||
value: "5"
|
||||
- name: MC_TE_METRIC
|
||||
value: "true"
|
||||
- name: SGLANG_MOONCAKE_TRANS_THREAD
|
||||
value: "16"
|
||||
- name: SGL_ENABLE_JIT_DEEPGEMM
|
||||
value: "1"
|
||||
- name: NCCL_IB_HCA
|
||||
value: ^=mlx5_0,mlx5_5,mlx5_6
|
||||
- name: LWS_WORKER_INDEX
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
|
||||
image: lmsysorg/sglang:deepep
|
||||
name: sglang-leader
|
||||
ports:
|
||||
- containerPort: 30000
|
||||
protocol: TCP
|
||||
readinessProbe:
|
||||
periodSeconds: 30
|
||||
tcpSocket:
|
||||
port: 30000
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: "8"
|
||||
securityContext:
|
||||
capabilities:
|
||||
add:
|
||||
- IPC_LOCK
|
||||
privileged: true
|
||||
volumeMounts:
|
||||
- mountPath: /root/.cache
|
||||
name: sgl-cache
|
||||
- mountPath: /dev/shm
|
||||
name: dshm
|
||||
- mountPath: /work/models
|
||||
name: model
|
||||
- mountPath: /dev/infiniband
|
||||
name: ib
|
||||
- mountPath: /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs
|
||||
name: cf
|
||||
dnsPolicy: ClusterFirstWithHostNet
|
||||
hostIPC: true
|
||||
hostNetwork: true
|
||||
nodeSelector:
|
||||
pd: "yes"
|
||||
tolerations:
|
||||
- key: pd
|
||||
operator: Exists
|
||||
- key: node-role
|
||||
operator: Exists
|
||||
volumes:
|
||||
- hostPath:
|
||||
path: /data1/sgl_cache1
|
||||
type: DirectoryOrCreate
|
||||
name: sgl-cache
|
||||
- emptyDir:
|
||||
medium: Memory
|
||||
name: dshm
|
||||
- hostPath:
|
||||
path: /data1/maas_hosted_models/models/DeepSeek-R1-0528/deepseek_r1_0528
|
||||
name: model
|
||||
- hostPath:
|
||||
path: /dev/infiniband
|
||||
name: ib
|
||||
- hostPath:
|
||||
path: /data1/maas_hosted_models/models/fused_moe_triton/configs
|
||||
name: cf
|
||||
restartPolicy: RecreateGroupOnPodRestart
|
||||
size: 2
|
||||
workerTemplate:
|
||||
metadata: {}
|
||||
spec:
|
||||
containers:
|
||||
- command:
|
||||
- python3
|
||||
- -m
|
||||
- sglang.launch_server
|
||||
- --model-path
|
||||
- /work/models
|
||||
- --chunked-prefill-size
|
||||
- "262144"
|
||||
- --page-size
|
||||
- "64"
|
||||
- --enable-dp-attention
|
||||
- --enable-dp-lm-head
|
||||
#- --enable-two-batch-overlap
|
||||
- --dp-size
|
||||
- "16"
|
||||
- --moe-a2a-backend
|
||||
- deepep
|
||||
- --disaggregation-mode
|
||||
- decode
|
||||
- --mem-fraction-static
|
||||
- "0.849"
|
||||
- --context-length
|
||||
- "32768"
|
||||
- --disaggregation-ib-device
|
||||
# should modify according your rdma env
|
||||
- "mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3"
|
||||
- --cuda-graph-max-bs
|
||||
- "64"
|
||||
- --max-running-requests
|
||||
- "2048"
|
||||
- --tp-size
|
||||
- "16" # Size of Tensor Parallelism
|
||||
- --dist-init-addr
|
||||
- $(LWS_LEADER_ADDRESS):20102
|
||||
- --nnodes
|
||||
- $(LWS_GROUP_SIZE)
|
||||
- --node-rank
|
||||
- $(LWS_WORKER_INDEX)
|
||||
- --trust-remote-code
|
||||
- --ep-num-redundant-experts
|
||||
- "32"
|
||||
- --moe-dense-tp-size
|
||||
- "1"
|
||||
env:
|
||||
- name: SGLANG_HACK_DEEPEP_NUM_SMS
|
||||
value: "24"
|
||||
- name: SGLANG_HACK_DEEPEP_NEW_MODE
|
||||
value: "0"
|
||||
- name: NVSHMEM_IB_TRAFFIC_CLASS
|
||||
value: "16"
|
||||
- name: NVSHMEM_IB_GID_INDEX
|
||||
value: "3"
|
||||
- name: NVSHMEM_ENABLE_NIC_PE_MAPPING
|
||||
value: "1"
|
||||
- name: NCCL_IB_QPS_PER_CONNECTION
|
||||
value: "8"
|
||||
- name: NCCL_IB_SPLIT_DATA_ON_QPS
|
||||
value: "1"
|
||||
- name: NCCL_NET_PLUGIN
|
||||
value: "none"
|
||||
- name: NCCL_IB_TC
|
||||
value: "136"
|
||||
- name: NCCL_MIN_NCHANNELS
|
||||
value: "4"
|
||||
- name: MC_TE_METRIC
|
||||
value: "true"
|
||||
- name: NCCL_IB_SL
|
||||
value: "5"
|
||||
- name: SGLANG_MOONCAKE_TRANS_THREAD
|
||||
value: "16"
|
||||
- name: SGL_ENABLE_JIT_DEEPGEMM
|
||||
value: "1"
|
||||
- name: NCCL_IB_HCA
|
||||
value: ^=mlx5_0,mlx5_5,mlx5_6
|
||||
- name: LWS_WORKER_INDEX
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
|
||||
image: lmsysorg/sglang:deepep
|
||||
name: sglang-worker
|
||||
ports:
|
||||
- containerPort: 30001
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: "8"
|
||||
securityContext:
|
||||
capabilities:
|
||||
add:
|
||||
- IPC_LOCK
|
||||
privileged: true
|
||||
volumeMounts:
|
||||
- mountPath: /root/.cache
|
||||
name: sgl-cache
|
||||
- mountPath: /dev/shm
|
||||
name: dshm
|
||||
- mountPath: /work/models
|
||||
name: model
|
||||
- mountPath: /dev/infiniband
|
||||
name: ib
|
||||
- mountPath: /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs
|
||||
name: cf
|
||||
dnsPolicy: ClusterFirstWithHostNet
|
||||
hostIPC: true
|
||||
hostNetwork: true
|
||||
nodeSelector:
|
||||
pd: "yes"
|
||||
tolerations:
|
||||
- key: pd
|
||||
operator: Exists
|
||||
- key: node-role
|
||||
operator: Exists
|
||||
volumes:
|
||||
- hostPath:
|
||||
path: /data1/sgl_cache1
|
||||
type: DirectoryOrCreate
|
||||
name: sgl-cache
|
||||
- emptyDir:
|
||||
medium: Memory
|
||||
name: dshm
|
||||
- hostPath:
|
||||
path: /dev/infiniband
|
||||
name: ib
|
||||
- hostPath:
|
||||
# modify according to you deployment env
|
||||
path: /data1/maas_hosted_models/models/DeepSeek-R1-0528/deepseek_r1_0528
|
||||
name: model
|
||||
- hostPath:
|
||||
# modify according to you deployment env
|
||||
path: /data1/maas_hosted_models/models/fused_moe_triton/configs
|
||||
name: cf
|
||||
networkConfig:
|
||||
subdomainPolicy: Shared
|
||||
replicas: 1
|
||||
rolloutStrategy:
|
||||
rollingUpdateConfiguration:
|
||||
maxSurge: 0
|
||||
maxUnavailable: 1
|
||||
type: RollingUpdate
|
||||
startupPolicy: LeaderCreated
|
||||
```
|
||||
|
||||
Execute separately:
|
||||
|
||||
```bash
|
||||
kubectl apply -f p.yaml
|
||||
kubectl apply -f d.yaml
|
||||
```
|
||||
|
||||
At this point, we have completed the deployment of the 1P1D SGlang engine part.
|
||||
|
||||
To allow our users to directly experience the model API, we still need a load balancer to handle sequential calls between prefill and decode. Different companies implement LBs differently, and the community will also officially release a new LB component written in Rust in the near future.
|
||||
|
||||
Currently, we use a static K8S service + minilb approach to implement model API calls.
|
||||
|
||||
### Creating Service for Prefill and Decode
|
||||
|
||||
#### Create prefill k8s service
|
||||
[p-svc.yaml](lws-examples/p-svc.yaml)
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: deepseekr10528-prefill-main
|
||||
spec:
|
||||
selector:
|
||||
leaderworkerset.sigs.k8s.io/name: deepseekr10528-prefill-main
|
||||
role: leader
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 30000
|
||||
targetPort: 30000
|
||||
```
|
||||
Execute `kubectl apply -f p-svc.yaml`
|
||||
|
||||
#### Create decode k8s service
|
||||
[d-svc.yaml](lws-examples/d-svc.yaml)
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: deepseekr10528-decode-main
|
||||
spec:
|
||||
selector:
|
||||
leaderworkerset.sigs.k8s.io/name: deepseekr10528-decode-main
|
||||
role: leader
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 30000
|
||||
targetPort: 30000
|
||||
```
|
||||
Execute `kubectl apply -f d-svc.yaml`
|
||||
|
||||
#### Deploy minilb and lb service
|
||||
[lb.yaml](lws-examples/lb.yaml)
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: deepseekr10528-lb-main
|
||||
labels:
|
||||
app: deepseekr10528-lb
|
||||
spec:
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app: deepseekr10528-lb
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: deepseekr10528-lb
|
||||
spec:
|
||||
nodeSelector:
|
||||
pd: "yes"
|
||||
tolerations:
|
||||
- key: pd
|
||||
operator: Exists
|
||||
- key: node-role
|
||||
operator: Exists
|
||||
containers:
|
||||
- name: sgl-minilb
|
||||
image: lmsysorg/sglang:deepep
|
||||
command:
|
||||
- python
|
||||
- -m
|
||||
- sglang_router.launch_router
|
||||
- --pd-disaggregation
|
||||
- --prefill
|
||||
- http://deepseekr10528-prefill-main:30000
|
||||
- --decode
|
||||
- http://deepseekr10528-decode-main:30000
|
||||
- --host
|
||||
- 0.0.0.0
|
||||
- --port
|
||||
- "8000"
|
||||
ports:
|
||||
- containerPort: 8000
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: deepseekr10528-lb-service
|
||||
spec:
|
||||
type: NodePort
|
||||
selector:
|
||||
app: deepseekr10528-lb
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 8000 # Service Port(In-Cluster)
|
||||
targetPort: 8000 # Exposed Container
|
||||
nodePort: 30800
|
||||
```
|
||||
Execute `kubectl apply -f lb.yaml`
|
||||
|
||||
After waiting for all model deployments to succeed, you will get the following output:
|
||||
|
||||
```bash
|
||||
[root@ecs-001]# kubectl get po
|
||||
deepseekr10528-decode-main-0 1/1 Running 0 74m
|
||||
deepseekr10528-decode-main-0-1 1/1 Running 0 74m
|
||||
deepseekr10528-lb-main-9c5dbfc57-6lcbd 1/1 Running 0 22m
|
||||
deepseekr10528-prefill-main-0 1/1 Running 0 74m
|
||||
deepseekr10528-prefill-main-0-1 1/1 Running 0 74m
|
||||
[root@ecs-cbm-x1-pd-cpu-001 main_doc]# kubectl get svc |grep dee
|
||||
deepseekr10528-decode-main ClusterIP None <none> <none> 97m
|
||||
deepseekr10528-lb-service NodePort 172.16.242.169 <none> 8000:30800/TCP 22m
|
||||
deepseekr10528-prefill-main ClusterIP None <none> <none> 97m
|
||||
```
|
||||
|
||||
At this point, select a nodePort:30800 to access:
|
||||
|
||||
```bash
|
||||
[root@ecs-001]# curl -X POST "http://{nodePort}:30800/v1/chat/completions" \
|
||||
> -H "Content-Type: application/json" \
|
||||
> -H "Authorization: Bearer None" \
|
||||
> -d '{
|
||||
> "rid":"ccccdd",
|
||||
> "model": "r1",
|
||||
> "messages": [
|
||||
> {"role": "system", "content": "0: You are a helpful AI assistant"},
|
||||
> {"role": "user", "content": "你是谁?."}
|
||||
> ],
|
||||
> "max_tokens":221
|
||||
> }'
|
||||
{"id":"ccccdd","object":"chat.completion","created":1750252498,"model":"qwen2","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n嗯,用户问了一个很基础的自我介绍问题"你是谁?"。这可能是第一次互动时的常规开场白,也可能是想确认我的身份和功能范围。\n\n用户没有提供任何背景信息,语气简洁中性。这种场景下新用户的可能性较高,需要给出清晰友好的自我介绍,同时突出实用价值来降低陌生感。\n\n考虑到中文用户,应该用简体中文回复。重点要说明三点:身份归属(深度求索)、功能定位(AI助手)、服务范围(学习/工作/生活)。结尾用开放性问题引导对话很关键——既能了解需求,又能避免让用户面对空白输入框时不知所措。\n\n用波浪线结尾可以软化语气,那个笑脸表情😊刚好能中和AI的机械感。不过要控制表情符号数量,避免显得轻浮。\n</think>\n你好呀!我是你的AI助手,由深度求索公司(DeepSeek)开发的语言模型,名字叫 **DeepSeek-R1**。你可以把我当成一个知识丰富、随叫随到的小帮手~😊\n\n我的任务就是陪你聊天、解答问题、","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":14,"total_tokens":235,"completion_tokens":221,"prompt_tokens_details":null}}
|
||||
|
||||
```
|
||||
## FAQ
|
||||
|
||||
1. The current deployment startup parameters may not be fully compatible with all RDMA scenarios. Different RDMA NCCL-related environment configurations may be needed in different network environments.
|
||||
|
||||
2. Some preset, optimized configurations for EPLB are not used here. You can adjust them according to [6017](https://github.com/sgl-project/sglang/issues/6017) as needed.
|
||||
90
docs/references/multi_node_deployment/multi_node.md
Normal file
90
docs/references/multi_node_deployment/multi_node.md
Normal file
@@ -0,0 +1,90 @@
|
||||
# Multi-Node Deployment
|
||||
|
||||
## Llama 3.1 405B
|
||||
|
||||
**Run 405B (fp16) on Two Nodes**
|
||||
|
||||
```bash
|
||||
# replace 172.16.4.52:20000 with your own node ip address and port of the first node
|
||||
|
||||
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --dist-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0
|
||||
|
||||
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --dist-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 1
|
||||
```
|
||||
|
||||
Note that LLama 405B (fp8) can also be launched on a single node.
|
||||
|
||||
```bash
|
||||
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
|
||||
```
|
||||
|
||||
## DeepSeek V3/R1
|
||||
|
||||
Please refer to [DeepSeek documents for reference](https://docs.sglang.ai/basic_usage/deepseek.html#running-examples-on-multi-node).
|
||||
|
||||
## Multi-Node Inference on SLURM
|
||||
|
||||
This example showcases how to serve SGLang server across multiple nodes by SLURM. Submit the following job to the SLURM cluster.
|
||||
|
||||
```
|
||||
#!/bin/bash -l
|
||||
|
||||
#SBATCH -o SLURM_Logs/%x_%j_master.out
|
||||
#SBATCH -e SLURM_Logs/%x_%j_master.err
|
||||
#SBATCH -D ./
|
||||
#SBATCH -J Llama-405B-Online-Inference-TP16-SGL
|
||||
|
||||
#SBATCH --nodes=2
|
||||
#SBATCH --ntasks=2
|
||||
#SBATCH --ntasks-per-node=1 # Ensure 1 task per node
|
||||
#SBATCH --cpus-per-task=18
|
||||
#SBATCH --mem=224GB
|
||||
#SBATCH --partition="lmsys.org"
|
||||
#SBATCH --gres=gpu:8
|
||||
#SBATCH --time=12:00:00
|
||||
|
||||
echo "[INFO] Activating environment on node $SLURM_PROCID"
|
||||
if ! source ENV_FOLDER/bin/activate; then
|
||||
echo "[ERROR] Failed to activate environment" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Define parameters
|
||||
model=MODEL_PATH
|
||||
tp_size=16
|
||||
|
||||
echo "[INFO] Running inference"
|
||||
echo "[INFO] Model: $model"
|
||||
echo "[INFO] TP Size: $tp_size"
|
||||
|
||||
# Set NCCL initialization address using the hostname of the head node
|
||||
HEAD_NODE=$(scontrol show hostname "$SLURM_NODELIST" | head -n 1)
|
||||
NCCL_INIT_ADDR="${HEAD_NODE}:8000"
|
||||
echo "[INFO] NCCL_INIT_ADDR: $NCCL_INIT_ADDR"
|
||||
|
||||
# Launch the model server on each node using SLURM
|
||||
srun --ntasks=2 --nodes=2 --output="SLURM_Logs/%x_%j_node$SLURM_NODEID.out" \
|
||||
--error="SLURM_Logs/%x_%j_node$SLURM_NODEID.err" \
|
||||
python3 -m sglang.launch_server \
|
||||
--model-path "$model" \
|
||||
--grammar-backend "xgrammar" \
|
||||
--tp "$tp_size" \
|
||||
--dist-init-addr "$NCCL_INIT_ADDR" \
|
||||
--nnodes 2 \
|
||||
--node-rank "$SLURM_NODEID" &
|
||||
|
||||
# Wait for the NCCL server to be ready on port 30000
|
||||
while ! nc -z "$HEAD_NODE" 30000; do
|
||||
sleep 1
|
||||
echo "[INFO] Waiting for $HEAD_NODE:30000 to accept connections"
|
||||
done
|
||||
|
||||
echo "[INFO] $HEAD_NODE:30000 is ready to accept connections"
|
||||
|
||||
# Keep the script running until the SLURM job times out
|
||||
wait
|
||||
```
|
||||
|
||||
Then, you can test the server by sending requests following other [documents](https://docs.sglang.ai/backend/openai_api_completions.html).
|
||||
|
||||
Thanks for [aflah02](https://github.com/aflah02) for providing the example, based on his [blog post](https://aflah02.substack.com/p/multi-node-llm-inference-with-sglang).
|
||||
13
docs/references/multi_node_deployment/multi_node_index.rst
Normal file
13
docs/references/multi_node_deployment/multi_node_index.rst
Normal file
@@ -0,0 +1,13 @@
|
||||
Multi-Node Deployment
|
||||
=====================
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: Multi-Node Deployment
|
||||
|
||||
multi_node.md
|
||||
deploy_on_k8s.md
|
||||
lws_pd/lws_pd_deploy.md
|
||||
|
||||
- `Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs <https://lmsys.org/blog/2025-05-05-large-scale-ep/>`_
|
||||
- `Deploying Kimi K2 with PD Disaggregation and Large-Scale Expert Parallelism on 128 H200 GPUs <https://lmsys.org/blog/2025-07-20-k2-large-scale-ep/>`_
|
||||
217
docs/references/production_metrics.md
Normal file
217
docs/references/production_metrics.md
Normal file
@@ -0,0 +1,217 @@
|
||||
# Production Metrics
|
||||
|
||||
SGLang exposes the following metrics via Prometheus. You can enable it by adding `--enable-metrics` when you launch the server.
|
||||
|
||||
An example of the monitoring dashboard is available in [examples/monitoring/grafana.json](https://github.com/sgl-project/sglang/blob/main/examples/monitoring/grafana/dashboards/json/sglang-dashboard.json).
|
||||
|
||||
Here is an example of the metrics:
|
||||
|
||||
```
|
||||
$ curl http://localhost:30000/metrics
|
||||
# HELP sglang:prompt_tokens_total Number of prefill tokens processed.
|
||||
# TYPE sglang:prompt_tokens_total counter
|
||||
sglang:prompt_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 8.128902e+06
|
||||
# HELP sglang:generation_tokens_total Number of generation tokens processed.
|
||||
# TYPE sglang:generation_tokens_total counter
|
||||
sglang:generation_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.557572e+06
|
||||
# HELP sglang:token_usage The token usage
|
||||
# TYPE sglang:token_usage gauge
|
||||
sglang:token_usage{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.28
|
||||
# HELP sglang:cache_hit_rate The cache hit rate
|
||||
# TYPE sglang:cache_hit_rate gauge
|
||||
sglang:cache_hit_rate{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.007507552643049313
|
||||
# HELP sglang:time_to_first_token_seconds Histogram of time to first token in seconds.
|
||||
# TYPE sglang:time_to_first_token_seconds histogram
|
||||
sglang:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 2.3518979474117756e+06
|
||||
sglang:time_to_first_token_seconds_bucket{le="0.001",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
|
||||
sglang:time_to_first_token_seconds_bucket{le="0.005",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
|
||||
sglang:time_to_first_token_seconds_bucket{le="0.01",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
|
||||
sglang:time_to_first_token_seconds_bucket{le="0.02",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
|
||||
sglang:time_to_first_token_seconds_bucket{le="0.04",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
|
||||
sglang:time_to_first_token_seconds_bucket{le="0.06",model_name="meta-llama/Llama-3.1-8B-Instruct"} 3.0
|
||||
sglang:time_to_first_token_seconds_bucket{le="0.08",model_name="meta-llama/Llama-3.1-8B-Instruct"} 6.0
|
||||
sglang:time_to_first_token_seconds_bucket{le="0.1",model_name="meta-llama/Llama-3.1-8B-Instruct"} 6.0
|
||||
sglang:time_to_first_token_seconds_bucket{le="0.25",model_name="meta-llama/Llama-3.1-8B-Instruct"} 6.0
|
||||
sglang:time_to_first_token_seconds_bucket{le="0.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 6.0
|
||||
sglang:time_to_first_token_seconds_bucket{le="0.75",model_name="meta-llama/Llama-3.1-8B-Instruct"} 6.0
|
||||
sglang:time_to_first_token_seconds_bucket{le="1.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 27.0
|
||||
sglang:time_to_first_token_seconds_bucket{le="2.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 140.0
|
||||
sglang:time_to_first_token_seconds_bucket{le="5.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 314.0
|
||||
sglang:time_to_first_token_seconds_bucket{le="7.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 941.0
|
||||
sglang:time_to_first_token_seconds_bucket{le="10.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1330.0
|
||||
sglang:time_to_first_token_seconds_bucket{le="15.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1970.0
|
||||
sglang:time_to_first_token_seconds_bucket{le="20.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 2326.0
|
||||
sglang:time_to_first_token_seconds_bucket{le="25.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 2417.0
|
||||
sglang:time_to_first_token_seconds_bucket{le="30.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 2513.0
|
||||
sglang:time_to_first_token_seconds_bucket{le="+Inf",model_name="meta-llama/Llama-3.1-8B-Instruct"} 11008.0
|
||||
sglang:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 11008.0
|
||||
# HELP sglang:e2e_request_latency_seconds Histogram of End-to-end request latency in seconds
|
||||
# TYPE sglang:e2e_request_latency_seconds histogram
|
||||
sglang:e2e_request_latency_seconds_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 3.116093850019932e+06
|
||||
sglang:e2e_request_latency_seconds_bucket{le="0.3",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
|
||||
sglang:e2e_request_latency_seconds_bucket{le="0.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 6.0
|
||||
sglang:e2e_request_latency_seconds_bucket{le="0.8",model_name="meta-llama/Llama-3.1-8B-Instruct"} 6.0
|
||||
sglang:e2e_request_latency_seconds_bucket{le="1.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 6.0
|
||||
sglang:e2e_request_latency_seconds_bucket{le="1.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 6.0
|
||||
sglang:e2e_request_latency_seconds_bucket{le="2.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 6.0
|
||||
sglang:e2e_request_latency_seconds_bucket{le="2.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 6.0
|
||||
sglang:e2e_request_latency_seconds_bucket{le="5.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.0
|
||||
sglang:e2e_request_latency_seconds_bucket{le="10.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 10.0
|
||||
sglang:e2e_request_latency_seconds_bucket{le="15.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 11.0
|
||||
sglang:e2e_request_latency_seconds_bucket{le="20.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 14.0
|
||||
sglang:e2e_request_latency_seconds_bucket{le="30.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 247.0
|
||||
sglang:e2e_request_latency_seconds_bucket{le="40.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 486.0
|
||||
sglang:e2e_request_latency_seconds_bucket{le="50.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 845.0
|
||||
sglang:e2e_request_latency_seconds_bucket{le="60.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1513.0
|
||||
sglang:e2e_request_latency_seconds_bucket{le="+Inf",model_name="meta-llama/Llama-3.1-8B-Instruct"} 11228.0
|
||||
sglang:e2e_request_latency_seconds_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 11228.0
|
||||
# HELP sglang:time_per_output_token_seconds Histogram of time per output token in seconds.
|
||||
# TYPE sglang:time_per_output_token_seconds histogram
|
||||
sglang:time_per_output_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 866964.5791549598
|
||||
sglang:time_per_output_token_seconds_bucket{le="0.005",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
|
||||
sglang:time_per_output_token_seconds_bucket{le="0.01",model_name="meta-llama/Llama-3.1-8B-Instruct"} 73.0
|
||||
sglang:time_per_output_token_seconds_bucket{le="0.015",model_name="meta-llama/Llama-3.1-8B-Instruct"} 382.0
|
||||
sglang:time_per_output_token_seconds_bucket{le="0.02",model_name="meta-llama/Llama-3.1-8B-Instruct"} 593.0
|
||||
sglang:time_per_output_token_seconds_bucket{le="0.025",model_name="meta-llama/Llama-3.1-8B-Instruct"} 855.0
|
||||
sglang:time_per_output_token_seconds_bucket{le="0.03",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1035.0
|
||||
sglang:time_per_output_token_seconds_bucket{le="0.04",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1815.0
|
||||
sglang:time_per_output_token_seconds_bucket{le="0.05",model_name="meta-llama/Llama-3.1-8B-Instruct"} 11685.0
|
||||
sglang:time_per_output_token_seconds_bucket{le="0.075",model_name="meta-llama/Llama-3.1-8B-Instruct"} 433413.0
|
||||
sglang:time_per_output_token_seconds_bucket{le="0.1",model_name="meta-llama/Llama-3.1-8B-Instruct"} 4.950195e+06
|
||||
sglang:time_per_output_token_seconds_bucket{le="0.15",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.039435e+06
|
||||
sglang:time_per_output_token_seconds_bucket{le="0.2",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.171662e+06
|
||||
sglang:time_per_output_token_seconds_bucket{le="0.3",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.266055e+06
|
||||
sglang:time_per_output_token_seconds_bucket{le="0.4",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.296752e+06
|
||||
sglang:time_per_output_token_seconds_bucket{le="0.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.312226e+06
|
||||
sglang:time_per_output_token_seconds_bucket{le="0.75",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.339675e+06
|
||||
sglang:time_per_output_token_seconds_bucket{le="1.0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.357747e+06
|
||||
sglang:time_per_output_token_seconds_bucket{le="2.5",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.389414e+06
|
||||
sglang:time_per_output_token_seconds_bucket{le="+Inf",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.400757e+06
|
||||
sglang:time_per_output_token_seconds_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.400757e+06
|
||||
# HELP sglang:func_latency_seconds Function latency in seconds
|
||||
# TYPE sglang:func_latency_seconds histogram
|
||||
sglang:func_latency_seconds_sum{name="generate_request"} 4.514771912145079
|
||||
sglang:func_latency_seconds_bucket{le="0.05",name="generate_request"} 14006.0
|
||||
sglang:func_latency_seconds_bucket{le="0.07500000000000001",name="generate_request"} 14006.0
|
||||
sglang:func_latency_seconds_bucket{le="0.1125",name="generate_request"} 14006.0
|
||||
sglang:func_latency_seconds_bucket{le="0.16875",name="generate_request"} 14006.0
|
||||
sglang:func_latency_seconds_bucket{le="0.253125",name="generate_request"} 14006.0
|
||||
sglang:func_latency_seconds_bucket{le="0.3796875",name="generate_request"} 14006.0
|
||||
sglang:func_latency_seconds_bucket{le="0.56953125",name="generate_request"} 14006.0
|
||||
sglang:func_latency_seconds_bucket{le="0.8542968750000001",name="generate_request"} 14006.0
|
||||
sglang:func_latency_seconds_bucket{le="1.2814453125",name="generate_request"} 14006.0
|
||||
sglang:func_latency_seconds_bucket{le="1.9221679687500002",name="generate_request"} 14006.0
|
||||
sglang:func_latency_seconds_bucket{le="2.8832519531250003",name="generate_request"} 14006.0
|
||||
sglang:func_latency_seconds_bucket{le="4.3248779296875",name="generate_request"} 14007.0
|
||||
sglang:func_latency_seconds_bucket{le="6.487316894531251",name="generate_request"} 14007.0
|
||||
sglang:func_latency_seconds_bucket{le="9.730975341796876",name="generate_request"} 14007.0
|
||||
sglang:func_latency_seconds_bucket{le="14.596463012695313",name="generate_request"} 14007.0
|
||||
sglang:func_latency_seconds_bucket{le="21.89469451904297",name="generate_request"} 14007.0
|
||||
sglang:func_latency_seconds_bucket{le="32.84204177856446",name="generate_request"} 14007.0
|
||||
sglang:func_latency_seconds_bucket{le="49.26306266784668",name="generate_request"} 14007.0
|
||||
sglang:func_latency_seconds_bucket{le="+Inf",name="generate_request"} 14007.0
|
||||
sglang:func_latency_seconds_count{name="generate_request"} 14007.0
|
||||
# HELP sglang:num_running_reqs The number of running requests
|
||||
# TYPE sglang:num_running_reqs gauge
|
||||
sglang:num_running_reqs{model_name="meta-llama/Llama-3.1-8B-Instruct"} 162.0
|
||||
# HELP sglang:num_used_tokens The number of used tokens
|
||||
# TYPE sglang:num_used_tokens gauge
|
||||
sglang:num_used_tokens{model_name="meta-llama/Llama-3.1-8B-Instruct"} 123859.0
|
||||
# HELP sglang:gen_throughput The generate throughput (token/s)
|
||||
# TYPE sglang:gen_throughput gauge
|
||||
sglang:gen_throughput{model_name="meta-llama/Llama-3.1-8B-Instruct"} 86.50814177726902
|
||||
# HELP sglang:num_queue_reqs The number of requests in the waiting queue
|
||||
# TYPE sglang:num_queue_reqs gauge
|
||||
sglang:num_queue_reqs{model_name="meta-llama/Llama-3.1-8B-Instruct"} 2826.0
|
||||
```
|
||||
|
||||
## Setup Guide
|
||||
|
||||
This section describes how to set up the monitoring stack (Prometheus + Grafana) provided in the `examples/monitoring` directory.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Docker and Docker Compose installed
|
||||
- SGLang server running with metrics enabled
|
||||
|
||||
### Usage
|
||||
|
||||
1. **Start your SGLang server with metrics enabled:**
|
||||
|
||||
```bash
|
||||
python -m sglang.launch_server --model-path <your_model_path> --port 30000 --enable-metrics
|
||||
```
|
||||
Replace `<your_model_path>` with the actual path to your model (e.g., `meta-llama/Meta-Llama-3.1-8B-Instruct`). Ensure the server is accessible from the monitoring stack (you might need `--host 0.0.0.0` if running in Docker). By default, the metrics endpoint will be available at `http://<sglang_server_host>:30000/metrics`.
|
||||
|
||||
2. **Navigate to the monitoring example directory:**
|
||||
```bash
|
||||
cd examples/monitoring
|
||||
```
|
||||
|
||||
3. **Start the monitoring stack:**
|
||||
```bash
|
||||
docker compose up -d
|
||||
```
|
||||
This command will start Prometheus and Grafana in the background.
|
||||
|
||||
4. **Access the monitoring interfaces:**
|
||||
* **Grafana:** Open your web browser and go to [http://localhost:3000](http://localhost:3000).
|
||||
* **Prometheus:** Open your web browser and go to [http://localhost:9090](http://localhost:9090).
|
||||
|
||||
5. **Log in to Grafana:**
|
||||
* Default Username: `admin`
|
||||
* Default Password: `admin`
|
||||
You will be prompted to change the password upon your first login.
|
||||
|
||||
6. **View the Dashboard:**
|
||||
The SGLang dashboard is pre-configured and should be available automatically. Navigate to `Dashboards` -> `Browse` -> `SGLang Monitoring` folder -> `SGLang Dashboard`.
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
* **Port Conflicts:** If you encounter errors like "port is already allocated," check if other services (including previous instances of Prometheus/Grafana) are using ports `9090` or `3000`. Use `docker ps` to find running containers and `docker stop <container_id>` to stop them, or use `lsof -i :<port>` to find other processes using the ports. You might need to adjust the ports in the `docker-compose.yaml` file if they permanently conflict with other essential services on your system.
|
||||
|
||||
To modify Grafana's port to the other one(like 3090) in your Docker Compose file, you need to explicitly specify the port mapping under the grafana service.
|
||||
|
||||
Option 1: Add GF_SERVER_HTTP_PORT to the environment section:
|
||||
```
|
||||
environment:
|
||||
- GF_AUTH_ANONYMOUS_ENABLED=true
|
||||
- GF_SERVER_HTTP_PORT=3090 # <-- Add this line
|
||||
```
|
||||
Option 2: Use port mapping:
|
||||
```
|
||||
grafana:
|
||||
image: grafana/grafana:latest
|
||||
container_name: grafana
|
||||
ports:
|
||||
- "3090:3000" # <-- Host:Container port mapping
|
||||
```
|
||||
* **Connection Issues:**
|
||||
* Ensure both Prometheus and Grafana containers are running (`docker ps`).
|
||||
* Verify the Prometheus data source configuration in Grafana (usually auto-configured via `grafana/datasources/datasource.yaml`). Go to `Connections` -> `Data sources` -> `Prometheus`. The URL should point to the Prometheus service (e.g., `http://prometheus:9090`).
|
||||
* Confirm that your SGLang server is running and the metrics endpoint (`http://<sglang_server_host>:30000/metrics`) is accessible *from the Prometheus container*. If SGLang is running on your host machine and Prometheus is in Docker, use `host.docker.internal` (on Docker Desktop) or your machine's network IP instead of `localhost` in the `prometheus.yaml` scrape configuration.
|
||||
* **No Data on Dashboard:**
|
||||
* Generate some traffic to your SGLang server to produce metrics. For example, run a benchmark:
|
||||
```bash
|
||||
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 100 --random-input 128 --random-output 128
|
||||
```
|
||||
* Check the Prometheus UI (`http://localhost:9090`) under `Status` -> `Targets` to see if the SGLang endpoint is being scraped successfully.
|
||||
* Verify the `model_name` and `instance` labels in your Prometheus metrics match the variables used in the Grafana dashboard. You might need to adjust the Grafana dashboard variables or the labels in your Prometheus configuration.
|
||||
|
||||
### Configuration Files
|
||||
|
||||
The monitoring setup is defined by the following files within the `examples/monitoring` directory:
|
||||
|
||||
* `docker-compose.yaml`: Defines the Prometheus and Grafana services.
|
||||
* `prometheus.yaml`: Prometheus configuration, including scrape targets.
|
||||
* `grafana/datasources/datasource.yaml`: Configures the Prometheus data source for Grafana.
|
||||
* `grafana/dashboards/config/dashboard.yaml`: Tells Grafana to load dashboards from the specified path.
|
||||
* `grafana/dashboards/json/sglang-dashboard.json`: The actual Grafana dashboard definition in JSON format.
|
||||
|
||||
You can customize the setup by modifying these files. For instance, you might need to update the `static_configs` target in `prometheus.yaml` if your SGLang server runs on a different host or port.
|
||||
|
||||
#### Check if the metrics are being collected
|
||||
|
||||
Run `python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 3000 --random-input 1024 --random-output 1024 --random-range-ratio 0.5` to generate some requests.
|
||||
|
||||
Then you should be able to see the metrics in the Grafana dashboard.
|
||||
13
docs/references/torch_compile_cache.md
Normal file
13
docs/references/torch_compile_cache.md
Normal file
@@ -0,0 +1,13 @@
|
||||
# Enabling cache for torch.compile
|
||||
|
||||
SGLang uses `max-autotune-no-cudagraphs` mode of torch.compile. The auto-tuning can be slow.
|
||||
If you want to deploy a model on many different machines, you can ship the torch.compile cache to these machines and skip the compilation steps.
|
||||
|
||||
This is based on https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html
|
||||
|
||||
|
||||
1. Generate the cache by setting TORCHINDUCTOR_CACHE_DIR and running the model once.
|
||||
```
|
||||
TORCHINDUCTOR_CACHE_DIR=/root/inductor_root_cache python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --enable-torch-compile
|
||||
```
|
||||
2. Copy the cache folder to other machines and launch the server with `TORCHINDUCTOR_CACHE_DIR`.
|
||||
Reference in New Issue
Block a user