diff --git a/docs/advanced_features/attention_backend.md b/docs/advanced_features/attention_backend.md index 68e4318d8..c00ae0ade 100644 --- a/docs/advanced_features/attention_backend.md +++ b/docs/advanced_features/attention_backend.md @@ -29,51 +29,93 @@ The "❌" and "✅" symbols in the table above under "Page Size > 1" indicate wh - FlashInfer (Default for Non-Hopper Machines, e.g., A100, A40) ```bash -python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --attention-backend flashinfer -python3 -m sglang.launch_server --tp 8 --model deepseek-ai/DeepSeek-V3 --attention-backend flashinfer --trust-remote-code +python3 -m sglang.launch_server \ + --model meta-llama/Meta-Llama-3.1-8B-Instruct \ + --attention-backend flashinfer +python3 -m sglang.launch_server \ + --tp 8 \ + --model deepseek-ai/DeepSeek-V3 \ + --attention-backend flashinfer \ + --trust-remote-code ``` - FlashAttention 3 (Default for Hopper Machines, e.g., H100, H200, H20) ```bash -python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --attention-backend fa3 -python3 -m sglang.launch_server --tp 8 --model deepseek-ai/DeepSeek-V3 --trust-remote-code --attention-backend fa3 +python3 -m sglang.launch_server \ + --model meta-llama/Meta-Llama-3.1-8B-Instruct \ + --attention-backend fa3 +python3 -m sglang.launch_server \ + --tp 8 \ + --model deepseek-ai/DeepSeek-V3 \ + --trust-remote-code \ + --attention-backend fa3 ``` - Triton ```bash -python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --attention-backend triton -python3 -m sglang.launch_server --tp 8 --model deepseek-ai/DeepSeek-V3 --attention-backend triton --trust-remote-code +python3 -m sglang.launch_server \ + --model meta-llama/Meta-Llama-3.1-8B-Instruct \ + --attention-backend triton +python3 -m sglang.launch_server \ + --tp 8 \ + --model deepseek-ai/DeepSeek-V3 \ + --attention-backend triton \ + --trust-remote-code ``` - Torch Native ```bash -python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --attention-backend torch_native +python3 -m sglang.launch_server \ + --model meta-llama/Meta-Llama-3.1-8B-Instruct \ + --attention-backend torch_native ``` - FlashMLA ```bash -python3 -m sglang.launch_server --tp 8 --model deepseek-ai/DeepSeek-R1 --attention-backend flashmla --trust-remote-code -python3 -m sglang.launch_server --tp 8 --model deepseek-ai/DeepSeek-R1 --attention-backend flashmla --kv-cache-dtype fp8_e4m3 --trust-remote-code +python3 -m sglang.launch_server \ + --tp 8 \ + --model deepseek-ai/DeepSeek-R1 \ + --attention-backend flashmla \ + --trust-remote-code +python3 -m sglang.launch_server \ + --tp 8 \ + --model deepseek-ai/DeepSeek-R1 \ + --attention-backend flashmla \ + --kv-cache-dtype fp8_e4m3 \ + --trust-remote-code ``` - TRTLLM MLA (Optimized for Blackwell Architecture, e.g., B200) ```bash -python3 -m sglang.launch_server --tp 8 --model deepseek-ai/DeepSeek-R1 --attention-backend trtllm_mla --trust-remote-code +python3 -m sglang.launch_server \ + --tp 8 \ + --model deepseek-ai/DeepSeek-R1 \ + --attention-backend trtllm_mla \ + --trust-remote-code ``` - TRTLLM MLA with FP8 KV Cache (Higher concurrency, lower memory footprint) ```bash -python3 -m sglang.launch_server --tp 8 --model deepseek-ai/DeepSeek-R1 --attention-backend trtllm_mla --kv-cache-dtype fp8_e4m3 --trust-remote-code +python3 -m sglang.launch_server \ + --tp 8 \ + --model deepseek-ai/DeepSeek-R1 \ + --attention-backend trtllm_mla \ + --kv-cache-dtype fp8_e4m3 \ + --trust-remote-code ``` - Ascend ```bash -python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --attention-backend ascend +python3 -m sglang.launch_server \ + --model meta-llama/Meta-Llama-3.1-8B-Instruct \ + --attention-backend ascend ``` - Wave ```bash -python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --attention-backend wave +python3 -m sglang.launch_server \ + --model meta-llama/Meta-Llama-3.1-8B-Instruct \ + --attention-backend wave ``` ## Steps to add a new attention backend diff --git a/docs/advanced_features/pd_disaggregation.md b/docs/advanced_features/pd_disaggregation.md index 85a5db07e..d33744713 100644 --- a/docs/advanced_features/pd_disaggregation.md +++ b/docs/advanced_features/pd_disaggregation.md @@ -34,22 +34,88 @@ uv pip install mooncake-transfer-engine ### Llama Single Node ```bash -$ python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode prefill --disaggregation-ib-device mlx5_roce0 -$ python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode decode --port 30001 --base-gpu-id 1 --disaggregation-ib-device mlx5_roce0 -$ python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000 +python -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --disaggregation-mode prefill \ + --disaggregation-ib-device mlx5_roce0 +python -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --disaggregation-mode decode \ + --port 30001 \ + --base-gpu-id 1 \ + --disaggregation-ib-device mlx5_roce0 +python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000 ``` ### DeepSeek Multi-Node ```bash # prefill 0 -$ python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --disaggregation-ib-device ${device_name} --disaggregation-mode prefill --host ${local_ip} --port 30000 --trust-remote-code --dist-init-addr ${prefill_master_ip}:5000 --nnodes 2 --node-rank 0 --tp-size 16 --dp-size 8 --enable-dp-attention --moe-a2a-backend deepep --mem-fraction-static 0.8 +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3-0324 \ + --disaggregation-ib-device ${device_name} \ + --disaggregation-mode prefill \ + --host ${local_ip} \ + --port 30000 \ + --trust-remote-code \ + --dist-init-addr ${prefill_master_ip}:5000 \ + --nnodes 2 \ + --node-rank 0 \ + --tp-size 16 \ + --dp-size 8 \ + --enable-dp-attention \ + --moe-a2a-backend deepep \ + --mem-fraction-static 0.8 # prefill 1 -$ python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --disaggregation-ib-device ${device_name} --disaggregation-mode prefill --host ${local_ip} --port 30000 --trust-remote-code --dist-init-addr ${prefill_master_ip}:5000 --nnodes 2 --node-rank 1 --tp-size 16 --dp-size 8 --enable-dp-attention --moe-a2a-backend deepep --mem-fraction-static 0.8 +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3-0324 \ + --disaggregation-ib-device ${device_name} \ + --disaggregation-mode prefill \ + --host ${local_ip} \ + --port 30000 \ + --trust-remote-code \ + --dist-init-addr ${prefill_master_ip}:5000 \ + --nnodes 2 \ + --node-rank 1 \ + --tp-size 16 \ + --dp-size 8 \ + --enable-dp-attention \ + --moe-a2a-backend deepep \ + --mem-fraction-static 0.8 # decode 0 -$ python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --disaggregation-ib-device ${device_name} --disaggregation-mode decode --host ${local_ip} --port 30001 --trust-remote-code --dist-init-addr ${decode_master_ip}:5000 --nnodes 2 --node-rank 0 --tp-size 16 --dp-size 8 --enable-dp-attention --moe-a2a-backend deepep --mem-fraction-static 0.8 --max-running-requests 128 +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3-0324 \ + --disaggregation-ib-device ${device_name} \ + --disaggregation-mode decode \ + --host ${local_ip} \ + --port 30001 \ + --trust-remote-code \ + --dist-init-addr ${decode_master_ip}:5000 \ + --nnodes 2 \ + --node-rank 0 \ + --tp-size 16 \ + --dp-size 8 \ + --enable-dp-attention \ + --moe-a2a-backend deepep \ + --mem-fraction-static 0.8 \ + --max-running-requests 128 # decode 1 -$ python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --disaggregation-ib-device ${device_name} --disaggregation-mode decode --host ${local_ip} --port 30001 --trust-remote-code --dist-init-addr ${decode_master_ip}:5000 --nnodes 2 --node-rank 1 --tp-size 16 --dp-size 8 --enable-dp-attention --moe-a2a-backend deepep --mem-fraction-static 0.8 --max-running-requests 128 +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3-0324 \ + --disaggregation-ib-device ${device_name} \ + --disaggregation-mode decode \ + --host ${local_ip} \ + --port 30001 \ + --trust-remote-code \ + --dist-init-addr ${decode_master_ip}:5000 \ + --nnodes 2 \ + --node-rank 1 \ + --tp-size 16 \ + --dp-size 8 \ + --enable-dp-attention \ + --moe-a2a-backend deepep \ + --mem-fraction-static 0.8 \ + --max-running-requests 128 ``` ### Advanced Configuration @@ -98,22 +164,88 @@ pip install . --config-settings=setup-args="-Ducx_path=/path/to/ucx" ### Llama Single Node ```bash -$ python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode prefill --disaggregation-transfer-backend nixl -$ python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode decode --port 30001 --base-gpu-id 1 --disaggregation-transfer-backend nixl -$ python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000 +python -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --disaggregation-mode prefill \ + --disaggregation-transfer-backend nixl +python -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --disaggregation-mode decode \ + --port 30001 \ + --base-gpu-id 1 \ + --disaggregation-transfer-backend nixl +python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000 ``` ### DeepSeek Multi-Node ```bash # prefill 0 -$ python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --disaggregation-transfer-backend nixl --disaggregation-mode prefill --host ${local_ip} --port 30000 --trust-remote-code --dist-init-addr ${prefill_master_ip}:5000 --nnodes 2 --node-rank 0 --tp-size 16 --dp-size 8 --enable-dp-attention --moe-a2a-backend deepep --mem-fraction-static 0.8 +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3-0324 \ + --disaggregation-transfer-backend nixl \ + --disaggregation-mode prefill \ + --host ${local_ip} \ + --port 30000 \ + --trust-remote-code \ + --dist-init-addr ${prefill_master_ip}:5000 \ + --nnodes 2 \ + --node-rank 0 \ + --tp-size 16 \ + --dp-size 8 \ + --enable-dp-attention \ + --moe-a2a-backend deepep \ + --mem-fraction-static 0.8 # prefill 1 -$ python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --disaggregation-transfer-backend nixl --disaggregation-mode prefill --host ${local_ip} --port 30000 --trust-remote-code --dist-init-addr ${prefill_master_ip}:5000 --nnodes 2 --node-rank 1 --tp-size 16 --dp-size 8 --enable-dp-attention --moe-a2a-backend deepep --mem-fraction-static 0.8 +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3-0324 \ + --disaggregation-transfer-backend nixl \ + --disaggregation-mode prefill \ + --host ${local_ip} \ + --port 30000 \ + --trust-remote-code \ + --dist-init-addr ${prefill_master_ip}:5000 \ + --nnodes 2 \ + --node-rank 1 \ + --tp-size 16 \ + --dp-size 8 \ + --enable-dp-attention \ + --moe-a2a-backend deepep \ + --mem-fraction-static 0.8 # decode 0 -$ python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --disaggregation-transfer-backend nixl --disaggregation-mode decode --host ${local_ip} --port 30001 --trust-remote-code --dist-init-addr ${decode_master_ip}:5000 --nnodes 2 --node-rank 0 --tp-size 16 --dp-size 8 --enable-dp-attention --moe-a2a-backend deepep --mem-fraction-static 0.8 --max-running-requests 128 +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3-0324 \ + --disaggregation-transfer-backend nixl \ + --disaggregation-mode decode \ + --host ${local_ip} \ + --port 30001 \ + --trust-remote-code \ + --dist-init-addr ${decode_master_ip}:5000 \ + --nnodes 2 \ + --node-rank 0 \ + --tp-size 16 \ + --dp-size 8 \ + --enable-dp-attention \ + --moe-a2a-backend deepep \ + --mem-fraction-static 0.8 \ + --max-running-requests 128 # decode 1 -$ python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --disaggregation-transfer-backend nixl --disaggregation-mode decode --host ${local_ip} --port 30001 --trust-remote-code --dist-init-addr ${decode_master_ip}:5000 --nnodes 2 --node-rank 1 --tp-size 16 --dp-size 8 --enable-dp-attention --moe-a2a-backend deepep --mem-fraction-static 0.8 --max-running-requests 128 +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3-0324 \ + --disaggregation-transfer-backend nixl \ + --disaggregation-mode decode \ + --host ${local_ip} \ + --port 30001 \ + --trust-remote-code \ + --dist-init-addr ${decode_master_ip}:5000 \ + --nnodes 2 \ + --node-rank 1 \ + --tp-size 16 \ + --dp-size 8 \ + --enable-dp-attention \ + --moe-a2a-backend deepep \ + --mem-fraction-static 0.8 \ + --max-running-requests 128 ``` ## ASCEND @@ -135,16 +267,44 @@ export ENABLE_ASCEND_TRANSFER_WITH_MOONCAKE=true ### Llama Single Node ```bash -$ python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode prefill --disaggregation-transfer-backend ascend -$ python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode decode --port 30001 --base-gpu-id 1 --disaggregation-transfer-backend ascend -$ python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000 +python -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --disaggregation-mode prefill \ + --disaggregation-transfer-backend ascend +python -m sglang.launch_server \ + --model-path meta-llama/Llama-3.1-8B-Instruct \ + --disaggregation-mode decode \ + --port 30001 \ + --base-gpu-id 1 \ + --disaggregation-transfer-backend ascend +python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000 ``` ### DeepSeek Multi-Node ```bash # prefill 0 -$ python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --disaggregation-transfer-backend ascend --disaggregation-mode prefill --host ${local_ip} --port 30000 --trust-remote-code --dist-init-addr ${prefill_master_ip}:5000 --nnodes 1 --node-rank 0 --tp-size 16 +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3-0324 \ + --disaggregation-transfer-backend ascend \ + --disaggregation-mode prefill \ + --host ${local_ip} \ + --port 30000 \ + --trust-remote-code \ + --dist-init-addr ${prefill_master_ip}:5000 \ + --nnodes 1 \ + --node-rank 0 \ + --tp-size 16 # decode 0 -$ python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --disaggregation-transfer-backend ascend --disaggregation-mode decode --host ${local_ip} --port 30001 --trust-remote-code --dist-init-addr ${decode_master_ip}:5000 --nnodes 1 --node-rank 0 --tp-size 16 +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3-0324 \ + --disaggregation-transfer-backend ascend \ + --disaggregation-mode decode \ + --host ${local_ip} \ + --port 30001 \ + --trust-remote-code \ + --dist-init-addr ${decode_master_ip}:5000 \ + --nnodes 1 \ + --node-rank 0 \ + --tp-size 16 ``` diff --git a/docs/advanced_features/server_arguments.md b/docs/advanced_features/server_arguments.md index 873fa8b05..d3425eecf 100644 --- a/docs/advanced_features/server_arguments.md +++ b/docs/advanced_features/server_arguments.md @@ -43,10 +43,20 @@ You can find all arguments by `python3 -m sglang.launch_server --help` ```bash # Node 0 - python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --dist-init-addr sgl-dev-0:50000 --nnodes 2 --node-rank 0 + python -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3-8B-Instruct \ + --tp 4 \ + --dist-init-addr sgl-dev-0:50000 \ + --nnodes 2 \ + --node-rank 0 # Node 1 - python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --dist-init-addr sgl-dev-0:50000 --nnodes 2 --node-rank 1 + python -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3-8B-Instruct \ + --tp 4 \ + --dist-init-addr sgl-dev-0:50000 \ + --nnodes 2 \ + --node-rank 1 ``` Please consult the documentation below and [server_args.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py) to learn more about the arguments you may provide when launching a server. diff --git a/docs/basic_usage/deepseek.md b/docs/basic_usage/deepseek.md index 8a71696f5..0f397b3cd 100644 --- a/docs/basic_usage/deepseek.md +++ b/docs/basic_usage/deepseek.md @@ -158,7 +158,14 @@ The precompilation process typically takes around 10 minutes to complete. **Usage**: Add arguments `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example: ``` -python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 --trust-remote-code --tp 8 +python3 -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3-0324 \ + --speculative-algorithm EAGLE \ + --speculative-num-steps 1 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 2 \ + --trust-remote-code \ + --tp 8 ``` - The best configuration for `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` can be searched with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py) script for given batch size. The minimum configuration is `--speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2`, which can achieve speedup for larger batch sizes. - FlashAttention3, FlashMLA, and Triton backend fully supports MTP usage. For FlashInfer backend (`--attention-backend flashinfer`) with speculative decoding,`--speculative-eagle-topk` parameter should be set to `1`. MTP support for the CutlassMLA and TRTLLM MLA backends are still under development. @@ -177,7 +184,14 @@ See [Reasoning Parser](https://docs.sglang.ai/advanced_features/separate_reasoni Add arguments `--tool-call-parser deepseekv3` and `--chat-template ./examples/chat_template/tool_chat_template_deepseekv3.jinja`(recommended) to enable this feature. For example (running on 1 * H20 node): ``` -python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3-0324 --tp 8 --port 30000 --host 0.0.0.0 --mem-fraction-static 0.9 --tool-call-parser deepseekv3 --chat-template ./examples/chat_template/tool_chat_template_deepseekv3.jinja +python3 -m sglang.launch_server \ + --model deepseek-ai/DeepSeek-V3-0324 \ + --tp 8 \ + --port 30000 \ + --host 0.0.0.0 \ + --mem-fraction-static 0.9 \ + --tool-call-parser deepseekv3 \ + --chat-template ./examples/chat_template/tool_chat_template_deepseekv3.jinja ``` Sample Request: diff --git a/docs/basic_usage/gpt_oss.md b/docs/basic_usage/gpt_oss.md index 798fd678d..d1af32f5f 100644 --- a/docs/basic_usage/gpt_oss.md +++ b/docs/basic_usage/gpt_oss.md @@ -43,7 +43,12 @@ export PYTHON_EXECUTION_BACKEND=UV Launch the server with the demo tool server: -`python3 -m sglang.launch_server --model-path openai/gpt-oss-120b --tool-server demo --tp 2` +```bash +python3 -m sglang.launch_server \ + --model-path openai/gpt-oss-120b \ + --tool-server demo \ + --tp 2 +``` For production usage, sglang can act as an MCP client for multiple services. An [example tool server](https://github.com/openai/gpt-oss/tree/main/gpt-oss-mcp-server) is provided. Start the servers and point sglang to them: ```bash diff --git a/docs/basic_usage/llama4.md b/docs/basic_usage/llama4.md index 07cc2b737..cdc62864a 100644 --- a/docs/basic_usage/llama4.md +++ b/docs/basic_usage/llama4.md @@ -11,7 +11,10 @@ Ongoing optimizations are tracked in the [Roadmap](https://github.com/sgl-projec To serve Llama 4 models on 8xH100/H200 GPUs: ```bash -python3 -m sglang.launch_server --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct --tp 8 --context-length 1000000 +python3 -m sglang.launch_server \ + --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \ + --tp 8 \ + --context-length 1000000 ``` ### Configuration Tips @@ -29,7 +32,16 @@ python3 -m sglang.launch_server --model-path meta-llama/Llama-4-Scout-17B-16E-In **Usage**: Add arguments `--speculative-draft-model-path`, `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example: ``` -python3 -m sglang.launch_server --model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct --speculative-algorithm EAGLE3 --speculative-draft-model-path nvidia/Llama-4-Maverick-17B-128E-Eagle3 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --trust-remote-code --tp 8 --context-length 1000000 +python3 -m sglang.launch_server \ + --model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \ + --speculative-algorithm EAGLE3 \ + --speculative-draft-model-path nvidia/Llama-4-Maverick-17B-128E-Eagle3 \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --trust-remote-code \ + --tp 8 \ + --context-length 1000000 ``` - **Note** The Llama 4 draft model *nvidia/Llama-4-Maverick-17B-128E-Eagle3* can only recognize conversations in chat mode. @@ -50,11 +62,21 @@ Commands: ```bash # Llama-4-Scout-17B-16E-Instruct model -python -m sglang.launch_server --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct --port 30000 --tp 8 --mem-fraction-static 0.8 --context-length 65536 +python -m sglang.launch_server \ + --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \ + --port 30000 \ + --tp 8 \ + --mem-fraction-static 0.8 \ + --context-length 65536 lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-4-Scout-17B-16E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks mmlu_pro --batch_size 128 --apply_chat_template --num_fewshot 0 # Llama-4-Maverick-17B-128E-Instruct -python -m sglang.launch_server --model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct --port 30000 --tp 8 --mem-fraction-static 0.8 --context-length 65536 +python -m sglang.launch_server \ + --model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \ + --port 30000 \ + --tp 8 \ + --mem-fraction-static 0.8 \ + --context-length 65536 lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-4-Maverick-17B-128E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks mmlu_pro --batch_size 128 --apply_chat_template --num_fewshot 0 ``` diff --git a/docs/basic_usage/qwen3.md b/docs/basic_usage/qwen3.md index 89295be60..c68a304b0 100644 --- a/docs/basic_usage/qwen3.md +++ b/docs/basic_usage/qwen3.md @@ -21,7 +21,13 @@ python3 -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Instruct --tp 4 Add arguments `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example: ``` bash -python3 -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Instruct --tp 4 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-algo NEXTN +python3 -m sglang.launch_server \ + --model Qwen/Qwen3-Next-80B-A3B-Instruct \ + --tp 4 \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 1 \ + --speculative-num-draft-tokens 4 \ + --speculative-algo NEXTN ``` Details can be seen in [this PR](https://github.com/sgl-project/sglang/pull/10233). diff --git a/docs/basic_usage/sampling_params.md b/docs/basic_usage/sampling_params.md index c1394a9fd..7e8db3a16 100644 --- a/docs/basic_usage/sampling_params.md +++ b/docs/basic_usage/sampling_params.md @@ -258,7 +258,10 @@ Detailed example in [structured outputs](../advanced_features/structured_outputs Launch a server with `--enable-custom-logit-processor` flag on. ```bash -python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --enable-custom-logit-processor +python -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3-8B-Instruct \ + --port 30000 \ + --enable-custom-logit-processor ``` Define a custom logit processor that will always sample a specific token id. diff --git a/docs/references/custom_chat_template.md b/docs/references/custom_chat_template.md index 557af5bf5..f22ee8bec 100644 --- a/docs/references/custom_chat_template.md +++ b/docs/references/custom_chat_template.md @@ -8,7 +8,10 @@ It should just work for most official models such as Llama-2/Llama-3. If needed, you can also override the chat template when launching the server: ```bash -python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template llama-2 +python -m sglang.launch_server \ + --model-path meta-llama/Llama-2-7b-chat-hf \ + --port 30000 \ + --chat-template llama-2 ``` If the chat template you are looking for is missing, you are welcome to contribute it or load it from a file. @@ -30,7 +33,10 @@ You can load the JSON format, which is defined by `conversation.py`. ``` ```bash -python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.json +python -m sglang.launch_server \ + --model-path meta-llama/Llama-2-7b-chat-hf \ + --port 30000 \ + --chat-template ./my_model_template.json ``` ## Jinja Format @@ -38,5 +44,8 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port You can also use the [Jinja template format](https://huggingface.co/docs/transformers/main/en/chat_templating) as defined by Hugging Face Transformers. ```bash -python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.jinja +python -m sglang.launch_server \ + --model-path meta-llama/Llama-2-7b-chat-hf \ + --port 30000 \ + --chat-template ./my_model_template.jinja ``` diff --git a/docs/references/multi_node_deployment/multi_node.md b/docs/references/multi_node_deployment/multi_node.md index 204b60586..28bc2a821 100644 --- a/docs/references/multi_node_deployment/multi_node.md +++ b/docs/references/multi_node_deployment/multi_node.md @@ -7,9 +7,19 @@ ```bash # replace 172.16.4.52:20000 with your own node ip address and port of the first node -python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --dist-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0 +python3 -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-405B-Instruct \ + --tp 16 \ + --dist-init-addr 172.16.4.52:20000 \ + --nnodes 2 \ + --node-rank 0 -python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --dist-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 1 +python3 -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-405B-Instruct \ + --tp 16 \ + --dist-init-addr 172.16.4.52:20000 \ + --nnodes 2 \ + --node-rank 1 ``` Note that LLama 405B (fp8) can also be launched on a single node. diff --git a/docs/references/production_metrics.md b/docs/references/production_metrics.md index 16afaca67..85a6ff8a6 100644 --- a/docs/references/production_metrics.md +++ b/docs/references/production_metrics.md @@ -139,7 +139,10 @@ This section describes how to set up the monitoring stack (Prometheus + Grafana) 1. **Start your SGLang server with metrics enabled:** ```bash - python -m sglang.launch_server --model-path --port 30000 --enable-metrics + python -m sglang.launch_server \ + --model-path \ + --port 30000 \ + --enable-metrics ``` Replace `` with the actual path to your model (e.g., `meta-llama/Meta-Llama-3.1-8B-Instruct`). Ensure the server is accessible from the monitoring stack (you might need `--host 0.0.0.0` if running in Docker). By default, the metrics endpoint will be available at `http://:30000/metrics`. @@ -212,6 +215,17 @@ You can customize the setup by modifying these files. For instance, you might ne #### Check if the metrics are being collected -Run `python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 3000 --random-input 1024 --random-output 1024 --random-range-ratio 0.5` to generate some requests. +Run: +``` +python3 -m sglang.bench_serving \ + --backend sglang \ + --dataset-name random \ + --num-prompts 3000 \ + --random-input 1024 \ + --random-output 1024 \ + --random-range-ratio 0.5 +``` + +to generate some requests. Then you should be able to see the metrics in the Grafana dashboard.