Revert "fix some typos" (#6244)

This commit is contained in:
Lianmin Zheng
2025-05-12 12:53:26 -07:00
committed by GitHub
parent bad7c26fdc
commit e8e18dcdcc
95 changed files with 276 additions and 276 deletions

View File

@@ -69,7 +69,7 @@
This command sets the number of prompts to 2 with `--num-prompts` argument and limits the length of output sequences to 100 with `--sharegpt-output-len` argument, which can generate a small trace file for browser to open smoothly.
Additionally, if you want to locate the SGLang Python source code through the CUDA kernel in Trace, you need to disable CUDA Graph when starting the service. This can be done by using the `--disable-cuda-graph` parameter in the command to start the service.
Additionally, if you want to locate the SGLang Python source code through the cuda kernel in Trace, you need to disable CUDA Graph when starting the service. This can be done by using the `--disable-cuda-graph` parameter in the command to start the service.
## Profile with Nsight

View File

@@ -35,7 +35,7 @@ SGLang uses Python's built-in [unittest](https://docs.python.org/3/library/unitt
## Writing Documentation & Running Docs CI
We recommend new contributors start from writing documentation, which helps you quickly understand the SGLang codebase. For more details, please refer to [docs/README.md](https://github.com/sgl-project/sglang/tree/main/docs/README.md).
We recommend new contributors start from writing documentation, which helps you quickly understand SGLang codebase. For more details, please refer to [docs/README.md](https://github.com/sgl-project/sglang/tree/main/docs/README.md).
## Tips for Newcomers

View File

@@ -69,13 +69,13 @@ If you encounter errors when starting the server, ensure the weights have finish
The DeepSeek series have huge model weights, it takes some time to compile the model with `torch.compile` for the first time if you have added the flag `--enable-torch-compile`. You can refer [here](https://docs.sglang.ai/backend/hyperparameter_tuning.html#try-advanced-options) to optimize the caching of compilation results, so that the cache can be used to speed up the next startup.
### Launch with one node of 8 x H200
Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended). **Note that Deepseek V3 is already in FP8**. So we should not run it with any quantization arguments like `--quantization fp8 --kv-cache-dtype fp8_e5m2`.
Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended). **Note that Deepseek V3 is already in FP8. So we should not run it with any quantization arguments like `--quantization fp8 --kv-cache-dtype fp8_e5m2`.
### Running examples on Multi-node
- [Serving with two H20*8 nodes](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h208-nodes).
- [Serving with two H200*8 nodes and Docker](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h2008-nodes-and-docker).
- [Serving with two H200*8 nodes and docker](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h2008-nodes-and-docker).
- [Serving with four A100*8 nodes](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-four-a1008-nodes).
@@ -89,13 +89,13 @@ Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/be
- **MLA Attention Backends**: Currently SGLang supports different optimized MLA attention backends, including [FlashAttention3](https://github.com/Dao-AILab/flash-attention), [Flashinfer](https://docs.flashinfer.ai/api/mla.html), [FlashMLA](https://github.com/deepseek-ai/FlashMLA), [CutlassMLA](https://github.com/sgl-project/sglang/pull/5390), and [Triton](https://github.com/triton-lang/triton) backends. The default FA3 provides good performance across wide workloads.
- **FP8 Quantization**: W8A8 FP8 and KV Cache FP8 quantization enables efficient FP8 inference. Additionally, we have implemented the Batched Matrix Multiplication (BMM) operator to facilitate FP8 inference in MLA with weight absorption.
- **FP8 Quantization**: W8A8 FP8 and KV Cache FP8 quantization enables efficient FP8 inference. Additionally, we have implemented Batched Matrix Multiplication (BMM) operator to facilitate FP8 inference in MLA with weight absorption.
- **CUDA Graph & torch.compile**: Both MLA and Mixture of Experts (MoE) are compatible with CUDA Graph and torch.compile, which reduces latency and accelerates decoding speed for small batch sizes.
- **CUDA Graph & Torch.compile**: Both MLA and Mixture of Experts (MoE) are compatible with CUDA Graph and Torch.compile, which reduces latency and accelerates decoding speed for small batch sizes.
- **Chunked Prefix Cache**: Chunked prefix cache optimization can increase throughput by cutting prefix cache into chunks, processing them with multi-head attention and merging their states. Its improvement can be significant when doing chunked prefill on long sequences. Currently this optimization is only available for the FlashAttention3 backend.
- **Chunked Prefix Cache**: Chunked prefix cache optimization can increase throughput by cutting prefix cache into chunks, processing them with multi-head attention and merging their states. Its improvement can be significant when doing chunked prefill on long sequences. Currently this optimization is only available for FlashAttention3 backend.
Overall, with these optimizations, we have achieved up to a **7x** acceleration in output throughput compared to the previous version.
Overall, with these optimizations, we have achieved up to **7x** acceleration in output throughput compared to the previous version.
<p align="center">
<img src="https://lmsys.org/images/blog/sglang_v0_3/deepseek_mla.svg" alt="Multi-head Latent Attention for DeepSeek Series Models">
@@ -113,7 +113,7 @@ Overall, with these optimizations, we have achieved up to a **7x** acceleration
<img src="https://lmsys.org/images/blog/sglang_v0_4/dp_attention.svg" alt="Data Parallelism Attention for DeepSeek Series Models">
</p>
With data parallelism attention enabled, we have achieved up to a **1.9x** decoding throughput improvement compared to the previous version.
With data parallelism attention enabled, we have achieved up to **1.9x** decoding throughput improvement compared to the previous version.
<p align="center">
<img src="https://lmsys.org/images/blog/sglang_v0_4/deepseek_coder_v2.svg" alt="Data Parallelism Attention Performance Comparison">
@@ -150,7 +150,7 @@ python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --tru
The precompilation process typically takes around 10 minutes to complete.
### Multi-token Prediction
**Description**: SGLang implements DeepSeek V3 Multi-Token Prediction (MTP) based on [EAGLE speculative decoding](https://docs.sglang.ai/backend/speculative_decoding.html#EAGLE-Decoding). With this optimization, the decoding speed can be improved by **1.8x** for batch size 1 and **1.5x** for batch size 32 respectively with H200 TP8 setting.
**Description**: SGLang implements DeepSeek V3 Multi-Token Prediction (MTP) based on [EAGLE speculative decoding](https://docs.sglang.ai/backend/speculative_decoding.html#EAGLE-Decoding). With this optimization, the decoding speed can be improved by **1.8x** for batch size 1 and **1.5x** for batch size 32 respectively on H200 TP8 setting.
**Usage**:
Add arguments `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example:
@@ -161,7 +161,7 @@ python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --spec
- FlashAttention3 and Triton backend fully supports MTP usage. For FlashInfer backend (`--attention-backend flashinfer`) with speculative decoding,`--speculative-eagle-topk` parameter should be set to `1`. MTP support for the FlashMLA backend and CutlassMLA backend is still under development.
- To enable DeepSeek MTP for large batch sizes (>32), there are some parameters should be changed (Reference [this discussion](https://github.com/sgl-project/sglang/issues/4543#issuecomment-2737413756)):
- Adjust `--max-running-requests` to a larger number. The default value is `32` for MTP. For larger batch sizes, you should increase this value beyond the default value.
- Set `--cuda-graph-bs`. It is a list of batch sizes for CUDA graph capture. The default captured batch sizes for speculative decoding is set [here](https://github.com/sgl-project/sglang/blob/49420741746c8f3e80e0eb17e7d012bfaf25793a/python/sglang/srt/model_executor/cuda_graph_runner.py#L126). You can include more batch sizes into it.
- Set `--cuda-graph-bs`. It's a list of batch sizes for cuda graph capture. The default captured batch sizes for speculative decoding is set [here](https://github.com/sgl-project/sglang/blob/49420741746c8f3e80e0eb17e7d012bfaf25793a/python/sglang/srt/model_executor/cuda_graph_runner.py#L126). You can include more batch sizes into it.
### Reasoning Content for DeepSeek R1
@@ -209,7 +209,7 @@ data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"\"}"}}]}}]}
data: {"choices":[{"delta":{"tool_calls":null}}], "finish_reason": "tool_calls"}
data: [DONE]
```
The client needs to concatenate all fragments of the tool call arguments to reconstruct the complete tool call:
The client needs to concatenate all arguments fragments to reconstruct the complete tool call:
```
{"city": "Qingdao"}
```
@@ -223,4 +223,4 @@ Important Notes:
1. **Question**: What should I do if model loading takes too long and NCCL timeout occurs?
**Answer**: You can try to add `--dist-timeout 3600` when launching the model, this allows for a 1-hour timeout.
**Answer**: You can try to add `--dist-timeout 3600` when launching the model, this allows for 1-hour timeout.

View File

@@ -330,7 +330,7 @@ This should resolve most NCCL-related issues.
## Remaining issues
* In Kubernetes, Docker, or Containerd environments, we use hostNetwork to prevent performance degradation.
* We utilize privileged mode, which isn't secure. Additionally, in containerized environments, full GPU isolation cannot be achieved.
* We utilize privileged mode, which isnt secure. Additionally, in containerized environments, full GPU isolation cannot be achieved.
## TODO

View File

@@ -25,4 +25,4 @@ docker run --gpus all \
python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 30000
```
Note that ModelScope uses a different cache directory than HuggingFace. You may need to set it manually to avoid running out of disk space.
Note that modelscope uses a different cache directory than huggingface. You may need to set it manually to avoid running out of disk space.