sglang/docs/advanced_features/hyperparameter_tuning.md

# Hyperparameter Tuning

## Achieving high throughput for offline batch inference

Achieving a large batch size is the most important thing for attaining high throughput in offline batch inference.
When the server is running at full load in a steady state, look for the following in the log:

```Decode batch. #running-req: 233, #token: 370959, token usage: 0.82, cuda graph: True, gen throughput (token/s): 4594.01, #queue-req: 317```

### Adjust the request submission speed to control `#queue-req`

`#queue-req` indicates the number of requests in the queue.
If you frequently see `#queue-req: 0`, it suggests that your client code is submitting requests too slowly.
A healthy range for `#queue-req` is `100 - 2000`.
However, avoid making `#queue-req` too large, as this will increase the scheduling overhead on the server.

### Achieve a high `token usage`

`token usage` indicates the KV cache memory utilization of the server. `token usage > 0.9` means good utilization.

If you frequently see `token usage < 0.9` and `#queue-req > 0`, it means the server is too conservative about taking in new requests. You can decrease `--schedule-conservativeness` to a value like 0.3.
The case of a server being too conservative can happen when users send many requests with a large `max_new_tokens` but the requests stop very early due to EOS or stop strings.

On the other hand, if you see `token usage` very high and you frequently see warnings like
`KV cache pool is full. Retract requests. #retracted_reqs: 1, #new_token_ratio: 0.9998 -> 1.0000`, you can increase `--schedule-conservativeness` to a value like 1.3.
If you see `KV cache pool is full. Retract requests.` occasionally but not frequently, it is okay.

### Tune `--mem-fraction-static` to increase KV cache pool capacity
SGLang allocates memory as follows:

Total memory usage = model weights + KV cache pool + CUDA graph buffers + activations

The `--mem-fraction-static` parameter determines how much memory is allocated to the first two components:

mem_fraction_static = (model weights + KV cache pool) / GPU memory capacity

To support higher concurrency, you should maximize the KV cache pool capacity by setting `--mem-fraction-static` as high as possible while still reserving enough memory for activations and CUDA graph buffers.

SGLang uses simple heuristics to set the default value of `--mem-fraction-static`, but you can optimize it for your use cases.
As a rule of thumb, reserving 5–8 GB of memory for activations is typically sufficient. You can check this by inspecting the logs just before the server is ready.
Look for log entries like this:

```
[2025-08-11 17:17:03] max_total_num_tokens=665690, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=65536, available_gpu_mem=13.50 GB
```

Check the `available_gpu_mem` value.
- If it is between 5–8 GB, the setting is good.
- If it is too high (e.g., 10 - 20 GB), increase `--mem-fraction-static` to allocate more memory to the KV cache.
- If it is too low, you risk out-of-memory (OOM) errors later, so decrease `--mem-fraction-static`.

Another straightforward approach is to increase `--mem-fraction-static` in increments of 0.01 until you encounter OOM errors for your workloads.

### Avoid out-of-memory errors by tuning `--chunked-prefill-size`, `--mem-fraction-static`, and `--max-running-requests`

If you encounter out-of-memory (OOM) errors, you can adjust the following parameters:

- If OOM occurs during prefill, try reducing `--chunked-prefill-size` to `4096` or `2048`. This saves memory but slows down the prefill speed for long prompts.
- If OOM occurs during decoding, try lowering `--max-running-requests`.
- You can also reduce `--mem-fraction-static` to a smaller value, such as 0.8 or 0.7. This decreases the memory usage of the KV cache memory pool and helps prevent OOM errors during both prefill and decoding. However, it limits maximum concurrency and reduces peak throughput.

### Tune `--cuda-graph-max-bs`
By default, CUDA graph is enabled only for small batch sizes (e.g., less than 160 or 256).
However, for some models, especially at large tensor parallelism sizes, CUDA graph can be useful for batch sizes up to 512 or 768.
Therefore, it may be beneficial to increase `--cuda-graph-max-bs` to a larger value.
Note that CUDA graph consumes more memory, so you may need to reduce `--mem-fraction-static` at the same time.

### Tune `--dp-size` and `--tp-size`

Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput. Refer to [sglang router](../advanced_features/router.md) for a better data parallelism rather than using `dp_size` parameter.

### Try other options

- `torch.compile` accelerates small models on small batch sizes. You can enable it with `--enable-torch-compile`.
- Try other quantization (e.g. FP8 quantization with `--quantization fp8`)
- Try other parallelism strategies (e.g. [expert parallelism](https://lmsys.org/blog/2025-05-05-large-scale-ep/)) or DP attention for deepseek models (with `--enable-dp-attention --dp-size 8`).
- If the workload has many shared prefixes, try `--schedule-policy lpm`. Here, `lpm` stands for longest prefix match. It reorders requests to encourage more cache hits but introduces more scheduling overhead.
-												Docs: Fix layout with sub-section (#3710)


											
										
										
											2025-02-19 15:44:30 -08:00
+								# Hyperparameter Tuning
-												Update docs (#486)

											
										
										
											2024-05-27 22:46:04 -07:00
-												Improve perf tuning docs (#7071)


											
										
										
											2025-06-10 16:55:04 -07:00
+								## Achieving high throughput for offline batch inference
-												Docs: add torch compile cache (#4151)

Co-authored-by: ybyang <ybyang7@iflytek.com>
											
										
										
											2025-03-06 14:27:09 -08:00
-												Improve perf tuning docs (#7071)


											
										
										
											2025-06-10 16:55:04 -07:00
+								Achieving a large batch size is the most important thing for attaining high throughput in offline batch inference.
 								When the server is running at full load in a steady state, look for the following in the log:
-												Update docs (#486)

											
										
										
											2024-05-27 22:46:04 -07:00
-												Improve perf tuning docs (#7071)


											
										
										
											2025-06-10 16:55:04 -07:00
+								```Decode batch. #running-req: 233, #token: 370959, token usage: 0.82, cuda graph: True, gen throughput (token/s): 4594.01, #queue-req: 317```
-												Improve docs

											
										
										
											2024-06-01 17:46:08 -05:00
-												Improve perf tuning docs (#7071)


											
										
										
											2025-06-10 16:55:04 -07:00
+								### Adjust the request submission speed to control `#queue-req`
-												Update docs (#486)

											
										
										
											2024-05-27 22:46:04 -07:00
-												Improve perf tuning docs (#7071)


											
										
										
											2025-06-10 16:55:04 -07:00
+								`#queue-req` indicates the number of requests in the queue.
 								If you frequently see `#queue-req: 0`, it suggests that your client code is submitting requests too slowly.
-												Update hyperparameter_tuning.md (#7454)


											
										
										
											2025-06-22 22:42:55 -07:00
+								A healthy range for `#queue-req` is `100 - 2000`.
-												Improve perf tuning docs (#7071)


											
										
										
											2025-06-10 16:55:04 -07:00
+								However, avoid making `#queue-req` too large, as this will increase the scheduling overhead on the server.
-												Docs: add torch compile cache (#4151)

Co-authored-by: ybyang <ybyang7@iflytek.com>
											
										
										
											2025-03-06 14:27:09 -08:00
-												Update hyperparameter_tuning.md (#7454)


											
										
										
											2025-06-22 22:42:55 -07:00
+								### Achieve a high `token usage`
-												Docs: add torch compile cache (#4151)

Co-authored-by: ybyang <ybyang7@iflytek.com>
											
										
										
											2025-03-06 14:27:09 -08:00
-												Update docs (#486)

											
										
										
											2024-05-27 22:46:04 -07:00
+								`token usage` indicates the KV cache memory utilization of the server. `token usage > 0.9` means good utilization.
-												Improve perf tuning docs (#7071)


											
										
										
											2025-06-10 16:55:04 -07:00
-												Update docs (#486)

											
										
										
											2024-05-27 22:46:04 -07:00
+								If you frequently see `token usage < 0.9` and `#queue-req > 0`, it means the server is too conservative about taking in new requests. You can decrease `--schedule-conservativeness` to a value like 0.3.
-												Update hyperparameter_tuning.md (#7454)


											
										
										
											2025-06-22 22:42:55 -07:00
+								The case of a server being too conservative can happen when users send many requests with a large `max_new_tokens` but the requests stop very early due to EOS or stop strings.
-												Update docs (#486)

											
										
										
											2024-05-27 22:46:04 -07:00
 								On the other hand, if you see `token usage` very high and you frequently see warnings like
-												Improve profiler and integrate profiler in bench_one_batch_server (#6787)


											
										
										
											2025-05-31 15:53:55 -07:00
+								`KV cache pool is full. Retract requests. #retracted_reqs: 1, #new_token_ratio: 0.9998 -> 1.0000`, you can increase `--schedule-conservativeness` to a value like 1.3.
 								If you see `KV cache pool is full. Retract requests.` occasionally but not frequently, it is okay.
-												Update docs (#486)

											
										
										
											2024-05-27 22:46:04 -07:00
-												Update hyperparameter_tuning.md (#9083)


											
										
										
											2025-08-11 13:44:11 -07:00
+								### Tune `--mem-fraction-static` to increase KV cache pool capacity
 								SGLang allocates memory as follows:
-												Update docs (#486)

											
										
										
											2024-05-27 22:46:04 -07:00
-												Update hyperparameter_tuning.md (#9083)


											
										
										
											2025-08-11 13:44:11 -07:00
+								Total memory usage = model weights + KV cache pool + CUDA graph buffers + activations
-												Docs: add torch compile cache (#4151)

Co-authored-by: ybyang <ybyang7@iflytek.com>
											
										
										
											2025-03-06 14:27:09 -08:00
-												Update hyperparameter_tuning.md (#9083)


											
										
										
											2025-08-11 13:44:11 -07:00
+								The `--mem-fraction-static` parameter determines how much memory is allocated to the first two components:
-												Docs: add torch compile cache (#4151)

Co-authored-by: ybyang <ybyang7@iflytek.com>
											
										
										
											2025-03-06 14:27:09 -08:00
-												Update hyperparameter_tuning.md (#9083)


											
										
										
											2025-08-11 13:44:11 -07:00
+								mem_fraction_static = (model weights + KV cache pool) / GPU memory capacity
 								To support higher concurrency, you should maximize the KV cache pool capacity by setting `--mem-fraction-static` as high as possible while still reserving enough memory for activations and CUDA graph buffers.
 								SGLang uses simple heuristics to set the default value of `--mem-fraction-static`, but you can optimize it for your use cases.
 								As a rule of thumb, reserving 5–8 GB of memory for activations is typically sufficient. You can check this by inspecting the logs just before the server is ready.
 								Look for log entries like this:
 								```
 								[2025-08-11 17:17:03] max_total_num_tokens=665690, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=65536, available_gpu_mem=13.50 GB
 								```
-												HiCache, add bench long context plus minor fixs (#9086)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
											
										
										
											2025-08-11 16:54:52 -07:00
+								Check the `available_gpu_mem` value.
 								- If it is between 5–8 GB, the setting is good.
 								- If it is too high (e.g., 10 - 20 GB), increase `--mem-fraction-static` to allocate more memory to the KV cache.
-												Update hyperparameter_tuning.md (#9083)


											
										
										
											2025-08-11 13:44:11 -07:00
+								- If it is too low, you risk out-of-memory (OOM) errors later, so decrease `--mem-fraction-static`.
 								Another straightforward approach is to increase `--mem-fraction-static` in increments of 0.01 until you encounter OOM errors for your workloads.
-												[Docs] Fix links and grammar issues (#4162)


											
										
										
											2025-03-07 15:14:18 +08:00
-												Update hyperparameter_tuning.md (#7454)


											
										
										
											2025-06-22 22:42:55 -07:00
+								### Avoid out-of-memory errors by tuning `--chunked-prefill-size`, `--mem-fraction-static`, and `--max-running-requests`
-												Make the server random by default (#493)


											
										
										
											2024-05-31 23:33:34 -07:00
-												Improve perf tuning docs (#7071)


											
										
										
											2025-06-10 16:55:04 -07:00
+								If you encounter out-of-memory (OOM) errors, you can adjust the following parameters:
-												Docs: add torch compile cache (#4151)

Co-authored-by: ybyang <ybyang7@iflytek.com>
											
										
										
											2025-03-06 14:27:09 -08:00
-												Improve perf tuning docs (#7071)


											
										
										
											2025-06-10 16:55:04 -07:00
+								- If OOM occurs during prefill, try reducing `--chunked-prefill-size` to `4096` or `2048`. This saves memory but slows down the prefill speed for long prompts.
 								- If OOM occurs during decoding, try lowering `--max-running-requests`.
 								- You can also reduce `--mem-fraction-static` to a smaller value, such as 0.8 or 0.7. This decreases the memory usage of the KV cache memory pool and helps prevent OOM errors during both prefill and decoding. However, it limits maximum concurrency and reduces peak throughput.
-												Docs: add torch compile cache (#4151)

Co-authored-by: ybyang <ybyang7@iflytek.com>
											
										
										
											2025-03-06 14:27:09 -08:00
-												Improve perf tuning docs (#7071)


											
										
										
											2025-06-10 16:55:04 -07:00
+								### Tune `--cuda-graph-max-bs`
 								By default, CUDA graph is enabled only for small batch sizes (e.g., less than 160 or 256).
 								However, for some models, especially at large tensor parallelism sizes, CUDA graph can be useful for batch sizes up to 512 or 768.
 								Therefore, it may be beneficial to increase `--cuda-graph-max-bs` to a larger value.
 								Note that CUDA graph consumes more memory, so you may need to reduce `--mem-fraction-static` at the same time.
-												Docs: add torch compile cache (#4151)

Co-authored-by: ybyang <ybyang7@iflytek.com>
											
										
										
											2025-03-06 14:27:09 -08:00
-												Improve perf tuning docs (#7071)


											
										
										
											2025-06-10 16:55:04 -07:00
+								### Tune `--dp-size` and `--tp-size`
-												Docs: add torch compile cache (#4151)

Co-authored-by: ybyang <ybyang7@iflytek.com>
											
										
										
											2025-03-06 14:27:09 -08:00
-												Refactor the docs (#9031)


											
										
										
											2025-08-10 19:49:45 -07:00
+								Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput. Refer to [sglang router](../advanced_features/router.md) for a better data parallelism rather than using `dp_size` parameter.
-												Update docs (#1768)

Co-authored-by: Chayenne Zhao <zhaochenyang20@gmail.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
											
										
										
											2024-10-23 11:28:48 -07:00
-												Improve perf tuning docs (#7071)


											
										
										
											2025-06-10 16:55:04 -07:00
+								### Try other options
-												Docs: add torch compile cache (#4151)

Co-authored-by: ybyang <ybyang7@iflytek.com>
											
										
										
											2025-03-06 14:27:09 -08:00
-												Improve perf tuning docs (#7071)


											
										
										
											2025-06-10 16:55:04 -07:00
+								- `torch.compile` accelerates small models on small batch sizes. You can enable it with `--enable-torch-compile`.
-												Update hyperparameter_tuning.md (#7454)


											
										
										
											2025-06-22 22:42:55 -07:00
+								- Try other quantization (e.g. FP8 quantization with `--quantization fp8`)
-												Refactor the docs (#9031)


											
										
										
											2025-08-10 19:49:45 -07:00
+								- Try other parallelism strategies (e.g. [expert parallelism](https://lmsys.org/blog/2025-05-05-large-scale-ep/)) or DP attention for deepseek models (with `--enable-dp-attention --dp-size 8`).
-												Improve perf tuning docs (#7071)


											
										
										
											2025-06-10 16:55:04 -07:00
+								- If the workload has many shared prefixes, try `--schedule-policy lpm`. Here, `lpm` stands for longest prefix match. It reorders requests to encourage more cache hits but introduces more scheduling overhead.