From 30ceccc74a5dffeaa834257146fb68a0ae7a6681 Mon Sep 17 00:00:00 2001 From: Lianmin Zheng Date: Sun, 22 Jun 2025 22:42:55 -0700 Subject: [PATCH] Update hyperparameter_tuning.md (#7454) --- docs/backend/hyperparameter_tuning.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/docs/backend/hyperparameter_tuning.md b/docs/backend/hyperparameter_tuning.md index fc380103a..3dde43881 100644 --- a/docs/backend/hyperparameter_tuning.md +++ b/docs/backend/hyperparameter_tuning.md @@ -11,15 +11,15 @@ When the server is running at full load in a steady state, look for the followin `#queue-req` indicates the number of requests in the queue. If you frequently see `#queue-req: 0`, it suggests that your client code is submitting requests too slowly. -A healthy range for `#queue-req` is `100 - 1000`. +A healthy range for `#queue-req` is `100 - 2000`. However, avoid making `#queue-req` too large, as this will increase the scheduling overhead on the server. -### Tune `--schedule-conservativeness` to achieve a high `token usage`. +### Achieve a high `token usage` `token usage` indicates the KV cache memory utilization of the server. `token usage > 0.9` means good utilization. If you frequently see `token usage < 0.9` and `#queue-req > 0`, it means the server is too conservative about taking in new requests. You can decrease `--schedule-conservativeness` to a value like 0.3. -The case of server being too conservative can happen when users send many requests with a large `max_new_tokens` but the requests stop very early due to EOS or stop strings. +The case of a server being too conservative can happen when users send many requests with a large `max_new_tokens` but the requests stop very early due to EOS or stop strings. On the other hand, if you see `token usage` very high and you frequently see warnings like `KV cache pool is full. Retract requests. #retracted_reqs: 1, #new_token_ratio: 0.9998 -> 1.0000`, you can increase `--schedule-conservativeness` to a value like 1.3. @@ -36,7 +36,7 @@ for activations and CUDA graph buffers. A simple strategy is to increase `--mem-fraction-static` by 0.01 each time until you encounter out-of-memory errors. -## Avoid out-of-memory errors by tuning `--chunked-prefill-size`, `--mem-fraction-static`, and `--max-running-requests` +### Avoid out-of-memory errors by tuning `--chunked-prefill-size`, `--mem-fraction-static`, and `--max-running-requests` If you encounter out-of-memory (OOM) errors, you can adjust the following parameters: @@ -57,5 +57,6 @@ Data parallelism is better for throughput. When there is enough GPU memory, alwa ### Try other options - `torch.compile` accelerates small models on small batch sizes. You can enable it with `--enable-torch-compile`. -- Try other quantization (e.g. FP8 quantizatioin) or other parallelism strategies (e.g. expert parallelism) +- Try other quantization (e.g. FP8 quantization with `--quantization fp8`) +- Try other parallelism strategies (e.g. expert parallelism) or DP attention for deepseek models (with `--enable-dp-attention --dp-size 8`). - If the workload has many shared prefixes, try `--schedule-policy lpm`. Here, `lpm` stands for longest prefix match. It reorders requests to encourage more cache hits but introduces more scheduling overhead.