improve the ttft when use mooncake (#6125)
### What this PR does / why we need it? improve performance of mooncake by change the log level from info to debug ### ENV 2P + 4D, EP 1. benchmark script ``` evalscope perf \ --parallel 512 \ --number 1024 \ --model deepseek \ --url http://localhost:9000/v1/chat/completions \ --api openai \ --dataset random \ --max-tokens 2 \ --min-tokens 2 \ --prefix-length 0 \ --min-prompt-length 512 \ --max-prompt-length 512 \ --tokenizer-path /tmp/DeepSeek-v3-0324-w8a8-0814 \ --extra-args '{"ignore_eos": true}' \ --rate 2 ``` 2. before patch ``` +-----------------------------------+-----------+ | Key | Value | +===================================+===========+ | Time taken for tests (s) | 209.484 | +-----------------------------------+-----------+ | Number of concurrency | 512 | +-----------------------------------+-----------+ | Request rate (req/s) | 6 | +-----------------------------------+-----------+ | Total requests | 1024 | +-----------------------------------+-----------+ | Succeed requests | 1022 | +-----------------------------------+-----------+ | Failed requests | 2 | +-----------------------------------+-----------+ | Output token throughput (tok/s) | 9.7573 | +-----------------------------------+-----------+ | Total token throughput (tok/s) | 2507.62 | +-----------------------------------+-----------+ | Request throughput (req/s) | 4.8786 | +-----------------------------------+-----------+ | Average latency (s) | 7.0561 | +-----------------------------------+-----------+ | Average time to first token (s) | 5.7444 | +-----------------------------------+-----------+ | Average time per output token (s) | 1.3117 | +-----------------------------------+-----------+ | Average inter-token latency (s) | 1.3117 | +-----------------------------------+-----------+ | Average input tokens per request | 512 | +-----------------------------------+-----------+ | Average output tokens per request | 2 | +-----------------------------------+-----------+ 2026-01-22 14:56:32 - evalscope - INFO: Percentile results: +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ | Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) | +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ | 10% | 0.6062 | 0.5113 | 0.5113 | 1.234 | 512 | 2 | 0.0888 | 22.8338 | | 25% | 0.7248 | 0.5639 | 0.5639 | 1.4114 | 512 | 2 | 0.2 | 51.3919 | | 50% | 0.9092 | 0.7748 | 0.7748 | 1.6767 | 512 | 2 | 1.1935 | 306.7171 | | 66% | 1.0745 | 1.0345 | 1.0345 | 3.1308 | 512 | 2 | 1.3395 | 344.2495 | | 75% | 7.0812 | 1.5389 | 1.5389 | 10.0016 | 512 | 2 | 1.417 | 364.1808 | | 80% | 10.6944 | 1.8552 | 1.8552 | 13.3717 | 512 | 2 | 1.4778 | 379.7911 | | 90% | 19.2342 | 2.4325 | 2.4326 | 22.5105 | 512 | 2 | 1.6208 | 416.5381 | | 95% | 24.4399 | 2.8289 | 2.8289 | 26.0329 | 512 | 2 | 1.7548 | 450.9942 | | 98% | 45.0941 | 3.4098 | 3.4098 | 45.6287 | 512 | 2 | 1.8193 | 467.5476 | | 99% | 46.2786 | 3.8492 | 3.8492 | 46.9282 | 512 | 2 | 1.8576 | 477.4157 | +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ ``` 3. after patch ``` Benchmarking summary: +-----------------------------------+-----------+ | Key | Value | +===================================+===========+ | Time taken for tests (s) | 191.613 | +-----------------------------------+-----------+ | Number of concurrency | 512 | +-----------------------------------+-----------+ | Request rate (req/s) | 6 | +-----------------------------------+-----------+ | Total requests | 1024 | +-----------------------------------+-----------+ | Succeed requests | 1024 | +-----------------------------------+-----------+ | Failed requests | 0 | +-----------------------------------+-----------+ | Output token throughput (tok/s) | 10.6882 | +-----------------------------------+-----------+ | Total token throughput (tok/s) | 2746.87 | +-----------------------------------+-----------+ | Request throughput (req/s) | 5.3441 | +-----------------------------------+-----------+ | Average latency (s) | 2.0407 | +-----------------------------------+-----------+ | Average time to first token (s) | 0.7989 | +-----------------------------------+-----------+ | Average time per output token (s) | 1.2419 | +-----------------------------------+-----------+ | Average inter-token latency (s) | 1.2419 | +-----------------------------------+-----------+ | Average input tokens per request | 512 | +-----------------------------------+-----------+ | Average output tokens per request | 2 | +-----------------------------------+-----------+ 2026-01-22 15:10:31 - evalscope - INFO: Percentile results: +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ | Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) | +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ | 10% | 0.5727 | 0.5051 | 0.5051 | 1.1761 | 512 | 2 | 1.0368 | 266.4696 | | 25% | 0.6497 | 0.5324 | 0.5324 | 1.3159 | 512 | 2 | 1.1763 | 302.3184 | | 50% | 0.7767 | 0.6908 | 0.6908 | 1.4793 | 512 | 2 | 1.3521 | 347.4944 | | 66% | 0.8711 | 0.7912 | 0.7912 | 1.5916 | 512 | 2 | 1.4518 | 373.1092 | | 75% | 0.9125 | 0.8797 | 0.8797 | 1.7008 | 512 | 2 | 1.521 | 390.9018 | | 80% | 0.9381 | 0.9442 | 0.9442 | 1.7657 | 512 | 2 | 1.5749 | 404.7606 | | 90% | 0.994 | 1.0818 | 1.0818 | 1.9289 | 512 | 2 | 1.7006 | 437.0518 | | 95% | 1.0369 | 1.2454 | 1.2454 | 2.2154 | 512 | 2 | 1.7937 | 460.9731 | | 98% | 1.1237 | 18.8814 | 18.8814 | 19.4607 | 512 | 2 | 1.8755 | 482.0097 | | 99% | 1.6752 | 24.4406 | 24.4406 | 25.4734 | 512 | 2 | 1.907 | 490.0993 | +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ ``` --------- Signed-off-by: xleoken <xleoken@163.com>
This commit is contained in:
@@ -187,7 +187,7 @@ class KVCacheStoreSendingThread(KVTransferThread):
|
||||
ends = ends[skip_block_num:]
|
||||
keys = keys[skip_block_num:]
|
||||
|
||||
logger.info(
|
||||
logger.debug(
|
||||
"Storing KV cache for %d out of %d blocks (skip_block_num=%d) for request %s",
|
||||
len(keys),
|
||||
token_len // self.block_size,
|
||||
|
||||
@@ -91,7 +91,7 @@ class KVPoolScheduler:
|
||||
else:
|
||||
need_to_allocate = num_external_hit_tokens - num_computed_tokens
|
||||
|
||||
logger.info(
|
||||
logger.debug(
|
||||
"Reqid: %s, Total tokens %d, kvpool hit tokens: %d, need to load: %d",
|
||||
request.request_id,
|
||||
request.num_tokens,
|
||||
|
||||
Reference in New Issue
Block a user