xleoken
77b43492ae
improve the ttft when use mooncake (#6125)
### What this PR does / why we need it?
improve performance of mooncake by change the log level from info to
debug
### ENV
2P + 4D, EP
1. benchmark script
```
evalscope perf \
--parallel 512 \
--number 1024 \
--model deepseek \
--url http://localhost:9000/v1/chat/completions \
--api openai \
--dataset random \
--max-tokens 2 \
--min-tokens 2 \
--prefix-length 0 \
--min-prompt-length 512 \
--max-prompt-length 512 \
--tokenizer-path /tmp/DeepSeek-v3-0324-w8a8-0814 \
--extra-args '{"ignore_eos": true}' \
--rate 2
```
2. before patch
```
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 209.484 |
+-----------------------------------+-----------+
| Number of concurrency | 512 |
+-----------------------------------+-----------+
| Request rate (req/s) | 6 |
+-----------------------------------+-----------+
| Total requests | 1024 |
+-----------------------------------+-----------+
| Succeed requests | 1022 |
+-----------------------------------+-----------+
| Failed requests | 2 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 9.7573 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 2507.62 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 4.8786 |
+-----------------------------------+-----------+
| Average latency (s) | 7.0561 |
+-----------------------------------+-----------+
| Average time to first token (s) | 5.7444 |
+-----------------------------------+-----------+
| Average time per output token (s) | 1.3117 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 1.3117 |
+-----------------------------------+-----------+
| Average input tokens per request | 512 |
+-----------------------------------+-----------+
| Average output tokens per request | 2 |
+-----------------------------------+-----------+
2026-01-22 14:56:32 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 0.6062 | 0.5113 | 0.5113 | 1.234 | 512 | 2 | 0.0888 | 22.8338 |
| 25% | 0.7248 | 0.5639 | 0.5639 | 1.4114 | 512 | 2 | 0.2 | 51.3919 |
| 50% | 0.9092 | 0.7748 | 0.7748 | 1.6767 | 512 | 2 | 1.1935 | 306.7171 |
| 66% | 1.0745 | 1.0345 | 1.0345 | 3.1308 | 512 | 2 | 1.3395 | 344.2495 |
| 75% | 7.0812 | 1.5389 | 1.5389 | 10.0016 | 512 | 2 | 1.417 | 364.1808 |
| 80% | 10.6944 | 1.8552 | 1.8552 | 13.3717 | 512 | 2 | 1.4778 | 379.7911 |
| 90% | 19.2342 | 2.4325 | 2.4326 | 22.5105 | 512 | 2 | 1.6208 | 416.5381 |
| 95% | 24.4399 | 2.8289 | 2.8289 | 26.0329 | 512 | 2 | 1.7548 | 450.9942 |
| 98% | 45.0941 | 3.4098 | 3.4098 | 45.6287 | 512 | 2 | 1.8193 | 467.5476 |
| 99% | 46.2786 | 3.8492 | 3.8492 | 46.9282 | 512 | 2 | 1.8576 | 477.4157 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
```
3. after patch
```
Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 191.613 |
+-----------------------------------+-----------+
| Number of concurrency | 512 |
+-----------------------------------+-----------+
| Request rate (req/s) | 6 |
+-----------------------------------+-----------+
| Total requests | 1024 |
+-----------------------------------+-----------+
| Succeed requests | 1024 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 10.6882 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 2746.87 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 5.3441 |
+-----------------------------------+-----------+
| Average latency (s) | 2.0407 |
+-----------------------------------+-----------+
| Average time to first token (s) | 0.7989 |
+-----------------------------------+-----------+
| Average time per output token (s) | 1.2419 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 1.2419 |
+-----------------------------------+-----------+
| Average input tokens per request | 512 |
+-----------------------------------+-----------+
| Average output tokens per request | 2 |
+-----------------------------------+-----------+
2026-01-22 15:10:31 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 0.5727 | 0.5051 | 0.5051 | 1.1761 | 512 | 2 | 1.0368 | 266.4696 |
| 25% | 0.6497 | 0.5324 | 0.5324 | 1.3159 | 512 | 2 | 1.1763 | 302.3184 |
| 50% | 0.7767 | 0.6908 | 0.6908 | 1.4793 | 512 | 2 | 1.3521 | 347.4944 |
| 66% | 0.8711 | 0.7912 | 0.7912 | 1.5916 | 512 | 2 | 1.4518 | 373.1092 |
| 75% | 0.9125 | 0.8797 | 0.8797 | 1.7008 | 512 | 2 | 1.521 | 390.9018 |
| 80% | 0.9381 | 0.9442 | 0.9442 | 1.7657 | 512 | 2 | 1.5749 | 404.7606 |
| 90% | 0.994 | 1.0818 | 1.0818 | 1.9289 | 512 | 2 | 1.7006 | 437.0518 |
| 95% | 1.0369 | 1.2454 | 1.2454 | 2.2154 | 512 | 2 | 1.7937 | 460.9731 |
| 98% | 1.1237 | 18.8814 | 18.8814 | 19.4607 | 512 | 2 | 1.8755 | 482.0097 |
| 99% | 1.6752 | 24.4406 | 24.4406 | 25.4734 | 512 | 2 | 1.907 | 490.0993 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
```
---------
Signed-off-by: xleoken <xleoken@163.com>
2026-03-12 16:13:48 +08:00
..
2026-03-12 16:13:48 +08:00
2026-03-10 09:58:06 +08:00
2026-01-15 08:57:40 +08:00
2026-01-24 22:45:38 +08:00