improve the ttft when use mooncake (#6125)

### What this PR does / why we need it?
improve performance of mooncake by change the log level from info to
debug
### ENV
2P + 4D, EP

1. benchmark script
```
evalscope perf \
  --parallel 512 \
  --number 1024 \
  --model deepseek \
  --url http://localhost:9000/v1/chat/completions \
  --api openai \
  --dataset random \
  --max-tokens 2 \
  --min-tokens 2 \
  --prefix-length 0 \
  --min-prompt-length 512 \
  --max-prompt-length 512 \
  --tokenizer-path /tmp/DeepSeek-v3-0324-w8a8-0814  \
  --extra-args '{"ignore_eos": true}' \
  --rate 2
```

2. before patch
```
+-----------------------------------+-----------+
| Key                               |     Value |
+===================================+===========+
| Time taken for tests (s)          |  209.484  |
+-----------------------------------+-----------+
| Number of concurrency             |  512      |
+-----------------------------------+-----------+
| Request rate (req/s)              |    6      |
+-----------------------------------+-----------+
| Total requests                    | 1024      |
+-----------------------------------+-----------+
| Succeed requests                  | 1022      |
+-----------------------------------+-----------+
| Failed requests                   |    2      |
+-----------------------------------+-----------+
| Output token throughput (tok/s)   |    9.7573 |
+-----------------------------------+-----------+
| Total token throughput (tok/s)    | 2507.62   |
+-----------------------------------+-----------+
| Request throughput (req/s)        |    4.8786 |
+-----------------------------------+-----------+
| Average latency (s)               |    7.0561 |
+-----------------------------------+-----------+
| Average time to first token (s)   |    5.7444 |
+-----------------------------------+-----------+
| Average time per output token (s) |    1.3117 |
+-----------------------------------+-----------+
| Average inter-token latency (s)   |    1.3117 |
+-----------------------------------+-----------+
| Average input tokens per request  |  512      |
+-----------------------------------+-----------+
| Average output tokens per request |    2      |
+-----------------------------------+-----------+
2026-01-22 14:56:32 - evalscope - INFO: 
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     |  0.6062  | 0.5113  |  0.5113  |    1.234    |     512      |       2       |     0.0888     |    22.8338    |
|     25%     |  0.7248  | 0.5639  |  0.5639  |   1.4114    |     512      |       2       |      0.2       |    51.3919    |
|     50%     |  0.9092  | 0.7748  |  0.7748  |   1.6767    |     512      |       2       |     1.1935     |   306.7171    |
|     66%     |  1.0745  | 1.0345  |  1.0345  |   3.1308    |     512      |       2       |     1.3395     |   344.2495    |
|     75%     |  7.0812  | 1.5389  |  1.5389  |   10.0016   |     512      |       2       |     1.417      |   364.1808    |
|     80%     | 10.6944  | 1.8552  |  1.8552  |   13.3717   |     512      |       2       |     1.4778     |   379.7911    |
|     90%     | 19.2342  | 2.4325  |  2.4326  |   22.5105   |     512      |       2       |     1.6208     |   416.5381    |
|     95%     | 24.4399  | 2.8289  |  2.8289  |   26.0329   |     512      |       2       |     1.7548     |   450.9942    |
|     98%     | 45.0941  | 3.4098  |  3.4098  |   45.6287   |     512      |       2       |     1.8193     |   467.5476    |
|     99%     | 46.2786  | 3.8492  |  3.8492  |   46.9282   |     512      |       2       |     1.8576     |   477.4157    |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
```

3. after patch
```
Benchmarking summary:
+-----------------------------------+-----------+
| Key                               |     Value |
+===================================+===========+
| Time taken for tests (s)          |  191.613  |
+-----------------------------------+-----------+
| Number of concurrency             |  512      |
+-----------------------------------+-----------+
| Request rate (req/s)              |    6      |
+-----------------------------------+-----------+
| Total requests                    | 1024      |
+-----------------------------------+-----------+
| Succeed requests                  | 1024      |
+-----------------------------------+-----------+
| Failed requests                   |    0      |
+-----------------------------------+-----------+
| Output token throughput (tok/s)   |   10.6882 |
+-----------------------------------+-----------+
| Total token throughput (tok/s)    | 2746.87   |
+-----------------------------------+-----------+
| Request throughput (req/s)        |    5.3441 |
+-----------------------------------+-----------+
| Average latency (s)               |    2.0407 |
+-----------------------------------+-----------+
| Average time to first token (s)   |    0.7989 |
+-----------------------------------+-----------+
| Average time per output token (s) |    1.2419 |
+-----------------------------------+-----------+
| Average inter-token latency (s)   |    1.2419 |
+-----------------------------------+-----------+
| Average input tokens per request  |  512      |
+-----------------------------------+-----------+
| Average output tokens per request |    2      |
+-----------------------------------+-----------+
2026-01-22 15:10:31 - evalscope - INFO: 
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     |  0.5727  | 0.5051  |  0.5051  |   1.1761    |     512      |       2       |     1.0368     |   266.4696    |
|     25%     |  0.6497  | 0.5324  |  0.5324  |   1.3159    |     512      |       2       |     1.1763     |   302.3184    |
|     50%     |  0.7767  | 0.6908  |  0.6908  |   1.4793    |     512      |       2       |     1.3521     |   347.4944    |
|     66%     |  0.8711  | 0.7912  |  0.7912  |   1.5916    |     512      |       2       |     1.4518     |   373.1092    |
|     75%     |  0.9125  | 0.8797  |  0.8797  |   1.7008    |     512      |       2       |     1.521      |   390.9018    |
|     80%     |  0.9381  | 0.9442  |  0.9442  |   1.7657    |     512      |       2       |     1.5749     |   404.7606    |
|     90%     |  0.994   | 1.0818  |  1.0818  |   1.9289    |     512      |       2       |     1.7006     |   437.0518    |
|     95%     |  1.0369  | 1.2454  |  1.2454  |   2.2154    |     512      |       2       |     1.7937     |   460.9731    |
|     98%     |  1.1237  | 18.8814 | 18.8814  |   19.4607   |     512      |       2       |     1.8755     |   482.0097    |
|     99%     |  1.6752  | 24.4406 | 24.4406  |   25.4734   |     512      |       2       |     1.907      |   490.0993    |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
```

---------

Signed-off-by: xleoken <xleoken@163.com>

This commit is contained in:

xleoken

2026-03-12 16:13:48 +08:00

committed by

GitHub

parent f244f3c4a9

commit 77b43492ae

2 changed files with 2 additions and 2 deletions

									
										2

vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/kv_transfer.py
									
												View File
												
				@@ -187,7 +187,7 @@ class KVCacheStoreSendingThread(KVTransferThread):

				        ends = ends[skip_block_num:]

				        keys = keys[skip_block_num:]

				        logger.info(

				        logger.debug(

				            "Storing KV cache for %d out of %d blocks (skip_block_num=%d) for request %s",

				            len(keys),

				            token_len // self.block_size,

									
										2

vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/pool_scheduler.py
									
												View File
												
				@@ -91,7 +91,7 @@ class KVPoolScheduler:

				        else:

				            need_to_allocate = num_external_hit_tokens - num_computed_tokens

				        logger.info(

				        logger.debug(

				            "Reqid: %s, Total tokens %d, kvpool hit tokens: %d, need to load: %d",

				            request.request_id,

				            request.num_tokens,

improve the ttft when use mooncake (#6125)

2 vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/kv_transfer.py Unescape Escape View File

2 vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/pool_scheduler.py Unescape Escape View File

2

vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/kv_transfer.py

View File

2

vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/pool_scheduler.py

View File