enginex-ascend-910-vllm/perf_result_template.md at docs-readme

Files

Yang Jun01 9149384e03 v0.10.1rc1

2025-09-09 09:40:35 +08:00

Online serving tests

Input length: randomly sample 200 prompts from ShareGPT and lmarena-ai/vision-arena-bench-v0.1(multi-modal) dataset (with fixed random seed).
Output length: the corresponding output length of these 200 prompts.
Batch size: dynamically determined by vllm and the arrival pattern of the requests.
Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
Evaluation metrics: throughput, TTFT (median time to the first token ), ITL (median inter-token latency) TPOT(median time per output token).

{serving_tests_markdown_table}

{latency_tests_markdown_table}

Input length: randomly sample 200 prompts from ShareGPT and lmarena-ai/vision-arena-bench-v0.1(multi-modal) dataset (with fixed random seed).
Output length: the corresponding output length of these 200 prompts.
Batch size: dynamically determined by vllm to achieve maximum throughput.
Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
Evaluation metrics: throughput.

{throughput_tests_markdown_table}