xc-llm-ascend/perf_result_template.md at ebb2a70dbbdb8f55002de3313e17dfd595e1de1f

Files

Li Wang 76dacf3fa0 [CI][Benchmark] Optimize performance benchmark workflow (#1039 )

### What this PR does / why we need it?

This is a post patch of #1014, for some convenience optimization
- Set cached dataset path for speed
- Use pypi to install escli-tool
- Add benchmark results convert script to have a developer-friendly
result
- Patch the `benchmark_dataset.py` to disable streaming load for
internet
- Add more trigger ways for different purpose, `pr` for debug,
`schedule` for daily test, `dispatch` and `pr-labled` for manual testing
of a single(current) commit
- Disable latency test for `qwen-2.5-vl`, (This script does not support
multi-modal yet)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>

2025-06-03 23:38:34 +08:00

1.8 KiB

Raw Blame History

Online serving tests

Input length: randomly sample 200 prompts from ShareGPT and lmarena-ai/vision-arena-bench-v0.1(multi-modal) dataset (with fixed random seed).
Output length: the corresponding output length of these 200 prompts.
Batch size: dynamically determined by vllm and the arrival pattern of the requests.
Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
Evaluation metrics: throughput, TTFT (median time to the first token ), ITL (median inter-token latency) TPOT(median time per output token).

{serving_tests_markdown_table}

Offline tests

Latency tests

Input length: 32 tokens.
Output length: 128 tokens.
Batch size: fixed (8).
Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
Evaluation metrics: end-to-end latency.

{latency_tests_markdown_table}

Throughput tests

Input length: randomly sample 200 prompts from ShareGPT and lmarena-ai/vision-arena-bench-v0.1(multi-modal) dataset (with fixed random seed).
Output length: the corresponding output length of these 200 prompts.
Batch size: dynamically determined by vllm to achieve maximum throughput.
Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
Evaluation metrics: throughput.

{throughput_tests_markdown_table}

1.8 KiB Raw Blame History

Online serving tests

Offline tests

Latency tests

Throughput tests

1.8 KiB

Raw Blame History