xc-llm-ascend/benchmarks/scripts/perf_result_template.md

## Online serving tests

- Input length: randomly sample 200 prompts from [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split.json) and [lmarena-ai/vision-arena-bench-v0.1](https://huggingface.co/datasets/lmarena-ai/vision-arena-bench-v0.1/tree/main)(multi-modal) dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
- Evaluation metrics: throughput, TTFT (median time to the first token ), ITL (median inter-token latency) TPOT(median time per output token).

{serving_tests_markdown_table}

## Offline tests
### Latency tests

- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
- Evaluation metrics: end-to-end latency.

{latency_tests_markdown_table}

### Throughput tests

- Input length: randomly sample 200 prompts from [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split.json) and [lmarena-ai/vision-arena-bench-v0.1](https://huggingface.co/datasets/lmarena-ai/vision-arena-bench-v0.1/tree/main)(multi-modal) dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
- Evaluation metrics: throughput.

{throughput_tests_markdown_table}
[CI][Benchmark] Optimize performance benchmark workflow (#1039) ### What this PR does / why we need it? This is a post patch of #1014, for some convenience optimization - Set cached dataset path for speed - Use pypi to install escli-tool - Add benchmark results convert script to have a developer-friendly result - Patch the `benchmark_dataset.py` to disable streaming load for internet - Add more trigger ways for different purpose, `pr` for debug, `schedule` for daily test, `dispatch` and `pr-labled` for manual testing of a single(current) commit - Disable latency test for `qwen-2.5-vl`, (This script does not support multi-modal yet) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-06-03 23:38:34 +08:00			`## Online serving tests`

			`- Input length: randomly sample 200 prompts from [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split.json) and [lmarena-ai/vision-arena-bench-v0.1](https://huggingface.co/datasets/lmarena-ai/vision-arena-bench-v0.1/tree/main)(multi-modal) dataset (with fixed random seed).`
			`- Output length: the corresponding output length of these 200 prompts.`
			`- Batch size: dynamically determined by vllm and the arrival pattern of the requests.`
			`- Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).`
			`- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct`
			`- Evaluation metrics: throughput, TTFT (median time to the first token ), ITL (median inter-token latency) TPOT(median time per output token).`

			`{serving_tests_markdown_table}`

			`## Offline tests`
			`### Latency tests`

			`- Input length: 32 tokens.`
			`- Output length: 128 tokens.`
			`- Batch size: fixed (8).`
			`- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct`
			`- Evaluation metrics: end-to-end latency.`

			`{latency_tests_markdown_table}`

			`### Throughput tests`

			`- Input length: randomly sample 200 prompts from [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split.json) and [lmarena-ai/vision-arena-bench-v0.1](https://huggingface.co/datasets/lmarena-ai/vision-arena-bench-v0.1/tree/main)(multi-modal) dataset (with fixed random seed).`
			`- Output length: the corresponding output length of these 200 prompts.`
			`- Batch size: dynamically determined by vllm to achieve maximum throughput.`
			`- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct`
			`- Evaluation metrics: throughput.`

[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011) ### What this PR does / why we need it? 1. Enable pymarkdown check 2. Enable python `__init__.py` check for vllm and vllm-ascend 3. Make clean code ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-07-25 22:16:10 +08:00			`{throughput_tests_markdown_table}`