xc-llm-ascend/benchmarks/tests/serving-tests.json

[
  {
    "test_name": "serving_qwen2_5vl_7B_tp1",
    "qps_list": [
      1,
      4,
      16,
      "inf"
    ],
    "server_parameters": {
      "model": "Qwen/Qwen2.5-VL-7B-Instruct",
      "tensor_parallel_size": 1,
      "swap_space": 16,
      "disable_log_stats": "",
      "disable_log_requests": "",
      "trust_remote_code": "",
      "max_model_len": 16384
    },
    "client_parameters": {
      "model": "Qwen/Qwen2.5-VL-7B-Instruct",
      "backend": "openai-chat",
      "dataset_name": "hf",
      "hf_split": "train",
      "endpoint": "/v1/chat/completions",
      "dataset_path": "lmarena-ai/vision-arena-bench-v0.1",
      "num_prompts": 200,
      "no_stream": ""
    }
  },
  {
    "test_name": "serving_qwen3_8B_tp1",
    "qps_list": [
      1,
      4,
      16,
      "inf"
    ],
    "server_parameters": {
      "model": "Qwen/Qwen3-8B",
      "tensor_parallel_size": 1,
      "swap_space": 16,
      "disable_log_stats": "",
      "disable_log_requests": "",
      "load_format": "dummy"
    },
    "client_parameters": {
      "model": "Qwen/Qwen3-8B",
      "backend": "vllm",
      "dataset_name": "sharegpt",
      "dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
      "num_prompts": 200
    }
  },
  {
    "test_name": "serving_qwen2_5_7B_tp1",
    "qps_list": [
      1,
      4,
      16,
      "inf"
    ],
    "server_parameters": {
      "model": "Qwen/Qwen2.5-7B-Instruct",
      "tensor_parallel_size": 1,
      "swap_space": 16,
      "disable_log_stats": "",
      "disable_log_requests": "",
      "load_format": "dummy"
    },
    "client_parameters": {
      "model": "Qwen/Qwen2.5-7B-Instruct",
      "backend": "vllm",
      "dataset_name": "sharegpt",
      "dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
      "num_prompts": 200
    }
  }
]
[Doc]Add benchmark scripts (#74) ### What this PR does / why we need it? The purpose of this PR is to add benchmark scripts for npu, developers can easily run performance tests on their own machines with one line of code . --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-03-21 15:54:34 +08:00			`[`
			`{`
[CI] Add benchmark workflows (#1014) ### What this PR does / why we need it? Add benchmark workflows ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run locally --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-05-30 22:42:44 +08:00			`"test_name": "serving_qwen2_5vl_7B_tp1",`
[Doc]Add benchmark scripts (#74) ### What this PR does / why we need it? The purpose of this PR is to add benchmark scripts for npu, developers can easily run performance tests on their own machines with one line of code . --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-03-21 15:54:34 +08:00			`"qps_list": [`
			`1,`
			`4,`
			`16,`
			`"inf"`
			`],`
			`"server_parameters": {`
[CI] Add benchmark workflows (#1014) ### What this PR does / why we need it? Add benchmark workflows ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run locally --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-05-30 22:42:44 +08:00			`"model": "Qwen/Qwen2.5-VL-7B-Instruct",`
[Doc]Add benchmark scripts (#74) ### What this PR does / why we need it? The purpose of this PR is to add benchmark scripts for npu, developers can easily run performance tests on their own machines with one line of code . --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-03-21 15:54:34 +08:00			`"tensor_parallel_size": 1,`
			`"swap_space": 16,`
			`"disable_log_stats": "",`
			`"disable_log_requests": "",`
[CI] Add benchmark workflows (#1014) ### What this PR does / why we need it? Add benchmark workflows ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run locally --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-05-30 22:42:44 +08:00			`"trust_remote_code": "",`
			`"max_model_len": 16384`
[Doc]Add benchmark scripts (#74) ### What this PR does / why we need it? The purpose of this PR is to add benchmark scripts for npu, developers can easily run performance tests on their own machines with one line of code . --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-03-21 15:54:34 +08:00			`},`
			`"client_parameters": {`
[CI] Add benchmark workflows (#1014) ### What this PR does / why we need it? Add benchmark workflows ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run locally --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-05-30 22:42:44 +08:00			`"model": "Qwen/Qwen2.5-VL-7B-Instruct",`
[Benchmark] Upgrade benchmark args for new vllm version (#3218) ### What this PR does / why we need it? Since the newest vllm commit has deprecated the arg `--endpoint-type`, we should use `--backend` instead ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? test it locally: ```shell export VLLM_USE_MODELSCOPE=true export DATASET_PATH=/root/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json vllm serve Qwen/Qwen2.5-7B-Instruct --load-format dummy wget -O ${DATASET_PATH} /root/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json https://hf-mirror.com/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json vllm bench serve --model Qwen/Qwen2.5-7B-Instruct --backend vllm --dataset-name sharegpt --dataset-path ${DATASET_PATH} --num-prompt 200 ``` and the result looks good: ```shell ============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 20.36 Total input tokens: 43560 Total generated tokens: 44697 Request throughput (req/s): 9.82 Output token throughput (tok/s): 2194.88 Peak output token throughput (tok/s): 4676.00 Peak concurrent requests: 200.00 Total Token throughput (tok/s): 4333.93 ---------------Time to First Token---------------- Mean TTFT (ms): 2143.85 Median TTFT (ms): 2486.17 P99 TTFT (ms): 2530.36 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 43.50 Median TPOT (ms): 30.75 P99 TPOT (ms): 309.22 ---------------Inter-token Latency---------------- Mean ITL (ms): 28.15 Median ITL (ms): 25.42 P99 ITL (ms): 38.30 ================================================== ``` - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangli <wangli858794774@gmail.com> 2025-10-24 11:18:19 +08:00			`"backend": "openai-chat",`
[CI] Add benchmark workflows (#1014) ### What this PR does / why we need it? Add benchmark workflows ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run locally --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-05-30 22:42:44 +08:00			`"dataset_name": "hf",`
			`"hf_split": "train",`
			`"endpoint": "/v1/chat/completions",`
			`"dataset_path": "lmarena-ai/vision-arena-bench-v0.1",`
[Benchmark] Correctly kill vllm process in performance benchamrk (#2782) ### What this PR does / why we need it? vLLM now names the process with VLLM prefix after https://github.com/vllm-project/vllm/pull/21445, we should kill the correct process name after one iteration benchmark to avoid OOM issue ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: https://github.com/vllm-project/vllm/commit/e599e2c65ee32abcc986733ab0a55becea158bb4 --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-09-07 10:36:34 +08:00			`"num_prompts": 200,`
			`"no_stream": ""`
[Doc]Add benchmark scripts (#74) ### What this PR does / why we need it? The purpose of this PR is to add benchmark scripts for npu, developers can easily run performance tests on their own machines with one line of code . --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-03-21 15:54:34 +08:00			`}`
[Benchmarks] Add qwen2.5-7b test (#763) ### What this PR does / why we need it? - Add qwen2.5-7b test - Optimize the documentation to be more developer-friendly Signed-off-by: xuedinge233 <damow890@gmail.com> Co-authored-by: xuedinge233 <damow890@gmail.com> 2025-05-10 09:47:42 +08:00			`},`
			`{`
[CI] Add benchmark workflows (#1014) ### What this PR does / why we need it? Add benchmark workflows ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run locally --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-05-30 22:42:44 +08:00			`"test_name": "serving_qwen3_8B_tp1",`
[Benchmarks] Add qwen2.5-7b test (#763) ### What this PR does / why we need it? - Add qwen2.5-7b test - Optimize the documentation to be more developer-friendly Signed-off-by: xuedinge233 <damow890@gmail.com> Co-authored-by: xuedinge233 <damow890@gmail.com> 2025-05-10 09:47:42 +08:00			`"qps_list": [`
			`1,`
			`4,`
			`16,`
			`"inf"`
			`],`
			`"server_parameters": {`
[CI] Add benchmark workflows (#1014) ### What this PR does / why we need it? Add benchmark workflows ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run locally --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-05-30 22:42:44 +08:00			`"model": "Qwen/Qwen3-8B",`
[Benchmarks] Add qwen2.5-7b test (#763) ### What this PR does / why we need it? - Add qwen2.5-7b test - Optimize the documentation to be more developer-friendly Signed-off-by: xuedinge233 <damow890@gmail.com> Co-authored-by: xuedinge233 <damow890@gmail.com> 2025-05-10 09:47:42 +08:00			`"tensor_parallel_size": 1,`
			`"swap_space": 16,`
			`"disable_log_stats": "",`
			`"disable_log_requests": "",`
			`"load_format": "dummy"`
			`},`
			`"client_parameters": {`
[CI] Add benchmark workflows (#1014) ### What this PR does / why we need it? Add benchmark workflows ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run locally --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-05-30 22:42:44 +08:00			`"model": "Qwen/Qwen3-8B",`
[Benchmark] Upgrade benchmark args for new vllm version (#3218) ### What this PR does / why we need it? Since the newest vllm commit has deprecated the arg `--endpoint-type`, we should use `--backend` instead ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? test it locally: ```shell export VLLM_USE_MODELSCOPE=true export DATASET_PATH=/root/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json vllm serve Qwen/Qwen2.5-7B-Instruct --load-format dummy wget -O ${DATASET_PATH} /root/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json https://hf-mirror.com/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json vllm bench serve --model Qwen/Qwen2.5-7B-Instruct --backend vllm --dataset-name sharegpt --dataset-path ${DATASET_PATH} --num-prompt 200 ``` and the result looks good: ```shell ============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 20.36 Total input tokens: 43560 Total generated tokens: 44697 Request throughput (req/s): 9.82 Output token throughput (tok/s): 2194.88 Peak output token throughput (tok/s): 4676.00 Peak concurrent requests: 200.00 Total Token throughput (tok/s): 4333.93 ---------------Time to First Token---------------- Mean TTFT (ms): 2143.85 Median TTFT (ms): 2486.17 P99 TTFT (ms): 2530.36 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 43.50 Median TPOT (ms): 30.75 P99 TPOT (ms): 309.22 ---------------Inter-token Latency---------------- Mean ITL (ms): 28.15 Median ITL (ms): 25.42 P99 ITL (ms): 38.30 ================================================== ``` - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangli <wangli858794774@gmail.com> 2025-10-24 11:18:19 +08:00			`"backend": "vllm",`
[Benchmarks] Add qwen2.5-7b test (#763) ### What this PR does / why we need it? - Add qwen2.5-7b test - Optimize the documentation to be more developer-friendly Signed-off-by: xuedinge233 <damow890@gmail.com> Co-authored-by: xuedinge233 <damow890@gmail.com> 2025-05-10 09:47:42 +08:00			`"dataset_name": "sharegpt",`
[CI][Benchmark] Optimize performance benchmark workflow (#1039) ### What this PR does / why we need it? This is a post patch of #1014, for some convenience optimization - Set cached dataset path for speed - Use pypi to install escli-tool - Add benchmark results convert script to have a developer-friendly result - Patch the `benchmark_dataset.py` to disable streaming load for internet - Add more trigger ways for different purpose, `pr` for debug, `schedule` for daily test, `dispatch` and `pr-labled` for manual testing of a single(current) commit - Disable latency test for `qwen-2.5-vl`, (This script does not support multi-modal yet) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-06-03 23:38:34 +08:00			`"dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",`
[Benchmarks] Add qwen2.5-7b test (#763) ### What this PR does / why we need it? - Add qwen2.5-7b test - Optimize the documentation to be more developer-friendly Signed-off-by: xuedinge233 <damow890@gmail.com> Co-authored-by: xuedinge233 <damow890@gmail.com> 2025-05-10 09:47:42 +08:00			`"num_prompts": 200`
			`}`
[CI][Benchmark] Add new model and v1 test to perf benchmarks (#1099) ### What this PR does / why we need it? - Add qwen2.5-7b-instruct test - Add v1 test --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-06-12 10:46:41 +08:00			`},`
			`{`
			`"test_name": "serving_qwen2_5_7B_tp1",`
			`"qps_list": [`
			`1,`
			`4,`
			`16,`
			`"inf"`
			`],`
			`"server_parameters": {`
			`"model": "Qwen/Qwen2.5-7B-Instruct",`
			`"tensor_parallel_size": 1,`
			`"swap_space": 16,`
			`"disable_log_stats": "",`
			`"disable_log_requests": "",`
			`"load_format": "dummy"`
			`},`
			`"client_parameters": {`
			`"model": "Qwen/Qwen2.5-7B-Instruct",`
[Benchmark] Upgrade benchmark args for new vllm version (#3218) ### What this PR does / why we need it? Since the newest vllm commit has deprecated the arg `--endpoint-type`, we should use `--backend` instead ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? test it locally: ```shell export VLLM_USE_MODELSCOPE=true export DATASET_PATH=/root/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json vllm serve Qwen/Qwen2.5-7B-Instruct --load-format dummy wget -O ${DATASET_PATH} /root/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json https://hf-mirror.com/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json vllm bench serve --model Qwen/Qwen2.5-7B-Instruct --backend vllm --dataset-name sharegpt --dataset-path ${DATASET_PATH} --num-prompt 200 ``` and the result looks good: ```shell ============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 20.36 Total input tokens: 43560 Total generated tokens: 44697 Request throughput (req/s): 9.82 Output token throughput (tok/s): 2194.88 Peak output token throughput (tok/s): 4676.00 Peak concurrent requests: 200.00 Total Token throughput (tok/s): 4333.93 ---------------Time to First Token---------------- Mean TTFT (ms): 2143.85 Median TTFT (ms): 2486.17 P99 TTFT (ms): 2530.36 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 43.50 Median TPOT (ms): 30.75 P99 TPOT (ms): 309.22 ---------------Inter-token Latency---------------- Mean ITL (ms): 28.15 Median ITL (ms): 25.42 P99 ITL (ms): 38.30 ================================================== ``` - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangli <wangli858794774@gmail.com> 2025-10-24 11:18:19 +08:00			`"backend": "vllm",`
[CI][Benchmark] Add new model and v1 test to perf benchmarks (#1099) ### What this PR does / why we need it? - Add qwen2.5-7b-instruct test - Add v1 test --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-06-12 10:46:41 +08:00			`"dataset_name": "sharegpt",`
			`"dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",`
			`"num_prompts": 200`
			`}`
[Doc]Add benchmark scripts (#74) ### What this PR does / why we need it? The purpose of this PR is to add benchmark scripts for npu, developers can easily run performance tests on their own machines with one line of code . --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-03-21 15:54:34 +08:00			`}`
			`]`