From d5fef22149c40e25810362cf4c0eec90106f38d5 Mon Sep 17 00:00:00 2001 From: Canlin Guo Date: Wed, 19 Nov 2025 16:00:39 +0800 Subject: [PATCH] [Docs] Improve the AISBench multi-modal testing docs (#4255) ### What this PR does / why we need it? Add some of the pitfalls I ran into when using AISBench to test multi-modal models. - vLLM version: v0.11.0 - vLLM main: https://github.com/vllm-project/vllm/commit/2918c1b49c88c29783c86f78d2c4221cb9622379 --------- Signed-off-by: gcanlin --- .../evaluation/using_ais_bench.md | 41 +++++++++++++++++++ 1 file changed, 41 insertions(+) diff --git a/docs/source/developer_guide/evaluation/using_ais_bench.md b/docs/source/developer_guide/evaluation/using_ais_bench.md index 62b1db7b..25811aa7 100644 --- a/docs/source/developer_guide/evaluation/using_ais_bench.md +++ b/docs/source/developer_guide/evaluation/using_ais_bench.md @@ -152,6 +152,9 @@ rm gsm8k.zip Update the file `benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`. There are several arguments that you should update according to your environment. +- `attr`: Identifier for the inference backend type, fixed as `service` (serving-based inference) or `local` (local model). +- `type`: Used to select different backend API types. +- `abbr`: Unique identifier for a local task, used to distinguish between multiple tasks. - `path`: Update to your model weight path. - `model`: Update to your model name in vLLM. - `host_ip` and `host_port`: Update to your vLLM server ip and port. @@ -242,6 +245,8 @@ After each dataset execution, you can get the result from saved files such as `o #### Execute Performance Evaluation +Text-only benchmarks: + ```shell # run C-Eval dataset ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf @@ -262,6 +267,13 @@ ais_bench --models vllm_api_general_chat --datasets livecodebench_code_generate_ ais_bench --models vllm_api_general_chat --datasets aime2024_gen_0_shot_chat_prompt.py --summarizer default_perf --mode perf ``` +Multi-modal benchmarks (text + images): + +```shell +# run textvqa dataset +ais_bench --models vllm_api_stream_chat --datasets textvqa_gen_base64 --summarizer default_perf --mode perf +``` + After execution, you can get the result from saved files, there is an example as follows: ``` @@ -281,3 +293,32 @@ After execution, you can get the result from saved files, there is an example as |-- cevaldataset_plot.html # Final performance results (in html format) `-- cevaldataset_rps_distribution_plot_with_actual_rps.html # Final performance results (in html format) ``` + +### 3. Troubleshooting + +#### Invalid Image Path Error + +If you download the TextVQA dataset following the AISBench documentation: + +```bash +cd ais_bench/datasets +git lfs install +git clone https://huggingface.co/datasets/maoxx241/textvqa_subset +mv textvqa_subset/ textvqa/ +mkdir textvqa/textvqa_json/ +mv textvqa/*.json textvqa/textvqa_json/ +mv textvqa/*.jsonl textvqa/textvqa_json/ +``` + +you may encounter the following error: + +```bash +AISBench - ERROR - /vllm-workspace/benchmark/ais_bench/benchmark/clients/base_client.py - raise_error - 35 - [AisBenchClientException] Request failed: HTTP status 400. Server response: {"error":{"message":"1 validation error for ChatCompletionContentPartImageParam\nimage_url\n Input should be a valid dictionary [type=dict_type, input_value='data/textvqa/train_images/b2ae0f96dfbea5d8.jpg', input_type=str]\n For further information visit https://errors.pydantic.dev/2.12/v/dict_type None","type":"BadRequestError","param":null,"code":400}} +``` + +You need to manually replace the dataset image paths with absolute paths, changing `/path/to/benchmark/ais_bench/datasets/textvqa/train_images/` to the actual absolute directory where the images are stored: + +```bash +cd ais_bench/datasets/textvqa/textvqa_json +sed -i 's#data/textvqa/train_images/#/path/to/benchmark/ais_bench/datasets/textvqa/train_images/#g' textvqa_val.json +```