sglang/benchmark/mmmu/README.md

## Run evaluation

### Evaluate sglang

Host the VLM:

```
python -m sglang.launch_server --model-path Qwen/Qwen2-VL-7B-Instruct --port 30000
```

It's recommended to reduce the memory usage by appending something like `--mem-fraction-static 0.6` to the command above.

Benchmark:

```
python benchmark/mmmu/bench_sglang.py --port 30000 --concurrency 16
```

You can adjust the `--concurrency` to control the number of concurrent OpenAI calls.

You can use `--lora-path` to specify the LoRA adapter to apply during benchmarking. E.g.,
```
# Launch server with LoRA enabled
python -m sglang.launch_server --model-path microsoft/Phi-4-multimodal-instruct --port 30000 --trust-remote-code --disable-radix-cache --lora-paths vision=<LoRA path>

# Apply LoRA adapter during inferencing
python -m benchmark/mmmu/bench_sglang.py --concurrency 8 --lora-path vision
```

You can use `--response-answer-regex` to specify how to extract the answer from the response string. E.g.,
```
python3 -m sglang.launch_server --model-path zai-org/GLM-4.1V-9B-Thinking --reasoning-parser glm45

python3 bench_sglang.py --response-answer-regex "<\|begin_of_box\|>(.*)<\|end_of_box\|>" --concurrency 64
```

You can use `--extra-request-body` to specify additional OpenAI request parameters. E.g.,
```
python3 bench_sglang.py --extra-request-body '{"max_new_tokens": 128, "temperature": 0.01}'
```

### Evaluate HF

```
python benchmark/mmmu/bench_hf.py --model-path Qwen/Qwen2-VL-7B-Instruct
```

# Profiling MMMU
You should use the standard instructions found in the [dedicated profiling doc](../../docs/developer_guide/benchmark_and_profiling.md) if running this benchmark with the profile option. We recommend using `--concurrency 1` for consistency, which makes profiling and debugging easier.
bench: Add MMMU benchmark for vLM (#3562) 2025-02-23 00:10:59 +08:00			`## Run evaluation`

			`### Evaluate sglang`

Update MMMU Benchmark instructions (#4694) 2025-03-28 03:14:16 +05:30			`Host the VLM:`

bench: Add MMMU benchmark for vLM (#3562) 2025-02-23 00:10:59 +08:00			```
Fix and Clean up chat-template requirement for VLM (#6114) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> 2025-05-10 09:14:09 -07:00			`python -m sglang.launch_server --model-path Qwen/Qwen2-VL-7B-Instruct --port 30000`
refactor: rewrite bench-mmmu-sglang (#4458) 2025-03-18 09:11:47 +08:00			```

feat: add concurrency evaluation logic in mmmu benchmark (#5782) 2025-05-01 18:20:08 -07:00			It's recommended to reduce the memory usage by appending something like `--mem-fraction-static 0.6` to the command above.

Update MMMU Benchmark instructions (#4694) 2025-03-28 03:14:16 +05:30			`Benchmark:`

refactor: rewrite bench-mmmu-sglang (#4458) 2025-03-18 09:11:47 +08:00			```
feat: add concurrency evaluation logic in mmmu benchmark (#5782) 2025-05-01 18:20:08 -07:00			`python benchmark/mmmu/bench_sglang.py --port 30000 --concurrency 16`
bench: Add MMMU benchmark for vLM (#3562) 2025-02-23 00:10:59 +08:00			```

feat: add concurrency evaluation logic in mmmu benchmark (#5782) 2025-05-01 18:20:08 -07:00			You can adjust the `--concurrency` to control the number of concurrent OpenAI calls.
bench: Add MMMU benchmark for vLM (#3562) 2025-02-23 00:10:59 +08:00
Support LoRA in MMMU benchmark script. (#7218) 2025-06-15 21:17:57 -07:00			You can use `--lora-path` to specify the LoRA adapter to apply during benchmarking. E.g.,
			```
			`# Launch server with LoRA enabled`
			`python -m sglang.launch_server --model-path microsoft/Phi-4-multimodal-instruct --port 30000 --trust-remote-code --disable-radix-cache --lora-paths vision=<LoRA path>`

			`# Apply LoRA adapter during inferencing`
			`python -m benchmark/mmmu/bench_sglang.py --concurrency 8 --lora-path vision`
			```

Support glm4.1v and glm4.5v (#8798) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: zRzRzRzRzRzRzR <2448370773@qq.com> Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> Co-authored-by: Chang Su <csu272@usc.edu> 2025-08-09 00:59:13 -07:00			You can use `--response-answer-regex` to specify how to extract the answer from the response string. E.g.,
			```
			`python3 -m sglang.launch_server --model-path zai-org/GLM-4.1V-9B-Thinking --reasoning-parser glm45`

			`python3 bench_sglang.py --response-answer-regex "<\\|begin_of_box\\|>(.*)<\\|end_of_box\\|>" --concurrency 64`
			```

			You can use `--extra-request-body` to specify additional OpenAI request parameters. E.g.,
			```
			`python3 bench_sglang.py --extra-request-body '{"max_new_tokens": 128, "temperature": 0.01}'`
			```

[docs / oneliner] update mmmu docs instruction (#9768) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> 2025-09-14 23:26:39 -04:00			`### Evaluate HF`
bench: Add MMMU benchmark for vLM (#3562) 2025-02-23 00:10:59 +08:00
			```
			`python benchmark/mmmu/bench_hf.py --model-path Qwen/Qwen2-VL-7B-Instruct`
			```
[docs / oneliner] update mmmu docs instruction (#9768) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> 2025-09-14 23:26:39 -04:00
			`# Profiling MMMU`
			You should use the standard instructions found in the [dedicated profiling doc](../../docs/developer_guide/benchmark_and_profiling.md) if running this benchmark with the profile option. We recommend using `--concurrency 1` for consistency, which makes profiling and debugging easier.