bench: Add MMMU benchmark for vLM (#3562)

2025-02-23 00:10:59 +08:00
parent 9087694006
commit 45205d88a0
9 changed files with 1026 additions and 7 deletions
--- a/docs/references/supported_models.md
+++ b/docs/references/supported_models.md
@@ -57,14 +57,25 @@ For most models, you should be able to find a similar model to start with (e.g.,

 ## How to Support a New vLM

-To support a new vision-language model (vLM) in SGLang, there are several key components in addition to the standard LLM.
+To support a new vision-language model (vLM) in SGLang, there are several key components in addition to the standard
+LLM.

-1. **Register your new model as multimodal**: Extend `is_multimodal_model` in [`model_config.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/configs/model_config.py) to return True for your model.
-2. **Process Images**: Create a new `ImageProcessor` class that inherits from `BaseImageProcessor` and register this processor as your model's dedicated processor. See [`image_processor.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/image_processor.py) for more details.
-3. **Handle Image Tokens**: Implement a `pad_input_ids` function for your new model, in which image tokens in the prompt should be expanded and replaced with image-hashes, so that SGLang can recognize different images for `RadixAttention`.
+1. **Register your new model as multimodal**: Extend `is_multimodal_model` in [
+   `model_config.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/configs/model_config.py) to
+   return True for your model.
+2. **Process Images**: Create a new `ImageProcessor` class that inherits from `BaseImageProcessor` and register this
+   processor as your model's dedicated processor. See [
+   `image_processor.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/image_processor.py)
+   for more details.
+3. **Handle Image Tokens**: Implement a `pad_input_ids` function for your new model, in which image tokens in the prompt
+   should be expanded and replaced with image-hashes, so that SGLang can recognize different images for
+   `RadixAttention`.
 4. Replace Multi-headed `Attention` of ViT with SGLang's `VisionAttention`.

-You can refer [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or other vLMs. These models demonstrate how to properly handle both visual and textual inputs.
+You can refer [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or other
+vLMs. These models demonstrate how to properly handle both visual and textual inputs.
+
+You should test the new vLM locally against hf models. See [`mmmu`](https://github.com/sgl-project/sglang/tree/main/benchmark/mmmu) for an example.

 ### Test the correctness