bench: Add MMMU benchmark for vLM (#3562)

This commit is contained in:
Mick
2025-02-23 00:10:59 +08:00
committed by GitHub
parent 9087694006
commit 45205d88a0
9 changed files with 1026 additions and 7 deletions

View File

@@ -57,14 +57,25 @@ For most models, you should be able to find a similar model to start with (e.g.,
## How to Support a New vLM
To support a new vision-language model (vLM) in SGLang, there are several key components in addition to the standard LLM.
To support a new vision-language model (vLM) in SGLang, there are several key components in addition to the standard
LLM.
1. **Register your new model as multimodal**: Extend `is_multimodal_model` in [`model_config.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/configs/model_config.py) to return True for your model.
2. **Process Images**: Create a new `ImageProcessor` class that inherits from `BaseImageProcessor` and register this processor as your model's dedicated processor. See [`image_processor.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/image_processor.py) for more details.
3. **Handle Image Tokens**: Implement a `pad_input_ids` function for your new model, in which image tokens in the prompt should be expanded and replaced with image-hashes, so that SGLang can recognize different images for `RadixAttention`.
1. **Register your new model as multimodal**: Extend `is_multimodal_model` in [
`model_config.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/configs/model_config.py) to
return True for your model.
2. **Process Images**: Create a new `ImageProcessor` class that inherits from `BaseImageProcessor` and register this
processor as your model's dedicated processor. See [
`image_processor.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/image_processor.py)
for more details.
3. **Handle Image Tokens**: Implement a `pad_input_ids` function for your new model, in which image tokens in the prompt
should be expanded and replaced with image-hashes, so that SGLang can recognize different images for
`RadixAttention`.
4. Replace Multi-headed `Attention` of ViT with SGLang's `VisionAttention`.
You can refer [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or other vLMs. These models demonstrate how to properly handle both visual and textual inputs.
You can refer [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or other
vLMs. These models demonstrate how to properly handle both visual and textual inputs.
You should test the new vLM locally against hf models. See [`mmmu`](https://github.com/sgl-project/sglang/tree/main/benchmark/mmmu) for an example.
### Test the correctness