bench: Add MMMU benchmark for vLM (#3562)
This commit is contained in:
@@ -57,14 +57,25 @@ For most models, you should be able to find a similar model to start with (e.g.,
|
||||
|
||||
## How to Support a New vLM
|
||||
|
||||
To support a new vision-language model (vLM) in SGLang, there are several key components in addition to the standard LLM.
|
||||
To support a new vision-language model (vLM) in SGLang, there are several key components in addition to the standard
|
||||
LLM.
|
||||
|
||||
1. **Register your new model as multimodal**: Extend `is_multimodal_model` in [`model_config.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/configs/model_config.py) to return True for your model.
|
||||
2. **Process Images**: Create a new `ImageProcessor` class that inherits from `BaseImageProcessor` and register this processor as your model's dedicated processor. See [`image_processor.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/image_processor.py) for more details.
|
||||
3. **Handle Image Tokens**: Implement a `pad_input_ids` function for your new model, in which image tokens in the prompt should be expanded and replaced with image-hashes, so that SGLang can recognize different images for `RadixAttention`.
|
||||
1. **Register your new model as multimodal**: Extend `is_multimodal_model` in [
|
||||
`model_config.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/configs/model_config.py) to
|
||||
return True for your model.
|
||||
2. **Process Images**: Create a new `ImageProcessor` class that inherits from `BaseImageProcessor` and register this
|
||||
processor as your model's dedicated processor. See [
|
||||
`image_processor.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/image_processor.py)
|
||||
for more details.
|
||||
3. **Handle Image Tokens**: Implement a `pad_input_ids` function for your new model, in which image tokens in the prompt
|
||||
should be expanded and replaced with image-hashes, so that SGLang can recognize different images for
|
||||
`RadixAttention`.
|
||||
4. Replace Multi-headed `Attention` of ViT with SGLang's `VisionAttention`.
|
||||
|
||||
You can refer [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or other vLMs. These models demonstrate how to properly handle both visual and textual inputs.
|
||||
You can refer [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or other
|
||||
vLMs. These models demonstrate how to properly handle both visual and textual inputs.
|
||||
|
||||
You should test the new vLM locally against hf models. See [`mmmu`](https://github.com/sgl-project/sglang/tree/main/benchmark/mmmu) for an example.
|
||||
|
||||
### Test the correctness
|
||||
|
||||
|
||||
Reference in New Issue
Block a user