model: Minicpmo (#3023)

This commit is contained in:
Mick
2025-03-25 11:08:40 +08:00
committed by GitHub
parent 64129fa632
commit 1e86457c90
40 changed files with 2906 additions and 493 deletions

View File

@@ -24,7 +24,7 @@
- InternLM 2
- Exaone 3
- BaiChuan2
- MiniCPM / MiniCPM 3 / MiniCPMV
- MiniCPM / MiniCPM 3 / MiniCPM-v / MiniCPM-o
- XVERSE / XVERSE MoE
- SmolLM
- GLM-4
@@ -70,9 +70,9 @@ LLM.
1. **Register your new model as multimodal**: Extend `is_multimodal_model` in [
`model_config.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/configs/model_config.py) to
return True for your model.
2. **Process Images**: Create a new `ImageProcessor` class that inherits from `BaseImageProcessor` and register this
2. **Process Images**: Define a new `Processor` class that inherits from `BaseProcessor` and register this
processor as your model's dedicated processor. See [
`image_processor.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/image_processor.py)
`multimodal_processor.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/multimodal_processor.py)
for more details.
3. **Handle Image Tokens**: Implement a `pad_input_ids` function for your new model, in which image tokens in the prompt
should be expanded and replaced with image-hashes, so that SGLang can recognize different images for
@@ -80,7 +80,7 @@ LLM.
4. Replace Multi-headed `Attention` of ViT with SGLang's `VisionAttention`.
You can refer [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or other
vLMs. These models demonstrate how to properly handle both visual and textual inputs.
vLMs. These models demonstrate how to properly handle both multimodal and textual inputs.
You should test the new vLM locally against hf models. See [`mmmu`](https://github.com/sgl-project/sglang/tree/main/benchmark/mmmu) for an example.