model: Minicpmo (#3023)

2025-03-25 11:08:40 +08:00
parent 64129fa632
commit 1e86457c90
40 changed files with 2906 additions and 493 deletions
--- a/docs/references/supported_models.md
+++ b/docs/references/supported_models.md
@@ -24,7 +24,7 @@
 - InternLM 2
 - Exaone 3
 - BaiChuan2
- MiniCPM / MiniCPM 3 / MiniCPMV
+- MiniCPM / MiniCPM 3 / MiniCPM-v / MiniCPM-o
 - XVERSE / XVERSE MoE
 - SmolLM
 - GLM-4
@@ -70,9 +70,9 @@ LLM.
 1. **Register your new model as multimodal**: Extend `is_multimodal_model` in [
   `model_config.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/configs/model_config.py) to
   return True for your model.
-2. **Process Images**: Create a new `ImageProcessor` class that inherits from `BaseImageProcessor` and register this
+2. **Process Images**: Define a new `Processor` class that inherits from `BaseProcessor` and register this
   processor as your model's dedicated processor. See [
-   `image_processor.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/image_processor.py)
+   `multimodal_processor.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/multimodal_processor.py)
   for more details.
 3. **Handle Image Tokens**: Implement a `pad_input_ids` function for your new model, in which image tokens in the prompt
   should be expanded and replaced with image-hashes, so that SGLang can recognize different images for
@@ -80,7 +80,7 @@ LLM.
 4. Replace Multi-headed `Attention` of ViT with SGLang's `VisionAttention`.

 You can refer [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or other
-vLMs. These models demonstrate how to properly handle both visual and textual inputs.
+vLMs. These models demonstrate how to properly handle both multimodal and textual inputs.

 You should test the new vLM locally against hf models. See [`mmmu`](https://github.com/sgl-project/sglang/tree/main/benchmark/mmmu) for an example.