From ced680663cce8dfa5663f5872543ce520d7ea749 Mon Sep 17 00:00:00 2001 From: Mick Date: Wed, 12 Feb 2025 16:43:14 +0800 Subject: [PATCH] doc: Support a new vLM (#3405) Co-authored-by: zhaochenyang20 --- docs/references/supported_models.md | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/docs/references/supported_models.md b/docs/references/supported_models.md index 66df55111..834428e04 100644 --- a/docs/references/supported_models.md +++ b/docs/references/supported_models.md @@ -48,12 +48,23 @@ - InternLM2ForRewardModel - `python -m sglang.launch_server --model-path internlm/internlm2-7b-reward --is-embedding --trust-remote-code` -## How to Support a New Model +## How to Support a New Language Model To support a new model in SGLang, you only need to add a single file under [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn from existing model implementations and create new files for the new models. For most models, you should be able to find a similar model to start with (e.g., starting from Llama). +## How to Support a New vision LLM + +To support a new vision-language model (vLM) in SGLang, there are several key components in addition to the standard LLM. + +1. **Register your new model as multimodal**: Extend `is_multimodal_model` in [`model_config.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/configs/model_config.py) to return True for your model. +2. **Process Images**: Create a new `ImageProcessor` class that inherits from `BaseImageProcessor` and register this processor as your model's dedicated processor. See [`image_processor.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/image_processor.py) for more details. +3. **Handle Image Tokens**: Implement a `pad_input_ids` function for your new model, in which image tokens in the prompt should be expanded and replaced with image-hashes, so that SGLang can recognize different images for `RadixAttention`. +4. Replace Multi-headed `Attention` of ViT with SGLang's `VisionAttention`. + +You can refer [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or other vLMs. These models demonstrate how to properly handle both visual and textual inputs. + ### Test the correctness #### Interactive debugging