From ced680663cce8dfa5663f5872543ce520d7ea749 Mon Sep 17 00:00:00 2001
From: Mick <mickjagger19@icloud.com>
Date: Wed, 12 Feb 2025 16:43:14 +0800
Subject: [PATCH] doc: Support a new vLM (#3405)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
---
 docs/references/supported_models.md | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/docs/references/supported_models.md b/docs/references/supported_models.md
index 66df55111..834428e04 100644
--- a/docs/references/supported_models.md
+++ b/docs/references/supported_models.md
@@ -48,12 +48,23 @@
 - InternLM2ForRewardModel
   - `python -m sglang.launch_server --model-path internlm/internlm2-7b-reward --is-embedding --trust-remote-code`
 
-## How to Support a New Model
+## How to Support a New Language Model
 
 To support a new model in SGLang, you only need to add a single file under [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models).
 You can learn from existing model implementations and create new files for the new models.
 For most models, you should be able to find a similar model to start with (e.g., starting from Llama).
 
+## How to Support a New vision LLM
+
+To support a new vision-language model (vLM) in SGLang, there are several key components in addition to the standard LLM.
+
+1. **Register your new model as multimodal**: Extend `is_multimodal_model` in [`model_config.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/configs/model_config.py) to return True for your model.
+2. **Process Images**: Create a new `ImageProcessor` class that inherits from `BaseImageProcessor` and register this processor as your model's dedicated processor. See [`image_processor.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/image_processor.py) for more details.
+3. **Handle Image Tokens**: Implement a `pad_input_ids` function for your new model, in which image tokens in the prompt should be expanded and replaced with image-hashes, so that SGLang can recognize different images for `RadixAttention`.
+4. Replace Multi-headed `Attention` of ViT with SGLang's `VisionAttention`.
+
+You can refer [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or other vLMs. These models demonstrate how to properly handle both visual and textual inputs.
+
 ### Test the correctness
 
 #### Interactive debugging