[Docs] Supported Model Docs - Major restructuring (#5290)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
2025-04-11 21:47:47 +05:30
parent c35dcfdb30
commit 4aa6bab0b0
8 changed files with 239 additions and 2 deletions
--- a/docs/backend/openai_api_vision.ipynb
+++ b/docs/backend/openai_api_vision.ipynb
@@ -10,7 +10,7 @@
    "A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/vision).\n",
    "This tutorial covers the vision APIs for vision language models.\n",
    "\n",
-    "SGLang supports various vision language models such as Llama 3.2, LLaVA-OneVision, Qwen2.5-VL, Gemma3 and [more](https://docs.sglang.ai/references/supported_models): \n",
+    "SGLang supports various vision language models such as Llama 3.2, LLaVA-OneVision, Qwen2.5-VL, Gemma3 and [more](https://docs.sglang.ai/supported_models/vision_language_models): \n",
    "- [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)  \n",
    "- [lmms-lab/llava-onevision-qwen2-72b-ov-chat](https://huggingface.co/lmms-lab/llava-onevision-qwen2-72b-ov-chat)  \n",
    "- [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)\n",
--- a/docs/frontend/frontend.ipynb
+++ b/docs/frontend/frontend.ipynb
@@ -393,7 +393,7 @@
    "## Multi-modal Generation\n",
    "\n",
    "You may use SGLang frontend language to define multi-modal prompts.\n",
-    "See [here](https://docs.sglang.ai/references/supported_models.html) for supported models."
+    "See [here](https://docs.sglang.ai/supported_models/generative_models.html) for supported models."
   ]
  },
  {
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -33,6 +33,16 @@ The core features include:
   backend/hyperparameter_tuning.md
   backend/structured_outputs_for_reasoning_models.ipynb

+.. toctree::
+   :maxdepth: 1
+   :caption: Supported Models
+
+   supported_models/generative_models.md
+   supported_models/vision_language_models.md
+   supported_models/embedding_models.md
+   supported_models/reward_models.md
+   supported_models/support_new_models.md
+
 .. toctree::
   :maxdepth: 1
   :caption: Advanced Features
--- a/docs/supported_models/embedding_models.md
+++ b/docs/supported_models/embedding_models.md
@@ -0,0 +1,27 @@
+# Embedding Models
+
+SGLang provides robust support for embedding models by integrating efficient serving mechanisms with its flexible programming interface. This integration allows for streamlined handling of embedding tasks, facilitating faster and more accurate retrieval and semantic search operations. SGLang's architecture enables better resource utilization and reduced latency in embedding model deployment.
+
+```{important}
+They are executed with `--is-embedding` and some may require `--trust-remote-code` and/or `--chat-template`
+```
+
+## Example launch Command
+
+```shell
+python3 -m sglang.launch_server \
+  --model-path Alibaba-NLP/gme-Qwen2-VL-2B-Instruct \  # example HF/local path
+  --is-embedding \
+  --host 0.0.0.0 \
+  --chat-template gme-qwen2-vl \                     # set chat template
+  --port 30000 \
+```
+
+## Supporting Matrixs
+
+| Model Family (Embedding)                        | Example HuggingFace Identifier                | Chat Template | Description                                                                                                                          |
+|-------------------------------------------------|-----------------------------------------------|---------------|--------------------------------------------------------------------------------------------------------------------------------------|
+| **Llama/Mistral based (E5EmbeddingModel)**      | `intfloat/e5-mistral-7b-instruct`             | N/A           | Mistral/Llama-based embedding model fine‑tuned for high‑quality text embeddings (top‑ranked on the MTEB benchmark).                   |
+| **GTE (QwenEmbeddingModel)**                    | `Alibaba-NLP/gte-Qwen2-7B-instruct`           | N/A           | Alibaba’s general text embedding model (7B), achieving state‑of‑the‑art multilingual performance in English and Chinese.             |
+| **GME (MultimodalEmbedModel)**                  | `Alibaba-NLP/gme-Qwen2-VL-2B-Instruct`        | `gme-qwen2-vl`          | Multimodal embedding model (2B) based on Qwen2‑VL, encoding image + text into a unified vector space for cross‑modal retrieval.      |
+| **CLIP (CLIPEmbeddingModel)**                   | `openai/clip-vit-large-patch14-336`           | N/A           | OpenAI’s CLIP model (ViT‑L/14) for embedding images (and text) into a joint latent space; widely used for image similarity search.   |
--- a/docs/supported_models/generative_models.md
+++ b/docs/supported_models/generative_models.md
@@ -0,0 +1,37 @@
+# Large Language Models
+
+These models accept text input and produce text output (e.g., chat completions). They are primarily large language models (LLMs), some with mixture-of-experts (MoE) architectures for scaling.
+
+## Example launch Command
+
+```shell
+python3 -m sglang.launch_server \
+  --model-path meta-llama/Llama-3.2-1B-Instruct \  # example HF/local path
+  --host 0.0.0.0 \
+  --port 30000 \
+```
+
+## Supporting Matrixs
+
+
+| Model Family (Variants)             | Example HuggingFace Identifier                     | Description                                                                            |
+|-------------------------------------|--------------------------------------------------|----------------------------------------------------------------------------------------|
+| **DeepSeek** (v1, v2, v3/R1)        | `deepseek-ai/DeepSeek-R1`                        | Series of advanced reasoning-optimized models (including a 671B MoE) trained with reinforcement learning; top performance on complex reasoning, math, and code tasks. [SGLang provides Deepseek v3/R1 model-specific optimizations](https://docs.sglang.ai/references/deepseek)|
+| **Qwen** (2, 2.5 series, MoE)       | `Qwen/Qwen2.5-14B-Instruct`                       | Alibaba’s Qwen model family (7B to 72B) with SOTA performance; Qwen2.5 series improves multilingual capability and includes base, instruct, MoE, and code-tuned variants. |
+| **Llama** (2, 3.x, 4 series)        | `meta-llama/Llama-4-Scout-17B-16E-Instruct`       | Meta’s open LLM series, spanning 7B to 400B parameters (Llama 2, 3, and new Llama 4) with well-recognized performance. [SGLang provides Llama-4 model-specific optimizations](https://docs.sglang.ai/references/llama4)  |
+| **Mistral** (Mixtral, NeMo, Small3) | `mistralai/Mistral-7B-Instruct-v0.2`             | Open 7B LLM by Mistral AI with strong performance; extended into MoE (“Mixtral”) and NeMo Megatron variants for larger scale. |
+| **Gemma** (v1, v2, v3)              | `google/gemma-3-1b-it`                            | Google’s family of efficient multilingual models (1B–27B); Gemma 3 offers a 128K context window, and its larger (4B+) variants support vision input. |
+| **Phi** (Phi-3, Phi-4 series)      | `microsoft/Phi-4-multimodal-instruct`            | Microsoft’s Phi family of small models (1.3B–5.6B); Phi-4-mini is a high-accuracy text model and Phi-4-multimodal (5.6B) processes text, images, and speech in one compact model. |
+| **MiniCPM** (v3, 4B)               | `openbmb/MiniCPM3-4B`                            | OpenBMB’s series of compact LLMs for edge devices; MiniCPM 3 (4B) achieves GPT-3.5-level results in text tasks. |
+| **OLMoE** (Open MoE)               | `allenai/OLMoE-1B-7B-0924`                       | Allen AI’s open Mixture-of-Experts model (7B total, 1B active parameters) delivering state-of-the-art results with sparse expert activation. |
+| **StableLM** (3B, 7B)               | `stabilityai/stablelm-tuned-alpha-7b`            | StabilityAI’s early open-source LLM (3B & 7B) for general text generation; a demonstration model with basic instruction-following ability. |
+| **Command-R** (Cohere)              | `CohereForAI/c4ai-command-r-v01`                 | Cohere’s open conversational LLM (Command series) optimized for long context, retrieval-augmented generation, and tool use. |
+| **DBRX** (Databricks)              | `databricks/dbrx-instruct`                       | Databricks’ 132B-parameter MoE model (36B active) trained on 12T tokens; competes with GPT-3.5 quality as a fully open foundation model. |
+| **Grok** (xAI)                     | `xai-org/grok-1`                                | xAI’s grok-1 model known for vast size(314B parameters) and high quality; integrated in SGLang for high-performance inference. |
+| **ChatGLM** (GLM-130B family)       | `THUDM/chatglm2-6b`                              | Zhipu AI’s bilingual chat model (6B) excelling at Chinese-English dialogue; fine-tuned for conversational quality and alignment. |
+| **InternLM 2** (7B, 20B)           | `internlm/internlm2-7b`                          | Next-gen InternLM (7B and 20B) from SenseTime, offering strong reasoning and ultra-long context support (up to 200K tokens). |
+| **ExaONE 3** (Korean-English)      | `LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct`           | LG AI Research’s Korean-English model (7.8B) trained on 8T tokens; provides high-quality bilingual understanding and generation. |
+| **Baichuan 2** (7B, 13B)           | `baichuan-inc/Baichuan2-13B-Chat`                | BaichuanAI’s second-generation Chinese-English LLM (7B/13B) with improved performance and an open commercial license. |
+| **XVERSE** (MoE)                   | `xverse/XVERSE-MoE-A36B`                         | Yuanxiang’s open MoE LLM (XVERSE-MoE-A36B: 255B total, 36B active) supporting ~40 languages; delivers 100B+ dense-level performance via expert routing. |
+| **SmolLM** (135M–1.7B)            | `HuggingFaceTB/SmolLM-1.7B`                      | Hugging Face’s ultra-small LLM series (135M–1.7B params) offering surprisingly strong results, enabling advanced AI on mobile/edge devices. |
+| **GLM-4** (Multilingual 9B)        | `ZhipuAI/glm-4-9b-chat`                          | Zhipu’s GLM-4 series (up to 9B parameters) – open multilingual models with support for 1M-token context and even a 5.6B multimodal variant (Phi-4V). |
--- a/docs/supported_models/reward_models.md
+++ b/docs/supported_models/reward_models.md
@@ -0,0 +1,28 @@
+# Reward Models
+
+These models output a scalar reward score or classification result, often used in reinforcement learning or content moderation tasks.
+
+```{important}
+They are executed with `--is-embedding` and some may require `--trust-remote-code`.
+```
+
+## Example launch Command
+
+```shell
+python3 -m sglang.launch_server \
+  --model-path Qwen/Qwen2.5-Math-RM-72B \  # example HF/local path
+  --is-embedding \
+  --host 0.0.0.0 \
+  --tp-size=4 \                          # set for tensor parallelism
+  --port 30000 \
+```
+
+## Supporting Matrixs
+
+| Model Family (Reward)                                                     | Example HuggingFace Identifier                              | Description                                                                     |
+|---------------------------------------------------------------------------|-----------------------------------------------------|---------------------------------------------------------------------------------|
+| **Llama (3.1 Reward / `LlamaForSequenceClassification`)**                   | `Skywork/Skywork-Reward-Llama-3.1-8B-v0.2`            | Reward model (preference classifier) based on Llama 3.1 (8B) for scoring and ranking responses for RLHF.  |
+| **Gemma 2 (27B Reward / `Gemma2ForSequenceClassification`)**                | `Skywork/Skywork-Reward-Gemma-2-27B-v0.2`             | Derived from Gemma‑2 (27B), this model provides human preference scoring for RLHF and multilingual tasks.  |
+| **InternLM 2 (Reward / `InternLM2ForRewardMode`)**                         | `internlm/internlm2-7b-reward`                       | InternLM 2 (7B)–based reward model used in alignment pipelines to guide outputs toward preferred behavior.  |
+| **Qwen2.5 (Reward - Math / `Qwen2ForRewardModel`)**                         | `Qwen/Qwen2.5-Math-RM-72B`                           | A 72B math-specialized RLHF reward model from the Qwen2.5 series, tuned for evaluating and refining responses.  |
+| **Qwen2.5 (Reward - Sequence / `Qwen2ForSequenceClassification`)**          | `jason9693/Qwen2.5-1.5B-apeach`                      | A smaller Qwen2.5 variant used for sequence classification, offering an alternative RLHF scoring mechanism.  |
--- a/docs/supported_models/support_new_models.md
+++ b/docs/supported_models/support_new_models.md
@@ -0,0 +1,104 @@
+# How to Support New Models
+
+This document explains how to add support for new language models and vision‐language models (VLMs) in SGLang. It also covers how to test new models and register external implementations.
+
+## How to Support a new Language Model
+
+To support a new model in SGLang, you only need to add a single file under the [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sgllang/srt/models). You can learn from existing model implementations and create a new file for your model. For most models, you should be able to find a similar model to start with (e.g., starting from Llama). Also refer how to [port a Model from vLLM to SGLang](#port-a-model-from-vllm-to-sglang)
+
+## How to Support a new Vision-Language model
+
+To support a new vision-language model (vLM) in SGLang, there are several key components in addition to the standard LLM support:
+
+1. **Register your new model as multimodal**:
+   Extend `is_multimodal_model` in [model_config.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/configs/model_config.py) to return `True` for your model.
+
+2. **Process Images**:
+   Define a new `Processor` class that inherits from `BaseProcessor` and register this processor as your model’s dedicated processor. See [multimodal_processor.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/multimodal_processor.py) for more details.
+
+3. **Handle Image Tokens**:
+   Implement a `pad_input_ids` function for your new model. In this function, image tokens in the prompt should be expanded and replaced with image-hashes so that SGLang can recognize different images when using `RadixAttention`.
+
+4. **Replace Vision Attention**:
+   Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.
+
+You can refer to [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or other vLM implementations. These models demonstrate how to correctly handle both multimodal and textual inputs.
+
+You should test the new vLM locally against Hugging Face models. See the [`mmmu`](https://github.com/sgl-project/sglang/tree/main/benchmark/mmmu) benchmark for an example.
+
+## Test the Correctness
+
+### Interactive Debugging
+
+For interactive debugging, compare the outputs of Hugging Face/Transformers and SGLang. The following two commands should give the same text output and very similar prefill logits:
+
+- Get the reference output:
+  ```bash
+  python3 scripts/playground/reference_hf.py --model-path [new model] --model-type {text,vlm}
+  ```
+- Get the SGLang output:
+  ```bash
+  python3 -m sglang.bench_one_batch --correct --model [new model]
+  ```
+
+### Add the Model to the Test Suite
+
+To ensure the new model is well maintained, add it to the test suite by including it in the `ALL_OTHER_MODELS` list in the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_generation_models.py) file, test the new model on your local machine and report the results on demonstrative benchmarks (GSM8K, MMLU, MMMU, MMMU-Pro, etc.) in your PR.
+
+This is the command to test a new model on your local machine:
+
+```bash
+ONLY_RUN=Qwen/Qwen2-1.5B python3 -m unittest test_generation_models.TestGenerationModels.test_others
+```
+
+## Port a Model from vLLM to SGLang
+
+The [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) is a valuable resource, as vLLM covers many models. SGLang reuses vLLM’s interface and some layers, making it easier to port models from vLLM to SGLang.
+
+To port a model from vLLM to SGLang:
+
+- Compare these two files for guidance:
+  - [SGLang Llama Implementation](https://github.com/sgl-project/sglang/blob/main/python/sgllang/srt/models/llama.py)
+  - [vLLM Llama Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py)
+- The major differences include:
+  - **Replace vLLM’s `Attention` with `RadixAttention`** (ensure you pass `layer_id` to `RadixAttention`).
+  - **Replace vLLM’s `LogitsProcessor` with SGLang’s `LogitsProcessor`.**
+  - **Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.**
+  - **Replace other vLLM layers** (such as `RMSNorm`, `SiluAndMul`) with SGLang layers.
+  - **Remove `Sample`.**
+  - **Change the `forward()` functions** and add a `forward_batch()` method.
+  - **Add `EntryClass`** at the end.
+  - **Ensure that the new implementation uses only SGLang components** and does not rely on any vLLM components.
+
+## Registering an External Model Implementation
+
+In addition to the methods above, you can register your new model with the `ModelRegistry` before launching the server. This allows you to integrate your model without modifying the source code.
+
+For example:
+
+```python
+from sglang.srt.models.registry import ModelRegistry
+from sglang.srt.entrypoints.http_server import launch_server
+
+# For a single model, add it to the registry:
+ModelRegistry.models[model_name] = model_class
+
+# For multiple models, you can imitate the import_model_classes() function:
+from functools import lru_cache
+
+@lru_cache()
+def import_new_model_classes():
+    model_arch_name_to_cls = {}
+    # Populate model_arch_name_to_cls with your new model classes.
+    ...
+    return model_arch_name_to_cls
+
+ModelRegistry.models.update(import_new_model_classes())
+
+# Launch the server with your server arguments:
+launch_server(server_args)
+```
+
+---
+
+By following these guidelines, you can add support for new language models and vision-language models in SGLang and ensure they are thoroughly tested and easily integrated into the system.
--- a/docs/supported_models/vision_language_models.md
+++ b/docs/supported_models/vision_language_models.md
@@ -0,0 +1,31 @@
+# Vision Language Models
+
+These models accept multi-modal inputs (e.g., images and text) and generate text output. They augment language models with visual encoders and require a specific chat template for handling vision prompts.
+
+```{important}
+We need to specify `--chat-template` for VLMs because the chat template provided in HuggingFace tokenizer only supports text. If you do not specify a vision model’s `--chat-template`, the server uses HuggingFace’s default template, which only supports text and the images won’t be passed in.
+```
+
+## Example launch Command
+
+```shell
+python3 -m sglang.launch_server \
+  --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \  # example HF/local path
+  --chat-template llama_3_vision \                        # required chat template
+  --host 0.0.0.0 \
+  --port 30000 \
+```
+
+## Supporting Matrixs
+
+| Model Family (Variants)        | Example HuggingFace Identifier                     | Chat Template        | Description                                                                            |
+|--------------------------------|--------------------------------------------------|----------------------|----------------------------------------------------------------------------------------|
+| **Qwen-VL** (Qwen2 series)     | `Qwen/Qwen2.5-VL-7B-Instruct`                    | `qwen2-vl`           | Alibaba’s vision-language extension of Qwen; for example, Qwen2.5-VL (7B and larger variants) can analyze and converse about image content. |
+| **DeepSeek-VL2**               | `deepseek-ai/deepseek-vl2`                       | `deepseek-vl2`       | Vision-language variant of DeepSeek (with a dedicated image processor), enabling advanced multimodal reasoning on image and text inputs. |
+| **Janus-Pro** (1B, 7B)      | `deepseek-ai/Janus-Pro-7B`                     | `janus-pro`       | DeepSeek’s open-source multimodal model capable of both image understanding and generation. Janus-Pro employs a decoupled architecture for separate visual encoding paths, enhancing performance in both tasks. |
+| **MiniCPM-V / MiniCPM-o**      | `openbmb/MiniCPM-V-2_6`                          | `minicpmv`           | MiniCPM-V (2.6, ~8B) supports image inputs, and MiniCPM-o adds audio/video; these multimodal LLMs are optimized for end-side deployment on mobile/edge devices. |
+| **Llama 3.2 Vision** (11B)     | `meta-llama/Llama-3.2-11B-Vision-Instruct`        | `llama_3_vision`     | Vision-enabled variant of Llama 3 (11B) that accepts image inputs for visual question answering and other multimodal tasks. |
+| **LLaVA** (v1.5 & v1.6)        | *e.g.* `liuhaotian/llava-v1.5-13b`               | `vicuna_v1.1`        | Open vision-chat models that add an image encoder to LLaMA/Vicuna (e.g. LLaMA2 13B) for following multimodal instruction prompts. |
+| **LLaVA-NeXT** (8B, 72B)       | `lmms-lab/llava-next-72b`                        | `chatml-llava`       | Improved LLaVA models (with an 8B Llama3 version and a 72B version) offering enhanced visual instruction-following and accuracy on multimodal benchmarks. |
+| **LLaVA-OneVision**            | `lmms-lab/llava-onevision-qwen2-7b-ov`           | `chatml-llava`       | Enhanced LLaVA variant integrating Qwen as the backbone; supports multiple images (and even video frames) as inputs via an OpenAI Vision API-compatible format. |
+| **Gemma 3 (Multimodal)**       | `google/gemma-3-4b-it`                           | `gemma-it`           | Gemma 3’s larger models (4B, 12B, 27B) accept images (each image encoded as 256 tokens) alongside text in a combined 128K-token context. |