adapt to sglang v0.5.2rc1 on dcu

This commit is contained in:
maxiao
2025-09-04 15:56:33 +08:00
commit 909abb58f5
2320 changed files with 489411 additions and 0 deletions

View File

@@ -0,0 +1,87 @@
# Embedding Models
SGLang provides robust support for embedding models by integrating efficient serving mechanisms with its flexible programming interface. This integration allows for streamlined handling of embedding tasks, facilitating faster and more accurate retrieval and semantic search operations. SGLang's architecture enables better resource utilization and reduced latency in embedding model deployment.
```{important}
Embedding models are executed with `--is-embedding` flag and some may require `--trust-remote-code`
```
## Quick Start
### Launch Server
```shell
python3 -m sglang.launch_server \
--model-path Qwen/Qwen3-Embedding-4B \
--is-embedding \
--host 0.0.0.0 \
--port 30000
```
### Client Request
```python
import requests
url = "http://127.0.0.1:30000"
payload = {
"model": "Qwen/Qwen3-Embedding-4B",
"input": "What is the capital of France?",
"encoding_format": "float"
}
response = requests.post(url + "/v1/embeddings", json=payload).json()
print("Embedding:", response["data"][0]["embedding"])
```
## Multimodal Embedding Example
For multimodal models like GME that support both text and images:
```shell
python3 -m sglang.launch_server \
--model-path Alibaba-NLP/gme-Qwen2-VL-2B-Instruct \
--is-embedding \
--chat-template gme-qwen2-vl \
--host 0.0.0.0 \
--port 30000
```
```python
import requests
url = "http://127.0.0.1:30000"
text_input = "Represent this image in embedding space."
image_path = "https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild/resolve/main/images/023.jpg"
payload = {
"model": "gme-qwen2-vl",
"input": [
{
"text": text_input
},
{
"image": image_path
}
],
}
response = requests.post(url + "/v1/embeddings", json=payload).json()
print("Embeddings:", [x.get("embedding") for x in response.get("data", [])])
```
## Supported Models
| Model Family | Example Model | Chat Template | Description |
| ------------------------------------------ | -------------------------------------- | ------------- | --------------------------------------------------------------------------- |
| **E5 (Llama/Mistral based)** | `intfloat/e5-mistral-7b-instruct` | N/A | High-quality text embeddings based on Mistral/Llama architectures |
| **GTE-Qwen2** | `Alibaba-NLP/gte-Qwen2-7B-instruct` | N/A | Alibaba's text embedding model with multilingual support |
| **Qwen3-Embedding** | `Qwen/Qwen3-Embedding-4B` | N/A | Latest Qwen3-based text embedding model for semantic representation |
| **BGE** | `BAAI/bge-large-en-v1.5` | N/A | BAAI's text embeddings (requires `attention-backend` triton/torch_native) |
| **GME (Multimodal)** | `Alibaba-NLP/gme-Qwen2-VL-2B-Instruct`| `gme-qwen2-vl`| Multimodal embedding for text and image cross-modal tasks |
| **CLIP** | `openai/clip-vit-large-patch14-336` | N/A | OpenAI's CLIP for image and text embeddings |

View File

@@ -0,0 +1,55 @@
# Large Language Models
These models accept text input and produce text output (e.g., chat completions). They are primarily large language models (LLMs), some with mixture-of-experts (MoE) architectures for scaling.
## Example launch Command
```shell
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.2-1B-Instruct \ # example HF/local path
--host 0.0.0.0 \
--port 30000 \
```
## Supported models
Below the supported models are summarized in a table.
If you are unsure if a specific architecture is implemented, you can search for it via GitHub. For example, to search for `Qwen3ForCausalLM`, use the expression:
```
repo:sgl-project/sglang path:/^python\/sglang\/srt\/models\// Qwen3ForCausalLM
```
in the GitHub search bar.
| Model Family (Variants) | Example HuggingFace Identifier | Description |
|-------------------------------------|--------------------------------------------------|----------------------------------------------------------------------------------------|
| **DeepSeek** (v1, v2, v3/R1) | `deepseek-ai/DeepSeek-R1` | Series of advanced reasoning-optimized models (including a 671B MoE) trained with reinforcement learning; top performance on complex reasoning, math, and code tasks. [SGLang provides Deepseek v3/R1 model-specific optimizations](../basic_usage/deepseek.md) and [Reasoning Parser](../advanced_features/separate_reasoning.ipynb)|
| **Qwen** (3, 3MoE, 2.5, 2 series) | `Qwen/Qwen3-0.6B`, `Qwen/Qwen3-30B-A3B` | Alibabas latest Qwen3 series for complex reasoning, language understanding, and generation tasks; Support for MoE variants along with previous generation 2.5, 2, etc. [SGLang provides Qwen3 specific reasoning parser](../advanced_features/separate_reasoning.ipynb)|
| **Llama** (2, 3.x, 4 series) | `meta-llama/Llama-4-Scout-17B-16E-Instruct` | Meta's open LLM series, spanning 7B to 400B parameters (Llama 2, 3, and new Llama 4) with well-recognized performance. [SGLang provides Llama-4 model-specific optimizations](../basic_usage/llama4.md) |
| **Mistral** (Mixtral, NeMo, Small3) | `mistralai/Mistral-7B-Instruct-v0.2` | Open 7B LLM by Mistral AI with strong performance; extended into MoE (“Mixtral”) and NeMo Megatron variants for larger scale. |
| **Gemma** (v1, v2, v3) | `google/gemma-3-1b-it` | Googles family of efficient multilingual models (1B27B); Gemma 3 offers a 128K context window, and its larger (4B+) variants support vision input. |
| **Phi** (Phi-1.5, Phi-2, Phi-3, Phi-4, Phi-MoE series) | `microsoft/Phi-4-multimodal-instruct`, `microsoft/Phi-3.5-MoE-instruct` | Microsofts Phi family of small models (1.3B5.6B); Phi-4-multimodal (5.6B) processes text, images, and speech, Phi-4-mini is a high-accuracy text model and Phi-3.5-MoE is a mixture-of-experts model. |
| **MiniCPM** (v3, 4B) | `openbmb/MiniCPM3-4B` | OpenBMBs series of compact LLMs for edge devices; MiniCPM 3 (4B) achieves GPT-3.5-level results in text tasks. |
| **OLMoE** (Open MoE) | `allenai/OLMoE-1B-7B-0924` | Allen AIs open Mixture-of-Experts model (7B total, 1B active parameters) delivering state-of-the-art results with sparse expert activation. |
| **StableLM** (3B, 7B) | `stabilityai/stablelm-tuned-alpha-7b` | StabilityAIs early open-source LLM (3B & 7B) for general text generation; a demonstration model with basic instruction-following ability. |
| **Command-R** (Cohere) | `CohereForAI/c4ai-command-r-v01` | Coheres open conversational LLM (Command series) optimized for long context, retrieval-augmented generation, and tool use. |
| **DBRX** (Databricks) | `databricks/dbrx-instruct` | Databricks 132B-parameter MoE model (36B active) trained on 12T tokens; competes with GPT-3.5 quality as a fully open foundation model. |
| **Grok** (xAI) | `xai-org/grok-1` | xAIs grok-1 model known for vast size(314B parameters) and high quality; integrated in SGLang for high-performance inference. |
| **ChatGLM** (GLM-130B family) | `THUDM/chatglm2-6b` | Zhipu AIs bilingual chat model (6B) excelling at Chinese-English dialogue; fine-tuned for conversational quality and alignment. |
| **InternLM 2** (7B, 20B) | `internlm/internlm2-7b` | Next-gen InternLM (7B and 20B) from SenseTime, offering strong reasoning and ultra-long context support (up to 200K tokens). |
| **ExaONE 3** (Korean-English) | `LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct` | LG AI Researchs Korean-English model (7.8B) trained on 8T tokens; provides high-quality bilingual understanding and generation. |
| **Baichuan 2** (7B, 13B) | `baichuan-inc/Baichuan2-13B-Chat` | BaichuanAIs second-generation Chinese-English LLM (7B/13B) with improved performance and an open commercial license. |
| **XVERSE** (MoE) | `xverse/XVERSE-MoE-A36B` | Yuanxiangs open MoE LLM (XVERSE-MoE-A36B: 255B total, 36B active) supporting ~40 languages; delivers 100B+ dense-level performance via expert routing. |
| **SmolLM** (135M1.7B) | `HuggingFaceTB/SmolLM-1.7B` | Hugging Faces ultra-small LLM series (135M1.7B params) offering surprisingly strong results, enabling advanced AI on mobile/edge devices. |
| **GLM-4** (Multilingual 9B) | `ZhipuAI/glm-4-9b-chat` | Zhipus GLM-4 series (up to 9B parameters) open multilingual models with support for 1M-token context and even a 5.6B multimodal variant (Phi-4V). |
| **MiMo** (7B series) | `XiaomiMiMo/MiMo-7B-RL` | Xiaomi's reasoning-optimized model series, leverages Multiple-Token Prediction for faster inference. |
| **ERNIE-4.5** (4.5, 4.5MoE series) | `baidu/ERNIE-4.5-21B-A3B-PT` | Baidu's ERNIE-4.5 series which consists of MoE with 47B and 3B active parameters, with the largest model having 424B total parameters, as well as a 0.3B dense model. |
| **Arcee AFM-4.5B** | `arcee-ai/AFM-4.5B-Base` | Arcee's foundational model series for real world reliability and edge deployments. |
| **Persimmon** (8B) | `adept/persimmon-8b-chat` | Adepts open 8B model with a 16K context window and fast inference; trained for broad usability and licensed under Apache 2.0. |
| **Ling** (16.8B290B) | `inclusionAI/Ling-lite`, `inclusionAI/Ling-plus` | InclusionAIs open MoE models. Ling-Lite has 16.8B total / 2.75B active parameters, and Ling-Plus has 290B total / 28.8B active parameters. They are designed for high performance on NLP and complex reasoning tasks. |
| **Granite 3.0, 3.1** (IBM) | `ibm-granite/granite-3.1-8b-instruct` | IBM's open dense foundation models optimized for reasoning, code, and business AI use cases. Integrated with Red Hat and watsonx systems. |
| **Granite 3.0 MoE** (IBM) | `ibm-granite/granite-3.0-3b-a800m-instruct` | IBMs Mixture-of-Experts models offering strong performance with cost-efficiency. MoE expert routing designed for enterprise deployment at scale. |
| **Llama Nemotron Super** (v1, v1.5, NVIDIA) | `nvidia/Llama-3_3-Nemotron-Super-49B-v1`, `nvidia/Llama-3_3-Nemotron-Super-49B-v1_5` | The [NVIDIA Nemotron](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/) family builds on the strongest open models in the ecosystem by enhancing them with greater accuracy, efficiency, and transparency using NVIDIA open synthetic datasets, advanced techniques, and tools. This enables the creation of practical, right-sized, and high-performing AI agents. |
| **Llama Nemotron Ultra** (v1, NVIDIA) | `nvidia/Llama-3_1-Nemotron-Ultra-253B-v1` | The [NVIDIA Nemotron](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/) family builds on the strongest open models in the ecosystem by enhancing them with greater accuracy, efficiency, and transparency using NVIDIA open synthetic datasets, advanced techniques, and tools. This enables the creation of practical, right-sized, and high-performing AI agents. |

View File

@@ -0,0 +1,28 @@
# Use Models From ModelScope
To use a model from [ModelScope](https://www.modelscope.cn), set the environment variable `SGLANG_USE_MODELSCOPE`.
```bash
export SGLANG_USE_MODELSCOPE=true
```
We take [Qwen2-7B-Instruct](https://www.modelscope.cn/models/qwen/qwen2-7b-instruct) as an example.
Launch the Server:
```bash
python -m sglang.launch_server --model-path qwen/Qwen2-7B-Instruct --port 30000
```
Or start it by docker:
```bash
docker run --gpus all \
-p 30000:30000 \
-v ~/.cache/modelscope:/root/.cache/modelscope \
--env "SGLANG_USE_MODELSCOPE=true" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 30000
```
Note that modelscope uses a different cache directory than huggingface. You may need to set it manually to avoid running out of disk space.

View File

@@ -0,0 +1,42 @@
# Multimodal Language Models
These models accept multi-modal inputs (e.g., images and text) and generate text output. They augment language models with multimodal encoders.
## Example launch Command
```shell
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \ # example HF/local path
--host 0.0.0.0 \
--port 30000 \
```
## Supported models
Below the supported models are summarized in a table.
If you are unsure if a specific architecture is implemented, you can search for it via GitHub. For example, to search for `Qwen2_5_VLForConditionalGeneration`, use the expression:
```
repo:sgl-project/sglang path:/^python\/sglang\/srt\/models\// Qwen2_5_VLForConditionalGeneration
```
in the GitHub search bar.
| Model Family (Variants) | Example HuggingFace Identifier | Chat Template | Description |
|----------------------------|--------------------------------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Qwen-VL** (Qwen2 series) | `Qwen/Qwen2.5-VL-7B-Instruct` | `qwen2-vl` | Alibabas vision-language extension of Qwen; for example, Qwen2.5-VL (7B and larger variants) can analyze and converse about image content. |
| **DeepSeek-VL2** | `deepseek-ai/deepseek-vl2` | `deepseek-vl2` | Vision-language variant of DeepSeek (with a dedicated image processor), enabling advanced multimodal reasoning on image and text inputs. |
| **Janus-Pro** (1B, 7B) | `deepseek-ai/Janus-Pro-7B` | `janus-pro` | DeepSeeks open-source multimodal model capable of both image understanding and generation. Janus-Pro employs a decoupled architecture for separate visual encoding paths, enhancing performance in both tasks. |
| **MiniCPM-V / MiniCPM-o** | `openbmb/MiniCPM-V-2_6` | `minicpmv` | MiniCPM-V (2.6, ~8B) supports image inputs, and MiniCPM-o adds audio/video; these multimodal LLMs are optimized for end-side deployment on mobile/edge devices. |
| **Llama 3.2 Vision** (11B) | `meta-llama/Llama-3.2-11B-Vision-Instruct` | `llama_3_vision` | Vision-enabled variant of Llama 3 (11B) that accepts image inputs for visual question answering and other multimodal tasks. |
| **LLaVA** (v1.5 & v1.6) | *e.g.* `liuhaotian/llava-v1.5-13b` | `vicuna_v1.1` | Open vision-chat models that add an image encoder to LLaMA/Vicuna (e.g. LLaMA2 13B) for following multimodal instruction prompts. |
| **LLaVA-NeXT** (8B, 72B) | `lmms-lab/llava-next-72b` | `chatml-llava` | Improved LLaVA models (with an 8B Llama3 version and a 72B version) offering enhanced visual instruction-following and accuracy on multimodal benchmarks. |
| **LLaVA-OneVision** | `lmms-lab/llava-onevision-qwen2-7b-ov` | `chatml-llava` | Enhanced LLaVA variant integrating Qwen as the backbone; supports multiple images (and even video frames) as inputs via an OpenAI Vision API-compatible format. |
| **Gemma 3 (Multimodal)** | `google/gemma-3-4b-it` | `gemma-it` | Gemma 3's larger models (4B, 12B, 27B) accept images (each image encoded as 256 tokens) alongside text in a combined 128K-token context. |
| **Kimi-VL** (A3B) | `moonshotai/Kimi-VL-A3B-Instruct` | `kimi-vl` | Kimi-VL is a multimodal model that can understand and generate text from images. |
| **Mistral-Small-3.1-24B** | `mistralai/Mistral-Small-3.1-24B-Instruct-2503` | `mistral` | Mistral 3.1 is a multimodal model that can generate text from text or images input. It also supports tool calling and structured output. |
| **Phi-4-multimodal-instruct** | `microsoft/Phi-4-multimodal-instruct` | `phi-4-mm` | Phi-4-multimodal-instruct is the multimodal variant of the Phi-4-mini model, enhanced with LoRA for improved multimodal capabilities. It supports text, vision and audio modalities in SGLang. |
| **MiMo-VL** (7B) | `XiaomiMiMo/MiMo-VL-7B-RL` | `mimo-vl` | Xiaomi's compact yet powerful vision-language model featuring a native resolution ViT encoder for fine-grained visual details, an MLP projector for cross-modal alignment, and the MiMo-7B language model optimized for complex reasoning tasks. |
| **GLM-4.5V** (106B) / **GLM-4.1V**(9B) | `zai-org/GLM-4.5V` | `glm-4v` | GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning |

View File

@@ -0,0 +1,49 @@
# Rerank Models
SGLang offers comprehensive support for rerank models by incorporating optimized serving frameworks with a flexible programming interface. This setup enables efficient processing of cross-encoder reranking tasks, improving the accuracy and relevance of search result ordering. SGLangs design ensures high throughput and low latency during reranker model deployment, making it ideal for semantic-based result refinement in large-scale retrieval systems.
```{important}
They are executed with `--is-embedding` and some may require `--trust-remote-code`
```
## Example Launch Command
```shell
python3 -m sglang.launch_server \
--model-path BAAI/bge-reranker-v2-m3 \
--host 0.0.0.0 \
--disable-radix-cache \
--chunked-prefill-size -1 \
--attention-backend triton \
--is-embedding \
--port 30000
```
## Example Client Request
```python
import requests
url = "http://127.0.0.1:30000/v1/rerank"
payload = {
"model": "BAAI/bge-reranker-v2-m3",
"query": "what is panda?",
"documents": [
"hi",
"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China."
]
}
response = requests.post(url, json=payload)
response_json = response.json()
for item in response_json:
print(f"Score: {item['score']:.2f} - Document: '{item['document']}'")
```
## Supported rerank models
| Model Family (Rerank) | Example HuggingFace Identifier | Chat Template | Description |
|------------------------------------------------|--------------------------------------|---------------|----------------------------------------------------------------------------------------------------------------------------------|
| **BGE-Reranker (BgeRerankModel)** | `BAAI/bge-reranker-v2-m3` | N/A | Currently only support `attention-backend` `triton` and `torch_native`. high-performance cross-encoder reranker model from BAAI. Suitable for reranking search results based on semantic relevance. |

View File

@@ -0,0 +1,28 @@
# Reward Models
These models output a scalar reward score or classification result, often used in reinforcement learning or content moderation tasks.
```{important}
They are executed with `--is-embedding` and some may require `--trust-remote-code`.
```
## Example launch Command
```shell
python3 -m sglang.launch_server \
--model-path Qwen/Qwen2.5-Math-RM-72B \ # example HF/local path
--is-embedding \
--host 0.0.0.0 \
--tp-size=4 \ # set for tensor parallelism
--port 30000 \
```
## Supported models
| Model Family (Reward) | Example HuggingFace Identifier | Description |
|---------------------------------------------------------------------------|-----------------------------------------------------|---------------------------------------------------------------------------------|
| **Llama (3.1 Reward / `LlamaForSequenceClassification`)** | `Skywork/Skywork-Reward-Llama-3.1-8B-v0.2` | Reward model (preference classifier) based on Llama 3.1 (8B) for scoring and ranking responses for RLHF. |
| **Gemma 2 (27B Reward / `Gemma2ForSequenceClassification`)** | `Skywork/Skywork-Reward-Gemma-2-27B-v0.2` | Derived from Gemma2 (27B), this model provides human preference scoring for RLHF and multilingual tasks. |
| **InternLM 2 (Reward / `InternLM2ForRewardMode`)** | `internlm/internlm2-7b-reward` | InternLM 2 (7B)based reward model used in alignment pipelines to guide outputs toward preferred behavior. |
| **Qwen2.5 (Reward - Math / `Qwen2ForRewardModel`)** | `Qwen/Qwen2.5-Math-RM-72B` | A 72B math-specialized RLHF reward model from the Qwen2.5 series, tuned for evaluating and refining responses. |
| **Qwen2.5 (Reward - Sequence / `Qwen2ForSequenceClassification`)** | `jason9693/Qwen2.5-1.5B-apeach` | A smaller Qwen2.5 variant used for sequence classification, offering an alternative RLHF scoring mechanism. |

View File

@@ -0,0 +1,144 @@
# How to Support New Models
This document explains how to add support for new language models and multimodal large language models (MLLMs) in
SGLang. It also covers how to test new models and register external implementations.
## How to Support a New Language Model
To support a new model in SGLang, you only need to add a single file under
the [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn
from existing model implementations and create a new file for your model. For most models, you should be able to find a
similar model to start with (e.g., starting from Llama). Also refer how
to [port a Model from vLLM to SGLang](#port-a-model-from-vllm-to-sglang)
## How to Support a New Multimodal Large Language Model
To support a new multimodal large language model (MLLM) in SGLang, there are several key components in addition to the
standard LLM support:
1. **Register your new model as multimodal**:
Extend `is_multimodal_model`
in [model_config.py](https://github.com/sgl-project/sglang/blob/0ab3f437aba729b348a683ab32b35b214456efc7/python/sglang/srt/configs/model_config.py#L561)
to return `True` for your model.
2. **Register a new chat-template**:
Only when your default chat-template is unable to accept images as input: Register a new chat template in [conversation.py](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/conversation.py) and the corresponding matching function.
3. **Multimodal Data Processor**:
Define a new `Processor` class that inherits from `BaseMultimodalProcessor` and register this processor as your
models dedicated processor.
See [multimodal_processor.py](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/multimodal/processors)
for more details.
4. **Handle Multimodal Tokens**:
Implement a `pad_input_ids` function for your new model. In this function, multimodal tokens in the prompt should be
expanded (if necessary) and padded with multimodal-data-hashes so that SGLang can recognize different multimodal data
with `RadixAttention`.
5. **Handle Image Feature Extraction**:
Implement a `get_image_feature` function for your new model, which extracts image features from raw image data and converts them into the embeddings used by the language model.
6. **Adapt to Vision Attention**:
Adapt the multi-headed `Attention` of ViT with SGLangs `VisionAttention`.
You can refer to [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or
other mllm implementations. These models demonstrate how to correctly handle both multimodal and textual inputs.
## Testing and Debugging
Please note all your testing and benchmarking results in PR description.
### Interactive Debugging
For interactive debugging, compare the outputs of Hugging Face/Transformers and SGLang. The following two commands
should give the same text output and very similar prefill logits:
- Get the reference output:
```bash
python3 scripts/playground/reference_hf.py --model-path [new model] --model-type {text,mllm}
```
- Get the SGLang output:
```bash
python3 -m sglang.bench_one_batch --correct --model [new model]
```
### Add the Model to the Test Suite
To ensure the new model is well maintained, add it to the test suite by including it in the `ALL_OTHER_MODELS` list in
the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_generation_models.py)
file, test the new model on your local machine and report the results on demonstrative benchmarks (GSM8K, MMLU, MMMU,
MMMU-Pro, etc.) in your PR. \\
For VLMs, also include a test in `test_vision_openai_server_{x}.py` (e.g. [test_vision_openai_server_a.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server_a.py), [test_vision_openai_server_b.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server_b.py)).
This is an example command to run to test a new model on your local machine:
```bash
ONLY_RUN=Qwen/Qwen2-1.5B python3 -m unittest test_generation_models.TestGenerationModels.test_others
```
### Benchmark
- **(Required) MMMU**: follow MMMU benchmark [README.md](https://github.com/sgl-project/sglang/blob/main/benchmark/mmmu/README.md) to get SGLang vs. HF Transformer accuracy comparison. The accuracy score from SGLang run should not be much lower than that from HF Transformer run. Similarly, follow https://docs.sglang.ai/developer_guide/benchmark_and_profiling.html to get performance comparison: TTFT and throughput must meet or exceed baselines (e.g., HF Transformer).
- **(Optional) Other evals**: If you ran other evals, please note the results in PR description.
## Port a Model from vLLM to SGLang
The [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) is a valuable
resource, as vLLM covers many models. SGLang reuses vLLMs interface and some layers, making it easier to port models
from vLLM to SGLang.
To port a model from vLLM to SGLang:
- Compare these two files for guidance:
- [SGLang Llama Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py)
- [vLLM Llama Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py)
- The major differences include:
- **Replace vLLMs `Attention` with `RadixAttention`** (ensure you pass `layer_id` to `RadixAttention`).
- **Replace vLLMs `LogitsProcessor` with SGLangs `LogitsProcessor`.**
- **Replace the multi-headed `Attention` of ViT with SGLangs `VisionAttention`.**
- **Replace other vLLM layers** (such as `RMSNorm`, `SiluAndMul`) with SGLang layers.
- **Remove `Sample`.**
- **Change the `forward()` functions** and add a `forward_batch()` method.
- **Add `EntryClass`** at the end.
- **Ensure that the new implementation uses only SGLang components** and does not rely on any vLLM components.
Note: make sure you add your new model to the supported models list in the supported models documentation.
## Registering an External Model Implementation
In addition to the methods above, you can register your new model with the `ModelRegistry` before launching the server.
This allows you to integrate your model without modifying the source code.
For example:
```python
from sglang.srt.models.registry import ModelRegistry
from sglang.srt.entrypoints.http_server import launch_server
# For a single model, add it to the registry:
ModelRegistry.models[model_name] = model_class
# For multiple models, you can imitate the import_model_classes() function:
from functools import lru_cache
@lru_cache()
def import_new_model_classes():
model_arch_name_to_cls = {}
# Populate model_arch_name_to_cls with your new model classes.
...
return model_arch_name_to_cls
ModelRegistry.models.update(import_new_model_classes())
# Launch the server with your server arguments:
launch_server(server_args)
```
## Documentation
Add to table of supported models in [generative_models.md](https://github.com/sgl-project/sglang/blob/main/docs/supported_models/generative_models.md) or [multimodal_language_models.md](https://github.com/sgl-project/sglang/blob/main/docs/supported_models/multimodal_language_models.md)
---
By following these guidelines, you can add support for new language models and multimodal large language models in
SGLang and ensure they are thoroughly tested and easily integrated into the system.

View File

@@ -0,0 +1,58 @@
# Transformers fallback in SGLang
`sglang` can fall back to using models that are available in `transformers`. This works for most decoder-style language models and support for vision-language models is coming soon!
## Example launch Command
By default, we will use sglang implementation if it is available. Otherwise, we will fall back to transformers one. However, you can switch the implementation by setting `--model-impl` to `transformers`.
```shell
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.2-1B-Instruct \
--host 0.0.0.0 \
--port 30000 \
--model-impl transformers
```
## Supported features
### Quantization
Transformers fall back has supported most of available quantization in SGLang (except GGUF). See [Quantization page](../advanced_features/quantization.md) for more information about supported quantization in SGLang.
### Remote code
This fallback also means that any model on the hub that can be used in `transformers` with `trust_remote_code=True` that correctly implements attention can be used in production!
A model just needs the following two things:
```python
from transformers import PreTrainedModel
from torch import nn
class MyAttention(nn.Module):
def forward(self, hidden_states, **kwargs): # <- kwargs are required
...
attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
attn_output, attn_weights = attention_interface(
self,
query_states,
key_states,
value_states,
**kwargs,
)
...
class MyModel(PreTrainedModel):
_supports_attention_backend = True
```
Here is what happens in the background:
1. The config is loaded
2. `MyModel` python class is loaded from the `auto_map`, and we check that the model `_supports_attention_backend`.
3. The `TransformersModel` backend is used. See `/srt/models/transformers`, which leverages `self.config._attn_implementation = "sglang"`, thus the need to use `ALL_ATTENTION_FUNCTIONS`.
That's it!