sglang/docs/supported_models/transformers_fallback.md

# Transformers fallback in SGLang

`sglang` can fall back to using models that are available in `transformers`. This works for most decoder-style language models and support for vision-language models is coming soon!

## Example launch Command

By default, we will use sglang implementation if it is available. Otherwise, we will fall back to transformers one. However, you can switch the implementation by setting `--model-impl` to `transformers`.

```shell
python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-1B-Instruct \
  --host 0.0.0.0 \
  --port 30000 \
  --model-impl transformers
```

## Supported features

### Quantization

Transformers fall back has supported most of available quantization in SGLang (except GGUF). See [Quantization page](../advanced_features/quantization.md) for more information about supported quantization in SGLang.

### Remote code

This fallback also means that any model on the hub that can be used in `transformers` with `trust_remote_code=True` that correctly implements attention can be used in production!

A model just needs the following two things:

```python
from transformers import PreTrainedModel
from torch import nn

class MyAttention(nn.Module):

  def forward(self, hidden_states, **kwargs): # <- kwargs are required

    ...
    attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
    attn_output, attn_weights = attention_interface(
      self,
      query_states,
      key_states,
      value_states,
      **kwargs,
    )
    ...

class MyModel(PreTrainedModel):
  _supports_attention_backend = True
```

Here is what happens in the background:

1. The config is loaded
2. `MyModel` python class is loaded from the `auto_map`, and we check that the model `_supports_attention_backend`.
3. The `TransformersModel` backend is used. See `/srt/models/transformers`, which leverages `self.config._attn_implementation = "sglang"`, thus the need to use `ALL_ATTENTION_FUNCTIONS`.

That's it!
[FEAT] Add transformers backend support (#5929) 2025-06-04 06:05:29 +02:00			`# Transformers fallback in SGLang`

			`sglang` can fall back to using models that are available in `transformers`. This works for most decoder-style language models and support for vision-language models is coming soon!

			`## Example launch Command`

Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			By default, we will use sglang implementation if it is available. Otherwise, we will fall back to transformers one. However, you can switch the implementation by setting `--model-impl` to `transformers`.
[FEAT] Add transformers backend support (#5929) 2025-06-04 06:05:29 +02:00
			```shell
			`python3 -m sglang.launch_server \`
			`--model-path meta-llama/Llama-3.2-1B-Instruct \`
			`--host 0.0.0.0 \`
			`--port 30000 \`
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`--model-impl transformers`
[FEAT] Add transformers backend support (#5929) 2025-06-04 06:05:29 +02:00			```

Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`## Supported features`
[FEAT] Add transformers backend support (#5929) 2025-06-04 06:05:29 +02:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`### Quantization`
[FEAT] Add transformers backend support (#5929) 2025-06-04 06:05:29 +02:00
Improve docs and developer guide (#9044) 2025-08-10 21:05:18 -07:00			`Transformers fall back has supported most of available quantization in SGLang (except GGUF). See [Quantization page](../advanced_features/quantization.md) for more information about supported quantization in SGLang.`
[FEAT] Add transformers backend support (#5929) 2025-06-04 06:05:29 +02:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`### Remote code`
[FEAT] Add transformers backend support (#5929) 2025-06-04 06:05:29 +02:00
			This fallback also means that any model on the hub that can be used in `transformers` with `trust_remote_code=True` that correctly implements attention can be used in production!

			`A model just needs the following two things:`

			```python
			`from transformers import PreTrainedModel`
			`from torch import nn`

			`class MyAttention(nn.Module):`

			`def forward(self, hidden_states, **kwargs): # <- kwargs are required`

			`...`
			`attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]`
			`attn_output, attn_weights = attention_interface(`
			`self,`
			`query_states,`
			`key_states,`
			`value_states,`
			`**kwargs,`
			`)`
			`...`

			`class MyModel(PreTrainedModel):`
			`_supports_attention_backend = True`
			```

			`Here is what happens in the background:`

			`1. The config is loaded`
			2. `MyModel` python class is loaded from the `auto_map`, and we check that the model `_supports_attention_backend`.
			3. The `TransformersModel` backend is used. See `/srt/models/transformers`, which leverages `self.config._attn_implementation = "sglang"`, thus the need to use `ALL_ATTENTION_FUNCTIONS`.

			`That's it!`