[FEAT] Add transformers backend support (#5929)
This commit is contained in:
@@ -63,6 +63,7 @@ Please consult the documentation below and [server_args.py](https://github.com/s
|
||||
| `kv_cache_dtype` | Dtype of the kv cache. | `auto` |
|
||||
| `context_length` | The model's maximum context length. Defaults to None (will use the value from the model's config.json instead). Note that extending the default might lead to strange behavior. | None |
|
||||
| `device` | The device we put the model. | None |
|
||||
| `impl` | The implementation of the model to use. Defaults to SGlang implementation and fall back to transformers if needed | `auto` |
|
||||
| `served_model_name` | Override the model name returned by the v1/models endpoint in OpenAI API server.| None |
|
||||
| `is_embedding` | Set to `true` to perform [embedding](./openai_api_embeddings.ipynb) / [encode](https://docs.sglang.ai/backend/native_api#Encode-(embedding-model)) and [reward](https://docs.sglang.ai/backend/native_api#Classify-(reward-model)) tasks. | `False` |
|
||||
| `revision` | Adjust if a specific version of the model should be used. | None |
|
||||
|
||||
@@ -47,6 +47,7 @@ The core features include:
|
||||
supported_models/embedding_models.md
|
||||
supported_models/reward_models.md
|
||||
supported_models/support_new_models.md
|
||||
supported_models/transformers_fallback.md
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
58
docs/supported_models/transformers_fallback.md
Normal file
58
docs/supported_models/transformers_fallback.md
Normal file
@@ -0,0 +1,58 @@
|
||||
# Transformers fallback in SGLang
|
||||
|
||||
`sglang` can fall back to using models that are available in `transformers`. This works for most decoder-style language models and support for vision-language models is coming soon!
|
||||
|
||||
## Example launch Command
|
||||
|
||||
By default, we will use sglang implementation if it is available. Otherwise, we will fall back to transformers one. However, you can switch the implementation by setting `impl` to `transformers`.
|
||||
|
||||
```shell
|
||||
python3 -m sglang.launch_server \
|
||||
--model-path meta-llama/Llama-3.2-1B-Instruct \
|
||||
--host 0.0.0.0 \
|
||||
--port 30000 \
|
||||
--impl transformers
|
||||
```
|
||||
|
||||
#### Supported features
|
||||
|
||||
##### Quantization
|
||||
|
||||
Transformers fall back has supported most of available quantization in SGLang (except GGUF). See [Quantization page](https://docs.sglang.ai/backend/quantization.html) for more information about supported quantization in SGLang.
|
||||
|
||||
##### Remote code
|
||||
|
||||
This fallback also means that any model on the hub that can be used in `transformers` with `trust_remote_code=True` that correctly implements attention can be used in production!
|
||||
|
||||
A model just needs the following two things:
|
||||
|
||||
```python
|
||||
from transformers import PreTrainedModel
|
||||
from torch import nn
|
||||
|
||||
class MyAttention(nn.Module):
|
||||
|
||||
def forward(self, hidden_states, **kwargs): # <- kwargs are required
|
||||
|
||||
...
|
||||
attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
|
||||
attn_output, attn_weights = attention_interface(
|
||||
self,
|
||||
query_states,
|
||||
key_states,
|
||||
value_states,
|
||||
**kwargs,
|
||||
)
|
||||
...
|
||||
|
||||
class MyModel(PreTrainedModel):
|
||||
_supports_attention_backend = True
|
||||
```
|
||||
|
||||
Here is what happens in the background:
|
||||
|
||||
1. The config is loaded
|
||||
2. `MyModel` python class is loaded from the `auto_map`, and we check that the model `_supports_attention_backend`.
|
||||
3. The `TransformersModel` backend is used. See `/srt/models/transformers`, which leverages `self.config._attn_implementation = "sglang"`, thus the need to use `ALL_ATTENTION_FUNCTIONS`.
|
||||
|
||||
That's it!
|
||||
Reference in New Issue
Block a user