Clean up server_args.py to have a dedicated function for model specific adjustments (#8983)
This commit is contained in:
@@ -189,8 +189,8 @@ Please consult the documentation below and [server_args.py](https://github.com/s
|
||||
| Arguments | Description | Defaults |
|
||||
|-----------|-------------|----------|
|
||||
| `--attention-backend` | Choose the kernels for attention layers. | None |
|
||||
| `decode_attention_backend` | (Experimental) This argument specifies the backend for decode attention computation. Note that this argument has priority over `attention_backend`. | None |
|
||||
| `prefill_attention_backend` | (Experimental) This argument specifies the backend for prefill attention computation. Note that this argument has priority over `attention_backend`. | None |
|
||||
| `--prefill-attention-backend` | (Experimental) This argument specifies the backend for prefill attention computation. Note that this argument has priority over `attention_backend`. | None |
|
||||
| `--decode-attention-backend` | (Experimental) This argument specifies the backend for decode attention computation. Note that this argument has priority over `attention_backend`. | None |
|
||||
| `--sampling-backend` | Choose the kernels for sampling layers. | None |
|
||||
| `--grammar-backend` | Choose the backend for grammar-guided decoding. | None |
|
||||
| `--mm-attention-backend` | Set multimodal attention backend. | None |
|
||||
|
||||
@@ -5,10 +5,10 @@ SGLang is a fast serving framework for large language models and vision language
|
||||
It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
|
||||
The core features include:
|
||||
|
||||
- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor parallelism, pipeline parallelism, expert parallelism, structured outputs, chunked prefill, quantization (FP8/INT4/AWQ/GPTQ), and multi-lora batching.
|
||||
- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization (FP4/FP8/INT4/AWQ/GPTQ), and multi-lora batching.
|
||||
- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
|
||||
- **Extensive Model Support**: Supports a wide range of generative models (Llama, Gemma, Mistral, Qwen, DeepSeek, LLaVA, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
|
||||
- **Active Community**: SGLang is open-source and backed by an active community with industry adoption.
|
||||
- **Extensive Model Support**: Supports a wide range of generative models (Llama, Qwen, DeepSeek, Kimi, GPT, Gemma, Mistral, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
|
||||
- **Active Community**: SGLang is open-source and backed by an active community with wide industry adoption.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
Reference in New Issue
Block a user