Update readme (#731)
This commit is contained in:
19
README.md
19
README.md
@@ -14,13 +14,14 @@ The core features include:
|
|||||||
- **Flexible Frontend Language**: Enables easy programming of LLM applications with chained generation calls, advanced prompting, control flow, multiple modalities, parallelism, and external interactions.
|
- **Flexible Frontend Language**: Enables easy programming of LLM applications with chained generation calls, advanced prompting, control flow, multiple modalities, parallelism, and external interactions.
|
||||||
|
|
||||||
## News
|
## News
|
||||||
- [2024/04] 🔥 SGLang is used by the official **LLaVA-NeXT (video)** release ([blog](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)).
|
- [2024/07] 🔥 Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/)).
|
||||||
- [2024/02] 🔥 SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
|
- [2024/04] SGLang is used by the official **LLaVA-NeXT (video)** release ([blog](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)).
|
||||||
- [2024/01] SGLang provides up to **5x faster inference** with RadixAttention ([blog](https://lmsys.org/blog/2024-01-17-sglang/)).
|
- [2024/02] SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary>More</summary>
|
<summary>More</summary>
|
||||||
|
|
||||||
|
- [2024/01] SGLang provides up to **5x faster inference** with RadixAttention ([blog](https://lmsys.org/blog/2024-01-17-sglang/)).
|
||||||
- [2024/01] SGLang powers the serving of the official **LLaVA v1.6** release demo ([usage](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#demo)).
|
- [2024/01] SGLang powers the serving of the official **LLaVA v1.6** release demo ([usage](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#demo)).
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
@@ -58,6 +59,7 @@ pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/
|
|||||||
|
|
||||||
### Method 3: Using docker
|
### Method 3: Using docker
|
||||||
The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](docker).
|
The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](docker).
|
||||||
|
Repalce `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker run --gpus all \
|
docker run --gpus all \
|
||||||
@@ -411,15 +413,12 @@ for out in state.text_iter():
|
|||||||
- The `choices` argument in `sgl.gen` is implemented by computing the [token-length normalized log probabilities](https://blog.eleuther.ai/multiple-choice-normalization/) of all choices and selecting the one with the highest probability.
|
- The `choices` argument in `sgl.gen` is implemented by computing the [token-length normalized log probabilities](https://blog.eleuther.ai/multiple-choice-normalization/) of all choices and selecting the one with the highest probability.
|
||||||
- The `regex` argument in `sgl.gen` is implemented through autoregressive decoding with logit bias masking, according to the constraints set by the regex. It is compatible with `temperature=0` and `temperature != 0`.
|
- The `regex` argument in `sgl.gen` is implemented through autoregressive decoding with logit bias masking, according to the constraints set by the regex. It is compatible with `temperature=0` and `temperature != 0`.
|
||||||
|
|
||||||
|
|
||||||
## Benchmark And Performance
|
## Benchmark And Performance
|
||||||
- Llama-7B on NVIDIA A10G, FP16, Tensor Parallelism=1
|

|
||||||

|

|
||||||
|
|
||||||
- Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8
|
Learn more at this [blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/).
|
||||||

|
|
||||||
|
|
||||||
- Learn more about the above [results](docs/benchmark_results.md).
|
|
||||||
- Synthetic latency and throughput benchmark [scripts](https://github.com/sgl-project/sglang/tree/main/benchmark/latency_throughput).
|
|
||||||
|
|
||||||
## Roadmap
|
## Roadmap
|
||||||
[Development Roadmap (2024 Q3)](https://github.com/sgl-project/sglang/issues/634)
|
[Development Roadmap (2024 Q3)](https://github.com/sgl-project/sglang/issues/634)
|
||||||
|
|||||||
Reference in New Issue
Block a user