Revert "fix some typos" (#6244)
This commit is contained in:
@@ -82,7 +82,7 @@ if is_in_ci():
|
||||
|
||||
llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")
|
||||
|
||||
# Terminate Engine
|
||||
# Terminalte Engine
|
||||
llm.shutdown()
|
||||
```
|
||||
|
||||
@@ -94,7 +94,7 @@ llm.shutdown()
|
||||
|
||||
### **Model Selection**
|
||||
|
||||
For demonstrations in the docs, we **prefer smaller models** to reduce memory consumption and speed up inference. Running larger models in CI can lead to instability due to memory constraints.
|
||||
For demonstrations in the docs, **prefer smaller models** to reduce memory consumption and speed up inference. Running larger models in CI can lead to instability due to memory constraints.
|
||||
|
||||
### **Prompt Alignment Example**
|
||||
|
||||
|
||||
@@ -134,7 +134,7 @@ python3 -m sglang.launch_server \
|
||||
|
||||
SGLang supports the following quantization methods based on torchao `["int8dq", "int8wo", "fp8wo", "fp8dq-per_tensor", "fp8dq-per_row", "int4wo-32", "int4wo-64", "int4wo-128", "int4wo-256"]`.
|
||||
|
||||
Note: According to [this issue](https://github.com/sgl-project/sglang/issues/2219#issuecomment-2561890230), `"int8dq"` method currently has some bugs when using together with CUDA graph capture. So we suggest to disable the CUDA graph capture when using `"int8dq"` method. Namely, please use the following command:
|
||||
Note: According to [this issue](https://github.com/sgl-project/sglang/issues/2219#issuecomment-2561890230), `"int8dq"` method currently has some bugs when using together with cuda graph capture. So we suggest to disable cuda graph capture when using `"int8dq"` method. Namely, please use the following command:
|
||||
|
||||
```bash
|
||||
python3 -m sglang.launch_server \
|
||||
|
||||
@@ -38,7 +38,7 @@ memory management, and optimization techniques.
|
||||
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
|
||||
- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
|
||||
- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](custom_chat_template.md).
|
||||
- To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port, so you can use the following commands. If you encounter deadlocks, please try to add `--disable-cuda-graph`.
|
||||
- To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port, you can use the following commands. If you meet deadlock, please try to add `--disable-cuda-graph`
|
||||
|
||||
```bash
|
||||
# Node 0
|
||||
|
||||
@@ -1,9 +1,9 @@
|
||||
# Development Guide Using Docker
|
||||
|
||||
## Setup VSCode on a Remote Host
|
||||
(Optional - you can skip this step if you plan to run the SGLang dev container locally)
|
||||
(Optional - you can skip this step if you plan to run sglang dev container locally)
|
||||
|
||||
1. In the remote host, download `code` from [https://code.visualstudio.com/docs/?dv=linux64cli](https://code.visualstudio.com/download) and run `code tunnel` in a shell.
|
||||
1. In the remote host, download `code` from [Https://code.visualstudio.com/docs/?dv=linux64cli](https://code.visualstudio.com/download) and run `code tunnel` in a shell.
|
||||
|
||||
Example
|
||||
```bash
|
||||
@@ -19,20 +19,20 @@ tar xf vscode_cli_alpine_x64_cli.tar.gz
|
||||
## Setup Docker Container
|
||||
|
||||
### Option 1. Use the default dev container automatically from VSCode
|
||||
There is a `.devcontainer` folder in the SGLang repository root folder to allow VSCode to automatically start up within dev container. You can read more about this VSCode extension in VSCode official document [Developing inside a Container](https://code.visualstudio.com/docs/devcontainers/containers).
|
||||
There is a `.devcontainer` folder in the sglang repository root folder to allow VSCode to automatically start up within dev container. You can read more about this VSCode extension in VSCode official document [Developing inside a Container](https://code.visualstudio.com/docs/devcontainers/containers).
|
||||

|
||||
(*Figure 1: Diagram from VSCode official documentation [Developing inside a Container](https://code.visualstudio.com/docs/devcontainers/containers).*)
|
||||
|
||||
To enable this, you only need to:
|
||||
1. Start Visual Studio Code and install the [VSCode dev container extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers).
|
||||
1. Start Visual Studio Code and install [VSCode dev container extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers).
|
||||
2. Press F1, type and choose "Dev Container: Open Folder in Container.
|
||||
3. Input the `sglang` local repo path in your machine and press enter.
|
||||
|
||||
The first time you open it in the dev container might take longer due to docker pull and build. Once it's successful, you should set on your status bar at the bottom left displaying that you are in a dev container:
|
||||
The first time you open it in dev container might take longer due to docker pull and build. Once it's successful, you should set on your status bar at the bottom left displaying that you are in a dev container:
|
||||
|
||||

|
||||
|
||||
Now when you run `sglang.launch_server` in the VSCode terminal or start debugging using F5, the SGLang server will be started in the dev container with all your local changes applied automatically:
|
||||
Now when you run `sglang.launch_server` in the VSCode terminal or start debugging using F5, sglang server will be started in the dev container with all your local changes applied automatically:
|
||||
|
||||

|
||||
|
||||
@@ -52,21 +52,21 @@ docker run -itd --shm-size 32g --gpus all -v <volumes-to-mount> --ipc=host --net
|
||||
docker exec -it sglang_dev /bin/zsh
|
||||
```
|
||||
Some useful volumes to mount are:
|
||||
1. **HuggingFace model cache**: mounting model cache can avoid the need to re-download every time docker restarts. Default location on Linux is `~/.cache/huggingface/`.
|
||||
1. **Huggingface model cache**: mounting model cache can avoid re-download every time docker restarts. Default location on Linux is `~/.cache/huggingface/`.
|
||||
2. **SGLang repository**: code changes in the SGLang local repository will be automatically synced to the .devcontainer.
|
||||
|
||||
Example 1: Mounting local cache folder `/opt/dlami/nvme/.cache` but not the SGLang repo. Use this when you prefer to manually transfer local code changes to the devcontainer.
|
||||
Example 1: Monting local cache folder `/opt/dlami/nvme/.cache` but not the SGLang repo. Use this when you prefer to manually transfer local code changes to the devcontainer.
|
||||
```bash
|
||||
docker run -itd --shm-size 32g --gpus all -v /opt/dlami/nvme/.cache:/root/.cache --ipc=host --network=host --privileged --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh
|
||||
docker exec -it sglang_zhyncs /bin/zsh
|
||||
```
|
||||
Example 2: Mounting both the HuggingFace cache and the local SGLang repo. Local code changes are automatically synced to the devcontainer as SGLang is installed in editable mode in the dev image.
|
||||
Example 2: Mounting both HuggingFace cache and local SGLang repo. Local code changes are automatically synced to the devcontainer as the SGLang is installed in editable mode in the dev image.
|
||||
```bash
|
||||
docker run -itd --shm-size 32g --gpus all -v $HOME/.cache/huggingface/:/root/.cache/huggingface -v $HOME/src/sglang:/sgl-workspace/sglang --ipc=host --network=host --privileged --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh
|
||||
docker exec -it sglang_zhyncs /bin/zsh
|
||||
```
|
||||
## Debug SGLang with VSCode Debugger
|
||||
1. (Create if it does not exist) open `launch.json` in VSCode.
|
||||
1. (Create if not exist) open `launch.json` in VSCode.
|
||||
2. Add the following config and save. Please note that you can edit the script as needed to apply different parameters or debug a different program (e.g. benchmark script).
|
||||
```JSON
|
||||
{
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
|
||||
### Step 1: Start a docker container.
|
||||
|
||||
You can mount a folder for the shared HuggingFace model weights cache. The command below uses `/tmp/huggingface` as an example.
|
||||
You can mount a folder for the shared huggingface model weights cache. The command below uses `/tmp/huggingface` as an example.
|
||||
|
||||
```
|
||||
docker pull nvidia/cuda:12.1.1-devel-ubuntu22.04
|
||||
|
||||
@@ -5,7 +5,7 @@ SGLang is a fast serving framework for large language models and vision language
|
||||
It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
|
||||
The core features include:
|
||||
|
||||
- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (PagedAttention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, quantization (FP8/INT4/AWQ/GPTQ), and multi-LoRA batching.
|
||||
- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (paged attention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, quantization (FP8/INT4/AWQ/GPTQ), and multi-lora batching.
|
||||
- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
|
||||
- **Extensive Model Support**: Supports a wide range of generative models (Llama, Gemma, Mistral, Qwen, DeepSeek, LLaVA, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
|
||||
- **Active Community**: SGLang is open-source and backed by an active community with industry adoption.
|
||||
|
||||
@@ -69,7 +69,7 @@
|
||||
|
||||
This command sets the number of prompts to 2 with `--num-prompts` argument and limits the length of output sequences to 100 with `--sharegpt-output-len` argument, which can generate a small trace file for browser to open smoothly.
|
||||
|
||||
Additionally, if you want to locate the SGLang Python source code through the CUDA kernel in Trace, you need to disable CUDA Graph when starting the service. This can be done by using the `--disable-cuda-graph` parameter in the command to start the service.
|
||||
Additionally, if you want to locate the SGLang Python source code through the cuda kernel in Trace, you need to disable CUDA Graph when starting the service. This can be done by using the `--disable-cuda-graph` parameter in the command to start the service.
|
||||
|
||||
## Profile with Nsight
|
||||
|
||||
|
||||
@@ -35,7 +35,7 @@ SGLang uses Python's built-in [unittest](https://docs.python.org/3/library/unitt
|
||||
|
||||
## Writing Documentation & Running Docs CI
|
||||
|
||||
We recommend new contributors start from writing documentation, which helps you quickly understand the SGLang codebase. For more details, please refer to [docs/README.md](https://github.com/sgl-project/sglang/tree/main/docs/README.md).
|
||||
We recommend new contributors start from writing documentation, which helps you quickly understand SGLang codebase. For more details, please refer to [docs/README.md](https://github.com/sgl-project/sglang/tree/main/docs/README.md).
|
||||
|
||||
## Tips for Newcomers
|
||||
|
||||
|
||||
@@ -69,13 +69,13 @@ If you encounter errors when starting the server, ensure the weights have finish
|
||||
The DeepSeek series have huge model weights, it takes some time to compile the model with `torch.compile` for the first time if you have added the flag `--enable-torch-compile`. You can refer [here](https://docs.sglang.ai/backend/hyperparameter_tuning.html#try-advanced-options) to optimize the caching of compilation results, so that the cache can be used to speed up the next startup.
|
||||
|
||||
### Launch with one node of 8 x H200
|
||||
Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended). **Note that Deepseek V3 is already in FP8**. So we should not run it with any quantization arguments like `--quantization fp8 --kv-cache-dtype fp8_e5m2`.
|
||||
Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended). **Note that Deepseek V3 is already in FP8. So we should not run it with any quantization arguments like `--quantization fp8 --kv-cache-dtype fp8_e5m2`.
|
||||
|
||||
### Running examples on Multi-node
|
||||
|
||||
- [Serving with two H20*8 nodes](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h208-nodes).
|
||||
|
||||
- [Serving with two H200*8 nodes and Docker](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h2008-nodes-and-docker).
|
||||
- [Serving with two H200*8 nodes and docker](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h2008-nodes-and-docker).
|
||||
|
||||
- [Serving with four A100*8 nodes](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-four-a1008-nodes).
|
||||
|
||||
@@ -89,13 +89,13 @@ Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/be
|
||||
|
||||
- **MLA Attention Backends**: Currently SGLang supports different optimized MLA attention backends, including [FlashAttention3](https://github.com/Dao-AILab/flash-attention), [Flashinfer](https://docs.flashinfer.ai/api/mla.html), [FlashMLA](https://github.com/deepseek-ai/FlashMLA), [CutlassMLA](https://github.com/sgl-project/sglang/pull/5390), and [Triton](https://github.com/triton-lang/triton) backends. The default FA3 provides good performance across wide workloads.
|
||||
|
||||
- **FP8 Quantization**: W8A8 FP8 and KV Cache FP8 quantization enables efficient FP8 inference. Additionally, we have implemented the Batched Matrix Multiplication (BMM) operator to facilitate FP8 inference in MLA with weight absorption.
|
||||
- **FP8 Quantization**: W8A8 FP8 and KV Cache FP8 quantization enables efficient FP8 inference. Additionally, we have implemented Batched Matrix Multiplication (BMM) operator to facilitate FP8 inference in MLA with weight absorption.
|
||||
|
||||
- **CUDA Graph & torch.compile**: Both MLA and Mixture of Experts (MoE) are compatible with CUDA Graph and torch.compile, which reduces latency and accelerates decoding speed for small batch sizes.
|
||||
- **CUDA Graph & Torch.compile**: Both MLA and Mixture of Experts (MoE) are compatible with CUDA Graph and Torch.compile, which reduces latency and accelerates decoding speed for small batch sizes.
|
||||
|
||||
- **Chunked Prefix Cache**: Chunked prefix cache optimization can increase throughput by cutting prefix cache into chunks, processing them with multi-head attention and merging their states. Its improvement can be significant when doing chunked prefill on long sequences. Currently this optimization is only available for the FlashAttention3 backend.
|
||||
- **Chunked Prefix Cache**: Chunked prefix cache optimization can increase throughput by cutting prefix cache into chunks, processing them with multi-head attention and merging their states. Its improvement can be significant when doing chunked prefill on long sequences. Currently this optimization is only available for FlashAttention3 backend.
|
||||
|
||||
Overall, with these optimizations, we have achieved up to a **7x** acceleration in output throughput compared to the previous version.
|
||||
Overall, with these optimizations, we have achieved up to **7x** acceleration in output throughput compared to the previous version.
|
||||
|
||||
<p align="center">
|
||||
<img src="https://lmsys.org/images/blog/sglang_v0_3/deepseek_mla.svg" alt="Multi-head Latent Attention for DeepSeek Series Models">
|
||||
@@ -113,7 +113,7 @@ Overall, with these optimizations, we have achieved up to a **7x** acceleration
|
||||
<img src="https://lmsys.org/images/blog/sglang_v0_4/dp_attention.svg" alt="Data Parallelism Attention for DeepSeek Series Models">
|
||||
</p>
|
||||
|
||||
With data parallelism attention enabled, we have achieved up to a **1.9x** decoding throughput improvement compared to the previous version.
|
||||
With data parallelism attention enabled, we have achieved up to **1.9x** decoding throughput improvement compared to the previous version.
|
||||
|
||||
<p align="center">
|
||||
<img src="https://lmsys.org/images/blog/sglang_v0_4/deepseek_coder_v2.svg" alt="Data Parallelism Attention Performance Comparison">
|
||||
@@ -150,7 +150,7 @@ python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --tru
|
||||
The precompilation process typically takes around 10 minutes to complete.
|
||||
|
||||
### Multi-token Prediction
|
||||
**Description**: SGLang implements DeepSeek V3 Multi-Token Prediction (MTP) based on [EAGLE speculative decoding](https://docs.sglang.ai/backend/speculative_decoding.html#EAGLE-Decoding). With this optimization, the decoding speed can be improved by **1.8x** for batch size 1 and **1.5x** for batch size 32 respectively with H200 TP8 setting.
|
||||
**Description**: SGLang implements DeepSeek V3 Multi-Token Prediction (MTP) based on [EAGLE speculative decoding](https://docs.sglang.ai/backend/speculative_decoding.html#EAGLE-Decoding). With this optimization, the decoding speed can be improved by **1.8x** for batch size 1 and **1.5x** for batch size 32 respectively on H200 TP8 setting.
|
||||
|
||||
**Usage**:
|
||||
Add arguments `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example:
|
||||
@@ -161,7 +161,7 @@ python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --spec
|
||||
- FlashAttention3 and Triton backend fully supports MTP usage. For FlashInfer backend (`--attention-backend flashinfer`) with speculative decoding,`--speculative-eagle-topk` parameter should be set to `1`. MTP support for the FlashMLA backend and CutlassMLA backend is still under development.
|
||||
- To enable DeepSeek MTP for large batch sizes (>32), there are some parameters should be changed (Reference [this discussion](https://github.com/sgl-project/sglang/issues/4543#issuecomment-2737413756)):
|
||||
- Adjust `--max-running-requests` to a larger number. The default value is `32` for MTP. For larger batch sizes, you should increase this value beyond the default value.
|
||||
- Set `--cuda-graph-bs`. It is a list of batch sizes for CUDA graph capture. The default captured batch sizes for speculative decoding is set [here](https://github.com/sgl-project/sglang/blob/49420741746c8f3e80e0eb17e7d012bfaf25793a/python/sglang/srt/model_executor/cuda_graph_runner.py#L126). You can include more batch sizes into it.
|
||||
- Set `--cuda-graph-bs`. It's a list of batch sizes for cuda graph capture. The default captured batch sizes for speculative decoding is set [here](https://github.com/sgl-project/sglang/blob/49420741746c8f3e80e0eb17e7d012bfaf25793a/python/sglang/srt/model_executor/cuda_graph_runner.py#L126). You can include more batch sizes into it.
|
||||
|
||||
|
||||
### Reasoning Content for DeepSeek R1
|
||||
@@ -209,7 +209,7 @@ data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"\"}"}}]}}]}
|
||||
data: {"choices":[{"delta":{"tool_calls":null}}], "finish_reason": "tool_calls"}
|
||||
data: [DONE]
|
||||
```
|
||||
The client needs to concatenate all fragments of the tool call arguments to reconstruct the complete tool call:
|
||||
The client needs to concatenate all arguments fragments to reconstruct the complete tool call:
|
||||
```
|
||||
{"city": "Qingdao"}
|
||||
```
|
||||
@@ -223,4 +223,4 @@ Important Notes:
|
||||
|
||||
1. **Question**: What should I do if model loading takes too long and NCCL timeout occurs?
|
||||
|
||||
**Answer**: You can try to add `--dist-timeout 3600` when launching the model, this allows for a 1-hour timeout.
|
||||
**Answer**: You can try to add `--dist-timeout 3600` when launching the model, this allows for 1-hour timeout.
|
||||
|
||||
@@ -330,7 +330,7 @@ This should resolve most NCCL-related issues.
|
||||
## Remaining issues
|
||||
|
||||
* In Kubernetes, Docker, or Containerd environments, we use hostNetwork to prevent performance degradation.
|
||||
* We utilize privileged mode, which isn't secure. Additionally, in containerized environments, full GPU isolation cannot be achieved.
|
||||
* We utilize privileged mode, which isn’t secure. Additionally, in containerized environments, full GPU isolation cannot be achieved.
|
||||
|
||||
## TODO
|
||||
|
||||
|
||||
@@ -25,4 +25,4 @@ docker run --gpus all \
|
||||
python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 30000
|
||||
```
|
||||
|
||||
Note that ModelScope uses a different cache directory than HuggingFace. You may need to set it manually to avoid running out of disk space.
|
||||
Note that modelscope uses a different cache directory than huggingface. You may need to set it manually to avoid running out of disk space.
|
||||
|
||||
@@ -23,7 +23,7 @@ uv pip install "sglang[all]>=0.4.6.post3"
|
||||
1. Use `export CUDA_HOME=/usr/local/cuda-<your-cuda-version>` to set the `CUDA_HOME` environment variable.
|
||||
2. Install FlashInfer first following [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html), then install SGLang as described above.
|
||||
|
||||
- If you encounter `ImportError; cannot import name 'is_valid_list_of_images' from 'transformers.models.llama.image_processing_llama'`, try to use the specified version of `transformers` in [pyproject.toml](https://github.com/sgl-project/sglang/blob/main/python/pyproject.toml). Currently, run `pip install transformers==4.51.1`.
|
||||
- If you encounter `ImportError; cannot import name 'is_valid_list_of_images' from 'transformers.models.llama.image_processing_llama'`, try to use the specified version of `transformers` in [pyproject.toml](https://github.com/sgl-project/sglang/blob/main/python/pyproject.toml). Currently, just running `pip install transformers==4.51.1`.
|
||||
|
||||
## Method 2: From source
|
||||
|
||||
@@ -54,10 +54,10 @@ cd ..
|
||||
pip install -e "python[all_hip]"
|
||||
```
|
||||
|
||||
## Method 3: Using Docker
|
||||
## Method 3: Using docker
|
||||
|
||||
The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
|
||||
Replace `<secret>` below with your HuggingFace hub [token](https://huggingface.co/docs/hub/en/security-tokens).
|
||||
Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
|
||||
|
||||
```bash
|
||||
docker run --gpus all \
|
||||
@@ -89,7 +89,7 @@ drun -p 30000:30000 \
|
||||
drun v0.4.6.post3-rocm630 python3 -m sglang.bench_one_batch --batch-size 32 --input 1024 --output 128 --model amd/Meta-Llama-3.1-8B-Instruct-FP8-KV --tp 8 --quantization fp8
|
||||
```
|
||||
|
||||
## Method 4: Using Docker Compose
|
||||
## Method 4: Using docker compose
|
||||
|
||||
<details>
|
||||
<summary>More</summary>
|
||||
@@ -164,4 +164,4 @@ sky status --endpoint 30000 sglang
|
||||
- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub.
|
||||
- If you only need to use OpenAI models with the frontend language, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
|
||||
- The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. `srt` is the abbreviation of SGLang runtime.
|
||||
- To reinstall FlashInfer locally, use the following command: `pip install "flashinfer-python==0.2.5" -i https://flashinfer.ai/whl/cu124/torch2.6 --force-reinstall --no-deps` and then delete the cache with `rm -rf ~/.cache/flashinfer`.
|
||||
- To reinstall flashinfer locally, use the following command: `pip install "flashinfer-python==0.2.5" -i https://flashinfer.ai/whl/cu124/torch2.6 --force-reinstall --no-deps` and then delete the cache with `rm -rf ~/.cache/flashinfer`.
|
||||
|
||||
Reference in New Issue
Block a user