Sync from v0.13
This commit is contained in:
327
docs/deployment/frameworks/skypilot.md
Normal file
327
docs/deployment/frameworks/skypilot.md
Normal file
@@ -0,0 +1,327 @@
|
||||
# SkyPilot
|
||||
|
||||
<p align="center">
|
||||
<img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
|
||||
</p>
|
||||
|
||||
vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with [SkyPilot](https://github.com/skypilot-org/skypilot), an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc., can be found in [SkyPilot AI gallery](https://skypilot.readthedocs.io/en/latest/gallery/index.html).
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Go to the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and request access to the model `meta-llama/Meta-Llama-3-8B-Instruct`.
|
||||
- Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)).
|
||||
- Check that `sky check` shows clouds or Kubernetes are enabled.
|
||||
|
||||
```bash
|
||||
pip install skypilot-nightly
|
||||
sky check
|
||||
```
|
||||
|
||||
## Run on a single instance
|
||||
|
||||
See the vLLM SkyPilot YAML for serving, [serving.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml).
|
||||
|
||||
??? code "Yaml"
|
||||
|
||||
```yaml
|
||||
resources:
|
||||
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
|
||||
use_spot: True
|
||||
disk_size: 512 # Ensure model checkpoints can fit.
|
||||
disk_tier: best
|
||||
ports: 8081 # Expose to internet traffic.
|
||||
|
||||
envs:
|
||||
PYTHONUNBUFFERED: 1
|
||||
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
|
||||
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
|
||||
|
||||
setup: |
|
||||
conda create -n vllm python=3.10 -y
|
||||
conda activate vllm
|
||||
|
||||
pip install vllm==0.4.0.post1
|
||||
# Install Gradio for web UI.
|
||||
pip install gradio openai
|
||||
pip install flash-attn==2.5.7
|
||||
|
||||
run: |
|
||||
conda activate vllm
|
||||
echo 'Starting vllm api server...'
|
||||
vllm serve $MODEL_NAME \
|
||||
--port 8081 \
|
||||
--trust-remote-code \
|
||||
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
|
||||
2>&1 | tee api_server.log &
|
||||
|
||||
echo 'Waiting for vllm api server to start...'
|
||||
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
|
||||
|
||||
echo 'Starting gradio server...'
|
||||
git clone https://github.com/vllm-project/vllm.git || true
|
||||
python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
|
||||
-m $MODEL_NAME \
|
||||
--port 8811 \
|
||||
--model-url http://localhost:8081/v1 \
|
||||
--stop-token-ids 128009,128001
|
||||
```
|
||||
|
||||
Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...):
|
||||
|
||||
```bash
|
||||
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN
|
||||
```
|
||||
|
||||
Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion.
|
||||
|
||||
```console
|
||||
(task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live
|
||||
```
|
||||
|
||||
**Optional**: Serve the 70B model instead of the default 8B and use more GPU:
|
||||
|
||||
```bash
|
||||
HF_TOKEN="your-huggingface-token" \
|
||||
sky launch serving.yaml \
|
||||
--gpus A100:8 \
|
||||
--env HF_TOKEN \
|
||||
--env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
|
||||
```
|
||||
|
||||
## Scale up to multiple replicas
|
||||
|
||||
SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file.
|
||||
|
||||
??? code "Yaml"
|
||||
|
||||
```yaml
|
||||
service:
|
||||
replicas: 2
|
||||
# An actual request for readiness probe.
|
||||
readiness_probe:
|
||||
path: /v1/chat/completions
|
||||
post_data:
|
||||
model: $MODEL_NAME
|
||||
messages:
|
||||
- role: user
|
||||
content: Hello! What is your name?
|
||||
max_completion_tokens: 1
|
||||
```
|
||||
|
||||
??? code "Yaml"
|
||||
|
||||
```yaml
|
||||
service:
|
||||
replicas: 2
|
||||
# An actual request for readiness probe.
|
||||
readiness_probe:
|
||||
path: /v1/chat/completions
|
||||
post_data:
|
||||
model: $MODEL_NAME
|
||||
messages:
|
||||
- role: user
|
||||
content: Hello! What is your name?
|
||||
max_completion_tokens: 1
|
||||
|
||||
resources:
|
||||
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
|
||||
use_spot: True
|
||||
disk_size: 512 # Ensure model checkpoints can fit.
|
||||
disk_tier: best
|
||||
ports: 8081 # Expose to internet traffic.
|
||||
|
||||
envs:
|
||||
PYTHONUNBUFFERED: 1
|
||||
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
|
||||
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
|
||||
|
||||
setup: |
|
||||
conda create -n vllm python=3.10 -y
|
||||
conda activate vllm
|
||||
|
||||
pip install vllm==0.4.0.post1
|
||||
# Install Gradio for web UI.
|
||||
pip install gradio openai
|
||||
pip install flash-attn==2.5.7
|
||||
|
||||
run: |
|
||||
conda activate vllm
|
||||
echo 'Starting vllm api server...'
|
||||
vllm serve $MODEL_NAME \
|
||||
--port 8081 \
|
||||
--trust-remote-code \
|
||||
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
|
||||
2>&1 | tee api_server.log
|
||||
```
|
||||
|
||||
Start the serving the Llama-3 8B model on multiple replicas:
|
||||
|
||||
```bash
|
||||
HF_TOKEN="your-huggingface-token" \
|
||||
sky serve up -n vllm serving.yaml \
|
||||
--env HF_TOKEN
|
||||
```
|
||||
|
||||
Wait until the service is ready:
|
||||
|
||||
```bash
|
||||
watch -n10 sky serve status vllm
|
||||
```
|
||||
|
||||
Example outputs:
|
||||
|
||||
```console
|
||||
Services
|
||||
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
|
||||
vllm 1 35s READY 2/2 xx.yy.zz.100:30001
|
||||
|
||||
Service Replicas
|
||||
SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION
|
||||
vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP([Spot]{'L4': 1}) READY us-east4
|
||||
vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'L4': 1}) READY us-east4
|
||||
```
|
||||
|
||||
After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:
|
||||
|
||||
??? console "Commands"
|
||||
|
||||
```bash
|
||||
ENDPOINT=$(sky serve status --endpoint 8081 vllm)
|
||||
curl -L http://$ENDPOINT/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
|
||||
"messages": [
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are a helpful assistant."
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Who are you?"
|
||||
}
|
||||
],
|
||||
"stop_token_ids": [128009, 128001]
|
||||
}'
|
||||
```
|
||||
|
||||
To enable autoscaling, you could replace the `replicas` with the following configs in `service`:
|
||||
|
||||
```yaml
|
||||
service:
|
||||
replica_policy:
|
||||
min_replicas: 2
|
||||
max_replicas: 4
|
||||
target_qps_per_replica: 2
|
||||
```
|
||||
|
||||
This will scale the service up to when the QPS exceeds 2 for each replica.
|
||||
|
||||
??? code "Yaml"
|
||||
|
||||
```yaml
|
||||
service:
|
||||
replica_policy:
|
||||
min_replicas: 2
|
||||
max_replicas: 4
|
||||
target_qps_per_replica: 2
|
||||
# An actual request for readiness probe.
|
||||
readiness_probe:
|
||||
path: /v1/chat/completions
|
||||
post_data:
|
||||
model: $MODEL_NAME
|
||||
messages:
|
||||
- role: user
|
||||
content: Hello! What is your name?
|
||||
max_completion_tokens: 1
|
||||
|
||||
resources:
|
||||
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
|
||||
use_spot: True
|
||||
disk_size: 512 # Ensure model checkpoints can fit.
|
||||
disk_tier: best
|
||||
ports: 8081 # Expose to internet traffic.
|
||||
|
||||
envs:
|
||||
PYTHONUNBUFFERED: 1
|
||||
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
|
||||
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
|
||||
|
||||
setup: |
|
||||
conda create -n vllm python=3.10 -y
|
||||
conda activate vllm
|
||||
|
||||
pip install vllm==0.4.0.post1
|
||||
# Install Gradio for web UI.
|
||||
pip install gradio openai
|
||||
pip install flash-attn==2.5.7
|
||||
|
||||
run: |
|
||||
conda activate vllm
|
||||
echo 'Starting vllm api server...'
|
||||
vllm serve $MODEL_NAME \
|
||||
--port 8081 \
|
||||
--trust-remote-code \
|
||||
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
|
||||
2>&1 | tee api_server.log
|
||||
```
|
||||
|
||||
To update the service with the new config:
|
||||
|
||||
```bash
|
||||
HF_TOKEN="your-huggingface-token" sky serve update vllm serving.yaml --env HF_TOKEN
|
||||
```
|
||||
|
||||
To stop the service:
|
||||
|
||||
```bash
|
||||
sky serve down vllm
|
||||
```
|
||||
|
||||
### **Optional**: Connect a GUI to the endpoint
|
||||
|
||||
It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.
|
||||
|
||||
??? code "Yaml"
|
||||
|
||||
```yaml
|
||||
envs:
|
||||
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
|
||||
ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm.
|
||||
|
||||
resources:
|
||||
cpus: 2
|
||||
|
||||
setup: |
|
||||
conda create -n vllm python=3.10 -y
|
||||
conda activate vllm
|
||||
|
||||
# Install Gradio for web UI.
|
||||
pip install gradio openai
|
||||
|
||||
run: |
|
||||
conda activate vllm
|
||||
export PATH=$PATH:/sbin
|
||||
|
||||
echo 'Starting gradio server...'
|
||||
git clone https://github.com/vllm-project/vllm.git || true
|
||||
python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
|
||||
-m $MODEL_NAME \
|
||||
--port 8811 \
|
||||
--model-url http://$ENDPOINT/v1 \
|
||||
--stop-token-ids 128009,128001 | tee ~/gradio.log
|
||||
```
|
||||
|
||||
1. Start the chat web UI:
|
||||
|
||||
```bash
|
||||
sky launch \
|
||||
-c gui ./gui.yaml \
|
||||
--env ENDPOINT=$(sky serve status --endpoint vllm)
|
||||
```
|
||||
|
||||
2. Then, we can access the GUI at the returned gradio link:
|
||||
|
||||
```console
|
||||
| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
|
||||
```
|
||||
Reference in New Issue
Block a user