change file tree (#1859)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
This commit is contained in:
Chayenne
2024-10-31 20:10:16 -07:00
committed by GitHub
parent b9fd178f1b
commit 61cf00e112
24 changed files with 1177 additions and 456 deletions

101
docs/starts/install.md Normal file
View File

@@ -0,0 +1,101 @@
# Install SGLang
You can install SGLang using any of the methods below.
## Method 1: With pip
```
pip install --upgrade pip
pip install "sglang[all]"
# Install FlashInfer accelerated kernels
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
```
Note: Please check the [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html) to install the proper version according to your PyTorch and CUDA versions.
## Method 2: From source
```
# Use the last release branch
git clone -b v0.3.4.post2 https://github.com/sgl-project/sglang.git
cd sglang
pip install --upgrade pip
pip install -e "python[all]"
# Install FlashInfer accelerated kernels
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
```
Note: Please check the [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html) to install the proper version according to your PyTorch and CUDA versions.
## Method 3: Using docker
The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
```bash
docker run --gpus all \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
```
## Method 4: Using docker compose
<details>
<summary>More</summary>
> This method is recommended if you plan to serve it as a service.
> A better approach is to use the [k8s-sglang-service.yaml](https://github.com/sgl-project/sglang/blob/main/docker/k8s-sglang-service.yaml).
1. Copy the [compose.yml](https://github.com/sgl-project/sglang/blob/main/docker/compose.yaml) to your local machine
2. Execute the command `docker compose up -d` in your terminal.
</details>
## Method 5: Run on Kubernetes or Clouds with SkyPilot
<details>
<summary>More</summary>
To deploy on Kubernetes or 12+ clouds, you can use [SkyPilot](https://github.com/skypilot-org/skypilot).
1. Install SkyPilot and set up Kubernetes cluster or cloud access: see [SkyPilot's documentation](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html).
2. Deploy on your own infra with a single command and get the HTTP API endpoint:
<details>
<summary>SkyPilot YAML: <code>sglang.yaml</code></summary>
```yaml
# sglang.yaml
envs:
HF_TOKEN: null
resources:
image_id: docker:lmsysorg/sglang:latest
accelerators: A100
ports: 30000
run: |
conda deactivate
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 30000
```
</details>
```bash
# Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider.
HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml
# Get the HTTP API endpoint
sky status --endpoint 30000 sglang
```
3. To further scale up your deployment with autoscaling and failure recovery, check out the [SkyServe + SGLang guide](https://github.com/skypilot-org/skypilot/tree/master/llm/sglang#serving-llama-2-with-sglang-for-more-traffic-using-skyserve).
</details>
## Common Notes
- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub.
- If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
- The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. This allows you to build SGLang programs locally and execute them by connecting to the remote backend.

View File

@@ -0,0 +1,403 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Quick Start: Launch A Server and Send Requests\n",
"\n",
"This notebook provides a quick-start guide for using SGLang after installation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Launch a server\n",
"\n",
"This code block is equivalent to executing \n",
"\n",
"```bash\n",
"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
"--port 30000 --host 0.0.0.0\n",
"```\n",
"\n",
"in your command line and wait for the server to be ready."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-01T02:46:13.611212Z",
"iopub.status.busy": "2024-11-01T02:46:13.611093Z",
"iopub.status.idle": "2024-11-01T02:46:42.810261Z",
"shell.execute_reply": "2024-11-01T02:46:42.809147Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/home/chenyang/miniconda3/envs/AlphaMeemory/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.\n",
" warnings.warn(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2024-10-31 19:46:18] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, is_embedding=False, host='0.0.0.0', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=706578968, constrained_json_whitespace_pattern=None, decode_log_interval=40, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"/home/chenyang/miniconda3/envs/AlphaMeemory/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.\n",
" warnings.warn(\n",
"/home/chenyang/miniconda3/envs/AlphaMeemory/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.\n",
" warnings.warn(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2024-10-31 19:46:24 TP0] Init torch distributed begin.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2024-10-31 19:46:24 TP0] Load weight begin. avail mem=47.27 GB\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2024-10-31 19:46:25 TP0] lm_eval is not installed, GPTQ may not be usable\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO 10-31 19:46:26 weight_utils.py:243] Using model weights format ['*.safetensors']\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\r",
"Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\r",
"Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:01, 2.50it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\r",
"Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:00<00:00, 2.39it/s]\n",
"\r",
"Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:00<00:00, 3.45it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\r",
"Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00, 2.95it/s]\n",
"\r",
"Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00, 2.90it/s]\n",
"\n",
"[2024-10-31 19:46:28 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=32.22 GB\n",
"[2024-10-31 19:46:28 TP0] Memory pool end. avail mem=4.60 GB\n",
"[2024-10-31 19:46:28 TP0] Capture cuda graph begin. This can take up to several minutes.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2024-10-31 19:46:36 TP0] max_total_num_tokens=217512, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2024-10-31 19:46:36] INFO: Started server process [1548791]\n",
"[2024-10-31 19:46:36] INFO: Waiting for application startup.\n",
"[2024-10-31 19:46:36] INFO: Application startup complete.\n",
"[2024-10-31 19:46:36] INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2024-10-31 19:46:37] INFO: 127.0.0.1:46022 - \"GET /v1/models HTTP/1.1\" 200 OK\n",
"[2024-10-31 19:46:37] INFO: 127.0.0.1:46028 - \"GET /get_model_info HTTP/1.1\" 200 OK\n",
"[2024-10-31 19:46:37 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2024-10-31 19:46:38] INFO: 127.0.0.1:46042 - \"POST /generate HTTP/1.1\" 200 OK\n",
"[2024-10-31 19:46:38] The server is fired up and ready to roll!\n"
]
},
{
"data": {
"text/html": [
"<strong style='color: #00008B;'><br><br> NOTE: Typically, the server runs in a separate terminal.<br> In this notebook, we run the server and notebook code together, so their outputs are combined.<br> To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.<br> </strong>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sglang.utils import (\n",
" execute_shell_command,\n",
" wait_for_server,\n",
" terminate_process,\n",
" print_highlight,\n",
")\n",
"\n",
"server_process = execute_shell_command(\n",
"\"\"\"\n",
"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
"--port 30000 --host 0.0.0.0\n",
"\"\"\"\n",
")\n",
"\n",
"wait_for_server(\"http://localhost:30000\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Send a Request\n",
"\n",
"Once the server is running, you can send test requests using curl. The server implements the [OpenAI-compatible API](https://platform.openai.com/docs/api-reference/chat)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-01T02:46:42.813656Z",
"iopub.status.busy": "2024-11-01T02:46:42.813354Z",
"iopub.status.idle": "2024-11-01T02:46:51.436613Z",
"shell.execute_reply": "2024-11-01T02:46:51.435965Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2024-10-31 19:46:42 TP0] Prefill batch. #new-seq: 1, #new-token: 46, #cached-token: 1, cache hit rate: 1.85%, token usage: 0.00, #running-req: 0, #queue-req: 0\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2024-10-31 19:46:43 TP0] Decode batch. #running-req: 1, #token: 80, token usage: 0.00, gen throughput (token/s): 5.40, #queue-req: 0\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2024-10-31 19:46:44 TP0] Decode batch. #running-req: 1, #token: 120, token usage: 0.00, gen throughput (token/s): 42.48, #queue-req: 0\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2024-10-31 19:46:45 TP0] Decode batch. #running-req: 1, #token: 160, token usage: 0.00, gen throughput (token/s): 42.37, #queue-req: 0\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2024-10-31 19:46:46 TP0] Decode batch. #running-req: 1, #token: 200, token usage: 0.00, gen throughput (token/s): 42.33, #queue-req: 0\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2024-10-31 19:46:47 TP0] Decode batch. #running-req: 1, #token: 240, token usage: 0.00, gen throughput (token/s): 42.34, #queue-req: 0\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2024-10-31 19:46:48 TP0] Decode batch. #running-req: 1, #token: 280, token usage: 0.00, gen throughput (token/s): 42.28, #queue-req: 0\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2024-10-31 19:46:49 TP0] Decode batch. #running-req: 1, #token: 320, token usage: 0.00, gen throughput (token/s): 42.28, #queue-req: 0\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2024-10-31 19:46:50 TP0] Decode batch. #running-req: 1, #token: 360, token usage: 0.00, gen throughput (token/s): 42.24, #queue-req: 0\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2024-10-31 19:46:51] INFO: 127.0.0.1:46046 - \"POST /v1/chat/completions HTTP/1.1\" 200 OK\n",
"{\"id\":\"f9761ee1b1444bd7a640286884a90842\",\"object\":\"chat.completion\",\"created\":1730429211,\"model\":\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\"choices\":[{\"index\":0,\"message\":{\"role\":\"assistant\",\"content\":\"LLM stands for Large Language Model. It's a type of artificial intelligence (AI) designed to process and comprehend human language in a way that's similar to how humans do.\\n\\nLarge Language Models are trained on massive amounts of text data, which allows them to learn patterns and relationships in language. This training enables them to generate text, answer questions, summarize content, and even engage in conversation.\\n\\nSome key characteristics of LLMs include:\\n\\n1. **Language understanding**: LLMs can comprehend the meaning of text, including nuances like idioms, sarcasm, and figurative language.\\n2. **Contextual awareness**: LLMs can understand the context in which a piece of text is written, including the topic, tone, and intent.\\n3. **Generative capabilities**: LLMs can generate text, including entire articles, conversations, or even creative writing like stories or poetry.\\n4. **Continuous learning**: LLMs can learn from new data and update their understanding of language over time.\\n\\nLLMs are used in a wide range of applications, including:\\n\\n1. **Virtual assistants**: LLMs power virtual assistants like Siri, Alexa, and Google Assistant.\\n2. **Chatbots**: LLMs are used to create chatbots that can engage with customers and provide support.\\n3. **Language translation**: LLMs can translate text from one language to another with high accuracy.\\n4. **Content generation**: LLMs can generate content, such as articles, social media posts, and product descriptions.\\n5. **Research and analysis**: LLMs can help researchers analyze and understand large amounts of text data.\\n\\nIn the context of our conversation, I'm a Large Language Model designed to provide helpful and informative responses to your questions!\"},\"logprobs\":null,\"finish_reason\":\"stop\",\"matched_stop\":128009}],\"usage\":{\"prompt_tokens\":47,\"total_tokens\":400,\"completion_tokens\":353,\"prompt_tokens_details\":null}}"
]
}
],
"source": [
"!curl http://localhost:30000/v1/chat/completions \\\n",
" -H \"Content-Type: application/json\" \\\n",
" -H \"Authorization: Bearer None\" \\\n",
" -d '{\"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\", \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"What is a LLM?\"}]}'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using OpenAI Python Client\n",
"\n",
"You can also use the OpenAI Python API library to send requests."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-01T02:46:51.439372Z",
"iopub.status.busy": "2024-11-01T02:46:51.439178Z",
"iopub.status.idle": "2024-11-01T02:46:52.895776Z",
"shell.execute_reply": "2024-11-01T02:46:52.895318Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2024-10-31 19:46:51 TP0] Prefill batch. #new-seq: 1, #new-token: 20, #cached-token: 29, cache hit rate: 29.13%, token usage: 0.00, #running-req: 0, #queue-req: 0\n",
"[2024-10-31 19:46:51 TP0] Decode batch. #running-req: 1, #token: 50, token usage: 0.00, gen throughput (token/s): 27.57, #queue-req: 0\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2024-10-31 19:46:52 TP0] Decode batch. #running-req: 1, #token: 90, token usage: 0.00, gen throughput (token/s): 42.69, #queue-req: 0\n",
"[2024-10-31 19:46:52] INFO: 127.0.0.1:40952 - \"POST /v1/chat/completions HTTP/1.1\" 200 OK\n"
]
},
{
"data": {
"text/html": [
"<strong style='color: #00008B;'>ChatCompletion(id='c563abb8fe74496f83203fe21ec4ff61', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Here are 3 countries and their capitals:\\n\\n1. **Country:** Japan\\n**Capital:** Tokyo\\n\\n2. **Country:** Australia\\n**Capital:** Canberra\\n\\n3. **Country:** Brazil\\n**Capital:** Brasília', refusal=None, role='assistant', function_call=None, tool_calls=None), matched_stop=128009)], created=1730429212, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=46, prompt_tokens=49, total_tokens=95, prompt_tokens_details=None))</strong>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import openai\n",
"\n",
"client = openai.Client(base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\")\n",
"\n",
"response = client.chat.completions.create(\n",
" model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": \"You are a helpful AI assistant\"},\n",
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
" ],\n",
" temperature=0,\n",
" max_tokens=64,\n",
")\n",
"print_highlight(response)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-01T02:46:52.898411Z",
"iopub.status.busy": "2024-11-01T02:46:52.898149Z",
"iopub.status.idle": "2024-11-01T02:46:54.398382Z",
"shell.execute_reply": "2024-11-01T02:46:54.397564Z"
}
},
"outputs": [],
"source": [
"terminate_process(server_process)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}