add qwen3

2026-02-04 17:22:39 +08:00
parent d1c0f68ab4
commit 8511fe8530
1932 changed files with 300426 additions and 0 deletions
--- a/vllm-v0.6.2/docs/source/serving/compatibility_matrix.rst
+++ b/vllm-v0.6.2/docs/source/serving/compatibility_matrix.rst
@@ -0,0 +1,427 @@
+.. _compatibility_matrix:
+
+Compatibility Matrix
+====================
+
+The tables below show mutually exclusive features and the support on some hardware. 
+
+.. note::
+
+   Check the '✗' with links to see tracking issue for unsupported feature/hardware combination.
+
+Feature x Feature
+-----------------
+
+
+.. raw:: html
+
+    <style>
+      /* Make smaller to try to improve readability  */
+      td {
+        font-size: 0.8rem;
+        text-align: center;
+      }
+
+      th {
+        text-align: center;
+        font-size: 0.8rem;
+      }
+    </style>
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - Feature
+     - :ref:`CP <chunked-prefill>`
+     - :ref:`APC <apc>`
+     - :ref:`LoRA <lora>`
+     - :abbr:`prmpt adptr (Prompt Adapter)`
+     - :ref:`SD <spec_decode>`
+     - CUDA graph
+     - :abbr:`enc-dec (Encoder-Decoder Models)`
+     - :abbr:`logP (Logprobs)`
+     - :abbr:`prmpt logP (Prompt Logprobs)`
+     - :abbr:`async output (Async Output Processing)`
+     - multi-step
+     - :abbr:`MM (Multimodal)`
+     - best-of
+     - beam-search
+     - :abbr:`guided dec (Guided Decoding)`
+   * - :ref:`CP <chunked-prefill>`
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+   * - :ref:`APC <apc>`
+     - ✅
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+   * - :ref:`LoRA <lora>`
+     - `✗ <https://github.com/vllm-project/vllm/pull/9057>`__ 
+     - ✅
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+   * - :abbr:`prmpt adptr (Prompt Adapter)`
+     - ✅
+     - ✅
+     - ✅
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+   * - :ref:`SD <spec_decode>`
+     - ✗
+     - ✅
+     - ✗
+     - ✅
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+   * - CUDA graph
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+   * - :abbr:`enc-dec (Encoder-Decoder Models)`
+     - ✗
+     - `✗ <https://github.com/vllm-project/vllm/issues/7366>`__ 
+     - ✗ 
+     - ✗
+     - `✗ <https://github.com/vllm-project/vllm/issues/7366>`__ 
+     - ✅
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+   * - :abbr:`logP (Logprobs)`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+   * - :abbr:`prmpt logP (Prompt Logprobs)`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - `✗ <https://github.com/vllm-project/vllm/pull/8199>`__ 
+     - ✅
+     - ✅
+     - ✅
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+   * - :abbr:`async output (Async Output Processing)`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✗
+     - ✅ 
+     - ✗
+     - ✅
+     - ✅
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+   * - multi-step
+     - ✗
+     - ✅
+     - ✗
+     - ✅
+     - ✗
+     - ✅
+     - ✗
+     - ✅
+     - `✗ <https://github.com/vllm-project/vllm/issues/8198>`__ 
+     - ✅
+     - 
+     - 
+     - 
+     - 
+     - 
+   * - :abbr:`MM (Multimodal)`
+     -  `✗ <https://github.com/vllm-project/vllm/pull/8346>`__ 
+     -  `✗ <https://github.com/vllm-project/vllm/pull/8348>`__ 
+     -  `✗ <https://github.com/vllm-project/vllm/pull/7199>`__ 
+     - ?
+     - ?
+     - ✅
+     - ✗
+     - ✅
+     - ✅
+     - ✅
+     - ?
+     - 
+     - 
+     - 
+     - 
+   * - best-of
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - `✗ <https://github.com/vllm-project/vllm/issues/6137>`__ 
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ?
+     - `✗ <https://github.com/vllm-project/vllm/issues/7968>`__ 
+     - ✅
+     - 
+     - 
+     - 
+   * - beam-search
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - `✗ <https://github.com/vllm-project/vllm/issues/6137>`__ 
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ?
+     - `✗ <https://github.com/vllm-project/vllm/issues/7968>`__ 
+     - ?
+     - ✅
+     - 
+     - 
+   * - :abbr:`guided dec (Guided Decoding)`
+     - ✅
+     - ✅
+     - ?
+     - ?
+     - ✅
+     - ✅
+     - ?
+     - ✅
+     - ✅
+     - ✅
+     - `✗ <https://github.com/vllm-project/vllm/issues/9893>`__ 
+     - ?
+     - ✅
+     - ✅
+     - 
+
+
+Feature x Hardware
+^^^^^^^^^^^^^^^^^^
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - Feature
+     - Volta
+     - Turing
+     - Ampere
+     - Ada
+     - Hopper
+     - CPU
+     - AMD
+   * - :ref:`CP <chunked-prefill>`
+     - `✗ <https://github.com/vllm-project/vllm/issues/2729>`__ 
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✗ 
+     - ✅
+   * - :ref:`APC <apc>`
+     - `✗ <https://github.com/vllm-project/vllm/issues/3687>`__ 
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✗
+     - ✅
+   * - :ref:`LoRA <lora>`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - `✗ <https://github.com/vllm-project/vllm/pull/4830>`__ 
+     - ✅
+   * - :abbr:`prmpt adptr (Prompt Adapter)`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - `✗ <https://github.com/vllm-project/vllm/issues/8475>`__ 
+     - ✅
+   * - :ref:`SD <spec_decode>`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+   * - CUDA graph
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✗
+     - ✅
+   * - :abbr:`enc-dec (Encoder-Decoder Models)`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✗
+   * - :abbr:`logP (Logprobs)`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+   * - :abbr:`prmpt logP (Prompt Logprobs)`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+   * - :abbr:`async output (Async Output Processing)`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✗
+     - ✗
+   * - multi-step
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - `✗ <https://github.com/vllm-project/vllm/issues/8477>`__ 
+     - ✅
+   * - :abbr:`MM (Multimodal)`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+   * - best-of
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+   * - beam-search
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+   * - :abbr:`guided dec (Guided Decoding)`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
--- a/vllm-v0.6.2/docs/source/serving/deploying_with_bentoml.rst
+++ b/vllm-v0.6.2/docs/source/serving/deploying_with_bentoml.rst
@@ -0,0 +1,8 @@
+.. _deploying_with_bentoml:
+
+Deploying with BentoML
+======================
+
+`BentoML <https://github.com/bentoml/BentoML>`_ allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-complicant image and deploy it on Kubernetes.
+
+For details, see the tutorial `vLLM inference in the BentoML documentation <https://docs.bentoml.com/en/latest/use-cases/large-language-models/vllm.html>`_.
--- a/vllm-v0.6.2/docs/source/serving/deploying_with_cerebrium.rst
+++ b/vllm-v0.6.2/docs/source/serving/deploying_with_cerebrium.rst
@@ -0,0 +1,112 @@
+.. _deploying_with_cerebrium:
+
+Deploying with Cerebrium
+============================
+
+.. raw:: html
+
+    <p align="center">
+        <img src="https://i.ibb.co/hHcScTT/Screenshot-2024-06-13-at-10-14-54.png" alt="vLLM_plus_cerebrium"/>
+    </p>
+
+vLLM can be run on a cloud based GPU machine with `Cerebrium <https://www.cerebrium.ai/>`__, a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications.
+
+To install the Cerebrium client, run:
+
+.. code-block:: console
+
+    $ pip install cerebrium
+    $ cerebrium login
+
+Next, create your Cerebrium project, run:
+    
+.. code-block:: console
+
+    $ cerebrium init vllm-project
+
+Next, to install the required packages, add the following to your cerebrium.toml:
+
+.. code-block:: toml
+
+    [cerebrium.deployment]
+    docker_base_image_url = "nvidia/cuda:12.1.1-runtime-ubuntu22.04"
+
+    [cerebrium.dependencies.pip]
+    vllm = "latest"
+
+Next, let us add our code to handle inference for the LLM of your choice(`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your main.py`:
+    
+.. code-block:: python
+
+    from vllm import LLM, SamplingParams
+
+    llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
+
+    def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
+    
+        sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
+        outputs = llm.generate(prompts, sampling_params)
+
+        # Print the outputs.
+        results = []
+        for output in outputs:
+            prompt = output.prompt
+            generated_text = output.outputs[0].text
+            results.append({"prompt": prompt, "generated_text": generated_text})
+
+        return {"results": results}
+
+
+Then, run the following code to deploy it to the cloud
+
+.. code-block:: console
+
+    $ cerebrium deploy
+
+If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case /run)
+
+.. code-block:: python
+
+    curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
+     -H 'Content-Type: application/json' \
+     -H 'Authorization: <JWT TOKEN>' \
+     --data '{
+       "prompts": [
+         "Hello, my name is",
+         "The president of the United States is",
+         "The capital of France is",
+         "The future of AI is"
+       ]
+     }'
+
+You should get a response like:
+
+.. code-block:: python
+    
+    {
+        "run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
+        "result": {
+            "result": [
+                {
+                    "prompt": "Hello, my name is",
+                    "generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
+                },
+                {
+                    "prompt": "The president of the United States is",
+                    "generated_text": " elected every four years. This is a democratic system.\n\n5. What"
+                },
+                {
+                    "prompt": "The capital of France is",
+                    "generated_text": " Paris.\n"
+                },
+                {
+                    "prompt": "The future of AI is",
+                    "generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
+                }
+            ]
+        },
+        "run_time_ms": 152.53663063049316
+    }
+
+You now have an autoscaling endpoint where you only pay for the compute you use!
+
--- a/vllm-v0.6.2/docs/source/serving/deploying_with_docker.rst
+++ b/vllm-v0.6.2/docs/source/serving/deploying_with_docker.rst
@@ -0,0 +1,53 @@
+.. _deploying_with_docker:
+
+Deploying with Docker
+============================
+
+vLLM offers an official Docker image for deployment.
+The image can be used to run OpenAI compatible server and is available on Docker Hub as `vllm/vllm-openai <https://hub.docker.com/r/vllm/vllm-openai/tags>`_.
+
+.. code-block:: console
+
+    $ docker run --runtime nvidia --gpus all \
+        -v ~/.cache/huggingface:/root/.cache/huggingface \
+        --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
+        -p 8000:8000 \
+        --ipc=host \
+        vllm/vllm-openai:latest \
+        --model mistralai/Mistral-7B-v0.1
+
+
+.. note::
+
+        You can either use the ``ipc=host`` flag or ``--shm-size`` flag to allow the
+        container to access the host's shared memory. vLLM uses PyTorch, which uses shared
+        memory to share data between processes under the hood, particularly for tensor parallel inference.
+
+
+You can build and run vLLM from source via the provided `Dockerfile <https://github.com/vllm-project/vllm/blob/main/Dockerfile>`_. To build vLLM:
+
+.. code-block:: console
+
+    $ DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai # optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
+
+
+.. note::
+
+        By default vLLM will build for all GPU types for widest distribution. If you are just building for the
+        current GPU type the machine is running on, you can add the argument ``--build-arg torch_cuda_arch_list=""``
+        for vLLM to find the current GPU type and build for that.
+
+
+To run vLLM:
+
+.. code-block:: console
+
+    $ docker run --runtime nvidia --gpus all \
+        -v ~/.cache/huggingface:/root/.cache/huggingface \
+        -p 8000:8000 \
+        --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
+        vllm/vllm-openai <args...>
+
+.. note::
+
+        **For `v0.4.1` and `v0.4.2` only** - the vLLM docker images under these versions are supposed to be run under the root user since a library under the root user's home directory, i.e. ``/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1`` is required to be loaded during runtime. If you are running the container under a different user, you may need to first change the permissions of the library (and all the parent directories) to allow the user to access it, then run vLLM with environment variable ``VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1`` .
--- a/vllm-v0.6.2/docs/source/serving/deploying_with_dstack.rst
+++ b/vllm-v0.6.2/docs/source/serving/deploying_with_dstack.rst
@@ -0,0 +1,103 @@
+.. _deploying_with_dstack:
+
+Deploying with dstack
+============================
+
+.. raw:: html
+
+    <p align="center">
+        <img src="https://i.ibb.co/71kx6hW/vllm-dstack.png" alt="vLLM_plus_dstack"/>
+    </p>
+
+vLLM can be run on a cloud based GPU machine with `dstack <https://dstack.ai/>`__, an open-source framework for running LLMs on any cloud. This tutorial assumes that you have already configured credentials, gateway, and GPU quotas on your cloud environment.
+
+To install dstack client, run:
+
+.. code-block:: console
+
+    $ pip install "dstack[all]
+    $ dstack server
+
+Next, to configure your dstack project, run:
+    
+.. code-block:: console
+
+    $ mkdir -p vllm-dstack
+    $ cd vllm-dstack
+    $ dstack init
+
+Next, to provision a VM instance with LLM of your choice(`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
+    
+.. code-block:: yaml
+
+    type: service
+    
+    python: "3.11"
+    env:
+        - MODEL=NousResearch/Llama-2-7b-chat-hf
+    port: 8000
+    resources:
+        gpu: 24GB
+    commands:
+        - pip install vllm
+        - vllm serve $MODEL --port 8000
+    model:
+        format: openai
+        type: chat
+        name: NousResearch/Llama-2-7b-chat-hf
+
+Then, run the following CLI for provisioning:
+
+.. code-block:: console
+
+    $ dstack run . -f serve.dstack.yml
+    
+    ⠸ Getting run plan...
+     Configuration  serve.dstack.yml             
+     Project        deep-diver-main              
+     User           deep-diver                   
+     Min resources  2..xCPU, 8GB.., 1xGPU (24GB) 
+     Max price      -                            
+     Max duration   -                            
+     Spot policy    auto                         
+     Retry policy   no                           
+    
+     #  BACKEND  REGION       INSTANCE       RESOURCES                               SPOT  PRICE       
+     1  gcp   us-central1  g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804   
+     2  gcp   us-east1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804   
+     3  gcp   us-west1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804   
+        ...                                                                                            
+     Shown 3 of 193 offers, $5.876 max
+    
+    Continue? [y/n]: y
+    ⠙ Submitting run...
+    ⠏ Launching spicy-treefrog-1 (pulling)
+    spicy-treefrog-1 provisioning completed (running)
+    Service is published at ...
+
+After the provisioning, you can interact with the model by using the OpenAI SDK:
+
+.. code-block:: python
+
+    from openai import OpenAI
+    
+    client = OpenAI(
+        base_url="https://gateway.<gateway domain>",
+        api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>"
+    )
+    
+    completion = client.chat.completions.create(
+        model="NousResearch/Llama-2-7b-chat-hf",
+        messages=[
+            {
+                "role": "user",
+                "content": "Compose a poem that explains the concept of recursion in programming.",
+            }
+        ]
+    )
+
+    print(completion.choices[0].message.content)
+
+.. note::
+
+    dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out `this repository <https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm>`__
--- a/vllm-v0.6.2/docs/source/serving/deploying_with_k8s.rst
+++ b/vllm-v0.6.2/docs/source/serving/deploying_with_k8s.rst
@@ -0,0 +1,175 @@
+.. _deploying_with_k8s:
+
+Deploying with Kubernetes
+==========================
+
+Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.
+
+Prerequisites
+-------------
+Before you begin, ensure that you have the following:
+
+- A running Kubernetes cluster
+- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at `https://github.com/NVIDIA/k8s-device-plugin/`
+- Available GPU resources in your cluster
+
+Deployment Steps
+----------------
+
+1.  **Create a PVC , Secret and Deployment for vLLM**
+
+
+PVC is used to store the model cache and it is optional, you can use hostPath or other storage options
+
+.. code-block:: yaml
+
+  apiVersion: v1
+  kind: PersistentVolumeClaim
+  metadata:
+    name: mistral-7b
+    namespace: default
+  spec:
+    accessModes:
+    - ReadWriteOnce
+    resources:
+      requests:
+        storage: 50Gi
+    storageClassName: default
+    volumeMode: Filesystem
+
+Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models
+
+.. code-block:: yaml
+
+  apiVersion: v1
+  kind: Secret
+  metadata:
+    name: hf-token-secret
+    namespace: default
+  type: Opaque
+  data:
+    token: "REPLACE_WITH_TOKEN"
+
+
+Create a deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model:
+
+.. code-block:: yaml
+
+  apiVersion: apps/v1
+  kind: Deployment
+  metadata:
+    name: mistral-7b
+    namespace: default
+    labels:
+      app: mistral-7b
+  spec:
+    replicas: 1
+    selector:
+      matchLabels:
+        app: mistral-7b
+    template:
+      metadata:
+        labels:
+          app: mistral-7b
+      spec:
+        volumes:
+        - name: cache-volume
+          persistentVolumeClaim:
+            claimName: mistral-7b
+        # vLLM needs to access the host's shared memory for tensor parallel inference.
+        - name: shm
+          emptyDir:
+            medium: Memory
+            sizeLimit: "2Gi"
+        containers:
+        - name: mistral-7b
+          image: vllm/vllm-openai:latest
+          command: ["/bin/sh", "-c"]
+          args: [
+            "vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
+          ]
+          env:
+          - name: HUGGING_FACE_HUB_TOKEN
+            valueFrom:
+              secretKeyRef:
+                name: hf-token-secret
+                key: token
+          ports:
+          - containerPort: 8000
+          resources:
+            limits:
+              cpu: "10"
+              memory: 20G
+              nvidia.com/gpu: "1"
+            requests:
+              cpu: "2"
+              memory: 6G
+              nvidia.com/gpu: "1"
+          volumeMounts:
+          - mountPath: /root/.cache/huggingface
+            name: cache-volume
+          - name: shm
+            mountPath: /dev/shm
+          livenessProbe:
+            httpGet:
+              path: /health
+              port: 8000
+            initialDelaySeconds: 60
+            periodSeconds: 10
+          readinessProbe:
+            httpGet:
+              path: /health
+              port: 8000
+            initialDelaySeconds: 60
+            periodSeconds: 5
+
+2. **Create a Kubernetes Service for vLLM**
+
+Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:
+
+.. code-block:: yaml
+
+    apiVersion: v1
+    kind: Service
+    metadata:
+      name: mistral-7b
+      namespace: default
+    spec:
+      ports:
+      - name: http-mistral-7b
+        port: 80
+        protocol: TCP
+        targetPort: 8000
+      # The label selector should match the deployment labels & it is useful for prefix caching feature
+      selector:
+        app: mistral-7b
+      sessionAffinity: None
+      type: ClusterIP
+
+3. **Deploy and Test**
+
+Apply the deployment and service configurations using ``kubectl apply -f <filename>``:
+
+.. code-block:: console
+
+    kubectl apply -f deployment.yaml
+    kubectl apply -f service.yaml
+
+To test the deployment, run the following ``curl`` command:
+
+.. code-block:: console
+
+    curl http://mistral-7b.default.svc.cluster.local/v1/completions \
+      -H "Content-Type: application/json" \
+      -d '{
+            "model": "facebook/opt-125m",
+            "prompt": "San Francisco is a",
+            "max_tokens": 7,
+            "temperature": 0
+          }'
+
+If the service is correctly deployed, you should receive a response from the vLLM model.
+
+Conclusion
+----------
+Deploying vLLM with Kubernetes allows for efficient scaling and management of ML models leveraging GPU resources. By following the steps outlined above, you should be able to set up and test a vLLM deployment within your Kubernetes cluster. If you encounter any issues or have suggestions, please feel free to contribute to the documentation.
--- a/vllm-v0.6.2/docs/source/serving/deploying_with_kserve.rst
+++ b/vllm-v0.6.2/docs/source/serving/deploying_with_kserve.rst
@@ -0,0 +1,8 @@
+.. _deploying_with_kserve:
+
+Deploying with KServe
+============================
+
+vLLM can be deployed with `KServe <https://github.com/kserve/kserve>`_ on Kubernetes for highly scalable distributed model serving.
+
+Please see `this guide <https://kserve.github.io/website/latest/modelserving/v1beta1/llm/huggingface/>`_ for more details on using vLLM with KServe.
--- a/vllm-v0.6.2/docs/source/serving/deploying_with_lws.rst
+++ b/vllm-v0.6.2/docs/source/serving/deploying_with_lws.rst
@@ -0,0 +1,12 @@
+.. _deploying_with_lws:
+
+Deploying with LWS
+============================
+
+LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads.
+A major use case is for multi-host/multi-node distributed inference.
+
+vLLM can be deployed with `LWS <https://github.com/kubernetes-sigs/lws>`_ on Kubernetes for distributed model serving.
+
+Please see `this guide <https://github.com/kubernetes-sigs/lws/tree/main/docs/examples/vllm>`_ for more details on
+deploying vLLM on Kubernetes using LWS.
--- a/vllm-v0.6.2/docs/source/serving/deploying_with_nginx.rst
+++ b/vllm-v0.6.2/docs/source/serving/deploying_with_nginx.rst
@@ -0,0 +1,142 @@
+.. _nginxloadbalancer:
+
+Deploying with Nginx Loadbalancer
+=================================
+
+This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers. 
+
+Table of contents:
+
+#. :ref:`Build Nginx Container <nginxloadbalancer_nginx_build>`
+#. :ref:`Create Simple Nginx Config file <nginxloadbalancer_nginx_conf>`
+#. :ref:`Build vLLM Container <nginxloadbalancer_nginx_vllm_container>`
+#. :ref:`Create Docker Network <nginxloadbalancer_nginx_docker_network>`
+#. :ref:`Launch vLLM Containers <nginxloadbalancer_nginx_launch_container>`
+#. :ref:`Launch Nginx <nginxloadbalancer_nginx_launch_nginx>`
+#. :ref:`Verify That vLLM Servers Are Ready <nginxloadbalancer_nginx_verify_nginx>`
+
+.. _nginxloadbalancer_nginx_build:
+
+Build Nginx Container
+---------------------
+
+This guide assumes that you have just cloned the vLLM project and you're currently in the vllm root directory.
+
+.. code-block:: console
+
+    export vllm_root=`pwd`
+
+Create a file named ``Dockerfile.nginx``:
+
+.. code-block:: console
+
+    FROM nginx:latest
+    RUN rm /etc/nginx/conf.d/default.conf
+    EXPOSE 80
+    CMD ["nginx", "-g", "daemon off;"]
+
+Build the container:
+
+.. code-block:: console
+
+    docker build . -f Dockerfile.nginx --tag nginx-lb
+
+.. _nginxloadbalancer_nginx_conf:
+
+Create Simple Nginx Config file
+-------------------------------
+
+Create a file named ``nginx_conf/nginx.conf``. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another ``server vllmN:8000 max_fails=3 fail_timeout=10000s;`` entry to ``upstream backend``.
+
+.. code-block:: console
+
+    upstream backend {
+        least_conn;
+        server vllm0:8000 max_fails=3 fail_timeout=10000s;
+        server vllm1:8000 max_fails=3 fail_timeout=10000s;
+    }     
+    server {
+        listen 80;
+        location / {
+            proxy_pass http://backend;
+            proxy_set_header Host $host;
+            proxy_set_header X-Real-IP $remote_addr;
+            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+            proxy_set_header X-Forwarded-Proto $scheme;
+        }
+    }
+
+.. _nginxloadbalancer_nginx_vllm_container:
+
+Build vLLM Container
+--------------------
+
+.. code-block:: console
+
+    cd $vllm_root
+    docker build -f Dockerfile . --tag vllm
+
+
+If you are behind proxy, you can pass the proxy settings to the docker build command as shown below:
+
+.. code-block:: console
+
+    cd $vllm_root
+    docker build -f Dockerfile . --tag vllm --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy
+
+.. _nginxloadbalancer_nginx_docker_network:
+
+Create Docker Network
+---------------------
+
+.. code-block:: console
+
+    docker network create vllm_nginx
+
+
+.. _nginxloadbalancer_nginx_launch_container:
+
+Launch vLLM Containers
+----------------------
+
+Notes:
+
+* If you have your HuggingFace models cached somewhere else, update ``hf_cache_dir`` below. 
+* If you don't have an existing HuggingFace cache you will want to start ``vllm0`` and wait for the model to complete downloading and the server to be ready. This will ensure that ``vllm1`` can leverage the model you just downloaded and it won't have to be downloaded again.
+* The below example assumes GPU backend used. If you are using CPU backend, remove ``--gpus all``, add ``VLLM_CPU_KVCACHE_SPACE`` and ``VLLM_CPU_OMP_THREADS_BIND`` environment variables to the docker run command.
+* Adjust the model name that you want to use in your vLLM servers if you don't want to use ``Llama-2-7b-chat-hf``. 
+
+.. code-block:: console
+
+    mkdir -p ~/.cache/huggingface/hub/
+    hf_cache_dir=~/.cache/huggingface/
+    docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8081:8000 --name vllm0 vllm --model meta-llama/Llama-2-7b-chat-hf
+    docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
+
+.. note::
+    If you are behind proxy, you can pass the proxy settings to the docker run command via ``-e http_proxy=$http_proxy -e https_proxy=$https_proxy``.
+
+.. _nginxloadbalancer_nginx_launch_nginx:
+
+Launch Nginx
+------------
+
+.. code-block:: console
+
+    docker run -itd -p 8000:80 --network vllm_nginx -v ./nginx_conf/:/etc/nginx/conf.d/ --name nginx-lb nginx-lb:latest
+    
+.. _nginxloadbalancer_nginx_verify_nginx:
+
+Verify That vLLM Servers Are Ready
+----------------------------------
+
+.. code-block:: console
+    
+    docker logs vllm0 | grep Uvicorn
+    docker logs vllm1 | grep Uvicorn
+
+Both outputs should look like this:
+
+.. code-block:: console
+
+    INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
--- a/vllm-v0.6.2/docs/source/serving/deploying_with_triton.rst
+++ b/vllm-v0.6.2/docs/source/serving/deploying_with_triton.rst
@@ -0,0 +1,6 @@
+.. _deploying_with_triton:
+
+Deploying with NVIDIA Triton
+============================
+
+The `Triton Inference Server <https://github.com/triton-inference-server>`_ hosts a tutorial demonstrating how to quickly deploy a simple `facebook/opt-125m <https://huggingface.co/facebook/opt-125m>`_ model using vLLM. Please see `Deploying a vLLM model in Triton <https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton>`_ for more details.
--- a/vllm-v0.6.2/docs/source/serving/distributed_serving.rst
+++ b/vllm-v0.6.2/docs/source/serving/distributed_serving.rst
@@ -0,0 +1,107 @@
+.. _distributed_serving:
+
+Distributed Inference and Serving
+=================================
+
+How to decide the distributed inference strategy?
+-------------------------------------------------
+
+Before going into the details of distributed inference and serving, let's first make it clear when to use distributed inference and what are the strategies available. The common practice is:
+
+- **Single GPU (no distributed inference)**: If your model fits in a single GPU, you probably don't need to use distributed inference. Just use the single GPU to run the inference.
+- **Single-Node Multi-GPU (tensor parallel inference)**: If your model is too large to fit in a single GPU, but it can fit in a single node with multiple GPUs, you can use tensor parallelism. The tensor parallel size is the number of GPUs you want to use. For example, if you have 4 GPUs in a single node, you can set the tensor parallel size to 4.
+- **Multi-Node Multi-GPU (tensor parallel plus pipeline parallel inference)**: If your model is too large to fit in a single node, you can use tensor parallel together with pipeline parallelism. The tensor parallel size is the number of GPUs you want to use in each node, and the pipeline parallel size is the number of nodes you want to use. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2.
+
+In short, you should increase the number of GPUs and the number of nodes until you have enough GPU memory to hold the model. The tensor parallel size should be the number of GPUs in each node, and the pipeline parallel size should be the number of nodes.
+
+After adding enough GPUs and nodes to hold the model, you can run vLLM first, which will print some logs like ``# GPU blocks: 790``. Multiply the number by ``16`` (the block size), and you can get roughly the maximum number of tokens that can be served on the current configuration. If this number is not satisfying, e.g. you want higher throughput, you can further increase the number of GPUs or nodes, until the number of blocks is enough.
+
+.. note::
+    There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs.
+
+Details for Distributed Inference and Serving
+----------------------------------------------
+
+vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. Currently, we support `Megatron-LM's tensor parallel algorithm <https://arxiv.org/pdf/1909.08053.pdf>`_. We manage the distributed runtime with either `Ray <https://github.com/ray-project/ray>`_ or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inferencing currently requires Ray.
+
+Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured :code:`tensor_parallel_size`, otherwise Ray will be used. This default can be overridden via the :code:`LLM` class :code:`distributed-executor-backend` argument or :code:`--distributed-executor-backend` API server argument. Set it to :code:`mp` for multiprocessing or :code:`ray` for Ray. It's not required for Ray to be installed for the multiprocessing case.
+
+To run multi-GPU inference with the :code:`LLM` class, set the :code:`tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs:
+
+.. code-block:: python
+
+    from vllm import LLM
+    llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
+    output = llm.generate("San Franciso is a")
+
+To run multi-GPU serving, pass in the :code:`--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs:
+
+.. code-block:: console
+
+    $ vllm serve facebook/opt-13b \
+    $     --tensor-parallel-size 4
+
+You can also additionally specify :code:`--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:
+
+.. code-block:: console
+
+    $ vllm serve gpt2 \
+    $     --tensor-parallel-size 4 \
+    $     --pipeline-parallel-size 2
+
+Multi-Node Inference and Serving
+--------------------------------
+
+If a single node does not have enough GPUs to hold the model, you can run the model using multiple nodes. It is important to make sure the execution environment is the same on all nodes, including the model path, the Python environment. The recommended way is to use docker images to ensure the same environment, and hide the heterogeneity of the host machines via mapping them into the same docker configuration.
+
+The first step, is to start containers and organize them into a cluster. We have provided a helper `script <https://github.com/vllm-project/vllm/tree/main/examples/run_cluster.sh>`_ to start the cluster.
+
+Pick a node as the head node, and run the following command:
+
+.. code-block:: console
+
+    $ bash run_cluster.sh \
+    $                   vllm/vllm-openai \
+    $                   ip_of_head_node \
+    $                   --head \
+    $                   /path/to/the/huggingface/home/in/this/node
+
+On the rest of the worker nodes, run the following command:
+
+.. code-block:: console
+
+    $ bash run_cluster.sh \
+    $                   vllm/vllm-openai \
+    $                   ip_of_head_node \
+    $                   --worker \
+    $                   /path/to/the/huggingface/home/in/this/node
+
+Then you get a ray cluster of containers. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument ``ip_of_head_node`` should be the IP address of the head node, which is accessible by all the worker nodes. A common misunderstanding is to use the IP address of the worker node, which is not correct.
+
+Then, on any node, use ``docker exec -it node /bin/bash`` to enter the container, execute ``ray status`` to check the status of the Ray cluster. You should see the right number of nodes and GPUs.
+
+After that, on any node, you can use vLLM as usual, just as you have all the GPUs on one node. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:
+
+.. code-block:: console
+
+    $ vllm serve /path/to/the/model/in/the/container \
+    $     --tensor-parallel-size 8 \
+    $     --pipeline-parallel-size 2
+
+You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 16:
+
+.. code-block:: console
+
+    $ vllm serve /path/to/the/model/in/the/container \
+    $     --tensor-parallel-size 16
+
+To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like ``--privileged -e NCCL_IB_HCA=mlx5`` to the ``run_cluster.sh`` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with ``NCCL_DEBUG=TRACE`` environment variable set, e.g. ``NCCL_DEBUG=TRACE vllm serve ...`` and check the logs for the NCCL version and the network used. If you find ``[send] via NET/Socket`` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find ``[send] via NET/IB/GDRDMA`` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.
+
+.. warning::
+    After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the `sanity check script <https://docs.vllm.ai/en/latest/getting_started/debugging.html>`_ for more information. If you need to set some environment variables for the communication configuration, you can append them to the ``run_cluster.sh`` script, e.g. ``-e NCCL_SOCKET_IFNAME=eth0``. Note that setting environment variables in the shell (e.g. ``NCCL_SOCKET_IFNAME=eth0 vllm serve ...``) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See the `discussion <https://github.com/vllm-project/vllm/issues/6803>`_ for more information.
+
+.. warning::
+
+    Please make sure you downloaded the model to all the nodes (with the same path), or the model is downloaded to some distributed file system that is accessible by all nodes.
+
+    When you use huggingface repo id to refer to the model, you should append your huggingface token to the ``run_cluster.sh`` script, e.g. ``-e HF_TOKEN=``. The recommended way is to download the model first, and then use the path to refer to the model.
--- a/vllm-v0.6.2/docs/source/serving/env_vars.rst
+++ b/vllm-v0.6.2/docs/source/serving/env_vars.rst
@@ -0,0 +1,14 @@
+Environment Variables
+========================
+
+vLLM uses the following environment variables to configure the system:
+
+.. warning::
+    Please note that ``VLLM_PORT`` and ``VLLM_HOST_IP`` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use ``--host $VLLM_HOST_IP`` and ``--port $VLLM_PORT`` to start the API server, it will not work.
+
+    All environment variables used by vLLM are prefixed with ``VLLM_``. **Special care should be taken for Kubernetes users**: please do not name the service as ``vllm``, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because `Kubernetes sets environment variables for each service with the capitalized service name as the prefix <https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables>`_.
+
+.. literalinclude:: ../../../vllm/envs.py
+    :language: python
+    :start-after: begin-env-vars-definition
+    :end-before: end-env-vars-definition
--- a/vllm-v0.6.2/docs/source/serving/faq.rst
+++ b/vllm-v0.6.2/docs/source/serving/faq.rst
@@ -0,0 +1,31 @@
+Frequently Asked Questions
+===========================
+
+    Q: How can I serve multiple models on a single port using the OpenAI API?
+
+A: Assuming that you're referring to using OpenAI compatible server to serve multiple models at once, that is not currently supported, you can run multiple instances of the server (each serving a different model) at the same time, and have another layer to route the incoming request to the correct server accordingly.
+
+----------------------------------------
+
+    Q: Which model to use for offline inference embedding?
+
+A: If you want to use an embedding model, try: https://huggingface.co/intfloat/e5-mistral-7b-instruct. Instead models, such as Llama-3-8b, Mistral-7B-Instruct-v0.3, are generation models rather than an embedding model
+
+----------------------------------------
+
+    Q: Can the output of a prompt vary across runs in vLLM?
+
+A: Yes, it can. vLLM does not guarantee stable log probabilities (logprobs) for the output tokens. Variations in logprobs may occur due to
+numerical instability in Torch operations or non-deterministic behavior in batched Torch operations when batching changes. For more details, 
+see the `Numerical Accuracy section <https://pytorch.org/docs/stable/notes/numerical_accuracy.html#batched-computations-or-slice-computations>`_.
+
+In vLLM, the same requests might be batched differently due to factors such as other concurrent requests,
+changes in batch size, or batch expansion in speculative decoding. These batching variations, combined with numerical instability of Torch operations, 
+can lead to slightly different logit/logprob values at each step. Such differences can accumulate, potentially resulting in 
+different tokens being sampled. Once a different token is sampled, further divergence is likely.
+
+**Mitigation Strategies**
+
+- For improved stability and reduced variance, use `float32`. Note that this will require more memory.
+- If using `bfloat16`, switching to `float16` can also help.
+- Using request seeds can aid in achieving more stable generation for temperature > 0, but discrepancies due to precision differences may still occur.
--- a/vllm-v0.6.2/docs/source/serving/integrations.rst
+++ b/vllm-v0.6.2/docs/source/serving/integrations.rst
@@ -0,0 +1,16 @@
+Integrations
+------------
+
+.. toctree::
+   :maxdepth: 1
+
+   run_on_sky
+   deploying_with_kserve
+   deploying_with_triton
+   deploying_with_bentoml
+   deploying_with_cerebrium
+   deploying_with_lws
+   deploying_with_dstack
+   serving_with_langchain
+   serving_with_llamaindex
+   serving_with_llamastack
--- a/vllm-v0.6.2/docs/source/serving/metrics.rst
+++ b/vllm-v0.6.2/docs/source/serving/metrics.rst
@@ -0,0 +1,13 @@
+Production Metrics
+==================
+
+vLLM exposes a number of metrics that can be used to monitor the health of the
+system. These metrics are exposed via the `/metrics` endpoint on the vLLM
+OpenAI compatible API server.
+
+The following metrics are exposed:
+
+.. literalinclude:: ../../../vllm/engine/metrics.py
+    :language: python
+    :start-after: begin-metrics-definitions
+    :end-before: end-metrics-definitions
--- a/vllm-v0.6.2/docs/source/serving/openai_compatible_server.md
+++ b/vllm-v0.6.2/docs/source/serving/openai_compatible_server.md
@@ -0,0 +1,430 @@
+# OpenAI Compatible Server
+
+vLLM provides an HTTP server that implements OpenAI's [Completions](https://platform.openai.com/docs/api-reference/completions) and [Chat](https://platform.openai.com/docs/api-reference/chat) API.
+
+You can start the server using Python, or using [Docker](deploying_with_docker.rst):
+```bash
+vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
+```
+
+To call the server, you can use the official OpenAI Python client library, or any other HTTP client.
+```python
+from openai import OpenAI
+client = OpenAI(
+    base_url="http://localhost:8000/v1",
+    api_key="token-abc123",
+)
+
+completion = client.chat.completions.create(
+  model="NousResearch/Meta-Llama-3-8B-Instruct",
+  messages=[
+    {"role": "user", "content": "Hello!"}
+  ]
+)
+
+print(completion.choices[0].message)
+```
+
+## API Reference
+
+We currently support the following OpenAI APIs:
+
+- [Completions API](https://platform.openai.com/docs/api-reference/completions)
+  - *Note: `suffix` parameter is not supported.*
+- [Chat Completions API](https://platform.openai.com/docs/api-reference/chat)
+  - [Vision](https://platform.openai.com/docs/guides/vision)-related parameters are supported; see [Using VLMs](../models/vlm.rst).
+    - *Note: `image_url.detail` parameter is not supported.*
+  - We also support `audio_url` content type for audio files.
+    - Refer to [vllm.entrypoints.chat_utils](https://github.com/vllm-project/vllm/tree/main/vllm/entrypoints/chat_utils.py) for the exact schema.
+    - *TODO: Support `input_audio` content type as defined [here](https://github.com/openai/openai-python/blob/v1.52.2/src/openai/types/chat/chat_completion_content_part_input_audio_param.py).*
+  - *Note: `parallel_tool_calls` and `user` parameters are ignored.*
+- [Embeddings API](https://platform.openai.com/docs/api-reference/embeddings)
+  - Instead of `inputs`, you can pass in a list of `messages` (same schema as Chat Completions API),
+    which will be treated as a single prompt to the model according to its chat template.
+    - This enables multi-modal inputs to be passed to embedding models, see [Using VLMs](../models/vlm.rst).
+  - *Note: You should run `vllm serve` with `--task embedding` to ensure that the model is being run in embedding mode.*
+
+## Extra Parameters
+
+vLLM supports a set of parameters that are not part of the OpenAI API.
+In order to use them, you can pass them as extra parameters in the OpenAI client.
+Or directly merge them into the JSON payload if you are using HTTP call directly.
+
+```python
+completion = client.chat.completions.create(
+  model="NousResearch/Meta-Llama-3-8B-Instruct",
+  messages=[
+    {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
+  ],
+  extra_body={
+    "guided_choice": ["positive", "negative"]
+  }
+)
+```
+
+### Extra HTTP Headers
+
+Only `X-Request-Id` HTTP request header is supported for now.
+
+```python
+completion = client.chat.completions.create(
+  model="NousResearch/Meta-Llama-3-8B-Instruct",
+  messages=[
+    {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
+  ],
+  extra_headers={
+    "x-request-id": "sentiment-classification-00001",
+  }
+)
+print(completion._request_id)
+
+completion = client.completions.create(
+  model="NousResearch/Meta-Llama-3-8B-Instruct",
+  prompt="A robot may not injure a human being",
+  extra_headers={
+    "x-request-id": "completion-test",
+  }
+)
+print(completion._request_id)
+```
+
+### Extra Parameters for Completions API
+
+The following [sampling parameters (click through to see documentation)](../dev/sampling_params.rst) are supported.
+
+```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
+:language: python
+:start-after: begin-completion-sampling-params
+:end-before: end-completion-sampling-params
+```
+
+The following extra parameters are supported:
+
+```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
+:language: python
+:start-after: begin-completion-extra-params
+:end-before: end-completion-extra-params
+```
+
+### Extra Parameters for Chat Completions API
+
+The following [sampling parameters (click through to see documentation)](../dev/sampling_params.rst) are supported.
+
+```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
+:language: python
+:start-after: begin-chat-completion-sampling-params
+:end-before: end-chat-completion-sampling-params
+```
+
+The following extra parameters are supported:
+
+```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
+:language: python
+:start-after: begin-chat-completion-extra-params
+:end-before: end-chat-completion-extra-params
+```
+
+### Extra Parameters for Embeddings API
+
+The following [pooling parameters (click through to see documentation)](../dev/pooling_params.rst) are supported.
+
+```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
+:language: python
+:start-after: begin-embedding-pooling-params
+:end-before: end-embedding-pooling-params
+```
+
+The following extra parameters are supported:
+
+```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
+:language: python
+:start-after: begin-embedding-extra-params
+:end-before: end-embedding-extra-params
+```
+
+## Chat Template
+
+In order for the language model to support chat protocol, vLLM requires the model to include
+a chat template in its tokenizer configuration. The chat template is a Jinja2 template that
+specifies how are roles, messages, and other chat-specific tokens are encoded in the input.
+
+An example chat template for `NousResearch/Meta-Llama-3-8B-Instruct` can be found [here](https://github.com/meta-llama/llama3?tab=readme-ov-file#instruction-tuned-models)
+
+Some models do not provide a chat template even though they are instruction/chat fine-tuned. For those model,
+you can manually specify their chat template in the `--chat-template` parameter with the file path to the chat
+template, or the template in string form. Without a chat template, the server will not be able to process chat
+and all chat requests will error.
+
+```bash
+vllm serve <model> --chat-template ./path-to-chat-template.jinja
+```
+
+vLLM community provides a set of chat templates for popular models. You can find them in the examples
+directory [here](https://github.com/vllm-project/vllm/tree/main/examples/)
+
+With the inclusion of multi-modal chat APIs, the OpenAI spec now accepts chat messages in a new format which specifies
+both a `type` and a `text` field. An example is provided below:
+```python
+completion = client.chat.completions.create(
+  model="NousResearch/Meta-Llama-3-8B-Instruct",
+  messages=[
+    {"role": "user", "content": [{"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"}]}
+  ]
+)
+```
+Most chat templates for LLMs expect the `content` to be a `string` but there are some newer models like
+`meta-llama/Llama-Guard-3-1B` that expect the content to be parsed with the new OpenAI spec. In order to choose which
+format the content needs to be parsed in by vLLM, please use the `--chat-template-text-format` argument to specify
+between `string` or `openai`. The default value is `string` and vLLM internally converts both spec formats to match
+this, unless explicitly specified.
+
+
+## Command line arguments for the server
+
+```{argparse}
+:module: vllm.entrypoints.openai.cli_args
+:func: create_parser_for_docs
+:prog: vllm serve
+```
+
+
+### Config file
+
+The `serve` module can also accept arguments from a config file in
+`yaml` format. The arguments in the yaml must be specified using the
+long form of the argument outlined [here](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#command-line-arguments-for-the-server):
+
+For example:
+
+```yaml
+# config.yaml
+
+host: "127.0.0.1"
+port: 6379
+uvicorn-log-level: "info"
+```
+
+```bash
+$ vllm serve SOME_MODEL --config config.yaml
+```
+---
+**NOTE**
+In case an argument is supplied simultaneously using command line and the config file, the value from the commandline will take precedence.
+The order of priorities is `command line > config file values > defaults`.
+
+---
+
+## Tool calling in the chat completion API
+vLLM currently supports named function calling, as well as the `auto` and `none` options for the `tool_choice` field in the chat completion API. The `tool_choice` option `required` is **not yet supported** but on the roadmap.
+
+It is the callers responsibility to prompt the model with the tool information, vLLM will not automatically manipulate the prompt.
+Please see below for recommended configuration and chat templates to use when function calling is to be used with the different models.
+
+
+### Named Function Calling
+vLLM supports named function calling in the chat completion API by default. It does so using Outlines, so this is
+enabled by default, and will work with any supported model. You are guaranteed a validly-parsable function call - not a
+high-quality one.
+
+vLLM will use guided decoding to ensure the response matches the tool parameter object defined by the JSON schema in the `tools` parameter.
+
+To use a named function, you need to define the functions in the `tools` parameter of the chat completion request, and
+specify the `name` of one of the tools in the `tool_choice` parameter of the chat completion request.
+
+
+### Automatic Function Calling
+To enable this feature, you should set the following flags:
+* `--enable-auto-tool-choice` -- **mandatory** Auto tool choice. tells vLLM that you want to enable the model to generate its own tool calls when it
+deems appropriate.
+* `--tool-call-parser` -- select the tool parser to use (listed below). Additional tool parsers
+will continue to be added in the future, and also can register your own tool parsers in the `--tool-parser-plugin`.
+* `--tool-parser-plugin` -- **optional** tool parser plugin used to register user defined tool parsers into vllm, the registered tool parser name can be specified in `--tool-call-parser`.
+* `--chat-template` -- **optional** for auto tool choice. the path to the chat template which handles `tool`-role messages and `assistant`-role messages
+that contain previously generated tool calls. Hermes, Mistral and Llama models have tool-compatible chat templates in their
+`tokenizer_config.json` files, but you can specify a custom template. This argument can be set to `tool_use` if your model has a tool use-specific chat
+template configured in the `tokenizer_config.json`. In this case, it will be used per the `transformers` specification. More on this [here](https://huggingface.co/docs/transformers/en/chat_templating#why-do-some-models-have-multiple-templates)
+from HuggingFace; and you can find an example of this in a `tokenizer_config.json` [here](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B/blob/main/tokenizer_config.json)
+
+If your favorite tool-calling model is not supported, please feel free to contribute a parser & tool use chat template!
+
+
+#### Hermes Models (`hermes`)
+
+All Nous Research Hermes-series models newer than Hermes 2 Pro should be supported.
+* `NousResearch/Hermes-2-Pro-*`
+* `NousResearch/Hermes-2-Theta-*`
+* `NousResearch/Hermes-3-*`
+
+
+_Note that the Hermes 2 **Theta** models are known to have degraded tool call quality & capabilities due to the merge
+step in their creation_.
+
+Flags: `--tool-call-parser hermes`
+
+
+#### Mistral Models (`mistral`)
+
+Supported models:
+* `mistralai/Mistral-7B-Instruct-v0.3` (confirmed)
+* Additional mistral function-calling models are compatible as well.
+
+Known issues:
+1. Mistral 7B struggles to generate parallel tool calls correctly.
+2. Mistral's `tokenizer_config.json` chat template requires tool call IDs that are exactly 9 digits, which is
+much shorter than what vLLM generates. Since an exception is thrown when this condition
+is not met, the following additional chat templates are provided:
+
+* `examples/tool_chat_template_mistral.jinja` - this is the "official" Mistral chat template, but tweaked so that
+it works with vLLM's tool call IDs (provided `tool_call_id` fields are truncated to the last 9 digits)
+* `examples/tool_chat_template_mistral_parallel.jinja` - this is a "better" version that adds a tool-use system prompt
+when tools are provided, that results in much better reliability when working with parallel tool calling.
+
+
+Recommended flags: `--tool-call-parser mistral --chat-template examples/tool_chat_template_mistral_parallel.jinja`
+
+
+#### Llama Models (`llama3_json`)
+
+Supported models:
+* `meta-llama/Meta-Llama-3.1-8B-Instruct`
+* `meta-llama/Meta-Llama-3.1-70B-Instruct`
+* `meta-llama/Meta-Llama-3.1-405B-Instruct`
+* `meta-llama/Meta-Llama-3.1-405B-Instruct-FP8`
+
+The tool calling that is supported is the [JSON based tool calling](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/#json-based-tool-calling). For [pythonic tool calling](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/text_prompt_format.md#zero-shot-function-calling) in Llama-3.2 models, see the `pythonic` tool parser below.
+Other tool calling formats like the built in python tool calling or custom tool calling are not supported.
+
+Known issues:
+1. Parallel tool calls are not supported.
+2. The model can generate parameters with a wrong format, such as generating
+   an array serialized as string instead of an array.
+
+The `tool_chat_template_llama3_json.jinja` file contains the "official" Llama chat template, but tweaked so that
+it works better with vLLM.
+
+Recommended flags: `--tool-call-parser llama3_json --chat-template examples/tool_chat_template_llama3_json.jinja`
+
+#### IBM Granite
+
+Supported models:
+* `ibm-granite/granite-3.0-8b-instruct`
+
+Recommended flags: `--tool-call-parser granite --chat-template examples/tool_chat_template_granite.jinja`
+
+`examples/tool_chat_template_granite.jinja`: this is a modified chat template from the original on Huggingface. Parallel function calls are supported.
+
+* `ibm-granite/granite-20b-functioncalling`
+
+Recommended flags: `--tool-call-parser granite-20b-fc --chat-template examples/tool_chat_template_granite_20b_fc.jinja`
+
+`examples/tool_chat_template_granite_20b_fc.jinja`: this is a modified chat template from the original on Huggingface, which is not vLLM compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported.
+
+
+#### InternLM Models (`internlm`)
+
+Supported models:
+* `internlm/internlm2_5-7b-chat` (confirmed)
+* Additional internlm2.5 function-calling models are compatible as well
+
+Known issues:
+* Although this implementation also supports InternLM2, the tool call results are not stable when testing with the `internlm/internlm2-chat-7b` model.
+
+Recommended flags: `--tool-call-parser internlm --chat-template examples/tool_chat_template_internlm2_tool.jinja`
+
+
+#### Jamba Models (`jamba`)
+AI21's Jamba-1.5 models are supported.
+* `ai21labs/AI21-Jamba-1.5-Mini`
+* `ai21labs/AI21-Jamba-1.5-Large`
+
+
+Flags: `--tool-call-parser jamba`
+
+
+#### Models with Pythonic Tool Calls (`pythonic`)
+
+A growing number of models output a python list to represent tool calls instead of using JSON. This has the advantage of inherently supporting parallel tool calls and removing ambiguity around the JSON schema required for tool calls. The `pythonic` tool parser can support such models.
+
+As a concrete example, these models may look up the weather in San Francisco and Seattle by generating:
+```python
+[get_weather(city='San Francisco', metric='celsius'), get_weather(city='Seattle', metric='celsius')]
+```
+
+Limitations:
+* The model must not generate both text and tool calls in the same generation. This may not be hard to change for a specific model, but the community currently lacks consensus on which tokens to emit when starting and ending tool calls.  (In particular, the Llama 3.2 models emit no such tokens.)
+* Llama's smaller models struggle to use tools effectively.
+
+Example supported models:
+* `meta-llama/Llama-3.2-1B-Instruct`\* (use with `examples/tool_chat_template_llama3.2_pythonic.jinja`)
+* `meta-llama/Llama-3.2-3B-Instruct`\* (use with `examples/tool_chat_template_llama3.2_pythonic.jinja`)
+* `Team-ACE/ToolACE-8B` (use with `examples/tool_chat_template_toolace.jinja`)
+* `fixie-ai/ultravox-v0_4-ToolACE-8B` (use with `examples/tool_chat_template_toolace.jinja`)
+
+Flags: `--tool-call-parser pythonic --chat-template {see_above}`
+
+---
+**WARNING**
+Llama's smaller models frequently fail to emit tool calls in the correct format. Your mileage may vary.
+
+---
+
+
+### How to write a tool parser plugin
+
+A tool parser plugin is a Python file containing one or more ToolParser implementations. You can write a ToolParser similar to the `Hermes2ProToolParser` in vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py.
+
+Here is a summary of a plugin file:
+
+```python
+
+# import the required packages
+
+# define a tool parser and register it to vllm
+# the name list in register_module can be used
+# in --tool-call-parser. you can define as many
+# tool parsers as you want here.
+@ToolParserManager.register_module(["example"])
+class ExampleToolParser(ToolParser):
+    def __init__(self, tokenizer: AnyTokenizer):
+        super().__init__(tokenizer)
+
+    # adjust request. e.g.: set skip special tokens
+    # to False for tool call output.
+    def adjust_request(
+            self, request: ChatCompletionRequest) -> ChatCompletionRequest:
+        return request
+
+    # implement the tool call parse for stream call
+    def extract_tool_calls_streaming(
+        self,
+        previous_text: str,
+        current_text: str,
+        delta_text: str,
+        previous_token_ids: Sequence[int],
+        current_token_ids: Sequence[int],
+        delta_token_ids: Sequence[int],
+        request: ChatCompletionRequest,
+    ) -> Union[DeltaMessage, None]:
+        return delta
+
+    # implement the tool parse for non-stream call
+    def extract_tool_calls(
+        self,
+        model_output: str,
+        request: ChatCompletionRequest,
+    ) -> ExtractedToolCallInformation:
+        return ExtractedToolCallInformation(tools_called=False,
+                                            tool_calls=[],
+                                            content=text)
+
+
+```
+
+Then you can use this plugin in the command line like this.
+```
+    --enable-auto-tool-choice \
+    --tool-parser-plugin <absolute path of the plugin file>
+    --tool-call-parser example \
+    --chat-template <your chat template> \
+```
+
--- a/vllm-v0.6.2/docs/source/serving/run_on_sky.rst
+++ b/vllm-v0.6.2/docs/source/serving/run_on_sky.rst
@@ -0,0 +1,366 @@
+.. _on_cloud:
+
+Deploying and scaling up with SkyPilot
+================================================
+
+.. raw:: html
+
+  <p align="center">
+    <img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
+  </p>
+
+vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with `SkyPilot <https://github.com/skypilot-org/skypilot>`__, an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in `SkyPilot AI gallery <https://skypilot.readthedocs.io/en/latest/gallery/index.html>`__.
+
+
+Prerequisites
+-------------
+
+- Go to the `HuggingFace model page <https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct>`__ and request access to the model :code:`meta-llama/Meta-Llama-3-8B-Instruct`.
+- Check that you have installed SkyPilot (`docs <https://skypilot.readthedocs.io/en/latest/getting-started/installation.html>`__).
+- Check that :code:`sky check` shows clouds or Kubernetes are enabled.
+
+.. code-block:: console
+
+  pip install skypilot-nightly
+  sky check
+
+
+Run on a single instance
+------------------------
+
+See the vLLM SkyPilot YAML for serving, `serving.yaml <https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml>`__.
+
+.. code-block:: yaml
+
+  resources:
+    accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
+    use_spot: True
+    disk_size: 512  # Ensure model checkpoints can fit.
+    disk_tier: best
+    ports: 8081  # Expose to internet traffic.
+
+  envs:
+    MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
+    HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
+
+  setup: |
+    conda create -n vllm python=3.10 -y
+    conda activate vllm
+
+    pip install vllm==0.4.0.post1
+    # Install Gradio for web UI.
+    pip install gradio openai
+    pip install flash-attn==2.5.7
+
+  run: |
+    conda activate vllm
+    echo 'Starting vllm api server...'
+    python -u -m vllm.entrypoints.openai.api_server \
+      --port 8081 \
+      --model $MODEL_NAME \
+      --trust-remote-code \
+      --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+      2>&1 | tee api_server.log &
+    
+    echo 'Waiting for vllm api server to start...'
+    while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
+
+    echo 'Starting gradio server...'
+    git clone https://github.com/vllm-project/vllm.git || true
+    python vllm/examples/gradio_openai_chatbot_webserver.py \
+      -m $MODEL_NAME \
+      --port 8811 \
+      --model-url http://localhost:8081/v1 \
+      --stop-token-ids 128009,128001
+
+Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...): 
+
+.. code-block:: console
+
+  HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN
+
+Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion.
+
+.. code-block:: console
+
+  (task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live
+
+**Optional**: Serve the 70B model instead of the default 8B and use more GPU:
+
+.. code-block:: console
+
+  HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
+
+
+Scale up to multiple replicas
+-----------------------------
+
+SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file.
+
+.. code-block:: yaml
+
+  service:
+    replicas: 2
+    # An actual request for readiness probe.
+    readiness_probe:
+      path: /v1/chat/completions
+      post_data:
+      model: $MODEL_NAME
+      messages:
+        - role: user
+          content: Hello! What is your name?
+    max_completion_tokens: 1
+    
+.. raw:: html
+
+  <details>
+  <summary>Click to see the full recipe YAML</summary>
+
+
+.. code-block:: yaml
+
+  service:
+    replicas: 2
+    # An actual request for readiness probe.
+    readiness_probe:
+      path: /v1/chat/completions
+      post_data:
+        model: $MODEL_NAME
+        messages:
+          - role: user
+            content: Hello! What is your name?
+        max_completion_tokens: 1
+
+  resources:
+    accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
+    use_spot: True
+    disk_size: 512  # Ensure model checkpoints can fit.
+    disk_tier: best
+    ports: 8081  # Expose to internet traffic.
+
+  envs:
+    MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
+    HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
+
+  setup: |
+    conda create -n vllm python=3.10 -y
+    conda activate vllm
+
+    pip install vllm==0.4.0.post1
+    # Install Gradio for web UI.
+    pip install gradio openai
+    pip install flash-attn==2.5.7
+
+  run: |
+    conda activate vllm
+    echo 'Starting vllm api server...'
+    python -u -m vllm.entrypoints.openai.api_server \
+      --port 8081 \
+      --model $MODEL_NAME \
+      --trust-remote-code \
+      --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+      2>&1 | tee api_server.log
+
+.. raw:: html
+
+  </details>
+
+Start the serving the Llama-3 8B model on multiple replicas:
+
+.. code-block:: console
+
+  HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN
+
+
+Wait until the service is ready:
+
+.. code-block:: console
+
+  watch -n10 sky serve status vllm
+
+
+.. raw:: html
+
+  <details>
+  <summary>Example outputs:</summary>
+
+.. code-block:: console
+
+  Services
+  NAME  VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT
+  vllm  1        35s     READY   2/2       xx.yy.zz.100:30001
+
+  Service Replicas
+  SERVICE_NAME  ID  VERSION  IP            LAUNCHED     RESOURCES                STATUS  REGION
+  vllm          1   1        xx.yy.zz.121  18 mins ago  1x GCP([Spot]{'L4': 1})  READY   us-east4
+  vllm          2   1        xx.yy.zz.245  18 mins ago  1x GCP([Spot]{'L4': 1})  READY   us-east4
+
+.. raw:: html
+  
+  </details>
+
+After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:
+
+.. code-block:: console
+
+  ENDPOINT=$(sky serve status --endpoint 8081 vllm)
+  curl -L http://$ENDPOINT/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+      "model": "meta-llama/Meta-Llama-3-8B-Instruct",
+      "messages": [
+      {
+        "role": "system",
+        "content": "You are a helpful assistant."
+      },
+      {
+        "role": "user",
+        "content": "Who are you?"
+      }
+      ],
+      "stop_token_ids": [128009,  128001]
+    }'
+
+To enable autoscaling, you could replace the `replicas` with the following configs in `service`:
+
+.. code-block:: yaml
+
+  service:
+    replica_policy:
+      min_replicas: 2
+      max_replicas: 4
+      target_qps_per_replica: 2
+
+This will scale the service up to when the QPS exceeds 2 for each replica.
+
+    
+.. raw:: html
+
+  <details>
+  <summary>Click to see the full recipe YAML</summary>
+
+
+.. code-block:: yaml
+
+  service:
+    replica_policy:
+      min_replicas: 2
+      max_replicas: 4
+      target_qps_per_replica: 2
+    # An actual request for readiness probe.
+    readiness_probe:
+      path: /v1/chat/completions
+      post_data:
+        model: $MODEL_NAME
+        messages:
+          - role: user
+            content: Hello! What is your name?
+        max_completion_tokens: 1
+
+  resources:
+    accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
+    use_spot: True
+    disk_size: 512  # Ensure model checkpoints can fit.
+    disk_tier: best
+    ports: 8081  # Expose to internet traffic.
+
+  envs:
+    MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
+    HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
+
+  setup: |
+    conda create -n vllm python=3.10 -y
+    conda activate vllm
+
+    pip install vllm==0.4.0.post1
+    # Install Gradio for web UI.
+    pip install gradio openai
+    pip install flash-attn==2.5.7
+
+  run: |
+    conda activate vllm
+    echo 'Starting vllm api server...'
+    python -u -m vllm.entrypoints.openai.api_server \
+      --port 8081 \
+      --model $MODEL_NAME \
+      --trust-remote-code \
+      --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+      2>&1 | tee api_server.log
+
+
+.. raw:: html
+  
+  </details>
+
+To update the service with the new config:
+
+.. code-block:: console
+
+  HF_TOKEN="your-huggingface-token" sky serve update vllm serving.yaml --env HF_TOKEN
+
+
+To stop the service:
+
+.. code-block:: console
+
+  sky serve down vllm
+
+
+**Optional**: Connect a GUI to the endpoint
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+
+It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.
+
+.. raw:: html
+
+  <details>
+  <summary>Click to see the full GUI YAML</summary>
+
+.. code-block:: yaml
+
+  envs:
+    MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
+    ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm. 
+
+  resources:
+    cpus: 2
+
+  setup: |
+    conda create -n vllm python=3.10 -y
+    conda activate vllm
+
+    # Install Gradio for web UI.
+    pip install gradio openai
+
+  run: |
+    conda activate vllm
+    export PATH=$PATH:/sbin
+
+    echo 'Starting gradio server...'
+    git clone https://github.com/vllm-project/vllm.git || true
+    python vllm/examples/gradio_openai_chatbot_webserver.py \
+      -m $MODEL_NAME \
+      --port 8811 \
+      --model-url http://$ENDPOINT/v1 \
+      --stop-token-ids 128009,128001 | tee ~/gradio.log
+
+
+.. raw:: html
+  
+  </details>
+
+1. Start the chat web UI:
+
+.. code-block:: console
+
+  sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
+
+
+2. Then, we can access the GUI at the returned gradio link:
+
+.. code-block:: console
+
+  | INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
+
+
--- a/vllm-v0.6.2/docs/source/serving/serving_with_langchain.rst
+++ b/vllm-v0.6.2/docs/source/serving/serving_with_langchain.rst
@@ -0,0 +1,31 @@
+.. _run_on_langchain:
+
+Serving with Langchain
+============================
+
+vLLM is also available via `Langchain <https://github.com/langchain-ai/langchain>`_ .
+
+To install langchain, run
+
+.. code-block:: console
+
+    $ pip install langchain langchain_community -q
+
+To run inference on a single or multiple GPUs, use ``VLLM`` class from ``langchain``.
+
+.. code-block:: python
+
+    from langchain_community.llms import VLLM
+
+    llm = VLLM(model="mosaicml/mpt-7b",
+               trust_remote_code=True,  # mandatory for hf models
+               max_new_tokens=128,
+               top_k=10,
+               top_p=0.95,
+               temperature=0.8,
+               # tensor_parallel_size=... # for distributed inference
+    )
+
+    print(llm("What is the capital of France ?"))
+
+Please refer to this `Tutorial <https://python.langchain.com/docs/integrations/llms/vllm>`_ for more details.
--- a/vllm-v0.6.2/docs/source/serving/serving_with_llamaindex.rst
+++ b/vllm-v0.6.2/docs/source/serving/serving_with_llamaindex.rst
@@ -0,0 +1,27 @@
+.. _run_on_llamaindex:
+
+Serving with llama_index
+============================
+
+vLLM is also available via `llama_index <https://github.com/run-llama/llama_index>`_ .
+
+To install llamaindex, run
+
+.. code-block:: console
+
+    $ pip install llama-index-llms-vllm -q
+
+To run inference on a single or multiple GPUs, use ``Vllm`` class from ``llamaindex``.
+
+.. code-block:: python
+
+    from llama_index.llms.vllm import Vllm
+
+    llm = Vllm(
+        model="microsoft/Orca-2-7b",
+        tensor_parallel_size=4,
+        max_new_tokens=100,
+        vllm_kwargs={"swap_space": 1, "gpu_memory_utilization": 0.5},
+    )
+
+Please refer to this `Tutorial <https://docs.llamaindex.ai/en/latest/examples/llm/vllm/>`_ for more details.
--- a/vllm-v0.6.2/docs/source/serving/serving_with_llamastack.rst
+++ b/vllm-v0.6.2/docs/source/serving/serving_with_llamastack.rst
@@ -0,0 +1,42 @@
+.. _run_on_llamastack:
+
+Serving with Llama Stack
+============================
+
+vLLM is also available via `Llama Stack <https://github.com/meta-llama/llama-stack>`_ .
+
+To install Llama Stack, run
+
+.. code-block:: console
+
+    $ pip install llama-stack -q
+
+Inference using OpenAI Compatible API
+-------------------------------------
+
+Then start Llama Stack server pointing to your vLLM server with the following configuration:
+
+.. code-block:: yaml
+
+    inference:
+      - provider_id: vllm0
+        provider_type: remote::vllm
+        config:
+          url: http://127.0.0.1:8000
+
+Please refer to `this guide <https://github.com/meta-llama/llama-stack/blob/main/docs/source/getting_started/distributions/self_hosted_distro/remote_vllm.md>`_ for more details on this remote vLLM provider.
+
+Inference via Embedded vLLM
+---------------------------
+
+An `inline vLLM provider
+<https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/inline/inference/vllm>`_
+is also available. This is a sample of configuration using that method:
+
+.. code-block:: yaml
+
+    inference
+      - provider_type: vllm
+        config:
+          model: Llama3.1-8B-Instruct
+          tensor_parallel_size: 4
--- a/vllm-v0.6.2/docs/source/serving/tensorizer.rst
+++ b/vllm-v0.6.2/docs/source/serving/tensorizer.rst
@@ -0,0 +1,15 @@
+.. _tensorizer:
+
+Loading Models with CoreWeave's Tensorizer
+==========================================
+vLLM supports loading models with `CoreWeave's Tensorizer <https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer>`_.
+vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized
+at runtime extremely quickly directly to the GPU, resulting in significantly
+shorter Pod startup times and CPU memory usage. Tensor encryption is also supported.
+
+For more information on CoreWeave's Tensorizer, please refer to
+`CoreWeave's Tensorizer documentation <https://github.com/coreweave/tensorizer>`_. For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see
+the `vLLM example script <https://docs.vllm.ai/en/stable/getting_started/examples/tensorize_vllm_model.html>`_.
+
+.. note::
+  Note that to use this feature you will need to install `tensorizer` by running `pip install vllm[tensorizer]`.
--- a/vllm-v0.6.2/docs/source/serving/usage_stats.md
+++ b/vllm-v0.6.2/docs/source/serving/usage_stats.md
@@ -0,0 +1,57 @@
+# Usage Stats Collection
+
+vLLM collects anonymous usage data by default to help the engineering team better understand which hardware and model configurations are widely used. This data allows them to prioritize their efforts on the most common workloads. The collected data is transparent, does not contain any sensitive information, and will be publicly released for the community's benefit.
+
+## What data is collected?
+
+You can see the up to date list of data collected by vLLM in the [usage_lib.py](https://github.com/vllm-project/vllm/blob/main/vllm/usage/usage_lib.py).
+
+Here is an example as of v0.4.0:
+
+```json
+{
+  "uuid": "fbe880e9-084d-4cab-a395-8984c50f1109",
+  "provider": "GCP",
+  "num_cpu": 24,
+  "cpu_type": "Intel(R) Xeon(R) CPU @ 2.20GHz",
+  "cpu_family_model_stepping": "6,85,7",
+  "total_memory": 101261135872,
+  "architecture": "x86_64",
+  "platform": "Linux-5.10.0-28-cloud-amd64-x86_64-with-glibc2.31",
+  "gpu_count": 2,
+  "gpu_type": "NVIDIA L4",
+  "gpu_memory_per_device": 23580639232,
+  "model_architecture": "OPTForCausalLM",
+  "vllm_version": "0.3.2+cu123",
+  "context": "LLM_CLASS",
+  "log_time": 1711663373492490000,
+  "source": "production",
+  "dtype": "torch.float16",
+  "tensor_parallel_size": 1,
+  "block_size": 16,
+  "gpu_memory_utilization": 0.9,
+  "quantization": null,
+  "kv_cache_dtype": "auto",
+  "enable_lora": false,
+  "enable_prefix_caching": false,
+  "enforce_eager": false,
+  "disable_custom_all_reduce": true
+}
+```
+
+You can preview the collected data by running the following command:
+
+```bash
+tail ~/.config/vllm/usage_stats.json
+```
+
+## Opt-out of Usage Stats Collection
+
+You can opt-out of usage stats collection by setting the VLLM_NO_USAGE_STATS or DO_NOT_TRACK environment variable, or by creating a ~/.config/vllm/do_not_track file:
+
+```bash
+# Any of the following methods can disable usage stats collection
+export VLLM_NO_USAGE_STATS=1
+export DO_NOT_TRACK=1
+mkdir -p ~/.config/vllm && touch ~/.config/vllm/do_not_track
+```