Sync from v0.13

2026-01-19 10:38:50 +08:00
parent b2ef04d792
commit 5aef6c175a
3714 changed files with 854317 additions and 89342 deletions
--- a/docs/deployment/integrations/kaito.md
+++ b/docs/deployment/integrations/kaito.md
@@ -0,0 +1,5 @@
+# KAITO
+
+[KAITO](https://kaito-project.github.io/kaito/docs/) is a Kubernetes operator that supports deploying and serving LLMs with vLLM. It offers managing large models via container images with built-in OpenAI-compatible inference, auto-provisioning GPU nodes and curated model presets.
+
+Please refer to [quick start](https://kaito-project.github.io/kaito/docs/quick-start) for more details.
--- a/docs/deployment/integrations/kserve.md
+++ b/docs/deployment/integrations/kserve.md
@@ -0,0 +1,5 @@
+# KServe
+
+vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving.
+
+Please see [this guide](https://kserve.github.io/website/docs/model-serving/generative-inference/overview) for more details on using vLLM with KServe.
--- a/docs/deployment/integrations/kthena.md
+++ b/docs/deployment/integrations/kthena.md
@@ -0,0 +1,333 @@
+# Kthena
+
+[**Kthena**](https://github.com/volcano-sh/kthena) is a Kubernetes-native LLM inference platform that transforms how organizations deploy and manage Large Language Models in production. Built with declarative model lifecycle management and intelligent request routing, it provides high performance and enterprise-grade scalability for LLM inference workloads.
+
+This guide shows how to deploy a production-grade, **multi-node vLLM** service on Kubernetes.
+
+We’ll:
+
+- Install the required components (Kthena + Volcano).
+- Deploy a multi-node vLLM model via Kthena’s `ModelServing` CR.
+- Validate the deployment.
+
+---
+
+## 1. Prerequisites
+
+You need:
+
+- A Kubernetes cluster with **GPU nodes**.
+- `kubectl` access with cluster-admin or equivalent permissions.
+- **Volcano** installed for gang scheduling.
+- **Kthena** installed with the `ModelServing` CRD available.
+- A valid **Hugging Face token** if loading models from Hugging Face Hub.
+
+### 1.1 Install Volcano
+
+```bash
+helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
+helm repo update
+helm install volcano volcano-sh/volcano -n volcano-system --create-namespace
+```
+
+This provides the gang-scheduling and network topology features used by Kthena.
+
+### 1.2 Install Kthena
+
+```bash
+helm install kthena oci://ghcr.io/volcano-sh/charts/kthena --version v0.1.0 --namespace kthena-system --create-namespace
+```
+
+- The `kthena-system` namespace is created.
+- Kthena controllers and CRDs, including `ModelServing`, are installed and healthy.
+
+Validate:
+
+```bash
+kubectl get crd | grep modelserving
+```
+
+You should see:
+
+```text
+modelservings.workload.serving.volcano.sh   ...
+```
+
+---
+
+## 2. The Multi-Node vLLM `ModelServing` Example
+
+Kthena provides an example manifest to deploy a **multi-node vLLM cluster running Llama**. Conceptually this is equivalent to the vLLM production stack Helm deployment, but expressed with `ModelServing`.
+
+A simplified version of the example (`llama-multinode`) looks like:
+
+- `spec.replicas: 1` – one `ServingGroup` (one logical model deployment).
+- `roles`:
+    - `entryTemplate` – defines **leader** pods that run:
+        - vLLM’s **multi-node cluster bootstrap script** (Ray cluster).
+        - vLLM **OpenAI-compatible API server**.
+    - `workerTemplate` – defines **worker** pods that join the leader’s Ray cluster.
+
+Key points from the example YAML:
+
+- **Image**: `vllm/vllm-openai:latest` (matches upstream vLLM images).
+- **Command** (leader):
+
+  ```yaml
+  command:
+    - sh
+    - -c
+    - >
+      bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=2;
+      python3 -m vllm.entrypoints.openai.api_server
+        --port 8080
+        --model meta-llama/Llama-3.1-405B-Instruct
+        --tensor-parallel-size 8
+        --pipeline-parallel-size 2
+  ```
+
+- **Command** (worker):
+
+  ```yaml
+  command:
+    - sh
+    - -c
+    - >
+      bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(ENTRY_ADDRESS)
+  ```
+
+---
+
+## 3. Deploying Multi-Node llama vLLM via Kthena
+
+### 3.1 Prepare the Manifest
+
+**Recommended**: use a Secret instead of a raw env var:
+
+```bash
+kubectl create secret generic hf-token \
+  -n default \
+  --from-literal=HUGGING_FACE_HUB_TOKEN='<your-token>'
+```
+
+### 3.2 Apply the `ModelServing`
+
+```bash
+cat  <<EOF | kubectl apply -f -
+apiVersion: workload.serving.volcano.sh/v1alpha1
+kind: ModelServing
+metadata:
+  name: llama-multinode
+  namespace: default
+spec:
+  schedulerName: volcano
+  replicas: 1  # group replicas
+  template:
+    restartGracePeriodSeconds: 60
+    gangPolicy:
+      minRoleReplicas:
+        405b: 1
+    roles:
+      - name: 405b
+        replicas: 2
+        entryTemplate:
+          spec:
+            containers:
+              - name: leader
+                image: vllm/vllm-openai:latest
+                env:
+                  - name: HUGGING_FACE_HUB_TOKEN
+                    valueFrom:
+                      secretKeyRef:
+                        name: hf-token
+                        key: HUGGING_FACE_HUB_TOKEN
+                command:
+                  - sh
+                  - -c
+                  - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=2; 
+                    python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline-parallel-size 2"
+                resources:
+                  limits:
+                    nvidia.com/gpu: "8"
+                    memory: 1124Gi
+                    ephemeral-storage: 800Gi
+                  requests:
+                    ephemeral-storage: 800Gi
+                    cpu: 125
+                ports:
+                  - containerPort: 8080
+                readinessProbe:
+                  tcpSocket:
+                    port: 8080
+                  initialDelaySeconds: 15
+                  periodSeconds: 10
+                volumeMounts:
+                  - mountPath: /dev/shm
+                    name: dshm
+            volumes:
+            - name: dshm
+              emptyDir:
+                medium: Memory
+                sizeLimit: 15Gi
+        workerReplicas: 1
+        workerTemplate:
+          spec:
+            containers:
+              - name: worker
+                image: vllm/vllm-openai:latest
+                command:
+                  - sh
+                  - -c
+                  - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(ENTRY_ADDRESS)"
+                resources:
+                  limits:
+                    nvidia.com/gpu: "8"
+                    memory: 1124Gi
+                    ephemeral-storage: 800Gi
+                  requests:
+                    ephemeral-storage: 800Gi
+                    cpu: 125
+                env:
+                  - name: HUGGING_FACE_HUB_TOKEN
+                    valueFrom:
+                      secretKeyRef:
+                        name: hf-token
+                        key: HUGGING_FACE_HUB_TOKEN
+                volumeMounts:
+                  - mountPath: /dev/shm
+                    name: dshm   
+            volumes:
+            - name: dshm
+              emptyDir:
+                medium: Memory
+                sizeLimit: 15Gi
+EOF
+```
+
+Kthena will:
+
+- Create a `ModelServing` object.
+- Derive a `PodGroup` for Volcano gang scheduling.
+- Create the leader and worker pods for each `ServingGroup` and `Role`.
+
+---
+
+## 4. Verifying the Deployment
+
+### 4.1 Check ModelServing Status
+
+Use the snippet from the Kthena docs:
+
+```bash
+kubectl get modelserving -oyaml | grep status -A 10
+```
+
+You should see something like:
+
+```yaml
+status:
+  availableReplicas: 1
+  conditions:
+    - type: Available
+      status: "True"
+      reason: AllGroupsReady
+      message: All Serving groups are ready
+    - type: Progressing
+      status: "False"
+      ...
+  replicas: 1
+  updatedReplicas: 1
+```
+
+### 4.2 Check Pods
+
+List pods for your deployment:
+
+```bash
+kubectl get pod -owide -l modelserving.volcano.sh/name=llama-multinode
+```
+
+Example output (from docs):
+
+```text
+NAMESPACE   NAME                          READY   STATUS    RESTARTS   AGE   IP            NODE           ...
+default     llama-multinode-0-405b-0-0    1/1     Running   0          15m   10.244.0.56   192.168.5.12   ...
+default     llama-multinode-0-405b-0-1    1/1     Running   0          15m   10.244.0.58   192.168.5.43   ...
+default     llama-multinode-0-405b-1-0    1/1     Running   0          15m   10.244.0.57   192.168.5.58   ...
+default     llama-multinode-0-405b-1-1    1/1     Running   0          15m   10.244.0.53   192.168.5.36   ...
+```
+
+Pod name pattern:
+
+- `llama-multinode-<group-idx>-<role-name>-<replica-idx>-<ordinal>`.
+
+The first number indicates `ServingGroup`. The second (`405b`) is the `Role`. The remaining indices identify the pod within the role.
+
+---
+
+## 6. Accessing the vLLM OpenAI-Compatible API
+
+Expose the entry via a Service:
+
+```yaml
+apiVersion: v1
+kind: Service
+metadata:
+  name: llama-multinode-openai
+  namespace: default
+spec:
+  selector:
+    modelserving.volcano.sh/name: llama-multinode
+    modelserving.volcano.sh/entry: "true"
+    # optionally further narrow to leader role if you label it
+  ports:
+    - name: http
+      port: 80
+      targetPort: 8080
+  type: ClusterIP
+```
+
+Port-forward from your local machine:
+
+```bash
+kubectl port-forward svc/llama-multinode-openai 30080:80 -n default
+```
+
+Then:
+
+- List models:
+
+  ```bash
+  curl -s http://localhost:30080/v1/models
+  ```
+
+- Send a completion request (mirroring vLLM production stack docs):
+
+  ```bash
+  curl -X POST http://localhost:30080/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+      "model": "meta-llama/Llama-3.1-405B-Instruct",
+      "prompt": "Once upon a time,",
+      "max_tokens": 10
+    }'
+  ```
+
+You should see an OpenAI-style response from vLLM.
+
+---
+
+## 7. Clean Up
+
+To remove the deployment and its resources:
+
+```bash
+kubectl delete modelserving llama-multinode -n default
+```
+
+If you’re done with the entire stack:
+
+```bash
+helm uninstall kthena -n kthena-system   # or your Kthena release name
+helm uninstall volcano -n volcano-system
+```
--- a/docs/deployment/integrations/kubeai.md
+++ b/docs/deployment/integrations/kubeai.md
@@ -0,0 +1,13 @@
+# KubeAI
+
+[KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.
+
+Please see the Installation Guides for environment specific instructions:
+
+- [Any Kubernetes Cluster](https://www.kubeai.org/installation/any/)
+- [EKS](https://www.kubeai.org/installation/eks/)
+- [GKE](https://www.kubeai.org/installation/gke/)
+
+Once you have KubeAI installed, you can
+[configure text generation models](https://www.kubeai.org/how-to/configure-text-generation-models/)
+using vLLM.
--- a/docs/deployment/integrations/kuberay.md
+++ b/docs/deployment/integrations/kuberay.md
@@ -0,0 +1,20 @@
+# KubeRay
+
+[KubeRay](https://github.com/ray-project/kuberay) provides a Kubernetes-native way to run vLLM workloads on Ray clusters.
+A Ray cluster can be declared in YAML, and the operator then handles pod scheduling, networking configuration, restarts, and blue-green deployments — all while preserving the familiar Kubernetes experience.
+
+## Why KubeRay instead of manual scripts?
+
+| Feature | Manual scripts | KubeRay |
+|---------|-----------------------------------------------------------|---------|
+| Cluster bootstrap | Manually SSH into every node and run a script | One command to create or update the whole cluster: `kubectl apply -f cluster.yaml` |
+| Autoscaling | Manual | Automatically patches CRDs for adjusting cluster size |
+| Upgrades | Tear down & re-create manually | Blue/green deployment updates supported |
+| Declarative config | Bash flags & environment variables | Git-ops-friendly YAML CRDs (RayCluster/RayService) |
+
+Using KubeRay reduces the operational burden and simplifies integration of Ray + vLLM with existing Kubernetes workflows (CI/CD, secrets, storage classes, etc.).
+
+## Learn more
+
+* ["Serve a Large Language Model using Ray Serve LLM on Kubernetes"](https://docs.ray.io/en/master/cluster/kubernetes/examples/rayserve-llm-example.html) - An end-to-end example of how to serve a model using vLLM, KubeRay, and Ray Serve.
+* [KubeRay documentation](https://docs.ray.io/en/latest/cluster/kubernetes/index.html)
--- a/docs/deployment/integrations/llamastack.md
+++ b/docs/deployment/integrations/llamastack.md
@@ -0,0 +1,36 @@
+# Llama Stack
+
+vLLM is also available via [Llama Stack](https://github.com/llamastack/llama-stack).
+
+To install Llama Stack, run
+
+```bash
+pip install llama-stack -q
+```
+
+## Inference using OpenAI-Compatible API
+
+Then start the Llama Stack server and configure it to point to your vLLM server with the following settings:
+
+```yaml
+inference:
+  - provider_id: vllm0
+    provider_type: remote::vllm
+    config:
+      url: http://127.0.0.1:8000
+```
+
+Please refer to [this guide](https://llama-stack.readthedocs.io/en/latest/providers/inference/remote_vllm.html) for more details on this remote vLLM provider.
+
+## Inference using Embedded vLLM
+
+An [inline provider](https://github.com/llamastack/llama-stack/tree/main/llama_stack/providers/inline/inference)
+is also available. This is a sample of configuration using that method:
+
+```yaml
+inference:
+  - provider_type: vllm
+    config:
+      model: Llama3.1-8B-Instruct
+      tensor_parallel_size: 4
+```
--- a/docs/deployment/integrations/llmaz.md
+++ b/docs/deployment/integrations/llmaz.md
@@ -0,0 +1,5 @@
+# llmaz
+
+[llmaz](https://github.com/InftyAI/llmaz) is an easy-to-use and advanced inference platform for large language models on Kubernetes, aimed for production use. It uses vLLM as the default model serving backend.
+
+Please refer to the [Quick Start](https://github.com/InftyAI/llmaz?tab=readme-ov-file#quick-start) for more details.
--- a/docs/deployment/integrations/production-stack.md
+++ b/docs/deployment/integrations/production-stack.md
@@ -0,0 +1,158 @@
+# Production stack
+
+Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using the [vLLM production stack](https://github.com/vllm-project/production-stack). Born out of a Berkeley-UChicago collaboration, [vLLM production stack](https://github.com/vllm-project/production-stack) is an officially released, production-optimized codebase under the [vLLM project](https://github.com/vllm-project), designed for LLM deployment with:
+
+* **Upstream vLLM compatibility** – It wraps around upstream vLLM without modifying its code.
+* **Ease of use** – Simplified deployment via Helm charts and observability through Grafana dashboards.
+* **High performance** – Optimized for LLM workloads with features like multimodel support, model-aware and prefix-aware routing, fast vLLM bootstrapping, and KV cache offloading with [LMCache](https://github.com/LMCache/LMCache), among others.
+
+If you are new to Kubernetes, don't worry: in the vLLM production stack [repo](https://github.com/vllm-project/production-stack), we provide a step-by-step [guide](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) and a [short video](https://www.youtube.com/watch?v=EsTJbQtzj0g) to set up everything and get started in **4 minutes**!
+
+## Pre-requisite
+
+Ensure that you have a running Kubernetes environment with GPU (you can follow [this tutorial](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) to install a Kubernetes environment on a bare-medal GPU machine).
+
+## Deployment using vLLM production stack
+
+The standard vLLM production stack is installed using a Helm chart. You can run this [bash script](https://github.com/vllm-project/production-stack/blob/main/utils/install-helm.sh) to install Helm on your GPU server.
+
+To install the vLLM production stack, run the following commands on your desktop:
+
+```bash
+sudo helm repo add vllm https://vllm-project.github.io/production-stack
+sudo helm install vllm vllm/vllm-stack -f tutorials/assets/values-01-minimal-example.yaml
+```
+
+This will instantiate a vLLM-production-stack-based deployment named `vllm` that runs a small LLM (Facebook opt-125M model).
+
+### Validate Installation
+
+Monitor the deployment status using:
+
+```bash
+sudo kubectl get pods
+```
+
+And you will see that pods for the `vllm` deployment will transit to `Running` state.
+
+```text
+NAME                                           READY   STATUS    RESTARTS   AGE
+vllm-deployment-router-859d8fb668-2x2b7        1/1     Running   0          2m38s
+vllm-opt125m-deployment-vllm-84dfc9bd7-vb9bs   1/1     Running   0          2m38s
+```
+
+!!! note
+    It may take some time for the containers to download the Docker images and LLM weights.
+
+### Send a Query to the Stack
+
+Forward the `vllm-router-service` port to the host machine:
+
+```bash
+sudo kubectl port-forward svc/vllm-router-service 30080:80
+```
+
+And then you can send out a query to the OpenAI-compatible API to check the available models:
+
+```bash
+curl -o- http://localhost:30080/v1/models
+```
+
+??? console "Output"
+
+    ```json
+    {
+      "object": "list",
+      "data": [
+        {
+          "id": "facebook/opt-125m",
+          "object": "model",
+          "created": 1737428424,
+          "owned_by": "vllm",
+          "root": null
+        }
+      ]
+    }
+    ```
+
+To send an actual chatting request, you can issue a curl request to the OpenAI `/completion` endpoint:
+
+```bash
+curl -X POST http://localhost:30080/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "facebook/opt-125m",
+    "prompt": "Once upon a time,",
+    "max_tokens": 10
+  }'
+```
+
+??? console "Output"
+
+    ```json
+    {
+      "id": "completion-id",
+      "object": "text_completion",
+      "created": 1737428424,
+      "model": "facebook/opt-125m",
+      "choices": [
+        {
+          "text": " there was a brave knight who...",
+          "index": 0,
+          "finish_reason": "length"
+        }
+      ]
+    }
+    ```
+
+### Uninstall
+
+To remove the deployment, run:
+
+```bash
+sudo helm uninstall vllm
+```
+
+---
+
+### (Advanced) Configuring vLLM production stack
+
+The core vLLM production stack configuration is managed with YAML. Here is the example configuration used in the installation above:
+
+??? code "Yaml"
+
+    ```yaml
+    servingEngineSpec:
+      runtimeClassName: ""
+      modelSpec:
+      - name: "opt125m"
+        repository: "vllm/vllm-openai"
+        tag: "latest"
+        modelURL: "facebook/opt-125m"
+
+        replicaCount: 1
+
+        requestCPU: 6
+        requestMemory: "16Gi"
+        requestGPU: 1
+
+        pvcStorage: "10Gi"
+    ```
+
+In this YAML configuration:
+
+* **`modelSpec`** includes:
+    * `name`: A nickname that you prefer to call the model.
+    * `repository`: Docker repository of vLLM.
+    * `tag`: Docker image tag.
+    * `modelURL`: The LLM model that you want to use.
+* **`replicaCount`**: Number of replicas.
+* **`requestCPU` and `requestMemory`**: Specifies the CPU and memory resource requests for the pod.
+* **`requestGPU`**: Specifies the number of GPUs required.
+* **`pvcStorage`**: Allocates persistent storage for the model.
+
+!!! note
+    If you intend to set up two pods, please refer to this [YAML file](https://github.com/vllm-project/production-stack/blob/main/tutorials/assets/values-01-2pods-minimal-example.yaml).
+
+!!! tip
+    vLLM production stack offers many more features (*e.g.* CPU offloading and a wide range of routing algorithms). Please check out these [examples and tutorials](https://github.com/vllm-project/production-stack/tree/main/tutorials) and our [repo](https://github.com/vllm-project/production-stack) for more details!