# Using Volcano Kthena This guide shows how to run **prefill–decode (PD) disaggregation** on Huawei Ascend NPUs using **vLLM-Ascend**, with [**Kthena**](https://kthena.volcano.sh/) handling orchestration on Kubernetes. About vLLM support with Kthena, please refer to [Deploy vLLM with Kthena](https://docs.vllm.ai/en/latest/deployment/integrations/kthena/). --- ## 1. What is Prefill–Decode Disaggregation? Large language model inference naturally splits into two phases: - **Prefill** - Processes input tokens and builds the key–value (KV) cache. - Batch-friendly, high-throughput, well-suited to parallel NPU execution. - **Decode** - Consumes the KV cache to generate output tokens. - Latency-sensitive, memory-intensive, more sequential. From the client's perspective, this still looks like a single Chat / Completions endpoint. --- ## 2. Deploy on Kubernetes with Kthena [Kthena](https://kthena.volcano.sh/) is a Kubernetes-native LLM inference platform that transforms how organizations deploy and manage Large Language Models in production. Built with declarative model lifecycle management and intelligent request routing, it provides high-performance and enterprise-grade scalability for LLM inference workloads. In this example, we use three key Custom Resource Definitions (CRDs): - `ModelServing` — defines the workloads (prefill and decode roles). - `ModelServer` — manages PD groupings and internal routing. - `ModelRoute` — exposes a stable model endpoint. This section uses the `deepseek-ai/DeepSeek-V2-Lite` example, but you can swap in any model supported by vLLM-Ascend. ### 2.1 Prerequisites - Kubernetes cluster with Ascend NPU nodes: The resources corresponding to different NPU Drivers may vary slightly. For example: - If using [MindCluster](https://gitee.com/ascend/mind-cluster#https://gitee.com/link?target=https%3A%2F%2Fgitcode.com%2FAscend%2Fmind-cluster), please use `huawei.com/Ascend310P` or `huawei.com/Ascend910`. - If running on CCE (Cloud Container Engine) of Huawei Cloud and the [CCE AI Suite Plugin (Ascend NPU)](https://support.huaweicloud.com/intl/en-us/usermanual-cce/cce_10_0239.html) is installed, please use `huawei.com/ascend-310` or `huawei.com/ascend-1980`. - Kthena installed. Please follow the [Kthena installation guide](https://kthena.volcano.sh/docs/getting-started/installation). ### 2.2 Deploy Prefill-Decode Disaggregated DeepSeek-V2-Lite on Kubernetes A concrete example is provided in Kthena as Deploy it with the command below: ```bash kubectl apply -f https://raw.githubusercontent.com/volcano-sh/kthena/refs/heads/main/examples/model-serving/prefill-decode-disaggregation.yaml ``` or ```bash cat << EOF | kubectl apply -f - apiVersion: workload.serving.volcano.sh/v1alpha1 kind: ModelServing metadata: name: deepseek-v2-lite namespace: dev spec: schedulerName: volcano replicas: 1 recoveryPolicy: ServingGroupRecreate template: restartGracePeriodSeconds: 60 roles: - name: prefill replicas: 1 entryTemplate: spec: initContainers: - name: downloader imagePullPolicy: Always image: ghcr.io/volcano-sh/downloader:latest args: - --source - deepseek-ai/DeepSeek-V2-Lite - --output-dir - /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/ volumeMounts: - name: models mountPath: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/ containers: - name: runtime image: ghcr.io/volcano-sh/runtime:latest ports: - containerPort: 8100 args: - --port - "8100" - --engine - vllm - --pod - $(POD_NAME).$(NAMESPACE) - --model - deepseek-v2-lite - --engine-base-url - http://localhost:8000 - name: vllm image: ghcr.io/volcano-sh/kthena-engine:vllm-ascend_v0.10.1rc1_mooncake_v0.3.5 ports: - containerPort: 8000 env: - name: HF_HUB_OFFLINE value: "1" - name: HCCL_IF_IP valueFrom: fieldRef: fieldPath: status.podIP - name: GLOO_SOCKET_IFNAME value: eth0 - name: TP_SOCKET_IFNAME value: eth0 - name: HCCL_SOCKET_IFNAME value: eth0 - name: VLLM_LOGGING_LEVEL value: DEBUG - name: AscendRealDevices valueFrom: fieldRef: fieldPath: metadata.annotations['huawei.com/AscendReal'] args: - "/mnt/cache/deepseek-ai/DeepSeek-V2-Lite/" - "--served-model-name" - "deepseek-ai/DeepSeekV2" - "--tensor-parallel-size" - "2" - "--gpu-memory-utilization" - "0.8" - "--max-model-len" - "8192" - "--max-num-batched-tokens" - "8192" - "--trust-remote-code" - "--enforce-eager" - "--kv-transfer-config" - '{"kv_connector":"MooncakeConnectorV1","kv_buffer_device":"npu","kv_role":"kv_producer","kv_parallel_size":1,"kv_port":"20001","engine_id":"0","kv_rank":0,"kv_connector_extra_config":{"prefill":{"dp_size":2,"tp_size":2},"decode":{"dp_size":2,"tp_size":2}}}' imagePullPolicy: Always resources: limits: cpu: "8" memory: 64Gi huawei.com/ascend-1980: "4" requests: cpu: "8" memory: 64Gi huawei.com/ascend-1980: "4" readinessProbe: initialDelaySeconds: 5 periodSeconds: 5 failureThreshold: 3 httpGet: path: /health port: 8000 livenessProbe: initialDelaySeconds: 900 periodSeconds: 5 failureThreshold: 3 httpGet: path: /health port: 8000 volumeMounts: - name: models mountPath: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/ readOnly: true - name: hccn-config mountPath: /etc/hccn.conf readOnly: true - name: shared-memory-volume mountPath: /dev/shm volumes: - name: models hostPath: path: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/ type: DirectoryOrCreate - name: hccn-config hostPath: path: /etc/hccn.conf type: File - name: shared-memory-volume emptyDir: sizeLimit: 256Mi medium: Memory - name: decode replicas: 1 entryTemplate: spec: initContainers: - name: downloader imagePullPolicy: Always image: ghcr.io/volcano-sh/downloader:latest args: - --source - deepseek-ai/DeepSeek-V2-Lite - --output-dir - /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/ volumeMounts: - name: models mountPath: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/ containers: - name: vllm image: ghcr.io/volcano-sh/kthena-engine:vllm-ascend_v0.10.1rc1_mooncake_v0.3.5 ports: - containerPort: 8000 env: - name: HF_HUB_OFFLINE value: "1" - name: HCCL_IF_IP valueFrom: fieldRef: fieldPath: status.podIP - name: GLOO_SOCKET_IFNAME value: eth0 - name: TP_SOCKET_IFNAME value: eth0 - name: HCCL_SOCKET_IFNAME value: eth0 - name: VLLM_LOGGING_LEVEL value: DEBUG - name: AscendRealDevices valueFrom: fieldRef: fieldPath: metadata.annotations['huawei.com/AscendReal'] args: - "/mnt/cache/deepseek-ai/DeepSeek-V2-Lite/" - "--served-model-name" - "deepseek-ai/DeepSeekV2" - "--tensor-parallel-size" - "2" - "--gpu-memory-utilization" - "0.8" - "--max-model-len" - "8192" - "--max-num-batched-tokens" - "16384" - "--trust-remote-code" - "--no-enable-prefix-caching" - "--enforce-eager" - "--kv-transfer-config" - '{"kv_connector":"MooncakeConnectorV1","kv_buffer_device":"npu","kv_role":"kv_consumer","kv_parallel_size":1,"kv_port":"20002","engine_id":"1","kv_rank":1,"kv_connector_extra_config":{"prefill":{"dp_size":2,"tp_size":2},"decode":{"dp_size":2,"tp_size":2}}}' imagePullPolicy: Always resources: limits: cpu: "8" memory: 64Gi huawei.com/ascend-1980: "4" requests: cpu: "8" memory: 64Gi huawei.com/ascend-1980: "4" readinessProbe: initialDelaySeconds: 5 periodSeconds: 5 failureThreshold: 3 httpGet: path: /health port: 8000 livenessProbe: initialDelaySeconds: 900 periodSeconds: 5 failureThreshold: 3 httpGet: path: /health port: 8000 volumeMounts: - name: models mountPath: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/ readOnly: true - name: hccn-config mountPath: /etc/hccn.conf readOnly: true - name: shared-memory-volume mountPath: /dev/shm volumes: - name: models hostPath: path: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/ type: DirectoryOrCreate - name: hccn-config hostPath: path: /etc/hccn.conf type: File - name: shared-memory-volume emptyDir: sizeLimit: 256Mi medium: Memory EOF ``` You should see Pods such as: - `deepseek-v2-lite-0-prefill-0-0` - `deepseek-v2-lite-0-decode-0-0` To enable the LLM access, we still need to configure the routing layer with `ModelServer` and `ModelRoute`. ### 2.3 ModelServer: PD Group Management The `ModelServer` resource: - Selects the `ModelServing` workloads via labels. - Groups prefill and decode Pods into PD pairs. - Configures KV connector details and timeouts. - Exposes an internal gRPC/HTTP interface. Create ModelServer with the command below: ```bash kubectl apply -f https://raw.githubusercontent.com/volcano-sh/kthena/refs/heads/main/examples/kthena-router/ModelServer-prefill-decode-disaggregation.yaml ``` or ```bash cat << EOF | kubectl apply -f - apiVersion: networking.serving.volcano.sh/v1alpha1 kind: ModelServer metadata: name: deepseek-v2 namespace: dev spec: kvConnector: type: nixl workloadSelector: matchLabels: modelserving.volcano.sh/name: deepseek-v2-lite pdGroup: groupKey: "modelserving.volcano.sh/group-name" prefillLabels: modelserving.volcano.sh/role: prefill decodeLabels: modelserving.volcano.sh/role: decode workloadPort: port: 8000 model: "deepseek-ai/DeepSeekV2" inferenceEngine: "vLLM" trafficPolicy: timeout: 10s EOF ``` ### 2.4 ModelRoute: User-Facing Endpoint The `ModelRoute` resource maps a model name (e.g., `"deepseek-ai/DeepSeekV2"`) to the `ModelServer`. Example manifest: ```bash cat << EOF | kubectl apply -f - apiVersion: networking.serving.volcano.sh/v1alpha1 kind: ModelRoute metadata: name: deepseek-v2 namespace: dev spec: modelName: "deepseek-ai/DeepSeekV2" rules: - name: "default" targetModels: - modelServerName: "deepseek-v2" EOF ``` --- ## 3. Verification ### 3.1 Check Workloads Confirm that prefill and decode Pods are up: ```bash kubectl get modelserving deepseek-v2-lite -n dev -o yaml | grep status -A 10 kubectl get pod -n dev -owide \ -l modelserving.volcano.sh/name=deepseek-v2-lite ``` You should see both roles in `Running` and `Ready` state. ### 3.2 Test the Chat Endpoint Once routing is configured, you can send a test request to the Kthena-router: ```bash export ENDPOINT=$(kubectl get svc kthena-router -n kthena-system --output=jsonpath='{.status.loadBalancer.ingress[0].ip}:{.spec.ports[0].port}') curl --location "http://${ENDPOINT}/v1/chat/completions" \ --header "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeekV2", "messages": [ { "role": "user", "content": "Where is the capital of China?" } ], "stream": false }' ``` A successful JSON response confirms that: - The prefill and decode services are both running on Ascend NPUs. - KV transfer between them is working. - The Kthena routing layer is correctly fronting the vLLM-Ascend plugin. --- ## 4. Cleanup To remove the deployment: ```bash # 1. Remove user-facing routing kubectl delete modelroute deepseek-v2 -n dev # 2. Remove internal server kubectl delete modelserver deepseek-v2 -n dev # 3. Remove workloads kubectl delete modelserving deepseek-v2-lite -n dev ``` --- ## 5. Summary For more advanced features, please refer to the [Kthena website](https://kthena.volcano.sh/).