Files
xc-llm-ascend/docs/source/user_guide/deployment_guide/using_volcano_kthena.md

434 lines
15 KiB
Markdown
Raw Permalink Normal View History

# Using Volcano Kthena
This guide shows how to run **prefilldecode (PD) disaggregation** on Huawei Ascend NPUs using **vLLM-Ascend**, with [**Kthena**](https://kthena.volcano.sh/) handling orchestration on Kubernetes. About vLLM support with Kthena, please refer to [Deploy vLLM with Kthena](https://docs.vllm.ai/en/latest/deployment/integrations/kthena/).
---
## 1. What is PrefillDecode Disaggregation?
Large language model inference naturally splits into two phases:
- **Prefill**
- Processes input tokens and builds the keyvalue (KV) cache.
- Batch-friendly, high-throughput, well-suited to parallel NPU execution.
- **Decode**
- Consumes the KV cache to generate output tokens.
- Latency-sensitive, memory-intensive, more sequential.
From the client's perspective, this still looks like a single Chat / Completions endpoint.
---
## 2. Deploy on Kubernetes with Kthena
[Kthena](https://kthena.volcano.sh/) is a Kubernetes-native LLM inference platform that transforms how organizations deploy and manage Large Language Models in production. Built with declarative model lifecycle management and intelligent request routing, it provides high-performance and enterprise-grade scalability for LLM inference workloads. In this example, we use three key Custom Resource Definitions (CRDs):
- `ModelServing` — defines the workloads (prefill and decode roles).
- `ModelServer` — manages PD groupings and internal routing.
- `ModelRoute` — exposes a stable model endpoint.
This section uses the `deepseek-ai/DeepSeek-V2-Lite` example, but you can swap in any model supported by vLLM-Ascend.
### 2.1 Prerequisites
- Kubernetes cluster with Ascend NPU nodes:
The resources corresponding to different NPU Drivers may vary slightly. For example:
- If using [MindCluster](https://gitee.com/ascend/mind-cluster#https://gitee.com/link?target=https%3A%2F%2Fgitcode.com%2FAscend%2Fmind-cluster), please use `huawei.com/Ascend310P` or `huawei.com/Ascend910`.
- If running on CCE (Cloud Container Engine) of Huawei Cloud and the [CCE AI Suite Plugin (Ascend NPU)](https://support.huaweicloud.com/intl/en-us/usermanual-cce/cce_10_0239.html) is installed, please use `huawei.com/ascend-310` or `huawei.com/ascend-1980`.
- Kthena installed. Please follow the [Kthena installation guide](https://kthena.volcano.sh/docs/getting-started/installation).
### 2.2 Deploy Prefill-Decode Disaggregated DeepSeek-V2-Lite on Kubernetes
A concrete example is provided in Kthena as <https://github.com/volcano-sh/kthena/blob/main/examples/model-serving/prefill-decode-disaggregation.yaml>
Deploy it with the command below:
```bash
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/kthena/refs/heads/main/examples/model-serving/prefill-decode-disaggregation.yaml
```
or
```bash
cat << EOF | kubectl apply -f -
apiVersion: workload.serving.volcano.sh/v1alpha1
kind: ModelServing
metadata:
name: deepseek-v2-lite
namespace: dev
spec:
schedulerName: volcano
replicas: 1
recoveryPolicy: ServingGroupRecreate
template:
restartGracePeriodSeconds: 60
roles:
- name: prefill
replicas: 1
entryTemplate:
spec:
initContainers:
- name: downloader
imagePullPolicy: Always
image: ghcr.io/volcano-sh/downloader:latest
args:
- --source
- deepseek-ai/DeepSeek-V2-Lite
- --output-dir
- /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
volumeMounts:
- name: models
mountPath: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
containers:
- name: runtime
image: ghcr.io/volcano-sh/runtime:latest
ports:
- containerPort: 8100
args:
- --port
- "8100"
- --engine
- vllm
- --pod
- $(POD_NAME).$(NAMESPACE)
- --model
- deepseek-v2-lite
- --engine-base-url
- http://localhost:8000
- name: vllm
image: ghcr.io/volcano-sh/kthena-engine:vllm-ascend_v0.10.1rc1_mooncake_v0.3.5
ports:
- containerPort: 8000
env:
- name: HF_HUB_OFFLINE
value: "1"
- name: HCCL_IF_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: GLOO_SOCKET_IFNAME
value: eth0
- name: TP_SOCKET_IFNAME
value: eth0
- name: HCCL_SOCKET_IFNAME
value: eth0
- name: VLLM_LOGGING_LEVEL
value: DEBUG
- name: AscendRealDevices
valueFrom:
fieldRef:
fieldPath: metadata.annotations['huawei.com/AscendReal']
args:
- "/mnt/cache/deepseek-ai/DeepSeek-V2-Lite/"
- "--served-model-name"
- "deepseek-ai/DeepSeekV2"
- "--tensor-parallel-size"
- "2"
- "--gpu-memory-utilization"
- "0.8"
- "--max-model-len"
- "8192"
- "--max-num-batched-tokens"
- "8192"
- "--trust-remote-code"
- "--enforce-eager"
- "--kv-transfer-config"
- '{"kv_connector":"MooncakeConnectorV1","kv_buffer_device":"npu","kv_role":"kv_producer","kv_parallel_size":1,"kv_port":"20001","engine_id":"0","kv_rank":0,"kv_connector_extra_config":{"prefill":{"dp_size":2,"tp_size":2},"decode":{"dp_size":2,"tp_size":2}}}'
imagePullPolicy: Always
resources:
limits:
cpu: "8"
memory: 64Gi
huawei.com/ascend-1980: "4"
requests:
cpu: "8"
memory: 64Gi
huawei.com/ascend-1980: "4"
readinessProbe:
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /health
port: 8000
livenessProbe:
initialDelaySeconds: 900
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /health
port: 8000
volumeMounts:
- name: models
mountPath: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
readOnly: true
- name: hccn-config
mountPath: /etc/hccn.conf
readOnly: true
- name: shared-memory-volume
mountPath: /dev/shm
volumes:
- name: models
hostPath:
path: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
type: DirectoryOrCreate
- name: hccn-config
hostPath:
path: /etc/hccn.conf
type: File
- name: shared-memory-volume
emptyDir:
sizeLimit: 256Mi
medium: Memory
- name: decode
replicas: 1
entryTemplate:
spec:
initContainers:
- name: downloader
imagePullPolicy: Always
image: ghcr.io/volcano-sh/downloader:latest
args:
- --source
- deepseek-ai/DeepSeek-V2-Lite
- --output-dir
- /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
volumeMounts:
- name: models
mountPath: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
containers:
- name: vllm
image: ghcr.io/volcano-sh/kthena-engine:vllm-ascend_v0.10.1rc1_mooncake_v0.3.5
ports:
- containerPort: 8000
env:
- name: HF_HUB_OFFLINE
value: "1"
- name: HCCL_IF_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: GLOO_SOCKET_IFNAME
value: eth0
- name: TP_SOCKET_IFNAME
value: eth0
- name: HCCL_SOCKET_IFNAME
value: eth0
- name: VLLM_LOGGING_LEVEL
value: DEBUG
- name: AscendRealDevices
valueFrom:
fieldRef:
fieldPath: metadata.annotations['huawei.com/AscendReal']
args:
- "/mnt/cache/deepseek-ai/DeepSeek-V2-Lite/"
- "--served-model-name"
- "deepseek-ai/DeepSeekV2"
- "--tensor-parallel-size"
- "2"
- "--gpu-memory-utilization"
- "0.8"
- "--max-model-len"
- "8192"
- "--max-num-batched-tokens"
- "16384"
- "--trust-remote-code"
- "--no-enable-prefix-caching"
- "--enforce-eager"
- "--kv-transfer-config"
- '{"kv_connector":"MooncakeConnectorV1","kv_buffer_device":"npu","kv_role":"kv_consumer","kv_parallel_size":1,"kv_port":"20002","engine_id":"1","kv_rank":1,"kv_connector_extra_config":{"prefill":{"dp_size":2,"tp_size":2},"decode":{"dp_size":2,"tp_size":2}}}'
imagePullPolicy: Always
resources:
limits:
cpu: "8"
memory: 64Gi
huawei.com/ascend-1980: "4"
requests:
cpu: "8"
memory: 64Gi
huawei.com/ascend-1980: "4"
readinessProbe:
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /health
port: 8000
livenessProbe:
initialDelaySeconds: 900
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /health
port: 8000
volumeMounts:
- name: models
mountPath: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
readOnly: true
- name: hccn-config
mountPath: /etc/hccn.conf
readOnly: true
- name: shared-memory-volume
mountPath: /dev/shm
volumes:
- name: models
hostPath:
path: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
type: DirectoryOrCreate
- name: hccn-config
hostPath:
path: /etc/hccn.conf
type: File
- name: shared-memory-volume
emptyDir:
sizeLimit: 256Mi
medium: Memory
EOF
```
You should see Pods such as:
- `deepseek-v2-lite-0-prefill-0-0`
- `deepseek-v2-lite-0-decode-0-0`
To enable the LLM access, we still need to configure the routing layer with `ModelServer` and `ModelRoute`.
### 2.3 ModelServer: PD Group Management
The `ModelServer` resource:
- Selects the `ModelServing` workloads via labels.
- Groups prefill and decode Pods into PD pairs.
- Configures KV connector details and timeouts.
- Exposes an internal gRPC/HTTP interface.
Create ModelServer with the command below:
```bash
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/kthena/refs/heads/main/examples/kthena-router/ModelServer-prefill-decode-disaggregation.yaml
```
or
```bash
cat << EOF | kubectl apply -f -
apiVersion: networking.serving.volcano.sh/v1alpha1
kind: ModelServer
metadata:
name: deepseek-v2
namespace: dev
spec:
kvConnector:
type: nixl
workloadSelector:
matchLabels:
modelserving.volcano.sh/name: deepseek-v2-lite
pdGroup:
groupKey: "modelserving.volcano.sh/group-name"
prefillLabels:
modelserving.volcano.sh/role: prefill
decodeLabels:
modelserving.volcano.sh/role: decode
workloadPort:
port: 8000
model: "deepseek-ai/DeepSeekV2"
inferenceEngine: "vLLM"
trafficPolicy:
timeout: 10s
EOF
```
### 2.4 ModelRoute: User-Facing Endpoint
The `ModelRoute` resource maps a model name (e.g., `"deepseek-ai/DeepSeekV2"`) to the `ModelServer`.
Example manifest:
```bash
cat << EOF | kubectl apply -f -
apiVersion: networking.serving.volcano.sh/v1alpha1
kind: ModelRoute
metadata:
name: deepseek-v2
namespace: dev
spec:
modelName: "deepseek-ai/DeepSeekV2"
rules:
- name: "default"
targetModels:
- modelServerName: "deepseek-v2"
EOF
```
---
## 3. Verification
### 3.1 Check Workloads
Confirm that prefill and decode Pods are up:
```bash
kubectl get modelserving deepseek-v2-lite -n dev -o yaml | grep status -A 10
kubectl get pod -n dev -owide \
-l modelserving.volcano.sh/name=deepseek-v2-lite
```
You should see both roles in `Running` and `Ready` state.
### 3.2 Test the Chat Endpoint
Once routing is configured, you can send a test request to the Kthena-router:
```bash
export ENDPOINT=$(kubectl get svc kthena-router -n kthena-system --output=jsonpath='{.status.loadBalancer.ingress[0].ip}:{.spec.ports[0].port}')
curl --location "http://${ENDPOINT}/v1/chat/completions" \
--header "Content-Type: application/json" \
--data '{
"model": "deepseek-ai/DeepSeekV2",
"messages": [
{
"role": "user",
"content": "Where is the capital of China?"
}
],
"stream": false
}'
```
A successful JSON response confirms that:
- The prefill and decode services are both running on Ascend NPUs.
- KV transfer between them is working.
- The Kthena routing layer is correctly fronting the vLLM-Ascend plugin.
---
## 4. Cleanup
To remove the deployment:
```bash
# 1. Remove user-facing routing
kubectl delete modelroute deepseek-v2 -n dev
# 2. Remove internal server
kubectl delete modelserver deepseek-v2 -n dev
# 3. Remove workloads
kubectl delete modelserving deepseek-v2-lite -n dev
```
---
## 5. Summary
For more advanced features, please refer to the [Kthena website](https://kthena.volcano.sh/).