[Doc][v0.18.0] Fix documentation formatting and improve code examples (#8701)
### What this PR does / why we need it? This PR fixes various documentation issues and improves code examples throughout the project. Signed-off-by: MrZ20 <2609716663@qq.com>
This commit is contained in:
@@ -18,7 +18,7 @@ enable_custom_op()
|
||||
|
||||
- Create a new operation folder under `csrc` directory.
|
||||
- Create `op_host` and `op_kernel` directories for host and kernel source code.
|
||||
- Add build options in `csrc/build_aclnn.sh` for supported SOC. Note that multiple ops should be separated with `;`, i.e. `CUSTOM_OPS=op1;op2;op3`.
|
||||
- Add build options in `csrc/build_aclnn.sh` for supported SOC. Note that multiple ops should be separated with `;`, i.e. `CUSTOM_OPS="op1;op2;op3"`.
|
||||
- Bind aclnn operators to torch.ops._C_ascend module in `csrc/torch_binding.cpp`.
|
||||
- Write a meta implementation in `csrc/torch_binding_meta.cpp` for the op to be captured into the aclgraph.
|
||||
|
||||
|
||||
@@ -35,38 +35,36 @@ From the workflow perspective, we can see how the final test script is executed,
|
||||
npu_per_node: 16
|
||||
# All env vars you need should add it here
|
||||
env_common:
|
||||
VLLM_USE_MODELSCOPE: true
|
||||
OMP_PROC_BIND: false
|
||||
OMP_NUM_THREADS: 100
|
||||
HCCL_BUFFSIZE: 1024
|
||||
SERVER_PORT: 8080
|
||||
VLLM_USE_MODELSCOPE: true
|
||||
OMP_PROC_BIND: false
|
||||
OMP_NUM_THREADS: 100
|
||||
HCCL_BUFFSIZE: 1024
|
||||
SERVER_PORT: 8080
|
||||
disaggregated_prefill:
|
||||
enabled: true
|
||||
# node index(a list) which meet all the conditions:
|
||||
# - prefiller
|
||||
# - no headless(have api server)
|
||||
prefiller_host_index: [0]
|
||||
# node index(a list) which meet all the conditions:
|
||||
# - decoder
|
||||
decoder_host_index: [1]
|
||||
enabled: true
|
||||
# node index(a list) which meet all the conditions:
|
||||
# - prefiller
|
||||
# - no headless(have api server)
|
||||
prefiller_host_index: [0]
|
||||
# node index(a list) which meet all the conditions:
|
||||
# - decoder
|
||||
decoder_host_index: [1]
|
||||
|
||||
# Add each node's vllm serve cli command just like you run locally
|
||||
# Add each node's individual envs like follow
|
||||
deployment:
|
||||
-
|
||||
envs:
|
||||
# fill with envs like: <key>:<value>
|
||||
- envs:
|
||||
# fill with envs like: <key>:<value>
|
||||
server_cmd: >
|
||||
vllm serve ...
|
||||
-
|
||||
envs:
|
||||
# fill with envs like: <key>:<value>
|
||||
vllm serve ...
|
||||
- envs:
|
||||
# fill with envs like: <key>:<value>
|
||||
server_cmd: >
|
||||
vllm serve ...
|
||||
vllm serve ...
|
||||
benchmarks:
|
||||
perf:
|
||||
perf:
|
||||
# fill with performance test kwargs
|
||||
acc:
|
||||
acc:
|
||||
# fill with accuracy test kwargs
|
||||
```
|
||||
|
||||
@@ -74,38 +72,38 @@ From the workflow perspective, we can see how the final test script is executed,
|
||||
|
||||
Currently, the multi-node test workflow is defined in the [nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml)
|
||||
|
||||
```yaml
|
||||
```yaml
|
||||
multi-node-tests:
|
||||
name: multi-node
|
||||
if: always() && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
|
||||
strategy:
|
||||
name: multi-node
|
||||
if: always() && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
|
||||
strategy:
|
||||
fail-fast: false
|
||||
max-parallel: 1
|
||||
matrix:
|
||||
test_config:
|
||||
test_config:
|
||||
- name: multi-node-deepseek-pd
|
||||
config_file_path: DeepSeek-V3.yaml
|
||||
size: 2
|
||||
config_file_path: DeepSeek-V3.yaml
|
||||
size: 2
|
||||
- name: multi-node-qwen3-dp
|
||||
config_file_path: Qwen3-235B-A22B.yaml
|
||||
size: 2
|
||||
config_file_path: Qwen3-235B-A22B.yaml
|
||||
size: 2
|
||||
- name: multi-node-qwenw8a8-2node
|
||||
config_file_path: Qwen3-235B-W8A8.yaml
|
||||
size: 2
|
||||
config_file_path: Qwen3-235B-W8A8.yaml
|
||||
size: 2
|
||||
- name: multi-node-qwenw8a8-2node-eplb
|
||||
config_file_path: Qwen3-235B-W8A8-EPLB.yaml
|
||||
size: 2
|
||||
uses: ./.github/workflows/_e2e_nightly_multi_node.yaml
|
||||
with:
|
||||
config_file_path: Qwen3-235B-W8A8-EPLB.yaml
|
||||
size: 2
|
||||
uses: ./.github/workflows/_e2e_nightly_multi_node.yaml
|
||||
with:
|
||||
soc_version: a3
|
||||
runner: linux-aarch64-a3-0
|
||||
image: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3'
|
||||
replicas: 1
|
||||
size: ${{ matrix.test_config.size }}
|
||||
config_file_path: ${{ matrix.test_config.config_file_path }}
|
||||
secrets:
|
||||
secrets:
|
||||
KUBECONFIG_B64: ${{ secrets.KUBECONFIG_B64 }}
|
||||
```
|
||||
```
|
||||
|
||||
The matrix above defines all the parameters required to add a multi-machine use case. The parameters worth noting (if you are adding a new use case) are `size` and the path to the yaml configuration file. The former defines the number of nodes required for your use case, and the latter defines the path to the configuration file you have completed in step 2.
|
||||
|
||||
@@ -125,130 +123,130 @@ This section assumes that you already have a [Kubernetes](https://kubernetes.io/
|
||||
apiVersion: leaderworkerset.x-k8s.io/v1
|
||||
kind: LeaderWorkerSet
|
||||
metadata:
|
||||
name: test-server
|
||||
namespace: vllm-project
|
||||
name: test-server
|
||||
namespace: vllm-project
|
||||
spec:
|
||||
replicas: 1
|
||||
leaderWorkerTemplate:
|
||||
replicas: 1
|
||||
leaderWorkerTemplate:
|
||||
size: 2
|
||||
restartPolicy: None
|
||||
leaderTemplate:
|
||||
metadata:
|
||||
metadata:
|
||||
labels:
|
||||
role: leader
|
||||
spec:
|
||||
role: leader
|
||||
spec:
|
||||
containers:
|
||||
- name: vllm-leader
|
||||
- name: vllm-leader
|
||||
imagePullPolicy: Always
|
||||
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3
|
||||
env:
|
||||
- name: CONFIG_YAML_PATH
|
||||
- name: CONFIG_YAML_PATH
|
||||
value: DeepSeek-V3.yaml
|
||||
- name: WORKSPACE
|
||||
- name: WORKSPACE
|
||||
value: "/vllm-workspace"
|
||||
- name: FAIL_TAG
|
||||
- name: FAIL_TAG
|
||||
value: FAIL_TAG
|
||||
command:
|
||||
- sh
|
||||
- -c
|
||||
- |
|
||||
- sh
|
||||
- -c
|
||||
- |
|
||||
bash /vllm-workspace/vllm-ascend/tests/e2e/nightly/multi_node/scripts/run.sh
|
||||
resources:
|
||||
limits:
|
||||
limits:
|
||||
huawei.com/ascend-1980: 16
|
||||
memory: 512Gi
|
||||
ephemeral-storage: 100Gi
|
||||
requests:
|
||||
requests:
|
||||
huawei.com/ascend-1980: 16
|
||||
memory: 512Gi
|
||||
ephemeral-storage: 100Gi
|
||||
cpu: 125
|
||||
ports:
|
||||
- containerPort: 8080
|
||||
- containerPort: 8080
|
||||
# readinessProbe:
|
||||
# tcpSocket:
|
||||
# port: 8080
|
||||
# initialDelaySeconds: 15
|
||||
# periodSeconds: 10
|
||||
volumeMounts:
|
||||
- mountPath: /root/.cache
|
||||
- mountPath: /root/.cache
|
||||
name: shared-volume
|
||||
- mountPath: /usr/local/Ascend/driver/tools
|
||||
- mountPath: /usr/local/Ascend/driver/tools
|
||||
name: driver-tools
|
||||
- mountPath: /dev/shm
|
||||
- mountPath: /dev/shm
|
||||
name: dshm
|
||||
volumes:
|
||||
- name: dshm
|
||||
- name: dshm
|
||||
emptyDir:
|
||||
medium: Memory
|
||||
sizeLimit: 15Gi
|
||||
- name: shared-volume
|
||||
medium: Memory
|
||||
sizeLimit: 15Gi
|
||||
- name: shared-volume
|
||||
persistentVolumeClaim:
|
||||
claimName: nv-action-vllm-benchmarks-v2
|
||||
- name: driver-tools
|
||||
claimName: nv-action-vllm-benchmarks-v2
|
||||
- name: driver-tools
|
||||
hostPath:
|
||||
path: /usr/local/Ascend/driver/tools
|
||||
path: /usr/local/Ascend/driver/tools
|
||||
workerTemplate:
|
||||
spec:
|
||||
spec:
|
||||
containers:
|
||||
- name: vllm-worker
|
||||
- name: vllm-worker
|
||||
imagePullPolicy: Always
|
||||
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3
|
||||
env:
|
||||
- name: CONFIG_YAML_PATH
|
||||
- name: CONFIG_YAML_PATH
|
||||
value: DeepSeek-V3.yaml
|
||||
- name: WORKSPACE
|
||||
- name: WORKSPACE
|
||||
value: "/vllm-workspace"
|
||||
- name: FAIL_TAG
|
||||
- name: FAIL_TAG
|
||||
value: FAIL_TAG
|
||||
command:
|
||||
- sh
|
||||
- -c
|
||||
- |
|
||||
- sh
|
||||
- -c
|
||||
- |
|
||||
bash /vllm-workspace/vllm-ascend/tests/e2e/nightly/multi_node/scripts/run.sh
|
||||
resources:
|
||||
limits:
|
||||
limits:
|
||||
huawei.com/ascend-1980: 16
|
||||
memory: 512Gi
|
||||
ephemeral-storage: 100Gi
|
||||
requests:
|
||||
requests:
|
||||
huawei.com/ascend-1980: 16
|
||||
ephemeral-storage: 100Gi
|
||||
cpu: 125
|
||||
volumeMounts:
|
||||
- mountPath: /root/.cache
|
||||
- mountPath: /root/.cache
|
||||
name: shared-volume
|
||||
- mountPath: /usr/local/Ascend/driver/tools
|
||||
- mountPath: /usr/local/Ascend/driver/tools
|
||||
name: driver-tools
|
||||
- mountPath: /dev/shm
|
||||
- mountPath: /dev/shm
|
||||
name: dshm
|
||||
volumes:
|
||||
- name: dshm
|
||||
- name: dshm
|
||||
emptyDir:
|
||||
medium: Memory
|
||||
sizeLimit: 15Gi
|
||||
- name: shared-volume
|
||||
medium: Memory
|
||||
sizeLimit: 15Gi
|
||||
- name: shared-volume
|
||||
persistentVolumeClaim:
|
||||
claimName: nv-action-vllm-benchmarks-v2
|
||||
- name: driver-tools
|
||||
claimName: nv-action-vllm-benchmarks-v2
|
||||
- name: driver-tools
|
||||
hostPath:
|
||||
path: /usr/local/Ascend/driver/tools
|
||||
path: /usr/local/Ascend/driver/tools
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: vllm-leader
|
||||
namespace: vllm-project
|
||||
name: vllm-leader
|
||||
namespace: vllm-project
|
||||
spec:
|
||||
ports:
|
||||
ports:
|
||||
- name: http
|
||||
port: 8080
|
||||
protocol: TCP
|
||||
targetPort: 8080
|
||||
selector:
|
||||
port: 8080
|
||||
protocol: TCP
|
||||
targetPort: 8080
|
||||
selector:
|
||||
leaderworkerset.sigs.k8s.io/name: vllm
|
||||
role: leader
|
||||
type: ClusterIP
|
||||
type: ClusterIP
|
||||
```
|
||||
|
||||
```bash
|
||||
|
||||
@@ -40,6 +40,7 @@ export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu/ https://mirror
|
||||
# src path
|
||||
export SRC_WORKSPACE=/vllm-workspace
|
||||
mkdir -p $SRC_WORKSPACE
|
||||
cd $SRC_WORKSPACE
|
||||
|
||||
apt-get update -y
|
||||
apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev curl gnupg2
|
||||
|
||||
@@ -38,11 +38,11 @@ Run the vLLM server in the docker.
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 35000 &
|
||||
vllm serve Qwen/Qwen2.5-0.5B-Instruct --max-model-len 35000 &
|
||||
```
|
||||
|
||||
:::{note}
|
||||
`--max_model_len` should be greater than `35000`, this will be suitable for most datasets. Otherwise the accuracy evaluation may be affected.
|
||||
`--max-model-len` should be greater than `35000`, this will be suitable for most datasets. Otherwise the accuracy evaluation may be affected.
|
||||
:::
|
||||
|
||||
The vLLM server is started successfully, if you see logs as below:
|
||||
|
||||
@@ -29,7 +29,7 @@ docker run --rm \
|
||||
-e VLLM_USE_MODELSCOPE=True \
|
||||
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
|
||||
-it $IMAGE \
|
||||
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
|
||||
vllm serve Qwen/Qwen2.5-7B-Instruct --max-model-len 26240
|
||||
```
|
||||
|
||||
If the vLLM server is started successfully, you can see information shown below:
|
||||
|
||||
@@ -32,7 +32,7 @@ docker run --rm \
|
||||
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
|
||||
-it $IMAGE \
|
||||
/bin/bash
|
||||
vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 4096 &
|
||||
vllm serve Qwen/Qwen2.5-0.5B-Instruct --max-model-len 4096 &
|
||||
```
|
||||
|
||||
The vLLM server is started successfully, if you see logs as below:
|
||||
@@ -48,28 +48,36 @@ INFO: Application startup complete.
|
||||
You can query the result with input prompts:
|
||||
|
||||
```shell
|
||||
PROMPT='<|im_start|>system
|
||||
You are a professional accountant. Answer questions using accounting knowledge, output only the option letter (A/B/C/D).<|im_end|>
|
||||
<|im_start|>user
|
||||
Question: A company'"'"'s balance sheet as of December 31, 2023 shows:
|
||||
Current assets: Cash and equivalents 5 million yuan, Accounts receivable 8 million yuan, Inventory 6 million yuan
|
||||
Non-current assets: Net fixed assets 12 million yuan
|
||||
Current liabilities: Short-term loans 4 million yuan, Accounts payable 3 million yuan
|
||||
Non-current liabilities: Long-term loans 9 million yuan
|
||||
Owner'"'"'s equity: Paid-in capital 10 million yuan, Retained earnings ?
|
||||
Requirement: Calculate the company'"'"'s Asset-Liability Ratio and Current Ratio (round to two decimal places).
|
||||
Options:
|
||||
A. Asset-Liability Ratio=58.33%, Current Ratio=1.90
|
||||
B. Asset-Liability Ratio=62.50%, Current Ratio=2.17
|
||||
C. Asset-Liability Ratio=65.22%, Current Ratio=1.75
|
||||
D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>
|
||||
<|im_start|>assistant
|
||||
'
|
||||
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Qwen/Qwen2.5-0.5B-Instruct",
|
||||
"prompt": "'"<|im_start|>system\nYou are a professional accountant. Answer questions using accounting knowledge, output only the option letter (A/B/C/D).<|im_end|>\n"\
|
||||
"<|im_start|>user\nQuestion: A company's balance sheet as of December 31, 2023 shows:\n"\
|
||||
" Current assets: Cash and equivalents 5 million yuan, Accounts receivable 8 million yuan, Inventory 6 million yuan\n"\
|
||||
" Non-current assets: Net fixed assets 12 million yuan\n"\
|
||||
" Current liabilities: Short-term loans 4 million yuan, Accounts payable 3 million yuan\n"\
|
||||
" Non-current liabilities: Long-term loans 9 million yuan\n"\
|
||||
" Owner's equity: Paid-in capital 10 million yuan, Retained earnings ?\n"\
|
||||
"Requirement: Calculate the company's Asset-Liability Ratio and Current Ratio (round to two decimal places).\n"\
|
||||
"Options:\n"\
|
||||
"A. Asset-Liability Ratio=58.33%, Current Ratio=1.90\n"\
|
||||
"B. Asset-Liability Ratio=62.50%, Current Ratio=2.17\n"\
|
||||
"C. Asset-Liability Ratio=65.22%, Current Ratio=1.75\n"\
|
||||
"D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>\n"\
|
||||
"<|im_start|>assistant\n"'",
|
||||
"max_completion_tokens": 1,
|
||||
"temperature": 0,
|
||||
"stop": ["<|im_end|>"]
|
||||
}' | python3 -m json.tool
|
||||
-d "$(jq -n \
|
||||
--arg model "Qwen/Qwen2.5-0.5B-Instruct" \
|
||||
--arg prompt "$PROMPT" \
|
||||
'{
|
||||
model: $model,
|
||||
prompt: $prompt,
|
||||
max_completion_tokens: 1,
|
||||
temperature: 0,
|
||||
stop: ["<|im_end|>"]
|
||||
}')" | python3 -m json.tool
|
||||
```
|
||||
|
||||
The output format matches the following:
|
||||
|
||||
@@ -29,7 +29,7 @@ docker run --rm \
|
||||
-e VLLM_USE_MODELSCOPE=True \
|
||||
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
|
||||
-it $IMAGE \
|
||||
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
|
||||
vllm serve Qwen/Qwen2.5-7B-Instruct --max-model-len 26240
|
||||
```
|
||||
|
||||
The vLLM server is started successfully, if you see information as below:
|
||||
|
||||
@@ -158,6 +158,12 @@ Scheduling optimization:
|
||||
:substitutions:
|
||||
# Optimize operator delivery queue. This will affect the memory peak value, and may degrade if the memory is tight.
|
||||
export TASK_QUEUE_ENABLE=2
|
||||
```
|
||||
|
||||
or
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
|
||||
# This will greatly improve the CPU bottleneck model and ensure the same performance for the NPU bottleneck model.
|
||||
export CPU_AFFINITY_CONF=1
|
||||
|
||||
@@ -223,7 +223,7 @@ vllm serve Qwen/Qwen3-Embedding-8B --trust-remote-code
|
||||
```shell
|
||||
# download dataset
|
||||
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
vllm bench serve \
|
||||
--model Qwen/Qwen3-Embedding-8B \
|
||||
--backend openai-embeddings \
|
||||
|
||||
Reference in New Issue
Block a user