[Doc][v0.18.0] Fix documentation formatting and improve code examples (#8701)

### What this PR does / why we need it?
This PR fixes various documentation issues and improves code examples
throughout the project.

Signed-off-by: MrZ20 <2609716663@qq.com>
This commit is contained in:
SILONG ZENG
2026-04-28 09:01:25 +08:00
committed by GitHub
parent 9a0b786f2b
commit 2e2aaa2fae
38 changed files with 205 additions and 188 deletions

View File

@@ -32,7 +32,7 @@ This is the first release candidate of v0.14.0 for vLLM Ascend. Please follow th
- Fix multi-modal inference OOM issues by setting `expandable_segments:True` by default. [#5855](https://github.com/vllm-project/vllm-ascend/pull/5855) - Fix multi-modal inference OOM issues by setting `expandable_segments:True` by default. [#5855](https://github.com/vllm-project/vllm-ascend/pull/5855)
- `VLLM_ASCEND_ENABLE_MLAPO` is set to `True` by default. It's enabled automatically on decode node in PD deployment case. Please note that this feature will cost more memory. If you are memory sensitive, please set it to False. [#5952](https://github.com/vllm-project/vllm-ascend/pull/5952) - `VLLM_ASCEND_ENABLE_MLAPO` is set to `True` by default. It's enabled automatically on decode node in PD deployment case. Please note that this feature will cost more memory. If you are memory sensitive, please set it to False. [#5952](https://github.com/vllm-project/vllm-ascend/pull/5952)
- SSL config can be set to kv_extra_config for PD deployment with mooncake layerwise connector. [#5875](https://github.com/vllm-project/vllm-ascend/pull/5875) - SSL config can be set to kv_extra_config for PD deployment with mooncake layerwise connector. [#5875](https://github.com/vllm-project/vllm-ascend/pull/5875)
- support `--max_model_len=auto`. [#6193](https://github.com/vllm-project/vllm-ascend/pull/6193) - support `--max-model-len=auto`. [#6193](https://github.com/vllm-project/vllm-ascend/pull/6193)
### Dependencies ### Dependencies

View File

@@ -50,6 +50,8 @@ All environment variables must be defined in `vllm_ascend/envs.py` using the cen
**Example:** **Example:**
```python ```python
import os
env_variables = { env_variables = {
"VLLM_ASCEND_ENABLE_NZ": lambda: int(os.getenv("VLLM_ASCEND_ENABLE_NZ", 1)), "VLLM_ASCEND_ENABLE_NZ": lambda: int(os.getenv("VLLM_ASCEND_ENABLE_NZ", 1)),
# ... # ...

View File

@@ -18,7 +18,7 @@ enable_custom_op()
- Create a new operation folder under `csrc` directory. - Create a new operation folder under `csrc` directory.
- Create `op_host` and `op_kernel` directories for host and kernel source code. - Create `op_host` and `op_kernel` directories for host and kernel source code.
- Add build options in `csrc/build_aclnn.sh` for supported SOC. Note that multiple ops should be separated with `;`, i.e. `CUSTOM_OPS=op1;op2;op3`. - Add build options in `csrc/build_aclnn.sh` for supported SOC. Note that multiple ops should be separated with `;`, i.e. `CUSTOM_OPS="op1;op2;op3"`.
- Bind aclnn operators to torch.ops._C_ascend module in `csrc/torch_binding.cpp`. - Bind aclnn operators to torch.ops._C_ascend module in `csrc/torch_binding.cpp`.
- Write a meta implementation in `csrc/torch_binding_meta.cpp` for the op to be captured into the aclgraph. - Write a meta implementation in `csrc/torch_binding_meta.cpp` for the op to be captured into the aclgraph.

View File

@@ -35,38 +35,36 @@ From the workflow perspective, we can see how the final test script is executed,
npu_per_node: 16 npu_per_node: 16
# All env vars you need should add it here # All env vars you need should add it here
env_common: env_common:
VLLM_USE_MODELSCOPE: true VLLM_USE_MODELSCOPE: true
OMP_PROC_BIND: false OMP_PROC_BIND: false
OMP_NUM_THREADS: 100 OMP_NUM_THREADS: 100
HCCL_BUFFSIZE: 1024 HCCL_BUFFSIZE: 1024
SERVER_PORT: 8080 SERVER_PORT: 8080
disaggregated_prefill: disaggregated_prefill:
enabled: true enabled: true
# node index(a list) which meet all the conditions: # node index(a list) which meet all the conditions:
# - prefiller # - prefiller
# - no headless(have api server) # - no headless(have api server)
prefiller_host_index: [0] prefiller_host_index: [0]
# node index(a list) which meet all the conditions: # node index(a list) which meet all the conditions:
# - decoder # - decoder
decoder_host_index: [1] decoder_host_index: [1]
# Add each node's vllm serve cli command just like you run locally # Add each node's vllm serve cli command just like you run locally
# Add each node's individual envs like follow # Add each node's individual envs like follow
deployment: deployment:
- - envs:
envs: # fill with envs like: <key>:<value>
# fill with envs like: <key>:<value>
server_cmd: > server_cmd: >
vllm serve ... vllm serve ...
- - envs:
envs: # fill with envs like: <key>:<value>
# fill with envs like: <key>:<value>
server_cmd: > server_cmd: >
vllm serve ... vllm serve ...
benchmarks: benchmarks:
perf: perf:
# fill with performance test kwargs # fill with performance test kwargs
acc: acc:
# fill with accuracy test kwargs # fill with accuracy test kwargs
``` ```
@@ -74,38 +72,38 @@ From the workflow perspective, we can see how the final test script is executed,
Currently, the multi-node test workflow is defined in the [nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml) Currently, the multi-node test workflow is defined in the [nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml)
```yaml ```yaml
multi-node-tests: multi-node-tests:
name: multi-node name: multi-node
if: always() && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch') if: always() && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
strategy: strategy:
fail-fast: false fail-fast: false
max-parallel: 1 max-parallel: 1
matrix: matrix:
test_config: test_config:
- name: multi-node-deepseek-pd - name: multi-node-deepseek-pd
config_file_path: DeepSeek-V3.yaml config_file_path: DeepSeek-V3.yaml
size: 2 size: 2
- name: multi-node-qwen3-dp - name: multi-node-qwen3-dp
config_file_path: Qwen3-235B-A22B.yaml config_file_path: Qwen3-235B-A22B.yaml
size: 2 size: 2
- name: multi-node-qwenw8a8-2node - name: multi-node-qwenw8a8-2node
config_file_path: Qwen3-235B-W8A8.yaml config_file_path: Qwen3-235B-W8A8.yaml
size: 2 size: 2
- name: multi-node-qwenw8a8-2node-eplb - name: multi-node-qwenw8a8-2node-eplb
config_file_path: Qwen3-235B-W8A8-EPLB.yaml config_file_path: Qwen3-235B-W8A8-EPLB.yaml
size: 2 size: 2
uses: ./.github/workflows/_e2e_nightly_multi_node.yaml uses: ./.github/workflows/_e2e_nightly_multi_node.yaml
with: with:
soc_version: a3 soc_version: a3
runner: linux-aarch64-a3-0 runner: linux-aarch64-a3-0
image: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3' image: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3'
replicas: 1 replicas: 1
size: ${{ matrix.test_config.size }} size: ${{ matrix.test_config.size }}
config_file_path: ${{ matrix.test_config.config_file_path }} config_file_path: ${{ matrix.test_config.config_file_path }}
secrets: secrets:
KUBECONFIG_B64: ${{ secrets.KUBECONFIG_B64 }} KUBECONFIG_B64: ${{ secrets.KUBECONFIG_B64 }}
``` ```
The matrix above defines all the parameters required to add a multi-machine use case. The parameters worth noting (if you are adding a new use case) are `size` and the path to the yaml configuration file. The former defines the number of nodes required for your use case, and the latter defines the path to the configuration file you have completed in step 2. The matrix above defines all the parameters required to add a multi-machine use case. The parameters worth noting (if you are adding a new use case) are `size` and the path to the yaml configuration file. The former defines the number of nodes required for your use case, and the latter defines the path to the configuration file you have completed in step 2.
@@ -125,130 +123,130 @@ This section assumes that you already have a [Kubernetes](https://kubernetes.io/
apiVersion: leaderworkerset.x-k8s.io/v1 apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet kind: LeaderWorkerSet
metadata: metadata:
name: test-server name: test-server
namespace: vllm-project namespace: vllm-project
spec: spec:
replicas: 1 replicas: 1
leaderWorkerTemplate: leaderWorkerTemplate:
size: 2 size: 2
restartPolicy: None restartPolicy: None
leaderTemplate: leaderTemplate:
metadata: metadata:
labels: labels:
role: leader role: leader
spec: spec:
containers: containers:
- name: vllm-leader - name: vllm-leader
imagePullPolicy: Always imagePullPolicy: Always
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3 image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3
env: env:
- name: CONFIG_YAML_PATH - name: CONFIG_YAML_PATH
value: DeepSeek-V3.yaml value: DeepSeek-V3.yaml
- name: WORKSPACE - name: WORKSPACE
value: "/vllm-workspace" value: "/vllm-workspace"
- name: FAIL_TAG - name: FAIL_TAG
value: FAIL_TAG value: FAIL_TAG
command: command:
- sh - sh
- -c - -c
- | - |
bash /vllm-workspace/vllm-ascend/tests/e2e/nightly/multi_node/scripts/run.sh bash /vllm-workspace/vllm-ascend/tests/e2e/nightly/multi_node/scripts/run.sh
resources: resources:
limits: limits:
huawei.com/ascend-1980: 16 huawei.com/ascend-1980: 16
memory: 512Gi memory: 512Gi
ephemeral-storage: 100Gi ephemeral-storage: 100Gi
requests: requests:
huawei.com/ascend-1980: 16 huawei.com/ascend-1980: 16
memory: 512Gi memory: 512Gi
ephemeral-storage: 100Gi ephemeral-storage: 100Gi
cpu: 125 cpu: 125
ports: ports:
- containerPort: 8080 - containerPort: 8080
# readinessProbe: # readinessProbe:
# tcpSocket: # tcpSocket:
# port: 8080 # port: 8080
# initialDelaySeconds: 15 # initialDelaySeconds: 15
# periodSeconds: 10 # periodSeconds: 10
volumeMounts: volumeMounts:
- mountPath: /root/.cache - mountPath: /root/.cache
name: shared-volume name: shared-volume
- mountPath: /usr/local/Ascend/driver/tools - mountPath: /usr/local/Ascend/driver/tools
name: driver-tools name: driver-tools
- mountPath: /dev/shm - mountPath: /dev/shm
name: dshm name: dshm
volumes: volumes:
- name: dshm - name: dshm
emptyDir: emptyDir:
medium: Memory medium: Memory
sizeLimit: 15Gi sizeLimit: 15Gi
- name: shared-volume - name: shared-volume
persistentVolumeClaim: persistentVolumeClaim:
claimName: nv-action-vllm-benchmarks-v2 claimName: nv-action-vllm-benchmarks-v2
- name: driver-tools - name: driver-tools
hostPath: hostPath:
path: /usr/local/Ascend/driver/tools path: /usr/local/Ascend/driver/tools
workerTemplate: workerTemplate:
spec: spec:
containers: containers:
- name: vllm-worker - name: vllm-worker
imagePullPolicy: Always imagePullPolicy: Always
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3 image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3
env: env:
- name: CONFIG_YAML_PATH - name: CONFIG_YAML_PATH
value: DeepSeek-V3.yaml value: DeepSeek-V3.yaml
- name: WORKSPACE - name: WORKSPACE
value: "/vllm-workspace" value: "/vllm-workspace"
- name: FAIL_TAG - name: FAIL_TAG
value: FAIL_TAG value: FAIL_TAG
command: command:
- sh - sh
- -c - -c
- | - |
bash /vllm-workspace/vllm-ascend/tests/e2e/nightly/multi_node/scripts/run.sh bash /vllm-workspace/vllm-ascend/tests/e2e/nightly/multi_node/scripts/run.sh
resources: resources:
limits: limits:
huawei.com/ascend-1980: 16 huawei.com/ascend-1980: 16
memory: 512Gi memory: 512Gi
ephemeral-storage: 100Gi ephemeral-storage: 100Gi
requests: requests:
huawei.com/ascend-1980: 16 huawei.com/ascend-1980: 16
ephemeral-storage: 100Gi ephemeral-storage: 100Gi
cpu: 125 cpu: 125
volumeMounts: volumeMounts:
- mountPath: /root/.cache - mountPath: /root/.cache
name: shared-volume name: shared-volume
- mountPath: /usr/local/Ascend/driver/tools - mountPath: /usr/local/Ascend/driver/tools
name: driver-tools name: driver-tools
- mountPath: /dev/shm - mountPath: /dev/shm
name: dshm name: dshm
volumes: volumes:
- name: dshm - name: dshm
emptyDir: emptyDir:
medium: Memory medium: Memory
sizeLimit: 15Gi sizeLimit: 15Gi
- name: shared-volume - name: shared-volume
persistentVolumeClaim: persistentVolumeClaim:
claimName: nv-action-vllm-benchmarks-v2 claimName: nv-action-vllm-benchmarks-v2
- name: driver-tools - name: driver-tools
hostPath: hostPath:
path: /usr/local/Ascend/driver/tools path: /usr/local/Ascend/driver/tools
--- ---
apiVersion: v1 apiVersion: v1
kind: Service kind: Service
metadata: metadata:
name: vllm-leader name: vllm-leader
namespace: vllm-project namespace: vllm-project
spec: spec:
ports: ports:
- name: http - name: http
port: 8080 port: 8080
protocol: TCP protocol: TCP
targetPort: 8080 targetPort: 8080
selector: selector:
leaderworkerset.sigs.k8s.io/name: vllm leaderworkerset.sigs.k8s.io/name: vllm
role: leader role: leader
type: ClusterIP type: ClusterIP
``` ```
```bash ```bash

View File

@@ -40,6 +40,7 @@ export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu/ https://mirror
# src path # src path
export SRC_WORKSPACE=/vllm-workspace export SRC_WORKSPACE=/vllm-workspace
mkdir -p $SRC_WORKSPACE mkdir -p $SRC_WORKSPACE
cd $SRC_WORKSPACE
apt-get update -y apt-get update -y
apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev curl gnupg2 apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev curl gnupg2

View File

@@ -38,11 +38,11 @@ Run the vLLM server in the docker.
```{code-block} bash ```{code-block} bash
:substitutions: :substitutions:
vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 35000 & vllm serve Qwen/Qwen2.5-0.5B-Instruct --max-model-len 35000 &
``` ```
:::{note} :::{note}
`--max_model_len` should be greater than `35000`, this will be suitable for most datasets. Otherwise the accuracy evaluation may be affected. `--max-model-len` should be greater than `35000`, this will be suitable for most datasets. Otherwise the accuracy evaluation may be affected.
::: :::
The vLLM server is started successfully, if you see logs as below: The vLLM server is started successfully, if you see logs as below:

View File

@@ -29,7 +29,7 @@ docker run --rm \
-e VLLM_USE_MODELSCOPE=True \ -e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \ -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \ -it $IMAGE \
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240 vllm serve Qwen/Qwen2.5-7B-Instruct --max-model-len 26240
``` ```
If the vLLM server is started successfully, you can see information shown below: If the vLLM server is started successfully, you can see information shown below:

View File

@@ -32,7 +32,7 @@ docker run --rm \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \ -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \ -it $IMAGE \
/bin/bash /bin/bash
vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 4096 & vllm serve Qwen/Qwen2.5-0.5B-Instruct --max-model-len 4096 &
``` ```
The vLLM server is started successfully, if you see logs as below: The vLLM server is started successfully, if you see logs as below:
@@ -48,28 +48,36 @@ INFO: Application startup complete.
You can query the result with input prompts: You can query the result with input prompts:
```shell ```shell
PROMPT='<|im_start|>system
You are a professional accountant. Answer questions using accounting knowledge, output only the option letter (A/B/C/D).<|im_end|>
<|im_start|>user
Question: A company'"'"'s balance sheet as of December 31, 2023 shows:
Current assets: Cash and equivalents 5 million yuan, Accounts receivable 8 million yuan, Inventory 6 million yuan
Non-current assets: Net fixed assets 12 million yuan
Current liabilities: Short-term loans 4 million yuan, Accounts payable 3 million yuan
Non-current liabilities: Long-term loans 9 million yuan
Owner'"'"'s equity: Paid-in capital 10 million yuan, Retained earnings ?
Requirement: Calculate the company'"'"'s Asset-Liability Ratio and Current Ratio (round to two decimal places).
Options:
A. Asset-Liability Ratio=58.33%, Current Ratio=1.90
B. Asset-Liability Ratio=62.50%, Current Ratio=2.17
C. Asset-Liability Ratio=65.22%, Current Ratio=1.75
D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>
<|im_start|>assistant
'
curl http://localhost:8000/v1/completions \ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d "$(jq -n \
"model": "Qwen/Qwen2.5-0.5B-Instruct", --arg model "Qwen/Qwen2.5-0.5B-Instruct" \
"prompt": "'"<|im_start|>system\nYou are a professional accountant. Answer questions using accounting knowledge, output only the option letter (A/B/C/D).<|im_end|>\n"\ --arg prompt "$PROMPT" \
"<|im_start|>user\nQuestion: A company's balance sheet as of December 31, 2023 shows:\n"\ '{
" Current assets: Cash and equivalents 5 million yuan, Accounts receivable 8 million yuan, Inventory 6 million yuan\n"\ model: $model,
" Non-current assets: Net fixed assets 12 million yuan\n"\ prompt: $prompt,
" Current liabilities: Short-term loans 4 million yuan, Accounts payable 3 million yuan\n"\ max_completion_tokens: 1,
" Non-current liabilities: Long-term loans 9 million yuan\n"\ temperature: 0,
" Owner's equity: Paid-in capital 10 million yuan, Retained earnings ?\n"\ stop: ["<|im_end|>"]
"Requirement: Calculate the company's Asset-Liability Ratio and Current Ratio (round to two decimal places).\n"\ }')" | python3 -m json.tool
"Options:\n"\
"A. Asset-Liability Ratio=58.33%, Current Ratio=1.90\n"\
"B. Asset-Liability Ratio=62.50%, Current Ratio=2.17\n"\
"C. Asset-Liability Ratio=65.22%, Current Ratio=1.75\n"\
"D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>\n"\
"<|im_start|>assistant\n"'",
"max_completion_tokens": 1,
"temperature": 0,
"stop": ["<|im_end|>"]
}' | python3 -m json.tool
``` ```
The output format matches the following: The output format matches the following:

View File

@@ -29,7 +29,7 @@ docker run --rm \
-e VLLM_USE_MODELSCOPE=True \ -e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \ -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \ -it $IMAGE \
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240 vllm serve Qwen/Qwen2.5-7B-Instruct --max-model-len 26240
``` ```
The vLLM server is started successfully, if you see information as below: The vLLM server is started successfully, if you see information as below:

View File

@@ -158,6 +158,12 @@ Scheduling optimization:
:substitutions: :substitutions:
# Optimize operator delivery queue. This will affect the memory peak value, and may degrade if the memory is tight. # Optimize operator delivery queue. This will affect the memory peak value, and may degrade if the memory is tight.
export TASK_QUEUE_ENABLE=2 export TASK_QUEUE_ENABLE=2
```
or
```{code-block} bash
:substitutions:
# This will greatly improve the CPU bottleneck model and ensure the same performance for the NPU bottleneck model. # This will greatly improve the CPU bottleneck model and ensure the same performance for the NPU bottleneck model.
export CPU_AFFINITY_CONF=1 export CPU_AFFINITY_CONF=1

View File

@@ -223,7 +223,7 @@ vllm serve Qwen/Qwen3-Embedding-8B --trust-remote-code
```shell ```shell
# download dataset # download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json # wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
vllm bench serve \ vllm bench serve \
--model Qwen/Qwen3-Embedding-8B \ --model Qwen/Qwen3-Embedding-8B \
--backend openai-embeddings \ --backend openai-embeddings \

View File

@@ -284,7 +284,7 @@ python example.py
If you encounter a connection error with Hugging Face (e.g., `We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.`), run the following commands to use ModelScope as an alternative: If you encounter a connection error with Hugging Face (e.g., `We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.`), run the following commands to use ModelScope as an alternative:
```bash ```bash
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
pip install modelscope pip install modelscope
python example.py python example.py
``` ```

View File

@@ -12,6 +12,8 @@
## Setup environment using container ## Setup environment using container
Before using containers, make sure Docker is installed on your system. If Docker is not installed, please refer to the [Docker installation guide](https://docs.docker.com/get-docker/) for installation instructions.
:::::{tab-set} :::::{tab-set}
::::{tab-item} Ubuntu ::::{tab-item} Ubuntu
@@ -91,7 +93,7 @@ You can use ModelScope mirror to speed up download:
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well --> <!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
```bash ```bash
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
``` ```
There are two ways to start vLLM on Ascend NPU: There are two ways to start vLLM on Ascend NPU:

View File

@@ -559,7 +559,7 @@ There are three `vllm bench` subcommands:
Take the `serve` as an example. Run the code as follows. Take the `serve` as an example. Run the code as follows.
```shell ```shell
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
vllm bench serve --model /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp --dataset-name random --random-input 131072 --num-prompts 20 --request-rate 0 --save-result --result-dir ./ vllm bench serve --model /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp --dataset-name random --random-input 131072 --num-prompts 20 --request-rate 0 --save-result --result-dir ./
``` ```

View File

@@ -72,7 +72,7 @@ Run the following script to execute online 128k inference.
```shell ```shell
#!/bin/sh #!/bin/sh
# Load model from ModelScope to speed up download # Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
# To reduce memory fragmentation and avoid out of memory # To reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=512 export HCCL_BUFFSIZE=512
@@ -166,7 +166,7 @@ There are three `vllm bench` subcommands:
Take the `serve` as an example. Run the code as follows. Take the `serve` as an example. Run the code as follows.
```shell ```shell
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 131072 --num-prompts 1 --request-rate 1 --save-result --result-dir ./ vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 131072 --num-prompts 1 --request-rate 1 --save-result --result-dir ./
``` ```

View File

@@ -96,7 +96,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser
--served_model_name qwen --dtype float16 \ --served_model_name qwen --dtype float16 \
--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \ --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \
--quantization ascend --max_model_len 16384 --quantization ascend --max-model-len 16384
# `--load_format` is required only for the W8A8SC quantized weight format. # `--load_format` is required only for the W8A8SC quantized weight format.
# #
``` ```
@@ -134,7 +134,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser
--enforce-eager \ --enforce-eager \
--dtype float16 \ --dtype float16 \
--quantization ascend \ --quantization ascend \
--max_model_len 10240 --max-model-len 10240
``` ```
Argument notes: `--tensor-parallel-size`: `W8A8SC` quantized weights are tightly coupled to the TP size, so you must specify the TP size you plan to use at serving time when running compression. `--model` is the path to the input `w8a8s` weights, and `--output` is the output path for the compressed `w8a8sc` weights. Argument notes: `--tensor-parallel-size`: `W8A8SC` quantized weights are tightly coupled to the TP size, so you must specify the TP size you plan to use at serving time when running compression. `--model` is the path to the input `w8a8s` weights, and `--output` is the output path for the compressed `w8a8sc` weights.
@@ -159,7 +159,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser
--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \ --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \
--quantization ascend \ --quantization ascend \
--max_model_len 16384 \ --max-model-len 16384 \
--no-enable-prefix-caching \ --no-enable-prefix-caching \
--load_format="sharded_state" --load_format="sharded_state"
``` ```
@@ -178,7 +178,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser
--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \ --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16]}' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16]}' \
--quantization ascend \ --quantization ascend \
--max_model_len 16384 \ --max-model-len 16384 \
--no-enable-prefix-caching \ --no-enable-prefix-caching \
--load_format="sharded_state" --load_format="sharded_state"
``` ```
@@ -199,7 +199,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser
--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \ --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [16,32]}' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [16,32]}' \
--quantization ascend \ --quantization ascend \
--max_model_len 20480 \ --max-model-len 20480 \
--no-enable-prefix-caching \ --no-enable-prefix-caching \
--load_format="sharded_state" --load_format="sharded_state"
``` ```

View File

@@ -302,7 +302,7 @@ There are three `vllm bench` subcommands:
Take the `serve` as an example. Run the code as follows. Take the `serve` as an example. Run the code as follows.
```shell ```shell
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
vllm bench serve --model path/DeepSeek-R1-W8A8 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./ vllm bench serve --model path/DeepSeek-R1-W8A8 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
``` ```

View File

@@ -943,7 +943,7 @@ There are three `vllm bench` subcommands:
Take the `serve` as an example. Run the code as follows. Take the `serve` as an example. Run the code as follows.
```shell ```shell
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
vllm bench serve --model /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./ vllm bench serve --model /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
``` ```

View File

@@ -93,7 +93,7 @@ vllm serve /root/.cache/DeepSeek-OCR-2 \
--trust-remote-code \ --trust-remote-code \
--tensor-parallel-size 1 \ --tensor-parallel-size 1 \
--port 1055 \ --port 1055 \
--max_model_len 8192 \ --max-model-len 8192 \
--no-enable-prefix-caching \ --no-enable-prefix-caching \
--gpu-memory-utilization 0.8 \ --gpu-memory-utilization 0.8 \
--allowed-local-media-path / \ --allowed-local-media-path / \

View File

@@ -784,7 +784,7 @@ There are three `vllm bench` subcommands:
Take the `serve` as an example. Run the code as follows. Take the `serve` as an example. Run the code as follows.
```shell ```shell
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
vllm bench serve --model Eco-Tech/Kimi-K2.5-w4a8 --dataset-name random --random-input 1024 --num-prompts 200 --request-rate 1 --save-result --result-dir ./ vllm bench serve --model Eco-Tech/Kimi-K2.5-w4a8 --dataset-name random --random-input 1024 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
``` ```

View File

@@ -72,7 +72,7 @@ Run the following script to start the vLLM server on single 910B4:
```shell ```shell
#!/bin/sh #!/bin/sh
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
export MODEL_PATH="PaddlePaddle/PaddleOCR-VL" export MODEL_PATH="PaddlePaddle/PaddleOCR-VL"
export TASK_QUEUE_ENABLE=1 export TASK_QUEUE_ENABLE=1
export CPU_AFFINITY_CONF=1 export CPU_AFFINITY_CONF=1
@@ -97,11 +97,11 @@ Run the following script to start the vLLM server on single Atlas 300 inference
```shell ```shell
#!/bin/sh #!/bin/sh
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
export MODEL_PATH="PaddlePaddle/PaddleOCR-VL" export MODEL_PATH="PaddlePaddle/PaddleOCR-VL"
vllm serve ${MODEL_PATH} \ vllm serve ${MODEL_PATH} \
--max_model_len 16384 \ --max-model-len 16384 \
--served-model-name PaddleOCR-VL-0.9B \ --served-model-name PaddleOCR-VL-0.9B \
--trust-remote-code \ --trust-remote-code \
--no-enable-prefix-caching \ --no-enable-prefix-caching \
@@ -112,7 +112,7 @@ vllm serve ${MODEL_PATH} \
``` ```
:::{note} :::{note}
The `--max_model_len` option is added to prevent errors when generating the attention operator mask on the Atlas 300 inference products. The `--max-model-len` option is added to prevent errors when generating the attention operator mask on the Atlas 300 inference products.
::: :::
:::: ::::

View File

@@ -323,12 +323,12 @@ Run docker container to start the vLLM server on single-NPU:
:substitutions: :substitutions:
vllm serve Qwen/Qwen3-VL-8B-Instruct \ vllm serve Qwen/Qwen3-VL-8B-Instruct \
--dtype bfloat16 \ --dtype bfloat16 \
--max_model_len 16384 \ --max-model-len 16384 \
--max-num-batched-tokens 16384 --max-num-batched-tokens 16384
``` ```
:::{note} :::{note}
Add `--max_model_len` option to avoid ValueError that the Qwen3-VL-8B-Instruct model's max seq len (256000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series based on the on-chip memory size. Please modify the value according to a suitable value for your NPU series. Add `--max-model-len` option to avoid ValueError that the Qwen3-VL-8B-Instruct model's max seq len (256000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series based on the on-chip memory size. Please modify the value according to a suitable value for your NPU series.
::: :::
If your service start successfully, you can see the info shown below: If your service start successfully, you can see the info shown below:
@@ -415,7 +415,7 @@ vllm serve Qwen/Qwen2.5-VL-32B-Instruct \
``` ```
:::{note} :::{note}
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-32B-Instruct model's max_model_len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the on-chip memory size. Please modify the value according to a suitable value for your NPU series. Add `--max-model-len` option to avoid ValueError that the Qwen2.5-VL-32B-Instruct model's max_model_len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the on-chip memory size. Please modify the value according to a suitable value for your NPU series.
::: :::
If your service start successfully, you can see the info shown below: If your service start successfully, you can see the info shown below:

View File

@@ -74,7 +74,7 @@ The environment variable `LOCAL_MEDIA_PATH` which allows API requests to read lo
::: :::
```bash ```bash
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
export MODEL_PATH="Qwen/Qwen2.5-Omni-7B" export MODEL_PATH="Qwen/Qwen2.5-Omni-7B"
export LOCAL_MEDIA_PATH=$HOME/.cache/vllm/assets/vllm_public_assets/ export LOCAL_MEDIA_PATH=$HOME/.cache/vllm/assets/vllm_public_assets/
@@ -104,7 +104,7 @@ VLLM_TARGET_DEVICE=empty pip install -v ".[audio]"
#### Multiple NPU (Qwen2.5-Omni-7B) #### Multiple NPU (Qwen2.5-Omni-7B)
```bash ```bash
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
export MODEL_PATH=Qwen/Qwen2.5-Omni-7B export MODEL_PATH=Qwen/Qwen2.5-Omni-7B
export LOCAL_MEDIA_PATH=$HOME/.cache/vllm/assets/vllm_public_assets/ export LOCAL_MEDIA_PATH=$HOME/.cache/vllm/assets/vllm_public_assets/
export DP_SIZE=8 export DP_SIZE=8

View File

@@ -95,7 +95,7 @@ Run the following script to execute online 128k inference.
```shell ```shell
#!/bin/sh #!/bin/sh
# Load model from ModelScope to speed up download # Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
# To reduce memory fragmentation and avoid out of memory # To reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=512 export HCCL_BUFFSIZE=512
@@ -157,7 +157,7 @@ Node 0
```shell ```shell
#!/bin/sh #!/bin/sh
# Load model from ModelScope to speed up download # Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
# To reduce memory fragmentation and avoid out of memory # To reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
# this obtained through ifconfig # this obtained through ifconfig
@@ -199,7 +199,7 @@ Node1
```shell ```shell
#!/bin/sh #!/bin/sh
# Load model from ModelScope to speed up download # Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
# To reduce memory fragmentation and avoid out of memory # To reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
# this obtained through ifconfig # this obtained through ifconfig
@@ -309,7 +309,7 @@ There are three `vllm bench` subcommands:
Take the `serve` as an example. Run the code as follows. Take the `serve` as an example. Run the code as follows.
```shell ```shell
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input-len 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./ vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input-len 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
``` ```
@@ -335,7 +335,7 @@ Example server scripts:
```shell ```shell
#!/bin/sh #!/bin/sh
# Load model from ModelScope to speed up download # Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
# To reduce memory fragmentation and avoid out of memory # To reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=512 export HCCL_BUFFSIZE=512
@@ -408,7 +408,7 @@ export TP_SOCKET_IFNAME=${ifname}
export HCCL_SOCKET_IFNAME=${ifname} export HCCL_SOCKET_IFNAME=${ifname}
# Load model from ModelScope to speed up download # Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
# To reduce memory fragmentation and avoid out of memory # To reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=512 export HCCL_BUFFSIZE=512
@@ -470,7 +470,7 @@ export TP_SOCKET_IFNAME=${ifname}
export HCCL_SOCKET_IFNAME=${ifname} export HCCL_SOCKET_IFNAME=${ifname}
# Load model from ModelScope to speed up download # Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
# To reduce memory fragmentation and avoid out of memory # To reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=1024 export HCCL_BUFFSIZE=1024
@@ -534,7 +534,7 @@ export TP_SOCKET_IFNAME=${ifname}
export HCCL_SOCKET_IFNAME=${ifname} export HCCL_SOCKET_IFNAME=${ifname}
# Load model from ModelScope to speed up download # Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
# To reduce memory fragmentation and avoid out of memory # To reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=1024 export HCCL_BUFFSIZE=1024

View File

@@ -93,7 +93,7 @@ The converted model files look like:
Run the following script to start the vLLM server with the quantized model: Run the following script to start the vLLM server with the quantized model:
```bash ```bash
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
export MODEL_PATH=vllm-ascend/Qwen3-8B-W4A8 export MODEL_PATH=vllm-ascend/Qwen3-8B-W4A8
vllm serve ${MODEL_PATH} --served-model-name "qwen3-8b-w4a8" --max-model-len 4096 --quantization ascend vllm serve ${MODEL_PATH} --served-model-name "qwen3-8b-w4a8" --max-model-len 4096 --quantization ascend
``` ```

View File

@@ -64,7 +64,7 @@ For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at
```shell ```shell
#!/bin/sh #!/bin/sh
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --served-model-name qwen3-coder --tensor-parallel-size 4 --enable_expert_parallel vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --served-model-name qwen3-coder --tensor-parallel-size 4 --enable_expert_parallel
``` ```

View File

@@ -163,7 +163,7 @@ There are three `vllm bench` subcommands:
Take the `serve` as an example. Run the code as follows. Take the `serve` as an example. Run the code as follows.
```shell ```shell
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
vllm bench serve --model Qwen/Qwen3-Next-80B-A3B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./ vllm bench serve --model Qwen/Qwen3-Next-80B-A3B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
``` ```

View File

@@ -94,7 +94,7 @@ Node 0
```shell ```shell
#!/bin/sh #!/bin/sh
# Load model from ModelScope to speed up download # Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
# To reduce memory fragmentation and avoid out of memory # To reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
# this obtained through ifconfig # this obtained through ifconfig
@@ -137,7 +137,7 @@ Node1
```shell ```shell
#!/bin/sh #!/bin/sh
# Load model from ModelScope to speed up download # Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
# To reduce memory fragmentation and avoid out of memory # To reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
# this is obtained through ifconfig # this is obtained through ifconfig
@@ -269,7 +269,7 @@ There are three `vllm bench` subcommands:
Take the `serve` as an example. Run the code as follows. Take the `serve` as an example. Run the code as follows.
```shell ```shell
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
vllm bench serve --model Qwen/Qwen3-VL-235B-A22B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./ vllm bench serve --model Qwen/Qwen3-VL-235B-A22B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
``` ```

View File

@@ -94,7 +94,7 @@ Run the following script to execute online 128k inference.
```shell ```shell
#!/bin/sh #!/bin/sh
# Load model from ModelScope to speed up download # Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
# To reduce memory fragmentation and avoid out of memory # To reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=512 export HCCL_BUFFSIZE=512
@@ -190,7 +190,7 @@ There are three `vllm bench` subcommands:
Take the `serve` as an example. Run the code as follows. Take the `serve` as an example. Run the code as follows.
```shell ```shell
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
vllm bench serve --model Eco-Tech/Qwen3.5-27B-w8a8-mtp --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./ vllm bench serve --model Eco-Tech/Qwen3.5-27B-w8a8-mtp --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
``` ```

View File

@@ -94,7 +94,7 @@ Run the following script to execute online 128k inference On 1 Atlas 800 A3(64G*
```shell ```shell
#!/bin/sh #!/bin/sh
# Load model from ModelScope to speed up download # Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
# To reduce memory fragmentation and avoid out of memory # To reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE="AIV" export HCCL_OP_EXPANSION_MODE="AIV"
@@ -157,7 +157,7 @@ Node 0
```shell ```shell
#!/bin/sh #!/bin/sh
# Load model from ModelScope to speed up download # Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
# To reduce memory fragmentation and avoid out of memory # To reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
# this obtained through ifconfig # this obtained through ifconfig
@@ -203,7 +203,7 @@ Node1
```shell ```shell
#!/bin/sh #!/bin/sh
# Load model from ModelScope to speed up download # Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
# To reduce memory fragmentation and avoid out of memory # To reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
# this obtained through ifconfig # this obtained through ifconfig
@@ -595,7 +595,7 @@ There are three `vllm bench` subcommands:
Take the `serve` as an example. Run the code as follows. Take the `serve` as an example. Run the code as follows.
```shell ```shell
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
vllm bench serve --model Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./ vllm bench serve --model Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
``` ```

View File

@@ -38,9 +38,9 @@ So far, dynamic batch performs better on several dense models including Qwen and
Dynamic batch is used in the online inference. A fully executable example is as follows: Dynamic batch is used in the online inference. A fully executable example is as follows:
```shell ```shell
SLO_LITMIT=50 SLO_LIMIT=50
vllm serve Qwen/Qwen2.5-14B-Instruct\ vllm serve Qwen/Qwen2.5-14B-Instruct\
--additional_config '{"SLO_limits_for_dynamic_batch":'${SLO_LITMIT}'}' \ --additional_config '{"SLO_limits_for_dynamic_batch":'${SLO_LIMIT}'}' \
--max-num-seqs 256 \ --max-num-seqs 256 \
--block-size 128 \ --block-size 128 \
--tensor_parallel_size 8 \ --tensor_parallel_size 8 \

View File

@@ -54,25 +54,25 @@ To enable Netloader, pass `--load-format=netloader` and provide configuration vi
### Server ### Server
```shell ```shell
VLLM_SLEEP_WHEN_IDLE=1 vllm serve `<model_file>` \ VLLM_SLEEP_WHEN_IDLE=1 vllm serve <model_file> \
--tensor-parallel-size 1 \ --tensor-parallel-size 1 \
--served-model-name `<model_name>` \ --served-model-name <model_name> \
--enforce-eager \ --enforce-eager \
--port `<port>` \ --port <port> \
--load-format netloader --load-format netloader
``` ```
### Client ### Client
```shell ```shell
export NETLOADER_CONFIG='{"SOURCE":[{"device_id":0, "sources": ["`<server_IP>`:`<server_Port>`"]}]}' export NETLOADER_CONFIG='{"SOURCE":[{"device_id":0, "sources": ["<server_IP>:<server_Port>"]}]}'
VLLM_SLEEP_WHEN_IDLE=1 ASCEND_RT_VISIBLE_DEVICES=`<device_id_diff_from_server>` \ VLLM_SLEEP_WHEN_IDLE=1 ASCEND_RT_VISIBLE_DEVICES=<device_id_diff_from_server> \
vllm serve `<model_file>` \ vllm serve <model_file> \
--tensor-parallel-size 1 \ --tensor-parallel-size 1 \
--served-model-name `<model_name>` \ --served-model-name <model_name> \
--enforce-eager \ --enforce-eager \
--port `<client_port>` \ --port <client_port> \
--load-format netloader \ --load-format netloader \
--model-loader-extra-config="${NETLOADER_CONFIG}" --model-loader-extra-config="${NETLOADER_CONFIG}"
``` ```

View File

@@ -80,7 +80,7 @@ A simple planner implementation is provided at [`rfork_planner.py`](../../../../
```shell ```shell
python rfork_planner.py \ python rfork_planner.py \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port `<planner_port>` --port <planner_port>
``` ```
### 3. Start vLLM Instances ### 3. Start vLLM Instances
@@ -93,15 +93,15 @@ For later instances, if the planner can allocate a compatible seed, RFork will t
```shell ```shell
export RFORK_CONFIG='{ export RFORK_CONFIG='{
"model_url": "`<model_url>`", "model_url": "<model_url>",
"model_deploy_strategy_name": "`<deploy_strategy>`", "model_deploy_strategy_name": "<deploy_strategy>",
"rfork_scheduler_url": "http://`<planner_ip>`:`<planner_port>`" "rfork_scheduler_url": "http://<planner_ip>:<planner_port>"
}' }'
vllm serve `<model_path>` \ vllm serve <model_path> \
--tensor-parallel-size 1 \ --tensor-parallel-size 1 \
--served-model-name `<served_model_name>` \ --served-model-name <served_model_name> \
--port `<port>` \ --port <port> \
--load-format rfork \ --load-format rfork \
--model-loader-extra-config "${RFORK_CONFIG}" --model-loader-extra-config "${RFORK_CONFIG}"
``` ```

View File

@@ -381,7 +381,7 @@ This is the first release candidate of v0.14.0 for vLLM Ascend. Please follow th
- Fix multimodal inference OOM issues by setting `expandable_segments:True` by default. [#5855](https://github.com/vllm-project/vllm-ascend/pull/5855) - Fix multimodal inference OOM issues by setting `expandable_segments:True` by default. [#5855](https://github.com/vllm-project/vllm-ascend/pull/5855)
- `VLLM_ASCEND_ENABLE_MLAPO` is set to `True` by default. It's enabled automatically on decode node in PD deployment case. Please note that this feature will cost more memory. If you are memory sensitive, please set it to False. [#5952](https://github.com/vllm-project/vllm-ascend/pull/5952) - `VLLM_ASCEND_ENABLE_MLAPO` is set to `True` by default. It's enabled automatically on decode node in PD deployment case. Please note that this feature will cost more memory. If you are memory sensitive, please set it to False. [#5952](https://github.com/vllm-project/vllm-ascend/pull/5952)
- SSL config can be set to kv_extra_config for PD deployment with mooncake layerwise connector. [#5875](https://github.com/vllm-project/vllm-ascend/pull/5875) - SSL config can be set to kv_extra_config for PD deployment with mooncake layerwise connector. [#5875](https://github.com/vllm-project/vllm-ascend/pull/5875)
- support `--max_model_len=auto`. [#6193](https://github.com/vllm-project/vllm-ascend/pull/6193) - support `--max-model-len=auto`. [#6193](https://github.com/vllm-project/vllm-ascend/pull/6193)
### Dependencies ### Dependencies

View File

@@ -7,7 +7,7 @@ export HCCL_SOCKET_IFNAME="eth0"
export OMP_PROC_BIND=false export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10 export OMP_NUM_THREADS=10
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
export ASCEND_LAUNCH_BLOCKING=0 export ASCEND_LAUNCH_BLOCKING=0

View File

@@ -80,7 +80,7 @@ async def test_qwen3_moe_w8a8_distributed_tp2_ep_dynamic_eplb():
port = get_open_port() port = get_open_port()
compilation_config = json.dumps({"cudagraph_capture_sizes": [8]}) compilation_config = json.dumps({"cudagraph_capture_sizes": [8]})
server_args = [ server_args = [
"--max_model_len", "--max-model-len",
"8192", "8192",
"--tensor_parallel_size", "--tensor_parallel_size",
"2", "2",

View File

@@ -239,7 +239,7 @@ test_cases:
<<: *envs <<: *envs
server_cmd: *server_cmd server_cmd: *server_cmd
benchmarks: benchmarks:
<<: *benchmarks_acc <<: *benchmarks
``` ```
#### EPD / Disaggregated Case #### EPD / Disaggregated Case

View File

@@ -21,7 +21,7 @@ set -eo errexit
. $(dirname "$0")/common.sh . $(dirname "$0")/common.sh
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=True
export MODELSCOPE_HUB_FILE_LOCK=false export MODELSCOPE_HUB_FILE_LOCK=false
export HF_HUB_OFFLINE=1 export HF_HUB_OFFLINE=1