[Doc][v0.18.0] Fix documentation formatting and improve code examples (#8701)
### What this PR does / why we need it? This PR fixes various documentation issues and improves code examples throughout the project. Signed-off-by: MrZ20 <2609716663@qq.com>
This commit is contained in:
@@ -32,7 +32,7 @@ This is the first release candidate of v0.14.0 for vLLM Ascend. Please follow th
|
|||||||
- Fix multi-modal inference OOM issues by setting `expandable_segments:True` by default. [#5855](https://github.com/vllm-project/vllm-ascend/pull/5855)
|
- Fix multi-modal inference OOM issues by setting `expandable_segments:True` by default. [#5855](https://github.com/vllm-project/vllm-ascend/pull/5855)
|
||||||
- `VLLM_ASCEND_ENABLE_MLAPO` is set to `True` by default. It's enabled automatically on decode node in PD deployment case. Please note that this feature will cost more memory. If you are memory sensitive, please set it to False. [#5952](https://github.com/vllm-project/vllm-ascend/pull/5952)
|
- `VLLM_ASCEND_ENABLE_MLAPO` is set to `True` by default. It's enabled automatically on decode node in PD deployment case. Please note that this feature will cost more memory. If you are memory sensitive, please set it to False. [#5952](https://github.com/vllm-project/vllm-ascend/pull/5952)
|
||||||
- SSL config can be set to kv_extra_config for PD deployment with mooncake layerwise connector. [#5875](https://github.com/vllm-project/vllm-ascend/pull/5875)
|
- SSL config can be set to kv_extra_config for PD deployment with mooncake layerwise connector. [#5875](https://github.com/vllm-project/vllm-ascend/pull/5875)
|
||||||
- support `--max_model_len=auto`. [#6193](https://github.com/vllm-project/vllm-ascend/pull/6193)
|
- support `--max-model-len=auto`. [#6193](https://github.com/vllm-project/vllm-ascend/pull/6193)
|
||||||
|
|
||||||
### Dependencies
|
### Dependencies
|
||||||
|
|
||||||
|
|||||||
@@ -50,6 +50,8 @@ All environment variables must be defined in `vllm_ascend/envs.py` using the cen
|
|||||||
**Example:**
|
**Example:**
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
import os
|
||||||
|
|
||||||
env_variables = {
|
env_variables = {
|
||||||
"VLLM_ASCEND_ENABLE_NZ": lambda: int(os.getenv("VLLM_ASCEND_ENABLE_NZ", 1)),
|
"VLLM_ASCEND_ENABLE_NZ": lambda: int(os.getenv("VLLM_ASCEND_ENABLE_NZ", 1)),
|
||||||
# ...
|
# ...
|
||||||
|
|||||||
@@ -18,7 +18,7 @@ enable_custom_op()
|
|||||||
|
|
||||||
- Create a new operation folder under `csrc` directory.
|
- Create a new operation folder under `csrc` directory.
|
||||||
- Create `op_host` and `op_kernel` directories for host and kernel source code.
|
- Create `op_host` and `op_kernel` directories for host and kernel source code.
|
||||||
- Add build options in `csrc/build_aclnn.sh` for supported SOC. Note that multiple ops should be separated with `;`, i.e. `CUSTOM_OPS=op1;op2;op3`.
|
- Add build options in `csrc/build_aclnn.sh` for supported SOC. Note that multiple ops should be separated with `;`, i.e. `CUSTOM_OPS="op1;op2;op3"`.
|
||||||
- Bind aclnn operators to torch.ops._C_ascend module in `csrc/torch_binding.cpp`.
|
- Bind aclnn operators to torch.ops._C_ascend module in `csrc/torch_binding.cpp`.
|
||||||
- Write a meta implementation in `csrc/torch_binding_meta.cpp` for the op to be captured into the aclgraph.
|
- Write a meta implementation in `csrc/torch_binding_meta.cpp` for the op to be captured into the aclgraph.
|
||||||
|
|
||||||
|
|||||||
@@ -35,38 +35,36 @@ From the workflow perspective, we can see how the final test script is executed,
|
|||||||
npu_per_node: 16
|
npu_per_node: 16
|
||||||
# All env vars you need should add it here
|
# All env vars you need should add it here
|
||||||
env_common:
|
env_common:
|
||||||
VLLM_USE_MODELSCOPE: true
|
VLLM_USE_MODELSCOPE: true
|
||||||
OMP_PROC_BIND: false
|
OMP_PROC_BIND: false
|
||||||
OMP_NUM_THREADS: 100
|
OMP_NUM_THREADS: 100
|
||||||
HCCL_BUFFSIZE: 1024
|
HCCL_BUFFSIZE: 1024
|
||||||
SERVER_PORT: 8080
|
SERVER_PORT: 8080
|
||||||
disaggregated_prefill:
|
disaggregated_prefill:
|
||||||
enabled: true
|
enabled: true
|
||||||
# node index(a list) which meet all the conditions:
|
# node index(a list) which meet all the conditions:
|
||||||
# - prefiller
|
# - prefiller
|
||||||
# - no headless(have api server)
|
# - no headless(have api server)
|
||||||
prefiller_host_index: [0]
|
prefiller_host_index: [0]
|
||||||
# node index(a list) which meet all the conditions:
|
# node index(a list) which meet all the conditions:
|
||||||
# - decoder
|
# - decoder
|
||||||
decoder_host_index: [1]
|
decoder_host_index: [1]
|
||||||
|
|
||||||
# Add each node's vllm serve cli command just like you run locally
|
# Add each node's vllm serve cli command just like you run locally
|
||||||
# Add each node's individual envs like follow
|
# Add each node's individual envs like follow
|
||||||
deployment:
|
deployment:
|
||||||
-
|
- envs:
|
||||||
envs:
|
# fill with envs like: <key>:<value>
|
||||||
# fill with envs like: <key>:<value>
|
|
||||||
server_cmd: >
|
server_cmd: >
|
||||||
vllm serve ...
|
vllm serve ...
|
||||||
-
|
- envs:
|
||||||
envs:
|
# fill with envs like: <key>:<value>
|
||||||
# fill with envs like: <key>:<value>
|
|
||||||
server_cmd: >
|
server_cmd: >
|
||||||
vllm serve ...
|
vllm serve ...
|
||||||
benchmarks:
|
benchmarks:
|
||||||
perf:
|
perf:
|
||||||
# fill with performance test kwargs
|
# fill with performance test kwargs
|
||||||
acc:
|
acc:
|
||||||
# fill with accuracy test kwargs
|
# fill with accuracy test kwargs
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -74,38 +72,38 @@ From the workflow perspective, we can see how the final test script is executed,
|
|||||||
|
|
||||||
Currently, the multi-node test workflow is defined in the [nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml)
|
Currently, the multi-node test workflow is defined in the [nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml)
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
multi-node-tests:
|
multi-node-tests:
|
||||||
name: multi-node
|
name: multi-node
|
||||||
if: always() && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
|
if: always() && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
|
||||||
strategy:
|
strategy:
|
||||||
fail-fast: false
|
fail-fast: false
|
||||||
max-parallel: 1
|
max-parallel: 1
|
||||||
matrix:
|
matrix:
|
||||||
test_config:
|
test_config:
|
||||||
- name: multi-node-deepseek-pd
|
- name: multi-node-deepseek-pd
|
||||||
config_file_path: DeepSeek-V3.yaml
|
config_file_path: DeepSeek-V3.yaml
|
||||||
size: 2
|
size: 2
|
||||||
- name: multi-node-qwen3-dp
|
- name: multi-node-qwen3-dp
|
||||||
config_file_path: Qwen3-235B-A22B.yaml
|
config_file_path: Qwen3-235B-A22B.yaml
|
||||||
size: 2
|
size: 2
|
||||||
- name: multi-node-qwenw8a8-2node
|
- name: multi-node-qwenw8a8-2node
|
||||||
config_file_path: Qwen3-235B-W8A8.yaml
|
config_file_path: Qwen3-235B-W8A8.yaml
|
||||||
size: 2
|
size: 2
|
||||||
- name: multi-node-qwenw8a8-2node-eplb
|
- name: multi-node-qwenw8a8-2node-eplb
|
||||||
config_file_path: Qwen3-235B-W8A8-EPLB.yaml
|
config_file_path: Qwen3-235B-W8A8-EPLB.yaml
|
||||||
size: 2
|
size: 2
|
||||||
uses: ./.github/workflows/_e2e_nightly_multi_node.yaml
|
uses: ./.github/workflows/_e2e_nightly_multi_node.yaml
|
||||||
with:
|
with:
|
||||||
soc_version: a3
|
soc_version: a3
|
||||||
runner: linux-aarch64-a3-0
|
runner: linux-aarch64-a3-0
|
||||||
image: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3'
|
image: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3'
|
||||||
replicas: 1
|
replicas: 1
|
||||||
size: ${{ matrix.test_config.size }}
|
size: ${{ matrix.test_config.size }}
|
||||||
config_file_path: ${{ matrix.test_config.config_file_path }}
|
config_file_path: ${{ matrix.test_config.config_file_path }}
|
||||||
secrets:
|
secrets:
|
||||||
KUBECONFIG_B64: ${{ secrets.KUBECONFIG_B64 }}
|
KUBECONFIG_B64: ${{ secrets.KUBECONFIG_B64 }}
|
||||||
```
|
```
|
||||||
|
|
||||||
The matrix above defines all the parameters required to add a multi-machine use case. The parameters worth noting (if you are adding a new use case) are `size` and the path to the yaml configuration file. The former defines the number of nodes required for your use case, and the latter defines the path to the configuration file you have completed in step 2.
|
The matrix above defines all the parameters required to add a multi-machine use case. The parameters worth noting (if you are adding a new use case) are `size` and the path to the yaml configuration file. The former defines the number of nodes required for your use case, and the latter defines the path to the configuration file you have completed in step 2.
|
||||||
|
|
||||||
@@ -125,130 +123,130 @@ This section assumes that you already have a [Kubernetes](https://kubernetes.io/
|
|||||||
apiVersion: leaderworkerset.x-k8s.io/v1
|
apiVersion: leaderworkerset.x-k8s.io/v1
|
||||||
kind: LeaderWorkerSet
|
kind: LeaderWorkerSet
|
||||||
metadata:
|
metadata:
|
||||||
name: test-server
|
name: test-server
|
||||||
namespace: vllm-project
|
namespace: vllm-project
|
||||||
spec:
|
spec:
|
||||||
replicas: 1
|
replicas: 1
|
||||||
leaderWorkerTemplate:
|
leaderWorkerTemplate:
|
||||||
size: 2
|
size: 2
|
||||||
restartPolicy: None
|
restartPolicy: None
|
||||||
leaderTemplate:
|
leaderTemplate:
|
||||||
metadata:
|
metadata:
|
||||||
labels:
|
labels:
|
||||||
role: leader
|
role: leader
|
||||||
spec:
|
spec:
|
||||||
containers:
|
containers:
|
||||||
- name: vllm-leader
|
- name: vllm-leader
|
||||||
imagePullPolicy: Always
|
imagePullPolicy: Always
|
||||||
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3
|
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3
|
||||||
env:
|
env:
|
||||||
- name: CONFIG_YAML_PATH
|
- name: CONFIG_YAML_PATH
|
||||||
value: DeepSeek-V3.yaml
|
value: DeepSeek-V3.yaml
|
||||||
- name: WORKSPACE
|
- name: WORKSPACE
|
||||||
value: "/vllm-workspace"
|
value: "/vllm-workspace"
|
||||||
- name: FAIL_TAG
|
- name: FAIL_TAG
|
||||||
value: FAIL_TAG
|
value: FAIL_TAG
|
||||||
command:
|
command:
|
||||||
- sh
|
- sh
|
||||||
- -c
|
- -c
|
||||||
- |
|
- |
|
||||||
bash /vllm-workspace/vllm-ascend/tests/e2e/nightly/multi_node/scripts/run.sh
|
bash /vllm-workspace/vllm-ascend/tests/e2e/nightly/multi_node/scripts/run.sh
|
||||||
resources:
|
resources:
|
||||||
limits:
|
limits:
|
||||||
huawei.com/ascend-1980: 16
|
huawei.com/ascend-1980: 16
|
||||||
memory: 512Gi
|
memory: 512Gi
|
||||||
ephemeral-storage: 100Gi
|
ephemeral-storage: 100Gi
|
||||||
requests:
|
requests:
|
||||||
huawei.com/ascend-1980: 16
|
huawei.com/ascend-1980: 16
|
||||||
memory: 512Gi
|
memory: 512Gi
|
||||||
ephemeral-storage: 100Gi
|
ephemeral-storage: 100Gi
|
||||||
cpu: 125
|
cpu: 125
|
||||||
ports:
|
ports:
|
||||||
- containerPort: 8080
|
- containerPort: 8080
|
||||||
# readinessProbe:
|
# readinessProbe:
|
||||||
# tcpSocket:
|
# tcpSocket:
|
||||||
# port: 8080
|
# port: 8080
|
||||||
# initialDelaySeconds: 15
|
# initialDelaySeconds: 15
|
||||||
# periodSeconds: 10
|
# periodSeconds: 10
|
||||||
volumeMounts:
|
volumeMounts:
|
||||||
- mountPath: /root/.cache
|
- mountPath: /root/.cache
|
||||||
name: shared-volume
|
name: shared-volume
|
||||||
- mountPath: /usr/local/Ascend/driver/tools
|
- mountPath: /usr/local/Ascend/driver/tools
|
||||||
name: driver-tools
|
name: driver-tools
|
||||||
- mountPath: /dev/shm
|
- mountPath: /dev/shm
|
||||||
name: dshm
|
name: dshm
|
||||||
volumes:
|
volumes:
|
||||||
- name: dshm
|
- name: dshm
|
||||||
emptyDir:
|
emptyDir:
|
||||||
medium: Memory
|
medium: Memory
|
||||||
sizeLimit: 15Gi
|
sizeLimit: 15Gi
|
||||||
- name: shared-volume
|
- name: shared-volume
|
||||||
persistentVolumeClaim:
|
persistentVolumeClaim:
|
||||||
claimName: nv-action-vllm-benchmarks-v2
|
claimName: nv-action-vllm-benchmarks-v2
|
||||||
- name: driver-tools
|
- name: driver-tools
|
||||||
hostPath:
|
hostPath:
|
||||||
path: /usr/local/Ascend/driver/tools
|
path: /usr/local/Ascend/driver/tools
|
||||||
workerTemplate:
|
workerTemplate:
|
||||||
spec:
|
spec:
|
||||||
containers:
|
containers:
|
||||||
- name: vllm-worker
|
- name: vllm-worker
|
||||||
imagePullPolicy: Always
|
imagePullPolicy: Always
|
||||||
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3
|
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3
|
||||||
env:
|
env:
|
||||||
- name: CONFIG_YAML_PATH
|
- name: CONFIG_YAML_PATH
|
||||||
value: DeepSeek-V3.yaml
|
value: DeepSeek-V3.yaml
|
||||||
- name: WORKSPACE
|
- name: WORKSPACE
|
||||||
value: "/vllm-workspace"
|
value: "/vllm-workspace"
|
||||||
- name: FAIL_TAG
|
- name: FAIL_TAG
|
||||||
value: FAIL_TAG
|
value: FAIL_TAG
|
||||||
command:
|
command:
|
||||||
- sh
|
- sh
|
||||||
- -c
|
- -c
|
||||||
- |
|
- |
|
||||||
bash /vllm-workspace/vllm-ascend/tests/e2e/nightly/multi_node/scripts/run.sh
|
bash /vllm-workspace/vllm-ascend/tests/e2e/nightly/multi_node/scripts/run.sh
|
||||||
resources:
|
resources:
|
||||||
limits:
|
limits:
|
||||||
huawei.com/ascend-1980: 16
|
huawei.com/ascend-1980: 16
|
||||||
memory: 512Gi
|
memory: 512Gi
|
||||||
ephemeral-storage: 100Gi
|
ephemeral-storage: 100Gi
|
||||||
requests:
|
requests:
|
||||||
huawei.com/ascend-1980: 16
|
huawei.com/ascend-1980: 16
|
||||||
ephemeral-storage: 100Gi
|
ephemeral-storage: 100Gi
|
||||||
cpu: 125
|
cpu: 125
|
||||||
volumeMounts:
|
volumeMounts:
|
||||||
- mountPath: /root/.cache
|
- mountPath: /root/.cache
|
||||||
name: shared-volume
|
name: shared-volume
|
||||||
- mountPath: /usr/local/Ascend/driver/tools
|
- mountPath: /usr/local/Ascend/driver/tools
|
||||||
name: driver-tools
|
name: driver-tools
|
||||||
- mountPath: /dev/shm
|
- mountPath: /dev/shm
|
||||||
name: dshm
|
name: dshm
|
||||||
volumes:
|
volumes:
|
||||||
- name: dshm
|
- name: dshm
|
||||||
emptyDir:
|
emptyDir:
|
||||||
medium: Memory
|
medium: Memory
|
||||||
sizeLimit: 15Gi
|
sizeLimit: 15Gi
|
||||||
- name: shared-volume
|
- name: shared-volume
|
||||||
persistentVolumeClaim:
|
persistentVolumeClaim:
|
||||||
claimName: nv-action-vllm-benchmarks-v2
|
claimName: nv-action-vllm-benchmarks-v2
|
||||||
- name: driver-tools
|
- name: driver-tools
|
||||||
hostPath:
|
hostPath:
|
||||||
path: /usr/local/Ascend/driver/tools
|
path: /usr/local/Ascend/driver/tools
|
||||||
---
|
---
|
||||||
apiVersion: v1
|
apiVersion: v1
|
||||||
kind: Service
|
kind: Service
|
||||||
metadata:
|
metadata:
|
||||||
name: vllm-leader
|
name: vllm-leader
|
||||||
namespace: vllm-project
|
namespace: vllm-project
|
||||||
spec:
|
spec:
|
||||||
ports:
|
ports:
|
||||||
- name: http
|
- name: http
|
||||||
port: 8080
|
port: 8080
|
||||||
protocol: TCP
|
protocol: TCP
|
||||||
targetPort: 8080
|
targetPort: 8080
|
||||||
selector:
|
selector:
|
||||||
leaderworkerset.sigs.k8s.io/name: vllm
|
leaderworkerset.sigs.k8s.io/name: vllm
|
||||||
role: leader
|
role: leader
|
||||||
type: ClusterIP
|
type: ClusterIP
|
||||||
```
|
```
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|||||||
@@ -40,6 +40,7 @@ export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu/ https://mirror
|
|||||||
# src path
|
# src path
|
||||||
export SRC_WORKSPACE=/vllm-workspace
|
export SRC_WORKSPACE=/vllm-workspace
|
||||||
mkdir -p $SRC_WORKSPACE
|
mkdir -p $SRC_WORKSPACE
|
||||||
|
cd $SRC_WORKSPACE
|
||||||
|
|
||||||
apt-get update -y
|
apt-get update -y
|
||||||
apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev curl gnupg2
|
apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev curl gnupg2
|
||||||
|
|||||||
@@ -38,11 +38,11 @@ Run the vLLM server in the docker.
|
|||||||
|
|
||||||
```{code-block} bash
|
```{code-block} bash
|
||||||
:substitutions:
|
:substitutions:
|
||||||
vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 35000 &
|
vllm serve Qwen/Qwen2.5-0.5B-Instruct --max-model-len 35000 &
|
||||||
```
|
```
|
||||||
|
|
||||||
:::{note}
|
:::{note}
|
||||||
`--max_model_len` should be greater than `35000`, this will be suitable for most datasets. Otherwise the accuracy evaluation may be affected.
|
`--max-model-len` should be greater than `35000`, this will be suitable for most datasets. Otherwise the accuracy evaluation may be affected.
|
||||||
:::
|
:::
|
||||||
|
|
||||||
The vLLM server is started successfully, if you see logs as below:
|
The vLLM server is started successfully, if you see logs as below:
|
||||||
|
|||||||
@@ -29,7 +29,7 @@ docker run --rm \
|
|||||||
-e VLLM_USE_MODELSCOPE=True \
|
-e VLLM_USE_MODELSCOPE=True \
|
||||||
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
|
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
|
||||||
-it $IMAGE \
|
-it $IMAGE \
|
||||||
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
|
vllm serve Qwen/Qwen2.5-7B-Instruct --max-model-len 26240
|
||||||
```
|
```
|
||||||
|
|
||||||
If the vLLM server is started successfully, you can see information shown below:
|
If the vLLM server is started successfully, you can see information shown below:
|
||||||
|
|||||||
@@ -32,7 +32,7 @@ docker run --rm \
|
|||||||
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
|
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
|
||||||
-it $IMAGE \
|
-it $IMAGE \
|
||||||
/bin/bash
|
/bin/bash
|
||||||
vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 4096 &
|
vllm serve Qwen/Qwen2.5-0.5B-Instruct --max-model-len 4096 &
|
||||||
```
|
```
|
||||||
|
|
||||||
The vLLM server is started successfully, if you see logs as below:
|
The vLLM server is started successfully, if you see logs as below:
|
||||||
@@ -48,28 +48,36 @@ INFO: Application startup complete.
|
|||||||
You can query the result with input prompts:
|
You can query the result with input prompts:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
|
PROMPT='<|im_start|>system
|
||||||
|
You are a professional accountant. Answer questions using accounting knowledge, output only the option letter (A/B/C/D).<|im_end|>
|
||||||
|
<|im_start|>user
|
||||||
|
Question: A company'"'"'s balance sheet as of December 31, 2023 shows:
|
||||||
|
Current assets: Cash and equivalents 5 million yuan, Accounts receivable 8 million yuan, Inventory 6 million yuan
|
||||||
|
Non-current assets: Net fixed assets 12 million yuan
|
||||||
|
Current liabilities: Short-term loans 4 million yuan, Accounts payable 3 million yuan
|
||||||
|
Non-current liabilities: Long-term loans 9 million yuan
|
||||||
|
Owner'"'"'s equity: Paid-in capital 10 million yuan, Retained earnings ?
|
||||||
|
Requirement: Calculate the company'"'"'s Asset-Liability Ratio and Current Ratio (round to two decimal places).
|
||||||
|
Options:
|
||||||
|
A. Asset-Liability Ratio=58.33%, Current Ratio=1.90
|
||||||
|
B. Asset-Liability Ratio=62.50%, Current Ratio=2.17
|
||||||
|
C. Asset-Liability Ratio=65.22%, Current Ratio=1.75
|
||||||
|
D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>
|
||||||
|
<|im_start|>assistant
|
||||||
|
'
|
||||||
|
|
||||||
curl http://localhost:8000/v1/completions \
|
curl http://localhost:8000/v1/completions \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d "$(jq -n \
|
||||||
"model": "Qwen/Qwen2.5-0.5B-Instruct",
|
--arg model "Qwen/Qwen2.5-0.5B-Instruct" \
|
||||||
"prompt": "'"<|im_start|>system\nYou are a professional accountant. Answer questions using accounting knowledge, output only the option letter (A/B/C/D).<|im_end|>\n"\
|
--arg prompt "$PROMPT" \
|
||||||
"<|im_start|>user\nQuestion: A company's balance sheet as of December 31, 2023 shows:\n"\
|
'{
|
||||||
" Current assets: Cash and equivalents 5 million yuan, Accounts receivable 8 million yuan, Inventory 6 million yuan\n"\
|
model: $model,
|
||||||
" Non-current assets: Net fixed assets 12 million yuan\n"\
|
prompt: $prompt,
|
||||||
" Current liabilities: Short-term loans 4 million yuan, Accounts payable 3 million yuan\n"\
|
max_completion_tokens: 1,
|
||||||
" Non-current liabilities: Long-term loans 9 million yuan\n"\
|
temperature: 0,
|
||||||
" Owner's equity: Paid-in capital 10 million yuan, Retained earnings ?\n"\
|
stop: ["<|im_end|>"]
|
||||||
"Requirement: Calculate the company's Asset-Liability Ratio and Current Ratio (round to two decimal places).\n"\
|
}')" | python3 -m json.tool
|
||||||
"Options:\n"\
|
|
||||||
"A. Asset-Liability Ratio=58.33%, Current Ratio=1.90\n"\
|
|
||||||
"B. Asset-Liability Ratio=62.50%, Current Ratio=2.17\n"\
|
|
||||||
"C. Asset-Liability Ratio=65.22%, Current Ratio=1.75\n"\
|
|
||||||
"D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>\n"\
|
|
||||||
"<|im_start|>assistant\n"'",
|
|
||||||
"max_completion_tokens": 1,
|
|
||||||
"temperature": 0,
|
|
||||||
"stop": ["<|im_end|>"]
|
|
||||||
}' | python3 -m json.tool
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The output format matches the following:
|
The output format matches the following:
|
||||||
|
|||||||
@@ -29,7 +29,7 @@ docker run --rm \
|
|||||||
-e VLLM_USE_MODELSCOPE=True \
|
-e VLLM_USE_MODELSCOPE=True \
|
||||||
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
|
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
|
||||||
-it $IMAGE \
|
-it $IMAGE \
|
||||||
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
|
vllm serve Qwen/Qwen2.5-7B-Instruct --max-model-len 26240
|
||||||
```
|
```
|
||||||
|
|
||||||
The vLLM server is started successfully, if you see information as below:
|
The vLLM server is started successfully, if you see information as below:
|
||||||
|
|||||||
@@ -158,6 +158,12 @@ Scheduling optimization:
|
|||||||
:substitutions:
|
:substitutions:
|
||||||
# Optimize operator delivery queue. This will affect the memory peak value, and may degrade if the memory is tight.
|
# Optimize operator delivery queue. This will affect the memory peak value, and may degrade if the memory is tight.
|
||||||
export TASK_QUEUE_ENABLE=2
|
export TASK_QUEUE_ENABLE=2
|
||||||
|
```
|
||||||
|
|
||||||
|
or
|
||||||
|
|
||||||
|
```{code-block} bash
|
||||||
|
:substitutions:
|
||||||
|
|
||||||
# This will greatly improve the CPU bottleneck model and ensure the same performance for the NPU bottleneck model.
|
# This will greatly improve the CPU bottleneck model and ensure the same performance for the NPU bottleneck model.
|
||||||
export CPU_AFFINITY_CONF=1
|
export CPU_AFFINITY_CONF=1
|
||||||
|
|||||||
@@ -223,7 +223,7 @@ vllm serve Qwen/Qwen3-Embedding-8B --trust-remote-code
|
|||||||
```shell
|
```shell
|
||||||
# download dataset
|
# download dataset
|
||||||
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
vllm bench serve \
|
vllm bench serve \
|
||||||
--model Qwen/Qwen3-Embedding-8B \
|
--model Qwen/Qwen3-Embedding-8B \
|
||||||
--backend openai-embeddings \
|
--backend openai-embeddings \
|
||||||
|
|||||||
@@ -284,7 +284,7 @@ python example.py
|
|||||||
If you encounter a connection error with Hugging Face (e.g., `We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.`), run the following commands to use ModelScope as an alternative:
|
If you encounter a connection error with Hugging Face (e.g., `We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.`), run the following commands to use ModelScope as an alternative:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
pip install modelscope
|
pip install modelscope
|
||||||
python example.py
|
python example.py
|
||||||
```
|
```
|
||||||
|
|||||||
@@ -12,6 +12,8 @@
|
|||||||
|
|
||||||
## Setup environment using container
|
## Setup environment using container
|
||||||
|
|
||||||
|
Before using containers, make sure Docker is installed on your system. If Docker is not installed, please refer to the [Docker installation guide](https://docs.docker.com/get-docker/) for installation instructions.
|
||||||
|
|
||||||
:::::{tab-set}
|
:::::{tab-set}
|
||||||
::::{tab-item} Ubuntu
|
::::{tab-item} Ubuntu
|
||||||
|
|
||||||
@@ -91,7 +93,7 @@ You can use ModelScope mirror to speed up download:
|
|||||||
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
|
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
```
|
```
|
||||||
|
|
||||||
There are two ways to start vLLM on Ascend NPU:
|
There are two ways to start vLLM on Ascend NPU:
|
||||||
|
|||||||
@@ -559,7 +559,7 @@ There are three `vllm bench` subcommands:
|
|||||||
Take the `serve` as an example. Run the code as follows.
|
Take the `serve` as an example. Run the code as follows.
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
vllm bench serve --model /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp --dataset-name random --random-input 131072 --num-prompts 20 --request-rate 0 --save-result --result-dir ./
|
vllm bench serve --model /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp --dataset-name random --random-input 131072 --num-prompts 20 --request-rate 0 --save-result --result-dir ./
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
@@ -72,7 +72,7 @@ Run the following script to execute online 128k inference.
|
|||||||
```shell
|
```shell
|
||||||
#!/bin/sh
|
#!/bin/sh
|
||||||
# Load model from ModelScope to speed up download
|
# Load model from ModelScope to speed up download
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
# To reduce memory fragmentation and avoid out of memory
|
# To reduce memory fragmentation and avoid out of memory
|
||||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||||
export HCCL_BUFFSIZE=512
|
export HCCL_BUFFSIZE=512
|
||||||
@@ -166,7 +166,7 @@ There are three `vllm bench` subcommands:
|
|||||||
Take the `serve` as an example. Run the code as follows.
|
Take the `serve` as an example. Run the code as follows.
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 131072 --num-prompts 1 --request-rate 1 --save-result --result-dir ./
|
vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 131072 --num-prompts 1 --request-rate 1 --save-result --result-dir ./
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
@@ -96,7 +96,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser
|
|||||||
--served_model_name qwen --dtype float16 \
|
--served_model_name qwen --dtype float16 \
|
||||||
--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
|
--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
|
||||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \
|
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \
|
||||||
--quantization ascend --max_model_len 16384
|
--quantization ascend --max-model-len 16384
|
||||||
# `--load_format` is required only for the W8A8SC quantized weight format.
|
# `--load_format` is required only for the W8A8SC quantized weight format.
|
||||||
#
|
#
|
||||||
```
|
```
|
||||||
@@ -134,7 +134,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser
|
|||||||
--enforce-eager \
|
--enforce-eager \
|
||||||
--dtype float16 \
|
--dtype float16 \
|
||||||
--quantization ascend \
|
--quantization ascend \
|
||||||
--max_model_len 10240
|
--max-model-len 10240
|
||||||
```
|
```
|
||||||
|
|
||||||
Argument notes: `--tensor-parallel-size`: `W8A8SC` quantized weights are tightly coupled to the TP size, so you must specify the TP size you plan to use at serving time when running compression. `--model` is the path to the input `w8a8s` weights, and `--output` is the output path for the compressed `w8a8sc` weights.
|
Argument notes: `--tensor-parallel-size`: `W8A8SC` quantized weights are tightly coupled to the TP size, so you must specify the TP size you plan to use at serving time when running compression. `--model` is the path to the input `w8a8s` weights, and `--output` is the output path for the compressed `w8a8sc` weights.
|
||||||
@@ -159,7 +159,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser
|
|||||||
--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
|
--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
|
||||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \
|
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \
|
||||||
--quantization ascend \
|
--quantization ascend \
|
||||||
--max_model_len 16384 \
|
--max-model-len 16384 \
|
||||||
--no-enable-prefix-caching \
|
--no-enable-prefix-caching \
|
||||||
--load_format="sharded_state"
|
--load_format="sharded_state"
|
||||||
```
|
```
|
||||||
@@ -178,7 +178,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser
|
|||||||
--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
|
--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
|
||||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16]}' \
|
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16]}' \
|
||||||
--quantization ascend \
|
--quantization ascend \
|
||||||
--max_model_len 16384 \
|
--max-model-len 16384 \
|
||||||
--no-enable-prefix-caching \
|
--no-enable-prefix-caching \
|
||||||
--load_format="sharded_state"
|
--load_format="sharded_state"
|
||||||
```
|
```
|
||||||
@@ -199,7 +199,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser
|
|||||||
--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
|
--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
|
||||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [16,32]}' \
|
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [16,32]}' \
|
||||||
--quantization ascend \
|
--quantization ascend \
|
||||||
--max_model_len 20480 \
|
--max-model-len 20480 \
|
||||||
--no-enable-prefix-caching \
|
--no-enable-prefix-caching \
|
||||||
--load_format="sharded_state"
|
--load_format="sharded_state"
|
||||||
```
|
```
|
||||||
|
|||||||
@@ -302,7 +302,7 @@ There are three `vllm bench` subcommands:
|
|||||||
Take the `serve` as an example. Run the code as follows.
|
Take the `serve` as an example. Run the code as follows.
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
vllm bench serve --model path/DeepSeek-R1-W8A8 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
vllm bench serve --model path/DeepSeek-R1-W8A8 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
@@ -943,7 +943,7 @@ There are three `vllm bench` subcommands:
|
|||||||
Take the `serve` as an example. Run the code as follows.
|
Take the `serve` as an example. Run the code as follows.
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
vllm bench serve --model /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
vllm bench serve --model /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
@@ -93,7 +93,7 @@ vllm serve /root/.cache/DeepSeek-OCR-2 \
|
|||||||
--trust-remote-code \
|
--trust-remote-code \
|
||||||
--tensor-parallel-size 1 \
|
--tensor-parallel-size 1 \
|
||||||
--port 1055 \
|
--port 1055 \
|
||||||
--max_model_len 8192 \
|
--max-model-len 8192 \
|
||||||
--no-enable-prefix-caching \
|
--no-enable-prefix-caching \
|
||||||
--gpu-memory-utilization 0.8 \
|
--gpu-memory-utilization 0.8 \
|
||||||
--allowed-local-media-path / \
|
--allowed-local-media-path / \
|
||||||
|
|||||||
@@ -784,7 +784,7 @@ There are three `vllm bench` subcommands:
|
|||||||
Take the `serve` as an example. Run the code as follows.
|
Take the `serve` as an example. Run the code as follows.
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
vllm bench serve --model Eco-Tech/Kimi-K2.5-w4a8 --dataset-name random --random-input 1024 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
vllm bench serve --model Eco-Tech/Kimi-K2.5-w4a8 --dataset-name random --random-input 1024 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
@@ -72,7 +72,7 @@ Run the following script to start the vLLM server on single 910B4:
|
|||||||
|
|
||||||
```shell
|
```shell
|
||||||
#!/bin/sh
|
#!/bin/sh
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
export MODEL_PATH="PaddlePaddle/PaddleOCR-VL"
|
export MODEL_PATH="PaddlePaddle/PaddleOCR-VL"
|
||||||
export TASK_QUEUE_ENABLE=1
|
export TASK_QUEUE_ENABLE=1
|
||||||
export CPU_AFFINITY_CONF=1
|
export CPU_AFFINITY_CONF=1
|
||||||
@@ -97,11 +97,11 @@ Run the following script to start the vLLM server on single Atlas 300 inference
|
|||||||
|
|
||||||
```shell
|
```shell
|
||||||
#!/bin/sh
|
#!/bin/sh
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
export MODEL_PATH="PaddlePaddle/PaddleOCR-VL"
|
export MODEL_PATH="PaddlePaddle/PaddleOCR-VL"
|
||||||
|
|
||||||
vllm serve ${MODEL_PATH} \
|
vllm serve ${MODEL_PATH} \
|
||||||
--max_model_len 16384 \
|
--max-model-len 16384 \
|
||||||
--served-model-name PaddleOCR-VL-0.9B \
|
--served-model-name PaddleOCR-VL-0.9B \
|
||||||
--trust-remote-code \
|
--trust-remote-code \
|
||||||
--no-enable-prefix-caching \
|
--no-enable-prefix-caching \
|
||||||
@@ -112,7 +112,7 @@ vllm serve ${MODEL_PATH} \
|
|||||||
```
|
```
|
||||||
|
|
||||||
:::{note}
|
:::{note}
|
||||||
The `--max_model_len` option is added to prevent errors when generating the attention operator mask on the Atlas 300 inference products.
|
The `--max-model-len` option is added to prevent errors when generating the attention operator mask on the Atlas 300 inference products.
|
||||||
:::
|
:::
|
||||||
|
|
||||||
::::
|
::::
|
||||||
|
|||||||
@@ -323,12 +323,12 @@ Run docker container to start the vLLM server on single-NPU:
|
|||||||
:substitutions:
|
:substitutions:
|
||||||
vllm serve Qwen/Qwen3-VL-8B-Instruct \
|
vllm serve Qwen/Qwen3-VL-8B-Instruct \
|
||||||
--dtype bfloat16 \
|
--dtype bfloat16 \
|
||||||
--max_model_len 16384 \
|
--max-model-len 16384 \
|
||||||
--max-num-batched-tokens 16384
|
--max-num-batched-tokens 16384
|
||||||
```
|
```
|
||||||
|
|
||||||
:::{note}
|
:::{note}
|
||||||
Add `--max_model_len` option to avoid ValueError that the Qwen3-VL-8B-Instruct model's max seq len (256000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series based on the on-chip memory size. Please modify the value according to a suitable value for your NPU series.
|
Add `--max-model-len` option to avoid ValueError that the Qwen3-VL-8B-Instruct model's max seq len (256000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series based on the on-chip memory size. Please modify the value according to a suitable value for your NPU series.
|
||||||
:::
|
:::
|
||||||
|
|
||||||
If your service start successfully, you can see the info shown below:
|
If your service start successfully, you can see the info shown below:
|
||||||
@@ -415,7 +415,7 @@ vllm serve Qwen/Qwen2.5-VL-32B-Instruct \
|
|||||||
```
|
```
|
||||||
|
|
||||||
:::{note}
|
:::{note}
|
||||||
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-32B-Instruct model's max_model_len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the on-chip memory size. Please modify the value according to a suitable value for your NPU series.
|
Add `--max-model-len` option to avoid ValueError that the Qwen2.5-VL-32B-Instruct model's max_model_len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the on-chip memory size. Please modify the value according to a suitable value for your NPU series.
|
||||||
:::
|
:::
|
||||||
|
|
||||||
If your service start successfully, you can see the info shown below:
|
If your service start successfully, you can see the info shown below:
|
||||||
|
|||||||
@@ -74,7 +74,7 @@ The environment variable `LOCAL_MEDIA_PATH` which allows API requests to read lo
|
|||||||
:::
|
:::
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
export MODEL_PATH="Qwen/Qwen2.5-Omni-7B"
|
export MODEL_PATH="Qwen/Qwen2.5-Omni-7B"
|
||||||
export LOCAL_MEDIA_PATH=$HOME/.cache/vllm/assets/vllm_public_assets/
|
export LOCAL_MEDIA_PATH=$HOME/.cache/vllm/assets/vllm_public_assets/
|
||||||
|
|
||||||
@@ -104,7 +104,7 @@ VLLM_TARGET_DEVICE=empty pip install -v ".[audio]"
|
|||||||
#### Multiple NPU (Qwen2.5-Omni-7B)
|
#### Multiple NPU (Qwen2.5-Omni-7B)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
export MODEL_PATH=Qwen/Qwen2.5-Omni-7B
|
export MODEL_PATH=Qwen/Qwen2.5-Omni-7B
|
||||||
export LOCAL_MEDIA_PATH=$HOME/.cache/vllm/assets/vllm_public_assets/
|
export LOCAL_MEDIA_PATH=$HOME/.cache/vllm/assets/vllm_public_assets/
|
||||||
export DP_SIZE=8
|
export DP_SIZE=8
|
||||||
|
|||||||
@@ -95,7 +95,7 @@ Run the following script to execute online 128k inference.
|
|||||||
```shell
|
```shell
|
||||||
#!/bin/sh
|
#!/bin/sh
|
||||||
# Load model from ModelScope to speed up download
|
# Load model from ModelScope to speed up download
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
# To reduce memory fragmentation and avoid out of memory
|
# To reduce memory fragmentation and avoid out of memory
|
||||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||||
export HCCL_BUFFSIZE=512
|
export HCCL_BUFFSIZE=512
|
||||||
@@ -157,7 +157,7 @@ Node 0
|
|||||||
```shell
|
```shell
|
||||||
#!/bin/sh
|
#!/bin/sh
|
||||||
# Load model from ModelScope to speed up download
|
# Load model from ModelScope to speed up download
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
# To reduce memory fragmentation and avoid out of memory
|
# To reduce memory fragmentation and avoid out of memory
|
||||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||||
# this obtained through ifconfig
|
# this obtained through ifconfig
|
||||||
@@ -199,7 +199,7 @@ Node1
|
|||||||
```shell
|
```shell
|
||||||
#!/bin/sh
|
#!/bin/sh
|
||||||
# Load model from ModelScope to speed up download
|
# Load model from ModelScope to speed up download
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
# To reduce memory fragmentation and avoid out of memory
|
# To reduce memory fragmentation and avoid out of memory
|
||||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||||
# this obtained through ifconfig
|
# this obtained through ifconfig
|
||||||
@@ -309,7 +309,7 @@ There are three `vllm bench` subcommands:
|
|||||||
Take the `serve` as an example. Run the code as follows.
|
Take the `serve` as an example. Run the code as follows.
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input-len 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input-len 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -335,7 +335,7 @@ Example server scripts:
|
|||||||
```shell
|
```shell
|
||||||
#!/bin/sh
|
#!/bin/sh
|
||||||
# Load model from ModelScope to speed up download
|
# Load model from ModelScope to speed up download
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
# To reduce memory fragmentation and avoid out of memory
|
# To reduce memory fragmentation and avoid out of memory
|
||||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||||
export HCCL_BUFFSIZE=512
|
export HCCL_BUFFSIZE=512
|
||||||
@@ -408,7 +408,7 @@ export TP_SOCKET_IFNAME=${ifname}
|
|||||||
export HCCL_SOCKET_IFNAME=${ifname}
|
export HCCL_SOCKET_IFNAME=${ifname}
|
||||||
|
|
||||||
# Load model from ModelScope to speed up download
|
# Load model from ModelScope to speed up download
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
# To reduce memory fragmentation and avoid out of memory
|
# To reduce memory fragmentation and avoid out of memory
|
||||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||||
export HCCL_BUFFSIZE=512
|
export HCCL_BUFFSIZE=512
|
||||||
@@ -470,7 +470,7 @@ export TP_SOCKET_IFNAME=${ifname}
|
|||||||
export HCCL_SOCKET_IFNAME=${ifname}
|
export HCCL_SOCKET_IFNAME=${ifname}
|
||||||
|
|
||||||
# Load model from ModelScope to speed up download
|
# Load model from ModelScope to speed up download
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
# To reduce memory fragmentation and avoid out of memory
|
# To reduce memory fragmentation and avoid out of memory
|
||||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||||
export HCCL_BUFFSIZE=1024
|
export HCCL_BUFFSIZE=1024
|
||||||
@@ -534,7 +534,7 @@ export TP_SOCKET_IFNAME=${ifname}
|
|||||||
export HCCL_SOCKET_IFNAME=${ifname}
|
export HCCL_SOCKET_IFNAME=${ifname}
|
||||||
|
|
||||||
# Load model from ModelScope to speed up download
|
# Load model from ModelScope to speed up download
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
# To reduce memory fragmentation and avoid out of memory
|
# To reduce memory fragmentation and avoid out of memory
|
||||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||||
export HCCL_BUFFSIZE=1024
|
export HCCL_BUFFSIZE=1024
|
||||||
|
|||||||
@@ -93,7 +93,7 @@ The converted model files look like:
|
|||||||
Run the following script to start the vLLM server with the quantized model:
|
Run the following script to start the vLLM server with the quantized model:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
export MODEL_PATH=vllm-ascend/Qwen3-8B-W4A8
|
export MODEL_PATH=vllm-ascend/Qwen3-8B-W4A8
|
||||||
vllm serve ${MODEL_PATH} --served-model-name "qwen3-8b-w4a8" --max-model-len 4096 --quantization ascend
|
vllm serve ${MODEL_PATH} --served-model-name "qwen3-8b-w4a8" --max-model-len 4096 --quantization ascend
|
||||||
```
|
```
|
||||||
|
|||||||
@@ -64,7 +64,7 @@ For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at
|
|||||||
|
|
||||||
```shell
|
```shell
|
||||||
#!/bin/sh
|
#!/bin/sh
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
|
|
||||||
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --served-model-name qwen3-coder --tensor-parallel-size 4 --enable_expert_parallel
|
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --served-model-name qwen3-coder --tensor-parallel-size 4 --enable_expert_parallel
|
||||||
```
|
```
|
||||||
|
|||||||
@@ -163,7 +163,7 @@ There are three `vllm bench` subcommands:
|
|||||||
Take the `serve` as an example. Run the code as follows.
|
Take the `serve` as an example. Run the code as follows.
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
vllm bench serve --model Qwen/Qwen3-Next-80B-A3B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
vllm bench serve --model Qwen/Qwen3-Next-80B-A3B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
@@ -94,7 +94,7 @@ Node 0
|
|||||||
```shell
|
```shell
|
||||||
#!/bin/sh
|
#!/bin/sh
|
||||||
# Load model from ModelScope to speed up download
|
# Load model from ModelScope to speed up download
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
# To reduce memory fragmentation and avoid out of memory
|
# To reduce memory fragmentation and avoid out of memory
|
||||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||||
# this obtained through ifconfig
|
# this obtained through ifconfig
|
||||||
@@ -137,7 +137,7 @@ Node1
|
|||||||
```shell
|
```shell
|
||||||
#!/bin/sh
|
#!/bin/sh
|
||||||
# Load model from ModelScope to speed up download
|
# Load model from ModelScope to speed up download
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
# To reduce memory fragmentation and avoid out of memory
|
# To reduce memory fragmentation and avoid out of memory
|
||||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||||
# this is obtained through ifconfig
|
# this is obtained through ifconfig
|
||||||
@@ -269,7 +269,7 @@ There are three `vllm bench` subcommands:
|
|||||||
Take the `serve` as an example. Run the code as follows.
|
Take the `serve` as an example. Run the code as follows.
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
vllm bench serve --model Qwen/Qwen3-VL-235B-A22B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
vllm bench serve --model Qwen/Qwen3-VL-235B-A22B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
@@ -94,7 +94,7 @@ Run the following script to execute online 128k inference.
|
|||||||
```shell
|
```shell
|
||||||
#!/bin/sh
|
#!/bin/sh
|
||||||
# Load model from ModelScope to speed up download
|
# Load model from ModelScope to speed up download
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
# To reduce memory fragmentation and avoid out of memory
|
# To reduce memory fragmentation and avoid out of memory
|
||||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||||
export HCCL_BUFFSIZE=512
|
export HCCL_BUFFSIZE=512
|
||||||
@@ -190,7 +190,7 @@ There are three `vllm bench` subcommands:
|
|||||||
Take the `serve` as an example. Run the code as follows.
|
Take the `serve` as an example. Run the code as follows.
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
vllm bench serve --model Eco-Tech/Qwen3.5-27B-w8a8-mtp --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
vllm bench serve --model Eco-Tech/Qwen3.5-27B-w8a8-mtp --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
@@ -94,7 +94,7 @@ Run the following script to execute online 128k inference On 1 Atlas 800 A3(64G*
|
|||||||
```shell
|
```shell
|
||||||
#!/bin/sh
|
#!/bin/sh
|
||||||
# Load model from ModelScope to speed up download
|
# Load model from ModelScope to speed up download
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
# To reduce memory fragmentation and avoid out of memory
|
# To reduce memory fragmentation and avoid out of memory
|
||||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||||
@@ -157,7 +157,7 @@ Node 0
|
|||||||
```shell
|
```shell
|
||||||
#!/bin/sh
|
#!/bin/sh
|
||||||
# Load model from ModelScope to speed up download
|
# Load model from ModelScope to speed up download
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
# To reduce memory fragmentation and avoid out of memory
|
# To reduce memory fragmentation and avoid out of memory
|
||||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||||
# this obtained through ifconfig
|
# this obtained through ifconfig
|
||||||
@@ -203,7 +203,7 @@ Node1
|
|||||||
```shell
|
```shell
|
||||||
#!/bin/sh
|
#!/bin/sh
|
||||||
# Load model from ModelScope to speed up download
|
# Load model from ModelScope to speed up download
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
# To reduce memory fragmentation and avoid out of memory
|
# To reduce memory fragmentation and avoid out of memory
|
||||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||||
# this obtained through ifconfig
|
# this obtained through ifconfig
|
||||||
@@ -595,7 +595,7 @@ There are three `vllm bench` subcommands:
|
|||||||
Take the `serve` as an example. Run the code as follows.
|
Take the `serve` as an example. Run the code as follows.
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
vllm bench serve --model Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
vllm bench serve --model Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
@@ -38,9 +38,9 @@ So far, dynamic batch performs better on several dense models including Qwen and
|
|||||||
Dynamic batch is used in the online inference. A fully executable example is as follows:
|
Dynamic batch is used in the online inference. A fully executable example is as follows:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
SLO_LITMIT=50
|
SLO_LIMIT=50
|
||||||
vllm serve Qwen/Qwen2.5-14B-Instruct\
|
vllm serve Qwen/Qwen2.5-14B-Instruct\
|
||||||
--additional_config '{"SLO_limits_for_dynamic_batch":'${SLO_LITMIT}'}' \
|
--additional_config '{"SLO_limits_for_dynamic_batch":'${SLO_LIMIT}'}' \
|
||||||
--max-num-seqs 256 \
|
--max-num-seqs 256 \
|
||||||
--block-size 128 \
|
--block-size 128 \
|
||||||
--tensor_parallel_size 8 \
|
--tensor_parallel_size 8 \
|
||||||
|
|||||||
@@ -54,25 +54,25 @@ To enable Netloader, pass `--load-format=netloader` and provide configuration vi
|
|||||||
### Server
|
### Server
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
VLLM_SLEEP_WHEN_IDLE=1 vllm serve `<model_file>` \
|
VLLM_SLEEP_WHEN_IDLE=1 vllm serve <model_file> \
|
||||||
--tensor-parallel-size 1 \
|
--tensor-parallel-size 1 \
|
||||||
--served-model-name `<model_name>` \
|
--served-model-name <model_name> \
|
||||||
--enforce-eager \
|
--enforce-eager \
|
||||||
--port `<port>` \
|
--port <port> \
|
||||||
--load-format netloader
|
--load-format netloader
|
||||||
```
|
```
|
||||||
|
|
||||||
### Client
|
### Client
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
export NETLOADER_CONFIG='{"SOURCE":[{"device_id":0, "sources": ["`<server_IP>`:`<server_Port>`"]}]}'
|
export NETLOADER_CONFIG='{"SOURCE":[{"device_id":0, "sources": ["<server_IP>:<server_Port>"]}]}'
|
||||||
|
|
||||||
VLLM_SLEEP_WHEN_IDLE=1 ASCEND_RT_VISIBLE_DEVICES=`<device_id_diff_from_server>` \
|
VLLM_SLEEP_WHEN_IDLE=1 ASCEND_RT_VISIBLE_DEVICES=<device_id_diff_from_server> \
|
||||||
vllm serve `<model_file>` \
|
vllm serve <model_file> \
|
||||||
--tensor-parallel-size 1 \
|
--tensor-parallel-size 1 \
|
||||||
--served-model-name `<model_name>` \
|
--served-model-name <model_name> \
|
||||||
--enforce-eager \
|
--enforce-eager \
|
||||||
--port `<client_port>` \
|
--port <client_port> \
|
||||||
--load-format netloader \
|
--load-format netloader \
|
||||||
--model-loader-extra-config="${NETLOADER_CONFIG}"
|
--model-loader-extra-config="${NETLOADER_CONFIG}"
|
||||||
```
|
```
|
||||||
|
|||||||
@@ -80,7 +80,7 @@ A simple planner implementation is provided at [`rfork_planner.py`](../../../../
|
|||||||
```shell
|
```shell
|
||||||
python rfork_planner.py \
|
python rfork_planner.py \
|
||||||
--host 0.0.0.0 \
|
--host 0.0.0.0 \
|
||||||
--port `<planner_port>`
|
--port <planner_port>
|
||||||
```
|
```
|
||||||
|
|
||||||
### 3. Start vLLM Instances
|
### 3. Start vLLM Instances
|
||||||
@@ -93,15 +93,15 @@ For later instances, if the planner can allocate a compatible seed, RFork will t
|
|||||||
|
|
||||||
```shell
|
```shell
|
||||||
export RFORK_CONFIG='{
|
export RFORK_CONFIG='{
|
||||||
"model_url": "`<model_url>`",
|
"model_url": "<model_url>",
|
||||||
"model_deploy_strategy_name": "`<deploy_strategy>`",
|
"model_deploy_strategy_name": "<deploy_strategy>",
|
||||||
"rfork_scheduler_url": "http://`<planner_ip>`:`<planner_port>`"
|
"rfork_scheduler_url": "http://<planner_ip>:<planner_port>"
|
||||||
}'
|
}'
|
||||||
|
|
||||||
vllm serve `<model_path>` \
|
vllm serve <model_path> \
|
||||||
--tensor-parallel-size 1 \
|
--tensor-parallel-size 1 \
|
||||||
--served-model-name `<served_model_name>` \
|
--served-model-name <served_model_name> \
|
||||||
--port `<port>` \
|
--port <port> \
|
||||||
--load-format rfork \
|
--load-format rfork \
|
||||||
--model-loader-extra-config "${RFORK_CONFIG}"
|
--model-loader-extra-config "${RFORK_CONFIG}"
|
||||||
```
|
```
|
||||||
|
|||||||
@@ -381,7 +381,7 @@ This is the first release candidate of v0.14.0 for vLLM Ascend. Please follow th
|
|||||||
- Fix multimodal inference OOM issues by setting `expandable_segments:True` by default. [#5855](https://github.com/vllm-project/vllm-ascend/pull/5855)
|
- Fix multimodal inference OOM issues by setting `expandable_segments:True` by default. [#5855](https://github.com/vllm-project/vllm-ascend/pull/5855)
|
||||||
- `VLLM_ASCEND_ENABLE_MLAPO` is set to `True` by default. It's enabled automatically on decode node in PD deployment case. Please note that this feature will cost more memory. If you are memory sensitive, please set it to False. [#5952](https://github.com/vllm-project/vllm-ascend/pull/5952)
|
- `VLLM_ASCEND_ENABLE_MLAPO` is set to `True` by default. It's enabled automatically on decode node in PD deployment case. Please note that this feature will cost more memory. If you are memory sensitive, please set it to False. [#5952](https://github.com/vllm-project/vllm-ascend/pull/5952)
|
||||||
- SSL config can be set to kv_extra_config for PD deployment with mooncake layerwise connector. [#5875](https://github.com/vllm-project/vllm-ascend/pull/5875)
|
- SSL config can be set to kv_extra_config for PD deployment with mooncake layerwise connector. [#5875](https://github.com/vllm-project/vllm-ascend/pull/5875)
|
||||||
- support `--max_model_len=auto`. [#6193](https://github.com/vllm-project/vllm-ascend/pull/6193)
|
- support `--max-model-len=auto`. [#6193](https://github.com/vllm-project/vllm-ascend/pull/6193)
|
||||||
|
|
||||||
### Dependencies
|
### Dependencies
|
||||||
|
|
||||||
|
|||||||
@@ -7,7 +7,7 @@ export HCCL_SOCKET_IFNAME="eth0"
|
|||||||
export OMP_PROC_BIND=false
|
export OMP_PROC_BIND=false
|
||||||
export OMP_NUM_THREADS=10
|
export OMP_NUM_THREADS=10
|
||||||
|
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
|
|
||||||
export ASCEND_LAUNCH_BLOCKING=0
|
export ASCEND_LAUNCH_BLOCKING=0
|
||||||
|
|
||||||
|
|||||||
@@ -80,7 +80,7 @@ async def test_qwen3_moe_w8a8_distributed_tp2_ep_dynamic_eplb():
|
|||||||
port = get_open_port()
|
port = get_open_port()
|
||||||
compilation_config = json.dumps({"cudagraph_capture_sizes": [8]})
|
compilation_config = json.dumps({"cudagraph_capture_sizes": [8]})
|
||||||
server_args = [
|
server_args = [
|
||||||
"--max_model_len",
|
"--max-model-len",
|
||||||
"8192",
|
"8192",
|
||||||
"--tensor_parallel_size",
|
"--tensor_parallel_size",
|
||||||
"2",
|
"2",
|
||||||
|
|||||||
@@ -239,7 +239,7 @@ test_cases:
|
|||||||
<<: *envs
|
<<: *envs
|
||||||
server_cmd: *server_cmd
|
server_cmd: *server_cmd
|
||||||
benchmarks:
|
benchmarks:
|
||||||
<<: *benchmarks_acc
|
<<: *benchmarks
|
||||||
```
|
```
|
||||||
|
|
||||||
#### EPD / Disaggregated Case
|
#### EPD / Disaggregated Case
|
||||||
|
|||||||
@@ -21,7 +21,7 @@ set -eo errexit
|
|||||||
|
|
||||||
. $(dirname "$0")/common.sh
|
. $(dirname "$0")/common.sh
|
||||||
|
|
||||||
export VLLM_USE_MODELSCOPE=true
|
export VLLM_USE_MODELSCOPE=True
|
||||||
export MODELSCOPE_HUB_FILE_LOCK=false
|
export MODELSCOPE_HUB_FILE_LOCK=false
|
||||||
export HF_HUB_OFFLINE=1
|
export HF_HUB_OFFLINE=1
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user