xc-llm-ascend/docs/source/tutorials/models/GLM5.md

# GLM-5

## Introduction

[GLM-5](https://huggingface.co/zai-org/GLM-5)use a Mixture-of-Experts (MoE) architecture and targeting at complex systems engineering and long-horizon agentic tasks.

This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.

## Supported Features

Refer to [supported features](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_models.html)to get the model's supported feature matrix.

Refer to [feature guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_features.html) to get the feature's configuration.

## Environment Preparation

### Model Weight

- `GLM-5`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-5).
- `GLM-5-w4a8`(Quantized version without mtp): [Download model weight](https://modelers.cn/models/Eco-Tech/GLM-5-w4a8).
- You can use [msmodelslim](https://gitcode.com/Ascend/msmodelslim) to quantify the model naively.

It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`

### Installation

vLLM and vLLM-ascend only support GLM-5 on our main branches. you can use our official docker images and upgrade vllm and vllm-ascend for inference.

:::::{tab-set}
:sync-group: install

::::{tab-item} A3 series
:sync: A3

Start the docker image on your each node.

```{code-block} bash
   :substitutions:

# Update --device according to your device (Atlas A3:/dev/davinci[0-15]).
# Update the vllm-ascend image according to your environment.
# Note you should download the weight to /root/.cache in advance.
# Update the vllm-ascend image, glm5-a3 can be replaced by: glm5;glm5-openeuler;glm5-a3-openeuler
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:glm5-a3
export NAME=vllm-ascend

# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci8 \
--device /dev/davinci9 \
--device /dev/davinci10 \
--device /dev/davinci11 \
--device /dev/davinci12 \
--device /dev/davinci13 \
--device /dev/davinci14 \
--device /dev/davinci15 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash
```

::::
::::{tab-item} A2 series
:sync: A2

Start the docker image on your each node.

```{code-block} bash
   :substitutions:

export IMAGE=quay.io/ascend/vllm-ascend:glm5
docker run --rm \
    --name vllm-ascend \
    --shm-size=1g \
    --net=host \
    --device /dev/davinci0 \
    --device /dev/davinci1 \
    --device /dev/davinci2 \
    --device /dev/davinci3 \
    --device /dev/davinci4 \
    --device /dev/davinci5 \
    --device /dev/davinci6 \
    --device /dev/davinci7 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /root/.cache:/root/.cache \
    -it $IMAGE bash
```

::::
:::::

In addition, if you don't want to use the docker image as above, you can also build all from source:

- Install `vllm-ascend` from source, refer to [installation](https://docs.vllm.ai/projects/ascend/en/latest/installation.html).

To inference `GLM-5`, you should upgrade vllm、vllm-ascend、transformers to main branches:

```shell
# upgrade vllm
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout 978a37c82387ce4a40aaadddcdbaf4a06fc4d590
VLLM_TARGET_DEVICE=empty pip install -v .

# upgrade vllm-ascend
git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
git checkout ff3a50d011dcbea08f87ebed69ff1bf156dbb01e
git submodule update --init --recursive
pip install -v .

# reinstall transformers
pip install git+https://github.com/huggingface/transformers.git
```

If you want to deploy multi-node environment, you need to set up environment on each node.

## Deployment

### Single-node Deployment

:::::{tab-set}
:sync-group: install

::::{tab-item} A3 series
:sync: A3

- Quantized model `glm-5-w4a8` can be deployed on 1 Atlas 800 A3 (64G × 16) .

Run the following script to execute online inference.

```{code-block} bash
   :substitutions:
export HCCL_OP_EXPANSION_MODE="AIV"
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_BALANCE_SCHEDULING=1

vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM5-w4a8 \
--host 0.0.0.0 \
--port 8077 \
--data-parallel-size 1 \
--tensor-parallel-size 16 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name glm-5 \
--max-num-seqs 8 \
--max-model-len 66600 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--quantization ascend \
--enable-chunked-prefill \
--enable-prefix-caching \
--async-scheduling \
--additional-config '{"multistream_overlap_shared_expert":true}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}' 
```

::::
::::{tab-item} A2 series
:sync: A2

- Quantized model `glm-5-w4a8` can be deployed on 1 Atlas 800 A2 (64G × 8) .

Run the following script to execute online inference.

```{code-block} bash
   :substitutions:
export HCCL_OP_EXPANSION_MODE="AIV"
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_BALANCE_SCHEDULING=1

vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM-5-w4a8 \
--host 0.0.0.0 \
--port 8077 \
--data-parallel-size 1 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name glm-5 \
--max-num-seqs 2 \
--max-model-len 32768 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--quantization ascend \
--enable-chunked-prefill \
--enable-prefix-caching \
--async-scheduling \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--additional-config '{"multistream_overlap_shared_expert":true}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
```

::::
:::::

**Notice:**
The parameters are explained as follows:

- For single-node deployment, we recommend using `dp1tp16` and turn off expert parallel in low-latency scenarios.
- `--async-scheduling` Asynchronous scheduling is a technique used to optimize inference efficiency. It allows non-blocking task scheduling to improve concurrency and throughput, especially when processing large-scale models.

### Multi-node Deployment

:::::{tab-set}
:sync-group: install

::::{tab-item} A3 series
:sync: A3

- `glm-5-bf16`: require at least 2 Atlas 800 A3 (64G × 16).

Run the following scripts on two nodes respectively.

**node 0**

```{code-block} bash
   :substitutions:
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="xxx"

# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"

export HCCL_OP_EXPANSION_MODE="AIV"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM5-bf16 \
--host 0.0.0.0 \
--port 8077 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 12890 \
--tensor-parallel-size 16 \
--seed 1024 \
--served-model-name glm-5 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 8192 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
```

**node 1**

```{code-block} bash
   :substitutions:
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="xxx"

# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"

export HCCL_OP_EXPANSION_MODE="AIV"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM5-bf16 \
--host 0.0.0.0 \
--port 8077 \
--headless \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 1 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 12890 \
--tensor-parallel-size 16 \
--seed 1024 \
--served-model-name glm-5 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 8192 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
```

::::
::::{tab-item} A2 series
:sync: A2

Run the following scripts on two nodes respectively.

**node 0**

```{code-block} bash
   :substitutions:
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="xxx"

# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxx"

export HCCL_OP_EXPANSION_MODE="AIV"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM-5-w4a8 \
--host 0.0.0.0 \
--port 8077 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 8 \
--quantization ascend \
--seed 1024 \
--served-model-name glm-5 \
--enable-expert-parallel \
--max-num-seqs 2 \
--max-model-len 131072 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--additional-config '{"multistream_overlap_shared_expert":true}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
```

**node 1**

```{code-block} bash
   :substitutions:
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="xxx"

# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxx"

export HCCL_OP_EXPANSION_MODE="AIV"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM-5-w4a8 \
--host 0.0.0.0 \
--port 8077 \
--headless \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 1 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 8 \
--quantization ascend \
--seed 1024 \
--served-model-name glm-5 \
--enable-expert-parallel \
--max-num-seqs 2 \
--max-model-len 131072 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--additional-config '{"multistream_overlap_shared_expert":true}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
```

::::
:::::

### Prefill-Decode Disaggregation

Not test yet.

## Accuracy Evaluation

Here are two accuracy evaluation methods.

### Using AISBench

1. Refer to [Using AISBench](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_ais_bench.html) for details.

2. After execution, you can get the result.

### Using Language Model Evaluation Harness

Not test yet.

## Performance

### Using AISBench

Refer to [Using AISBench for performance evaluation](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_ais_bench.html#execute-performance-evaluation) for details.

### Using vLLM Benchmark

Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
-												[doc]add GLM5.md (#6709)

### What this PR does / why we need it?
Add GLM5 doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

Signed-off-by: nakairika <982275964@qq.com>
											
										
										
											2026-02-12 04:00:40 +08:00
+								# GLM-5
 								## Introduction
 								[GLM-5](https://huggingface.co/zai-org/GLM-5)use a Mixture-of-Experts (MoE) architecture and targeting at complex systems engineering and long-horizon agentic tasks.
 								This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.
 								## Supported Features
 								Refer to [supported features](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_models.html)to get the model's supported feature matrix.
 								Refer to [feature guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_features.html) to get the feature's configuration.
 								## Environment Preparation
 								### Model Weight
 								- `GLM-5`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-5).
 								- `GLM-5-w4a8`(Quantized version without mtp): [Download model weight](https://modelers.cn/models/Eco-Tech/GLM-5-w4a8).
 								- You can use [msmodelslim](https://gitcode.com/Ascend/msmodelslim) to quantify the model naively.
 								It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
 								### Installation
 								vLLM and vLLM-ascend only support GLM-5 on our main branches. you can use our official docker images and upgrade vllm and vllm-ascend for inference.
-												[doc] add A2 series doc for GLM5.md (#6717)

### What this PR does / why we need it?
Added support for A2 in the GLM-5 doc.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

vLLM version: v0.15.0
vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007
											
										
										
											2026-02-12 16:08:17 +08:00
+								:::::{tab-set}
 								:sync-group: install
 								::::{tab-item} A3 series
 								:sync: A3
 								Start the docker image on your each node.
-												[doc]add GLM5.md (#6709)

### What this PR does / why we need it?
Add GLM5 doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

Signed-off-by: nakairika <982275964@qq.com>
											
										
										
											2026-02-12 04:00:40 +08:00
+								```{code-block} bash
-												[doc] add A2 series doc for GLM5.md (#6717)

### What this PR does / why we need it?
Added support for A2 in the GLM-5 doc.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

vLLM version: v0.15.0
vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007
											
										
										
											2026-02-12 16:08:17 +08:00
+								   :substitutions:
-												[doc]add GLM5.md (#6709)

### What this PR does / why we need it?
Add GLM5 doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

Signed-off-by: nakairika <982275964@qq.com>
											
										
										
											2026-02-12 04:00:40 +08:00
+								# Update --device according to your device (Atlas A3:/dev/davinci[0-15]).
 								# Update the vllm-ascend image according to your environment.
 								# Note you should download the weight to /root/.cache in advance.
-												[doc] add A2 series doc for GLM5.md (#6717)

### What this PR does / why we need it?
Added support for A2 in the GLM-5 doc.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

vLLM version: v0.15.0
vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007
											
										
										
											2026-02-12 16:08:17 +08:00
+								# Update the vllm-ascend image, glm5-a3 can be replaced by: glm5;glm5-openeuler;glm5-a3-openeuler
-												[doc]add GLM5.md (#6709)

### What this PR does / why we need it?
Add GLM5 doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

Signed-off-by: nakairika <982275964@qq.com>
											
										
										
											2026-02-12 04:00:40 +08:00
+								export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:glm5-a3
 								export NAME=vllm-ascend
 								# Run the container using the defined variables
 								# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
 								docker run --rm \
 								--name $NAME \
 								--net=host \
 								--shm-size=1g \
 								--device /dev/davinci0 \
 								--device /dev/davinci1 \
 								--device /dev/davinci2 \
 								--device /dev/davinci3 \
 								--device /dev/davinci4 \
 								--device /dev/davinci5 \
 								--device /dev/davinci6 \
 								--device /dev/davinci7 \
-												[Docs] Fix GLM-5 deploy command (#6711)

This pull request refines the GLM-5 deployment documentation by updating
the Docker run command to include a more comprehensive set of device
mappings and by removing an extraneous quantization flag from the `vllm
serve` commands. These changes aim to correct and clarify the deployment
instructions, ensuring users can successfully set up and run the GLM-5
model as intended.


- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

Signed-off-by: Canlin Guo <961750412@qq.com>
											
										
										
											2026-02-12 08:55:48 +08:00
+								--device /dev/davinci8 \
 								--device /dev/davinci9 \
 								--device /dev/davinci10 \
 								--device /dev/davinci11 \
 								--device /dev/davinci12 \
 								--device /dev/davinci13 \
 								--device /dev/davinci14 \
 								--device /dev/davinci15 \
-												[doc]add GLM5.md (#6709)

### What this PR does / why we need it?
Add GLM5 doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

Signed-off-by: nakairika <982275964@qq.com>
											
										
										
											2026-02-12 04:00:40 +08:00
+								--device /dev/davinci_manager \
 								--device /dev/devmm_svm \
 								--device /dev/hisi_hdc \
 								-v /usr/local/dcmi:/usr/local/dcmi \
 								-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
 								-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
 								-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
 								-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
 								-v /etc/ascend_install.info:/etc/ascend_install.info \
 								-v /root/.cache:/root/.cache \
 								-it $IMAGE bash
 								```
-												[doc] add A2 series doc for GLM5.md (#6717)

### What this PR does / why we need it?
Added support for A2 in the GLM-5 doc.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

vLLM version: v0.15.0
vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007
											
										
										
											2026-02-12 16:08:17 +08:00
+								::::
 								::::{tab-item} A2 series
 								:sync: A2
 								Start the docker image on your each node.
 								```{code-block} bash
 								   :substitutions:
 								export IMAGE=quay.io/ascend/vllm-ascend:glm5
 								docker run --rm \
 								    --name vllm-ascend \
 								    --shm-size=1g \
 								    --net=host \
 								    --device /dev/davinci0 \
 								    --device /dev/davinci1 \
 								    --device /dev/davinci2 \
 								    --device /dev/davinci3 \
 								    --device /dev/davinci4 \
 								    --device /dev/davinci5 \
 								    --device /dev/davinci6 \
 								    --device /dev/davinci7 \
 								    --device /dev/davinci_manager \
 								    --device /dev/devmm_svm \
 								    --device /dev/hisi_hdc \
 								    -v /usr/local/dcmi:/usr/local/dcmi \
 								    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
 								    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
 								    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
 								    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
 								    -v /etc/ascend_install.info:/etc/ascend_install.info \
 								    -v /root/.cache:/root/.cache \
 								    -it $IMAGE bash
 								```
 								::::
 								:::::
-												[doc]add GLM5.md (#6709)

### What this PR does / why we need it?
Add GLM5 doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

Signed-off-by: nakairika <982275964@qq.com>
											
										
										
											2026-02-12 04:00:40 +08:00
+								In addition, if you don't want to use the docker image as above, you can also build all from source:
 								- Install `vllm-ascend` from source, refer to [installation](https://docs.vllm.ai/projects/ascend/en/latest/installation.html).
 								To inference `GLM-5`, you should upgrade vllm、vllm-ascend、transformers to main branches:
 								```shell
 								# upgrade vllm
 								git clone https://github.com/vllm-project/vllm.git
 								cd vllm
 								git checkout 978a37c82387ce4a40aaadddcdbaf4a06fc4d590
 								VLLM_TARGET_DEVICE=empty pip install -v .
 								# upgrade vllm-ascend
 								git clone https://github.com/vllm-project/vllm-ascend.git
 								cd vllm-ascend
 								git checkout ff3a50d011dcbea08f87ebed69ff1bf156dbb01e
 								git submodule update --init --recursive
 								pip install -v .
 								# reinstall transformers
 								pip install git+https://github.com/huggingface/transformers.git
 								```
 								If you want to deploy multi-node environment, you need to set up environment on each node.
 								## Deployment
 								### Single-node Deployment
-												[doc] add A2 series doc for GLM5.md (#6717)

### What this PR does / why we need it?
Added support for A2 in the GLM-5 doc.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

vLLM version: v0.15.0
vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007
											
										
										
											2026-02-12 16:08:17 +08:00
+								:::::{tab-set}
 								:sync-group: install
-												[doc]add GLM5.md (#6709)

### What this PR does / why we need it?
Add GLM5 doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

Signed-off-by: nakairika <982275964@qq.com>
											
										
										
											2026-02-12 04:00:40 +08:00
-												[doc] add A2 series doc for GLM5.md (#6717)

### What this PR does / why we need it?
Added support for A2 in the GLM-5 doc.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

vLLM version: v0.15.0
vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007
											
										
										
											2026-02-12 16:08:17 +08:00
+								::::{tab-item} A3 series
 								:sync: A3
-												[doc]add GLM5.md (#6709)

### What this PR does / why we need it?
Add GLM5 doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

Signed-off-by: nakairika <982275964@qq.com>
											
										
										
											2026-02-12 04:00:40 +08:00
 								- Quantized model `glm-5-w4a8` can be deployed on 1 Atlas 800 A3 (64G × 16) .
 								Run the following script to execute online inference.
-												[doc] add A2 series doc for GLM5.md (#6717)

### What this PR does / why we need it?
Added support for A2 in the GLM-5 doc.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

vLLM version: v0.15.0
vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007
											
										
										
											2026-02-12 16:08:17 +08:00
+								```{code-block} bash
 								   :substitutions:
-												[doc]add GLM5.md (#6709)

### What this PR does / why we need it?
Add GLM5 doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

Signed-off-by: nakairika <982275964@qq.com>
											
										
										
											2026-02-12 04:00:40 +08:00
+								export HCCL_OP_EXPANSION_MODE="AIV"
 								export OMP_PROC_BIND=false
 								export OMP_NUM_THREADS=10
 								export VLLM_USE_V1=1
 								export HCCL_BUFFSIZE=200
 								export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 								export VLLM_ASCEND_BALANCE_SCHEDULING=1
 								vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM5-w4a8 \
 								--host 0.0.0.0 \
 								--port 8077 \
 								--data-parallel-size 1 \
 								--tensor-parallel-size 16 \
 								--enable-expert-parallel \
 								--seed 1024 \
 								--served-model-name glm-5 \
 								--max-num-seqs 8 \
 								--max-model-len 66600 \
 								--max-num-batched-tokens 4096 \
 								--trust-remote-code \
 								--gpu-memory-utilization 0.95 \
 								--quantization ascend \
 								--enable-chunked-prefill \
 								--enable-prefix-caching \
 								--async-scheduling \
 								--additional-config '{"multistream_overlap_shared_expert":true}' \
 								--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
 								--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
 								```
-												[doc] add A2 series doc for GLM5.md (#6717)

### What this PR does / why we need it?
Added support for A2 in the GLM-5 doc.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

vLLM version: v0.15.0
vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007
											
										
										
											2026-02-12 16:08:17 +08:00
+								::::
 								::::{tab-item} A2 series
 								:sync: A2
 								- Quantized model `glm-5-w4a8` can be deployed on 1 Atlas 800 A2 (64G × 8) .
 								Run the following script to execute online inference.
 								```{code-block} bash
 								   :substitutions:
 								export HCCL_OP_EXPANSION_MODE="AIV"
 								export OMP_PROC_BIND=false
 								export OMP_NUM_THREADS=10
 								export VLLM_USE_V1=1
 								export HCCL_BUFFSIZE=200
 								export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 								export VLLM_ASCEND_BALANCE_SCHEDULING=1
 								vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM-5-w4a8 \
 								--host 0.0.0.0 \
 								--port 8077 \
 								--data-parallel-size 1 \
 								--tensor-parallel-size 8 \
 								--enable-expert-parallel \
 								--seed 1024 \
 								--served-model-name glm-5 \
 								--max-num-seqs 2 \
 								--max-model-len 32768 \
 								--max-num-batched-tokens 4096 \
 								--trust-remote-code \
 								--gpu-memory-utilization 0.95 \
 								--quantization ascend \
 								--enable-chunked-prefill \
 								--enable-prefix-caching \
 								--async-scheduling \
 								--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
 								--additional-config '{"multistream_overlap_shared_expert":true}' \
 								--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
 								```
 								::::
 								:::::
-												[doc]add GLM5.md (#6709)

### What this PR does / why we need it?
Add GLM5 doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

Signed-off-by: nakairika <982275964@qq.com>
											
										
										
											2026-02-12 04:00:40 +08:00
+								**Notice:**
 								The parameters are explained as follows:
 								- For single-node deployment, we recommend using `dp1tp16` and turn off expert parallel in low-latency scenarios.
 								- `--async-scheduling` Asynchronous scheduling is a technique used to optimize inference efficiency. It allows non-blocking task scheduling to improve concurrency and throughput, especially when processing large-scale models.
 								### Multi-node Deployment
-												[doc] add A2 series doc for GLM5.md (#6717)

### What this PR does / why we need it?
Added support for A2 in the GLM-5 doc.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

vLLM version: v0.15.0
vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007
											
										
										
											2026-02-12 16:08:17 +08:00
+								:::::{tab-set}
 								:sync-group: install
-												[doc]add GLM5.md (#6709)

### What this PR does / why we need it?
Add GLM5 doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

Signed-off-by: nakairika <982275964@qq.com>
											
										
										
											2026-02-12 04:00:40 +08:00
-												[doc] add A2 series doc for GLM5.md (#6717)

### What this PR does / why we need it?
Added support for A2 in the GLM-5 doc.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

vLLM version: v0.15.0
vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007
											
										
										
											2026-02-12 16:08:17 +08:00
+								::::{tab-item} A3 series
 								:sync: A3
-												[doc]add GLM5.md (#6709)

### What this PR does / why we need it?
Add GLM5 doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

Signed-off-by: nakairika <982275964@qq.com>
											
										
										
											2026-02-12 04:00:40 +08:00
 								- `glm-5-bf16`: require at least 2 Atlas 800 A3 (64G × 16).
 								Run the following scripts on two nodes respectively.
 								**node 0**
-												[doc] add A2 series doc for GLM5.md (#6717)

### What this PR does / why we need it?
Added support for A2 in the GLM-5 doc.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

vLLM version: v0.15.0
vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007
											
										
										
											2026-02-12 16:08:17 +08:00
+								```{code-block} bash
 								   :substitutions:
-												[doc]add GLM5.md (#6709)

### What this PR does / why we need it?
Add GLM5 doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

Signed-off-by: nakairika <982275964@qq.com>
											
										
										
											2026-02-12 04:00:40 +08:00
+								# this obtained through ifconfig
 								# nic_name is the network interface name corresponding to local_ip of the current node
 								nic_name="xxx"
 								local_ip="xxx"
 								# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
 								node0_ip="xxxx"
 								export HCCL_OP_EXPANSION_MODE="AIV"
 								export HCCL_IF_IP=$local_ip
 								export GLOO_SOCKET_IFNAME=$nic_name
 								export TP_SOCKET_IFNAME=$nic_name
 								export HCCL_SOCKET_IFNAME=$nic_name
 								export OMP_PROC_BIND=false
 								export OMP_NUM_THREADS=10
 								export VLLM_USE_V1=1
 								export HCCL_BUFFSIZE=200
 								export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 								vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM5-bf16 \
 								--host 0.0.0.0 \
 								--port 8077 \
 								--data-parallel-size 2 \
 								--data-parallel-size-local 1 \
 								--data-parallel-address $node0_ip \
 								--data-parallel-rpc-port 12890 \
 								--tensor-parallel-size 16 \
 								--seed 1024 \
 								--served-model-name glm-5 \
 								--enable-expert-parallel \
 								--max-num-seqs 16 \
 								--max-model-len 8192 \
 								--max-num-batched-tokens 4096 \
 								--trust-remote-code \
 								--no-enable-prefix-caching \
 								--gpu-memory-utilization 0.95 \
 								--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
 								--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
 								```
 								**node 1**
-												[doc] add A2 series doc for GLM5.md (#6717)

### What this PR does / why we need it?
Added support for A2 in the GLM-5 doc.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

vLLM version: v0.15.0
vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007
											
										
										
											2026-02-12 16:08:17 +08:00
+								```{code-block} bash
 								   :substitutions:
-												[doc]add GLM5.md (#6709)

### What this PR does / why we need it?
Add GLM5 doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

Signed-off-by: nakairika <982275964@qq.com>
											
										
										
											2026-02-12 04:00:40 +08:00
+								# this obtained through ifconfig
 								# nic_name is the network interface name corresponding to local_ip of the current node
 								nic_name="xxx"
 								local_ip="xxx"
 								# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
 								node0_ip="xxxx"
 								export HCCL_OP_EXPANSION_MODE="AIV"
 								export HCCL_IF_IP=$local_ip
 								export GLOO_SOCKET_IFNAME=$nic_name
 								export TP_SOCKET_IFNAME=$nic_name
 								export HCCL_SOCKET_IFNAME=$nic_name
 								export OMP_PROC_BIND=false
 								export OMP_NUM_THREADS=10
 								export VLLM_USE_V1=1
 								export HCCL_BUFFSIZE=200
 								export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 								vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM5-bf16 \
 								--host 0.0.0.0 \
 								--port 8077 \
 								--headless \
 								--data-parallel-size 2 \
 								--data-parallel-size-local 1 \
 								--data-parallel-start-rank 1 \
 								--data-parallel-address $node0_ip \
 								--data-parallel-rpc-port 12890 \
 								--tensor-parallel-size 16 \
 								--seed 1024 \
 								--served-model-name glm-5 \
 								--enable-expert-parallel \
 								--max-num-seqs 16 \
 								--max-model-len 8192 \
 								--max-num-batched-tokens 4096 \
 								--trust-remote-code \
 								--no-enable-prefix-caching \
 								--gpu-memory-utilization 0.95 \
 								--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
 								--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
 								```
-												[doc] add A2 series doc for GLM5.md (#6717)

### What this PR does / why we need it?
Added support for A2 in the GLM-5 doc.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

vLLM version: v0.15.0
vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007
											
										
										
											2026-02-12 16:08:17 +08:00
+								::::
 								::::{tab-item} A2 series
 								:sync: A2
 								Run the following scripts on two nodes respectively.
 								**node 0**
 								```{code-block} bash
 								   :substitutions:
 								# this obtained through ifconfig
 								# nic_name is the network interface name corresponding to local_ip of the current node
 								nic_name="xxx"
 								local_ip="xxx"
 								# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
 								node0_ip="xxx"
 								export HCCL_OP_EXPANSION_MODE="AIV"
 								export HCCL_IF_IP=$local_ip
 								export GLOO_SOCKET_IFNAME=$nic_name
 								export TP_SOCKET_IFNAME=$nic_name
 								export HCCL_SOCKET_IFNAME=$nic_name
 								export OMP_PROC_BIND=false
 								export OMP_NUM_THREADS=10
 								export VLLM_USE_V1=1
 								export HCCL_BUFFSIZE=200
 								export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 								vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM-5-w4a8 \
 								--host 0.0.0.0 \
 								--port 8077 \
 								--data-parallel-size 2 \
 								--data-parallel-size-local 1 \
 								--data-parallel-address $node0_ip \
 								--data-parallel-rpc-port 13389 \
 								--tensor-parallel-size 8 \
 								--quantization ascend \
 								--seed 1024 \
 								--served-model-name glm-5 \
 								--enable-expert-parallel \
 								--max-num-seqs 2 \
 								--max-model-len 131072 \
 								--max-num-batched-tokens 4096 \
 								--trust-remote-code \
 								--no-enable-prefix-caching \
 								--gpu-memory-utilization 0.95 \
 								--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
 								--additional-config '{"multistream_overlap_shared_expert":true}' \
 								--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
 								```
 								**node 1**
 								```{code-block} bash
 								   :substitutions:
 								# this obtained through ifconfig
 								# nic_name is the network interface name corresponding to local_ip of the current node
 								nic_name="xxx"
 								local_ip="xxx"
 								# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
 								node0_ip="xxx"
 								export HCCL_OP_EXPANSION_MODE="AIV"
 								export HCCL_IF_IP=$local_ip
 								export GLOO_SOCKET_IFNAME=$nic_name
 								export TP_SOCKET_IFNAME=$nic_name
 								export HCCL_SOCKET_IFNAME=$nic_name
 								export OMP_PROC_BIND=false
 								export OMP_NUM_THREADS=10
 								export VLLM_USE_V1=1
 								export HCCL_BUFFSIZE=200
 								export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 								vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM-5-w4a8 \
 								--host 0.0.0.0 \
 								--port 8077 \
 								--headless \
 								--data-parallel-size 2 \
 								--data-parallel-size-local 1 \
 								--data-parallel-start-rank 1 \
 								--data-parallel-address $node0_ip \
 								--data-parallel-rpc-port 13389 \
 								--tensor-parallel-size 8 \
 								--quantization ascend \
 								--seed 1024 \
 								--served-model-name glm-5 \
 								--enable-expert-parallel \
 								--max-num-seqs 2 \
 								--max-model-len 131072 \
 								--max-num-batched-tokens 4096 \
 								--trust-remote-code \
 								--no-enable-prefix-caching \
 								--gpu-memory-utilization 0.95 \
 								--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
 								--additional-config '{"multistream_overlap_shared_expert":true}' \
 								--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
 								```
 								::::
 								:::::
-												[doc]add GLM5.md (#6709)

### What this PR does / why we need it?
Add GLM5 doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

Signed-off-by: nakairika <982275964@qq.com>
											
										
										
											2026-02-12 04:00:40 +08:00
+								### Prefill-Decode Disaggregation
 								Not test yet.
 								## Accuracy Evaluation
 								Here are two accuracy evaluation methods.
 								### Using AISBench
 . Refer to [Using AISBench](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_ais_bench.html) for details.
 . After execution, you can get the result.
 								### Using Language Model Evaluation Harness
 								Not test yet.
 								## Performance
 								### Using AISBench
 								Refer to [Using AISBench for performance evaluation](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_ais_bench.html#execute-performance-evaluation) for details.
 								### Using vLLM Benchmark
 								Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.