GLM-4.x series models use a Mixture-of-Experts (MoE) architecture and are foundational models specifically designed for agent applications
The `GLM-4.5` model is first supported in `vllm-ascend:v0.10.0rc1`
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.
## Supported Features
Refer to [supported features](../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
Refer to [feature guide](../user_guide/feature_guide/index.md) to get the feature's configuration.
-`GLM-4.5`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.5).
-`GLM-4.6`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.6).
-`GLM-4.7`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.7).
-`GLM-4.5-w8a8-with-float-mtp`(Quantized version with mtp): [Download model weight](https://modelers.cn/models/Modelers_Park/GLM-4.5-w8a8).
-`GLM-4.6-w8a8`(Quantized version without mtp): [Download model weight](https://modelers.cn/models/Modelers_Park/GLM-4.6-w8a8). Because vllm do not support GLM4.6 mtp in October, so we do not provide mtp version. And last month, it supported, you can use the following quantization scheme to add mtp weights to Quantized weights.
-`Method of Quantify`: [quantization scheme](https://blog.csdn.net/qq_37368095/article/details/156429653?spm=1011.2124.3001.6209). You can use these methods to quantify the model.
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
### Installation
You can using our official docker image to run `GLM-4.x` directly.
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker).
```{code-block} bash
:substitutions:
# Update --device according to your device (Atlas A2: /dev/davinci[0-7] Atlas A3:/dev/davinci[0-15]).
# Update the vllm-ascend image according to your environment.
# Note you should download the weight to /root/.cache in advance.
- For single-node deployment, we recommend using `dp1tp16` and turn off expert parallel in low-latency scenarios.
-`--async-scheduling` Asynchronous scheduling is a technique used to optimize inference efficiency. It allows non-blocking task scheduling to improve concurrency and throughput, especially when processing large-scale models.
### Multi-node Deployment
Not recommended to deploy multi-node on Atlas 800 A2 (64G * 8).