[Doc][Misc][v0.18.0] Add GLM5 to supported model list and update deployment document for GLM5 (#7963)
### What this PR does / why we need it?
1. Add version notes for GLM5.
2. Add paramter modification for GLM5.
3. Add GLM5 to supported model list.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.18.0
- vLLM main:
35141a7eed
---------
Signed-off-by: yydyzr <liuyuncong1@huawei.com>
Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com>
Co-authored-by: Zhu Jiyang <zhujiyang2@huawei.com>
This commit is contained in:
@@ -4,7 +4,7 @@
|
|||||||
|
|
||||||
[GLM-5](https://huggingface.co/zai-org/GLM-5) use a Mixture-of-Experts (MoE) architecture and targeting at complex systems engineering and long-horizon agentic tasks.
|
[GLM-5](https://huggingface.co/zai-org/GLM-5) use a Mixture-of-Experts (MoE) architecture and targeting at complex systems engineering and long-horizon agentic tasks.
|
||||||
|
|
||||||
The `GLM-5` model is first supported in `vllm-ascend:v0.17.0rc1`, and the version of transformers need to be upgraded to 5.2.0.
|
The `GLM-5` model is first supported in `vllm-ascend:v0.17.0rc1`. In `vllm-ascend:v0.17.0rc1` and `vllm-ascend:v0.18.0rc1` , the version of transformers need to be upgraded to 5.2.0.
|
||||||
|
|
||||||
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.
|
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.
|
||||||
|
|
||||||
@@ -154,7 +154,7 @@ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM5-w4a8 \
|
|||||||
--seed 1024 \
|
--seed 1024 \
|
||||||
--served-model-name glm-5 \
|
--served-model-name glm-5 \
|
||||||
--max-num-seqs 8 \
|
--max-num-seqs 8 \
|
||||||
--max-model-len 66600 \
|
--max-model-len 200000 \
|
||||||
--max-num-batched-tokens 4096 \
|
--max-num-batched-tokens 4096 \
|
||||||
--trust-remote-code \
|
--trust-remote-code \
|
||||||
--gpu-memory-utilization 0.95 \
|
--gpu-memory-utilization 0.95 \
|
||||||
@@ -563,7 +563,7 @@ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM5-w8a8 \
|
|||||||
--served-model-name glm-5 \
|
--served-model-name glm-5 \
|
||||||
--enable-expert-parallel \
|
--enable-expert-parallel \
|
||||||
--max-num-seqs 16 \
|
--max-num-seqs 16 \
|
||||||
--max-model-len 65536 \
|
--max-model-len 200000 \
|
||||||
--max-num-batched-tokens 4096 \
|
--max-num-batched-tokens 4096 \
|
||||||
--trust-remote-code \
|
--trust-remote-code \
|
||||||
--gpu-memory-utilization 0.95 \
|
--gpu-memory-utilization 0.95 \
|
||||||
@@ -615,7 +615,7 @@ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM5-w8a8 \
|
|||||||
--served-model-name glm-5 \
|
--served-model-name glm-5 \
|
||||||
--enable-expert-parallel \
|
--enable-expert-parallel \
|
||||||
--max-num-seqs 16 \
|
--max-num-seqs 16 \
|
||||||
--max-model-len 65536 \
|
--max-model-len 200000 \
|
||||||
--max-num-batched-tokens 4096 \
|
--max-num-batched-tokens 4096 \
|
||||||
--trust-remote-code \
|
--trust-remote-code \
|
||||||
--gpu-memory-utilization 0.95 \
|
--gpu-memory-utilization 0.95 \
|
||||||
@@ -742,6 +742,7 @@ Before you start, please
|
|||||||
|
|
||||||
2. prepare the script `run_dp_template.sh` on each node.
|
2. prepare the script `run_dp_template.sh` on each node.
|
||||||
|
|
||||||
|
To support a 200k context window on the stage of prefill, the parameter `"layer_sharding": ["q_b_proj"]` needs to be added to `--additional_config` on each prefill node.
|
||||||
1. Prefill node 0
|
1. Prefill node 0
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
@@ -789,10 +790,12 @@ Before you start, please
|
|||||||
--seed 1024 \
|
--seed 1024 \
|
||||||
--served-model-name glm-5 \
|
--served-model-name glm-5 \
|
||||||
--max-model-len 131072 \
|
--max-model-len 131072 \
|
||||||
--additional-config '{"fuse_muls_add": true, "multistream_overlap_shared_expert": true, "ascend_compilation_config": {"enable_npugraph_ex": true}}' \
|
--additional-config '{"fuse_muls_add": true, "multistream_overlap_shared_expert": true, "recompute_scheduler_enable": true, "ascend_compilation_config": {"enable_npugraph_ex": true}}' \
|
||||||
--max-num-batched-tokens 4096 \
|
--max-num-batched-tokens 4096 \
|
||||||
--trust-remote-code \
|
--trust-remote-code \
|
||||||
--max-num-seqs 64 \
|
--max-num-seqs 64 \
|
||||||
|
--async-scheduling \
|
||||||
|
--enable-chunked-prefill \
|
||||||
--quantization ascend \
|
--quantization ascend \
|
||||||
--gpu-memory-utilization 0.95 \
|
--gpu-memory-utilization 0.95 \
|
||||||
--enforce-eager \
|
--enforce-eager \
|
||||||
@@ -807,8 +810,8 @@ Before you start, please
|
|||||||
"kv_connector_extra_config": {
|
"kv_connector_extra_config": {
|
||||||
"use_ascend_direct": true,
|
"use_ascend_direct": true,
|
||||||
"prefill": {
|
"prefill": {
|
||||||
"dp_size": 4,
|
"dp_size": 2,
|
||||||
"tp_size": 8
|
"tp_size": 16
|
||||||
},
|
},
|
||||||
"decode": {
|
"decode": {
|
||||||
"dp_size": 16,
|
"dp_size": 16,
|
||||||
@@ -868,10 +871,12 @@ Before you start, please
|
|||||||
--seed 1024 \
|
--seed 1024 \
|
||||||
--served-model-name glm-5 \
|
--served-model-name glm-5 \
|
||||||
--max-model-len 131072 \
|
--max-model-len 131072 \
|
||||||
--additional-config '{"fuse_muls_add": true, "multistream_overlap_shared_expert": true, "ascend_compilation_config": {"enable_npugraph_ex": true}}' \
|
--additional-config '{"fuse_muls_add": true, "multistream_overlap_shared_expert": true, "recompute_scheduler_enable": true, "ascend_compilation_config": {"enable_npugraph_ex": true}}' \
|
||||||
--max-num-batched-tokens 4096 \
|
--max-num-batched-tokens 4096 \
|
||||||
--trust-remote-code \
|
--trust-remote-code \
|
||||||
--max-num-seqs 64 \
|
--max-num-seqs 64 \
|
||||||
|
--async-scheduling \
|
||||||
|
--enable-chunked-prefill \
|
||||||
--gpu-memory-utilization 0.95 \
|
--gpu-memory-utilization 0.95 \
|
||||||
--quantization ascend \
|
--quantization ascend \
|
||||||
--enforce-eager \
|
--enforce-eager \
|
||||||
@@ -886,8 +891,8 @@ Before you start, please
|
|||||||
"kv_connector_extra_config": {
|
"kv_connector_extra_config": {
|
||||||
"use_ascend_direct": true,
|
"use_ascend_direct": true,
|
||||||
"prefill": {
|
"prefill": {
|
||||||
"dp_size": 4,
|
"dp_size": 2,
|
||||||
"tp_size": 8
|
"tp_size": 16
|
||||||
},
|
},
|
||||||
"decode": {
|
"decode": {
|
||||||
"dp_size": 16,
|
"dp_size": 16,
|
||||||
@@ -951,7 +956,7 @@ Before you start, please
|
|||||||
--max-model-len 200000 \
|
--max-model-len 200000 \
|
||||||
--max-num-batched-tokens 32 \
|
--max-num-batched-tokens 32 \
|
||||||
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[4, 8, 12, 16,20,24,28, 32]}' \
|
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[4, 8, 12, 16,20,24,28, 32]}' \
|
||||||
--additional-config '{"fuse_muls_add": true, "multistream_overlap_shared_expert": true, "ascend_compilation_config": {"enable_npugraph_ex": true}}' \
|
--additional-config '{"fuse_muls_add": true, "multistream_overlap_shared_expert": true, "recompute_scheduler_enable": true, "ascend_compilation_config": {"enable_npugraph_ex": true}}' \
|
||||||
--trust-remote-code \
|
--trust-remote-code \
|
||||||
--max-num-seqs 8 \
|
--max-num-seqs 8 \
|
||||||
--gpu-memory-utilization 0.92 \
|
--gpu-memory-utilization 0.92 \
|
||||||
@@ -968,8 +973,8 @@ Before you start, please
|
|||||||
"kv_connector_extra_config": {
|
"kv_connector_extra_config": {
|
||||||
"use_ascend_direct": true,
|
"use_ascend_direct": true,
|
||||||
"prefill": {
|
"prefill": {
|
||||||
"dp_size": 4,
|
"dp_size": 2,
|
||||||
"tp_size": 8
|
"tp_size": 16
|
||||||
},
|
},
|
||||||
"decode": {
|
"decode": {
|
||||||
"dp_size": 16,
|
"dp_size": 16,
|
||||||
@@ -1032,7 +1037,7 @@ Before you start, please
|
|||||||
--max-model-len 200000 \
|
--max-model-len 200000 \
|
||||||
--max-num-batched-tokens 32 \
|
--max-num-batched-tokens 32 \
|
||||||
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[4, 8, 12, 16,20,24,28, 32]}' \
|
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[4, 8, 12, 16,20,24,28, 32]}' \
|
||||||
--additional-config '{"fuse_muls_add": true, "multistream_overlap_shared_expert": true, "ascend_compilation_config": {"enable_npugraph_ex": true}}' \
|
--additional-config '{"fuse_muls_add": true, "multistream_overlap_shared_expert": true, "recompute_scheduler_enable": true, "ascend_compilation_config": {"enable_npugraph_ex": true}}' \
|
||||||
--trust-remote-code \
|
--trust-remote-code \
|
||||||
--max-num-seqs 8 \
|
--max-num-seqs 8 \
|
||||||
--gpu-memory-utilization 0.92 \
|
--gpu-memory-utilization 0.92 \
|
||||||
@@ -1049,8 +1054,8 @@ Before you start, please
|
|||||||
"kv_connector_extra_config": {
|
"kv_connector_extra_config": {
|
||||||
"use_ascend_direct": true,
|
"use_ascend_direct": true,
|
||||||
"prefill": {
|
"prefill": {
|
||||||
"dp_size": 4,
|
"dp_size": 2,
|
||||||
"tp_size": 8
|
"tp_size": 16
|
||||||
},
|
},
|
||||||
"decode": {
|
"decode": {
|
||||||
"dp_size": 16,
|
"dp_size": 16,
|
||||||
@@ -1113,7 +1118,7 @@ Before you start, please
|
|||||||
--max-model-len 200000 \
|
--max-model-len 200000 \
|
||||||
--max-num-batched-tokens 32 \
|
--max-num-batched-tokens 32 \
|
||||||
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[4, 8, 12, 16,20,24,28, 32]}' \
|
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[4, 8, 12, 16,20,24,28, 32]}' \
|
||||||
--additional-config '{"fuse_muls_add": true, "multistream_overlap_shared_expert": true, "ascend_compilation_config": {"enable_npugraph_ex": true}}' \
|
--additional-config '{"fuse_muls_add": true, "multistream_overlap_shared_expert": true, "recompute_scheduler_enable": true, "ascend_compilation_config": {"enable_npugraph_ex": true}}' \
|
||||||
--trust-remote-code \
|
--trust-remote-code \
|
||||||
--max-num-seqs 8 \
|
--max-num-seqs 8 \
|
||||||
--gpu-memory-utilization 0.92 \
|
--gpu-memory-utilization 0.92 \
|
||||||
@@ -1130,8 +1135,8 @@ Before you start, please
|
|||||||
"kv_connector_extra_config": {
|
"kv_connector_extra_config": {
|
||||||
"use_ascend_direct": true,
|
"use_ascend_direct": true,
|
||||||
"prefill": {
|
"prefill": {
|
||||||
"dp_size": 4,
|
"dp_size": 2,
|
||||||
"tp_size": 8
|
"tp_size": 16
|
||||||
},
|
},
|
||||||
"decode": {
|
"decode": {
|
||||||
"dp_size": 16,
|
"dp_size": 16,
|
||||||
@@ -1194,7 +1199,7 @@ Before you start, please
|
|||||||
--max-model-len 200000 \
|
--max-model-len 200000 \
|
||||||
--max-num-batched-tokens 32 \
|
--max-num-batched-tokens 32 \
|
||||||
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[4, 8, 12, 16,20,24,28, 32]}' \
|
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[4, 8, 12, 16,20,24,28, 32]}' \
|
||||||
--additional-config '{"fuse_muls_add": true, "multistream_overlap_shared_expert": true, "ascend_compilation_config": {"enable_npugraph_ex": true}}' \
|
--additional-config '{"fuse_muls_add": true, "multistream_overlap_shared_expert": true, "recompute_scheduler_enable": true, "ascend_compilation_config": {"enable_npugraph_ex": true}}' \
|
||||||
--trust-remote-code \
|
--trust-remote-code \
|
||||||
--max-num-seqs 8 \
|
--max-num-seqs 8 \
|
||||||
--gpu-memory-utilization 0.92 \
|
--gpu-memory-utilization 0.92 \
|
||||||
@@ -1211,8 +1216,8 @@ Before you start, please
|
|||||||
"kv_connector_extra_config": {
|
"kv_connector_extra_config": {
|
||||||
"use_ascend_direct": true,
|
"use_ascend_direct": true,
|
||||||
"prefill": {
|
"prefill": {
|
||||||
"dp_size": 4,
|
"dp_size": 2,
|
||||||
"tp_size": 8
|
"tp_size": 16
|
||||||
},
|
},
|
||||||
"decode": {
|
"decode": {
|
||||||
"dp_size": 16,
|
"dp_size": 16,
|
||||||
@@ -1228,14 +1233,14 @@ Once the preparation is done, you can start the server with the following comman
|
|||||||
|
|
||||||
```shell
|
```shell
|
||||||
# change ip to your own
|
# change ip to your own
|
||||||
python launch_online_dp.py --dp-size 4 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address $node_p0_ip --dp-rpc-port 10521 --vllm-start-port 6700
|
python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 0 --dp-address $node_p0_ip --dp-rpc-port 10521 --vllm-start-port 6700
|
||||||
```
|
```
|
||||||
|
|
||||||
2. Prefill node 1
|
2. Prefill node 1
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
# change ip to your own
|
# change ip to your own
|
||||||
python launch_online_dp.py --dp-size 4 --tp-size 8 --dp-size-local 2 --dp-rank-start 2 --dp-address $node_p0_ip --dp-rpc-port 10521 --vllm-start-port 6700
|
python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 1 --dp-address $node_p0_ip --dp-rpc-port 10521 --vllm-start-port 6700
|
||||||
```
|
```
|
||||||
|
|
||||||
3. Decode node 0
|
3. Decode node 0
|
||||||
@@ -1283,8 +1288,8 @@ python load_balance_proxy_server_example.py \
|
|||||||
$node_p1_ip \
|
$node_p1_ip \
|
||||||
$node_p1_ip \
|
$node_p1_ip \
|
||||||
--prefiller-ports \
|
--prefiller-ports \
|
||||||
6700 6701 \
|
6700 \
|
||||||
6700 6701 \
|
6700 \
|
||||||
--decoder-hosts \
|
--decoder-hosts \
|
||||||
$node_d0_ip \
|
$node_d0_ip \
|
||||||
$node_d0_ip \
|
$node_d0_ip \
|
||||||
|
|||||||
@@ -26,6 +26,7 @@ Get the latest info here: <https://github.com/vllm-project/vllm-ascend/issues/16
|
|||||||
| Qwen3-Next | 🔵 | | ✅ | A2/A3 | ✅ |||||| ✅ ||| ✅ || ✅ | ✅ ||| [Qwen3-Next](../../tutorials/models/Qwen3-Next.md) |
|
| Qwen3-Next | 🔵 | | ✅ | A2/A3 | ✅ |||||| ✅ ||| ✅ || ✅ | ✅ ||| [Qwen3-Next](../../tutorials/models/Qwen3-Next.md) |
|
||||||
| Qwen2.5 | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ |||| ✅ ||| ✅ |||||| [Qwen2.5-7B](../../tutorials/models/Qwen2.5-7B.md) |
|
| Qwen2.5 | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ |||| ✅ ||| ✅ |||||| [Qwen2.5-7B](../../tutorials/models/Qwen2.5-7B.md) |
|
||||||
| GLM-4.x | 🔵 | || A2/A3 |✅|✅|✅||✅|✅|✅||✅|✅|✅|✅|✅|198k||[GLM-4.x](../../tutorials/models/GLM4.x.md)|
|
| GLM-4.x | 🔵 | || A2/A3 |✅|✅|✅||✅|✅|✅||✅|✅|✅|✅|✅|198k||[GLM-4.x](../../tutorials/models/GLM4.x.md)|
|
||||||
|
| GLM-5 | 🔵 | | ✅ | A2/A3 | ✅ | ✅ | ✅ || ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 198k || [GLM-5](../../tutorials/models/GLM5.md) |
|
||||||
| Kimi-K2-Thinking | 🔵 | || A2/A3 |||||||||||||||| [Kimi-K2-Thinking](../../tutorials/models/Kimi-K2-Thinking.md) |
|
| Kimi-K2-Thinking | 🔵 | || A2/A3 |||||||||||||||| [Kimi-K2-Thinking](../../tutorials/models/Kimi-K2-Thinking.md) |
|
||||||
|
|
||||||
#### Extended Compatible Models
|
#### Extended Compatible Models
|
||||||
|
|||||||
Reference in New Issue
Block a user