[Doc] modify glm doc (#6770)

### What this PR does / why we need it?
1. add description of another version of glm5-w4a8 weight 
2. update the introduction of  installation 
3. introduce a script to enable bf16 MTP
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
N/A
- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: yydyzr <liuyuncong1@huawei.com>
This commit is contained in:
yydyzr
2026-02-14 16:47:23 +08:00
committed by GitHub
parent e2237819a9
commit 70e26551cf

View File

@@ -17,14 +17,15 @@ Refer to [feature guide](https://docs.vllm.ai/projects/ascend/en/latest/user_gui
### Model Weight
- `GLM-5`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-5).
- `GLM-5-w4a8`(Quantized version without mtp): [Download model weight](https://modelers.cn/models/Eco-Tech/GLM-5-w4a8).
- `GLM-5-w4a8`(Quantized version without MTP quant): [Download model weight](https://modelscope.cn/models/Eco-Tech/GLM-5-w4a8).
- `GLM-5-w4a8`(Quantized version with MTP quant): [Download model weight](https://modelscope.cn/models/Eco-Tech/GLM-5-w4a8-mtp-QuaRot).
- You can use [msmodelslim](https://gitcode.com/Ascend/msmodelslim) to quantify the model naively.
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
### Installation
vLLM and vLLM-ascend only support GLM-5 on our main branches. you can use our official docker images and upgrade vllm and vllm-ascend for inference.
vLLM and vLLM-ascend only support GLM-5 on our main branches. you can use our glm5 docker images for inference.
:::::{tab-set}
:sync-group: install
@@ -121,7 +122,7 @@ In addition, if you don't want to use the docker image as above, you can also bu
- Install `vllm-ascend` from source, refer to [installation](https://docs.vllm.ai/projects/ascend/en/latest/installation.html).
To inference `GLM-5`, you should upgrade vllm、vllm-ascend、transformers to main branches:
- After install `vllm-ascend` from source, you should upgrade vllm、vllm-ascend、transformers to main branches:
```shell
# upgrade vllm
@@ -240,6 +241,8 @@ The parameters are explained as follows:
### Multi-node Deployment
If you want to deploy multi-node environment, you need to verify multi-node communication according to [verify multi-node communication environment](https://docs.vllm.ai/projects/ascend/en/latest/installation.html#verify-multi-node-communication).
:::::{tab-set}
:sync-group: install
@@ -447,6 +450,64 @@ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM-5-w4a8 \
::::
:::::
- For bf16 weight, use this script on each node to enable [Multi Token Prediction (MTP)](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/Multi_Token_Prediction.html).
```shell
python adjust_weight.py "path_of_bf16_weight"
```
```python
# adjust_weight.py
from safetensors.torch import safe_open, save_file
import torch
import json
import os
import sys
target_keys = ["model.embed_tokens.weight", "lm_head.weight"]
def get_tensor_info(file_path):
with safe_open(file_path, framework="pt", device="cpu") as f:
tensor_names = f.keys()
tensor_dict = {}
for name in tensor_names:
tensor = f.get_tensor(name)
tensor_dict[name] = tensor
return tensor_dict
if __name__ == "__main__":
directory_path = sys.argv[1]
json_name = "model.safetensors.index.json"
json_path = os.path.join(directory_path, json_name)
with open(json_path, 'r', encoding='utf-8') as f:
json_data = json.load(f)
weight_map = json_data.get('weight_map', {})
file_list = []
for key in target_keys:
safetensor_file = weight_map.get(key)
file_list.append(directory_path + safetensor_file)
new_dict = {}
for file_path in file_list:
tensor_dict = get_tensor_info(file_path)
for key in target_keys:
if key in tensor_dict:
if key == "model.embed_tokens.weight":
new_key = "model.layers.78.embed_tokens.weight"
elif key == "lm_head.weight":
new_key = "model.layers.78.shared_head.head.weight"
new_dict[new_key] = tensor_dict[key]
new_file_name = os.path.join(directory_path, "mtp-others.safetensors")
new_key = ["model.layers.78.embed_tokens.weight", "model.layers.78.shared_head.head.weight"]
save_file(tensors=new_dict, filename=new_file_name)
for key in new_key:
json_data["weight_map"][key] = "mtp-others.safetensors"
with open(json_path, 'w', encoding='utf-8') as f:
json.dump(json_data, f, indent=2)
```
### Prefill-Decode Disaggregation
Not test yet.