[Doc] modify glm doc (#6770)
### What this PR does / why we need it?
1. add description of another version of glm5-w4a8 weight
2. update the introduction of installation
3. introduce a script to enable bf16 MTP
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
N/A
- vLLM version: v0.15.0
- vLLM main:
9562912cea
---------
Signed-off-by: yydyzr <liuyuncong1@huawei.com>
This commit is contained in:
@@ -17,14 +17,15 @@ Refer to [feature guide](https://docs.vllm.ai/projects/ascend/en/latest/user_gui
|
||||
### Model Weight
|
||||
|
||||
- `GLM-5`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-5).
|
||||
- `GLM-5-w4a8`(Quantized version without mtp): [Download model weight](https://modelers.cn/models/Eco-Tech/GLM-5-w4a8).
|
||||
- `GLM-5-w4a8`(Quantized version without MTP quant): [Download model weight](https://modelscope.cn/models/Eco-Tech/GLM-5-w4a8).
|
||||
- `GLM-5-w4a8`(Quantized version with MTP quant): [Download model weight](https://modelscope.cn/models/Eco-Tech/GLM-5-w4a8-mtp-QuaRot).
|
||||
- You can use [msmodelslim](https://gitcode.com/Ascend/msmodelslim) to quantify the model naively.
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
|
||||
### Installation
|
||||
|
||||
vLLM and vLLM-ascend only support GLM-5 on our main branches. you can use our official docker images and upgrade vllm and vllm-ascend for inference.
|
||||
vLLM and vLLM-ascend only support GLM-5 on our main branches. you can use our glm5 docker images for inference.
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: install
|
||||
@@ -121,7 +122,7 @@ In addition, if you don't want to use the docker image as above, you can also bu
|
||||
|
||||
- Install `vllm-ascend` from source, refer to [installation](https://docs.vllm.ai/projects/ascend/en/latest/installation.html).
|
||||
|
||||
To inference `GLM-5`, you should upgrade vllm、vllm-ascend、transformers to main branches:
|
||||
- After install `vllm-ascend` from source, you should upgrade vllm、vllm-ascend、transformers to main branches:
|
||||
|
||||
```shell
|
||||
# upgrade vllm
|
||||
@@ -240,6 +241,8 @@ The parameters are explained as follows:
|
||||
|
||||
### Multi-node Deployment
|
||||
|
||||
If you want to deploy multi-node environment, you need to verify multi-node communication according to [verify multi-node communication environment](https://docs.vllm.ai/projects/ascend/en/latest/installation.html#verify-multi-node-communication).
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: install
|
||||
|
||||
@@ -447,6 +450,64 @@ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM-5-w4a8 \
|
||||
::::
|
||||
:::::
|
||||
|
||||
- For bf16 weight, use this script on each node to enable [Multi Token Prediction (MTP)](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/Multi_Token_Prediction.html).
|
||||
|
||||
```shell
|
||||
python adjust_weight.py "path_of_bf16_weight"
|
||||
```
|
||||
|
||||
```python
|
||||
# adjust_weight.py
|
||||
from safetensors.torch import safe_open, save_file
|
||||
import torch
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
|
||||
target_keys = ["model.embed_tokens.weight", "lm_head.weight"]
|
||||
|
||||
def get_tensor_info(file_path):
|
||||
with safe_open(file_path, framework="pt", device="cpu") as f:
|
||||
tensor_names = f.keys()
|
||||
tensor_dict = {}
|
||||
for name in tensor_names:
|
||||
tensor = f.get_tensor(name)
|
||||
tensor_dict[name] = tensor
|
||||
return tensor_dict
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
directory_path = sys.argv[1]
|
||||
json_name = "model.safetensors.index.json"
|
||||
json_path = os.path.join(directory_path, json_name)
|
||||
with open(json_path, 'r', encoding='utf-8') as f:
|
||||
json_data = json.load(f)
|
||||
weight_map = json_data.get('weight_map', {})
|
||||
file_list = []
|
||||
for key in target_keys:
|
||||
safetensor_file = weight_map.get(key)
|
||||
file_list.append(directory_path + safetensor_file)
|
||||
|
||||
new_dict = {}
|
||||
for file_path in file_list:
|
||||
tensor_dict = get_tensor_info(file_path)
|
||||
for key in target_keys:
|
||||
if key in tensor_dict:
|
||||
if key == "model.embed_tokens.weight":
|
||||
new_key = "model.layers.78.embed_tokens.weight"
|
||||
elif key == "lm_head.weight":
|
||||
new_key = "model.layers.78.shared_head.head.weight"
|
||||
new_dict[new_key] = tensor_dict[key]
|
||||
|
||||
new_file_name = os.path.join(directory_path, "mtp-others.safetensors")
|
||||
new_key = ["model.layers.78.embed_tokens.weight", "model.layers.78.shared_head.head.weight"]
|
||||
save_file(tensors=new_dict, filename=new_file_name)
|
||||
for key in new_key:
|
||||
json_data["weight_map"][key] = "mtp-others.safetensors"
|
||||
with open(json_path, 'w', encoding='utf-8') as f:
|
||||
json.dump(json_data, f, indent=2)
|
||||
```
|
||||
|
||||
### Prefill-Decode Disaggregation
|
||||
|
||||
Not test yet.
|
||||
|
||||
Reference in New Issue
Block a user