初始化项目,由ModelHub XC社区提供模型
Model: tclf90/llama-3-chinese-8b-instruct-v3-GPTQ-Int4 Source: Original Platform
This commit is contained in:
119
README.md
Normal file
119
README.md
Normal file
@@ -0,0 +1,119 @@
|
||||
---
|
||||
license: llama3
|
||||
license_name: llama3
|
||||
license_link: LICENSE
|
||||
pipeline_tag: text-generation
|
||||
tags:
|
||||
- llama3
|
||||
- gptq
|
||||
- int4
|
||||
- 量化修复
|
||||
- vLLM
|
||||
---
|
||||
|
||||
# Llama-3-chinese-8b-instruct-v3-GPTQ-Int4-量化修复
|
||||
原模型 [ChineseAlpacaGroup/llama-3-chinese-8b-instruct-v3](https://www.modelscope.cn/models/ChineseAlpacaGroup/llama-3-chinese-8b-instruct-v3)
|
||||
|
||||
|
||||
### 【模型更新日期】
|
||||
``` 2024-06-02 ```
|
||||
|
||||
### 【模型大小】
|
||||
`6.2GB`
|
||||
|
||||
|
||||
### 【介绍】
|
||||
|
||||
Llama-3-Chinese-8B-Instruct-v3(指令模型),融合了v1、v2以及Meta原版Instruct模型,在中文任务上大幅超越v1/v2版,英文任务上与Meta原版保持持平,主观体验效果显著提升。
|
||||
|
||||
[更多详情...](https://github.com/ymcui/Chinese-LLaMA-Alpaca-3/releases/tag/v3.0)
|
||||
|
||||
### 【量化修复】
|
||||
|
||||
调优了现有 `AWQ` 与 `GPTQ` 量化算法的量化策略。带有`量化修复`标签的`Int3`模型,可以比肩默认`AWQ`与`GPTQ`算法的`Int8`模型的能力。
|
||||
|
||||
1. 量化修复可以极大减少模型的`1.乱吐字`、`2.无限循环`、`3.长文能力丢失`等量化损失造成的模型不可用的情况。
|
||||
|
||||
2. 调优后的量化模型,`AWQ`与`GPTQ`模型在能力上没有表现出明显区别。同时考虑到`GPTQ`的`vLLM`引擎的并发推理效率最好,所以不再制作`AWQ`模型。
|
||||
|
||||
3. 调优后的量化算法,`int4`与`int3`在大尺寸的(30B+)模型上没有表现出明显区别。
|
||||
|
||||
4. 小尺寸的模型,会争取采用`int4`与`group_size=32`的配置,以尽最大可能减少量化造成的损失。
|
||||
|
||||
|
||||
### 【同期量化修复模型】
|
||||
|
||||
| 模型名称 | 磁盘大小(GB) |
|
||||
|---------------------------------------------------------------------------------------------------------------------------------|----------|
|
||||
| [零一万物-1.5-6B-Chat-GPTQ-Int3-量化修复](https://www.modelscope.cn/models/tclf90/Yi-1.5-6B-Chat-GPTQ-Int3) | 3.3 |
|
||||
| [零一万物-1.5-9B-Chat-16K-GPTQ-Int3-量化修复](https://www.modelscope.cn/models/tclf90/Yi-1.5-9B-Chat-16K-GPTQ-Int3) | 4.4 |
|
||||
| [零一万物-1.5-34B-Chat-16K-GPTQ-Int3-量化修复](https://www.modelscope.cn/models/tclf90/Yi-1.5-34B-Chat-16K-GPTQ-Int3) | 15.1 |
|
||||
| [通义千问1.5-7B-Chat-GPTQ-Int3-量化修复](https://www.modelscope.cn/models/tclf90/Qwen1.5-7B-Chat-GPTQ-Int3) | 5.1 |
|
||||
| [通义千问1.5-14B-Chat-GPTQ-Int3-量化修复](https://www.modelscope.cn/models/tclf90/Qwen1.5-14B-Chat-GPTQ-Int3) | 8.1 |
|
||||
| [通义千问1.5-32B-Chat-GPTQ-Int3-量化修复](https://www.modelscope.cn/models/tclf90/Qwen1.5-32B-Chat-GPTQ-Int3) | 15.4 |
|
||||
| [通义千问1.5-72B-Chat-GPTQ-Int3-量化修复](https://www.modelscope.cn/models/tclf90/Qwen1.5-72B-Chat-GPTQ-Int3) | 32.5 |
|
||||
| [通义千问1.5-110B-Chat-GPTQ-Int3-量化修复](https://www.modelscope.cn/models/tclf90/Qwen1.5-110B-Chat-GPTQ-Int3) | 47.9 |
|
||||
| [openbuddy-llama3-70b-v21.1-8k-GPTQ-Int3-量化修复](https://www.modelscope.cn/models/tclf90/openbuddy-llama3-70b-v21.1-8k-GPTQ-Int3) | 31.5 |
|
||||
|
||||
|
||||
### 【模型下载】
|
||||
```python
|
||||
from modelscope import snapshot_download
|
||||
model_dir = snapshot_download('tclf90/模型名', cache_dir="本地路径")
|
||||
```
|
||||
|
||||
### 【[vLLM](https://github.com/vllm-project/vllm)推理(目前仅限Linux)】
|
||||
#### 1. Python 简易调试
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
max_model_len, tp_size = 4000, 1
|
||||
model_name = "本地路径/tclf90/模型名称" # 例:"./my_models/tclf90/Qwen1.5-32B-Chat-GPTQ-Int3"
|
||||
model_name = model_name.replace('.', '___')
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True)
|
||||
sampling_params = SamplingParams(temperature=0.7, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
|
||||
|
||||
messages_list = [
|
||||
[{"role": "user", "content": "你是谁"}],
|
||||
[{"role": "user", "content": "介绍一下你自己"}],
|
||||
[{"role": "user", "content": "用python写一个快排函数"}],
|
||||
]
|
||||
|
||||
prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]
|
||||
|
||||
outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)
|
||||
|
||||
generated_text = [output.outputs[0].text for output in outputs]
|
||||
print(generated_text)
|
||||
```
|
||||
|
||||
#### 2. 类ChatGPT RESTFul API Server
|
||||
```
|
||||
>>> python -m vllm.entrypoints.openai.api_server --model 本地路径/tclf90/模型名称
|
||||
```
|
||||
|
||||
### 【Transformer推理】
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
|
||||
|
||||
model_name = "本地路径/tclf90/模型名称" # 例:"./my_models/tclf90/Qwen1.5-32B-Chat-GPTQ-Int3"
|
||||
model_name = model_name.replace('.', '___')
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
||||
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
|
||||
model.generation_config = GenerationConfig.from_pretrained(model_name)
|
||||
model.generation_config.pad_token_id = model.generation_config.eos_token_id
|
||||
|
||||
messages = [
|
||||
{"role": "user", "content": "你好你是谁"}
|
||||
]
|
||||
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
|
||||
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)
|
||||
|
||||
result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
|
||||
print(result)
|
||||
```
|
||||
Reference in New Issue
Block a user