Files
ModelHub XC 5b9b999541 初始化项目,由ModelHub XC社区提供模型
Model: tclf90/llama-3-chinese-8b-instruct-v3-GPTQ-Int4
Source: Original Platform
2026-05-19 20:02:03 +08:00

120 lines
5.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: llama3
license_name: llama3
license_link: LICENSE
pipeline_tag: text-generation
tags:
- llama3
- gptq
- int4
- 量化修复
- vLLM
---
# Llama-3-chinese-8b-instruct-v3-GPTQ-Int4-量化修复
原模型 [ChineseAlpacaGroup/llama-3-chinese-8b-instruct-v3](https://www.modelscope.cn/models/ChineseAlpacaGroup/llama-3-chinese-8b-instruct-v3)
### 【模型更新日期】
``` 2024-06-02 ```
### 【模型大小】
`6.2GB`
### 【介绍】
Llama-3-Chinese-8B-Instruct-v3指令模型融合了v1、v2以及Meta原版Instruct模型在中文任务上大幅超越v1/v2版英文任务上与Meta原版保持持平主观体验效果显著提升。
[更多详情...](https://github.com/ymcui/Chinese-LLaMA-Alpaca-3/releases/tag/v3.0)
### 【量化修复】
调优了现有 `AWQ` 与 `GPTQ` 量化算法的量化策略。带有`量化修复`标签的`Int3`模型,可以比肩默认`AWQ`与`GPTQ`算法的`Int8`模型的能力。
1. 量化修复可以极大减少模型的`1.乱吐字`、`2.无限循环`、`3.长文能力丢失`等量化损失造成的模型不可用的情况。
2. 调优后的量化模型,`AWQ`与`GPTQ`模型在能力上没有表现出明显区别。同时考虑到`GPTQ`的`vLLM`引擎的并发推理效率最好,所以不再制作`AWQ`模型。
3. 调优后的量化算法,`int4`与`int3`在大尺寸的30B+)模型上没有表现出明显区别。
4. 小尺寸的模型,会争取采用`int4`与`group_size=32`的配置,以尽最大可能减少量化造成的损失。
### 【同期量化修复模型】
| 模型名称 | 磁盘大小GB |
|---------------------------------------------------------------------------------------------------------------------------------|----------|
| [零一万物-1.5-6B-Chat-GPTQ-Int3-量化修复](https://www.modelscope.cn/models/tclf90/Yi-1.5-6B-Chat-GPTQ-Int3) | 3.3 |
| [零一万物-1.5-9B-Chat-16K-GPTQ-Int3-量化修复](https://www.modelscope.cn/models/tclf90/Yi-1.5-9B-Chat-16K-GPTQ-Int3) | 4.4 |
| [零一万物-1.5-34B-Chat-16K-GPTQ-Int3-量化修复](https://www.modelscope.cn/models/tclf90/Yi-1.5-34B-Chat-16K-GPTQ-Int3) | 15.1 |
| [通义千问1.5-7B-Chat-GPTQ-Int3-量化修复](https://www.modelscope.cn/models/tclf90/Qwen1.5-7B-Chat-GPTQ-Int3) | 5.1 |
| [通义千问1.5-14B-Chat-GPTQ-Int3-量化修复](https://www.modelscope.cn/models/tclf90/Qwen1.5-14B-Chat-GPTQ-Int3) | 8.1 |
| [通义千问1.5-32B-Chat-GPTQ-Int3-量化修复](https://www.modelscope.cn/models/tclf90/Qwen1.5-32B-Chat-GPTQ-Int3) | 15.4 |
| [通义千问1.5-72B-Chat-GPTQ-Int3-量化修复](https://www.modelscope.cn/models/tclf90/Qwen1.5-72B-Chat-GPTQ-Int3) | 32.5 |
| [通义千问1.5-110B-Chat-GPTQ-Int3-量化修复](https://www.modelscope.cn/models/tclf90/Qwen1.5-110B-Chat-GPTQ-Int3) | 47.9 |
| [openbuddy-llama3-70b-v21.1-8k-GPTQ-Int3-量化修复](https://www.modelscope.cn/models/tclf90/openbuddy-llama3-70b-v21.1-8k-GPTQ-Int3) | 31.5 |
### 【模型下载】
```python
from modelscope import snapshot_download
model_dir = snapshot_download('tclf90/模型名', cache_dir="本地路径")
```
### 【[vLLM](https://github.com/vllm-project/vllm)推理目前仅限Linux
#### 1. Python 简易调试
```python
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
max_model_len, tp_size = 4000, 1
model_name = "本地路径/tclf90/模型名称" # 例:"./my_models/tclf90/Qwen1.5-32B-Chat-GPTQ-Int3"
model_name = model_name.replace('.', '___')
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
messages_list = [
[{"role": "user", "content": "你是谁"}],
[{"role": "user", "content": "介绍一下你自己"}],
[{"role": "user", "content": "用python写一个快排函数"}],
]
prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]
outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)
generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)
```
#### 2. 类ChatGPT RESTFul API Server
```
>>> python -m vllm.entrypoints.openai.api_server --model 本地路径/tclf90/模型名称
```
### 【Transformer推理】
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
model_name = "本地路径/tclf90/模型名称" # 例:"./my_models/tclf90/Qwen1.5-32B-Chat-GPTQ-Int3"
model_name = model_name.replace('.', '___')
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id
messages = [
{"role": "user", "content": "你好你是谁"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)
result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)
```