初始化项目,由ModelHub XC社区提供模型
Model: ciCic/llama-3.2-1B-Instruct-AWQ Source: Original Platform
This commit is contained in:
145
README.md
Normal file
145
README.md
Normal file
@@ -0,0 +1,145 @@
|
||||
---
|
||||
language:
|
||||
- en
|
||||
- de
|
||||
- fr
|
||||
- it
|
||||
- pt
|
||||
- hi
|
||||
- es
|
||||
- th
|
||||
library_name: transformers
|
||||
pipeline_tag: text-generation
|
||||
tags:
|
||||
- facebook
|
||||
- meta
|
||||
- pytorch
|
||||
- llama-3
|
||||
license: llama3.2
|
||||
base_model:
|
||||
- meta-llama/Llama-3.2-1B-Instruct
|
||||
---
|
||||
# Represents
|
||||
A quantized version of Llama 3.2 1B Instruct with Activation-aware Weight Quantization (AWQ)[https://github.com/mit-han-lab/llm-awq]
|
||||
|
||||
## Use with transformers/autoawq
|
||||
Starting with
|
||||
- `transformers==4.45.1`
|
||||
- `accelerate==0.34.2`
|
||||
- `torch==2.3.1`
|
||||
- `numpy==2.0.0`
|
||||
- `autoawq==0.2.6`
|
||||
|
||||
Experimented with
|
||||
- OS = Windows
|
||||
- GPU = Nvidia GeForce RTX 3080 10gb
|
||||
- CPU = Intel Core i5-9600K
|
||||
- RAM = 32GB
|
||||
|
||||
### For CUDA users
|
||||
|
||||
**AutoAWQ**
|
||||
|
||||
NOTE: this example uses `fuse_layers=True` to fuse attention and mlp layers together for faster inference
|
||||
```python
|
||||
from awq import AutoAWQForCausalLM
|
||||
from transformers import AutoTokenizer, TextStreamer
|
||||
|
||||
quant_id = "ciCic/llama-3.2-1B-Instruct-AWQ"
|
||||
model = AutoAWQForCausalLM.from_quantized(quant_id, fuse_layers=True)
|
||||
tokenizer = AutoTokenizer.from_pretrained(quant_id, trust_remote_code=True)
|
||||
|
||||
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
|
||||
|
||||
# Declare prompt
|
||||
prompt = "You're standing on the surface of the Earth. "\
|
||||
"You walk one mile south, one mile west and one mile north. "\
|
||||
"You end up exactly where you started. Where are you?"
|
||||
|
||||
# Tokenization of the prompt
|
||||
tokens = tokenizer(
|
||||
prompt,
|
||||
return_tensors='pt'
|
||||
).input_ids.cuda()
|
||||
|
||||
# Generate output in a streaming fashion
|
||||
generation_output = model.generate(
|
||||
tokens,
|
||||
streamer=streamer,
|
||||
max_new_tokens=512
|
||||
)
|
||||
```
|
||||
|
||||
**Transformers**
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM
|
||||
import torch
|
||||
|
||||
quant_id = "ciCic/llama-3.2-1B-Instruct-AWQ"
|
||||
tokenizer = AutoTokenizer.from_pretrained(quant_id, trust_remote_code=True)
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
quant_id,
|
||||
torch_dtype=torch.float16,
|
||||
low_cpu_mem_usage=True,
|
||||
device_map="cuda"
|
||||
)
|
||||
|
||||
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
|
||||
|
||||
# Convert prompt to tokens
|
||||
prompt = "You're standing on the surface of the Earth. "\
|
||||
"You walk one mile south, one mile west and one mile north. "\
|
||||
"You end up exactly where you started. Where are you?"
|
||||
|
||||
tokens = tokenizer(
|
||||
prompt,
|
||||
return_tensors='pt'
|
||||
).input_ids.cuda()
|
||||
|
||||
# Generate output
|
||||
generation_output = model.generate(
|
||||
tokens,
|
||||
streamer=streamer,
|
||||
max_new_tokens=512
|
||||
)
|
||||
```
|
||||
|
||||
#### Issue/Solution
|
||||
- torch.from_numpy fails
|
||||
- This might be due to certain issues within `torch==2.3.1` .cpp files. Since AutoAWQ uses torch version 2.3.1, instead of most recent, this issue might occur within module `marlin.py -> def _get_perms()`
|
||||
- Module path: Python\Python311\site-packages\awq\modules\linear\marlin.py
|
||||
- Solution:
|
||||
- there are several operations to numpy (cpu) then back to tensor (gpu) which could be completely replaced by tensor without having to use numpy, this will solve (temporarily) the from_numpy() issue
|
||||
```python
|
||||
def _get_perms():
|
||||
perm = []
|
||||
for i in range(32):
|
||||
perm1 = []
|
||||
col = i // 4
|
||||
for block in [0, 1]:
|
||||
for row in [
|
||||
2 * (i % 4),
|
||||
2 * (i % 4) + 1,
|
||||
2 * (i % 4 + 4),
|
||||
2 * (i % 4 + 4) + 1,
|
||||
]:
|
||||
perm1.append(16 * row + col + 8 * block)
|
||||
|
||||
for j in range(4):
|
||||
perm.extend([p + 256 * j for p in perm1])
|
||||
|
||||
# perm = np.array(perm)
|
||||
perm = torch.asarray(perm)
|
||||
# interleave = np.array([0, 2, 4, 6, 1, 3, 5, 7])
|
||||
interleave = torch.asarray([0, 2, 4, 6, 1, 3, 5, 7])
|
||||
perm = perm.reshape((-1, 8))[:, interleave].ravel()
|
||||
# perm = torch.from_numpy(perm)
|
||||
scale_perm = []
|
||||
for i in range(8):
|
||||
scale_perm.extend([i + 8 * j for j in range(8)])
|
||||
scale_perm_single = []
|
||||
for i in range(4):
|
||||
scale_perm_single.extend([2 * i + j for j in [0, 1, 8, 9, 16, 17, 24, 25]])
|
||||
return perm, scale_perm, scale_perm_single
|
||||
```
|
||||
Reference in New Issue
Block a user