93 lines
2.2 KiB
Markdown
93 lines
2.2 KiB
Markdown
---
|
|
license: cc-by-nc-4.0
|
|
language:
|
|
- en
|
|
datasets:
|
|
- Gryphe/Opus-WritingPrompts
|
|
- Sao10K/Claude-3-Opus-Instruct-15K
|
|
- Sao10K/Short-Storygen-v2
|
|
- Sao10K/c2-Logs-Filtered
|
|
quantized_by: bartowski
|
|
pipeline_tag: text-generation
|
|
---
|
|
|
|
## 4-bit GEMM AWQ Quantizations of L3-8B-Stheno-v3.2
|
|
|
|
Using <a href="https://github.com/casper-hansen/AutoAWQ/">AutoAWQ</a> release <a href="https://github.com/casper-hansen/AutoAWQ/releases/tag/v0.2.5">v0.2.5</a> for quantization.
|
|
|
|
Original model: https://huggingface.co/Sao10K/L3-8B-Stheno-v3.2
|
|
|
|
## Prompt format
|
|
|
|
```
|
|
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
|
|
|
|
{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
|
|
|
|
{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
|
|
|
|
|
|
```
|
|
|
|
## AWQ Parameters
|
|
|
|
- q_group_size: 128
|
|
- w_bit: 4
|
|
- zero_point: True
|
|
- version: GEMM
|
|
|
|
## How to run
|
|
|
|
From the AutoAWQ repo [here](https://github.com/casper-hansen/AutoAWQ/blob/main/examples/generate.py)
|
|
|
|
First install autoawq pypi package:
|
|
|
|
```
|
|
pip install autoawq
|
|
```
|
|
|
|
Then run the following:
|
|
|
|
```
|
|
from awq import AutoAWQForCausalLM
|
|
from transformers import AutoTokenizer, TextStreamer
|
|
|
|
|
|
quant_path = "models/L3-8B-Stheno-v3.2-AWQ"
|
|
|
|
# Load model
|
|
model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=True)
|
|
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
|
|
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
|
|
|
|
prompt = "You're standing on the surface of the Earth. "\
|
|
"You walk one mile south, one mile west and one mile north. "\
|
|
"You end up exactly where you started. Where are you?"
|
|
|
|
chat = [
|
|
{"role": "system", "content": "You are a concise assistant that helps answer questions."},
|
|
{"role": "user", "content": prompt},
|
|
]
|
|
|
|
# <|eot_id|> used for llama 3 models
|
|
terminators = [
|
|
tokenizer.eos_token_id,
|
|
tokenizer.convert_tokens_to_ids("<|eot_id|>")
|
|
]
|
|
|
|
tokens = tokenizer.apply_chat_template(
|
|
chat,
|
|
return_tensors="pt"
|
|
).cuda()
|
|
|
|
# Generate output
|
|
generation_output = model.generate(
|
|
tokens,
|
|
streamer=streamer,
|
|
max_new_tokens=64,
|
|
eos_token_id=terminators
|
|
)
|
|
```
|
|
|
|
Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
|