151 lines
8.6 KiB
Markdown
151 lines
8.6 KiB
Markdown
---
|
||
license: apache-2.0
|
||
inference: false
|
||
---
|
||
|
||
# MistralLite-AWQ Model
|
||
|
||
MistralLite-AWQ is a version of the [MistralLite](https://huggingface.co/amazon/MistralLite) model that was
|
||
quantized using the AWQ method developed by [Lin et al. (2023)](https://arxiv.org/abs/2306.00978).
|
||
The MistralLite-AWQ models are approximately **70% smaller** than those of MistralLite whilst maintaining comparable performance.
|
||
|
||
Please refer to the [original MistralLite model card](https://huggingface.co/amazon/MistralLite) for details about the model
|
||
preparation and training processes.
|
||
|
||
## MistralLite-AWQ Variants
|
||
|
||
| Branch | Approx. Model Size | `q_group_size` | `w_bit` | `version` |
|
||
|--------|---:|---------------:|--------:|-----------|
|
||
| [main](https://huggingface.co/amazon/MistralLite-AWQ/tree/main) | 3.9 GB | 128 | 4 | GEMM |
|
||
| [MistralLite-AWQ-64g-4b-GEMM](https://huggingface.co/amazon/MistralLite-AWQ/tree/MistralLite-AWQ-64g-4b-GEMM) | 4.0 GB | 64 | 4 | GEMM |
|
||
| [MistralLite-AWQ-32g-4b-GEMM](https://huggingface.co/amazon/MistralLite-AWQ/tree/MistralLite-AWQ-32g-4b-GEMM) | 4.3 GB | 32 | 4 | GEMM |
|
||
|
||
## Dependencies
|
||
- [`autoawq==0.2.5`](https://pypi.org/project/autoawq/0.2.5/) – [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) was used to quantize the MistralLite model.
|
||
- [`vllm==0.4.2`](https://pypi.org/project/vllm/0.4.2/) – [vLLM](https://github.com/vllm-project/vllm) was used to host models for benchmarking.
|
||
|
||
## Evaluations
|
||
|
||
### Long Context
|
||
|
||
The following benchmark results are shown as _accuracy_ (%) values, unless stated otherwise.
|
||
|
||
#### Topic Retrieval
|
||
|
||
See https://lmsys.org/blog/2023-06-29-longchat/
|
||
|
||
| Model Name | n_topics=05 | n_topics=10 | n_topics=15 | n_topics=20 | n_topics=25 |
|
||
|:---------------------------------------------------|--------------:|--------------:|--------------:|--------------:|--------------:|
|
||
| _n_tokens_ (approx.) = | _3048_ | _5966_ | _8903_ | _11832_ | _14757_ |
|
||
| MistralLite | 100 | 100 | 100 | 100 | 98 |
|
||
| **MistralLite-AWQ** | **100** | **100** | **100**| **100** | **98** |
|
||
| **MistralLite-AWQ-64g-4b-GEMM** | **100** | **100** | **100**| **100** | **98** |
|
||
| **MistralLite-AWQ-32g-4b-GEMM** | **100** | **100** | **100**| **100** | **98** |
|
||
| Mistral-7B-Instruct-v0.1 | 96 | 52 | 2 | 0 | 0 |
|
||
| Mistral-7B-Instruct-v0.2 | 100 | 100 | 100 | 100 | 100 |
|
||
| Mixtral-8x7B-v0.1 | 0 | 0 | 0 | 0 | 0 |
|
||
| Mixtral-8x7B-Instruct-v0.1 | 100 | 100 | 100 | 100 | 100 |
|
||
|
||
#### [Line Retrieval](https://lmsys.org/blog/2023-06-29-longchat/#longeval-results)
|
||
|
||
See https://lmsys.org/blog/2023-06-29-longchat/#longeval-results
|
||
|
||
| Model Name | n_lines=200 | n_lines=300 | n_lines=400 | n_lines=500 | n_lines=600 | n_lines=680 |
|
||
|:----------|-------------:|-------------:|------------:|-----------:|-----------:|-----------:|
|
||
| _n_tokens_ (approx.) = | _4317_ | _6415_ | _8510_ | _10610_ | _12698_ | _14373_ |
|
||
| MistralLite | 100 | 94 | 86 | 82 | 76 | 66 |
|
||
| **MistralLite-AWQ** | **96**| **94**| **88** | **80** | **70**| **62** |
|
||
| **MistralLite-AWQ-64g-4b-GEMM** | **96**| **96**| **90** | **70** | **72**| **60** |
|
||
| **MistralLite-AWQ-32g-4b-GEMM** | **98**| **96**| **84** | **76** | **70**| **62** |
|
||
| Mistral-7B-Instruct-v0.1 | 96 | 56 | 38 | 36 | 30 | 30 |
|
||
| Mistral-7B-Instruct-v0.2 | 100 | 100 | 96 | 98 | 96 | 84 |
|
||
| Mixtral-8x7B-v0.1 | 54 | 38 | 56 | 66 | 62 | 38 |
|
||
| Mixtral-8x7B-Instruct-v0.1 | 100 | 100 | 100 | 100 | 100 | 100 |
|
||
|
||
#### Pass Key Retrieval
|
||
|
||
See https://github.com/epfml/landmark-attention/blob/main/llama/run_test.py#L101
|
||
|
||
| Model Name | n_garbage=12000 | n_garbage=20000 | n_garbage=31000 | n_garbage=38000 | n_garbage=45000 | n_garbage=60000 |
|
||
|:----------|-------------:|-------------:|------------:|-----------:|-----------:|-----------:|
|
||
| _n_tokens_ (approx.) = | _3272_ | _5405_ | _8338_ | _10205_ | _12071_ | _16072_ |
|
||
| MistralLite | 100 | 100 | 100 | 100 | 100 | 100|
|
||
| **MistralLite-AWQ** | **100** | **100**| **100**| **100** | **100**| **100**|
|
||
| **MistralLite-AWQ-64g-4b-GEMM** | **100** | **100**| **100**| **100** | **100**| **100**|
|
||
| **MistralLite-AWQ-32g-4b-GEMM** | **100** | **100**| **100**| **100** | **100**| **100**|
|
||
| Mistral-7B-Instruct-v0.1 | 100 | 50 | 30 | 20 | 10 | 10 |
|
||
| Mistral-7B-Instruct-v0.2 | 100 | 100 | 100 | 100 | 100 | 100 |
|
||
| Mixtral-8x7B-v0.1 | 100 | 100 | 100 | 100 | 100 | 100 |
|
||
| Mixtral-8x7B-Instruct-v0.1 | 100 | 100 | 100 | 90 | 100 | 100 |
|
||
|
||
|
||
#### QuALITY (Question Answering with Long Input Texts, Yes!)
|
||
|
||
See https://nyu-mll.github.io/quality/
|
||
|
||
|Model Name| Test set Accuracy | Hard subset Accuracy|
|
||
|:----------|-------------:|-------------:|
|
||
| MistralLite | 56.8 | 74.5 |
|
||
| **MistralLite-AWQ** | **55.3** | **71.8** |
|
||
| **MistralLite-AWQ-64g-4b-GEMM** | **55.2** | **72.9** |
|
||
| **MistralLite-AWQ-32g-4b-GEMM** | **56.6** | **72.8** |
|
||
| Mistral-7B-Instruct-v0.1 | 45.2 | 58.9 |
|
||
| Mistral-7B-Instruct-v0.2 | 55.5 | 74 |
|
||
| Mixtral-8x7B-v0.1 | 75 | 74.1 |
|
||
| Mixtral-8x7B-Instruct-v0.1 | 68.7 | 83.3 |
|
||
|
||
## Usage
|
||
|
||
## Inference via vLLM HTTP Host
|
||
|
||
### Launch Host
|
||
```bash
|
||
python -m vllm.entrypoints.openai.api_server \
|
||
--model amazon/MistralLite-AWQ \
|
||
--quantization awq
|
||
```
|
||
|
||
### Query Host
|
||
```bash
|
||
curl -X POST http://localhost:8000/v1/completions \
|
||
-H "Content-Type: application/json" \
|
||
-d '{ "model": "amazon/MistralLite-AWQ",
|
||
"prompt": "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
|
||
"temperature": 0,
|
||
"echo": false
|
||
}'
|
||
```
|
||
|
||
## Inference via [vLLM Offline Inference](https://docs.vllm.ai/en/latest/getting_started/examples/offline_inference.html)
|
||
```python
|
||
from vllm import LLM, SamplingParams
|
||
|
||
prompts = [
|
||
"<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
|
||
]
|
||
sampling_params = SamplingParams(temperature=0, max_tokens=100)
|
||
|
||
llm = LLM(model="amazon/MistralLite-AWQ")
|
||
|
||
outputs = llm.generate(prompts, sampling_params)
|
||
|
||
# Print the outputs.
|
||
for output in outputs:
|
||
prompt = output.prompt
|
||
generated_text = output.outputs[0].text
|
||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||
|
||
```
|
||
|
||
## License
|
||
|
||
Apache 2.0
|
||
|
||
## Limitations
|
||
|
||
Before using the MistralLite-AWQ model, it is important to perform your own
|
||
independent assessment, and take measures to ensure that your use would comply
|
||
with your own specific quality control practices and standards, and that your
|
||
use would comply with the local rules, laws, regulations, licenses and terms
|
||
that apply to you, and your content.
|