76 lines
3.4 KiB
Markdown
76 lines
3.4 KiB
Markdown
|
|
---
|
|||
|
|
license: apache-2.0
|
|||
|
|
language:
|
|||
|
|
- en
|
|||
|
|
- ru
|
|||
|
|
base_model:
|
|||
|
|
- bond005/meno-lite-0.1
|
|||
|
|
tags:
|
|||
|
|
- rag
|
|||
|
|
- ner
|
|||
|
|
- information-extraction
|
|||
|
|
- summarization
|
|||
|
|
- question-answering
|
|||
|
|
- document-qa
|
|||
|
|
- long-context
|
|||
|
|
pipeline_tag: text-generation
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Meno-Lite-0.1 GGUF
|
|||
|
|
|
|||
|
|
This repository contains quantized GGUF versions of [Meno-Lite-0.1](https://huggingface.co/bond005/meno-lite-0.1).
|
|||
|
|
All variants were produced using an **importance matrix** computed on the `train` split of the [`ru_llm_calibration`](https://huggingface.co/datasets/bond005/ru_llm_calibration) dataset, and are intended to be run with [`llama.cpp`](https://github.com/ggerganov/llama.cpp).
|
|||
|
|
|
|||
|
|
## Available Formats
|
|||
|
|
|
|||
|
|
| Quantization type | File size | Quality | Recommendation |
|
|||
|
|
| :--- | :--- | :--- | :--- |
|
|||
|
|
| **Q8_0** | ~8.05 GB | Virtually identical to FP16 | **Best quality**. Ideal for CPU inference when memory is not a constraint. |
|
|||
|
|
| **Q5_K_M** | ~5.41 GB | Minimal degradation | **Recommended balance**. Excellent speed and quality, fits most consumer GPUs. |
|
|||
|
|
| **Q4_K_M** | ~4.65 GB | Moderate degradation | **"Golden standard"**. Best trade-off between size and quality. |
|
|||
|
|
| **IQ3_M** | ~3.54 GB | Noticeable degradation | **Maximum memory savings**. Quality drops visibly; suited for highly constrained devices. |
|
|||
|
|
|
|||
|
|
## Quality Evaluation
|
|||
|
|
|
|||
|
|
Quality was measured on the `test` split of the [**Ru LLM Calibration**](https://huggingface.co/datasets/bond005/ru_llm_calibration) dataset using the `llama-perplexity` utility. The original FP16 model served as the reference.
|
|||
|
|
|
|||
|
|
| Metric | Q8_0 | Q5_K_M | Q4_K_M | IQ3_M |
|
|||
|
|
| :--- | :--- | :--- | :--- | :--- |
|
|||
|
|
| **Mean PPL (Q) ↓** | 9.047 | 9.075 | 9.135 | 9.689 |
|
|||
|
|
| **PPL correlation ↑** | 99.97% | 99.87% | 99.69% | 98.64% |
|
|||
|
|
| **Mean KLD ↓** | 0.0020 | 0.0077 | 0.0174 | 0.0804 |
|
|||
|
|
| **Same top p ↑** | 96.71% | 94.36% | 92.16% | 84.58% |
|
|||
|
|
|
|||
|
|
> ↑ – higher is better; ↓ – lower is better
|
|||
|
|
|
|||
|
|
**How to interpret these metrics:**
|
|||
|
|
- **Mean PPL (Q)**: Lower is better. Shows the average perplexity of the quantized model.
|
|||
|
|
- **PPL correlation**: Closer to 100% indicates the quantized model behaves almost identically to FP16. Values above 99.5% are considered excellent.
|
|||
|
|
- **Mean KLD**: Measures the divergence between the output probability distributions. Lower is better; 0 means identical distributions.
|
|||
|
|
- **Same top p**: The percentage of tokens where the quantized model's top prediction matches the FP16 model. Higher is better – it reflects how often the model's first-choice token remains unchanged.
|
|||
|
|
|
|||
|
|
## Usage
|
|||
|
|
|
|||
|
|
### 1. Install `llama.cpp`
|
|||
|
|
|
|||
|
|
Follow the [official build instructions](https://github.com/ggerganov/llama.cpp#build).
|
|||
|
|
|
|||
|
|
### 2. Run the model
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# CLI
|
|||
|
|
./llama-cli -hf bond005/meno-lite-0.1-gguf -m meno-lite-0.1-Q4_K_M.gguf -p "Привет, как дела?"
|
|||
|
|
|
|||
|
|
# Server with WebUI (default http://127.0.0.1:8080)
|
|||
|
|
./llama-server -hf bond005/meno-lite-0.1-gguf -m meno-lite-0.1-Q4_K_M.gguf --host 0.0.0.0 --port 8080
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
For more details on available parameters, see the [`llama.cpp` documentation](https://github.com/ggerganov/llama.cpp/tree/master/examples).
|
|||
|
|
|
|||
|
|
## About Meno-Lite-0.1
|
|||
|
|
|
|||
|
|
Meno-Lite-0.1 is a 7B model based on Qwen2.5, fine-tuned for **RAG, document QA, information extraction, and knowledge graph construction**. Read more about its capabilities, training procedure, and limitations in the [main model card](https://huggingface.co/bond005/meno-lite-0.1).
|
|||
|
|
|
|||
|
|
## License
|
|||
|
|
|
|||
|
|
All quantized variants inherit the license of the original model (Apache 2.0).
|