99 lines
3.4 KiB
Markdown
99 lines
3.4 KiB
Markdown
---
|
||
license: llama3
|
||
language:
|
||
- tr
|
||
- en
|
||
tags:
|
||
- gguf
|
||
- ggml
|
||
- llama3
|
||
- cosmosllama
|
||
- turkish llama
|
||
---
|
||
# CosmsoLLaMa GGUFs
|
||
|
||
## Objective
|
||
Due to the need for quantized models in real-time applications, we introduce our GGUF formatted models. These models are part of
|
||
GGML project with a hope to democratize the use of Large Models. Depending on the quantization type, there are 20+ models.
|
||
|
||
### Features
|
||
* All quantization details are listed on the right by Hugging Face.
|
||
* All the models have been tested in `llama.cpp` environments, `llama-cli` and `llama-server`.
|
||
* Furthermore, a YouTube video has been made to introduce the basics of using `lmstudio` to utilize these models. 👇
|
||
[](https://www.youtube.com/watch?v=JRID-6sRl7I)
|
||
|
||
|
||
### Code Example
|
||
Usage example with `llama-cpp-python`
|
||
|
||
```py
|
||
from llama_cpp import Llama
|
||
|
||
# Define the inference parameters
|
||
inference_params = {
|
||
"n_threads": 4,
|
||
"n_predict": -1,
|
||
"top_k": 40,
|
||
"min_p": 0.05,
|
||
"top_p": 0.95,
|
||
"temp": 0.8,
|
||
"repeat_penalty": 1.1,
|
||
"input_prefix": "<|start_header_id|>user<|end_header_id|>\\n\\n",
|
||
"input_suffix": "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\\n\\n",
|
||
"antiprompt": [],
|
||
"pre_prompt": "Sen bir yapay zeka asistanısın. Kullanıcı sana bir görev verecek. Amacın görevi olabildiğince sadık bir şekilde tamamlamak.",
|
||
"pre_prompt_suffix": "<|eot_id|>",
|
||
"pre_prompt_prefix": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\\n\\n",
|
||
"seed": -1,
|
||
"tfs_z": 1,
|
||
"typical_p": 1,
|
||
"repeat_last_n": 64,
|
||
"frequency_penalty": 0,
|
||
"presence_penalty": 0,
|
||
"n_keep": 0,
|
||
"logit_bias": {},
|
||
"mirostat": 0,
|
||
"mirostat_tau": 5,
|
||
"mirostat_eta": 0.1,
|
||
"memory_f16": True,
|
||
"multiline_input": False,
|
||
"penalize_nl": True
|
||
}
|
||
|
||
# Initialize the Llama model with the specified inference parameters
|
||
llama = Llama.from_pretrained(
|
||
repo_id="ytu-ce-cosmos/Turkish-Llama-8b-Instruct-v0.1-GGUF",
|
||
filename="*Q4_K.gguf",
|
||
verbose=False
|
||
)
|
||
# Example input
|
||
user_input = "Türkiyenin başkenti neresidir?"
|
||
|
||
# Construct the prompt
|
||
prompt = f"{inference_params['pre_prompt_prefix']}{inference_params['pre_prompt']}\n\n{inference_params['input_prefix']}{user_input}{inference_params['input_suffix']}"
|
||
|
||
# Generate the response
|
||
response = llama(prompt)
|
||
|
||
# Output the response
|
||
print(response['choices'][0]['text'])
|
||
|
||
```
|
||
|
||
The quantization has been made using `llama.cpp`. As we have seen, this method tends to give the most stable results.
|
||
|
||
Obviously, we encountered better inference quality for models with the highest bits. However, the inference time tends to be similar between low-bit models.
|
||
|
||
Each model's memory footprint can be anticipated by the qunatization docs in either [Hugging Face](https://huggingface.co/docs/transformers/main/en/quantization/overview) or [llama.cpp](https://github.com/ggerganov/llama.cpp/tree/master/examples/quantize).
|
||
|
||
|
||
# Acknowledgments
|
||
- Research supported with Cloud TPUs from [Google's TensorFlow Research Cloud](https://sites.research.google/trc/about/) (TFRC). Thanks for providing access to the TFRC ❤️
|
||
- Thanks to the generous support from the Hugging Face team, it is possible to download models from their S3 storage 🤗
|
||
|
||
## Contact
|
||
*Feel free to contact us whenever you confront any problems :)*
|
||
|
||
COSMOS AI Research Group, Yildiz Technical University Computer Engineering Department
|
||
https://cosmos.yildiz.edu.tr/
|
||
cosmos@yildiz.edu.tr |