初始化项目,由ModelHub XC社区提供模型
Model: NYTK/PULI-LlumiX-32K Source: Original Platform
This commit is contained in:
41
README.md
Normal file
41
README.md
Normal file
@@ -0,0 +1,41 @@
|
||||
---
|
||||
license: llama2
|
||||
language:
|
||||
- hu
|
||||
- en
|
||||
tags:
|
||||
- puli
|
||||
---
|
||||
|
||||
# PULI LlumiX 32K base (6.74B billion parameter)
|
||||
|
||||
For further details or testing our instruct model, see [our demo site](https://puli.nytud.hu/puli-llumix-instruct).
|
||||
|
||||
- Trained with OpenChatKit [github](https://github.com/togethercomputer/OpenChatKit)
|
||||
- The [LLaMA-2-7B-32K](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K) model were continuously pretrained on Hungarian dataset
|
||||
- The model has been extended to a context length of 32K with position interpolation
|
||||
- Checkpoint: 100 000 steps
|
||||
|
||||
## Dataset for continued pretraining
|
||||
|
||||
- Hungarian: 7.9 billion words, documents (763K) that exceed 5000 words in length
|
||||
- English: Long Context QA (2 billion words), BookSum (78 million words)
|
||||
|
||||
## Limitations
|
||||
|
||||
- max_seq_length = 32 768
|
||||
- float16
|
||||
- vocab size: 32 000
|
||||
|
||||
## Usage with pipeline
|
||||
|
||||
```python
|
||||
from transformers import pipeline, LlamaForCausalLM, LlamaTokenizer
|
||||
|
||||
model = LlamaForCausalLM.from_pretrained("NYTK/PULI-LlumiX-32K")
|
||||
tokenizer = LlamaTokenizer.from_pretrained("NYTK/PULI-LlumiX-32K")
|
||||
prompt = "Elmesélek egy történetet a nyelvtechnológiáról."
|
||||
generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer)
|
||||
|
||||
print(generator(prompt, max_new_tokens=30)[0]["generated_text"])
|
||||
```
|
||||
Reference in New Issue
Block a user