初始化项目，由ModelHub XC社区提供模型

Model: NYTK/PULI-LlumiX-32K Source: Original Platform
2026-04-30 21:40:23 +08:00
commit 59f7cb06eb
12 changed files with 93863 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,41 @@
+---
+license: llama2
+language:
+- hu
+- en
+tags:
+- puli
+---
+
+# PULI LlumiX 32K base (6.74B billion parameter)
+
+For further details or testing our instruct model, see [our demo site](https://puli.nytud.hu/puli-llumix-instruct).
+
+  - Trained with OpenChatKit [github](https://github.com/togethercomputer/OpenChatKit)
+  - The [LLaMA-2-7B-32K](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K) model were continuously pretrained on Hungarian dataset
+  - The model has been extended to a context length of 32K with position interpolation
+  - Checkpoint: 100 000 steps
+
+## Dataset for continued pretraining
+
+- Hungarian: 7.9 billion words, documents (763K) that exceed 5000 words in length
+- English: Long Context QA (2 billion words), BookSum (78 million words)
+
+## Limitations
+
+- max_seq_length = 32 768
+- float16
+- vocab size: 32 000
+
+## Usage with pipeline
+
+```python
+from transformers import pipeline, LlamaForCausalLM, LlamaTokenizer
+
+model = LlamaForCausalLM.from_pretrained("NYTK/PULI-LlumiX-32K")
+tokenizer = LlamaTokenizer.from_pretrained("NYTK/PULI-LlumiX-32K")
+prompt = "Elmesélek egy történetet a nyelvtechnológiáról."
+generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer)
+
+print(generator(prompt, max_new_tokens=30)[0]["generated_text"])
+```