The LLaMA-2-7B-32K model were continuously pretrained on Hungarian dataset
The model has been extended to a context length of 32K with position interpolation
Checkpoint: 100 000 steps
Dataset for continued pretraining
Hungarian: 7.9 billion words, documents (763K) that exceed 5000 words in length
English: Long Context QA (2 billion words), BookSum (78 million words)
Limitations
max_seq_length = 32 768
float16
vocab size: 32 000
Usage with pipeline
fromtransformersimportpipeline,LlamaForCausalLM,LlamaTokenizermodel=LlamaForCausalLM.from_pretrained("NYTK/PULI-LlumiX-32K")tokenizer=LlamaTokenizer.from_pretrained("NYTK/PULI-LlumiX-32K")prompt="Elmesélek egy történetet a nyelvtechnológiáról."generator=pipeline(task="text-generation",model=model,tokenizer=tokenizer)print(generator(prompt,max_new_tokens=30)[0]["generated_text"])