diff --git a/README.md b/README.md index 55a06a5..8bbddea 100644 --- a/README.md +++ b/README.md @@ -4,3 +4,63 @@ license_name: nvidia-open-model-license license_link: >- https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf --- + +# Minitron 8B Base + +Minitron is a family of small language models (SLMs) obtained by pruning NVIDIA's [Nemotron-4 15B](https://arxiv.org/abs/2402.16819) model. We prune model embedding size, attention heads, and MLP intermediate dimension, following which, we perform continued training with distillation to arrive at the final models. + +Deriving the Minitron 8B and 4B models from the base 15B model using our approach requires up to **40x fewer training tokens** per model compared to training from scratch; this results in **compute cost savings of 1.8x** for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. Please refer to our [arXiv paper]() for more details. + +Minitron models are for research and development only. + +## HuggingFace Quickstart + +The [PR](https://github.com/huggingface/transformers/pull/31699) to support our models in Hugging Face is under review and expected to be merged soon. This [branch](https://github.com/suiyoubi/transformers/tree/aot/nemotron-support) can be used meanwhile to use our Minitron models. + +``` +git clone git@github.com:suiyoubi/transformers.git +cd transformers +git checkout aot/nemotron-support +pip install . +``` +The following code provides an example of how to load the Minitron-8B model and use it to perform text generation. + +```python +import torch +from transformers import AutoTokenizer, AutoModelForCausalLM + +# Load the tokenizer and model +model_path = "nvidia/Minitron-8B-Base" +tokenizer = AutoTokenizer.from_pretrained(model_path) + +device='cuda' +dtype=torch.bfloat16 +model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=dtype, device_map=device) + +# Prepare the input text +prompt = "To be or not to be," +input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device) + +# Generate the output +output_ids = model.generate(input_ids, max_length=50, num_return_sequences=1) + +# Decode and print the output +output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True) +print(output_text) +``` + +## License + +Minitron is released under the [NVIDIA Open Model License Agreement](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf). + +## Citation + +If you find our work helpful, please consider citing our paper: +``` +@article{minitron2024, + title={Compact Language Models via Pruning and Knowledge Distillation}, + author={Saurav Muralidharan and Sharath Turuvekere Sreenivas and Raviraj Joshi and Marcin Chochowski and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro and Jan Kautz and Pavlo Molchanov}, + journal={arXiv preprint arXiv:XXX}, + year={2024} +} +```