初始化项目，由ModelHub XC社区提供模型

Model: TheBloke/Chronos-13B-SuperHOT-8K-fp16 Source: Original Platform
2026-05-24 17:29:12 +08:00
commit 6f3ec1cdac
16 changed files with 95247 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,38 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+pytorch_model-00001-of-00003.bin filter=lfs diff=lfs merge=lfs -text
+pytorch_model-00002-of-00003.bin filter=lfs diff=lfs merge=lfs -text
+pytorch_model-00003-of-00003.bin filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,343 @@
+---
+inference: false
+license: other
+---
+
+<!-- header start -->
+<div style="width: 100%;">
+    <img src="https://i.imgur.com/EBdldam.jpg" alt="TheBlokeAI" style="width: 100%; min-width: 400px; display: block; margin: auto;">
+</div>
+<div style="display: flex; justify-content: space-between; width: 100%;">
+    <div style="display: flex; flex-direction: column; align-items: flex-start;">
+        <p><a href="https://discord.gg/theblokeai">Chat & support: my new Discord server</a></p>
+    </div>
+    <div style="display: flex; flex-direction: column; align-items: flex-end;">
+        <p><a href="https://www.patreon.com/TheBlokeAI">Want to contribute? TheBloke's Patreon page</a></p>
+    </div>
+</div>
+<!-- header end -->
+
+# Elinas' Chronos 13B fp16
+
+This is fp16 pytorch format model files for [Elinas' Chronos 13B](https://huggingface.co/elinas/chronos-13b) merged with [Kaio Ken's SuperHOT 8K](https://huggingface.co/kaiokendev/superhot-13b-8k-no-rlhf-test).
+
+[Kaio Ken's SuperHOT 13b LoRA](https://huggingface.co/kaiokendev/superhot-13b-8k-no-rlhf-test) is merged on to the base model, and then 8K context can be achieved during inference by using `trust_remote_code=True`.
+
+Note that `config.json` has been set to a sequence length of 8192. This can be modified to 4096 if you want to try with a smaller sequence length.
+
+## Repositories available
+
+* [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/Chronos-13B-SuperHOT-8K-GPTQ)
+* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU inference](https://huggingface.co/TheBloke/Chronos-13B-SuperHOT-8K-GGML)
+* [Unquantised SuperHOT fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/TheBloke/Chronos-13B-SuperHOT-8K-fp16)
+* [Unquantised base fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/elinas/chronos-13b)
+
+## How to use this model from Python code
+
+First make sure you have Einops installed:
+
+```
+pip3 install auto-gptq
+```
+
+Then run the following code. `config.json` has been default to a sequence length of 8192, but you can also configure this in your Python code.
+
+The provided modelling code, activated with `trust_remote_code=True` will automatically set the `scale` parameter from the configured `max_position_embeddings`. Eg for 8192, `scale` is set to `4`.
+
+```python
+from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, pipeline
+import argparse
+
+model_name_or_path = "TheBloke/Chronos-13B-SuperHOT-8K-fp16"
+
+use_triton = False
+
+tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
+
+config = AutoConfig.from_pretrained(model_name_or_path, trust_remote_code=True)
+# Change this to the sequence length you want
+config.max_position_embeddings = 8192
+
+model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
+        config=config,
+        trust_remote_code=True,
+        device_map='auto')
+
+# Note: check to confirm if this is correct prompt template is correct for this model!
+prompt = "Tell me about AI"
+prompt_template=f'''USER: {prompt}
+ASSISTANT:'''
+
+print("\n\n*** Generate:")
+
+input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
+output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
+print(tokenizer.decode(output[0]))
+
+# Inference can also be done using transformers' pipeline
+
+print("*** Pipeline:")
+pipe = pipeline(
+    "text-generation",
+    model=model,
+    tokenizer=tokenizer,
+    max_new_tokens=512,
+    temperature=0.7,
+    top_p=0.95,
+    repetition_penalty=1.15
+)
+
+print(pipe(prompt_template)[0]['generated_text'])
+```
+
+## Using other UIs: monkey patch
+
+Provided in the repo is `llama_rope_scaled_monkey_patch.py`, written by @kaiokendev.
+
+It can be theoretically be added to any Python UI or custom code to enable the same result as `trust_remote_code=True`.  I have not tested this, and it should be superseded by using `trust_remote_code=True`, but I include it for completeness and for interest.
+
+<!-- footer start -->
+## Discord
+
+For further support, and discussions on these models and AI in general, join us at:
+
+[TheBloke AI's Discord server](https://discord.gg/theblokeai)
+
+## Thanks, and how to contribute.
+
+Thanks to the [chirper.ai](https://chirper.ai) team!
+
+I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training.
+
+If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects.
+
+Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits.
+
+* Patreon: https://patreon.com/TheBlokeAI
+* Ko-Fi: https://ko-fi.com/TheBlokeAI
+
+**Special thanks to**: Luke from CarbonQuill, Aemon Algiz, Dmitriy Samsonov.
+
+**Patreon special mentions**: Pyrater, WelcomeToTheClub, Kalila, Mano Prime, Trenton Dambrowitz, Spiking Neurons AB, Pierre Kircher, Fen Risland, Kevin Schuppel, Luke, Rainer Wilmers, vamX, Gabriel Puliatti, Alex , Karl Bernard, Ajan Kanaga, Talal Aujan, Space Cruiser, ya boyyy, biorpg, Johann-Peter Hartmann, Asp the Wyvern, Ai Maven, Ghost , Preetika Verma, Nikolai Manek, trip7s trip, John Detwiler, Fred von Graf, Artur Olbinski, subjectnull, John Villwock, Junyu Yang, Rod A, Lone Striker, Chris McCloskey, Iucharbius , Matthew Berman, Illia Dulskyi, Khalefa Al-Ahmad, Imad Khwaja, chris gileta, Willem Michiel, Greatston Gnanesh, Derek Yates, K, Alps Aficionado, Oscar Rangel, David Flickinger, Luke Pendergrass, Deep Realms, Eugene Pentland, Cory Kujawski, terasurfer , Jonathan Leane, senxiiz, Joseph William Delisle, Sean Connelly, webtim, zynix , Nathan LeClaire.
+
+Thank you to all my generous patrons and donaters!
+
+<!-- footer end -->
+
+# Original model card: Kaio Ken's SuperHOT 8K
+
+### SuperHOT Prototype 2 w/ 8K Context
+
+This is a second prototype of SuperHOT, this time 30B with 8K context and no RLHF, using the same technique described in [the github blog](https://kaiokendev.github.io/til#extending-context-to-8k).
+Tests have shown that the model does indeed leverage the extended context at 8K.
+
+You will need to **use either the monkeypatch** or, if you are already using the monkeypatch, **change the scaling factor to 0.25 and the maximum sequence length to 8192**
+
+#### Looking for Merged & Quantized Models?
+- 30B 4-bit CUDA: [tmpupload/superhot-30b-8k-4bit-safetensors](https://huggingface.co/tmpupload/superhot-30b-8k-4bit-safetensors)
+- 30B 4-bit CUDA 128g: [tmpupload/superhot-30b-8k-4bit-128g-safetensors](https://huggingface.co/tmpupload/superhot-30b-8k-4bit-128g-safetensors)
+
+
+#### Training Details
+I trained the LoRA with the following configuration:
+- 1200 samples (~400 samples over 2048 sequence length)
+  - learning rate of 3e-4
+  - 3 epochs
+  - The exported modules are:
+  - q_proj
+  - k_proj
+  - v_proj
+  - o_proj
+  - no bias
+  - Rank = 4
+  - Alpha = 8
+  - no dropout
+  - weight decay of 0.1
+  - AdamW beta1 of 0.9 and beta2 0.99, epsilon of 1e-5
+  - Trained on 4-bit base model
+
+# Original model card: Elinas' Chronos 13B
+
+
+# chronos-13b
+
+This is the fp16 PyTorch / HF version of **chronos-13b**
+
+This model is primarily focused on chat, roleplay, and storywriting, but can accomplish other tasks such as simple reasoning and coding.
+
+Chronos generates very long outputs with coherent text, largely due to the human inputs it was trained on.
+
+This model uses Alpaca formatting, so for optimal model performance, use:
+```
+### Instruction:
+Your instruction or question here.
+### Response:
+```
+
+[4bit Quantized version](https://huggingface.co/elinas/chronos-13b-4bit)
+
+[GGML Version provided by @TheBloke](https://huggingface.co/TheBloke/chronos-13B-GGML)
+
+<!--**Support My Development of New Models**
+<a href='https://ko-fi.com/Q5Q6MB734' target='_blank'><img height='36' style='border:0px;height:36px;' 
+src='https://storage.ko-fi.com/cdn/kofi1.png?v=3' border='0' alt='Support Development' /></a>-->
+
+--
+license: other
+---
+# LLaMA Model Card
+
+## Model details
+**Organization developing the model**
+The FAIR team of Meta AI.
+
+**Model date**
+LLaMA was trained between December. 2022 and Feb. 2023.
+
+**Model version**
+This is version 1 of the model.
+
+**Model type**
+LLaMA is an auto-regressive language model, based on the transformer architecture. The model comes in different sizes: 7B, 13B, 33B and 65B parameters.
+
+**Paper or resources for more information**
+More information can be found in the paper “LLaMA, Open and Efficient Foundation Language Models”, available at https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/.
+
+**Citations details**
+https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/
+
+**License**
+Non-commercial bespoke license
+
+**Where to send questions or comments about the model**
+Questions and comments about LLaMA can be sent via the [GitHub repository](https://github.com/facebookresearch/llama) of the project , by opening an issue.
+
+## Intended use
+**Primary intended uses**
+The primary use of LLaMA is research on large language models, including:
+exploring potential applications such as question answering, natural language understanding or reading comprehension,
+understanding capabilities and limitations of current language models, and developing techniques to improve those,
+evaluating and mitigating biases, risks, toxic and harmful content generations, hallucinations.
+
+**Primary intended users**
+The primary intended users of the model are researchers in natural language processing, machine learning and artificial intelligence.
+
+**Out-of-scope use cases**
+LLaMA is a base, or foundational, model. As such, it should not be used on downstream applications without further risk evaluation and mitigation. In particular, our model has not been trained with human feedback, and can thus generate toxic or offensive content, incorrect information or generally unhelpful answers.
+
+## Factors
+**Relevant factors**
+One of the most relevant factors for which model performance may vary is which language is used. Although we included 20 languages in the training data, most of our dataset is made of English text, and we thus expect the model to perform better for English than other languages. Relatedly, it has been shown in previous studies that performance might vary for different dialects, and we expect that it will be the case for our model.
+
+**Evaluation factors**
+As our model is trained on data from the Web, we expect that it reflects biases from this source. We thus evaluated on RAI datasets to measure biases exhibited by the model for gender, religion, race, sexual orientation, age, nationality, disability, physical appearance and socio-economic status. We also measure the toxicity of model generations, depending on the toxicity of the context used to prompt the model.
+
+## Metrics
+**Model performance measures**
+We use the following measure to evaluate the model:
+- Accuracy for common sense reasoning, reading comprehension, natural language understanding (MMLU), BIG-bench hard, WinoGender and CrowS-Pairs,
+- Exact match for question answering,
+- The toxicity score from Perspective API on RealToxicityPrompts.
+
+**Decision thresholds**
+Not applicable.
+
+**Approaches to uncertainty and variability**
+Due to the high computational requirements of training LLMs, we trained only one model of each size, and thus could not evaluate variability of pre-training.
+
+## Evaluation datasets
+The model was evaluated on the following benchmarks: BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC, OpenBookQA, NaturalQuestions, TriviaQA, RACE, MMLU, BIG-bench hard, GSM8k, RealToxicityPrompts, WinoGender, CrowS-Pairs.
+
+## Training dataset
+The model was trained using the following source of data: CCNet [67%], C4 [15%], GitHub [4.5%], Wikipedia [4.5%], Books [4.5%], ArXiv [2.5%], Stack Exchange[2%]. The Wikipedia and Books domains include data in the following languages: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk. See the paper for more details about the training set and corresponding preprocessing.
+
+## Quantitative analysis
+Hyperparameters for the model architecture
+
+
+<table>
+    <thead>
+            <tr>
+            <th >LLaMA</th> <th colspan=6>Model hyper parameters </th>
+            </tr>
+            <tr>
+            <th>Number of parameters</th><th>dimension</th><th>n heads</th><th>n layers</th><th>Learn rate</th><th>Batch size</th><th>n tokens</th>
+            </tr>           
+        </thead>
+    <tbody>       
+        <tr>
+            <th>7B</th> <th>4096</th> <th>32</th> <th>32</th> <th>3.0E-04</th><th>4M</th><th>1T 
+        </tr>
+        <tr>
+            <th>13B</th><th>5120</th><th>40</th><th>40</th><th>3.0E-04</th><th>4M</th><th>1T
+        </tr>
+        <tr>
+            <th>33B</th><th>6656</th><th>52</th><th>60</th><th>1.5.E-04</th><th>4M</th><th>1.4T
+        </tr>
+        <tr>
+            <th>65B</th><th>8192</th><th>64</th><th>80</th><th>1.5.E-04</th><th>4M</th><th>1.4T
+        </tr>     
+    </tbody>
+</table>
+
+*Table 1 - Summary of LLama Model Hyperparameters*
+
+We present our results on eight standard common sense reasoning benchmarks in the table below. 
+<table>
+    <thead>
+            <tr>
+            <th>LLaMA</th>  <th colspan=9>Reasoning tasks </th>
+            </tr>
+            <tr>
+            <th>Number of parameters</th> <th>BoolQ</th><th>PIQA</th><th>SIQA</th><th>HellaSwag</th><th>WinoGrande</th><th>ARC-e</th><th>ARC-c</th><th>OBQA</th><th>COPA</th>
+            </tr>           
+        </thead>
+    <tbody>    
+    <tr>   
+        <th>7B</th><th>76.5</th><th>79.8</th><th>48.9</th><th>76.1</th><th>70.1</th><th>76.7</th><th>47.6</th><th>57.2</th><th>93
+        </th>   
+    <tr><th>13B</th><th>78.1</th><th>80.1</th><th>50.4</th><th>79.2</th><th>73</th><th>78.1</th><th>52.7</th><th>56.4</th><th>94
+</th>
+    <tr><th>33B</th><th>83.1</th><th>82.3</th><th>50.4</th><th>82.8</th><th>76</th><th>81.4</th><th>57.8</th><th>58.6</th><th>92
+</th>
+    <tr><th>65B</th><th>85.3</th><th>82.8</th><th>52.3</th><th>84.2</th><th>77</th><th>81.5</th><th>56</th><th>60.2</th><th>94</th></tr> 
+    </tbody>
+</table>
+*Table 2 - Summary of LLama Model Performance on Reasoning tasks*
+
+
+We present our results on bias in the table below. Note that lower value is better indicating lower bias. 
+
+
+| No  | Category             | FAIR LLM |
+| --- | -------------------- | -------- |
+| 1   | Gender               | 70.6     |
+| 2   | Religion             | 79       |
+| 3   | Race/Color           | 57       |
+| 4   | Sexual orientation   | 81       |
+| 5   | Age                  | 70.1     |
+| 6   | Nationality          | 64.2     |
+| 7   | Disability           | 66.7     |
+| 8   | Physical appearance  | 77.8     |
+| 9   | Socioeconomic status | 71.5     |
+|     | LLaMA Average        | 66.6     |
+
+*Table 3 - Summary bias of our model output*
+
+
+
+## Ethical considerations
+**Data**
+The data used to train the model is collected from various sources, mostly from the Web. As such, it contains offensive, harmful and biased content. We thus expect the model to exhibit such biases from the training data.
+
+**Human life**
+The model is not intended to inform decisions about matters central to human life, and should not be used in such a way.
+
+**Mitigations**
+We filtered the data from the Web based on its proximity to Wikipedia text and references. For this, we used a Kneser-Ney language model and a fastText linear classifier.
+
+**Risks and harms**
+Risks and harms of large language models include the generation of harmful, offensive or biased content. These models are often prone to generating incorrect information, sometimes referred to as hallucinations. We do not expect our model to be an exception in this regard.
+
+**Use cases**
+LLaMA is a foundational model, and as such, it should not be used for downstream applications without further investigation and mitigations of risks. These risks and potential fraught use cases include, but are not limited to: generation of misinformation and generation of harmful, biased or offensive content.
--- a/config.json
+++ b/config.json
@@ -0,0 +1,28 @@
+{
+    "_name_or_path": "/workspace/superhot_process/chronos-13b/source",
+    "architectures": [
+        "LlamaForCausalLM"
+    ],
+    "bos_token_id": 1,
+    "eos_token_id": 2,
+    "hidden_act": "silu",
+    "hidden_size": 5120,
+    "initializer_range": 0.02,
+    "intermediate_size": 13824,
+    "max_position_embeddings": 8192,
+    "model_type": "llama",
+    "num_attention_heads": 40,
+    "num_hidden_layers": 40,
+    "pad_token_id": 0,
+    "rms_norm_eps": 1e-06,
+    "tie_word_embeddings": false,
+    "torch_dtype": "float16",
+    "transformers_version": "4.30.0.dev0",
+    "use_cache": true,
+    "vocab_size": 32000,
+    "auto_map": {
+        "AutoModel": "modelling_llama.LlamaModel",
+        "AutoModelForCausalLM": "modelling_llama.LlamaForCausalLM",
+        "AutoModelForSequenceClassification": "modelling_llama.LlamaForSequenceClassification"
+    }
+}
--- a/configuration.json
+++ b/configuration.json
@@ -0,0 +1 @@
+{"framework": "pytorch", "task": "text-generation", "allow_remote": true}
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,7 @@
+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "pad_token_id": 0,
+  "transformers_version": "4.30.0.dev0"
+}
--- a/huggingface-metadata.txt
+++ b/huggingface-metadata.txt
@@ -0,0 +1,8 @@
+url: https://huggingface.co/TheBloke/Chronos-13B-SuperHOT-8K-fp16
+branch: main
+download date: 2023-06-29 15:32:14
+sha256sum:
+    92d93b1d27232a29e4c0f8d7db067db7a1454fdaebf510f5fb73f15b9a1e4b2e pytorch_model-00001-of-00003.bin
+    b386739f88805d069db2420301d11e0a3eca70108a8fc4e2438bae35c6d89741 pytorch_model-00002-of-00003.bin
+    a4a34ee3cd5d8c09531d0b672e9a0ec6518a07cf5490ebdd0bfbffacdf740d5e pytorch_model-00003-of-00003.bin
+    9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347 tokenizer.model
--- a/llama_rope_scaled_monkey_patch.py
+++ b/llama_rope_scaled_monkey_patch.py
@@ -0,0 +1,65 @@
+import torch
+import transformers
+import transformers.models.llama.modeling_llama
+from einops import rearrange
+import random
+
+# This monkey patch file is not needed if using ExLlama, or if using `trust_remote_code=True``
+
+class ScaledRotaryEmbedding(torch.nn.Module):
+    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
+        super().__init__()
+        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float().to(device) / dim))
+        self.register_buffer("inv_freq", inv_freq)
+        
+        max_position_embeddings = 8192
+
+        # Build here to make `torch.jit.trace` work.
+        self.max_seq_len_cached = max_position_embeddings
+        t = torch.arange(
+            self.max_seq_len_cached,
+            device=self.inv_freq.device,
+            dtype=self.inv_freq.dtype,
+        )
+
+        self.scale = 1 / 4
+        t *= self.scale
+
+        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer(
+            "cos_cached", emb.cos()[None, None, :, :], persistent=False
+        )
+        self.register_buffer(
+            "sin_cached", emb.sin()[None, None, :, :], persistent=False
+        )
+
+    def forward(self, x, seq_len=None):
+        # x: [bs, num_attention_heads, seq_len, head_size]
+        # This `if` block is unlikely to be run after we build sin/cos in `__init__`. Keep the logic here just in case.
+        if seq_len > self.max_seq_len_cached:
+            self.max_seq_len_cached = seq_len
+            t = torch.arange(
+                self.max_seq_len_cached, device=x.device, dtype=self.inv_freq.dtype
+            )
+            t *= self.scale
+            freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+            # Different from paper, but it uses a different permutation in order to obtain the same calculation
+            emb = torch.cat((freqs, freqs), dim=-1).to(x.device)
+            self.register_buffer(
+                "cos_cached", emb.cos()[None, None, :, :], persistent=False
+            )
+            self.register_buffer(
+                "sin_cached", emb.sin()[None, None, :, :], persistent=False
+            )
+        return (
+            self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
+            self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
+        )
+
+
+def replace_llama_rope_with_scaled_rope():
+    transformers.models.llama.modeling_llama.LlamaRotaryEmbedding = (
+        ScaledRotaryEmbedding
+    )
--- a/modelling_llama.py
+++ b/modelling_llama.py
@@ -0,0 +1,894 @@
+# coding=utf-8
+# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
+#
+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
+# and OPT implementations in this library. It has been modified from its
+# original forms to accommodate minor architectural differences compared
+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch LLaMA model."""
+import math
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
+
+from transformers.activations import ACT2FN
+from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
+from transformers.models.llama.modeling_llama import LlamaConfig
+
+logger = logging.get_logger(__name__)
+
+_CONFIG_FOR_DOC = "LlamaConfig"
+
+
+# Copied from transformers.models.bart.modeling_bart._make_causal_mask
+def _make_causal_mask(
+    input_ids_shape: torch.Size, dtype: torch.dtype, device: torch.device, past_key_values_length: int = 0
+):
+    """
+    Make causal mask used for bi-directional self-attention.
+    """
+    bsz, tgt_len = input_ids_shape
+    mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=device), device=device)
+    mask_cond = torch.arange(mask.size(-1), device=device)
+    mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
+    mask = mask.to(dtype)
+
+    if past_key_values_length > 0:
+        mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype, device=device), mask], dim=-1)
+    return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length)
+
+
+# Copied from transformers.models.bart.modeling_bart._expand_mask
+def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
+    """
+    Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
+    """
+    bsz, src_len = mask.size()
+    tgt_len = tgt_len if tgt_len is not None else src_len
+
+    expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
+
+    inverted_mask = 1.0 - expanded_mask
+
+    return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min)
+
+
+class LlamaRMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        """
+        LlamaRMSNorm is equivalent to T5LayerNorm
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+
+        return (self.weight * hidden_states).to(input_dtype)
+
+
+class LlamaRotaryEmbedding(torch.nn.Module):
+    def __init__(self, dim, max_position_embeddings=2048, base=10000, scale=1, device=None):
+        super().__init__()
+        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float().to(device) / dim))
+        self.register_buffer("inv_freq", inv_freq)
+
+        # Build here to make `torch.jit.trace` work.
+        self.max_seq_len_cached = max_position_embeddings
+        t = torch.arange(self.max_seq_len_cached, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
+
+        self.scale = scale
+        t *= self.scale
+
+        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1)
+        dtype = torch.get_default_dtype()
+        self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
+        self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)
+
+    def forward(self, x, seq_len=None):
+        # x: [bs, num_attention_heads, seq_len, head_size]
+        # This `if` block is unlikely to be run after we build sin/cos in `__init__`. Keep the logic here just in case.
+        if seq_len > self.max_seq_len_cached:
+            self.max_seq_len_cached = seq_len
+            t = torch.arange(self.max_seq_len_cached, device=x.device, dtype=self.inv_freq.dtype)
+            freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+            # Different from paper, but it uses a different permutation in order to obtain the same calculation
+            emb = torch.cat((freqs, freqs), dim=-1).to(x.device)
+            self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(x.dtype), persistent=False)
+            self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(x.dtype), persistent=False)
+        return (
+            self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
+            self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
+        )
+
+
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+
+
+def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
+    # The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
+    cos = cos.squeeze(1).squeeze(0)  # [seq_len, dim]
+    sin = sin.squeeze(1).squeeze(0)  # [seq_len, dim]
+    cos = cos[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
+    sin = sin[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+
+
+class LlamaMLP(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int,
+        intermediate_size: int,
+        hidden_act: str,
+    ):
+        super().__init__()
+        self.gate_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
+        self.down_proj = nn.Linear(intermediate_size, hidden_size, bias=False)
+        self.up_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
+        self.act_fn = ACT2FN[hidden_act]
+
+    def forward(self, x):
+        return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+
+
+class LlamaAttention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+
+    def __init__(self, config: LlamaConfig):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = self.hidden_size // self.num_heads
+        self.max_position_embeddings = config.max_position_embeddings
+        self.position_embeddings_scale = 2048 / self.max_position_embeddings
+
+        if (self.head_dim * self.num_heads) != self.hidden_size:
+            raise ValueError(
+                f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
+                f" and `num_heads`: {self.num_heads})."
+            )
+        self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
+        self.k_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
+        self.v_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
+        self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
+        self.rotary_emb = LlamaRotaryEmbedding(self.head_dim, max_position_embeddings=self.max_position_embeddings, scale=self.position_embeddings_scale)
+
+    def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
+        return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        bsz, q_len, _ = hidden_states.size()
+
+        query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        value_states = self.v_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+
+        kv_seq_len = key_states.shape[-2]
+        if past_key_value is not None:
+            kv_seq_len += past_key_value[0].shape[-2]
+        cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
+        # [bsz, nh, t, hd]
+
+        if past_key_value is not None:
+            # reuse k, v, self_attention
+            key_states = torch.cat([past_key_value[0], key_states], dim=2)
+            value_states = torch.cat([past_key_value[1], value_states], dim=2)
+
+        past_key_value = (key_states, value_states) if use_cache else None
+
+        attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
+
+        if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
+            raise ValueError(
+                f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
+                f" {attn_weights.size()}"
+            )
+
+        if attention_mask is not None:
+            if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
+                raise ValueError(
+                    f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
+                )
+            attn_weights = attn_weights + attention_mask
+            attn_weights = torch.max(
+                attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min, device=attn_weights.device)
+            )
+
+        # upcast attention to fp32
+        attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
+        attn_output = torch.matmul(attn_weights, value_states)
+
+        if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
+            raise ValueError(
+                f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
+                f" {attn_output.size()}"
+            )
+
+        attn_output = attn_output.transpose(1, 2)
+        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
+
+        attn_output = self.o_proj(attn_output)
+
+        if not output_attentions:
+            attn_weights = None
+
+        return attn_output, attn_weights, past_key_value
+
+
+class LlamaDecoderLayer(nn.Module):
+    def __init__(self, config: LlamaConfig):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.self_attn = LlamaAttention(config=config)
+        self.mlp = LlamaMLP(
+            hidden_size=self.hidden_size,
+            intermediate_size=config.intermediate_size,
+            hidden_act=config.hidden_act,
+        )
+        self.input_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        output_attentions: Optional[bool] = False,
+        use_cache: Optional[bool] = False,
+    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
+        """
+        Args:
+            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
+            attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
+                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            use_cache (`bool`, *optional*):
+                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+                (see `past_key_values`).
+            past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
+        """
+
+        residual = hidden_states
+
+        hidden_states = self.input_layernorm(hidden_states)
+
+        # Self Attention
+        hidden_states, self_attn_weights, present_key_value = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_value=past_key_value,
+            output_attentions=output_attentions,
+            use_cache=use_cache,
+        )
+        hidden_states = residual + hidden_states
+
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+
+        outputs = (hidden_states,)
+
+        if output_attentions:
+            outputs += (self_attn_weights,)
+
+        if use_cache:
+            outputs += (present_key_value,)
+
+        return outputs
+
+
+LLAMA_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+
+    Parameters:
+        config ([`LlamaConfig`]):
+            Model configuration class with all the parameters of the model. Initializing with a config file does not
+            load the weights associated with the model, only the configuration. Check out the
+            [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+
+@add_start_docstrings(
+    "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
+    LLAMA_START_DOCSTRING,
+)
+class LlamaPreTrainedModel(PreTrainedModel):
+    config_class = LlamaConfig
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["LlamaDecoderLayer"]
+    _skip_keys_device_placement = "past_key_values"
+    _keys_to_ignore_on_load_unexpected = [r"decoder\.version"]
+
+    def _init_weights(self, module):
+        std = self.config.initializer_range
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, LlamaModel):
+            module.gradient_checkpointing = value
+
+
+LLAMA_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+            it.
+
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+
+            [What are input IDs?](../glossary#input-ids)
+        attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+
+            [What are attention masks?](../glossary#attention-mask)
+
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+
+            If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
+            `past_key_values`).
+
+            If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
+            and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
+            information on the default strategy.
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+        position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+            config.n_positions - 1]`.
+
+            [What are position IDs?](../glossary#position-ids)
+        past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
+            Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
+            `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
+            `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
+
+            Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
+            blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
+
+            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
+            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
+            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
+        inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
+            is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
+            model's internal embedding lookup matrix.
+        use_cache (`bool`, *optional*):
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+            `past_key_values`).
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+
+@add_start_docstrings(
+    "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
+    LLAMA_START_DOCSTRING,
+)
+class LlamaModel(LlamaPreTrainedModel):
+    """
+    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`LlamaDecoderLayer`]
+
+    Args:
+        config: LlamaConfig
+    """
+
+    def __init__(self, config: LlamaConfig):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
+        self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+
+        self.gradient_checkpointing = False
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+
+    # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask
+    def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
+        # create causal mask
+        # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+        combined_attention_mask = None
+        if input_shape[-1] > 1:
+            combined_attention_mask = _make_causal_mask(
+                input_shape,
+                inputs_embeds.dtype,
+                device=inputs_embeds.device,
+                past_key_values_length=past_key_values_length,
+            )
+
+        if attention_mask is not None:
+            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+            expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(
+                inputs_embeds.device
+            )
+            combined_attention_mask = (
+                expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
+            )
+
+        return combined_attention_mask
+
+    @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # retrieve input_ids and inputs_embeds
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
+        elif input_ids is not None:
+            batch_size, seq_length = input_ids.shape
+        elif inputs_embeds is not None:
+            batch_size, seq_length, _ = inputs_embeds.shape
+        else:
+            raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
+
+        seq_length_with_past = seq_length
+        past_key_values_length = 0
+
+        if past_key_values is not None:
+            past_key_values_length = past_key_values[0][0].shape[2]
+            seq_length_with_past = seq_length_with_past + past_key_values_length
+
+        if position_ids is None:
+            device = input_ids.device if input_ids is not None else inputs_embeds.device
+            position_ids = torch.arange(
+                past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
+            )
+            position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
+        else:
+            position_ids = position_ids.view(-1, seq_length).long()
+
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+        # embed positions
+        if attention_mask is None:
+            attention_mask = torch.ones(
+                (batch_size, seq_length_with_past), dtype=torch.bool, device=inputs_embeds.device
+            )
+        attention_mask = self._prepare_decoder_attention_mask(
+            attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
+        )
+
+        hidden_states = inputs_embeds
+
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
+
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        next_decoder_cache = () if use_cache else None
+
+        for idx, decoder_layer in enumerate(self.layers):
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+
+            past_key_value = past_key_values[idx] if past_key_values is not None else None
+
+            if self.gradient_checkpointing and self.training:
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        # None for past_key_value
+                        return module(*inputs, output_attentions, None)
+
+                    return custom_forward
+
+                layer_outputs = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(decoder_layer),
+                    hidden_states,
+                    attention_mask,
+                    position_ids,
+                    None,
+                )
+            else:
+                layer_outputs = decoder_layer(
+                    hidden_states,
+                    attention_mask=attention_mask,
+                    position_ids=position_ids,
+                    past_key_value=past_key_value,
+                    output_attentions=output_attentions,
+                    use_cache=use_cache,
+                )
+
+            hidden_states = layer_outputs[0]
+
+            if use_cache:
+                next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
+
+            if output_attentions:
+                all_self_attns += (layer_outputs[1],)
+
+        hidden_states = self.norm(hidden_states)
+
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+
+        next_cache = next_decoder_cache if use_cache else None
+        if not return_dict:
+            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=next_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+
+
+class LlamaForCausalLM(LlamaPreTrainedModel):
+    _tied_weights_keys = ["lm_head.weight"]
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = LlamaModel(config)
+
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.model.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.model.embed_tokens = value
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def set_decoder(self, decoder):
+        self.model = decoder
+
+    def get_decoder(self):
+        return self.model
+
+    @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        r"""
+        Args:
+            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+
+        Returns:
+
+        Example:
+
+        ```python
+        >>> from transformers import AutoTokenizer, LlamaForCausalLM
+
+        >>> model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
+        >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
+
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        hidden_states = outputs[0]
+        logits = self.lm_head(hidden_states)
+
+        loss = None
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss()
+            shift_logits = shift_logits.view(-1, self.config.vocab_size)
+            shift_labels = shift_labels.view(-1)
+            # Enable model parallelism
+            shift_labels = shift_labels.to(shift_logits.device)
+            loss = loss_fct(shift_logits, shift_labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+    def prepare_inputs_for_generation(
+        self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
+    ):
+        if past_key_values:
+            input_ids = input_ids[:, -1:]
+
+        position_ids = kwargs.get("position_ids", None)
+        if attention_mask is not None and position_ids is None:
+            # create position_ids on the fly for batch generation
+            position_ids = attention_mask.long().cumsum(-1) - 1
+            position_ids.masked_fill_(attention_mask == 0, 1)
+            if past_key_values:
+                position_ids = position_ids[:, -1].unsqueeze(-1)
+
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs = {"inputs_embeds": inputs_embeds}
+        else:
+            model_inputs = {"input_ids": input_ids}
+
+        model_inputs.update(
+            {
+                "position_ids": position_ids,
+                "past_key_values": past_key_values,
+                "use_cache": kwargs.get("use_cache"),
+                "attention_mask": attention_mask,
+            }
+        )
+        return model_inputs
+
+    @staticmethod
+    def _reorder_cache(past_key_values, beam_idx):
+        reordered_past = ()
+        for layer_past in past_key_values:
+            reordered_past += (
+                tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
+            )
+        return reordered_past
+
+
+@add_start_docstrings(
+    """
+    The LLaMa Model transformer with a sequence classification head on top (linear layer).
+
+    [`LlamaForSequenceClassification`] uses the last token in order to do the classification, as other causal models
+    (e.g. GPT-2) do.
+
+    Since it does classification on the last token, it requires to know the position of the last token. If a
+    `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
+    no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
+    padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
+    each row of the batch).
+    """,
+    LLAMA_START_DOCSTRING,
+)
+class LlamaForSequenceClassification(LlamaPreTrainedModel):
+    _keys_to_ignore_on_load_missing = [r"lm_head.weight"]
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.model = LlamaModel(config)
+        self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.model.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.model.embed_tokens = value
+
+    @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        transformer_outputs = self.model(
+            input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = transformer_outputs[0]
+        logits = self.score(hidden_states)
+
+        if input_ids is not None:
+            batch_size = input_ids.shape[0]
+        else:
+            batch_size = inputs_embeds.shape[0]
+
+        if self.config.pad_token_id is None and batch_size != 1:
+            raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
+        if self.config.pad_token_id is None:
+            sequence_lengths = -1
+        else:
+            if input_ids is not None:
+                sequence_lengths = (torch.ne(input_ids, self.config.pad_token_id).sum(-1) - 1).to(logits.device)
+            else:
+                sequence_lengths = -1
+
+        pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
+
+        loss = None
+        if labels is not None:
+            labels = labels.to(logits.device)
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(pooled_logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = BCEWithLogitsLoss()
+                loss = loss_fct(pooled_logits, labels)
+        if not return_dict:
+            output = (pooled_logits,) + transformer_outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return SequenceClassifierOutputWithPast(
+            loss=loss,
+            logits=pooled_logits,
+            past_key_values=transformer_outputs.past_key_values,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
--- a/pytorch_model-00001-of-00003.bin
+++ b/pytorch_model-00001-of-00003.bin
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:92d93b1d27232a29e4c0f8d7db067db7a1454fdaebf510f5fb73f15b9a1e4b2e
+size 9948728430
--- a/pytorch_model-00002-of-00003.bin
+++ b/pytorch_model-00002-of-00003.bin
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:b386739f88805d069db2420301d11e0a3eca70108a8fc4e2438bae35c6d89741
+size 9904165024
--- a/pytorch_model-00003-of-00003.bin
+++ b/pytorch_model-00003-of-00003.bin
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:a4a34ee3cd5d8c09531d0b672e9a0ec6518a07cf5490ebdd0bfbffacdf740d5e
+size 6506663689
--- a/pytorch_model.bin.index.json
+++ b/pytorch_model.bin.index.json
@@ -0,0 +1,410 @@
+{
+  "metadata": {
+    "total_size": 26031738880
+  },
+  "weight_map": {
+    "lm_head.weight": "pytorch_model-00003-of-00003.bin",
+    "model.embed_tokens.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.0.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.0.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.0.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.0.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.0.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.0.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.0.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.0.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.0.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00003.bin",
+    "model.layers.0.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.1.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.1.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.1.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.1.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.1.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.1.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.1.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.1.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.1.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00003.bin",
+    "model.layers.1.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.10.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.10.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.10.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.10.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.10.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.10.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.10.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.10.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.10.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00003.bin",
+    "model.layers.10.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.11.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.11.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.11.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.11.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.11.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.11.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.11.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.11.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.11.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00003.bin",
+    "model.layers.11.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.12.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.12.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.12.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.12.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.12.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.12.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.12.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.12.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.12.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00003.bin",
+    "model.layers.12.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.13.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.13.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.13.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.13.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.13.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.13.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.13.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.13.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.13.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00003.bin",
+    "model.layers.13.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.14.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.14.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.14.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.14.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.14.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.14.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.14.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.14.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.14.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00003.bin",
+    "model.layers.14.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.15.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.15.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.15.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.15.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.15.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.15.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.15.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.15.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.15.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00003.bin",
+    "model.layers.15.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.16.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.16.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.16.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.16.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.16.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.16.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.16.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.16.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.16.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00003.bin",
+    "model.layers.16.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.17.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.17.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.17.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.17.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.17.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.17.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.17.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.17.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.17.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00003.bin",
+    "model.layers.17.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.18.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.18.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.18.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.18.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.18.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.18.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.18.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.18.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.18.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00003.bin",
+    "model.layers.18.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.19.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.19.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.19.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.19.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.19.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.19.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.19.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.19.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.19.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00003.bin",
+    "model.layers.19.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.2.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.2.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.2.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.2.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.2.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.2.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.2.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.2.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.2.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00003.bin",
+    "model.layers.2.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.20.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.20.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.20.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.20.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.20.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.20.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.20.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.20.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.20.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00003.bin",
+    "model.layers.20.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.21.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.21.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.21.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.21.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.21.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.21.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.21.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.21.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.21.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00003.bin",
+    "model.layers.21.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.22.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.22.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.22.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.22.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.22.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.22.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.22.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.22.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.22.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00003.bin",
+    "model.layers.22.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.23.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.23.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.23.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.23.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.23.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.23.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.23.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.23.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.23.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00003.bin",
+    "model.layers.23.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.24.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.24.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.24.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.24.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.24.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.24.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.24.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.24.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.24.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00003.bin",
+    "model.layers.24.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.25.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.25.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.25.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.25.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.25.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.25.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.25.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.25.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.25.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00003.bin",
+    "model.layers.25.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.26.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.26.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.26.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.26.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.26.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.26.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.26.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.26.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.26.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00003.bin",
+    "model.layers.26.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.27.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.27.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.27.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.27.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.27.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.27.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.27.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.27.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.27.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00003.bin",
+    "model.layers.27.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.28.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.28.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.28.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.28.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.28.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.28.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.28.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.28.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.28.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00003.bin",
+    "model.layers.28.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.29.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.29.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.29.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.29.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.29.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.29.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.29.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.29.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.29.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00003.bin",
+    "model.layers.29.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.3.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.3.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.3.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.3.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.3.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.3.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.3.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.3.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.3.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00003.bin",
+    "model.layers.3.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.30.input_layernorm.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.30.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.30.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.30.mlp.up_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.30.post_attention_layernorm.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.30.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.30.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.30.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.30.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00003.bin",
+    "model.layers.30.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
+    "model.layers.31.input_layernorm.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.31.mlp.down_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.31.mlp.gate_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.31.mlp.up_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.31.post_attention_layernorm.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.31.self_attn.k_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.31.self_attn.o_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.31.self_attn.q_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.31.self_attn.rotary_emb.inv_freq": "pytorch_model-00003-of-00003.bin",
+    "model.layers.31.self_attn.v_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.32.input_layernorm.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.32.mlp.down_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.32.mlp.gate_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.32.mlp.up_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.32.post_attention_layernorm.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.32.self_attn.k_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.32.self_attn.o_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.32.self_attn.q_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.32.self_attn.rotary_emb.inv_freq": "pytorch_model-00003-of-00003.bin",
+    "model.layers.32.self_attn.v_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.33.input_layernorm.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.33.mlp.down_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.33.mlp.gate_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.33.mlp.up_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.33.post_attention_layernorm.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.33.self_attn.k_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.33.self_attn.o_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.33.self_attn.q_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.33.self_attn.rotary_emb.inv_freq": "pytorch_model-00003-of-00003.bin",
+    "model.layers.33.self_attn.v_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.34.input_layernorm.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.34.mlp.down_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.34.mlp.gate_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.34.mlp.up_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.34.post_attention_layernorm.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.34.self_attn.k_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.34.self_attn.o_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.34.self_attn.q_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.34.self_attn.rotary_emb.inv_freq": "pytorch_model-00003-of-00003.bin",
+    "model.layers.34.self_attn.v_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.35.input_layernorm.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.35.mlp.down_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.35.mlp.gate_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.35.mlp.up_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.35.post_attention_layernorm.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.35.self_attn.k_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.35.self_attn.o_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.35.self_attn.q_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.35.self_attn.rotary_emb.inv_freq": "pytorch_model-00003-of-00003.bin",
+    "model.layers.35.self_attn.v_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.36.input_layernorm.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.36.mlp.down_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.36.mlp.gate_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.36.mlp.up_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.36.post_attention_layernorm.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.36.self_attn.k_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.36.self_attn.o_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.36.self_attn.q_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.36.self_attn.rotary_emb.inv_freq": "pytorch_model-00003-of-00003.bin",
+    "model.layers.36.self_attn.v_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.37.input_layernorm.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.37.mlp.down_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.37.mlp.gate_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.37.mlp.up_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.37.post_attention_layernorm.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.37.self_attn.k_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.37.self_attn.o_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.37.self_attn.q_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.37.self_attn.rotary_emb.inv_freq": "pytorch_model-00003-of-00003.bin",
+    "model.layers.37.self_attn.v_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.38.input_layernorm.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.38.mlp.down_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.38.mlp.gate_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.38.mlp.up_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.38.post_attention_layernorm.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.38.self_attn.k_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.38.self_attn.o_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.38.self_attn.q_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.38.self_attn.rotary_emb.inv_freq": "pytorch_model-00003-of-00003.bin",
+    "model.layers.38.self_attn.v_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.39.input_layernorm.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.39.mlp.down_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.39.mlp.gate_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.39.mlp.up_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.39.post_attention_layernorm.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.39.self_attn.k_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.39.self_attn.o_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.39.self_attn.q_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.39.self_attn.rotary_emb.inv_freq": "pytorch_model-00003-of-00003.bin",
+    "model.layers.39.self_attn.v_proj.weight": "pytorch_model-00003-of-00003.bin",
+    "model.layers.4.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.4.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.4.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.4.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.4.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.4.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.4.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.4.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.4.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00003.bin",
+    "model.layers.4.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.5.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.5.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.5.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.5.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.5.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.5.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.5.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.5.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.5.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00003.bin",
+    "model.layers.5.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.6.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.6.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.6.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.6.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.6.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.6.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.6.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.6.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.6.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00003.bin",
+    "model.layers.6.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.7.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.7.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.7.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.7.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.7.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.7.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.7.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.7.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.7.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00003.bin",
+    "model.layers.7.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.8.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.8.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.8.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.8.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.8.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.8.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.8.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.8.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.8.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00003.bin",
+    "model.layers.8.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.9.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.9.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.9.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.9.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.9.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.9.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.9.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.9.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.layers.9.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00003.bin",
+    "model.layers.9.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
+    "model.norm.weight": "pytorch_model-00003-of-00003.bin"
+  }
+}
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,23 @@
+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}
--- a/tokenizer.json
+++ b/tokenizer.json
--- a/tokenizer.model
+++ b/tokenizer.model
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,33 @@
+{
+  "add_bos_token": true,
+  "add_eos_token": false,
+  "bos_token": {
+    "__type": "AddedToken",
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "clean_up_tokenization_spaces": false,
+  "eos_token": {
+    "__type": "AddedToken",
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": null,
+  "sp_model_kwargs": {},
+  "tokenizer_class": "LlamaTokenizer",
+  "unk_token": {
+    "__type": "AddedToken",
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}
				`@@ -0,0 +1 @@`
				`{"framework": "pytorch", "task": "text-generation", "allow_remote": true}`