初始化项目，由ModelHub XC社区提供模型

Model: BAAI/OpenSeek-Mid-v1 Source: Original Platform
2026-05-19 20:10:08 +08:00
commit db932fe0b1
18 changed files with 3113 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,66 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bin.* filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zstandard filter=lfs diff=lfs merge=lfs -text
+*.tfevents* filter=lfs diff=lfs merge=lfs -text
+*.db* filter=lfs diff=lfs merge=lfs -text
+*.ark* filter=lfs diff=lfs merge=lfs -text
+**/*ckpt*data* filter=lfs diff=lfs merge=lfs -text
+**/*ckpt*.meta filter=lfs diff=lfs merge=lfs -text
+**/*ckpt*.index filter=lfs diff=lfs merge=lfs -text
+ 
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.gguf* filter=lfs diff=lfs merge=lfs -text
+*.ggml filter=lfs diff=lfs merge=lfs -text
+*.llamafile* filter=lfs diff=lfs merge=lfs -text
+*.pt2 filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+iter_0023860_hf/model-00001-of-00007.safetensors filter=lfs diff=lfs merge=lfs -text
+iter_0023860_hf/model-00002-of-00007.safetensors filter=lfs diff=lfs merge=lfs -text
+iter_0023860_hf/model-00003-of-00007.safetensors filter=lfs diff=lfs merge=lfs -text
+iter_0023860_hf/model-00004-of-00007.safetensors filter=lfs diff=lfs merge=lfs -text
+iter_0023860_hf/model-00005-of-00007.safetensors filter=lfs diff=lfs merge=lfs -text
+iter_0023860_hf/model-00006-of-00007.safetensors filter=lfs diff=lfs merge=lfs -text
+iter_0023860_hf/model-00007-of-00007.safetensors filter=lfs diff=lfs merge=lfs -text
+
+model-00001-of-00005.safetensors filter=lfs diff=lfs merge=lfs -text
+
+model-00002-of-00005.safetensors filter=lfs diff=lfs merge=lfs -text
+
+model-00003-of-00005.safetensors filter=lfs diff=lfs merge=lfs -text
+
+model-00004-of-00005.safetensors filter=lfs diff=lfs merge=lfs -text
+
+model-00005-of-00005.safetensors filter=lfs diff=lfs merge=lfs -text
+
+qwen.tiktoken filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,244 @@
+---
+extra_gated_prompt: >-
+  You agree to not use the dataset to conduct experiments that cause harm to
+  human subjects.
+extra_gated_fields:
+  Company/Organization: text
+  Country: country
+pipeline_tag: text-generation
+library_name: transformers
+---
+---
+license: apache-2.0
+---
+# OpenSeek-Mid-v1
+
+**OpenSeek-Mid-v1** is a 10.61-billion-parameter language model grown from [Qwen3-4B-Base](https://huggingface.co/Qwen/Qwen3-4B-Base) through a two-stage model expansion pipeline and trained on only **2 trillion tokens** of fully open-source data.
+
+Despite having **25% fewer parameters** and using **18x less training data**, OpenSeek-Mid-v1 matches or surpasses Qwen3-14B-Base across multiple benchmarks.
+
+
+
+<img src="https://cdn-uploads.huggingface.co/production/uploads/642ee226a7e765fff0bf00ac/VcTNOdzlJK1tw5PjgeSSi.png" width="90%" alt="results_all">
+
+
+---
+
+## Highlights
+
+- **Model Growth, Not From-Scratch Training**: Grown from Qwen3-4B via width expansion + partial depth stacking, inheriting the seed model's learned representations.
+- **Extreme Data Efficiency**: Matches Qwen3-14B-Base (~36T tokens) with only 2T tokens of training — an 18x reduction in data requirement.
+- **Muon Optimizer**: Spectral whitening ensures expanded dimensions are effectively utilized, delivering significant gains over AdamW in the model growth setting.
+- **Fully Open-Source Data**: All training data comes from publicly available datasets (NemotronCC-v2, Stack-Edu, Dolmino, CCI, etc.).
+
+---
+
+## Architecture
+
+| Specification | Value |
+|---|---|
+| Parameters | 10.61B |
+| Layers | 56 |
+| Hidden Size (d_model) | 2560 |
+| FFN Intermediate Size (d_FFN) | 19456 |
+| Attention Heads | 32 |
+| KV Heads | 8 |
+| Sequence Length | 8192 |
+| Vocabulary Size | same as Qwen3-4B |
+
+### Growth Pipeline
+
+```
+Qwen3-4B (4.02B, 36L)
+    │  Width expansion (d_FFN: 9728 → 19456, SNR=10dB)
+    ▼
+Width-Expanded (7.10B, 36L)
+    │  Partial depth stacking (layers 14–34 × 2)
+    ▼
+OpenSeek-Mid-v1 (10.61B, 56L)
+    │  Continual pretraining with Muon (2T tokens)
+    ▼
+Final Model
+```
+
+---
+
+## Training
+
+### Training Configuration
+
+| Parameter | Value |
+|---|---|
+| Optimizer | Muon |
+| Sequence Length | 8192 |
+| Global Batch Size | 2048 sequences |
+| Peak Learning Rate | 1e-4 |
+| LR Schedule | Cosine with linear warmup |
+| Warmup Steps | 1000 |
+| Weight Decay | 0.1 |
+| Training Framework | FlagScale (FlagOS) |
+| Total Training Tokens | ~2.06T |
+
+### Stage 1: Broad Knowledge Acquisition (1.36T tokens)
+
+#### Stage 1 Data Mixture
+
+| Category | Proportion | Tokens (B) |
+|---|---|---|
+| Web | 42% | ~571B |
+| Math | 20% | ~272B |
+| Code | 20% | ~272B |
+| STEM | 15% | ~204B |
+| Multilingual | 3% | ~41B |
+
+---
+
+### Stage 2: Capability Specialization (0.70T tokens)
+
+#### Stage 2 Data Mixture
+
+| Category | Proportion | Tokens (B) | Delta vs. Stage 1 |
+|---|---|---|---|
+| Web | 35% | ~245B | -7% |
+| Math | 20% | ~140B | — |
+| Code | 24% | ~168B | +4% |
+| STEM | 18% | ~126B | +3% |
+| Multilingual | 3% | ~21B | — |
+
+---
+
+### Detailed Dataset Composition
+
+Stage 1 (%) and Stage 2 (%) denote each dataset's sampling weight within the respective stage. "—" indicates the dataset is not used in that stage.
+
+**Web**
+
+| Dataset | Tokens (B) | Stage 1 (%) | Stage 2 (%) |
+|---|---|---|---|
+| Nemotron-CC-v2-HQ-Syn | 798.41 | 23.24 | 19.36 |
+| Nemotron-CC-v2-Diverse-QA (×5 shards) | 340.81 | 9.92 | 8.26 |
+| Nemotron-CC-v2-HQ (×5 shards) | 303.82 | 8.84 | 7.36 |
+| dolmino-mix-1124-wiki | 3.82 | 0.15 | 0.18 |
+| dolmino-mix-1124-stackexchange | 1.30 | 0.05 | 0.06 |
+
+**Math**
+
+| Dataset | Tokens (B) | Stage 1 (%) | Stage 2 (%) |
+|---|---|---|---|
+| Nemotron-SFT-MATH | 207.46 | 11.70 | 11.70 |
+| Nemotron-CC-Math-v1-4plus-MIND | 74.34 | 4.19 | 4.19 |
+| Nemotron-CC-Math-v1-4plus | 53.37 | 3.01 | 3.01 |
+| Dolmino-math | 11.17 | 0.63 | 0.63 |
+| OpenMathInstruct-2 | 5.30 | 0.30 | 0.30 |
+| OpenMathReasoning-4k | 2.48 | 0.14 | 0.14 |
+| NuminaMath-1.5 | 0.38 | 0.02 | 0.02 |
+
+**Code**
+
+| Dataset | Tokens (B) | Stage 1 (%) | Stage 2 (%) |
+|---|---|---|---|
+| Nemotron-Pretraining-Code-v1-Syn | 171.53 | 9.05 | 10.86 |
+| Nemotron-SFT-Code | 57.47 | 3.03 | 3.64 |
+| stack-edu-Java | 31.70 | 1.06 | 1.27 |
+| stack-edu-Markdown | 26.64 | 0.38 | 0.45 |
+| stack-edu-Python | 18.27 | 1.54 | 1.85 |
+| stack-edu-Cpp | 12.62 | 1.11 | 1.33 |
+| stack-edu-JavaScript | 8.99 | 1.00 | 1.20 |
+| stack-edu-SQL | 8.23 | 0.37 | 0.44 |
+| github-issue | 8.46 | 0.25 | 0.30 |
+| stack-edu-PHP | 7.43 | 0.25 | 0.30 |
+| stack-edu-CSharp | 7.26 | 0.37 | 0.44 |
+| stack-edu-C | 4.80 | 0.43 | 0.52 |
+| stack-edu-Shell | 2.60 | 0.01 | 0.01 |
+| stack-edu-TypeScript | 2.51 | 0.18 | 0.22 |
+| OpenCodeInstruct | 1.59 | — | 0.10 |
+| stack-edu-Swift | 1.53 | 0.06 | 0.07 |
+| stack-edu-Rust | 1.45 | 0.05 | 0.06 |
+| stack-edu-Go | 1.42 | 0.03 | 0.04 |
+| kaggle-notebooks | 1.42 | 0.65 | 0.78 |
+| stack-edu-Ruby | 1.36 | 0.01 | 0.01 |
+| OpenCodeReasoning-2-cpp-4k | 0.76 | 0.04 | 0.05 |
+| OpenCodeReasoning-2-python-4k | 0.58 | 0.03 | 0.04 |
+| github-code-review | 0.32 | — | 0.02 |
+
+**STEM & Science**
+
+| Dataset | Tokens (B) | Stage 1 (%) | Stage 2 (%) |
+|---|---|---|---|
+| Nemotron-Pretraining-Specialized-v1 (×4 shards) | 276.83 | 10.55 | 12.73 |
+| Nemotron-Pretraining-SFT-v1-General | 86.93 | 3.31 | 4.00 |
+| dolmino-mix-1124-pes2o | 60.19 | 0.50 | 0.50 |
+| Nemotron-Pretraining-Specialized-v1.1 | 9.04 | — | 0.42 |
+| OpenScienceReasoning-2-4k | 1.72 | 0.07 | 0.08 |
+| MegaScience | 0.98 | 0.04 | 0.04 |
+
+**Multilingual**
+
+| Dataset | Tokens (B) | Stage 1 (%) | Stage 2 (%) |
+|---|---|---|---|
+| Nemotron-CC-v2-Translated-Diverse-QA | 135.80 | 1.74 | 1.74 |
+| CCI4_0-Zh-High | 98.76 | 1.26 | 1.26 |
+
+---
+
+### Checkpoint Merging
+
+The final model is a weighted average of 5 complementary checkpoints, each selected for a unique strength:
+
+| Checkpoint | Weight | Role | Key Metric |
+|---|---|---|---|
+| iter 169984 | 0.30 | Code anchor | MBPP **78.84** |
+| iter 219136 | 0.25 | Reasoning lead | GPQA-d **44.39** |
+| iter 174080 | 0.15 | Code peak | EvalPlus **68.88** |
+| iter 190464 | 0.15 | Math bridge | GPQA-d **42.86** |
+| iter 217088 | 0.15 | General boost | BBH **82.84** |
+
+---
+
+## Evaluation Results
+
+All evaluations conducted via `lm-eval-harness` with consistent settings.
+
+| Benchmark | Qwen3-4B | Qwen3-8B | Qwen3.5-9B | Nemotron-12B | Gemma3-12B | Qwen3-14B | **OpenSeek-Mid-v1** |
+|---|---|---|---|---|---|---|---|
+| *Training tokens* | *36T* | *36T* | *36T* | *20T* | *12T* | *36T* | ***2T*** |
+| MMLU (5-shot) | 72.72 | 76.57 | 78.64 | 78.07 | 73.28 | **80.57** | <u>79.31</u> |
+| MMLU-Pro (5-shot CoT) | 49.31 | 52.35 | <u>58.48</u> | 57.57 | 41.16 | 56.00 | **66.57** |
+| AGIEval-en (0-shot) | 45.92 | 49.09 | 45.15 | 49.20 | 44.89 | **52.83** | <u>52.18</u> |
+| BBH (3-shot CoT) | 71.20 | 77.75 | <u>82.23</u> | 69.65 | 73.78 | 78.71 | **82.55** |
+| HellaSwag (5-shot) | 75.36 | 79.47 | 81.04 | <u>83.13</u> | **83.45** | 82.05 | 81.81 |
+| Winogrande (5-shot) | 71.90 | 77.51 | 76.80 | 79.24 | **80.35** | <u>79.40</u> | 79.24 |
+| PIQA (5-shot) | 78.89 | 81.39 | 81.61 | 82.97 | 81.80 | **83.30** | <u>83.19</u> |
+| OpenBookQA (5-shot) | 45.00 | 49.00 | 50.00 | <u>50.20</u> | 49.60 | **50.80** | 49.80 |
+| ARC-C (0-shot) | 51.19 | 56.91 | 56.83 | 60.58 | **64.68** | 59.30 | <u>62.12</u> |
+| GSM8K (4-shot CoT) | 84.31 | 86.73 | 85.60 | 81.43 | 72.02 | **90.07** | <u>89.16</u> |
+| MATH (4-shot CoT) | 50.16 | 52.48 | 56.16 | 57.30 | 43.30 | <u>59.70</u> | **65.88** |
+| GPQA-diamond (3-shot CoT) | 32.65 | 35.71 | <u>37.76</u> | 31.12 | 23.47 | <u>37.76</u> | **45.41** |
+| MBPP (0-shot) | 73.81 | 75.66 | <u>77.51</u> | 73.81 | 73.28 | **84.92** | 76.19 |
+| EvalPlus Avg (0-shot) | 63.96 | <u>67.95</u> | 59.54 | 61.20 | 53.48 | **73.41** | 66.45 |
+| | | | | | | | |
+| **Avg General** | 62.39 | 66.67 | 67.86 | 65.04 | 60.98 | <u>69.22</u> | **70.75** |
+| **Avg All** | 61.88 | 65.61 | 66.24 | 65.39 | 61.32 | <u>69.20</u> | **69.99** |
+
+- **Avg General**: average of knowledge, reasoning, and commonsense benchmarks (MMLU, MMLU-Pro, AGIEval-en, BBH, HellaSwag, Winogrande, PIQA, OpenBookQA, ARC-C).
+- **Avg All**: average of all benchmarks above, including math, STEM, and code (+ GSM8K, MATH, GPQA-diamond, MBPP, EvalPlus Avg).
+
+---
+
+## Citation
+
+If you find this work useful, please cite:
+
+```bibtex
+@misc{openseek-mid-v1,
+  title={OpenSeek-Mid-v1: Efficient Language Model Scaling via Seed Model Expansion},
+  year={2026},
+  note={Technical report coming soon}
+}
+```
+
+---
+
+## Acknowledgements
+
+This project was built using open-source data and tools, including NemotronCC-v2, Stack-Edu, Dolmino, CCI, OpenMathInstruct, OpenCodeReasoning, and FlagOS.
--- a/config.json
+++ b/config.json
@@ -0,0 +1,32 @@
+{
+  "architectures": [
+    "Qwen3ForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration_qwen3.Qwen3Config",
+    "AutoModelForCausalLM": "modeling_qwen3.Qwen3ForCausalLM"
+  },
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 2560,
+  "initializer_range": 0.006,
+  "intermediate_size": 19456,
+  "max_position_embeddings": 32768,
+  "max_window_layers": 28,
+  "model_type": "qwen3",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 56,
+  "num_key_value_heads": 8,
+  "rms_norm_eps": 1e-06,
+  "rope_scaling": null,
+  "rope_theta": 1000000,
+  "sliding_window": 4096,
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.51.3",
+  "use_cache": true,
+  "use_sliding_window": false,
+  "vocab_size": 151851
+}
--- a/configuration.json
+++ b/configuration.json
@@ -0,0 +1 @@
+{"framework": "pytorch", "task": "others", "allow_remote": true}
--- a/configuration_qwen3.py
+++ b/configuration_qwen3.py
@@ -0,0 +1,208 @@
+# coding=utf-8
+# Copyright 2024 The Qwen team, Alibaba Group and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Qwen3 model configuration"""
+
+from transformers.configuration_utils import PretrainedConfig
+from transformers.modeling_rope_utils import rope_config_validation
+from transformers.utils import logging
+
+logger = logging.get_logger(__name__)
+
+
+class Qwen3Config(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`Qwen3Model`]. It is used to instantiate a
+    Qwen3 model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of
+    Qwen3-8B [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B).
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 151936):
+            Vocabulary size of the Qwen3 model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`Qwen3Model`]
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 22016):
+            Dimension of the MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 32):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        num_key_value_heads (`int`, *optional*, defaults to 32):
+            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
+            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
+            `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
+            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
+            by meanpooling all the original heads within that group. For more details checkout [this
+            paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to `32`.
+        head_dim (`int`, *optional*, defaults to 128):
+            The attention head dimension.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function (function or string) in the decoder.
+        max_position_embeddings (`int`, *optional*, defaults to 32768):
+            The maximum sequence length that this model might ever be used with.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        rms_norm_eps (`float`, *optional*, defaults to 1e-06):
+            The epsilon used by the rms normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether the model's input and output word embeddings should be tied.
+        rope_theta (`float`, *optional*, defaults to 10000.0):
+            The base period of the RoPE embeddings.
+        rope_scaling (`Dict`, *optional*):
+            Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
+            and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
+            accordingly.
+            Expected contents:
+                `rope_type` (`str`):
+                    The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
+                    'llama3'], with 'default' being the original RoPE implementation.
+                `factor` (`float`, *optional*):
+                    Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In
+                    most scaling types, a `factor` of x will enable the model to handle sequences of length x *
+                    original maximum pre-trained length.
+                `original_max_position_embeddings` (`int`, *optional*):
+                    Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during
+                    pretraining.
+                `attention_factor` (`float`, *optional*):
+                    Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention
+                    computation. If unspecified, it defaults to value recommended by the implementation, using the
+                    `factor` field to infer the suggested value.
+                `beta_fast` (`float`, *optional*):
+                    Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear
+                    ramp function. If unspecified, it defaults to 32.
+                `beta_slow` (`float`, *optional*):
+                    Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear
+                    ramp function. If unspecified, it defaults to 1.
+                `short_factor` (`List[float]`, *optional*):
+                    Only used with 'longrope'. The scaling factor to be applied to short contexts (<
+                    `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
+                    size divided by the number of attention heads divided by 2
+                `long_factor` (`List[float]`, *optional*):
+                    Only used with 'longrope'. The scaling factor to be applied to long contexts (<
+                    `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
+                    size divided by the number of attention heads divided by 2
+                `low_freq_factor` (`float`, *optional*):
+                    Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE
+                `high_freq_factor` (`float`, *optional*):
+                    Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE
+        attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
+            Whether to use a bias in the query, key, value and output projection layers during self-attention.
+        use_sliding_window (`bool`, *optional*, defaults to `False`):
+            Whether to use sliding window attention.
+        sliding_window (`int`, *optional*, defaults to 4096):
+            Sliding window attention (SWA) window size. If not specified, will default to `4096`.
+        max_window_layers (`int`, *optional*, defaults to 28):
+            The number of layers that use SWA (Sliding Window Attention). The bottom layers use SWA while the top use full attention.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+
+    ```python
+    >>> from transformers import Qwen3Model, Qwen3Config
+
+    >>> # Initializing a Qwen3 style configuration
+    >>> configuration = Qwen3Config()
+
+    >>> # Initializing a model from the Qwen3-8B style configuration
+    >>> model = Qwen3Model(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "qwen3"
+    keys_to_ignore_at_inference = ["past_key_values"]
+
+    # Default tensor parallel plan for base model `Qwen3`
+    base_model_tp_plan = {
+        "layers.*.self_attn.q_proj": "colwise",
+        "layers.*.self_attn.k_proj": "colwise",
+        "layers.*.self_attn.v_proj": "colwise",
+        "layers.*.self_attn.o_proj": "rowwise",
+        "layers.*.mlp.gate_proj": "colwise",
+        "layers.*.mlp.up_proj": "colwise",
+        "layers.*.mlp.down_proj": "rowwise",
+    }
+    base_model_pp_plan = {
+        "embed_tokens": (["input_ids"], ["inputs_embeds"]),
+        "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
+        "norm": (["hidden_states"], ["hidden_states"]),
+    }
+
+    def __init__(
+        self,
+        vocab_size=151936,
+        hidden_size=4096,
+        intermediate_size=22016,
+        num_hidden_layers=32,
+        num_attention_heads=32,
+        num_key_value_heads=32,
+        head_dim=128,
+        hidden_act="silu",
+        max_position_embeddings=32768,
+        initializer_range=0.02,
+        rms_norm_eps=1e-6,
+        use_cache=True,
+        tie_word_embeddings=False,
+        rope_theta=10000.0,
+        rope_scaling=None,
+        attention_bias=False,
+        use_sliding_window=False,
+        sliding_window=4096,
+        max_window_layers=28,
+        attention_dropout=0.0,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.use_sliding_window = use_sliding_window
+        self.sliding_window = sliding_window  # we check `use_sliding_window` in the modeling code
+        self.max_window_layers = max_window_layers
+
+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+
+        self.num_key_value_heads = num_key_value_heads
+        self.head_dim = head_dim
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.rms_norm_eps = rms_norm_eps
+        self.use_cache = use_cache
+        self.rope_theta = rope_theta
+        self.rope_scaling = rope_scaling
+        self.attention_bias = attention_bias
+        self.attention_dropout = attention_dropout
+        # Validate the correctness of rotary position embeddings parameters
+        # BC: if there is a 'type' field, move it to 'rope_type'.
+        if self.rope_scaling is not None and "type" in self.rope_scaling:
+            self.rope_scaling["rope_type"] = self.rope_scaling["type"]
+        rope_config_validation(self)
+
+        super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
+
+
+__all__ = ["Qwen3Config"]
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,6 @@
+{
+  "_from_model_config": true,
+  "bos_token_id": 151849,
+  "eos_token_id": 151850,
+  "transformers_version": "4.41.2"
+}
--- a/model-00001-of-00005.safetensors
+++ b/model-00001-of-00005.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:dabc704327965bdac65cfecb620d55881d0f6e7d0353badd87de75cc9d3115c7
+size 4992896952
--- a/model-00002-of-00005.safetensors
+++ b/model-00002-of-00005.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:9d5fb1d13829606228ecd380df10cb650363f822636013033a3517a3bbe6e01b
+size 4970419768
--- a/model-00003-of-00005.safetensors
+++ b/model-00003-of-00005.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:c0ea891c5f8e618c07b575a93d7ff6cef8242929536df7204091f5e90ffdab37
+size 4917989760
--- a/model-00004-of-00005.safetensors
+++ b/model-00004-of-00005.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:f2cb71ea4200692279d02e3b81e4378de68cfaed8816ebe806737638fa2b544b
+size 4917989760
--- a/model-00005-of-00005.safetensors
+++ b/model-00005-of-00005.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:2a1fbcf10aee4f71b4bc8f715990c3a710f42ca5a09862e15a8250268f871fb9
+size 1427622384
--- a/model.safetensors.index.json
+++ b/model.safetensors.index.json
@@ -0,0 +1,626 @@
+{
+  "metadata": {
+    "total_size": 21226847232
+  },
+  "weight_map": {
+    "lm_head.weight": "model-00005-of-00005.safetensors",
+    "model.embed_tokens.weight": "model-00001-of-00005.safetensors",
+    "model.layers.0.input_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.0.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.0.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.0.self_attn.k_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.0.self_attn.q_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.1.input_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.1.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.1.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.1.self_attn.k_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.1.self_attn.q_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.10.input_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.10.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.10.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.10.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.10.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.10.self_attn.k_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.10.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.10.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.10.self_attn.q_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.10.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.10.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.11.input_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.11.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.11.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.11.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.11.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.11.self_attn.k_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.11.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.11.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.11.self_attn.q_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.11.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.11.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.12.input_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.12.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.12.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.12.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.12.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.12.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.12.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.12.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.12.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.12.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.12.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.13.input_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.13.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.13.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.13.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.13.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.13.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.13.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.13.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.13.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.13.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.13.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.14.input_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.14.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.14.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.14.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.14.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.14.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.14.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.14.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.14.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.14.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.14.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.15.input_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.15.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.15.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.15.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.15.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.15.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.15.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.15.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.15.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.15.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.15.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.16.input_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.16.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.16.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.16.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.16.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.16.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.16.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.16.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.16.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.16.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.16.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.17.input_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.17.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.17.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.17.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.17.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.17.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.17.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.17.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.17.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.17.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.17.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.18.input_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.18.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.18.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.18.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.18.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.18.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.18.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.18.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.18.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.18.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.18.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.19.input_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.19.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.19.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.19.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.19.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.19.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.19.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.19.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.19.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.19.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.19.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.2.input_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.2.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.2.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.2.self_attn.k_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.2.self_attn.q_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.20.input_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.20.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.20.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.20.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.20.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.20.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.20.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.20.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.20.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.20.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.20.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.21.input_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.21.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.21.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.21.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.21.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.21.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.21.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.21.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.21.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.21.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.21.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.22.input_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.22.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.22.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.22.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.22.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.22.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.22.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.22.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.22.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.22.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.22.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.23.input_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.23.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.23.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.23.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.23.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.23.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.23.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.23.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.23.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.23.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.23.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.24.input_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.24.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.24.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.24.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.24.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.24.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.24.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.24.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.24.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.24.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.24.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.25.input_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.25.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.25.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.25.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.25.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.25.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.25.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.25.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.25.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.25.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.25.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.26.input_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.26.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.26.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.26.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.26.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.26.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.26.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.26.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.26.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.26.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.26.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.27.input_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.27.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.27.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.27.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.27.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.27.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.27.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.27.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.27.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.27.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.27.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.28.input_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.28.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.28.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.28.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.28.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.28.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.28.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.28.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.28.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.28.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.28.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.29.input_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.29.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.29.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.29.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.29.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.29.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.29.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.29.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.29.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.29.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.29.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.3.input_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.3.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.3.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.3.self_attn.k_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.3.self_attn.q_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.30.input_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.30.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.30.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.30.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.30.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.30.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.30.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.30.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.30.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.30.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.30.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.31.input_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.31.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.31.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.31.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.31.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.31.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.31.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.31.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.31.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.31.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.31.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.32.input_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.32.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.32.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.32.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.32.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.32.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.32.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.32.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.32.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.32.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.32.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.33.input_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.33.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.33.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.33.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.33.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.33.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.33.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.33.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.33.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.33.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.33.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.34.input_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.34.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.34.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.34.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.34.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.34.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.34.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.34.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.34.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.34.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.34.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.35.input_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.35.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.35.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.35.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.35.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.35.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.35.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.35.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.35.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.35.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.35.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.36.input_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.36.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.36.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.36.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.36.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.36.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.36.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.36.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.36.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.36.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.36.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.37.input_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.37.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.37.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.37.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.37.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.37.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.37.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.37.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.37.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.37.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.37.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.38.input_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.38.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.38.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.38.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.38.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.38.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.38.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.38.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.38.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.38.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.38.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.39.input_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.39.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.39.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.39.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.39.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.39.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.39.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.39.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.39.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.39.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.39.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.4.input_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.4.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.4.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.4.self_attn.k_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.4.self_attn.q_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.40.input_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.40.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.40.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.40.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.40.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.40.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.40.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.40.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.40.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.40.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.40.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.41.input_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.41.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.41.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.41.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.41.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.41.self_attn.k_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.41.self_attn.k_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.41.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.41.self_attn.q_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.41.self_attn.q_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.41.self_attn.v_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.42.input_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.42.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.42.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.42.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.42.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.42.self_attn.k_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.42.self_attn.k_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.42.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.42.self_attn.q_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.42.self_attn.q_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.42.self_attn.v_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.43.input_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.43.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.43.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.43.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.43.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.43.self_attn.k_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.43.self_attn.k_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.43.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.43.self_attn.q_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.43.self_attn.q_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.43.self_attn.v_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.44.input_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.44.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.44.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.44.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.44.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.44.self_attn.k_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.44.self_attn.k_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.44.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.44.self_attn.q_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.44.self_attn.q_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.44.self_attn.v_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.45.input_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.45.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.45.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.45.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.45.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.45.self_attn.k_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.45.self_attn.k_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.45.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.45.self_attn.q_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.45.self_attn.q_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.45.self_attn.v_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.46.input_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.46.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.46.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.46.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.46.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.46.self_attn.k_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.46.self_attn.k_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.46.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.46.self_attn.q_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.46.self_attn.q_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.46.self_attn.v_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.47.input_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.47.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.47.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.47.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.47.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.47.self_attn.k_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.47.self_attn.k_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.47.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.47.self_attn.q_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.47.self_attn.q_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.47.self_attn.v_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.48.input_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.48.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.48.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.48.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.48.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.48.self_attn.k_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.48.self_attn.k_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.48.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.48.self_attn.q_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.48.self_attn.q_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.48.self_attn.v_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.49.input_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.49.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.49.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.49.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.49.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.49.self_attn.k_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.49.self_attn.k_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.49.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.49.self_attn.q_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.49.self_attn.q_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.49.self_attn.v_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.5.input_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.5.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.5.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.5.self_attn.k_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.5.self_attn.q_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.50.input_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.50.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.50.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.50.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.50.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.50.self_attn.k_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.50.self_attn.k_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.50.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.50.self_attn.q_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.50.self_attn.q_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.50.self_attn.v_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.51.input_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.51.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.51.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.51.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.51.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.51.self_attn.k_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.51.self_attn.k_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.51.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.51.self_attn.q_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.51.self_attn.q_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.51.self_attn.v_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.52.input_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.52.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.52.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.52.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.52.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.52.self_attn.k_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.52.self_attn.k_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.52.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.52.self_attn.q_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.52.self_attn.q_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.52.self_attn.v_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.53.input_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.53.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.53.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.53.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.53.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.53.self_attn.k_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.53.self_attn.k_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.53.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.53.self_attn.q_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.53.self_attn.q_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.53.self_attn.v_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.54.input_layernorm.weight": "model-00005-of-00005.safetensors",
+    "model.layers.54.mlp.down_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.54.mlp.gate_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.54.mlp.up_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.54.post_attention_layernorm.weight": "model-00005-of-00005.safetensors",
+    "model.layers.54.self_attn.k_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.54.self_attn.k_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.54.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.54.self_attn.q_norm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.54.self_attn.q_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.54.self_attn.v_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.55.input_layernorm.weight": "model-00005-of-00005.safetensors",
+    "model.layers.55.mlp.down_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.55.mlp.gate_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.55.mlp.up_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.55.post_attention_layernorm.weight": "model-00005-of-00005.safetensors",
+    "model.layers.55.self_attn.k_norm.weight": "model-00005-of-00005.safetensors",
+    "model.layers.55.self_attn.k_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.55.self_attn.o_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.55.self_attn.q_norm.weight": "model-00005-of-00005.safetensors",
+    "model.layers.55.self_attn.q_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.55.self_attn.v_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.6.input_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.6.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.6.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.6.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.6.self_attn.k_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.6.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.6.self_attn.q_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.6.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.7.input_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.7.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.7.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.7.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.7.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.7.self_attn.k_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.7.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.7.self_attn.q_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.7.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.7.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.8.input_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.8.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.8.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.8.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.8.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.8.self_attn.k_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.8.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.8.self_attn.q_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.8.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.8.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.9.input_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.9.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.9.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.9.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.9.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.9.self_attn.k_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.9.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.9.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.9.self_attn.q_norm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.9.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.9.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
+    "model.norm.weight": "model-00005-of-00005.safetensors"
+  }
+}
--- a/modeling_qwen3.py
+++ b/modeling_qwen3.py
--- a/qwen.tiktoken
+++ b/qwen.tiktoken
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:b2b1b8dfb5cc5f024bafc373121c6aba3f66f9a5a0269e243470a1de16a33186
+size 2561218
--- a/qwen_generation_utils.py
+++ b/qwen_generation_utils.py
@@ -0,0 +1,416 @@
+# Copyright (c) Alibaba Cloud.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+"""Generation support."""
+
+from typing import Tuple, List, Union, Iterable
+
+import numpy as np
+import torch
+import torch.nn.functional as F
+from transformers import PreTrainedTokenizer
+from transformers import logging
+from transformers.generation import LogitsProcessor
+
+logger = logging.get_logger(__name__)
+
+# Types.
+HistoryType = List[Tuple[str, str]]
+TokensType = List[int]
+BatchTokensType = List[List[int]]
+
+
+def pad_batch(batch: BatchTokensType, pad_id: int, seq_length: int) -> BatchTokensType:
+    for tokens in batch:
+        context_length = len(tokens)
+        if context_length < seq_length:
+            tokens.extend([pad_id] * (seq_length - context_length))
+    return batch
+
+
+def get_ltor_masks_and_position_ids(
+    data,
+    eod_token,
+    reset_position_ids,
+    reset_attention_mask,
+    eod_mask_loss,
+):
+    """Build masks and position id for left to right model."""
+
+    # Extract batch size and sequence length.
+    micro_batch_size, seq_length = data.size()
+
+    # Attention mask (lower triangular).
+    if reset_attention_mask:
+        att_mask_batch = micro_batch_size
+    else:
+        att_mask_batch = 1
+    attention_mask = torch.tril(
+        torch.ones((att_mask_batch, seq_length, seq_length), device=data.device)
+    ).view(att_mask_batch, 1, seq_length, seq_length)
+
+    # Loss mask.
+    loss_mask = torch.ones(data.size(), dtype=torch.float, device=data.device)
+    if eod_mask_loss:
+        loss_mask[data == eod_token] = 0.0
+
+    # Position ids.
+    position_ids = torch.arange(seq_length, dtype=torch.long, device=data.device)
+    position_ids = position_ids.unsqueeze(0).expand_as(data)
+    # We need to clone as the ids will be modifed based on batch index.
+    if reset_position_ids:
+        position_ids = position_ids.clone()
+
+    if reset_position_ids or reset_attention_mask:
+        # Loop through the batches:
+        for b in range(micro_batch_size):
+
+            # Find indecies where EOD token is.
+            eod_index = position_ids[b, data[b] == eod_token]
+            # Detach indecies from positions if going to modify positions.
+            if reset_position_ids:
+                eod_index = eod_index.clone()
+
+            # Loop through EOD indecies:
+            prev_index = 0
+            for j in range(eod_index.size()[0]):
+                i = eod_index[j]
+                # Mask attention loss.
+                if reset_attention_mask:
+                    attention_mask[b, 0, (i + 1) :, : (i + 1)] = 0
+                # Reset positions.
+                if reset_position_ids:
+                    position_ids[b, (i + 1) :] -= i + 1 - prev_index
+                    prev_index = i + 1
+
+    # Convert attention mask to binary:
+    attention_mask = attention_mask < 0.5
+
+    return attention_mask, loss_mask, position_ids
+
+
+def get_batch(context_tokens: torch.LongTensor, eod_id: int):
+    """Generate batch from context tokens."""
+    # Move to GPU.
+    tokens = context_tokens.contiguous().to(context_tokens.device)
+    # Get the attention mask and postition ids.
+    attention_mask, _, position_ids = get_ltor_masks_and_position_ids(
+        tokens,
+        eod_id,
+        reset_position_ids=False,
+        reset_attention_mask=False,
+        eod_mask_loss=False,
+    )
+    return tokens, attention_mask, position_ids
+
+
+def get_stop_words_ids(chat_format, tokenizer):
+    if chat_format == "raw":
+        stop_words_ids = [tokenizer.encode("Human:"), [tokenizer.eod_id]]
+    elif chat_format == "chatml":
+        stop_words_ids = [[tokenizer.im_end_id], [tokenizer.im_start_id]]
+    else:
+        raise NotImplementedError(f"Unknown chat format {chat_format!r}")
+    return stop_words_ids
+
+
+def make_context(
+    tokenizer: PreTrainedTokenizer,
+    query: str,
+    history: List[Tuple[str, str]] = None,
+    system: str = "",
+    max_window_size: int = 6144,
+    chat_format: str = "chatml",
+):
+    if history is None:
+        history = []
+
+    if chat_format == "chatml":
+        im_start, im_end = "<|im_start|>", "<|im_end|>"
+        im_start_tokens = [tokenizer.im_start_id]
+        im_end_tokens = [tokenizer.im_end_id]
+        nl_tokens = tokenizer.encode("\n")
+
+        def _tokenize_str(role, content):
+            return f"{role}\n{content}", tokenizer.encode(
+                role, allowed_special=set()
+            ) + nl_tokens + tokenizer.encode(content, allowed_special=set())
+
+        system_text, system_tokens_part = _tokenize_str("system", system)
+        system_tokens = im_start_tokens + system_tokens_part + im_end_tokens
+
+        raw_text = ""
+        context_tokens = []
+
+        for turn_query, turn_response in reversed(history):
+            query_text, query_tokens_part = _tokenize_str("user", turn_query)
+            query_tokens = im_start_tokens + query_tokens_part + im_end_tokens
+            response_text, response_tokens_part = _tokenize_str(
+                "assistant", turn_response
+            )
+            response_tokens = im_start_tokens + response_tokens_part + im_end_tokens
+
+            next_context_tokens = nl_tokens + query_tokens + nl_tokens + response_tokens
+            prev_chat = (
+                f"\n{im_start}{query_text}{im_end}\n{im_start}{response_text}{im_end}"
+            )
+
+            current_context_size = (
+                len(system_tokens) + len(next_context_tokens) + len(context_tokens)
+            )
+            if current_context_size < max_window_size:
+                context_tokens = next_context_tokens + context_tokens
+                raw_text = prev_chat + raw_text
+            else:
+                break
+
+        context_tokens = system_tokens + context_tokens
+        raw_text = f"{im_start}{system_text}{im_end}" + raw_text
+        context_tokens += (
+            nl_tokens
+            + im_start_tokens
+            + _tokenize_str("user", query)[1]
+            + im_end_tokens
+            + nl_tokens
+            + im_start_tokens
+            + tokenizer.encode("assistant")
+            + nl_tokens
+        )
+        raw_text += f"\n{im_start}user\n{query}{im_end}\n{im_start}assistant\n"
+
+    elif chat_format == "raw":
+        raw_text = query
+        context_tokens = tokenizer.encode(raw_text)
+    else:
+        raise NotImplementedError(f"Unknown chat format {chat_format!r}")
+
+    return raw_text, context_tokens
+
+
+def _decode_default(
+    tokens: List[int],
+    *,
+    stop_words: List[str],
+    eod_words: List[str],
+    tokenizer: PreTrainedTokenizer,
+    raw_text_len: int,
+    verbose: bool = False,
+    return_end_reason: bool = False,
+    errors: str='replace',
+):
+    trim_decode_tokens = tokenizer.decode(tokens, errors=errors)[raw_text_len:]
+    if verbose:
+        print("\nRaw Generate: ", trim_decode_tokens)
+
+    end_reason = f"Gen length {len(tokens)}"
+    for stop_word in stop_words:
+        trim_decode_tokens = trim_decode_tokens.replace(stop_word, "").strip()
+    for eod_word in eod_words:
+        if eod_word in trim_decode_tokens:
+            end_reason = f"Gen {eod_word!r}"
+        trim_decode_tokens = trim_decode_tokens.split(eod_word)[0]
+    trim_decode_tokens = trim_decode_tokens.strip()
+    if verbose:
+        print("\nEnd Reason:", end_reason)
+        print("\nGenerate: ", trim_decode_tokens)
+
+    if return_end_reason:
+        return trim_decode_tokens, end_reason
+    else:
+        return trim_decode_tokens
+
+
+def _decode_chatml(
+    tokens: List[int],
+    *,
+    stop_words: List[str],
+    eod_token_ids: List[int],
+    tokenizer: PreTrainedTokenizer,
+    raw_text_len: int,
+    context_length: int,
+    verbose: bool = False,
+    return_end_reason: bool = False,
+    errors: str='replace'
+):
+    end_reason = f"Gen length {len(tokens)}"
+    eod_token_idx = context_length
+    for eod_token_idx in range(context_length, len(tokens)):
+        if tokens[eod_token_idx] in eod_token_ids:
+            end_reason = f"Gen {tokenizer.decode([tokens[eod_token_idx]])!r}"
+            break
+
+    trim_decode_tokens = tokenizer.decode(tokens[:eod_token_idx], errors=errors)[raw_text_len:]
+    if verbose:
+        print("\nRaw Generate w/o EOD:", tokenizer.decode(tokens, errors=errors)[raw_text_len:])
+        print("\nRaw Generate:", trim_decode_tokens)
+        print("\nEnd Reason:", end_reason)
+    for stop_word in stop_words:
+        trim_decode_tokens = trim_decode_tokens.replace(stop_word, "").strip()
+    trim_decode_tokens = trim_decode_tokens.strip()
+    if verbose:
+        print("\nGenerate:", trim_decode_tokens)
+
+    if return_end_reason:
+        return trim_decode_tokens, end_reason
+    else:
+        return trim_decode_tokens
+
+
+def decode_tokens(
+    tokens: Union[torch.LongTensor, TokensType],
+    tokenizer: PreTrainedTokenizer,
+    raw_text_len: int,
+    context_length: int,
+    chat_format: str,
+    verbose: bool = False,
+    return_end_reason: bool = False,
+    errors: str="replace",
+) -> str:
+    if torch.is_tensor(tokens):
+        tokens = tokens.cpu().numpy().tolist()
+
+    if chat_format == "chatml":
+        return _decode_chatml(
+            tokens,
+            stop_words=[],
+            eod_token_ids=[tokenizer.im_start_id, tokenizer.im_end_id],
+            tokenizer=tokenizer,
+            raw_text_len=raw_text_len,
+            context_length=context_length,
+            verbose=verbose,
+            return_end_reason=return_end_reason,
+            errors=errors,
+        )
+    elif chat_format == "raw":
+        return _decode_default(
+            tokens,
+            stop_words=["<|endoftext|>"],
+            eod_words=["<|endoftext|>"],
+            tokenizer=tokenizer,
+            raw_text_len=raw_text_len,
+            verbose=verbose,
+            return_end_reason=return_end_reason,
+            errors=errors,
+        )
+    else:
+        raise NotImplementedError(f"Unknown chat format {chat_format!r}")
+
+
+class StopWordsLogitsProcessor(LogitsProcessor):
+    """
+    :class:`transformers.LogitsProcessor` that enforces that when specified sequences appear, stop geration.
+
+    Args:
+        stop_words_ids (:obj:`List[List[int]]`):
+            List of list of token ids of stop ids. In order to get the tokens of the words
+            that should not appear in the generated text, use :obj:`tokenizer(bad_word,
+            add_prefix_space=True).input_ids`.
+        eos_token_id (:obj:`int`):
+            The id of the `end-of-sequence` token.
+    """
+
+    def __init__(self, stop_words_ids: Iterable[Iterable[int]], eos_token_id: int):
+
+        if not isinstance(stop_words_ids, List) or len(stop_words_ids) == 0:
+            raise ValueError(
+                f"`stop_words_ids` has to be a non-emtpy list, but is {stop_words_ids}."
+            )
+        if any(not isinstance(bad_word_ids, list) for bad_word_ids in stop_words_ids):
+            raise ValueError(
+                f"`stop_words_ids` has to be a list of lists, but is {stop_words_ids}."
+            )
+        if any(
+            any(
+                (not isinstance(token_id, (int, np.integer)) or token_id < 0)
+                for token_id in stop_word_ids
+            )
+            for stop_word_ids in stop_words_ids
+        ):
+            raise ValueError(
+                f"Each list in `stop_words_ids` has to be a list of positive integers, but is {stop_words_ids}."
+            )
+
+        self.stop_words_ids = list(
+            filter(
+                lambda bad_token_seq: bad_token_seq != [eos_token_id], stop_words_ids
+            )
+        )
+        self.eos_token_id = eos_token_id
+        for stop_token_seq in self.stop_words_ids:
+            assert (
+                len(stop_token_seq) > 0
+            ), "Stop words token sequences {} cannot have an empty list".format(
+                stop_words_ids
+            )
+
+    def __call__(
+        self, input_ids: torch.LongTensor, scores: torch.FloatTensor
+    ) -> torch.FloatTensor:
+        stopped_samples = self._calc_stopped_samples(input_ids)
+        for i, should_stop in enumerate(stopped_samples):
+            if should_stop:
+                scores[i, self.eos_token_id] = float(2**15)
+        return scores
+
+    def _tokens_match(self, prev_tokens: torch.LongTensor, tokens: List[int]) -> bool:
+        if len(tokens) == 0:
+            # if bad word tokens is just one token always ban it
+            return True
+        elif len(tokens) > len(prev_tokens):
+            # if bad word tokens are longer then prev input_ids they can't be equal
+            return False
+        elif prev_tokens[-len(tokens) :].tolist() == tokens:
+            # if tokens match
+            return True
+        else:
+            return False
+
+    def _calc_stopped_samples(self, prev_input_ids: Iterable[int]) -> Iterable[int]:
+        stopped_samples = []
+        for prev_input_ids_slice in prev_input_ids:
+            match = False
+            for stop_token_seq in self.stop_words_ids:
+                if self._tokens_match(prev_input_ids_slice, stop_token_seq):
+                    # if tokens do not match continue
+                    match = True
+                    break
+            stopped_samples.append(match)
+
+        return stopped_samples
+
+
+def top_k_logits(logits, top_k=0, top_p=0.0, filter_value=-float("Inf")):
+    """This function has been mostly taken from huggingface conversational
+    ai code at
+        https://medium.com/huggingface/how-to-build-a-state-of-the-art-
+             conversational-ai-with-transfer-learning-2d818ac26313"""
+
+    if top_k > 0:
+        # Remove all tokens with a probability less than the
+        # last token of the top-k
+        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
+        logits[indices_to_remove] = filter_value
+
+    if top_p > 0.0:
+        # Cconvert to 1D
+        sorted_logits, sorted_indices = torch.sort(logits, descending=True, dim=-1)
+        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+
+        # Remove tokens with cumulative probability above the threshold
+        sorted_indices_to_remove = cumulative_probs > top_p
+        # Shift the indices to the right to keep also the first token
+        # above the threshold
+        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+        sorted_indices_to_remove[..., 0] = 0
+        for i in range(sorted_indices.size(0)):
+            indices_to_remove = sorted_indices[i][sorted_indices_to_remove[i]]
+            logits[i][indices_to_remove] = filter_value
+
+    return logits
+
+
+def switch(val1, val2, boolean):
+    boolean = boolean.type_as(val1)
+    return (1 - boolean) * val1 + boolean * val2
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,6 @@
+{
+  "bos_token": "<|extra_203|>",
+  "eos_token": "<|extra_204|>",
+  "unk_token": "<|endoftext|>",
+  "pad_token": "<|endoftext|>"
+}
--- a/tokenization_qwen.py
+++ b/tokenization_qwen.py
@@ -0,0 +1,276 @@
+# Copyright (c) Alibaba Cloud.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+"""Tokenization classes for QWen."""
+
+import base64
+import logging
+import os
+import unicodedata
+from typing import Collection, Dict, List, Set, Tuple, Union
+
+import tiktoken
+from transformers import PreTrainedTokenizer, AddedToken
+
+logger = logging.getLogger(__name__)
+
+
+VOCAB_FILES_NAMES = {"vocab_file": "qwen.tiktoken"}
+
+PAT_STR = r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"""
+ENDOFTEXT = "<|endoftext|>"
+IMSTART = "<|im_start|>"
+IMEND = "<|im_end|>"
+# as the default behavior is changed to allow special tokens in
+# regular texts, the surface forms of special tokens need to be
+# as different as possible to minimize the impact
+EXTRAS = tuple((f"<|extra_{i}|>" for i in range(205)))
+# changed to use actual index to avoid misconfiguration with vocabulary expansion
+SPECIAL_START_ID = 151643
+SPECIAL_TOKENS = tuple(
+    enumerate(
+        (
+            (
+                ENDOFTEXT,
+                IMSTART,
+                IMEND,
+            )
+            + EXTRAS
+        ),
+        start=SPECIAL_START_ID,
+    )
+)
+SPECIAL_TOKENS_SET = set(t for i, t in SPECIAL_TOKENS)
+
+
+def _load_tiktoken_bpe(tiktoken_bpe_file: str) -> Dict[bytes, int]:
+    with open(tiktoken_bpe_file, "rb") as f:
+        contents = f.read()
+    return {
+        base64.b64decode(token): int(rank)
+        for token, rank in (line.split() for line in contents.splitlines() if line)
+    }
+
+
+class QWenTokenizer(PreTrainedTokenizer):
+    """QWen tokenizer."""
+
+    vocab_files_names = VOCAB_FILES_NAMES
+
+    def __init__(
+        self,
+        vocab_file,
+        errors="replace",
+        extra_vocab_file=None,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+
+        # how to handle errors in decoding UTF-8 byte sequences
+        # use ignore if you are in streaming inference
+        self.errors = errors  
+
+        self.mergeable_ranks = _load_tiktoken_bpe(vocab_file)  # type: Dict[bytes, int]
+        self.special_tokens = {
+            token: index
+            for index, token in SPECIAL_TOKENS
+        }
+
+        # try load extra vocab from file
+        if extra_vocab_file is not None:
+            used_ids = set(self.mergeable_ranks.values()) | set(self.special_tokens.values())
+            extra_mergeable_ranks = _load_tiktoken_bpe(extra_vocab_file)
+            for token, index in extra_mergeable_ranks.items():
+                if token in self.mergeable_ranks:
+                    logger.info(f"extra token {token} exists, skipping")
+                    continue
+                if index in used_ids:
+                    logger.info(f'the index {index} for extra token {token} exists, skipping')
+                    continue
+                self.mergeable_ranks[token] = index
+            # the index may be sparse after this, but don't worry tiktoken.Encoding will handle this
+
+        enc = tiktoken.Encoding(
+            "Qwen",
+            pat_str=PAT_STR,
+            mergeable_ranks=self.mergeable_ranks,
+            special_tokens=self.special_tokens,
+        )
+        assert (
+            len(self.mergeable_ranks) + len(self.special_tokens) == enc.n_vocab
+        ), f"{len(self.mergeable_ranks) + len(self.special_tokens)} != {enc.n_vocab} in encoding"
+
+        self.decoder = {
+            v: k for k, v in self.mergeable_ranks.items()
+        }  # type: dict[int, bytes|str]
+        self.decoder.update({v: k for k, v in self.special_tokens.items()})
+
+        self.tokenizer = enc  # type: tiktoken.Encoding
+
+        self.eod_id = self.tokenizer.eot_token
+        self.im_start_id = self.special_tokens[IMSTART]
+        self.im_end_id = self.special_tokens[IMEND]
+
+    def __getstate__(self):
+        # for pickle lovers
+        state = self.__dict__.copy()
+        del state["tokenizer"]
+        return state
+
+    def __setstate__(self, state):
+        # tokenizer is not python native; don't pass it; rebuild it
+        self.__dict__.update(state)
+        enc = tiktoken.Encoding(
+            "Qwen",
+            pat_str=PAT_STR,
+            mergeable_ranks=self.mergeable_ranks,
+            special_tokens=self.special_tokens,
+        )
+        self.tokenizer = enc
+
+    def __len__(self) -> int:
+        return self.tokenizer.n_vocab
+
+    def get_vocab(self) -> Dict[bytes, int]:
+        return self.mergeable_ranks
+
+    def convert_tokens_to_ids(
+        self, tokens: Union[bytes, str, List[Union[bytes, str]]]
+    ) -> List[int]:
+        ids = []
+        if isinstance(tokens, (str, bytes)):
+            if tokens in self.special_tokens:
+                return self.special_tokens[tokens]
+            else:
+                return self.mergeable_ranks.get(tokens)
+        for token in tokens:
+            if token in self.special_tokens:
+                ids.append(self.special_tokens[token])
+            else:
+                ids.append(self.mergeable_ranks.get(token))
+        return ids
+
+    def _add_tokens(
+        self,
+        new_tokens: Union[List[str], List[AddedToken]],
+        special_tokens: bool = False,
+    ) -> int:
+        if not special_tokens and new_tokens:
+            raise ValueError("Adding regular tokens is not supported")
+        for token in new_tokens:
+            surface_form = token.content if isinstance(token, AddedToken) else token
+            if surface_form not in SPECIAL_TOKENS_SET:
+                raise ValueError("Adding unknown special tokens is not supported")
+        return 0
+
+    def save_vocabulary(self, save_directory: str, **kwargs) -> Tuple[str]:
+        """
+        Save only the vocabulary of the tokenizer (vocabulary).
+
+        Returns:
+            `Tuple(str)`: Paths to the files saved.
+        """
+        file_path = os.path.join(save_directory, "qwen.tiktoken")
+        with open(file_path, "w", encoding="utf8") as w:
+            for k, v in self.mergeable_ranks.items():
+                line = base64.b64encode(k).decode("utf8") + " " + str(v) + "\n"
+                w.write(line)
+        return (file_path,)
+
+    def tokenize(
+        self,
+        text: str,
+        allowed_special: Union[Set, str] = "all",
+        disallowed_special: Union[Collection, str] = (),
+        **kwargs,
+    ) -> List[Union[bytes, str]]:
+        """
+        Converts a string in a sequence of tokens.
+
+        Args:
+            text (`str`):
+                The sequence to be encoded.
+            allowed_special (`Literal["all"]` or `set`):
+                The surface forms of the tokens to be encoded as special tokens in regular texts.
+                Default to "all".
+            disallowed_special (`Literal["all"]` or `Collection`):
+                The surface forms of the tokens that should not be in regular texts and trigger errors.
+                Default to an empty tuple.
+
+            kwargs (additional keyword arguments, *optional*):
+                Will be passed to the underlying model specific encode method.
+
+        Returns:
+            `List[bytes|str]`: The list of tokens.
+        """
+        tokens = []
+        text = unicodedata.normalize("NFC", text)
+
+        # this implementation takes a detour: text -> token id -> token surface forms
+        for t in self.tokenizer.encode(
+            text, allowed_special=allowed_special, disallowed_special=disallowed_special
+        ):
+            tokens.append(self.decoder[t])
+        return tokens
+
+    def convert_tokens_to_string(self, tokens: List[Union[bytes, str]]) -> str:
+        """
+        Converts a sequence of tokens in a single string.
+        """
+        text = ""
+        temp = b""
+        for t in tokens:
+            if isinstance(t, str):
+                if temp:
+                    text += temp.decode("utf-8", errors=self.errors)
+                    temp = b""
+                text += t
+            elif isinstance(t, bytes):
+                temp += t
+            else:
+                raise TypeError("token should only be of type types or str")
+        if temp:
+            text += temp.decode("utf-8", errors=self.errors)
+        return text
+
+    @property
+    def vocab_size(self):
+        return self.tokenizer.n_vocab
+
+    def _convert_id_to_token(self, index: int) -> Union[bytes, str]:
+        """Converts an id to a token, special tokens included"""
+        if index in self.decoder:
+            return self.decoder[index]
+        raise ValueError("unknown ids")
+
+    def _convert_token_to_id(self, token: Union[bytes, str]) -> int:
+        """Converts a token to an id using the vocab, special tokens included"""
+        if token in self.special_tokens:
+            return self.special_tokens[token]
+        if token in self.mergeable_ranks:
+            return self.mergeable_ranks[token]
+        raise ValueError("unknown token")
+
+    def _tokenize(self, text: str, **kwargs):
+        """
+        Converts a string in a sequence of tokens (string), using the tokenizer. Split in words for word-based
+        vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces).
+
+        Do NOT take care of added tokens.
+        """
+        raise NotImplementedError
+
+    def _decode(
+        self,
+        token_ids: Union[int, List[int]],
+        skip_special_tokens: bool = False,
+        errors: str = None,
+        **kwargs,
+    ) -> str:
+        if isinstance(token_ids, int):
+            token_ids = [token_ids]
+        if skip_special_tokens:
+            token_ids = [i for i in token_ids if i < self.eod_id]
+        return self.tokenizer.decode(token_ids, errors=errors or self.errors)
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,10 @@
+{
+  "model_max_length": 8192,
+  "tokenizer_class": "QWenTokenizer",
+  "auto_map": {
+    "AutoTokenizer": [
+      "tokenization_qwen.QWenTokenizer",
+      null
+      ]
+  }
+}
				`@@ -0,0 +1 @@`
				`{"framework": "pytorch", "task": "others", "allow_remote": true}`