初始化项目，由ModelHub XC社区提供模型

Model: daekeun-ml/phi-2-ko-v0.1 Source: Original Platform
2026-04-11 01:21:09 +08:00
commit 77610584ae
14 changed files with 202107 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,159 @@
+---
+library_name: transformers
+license: cc-by-sa-3.0
+datasets:
+- wikimedia/wikipedia
+- maywell/korean_textbooks
+- nampdn-ai/tiny-codes
+- Open-Orca/OpenOrca
+language:
+- ko
+- en
+inference: false
+---
+
+# phi-2-ko-v0.1
+
+## Model Details
+This model is a Korean-specific model trained in phi-2 by adding a Korean tokenizer and Korean data. (English is also available.)
+Although phi-2 performs very well, it does not support the Korean language and does not have a tokenizer trained on Korean corpous, so tokenizing Korean text will use many times more tokens than English tokens.
+
+To overcome these limitations, I trained the model using an open-license Korean corpus and some English corpus. 
+The reasons for using the English corpus together are as follows:
+    1. The goal is to preserve the excellent performance of the existing model by preventing catastrophic forgetting. 
+    2. Mixing English and Korean prompts usually produces better results than using all prompts in Korean. 
+
+Since my role is not as a working developer, but as an solutions architect helping customers with quick PoCs/prototypes, and I was limited by AWS GPU resources available, I only trained with 5GB of data instead of hundreds of GB of massive data.
+
+### Vocab Expansion
+
+| Model Name | Vocabulary Size | Description | 
+| --- | --- | --- |
+| Original phi-2 | 50,295 | BBPE (Byte-level BPE) |
+| **phi-2-ko** | 66,676 | BBPE. Added Korean vocab and merges |
+
+**Tokenizing "아마존 세이지메이커"**
+
+| Model | # of tokens | Tokens |
+| --- | --- | --- |
+| Original phi-2 | 25 | `[168, 243, 226, 167, 100, 230, 168, 94, 112, 23821, 226, 116, 35975, 112, 168, 100, 222, 167, 102, 242, 35975, 112, 168, 119, 97]` |
+| **phi-2-ko** |6| `[57974, 51299, 50617, 51005, 52027, 51446]` |
+
+### Continued pre-training
+
+The dataset used for training is as follows. To prevent catastrophic forgetting, I included some English corpus as training data.
+
+- Wikipedia Korean dataset (https://huggingface.co/datasets/wikimedia/wikipedia) 
+- Massive Korean synthetic dataset (https://huggingface.co/datasets/maywell/korean_textbooks)
+- Tiny code dataset (https://huggingface.co/datasets/nampdn-ai/tiny-codes)
+- OpenOrca dataset (https://huggingface.co/datasets/Open-Orca/OpenOrca)
+- Using some of the various sentences I wrote (personal blog, chat, etc.)
+
+
+Note that performance is not guaranteed since only a small number of datasets were used for the experiment. The number of samples for training set is just around 5 million after tokenization.
+For distributed training, all weights were trained without adapter techniques, and sharding parallelization was performed with ZeRO-2. The presets are as follows.
+
+Since this is a model that has not been fine-tuned, it is recommended to perform fine tuning such as instruction tuning/alignment tuning according to your use case.
+
+```json
+{
+    "fp16": {
+        "enabled": "auto",
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+    
+    "bf16": {
+        "enabled": "auto"
+    },    
+
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": "auto",
+            "betas": "auto",
+            "eps": "auto",
+            "weight_decay": "auto"
+        }
+    },
+
+    "scheduler": {
+        "type": "WarmupLR",
+        "params": {
+            "warmup_min_lr": "auto",
+            "warmup_max_lr": "auto",
+            "warmup_num_steps": "auto"
+        }
+    },
+
+    "zero_optimization": {
+        "stage": 2,
+        "allgather_partitions": true,
+        "allgather_bucket_size": 2e8,
+        "overlap_comm": true,
+        "reduce_scatter": true,
+        "reduce_bucket_size": 2e8,
+        "contiguous_gradients": true,
+        "cpu_offload": true
+    },
+
+    "gradient_accumulation_steps": "auto",
+    "gradient_clipping": "auto",
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto"
+}
+```
+
+Some hyperparameters are listed below.
+```
+batch_size: 2
+num_epochs: 1
+learning_rate: 3e-4
+gradient_accumulation_steps: 8
+lr_scheduler_type: "linear"
+group_by_length: False
+```
+
+## How to Get Started with the Model
+```python
+import torch
+from transformers import PhiForCausalLM, AutoModelForCausalLM, AutoTokenizer
+
+torch.set_default_device("cuda")
+
+# Load model and tokenizer
+model = AutoModelForCausalLM.from_pretrained("daekeun-ml/phi-2-ko-v0.1", torch_dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("daekeun-ml/phi-2-ko-v0.1", trust_remote_code=True)
+
+# Korean 
+inputs = tokenizer("머신러닝은 ", return_tensors="pt", return_attention_mask=False)
+
+outputs = model.generate(**inputs, max_length=200)
+text = tokenizer.batch_decode(outputs)[0]
+print(text)
+
+# English 
+inputs = tokenizer('''def print_prime(n):
+   """
+   Print all primes between 1 and n
+   """''', return_tensors="pt", return_attention_mask=False)
+
+outputs = model.generate(**inputs, max_length=200)
+text = tokenizer.batch_decode(outputs)[0]
+print(text)
+```
+
+### References
+- Base model: [microsoft/phi-2](https://huggingface.co/microsoft/phi-2)
+
+## Notes 
+
+### License
+
+cc-by-sa 3.0; The license of phi-2 is MIT, but I considered the licensing of the dataset used for training.
+
+### Caution
+This model was created as a personal experiment, unrelated to the organization I work for. The model may not operate correctly because separate verification was not performed. Please be careful unless it is for personal experimentation or PoC (Proof of Concept)!