初始化项目，由ModelHub XC社区提供模型

Model: khazarai/Llama-electronic-radiology-TR Source: Original Platform
2026-05-06 02:42:42 +08:00
commit 8de6c1c178
8 changed files with 2310 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,36 @@
 *.7z filter=lfs diff=lfs merge=lfs -text
 *.arrow filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 *.bz2 filter=lfs diff=lfs merge=lfs -text
 *.ckpt filter=lfs diff=lfs merge=lfs -text
 *.ftz filter=lfs diff=lfs merge=lfs -text
 *.gz filter=lfs diff=lfs merge=lfs -text
 *.h5 filter=lfs diff=lfs merge=lfs -text
 *.joblib filter=lfs diff=lfs merge=lfs -text
 *.lfs.* filter=lfs diff=lfs merge=lfs -text
 *.mlmodel filter=lfs diff=lfs merge=lfs -text
 *.model filter=lfs diff=lfs merge=lfs -text
 *.msgpack filter=lfs diff=lfs merge=lfs -text
 *.npy filter=lfs diff=lfs merge=lfs -text
 *.npz filter=lfs diff=lfs merge=lfs -text
 *.onnx filter=lfs diff=lfs merge=lfs -text
 *.ot filter=lfs diff=lfs merge=lfs -text
 *.parquet filter=lfs diff=lfs merge=lfs -text
 *.pb filter=lfs diff=lfs merge=lfs -text
 *.pickle filter=lfs diff=lfs merge=lfs -text
 *.pkl filter=lfs diff=lfs merge=lfs -text
 *.pt filter=lfs diff=lfs merge=lfs -text
 *.pth filter=lfs diff=lfs merge=lfs -text
 *.rar filter=lfs diff=lfs merge=lfs -text
 *.safetensors filter=lfs diff=lfs merge=lfs -text
 saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.tar.* filter=lfs diff=lfs merge=lfs -text
 *.tar filter=lfs diff=lfs merge=lfs -text
 *.tflite filter=lfs diff=lfs merge=lfs -text
 *.tgz filter=lfs diff=lfs merge=lfs -text
 *.wasm filter=lfs diff=lfs merge=lfs -text
 *.xz filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,131 @@
 ---
 license: llama3.2
 datasets:
 - hazal/electronic-radiology-phd-thesis-trR
 language:
 - tr
 base_model:
 - unsloth/Llama-3.2-1B
 pipeline_tag: text-generation
 library_name: transformers
 tags:
 - radilogy
 ---
 # Model Card for Llama-electronic-radiology-TR
 ## Model Details
 ## Model Summary
 This model is a domain-adapted version of Llama-3.2-1B, fine-tuned via continued pretraining on Turkish-language electronic radiology PhD theses.
 The model was trained in an autoregressive (causal language modeling) setup using the [hazal/electronic-radiology-phd-thesis-trR](https://huggingface.co/datasets/hazal/electronic-radiology-phd-thesis-trR) dataset.
 Unlike instruction-tuned models, this version focuses on improving the model’s fluency, vocabulary, and semantic consistency in highly technical medical and radiological contexts.
 It is intended for downstream applications such as domain-specific generation, summarization, and potential fine-tuning for clinical tasks in Turkish.
 ### Model Description
 - **Language(s) (NLP):** Turkish
 - **License:** Llama 3.2
 - **Finetuned from model:** Llama-3.2-1B
 ## Uses
 ### Direct Use
 The primary intended uses include:
 - **Domain-specific generation**: Generating fluent, semantically rich Turkish text in radiological contexts, e.g., imaging protocols, research summaries, or academic abstracts.
 - **Medical document summarization**: Summarizing long Turkish-language radiological texts, such as reports or thesis chapters.
 - **Language modeling for downstream tasks**: Serving as a base model for fine-tuning into instruction-tuned clinical models or QA systems in radiology.
 - **Research applications**: Assisting in the development of Turkish-language models for clinical NLP, especially in low-resource and domain-specific contexts.
 This model **is not instruction-tuned** and does not perform well in prompt-based Q&A or dialogue setups without additional supervised fine-tuning.
 ## Bias, Risks, and Limitations
 ### 🔬 Domain Bias
 The model has been trained exclusively on Turkish PhD-level academic texts in radiology. As such, its knowledge and language patterns are narrowly focused on:
 - Formal, academic Turkish
 - Medical terminology in radiology and imaging
 - Structured dissertation-like content
 It may underperform or produce awkward completions when applied to:
 - Conversational Turkish
 - Non-medical or non-radiological topics
 - Informal writing styles or dialectal Turkish
 ### ❌ Medical Safety
 This model **should not be used for clinical decision-making, diagnosis, or treatment recommendations**. Despite being trained on medical content, it lacks factual grounding, context awareness, and real-time clinical judgment.
 Any outputs generated by this model must be verified by licensed medical professionals.
 ### 🧠 Memorization Risk
 Continued pretraining on a small or repetitive corpus can lead to memorization of phrases, potentially exposing:
 - Patient case formats
 - Study identifiers
 - Sections of dissertations
 Although this dataset appears anonymized and academic, **you should not use the model for data anonymization, patient privacy protection, or regulatory compliance tasks**.
 ### 🧪 Limitations
 - The model does not have grounding in real-world imaging data or structured knowledge bases.
 - Outputs may hallucinate plausible-sounding but incorrect medical facts.
 - Limited to Turkish; does not generalize to multilingual or English medical contexts.
 - Repetition or looping in generation may still occur in long sequences if decoding is not configured properly (`repetition_penalty`, `eos_token_id`, etc.).
 ## How to Get Started with the Model
 Use the code below to get started with the model.
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 import torch
 tokenizer = AutoTokenizer.from_pretrained("khazarai/Llama-electronic-radiology-TR")
 model = AutoModelForCausalLM.from_pretrained(
    "khazarai/Llama-electronic-radiology-TR",
    device_map={"":0}
 )
 input_text = "Bulgular: Gruplar arası yaş ve cinsiyet dağılımı açısından istatiksel olarakanlamlı farklılık saptanmadı."
 inputs = tokenizer(input_text, return_tensors="pt", add_special_tokens=True).to(model.device)
 outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,            
    temperature=1.0,             
    top_p=0.95,
    repetition_penalty = 1.2, 
    eos_token_id=tokenizer.eos_token_id 
 )
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
 ## Training Details
 ### Training Data
 - **Hours used:** 10 hours
 ## Dataset: hazal/electronic-radiology-phd-thesis-trR
 This dataset contains Turkish-language PhD theses focused on electronic and diagnostic radiology. It was curated for the purpose of training language models in the Turkish medical domain. The documents are academic in tone, rich in domain-specific vocabulary, and structured into medical sections (e.g., materials & methods, results, discussion).
 - **Language**: Turkish
 - **Domain**: Electronic Radiology
 - **Type**: Academic dissertations
 - **Preprocessing**: The dataset was tokenized and truncated to a maximum sequence length suitable for LLM training. No instruction-style formatting was applied.
 Dataset link: [hazal/electronic-radiology-phd-thesis-trR](https://huggingface.co/datasets/hazal/electronic-radiology-phd-thesis-trR)
--- a/config.json
+++ b/config.json
@@ -0,0 +1,37 @@
 {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "dtype": "float16",
  "eos_token_id": 128001,
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 16,
  "num_key_value_heads": 8,
  "pad_token_id": 128004,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 32.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": true,
  "transformers_version": "4.56.2",
  "unsloth_fixed": true,
  "use_cache": true,
  "vocab_size": 128256
 }
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,11 @@
 {
  "_from_model_config": true,
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": 128001,
  "max_length": 131072,
  "pad_token_id": 128004,
  "temperature": 0.6,
  "top_p": 0.9,
  "transformers_version": "4.56.2"
 }
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:98b7609e3824aa6e3255ddab70748790146612f33b9ed52d80f37d9bf3e1bca3
 size 2471645464
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,23 @@
 {
  "bos_token": {
    "content": "<|begin_of_text|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "eos_token": {
    "content": "<|end_of_text|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "pad_token": {
    "content": "<|finetune_right_pad_id|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  }
 }
--- a/tokenizer.json
+++ b/tokenizer.json
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:6b9e4e7fb171f92fd137b777cc2714bf87d11576700a1dcd7a399e7bbe39537b
 size 17209920
--- a/tokenizer_config.json
+++ b/tokenizer_config.json