初始化项目,由ModelHub XC社区提供模型
Model: khazarai/Llama-electronic-radiology-TR Source: Original Platform
This commit is contained in:
36
.gitattributes
vendored
Normal file
36
.gitattributes
vendored
Normal file
@@ -0,0 +1,36 @@
|
|||||||
|
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.model filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npy filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pickle filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pkl filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||||
|
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.wasm filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
||||||
131
README.md
Normal file
131
README.md
Normal file
@@ -0,0 +1,131 @@
|
|||||||
|
---
|
||||||
|
license: llama3.2
|
||||||
|
datasets:
|
||||||
|
- hazal/electronic-radiology-phd-thesis-trR
|
||||||
|
language:
|
||||||
|
- tr
|
||||||
|
base_model:
|
||||||
|
- unsloth/Llama-3.2-1B
|
||||||
|
pipeline_tag: text-generation
|
||||||
|
library_name: transformers
|
||||||
|
tags:
|
||||||
|
- radilogy
|
||||||
|
---
|
||||||
|
|
||||||
|
# Model Card for Llama-electronic-radiology-TR
|
||||||
|
|
||||||
|
|
||||||
|
## Model Details
|
||||||
|
|
||||||
|
## Model Summary
|
||||||
|
|
||||||
|
This model is a domain-adapted version of Llama-3.2-1B, fine-tuned via continued pretraining on Turkish-language electronic radiology PhD theses.
|
||||||
|
The model was trained in an autoregressive (causal language modeling) setup using the [hazal/electronic-radiology-phd-thesis-trR](https://huggingface.co/datasets/hazal/electronic-radiology-phd-thesis-trR) dataset.
|
||||||
|
Unlike instruction-tuned models, this version focuses on improving the model’s fluency, vocabulary, and semantic consistency in highly technical medical and radiological contexts.
|
||||||
|
It is intended for downstream applications such as domain-specific generation, summarization, and potential fine-tuning for clinical tasks in Turkish.
|
||||||
|
|
||||||
|
|
||||||
|
### Model Description
|
||||||
|
|
||||||
|
|
||||||
|
- **Language(s) (NLP):** Turkish
|
||||||
|
- **License:** Llama 3.2
|
||||||
|
- **Finetuned from model:** Llama-3.2-1B
|
||||||
|
|
||||||
|
## Uses
|
||||||
|
|
||||||
|
### Direct Use
|
||||||
|
|
||||||
|
The primary intended uses include:
|
||||||
|
|
||||||
|
- **Domain-specific generation**: Generating fluent, semantically rich Turkish text in radiological contexts, e.g., imaging protocols, research summaries, or academic abstracts.
|
||||||
|
- **Medical document summarization**: Summarizing long Turkish-language radiological texts, such as reports or thesis chapters.
|
||||||
|
- **Language modeling for downstream tasks**: Serving as a base model for fine-tuning into instruction-tuned clinical models or QA systems in radiology.
|
||||||
|
- **Research applications**: Assisting in the development of Turkish-language models for clinical NLP, especially in low-resource and domain-specific contexts.
|
||||||
|
|
||||||
|
This model **is not instruction-tuned** and does not perform well in prompt-based Q&A or dialogue setups without additional supervised fine-tuning.
|
||||||
|
|
||||||
|
## Bias, Risks, and Limitations
|
||||||
|
|
||||||
|
### 🔬 Domain Bias
|
||||||
|
The model has been trained exclusively on Turkish PhD-level academic texts in radiology. As such, its knowledge and language patterns are narrowly focused on:
|
||||||
|
|
||||||
|
- Formal, academic Turkish
|
||||||
|
- Medical terminology in radiology and imaging
|
||||||
|
- Structured dissertation-like content
|
||||||
|
|
||||||
|
It may underperform or produce awkward completions when applied to:
|
||||||
|
|
||||||
|
- Conversational Turkish
|
||||||
|
- Non-medical or non-radiological topics
|
||||||
|
- Informal writing styles or dialectal Turkish
|
||||||
|
|
||||||
|
### ❌ Medical Safety
|
||||||
|
This model **should not be used for clinical decision-making, diagnosis, or treatment recommendations**. Despite being trained on medical content, it lacks factual grounding, context awareness, and real-time clinical judgment.
|
||||||
|
Any outputs generated by this model must be verified by licensed medical professionals.
|
||||||
|
|
||||||
|
### 🧠 Memorization Risk
|
||||||
|
Continued pretraining on a small or repetitive corpus can lead to memorization of phrases, potentially exposing:
|
||||||
|
- Patient case formats
|
||||||
|
- Study identifiers
|
||||||
|
- Sections of dissertations
|
||||||
|
|
||||||
|
Although this dataset appears anonymized and academic, **you should not use the model for data anonymization, patient privacy protection, or regulatory compliance tasks**.
|
||||||
|
|
||||||
|
### 🧪 Limitations
|
||||||
|
- The model does not have grounding in real-world imaging data or structured knowledge bases.
|
||||||
|
- Outputs may hallucinate plausible-sounding but incorrect medical facts.
|
||||||
|
- Limited to Turkish; does not generalize to multilingual or English medical contexts.
|
||||||
|
- Repetition or looping in generation may still occur in long sequences if decoding is not configured properly (`repetition_penalty`, `eos_token_id`, etc.).
|
||||||
|
|
||||||
|
|
||||||
|
## How to Get Started with the Model
|
||||||
|
|
||||||
|
Use the code below to get started with the model.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||||
|
import torch
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained("khazarai/Llama-electronic-radiology-TR")
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(
|
||||||
|
"khazarai/Llama-electronic-radiology-TR",
|
||||||
|
device_map={"":0}
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
input_text = "Bulgular: Gruplar arası yaş ve cinsiyet dağılımı açısından istatiksel olarakanlamlı farklılık saptanmadı."
|
||||||
|
|
||||||
|
inputs = tokenizer(input_text, return_tensors="pt", add_special_tokens=True).to(model.device)
|
||||||
|
|
||||||
|
outputs = model.generate(
|
||||||
|
**inputs,
|
||||||
|
max_new_tokens=512,
|
||||||
|
do_sample=True,
|
||||||
|
temperature=1.0,
|
||||||
|
top_p=0.95,
|
||||||
|
repetition_penalty = 1.2,
|
||||||
|
eos_token_id=tokenizer.eos_token_id
|
||||||
|
)
|
||||||
|
|
||||||
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
||||||
|
```
|
||||||
|
|
||||||
|
## Training Details
|
||||||
|
|
||||||
|
### Training Data
|
||||||
|
|
||||||
|
- **Hours used:** 10 hours
|
||||||
|
|
||||||
|
## Dataset: hazal/electronic-radiology-phd-thesis-trR
|
||||||
|
|
||||||
|
This dataset contains Turkish-language PhD theses focused on electronic and diagnostic radiology. It was curated for the purpose of training language models in the Turkish medical domain. The documents are academic in tone, rich in domain-specific vocabulary, and structured into medical sections (e.g., materials & methods, results, discussion).
|
||||||
|
|
||||||
|
- **Language**: Turkish
|
||||||
|
- **Domain**: Electronic Radiology
|
||||||
|
- **Type**: Academic dissertations
|
||||||
|
- **Preprocessing**: The dataset was tokenized and truncated to a maximum sequence length suitable for LLM training. No instruction-style formatting was applied.
|
||||||
|
|
||||||
|
Dataset link: [hazal/electronic-radiology-phd-thesis-trR](https://huggingface.co/datasets/hazal/electronic-radiology-phd-thesis-trR)
|
||||||
37
config.json
Normal file
37
config.json
Normal file
@@ -0,0 +1,37 @@
|
|||||||
|
{
|
||||||
|
"architectures": [
|
||||||
|
"LlamaForCausalLM"
|
||||||
|
],
|
||||||
|
"attention_bias": false,
|
||||||
|
"attention_dropout": 0.0,
|
||||||
|
"bos_token_id": 128000,
|
||||||
|
"dtype": "float16",
|
||||||
|
"eos_token_id": 128001,
|
||||||
|
"head_dim": 64,
|
||||||
|
"hidden_act": "silu",
|
||||||
|
"hidden_size": 2048,
|
||||||
|
"initializer_range": 0.02,
|
||||||
|
"intermediate_size": 8192,
|
||||||
|
"max_position_embeddings": 131072,
|
||||||
|
"mlp_bias": false,
|
||||||
|
"model_type": "llama",
|
||||||
|
"num_attention_heads": 32,
|
||||||
|
"num_hidden_layers": 16,
|
||||||
|
"num_key_value_heads": 8,
|
||||||
|
"pad_token_id": 128004,
|
||||||
|
"pretraining_tp": 1,
|
||||||
|
"rms_norm_eps": 1e-05,
|
||||||
|
"rope_scaling": {
|
||||||
|
"factor": 32.0,
|
||||||
|
"high_freq_factor": 4.0,
|
||||||
|
"low_freq_factor": 1.0,
|
||||||
|
"original_max_position_embeddings": 8192,
|
||||||
|
"rope_type": "llama3"
|
||||||
|
},
|
||||||
|
"rope_theta": 500000.0,
|
||||||
|
"tie_word_embeddings": true,
|
||||||
|
"transformers_version": "4.56.2",
|
||||||
|
"unsloth_fixed": true,
|
||||||
|
"use_cache": true,
|
||||||
|
"vocab_size": 128256
|
||||||
|
}
|
||||||
11
generation_config.json
Normal file
11
generation_config.json
Normal file
@@ -0,0 +1,11 @@
|
|||||||
|
{
|
||||||
|
"_from_model_config": true,
|
||||||
|
"bos_token_id": 128000,
|
||||||
|
"do_sample": true,
|
||||||
|
"eos_token_id": 128001,
|
||||||
|
"max_length": 131072,
|
||||||
|
"pad_token_id": 128004,
|
||||||
|
"temperature": 0.6,
|
||||||
|
"top_p": 0.9,
|
||||||
|
"transformers_version": "4.56.2"
|
||||||
|
}
|
||||||
3
model.safetensors
Normal file
3
model.safetensors
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:98b7609e3824aa6e3255ddab70748790146612f33b9ed52d80f37d9bf3e1bca3
|
||||||
|
size 2471645464
|
||||||
23
special_tokens_map.json
Normal file
23
special_tokens_map.json
Normal file
@@ -0,0 +1,23 @@
|
|||||||
|
{
|
||||||
|
"bos_token": {
|
||||||
|
"content": "<|begin_of_text|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"eos_token": {
|
||||||
|
"content": "<|end_of_text|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"pad_token": {
|
||||||
|
"content": "<|finetune_right_pad_id|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
}
|
||||||
|
}
|
||||||
3
tokenizer.json
Normal file
3
tokenizer.json
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:6b9e4e7fb171f92fd137b777cc2714bf87d11576700a1dcd7a399e7bbe39537b
|
||||||
|
size 17209920
|
||||||
2066
tokenizer_config.json
Normal file
2066
tokenizer_config.json
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user