初始化项目，由ModelHub XC社区提供模型

Model: toroe/SmolLM-3B-Science-DE Source: Original Platform
2026-05-29 03:56:16 +08:00
commit 85dd4b53c0
25 changed files with 8479 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,280 @@
+---
+language:
+  - de
+license: other
+base_model: HuggingFaceTB/SmolLM3-3B
+tags:
+  - sft
+  - instruction-tuning
+  - reasoning
+  - german
+  - multilingual
+  - long-context
+  - fsdp
+  - transformers
+datasets:
+  - DGurgurov/Nemotron-Multilingual-Reasoning
+metrics:
+  - token_accuracy
+library_name: transformers
+pipeline_tag: text-generation
+---
+
+ # SmolLM3-3B — German Reasoning Instruction SFT (Nemotron Multilingual Reasoning)
+
+ ## Model Description
+
+ This model is a **Supervised Fine-Tuned (SFT)** version of:
+
+ `HuggingFaceTB/SmolLM3-3B`
+
+ It was fine-tuned on the **German (`de`) split** of the dataset:
+
+ `DGurgurov/Nemotron-Multilingual-Reasoning`
+
+ The goal of the training was to improve:
+
+ - German instruction following
+- Step-by-step reasoning
+- Long-context conversation behavior
+
+ The model was trained using chat-formatted conversations and **completion-only loss**, meaning only assistant responses contributed to optimization.
+
+ Key properties:
+
+ - Base model: SmolLM3-3B
+- Language specialization: German
+- Context length during training: **16,384 tokens**
+- Chat formatted dataset
+- Long-context packing enabled
+
+ ---
+
+ ## Intended Uses
+
+ ### Suitable For
+- German conversational assistants
+- Educational tutoring
+- Reasoning and structured explanation tasks
+- Long-document Q&A in German
+- Research experiments with long-context small LLMs
+
+ ### Not Suitable For
+- Medical or legal advice without human review
+- Autonomous decision-making
+- Safety-critical systems
+- High-stakes financial decisions
+
+ ---
+
+ ## Training Data
+
+ Dataset used:
+
+ `DGurgurov/Nemotron-Multilingual-Reasoning`
+
+ Processing configuration:
+
+ - Language filtering: **German only**
+- Converted into chat messages (`prepare_messages=True`)
+- Assistant-only optimization (`completion_only_loss=True`)
+
+ Only the assistant responses were used to compute loss; user and system messages were masked.
+
+ Please review the dataset card for provenance and limitations.
+
+ ---
+
+ ## Training Procedure
+
+ Training was performed using **HuggingFace Accelerate with FSDP (Fully Sharded Data Parallel)** across 8 processes.
+
+ ### Core Setup
+
+ - Training method: Supervised fine-tuning (SFT)
+- Epochs: **3**
+- Maximum sequence length: **16,384**
+- Sequence packing: enabled
+- Precision: **bfloat16**
+- Kernel optimization: Liger kernel enabled
+- Gradient checkpointing: enabled
+- Distributed: FSDP (8 processes)
+
+ ---
+
+ ### Optimization
+
+ - Optimizer: `adamw_torch_fused`
+- Per-device batch size: 4
+- Gradient accumulation: 4
+- Effective batch size (per GPU): 16 sequences per step
+- Weight decay: 0.05
+
+ Learning rate schedule:
+
+ - Scheduler: `cosine_with_min_lr`
+- Warmup ratio: 0.05
+- Minimum LR: 5e-6
+
+ ---
+
+ ### Logging & Checkpoints
+
+ - Logging every 5 steps
+- Checkpoint every 450 steps
+- Weights & Biases tracking enabled
+- Token accuracy logged during training
+
+ ---
+
+ ### Data Processing
+
+ - Dataset workers: 16
+- Dataset preparation: enabled
+- Chat message preparation: enabled
+- German split: enabled
+
+ ---
+
+ ## Usage
+
+ ### Transformers
+
+ ```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+
+ model_id = "YOUR_USERNAME/YOUR_MODEL_NAME"
+
+ tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="auto",
+    torch_dtype=torch.bfloat16,
+)
+
+ messages = [
+    {"role": "system", "content": "Du bist ein hilfreicher Assistent."},
+    {"role": "user", "content": "Warum ist der Himmel blau?"}
+]
+
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+
+ outputs = model.generate(
+    **inputs,
+    max_new_tokens=512,
+    temperature=0.7,
+    top_p=0.9,
+    do_sample=True
+)
+
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+**Important:**  
+You should use `apply_chat_template()` when prompting. The model was trained on chat-formatted conversations and performance will degrade without it.
+
+ ---
+
+ ## Evaluation
+
+ During training, **token accuracy** was logged as a diagnostic metric.
+
+ Token accuracy:
+- is useful for monitoring training stability
+- is **NOT** a benchmark score
+- does not represent real reasoning performance
+
+ For proper evaluation, use:
+- German instruction-following benchmarks
+- reasoning datasets
+- long-context evaluation tasks
+
+ ---
+
+ ## Limitations
+
+ - May hallucinate facts
+- Reasoning chains can still contain logical errors
+- Performance near 16k context depends heavily on prompt structure
+- Improvements mainly apply to German
+- Smaller model size means weaker world knowledge than large LLMs
+- Not aligned for safety-critical deployment
+
+ ---
+
+ ## Bias & Safety
+
+ This model inherits biases from:
+- the base model
+- the training dataset
+
+ Recommended mitigations:
+- add moderation filters
+- use system prompts enforcing safe behavior
+- include human review for sensitive deployments
+
+ ---
+
+ ## License
+
+ This model is a derivative of:
+
+ `HuggingFaceTB/SmolLM3-3B`
+
+ Therefore, the original base model license and usage restrictions apply, along with any dataset terms.
+
+ Verify compatibility before commercial deployment.
+
+ ---
+
+ ## Reproducibility (Training Arguments)
+
+ ```text
+accelerate launch --use_fsdp --num_processes 8 --config_file sft/my_config.yaml sft/sft_trainer.py
+
+ --model_name HuggingFaceTB/SmolLM3-3B
+--tokenizer_name HuggingFaceTB/SmolLM3-3B
+--dataset_path DGurgurov/Nemotron-Multilingual-Reasoning
+--skip_prepare_dataset False
+--lang_split de
+--prepare_messages True
+--completion_only_loss True
+--max_length 16384
+--dataset_num_proc 16
+--packing True
+--use_liger_kernel True
+--bf16 True
+--log_token_accuracy True
+--optim adamw_torch_fused
+--gradient_checkpointing True
+--per_device_train_batch_size 4
+--gradient_accumulation_steps 4
+--ddp_find_unused_parameters False
+--lr_scheduler_type cosine_with_min_lr
+--lr_scheduler_kwargs {"min_lr": 5.0e-6}
+--warmup_ratio 0.05
+--weight_decay 0.05
+--report_to wandb
+--run_name smol_3b_3epochs_lns_de
+--num_train_epochs 3
+--save_strategy steps
+--logging_steps 5
+--save_steps 450
+```
+---
+
+ ## Citation
+
+ If you use this model, please cite:
+
+ - `HuggingFaceTB/SmolLM3-3B`
+- `DGurgurov/Nemotron-Multilingual-Reasoning`
+
+ ---
+
+ ## Acknowledgements
+
+ - HuggingFaceTB — SmolLM3 base model
+- Nemotron Multilingual Reasoning dataset authors
+- HuggingFace Accelerate and Transformers libraries