--- license: apache-2.0 language: en library_name: peft pipeline_tag: text-generation tags: - text-generation - qlora - lora - peft - dental - medical - domain-specialized - 7b - gguf base_model: Qwen/Qwen2.5-7B-Instruct ---

SLM-Forge banner

GitHub: Dshamir/slm-forge License: MIT Built with Claude Code Status: PoC

> πŸ”§ **Toolkit:** This model was forged with [**SLM-Forge**](https://github.com/Dshamir/slm-forge) β€” a public, MIT-licensed, semi-autonomous skill tree for the Claude Code TUI that takes you from corpus + budget to a trained + evaluated + quantized + published Small Specialty Language Model in a single session, with one human gate. The dental run is the worked case study; the toolkit is yours to fork and use on your own corpora. # dental-ai-research-slm β€” Qwen2.5-7B-Instruct + IntelliDent dental-AI research LoRA (v0) A **research-methodology assistant**, not a clinical assistant. LoRA fine-tune of [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) on a curated corpus of dental-AI research papers from the IntelliDent / Polytechnique MontrΓ©al group β€” covering tooth segmentation (MeshSegNet, iMeshSegNet, MC-Net), preparation margin-line detection, crown generation (PointR / PointNet++), intraoral 3D scan processing, and adjacent dental imaging research. > ## ⚠️ The slug says `slm-0m`. The model is **not** a 0M-parameter model. > > `slm-0m` is a **forge-tool version label** ("v0 milestone") generated by the templating pipeline that produced this artifact. It does **not** describe the model size. > > **Actual size:** > - Base: **Qwen/Qwen2.5-7B-Instruct** β€” 7.62 B parameters > - Trainable LoRA adapter: **~20 M parameters** (0.46 % of base) > > Future milestones will use clearer slugs (`v0`, `v0.1`, `v1`). ## βœ… What this model is for / 🚫 what it is not for | βœ… In-scope | 🚫 Out-of-scope (model abstains) | |---|---| | MeshSegNet / iMeshSegNet / MC-Net architectures | Clinical questions (caries, hygiene, treatment, diagnosis) | | Margin-line detection methods | Generic dental Qs a patient would ask | | Crown-generation pipelines (PointR, PointNet++, GAN completion) | Other specialties: orthodontics, periodontics, implantology, oral surgery, oral cancer, TMJ, cosmetic dentistry | | Intraoral 3D scan acquisition + decimation | Methods outside the IntelliDent corpus | | Method comparison / methodology summaries in an academic register | Drug design, dental insurance, clinical recommendations | **Abstention contract.** When asked anything out-of-scope, the model is instructed to answer exactly: > "This question is outside the dental-AI research-methodology corpus this model was trained on. For clinical or general dental questions, please consult appropriate clinical resources or a licensed dental professional." The abstention contract is enforced via the system prompt baked into the shipped [`Modelfile`](./Modelfile). **For Transformers / PEFT users:** include the same system prompt (excerpted below in *How to use*) β€” without it the model will freelance on out-of-scope prompts. ## What's in this repo | File / dir | Purpose | Size | |---|---|---| | `adapter_model.safetensors` + `adapter_config.json` | LoRA adapter (r=32, Ξ±=64) β€” apply on top of `Qwen/Qwen2.5-7B-Instruct` via PEFT | ~80 MB | | `gguf/model-Q4_K_M.gguf` | Merged + quantized 4-bit GGUF β€” best size/quality trade-off; runs on a modern laptop | ~4.7 GB | | `gguf/model-Q8_0.gguf` | Merged + quantized 8-bit GGUF β€” near-lossless | ~8.1 GB | | `Modelfile` | Ollama Modelfile pre-configured with system prompt + sampling defaults (see below) | β€” | | `tokenizer.json` / `vocab.json` / `merges.txt` / `chat_template.jinja` | Qwen2 tokenizer + ChatML chat template | β€” | | `LICENSE` | Apache-2.0 (inherited from base) | β€” | ## Training | | | |---|---| | **Base model** | `Qwen/Qwen2.5-7B-Instruct` (Apache-2.0) | | **Regime** | QLoRA β€” base loaded in 4-bit NF4 (bitsandbytes), LoRA adapter trained in bf16 | | **LoRA config** | rank 32, alpha 64, dropout 0.05; targets `q_proj, k_proj, v_proj, o_proj` | | **Trainable params** | 20,185,088 (0.46 % of 4.37 B post-quantization) | | **Sequence length** | 2048 tokens | | **Effective batch** | 8 (micro-batch 2 Γ— grad-accum 4) | | **Steps** | 900 (~2.94 epochs over 2,209 train examples) | | **Learning rate** | 1e-4, cosine schedule, gradient checkpointing on | | **Hardware** | AWS `g5.2xlarge` β€” 1Γ— NVIDIA A10G (24 GB), 8 vCPU, 32 GB RAM, ca-central-1 | | **Train wall-clock** | 5 h 53 min | | **Total forge wall-clock** | 9 h 03 min (training + provisioning + repeated phase reruns from in-flight bug fixes) | | **AWS spend** | $11.62 (on-demand g5.2xlarge Γ— 8 h actual EC2 + Claude API for synth + plan-fit grading) | ## Training data A 320-document corpus of dental-AI research artifacts (PDFs of theses, journal papers, conference proceedings: PolyMtl, MICCAI, SPIE, ISBI, JBHI, JMI, IEEE-EMBC, MEDIA, JDentistry, Computer Biology & Medicine, plus internal IVADO presentations). > **Scope of v0's corpus (important for setting expectations).** > v0 was forged from the original `corpora/publications-raw/Publications` directory, which physically contained only **4 file types** β€” PDF / DOCX / PPTX / TXT. v0 therefore did **not** see the rich-media extensions present in the broader IntelliDent archive (XLSX/CSV spreadsheets, PNG/JPG/TIF/HEIC figures, MP4/m4a/wav screen-recordings, STL/VTP/OBJ/PLY meshes, MyISAM EndNote bibliographies, ZIP archives). Those are queued for **v1** β€” see [Roadmap](#roadmap). | Stage | Output | |---|---| | `prep` | 2,448 raw passages extracted from PDF/DOCX/PPTX/TXT (`pdf=2041, docx=163, pptx=241, txt=3`) | | `audit` | 843 passages retained (34.4 %) after MinHash dedup, length filter, language filter, domain density check, contamination scrub. **655,562 clean tokens.** | | `synth` | 2,514 Q/A pairs synthesized via Claude Haiku 4.5 (factual / mechanism / clinical Q types) | | `synth-filter` | 2,455 Q/A pairs after schema + quality filter | | `shape` | 90/5/5 split β†’ 2,209 train / 122 val / 124 test (deterministic shuffle seeded on forge id) | The corpus is heavily concentrated on: - 3D mesh segmentation of dental arches (MeshSegNet, iMeshSegNet, PointNet++) - Preparation margin-line detection (regression + classification approaches) - AI-driven crown generation (PointR shell prediction, GAN-based completion) - Dental scan acquisition + decimation pipelines ## Plan-fit gate (pre-spend validation) Before any GPU spend, the forge ran a 7-axis Q/A grading gate via Claude Sonnet 4.6: | Axis | Result | |---|---| | In-domain classification (Haiku) | **100 %** in domain (threshold: 95 %) | | Subdomain coverage | passed | | Q/A factual + appropriate + grounded + concise score | mean **4.135 / 5** (threshold: 4.0); min individual **3.0+** | | Q/A type diversity | max single type 40 % (threshold: 50 %) | | Hyperparameter sanity | passed (rank, epochs, LR, seq-len in safe envelopes) | | Training format roundtrip | passed (ChatML round-trips through tokenizer) | | Budget headroom | $189.90 remaining of $200 cap (-94.4 % synth overrun β†’ much cheaper than estimated) | ## Evaluation Post-training perplexity / sample-generation eval **was skipped on this run** due to a bug in the forge's eval phase (loaded base + adapter in fp32 β†’ OOM on the 32 GB instance). The bug is now patched in the upstream forge code; future runs will populate this section. **Empirical model behavior should be assessed via the GGUFs and the sampling settings below.** Hand-evaluation showed two failure modes that the v0 patch (this artifact, 2026-04-27) targets directly: 1. **Out-of-scope confabulation** β€” questions like "What causes cavities?" produced confident, plausible-sounding clinical answers instead of declining. **Root cause:** v0 was shipped with a placeholder system prompt. **Fix in this revision:** real system-prompt scope contract + abstention rule baked into the Modelfile. 2. **In-scope name corruption** β€” answers about MeshSegNet sometimes leaked tokenizer-edge artifacts ("Mesh-Seg-Label") and sometimes invented architectural components (a 2D-CNN stage). **Root cause:** sampling was tuned for fluency (T=0.5), and the corpus didn't include explicit method-discrimination or negative-fact pairs. **Partial fix in this revision:** sampling tightened (T=0.2, top_p=0.85, top_k=20, min_p=0.05) which kills the name-corruption class. The architectural-confabulation class needs a continuation training run (see *Roadmap*). ## Recommended sampling settings For factual recall on this narrow corpus (all values pre-baked in the [`Modelfile`](./Modelfile)): | param | value | reason | |---|---|---| | `temperature` | **0.2** | factual recall, kills tokenizer-edge corruption like `Mesh-Seg-Label` | | `top_p` | **0.85** | tighter tail than v0's 0.9 | | `top_k` | **20** | tighter tail than v0's 40 | | `min_p` | **0.05** | filters noise tokens | | `repeat_penalty` | 1.18 | kills paragraph loops | | `repeat_last_n` | 256 | window for the penalty | | `max_tokens` | 320 | keep responses short | **Stop strings** (LM Studio "Stop strings" field, or llama-cli `--reverse-prompt`): ``` <|im_end|> <|im_start|> <|endoftext|> ttiuser ttiassistant ``` ## How to use > **Always pass a scope-enforcing system prompt.** The Modelfile and Space app bake one in. If you're calling the model via Transformers / PEFT directly, paste the [system prompt block](#system-prompt-for-transformerspeft-users) below into your `apply_chat_template` call. ### Load the LoRA adapter on top of Qwen2.5-7B-Instruct (Transformers + PEFT) ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel REPO = "Nexless/dental-ai-research-slm-0m-20260425-3845" BASE = "Qwen/Qwen2.5-7B-Instruct" tokenizer = AutoTokenizer.from_pretrained(REPO) base = AutoModelForCausalLM.from_pretrained( BASE, torch_dtype=torch.bfloat16, # required β€” fp32 won't fit on a 24 GB GPU device_map="auto", ) model = PeftModel.from_pretrained(base, REPO) model.eval() SYSTEM_PROMPT = """You are a research-methodology assistant trained on the IntelliDent / Polytechnique MontrΓ©al research group's published dental-AI papers. ... (see full text under 'System prompt for Transformers/PEFT users' below)""" prompt = "How does iMeshSegNet differ from MeshSegNet for dental arch segmentation?" inputs = tokenizer.apply_chat_template( [ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": prompt}, ], return_tensors="pt", add_generation_prompt=True, ).to(model.device) with torch.no_grad(): out = model.generate( inputs, max_new_tokens=320, temperature=0.2, top_p=0.85, top_k=20, repetition_penalty=1.18, do_sample=True, ) print(tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True)) ``` ### Use the GGUF in Ollama ```bash curl -L -O https://huggingface.co/Nexless/dental-ai-research-slm-0m-20260425-3845/resolve/main/Modelfile ollama create dental-research -f Modelfile ollama run dental-research ``` The shipped [`Modelfile`](./Modelfile) already encodes the system prompt + sampling defaults + stop strings. ### Use the GGUF in LM Studio / llama.cpp / text-generation-webui Download `gguf/model-Q4_K_M.gguf` (4.7 GB, recommended for laptop CPUs) or `gguf/model-Q8_0.gguf` (8.1 GB, higher quality on machines with 16+ GB RAM). In the UI's advanced configuration panel: 1. Paste the system prompt from [System prompt for Transformers/PEFT users](#system-prompt-for-transformerspeft-users) below. 2. Set the sampling values from the [recommended sampling settings table](#recommended-sampling-settings). 3. Add the stop strings. ### System prompt for Transformers/PEFT users ```text You are a research-methodology assistant trained on the IntelliDent / Polytechnique MontrΓ©al research group's published dental-AI papers. Your scope is strictly: IN-SCOPE - Methods from the IntelliDent corpus: MeshSegNet, iMeshSegNet, MC-Net, margin-line detection, crown-generation pipelines (PointR / PointNet++ / GAN completion), intraoral-scan acquisition + decimation, dental arch segmentation - Comparison and explanation of these methods (architecture, training data, loss functions, post-processing) - Methodology summaries written in an academic register OUT-OF-SCOPE - Clinical or medical questions (caries causes, hygiene, treatment, diagnosis) - Methods outside the IntelliDent corpus - Generic dental questions a patient would ask - Other dental specialties (orthodontics, periodontics, endodontics, implantology, oral surgery, oral cancer, cosmetic dentistry, TMJ, dental insurance) ABSTENTION CONTRACT For ANY out-of-scope question, respond exactly: "This question is outside the dental-AI research-methodology corpus this model was trained on. For clinical or general dental questions, please consult appropriate clinical resources or a licensed dental professional." WHEN ANSWERING IN-SCOPE QUESTIONS - Name methods exactly: "MeshSegNet" (not "Mesh-Seg-Label" or "MeshSegLabel") - MeshSegNet uses graph-constrained learning modules (GLMs) with multi-scale adjacency matrices, performs CELL-WISE classification on mesh triangles (NOT vertex-wise), and operates directly in 3D space (NO 2D-CNN component) - iMeshSegNet replaces GLMs with EdgeConv for efficiency - MC-Net targets mesh completion, not segmentation - If uncertain about a specific detail, say "the corpus does not specify this clearly" rather than fabricating Example 1 (in-scope): Q: What is MeshSegNet? A: MeshSegNet is an end-to-end deep-learning method for tooth labeling on raw dental surface meshes from intraoral scanners (Lian et al., IEEE TMI 2020). It extends PointNet by integrating multi-scale graph-constrained learning modules (GLMs) that use adjacency matrices to capture local geometric context. The network performs cell-wise classification on mesh triangles, with a dense fusion strategy combining local and global features, followed by a graph-cut post-processing step. Example 2 (out-of-scope): Q: What causes cavities? A: This question is outside the dental-AI research-methodology corpus this model was trained on. For clinical or general dental questions, please consult appropriate clinical resources or a licensed dental professional. ``` ## Limitations & caveats - **Narrow domain.** Out-of-domain generic prompts will degrade vs. base. Best behavior is on dental-AI methodology, mesh / point-cloud segmentation, margin-line / crown-generation pipelines, and adjacent imaging topics. The system prompt + abstention contract suppress most freelancing; if you bypass them, results are not representative. - **Small training set.** ~655K clean tokens / 2,455 Q/A pairs is on the lower end for 7B SFT. A larger corpus would improve recall of paper-specific facts. - **No baseline perplexity comparison** in this release β€” see the eval section. - **Method-discrimination + negative-fact gaps.** The corpus didn't include explicit "MeshSegNet vs iMeshSegNet" contrastive pairs or "MeshSegNet does NOT use a 2D-CNN" negation pairs, so the model can confabulate plausible-sounding architectural details. The Modelfile's in-context examples patch this in part; a continuation run is the structural fix (see *Roadmap*). - **Live demo Space currently broken on free tier.** The auto-published [Gradio Space](https://huggingface.co/spaces/Nexless/dental-ai-research-slm-0m-20260425-3845-demo) tries to load the full 7B in HuggingFace Spaces' free CPU tier (16 GB RAM) and OOMs. Use the GGUF locally instead, or fork the Space onto a paid GPU runtime. - **Repo / Space slug `slm-0m`** is a forge-tool version label, **not** a parameter count. Future revisions will use clearer slugs. ## Roadmap This is **v0**. Two follow-up artifacts are planned: ### v0.1 β€” continuation training (~3–4 h on A10G, ~$5) Add three synth buckets to the existing Q/A set and continue-train the LoRA at LR=5e-5 for one epoch: | Bucket | Pairs | Purpose | |---|---|---| | **Abstention pairs** | ~250 | Out-of-scope questions paired with the abstention response (3-4 phrasing variants), covering generic hygiene, orthodontics, periodontics, drug design, mesh decimation, dental insurance, cosmetic dentistry, oral cancer, TMJ, implants. Adds learned abstention behavior on top of the system-prompt enforcement. | | **Method discrimination pairs** | ~150 | "What's the difference between [method A] and [method B]?" with explicit contrastive answers naming the actual differentiator. Pairs to cover: MeshSegNet vs iMeshSegNet (GLM vs EdgeConv), vs PointNet (mesh vs raw point cloud), vs TSGCNet (single vs dual stream), vs DGCNN; MC-Net vs MeshSegNet (completion vs segmentation). | | **Negative-fact pairs** | ~100 | Explicit corrections: "Does MeshSegNet use 2D CNNs? No β€” MeshSegNet operates directly in 3D space on mesh cells." Negation training is the underused trick β€” LLMs hallucinate confidently because they've never seen explicit corrections. | ### v1 β€” full re-forge (expanded corpus, all file types) v1 moves from `corpora/publications-raw/Publications` (4 file types, 320 docs) to **`corpora2/extracted/Publications`** β€” **1,547 files spanning 18 extensions** routed through all 19 SLM-Forge prep plugins. New file-type coverage relative to v0: | Extension family | Plugin | What v0 missed | |---|---|---| | **PDF / DOCX / PPTX / TXT** | `pdf`, `docx_plugin`, `pptx_plugin`, `text_simple` | (already covered in v0) | | **XLSX / CSV** | `tabular` | data tables, references lists in spreadsheets | | **EPUB** | `epub` | textbook chapters | | **Notebooks (.ipynb)** | `notebook` | analysis code + markdown narratives | | **Code (.py / .js / .ts / .cpp / etc.)** | `code` | training scripts, model definitions | | **PNG / JPG / TIF / HEIC** | `ocr` (disabled in v1: `FORGE_DISABLE_OCR=1`) | (figures yielded low-value OCR text β€” explicitly skipped to save tokens + Claude budget) | | **MP4 / MKV / MOV / WEBM** | `video` (whitelisted: β‰₯120 s + audio stream required) | narrated method walkthroughs (~10 of 171 clips have value; rest are silent MeshLab screen-recordings) | | **m4a / wav / mp3** | `audio` (faster-whisper local CPU transcribe) | recorded talks, voice memos | | **STL / PLY / OBJ / VTP** | `mesh_metadata` | 3D mesh metadata (vertex/face counts, bounding boxes) | | **DICOM** | `dicom` | medical imaging headers | | **MyISAM (.frm/.myi/.myd/.ibd)** | `mysql_revive` | EndNote bibliographic backups (paper abstracts β†’ high signal) | | **EML / MBOX** | `email_plugin` | research correspondence | | **GeoTIFF / SHP** | `geo` | (not present here, future-proofing) | | **HDF5 / NetCDF** | `scientific` | dataset archives | | **ZIP / TAR / RAR** | `archive` | recursive extraction (RAR support via `unrar` system binary) | | **Binary (default)** | `binary_metadata` | catch-all metadata extraction | Plus structural changes targeting the failure modes documented in *Evaluation*: - **Smaller base, lower rank.** 7B + r=32 + 1.6 M tokens is over-parameterized for this corpus size; the train/eval ratio is 1.61Γ— which is the overfitting signature. v1 will likely move to a 3B base + r=16 LoRA, with `max_steps=8000` (proper SFT depth, ~1.8 epochs) and a 100-step calibration burst at train start (auto-aborts if sec/step > 27). - **New synth buckets baked in.** The v0.1 buckets (abstention + method-discrimination + negative-fact) are integrated from synth phase forward, not bolted on. - **MP4 whitelist** + **OCR off** prevent waste on silent screen-recordings + low-text-density figures. ## Reproducibility Forge run id: `v2-20260424-131000-3845`. Full pipeline definition: `prep β†’ audit β†’ synth β†’ shape β†’ plan_fit β†’ provision β†’ bootstrap β†’ train β†’ monitor β†’ eval β†’ quantize β†’ register β†’ card_validator β†’ smoketest β†’ publish β†’ teardown β†’ report`. ## Citation ```bibtex @misc{dental_ai_research_slm_2026, title = {dental-ai-research-slm: a Qwen2.5-7B-Instruct LoRA fine-tuned on dental-AI research papers}, author = {Nexless}, year = {2026}, howpublished = {\url{https://huggingface.co/Nexless/dental-ai-research-slm-0m-20260425-3845}} } ``` --- ## ⚠️ Disclaimer β€” Proof of Concept This artifact is a **proof of concept** of **semi-autonomous skills running inside the Claude Code TUI**, developed by **Nexless**. The dental-AI research content is published **for educational purposes only** β€” it is part of a broader experiment testing the capabilities of **SLM β€” (Small) Speciality Language Models** as a category: how narrow domain corpora, plan-fit pre-spend gates, abstention contracts, and skill-tree-driven training pipelines compose into a publishable, scope-honest small model. **Not for clinical use.** Not a substitute for licensed dental or medical advice. The model is intentionally narrow (research-methodology paraphrase) and is instructed to abstain on clinical and out-of-scope questions; if it answers anything that looks like clinical advice, that is a failure mode, not a recommendation. --- *Trained with [SLM-Forge](https://github.com/Dshamir/sif-knowledge-base) β€” a skill-tree pipeline for scoping, training, evaluating, quantizing, and publishing small-to-mid-size domain language models.* *Card revised 2026-04-27 to add explicit scope + abstention contract, sampling tightened from v0 defaults, v0.1/v1 roadmap, and PoC disclaimer.*