Nexless/dental-ai-research-slm-0m-20260425-3845

Go to file

ModelHub XC 1f336d4e99 初始化项目，由ModelHub XC社区提供模型

Model: Nexless/dental-ai-research-slm-0m-20260425-3845
Source: Original Platform

2026-05-15 11:38:17 +08:00

gguf

初始化项目，由ModelHub XC社区提供模型

2026-05-15 11:38:17 +08:00

.gitattributes

初始化项目，由ModelHub XC社区提供模型

2026-05-15 11:38:17 +08:00

adapter_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-15 11:38:17 +08:00

adapter_model.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-05-15 11:38:17 +08:00

added_tokens.json

初始化项目，由ModelHub XC社区提供模型

2026-05-15 11:38:17 +08:00

chat_template.jinja

初始化项目，由ModelHub XC社区提供模型

2026-05-15 11:38:17 +08:00

LICENSE

初始化项目，由ModelHub XC社区提供模型

2026-05-15 11:38:17 +08:00

merges.txt

初始化项目，由ModelHub XC社区提供模型

2026-05-15 11:38:17 +08:00

Modelfile

初始化项目，由ModelHub XC社区提供模型

2026-05-15 11:38:17 +08:00

README.md

初始化项目，由ModelHub XC社区提供模型

2026-05-15 11:38:17 +08:00

special_tokens_map.json

初始化项目，由ModelHub XC社区提供模型

2026-05-15 11:38:17 +08:00

tokenizer_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-15 11:38:17 +08:00

tokenizer.json

初始化项目，由ModelHub XC社区提供模型

2026-05-15 11:38:17 +08:00

vocab.json

初始化项目，由ModelHub XC社区提供模型

2026-05-15 11:38:17 +08:00

README.md

license, language, library_name, pipeline_tag, tags, base_model

license

language

library_name

pipeline_tag

dental-ai-research-slm — Qwen2.5-7B-Instruct + IntelliDent dental-AI research LoRA (v0)

A research-methodology assistant, not a clinical assistant. LoRA fine-tune of Qwen/Qwen2.5-7B-Instruct on a curated corpus of dental-AI research papers from the IntelliDent / Polytechnique Montréal group — covering tooth segmentation (MeshSegNet, iMeshSegNet, MC-Net), preparation margin-line detection, crown generation (PointR / PointNet++), intraoral 3D scan processing, and adjacent dental imaging research.

⚠️ The slug says slm-0m. The model is not a 0M-parameter model.

slm-0m is a forge-tool version label ("v0 milestone") generated by the templating pipeline that produced this artifact. It does not describe the model size.

Actual size:

Base: Qwen/Qwen2.5-7B-Instruct — 7.62 B parameters

Trainable LoRA adapter: ~20 M parameters (0.46 % of base)

Future milestones will use clearer slugs (v0, v0.1, v1).

✅ What this model is for / 🚫 what it is not for

✅ In-scope	🚫 Out-of-scope (model abstains)
MeshSegNet / iMeshSegNet / MC-Net architectures	Clinical questions (caries, hygiene, treatment, diagnosis)
Margin-line detection methods	Generic dental Qs a patient would ask
Crown-generation pipelines (PointR, PointNet++, GAN completion)	Other specialties: orthodontics, periodontics, implantology, oral surgery, oral cancer, TMJ, cosmetic dentistry
Intraoral 3D scan acquisition + decimation	Methods outside the IntelliDent corpus
Method comparison / methodology summaries in an academic register	Drug design, dental insurance, clinical recommendations

Abstention contract. When asked anything out-of-scope, the model is instructed to answer exactly:

"This question is outside the dental-AI research-methodology corpus this model was trained on. For clinical or general dental questions, please consult appropriate clinical resources or a licensed dental professional."

The abstention contract is enforced via the system prompt baked into the shipped Modelfile. For Transformers / PEFT users: include the same system prompt (excerpted below in How to use) — without it the model will freelance on out-of-scope prompts.

What's in this repo

File / dir	Purpose	Size
`adapter_model.safetensors` + `adapter_config.json`	LoRA adapter (r=32, α=64) — apply on top of `Qwen/Qwen2.5-7B-Instruct` via PEFT	~80 MB
`gguf/model-Q4_K_M.gguf`	Merged + quantized 4-bit GGUF — best size/quality trade-off; runs on a modern laptop	~4.7 GB
`gguf/model-Q8_0.gguf`	Merged + quantized 8-bit GGUF — near-lossless	~8.1 GB
`Modelfile`	Ollama Modelfile pre-configured with system prompt + sampling defaults (see below)	—
`tokenizer.json` / `vocab.json` / `merges.txt` / `chat_template.jinja`	Qwen2 tokenizer + ChatML chat template	—
`LICENSE`	Apache-2.0 (inherited from base)	—

Training


Base model	`Qwen/Qwen2.5-7B-Instruct` (Apache-2.0)
Regime	QLoRA — base loaded in 4-bit NF4 (bitsandbytes), LoRA adapter trained in bf16
LoRA config	rank 32, alpha 64, dropout 0.05; targets `q_proj, k_proj, v_proj, o_proj`
Trainable params	20,185,088 (0.46 % of 4.37 B post-quantization)
Sequence length	2048 tokens
Effective batch	8 (micro-batch 2 × grad-accum 4)
Steps	900 (~2.94 epochs over 2,209 train examples)
Learning rate	1e-4, cosine schedule, gradient checkpointing on
Hardware	AWS `g5.2xlarge` — 1× NVIDIA A10G (24 GB), 8 vCPU, 32 GB RAM, ca-central-1
Train wall-clock	5 h 53 min
Total forge wall-clock	9 h 03 min (training + provisioning + repeated phase reruns from in-flight bug fixes)
AWS spend	$11.62 (on-demand g5.2xlarge × 8 h actual EC2 + Claude API for synth + plan-fit grading)

Training data

A 320-document corpus of dental-AI research artifacts (PDFs of theses, journal papers, conference proceedings: PolyMtl, MICCAI, SPIE, ISBI, JBHI, JMI, IEEE-EMBC, MEDIA, JDentistry, Computer Biology & Medicine, plus internal IVADO presentations).

Scope of v0's corpus (important for setting expectations). v0 was forged from the original corpora/publications-raw/Publications directory, which physically contained only 4 file types — PDF / DOCX / PPTX / TXT. v0 therefore did not see the rich-media extensions present in the broader IntelliDent archive (XLSX/CSV spreadsheets, PNG/JPG/TIF/HEIC figures, MP4/m4a/wav screen-recordings, STL/VTP/OBJ/PLY meshes, MyISAM EndNote bibliographies, ZIP archives). Those are queued for v1 — see Roadmap.

Stage	Output
`prep`	2,448 raw passages extracted from PDF/DOCX/PPTX/TXT (`pdf=2041, docx=163, pptx=241, txt=3`)
`audit`	843 passages retained (34.4 %) after MinHash dedup, length filter, language filter, domain density check, contamination scrub. 655,562 clean tokens.
`synth`	2,514 Q/A pairs synthesized via Claude Haiku 4.5 (factual / mechanism / clinical Q types)
`synth-filter`	2,455 Q/A pairs after schema + quality filter
`shape`	90/5/5 split → 2,209 train / 122 val / 124 test (deterministic shuffle seeded on forge id)

The corpus is heavily concentrated on:

3D mesh segmentation of dental arches (MeshSegNet, iMeshSegNet, PointNet++)
Preparation margin-line detection (regression + classification approaches)
AI-driven crown generation (PointR shell prediction, GAN-based completion)
Dental scan acquisition + decimation pipelines

Plan-fit gate (pre-spend validation)

Before any GPU spend, the forge ran a 7-axis Q/A grading gate via Claude Sonnet 4.6:

Axis	Result
In-domain classification (Haiku)	100 % in domain (threshold: 95 %)
Subdomain coverage	passed
Q/A factual + appropriate + grounded + concise score	mean 4.135 / 5 (threshold: 4.0); min individual 3.0+
Q/A type diversity	max single type 40 % (threshold: 50 %)
Hyperparameter sanity	passed (rank, epochs, LR, seq-len in safe envelopes)
Training format roundtrip	passed (ChatML round-trips through tokenizer)
Budget headroom	$189.90 remaining of $200 cap (-94.4 % synth overrun → much cheaper than estimated)

Evaluation

Post-training perplexity / sample-generation eval was skipped on this run due to a bug in the forge's eval phase (loaded base + adapter in fp32 → OOM on the 32 GB instance). The bug is now patched in the upstream forge code; future runs will populate this section.

Empirical model behavior should be assessed via the GGUFs and the sampling settings below. Hand-evaluation showed two failure modes that the v0 patch (this artifact, 2026-04-27) targets directly:

Out-of-scope confabulation — questions like "What causes cavities?" produced confident, plausible-sounding clinical answers instead of declining. Root cause: v0 was shipped with a placeholder system prompt. Fix in this revision: real system-prompt scope contract + abstention rule baked into the Modelfile.
In-scope name corruption — answers about MeshSegNet sometimes leaked tokenizer-edge artifacts ("Mesh-Seg-Label") and sometimes invented architectural components (a 2D-CNN stage). Root cause: sampling was tuned for fluency (T=0.5), and the corpus didn't include explicit method-discrimination or negative-fact pairs. Partial fix in this revision: sampling tightened (T=0.2, top_p=0.85, top_k=20, min_p=0.05) which kills the name-corruption class. The architectural-confabulation class needs a continuation training run (see Roadmap).

Recommended sampling settings

For factual recall on this narrow corpus (all values pre-baked in the Modelfile):

param	value	reason
`temperature`	0.2	factual recall, kills tokenizer-edge corruption like `Mesh-Seg-Label`
`top_p`	0.85	tighter tail than v0's 0.9
`top_k`	20	tighter tail than v0's 40
`min_p`	0.05	filters noise tokens
`repeat_penalty`	1.18	kills paragraph loops
`repeat_last_n`	256	window for the penalty
`max_tokens`	320	keep responses short

Stop strings (LM Studio "Stop strings" field, or llama-cli --reverse-prompt):

<|im_end|>
<|im_start|>
<|endoftext|>
ttiuser
ttiassistant

How to use

Always pass a scope-enforcing system prompt. The Modelfile and Space app bake one in. If you're calling the model via Transformers / PEFT directly, paste the system prompt block below into your apply_chat_template call.

Load the LoRA adapter on top of Qwen2.5-7B-Instruct (Transformers + PEFT)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

REPO = "Nexless/dental-ai-research-slm-0m-20260425-3845"
BASE = "Qwen/Qwen2.5-7B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(REPO)
base = AutoModelForCausalLM.from_pretrained(
    BASE,
    torch_dtype=torch.bfloat16,   # required — fp32 won't fit on a 24 GB GPU
    device_map="auto",
)
model = PeftModel.from_pretrained(base, REPO)
model.eval()

SYSTEM_PROMPT = """You are a research-methodology assistant trained on the IntelliDent / Polytechnique Montréal research group's published dental-AI papers. ...
(see full text under 'System prompt for Transformers/PEFT users' below)"""

prompt = "How does iMeshSegNet differ from MeshSegNet for dental arch segmentation?"
inputs = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": prompt},
    ],
    return_tensors="pt", add_generation_prompt=True,
).to(model.device)

with torch.no_grad():
    out = model.generate(
        inputs,
        max_new_tokens=320,
        temperature=0.2,
        top_p=0.85,
        top_k=20,
        repetition_penalty=1.18,
        do_sample=True,
    )
print(tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))

Use the GGUF in Ollama

curl -L -O https://huggingface.co/Nexless/dental-ai-research-slm-0m-20260425-3845/resolve/main/Modelfile
ollama create dental-research -f Modelfile
ollama run dental-research

The shipped Modelfile already encodes the system prompt + sampling defaults + stop strings.

Use the GGUF in LM Studio / llama.cpp / text-generation-webui

Download gguf/model-Q4_K_M.gguf (4.7 GB, recommended for laptop CPUs) or gguf/model-Q8_0.gguf (8.1 GB, higher quality on machines with 16+ GB RAM). In the UI's advanced configuration panel:

Paste the system prompt from System prompt for Transformers/PEFT users below.
Set the sampling values from the recommended sampling settings table.
Add the stop strings.

System prompt for Transformers/PEFT users

You are a research-methodology assistant trained on the IntelliDent / Polytechnique Montréal research group's published dental-AI papers. Your scope is strictly:

IN-SCOPE
- Methods from the IntelliDent corpus: MeshSegNet, iMeshSegNet, MC-Net, margin-line detection, crown-generation pipelines (PointR / PointNet++ / GAN completion), intraoral-scan acquisition + decimation, dental arch segmentation
- Comparison and explanation of these methods (architecture, training data, loss functions, post-processing)
- Methodology summaries written in an academic register

OUT-OF-SCOPE
- Clinical or medical questions (caries causes, hygiene, treatment, diagnosis)
- Methods outside the IntelliDent corpus
- Generic dental questions a patient would ask
- Other dental specialties (orthodontics, periodontics, endodontics, implantology, oral surgery, oral cancer, cosmetic dentistry, TMJ, dental insurance)

ABSTENTION CONTRACT
For ANY out-of-scope question, respond exactly:
"This question is outside the dental-AI research-methodology corpus this model was trained on. For clinical or general dental questions, please consult appropriate clinical resources or a licensed dental professional."

WHEN ANSWERING IN-SCOPE QUESTIONS
- Name methods exactly: "MeshSegNet" (not "Mesh-Seg-Label" or "MeshSegLabel")
- MeshSegNet uses graph-constrained learning modules (GLMs) with multi-scale adjacency matrices, performs CELL-WISE classification on mesh triangles (NOT vertex-wise), and operates directly in 3D space (NO 2D-CNN component)
- iMeshSegNet replaces GLMs with EdgeConv for efficiency
- MC-Net targets mesh completion, not segmentation
- If uncertain about a specific detail, say "the corpus does not specify this clearly" rather than fabricating

Example 1 (in-scope):
Q: What is MeshSegNet?
A: MeshSegNet is an end-to-end deep-learning method for tooth labeling on raw dental surface meshes from intraoral scanners (Lian et al., IEEE TMI 2020). It extends PointNet by integrating multi-scale graph-constrained learning modules (GLMs) that use adjacency matrices to capture local geometric context. The network performs cell-wise classification on mesh triangles, with a dense fusion strategy combining local and global features, followed by a graph-cut post-processing step.

Example 2 (out-of-scope):
Q: What causes cavities?
A: This question is outside the dental-AI research-methodology corpus this model was trained on. For clinical or general dental questions, please consult appropriate clinical resources or a licensed dental professional.

Limitations & caveats

Narrow domain. Out-of-domain generic prompts will degrade vs. base. Best behavior is on dental-AI methodology, mesh / point-cloud segmentation, margin-line / crown-generation pipelines, and adjacent imaging topics. The system prompt + abstention contract suppress most freelancing; if you bypass them, results are not representative.
Small training set. ~655K clean tokens / 2,455 Q/A pairs is on the lower end for 7B SFT. A larger corpus would improve recall of paper-specific facts.
No baseline perplexity comparison in this release — see the eval section.
Method-discrimination + negative-fact gaps. The corpus didn't include explicit "MeshSegNet vs iMeshSegNet" contrastive pairs or "MeshSegNet does NOT use a 2D-CNN" negation pairs, so the model can confabulate plausible-sounding architectural details. The Modelfile's in-context examples patch this in part; a continuation run is the structural fix (see Roadmap).
Live demo Space currently broken on free tier. The auto-published Gradio Space tries to load the full 7B in HuggingFace Spaces' free CPU tier (16 GB RAM) and OOMs. Use the GGUF locally instead, or fork the Space onto a paid GPU runtime.
Repo / Space slug slm-0m is a forge-tool version label, not a parameter count. Future revisions will use clearer slugs.

Roadmap

This is v0. Two follow-up artifacts are planned:

v0.1 — continuation training (~3–4 h on A10G, ~$5)

Add three synth buckets to the existing Q/A set and continue-train the LoRA at LR=5e-5 for one epoch:

Bucket	Pairs	Purpose
Abstention pairs	~250	Out-of-scope questions paired with the abstention response (3-4 phrasing variants), covering generic hygiene, orthodontics, periodontics, drug design, mesh decimation, dental insurance, cosmetic dentistry, oral cancer, TMJ, implants. Adds learned abstention behavior on top of the system-prompt enforcement.
Method discrimination pairs	~150	"What's the difference between [method A] and [method B]?" with explicit contrastive answers naming the actual differentiator. Pairs to cover: MeshSegNet vs iMeshSegNet (GLM vs EdgeConv), vs PointNet (mesh vs raw point cloud), vs TSGCNet (single vs dual stream), vs DGCNN; MC-Net vs MeshSegNet (completion vs segmentation).
Negative-fact pairs	~100	Explicit corrections: "Does MeshSegNet use 2D CNNs? No — MeshSegNet operates directly in 3D space on mesh cells." Negation training is the underused trick — LLMs hallucinate confidently because they've never seen explicit corrections.

v1 — full re-forge (expanded corpus, all file types)

v1 moves from corpora/publications-raw/Publications (4 file types, 320 docs) to corpora2/extracted/Publications — 1,547 files spanning 18 extensions routed through all 19 SLM-Forge prep plugins. New file-type coverage relative to v0:

Extension family	Plugin	What v0 missed
PDF / DOCX / PPTX / TXT	`pdf`, `docx_plugin`, `pptx_plugin`, `text_simple`	(already covered in v0)
XLSX / CSV	`tabular`	data tables, references lists in spreadsheets
EPUB	`epub`	textbook chapters
Notebooks (.ipynb)	`notebook`	analysis code + markdown narratives
Code (.py / .js / .ts / .cpp / etc.)	`code`	training scripts, model definitions
PNG / JPG / TIF / HEIC	`ocr` (disabled in v1: `FORGE_DISABLE_OCR=1`)	(figures yielded low-value OCR text — explicitly skipped to save tokens + Claude budget)
MP4 / MKV / MOV / WEBM	`video` (whitelisted: ≥120 s + audio stream required)	narrated method walkthroughs (~10 of 171 clips have value; rest are silent MeshLab screen-recordings)
m4a / wav / mp3	`audio` (faster-whisper local CPU transcribe)	recorded talks, voice memos
STL / PLY / OBJ / VTP	`mesh_metadata`	3D mesh metadata (vertex/face counts, bounding boxes)
DICOM	`dicom`	medical imaging headers
MyISAM (.frm/.myi/.myd/.ibd)	`mysql_revive`	EndNote bibliographic backups (paper abstracts → high signal)
EML / MBOX	`email_plugin`	research correspondence
GeoTIFF / SHP	`geo`	(not present here, future-proofing)
HDF5 / NetCDF	`scientific`	dataset archives
ZIP / TAR / RAR	`archive`	recursive extraction (RAR support via `unrar` system binary)
Binary (default)	`binary_metadata`	catch-all metadata extraction

Plus structural changes targeting the failure modes documented in Evaluation:

Smaller base, lower rank. 7B + r=32 + 1.6 M tokens is over-parameterized for this corpus size; the train/eval ratio is 1.61× which is the overfitting signature. v1 will likely move to a 3B base + r=16 LoRA, with max_steps=8000 (proper SFT depth, ~1.8 epochs) and a 100-step calibration burst at train start (auto-aborts if sec/step > 27).
New synth buckets baked in. The v0.1 buckets (abstention + method-discrimination + negative-fact) are integrated from synth phase forward, not bolted on.
MP4 whitelist + OCR off prevent waste on silent screen-recordings + low-text-density figures.

Reproducibility

Forge run id: v2-20260424-131000-3845. Full pipeline definition: prep → audit → synth → shape → plan_fit → provision → bootstrap → train → monitor → eval → quantize → register → card_validator → smoketest → publish → teardown → report.

Citation

@misc{dental_ai_research_slm_2026,
  title  = {dental-ai-research-slm: a Qwen2.5-7B-Instruct LoRA fine-tuned on dental-AI research papers},
  author = {Nexless},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/Nexless/dental-ai-research-slm-0m-20260425-3845}}
}

⚠️ Disclaimer — Proof of Concept

This artifact is a proof of concept of semi-autonomous skills running inside the Claude Code TUI, developed by Nexless. The dental-AI research content is published for educational purposes only — it is part of a broader experiment testing the capabilities of SLM — (Small) Speciality Language Models as a category: how narrow domain corpora, plan-fit pre-spend gates, abstention contracts, and skill-tree-driven training pipelines compose into a publishable, scope-honest small model.

Not for clinical use. Not a substitute for licensed dental or medical advice. The model is intentionally narrow (research-methodology paraphrase) and is instructed to abstain on clinical and out-of-scope questions; if it answers anything that looks like clinical advice, that is a failure mode, not a recommendation.

Trained with SLM-Forge — a skill-tree pipeline for scoping, training, evaluating, quantizing, and publishing small-to-mid-size domain language models.

Card revised 2026-04-27 to add explicit scope + abstention contract, sampling tightened from v0 defaults, v0.1/v1 roadmap, and PoC disclaimer.

README.md Unescape Escape

dental-ai-research-slm — Qwen2.5-7B-Instruct + IntelliDent dental-AI research LoRA (v0)

⚠️ The slug says slm-0m. The model is not a 0M-parameter model.

✅ What this model is for / 🚫 what it is not for

What's in this repo

Training

Training data

Plan-fit gate (pre-spend validation)

Evaluation

Recommended sampling settings

How to use

Load the LoRA adapter on top of Qwen2.5-7B-Instruct (Transformers + PEFT)

Use the GGUF in Ollama

Use the GGUF in LM Studio / llama.cpp / text-generation-webui

System prompt for Transformers/PEFT users

Limitations & caveats

Roadmap

v0.1 — continuation training (~3–4 h on A10G, ~$5)

v1 — full re-forge (expanded corpus, all file types)

Reproducibility

Citation

⚠️ Disclaimer — Proof of Concept

README.md

⚠️ The slug says `slm-0m`. The model is not a 0M-parameter model.