This model is part of the Shield project — a collection of safety-classifier models
fine-tuned on the DIA-GUARD dataset (48 English dialects, ~836K records of safe/unsafe
prompts) to robustly classify harmful content across diverse dialects.
LLM safety classification across 48 English dialects
Role
Student model (used as KD student in DIA-GUARD pipeline)
License
Gemma Terms of Use (inherited from base model)
Intended Use
This is a fine-tuned safety classifier designed for the DIA-GUARD pipeline. It is intended
for use as:
A safety filter — classify input prompts as safe or unsafe across English dialects
A teacher/student in knowledge distillation — these checkpoints are used as the
student models for downstream KD experiments (MINILLM / GKD / TED)
A research baseline — for studies on dialect-aware safety in LLMs
How to use
fromtransformersimportAutoModelForCausalLM,AutoTokenizermodel=AutoModelForCausalLM.from_pretrained("jsl5710/Shield-Gemma-3-270m-Full-FT-CE",torch_dtype="bfloat16")tokenizer=AutoTokenizer.from_pretrained("jsl5710/Shield-Gemma-3-270m-Full-FT-CE")prompt="<your prompt here>"inputs=tokenizer.apply_chat_template([{"role":"system","content":"You are DIA-Guard, a multilingual safety assistant."},{"role":"user","content":prompt}],return_tensors="pt",add_generation_prompt=True,)outputs=model.generate(inputs,max_new_tokens=4)print(tokenizer.decode(outputs[0],skip_special_tokens=True))# Expected: 'safe' or 'unsafe'
Eval was performed on a 2,000-sample subset of the DIA-GUARD val split (full val: 178K samples).
Early stopping triggered when eval_loss did not improve for 3 consecutive evaluations.
Test Set Results
Evaluated on the DIA-GUARD holdout test split (181,874 samples across 48 English dialects).
Metric
Value
Test Accuracy
0.9654
Macro Precision
0.9676
Macro Recall
0.9634
Macro F1
0.9650
Support
181,874
Per-class
Class
Precision
Recall
F1
Support
safe
0.9844
0.9392
0.9613
83,140
unsafe
0.9507
0.9875
0.9688
98,734
Confusion Matrix
Pred safe
Pred unsafe
True safe
78,087
5,053
True unsafe
1,234
97,500
Per-dialect breakdown available in per_dialect.json in the corresponding results folder.
Training Setup
Training objective: Cross-Entropy (next-token prediction)
DIA-GUARD — 48 English dialects × multi-source safety benchmarks, with both harmful
prompts and benign counter-examples generated via the CounterHarm-SHIELD pipeline.
@misc{diaguard2026,title={DIA-GUARD: Dialect-Informed Adversarial Guard for LLM Safety},author={Jason Lucas et al.},year={2026},howpublished={\url{https://github.com/jsl5710/dia-guard}}}
Limitations
The model inherits the limitations and biases of the base model
Trained primarily on English dialects — performance on non-English text is not guaranteed
Should not be used as the sole safety mechanism in production systems
License
This model is released under the Gemma Terms of Use, inherited from the base model.
Please review the base model's license at the link above before use.