初始化项目,由ModelHub XC社区提供模型

Model: distil-labs/distil-email-classifier
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-05-08 06:43:51 +08:00
commit ceeee13104
18 changed files with 152376 additions and 0 deletions

37
.gitattributes vendored Normal file
View File

@@ -0,0 +1,37 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
tokenizer.json filter=lfs diff=lfs merge=lfs -text
model.gguf filter=lfs diff=lfs merge=lfs -text

31
LICENSE Normal file
View File

@@ -0,0 +1,31 @@
GENERAL TERMS AND CONDITIONS
Note that if you want to use the Commercial licence, please contact us at contact@distillabs.ai
- Model License Terms -
R&D License
1. SERVICES, PRICES AND PAYMENT
1.1 The Customer pays a one-time license fee, as indicated in the check-out process, for running of one (1) training process of the selected Base Model using Customer Data (“License Fee”).
1.2 The License Fee shall be due for payment in advance. The Customer shall only be permitted to set off against payment claims of Distil Labs if the Customers claims are undisputed or have become res judicata.
2. MODEL LICENSE: R&D LICENSE
2.1 Subject to Customers payment of the license fee, Distil Labs grants to Customer the Model License (as defined below). For clarification, Distil Labs retains any other rights in its software or know- how, in particular in the codebase needed for the fine-tuning of the Trained Model.
2.2 Subject to the requirements of the Base Model License (cf. Section 2.5 below), Distil Labs transfers to the Customer the perpetual, non-exclusive usage right to the Trained Model for non-commercial purposes of prototyping and research & development. The Parties agree, that commercial purposes include deployment in production externally (to be used by Customers customers paid or free of charge) or internally (as a tool for Customers employees). The territorial scope of the license is limited to the use within the United States of America and the European Economic Area including all member states of the European Union (“Model License”).
2.3 The Model License for non-commercial purposes of prototyping and research & development shall include (i) the non-exclusive right to permanent or temporary reproduction, in whole or in part, by any means and in any form (e.g. permanent and/or volatile storage on electrical, electromagnetic, optical storage media, such as any type of SDD, HDD, DVD, memory cards, USB sticks), (ii) the non-exclusive right to distribution in any form, media and by any means regardless of whether the distribution is in tangible or intangible form, in particular to transmit the Trained Model via wired and wireless networks (e.g. for download from internet or intranet by wire or wireless means including broadband, cable, fiberglass, WIFI, LTE, 5G, satellite internet, other data networks), and (iii) the non-exclusive right of making available to the public in such a way that members of the public can access it from places and at times of their choice (e.g. by web or mobile app, virtual or augmented reality, cloud storage, cloud hosting, decentralized hosting, non-fungible token, application service providing, software as a service, or cloud computing). The license shall also contain, to the extent necessary for prototyping and research & development, the right to adapt and modify the Trained Model subject to the limitation in Section 2.4 and 2.5 below, to further develop the Trained Model including changes to functions or appearance, adapt to other software versions, to exchange parts of the Trained Model or combine the Trained Model with other results of work and to use the results in the same way as the original Trained Model. Any derived models from the Trained Model shall retain this model license.
2.4 The Customer shall not, without the prior written consent of Distil Labs:
2.4.1 train, fine-tune, re-train, or otherwise modify the Trained Model, unless for purpose of research & development;
2.4.2 use the Trained Model or any part thereof to create derivative models or services that compete with those of Distil Labs;
2.4.3 circumvent any technical restrictions embedded in the Trained Model or Base Model that are designed to enforce usage limitations.
2.5 The Parties acknowledge and agree that the Trained Model is developed from Base Models which are supplied by a third party. Therefore, the Model License is subject to the restrictions resulting from the open-source or any other applicable license of the Base Model (“Base Model License”) and the Customer must use the Trained Model in compliance with the Base Model License. In particular, the Customer must oblige their clients to compliance with the Base Model License in any case of transferring or sublicensing the rights to or making available in any way the Trained Model. The applicable Base Model License is defined in the Training Configuration and will be provided for download. The Customer agrees to indemnify Distil Labs for any and all claims brought by the Base Model provider for violations of the Base Model License.

59
Modelfile Normal file
View File

@@ -0,0 +1,59 @@
FROM ./model.gguf
TEMPLATE """{{- $lastUserIdx := -1 -}}
{{- range $idx, $msg := .Messages -}}
{{- if eq $msg.Role "user" }}{{ $lastUserIdx = $idx }}{{ end -}}
{{- end }}
{{- if or .System .Tools }}<|im_start|>system
{{ if .System }}
{{ .System }}
{{- end }}
{{- if .Tools }}
# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{{- range .Tools }}
{"type": "function", "function": {{ .Function }}}
{{- end }}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
{{- end -}}
<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ if (and $.IsThinkSet (and .Thinking (or $last (gt $i $lastUserIdx)))) -}}
<think>{{ .Thinking }}</think>
{{ end -}}
{{ if .Content }}{{ .Content }}
{{- else if .ToolCalls }}<tool_call>
{{ range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{ end }}</tool_call>
{{- end }}{{ if not $last }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
{{ if and $.IsThinkSet (not $.Think) -}}
<think>
</think>
{{ end -}}
{{ end }}
{{- end }}"""

187
README.md Normal file
View File

@@ -0,0 +1,187 @@
---
library_name: transformers
tags:
- email
- classification
- qwen3
- distillation
- privacy
license: apache-2.0
language:
- en
metrics:
- accuracy
base_model:
- Qwen/Qwen3-0.6B
---
<div align="center">
<img src="https://github.com/distil-labs/badges/blob/main/distillabs-logo.svg?raw=true" width="40%" alt="distil labs" />
</div>
---
<div align="center">
<table>
<tr>
<td align="center">
<a href="https://www.distillabs.ai/?utm_source=huggingface&utm_medium=referral&utm_campaign=distil-n8n-email-classifier">
<img src="https://github.com/distil-labs/badges/blob/main/badge-distillabs-home.svg?raw=true" alt="Homepage"/>
</a>
</td>
<td align="center">
<a href="https://github.com/distil-labs">
<img src="https://github.com/distil-labs/badges/blob/main/badge-github.svg?raw=true" alt="GitHub"/>
</a>
</td>
<td align="center">
<a href="https://huggingface.co/distil-labs">
<img src="https://github.com/distil-labs/badges/blob/main/badge-huggingface.svg?raw=true" alt="Hugging Face"/>
</a>
</td>
</tr>
<tr>
<td align="center">
<a href="https://www.linkedin.com/company/distil-labs/">
<img src="https://github.com/distil-labs/badges/blob/main/badge-linkedin.svg?raw=true" alt="LinkedIn"/>
</a>
</td>
<td align="center">
<a href="https://distil-labs-community.slack.com/join/shared_invite/zt-36zqj87le-i3quWUn2bjErRq22xoE58g">
<img src="https://github.com/distil-labs/badges/blob/main/badge-slack.svg?raw=true" alt="Slack"/>
</a>
</td>
<td align="center">
<a href="https://x.com/distil_labs">
<img src="https://github.com/distil-labs/badges/blob/main/badge-twitter.svg?raw=true" alt="Twitter"/>
</a>
</td>
</tr>
</table>
</div>
# We fine-tuned an email classification model so you can auto-label your emails locally with n8n.
We built a fully local Gmail auto-labeler with n8n + a fine-tuned 0.6B model (no email content sent to cloud LLMs)
Most of our inboxes are a mix of useful and distracting. Labels can help making order from the chaos - but labelling all emails manually takes time too.
We put together a setup that auto-labels Gmail **locally**, so email content does not go to external LLM APIs.
### What it does (end-to-end local):
- n8n trigger when you receive an email
- It sends the email text (subject + snippet/body) to a fine-tuned model running on localhost via Ollama
- It applies the predicted label back in Gmail (we recommend prefixing labels with AI/)
#### Label set (10-way closed set):
Billing, Newsletter, Work, Personal, Promotional, Security, Shipping, Travel, Spam, Other
### Results:
| Model | Accuracy |
| --- | --- |
| Teacher (GPT-OSS-120B) | 93% |
| Base student (Qwen3-0.6B) | 38% |
| Fine-tuned student (Qwen3-0.6B) | 93% |
### Traning setup
| Student model | Qwen3-0.6B (600M parameters) |
| --- | --- |
| Teacher model | GPT-OSS-120B |
| Training method | Knowledge distillation + supervised fine-tuning (SFT) |
| Seed data | 154 examples |
| Training data | 10,000 synthetic email examples across 10 categories generated using our data synthesis pipeline |
## **System Setup**
**Installation Steps:**
1. **Install n8n locally**
```bash
# Install Node.js (if not installed)
brew install node
# Install n8n globally
npm install -g n8n
# Start n8n
n8n
```
Access n8n at: [**http://localhost:5678**](http://localhost:5678/)
2. **Download the model**
```bash
#install huggingface CLI if not instlalled
python3 -m pip install -U huggingface_hub
#download the model
hf download distil-labs/distil-email-classifier --local-dir ./distil-email-classifier
```
3. **Run the model**
```bash
#install Ollama or you can download from https://ollama.com/download
brew install ollama
#start Ollama
ollama serve
#navigate to your model folder
cd ./distil-email-classifier
#create model in ollama
ollama create email-classifier -f Modelfile
#verify the model is created or not
ollama list
#run the model
ollama run email-classifier "test"
#check the model is running or not
ollama ps
Expected output:
NAME ID SIZE PROCESSOR CONTEXT UNTIL
email-classifier:latest 695190b0f07f 3.5 GB 100% GPU 4096 4 minutes from now
#to keep model running forver run the below commands
OLLAMA_KEEP_ALIVE=-1 ollama run email-classifier "test"
Now shows Forever instead of 4 minutes from now.
```
Once you finish the setup, open n8n in your browser ([**`http://localhost:5678`**](http://localhost:5678/)), sign up with your email, and you get an access code from n8n for your email, you can update the access code.
### Import n8n workflows
Download the workflow JSON files from our GitHub repository: https://github.com/distil-labs/distil-n8n-gmail-automation
#### Two workflows are available:
- Real-time Classification: Triggers automatically on each incoming email
- Batch Processing: Classifies multiple existing emails at once
To connect your Gmail you need to setup Gmail OAuth (Google cloud console) you can find detailed steps on this on github readme.
Before running the workflow, create all 10 labels manually in your Gmail account. Use the "AI/" prefix to match the model output (AI/Billing, AI/Work, AI/Travel, and so on).
Once this is running, New messages get labeled automatically.
If you want different labels, you can distill a custom version of this classifier on the [distil labs platform](https://www.distillabs.ai/?utm_source=huggingface&utm_medium=referral&utm_campaign=distil-n8n-email-classifier). When you sign up, you get two free training credits to train the model.
Full Write up: [https://www.distillabs.ai/blog/building-a-local-agent-for-email-classification-using-n8n-distil-labs](https://www.distillabs.ai/blog/building-a-local-agent-for-email-classification-using-n8n-distil-labs?utm_source=huggingface&utm_medium=referral&utm_campaign=distil-n8n-email-classifier)
workflows: https://github.com/distil-labs/distil-n8n-gmail-automation
Model: https://huggingface.co/distil-labs/distil-email-classifier

13
STUDENT_LICENSE Normal file
View File

@@ -0,0 +1,13 @@
Copyright 2023 Qwen
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

13
TEACHER_LICENSE Normal file
View File

@@ -0,0 +1,13 @@
Copyright 2025 OpenAI
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

28
added_tokens.json Normal file
View File

@@ -0,0 +1,28 @@
{
"</think>": 151668,
"</tool_call>": 151658,
"</tool_response>": 151666,
"<think>": 151667,
"<tool_call>": 151657,
"<tool_response>": 151665,
"<|box_end|>": 151649,
"<|box_start|>": 151648,
"<|endoftext|>": 151643,
"<|file_sep|>": 151664,
"<|fim_middle|>": 151660,
"<|fim_pad|>": 151662,
"<|fim_prefix|>": 151659,
"<|fim_suffix|>": 151661,
"<|im_end|>": 151645,
"<|im_start|>": 151644,
"<|image_pad|>": 151655,
"<|object_ref_end|>": 151647,
"<|object_ref_start|>": 151646,
"<|quad_end|>": 151651,
"<|quad_start|>": 151650,
"<|repo_name|>": 151663,
"<|video_pad|>": 151656,
"<|vision_end|>": 151653,
"<|vision_pad|>": 151654,
"<|vision_start|>": 151652
}

89
chat_template.jinja Normal file
View File

@@ -0,0 +1,89 @@
{%- if tools %}
{{- '<|im_start|>system\n' }}
{%- if messages[0].role == 'system' %}
{{- messages[0].content + '\n\n' }}
{%- endif %}
{{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
{%- for tool in tools %}
{{- "\n" }}
{{- tool | tojson }}
{%- endfor %}
{{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
{%- if messages[0].role == 'system' %}
{{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
{%- set index = (messages|length - 1) - loop.index0 %}
{%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
{%- set ns.multi_step_tool = false %}
{%- set ns.last_query_index = index %}
{%- endif %}
{%- endfor %}
{%- for message in messages %}
{%- if message.content is string %}
{%- set content = message.content %}
{%- else %}
{%- set content = '' %}
{%- endif %}
{%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
{{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
{%- elif message.role == "assistant" %}
{%- set reasoning_content = '' %}
{%- if message.reasoning_content is string %}
{%- set reasoning_content = message.reasoning_content %}
{%- else %}
{%- if '</think>' in content %}
{%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
{%- set content = content.split('</think>')[-1].lstrip('\n') %}
{%- endif %}
{%- endif %}
{%- if loop.index0 > ns.last_query_index %}
{%- if loop.last or (not loop.last and reasoning_content) %}
{{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
{%- else %}
{{- '<|im_start|>' + message.role + '\n' + content }}
{%- endif %}
{%- else %}
{{- '<|im_start|>' + message.role + '\n' + content }}
{%- endif %}
{%- if message.tool_calls %}
{%- for tool_call in message.tool_calls %}
{%- if (loop.first and content) or (not loop.first) %}
{{- '\n' }}
{%- endif %}
{%- if tool_call.function %}
{%- set tool_call = tool_call.function %}
{%- endif %}
{{- '<tool_call>\n{"name": "' }}
{{- tool_call.name }}
{{- '", "arguments": ' }}
{%- if tool_call.arguments is string %}
{{- tool_call.arguments }}
{%- else %}
{{- tool_call.arguments | tojson }}
{%- endif %}
{{- '}\n</tool_call>' }}
{%- endfor %}
{%- endif %}
{{- '<|im_end|>\n' }}
{%- elif message.role == "tool" %}
{%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
{{- '<|im_start|>user' }}
{%- endif %}
{{- '\n<tool_response>\n' }}
{{- content }}
{{- '\n</tool_response>' }}
{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
{{- '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n' }}
{%- if enable_thinking is defined and enable_thinking is false %}
{{- '<think>\n\n</think>\n\n' }}
{%- endif %}
{%- endif %}

62
config.json Normal file
View File

@@ -0,0 +1,62 @@
{
"architectures": [
"Qwen3ForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 1024,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 40960,
"max_window_layers": 28,
"model_type": "qwen3",
"num_attention_heads": 16,
"num_hidden_layers": 28,
"num_key_value_heads": 8,
"pad_token": "<|endoftext|>",
"pad_token_id": 151643,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000,
"sliding_window": null,
"tie_word_embeddings": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.53.0",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 151936
}

13
generation_config.json Normal file
View File

@@ -0,0 +1,13 @@
{
"bos_token_id": 151643,
"do_sample": true,
"eos_token_id": [
151645,
151643
],
"pad_token_id": 151643,
"temperature": 0.6,
"top_k": 20,
"top_p": 0.95,
"transformers_version": "4.53.0"
}

151388
merges.txt Normal file

File diff suppressed because it is too large Load Diff

3
model.gguf Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:d652cb3fe34a8196a3ae532e634cc50337f5994575d8ef031fd5388a73f8d6b6
size 2390146560

3
model.safetensors Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:4ee220168cc593d01caaea8c0908e68e7e7501ba6c6c00b3896cd6c3c7585747
size 1192135096

168
model_client.py Normal file
View File

@@ -0,0 +1,168 @@
import argparse
from openai import OpenAI
DEFAULT_QUESTION = """Subject: Package Delivered to Neighbor - Building 4, Apt 2B
Tracking: USPS-9405511899562871456789
Delivery Notice
Dear Customer,
Your package was delivered to your neighbor as you were not home.
Delivered To:
- Name: Sarah Johnson
- Address: Building 4, Apartment 2B
- Relationship: Neighbor (same floor)
- Signed: Yes, at 3:15 PM today
Your Address:
Building 4, Apartment 2A
Chicago, IL 60614
Package Info:
- From: Target.com
- Weight: 5.2 lbs
- Service: USPS Priority Mail
- Tracking: 9405511899562871456789
Delivery Attempt:
We attempted delivery at 3:10 PM. No answer at your door. Your neighbor (Apt 2B) answered and accepted the package on your behalf.
Delivery photo: https://usps.com/photo/9405511899562871456789
(Shows your neighbor receiving package)
Pickup Instructions:
Please retrieve your package from Sarah Johnson at Apartment 2B.
Note: USPS policy allows delivery to neighbors in multi-unit buildings when recipient is unavailable and neighbor is willing to accept.
We left a notice on your door.
Questions?
Call: 1-800-ASK-USPS
USPS Delivery Services"""
class DistilLabsLLM(object):
def __init__(self, model_name: str, api_key: str = "EMPTY", port: int = 11434):
self.model_name = model_name
self.client = OpenAI(base_url=f"http://127.0.0.1:{port}/v1", api_key=api_key)
def get_prompt(
self,
question: str,
) -> list[dict[str, str]]:
return [
{
"role": "system",
"content": """
You are a classifier working on a problem described in task_description XML block:
<task_description>## Task
Classify incoming emails into one of ten predefined categories to enable intelligent email organization, prioritization, and automation workflows. The classification must accurately determine the email's primary purpose and intent based on comprehensive analysis of sender information, subject line, body content, formatting patterns, and contextual signals. The system should handle multi-lingual emails (English and French), mixed personal/professional contexts, and edge cases where emails may contain elements of multiple categories.
## Inputs
Complete email content including:
- Subject line (required)
- Email body text (required)
- Sender information (when available)
- Metadata such as timestamps, domains, and formatting (when available)
## Outputs
A single category label that best represents the email's primary purpose and expected user action.
## Classification Guidelines
1. **Multi-category Resolution:** When an email contains elements of multiple categories, classify based on the PRIMARY user action required. Priority order for ambiguous cases: Security > Spam > Billing > Work > Travel > Shipping > Personal > Promotional > Newsletter > Other.
2. **Language Handling:** The system must accurately classify emails in both English and French based on content meaning. French keywords (e.g., 'facture', 'virement', 'livraison') must be recognized.
3. **Context Awareness:** Consider sender domain and structure. E.g., '@linkedin.com' about jobs is AI/Work, but about network posts is AI/Newsletter.
4. **Edge Case Principles:**
- Security concerns always take precedence.
- Obvious spam/phishing is always AI/Spam.
- Transactional emails (receipts) go to AI/Billing.
- Personal relationships override platform context.
- Work context is determined by professional intent, not just sender.
## Decision Logic Examples
- **LinkedIn Flow:** Job posting = AI/Work; Personal msg = AI/Personal; Digest = AI/Newsletter; Profile view = AI/Other.
- **Payment Flow:** If amount+ID present = AI/Billing; If phishing/scam = AI/Spam; If shipping focus = AI/Shipping.
- **Notification Flow:** Security alert = AI/Security; Payment = AI/Billing; Delivery = AI/Shipping; Personal msg = AI/Personal.</task_description>
Classify the input into one of the available classes, each class has a name in class_name and description in class_description XML block:
<class_name>AI/Promotional</class_name>
<class_description>Marketing and sales communications from businesses, services, or platforms promoting products, services, features, or special offers. INDICATORS: Discount codes, limited-time deals, product launches, 'Upgrade today' calls-to-action, webinar invitations. EXAMPLES: SaaS discount offers, early access invites, Black Friday sales. EDGE CASES: Work-related webinar invites from vendors count as Promotional.</class_description>
<class_name>AI/Travel</class_name>
<class_description>All communications related to travel arrangements, transportation bookings, accommodations, and trip logistics. INDICATORS: Flight/Hotel confirmations, boarding passes, car rental reservations, itineraries. EXAMPLES: Air France confirmations, Airbnb bookings, Eurostar tickets. EDGE CASES: Work conference travel is AI/Travel (logistics focus), not AI/Work.</class_description>
<class_name>AI/Spam</class_name>
<class_description>Unsolicited, fraudulent, or malicious emails including phishing attempts, scams, lottery notifications, and suspicious requests. INDICATORS: Unrealistic promises ('You won!'), urgent threats, requests for passwords/SSN, generic greetings, poor grammar, mismatched sender domains. EXAMPLES: Phishing impersonating Amazon/banks, inheritance scams, crypto schemes. EDGE CASES: Legitimate security alerts go to AI/Security; aggressive but legitimate marketing goes to AI/Promotional.</class_description>
<class_name>AI/Shipping</class_name>
<class_description>Order fulfillment communications including shipping confirmations, tracking updates, delivery notifications, and returns. INDICATORS: Tracking numbers (UPS/FedEx), 'Out for delivery' status, delivered photos, return labels. EXAMPLES: Amazon shipment updates, UPS delivery notifications. EDGE CASES: Order confirmations without shipping info often go to AI/Billing.</class_description>
<class_name>AI/Security</class_name>
<class_description>Account security notifications including login alerts, authentication codes, and password changes. INDICATORS: New device logins, 2FA codes, password reset requests, suspicious activity alerts. EXAMPLES: Google sign-in alerts, Microsoft 2FA codes, GitHub security keys. EDGE CASES: If the email is a scam threat, it is AI/Spam.</class_description>
<class_name>AI/Billing</class_name>
<class_description>Financial transaction communications including invoices, payment receipts, subscription charges, and refunds. INDICATORS: Invoice numbers, transaction IDs, 'Payment successful', subscription renewals, tax receipts. EXAMPLES: Stripe receipts, Netflix renewals, cloud billing statements. EDGE CASES: Order confirmations with payment info are AI/Billing; Expired trial upsells are AI/Promotional.</class_description>
<class_name>AI/Work</class_name>
<class_description>Professional and employment-related communications including job opportunities, project updates, team collaboration, and career development. INDICATORS: Job postings, meeting agendas, sprint planning, pull requests, performance reviews, client emails. EXAMPLES: LinkedIn Recruiter messages, Jira updates, Slack digest, client project requirements. EDGE CASES: Professional conference travel is AI/Travel; Work-related SaaS receipts are AI/Billing.</class_description>
<class_name>AI/Newsletter</class_name>
<class_description>Curated content digests, regular informational updates, or periodic communications from subscribed sources. INDICATORS: Daily/weekly cadence, multiple article links, 'digest', 'roundup', unsubscribe links. EXAMPLES: TechCrunch daily, GitHub trending, Substack newsletters. EDGE CASES: A single dedicated promotional email from a newsletter sender is AI/Promotional.</class_description>
<class_name>AI/Personal</class_name>
<class_description>Direct personal communications from friends, family, or colleagues regarding non-professional matters. INDICATORS: Casual tone, social plans (coffee/dinner), birthday wishes, personal advice. EXAMPLES: Friend asking to hang out, family updates, personal networking. EDGE CASES: Colleagues emailing about work are AI/Work; Platform notifications about messages are AI/Other.</class_description>
<class_name>AI/Other</class_name>
<class_description>Miscellaneous communications including platform notifications, system messages, event registrations, and administrative notices. INDICATORS: Automated system updates, terms of service changes, community moderation, badge awards, meetup confirmations. EXAMPLES: Reddit upvote notifications, Terms of Service updates, event registrations. EDGE CASES: Security notifications must go to AI/Security.</class_description>
Write the name of the predicted class inside output XML block
For example, if the input matches class test_output, write
<output>test_output</output>
""",
},
{
"role": "user",
"content": f"""
Now for the real task, classify the following example
<question>{question}</question>
""",
},
]
def invoke(self, question: str) -> str:
chat_response = self.client.chat.completions.create(
model=self.model_name,
messages=self.get_prompt(question),
temperature=0,
reasoning_effort="none",
)
return chat_response.choices[0].message.content
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--question", type=str, default=DEFAULT_QUESTION, required=False)
parser.add_argument("--api-key", type=str, default="EMPTY", required=False)
parser.add_argument("--model", type=str, default="model", required=False)
parser.add_argument("--port", type=int, default=11434, required=False)
args = parser.parse_args()
client = DistilLabsLLM(model_name=args.model, api_key=args.api_key, port=args.port)
print(client.invoke(args.question))

38
special_tokens_map.json Normal file
View File

@@ -0,0 +1,38 @@
{
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<|object_ref_start|>",
"<|object_ref_end|>",
"<|box_start|>",
"<|box_end|>",
"<|quad_start|>",
"<|quad_end|>",
"<|vision_start|>",
"<|vision_end|>",
"<|vision_pad|>",
"<|image_pad|>",
"<|video_pad|>"
],
"bos_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

3
tokenizer.json Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:fbd5dd30a62db2f0ead71513492e40939dca4240dd5141e0a525212e2a45ff74
size 11422923

240
tokenizer_config.json Normal file
View File

@@ -0,0 +1,240 @@
{
"add_bos_token": false,
"add_prefix_space": false,
"added_tokens_decoder": {
"151643": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151644": {
"content": "<|im_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151645": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151646": {
"content": "<|object_ref_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151647": {
"content": "<|object_ref_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151648": {
"content": "<|box_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151649": {
"content": "<|box_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151650": {
"content": "<|quad_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151651": {
"content": "<|quad_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151652": {
"content": "<|vision_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151653": {
"content": "<|vision_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151654": {
"content": "<|vision_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151655": {
"content": "<|image_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151656": {
"content": "<|video_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151657": {
"content": "<tool_call>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151658": {
"content": "</tool_call>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151659": {
"content": "<|fim_prefix|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151660": {
"content": "<|fim_middle|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151661": {
"content": "<|fim_suffix|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151662": {
"content": "<|fim_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151663": {
"content": "<|repo_name|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151664": {
"content": "<|file_sep|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151665": {
"content": "<tool_response>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151666": {
"content": "</tool_response>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151667": {
"content": "<think>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151668": {
"content": "</think>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
}
},
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<|object_ref_start|>",
"<|object_ref_end|>",
"<|box_start|>",
"<|box_end|>",
"<|quad_start|>",
"<|quad_end|>",
"<|vision_start|>",
"<|vision_end|>",
"<|vision_pad|>",
"<|image_pad|>",
"<|video_pad|>"
],
"bos_token": "<|endoftext|>",
"clean_up_tokenization_spaces": false,
"eos_token": "<|im_end|>",
"errors": "replace",
"extra_special_tokens": {},
"model_max_length": 131072,
"pad_token": "<|endoftext|>",
"padding_side": "left",
"split_special_tokens": false,
"tokenizer_class": "Qwen2Tokenizer",
"unk_token": null
}

1
vocab.json Normal file

File diff suppressed because one or more lines are too long