初始化项目,由ModelHub XC社区提供模型
Model: betterdataai/PII_DETECTION_MODEL Source: Original Platform
This commit is contained in:
35
.gitattributes
vendored
Normal file
35
.gitattributes
vendored
Normal file
@@ -0,0 +1,35 @@
|
|||||||
|
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.model filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npy filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pickle filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pkl filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||||
|
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.wasm filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||||
120
README.md
Normal file
120
README.md
Normal file
@@ -0,0 +1,120 @@
|
|||||||
|
---
|
||||||
|
library_name: transformers
|
||||||
|
pipeline_tag: text-generation
|
||||||
|
---
|
||||||
|
|
||||||
|
# Model Card
|
||||||
|
|
||||||
|
This model is used for PII detection. It covers 29 classes across 7 languages (English , Spanish, Swedish, German, Italian, Dutch, French)
|
||||||
|
|
||||||
|
|
||||||
|
## Model Details
|
||||||
|
The model is built on top of the decoder transformer model (Qwen2-0.5B). We used publicly available datasets with a permissible license as well as synthetic data to train the model.
|
||||||
|
|
||||||
|
### Model Description
|
||||||
|
|
||||||
|
<!-- Provide a longer summary of what this model is. -->
|
||||||
|
|
||||||
|
- **Developed by:** Betterdata.ai
|
||||||
|
- **Model type:** Decoder Transformer
|
||||||
|
- **License:** apache-2.0
|
||||||
|
- **Finetuned from model :** Qwen2-0.5B
|
||||||
|
|
||||||
|
## Uses
|
||||||
|
|
||||||
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
||||||
|
|
||||||
|
With the advent of chatGPT , professionals and organizations use public chat interfaces for various applications. This often leads to leakage of PII data which causes privacy issues as users enter names , dates or even API keys etc to give the model better context. With the PII class tags, these confidential information can be masked out as class tags which enable the end models to understand context without data leaving the server. Even developers or teams building applications with the help of third party API's can use these models for better privacy. The image below illustrates this:
|
||||||
|
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
The PII model shouldn't add too much latency and be able to take in long documents, therefore we used the Qwen 0.5B as the base model. Another consideration for the model size was that we felt a model for privacy should be easy to run even on CPUs. We do have larger models in house with better performance. We have coverage for south east asian langauges as well as giving the user the ability to define custom user classes as part of our plans.
|
||||||
|
|
||||||
|
|
||||||
|
We will constantly be improving this model so always pull the latest version of the model.
|
||||||
|
|
||||||
|
|
||||||
|
### Out-of-Scope Use
|
||||||
|
|
||||||
|
1. We currently are trying to replace the PII text with the respective class tags. In the future we plan to replace the data with synthetic class values.
|
||||||
|
|
||||||
|
|
||||||
|
## Bias, Risks, and Limitations
|
||||||
|
|
||||||
|
1. Model may not be 100% accurate always but we are working on it.
|
||||||
|
2. In our 'vibe test' where we fed the model with out of distribution data , the model does very well on classes like names , emails , phone numbers etc. We find that it can do better on classes like API_KEYS , credit card cvv numbers , bank account numbers. We are creating more data for these classes to futher improve performance. You can expect improvement in the next iteration.
|
||||||
|
|
||||||
|
## How to Get Started with the Model
|
||||||
|
|
||||||
|
Use the code below to get started with the model.
|
||||||
|
|
||||||
|
```
|
||||||
|
from transformers import AutoModelForCausalLM , AutoTokenizer
|
||||||
|
import torch
|
||||||
|
|
||||||
|
device = "cuda" if torch.cuda.is_available() else "cpu"
|
||||||
|
model = AutoModelForCausalLM.from_pretrained("betterdataai/PII_DETECTION_MODEL").to(device)
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained("betterdataai/PII_DETECTION_MODEL")
|
||||||
|
|
||||||
|
classes_list = ['<pin>','<api_key>','<bank_routing_number>','<bban>','<company>','<credit_card_number>','<credit_card_security_code>','<customer_id>','<date>','<date_of_birth>','<date_time>','<driver_license_number>','<email>','<employee_id>','<first_name>','<iban>','<ipv4>','<ipv6>','<last_name>','<local_latlng>','<name>','<passport_number>','<password>','<phone_number>','<social_security_number>','<street_address>','<swift_bic_code>','<time>','<user_name>']
|
||||||
|
|
||||||
|
|
||||||
|
prompt = """You are an AI assistant who is responisble for identifying Personal Identifiable information (PII). You will be given a passage of text and you have to \
|
||||||
|
identify the PII data present in the passage. You should only identify the data based on the classes provided and not make up any class on your own.
|
||||||
|
|
||||||
|
```PII Classes```
|
||||||
|
{classes}
|
||||||
|
|
||||||
|
The given text is:
|
||||||
|
{text}
|
||||||
|
|
||||||
|
The PII data are:
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
user_input = "Write an email to Julia indicating I won't be coming to office on the 29th of June"
|
||||||
|
new_prompt = prompt.format(classes="\n".join(classes_list) , text=user_input)
|
||||||
|
tokenized_input = tokenizer(new_prompt , return_tensors="pt").to(device)
|
||||||
|
|
||||||
|
output = model.generate(**tokenized_input , max_new_tokens=6000)
|
||||||
|
pii_classes = tokenizer.decode(output[0] , skip_special_tokens=True).split("The PII data are:\n")[1]
|
||||||
|
|
||||||
|
print(pii_classes)
|
||||||
|
|
||||||
|
##output
|
||||||
|
"""
|
||||||
|
<name> : ['Julia']
|
||||||
|
<date> : ['the 29th of June']
|
||||||
|
"""
|
||||||
|
```
|
||||||
|
|
||||||
|
## Evaluation
|
||||||
|
|
||||||
|
<!-- This section describes the evaluation protocols and provides the results. -->
|
||||||
|
|
||||||
|
### Testing Data, Factors & Metrics
|
||||||
|
|
||||||
|
#### Testing Data
|
||||||
|
1. We developed an in house testset that we evaluate the model on. The test set was manually annotated to ensure high quality of the test set. The test set mainly comprises of financial documents , contracts etc across all the 7 languages, covering all the classes.
|
||||||
|
|
||||||
|
|
||||||
|
#### Metrics
|
||||||
|
|
||||||
|
1. We used precision , recall and F1 score to evaluate the models output with the ground truth.
|
||||||
|
|
||||||
|
We consider precision and recall to be:
|
||||||
|
|
||||||
|
- Precision: The ratio of correctly identified PII instances to the total identified instances.
|
||||||
|
- Recall: The ratio of correctly identified PII instances to the total actual instances in the dataset.
|
||||||
|
|
||||||
|
## Model Card Authors
|
||||||
|
|
||||||
|
Srinivasan
|
||||||
|
|
||||||
|
## Model Card Contact
|
||||||
|
For feedback/suggestions and any collaborations reach us at:
|
||||||
|
|
||||||
|
srini@betterdata.ai
|
||||||
|
contact@betterdata.ai
|
||||||
6
added_tokens.json
Normal file
6
added_tokens.json
Normal file
@@ -0,0 +1,6 @@
|
|||||||
|
{
|
||||||
|
"<|PAD_TOKEN|>": 151646,
|
||||||
|
"<|endoftext|>": 151643,
|
||||||
|
"<|im_end|>": 151645,
|
||||||
|
"<|im_start|>": 151644
|
||||||
|
}
|
||||||
31
config.json
Normal file
31
config.json
Normal file
@@ -0,0 +1,31 @@
|
|||||||
|
{
|
||||||
|
"_name_or_path": "./final_model_qwen-0.5B_with_allclasses_v2_3epochs_12batchsize_4e-5lr/",
|
||||||
|
"architectures": [
|
||||||
|
"Qwen2ForCausalLM"
|
||||||
|
],
|
||||||
|
"attention_dropout": 0.0,
|
||||||
|
"bos_token_id": 151643,
|
||||||
|
"eos_token_id": 151643,
|
||||||
|
"hidden_act": "silu",
|
||||||
|
"hidden_size": 896,
|
||||||
|
"initializer_range": 0.02,
|
||||||
|
"intermediate_size": 4864,
|
||||||
|
"max_position_embeddings": 131072,
|
||||||
|
"max_window_layers": 24,
|
||||||
|
"model_type": "qwen2",
|
||||||
|
"num_attention_heads": 14,
|
||||||
|
"num_hidden_layers": 24,
|
||||||
|
"num_key_value_heads": 2,
|
||||||
|
"pad_token_id": 151646,
|
||||||
|
"rms_norm_eps": 1e-06,
|
||||||
|
"rope_scaling": null,
|
||||||
|
"rope_theta": 1000000.0,
|
||||||
|
"sliding_window": 131072,
|
||||||
|
"tie_word_embeddings": true,
|
||||||
|
"torch_dtype": "float32",
|
||||||
|
"transformers_version": "4.42.4",
|
||||||
|
"unsloth_version": "2024.7",
|
||||||
|
"use_cache": true,
|
||||||
|
"use_sliding_window": false,
|
||||||
|
"vocab_size": 151936
|
||||||
|
}
|
||||||
6
generation_config.json
Normal file
6
generation_config.json
Normal file
@@ -0,0 +1,6 @@
|
|||||||
|
{
|
||||||
|
"bos_token_id": 151643,
|
||||||
|
"eos_token_id": 151643,
|
||||||
|
"max_new_tokens": 2048,
|
||||||
|
"transformers_version": "4.42.4"
|
||||||
|
}
|
||||||
151388
merges.txt
Normal file
151388
merges.txt
Normal file
File diff suppressed because it is too large
Load Diff
3
model.safetensors
Normal file
3
model.safetensors
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:31cc8cf5fb40ccdcd6b7f7f215c07e1fa598362a35cdadbc2ba20933e42dbfd2
|
||||||
|
size 1976163472
|
||||||
BIN
pii_image.png
Normal file
BIN
pii_image.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 25 KiB |
20
special_tokens_map.json
Normal file
20
special_tokens_map.json
Normal file
@@ -0,0 +1,20 @@
|
|||||||
|
{
|
||||||
|
"additional_special_tokens": [
|
||||||
|
"<|im_start|>",
|
||||||
|
"<|im_end|>"
|
||||||
|
],
|
||||||
|
"eos_token": {
|
||||||
|
"content": "<|endoftext|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"pad_token": {
|
||||||
|
"content": "<|PAD_TOKEN|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
}
|
||||||
|
}
|
||||||
303133
tokenizer.json
Normal file
303133
tokenizer.json
Normal file
File diff suppressed because it is too large
Load Diff
58
tokenizer_config.json
Normal file
58
tokenizer_config.json
Normal file
@@ -0,0 +1,58 @@
|
|||||||
|
{
|
||||||
|
"add_prefix_space": false,
|
||||||
|
"added_tokens_decoder": {
|
||||||
|
"151643": {
|
||||||
|
"content": "<|endoftext|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"151644": {
|
||||||
|
"content": "<|im_start|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"151645": {
|
||||||
|
"content": "<|im_end|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"151646": {
|
||||||
|
"content": "<|PAD_TOKEN|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"additional_special_tokens": [
|
||||||
|
"<|im_start|>",
|
||||||
|
"<|im_end|>"
|
||||||
|
],
|
||||||
|
"bos_token": null,
|
||||||
|
"chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
|
||||||
|
"clean_up_tokenization_spaces": false,
|
||||||
|
"eos_token": "<|endoftext|>",
|
||||||
|
"errors": "replace",
|
||||||
|
"max_length": 131072,
|
||||||
|
"model_max_length": 131072,
|
||||||
|
"pad_to_multiple_of": null,
|
||||||
|
"pad_token": "<|PAD_TOKEN|>",
|
||||||
|
"pad_token_type_id": 0,
|
||||||
|
"padding_side": "left",
|
||||||
|
"split_special_tokens": false,
|
||||||
|
"stride": 0,
|
||||||
|
"tokenizer_class": "Qwen2Tokenizer",
|
||||||
|
"truncation_side": "right",
|
||||||
|
"truncation_strategy": "longest_first",
|
||||||
|
"unk_token": null
|
||||||
|
}
|
||||||
1
vocab.json
Normal file
1
vocab.json
Normal file
File diff suppressed because one or more lines are too long
Reference in New Issue
Block a user