初始化项目，由ModelHub XC社区提供模型

Model: betterdataai/PII_DETECTION_MODEL Source: Original Platform
2026-05-06 20:14:55 +08:00
commit 53b12b8500
12 changed files with 454801 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,35 @@
 *.7z filter=lfs diff=lfs merge=lfs -text
 *.arrow filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 *.bz2 filter=lfs diff=lfs merge=lfs -text
 *.ckpt filter=lfs diff=lfs merge=lfs -text
 *.ftz filter=lfs diff=lfs merge=lfs -text
 *.gz filter=lfs diff=lfs merge=lfs -text
 *.h5 filter=lfs diff=lfs merge=lfs -text
 *.joblib filter=lfs diff=lfs merge=lfs -text
 *.lfs.* filter=lfs diff=lfs merge=lfs -text
 *.mlmodel filter=lfs diff=lfs merge=lfs -text
 *.model filter=lfs diff=lfs merge=lfs -text
 *.msgpack filter=lfs diff=lfs merge=lfs -text
 *.npy filter=lfs diff=lfs merge=lfs -text
 *.npz filter=lfs diff=lfs merge=lfs -text
 *.onnx filter=lfs diff=lfs merge=lfs -text
 *.ot filter=lfs diff=lfs merge=lfs -text
 *.parquet filter=lfs diff=lfs merge=lfs -text
 *.pb filter=lfs diff=lfs merge=lfs -text
 *.pickle filter=lfs diff=lfs merge=lfs -text
 *.pkl filter=lfs diff=lfs merge=lfs -text
 *.pt filter=lfs diff=lfs merge=lfs -text
 *.pth filter=lfs diff=lfs merge=lfs -text
 *.rar filter=lfs diff=lfs merge=lfs -text
 *.safetensors filter=lfs diff=lfs merge=lfs -text
 saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.tar.* filter=lfs diff=lfs merge=lfs -text
 *.tar filter=lfs diff=lfs merge=lfs -text
 *.tflite filter=lfs diff=lfs merge=lfs -text
 *.tgz filter=lfs diff=lfs merge=lfs -text
 *.wasm filter=lfs diff=lfs merge=lfs -text
 *.xz filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,120 @@
 ---
 library_name: transformers
 pipeline_tag: text-generation
 ---
 # Model Card
 This model is used for PII detection. It covers 29 classes across 7 languages (English , Spanish, Swedish, German, Italian, Dutch, French) 
 ## Model Details
 The model is built on top of the decoder transformer model (Qwen2-0.5B). We used publicly available datasets with a permissible license as well as synthetic data to train the model.  
 ### Model Description
 <!-- Provide a longer summary of what this model is. -->
 - **Developed by:** Betterdata.ai
 - **Model type:** Decoder Transformer
 - **License:** apache-2.0
 - **Finetuned from model :** Qwen2-0.5B
 ## Uses
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 With the advent of chatGPT , professionals and organizations use public chat interfaces for various applications. This often leads to leakage of PII data which causes privacy issues as users enter names , dates or even API keys etc to give the model better context. With the PII class tags, these confidential information can be masked out as class tags which enable the end models to understand context without data leaving the server. Even developers or teams building applications with the help of third party API's can use these models for better privacy. The image below illustrates this:
 ![Use Case](pii_image.png)
 The PII model shouldn't add too much latency and be able to take in long documents, therefore we used the Qwen 0.5B as the base model. Another consideration for the model size was that we felt a model for privacy should be easy to run even on CPUs. We do have larger models in house with better performance. We have coverage for south east asian langauges as well as giving the user the ability to define custom user classes as part of our plans. 
 We will constantly be improving this model so always pull the latest version of the model. 
 ### Out-of-Scope Use
 1. We currently are trying to replace the PII text with the respective class tags. In the future we plan to replace the data with synthetic class values. 
 ## Bias, Risks, and Limitations
 1. Model may not be 100% accurate always but we are working on it. 
 2. In our 'vibe test' where we fed the model with out of distribution data , the model does very well on classes like names , emails , phone numbers etc. We find that it can do better on classes like API_KEYS , credit card cvv numbers , bank account numbers. We are creating more data for these classes to futher improve performance. You can expect improvement in the next iteration.  
 ## How to Get Started with the Model
 Use the code below to get started with the model.
 ```
 from transformers import AutoModelForCausalLM , AutoTokenizer
 import torch
 device = "cuda" if torch.cuda.is_available() else "cpu"
 model = AutoModelForCausalLM.from_pretrained("betterdataai/PII_DETECTION_MODEL").to(device)
 tokenizer = AutoTokenizer.from_pretrained("betterdataai/PII_DETECTION_MODEL")
 classes_list = ['<pin>','<api_key>','<bank_routing_number>','<bban>','<company>','<credit_card_number>','<credit_card_security_code>','<customer_id>','<date>','<date_of_birth>','<date_time>','<driver_license_number>','<email>','<employee_id>','<first_name>','<iban>','<ipv4>','<ipv6>','<last_name>','<local_latlng>','<name>','<passport_number>','<password>','<phone_number>','<social_security_number>','<street_address>','<swift_bic_code>','<time>','<user_name>']
 prompt = """You are an AI assistant who is responisble for identifying Personal Identifiable information (PII). You will be given a passage of text and you have to \
 identify the PII data present in the passage. You should only identify the data based on the classes provided and not make up any class on your own.
 ```PII Classes```
 {classes}
 The given text is:
 {text}
 The PII data are:
 """
 user_input = "Write an email to Julia indicating I won't be coming to office on the 29th of June"
 new_prompt = prompt.format(classes="\n".join(classes_list) , text=user_input)
 tokenized_input = tokenizer(new_prompt , return_tensors="pt").to(device)
 output = model.generate(**tokenized_input , max_new_tokens=6000)
 pii_classes = tokenizer.decode(output[0] , skip_special_tokens=True).split("The PII data are:\n")[1]
 print(pii_classes)
 ##output
 """
 <name> : ['Julia']
 <date> : ['the 29th of June']
 """
 ```
 ## Evaluation
 <!-- This section describes the evaluation protocols and provides the results. -->
 ### Testing Data, Factors & Metrics
 #### Testing Data
 1. We developed an in house testset that we evaluate the model on. The test set was manually annotated to ensure high quality of the test set. The test set mainly comprises of financial documents , contracts etc across all the 7 languages, covering all the classes.
 #### Metrics
 1. We used precision , recall and F1 score to evaluate the models output with the ground truth. 
 We consider precision and recall to be:
 - Precision: The ratio of correctly identified PII instances to the total identified instances.
 - Recall: The ratio of correctly identified PII instances to the total actual instances in the dataset.
 ## Model Card Authors
 Srinivasan
 ## Model Card Contact
 For feedback/suggestions and any collaborations reach us at:
 srini@betterdata.ai
 contact@betterdata.ai
--- a/added_tokens.json
+++ b/added_tokens.json
@@ -0,0 +1,6 @@
 {
  "<|PAD_TOKEN|>": 151646,
  "<|endoftext|>": 151643,
  "<|im_end|>": 151645,
  "<|im_start|>": 151644
 }
--- a/config.json
+++ b/config.json
@@ -0,0 +1,31 @@
 {
  "_name_or_path": "./final_model_qwen-0.5B_with_allclasses_v2_3epochs_12batchsize_4e-5lr/",
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "hidden_act": "silu",
  "hidden_size": 896,
  "initializer_range": 0.02,
  "intermediate_size": 4864,
  "max_position_embeddings": 131072,
  "max_window_layers": 24,
  "model_type": "qwen2",
  "num_attention_heads": 14,
  "num_hidden_layers": 24,
  "num_key_value_heads": 2,
  "pad_token_id": 151646,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000.0,
  "sliding_window": 131072,
  "tie_word_embeddings": true,
  "torch_dtype": "float32",
  "transformers_version": "4.42.4",
  "unsloth_version": "2024.7",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 151936
 }
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,6 @@
 {
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "max_new_tokens": 2048,
  "transformers_version": "4.42.4"
 }
--- a/merges.txt
+++ b/merges.txt
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:31cc8cf5fb40ccdcd6b7f7f215c07e1fa598362a35cdadbc2ba20933e42dbfd2
 size 1976163472
--- a/pii_image.png
+++ b/pii_image.png
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,20 @@
 {
  "additional_special_tokens": [
    "<|im_start|>",
    "<|im_end|>"
  ],
  "eos_token": {
    "content": "<|endoftext|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "pad_token": {
    "content": "<|PAD_TOKEN|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  }
 }
--- a/tokenizer.json
+++ b/tokenizer.json
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,58 @@
 {
  "add_prefix_space": false,
  "added_tokens_decoder": {
    "151643": {
      "content": "<|endoftext|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151644": {
      "content": "<|im_start|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151645": {
      "content": "<|im_end|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151646": {
      "content": "<|PAD_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    }
  },
  "additional_special_tokens": [
    "<|im_start|>",
    "<|im_end|>"
  ],
  "bos_token": null,
  "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
  "clean_up_tokenization_spaces": false,
  "eos_token": "<|endoftext|>",
  "errors": "replace",
  "max_length": 131072,
  "model_max_length": 131072,
  "pad_to_multiple_of": null,
  "pad_token": "<|PAD_TOKEN|>",
  "pad_token_type_id": 0,
  "padding_side": "left",
  "split_special_tokens": false,
  "stride": 0,
  "tokenizer_class": "Qwen2Tokenizer",
  "truncation_side": "right",
  "truncation_strategy": "longest_first",
  "unk_token": null
 }
--- a/vocab.json
+++ b/vocab.json