commit 53b12b85002fe373856972f9ad0632f9887ad920 Author: ModelHub XC Date: Wed May 6 20:14:55 2026 +0800 初始化项目,由ModelHub XC社区提供模型 Model: betterdataai/PII_DETECTION_MODEL Source: Original Platform diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 0000000..a6344aa --- /dev/null +++ b/.gitattributes @@ -0,0 +1,35 @@ +*.7z filter=lfs diff=lfs merge=lfs -text +*.arrow filter=lfs diff=lfs merge=lfs -text +*.bin filter=lfs diff=lfs merge=lfs -text +*.bz2 filter=lfs diff=lfs merge=lfs -text +*.ckpt filter=lfs diff=lfs merge=lfs -text +*.ftz filter=lfs diff=lfs merge=lfs -text +*.gz filter=lfs diff=lfs merge=lfs -text +*.h5 filter=lfs diff=lfs merge=lfs -text +*.joblib filter=lfs diff=lfs merge=lfs -text +*.lfs.* filter=lfs diff=lfs merge=lfs -text +*.mlmodel filter=lfs diff=lfs merge=lfs -text +*.model filter=lfs diff=lfs merge=lfs -text +*.msgpack filter=lfs diff=lfs merge=lfs -text +*.npy filter=lfs diff=lfs merge=lfs -text +*.npz filter=lfs diff=lfs merge=lfs -text +*.onnx filter=lfs diff=lfs merge=lfs -text +*.ot filter=lfs diff=lfs merge=lfs -text +*.parquet filter=lfs diff=lfs merge=lfs -text +*.pb filter=lfs diff=lfs merge=lfs -text +*.pickle filter=lfs diff=lfs merge=lfs -text +*.pkl filter=lfs diff=lfs merge=lfs -text +*.pt filter=lfs diff=lfs merge=lfs -text +*.pth filter=lfs diff=lfs merge=lfs -text +*.rar filter=lfs diff=lfs merge=lfs -text +*.safetensors filter=lfs diff=lfs merge=lfs -text +saved_model/**/* filter=lfs diff=lfs merge=lfs -text +*.tar.* filter=lfs diff=lfs merge=lfs -text +*.tar filter=lfs diff=lfs merge=lfs -text +*.tflite filter=lfs diff=lfs merge=lfs -text +*.tgz filter=lfs diff=lfs merge=lfs -text +*.wasm filter=lfs diff=lfs merge=lfs -text +*.xz filter=lfs diff=lfs merge=lfs -text +*.zip filter=lfs diff=lfs merge=lfs -text +*.zst filter=lfs diff=lfs merge=lfs -text +*tfevents* filter=lfs diff=lfs merge=lfs -text diff --git a/README.md b/README.md new file mode 100644 index 0000000..656e8ee --- /dev/null +++ b/README.md @@ -0,0 +1,120 @@ +--- +library_name: transformers +pipeline_tag: text-generation +--- + +# Model Card + +This model is used for PII detection. It covers 29 classes across 7 languages (English , Spanish, Swedish, German, Italian, Dutch, French) + + +## Model Details +The model is built on top of the decoder transformer model (Qwen2-0.5B). We used publicly available datasets with a permissible license as well as synthetic data to train the model. + +### Model Description + + + +- **Developed by:** Betterdata.ai +- **Model type:** Decoder Transformer +- **License:** apache-2.0 +- **Finetuned from model :** Qwen2-0.5B + +## Uses + + + +With the advent of chatGPT , professionals and organizations use public chat interfaces for various applications. This often leads to leakage of PII data which causes privacy issues as users enter names , dates or even API keys etc to give the model better context. With the PII class tags, these confidential information can be masked out as class tags which enable the end models to understand context without data leaving the server. Even developers or teams building applications with the help of third party API's can use these models for better privacy. The image below illustrates this: + + +![Use Case](pii_image.png) + +The PII model shouldn't add too much latency and be able to take in long documents, therefore we used the Qwen 0.5B as the base model. Another consideration for the model size was that we felt a model for privacy should be easy to run even on CPUs. We do have larger models in house with better performance. We have coverage for south east asian langauges as well as giving the user the ability to define custom user classes as part of our plans. + + +We will constantly be improving this model so always pull the latest version of the model. + + +### Out-of-Scope Use + +1. We currently are trying to replace the PII text with the respective class tags. In the future we plan to replace the data with synthetic class values. + + +## Bias, Risks, and Limitations + +1. Model may not be 100% accurate always but we are working on it. +2. In our 'vibe test' where we fed the model with out of distribution data , the model does very well on classes like names , emails , phone numbers etc. We find that it can do better on classes like API_KEYS , credit card cvv numbers , bank account numbers. We are creating more data for these classes to futher improve performance. You can expect improvement in the next iteration. + +## How to Get Started with the Model + +Use the code below to get started with the model. + +``` +from transformers import AutoModelForCausalLM , AutoTokenizer +import torch + +device = "cuda" if torch.cuda.is_available() else "cpu" +model = AutoModelForCausalLM.from_pretrained("betterdataai/PII_DETECTION_MODEL").to(device) +tokenizer = AutoTokenizer.from_pretrained("betterdataai/PII_DETECTION_MODEL") + +classes_list = ['','','','','','','','','','','','','','','','','','','','','','','','','','','','