--- license: apache-2.0 language: - en tags: - zero-latency - pii-scrubbing - pii - compliance - privacy - function-calling - arpa - micro-f1-mask - micro-series pipeline_tag: text-generation library_name: transformers base_model: google/gemma-3-270m-it datasets: - synthetic model-index: - name: micro-f1-mask results: [] ---
# ARPA MICRO SERIES: F1 MASK **Zero-Latency PII Scrubbing - 270M Parameter Middleware** Base Model Task Training License
--- **ARPA Micro Series: F1 Mask** is a high-performance fine-tuned model built to provide real-time identification and tokenization of Personally Identifiable Information (PII) for secure cloud computing. Developed by [ARPA Hellenic Logical Systems](https://arpacorp.net), it acts as a privacy firewall for incoming/outgoing LLM prompts. **GitHub**: [arpahls/micro-f1-mask](https://github.com/arpahls/micro-f1-mask) — Full training pipeline, Redis vault, and infrastructure. ## Model Summary F1 Mask is a specialized fine-tune of [Gemma 3 270M IT](https://huggingface.co/google/gemma-3-270m-it). It is trained exclusively to output structured `replace_pii` function calls, effectively mapping sensitive data to safe tokens before they reach cloud-based LLMs. ## Quick Start ### 1. Register with Ollama ```bash # Direct SafeTensors registration ollama create micro-f1-mask --from arpacorp/micro-f1-mask # Run detection ollama run micro-f1-mask "John Doe called from 555-0123 about invoice GB29NWBK60161331926819." ``` ### 2. Python (Transformers) ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("arpacorp/micro-f1-mask") tokenizer = AutoTokenizer.from_pretrained("arpacorp/micro-f1-mask") prompt = """user You are Micro F1 Mask. Extract PII and output the 'replace_pii' function call. Draft an email to Jane Smith at jane@example.com. model """ inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.0) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Binary Mapping & Tokens The model uses a deterministic tokenization scheme: | Category | Token | |----------|-------| | INDIVIDUAL | [INDIVIDUAL_N] | | FINANCIAL | [FINANCIAL_N] | | LOCATION | [LOCATION_N] | | CONTACT | [CONTACT_N] | | ACCESS | [ACCESS_N] | | CORP | [CORP_N] | ### Example Output ```json { "name": "replace_pii", "arguments": { "entities": [ {"type": "INDIVIDUAL", "val": "Jane Smith", "id": "[INDIVIDUAL_1]"}, {"type": "CONTACT", "val": "jane@example.com", "id": "[CONTACT_1]"} ] } } ``` ## Training Methodology - **Dataset**: 1,000 synthetic samples generated via high-entropy LLM workflows. - **Method**: PEFT / LoRA (Rank 16, Alpha 32). - **Epochs**: 3. - **Accuracy**: 76.10% (token-level generation). - **Latency**: Sub-50ms (inference on RTX 2070). ## Production Optimization Roadmap While this repository provides a fully functional 1,000-sample prototype, reaching 95%+ enterprise accuracy requires the following architectural optimizations: ### 1. Hard-Negative Mining (Re-training) To push accuracy into the high 90s, the model must iteratively learn from its mistakes: 1. **Scale**: Use the synthetic generator to produce 10,000 - 50,000 highly diverse samples tailored to your industry. 2. **Evaluate**: Run an evaluation script to benchmark against samples of your real-world traffic. 3. **Mine Edge Cases**: Every time the model misses a PII token (a "hard negative"), extract that sentence structure, generate 500 synthetic variations of that specific edge-case, and re-run the fine-tuning pipeline. ### 2. Human-In-The-Loop (HITL) Workflows For mission-critical data, we recommend extending the middleware bridge to include human oversight: - **Pre-Cloud Quarantine (Maximum Security):** Modify the endpoint so that when F1 Mask detects PII, the API payload pauses. The application UI highlights the detected entities to the user. The user manually verifies the masking *before* the payload is authorized to hit the external cloud. - **Post-Reconstruction Review (Quality Control):** Allow the fully automated process to finish. Before the final reconstructed cloud response is saved or emailed, route it to an analyst dashboard where a human can manually verify the grammar of the reconstructed payload. ## Enterprise Solutions The public release of **ARPA F1 Mask** serves as a lightweight demonstration of how the *Function One (F1)* architecture can be fine-tuned for structured privacy enforcement. For mission-critical infrastructure, ARPA offers an actively maintained, highly robust enterprise tier. Organizations can deploy our gated version out-of-the-box and completely offload the burden of continuous maintenance, bespoke fine-tuning, concept drift avoidance, and rigorous scenario evaluations. For enterprise licensing or to discuss tailoring the F1 model to your proprietary data schemas, reach out to: **[input@arpacorp.net](mailto:input@arpacorp.net)** ## Ethical Considerations **Data Provenance**: No real PII was used in the training of this model. All examples were synthetically generated to mimic enterprise communication patterns. **Intended Use**: This model is designed for middleware. It is not intended to be used as a conversational assistant. It is a one-way security gate that focuses exclusively on privacy enforcement. ---
Built by [ARPA Hellenic Logical Systems](https://arpacorp.net)