初始化项目，由ModelHub XC社区提供模型

Model: LeviDeHaan/SecInt-SmolLM2-360M-nginx Source: Original Platform
2026-05-11 13:04:17 +08:00
commit b9a2daf84c
14 changed files with 294489 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,335 @@
+---
+license: apache-2.0
+base_model: HuggingFaceTB/SmolLM2-360M-Instruct
+tags:
+- security
+- log-analysis
+- threat-detection
+- nginx
+- text-classification
+- lora
+- cpu
+- llama-cpp
+language:
+- en
+library_name: transformers
+pipeline_tag: text-classification
+datasets:
+- nginx_security
+metrics:
+- accuracy
+model-index:
+- name: SecInt-SmolLM2-360M-nginx
+  results:
+  - task:
+      type: text-classification
+      name: Security Log Classification
+    metrics:
+    - type: accuracy
+      value: 99.0
+      name: Accuracy
+---
+
+# SecInt-SmolLM2-360M-nginx
+
+**SecInt** (Security Intelligence Monitor) is a fine-tuned SmolLM2-360M model for real-time nginx security log classification. This is the first model in the SecInt series, designed to automatically detect security threats, errors, and normal traffic patterns in web server logs.
+
+**There are 2 GGUF models, try version 04 its been trained on a lot more data.
+
+## Model Overview
+
+- **Base Model**: [HuggingFaceTB/SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct)
+- **Model Size**: 360M parameters (~691MB)
+- **Fine-tuning Method**: LoRA (Low-Rank Adaptation)
+- **Task**: Multi-class text classification (3 classes)
+- **Classes**: `hack`, `error`, `normal`
+- **Inference**: CPU-optimized (~2GB RAM, 32 tokens/sec)
+- **Format**: Safetensors + GGUF (llama.cpp compatible)
+
+## Key Features
+
+- **99%+ Accuracy** on production security logs
+- **Real-time Detection**: ~100ms latency per classification
+- **CPU Inference**: No GPU required, runs on any system
+- **Production-Tested**: Battle-tested since October 2025, processing logs from 8 domains
+- **Lightweight**: Only ~2GB RAM needed
+- **Fast**: 32 tokens/second on CPU
+
+## Quick Start
+
+### Using Transformers
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+
+# Load model and tokenizer
+model_name = "LeviDeHaan/SecInt-SmolLM2-360M-nginx"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(model_name)
+
+# Example log entry
+log_entry = '192.168.1.100 - - [28/Oct/2025:12:34:56 +0000] "GET /.env HTTP/1.1" 404 162 "-" "curl/7.68.0"'
+
+# System prompt with classification rules
+system_prompt = """You are a security log analyzer. Classify the log entry as one of: hack, error, or normal.
+
+HACK - Any of these patterns indicate an attack:
+- Scanning for sensitive files: .env, .git, .php, config.php, wp-admin, phpmyadmin
+- SQL injection attempts, XSS attempts
+- Invalid login attempts, brute force, "invalid user", "failed password"
+- Exploit attempts: /cgi-bin/, shell commands, malformed requests
+- 403/404 errors with suspicious paths
+- "access forbidden by rule" with .env, .git, admin, wp-, .php
+- Scanner user-agents: sqlmap, nikto, zgrab, nuclei
+- Webshell access attempts
+
+ERROR - Application errors:
+- 500 errors, crashes, exceptions
+- SSL/TLS errors
+- Database connection failures
+- [emerg], [alert], [crit], [error] log levels
+
+NORMAL - Everything else:
+- 200/304 responses to legitimate paths
+- Regular API calls, static files
+- Known good bots: googlebot, facebookbot
+
+Respond with only one word: hack, error, or normal."""
+
+# Format prompt using chat template
+messages = [
+    {"role": "system", "content": system_prompt},
+    {"role": "user", "content": f"Classify this log entry as hack, error, or normal.\n\n{log_entry}"}
+]
+prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+
+# Generate classification
+inputs = tokenizer(prompt, return_tensors="pt")
+with torch.no_grad():
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=10,
+        temperature=0.01,
+        top_p=0.38,
+        top_k=10,
+        do_sample=True,
+        pad_token_id=tokenizer.eos_token_id
+    )
+
+# Extract result
+result = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True).strip()
+print(f"Classification: {result}")  # Output: hack
+```
+
+### Using llama.cpp
+
+The model includes a GGUF file for efficient CPU inference:
+
+```bash
+# Download the GGUF model
+huggingface-cli download LeviDeHaan/SecInt-SmolLM2-360M-nginx smollm-security-nginx02-merged.gguf
+
+# Run inference with llama.cpp
+./llama-cli -m smollm-security-nginx02-merged.gguf \
+  --temp 0.01 \
+  --top-p 0.38 \
+  --top-k 10 \
+  --seed 42 \
+  -p "<|im_start|>system\nYou are a security log analyzer...<|im_end|>\n<|im_start|>user\nClassify this log entry...<|im_end|>\n<|im_start|>assistant\n"
+```
+
+## Training Details
+
+### Dataset
+
+- **Source**: Real production nginx logs from 8 domains
+- **Total Examples**: 1,646 labeled samples
+- **Class Distribution**:
+  - `hack`: 800 examples (48.6%) - SQL injection, path traversal, scanner activity, exploit attempts
+  - `error`: 46 examples (2.8%) - 500 errors, SSL failures, application crashes
+  - `normal`: 800 examples (48.6%) - Legitimate traffic, API calls, static file requests
+
+### LoRA Configuration
+
+```yaml
+LoRA Rank (r): 8
+LoRA Alpha: 16
+LoRA Dropout: 0.05
+Target Modules: q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj
+RSLoRA: enabled
+```
+
+### Training Hyperparameters
+
+```yaml
+Learning Rate: 2e-05
+Scheduler: cosine_with_restarts
+Warmup Steps: 5
+Batch Size: 10 per device
+Gradient Accumulation: 8 steps
+Effective Batch Size: 80
+Epochs: 10
+Max Sequence Length: 2048 tokens
+Optimizer: AdamW (betas=0.9,0.999, eps=1e-08)
+Seed: 42
+```
+
+### Training Results
+
+- **Training Duration**: ~50 minutes (210 steps)
+- **Final Loss**: 0.2575
+- **Throughput**: 3,121 tokens/second
+- **Total Tokens**: 9.29M
+- **Hardware**: CPU training (no GPU required)
+
+## Use Cases
+
+### Real-time Web Server Security Monitoring
+
+SecInt is designed for integration into security monitoring systems to provide automated threat detection:
+
+1. **Log Ingestion**: Monitor nginx access/error logs
+2. **Classification**: Identify attacks, errors, and normal traffic
+3. **Alerting**: Trigger notifications for security threats
+4. **Analytics**: Track attack patterns and trends
+5. **Response**: Feed into incident response workflows
+
+### Typical Integration Architecture
+
+```
+nginx logs → Log Parser → SecInt Classifier → Alert System
+                              ↓
+                         Database Storage → Dashboard
+```
+
+### Detection Capabilities
+
+The model can identify:
+
+**Attack Patterns (hack)**:
+- File/directory scanning (`.env`, `.git`, `config.php`, `wp-admin`, `phpmyadmin`)
+- SQL injection (`UNION SELECT`, `OR 1=1`, etc.)
+- Cross-site scripting (XSS) attempts
+- Path traversal (`../../../`)
+- Command injection attempts
+- Known exploit attempts (PHPUnit RCE, ThinkPHP, etc.)
+- Webshell access (c99, r57, alfa, wso)
+- Scanner signatures (sqlmap, nikto, zgrab, nuclei)
+- Brute force attacks (failed passwords, invalid users)
+- Request obfuscation (null bytes, encoding tricks)
+
+**Application Errors (error)**:
+- HTTP 500 errors
+- SSL/TLS handshake failures
+- Application crashes and exceptions
+- Database connection errors
+- Critical log levels ([emerg], [alert], [crit])
+
+**Normal Traffic (normal)**:
+- HTTP 200/304 responses to legitimate paths
+- API endpoints and authenticated requests
+- Static file serving (CSS, JS, images)
+- Known good bots (Googlebot, etc.)
+
+## Performance Metrics
+
+### Optimization Features
+
+When deployed in the full SecInt system:
+- **Intelligent Caching**: 95%+ cache hit rate reduces redundant LLM calls
+- **Session Tracking**: Sampling mode after 50 requests from same IP
+- **Whitelist Support**: Known-good traffic bypasses classification
+- **Batch Processing**: Groups requests for efficient processing
+
+## Recommended Inference Settings
+
+For optimal security classification results:
+
+```python
+temperature = 0.01      # Very deterministic
+max_tokens = 1024       # Classification is short
+top_k = 10              # Limit vocabulary
+top_p = 0.38            # Nucleus sampling
+seed = 42               # Fixed for consistency
+```
+
+These settings ensure consistent, deterministic classification suitable for production security monitoring.
+
+## Prompt Template
+
+The model requires the SmolLM2 chat template format. **Critical**: Use the exact system prompt shown in the Quick Start section for best results. The system prompt contains:
+
+1. Clear task definition
+2. Detailed attack pattern definitions (HACK class)
+3. Error pattern definitions (ERROR class)
+4. Normal traffic definitions (NORMAL class)
+5. Instruction to respond with single word only
+
+Deviation from this prompt format may significantly reduce accuracy.
+
+## Limitations
+
+- **nginx-Specific**: Trained exclusively on nginx log format; may require fine-tuning for Apache, IIS, or other web servers
+- **Prompt-Dependent**: Requires exact prompt template for optimal performance
+- **CPU Inference**: Optimized for CPU; no GPU-specific optimizations
+- **English Only**: Trained on English-language logs
+- **Context Length**: Limited to 2048 tokens per log entry
+- **No Multi-log Context**: Classifies individual log entries; does not correlate across multiple logs
+
+## Model Architecture
+
+Built on SmolLM2-360M-Instruct, a decoder-only transformer model optimized for instruction following:
+
+- **Parameters**: 360M
+- **Architecture**: Transformer decoder with grouped-query attention
+- **Context Length**: 2048 tokens
+- **Vocabulary Size**: 49,152 tokens
+- **Base Training**: Pre-trained on diverse text corpus, instruction-tuned
+
+LoRA fine-tuning targets all attention and MLP projection layers for maximum adaptation to security log classification while maintaining base model knowledge.
+
+## Citation
+
+If you use this model in your research or production systems, please cite:
+
+```bibtex
+@misc{secint-smollm2-nginx,
+  author = {Levi DeHaan},
+  title = {SecInt: SmolLM2-360M Fine-tuned for nginx Security Log Classification},
+  year = {2025},
+  publisher = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/LeviDeHaan/SecInt-SmolLM2-360M-nginx}}
+}
+```
+
+## Acknowledgments
+
+- **HuggingFace** for the SmolLM2-360M-Instruct base model
+- **llama.cpp** team for efficient CPU inference capabilities
+- **LLaMA-Factory** for streamlined LoRA fine-tuning framework
+
+## License
+
+This model is released under Apache 2.0 license, consistent with the base SmolLM2 model. You are free to use, modify, and distribute this model for commercial and non-commercial purposes.
+
+## Project
+
+SecInt is part of the **Security Intelligence Monitor v2** project, a comprehensive real-time security monitoring system for web servers. The full system includes:
+
+- Multi-format log ingestion (nginx, Apache, custom)
+- AI-powered threat classification
+- Threat intelligence enrichment (GeoIP, Shodan)
+- Breach detection (7+ detection rules)
+- Real-time alerting (Pushover, email, webhooks)
+- Interactive dashboard (Streamlit)
+- Attack session management
+- SQLite-based persistence and analytics
+
+For more information about the full SecInt system, visit: [logwatcher project](https://levidehaan.com/projects)
+
+## Model Card Contact
+
+For questions, issues, or collaboration opportunities:
+- **Hugging Face**: [@LeviDeHaan](https://huggingface.co/LeviDeHaan)
+- **Model Repository**: [SecInt-SmolLM2-360M-nginx](https://huggingface.co/LeviDeHaan/SecInt-SmolLM2-360M-nginx)