336 lines
11 KiB
Markdown
336 lines
11 KiB
Markdown
|
|
---
|
||
|
|
license: apache-2.0
|
||
|
|
base_model: HuggingFaceTB/SmolLM2-360M-Instruct
|
||
|
|
tags:
|
||
|
|
- security
|
||
|
|
- log-analysis
|
||
|
|
- threat-detection
|
||
|
|
- nginx
|
||
|
|
- text-classification
|
||
|
|
- lora
|
||
|
|
- cpu
|
||
|
|
- llama-cpp
|
||
|
|
language:
|
||
|
|
- en
|
||
|
|
library_name: transformers
|
||
|
|
pipeline_tag: text-classification
|
||
|
|
datasets:
|
||
|
|
- nginx_security
|
||
|
|
metrics:
|
||
|
|
- accuracy
|
||
|
|
model-index:
|
||
|
|
- name: SecInt-SmolLM2-360M-nginx
|
||
|
|
results:
|
||
|
|
- task:
|
||
|
|
type: text-classification
|
||
|
|
name: Security Log Classification
|
||
|
|
metrics:
|
||
|
|
- type: accuracy
|
||
|
|
value: 99.0
|
||
|
|
name: Accuracy
|
||
|
|
---
|
||
|
|
|
||
|
|
# SecInt-SmolLM2-360M-nginx
|
||
|
|
|
||
|
|
**SecInt** (Security Intelligence Monitor) is a fine-tuned SmolLM2-360M model for real-time nginx security log classification. This is the first model in the SecInt series, designed to automatically detect security threats, errors, and normal traffic patterns in web server logs.
|
||
|
|
|
||
|
|
**There are 2 GGUF models, try version 04 its been trained on a lot more data.
|
||
|
|
|
||
|
|
## Model Overview
|
||
|
|
|
||
|
|
- **Base Model**: [HuggingFaceTB/SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct)
|
||
|
|
- **Model Size**: 360M parameters (~691MB)
|
||
|
|
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation)
|
||
|
|
- **Task**: Multi-class text classification (3 classes)
|
||
|
|
- **Classes**: `hack`, `error`, `normal`
|
||
|
|
- **Inference**: CPU-optimized (~2GB RAM, 32 tokens/sec)
|
||
|
|
- **Format**: Safetensors + GGUF (llama.cpp compatible)
|
||
|
|
|
||
|
|
## Key Features
|
||
|
|
|
||
|
|
- **99%+ Accuracy** on production security logs
|
||
|
|
- **Real-time Detection**: ~100ms latency per classification
|
||
|
|
- **CPU Inference**: No GPU required, runs on any system
|
||
|
|
- **Production-Tested**: Battle-tested since October 2025, processing logs from 8 domains
|
||
|
|
- **Lightweight**: Only ~2GB RAM needed
|
||
|
|
- **Fast**: 32 tokens/second on CPU
|
||
|
|
|
||
|
|
## Quick Start
|
||
|
|
|
||
|
|
### Using Transformers
|
||
|
|
|
||
|
|
```python
|
||
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||
|
|
import torch
|
||
|
|
|
||
|
|
# Load model and tokenizer
|
||
|
|
model_name = "LeviDeHaan/SecInt-SmolLM2-360M-nginx"
|
||
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||
|
|
model = AutoModelForCausalLM.from_pretrained(model_name)
|
||
|
|
|
||
|
|
# Example log entry
|
||
|
|
log_entry = '192.168.1.100 - - [28/Oct/2025:12:34:56 +0000] "GET /.env HTTP/1.1" 404 162 "-" "curl/7.68.0"'
|
||
|
|
|
||
|
|
# System prompt with classification rules
|
||
|
|
system_prompt = """You are a security log analyzer. Classify the log entry as one of: hack, error, or normal.
|
||
|
|
|
||
|
|
HACK - Any of these patterns indicate an attack:
|
||
|
|
- Scanning for sensitive files: .env, .git, .php, config.php, wp-admin, phpmyadmin
|
||
|
|
- SQL injection attempts, XSS attempts
|
||
|
|
- Invalid login attempts, brute force, "invalid user", "failed password"
|
||
|
|
- Exploit attempts: /cgi-bin/, shell commands, malformed requests
|
||
|
|
- 403/404 errors with suspicious paths
|
||
|
|
- "access forbidden by rule" with .env, .git, admin, wp-, .php
|
||
|
|
- Scanner user-agents: sqlmap, nikto, zgrab, nuclei
|
||
|
|
- Webshell access attempts
|
||
|
|
|
||
|
|
ERROR - Application errors:
|
||
|
|
- 500 errors, crashes, exceptions
|
||
|
|
- SSL/TLS errors
|
||
|
|
- Database connection failures
|
||
|
|
- [emerg], [alert], [crit], [error] log levels
|
||
|
|
|
||
|
|
NORMAL - Everything else:
|
||
|
|
- 200/304 responses to legitimate paths
|
||
|
|
- Regular API calls, static files
|
||
|
|
- Known good bots: googlebot, facebookbot
|
||
|
|
|
||
|
|
Respond with only one word: hack, error, or normal."""
|
||
|
|
|
||
|
|
# Format prompt using chat template
|
||
|
|
messages = [
|
||
|
|
{"role": "system", "content": system_prompt},
|
||
|
|
{"role": "user", "content": f"Classify this log entry as hack, error, or normal.\n\n{log_entry}"}
|
||
|
|
]
|
||
|
|
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
||
|
|
|
||
|
|
# Generate classification
|
||
|
|
inputs = tokenizer(prompt, return_tensors="pt")
|
||
|
|
with torch.no_grad():
|
||
|
|
outputs = model.generate(
|
||
|
|
**inputs,
|
||
|
|
max_new_tokens=10,
|
||
|
|
temperature=0.01,
|
||
|
|
top_p=0.38,
|
||
|
|
top_k=10,
|
||
|
|
do_sample=True,
|
||
|
|
pad_token_id=tokenizer.eos_token_id
|
||
|
|
)
|
||
|
|
|
||
|
|
# Extract result
|
||
|
|
result = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True).strip()
|
||
|
|
print(f"Classification: {result}") # Output: hack
|
||
|
|
```
|
||
|
|
|
||
|
|
### Using llama.cpp
|
||
|
|
|
||
|
|
The model includes a GGUF file for efficient CPU inference:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Download the GGUF model
|
||
|
|
huggingface-cli download LeviDeHaan/SecInt-SmolLM2-360M-nginx smollm-security-nginx02-merged.gguf
|
||
|
|
|
||
|
|
# Run inference with llama.cpp
|
||
|
|
./llama-cli -m smollm-security-nginx02-merged.gguf \
|
||
|
|
--temp 0.01 \
|
||
|
|
--top-p 0.38 \
|
||
|
|
--top-k 10 \
|
||
|
|
--seed 42 \
|
||
|
|
-p "<|im_start|>system\nYou are a security log analyzer...<|im_end|>\n<|im_start|>user\nClassify this log entry...<|im_end|>\n<|im_start|>assistant\n"
|
||
|
|
```
|
||
|
|
|
||
|
|
## Training Details
|
||
|
|
|
||
|
|
### Dataset
|
||
|
|
|
||
|
|
- **Source**: Real production nginx logs from 8 domains
|
||
|
|
- **Total Examples**: 1,646 labeled samples
|
||
|
|
- **Class Distribution**:
|
||
|
|
- `hack`: 800 examples (48.6%) - SQL injection, path traversal, scanner activity, exploit attempts
|
||
|
|
- `error`: 46 examples (2.8%) - 500 errors, SSL failures, application crashes
|
||
|
|
- `normal`: 800 examples (48.6%) - Legitimate traffic, API calls, static file requests
|
||
|
|
|
||
|
|
### LoRA Configuration
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
LoRA Rank (r): 8
|
||
|
|
LoRA Alpha: 16
|
||
|
|
LoRA Dropout: 0.05
|
||
|
|
Target Modules: q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj
|
||
|
|
RSLoRA: enabled
|
||
|
|
```
|
||
|
|
|
||
|
|
### Training Hyperparameters
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
Learning Rate: 2e-05
|
||
|
|
Scheduler: cosine_with_restarts
|
||
|
|
Warmup Steps: 5
|
||
|
|
Batch Size: 10 per device
|
||
|
|
Gradient Accumulation: 8 steps
|
||
|
|
Effective Batch Size: 80
|
||
|
|
Epochs: 10
|
||
|
|
Max Sequence Length: 2048 tokens
|
||
|
|
Optimizer: AdamW (betas=0.9,0.999, eps=1e-08)
|
||
|
|
Seed: 42
|
||
|
|
```
|
||
|
|
|
||
|
|
### Training Results
|
||
|
|
|
||
|
|
- **Training Duration**: ~50 minutes (210 steps)
|
||
|
|
- **Final Loss**: 0.2575
|
||
|
|
- **Throughput**: 3,121 tokens/second
|
||
|
|
- **Total Tokens**: 9.29M
|
||
|
|
- **Hardware**: CPU training (no GPU required)
|
||
|
|
|
||
|
|
## Use Cases
|
||
|
|
|
||
|
|
### Real-time Web Server Security Monitoring
|
||
|
|
|
||
|
|
SecInt is designed for integration into security monitoring systems to provide automated threat detection:
|
||
|
|
|
||
|
|
1. **Log Ingestion**: Monitor nginx access/error logs
|
||
|
|
2. **Classification**: Identify attacks, errors, and normal traffic
|
||
|
|
3. **Alerting**: Trigger notifications for security threats
|
||
|
|
4. **Analytics**: Track attack patterns and trends
|
||
|
|
5. **Response**: Feed into incident response workflows
|
||
|
|
|
||
|
|
### Typical Integration Architecture
|
||
|
|
|
||
|
|
```
|
||
|
|
nginx logs → Log Parser → SecInt Classifier → Alert System
|
||
|
|
↓
|
||
|
|
Database Storage → Dashboard
|
||
|
|
```
|
||
|
|
|
||
|
|
### Detection Capabilities
|
||
|
|
|
||
|
|
The model can identify:
|
||
|
|
|
||
|
|
**Attack Patterns (hack)**:
|
||
|
|
- File/directory scanning (`.env`, `.git`, `config.php`, `wp-admin`, `phpmyadmin`)
|
||
|
|
- SQL injection (`UNION SELECT`, `OR 1=1`, etc.)
|
||
|
|
- Cross-site scripting (XSS) attempts
|
||
|
|
- Path traversal (`../../../`)
|
||
|
|
- Command injection attempts
|
||
|
|
- Known exploit attempts (PHPUnit RCE, ThinkPHP, etc.)
|
||
|
|
- Webshell access (c99, r57, alfa, wso)
|
||
|
|
- Scanner signatures (sqlmap, nikto, zgrab, nuclei)
|
||
|
|
- Brute force attacks (failed passwords, invalid users)
|
||
|
|
- Request obfuscation (null bytes, encoding tricks)
|
||
|
|
|
||
|
|
**Application Errors (error)**:
|
||
|
|
- HTTP 500 errors
|
||
|
|
- SSL/TLS handshake failures
|
||
|
|
- Application crashes and exceptions
|
||
|
|
- Database connection errors
|
||
|
|
- Critical log levels ([emerg], [alert], [crit])
|
||
|
|
|
||
|
|
**Normal Traffic (normal)**:
|
||
|
|
- HTTP 200/304 responses to legitimate paths
|
||
|
|
- API endpoints and authenticated requests
|
||
|
|
- Static file serving (CSS, JS, images)
|
||
|
|
- Known good bots (Googlebot, etc.)
|
||
|
|
|
||
|
|
## Performance Metrics
|
||
|
|
|
||
|
|
### Optimization Features
|
||
|
|
|
||
|
|
When deployed in the full SecInt system:
|
||
|
|
- **Intelligent Caching**: 95%+ cache hit rate reduces redundant LLM calls
|
||
|
|
- **Session Tracking**: Sampling mode after 50 requests from same IP
|
||
|
|
- **Whitelist Support**: Known-good traffic bypasses classification
|
||
|
|
- **Batch Processing**: Groups requests for efficient processing
|
||
|
|
|
||
|
|
## Recommended Inference Settings
|
||
|
|
|
||
|
|
For optimal security classification results:
|
||
|
|
|
||
|
|
```python
|
||
|
|
temperature = 0.01 # Very deterministic
|
||
|
|
max_tokens = 1024 # Classification is short
|
||
|
|
top_k = 10 # Limit vocabulary
|
||
|
|
top_p = 0.38 # Nucleus sampling
|
||
|
|
seed = 42 # Fixed for consistency
|
||
|
|
```
|
||
|
|
|
||
|
|
These settings ensure consistent, deterministic classification suitable for production security monitoring.
|
||
|
|
|
||
|
|
## Prompt Template
|
||
|
|
|
||
|
|
The model requires the SmolLM2 chat template format. **Critical**: Use the exact system prompt shown in the Quick Start section for best results. The system prompt contains:
|
||
|
|
|
||
|
|
1. Clear task definition
|
||
|
|
2. Detailed attack pattern definitions (HACK class)
|
||
|
|
3. Error pattern definitions (ERROR class)
|
||
|
|
4. Normal traffic definitions (NORMAL class)
|
||
|
|
5. Instruction to respond with single word only
|
||
|
|
|
||
|
|
Deviation from this prompt format may significantly reduce accuracy.
|
||
|
|
|
||
|
|
## Limitations
|
||
|
|
|
||
|
|
- **nginx-Specific**: Trained exclusively on nginx log format; may require fine-tuning for Apache, IIS, or other web servers
|
||
|
|
- **Prompt-Dependent**: Requires exact prompt template for optimal performance
|
||
|
|
- **CPU Inference**: Optimized for CPU; no GPU-specific optimizations
|
||
|
|
- **English Only**: Trained on English-language logs
|
||
|
|
- **Context Length**: Limited to 2048 tokens per log entry
|
||
|
|
- **No Multi-log Context**: Classifies individual log entries; does not correlate across multiple logs
|
||
|
|
|
||
|
|
## Model Architecture
|
||
|
|
|
||
|
|
Built on SmolLM2-360M-Instruct, a decoder-only transformer model optimized for instruction following:
|
||
|
|
|
||
|
|
- **Parameters**: 360M
|
||
|
|
- **Architecture**: Transformer decoder with grouped-query attention
|
||
|
|
- **Context Length**: 2048 tokens
|
||
|
|
- **Vocabulary Size**: 49,152 tokens
|
||
|
|
- **Base Training**: Pre-trained on diverse text corpus, instruction-tuned
|
||
|
|
|
||
|
|
LoRA fine-tuning targets all attention and MLP projection layers for maximum adaptation to security log classification while maintaining base model knowledge.
|
||
|
|
|
||
|
|
## Citation
|
||
|
|
|
||
|
|
If you use this model in your research or production systems, please cite:
|
||
|
|
|
||
|
|
```bibtex
|
||
|
|
@misc{secint-smollm2-nginx,
|
||
|
|
author = {Levi DeHaan},
|
||
|
|
title = {SecInt: SmolLM2-360M Fine-tuned for nginx Security Log Classification},
|
||
|
|
year = {2025},
|
||
|
|
publisher = {Hugging Face},
|
||
|
|
howpublished = {\url{https://huggingface.co/LeviDeHaan/SecInt-SmolLM2-360M-nginx}}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Acknowledgments
|
||
|
|
|
||
|
|
- **HuggingFace** for the SmolLM2-360M-Instruct base model
|
||
|
|
- **llama.cpp** team for efficient CPU inference capabilities
|
||
|
|
- **LLaMA-Factory** for streamlined LoRA fine-tuning framework
|
||
|
|
|
||
|
|
## License
|
||
|
|
|
||
|
|
This model is released under Apache 2.0 license, consistent with the base SmolLM2 model. You are free to use, modify, and distribute this model for commercial and non-commercial purposes.
|
||
|
|
|
||
|
|
## Project
|
||
|
|
|
||
|
|
SecInt is part of the **Security Intelligence Monitor v2** project, a comprehensive real-time security monitoring system for web servers. The full system includes:
|
||
|
|
|
||
|
|
- Multi-format log ingestion (nginx, Apache, custom)
|
||
|
|
- AI-powered threat classification
|
||
|
|
- Threat intelligence enrichment (GeoIP, Shodan)
|
||
|
|
- Breach detection (7+ detection rules)
|
||
|
|
- Real-time alerting (Pushover, email, webhooks)
|
||
|
|
- Interactive dashboard (Streamlit)
|
||
|
|
- Attack session management
|
||
|
|
- SQLite-based persistence and analytics
|
||
|
|
|
||
|
|
For more information about the full SecInt system, visit: [logwatcher project](https://levidehaan.com/projects)
|
||
|
|
|
||
|
|
## Model Card Contact
|
||
|
|
|
||
|
|
For questions, issues, or collaboration opportunities:
|
||
|
|
- **Hugging Face**: [@LeviDeHaan](https://huggingface.co/LeviDeHaan)
|
||
|
|
- **Model Repository**: [SecInt-SmolLM2-360M-nginx](https://huggingface.co/LeviDeHaan/SecInt-SmolLM2-360M-nginx)
|