Files
DLM-NL2JSON-4B/eval/results.md
ModelHub XC 72ae6ed524 初始化项目,由ModelHub XC社区提供模型
Model: dataslab/DLM-NL2JSON-4B
Source: Original Platform
2026-05-04 04:44:49 +08:00

2.1 KiB

Evaluation Results — DLM-NL2JSON-4B vs Baselines

Test Configuration

  • Test set: task_analysis_sft_251128_test.jsonl (2,041 samples, 10 categories)
  • Metric: Field-level exact match accuracy (summary field excluded)
  • Note: 64 CSM samples with known gold label noise excluded in adjusted metrics (see below)
  • Train/Test overlap: 16/2,041 (0.78%) — retained for consistency across models

Per-Category Accuracy

Category N DLM-NL2JSON-4B GPT-4o Qwen3.5-35B-A3B
ALP-A (pattern) 250 99.6% 56.0% 47.6%
ALP-B (flow) 250 98.4% 50.4% 46.8%
CSM (consumption) 700 90.6% 90.1% 86.1%
CREDIT-Income 58 94.8% 53.4% 34.5%
CREDIT-Spending 77 97.4% 92.2% 51.9%
CREDIT-Loan/Default 73 98.6% 94.5% 72.6%
CPI (business) 219 86.3% 87.2% 54.8%
GIS-Inflow 72 97.2% 79.2% 93.1%
GIS-Outflow 62 98.4% 77.4% 98.4%
GIS-Consumption 280 98.2% 99.6% 97.5%

Overall (Raw)

Model Params Accuracy Avg Latency
DLM-NL2JSON-4B 4B 94.4% (1926/2041) 2.59s
GPT-4o ~200B+ 80.5% (1643/2041) 1.58s
Qwen3.5-35B-A3B 35B (3B active) 72.2% (1473/2041) 0.85s

Overall (Adjusted — 64 CSM gold noise samples excluded)

Model Accuracy N
DLM-NL2JSON-4B 96.8% (1914/1977) 1977
GPT-4o 82.5% (1631/1977) 1977
Qwen3.5-35B-A3B 73.9% (1461/1977) 1977

Hardware

Model Serving GPU
DLM-NL2JSON-4B vLLM (TensorRT-LLM) NVIDIA L4 24GB
GPT-4o OpenAI API N/A
Qwen3.5-35B-A3B vLLM NVIDIA A6000 48GB

Notes

  • CSM gold noise: 64/700 CSM test samples have age_cd capped at 60 instead of 70 for "all ages" queries, conflicting with the prompt specification (age_cd: [10,20,30,40,50,60,70]). This affects all models equally.
  • DLM-NL2JSON-4B wins 8/10 categories outright, ties 1, and loses only CPI (86.3% vs GPT-4o 87.2%).