This website requires JavaScript.
Evaluation Results — DLM-NL2JSON-4B vs Baselines
Test Configuration
Test set : task_analysis_sft_251128_test.jsonl (2,041 samples, 10 categories)
Metric : Field-level exact match accuracy (summary field excluded)
Note : 64 CSM samples with known gold label noise excluded in adjusted metrics (see below)
Train/Test overlap : 16/2,041 (0.78%) — retained for consistency across models
Per-Category Accuracy
Category
N
DLM-NL2JSON-4B
GPT-4o
Qwen3.5-35B-A3B
ALP-A (pattern)
250
99.6%
56.0%
47.6%
ALP-B (flow)
250
98.4%
50.4%
46.8%
CSM (consumption)
700
90.6%
90.1%
86.1%
CREDIT-Income
58
94.8%
53.4%
34.5%
CREDIT-Spending
77
97.4%
92.2%
51.9%
CREDIT-Loan/Default
73
98.6%
94.5%
72.6%
CPI (business)
219
86.3%
87.2%
54.8%
GIS-Inflow
72
97.2%
79.2%
93.1%
GIS-Outflow
62
98.4%
77.4%
98.4%
GIS-Consumption
280
98.2%
99.6%
97.5%
Overall (Raw)
Model
Params
Accuracy
Avg Latency
DLM-NL2JSON-4B
4B
94.4% (1926/2041)
2.59s
GPT-4o
~200B+
80.5% (1643/2041)
1.58s
Qwen3.5-35B-A3B
35B (3B active)
72.2% (1473/2041)
0.85s
Overall (Adjusted — 64 CSM gold noise samples excluded)
Model
Accuracy
N
DLM-NL2JSON-4B
96.8% (1914/1977)
1977
GPT-4o
82.5% (1631/1977)
1977
Qwen3.5-35B-A3B
73.9% (1461/1977)
1977
Hardware
Model
Serving
GPU
DLM-NL2JSON-4B
vLLM (TensorRT-LLM)
NVIDIA L4 24GB
GPT-4o
OpenAI API
N/A
Qwen3.5-35B-A3B
vLLM
NVIDIA A6000 48GB
Notes
CSM gold noise: 64/700 CSM test samples have age_cd capped at 60 instead of 70 for "all ages" queries, conflicting with the prompt specification (age_cd: [10,20,30,40,50,60,70]). This affects all models equally.
DLM-NL2JSON-4B wins 8/10 categories outright, ties 1, and loses only CPI (86.3% vs GPT-4o 87.2%).