Files

ModelHub XC 72ae6ed524 初始化项目，由ModelHub XC社区提供模型

Model: dataslab/DLM-NL2JSON-4B
Source: Original Platform

2026-05-04 04:44:49 +08:00

Evaluation Results — DLM-NL2JSON-4B vs Baselines

Test Configuration

Test set: task_analysis_sft_251128_test.jsonl (2,041 samples, 10 categories)
Metric: Field-level exact match accuracy (summary field excluded)
Note: 64 CSM samples with known gold label noise excluded in adjusted metrics (see below)
Train/Test overlap: 16/2,041 (0.78%) — retained for consistency across models

Category	N	DLM-NL2JSON-4B	GPT-4o	Qwen3.5-35B-A3B
ALP-A (pattern)	250	99.6%	56.0%	47.6%
ALP-B (flow)	250	98.4%	50.4%	46.8%
CSM (consumption)	700	90.6%	90.1%	86.1%
CREDIT-Income	58	94.8%	53.4%	34.5%
CREDIT-Spending	77	97.4%	92.2%	51.9%
CREDIT-Loan/Default	73	98.6%	94.5%	72.6%
CPI (business)	219	86.3%	87.2%	54.8%
GIS-Inflow	72	97.2%	79.2%	93.1%
GIS-Outflow	62	98.4%	77.4%	98.4%
GIS-Consumption	280	98.2%	99.6%	97.5%

Model	Params	Accuracy	Avg Latency
DLM-NL2JSON-4B	4B	94.4% (1926/2041)	2.59s
GPT-4o	~200B+	80.5% (1643/2041)	1.58s
Qwen3.5-35B-A3B	35B (3B active)	72.2% (1473/2041)	0.85s

CSM gold noise: 64/700 CSM test samples have age_cd capped at 60 instead of 70 for "all ages" queries, conflicting with the prompt specification (age_cd: [10,20,30,40,50,60,70]). This affects all models equally.
DLM-NL2JSON-4B wins 8/10 categories outright, ties 1, and loses only CPI (86.3% vs GPT-4o 87.2%).