DLM-NL2JSON-4B/eval/results.md

# Evaluation Results — DLM-NL2JSON-4B vs Baselines

## Test Configuration
- **Test set**: `task_analysis_sft_251128_test.jsonl` (2,041 samples, 10 categories)
- **Metric**: Field-level exact match accuracy (summary field excluded)
- **Note**: 64 CSM samples with known gold label noise excluded in adjusted metrics (see below)
- **Train/Test overlap**: 16/2,041 (0.78%) — retained for consistency across models

## Per-Category Accuracy

| Category | N | DLM-NL2JSON-4B | GPT-4o | Qwen3.5-35B-A3B |
|----------|---|-------------|--------|-----------------|
| ALP-A (pattern) | 250 | **99.6%** | 56.0% | 47.6% |
| ALP-B (flow) | 250 | **98.4%** | 50.4% | 46.8% |
| CSM (consumption) | 700 | **90.6%** | 90.1% | 86.1% |
| CREDIT-Income | 58 | **94.8%** | 53.4% | 34.5% |
| CREDIT-Spending | 77 | **97.4%** | 92.2% | 51.9% |
| CREDIT-Loan/Default | 73 | **98.6%** | 94.5% | 72.6% |
| CPI (business) | 219 | 86.3% | **87.2%** | 54.8% |
| GIS-Inflow | 72 | **97.2%** | 79.2% | 93.1% |
| GIS-Outflow | 62 | **98.4%** | 77.4% | 98.4% |
| GIS-Consumption | 280 | 98.2% | **99.6%** | 97.5% |

## Overall (Raw)

| Model | Params | Accuracy | Avg Latency |
|-------|--------|----------|-------------|
| **DLM-NL2JSON-4B** | **4B** | **94.4% (1926/2041)** | 2.59s |
| GPT-4o | ~200B+ | 80.5% (1643/2041) | 1.58s |
| Qwen3.5-35B-A3B | 35B (3B active) | 72.2% (1473/2041) | 0.85s |

## Overall (Adjusted — 64 CSM gold noise samples excluded)

| Model | Accuracy | N |
|-------|----------|---|
| **DLM-NL2JSON-4B** | **96.8% (1914/1977)** | 1977 |
| GPT-4o | 82.5% (1631/1977) | 1977 |
| Qwen3.5-35B-A3B | 73.9% (1461/1977) | 1977 |

## Hardware

| Model | Serving | GPU |
|-------|---------|-----|
| DLM-NL2JSON-4B | vLLM (TensorRT-LLM) | NVIDIA L4 24GB |
| GPT-4o | OpenAI API | N/A |
| Qwen3.5-35B-A3B | vLLM | NVIDIA A6000 48GB |

## Notes
- CSM gold noise: 64/700 CSM test samples have `age_cd` capped at 60 instead of 70 for "all ages" queries, conflicting with the prompt specification (`age_cd: [10,20,30,40,50,60,70]`). This affects all models equally.
- DLM-NL2JSON-4B wins 8/10 categories outright, ties 1, and loses only CPI (86.3% vs GPT-4o 87.2%).
初始化项目，由ModelHub XC社区提供模型 Model: dataslab/DLM-NL2JSON-4B Source: Original Platform 2026-05-04 04:44:49 +08:00			`# Evaluation Results — DLM-NL2JSON-4B vs Baselines`

			`## Test Configuration`
			- Test set: `task_analysis_sft_251128_test.jsonl` (2,041 samples, 10 categories)
			`- Metric: Field-level exact match accuracy (summary field excluded)`
			`- Note: 64 CSM samples with known gold label noise excluded in adjusted metrics (see below)`
			`- Train/Test overlap: 16/2,041 (0.78%) — retained for consistency across models`

			`## Per-Category Accuracy`

			`\| Category \| N \| DLM-NL2JSON-4B \| GPT-4o \| Qwen3.5-35B-A3B \|`
			`\|----------\|---\|-------------\|--------\|-----------------\|`
			`\| ALP-A (pattern) \| 250 \| 99.6% \| 56.0% \| 47.6% \|`
			`\| ALP-B (flow) \| 250 \| 98.4% \| 50.4% \| 46.8% \|`
			`\| CSM (consumption) \| 700 \| 90.6% \| 90.1% \| 86.1% \|`
			`\| CREDIT-Income \| 58 \| 94.8% \| 53.4% \| 34.5% \|`
			`\| CREDIT-Spending \| 77 \| 97.4% \| 92.2% \| 51.9% \|`
			`\| CREDIT-Loan/Default \| 73 \| 98.6% \| 94.5% \| 72.6% \|`
			`\| CPI (business) \| 219 \| 86.3% \| 87.2% \| 54.8% \|`
			`\| GIS-Inflow \| 72 \| 97.2% \| 79.2% \| 93.1% \|`
			`\| GIS-Outflow \| 62 \| 98.4% \| 77.4% \| 98.4% \|`
			`\| GIS-Consumption \| 280 \| 98.2% \| 99.6% \| 97.5% \|`

			`## Overall (Raw)`

			`\| Model \| Params \| Accuracy \| Avg Latency \|`
			`\|-------\|--------\|----------\|-------------\|`
			`\| DLM-NL2JSON-4B \| 4B \| 94.4% (1926/2041) \| 2.59s \|`
			`\| GPT-4o \| ~200B+ \| 80.5% (1643/2041) \| 1.58s \|`
			`\| Qwen3.5-35B-A3B \| 35B (3B active) \| 72.2% (1473/2041) \| 0.85s \|`

			`## Overall (Adjusted — 64 CSM gold noise samples excluded)`

			`\| Model \| Accuracy \| N \|`
			`\|-------\|----------\|---\|`
			`\| DLM-NL2JSON-4B \| 96.8% (1914/1977) \| 1977 \|`
			`\| GPT-4o \| 82.5% (1631/1977) \| 1977 \|`
			`\| Qwen3.5-35B-A3B \| 73.9% (1461/1977) \| 1977 \|`

			`## Hardware`

			`\| Model \| Serving \| GPU \|`
			`\|-------\|---------\|-----\|`
			`\| DLM-NL2JSON-4B \| vLLM (TensorRT-LLM) \| NVIDIA L4 24GB \|`
			`\| GPT-4o \| OpenAI API \| N/A \|`
			`\| Qwen3.5-35B-A3B \| vLLM \| NVIDIA A6000 48GB \|`

			`## Notes`
			- CSM gold noise: 64/700 CSM test samples have `age_cd` capped at 60 instead of 70 for "all ages" queries, conflicting with the prompt specification (`age_cd: [10,20,30,40,50,60,70]`). This affects all models equally.
			`- DLM-NL2JSON-4B wins 8/10 categories outright, ties 1, and loses only CPI (86.3% vs GPT-4o 87.2%).`