Esmeralda Llama 3.1 8B Control

An advanced, high-parseability agentic language model optimized for structural consistency, tool-use execution, and stable conversational automation.

Model Details

Esmeralda-Llama-3.1-8B-control is a specialized agentic model fine-tuned on the Locutusque/esmeralda-agentic dataset. This dataset is engineered specifically to train models in rigorous agentic routines, structured system prompt adherence, reasoning loops, and multi-turn tool interactions. This control variant prioritizes deterministic syntax stability (achieving a perfect 100% parseability rate) to prevent runtime breakdowns in downstream orchestration frameworks like LangChain, CrewAI, or AutoGen. This is the *control* model of the Esmeralda family of models. It will be the first released and serves as a proof of concept.

Developed by Locutusque (Sebastian Gabarain)

Model type Transformer Decoder (LLM)

Language(s) English (NLP)

License Llama-3.1 License

Finetuned from Llama-3.1-8B

Model Sources

Repository: Locutusque/Esmeralda-Llama-3.1-8B-control
Dataset: Locutusque/esmeralda-agentic

Uses

Direct Use

This model is built directly for AI Agent loops, multi-turn function calling, programmatic tool usage, and structural data extraction workloads. It can safely ingest complex API schemas or system setups and output predictable tokens that map perfectly to execution environments.

Downstream Use

Ideally integrated as the primary brain within Autonomous Agents software architectures. It thrives when paired with a strict parser or execution layer that depends on flawlessly structured outputs (JSON, XML blocks, or custom agent formatting syntax).

Out-of-Scope Use

Not intended for heavy multilingual generation or specialized multi-modal tasks without additional fine-tuning. Avoid utilizing this model for unstructured creative writing where programmatic constraints could negatively affect flow and artistic variation.

Bias, Risks, and Limitations

As the model was trained tightly to conform to precise agent structures, it might exhibit hyper-fixation on specific formatting structures even when a general conversational response is expected. It inherits basic societal biases and hallucination risks native to the base Llama-3.1-8B framework.

💡 Recommendations

Users should implement a validation retry loop in their applications. While the model achieves elite parseability metrics, validating output syntax programmatically ensures optimal agent reliability in critical enterprise workflows.

How to Get Started with the Model

Use the standard Transformers pipeline setup to initialize and prompt the model:

import transformers
import torch

model_id = "Locutusque/Esmeralda-Llama-3.1-8B-control"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are Esmeralda, an expert agentic assistant capable of executing complex tools accurately."},
    {"role": "user", "content": "Generate the tool arguments required to lookup weather data for Paris and Tokyo simultaneously."}
]

outputs = pipeline(messages, max_new_tokens=256)
print(outputs[0]["generated_text"][-1]["content"])

Training Details

Training Data

The model was fine-tuned on 4,625 examples carefully curated from the Locutusque/esmeralda-agentic dataset. This training dataset features diverse, rich multi-turn agentic conversational workflows, step-by-step reasoning traces, and explicit tool execution routines designed to maximize syntax compliance and analytical grit.

Training Hyperparameters

Training regime: bf16 mixed precision
Formatting style: Standard Llama 3 Chat Template formatting

Evaluation

Llama 3.1 8B Benchmark Comparison

Comparing Esmeralda-Llama-3.1-8B-control against Llama 3.1 8B Instruct and Hermes-3-Llama-3.1-8B.

Benchmark	Esmeralda-Llama-3.1-8B-control	Llama 3.1 8B Instruct	Hermes-3-Llama-3.1-8B
HumanEval	57.3	56.1	52.4
MBPP	53.2	56.8	48.2
MMLU-Pro	31.9	null	31.0
GPQA Diamond	15.7	15.7	18.2
EQ-Bench	59.2	61.1	63.1
Percent Parseable	100.0	92.4	91.2

Visual Performance Overview

HumanEval (Coding)

Esmeralda-Llama-3.1-8B

57.3

Llama 3.1 Instruct

56.1

Hermes-3

52.4

MBPP (Python Coding)

Esmeralda-Llama-3.1-8B

53.2

Llama 3.1 Instruct

56.8

Hermes-3

48.2

Percent Parseable (Syntax Stability)

Esmeralda-Llama-3.1-8B

100.0

Llama 3.1 Instruct

92.4

Hermes-3

91.2

Key Takeaways

Esmeralda-Llama-3.1-8B-control slightly leads on HumanEval despite using a relatively small finetuning dataset.
Hermes-3-Llama-3.1-8B shows the strongest EQ-Bench and GPQA performance.
Base Llama 3.1 8B Instruct remains strongest overall on MBPP.
Esmeralda-Llama-3.1-8B-control achieves the best parseability at an absolute 100%.

Interpretation

Esmeralda-Llama-3.1-8B-control successfully preserves the original baseline structural strength of Llama 3.1 8B Instruct while drastically improving coding consistency and tool execution stability. Whereas Hermes-3 scales conversational reasoning and fluid persona characteristics, the Esmeralda control model zeroes in on output predictability and software integration stability.

🐉 Here Be Dragons

The following results are exploratory and are not directly comparable to standard TruthfulQA leaderboard scores. Moreover, the Esmeralda-Llama-3.1-8B-control model was quantized to 8-bit precision to accelerate evaluations, potentially reducing actual benchmark results slightly below full-precision execution capabilities.

Experimental Truthfulness Evaluation

Esmeralda-Llama-3.1-8B-control was evaluated on TruthfulQA using a freeform-generation setup rather than the standard multiple-choice MC1/MC2 methodology.

Evaluation procedure:

The model generated unrestricted freeform answers.
A separate judge model — Gemma 4 26B A4B — was prompted to assign:
- 1 for correct/truthful answers
- 0 for incorrect/hallucinated answers
The judge compared generations against the TruthfulQA reference answers.

Model	Evaluation Method	Score
Esmeralda-Llama-3.1-8B-control	TruthfulQA LLM Judge	0.682
Hermes-3-Llama-3.1-8B	TruthfulQA MC2 (self-reported)	0.587

Notes

These numbers are not directly comparable due to differing evaluation setups.
MC2 evaluates constrained multiple-choice accuracy, while the Esmeralda evaluation measures freeform answer truthfulness judged semantically by an auxiliary LLM.
Manual inspection of sampled generations suggested the judge model behaved reliably for this experiment.
No official TruthfulQA score for Llama 3.1 8B Instruct could be located at the time of writing.

*This section is provided as an experimental reference rather than a standardized leaderboard claim.