--- library_name: transformers tags: - agent - tool-use - function-calling - llama-3.1 - text-generation-inference model_creator: Locutusque license: llama3.1 language: - en pipeline_tag: text-generation datasets: - Locutusque/esmeralda-agentic ---
An advanced, high-parseability agentic language model optimized for structural consistency, tool-use execution, and stable conversational automation.
Esmeralda-Llama-3.1-8B-control is a specialized agentic model fine-tuned on the Locutusque/esmeralda-agentic dataset. This dataset is engineered specifically to train models in rigorous agentic routines, structured system prompt adherence, reasoning loops, and multi-turn tool interactions. This control variant prioritizes deterministic syntax stability (achieving a perfect 100% parseability rate) to prevent runtime breakdowns in downstream orchestration frameworks like LangChain, CrewAI, or AutoGen. This is the *control* model of the Esmeralda family of models. It will be the first released and serves as a proof of concept.
This model is built directly for AI Agent loops, multi-turn function calling, programmatic tool usage, and structural data extraction workloads. It can safely ingest complex API schemas or system setups and output predictable tokens that map perfectly to execution environments.
Ideally integrated as the primary brain within Autonomous Agents software architectures. It thrives when paired with a strict parser or execution layer that depends on flawlessly structured outputs (JSON, XML blocks, or custom agent formatting syntax).
Not intended for heavy multilingual generation or specialized multi-modal tasks without additional fine-tuning. Avoid utilizing this model for unstructured creative writing where programmatic constraints could negatively affect flow and artistic variation.
As the model was trained tightly to conform to precise agent structures, it might exhibit hyper-fixation on specific formatting structures even when a general conversational response is expected. It inherits basic societal biases and hallucination risks native to the base Llama-3.1-8B framework.
Users should implement a validation retry loop in their applications. While the model achieves elite parseability metrics, validating output syntax programmatically ensures optimal agent reliability in critical enterprise workflows.
Use the standard Transformers pipeline setup to initialize and prompt the model:
import transformers
import torch
model_id = "Locutusque/Esmeralda-Llama-3.1-8B-control"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto"
)
messages = [
{"role": "system", "content": "You are Esmeralda, an expert agentic assistant capable of executing complex tools accurately."},
{"role": "user", "content": "Generate the tool arguments required to lookup weather data for Paris and Tokyo simultaneously."}
]
outputs = pipeline(messages, max_new_tokens=256)
print(outputs[0]["generated_text"][-1]["content"])
The model was fine-tuned on 4,625 examples carefully curated from the Locutusque/esmeralda-agentic dataset. This training dataset features diverse, rich multi-turn agentic conversational workflows, step-by-step reasoning traces, and explicit tool execution routines designed to maximize syntax compliance and analytical grit.
bf16 mixed precision| Benchmark | Esmeralda-Llama-3.1-8B-control | Llama 3.1 8B Instruct | Hermes-3-Llama-3.1-8B |
|---|---|---|---|
| HumanEval | 57.3 | 56.1 | 52.4 |
| MBPP | 53.2 | 56.8 | 48.2 |
| MMLU-Pro | 31.9 | null | 31.0 |
| GPQA Diamond | 15.7 | 15.7 | 18.2 |
| EQ-Bench | 59.2 | 61.1 | 63.1 |
| Percent Parseable | 100.0 | 92.4 | 91.2 |
Esmeralda-Llama-3.1-8B-control successfully preserves the original baseline structural strength of Llama 3.1 8B Instruct while drastically improving coding consistency and tool execution stability. Whereas Hermes-3 scales conversational reasoning and fluid persona characteristics, the Esmeralda control model zeroes in on output predictability and software integration stability.
The following results are exploratory and are not directly comparable to standard TruthfulQA leaderboard scores. Moreover, the Esmeralda-Llama-3.1-8B-control model was quantized to 8-bit precision to accelerate evaluations, potentially reducing actual benchmark results slightly below full-precision execution capabilities.
Esmeralda-Llama-3.1-8B-control was evaluated on TruthfulQA using a freeform-generation setup rather than the standard multiple-choice MC1/MC2 methodology.
Evaluation procedure:
Gemma 4 26B A4B — was prompted to assign:
1 for correct/truthful answers0 for incorrect/hallucinated answers| Model | Evaluation Method | Score |
|---|---|---|
| Esmeralda-Llama-3.1-8B-control | TruthfulQA LLM Judge | 0.682 |
| Hermes-3-Llama-3.1-8B | TruthfulQA MC2 (self-reported) | 0.587 |
*This section is provided as an experimental reference rather than a standardized leaderboard claim.