Esmeralda-Llama-3.1-8B-control/README.md

---
library_name: transformers
tags:
- agent
- tool-use
- function-calling
- llama-3.1
- text-generation-inference
model_creator: Locutusque
license: llama3.1
language:
- en
pipeline_tag: text-generation
datasets:
- Locutusque/esmeralda-agentic
---

<style>
  .card-container {
    font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif;
    color: #e2e8f0;
    line-height: 1.7;
    max-width: 900px;
    margin: 0 auto;
    background-color: #0f172a; /* Deep dark slate */
    padding: 2.8rem;
    border-radius: 16px;
    box-shadow: 0 10px 30px rgba(0, 0, 0, 0.5);
    border: 1px solid #1e293b;
    overflow: hidden; /* Keep container clean */
  }

  .card-container a {
    color: #38bdf8; /* High-contrast bright cyan */
    font-weight: 500;
    text-decoration: none;
    border-bottom: 1px dashed #38bdf8;
    transition: color 0.2s ease;
  }

  .card-container a:hover {
    color: #7dd3fc;
    border-bottom-style: solid;
  }

  .hero-header {
    background: linear-gradient(135deg, #065f46 0%, #022c22 100%); /* Deep cyber green */
    color: #ffffff;
    padding: 2.5rem;
    border-radius: 16px;
    margin-bottom: 2rem;
    border: 1px solid #10b981;
    box-shadow: 0 4px 20px rgba(16, 185, 129, 0.15);
  }

  .hero-header h1 {
    color: #34d399 !important; /* Vibrant mint */
    margin-top: 0;
    font-size: 2.2rem;
    font-weight: 700;
    letter-spacing: -0.02em;
  }

  .hero-header p {
    color: #cbd5e1;
    font-size: 1.05rem;
  }

  .section-title {
    border-bottom: 2px solid #334155;
    padding-bottom: 0.5rem;
    color: #f8fafc;
    margin-top: 3rem;
    font-weight: 600;
    letter-spacing: -0.01em;
  }

  h3 {
    color: #f1f5f9 !important;
    margin-top: 1.8rem;
  }

  .grid-details {
    display: grid;
    grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
    gap: 1rem;
    margin-bottom: 1.5rem;
    margin-top: 1.5rem;
  }

  .detail-card {
    background: #1e293b;
    border-left: 4px solid #10b981;
    padding: 1.25rem;
    border-radius: 0 8px 8px 0;
    border: 1px solid #334155;
    border-left: none;
    box-shadow: 0 2px 4px rgba(0,0,0,0.2);
  }

  .detail-card strong {
    color: #94a3b8;
    display: block;
    margin-bottom: 0.3rem;
    font-size: 0.85rem;
    text-transform: uppercase;
    letter-spacing: 0.05em;
  }

  ul, ol {
    padding-left: 1.4rem;
  }

  ul li, ol li {
    margin-bottom: 0.5rem;
    color: #cbd5e1;
  }

  /* CRITICAL SCROLL CORRECTIONS */
  .table-responsive {
    display: block !important;
    width: 100% !important;
    overflow-x: auto !important;
    overflow-y: hidden !important;
    -webkit-overflow-scrolling: touch;
    margin: 1.5rem 0;
    border-radius: 8px;
    border: 1px solid #334155;
  }

  .custom-table {
    width: 100%;
    min-width: 600px; /* Forces table element structure to stay wide enough to require scrolling instead of compressing */
    border-collapse: collapse;
    margin: 0;
    font-size: 0.95rem;
    background: #1e293b;
  }

  .custom-table th {
    background-color: #334155;
    color: #f8fafc;
    text-align: left;
    padding: 14px;
    font-weight: 600;
  }

  .custom-table td {
    padding: 14px;
    border-bottom: 1px solid #334155;
    color: #cbd5e1;
  }

  .custom-table tr:last-child td {
    border-bottom: none;
  }

  .custom-table tr:nth-child(even) {
    background-color: #111827;
  }

  .bar-container {
    margin: 1.5rem 0;
    background: #1e293b;
    padding: 1.5rem;
    border-radius: 12px;
    border: 1px solid #334155;
  }

  .bar-title {
    font-weight: 600;
    color: #f1f5f9;
    margin-bottom: 1rem;
  }

  .bar-wrapper {
    display: flex;
    align-items: center;
    margin-bottom: 0.8rem;
  }

  .bar-label {
    width: 180px;
    font-size: 0.9rem;
    color: #94a3b8;
  }

  .bar-track {
    flex-grow: 1;
    background-color: #0f172a;
    height: 12px;
    border-radius: 6px;
    overflow: hidden;
    margin: 0 1rem;
    border: 1px solid #334155;
  }

  .bar-fill {
    height: 100%;
    border-radius: 6px;
  }

  .bar-fill.esmeralda { background: linear-gradient(90deg, #10b981, #34d399); }
  .bar-fill.llama { background: linear-gradient(90deg, #3b82f6, #60a5fa); }
  .bar-fill.hermes { background: linear-gradient(90deg, #8b5cf6, #a78bfa); }

  .bar-value {
    width: 45px;
    text-align: right;
    font-family: monospace;
    font-weight: bold;
    color: #f8fafc;
  }

  .alert-box {
    padding: 1.25rem;
    border-radius: 8px;
    margin: 1.5rem 0;
    border: 1px solid transparent;
  }

  .alert-box h4 {
    margin-top: 0;
    margin-bottom: 0.5rem;
    font-size: 1.05rem;
  }

  .alert-box.info {
    background-color: rgba(14, 116, 144, 0.15);
    border-color: #06b6d4;
    color: #e0f2fe;
  }
  .alert-box.info h4 { color: #38bdf8; }

  .alert-box.dragon {
    background-color: rgba(185, 28, 28, 0.15);
    border-color: #ef4444;
    color: #fee2e2;
  }
  .alert-box.dragon h4 { color: #f87171; }

  code {
    background-color: #1e293b;
    color: #f43f5e;
    padding: 0.2rem 0.4rem;
    border-radius: 4px;
    font-size: 0.9em;
    font-family: monospace;
    border: 1px solid #334155;
  }

  pre code {
    background-color: #0b0f19;
    color: #cbd5e1;
    padding: 1.25rem;
    border-radius: 8px;
    display: block;
    overflow-x: auto;
    border: 1px solid #1e293b;
    font-size: 0.9rem;
    line-height: 1.6;
  }
</style>
<div class="card-container">

<div class="hero-header">
  <h1>Esmeralda Llama 3.1 8B Control</h1>
  <p>An advanced, high-parseability agentic language model optimized for structural consistency, tool-use execution, and stable conversational automation.</p>
</div>

<h2 class="section-title">Model Details</h2>

<p><strong>Esmeralda-Llama-3.1-8B-control</strong> is a specialized agentic model fine-tuned on the <code>Locutusque/esmeralda-agentic</code> dataset. This dataset is engineered specifically to train models in rigorous agentic routines, structured system prompt adherence, reasoning loops, and multi-turn tool interactions. This control variant prioritizes deterministic syntax stability (achieving a perfect 100% parseability rate) to prevent runtime breakdowns in downstream orchestration frameworks like LangChain, CrewAI, or AutoGen. This is the *control* model of the Esmeralda family of models. It will be the first released and serves as a proof of concept. </p>

<div class="grid-details">
  <div class="detail-card">
    <strong>Developed by</strong>
    Locutusque (Sebastian Gabarain)
  </div>
  <div class="detail-card">
    <strong>Model type</strong>
    Transformer Decoder (LLM)
  </div>
  <div class="detail-card">
    <strong>Language(s)</strong>
    English (NLP)
  </div>
  <div class="detail-card">
    <strong>License</strong>
    Llama-3.1 License
  </div>
  <div class="detail-card">
    <strong>Finetuned from</strong>
    Llama-3.1-8B
  </div>
</div>

<h2 class="section-title">Model Sources</h2>
<ul>
  <li><strong>Repository:</strong> <a href="https://huggingface.co/Locutusque/Esmeralda-Llama-3.1-8B-control">Locutusque/Esmeralda-Llama-3.1-8B-control</a></li>
  <li><strong>Dataset:</strong> <a href="https://huggingface.co/datasets/Locutusque/esmeralda-agentic">Locutusque/esmeralda-agentic</a></li>
</ul>

<h2 class="section-title">Uses</h2>

<h3>Direct Use</h3>
<p>This model is built directly for AI Agent loops, multi-turn function calling, programmatic tool usage, and structural data extraction workloads. It can safely ingest complex API schemas or system setups and output predictable tokens that map perfectly to execution environments.</p>

<h3>Downstream Use</h3>
<p>Ideally integrated as the primary brain within Autonomous Agents software architectures. It thrives when paired with a strict parser or execution layer that depends on flawlessly structured outputs (JSON, XML blocks, or custom agent formatting syntax).</p>

<h3>Out-of-Scope Use</h3>
<p>Not intended for heavy multilingual generation or specialized multi-modal tasks without additional fine-tuning. Avoid utilizing this model for unstructured creative writing where programmatic constraints could negatively affect flow and artistic variation.</p>

<h2 class="section-title">Bias, Risks, and Limitations</h2>
<p>As the model was trained tightly to conform to precise agent structures, it might exhibit hyper-fixation on specific formatting structures even when a general conversational response is expected. It inherits basic societal biases and hallucination risks native to the base <code>Llama-3.1-8B</code> framework.</p>

<div class="alert-box info">
  <h4>💡 Recommendations</h4>
  <p>Users should implement a validation retry loop in their applications. While the model achieves elite parseability metrics, validating output syntax programmatically ensures optimal agent reliability in critical enterprise workflows.</p>
</div>

<h2 class="section-title">How to Get Started with the Model</h2>

<p>Use the standard Transformers pipeline setup to initialize and prompt the model:</p>

<pre><code>import transformers
import torch

model_id = "Locutusque/Esmeralda-Llama-3.1-8B-control"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are Esmeralda, an expert agentic assistant capable of executing complex tools accurately."},
    {"role": "user", "content": "Generate the tool arguments required to lookup weather data for Paris and Tokyo simultaneously."}
]

outputs = pipeline(messages, max_new_tokens=256)
print(outputs[0]["generated_text"][-1]["content"])</code></pre>

<h2 class="section-title">Training Details</h2>
<h3>Training Data</h3>
<p>The model was fine-tuned on <strong>4,625 examples</strong> carefully curated from the <a href="https://huggingface.co/datasets/Locutusque/esmeralda-agentic">Locutusque/esmeralda-agentic</a> dataset. This training dataset features diverse, rich multi-turn agentic conversational workflows, step-by-step reasoning traces, and explicit tool execution routines designed to maximize syntax compliance and analytical grit.</p>
<h3>Training Hyperparameters</h3>
<ul>
<li><strong>Training regime:</strong> <code>bf16 mixed precision</code></li>
<li><strong>Formatting style:</strong> Standard Llama 3 Chat Template formatting</li>
</ul>
<h2 class="section-title">Evaluation</h2>
<div style="text-align: center; font-weight: bold; font-size: 1.2rem; margin-bottom: 0.5rem; color: #f8fafc;">
Llama 3.1 8B Benchmark Comparison
</div>
<div style="text-align: center; color: #94a3b8; font-size: 0.9rem; margin-bottom: 1.5rem;">
Comparing Esmeralda-Llama-3.1-8B-control against Llama 3.1 8B Instruct and Hermes-3-Llama-3.1-8B.
</div>

<div class="table-responsive">
<table class="custom-table">
<thead>
<tr>
<th>Benchmark</th>
<th>Esmeralda-Llama-3.1-8B-control</th>
<th>Llama 3.1 8B Instruct</th>
<th>Hermes-3-Llama-3.1-8B</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>HumanEval</strong></td>
<td><strong>57.3</strong></td>
<td>56.1</td>
<td>52.4</td>
</tr>
<tr>
<td><strong>MBPP</strong></td>
<td>53.2</td>
<td><strong>56.8</strong></td>
<td>48.2</td>
</tr>
<tr>
<td><strong>MMLU-Pro</strong></td>
<td><strong>31.9</strong></td>
<td>null</td>
<td>31.0</td>
</tr>
<tr>
<td><strong>GPQA Diamond</strong></td>
<td>15.7</td>
<td>15.7</td>
<td><strong>18.2</strong></td>
</tr>
<tr>
<td><strong>EQ-Bench</strong></td>
<td>59.2</td>
<td>61.1</td>
<td><strong>63.1</strong></td>
</tr>
<tr>
<td><strong>Percent Parseable</strong></td>
<td><strong>100.0</strong></td>
<td>92.4</td>
<td>91.2</td>
</tr>
</tbody>
</table>
</div>

<h3 style="color: #f1f5f9; margin-top: 2rem;">Visual Performance Overview</h3>
<div class="bar-container">
<div class="bar-title">HumanEval (Coding)</div>
<div class="bar-wrapper">
<div class="bar-label">Esmeralda-Llama-3.1-8B</div>
<div class="bar-track"><div class="bar-fill esmeralda" style="width: 57.3%;"></div></div>
<div class="bar-value">57.3</div>
</div>
<div class="bar-wrapper">
<div class="bar-label">Llama 3.1 Instruct</div>
<div class="bar-track"><div class="bar-fill llama" style="width: 56.1%;"></div></div>
<div class="bar-value">56.1</div>
</div>
<div class="bar-wrapper">
<div class="bar-label">Hermes-3</div>
<div class="bar-track"><div class="bar-fill hermes" style="width: 52.4%;"></div></div>
<div class="bar-value">52.4</div>
</div>
</div>
<div class="bar-container">
<div class="bar-title">MBPP (Python Coding)</div>
<div class="bar-wrapper">
<div class="bar-label">Esmeralda-Llama-3.1-8B</div>
<div class="bar-track"><div class="bar-fill esmeralda" style="width: 53.2%;"></div></div>
<div class="bar-value">53.2</div>
</div>
<div class="bar-wrapper">
<div class="bar-label">Llama 3.1 Instruct</div>
<div class="bar-track"><div class="bar-fill llama" style="width: 56.8%;"></div></div>
<div class="bar-value">56.8</div>
</div>
<div class="bar-wrapper">
<div class="bar-label">Hermes-3</div>
<div class="bar-track"><div class="bar-fill hermes" style="width: 48.2%;"></div></div>
<div class="bar-value">48.2</div>
</div>
</div>
<div class="bar-container">
<div class="bar-title">Percent Parseable (Syntax Stability)</div>
<div class="bar-wrapper">
<div class="bar-label">Esmeralda-Llama-3.1-8B</div>
<div class="bar-track"><div class="bar-fill esmeralda" style="width: 100%;"></div></div>
<div class="bar-value">100.0</div>
</div>
<div class="bar-wrapper">
<div class="bar-label">Llama 3.1 Instruct</div>
<div class="bar-track"><div class="bar-fill llama" style="width: 92.4%;"></div></div>
<div class="bar-value">92.4</div>
</div>
<div class="bar-wrapper">
<div class="bar-label">Hermes-3</div>
<div class="bar-track"><div class="bar-fill hermes" style="width: 91.2%;"></div></div>
<div class="bar-value">91.2</div>
</div>
</div>
<h3>Key Takeaways</h3>
<ul>
<li><strong>Esmeralda-Llama-3.1-8B-control</strong> slightly leads on HumanEval despite using a relatively small finetuning dataset.</li>
<li><strong>Hermes-3-Llama-3.1-8B</strong> shows the strongest EQ-Bench and GPQA performance.</li>
<li>Base <strong>Llama 3.1 8B Instruct</strong> remains strongest overall on MBPP.</li>
<li><strong>Esmeralda-Llama-3.1-8B-control</strong> achieves the best parseability at an absolute <strong>100%</strong>.</li>
</ul>
<h3>Interpretation</h3>
<p>Esmeralda-Llama-3.1-8B-control successfully preserves the original baseline structural strength of Llama 3.1 8B Instruct while drastically improving coding consistency and tool execution stability. Whereas Hermes-3 scales conversational reasoning and fluid persona characteristics, the Esmeralda control model zeroes in on output predictability and software integration stability.</p>
<div class="alert-box dragon">
<h4>🐉 Here Be Dragons</h4>
<p>The following results are exploratory and are not directly comparable to standard TruthfulQA leaderboard scores. Moreover, the Esmeralda-Llama-3.1-8B-control model was quantized to 8-bit precision to accelerate evaluations, potentially reducing actual benchmark results slightly below full-precision execution capabilities.</p>
</div>
<h3>Experimental Truthfulness Evaluation</h3>
<p>Esmeralda-Llama-3.1-8B-control was evaluated on TruthfulQA using a freeform-generation setup rather than the standard multiple-choice MC1/MC2 methodology.</p>
<p><strong>Evaluation procedure:</strong></p>
<ol>
<li>The model generated unrestricted freeform answers.</li>
<li>A separate judge model — <code>Gemma 4 26B A4B</code> — was prompted to assign:
<ul>
<li><code>1</code> for correct/truthful answers</li>
<li><code>0</code> for incorrect/hallucinated answers</li>
</ul>
</li>
<li>The judge compared generations against the TruthfulQA reference answers.</li>
</ol>

<div class="table-responsive">
<table class="custom-table">
<thead>
<tr>
<th>Model</th>
<th>Evaluation Method</th>
<th>Score</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Esmeralda-Llama-3.1-8B-control</strong></td>
<td>TruthfulQA LLM Judge</td>
<td><strong>0.682</strong></td>
</tr>
<tr>
<td><strong>Hermes-3-Llama-3.1-8B</strong></td>
<td>TruthfulQA MC2 (self-reported)</td>
<td>0.587</td>
</tr>
</tbody>
</table>
</div>

<h3 style="color: #f1f5f9;">Notes</h3>
<ul>
<li>These numbers are <strong>not directly comparable</strong> due to differing evaluation setups.</li>
<li>MC2 evaluates constrained multiple-choice accuracy, while the Esmeralda evaluation measures freeform answer truthfulness judged semantically by an auxiliary LLM.</li>
<li>Manual inspection of sampled generations suggested the judge model behaved reliably for this experiment.</li>
<li>No official TruthfulQA score for Llama 3.1 8B Instruct could be located at the time of writing.</li>
</ul>
<p><em>*This section is provided as an experimental reference rather than a standardized leaderboard claim.</em></p>
</div>