ModelHub XC bebecacbde 初始化项目,由ModelHub XC社区提供模型
Model: Locutusque/Esmeralda-Llama-3.1-8B-control
Source: Original Platform
2026-06-04 14:19:40 +08:00

library_name, tags, model_creator, license, language, pipeline_tag, datasets
library_name tags model_creator license language pipeline_tag datasets
transformers
agent
tool-use
function-calling
llama-3.1
text-generation-inference
Locutusque llama3.1
en
text-generation
Locutusque/esmeralda-agentic
<style> .card-container { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif; color: #e2e8f0; line-height: 1.7; max-width: 900px; margin: 0 auto; background-color: #0f172a; /* Deep dark slate */ padding: 2.8rem; border-radius: 16px; box-shadow: 0 10px 30px rgba(0, 0, 0, 0.5); border: 1px solid #1e293b; overflow: hidden; /* Keep container clean */ } .card-container a { color: #38bdf8; /* High-contrast bright cyan */ font-weight: 500; text-decoration: none; border-bottom: 1px dashed #38bdf8; transition: color 0.2s ease; } .card-container a:hover { color: #7dd3fc; border-bottom-style: solid; } .hero-header { background: linear-gradient(135deg, #065f46 0%, #022c22 100%); /* Deep cyber green */ color: #ffffff; padding: 2.5rem; border-radius: 16px; margin-bottom: 2rem; border: 1px solid #10b981; box-shadow: 0 4px 20px rgba(16, 185, 129, 0.15); } .hero-header h1 { color: #34d399 !important; /* Vibrant mint */ margin-top: 0; font-size: 2.2rem; font-weight: 700; letter-spacing: -0.02em; } .hero-header p { color: #cbd5e1; font-size: 1.05rem; } .section-title { border-bottom: 2px solid #334155; padding-bottom: 0.5rem; color: #f8fafc; margin-top: 3rem; font-weight: 600; letter-spacing: -0.01em; } h3 { color: #f1f5f9 !important; margin-top: 1.8rem; } .grid-details { display: grid; grid-template-columns: repeat(auto-fit, minmax(240px, 1fr)); gap: 1rem; margin-bottom: 1.5rem; margin-top: 1.5rem; } .detail-card { background: #1e293b; border-left: 4px solid #10b981; padding: 1.25rem; border-radius: 0 8px 8px 0; border: 1px solid #334155; border-left: none; box-shadow: 0 2px 4px rgba(0,0,0,0.2); } .detail-card strong { color: #94a3b8; display: block; margin-bottom: 0.3rem; font-size: 0.85rem; text-transform: uppercase; letter-spacing: 0.05em; } ul, ol { padding-left: 1.4rem; } ul li, ol li { margin-bottom: 0.5rem; color: #cbd5e1; } /* CRITICAL SCROLL CORRECTIONS */ .table-responsive { display: block !important; width: 100% !important; overflow-x: auto !important; overflow-y: hidden !important; -webkit-overflow-scrolling: touch; margin: 1.5rem 0; border-radius: 8px; border: 1px solid #334155; } .custom-table { width: 100%; min-width: 600px; /* Forces table element structure to stay wide enough to require scrolling instead of compressing */ border-collapse: collapse; margin: 0; font-size: 0.95rem; background: #1e293b; } .custom-table th { background-color: #334155; color: #f8fafc; text-align: left; padding: 14px; font-weight: 600; } .custom-table td { padding: 14px; border-bottom: 1px solid #334155; color: #cbd5e1; } .custom-table tr:last-child td { border-bottom: none; } .custom-table tr:nth-child(even) { background-color: #111827; } .bar-container { margin: 1.5rem 0; background: #1e293b; padding: 1.5rem; border-radius: 12px; border: 1px solid #334155; } .bar-title { font-weight: 600; color: #f1f5f9; margin-bottom: 1rem; } .bar-wrapper { display: flex; align-items: center; margin-bottom: 0.8rem; } .bar-label { width: 180px; font-size: 0.9rem; color: #94a3b8; } .bar-track { flex-grow: 1; background-color: #0f172a; height: 12px; border-radius: 6px; overflow: hidden; margin: 0 1rem; border: 1px solid #334155; } .bar-fill { height: 100%; border-radius: 6px; } .bar-fill.esmeralda { background: linear-gradient(90deg, #10b981, #34d399); } .bar-fill.llama { background: linear-gradient(90deg, #3b82f6, #60a5fa); } .bar-fill.hermes { background: linear-gradient(90deg, #8b5cf6, #a78bfa); } .bar-value { width: 45px; text-align: right; font-family: monospace; font-weight: bold; color: #f8fafc; } .alert-box { padding: 1.25rem; border-radius: 8px; margin: 1.5rem 0; border: 1px solid transparent; } .alert-box h4 { margin-top: 0; margin-bottom: 0.5rem; font-size: 1.05rem; } .alert-box.info { background-color: rgba(14, 116, 144, 0.15); border-color: #06b6d4; color: #e0f2fe; } .alert-box.info h4 { color: #38bdf8; } .alert-box.dragon { background-color: rgba(185, 28, 28, 0.15); border-color: #ef4444; color: #fee2e2; } .alert-box.dragon h4 { color: #f87171; } code { background-color: #1e293b; color: #f43f5e; padding: 0.2rem 0.4rem; border-radius: 4px; font-size: 0.9em; font-family: monospace; border: 1px solid #334155; } pre code { background-color: #0b0f19; color: #cbd5e1; padding: 1.25rem; border-radius: 8px; display: block; overflow-x: auto; border: 1px solid #1e293b; font-size: 0.9rem; line-height: 1.6; } </style>

Esmeralda Llama 3.1 8B Control

An advanced, high-parseability agentic language model optimized for structural consistency, tool-use execution, and stable conversational automation.

Model Details

Esmeralda-Llama-3.1-8B-control is a specialized agentic model fine-tuned on the Locutusque/esmeralda-agentic dataset. This dataset is engineered specifically to train models in rigorous agentic routines, structured system prompt adherence, reasoning loops, and multi-turn tool interactions. This control variant prioritizes deterministic syntax stability (achieving a perfect 100% parseability rate) to prevent runtime breakdowns in downstream orchestration frameworks like LangChain, CrewAI, or AutoGen. This is the *control* model of the Esmeralda family of models. It will be the first released and serves as a proof of concept.

Developed by Locutusque (Sebastian Gabarain)
Model type Transformer Decoder (LLM)
Language(s) English (NLP)
License Llama-3.1 License
Finetuned from Llama-3.1-8B

Model Sources

Uses

Direct Use

This model is built directly for AI Agent loops, multi-turn function calling, programmatic tool usage, and structural data extraction workloads. It can safely ingest complex API schemas or system setups and output predictable tokens that map perfectly to execution environments.

Downstream Use

Ideally integrated as the primary brain within Autonomous Agents software architectures. It thrives when paired with a strict parser or execution layer that depends on flawlessly structured outputs (JSON, XML blocks, or custom agent formatting syntax).

Out-of-Scope Use

Not intended for heavy multilingual generation or specialized multi-modal tasks without additional fine-tuning. Avoid utilizing this model for unstructured creative writing where programmatic constraints could negatively affect flow and artistic variation.

Bias, Risks, and Limitations

As the model was trained tightly to conform to precise agent structures, it might exhibit hyper-fixation on specific formatting structures even when a general conversational response is expected. It inherits basic societal biases and hallucination risks native to the base Llama-3.1-8B framework.

💡 Recommendations

Users should implement a validation retry loop in their applications. While the model achieves elite parseability metrics, validating output syntax programmatically ensures optimal agent reliability in critical enterprise workflows.

How to Get Started with the Model

Use the standard Transformers pipeline setup to initialize and prompt the model:

import transformers
import torch

model_id = "Locutusque/Esmeralda-Llama-3.1-8B-control"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are Esmeralda, an expert agentic assistant capable of executing complex tools accurately."},
    {"role": "user", "content": "Generate the tool arguments required to lookup weather data for Paris and Tokyo simultaneously."}
]

outputs = pipeline(messages, max_new_tokens=256)
print(outputs[0]["generated_text"][-1]["content"])

Training Details

Training Data

The model was fine-tuned on 4,625 examples carefully curated from the Locutusque/esmeralda-agentic dataset. This training dataset features diverse, rich multi-turn agentic conversational workflows, step-by-step reasoning traces, and explicit tool execution routines designed to maximize syntax compliance and analytical grit.

Training Hyperparameters

  • Training regime: bf16 mixed precision
  • Formatting style: Standard Llama 3 Chat Template formatting

Evaluation

Llama 3.1 8B Benchmark Comparison
Comparing Esmeralda-Llama-3.1-8B-control against Llama 3.1 8B Instruct and Hermes-3-Llama-3.1-8B.
Benchmark Esmeralda-Llama-3.1-8B-control Llama 3.1 8B Instruct Hermes-3-Llama-3.1-8B
HumanEval 57.3 56.1 52.4
MBPP 53.2 56.8 48.2
MMLU-Pro 31.9 null 31.0
GPQA Diamond 15.7 15.7 18.2
EQ-Bench 59.2 61.1 63.1
Percent Parseable 100.0 92.4 91.2

Visual Performance Overview

HumanEval (Coding)
Esmeralda-Llama-3.1-8B
57.3
Llama 3.1 Instruct
56.1
Hermes-3
52.4
MBPP (Python Coding)
Esmeralda-Llama-3.1-8B
53.2
Llama 3.1 Instruct
56.8
Hermes-3
48.2
Percent Parseable (Syntax Stability)
Esmeralda-Llama-3.1-8B
100.0
Llama 3.1 Instruct
92.4
Hermes-3
91.2

Key Takeaways

  • Esmeralda-Llama-3.1-8B-control slightly leads on HumanEval despite using a relatively small finetuning dataset.
  • Hermes-3-Llama-3.1-8B shows the strongest EQ-Bench and GPQA performance.
  • Base Llama 3.1 8B Instruct remains strongest overall on MBPP.
  • Esmeralda-Llama-3.1-8B-control achieves the best parseability at an absolute 100%.

Interpretation

Esmeralda-Llama-3.1-8B-control successfully preserves the original baseline structural strength of Llama 3.1 8B Instruct while drastically improving coding consistency and tool execution stability. Whereas Hermes-3 scales conversational reasoning and fluid persona characteristics, the Esmeralda control model zeroes in on output predictability and software integration stability.

🐉 Here Be Dragons

The following results are exploratory and are not directly comparable to standard TruthfulQA leaderboard scores. Moreover, the Esmeralda-Llama-3.1-8B-control model was quantized to 8-bit precision to accelerate evaluations, potentially reducing actual benchmark results slightly below full-precision execution capabilities.

Experimental Truthfulness Evaluation

Esmeralda-Llama-3.1-8B-control was evaluated on TruthfulQA using a freeform-generation setup rather than the standard multiple-choice MC1/MC2 methodology.

Evaluation procedure:

  1. The model generated unrestricted freeform answers.
  2. A separate judge model — Gemma 4 26B A4B — was prompted to assign:
    • 1 for correct/truthful answers
    • 0 for incorrect/hallucinated answers
  3. The judge compared generations against the TruthfulQA reference answers.
Model Evaluation Method Score
Esmeralda-Llama-3.1-8B-control TruthfulQA LLM Judge 0.682
Hermes-3-Llama-3.1-8B TruthfulQA MC2 (self-reported) 0.587

Notes

  • These numbers are not directly comparable due to differing evaluation setups.
  • MC2 evaluates constrained multiple-choice accuracy, while the Esmeralda evaluation measures freeform answer truthfulness judged semantically by an auxiliary LLM.
  • Manual inspection of sampled generations suggested the judge model behaved reliably for this experiment.
  • No official TruthfulQA score for Llama 3.1 8B Instruct could be located at the time of writing.

*This section is provided as an experimental reference rather than a standardized leaderboard claim.

Description
Model synced from source: Locutusque/Esmeralda-Llama-3.1-8B-control
Readme 2.6 MiB
Languages
Jinja 100%