初始化项目,由ModelHub XC社区提供模型

Model: domyn/Domyn-Small-v1.0
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-06-01 12:09:16 +08:00
commit d1020dd6c0
15 changed files with 9709 additions and 0 deletions

36
.gitattributes vendored Normal file
View File

@@ -0,0 +1,36 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
tokenizer.json filter=lfs diff=lfs merge=lfs -text

395
README.md Normal file
View File

@@ -0,0 +1,395 @@
---
library_name: transformers
license: other
license_name: mit
license_link: https://www.domyn.com/legal/software-licenses/domyn-small
pipeline_tag: text-generation
language:
- en
- it
- es
- fr
- de
tags:
- reasoning
- dual-mode
- thinking
- tool-calling
- agentic
- multilingual
---
<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://cdn.prod.website-files.com/682dcb35e7fd7a313cae803a/6835cb0fb6bd22a7cba7d645_domyn-logo-payoff-primary-white.svg">
<img alt="Domyn" src="https://cdn.prod.website-files.com/682dcb35e7fd7a313cae803a/6835ca86622d317630fa3861_domyn-logo-primary.svg" width="400">
</picture>
</p>
# Domyn Small
Domyn Small is a 10B-parameter open-weight reasoning model designed for resource-constrained, agentic, and fine-tunable deployments. It pairs a dual-mode (thinking on/off) inference design with grouped-query attention, a native 32k context window (extensible to 131k via YaRN), and tool calling. On reasoning benchmarks it reaches accuracy comparable to leading 710B reasoning peers while spending roughly **24× fewer reasoning tokens** — placing it on a favourable accuracy/cost Pareto frontier for production inference and downstream fine-tuning.
Fine-tune Domyn Small to your domain to unlock its real power and to retain full ownership and control over the resulting model.
## Highlights
- **Token-efficient reasoning** — ~32% of Qwen3.5-9B's reasoning-token budget and ~35% of OLMo-3-7B-Think's at comparable accuracy on several reasoning tasks ([Token Efficiency](#token-efficiency)).
- **Dual-mode inference** — `thinking on` for deep multi-step reasoning, `thinking off` for fast, compact output. Toggleable from the system prompt or the API.
- **Tool calling** — first-class function calling via `<tool_call>` XML tags, with a chat template that handles tool injection automatically. Strong BFCL V3 single-turn results (75.9 Non-Live / 68.3 Live) at ~280 mean tokens per problem.
- **Expandable context** — 32,768 tokens natively, extensible to 131,072 (128k) via YaRN at inference time.
- **Multilingual** — 50+ languages with explicit coverage; optimised for English and the Tier-A European set (Italian, Spanish, French, German).
## Model Overview
- **Developed by**: Domyn S.p.A.
- **Version**: 1.0
- **Released and last updated on**: May 2026
- **Input / Output**: Text-only / Text-only
- **Model size**: ~10B parameters
- **Attention**: Grouped-Query Attention (48 query heads, 8 KV heads)
- **Tokenizer**: 256,000-token SentencePiece BPE vocabulary
- **Native context**: 32,768 tokens
- **Extended context**: 131,072 tokens (YaRN, 4× at inference time)
- **Language(s)**: 50+ languages; optimised for English and the Tier-A European set (Italian, Spanish, French, German)
- **Base model**: Initialised from Italia 10B and continually pre-trained on 503B tokens
- **Knowledge cut-off date**: September 2024 (based on pre-training dataset cut-off)
- **License**: MIT
A full architecture and training-recipe specification is available in the Domyn Small technical report.
## Quickstart
```python
from openai import OpenAI
client = OpenAI(
base_url="http://<your-vllm-host>/v1",
api_key="none",
)
response = client.chat.completions.create(
model="domyn/Domyn-Small-v1.0",
messages=[
{"role": "system", "content": "You are Domyn Small, a helpful assistant."},
{"role": "user", "content": "What is the capital of Italy?"},
],
)
print(response.choices[0].message.content)
```
## Deployment
> We recommend **vLLM ≥ 0.9.2** for all the snippets below.
### vLLM — Basic
```bash
vllm serve domyn/Domyn-Small-v1.0 \
--tensor-parallel-size 1 \
--dtype bfloat16 \
--max-model-len 32768 \
--max-num-seqs 256 \
--gpu-memory-utilization 0.9
```
### vLLM — With Reasoning Parsing
To have vLLM automatically extract the model's `<think>` blocks and expose them as a structured `reasoning_content` field, add a reasoning-parser flag. Which flag to use depends on your vLLM version.
**vLLM < 0.21.0** — Domyn Small emits the same `<think>…</think>` format as OLMo 3, and earlier vLLM releases work with the OLMo 3 parser directly:
```bash
vllm serve domyn/Domyn-Small-v1.0 \
--tensor-parallel-size 1 \
--dtype bfloat16 \
--max-model-len 32768 \
--max-num-seqs 256 \
--gpu-memory-utilization 0.9 \
--reasoning-parser olmo3
```
**vLLM ≥ 0.21.0 (recommended)** — use the Domyn-specific parser plugin shipped with this checkpoint (`reasoning_parser_plugin.py`). It reads the per-request `enable_thinking` flag (or the `thinking on` / `thinking off` system-prompt directive) and routes streamed output to the correct lane (`reasoning` vs `content`) for both modes.
```bash
vllm serve domyn/Domyn-Small-v1.0 \
--tensor-parallel-size 1 \
--dtype bfloat16 \
--max-model-len 32768 \
--max-num-seqs 256 \
--gpu-memory-utilization 0.9 \
--reasoning-parser think_block \
--reasoning-parser-plugin /path/to/reasoning_parser_plugin.py
```
Replace `/path/to/` with the actual path to the plugin file bundled with the checkpoint. The parser name `think_block` is the registration string declared inside the plugin and must match exactly.
### vLLM — Extended Context with YaRN
> YaRN scaling may impact model quality on inputs shorter than 32k. Enable it only when you actually need contexts beyond the native 32,768-token window.
```bash
vllm serve domyn/Domyn-Small-v1.0 \
--tensor-parallel-size 1 \
--dtype bfloat16 \
# vLLM < 0.12.0
--rope-scaling '{"rope_type": "yarn", "factor": 4, "original_max_position_embeddings": 32768}' \
# vLLM >= 0.12.0
--hf-overrides '{"rope_parameters": {"rope_type": "yarn", "factor": 4.0, "original_max_position_embeddings": 32768}}' \
--max-model-len 131072
```
### vLLM — With Tool Calling
Tool calling requires three extra flags and the bundled plugin files (shipped with this model checkpoint):
```bash
vllm serve domyn/Domyn-Small-v1.0 \
--tensor-parallel-size 1 \
--dtype bfloat16 \
--max-model-len 32768 \
--max-num-seqs 256 \
--gpu-memory-utilization 0.9 \
--enable-auto-tool-choice \
--tool-call-parser xml_tool_call \
--tool-parser-plugin /path/to/tool_parser_plugin.py \
--chat-template /path/to/chat_template.jinja
```
Replace `/path/to/` with the actual paths to the files bundled with the checkpoint.
### Transformers
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "domyn/Domyn-Small-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, dtype=torch.bfloat16, device_map="auto"
)
messages = [
{
"role": "system",
"content": "You are Domyn Small, a helpful assistant. thinking on",
},
{"role": "user", "content": "Solve step by step: what is 17 × 24?"},
]
inputs = tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
)
outputs = model.generate(**inputs, max_new_tokens=128)
print(
tokenizer.decode(
outputs[0][inputs["input_ids"].shape[-1] :], skip_special_tokens=True
)
)
```
## Thinking Mode
Domyn Small supports chain-of-thought reasoning controlled by a directive in the system prompt:
- **Thinking off** (default): omit the directive, or include `thinking off`.
- **Thinking on**: append `thinking on` to your system prompt.
```python
messages = [
{"role": "system", "content": "You are Domyn Small, a helpful assistant. thinking on"},
{"role": "user", "content": "Solve step by step: what is 17 × 24?"},
]
```
When thinking is on, the model emits its reasoning inside `<think>…</think>` tags before the final answer.
Alternatively, you can control reasoning by passing `enable_thinking` as an extra request parameter. This has the same effect as adding `thinking on` to the system prompt. Because `enable_thinking` is not part of the standard OpenAI schema, it must be forwarded to vLLM via the OpenAI client's `extra_body` field:
```python
response = client.chat.completions.create(
model="domyn/Domyn-Small-v1.0",
messages=[
{"role": "user", "content": "Solve step by step: what is 17 × 24?"},
],
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
```
### Recommended Sampling Parameters
| Mode | temperature | top_p | top_k | min_p |
|------|-------------|-------|-------|-------|
| Thinking **off** | 0.1 | 0.95 | 50 | 0.1 |
| Thinking **on** | 0.6 | 0.90 | 25 | 0.1 |
> Do **not** use greedy decoding in thinking mode — it degrades reasoning quality and may cause repetition.
## Tool Calling
### How It Works
Domyn Small has been trained to call functions using `<tool_call>` XML tags. The chat template handles tool formatting automatically: **you do not need to write tool instructions in your system prompt.**
When you pass a `tools` list to the API, the chat template prepends a structured tool-instruction block to the system prompt automatically. Your own system message (for persona or context) is appended after that block. The final rendered system block looks like:
```
<auto-generated tool instruction containing the tools JSON>
<your system message>
thinking on/off
```
This means your system prompt stays clean — just describe the assistant's persona or context.
### Python Example
```python
from openai import OpenAI
client = OpenAI(
base_url="http://<your-vllm-host>/v1",
api_key="none",
)
tools = [
{
"type": "function",
"function": {
"name": "get_weather_forecast",
"description": "Get the weather forecast for a location on a given date.",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
"date": {"type": "string", "description": "Date in YYYY-MM-DD format"},
},
"required": ["location", "date"],
},
},
}
]
response = client.chat.completions.create(
model="domyn/Domyn-Small-v1.0",
messages=[
{"role": "system", "content": "You are Domyn Small, a helpful assistant."},
{"role": "user", "content": "What's the weather like in Rome today?"},
],
tools=tools,
temperature=0.0,
)
choice = response.choices[0]
if choice.finish_reason == "tool_calls":
for tc in choice.message.tool_calls:
print(f"Function: {tc.function.name}")
print(f"Arguments: {tc.function.arguments}")
```
## Evaluations
Domyn Small is evaluated against four peer models in the 710B parameter class: **Qwen3.5-9B**, **OLMo-3-7B-Think**, **Llama-3.1-Nemotron-Nano-8B-v1**, and **Ministral-3-8B-Reasoning**. All scores are in thinking-on mode at 32,768-token sequence length (RULER extends to 131,072 via YaRN).
| Category | Benchmark | Domyn Small | Qwen3.5-9B | OLMo-3-7B-Think | Nemotron-Nano | Ministral-3-8B |
|---|---|---|---|---|---|---|
| Reasoning | MATH-500 | **93.2** | 97.4 | 96.8 | 95.4 | 89.2 |
| | AIME 2025 (avg@48) | **35.7** | 90.0 | 70.4 | 51.2 | 32.3 |
| | GPQA-Diamond | **50.0** | 82.7 | 50.8 | 42.4 | 43.9 |
| Code | HumanEval (pass@1) | **96.3** | 93.3 | 95.7 | 91.5 | 86.6 |
| | LiveCodeBench (pass@1)| **55.0** | 86.2 | 74.8 | 67.2 | 46.0 |
| | MBPP (pass@1) | **76.8** | 76.8 | 86.6 | 77.6 | 66.6 |
| General Knowledge | MMLU | **80.3** | 84.6 | 75.2 | 56.0 | 75.3 |
| | MMLU-PRO | **67.7** | 84.4 | 64.0 | 28.8 | 62.0 |
| Instruction | IFEval (strict) | **79.9** | 91.0 | 83.7 | 70.4 | 62.5 |
| Multilingual | MGSM | **73.1** | 88.9 | 64.0 | 19.9 | 75.5 |
| Long context | RULER 32k | **59.5** | 89.8 | 69.8 | 34.0 | 88.7 |
| | RULER 64k | **29.6** | 87.9 | 17.2 | 18.7 | 85.9 |
| Tool calling | BFCL V3 Non-Live | **75.9** | 78.1 | 61.1 | 63.3 | — |
| | BFCL V3 Live | **68.3** | 78.4 | 66.9 | 40.2 | — |
| | BFCL V3 Multi-Turn | **7.0** | 50.6 | 2.1 | 0.1 | — |
Domyn Small attains its single-turn BFCL results at ~280 mean tokens per problem against ~590 for Qwen3.5-9B and ~2,429 for OLMo-3-7B-Think — the best accuracy-per-token tool-calling profile in the peer set among models that fully engage the reasoning path. Ministral-3-8B is excluded from the BFCL comparison: during evaluation it consistently failed to close the `[/THINK]` reasoning delimiter, making its structured outputs unparseable by the benchmark.
## Token Efficiency
The table below compares mean generated tokens per problem (thinking on, lower is better) against the strongest accuracy peer in the set, Qwen3.5-9B. Grand means weight each benchmark by its problem count.
| Category | Benchmark | Domyn Small | Qwen3.5-9B |
|---|---|---|---|
| Reasoning | MATH-500 | **2,261** | 7,614 |
| | AIME 2025 | **5,190** | 18,668 |
| | GPQA-Diamond | **3,396** | 8,976 |
| | **Grand mean** | **2,690** | 8,440 |
| Code | HumanEval | **1,884** | 1,144 |
| | LCB-Gen | **5,010** | 12,739 |
| | MBPP | **2,420** | 1,927 |
| | **Grand mean** | **3,312** | 5,870 |
| General Knowledge | MMLU | **1,236** | 3,262 |
| | MMLU-PRO | **2,947** | 4,666 |
| | **Grand mean** | **2,026** | 3,910 |
| Instruction | IFEval | **775** | 3,874 |
| Multilingual | MGSM | **796** | 3,140 |
On the reasoning suite Domyn Small produces approximately **32% of Qwen3.5-9B's** token budget — a 3.1× saving at comparable accuracy on several benchmarks.
## Dual-Mode Comparison (Thinking ON vs. OFF)
Effect of the reasoning toggle on Domyn Small. Same evaluation harness; thinking-on AIME 2025 is reported as avg@48, other thinking-on entries are single-pass.
| Benchmark | Thinking off | Thinking on | Δ |
|---|---|---|---|
| MATH-500 | 91.4 | 93.2 | +1.8 |
| AIME 2025 | 31.0 | 35.7 | +4.7 |
| LiveCodeBench | 33.8 | 55.0 | +21.2 |
| MBPP | 54.6 | 76.8 | +22.2 |
| HumanEval | 69.5 | 96.3 | +26.8 |
| GPQA-Diamond | 40.0 | 50.0 | +10.0 |
| MMLU-PRO | 60.0 | 67.7 | +7.7 |
| MGSM | 59.7 | 73.1 | +13.4 |
| IFEval (prompt strict) | 78.6 | 79.9 | +1.3 |
The toggle helps most when the bottleneck is multi-step search or program synthesis (code, science reasoning, multilingual math); it helps least when the bottleneck is recall or format compliance.
## Intended Uses
### Primary Use Cases
Domyn Small is intended for commercial and research use in multiple languages:
- Regulated-industry use cases in **resource-constrained environments** that need reduced computational cost and faster response times in production.
- **Fine-tuning to any desired domain knowledge** across industries, to equip the model with the context and expertise needed to excel on real-world applications.
- **Agentic applications**, especially agents that need to solve coding and mathematical problems and perform sequential, tool-calling tasks.
### Out-of-Scope Use Cases
Domyn Small is not specifically designed or evaluated for all downstream purposes. As with any language model, developers should carefully evaluate accuracy, safety, and fairness before applying it to specific downstream scenarios, particularly high-risk ones. Developers should also ensure compliance with all applicable laws and regulations (including, but not limited to, privacy and trade compliance) relevant to their use case.
## EU AI Act Compliance
Domyn Small is released as a general-purpose AI (GPAI) model under the EU AI Act. Article 53 transparency obligations are discharged via this model card, the Domyn Small technical report (architecture, training data composition, training stages, evaluations, and known limitations end-to-end), and the MIT-licensed open-weights release. The training-data summary required by Article 53(1)(d) is provided as a companion artefact to the model release.
To uphold data-subject rights and comply with the AI Act and EU copyright framework, we operate an opt-out procedure for rights holders. Anyone who believes their copyrighted material was inadvertently included in our training corpora can contact `copyright@domyn.com`, and we will exclude the affected data from subsequent model iterations.
## Citation
If you find this work valuable, please consider citing it:
```bibtex
@misc{domynsmall2026,
title = {Domyn Small},
author = {Domyn S.p.A.},
year = {2026},
eprint = {TBD},
note = {Technical report, forthcoming},
}
```
## Contacts
- For general inquiries about Domyn Small, please contact: `models@domyn.com`
- For copyright-related complaints, please contact: `copyright@domyn.com`
*Affected rightsholders and their authorised representatives, including collective management organisations, may submit sufficiently precise and adequately substantiated complaints electronically concerning any non-compliance with our commitments under the Copyright Chapter of the GPAI Code of Practice. We commit to handling such complaints diligently, impartially, and within a reasonable timeframe, except in cases where the complaint is manifestly unfounded or has already been addressed. This mechanism complements, but does not limit, the available legal measures, remedies, and sanctions under Union and national copyright law.*

123
chat_template.jinja Normal file
View File

@@ -0,0 +1,123 @@
{#- This template extends the base model template to support tool calling.
Tools are injected into the system prompt using <tools> XML tags.
Tool calls use <tool_call> XML tags, tool responses use <tool_response> tags.
Handles all combinations:
1. tools + user system message → single system block: tool instructions + user content
2. tools + NO system message → single synthetic system block with tool instructions
3. no tools + user system message → single system block with user content
4. no tools + no system message → minimal default system block
Thinking mode: if any system message contains "thinking on", all system
blocks end with "thinking on" and the generation prompt emits an open <think>.
Otherwise "thinking off" is used and <think></think> is immediately closed.
"thinking on" / "thinking off" is stripped from user system content and
re-appended as the final line of the system block to keep it consistent.
-#}
{%- set loop_messages = messages %}
{%- set ns = namespace(thinking=false, has_tools=false, has_system=false, system_emitted=false) %}
{#- ===== Method A: Check kwarg ===== -#}
{%- if enable_thinking is defined and enable_thinking %}
{%- set ns.thinking = true %}
{%- endif %}
{#- ===== PASS 1: Scan messages to detect thinking mode and system presence ===== -#}
{%- for message in loop_messages %}
{%- if message['role'] == 'system' %}
{%- set ns.has_system = true %}
{%- if 'thinking on' in message['content'] %}
{%- set ns.thinking = true %}
{%- endif %}
{%- endif %}
{%- endfor %}
{#- ===== Build tool instruction block if tools provided ===== -#}
{%- if tools is defined and tools is not none and tools | length > 0 %}
{%- set ns.has_tools = true %}
{%- set tool_instruction %}
You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags.
You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.
Here are the available tools:
<tools>
{{ tools | tojson }}
</tools>
For each function call, return a JSON object with function name and arguments within <tool_call></tool_call> XML tags as follows:
<tool_call>
{"name": <function-name>, "arguments": <args-dict>}
</tool_call>
{%- endset %}
{%- endif %}
{#- ===== Thinking mode suffix — appended to every system block ===== -#}
{%- set thinking_suffix = "thinking on" if ns.thinking else "thinking off" %}
{#- ===== SYNTHETIC SYSTEM BLOCK (only if NO system messages in conversation) ===== -#}
{%- if not ns.has_system %}
{%- if ns.has_tools %}
{{- '<extra_id_0>System\n' }}
{{- tool_instruction + '\n\n' }}
{{- thinking_suffix + '\n' }}
{%- else %}
{{- '<extra_id_0>System\n' }}
{{- thinking_suffix + '\n' }}
{%- endif %}
{%- set ns.system_emitted = true %}
{%- endif %}
{#- ===== PASS 2: Render messages ===== -#}
{%- for message in loop_messages %}
{#- ---- SYSTEM MESSAGE ---- -#}
{%- if message['role'] == 'system' %}
{#- Strip thinking directives from the content — we handle them via thinking_suffix -#}
{%- set clean_content = message['content'] | replace('thinking on', '') | replace('thinking off', '') | trim %}
{{- '<extra_id_0>System\n' }}
{%- if ns.has_tools %}
{{- tool_instruction + '\n' }}
{%- endif %}
{%- if clean_content %}
{{- clean_content + '\n' }}
{%- endif %}
{{- thinking_suffix + '\n' }}
{%- set ns.system_emitted = true %}
{#- ---- USER MESSAGE ---- -#}
{%- elif message['role'] == 'user' %}
{{- '<extra_id_1>User\n' }}
{{- message['content'] + '\n' }}
{#- ---- ASSISTANT MESSAGE ---- -#}
{%- elif message['role'] == 'assistant' %}
{%- if message.tool_calls is defined and message.tool_calls | length > 0 %}
{{- '<extra_id_1>Assistant\n' }}
{%- for tool_call in message.tool_calls %}
{{- '<tool_call>\n' }}
{{- {"name": tool_call.function.name, "arguments": tool_call.function.arguments} | tojson + '\n' }}
{{- '</tool_call>\n' }}
{%- endfor %}
{%- else %}
{{- '<extra_id_1>Assistant\n' }}
{{- message['content'] + '\n' }}
{%- endif %}
{#- ---- TOOL RESPONSE MESSAGE ---- -#}
{%- elif message['role'] == 'tool' %}
{{- '<extra_id_1>User\n' }}
{{- '<tool_response>\n' }}
{{- message['content'] + '\n' }}
{{- '</tool_response>\n' }}
{%- endif %}
{%- endfor %}
{#- ===== Generation prompt ===== -#}
{%- if add_generation_prompt %}
{{- '<extra_id_1>Assistant\n' }}
{%- if ns.thinking %}
{{- '<think>\n' }}
{%- else %}
{{- '<think>\n</think>\n\n' }}
{%- endif %}
{%- endif %}

30
config.json Normal file
View File

@@ -0,0 +1,30 @@
{
"architectures": [
"NemotronForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 2,
"dtype": "bfloat16",
"eos_token_id": 5,
"head_dim": 128,
"hidden_act": "relu2",
"hidden_size": 4096,
"initializer_range": 0.0134,
"intermediate_size": 16384,
"max_position_embeddings": 32768,
"mlp_bias": false,
"model_type": "nemotron",
"nemo_version": "0.2.0",
"norm_eps": 1e-05,
"num_attention_heads": 48,
"num_hidden_layers": 40,
"num_key_value_heads": 8,
"pad_token_id": 0,
"partial_rotary_factor": 0.5,
"rope_theta": 500000,
"tie_word_embeddings": false,
"transformers_version": "4.57.1",
"use_cache": false,
"vocab_size": 256000
}

9
generation_config.json Normal file
View File

@@ -0,0 +1,9 @@
{
"_from_model_config": true,
"bos_token_id": 2,
"eos_token_id": [
5
],
"pad_token_id": 0,
"transformers_version": "4.57.1"
}

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:6f2f368aab9b9d6a047a1f1f7d19e4b7e8531edda3690b2eda63c047afd9108c
size 4915962352

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:4bb6e10249871a9647a138170b75f788c7d65fd01e8a3a25945cd10e4ec2a02e
size 4966496880

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:08a15f8b372b3e05f01a21127912407b712088ab192eda97eecf791189bead22
size 4949719448

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:d6156ef734c8705b9aa73693cd4179eba0d315c7f96c4ffa8077450d323dce28
size 4798537992

View File

@@ -0,0 +1,412 @@
{
"metadata": {
"total_parameters": 9815334912,
"total_size": 19630669824
},
"weight_map": {
"lm_head.weight": "model-00004-of-00004.safetensors",
"model.embed_tokens.weight": "model-00001-of-00004.safetensors",
"model.layers.0.input_layernorm.bias": "model-00001-of-00004.safetensors",
"model.layers.0.input_layernorm.weight": "model-00001-of-00004.safetensors",
"model.layers.0.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.0.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.0.post_attention_layernorm.bias": "model-00001-of-00004.safetensors",
"model.layers.0.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
"model.layers.0.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.0.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.0.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.0.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.1.input_layernorm.bias": "model-00001-of-00004.safetensors",
"model.layers.1.input_layernorm.weight": "model-00001-of-00004.safetensors",
"model.layers.1.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.1.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.1.post_attention_layernorm.bias": "model-00001-of-00004.safetensors",
"model.layers.1.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
"model.layers.1.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.1.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.1.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.1.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.10.input_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.10.input_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.10.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.10.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.10.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.10.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.10.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.10.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.10.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.10.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.11.input_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.11.input_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.11.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.11.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.11.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.11.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.11.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.11.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.11.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.11.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.12.input_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.12.input_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.12.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.12.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.12.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.12.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.12.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.12.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.12.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.12.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.13.input_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.13.input_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.13.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.13.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.13.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.13.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.13.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.13.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.13.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.13.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.14.input_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.14.input_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.14.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.14.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.14.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.14.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.14.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.14.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.14.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.14.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.15.input_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.15.input_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.15.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.15.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.15.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.15.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.15.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.15.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.15.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.15.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.16.input_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.16.input_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.16.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.16.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.16.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.16.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.16.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.16.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.16.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.16.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.17.input_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.17.input_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.17.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.17.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.17.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.17.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.17.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.17.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.17.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.17.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.18.input_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.18.input_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.18.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.18.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.18.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.18.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.18.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.18.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.18.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.18.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.19.input_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.19.input_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.19.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.19.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.19.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.19.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.19.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.19.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.19.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.19.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.2.input_layernorm.bias": "model-00001-of-00004.safetensors",
"model.layers.2.input_layernorm.weight": "model-00001-of-00004.safetensors",
"model.layers.2.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.2.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.2.post_attention_layernorm.bias": "model-00001-of-00004.safetensors",
"model.layers.2.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
"model.layers.2.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.2.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.2.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.2.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.20.input_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.20.input_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.20.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.20.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.20.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.20.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.20.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.20.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.20.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.20.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.21.input_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.21.input_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.21.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.21.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.21.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.21.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.21.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.21.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.21.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.21.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.22.input_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.22.input_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.22.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.22.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.22.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.22.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.22.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.22.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.22.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.22.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.23.input_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.23.input_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.23.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.23.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.23.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.23.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.23.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.23.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.23.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.23.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.24.input_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.24.input_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.24.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.24.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.24.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.24.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.24.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.24.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.24.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.24.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.25.input_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.25.input_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.25.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.25.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.25.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.25.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.25.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.25.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.25.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.25.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.26.input_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.26.input_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.26.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.26.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.26.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.26.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.26.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.26.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.26.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.26.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.27.input_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.27.input_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.27.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.27.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.27.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.27.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.27.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.27.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.27.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.27.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.28.input_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.28.input_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.28.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.28.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.28.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.28.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.28.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.28.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.28.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.28.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.29.input_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.29.input_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.29.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.29.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.29.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.29.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.29.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.29.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.29.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.29.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.3.input_layernorm.bias": "model-00001-of-00004.safetensors",
"model.layers.3.input_layernorm.weight": "model-00001-of-00004.safetensors",
"model.layers.3.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.3.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.3.post_attention_layernorm.bias": "model-00001-of-00004.safetensors",
"model.layers.3.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
"model.layers.3.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.3.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.3.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.3.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.30.input_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.30.input_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.30.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.30.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.30.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.30.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.30.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.30.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.30.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.30.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.31.input_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.31.input_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.31.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.31.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.31.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.31.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.31.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.31.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.31.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.31.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.32.input_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.32.input_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.32.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.32.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.32.post_attention_layernorm.bias": "model-00003-of-00004.safetensors",
"model.layers.32.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
"model.layers.32.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.32.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.32.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.32.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
"model.layers.33.input_layernorm.bias": "model-00004-of-00004.safetensors",
"model.layers.33.input_layernorm.weight": "model-00004-of-00004.safetensors",
"model.layers.33.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.33.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.33.post_attention_layernorm.bias": "model-00004-of-00004.safetensors",
"model.layers.33.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
"model.layers.33.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.33.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.33.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.33.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.34.input_layernorm.bias": "model-00004-of-00004.safetensors",
"model.layers.34.input_layernorm.weight": "model-00004-of-00004.safetensors",
"model.layers.34.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.34.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.34.post_attention_layernorm.bias": "model-00004-of-00004.safetensors",
"model.layers.34.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
"model.layers.34.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.34.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.34.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.34.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.35.input_layernorm.bias": "model-00004-of-00004.safetensors",
"model.layers.35.input_layernorm.weight": "model-00004-of-00004.safetensors",
"model.layers.35.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.35.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.35.post_attention_layernorm.bias": "model-00004-of-00004.safetensors",
"model.layers.35.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
"model.layers.35.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.35.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.35.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.35.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.36.input_layernorm.bias": "model-00004-of-00004.safetensors",
"model.layers.36.input_layernorm.weight": "model-00004-of-00004.safetensors",
"model.layers.36.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.36.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.36.post_attention_layernorm.bias": "model-00004-of-00004.safetensors",
"model.layers.36.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
"model.layers.36.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.36.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.36.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.36.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.37.input_layernorm.bias": "model-00004-of-00004.safetensors",
"model.layers.37.input_layernorm.weight": "model-00004-of-00004.safetensors",
"model.layers.37.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.37.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.37.post_attention_layernorm.bias": "model-00004-of-00004.safetensors",
"model.layers.37.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
"model.layers.37.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.37.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.37.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.37.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.38.input_layernorm.bias": "model-00004-of-00004.safetensors",
"model.layers.38.input_layernorm.weight": "model-00004-of-00004.safetensors",
"model.layers.38.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.38.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.38.post_attention_layernorm.bias": "model-00004-of-00004.safetensors",
"model.layers.38.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
"model.layers.38.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.38.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.38.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.38.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.39.input_layernorm.bias": "model-00004-of-00004.safetensors",
"model.layers.39.input_layernorm.weight": "model-00004-of-00004.safetensors",
"model.layers.39.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.39.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.39.post_attention_layernorm.bias": "model-00004-of-00004.safetensors",
"model.layers.39.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
"model.layers.39.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.39.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.39.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.39.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
"model.layers.4.input_layernorm.bias": "model-00001-of-00004.safetensors",
"model.layers.4.input_layernorm.weight": "model-00001-of-00004.safetensors",
"model.layers.4.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.4.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.4.post_attention_layernorm.bias": "model-00001-of-00004.safetensors",
"model.layers.4.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
"model.layers.4.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.4.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.4.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.4.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.5.input_layernorm.bias": "model-00001-of-00004.safetensors",
"model.layers.5.input_layernorm.weight": "model-00001-of-00004.safetensors",
"model.layers.5.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.5.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.5.post_attention_layernorm.bias": "model-00001-of-00004.safetensors",
"model.layers.5.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
"model.layers.5.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.5.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.5.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.5.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.6.input_layernorm.bias": "model-00001-of-00004.safetensors",
"model.layers.6.input_layernorm.weight": "model-00001-of-00004.safetensors",
"model.layers.6.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.6.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.6.post_attention_layernorm.bias": "model-00001-of-00004.safetensors",
"model.layers.6.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
"model.layers.6.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.6.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.6.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.6.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.7.input_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.7.input_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.7.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.7.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.7.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.7.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.7.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.7.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.7.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.7.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
"model.layers.8.input_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.8.input_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.8.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.8.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.8.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.8.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.8.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.8.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.8.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.8.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.9.input_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.9.input_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.9.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.9.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.9.post_attention_layernorm.bias": "model-00002-of-00004.safetensors",
"model.layers.9.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
"model.layers.9.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.9.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.9.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
"model.layers.9.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
"model.norm.bias": "model-00004-of-00004.safetensors",
"model.norm.weight": "model-00004-of-00004.safetensors"
}
}

297
reasoning_parser_plugin.py Normal file
View File

@@ -0,0 +1,297 @@
"""Reasoning parser plugin for Domyn-Small ``<think>...</think>`` outputs.
Loaded into vLLM with ``--reasoning-parser-plugin <path>`` and selected via
``--reasoning-parser think_block``. The parser splits each model output on
the literal ``</think>`` marker: everything before it is reasoning,
everything after is final content.
See :class:`ThinkBlockReasoningParser` for the streaming state machine and
how per-request thinking-on/off is discovered.
"""
from __future__ import annotations
from collections.abc import Iterable, Sequence
from typing import TYPE_CHECKING
from vllm.reasoning import ReasoningParser, ReasoningParserManager
if TYPE_CHECKING:
from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest
from vllm.entrypoints.openai.engine.protocol import DeltaMessage
from vllm.entrypoints.openai.responses.protocol import ResponsesRequest
# Literal markers emitted by the Domyn-Small chat template. `<think>` is
# pre-emitted by the prompt, so model output never starts with it; only `</think>`
# actually has to be detected at runtime.
START = "<think>"
END = "</think>"
def _max_suffix_prefix(s: str, marker: str) -> str:
"""Longest non-empty suffix of ``s`` that is also a prefix of ``marker``.
Used to decide how many trailing bytes of the streaming buffer must be
held back — if those bytes could still grow into ``marker`` on the next
delta, releasing them now would fragment the marker across deltas (e.g.
emitting ``</thi`` and then ``nk>``).
"""
for i in range(min(len(marker) - 1, len(s)), 0, -1):
if s.endswith(marker[:i]):
return s[-i:]
return ""
@ReasoningParserManager.register_module("think_block")
class ThinkBlockReasoningParser(ReasoningParser):
"""Splits model output on the literal ``</think>`` marker.
**Streaming.** Olmo3-style buffered state machine: incoming text is
accumulated in :attr:`_buffer` and only released when the marker is
either confirmed (split point reached) or ruled out (the buffer tail
can no longer be a prefix of ``</think>``). This guarantees the marker
is never fragmented across deltas.
**Per-request lane.** The initial lane (``"reasoning"`` vs
``"content"``) is set from the request itself: ``True`` if
``chat_template_kwargs.enable_thinking`` (or ``.thinking``) is truthy,
or if any system message contains the literal ``"thinking on"``
directive — mirroring the chat template's own detection.
**Request discovery.** vLLM instantiates the parser per request from
inside ``create_chat_completion(self, request, ...)``, but does not
pass the request to the constructor. We recover it by walking the call
stack at ``__init__`` time, inspecting only each frame's *function
arguments* (so we don't accidentally match request-shaped objects in
module globals or unrelated locals). If no request is found we fall
back to ``thinking=off``, which keeps tool-call streaming working out
of the box.
"""
def __init__(self, tokenizer, *args, **kwargs) -> None:
# Base ReasoningParser only accepts `tokenizer`; swallow any extras so
# the registration signature stays compatible across vLLM versions.
super().__init__(tokenizer)
self._buffer: str = ""
# Current lane for streaming output: "reasoning" while inside
# <think>...</think>, "content" otherwise. Locked to "content" once
# `</think>` is observed.
self._state: str = "content"
# Tracks whether we have applied per-request configuration yet —
# stack-walking covers the streaming path; `extract_reasoning` also
# configures on the first non-streaming call as a safety net.
self._configured: bool = False
request = self._find_request_in_stack()
if request is not None:
self._configure_for_request(request)
@staticmethod
def _looks_like_request(obj) -> bool:
"""Duck-typed check for ChatCompletionRequest / ResponsesRequest.
Avoids importing vLLM's protocol module, which differs across forks
and isn't guaranteed to be importable at plugin load time.
"""
return hasattr(obj, "messages") and (
hasattr(obj, "chat_template_kwargs") or hasattr(obj, "stream")
)
@classmethod
def _find_request_in_stack(cls, max_depth: int = 12):
"""Locate the in-flight request by scanning caller-frame arguments.
Walks a bounded number of caller frames via ``sys._getframe`` /
``frame.f_back`` and inspects only each frame's *function
arguments* — never its full locals. This matches vLLM's
``create_chat_completion(self, request, ...)`` signature and avoids
matching request-shaped objects that happen to live in module
globals or unrelated locals (e.g. test fixtures).
We deliberately avoid :func:`inspect.stack`, which reads source
files via ``linecache`` and builds ``FrameInfo`` objects for the
whole stack on every call — measurable overhead per request under
high concurrency, since parser construction is per-request and
runs under the GIL on the serving event loop.
"""
import sys
try:
frame = sys._getframe(1)
except Exception:
return None
depth = 0
while frame is not None and depth < max_depth:
code = frame.f_code
n_args = code.co_argcount + code.co_kwonlyargcount
for name in code.co_varnames[:n_args]:
value = frame.f_locals.get(name)
if cls._looks_like_request(value):
return value
frame = frame.f_back
depth += 1
return None
def _configure_for_request(self, request) -> None:
"""Set initial streaming lane from the request's thinking flag."""
self._state = "reasoning" if self._thinking_was_enabled(request) else "content"
self._configured = True
def _decode(self, ids: Sequence[int]) -> str:
# `skip_special_tokens=False` is required: `</think>` may be tokenized
# as (or contain) special tokens that the default decode would strip,
# which would silently break marker detection.
try:
return self.model_tokenizer.decode(list(ids), skip_special_tokens=False)
except Exception:
return ""
@property
def reasoning_start_str(self) -> str | None:
return START
@property
def reasoning_end_str(self) -> str | None:
return END
def is_reasoning_end(self, input_ids: Sequence[int]) -> bool:
return END in self._decode(input_ids)
def is_reasoning_end_streaming(
self, input_ids: Sequence[int], delta_ids: Iterable[int]
) -> bool:
# Decode a 64-token tail window so the marker is detected even when
# it straddles the previous-vs-delta token boundary (BPE may split
# `</think>` across multiple tokens, especially around punctuation).
tail = list(input_ids)[-64:]
return END in self._decode(tail)
def extract_content_ids(self, input_ids: list[int]) -> list[int]:
text = self._decode(input_ids)
idx = text.rfind(END)
if idx < 0:
return []
try:
return self.model_tokenizer.encode(
text[idx + len(END):], add_special_tokens=False
)
except Exception:
return []
def count_reasoning_tokens(self, token_ids: Sequence[int]) -> int:
text = self._decode(token_ids)
idx = text.find(END)
prefix = text if idx < 0 else text[:idx]
try:
return len(self.model_tokenizer.encode(prefix, add_special_tokens=False))
except Exception:
return 0
def extract_reasoning(
self,
model_output: str,
request: "ChatCompletionRequest | ResponsesRequest",
) -> tuple[str | None, str | None]:
"""Split a full (non-streaming) output into ``(reasoning, content)``.
Returns ``(None, content)`` when the request has thinking disabled
and the output contains no marker — the chat template pre-emits
``<think></think>`` in the prompt in that case, so a marker-less
output is purely the answer.
"""
# Configure streaming state as a side effect: a fork's serving layer
# may call this before streaming starts, and we don't want the
# streaming path to fall back to the `thinking=off` default if the
# request actually had thinking enabled.
if not self._configured:
self._configure_for_request(request)
s = model_output
if s.startswith(START):
s = s[len(START):]
if END in s:
reasoning, _, content = s.partition(END)
return (reasoning.strip("\n") or None, content.lstrip("\n") or None)
# No `</think>` in output: only treat the text as truncated reasoning
# if we have positive evidence that thinking was enabled — otherwise
# it is the final answer.
if self._thinking_was_enabled(request):
return (s.strip("\n") or None, None)
return (None, s.lstrip("\n") or None)
@staticmethod
def _thinking_was_enabled(request) -> bool:
"""Whether ``request`` asked for reasoning to be emitted.
Mirrors the chat template's own detection so the parser stays in
lockstep with prompt construction: enabled iff
``chat_template_kwargs.enable_thinking`` (or ``.thinking``) is
truthy, or any system message contains the literal ``"thinking on"``
directive (case-insensitive).
"""
kwargs = getattr(request, "chat_template_kwargs", None) or {}
if kwargs.get("enable_thinking") or kwargs.get("thinking"):
return True
messages = getattr(request, "messages", None) or []
for m in messages:
role = m.get("role") if isinstance(m, dict) else getattr(m, "role", None)
if role != "system":
continue
content = m.get("content") if isinstance(m, dict) else getattr(m, "content", None)
if isinstance(content, str) and "thinking on" in content.lower():
return True
return False
def extract_reasoning_streaming(
self,
previous_text: str,
current_text: str,
delta_text: str,
previous_token_ids: Sequence[int],
current_token_ids: Sequence[int],
delta_token_ids: Sequence[int],
) -> "DeltaMessage | None":
"""Emit one ``DeltaMessage`` per delta, routed to reasoning or content.
The marker ``</think>`` is never emitted to the client. Trailing
bytes of the buffer that *could* still grow into the marker on the
next delta are held back, so the marker is never fragmented across
deltas (e.g. ``</thi`` ... ``nk>``). When the marker is observed,
pre-marker bytes go to the current lane and post-marker bytes go
to ``content``; the lane is then locked to ``content``.
"""
from vllm.entrypoints.openai.engine.protocol import DeltaMessage
self._buffer += delta_text
# Case 1 — marker fully present in the buffer: split and switch lane.
# The pre-marker chunk stays on the *current* lane (reasoning if we
# were inside <think>, content otherwise); the post-marker chunk
# always goes to content; the lane is locked to content afterwards.
idx = self._buffer.find(END)
if idx >= 0:
pre = self._buffer[:idx]
post = self._buffer[idx + len(END):]
self._buffer = ""
pre_lane = self._state
self._state = "content"
if not pre and not post:
return None
fields: dict = {}
if pre:
fields[pre_lane] = pre
if post:
# `.get` covers the edge case where pre_lane is already
# "content" and both pre and post are non-empty — they get
# concatenated into a single content delta.
fields["content"] = fields.get("content", "") + post
return DeltaMessage(**fields)
# Case 2 — no marker yet: release everything except a possible
# partial-marker tail, which we retain for the next delta.
held = _max_suffix_prefix(self._buffer, END)
safe_end = len(self._buffer) - len(held)
if safe_end == 0:
return None
chunk = self._buffer[:safe_end]
self._buffer = self._buffer[safe_end:]
return DeltaMessage(**{self._state: chunk})

37
special_tokens_map.json Normal file
View File

@@ -0,0 +1,37 @@
{
"bos_token": {
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"cls_token": {
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "<extra_id_1>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "<pad>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"sep_token": {
"content": "<extra_id_1>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

3
tokenizer.json Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:85c33485220c152141b14e438e9bf16141eb14ad4b17b6e9329ab35fc96d1137
size 34809687

8041
tokenizer_config.json Normal file

File diff suppressed because it is too large Load Diff

314
tool_parser_plugin.py Normal file
View File

@@ -0,0 +1,314 @@
"""
Custom vLLM tool parser plugin for models that use <tool_call> XML tags.
The model outputs tool calls in this format:
<tool_call>
{"name": "function_name", "arguments": {"arg1": "val1"}}
</tool_call>
Multiple tool calls can appear in a single response (parallel tool calling).
Usage:
vllm serve <model> \
--enable-auto-tool-choice \
--tool-parser-plugin /absolute/path/to/tool_parser_plugin.py \
--tool-call-parser xml_tool_call \
--chat-template /absolute/path/to/tool_chat_template.jinja
"""
import ast
import json
import re
import uuid
from typing import Sequence, Union
# ---------------------------------------------------------------------------
# Import compatibility: vLLM >=0.8 moved tool_parsers to vllm.tool_parsers;
# older versions keep them under vllm.entrypoints.openai.tool_parsers.
# ---------------------------------------------------------------------------
try:
# Newer vLLM, roughly 0.15+
from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest
from vllm.entrypoints.openai.engine.protocol import (
DeltaFunctionCall,
DeltaMessage,
DeltaToolCall,
ExtractedToolCallInformation,
FunctionCall,
ToolCall,
)
except ImportError:
# Older vLLM
from vllm.entrypoints.openai.protocol import (
ChatCompletionRequest,
DeltaFunctionCall,
DeltaMessage,
DeltaToolCall,
ExtractedToolCallInformation,
FunctionCall,
ToolCall,
)
try:
from vllm.tool_parsers.abstract_tool_parser import ToolParser, ToolParserManager
except ImportError:
from vllm.entrypoints.openai.tool_parsers.abstract_tool_parser import (
ToolParser,
ToolParserManager,
)
from vllm.logger import init_logger
logger = init_logger(__name__)
def _generate_tool_call_id() -> str:
"""Generate a unique tool-call ID in the format expected by OpenAI."""
return f"call_{uuid.uuid4().hex[:24]}"
# ---------------------------------------------------------------------------
# Register the parser so it can be referenced via --tool-call-parser
# ---------------------------------------------------------------------------
@ToolParserManager.register_module(["xml_tool_call"])
class XMLToolCallParser(ToolParser):
"""
Parses tool calls wrapped in <tool_call>...</tool_call> XML tags.
Handles both single and parallel (multiple) tool calls in one response.
Supports streaming and non-streaming extraction.
"""
# Regex to match complete <tool_call>...</tool_call> blocks
TOOL_CALL_RE = re.compile(
r"<tool_call>\s*(.*?)\s*</tool_call>",
re.DOTALL,
)
# Regex that also matches an incomplete (still-streaming) block
TOOL_CALL_OPEN_RE = re.compile(
r"<tool_call>\s*(.*?)(?:</tool_call>|$)",
re.DOTALL,
)
TOOL_CALL_START = "<tool_call>"
TOOL_CALL_END = "</tool_call>"
def __init__(self, tokenizer, tools=None):
# vLLM newer versions: ToolParser.__init__(tokenizer, tools)
# vLLM older versions: ToolParser.__init__(tokenizer)
try:
super().__init__(tokenizer, tools)
except TypeError:
super().__init__(tokenizer)
self.tools = tools or []
# ---- streaming state ----
self.current_tool_id: int = -1
self.current_tool_name_sent: bool = False
self.prev_tool_call_arr: list[dict] = []
self.streamed_args_for_tool: list[str] = []
# ------------------------------------------------------------------
# Optional: adjust the request before inference
# ------------------------------------------------------------------
@staticmethod
def _parse_tool_json(raw: str) -> dict | None:
"""Parse a tool call JSON block, handling Python-style single quotes."""
# Try standard JSON first
try:
return json.loads(raw)
except (json.JSONDecodeError, ValueError):
pass
# Fall back to ast.literal_eval for Python-style dicts with single quotes
try:
result = ast.literal_eval(raw)
if isinstance(result, dict):
return result
except (ValueError, SyntaxError):
pass
return None
def adjust_request(
self, request: ChatCompletionRequest
) -> ChatCompletionRequest:
return request
# ------------------------------------------------------------------
# NON-STREAMING extraction
# ------------------------------------------------------------------
def extract_tool_calls(
self,
model_output: str,
request: ChatCompletionRequest,
) -> ExtractedToolCallInformation:
"""
Parse all <tool_call>...</tool_call> blocks from the full model
output and convert them to OpenAI ToolCall objects.
"""
# Find all complete tool-call blocks
raw_matches = self.TOOL_CALL_RE.findall(model_output)
if not raw_matches:
# No tool calls found — return the text as-is
return ExtractedToolCallInformation(
tools_called=False,
tool_calls=[],
content=model_output,
)
tool_calls: list[ToolCall] = []
for raw_json in raw_matches:
parsed = self._parse_tool_json(raw_json)
if parsed is None:
logger.warning(
"Failed to parse tool call JSON: %s", raw_json
)
continue
fn_name = parsed.get("name", "")
fn_args = parsed.get("arguments", {})
# Ensure arguments is a JSON string (OpenAI format)
if isinstance(fn_args, dict):
fn_args_str = json.dumps(fn_args)
elif isinstance(fn_args, str):
# Model may emit arguments as a JSON string — validate and pass through
try:
json.loads(fn_args)
fn_args_str = fn_args
except (json.JSONDecodeError, ValueError):
# Try ast.literal_eval for Python-style dicts (e.g. single quotes,
# unquoted keys). If that also fails, emit an empty dict so
# downstream json.loads never sees an invalid string.
try:
recovered = ast.literal_eval(fn_args)
fn_args_str = json.dumps(recovered) if isinstance(recovered, dict) else json.dumps({})
except (ValueError, SyntaxError):
fn_args_str = "{}"
else:
fn_args_str = str(fn_args)
tool_calls.append(
ToolCall(
id=_generate_tool_call_id(),
type="function",
function=FunctionCall(
name=fn_name,
arguments=fn_args_str,
),
)
)
# Strip tool-call blocks from content to get any surrounding text
remaining_content = self.TOOL_CALL_RE.sub("", model_output).strip()
return ExtractedToolCallInformation(
tools_called=True,
tool_calls=tool_calls,
content=remaining_content if remaining_content else None,
)
# ------------------------------------------------------------------
# STREAMING extraction
# ------------------------------------------------------------------
def extract_tool_calls_streaming(
self,
previous_text: str,
current_text: str,
delta_text: str,
previous_token_ids: Sequence[int],
current_token_ids: Sequence[int],
delta_token_ids: Sequence[int],
request: ChatCompletionRequest,
) -> Union[DeltaMessage, None]:
"""
Incrementally parse tool calls from the streaming token output.
Strategy:
- Before seeing <tool_call>, stream tokens as regular content.
- Once <tool_call> is detected, buffer until </tool_call>.
- On </tool_call>, emit the complete tool call delta.
- Support multiple sequential tool calls.
"""
# If we haven't seen a tool_call opening tag yet, pass through as
# regular content (unless the start tag is partially forming).
if self.TOOL_CALL_START not in current_text:
# Check if the current text ends with a partial match of the
# start tag — if so, hold back to avoid emitting partial tags.
for i in range(1, len(self.TOOL_CALL_START)):
if current_text.endswith(self.TOOL_CALL_START[:i]):
# Possibly forming the start tag — hold delta
return None
return DeltaMessage(content=delta_text)
# ---- We are inside or past a <tool_call> block ----
# Find all *complete* tool call blocks so far
complete_matches = self.TOOL_CALL_RE.findall(current_text)
num_complete = len(complete_matches)
# Determine how many we've already streamed
num_already_sent = len(self.prev_tool_call_arr)
if num_complete > num_already_sent:
# A new tool call just completed — emit it
new_raw = complete_matches[num_already_sent]
parsed = self._parse_tool_json(new_raw)
if parsed is None:
logger.warning(
"Streaming: failed to parse tool call JSON: %s",
new_raw,
)
return None
fn_name = parsed.get("name", "")
fn_args = parsed.get("arguments", {})
if isinstance(fn_args, dict):
fn_args_str = json.dumps(fn_args)
elif isinstance(fn_args, str):
try:
json.loads(fn_args)
fn_args_str = fn_args
except (json.JSONDecodeError, ValueError):
try:
recovered = ast.literal_eval(fn_args)
fn_args_str = json.dumps(recovered) if isinstance(recovered, dict) else json.dumps({})
except (ValueError, SyntaxError):
fn_args_str = "{}"
else:
fn_args_str = str(fn_args)
self.current_tool_id += 1
self.prev_tool_call_arr.append(parsed)
self.streamed_args_for_tool.append(fn_args_str)
self.current_tool_name_sent = True
return DeltaMessage(
tool_calls=[
DeltaToolCall(
index=self.current_tool_id,
id=_generate_tool_call_id(),
type="function",
function=DeltaFunctionCall(
name=fn_name,
arguments=fn_args_str,
),
)
]
)
# If we're currently inside an incomplete tool call block,
# don't emit anything — wait for it to complete.
# Check if there's an open <tool_call> without a matching close
open_count = current_text.count(self.TOOL_CALL_START)
close_count = current_text.count(self.TOOL_CALL_END)
if open_count > close_count:
# Still buffering inside a tool call
return None
# If we're past all tool call blocks, stream remaining content
# (unlikely for most models but handles edge cases)
return None