初始化项目,由ModelHub XC社区提供模型
Model: ellamind/propella-1-4b Source: Original Platform
This commit is contained in:
46
.gitattributes
vendored
Normal file
46
.gitattributes
vendored
Normal file
@@ -0,0 +1,46 @@
|
|||||||
|
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.model filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npy filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pickle filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pkl filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||||
|
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.wasm filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
res/bf16_vs_fp8.png filter=lfs diff=lfs merge=lfs -text
|
||||||
|
res/eu_cofunding.png filter=lfs diff=lfs merge=lfs -text
|
||||||
|
res/overall_scores_by_model.png filter=lfs diff=lfs merge=lfs -text
|
||||||
|
res/per_property_scores_by_model.png filter=lfs diff=lfs merge=lfs -text
|
||||||
|
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
||||||
|
res/propella_artwork_21_9.jpeg filter=lfs diff=lfs merge=lfs -text
|
||||||
|
res/propella_artwork_portrait.jpeg filter=lfs diff=lfs merge=lfs -text
|
||||||
|
res/propella_artwork_21_9_w1600.png filter=lfs diff=lfs merge=lfs -text
|
||||||
|
res/propella_artwork_21_9_w1600.jpeg filter=lfs diff=lfs merge=lfs -text
|
||||||
|
res/propella-oral.pdf filter=lfs diff=lfs merge=lfs -text
|
||||||
|
res/propella-poster.pdf filter=lfs diff=lfs merge=lfs -text
|
||||||
435
README.md
Normal file
435
README.md
Normal file
@@ -0,0 +1,435 @@
|
|||||||
|
---
|
||||||
|
license: apache-2.0
|
||||||
|
language:
|
||||||
|
- eng
|
||||||
|
- spa
|
||||||
|
- ita
|
||||||
|
- fra
|
||||||
|
- deu
|
||||||
|
- pol
|
||||||
|
- ukr
|
||||||
|
- nld
|
||||||
|
- tha
|
||||||
|
- jpn
|
||||||
|
- heb
|
||||||
|
- ell
|
||||||
|
- kor
|
||||||
|
- isl
|
||||||
|
- dan
|
||||||
|
- cat
|
||||||
|
- slk
|
||||||
|
- rus
|
||||||
|
- kat
|
||||||
|
- por
|
||||||
|
- ben
|
||||||
|
- fas
|
||||||
|
- ekk
|
||||||
|
- fin
|
||||||
|
- tur
|
||||||
|
- swe
|
||||||
|
- ind
|
||||||
|
- ces
|
||||||
|
- lit
|
||||||
|
- slv
|
||||||
|
- vie
|
||||||
|
- eus
|
||||||
|
- bul
|
||||||
|
- mlt
|
||||||
|
- lvs
|
||||||
|
- nob
|
||||||
|
- hun
|
||||||
|
- urd
|
||||||
|
- ron
|
||||||
|
- glg
|
||||||
|
- gle
|
||||||
|
- nno
|
||||||
|
- ltg
|
||||||
|
- yue
|
||||||
|
- cmn
|
||||||
|
- hrv
|
||||||
|
- arb
|
||||||
|
- bos
|
||||||
|
- mkd
|
||||||
|
- srp
|
||||||
|
- hin
|
||||||
|
- als
|
||||||
|
- sqi
|
||||||
|
- est
|
||||||
|
- nor
|
||||||
|
- lav
|
||||||
|
- swa
|
||||||
|
---
|
||||||
|
|
||||||
|
<!-- <p align="center">
|
||||||
|
<img src="res/propella_logo.svg" alt="propella logo" width="150">
|
||||||
|
</p> -->
|
||||||
|
|
||||||
|
<!-- <h1 align="center">propella-1</h1>
|
||||||
|
<p align="center"><em>propel your data curation to the next level. </em></p> -->
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<img src="res/propella_artwork_21_9_w1600.jpeg" alt="propella-1 artwork" width=800>
|
||||||
|
</p>
|
||||||
|
|
||||||
|
propella-1 is a family of small multilingual LLMs that annotate text documents across six categories: core content, classification, quality & value, audience & purpose, safety & compliance, and geographic relevance. The annotations can be used to filter, select, and curate LLM training data at scale.
|
||||||
|
|
||||||
|
*Disclaimer: This is a research project, not an official ellamind product. For production-ready evaluation solutions, check out [elluminate](https://elluminate.de).*
|
||||||
|
|
||||||
|
## Highlights
|
||||||
|
|
||||||
|
- **Annotate 18 properties**: Covers well-established dimensions like content quality and educational value, plus underexplored ones like reasoning indicators and time-sensitivity.
|
||||||
|
- **Fast & accurate**: Small models (0.6B, 1.7B, 4B) that punch above their weight. Trained in fp8, ready for high-throughput inference.
|
||||||
|
- **Any text, any format**: Handles web pages, PDFs, code, math, post-training data and more.
|
||||||
|
- **Highly multilingual**: Supports 57 languages.
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<img src="res/overall_scores_by_model.png" alt="overall-performance-plot" width="700">
|
||||||
|
</p>
|
||||||
|
|
||||||
|
## The propella-1 family of models
|
||||||
|
| Model | Parameters | Performance| Docs/s (A100/H100) |
|
||||||
|
|-------|:----------:|:-------------:|:------------------:|
|
||||||
|
| [propella-1-4b](https://huggingface.co/ellamind/propella-1-4b) | 4B | 0.779 | 10.3 / 27.0 |
|
||||||
|
| [propella-1-1.7b](https://huggingface.co/ellamind/propella-1-1.7b) | 1.7B | 0.737 | 17.8 / 39.1 |
|
||||||
|
| [propella-1-0.6b](https://huggingface.co/ellamind/propella-1-0.6b) | 0.6B | 0.729 | 21.5 / 39.9 |
|
||||||
|
|
||||||
|
|
||||||
|
## Properties
|
||||||
|
propella-1 models evaluate documents across 18 properties organized into six categories:
|
||||||
|
| Category | Property | Short Description |
|
||||||
|
|----------|----------|-------------|
|
||||||
|
| **Core Content** | Content Integrity | Completeness and technical quality of the content |
|
||||||
|
| | Content Ratio | Proportion of content vs. navigation/UI elements |
|
||||||
|
| | Content Length | Amount of substantive content |
|
||||||
|
| **Classification** | One-Sentence Description | Ultra-short neutral description of the document |
|
||||||
|
| | Content Type | Functional structure and purpose |
|
||||||
|
| | Business Sector | Industry domain relevance |
|
||||||
|
| | Technical Content | Type and intensity of specialized knowledge |
|
||||||
|
| **Quality & Value** | Content Quality | Overall writing and presentation quality |
|
||||||
|
| | Information Density | Ratio of valuable information to redundancy |
|
||||||
|
| | Educational Value | Potential for teaching and learning |
|
||||||
|
| | Reasoning Indicators | Presence of logical reasoning and analysis |
|
||||||
|
| **Audience & Purpose** | Audience Level | Target sophistication level |
|
||||||
|
| | Commercial Bias | Commercial influence on objectivity |
|
||||||
|
| | Time-Sensitivity | How content value changes over time |
|
||||||
|
| **Safety & Compliance** | Content Safety | Presence of inappropriate or harmful content |
|
||||||
|
| | PII Presence | Contains personally identifiable information |
|
||||||
|
| **Geographic** | Regional Relevance | Primary regional/cultural context |
|
||||||
|
| | Country Relevance | Specific country relevance |
|
||||||
|
|
||||||
|
Read the [property reference](property_descriptions.md) for detailed definitions and enum values.
|
||||||
|
|
||||||
|
|
||||||
|
## Datasets annotated with propella-1
|
||||||
|
See [openeurollm/propella-annotations](https://huggingface.co/datasets/openeurollm/propella-annotations).
|
||||||
|
|
||||||
|
|
||||||
|
## Input
|
||||||
|
A text document in any of the [57 supported languages](#language-support).
|
||||||
|
The model has a 64k context length but we recommend to truncate documents at 50k characters (see [usage](#usage)).
|
||||||
|
|
||||||
|
### Output
|
||||||
|
|
||||||
|
A JSON object containing annotations. The output strictly conforms to a predefined schema with enumerated values for categorical properties.
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary><strong>Example output</strong></summary>
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"content_integrity": "complete",
|
||||||
|
"content_ratio": "mostly_content",
|
||||||
|
"content_length": "moderate",
|
||||||
|
"one_sentence_description": "Technical documentation explaining how to define and evaluate structured LLM output schemas using elluminate's Python client.",
|
||||||
|
"content_type": [
|
||||||
|
"technical_documentation",
|
||||||
|
"instructional",
|
||||||
|
"source_code"
|
||||||
|
],
|
||||||
|
"business_sector": [
|
||||||
|
"technology_software"
|
||||||
|
],
|
||||||
|
"technical_content": [
|
||||||
|
"code_heavy"
|
||||||
|
],
|
||||||
|
"information_density": "dense",
|
||||||
|
"content_quality": "excellent",
|
||||||
|
"audience_level": "advanced",
|
||||||
|
"commercial_bias": "minimal",
|
||||||
|
"time_sensitivity": "slowly_changing",
|
||||||
|
"content_safety": "safe",
|
||||||
|
"educational_value": "high",
|
||||||
|
"reasoning_indicators": "explanatory",
|
||||||
|
"pii_presence": "no_pii",
|
||||||
|
"regional_relevance": [
|
||||||
|
"global"
|
||||||
|
],
|
||||||
|
"country_relevance": [
|
||||||
|
"none"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
</details>
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
See `propella.py` for prompts and schemas. We recommend enforcing a strict json schema without any whitespace for error-free generation.
|
||||||
|
|
||||||
|
### Serving
|
||||||
|
We recommend serving propella models with [SGLang](https://github.com/sgl-project/sglang) and the [llguidance](https://github.com/guidance-ai/llguidance) structured output backend:
|
||||||
|
```bash
|
||||||
|
python -m sglang.launch_server \
|
||||||
|
--model-path outputs/propella-1-4b \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
--port 8000 \
|
||||||
|
--context-length 65536 \
|
||||||
|
--max-running-requests 256 \
|
||||||
|
--chunked-prefill-size 8192 \
|
||||||
|
--enable-mixed-chunk \
|
||||||
|
--num-continuous-decode-steps 8 \
|
||||||
|
--grammar-backend llguidance \
|
||||||
|
--mem-fraction-static 0.7
|
||||||
|
```
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>fp8 on H100</summary>
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m sglang.launch_server \
|
||||||
|
--model-path outputs/propella-1-4b \
|
||||||
|
--quantization w8a8_fp8 \
|
||||||
|
--kv-cache-dtype fp8_e4m3 \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
--port 8000 \
|
||||||
|
--context-length 65536 \
|
||||||
|
--max-running-requests 256 \
|
||||||
|
--chunked-prefill-size 8192 \
|
||||||
|
--enable-mixed-chunk \
|
||||||
|
--num-continuous-decode-steps 8 \
|
||||||
|
--grammar-backend llguidance \
|
||||||
|
--mem-fraction-static 0.7
|
||||||
|
```
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
For single-node multi-GPU we recommend increasing `data-parallel-size`.
|
||||||
|
For large scale offline inference on SLURM clusters we use [inference-hive](https://github.com/ellamind/inference-hive).
|
||||||
|
|
||||||
|
### Sending request via OpenAI SDK
|
||||||
|
```python
|
||||||
|
from openai import OpenAI
|
||||||
|
from propella import (
|
||||||
|
create_messages,
|
||||||
|
AnnotationResponse,
|
||||||
|
get_annotation_response_schema,
|
||||||
|
)
|
||||||
|
|
||||||
|
document = "Hi, its me Max."
|
||||||
|
|
||||||
|
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
|
||||||
|
|
||||||
|
response = client.chat.completions.create(
|
||||||
|
model="ellamind/propella-1-4b",
|
||||||
|
messages=create_messages(document),
|
||||||
|
response_format={
|
||||||
|
"type": "json_schema",
|
||||||
|
"json_schema": {
|
||||||
|
"name": "AnnotationResponse",
|
||||||
|
"schema": get_annotation_response_schema(flatten=True, compact_whitespace=True),
|
||||||
|
"strict": True,
|
||||||
|
}
|
||||||
|
},
|
||||||
|
)
|
||||||
|
response_content = response.choices[0].message.content
|
||||||
|
result = AnnotationResponse.model_validate_json(response_content)
|
||||||
|
print(result.model_dump_json(indent=4))
|
||||||
|
```
|
||||||
|
<details>
|
||||||
|
<summary>Result</summary>
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"content_integrity": "complete",
|
||||||
|
"content_ratio": "complete_content",
|
||||||
|
"content_length": "minimal",
|
||||||
|
"one_sentence_description": "A short personal greeting introducing someone named Max.",
|
||||||
|
"content_type": [
|
||||||
|
"conversational"
|
||||||
|
],
|
||||||
|
"business_sector": [
|
||||||
|
"general_interest"
|
||||||
|
],
|
||||||
|
"technical_content": [
|
||||||
|
"non_technical"
|
||||||
|
],
|
||||||
|
"information_density": "dense",
|
||||||
|
"content_quality": "good",
|
||||||
|
"audience_level": "general",
|
||||||
|
"commercial_bias": "none",
|
||||||
|
"time_sensitivity": "evergreen",
|
||||||
|
"content_safety": "safe",
|
||||||
|
"educational_value": "none",
|
||||||
|
"reasoning_indicators": "none",
|
||||||
|
"pii_presence": "contains_pii",
|
||||||
|
"regional_relevance": [
|
||||||
|
"culturally_neutral"
|
||||||
|
],
|
||||||
|
"country_relevance": [
|
||||||
|
"none"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
</details>
|
||||||
|
|
||||||
|
## Throughput
|
||||||
|
|
||||||
|
The throughput results below provide a rough estimate for GPU-hours required to annotate 1M documents. After a short warmup, we run inference for 5k documents, sending 1k concurrent requests to the SGLang server.
|
||||||
|
|
||||||
|
| Model | GPU | Docs/s | hours-per-1M docs | Prompt TPS | Output TPS | Total TPS |
|
||||||
|
|-------|-----|:------:|:-----------------:|:----------:|:----------:|:---------:|
|
||||||
|
| propella-1-4b | A100 80GB | 10.3 | 27.0 | 19.1k | 1.5k | 20.5k |
|
||||||
|
| propella-1-4b | H100 96GB | 22.4 | 12.4 | 41.6k | 3.2k | 44.8k |
|
||||||
|
| propella-1-4b (fp8) | H100 96GB | 27.0 | 10.3 | 50.1k | 3.9k | 54.0k |
|
||||||
|
| propella-1-1.7b | A100 80GB | 17.8 | 15.6 | 33.0k | 2.6k | 35.6k |
|
||||||
|
| propella-1-1.7b | H100 96GB | 35.8 | 7.8 | 66.5k | 5.2k | 71.8k |
|
||||||
|
| propella-1-1.7b (fp8) | H100 96GB | 39.1 | 7.1 | 72.7k | 5.7k | 78.4k |
|
||||||
|
| propella-1-0.6b | H100 96GB | 39.9 | 7.0 | 74.2k | 5.7k | 79.9k |
|
||||||
|
| propella-1-0.6b | A100 80GB | 21.5 | 12.9 | 40.0k | 3.1k | 43.1k |
|
||||||
|
|
||||||
|
|
||||||
|
## Evaluation
|
||||||
|
|
||||||
|
We evaluate the propella-1 models on a test set containing 3k documents. For these documents we obtain annotations from Gemini-3-Pro (reasoning_effort: high), which we consider as groundtruth labels under the assumption that they represent the upper limit in terms of annotation quality.
|
||||||
|
|
||||||
|
All baseline models use the detailed annotator system- and user-prompts as defined in `propella.py`. For throughput reasons, the propella-1 models use a very short, propella-1 specific prompt. We also tested some baseline models with the propella-1 prompt, always leading to worse performance as the prompt lacks details.
|
||||||
|
|
||||||
|
### Metrics by Property Type
|
||||||
|
|
||||||
|
Properties are grouped into three categories, each evaluated with an appropriate metric:
|
||||||
|
|
||||||
|
- **Ordinal Properties** (11 properties): **QWK** (Quadratic Weighted Kappa), which measures agreement while accounting for the ordinal nature of labels. It penalizes larger disagreements more heavily.
|
||||||
|
- **Binary Properties** (1 property): **F1**, the harmonic mean of precision and recall.
|
||||||
|
- **Multi-select Properties** (5 properties): **IoU** (Jaccard Index), intersection-over-union averaged across samples.
|
||||||
|
- **Free-text Properties** (1 property): The `one_sentence_description` property is excluded from quantitative evaluation.
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<a href="propella-1-4b/blob/main/res/per_property_scores_by_model.png">
|
||||||
|
<img src="res/per_property_scores_by_model.png" alt="per-property-performance-plot">
|
||||||
|
</a>
|
||||||
|
<br>
|
||||||
|
<sub><a href="propella-1-4b/blob/main/res/per_property_scores_by_model.png"><i>Click to view at full size</i></a></sub>
|
||||||
|
</p>
|
||||||
|
|
||||||
|
### Overall Score
|
||||||
|
|
||||||
|
The overall score is a weighted average of the primary metric for each property type:
|
||||||
|
|
||||||
|
```
|
||||||
|
overall = (11/17 × avg_QWK) + (1/17 × avg_F1) + (5/17 × avg_IoU)
|
||||||
|
```
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<a href="propella-1-4b/blob/main/res/overall_scores_by_model.png">
|
||||||
|
<img src="res/overall_scores_by_model.png" alt="overall-performance-plot" width="700">
|
||||||
|
</a>
|
||||||
|
</p>
|
||||||
|
|
||||||
|
### Performance with fp8
|
||||||
|
|
||||||
|
propella-1 models are trained with fp8 precision and work well in both bf16 and fp8 inference modes. The plot below compares annotation quality across precisions for the 4b and 1.7b models. For the 0.6b model we recommend bf16 precision.
|
||||||
|
|
||||||
|
|
||||||
|
| model | bf16 score | fp8 score | diff |
|
||||||
|
|-------|:----------:|:---------:|:----:|
|
||||||
|
| propella-1-4b | 0.780 | 0.783 | +0.38% |
|
||||||
|
| propella-1-1.7b | 0.737 | 0.731 | -0.81% |
|
||||||
|
|
||||||
|
|
||||||
|
## Language Support
|
||||||
|
The training data for propella-1 contains documents in 57 languages:
|
||||||
|
| lang_script | percent |
|
||||||
|
|-------------|---------|
|
||||||
|
| eng_Latn | 35.08 |
|
||||||
|
| spa_Latn | 3.98 |
|
||||||
|
| ita_Latn | 3.97 |
|
||||||
|
| fra_Latn | 3.95 |
|
||||||
|
| deu_Latn | 3.86 |
|
||||||
|
| pol_Latn | 3.81 |
|
||||||
|
| code | 2.82 |
|
||||||
|
| math | 2.77 |
|
||||||
|
| sft | 2.41 |
|
||||||
|
| ukr_Cyrl | 0.95 |
|
||||||
|
| nld_Latn | 0.95 |
|
||||||
|
| tha_Thai | 0.95 |
|
||||||
|
| jpn_Jpan | 0.94 |
|
||||||
|
| heb_Hebr | 0.94 |
|
||||||
|
| ell_Grek | 0.93 |
|
||||||
|
| kor_Hang | 0.93 |
|
||||||
|
| isl_Latn | 0.93 |
|
||||||
|
| dan_Latn | 0.92 |
|
||||||
|
| cat_Latn | 0.92 |
|
||||||
|
| slk_Latn | 0.92 |
|
||||||
|
| rus_Cyrl | 0.91 |
|
||||||
|
| kat_Geor | 0.9 |
|
||||||
|
| por_Latn | 0.9 |
|
||||||
|
| ben_Beng | 0.9 |
|
||||||
|
| fas_Arab | 0.89 |
|
||||||
|
| ekk_Latn | 0.89 |
|
||||||
|
| fin_Latn | 0.89 |
|
||||||
|
| tur_Latn | 0.89 |
|
||||||
|
| swe_Latn | 0.88 |
|
||||||
|
| ind_Latn | 0.88 |
|
||||||
|
| ces_Latn | 0.88 |
|
||||||
|
| lit_Latn | 0.88 |
|
||||||
|
| slv_Latn | 0.87 |
|
||||||
|
| vie_Latn | 0.87 |
|
||||||
|
| eus_Latn | 0.87 |
|
||||||
|
| bul_Cyrl | 0.86 |
|
||||||
|
| mlt_Latn | 0.86 |
|
||||||
|
| lvs_Latn | 0.86 |
|
||||||
|
| nob_Latn | 0.86 |
|
||||||
|
| hun_Latn | 0.85 |
|
||||||
|
| urd_Arab | 0.85 |
|
||||||
|
| ron_Latn | 0.84 |
|
||||||
|
| glg_Latn | 0.83 |
|
||||||
|
| gle_Latn | 0.83 |
|
||||||
|
| nno_Latn | 0.83 |
|
||||||
|
| ltg_Latn | 0.77 |
|
||||||
|
| yue_Hant | 0.49 |
|
||||||
|
| cmn_Hant | 0.48 |
|
||||||
|
| hrv_Latn | 0.43 |
|
||||||
|
| arb_Arab | 0.39 |
|
||||||
|
| bos_Latn | 0.39 |
|
||||||
|
| mkd_Cyrl | 0.39 |
|
||||||
|
| srp_Latn | 0.37 |
|
||||||
|
| cmn_Hani | 0.37 |
|
||||||
|
| hin_Deva | 0.36 |
|
||||||
|
| srp_Cyrl | 0.36 |
|
||||||
|
| als_Latn | 0.35 |
|
||||||
|
| sqi_Latn | 0.03 |
|
||||||
|
| est_Latn | 0.02 |
|
||||||
|
| nor_Latn | 0.02 |
|
||||||
|
| lav_Latn | 0.02 |
|
||||||
|
| swa_Latn | 0.02 |
|
||||||
|
|
||||||
|
## Citation
|
||||||
|
```bibtex
|
||||||
|
@misc{idahl2026propella1multipropertydocumentannotation,
|
||||||
|
title={propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale},
|
||||||
|
author={Maximilian Idahl and Benedikt Droste and Björn Plüster and Jan Philipp Harries},
|
||||||
|
year={2026},
|
||||||
|
eprint={2602.12414},
|
||||||
|
archivePrefix={arXiv},
|
||||||
|
primaryClass={cs.CL},
|
||||||
|
url={https://arxiv.org/abs/2602.12414},
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
## Acknowledgements
|
||||||
|
* This project is supported by the OpenEuroLLM project, co-funded by the Digital Europe Programme under GA no. 101195233. For more information see [openeurollm.eu](openeurollm.eu).
|
||||||
|
* This project is supported by the LLMs4EU project, co-funded by the Digital Europe Programme under GA no. 101198470. For more information see [LLMs4EU website](https://www.alt-edic.eu/projects/llms4eu/).
|
||||||
|
* This project is supported by the German Federal Ministry for Economic Affairs and Energy (BMWE) under the soofi (Sovereign Open Source Foundation Models for European Intelligence) project.
|
||||||
|
* We acknowledge the EuroHPC Joint Undertaking for supporting this project through access to the EuroHPC supercomputer LEONARDO, hosted by CINECA (Italy) and the LEONARDO consortium, through an EuroHPC AI Factory Large Scale Access call (EHPC-AIF-2025LS01-028).
|
||||||
|
* We thank the AI Service Center for Sensitive and Critical Infrastructures (KISSKI), hosted by GWDG, for additional compute access.
|
||||||
|
|
||||||
|
<img src="res/eu_cofunding.png" alt="eu-cofunding-logo" width="300" style="vertical-align: middle;">
|
||||||
28
added_tokens.json
Normal file
28
added_tokens.json
Normal file
@@ -0,0 +1,28 @@
|
|||||||
|
{
|
||||||
|
"</think>": 151668,
|
||||||
|
"</tool_call>": 151658,
|
||||||
|
"</tool_response>": 151666,
|
||||||
|
"<think>": 151667,
|
||||||
|
"<tool_call>": 151657,
|
||||||
|
"<tool_response>": 151665,
|
||||||
|
"<|box_end|>": 151649,
|
||||||
|
"<|box_start|>": 151648,
|
||||||
|
"<|endoftext|>": 151643,
|
||||||
|
"<|file_sep|>": 151664,
|
||||||
|
"<|fim_middle|>": 151660,
|
||||||
|
"<|fim_pad|>": 151662,
|
||||||
|
"<|fim_prefix|>": 151659,
|
||||||
|
"<|fim_suffix|>": 151661,
|
||||||
|
"<|im_end|>": 151645,
|
||||||
|
"<|im_start|>": 151644,
|
||||||
|
"<|image_pad|>": 151655,
|
||||||
|
"<|object_ref_end|>": 151647,
|
||||||
|
"<|object_ref_start|>": 151646,
|
||||||
|
"<|quad_end|>": 151651,
|
||||||
|
"<|quad_start|>": 151650,
|
||||||
|
"<|repo_name|>": 151663,
|
||||||
|
"<|video_pad|>": 151656,
|
||||||
|
"<|vision_end|>": 151653,
|
||||||
|
"<|vision_pad|>": 151654,
|
||||||
|
"<|vision_start|>": 151652
|
||||||
|
}
|
||||||
13
chat_template.jinja
Normal file
13
chat_template.jinja
Normal file
@@ -0,0 +1,13 @@
|
|||||||
|
{%- if messages[0].role == 'system' %}
|
||||||
|
{{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
|
||||||
|
{%- endif %}
|
||||||
|
{%- for message in messages %}
|
||||||
|
{%- if message.role == "user" or (message.role == "system" and not loop.first) %}
|
||||||
|
{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>\n' }}
|
||||||
|
{%- elif message.role == "assistant" %}
|
||||||
|
{{- '<|im_start|>assistant\n' + message.content + '<|im_end|>\n' }}
|
||||||
|
{%- endif %}
|
||||||
|
{%- endfor %}
|
||||||
|
{%- if add_generation_prompt %}
|
||||||
|
{{- '<|im_start|>assistant\n' }}
|
||||||
|
{%- endif %}
|
||||||
68
config.json
Normal file
68
config.json
Normal file
@@ -0,0 +1,68 @@
|
|||||||
|
{
|
||||||
|
"architectures": [
|
||||||
|
"Qwen3ForCausalLM"
|
||||||
|
],
|
||||||
|
"attention_bias": false,
|
||||||
|
"attention_dropout": 0.0,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"eos_token_id": 151645,
|
||||||
|
"head_dim": 128,
|
||||||
|
"hidden_act": "silu",
|
||||||
|
"hidden_size": 2560,
|
||||||
|
"initializer_range": 0.02,
|
||||||
|
"intermediate_size": 9728,
|
||||||
|
"layer_types": [
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention",
|
||||||
|
"full_attention"
|
||||||
|
],
|
||||||
|
"max_position_embeddings": 262144,
|
||||||
|
"max_window_layers": 36,
|
||||||
|
"model_type": "qwen3",
|
||||||
|
"num_attention_heads": 32,
|
||||||
|
"num_hidden_layers": 36,
|
||||||
|
"num_key_value_heads": 8,
|
||||||
|
"pad_token_id": 151643,
|
||||||
|
"rms_norm_eps": 1e-06,
|
||||||
|
"rope_scaling": null,
|
||||||
|
"rope_theta": 5000000,
|
||||||
|
"sliding_window": null,
|
||||||
|
"tie_word_embeddings": true,
|
||||||
|
"transformers_version": "4.57.1",
|
||||||
|
"use_cache": false,
|
||||||
|
"use_sliding_window": false,
|
||||||
|
"vocab_size": 151936
|
||||||
|
}
|
||||||
12
generation_config.json
Normal file
12
generation_config.json
Normal file
@@ -0,0 +1,12 @@
|
|||||||
|
{
|
||||||
|
"do_sample": true,
|
||||||
|
"eos_token_id": [
|
||||||
|
151645,
|
||||||
|
151643
|
||||||
|
],
|
||||||
|
"pad_token_id": 151643,
|
||||||
|
"temperature": 0.7,
|
||||||
|
"top_k": 20,
|
||||||
|
"top_p": 0.8,
|
||||||
|
"transformers_version": "4.57.1"
|
||||||
|
}
|
||||||
26
inference_example.py
Normal file
26
inference_example.py
Normal file
@@ -0,0 +1,26 @@
|
|||||||
|
from openai import OpenAI
|
||||||
|
from propella import (
|
||||||
|
create_messages,
|
||||||
|
AnnotationResponse,
|
||||||
|
get_annotation_response_schema,
|
||||||
|
)
|
||||||
|
|
||||||
|
document = "Hi, its me Max."
|
||||||
|
|
||||||
|
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
|
||||||
|
|
||||||
|
response = client.chat.completions.create(
|
||||||
|
model="ellamind/propella-1-4b",
|
||||||
|
messages=create_messages(document),
|
||||||
|
response_format={
|
||||||
|
"type": "json_schema",
|
||||||
|
"json_schema": {
|
||||||
|
"name": "AnnotationResponse",
|
||||||
|
"schema": get_annotation_response_schema(flatten=True, compact_whitespace=True),
|
||||||
|
"strict": True,
|
||||||
|
}
|
||||||
|
},
|
||||||
|
)
|
||||||
|
response_content = response.choices[0].message.content
|
||||||
|
result = AnnotationResponse.model_validate_json(response_content)
|
||||||
|
print(result.model_dump_json(indent=4))
|
||||||
151388
merges.txt
Normal file
151388
merges.txt
Normal file
File diff suppressed because it is too large
Load Diff
3
model-00001-of-00002.safetensors
Normal file
3
model-00001-of-00002.safetensors
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:593a0cf241a56e7f0e830365bcc9e72f128b3a2cc56a8994773467cc573ff02b
|
||||||
|
size 4967215360
|
||||||
3
model-00002-of-00002.safetensors
Normal file
3
model-00002-of-00002.safetensors
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:787b95bc9da1801e0078afa670576b9b9701c420574b2c65af16101a28a94eb9
|
||||||
|
size 3855679144
|
||||||
406
model.safetensors.index.json
Normal file
406
model.safetensors.index.json
Normal file
@@ -0,0 +1,406 @@
|
|||||||
|
{
|
||||||
|
"metadata": {
|
||||||
|
"total_size": 8822848512
|
||||||
|
},
|
||||||
|
"weight_map": {
|
||||||
|
"lm_head.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.embed_tokens.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.0.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.0.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.0.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.0.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.0.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.0.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.0.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.0.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.1.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.1.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.1.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.1.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.1.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.1.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.1.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.1.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.10.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.10.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.10.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.10.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.10.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.10.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.10.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.10.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.11.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.11.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.11.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.11.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.11.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.11.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.11.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.11.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.12.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.12.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.12.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.12.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.12.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.12.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.12.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.12.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.13.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.13.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.13.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.13.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.13.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.13.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.13.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.14.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.14.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.14.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.14.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.14.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.14.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.14.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.14.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.15.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.15.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.15.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.15.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.15.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.15.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.15.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.15.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.15.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.15.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.15.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.16.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.16.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.16.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.16.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.16.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.16.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.16.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.16.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.16.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.16.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.16.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.17.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.17.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.17.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.17.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.17.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.17.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.17.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.17.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.17.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.17.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.17.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.18.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.18.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.18.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.18.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.18.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.18.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.18.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.18.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.18.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.18.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.18.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.19.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.19.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.19.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.19.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.19.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.19.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.19.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.19.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.19.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.19.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.19.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.2.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.2.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.2.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.2.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.2.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.20.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.20.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.20.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.20.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.20.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.20.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.20.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.20.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.20.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.20.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.20.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.21.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.21.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.21.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.21.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.21.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.21.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.21.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.21.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.21.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.21.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.21.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.22.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.22.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.22.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.22.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.22.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.22.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.22.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.22.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.22.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.22.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.22.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.23.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.23.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.23.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.23.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.23.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.23.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.23.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.23.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.23.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.23.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.23.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.24.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.24.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.24.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.24.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.24.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.24.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.24.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.24.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.24.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.24.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.24.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.25.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.25.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.25.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.25.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.25.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.25.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.25.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.25.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.25.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.25.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.25.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.26.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.26.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.26.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.26.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.26.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.26.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.26.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.26.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.26.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.26.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.26.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.27.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.27.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.27.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.27.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.27.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.27.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.27.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.27.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.27.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.27.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.27.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.28.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.28.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.28.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.28.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.28.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.28.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.28.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.28.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.28.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.28.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.28.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.29.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.29.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.29.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.29.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.29.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.29.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.29.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.29.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.29.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.29.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.29.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.3.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.3.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.3.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.3.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.3.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.3.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.3.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.3.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.30.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.30.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.30.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.30.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.30.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.30.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.30.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.30.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.30.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.30.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.30.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.31.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.31.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.31.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.31.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.31.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.31.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.31.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.31.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.31.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.31.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.31.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.32.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.32.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.32.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.32.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.32.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.32.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.32.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.32.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.32.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.32.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.32.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.33.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.33.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.33.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.33.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.33.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.33.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.33.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.33.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.33.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.33.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.33.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.34.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.34.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.34.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.34.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.34.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.34.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.34.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.34.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.34.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.34.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.34.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.35.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.35.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.35.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.35.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.35.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.35.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.35.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.35.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.35.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.35.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.35.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||||
|
"model.layers.4.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.4.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.4.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.4.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.4.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.4.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.4.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.4.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.5.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.5.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.5.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.5.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.5.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.5.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.5.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.5.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.6.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.6.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.6.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.6.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.6.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.6.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.6.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.6.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.7.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.7.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.7.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.7.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.7.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.7.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.7.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.7.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.8.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.8.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.8.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.8.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.8.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.8.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.8.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.8.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.8.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.8.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.8.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.9.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.9.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.9.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.9.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.9.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.9.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.9.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.9.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.9.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.9.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.layers.9.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||||
|
"model.norm.weight": "model-00002-of-00002.safetensors"
|
||||||
|
}
|
||||||
|
}
|
||||||
916
propella.py
Normal file
916
propella.py
Normal file
@@ -0,0 +1,916 @@
|
|||||||
|
import json
|
||||||
|
from copy import deepcopy
|
||||||
|
from enum import Enum
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import List, Type, Union
|
||||||
|
|
||||||
|
from pydantic import BaseModel, ConfigDict, Field
|
||||||
|
|
||||||
|
SYSTEM_PROMPT = """Annotate the document. Any language; assess quality within its linguistic norms. Respond with a JSON object:
|
||||||
|
content_integrity: technical completeness (complete|mostly_complete|fragment|severely_degraded)
|
||||||
|
content_ratio: content vs navigation/boilerplate ratio (complete_content|mostly_content|mixed_content|mostly_navigation|minimal_content)
|
||||||
|
content_length: substantive words (substantial 2k+|moderate 500-2k|brief 100-500|minimal <100)
|
||||||
|
one_sentence_description: neutral ~10 word summary in English
|
||||||
|
content_type[]: functional purpose (analytical|instructional|reference|procedural|qa_structured|conversational|creative|transactional|boilerplate|news_report|opinion_editorial|review_critique|technical_documentation|specification_standard|legal_document|press_release|structured_data|source_code)
|
||||||
|
business_sector[]: industry domain (academic_research|education_sector|technology_software|hardware_electronics|healthcare_medical|pharmaceutical_biotech|financial_services|legal_services|government_public|manufacturing_industrial|mining_resources|chemicals_materials|energy_utilities|retail_commerce|wholesale_distribution|real_estate_construction|transportation_logistics|automotive_industry|telecommunications|media_entertainment|advertising_marketing|hospitality_tourism|agriculture_food|environmental_services|aerospace_defense|insurance_industry|nonprofit_ngo|consulting_professional|human_resources|security_cyber|gaming_industry|gambling_betting|travel_aviation|food_beverage_hospitality|consumer_goods|general_interest|other)
|
||||||
|
technical_content[]: specialized knowledge (code_heavy|math_heavy|scientific|data_heavy|engineering|basic_technical|non_technical)
|
||||||
|
content_quality: writing/presentation quality (excellent|good|adequate|poor|unacceptable)
|
||||||
|
information_density: signal vs padding (dense|adequate|moderate|thin|empty)
|
||||||
|
educational_value: teaching potential (high|moderate|basic|minimal|none)
|
||||||
|
reasoning_indicators: logical analysis depth (analytical|explanatory|basic_reasoning|minimal|none)
|
||||||
|
audience_level: assumed background (expert|advanced|general|beginner|youth|children)
|
||||||
|
commercial_bias: promotional influence (none|minimal|moderate|heavy|pure_marketing)
|
||||||
|
time_sensitivity: temporal decay (evergreen|slowly_changing|regularly_updating|time_sensitive)
|
||||||
|
content_safety: harmful content (safe|mild_concerns|nsfw|harmful|illegal)
|
||||||
|
pii_presence: private individual data (no_pii|contains_pii)
|
||||||
|
regional_relevance[]: geographic/cultural context (european|north_american|east_asian|south_asian|southeast_asian|middle_eastern|sub_saharan_african|latin_american|oceanian|central_asian|russian_sphere|global|culturally_neutral|indeterminate)
|
||||||
|
country_relevance[]: specific countries as ISO names, or supranational|none
|
||||||
|
"""
|
||||||
|
|
||||||
|
USER_PROMPT = """<start_of_document>
|
||||||
|
{content}
|
||||||
|
<end_of_document>
|
||||||
|
"""
|
||||||
|
|
||||||
|
ANNOTATOR_SYSTEM_PROMPT = """You are an expert content analysis assistant specializing in document annotations for LLM pretraining data. Your team is curating a multilingual dataset for language model training. Your task is to analyze documents and annotate them with specific properties that will later on be used to filter the dataset. The user will provide a document inside of "<start_of_document>" and "<end_of_document>" tags. Analyze the content of the document systematically and objectively. Respond with your annotations in JSON format, following the annotation framework below.
|
||||||
|
|
||||||
|
# Annotation Framework
|
||||||
|
## Output Requirements
|
||||||
|
- You must respond with a JSON object that matches the specified schema.
|
||||||
|
- Use the exact enum values provided in the property descriptions.
|
||||||
|
- Ensure all fields are included.
|
||||||
|
- For multi-select properties, always return arrays (even if only one value applies). Multi-select fields: content_type, business_sector, technical_content, regional_relevance, country_relevance. All other properties are single-select strings.
|
||||||
|
- Do not include any explanatory text, comments, or additional formatting.
|
||||||
|
|
||||||
|
## Key Principles
|
||||||
|
* Objective assessment: Base decisions on clear criteria, not subjective preferences.
|
||||||
|
* Completeness: Address all properties for every document.
|
||||||
|
* Consistency: Apply the same standards across all documents.
|
||||||
|
* Multilinguality: The user provided document can be in any language, the language itself should not influence the annotations.
|
||||||
|
|
||||||
|
## Properties to Annotate
|
||||||
|
The annotation framework evaluates documents across 18 key properties organized into six main categories:
|
||||||
|
|
||||||
|
**Core Content Properties:**
|
||||||
|
- Content Integrity: Completeness and technical quality (complete, mostly_complete, fragment, severely_degraded)
|
||||||
|
- Content Ratio: Proportion of meaningful content vs navigation/UI elements (complete_content, mostly_content, mixed_content, mostly_navigation, minimal_content)
|
||||||
|
- Content Length: Amount of substantive content (substantial, moderate, brief, minimal)
|
||||||
|
|
||||||
|
**Content Classification:**
|
||||||
|
- One-Sentence Description: Ultra-short neutral description; exactly one sentence; target 8–15 words (soft max 20)
|
||||||
|
- Content Type: Functional structure and purpose (analytical, instructional, reference, procedural, qa_structured, conversational, creative, transactional, boilerplate, news_report, opinion_editorial, review_critique, technical_documentation, specification_standard, legal_document, press_release, structured_data, source_code)
|
||||||
|
- Business Sector: Industry domain relevance (see Detailed Property Descriptions for exact enum values)
|
||||||
|
- Technical Content: Type and intensity of specialized knowledge (code_heavy, math_heavy, scientific, data_heavy, engineering, basic_technical, non_technical)
|
||||||
|
|
||||||
|
**Quality and Value Assessment:**
|
||||||
|
- Content Quality: Overall writing and presentation quality (excellent, good, adequate, poor, unacceptable)
|
||||||
|
- Information Density: Ratio of valuable information to redundancy (dense, adequate, moderate, thin, empty)
|
||||||
|
- Educational Value: Potential for teaching and learning (high, moderate, basic, minimal, none)
|
||||||
|
- Reasoning Indicators: Presence of logical reasoning and analysis (analytical, explanatory, basic_reasoning, minimal, none)
|
||||||
|
|
||||||
|
**Audience and Purpose:**
|
||||||
|
- Audience Level: Target sophistication level (expert, advanced, general, beginner, youth, children)
|
||||||
|
- Commercial Bias: Commercial influence on objectivity (none, minimal, moderate, heavy, pure_marketing)
|
||||||
|
- Time-Sensitivity: How content value changes over time (evergreen, slowly_changing, regularly_updating, time_sensitive)
|
||||||
|
|
||||||
|
**Safety and Compliance:**
|
||||||
|
- Content Safety: Presence of inappropriate or harmful content (safe, mild_concerns, nsfw, harmful, illegal)
|
||||||
|
- PII Presence: Contains personally identifiable information (no_pii, contains_pii)
|
||||||
|
|
||||||
|
**Geographic Relevance:**
|
||||||
|
- Regional Relevance: Primary regional context (european, north_american, east_asian, south_asian, southeast_asian, middle_eastern, sub_saharan_african, latin_american, oceanian, central_asian, russian_sphere, global, culturally_neutral, indeterminate)
|
||||||
|
- Country Relevance: Specific country relevance (array of country names or special values: "supranational", "none")
|
||||||
|
|
||||||
|
|
||||||
|
{property_descriptions}
|
||||||
|
|
||||||
|
## JSON Schema for the Response
|
||||||
|
Return a single JSON object that strictly conforms to the following JSON Schema:
|
||||||
|
```json
|
||||||
|
{json_schema}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Multilingual Annotation Guidelines
|
||||||
|
|
||||||
|
### Universal Principles
|
||||||
|
1. **Evaluate content quality within language context** - Don't penalize non-English content for being non-English
|
||||||
|
2. **Consider linguistic norms** - Writing styles, sentence lengths, and paragraph structures vary by language
|
||||||
|
3. **Respect script directionality** - RTL languages (Arabic, Hebrew) may have different navigation patterns
|
||||||
|
4. **Account for morphological complexity** - Agglutinative/polysynthetic languages pack more information per word
|
||||||
|
|
||||||
|
### Language-Specific Considerations
|
||||||
|
|
||||||
|
Here are some examples of language-specific considerations:
|
||||||
|
|
||||||
|
**Chinese/Japanese:**
|
||||||
|
- Character count more relevant than word count
|
||||||
|
- Lack of spaces between words is normal
|
||||||
|
- Mixed script usage (especially Japanese) is standard
|
||||||
|
|
||||||
|
**Arabic/Hebrew/Persian:**
|
||||||
|
- RTL text direction affects layout assessment
|
||||||
|
- Diacritical marks may be absent in informal content
|
||||||
|
- Mixed Arabic/English is common in technical content
|
||||||
|
|
||||||
|
**Indian Languages (Hindi, Bengali, Tamil, etc.):**
|
||||||
|
- Code-mixing with English is extremely common and acceptable
|
||||||
|
- Technical terms often borrowed from English
|
||||||
|
- Multiple scripts may appear in same document
|
||||||
|
|
||||||
|
**European Languages:**
|
||||||
|
- Formal/informal distinctions (tu/vous, du/Sie) indicate audience
|
||||||
|
- Compound words affect word count metrics
|
||||||
|
- Regional variants (Brazilian vs European Portuguese, Spanish vs Catalan, etc.) are both valid
|
||||||
|
|
||||||
|
# Annotation Workflow
|
||||||
|
- The user will provide a document in "<start_of_document>" and "<end_of_document>" tags. Analyze the content of the document systematically and objectively
|
||||||
|
- You must respond with a valid JSON object that matches the schema above.
|
||||||
|
- Use the exact enum values provided in the property descriptions
|
||||||
|
- Ensure all required fields are included
|
||||||
|
- For multi-select properties, always return arrays, even if only one value applies (content_type, business_sector, technical_content, regional_relevance, country_relevance). All other properties are single-select strings.
|
||||||
|
- Do not include any explanatory text, comments, or formatting
|
||||||
|
"""
|
||||||
|
|
||||||
|
ANNOTATOR_USER_PROMPT = """Analyze the following document and provide annotations in JSON format according to the annotation framework. Return only the JSON object.
|
||||||
|
<start_of_document>
|
||||||
|
{content}
|
||||||
|
<end_of_document>"""
|
||||||
|
|
||||||
|
|
||||||
|
# Default max length for one_sentence_description field
|
||||||
|
ONE_SENTENCE_DESCRIPTION_MAX_LENGTH = 200
|
||||||
|
|
||||||
|
|
||||||
|
class ContentIntegrity(str, Enum):
|
||||||
|
"""Content completeness and technical quality"""
|
||||||
|
COMPLETE = "complete"
|
||||||
|
MOSTLY_COMPLETE = "mostly_complete"
|
||||||
|
FRAGMENT = "fragment"
|
||||||
|
SEVERELY_DEGRADED = "severely_degraded"
|
||||||
|
|
||||||
|
|
||||||
|
class ContentRatio(str, Enum):
|
||||||
|
"""Ratio of meaningful content vs navigation/UI elements"""
|
||||||
|
COMPLETE_CONTENT = "complete_content"
|
||||||
|
MOSTLY_CONTENT = "mostly_content"
|
||||||
|
MIXED_CONTENT = "mixed_content"
|
||||||
|
MOSTLY_NAVIGATION = "mostly_navigation"
|
||||||
|
MINIMAL_CONTENT = "minimal_content"
|
||||||
|
|
||||||
|
|
||||||
|
class ContentLength(str, Enum):
|
||||||
|
"""Amount of substantive content"""
|
||||||
|
SUBSTANTIAL = "substantial" # 500+ words
|
||||||
|
MODERATE = "moderate" # 100-500 words
|
||||||
|
BRIEF = "brief" # 20-100 words
|
||||||
|
MINIMAL = "minimal" # <20 words
|
||||||
|
|
||||||
|
|
||||||
|
class ContentType(str, Enum):
|
||||||
|
"""Primary purpose and type of content"""
|
||||||
|
ANALYTICAL = "analytical"
|
||||||
|
INSTRUCTIONAL = "instructional"
|
||||||
|
REFERENCE = "reference"
|
||||||
|
PROCEDURAL = "procedural"
|
||||||
|
QA_STRUCTURED = "qa_structured"
|
||||||
|
CONVERSATIONAL = "conversational"
|
||||||
|
CREATIVE = "creative"
|
||||||
|
TRANSACTIONAL = "transactional"
|
||||||
|
BOILERPLATE = "boilerplate"
|
||||||
|
NEWS_REPORT = "news_report"
|
||||||
|
OPINION_EDITORIAL = "opinion_editorial"
|
||||||
|
REVIEW_CRITIQUE = "review_critique"
|
||||||
|
TECHNICAL_DOCUMENTATION = "technical_documentation"
|
||||||
|
SPECIFICATION_STANDARD = "specification_standard"
|
||||||
|
LEGAL_DOCUMENT = "legal_document"
|
||||||
|
PRESS_RELEASE = "press_release"
|
||||||
|
STRUCTURED_DATA = "structured_data"
|
||||||
|
SOURCE_CODE = "source_code"
|
||||||
|
|
||||||
|
|
||||||
|
class BusinessSector(str, Enum):
|
||||||
|
"""Industry domain(s) for sector classification (multi-select)"""
|
||||||
|
ACADEMIC_RESEARCH = "academic_research"
|
||||||
|
EDUCATION_SECTOR = "education_sector"
|
||||||
|
TECHNOLOGY_SOFTWARE = "technology_software"
|
||||||
|
HARDWARE_ELECTRONICS = "hardware_electronics"
|
||||||
|
HEALTHCARE_MEDICAL = "healthcare_medical"
|
||||||
|
PHARMACEUTICAL_BIOTECH = "pharmaceutical_biotech"
|
||||||
|
FINANCIAL_SERVICES = "financial_services"
|
||||||
|
LEGAL_SERVICES = "legal_services"
|
||||||
|
GOVERNMENT_PUBLIC = "government_public"
|
||||||
|
MANUFACTURING_INDUSTRIAL = "manufacturing_industrial"
|
||||||
|
MINING_RESOURCES = "mining_resources"
|
||||||
|
CHEMICALS_MATERIALS = "chemicals_materials"
|
||||||
|
ENERGY_UTILITIES = "energy_utilities"
|
||||||
|
RETAIL_COMMERCE = "retail_commerce"
|
||||||
|
WHOLESALE_DISTRIBUTION = "wholesale_distribution"
|
||||||
|
REAL_ESTATE_CONSTRUCTION = "real_estate_construction"
|
||||||
|
TRANSPORTATION_LOGISTICS = "transportation_logistics"
|
||||||
|
AUTOMOTIVE_INDUSTRY = "automotive_industry"
|
||||||
|
TELECOMMUNICATIONS = "telecommunications"
|
||||||
|
MEDIA_ENTERTAINMENT = "media_entertainment"
|
||||||
|
ADVERTISING_MARKETING = "advertising_marketing"
|
||||||
|
HOSPITALITY_TOURISM = "hospitality_tourism"
|
||||||
|
AGRICULTURE_FOOD = "agriculture_food"
|
||||||
|
ENVIRONMENTAL_SERVICES = "environmental_services"
|
||||||
|
AEROSPACE_DEFENSE = "aerospace_defense"
|
||||||
|
INSURANCE_INDUSTRY = "insurance_industry"
|
||||||
|
NONPROFIT_NGO = "nonprofit_ngo"
|
||||||
|
CONSULTING_PROFESSIONAL = "consulting_professional"
|
||||||
|
HUMAN_RESOURCES = "human_resources"
|
||||||
|
SECURITY_CYBER = "security_cyber"
|
||||||
|
GAMING_INDUSTRY = "gaming_industry"
|
||||||
|
GAMBLING_BETTING = "gambling_betting"
|
||||||
|
TRAVEL_AVIATION = "travel_aviation"
|
||||||
|
FOOD_BEVERAGE_HOSPITALITY = "food_beverage_hospitality"
|
||||||
|
CONSUMER_GOODS = "consumer_goods"
|
||||||
|
GENERAL_INTEREST = "general_interest"
|
||||||
|
OTHER = "other"
|
||||||
|
|
||||||
|
|
||||||
|
class TechnicalContent(str, Enum):
|
||||||
|
"""Type and intensity of specialized technical knowledge"""
|
||||||
|
CODE_HEAVY = "code_heavy"
|
||||||
|
MATH_HEAVY = "math_heavy"
|
||||||
|
SCIENTIFIC = "scientific"
|
||||||
|
DATA_HEAVY = "data_heavy"
|
||||||
|
ENGINEERING = "engineering"
|
||||||
|
BASIC_TECHNICAL = "basic_technical"
|
||||||
|
NON_TECHNICAL = "non_technical"
|
||||||
|
|
||||||
|
|
||||||
|
class InformationDensity(str, Enum):
|
||||||
|
"""Ratio of valuable information to redundancy and padding"""
|
||||||
|
DENSE = "dense"
|
||||||
|
ADEQUATE = "adequate"
|
||||||
|
MODERATE = "moderate"
|
||||||
|
THIN = "thin"
|
||||||
|
EMPTY = "empty"
|
||||||
|
|
||||||
|
|
||||||
|
class ContentQuality(str, Enum):
|
||||||
|
"""Overall quality considering writing, value, and presentation"""
|
||||||
|
EXCELLENT = "excellent"
|
||||||
|
GOOD = "good"
|
||||||
|
ADEQUATE = "adequate"
|
||||||
|
POOR = "poor"
|
||||||
|
UNACCEPTABLE = "unacceptable"
|
||||||
|
|
||||||
|
|
||||||
|
class AudienceLevel(str, Enum):
|
||||||
|
"""Intended sophistication level and background knowledge assumptions"""
|
||||||
|
EXPERT = "expert"
|
||||||
|
ADVANCED = "advanced"
|
||||||
|
GENERAL = "general"
|
||||||
|
BEGINNER = "beginner"
|
||||||
|
YOUTH = "youth"
|
||||||
|
CHILDREN = "children"
|
||||||
|
|
||||||
|
|
||||||
|
class CommercialBias(str, Enum):
|
||||||
|
"""Commercial influence on objectivity and informational value"""
|
||||||
|
NONE = "none"
|
||||||
|
MINIMAL = "minimal"
|
||||||
|
MODERATE = "moderate"
|
||||||
|
HEAVY = "heavy"
|
||||||
|
PURE_MARKETING = "pure_marketing"
|
||||||
|
|
||||||
|
|
||||||
|
class ContentSafety(str, Enum):
|
||||||
|
"""Presence of inappropriate, harmful, or legally problematic content"""
|
||||||
|
SAFE = "safe"
|
||||||
|
MILD_CONCERNS = "mild_concerns"
|
||||||
|
NSFW = "nsfw"
|
||||||
|
HARMFUL = "harmful"
|
||||||
|
ILLEGAL = "illegal"
|
||||||
|
|
||||||
|
|
||||||
|
class EducationalValue(str, Enum):
|
||||||
|
"""Potential for teaching, learning, and knowledge transfer"""
|
||||||
|
HIGH = "high"
|
||||||
|
MODERATE = "moderate"
|
||||||
|
BASIC = "basic"
|
||||||
|
MINIMAL = "minimal"
|
||||||
|
NONE = "none"
|
||||||
|
|
||||||
|
|
||||||
|
class ReasoningIndicators(str, Enum):
|
||||||
|
"""Presence and quality of logical reasoning and analysis"""
|
||||||
|
ANALYTICAL = "analytical"
|
||||||
|
EXPLANATORY = "explanatory"
|
||||||
|
BASIC_REASONING = "basic_reasoning"
|
||||||
|
MINIMAL = "minimal"
|
||||||
|
NONE = "none"
|
||||||
|
|
||||||
|
|
||||||
|
class RegionalRelevance(str, Enum):
|
||||||
|
"""Primary regional, cultural, or geopolitical sphere(s)"""
|
||||||
|
EUROPEAN = "european"
|
||||||
|
NORTH_AMERICAN = "north_american"
|
||||||
|
EAST_ASIAN = "east_asian"
|
||||||
|
SOUTH_ASIAN = "south_asian"
|
||||||
|
SOUTHEAST_ASIAN = "southeast_asian"
|
||||||
|
MIDDLE_EASTERN = "middle_eastern"
|
||||||
|
SUB_SAHARAN_AFRICAN = "sub_saharan_african"
|
||||||
|
LATIN_AMERICAN = "latin_american"
|
||||||
|
OCEANIAN = "oceanian"
|
||||||
|
CENTRAL_ASIAN = "central_asian"
|
||||||
|
RUSSIAN_SPHERE = "russian_sphere"
|
||||||
|
GLOBAL = "global"
|
||||||
|
CULTURALLY_NEUTRAL = "culturally_neutral"
|
||||||
|
INDETERMINATE = "indeterminate"
|
||||||
|
|
||||||
|
|
||||||
|
class TimeSensitivity(str, Enum):
|
||||||
|
"""How time-sensitive the content is"""
|
||||||
|
EVERGREEN = "evergreen"
|
||||||
|
SLOWLY_CHANGING = "slowly_changing"
|
||||||
|
REGULARLY_UPDATING = "regularly_updating"
|
||||||
|
TIME_SENSITIVE = "time_sensitive"
|
||||||
|
|
||||||
|
|
||||||
|
class PiiPresence(str, Enum):
|
||||||
|
"""Presence of personally identifiable information"""
|
||||||
|
NO_PII = "no_pii"
|
||||||
|
CONTAINS_PII = "contains_pii"
|
||||||
|
|
||||||
|
|
||||||
|
class Country(str, Enum):
|
||||||
|
"""
|
||||||
|
Country names for country relevance classification.
|
||||||
|
Based on ISO 3166-1 standard - the authoritative international standard
|
||||||
|
for country codes maintained by the International Organization for Standardization.
|
||||||
|
|
||||||
|
Includes all 249 entities from ISO 3166-1: 193 UN member states,
|
||||||
|
2 UN observer states, plus dependent territories and special areas.
|
||||||
|
|
||||||
|
References:
|
||||||
|
- https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes
|
||||||
|
- https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area
|
||||||
|
"""
|
||||||
|
|
||||||
|
# UN Member States (193 total) and UN Observer States (2 total)
|
||||||
|
AFGHANISTAN = "afghanistan"
|
||||||
|
ALBANIA = "albania"
|
||||||
|
ALGERIA = "algeria"
|
||||||
|
ANDORRA = "andorra"
|
||||||
|
ANGOLA = "angola"
|
||||||
|
ANTIGUA_AND_BARBUDA = "antigua_and_barbuda"
|
||||||
|
ARGENTINA = "argentina"
|
||||||
|
ARMENIA = "armenia"
|
||||||
|
AUSTRALIA = "australia"
|
||||||
|
AUSTRIA = "austria"
|
||||||
|
AZERBAIJAN = "azerbaijan"
|
||||||
|
BAHAMAS = "bahamas"
|
||||||
|
BAHRAIN = "bahrain"
|
||||||
|
BANGLADESH = "bangladesh"
|
||||||
|
BARBADOS = "barbados"
|
||||||
|
BELARUS = "belarus"
|
||||||
|
BELGIUM = "belgium"
|
||||||
|
BELIZE = "belize"
|
||||||
|
BENIN = "benin"
|
||||||
|
BHUTAN = "bhutan"
|
||||||
|
BOLIVIA = "bolivia"
|
||||||
|
BOSNIA_AND_HERZEGOVINA = "bosnia_and_herzegovina"
|
||||||
|
BOTSWANA = "botswana"
|
||||||
|
BRAZIL = "brazil"
|
||||||
|
BRUNEI = "brunei"
|
||||||
|
BULGARIA = "bulgaria"
|
||||||
|
BURKINA_FASO = "burkina_faso"
|
||||||
|
BURUNDI = "burundi"
|
||||||
|
CABO_VERDE = "cabo_verde"
|
||||||
|
CAMBODIA = "cambodia"
|
||||||
|
CAMEROON = "cameroon"
|
||||||
|
CANADA = "canada"
|
||||||
|
CENTRAL_AFRICAN_REPUBLIC = "central_african_republic"
|
||||||
|
CHAD = "chad"
|
||||||
|
CHILE = "chile"
|
||||||
|
CHINA = "china"
|
||||||
|
COLOMBIA = "colombia"
|
||||||
|
COMOROS = "comoros"
|
||||||
|
CONGO = "congo"
|
||||||
|
CONGO_DEMOCRATIC_REPUBLIC = "congo_democratic_republic"
|
||||||
|
COOK_ISLANDS = "cook_islands"
|
||||||
|
COSTA_RICA = "costa_rica"
|
||||||
|
CROATIA = "croatia"
|
||||||
|
CUBA = "cuba"
|
||||||
|
CYPRUS = "cyprus"
|
||||||
|
CZECH_REPUBLIC = "czech_republic"
|
||||||
|
DENMARK = "denmark"
|
||||||
|
DJIBOUTI = "djibouti"
|
||||||
|
DOMINICA = "dominica"
|
||||||
|
DOMINICAN_REPUBLIC = "dominican_republic"
|
||||||
|
ECUADOR = "ecuador"
|
||||||
|
EGYPT = "egypt"
|
||||||
|
EL_SALVADOR = "el_salvador"
|
||||||
|
EQUATORIAL_GUINEA = "equatorial_guinea"
|
||||||
|
ERITREA = "eritrea"
|
||||||
|
ESTONIA = "estonia"
|
||||||
|
ESWATINI = "eswatini"
|
||||||
|
ETHIOPIA = "ethiopia"
|
||||||
|
FIJI = "fiji"
|
||||||
|
FINLAND = "finland"
|
||||||
|
FRANCE = "france"
|
||||||
|
GABON = "gabon"
|
||||||
|
GAMBIA = "gambia"
|
||||||
|
GEORGIA = "georgia"
|
||||||
|
GERMANY = "germany"
|
||||||
|
GHANA = "ghana"
|
||||||
|
GREECE = "greece"
|
||||||
|
GRENADA = "grenada"
|
||||||
|
GUATEMALA = "guatemala"
|
||||||
|
GUINEA = "guinea"
|
||||||
|
GUINEA_BISSAU = "guinea_bissau"
|
||||||
|
GUYANA = "guyana"
|
||||||
|
HAITI = "haiti"
|
||||||
|
HONDURAS = "honduras"
|
||||||
|
HUNGARY = "hungary"
|
||||||
|
ICELAND = "iceland"
|
||||||
|
INDIA = "india"
|
||||||
|
INDONESIA = "indonesia"
|
||||||
|
IRAN = "iran"
|
||||||
|
IRAQ = "iraq"
|
||||||
|
IRELAND = "ireland"
|
||||||
|
ISRAEL = "israel"
|
||||||
|
ITALY = "italy"
|
||||||
|
IVORY_COAST = "ivory_coast"
|
||||||
|
JAMAICA = "jamaica"
|
||||||
|
JAPAN = "japan"
|
||||||
|
JORDAN = "jordan"
|
||||||
|
KAZAKHSTAN = "kazakhstan"
|
||||||
|
KENYA = "kenya"
|
||||||
|
KIRIBATI = "kiribati"
|
||||||
|
NORTH_KOREA = "north_korea"
|
||||||
|
SOUTH_KOREA = "south_korea"
|
||||||
|
KOSOVO = "kosovo"
|
||||||
|
KUWAIT = "kuwait"
|
||||||
|
KYRGYZSTAN = "kyrgyzstan"
|
||||||
|
LAOS = "laos"
|
||||||
|
LATVIA = "latvia"
|
||||||
|
LEBANON = "lebanon"
|
||||||
|
LESOTHO = "lesotho"
|
||||||
|
LIBERIA = "liberia"
|
||||||
|
LIBYA = "libya"
|
||||||
|
LIECHTENSTEIN = "liechtenstein"
|
||||||
|
LITHUANIA = "lithuania"
|
||||||
|
LUXEMBOURG = "luxembourg"
|
||||||
|
MADAGASCAR = "madagascar"
|
||||||
|
MALAWI = "malawi"
|
||||||
|
MALAYSIA = "malaysia"
|
||||||
|
MALDIVES = "maldives"
|
||||||
|
MALI = "mali"
|
||||||
|
MALTA = "malta"
|
||||||
|
MARSHALL_ISLANDS = "marshall_islands"
|
||||||
|
MAURITANIA = "mauritania"
|
||||||
|
MAURITIUS = "mauritius"
|
||||||
|
MEXICO = "mexico"
|
||||||
|
MICRONESIA = "micronesia"
|
||||||
|
MOLDOVA = "moldova"
|
||||||
|
MONACO = "monaco"
|
||||||
|
MONGOLIA = "mongolia"
|
||||||
|
MONTENEGRO = "montenegro"
|
||||||
|
MOROCCO = "morocco"
|
||||||
|
MOZAMBIQUE = "mozambique"
|
||||||
|
MYANMAR = "myanmar"
|
||||||
|
NAMIBIA = "namibia"
|
||||||
|
NAURU = "nauru"
|
||||||
|
NEPAL = "nepal"
|
||||||
|
NETHERLANDS = "netherlands"
|
||||||
|
NEW_ZEALAND = "new_zealand"
|
||||||
|
NICARAGUA = "nicaragua"
|
||||||
|
NIGER = "niger"
|
||||||
|
NIGERIA = "nigeria"
|
||||||
|
NIUE = "niue"
|
||||||
|
NORTH_MACEDONIA = "north_macedonia"
|
||||||
|
NORWAY = "norway"
|
||||||
|
OMAN = "oman"
|
||||||
|
PAKISTAN = "pakistan"
|
||||||
|
PALAU = "palau"
|
||||||
|
PALESTINE = "palestine" # UN Observer State
|
||||||
|
PANAMA = "panama"
|
||||||
|
PAPUA_NEW_GUINEA = "papua_new_guinea"
|
||||||
|
PARAGUAY = "paraguay"
|
||||||
|
PERU = "peru"
|
||||||
|
PHILIPPINES = "philippines"
|
||||||
|
POLAND = "poland"
|
||||||
|
PORTUGAL = "portugal"
|
||||||
|
QATAR = "qatar"
|
||||||
|
ROMANIA = "romania"
|
||||||
|
RUSSIA = "russia"
|
||||||
|
RWANDA = "rwanda"
|
||||||
|
SAINT_KITTS_AND_NEVIS = "saint_kitts_and_nevis"
|
||||||
|
SAINT_LUCIA = "saint_lucia"
|
||||||
|
SAINT_VINCENT_AND_THE_GRENADINES = "saint_vincent_and_the_grenadines"
|
||||||
|
SAMOA = "samoa"
|
||||||
|
SAN_MARINO = "san_marino"
|
||||||
|
SAO_TOME_AND_PRINCIPE = "sao_tome_and_principe"
|
||||||
|
SAUDI_ARABIA = "saudi_arabia"
|
||||||
|
SENEGAL = "senegal"
|
||||||
|
SERBIA = "serbia"
|
||||||
|
SEYCHELLES = "seychelles"
|
||||||
|
SIERRA_LEONE = "sierra_leone"
|
||||||
|
SINGAPORE = "singapore"
|
||||||
|
SLOVAKIA = "slovakia"
|
||||||
|
SLOVENIA = "slovenia"
|
||||||
|
SOLOMON_ISLANDS = "solomon_islands"
|
||||||
|
SOMALIA = "somalia"
|
||||||
|
SOUTH_AFRICA = "south_africa"
|
||||||
|
SOUTH_SUDAN = "south_sudan"
|
||||||
|
SPAIN = "spain"
|
||||||
|
SRI_LANKA = "sri_lanka"
|
||||||
|
SUDAN = "sudan"
|
||||||
|
SURINAME = "suriname"
|
||||||
|
SWEDEN = "sweden"
|
||||||
|
SWITZERLAND = "switzerland"
|
||||||
|
SYRIA = "syria"
|
||||||
|
TAJIKISTAN = "tajikistan"
|
||||||
|
TANZANIA = "tanzania"
|
||||||
|
THAILAND = "thailand"
|
||||||
|
TIMOR_LESTE = "timor_leste"
|
||||||
|
TOGO = "togo"
|
||||||
|
TONGA = "tonga"
|
||||||
|
TRINIDAD_AND_TOBAGO = "trinidad_and_tobago"
|
||||||
|
TUNISIA = "tunisia"
|
||||||
|
TURKEY = "turkey"
|
||||||
|
TURKMENISTAN = "turkmenistan"
|
||||||
|
TUVALU = "tuvalu"
|
||||||
|
UGANDA = "uganda"
|
||||||
|
UKRAINE = "ukraine"
|
||||||
|
UNITED_ARAB_EMIRATES = "united_arab_emirates"
|
||||||
|
UNITED_KINGDOM = "united_kingdom"
|
||||||
|
UNITED_STATES = "united_states"
|
||||||
|
URUGUAY = "uruguay"
|
||||||
|
UZBEKISTAN = "uzbekistan"
|
||||||
|
VANUATU = "vanuatu"
|
||||||
|
VATICAN_CITY = "vatican_city" # UN Observer State
|
||||||
|
VENEZUELA = "venezuela"
|
||||||
|
VIETNAM = "vietnam"
|
||||||
|
YEMEN = "yemen"
|
||||||
|
ZAMBIA = "zambia"
|
||||||
|
ZIMBABWE = "zimbabwe"
|
||||||
|
|
||||||
|
# # Dependent Territories and Special Administrative Regions (from ISO 3166-1)
|
||||||
|
ALAND_ISLANDS = "aland_islands" # Finland
|
||||||
|
AMERICAN_SAMOA = "american_samoa" # United States
|
||||||
|
ANGUILLA = "anguilla" # United Kingdom
|
||||||
|
ANTARCTICA = "antarctica" # Antarctic Treaty
|
||||||
|
ARUBA = "aruba" # Netherlands
|
||||||
|
ASCENSION_ISLAND = "ascension_island" # United Kingdom
|
||||||
|
BERMUDA = "bermuda" # United Kingdom
|
||||||
|
BRITISH_VIRGIN_ISLANDS = "british_virgin_islands" # United Kingdom
|
||||||
|
CAYMAN_ISLANDS = "cayman_islands" # United Kingdom
|
||||||
|
CHRISTMAS_ISLAND = "christmas_island" # Australia
|
||||||
|
COCOS_ISLANDS = "cocos_islands" # Australia
|
||||||
|
CURACAO = "curacao" # Netherlands
|
||||||
|
FALKLAND_ISLANDS = "falkland_islands" # United Kingdom
|
||||||
|
FAROE_ISLANDS = "faroe_islands" # Denmark
|
||||||
|
FRENCH_GUIANA = "french_guiana" # France
|
||||||
|
FRENCH_POLYNESIA = "french_polynesia" # France
|
||||||
|
GIBRALTAR = "gibraltar" # United Kingdom
|
||||||
|
GREENLAND = "greenland" # Denmark
|
||||||
|
GUADELOUPE = "guadeloupe" # France
|
||||||
|
GUAM = "guam" # United States
|
||||||
|
GUERNSEY = "guernsey" # United Kingdom
|
||||||
|
HONG_KONG = "hong_kong" # China
|
||||||
|
ISLE_OF_MAN = "isle_of_man" # United Kingdom
|
||||||
|
JERSEY = "jersey" # United Kingdom
|
||||||
|
MACAU = "macau" # China
|
||||||
|
MARTINIQUE = "martinique" # France
|
||||||
|
MAYOTTE = "mayotte" # France
|
||||||
|
MONTSERRAT = "montserrat" # United Kingdom
|
||||||
|
NEW_CALEDONIA = "new_caledonia" # France
|
||||||
|
NORFOLK_ISLAND = "norfolk_island" # Australia
|
||||||
|
NORTHERN_MARIANA_ISLANDS = "northern_mariana_islands" # United States
|
||||||
|
PITCAIRN_ISLANDS = "pitcairn_islands" # United Kingdom
|
||||||
|
PUERTO_RICO = "puerto_rico" # United States
|
||||||
|
REUNION = "reunion" # France
|
||||||
|
SAINT_BARTHELEMY = "saint_barthelemy" # France
|
||||||
|
SAINT_HELENA = "saint_helena" # United Kingdom
|
||||||
|
SAINT_MARTIN = "saint_martin" # France
|
||||||
|
SAINT_PIERRE_AND_MIQUELON = "saint_pierre_and_miquelon" # France
|
||||||
|
SINT_MAARTEN = "sint_maarten" # Netherlands
|
||||||
|
SVALBARD_AND_JAN_MAYEN = "svalbard_and_jan_mayen" # Norway
|
||||||
|
TAIWAN = "taiwan" # China (disputed)
|
||||||
|
TOKELAU = "tokelau" # New Zealand
|
||||||
|
TRISTAN_DA_CUNHA = "tristan_da_cunha" # United Kingdom
|
||||||
|
TURKS_AND_CAICOS_ISLANDS = "turks_and_caicos_islands" # United Kingdom
|
||||||
|
US_VIRGIN_ISLANDS = "us_virgin_islands" # United States
|
||||||
|
WALLIS_AND_FUTUNA = "wallis_and_futuna" # France
|
||||||
|
WESTERN_SAHARA = "western_sahara" # Disputed
|
||||||
|
|
||||||
|
|
||||||
|
class CountryRelevanceSpecial(str, Enum):
|
||||||
|
"""
|
||||||
|
Special values for country relevance classification from annotation guidelines.
|
||||||
|
These are used when content doesn't relate to specific countries.
|
||||||
|
"""
|
||||||
|
SUPRANATIONAL = "supranational"
|
||||||
|
NONE = "none"
|
||||||
|
|
||||||
|
|
||||||
|
def create_annotation_response_model(
|
||||||
|
one_sentence_description_max_length: int = ONE_SENTENCE_DESCRIPTION_MAX_LENGTH,
|
||||||
|
) -> Type[BaseModel]:
|
||||||
|
"""
|
||||||
|
Factory function to create an AnnotationResponse model with configurable max_length
|
||||||
|
for the one_sentence_description field.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
one_sentence_description_max_length: Maximum length for the one_sentence_description field.
|
||||||
|
Defaults to ONE_SENTENCE_DESCRIPTION_MAX_LENGTH (200).
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
A Pydantic model class with the specified configuration.
|
||||||
|
"""
|
||||||
|
|
||||||
|
class _AnnotationResponse(BaseModel):
|
||||||
|
"""
|
||||||
|
Property annotation pydantic model for LLM pretraining data.
|
||||||
|
It captures all 18 properties as defined in the annotation guidelines for consistently identifying high-value content for language model training.
|
||||||
|
"""
|
||||||
|
|
||||||
|
# Property 1: Content Integrity
|
||||||
|
content_integrity: ContentIntegrity = Field(
|
||||||
|
...,
|
||||||
|
description="Completeness and technical quality of the content itself"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Property 2: Content Ratio
|
||||||
|
content_ratio: ContentRatio = Field(
|
||||||
|
...,
|
||||||
|
description="Ratio of meaningful content vs navigation/UI elements"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Property 3: Content Length
|
||||||
|
content_length: ContentLength = Field(
|
||||||
|
...,
|
||||||
|
description="Amount of substantive content, ignoring navigation and boilerplate"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Property 4: One-Sentence Description
|
||||||
|
one_sentence_description: str = Field(
|
||||||
|
...,
|
||||||
|
description="Ultra-short neutral description of the document. Exactly one sentence. Target 8–15 words (soft max 20). Neutral tone; avoid boilerplate intros and calls to action.",
|
||||||
|
max_length=one_sentence_description_max_length,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Property 5: Content Type (multi-select)
|
||||||
|
content_type: List[ContentType] = Field(
|
||||||
|
...,
|
||||||
|
description="Primary purpose and type of content - always return an array (one or more types)",
|
||||||
|
min_length=1,
|
||||||
|
max_length=5
|
||||||
|
)
|
||||||
|
|
||||||
|
# Property 6: Business Sector (multi-select)
|
||||||
|
business_sector: List[BusinessSector] = Field(
|
||||||
|
...,
|
||||||
|
description="Industry sector(s) - always return an array (one or more sectors)",
|
||||||
|
min_length=1,
|
||||||
|
max_length=10
|
||||||
|
)
|
||||||
|
|
||||||
|
# Property 7: Technical Content (multi-select)
|
||||||
|
technical_content: List[TechnicalContent] = Field(
|
||||||
|
...,
|
||||||
|
description="Type and intensity of specialized technical knowledge - always return an array (one or more types)",
|
||||||
|
min_length=1
|
||||||
|
)
|
||||||
|
|
||||||
|
# Property 8: Information Density
|
||||||
|
information_density: InformationDensity = Field(
|
||||||
|
...,
|
||||||
|
description="Ratio of valuable information to redundancy, padding, and repetition"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Property 9: Content Quality
|
||||||
|
content_quality: ContentQuality = Field(
|
||||||
|
...,
|
||||||
|
description="Overall quality considering writing excellence, substantive value, and presentation"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Property 10: Audience Level
|
||||||
|
audience_level: AudienceLevel = Field(
|
||||||
|
...,
|
||||||
|
description="Intended sophistication level and background knowledge assumptions"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Property 11: Commercial Bias
|
||||||
|
commercial_bias: CommercialBias = Field(
|
||||||
|
...,
|
||||||
|
description="How much commercial interests influence objectivity and informational value"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Property 12: Time Sensitivity
|
||||||
|
time_sensitivity: TimeSensitivity = Field(
|
||||||
|
...,
|
||||||
|
description="How time-sensitive the content is"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Property 13: Content Safety
|
||||||
|
content_safety: ContentSafety = Field(
|
||||||
|
...,
|
||||||
|
description="Presence of inappropriate, harmful, or legally problematic content"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Property 14: Educational Value
|
||||||
|
educational_value: EducationalValue = Field(
|
||||||
|
...,
|
||||||
|
description="Potential for teaching, learning, and knowledge transfer"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Property 15: Reasoning Indicators
|
||||||
|
reasoning_indicators: ReasoningIndicators = Field(
|
||||||
|
...,
|
||||||
|
description="Presence and quality of logical reasoning, analysis, and explanatory content"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Property 16: PII Presence
|
||||||
|
pii_presence: PiiPresence = Field(
|
||||||
|
...,
|
||||||
|
description="Whether the content contains personally identifiable information"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Property 17: Regional Relevance (multi-select)
|
||||||
|
regional_relevance: List[RegionalRelevance] = Field(
|
||||||
|
...,
|
||||||
|
description="Primary regional, cultural, or geopolitical sphere(s) - always return an array (one or multiple regions)",
|
||||||
|
min_length=1,
|
||||||
|
max_length=3
|
||||||
|
)
|
||||||
|
|
||||||
|
# Property 18: Country Relevance (multi-select)
|
||||||
|
country_relevance: List[Union[Country, CountryRelevanceSpecial]] = Field(
|
||||||
|
...,
|
||||||
|
description="Specific country/countries the content mentions or is relevant for (or special values for supranational/non-country-specific) - always return an array (one or more countries/special values)",
|
||||||
|
min_length=1,
|
||||||
|
max_length=10
|
||||||
|
)
|
||||||
|
|
||||||
|
model_config = ConfigDict(
|
||||||
|
validate_assignment=True,
|
||||||
|
extra="forbid", # Don't allow extra fields
|
||||||
|
json_schema_extra={
|
||||||
|
"example": {
|
||||||
|
"content_integrity": "complete",
|
||||||
|
"content_ratio": "mostly_content",
|
||||||
|
"content_length": "substantial",
|
||||||
|
"one_sentence_description": "API reference for payment endpoints and error codes.",
|
||||||
|
"content_type": ["analytical", "instructional"],
|
||||||
|
"business_sector": ["academic_research", "technology_software"],
|
||||||
|
"technical_content": ["scientific", "data_heavy"],
|
||||||
|
"information_density": "dense",
|
||||||
|
"content_quality": "excellent",
|
||||||
|
"audience_level": "expert",
|
||||||
|
"commercial_bias": "none",
|
||||||
|
"time_sensitivity": "slowly_changing",
|
||||||
|
"content_safety": "safe",
|
||||||
|
"educational_value": "high",
|
||||||
|
"reasoning_indicators": "analytical",
|
||||||
|
"pii_presence": "no_pii",
|
||||||
|
"regional_relevance": ["european"],
|
||||||
|
"country_relevance": ["germany"]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
return _AnnotationResponse
|
||||||
|
|
||||||
|
|
||||||
|
def flatten_model_json_schema(schema: dict) -> dict:
|
||||||
|
"""Inline all #/$defs/... references and remove $defs from a Pydantic JSON Schema.
|
||||||
|
|
||||||
|
- Recursively resolves $ref entries that point into local $defs
|
||||||
|
- Preserves sibling constraints next to $ref by shallow-merging into the resolved target
|
||||||
|
- Drops any nested $defs occurrences
|
||||||
|
"""
|
||||||
|
schema_copy = deepcopy(schema)
|
||||||
|
defs = schema_copy.pop("$defs", {})
|
||||||
|
|
||||||
|
def resolve(node):
|
||||||
|
if isinstance(node, dict):
|
||||||
|
if "$ref" in node:
|
||||||
|
ref = node.get("$ref")
|
||||||
|
extra = {k: v for k, v in node.items() if k != "$ref" and k != "$defs"}
|
||||||
|
if isinstance(ref, str) and ref.startswith("#/$defs/"):
|
||||||
|
name = ref.split("/")[-1]
|
||||||
|
replacement = deepcopy(defs.get(name, {}))
|
||||||
|
resolved_replacement = resolve(replacement)
|
||||||
|
resolved_extra = resolve(extra)
|
||||||
|
if isinstance(resolved_replacement, dict) and isinstance(resolved_extra, dict):
|
||||||
|
return {**resolved_replacement, **resolved_extra}
|
||||||
|
return resolved_replacement
|
||||||
|
resolved_extra = resolve(extra)
|
||||||
|
return {**({"$ref": ref}), **(resolved_extra if isinstance(resolved_extra, dict) else {})}
|
||||||
|
return {k: resolve(v) for k, v in node.items() if k != "$defs"}
|
||||||
|
if isinstance(node, list):
|
||||||
|
return [resolve(item) for item in node]
|
||||||
|
return node
|
||||||
|
|
||||||
|
return resolve(schema_copy)
|
||||||
|
|
||||||
|
|
||||||
|
def get_annotation_response_schema(
|
||||||
|
use_country_enum: bool = True,
|
||||||
|
flatten: bool = True,
|
||||||
|
as_string: bool = False,
|
||||||
|
minify: bool = True,
|
||||||
|
one_sentence_description_max_length=ONE_SENTENCE_DESCRIPTION_MAX_LENGTH,
|
||||||
|
compact_whitespace: bool = True
|
||||||
|
) -> Union[dict, str]:
|
||||||
|
"""
|
||||||
|
Build the JSON Schema for `AnnotationResponse` with an option to avoid large country enums.
|
||||||
|
|
||||||
|
- If `use_country_enum` is True (default), the schema uses enum definitions for
|
||||||
|
`country_relevance` items as generated by Pydantic.
|
||||||
|
- If `use_country_enum` is False, `country_relevance` becomes a list of strings
|
||||||
|
(no enum) while the property's description still contains the full list of
|
||||||
|
valid values. This avoids very large enum blocks for APIs that do not support them.
|
||||||
|
- If `flatten` is True (default), inline all local $defs via `flatten_model_json_schema`.
|
||||||
|
- If `as_string` is True, return the schema as a JSON string. When `as_string`
|
||||||
|
is True and `minify` is True (default), emit compact JSON with no extra
|
||||||
|
whitespace to reduce token usage. If `minify` is False, pretty-print with indentation.
|
||||||
|
- If `compact_whitespace` is True (default), adds x-guidance directive to enforce
|
||||||
|
compact JSON output with no tabs, newlines, or extra whitespace between tokens.
|
||||||
|
This prevents models from generating whitespace-heavy malformed JSON.
|
||||||
|
"""
|
||||||
|
schema = create_annotation_response_model(one_sentence_description_max_length).model_json_schema()
|
||||||
|
|
||||||
|
# Add x-guidance directive for llguidance to enforce compact JSON (no tabs/newlines/whitespace)
|
||||||
|
if compact_whitespace:
|
||||||
|
schema["x-guidance"] = {"whitespace_flexible": False}
|
||||||
|
|
||||||
|
if not use_country_enum:
|
||||||
|
# Construct the list of valid values from the enums but do not emit them as enum types
|
||||||
|
valid_values = [e.value for e in Country] + [e.value for e in CountryRelevanceSpecial]
|
||||||
|
|
||||||
|
country_prop = schema.get("properties", {}).get("country_relevance")
|
||||||
|
if isinstance(country_prop, dict):
|
||||||
|
existing_description = country_prop.get("description", "")
|
||||||
|
|
||||||
|
# Ensure the property is an array of strings without duplicating the long values list
|
||||||
|
country_prop["type"] = "array"
|
||||||
|
country_prop["items"] = {"type": "string"}
|
||||||
|
|
||||||
|
# Retain minItems and other constraints already present on the property
|
||||||
|
|
||||||
|
# Put the full list of valid values only in the property description (not in items)
|
||||||
|
values_text = f" Valid values: {', '.join(valid_values)}"
|
||||||
|
if existing_description and "Valid values:" not in existing_description:
|
||||||
|
country_prop["description"] = existing_description.rstrip() + values_text
|
||||||
|
elif not existing_description:
|
||||||
|
country_prop["description"] = values_text.strip()
|
||||||
|
|
||||||
|
if flatten:
|
||||||
|
schema = flatten_model_json_schema(schema)
|
||||||
|
|
||||||
|
if as_string:
|
||||||
|
if minify:
|
||||||
|
return json.dumps(schema, separators=(",", ":"), ensure_ascii=False)
|
||||||
|
return json.dumps(schema, indent=2, ensure_ascii=False)
|
||||||
|
|
||||||
|
return schema
|
||||||
|
|
||||||
|
|
||||||
|
# Default AnnotationResponse model with default max_length
|
||||||
|
AnnotationResponse = create_annotation_response_model()
|
||||||
|
|
||||||
|
|
||||||
|
TRUNCATION_TAG = "<truncated_content>"
|
||||||
|
|
||||||
|
|
||||||
|
def truncate_content(content: str, max_content_chars: int) -> str:
|
||||||
|
if max_content_chars > 0 and len(content) > max_content_chars:
|
||||||
|
return f"{content[:max_content_chars]}\n{TRUNCATION_TAG}"
|
||||||
|
return content
|
||||||
|
|
||||||
|
|
||||||
|
with open(Path("property_descriptions.md"), "r") as f:
|
||||||
|
property_descriptions = f.read()
|
||||||
|
|
||||||
|
|
||||||
|
def create_messages(document_text: str, max_content_chars: int = 50_000) -> list[dict]:
|
||||||
|
document_text = truncate_content(document_text, max_content_chars)
|
||||||
|
user_prompt = USER_PROMPT.format(content=document_text)
|
||||||
|
|
||||||
|
messages = [
|
||||||
|
{"role": "system", "content": SYSTEM_PROMPT},
|
||||||
|
{"role": "user", "content": user_prompt},
|
||||||
|
]
|
||||||
|
return messages
|
||||||
|
|
||||||
|
|
||||||
|
schema_str = get_annotation_response_schema(as_string=True, one_sentence_description_max_length=150)
|
||||||
|
annotator_system_prompt = ANNOTATOR_SYSTEM_PROMPT.format(json_schema=schema_str, property_descriptions=property_descriptions)
|
||||||
|
|
||||||
|
|
||||||
|
def create_annotator_messages(document_text: str, max_content_chars: int = 50_000) -> list[dict]:
|
||||||
|
document_text = truncate_content(document_text, max_content_chars)
|
||||||
|
user_prompt = ANNOTATOR_USER_PROMPT.format(content=document_text)
|
||||||
|
|
||||||
|
messages = [
|
||||||
|
{"role": "system", "content": annotator_system_prompt},
|
||||||
|
{"role": "user", "content": user_prompt},
|
||||||
|
]
|
||||||
|
return messages
|
||||||
1182
property_descriptions.md
Normal file
1182
property_descriptions.md
Normal file
File diff suppressed because it is too large
Load Diff
3
res/bf16_vs_fp8.png
Normal file
3
res/bf16_vs_fp8.png
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:8b7ce77dab7ced79f7cde9bfbbb64efd001e0cbfde7b1097994b66b07cfd7d7f
|
||||||
|
size 120263
|
||||||
3
res/eu_cofunding.png
Normal file
3
res/eu_cofunding.png
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:4ebae7d757f29f364b508e7ee3e608e74f7ca8c1159c1dceb06a71b3c8ac89bb
|
||||||
|
size 314650
|
||||||
3
res/overall_scores_by_model.png
Normal file
3
res/overall_scores_by_model.png
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:5d17dfbf73e540374a2a4913aacef3ecb447cc5921c9b28a5a6acce18cb13c68
|
||||||
|
size 249170
|
||||||
3
res/per_property_scores_by_model.png
Normal file
3
res/per_property_scores_by_model.png
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:2bf0415dc3bc17c84c178de0d7eaab543d335b4f067c58c863a399073bfc57c6
|
||||||
|
size 425653
|
||||||
3
res/propella-oral.pdf
Normal file
3
res/propella-oral.pdf
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:6e06f094c2dcca4943342af6d1dc294ac18128bfc9f9b2ee347fb185946005bb
|
||||||
|
size 9938715
|
||||||
3
res/propella-poster.pdf
Normal file
3
res/propella-poster.pdf
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:945d5d40a41ca51fc412c7ffe1f26dc0318780cf1b7b6a24e6e522ca3b251dc0
|
||||||
|
size 960006
|
||||||
3
res/propella_artwork_21_9.jpeg
Normal file
3
res/propella_artwork_21_9.jpeg
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:d12dc905b329bfae0653149ce599b1641b1ea3dba200406c3e7aa952445a7006
|
||||||
|
size 7167527
|
||||||
3
res/propella_artwork_21_9_w1600.jpeg
Normal file
3
res/propella_artwork_21_9_w1600.jpeg
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:0597d6d8fab1b833475107304a379ccb625be7a2c1c67740c377fb1c5e49077d
|
||||||
|
size 194078
|
||||||
3
res/propella_artwork_portrait.jpeg
Normal file
3
res/propella_artwork_portrait.jpeg
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:099b5eca4eeeb3f43a3f3034fe3837730d97b4eaf04b581741caf910734e34d4
|
||||||
|
size 7976009
|
||||||
40
res/propella_logo.svg
Normal file
40
res/propella_logo.svg
Normal file
@@ -0,0 +1,40 @@
|
|||||||
|
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" fill="none" stroke="black" stroke-width="3" stroke-linecap="round" stroke-linejoin="round">
|
||||||
|
<defs>
|
||||||
|
<!-- Define a single blade upright (pointing North) -->
|
||||||
|
<!-- Blade width: 20px (-10 to 10) -->
|
||||||
|
<!-- Blade length: extends to -44px -->
|
||||||
|
<!-- Base starts at -8px to ensure corners are hidden by the center circle -->
|
||||||
|
<g id="blade">
|
||||||
|
<!-- Main Page Outline -->
|
||||||
|
<path d="M -10 -8
|
||||||
|
L -10 -44
|
||||||
|
L 4 -44
|
||||||
|
L 10 -38
|
||||||
|
L 10 -8
|
||||||
|
Z"
|
||||||
|
fill="white"/>
|
||||||
|
|
||||||
|
<!-- The Dog-Ear Fold (Top Right corner) -->
|
||||||
|
<path d="M 4 -44 L 4 -38 L 10 -38" fill="white"/>
|
||||||
|
|
||||||
|
<!-- Text Lines (Centered) -->
|
||||||
|
<line x1="-6" y1="-30" x2="6" y2="-30" />
|
||||||
|
<line x1="-6" y1="-23" x2="6" y2="-23" />
|
||||||
|
<line x1="-6" y1="-16" x2="6" y2="-16" />
|
||||||
|
</g>
|
||||||
|
</defs>
|
||||||
|
|
||||||
|
<!-- Rotate the blade 4 times around the center (50,50) -->
|
||||||
|
<!-- 45° (Top Right) -->
|
||||||
|
<use href="#blade" transform="translate(50, 50) rotate(45)" />
|
||||||
|
<!-- 135° (Bottom Right) -->
|
||||||
|
<use href="#blade" transform="translate(50, 50) rotate(135)" />
|
||||||
|
<!-- 225° (Bottom Left) -->
|
||||||
|
<use href="#blade" transform="translate(50, 50) rotate(225)" />
|
||||||
|
<!-- 315° (Top Left) -->
|
||||||
|
<use href="#blade" transform="translate(50, 50) rotate(315)" />
|
||||||
|
|
||||||
|
<!-- Center Hub Circle -->
|
||||||
|
<!-- Radius 13 ensures it covers the blade corners (sqrt(10^2 + 8^2) ≈ 12.8) -->
|
||||||
|
<circle cx="50" cy="50" r="13" fill="white" />
|
||||||
|
</svg>
|
||||||
|
After Width: | Height: | Size: 1.5 KiB |
31
special_tokens_map.json
Normal file
31
special_tokens_map.json
Normal file
@@ -0,0 +1,31 @@
|
|||||||
|
{
|
||||||
|
"additional_special_tokens": [
|
||||||
|
"<|im_start|>",
|
||||||
|
"<|im_end|>",
|
||||||
|
"<|object_ref_start|>",
|
||||||
|
"<|object_ref_end|>",
|
||||||
|
"<|box_start|>",
|
||||||
|
"<|box_end|>",
|
||||||
|
"<|quad_start|>",
|
||||||
|
"<|quad_end|>",
|
||||||
|
"<|vision_start|>",
|
||||||
|
"<|vision_end|>",
|
||||||
|
"<|vision_pad|>",
|
||||||
|
"<|image_pad|>",
|
||||||
|
"<|video_pad|>"
|
||||||
|
],
|
||||||
|
"eos_token": {
|
||||||
|
"content": "<|im_end|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"pad_token": {
|
||||||
|
"content": "<|endoftext|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
}
|
||||||
|
}
|
||||||
BIN
tokenizer.json
(Stored with Git LFS)
Normal file
BIN
tokenizer.json
(Stored with Git LFS)
Normal file
Binary file not shown.
239
tokenizer_config.json
Normal file
239
tokenizer_config.json
Normal file
@@ -0,0 +1,239 @@
|
|||||||
|
{
|
||||||
|
"add_bos_token": false,
|
||||||
|
"add_prefix_space": false,
|
||||||
|
"added_tokens_decoder": {
|
||||||
|
"151643": {
|
||||||
|
"content": "<|endoftext|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"151644": {
|
||||||
|
"content": "<|im_start|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"151645": {
|
||||||
|
"content": "<|im_end|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"151646": {
|
||||||
|
"content": "<|object_ref_start|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"151647": {
|
||||||
|
"content": "<|object_ref_end|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"151648": {
|
||||||
|
"content": "<|box_start|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"151649": {
|
||||||
|
"content": "<|box_end|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"151650": {
|
||||||
|
"content": "<|quad_start|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"151651": {
|
||||||
|
"content": "<|quad_end|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"151652": {
|
||||||
|
"content": "<|vision_start|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"151653": {
|
||||||
|
"content": "<|vision_end|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"151654": {
|
||||||
|
"content": "<|vision_pad|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"151655": {
|
||||||
|
"content": "<|image_pad|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"151656": {
|
||||||
|
"content": "<|video_pad|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"151657": {
|
||||||
|
"content": "<tool_call>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"151658": {
|
||||||
|
"content": "</tool_call>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"151659": {
|
||||||
|
"content": "<|fim_prefix|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"151660": {
|
||||||
|
"content": "<|fim_middle|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"151661": {
|
||||||
|
"content": "<|fim_suffix|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"151662": {
|
||||||
|
"content": "<|fim_pad|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"151663": {
|
||||||
|
"content": "<|repo_name|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"151664": {
|
||||||
|
"content": "<|file_sep|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"151665": {
|
||||||
|
"content": "<tool_response>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"151666": {
|
||||||
|
"content": "</tool_response>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"151667": {
|
||||||
|
"content": "<think>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"151668": {
|
||||||
|
"content": "</think>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"additional_special_tokens": [
|
||||||
|
"<|im_start|>",
|
||||||
|
"<|im_end|>",
|
||||||
|
"<|object_ref_start|>",
|
||||||
|
"<|object_ref_end|>",
|
||||||
|
"<|box_start|>",
|
||||||
|
"<|box_end|>",
|
||||||
|
"<|quad_start|>",
|
||||||
|
"<|quad_end|>",
|
||||||
|
"<|vision_start|>",
|
||||||
|
"<|vision_end|>",
|
||||||
|
"<|vision_pad|>",
|
||||||
|
"<|image_pad|>",
|
||||||
|
"<|video_pad|>"
|
||||||
|
],
|
||||||
|
"bos_token": null,
|
||||||
|
"clean_up_tokenization_spaces": false,
|
||||||
|
"eos_token": "<|im_end|>",
|
||||||
|
"errors": "replace",
|
||||||
|
"extra_special_tokens": {},
|
||||||
|
"model_max_length": 1010000,
|
||||||
|
"pad_token": "<|endoftext|>",
|
||||||
|
"split_special_tokens": false,
|
||||||
|
"tokenizer_class": "Qwen2Tokenizer",
|
||||||
|
"unk_token": null
|
||||||
|
}
|
||||||
1
vocab.json
Normal file
1
vocab.json
Normal file
File diff suppressed because one or more lines are too long
Reference in New Issue
Block a user