初始化项目,由ModelHub XC社区提供模型
Model: Pclanglais/POntAvignon-4b Source: Original Platform
This commit is contained in:
125
README.md
Normal file
125
README.md
Normal file
@@ -0,0 +1,125 @@
|
||||
---
|
||||
license: apache-2.0
|
||||
language:
|
||||
- fr
|
||||
- en
|
||||
base_model: Qwen/Qwen3-4B
|
||||
tags:
|
||||
- linked-art
|
||||
- performing-arts
|
||||
- ontology
|
||||
- json-ld
|
||||
- theater
|
||||
- festival-avignon
|
||||
- digital-humanities
|
||||
datasets:
|
||||
- PleIAs/linked-art-avignon
|
||||
pipeline_tag: text-generation
|
||||
---
|
||||
|
||||
# POntAvignon-4b
|
||||
|
||||
A 4B reasoning model that annotates French theater programmes against the [Linked Art](https://linked.art/) performing-arts ontology extension. Given raw programme markdown from the Festival d'Avignon (1947–present), it extracts structured JSON-LD entities with chain-of-thought reasoning.
|
||||
|
||||
POntAvignon-4b is based on Qwen but relies on the SYNTH-syntax from Pleias' Baguettotron.
|
||||
|
||||
## Model Details
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **Base model** | [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) |
|
||||
| **Parameters** | 4B |
|
||||
| **Training** | Full SFT, 3 epochs, bf16 |
|
||||
| **Context** | 16k tokens |
|
||||
| **Output** | `<think>` reasoning + Linked Art JSON-LD |
|
||||
| **Train loss** | 0.360 |
|
||||
| **Token accuracy** | 96.6% |
|
||||
| **Valid JSON rate** | 97% (on held-out test set) |
|
||||
|
||||
## Entity Types
|
||||
|
||||
The model extracts 7 entity types from a single programme, each as a separate query:
|
||||
|
||||
| Entity | Linked Art type | Description |
|
||||
|---|---|---|
|
||||
| **A** | `PropositionalObject` | The abstract creative Work — title, creator, BnF role, adaptation links |
|
||||
| **B_meta** | `Activity` | Production metadata — title, venue, dates, genre, festival link |
|
||||
| **B_cast** | `Activity.produced_by` | All performers — actors, dancers, musicians with BnF roles and characters |
|
||||
| **B_crew** | `Activity.produced_by` | Creative/technical staff — director, lighting, costumes, set design with BnF roles |
|
||||
| **C** | `Activity` | Single Performance — one specific date/time, venue, parent production link |
|
||||
| **Festival** | `Activity` | Overall Event — festival edition (e.g. Festival d'Avignon 1996) |
|
||||
| **Text** | `LinguisticObject` | Source literary work — the play or text being adapted/performed |
|
||||
|
||||
## Reasoning
|
||||
|
||||
The model uses `<think>` tags to produce SYNTH-style dense reasoning traces before outputting JSON-LD. The reasoning:
|
||||
|
||||
- Names the task explicitly and stays focused on it
|
||||
- Engages with document structure, era, and typographic conventions
|
||||
- Works through French theatrical vocabulary and BnF role mapping
|
||||
- Resolves ontological boundaries (Work vs Production vs Performance)
|
||||
- Uses confidence markers: ● sure, ◐ probable, ○ guess, ⚠ risk
|
||||
|
||||
## Usage
|
||||
|
||||
### With vLLM (recommended)
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
llm = LLM(model="PleIAs/linked-art-qwen3-4b", dtype="bfloat16", max_model_len=4096)
|
||||
tokenizer = llm.get_tokenizer()
|
||||
|
||||
programme = """____ PAGE 1 ____
|
||||
|
||||
FESTIVAL D'AVIGNON
|
||||
COUR D'HONNEUR DU PALAIS DES PAPES
|
||||
7, 8, 12, 15, 18, 19 juillet à 22 h
|
||||
|
||||
# Médée
|
||||
|
||||
de Sénèque
|
||||
Mise en scène Jacques Lassalle
|
||||
"""
|
||||
|
||||
messages = [
|
||||
{"role": "system", "content": "You annotate French theater programmes from the Festival d'Avignon against the Linked Art performing-arts ontology. Extract structured JSON-LD entities from programme markdown."},
|
||||
{"role": "user", "content": f"Extract the Work entity (A) from this theater programme. Output valid Linked Art JSON-LD for the Work as a PropositionalObject, including title, creator with BnF role, source attribution, and any adaptation/influence links.\n\nSource: Medee_FDA1996.md\n\n---\n\n{programme}"},
|
||||
]
|
||||
|
||||
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
||||
prompt += "<think>\n"
|
||||
|
||||
output = llm.generate([prompt], SamplingParams(max_tokens=2048, temperature=0.7, top_p=0.9))
|
||||
print(output[0].outputs[0].text)
|
||||
```
|
||||
|
||||
### Demo notebook
|
||||
|
||||
A [Colab notebook](demo/linked_art_demo.ipynb) is available that runs all 7 entity types on a demo programme with a tabbed display. Runs on a free T4 GPU.
|
||||
|
||||
## Training Data
|
||||
|
||||
The model was trained on 12,507 samples derived from ~1,400 Festival d'Avignon programmes (1971–2022), spanning three source collections:
|
||||
|
||||
- **BnF fonds** — digitized paper programmes (1971–2002)
|
||||
- **CommAvignon** — born-digital programmes from festival communications (2007–2022)
|
||||
- **SiteAvignon** — programmes scraped from the festival website (2018)
|
||||
|
||||
Each sample pairs a raw programme markdown with a Linked Art JSON-LD entity and a SYNTH-style reasoning trace. Reasoning traces were generated by Claude Sonnet (3,110 samples) and then scaled to the full dataset using a Gemma 12B backreasoning model trained on the Sonnet traces.
|
||||
|
||||
## Ontology
|
||||
|
||||
The model targets the [Linked Art Performing Arts extension](https://linked.art/) (v0.9), developed in collaboration with the ERC "From Stage to Data" project (Clarisse Bardiot). Key features:
|
||||
|
||||
- **BnF role vocabulary** — French-language controlled terms from the Bibliothèque nationale de France, mapped to person attributions
|
||||
- **Deterministic IDs** — content-derived identifiers (e.g. `W-MEDE-JACQ-1996-a3f1`) for deduplication
|
||||
- **Source attribution** — every extracted fact links back to the programme document as primary source
|
||||
|
||||
## Limitations
|
||||
|
||||
- **Specialized corpus**: trained exclusively on Festival d'Avignon programmes. Other festivals or theatrical traditions may need additional training data.
|
||||
- **French-centric**: programme text, role vocabulary, and typographic conventions are French. Non-French programmes are out of scope.
|
||||
- **Large cast/crew**: productions with 30+ team members may produce truncated B_cast/B_crew outputs near the context limit.
|
||||
- **Date inference**: when the programme doesn't state the year explicitly, the model infers it from the filename (e.g. `FDA1996`). Without a filename, year extraction may fail.
|
||||
- **Reasoning traces** improve accuracy and provide explainability but may contain intermediate errors that are corrected in the final JSON-LD output.
|
||||
Reference in New Issue
Block a user