Files
POntAvignon-4b/README.md

126 lines
5.6 KiB
Markdown
Raw Normal View History

---
license: apache-2.0
language:
- fr
- en
base_model: Qwen/Qwen3-4B
tags:
- linked-art
- performing-arts
- ontology
- json-ld
- theater
- festival-avignon
- digital-humanities
datasets:
- PleIAs/linked-art-avignon
pipeline_tag: text-generation
---
# POntAvignon-4b
A 4B reasoning model that annotates French theater programmes against the [Linked Art](https://linked.art/) performing-arts ontology extension. Given raw programme markdown from the Festival d'Avignon (1947present), it extracts structured JSON-LD entities with chain-of-thought reasoning.
POntAvignon-4b is based on Qwen but relies on the SYNTH-syntax from Pleias' Baguettotron.
## Model Details
| | |
|---|---|
| **Base model** | [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) |
| **Parameters** | 4B |
| **Training** | Full SFT, 3 epochs, bf16 |
| **Context** | 16k tokens |
| **Output** | `<think>` reasoning + Linked Art JSON-LD |
| **Train loss** | 0.360 |
| **Token accuracy** | 96.6% |
| **Valid JSON rate** | 97% (on held-out test set) |
## Entity Types
The model extracts 7 entity types from a single programme, each as a separate query:
| Entity | Linked Art type | Description |
|---|---|---|
| **A** | `PropositionalObject` | The abstract creative Work — title, creator, BnF role, adaptation links |
| **B_meta** | `Activity` | Production metadata — title, venue, dates, genre, festival link |
| **B_cast** | `Activity.produced_by` | All performers — actors, dancers, musicians with BnF roles and characters |
| **B_crew** | `Activity.produced_by` | Creative/technical staff — director, lighting, costumes, set design with BnF roles |
| **C** | `Activity` | Single Performance — one specific date/time, venue, parent production link |
| **Festival** | `Activity` | Overall Event — festival edition (e.g. Festival d'Avignon 1996) |
| **Text** | `LinguisticObject` | Source literary work — the play or text being adapted/performed |
## Reasoning
The model uses `<think>` tags to produce SYNTH-style dense reasoning traces before outputting JSON-LD. The reasoning:
- Names the task explicitly and stays focused on it
- Engages with document structure, era, and typographic conventions
- Works through French theatrical vocabulary and BnF role mapping
- Resolves ontological boundaries (Work vs Production vs Performance)
- Uses confidence markers: ● sure, ◐ probable, ○ guess, ⚠ risk
## Usage
### With vLLM (recommended)
```python
from vllm import LLM, SamplingParams
llm = LLM(model="PleIAs/linked-art-qwen3-4b", dtype="bfloat16", max_model_len=4096)
tokenizer = llm.get_tokenizer()
programme = """____ PAGE 1 ____
FESTIVAL D'AVIGNON
COUR D'HONNEUR DU PALAIS DES PAPES
7, 8, 12, 15, 18, 19 juillet à 22 h
# Médée
de Sénèque
Mise en scène Jacques Lassalle
"""
messages = [
{"role": "system", "content": "You annotate French theater programmes from the Festival d'Avignon against the Linked Art performing-arts ontology. Extract structured JSON-LD entities from programme markdown."},
{"role": "user", "content": f"Extract the Work entity (A) from this theater programme. Output valid Linked Art JSON-LD for the Work as a PropositionalObject, including title, creator with BnF role, source attribution, and any adaptation/influence links.\n\nSource: Medee_FDA1996.md\n\n---\n\n{programme}"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
prompt += "<think>\n"
output = llm.generate([prompt], SamplingParams(max_tokens=2048, temperature=0.7, top_p=0.9))
print(output[0].outputs[0].text)
```
### Demo notebook
A [Colab notebook](demo/linked_art_demo.ipynb) is available that runs all 7 entity types on a demo programme with a tabbed display. Runs on a free T4 GPU.
## Training Data
The model was trained on 12,507 samples derived from ~1,400 Festival d'Avignon programmes (19712022), spanning three source collections:
- **BnF fonds** — digitized paper programmes (19712002)
- **CommAvignon** — born-digital programmes from festival communications (20072022)
- **SiteAvignon** — programmes scraped from the festival website (2018)
Each sample pairs a raw programme markdown with a Linked Art JSON-LD entity and a SYNTH-style reasoning trace. Reasoning traces were generated by Claude Sonnet (3,110 samples) and then scaled to the full dataset using a Gemma 12B backreasoning model trained on the Sonnet traces.
## Ontology
The model targets the [Linked Art Performing Arts extension](https://linked.art/) (v0.9), developed in collaboration with the ERC "From Stage to Data" project (Clarisse Bardiot). Key features:
- **BnF role vocabulary** — French-language controlled terms from the Bibliothèque nationale de France, mapped to person attributions
- **Deterministic IDs** — content-derived identifiers (e.g. `W-MEDE-JACQ-1996-a3f1`) for deduplication
- **Source attribution** — every extracted fact links back to the programme document as primary source
## Limitations
- **Specialized corpus**: trained exclusively on Festival d'Avignon programmes. Other festivals or theatrical traditions may need additional training data.
- **French-centric**: programme text, role vocabulary, and typographic conventions are French. Non-French programmes are out of scope.
- **Large cast/crew**: productions with 30+ team members may produce truncated B_cast/B_crew outputs near the context limit.
- **Date inference**: when the programme doesn't state the year explicitly, the model infers it from the filename (e.g. `FDA1996`). Without a filename, year extraction may fail.
- **Reasoning traces** improve accuracy and provide explainability but may contain intermediate errors that are corrected in the final JSON-LD output.