126 lines
5.6 KiB
Markdown
126 lines
5.6 KiB
Markdown
---
|
||
license: apache-2.0
|
||
language:
|
||
- fr
|
||
- en
|
||
base_model: Qwen/Qwen3-4B
|
||
tags:
|
||
- linked-art
|
||
- performing-arts
|
||
- ontology
|
||
- json-ld
|
||
- theater
|
||
- festival-avignon
|
||
- digital-humanities
|
||
datasets:
|
||
- PleIAs/linked-art-avignon
|
||
pipeline_tag: text-generation
|
||
---
|
||
|
||
# POntAvignon-4b
|
||
|
||
A 4B reasoning model that annotates French theater programmes against the [Linked Art](https://linked.art/) performing-arts ontology extension. Given raw programme markdown from the Festival d'Avignon (1947–present), it extracts structured JSON-LD entities with chain-of-thought reasoning.
|
||
|
||
POntAvignon-4b is based on Qwen but relies on the SYNTH-syntax from Pleias' Baguettotron.
|
||
|
||
## Model Details
|
||
|
||
| | |
|
||
|---|---|
|
||
| **Base model** | [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) |
|
||
| **Parameters** | 4B |
|
||
| **Training** | Full SFT, 3 epochs, bf16 |
|
||
| **Context** | 16k tokens |
|
||
| **Output** | `<think>` reasoning + Linked Art JSON-LD |
|
||
| **Train loss** | 0.360 |
|
||
| **Token accuracy** | 96.6% |
|
||
| **Valid JSON rate** | 97% (on held-out test set) |
|
||
|
||
## Entity Types
|
||
|
||
The model extracts 7 entity types from a single programme, each as a separate query:
|
||
|
||
| Entity | Linked Art type | Description |
|
||
|---|---|---|
|
||
| **A** | `PropositionalObject` | The abstract creative Work — title, creator, BnF role, adaptation links |
|
||
| **B_meta** | `Activity` | Production metadata — title, venue, dates, genre, festival link |
|
||
| **B_cast** | `Activity.produced_by` | All performers — actors, dancers, musicians with BnF roles and characters |
|
||
| **B_crew** | `Activity.produced_by` | Creative/technical staff — director, lighting, costumes, set design with BnF roles |
|
||
| **C** | `Activity` | Single Performance — one specific date/time, venue, parent production link |
|
||
| **Festival** | `Activity` | Overall Event — festival edition (e.g. Festival d'Avignon 1996) |
|
||
| **Text** | `LinguisticObject` | Source literary work — the play or text being adapted/performed |
|
||
|
||
## Reasoning
|
||
|
||
The model uses `<think>` tags to produce SYNTH-style dense reasoning traces before outputting JSON-LD. The reasoning:
|
||
|
||
- Names the task explicitly and stays focused on it
|
||
- Engages with document structure, era, and typographic conventions
|
||
- Works through French theatrical vocabulary and BnF role mapping
|
||
- Resolves ontological boundaries (Work vs Production vs Performance)
|
||
- Uses confidence markers: ● sure, ◐ probable, ○ guess, ⚠ risk
|
||
|
||
## Usage
|
||
|
||
### With vLLM (recommended)
|
||
|
||
```python
|
||
from vllm import LLM, SamplingParams
|
||
|
||
llm = LLM(model="PleIAs/linked-art-qwen3-4b", dtype="bfloat16", max_model_len=4096)
|
||
tokenizer = llm.get_tokenizer()
|
||
|
||
programme = """____ PAGE 1 ____
|
||
|
||
FESTIVAL D'AVIGNON
|
||
COUR D'HONNEUR DU PALAIS DES PAPES
|
||
7, 8, 12, 15, 18, 19 juillet à 22 h
|
||
|
||
# Médée
|
||
|
||
de Sénèque
|
||
Mise en scène Jacques Lassalle
|
||
"""
|
||
|
||
messages = [
|
||
{"role": "system", "content": "You annotate French theater programmes from the Festival d'Avignon against the Linked Art performing-arts ontology. Extract structured JSON-LD entities from programme markdown."},
|
||
{"role": "user", "content": f"Extract the Work entity (A) from this theater programme. Output valid Linked Art JSON-LD for the Work as a PropositionalObject, including title, creator with BnF role, source attribution, and any adaptation/influence links.\n\nSource: Medee_FDA1996.md\n\n---\n\n{programme}"},
|
||
]
|
||
|
||
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
||
prompt += "<think>\n"
|
||
|
||
output = llm.generate([prompt], SamplingParams(max_tokens=2048, temperature=0.7, top_p=0.9))
|
||
print(output[0].outputs[0].text)
|
||
```
|
||
|
||
### Demo notebook
|
||
|
||
A [Colab notebook](demo/linked_art_demo.ipynb) is available that runs all 7 entity types on a demo programme with a tabbed display. Runs on a free T4 GPU.
|
||
|
||
## Training Data
|
||
|
||
The model was trained on 12,507 samples derived from ~1,400 Festival d'Avignon programmes (1971–2022), spanning three source collections:
|
||
|
||
- **BnF fonds** — digitized paper programmes (1971–2002)
|
||
- **CommAvignon** — born-digital programmes from festival communications (2007–2022)
|
||
- **SiteAvignon** — programmes scraped from the festival website (2018)
|
||
|
||
Each sample pairs a raw programme markdown with a Linked Art JSON-LD entity and a SYNTH-style reasoning trace. Reasoning traces were generated by Claude Sonnet (3,110 samples) and then scaled to the full dataset using a Gemma 12B backreasoning model trained on the Sonnet traces.
|
||
|
||
## Ontology
|
||
|
||
The model targets the [Linked Art Performing Arts extension](https://linked.art/) (v0.9), developed in collaboration with the ERC "From Stage to Data" project (Clarisse Bardiot). Key features:
|
||
|
||
- **BnF role vocabulary** — French-language controlled terms from the Bibliothèque nationale de France, mapped to person attributions
|
||
- **Deterministic IDs** — content-derived identifiers (e.g. `W-MEDE-JACQ-1996-a3f1`) for deduplication
|
||
- **Source attribution** — every extracted fact links back to the programme document as primary source
|
||
|
||
## Limitations
|
||
|
||
- **Specialized corpus**: trained exclusively on Festival d'Avignon programmes. Other festivals or theatrical traditions may need additional training data.
|
||
- **French-centric**: programme text, role vocabulary, and typographic conventions are French. Non-French programmes are out of scope.
|
||
- **Large cast/crew**: productions with 30+ team members may produce truncated B_cast/B_crew outputs near the context limit.
|
||
- **Date inference**: when the programme doesn't state the year explicitly, the model infers it from the filename (e.g. `FDA1996`). Without a filename, year extraction may fail.
|
||
- **Reasoning traces** improve accuracy and provide explainability but may contain intermediate errors that are corrected in the final JSON-LD output.
|