--- license: apache-2.0 language: - fr - en base_model: Qwen/Qwen3-4B tags: - linked-art - performing-arts - ontology - json-ld - theater - festival-avignon - digital-humanities datasets: - PleIAs/linked-art-avignon pipeline_tag: text-generation --- # POntAvignon-4b A 4B reasoning model that annotates French theater programmes against the [Linked Art](https://linked.art/) performing-arts ontology extension. Given raw programme markdown from the Festival d'Avignon (1947–present), it extracts structured JSON-LD entities with chain-of-thought reasoning. POntAvignon-4b is based on Qwen but relies on the SYNTH-syntax from Pleias' Baguettotron. ## Model Details | | | |---|---| | **Base model** | [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) | | **Parameters** | 4B | | **Training** | Full SFT, 3 epochs, bf16 | | **Context** | 16k tokens | | **Output** | `` reasoning + Linked Art JSON-LD | | **Train loss** | 0.360 | | **Token accuracy** | 96.6% | | **Valid JSON rate** | 97% (on held-out test set) | ## Entity Types The model extracts 7 entity types from a single programme, each as a separate query: | Entity | Linked Art type | Description | |---|---|---| | **A** | `PropositionalObject` | The abstract creative Work — title, creator, BnF role, adaptation links | | **B_meta** | `Activity` | Production metadata — title, venue, dates, genre, festival link | | **B_cast** | `Activity.produced_by` | All performers — actors, dancers, musicians with BnF roles and characters | | **B_crew** | `Activity.produced_by` | Creative/technical staff — director, lighting, costumes, set design with BnF roles | | **C** | `Activity` | Single Performance — one specific date/time, venue, parent production link | | **Festival** | `Activity` | Overall Event — festival edition (e.g. Festival d'Avignon 1996) | | **Text** | `LinguisticObject` | Source literary work — the play or text being adapted/performed | ## Reasoning The model uses `` tags to produce SYNTH-style dense reasoning traces before outputting JSON-LD. The reasoning: - Names the task explicitly and stays focused on it - Engages with document structure, era, and typographic conventions - Works through French theatrical vocabulary and BnF role mapping - Resolves ontological boundaries (Work vs Production vs Performance) - Uses confidence markers: ● sure, ◐ probable, ○ guess, ⚠ risk ## Usage ### With vLLM (recommended) ```python from vllm import LLM, SamplingParams llm = LLM(model="PleIAs/linked-art-qwen3-4b", dtype="bfloat16", max_model_len=4096) tokenizer = llm.get_tokenizer() programme = """____ PAGE 1 ____ FESTIVAL D'AVIGNON COUR D'HONNEUR DU PALAIS DES PAPES 7, 8, 12, 15, 18, 19 juillet à 22 h # Médée de Sénèque Mise en scène Jacques Lassalle """ messages = [ {"role": "system", "content": "You annotate French theater programmes from the Festival d'Avignon against the Linked Art performing-arts ontology. Extract structured JSON-LD entities from programme markdown."}, {"role": "user", "content": f"Extract the Work entity (A) from this theater programme. Output valid Linked Art JSON-LD for the Work as a PropositionalObject, including title, creator with BnF role, source attribution, and any adaptation/influence links.\n\nSource: Medee_FDA1996.md\n\n---\n\n{programme}"}, ] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) prompt += "\n" output = llm.generate([prompt], SamplingParams(max_tokens=2048, temperature=0.7, top_p=0.9)) print(output[0].outputs[0].text) ``` ### Demo notebook A [Colab notebook](demo/linked_art_demo.ipynb) is available that runs all 7 entity types on a demo programme with a tabbed display. Runs on a free T4 GPU. ## Training Data The model was trained on 12,507 samples derived from ~1,400 Festival d'Avignon programmes (1971–2022), spanning three source collections: - **BnF fonds** — digitized paper programmes (1971–2002) - **CommAvignon** — born-digital programmes from festival communications (2007–2022) - **SiteAvignon** — programmes scraped from the festival website (2018) Each sample pairs a raw programme markdown with a Linked Art JSON-LD entity and a SYNTH-style reasoning trace. Reasoning traces were generated by Claude Sonnet (3,110 samples) and then scaled to the full dataset using a Gemma 12B backreasoning model trained on the Sonnet traces. ## Ontology The model targets the [Linked Art Performing Arts extension](https://linked.art/) (v0.9), developed in collaboration with the ERC "From Stage to Data" project (Clarisse Bardiot). Key features: - **BnF role vocabulary** — French-language controlled terms from the Bibliothèque nationale de France, mapped to person attributions - **Deterministic IDs** — content-derived identifiers (e.g. `W-MEDE-JACQ-1996-a3f1`) for deduplication - **Source attribution** — every extracted fact links back to the programme document as primary source ## Limitations - **Specialized corpus**: trained exclusively on Festival d'Avignon programmes. Other festivals or theatrical traditions may need additional training data. - **French-centric**: programme text, role vocabulary, and typographic conventions are French. Non-French programmes are out of scope. - **Large cast/crew**: productions with 30+ team members may produce truncated B_cast/B_crew outputs near the context limit. - **Date inference**: when the programme doesn't state the year explicitly, the model infers it from the filename (e.g. `FDA1996`). Without a filename, year extraction may fail. - **Reasoning traces** improve accuracy and provide explainability but may contain intermediate errors that are corrected in the final JSON-LD output.