初始化项目，由ModelHub XC社区提供模型

Model: Pclanglais/POntAvignon-4b Source: Original Platform
2026-04-29 22:35:11 +08:00
commit cf70142aff
14 changed files with 152395 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,125 @@
+---
+license: apache-2.0
+language:
+- fr
+- en
+base_model: Qwen/Qwen3-4B
+tags:
+- linked-art
+- performing-arts
+- ontology
+- json-ld
+- theater
+- festival-avignon
+- digital-humanities
+datasets:
+- PleIAs/linked-art-avignon
+pipeline_tag: text-generation
+---
+
+# POntAvignon-4b
+
+A 4B reasoning model that annotates French theater programmes against the [Linked Art](https://linked.art/) performing-arts ontology extension. Given raw programme markdown from the Festival d'Avignon (1947–present), it extracts structured JSON-LD entities with chain-of-thought reasoning.
+
+POntAvignon-4b is based on Qwen but relies on the SYNTH-syntax from Pleias' Baguettotron.
+
+## Model Details
+
+| | |
+|---|---|
+| **Base model** | [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) |
+| **Parameters** | 4B |
+| **Training** | Full SFT, 3 epochs, bf16 |
+| **Context** | 16k tokens |
+| **Output** | `<think>` reasoning + Linked Art JSON-LD |
+| **Train loss** | 0.360 |
+| **Token accuracy** | 96.6% |
+| **Valid JSON rate** | 97% (on held-out test set) |
+
+## Entity Types
+
+The model extracts 7 entity types from a single programme, each as a separate query:
+
+| Entity | Linked Art type | Description |
+|---|---|---|
+| **A** | `PropositionalObject` | The abstract creative Work — title, creator, BnF role, adaptation links |
+| **B_meta** | `Activity` | Production metadata — title, venue, dates, genre, festival link |
+| **B_cast** | `Activity.produced_by` | All performers — actors, dancers, musicians with BnF roles and characters |
+| **B_crew** | `Activity.produced_by` | Creative/technical staff — director, lighting, costumes, set design with BnF roles |
+| **C** | `Activity` | Single Performance — one specific date/time, venue, parent production link |
+| **Festival** | `Activity` | Overall Event — festival edition (e.g. Festival d'Avignon 1996) |
+| **Text** | `LinguisticObject` | Source literary work — the play or text being adapted/performed |
+
+## Reasoning
+
+The model uses `<think>` tags to produce SYNTH-style dense reasoning traces before outputting JSON-LD. The reasoning:
+
+- Names the task explicitly and stays focused on it
+- Engages with document structure, era, and typographic conventions
+- Works through French theatrical vocabulary and BnF role mapping
+- Resolves ontological boundaries (Work vs Production vs Performance)
+- Uses confidence markers: ● sure, ◐ probable, ○ guess, ⚠ risk
+
+## Usage
+
+### With vLLM (recommended)
+
+```python
+from vllm import LLM, SamplingParams
+
+llm = LLM(model="PleIAs/linked-art-qwen3-4b", dtype="bfloat16", max_model_len=4096)
+tokenizer = llm.get_tokenizer()
+
+programme = """____ PAGE 1 ____
+
+FESTIVAL D'AVIGNON
+COUR D'HONNEUR DU PALAIS DES PAPES
+7, 8, 12, 15, 18, 19 juillet à 22 h
+
+# Médée
+
+de Sénèque
+Mise en scène Jacques Lassalle
+"""
+
+messages = [
+    {"role": "system", "content": "You annotate French theater programmes from the Festival d'Avignon against the Linked Art performing-arts ontology. Extract structured JSON-LD entities from programme markdown."},
+    {"role": "user", "content": f"Extract the Work entity (A) from this theater programme. Output valid Linked Art JSON-LD for the Work as a PropositionalObject, including title, creator with BnF role, source attribution, and any adaptation/influence links.\n\nSource: Medee_FDA1996.md\n\n---\n\n{programme}"},
+]
+
+prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+prompt += "<think>\n"
+
+output = llm.generate([prompt], SamplingParams(max_tokens=2048, temperature=0.7, top_p=0.9))
+print(output[0].outputs[0].text)
+```
+
+### Demo notebook
+
+A [Colab notebook](demo/linked_art_demo.ipynb) is available that runs all 7 entity types on a demo programme with a tabbed display. Runs on a free T4 GPU.
+
+## Training Data
+
+The model was trained on 12,507 samples derived from ~1,400 Festival d'Avignon programmes (1971–2022), spanning three source collections:
+
+- **BnF fonds** — digitized paper programmes (1971–2002)
+- **CommAvignon** — born-digital programmes from festival communications (2007–2022)
+- **SiteAvignon** — programmes scraped from the festival website (2018)
+
+Each sample pairs a raw programme markdown with a Linked Art JSON-LD entity and a SYNTH-style reasoning trace. Reasoning traces were generated by Claude Sonnet (3,110 samples) and then scaled to the full dataset using a Gemma 12B backreasoning model trained on the Sonnet traces.
+
+## Ontology
+
+The model targets the [Linked Art Performing Arts extension](https://linked.art/) (v0.9), developed in collaboration with the ERC "From Stage to Data" project (Clarisse Bardiot). Key features:
+
+- **BnF role vocabulary** — French-language controlled terms from the Bibliothèque nationale de France, mapped to person attributions
+- **Deterministic IDs** — content-derived identifiers (e.g. `W-MEDE-JACQ-1996-a3f1`) for deduplication
+- **Source attribution** — every extracted fact links back to the programme document as primary source
+
+## Limitations
+
+- **Specialized corpus**: trained exclusively on Festival d'Avignon programmes. Other festivals or theatrical traditions may need additional training data.
+- **French-centric**: programme text, role vocabulary, and typographic conventions are French. Non-French programmes are out of scope.
+- **Large cast/crew**: productions with 30+ team members may produce truncated B_cast/B_crew outputs near the context limit.
+- **Date inference**: when the programme doesn't state the year explicitly, the model infers it from the filename (e.g. `FDA1996`). Without a filename, year extraction may fail.
+- **Reasoning traces** improve accuracy and provide explainability but may contain intermediate errors that are corrected in the final JSON-LD output.