--- base_model: dicta-il/dictalm2.0-instruct library_name: transformers license: mit language: - he tags: - qasem - hebrew - causal-lm - semantic-parsing datasets: - biu-nlp/Multilingual_QASem_Datasets pipeline_tag: text-generation --- # QASem Hebrew Full Model (DictaLM 2.0) This model performs **QA-based semantic parsing (QASem) in Hebrew**. ## Overview This repository provides a **fully fine-tuned model** for performing **QA-based semantic parsing (QASem) in Hebrew**. QASem represents predicate–argument structure using **natural-language question–answer pairs**, rather than predefined semantic role labels. This makes the representation more interpretable and flexible across languages. The model is based on: **Base model:** `dicta-il/dictalm2.0-instruct` and was fully fine-tuned for QA-based semantic parsing. ## ✨ Why this model matters Traditional semantic role labeling methods rely on fixed label schemas and costly expert annotation. This model takes a different approach by: - Representing semantics using **natural-language question–answer pairs** - Enabling **automatic dataset construction** via cross-lingual projection - Supporting **scalable semantic parsing across languages** - Achieving strong performance with **efficient fine-tuned models** This makes it possible to build semantic parsers for new languages with minimal cost. ## Use Cases This model can be used for: - Research in **QA-based semantic parsing (QASem)** and semantic representation learning - Extraction of **predicate–argument structures** from Hebrew text - Automatic **dataset creation** for training semantic models in new languages - Downstream NLP applications such as: - Information extraction - Text understanding - Factuality and attribution evaluation ## Language - Hebrew 🇮🇱 ## Training Data The model was trained on the **Multilingual QASem Dataset**: 👉 https://huggingface.co/datasets/biu-nlp/Multilingual_QASem_Datasets The dataset includes: - Automatically generated QASem annotations - Train / Development / Test splits - Multiple languages: **French, Hebrew, Russian** - Tens of thousands of QA pairs per language The data was constructed using a cross-lingual projection approach, ensuring scalability across languages. ## 📄 Associated Work This model and the underlying dataset are introduced in: [Effective QA-Driven Annotation of Predicate-Argument Relations Across Languages](https://aclanthology.org/2026.eacl-long.112/). The paper presents the full methodology, dataset construction process, and evaluation across multiple languages. ## 🚀 Quick Start (Recommended) ### Using the XQASem Parser For a simple and structured interface, you can use the XQASem parser. ### Installation ```bash pip install xqasem ``` ### Basic Example ```python from xqasem import XQasemParser parser = XQasemParser.from_language("he") sentences = [ "המומחים הדגישו שהאלגוריתם החדש מאיץ משמעותית את עיבוד הבקשות המורכבות." ] df = parser(sentences) print(df) ``` ## Output Format The model produces structured predicate–argument representations in the form of: - A predicate (verb or nominal) - A natural-language question - A corresponding answer span from the sentence This structure can be easily converted into tabular or JSON format for downstream use. ### Example Output | sentence | predicate | predicate_type | question | answer | | --- | --- | --- | --- | --- | | המומחים הדגישו שהאלגוריתם החדש מאיץ משמעותית את עיבוד הבקשות המורכבות. | הדגישו | verb | מי הדגיש משהו? | המומחים | | המומחים הדגישו שהאלגוריתם החדש מאיץ משמעותית את עיבוד הבקשות המורכבות. | הדגישו | verb | מה מישהו הדגיש? | שהאלגוריתם החדש מאיץ משמעותית את עיבוד הבקשות המורכבות | | המומחים הדגישו שהאלגוריתם החדש מאיץ משמעותית את עיבוד הבקשות המורכבות. | מאיץ | verb | מה מאיץ משהו? | האלגוריתם החדש | | המומחים הדגישו שהאלגוריתם החדש מאיץ משמעותית את עיבוד הבקשות המורכבות. | מאיץ | verb | מה משהו מאיץ? | את עיבוד הבקשות המורכבות | 👉 For more details and advanced usage, see the project repository: https://github.com/JohnnieDavidov/xqasem ## Manual Model Loading (Advanced) ```python from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "YonatanDavidov/qasem-he-dictalm2-full" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id) ``` ## Limitations - Performance may degrade on out-of-domain text - Complex or ambiguous predicates may lead to inconsistent outputs - The model is optimized for QASem-style generation and not for general-purpose text generation ## 📄 Citation If you use this model, please cite our work: ``` @inproceedings{davidov-etal-2026-effective, title = "Effective {QA}-Driven Annotation of Predicate{--}Argument Relations Across Languages", author = "Davidov, Jonathan and Slobodkin, Aviv and Klein, Shmuel Tomi and Tsarfaty, Reut and Dagan, Ido and Klein, Ayal", editor = "Demberg, Vera and Inui, Kentaro and Marquez, Llu{\'i}s", booktitle = "Proceedings of the 19th Conference of the {E}uropean Chapter of the {A}ssociation for {C}omputational {L}inguistics (Volume 1: Long Papers)", month = mar, year = "2026", address = "Rabat, Morocco", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2026.eacl-long.112/", doi = "10.18653/v1/2026.eacl-long.112", pages = "2484--2502", ISBN = "979-8-89176-380-7", } ```