Files
llama3-janus/README.md
ModelHub XC fbf582efce 初始化项目,由ModelHub XC社区提供模型
Model: ChangeIsKey/llama3-janus
Source: Original Platform
2026-05-25 21:11:23 +08:00

100 lines
4.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
language:
- en
base_model:
- meta-llama/Meta-Llama-3-8B
pipeline_tag: text2text-generation
---
## Janus
(Built with Meta Llama 3)
For the version with the PoS tag visit [Janus (PoS)](https://huggingface.co/ChangeIsKey/llama3-janus-pos).
### Model Details
- **Model Name**: Janus
- **Version**: 1.0
- **Developers**: Pierluigi Cassotti, Nina Tahmasebi
- **Affiliation**: University of Gothenburg
- **License**: MIT
- **GitHub Repository**: [Historical Word Usage Generation](https://github.com/ChangeIsKey/historical-word-usage-generation)
- **Paper**: [Sense-specific Historical Word Usage Generation](https://transacl.org)
- **Contact**: pierluigi.cassotti@gu.se
### Model Description
Janus is a fine-tuned **Llama 3 8B** model designed to generate historically and semantically accurate word usages. It takes as input a word, its sense definition, and a year and produces example sentences that reflect linguistic usage from the specified period. This model is particularly useful for **semantic change detection**, **historical NLP**, and **linguistic research**.
### Intended Use
- **Semantic Change Detection**: Investigating how word meanings evolve over time.
- **Historical Text Processing**: Enhancing the understanding and modeling of historical texts.
- **Corpus Expansion**: Generating sense-annotated corpora for linguistic studies.
### Training Data
- **Dataset**: Extracted from the **Oxford English Dictionary (OED)**
- **Size**: Over **1.2 million** sense-annotated historical usages
- **Time Span**: **1700 - 2020**
- **Data Format**:
```
<year><|t|><lemma><|t|><definition><|s|><historical usage sentence><|end|>
```
- **Janus (PoS) Format**:
```
<year><|t|><lemma><|t|><definition><|p|><PoS><|p|><|s|><historical usage sentence><|end|>
```
### Training Procedure
- **Base Model**: `meta-llama/Llama-3-8B`
- **Optimization**: **QLoRA** (Quantized Low-Rank Adaptation)
- **Batch Size**: **4**
- **Learning Rate**: **2e-4**
- **Epochs**: **1**
### Model Performance
- **Temporal Accuracy**: Root mean squared error (RMSE) of **~52.7 years** (close to OED ground truth)
- **Semantic Accuracy**: Comparable to OED test data on human evaluations
- **Context Variability**: Low lexical repetition, preserving natural linguistic diversity
### Usage Example
#### Generating Historical Usages
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "ChangeIsKey/llama3-janus"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
input_text = "1800<|t|>awful<|t|>Used to emphasize something unpleasant or negative; such a, an absolute.<|s|>"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
output = model.generate(**inputs, temperature=1.0, top_p=0.9, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
For more examples, see the GitHub repository [Historical Word Usage Generation](https://github.com/ChangeIsKey/historical-word-usage-generation)
### Limitations & Ethical Considerations
- **Historical Bias**: The model may reflect biases present in historical texts.
- **Time Granularity**: The temporal resolution is approximate (~50 years RMSE).
- **Modern Influence**: Despite fine-tuning, the model may still generate modern phrases in older contexts.
- **Not Trained for Fairness**: The model has not been explicitly trained to be fair or unbiased. It may produce sensitive, outdated, or culturally inappropriate content.
### Citation
If you use Janus, please cite:
```
@article{10.1162/tacl_a_00761,
author = {Cassotti, Pierluigi and Tahmasebi, Nina},
title = {Sense-specific Historical Word Usage Generation},
journal = {Transactions of the Association for Computational Linguistics},
volume = {13},
pages = {690-708},
year = {2025},
month = {07},
abstract = {Large-scale sense-annotated corpora are important for a range of tasks but are hard to come by. Dictionaries that record and describe the vocabulary of a language often offer a small set of real-world example sentences for each sense of a word. However, on their own, these sentences are too few to be used as diachronic sense-annotated corpora. We propose a targeted strategy for training and evaluating generative models producing historically and semantically accurate word usages given any word, sense definition, and year triple. Our results demonstrate that fine-tuned models can generate usages with the same properties as real-world example sentences from a reference dictionary. Thus the generated usages will be suitable for training and testing computational models where large-scale sense-annotated corpora are needed but currently unavailable.},
issn = {2307-387X},
doi = {10.1162/tacl_a_00761},
url = {https://doi.org/10.1162/tacl\_a\_00761},
eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00761/2535111/tacl\_a\_00761.pdf},
}
```