Update README.md

This commit is contained in:
Xinyu Hu
2024-06-27 13:21:57 +00:00
committed by system
parent 00f8dd3dc9
commit 2dc14212b6

View File

@@ -30,7 +30,7 @@ We believe that an ideal evaluator should be convenient to use and possess these
We implement experiments on several common NLG evaluation tasks and datasets to compare our Themis with other methods, including SummEval for summarization, Topical-Chat for dialogue response generation, SFRES&SFHOT for data-to-text, QAGS for factuality, MANS for story generation, and WMT23 zh-en for machine translation. Experimental results show that our Themis achieves better overall evaluation performance over other evaluation models, including GPT-4. We implement experiments on several common NLG evaluation tasks and datasets to compare our Themis with other methods, including SummEval for summarization, Topical-Chat for dialogue response generation, SFRES&SFHOT for data-to-text, QAGS for factuality, MANS for story generation, and WMT23 zh-en for machine translation. Experimental results show that our Themis achieves better overall evaluation performance over other evaluation models, including GPT-4.
| Method | SummEval | Topical-Chat | SFHOT&SFRES | QAGS | MANS | WMT23 | Average $\rho$ | | Method | SummEval | Topical-Chat | SFHOT&SFRES | QAGS | MANS | WMT23 | Average Spearman |
| -------------------- | :-------: | :----------: | :---------: | :-------: | :-------: | :-------: | :------------: | | -------------------- | :-------: | :----------: | :---------: | :-------: | :-------: | :-------: | :------------: |
| BLEU | 0.075 | 0.388 | 0.024 | - | 0.032 | 0.021 | - | | BLEU | 0.075 | 0.388 | 0.024 | - | 0.032 | 0.021 | - |
| ROUGE | 0.152 | 0.412 | 0.101 | - | -0.002 | 0.151 | - | | ROUGE | 0.152 | 0.412 | 0.101 | - | -0.002 | 0.151 | - |