10 KiB
10 KiB
| 1 | dataset | prompt | metric | value |
|---|---|---|---|---|
| 2 | anli_dev_r1 | GPT-3 style | accuracy | 0.351 |
| 3 | anli_dev_r1 | MNLI crowdsource | accuracy | 0.334 |
| 4 | anli_dev_r1 | can we infer | accuracy | 0.351 |
| 5 | anli_dev_r1 | guaranteed/possible/impossible | accuracy | 0.288 |
| 6 | anli_dev_r1 | justified in saying | accuracy | 0.345 |
| 7 | anli_dev_r1 | median | accuracy | 0.345 |
| 8 | anli_dev_r2 | GPT-3 style | accuracy | 0.339 |
| 9 | anli_dev_r2 | MNLI crowdsource | accuracy | 0.335 |
| 10 | anli_dev_r2 | can we infer | accuracy | 0.354 |
| 11 | anli_dev_r2 | guaranteed/possible/impossible | accuracy | 0.297 |
| 12 | anli_dev_r2 | justified in saying | accuracy | 0.345 |
| 13 | anli_dev_r2 | median | accuracy | 0.339 |
| 14 | anli_dev_r3 | GPT-3 style | accuracy | 0.37583333333333335 |
| 15 | anli_dev_r3 | MNLI crowdsource | accuracy | 0.3408333333333333 |
| 16 | anli_dev_r3 | can we infer | accuracy | 0.36333333333333334 |
| 17 | anli_dev_r3 | guaranteed/possible/impossible | accuracy | 0.31083333333333335 |
| 18 | anli_dev_r3 | justified in saying | accuracy | 0.34 |
| 19 | anli_dev_r3 | median | accuracy | 0.3408333333333333 |
| 20 | story_cloze_2016 | Answer Given options | accuracy | 0.8305718866916088 |
| 21 | story_cloze_2016 | Choose Story Ending | accuracy | 0.8706574024585783 |
| 22 | story_cloze_2016 | Generate Ending | accuracy | 0.7183324425440941 |
| 23 | story_cloze_2016 | Novel Correct Ending | accuracy | 0.848743987172635 |
| 24 | story_cloze_2016 | Story Continuation and Options | accuracy | 0.8466060929983966 |
| 25 | story_cloze_2016 | median | accuracy | 0.8466060929983966 |
| 26 | super_glue_cb | GPT-3 style | accuracy | 0.625 |
| 27 | super_glue_cb | MNLI crowdsource | accuracy | 0.08928571428571429 |
| 28 | super_glue_cb | can we infer | accuracy | 0.5892857142857143 |
| 29 | super_glue_cb | guaranteed/possible/impossible | accuracy | 0.5 |
| 30 | super_glue_cb | justified in saying | accuracy | 0.5357142857142857 |
| 31 | super_glue_cb | median | accuracy | 0.5357142857142857 |
| 32 | super_glue_copa | C1 or C2? premise, so/because… | accuracy | 0.66 |
| 33 | super_glue_copa | best_option | accuracy | 0.67 |
| 34 | super_glue_copa | cause_effect | accuracy | 0.78 |
| 35 | super_glue_copa | i_am_hesitating | accuracy | 0.8 |
| 36 | super_glue_copa | plausible_alternatives | accuracy | 0.81 |
| 37 | super_glue_copa | median | accuracy | 0.78 |
| 38 | super_glue_rte | GPT-3 style | accuracy | 0.7870036101083032 |
| 39 | super_glue_rte | MNLI crowdsource | accuracy | 0.7220216606498195 |
| 40 | super_glue_rte | does it follow that | accuracy | 0.6678700361010831 |
| 41 | super_glue_rte | guaranteed true | accuracy | 0.6714801444043321 |
| 42 | super_glue_rte | should assume | accuracy | 0.6678700361010831 |
| 43 | super_glue_rte | median | accuracy | 0.6714801444043321 |
| 44 | winogrande_winogrande_xl | Replace | accuracy | 0.5406471981057617 |
| 45 | winogrande_winogrande_xl | True or False | accuracy | 0.5074980268350434 |
| 46 | winogrande_winogrande_xl | does underscore refer to | accuracy | 0.5177584846093133 |
| 47 | winogrande_winogrande_xl | stand for | accuracy | 0.510655090765588 |
| 48 | winogrande_winogrande_xl | underscore refer to | accuracy | 0.5256511444356748 |
| 49 | winogrande_winogrande_xl | median | accuracy | 0.5177584846093133 |
| 50 | xcopa_id | C1 or C2? premise, so/because… | accuracy | 0.47 |
| 51 | xcopa_id | best_option | accuracy | 0.51 |
| 52 | xcopa_id | cause_effect | accuracy | 0.65 |
| 53 | xcopa_id | i_am_hesitating | accuracy | 0.66 |
| 54 | xcopa_id | plausible_alternatives | accuracy | 0.67 |
| 55 | xcopa_id | median | accuracy | 0.65 |
| 56 | xcopa_sw | C1 or C2? premise, so/because… | accuracy | 0.58 |
| 57 | xcopa_sw | best_option | accuracy | 0.57 |
| 58 | xcopa_sw | cause_effect | accuracy | 0.46 |
| 59 | xcopa_sw | i_am_hesitating | accuracy | 0.48 |
| 60 | xcopa_sw | plausible_alternatives | accuracy | 0.45 |
| 61 | xcopa_sw | median | accuracy | 0.48 |
| 62 | xcopa_ta | C1 or C2? premise, so/because… | accuracy | 0.57 |
| 63 | xcopa_ta | best_option | accuracy | 0.67 |
| 64 | xcopa_ta | cause_effect | accuracy | 0.71 |
| 65 | xcopa_ta | i_am_hesitating | accuracy | 0.71 |
| 66 | xcopa_ta | plausible_alternatives | accuracy | 0.69 |
| 67 | xcopa_ta | median | accuracy | 0.69 |
| 68 | xcopa_vi | C1 or C2? premise, so/because… | accuracy | 0.55 |
| 69 | xcopa_vi | best_option | accuracy | 0.61 |
| 70 | xcopa_vi | cause_effect | accuracy | 0.67 |
| 71 | xcopa_vi | i_am_hesitating | accuracy | 0.66 |
| 72 | xcopa_vi | plausible_alternatives | accuracy | 0.65 |
| 73 | xcopa_vi | median | accuracy | 0.65 |
| 74 | xcopa_zh | C1 or C2? premise, so/because… | accuracy | 0.62 |
| 75 | xcopa_zh | best_option | accuracy | 0.61 |
| 76 | xcopa_zh | cause_effect | accuracy | 0.77 |
| 77 | xcopa_zh | i_am_hesitating | accuracy | 0.72 |
| 78 | xcopa_zh | plausible_alternatives | accuracy | 0.74 |
| 79 | xcopa_zh | median | accuracy | 0.72 |
| 80 | xnli_ar | GPT-3 style | accuracy | 0.5040160642570282 |
| 81 | xnli_ar | MNLI crowdsource | accuracy | 0.39879518072289155 |
| 82 | xnli_ar | can we infer | accuracy | 0.506425702811245 |
| 83 | xnli_ar | guaranteed/possible/impossible | accuracy | 0.4799196787148594 |
| 84 | xnli_ar | justified in saying | accuracy | 0.41526104417670684 |
| 85 | xnli_ar | median | accuracy | 0.4799196787148594 |
| 86 | xnli_en | GPT-3 style | accuracy | 0.5590361445783133 |
| 87 | xnli_en | MNLI crowdsource | accuracy | 0.342570281124498 |
| 88 | xnli_en | can we infer | accuracy | 0.5449799196787148 |
| 89 | xnli_en | guaranteed/possible/impossible | accuracy | 0.41164658634538154 |
| 90 | xnli_en | justified in saying | accuracy | 0.4634538152610442 |
| 91 | xnli_en | median | accuracy | 0.4634538152610442 |
| 92 | xnli_es | GPT-3 style | accuracy | 0.5373493975903615 |
| 93 | xnli_es | MNLI crowdsource | accuracy | 0.40441767068273093 |
| 94 | xnli_es | can we infer | accuracy | 0.5277108433734939 |
| 95 | xnli_es | guaranteed/possible/impossible | accuracy | 0.44216867469879517 |
| 96 | xnli_es | justified in saying | accuracy | 0.4534136546184739 |
| 97 | xnli_es | median | accuracy | 0.4534136546184739 |
| 98 | xnli_fr | GPT-3 style | accuracy | 0.5248995983935743 |
| 99 | xnli_fr | MNLI crowdsource | accuracy | 0.3895582329317269 |
| 100 | xnli_fr | can we infer | accuracy | 0.5337349397590362 |
| 101 | xnli_fr | guaranteed/possible/impossible | accuracy | 0.42971887550200805 |
| 102 | xnli_fr | justified in saying | accuracy | 0.4738955823293173 |
| 103 | xnli_fr | median | accuracy | 0.4738955823293173 |
| 104 | xnli_hi | GPT-3 style | accuracy | 0.4983935742971888 |
| 105 | xnli_hi | MNLI crowdsource | accuracy | 0.38714859437751004 |
| 106 | xnli_hi | can we infer | accuracy | 0.45542168674698796 |
| 107 | xnli_hi | guaranteed/possible/impossible | accuracy | 0.41405622489959837 |
| 108 | xnli_hi | justified in saying | accuracy | 0.38795180722891565 |
| 109 | xnli_hi | median | accuracy | 0.41405622489959837 |
| 110 | xnli_sw | GPT-3 style | accuracy | 0.43493975903614457 |
| 111 | xnli_sw | MNLI crowdsource | accuracy | 0.363855421686747 |
| 112 | xnli_sw | can we infer | accuracy | 0.42891566265060244 |
| 113 | xnli_sw | guaranteed/possible/impossible | accuracy | 0.3457831325301205 |
| 114 | xnli_sw | justified in saying | accuracy | 0.3650602409638554 |
| 115 | xnli_sw | median | accuracy | 0.3650602409638554 |
| 116 | xnli_ur | GPT-3 style | accuracy | 0.43493975903614457 |
| 117 | xnli_ur | MNLI crowdsource | accuracy | 0.3895582329317269 |
| 118 | xnli_ur | can we infer | accuracy | 0.45180722891566266 |
| 119 | xnli_ur | guaranteed/possible/impossible | accuracy | 0.40120481927710844 |
| 120 | xnli_ur | justified in saying | accuracy | 0.37630522088353413 |
| 121 | xnli_ur | median | accuracy | 0.40120481927710844 |
| 122 | xnli_vi | GPT-3 style | accuracy | 0.5196787148594377 |
| 123 | xnli_vi | MNLI crowdsource | accuracy | 0.38112449799196785 |
| 124 | xnli_vi | can we infer | accuracy | 0.5080321285140562 |
| 125 | xnli_vi | guaranteed/possible/impossible | accuracy | 0.38393574297188754 |
| 126 | xnli_vi | justified in saying | accuracy | 0.43614457831325304 |
| 127 | xnli_vi | median | accuracy | 0.43614457831325304 |
| 128 | xnli_zh | GPT-3 style | accuracy | 0.5052208835341365 |
| 129 | xnli_zh | MNLI crowdsource | accuracy | 0.4 |
| 130 | xnli_zh | can we infer | accuracy | 0.5228915662650603 |
| 131 | xnli_zh | guaranteed/possible/impossible | accuracy | 0.4738955823293173 |
| 132 | xnli_zh | justified in saying | accuracy | 0.45863453815261046 |
| 133 | xnli_zh | median | accuracy | 0.4738955823293173 |
| 134 | xstory_cloze_ar | Answer Given options | accuracy | 0.7518199867637326 |
| 135 | xstory_cloze_ar | Choose Story Ending | accuracy | 0.7749834546657842 |
| 136 | xstory_cloze_ar | Generate Ending | accuracy | 0.586366644606221 |
| 137 | xstory_cloze_ar | Novel Correct Ending | accuracy | 0.7518199867637326 |
| 138 | xstory_cloze_ar | Story Continuation and Options | accuracy | 0.7438782263401721 |
| 139 | xstory_cloze_ar | median | accuracy | 0.7518199867637326 |
| 140 | xstory_cloze_es | Answer Given options | accuracy | 0.7835870284579749 |
| 141 | xstory_cloze_es | Choose Story Ending | accuracy | 0.8292521508934481 |
| 142 | xstory_cloze_es | Generate Ending | accuracy | 0.6399735274652548 |
| 143 | xstory_cloze_es | Novel Correct Ending | accuracy | 0.7935142289874255 |
| 144 | xstory_cloze_es | Story Continuation and Options | accuracy | 0.7888815354070152 |
| 145 | xstory_cloze_es | median | accuracy | 0.7888815354070152 |
| 146 | xstory_cloze_eu | Answer Given options | accuracy | 0.7041694242223693 |
| 147 | xstory_cloze_eu | Choose Story Ending | accuracy | 0.6823295830575777 |
| 148 | xstory_cloze_eu | Generate Ending | accuracy | 0.5625413633355394 |
| 149 | xstory_cloze_eu | Novel Correct Ending | accuracy | 0.6671078755790867 |
| 150 | xstory_cloze_eu | Story Continuation and Options | accuracy | 0.671740569159497 |
| 151 | xstory_cloze_eu | median | accuracy | 0.671740569159497 |
| 152 | xstory_cloze_hi | Answer Given options | accuracy | 0.6915949702183984 |
| 153 | xstory_cloze_hi | Choose Story Ending | accuracy | 0.7220383851753805 |
| 154 | xstory_cloze_hi | Generate Ending | accuracy | 0.5883520847121112 |
| 155 | xstory_cloze_hi | Novel Correct Ending | accuracy | 0.6743878226340172 |
| 156 | xstory_cloze_hi | Story Continuation and Options | accuracy | 0.6816677696889477 |
| 157 | xstory_cloze_hi | median | accuracy | 0.6816677696889477 |
| 158 | xstory_cloze_id | Answer Given options | accuracy | 0.7445400397088021 |
| 159 | xstory_cloze_id | Choose Story Ending | accuracy | 0.771012574454004 |
| 160 | xstory_cloze_id | Generate Ending | accuracy | 0.6029119788219722 |
| 161 | xstory_cloze_id | Novel Correct Ending | accuracy | 0.7485109199205824 |
| 162 | xstory_cloze_id | Story Continuation and Options | accuracy | 0.7438782263401721 |
| 163 | xstory_cloze_id | median | accuracy | 0.7445400397088021 |
| 164 | xstory_cloze_zh | Answer Given options | accuracy | 0.7610853739245532 |
| 165 | xstory_cloze_zh | Choose Story Ending | accuracy | 0.7961614824619457 |
| 166 | xstory_cloze_zh | Generate Ending | accuracy | 0.6214427531436135 |
| 167 | xstory_cloze_zh | Novel Correct Ending | accuracy | 0.7696889477167439 |
| 168 | xstory_cloze_zh | Story Continuation and Options | accuracy | 0.7670416942422237 |
| 169 | xstory_cloze_zh | median | accuracy | 0.7670416942422237 |
| 170 | xwinograd_en | Replace | accuracy | 0.5225806451612903 |
| 171 | xwinograd_en | True or False | accuracy | 0.48946236559139783 |
| 172 | xwinograd_en | does underscore refer to | accuracy | 0.5281720430107527 |
| 173 | xwinograd_en | stand for | accuracy | 0.5062365591397849 |
| 174 | xwinograd_en | underscore refer to | accuracy | 0.5372043010752688 |
| 175 | xwinograd_en | median | accuracy | 0.5225806451612903 |
| 176 | xwinograd_fr | Replace | accuracy | 0.5060240963855421 |
| 177 | xwinograd_fr | True or False | accuracy | 0.5421686746987951 |
| 178 | xwinograd_fr | does underscore refer to | accuracy | 0.5542168674698795 |
| 179 | xwinograd_fr | stand for | accuracy | 0.4819277108433735 |
| 180 | xwinograd_fr | underscore refer to | accuracy | 0.5301204819277109 |
| 181 | xwinograd_fr | median | accuracy | 0.5301204819277109 |
| 182 | xwinograd_pt | Replace | accuracy | 0.5133079847908745 |
| 183 | xwinograd_pt | True or False | accuracy | 0.4714828897338403 |
| 184 | xwinograd_pt | does underscore refer to | accuracy | 0.5209125475285171 |
| 185 | xwinograd_pt | stand for | accuracy | 0.5019011406844106 |
| 186 | xwinograd_pt | underscore refer to | accuracy | 0.5399239543726235 |
| 187 | xwinograd_pt | median | accuracy | 0.5133079847908745 |
| 188 | xwinograd_zh | Replace | accuracy | 0.5257936507936508 |
| 189 | xwinograd_zh | True or False | accuracy | 0.5297619047619048 |
| 190 | xwinograd_zh | does underscore refer to | accuracy | 0.5218253968253969 |
| 191 | xwinograd_zh | stand for | accuracy | 0.4444444444444444 |
| 192 | xwinograd_zh | underscore refer to | accuracy | 0.5198412698412699 |
| 193 | xwinograd_zh | median | accuracy | 0.5218253968253969 |
| 194 | multiple | average | multiple | 0.5631550819200618 |