Aira-OPT-1B3/lm-evaluation-harness.ipynb

{"cells":[{"cell_type":"markdown","metadata":{"id":"Ac6wadk3rmkK"},"source":["# LM Evaluation Harness (by [EleutherAI](https://www.eleuther.ai/))\n","\n","This [`LM-Evaluation-Harness`](https://github.com/EleutherAI/lm-evaluation-harness) provides a unified framework to test generative language models on a large number of different evaluation tasks. For a complete list of available tasks, see the [task table](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md), or scroll to the bottom of the page.\n","\n","1. Clone the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and install the necessary libraries (`sentencepiece` is required for the Llama tokenizer)."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":40508,"status":"ok","timestamp":1698580809187,"user":{"displayName":"Nicholas Corrêa","userId":"09736120585766268588"},"user_tz":-60},"id":"UA5I86u91e0A","outputId":"2342bf64-d93b-441f-8643-8e4003c6ef6c"},"outputs":[],"source":["!git clone --branch master https://github.com/EleutherAI/lm-evaluation-harness\n","!cd lm-evaluation-harness && pip install -e . -q\n","!pip install cohere tiktoken sentencepiece -q"]},{"cell_type":"markdown","metadata":{},"source":["2. Run the evaluation harness on the selected tasks."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":1753416,"status":"ok","timestamp":1698583348574,"user":{"displayName":"Nicholas Corrêa","userId":"09736120585766268588"},"user_tz":-60},"id":"pnHoAVK25QZn","outputId":"23f65f99-82f8-423f-9c8a-b1d4f2bdbd56"},"outputs":[],"source":["!huggingface-cli login --token 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\n","!cd lm-evaluation-harness && python main.py \\\n","    --model hf-causal \\\n","    --model_args pretrained=nicholasKluge/Aira-OPT-1B3 \\\n","    --tasks arc_challenge,truthfulqa_mc,toxigen  \\\n","    --device cuda:0"]},{"cell_type":"markdown","metadata":{"id":"4Bm78wiZ4Own"},"source":["## Task Table 📚\n","\n","|                        Task Name                        |Train|Val|Test|Val/Test Docs|                                                                                     Metrics                                                                                     |\n","|---------------------------------------------------------|-----|---|----|------------:|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n","|anagrams1                                                |     |✓  |    |        10000|acc                                                                                                                                                                              |\n","|anagrams2                                                |     |✓  |    |        10000|acc                                                                                                                                                                              |\n","|anli_r1                                                  |✓    |✓  |✓   |         1000|acc                                                                                                                                                                              |\n","|anli_r2                                                  |✓    |✓  |✓   |         1000|acc                                                                                                                                                                              |\n","|anli_r3                                                  |✓    |✓  |✓   |         1200|acc                                                                                                                                                                              |\n","|arc_challenge