2 lines
108 KiB
Plaintext
2 lines
108 KiB
Plaintext
|
|
{"cells":[{"cell_type":"markdown","metadata":{"id":"Ac6wadk3rmkK"},"source":["# LM Evaluation Harness (by [EleutherAI](https://www.eleuther.ai/))\n","\n","This [`LM-Evaluation-Harness`](https://github.com/EleutherAI/lm-evaluation-harness) provides a unified framework to test generative language models on a large number of different evaluation tasks. For a complete list of available tasks, see the [task table](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md), or scroll to the bottom of the page.\n","\n","1. Clone the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and install the necessary libraries (`sentencepiece` is required for the Llama tokenizer)."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":40508,"status":"ok","timestamp":1698580809187,"user":{"displayName":"Nicholas Corrêa","userId":"09736120585766268588"},"user_tz":-60},"id":"UA5I86u91e0A","outputId":"2342bf64-d93b-441f-8643-8e4003c6ef6c"},"outputs":[],"source":["!git clone --branch master https://github.com/EleutherAI/lm-evaluation-harness\n","!cd lm-evaluation-harness && pip install -e . -q\n","!pip install cohere tiktoken sentencepiece -q"]},{"cell_type":"markdown","metadata":{},"source":["2. Run the evaluation harness on the selected tasks."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":1753416,"status":"ok","timestamp":1698583348574,"user":{"displayName":"Nicholas Corrêa","userId":"09736120585766268588"},"user_tz":-60},"id":"pnHoAVK25QZn","outputId":"23f65f99-82f8-423f-9c8a-b1d4f2bdbd56"},"outputs":[],"source":["!huggingface-cli login --token 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\n","!cd lm-evaluation-harness && python main.py \\\n"," --model hf-causal \\\n"," --model_args pretrained=nicholasKluge/Aira-OPT-1B3 \\\n"," --tasks arc_challenge,truthfulqa_mc,toxigen \\\n"," --device cuda:0"]},{"cell_type":"markdown","metadata":{"id":"4Bm78wiZ4Own"},"source":["## Task Table 📚\n","\n","| Task Name |Train|Val|Test|Val/Test Docs| Metrics |\n","|---------------------------------------------------------|-----|---|----|------------:|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n","|anagrams1 | |✓ | | 10000|acc |\n","|anagrams2 | |✓ | | 10000|acc |\n","|anli_r1 |✓ |✓ |✓ | 1000|acc |\n","|anli_r2 |✓ |✓ |✓ | 1000|acc |\n","|anli_r3 |✓ |✓ |✓ | 1200|acc |\n","|arc_challenge
|