AI2 Reasoning Challenge(25-shot)
HellaSwag(10-shot)
MMLU(5-shot)
TruthfulQA(0-shot)
For all these evaluations, a higher score is a better score.
We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
git clone <https://github.com/EleutherAI/lm-evaluation-harness>
cd lm-evaluation-harness
pip install -e .
transformers
python main.py \\
--model hf-causal \\
--model_args pretrained=EleutherAI/gpt-j-6B \\
--tasks hellaswag \\
--device cuda:0