LLM Evaluation

AI2 Reasoning Challenge(25-shot)

HellaSwag(10-shot)

a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
https://arxiv.org/abs/1905.07830

MMLU(5-shot)

a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
https://arxiv.org/abs/2009.03300

TruthfulQA(0-shot)

a test to measure a model’s propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples.
https://arxiv.org/abs/2109.07958

For all these evaluations, a higher score is a better score.

We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.

git clone <https://github.com/EleutherAI/lm-evaluation-harness>
cd lm-evaluation-harness
pip install -e .

Basic Usage

python main.py \\
    --model hf-causal \\
    --model_args pretrained=EleutherAI/gpt-j-6B \\
    --tasks hellaswag \\
    --device cuda:0