현재 Transformers Open LLM Leaderboard 에서 사용중인 태스크 4가지Eleuther AI Language Model Evaluation Harness

AI2 Reasoning Challenge(25-shot)

HellaSwag(10-shot)

MMLU(5-shot)

TruthfulQA(0-shot)

For all these evaluations, a higher score is a better score.

We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.

git clone <https://github.com/EleutherAI/lm-evaluation-harness>
cd lm-evaluation-harness
pip install -e .

Basic Usage


python main.py \\
    --model hf-causal \\
    --model_args pretrained=EleutherAI/gpt-j-6B \\
    --tasks hellaswag \\
    --device cuda:0