How to Evaluate LLMs with lm-evaluation-harness
A practical guide to EleutherAI lm-evaluation-harness for ML engineers: CLI and Python API usage, running MMLU, HellaSwag, ARC and TruthfulQA, evaluating fine-tuned checkpoints, writing custom YAML tasks for domain benchmarks, understanding acc vs acc_norm vs mc2 metrics, and avoiding the prompt format mismatches and contamination issues that produce misleading benchmark numbers.