How to Evaluate LLMs with lm-evaluation-harness

lm-evaluation-harness, maintained by EleutherAI, is the standard tool for benchmarking language models against a comprehensive suite of tasks — MMLU, HellaSwag, ARC, TruthfulQA, GSM8K, and hundreds more. It handles prompt formatting, few-shot example injection, metric computation, and result aggregation, and it supports models loaded via Hugging Face Transformers, vLLM, or the OpenAI API. If you are evaluating a fine-tuned model before deployment, comparing checkpoint quality during training, or trying to reproduce a published benchmark number, this is the tool to use. This article covers practical usage: running standard benchmarks, adding custom tasks, interpreting results correctly, and avoiding the common mistakes that produce misleading numbers.

Installation and Basic Usage

The harness installs cleanly via pip and runs from the command line or Python API. The quickest way to evaluate a model is with the lm_eval CLI, which handles model loading and task execution in a single command.

# Install
pip install lm-eval

# For additional task dependencies (recommended):
pip install lm-eval[math,ifeval,batched]

# Basic evaluation: run MMLU on a HuggingFace model
lm_eval --model hf   --model_args pretrained=meta-llama/Llama-3.2-1B   --tasks mmlu   --num_fewshot 5   --batch_size auto   --output_path ./results/llama-1b-mmlu

# Multiple tasks in one run
lm_eval --model hf   --model_args pretrained=meta-llama/Llama-3.2-1B   --tasks mmlu,hellaswag,arc_challenge,truthfulqa_mc2   --num_fewshot 5,10,25,0   --batch_size auto   --output_path ./results/llama-1b-standard

# Evaluate a fine-tuned model from a local path
lm_eval --model hf   --model_args pretrained=/path/to/finetuned-model,dtype=bfloat16   --tasks mmlu   --num_fewshot 5   --batch_size 8   --output_path ./results/finetuned-mmlu

# Evaluate via vLLM backend (faster for large models)
lm_eval --model vllm   --model_args pretrained=meta-llama/Llama-3.2-8B,dtype=bfloat16,gpu_memory_utilization=0.8   --tasks mmlu,gsm8k   --num_fewshot 5,8   --batch_size auto   --output_path ./results/llama-8b-vllm

The --batch_size auto flag lets the harness find the largest batch size that fits in GPU memory, which significantly speeds up evaluation. Use the vLLM backend for models above 7B — it uses continuous batching and is substantially faster than the HuggingFace backend for large models. The --num_fewshot argument controls how many examples are prepended to each prompt; the conventional values for standard benchmarks are fixed (5-shot MMLU, 10-shot HellaSwag, 25-shot ARC, 0-shot TruthfulQA) and you should use these to produce numbers that are comparable to published results.

Python API for Programmatic Evaluation

For integrating evaluation into a training pipeline — running benchmarks after each checkpoint, or automating comparison across experiments — the Python API is more convenient than the CLI.

import lm_eval
from lm_eval import evaluator
from lm_eval.models.huggingface import HFLM
import json
from pathlib import Path

def evaluate_checkpoint(
    model_path: str,
    tasks: list[str] = ["mmlu", "hellaswag", "arc_challenge"],
    num_fewshot: int = 5,
    batch_size: int = 8,
    output_dir: str = "./eval_results",
) -> dict:
    """Evaluate a checkpoint on standard benchmarks and return results dict."""
    # Load model
    lm = HFLM(
        pretrained=model_path,
        dtype="bfloat16",
        batch_size=batch_size,
    )
    # Run evaluation
    results = evaluator.simple_evaluate(
        model=lm,
        tasks=tasks,
        num_fewshot=num_fewshot,
        batch_size=batch_size,
        log_samples=False,  # set True to inspect individual predictions
    )
    # Save results
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    output_path = f"{output_dir}/{Path(model_path).name}_results.json"
    with open(output_path, "w") as f:
        json.dump(results["results"], f, indent=2)

    # Print summary
    for task, metrics in results["results"].items():
        acc = metrics.get("acc,none") or metrics.get("acc_norm,none", 0)
        print(f"{task:30s}: {acc:.4f}")

    return results["results"]

# Example: evaluate every checkpoint saved during training
import glob
checkpoints = sorted(glob.glob("./checkpoints/checkpoint-*"))
for ckpt in checkpoints:
    print(f"
=== Evaluating {ckpt} ===")
    evaluate_checkpoint(ckpt, tasks=["mmlu"], num_fewshot=5)

Writing a Custom Task

The harness supports custom tasks defined as YAML configuration files, making it straightforward to evaluate a model on a domain-specific benchmark or a proprietary held-out dataset. A task YAML specifies the dataset source, prompt template, output type (multiple choice, generation, or loglikelihood), and metric. The YAML-based task format was introduced in harness v0.4 and is the current standard; older Python-class-based task definitions still work but are being phased out.

# custom_tasks/my_domain_qa.yaml
task: my_domain_qa
dataset_path: json                  # load from local JSON files
dataset_kwargs:
  data_files:
    test: /path/to/test_data.jsonl
doc_to_text: "Question: {{question}}
Answer:"
doc_to_target: "{{answer}}"
output_type: generate_until          # free-form generation
generation_kwargs:
  until:
    - "
"                           # stop at newline
  max_gen_toks: 64
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
  - metric: f1
    aggregation: mean
    higher_is_better: true
num_fewshot: 3
fewshot_config:
  sampler: first_n                   # use first N examples as few-shot demos
# Run the custom task
lm_eval --model hf   --model_args pretrained=meta-llama/Llama-3.2-1B   --tasks my_domain_qa   --include_path ./custom_tasks   --num_fewshot 3   --output_path ./results/custom-qa

For multiple-choice tasks, use output_type: multiple_choice and define doc_to_choice to return the list of answer options. The harness will score each option by log-likelihood and select the highest as the model’s prediction, which is the standard approach for benchmarks like MMLU and ARC. For generation tasks, define the stop sequences carefully — an incomplete stop sequence that allows the model to generate multiple answers before terminating will silently inflate or deflate scores depending on the metric.

Understanding the Output and Common Metrics

The harness outputs a JSON file and a printed table with metric values for each task. The key metrics to understand are: acc (accuracy on exact match or multiple-choice selection), acc_norm (accuracy normalised by answer length, used for multiple-choice tasks where answer options have very different lengths), exact_match (strict string matching for generation tasks), and perplexity (for language modelling tasks). For MMLU, use acc; for ARC, use acc_norm; for TruthfulQA, use mc2 (which measures the probability mass on all correct answers rather than just the top prediction).

import json

def compare_checkpoints(result_files: list[str]) -> None:
    """Print a comparison table of benchmark results across checkpoints."""
    all_results = {}
    for path in result_files:
        with open(path) as f:
            data = json.load(f)
        name = path.split("/")[-1].replace("_results.json", "")
        all_results[name] = data

    # Print header
    tasks = list(next(iter(all_results.values())).keys())
    print(f"{'Model':<35}", end="")
    for task in tasks:
        print(f"{task[:12]:>14}", end="")
    print()
    print("-" * (35 + 14 * len(tasks)))

    # Print each model row
    for model_name, results in all_results.items():
        print(f"{model_name:<35}", end="")
        for task in tasks:
            metrics = results.get(task, {})
            # Prefer acc_norm > acc > first available metric
            val = (metrics.get("acc_norm,none")
                   or metrics.get("acc,none")
                   or next(iter(metrics.values()), 0))
            print(f"{val:14.4f}", end="")
        print()

# Usage
compare_checkpoints([
    "./results/base_model_results.json",
    "./results/finetuned_epoch1_results.json",
    "./results/finetuned_epoch3_results.json",
])

Avoiding Evaluation Mistakes That Produce Misleading Numbers

Several common mistakes produce benchmark numbers that look credible but do not reflect actual model quality. The most frequent is prompt format mismatch: MMLU and other benchmarks have specific expected prompt formats, and even minor deviations — adding a trailing space, changing capitalisation of “Answer:”, or using a different delimiter between the question and options — can shift accuracy by 2–5 percentage points. Always use the harness’s built-in task definitions rather than re-implementing prompts manually, and if you need to compare against a paper’s numbers, verify that the paper used the same harness version and the same prompt format.

The second common mistake is contamination: if your fine-tuning data contains examples from the evaluation benchmarks, your scores will be inflated. MMLU questions and answers are widely reproduced on the internet and appear in many popular instruction-tuning datasets. The harness does not check for contamination — you must do this yourself by checking whether your training data overlaps with evaluation questions. A simple n-gram overlap check between your training text and the evaluation dataset detects most contamination. Third, do not compare few-shot numbers across different values of num_fewshot: a model evaluated at 5-shot MMLU and another at 0-shot MMLU are not directly comparable, since few-shot examples improve performance by providing format demonstrations even when they add no factual information. Use the canonical few-shot counts for each benchmark consistently across all models you compare.

Evaluating Instruction-Tuned Models Correctly

Instruction-tuned and chat models require extra care when running benchmark evaluations. These models are trained to follow instructions and often interpret the few-shot examples in a benchmark prompt as part of a multi-turn conversation rather than as plain-text demonstrations, which can cause them to respond to the wrong thing or apply chat formatting that breaks the metric computation. The harness has chat template support that correctly formats prompts for instruction-tuned models, and using it is important for getting valid numbers.

# Evaluate an instruction-tuned model with its chat template applied
lm_eval --model hf   --model_args pretrained=meta-llama/Llama-3.2-1B-Instruct,dtype=bfloat16   --tasks mmlu   --num_fewshot 5   --apply_chat_template   --fewshot_as_multiturn   --output_path ./results/llama-1b-instruct-mmlu

The --apply_chat_template flag wraps each prompt in the model’s chat template (loaded from the tokenizer config), and --fewshot_as_multiturn formats few-shot examples as a multi-turn conversation with user/assistant turns rather than as a single concatenated context. Without these flags, instruction-tuned models typically score significantly lower than their base model counterparts on multiple-choice benchmarks because the prompt format is mismatched to what they were trained on. With them, instruction-tuned models typically score higher or comparably to the base model, reflecting their actual capability. Always compare instruction-tuned models with --apply_chat_template and base models without it — mixing the two produces meaningless comparisons.

Tracking Evaluation Results Across Training

Integrating benchmark evaluation into your training loop gives you a learning curve for benchmark performance in addition to training loss, which is far more informative for understanding whether fine-tuning is improving capabilities or causing regression. A practical approach is to evaluate on a lightweight subset of tasks — a single MMLU subcategory, or a small split of your domain benchmark — every few hundred training steps, and run the full benchmark suite only on final checkpoints. This gives you early signal on whether the fine-tuning direction is working without the cost of full MMLU at every checkpoint, which takes 30–90 minutes depending on model size and hardware.

The specific MMLU subcategories most useful as cheap proxies depend on your fine-tuning domain. For code-focused fine-tuning, “college_computer_science” and “computer_security” track relevant capability changes. For science and reasoning, “high_school_mathematics” and “formal_logic” are sensitive to regression from fine-tuning on narrow instruction data. Running two or three subcategories rather than all 57 reduces evaluation time by roughly 20x with minimal loss of signal for detecting major regressions. When you see a drop on these proxy tasks during fine-tuning, it is usually a sign of catastrophic forgetting — the model is overwriting general capabilities in favour of task-specific patterns — and the right response is to add general instruction data to the training mix or reduce the fine-tuning learning rate.

Interpreting Results: What the Numbers Actually Mean

Benchmark scores are a compressed signal about model capability, not a ground truth. MMLU accuracy above 70% indicates strong general knowledge but says nothing about instruction following, factual grounding, or output quality. HellaSwag measures commonsense sentence completion and is largely saturated for models above 13B parameters — differences below 0.5 percentage points are within noise. GSM8K measures grade-school math reasoning and is a better signal for reasoning capability than MMLU for most practical purposes, since it requires multi-step reasoning rather than retrieval of memorised facts. TruthfulQA mc2 measures whether a model avoids stating common misconceptions as fact, which is particularly relevant for models deployed in question-answering contexts.

The most useful benchmark for detecting fine-tuning regression is the one most closely related to the task you are fine-tuning away from. If you are fine-tuning a general model on a narrow domain, run a broad benchmark before and after to quantify the regression. If you are fine-tuning an instruction model on a specific skill, run a diverse instruction-following benchmark (IFEval is a good choice) alongside your domain benchmark to check that general instruction-following capability is preserved. A model that scores 10 points higher on your domain benchmark but 5 points lower on IFEval has likely traded general utility for narrow performance, which may or may not be acceptable depending on your deployment context.

Speed and Cost: Making Evaluation Practical

Full MMLU across all 57 subcategories takes roughly 15–30 minutes for a 1B model and 60–120 minutes for a 7B model on a single A100, using the HuggingFace backend. The vLLM backend is typically 2–4x faster for generation tasks and roughly comparable for multiple-choice tasks where generation length is short. For development iteration — checking whether a code change broke something — run a single fast task like arc_easy or a single MMLU subcategory rather than the full suite. Reserve the full standard benchmark suite for final model selection and for numbers you plan to report externally.

If you are evaluating models frequently as part of a larger pipeline, caching model outputs and rerunning only the metric computation is a major time saver. The harness supports saving raw model outputs with --log_samples, which writes per-example predictions and loglikelihoods to disk. You can then reload these and recompute metrics with different aggregation logic or additional metrics without re-running inference. This is particularly useful when you want to add a custom metric to an existing evaluation after the fact, or when you want to perform error analysis on specific failure cases rather than looking only at aggregate accuracy numbers.

Running the harness as part of a CI/CD pipeline for model development — evaluating every pull request that changes training data or fine-tuning hyperparameters against a fixed benchmark subset — gives you an automated regression detector that catches capability drops before they reach production. The infrastructure cost is low (one GPU job per PR), and the signal is high: a PR that drops MMLU by more than one percentage point or GSM8K by more than two points deserves careful review regardless of what the training loss says. Training loss and benchmark performance are correlated but not identical, and benchmark regression is the more reliable signal for whether a model change is safe to ship.

Quick Reference: Standard Benchmark Configurations

MMLU: 5-shot, acc metric, 57 subcategories covering knowledge across STEM, humanities, and social sciences. HellaSwag: 10-shot, acc_norm, commonsense sentence completion, largely saturated above 13B. ARC-Challenge: 25-shot, acc_norm, science questions requiring reasoning rather than recall. TruthfulQA: 0-shot, mc2, measures avoidance of common misconceptions. GSM8K: 8-shot, exact_match, grade-school math reasoning with chain-of-thought. IFEval: 0-shot, prompt_level_strict_acc, instruction following with verifiable constraints. WinoGrande: 5-shot, acc, commonsense pronoun resolution. Use these canonical settings consistently to produce numbers that are comparable across models and reproducible across runs.

Leave a Comment