How to Train a Reward Model for RLHF

Reward models are the foundation of RLHF (Reinforcement Learning from Human Feedback), and they are increasingly relevant beyond pure alignment work — reward models are used to filter synthetic training data, to rank candidate outputs in best-of-N sampling, and as automated evaluators in LLM evaluation pipelines. Despite their importance, the practical details of how to train a reward model — what data it needs, how to structure the training objective, what quality problems to watch for — are less well documented than fine-tuning the policy model itself. This post covers the full reward model training pipeline from preference data collection through to deployment.

What a Reward Model Does

A reward model takes a prompt and a response as input and outputs a scalar score representing how good the response is. During RLHF, this scalar is used as the reward signal for PPO or other RL algorithms to fine-tune the language model. Outside of RLHF, reward models power best-of-N sampling (generate N candidates, pick the highest-scored one), data quality filtering (keep only generated data that scores above a threshold), and automated evaluation (replace expensive human ratings with model ratings at scale). A reward model is typically initialised from the same base LLM that will be fine-tuned, with the language modelling head replaced by a linear layer that maps the final hidden state to a single scalar.

Preference Data: The Reward Model’s Training Signal

Reward models are trained on preference data: pairs of responses to the same prompt, one labelled as preferred and one as rejected. The preference label can come from human annotators (the original RLHF setup), from a stronger model acting as a judge (constitutional AI and RLAIF approaches), or from implicit signals like upvotes, engagement, or task completion. The quality and diversity of this preference data is the single largest determinant of reward model quality — a reward model trained on narrow preference data will have a narrow notion of quality and will fail to generalise to out-of-distribution prompts. Practically, you want your preference dataset to span the distribution of prompts and response types you care about, to have consistent labelling criteria (annotators should agree on close calls), and to include hard negatives — pairs where the preference is clear but requires genuine understanding rather than surface-level features like length or formatting.

from datasets import Dataset
import json

# Example preference dataset structure
preference_data = [
    {
        "prompt": "Explain gradient descent in one paragraph.",
        "chosen": "Gradient descent is an optimisation algorithm that iteratively updates model parameters by computing the gradient of the loss function and stepping in the direction that reduces the loss. At each step, parameters are updated as θ ← θ - η∇L(θ), where η is the learning rate. The process repeats until the loss converges or a stopping criterion is met.",
        "rejected": "Gradient descent helps train neural networks. You compute gradients and update weights. It's a common method used in machine learning to optimise models."
    },
    {
        "prompt": "What is the difference between precision and recall?",
        "chosen": "Precision measures what fraction of positive predictions are correct (TP / (TP + FP)), while recall measures what fraction of actual positives were retrieved (TP / (TP + FN)). They trade off against each other: optimising for high precision means being selective at the cost of missing some positives; optimising for high recall means capturing most positives at the cost of more false positives.",
        "rejected": "Precision is about how accurate your predictions are. Recall is about how many relevant results you found. Both are important metrics for classification."
    }
]

dataset = Dataset.from_list(preference_data)
print(dataset)

Training Objective: Bradley-Terry Pairwise Loss

The standard reward model training objective is derived from the Bradley-Terry model for pairwise comparisons. Given a chosen response y_w and a rejected response y_l for the same prompt x, the loss is the negative log-likelihood that the reward model assigns a higher score to the chosen response:

import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification, AutoTokenizer

class RewardModel(nn.Module):
    """Reward model wrapping a pretrained LLM with a scalar head."""
    def __init__(self, model_name: str):
        super().__init__()
        # Load base model with a single-output classification head
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_name,
            num_labels=1,          # scalar reward output
            torch_dtype=torch.bfloat16,
        )
        # Disable caching during training
        self.model.config.use_cache = False

    def forward(self, input_ids: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
        return outputs.logits.squeeze(-1)  # (batch,) scalar rewards

def bradley_terry_loss(reward_chosen: torch.Tensor, reward_rejected: torch.Tensor) -> torch.Tensor:
    """Pairwise Bradley-Terry loss. Maximises P(chosen > rejected)."""
    # log sigmoid of the reward difference
    return -torch.nn.functional.logsigmoid(reward_chosen - reward_rejected).mean()

# Training step
def reward_model_step(model, batch, tokenizer):
    # Tokenize chosen and rejected responses separately
    chosen_enc = tokenizer(
        [f"{p} {c}" for p, c in zip(batch['prompt'], batch['chosen'])],
        padding=True, truncation=True, max_length=512, return_tensors='pt'
    )
    rejected_enc = tokenizer(
        [f"{p} {r}" for p, r in zip(batch['prompt'], batch['rejected'])],
        padding=True, truncation=True, max_length=512, return_tensors='pt'
    )
    reward_chosen = model(**chosen_enc)
    reward_rejected = model(**rejected_enc)
    loss = bradley_terry_loss(reward_chosen, reward_rejected)
    # Track accuracy: fraction of pairs where model correctly scores chosen > rejected
    accuracy = (reward_chosen > reward_rejected).float().mean()
    return loss, accuracy

Pairwise accuracy — the fraction of preference pairs where the reward model correctly assigns a higher score to the chosen response — is the most useful training metric to monitor. A random baseline achieves 50%; well-trained reward models on standard preference datasets typically reach 70–85% pairwise accuracy. If accuracy plateaus well below 70%, the most common causes are noisy preference labels, a preference dataset that is too narrow or too easy (the model learns superficial correlations rather than quality), or a model that is too small relative to task complexity.

Training Setup and Hyperparameters

Reward model training uses the same infrastructure as supervised fine-tuning. Key hyperparameters that differ from SFT: use a lower learning rate (1e-5 to 5e-6, compared to 1e-4 to 2e-5 for SFT) because the reward signal is noisier than supervised labels; use a smaller batch size (8–32 pairs) because each preference pair contributes two forward passes; and train for fewer epochs (1–3) because reward models overfit faster than SFT models on the same dataset size. LoRA works well for reward model training and substantially reduces memory requirements — attach LoRA adapters to the query and value projection matrices with rank 8–32, the same settings that work for SFT fine-tuning.

from transformers import TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model, TaskType

def setup_reward_model_training(model_name: str, output_dir: str):
    base_model = RewardModel(model_name)
    
    # LoRA config for reward model training
    lora_config = LoraConfig(
        task_type=TaskType.SEQ_CLS,
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        target_modules=["q_proj", "v_proj"],
        bias="none",
    )
    model = get_peft_model(base_model, lora_config)
    model.print_trainable_parameters()

    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=2,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        gradient_accumulation_steps=4,   # effective batch size 32
        learning_rate=2e-5,
        warmup_ratio=0.05,
        lr_scheduler_type="cosine",
        bf16=True,
        logging_steps=10,
        eval_strategy="steps",
        eval_steps=100,
        save_steps=200,
        dataloader_num_workers=4,
        remove_unused_columns=False,
    )
    return model, training_args

Reward Hacking and How to Mitigate It

Reward hacking is the most serious failure mode in reward model deployment. It occurs when the policy model learns to exploit weaknesses in the reward model — finding inputs that score highly according to the reward model but are actually low quality or harmful by human standards. Common reward hacking patterns include verbosity hacking (reward models often spuriously reward longer responses, so the policy learns to pad outputs), sycophancy (agreeing with the user’s implied preferences regardless of accuracy), and formatting exploitation (heavy use of bullet points or headers that the reward model associates with structured and therefore high-quality responses). Mitigations include adding length normalisation to the reward (penalise reward per token rather than per response), training on diverse preference data that includes examples where concise responses are preferred, and using KL divergence penalties during PPO to prevent the policy from drifting too far from the SFT baseline. Monitoring response length distributions and running periodic human evaluation of policy outputs throughout RLHF training catches hacking early before it compounds.

Best-of-N Sampling as an Alternative to PPO

Full PPO-based RLHF is complex to implement correctly and sensitive to hyperparameters. For many practical applications, best-of-N (BoN) sampling with a reward model achieves similar quality improvements with far less infrastructure: generate N candidate responses to each prompt, score all of them with the reward model, and return the highest-scored response. BoN requires no RL training, is completely stateless, and is trivially parallelisable. The compute cost scales linearly with N, whereas PPO requires multiple forward-backward passes per policy update step. Research has found that BoN with N = 32–64 achieves quality roughly comparable to PPO-trained models, and BoN with N = 256 often exceeds them, though the inference cost at N = 256 is prohibitive for most production systems. For offline use cases — generating high-quality synthetic training data, filtering a large candidate set — BoN is almost always the right choice. For online inference where you can only afford to run the model once per user request, you need either PPO-trained weights or a distilled model trained on BoN outputs.

Choosing Which Stage to Apply Your Reward Model

Reward models see the most return when applied to filter data rather than drive RL training. If your goal is improving a fine-tuned model’s output quality, the most cost-effective path is often: (1) generate a large candidate set with the SFT model, (2) filter with the reward model to keep the top quartile, (3) fine-tune again on the filtered data. This rejection-sampling fine-tuning approach, used in Llama 2 and other open-weight alignment recipes, is simpler to implement than PPO, less prone to reward hacking, and easier to debug because the training data is inspectable. Reserve full PPO-based RLHF for cases where rejection sampling has saturated — you have already run several rounds of rejection sampling fine-tuning and the reward model cannot distinguish good from bad outputs in your SFT model’s current output distribution.

Preference Data Quality: What Actually Matters

The most common failure mode in reward model training is poor preference data, not architecture or hyperparameter choices. Annotator agreement rate is the most reliable indicator of data quality — if human annotators agree on which response is better less than 70% of the time, the labels are too noisy to train a reliable reward model, and you need either clearer annotation guidelines, better annotator selection, or a different preference elicitation method. Inter-annotator agreement (measured by Cohen’s kappa or simply the percentage of pairs where majority agreement exceeds 2:1) should be computed before training and used to filter out the most contested examples.

Margin matters too. Preference datasets that include only clear-cut cases (one response is obviously better) produce reward models that are well-calibrated on easy cases but poorly calibrated on close calls — exactly the situations where an accurate reward signal matters most for RL training. Deliberately including hard cases where the better response requires domain expertise to identify improves reward model generalisation, even if it reduces pairwise training accuracy. A good preference dataset therefore has a mix of easy, medium, and hard pairs, with the hard pairs contributing most of the training signal for domain-specific quality distinctions that the model could not learn from length, formatting, or other surface features.

For teams without resources for large-scale human annotation, using a frontier model (GPT-4o, Claude Opus) as a preference judge is a practical alternative. Constitutional AI and RLAIF (Reinforcement Learning from AI Feedback) use this approach: generate pairs of responses, ask a judge model to compare them according to specified criteria, and use the judge’s preferences as training signal. The resulting reward models are less reliable than human-annotated ones for nuanced quality distinctions, but work well for filtering obvious quality differences and are dramatically cheaper to produce. When using AI-generated preferences, the most important safeguard is to specify detailed evaluation criteria in the judge prompt — vague instructions produce noisier preferences that reflect the judge model’s idiosyncrasies rather than meaningful quality distinctions.

Reward Model Generalisation and Out-of-Distribution Robustness

A reward model trained on one preference distribution will have degraded accuracy when used to score outputs from a policy that has been fine-tuned away from that distribution. This is the fundamental tension in iterative RLHF: the more the policy improves, the more it generates outputs that differ from the preference data distribution, and the less reliable the reward model becomes. The standard mitigation is iterative reward model training — after each round of policy fine-tuning, collect new preference data on policy outputs, retrain the reward model, and use the updated reward model for the next PPO round. This adds annotation cost but is necessary to maintain reward model reliability across multiple alignment iterations. For organisations running production RLHF pipelines, budget for at least two to three reward model retraining rounds per policy training cycle, and monitor reward model pairwise accuracy on fresh held-out preference data as the policy evolves rather than evaluating only on the original test set.

Decision Framework: When to Train a Custom Reward Model

Train a custom reward model when your quality criteria are domain-specific enough that off-the-shelf models or simple heuristics cannot capture them reliably. For general instruction-following quality — helpfulness, harmlessness, honesty — strong open-weight reward models (Llama-3-based reward models, Qwen reward models available on Hugging Face) are a reasonable starting point and may be sufficient without any custom training. Reserve the engineering investment of custom reward model training for tasks with measurable, domain-specific quality signals: code correctness (use execution-based rewards instead), medical accuracy (requires expert annotators), legal reasoning (requires lawyers), or a specific writing style your organisation wants to enforce. If you can define your quality criterion precisely enough to write an evaluation rubric, you can build a reward model around it. If the quality criterion is hard to articulate, you likely need human preference data collection to make it legible before training can begin.

Leave a Comment