How to Use Multi-Armed Bandits for A/B Testing and Online Decision Making

Traditional A/B testing treats exploration and exploitation as sequential phases: run the experiment for a fixed duration, declare a winner, then exploit that winner permanently. Multi-armed bandits collapse these phases into a single continuous process — the algorithm simultaneously explores alternatives and exploits what it has learned so far, dynamically shifting traffic toward better-performing options as evidence accumulates. For product and ML teams that run frequent experiments, bandits reduce the cost of exploration (fewer users exposed to inferior variants) while reaching conclusions faster than fixed-horizon A/B tests.

The name comes from the slot machine analogy: a gambler facing multiple slot machines (“one-armed bandits”) with unknown payout rates must decide how to allocate pulls to maximise total reward. Each pull provides a noisy signal about that arm’s true payout rate. The exploration-exploitation tradeoff is the core challenge: pulling the current best arm exclusively exploits existing knowledge but may miss a better arm; pulling all arms equally explores thoroughly but wastes pulls on clearly inferior options.

Epsilon-Greedy

Epsilon-greedy is the simplest bandit algorithm and a useful baseline. With probability epsilon, it selects a random arm (exploration); with probability 1-epsilon, it selects the arm with the highest observed mean reward (exploitation). The algorithm is easy to implement, easy to reason about, and works reasonably well in practice — its main weakness is that it explores uniformly at random rather than directing exploration toward arms with high uncertainty, which wastes exploration budget on arms that are clearly inferior.

import numpy as np
from dataclasses import dataclass, field

@dataclass
class ArmStats:
    pulls: int = 0
    total_reward: float = 0.0

    @property
    def mean_reward(self):
        return self.total_reward / self.pulls if self.pulls > 0 else 0.0

class EpsilonGreedy:
    def __init__(self, n_arms: int, epsilon: float = 0.1):
        self.epsilon = epsilon
        self.arms = [ArmStats() for _ in range(n_arms)]

    def select_arm(self) -> int:
        if np.random.random() < self.epsilon:
            return np.random.randint(len(self.arms))  # explore
        return max(range(len(self.arms)), key=lambda i: self.arms[i].mean_reward)  # exploit

    def update(self, arm: int, reward: float):
        self.arms[arm].pulls += 1
        self.arms[arm].total_reward += reward

# Usage
bandit = EpsilonGreedy(n_arms=3, epsilon=0.1)
for _ in range(1000):
    arm = bandit.select_arm()
    reward = simulate_reward(arm)  # 0 or 1 for click/no-click
    bandit.update(arm, reward)

Epsilon decay — reducing epsilon over time as confidence in arm estimates grows — improves on fixed epsilon by reducing exploration as the algorithm converges. A common schedule is epsilon = 1/sqrt(t) where t is the total number of pulls. This ensures exploration eventually reaches zero while maintaining sufficient early exploration to identify the best arm.

Thompson Sampling

Thompson Sampling is the most widely used bandit algorithm in production ML systems and consistently outperforms epsilon-greedy in empirical comparisons. For binary reward problems (click/no-click, convert/no-convert), it models each arm's true conversion rate as a Beta distribution and selects arms by sampling from those distributions — arms with higher uncertainty get selected more often because their sampled values are more likely to exceed the current best arm's sample. This naturally directs exploration toward uncertain arms rather than exploring uniformly.

class ThompsonSampling:
    def __init__(self, n_arms: int):
        # Beta(alpha, beta): alpha = successes + 1, beta = failures + 1
        self.alpha = np.ones(n_arms)  # prior: 1 success
        self.beta = np.ones(n_arms)   # prior: 1 failure

    def select_arm(self) -> int:
        # Sample from each arm's posterior Beta distribution
        samples = np.random.beta(self.alpha, self.beta)
        return int(np.argmax(samples))

    def update(self, arm: int, reward: float):
        # reward should be 0 or 1 for binary outcomes
        self.alpha[arm] += reward
        self.beta[arm] += (1 - reward)

    def get_estimates(self):
        means = self.alpha / (self.alpha + self.beta)
        # 95% credible intervals
        from scipy.stats import beta
        lower = beta.ppf(0.025, self.alpha, self.beta)
        upper = beta.ppf(0.975, self.alpha, self.beta)
        return means, lower, upper

# Check current estimates
ts = ThompsonSampling(n_arms=3)
# ... after many updates ...
means, lower, upper = ts.get_estimates()
for i, (m, l, u) in enumerate(zip(means, lower, upper)):
    print(f'Arm {i}: {m:.3f} [{l:.3f}, {u:.3f}]')

Thompson Sampling's Bayesian nature gives you credible intervals on arm estimates for free — you always know not just which arm is winning but how confident you are. This is valuable for deciding when to stop the experiment: once the credible intervals of all arms are sufficiently narrow and one arm's lower bound exceeds all other arms' upper bounds, you have strong evidence for a winner and can stop exploration.

UCB1: Upper Confidence Bound

UCB1 takes a frequentist approach: for each arm, it computes an upper confidence bound on the true mean reward and selects the arm with the highest UCB. Arms with few pulls have wide confidence bounds and get selected for exploration; arms with many pulls have narrow bounds and are selected only when their mean reward genuinely makes them the best option. UCB1 is deterministic given the same history — unlike Thompson Sampling, which involves sampling — making it easier to debug and reproduce:

import math

class UCB1:
    def __init__(self, n_arms: int):
        self.arms = [ArmStats() for _ in range(n_arms)]
        self.total_pulls = 0

    def select_arm(self) -> int:
        n_arms = len(self.arms)
        # Pull each arm once before applying UCB formula
        for i, arm in enumerate(self.arms):
            if arm.pulls == 0:
                return i
        ucb_values = [
            arm.mean_reward + math.sqrt(2 * math.log(self.total_pulls) / arm.pulls)
            for arm in self.arms
        ]
        return int(np.argmax(ucb_values))

    def update(self, arm: int, reward: float):
        self.arms[arm].pulls += 1
        self.arms[arm].total_reward += reward
        self.total_pulls += 1

UCB1 has strong theoretical guarantees — its cumulative regret (total reward lost compared to always playing the optimal arm) grows as O(log T), which is optimal for the stochastic bandit setting. In practice, Thompson Sampling often achieves lower regret empirically despite similar theoretical bounds, because the Bayesian prior provides better calibration in the early pulls when data is sparse.

Contextual Bandits

Standard bandits choose between arms with no information about the current context. Contextual bandits extend this to settings where each decision arrives with features — user demographics, page context, time of day — and the optimal arm may differ by context. This is the setting for personalised recommendations, dynamic pricing, and content ranking: the best headline for a 25-year-old mobile user may differ from the best headline for a 55-year-old desktop user, and a contextual bandit learns these per-context preferences simultaneously.

LinUCB is the standard linear contextual bandit algorithm. It fits a linear model mapping context features to expected reward for each arm, with UCB-style exploration bonuses based on prediction uncertainty. For non-linear reward functions, neural contextual bandits replace the linear model with a neural network while preserving the UCB exploration structure:

import numpy as np

class LinUCB:
    def __init__(self, n_arms: int, n_features: int, alpha: float = 1.0):
        self.alpha = alpha
        self.A = [np.eye(n_features) for _ in range(n_arms)]       # feature covariance
        self.b = [np.zeros(n_features) for _ in range(n_arms)]     # reward accumulator

    def select_arm(self, context: np.ndarray) -> int:
        ucb_scores = []
        for i in range(len(self.A)):
            A_inv = np.linalg.inv(self.A[i])
            theta = A_inv @ self.b[i]                               # estimated coefficients
            uncertainty = self.alpha * np.sqrt(context @ A_inv @ context)
            ucb_scores.append(theta @ context + uncertainty)
        return int(np.argmax(ucb_scores))

    def update(self, arm: int, context: np.ndarray, reward: float):
        self.A[arm] += np.outer(context, context)
        self.b[arm] += reward * context

# Usage
linucb = LinUCB(n_arms=3, n_features=10, alpha=0.5)
context = extract_user_features(user)  # shape: (10,)
arm = linucb.select_arm(context)
reward = get_reward(arm, user)
linucb.update(arm, context, reward)

Bandits vs A/B Testing: When to Use Each

Bandits and A/B tests answer different questions and suit different situations. A/B tests are the right tool when statistical rigour is paramount — when you need a p-value or confidence interval you can defend to stakeholders, when the decision has long-term consequences that justify waiting for a clean result, or when the experiment will run for a fixed duration regardless. The fixed-horizon design of A/B tests has well-understood statistical properties and is less susceptible to peeking problems (inflated false positive rates from stopping early when results look good).

Bandits are the right tool when minimising the cost of exploration matters more than statistical elegance — high-traffic settings where assigning users to inferior variants carries real business cost, short-lived experiments where the winner will be deployed quickly regardless, and continuous optimisation settings where there is no natural stopping point. The tradeoff is that bandit algorithms are harder to analyse post-hoc for statistical significance, and their adaptive nature makes it harder to compute valid confidence intervals on the final estimates (the traffic allocation is correlated with the reward history, which violates standard frequentist assumptions).

A practical rule: use A/B tests for major product decisions where you need clean statistical inference and can afford the fixed sample cost; use bandits for high-volume optimisation tasks (ad selection, recommendation ranking, notification timing) where continuous online learning and low regret matter more than post-hoc significance testing. For LLM prompt optimisation and model variant selection in production, Thompson Sampling is often the most practical choice — it is straightforward to implement, converges quickly, and provides Bayesian credible intervals that are more interpretable than p-values for operational decisions.

Implementing Bandits in Production

Production bandit systems need persistent state, atomic updates, and monitoring. The arm statistics (alpha/beta for Thompson Sampling, or pulls/rewards for UCB1) must be stored in a shared data store so that arm selections and reward updates from multiple serving instances are consistent. Redis is the standard choice: atomic increment operations prevent race conditions when multiple requests update the same arm simultaneously, and reads are fast enough to not add meaningful latency to the selection step. Log every arm selection and reward with a shared experiment ID and timestamp — this enables post-hoc analysis, debugging, and the ability to replay the experiment history if you need to audit or reproduce the bandit's decisions.

Delayed and Partial Rewards

Many production reward signals are delayed or partial, which complicates bandit updates. A click happens immediately; a conversion may happen hours later; a subscription upgrade may happen days after the initial interaction. Naive bandit implementations that only update on immediate reward ignore the delayed signal entirely, which produces arm estimates that undervalue options with high delayed conversion rates relative to options with high immediate click rates. The standard fix is to use a reward model: log the arm selection with a request ID, and when the delayed reward signal arrives (conversion event, upgrade event), look up the request ID and update the arm statistics retroactectively.

Partial rewards — where the reward is a continuous value between 0 and 1 rather than binary — work naturally with Thompson Sampling using a Normal or Beta approximation depending on the reward distribution. For continuous rewards bounded in [0, 1], the Beta-Binomial model extends by treating fractional rewards as fractional successes: update alpha by the reward value and beta by (1 - reward). For unbounded continuous rewards, use a Normal-Normal conjugate model instead, maintaining a running mean and variance estimate for each arm and sampling from the posterior Normal distribution at selection time.

Non-Stationary Environments

Standard bandit algorithms assume arm reward rates are stationary — that the true conversion rate of arm A today is the same as it was last week. In practice, many production environments are non-stationary: user preferences shift, seasonality affects conversion rates, and product changes alter the reward landscape. A Thompson Sampling instance that has accumulated thousands of pulls will be slow to adapt to a sudden shift in arm performance because the accumulated prior overwhelms the new signal.

The standard solution is a sliding window or discounted update. In a sliding window bandit, only the last N observations per arm contribute to the arm's statistics — older observations are discarded as new ones arrive. In a discounted bandit, each existing observation is multiplied by a discount factor gamma (typically 0.99–0.999) at each new pull, giving recent observations exponentially more weight. Both approaches trade statistical efficiency (less data used per estimate) for adaptability (faster response to distribution shifts). Choose the window size or discount factor based on the expected timescale of environmental change — if you expect reward rates to shift meaningfully over days, a window of several thousand pulls is appropriate; if shifts happen over weeks, a larger window or smaller discount is better.

Choosing Between Thompson Sampling, UCB1, and Epsilon-Greedy

In practice, Thompson Sampling is the default choice for most production bandit problems: it consistently achieves the lowest empirical regret across a wide range of settings, its Bayesian credible intervals are directly useful for monitoring and stopping decisions, and the implementation complexity is low. UCB1 is preferable when you need deterministic, reproducible arm selections — for example, in audited systems where you need to explain exactly why a particular arm was selected at a given time. Epsilon-greedy is worth using only as a debugging baseline or in extremely simple settings where you want to verify the bandit plumbing before introducing more complex selection logic. The epsilon parameter requires manual tuning and the uniform exploration is almost always suboptimal relative to Thompson Sampling's uncertainty-directed exploration, so there is rarely a production reason to prefer it over Thompson Sampling once the implementation is verified.

In all three cases, the most important operational decision is not the algorithm but the reward definition: a bandit optimising for immediate clicks will happily sacrifice long-term retention to maximise short-term engagement. Define the reward signal to match the business outcome you actually care about — even if that requires a delayed reward pipeline — before committing to any exploration strategy, since the algorithm can only optimise what it is given to measure. Getting the reward definition right is more impactful than any choice between Thompson Sampling, UCB1, or epsilon-greedy. Start there, then pick your algorithm.

Leave a Comment