LLM Watermarking: How It Works and What It Means for Production Systems

LLM watermarking embeds a statistical signal into generated text during the decoding process — a signal that is invisible to a human reader but detectable by a verifier with access to the watermarking key. As AI-generated text proliferates across the web and in high-stakes contexts like academic work, legal documents, and news, watermarking is becoming a practical tool for content provenance and regulatory compliance. This article covers how the main watermarking approaches work mechanically, what the known attack surfaces are, how to implement a basic green/red list watermark in PyTorch, and what the real limitations are for production use.

How Green/Red List Watermarking Works

The most widely studied LLM watermarking scheme — introduced by Kirchenbauer et al. (2023) at the University of Maryland — works by partitioning the vocabulary at each decoding step into two sets: a “green list” (permitted tokens) and a “red list” (penalised tokens), where the partition is determined by a hash of the preceding token. The model’s logits are then modified before sampling: tokens in the green list receive a positive score boost (delta), making them more likely to be sampled. This process repeats at every token position. The resulting text contains a statistical fingerprint: across a sufficiently long passage, tokens from the green list will appear at a significantly higher rate than chance would predict.

Detection works by: (1) taking the text to be tested, (2) reconstructing the green/red partition at each position using the same hash function and key, (3) counting how many tokens in the text fall in their position’s green list, and (4) computing a z-score comparing the observed green token fraction to the expected fraction under the null hypothesis (no watermark). A high z-score (typically above 4.0) indicates the text was watermarked. The detection does not require the original model or the original prompt — only the watermarking key and the text to be tested.

Implementing a Basic Watermark

import torch
import hashlib
from transformers import AutoTokenizer, AutoModelForCausalLM

class GreenRedWatermark:
    """
    Kirchenbauer et al. (2023) green/red list watermarking.
    
    At each decoding step, splits the vocabulary into a green list
    (50% of tokens, determined by hash of previous token) and a red list.
    Adds delta to green token logits before sampling.
    """
    def __init__(self, tokenizer, delta: float = 2.0, seed: int = 42,
                 green_list_fraction: float = 0.5):
        self.tokenizer = tokenizer
        self.vocab_size = tokenizer.vocab_size
        self.delta = delta
        self.seed = seed
        self.green_fraction = green_list_fraction

    def _get_green_list(self, prev_token_id: int) -> torch.Tensor:
        """Deterministically generate green list from previous token."""
        # Hash previous token with secret seed to get a reproducible RNG state
        hash_input = f"{self.seed}:{prev_token_id}".encode()
        hash_value = int(hashlib.sha256(hash_input).hexdigest(), 16)
        rng = torch.Generator()
        rng.manual_seed(hash_value % (2**32))
        # Sample a random permutation and take the first green_fraction
        perm = torch.randperm(self.vocab_size, generator=rng)
        n_green = int(self.vocab_size * self.green_fraction)
        green_list = torch.zeros(self.vocab_size, dtype=torch.bool)
        green_list[perm[:n_green]] = True
        return green_list

    def apply_watermark_logits(self, logits: torch.Tensor,
                                prev_token_id: int) -> torch.Tensor:
        """Add delta to green list token logits."""
        green_list = self._get_green_list(prev_token_id).to(logits.device)
        watermarked = logits.clone()
        watermarked[green_list] += self.delta
        return watermarked

    def detect(self, token_ids: list[int], z_threshold: float = 4.0) -> dict:
        """Detect watermark in a sequence of token ids."""
        if len(token_ids) < 2:
            return {"detected": False, "z_score": 0.0, "green_count": 0}

        green_count = 0
        n_tokens = len(token_ids) - 1   # skip the first (no previous token)

        for i in range(1, len(token_ids)):
            prev_token = token_ids[i - 1]
            token = token_ids[i]
            green_list = self._get_green_list(prev_token)
            if green_list[token]:
                green_count += 1

        # Z-score: how many standard deviations above expected green rate?
        expected = n_tokens * self.green_fraction
        std = (n_tokens * self.green_fraction * (1 - self.green_fraction)) ** 0.5
        z_score = (green_count - expected) / std if std > 0 else 0.0

        return {
            "detected": z_score > z_threshold,
            "z_score": round(z_score, 3),
            "green_count": green_count,
            "total_tokens": n_tokens,
            "green_fraction_observed": round(green_count / n_tokens, 3),
        }

Integrating Watermarking into Generation

def generate_with_watermark(model, tokenizer, prompt: str,
                             watermark: GreenRedWatermark,
                             max_new_tokens: int = 200,
                             temperature: float = 1.0) -> str:
    """Greedy/sampling generation with watermark logit processor."""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    input_ids = inputs["input_ids"]
    generated = input_ids.clone()

    with torch.no_grad():
        for _ in range(max_new_tokens):
            outputs = model(generated)
            logits = outputs.logits[:, -1, :]   # (1, vocab_size)

            # Apply watermark: modify logits based on last generated token
            prev_token_id = generated[0, -1].item()
            logits = watermark.apply_watermark_logits(logits[0], prev_token_id).unsqueeze(0)

            # Sample next token
            if temperature != 1.0:
                logits = logits / temperature
            probs = torch.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)

            generated = torch.cat([generated, next_token], dim=-1)

            if next_token.item() == tokenizer.eos_token_id:
                break

    new_tokens = generated[0, input_ids.shape[1]:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True)

# Usage
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
watermark = GreenRedWatermark(tokenizer, delta=2.0, seed=12345)

text = generate_with_watermark(model, tokenizer,
                                "Explain gradient descent in one paragraph:",
                                watermark=watermark)
print(text)

# Detect watermark
token_ids = tokenizer.encode(text)
result = watermark.detect(token_ids)
print(f"Watermark detected: {result['detected']} | z={result['z_score']}")

Attack Surfaces and Robustness Limitations

The most important thing to understand about LLM watermarking is that no current scheme is robust to a determined adversary with enough compute. The green/red list watermark is vulnerable to paraphrasing attacks: if a user copies the watermarked text and asks a second LLM to rewrite it, the paraphrased output will not preserve the green/red token pattern and will pass detection as non-watermarked. Kirchenbauer et al. showed that even simple synonym replacement, if applied at a high enough rate (roughly 40% of tokens), is sufficient to drop the z-score below the detection threshold for passages of a few hundred tokens.

Token substitution attacks work similarly: replacing individual words with semantically equivalent alternatives breaks the watermark faster than the text’s meaning is degraded. More sophisticated attacks include: copy-paste attacks (inserting unwatermarked text into a watermarked document to dilute the signal), translation attacks (translate to another language and back), and regeneration attacks (use the watermarked text as a prompt to generate a new response). None of these destroy the original content’s utility for a human, but all are effective at defeating statistical watermark detection.

Cryptographic watermarking approaches — where the watermark is based on a pseudorandom function keyed to a secret known only to the provider — are more robust than the heuristic hash approach described above but are also more brittle to text modification because the signal depends on exact token positions. Current research on robust watermarking that survives paraphrasing without sacrificing generation quality is an active area with no fully satisfying solution yet.

What Changes in Practice at the Token Level

A practical concern for engineers integrating watermarking is the effect on generation quality. Adding a positive delta to 50% of vocabulary tokens at every step effectively biases the model away from the red list tokens, some of which may be the most likely next tokens under the model’s actual distribution. The impact depends on delta: at delta=2.0, the watermark is detectable in ~200 tokens but causes measurable quality degradation for tasks that require precise token choice (code generation, mathematical notation, named entity generation). At delta=1.0, quality degradation is smaller but detection requires more tokens (~500–800). The right delta depends on your use case — for long-form prose, delta=2.0 is typically acceptable; for code or structured output, lower values or task-specific green list construction are needed.

def evaluate_watermark_impact(model, tokenizer, watermark, prompts: list[str],
                               n_samples: int = 50) -> dict:
    """Compare perplexity of watermarked vs non-watermarked generation."""
    watermarked_texts, plain_texts = [], []

    for prompt in prompts[:n_samples]:
        wm_text = generate_with_watermark(model, tokenizer, prompt, watermark)
        watermarked_texts.append(wm_text)
        # Plain generation without watermark for comparison
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        with torch.no_grad():
            output = model.generate(**inputs, max_new_tokens=200, do_sample=True)
        plain_texts.append(tokenizer.decode(output[0], skip_special_tokens=True))

    # Detection rate on watermarked outputs
    detected = 0
    for text in watermarked_texts:
        token_ids = tokenizer.encode(text)
        result = watermark.detect(token_ids)
        if result["detected"]:
            detected += 1

    return {
        "detection_rate": detected / len(watermarked_texts),
        "n_samples": len(watermarked_texts),
    }

Regulatory Context and Where Watermarking Fits

The EU AI Act includes provisions requiring that AI-generated content be detectable as such, which has pushed watermarking from a research curiosity into an engineering consideration for products deploying LLMs in the EU. Anthropic, Google, and OpenAI have all committed to implementing output watermarking as part of voluntary commitments to the US government. The practical implication for engineers is that watermarking is likely to become a production requirement for certain deployment contexts — not optional research infrastructure — and understanding the tradeoffs now is worthwhile.

For production use, the current practical recommendation is to treat watermarking as one signal among several in a content provenance system, not as a standalone reliable detector. Combine it with metadata logging (which requests generated which outputs), model fingerprinting approaches, and usage policy enforcement. A watermark that is defeated by paraphrasing is still useful if paraphrasing is detectable through other means or if the threat model does not include sophisticated adversaries. The decision of which delta to use, which scheme to implement, and how to handle false positives (legitimate text that triggers detection) should be made with a clear model of who the actual adversaries are and what level of false positive rate is acceptable for the application.

Alternative Watermarking Approaches

The green/red list scheme is the most widely studied but not the only approach. Three other approaches are worth understanding for production decisions.

Soft watermarking (also called semantic watermarking) embeds the watermark in the semantic content of the text rather than at the token level — for example, by preferring certain sentence structures, synonym choices, or paragraph organisations that are statistically detectable but semantically equivalent to alternatives. This approach is more robust to token-level manipulation because paraphrasing may preserve the semantic patterns even when it changes every word. The tradeoff is that semantic watermarks are harder to implement without degrading generation quality and require more text for reliable detection.

Post-hoc watermarking applies a text modification to already-generated content rather than modifying the generation process. The simplest version is synonym substitution guided by a keyed pseudorandom function — at each eligible token position, the watermarking function decides whether to substitute the token with a synonym from a pre-defined equivalence class. This approach works with any existing LLM without modifying the inference stack, which makes it practical for API-based deployment where you do not control the model’s decoding process. The detection signal is weaker than logit-modification watermarks for the same text length, requiring longer passages for reliable detection.

Distortion-free watermarking (Christ et al., 2023) is a theoretically motivated approach that modifies the sampling process to embed a watermark with zero distortion to the output distribution — meaning watermarked and non-watermarked text are statistically indistinguishable in terms of token probabilities, while still being detectable by a verifier with the key. This is achieved by using a keyed pseudorandom function to reorder the sampling step rather than modifying logits. The advantage is that quality degradation is provably zero; the disadvantage is that detection requires more computation and the scheme is less robust to text modification than logit-modification approaches.

False Positive Rates and Human Text

A critical operational parameter for any watermarking deployment is the false positive rate — the probability that non-watermarked text (human-written or generated by a different model without watermarking) triggers detection. The z-score threshold of 4.0 corresponds to a false positive rate of roughly 3 in 100,000 under the assumption that non-watermarked text has no green/red list bias. In practice, false positive rates depend heavily on text length and domain: very short texts (under 100 tokens) have high variance in their green token fraction and produce more false positives at a given z-threshold, while long texts (500+ tokens) achieve reliable discrimination.

Domain effects matter too. Technical text, particularly code and mathematical notation, has a more constrained token distribution than general prose — certain symbols and keywords appear at high frequency regardless of watermarking, which can coincidentally align with the green list and inflate z-scores. For applications that watermark code-heavy outputs, calibrating the false positive rate requires testing on a large corpus of human-written code, not just general text. Setting the detection threshold based on a desired false positive rate on your specific content domain is more principled than using the paper’s default threshold.

def calibrate_threshold(watermark: GreenRedWatermark, tokenizer,
                         human_texts: list[str],
                         target_fpr: float = 0.001) -> float:
    """Find z-score threshold that achieves target false positive rate on human text."""
    z_scores = []
    for text in human_texts:
        token_ids = tokenizer.encode(text)
        result = watermark.detect(token_ids, z_threshold=0)  # compute z, don't threshold
        z_scores.append(result["z_score"])

    z_scores.sort(reverse=True)
    # Threshold at the (1 - target_fpr) quantile of human text z-scores
    idx = int(len(z_scores) * (1 - target_fpr))
    threshold = z_scores[min(idx, len(z_scores) - 1)]
    print(f"Threshold for {target_fpr*100:.1f}% FPR: {threshold:.3f}")
    return threshold

Practical Recommendation for Production

For ML engineers building systems that need content provenance today, the most pragmatic architecture is a layered one. Use metadata logging as the primary mechanism — every generated response is stored with its request ID, timestamp, model version, and prompt hash in an append-only log. This gives you provenance for content you generated without any quality tradeoff and is not vulnerable to the attacks that defeat statistical watermarking. Add statistical watermarking as a secondary signal for detecting re-distribution of your content in contexts where you cannot rely on metadata — specifically, for detecting whether content scraped from the web was originally generated by your system. Set the detection threshold conservatively to control false positives, use longer texts for detection decisions, and treat watermark detection as a probabilistic signal that requires human review rather than an automated enforcement mechanism. This combination gives you most of the practical value of watermarking while being honest about what current watermarking technology can and cannot reliably achieve.

Open Questions and Research Directions

Several open problems in LLM watermarking are directly relevant to production engineers tracking this space. Multi-model detection — being able to determine not just that text was AI-generated but which model generated it — remains unsolved for the general case, though provider-specific watermarks can trivially solve the simpler question of “did our system generate this.” Adaptive adversaries who have access to the watermarking scheme (but not the key) and can query the detection API repeatedly to refine their attacks represent a realistic threat model for high-value content that current schemes do not adequately address. And watermarking for structured outputs like code and JSON, where the token distribution is fundamentally constrained by syntax, requires different approaches than the general text case because the green/red partition cannot freely bias a constrained vocabulary without introducing syntax errors. These are the problems most likely to produce practically important results in the near term, and they are worth monitoring if watermarking becomes a compliance requirement in your deployment context. The engineering work required to prepare for these requirements now — logging infrastructure, detection pipelines, threshold calibration — is well-defined and worth building before it becomes urgent.