How to Compare Two Documents with a Local LLM

Comparing two documents — contracts, drafts, reports, research papers — is a task where local LLMs add genuine value. You can ask the model to highlight what changed between versions, identify conflicting clauses, summarise the key differences, or flag sections that need review. All of this stays local, which matters when the documents contain sensitive business, legal, or personal content.

The Basic Comparison Pattern

import ollama

def compare_documents(doc_a: str, doc_b: str, focus: str = '') -> str:
    """
    Compare two documents and return a structured analysis.
    focus: optional specific aspect to compare (e.g. 'payment terms', 'liability')
    """
    focus_instruction = f'Focus especially on: {focus}.' if focus else ''
    prompt = f"""Compare these two documents and identify:
1. Key differences (what changed, was added, or was removed)
2. Areas of agreement or similarity
3. Anything that seems contradictory or inconsistent between the two
{focus_instruction}

Document A:
{doc_a}

Document B:
{doc_b}

Provide a clear, structured comparison:"""

    response = ollama.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': prompt}],
        options={'temperature': 0.2, 'num_ctx': 16384}
    )
    return response['message']['content']

# Example usage
doc_a = open('contract_v1.txt').read()
doc_b = open('contract_v2.txt').read()
print(compare_documents(doc_a, doc_b, focus='payment terms and liability'))

Structured Diff Output with Pydantic

from pydantic import BaseModel
from typing import Optional
import ollama

class DocumentDiff(BaseModel):
    additions: list[str]       # things in B but not A
    removals: list[str]        # things in A but not B
    modifications: list[str]   # things that changed between A and B
    conflicts: list[str]       # contradictions or inconsistencies
    summary: str               # one-paragraph overall assessment

def structured_compare(doc_a: str, doc_b: str, model: str = 'llama3.2') -> DocumentDiff:
    response = ollama.chat(
        model=model,
        messages=[{
            'role': 'user',
            'content': f'Compare these two documents and identify all significant differences.\n\nDocument A:\n{doc_a[:6000]}\n\nDocument B:\n{doc_b[:6000]}'
        }],
        format=DocumentDiff.model_json_schema(),
        options={'temperature': 0.1}
    )
    return DocumentDiff.model_validate_json(response['message']['content'])

Handling Long Documents with Chunked Comparison

def chunked_compare(doc_a: str, doc_b: str, chunk_size: int = 4000) -> str:
    """
    Compare long documents by splitting into sections.
    Works for documents too long to fit in a single context window.
    """
    words_a = doc_a.split()
    words_b = doc_b.split()

    # Split both documents into matching chunks
    chunks_a = [' '.join(words_a[i:i+chunk_size]) for i in range(0, len(words_a), chunk_size)]
    chunks_b = [' '.join(words_b[i:i+chunk_size]) for i in range(0, len(words_b), chunk_size)]

    # Pad to same number of chunks
    max_chunks = max(len(chunks_a), len(chunks_b))
    chunks_a.extend(['[Section not present]'] * (max_chunks - len(chunks_a)))
    chunks_b.extend(['[Section not present]'] * (max_chunks - len(chunks_b)))

    section_diffs = []
    for i, (ca, cb) in enumerate(zip(chunks_a, chunks_b)):
        if ca.strip() == cb.strip():
            continue  # Skip identical sections
        r = ollama.chat(
            model='llama3.2',
            messages=[{'role':'user','content': f'Section {i+1} comparison:\n\nVersion A:\n{ca}\n\nVersion B:\n{cb}\n\nWhat changed?'}],
            options={'temperature': 0.1}
        )
        section_diffs.append(f'Section {i+1}: {r["message"]["content"]}')

    if not section_diffs:
        return 'No significant differences found between the documents.'

    # Final synthesis
    synthesis = ollama.chat(
        model='llama3.2',
        messages=[{'role':'user','content': f'Synthesise these section-by-section differences into an overall summary:\n\n' + '\n\n'.join(section_diffs)}],
        options={'temperature': 0.3}
    )
    return synthesis['message']['content']

print(chunked_compare(long_doc_a, long_doc_b))

Question-Answering Across Two Documents

def ask_across_docs(doc_a: str, doc_b: str, question: str) -> dict:
    """Ask a question that requires consulting both documents."""
    response = ollama.chat(
        model='llama3.2',
        messages=[{
            'role': 'user',
            'content': f"""{question}

Document A:
{doc_a[:5000]}

Document B:
{doc_b[:5000]}

Answer based on both documents, citing which document supports each point."""
        }],
        options={'temperature': 0.2}
    )
    return {'question': question, 'answer': response['message']['content']}

# Examples:
print(ask_across_docs(contract_v1, contract_v2,
    'What are the payment terms in each version and how do they differ?'))
print(ask_across_docs(proposal_a, proposal_b,
    'Which proposal offers the better value for money and why?'))

Use Cases Where This Shines

Document comparison with a local LLM is particularly valuable for contract review: comparing a revised vendor contract against the previous version to catch unfavourable term changes before signing. Legal documents often change in subtle ways (a deadline shifts from 30 to 14 days, an indemnity clause expands) that are easy to miss in a dense 20-page contract but that a focused LLM prompt reliably surfaces. The local inference piece is essential here — sending legal contracts to a cloud API creates data handling concerns that many organisations cannot accept.

Research and academic use cases include comparing different versions of a research paper, identifying what changed between a preprint and published version, or comparing two papers on the same topic to identify where they agree and disagree. Policy and compliance teams compare regulatory documents across versions to track what changed in new guidance. Software teams compare API specifications or RFC documents across versions. The pattern is identical across all these cases — the prompt structure and chunking logic from this article applies with minimal modification.

Why Local Inference for Document Comparison

Document comparison tasks are among the most privacy-sensitive AI use cases. When you compare two versions of a vendor contract, you are exposing confidential pricing, terms, and legal obligations to whatever service processes the text. When comparing research paper drafts, you may be working with unpublished findings under embargo. When comparing HR policies or employment contracts, you are handling sensitive personnel information. For all of these cases, sending documents to a cloud API creates data handling risks — even if the API provider has strong privacy policies, the data leaves your control the moment it crosses your network boundary.

Local inference with Ollama eliminates this risk entirely. The document text never leaves your machine. The model processes everything locally, and the comparison output stays local too. For organisations with strict data governance requirements — financial services, healthcare, legal, and government — local document comparison is not a preference but a requirement. The Python code in this article runs entirely on your own hardware and can be integrated into document review workflows without any cloud API configuration or data processing agreements.

Choosing the Right Model for Comparison Tasks

Document comparison benefits from a model with good instruction following and the ability to maintain attention across long context. For short documents (under 5,000 words combined), any 7–8B model works well — Llama 3.2 8B, Qwen2.5 7B, or Mistral 7B all handle focused comparison prompts reliably. For long documents requiring 16K+ context, consider Mistral Nemo 12B with a 32K Modelfile configuration — it maintains better coherence across very long prompts than 7B models. For the most critical comparisons (high-stakes contracts, important research), running the same comparison twice with different models and checking for consistency between outputs is a useful quality control step that takes minutes but catches cases where one model missed something the other caught.

Set temperature to 0.1–0.2 for comparison tasks. Higher temperatures introduce randomness that can cause the model to miss differences it would have caught deterministically, or to hallucinate differences that do not exist. Document comparison is a precision task where consistency matters more than creativity — use the settings that maximise consistency.

Comparing Across Multiple Documents

def compare_many(documents: dict[str, str], question: str) -> str:
    """
    Compare multiple documents against a specific question.
    documents: {'label': 'content'} mapping
    """
    sections = '\n\n'.join(
        f'Document {label}:\n{content[:3000]}'
        for label, content in documents.items()
    )
    response = ollama.chat(
        model='llama3.2',
        messages=[{'role':'user','content':
            f'{question}\n\nAnswer for each document, then compare across all:\n\n{sections}'
        }],
        options={'temperature':0.2,'num_ctx':16384}
    )
    return response['message']['content']

# Example: compare vendor proposals
proposals = {
    'Vendor A': open('proposal_vendor_a.txt').read(),
    'Vendor B': open('proposal_vendor_b.txt').read(),
    'Vendor C': open('proposal_vendor_c.txt').read(),
}
print(compare_many(proposals,
    'What are the delivery timeline and pricing terms for each vendor?'))

Building a Document Review CLI

#!/usr/bin/env python3
# compare_docs.py — python compare_docs.py doc_a.txt doc_b.txt
import argparse, sys
import ollama

parser = argparse.ArgumentParser(description='Compare two documents with a local LLM')
parser.add_argument('doc_a', help='First document')
parser.add_argument('doc_b', help='Second document')
parser.add_argument('--focus', default='', help='Specific aspect to focus on')
parser.add_argument('--model', default='llama3.2')
parser.add_argument('--structured', action='store_true', help='Output structured diff')
args = parser.parse_args()

doc_a = open(args.doc_a).read()
doc_b = open(args.doc_b).read()

if args.structured:
    # Use the Pydantic structured output approach
    result = structured_compare(doc_a, doc_b, args.model)
    print(f'Additions: {result.additions}')
    print(f'Removals: {result.removals}')
    print(f'Changes: {result.modifications}')
    print(f'Summary: {result.summary}')
else:
    result = compare_documents(doc_a, doc_b, args.focus)
    print(result)

Getting Started

The basic compare_documents function from this article works immediately with any running Ollama instance and model. For most comparison tasks, start with the plain text output approach — it is more readable than structured JSON and easier to scan for the differences you care about. Add the Pydantic structured output approach when you need to process the comparison programmatically (feeding it into a database, a reporting tool, or a downstream automation). The chunked comparison handles documents too long for a single context window and scales to very long documents with manageable latency. Start simple, evaluate quality on your actual documents, and add complexity only where the basic approach falls short.

Pre-Processing Documents for Better Comparison

Raw document text often contains formatting noise — headers, footers, page numbers, table formatting — that distracts from the substantive content differences. Simple pre-processing improves comparison quality:

import re

def clean_document(text: str) -> str:
    """Basic cleanup for comparison — remove noise, normalise whitespace."""
    # Remove common footer/header patterns
    text = re.sub(r'Page \d+ of \d+', '', text)
    text = re.sub(r'CONFIDENTIAL|DRAFT|PRIVILEGED', '', text, flags=re.IGNORECASE)
    # Collapse multiple blank lines
    text = re.sub(r'\n{3,}', '\n\n', text)
    # Normalise whitespace
    text = ' '.join(text.split())
    return text.strip()

doc_a_clean = clean_document(raw_doc_a)
doc_b_clean = clean_document(raw_doc_b)
result = compare_documents(doc_a_clean, doc_b_clean)

For PDF documents, use pdfplumber or PyMuPDF to extract text before passing to the comparison function. For Word documents, python-docx extracts text cleanly. The comparison function itself is format-agnostic — it only needs plain text input, so any text extraction library works as a pre-processing step.

Comparison Quality and Validation

LLM-based document comparison can miss subtle differences, particularly in numerical values (a contract clause changing “30 days” to “14 days” may be missed if the model focuses on the surrounding context) and in formatting-dependent content (tables, structured data). For high-stakes comparisons, combine LLM analysis with a simple text diff tool — Python’s difflib provides character-level and line-level diffs that catch every literal change, while the LLM provides semantic understanding of which changes are significant and what they mean. The combination is more reliable than either alone:

import difflib

def hybrid_compare(doc_a: str, doc_b: str) -> dict:
    # Character-level diff for literal changes
    differ = difflib.Differ()
    diff_lines = list(differ.compare(doc_a.splitlines(), doc_b.splitlines()))
    changed_lines = [l for l in diff_lines if l.startswith('+ ') or l.startswith('- ')]
    literal_changes = '\n'.join(changed_lines[:50])  # first 50 changed lines

    # LLM for semantic understanding
    semantic = compare_documents(doc_a, doc_b)

    return {
        'literal_changes': literal_changes,
        'semantic_analysis': semantic,
        'change_count': len([l for l in diff_lines if l.startswith('+ ') or l.startswith('- ')])
    }

Practical Accuracy Tips

For best comparison accuracy: keep each document section under 4,000 words when possible (split longer documents into logical sections before comparing); provide explicit context about what type of documents you are comparing at the start of the prompt (“These are software licensing agreements” tells the model what to look for); ask the model to list specific differences as numbered items rather than prose (numbered lists are easier to verify than flowing paragraphs); and for very important comparisons, run the comparison twice with slightly different prompts and check whether the identified differences are consistent between runs — inconsistencies indicate areas where the model is uncertain and warrant manual review.

Automating Document Comparison in Workflows

The comparison functions in this article integrate naturally into document workflows. A Git pre-commit hook that compares changed contract files against their previous versions and flags significant term changes is a practical 30-line implementation that catches accidental changes before they are committed. A weekly script that compares the current version of a key regulatory document against last week’s version and emails the diff summary keeps compliance teams informed without manual review. A document review tool that compares incoming vendor proposals against a reference requirements document and scores how well each proposal addresses each requirement automates the initial screening step of a procurement process. In all of these cases, the local inference approach means sensitive documents stay within your organisation’s infrastructure rather than flowing through external APIs, which is often a compliance requirement rather than merely a preference.

Limitations to Know

LLM document comparison has real limitations that are worth understanding before relying on it for high-stakes decisions. The model can miss differences that are syntactically subtle but semantically significant — a clause changing “shall” to “may” in a legal document is a major change in obligation, but a model focused on content may not flag it as a key difference without explicit prompting to check for modal verb changes. Very long documents with many small differences (a heavily tracked-changes document) can overwhelm the model’s ability to systematically enumerate all changes — the chunked approach helps but is not a complete solution for documents with hundreds of edits. And the model may incorrectly characterise the significance of a difference — flagging a trivial formatting change as significant while missing a substantive term change elsewhere. Use LLM comparison as a first-pass tool that highlights areas for human review, not as a replacement for careful human reading of important documents. The combination of LLM semantic analysis and a literal diff tool (difflib, git diff) provides more coverage than either alone.

Getting the Most Value

The highest-value application of local LLM document comparison is for documents you review regularly where the same types of changes recur — recurring vendor contracts, updated regulatory guidance, iterative proposal revisions. Once you have established a comparison workflow that surfaces the differences that matter for your specific document type, running it consistently takes seconds and catches changes that might otherwise be missed in a time-pressured review. Build the workflow once, validate it against a few known examples, and let it become a standard step in your document review process. The local AI aspect means there is no ongoing cost per comparison, no data leaving your infrastructure, and no dependency on a third-party service’s availability — all of which make it genuinely sustainable as a regular part of your workflow rather than an occasional tool you use for the highest-stakes reviews only.

When to Skip This Approach

LLM document comparison is not the right tool for every scenario. For pure text diff tasks — finding every character-level change between two versions of a source file — difflib, git diff, or a dedicated diff tool is faster, more precise, and does not require a running LLM. For structured data comparison (two CSV files, two JSON objects) where the structure defines what counts as a difference, purpose-built comparison tools outperform general-purpose LLM prompts. For very short documents (under 500 words), the overhead of LLM inference rarely adds enough value over a simple text diff to justify it. The LLM approach shines for medium-to-long prose documents where semantic understanding matters — contracts, policies, research papers, proposals — where the question is not just what changed but what the changes mean and whether they are significant. That is a meaningful niche, and for documents in that category, local LLM comparison is among the most useful tools available to anyone who works with documents professionally — and for those documents, the combination of privacy, zero cost per comparison, and genuine semantic understanding makes local LLM document comparison one of the most practically useful applications in this article series.

The local AI ecosystem in 2026 has reached a point where the tools and patterns described in this article are stable, well-supported, and ready for production use. Whether you reach for Go, Python, JavaScript, or another language entirely, the underlying Ollama API remains the same — consistent, version-stable, and designed to support exactly the kind of practical applications this series has covered — which is precisely the kind of practical, privacy-respecting capability that makes local AI genuinely valuable in professional settings where data governance is taken seriously and long-term operational cost matters.

Leave a Comment