PDFs are everywhere — research papers, contracts, reports, invoices, technical manuals. Being able to summarise, extract information from, and query a PDF using a local LLM means you can process sensitive documents without sending them to a cloud API. This guide walks through building a complete local PDF processing pipeline in Python: extracting text, handling multi-page documents that exceed the model context window, generating structured summaries, and building a simple question-answering system over a PDF collection.
The approach uses Ollama for inference and pymupdf for PDF text extraction — the fastest and most reliable Python PDF library available. All processing happens locally: the PDF never leaves your machine and no API keys are required.
Setup
Install the required packages and pull the models you need:
pip install pymupdf httpx numpy ollama pull llama3.2 ollama pull nomic-embed-text
For summarisation and Q&A over general documents, llama3.2 works well. For technical PDFs — academic papers, engineering specifications, or medical literature — qwen2.5-coder:7b produces more accurate summaries that preserve technical nuance despite its coding-focused name.
Extracting Text from a PDF
pymupdf extracts clean text with layout-aware extraction that handles columns, tables, and headers better than most alternatives. The basic extraction pattern reads each page and joins the results:
import fitz # pymupdf
def extract_text(pdf_path: str) -> str:
doc = fitz.open(pdf_path)
pages = []
for i, page in enumerate(doc):
text = page.get_text("text")
if text.strip():
pages.append(f"--- Page {i+1} ---
{text.strip()}")
doc.close()
return "
".join(pages)The “text” mode returns plain text with whitespace preserved. For PDFs with two-column layouts like academic papers, the “blocks” mode returns text blocks with bounding box coordinates — you can then sort blocks by vertical position to reconstruct the correct reading order rather than getting columns interleaved.
One-Shot Summarisation
For PDFs that fit within the model context window, pass the full text in a single request with a style parameter that controls the summary format:
import httpx
OLLAMA_URL = "http://localhost:11434"
MODEL = "llama3.2"
def summarise_pdf(pdf_path: str, style: str = "executive") -> str:
text = extract_text(pdf_path)
prompts = {
"executive": "Write a 3-paragraph executive summary focusing on key findings and conclusions.",
"bullet": "Summarise as 10 bullet points covering the most important information.",
"technical": "Provide a technical summary including methods, results, and conclusions.",
"simple": "Explain this document in plain English a non-expert could understand.",
}
messages = [
{"role": "system", "content": "You are an expert document analyst."},
{"role": "user", "content": f"{prompts.get(style, prompts['executive'])}
Document:
{text[:80000]}"}
]
with httpx.Client(timeout=180) as client:
resp = client.post(f"{OLLAMA_URL}/api/chat",
json={"model": MODEL, "messages": messages, "stream": False})
return resp.json()["message"]["content"]The text[:80000] slice caps input at roughly 20,000 tokens — a conservative limit that works with all models. For llama3.2 with its 128k context you can push this much higher. The 180-second timeout accommodates long document processing without timing out prematurely.
Chunked Summarisation for Long Documents
For very long PDFs — lengthy reports, full books, legal contracts — use a map-reduce approach: summarise each chunk independently, then summarise the summaries into a coherent final result. The 200-word overlap between chunks ensures sentences straddling boundaries are captured in at least one chunk. For a 100-page PDF this creates around 4 to 5 chunks, typically completing in 1 to 3 minutes on a machine with a mid-range GPU.
Question Answering Over a PDF
For interactive Q&A, embed the PDF text page by page and retrieve the most relevant pages before asking the model. This approach — sometimes called retrieval-augmented generation — gives accurate answers without sending the full document on every question. Build the embedding index once at startup, then for each question embed only the query string, find the top 3 most similar pages by cosine similarity, and send just those pages as context to Ollama. The focused context leads to more precise answers than sending the entire document, and the response comes back much faster.
For repeated use of the same document, persist the embedding index to disk using Python’s pickle module, keyed on an MD5 hash of the PDF file bytes. Loading a cached index takes under a second compared to several minutes for regenerating embeddings from scratch. Invalidate the cache automatically by hashing the file contents rather than the filename, so updated PDFs rebuild their index on the next run without any manual intervention.
Structured Data Extraction
Use Ollama’s JSON schema mode to extract structured information from PDFs. This is particularly valuable for invoices, contracts, resumes, and forms where you need to pull specific fields reliably. Define a JSON Schema object describing the fields you want to extract, pass it in the format field of the Ollama request, and the model’s response will always conform to the schema. Call json.loads on the response content directly — schema-constrained output never produces malformed JSON. For batch invoice processing, loop over a directory of PDFs and write the extracted data to a CSV or database for further analysis. Schema constraints are dramatically more reliable than asking the model to “respond in JSON format” via the prompt alone.
Handling Scanned PDFs
Scanned PDFs contain images rather than text, so pymupdf’s get_text returns nothing useful. Use a hybrid approach: try native text extraction first, fall back to Tesseract OCR only for pages without sufficient text. Install pytesseract and the Tesseract binary, then rasterise each page at 200 DPI using pymupdf’s get_pixmap method and pass the resulting image to pytesseract.image_to_string. The 200 DPI rasterisation balances OCR accuracy against memory usage for typical document scans. Higher resolutions like 300 DPI improve accuracy on small fonts but consume significantly more memory for large documents.
OCR quality depends on scan resolution, font clarity, and language. Tesseract handles English well but struggles with non-Latin scripts, very small fonts, and handwritten text. For production OCR on critical documents, commercial alternatives offer better accuracy — but for most personal and team use cases, Tesseract with a clean scan produces text that Ollama can summarise accurately.
Choosing the Right Extraction Approach
Native text PDFs generated digitally by Word, LaTeX, or Google Docs extract cleanly and essentially instantaneously with pymupdf. These are by far the easiest PDFs to work with and where Ollama gives the best results, because the extracted text closely matches what a human would read. Password-protected PDFs require the password to open — pymupdf supports this with the doc.authenticate method before extraction. PDFs with digital rights management that prevents copying are harder to handle and may require platform-specific tools to unlock first.
Improving Summary Quality
The quality of Ollama’s summaries depends heavily on prompt quality. A few techniques consistently improve output. First, tell the model what type of document it is processing — “This is a quarterly financial report for a publicly listed company” gives much better context than just dumping the text. Second, specify the audience — “Summarise for a non-technical executive who needs to make a budget decision” produces more focused output. Third, ask for specific elements you care about — “Include the key recommendations, financial figures, and any risks identified” gives the model a clear checklist.
For technical documents, ask the model to preserve specific terminology rather than simplifying it. For legal documents, ask the model to flag any clauses it is uncertain about rather than summarising them confidently — models can misinterpret complex legal language, and a flag is much more useful than a confident but wrong summary. These simple prompt adjustments have a larger impact on output quality than switching models for most document types.
When to Use Local vs Cloud Processing
Local PDF processing with Ollama is the right choice when documents are sensitive — personal financial records, medical records, legal documents, internal business reports, or customer data — and you cannot or should not send them to a third-party API. It is also right when you need to process large volumes of documents and cloud API costs would be prohibitive, or when you need processing to work offline or in air-gapped environments.
Cloud APIs are the better choice when you need the absolute best quality — large frontier models consistently outperform local 7B models for complex document understanding — or when you need features like native PDF vision that local models do not yet support well. Many production workflows use a hybrid approach: screen documents locally to identify which ones need careful high-quality AI review, and process the remainder locally at low cost. This keeps sensitive document handling local while reserving cloud API budget for cases where quality genuinely matters.
Batch Processing a Document Library
For batch processing an entire directory of PDFs — summarising a month’s worth of reports, cataloguing a research library, or processing submitted applications — the key engineering concerns are progress persistence, error isolation, and throughput. Write results to a CSV as you go rather than accumulating them in memory, and flush after each row so that if the batch is interrupted by a crash, power failure, or keyboard interrupt, the completed summaries are not lost. Handle errors per file so a single corrupt or password-protected PDF does not abort the entire batch — catch exceptions, write an error row to the CSV, and continue to the next file. Track which files have already been processed by checking the CSV before starting so you can resume an interrupted batch without reprocessing completed documents.
For throughput, run with a small pool of concurrent workers — two or three is usually optimal. Ollama queues concurrent requests internally and processes them sequentially on the GPU, so adding more workers beyond three rarely improves throughput and increases memory pressure. The ideal batch setup depends on your hardware: on a machine with a fast GPU and NVMe storage, two workers with a small fast model (llama3.2:3b) can process 50 to 100 short documents per hour. For longer documents or larger models, plan for longer per-document processing times and adjust your batch scheduling accordingly.
Integrating PDF Processing into Existing Applications
The functions in this guide are designed to be imported into larger applications rather than run only as standalone scripts. In a Django or FastAPI web application, wrap the synchronous extraction and Ollama calls in Celery tasks to avoid blocking request handlers. In a data pipeline built with tools like Prefect or Airflow, each extraction, chunking, embedding, and summarisation step maps naturally to a pipeline task with its own retry logic and monitoring. In a document management system, trigger PDF processing automatically when new files are uploaded using filesystem watchers like watchdog or cloud storage event triggers.
The embedding-based Q&A system scales from a single PDF to a collection of hundreds of documents by storing embeddings in a proper vector database like ChromaDB or Qdrant rather than in memory. Both have Python clients and run locally without any cloud dependency. ChromaDB in particular is trivial to integrate — replace the in-memory list of chunks with a ChromaDB collection, and the rest of the Q&A code stays almost identical. The payoff is persistent storage across restarts, faster similarity search for large collections, and metadata filtering so you can restrict searches to specific documents or date ranges.
The combination of pymupdf for extraction, Ollama for inference, and nomic-embed-text for embeddings gives you a fully local PDF intelligence stack that handles everything from quick one-shot summaries to persistent Q&A indexes over large document libraries. No cloud accounts, no per-page costs, no data leaving your infrastructure — just fast, private, and accurate document processing that runs entirely on hardware you already own.
Start with the one-shot summariser for documents under 50 pages, add the chunked approach when you encounter longer documents, and layer in the Q&A system when your users need to ask specific questions rather than read a full summary. Each capability builds on the same extraction and Ollama client foundations, so adding them incrementally requires minimal code changes to what you have already built.