How to Read and Summarise Research Papers with a Local LLM

Research papers are notoriously dense. A typical machine learning paper runs 15–30 pages and contains mathematical notation, experimental results across multiple baselines, and technical prose that takes significant time to parse even for domain experts. A local LLM can dramatically speed up your reading workflow — summarising papers, extracting key contributions, explaining concepts, generating questions, and helping you situate a paper within broader literature — entirely offline, without sending research content to cloud APIs.

Fetching Papers from arXiv

import arxiv
import re

def fetch_arxiv_paper(arxiv_id: str) -> dict:
    """Fetch paper metadata and full text from arXiv.
    arxiv_id: e.g. '2410.21276' or full URL
    """
    # Clean ID from URL if needed
    arxiv_id = re.sub(r'.*arxiv.org/abs/', '', arxiv_id).split('v')[0].strip()
    client = arxiv.Client()
    search = arxiv.Search(id_list=[arxiv_id])
    paper = next(client.results(search))
    return {
        'title': paper.title,
        'authors': [a.name for a in paper.authors],
        'abstract': paper.summary,
        'published': str(paper.published.date()),
        'pdf_url': paper.pdf_url,
        'arxiv_id': arxiv_id
    }

# Install: pip install arxiv
paper = fetch_arxiv_paper('2410.21276')
print(paper['title'])
print(paper['abstract'][:300])

Extracting Text from a PDF

import urllib.request
import tempfile
from pathlib import Path

def download_and_extract_pdf(pdf_url: str) -> str:
    """Download PDF and extract text."""
    # pip install pymupdf (fitz)
    import fitz
    with tempfile.NamedTemporaryFile(suffix='.pdf', delete=False) as tmp:
        urllib.request.urlretrieve(pdf_url, tmp.name)
        doc = fitz.open(tmp.name)
        text = ''
        for page in doc:
            text += page.get_text()
        doc.close()
    Path(tmp.name).unlink()
    return text

paper_text = download_and_extract_pdf(paper['pdf_url'])
print(f'Extracted {len(paper_text.split())} words')

Core Summarisation Prompts

import ollama

MODEL = 'llama3.2'  # or 'mistral-nemo' for long papers

def summarise_paper(paper_text: str, title: str, max_words: int = 8000) -> str:
    # Trim to context window
    words = paper_text.split()
    if len(words) > max_words:
        paper_text = ' '.join(words[:max_words])

    prompt = f"""Read this research paper and provide:

1. **One-sentence summary**: What does this paper do in plain English?
2. **Problem solved**: What gap or problem does it address?
3. **Key contribution**: What is the main technical innovation?
4. **Method**: How does it work? (2-3 sentences)
5. **Results**: Key quantitative results and what they mean
6. **Limitations**: What are the acknowledged weaknesses or scope restrictions?
7. **Relevance**: Who should read this paper and why?

Paper title: {title}

Paper text:
{paper_text}"""

    response = ollama.chat(
        model=MODEL,
        messages=[{'role': 'user', 'content': prompt}],
        options={'temperature': 0.2, 'num_ctx': 16384}
    )
    return response['message']['content']

summary = summarise_paper(paper_text, paper['title'])
print(summary)

Abstract-Only Quick Triage

def triage_paper(abstract: str, title: str,
                  your_interests: str, model: str = 'llama3.2') -> str:
    """Quick relevance triage from abstract alone."""
    prompt = f"""Given my research interests: {your_interests}

Assess this paper's relevance:
Title: {title}
Abstract: {abstract}

Answer:
1. Relevant? (Yes/Partial/No) and why in one sentence
2. Key concept or technique introduced
3. Should I read the full paper? (Yes/Skim/No)"""

    return ollama.chat(
        model=model,
        messages=[{'role':'user','content':prompt}],
        options={'temperature':0.1}
    )['message']['content']

# Quickly triage 20 papers from a reading list
my_interests = 'local LLM deployment, quantization, inference optimization'
print(triage_paper(paper['abstract'], paper['title'], my_interests))

Batch Processing a Reading List

import json
from datetime import datetime

def process_reading_list(arxiv_ids: list[str],
                          your_interests: str,
                          model: str = 'llama3.2') -> list[dict]:
    results = []
    for i, arxiv_id in enumerate(arxiv_ids):
        print(f'[{i+1}/{len(arxiv_ids)}] Processing {arxiv_id}...')
        try:
            paper = fetch_arxiv_paper(arxiv_id)
            triage = triage_paper(paper['abstract'], paper['title'],
                                   your_interests, model)
            results.append({
                'arxiv_id': arxiv_id,
                'title': paper['title'],
                'authors': paper['authors'][:3],
                'published': paper['published'],
                'triage': triage,
                'abstract': paper['abstract']
            })
        except Exception as e:
            print(f'  Error: {e}')
            results.append({'arxiv_id': arxiv_id, 'error': str(e)})
    return results

paper_ids = ['2410.21276', '2405.04434', '2309.10305']  # replace with your list
results = process_reading_list(paper_ids, my_interests)

# Save as readable Markdown
output = f'# Paper Triage — {datetime.now().strftime("%Y-%m-%d")}\n\n'
for r in results:
    if 'error' not in r:
        output += f'## [{r["title"]}](https://arxiv.org/abs/{r["arxiv_id"]})\n'
        output += f'*{", ".join(r["authors"])} ({r["published"]})*\n\n'
        output += r['triage'] + '\n\n---\n\n'
Path('reading_list_triage.md').write_text(output)
print('Saved to reading_list_triage.md')

Deep Reading: Interactive Q&A on a Paper

def paper_qa_session(paper_text: str, title: str,
                      model: str = 'llama3.2'):
    """Interactive Q&A about a paper."""
    words = paper_text.split()
    if len(words) > 12000:
        paper_text = ' '.join(words[:12000])

    system = f"""You are an expert research assistant. Answer questions about
the following paper. Be precise, cite specific sections when possible,
and acknowledge uncertainty when the paper doesn't address a question.

Paper: {title}\n\n{paper_text}"""

    history = [{'role':'system','content':system}]
    print(f'Paper loaded. Ask questions about: {title}')
    print('Type "quit" to exit\n')

    while True:
        q = input('Q: ').strip()
        if q.lower() == 'quit': break
        history.append({'role':'user','content':q})
        r = ollama.chat(model=model, messages=history,
                        options={'temperature':0.2,'num_ctx':16384})
        answer = r['message']['content']
        print(f'A: {answer}\n')
        history.append({'role':'assistant','content':answer})

paper_qa_session(paper_text, paper['title'])

Choosing the Right Model

For abstract triage (processing 50 papers in a session), use the smallest capable model — Llama 3.2 8B or even 3B. Speed matters more than depth when triage is the goal. For full-paper summarisation, Llama 3.2 8B with num_ctx set to 16K handles most papers well. For long papers (30+ pages, 15,000+ words), Mistral Nemo 12B with 32K context processes the full text in a single pass rather than truncating, which produces better summaries that capture conclusions and related work rather than just the introduction and methods.

Privacy and Offline Advantages

Research workflows often involve pre-publication papers shared under embargo, proprietary datasets, or internal research that cannot be sent to cloud APIs. Processing papers locally gives you full confidentiality for all content — summaries, extracted insights, and Q&A sessions never leave your machine. For researchers at organisations with data handling policies, this makes local LLM processing the only compliant option for paper review workflows involving sensitive or embargoed content. Combined with offline functionality, it also means your reading workflow is not interrupted by API outages or rate limits during intensive paper-reading sessions.

Why Research Papers Are a Good Fit for Local LLMs

Research papers have a specific structure that local LLMs handle well: abstract, introduction, related work, methods, experiments, results, discussion, conclusion. This structure is highly predictable, which means a well-crafted prompt can reliably extract each component. The model does not need to reason about ambiguous instructions or generate creative content — it needs to read carefully and report accurately, which is exactly the type of extractive task where 7B and 8B models perform at near-parity with much larger models.

The volume of papers being published has also made manual reading workflows increasingly inadequate. The arXiv ML section alone publishes hundreds of papers per week. Keeping up with even a focused subfield requires reading 20–50 papers per week, which is impractical if each paper requires 30–60 minutes of careful reading. A local LLM triage workflow reduces the per-paper time to 30–90 seconds for abstract triage and 3–5 minutes for full summarisation, making it practical to maintain genuine awareness of a broad literature without sacrificing depth on the papers that matter most.

Handling Mathematical Content

ML papers contain substantial mathematical notation — loss functions, gradient updates, attention formulas, statistical definitions. PDF text extraction preserves most of this content but LaTeX symbols and equation formatting can produce garbled text (Greek letters rendered as Unicode, fractions as text approximations). Current 7B and 8B models handle this imperfectly — they often understand the intent of a formula from context without being able to parse the extracted notation precisely. For most summarisation and triage tasks this is acceptable: the model correctly identifies that a section describes the model architecture’s attention mechanism without needing to parse every term in the formula precisely. For tasks requiring precise mathematical reasoning about paper equations, prompt the model explicitly to describe the mathematical content in natural language rather than attempt to reproduce the notation: “Describe what Equation 3 computes without using mathematical notation.”

The practical workaround for papers where mathematical precision is important is to include the abstract and introduction (which typically describe the key equations conceptually) plus the results section, and to skip the dense methods sections where notation is heaviest. This approach captures the what and why of the paper accurately, with somewhat less precision on the how — which is appropriate for triage and preliminary assessment even if it is insufficient for implementing the method yourself.

Building a Personal Research Knowledge Base

The most powerful extension of individual paper summarisation is building a searchable knowledge base of your reading history. Each paper you summarise generates a structured Markdown file with title, authors, date, one-sentence summary, key contributions, and relevance notes. Storing these in an Obsidian vault creates a searchable, interconnected knowledge base where you can find papers by topic, author, concept, or recency without relying on memory.

Combined with the Smart Connections Obsidian plugin (which indexes the vault with local embeddings), this becomes semantically searchable — you can ask “what have I read about mixture-of-experts efficiency?” and get back the relevant summaries from your reading history, even if none of the papers used those exact words in your notes. Over six months to a year of consistent use, this knowledge base becomes a significant research asset that compounds in value as the collection grows. The key is consistency: summarise every paper you read rather than just the important ones, because you often do not know which papers will become important until later when you need to retrieve them.

Citing and Connecting Papers

Local LLMs can help with the lateral reading task of understanding how papers relate to each other. After building summaries for a set of related papers, ask the model to compare them:

def compare_papers(summaries: list[str], topic: str,
                    model: str = 'mistral-nemo') -> str:
    combined = '\n\n---\n\n'.join(summaries)
    prompt = f"""Given these summaries of papers on {topic}, provide:
1. How do these papers relate to each other? (agreements, contradictions, extensions)
2. What is the progression of ideas across these papers?
3. What gaps or open questions remain across this literature?
4. Which paper should someone read first to understand this area?

Paper summaries:
{combined}"""
    return ollama.chat(
        model=model,
        messages=[{'role':'user','content':prompt}],
        options={'temperature':0.3}
    )['message']['content']

# Load previously saved summaries
summary_files = list(Path('summaries/').glob('*.md'))
summaries = [f.read_text() for f in summary_files[:5]]  # compare 5 papers
comparison = compare_papers(summaries, 'LLM inference optimization')
print(comparison)

Daily arXiv Digest

A practical workflow for staying current: run a daily script that fetches the day’s new arXiv papers in your target categories, runs abstract triage on each, and emails or saves a digest of the relevant ones. This eliminates the manual step of browsing arXiv daily while ensuring you do not miss papers in your area. The triage prompt’s relevance judgment is reliable enough for filtering — you will occasionally miss a borderline paper, but the time savings from automated triage of 50–100 abstracts per day more than compensate for the occasional miss. Tune the relevance threshold by reviewing the first week’s triage results and adjusting your interest description to reduce false positives and false negatives based on what the model is getting right and wrong for your specific research focus.

Getting Started in 10 Minutes

The minimum viable research paper workflow requires three pip installs and a running Ollama instance: pip install arxiv pymupdf ollama. From there, copy the fetch and summarise functions from this article and run them on a paper you have read recently so you can evaluate the summary quality against your own knowledge. The structured seven-point summary format in this article is a strong starting point, but it is worth iterating on the prompt for your specific field — the sections that matter most differ between empirical ML papers, theoretical work, systems papers, and applied research. Spending 30 minutes tuning the prompt on five papers you know well will produce a template that serves you reliably across hundreds of future papers.

The research paper summarisation workflow is one of the strongest demonstrations of local LLMs providing genuine, measurable productivity value to knowledge workers. The time savings are concrete and immediate — 30 seconds of triage versus 5–10 minutes of manual abstract reading scales to hours saved per week for active researchers. The privacy benefit is equally concrete for anyone working with pre-publication or proprietary research. And the knowledge base accumulation benefit compounds over time in a way that makes the early investment in setup increasingly valuable. For researchers and analysts who read papers regularly, this is among the highest-return-on-setup-time local LLM workflows available.

Integrating with Reference Managers

Most researchers use a reference manager — Zotero, Mendeley, or Papers — to organise their reading library. Local LLM summaries can slot into this workflow at two points. First, when you add a paper to your reference manager, run the triage script on the abstract immediately and add the triage result as a note or tag. This means your reference library always has a first-pass assessment attached to every paper, searchable within the reference manager itself. Second, for papers you have already collected but never fully read, run the batch summarisation script over your reference manager’s export (most support BibTeX or CSV export with abstracts) to generate summaries for the entire archive. A library of 500 papers with machine-generated one-paragraph summaries is dramatically more useful than the same library with only titles and abstracts, because you can scan the summaries much faster than the originals and surface papers you filed away but never got back to.

The combination of a local LLM pipeline, an Obsidian knowledge base, and a reference manager creates a research workflow that scales well with volume: the LLM handles the initial processing workload that grows with the number of papers read, the reference manager handles citation management and export, and Obsidian handles the synthesis and connection-making that requires human judgment. Each tool does what it is best at, with the local LLM removing the bottleneck that makes high-volume reading unsustainable — the time cost of reading and annotating every paper before deciding whether it is worth keeping in active memory.

A Note on Accuracy

Local LLM paper summaries are accurate for factual content that is clearly stated in the paper — titles, author contributions, experimental baselines, main results. They are less reliable for nuanced interpretations, implicit assumptions, and the significance of results relative to prior work that is not explicitly discussed in the paper itself. A summary that says a model achieved “state-of-the-art results on three benchmarks” is accurate if the paper says so, but the model cannot independently verify whether those benchmarks are meaningful or whether the comparison baselines are appropriate. Treat LLM summaries as a fast first read that correctly captures what the paper claims, and reserve your own judgment for evaluating whether those claims are well-supported and significant. This division — LLM for extraction, human for evaluation — is the right way to integrate this workflow into serious research practice, and it preserves the intellectual rigor of the research process while removing the time cost of the extraction work that precedes evaluation. Used this way, the local LLM paper workflow accelerates research without compromising its quality — the two goals are complementary rather than in tension when the division of labour between machine extraction and human judgment is maintained clearly. The scripts in this article give you everything you need to build that workflow in an afternoon, with the flexibility to adapt each component to the specific demands of your field and reading habits — and the time savings start appearing from the very first paper you run through it.