How to Build a Web Scraper with Ollama and Playwright

Playwright is Microsoft’s modern browser automation library — fast, reliable, and capable of controlling Chromium, Firefox, and WebKit from Python, JavaScript, or TypeScript. Paired with Ollama, it gives you a powerful combination: Playwright handles the browser automation and data collection, while Ollama handles the intelligent analysis, summarisation, and classification of the collected content. This guide covers the core patterns for combining Playwright and Ollama — scraping and summarising web content, extracting structured data from pages, monitoring sites for changes, and building an AI-powered research assistant.

All processing happens locally — Playwright runs in your Python process and Ollama runs on your machine. No cloud APIs, no rate limits beyond what the target websites impose, and no data leaving your infrastructure.

Setup

pip install playwright httpx
playwright install chromium
ollama pull llama3.2

The playwright install chromium command downloads the Chromium browser binary that Playwright manages. This is separate from any browser you have installed on your system — Playwright controls its own dedicated browser instances.

Scraping and Summarising a Web Page

Here is the basic pattern — fetch a page with Playwright, extract the text content, and summarise it with Ollama:

import asyncio, httpx, json
from playwright.async_api import async_playwright

OLLAMA_URL = "http://localhost:11434"
MODEL = "llama3.2"

async def scrape_and_summarise(url: str) -> dict:
    async with async_playwright() as pw:
        browser = await pw.chromium.launch()
        page = await browser.new_page()
        await page.goto(url, wait_until="domcontentloaded", timeout=30000)
        title = await page.title()
        # Extract readable text, excluding nav/footer noise
        text = await page.evaluate("""
            () => {
                const remove = document.querySelectorAll('nav,footer,header,aside,script,style');
                remove.forEach(el => el.remove());
                return document.body.innerText;
            }
        """)
        await browser.close()

    # Summarise with Ollama
    async with httpx.AsyncClient(timeout=120) as client:
        resp = await client.post(f"{OLLAMA_URL}/api/chat", json={
            "model": MODEL,
            "messages": [{"role": "user", "content": f"Summarise this web page in 3 bullet points:

{text[:6000]}"}],
            "stream": False
        })
    return {"title": title, "url": url, "summary": resp.json()["message"]["content"]}

# Run it
result = asyncio.run(scrape_and_summarise("https://example.com"))
print(result["summary"])

The JavaScript snippet that removes nav, footer, header, aside, script, and style elements before extracting text is essential for clean output. Without it, the extracted text is cluttered with navigation menus, cookie banners, social media links, and other boilerplate that wastes context window tokens and degrades the quality of the summary.

Extracting Structured Data from Pages

Use Playwright to scrape content and Ollama’s JSON schema mode to extract structured data reliably:

async def extract_job_listings(url: str) -> list[dict]:
    async with async_playwright() as pw:
        browser = await pw.chromium.launch()
        page = await browser.new_page()
        await page.goto(url)
        await page.wait_for_selector(".job-listing", timeout=10000)
        text = await page.inner_text(".jobs-container")
        await browser.close()

    schema = {
        "type": "object",
        "properties": {
            "jobs": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "title": {"type": "string"},
                        "company": {"type": "string"},
                        "location": {"type": "string"},
                        "salary": {"type": "string"},
                        "remote": {"type": "boolean"}
                    },
                    "required": ["title", "company"]
                }
            }
        }
    }

    async with httpx.AsyncClient(timeout=60) as client:
        resp = await client.post(f"{OLLAMA_URL}/api/chat", json={
            "model": MODEL,
            "messages": [{"role": "user", "content": f"Extract all job listings from this text:

{text[:8000]}"}],
            "format": schema,
            "stream": False
        })
    data = json.loads(resp.json()["message"]["content"])
    return data.get("jobs", [])

This pattern is more robust than traditional CSS selector-based scraping for sites with inconsistent markup. Instead of trying to identify exactly which CSS classes contain which data, you extract the visible text from the relevant section and let Ollama parse it semantically. It handles variations in page layout, missing fields, and different formatting styles automatically.

Batch Scraping Multiple URLs

For scraping multiple pages concurrently, use Playwright’s browser context and asyncio to run parallel requests efficiently:

async def batch_scrape(urls: list[str], max_concurrent: int = 3) -> list[dict]:
    results = []
    async with async_playwright() as pw:
        browser = await pw.chromium.launch()
        semaphore = asyncio.Semaphore(max_concurrent)

        async def scrape_one(url: str) -> dict:
            async with semaphore:
                context = await browser.new_context(
                    user_agent="Mozilla/5.0 (compatible; research-bot/1.0)"
                )
                page = await context.new_page()
                try:
                    await page.goto(url, timeout=20000)
                    text = await page.evaluate(
                        "() => document.body.innerText"
                    )
                    return {"url": url, "text": text[:5000], "ok": True}
                except Exception as e:
                    return {"url": url, "text": "", "ok": False, "error": str(e)}
                finally:
                    await context.close()

        results = await asyncio.gather(*[scrape_one(u) for u in urls])
        await browser.close()

    return [r for r in results if r["ok"]]

The Semaphore(max_concurrent) limits the number of simultaneous browser pages to 3, preventing memory exhaustion when processing large URL lists. Each URL gets its own browser context so cookies, local storage, and session state are fully isolated between requests. Keep concurrency conservative — 3 to 5 parallel pages is a good default that respects target servers while still processing lists of URLs significantly faster than sequential scraping.

AI-Powered Site Monitor

Combine Playwright and Ollama to build a site monitor that not only detects changes but also explains what changed in plain English:

import hashlib, json, pathlib, time

STATE_FILE = pathlib.Path("site_state.json")

async def check_site(url: str) -> None:
    state = json.loads(STATE_FILE.read_text()) if STATE_FILE.exists() else {}

    async with async_playwright() as pw:
        browser = await pw.chromium.launch()
        page = await browser.new_page()
        await page.goto(url)
        text = await page.evaluate("() => document.body.innerText")
        await browser.close()

    current_hash = hashlib.md5(text.encode()).hexdigest()
    prev_hash = state.get(url)

    if prev_hash and prev_hash != current_hash:
        async with httpx.AsyncClient(timeout=60) as client:
            resp = await client.post(f"{OLLAMA_URL}/api/chat", json={
                "model": MODEL,
                "messages": [{"role": "user", "content": f"Describe what likely changed on this page based on the new content. Be specific:

{text[:4000]}"}],
                "stream": False
            })
        explanation = resp.json()["message"]["content"]
        print(f"CHANGE DETECTED at {url}:
{explanation}")

    state[url] = current_hash
    STATE_FILE.write_text(json.dumps(state))

Run this with a cron job or a simple loop with asyncio.sleep to monitor pages continuously. The AI explanation gives you context about what changed — whether a price dropped, a product came back in stock, a policy was updated, or a news article was published — rather than just an alert that the page hash changed.

AI Research Assistant

Build a research assistant that takes a topic, searches for relevant pages, scrapes them, and synthesises the findings:

async def research(topic: str, urls: list[str]) -> str:
    # Scrape all pages
    pages = await batch_scrape(urls)

    # Summarise each page
    summaries = []
    async with httpx.AsyncClient(timeout=120) as client:
        for page in pages:
            resp = await client.post(f"{OLLAMA_URL}/api/chat", json={
                "model": MODEL,
                "messages": [{"role": "user",
                              "content": f"Summarise the key information about '{topic}' from this page:

{page['text']}"}],
                "stream": False
            })
            summaries.append(f"Source: {page['url']}
{resp.json()['message']['content']}")

        # Synthesise into a final report
        combined = "

".join(summaries)
        resp = await client.post(f"{OLLAMA_URL}/api/chat", json={
            "model": MODEL,
            "messages": [{"role": "user",
                          "content": f"Synthesise these source summaries about '{topic}' into a comprehensive report:

{combined}"}],
            "stream": False
        })
    return resp.json()["message"]["content"]

This map-reduce pattern — summarise each source independently, then synthesise the summaries — scales to dozens of sources without hitting context window limits. Each page is processed independently, and the final synthesis call only receives the condensed summaries rather than the full raw text of all pages.

Handling JavaScript-Heavy Sites

Playwright’s killer advantage over simpler HTTP-based scrapers like requests or httpx is that it executes JavaScript fully. Single-page applications built with React, Vue, or Angular render their content dynamically after the initial page load — a plain HTTP request gets an empty shell with no data. Playwright waits for the JavaScript to execute and the DOM to populate before extracting text.

Use page.wait_for_selector to wait for a specific element that indicates the page has finished loading its content. For API-driven sites where the data loads asynchronously, page.wait_for_response can wait until a specific API call completes before extracting the rendered output. For infinite-scroll pages, simulate scrolling with page.evaluate("window.scrollTo(0, document.body.scrollHeight)") and wait between scrolls to trigger additional content loading.

Ethical Scraping Practices

When building web scrapers, check the site’s robots.txt and terms of service before scraping. Set a descriptive User-Agent string identifying your bot. Add delays between requests to avoid hammering servers — a 1 to 2 second sleep between page loads is a reasonable default for most sites. Never scrape login-protected content without permission. For sites that provide an API, use the API rather than scraping the web interface. These practices keep your scraper from being blocked and respect the server resources of sites you depend on.

Playwright and Ollama together are well-suited to research automation, competitive intelligence, content aggregation, and dataset collection for machine learning. The combination of reliable browser automation and intelligent text processing handles the full pipeline from raw web pages to structured, analysed, and summarised information — entirely locally, entirely under your control.

Using Playwright for Login-Required Pages

Playwright can handle authenticated sessions, making it possible to scrape your own accounts on sites that require login — your social media analytics, your SaaS dashboard, your email newsletter statistics. The recommended approach is to log in once manually in a Playwright browser window, save the authenticated session state to a file, then reuse that saved state in subsequent automated runs without re-entering credentials each time.

Playwright saves session state as a JSON file containing cookies, local storage, and session tokens. Load it into a new browser context with browser.new_context(storage_state="auth.json") and the context starts already logged in. Session files expire when the underlying cookies expire — typically anywhere from a few hours to 30 days depending on the site’s remember-me policy. Refresh the session file by running the login flow again before the old session expires. Never commit session files to version control since they contain authentication credentials that provide full access to the account.

Screenshot-Based Analysis

Playwright can take full-page screenshots, and Ollama’s vision-capable models can analyse images. This opens up a different class of scraping task: instead of extracting text and sending it to a text model, you take a screenshot of the page and send it to a multimodal model for visual analysis. This works well for pages with complex visual layouts — charts, graphs, diagrams, infographics — where the visual presentation carries information that is lost in the plain text extraction.

Take a screenshot with await page.screenshot(path="page.png", full_page=True), encode it as base64, and include it in the Ollama API request alongside a text prompt using the messages format that includes an images field. Models like llava and gemma3 handle image inputs and can describe what they see, extract text from images, answer questions about visual content, and classify what type of page is shown. This approach is particularly useful for monitoring dashboards, extracting data from charts, and analysing pages where the visual structure matters as much as the text content.

Playwright vs requests for Ollama Workflows

For simple scraping tasks where the page content is fully server-rendered — news articles, blog posts, Wikipedia, documentation sites — the httpx library is faster, lighter, and simpler than Playwright. It makes a plain HTTP request and returns the HTML, which you then parse with BeautifulSoup or extract text from directly. Playwright is the right choice when you need JavaScript execution (single-page apps, infinite scroll, lazy-loaded content), browser interaction (clicking buttons, filling forms, handling popups), session management (authenticated scraping), or screenshot capture. If a page works fine when you disable JavaScript in your browser, it will work fine with httpx. If it breaks, you need Playwright.

A common pattern in production Ollama scraping pipelines is to use httpx for the majority of requests and fall back to Playwright for the subset of URLs that require JavaScript. Check whether httpx returns useful content first — if the response body contains the data you need, use it. If the body is empty or contains only a loading spinner, switch to Playwright. This keeps the pipeline fast and resource-efficient for the easy cases while correctly handling the hard ones.

Storing and Querying Scraped Content

For research workflows that accumulate scraped content over time, a simple SQLite database is more practical than flat files. Store each scraped page as a row with the URL, scrape timestamp, raw text, and the Ollama-generated summary. Query by date to review content scraped in a specific period, by URL pattern to filter by source domain, or full-text search across summaries to find pages covering a specific topic. Python’s built-in sqlite3 module handles this without any additional dependencies, and the single-file database format makes it easy to back up, move between machines, and open in any SQLite client for ad-hoc analysis.

Add embeddings alongside summaries to enable semantic search across your scraped content. Store the nomic-embed-text embedding for each page’s summary, and at query time embed the search query and find the most similar stored embeddings using cosine similarity. This gives you a personal searchable knowledge base of everything you have scraped and analysed — a local alternative to commercial research tools that keeps all your data under your control and processes new content instantly rather than waiting for cloud indexing.

Production Considerations

For scraping workflows that run on a schedule — daily news digests, weekly competitive analysis, hourly price monitoring — wrap the Playwright and Ollama calls in proper error handling with retry logic. Network requests fail, sites go down, Ollama occasionally returns unexpected responses. Log errors with enough context to diagnose what went wrong — the URL, the error message, and the timestamp — so you can review failures and fix systematic problems without losing track of which pages were successfully processed and which need reprocessing.

For long-running scraping jobs, monitor memory usage. Playwright browser contexts accumulate memory over time, particularly on JavaScript-heavy sites that don’t clean up properly. Close contexts explicitly after each page rather than relying on garbage collection, and restart the browser instance periodically for very long batches. A simple pattern is to close and reopen the browser every 50 to 100 pages, which keeps memory usage flat regardless of how long the job runs.

Playwright and Ollama represent a genuinely new capability in automation: a scraper that does not just collect data but understands it. Traditional web scrapers return raw text or structured fields that require post-processing; a Playwright plus Ollama pipeline returns analysed, summarised, and classified information ready for decision-making. For researchers, analysts, and developers who need to make sense of large amounts of web content quickly, this combination is one of the most practically useful things you can build with a local LLM.

Start with the basic scrape-and-summarise pattern, get it working reliably on the sites you care about, then add the research assistant and monitor patterns as your needs grow. The investment in setting up a solid scraping foundation pays back every time you need to process a batch of pages that would take hours to read manually.