How to Evaluate Ollama Prompts with Langfuse

Langfuse is an open-source LLM observability platform — it records every Ollama prompt and response, tracks latency and token usage, and lets you compare prompt versions to understand what changes actually improve output quality. This guide covers integrating Langfuse with Ollama in Python for systematic prompt evaluation.

Setup

pip install langfuse ollama
# Self-host Langfuse:
docker run -d -p 3000:3000 langfuse/langfuse
from langfuse import Langfuse
import ollama, os

lf = Langfuse(
    public_key=os.getenv('LANGFUSE_PUBLIC_KEY', 'pk-lf-local'),
    secret_key=os.getenv('LANGFUSE_SECRET_KEY', 'sk-lf-local'),
    host=os.getenv('LANGFUSE_HOST', 'http://localhost:3000')
)

Tracing Ollama Calls

def traced_chat(prompt: str, model: str = 'llama3.2', session_id: str = None) -> str:
    trace = lf.trace(name='chat', session_id=session_id)
    generation = trace.generation(
        name='ollama-chat',
        model=model,
        input=[{'role': 'user', 'content': prompt}]
    )
    response = ollama.chat(
        model=model,
        messages=[{'role': 'user', 'content': prompt}]
    )
    content = response['message']['content']
    generation.end(
        output=content,
        usage={
            'prompt_tokens': response.get('prompt_eval_count', 0),
            'completion_tokens': response.get('eval_count', 0)
        }
    )
    trace.update(output=content)
    return content

result = traced_chat('Explain Docker networking in one paragraph.')

Prompt Versioning and A/B Testing

# Create prompts in Langfuse UI, then fetch by name + version
prompt_v1 = lf.get_prompt('summarise', version=1)
prompt_v2 = lf.get_prompt('summarise', version=2)

def evaluate_prompts(test_inputs: list[str]) -> dict:
    results = {'v1': [], 'v2': []}
    for text in test_inputs:
        for version, prompt in [('v1', prompt_v1), ('v2', prompt_v2)]:
            compiled = prompt.compile(text=text)
            output = traced_chat(compiled, session_id=f'{version}-eval')
            results[version].append({'input': text[:50], 'output': output[:100]})
    return results

# Compare outputs in Langfuse dashboard after running

Score Tracking

def rate_output(trace_id: str, score: float, comment: str = ''):
    """Record human or automated quality scores (0-1)."""
    lf.score(trace_id=trace_id, name='quality', value=score, comment=comment)

# Automated scoring with a judge model
def auto_score(prompt: str, response: str) -> float:
    judge = ollama.chat(
        model='llama3.2',
        messages=[{
            'role': 'user',
            'content': f'Rate this response 0-10 (10=excellent).\nQ: {prompt}\nA: {response}\nReturn only the number.'
        }],
        options={'temperature': 0}
    )
    try:
        return float(judge['message']['content'].strip()) / 10
    except:
        return 0.5

What to Measure

Langfuse is most useful for tracking metrics that are hard to evaluate by eye when running individual tests. Prompt latency across model sizes (measured automatically via the generation end timestamp), output length consistency across prompt versions, failure rates (tracked via trace status), and quality scores from either human raters or automated judge models all become visible in Langfuse’s dashboard after instrumentation. The A/B prompt comparison feature is particularly useful when iterating on a prompt — rather than relying on anecdotal impressions from a few manual tests, you run both versions against a fixed test set and see the quality distribution difference directly. For production Ollama deployments, Langfuse complements the Prometheus metrics approach from this series: Prometheus for infrastructure-level observability, Langfuse for AI-quality-level observability.

Getting Started

Self-host Langfuse with the single Docker command above, create an API key in the web interface at localhost:3000, add the traced_chat wrapper to your application, and run a few queries. Every call appears in the Langfuse dashboard with full prompt, response, latency, and token usage visible. Add scoring to the calls you care most about evaluating, create prompt versions for the prompts you are actively iterating on, and use the dashboard’s comparison view to make data-driven decisions about prompt changes rather than guessing from manual spot-checks.

Why Prompt Evaluation Matters

Iterating on prompts without systematic evaluation is one of the most common ways AI projects stagnate. A developer changes a prompt, tries it on three examples, thinks it is better, and ships it — only to find edge cases in production that the manual test set never covered. Langfuse solves this by making it easy to: record every prompt and response persistently, compare performance across prompt versions on a fixed evaluation set, track quality scores over time, and spot regressions before they reach production. For teams using Ollama, where the inference is local and free, the main cost of thorough evaluation is time — and Langfuse reduces that cost significantly by automating the recording and comparison workflow.

Self-Hosted vs Cloud Langfuse

Langfuse is available as a managed cloud service (langfuse.com) and as an open-source self-hosted deployment. For Ollama users who care about data privacy, self-hosting is the natural choice — prompts and responses stay on your infrastructure rather than being sent to Langfuse’s cloud. The self-hosted deployment is a single Docker Compose stack (Langfuse app + Postgres + Redis) that takes about 10 minutes to configure and runs comfortably on the same machine as Ollama. The cloud service is simpler to set up and includes hosted collaboration features, which may be worth it for teams that do not have an objection to sending prompt data to a third-party service.

Automated Evaluation Pipelines

import json, os
from langfuse import Langfuse
import ollama

lf = Langfuse(public_key='...', secret_key='...', host='http://localhost:3000')

# Load a fixed evaluation dataset
EVAL_DATASET = [
    {'input': 'What is Docker?', 'expected_keywords': ['container', 'image', 'isolated']},
    {'input': 'Explain async/await', 'expected_keywords': ['asynchronous', 'await', 'promise']},
]

def keyword_score(response: str, keywords: list) -> float:
    found = sum(1 for kw in keywords if kw.lower() in response.lower())
    return found / len(keywords)

def run_evaluation(prompt_template: str, model: str = 'llama3.2') -> float:
    scores = []
    for item in EVAL_DATASET:
        trace = lf.trace(name='eval-run')
        gen = trace.generation(name='eval', model=model,
            input=[{'role':'user','content':prompt_template.format(q=item['input'])}])
        response = ollama.chat(model=model,
            messages=[{'role':'user','content':prompt_template.format(q=item['input'])}])
        content = response['message']['content']
        gen.end(output=content)
        score = keyword_score(content, item['expected_keywords'])
        lf.score(trace_id=trace.id, name='keyword_recall', value=score)
        scores.append(score)
    avg = sum(scores) / len(scores)
    print(f'Average keyword recall: {avg:.2f}')
    return avg

# Compare two prompt templates
v1_score = run_evaluation('Answer this question: {q}')
v2_score = run_evaluation('You are a technical expert. Answer clearly: {q}')
print(f'V1: {v1_score:.2f} vs V2: {v2_score:.2f}')

Session Grouping for Conversation Analysis

Langfuse’s session feature groups related traces together, making it easy to analyse full conversation flows rather than individual turns. Pass a consistent session_id for all turns in a conversation — Langfuse’s session view then shows the complete conversation timeline, latency per turn, and where quality scores drop. This is particularly useful for identifying conversation patterns that cause the model to lose context or produce lower-quality responses as the conversation lengthens. For RAG applications, group the embedding lookup trace and the generation trace in the same session to see the full request flow in a single view.

Langfuse in the Production Observability Stack

Langfuse complements the infrastructure-level observability from Prometheus and OpenTelemetry (both covered earlier in this series) with AI-quality-level observability. Prometheus tells you Ollama is responding in 5 seconds average; OpenTelemetry shows you which step in your pipeline that 5 seconds is spent in; Langfuse shows you whether the responses generated in those 5 seconds are actually high quality. All three layers are necessary for a complete picture of a production AI application’s health — infrastructure metrics for capacity planning, distributed tracing for performance debugging, and prompt evaluation for quality monitoring. The combination turns your Ollama deployment from a black box into a measurable, improvable system.

Getting Started

Run Langfuse locally with Docker, create API keys in the web interface, and wrap your first Ollama call with the traced_chat function from this article. Open the Langfuse dashboard after running a few queries and explore the trace view — seeing prompts, responses, and latency recorded automatically is immediately useful even before you add scoring or prompt versioning. Add the automated evaluation pipeline when you are actively iterating on prompts. The investment in Langfuse pays back fastest during the prompt development phase of a project, when systematic evaluation turns guesswork into data-driven iteration.

Connecting Langfuse to Your CI Pipeline

Automated quality gates in CI ensure that prompt changes do not regress without detection. Add a CI step that runs your evaluation dataset against both the current and previous prompt versions, compares average scores, and fails the build if quality drops below a threshold:

# ci_eval.py — run in CI after prompt changes
import sys
from langfuse import Langfuse
import ollama

THRESHOLD = 0.75  # Minimum acceptable average quality score

lf = Langfuse(public_key=..., secret_key=..., host='http://langfuse:3000')

def run_ci_eval(prompt_name: str) -> float:
    prompt = lf.get_prompt(prompt_name)  # Latest version
    scores = []
    for item in EVAL_DATASET:
        trace = lf.trace(name='ci-eval', tags=['ci'])
        gen = trace.generation(name='gen', model='llama3.2',
            input=[{'role':'user','content':prompt.compile(**item['vars'])}])
        response = ollama.chat(model='llama3.2',
            messages=[{'role':'user','content':prompt.compile(**item['vars'])}])
        content = response['message']['content']
        gen.end(output=content)
        score = evaluate(content, item['expected'])
        lf.score(trace_id=trace.id, name='quality', value=score)
        scores.append(score)
    return sum(scores)/len(scores)

score = run_ci_eval('my-prompt')
print(f'Average score: {score:.3f} (threshold: {THRESHOLD})')
if score < THRESHOLD:
    sys.exit(1)  # Fail CI

Langfuse Datasets

Langfuse’s Datasets feature provides a structured way to manage evaluation sets — you create a dataset in the UI, add items (input/expected output pairs), and run your prompt against the dataset via the SDK. This is more organised than maintaining a Python list of test cases and integrates natively with Langfuse’s run comparison view, which shows all dataset runs side-by-side with scores. For teams with more than a handful of evaluation cases, the dataset approach scales better than hardcoded test lists and gives non-technical team members a way to contribute evaluation examples through the Langfuse UI without modifying code.

The Case for Systematic Prompt Evaluation

The difference between teams that ship reliable AI features and teams that continuously struggle with prompt quality is usually not model selection or infrastructure — it is evaluation rigour. Teams that evaluate systematically catch regressions early, improve prompts confidently based on data rather than intuition, and build up a shared understanding of what good output looks like through the scored examples they accumulate over time. Langfuse makes systematic evaluation accessible without requiring a custom evaluation framework — the infrastructure is there, the patterns are documented, and the integration with Ollama takes an afternoon to set up. The investment in good evaluation practice is one of the highest-leverage things you can do for the long-term quality and reliability of an AI-powered application.

Langfuse vs Manual Evaluation

Before Langfuse, teams evaluating prompts typically maintained spreadsheets of test inputs and expected outputs, ran prompts manually, and recorded results by hand. This approach is slow, error-prone, and does not scale beyond a handful of test cases. Langfuse replaces it with automated recording, persistent storage, structured scoring, and comparison views that make systematic evaluation as fast as running your application normally. The key shift is moving from periodic manual evaluation to continuous automated evaluation — every production call is recorded, scores accumulate automatically, and quality trends are visible in a dashboard rather than requiring a dedicated evaluation session. This continuous visibility catches prompt drift (model updates or prompt changes that gradually degrade quality) before it affects enough users to generate support complaints.

Summary

Langfuse is the missing observability layer for teams building AI applications on Ollama. Infrastructure metrics show whether the service is healthy; distributed traces show where time is spent; Langfuse shows whether the outputs are actually good. Setting it up takes an afternoon, and the return is visible from the first prompt iteration that you evaluate systematically rather than by feel. For teams serious about shipping reliable AI features, adding Langfuse to the observability stack is not optional — it is the difference between shipping AI features with confidence and shipping them while hoping quality is acceptable.

Alternatives to Langfuse

Several other tools provide LLM observability that works with Ollama. Helicone is a proxy-based approach that intercepts API calls and records them without requiring code changes — convenient for quick setup, but the proxy adds a network hop. Promptlayer offers similar prompt versioning and evaluation features with a hosted-only option. Phoenix from Arize is another open-source option with strong evaluation capabilities and OpenTelemetry integration. Langfuse is recommended for Ollama deployments primarily because of its robust self-hosting support (important for keeping prompt data local) and the maturity of its evaluation and scoring features. If you are already using one of the alternatives, the patterns in this article translate directly — the concepts of traces, generations, scores, and prompt versions are consistent across the LLM observability ecosystem, even if the specific API calls differ.

Building an Evaluation Culture

Langfuse is a tool, but systematic prompt evaluation is a practice — and the practice matters more than the tool. The teams that get the most value from Langfuse are those that integrate evaluation into their normal development workflow: every prompt change is accompanied by an evaluation run, every model upgrade is validated against the existing test set, and quality scores are tracked as part of the project’s health metrics alongside uptime and latency. This level of discipline requires buy-in from the team and explicit allocation of time for evaluation — it does not happen by default just because the tooling is in place. Start by evaluating the two or three prompts that matter most to your application’s core functionality, establish a baseline score, and commit to maintaining or improving that score as the project evolves. That focused, disciplined approach produces measurable quality improvements and builds the evaluation habit that scales to the rest of the application over time.

Getting the Most from Prompt Evaluation

Prompt evaluation works best when you treat it like any other engineering discipline: start with a hypothesis (“this system prompt will improve classification accuracy”), define a measurable metric (classification accuracy on a labelled test set), run the experiment, and let the data decide. Langfuse provides the infrastructure to do this systematically rather than relying on anecdotal impressions from a handful of manual test cases. The combination of tracing (to see exactly what prompts were sent and what responses were received), datasets (to test against a consistent set of representative inputs), and evaluation scores (to measure quality in an objective, comparable way) creates a feedback loop that continuously improves your prompts without guesswork. For teams building serious local AI applications, this systematic evaluation approach is what separates production-quality prompt engineering from vibe-based iteration.

Leave a Comment