How to Summarise Audio and Podcasts Locally with Ollama

Transcribing and summarising audio locally — podcasts, meetings, interviews, voice notes — requires two tools: Whisper for speech-to-text and Ollama for the summarisation step. Both run entirely offline with no cloud API. This guide covers the full pipeline from audio file to structured summary, with options for different hardware and audio lengths.

The Two-Step Pipeline

Local audio summarisation works in two stages. First, faster-whisper (a fast CPU/GPU implementation of OpenAI’s Whisper model) transcribes the audio to text. Second, Ollama summarises, extracts key points, or answers questions about the transcript. The two tools are independent and complement each other — Whisper handles the speech recognition task that LLMs are not designed for, and Ollama handles the language understanding task that Whisper cannot do.

Installation

# Install faster-whisper (faster than openai-whisper, same quality)
pip install faster-whisper

# Install ffmpeg (required for audio format conversion)
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt install ffmpeg
# Windows (via Chocolatey)
choco install ffmpeg

# Ollama (if not already installed)
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.2  # for summarisation

Basic Transcription with faster-whisper

from faster_whisper import WhisperModel

def transcribe(audio_path: str, model_size: str = 'base') -> str:
    """
    model_size options: tiny, base, small, medium, large-v3
    - tiny/base: fast, lower accuracy (good for clear speech)
    - small/medium: balanced (recommended for podcasts)
    - large-v3: best accuracy, slowest
    """
    model = WhisperModel(model_size, device='auto', compute_type='auto')
    segments, info = model.transcribe(audio_path, beam_size=5)
    print(f'Detected language: {info.language} ({info.language_probability:.0%})')
    return ' '.join(seg.text.strip() for seg in segments)

# Usage
text = transcribe('podcast_episode.mp3', model_size='small')
print(f'Transcript length: {len(text.split())} words')

Full Pipeline: Transcribe + Summarise

from faster_whisper import WhisperModel
import ollama

def transcribe_and_summarise(
    audio_path: str,
    whisper_model: str = 'small',
    ollama_model: str = 'llama3.2',
    summary_style: str = 'bullets'
) -> dict:
    # Step 1: Transcribe
    print('Transcribing...')
    wmodel = WhisperModel(whisper_model, device='auto', compute_type='auto')
    segments, info = wmodel.transcribe(audio_path, beam_size=5)
    transcript = ' '.join(seg.text.strip() for seg in segments)
    word_count = len(transcript.split())
    print(f'Transcribed {word_count} words ({info.language})')

    # Step 2: Handle long transcripts by chunking if needed
    MAX_WORDS = 6000  # safe limit for most models at default context
    words = transcript.split()
    if len(words) > MAX_WORDS:
        print(f'Transcript too long ({len(words)} words), summarising in chunks...')
        chunk_summaries = []
        for i in range(0, len(words), MAX_WORDS):
            chunk = ' '.join(words[i:i+MAX_WORDS])
            r = ollama.chat(
                model=ollama_model,
                messages=[{'role':'user','content':f'Summarise this transcript section briefly:\n\n{chunk}'}],
                options={'temperature':0.3}
            )
            chunk_summaries.append(r['message']['content'])
        combined = '\n\n'.join(chunk_summaries)
    else:
        combined = transcript

    # Step 3: Final summary
    prompts = {
        'bullets': 'Extract the 5-7 most important points as bullet points.',
        'paragraph': 'Write a 3-paragraph summary of the key ideas.',
        'tldr': 'Write a 2-sentence TL;DR summary.',
        'chapters': 'Identify the main topics/chapters with a one-sentence description of each.'
    }
    prompt = prompts.get(summary_style, prompts['bullets'])
    print('Summarising...')
    response = ollama.chat(
        model=ollama_model,
        messages=[{'role':'user','content':f'{prompt}\n\nTranscript:\n{combined}'}],
        options={'temperature':0.3}
    )
    return {
        'transcript': transcript,
        'summary': response['message']['content'],
        'language': info.language,
        'word_count': word_count
    }

# Example usage
result = transcribe_and_summarise(
    'lex_fridman_ep500.mp3',
    whisper_model='small',
    ollama_model='llama3.2',
    summary_style='bullets'
)
print(result['summary'])

Extracting Action Items from Meeting Recordings

def extract_action_items(audio_path: str) -> dict:
    result = transcribe_and_summarise(audio_path, summary_style='bullets')
    # Second pass for action items
    response = ollama.chat(
        model='llama3.2',
        messages=[{'role':'user','content':
            f'From this meeting transcript, extract:\n1. Action items (who does what by when)\n2. Decisions made\n3. Open questions\n\nTranscript:\n{result["transcript"][:8000]}'
        }],
        options={'temperature':0.1}
    )
    result['action_items'] = response['message']['content']
    return result

Command-Line Script

#!/usr/bin/env python3
# summarise_audio.py — python summarise_audio.py recording.mp3 --style bullets
import argparse
from faster_whisper import WhisperModel
import ollama

parser = argparse.ArgumentParser()
parser.add_argument('audio', help='Path to audio file')
parser.add_argument('--style', choices=['bullets','paragraph','tldr','chapters'], default='bullets')
parser.add_argument('--whisper', default='small', help='Whisper model size')
parser.add_argument('--model', default='llama3.2', help='Ollama model')
parser.add_argument('--save', help='Save transcript to file')
args = parser.parse_args()

result = transcribe_and_summarise(args.audio, args.whisper, args.model, args.style)
if args.save:
    with open(args.save, 'w') as f:
        f.write(result['transcript'])
    print(f'Transcript saved to {args.save}')
print('\n--- SUMMARY ---')
print(result['summary'])

Why Local Audio Processing Matters

Audio transcription and summarisation are among the most privacy-sensitive AI tasks. Meeting recordings contain confidential business discussions, salary negotiations, strategic plans, and personal conversations. Podcast notes and interview transcripts may include unreleased information, personal opinions shared informally, or content under NDA. Sending this audio to a cloud API — even a reputable one — means the content leaves your control and enters someone else’s infrastructure, logging pipelines, and potential training datasets. For personal productivity, this may be an acceptable trade-off. For professional use with sensitive content, running the pipeline locally eliminates the risk entirely.

The practical quality of local transcription with faster-whisper has reached a level where it is difficult to justify cloud alternatives for most use cases. The small Whisper model achieves word error rates comparable to cloud services on clear speech, and the medium/large models often match or exceed them. Combined with Ollama’s summarisation capability, the local pipeline produces results that are genuinely competitive with cloud-based audio intelligence tools — at zero per-minute cost and with no data leaving your machine.

Choosing the Right Whisper Model Size

The faster-whisper model size significantly affects both speed and accuracy. For a 1-hour podcast on a modern laptop with CPU inference, approximate processing times are: tiny (5–8 minutes), base (8–12 minutes), small (15–25 minutes), medium (35–55 minutes), large-v3 (75–120 minutes). On a GPU, all sizes run 4–8× faster. For most podcast and meeting transcription, the small model provides the best balance — good accuracy on clear speech with reasonable processing time. Upgrade to medium or large-v3 if you are working with accented speech, technical jargon, multiple speakers in a noisy environment, or languages other than English where the smaller models underperform.

The device='auto' and compute_type='auto' settings let faster-whisper choose the optimal device (GPU if available, CPU otherwise) and precision level automatically. On Apple Silicon, faster-whisper uses the CPU efficiently and MPS support is improving — check the faster-whisper documentation for the latest guidance on Apple Silicon optimisation, as this is an area of active development.

Handling Different Audio Formats

faster-whisper accepts most common audio formats directly (MP3, WAV, M4A, FLAC, OGG) as long as ffmpeg is installed. For video files (MP4, MKV, MOV), extract the audio first:

# Extract audio from video
ffmpeg -i recording.mp4 -vn -acodec mp3 audio.mp3

# Convert to optimal format for whisper (16kHz mono WAV)
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav

# Split a long file into 30-minute chunks for parallel processing
ffmpeg -i long_podcast.mp3 -f segment -segment_time 1800 -c copy chunk_%03d.mp3

Speaker Diarisation (Who Said What)

faster-whisper alone does not identify individual speakers — it produces a single transcript stream. For meeting recordings where you need to know who said what, add pyannote.audio for speaker diarisation:

pip install pyannote.audio
# Requires HuggingFace account and model access token
# See pyannote.audio docs for setup

For most podcast and meeting summarisation use cases, speaker identification is not critical — the key points and action items are valuable regardless of which participant said them. Speaker diarisation adds significant complexity and compute time, so evaluate whether your specific use case genuinely requires it before adding it to the pipeline.

Batch Processing Multiple Files

import os
from pathlib import Path
from faster_whisper import WhisperModel
import ollama

def batch_summarise(folder: str, output_folder: str, whisper_model='small'):
    Path(output_folder).mkdir(exist_ok=True)
    audio_exts = {'.mp3','.wav','.m4a','.flac','.ogg'}
    files = [f for f in Path(folder).iterdir() if f.suffix.lower() in audio_exts]
    print(f'Processing {len(files)} files...')
    wmodel = WhisperModel(whisper_model, device='auto', compute_type='auto')
    for audio_file in files:
        print(f'\nProcessing: {audio_file.name}')
        segments, _ = wmodel.transcribe(str(audio_file), beam_size=5)
        transcript = ' '.join(s.text.strip() for s in segments)
        words = transcript.split()
        text = ' '.join(words[:6000]) if len(words) > 6000 else transcript
        r = ollama.chat(
            model='llama3.2',
            messages=[{'role':'user','content':f'Summarise in 5 bullet points:\n\n{text}'}],
            options={'temperature':0.3}
        )
        out_path = Path(output_folder) / (audio_file.stem + '_summary.txt')
        out_path.write_text(f'FILE: {audio_file.name}\n\nSUMMARY:\n{r["message"]["content"]}\n\nTRANSCRIPT:\n{transcript}')
        print(f'Saved to {out_path}')

batch_summarise('podcasts/', 'summaries/')

Performance on Common Hardware

For a 1-hour podcast using the small Whisper model followed by Llama 3.2 8B summarisation: on a modern Apple Silicon Mac (M2 Pro), expect roughly 20 minutes total processing time with no GPU for Whisper and fast inference for the summary step. On a machine with an NVIDIA RTX 3080, expect 6–10 minutes total with GPU-accelerated Whisper. On a CPU-only laptop without Apple Silicon, expect 35–50 minutes. The processing is not interactive — you start the script, come back later, and have the summary waiting. For regular podcast listeners or people who record lots of meetings, setting up a folder-based batch script and running it overnight is the most practical workflow.

Whisper Model Quality vs Speed Trade-offs in Practice

Choosing the right Whisper model size for your workflow comes down to two factors: audio quality and your patience budget. The tiny and base models are genuinely impressive for what they are — they transcribe clear, clean audio from a single native English speaker with surprisingly low error rates. Where they struggle is with accents, overlapping speech, background noise, technical vocabulary, and languages other than English. If you are transcribing professionally recorded podcasts with good microphones and clear speakers, base is often sufficient and your pipeline runs 3–4× faster than with small. If you are transcribing conference call recordings, interviews in noisy environments, or content in non-English languages, medium or large-v3 is worth the extra processing time because the accuracy difference is meaningful enough to affect the quality of the downstream summarisation.

The summarisation quality from Ollama is also sensitive to transcript accuracy — garbage in, garbage out. A transcript full of errors from an undersized Whisper model produces a summary that may miss key points or include nonsensical content. Spending extra minutes on a higher-quality transcription often produces better final summaries than using a faster model and a more sophisticated summarisation prompt. If you find your summaries are missing important information or containing odd phrases, upgrading the Whisper model size is usually the most effective fix before investing time in prompt engineering.

Integrating with Your Existing Workflow

The most practical integration for regular users is a folder-watch script that automatically processes new audio files as they appear. Tools like watchdog (Python) can monitor a directory and trigger the pipeline whenever a new file is added — drop a meeting recording into a folder, and find the summary waiting when you check back later. For podcast listeners, a simple cron job that runs the batch processor on a podcasts download folder each morning processes overnight downloads and has summaries ready when you start your day. These workflow integrations transform the pipeline from a tool you run manually into infrastructure that quietly works in the background, making local AI summarisation a passive productivity enhancement rather than an active task.

The combination of Whisper and Ollama for audio processing represents one of the clearest demonstrations of what the local AI stack can do that cloud tools struggle to match on cost and privacy simultaneously. No per-minute transcription fees, no data leaving your machine, and quality that is genuinely competitive with commercial services for most audio types. For anyone who regularly consumes long-form audio content or records meetings, building this pipeline once pays dividends every day it runs.

Q&A Over Audio Content

Beyond summarisation, the transcript enables a question-answering workflow — ask specific questions about the audio content rather than getting a fixed summary format. This is particularly useful for long technical podcasts or recorded lectures where you want to retrieve specific information:

def ask_about_audio(audio_path: str, question: str, whisper_model='small') -> str:
    wmodel = WhisperModel(whisper_model, device='auto', compute_type='auto')
    segments, _ = wmodel.transcribe(audio_path, beam_size=5)
    transcript = ' '.join(s.text.strip() for s in segments)
    words = transcript.split()
    text = ' '.join(words[:8000]) if len(words) > 8000 else transcript
    response = ollama.chat(
        model='llama3.2',
        messages=[{
            'role': 'user',
            'content': f'Answer this question based on the transcript.\nIf not covered, say so.\n\nQuestion: {question}\n\nTranscript:\n{text}'
        }],
        options={'temperature': 0.2}
    )
    return response['message']['content']

# Examples
print(ask_about_audio('interview.mp3', 'What advice did they give about career transitions?'))
print(ask_about_audio('lecture.mp3', 'What were the three main criticisms of the approach?'))

This Q&A pattern is more flexible than fixed-format summaries for exploratory use — you can ask follow-up questions about different aspects of the same transcript without re-running the expensive Whisper transcription step. Cache the transcript to disk after the first run and reload it for subsequent questions against the same audio file.

Getting Started

The minimum viable setup is three commands: pip install faster-whisper, brew install ffmpeg (or the equivalent for your OS), and pulling an Ollama model. Copy the transcribe_and_summarise function from this article, point it at an audio file, and you have a working pipeline. Refine the Whisper model size and Ollama model based on your hardware and quality requirements. The entire setup takes under 15 minutes for most developers, and the result is a reusable tool you own completely — no API keys, no monthly costs, no data leaving your machine.

Alternative: openai-whisper vs faster-whisper

faster-whisper is recommended over the original openai-whisper package for most users because it runs 2–4× faster on the same hardware using CTranslate2 as the backend, with identical transcription quality. The API is slightly different but the concepts are the same. If you already have openai-whisper installed and working, there is no urgent need to switch — both produce the same quality output. For new installations, faster-whisper is the better starting point. A third option is whisper.cpp, a C++ port that is particularly fast on Apple Silicon and can be called from Python via ctypes or a subprocess — worth considering if you are processing very large volumes and need maximum performance on Mac hardware. All three share the same underlying Whisper model weights and produce equivalent transcription quality at the same model size; the differences are purely in inference speed and integration convenience.

Building a Personal Podcast Intelligence System

The natural evolution of the basic pipeline is a personal system that tracks which podcasts you have processed, stores transcripts for later retrieval, and lets you search and query across your archive. A simple SQLite database — one table for files, one for transcript text — combined with nomic-embed-text embeddings enables semantic search across months of processed audio content. When you want to find everything a particular person said about a specific topic across dozens of interviews, or retrieve all the book recommendations mentioned across a podcast feed, the combination of Whisper transcription and local embeddings makes this possible with no external service dependencies. It is one of the more compelling practical demonstrations of what a complete local AI stack can do when the individual pieces — transcription, language models, embeddings — are combined into a purposeful application.

Start simple — one audio file, the basic transcribe_and_summarise function, and see what quality you get. Then scale to batch processing, add Q&A capability, and eventually build the archive system if the use case justifies it. The progression from simple script to personal intelligence system follows naturally from a working foundation.

The privacy argument alone is worth the setup time for anyone handling sensitive audio content professionally — and for the rest, the zero ongoing cost and offline capability make it one of the most practically valuable tools in the local AI toolkit — one that most developers can be up and running with in an afternoon and using productively every day thereafter — the setup cost is low and the daily utility compounds with every recording you process through it — the value accrues quietly in the background, one processed file at a time.

Leave a Comment