How to Build a Discord Bot with Ollama

Discord has become one of the most popular platforms for developer communities, gaming groups, and hobbyist projects alike. If you’re already running a local LLM with Ollama, building a Discord bot that connects to it is a natural next step — you get a private, free AI assistant available to your entire server, with no API costs and no data leaving your machine.

This guide walks through building a fully functional Discord bot in Python that talks to Ollama, handles multi-turn conversations, and streams responses back to users in real time. By the end you’ll have a bot running locally that your Discord server can query just like a chatbot.

What You’ll Need

Before writing any code, you need three things in place: Ollama installed and running locally with at least one model pulled, a Discord account with permission to create a bot, and Python 3.10 or later. The Discord bot library we’ll use is discord.py, which is the most widely maintained Python library for the Discord API in 2026.

You’ll also need to create a Discord application and bot token. Head to the Discord Developer Portal, create a new application, navigate to the Bot tab, and click Add Bot. Copy the token — you’ll need it shortly. Under OAuth2 → URL Generator, select the bot scope and grant the following permissions: Send Messages, Read Message History, and Use Slash Commands. Use the generated URL to invite the bot to your server.

Project Setup

Create a new directory for the project and set up a virtual environment to keep dependencies isolated.

mkdir ollama-discord-bot
cd ollama-discord-bot
python -m venv venv
source venv/bin/activate  # Windows: venvScriptsactivate
pip install discord.py httpx python-dotenv

We’re using httpx for async HTTP requests to the Ollama API rather than the ollama Python package, because it gives us more direct control over streaming — which matters when you want to push partial responses to Discord as they arrive.

Create a .env file in the project root to store your bot token and configuration:

DISCORD_TOKEN=your_bot_token_here
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3.2
BOT_PREFIX=!

Basic Bot: Single-Turn Responses

Start with the simplest possible implementation — a bot that responds to a slash command by sending the user’s message to Ollama and returning the full response once it’s complete. This is the easiest pattern to reason about before adding streaming or conversation history.

import os
import discord
from discord import app_commands
import httpx
from dotenv import load_dotenv

load_dotenv()

TOKEN = os.getenv('DISCORD_TOKEN')
OLLAMA_URL = os.getenv('OLLAMA_BASE_URL', 'http://localhost:11434')
MODEL = os.getenv('OLLAMA_MODEL', 'llama3.2')

intents = discord.Intents.default()
intents.message_content = True
client = discord.Client(intents=intents)
tree = app_commands.CommandTree(client)

async def ask_ollama(prompt: str) -> str:
    async with httpx.AsyncClient(timeout=120) as http:
        resp = await http.post(
            f'{OLLAMA_URL}/api/chat',
            json={
                'model': MODEL,
                'messages': [{'role': 'user', 'content': prompt}],
                'stream': False
            }
        )
        data = resp.json()
        return data['message']['content']

@tree.command(name='ask', description='Ask the local LLM a question')
async def ask(interaction: discord.Interaction, prompt: str):
    await interaction.response.defer(thinking=True)
    answer = await ask_ollama(prompt)
    if len(answer) > 1900:
        answer = answer[:1900] + '
... *(truncated)*'
    await interaction.followup.send(answer)

@client.event
async def on_ready():
    await tree.sync()
    print(f'Logged in as {client.user}')

client.run(TOKEN)

Run this with python bot.py and you should see the bot come online. The /ask slash command will now appear in any channel the bot has access to. The defer(thinking=True) call is important — it sends Discord the “Bot is thinking…” indicator immediately, which prevents the interaction from timing out while Ollama is generating a response. Without it, you have only 3 seconds before Discord considers the interaction failed.

Streaming Responses with Message Editing

One of the nicest things about chat interfaces is seeing responses arrive word by word. Discord doesn’t support true server-sent streaming the way a web UI does, but you can approximate it by sending a placeholder message and then editing it repeatedly as new tokens arrive from Ollama. The result feels natural and gives users immediate feedback that the model is working.

import asyncio

async def ask_ollama_stream(prompt: str, callback):
    buf = ''
    async with httpx.AsyncClient(timeout=120) as http:
        async with http.stream(
            'POST',
            f'{OLLAMA_URL}/api/chat',
            json={
                'model': MODEL,
                'messages': [{'role': 'user', 'content': prompt}],
                'stream': True
            }
        ) as resp:
            async for line in resp.aiter_lines():
                if not line:
                    continue
                import json
                chunk = json.loads(line)
                token = chunk.get('message', {}).get('content', '')
                buf += token
                await callback(buf)
                if chunk.get('done'):
                    break
    return buf

@tree.command(name='chat', description='Chat with streaming output')
async def chat(interaction: discord.Interaction, prompt: str):
    await interaction.response.defer(thinking=True)
    msg = await interaction.followup.send('...')
    last_edit = ''

    async def update(text: str):
        nonlocal last_edit
        display = text[-1900:] if len(text) > 1900 else text
        if len(display) - len(last_edit) > 50:
            await msg.edit(content=display)
            last_edit = display
            await asyncio.sleep(0.5)

    final = await ask_ollama_stream(prompt, update)
    display = final[-1900:] if len(final) > 1900 else final
    await msg.edit(content=display)

The key design decision here is the edit throttle. Discord’s rate limit for message edits is roughly 5 per second per channel, and hitting it will cause your bot to back off or get temporarily blocked. By only editing when the buffer has grown by at least 50 characters and adding a 0.5 second sleep, you stay well within the limit while still giving a responsive feel. You can tune the 50-character threshold and the sleep duration based on how your model generates tokens.

Adding Multi-Turn Conversation History

A single-turn bot is useful but a conversational bot is much more powerful. To support multi-turn conversations you need to track message history per user (or per channel) and include it in each request to Ollama. The simplest approach is an in-memory dictionary keyed by Discord user ID.

from collections import defaultdict, deque

histories = defaultdict(lambda: deque(maxlen=20))

async def ask_ollama_with_history(user_id: int, user_msg: str) -> str:
    hist = histories[user_id]
    hist.append({'role': 'user', 'content': user_msg})

    async with httpx.AsyncClient(timeout=120) as http:
        resp = await http.post(
            f'{OLLAMA_URL}/api/chat',
            json={
                'model': MODEL,
                'messages': list(hist),
                'stream': False
            }
        )
        data = resp.json()
        reply = data['message']['content']
        hist.append({'role': 'assistant', 'content': reply})
        return reply

@tree.command(name='converse', description='Multi-turn conversation with memory')
async def converse(interaction: discord.Interaction, message: str):
    await interaction.response.defer(thinking=True)
    reply = await ask_ollama_with_history(interaction.user.id, message)
    if len(reply) > 1900:
        reply = reply[:1900] + '
... *(truncated)*'
    await interaction.followup.send(reply)

@tree.command(name='reset', description='Clear your conversation history')
async def reset(interaction: discord.Interaction):
    histories[interaction.user.id].clear()
    await interaction.response.send_message('Conversation history cleared.', ephemeral=True)

Using a deque with maxlen=20 is a simple way to implement a sliding window over conversation history. When the deque is full and a new message arrives, the oldest message is automatically dropped. This keeps the context window from growing indefinitely and prevents Ollama from receiving more tokens than the model can handle. For a model like Llama 3.2 with a 128k context window this matters less, but for smaller models it’s a meaningful constraint.

The /reset command is sent as an ephemeral message — visible only to the user who invoked it — which keeps the channel clean and avoids cluttering the history with administrative commands.

Handling Long Responses with Chunked Replies

Discord enforces a hard 2,000 character limit per message. If Ollama generates a long response — a detailed code explanation, a multi-step plan, or a lengthy essay — you need to split it across multiple messages rather than silently truncating it. Here’s a utility function that handles chunking cleanly at word boundaries:

def chunk_text(text: str, size: int = 1900) -> list[str]:
    if len(text) <= size:
        return [text]
    chunks, buf = [], ''
    for word in text.split():
        if len(buf) + len(word) + 1 > size:
            chunks.append(buf.strip())
            buf = word
        else:
            buf += ' ' + word
    if buf:
        chunks.append(buf.strip())
    return chunks

async def send_chunked(followup, text: str):
    chunks = chunk_text(text)
    for i, chunk in enumerate(chunks):
        if i == 0:
            await followup.send(chunk)
        else:
            await followup.channel.send(chunk)

Replace the truncation logic in your earlier commands with await send_chunked(interaction.followup, reply) to get clean multi-message output instead of cut-off responses.

Adding a System Prompt

You can give your bot a personality or constrain its behaviour by prepending a system message to every conversation. This is useful if you want the bot to act as a coding assistant, a writing helper, or a domain-specific expert for your community.

SYSTEM_PROMPT = (
    "You are a helpful assistant in a Discord server for developers. "
    "Keep responses concise and use markdown formatting where appropriate. "
    "When showing code, always use fenced code blocks with the language specified."
)

async def ask_ollama_with_history(user_id: int, user_msg: str) -> str:
    hist = histories[user_id]
    hist.append({'role': 'user', 'content': user_msg})
    messages = [{'role': 'system', 'content': SYSTEM_PROMPT}] + list(hist)

    async with httpx.AsyncClient(timeout=120) as http:
        resp = await http.post(
            f'{OLLAMA_URL}/api/chat',
            json={'model': MODEL, 'messages': messages, 'stream': False}
        )
        data = resp.json()
        reply = data['message']['content']
        hist.append({'role': 'assistant', 'content': reply})
        return reply

Notice that the system message is prepended to messages each time but is not stored in hist. This means it’s always present at the start of the conversation without consuming slots in the history deque. You can make the system prompt configurable via a slash command if you want server admins to be able to change the bot’s persona on the fly.

Restricting the Bot to Specific Channels

If you don’t want the bot responding in every channel on your server, add a channel whitelist check at the top of each command handler:

ALLOWED_CHANNELS = {123456789012345678, 987654321098765432}

def channel_allowed(interaction: discord.Interaction) -> bool:
    return not ALLOWED_CHANNELS or interaction.channel_id in ALLOWED_CHANNELS

@tree.command(name='ask', description='Ask the local LLM a question')
async def ask(interaction: discord.Interaction, prompt: str):
    if not channel_allowed(interaction):
        await interaction.response.send_message(
            'This bot is only available in designated channels.', ephemeral=True
        )
        return
    await interaction.response.defer(thinking=True)
    answer = await ask_ollama(prompt)
    await send_chunked(interaction.followup, answer)

If ALLOWED_CHANNELS is an empty set, the check passes everywhere — so you can disable channel restriction by leaving it empty without touching any other logic. Channel IDs can be found by enabling Developer Mode in Discord’s settings (Settings → Advanced → Developer Mode) and right-clicking any channel.

Error Handling and Resilience

A bot that crashes silently when Ollama is unreachable or returns an unexpected response is frustrating to debug. Wrapping your Ollama calls in proper exception handling means users get a clear error message rather than watching the bot go offline.

async def ollama_chat_safe(messages: list) -> str:
    try:
        async with httpx.AsyncClient(timeout=60) as http:
            r = await http.post(
                f'{OLLAMA_URL}/api/chat',
                json={'model': MODEL, 'messages': messages, 'stream': False}
            )
            r.raise_for_status()
            return r.json()['message']['content']
    except httpx.ConnectError:
        return 'Could not reach Ollama. Is it running on your machine?'
    except httpx.TimeoutException:
        return 'Ollama took too long to respond. The model may be overloaded.'
    except Exception as e:
        return f'An unexpected error occurred: {str(e)}'

Reducing the timeout to 60 seconds rather than 120 is intentional here. Discord users expect fairly quick responses, and if a model is taking more than a minute to reply something has likely gone wrong — the model hasn’t loaded, the machine is under heavy load, or the prompt is unusually demanding. Surfacing that as a clean error message is better than leaving users staring at the “thinking” indicator indefinitely.

It’s also worth catching the case where the Ollama model hasn’t been pulled yet. If the model name in your .env doesn’t match any locally available model, Ollama returns a 404. You can catch this with r.raise_for_status() and handle the httpx.HTTPStatusError to give a specific “model not found — run ollama pull” message.

Rate Limiting Per User

On a busy server with many active users, you may want to prevent any single person from flooding the bot with requests. A simple token bucket or cooldown approach works well here. The following pattern adds a per-user cooldown of 10 seconds between requests:

import time
user_cooldowns: dict[int, float] = {}
COOLDOWN_SECONDS = 10

def is_on_cooldown(user_id: int) -> float:
    last = user_cooldowns.get(user_id, 0)
    remaining = COOLDOWN_SECONDS - (time.time() - last)
    return remaining if remaining > 0 else 0

def set_cooldown(user_id: int):
    user_cooldowns[user_id] = time.time()

This is intentionally lightweight — a dictionary lookup and a time comparison. For most Discord bots this is sufficient. If you’re running the bot for a very active community and need more sophisticated rate limiting, discord.py’s built-in app_commands.checks.cooldown decorator is a cleaner solution that integrates with the slash command framework directly.

Putting It All Together

Here’s the complete bot.py combining everything above into a single file you can run directly:

import os, json, asyncio, time
import discord
from discord import app_commands
import httpx
from dotenv import load_dotenv
from collections import defaultdict, deque

load_dotenv()
TOKEN = os.getenv('DISCORD_TOKEN')
OLLAMA_URL = os.getenv('OLLAMA_BASE_URL', 'http://localhost:11434')
MODEL = os.getenv('OLLAMA_MODEL', 'llama3.2')
ALLOWED_CHANNELS: set[int] = set()
SYSTEM_PROMPT = (
    "You are a helpful assistant in a Discord server for developers. "
    "Keep responses concise and use markdown formatting where appropriate."
)

intents = discord.Intents.default()
intents.message_content = True
client = discord.Client(intents=intents)
tree = app_commands.CommandTree(client)
histories: dict[int, deque] = defaultdict(lambda: deque(maxlen=20))
user_cooldowns: dict[int, float] = {}
COOLDOWN_SECONDS = 10

def chunk_text(text: str, size=1900) -> list[str]:
    if len(text) <= size:
        return [text]
    chunks, buf = [], ''
    for word in text.split():
        if len(buf) + len(word) + 1 > size:
            chunks.append(buf.strip())
            buf = word
        else:
            buf += ' ' + word
    if buf:
        chunks.append(buf.strip())
    return chunks

async def send_chunked(target, text: str):
    for i, chunk in enumerate(chunk_text(text)):
        if i == 0:
            await target.send(chunk)
        else:
            await target.channel.send(chunk)

def channel_ok(interaction: discord.Interaction) -> bool:
    return not ALLOWED_CHANNELS or interaction.channel_id in ALLOWED_CHANNELS

def is_on_cooldown(user_id: int) -> float:
    remaining = COOLDOWN_SECONDS - (time.time() - user_cooldowns.get(user_id, 0))
    return remaining if remaining > 0 else 0

async def ollama_chat(messages: list) -> str:
    try:
        async with httpx.AsyncClient(timeout=60) as http:
            r = await http.post(
                f'{OLLAMA_URL}/api/chat',
                json={'model': MODEL, 'messages': messages, 'stream': False}
            )
            r.raise_for_status()
            return r.json()['message']['content']
    except httpx.ConnectError:
        return 'Could not reach Ollama. Is it running?'
    except httpx.TimeoutException:
        return 'Ollama timed out. The model may be overloaded.'
    except Exception as e:
        return f'Error: {e}'

@tree.command(name='ask', description='Single-turn question')
async def ask(interaction: discord.Interaction, prompt: str):
    if not channel_ok(interaction):
        await interaction.response.send_message('Not available here.', ephemeral=True); return
    if cd := is_on_cooldown(interaction.user.id):
        await interaction.response.send_message(f'Wait {cd:.1f}s.', ephemeral=True); return
    user_cooldowns[interaction.user.id] = time.time()
    await interaction.response.defer(thinking=True)
    reply = await ollama_chat([
        {'role': 'system', 'content': SYSTEM_PROMPT},
        {'role': 'user', 'content': prompt}
    ])
    await send_chunked(interaction.followup, reply)

@tree.command(name='chat', description='Multi-turn conversation')
async def chat(interaction: discord.Interaction, message: str):
    if not channel_ok(interaction):
        await interaction.response.send_message('Not available here.', ephemeral=True); return
    if cd := is_on_cooldown(interaction.user.id):
        await interaction.response.send_message(f'Wait {cd:.1f}s.', ephemeral=True); return
    user_cooldowns[interaction.user.id] = time.time()
    await interaction.response.defer(thinking=True)
    hist = histories[interaction.user.id]
    hist.append({'role': 'user', 'content': message})
    reply = await ollama_chat([{'role': 'system', 'content': SYSTEM_PROMPT}] + list(hist))
    hist.append({'role': 'assistant', 'content': reply})
    await send_chunked(interaction.followup, reply)

@tree.command(name='reset', description='Clear conversation history')
async def reset(interaction: discord.Interaction):
    histories[interaction.user.id].clear()
    await interaction.response.send_message('History cleared.', ephemeral=True)

@client.event
async def on_ready():
    await tree.sync()
    print(f'Ready: {client.user}')

client.run(TOKEN)

Keeping the Bot Running

For a bot that stays online reliably you’ll want to run it as a background process rather than keeping a terminal open. On Linux, a simple systemd service works well. Create /etc/systemd/system/discord-ollama-bot.service:

[Unit]
Description=Discord Ollama Bot
After=network.target ollama.service

[Service]
Type=simple
WorkingDirectory=/home/youruser/ollama-discord-bot
ExecStart=/home/youruser/ollama-discord-bot/venv/bin/python bot.py
Restart=on-failure
RestartSec=10
EnvironmentFile=/home/youruser/ollama-discord-bot/.env

[Install]
WantedBy=multi-user.target

Enable and start it with sudo systemctl enable --now discord-ollama-bot. The After=ollama.service directive ensures the bot only starts after Ollama is available, which prevents connection errors on boot. Logs are accessible with journalctl -u discord-ollama-bot -f.

On macOS or Windows, a simpler option is a process manager like pm2 (which works with Python via the --interpreter python3 flag) or simply running the bot inside a screen or tmux session.

Model Selection and Performance

The best model for a Discord bot depends on your hardware and how fast you need responses. For most desktop machines, llama3.2:3b is a solid choice — it loads quickly, generates tokens fast enough that the streaming edit pattern feels responsive, and handles general conversation well. If your server is a beefier machine with 16 GB or more VRAM, llama3.2:8b gives noticeably better reasoning and instruction following without too much added latency.

One practical consideration: Discord bots are inherently multi-user, and Ollama handles only one request at a time by default. If two users send /chat simultaneously, the second request will queue behind the first. For small Discord servers this is usually fine — the wait is typically a few seconds at most. For busier servers, consider pulling a smaller, faster model or running Ollama on a machine with enough VRAM to load multiple model layers in parallel.

Testing Locally Before Deploying

Before inviting the bot to a real server, test it in a private development server with just yourself as a member. This lets you verify slash commands register correctly, responses arrive cleanly, and the conversation history resets as expected — without exposing half-finished features to other users. Discord slash commands take up to an hour to propagate globally after tree.sync(), but guild-specific syncing — passing your test server’s guild ID to tree.sync(guild=discord.Object(id=YOUR_GUILD_ID)) — updates immediately during development.

It’s also worth testing what happens when Ollama isn’t running — start the bot without starting Ollama and confirm the error handling returns a friendly message rather than an unhandled exception. Then test with a model that produces a very long response to verify your chunking logic splits at sensible word boundaries and doesn’t break mid-sentence.

Extending Further

The bot as built covers the core use case, but there are several natural directions to take it depending on what your Discord community needs. Adding vision support is straightforward if you switch to a multimodal model like llava or gemma3 — you’d accept image attachments in the command, convert them to base64, and pass them through the Ollama /api/chat endpoint alongside the text prompt.

Persistent conversation history across bot restarts can be added by serialising the history dictionaries to a JSON file on disk periodically or on graceful shutdown. For larger deployments, SQLite is a better fit — the aiosqlite library integrates cleanly with asyncio and lets you store per-user history in a proper database without adding much complexity.

You can also hook the bot into Discord’s thread system — rather than using slash commands in a main channel, create a new thread for each conversation automatically. This keeps the main channel clean and groups each conversation’s messages together, making it much easier to follow long exchanges. The Discord.py interaction.channel.create_thread() method makes this straightforward to implement.

Leave a Comment