LLM Memory Patterns for AI Agents: Short-Term, Long-Term, and Episodic Storage

Why Memory Matters for Agents

A language model has no persistent memory between API calls. Every request starts fresh from whatever you put in the context window. For a simple Q&A chatbot, this is fine — each turn is mostly self-contained. But for agents that work on multi-step tasks, handle long-running workflows, or serve the same user over time, statelessness becomes a fundamental limitation. The agent cannot remember what it has already done, what the user told it last week, or what decisions it made three steps ago in the current task.

Memory patterns are the architectural solutions to this problem. They define how information is stored, retrieved, and injected into context so that agents can operate coherently across turns, sessions, and long task horizons. Getting memory right is one of the key engineering challenges in building reliable agentic systems — too little memory and the agent forgets critical context; too much and you blow the context window with irrelevant information or pay for tokens that add noise rather than signal.

The Four Types of Agent Memory

A useful taxonomy divides agent memory into four types, each serving a different purpose and implemented differently.

In-context memory is the simplest: the full conversation history and any relevant information injected directly into the current API call’s context window. Everything the agent needs right now lives in the prompt. This works perfectly for single-session tasks with bounded complexity, but hits the context window limit as tasks grow and becomes expensive at scale.

External storage memory persists information outside the model — in a database, a vector store, or a file system — and retrieves relevant pieces into context on demand. This is the foundation of RAG and the key to scaling beyond context window limits. The agent does not see everything at once; it sees what retrieval decides is relevant to the current moment.

In-weights memory is knowledge baked into the model’s parameters through training or fine-tuning. The model “remembers” this information intrinsically — it does not need to be injected into context. Fine-tuning on domain-specific data is the main way to create in-weights memory for custom applications, though it is expensive and inflexible compared to retrieval-based approaches.

In-cache memory refers to the KV cache maintained by the model’s attention mechanism during a session. Prompt caching at the provider level is a form of this — reusing previously computed representations to speed up calls with shared prefixes. This is mostly an optimisation concern rather than an architectural memory pattern, but it matters for latency in long-running sessions.

Short-Term Memory: Managing Conversation History

Short-term memory is what the agent knows within the current session. The naive implementation is to keep the full conversation history and append every new turn. This works until the history exceeds the context window or becomes too expensive to include in every call.

Three management strategies handle short-term memory growth. Sliding window keeps only the most recent N turns, discarding older ones. Simple and fast, but loses context established early in the conversation. Summarisation periodically compresses older turns into a running summary, replacing raw history with a compact representation. Preserves semantic content at the cost of details. Selective retention keeps only turns that are explicitly marked as important — decisions made, facts established, instructions given — and discards the rest. More precise than summarisation but requires logic to identify what is worth keeping.

from anthropic import Anthropic

client = Anthropic()

class ShortTermMemory:
    def __init__(self, max_turns: int = 20):
        self.messages: list[dict] = []
        self.max_turns = max_turns

    def add(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        if len(self.messages) > self.max_turns * 2:  # pairs of user+assistant
            self._summarise_oldest()

    def _summarise_oldest(self):
        # Keep most recent half, summarise the older half
        split = len(self.messages) // 2
        to_summarise = self.messages[:split]
        self.messages = self.messages[split:]

        summary_prompt = "Summarise this conversation history concisely:\n" + "\n".join(
            f"{m['role']}: {m['content']}" for m in to_summarise
        )
        summary = client.messages.create(
            model="claude-sonnet-4-6", max_tokens=256,
            messages=[{"role": "user", "content": summary_prompt}]
        ).content[0].text

        self.messages.insert(0, {"role": "user", "content": f"[Earlier conversation summary: {summary}]"})
        self.messages.insert(1, {"role": "assistant", "content": "Understood, I have the context from our earlier conversation."})

    def get_messages(self) -> list[dict]:
        return self.messages

Long-Term Memory: Persisting Across Sessions

Long-term memory persists information between separate sessions — what the user told you last week, their preferences, past decisions, completed work. The most common implementation stores key facts and events in a database and retrieves relevant ones at the start of each new session.

A practical long-term memory system has three components: a write path that extracts and stores memorable information from the current session, a retrieval path that fetches relevant memories for the new session, and a schema that defines what kinds of information are worth remembering.

import json
from datetime import datetime
import anthropic

client = anthropic.Anthropic()

EXTRACT_PROMPT = """Review this conversation and extract facts worth remembering long-term.
Focus on: user preferences, decisions made, important facts stated, tasks completed.
Return a JSON array of memory objects: [{{"content": "...", "type": "preference|decision|fact|task", "importance": 1-5}}]
Return empty array if nothing is worth remembering.

Conversation:
{conversation}"""

def extract_memories(conversation: list[dict]) -> list[dict]:
    conv_text = "\n".join(f"{m['role']}: {m['content']}" for m in conversation)
    response = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=512,
        messages=[{"role": "user", "content": EXTRACT_PROMPT.format(conversation=conv_text)}]
    )
    memories = json.loads(response.content[0].text)
    for m in memories:
        m["timestamp"] = datetime.now().isoformat()
    return memories

def build_memory_context(memories: list[dict], max_memories: int = 10) -> str:
    if not memories:
        return ""
    # Sort by importance and recency, take top N
    sorted_memories = sorted(memories, key=lambda m: (m.get("importance", 1), m["timestamp"]), reverse=True)
    top = sorted_memories[:max_memories]
    lines = ["Relevant memories from previous sessions:"]
    for m in top:
        lines.append(f"- [{m['type']}] {m['content']}")
    return "\n".join(lines)

Store memories in a database with the user ID as a partition key. At session start, retrieve the most relevant memories — either all of them if the collection is small, or the top-K by semantic similarity to the current query if you have thousands. Inject them into the system prompt before the conversation begins.

Episodic Memory: Remembering Past Events

Episodic memory stores complete records of past agent runs — what the agent was asked to do, what steps it took, what tools it called, and what the outcome was. This is distinct from long-term memory of facts: episodic memory preserves the narrative of past actions rather than extracted knowledge.

Episodic memory is most valuable for agents that repeat similar tasks and should learn from experience. A coding agent that successfully debugged a specific type of error can retrieve that episode when it encounters a similar error again. A research agent that compiled a report on a topic can retrieve the episode to avoid duplicating work. Storing episodes in a vector database keyed by a semantic summary of the task enables efficient retrieval of past episodes relevant to the current task.

from dataclasses import dataclass, field
from datetime import datetime

@dataclass
class Episode:
    task: str
    steps: list[dict] = field(default_factory=list)
    outcome: str = ""
    timestamp: str = field(default_factory=lambda: datetime.now().isoformat())

    def add_step(self, action: str, result: str):
        self.steps.append({"action": action, "result": result})

    def summarise(self) -> str:
        steps_text = "; ".join(f"{s['action']} -> {s['result']}" for s in self.steps[:5])
        return f"Task: {self.task} | Steps: {steps_text} | Outcome: {self.outcome}"

Semantic Memory: A Knowledge Base for Agents

Semantic memory is the agent’s domain knowledge — facts, procedures, and relationships about the world relevant to its task. Unlike episodic memory (what happened) or short-term memory (what was just said), semantic memory is general knowledge that applies across many sessions and tasks. For most practical agents, semantic memory is implemented as a RAG pipeline: a vector store of documents, FAQs, procedures, and reference material that the agent can query when it needs specific information.

The key design decision is what to put in semantic memory versus in-weights memory. In-weights knowledge (baked into the model) is fast to access but inflexible — updating it requires retraining. Semantic memory in a vector store can be updated instantly by adding or modifying documents, but retrieval adds latency and introduces the risk of fetching the wrong information. For stable, general knowledge, rely on the base model. For domain-specific, frequently updated, or proprietary knowledge, use semantic memory with a vector store.

Combining Memory Types in Practice

Production agents typically combine all four memory types. At session start, the system injects a semantic memory retrieval of relevant background knowledge, followed by the user’s long-term memory context, followed by any relevant episodic memories from past similar tasks. The conversation then proceeds with short-term memory managing the current session’s history. At session end, new facts and decisions are extracted and written to long-term memory, and the episode is stored for future retrieval.

The complexity this introduces is real — more moving parts means more failure modes and more latency. Start with the simplest memory pattern that solves your immediate problem, and add complexity only when you observe specific failures that more memory would address. A well-designed short-term memory with good summarisation handles the majority of use cases; long-term and episodic memory are worth adding once you have users returning across multiple sessions and you can measure the quality improvement they deliver.

Memory and Privacy

Any system that persists information about users across sessions has privacy implications that deserve explicit attention. Users should know what is being remembered, be able to review it, and be able to delete it. Storing sensitive information — health details, financial data, personal relationships — in an agent’s memory without clear disclosure and user consent creates both legal and ethical exposure. Design your memory schema to capture only what is necessary for the agent to be useful, not everything it could possibly infer. Give users a memory management interface, even a simple one, so they are not in the dark about what the system knows about them. Treat memory data with the same care you would treat any other sensitive user data: encrypted at rest, access-controlled, with a clear retention and deletion policy.

Debugging Memory Issues

Memory bugs in agents are some of the hardest to diagnose because they involve state that accumulates across many interactions. A few patterns help. Log every memory read and write with the session ID, timestamp, and the content stored or retrieved — this creates an audit trail that makes it possible to reconstruct exactly what the agent knew at any point in a session. Add a memory inspection tool to your agent that lets you query the current memory state during development. Implement memory health checks that flag anomalies: memory entries growing without bound, retrieval returning irrelevant results, or the same facts being stored multiple times. Test memory behaviour explicitly in your evaluation suite — not just whether the agent produces correct answers, but whether it correctly uses and updates memory across multi-turn scenarios. Memory correctness is easy to overlook in standard evals that only test single-turn response quality.

Memory as a Competitive Advantage

For consumer-facing AI applications, memory is increasingly a product differentiator. An assistant that remembers your preferences, your ongoing projects, your communication style, and the context of your past requests feels fundamentally more useful than one that treats every session as the first. The technical challenge is manageable — the patterns described here are well understood and implementable with existing tools. The harder challenge is product design: deciding what to remember, how to surface it, and how to give users meaningful control. Teams that invest in thoughtful memory design early will find it significantly harder for competitors to replicate, because the value of a personalised assistant accumulates with time and usage in a way that a better base model alone cannot replicate. Memory is not just a technical pattern — it is the mechanism by which an AI application builds a relationship with its users.

Memory in Multi-Agent Systems

When multiple agents collaborate on a task — a common pattern in more sophisticated agentic architectures — memory becomes shared infrastructure rather than a per-agent concern. A planner agent that breaks a task into subtasks and delegates them to specialist agents needs a shared memory store where all agents can read the overall task state, write their intermediate results, and check what other agents have already done. Without shared memory, agents duplicate work, contradict each other, and lose track of the overall goal.

The simplest shared memory for multi-agent systems is a structured state object passed explicitly between agents — a task graph where each node records the agent responsible, the action taken, the result, and whether the step is complete. More sophisticated systems use a shared database that all agents can read and write transactionally, with locking to prevent race conditions when multiple agents try to update the same piece of state simultaneously. The key design principle is that shared memory should be the single source of truth for task state — agents should not maintain private copies of shared information, because divergent state is the root cause of most multi-agent coordination failures.

Choosing the Right Memory Architecture

The right memory architecture depends on your use case, user base, and operational constraints. For single-session task agents with bounded complexity, in-context memory with a sliding window is sufficient. For conversational assistants serving returning users, add long-term memory with a lightweight extraction and retrieval pipeline. For agents handling complex multi-step tasks that span many sessions, add episodic memory so the agent can build on past work. For knowledge-intensive agents in specialised domains, invest in semantic memory with a high-quality vector store and chunking strategy.

In all cases, start simpler than you think you need and add complexity when you have evidence that it improves outcomes. Memory systems that are more complex than necessary introduce latency, cost, privacy risk, and debugging difficulty without proportional benefit. Measure the impact of each memory layer on task success rate and user satisfaction before adding the next one — the data will tell you more reliably than intuition whether the investment is justified.

Leave a Comment