How to Use Ollama with Neovim

Neovim has become the editor of choice for a significant portion of the developer community, and its Lua-based plugin ecosystem makes it surprisingly capable as a local AI coding assistant. By connecting Neovim to Ollama, you get code completions, inline chat, and documentation generation that run entirely on your own hardware — no GitHub Copilot subscription, no telemetry, and no requests leaving your machine. This guide covers the best Neovim plugins for Ollama integration and how to configure them effectively, plus how to write your own minimal Lua integration if you prefer full control over the experience.

Neovim’s architecture makes it well-suited for LLM integration. The async job API lets plugins make HTTP requests to Ollama in the background without blocking the editor, and the floating window system provides a natural UI for chat interactions and inline suggestions. The Lua scripting layer gives you the ability to customise every aspect of the integration — from how completions are triggered to how responses are formatted and inserted.

Prerequisites

You need Ollama installed and running with at least one model pulled. For coding assistance, qwen2.5-coder:7b is an excellent choice — it is fine-tuned on code across dozens of languages and produces accurate, idiomatic suggestions. For general chat and documentation, llama3.2 works well. On the Neovim side you need Neovim 0.9 or later and a plugin manager — the examples here use lazy.nvim, which is the current standard.

Option 1: llm.nvim

llm.nvim is a lightweight plugin that streams LLM responses directly into your buffer. It is minimal by design — no floating windows, no complex UI — just text streamed into the current file at your cursor position. This makes it feel like a natural extension of the editor rather than a separate chat interface bolted on.

Add it to your lazy.nvim config:

{
  "huggingface/llm.nvim",
  opts = {
    backend = "ollama",
    url = "http://localhost:11434",
    model = "qwen2.5-coder:7b",
    request_body = {
      options = {
        temperature = 0.2,
      },
    },
    enable_suggestions_on_startup = false,
    display = {
      renderer = "extmark",
    },
  },
}

With enable_suggestions_on_startup = false, completions are triggered manually rather than firing on every keystroke. Use :LLMSuggestion to request a completion at the cursor, then Tab to accept or Escape to dismiss. Setting temperature = 0.2 keeps code completions deterministic — lower temperature means the model is less creative and more likely to produce the most statistically likely continuation, which is what you want for code.

Option 2: ollama.nvim

ollama.nvim takes a different approach — it provides a prompt-based interface where you send selected text or a custom prompt to Ollama and receive the response in a floating window or directly in the buffer. It is more interactive than llm.nvim and better suited for refactoring tasks, explanation requests, and documentation generation.

{
  "nomnivore/ollama.nvim",
  dependencies = { "nvim-lua/plenary.nvim" },
  keys = {
    { "<leader>oo", ":lua require'ollama'.prompt()<cr>",            desc = "Ollama Prompt" },
    { "<leader>oG", ":lua require'ollama'.prompt('Generate_Code')<cr>", desc = "Generate Code" },
  },
  opts = {
    model = "llama3.2",
    url = "http://localhost:11434",
    serve = { on_start = false },
    prompts = {
      Sample_Data = {
        prompt = "Generate sample data for the following:\n$buf",
        action = "display",
      },
      Explain_Code = {
        prompt = "Explain what this code does:\n$sel",
        action = "display",
      },
    },
  },
}

The $sel variable inserts the currently selected text into the prompt, and $buf inserts the full buffer contents. The action = "display" setting shows the response in a floating window rather than inserting it into the buffer — useful for explanations and analysis where you want to read the output without modifying your file. Change it to "replace" to have Ollama’s response replace the selected text, which is useful for refactoring tasks.

Option 3: model.nvim

model.nvim is the most fully-featured option, providing a chat buffer, completion integration, and support for multiple backends including Ollama. It feels closest to how GitHub Copilot Chat works — a persistent side panel where you can have a multi-turn conversation with context from your open files.

{
  "gsuuon/model.nvim",
  opts = {
    prompts = require("model.providers.ollama").default_prompts({
      model = "llama3.2",
    }),
  },
  keys = {
    { "<leader>m",  "<cmd>Model<cr>",      mode = { "n", "v" } },
    { "<leader>mc", "<cmd>ModelChat<cr>",  mode = "n" },
  },
}

The ModelChat command opens a persistent chat buffer in a split. You type your message, press Enter, and Ollama’s response streams in below. The conversation persists for the session, so you can refer back to earlier exchanges. The Model command in visual mode sends the selected text as context for a one-shot prompt, streaming the result into a floating window.

Writing Your Own Lua Integration

If you want complete control over the experience, writing a minimal Lua integration is straightforward. Neovim’s vim.loop (libuv bindings) and vim.fn.jobstart provide the async primitives needed to call Ollama without blocking the editor. Here is a minimal implementation using curl via jobstart:

-- lua/ollama.lua
local M = {}

local function ask_ollama(prompt, on_token)
  local body = vim.json.encode({
    model = "llama3.2",
    messages = {{ role = "user", content = prompt }},
    stream = true,
  })

  local buf = ""
  vim.fn.jobstart({
    "curl", "-sS", "-X", "POST",
    "http://localhost:11434/api/chat",
    "-H", "Content-Type: application/json",
    "-d", body,
  }, {
    on_stdout = function(_, data)
      for _, line in ipairs(data) do
        if line ~= "" then
          local ok, chunk = pcall(vim.json.decode, line)
          if ok and chunk.message then
            on_token(chunk.message.content)
          end
        end
      end
    end,
    stdout_buffered = false,
  })
end

function M.prompt_and_display()
  vim.ui.input({ prompt = "Ask Ollama: " }, function(input)
    if not input or input == "" then return end
    local lines = { "--- Ollama ---", "" }
    local bufnr = vim.api.nvim_create_buf(false, true)
    vim.api.nvim_buf_set_lines(bufnr, 0, -1, false, lines)
    vim.cmd("vsplit")
    vim.api.nvim_win_set_buf(0, bufnr)
    ask_ollama(input, function(token)
      vim.schedule(function()
        local last = vim.api.nvim_buf_line_count(bufnr) - 1
        local cur = vim.api.nvim_buf_get_lines(bufnr, last, last+1, false)[1] or ""
        vim.api.nvim_buf_set_lines(bufnr, last, last+1, false, { cur .. token })
      end)
    end)
  end)
end

return M

The vim.schedule wrapper around the buffer update is essential — on_stdout callbacks fire on Neovim’s event loop outside the main thread context, and buffer modifications must be scheduled back onto the main thread. Without vim.schedule you will get intermittent errors about modifying the buffer from a fast event handler. The stdout_buffered = false setting ensures tokens are delivered as they arrive rather than batched, giving you the streaming typewriter effect.

Useful Keybindings and Workflows

Regardless of which plugin you choose, a few keybinding patterns make the Ollama integration feel natural in a Neovim workflow. Map a leader key sequence to send the current visual selection to Ollama with a predefined prompt, so you can highlight a function and ask for an explanation with two keystrokes. Map another sequence to generate a docstring for the function under the cursor by sending the surrounding lines as context. And map a third to open a scratch buffer pre-populated with the current file’s content so you can ask architecture questions with full context available.

Here is an example set of keybindings that work with the custom Lua integration above:

-- In init.lua or a keybindings file
local ollama = require("ollama")

vim.keymap.set("n", "<leader>ao", ollama.prompt_and_display, { desc = "Ask Ollama" })

vim.keymap.set("v", "<leader>ae", function()
  local sel = vim.fn.getregion(vim.fn.getpos("v"), vim.fn.getpos("."), { type = "v" })
  local text = table.concat(sel, "\n")
  ollama.ask_with_context("Explain this code:\n" .. text)
end, { desc = "Explain selection" })

vim.keymap.set("n", "<leader>ad", function()
  local line = vim.fn.line(".")
  local context = table.concat(
    vim.api.nvim_buf_get_lines(0, math.max(0, line-10), line+10, false), "\n"
  )
  ollama.ask_with_context("Write a docstring for the function in this code:\n" .. context)
end, { desc = "Generate docstring" })

The visual mode keybinding captures the selected text using getregion (available in Neovim 0.10+) and appends it to a prompt string. The normal mode docstring keybinding grabs 20 lines of context around the cursor — enough to include the function signature and body for most functions — without requiring you to manually select anything.

Model Selection for Neovim Use Cases

The right model depends on what you are using the Neovim integration for. For inline code completions triggered mid-edit, you want a model that responds in under two seconds — qwen2.5-coder:7b on a machine with a mid-range GPU typically manages this, while larger models like qwen2.5-coder:32b are too slow for interactive completions unless you have significant GPU memory. For chat-style interactions where you ask questions about code and read longer responses, latency matters less and a larger model produces noticeably better explanations — llama3.2:8b is a solid choice for this use case.

You can configure different models for different keybindings, using a fast small model for completions and a slower larger model for explanations and documentation. This is easy to implement in the custom Lua integration by making the model name a parameter, or in ollama.nvim by defining separate named prompts that each specify a different model.

Performance Tips

A few configuration choices make a significant difference to perceived performance. First, keep the model loaded in Ollama’s memory between requests by setting keep_alive to a long duration — the default is 5 minutes, but for an interactive coding session you want the model to stay loaded for as long as you are actively working. Pass "keep_alive": "1h" in the request body, or set it globally via the OLLAMA_KEEP_ALIVE environment variable.

Second, keep prompts short and focused. Sending the entire buffer as context on every request significantly increases the number of tokens the model processes, which directly increases latency. For most coding tasks, the 20 lines around the cursor provide enough context for accurate completions and explanations. Reserve full-buffer context for explicit architecture or review requests where the breadth of context genuinely matters.

Third, use a quantised model. Ollama defaults to Q4 quantisation for most models, which reduces memory usage and increases generation speed with a small quality trade-off that is usually imperceptible for code tasks. If you are pulling a model and have GPU memory to spare, Q5 or Q8 gives slightly better quality — but Q4 is the right starting point for interactive use where speed is the priority.

Using avante.nvim for a Copilot-Style Experience

avante.nvim is the most polished Ollama plugin available for Neovim in 2026, providing a sidebar chat interface and inline diff suggestions that closely resemble the GitHub Copilot Chat experience. It uses a panel on the right side of the editor where you type requests, and it can propose changes directly as diffs that you can accept or reject inline — the same workflow as Cursor’s AI editing mode.

{
  "yetone/avante.nvim",
  build = "make",
  dependencies = {
    "nvim-treesitter/nvim-treesitter",
    "stevearc/dressing.nvim",
    "nvim-lua/plenary.nvim",
    "MunifTanjim/nui.nvim",
  },
  opts = {
    provider = "ollama",
    ollama = {
      model = "qwen2.5-coder:7b",
      endpoint = "http://127.0.0.1:11434",
    },
  },
}

The build step compiles a small Rust binary that handles the streaming response processing efficiently. After installation, <leader>aa opens the avante panel where you can ask questions, request edits, and review suggested changes. The diff view shows exactly what avante proposes to change in your file, and you accept or reject individual hunks with standard key bindings. For developers accustomed to the Cursor or Copilot Chat workflow, avante.nvim is the closest equivalent in a fully local, privacy-preserving setup.

Integrating with nvim-cmp for Automatic Completions

If you use nvim-cmp for completions, you can add Ollama as a completion source so that AI suggestions appear alongside LSP completions, snippets, and buffer words in the familiar completion popup. The cmp-ai plugin provides this integration and supports Ollama as a backend:

{
  "tzachar/cmp-ai",
  dependencies = "nvim-cmp",
  config = function()
    local cmp_ai = require("cmp_ai.config")
    cmp_ai:setup({
      max_lines = 100,
      provider = "Ollama",
      provider_options = {
        model = "qwen2.5-coder:7b",
      },
      notify = true,
      notify_callback = function(msg)
        vim.notify(msg)
      end,
      run_on_every_keystroke = false,
      ignored_file_types = {},
    })
  end,
}

Setting run_on_every_keystroke = false is strongly recommended — triggering an Ollama request on every keystroke would make the editor feel sluggish because each request takes at least a few hundred milliseconds even on fast hardware. Instead, trigger completions manually with your nvim-cmp completion key (usually Tab or Ctrl+Space) when you actually want a suggestion. The max_lines = 100 setting limits how much of the surrounding file is sent as context, keeping requests fast.

Troubleshooting Common Issues

A few issues come up regularly when setting up Ollama with Neovim. The most common is the model taking several seconds to respond on the first request after a period of inactivity — this is the model loading from disk into GPU memory. Set a long keep_alive value as described above to keep the model warm. If responses feel slow even after the model is loaded, check whether the request is sending more context than necessary — logging the full prompt in your Lua integration is the fastest way to diagnose this.

If the plugin cannot reach Ollama at all, verify that Ollama is running with ollama ps and that it is listening on the expected address. By default Ollama binds to 127.0.0.1:11434, which is only accessible from the same machine. If you run Neovim inside a Docker container or a remote SSH session, Ollama needs to bind to 0.0.0.0 or the container’s network interface, configured via the OLLAMA_HOST environment variable. Port forwarding via SSH (ssh -L 11434:localhost:11434 user@host) is another option that avoids changing Ollama’s bind address.

Which Plugin to Choose

The right plugin depends on your workflow. If you want a minimal integration that stays out of the way and just streams text into your buffer, llm.nvim is the right choice. If you want a prompt library with named workflows you can trigger by name, ollama.nvim gives you that structure. If you want the closest thing to GitHub Copilot Chat in a local, privacy-preserving setup, avante.nvim is worth the slightly more complex installation. And if you prefer to control every detail of the integration, the custom Lua approach gives you complete flexibility with about 50 lines of code. All four approaches work reliably with Ollama and the choice is largely a matter of how much UI you want the plugin to manage versus how much you prefer to handle yourself in Lua.

Leave a Comment