Tabby: The Self-Hosted Coding Assistant That Beats Copilot for Completions

Tabby is a self-hosted, open-source coding assistant that you run on your own server or workstation. Unlike Continue, which is a VS Code extension that calls out to Ollama, Tabby is a dedicated inference server built specifically for code completion — it handles the fill-in-the-middle completion logic, context window management, and IDE integration itself. The result is faster, more accurate tab completions than a general-purpose LLM configured for coding, because Tabby uses models fine-tuned specifically for FIM code completion tasks.

Tabby vs Continue + Ollama

The key distinction: Continue calls a general chat model (like Qwen2.5-Coder) and asks it to complete code, using the FIM prompt format. Tabby runs a dedicated completion-only model (like StarCoder2 or DeepSeek-Coder) that was trained specifically for inline completion, not chat. Dedicated completion models are faster at this task and produce better single-line and multi-line completions because that is all they were trained to do. The tradeoff is that Tabby does not provide a chat sidebar — it is purely an autocomplete tool. Use Tabby for completions and a separate chat interface (Continue’s chat, Open WebUI, or similar) for questions and refactoring.

Installation

Tabby runs as a Docker container or a native binary. Docker is the easiest path:

# CPU only
docker run -it \
  --gpus all \
  -p 8080:8080 \
  -v $HOME/.tabby:/data \
  tabbyml/tabby \
  serve --model StarCoder2-3B --device cuda

# Check it is running
curl http://localhost:8080/v1/health

For NVIDIA GPU acceleration, ensure the NVIDIA Container Toolkit is installed (same as for Ollama Docker). For CPU-only setups, remove --gpus all and change --device cuda to --device cpu.

Choosing a Completion Model

Tabby pulls models from its own model registry (registry.tabbyml.com). The best options for local use:

StarCoder2-3B — fastest, works on CPU, good for general code completion
StarCoder2-7B — better quality, needs 8GB+ VRAM
DeepSeek-Coder-1.3B — very fast, strong on Python and JavaScript
DeepSeek-Coder-6.7B — excellent quality, needs 8GB VRAM
CodeLlama-7B — strong multilanguage support

# List available models
curl http://localhost:8080/v1/models

# Switch models by restarting with a different --model flag
docker run -it --gpus all -p 8080:8080 -v $HOME/.tabby:/data \
  tabbyml/tabby serve --model DeepSeek-Coder-6.7B --device cuda

VS Code Integration

Install the Tabby VS Code extension from the marketplace (search “Tabby”). In the extension settings, set the server endpoint to http://localhost:8080. Tab completions appear automatically as you type — press Tab to accept, Escape to dismiss. The Tabby extension shows a status indicator in the VS Code status bar that confirms the connection to your Tabby server is active.

Neovim Integration

-- In your Neovim config (lazy.nvim example)
{
  'TabbyML/vim-tabby',
  config = function()
    vim.g.tabby_agent_start_command = {'npx', 'tabby-agent', '--stdio'}
    vim.g.tabby_trigger_mode = 'auto'
    -- Set server in ~/.tabby-client/agents/config.toml
  end
}

# ~/.tabby-client/agents/config.toml
[server]
endpoint = "http://localhost:8080"

JetBrains Integration

Install the Tabby plugin from the JetBrains Marketplace. Configure the server URL in Settings → Tools → Tabby. The plugin works across IntelliJ IDEA, PyCharm, WebStorm, GoLand, and all other JetBrains IDEs with the same configuration.

Docker Compose Setup

version: '3.8'
services:
  tabby:
    image: tabbyml/tabby
    command: serve --model StarCoder2-3B --device cuda
    ports:
      - '8080:8080'
    volumes:
      - tabby_data:/data
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
volumes:
  tabby_data:

Context-Aware Completion with Repository Indexing

Tabby’s most powerful feature beyond basic completion is repository indexing. When you configure a repository, Tabby indexes the codebase and uses it as additional context for completions — so it can complete function calls using your project’s actual APIs rather than generic patterns from the training data. This makes completions dramatically more useful in large codebases where your codebase has established conventions and abstractions that a general model has never seen.

# Index a local repository
curl -X POST http://localhost:8080/v1/repositories \
  -H 'Content-Type: application/json' \
  -d '{"name": "my-project", "git_url": "file:///path/to/repo"}'

# Tabby re-indexes automatically when files change
# Check index status
curl http://localhost:8080/v1/repositories

Monitoring and Usage Analytics

Tabby includes a built-in dashboard at http://localhost:8080 (open in browser) that shows completion acceptance rate, number of completions generated, and performance metrics. The acceptance rate — what fraction of shown completions the developer actually accepts — is the most useful signal for whether the model is producing relevant suggestions. A healthy acceptance rate for a well-configured Tabby instance is 20–35%. Below 15% suggests the model or context configuration needs adjustment; above 40% is excellent.

When to Use Tabby vs Continue

Use Tabby when inline tab completion is your primary need and you want the best completion quality for the least latency. Its dedicated completion models are faster and more accurate for this specific task than routing completions through a general chat model. Use Continue when you need the chat sidebar, codebase semantic search with @codebase, and slash commands — Continue’s chat capabilities are richer than Tabby’s completion-only focus. Many developers run both simultaneously: Tabby for tab completions in the editor and Continue (or Open WebUI) for chat-based code assistance, using each tool for what it does best.

Why Dedicated Completion Models Matter

The difference between a dedicated completion model and a general chat model for tab completions is more significant than it first appears. A general chat model like Llama 3.2 8B was trained primarily on instruction-following and conversation tasks. When used for code completion via the fill-in-the-middle prompt format, it works, but the model is applying general language understanding to a specialised task it was not optimised for. A dedicated completion model like StarCoder2 or DeepSeek-Coder was trained almost exclusively on code, with the FIM task as a primary objective. The training data, tokeniser, and fine-tuning process are all optimised specifically for the pattern of seeing a prefix, a suffix, and filling in the gap.

In practice this translates to three measurable differences. First, latency: a 3B completion model produces first-token results in 200–400ms on a modern GPU, fast enough to feel instantaneous during typing. A 7B general model used for completions typically takes 500ms–1s for the first token, which creates a noticeable lag that interrupts flow. Second, single-line accuracy: completion models are significantly better at completing the current line correctly, which is the most common and valuable completion case. Third, context efficiency: completion models use their context window more efficiently for code — they pay less attention to irrelevant parts of the file and more to the immediate syntactic context around the cursor.

Performance Expectations by Hardware

Tabby’s performance varies significantly by model and hardware. On an RTX 3080 (10GB VRAM), StarCoder2-3B runs at 80–120 tokens/sec with sub-200ms first-token latency — fast enough that completions appear before most developers finish typing a line. DeepSeek-Coder-6.7B on the same hardware runs at 40–60 tokens/sec with 300–500ms first-token latency, which is acceptable but more noticeable. On Apple Silicon, the Metal backend gives StarCoder2-3B roughly 60–80 tokens/sec on an M2 Pro, which is excellent for a laptop workstation. On CPU-only hardware, StarCoder2-3B runs at 8–15 tokens/sec — slow for interactive completion but usable for developers on machines without a discrete GPU who want any AI assistance.

The acceptance rate metric in Tabby’s dashboard is more useful than raw token speed for assessing whether the model is a good fit for your workflow. A model that generates at 100 tokens/sec but produces completions you accept only 10% of the time is less valuable than a model that runs at 50 tokens/sec but produces completions you accept 30% of the time. Profile your acceptance rate over a week of normal coding work before concluding that you need a larger model — the quality improvement from StarCoder2-3B to StarCoder2-7B is real but modest for most developers, while the latency difference is significant.

Repository Indexing in Practice

The repository indexing feature is worth setting up even though it requires an extra configuration step. Without indexing, Tabby’s completions are based only on the content currently open in your editor — the current file and a small window of recently opened files. With indexing, Tabby can pull relevant context from anywhere in your codebase: if you are calling a function defined in another file, Tabby has seen that function’s signature and implementation and can complete the call correctly. For small projects under a few thousand files, the index builds in seconds and is kept up to date automatically as files change. For very large monorepos, consider indexing only the most relevant subdirectories rather than the full repository to keep index build times reasonable.

The index also enables a feature called declaration snippets — when completing a function call, Tabby can retrieve the function’s declaration and include it as context for the completion model even if the declaration is in a file you have not recently opened. This is particularly valuable in large codebases where the relevant context is spread across many files and manual navigation to find the definition would interrupt the coding flow.

Multi-User Team Setup

Tabby is designed to work as a shared team server, not just a personal tool. Running a single Tabby instance on a powerful server and having all team members point their IDE plugins at it eliminates the need for every developer to have a GPU-equipped machine. The server handles authentication (JWT tokens or simple API keys), per-user usage analytics, and concurrent completions from multiple developers. For a team of 5–10 developers, a single server with an RTX 3080 or 3090 runs Tabby comfortably for all users simultaneously — the per-user active typing rate is low enough that the GPU is rarely saturated even with many connected users, because developers spend most of their time reading and thinking rather than actively triggering completions.

Privacy Advantages

Like all local AI tools, Tabby’s primary privacy advantage is that your code never leaves your infrastructure. This matters in ways that go beyond compliance: proprietary algorithms, unreleased features, client data embedded in tests, and security-sensitive configurations are all exposed to any cloud-based code assistant during completions. Self-hosting Tabby eliminates this exposure entirely. The code that passes through Tabby stays within your network — it processes completion requests ephemerally without logging code content by default. For organisations with strict data handling requirements or developers working on sensitive projects, this makes Tabby the only viable option for AI-assisted coding regardless of quality comparisons with cloud alternatives.

Getting Started in 5 Minutes

The fastest path to a working Tabby setup: install Docker, run the Docker command with StarCoder2-3B, install the VS Code extension, point it at localhost:8080, and open any code file. Within two minutes of running the Docker command you will have AI tab completions in your editor at zero ongoing cost. The acceptance rate dashboard at localhost:8080 tells you immediately whether the completions are useful for your specific codebase and coding style. If StarCoder2-3B feels too slow for interactive use, switch to DeepSeek-Coder-1.3B for lower latency. If the quality feels insufficient on complex completions, upgrade to DeepSeek-Coder-6.7B or StarCoder2-7B when you have the VRAM headroom. The model swap requires only restarting the Docker container with a different --model flag — no configuration changes in the IDE plugins are needed since they connect to the same endpoint regardless of which model is running. Tabby is actively developed with regular releases adding new models, improved context strategies, and better IDE integrations — checking the TabbyML GitHub repository monthly keeps you current with the improvements that matter most for your specific editor and language stack. Tabby represents a maturing approach to developer tooling — purpose-built, self-hosted, privacy-preserving, and increasingly competitive with cloud alternatives on the quality metrics that actually matter during a coding session. If you have been using cloud-based AI completions and have not tried a self-hosted alternative, Tabby is the most compelling reason to make the switch — the setup takes five minutes, and the result is a coding workflow that is both faster and more private than any cloud-based alternative.