How to Use Ollama with JavaScript and Node.js

A complete guide to using Ollama from JavaScript and Node.js: installing the official ollama npm package, chat completions with system prompts and options, streaming responses with async iterators, text generation for classification, generating embeddings with cosine similarity, managing models programmatically, building a streaming Express SSE endpoint, consuming the stream from browser JavaScript, connecting to a remote Ollama host, multi-turn conversation with history, TypeScript types, and when to use the native JS library versus the OpenAI SDK.

Ollama Keep-Alive and Model Preloading: Eliminate Cold Start Latency

A practical guide to eliminating Ollama cold-start latency: how keep-alive works and why it matters, setting keep_alive per-request to -1 for permanent loading or 0 for immediate unloading, setting OLLAMA_KEEP_ALIVE globally, pre-loading models at application startup with a minimal dummy request, running multiple models simultaneously with OLLAMA_MAX_LOADED_MODELS, inspecting loaded models and VRAM usage via /api/ps, manually unloading models to free VRAM, and recommended settings for interactive chat, batch processing, multi-model RAG, and low-VRAM machines.

Tabby: The Self-Hosted Coding Assistant That Beats Copilot for Completions

A complete guide to Tabby, the self-hosted coding assistant built specifically for inline tab completions: how it differs from Continue and why dedicated completion models are faster and more accurate, Docker installation with NVIDIA GPU, choosing between StarCoder2 and DeepSeek-Coder models, VS Code, Neovim and JetBrains plugin setup, Docker Compose for persistent deployment, repository indexing for codebase-aware completions, monitoring acceptance rates in the built-in dashboard, and when to use Tabby versus Continue.

Ollama REST API Reference: Every Endpoint with Examples

A complete reference for the Ollama REST API: listing and managing models with /api/tags, /api/pull, /api/delete, and /api/copy, running chat completions and raw text generation with /api/chat and /api/generate, generating embeddings with /api/embeddings, inspecting running models and VRAM usage with /api/ps, getting model details and Modelfiles with /api/show, creating custom models programmatically with /api/create, all key inference options, parsing the streaming response format including performance statistics, and health check patterns.

Mistral Nemo 12B: What It Is and When to Use It Locally

A practical guide to Mistral Nemo 12B for local use: what makes it distinctive including a native 128K context window, strong multilingual support, and the efficient Tekken tokeniser, hardware requirements at each quantisation level, running it with Ollama and configuring a 32K context Modelfile, benchmarking it against Llama 3.2 8B for your specific tasks, the four scenarios where its VRAM premium is justified, and how it compares to Mistral 7B and Mixtral 8x7B.

How to Fine-Tune Embedding Models with Contrastive Learning

A practical guide to fine-tuning embedding models with contrastive learning: the Multiple Negatives Ranking Loss objective, building training datasets with synthetic query generation, hard negative mining, Matryoshka Representation Learning for flexible dimensions, evaluation with InformationRetrievalEvaluator, and when domain adaptation is actually worth the engineering cost.