How to Use Cross-Encoders for Reranking in RAG Pipelines

A practical guide to cross-encoder reranking for ML engineers building RAG systems: why bi-encoder retrieval misses relevant chunks, how cross-encoders score query-document pairs jointly, reranking with sentence-transformers ms-marco and BAAI/bge-reranker models, integrating via LangChain ContextualCompressionRetriever, latency and batching optimisation, and how to choose between open-source and hosted reranker options.

Open WebUI: Features, Settings, and Admin Guide

A complete guide to Open WebUI beyond the basics: multi-user management with admin and pending roles, configuring system prompts and custom models per use case, the document RAG library for team knowledge bases, web search integration with SearXNG or Bing, conversation branching and message editing, Arena mode for side-by-side model comparison, the Functions and Pipelines extensibility system, the OpenAI-compatible API with generated keys, and key admin settings to configure for team deployments.

Chunking Strategies for RAG: Fixed-Size, Semantic, and Hierarchical

A practical guide to RAG chunking strategies for ML engineers: recursive fixed-size chunking with token-aware overlap, semantic chunking via sentence-level similarity breakpoints, hierarchical parent-child chunking for precision-plus-context retrieval, document-aware splitting for structured corpora, and how to choose chunk size empirically using RAGAS context recall.

How to Summarise Meeting Notes with a Local LLM

A practical guide to summarising meeting notes and transcripts locally with Ollama: a structured summarisation prompt with sections for overview, discussion points, decisions, action items, and open questions, transcribing recorded meetings with Whisper then summarising, extracting action items as structured JSON, generating follow-up emails from the summary, a complete command-line tool with argparse, choosing between Llama 3.2, Mistral Nemo and Qwen2.5, handling variable note formats, and the privacy case for keeping sensitive meeting content off cloud services.

Mamba and State Space Models: How They Work and How They Compare to Transformers

A practical deep-dive into Mamba and state space models for ML engineers: the SSM recurrence and linear-time scaling, Mamba selective state spaces with input-dependent parameters, parallel scan CUDA kernels, Mamba vs transformer performance tradeoffs on recall and throughput, Mamba 2 and SSM variants including RWKV and Griffin, and when to reach for Mamba over a transformer in production.

Best Ollama Models in 2026: A Practical Guide by Use Case

A curated guide to the best Ollama models in 2026 by use case: Llama 3.2 8B as the best all-around daily driver, Qwen2.5-Coder 7B for coding and debugging, Gemma 3 4B for constrained hardware with multimodal capability, Mistral Nemo 12B for long documents with 32K context, nomic-embed-text for RAG and embeddings, Qwen2.5-VL 7B for structured image analysis, Gemma 3 27B and Llama 3.3 70B for Apple Silicon with large unified memory, multilingual options, and a quick reference table for all use cases.

How to Evaluate a RAG Pipeline: Metrics, Tools, and What to Fix

A practical guide to RAG evaluation for ML engineers: decomposing retrieval and generation quality, RAGAS metrics including context precision, context recall, faithfulness and answer relevancy, diagnosing low retrieval recall with chunking and re-ranking fixes, diagnosing generation faithfulness failures, and building an automated production eval pipeline with online and offline metrics.

Continue vs GitHub Copilot: Which AI Coding Assistant Is Better?

A practical comparison of Continue and GitHub Copilot for VS Code developers: setup requirements and time to first completion, completion quality for everyday tasks vs complex problems, chat features including Continue’s @codebase semantic search across your entire project vs Copilot’s open-file context, privacy implications of cloud vs local processing, cost breakdown for individuals and teams, the hybrid Continue+cloud API approach, IDE support across editors, and guidance on which tool to choose based on your specific priorities.

Transformer Models for Time Series Forecasting: TFT, PatchTST, and iTransformer

A practical guide to transformer-based time series forecasting: Temporal Fusion Transformer for multivariate problems with rich covariates and probabilistic output, PatchTST for long-horizon univariate forecasting via patch tokenisation, iTransformer for dense multivariate problems via inverted attention, when to use each, and why you should always benchmark against simple baselines first.

Gemma 3: Google’s Multimodal Local LLM Explained

A practical guide to running Google’s Gemma 3 locally with Ollama: the 1B, 4B, 12B, and 27B variants and their VRAM requirements, native multimodal image analysis at every size above 1B, CLI and Python usage including image inputs, how Gemma 3 4B compares to Llama 3.2 8B on reasoning tasks, the 12B as a multimodal sweet spot, 27B for frontier-class local quality on Apple Silicon, configuring a 32K context Modelfile, strong multilingual support, and how to choose between Gemma 3 and other local model families.