How to Count Tokens and Estimate LLM Costs Before You Ship

A practical guide to LLM token counting and cost estimation for ML engineers: accurate token counting with tiktoken for OpenAI models and the Anthropic token counting API for Claude, building a multi-provider cost estimator with current pricing, pre-flight checks to catch context window overflows and budget breaches before API calls, and a production logging wrapper for per-request cost attribution and identifying expensive outlier requests.

How to Add Image Captioning to Your App with a Local LLM

A practical guide to local image captioning and visual AI with Ollama: pulling LLaVA, moondream, and Gemma 3 vision models, captioning images with base64 encoding, visual question answering for receipts and diagrams, batch processing a folder of images to CSV, extracting text from photos with OCR-style prompts, structured image classification with Pydantic and Literal types, and a comparison of LLaVA vs moondream vs Gemma 3 for different vision tasks.

Label Smoothing: When It Helps and When It Hurts

A practical guide to label smoothing for ML engineers: how soft targets prevent logit overconfidence, PyTorch implementation with nn.CrossEntropyLoss and a manual version for fine-grained control, the three settings where smoothing reliably helps (large-scale classification, seq2seq, small-data fine-tuning), why it actively hurts knowledge distillation, choosing smoothing values, and measuring calibration improvement with Expected Calibration Error.

How to Use Ollama with LangChain

A complete guide to using Ollama as the LangChain backend: installing langchain-ollama, using OllamaLLM and ChatOllama with system and human messages, building LCEL chains with prompt templates and StrOutputParser, a full RAG pipeline using OllamaEmbeddings with nomic-embed-text and Chroma vectorstore, adding conversation memory with InMemoryChatMessageHistory and RunnableWithMessageHistory, and creating a tool-using ReAct agent with LangGraph.

ColBERT and Late Interaction Retrieval: How It Works and When to Use It

A practical guide to ColBERT late interaction retrieval for ML engineers: how MaxSim scoring over per-token embeddings outperforms single-vector bi-encoders, using RAGatouille for indexing and search, two-stage retrieval with bi-encoder first stage plus ColBERT reranking, fine-tuning ColBERT on domain-specific query-document triples with RAGTrainer, and when to use bi-encoder vs ColBERT vs cross-encoder for different RAG pipeline architectures.

How to Compare Two Documents with a Local LLM

A practical guide to comparing documents with a local LLM using Ollama: a general compare_documents function with focus parameter, structured diff output using Pydantic with additions, removals, modifications, conflicts, and summary fields, a chunked comparison approach for long documents that exceed the context window, question-answering across two documents simultaneously, and specific use cases where local inference is essential including legal contracts, research papers, and policy documents.

Hard Negative Mining for Embedding Model Training

A practical guide to hard negative mining for ML engineers training embedding models: why random negatives produce weak gradient signal, BM25-mined hard negatives with rank_bm25, embedding-mined negatives with FAISS and sentence-transformers, cross-encoder filtering to identify the hardest candidates, training with MultipleNegativesRankingLoss, and iterative mining pipelines used by state-of-the-art models like E5 and BGE.

How to Use Ollama with Go

A complete guide to the official Ollama Go library: installing with go get, streaming chat with the callback handler, accumulating a non-streaming response, raw generate completion, generating embeddings with nomic-embed-text, listing and pulling models with progress callbacks, connecting to a remote Ollama server with a custom client URL, and building a full multi-turn CLI chatbot with conversation history.

How to Use HuggingFace Fast Tokenizers Efficiently

A practical guide to HuggingFace fast tokenizers for ML engineers: how Rust-backed fast tokenizers differ from slow Python tokenizers, using offset mappings for NER and QA span alignment, high-throughput batched tokenisation with datasets.map and multiprocessing, sliding window tokenisation for long documents with stride and overflow, training a custom BPE vocabulary with the tokenizers library, and debugging gotchas around special tokens and sequence pair handling.

How to Use Local AI with Obsidian: Smart Notes Without the Cloud

A guide to connecting Ollama with Obsidian for fully local AI-assisted note-taking: the Ollama community plugin for per-note summarisation and action item extraction, Smart Connections plugin for semantic indexing of the entire vault with nomic-embed-text and vault-wide RAG chat, Text Generator plugin via the OpenAI-compatible endpoint, a practical meeting notes workflow, building a queryable personal knowledge base, hardware recommendations, and getting started with just two ollama pulls.