How to Use Ollama with Celery for Async AI Tasks

A complete guide to using Celery and Redis with Ollama for async AI background processing: why slow LLM inference belongs in a task queue, defining Celery tasks for text summarisation and document classification with Pydantic structured output, starting workers with configurable concurrency, a FastAPI integration with immediate task ID response and polling endpoint, and batch processing with celery group for parallel document processing.

How to Use einops for Cleaner Tensor Operations in PyTorch

A practical guide to einops for ML engineers: rearrange for readable dimension splitting, merging, and transposing with named axes, reduce for explicit pooling over named dimensions, repeat as a drop-in for unsqueeze/expand chains, einsum with readable named contractions, the Rearrange nn.Module layer for use in Sequential and torch.compile, and a complete ViT patch embedding implementation using einops throughout.

How to Use Ollama with Swift and iOS

A complete guide to local AI in Swift: a URLSession-based OllamaClient with non-streaming and streaming chat using AsyncStream, SwiftUI integration with async token display, configuring Ollama for LAN access from iOS devices on the same network, and Apple Foundation Models for true on-device inference on iOS 18+ and macOS 15+ using the Neural Engine with zero network access required.

Mixed Precision Training with PyTorch AMP: fp16, bf16, and GradScaler

A practical guide to PyTorch Automatic Mixed Precision for ML engineers: the numerical difference between fp16 and bf16 and when to use each, complete AMP training loop with autocast and GradScaler, how GradScaler adaptive scaling works and how to tune it, which ops autocast converts vs keeps in float32, AMP with HuggingFace Trainer, mixed precision inference with permanently converted model weights, and a hook-based debugger for finding which operation first produces NaN under fp16.

How to Use Ollama with C# and .NET

A complete guide to Ollama integration in C# and .NET: the OllamaSharp library for chat and model listing, Microsoft.Extensions.AI with OllamaChatClient registered as IChatClient for dependency injection, streaming with IAsyncEnumerable, Semantic Kernel with AddOllamaChatCompletion and prompt templates, an ASP.NET Core streaming SSE endpoint, and generating embeddings with cosine similarity in C#.

Online Hard Example Mining (OHEM): How It Works and When to Use It

A practical guide to Online Hard Example Mining (OHEM) for ML engineers: how the forward-select-backward mechanism concentrates gradient signal on informative samples, PyTorch implementation with per-sample cross-entropy and topk selection, segmentation-specific OHEM with threshold-based pixel selection, OHEM vs focal loss and when to combine them, OHEM for metric learning and embedding training, and tuning the keep_ratio hyperparameter without destabilising training.

How to Use Ollama with Java and Spring Boot

A complete guide to integrating Ollama in Java Spring Boot applications: adding spring-ai-ollama-spring-boot-starter, configuring base URL and model in application.yml, chat with ChatClient including streaming with Flux, prompt templates with system prompts and parameter injection, generating embeddings with EmbeddingModel and computing cosine similarity, and a direct HTTP approach using RestClient with record-based request/response types for full control without Spring AI.

Matryoshka Representation Learning: How It Works and Why It Matters for RAG

A practical guide to Matryoshka Representation Learning (MRL) for ML engineers: how nested dimension training works, fine-tuning MRL models with sentence-transformers MatryoshkaLoss, using truncated embeddings at inference time with correct renormalisation, building two-stage RAG retrieval with a small-dimension FAISS index for recall and full-dimension reranking for precision, benchmarking quality across dimensions on your own domain data, and how MRL compares to PCA-based dimensionality reduction.

How to Monitor Ollama with Prometheus and Grafana

A practical guide to monitoring Ollama with Prometheus and Grafana: what performance metadata the Ollama API exposes in every response, a Python Prometheus exporter that scrapes ollama ps and api tags, a FastAPI middleware proxy that intercepts requests and records request count, latency histograms, and token counters, Prometheus scrape configuration, and key Grafana dashboard PromQL queries for models loaded, VRAM usage, request rate, tokens per second, and P95 latency.

How to Write Tests for ML Models with pytest

A practical guide to pytest for ML engineers: structuring test suites by speed and scope, shared fixtures for tiny models and small batches, testing data preprocessing deterministically, model shape and gradient flow tests, the overfit test for catching silent training bugs, loss function correctness tests, and configuring pytest markers and GitHub Actions CI to run fast unit tests on every push and GPU integration tests on a schedule.