How to Use Ollama with C# and .NET

A complete guide to Ollama integration in C# and .NET: the OllamaSharp library for chat and model listing, Microsoft.Extensions.AI with OllamaChatClient registered as IChatClient for dependency injection, streaming with IAsyncEnumerable, Semantic Kernel with AddOllamaChatCompletion and prompt templates, an ASP.NET Core streaming SSE endpoint, and generating embeddings with cosine similarity in C#.

Online Hard Example Mining (OHEM): How It Works and When to Use It

A practical guide to Online Hard Example Mining (OHEM) for ML engineers: how the forward-select-backward mechanism concentrates gradient signal on informative samples, PyTorch implementation with per-sample cross-entropy and topk selection, segmentation-specific OHEM with threshold-based pixel selection, OHEM vs focal loss and when to combine them, OHEM for metric learning and embedding training, and tuning the keep_ratio hyperparameter without destabilising training.

How to Use Ollama with Java and Spring Boot

A complete guide to integrating Ollama in Java Spring Boot applications: adding spring-ai-ollama-spring-boot-starter, configuring base URL and model in application.yml, chat with ChatClient including streaming with Flux, prompt templates with system prompts and parameter injection, generating embeddings with EmbeddingModel and computing cosine similarity, and a direct HTTP approach using RestClient with record-based request/response types for full control without Spring AI.

Matryoshka Representation Learning: How It Works and Why It Matters for RAG

A practical guide to Matryoshka Representation Learning (MRL) for ML engineers: how nested dimension training works, fine-tuning MRL models with sentence-transformers MatryoshkaLoss, using truncated embeddings at inference time with correct renormalisation, building two-stage RAG retrieval with a small-dimension FAISS index for recall and full-dimension reranking for precision, benchmarking quality across dimensions on your own domain data, and how MRL compares to PCA-based dimensionality reduction.

How to Monitor Ollama with Prometheus and Grafana

A practical guide to monitoring Ollama with Prometheus and Grafana: what performance metadata the Ollama API exposes in every response, a Python Prometheus exporter that scrapes ollama ps and api tags, a FastAPI middleware proxy that intercepts requests and records request count, latency histograms, and token counters, Prometheus scrape configuration, and key Grafana dashboard PromQL queries for models loaded, VRAM usage, request rate, tokens per second, and P95 latency.

How to Write Tests for ML Models with pytest

A practical guide to pytest for ML engineers: structuring test suites by speed and scope, shared fixtures for tiny models and small batches, testing data preprocessing deterministically, model shape and gradient flow tests, the overfit test for catching silent training bugs, loss function correctness tests, and configuring pytest markers and GitHub Actions CI to run fast unit tests on every push and GPU integration tests on a schedule.

How to Build a Telegram Bot with Ollama

A complete guide to building an Ollama-powered Telegram bot with python-telegram-bot: a basic bot that responds to messages with typing indicator, per-user conversation history with trimming, /start, /clear, and /model commands, restricting access to specific Telegram user IDs, and running the bot as a persistent systemd service alongside Ollama.

How to Use W&B Sweeps for Hyperparameter Search

A practical guide to W&B Sweeps for ML engineers: how the sweep controller and agent architecture works, configuring Bayesian vs random vs grid search with the right parameter distributions, writing sweep-compatible training scripts that read from wandb.config, running parallel agents across multiple GPUs and SLURM clusters, using Hyperband early termination to save compute, interpreting parallel coordinates plots, and avoiding common pitfalls like over-broad search spaces and treating the best sweep run as a final result without seed averaging.

How to Use Ollama with the Cursor Editor

A complete guide to using Ollama as the AI backend in the Cursor code editor: configuring the OpenAI-compatible endpoint in Cursor settings, choosing fast small models for Tab completion versus larger models for Chat, creating a Modelfile with 16K context for better code responses, using Cursor Chat and Cmd+K inline editing with a local model, performance tips including model preloading and code-specialised model selection, and privacy implications of the fully local setup for proprietary codebases.

How to Count Tokens and Estimate LLM Costs Before You Ship

A practical guide to LLM token counting and cost estimation for ML engineers: accurate token counting with tiktoken for OpenAI models and the Anthropic token counting API for Claude, building a multi-provider cost estimator with current pricing, pre-flight checks to catch context window overflows and budget breaches before API calls, and a production logging wrapper for per-request cost attribution and identifying expensive outlier requests.