How to Evaluate LLMs with lm-evaluation-harness

A practical guide to EleutherAI lm-evaluation-harness for ML engineers: CLI and Python API usage, running MMLU, HellaSwag, ARC and TruthfulQA, evaluating fine-tuned checkpoints, writing custom YAML tasks for domain benchmarks, understanding acc vs acc_norm vs mc2 metrics, and avoiding the prompt format mismatches and contamination issues that produce misleading benchmark numbers.

Mistral Nemo 12B: What It Is and When to Use It

A practical guide to Mistral Nemo 12B: its distinctive technical features including 128K native context, the Tekken tokeniser, and strong multilingual training across 11 languages, hardware requirements at Q4_K_M (~7GB), when the 12B quality jump over 7–8B models is worth the extra VRAM, a Modelfile for long-context use, multilingual Python examples, and a clear comparison against Llama 3.2 8B and where Nemo wins.

Text Data Augmentation for LLM Training: Techniques That Actually Work

A practical guide to text data augmentation for ML engineers: why text augmentation is harder than image augmentation, word-level perturbations with EDA, back-translation with MarianMT, LLM-based paraphrasing for instruction datasets, embedding-space Mixup for classification, and how to verify empirically that augmentation is actually helping rather than hurting model quality.

How to Summarise Audio and Podcasts Locally with Ollama

A complete guide to the local audio summarisation pipeline using faster-whisper for transcription and Ollama for summarisation: installing faster-whisper and ffmpeg, transcribing with different model sizes from tiny to large-v3, a full pipeline function that handles long transcripts by chunking, four summary styles (bullets, paragraph, tldr, chapters), extracting action items and decisions from meeting recordings, and a command-line script for quick use from the terminal.

How to Use PyTorch Lightning for LLM Training and Fine-Tuning

A practical guide to PyTorch Lightning for ML engineers working on LLMs: LightningModule structure for fine-tuning, LightningDataModule for reproducible data pipelines, Trainer configuration for multi-GPU with FSDP and mixed precision, LoRA fine-tuning with PEFT integration, gradient accumulation and checkpointing, and a clear-eyed comparison of when Lightning helps versus when Accelerate or raw PyTorch DDP is the better choice.

Llama 3.3 70B: Running a Frontier-Class Model Locally

A practical guide to running Llama 3.3 70B locally with Ollama: hardware requirements across Apple Silicon configurations and NVIDIA GPU setups, pulling the model and verifying GPU layer loading with ollama ps, configuring large context windows with a Modelfile, the four specific task areas where 70B quality significantly outperforms 7-8B models, Python usage for complex reasoning tasks with streaming, realistic tokens per second on M3 Max and dual RTX 4090, and a decision framework for when to use 70B versus smaller models.

How to Write Custom Autograd Functions in PyTorch

A practical guide to torch.autograd.Function for ML engineers: implementing custom forward and backward passes, ctx.save_for_backward rules, numerically stable operations, straight-through estimation for quantisation-aware training, handling non-differentiable inputs, and verifying correctness with gradcheck and gradgradcheck.

How to Run Ollama on a Raspberry Pi or ARM Device

A practical guide to running Ollama on ARM hardware: supported devices from Raspberry Pi 4/5 to Jetson Orin with realistic speed expectations, installation via the standard installer which auto-detects ARM, model selection for 4GB and 8GB RAM constraints, setting up as a systemd service, performance optimisation with smaller quantisation and reduced context windows, four practical use cases including offline home assistant and edge IoT, and honest expectations about 3–8 tokens per second versus the power efficiency advantages of always-on Pi deployment.

How to Get Structured JSON Output from Ollama with Pydantic

A practical guide to getting reliable structured JSON output from Ollama using Pydantic: the JSON prompt approach with markdown stripping and ValidationError handling, Ollama’s native format parameter that accepts a Pydantic model_json_schema() directly to constrain generation, nested Pydantic models for complex structured extraction, batch extraction with per-item error handling, and model selection guidance for simple versus complex schemas.