IA3 vs LoRA: Choosing a Parameter-Efficient Fine-Tuning Method

A practical comparison of IA3 and LoRA for ML engineers: how IA3 activation scaling works versus LoRA weight updates, when each method wins (data volume, task type, adapter size), implementing IA3 with HuggingFace PEFT for classification and causal LM tasks, combining IA3 with 4-bit quantisation on consumer GPUs, and a decision framework for choosing between PEFT methods in production fine-tuning projects.

How to Use Ollama with JavaScript and Node.js

A complete guide to the official Ollama npm package in Node.js: installing with npm/yarn/bun, generate and chat with stream:false and stream:true, multi-turn CLI chatbot with readline, generating embeddings and computing cosine similarity, model management including pull with progress, delete, and ps, connecting to a remote Ollama server with a custom client, structured output using Zod schema passed directly to the format parameter, and image input for vision models.

Sequence Packing for LLM Training: Eliminating Padding Waste

A practical guide to sequence packing for ML engineers training LLMs: measuring padding waste and estimating speedup, greedy packing implementation with EOS separation, the attention leakage problem in naive packing, document-aware attention masks with Flash Attention cu_seqlens, TRL SFTTrainer packing configuration, and how to verify packing efficiency and model quality after implementation.

Ollama REST API Reference: Every Endpoint with Examples

A complete Ollama REST API reference with curl examples for every endpoint: health check, /api/generate with streaming and options, /api/chat with multi-turn history and structured output format parameter, /api/embeddings, /api/tags to list models, /api/pull with progress streaming, /api/delete, /api/copy, /api/create from a Modelfile string, /api/ps for loaded models, /api/show for model details, and the OpenAI-compatible /v1/chat/completions, /v1/models, and /v1/embeddings endpoints.

Multi-Task Learning: Hard Parameter Sharing, Soft Sharing, and When It Beats Single-Task Models

A practical guide to multi-task learning for ML engineers: hard parameter sharing with task-specific heads, soft parameter sharing with cross-encoder regularisation, gradient cosine similarity for detecting negative transfer, homoscedastic uncertainty loss weighting, task sampling strategies, and an honest assessment of when multi-task training beats separate single-task baselines and when it does not.

Tabby: The Self-Hosted Coding Assistant

A complete guide to Tabby, the open-source self-hosted coding assistant: what it does and how it compares to GitHub Copilot, installing via brew, binary, or Docker, running with built-in code models on CPU and GPU, connecting to VS Code and JetBrains IDEs with API token setup, model selection by hardware tier from 1.3B CPU to 13B GPU, enabling repository context indexing for project-aware completions, and running as a systemd service for persistent availability.

How to Evaluate LLMs with lm-evaluation-harness

A practical guide to EleutherAI lm-evaluation-harness for ML engineers: CLI and Python API usage, running MMLU, HellaSwag, ARC and TruthfulQA, evaluating fine-tuned checkpoints, writing custom YAML tasks for domain benchmarks, understanding acc vs acc_norm vs mc2 metrics, and avoiding the prompt format mismatches and contamination issues that produce misleading benchmark numbers.

Mistral Nemo 12B: What It Is and When to Use It

A practical guide to Mistral Nemo 12B: its distinctive technical features including 128K native context, the Tekken tokeniser, and strong multilingual training across 11 languages, hardware requirements at Q4_K_M (~7GB), when the 12B quality jump over 7–8B models is worth the extra VRAM, a Modelfile for long-context use, multilingual Python examples, and a clear comparison against Llama 3.2 8B and where Nemo wins.

Text Data Augmentation for LLM Training: Techniques That Actually Work

A practical guide to text data augmentation for ML engineers: why text augmentation is harder than image augmentation, word-level perturbations with EDA, back-translation with MarianMT, LLM-based paraphrasing for instruction datasets, embedding-space Mixup for classification, and how to verify empirically that augmentation is actually helping rather than hurting model quality.

How to Summarise Audio and Podcasts Locally with Ollama

A complete guide to the local audio summarisation pipeline using faster-whisper for transcription and Ollama for summarisation: installing faster-whisper and ffmpeg, transcribing with different model sizes from tiny to large-v3, a full pipeline function that handles long transcripts by chunking, four summary styles (bullets, paragraph, tldr, chapters), extracting action items and decisions from meeting recordings, and a command-line script for quick use from the terminal.