mljourney, Author at ML Journey

Llama 3.3 70B: Running a Frontier-Class Model Locally

May 1, 2026 by mljourney

A practical guide to running Llama 3.3 70B locally with Ollama: hardware requirements across Apple Silicon configurations and NVIDIA GPU setups, pulling the model and verifying GPU layer loading with ollama ps, configuring large context windows with a Modelfile, the four specific task areas where 70B quality significantly outperforms 7-8B models, Python usage for complex reasoning tasks with streaming, realistic tokens per second on M3 Max and dual RTX 4090, and a decision framework for when to use 70B versus smaller models.

How to Write Custom Autograd Functions in PyTorch

May 1, 2026 by mljourney

A practical guide to torch.autograd.Function for ML engineers: implementing custom forward and backward passes, ctx.save_for_backward rules, numerically stable operations, straight-through estimation for quantisation-aware training, handling non-differentiable inputs, and verifying correctness with gradcheck and gradgradcheck.

How to Run Ollama on a Raspberry Pi or ARM Device

April 30, 2026 by mljourney

A practical guide to running Ollama on ARM hardware: supported devices from Raspberry Pi 4/5 to Jetson Orin with realistic speed expectations, installation via the standard installer which auto-detects ARM, model selection for 4GB and 8GB RAM constraints, setting up as a systemd service, performance optimisation with smaller quantisation and reduced context windows, four practical use cases including offline home assistant and edge IoT, and honest expectations about 3–8 tokens per second versus the power efficiency advantages of always-on Pi deployment.

Temperature, Top-p, and Top-k: LLM Sampling Strategies Explained

April 30, 2026 by mljourney

A practical guide to LLM sampling parameters for ML engineers: how temperature scales logits and why it matters, top-k hard truncation and its context-insensitivity, top-p nucleus sampling and its adaptive vocabulary selection, repetition penalty and min-p, and recommended settings by task type for code generation, chat, creative writing, and structured output.

How to Get Structured JSON Output from Ollama with Pydantic

April 29, 2026 by mljourney

A practical guide to getting reliable structured JSON output from Ollama using Pydantic: the JSON prompt approach with markdown stripping and ValidationError handling, Ollama’s native format parameter that accepts a Pydantic model_json_schema() directly to constrain generation, nested Pydantic models for complex structured extraction, batch extraction with per-item error handling, and model selection guidance for simple versus complex schemas.

How to Extend Context Length in LLMs: RoPE Scaling, YaRN, and NTK-Aware Interpolation

April 29, 2026 by mljourney

A practical guide to extending LLM context length beyond the training window: why RoPE breaks at out-of-range positions, position interpolation as the baseline, NTK-aware base frequency scaling for zero-shot extension, YaRN selective interpolation by frequency band with attention temperature correction, HuggingFace rope_scaling configuration, and when each method requires fine-tuning versus working out of the box.

Ollama vs LM Studio in 2026: Which Should You Use?

April 28, 2026 by mljourney

A practical comparison of Ollama and LM Studio in 2026: what each tool is designed for, installation and setup friction, model library size and discovery, API access and whether it requires manual enabling, programmability and automation in scripts and CI/CD, Modelfile persistence vs session-only configuration, identical underlying inference performance, clear guidance on who should use each tool, and how developers can use both together — Ollama as the always-running backend and LM Studio for model discovery.

How to Export PyTorch Models: TorchScript, ONNX, and TensorRT

April 28, 2026 by mljourney

A practical guide to PyTorch model export for production: TorchScript tracing vs scripting and when to use each, ONNX export with dynamic axes and opset version considerations, ONNX Runtime performance benchmarking, TensorRT engine building with FP16 and INT8 calibration, and a decision framework for choosing between the three based on hardware, portability, and throughput requirements.

How to Use Ollama in a React or Next.js App

April 27, 2026 by mljourney

A complete guide to integrating Ollama into React and Next.js applications: solving the CORS problem with OLLAMA_ORIGINS or a server-side proxy, a full streaming chat component that calls Ollama directly from the browser with real-time token display, a Next.js App Router API route that proxies Ollama streams to the client, and the AI SDK useChat hook approach that replaces manual streaming code with a clean abstraction — including the route handler using createOpenAI pointed at the local Ollama endpoint.

AdamW vs Adafactor vs Lion: Choosing an Optimizer for LLM Training

April 27, 2026 by mljourney

A practical guide to optimizers for LLM training: how AdamW works and why decoupled weight decay matters, the memory cost problem at 7B to 70B scale, Adafactor factored second moments for pretraining, 8-bit Adam as a drop-in memory reduction, Lion sign-based updates and its hyperparameter tradeoffs, and a decision framework for matching optimizer to training scale and budget.