ML Journey

Continuous Batching for LLM Inference: How It Works and When to Use It

April 3, 2026 by mljourney

A deep technical explainer on continuous batching for LLM inference: why static batching wastes GPU compute on autoregressive generation, how iteration-level scheduling works, the prefill vs decode phase distinction, PagedAttention and KV cache memory management, throughput vs latency tradeoffs, and vLLM configuration parameters for tuning continuous batching in production.

AnythingLLM Setup: Chat with Your Documents Locally

April 2, 2026 by mljourney

A practical guide to AnythingLLM for local document chat: desktop app and Docker installation, connecting to Ollama or LM Studio as the LLM backend, creating workspaces with isolated document collections, uploading PDFs and URLs for RAG, querying documents with source citations, using agents with web search and code execution, setting up multi-user access with role-based permissions, and a direct comparison with Open WebUI to help you choose the right tool for your workflow.

How to Deploy ML Models on Kubernetes with KServe

April 2, 2026 by mljourney

A complete guide to deploying machine learning models on Kubernetes using KServe: cluster setup, packaging PyTorch and HuggingFace models, InferenceService manifests, autoscaling with Knative, canary deployments for safe rollouts, and when KServe is worth the operational overhead versus simpler alternatives.

LM Studio: Complete Setup and Usage Guide

April 1, 2026 by mljourney

A complete guide to LM Studio for local LLMs: installing on Mac, Windows, and Linux, browsing and downloading models with RAM requirements shown upfront, loading models and adjusting generation parameters in the chat UI, starting the built-in OpenAI-compatible server on port 1234, connecting with the OpenAI Python SDK, tuning GPU layers and context length for your hardware, and a direct comparison of LM Studio versus Ollama to help you decide which fits your workflow.

Prefix Tuning vs Prompt Tuning vs P-Tuning: Soft Prompt Methods Compared

April 1, 2026 by mljourney

A practical comparison of prefix tuning, prompt tuning, and P-Tuning v2: how each method works, where soft tokens are inserted, parameter counts, scaling behavior, and when to choose each over LoRA for multi-task serving from a single frozen model.

How to Use Ollama’s OpenAI-Compatible API

March 31, 2026 by mljourney

A practical guide to Ollama’s OpenAI-compatible API: using the OpenAI Python SDK pointed at localhost, streaming completions, generating embeddings with nomic-embed-text, switching existing OpenAI code to Ollama with two line changes, integrating with LangChain and LlamaIndex, using environment variables to toggle between local and cloud, and a clear summary of what the compatibility layer does and does not support.

How to Fine-Tune Llama 3 with FSDP on Multiple GPUs

March 31, 2026 by mljourney

A complete guide to fine-tuning Llama 3 using PyTorch FSDP across multiple GPUs: wrapping strategy with transformer_auto_wrap_policy, sharding strategies (FULL_SHARD vs HYBRID_SHARD), gradient checkpointing integration, bfloat16 training loop, full state dict checkpointing, and memory budget planning for 8B and 70B models.

How to Set Up Open WebUI with Ollama (Complete Guide)

March 30, 2026 by mljourney

A complete setup guide for Open WebUI with Ollama: installing via Docker with a single run command, pip installation without Docker, connecting to Ollama and troubleshooting disconnection issues, switching and pulling models from the UI, setting system prompts and custom personas, uploading documents for local RAG, accessing Open WebUI from other devices on your network, keeping conversations across updates, and the most useful settings to configure for a single-user local setup.

How to Write Triton Kernels for PyTorch

March 30, 2026 by mljourney

A practical guide to writing GPU kernels with OpenAI Triton: the tile-based programming model, a minimal working kernel, fused softmax, autotuning block sizes, 2D matrix kernels, autograd integration, debugging with the interpreter, and performance profiling against the memory roofline.

Best Coding LLMs to Run Locally in 2026

March 29, 2026 by mljourney

A practical guide to the best coding LLMs for local use in 2026: Qwen2.5-Coder 7B, 14B and 32B as the overall best across VRAM tiers, DeepSeek-Coder-V2 as a fast MoE option, Codestral 22B for fill-in-the-middle completions, hardware requirements at each tier from 8GB to 24GB VRAM, setting up Continue in VS Code with a local Ollama model, recommended Modelfiles with coding-optimised parameters, and how to choose the right model for your hardware and workflow.