Skip to content
ML Journey

ML Journey

  • Home
  • Data Analytics
  • Data Science
  • Data Engineering
  • Machine Learning
  • Generative AI
  • About

How to Use FlashAttention-2 in Practice

March 23, 2026 by mljourney

FlashAttention-2 delivers 2–4x faster attention and eliminates O(N²) memory with a single argument change. A practical guide to enabling it in HuggingFace, PyTorch SDPA, and fine-tuning pipelines — including attention mask compatibility and where it helps most.

Categories Generative AI Leave a comment

RLHF vs DPO vs PPO: How to Align LLMs Without Losing Your Mind

March 22, 2026 by mljourney

RLHF with PPO is powerful but complex. DPO achieves comparable alignment with a fraction of the engineering overhead. A practical comparison of PPO, DPO, and newer variants like SimPO and ORPO — covering when to use each and how to build a preference dataset that actually works.

Categories Generative AI Leave a comment

How to Serve Multiple LoRA Adapters from a Single Base Model

March 20, 2026 by mljourney

Serving multiple fine-tuned model variants with separate deployments wastes GPU memory proportional to the number of adapters. A guide to multi-adapter serving with vLLM, S-LoRA architecture, adapter routing strategies, and GPU memory planning.

Categories Generative AI Leave a comment

Feast vs Tecton vs Hopsworks: Choosing a Feature Store

March 19, 2026 by mljourney

Feast, Tecton, and Hopsworks each take a different approach to feature store architecture. A practical comparison covering offline and online stores, streaming feature support, managed vs self-hosted trade-offs, and how to choose based on your actual requirements.

Categories Generative AI Leave a comment

How to Debug Slow PyTorch Dataloaders

March 17, 2026 by mljourney

GPU sitting at 40-60% utilization while the model code looks fine? The dataloader is likely the bottleneck. A systematic guide to diagnosing and fixing slow data loading in PyTorch training pipelines.

Categories Generative AI Leave a comment

Gradient Accumulation and Gradient Checkpointing Explained

March 16, 2026 by mljourney

Gradient accumulation and gradient checkpointing are frequently confused but solve different problems. A precise guide to how each works, when to use them, how to combine them, and how to reason about GPU memory during training.

Categories Generative AI Leave a comment

Attention Mechanisms Explained: From Scaled Dot-Product to GQA

March 15, 2026 by mljourney

A practical guide to how attention actually works — scaled dot-product, multi-head, MQA, GQA, Flash Attention, and RoPE — with the implications for memory, throughput, and context length that matter for production deployments.

Categories Generative AI Leave a comment

How to Build an LLM Eval Dataset from Production Logs

March 14, 2026 by mljourney

Handwritten test cases give false confidence. Building an eval dataset from production logs — with stratified sampling, proper labeling, and slice-based reporting — produces evaluations that actually catch regressions before they reach users.

Categories Generative AI Leave a comment

MLflow vs Weights and Biases vs Neptune: Choosing an Experiment Tracker

March 13, 2026 by mljourney

MLflow, W&B, and Neptune all track experiments but optimize for different teams and workflows. A practical comparison across UI quality, self-hosting, hyperparameter optimization, and pricing — with a clear decision framework.

Categories Generative AI Leave a comment

How to Reduce GPU Memory During LLM Training

March 12, 2026 by mljourney

CUDA out of memory errors are almost always solvable without buying more hardware. A practical checklist of GPU memory reduction techniques — gradient checkpointing, 8-bit Adam, LoRA, Flash Attention, and more — in the order to try them.

Categories Generative AI Leave a comment
Older posts
Newer posts
← Previous Page1 Page2 Page3 … Page169 Next →

Recent Posts

  • Best Coding LLMs to Run Locally in 2026
  • How to Get Reliable Structured Output from LLMs
  • How to Reduce LLM Inference Latency: KV Cache, Batching, and Quantization
  • How to Evaluate LLM Outputs Without Ground Truth Labels
  • How to Write a Custom CUDA Kernel for PyTorch
© 2026 ML Journey • Built with GeneratePress