Skip to content
ML Journey

ML Journey

  • Home
  • Data Analytics
  • Data Science
  • Data Engineering
  • Machine Learning
  • Generative AI
  • About

Generative AI

Tensor Parallelism vs Pipeline Parallelism for LLM Inference

March 26, 2026 by mljourney

Tensor parallelism and pipeline parallelism split models across GPUs in fundamentally different ways with different latency and throughput implications. A practical guide covering how each works, NVLink vs InfiniBand trade-offs, and how they combine for 70B+ multi-node serving.

Categories Generative AI Leave a comment

How to Fine-Tune an Embedding Model for Domain-Specific Retrieval

March 25, 2026 by mljourney

Off-the-shelf embedding models underperform on domain-specific retrieval. A practical guide to synthetic query generation, hard negative mining, contrastive fine-tuning, and Matryoshka training — with realistic expectations on NDCG@10 improvement.

Categories Generative AI Leave a comment

AWQ vs GPTQ vs bitsandbytes: LLM Quantization Methods Compared

March 24, 2026 by mljourney

AWQ, GPTQ, and bitsandbytes each quantize LLM weights differently. A practical comparison covering how each works, quality at 4-bit and 8-bit, inference speed, and when to use which — including the right choice for QLoRA training vs production serving.

Categories Generative AI Leave a comment

How to Use FlashAttention-2 in Practice

March 23, 2026 by mljourney

FlashAttention-2 delivers 2–4x faster attention and eliminates O(N²) memory with a single argument change. A practical guide to enabling it in HuggingFace, PyTorch SDPA, and fine-tuning pipelines — including attention mask compatibility and where it helps most.

Categories Generative AI Leave a comment

RLHF vs DPO vs PPO: How to Align LLMs Without Losing Your Mind

March 22, 2026 by mljourney

RLHF with PPO is powerful but complex. DPO achieves comparable alignment with a fraction of the engineering overhead. A practical comparison of PPO, DPO, and newer variants like SimPO and ORPO — covering when to use each and how to build a preference dataset that actually works.

Categories Generative AI Leave a comment

How to Serve Multiple LoRA Adapters from a Single Base Model

March 20, 2026 by mljourney

Serving multiple fine-tuned model variants with separate deployments wastes GPU memory proportional to the number of adapters. A guide to multi-adapter serving with vLLM, S-LoRA architecture, adapter routing strategies, and GPU memory planning.

Categories Generative AI Leave a comment

Feast vs Tecton vs Hopsworks: Choosing a Feature Store

March 19, 2026 by mljourney

Feast, Tecton, and Hopsworks each take a different approach to feature store architecture. A practical comparison covering offline and online stores, streaming feature support, managed vs self-hosted trade-offs, and how to choose based on your actual requirements.

Categories Generative AI Leave a comment

How to Debug Slow PyTorch Dataloaders

March 17, 2026 by mljourney

GPU sitting at 40-60% utilization while the model code looks fine? The dataloader is likely the bottleneck. A systematic guide to diagnosing and fixing slow data loading in PyTorch training pipelines.

Categories Generative AI Leave a comment

Gradient Accumulation and Gradient Checkpointing Explained

March 16, 2026 by mljourney

Gradient accumulation and gradient checkpointing are frequently confused but solve different problems. A precise guide to how each works, when to use them, how to combine them, and how to reason about GPU memory during training.

Categories Generative AI Leave a comment

Attention Mechanisms Explained: From Scaled Dot-Product to GQA

March 15, 2026 by mljourney

A practical guide to how attention actually works — scaled dot-product, multi-head, MQA, GQA, Flash Attention, and RoPE — with the implications for memory, throughput, and context length that matter for production deployments.

Categories Generative AI Leave a comment
Older posts
Newer posts
← Previous Page1 … Page12 Page13 Page14 … Page57 Next →

Recent Posts

  • How to Filter and Deduplicate Pretraining Data for LLMs
  • How to Stream Ollama Responses over WebSockets
  • Model Merging: Weight Averaging, TIES, and DARE Explained
  • How to Use Ollama with PHP and Laravel
  • LLM Routing in Production: Balancing Cost and Quality with Model Cascades
© 2026 ML Journey • Built with GeneratePress