Skip to content
ML Journey

ML Journey

  • Home
  • Data Analytics
  • Data Science
  • Data Engineering
  • Machine Learning
  • Generative AI
  • About

mljourney

RAG in Production: Fixing Retrieval Failures with Hybrid Search and Reranking

March 27, 2026 by mljourney

Most RAG failures are retrieval failures. A practical guide to the techniques that actually move the needle: hybrid BM25 + dense retrieval, reciprocal rank fusion, cross-encoder reranking, HyDE query transformation, and chunking strategies — with retrieval evaluation built in.

Categories Generative AI Leave a comment

Tensor Parallelism vs Pipeline Parallelism for LLM Inference

March 26, 2026 by mljourney

Tensor parallelism and pipeline parallelism split models across GPUs in fundamentally different ways with different latency and throughput implications. A practical guide covering how each works, NVLink vs InfiniBand trade-offs, and how they combine for 70B+ multi-node serving.

Categories Generative AI Leave a comment

How to Fine-Tune an Embedding Model for Domain-Specific Retrieval

March 25, 2026 by mljourney

Off-the-shelf embedding models underperform on domain-specific retrieval. A practical guide to synthetic query generation, hard negative mining, contrastive fine-tuning, and Matryoshka training — with realistic expectations on NDCG@10 improvement.

Categories Generative AI Leave a comment

AWQ vs GPTQ vs bitsandbytes: LLM Quantization Methods Compared

March 24, 2026 by mljourney

AWQ, GPTQ, and bitsandbytes each quantize LLM weights differently. A practical comparison covering how each works, quality at 4-bit and 8-bit, inference speed, and when to use which — including the right choice for QLoRA training vs production serving.

Categories Generative AI Leave a comment

How to Use FlashAttention-2 in Practice

March 23, 2026 by mljourney

FlashAttention-2 delivers 2–4x faster attention and eliminates O(N²) memory with a single argument change. A practical guide to enabling it in HuggingFace, PyTorch SDPA, and fine-tuning pipelines — including attention mask compatibility and where it helps most.

Categories Generative AI Leave a comment

RLHF vs DPO vs PPO: How to Align LLMs Without Losing Your Mind

March 22, 2026 by mljourney

RLHF with PPO is powerful but complex. DPO achieves comparable alignment with a fraction of the engineering overhead. A practical comparison of PPO, DPO, and newer variants like SimPO and ORPO — covering when to use each and how to build a preference dataset that actually works.

Categories Generative AI Leave a comment

How to Serve Multiple LoRA Adapters from a Single Base Model

March 20, 2026 by mljourney

Serving multiple fine-tuned model variants with separate deployments wastes GPU memory proportional to the number of adapters. A guide to multi-adapter serving with vLLM, S-LoRA architecture, adapter routing strategies, and GPU memory planning.

Categories Generative AI Leave a comment

Feast vs Tecton vs Hopsworks: Choosing a Feature Store

March 19, 2026 by mljourney

Feast, Tecton, and Hopsworks each take a different approach to feature store architecture. A practical comparison covering offline and online stores, streaming feature support, managed vs self-hosted trade-offs, and how to choose based on your actual requirements.

Categories Generative AI Leave a comment

How to Debug Slow PyTorch Dataloaders

March 17, 2026 by mljourney

GPU sitting at 40-60% utilization while the model code looks fine? The dataloader is likely the bottleneck. A systematic guide to diagnosing and fixing slow data loading in PyTorch training pipelines.

Categories Generative AI Leave a comment

Gradient Accumulation and Gradient Checkpointing Explained

March 16, 2026 by mljourney

Gradient accumulation and gradient checkpointing are frequently confused but solve different problems. A precise guide to how each works, when to use them, how to combine them, and how to reason about GPU memory during training.

Categories Generative AI Leave a comment
Older posts
Newer posts
← Previous Page1 … Page17 Page18 Page19 … Page21 Next →

Recent Posts

  • How to Use Azure OpenAI Service: A Complete Guide with Code Examples
  • How to Serve LLMs in Production with vLLM: Setup, Configuration, and Scaling
  • Structured Outputs with LLMs: JSON Mode, Tool Forcing, and Pydantic
  • LLM Routing: How to Send Every Request to the Right Model and Cut API Costs
  • LLM Memory Patterns for AI Agents: Short-Term, Long-Term, and Episodic Storage
© 2026 ML Journey • Built with GeneratePress