mljourney, Author at ML Journey

RAG in Production: Fixing Retrieval Failures with Hybrid Search and Reranking

March 27, 2026 by mljourney

Most RAG failures are retrieval failures. A practical guide to the techniques that actually move the needle: hybrid BM25 + dense retrieval, reciprocal rank fusion, cross-encoder reranking, HyDE query transformation, and chunking strategies — with retrieval evaluation built in.

Tensor Parallelism vs Pipeline Parallelism for LLM Inference

March 26, 2026 by mljourney

Tensor parallelism and pipeline parallelism split models across GPUs in fundamentally different ways with different latency and throughput implications. A practical guide covering how each works, NVLink vs InfiniBand trade-offs, and how they combine for 70B+ multi-node serving.

How to Fine-Tune an Embedding Model for Domain-Specific Retrieval

March 25, 2026 by mljourney

Off-the-shelf embedding models underperform on domain-specific retrieval. A practical guide to synthetic query generation, hard negative mining, contrastive fine-tuning, and Matryoshka training — with realistic expectations on NDCG@10 improvement.

AWQ vs GPTQ vs bitsandbytes: LLM Quantization Methods Compared

March 24, 2026 by mljourney

AWQ, GPTQ, and bitsandbytes each quantize LLM weights differently. A practical comparison covering how each works, quality at 4-bit and 8-bit, inference speed, and when to use which — including the right choice for QLoRA training vs production serving.

How to Use FlashAttention-2 in Practice

March 23, 2026 by mljourney

FlashAttention-2 delivers 2–4x faster attention and eliminates O(N²) memory with a single argument change. A practical guide to enabling it in HuggingFace, PyTorch SDPA, and fine-tuning pipelines — including attention mask compatibility and where it helps most.

RLHF vs DPO vs PPO: How to Align LLMs Without Losing Your Mind

March 22, 2026 by mljourney

RLHF with PPO is powerful but complex. DPO achieves comparable alignment with a fraction of the engineering overhead. A practical comparison of PPO, DPO, and newer variants like SimPO and ORPO — covering when to use each and how to build a preference dataset that actually works.

How to Serve Multiple LoRA Adapters from a Single Base Model

March 20, 2026 by mljourney

Serving multiple fine-tuned model variants with separate deployments wastes GPU memory proportional to the number of adapters. A guide to multi-adapter serving with vLLM, S-LoRA architecture, adapter routing strategies, and GPU memory planning.

Feast vs Tecton vs Hopsworks: Choosing a Feature Store

March 19, 2026 by mljourney

Feast, Tecton, and Hopsworks each take a different approach to feature store architecture. A practical comparison covering offline and online stores, streaming feature support, managed vs self-hosted trade-offs, and how to choose based on your actual requirements.

How to Debug Slow PyTorch Dataloaders

March 17, 2026 by mljourney

GPU sitting at 40-60% utilization while the model code looks fine? The dataloader is likely the bottleneck. A systematic guide to diagnosing and fixing slow data loading in PyTorch training pipelines.

Gradient Accumulation and Gradient Checkpointing Explained

March 16, 2026 by mljourney

Gradient accumulation and gradient checkpointing are frequently confused but solve different problems. A precise guide to how each works, when to use them, how to combine them, and how to reason about GPU memory during training.