How to Build an LLM Eval Dataset from Production Logs
Handwritten test cases give false confidence. Building an eval dataset from production logs — with stratified sampling, proper labeling, and slice-based reporting — produces evaluations that actually catch regressions before they reach users.