The Measurement Problem: Why Answer-Level Metrics Misdiagnose RAG Systems
RAG evaluation suffers from Diagnostic Collapse: single scalar scores cannot distinguish retrieval failure from grounding failure. Models achieve 31% deflection rate when grounding is insufficient.