This is an automated archive made by the Lemmit Bot.
The original was posted on /r/machinelearning by /u/jonas__m on 2025-09-04 23:30:36+00:00.
Curious what folks think about this paper: https://arxiv.org/abs/2508.08285
In my own experience in hallucination-detection research, the other popular benchmarks are also low-signal, even the ones that don’t suffer from the flaw highlighted in this work.
Other common flaws in existing benchmarks:
-
Too synthetic, when the aim is to catch real high-stakes hallucinations in production LLM use-cases.
-
Full of incorrect annotations regarding whether each LLM response is correct or not, due to either low-quality human review or just relying on automated LLM-powered annotation.
-
Only considering responses generated by old LLMs, which are no longer representative of the type of mistakes that modern LLMs make.
I think part of the challenge in this field is simply the overall difficulty of proper Evals. For instance, Evals are much easier in multiple-choice / closed domains, but those aren’t the settings where LLM hallucinations pose the biggest concern