This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/jonas__m on 2025-09-04 23:30:36+00:00.


Curious what folks think about this paper: https://arxiv.org/abs/2508.08285

In my own experience in hallucination-detection research, the other popular benchmarks are also low-signal, even the ones that don’t suffer from the flaw highlighted in this work.

Other common flaws in existing benchmarks:

  • Too synthetic, when the aim is to catch real high-stakes hallucinations in production LLM use-cases.

  • Full of incorrect annotations regarding whether each LLM response is correct or not, due to either low-quality human review or just relying on automated LLM-powered annotation.

  • Only considering responses generated by old LLMs, which are no longer representative of the type of mistakes that modern LLMs make.

I think part of the challenge in this field is simply the overall difficulty of proper Evals. For instance, Evals are much easier in multiple-choice / closed domains, but those aren’t the settings where LLM hallucinations pose the biggest concern