This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/Wiskkey on 2024-06-28 15:35:35+00:00.

Original Title: NoCha: a benchmark for long-context language models that measures claim verification about recent fiction books. Paper: ‘One Thousand and One Pairs: A “novel” challenge for long-context language models’.


From A Novel Challenge for long-context language models:

NoCha measures how well long-context language models can verify claims written about fictional books. Check out our paper and GitHub repo for more details.

About the benchmark: NoCha contains 1001 narrative minimal pairs written about recently-published novels, where one claim is true and the other is false. Given the book text and a claim, a model is instructed to verify whether the claim is true or false. The model only gets credit for a pair if it correctly labels both the true and false claim.

Paper. I am not affiliated with the authors.

Abstract:

Synthetic long-context LLM benchmarks (e.g., “needle-in-the-haystack”) test only surface-level retrieval capabilities, but how well can long-context LLMs retrieve, synthesize, and reason over information across book-length inputs? We address this question by creating NoCha, a dataset of 1,001 minimally different pairs of true and false claims about 67 recently-published English fictional books, written by human readers of those books. In contrast to existing long-context benchmarks, our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify. Our experiments show that while human readers easily perform this task, it is enormously challenging for all ten long-context LLMs that we evaluate: no open-weight model performs above random chance (despite their strong performance on synthetic benchmarks), while GPT-4o achieves the highest accuracy at 55.8%. Further analysis reveals that (1) on average, models perform much better on pairs that require only sentence-level retrieval vs. global reasoning; (2) model-generated explanations for their decisions are often inaccurate even for correctly-labeled claims; and (3) models perform substantially worse on speculative fiction books that contain extensive world-building. The methodology proposed in NoCha allows for the evolution of the benchmark dataset and the easy analysis of future models.

GitHub repo.

X/Twitter thread about the work from one of the authors.

Example of a claim pair (image from the aforementioned X thread):

Claim pair accuracy (image from the aforementioned X thread):

Humans had a claim pair accuracy of 96.9% for a subset of 64 claim pairs according to the paper. A random guesser on average would get 25% of claim pairs correct.

An article about the work: Fact or Fiction? NOCHA: A New Benchmark for Evaluating Long-Context Reasoning in LLMs.