This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/pigeon57434 on 2025-07-01 16:58:40+00:00.


Allen AI has released SciArena, which basically is like LMArena, except only *ACTUAL EXPERTS* (being 102 people who had ≥2 peer-reviewed publications and prior AI-assisted experience) voted on only scientific topics with rigorously fair eval methods

basically for science *o3 is still on top by a long shot* for more eval details:

Maximal fairness is enforced through a fixed multi-stage RAG pipeline (query decomposition, passage retrieval, re-ranking) adapted from Scholar QA, creating a controlled variable by using an identical retrieval index and prompt workflow for all competitors to purely isolate the model’s contribution. To neutralize stylistic bias, long-form model outputs are algorithmically stripped of unique formatting and post-processed into a standardized plain-text format with consistent citation styles before being presented to voters in a blind, side-by-side interface. The resulting expert preference data is rigorously validated, demonstrating exceptionally high data integrity through strong self-consistency (weighted Cohen’s κ=0.91) and inter-annotator agreement (κ=0.76). This meticulous isolation of the LM from confounding architectural and presentation variables creates an uncorrupted, high-fidelity signal of core scientific reasoning capability, providing a true benchmark for advancing beyond superficial mimicry.

https://preview.redd.it/6qfr80vomaaf1.png?width=659&format=png&auto=webp&s=5da7e36648c53b47972277e51a80cd8c147c8a99

https://preview.redd.it/dnpdrhnpmaaf1.png?width=512&format=png&auto=webp&s=277cb313728d343ba1bda75fcb81521c6f2029d2

blog: https://allenai.org/blog/sciarena

code: https://github.com/yale-nlp/SciArena/