This is an automated archive made by the Lemmit Bot.
The original was posted on /r/machinelearning by /u/habitante on 2025-01-11 20:02:48+00:00.
Hi, I’ve been toying with a simple idea for developing a future-proof, dynamic, AI model benchmark. The idea is pretty simple. A hidden function transforms data, and the model only gets to see the before and after, and has to deduce the hidden logic. I’ve carefully curated several levels of slightly increasing difficulty, and I’ve been surprised to see most current models I can access (GTP, o1, Sonet, Gemini) suck at it.
For instance, the first puzzle simply does ^=0x55 to the bytes on the input buffers, yet most models struggle to see it or deduce it.
I’ve spin up a opensource MIT repo with a live demo, so others can give this idea a try or contribute. I appreciate any feedback. Thanks!