[P] A hard algorithmic benchmark for future reasoning models

old.reddit.com

[P] A hard algorithmic benchmark for future reasoning models

old.reddit.com

Lemmit.Online botMAB to

Machine LearningEnglish · 6 months ago

Hi, I've been toying with a simple idea for developing a future-proof, dynamic, AI model benchmark. The idea is pretty simple. A hidden function...

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/habitante on 2025-01-11 20:02:48+00:00.

Hi, I’ve been toying with a simple idea for developing a future-proof, dynamic, AI model benchmark. The idea is pretty simple. A hidden function transforms data, and the model only gets to see the before and after, and has to deduce the hidden logic. I’ve carefully curated several levels of slightly increasing difficulty, and I’ve been surprised to see most current models I can access (GTP, o1, Sonet, Gemini) suck at it.

For instance, the first puzzle simply does ^=0x55 to the bytes on the input buffers, yet most models struggle to see it or deduce it.

I’ve spin up a opensource MIT repo with a live demo, so others can give this idea a try or contribute. I appreciate any feedback. Thanks!

You must log in or register to comment.

Chat

Machine Learning

machinelearning

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]

Community locked: only moderators can create posts. You can still comment on posts.

This subreddit is temporarily closed in protest of Reddit killing third party apps, see /r/ModCoord and /r/Save3rdPartyApps for more information.

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

1 user / day
1 user / week
1 user / month
6 users / 6 months
1 local subscriber
20 subscribers
2.39K Posts
1 Comment
Modlog

mods:
Lemmit.Online bot