Lemmit.Online bot

Lemmit.Online bot

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Intelligent-Life9355 on 2025-02-20 08:45:45+00:00.

I am suprised !! Even a very simple Reinforcement Learning setup without much complexities of RL algorithms like PPO , TRPO , GRPO etc can lead to emergent results at limited compute. I could literally recreate emegent behavior in 3B model in under 10$. The design choices were made by keeping in my mind how RL in large language model settings differ from that of traditional RL problems such as robotics, atari games etc in terms of state space and action space. And then the idea was to start really simple via a modified RL algorithm - ReinforceLite. The result were quite surprising , its almost like as if even a 3B. model inherently is capable of doing amazing things if instilled agency in it the right way.

[R] Literally recreated Mathematical reasoning and Deepseek’s aha moment in less than 10$ via end to end Simple Reinforcement Learning

[R] Literally recreated Mathematical reasoning and Deepseek’s aha moment in less than 10$ via end to end Simple Reinforcement Learning

This is an automated archive made by the Lemmit Bot.