GRPO & EGO GRPO is a training algorithm introduced by R1. Why is it a big deal? It ...

npub1nlk…jm9c

2025-03-06 18:57:33

GRPO & EGO

GRPO is a training algorithm introduced by R1. Why is it a big deal? It allowed models to reject themselves.

A model outputs some words while trying to solve a math or coding problem. If it cannot solve, the next round it may try a longer reasoning. And while doing all of this at some point "Wait!" or "But," or "On the other hand" is randomly introduced in the reasoning words and that allows it to re-think its reasoning words and correct itself. Once these random appearances of reflection allows it to solve problems, the next round it wants to do more of that because it got rewards when it did that. Hence it gets smarter gradually thanks to self reflection.

I think this is better than SFT because it fixes its own errors while SFT is primarily focusing on teaching new skills. Inverting the error is kind of "fixing the mistakes in itself" (GRPO method) and could be more effective than installing new ideas and hoping old ideas go away (SFT method).

LLMs fixing their own errors allows them to self learn. This has analogies to human operation. Rejecting the ego is liberation from the shackles of ego, in this case the past words are kind of shackles but when it corrects itself it is "thinking outside the box". We find our mistakes and contemplate on them and learn from them and next time don't repeat. We f around and find out basically. F around is enjoying life recklessly, finding out is "divine scripts work most of the time and should have priority in decision making". Controlling the ego and getting outside of the box of ego is how we ascend.

Author Public Key

npub1nlk894teh248w2heuu0x8z6jjg2hyxkwdc8cxgrjtm9lnamlskcsghjm9c

Show more details

Published at

2025-03-06 18:57:33

Kind type

1 Short Text Note

Event JSON

{ "id": "57eaf75969c7ed3e108a5ec0607cdf5507c163a4fadb0249100fbab515518607", "pubkey": "9fec72d579baaa772af9e71e638b529215721ace6e0f8320725ecbf9f77f85b1", "created_at": 1741287453, "kind": 1, "tags": [ [ "client", "Yakihonne", "31990:20986fb83e775d96d188ca5c9df10ce6d613e0eb7e5768a0f0b12b37cdac21b3:1700732875747" ] ], "content": "GRPO \u0026 EGO\n\nGRPO is a training algorithm introduced by R1. Why is it a big deal? It allowed models to reject themselves. \n\nA model outputs some words while trying to solve a math or coding problem. If it cannot solve, the next round it may try a longer reasoning. And while doing all of this at some point \"Wait!\" or \"But,\" or \"On the other hand\" is randomly introduced in the reasoning words and that allows it to re-think its reasoning words and correct itself. Once these random appearances of reflection allows it to solve problems, the next round it wants to do more of that because it got rewards when it did that. Hence it gets smarter gradually thanks to self reflection.\n\nI think this is better than SFT because it fixes its own errors while SFT is primarily focusing on teaching new skills. Inverting the error is kind of \"fixing the mistakes in itself\" (GRPO method) and could be more effective than installing new ideas and hoping old ideas go away (SFT method). \n\nLLMs fixing their own errors allows them to self learn. This has analogies to human operation. Rejecting the ego is liberation from the shackles of ego, in this case the past words are kind of shackles but when it corrects itself it is \"thinking outside the box\". We find our mistakes and contemplate on them and learn from them and next time don't repeat. We f around and find out basically. F around is enjoying life recklessly, finding out is \"divine scripts work most of the time and should have priority in decision making\". Controlling the ego and getting outside of the box of ego is how we ascend. \n", "sig": "8b20d5135fa6be3944701bed25040646dfdc7b2fcab5b226eb5e7a41b614eff8e2eb8a885c67dc414376aa873b7856b4288c76f44f1c415a34503f3c33593118" }