npub15swlx…zx855 Well, the papers I mentioned have two sets of ensembles for ...

2024-01-26 20:44:02

npub15swlxudlhx4ttcgsd4556zuqrl57qndxmt4n3dnzrkqn89nxv6lsjzx855 (npub15sw…x855)

Well, the papers I mentioned have two sets of ensembles for intrinsic (never-give-up, NGU) and extrinsic (game score) rewards, where the ensembling is over discount-factor. So, sorta...

Eq. 7 in the linked paper is in effect a fixed-policy bandit, so architecturally the two are similar, only the NGU reward is dynamically learned, and these 4 are "instinctual".

Model-based RL like MuZero and EfficientZero add state prediction as an independent objective, accourse.

Author Public Key

npub10egtpxtvjwdx00htm464c6hgwmz0ngwn3kgz90rv2qeq9qqqpdcs4jr2yr

Show more details

Tim Hanson on Nostr: npub15swlx…zx855 Well, the papers I mentioned have two sets of ensembles for ...