Tim Hanson on Nostr: npub15swlx…zx855 Well, the papers I mentioned have two sets of ensembles for ...
npub15swlxudlhx4ttcgsd4556zuqrl57qndxmt4n3dnzrkqn89nxv6lsjzx855 (npub15sw…x855)
Well, the papers I mentioned have two sets of ensembles for intrinsic (never-give-up, NGU) and extrinsic (game score) rewards, where the ensembling is over discount-factor. So, sorta...
Eq. 7 in the linked paper is in effect a fixed-policy bandit, so architecturally the two are similar, only the NGU reward is dynamically learned, and these 4 are "instinctual".
Model-based RL like MuZero and EfficientZero add state prediction as an independent objective, accourse.
Well, the papers I mentioned have two sets of ensembles for intrinsic (never-give-up, NGU) and extrinsic (game score) rewards, where the ensembling is over discount-factor. So, sorta...
Eq. 7 in the linked paper is in effect a fixed-policy bandit, so architecturally the two are similar, only the NGU reward is dynamically learned, and these 4 are "instinctual".
Model-based RL like MuZero and EfficientZero add state prediction as an independent objective, accourse.