Tim Hanson on Nostr: npub15swlx…zx855 Also in this ensembling vein: Agent57 and MEME use 32 ...
npub15swlxudlhx4ttcgsd4556zuqrl57qndxmt4n3dnzrkqn89nxv6lsjzx855 (npub15sw…x855)
Also in this ensembling vein: Agent57 and MEME use 32 approximations of the Q-value & policy for intrinsic and extrinsic reward. A multi-armed bandit controller is then used to select which policy to follow.
MEME is sota on Atari, slightly harder than 4-gaussian-blob world 😉
(neither was cited)
[1] http://arxiv.org/abs/2003.13350
[2] http://arxiv.org/abs/2209.07550
Also in this ensembling vein: Agent57 and MEME use 32 approximations of the Q-value & policy for intrinsic and extrinsic reward. A multi-armed bandit controller is then used to select which policy to follow.
MEME is sota on Atari, slightly harder than 4-gaussian-blob world 😉
(neither was cited)
[1] http://arxiv.org/abs/2003.13350
[2] http://arxiv.org/abs/2209.07550