npub15swlx…zx855 Also in this ensembling vein: Agent57 and MEME use 32 ...

2024-01-26 19:32:13

npub15swlxudlhx4ttcgsd4556zuqrl57qndxmt4n3dnzrkqn89nxv6lsjzx855 (npub15sw…x855)

Also in this ensembling vein: Agent57 and MEME use 32 approximations of the Q-value & policy for intrinsic and extrinsic reward. A multi-armed bandit controller is then used to select which policy to follow.

MEME is sota on Atari, slightly harder than 4-gaussian-blob world 😉
(neither was cited)

[1] http://arxiv.org/abs/2003.13350
[2] http://arxiv.org/abs/2209.07550

Author Public Key

npub10egtpxtvjwdx00htm464c6hgwmz0ngwn3kgz90rv2qeq9qqqpdcs4jr2yr

Show more details

Tim Hanson on Nostr: npub15swlx…zx855 Also in this ensembling vein: Agent57 and MEME use 32 ...