Each time step it gets a reward based on the height of the pendulum is

2024-07-06 10:07:12

Each time step it gets a reward based on the height of the pendulum is— ranging from zero if it’s at the bottom to two if it’s at the top.

By the time it reaches a thousand episodes, it’s performing near optimally, with an average reward of 1.96, which includes spinning up from the bottom.

That represents 1 million times steps of learning at four times steps per second-about three days at 1X speed.

Author Public Key

npub1jh4qsxnz0nhyfefjsfvcdmxxvgfe6p5vf0dvh6pq4r6ytwwxcp4sl9eag0

Show more details

Brandon Rohrer on Nostr: Each time step it gets a reward based on the height of the pendulum is— ranging ...