Brandon Rohrer on Nostr: Each time step it gets a reward based on the height of the pendulum is— ranging ...
Each time step it gets a reward based on the height of the pendulum is— ranging from zero if it’s at the bottom to two if it’s at the top.
By the time it reaches a thousand episodes, it’s performing near optimally, with an average reward of 1.96, which includes spinning up from the bottom.
That represents 1 million times steps of learning at four times steps per second-about three days at 1X speed.
By the time it reaches a thousand episodes, it’s performing near optimally, with an average reward of 1.96, which includes spinning up from the bottom.
That represents 1 million times steps of learning at four times steps per second-about three days at 1X speed.