Reinforced Learning (Continued)
Based on Graident Descent, it has four steps:
- compute for each step the gradient that make chosen action more likely
- after run several episodes and compute action’s score
- evaluate action score
- compute all resulting gradient vectors (using normalized mean) and use GD
Markov Decision Process
It is a twist to Markov chains: at each step, an agent can choose one of several possible actions, and the transition probabilities depend on the chosen action.
Bellman optimaity equation
recursive equation says that if the agent acts optimally, then the optimal value of the current state is equal to the reward it will get on average after taking one optimal action, plus the expected optimal value of all possible next states that this action can lead to.
V = sum(T * (R + yV(s’))) optimal state value: V T: trasition prob R: reward
state-action values(Q values) that evaluate Q(state, action)
Temporal Difference learning (TD algorithm)
Because initially it does not know T, R, the agent takes a purely random policy to explore; then TD learning algorithm updates the estimates of state values