Reinforced Learning (Continued)
Policy gradient
Based on Graident Descent, it has four steps:
 compute for each step the gradient that make chosen action more likely
 after run several episodes and compute action’s score
 evaluate action score
 compute all resulting gradient vectors (using normalized mean) and use GD
Markov Decision Process
It is a twist to Markov chains: at each step, an agent can choose one of several possible actions, and the transition probabilities depend on the chosen action.
Bellman optimaity equation
recursive equation says that if the agent acts optimally, then the optimal value of the current state is equal to the reward it will get on average after taking one optimal action, plus the expected optimal value of all possible next states that this action can lead to.
V = sum(T * (R + yV(s’))) optimal state value: V T: trasition prob R: reward
stateaction values(Q values) that evaluate Q(state, action)
Temporal Difference learning (TD algorithm)
Because initially it does not know T, R, the agent takes a purely random policy to explore; then TD learning algorithm updates the estimates of state values

Previous
Hands On Machine Learning with Scikit and Tensorflow(VII) 
Next
DeepFix, Fixing Common C language Errors by Deep Learning