Hands On Machine Learning with Scikit and Tensorflow(VII)

Posted by Kaiyuan Chen on September 25, 2017

Reinforced Learning (Continued)

Policy gradient

Based on Graident Descent, it has four steps:

  1. compute for each step the gradient that make chosen action more likely
  2. after run several episodes and compute action’s score
  3. evaluate action score
  4. compute all resulting gradient vectors (using normalized mean) and use GD

Markov Decision Process

It is a twist to Markov chains: at each step, an agent can choose one of several possible actions, and the transition probabilities depend on the chosen action.

Bellman optimaity equation

recursive equation says that if the agent acts optimally, then the optimal value of the current state is equal to the reward it will get on average after taking one optimal action, plus the expected optimal value of all possible next states that this action can lead to.

V = sum(T * (R + yV(s’))) optimal state value: V T: trasition prob R: reward

state-action values(Q values) that evaluate Q(state, action)

Temporal Difference learning (TD algorithm)

Because initially it does not know T, R, the agent takes a purely random policy to explore; then TD learning algorithm updates the estimates of state values