Hands On Machine Learning with Scikit and Tensorflow(VII)

Posted by Kaiyuan Chen on September 24, 2017

Reinforcement Learning (Chapter 16)

a software agent makes observations and takes actions within an environment, and in return it receives rewards. It is used to maximize expecteed long-term rewards


  • walking robot
  • playing game
  • stock price

Policy: The algorithm used by the software agent to determine its actions policy search: brute force approach that try out different values for parameters genetic algorithm: create 100 policies randomly and kill 80 worst policies and make 20 survivors to produce 4 offsprings each, which just some random variation on original values

we are picking a random action based on the probability given by the neural network, rather than just picking the action with the highest score. This approach lets the agent find the right balance between exploring new actions and exploiting the actions that are known to work well

Another guidance is discount rate r for each step. There has to be a good or bad in situations, by credit assignment problem.

Policy gradient