Reinforcement Learning
Table of Contents
- Trajectory = Episode = Rollout
- Value Functions
- Optimal action is the argmaxa Optimal Q-Function
- Bellman Equation
- Advantage Functions
- Policy Optimization vs. Q-Learning
- Policy Optimization and Policy Gradient
- Reduce sample variance
- Vanilla Policy Gradient (VPG)
- Trust Region Policy Optimization (TRPO)
- Proximal Policy Optimization (PPO)
- Deep Deterministic Policy Gradient (DDPG)
- Twin Delayed DDPG (TD3)
- Soft Actor Critic (SAC)
Trajectory = Episode = Rollout
a sequence of state and action τ ~ π means τ is a random trajectory (s1, a1, s2, a2, …) sampled based on policy π. Two things can be random here: state transition and the action that is taken based on a policy.
Value Functions
Four different types of value functions, if you combine the following two.
On-Policy vs. Optimal
A specific policy π vs. the optimal policy * (Max over all policies)
Value Function vs. Action-Value Function
Expectation over all possible actions from the policy vs. given a specific action a
(the action may not be from the policy).
Q-Function is Action-Value Function
Optimal action is the argmaxa Optimal Q-Function
since it is optimal
Bellman Equation
Expand any of the 4 value functions one step and make it recursive.
optimal value vs. on-policy: maxa vs. Ea~π.
Advantage Functions
Q(s,a)-V(s)
Policy Optimization vs. Q-Learning
- both are for Model Free RL
- Policy optimization predicts the distribution of the action given the observation.
- Policy Optimization's objective/loss function estimates a [surrogate] value function surrogate means the gradient is the same/similar, but the objective function is not just the value function.
- Q-Learning estimates Q-function based on the Bellman equation
- more vs. less stable: Satisfying the Bellman equation is an indirect measure of maximizing reward. And it has failure mode.
- less vs. more data effective. (on vs. off policy). Policy Optimization has to be on the latest policy
Policy Optimization and Policy Gradient
The NN directly learns the policy itself (given state/observation as input, what should be the action (or probability of taking that action)).
Instead of computing a loss as the "expected total reward of a policy" (J = E[R(τ)]), then taking a gradient of that and do gradient ascent (This is impossible when we cannot take the derivative of the environment (p(s1), p(s2|s1, a1), …). In model-based RL, this will be learned.), we derive the equation of the gradient of J, then ignore all the terms that have 0-grad. This new equation is called grad-log-prob.
You can think of this "grad-log-prob term before taking the gradient", namely log-prob*reward, as an approximation of J, because they have the same gradient.
Now the NN takes an observation and predicts the logits of π(a|s). These are the parameters of the action distribution in the current state. Then you sample this distribution to get an action. The so-called loss is the log-prob*reward.
However, this log-prob*reward is not a loss function. Its value goes down or up does not mean the policy is good. The only useful thing about this log-prob*reward is its gradient at the current step.
In summary, NN predicts the action distribution. The loss was designed such as its gradient is the same as dJ/dθ.
Reduce sample variance
Reward-to-go:
- Causality: policy cannot affect the previous reward
- The reason we can remove sum0t-1 is that its expectation is 0. In the cs285 lecture, he mentioned the proof is somewhat involved. The proof depends on EGLP lemma.
Baselines
To subtract a baseline, b must have nothing to do with actions. b can be a constant or a function of states. The proof of b=const case is simple; however, in lecture 6, he mentioned it can also be proved if b=f(s).
Vanilla Policy Gradient (VPG)
- use NN to estimate policy and value function
- update the policy NN with the gradient mentioned above (reward-to-go and value function as baseline)
- value function NN is updated by a loss function = L2 of between predict and the real reward.
Trust Region Policy Optimization (TRPO)
- why do we want the policy before and after updating to be close.
- cs285 lecture: TRPO and PPO derive a new "loss function". For this loss function to be the same as the one used mentioned before. There is one constraint: πθ' is close to πθ => pθ'(sₜ) is close to pθ(sₜ)
- However, a small grad update of policy NN does NOT necessarily mean the resulting difference in policy is small (although the difference is also bound, however for some gradient direction, the resulting policy difference could be large.). This means a bad step in the policy gradient can collapse the policy
- Maximum a new surrogate advantage, with a constraint that KL divergence of the policy before and after updating is less than a threshold.
- the new surrogate advantage has the same gradient as the one in VPG.
- Solve the Lagrangian duality.
- The gradient update will involve the Hessian of KL.
Proximal Policy Optimization (PPO)
PPO-Penalty: Similar to TRPO but the KL is in the loss function itself rather than a constraint
PPO-Clip
- same surrogate advantage function as TRPO, which contains a ratio between new and old policy.
- add an extra term in loss to clip the ratio so that it does not change much.
- They use a proxy loss to have the same gradient. This way you can use Pytorch's autograd.
Deep Deterministic Policy Gradient (DDPG)
- DDPG is an offline algorithm that learns the optimal deterministic policy and the Q*.
- VPG, TRPO, and PPO are all online algorithms that learn a policy and the value function for that policy
- DDPG is the Q-learning for continuous space.
- Getting the optimal policy from Q* is trivial for finite discretized action space (Thus, Q-learning is enough.). However, for continuous action space, it is not easy. This is why DDPG learns the optimal policy instead of computing it based on Q*.
- To learn this optimal policy, to loss function is just to max Q* (while the parameters in the Q* NN are fixed.).
Tricks:
Experience Replay
- For previous on-policy algorithms, such as VPG, TRPO, and PPO. The loss function and the gradient term have an expectation term sample over the current policy. This makes the old experience unusable.
- For the Q-learning algorithm, the loss function is mean-squared Bellman error (MSBE), and this equation should be satisfied with any action you sampled.
Target Networks
- There is a "max Q*" term inside the Bellman equation.
- As mentioned above, to estimate the max part in a continuous action space, you need to use the estimated optimal policy.
- if both Q* in the Bellman equation are the same Q* estimated by NN, it will be unstable. Therefore, they use a target network for Q* and a target network for optimal policy. These two are used for estimating "max Q*". The target network is just a running average of the normal NN.
Twin Delayed DDPG (TD3)
Three tricks
- learn 2 Q-function and use the smaller of the two Q-value (because Q-value tend to over-estimate [cs285 lecture 8])
- update policy less frequently than Q-function
- add noise to target action to smooth the policy (Q over-estimate)
Soft Actor Critic (SAC)
- similar to TD3
- learn a policy and two Q-functions.
- add an entropy regularization of policy into Q function estimation. Basically, encourage more random policy.