Notes of CS285: Deep Reinforcement Learning
Table of Contents
- The loss function for imitation learning (behavior clone and Dagger)
- Distribution
- Behaviour Clone (BC) vs. Dagger
- Discount factor γ
- Causality: policy cannot affect the previous reward
- Baselines in advantage
- Multithread and Multiprocess
- Fitted Q/value-iteration and Actor-Critic do not guarantee to converge.
- Make sampletrajactory multithread / multiprocess
The loss function for imitation learning (behavior clone and Dagger)
MSELoss()(dist.rsample(), expert_action)
I so this solution here. I feel this might be the intended solution because in the provided code, there is already a self.loss=MSELoss()
-dist.log_prob().sum(-1).mean()
- However, I think this is a better loss because it does not need the exact unnecessary randomness in
dist.rsample()
. As a piece of evidence, I notice this loss will yield a better EvalAverageReturn with shorter training steps and a smaller training batch size. - sum(-1) is to multiply 8 independent gaussian to make a joint Gaussian distribution; mean() instead of sum() so that the gradient contribution of each batch is the same no matter what is the batchsize. sum() will make each sample's contribution the same, so a larger batchsize will have a greater contribution.
Distribution
Discrete
prob: [0, 1] logprob: [-inf, 0]
Continuous
prob==likelihood: [0, +inf] logprob: [-inf, +inf]
Normal as indenpendent Gaussian
- multiply all the prob or sum all the logprob
Behaviour Clone (BC) vs. Dagger
control total training steps and the model the same
- For easy experiments like Ant, Dagger is just slightly better than BC (I keep the total training steps the same for BC and Dagger).
- For hard experiments like Humanoid, Dagger is just 2x better than BC (I keep the total training steps the same for BC and Dagger). I increased the training iteration another 5x (also the amount of data labeled by expert 5x), Dagger is more than 10x better than BC.
- I think this is partly because Dagger sees much more various data than BC
also control the total amount of data the same
- For easy experiments like Ant, Dagger is just slightly better than BC (I keep the total training steps the same for BC and Dagger).
- For hard experiments like Humanoid, it is too hard to train a good model with just 2000 data samples (The amount of expert data I have for BC).
Discount factor γ
The effect is that it modifies the MDP by adding a death state. And 1 - γ is the probability of transiting into the death state.
Causality: policy cannot affect the previous reward
The reason we can remove sum0t-1 is that its expectation is 0. In the lecture, he mentioned the proof is somewhat involved. However, I think there must be some additional requirements for the reward, e.g. centered.
Baselines in advantage
To subtract a baseline, b must have nothing to do with actions. b can be a constant or a function of states. The proof of b=const case is simple; however, in lecture 6, he mentioned it can also be proved if b=f(s).
Multithread and Multiprocess
hw2 bonus question is to make sampletrajectories multithread. However, after multi-threading, it is much slower (the more threads, the slower). I guess, 1) something under the hood does not release GIL, for example, gym? 2) it is already fast on my machine < 1ms, so the overhead of thread is more significant. Multiprocess does not work because it uses fork, and CUDA does not work on fork. P.S. batchsize = 50k
Fitted Q/value-iteration and Actor-Critic do not guarantee to converge.
- Q/value-iteration (Bellman equation part) converges
- Deep learning fits the Q/value function part also converges
- However, the combination of these two no longer guarantees to converge.
- For the same reason, the critic part in AC does not guarantee to converge either.
Make sampletrajactory multithread / multiprocess
- before multithread / multiprocess, only one CPU is used at 100%. Even without setting the OPENBLASNUMTHREADS=1.
- multithread works; however, all CPUs work at ~10%, and the total time is 3x longer than single thread.
- multiprocess crashes (because of the way torch uses CUDA. Some extra work is needed to get this work.); however, all CPUs go to 100% before crash.
- I think something (env or torch) does not release GIL.