This article takes notes of Reinforcement Learning issues I’ve seen. Continuously updated.
Learning
I suggest in the following order:
- Python (or other language)
- PyTorch (or other Deep Learning package)
- Deep Supervised Learning (e.g. MNIST handwriting recognize)
- Tabular Reinforcement Learning (e.g. Q-Learning, TD($\lambda$))
- Deep Q-Learning (e.g. DQN, D3QN)
- Policy Gradient methods (e.g. DDPG, SAC, PPO, TD3)
- Advanced works (e.g. AlphaGo, MuZero, Agent57)
Python
菜鸟教程 (Chinese)
Multi-Armed Bandit
The Multi-Armed Bandit Problem and Its Solutions, LiLian Weng
Deep Reinforcement Learning
A (Long) Peek into Reinforcement Learning, LiLian Weng
Policy Gradient Algorithms, LiLian Weng
Advanced works
MuZero, 知乎 (Chinese)
Existing Kits
RLlib
https://docs.ray.io/en/master/rllib.html
- The popular all-round open-source library
- Takes time to learn
ElegentRL
https://github.com/AI4Finance-Foundation/ElegantRL
- A lightweight and scalable open-source library
- Easy to learn
- Developing and improving
Environment
Generally an Environment class is deriving gym.Env, and following the interface definitions.
Generally the computational costs of Environment centers at CPU.
State
Representation
Many popular CNN structure doesn’t fit Reinforcement Learning, including
- Batch Norm
- Shift-invariance operations, e.g. pooling
and sometimes needs
- Orthogonal normalization at the output layer for deep layers
- Low Learning Rate or freezing parameters for increasing number of parameters
Reward
Design
There’s no standard for Reward designing. In principle, learning is faster with denser Reward, and convergence performance depends on the guidance of the Reward.
Replay Buffer
Replay Buffer should be as large as possible to describe the Environment and Policy.
While Replay Buffer couldn’t be large enough, try lower the Learning Rate.
Priority Experience Replay (PER)
Priority Experience Replay will result in data distribution shift for Deep Network, and this has great impact on performance when Rewards are dense. Generally we use PER in early training stage.
As an improvement, try Self Imitation Learning.
Self-Play
How to evaluate a policy with baseline unavailable?
One idea: manage a baseline policy set composed by policies in history.
- Record history policies during training.
- Grade policies by ‘domination’. That is, higher level policies suppress lowers.
- Draw policies from each level to form the baseline policy set.
Testing Methods
Reinforcement Learning methods are more complicated than supervised ones. When BUGs arise, it’s often clueless to continue. Here are some advices:
- From easy to hard. For examples, build D3QN via DQN, or build hard Env via basic Env, or test RL methods on simple environments such as ‘gridworld’, ‘cartpole’, or modify RL methods base on a mature work.
- Write test for each module in the method such as Replay Buffer.
- Learn more about RL, PyTorch/Tensorflow, Python, etc.
Tools
- PyCharm profiler
- memory_profiler, line_profiler
- Pytorch.autograd: profiler,grad_check, anomaly_detection
- Breakpoints
- Print()