Reinforcement Learning Issues

This article takes notes of Reinforcement Learning issues I’ve seen. Continuously updated.

Learning

I suggest in the following order:

Python (or other language)
PyTorch (or other Deep Learning package)
Deep Supervised Learning (e.g. MNIST handwriting recognize)
Tabular Reinforcement Learning (e.g. Q-Learning, TD($\lambda$))
Deep Q-Learning (e.g. DQN, D3QN)
Policy Gradient methods (e.g. DDPG, SAC, PPO, TD3)
Advanced works (e.g. AlphaGo, MuZero, Agent57)

Existing Kits

RLlib

https://docs.ray.io/en/master/rllib.html

The popular all-round open-source library
Takes time to learn

ElegentRL

https://github.com/AI4Finance-Foundation/ElegantRL

A lightweight and scalable open-source library
Easy to learn
Developing and improving

Environment

Generally an Environment class is deriving gym.Env, and following the interface definitions.

Generally the computational costs of Environment centers at CPU.

State

Representation

Many popular CNN structure doesn’t fit Reinforcement Learning, including

Batch Norm
Shift-invariance operations, e.g. pooling

and sometimes needs

Orthogonal normalization at the output layer for deep layers
Low Learning Rate or freezing parameters for increasing number of parameters

Reward

Design

There’s no standard for Reward designing. In principle, learning is faster with denser Reward, and convergence performance depends on the guidance of the Reward.

Replay Buffer

Replay Buffer should be as large as possible to describe the Environment and Policy.

While Replay Buffer couldn’t be large enough, try lower the Learning Rate.

Priority Experience Replay (PER)

Priority Experience Replay will result in data distribution shift for Deep Network, and this has great impact on performance when Rewards are dense. Generally we use PER in early training stage.

As an improvement, try Self Imitation Learning.

Self-Play

How to evaluate a policy with baseline unavailable?

One idea: manage a baseline policy set composed by policies in history.

Record history policies during training.
Grade policies by ‘domination’. That is, higher level policies suppress lowers.
Draw policies from each level to form the baseline policy set.

Testing Methods

Reinforcement Learning methods are more complicated than supervised ones. When BUGs arise, it’s often clueless to continue. Here are some advices:

From easy to hard. For examples, build D3QN via DQN, or build hard Env via basic Env, or test RL methods on simple environments such as ‘gridworld’, ‘cartpole’, or modify RL methods base on a mature work.
Write test for each module in the method such as Replay Buffer.
Learn more about RL, PyTorch/Tensorflow, Python, etc.

Tools

PyCharm profiler
memory_profiler, line_profiler
Pytorch.autograd: profiler,grad_check, anomaly_detection
Breakpoints
Print()

Reinforcement Learning Issues

Learning

Python

Multi-Armed Bandit

Deep Reinforcement Learning

Advanced works