Shortcuts

C51

Overview

C51 was first proposed in A Distributional Perspective on Reinforcement Learning, different from previous works, C51 evaluates the complete distribution of a q-value rather than only the expectation. The authors designed a distributional Bellman operator, which preserves multimodality in value distributions and is believed to achieve more stable learning and mitigates the negative effects of learning from a non-stationary policy.

Quick Facts

  1. C51 is a model-free and value-based RL algorithm.

  2. C51 only support discrete action spaces.

  3. C51 is an off-policy algorithm.

  4. Usually, C51 use eps-greedy or multinomial sample for exploration.

  5. C51 can be equipped with RNN.

Pseudo-code

../_images/C51.png

Note

C51 models the value distribution using a discrete distribution, whose support set are N atoms: \(z_i = V_\min + i * delta, i = 0,1,...,N-1\) and \(delta = (V_\max - V_\min) / N\). Each atom \(z_i\) has a parameterized probability \(p_i\). The Bellman update of C51 projects the distribution of \(r + \gamma * z_j^{\left(t+1\right)}\) onto the distribution \(z_i^t\).

Key Equations or Key Graphs

The Bellman target of C51 is derived by projecting the returned distribution \(r + \gamma * z_j\) onto the current distribution \(z_i\). Given a sample transition \((x, a, r, x')\), we compute the Bellman update \(Tˆz_j := r + \gamma z_j\) for each atom \(z_j\), then distribute its probability \(p_{j}(x', \pi(x'))\) to the immediate neighbors \(p_{i}(x, \pi(x))\):

\[\left(\Phi \hat{T} Z_{\theta}(x, a)\right)_{i}=\sum_{j=0}^{N-1}\left[1-\frac{\left|\left[\hat{\mathcal{T}} z_{j}\right]_{V_{\mathrm{MIN}}}^{V_{\mathrm{MAX}}}-z_{i}\right|}{\Delta z}\right]_{0}^{1} p_{j}\left(x^{\prime}, \pi\left(x^{\prime}\right)\right)\]

Extensions

  • C51s can be combined with:
    • PER (Prioritized Experience Replay)

    • Multi-step TD-loss

    • Double (target) network

    • Dueling head

    • RNN

Implementation

Tip

Our benchmark result of C51 uses the same hyper-parameters as DQN except the exclusive n_atom of C51, which is empirically set as 51.

The default config of C51 is defined as follows:

class ding.policy.c51.C51Policy(cfg: easydict.EasyDict, model: Optional[torch.nn.modules.module.Module] = None, enable_field: Optional[List[str]] = None)[source]
Overview:

Policy class of C51 algorithm.

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

c51

RL policy register name, refer to
registry POLICY_REGISTRY
this arg is optional,
a placeholder

2

cuda

bool

False

Whether to use cuda for network
this arg can be diff-
erent from modes

3

on_policy

bool

False

Whether the RL algorithm is on-policy
or off-policy

4

priority

bool

False

Whether use priority(PER)
priority sample,
update priority

5

model.v_min

float

-10

Value of the smallest atom
in the support set.

6

model.v_max

float

10

Value of the largest atom
in the support set.

7

model.n_atom

int

51

Number of atoms in the support set
of the value distribution.

8

other.eps
.start

float

0.95

Start value for epsilon decay.

9

other.eps
.end

float

0.1

End value for epsilon decay.

10

discount_
factor

float

0.97, [0.95, 0.999]

Reward’s future discount factor, aka.
gamma
may be 1 when sparse
reward env

11

nstep

int

1,

N-step reward discount sum for target
q_value estimation

12

learn.update
per_collect

int

3

How many updates(iterations) to train
after collector’s one collection. Only
valid in serial training
this args can be vary
from envs. Bigger val
means more off-policy

The network interface C51 used is defined as follows:

class ding.model.template.q_learning.C51DQN(obs_shape: Union[int, ding.utils.type_helper.SequenceType], action_shape: Union[int, ding.utils.type_helper.SequenceType], encoder_hidden_size_list: ding.utils.type_helper.SequenceType = [128, 128, 64], head_hidden_size: Optional[int] = None, head_layer_num: int = 1, activation: Optional[torch.nn.modules.module.Module] = ReLU(), norm_type: Optional[str] = None, v_min: Optional[float] = - 10, v_max: Optional[float] = 10, n_atom: Optional[int] = 51)[source]
Overview:

The neural network structure and computation graph of C51DQN, which combines distributional RL and DQN. You can refer to https://arxiv.org/pdf/1707.06887.pdf for more details. The C51DQN is composed of encoder and head. encoder is used to extract the feature of observation, and head is used to compute the distribution of Q-value.

Interfaces:

__init__, forward

Note

Current C51DQN supports two types of encoder: FCEncoder and ConvEncoder.

forward(x: torch.Tensor) Dict[source]
Overview:

C51DQN forward computation graph, input observation tensor to predict q_value and its distribution.

Arguments:
  • x (torch.Tensor): The input observation tensor data.

Returns:
  • outputs (Dict): The output of DQN’s forward, including q_value, and distribution.

ReturnsKeys:
  • logit (torch.Tensor): Discrete Q-value output of each possible action dimension.

  • distribution (torch.Tensor): Q-Value discretized distribution, i.e., probability of each uniformly spaced atom Q-value, such as dividing [-10, 10] into 51 uniform spaces.

Shapes:
  • x (torch.Tensor): \((B, N)\), where B is batch size and N is head_hidden_size.

  • logit (torch.Tensor): \((B, M)\), where M is action_shape.

  • distribution(torch.Tensor): \((B, M, P)\), where P is n_atom.

Examples:
>>> model = C51DQN(128, 64)  # arguments: 'obs_shape' and 'action_shape'
>>> inputs = torch.randn(4, 128)
>>> outputs = model(inputs)
>>> assert isinstance(outputs, dict)
>>> # default head_hidden_size: int = 64,
>>> assert outputs['logit'].shape == torch.Size([4, 64])
>>> # default n_atom: int = 51
>>> assert outputs['distribution'].shape == torch.Size([4, 64, 51])

Note

For consistency and compatibility, we name all the outputs of the network which are related to action selections as logit.

Note

For convenience, we recommend that the number of atoms should be odd, so that the middle atom is exactly the value of the Q-value.

Benchmark

Benchmark and comparison of c51 algorithm

environment

best mean reward

evaluation results

config link

comparison

Pong
(PongNoFrameskip-v4)

20.6

../_images/c51_pong.png

config_link_p

Tianshou(20)
Qbert
(QbertNoFrameskip-v4)

20006

../_images/c51_qbert.png

config_link_q

Tianshou(16245)
SpaceInvaders
(SpaceInvadersNoFrame skip-v4)

2766

../_images/c51_spaceinvaders.png

config_link_s

Tianshou(988.5)

P.S.:

  1. The above results are obtained by running the same configuration on five different random seeds (0, 1, 2, 3, 4)

  2. For the discrete action space algorithm like DQN, the Atari environment set is generally used for testing (including sub-environments Pong), and Atari environment is generally evaluated by the highest mean reward training 10M env_step. For more details about Atari, please refer to Atari Env Tutorial .

Other Public Implementations

Read the Docs v: latest
Versions
latest
Downloads
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.