Shortcuts

ding.rl_utils

a2c

Please refer to ding/rl_utils/a2c for more details.

a2c_error

ding.rl_utils.a2c_error(data: collections.namedtuple) collections.namedtuple[源代码]
Overview:

Implementation of A2C(Advantage Actor-Critic) (arXiv:1602.01783) for discrete action space

Arguments:
  • data (namedtuple): a2c input data with fieids shown in a2c_data

Returns:
  • a2c_loss (namedtuple): the a2c loss item, all of them are the differentiable 0-dim tensor

Shapes:
  • logit (torch.FloatTensor): \((B, N)\), where B is batch size and N is action dim

  • action (torch.LongTensor): \((B, )\)

  • value (torch.FloatTensor): \((B, )\)

  • adv (torch.FloatTensor): \((B, )\)

  • return (torch.FloatTensor): \((B, )\)

  • weight (torch.FloatTensor or None): \((B, )\)

  • policy_loss (torch.FloatTensor): \(()\), 0-dim tensor

  • value_loss (torch.FloatTensor): \(()\)

  • entropy_loss (torch.FloatTensor): \(()\)

Examples:
>>> data = a2c_data(
>>>     logit=torch.randn(2, 3),
>>>     action=torch.randint(0, 3, (2, )),
>>>     value=torch.randn(2, ),
>>>     adv=torch.randn(2, ),
>>>     return_=torch.randn(2, ),
>>>     weight=torch.ones(2, ),
>>> )
>>> loss = a2c_error(data)

a2c_error_continuous

ding.rl_utils.a2c_error_continuous(data: collections.namedtuple) collections.namedtuple[源代码]
Overview:

Implementation of A2C(Advantage Actor-Critic) (arXiv:1602.01783) for continuous action space

Arguments:
  • data (namedtuple): a2c input data with fieids shown in a2c_data

Returns:
  • a2c_loss (namedtuple): the a2c loss item, all of them are the differentiable 0-dim tensor

Shapes:
  • logit (torch.FloatTensor): \((B, N)\), where B is batch size and N is action dim

  • action (torch.LongTensor): \((B, N)\)

  • value (torch.FloatTensor): \((B, )\)

  • adv (torch.FloatTensor): \((B, )\)

  • return (torch.FloatTensor): \((B, )\)

  • weight (torch.FloatTensor or None): \((B, )\)

  • policy_loss (torch.FloatTensor): \(()\), 0-dim tensor

  • value_loss (torch.FloatTensor): \(()\)

  • entropy_loss (torch.FloatTensor): \(()\)

Examples:
>>> data = a2c_data(
>>>     logit={'mu': torch.randn(2, 3), 'sigma': torch.sqrt(torch.randn(2, 3)**2)},
>>>     action=torch.randn(2, 3),
>>>     value=torch.randn(2, ),
>>>     adv=torch.randn(2, ),
>>>     return_=torch.randn(2, ),
>>>     weight=torch.ones(2, ),
>>> )
>>> loss = a2c_error_continuous(data)

acer

Please refer to ding/rl_utils/acer for more details.

acer_policy_error

ding.rl_utils.acer_policy_error(q_values: torch.Tensor, q_retraces: torch.Tensor, v_pred: torch.Tensor, target_logit: torch.Tensor, actions: torch.Tensor, ratio: torch.Tensor, c_clip_ratio: float = 10.0) Tuple[torch.Tensor, torch.Tensor][源代码]
Overview:

Get ACER policy loss.

Arguments:
  • q_values (torch.Tensor): Q values

  • q_retraces (torch.Tensor): Q values (be calculated by retrace method)

  • v_pred (torch.Tensor): V values

  • target_pi (torch.Tensor): The new policy’s probability

  • actions (torch.Tensor): The actions in replay buffer

  • ratio (torch.Tensor): ratio of new polcy with behavior policy

  • c_clip_ratio (float): clip value for ratio

Returns:
  • actor_loss (torch.Tensor): policy loss from q_retrace

  • bc_loss (torch.Tensor): correct policy loss

Shapes:
  • q_values (torch.FloatTensor): \((T, B, N)\), where B is batch size and N is action dim

  • q_retraces (torch.FloatTensor): \((T, B, 1)\)

  • v_pred (torch.FloatTensor): \((T, B, 1)\)

  • target_pi (torch.FloatTensor): \((T, B, N)\)

  • actions (torch.LongTensor): \((T, B)\)

  • ratio (torch.FloatTensor): \((T, B, N)\)

  • actor_loss (torch.FloatTensor): \((T, B, 1)\)

  • bc_loss (torch.FloatTensor): \((T, B, 1)\)

Examples:
>>> q_values=torch.randn(2, 3, 4),
>>> q_retraces=torch.randn(2, 3, 1),
>>> v_pred=torch.randn(2, 3, 1),
>>> target_pi=torch.randn(2, 3, 4),
>>> actions=torch.randint(0, 4, (2, 3)),
>>> ratio=torch.randn(2, 3, 4),
>>> loss = acer_policy_error(q_values, q_retraces, v_pred, target_pi, actions, ratio)

acer_value_error

ding.rl_utils.acer_value_error(q_values, q_retraces, actions)[源代码]
Overview:

Get ACER critic loss.

Arguments:
  • q_values (torch.Tensor): Q values

  • q_retraces (torch.Tensor): Q values (be calculated by retrace method)

  • actions (torch.Tensor): The actions in replay buffer

  • ratio (torch.Tensor): ratio of new polcy with behavior policy

Returns:
  • critic_loss (torch.Tensor): critic loss

Shapes:
  • q_values (torch.FloatTensor): \((T, B, N)\), where B is batch size and N is action dim

  • q_retraces (torch.FloatTensor): \((T, B, 1)\)

  • actions (torch.LongTensor): \((T, B)\)

  • critic_loss (torch.FloatTensor): \((T, B, 1)\)

Examples:
>>> q_values=torch.randn(2, 3, 4)
>>> q_retraces=torch.randn(2, 3, 1)
>>> actions=torch.randint(0, 4, (2, 3))
>>> loss = acer_value_error(q_values, q_retraces, actions)

acer_trust_region_update

ding.rl_utils.acer_trust_region_update(actor_gradients: List[torch.Tensor], target_logit: torch.Tensor, avg_logit: torch.Tensor, trust_region_value: float) List[torch.Tensor][源代码]
Overview:

calcuate gradient with trust region constrain

Arguments:
  • actor_gradients (list(torch.Tensor)): gradients value’s for different part

  • target_pi (torch.Tensor): The new policy’s probability

  • avg_pi (torch.Tensor): The average policy’s probability

  • trust_region_value (float): the range of trust region

Returns:
  • update_gradients (list(torch.Tensor)): gradients with trust region constraint

Shapes:
  • target_pi (torch.FloatTensor): \((T, B, N)\)

  • avg_pi (torch.FloatTensor): \((T, B, N)\)

  • update_gradients (list(torch.FloatTensor)): \((T, B, N)\)

Examples:
>>> actor_gradients=[torch.randn(2, 3, 4)]
>>> target_pi=torch.randn(2, 3, 4)
>>> avg_pi=torch.randn(2, 3, 4)
>>> loss = acer_trust_region_update(actor_gradients, target_pi, avg_pi, 0.1)

adder

Please refer to ding/rl_utils/adder for more details.

Adder

class ding.rl_utils.adder.Adder[源代码]
Overview:

Adder is a component that handles different transformations and calculations for transitions in Collector Module(data generation and processing), such as GAE, n-step return, transition sampling etc.

Interface:

__init__, get_gae, get_gae_with_default_last_value, get_nstep_return_data, get_train_sample

classmethod _get_null_transition(template: dict, null_transition: Optional[dict] = None) dict[源代码]
Overview:

Get null transition for padding. If cls._null_transition is None, return input template instead.

Arguments:
  • template (dict): The template for null transition.

  • null_transition (Optional[dict]): Dict type null transition, used in null_padding

Returns:
  • null_transition (dict): The deepcopied null transition.

classmethod get_gae(data: List[Dict[str, Any]], last_value: torch.Tensor, gamma: float, gae_lambda: float, cuda: bool) List[Dict[str, Any]][源代码]
Overview:

Get GAE advantage for stacked transitions(T timestep, 1 batch). Call gae for calculation.

Arguments:
  • data (list): Transitions list, each element is a transition dict with at least [‘value’, ‘reward’]

  • last_value (torch.Tensor): The last value(i.e.: the T+1 timestep)

  • gamma (float): The future discount factor, should be in [0, 1], defaults to 0.99.

  • gae_lambda (float): GAE lambda parameter, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.

  • cuda (bool): Whether use cuda in GAE computation

Returns:
  • data (list): transitions list like input one, but each element owns extra advantage key ‘adv’

Examples:
>>> B, T = 2, 3 # batch_size, timestep
>>> data = [dict(value=torch.randn(B), reward=torch.randn(B)) for _ in range(T)]
>>> last_value = torch.randn(B)
>>> gamma = 0.99
>>> gae_lambda = 0.95
>>> cuda = False
>>> data = Adder.get_gae(data, last_value, gamma, gae_lambda, cuda)
classmethod get_gae_with_default_last_value(data: collections.deque, done: bool, gamma: float, gae_lambda: float, cuda: bool) List[Dict[str, Any]][源代码]
Overview:

Like get_gae above to get GAE advantage for stacked transitions. However, this function is designed in case last_value is not passed. If transition is not done yet, it wouold assign last value in data as last_value, discard the last element in data``(i.e. len(data) would decrease by 1), and then call ``get_gae. Otherwise it would make last_value equal to 0.

Arguments:
  • data (deque): Transitions list, each element is a transition dict with at least[‘value’, ‘reward’]

  • done (bool): Whether the transition reaches the end of an episode(i.e. whether the env is done)

  • gamma (float): The future discount factor, should be in [0, 1], defaults to 0.99.

  • gae_lambda (float): GAE lambda parameter, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.

  • cuda (bool): Whether use cuda in GAE computation

Returns:
  • data (List[Dict[str, Any]]): transitions list like input one, but each element owns extra advantage key ‘adv’

Examples:
>>> B, T = 2, 3 # batch_size, timestep
>>> data = [dict(value=torch.randn(B), reward=torch.randn(B)) for _ in range(T)]
>>> done = False
>>> gamma = 0.99
>>> gae_lambda = 0.95
>>> cuda = False
>>> data = Adder.get_gae_with_default_last_value(data, done, gamma, gae_lambda, cuda)
classmethod get_nstep_return_data(data: collections.deque, nstep: int, cum_reward=False, correct_terminate_gamma=True, gamma=0.99) collections.deque[源代码]
Overview:

Process raw traj data by updating keys [‘next_obs’, ‘reward’, ‘done’] in data’s dict element.

Arguments:
  • data (deque): Transitions list, each element is a transition dict

  • nstep (int): Number of steps. If equals to 1, return data directly; Otherwise update with nstep value.

Returns:
  • data (deque): Transitions list like input one, but each element updated with nstep value.

Examples:
>>> data = [dict(
>>>     obs=torch.randn(B),
>>>     reward=torch.randn(1),
>>>     next_obs=torch.randn(B),
>>>     done=False) for _ in range(T)]
>>> nstep = 2
>>> data = Adder.get_nstep_return_data(data, nstep)
classmethod get_train_sample(data: List[Dict[str, Any]], unroll_len: int, last_fn_type: str = 'last', null_transition: Optional[dict] = None) List[Dict[str, Any]][源代码]
Overview:

Process raw traj data by updating keys [‘next_obs’, ‘reward’, ‘done’] in data’s dict element. If unroll_len equals to 1, which means no process is needed, can directly return data. Otherwise, data will be splitted according to unroll_len, process residual part according to last_fn_type and call lists_to_dicts to form sampled training data.

Arguments:
  • data (List[Dict[str, Any]]): Transitions list, each element is a transition dict

  • unroll_len (int): Learn training unroll length

  • last_fn_type (str): The method type name for dealing with last residual data in a traj after splitting, should be in [‘last’, ‘drop’, ‘null_padding’]

  • null_transition (Optional[dict]): Dict type null transition, used in null_padding

Returns:
  • data (List[Dict[str, Any]]): Transitions list processed after unrolling

get_gae

ding.rl_utils.adder.get_gae(data: List[Dict[str, Any]], last_value: torch.Tensor, gamma: float, gae_lambda: float, cuda: bool) List[Dict[str, Any]]
Overview:

Get GAE advantage for stacked transitions(T timestep, 1 batch). Call gae for calculation.

Arguments:
  • data (list): Transitions list, each element is a transition dict with at least [‘value’, ‘reward’]

  • last_value (torch.Tensor): The last value(i.e.: the T+1 timestep)

  • gamma (float): The future discount factor, should be in [0, 1], defaults to 0.99.

  • gae_lambda (float): GAE lambda parameter, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.

  • cuda (bool): Whether use cuda in GAE computation

Returns:
  • data (list): transitions list like input one, but each element owns extra advantage key ‘adv’

Examples:
>>> B, T = 2, 3 # batch_size, timestep
>>> data = [dict(value=torch.randn(B), reward=torch.randn(B)) for _ in range(T)]
>>> last_value = torch.randn(B)
>>> gamma = 0.99
>>> gae_lambda = 0.95
>>> cuda = False
>>> data = Adder.get_gae(data, last_value, gamma, gae_lambda, cuda)

get_gae_with_default_last_value

ding.rl_utils.adder.get_gae_with_default_last_value(data: collections.deque, done: bool, gamma: float, gae_lambda: float, cuda: bool) List[Dict[str, Any]]
Overview:

Like get_gae above to get GAE advantage for stacked transitions. However, this function is designed in case last_value is not passed. If transition is not done yet, it wouold assign last value in data as last_value, discard the last element in data``(i.e. len(data) would decrease by 1), and then call ``get_gae. Otherwise it would make last_value equal to 0.

Arguments:
  • data (deque): Transitions list, each element is a transition dict with at least[‘value’, ‘reward’]

  • done (bool): Whether the transition reaches the end of an episode(i.e. whether the env is done)

  • gamma (float): The future discount factor, should be in [0, 1], defaults to 0.99.

  • gae_lambda (float): GAE lambda parameter, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.

  • cuda (bool): Whether use cuda in GAE computation

Returns:
  • data (List[Dict[str, Any]]): transitions list like input one, but each element owns extra advantage key ‘adv’

Examples:
>>> B, T = 2, 3 # batch_size, timestep
>>> data = [dict(value=torch.randn(B), reward=torch.randn(B)) for _ in range(T)]
>>> done = False
>>> gamma = 0.99
>>> gae_lambda = 0.95
>>> cuda = False
>>> data = Adder.get_gae_with_default_last_value(data, done, gamma, gae_lambda, cuda)

get_nstep_return_data

ding.rl_utils.adder.get_nstep_return_data(data: collections.deque, nstep: int, cum_reward=False, correct_terminate_gamma=True, gamma=0.99) collections.deque
Overview:

Process raw traj data by updating keys [‘next_obs’, ‘reward’, ‘done’] in data’s dict element.

Arguments:
  • data (deque): Transitions list, each element is a transition dict

  • nstep (int): Number of steps. If equals to 1, return data directly; Otherwise update with nstep value.

Returns:
  • data (deque): Transitions list like input one, but each element updated with nstep value.

Examples:
>>> data = [dict(
>>>     obs=torch.randn(B),
>>>     reward=torch.randn(1),
>>>     next_obs=torch.randn(B),
>>>     done=False) for _ in range(T)]
>>> nstep = 2
>>> data = Adder.get_nstep_return_data(data, nstep)

get_train_sample

ding.rl_utils.adder.get_train_sample(data: List[Dict[str, Any]], unroll_len: int, last_fn_type: str = 'last', null_transition: Optional[dict] = None) List[Dict[str, Any]]
Overview:

Process raw traj data by updating keys [‘next_obs’, ‘reward’, ‘done’] in data’s dict element. If unroll_len equals to 1, which means no process is needed, can directly return data. Otherwise, data will be splitted according to unroll_len, process residual part according to last_fn_type and call lists_to_dicts to form sampled training data.

Arguments:
  • data (List[Dict[str, Any]]): Transitions list, each element is a transition dict

  • unroll_len (int): Learn training unroll length

  • last_fn_type (str): The method type name for dealing with last residual data in a traj after splitting, should be in [‘last’, ‘drop’, ‘null_padding’]

  • null_transition (Optional[dict]): Dict type null transition, used in null_padding

Returns:
  • data (List[Dict[str, Any]]): Transitions list processed after unrolling

beta_function

Please refer to ding/rl_utils/beta_function for more details.

cpw

ding.rl_utils.beta_function.cpw(x: Union[torch.Tensor, float], eta: float = 0.71) Union[torch.Tensor, float][源代码]

CVaR

ding.rl_utils.beta_function.CVaR(x: Union[torch.Tensor, float], eta: float = 0.71) Union[torch.Tensor, float][源代码]

beta_function_map

rl_utils.beta_function_map = {'CPW': <function cpw>, 'CVaR': <function CVaR>, 'Pow': <function Pow>, 'uniform': <function <lambda>>}

coma

Please refer to ding/rl_utils/coma for more details.

coma_error

ding.rl_utils.coma_error(data: collections.namedtuple, gamma: float, lambda_: float) collections.namedtuple[源代码]
Overview:

Implementation of COMA

Arguments:
  • data (namedtuple): coma input data with fieids shown in coma_data

Returns:
  • coma_loss (namedtuple): the coma loss item, all of them are the differentiable 0-dim tensor

Shapes:
  • logit (torch.FloatTensor): \((T, B, A, N)\), where B is batch size A is the agent num, and N is action dim

  • action (torch.LongTensor): \((T, B, A)\)

  • q_value (torch.FloatTensor): \((T, B, A, N)\)

  • target_q_value (torch.FloatTensor): \((T, B, A, N)\)

  • reward (torch.FloatTensor): \((T, B)\)

  • weight (torch.FloatTensor or None): \((T ,B, A)\)

  • policy_loss (torch.FloatTensor): \(()\), 0-dim tensor

  • value_loss (torch.FloatTensor): \(()\)

  • entropy_loss (torch.FloatTensor): \(()\)

Examples:
>>> action_dim = 4
>>> agent_num = 3
>>> data = coma_data(
>>>     logit=torch.randn(2, 3, agent_num, action_dim),
>>>     action=torch.randint(0, action_dim, (2, 3, agent_num)),
>>>     q_value=torch.randn(2, 3, agent_num, action_dim),
>>>     target_q_value=torch.randn(2, 3, agent_num, action_dim),
>>>     reward=torch.randn(2, 3),
>>>     weight=torch.ones(2, 3, agent_num),
>>> )
>>> loss = coma_error(data, 0.99, 0.99)

exploration

Please refer to ding/rl_utils/exploration for more details.

get_epsilon_greedy_fn

ding.rl_utils.exploration.get_epsilon_greedy_fn(start: float, end: float, decay: int, type_: str = 'exp') Callable[源代码]
Overview:

Generate an epsilon_greedy function with decay, which inputs current timestep and outputs current epsilon.

Arguments:
  • start (float): Epsilon start value. For ‘linear’, it should be 1.0.

  • end (float): Epsilon end value.

  • decay (int): Controls the speed that epsilon decreases from start to end. We recommend epsilon decays according to env step rather than iteration.

  • type (str): How epsilon decays, now supports [‘linear’, ‘exp’(exponential)]

Returns:
  • eps_fn (function): The epsilon greedy function with decay

BaseNoise

class ding.rl_utils.exploration.BaseNoise[源代码]
Overview:

Base class for action noise

Interface:

__init__, __call__

Examples:
>>> noise_generator = OUNoise()  # init one type of noise
>>> noise = noise_generator(action.shape, action.device)  # generate noise
abstract __call__(shape: tuple, device: str) torch.Tensor[源代码]
Overview:

Generate noise according to action tensor’s shape, device

Arguments:
  • shape (tuple): size of the action tensor, output noise’s size should be the same

  • device (str): device of the action tensor, output noise’s device should be the same as it

Returns:
  • noise (torch.Tensor): generated action noise, have the same shape and device with the input action tensor

__init__() None[源代码]
Overview:

Initialization method

GaussianNoise

class ding.rl_utils.exploration.GaussianNoise(mu: float = 0.0, sigma: float = 1.0)[源代码]
Overview:

Derived class for generating gaussian noise, which satisfies \(X \sim N(\mu, \sigma^2)\)

Interface:

__init__, __call__

__call__(shape: tuple, device: str) torch.Tensor[源代码]
Overview:

Generate gaussian noise according to action tensor’s shape, device

Arguments:
  • shape (tuple): size of the action tensor, output noise’s size should be the same

  • device (str): device of the action tensor, output noise’s device should be the same as it

Returns:
  • noise (torch.Tensor): generated action noise, have the same shape and device with the input action tensor

__init__(mu: float = 0.0, sigma: float = 1.0) None[源代码]
Overview:

Initialize \(\mu\) and \(\sigma\) in Gaussian Distribution

Arguments:
  • mu (float): \(\mu\) , mean value

  • sigma (float): \(\sigma\) , standard deviation, should be positive

OUNoise

class ding.rl_utils.exploration.OUNoise(mu: float = 0.0, sigma: float = 0.3, theta: float = 0.15, dt: float = 0.01, x0: Optional[Union[float, torch.Tensor]] = 0.0)[源代码]
Overview:

Derived class for generating Ornstein-Uhlenbeck process noise. Satisfies \(dx_t=\theta(\mu-x_t)dt + \sigma dW_t\), where \(W_t\) denotes Weiner Process, acting as a random perturbation term.

Interface:

__init__, reset, __call__

reset() None[源代码]
Overview:

Reset _x to the initial state _x0

property x0: Union[float, torch.Tensor]
Overview:

Get self._x0

noise_mapping

exploration.noise_mapping = {'gauss': <class 'ding.rl_utils.exploration.GaussianNoise'>, 'ou': <class 'ding.rl_utils.exploration.OUNoise'>}

create_noise_generator

ding.rl_utils.exploration.create_noise_generator(noise_type: str, noise_kwargs: dict) ding.rl_utils.exploration.BaseNoise[源代码]
Overview:

Given the key (noise_type), create a new noise generator instance if in noise_mapping’s values, or raise an KeyError. In other words, a derived noise generator must first register, then call create_noise generator to get the instance object.

Arguments:
  • noise_type (str): the type of noise generator to be created

Returns:
  • noise (BaseNoise): the created new noise generator, should be an instance of one of noise_mapping’s values

gae

Please refer to ding/rl_utils/gae for more details.

gae_data

class ding.rl_utils.gae.gae_data(value, next_value, reward, done, traj_flag)

shape_fn_gae

ding.rl_utils.gae.shape_fn_gae(args, kwargs)[源代码]
Overview:

Return shape of gae for hpc

Returns:

shape: [T, B]

gae

ding.rl_utils.gae.gae(data: collections.namedtuple, gamma: float = 0.99, lambda_: float = 0.97) torch.FloatTensor[源代码]
Overview:

Implementation of Generalized Advantage Estimator (arXiv:1506.02438)

Arguments:
  • data (namedtuple): gae input data with fields [‘value’, ‘reward’], which contains some episodes or trajectories data.

  • gamma (float): the future discount factor, should be in [0, 1], defaults to 0.99.

  • lambda (float): the gae parameter lambda, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.

Returns:
  • adv (torch.FloatTensor): the calculated advantage

Shapes:
  • value (torch.FloatTensor): \((T, B)\), where T is trajectory length and B is batch size

  • next_value (torch.FloatTensor): \((T, B)\)

  • reward (torch.FloatTensor): \((T, B)\)

  • adv (torch.FloatTensor): \((T, B)\)

Examples:
>>> value = torch.randn(2, 3)
>>> next_value = torch.randn(2, 3)
>>> reward = torch.randn(2, 3)
>>> data = gae_data(value, next_value, reward, None, None)
>>> adv = gae(data)

isw

Please refer to ding/rl_utils/isw for more details.

compute_importance_weights

ding.rl_utils.isw.compute_importance_weights(target_output: Union[torch.Tensor, dict], behaviour_output: Union[torch.Tensor, dict], action: torch.Tensor, action_space_type: str = 'discrete', requires_grad: bool = False)[源代码]
Overview:

Computing importance sampling weight with given output and action

Arguments:
  • target_output (Union[torch.Tensor,dict]): the output taking the action by the current policy network, usually this output is network output logit if action space is discrete, or is a dict containing parameters of action distribution if action space is continuous.

  • behaviour_output (Union[torch.Tensor,dict]): the output taking the action by the behaviour policy network, usually this output is network output logit, if action space is discrete, or is a dict containing parameters of action distribution if action space is continuous.

  • action (torch.Tensor): the chosen action(index for the discrete action space) in trajectory, i.e.: behaviour_action

  • action_space_type (str): action space types in [‘discrete’, ‘continuous’]

  • requires_grad (bool): whether requires grad computation

Returns:
  • rhos (torch.Tensor): Importance sampling weight

Shapes:
  • target_output (Union[torch.FloatTensor,dict]): \((T, B, N)\), where T is timestep, B is batch size and N is action dim

  • behaviour_output (Union[torch.FloatTensor,dict]): \((T, B, N)\)

  • action (torch.LongTensor): \((T, B)\)

  • rhos (torch.FloatTensor): \((T, B)\)

Examples:
>>> target_output = torch.randn(2, 3, 4)
>>> behaviour_output = torch.randn(2, 3, 4)
>>> action = torch.randint(0, 4, (2, 3))
>>> rhos = compute_importance_weights(target_output, behaviour_output, action)

ppg

Please refer to ding/rl_utils/ppg for more details.

ppg_data

class ding.rl_utils.ppg.ppg_data(logit_new, logit_old, action, value_new, value_old, return_, weight)

ppg_joint_loss

class ding.rl_utils.ppg.ppg_joint_loss(auxiliary_loss, behavioral_cloning_loss)

ppg_joint_error

ding.rl_utils.ppg.ppg_joint_error(data: collections.namedtuple, clip_ratio: float = 0.2, use_value_clip: bool = True) Tuple[collections.namedtuple, collections.namedtuple][源代码]
Overview:

Get PPG joint loss

Arguments:
  • data (namedtuple): ppg input data with fieids shown in ppg_data

  • clip_ratio (float): clip value for ratio

  • use_value_clip (bool): whether use value clip

Returns:
  • ppg_joint_loss (namedtuple): the ppg loss item, all of them are the differentiable 0-dim tensor

Shapes:
  • logit_new (torch.FloatTensor): \((B, N)\), where B is batch size and N is action dim

  • logit_old (torch.FloatTensor): \((B, N)\)

  • action (torch.LongTensor): \((B,)\)

  • value_new (torch.FloatTensor): \((B, 1)\)

  • value_old (torch.FloatTensor): \((B, 1)\)

  • return (torch.FloatTensor): \((B, 1)\)

  • weight (torch.FloatTensor): \((B,)\)

  • auxiliary_loss (torch.FloatTensor): \(()\), 0-dim tensor

  • behavioral_cloning_loss (torch.FloatTensor): \(()\)

Examples:
>>> action_dim = 4
>>> data = ppg_data(
>>>     logit_new=torch.randn(3, action_dim),
>>>     logit_old=torch.randn(3, action_dim),
>>>     action=torch.randint(0, action_dim, (3,)),
>>>     value_new=torch.randn(3, 1),
>>>     value_old=torch.randn(3, 1),
>>>     return_=torch.randn(3, 1),
>>>     weight=torch.ones(3),
>>> )
>>> loss = ppg_joint_error(data, 0.99, 0.99)

ppo

Please refer to ding/rl_utils/ppo for more details.

ppo_data

class ding.rl_utils.ppo.ppo_data(logit_new, logit_old, action, value_new, value_old, adv, return_, weight)

ppo_policy_data

class ding.rl_utils.ppo.ppo_policy_data(logit_new, logit_old, action, adv, weight)

ppo_value_data

class ding.rl_utils.ppo.ppo_value_data(value_new, value_old, return_, weight)

ppo_loss

class ding.rl_utils.ppo.ppo_loss(policy_loss, value_loss, entropy_loss)

ppo_policy_loss

class ding.rl_utils.ppo.ppo_policy_loss(policy_loss, entropy_loss)

ppo_info

class ding.rl_utils.ppo.ppo_info(approx_kl, clipfrac)

shape_fn_ppo

ding.rl_utils.ppo.shape_fn_ppo(args, kwargs)[源代码]
Overview:

Return shape of ppo for hpc

Returns:

shape: [B, N]

ppo_error

ding.rl_utils.ppo.ppo_error(data: collections.namedtuple, clip_ratio: float = 0.2, use_value_clip: bool = True, dual_clip: Optional[float] = None) Tuple[collections.namedtuple, collections.namedtuple][源代码]
Overview:

Implementation of Proximal Policy Optimization (arXiv:1707.06347) with value_clip and dual_clip

Arguments:
  • data (namedtuple): the ppo input data with fieids shown in ppo_data

  • clip_ratio (float): the ppo clip ratio for the constraint of policy update, defaults to 0.2

  • use_value_clip (bool): whether to use clip in value loss with the same ratio as policy

  • dual_clip (float): a parameter c mentioned in arXiv:1912.09729 Equ. 5, shoule be in [1, inf), defaults to 5.0, if you don’t want to use it, set this parameter to None

Returns:
  • ppo_loss (namedtuple): the ppo loss item, all of them are the differentiable 0-dim tensor

  • ppo_info (namedtuple): the ppo optim information for monitoring, all of them are Python scalar

Shapes:
  • logit_new (torch.FloatTensor): \((B, N)\), where B is batch size and N is action dim

  • logit_old (torch.FloatTensor): \((B, N)\)

  • action (torch.LongTensor): \((B, )\)

  • value_new (torch.FloatTensor): \((B, )\)

  • value_old (torch.FloatTensor): \((B, )\)

  • adv (torch.FloatTensor): \((B, )\)

  • return (torch.FloatTensor): \((B, )\)

  • weight (torch.FloatTensor or None): \((B, )\)

  • policy_loss (torch.FloatTensor): \(()\), 0-dim tensor

  • value_loss (torch.FloatTensor): \(()\)

  • entropy_loss (torch.FloatTensor): \(()\)

Examples:
>>> action_dim = 4
>>> data = ppo_data(
>>>     logit_new=torch.randn(3, action_dim),
>>>     logit_old=torch.randn(3, action_dim),
>>>     action=torch.randint(0, action_dim, (3,)),
>>>     value_new=torch.randn(3),
>>>     value_old=torch.randn(3),
>>>     adv=torch.randn(3),
>>>     return_=torch.randn(3),
>>>     weight=torch.ones(3),
>>> )
>>> loss, info = ppo_error(data)

注解

adv is already normalized value (adv - adv.mean()) / (adv.std() + 1e-8), and there are many ways to calculate this mean and std, like among data buffer or train batch, so we don’t couple this part into ppo_error, you can refer to our examples for different ways.

ppo_policy_error

ding.rl_utils.ppo.ppo_policy_error(data: collections.namedtuple, clip_ratio: float = 0.2, dual_clip: Optional[float] = None) Tuple[collections.namedtuple, collections.namedtuple][源代码]
Overview:

Get PPO policy loss

Arguments:
  • data (namedtuple): ppo input data with fieids shown in ppo_policy_data

  • clip_ratio (float): clip value for ratio

  • dual_clip (float): a parameter c mentioned in arXiv:1912.09729 Equ. 5, shoule be in [1, inf), defaults to 5.0, if you don’t want to use it, set this parameter to None

Returns:
  • ppo_policy_loss (namedtuple): the ppo policy loss item, all of them are the differentiable 0-dim tensor

  • ppo_info (namedtuple): the ppo optim information for monitoring, all of them are Python scalar

Shapes:
  • logit_new (torch.FloatTensor): \((B, N)\), where B is batch size and N is action dim

  • logit_old (torch.FloatTensor): \((B, N)\)

  • action (torch.LongTensor): \((B, )\)

  • adv (torch.FloatTensor): \((B, )\)

  • weight (torch.FloatTensor or None): \((B, )\)

  • policy_loss (torch.FloatTensor): \(()\), 0-dim tensor

  • entropy_loss (torch.FloatTensor): \(()\)

Examples:
>>> action_dim = 4
>>> data = ppo_policy_data(
>>>     logit_new=torch.randn(3, action_dim),
>>>     logit_old=torch.randn(3, action_dim),
>>>     action=torch.randint(0, action_dim, (3,)),
>>>     adv=torch.randn(3),
>>>     weight=torch.ones(3),
>>> )
>>> loss, info = ppo_policy_error(data)

ppo_value_error

ding.rl_utils.ppo.ppo_value_error(data: collections.namedtuple, clip_ratio: float = 0.2, use_value_clip: bool = True) torch.Tensor[源代码]
Overview:

Get PPO value loss

Arguments:
  • data (namedtuple): ppo input data with fieids shown in ppo_value_data

  • clip_ratio (float): clip value for ratio

  • use_value_clip (bool): whether use value clip

Returns:
  • value_loss (torch.FloatTensor): the ppo value loss item, all of them are the differentiable 0-dim tensor

Shapes:
  • value_new (torch.FloatTensor): \((B, )\), where B is batch size

  • value_old (torch.FloatTensor): \((B, )\)

  • return (torch.FloatTensor): \((B, )\)

  • weight (torch.FloatTensor or None): \((B, )\)

  • value_loss (torch.FloatTensor): \(()\), 0-dim tensor

Examples:
>>> action_dim = 4
>>> data = ppo_value_data(
>>>     value_new=torch.randn(3),
>>>     value_old=torch.randn(3),
>>>     return_=torch.randn(3),
>>>     weight=torch.ones(3),
>>> )
>>> loss, info = ppo_value_error(data)

ppo_error_continuous

ding.rl_utils.ppo.ppo_error_continuous(data: collections.namedtuple, clip_ratio: float = 0.2, use_value_clip: bool = True, dual_clip: Optional[float] = None) Tuple[collections.namedtuple, collections.namedtuple][源代码]
Overview:

Implementation of Proximal Policy Optimization (arXiv:1707.06347) with value_clip and dual_clip

Arguments:
  • data (namedtuple): the ppo input data with fieids shown in ppo_data

  • clip_ratio (float): the ppo clip ratio for the constraint of policy update, defaults to 0.2

  • use_value_clip (bool): whether to use clip in value loss with the same ratio as policy

  • dual_clip (float): a parameter c mentioned in arXiv:1912.09729 Equ. 5, shoule be in [1, inf), defaults to 5.0, if you don’t want to use it, set this parameter to None

Returns:
  • ppo_loss (namedtuple): the ppo loss item, all of them are the differentiable 0-dim tensor

  • ppo_info (namedtuple): the ppo optim information for monitoring, all of them are Python scalar

Shapes:
  • mu_sigma_new (tuple): \(((B, N), (B, N))\), where B is batch size and N is action dim

  • mu_sigma_old (tuple): \(((B, N), (B, N))\), where B is batch size and N is action dim

  • action (torch.LongTensor): \((B, )\)

  • value_new (torch.FloatTensor): \((B, )\)

  • value_old (torch.FloatTensor): \((B, )\)

  • adv (torch.FloatTensor): \((B, )\)

  • return (torch.FloatTensor): \((B, )\)

  • weight (torch.FloatTensor or None): \((B, )\)

  • policy_loss (torch.FloatTensor): \(()\), 0-dim tensor

  • value_loss (torch.FloatTensor): \(()\)

  • entropy_loss (torch.FloatTensor): \(()\)

Examples:
>>> action_dim = 4
>>> data = ppo_data_continuous(
>>>     mu_sigma_new= dict(mu=torch.randn(3, action_dim), sigma=torch.randn(3, action_dim)**2),
>>>     mu_sigma_old= dict(mu=torch.randn(3, action_dim), sigma=torch.randn(3, action_dim)**2),
>>>     action=torch.randn(3, action_dim),
>>>     value_new=torch.randn(3),
>>>     value_old=torch.randn(3),
>>>     adv=torch.randn(3),
>>>     return_=torch.randn(3),
>>>     weight=torch.ones(3),
>>> )
>>> loss, info = ppo_error(data)

注解

adv is already normalized value (adv - adv.mean()) / (adv.std() + 1e-8), and there are many ways to calculate this mean and std, like among data buffer or train batch, so we don’t couple this part into ppo_error, you can refer to our examples for different ways.

ppo_policy_error_continuous

ding.rl_utils.ppo.ppo_policy_error_continuous(data: collections.namedtuple, clip_ratio: float = 0.2, dual_clip: Optional[float] = None) Tuple[collections.namedtuple, collections.namedtuple][源代码]
Overview:

Implementation of Proximal Policy Optimization (arXiv:1707.06347) with dual_clip

Arguments:
  • data (namedtuple): the ppo input data with fieids shown in ppo_data

  • clip_ratio (float): the ppo clip ratio for the constraint of policy update, defaults to 0.2

  • dual_clip (float): a parameter c mentioned in arXiv:1912.09729 Equ. 5, shoule be in [1, inf), defaults to 5.0, if you don’t want to use it, set this parameter to None

Returns:
  • ppo_loss (namedtuple): the ppo loss item, all of them are the differentiable 0-dim tensor

  • ppo_info (namedtuple): the ppo optim information for monitoring, all of them are Python scalar

Shapes:
  • mu_sigma_new (tuple): \(((B, N), (B, N))\), where B is batch size and N is action dim

  • mu_sigma_old (tuple): \(((B, N), (B, N))\), where B is batch size and N is action dim

  • action (torch.LongTensor): \((B, )\)

  • adv (torch.FloatTensor): \((B, )\)

  • weight (torch.FloatTensor or None): \((B, )\)

  • policy_loss (torch.FloatTensor): \(()\), 0-dim tensor

  • entropy_loss (torch.FloatTensor): \(()\)

Examples:
>>> action_dim = 4
>>> data = ppo_policy_data_continuous(
>>>     mu_sigma_new=dict(mu=torch.randn(3, action_dim), sigma=torch.randn(3, action_dim)**2),
>>>     mu_sigma_old=dict(mu=torch.randn(3, action_dim), sigma=torch.randn(3, action_dim)**2),
>>>     action=torch.randn(3, action_dim),
>>>     adv=torch.randn(3),
>>>     weight=torch.ones(3),
>>> )
>>> loss, info = ppo_policy_error_continuous(data)

retrace

Please refer to ding/rl_utils/retrace for more details.

compute_q_retraces

ding.rl_utils.retrace.compute_q_retraces(q_values: torch.Tensor, v_pred: torch.Tensor, rewards: torch.Tensor, actions: torch.Tensor, weights: torch.Tensor, ratio: torch.Tensor, gamma: float = 0.9) torch.Tensor[源代码]
Shapes:
  • q_values (torch.Tensor): \((T + 1, B, N)\), where T is unroll_len, B is batch size, N is discrete action dim.

  • v_pred (torch.Tensor): \((T + 1, B, 1)\)

  • rewards (torch.Tensor): \((T, B)\)

  • actions (torch.Tensor): \((T, B)\)

  • weights (torch.Tensor): \((T, B)\)

  • ratio (torch.Tensor): \((T, B, N)\)

  • q_retraces (torch.Tensor): \((T + 1, B, 1)\)

Examples:
>>> T=2
>>> B=3
>>> N=4
>>> q_values=torch.randn(T+1, B, N)
>>> v_pred=torch.randn(T+1, B, 1)
>>> rewards=torch.randn(T, B)
>>> actions=torch.randint(0, N, (T, B))
>>> weights=torch.ones(T, B)
>>> ratio=torch.randn(T, B, N)
>>> q_retraces = compute_q_retraces(q_values, v_pred, rewards, actions, weights, ratio)

注解

q_retrace operation doesn’t need to compute gradient, just executes forward computation.

sampler

Please refer to ding/rl_utils/sampler for more details.

ArgmaxSampler

class ding.rl_utils.sampler.ArgmaxSampler[源代码]
Overview:

Argmax sampler, return the index of the maximum value

MultinomialSampler

class ding.rl_utils.sampler.MultinomialSampler[源代码]
Overview:

Multinomial sampler, return the index of the sampled value

MuSampler

class ding.rl_utils.sampler.MuSampler[源代码]
Overview:

Mu sampler, return the mu of the input tensor

ReparameterizationSampler

class ding.rl_utils.sampler.ReparameterizationSampler[源代码]
Overview:

Reparameterization sampler, return the reparameterized value of the input tensor

HybridStochasticSampler

class ding.rl_utils.sampler.HybridStochasticSampler[源代码]
Overview:

Hybrid stochastic sampler, return the sampled action type and the reparameterized action args

HybridDeterminsticSampler

class ding.rl_utils.sampler.HybridDeterminsticSampler[源代码]
Overview:

Hybrid deterministic sampler, return the argmax action type and the mu action args

td

Please refer to ding/rl_utils/td for more details.

q_1step_td_data

class ding.rl_utils.td.q_1step_td_data(q, next_q, act, next_act, reward, done, weight)

q_1step_td_error

ding.rl_utils.td.q_1step_td_error(data: collections.namedtuple, gamma: float, criterion: <module 'torch.nn.modules' from '/home/docs/checkouts/readthedocs.org/user_builds/di-engine-docs-zh/envs/latest/lib/python3.9/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) torch.Tensor[源代码]
Overview:

1 step td_error, support single agent case and multi agent case.

Arguments:
  • data (q_1step_td_data): The input data, q_1step_td_data to calculate loss

  • gamma (float): Discount factor

  • criterion (torch.nn.modules): Loss function criterion

Returns:
  • loss (torch.Tensor): 1step td error

Shapes:
  • data (q_1step_td_data): the q_1step_td_data containing [‘q’, ‘next_q’, ‘act’, ‘next_act’, ‘reward’, ‘done’, ‘weight’]

  • q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • next_q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • act (torch.LongTensor): \((B, )\)

  • next_act (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \(( , B)\)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • weight (torch.FloatTensor or None): \((B, )\), the training sample weight

Examples:
>>> action_dim = 4
>>> data = q_1step_td_data(
>>>     q=torch.randn(3, action_dim),
>>>     next_q=torch.randn(3, action_dim),
>>>     act=torch.randint(0, action_dim, (3,)),
>>>     next_act=torch.randint(0, action_dim, (3,)),
>>>     reward=torch.randn(3),
>>>     done=torch.randint(0, 2, (3,)).bool(),
>>>     weight=torch.ones(3),
>>> )
>>> loss = q_1step_td_error(data, 0.99)

m_q_1step_td_data

class ding.rl_utils.td.m_q_1step_td_data(q, target_q, next_q, act, reward, done, weight)

m_q_1step_td_error

ding.rl_utils.td.m_q_1step_td_error(data: collections.namedtuple, gamma: float, tau: float, alpha: float, criterion: <module 'torch.nn.modules' from '/home/docs/checkouts/readthedocs.org/user_builds/di-engine-docs-zh/envs/latest/lib/python3.9/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) torch.Tensor[源代码]
Overview:

Munchausen td_error for DQN algorithm, support 1 step td error.

Arguments:
  • data (m_q_1step_td_data): The input data, m_q_1step_td_data to calculate loss

  • gamma (float): Discount factor

  • tau (float): Entropy factor for Munchausen DQN

  • alpha (float): Discount factor for Munchausen term

  • criterion (torch.nn.modules): Loss function criterion

Returns:
  • loss (torch.Tensor): 1step td error, 0-dim tensor

Shapes:
  • data (m_q_1step_td_data): the m_q_1step_td_data containing [‘q’, ‘target_q’, ‘next_q’, ‘act’, ‘reward’, ‘done’, ‘weight’]

  • q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • target_q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • next_q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • act (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \(( , B)\)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • weight (torch.FloatTensor or None): \((B, )\), the training sample weight

Examples:
>>> action_dim = 4
>>> data = m_q_1step_td_data(
>>>     q=torch.randn(3, action_dim),
>>>     target_q=torch.randn(3, action_dim),
>>>     next_q=torch.randn(3, action_dim),
>>>     act=torch.randint(0, action_dim, (3,)),
>>>     reward=torch.randn(3),
>>>     done=torch.randint(0, 2, (3,)),
>>>     weight=torch.ones(3),
>>> )
>>> loss = m_q_1step_td_error(data, 0.99, 0.01, 0.01)

q_v_1step_td_data

class ding.rl_utils.td.q_v_1step_td_data(q, v, act, reward, done, weight)

q_v_1step_td_error

ding.rl_utils.td.q_v_1step_td_error(data: collections.namedtuple, gamma: float, criterion: <module 'torch.nn.modules' from '/home/docs/checkouts/readthedocs.org/user_builds/di-engine-docs-zh/envs/latest/lib/python3.9/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) torch.Tensor[源代码]
Overview:

td_error between q and v value for SAC algorithm, support 1 step td error.

Arguments:
  • data (q_v_1step_td_data): The input data, q_v_1step_td_data to calculate loss

  • gamma (float): Discount factor

  • criterion (torch.nn.modules): Loss function criterion

Returns:
  • loss (torch.Tensor): 1step td error, 0-dim tensor

Shapes:
  • data (q_v_1step_td_data): the q_v_1step_td_data containing [‘q’, ‘v’, ‘act’, ‘reward’, ‘done’, ‘weight’]

  • q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • v (torch.FloatTensor): \((B, )\)

  • act (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \(( , B)\)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • weight (torch.FloatTensor or None): \((B, )\), the training sample weight

Examples:
>>> action_dim = 4
>>> data = q_v_1step_td_data(
>>>     q=torch.randn(3, action_dim),
>>>     v=torch.randn(3),
>>>     act=torch.randint(0, action_dim, (3,)),
>>>     reward=torch.randn(3),
>>>     done=torch.randint(0, 2, (3,)),
>>>     weight=torch.ones(3),
>>> )
>>> loss = q_v_1step_td_error(data, 0.99)

nstep_return_data

class ding.rl_utils.td.nstep_return_data(reward, next_value, done)

nstep_return

ding.rl_utils.td.nstep_return(data: collections.namedtuple, gamma: Union[float, list], nstep: int, value_gamma: Optional[torch.Tensor] = None)[源代码]
Overview:

Calculate nstep return for DQN algorithm, support single agent case and multi agent case.

Arguments:
  • data (nstep_return_data): The input data, nstep_return_data to calculate loss

  • gamma (float): Discount factor

  • nstep (int): nstep num

  • value_gamma (torch.Tensor): Discount factor for value

Returns:
  • return (torch.Tensor): nstep return

Shapes:
  • data (nstep_return_data): the nstep_return_data containing [‘reward’, ‘next_value’, ‘done’]

  • reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)

  • next_value (torch.FloatTensor): \((, B)\)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

Examples:
>>> data = nstep_return_data(
>>>     reward=torch.randn(3, 3),
>>>     next_value=torch.randn(3),
>>>     done=torch.randint(0, 2, (3,)),
>>> )
>>> loss = nstep_return(data, 0.99, 3)

dist_1step_td_data

class ding.rl_utils.td.dist_1step_td_data(dist, next_dist, act, next_act, reward, done, weight)

dist_1step_td_error

ding.rl_utils.td.dist_1step_td_error(data: collections.namedtuple, gamma: float, v_min: float, v_max: float, n_atom: int) torch.Tensor[源代码]
Overview:

1 step td_error for distributed q-learning based algorithm

Arguments:
  • data (dist_1step_td_data): The input data, dist_nstep_td_data to calculate loss

  • gamma (float): Discount factor

  • v_min (float): The min value of support

  • v_max (float): The max value of support

  • n_atom (int): The num of atom

Returns:
  • loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:
  • data (dist_1step_td_data): the dist_1step_td_data containing [‘dist’, ‘next_n_dist’, ‘act’, ‘reward’, ‘done’, ‘weight’]

  • dist (torch.FloatTensor): \((B, N, n_atom)\) i.e. [batch_size, action_dim, n_atom]

  • next_dist (torch.FloatTensor): \((B, N, n_atom)\)

  • act (torch.LongTensor): \((B, )\)

  • next_act (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \((, B)\)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • weight (torch.FloatTensor or None): \((B, )\), the training sample weight

Examples:
>>> dist = torch.randn(4, 3, 51).abs().requires_grad_(True)
>>> next_dist = torch.randn(4, 3, 51).abs()
>>> act = torch.randint(0, 3, (4,))
>>> next_act = torch.randint(0, 3, (4,))
>>> reward = torch.randn(4)
>>> done = torch.randint(0, 2, (4,))
>>> data = dist_1step_td_data(dist, next_dist, act, next_act, reward, done, None)
>>> loss = dist_1step_td_error(data, 0.99, -10.0, 10.0, 51)

dist_nstep_td_data

ding.rl_utils.td.dist_nstep_td_data

alias of ding.rl_utils.td.dist_1step_td_data

shape_fn_dntd

ding.rl_utils.td.shape_fn_dntd(args, kwargs)[源代码]
Overview:

Return dntd shape for hpc

Returns:

shape: [T, B, N, n_atom]

dist_nstep_td_error

ding.rl_utils.td.dist_nstep_td_error(data: collections.namedtuple, gamma: float, v_min: float, v_max: float, n_atom: int, nstep: int = 1, value_gamma: Optional[torch.Tensor] = None) torch.Tensor[源代码]
Overview:

Multistep (1 step or n step) td_error for distributed q-learning based algorithm, support single agent case and multi agent case.

Arguments:
  • data (dist_nstep_td_data): The input data, dist_nstep_td_data to calculate loss

  • gamma (float): Discount factor

  • nstep (int): nstep num, default set to 1

Returns:
  • loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:
  • data (dist_nstep_td_data): the dist_nstep_td_data containing [‘dist’, ‘next_n_dist’, ‘act’, ‘reward’, ‘done’, ‘weight’]

  • dist (torch.FloatTensor): \((B, N, n_atom)\) i.e. [batch_size, action_dim, n_atom]

  • next_n_dist (torch.FloatTensor): \((B, N, n_atom)\)

  • act (torch.LongTensor): \((B, )\)

  • next_n_act (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

Examples:
>>> dist = torch.randn(4, 3, 51).abs().requires_grad_(True)
>>> next_n_dist = torch.randn(4, 3, 51).abs()
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> reward = torch.randn(5, 4)
>>> data = dist_nstep_td_data(dist, next_n_dist, action, next_action, reward, done, None)
>>> loss, _ = dist_nstep_td_error(data, 0.95, -10.0, 10.0, 51, 5)

v_1step_td_data

class ding.rl_utils.td.v_1step_td_data(v, next_v, reward, done, weight)

v_1step_td_error

ding.rl_utils.td.v_1step_td_error(data: collections.namedtuple, gamma: float, criterion: <module 'torch.nn.modules' from '/home/docs/checkouts/readthedocs.org/user_builds/di-engine-docs-zh/envs/latest/lib/python3.9/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) torch.Tensor[源代码]
Overview:

1 step td_error for distributed value based algorithm

Arguments:
  • data (v_1step_td_data): The input data, v_1step_td_data to calculate loss

  • gamma (float): Discount factor

  • criterion (torch.nn.modules): Loss function criterion

Returns:
  • loss (torch.Tensor): 1step td error, 0-dim tensor

Shapes:
  • data (v_1step_td_data): the v_1step_td_data containing [‘v’, ‘next_v’, ‘reward’, ‘done’, ‘weight’]

  • v (torch.FloatTensor): \((B, )\) i.e. [batch_size, ]

  • next_v (torch.FloatTensor): \((B, )\)

  • reward (torch.FloatTensor): \((, B)\)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • weight (torch.FloatTensor or None): \((B, )\), the training sample weight

Examples:
>>> v = torch.randn(5).requires_grad_(True)
>>> next_v = torch.randn(5)
>>> reward = torch.rand(5)
>>> done = torch.zeros(5)
>>> data = v_1step_td_data(v, next_v, reward, done, None)
>>> loss, td_error_per_sample = v_1step_td_error(data, 0.99)

v_nstep_td_data

class ding.rl_utils.td.v_nstep_td_data(v, next_n_v, reward, done, weight, value_gamma)

v_nstep_td_error

ding.rl_utils.td.v_nstep_td_error(data: collections.namedtuple, gamma: float, nstep: int = 1, criterion: <module 'torch.nn.modules' from '/home/docs/checkouts/readthedocs.org/user_builds/di-engine-docs-zh/envs/latest/lib/python3.9/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) torch.Tensor[源代码]
Overview:

Multistep (n step) td_error for distributed value based algorithm

Arguments:
  • data (dist_nstep_td_data): The input data, v_nstep_td_data to calculate loss

  • gamma (float): Discount factor

  • nstep (int): nstep num, default set to 1

Returns:
  • loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:
  • data (dist_nstep_td_data): The v_nstep_td_data containing

    [‘v’, ‘next_n_v’, ‘reward’, ‘done’, ‘weight’, ‘value_gamma’]

  • v (torch.FloatTensor): \((B, )\) i.e. [batch_size, ]

  • next_v (torch.FloatTensor): \((B, )\)

  • reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • weight (torch.FloatTensor or None): \((B, )\), the training sample weight

  • value_gamma (torch.Tensor): If the remaining data in the buffer is less than n_step

    we use value_gamma as the gamma discount value for next_v rather than gamma**n_step

Examples:
>>> v = torch.randn(5).requires_grad_(True)
>>> next_v = torch.randn(5)
>>> reward = torch.rand(5, 5)
>>> done = torch.zeros(5)
>>> data = v_nstep_td_data(v, next_v, reward, done, 0.9, 0.99)
>>> loss, td_error_per_sample = v_nstep_td_error(data, 0.99, 5)

q_nstep_td_data

class ding.rl_utils.td.q_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, weight)

dqfd_nstep_td_data

class ding.rl_utils.td.dqfd_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, done_one_step, weight, new_n_q_one_step, next_n_action_one_step, is_expert)

shape_fn_qntd

ding.rl_utils.td.shape_fn_qntd(args, kwargs)[源代码]
Overview:

Return qntd shape for hpc

Returns:

shape: [T, B, N]

q_nstep_td_error

ding.rl_utils.td.q_nstep_td_error(data: collections.namedtuple, gamma: Union[float, list], nstep: int = 1, cum_reward: bool = False, value_gamma: Optional[torch.Tensor] = None, criterion: <module 'torch.nn.modules' from '/home/docs/checkouts/readthedocs.org/user_builds/di-engine-docs-zh/envs/latest/lib/python3.9/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) torch.Tensor[源代码]
Overview:

Multistep (1 step or n step) td_error for q-learning based algorithm

Arguments:
  • data (q_nstep_td_data): The input data, q_nstep_td_data to calculate loss

  • gamma (float): Discount factor

  • cum_reward (bool): Whether to use cumulative nstep reward, which is figured out when collecting data

  • value_gamma (torch.Tensor): Gamma discount value for target q_value

  • criterion (torch.nn.modules): Loss function criterion

  • nstep (int): nstep num, default set to 1

Returns:
  • loss (torch.Tensor): nstep td error, 0-dim tensor

  • td_error_per_sample (torch.Tensor): nstep td error, 1-dim tensor

Shapes:
  • data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]

  • q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • next_n_q (torch.FloatTensor): \((B, N)\)

  • action (torch.LongTensor): \((B, )\)

  • next_n_action (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • td_error_per_sample (torch.FloatTensor): \((B, )\)

Examples:
>>> next_q = torch.randn(4, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep =3
>>> q = torch.randn(4, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None)
>>> loss, td_error_per_sample = q_nstep_td_error(data, 0.95, nstep=nstep)

bdq_nstep_td_error

ding.rl_utils.td.bdq_nstep_td_error(data: collections.namedtuple, gamma: Union[float, list], nstep: int = 1, cum_reward: bool = False, value_gamma: Optional[torch.Tensor] = None, criterion: <module 'torch.nn.modules' from '/home/docs/checkouts/readthedocs.org/user_builds/di-engine-docs-zh/envs/latest/lib/python3.9/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) torch.Tensor[源代码]
Overview:

Multistep (1 step or n step) td_error for BDQ algorithm, referenced paper “Action Branching Architectures for Deep Reinforcement Learning”, link: https://arxiv.org/pdf/1711.08946. In fact, the original paper only provides the 1-step TD-error calculation method, and here we extend the calculation method of n-step, i.e., TD-error:

Arguments:
  • data (q_nstep_td_data): The input data, q_nstep_td_data to calculate loss

  • gamma (float): Discount factor

  • cum_reward (bool): Whether to use cumulative nstep reward, which is figured out when collecting data

  • value_gamma (torch.Tensor): Gamma discount value for target q_value

  • criterion (torch.nn.modules): Loss function criterion

  • nstep (int): nstep num, default set to 1

Returns:
  • loss (torch.Tensor): nstep td error, 0-dim tensor

  • td_error_per_sample (torch.Tensor): nstep td error, 1-dim tensor

Shapes:
  • data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]

  • q (torch.FloatTensor): \((B, D, N)\) i.e. [batch_size, branch_num, action_bins_per_branch]

  • next_n_q (torch.FloatTensor): \((B, D, N)\)

  • action (torch.LongTensor): \((B, D)\)

  • next_n_action (torch.LongTensor): \((B, D)\)

  • reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • td_error_per_sample (torch.FloatTensor): \((B, )\)

Examples:
>>> action_per_branch = 3
>>> next_q = torch.randn(8, 6, action_per_branch)
>>> done = torch.randn(8)
>>> action = torch.randint(0, action_per_branch, size=(8, 6))
>>> next_action = torch.randint(0, action_per_branch, size=(8, 6))
>>> nstep =3
>>> q = torch.randn(8, 6, action_per_branch).requires_grad_(True)
>>> reward = torch.rand(nstep, 8)
>>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None)
>>> loss, td_error_per_sample = bdq_nstep_td_error(data, 0.95, nstep=nstep)

shape_fn_qntd_rescale

ding.rl_utils.td.shape_fn_qntd_rescale(args, kwargs)[源代码]
Overview:

Return qntd_rescale shape for hpc

Returns:

shape: [T, B, N]

q_nstep_td_error_with_rescale

ding.rl_utils.td.q_nstep_td_error_with_rescale(data: collections.namedtuple, gamma: Union[float, list], nstep: int = 1, value_gamma: Optional[torch.Tensor] = None, criterion: <module 'torch.nn.modules' from '/home/docs/checkouts/readthedocs.org/user_builds/di-engine-docs-zh/envs/latest/lib/python3.9/site-packages/torch/nn/modules/__init__.py'> = MSELoss(), trans_fn: Callable = <function value_transform>, inv_trans_fn: Callable = <function value_inv_transform>) torch.Tensor[源代码]
Overview:

Multistep (1 step or n step) td_error with value rescaling

Arguments:
  • data (q_nstep_td_data): The input data, q_nstep_td_data to calculate loss

  • gamma (float): Discount factor

  • nstep (int): nstep num, default set to 1

  • criterion (torch.nn.modules): Loss function criterion

  • trans_fn (Callable): Value transfrom function, default to value_transform (refer to rl_utils/value_rescale.py)

  • inv_trans_fn (Callable): Value inverse transfrom function, default to value_inv_transform (refer to rl_utils/value_rescale.py)

Returns:
  • loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:
  • data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]

  • q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • next_n_q (torch.FloatTensor): \((B, N)\)

  • action (torch.LongTensor): \((B, )\)

  • next_n_action (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

Examples:
>>> next_q = torch.randn(4, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep =3
>>> q = torch.randn(4, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None)
>>> loss, _ = q_nstep_td_error_with_rescale(data, 0.95, nstep=nstep)

dqfd_nstep_td_error

ding.rl_utils.td.dqfd_nstep_td_error(data: collections.namedtuple, gamma: float, lambda_n_step_td: float, lambda_supervised_loss: float, margin_function: float, lambda_one_step_td: float = 1.0, nstep: int = 1, cum_reward: bool = False, value_gamma: Optional[torch.Tensor] = None, criterion: <module 'torch.nn.modules' from '/home/docs/checkouts/readthedocs.org/user_builds/di-engine-docs-zh/envs/latest/lib/python3.9/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) torch.Tensor[源代码]
Overview:

Multistep n step td_error + 1 step td_error + supervised margin loss or dqfd

Arguments:
  • data (dqfd_nstep_td_data): The input data, dqfd_nstep_td_data to calculate loss

  • gamma (float): discount factor

  • cum_reward (bool): Whether to use cumulative nstep reward, which is figured out when collecting data

  • value_gamma (torch.Tensor): Gamma discount value for target q_value

  • criterion (torch.nn.modules): Loss function criterion

  • nstep (int): nstep num, default set to 10

Returns:
  • loss (torch.Tensor): Multistep n step td_error + 1 step td_error + supervised margin loss, 0-dim tensor

  • td_error_per_sample (torch.Tensor): Multistep n step td_error + 1 step td_error + supervised margin loss, 1-dim tensor

Shapes:
  • data (q_nstep_td_data): the q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘next_n_action’, ‘reward’, ‘done’, ‘weight’ , ‘new_n_q_one_step’, ‘next_n_action_one_step’, ‘is_expert’]

  • q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • next_n_q (torch.FloatTensor): \((B, N)\)

  • action (torch.LongTensor): \((B, )\)

  • next_n_action (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • td_error_per_sample (torch.FloatTensor): \((B, )\)

  • new_n_q_one_step (torch.FloatTensor): \((B, N)\)

  • next_n_action_one_step (torch.LongTensor): \((B, )\)

  • is_expert (int) : 0 or 1

Examples:
>>> next_q = torch.randn(4, 3)
>>> done = torch.randn(4)
>>> done_1 = torch.randn(4)
>>> next_q_one_step = torch.randn(4, 3)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> next_action_one_step = torch.randint(0, 3, size=(4, ))
>>> is_expert = torch.ones((4))
>>> nstep = 3
>>> q = torch.randn(4, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = dqfd_nstep_td_data(
>>>     q, next_q, action, next_action, reward, done, done_1, None,
>>>     next_q_one_step, next_action_one_step, is_expert
>>> )
>>> loss, td_error_per_sample, loss_statistics = dqfd_nstep_td_error(
>>>     data, 0.95, lambda_n_step_td=1, lambda_supervised_loss=1,
>>>     margin_function=0.8, nstep=nstep
>>> )

dqfd_nstep_td_error_with_rescale

ding.rl_utils.td.dqfd_nstep_td_error_with_rescale(data: collections.namedtuple, gamma: float, lambda_n_step_td: float, lambda_supervised_loss: float, lambda_one_step_td: float, margin_function: float, nstep: int = 1, cum_reward: bool = False, value_gamma: Optional[torch.Tensor] = None, criterion: <module 'torch.nn.modules' from '/home/docs/checkouts/readthedocs.org/user_builds/di-engine-docs-zh/envs/latest/lib/python3.9/site-packages/torch/nn/modules/__init__.py'> = MSELoss(), trans_fn: Callable = <function value_transform>, inv_trans_fn: Callable = <function value_inv_transform>) torch.Tensor[源代码]
Overview:

Multistep n step td_error + 1 step td_error + supervised margin loss or dqfd

Arguments:
  • data (dqfd_nstep_td_data): The input data, dqfd_nstep_td_data to calculate loss

  • gamma (float): Discount factor

  • cum_reward (bool): Whether to use cumulative nstep reward, which is figured out when collecting data

  • value_gamma (torch.Tensor): Gamma discount value for target q_value

  • criterion (torch.nn.modules): Loss function criterion

  • nstep (int): nstep num, default set to 10

Returns:
  • loss (torch.Tensor): Multistep n step td_error + 1 step td_error + supervised margin loss, 0-dim tensor

  • td_error_per_sample (torch.Tensor): Multistep n step td_error + 1 step td_error + supervised margin loss, 1-dim tensor

Shapes:
  • data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘next_n_action’, ‘reward’, ‘done’, ‘weight’ , ‘new_n_q_one_step’, ‘next_n_action_one_step’, ‘is_expert’]

  • q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • next_n_q (torch.FloatTensor): \((B, N)\)

  • action (torch.LongTensor): \((B, )\)

  • next_n_action (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • td_error_per_sample (torch.FloatTensor): \((B, )\)

  • new_n_q_one_step (torch.FloatTensor): \((B, N)\)

  • next_n_action_one_step (torch.LongTensor): \((B, )\)

  • is_expert (int) : 0 or 1

qrdqn_nstep_td_data

class ding.rl_utils.td.qrdqn_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, tau, weight)

qrdqn_nstep_td_error

ding.rl_utils.td.qrdqn_nstep_td_error(data: collections.namedtuple, gamma: float, nstep: int = 1, value_gamma: Optional[torch.Tensor] = None) torch.Tensor[源代码]
Overview:

Multistep (1 step or n step) td_error with in QRDQN

Arguments:
  • data (iqn_nstep_td_data): The input data, iqn_nstep_td_data to calculate loss

  • gamma (float): Discount factor

  • nstep (int): nstep num, default set to 1

Returns:
  • loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:
  • data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]

  • q (torch.FloatTensor): \((tau, B, N)\) i.e. [tau x batch_size, action_dim]

  • next_n_q (torch.FloatTensor): \((tau', B, N)\)

  • action (torch.LongTensor): \((B, )\)

  • next_n_action (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

Examples:
>>> next_q = torch.randn(4, 3, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep = 3
>>> q = torch.randn(4, 3, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = qrdqn_nstep_td_data(q, next_q, action, next_action, reward, done, 3, None)
>>> loss, td_error_per_sample = qrdqn_nstep_td_error(data, 0.95, nstep=nstep)

q_nstep_sql_td_error

ding.rl_utils.td.q_nstep_sql_td_error(data: collections.namedtuple, gamma: float, alpha: float, nstep: int = 1, cum_reward: bool = False, value_gamma: Optional[torch.Tensor] = None, criterion: <module 'torch.nn.modules' from '/home/docs/checkouts/readthedocs.org/user_builds/di-engine-docs-zh/envs/latest/lib/python3.9/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) torch.Tensor[源代码]
Overview:

Multistep (1 step or n step) td_error for q-learning based algorithm

Arguments:
  • data (q_nstep_td_data): The input data, q_nstep_sql_td_data to calculate loss

  • gamma (float): Discount factor

  • Alpha (:obj:`float`): A parameter to weight entropy term in a policy equation

  • cum_reward (bool): Whether to use cumulative nstep reward, which is figured out when collecting data

  • value_gamma (torch.Tensor): Gamma discount value for target soft_q_value

  • criterion (torch.nn.modules): Loss function criterion

  • nstep (int): nstep num, default set to 1

Returns:
  • loss (torch.Tensor): nstep td error, 0-dim tensor

  • td_error_per_sample (torch.Tensor): nstep td error, 1-dim tensor

Shapes:
  • data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]

  • q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • next_n_q (torch.FloatTensor): \((B, N)\)

  • action (torch.LongTensor): \((B, )\)

  • next_n_action (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • td_error_per_sample (torch.FloatTensor): \((B, )\)

Examples:
>>> next_q = torch.randn(4, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep = 3
>>> q = torch.randn(4, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None)
>>> loss, td_error_per_sample, record_target_v = q_nstep_sql_td_error(data, 0.95, 1.0, nstep=nstep)

iqn_nstep_td_data

class ding.rl_utils.td.iqn_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, replay_quantiles, weight)

iqn_nstep_td_error ~~~~~~~~~~~~~~~~~~~`` .. autofunction:: ding.rl_utils.td.iqn_nstep_td_error

fqf_nstep_td_data

class ding.rl_utils.td.fqf_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, quantiles_hats, weight)

fqf_nstep_td_error

ding.rl_utils.td.fqf_nstep_td_error(data: collections.namedtuple, gamma: float, nstep: int = 1, kappa: float = 1.0, value_gamma: Optional[torch.Tensor] = None) torch.Tensor[源代码]
Overview:

Multistep (1 step or n step) td_error with in FQF, referenced paper Fully Parameterized Quantile Function for Distributional Reinforcement Learning <https://arxiv.org/pdf/1911.02140.pdf>

Arguments:
  • data (fqf_nstep_td_data): The input data, fqf_nstep_td_data to calculate loss

  • gamma (float): Discount factor

  • nstep (int): nstep num, default set to 1

  • criterion (torch.nn.modules): Loss function criterion

  • beta_function (Callable): The risk function

Returns:
  • loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:
  • data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]

  • q (torch.FloatTensor): \((B, tau, N)\) i.e. [batch_size, tau, action_dim]

  • next_n_q (torch.FloatTensor): \((B, tau', N)\)

  • action (torch.LongTensor): \((B, )\)

  • next_n_action (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • quantiles_hats (torch.FloatTensor): \((B, tau)\)

Examples:
>>> next_q = torch.randn(4, 3, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep = 3
>>> q = torch.randn(4, 3, 3).requires_grad_(True)
>>> quantiles_hats = torch.randn([4, 3])
>>> reward = torch.rand(nstep, 4)
>>> data = fqf_nstep_td_data(q, next_q, action, next_action, reward, done, quantiles_hats, None)
>>> loss, td_error_per_sample = fqf_nstep_td_error(data, 0.95, nstep=nstep)

evaluate_quantile_at_action

ding.rl_utils.td.evaluate_quantile_at_action(q_s, actions)[源代码]

fqf_calculate_fraction_loss

ding.rl_utils.td.fqf_calculate_fraction_loss(q_tau_i, q_value, quantiles, actions)[源代码]
Overview:

Calculate the fraction loss in FQF, referenced paper Fully Parameterized Quantile Function for Distributional Reinforcement Learning <https://arxiv.org/pdf/1911.02140.pdf>

Arguments:
  • q_tau_i (torch.FloatTensor): \((batch_size, num_quantiles-1, action_dim)\)

  • q_value (torch.FloatTensor): \((batch_size, num_quantiles, action_dim)\)

  • quantiles (torch.FloatTensor): \((batch_size, num_quantiles+1)\)

  • actions (torch.LongTensor): \((batch_size, )\)

Returns:
  • fraction_loss (torch.Tensor): fraction loss, 0-dim tensor

td_lambda_data

class ding.rl_utils.td.td_lambda_data(value, reward, weight)

shape_fn_td_lambda

ding.rl_utils.td.shape_fn_td_lambda(args, kwargs)[源代码]
Overview:

Return td_lambda shape for hpc

Returns:

shape: [T, B]

td_lambda_error

ding.rl_utils.td.td_lambda_error(data: collections.namedtuple, gamma: float = 0.9, lambda_: float = 0.8) torch.Tensor[源代码]
Overview:

Computing TD(lambda) loss given constant gamma and lambda. There is no special handling for terminal state value, if some state has reached the terminal, just fill in zeros for values and rewards beyond terminal (including the terminal state, values[terminal] should also be 0)

Arguments:
  • data (namedtuple): td_lambda input data with fields [‘value’, ‘reward’, ‘weight’]

  • gamma (float): Constant discount factor gamma, should be in [0, 1], defaults to 0.9

  • lambda (float): Constant lambda, should be in [0, 1], defaults to 0.8

Returns:
  • loss (torch.Tensor): Computed MSE loss, averaged over the batch

Shapes:
  • value (torch.FloatTensor): \((T+1, B)\), where T is trajectory length and B is batch, which is the estimation of the state value at step 0 to T

  • reward (torch.FloatTensor): \((T, B)\), the returns from time step 0 to T-1

  • weight (torch.FloatTensor or None): \((B, )\), the training sample weight

  • loss (torch.FloatTensor): \(()\), 0-dim tensor

Examples:
>>> T, B = 8, 4
>>> value = torch.randn(T + 1, B).requires_grad_(True)
>>> reward = torch.rand(T, B)
>>> loss = td_lambda_error(td_lambda_data(value, reward, None))

generalized_lambda_returns

ding.rl_utils.td.generalized_lambda_returns(bootstrap_values: torch.Tensor, rewards: torch.Tensor, gammas: float, lambda_: float, done: Optional[torch.Tensor] = None) torch.Tensor[源代码]
Overview:

Functional equivalent to trfl.value_ops.generalized_lambda_returns https://github.com/deepmind/trfl/blob/2c07ac22512a16715cc759f0072be43a5d12ae45/trfl/value_ops.py#L74 Passing in a number instead of tensor to make the value constant for all samples in batch

Arguments:
  • bootstrap_values (torch.Tensor or float): estimation of the value at step 0 to T, of size [T_traj+1, batchsize]

  • rewards (torch.Tensor): The returns from 0 to T-1, of size [T_traj, batchsize]

  • gammas (torch.Tensor or float): Discount factor for each step (from 0 to T-1), of size [T_traj, batchsize]

  • lambda (torch.Tensor or float): Determining the mix of bootstrapping vs further accumulation of multistep returns at each timestep, of size [T_traj, batchsize]

  • done (torch.Tensor or float): Whether the episode done at current step (from 0 to T-1), of size [T_traj, batchsize]

Returns:
  • return (torch.Tensor): Computed lambda return value for each state from 0 to T-1, of size [T_traj, batchsize]

multistep_forward_view

ding.rl_utils.td.multistep_forward_view(bootstrap_values: torch.Tensor, rewards: torch.Tensor, gammas: float, lambda_: float, done: Optional[torch.Tensor] = None) torch.Tensor[源代码]
Overview:

Same as trfl.sequence_ops.multistep_forward_view Implementing (12.18) in Sutton & Barto

` result[T-1] = rewards[T-1] + gammas[T-1] * bootstrap_values[T] for t in 0...T-2 : result[t] = rewards[t] + gammas[t]*(lambdas[t]*result[t+1] + (1-lambdas[t])*bootstrap_values[t+1]) `

Assuming the first dim of input tensors correspond to the index in batch

Arguments:
  • bootstrap_values (torch.Tensor): Estimation of the value at step 1 to T, of size [T_traj, batchsize]

  • rewards (torch.Tensor): The returns from 0 to T-1, of size [T_traj, batchsize]

  • gammas (torch.Tensor): Discount factor for each step (from 0 to T-1), of size [T_traj, batchsize]

  • lambda (torch.Tensor): Determining the mix of bootstrapping vs further accumulation of

    multistep returns at each timestep of size [T_traj, batchsize], the element for T-1 is ignored and effectively set to 0, as there is no information about future rewards.

  • done (torch.Tensor or float): Whether the episode done at current step (from 0 to T-1), of size [T_traj, batchsize]

Returns:
  • ret (torch.Tensor): Computed lambda return value

    for each state from 0 to T-1, of size [T_traj, batchsize]

upgo

Please refer to ding/rl_utils/upgo for more details.

upgo_returns

ding.rl_utils.upgo.upgo_returns(rewards: torch.Tensor, bootstrap_values: torch.Tensor) torch.Tensor[源代码]
Overview:

Computing UPGO return targets. Also notice there is no special handling for the terminal state.

Arguments:
  • rewards (torch.Tensor): the returns from time step 0 to T-1,

    of size [T_traj, batchsize]

  • bootstrap_values (torch.Tensor): estimation of the state value at step 0 to T,

    of size [T_traj+1, batchsize]

Returns:
  • ret (torch.Tensor): Computed lambda return value for each state from 0 to T-1,

    of size [T_traj, batchsize]

Examples:
>>> T, B, N, N2 = 4, 8, 5, 7
>>> rewards = torch.randn(T, B)
>>> bootstrap_values = torch.randn(T + 1, B).requires_grad_(True)
>>> returns = upgo_returns(rewards, bootstrap_values)

upgo_loss

ding.rl_utils.upgo.upgo_loss(target_output: torch.Tensor, rhos: torch.Tensor, action: torch.Tensor, rewards: torch.Tensor, bootstrap_values: torch.Tensor, mask=None) torch.Tensor[源代码]
Overview:

Computing UPGO loss given constant gamma and lambda. There is no special handling for terminal state value, if the last state in trajectory is the terminal, just pass a 0 as bootstrap_terminal_value.

Arguments:
  • target_output (torch.Tensor): the output computed by the target policy network,

    of size [T_traj, batchsize, n_output]

  • rhos (torch.Tensor): the importance sampling ratio, of size [T_traj, batchsize]

  • action (torch.Tensor): the action taken, of size [T_traj, batchsize]

  • rewards (torch.Tensor): the returns from time step 0 to T-1, of size [T_traj, batchsize]

  • bootstrap_values (torch.Tensor): estimation of the state value at step 0 to T,

    of size [T_traj+1, batchsize]

Returns:
  • loss (torch.Tensor): Computed importance sampled UPGO loss, averaged over the samples, of size []

Examples:
>>> T, B, N, N2 = 4, 8, 5, 7
>>> rhos = torch.randn(T, B)
>>> loss = upgo_loss(logit, rhos, action, rewards, bootstrap_values)

value_rescale

Please refer to ding/rl_utils/value_rescale for more details.

value_transform

ding.rl_utils.value_rescale.value_transform(x: torch.Tensor, eps: float = 0.01) torch.Tensor[源代码]
Overview:

A function to reduce the scale of the action-value function. :math: h(x) = sign(x)(sqrt{(abs(x)+1)} - 1) + eps * x .

Arguments:
  • x: (torch.Tensor) The input tensor to be normalized.

  • eps: (float) The coefficient of the additive regularization term

    to ensure h^{-1} is Lipschitz continuous

Returns:
  • (torch.Tensor) Normalized tensor.

注解

Observe and Look Further: Achieving Consistent Performance on Atari

(https://arxiv.org/abs/1805.11593)

value_inv_transform

ding.rl_utils.value_rescale.value_inv_transform(x: torch.Tensor, eps: float = 0.01) torch.Tensor[源代码]
Overview:

The inverse form of value rescale. :math: h^{-1}(x) = sign(x)({(frac{sqrt{1+4eps(|x|+1+eps)}-1}{2eps})}^2-1) .

Arguments:
  • x: (torch.Tensor) The input tensor to be unnormalized.

  • eps: (float) The coefficient of the additive regularization term

    to ensure h^{-1} is Lipschitz continuous

Returns:
  • (torch.Tensor) Unnormalized tensor.

symlog

ding.rl_utils.value_rescale.symlog(x: torch.Tensor) torch.Tensor[源代码]
Overview:

A function to normalize the targets. :math: symlog(x) = sign(x)(ln{|x|+1}) .

Arguments:
  • x: (torch.Tensor) The input tensor to be normalized.

Returns:
  • (torch.Tensor) Normalized tensor.

注解

Mastering Diverse Domains through World Models

(https://arxiv.org/abs/2301.04104)

inv_symlog

ding.rl_utils.value_rescale.inv_symlog(x: torch.Tensor) torch.Tensor[源代码]
Overview:

The inverse form of symlog. :math: symexp(x) = sign(x)(exp{|x|}-1) .

Arguments:
  • x: (torch.Tensor) The input tensor to be unnormalized.

Returns:
  • (torch.Tensor) Unnormalized tensor.

vtrace

Please refer to ding/rl_utils/vtrace for more details.

vtrace_nstep_return

ding.rl_utils.vtrace.vtrace_nstep_return(clipped_rhos, clipped_cs, reward, bootstrap_values, gamma=0.99, lambda_=0.95)[源代码]
Overview:

Computation of vtrace return.

Returns:
  • vtrace_return (torch.FloatTensor): the vtrace loss item, all of them are differentiable 0-dim tensor

Shapes:
  • clipped_rhos (torch.FloatTensor): \((T, B)\), where T is timestep, B is batch size

  • clipped_cs (torch.FloatTensor): \((T, B)\)

  • reward (torch.FloatTensor): \((T, B)\)

  • bootstrap_values (torch.FloatTensor): \((T+1, B)\)

  • vtrace_return (torch.FloatTensor): \((T, B)\)

vtrace_advantage

ding.rl_utils.vtrace.vtrace_advantage(clipped_pg_rhos, reward, return_, bootstrap_values, gamma)[源代码]
Overview:

Computation of vtrace advantage.

Returns:
  • vtrace_advantage (namedtuple): the vtrace loss item, all of them are the differentiable 0-dim tensor

Shapes:
  • clipped_pg_rhos (torch.FloatTensor): \((T, B)\), where T is timestep, B is batch size

  • reward (torch.FloatTensor): \((T, B)\)

  • return (torch.FloatTensor): \((T, B)\)

  • bootstrap_values (torch.FloatTensor): \((T, B)\)

  • vtrace_advantage (torch.FloatTensor): \((T, B)\)

vtrace_data

class ding.rl_utils.vtrace.vtrace_data(target_output, behaviour_output, action, value, reward, weight)

vtrace_loss

class ding.rl_utils.vtrace.vtrace_loss(policy_loss, value_loss, entropy_loss)

vtrace_error_discrete_action

ding.rl_utils.vtrace.vtrace_error_discrete_action(data: collections.namedtuple, gamma: float = 0.99, lambda_: float = 0.95, rho_clip_ratio: float = 1.0, c_clip_ratio: float = 1.0, rho_pg_clip_ratio: float = 1.0)[源代码]
Overview:

Implementation of vtrace(IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures), (arXiv:1802.01561)

Arguments:
  • data (namedtuple): input data with fields shown in vtrace_data
    • target_output (torch.Tensor): the output taking the action by the current policy network, usually this output is network output logit

    • behaviour_output (torch.Tensor): the output taking the action by the behaviour policy network, usually this output is network output logit, which is used to produce the trajectory(collector)

    • action (torch.Tensor): the chosen action(index for the discrete action space) in trajectory, i.e.: behaviour_action

  • gamma: (float): the future discount factor, defaults to 0.95

  • lambda: (float): mix factor between 1-step (lambda_=0) and n-step, defaults to 1.0

  • rho_clip_ratio (float): the clipping threshold for importance weights (rho) when calculating the baseline targets (vs)

  • c_clip_ratio (float): the clipping threshold for importance weights (c) when calculating the baseline targets (vs)

  • rho_pg_clip_ratio (float): the clipping threshold for importance weights (rho) when calculating the policy gradient advantage

Returns:
  • trace_loss (namedtuple): the vtrace loss item, all of them are the differentiable 0-dim tensor

Shapes:
  • target_output (torch.FloatTensor): \((T, B, N)\), where T is timestep, B is batch size and N is action dim

  • behaviour_output (torch.FloatTensor): \((T, B, N)\)

  • action (torch.LongTensor): \((T, B)\)

  • value (torch.FloatTensor): \((T+1, B)\)

  • reward (torch.LongTensor): \((T, B)\)

  • weight (torch.LongTensor): \((T, B)\)

Examples:
>>> T, B, N = 4, 8, 16
>>> value = torch.randn(T + 1, B).requires_grad_(True)
>>> reward = torch.rand(T, B)
>>> target_output = torch.randn(T, B, N).requires_grad_(True)
>>> behaviour_output = torch.randn(T, B, N)
>>> action = torch.randint(0, N, size=(T, B))
>>> data = vtrace_data(target_output, behaviour_output, action, value, reward, None)
>>> loss = vtrace_error_discrete_action(data, rho_clip_ratio=1.1)

vtrace_error_continuous_action

ding.rl_utils.vtrace.vtrace_error_continuous_action(data: collections.namedtuple, gamma: float = 0.99, lambda_: float = 0.95, rho_clip_ratio: float = 1.0, c_clip_ratio: float = 1.0, rho_pg_clip_ratio: float = 1.0)[源代码]
Overview:

Implementation of vtrace(IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures), (arXiv:1802.01561)

Arguments:
  • data (namedtuple): input data with fields shown in vtrace_data
    • target_output (dict{key:torch.Tensor}): the output taking the action by the current policy network, usually this output is network output, which represents the distribution by reparameterization trick.

    • behaviour_output (dict{key:torch.Tensor}): the output taking the action by the behaviour policy network, usually this output is network output logit, which represents the distribution by reparameterization trick.

    • action (torch.Tensor): the chosen action(index for the discrete action space) in trajectory, i.e.: behaviour_action

  • gamma: (float): the future discount factor, defaults to 0.95

  • lambda: (float): mix factor between 1-step (lambda_=0) and n-step, defaults to 1.0

  • rho_clip_ratio (float): the clipping threshold for importance weights (rho) when calculating the baseline targets (vs)

  • c_clip_ratio (float): the clipping threshold for importance weights (c) when calculating the baseline targets (vs)

  • rho_pg_clip_ratio (float): the clipping threshold for importance weights (rho) when calculating the policy gradient advantage

Returns:
  • trace_loss (namedtuple): the vtrace loss item, all of them are the differentiable 0-dim tensor

Shapes:
  • target_output (dict{key:torch.FloatTensor}): \((T, B, N)\), where T is timestep, B is batch size and N is action dim. The keys are usually parameters of reparameterization trick.

  • behaviour_output (dict{key:torch.FloatTensor}): \((T, B, N)\)

  • action (torch.LongTensor): \((T, B)\)

  • value (torch.FloatTensor): \((T+1, B)\)

  • reward (torch.LongTensor): \((T, B)\)

  • weight (torch.LongTensor): \((T, B)\)

Examples:
>>> T, B, N = 4, 8, 16
>>> value = torch.randn(T + 1, B).requires_grad_(True)
>>> reward = torch.rand(T, B)
>>> target_output = dict(
>>>     'mu': torch.randn(T, B, N).requires_grad_(True),
>>>     'sigma': torch.exp(torch.randn(T, B, N).requires_grad_(True)),
>>> )
>>> behaviour_output = dict(
>>>     'mu': torch.randn(T, B, N),
>>>     'sigma': torch.exp(torch.randn(T, B, N)),
>>> )
>>> action = torch.randn((T, B, N))
>>> data = vtrace_data(target_output, behaviour_output, action, value, reward, None)
>>> loss = vtrace_error_continuous_action(data, rho_clip_ratio=1.1)
Read the Docs v: latest
Versions
latest
Downloads
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.