ding.rl_utils¶

a2c¶

Please refer to ding/rl_utils/a2c for more details.

a2c_error¶

ding.rl_utils.a2c_error(data: collections.namedtuple) → collections.namedtuple[源代码]¶

Overview:

Implementation of A2C(Advantage Actor-Critic) (arXiv:1602.01783) for discrete action space

Arguments:

data (namedtuple): a2c input data with fieids shown in a2c_data

Returns:

a2c_loss (namedtuple): the a2c loss item, all of them are the differentiable 0-dim tensor

Shapes:

logit (torch.FloatTensor): \((B, N)\), where B is batch size and N is action dim
action (torch.LongTensor): \((B, )\)
value (torch.FloatTensor): \((B, )\)
adv (torch.FloatTensor): \((B, )\)
return (torch.FloatTensor): \((B, )\)
weight (torch.FloatTensor or None): \((B, )\)
policy_loss (torch.FloatTensor): \(()\), 0-dim tensor
value_loss (torch.FloatTensor): \(()\)
entropy_loss (torch.FloatTensor): \(()\)

Examples:

>>> data = a2c_data(
>>>     logit=torch.randn(2, 3),
>>>     action=torch.randint(0, 3, (2, )),
>>>     value=torch.randn(2, ),
>>>     adv=torch.randn(2, ),
>>>     return_=torch.randn(2, ),
>>>     weight=torch.ones(2, ),
>>> )
>>> loss = a2c_error(data)

a2c_error_continuous¶

ding.rl_utils.a2c_error_continuous(data: collections.namedtuple) → collections.namedtuple[源代码]¶

Overview:

Implementation of A2C(Advantage Actor-Critic) (arXiv:1602.01783) for continuous action space

Arguments:

data (namedtuple): a2c input data with fieids shown in a2c_data

Returns:

a2c_loss (namedtuple): the a2c loss item, all of them are the differentiable 0-dim tensor

Shapes:

logit (torch.FloatTensor): \((B, N)\), where B is batch size and N is action dim
action (torch.LongTensor): \((B, N)\)
value (torch.FloatTensor): \((B, )\)
adv (torch.FloatTensor): \((B, )\)
return (torch.FloatTensor): \((B, )\)
weight (torch.FloatTensor or None): \((B, )\)
policy_loss (torch.FloatTensor): \(()\), 0-dim tensor
value_loss (torch.FloatTensor): \(()\)
entropy_loss (torch.FloatTensor): \(()\)

Examples:

>>> data = a2c_data(
>>>     logit={'mu': torch.randn(2, 3), 'sigma': torch.sqrt(torch.randn(2, 3)**2)},
>>>     action=torch.randn(2, 3),
>>>     value=torch.randn(2, ),
>>>     adv=torch.randn(2, ),
>>>     return_=torch.randn(2, ),
>>>     weight=torch.ones(2, ),
>>> )
>>> loss = a2c_error_continuous(data)

acer¶

Please refer to ding/rl_utils/acer for more details.

acer_policy_error¶

ding.rl_utils.acer_policy_error(q_values: torch.Tensor, q_retraces: torch.Tensor, v_pred: torch.Tensor, target_logit: torch.Tensor, actions: torch.Tensor, ratio: torch.Tensor, c_clip_ratio: float = 10.0) → Tuple[torch.Tensor, torch.Tensor][源代码]¶

Overview:

Get ACER policy loss.

Arguments:

q_values (torch.Tensor): Q values
q_retraces (torch.Tensor): Q values (be calculated by retrace method)
v_pred (torch.Tensor): V values
target_pi (torch.Tensor): The new policy’s probability
actions (torch.Tensor): The actions in replay buffer
ratio (torch.Tensor): ratio of new polcy with behavior policy
c_clip_ratio (float): clip value for ratio

Returns:

actor_loss (torch.Tensor): policy loss from q_retrace
bc_loss (torch.Tensor): correct policy loss

Shapes:

q_values (torch.FloatTensor): \((T, B, N)\), where B is batch size and N is action dim
q_retraces (torch.FloatTensor): \((T, B, 1)\)
v_pred (torch.FloatTensor): \((T, B, 1)\)
target_pi (torch.FloatTensor): \((T, B, N)\)
actions (torch.LongTensor): \((T, B)\)
ratio (torch.FloatTensor): \((T, B, N)\)
actor_loss (torch.FloatTensor): \((T, B, 1)\)
bc_loss (torch.FloatTensor): \((T, B, 1)\)

Examples:

>>> q_values=torch.randn(2, 3, 4),
>>> q_retraces=torch.randn(2, 3, 1),
>>> v_pred=torch.randn(2, 3, 1),
>>> target_pi=torch.randn(2, 3, 4),
>>> actions=torch.randint(0, 4, (2, 3)),
>>> ratio=torch.randn(2, 3, 4),
>>> loss = acer_policy_error(q_values, q_retraces, v_pred, target_pi, actions, ratio)

acer_value_error¶

ding.rl_utils.acer_value_error(q_values, q_retraces, actions)[源代码]¶

Overview:

Get ACER critic loss.

Arguments:

q_values (torch.Tensor): Q values
q_retraces (torch.Tensor): Q values (be calculated by retrace method)
actions (torch.Tensor): The actions in replay buffer
ratio (torch.Tensor): ratio of new polcy with behavior policy

Returns:

critic_loss (torch.Tensor): critic loss

Shapes:

q_values (torch.FloatTensor): \((T, B, N)\), where B is batch size and N is action dim
q_retraces (torch.FloatTensor): \((T, B, 1)\)
actions (torch.LongTensor): \((T, B)\)
critic_loss (torch.FloatTensor): \((T, B, 1)\)

Examples:

>>> q_values=torch.randn(2, 3, 4)
>>> q_retraces=torch.randn(2, 3, 1)
>>> actions=torch.randint(0, 4, (2, 3))
>>> loss = acer_value_error(q_values, q_retraces, actions)

acer_trust_region_update¶

ding.rl_utils.acer_trust_region_update(actor_gradients: List[torch.Tensor], target_logit: torch.Tensor, avg_logit: torch.Tensor, trust_region_value: float) → List[torch.Tensor][源代码]¶

Overview:

calcuate gradient with trust region constrain

Arguments:

actor_gradients (list(torch.Tensor)): gradients value’s for different part
target_pi (torch.Tensor): The new policy’s probability
avg_pi (torch.Tensor): The average policy’s probability
trust_region_value (float): the range of trust region

Returns:

update_gradients (list(torch.Tensor)): gradients with trust region constraint

Shapes:

target_pi (torch.FloatTensor): \((T, B, N)\)
avg_pi (torch.FloatTensor): \((T, B, N)\)
update_gradients (list(torch.FloatTensor)): \((T, B, N)\)

Examples:

>>> actor_gradients=[torch.randn(2, 3, 4)]
>>> target_pi=torch.randn(2, 3, 4)
>>> avg_pi=torch.randn(2, 3, 4)
>>> loss = acer_trust_region_update(actor_gradients, target_pi, avg_pi, 0.1)

adder¶

Please refer to ding/rl_utils/adder for more details.

Adder¶

class ding.rl_utils.adder.Adder[源代码]¶

Overview:: Adder is a component that handles different transformations and calculations for transitions in Collector Module(data generation and processing), such as GAE, n-step return, transition sampling etc.
Interface:: __init__, get_gae, get_gae_with_default_last_value, get_nstep_return_data, get_train_sample

classmethod _get_null_transition(template: dict, null_transition: Optional[dict] = None) → dict[源代码]¶

Overview:

Get null transition for padding. If cls._null_transition is None, return input template instead.

Arguments:

template (dict): The template for null transition.
null_transition (Optional[dict]): Dict type null transition, used in null_padding

Returns:

null_transition (dict): The deepcopied null transition.

classmethod get_gae(data: List[Dict[str, Any]], last_value: torch.Tensor, gamma: float, gae_lambda: float, cuda: bool) → List[Dict[str, Any]][源代码]¶

Overview:

Get GAE advantage for stacked transitions(T timestep, 1 batch). Call gae for calculation.

Arguments:

data (list): Transitions list, each element is a transition dict with at least [‘value’, ‘reward’]
last_value (torch.Tensor): The last value(i.e.: the T+1 timestep)
gamma (float): The future discount factor, should be in [0, 1], defaults to 0.99.
gae_lambda (float): GAE lambda parameter, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.
cuda (bool): Whether use cuda in GAE computation

Returns:

data (list): transitions list like input one, but each element owns extra advantage key ‘adv’

Examples:

>>> B, T = 2, 3 # batch_size, timestep
>>> data = [dict(value=torch.randn(B), reward=torch.randn(B)) for _ in range(T)]
>>> last_value = torch.randn(B)
>>> gamma = 0.99
>>> gae_lambda = 0.95
>>> cuda = False
>>> data = Adder.get_gae(data, last_value, gamma, gae_lambda, cuda)

classmethod get_gae_with_default_last_value(data: collections.deque, done: bool, gamma: float, gae_lambda: float, cuda: bool) → List[Dict[str, Any]][源代码]¶

Overview:

Like get_gae above to get GAE advantage for stacked transitions. However, this function is designed in case last_value is not passed. If transition is not done yet, it wouold assign last value in data as last_value, discard the last element in data``(i.e. len(data) would decrease by 1), and then call ``get_gae. Otherwise it would make last_value equal to 0.

Arguments:

data (deque): Transitions list, each element is a transition dict with at least[‘value’, ‘reward’]
done (bool): Whether the transition reaches the end of an episode(i.e. whether the env is done)
gamma (float): The future discount factor, should be in [0, 1], defaults to 0.99.
gae_lambda (float): GAE lambda parameter, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.
cuda (bool): Whether use cuda in GAE computation

Returns:

data (List[Dict[str, Any]]): transitions list like input one, but each element owns extra advantage key ‘adv’

Examples:

>>> B, T = 2, 3 # batch_size, timestep
>>> data = [dict(value=torch.randn(B), reward=torch.randn(B)) for _ in range(T)]
>>> done = False
>>> gamma = 0.99
>>> gae_lambda = 0.95
>>> cuda = False
>>> data = Adder.get_gae_with_default_last_value(data, done, gamma, gae_lambda, cuda)

classmethod get_nstep_return_data(data: collections.deque, nstep: int, cum_reward=False, correct_terminate_gamma=True, gamma=0.99) → collections.deque[源代码]¶

Overview:

Process raw traj data by updating keys [‘next_obs’, ‘reward’, ‘done’] in data’s dict element.

Arguments:

data (deque): Transitions list, each element is a transition dict
nstep (int): Number of steps. If equals to 1, return data directly; Otherwise update with nstep value.

Returns:

data (deque): Transitions list like input one, but each element updated with nstep value.

Examples:

>>> data = [dict(
>>>     obs=torch.randn(B),
>>>     reward=torch.randn(1),
>>>     next_obs=torch.randn(B),
>>>     done=False) for _ in range(T)]
>>> nstep = 2
>>> data = Adder.get_nstep_return_data(data, nstep)

classmethod get_train_sample(data: List[Dict[str, Any]], unroll_len: int, last_fn_type: str = 'last', null_transition: Optional[dict] = None) → List[Dict[str, Any]][源代码]¶

Overview:

Process raw traj data by updating keys [‘next_obs’, ‘reward’, ‘done’] in data’s dict element. If unroll_len equals to 1, which means no process is needed, can directly return data. Otherwise, data will be splitted according to unroll_len, process residual part according to last_fn_type and call lists_to_dicts to form sampled training data.

Arguments:

data (List[Dict[str, Any]]): Transitions list, each element is a transition dict
unroll_len (int): Learn training unroll length
last_fn_type (str): The method type name for dealing with last residual data in a traj after splitting, should be in [‘last’, ‘drop’, ‘null_padding’]
null_transition (Optional[dict]): Dict type null transition, used in null_padding

Returns:

data (List[Dict[str, Any]]): Transitions list processed after unrolling

get_gae¶

ding.rl_utils.adder.get_gae(data: List[Dict[str, Any]], last_value: torch.Tensor, gamma: float, gae_lambda: float, cuda: bool) → List[Dict[str, Any]]¶

Overview:

Get GAE advantage for stacked transitions(T timestep, 1 batch). Call gae for calculation.

Arguments:

data (list): Transitions list, each element is a transition dict with at least [‘value’, ‘reward’]
last_value (torch.Tensor): The last value(i.e.: the T+1 timestep)
gamma (float): The future discount factor, should be in [0, 1], defaults to 0.99.
gae_lambda (float): GAE lambda parameter, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.
cuda (bool): Whether use cuda in GAE computation

Returns:

data (list): transitions list like input one, but each element owns extra advantage key ‘adv’

Examples:

>>> B, T = 2, 3 # batch_size, timestep
>>> data = [dict(value=torch.randn(B), reward=torch.randn(B)) for _ in range(T)]
>>> last_value = torch.randn(B)
>>> gamma = 0.99
>>> gae_lambda = 0.95
>>> cuda = False
>>> data = Adder.get_gae(data, last_value, gamma, gae_lambda, cuda)

get_gae_with_default_last_value¶

ding.rl_utils.adder.get_gae_with_default_last_value(data: collections.deque, done: bool, gamma: float, gae_lambda: float, cuda: bool) → List[Dict[str, Any]]¶

Overview:

Like get_gae above to get GAE advantage for stacked transitions. However, this function is designed in case last_value is not passed. If transition is not done yet, it wouold assign last value in data as last_value, discard the last element in data``(i.e. len(data) would decrease by 1), and then call ``get_gae. Otherwise it would make last_value equal to 0.

Arguments:

data (deque): Transitions list, each element is a transition dict with at least[‘value’, ‘reward’]
done (bool): Whether the transition reaches the end of an episode(i.e. whether the env is done)
gamma (float): The future discount factor, should be in [0, 1], defaults to 0.99.
gae_lambda (float): GAE lambda parameter, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.
cuda (bool): Whether use cuda in GAE computation

Returns:

data (List[Dict[str, Any]]): transitions list like input one, but each element owns extra advantage key ‘adv’

Examples:

>>> B, T = 2, 3 # batch_size, timestep
>>> data = [dict(value=torch.randn(B), reward=torch.randn(B)) for _ in range(T)]
>>> done = False
>>> gamma = 0.99
>>> gae_lambda = 0.95
>>> cuda = False
>>> data = Adder.get_gae_with_default_last_value(data, done, gamma, gae_lambda, cuda)

get_nstep_return_data¶

ding.rl_utils.adder.get_nstep_return_data(data: collections.deque, nstep: int, cum_reward=False, correct_terminate_gamma=True, gamma=0.99) → collections.deque¶

Overview:

Process raw traj data by updating keys [‘next_obs’, ‘reward’, ‘done’] in data’s dict element.

Arguments:

data (deque): Transitions list, each element is a transition dict
nstep (int): Number of steps. If equals to 1, return data directly; Otherwise update with nstep value.

Returns:

data (deque): Transitions list like input one, but each element updated with nstep value.

Examples:

>>> data = [dict(
>>>     obs=torch.randn(B),
>>>     reward=torch.randn(1),
>>>     next_obs=torch.randn(B),
>>>     done=False) for _ in range(T)]
>>> nstep = 2
>>> data = Adder.get_nstep_return_data(data, nstep)

get_train_sample¶

ding.rl_utils.adder.get_train_sample(data: List[Dict[str, Any]], unroll_len: int, last_fn_type: str = 'last', null_transition: Optional[dict] = None) → List[Dict[str, Any]]¶

Overview:

Process raw traj data by updating keys [‘next_obs’, ‘reward’, ‘done’] in data’s dict element. If unroll_len equals to 1, which means no process is needed, can directly return data. Otherwise, data will be splitted according to unroll_len, process residual part according to last_fn_type and call lists_to_dicts to form sampled training data.

Arguments:

data (List[Dict[str, Any]]): Transitions list, each element is a transition dict
unroll_len (int): Learn training unroll length
last_fn_type (str): The method type name for dealing with last residual data in a traj after splitting, should be in [‘last’, ‘drop’, ‘null_padding’]
null_transition (Optional[dict]): Dict type null transition, used in null_padding

Returns:

data (List[Dict[str, Any]]): Transitions list processed after unrolling

beta_function¶

Please refer to ding/rl_utils/beta_function for more details.

cpw¶

ding.rl_utils.beta_function.cpw(x: Union[torch.Tensor, float], eta: float = 0.71) → Union[torch.Tensor, float][源代码]¶

CVaR¶

ding.rl_utils.beta_function.CVaR(x: Union[torch.Tensor, float], eta: float = 0.71) → Union[torch.Tensor, float][源代码]¶

beta_function_map¶

rl_utils.beta_function_map = {'CPW': <function cpw>, 'CVaR': <function CVaR>, 'Pow': <function Pow>, 'uniform': <function <lambda>>}¶

coma¶

Please refer to ding/rl_utils/coma for more details.

coma_error¶

ding.rl_utils.coma_error(data: collections.namedtuple, gamma: float, lambda_: float) → collections.namedtuple[源代码]¶

Overview:

Implementation of COMA

Arguments:

data (namedtuple): coma input data with fieids shown in coma_data

Returns:

coma_loss (namedtuple): the coma loss item, all of them are the differentiable 0-dim tensor

Shapes:

logit (torch.FloatTensor): \((T, B, A, N)\), where B is batch size A is the agent num, and N is action dim
action (torch.LongTensor): \((T, B, A)\)
q_value (torch.FloatTensor): \((T, B, A, N)\)
target_q_value (torch.FloatTensor): \((T, B, A, N)\)
reward (torch.FloatTensor): \((T, B)\)
weight (torch.FloatTensor or None): \((T ,B, A)\)
policy_loss (torch.FloatTensor): \(()\), 0-dim tensor
value_loss (torch.FloatTensor): \(()\)
entropy_loss (torch.FloatTensor): \(()\)

Examples:

>>> action_dim = 4
>>> agent_num = 3
>>> data = coma_data(
>>>     logit=torch.randn(2, 3, agent_num, action_dim),
>>>     action=torch.randint(0, action_dim, (2, 3, agent_num)),
>>>     q_value=torch.randn(2, 3, agent_num, action_dim),
>>>     target_q_value=torch.randn(2, 3, agent_num, action_dim),
>>>     reward=torch.randn(2, 3),
>>>     weight=torch.ones(2, 3, agent_num),
>>> )
>>> loss = coma_error(data, 0.99, 0.99)

exploration¶

Please refer to ding/rl_utils/exploration for more details.

get_epsilon_greedy_fn¶

ding.rl_utils.exploration.get_epsilon_greedy_fn(start: float, end: float, decay: int, type_: str = 'exp') → Callable[源代码]¶

Overview:

Generate an epsilon_greedy function with decay, which inputs current timestep and outputs current epsilon.

Arguments:

start (float): Epsilon start value. For ‘linear’, it should be 1.0.
end (float): Epsilon end value.
decay (int): Controls the speed that epsilon decreases from start to end. We recommend epsilon decays according to env step rather than iteration.
type (str): How epsilon decays, now supports [‘linear’, ‘exp’(exponential)]

Returns:

eps_fn (function): The epsilon greedy function with decay

BaseNoise¶

class ding.rl_utils.exploration.BaseNoise[源代码]¶

Overview:

Base class for action noise

Interface:

__init__, __call__

Examples:

>>> noise_generator = OUNoise()  # init one type of noise
>>> noise = noise_generator(action.shape, action.device)  # generate noise

abstract __call__(shape: tuple, device: str) → torch.Tensor[源代码]¶

Overview:

Generate noise according to action tensor’s shape, device

Arguments:

shape (tuple): size of the action tensor, output noise’s size should be the same
device (str): device of the action tensor, output noise’s device should be the same as it

Returns:

noise (torch.Tensor): generated action noise, have the same shape and device with the input action tensor

__init__() → None[源代码]¶

Overview:: Initialization method

GaussianNoise¶

class ding.rl_utils.exploration.GaussianNoise(mu: float = 0.0, sigma: float = 1.0)[源代码]¶

Overview:: Derived class for generating gaussian noise, which satisfies \(X \sim N(\mu, \sigma^2)\)
Interface:: __init__, __call__

__call__(shape: tuple, device: str) → torch.Tensor[源代码]¶

Overview:

Generate gaussian noise according to action tensor’s shape, device

Arguments:

shape (tuple): size of the action tensor, output noise’s size should be the same
device (str): device of the action tensor, output noise’s device should be the same as it

Returns:

noise (torch.Tensor): generated action noise, have the same shape and device with the input action tensor

__init__(mu: float = 0.0, sigma: float = 1.0) → None[源代码]¶

Overview:

Initialize \(\mu\) and \(\sigma\) in Gaussian Distribution

Arguments:

mu (float): \(\mu\) , mean value
sigma (float): \(\sigma\) , standard deviation, should be positive

OUNoise¶

class ding.rl_utils.exploration.OUNoise(mu: float = 0.0, sigma: float = 0.3, theta: float = 0.15, dt: float = 0.01, x0: Optional[Union[float, torch.Tensor]] = 0.0)[源代码]¶

Overview:: Derived class for generating Ornstein-Uhlenbeck process noise. Satisfies \(dx_t=\theta(\mu-x_t)dt + \sigma dW_t\), where \(W_t\) denotes Weiner Process, acting as a random perturbation term.
Interface:: __init__, reset, __call__

reset() → None[源代码]¶

Overview:: Reset _x to the initial state _x0

property x0: Union[float, torch.Tensor]¶

Overview:: Get self._x0

noise_mapping¶

exploration.noise_mapping = {'gauss': <class 'ding.rl_utils.exploration.GaussianNoise'>, 'ou': <class 'ding.rl_utils.exploration.OUNoise'>}¶

create_noise_generator¶

ding.rl_utils.exploration.create_noise_generator(noise_type: str, noise_kwargs: dict) → ding.rl_utils.exploration.BaseNoise[源代码]¶

Overview:

Given the key (noise_type), create a new noise generator instance if in noise_mapping’s values, or raise an KeyError. In other words, a derived noise generator must first register, then call create_noise generator to get the instance object.

Arguments:

noise_type (str): the type of noise generator to be created

Returns:

noise (BaseNoise): the created new noise generator, should be an instance of one of noise_mapping’s values

gae¶

Please refer to ding/rl_utils/gae for more details.

gae_data¶

class ding.rl_utils.gae.gae_data(value, next_value, reward, done, traj_flag)¶

shape_fn_gae¶

ding.rl_utils.gae.shape_fn_gae(args, kwargs)[源代码]¶

Overview:: Return shape of gae for hpc
Returns:: shape: [T, B]

gae¶

ding.rl_utils.gae.gae(data: collections.namedtuple, gamma: float = 0.99, lambda_: float = 0.97) → torch.FloatTensor[源代码]¶

Overview:

Implementation of Generalized Advantage Estimator (arXiv:1506.02438)

Arguments:

data (namedtuple): gae input data with fields [‘value’, ‘reward’], which contains some episodes or trajectories data.
gamma (float): the future discount factor, should be in [0, 1], defaults to 0.99.
lambda (float): the gae parameter lambda, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.

Returns:

adv (torch.FloatTensor): the calculated advantage

Shapes:

value (torch.FloatTensor): \((T, B)\), where T is trajectory length and B is batch size
next_value (torch.FloatTensor): \((T, B)\)
reward (torch.FloatTensor): \((T, B)\)
adv (torch.FloatTensor): \((T, B)\)

Examples:

>>> value = torch.randn(2, 3)
>>> next_value = torch.randn(2, 3)
>>> reward = torch.randn(2, 3)
>>> data = gae_data(value, next_value, reward, None, None)
>>> adv = gae(data)

isw¶

Please refer to ding/rl_utils/isw for more details.

compute_importance_weights¶

ding.rl_utils.isw.compute_importance_weights(target_output: Union[torch.Tensor, dict], behaviour_output: Union[torch.Tensor, dict], action: torch.Tensor, action_space_type: str = 'discrete', requires_grad: bool = False)[源代码]¶

Overview:

Computing importance sampling weight with given output and action

Arguments:

target_output (Union[torch.Tensor,dict]): the output taking the action by the current policy network, usually this output is network output logit if action space is discrete, or is a dict containing parameters of action distribution if action space is continuous.
behaviour_output (Union[torch.Tensor,dict]): the output taking the action by the behaviour policy network, usually this output is network output logit, if action space is discrete, or is a dict containing parameters of action distribution if action space is continuous.
action (torch.Tensor): the chosen action(index for the discrete action space) in trajectory, i.e.: behaviour_action
action_space_type (str): action space types in [‘discrete’, ‘continuous’]
requires_grad (bool): whether requires grad computation

Returns:

rhos (torch.Tensor): Importance sampling weight

Shapes:

target_output (Union[torch.FloatTensor,dict]): \((T, B, N)\), where T is timestep, B is batch size and N is action dim
behaviour_output (Union[torch.FloatTensor,dict]): \((T, B, N)\)
action (torch.LongTensor): \((T, B)\)
rhos (torch.FloatTensor): \((T, B)\)

Examples:

>>> target_output = torch.randn(2, 3, 4)
>>> behaviour_output = torch.randn(2, 3, 4)
>>> action = torch.randint(0, 4, (2, 3))
>>> rhos = compute_importance_weights(target_output, behaviour_output, action)

ppg¶

Please refer to ding/rl_utils/ppg for more details.

ppg_data¶

class ding.rl_utils.ppg.ppg_data(logit_new, logit_old, action, value_new, value_old, return_, weight)¶

ppg_joint_loss¶

class ding.rl_utils.ppg.ppg_joint_loss(auxiliary_loss, behavioral_cloning_loss)¶

ppg_joint_error¶

ding.rl_utils.ppg.ppg_joint_error(data: collections.namedtuple, clip_ratio: float = 0.2, use_value_clip: bool = True) → Tuple[collections.namedtuple, collections.namedtuple][源代码]¶

Overview:

Get PPG joint loss

Arguments:

data (namedtuple): ppg input data with fieids shown in ppg_data
clip_ratio (float): clip value for ratio
use_value_clip (bool): whether use value clip

Returns:

ppg_joint_loss (namedtuple): the ppg loss item, all of them are the differentiable 0-dim tensor

Shapes:

logit_new (torch.FloatTensor): \((B, N)\), where B is batch size and N is action dim
logit_old (torch.FloatTensor): \((B, N)\)
action (torch.LongTensor): \((B,)\)
value_new (torch.FloatTensor): \((B, 1)\)
value_old (torch.FloatTensor): \((B, 1)\)
return (torch.FloatTensor): \((B, 1)\)
weight (torch.FloatTensor): \((B,)\)
auxiliary_loss (torch.FloatTensor): \(()\), 0-dim tensor
behavioral_cloning_loss (torch.FloatTensor): \(()\)

Examples:

>>> action_dim = 4
>>> data = ppg_data(
>>>     logit_new=torch.randn(3, action_dim),
>>>     logit_old=torch.randn(3, action_dim),
>>>     action=torch.randint(0, action_dim, (3,)),
>>>     value_new=torch.randn(3, 1),
>>>     value_old=torch.randn(3, 1),
>>>     return_=torch.randn(3, 1),
>>>     weight=torch.ones(3),
>>> )
>>> loss = ppg_joint_error(data, 0.99, 0.99)

ppo¶

Please refer to ding/rl_utils/ppo for more details.

ppo_data¶

class ding.rl_utils.ppo.ppo_data(logit_new, logit_old, action, value_new, value_old, adv, return_, weight)¶

ppo_policy_data¶

class ding.rl_utils.ppo.ppo_policy_data(logit_new, logit_old, action, adv, weight)¶

ppo_value_data¶

class ding.rl_utils.ppo.ppo_value_data(value_new, value_old, return_, weight)

ppo_loss¶

class ding.rl_utils.ppo.ppo_loss(policy_loss, value_loss, entropy_loss)¶

ppo_policy_loss¶

class ding.rl_utils.ppo.ppo_policy_loss(policy_loss, entropy_loss)¶

ppo_info¶

class ding.rl_utils.ppo.ppo_info(approx_kl, clipfrac)¶

shape_fn_ppo¶

ding.rl_utils.ppo.shape_fn_ppo(args, kwargs)[源代码]¶

Overview:: Return shape of ppo for hpc
Returns:: shape: [B, N]

ppo_error¶

ding.rl_utils.ppo.ppo_error(data: collections.namedtuple, clip_ratio: float = 0.2, use_value_clip: bool = True, dual_clip: Optional[float] = None) → Tuple[collections.namedtuple, collections.namedtuple][源代码]¶

Overview:

Implementation of Proximal Policy Optimization (arXiv:1707.06347) with value_clip and dual_clip

Arguments:

data (namedtuple): the ppo input data with fieids shown in ppo_data
clip_ratio (float): the ppo clip ratio for the constraint of policy update, defaults to 0.2
use_value_clip (bool): whether to use clip in value loss with the same ratio as policy
dual_clip (float): a parameter c mentioned in arXiv:1912.09729 Equ. 5, shoule be in [1, inf), defaults to 5.0, if you don’t want to use it, set this parameter to None

Returns:

ppo_loss (namedtuple): the ppo loss item, all of them are the differentiable 0-dim tensor
ppo_info (namedtuple): the ppo optim information for monitoring, all of them are Python scalar

Shapes:

logit_new (torch.FloatTensor): \((B, N)\), where B is batch size and N is action dim
logit_old (torch.FloatTensor): \((B, N)\)
action (torch.LongTensor): \((B, )\)
value_new (torch.FloatTensor): \((B, )\)
value_old (torch.FloatTensor): \((B, )\)
adv (torch.FloatTensor): \((B, )\)
return (torch.FloatTensor): \((B, )\)
weight (torch.FloatTensor or None): \((B, )\)
policy_loss (torch.FloatTensor): \(()\), 0-dim tensor
value_loss (torch.FloatTensor): \(()\)
entropy_loss (torch.FloatTensor): \(()\)

Examples:

>>> action_dim = 4
>>> data = ppo_data(
>>>     logit_new=torch.randn(3, action_dim),
>>>     logit_old=torch.randn(3, action_dim),
>>>     action=torch.randint(0, action_dim, (3,)),
>>>     value_new=torch.randn(3),
>>>     value_old=torch.randn(3),
>>>     adv=torch.randn(3),
>>>     return_=torch.randn(3),
>>>     weight=torch.ones(3),
>>> )
>>> loss, info = ppo_error(data)

注解

adv is already normalized value (adv - adv.mean()) / (adv.std() + 1e-8), and there are many ways to calculate this mean and std, like among data buffer or train batch, so we don’t couple this part into ppo_error, you can refer to our examples for different ways.

ppo_policy_error¶

ding.rl_utils.ppo.ppo_policy_error(data: collections.namedtuple, clip_ratio: float = 0.2, dual_clip: Optional[float] = None) → Tuple[collections.namedtuple, collections.namedtuple][源代码]

Overview:

Get PPO policy loss

Arguments:

data (namedtuple): ppo input data with fieids shown in ppo_policy_data
clip_ratio (float): clip value for ratio
dual_clip (float): a parameter c mentioned in arXiv:1912.09729 Equ. 5, shoule be in [1, inf), defaults to 5.0, if you don’t want to use it, set this parameter to None

Returns:

ppo_policy_loss (namedtuple): the ppo policy loss item, all of them are the differentiable 0-dim tensor
ppo_info (namedtuple): the ppo optim information for monitoring, all of them are Python scalar

Shapes:

logit_new (torch.FloatTensor): \((B, N)\), where B is batch size and N is action dim
logit_old (torch.FloatTensor): \((B, N)\)
action (torch.LongTensor): \((B, )\)
adv (torch.FloatTensor): \((B, )\)
weight (torch.FloatTensor or None): \((B, )\)
policy_loss (torch.FloatTensor): \(()\), 0-dim tensor
entropy_loss (torch.FloatTensor): \(()\)

Examples:

>>> action_dim = 4
>>> data = ppo_policy_data(
>>>     logit_new=torch.randn(3, action_dim),
>>>     logit_old=torch.randn(3, action_dim),
>>>     action=torch.randint(0, action_dim, (3,)),
>>>     adv=torch.randn(3),
>>>     weight=torch.ones(3),
>>> )
>>> loss, info = ppo_policy_error(data)

ppo_value_error¶

ding.rl_utils.ppo.ppo_value_error(data: collections.namedtuple, clip_ratio: float = 0.2, use_value_clip: bool = True) → torch.Tensor[源代码]

Overview:

Get PPO value loss

Arguments:

data (namedtuple): ppo input data with fieids shown in ppo_value_data
clip_ratio (float): clip value for ratio
use_value_clip (bool): whether use value clip

Returns:

value_loss (torch.FloatTensor): the ppo value loss item, all of them are the differentiable 0-dim tensor

Shapes:

value_new (torch.FloatTensor): \((B, )\), where B is batch size
value_old (torch.FloatTensor): \((B, )\)
return (torch.FloatTensor): \((B, )\)
weight (torch.FloatTensor or None): \((B, )\)
value_loss (torch.FloatTensor): \(()\), 0-dim tensor

Examples:

>>> action_dim = 4
>>> data = ppo_value_data(
>>>     value_new=torch.randn(3),
>>>     value_old=torch.randn(3),
>>>     return_=torch.randn(3),
>>>     weight=torch.ones(3),
>>> )
>>> loss, info = ppo_value_error(data)

ppo_error_continuous¶

ding.rl_utils.ppo.ppo_error_continuous(data: collections.namedtuple, clip_ratio: float = 0.2, use_value_clip: bool = True, dual_clip: Optional[float] = None) → Tuple[collections.namedtuple, collections.namedtuple][源代码]¶

Overview:

Implementation of Proximal Policy Optimization (arXiv:1707.06347) with value_clip and dual_clip

Arguments:

data (namedtuple): the ppo input data with fieids shown in ppo_data
clip_ratio (float): the ppo clip ratio for the constraint of policy update, defaults to 0.2
use_value_clip (bool): whether to use clip in value loss with the same ratio as policy
dual_clip (float): a parameter c mentioned in arXiv:1912.09729 Equ. 5, shoule be in [1, inf), defaults to 5.0, if you don’t want to use it, set this parameter to None

Returns:

ppo_loss (namedtuple): the ppo loss item, all of them are the differentiable 0-dim tensor
ppo_info (namedtuple): the ppo optim information for monitoring, all of them are Python scalar

Shapes:

mu_sigma_new (tuple): \(((B, N), (B, N))\), where B is batch size and N is action dim
mu_sigma_old (tuple): \(((B, N), (B, N))\), where B is batch size and N is action dim
action (torch.LongTensor): \((B, )\)
value_new (torch.FloatTensor): \((B, )\)
value_old (torch.FloatTensor): \((B, )\)
adv (torch.FloatTensor): \((B, )\)
return (torch.FloatTensor): \((B, )\)
weight (torch.FloatTensor or None): \((B, )\)
policy_loss (torch.FloatTensor): \(()\), 0-dim tensor
value_loss (torch.FloatTensor): \(()\)
entropy_loss (torch.FloatTensor): \(()\)

Examples:

>>> action_dim = 4
>>> data = ppo_data_continuous(
>>>     mu_sigma_new= dict(mu=torch.randn(3, action_dim), sigma=torch.randn(3, action_dim)**2),
>>>     mu_sigma_old= dict(mu=torch.randn(3, action_dim), sigma=torch.randn(3, action_dim)**2),
>>>     action=torch.randn(3, action_dim),
>>>     value_new=torch.randn(3),
>>>     value_old=torch.randn(3),
>>>     adv=torch.randn(3),
>>>     return_=torch.randn(3),
>>>     weight=torch.ones(3),
>>> )
>>> loss, info = ppo_error(data)

注解

adv is already normalized value (adv - adv.mean()) / (adv.std() + 1e-8), and there are many ways to calculate this mean and std, like among data buffer or train batch, so we don’t couple this part into ppo_error, you can refer to our examples for different ways.

ppo_policy_error_continuous¶

ding.rl_utils.ppo.ppo_policy_error_continuous(data: collections.namedtuple, clip_ratio: float = 0.2, dual_clip: Optional[float] = None) → Tuple[collections.namedtuple, collections.namedtuple][源代码]¶

Overview:

Implementation of Proximal Policy Optimization (arXiv:1707.06347) with dual_clip

Arguments:

data (namedtuple): the ppo input data with fieids shown in ppo_data
clip_ratio (float): the ppo clip ratio for the constraint of policy update, defaults to 0.2
dual_clip (float): a parameter c mentioned in arXiv:1912.09729 Equ. 5, shoule be in [1, inf), defaults to 5.0, if you don’t want to use it, set this parameter to None

Returns:

ppo_loss (namedtuple): the ppo loss item, all of them are the differentiable 0-dim tensor
ppo_info (namedtuple): the ppo optim information for monitoring, all of them are Python scalar

Shapes:

mu_sigma_new (tuple): \(((B, N), (B, N))\), where B is batch size and N is action dim
mu_sigma_old (tuple): \(((B, N), (B, N))\), where B is batch size and N is action dim
action (torch.LongTensor): \((B, )\)
adv (torch.FloatTensor): \((B, )\)
weight (torch.FloatTensor or None): \((B, )\)
policy_loss (torch.FloatTensor): \(()\), 0-dim tensor
entropy_loss (torch.FloatTensor): \(()\)

Examples:

>>> action_dim = 4
>>> data = ppo_policy_data_continuous(
>>>     mu_sigma_new=dict(mu=torch.randn(3, action_dim), sigma=torch.randn(3, action_dim)**2),
>>>     mu_sigma_old=dict(mu=torch.randn(3, action_dim), sigma=torch.randn(3, action_dim)**2),
>>>     action=torch.randn(3, action_dim),
>>>     adv=torch.randn(3),
>>>     weight=torch.ones(3),
>>> )
>>> loss, info = ppo_policy_error_continuous(data)

retrace¶

Please refer to ding/rl_utils/retrace for more details.

compute_q_retraces¶

ding.rl_utils.retrace.compute_q_retraces(q_values: torch.Tensor, v_pred: torch.Tensor, rewards: torch.Tensor, actions: torch.Tensor, weights: torch.Tensor, ratio: torch.Tensor, gamma: float = 0.9) → torch.Tensor[源代码]¶

Shapes:

q_values (torch.Tensor): \((T + 1, B, N)\), where T is unroll_len, B is batch size, N is discrete action dim.
v_pred (torch.Tensor): \((T + 1, B, 1)\)
rewards (torch.Tensor): \((T, B)\)
actions (torch.Tensor): \((T, B)\)
weights (torch.Tensor): \((T, B)\)
ratio (torch.Tensor): \((T, B, N)\)
q_retraces (torch.Tensor): \((T + 1, B, 1)\)

Examples:

>>> T=2
>>> B=3
>>> N=4
>>> q_values=torch.randn(T+1, B, N)
>>> v_pred=torch.randn(T+1, B, 1)
>>> rewards=torch.randn(T, B)
>>> actions=torch.randint(0, N, (T, B))
>>> weights=torch.ones(T, B)
>>> ratio=torch.randn(T, B, N)
>>> q_retraces = compute_q_retraces(q_values, v_pred, rewards, actions, weights, ratio)

注解

q_retrace operation doesn’t need to compute gradient, just executes forward computation.

sampler¶

Please refer to ding/rl_utils/sampler for more details.

ArgmaxSampler¶

class ding.rl_utils.sampler.ArgmaxSampler[源代码]¶

Overview:: Argmax sampler, return the index of the maximum value

MultinomialSampler¶

class ding.rl_utils.sampler.MultinomialSampler[源代码]¶

Overview:: Multinomial sampler, return the index of the sampled value

MuSampler¶

class ding.rl_utils.sampler.MuSampler[源代码]¶

Overview:: Mu sampler, return the mu of the input tensor

ReparameterizationSampler¶

class ding.rl_utils.sampler.ReparameterizationSampler[源代码]¶

Overview:: Reparameterization sampler, return the reparameterized value of the input tensor

HybridStochasticSampler¶

class ding.rl_utils.sampler.HybridStochasticSampler[源代码]¶

Overview:: Hybrid stochastic sampler, return the sampled action type and the reparameterized action args

HybridDeterminsticSampler¶

class ding.rl_utils.sampler.HybridDeterminsticSampler[源代码]¶

Overview:: Hybrid deterministic sampler, return the argmax action type and the mu action args

td¶

Please refer to ding/rl_utils/td for more details.

q_1step_td_data¶

class ding.rl_utils.td.q_1step_td_data(q, next_q, act, next_act, reward, done, weight)¶

q_1step_td_error¶

ding.rl_utils.td.q_1step_td_error(data: collections.namedtuple, gamma: float, criterion: <module 'torch.nn.modules' from '/home/docs/checkouts/readthedocs.org/user_builds/di-engine-docs-zh/envs/latest/lib/python3.9/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) → torch.Tensor[源代码]¶

Overview:

1 step td_error, support single agent case and multi agent case.

Arguments:

data (q_1step_td_data): The input data, q_1step_td_data to calculate loss
gamma (float): Discount factor
criterion (torch.nn.modules): Loss function criterion

Returns:

loss (torch.Tensor): 1step td error

Shapes:

data (q_1step_td_data): the q_1step_td_data containing [‘q’, ‘next_q’, ‘act’, ‘next_act’, ‘reward’, ‘done’, ‘weight’]
q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]
next_q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]
act (torch.LongTensor): \((B, )\)
next_act (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \(( , B)\)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
weight (torch.FloatTensor or None): \((B, )\), the training sample weight

Examples:

>>> action_dim = 4
>>> data = q_1step_td_data(
>>>     q=torch.randn(3, action_dim),
>>>     next_q=torch.randn(3, action_dim),
>>>     act=torch.randint(0, action_dim, (3,)),
>>>     next_act=torch.randint(0, action_dim, (3,)),
>>>     reward=torch.randn(3),
>>>     done=torch.randint(0, 2, (3,)).bool(),
>>>     weight=torch.ones(3),
>>> )
>>> loss = q_1step_td_error(data, 0.99)

m_q_1step_td_data¶

class ding.rl_utils.td.m_q_1step_td_data(q, target_q, next_q, act, reward, done, weight)¶

m_q_1step_td_error¶

ding.rl_utils.td.m_q_1step_td_error(data: collections.namedtuple, gamma: float, tau: float, alpha: float, criterion: <module 'torch.nn.modules' from '/home/docs/checkouts/readthedocs.org/user_builds/di-engine-docs-zh/envs/latest/lib/python3.9/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) → torch.Tensor[源代码]¶

Overview:

Munchausen td_error for DQN algorithm, support 1 step td error.

Arguments:

data (m_q_1step_td_data): The input data, m_q_1step_td_data to calculate loss
gamma (float): Discount factor
tau (float): Entropy factor for Munchausen DQN
alpha (float): Discount factor for Munchausen term
criterion (torch.nn.modules): Loss function criterion

Returns:

loss (torch.Tensor): 1step td error, 0-dim tensor

Shapes:

data (m_q_1step_td_data): the m_q_1step_td_data containing [‘q’, ‘target_q’, ‘next_q’, ‘act’, ‘reward’, ‘done’, ‘weight’]
q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]
target_q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]
next_q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]
act (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \(( , B)\)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
weight (torch.FloatTensor or None): \((B, )\), the training sample weight

Examples:

>>> action_dim = 4
>>> data = m_q_1step_td_data(
>>>     q=torch.randn(3, action_dim),
>>>     target_q=torch.randn(3, action_dim),
>>>     next_q=torch.randn(3, action_dim),
>>>     act=torch.randint(0, action_dim, (3,)),
>>>     reward=torch.randn(3),
>>>     done=torch.randint(0, 2, (3,)),
>>>     weight=torch.ones(3),
>>> )
>>> loss = m_q_1step_td_error(data, 0.99, 0.01, 0.01)

q_v_1step_td_data¶

class ding.rl_utils.td.q_v_1step_td_data(q, v, act, reward, done, weight)¶

q_v_1step_td_error¶

ding.rl_utils.td.q_v_1step_td_error(data: collections.namedtuple, gamma: float, criterion: <module 'torch.nn.modules' from '/home/docs/checkouts/readthedocs.org/user_builds/di-engine-docs-zh/envs/latest/lib/python3.9/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) → torch.Tensor[源代码]¶

Overview:

td_error between q and v value for SAC algorithm, support 1 step td error.

Arguments:

data (q_v_1step_td_data): The input data, q_v_1step_td_data to calculate loss
gamma (float): Discount factor
criterion (torch.nn.modules): Loss function criterion

Returns:

loss (torch.Tensor): 1step td error, 0-dim tensor

Shapes:

data (q_v_1step_td_data): the q_v_1step_td_data containing [‘q’, ‘v’, ‘act’, ‘reward’, ‘done’, ‘weight’]
q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]
v (torch.FloatTensor): \((B, )\)
act (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \(( , B)\)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
weight (torch.FloatTensor or None): \((B, )\), the training sample weight

Examples:

>>> action_dim = 4
>>> data = q_v_1step_td_data(
>>>     q=torch.randn(3, action_dim),
>>>     v=torch.randn(3),
>>>     act=torch.randint(0, action_dim, (3,)),
>>>     reward=torch.randn(3),
>>>     done=torch.randint(0, 2, (3,)),
>>>     weight=torch.ones(3),
>>> )
>>> loss = q_v_1step_td_error(data, 0.99)

nstep_return_data¶

class ding.rl_utils.td.nstep_return_data(reward, next_value, done)¶

nstep_return¶

ding.rl_utils.td.nstep_return(data: collections.namedtuple, gamma: Union[float, list], nstep: int, value_gamma: Optional[torch.Tensor] = None)[源代码]¶

Overview:

Calculate nstep return for DQN algorithm, support single agent case and multi agent case.

Arguments:

data (nstep_return_data): The input data, nstep_return_data to calculate loss
gamma (float): Discount factor
nstep (int): nstep num
value_gamma (torch.Tensor): Discount factor for value

Returns:

return (torch.Tensor): nstep return

Shapes:

data (nstep_return_data): the nstep_return_data containing [‘reward’, ‘next_value’, ‘done’]
reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)
next_value (torch.FloatTensor): \((, B)\)
done (torch.BoolTensor) \((B, )\), whether done in last timestep

Examples:

>>> data = nstep_return_data(
>>>     reward=torch.randn(3, 3),
>>>     next_value=torch.randn(3),
>>>     done=torch.randint(0, 2, (3,)),
>>> )
>>> loss = nstep_return(data, 0.99, 3)

dist_1step_td_data¶

class ding.rl_utils.td.dist_1step_td_data(dist, next_dist, act, next_act, reward, done, weight)¶

dist_1step_td_error¶

ding.rl_utils.td.dist_1step_td_error(data: collections.namedtuple, gamma: float, v_min: float, v_max: float, n_atom: int) → torch.Tensor[源代码]¶

Overview:

1 step td_error for distributed q-learning based algorithm

Arguments:

data (dist_1step_td_data): The input data, dist_nstep_td_data to calculate loss
gamma (float): Discount factor
v_min (float): The min value of support
v_max (float): The max value of support
n_atom (int): The num of atom

Returns:

loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:

data (dist_1step_td_data): the dist_1step_td_data containing [‘dist’, ‘next_n_dist’, ‘act’, ‘reward’, ‘done’, ‘weight’]
dist (torch.FloatTensor): \((B, N, n_atom)\) i.e. [batch_size, action_dim, n_atom]
next_dist (torch.FloatTensor): \((B, N, n_atom)\)
act (torch.LongTensor): \((B, )\)
next_act (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \((, B)\)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
weight (torch.FloatTensor or None): \((B, )\), the training sample weight

Examples:

>>> dist = torch.randn(4, 3, 51).abs().requires_grad_(True)
>>> next_dist = torch.randn(4, 3, 51).abs()
>>> act = torch.randint(0, 3, (4,))
>>> next_act = torch.randint(0, 3, (4,))
>>> reward = torch.randn(4)
>>> done = torch.randint(0, 2, (4,))
>>> data = dist_1step_td_data(dist, next_dist, act, next_act, reward, done, None)
>>> loss = dist_1step_td_error(data, 0.99, -10.0, 10.0, 51)

dist_nstep_td_data¶

ding.rl_utils.td.dist_nstep_td_data¶: alias of ding.rl_utils.td.dist_1step_td_data

shape_fn_dntd¶

ding.rl_utils.td.shape_fn_dntd(args, kwargs)[源代码]¶

Overview:: Return dntd shape for hpc
Returns:: shape: [T, B, N, n_atom]

dist_nstep_td_error¶

ding.rl_utils.td.dist_nstep_td_error(data: collections.namedtuple, gamma: float, v_min: float, v_max: float, n_atom: int, nstep: int = 1, value_gamma: Optional[torch.Tensor] = None) → torch.Tensor[源代码]¶

Overview:

Multistep (1 step or n step) td_error for distributed q-learning based algorithm, support single agent case and multi agent case.

Arguments:

data (dist_nstep_td_data): The input data, dist_nstep_td_data to calculate loss
gamma (float): Discount factor
nstep (int): nstep num, default set to 1

Returns:

loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:

data (dist_nstep_td_data): the dist_nstep_td_data containing [‘dist’, ‘next_n_dist’, ‘act’, ‘reward’, ‘done’, ‘weight’]
dist (torch.FloatTensor): \((B, N, n_atom)\) i.e. [batch_size, action_dim, n_atom]
next_n_dist (torch.FloatTensor): \((B, N, n_atom)\)
act (torch.LongTensor): \((B, )\)
next_n_act (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)
done (torch.BoolTensor) \((B, )\), whether done in last timestep

Examples:

>>> dist = torch.randn(4, 3, 51).abs().requires_grad_(True)
>>> next_n_dist = torch.randn(4, 3, 51).abs()
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> reward = torch.randn(5, 4)
>>> data = dist_nstep_td_data(dist, next_n_dist, action, next_action, reward, done, None)
>>> loss, _ = dist_nstep_td_error(data, 0.95, -10.0, 10.0, 51, 5)

v_1step_td_data¶

class ding.rl_utils.td.v_1step_td_data(v, next_v, reward, done, weight)¶

v_1step_td_error¶

ding.rl_utils.td.v_1step_td_error(data: collections.namedtuple, gamma: float, criterion: <module 'torch.nn.modules' from '/home/docs/checkouts/readthedocs.org/user_builds/di-engine-docs-zh/envs/latest/lib/python3.9/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) → torch.Tensor[源代码]¶

Overview:

1 step td_error for distributed value based algorithm

Arguments:

data (v_1step_td_data): The input data, v_1step_td_data to calculate loss
gamma (float): Discount factor
criterion (torch.nn.modules): Loss function criterion

Returns:

loss (torch.Tensor): 1step td error, 0-dim tensor

Shapes:

data (v_1step_td_data): the v_1step_td_data containing [‘v’, ‘next_v’, ‘reward’, ‘done’, ‘weight’]
v (torch.FloatTensor): \((B, )\) i.e. [batch_size, ]
next_v (torch.FloatTensor): \((B, )\)
reward (torch.FloatTensor): \((, B)\)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
weight (torch.FloatTensor or None): \((B, )\), the training sample weight

Examples:

>>> v = torch.randn(5).requires_grad_(True)
>>> next_v = torch.randn(5)
>>> reward = torch.rand(5)
>>> done = torch.zeros(5)
>>> data = v_1step_td_data(v, next_v, reward, done, None)
>>> loss, td_error_per_sample = v_1step_td_error(data, 0.99)

v_nstep_td_data¶

class ding.rl_utils.td.v_nstep_td_data(v, next_n_v, reward, done, weight, value_gamma)¶

v_nstep_td_error¶

ding.rl_utils.td.v_nstep_td_error(data: collections.namedtuple, gamma: float, nstep: int = 1, criterion: <module 'torch.nn.modules' from '/home/docs/checkouts/readthedocs.org/user_builds/di-engine-docs-zh/envs/latest/lib/python3.9/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) → torch.Tensor[源代码]¶

Overview:

Multistep (n step) td_error for distributed value based algorithm

Arguments:

data (dist_nstep_td_data): The input data, v_nstep_td_data to calculate loss
gamma (float): Discount factor
nstep (int): nstep num, default set to 1

Returns:

loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:

data (dist_nstep_td_data): The v_nstep_td_data containing
[‘v’, ‘next_n_v’, ‘reward’, ‘done’, ‘weight’, ‘value_gamma’]
v (torch.FloatTensor): \((B, )\) i.e. [batch_size, ]
next_v (torch.FloatTensor): \((B, )\)
reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
weight (torch.FloatTensor or None): \((B, )\), the training sample weight
value_gamma (torch.Tensor): If the remaining data in the buffer is less than n_step
we use value_gamma as the gamma discount value for next_v rather than gamma**n_step

Examples:

>>> v = torch.randn(5).requires_grad_(True)
>>> next_v = torch.randn(5)
>>> reward = torch.rand(5, 5)
>>> done = torch.zeros(5)
>>> data = v_nstep_td_data(v, next_v, reward, done, 0.9, 0.99)
>>> loss, td_error_per_sample = v_nstep_td_error(data, 0.99, 5)

q_nstep_td_data¶

class ding.rl_utils.td.q_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, weight)¶

dqfd_nstep_td_data¶

class ding.rl_utils.td.dqfd_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, done_one_step, weight, new_n_q_one_step, next_n_action_one_step, is_expert)¶

shape_fn_qntd¶

ding.rl_utils.td.shape_fn_qntd(args, kwargs)[源代码]¶

Overview:: Return qntd shape for hpc
Returns:: shape: [T, B, N]

q_nstep_td_error¶

ding.rl_utils.td.q_nstep_td_error(data: collections.namedtuple, gamma: Union[float, list], nstep: int = 1, cum_reward: bool = False, value_gamma: Optional[torch.Tensor] = None, criterion: <module 'torch.nn.modules' from '/home/docs/checkouts/readthedocs.org/user_builds/di-engine-docs-zh/envs/latest/lib/python3.9/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) → torch.Tensor[源代码]¶

Overview:

Multistep (1 step or n step) td_error for q-learning based algorithm

Arguments:

data (q_nstep_td_data): The input data, q_nstep_td_data to calculate loss
gamma (float): Discount factor
cum_reward (bool): Whether to use cumulative nstep reward, which is figured out when collecting data
value_gamma (torch.Tensor): Gamma discount value for target q_value
criterion (torch.nn.modules): Loss function criterion
nstep (int): nstep num, default set to 1

Returns:

loss (torch.Tensor): nstep td error, 0-dim tensor
td_error_per_sample (torch.Tensor): nstep td error, 1-dim tensor

Shapes:

data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]
q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]
next_n_q (torch.FloatTensor): \((B, N)\)
action (torch.LongTensor): \((B, )\)
next_n_action (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
td_error_per_sample (torch.FloatTensor): \((B, )\)

Examples:

>>> next_q = torch.randn(4, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep =3
>>> q = torch.randn(4, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None)
>>> loss, td_error_per_sample = q_nstep_td_error(data, 0.95, nstep=nstep)

bdq_nstep_td_error¶

ding.rl_utils.td.bdq_nstep_td_error(data: collections.namedtuple, gamma: Union[float, list], nstep: int = 1, cum_reward: bool = False, value_gamma: Optional[torch.Tensor] = None, criterion: <module 'torch.nn.modules' from '/home/docs/checkouts/readthedocs.org/user_builds/di-engine-docs-zh/envs/latest/lib/python3.9/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) → torch.Tensor[源代码]¶

Overview:

Multistep (1 step or n step) td_error for BDQ algorithm, referenced paper “Action Branching Architectures for Deep Reinforcement Learning”, link: https://arxiv.org/pdf/1711.08946. In fact, the original paper only provides the 1-step TD-error calculation method, and here we extend the calculation method of n-step, i.e., TD-error:

Arguments:

data (q_nstep_td_data): The input data, q_nstep_td_data to calculate loss
gamma (float): Discount factor
cum_reward (bool): Whether to use cumulative nstep reward, which is figured out when collecting data
value_gamma (torch.Tensor): Gamma discount value for target q_value
criterion (torch.nn.modules): Loss function criterion
nstep (int): nstep num, default set to 1

Returns:

loss (torch.Tensor): nstep td error, 0-dim tensor
td_error_per_sample (torch.Tensor): nstep td error, 1-dim tensor

Shapes:

data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]
q (torch.FloatTensor): \((B, D, N)\) i.e. [batch_size, branch_num, action_bins_per_branch]
next_n_q (torch.FloatTensor): \((B, D, N)\)
action (torch.LongTensor): \((B, D)\)
next_n_action (torch.LongTensor): \((B, D)\)
reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
td_error_per_sample (torch.FloatTensor): \((B, )\)

Examples:

>>> action_per_branch = 3
>>> next_q = torch.randn(8, 6, action_per_branch)
>>> done = torch.randn(8)
>>> action = torch.randint(0, action_per_branch, size=(8, 6))
>>> next_action = torch.randint(0, action_per_branch, size=(8, 6))
>>> nstep =3
>>> q = torch.randn(8, 6, action_per_branch).requires_grad_(True)
>>> reward = torch.rand(nstep, 8)
>>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None)
>>> loss, td_error_per_sample = bdq_nstep_td_error(data, 0.95, nstep=nstep)

shape_fn_qntd_rescale¶

ding.rl_utils.td.shape_fn_qntd_rescale(args, kwargs)[源代码]¶

Overview:: Return qntd_rescale shape for hpc
Returns:: shape: [T, B, N]

q_nstep_td_error_with_rescale¶

ding.rl_utils.td.q_nstep_td_error_with_rescale(data: collections.namedtuple, gamma: Union[float, list], nstep: int = 1, value_gamma: Optional[torch.Tensor] = None, criterion: <module 'torch.nn.modules' from '/home/docs/checkouts/readthedocs.org/user_builds/di-engine-docs-zh/envs/latest/lib/python3.9/site-packages/torch/nn/modules/__init__.py'> = MSELoss(), trans_fn: Callable = <function value_transform>, inv_trans_fn: Callable = <function value_inv_transform>) → torch.Tensor[源代码]¶

Overview:

Multistep (1 step or n step) td_error with value rescaling

Arguments:

data (q_nstep_td_data): The input data, q_nstep_td_data to calculate loss
gamma (float): Discount factor
nstep (int): nstep num, default set to 1
criterion (torch.nn.modules): Loss function criterion
trans_fn (Callable): Value transfrom function, default to value_transform (refer to rl_utils/value_rescale.py)
inv_trans_fn (Callable): Value inverse transfrom function, default to value_inv_transform (refer to rl_utils/value_rescale.py)

Returns:

loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:

data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]
q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]
next_n_q (torch.FloatTensor): \((B, N)\)
action (torch.LongTensor): \((B, )\)
next_n_action (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)
done (torch.BoolTensor) \((B, )\), whether done in last timestep

Examples:

>>> next_q = torch.randn(4, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep =3
>>> q = torch.randn(4, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None)
>>> loss, _ = q_nstep_td_error_with_rescale(data, 0.95, nstep=nstep)

dqfd_nstep_td_error¶

ding.rl_utils.td.dqfd_nstep_td_error(data: collections.namedtuple, gamma: float, lambda_n_step_td: float, lambda_supervised_loss: float, margin_function: float, lambda_one_step_td: float = 1.0, nstep: int = 1, cum_reward: bool = False, value_gamma: Optional[torch.Tensor] = None, criterion: <module 'torch.nn.modules' from '/home/docs/checkouts/readthedocs.org/user_builds/di-engine-docs-zh/envs/latest/lib/python3.9/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) → torch.Tensor[源代码]¶

Overview:

Multistep n step td_error + 1 step td_error + supervised margin loss or dqfd

Arguments:

data (dqfd_nstep_td_data): The input data, dqfd_nstep_td_data to calculate loss
gamma (float): discount factor
cum_reward (bool): Whether to use cumulative nstep reward, which is figured out when collecting data
value_gamma (torch.Tensor): Gamma discount value for target q_value
criterion (torch.nn.modules): Loss function criterion
nstep (int): nstep num, default set to 10

Returns:

loss (torch.Tensor): Multistep n step td_error + 1 step td_error + supervised margin loss, 0-dim tensor
td_error_per_sample (torch.Tensor): Multistep n step td_error + 1 step td_error + supervised margin loss, 1-dim tensor

Shapes:

data (q_nstep_td_data): the q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘next_n_action’, ‘reward’, ‘done’, ‘weight’ , ‘new_n_q_one_step’, ‘next_n_action_one_step’, ‘is_expert’]
q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]
next_n_q (torch.FloatTensor): \((B, N)\)
action (torch.LongTensor): \((B, )\)
next_n_action (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
td_error_per_sample (torch.FloatTensor): \((B, )\)
new_n_q_one_step (torch.FloatTensor): \((B, N)\)
next_n_action_one_step (torch.LongTensor): \((B, )\)
is_expert (int) : 0 or 1

Examples:

>>> next_q = torch.randn(4, 3)
>>> done = torch.randn(4)
>>> done_1 = torch.randn(4)
>>> next_q_one_step = torch.randn(4, 3)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> next_action_one_step = torch.randint(0, 3, size=(4, ))
>>> is_expert = torch.ones((4))
>>> nstep = 3
>>> q = torch.randn(4, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = dqfd_nstep_td_data(
>>>     q, next_q, action, next_action, reward, done, done_1, None,
>>>     next_q_one_step, next_action_one_step, is_expert
>>> )
>>> loss, td_error_per_sample, loss_statistics = dqfd_nstep_td_error(
>>>     data, 0.95, lambda_n_step_td=1, lambda_supervised_loss=1,
>>>     margin_function=0.8, nstep=nstep
>>> )

dqfd_nstep_td_error_with_rescale¶

ding.rl_utils.td.dqfd_nstep_td_error_with_rescale(data: collections.namedtuple, gamma: float, lambda_n_step_td: float, lambda_supervised_loss: float, lambda_one_step_td: float, margin_function: float, nstep: int = 1, cum_reward: bool = False, value_gamma: Optional[torch.Tensor] = None, criterion: <module 'torch.nn.modules' from '/home/docs/checkouts/readthedocs.org/user_builds/di-engine-docs-zh/envs/latest/lib/python3.9/site-packages/torch/nn/modules/__init__.py'> = MSELoss(), trans_fn: Callable = <function value_transform>, inv_trans_fn: Callable = <function value_inv_transform>) → torch.Tensor[源代码]¶

Overview:

Multistep n step td_error + 1 step td_error + supervised margin loss or dqfd

Arguments:

data (dqfd_nstep_td_data): The input data, dqfd_nstep_td_data to calculate loss
gamma (float): Discount factor
cum_reward (bool): Whether to use cumulative nstep reward, which is figured out when collecting data
value_gamma (torch.Tensor): Gamma discount value for target q_value
criterion (torch.nn.modules): Loss function criterion
nstep (int): nstep num, default set to 10

Returns:

loss (torch.Tensor): Multistep n step td_error + 1 step td_error + supervised margin loss, 0-dim tensor
td_error_per_sample (torch.Tensor): Multistep n step td_error + 1 step td_error + supervised margin loss, 1-dim tensor

Shapes:

data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘next_n_action’, ‘reward’, ‘done’, ‘weight’ , ‘new_n_q_one_step’, ‘next_n_action_one_step’, ‘is_expert’]
q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]
next_n_q (torch.FloatTensor): \((B, N)\)
action (torch.LongTensor): \((B, )\)
next_n_action (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
td_error_per_sample (torch.FloatTensor): \((B, )\)
new_n_q_one_step (torch.FloatTensor): \((B, N)\)
next_n_action_one_step (torch.LongTensor): \((B, )\)
is_expert (int) : 0 or 1

qrdqn_nstep_td_data¶

class ding.rl_utils.td.qrdqn_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, tau, weight)¶

qrdqn_nstep_td_error¶

ding.rl_utils.td.qrdqn_nstep_td_error(data: collections.namedtuple, gamma: float, nstep: int = 1, value_gamma: Optional[torch.Tensor] = None) → torch.Tensor[源代码]¶

Overview:

Multistep (1 step or n step) td_error with in QRDQN

Arguments:

data (iqn_nstep_td_data): The input data, iqn_nstep_td_data to calculate loss
gamma (float): Discount factor
nstep (int): nstep num, default set to 1

Returns:

loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:

data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]
q (torch.FloatTensor): \((tau, B, N)\) i.e. [tau x batch_size, action_dim]
next_n_q (torch.FloatTensor): \((tau', B, N)\)
action (torch.LongTensor): \((B, )\)
next_n_action (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)
done (torch.BoolTensor) \((B, )\), whether done in last timestep

Examples:

>>> next_q = torch.randn(4, 3, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep = 3
>>> q = torch.randn(4, 3, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = qrdqn_nstep_td_data(q, next_q, action, next_action, reward, done, 3, None)
>>> loss, td_error_per_sample = qrdqn_nstep_td_error(data, 0.95, nstep=nstep)

q_nstep_sql_td_error¶

ding.rl_utils.td.q_nstep_sql_td_error(data: collections.namedtuple, gamma: float, alpha: float, nstep: int = 1, cum_reward: bool = False, value_gamma: Optional[torch.Tensor] = None, criterion: <module 'torch.nn.modules' from '/home/docs/checkouts/readthedocs.org/user_builds/di-engine-docs-zh/envs/latest/lib/python3.9/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) → torch.Tensor[源代码]¶

Overview:

Multistep (1 step or n step) td_error for q-learning based algorithm

Arguments:

data (q_nstep_td_data): The input data, q_nstep_sql_td_data to calculate loss
gamma (float): Discount factor
Alpha (:obj:｀float`): A parameter to weight entropy term in a policy equation
cum_reward (bool): Whether to use cumulative nstep reward, which is figured out when collecting data
value_gamma (torch.Tensor): Gamma discount value for target soft_q_value
criterion (torch.nn.modules): Loss function criterion
nstep (int): nstep num, default set to 1

Returns:

loss (torch.Tensor): nstep td error, 0-dim tensor
td_error_per_sample (torch.Tensor): nstep td error, 1-dim tensor

Shapes:

data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]
q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]
next_n_q (torch.FloatTensor): \((B, N)\)
action (torch.LongTensor): \((B, )\)
next_n_action (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
td_error_per_sample (torch.FloatTensor): \((B, )\)

Examples:

>>> next_q = torch.randn(4, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep = 3
>>> q = torch.randn(4, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None)
>>> loss, td_error_per_sample, record_target_v = q_nstep_sql_td_error(data, 0.95, 1.0, nstep=nstep)

iqn_nstep_td_data¶

class ding.rl_utils.td.iqn_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, replay_quantiles, weight)¶

iqn_nstep_td_error ~~~~~~~~~~~~~~~~~~~`` .. autofunction:: ding.rl_utils.td.iqn_nstep_td_error

fqf_nstep_td_data¶

class ding.rl_utils.td.fqf_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, quantiles_hats, weight)¶

fqf_nstep_td_error¶

ding.rl_utils.td.fqf_nstep_td_error(data: collections.namedtuple, gamma: float, nstep: int = 1, kappa: float = 1.0, value_gamma: Optional[torch.Tensor] = None) → torch.Tensor[源代码]¶

Overview:

Multistep (1 step or n step) td_error with in FQF, referenced paper Fully Parameterized Quantile Function for Distributional Reinforcement Learning <https://arxiv.org/pdf/1911.02140.pdf>

Arguments:

data (fqf_nstep_td_data): The input data, fqf_nstep_td_data to calculate loss
gamma (float): Discount factor
nstep (int): nstep num, default set to 1
criterion (torch.nn.modules): Loss function criterion
beta_function (Callable): The risk function

Returns:

loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:

data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]
q (torch.FloatTensor): \((B, tau, N)\) i.e. [batch_size, tau, action_dim]
next_n_q (torch.FloatTensor): \((B, tau', N)\)
action (torch.LongTensor): \((B, )\)
next_n_action (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
quantiles_hats (torch.FloatTensor): \((B, tau)\)

Examples:

>>> next_q = torch.randn(4, 3, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep = 3
>>> q = torch.randn(4, 3, 3).requires_grad_(True)
>>> quantiles_hats = torch.randn([4, 3])
>>> reward = torch.rand(nstep, 4)
>>> data = fqf_nstep_td_data(q, next_q, action, next_action, reward, done, quantiles_hats, None)
>>> loss, td_error_per_sample = fqf_nstep_td_error(data, 0.95, nstep=nstep)

evaluate_quantile_at_action¶

ding.rl_utils.td.evaluate_quantile_at_action(q_s, actions)[源代码]¶

fqf_calculate_fraction_loss¶

ding.rl_utils.td.fqf_calculate_fraction_loss(q_tau_i, q_value, quantiles, actions)[源代码]¶

Overview:

Calculate the fraction loss in FQF, referenced paper Fully Parameterized Quantile Function for Distributional Reinforcement Learning <https://arxiv.org/pdf/1911.02140.pdf>

Arguments:

q_tau_i (torch.FloatTensor): \((batch_size, num_quantiles-1, action_dim)\)
q_value (torch.FloatTensor): \((batch_size, num_quantiles, action_dim)\)
quantiles (torch.FloatTensor): \((batch_size, num_quantiles+1)\)
actions (torch.LongTensor): \((batch_size, )\)

Returns:

fraction_loss (torch.Tensor): fraction loss, 0-dim tensor

td_lambda_data¶

class ding.rl_utils.td.td_lambda_data(value, reward, weight)¶

shape_fn_td_lambda¶

ding.rl_utils.td.shape_fn_td_lambda(args, kwargs)[源代码]¶

Overview:: Return td_lambda shape for hpc
Returns:: shape: [T, B]

td_lambda_error¶

ding.rl_utils.td.td_lambda_error(data: collections.namedtuple, gamma: float = 0.9, lambda_: float = 0.8) → torch.Tensor[源代码]¶

Overview:

Computing TD(lambda) loss given constant gamma and lambda. There is no special handling for terminal state value, if some state has reached the terminal, just fill in zeros for values and rewards beyond terminal (including the terminal state, values[terminal] should also be 0)

Arguments:

data (namedtuple): td_lambda input data with fields [‘value’, ‘reward’, ‘weight’]
gamma (float): Constant discount factor gamma, should be in [0, 1], defaults to 0.9
lambda (float): Constant lambda, should be in [0, 1], defaults to 0.8

Returns:

loss (torch.Tensor): Computed MSE loss, averaged over the batch

Shapes:

value (torch.FloatTensor): \((T+1, B)\), where T is trajectory length and B is batch, which is the estimation of the state value at step 0 to T
reward (torch.FloatTensor): \((T, B)\), the returns from time step 0 to T-1
weight (torch.FloatTensor or None): \((B, )\), the training sample weight
loss (torch.FloatTensor): \(()\), 0-dim tensor

Examples:

>>> T, B = 8, 4
>>> value = torch.randn(T + 1, B).requires_grad_(True)
>>> reward = torch.rand(T, B)
>>> loss = td_lambda_error(td_lambda_data(value, reward, None))

generalized_lambda_returns¶

ding.rl_utils.td.generalized_lambda_returns(bootstrap_values: torch.Tensor, rewards: torch.Tensor, gammas: float, lambda_: float, done: Optional[torch.Tensor] = None) → torch.Tensor[源代码]¶

Overview:

Functional equivalent to trfl.value_ops.generalized_lambda_returns https://github.com/deepmind/trfl/blob/2c07ac22512a16715cc759f0072be43a5d12ae45/trfl/value_ops.py#L74 Passing in a number instead of tensor to make the value constant for all samples in batch

Arguments:

bootstrap_values (torch.Tensor or float): estimation of the value at step 0 to T, of size [T_traj+1, batchsize]
rewards (torch.Tensor): The returns from 0 to T-1, of size [T_traj, batchsize]
gammas (torch.Tensor or float): Discount factor for each step (from 0 to T-1), of size [T_traj, batchsize]
lambda (torch.Tensor or float): Determining the mix of bootstrapping vs further accumulation of multistep returns at each timestep, of size [T_traj, batchsize]
done (torch.Tensor or float): Whether the episode done at current step (from 0 to T-1), of size [T_traj, batchsize]

Returns:

return (torch.Tensor): Computed lambda return value for each state from 0 to T-1, of size [T_traj, batchsize]

multistep_forward_view¶

ding.rl_utils.td.multistep_forward_view(bootstrap_values: torch.Tensor, rewards: torch.Tensor, gammas: float, lambda_: float, done: Optional[torch.Tensor] = None) → torch.Tensor[源代码]¶

Overview:

Same as trfl.sequence_ops.multistep_forward_view Implementing (12.18) in Sutton & Barto

` result[T-1] = rewards[T-1] + gammas[T-1] * bootstrap_values[T] for t in 0...T-2 : result[t] = rewards[t] + gammas[t]*(lambdas[t]*result[t+1] + (1-lambdas[t])*bootstrap_values[t+1]) `

Assuming the first dim of input tensors correspond to the index in batch

Arguments:

bootstrap_values (torch.Tensor): Estimation of the value at step 1 to T, of size [T_traj, batchsize]
rewards (torch.Tensor): The returns from 0 to T-1, of size [T_traj, batchsize]
gammas (torch.Tensor): Discount factor for each step (from 0 to T-1), of size [T_traj, batchsize]
lambda (torch.Tensor): Determining the mix of bootstrapping vs further accumulation of
multistep returns at each timestep of size [T_traj, batchsize], the element for T-1 is ignored and effectively set to 0, as there is no information about future rewards.
done (torch.Tensor or float): Whether the episode done at current step (from 0 to T-1), of size [T_traj, batchsize]

Returns:

ret (torch.Tensor): Computed lambda return value
for each state from 0 to T-1, of size [T_traj, batchsize]

upgo¶

Please refer to ding/rl_utils/upgo for more details.

upgo_returns¶

ding.rl_utils.upgo.upgo_returns(rewards: torch.Tensor, bootstrap_values: torch.Tensor) → torch.Tensor[源代码]¶

Overview:

Computing UPGO return targets. Also notice there is no special handling for the terminal state.

Arguments:

rewards (torch.Tensor): the returns from time step 0 to T-1,
of size [T_traj, batchsize]
bootstrap_values (torch.Tensor): estimation of the state value at step 0 to T,
of size [T_traj+1, batchsize]

Returns:

ret (torch.Tensor): Computed lambda return value for each state from 0 to T-1,
of size [T_traj, batchsize]

Examples:

>>> T, B, N, N2 = 4, 8, 5, 7
>>> rewards = torch.randn(T, B)
>>> bootstrap_values = torch.randn(T + 1, B).requires_grad_(True)
>>> returns = upgo_returns(rewards, bootstrap_values)

upgo_loss¶

ding.rl_utils.upgo.upgo_loss(target_output: torch.Tensor, rhos: torch.Tensor, action: torch.Tensor, rewards: torch.Tensor, bootstrap_values: torch.Tensor, mask=None) → torch.Tensor[源代码]¶

Overview:

Computing UPGO loss given constant gamma and lambda. There is no special handling for terminal state value, if the last state in trajectory is the terminal, just pass a 0 as bootstrap_terminal_value.

Arguments:

target_output (torch.Tensor): the output computed by the target policy network,
of size [T_traj, batchsize, n_output]
rhos (torch.Tensor): the importance sampling ratio, of size [T_traj, batchsize]
action (torch.Tensor): the action taken, of size [T_traj, batchsize]
rewards (torch.Tensor): the returns from time step 0 to T-1, of size [T_traj, batchsize]
bootstrap_values (torch.Tensor): estimation of the state value at step 0 to T,
of size [T_traj+1, batchsize]

Returns:

loss (torch.Tensor): Computed importance sampled UPGO loss, averaged over the samples, of size []

Examples:

>>> T, B, N, N2 = 4, 8, 5, 7
>>> rhos = torch.randn(T, B)
>>> loss = upgo_loss(logit, rhos, action, rewards, bootstrap_values)

value_rescale¶

Please refer to ding/rl_utils/value_rescale for more details.

value_transform¶

ding.rl_utils.value_rescale.value_transform(x: torch.Tensor, eps: float = 0.01) → torch.Tensor[源代码]¶

Overview:

A function to reduce the scale of the action-value function. :math: h(x) = sign(x)(sqrt{(abs(x)+1)} - 1) + eps * x .

Arguments:

x: (torch.Tensor) The input tensor to be normalized.
eps: (float) The coefficient of the additive regularization term
to ensure h^{-1} is Lipschitz continuous

Returns:

(torch.Tensor) Normalized tensor.

注解

Observe and Look Further: Achieving Consistent Performance on Atari: (https://arxiv.org/abs/1805.11593)

value_inv_transform¶

ding.rl_utils.value_rescale.value_inv_transform(x: torch.Tensor, eps: float = 0.01) → torch.Tensor[源代码]¶

Overview:

The inverse form of value rescale. :math: h^{-1}(x) = sign(x)({(frac{sqrt{1+4eps(|x|+1+eps)}-1}{2eps})}^2-1) .

Arguments:

x: (torch.Tensor) The input tensor to be unnormalized.
eps: (float) The coefficient of the additive regularization term
to ensure h^{-1} is Lipschitz continuous

Returns:

(torch.Tensor) Unnormalized tensor.

symlog¶

ding.rl_utils.value_rescale.symlog(x: torch.Tensor) → torch.Tensor[源代码]¶

Overview:

A function to normalize the targets. :math: symlog(x) = sign(x)(ln{|x|+1}) .

Arguments:

x: (torch.Tensor) The input tensor to be normalized.

Returns:

(torch.Tensor) Normalized tensor.

注解

Mastering Diverse Domains through World Models: (https://arxiv.org/abs/2301.04104)

inv_symlog¶

ding.rl_utils.value_rescale.inv_symlog(x: torch.Tensor) → torch.Tensor[源代码]¶

Overview:

The inverse form of symlog. :math: symexp(x) = sign(x)(exp{|x|}-1) .

Arguments:

x: (torch.Tensor) The input tensor to be unnormalized.

Returns:

(torch.Tensor) Unnormalized tensor.

vtrace¶

Please refer to ding/rl_utils/vtrace for more details.

vtrace_nstep_return¶

ding.rl_utils.vtrace.vtrace_nstep_return(clipped_rhos, clipped_cs, reward, bootstrap_values, gamma=0.99, lambda_=0.95)[源代码]¶

Overview:

Computation of vtrace return.

Returns:

vtrace_return (torch.FloatTensor): the vtrace loss item, all of them are differentiable 0-dim tensor

Shapes:

clipped_rhos (torch.FloatTensor): \((T, B)\), where T is timestep, B is batch size
clipped_cs (torch.FloatTensor): \((T, B)\)
reward (torch.FloatTensor): \((T, B)\)
bootstrap_values (torch.FloatTensor): \((T+1, B)\)
vtrace_return (torch.FloatTensor): \((T, B)\)

vtrace_advantage¶

ding.rl_utils.vtrace.vtrace_advantage(clipped_pg_rhos, reward, return_, bootstrap_values, gamma)[源代码]¶

Overview:

Computation of vtrace advantage.

Returns:

vtrace_advantage (namedtuple): the vtrace loss item, all of them are the differentiable 0-dim tensor

Shapes:

clipped_pg_rhos (torch.FloatTensor): \((T, B)\), where T is timestep, B is batch size
reward (torch.FloatTensor): \((T, B)\)
return (torch.FloatTensor): \((T, B)\)
bootstrap_values (torch.FloatTensor): \((T, B)\)
vtrace_advantage (torch.FloatTensor): \((T, B)\)

vtrace_data¶

class ding.rl_utils.vtrace.vtrace_data(target_output, behaviour_output, action, value, reward, weight)¶

vtrace_loss¶

class ding.rl_utils.vtrace.vtrace_loss(policy_loss, value_loss, entropy_loss)¶

vtrace_error_discrete_action¶

ding.rl_utils.vtrace.vtrace_error_discrete_action(data: collections.namedtuple, gamma: float = 0.99, lambda_: float = 0.95, rho_clip_ratio: float = 1.0, c_clip_ratio: float = 1.0, rho_pg_clip_ratio: float = 1.0)[源代码]¶

Overview:

Implementation of vtrace(IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures), (arXiv:1802.01561)

Arguments:

data (namedtuple): input data with fields shown in vtrace_data
- target_output (torch.Tensor): the output taking the action by the current policy network, usually this output is network output logit
- behaviour_output (torch.Tensor): the output taking the action by the behaviour policy network, usually this output is network output logit, which is used to produce the trajectory(collector)
- action (torch.Tensor): the chosen action(index for the discrete action space) in trajectory, i.e.: behaviour_action
gamma: (float): the future discount factor, defaults to 0.95
lambda: (float): mix factor between 1-step (lambda_=0) and n-step, defaults to 1.0
rho_clip_ratio (float): the clipping threshold for importance weights (rho) when calculating the baseline targets (vs)
c_clip_ratio (float): the clipping threshold for importance weights (c) when calculating the baseline targets (vs)
rho_pg_clip_ratio (float): the clipping threshold for importance weights (rho) when calculating the policy gradient advantage

Returns:

trace_loss (namedtuple): the vtrace loss item, all of them are the differentiable 0-dim tensor

Shapes:

target_output (torch.FloatTensor): \((T, B, N)\), where T is timestep, B is batch size and N is action dim
behaviour_output (torch.FloatTensor): \((T, B, N)\)
action (torch.LongTensor): \((T, B)\)
value (torch.FloatTensor): \((T+1, B)\)
reward (torch.LongTensor): \((T, B)\)
weight (torch.LongTensor): \((T, B)\)

Examples:

>>> T, B, N = 4, 8, 16
>>> value = torch.randn(T + 1, B).requires_grad_(True)
>>> reward = torch.rand(T, B)
>>> target_output = torch.randn(T, B, N).requires_grad_(True)
>>> behaviour_output = torch.randn(T, B, N)
>>> action = torch.randint(0, N, size=(T, B))
>>> data = vtrace_data(target_output, behaviour_output, action, value, reward, None)
>>> loss = vtrace_error_discrete_action(data, rho_clip_ratio=1.1)

vtrace_error_continuous_action¶

ding.rl_utils.vtrace.vtrace_error_continuous_action(data: collections.namedtuple, gamma: float = 0.99, lambda_: float = 0.95, rho_clip_ratio: float = 1.0, c_clip_ratio: float = 1.0, rho_pg_clip_ratio: float = 1.0)[源代码]¶

Overview:

Implementation of vtrace(IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures), (arXiv:1802.01561)

Arguments:

data (namedtuple): input data with fields shown in vtrace_data
- target_output (dict{key:torch.Tensor}): the output taking the action by the current policy network, usually this output is network output, which represents the distribution by reparameterization trick.
- behaviour_output (dict{key:torch.Tensor}): the output taking the action by the behaviour policy network, usually this output is network output logit, which represents the distribution by reparameterization trick.
- action (torch.Tensor): the chosen action(index for the discrete action space) in trajectory, i.e.: behaviour_action
gamma: (float): the future discount factor, defaults to 0.95
lambda: (float): mix factor between 1-step (lambda_=0) and n-step, defaults to 1.0
rho_clip_ratio (float): the clipping threshold for importance weights (rho) when calculating the baseline targets (vs)
c_clip_ratio (float): the clipping threshold for importance weights (c) when calculating the baseline targets (vs)
rho_pg_clip_ratio (float): the clipping threshold for importance weights (rho) when calculating the policy gradient advantage

Returns:

trace_loss (namedtuple): the vtrace loss item, all of them are the differentiable 0-dim tensor

Shapes:

target_output (dict{key:torch.FloatTensor}): \((T, B, N)\), where T is timestep, B is batch size and N is action dim. The keys are usually parameters of reparameterization trick.
behaviour_output (dict{key:torch.FloatTensor}): \((T, B, N)\)
action (torch.LongTensor): \((T, B)\)
value (torch.FloatTensor): \((T+1, B)\)
reward (torch.LongTensor): \((T, B)\)
weight (torch.LongTensor): \((T, B)\)

Examples:

>>> T, B, N = 4, 8, 16
>>> value = torch.randn(T + 1, B).requires_grad_(True)
>>> reward = torch.rand(T, B)
>>> target_output = dict(
>>>     'mu': torch.randn(T, B, N).requires_grad_(True),
>>>     'sigma': torch.exp(torch.randn(T, B, N).requires_grad_(True)),
>>> )
>>> behaviour_output = dict(
>>>     'mu': torch.randn(T, B, N),
>>>     'sigma': torch.exp(torch.randn(T, B, N)),
>>> )
>>> action = torch.randn((T, B, N))
>>> data = vtrace_data(target_output, behaviour_output, action, value, reward, None)
>>> loss = vtrace_error_continuous_action(data, rho_clip_ratio=1.1)