CollaQ¶

Overview¶

CollaQ (Zhang et al. 2020), Collaborative Q-learning, is a multi-agent collaboration approach based on Q-learning, which formulates multi-agent collaboration as a joint optimization problem on reward assignments. CollaQ decomposes decentralized Q value functions of individual agents into two terms, the self-term that only relies on the agent’s own state, and the interactive term that is related to states of nearby agents. CollaQ jointly trains using regular DQN, regulated with a Multi-Agent Reward Attribution (MARA) loss.

Quick Facts¶

CollaQ is a model-free and value-based multi-agent RL approach.
CollaQ only supports discrete action spaces.
CollaQ is an off-policy algorithm.
CollaQ considers a partially observable scenario in which each agent only obtains individual observations.
CollaQ uses DRQN architecture for individual Q learning.
Compared to QMIX and VDN, CollaQ doesn’t need a centralized Q function, which expands the individual Q-function for each agent with reward assignment depending on the joint state.

Key Equations or Key Graphs¶

The overall architecture of the Q-function with attention-based model in CollaQ:

The Q-function for agent i:

\[Q_{i}(s_{i},a_{i};\hat{\textbf{r}}_{i}) = \underbrace{Q_{1}(s{i}, a_{i},\hat{\textbf{r}_{0i}})}_{Q^{alone}(s_{i},a_{i})} + \underbrace{\nabla_{\textbf{r}}Q_{i}(s_{i},a_{i};\textbf{r}_{0i})\cdot(\hat{\textbf{r}_{i}} - \textbf{r}_{0i}) + \mathcal{O}(||\hat{\textbf{r}_{i}} - \textbf{r}_{0i}||^{2})}_{Q^{collab}(s^{local}_{i}, a_{i})}\]

The overall training objective of standard DQN training with MARA loss:

\[L = \mathbb{E}_{s_{i},a{i}\sim\rho(\cdot)}[\underbrace{(y-Q_{i}(o_{i},a_{i}))^{2}}_{\text{DQN Object}} +\underbrace{\alpha(Q_{i}^{collab}(o_{i}^{alone}, a_{i}))^{2}}_{\text{MARA Object}}]\]

Extensions¶

CollaQ can choose wether to use an attention-based architecture or not. Because the observation can be spatially large and covers agents whose states do not contribute much to a certain agent policy. In details, CollaQ uses a transformer architecture (stacking multiple layers of attention modules), which empirically helps improve the performance in multi-agent tasks.

Implementations¶

The default config is defined as follows:

class ding.policy.collaq.CollaQPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]

Overview:
Policy class of CollaQ algorithm. CollaQ is a multi-agent reinforcement learning algorithm

Interface:

_init_learn, _data_preprocess_learn, _forward_learn, _reset_learn, _state_dict_learn, _load_state_dict_learn
_init_collect, _forward_collect, _reset_collect, _process_transition, _init_eval, _forward_eval_reset_eval, _get_train_sample, default_model

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

collaq

RL policy register name, refer to

registry POLICY_REGISTRY

this arg is optional,

a placeholder

2

cuda

bool

True

Whether to use cuda for network

this arg can be diff-

erent from modes

3

on_policy

bool

False

Whether the RL algorithm is on-policy

or off-policy

priority

bool

False

Whether use priority(PER)

priority sample,

update priority

5

priority_

IS_weight

bool

False

Whether use Importance Sampling

Weight to correct biased update.

IS weight

6

learn.update_

per_collect

int

20

How many updates(iterations) to train

after collector’s one collection. Only

valid in serial training

this args can be vary

from envs. Bigger val

means more off-policy

7

learn.target_

update_theta

float

0.001

Target network update momentum

parameter.

between[0,1]

8

learn.discount

_factor

float

0.99

Reward’s future discount factor, aka.

gamma

may be 1 when sparse

reward env

9

learn.collaq

_loss_weight

float

1.0

The weight of collaq MARA loss

The network interface CollaQ used is defined as follows:

class ding.model.template.collaq.CollaQ(agent_num: int, obs_shape: int, alone_obs_shape: int, global_obs_shape: int, action_shape: int, hidden_size_list: list, attention: bool = False, self_feature_range: List[int] | None = None, ally_feature_range: List[int] | None = None, attention_size: int = 32, mixer: bool = True, lstm_type: str = 'gru', activation: Module = ReLU(), dueling: bool = False)[source]
Overview:
The network of CollaQ (Collaborative Q-learning) algorithm. It includes two parts: q_network and q_alone_network. The q_network is used to get the q_value of the agent’s observation and the agent’s part of the observation information of the agent’s concerned allies. The q_alone_network is used to get the q_value of the agent’s observation and the agent’s observation information without the agent’s concerned allies. Multi-Agent Collaboration via Reward Attribution Decomposition https://arxiv.org/abs/2010.08531

Interface:
__init__, forward, _setup_global_encoder
forward(data: dict, single_step: bool = True) → dict[source]
Overview:
The forward method calculates the q_value of each agent and the total q_value of all agents. The q_value of each agent is calculated by the q_network, and the total q_value is calculated by the mixer.

Arguments:

data (dict): input data dict with keys [‘obs’, ‘prev_state’, ‘action’]

agent_state (torch.Tensor): each agent local state(obs)

agent_alone_state (torch.Tensor): each agent’s local state alone, in smac setting is without ally feature(obs_along)

global_state (torch.Tensor): global state(obs)

prev_state (list): previous rnn state, should include 3 parts: one hidden state of q_network, and two hidden state if q_alone_network for obs and obs_alone inputs

action (torch.Tensor or None): if action is None, use argmax q_value index as action to calculate agent_q_act

single_step (bool): whether single_step forward, if so, add timestep dim before forward and remove it after forward

Return:

ret (dict): output data dict with keys [‘total_q’, ‘logit’, ‘next_state’]

total_q (torch.Tensor): total q_value, which is the result of mixer network

agent_q (torch.Tensor): each agent q_value

next_state (list): next rnn state

Shapes:

agent_state (torch.Tensor): \((T, B, A, N)\), where T is timestep, B is batch_size A is agent_num, N is obs_shape

global_state (torch.Tensor): \((T, B, M)\), where M is global_obs_shape

prev_state (list): math:(B, A), a list of length B, and each element is a list of length A

action (torch.Tensor): \((T, B, A)\)

total_q (torch.Tensor): \((T, B)\)

agent_q (torch.Tensor): \((T, B, A, P)\), where P is action_shape

next_state (list): math:(B, A), a list of length B, and each element is a list of length A

Examples:
>>> collaQ_model = CollaQ(
>>>     agent_num=4,
>>>     obs_shape=32,
>>>     alone_obs_shape=24,
>>>     global_obs_shape=32 * 4,
>>>     action_shape=9,
>>>     hidden_size_list=[128, 64],
>>>     self_feature_range=[8, 10],
>>>     ally_feature_range=[10, 16],
>>>     attention_size=64,
>>>     mixer=True,
>>>     activation=torch.nn.Tanh()
>>> )
>>> data={
>>>     'obs': {
>>>         'agent_state': torch.randn(8, 4, 4, 32),
>>>         'agent_alone_state': torch.randn(8, 4, 4, 24),
>>>         'agent_alone_padding_state': torch.randn(8, 4, 4, 32),
>>>         'global_state': torch.randn(8, 4, 32 * 4),
>>>         'action_mask': torch.randint(0, 2, size=(8, 4, 4, 9))
>>>     },
>>>     'prev_state': [[[None for _ in range(4)] for _ in range(3)] for _ in range(4)],
>>>     'action': torch.randint(0, 9, size=(8, 4, 4))
>>> }
>>> output = collaQ_model(data, single_step=False)

The Benchmark result of CollaQ in SMAC (Samvelyan et al. 2019), for StarCraft micromanagement problems, implemented in DI-engine is shown.

Benchmark¶

Environment	Best mean reward	Config link	Comparison
5m6m	1	config_link_p	Pymarl(0.8)
MMM	0.7	config_link_q	Pymarl(1)
3s5z	1	config_link_s	Pymarl(1)

P.S.：

The above results are obtained by running the same configuration on three different random seeds (0, 1, 2).

References¶

Tianjun Zhang, Huazhe Xu, Xiaolong Wang, Yi Wu, Kurt Keutzer, Joseph E. Gonzalez, Yuandong Tian. Multi-Agent Collaboration via Reward Attribution Decomposition. arXiv preprint arXiv:2010.08531, 2020.
Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philip H. S. Torr, Jakob Foerster, Shimon Whiteson. The StarCraft Multi-Agent Challenge. arXiv preprint arXiv:1902.04043, 2019.

Other Public Implementations¶

Pymarl.

ID	Symbol	Type	Default Value	Description	Other(Shape)
1	`type`	str	collaq	RL policy register name, refer to registry `POLICY_REGISTRY`	this arg is optional, a placeholder
2	`cuda`	bool	True	Whether to use cuda for network	this arg can be diff- erent from modes
3	`on_policy`	bool	False	Whether the RL algorithm is on-policy or off-policy
	`priority`	bool	False	Whether use priority(PER)	priority sample, update priority
5	`priority_` `IS_weight`	bool	False	Whether use Importance Sampling Weight to correct biased update.	IS weight
6	`learn.update_` `per_collect`	int	20	How many updates(iterations) to train after collector’s one collection. Only valid in serial training	this args can be vary from envs. Bigger val means more off-policy
7	`learn.target_` `update_theta`	float	0.001	Target network update momentum parameter.	between[0,1]
8	`learn.discount` `_factor`	float	0.99	Reward’s future discount factor, aka. gamma	may be 1 when sparse reward env
9	`learn.collaq` `_loss_weight`	float	1.0	The weight of collaq MARA loss