CQL¶

Overview¶

Offline reinforcement learning (RL) is a re-emerging area of study that aims to learn behaviors using large, previously collected datasets, without further environment interaction. It has the potential to make tremendous progress in a number of real-world decision-making problems where active data collection is expensive (e.g., in robotics, drug discovery, dialogue generation, recommendation systems) or unsafe/dangerous (e.g., healthcare, autonomous driving, or education). Besides, the quantities of data that can be gathered online are substantially lower than the offline datasets. Such a paradigm promises to resolve a key challenge to bringing reinforcement learning algorithms out of constrained lab settings to the real world.

However, directly utilizing existing value-based off-policy RL algorithms in an offline setting generally results in poor performance, due to issues with bootstrapping from out-of-distribution actions and overfitting. Thus, many constrain techniques are added to basic online RL algorithms. Conservative Q-learning (CQL), first proposed in Conservative Q-Learning for Offline Reinforcement Learning, is one of them which learns conservative Q functions of which the expected value is lower-bounded via a simple modification to standard value-based RL algorithms.

Quick Facts¶

CQL is an offline RL algorithm.
CQL can be implemented with less than 20 lines of code on top of a number of standard, online RL algorithms
CQL supports both discrete and continuous action spaces.

Key Equations or Key Graphs¶

CQL can be implemented with less than 20 lines of code on top of a number of standard, online RL algorithms, simply by adding the CQL regularization terms to the Q-function update.

In general, for the conservative off-policy evaluation, the Q-function is trained via an iterative update:

Taking a closer look at the above equation, it consists of two parts - the regularization term and the usual Bellman error with a tradeoff factor alpha. Inside the the regularization term, the first term always pushes the Q value down on the (s,a) pairs sampled from \(\mu\) whereas the second term pushes Q value up on the (s,a) samples drawn from the offline data set.

According to the following theorem, the above equation lower-bounds the expected value under the policy \(\pi\), when \(\mu\) = \(\pi\).

For suitable \(\alpha\), the bound holds under sampling error and function approximation. We also note that as more data becomes available and |D(s; a)| increases, the theoretical value of \(\alpha\) that is needed to guarantee a lower bound decreases, which indicates that in the limit of infinite data, a lower bound can be obtained by using extremely small values of \(\alpha\).

Note that the analysis presented below assumes that no function approximation is used in the Q-function, meaning that each iterate can be represented exactly. the result in this theorem can be further generalized to the case of both linear function approximators and non-linear neural network function approximators, where the latter builds on the neural tangent kernel (NTK) framework. For more details, please refer to the Theorem D.1 and Theorem D.2 in Appendix D.1 in the original paper.

So, how should we utilize this for policy optimization? We could alternate between performing full off-policy evaluation for each policy iterate, \(\hat{\pi}^{k}(a|s)\), and one step of policy improvement. However, this can be computationally expensive. Alternatively, since the policy \(\hat{\pi}^{k}(a|s)\) is typically derived from the Q-function, we could instead choose \(\mu(a|s)\) to approximate the policy that would maximize the current Q-function iterate, thus giving rise to an online algorithm. So, for a complete offline RL algorithm, Q-function in general updates as follows:

where \(CQL(R)\) is characterized by a particular choice of regularizer \(R(\mu)\). If \(R(\mu)\) is chosen to be the KL-divergence against a prior distribution, \(\rho(a|s)\), then we get \(\mu(a|s)\approx \rho(a|s)exp(Q(s,a))\). Firstly, if \(\rho(a|s)\) = Unif(a), then the first term above corresponds to a soft-maximum of the Q-values at any state s and gives rise to the following variant, called CQL(H):

Secondly, if \(\rho(a|s)\) is chosen to be the previous policy \(\hat{\pi}^{k-1}\), the first term in Equation (4) above is replaced by an exponential weighted average of Q-values of actions from the chosen \(\hat{\pi}^{k-1}(a|s)\).

Pseudo-code¶

The pseudo-code is shown in Algorithm 1, with differences from conventional actor critic algorithms (e.g., SAC) and deep Q-learning algorithms (e.g.,DQN) in red

The equation (4) in above pseudo-code is:

Note that during implementation, the first term in the equation (4) will be computed under torch.logsumexp, which consumes lots of running time.

Implementations¶

The default config of CQLPolicy is defined as follows:

class ding.policy.cql.CQLPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]

Overview:

Policy class of CQL algorithm for continuous control. Paper link: https://arxiv.org/abs/2006.04779.

Config:

ID	Symbol	Type	Default Value	Description	Other(Shape)
1	`type`	str	cql	RL policy register name, refer to registry `POLICY_REGISTRY`	this arg is optional, a placeholder
2	`cuda`	bool	True	Whether to use cuda for network
3	`random_` `collect_size`	int	10000	Number of randomly collected training samples in replay buffer when training starts.	Default to 10000 for SAC, 25000 for DDPG/ TD3.
4	`model.policy_` `embedding_size`	int	256	Linear layer size for policy network.
5	`model.soft_q_` `embedding_size`	int	256	Linear layer size for soft q network.
6	`model.value_` `embedding_size`	int	256	Linear layer size for value network.	Defalut to None when model.value_network is False.
7	`learn.learning` `_rate_q`	float	3e-4	Learning rate for soft q network.	Defalut to 1e-3, when model.value_network is True.
8	`learn.learning` `_rate_policy`	float	3e-4	Learning rate for policy network.	Defalut to 1e-3, when model.value_network is True.
9	`learn.learning` `_rate_value`	float	3e-4	Learning rate for policy network.	Defalut to None when model.value_network is False.
10	`learn.alpha`	float	0.2	Entropy regularization coefficient.	alpha is initiali- zation for auto alpha, when auto_alpha is True
11	`learn.repara_` `meterization`	bool	True	Determine whether to use reparameterization trick.
12	`learn.` `auto_alpha`	bool	False	Determine whether to use auto temperature parameter alpha.	Temperature parameter determines the relative importance of the entropy term against the reward.
13	`learn.-` `ignore_done`	bool	False	Determine whether to ignore done flag.	Use ignore_done only in halfcheetah env.
14	`learn.-` `target_theta`	float	0.005	Used for soft update of the target network.	aka. Interpolation factor in polyak aver aging for target networks.

class ding.policy.cql.DiscreteCQLPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]

Overview:: Policy class of discrete CQL algorithm in discrete action space environments. Paper link: https://arxiv.org/abs/2006.04779.

Benchmark¶

environment

best mean reward

evaluation results

config link

comparison

HalfCheetah

(Medium Expert)

57.6 \(\pm\) 3.7

config_link_ha

CQL Repo (75.6 \(\pm\) 25.7)

Walker2d

(Medium Expert)

109.7 \(\pm\) 0.8

config_link_w

CQL Repo (107.9 \(\pm\) 1.6)

Hopper

(Medium Expert)

85.4 \(\pm\) 14.8

config_link_ho

CQL Repo (105.6 \(\pm\) 12.9)

Specifically for each dataset, our implementation results are as follows:

environment	random	medium replay	medium expert	medium	expert
HalfCheetah	18.7 \(\pm\) 1.2	47.1 \(\pm\) 0.3	57.6 \(\pm\) 3.7	49.7 \(\pm\) 0.4	75.1 \(\pm\) 18.4
Walker2d	22.0 \(\pm\) 0.0	82.6 \(\pm\) 3.4	109.7 \(\pm\) 0.8	82.4 \(\pm\) 1.9	109.2 \(\pm\) 0.3
Hopper	3.1 \(\pm\) 2.6	98.3 \(\pm\) 1.8	85.4 \(\pm\) 14.8	79.6 \(\pm\) 8.5	105.4 \(\pm\) 7.2

P.S.：

The above results are obtained by running the same configuration on four different random seeds (5, 10, 20, 30)
The above benchmark is for HalfCheetah-v2, Hopper-v2, Walker2d-v2.
The comparison results above is obtained via the paper Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning. The complete table is depicted below
The above Tensorboard results illustrate the unnormalized results

Reference¶

Kumar, Aviral, et al. “Conservative q-learning for offline reinforcement learning.” arXiv preprint arXiv:2006.04779 (2020).
Chenjia Bai, et al. “Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning.”

Other Public Implementations¶

CQL release repo