模仿学习
==================

问题与意义
----------

模仿学习 (Imitation Learning, IL) 指的是，智能体通过学习一些专家数据来提取知识，进而复制下这些专家数据的行为这样一种学习方法。由于IL的本身特性，
它面临两大难题：需要大量的训练数据、训练数据的质量一定要好。为了解决上述问题，从大体来说，IL可以分成三个方向：IRL（逆强化学习），BC（行为克隆），
Adversarial Structured IL（对抗结构）。下面对各个方向做简要分析：

研究方向
--------

BC
~~~~~~~~

BC 最早提出于[1]，它提出了一种监督学习的方法，通过拟合专家数据，直接建立状态-动作的映射关系。

BC 的最大好处是效率很高，算法简单，但是一旦智能体遇到了从未见过的状态，就可能做出错误的行为——这一问题被称作“状态漂移”。为了解决这个问题，DAgger[2]方法采用了一种动态更新数据集的方法，根据训练出 policy 遇到的真实状态，不断添加新的专家数据至数据集中。而在后续的研究中，IBC[3] 采用了隐式行为克隆的方法，它的关键是训练一个神经网络来接受观察和动作，并输出一个数字，该数字对专家动作来说很低，对非专家动作来说很高，从而将行为克隆变成一个基于能量的建模问题。

目前的 BC 算法研究热点主要聚焦于两个方面：meta-learning 和利用 VR 设备进行行为克隆。

IRL
~~~~~~~~

IRL 的主要目标是为了解决数据收集时，难以找到足够高质量数据的问题。具体来说，IRL 首先从专家数据中学习一个奖励函数，进而使用这个奖励函数进行后续的RL训练。通过这样的方法，IRL 从理论上来说，可以表现出超越专家数据的性能。

从具体的工作上面，Ziebart等人[4] 首先提出了最大熵 IRL，它利用最大熵分布来获得良好的前景和有效的优化。后来在2016年，Finn等人[5]提出了一种基于模型的 IRL 方法，称为引导成本学习（
guided cost
learning），这种方法使用神经网络表示 cost 进而提高表达能力。后续，Hester等人又提出了DQfD[6]，该方法仅需少量的专家数据，通过预训练启动过程和后续学习过程，显著加速了训练。后来的方法如 T-REX[7] 提出了一种基于为专家数据排序的结构，通过对比什么专家数据效果更好，间接地学习奖励函数。

Adversarial Structured IL
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Adversarial Structured IL 方法的主要目标是为了解决 IRL 的效率问题。通过 IRL 的算法可以看出，即便它学到了非常好的奖励函数，由于得到最终的策略仍然需要执行强化学习步骤，因此如果可以直接从专家数据中学习到策略，就可以大大提高效率。基于这个想法 GAIL
[8] 结合了生成式网络 GAN 和最大熵 IRL，无需人工不断标注专家数据，就可以不断地高效训练。

在此基础上，许多工作都对 GAIL 做了改进。如 InfoGail
[9]用 WGAN 替换了 GAN，取得了较好的效果。还有一些近期的工作，如 GoalGAIL[10]，TRGAIL[11] 和 DGAIL[12] 都结合了其他方法，如事后重标记和 DDPG，以实现更快的收敛速度和更好的最终性能。

未来展望
--------

当前模仿学习还存在许多挑战，主要包括以下几点：

- 当前的模仿学习都是针对某个特定任务而言的，缺乏能适用于多任务的模仿学习方法；

- 当前模仿学习算法对于专家数据并非最优的情形，难以超越专家数据达到最优结果；

- 当前的模仿学习算法主要针对 observation 的，没有能结合语音、自然语言等多模态因素；

- 当前模仿学习能够找到局部的最优点，但往往不能找到全局的最优点。

参考文献
--------

[1] Michael Bain and Claude Sammut. 1999. A framework for behavioural
cloning. In *Machine Intelligence 15*. Oxford

University Press, 103–129.

[2] Stéphane Ross, Geoffffrey Gordon, and Drew Bagnell. 2011. A
reduction of imitation learning and structured prediction to no-regret
online learning. In *Proceedings of the fourteenth international
conference on artifificial intelligence and*

*statistics*. JMLR Workshop and Conference Proceedings, 627–635.

[3] Florence, P. , Lynch, C. , Zeng, A. , Ramirez, O. , Wahid, A. , &
Downs, L. , et al. (2021). Implicit behavioral cloning.

[4] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey.
2008. Maximum entropy inverse reinforcement

learning.. In *Aaai*, Vol. 8. Chicago, IL, USA, 1433–1438.

[5] Chelsea Finn, Sergey Levine, and Pieter Abbeel. 2016. Guided cost
learning: Deep inverse optimal control via policy

optimization. In *International conference on machine learning*. PMLR,
49–58.

[6] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom
Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew

Sendonaris, Gabriel Dulac-Arnold, Ian Osband, John Agapiou, Joel Z.
Leibo, and Audrunas Gruslys. 2017. Deep Q learning from Demonstrations.
*arXiv:1704.03732 [cs]* (Nov. 2017). http://arxiv.org/abs/1704.03732
arXiv: 1704.03732.

[7] Daniel Brown, Wonjoon Goo, Prabhat Nagarajan, and Scott Niekum.
2019. Extrapolating beyond suboptimal demon

strations via inverse reinforcement learning from observations. In
*International Conference on Machine Learning*.

PMLR, 783–792.

[8] Jonathan Ho and Stefano Ermon. 2016. Generative Adversarial
Imitation Learning. In *Advances in Neural Information*

*Processing Systems 29*, D. D. Lee, M. Sugiyama, U. V. Luxburg, I.
Guyon, and R. Garnett (Eds.). Curran Associates, Inc.,

4565–4573.
http://papers.nips.cc/paper/6391-generative-adversarial-imitation-learning.pdf

[9] Yunzhu Li, Jiaming Song, and Stefano Ermon. 2017. InfoGAIL:
Interpretable Imitation Learning from Vi

sual Demonstrations. In *Advances in Neural Information Processing
Systems 30*, I. Guyon, U. V. Luxburg,

S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett
(Eds.). Curran Associates, Inc., 3812–3822.

http://papers.nips.cc/paper/6971-infogail-interpretable-imitation-learning-from-visual-demonstrations.pdf

[10] Yiming Ding, Carlos Florensa, Mariano Phielipp, and Pieter Abbeel.
2019. Goal-conditioned imitation learning. *arXiv*

*preprint arXiv:1906.05838* (2019).

[11] Akira Kinose and Tadahiro Taniguchi. 2020. Integration of imitation
learning using GAIL and reinforcement

learning using task-achievement rewards via probabilistic graphical
model. *Advanced Robotics* (June 2020), 1–13.

https://doi.org/10.1080/01691864.2020.1778521

[12] Guoyu Zuo, Kexin Chen, Jiahao Lu, and Xiangsheng Huang. 2020.
Deterministic generative adversarial imitation

learning. *Neurocomputing* 388 (May 2020), 60–69.
https://doi.org/10.1016/j.neucom.2020.01.016