【强化学习-02】Value-based reinforcement learning

Value-based reinforcement learningValue-based reinforcement learningAction-value functionsDeep Q Network (DQN)训练神经网络的算法：`Temporal difference algorithm`一个例子Apply TD learning to DQNSummary参考文献本文整理自教学视频

刘兴禄

730人浏览 · 2022-01-27 00:33:24

刘兴禄 · 2022-01-27 00:33:24 发布

本文整理自教学视频 (作者: Shusen Wang)：https://www.bilibili.com/video/BV1rv41167yx?from=search&seid=18272266068137655483&spm_id_from=333.337.0.0

Value-based reinforcement learning

Action-value functions

$U_t$ 是未来奖励的总和的期望，依赖于未来所有动作 $A_t, A_{t+1}, A_{t+2}, \cdots$ 和所有状态 $s_t, s_{t+1}, s_{t+2}, \cdots$ 。由于我们对 $U_t$ 求了期望，因此就将未来所有的动作和状态的不确定性消除掉了(也就是消除了随机变量)，只留下当前的观测值 $a_t$ 和 $s_t$ 。

$\begin{aligned} Q^{*}(s_t, a_t) = \max_{\pi} Q_{\pi}(s_t, a_t) \end{aligned}$

Deep Q Network (DQN)

DQN就是用神经网络近似一个Q function.

在这里插入图片描述

我们用神经网络 $\mathbf{w})$ 去近似Q function： $Q^{*}(s, a)$ .

我们把神经网络记为 $\mathbf{w})$ .,其中:

$s, a$ 是神经网络的输入

$\mathbf{w}$ 是神经网络的参数，也就是weights of connections

神经网络的输出是很多数值，这些数值是对所有可能动作的打分，每一个动作对应一个分数

我们通过reward来学习这个神经网络，这个神经网络给动作的打分就会逐渐改进，打分就会越来越准

在这里插入图片描述

训练神经网络的算法：`Temporal difference algorithm`

在这里插入图片描述

@article{sutton2008convergent,
title={A convergent O (n) algorithm for off-policy temporal-difference learning with linear function approximation},
author={Sutton, Richard S and Szepesv{'a}ri, Csaba and Maei, Hamid Reza},
journal={Advances in neural information processing systems},
volume={21},
number={21},
pages={1609–1616},
year={2008},
publisher={MIT Press}
}

@inproceedings{sutton2009fast,
title={Fast gradient-descent methods for temporal-difference learning with linear function approximation},
author={Sutton, Richard S and Maei, Hamid Reza and Precup, Doina and Bhatnagar, Shalabh and Silver, David and Szepesv{'a}ri, Csaba and Wiewiora, Eric},
booktitle={Proceedings of the 26th Annual International Conference on Machine Learning},
pages={993–1000},
year={2009}
}