Reinforcement Learning Resources
- 状态 (State) $S_t$: $t$时刻的环境,可以是连续的也可以指离散的;
- 行为(Action): 指智能体(agent)所能采取的行为的集合($A\in \{a_1, ..., a_N\}$);
- 回报(Reward) $R_t$: 智能体在采取某个行为之后得到的回报,一般是一个实数;
- 策略 (Policy) $\pi(a|s)$: 是一个条件概率,表示智能体在感受到当前环境$S_t$之后采取各行为的条件概率;
- 状态值函数 (state value function): $v_{\pi}(S_t)$
- Q-function (state-action value function): $Q(S_t, a_t)$
累计回报$G_t$
$$ \begin{equation} \begin{split} G_t &= R_t + \gamma R_{t+1} +\gamma^2 R_{t+2} + ... \\ &= \sum_{k=0}^{\infty} \gamma^k R_{t+k} \end{split} \end{equation} $$状态值函数 (state value function) $v_{\pi}(s_t)$
在状态$s_t$下,状态值函数为累计回报的期望:
$$ \begin{equation} \begin{split} v_{\pi}(s_t) &= \mathbb{E}_{\pi}\big[\sum_{k=0}^{\infty}\gamma^k \cdot R_{t+k+1} | S_t=s\big] \\ &= \mathbb{E}_{\pi}\big[G_t | S_t=s\big], \end{split} \end{equation} $$这里$G_t$和$v_{\pi}(s_t)$的区别我觉得是:$G_t$是在给定$\pi$之后,在每个状态都采取概率最大的那个action的累计回报; 而$v_{\pi}{s_t}$是取每个action所得到的回报的期望,在$s_t$采取的action是一个随机变量,与$\pi$有关。
$$ \begin{equation} q_{\pi}(s_t, a) = \mathbb{E}_{\pi}\big[\sum_{k=0}^{\infty}\gamma^k \cdot R_{t+k+1}|S_t=s, A_t=a \big]. \end{equation} $$ Where $\gamma$ is the decay factor.- A brief introduction to reinforcement learning.
- Deep Reinforcement Learning: Pong from Pixels. Python Code, PyTorch Code
- TDM: From Model-Free to Model-Based Deep Reinforcement Learning.
- RL — Policy Gradient Explained.
- Policy Gradients in a Nutshell.
- An introduction to Policy Gradients with Cartpole and Doom.
- What is an example of temporal difference learning?
- Applications of RL in Computer Vision.
- What is the difference between model-based and model-free reinforcement learning?
- Deep Reinforcement Learning for Unsupervised Video Summarization with DiversityRepresentativeness Reward AAAI 2019. [PyTorch Code], [arXiv].
- Environment Upgrade Reinforcement Learning for Non-differentiable Multi-stage Pipelines, CVPR 2018. [PDF]