Reinforcement Learning/Policy iteration

Policy Iteration (PI) is one of the algorithms for finding the optimal policy (MDP control).

Policy iteration is a model-based algorithm.

The complexity of the algorithm is $|A|\times |S|\times k$ where $k$ is the number of iterations needed for convergence. Theoretically, the maximum number of iterations is $|A|^{|S|}$ .

The algorithm converges to the global optimum.

State-action value $Q$

State-action value of a policy $\pi$ , is calculated by taking the specified action $a$ immediately, then following the policy

Q^{\pi }(s,a)=R(s,a)+\gamma \sum _{s'\in S}P(s'\mid s,a)V^{\pi }(s')

Here, $R(s,a)$ is the reward function in MDP and $P(s'|s,a)$ is the transition model.

Algorithm

Set $i=0$
Initialize $\pi _{0}(s)$ randomly for all states $s$
While $i=0$ $i=0$ or $|\pi _{i}-\pi _{i-1}|_{1}>0$ $|\pi _{i}-\pi _{i-1}|_{1}>0$ (L1-norm, measures if the policy changed for any state):
- Compute state-action value of a policy $\pi _{i}$ , for all $s\in S$ and all $a\in A$ $Q^{\pi }(s,a)=R(s,a)+\gamma \sum _{s'\in S}P(s'\mid s,a)V^{\pi }(s')$
- Compute new policy $\pi _{i+1}$ , for all $s\in S$ by choosing the action that returns the maximum state-action value for each specific state $\pi _{i+1}(s)=\arg \max _{a}Q^{\pi _{i}}(s,a)~~~\forall s\in S$

Explanation

In each iteration, by definition we have

\arg \max _{a}Q^{\pi _{i}}(s,a)\geq Q^{\pi _{i}}(s,\pi _{i}(s))=V^{\pi _{i}}(s)~~~\forall s\in S

Proof

{\begin{aligned}V^{\pi _{i}}(s)\leq &~\max _{a}Q^{\pi _{i}}(s,a)\\=&~\max _{a}R(s,a)+\gamma \sum _{s'\in S}P(s'|s,a)V^{\pi _{i}}(s')\\=&~\max _{a}R(s,\pi _{i+1}(s))+\gamma \sum _{s'\in S}P(s'|s,\pi _{i+1}(s))V^{\pi _{i}}(s')~~~\leftarrow {\text{by definition the action with the maximum Q value is taken as the new policy}}\\\leq &~\max _{a}R(s,\pi _{i+1}(s))+\gamma \sum _{s'\in S}P(s'|s,\pi _{i+1}(s)){\Big [}\max _{a'}Q^{\pi _{i}}(s',a'){\Big ]}\\=&~\max _{a}R(s,\pi _{i+1}(s))+\gamma \sum _{s'\in S}P(s'|s,\pi _{i+1}(s)){\Bigg [}R(s',\pi _{i+1}(s'))+\gamma \sum _{s''\in S}P(s''|s',\pi _{i+1}(s'))V^{\pi _{i}}(s''){\Bigg ]}\\\vdots &\\\leq &~V^{\pi _{i+1}}\end{aligned}}

This article is issued from Wikiversity. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

State-action value Q {\displaystyle Q}

Algorithm

Explanation

Proof

State-action value $Q$