Policy iteration vs Value iteration

Policy iteration computes optimal value and policy
Value iteration:
- Maintain optimal value of starting in a state s if have a finite number of steps $k$ left in the episode
- Iterate to consider longer and longer episodes

Policy iteration and value iteration will converge to the same optimal policy.

Algorithm

Value function of a policy is the solution to the Bellman equation

V^{\pi }(s)=R^{\pi }(s)+\gamma \sum _{s'\in S}P^{\pi }(s'|s)V^{\pi }(s')

Bellman-backup operator is an operator that is applied to a value function and returns a new value function. The Bellman-backup operator improves the value if it is possible

{\mathcal {B}}V(s)=\max _{a}R(s,a)+\gamma \sum _{s'\in S}P^{\pi }(s'|s,a)V(s')

${\mathcal {B}}V$ yields a value function over all states $s$ .

This article is issued from Wikiversity. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.