< Reinforcement Learning
Policy Iteration (PI) is one of the algorithms for finding the optimal policy (MDP control).
Policy iteration is a model-based algorithm.
The complexity of the algorithm is where is the number of iterations needed for convergence. Theoretically, the maximum number of iterations is .
The algorithm converges to the global optimum.
State-action value
State-action value of a policy , is calculated by taking the specified action immediately, then following the policy
Here, is the reward function in MDP and is the transition model.
Algorithm
- Set
- Initialize randomly for all states
- While or (L1-norm, measures if the policy changed for any state):
- Compute state-action value of a policy , for all and all
- Compute new policy , for all by choosing the action that returns the maximum state-action value for each specific state
- Compute state-action value of a policy , for all and all
Explanation
In each iteration, by definition we have
Proof
This article is issued from Wikiversity. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.