< Reinforcement Learning
Policy iteration vs Value iteration
- Policy iteration computes optimal value and policy
- Value iteration:
- Maintain optimal value of starting in a state s if have a finite number of steps left in the episode
- Iterate to consider longer and longer episodes
Policy iteration and value iteration will converge to the same optimal policy.
Algorithm
Value function of a policy is the solution to the Bellman equation
Bellman-backup operator is an operator that is applied to a value function and returns a new value function. The Bellman-backup operator improves the value if it is possible
yields a value function over all states .
This article is issued from Wikiversity. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.