< Reinforcement Learning
Temporal difference (TD) learning is a central and novel idea in reinforcement learning.
- It is a combination of Monte Carlo and dynamic programing methods
- It is a Model-free learning algorithm
- It both bootstraps (builds on top of previous best estimate) and samples
- It can an be used for both episodic or infinite-horizon (non-episodic) domains
- Immediately updates estimate of V after each
- Requires the system to be Markovian
- Biased estimator of value function but often much lower variance than Monte Carlo estimator
- Converges to true value in finite state cases, but does not always converge with infinite number of states (known as function approximation)
Algorithm
TD learning can be applied as a spectrum between pure Monte Carlo and dynamic programing, but the simplest TD learning is as follows
- Input:
- Initialize
- Loop
- Sample tuple
- Update
Temporal difference error is defined as
This article is issued from Wikiversity. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.