< Reinforcement Learning

Temporal difference (TD) learning is a central and novel idea in reinforcement learning.

  • It is a combination of Monte Carlo and dynamic programing methods
  • It is a Model-free learning algorithm
  • It both bootstraps (builds on top of previous best estimate) and samples
  • It can an be used for both episodic or infinite-horizon (non-episodic) domains
  • Immediately updates estimate of V after each
  • Requires the system to be Markovian
  • Biased estimator of value function but often much lower variance than Monte Carlo estimator
  • Converges to true value in finite state cases, but does not always converge with infinite number of states (known as function approximation)

Algorithm

TD learning can be applied as a spectrum between pure Monte Carlo and dynamic programing, but the simplest TD learning is as follows

  • Input:
  • Initialize
  • Loop
    • Sample tuple
    • Update

Temporal difference error is defined as


This article is issued from Wikiversity. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.