The goal is to estimate by generating many episodes under policy .
- An episode is a series of states, actions, and rewards () created for an Markov Decision Process (MDP) under policy .
- In this method, we simply simulate many trajectories (decision processes), and calculate the average returns.
- The error of calculated reward reduces with , where is the number of trajectories created.
- This method can be used only for episodic decision processes, meaning that the trajectories are finite and terminates after a number of states.
- The evaluation does NOT require formal derivation of dynamics and rewards models.
- This method does NOT assume states to be Markov.
- Generally a high variance estimator. Reducing the variance can require a lot of data. Therefore, in cases where data is expensive to acquire or the stakes are high, MC may be impractical.
There are different types of Monte Carlo policy evaluation:
- First-visit Monte Carlo
- Every-visit Monte Carlo
- Incremental Monte Carlo
First-visit Monte Carlo
Algorithm:
Initialize
Loop:
- Sample episode
- Define as return from time step onwards in th episode
- For each state visited in episode
- For first time that state is visited in episode
- Increment counter of total first visits:
- Increment total return
- Update estimate
- For first time that state is visited in episode
Properties:
- first-time MC estimator is an unbiased estimator of true . (Read more about Bias of an estimator).
- By law of large numbers, as
Every-visit Monte Carlo
Algorithm:
Initialize
Loop:
- Sample episode
- Define as return from time step onwards in th episode
- For each state visited in episode
- For every time that state is visited in episode
- Increment counter of total first visits:
- Increment total return
- Update estimate
- For every time that state is visited in episode
Properties:
- every-visit MC estimator is a biased estimator of true . (Read more about Bias of an estimator).
- The every-visit MC estimator has MSE (variance + bias2) than first-visit estimator, because we collect way more data when we count every visit.
- The every-visit estimator is a consistent estimator, meaning that the bias value consistently decreases with increasing number of simulated episodes. The bias of a consistent estimator asymptotically goes to zero with increasing number of sample size.
Incremental Monte Carlo
Incremental MC policy evaluation is a more general form of policy evaluation that can be applied to both first-visit and every-visit policy evaluation algorithms.
The benefit of using incremental MC algorithm is that it can be applied to cases where the system is non-stationary. The algorithm does this by giving higher weight to newer data.
In both first-visit and every-visit MC algorithms the value function is updated by the following equation
This equation is easily derivable by looking value of , , and each time the value function is updated. If we change the update equation to the following we arrive at the incremental MC algorithm which can have both first-visit and every-visit variations
If we set , we arrive at the original first-visit or every-visit MC algorithms, but if set we have an algorithm that gives more weight to the newer data and is more suitable for non-stationary domains.