In this exercise we will revisit the included racetrack_environment
to have a look at temporal difference (TD) algorithms.
- policy evaluation using TD learning
- on-policy epsilon-greedy control using TD learning
- off-policy epsilon-greedy control using TD learning → Q-learning
- using double Q-learning in stochastic environments