Reading tips: Extra reading material
How is is possible to learn a behaviour from a scalar reward signal?
Over what time horizon do we want to optimize the reward?
What can RL be used for?
What are the basic assumptions behind RL?
What is meant by policy and by value function?
How can an optima policy be found when one has a (stochastic) model of the world?
What is meant by value iteration and policy interation?
How can one find an optimal policy in an unknown environment?
What is the exploitation-exploration dilemma? How is it solved?
What is the characteristics of the TD (temporal difference) methods?
What is meant by Q-learning?
What is meant by Sarsa-learning?
What is an eligibility trace, and how can it be used to speed up lerning?