Using upper confidence bounds to control exploration and exploitation
Lecture slides:
- Using upper confidence bounds to control exploration and exploitation
- Contents
- Exploration vs. Exploitation
- Exploration vs. Exploitation: Some Applications
- Bandit Problems – “Optimism in the Face Uncertainty”
- Parametric Bandits [Lai&Robbins]
- Bounds
- UCB1 Algorithm (Auer et al., 2002)
- TITLE
- Bandits in Continuous Time
- Formal framework
- Evaluating allocation rules (policies)
- Gain, action values and regret
- Model-based UCB
- Algorithm
- Regret bound
- Key proposition
- Open problems
- Levente Kocsis Remi Munos
- Bandits with large action-spaces
- Structure helps!
- UCT Upper Confidence based Tree search
- Example (t=1)
- Example (t=2)
- Example (t=3)
- Example (t=4)
- What is the next time a \t\tsuboptimal action is sampled?
- UCT variations
- UCT variations
- Theoretical results
- Planning in MDPs: Sailing
- Planning in MDPs: Sailing
- Planning in MDPs: Sailing
- Results in games
- Thank you!
Author: Csaba Szepesvari, Department Of Computing Science, University Of Alberta