A Selection Variation for Improved Throughput and Accuracy of Monte Carlo Tree Search Algorithms Academic Article uri icon

abstract

  • Reinforcement Learning (RL) is a computational approach to understanding and automating goal-directed learning and decision making. Methods such as dynamic programming, Monte Carlo, and temporal difference can all be used to implement RL. They provide room for continuous improvement. These improvements are faced with the challenge of balancing between the accuracy of the decision and the amount of time spent to make the decision. This study presents an improvement of the Monte Carlo Tree Search (MCTS) algorithm by enhancing its selection phase. The variation was implemented by applying a combination of Upper Confidence Bound applied to Trees (UCT) and “lean Last Good Reply with Forgetting (LGRF)” strategies. Consideration was made to maintain the generality of the algorithm. Two agents playing against each other in a game of Othello were used to implement the playouts while conducting the experiments. Each agent applied a different variation of the MCTS algorithm and a Chi-square test was then used to analyze the results. A win-rate, for the “lean LGRF” strategy, of 100% over 1,000 playouts against the UCT strategy was recorded. The Progressive Bias (PB) strategy had a win-rate of 45.8% against UCT. As expected, the UCT strategy had a win-rate of 49.7% (an almost 50-50 win-rate) against itself. The results show that there is a significant level of variance in the selection variation techniques and that the “lean LGRF” variation is superior to the UCT and PB variations.
    Reinforcement Learning (RL) is a computational approach to understanding and automating goal-directed learning and decision making. Methods such as dynamic programming, Monte Carlo, and temporal difference can all be used to implement RL. They provide room for continuous improvement. These improvements are faced with the challenge of balancing between the accuracy of the decision and the amount of time spent to make the decision. This study presents an improvement of the Monte Carlo Tree Search (MCTS) algorithm by enhancing its selection phase. The variation was implemented by applying a combination of Upper Confidence Bound applied to Trees (UCT) and “lean Last Good Reply with Forgetting (LGRF)” strategies. Consideration was made to maintain the generality of the algorithm. Two agents playing against each other in a game of Othello were used to implement the playouts while conducting the experiments. Each agent applied a different variation of the MCTS algorithm and a Chi-square test was then used to analyze the results. A win-rate, for the “lean LGRF” strategy, of 100% over 1,000 playouts against the UCT strategy was recorded. The Progressive Bias (PB) strategy had a win-rate of 45.8% against UCT. As expected, the UCT strategy had a win-rate of 49.7% (an almost 50-50 win-rate) against itself. The results show that there is a significant level of variance in the selection variation techniques and that the “lean LGRF” variation is superior to the UCT and PB variations.

publication date

  • November 2018

keywords

  • Artificial Intelligence, Monte Carlo Tree Search; Reinforcement Learning; decision theory, board games

start page

  • 286

end page

  • 294

volume

  • 7

issue

  • 6