利用熟練者知識加速強化式學習演算法之效率─以西洋雙陸棋作為探討實例

強化式學習(reinforcement learning)是一種特別的機器學習方法。它和一般監督式學習(supervised learning)有很大的不同，它可以不需要靠外部的範例來進行self-play學習，這代表著它可以在未知的環境中藉由與環境互動自我學習，這是監督式學習所作不到的。然而我們通常並不處於一完全未知的環境。換言之，我們往往有許多與環境有過經驗的前輩知識可以參考。以下棋為例，棋師、棋譜、下棋程式，都可視為經驗的來源，而且各自擁有不同的棋藝。不加以利用殊為可惜。這時如何結合self-play與利用有經驗者的知識有效率的學習並達到青出於藍的效果是我們研究的目標。本論文將以以TD-Gammon為例驗證我們的概念與想法。TD-Gammon是強化式學習一個很有代表性的範例。它結合TD($lambda $)與類神經網路實現函數逼近(function approximation)來學習玩西洋雙陸棋，已可以在150萬次self-play中學習達到接近世界級棋手的強度。我們在TD-Gammon的self-play的過程中，加入指導者對目前棋局的觀點提供建議，藉此提升學習的效果。另外，我們亦實驗了多指導者參與時對學習效能產生的影響，實驗並討論其結果。關鍵詞：強化式學習; TD-Gammon

關鍵字

強化式學習； TD-Gammon

並列摘要

Reinforcement learning is a special machine learning method. It is different from supervised learning. It can learn by self-play without using examples. This represents that it can study in an unknown environment, but supervised learning can’t. But the questions we usually meet aren’t located in the unknown environment. In other words, we usually have a lot of experiences that can be referred. Take chess as the example, we have many sources of experiences. In this case, if we don’t use those knowledge may waste those resources. Therefore the goal we study is how to combine self-play and those knowledge. We take TD-Gammon as the example, TD-Gammon is one of the most impressive applications of reinforcement learning. The learning algorithm in TD-Gammon was a straightforward combination of the TD($lambda $) algorithm and nonlinear function approximation using a multilayer neural network trained by backpropagating TD errors. TD-Gammon learned to play extremely well, near the level of the world's strongest grandmasters. We entered seniors' knowledge in TD-Gammon self-play process, and improving learning efficacy. Moreover, we attempt improving learning efficacy further by using more seniors. Keywords: Reinforcement learning; TD-Gammon

並列關鍵字

Reinforcement learning ； TD-Gammon

參考文獻

[1] A. G. Barto et al., ”Learning to act using real-time dynamic programming”. Artificial Intelligence, 72:81-138.

[2] J. Baxter et al., ”TDLeaf(λ): Combining Temporal Difference Learning with Game-Tree Search”, CoRR cs.LG/9901001 1999

[5] M. H. Kalos and P. A. Whitlock, Monte Carlo Methods. Wiley, NY.1986

[7] R. Y. Rubinstein Simulation and the Monte Carlo Method. Wiley, NY.1981

[9] R. S. Sutton,”Learning to predict by the methods of temporal differences”,Machine Learning, vol. 3, no. 1,pp. 9-44, 1988

國際替代計量

利用熟練者知識加速強化式學習演算法之效率─以西洋雙陸棋作為探討實例

未授權

主題瀏覽