時間差預測於智慧型多人遊戲之應用

強化式學習(reinforcement learning)是一種沒有專家指導而讓代理人(agent)在動態環境中，經由反覆嘗試，以求從錯誤得到經驗來達到學習效果的一種學習方式。不斷地利用已知與探索未知的方式，代理人持續更新所謂的價值函式(value function)，並以其為準則來更新行為策略。價值函式用以記錄環境中各狀態(state)之價值，代理人在實際體驗中，將不斷地更新所經歷狀態的價值,作為爾後各狀態行為決策的參考指標。不過現今的強化式學習存在著兩樣挑戰︰第一，我們無法以表格方式來存放價值函式，因為所欲解決的問題常有著非常龐大的狀態數，記憶體空間無法負荷。第二，傳統的強化式學習演算法適用於訊息完全的環境，但是大多數現實世界中的環境並非如此。本論文提出可用在訊息不完全環境下的一種新穎學習方式。在此方法中，我們把兩個類神經網路(neural networks)串接在一起用來做為學習引擎。其中一個類神經網路是基於時間差網路(temporal-difference network)的概念設計的，其任務是預測各種行為在未來一段連續的時間內所將發生事件機率值。預測值加上其他已知資訊，傳入另一個類神經網路進行價值評估，以作為決定策略的依據。訓練完成的代理人能評估對手可能採取的攻勢，給予迎頭痛擊。我們用紙牌遊戲“傷心小棧＂(Hearts；又稱為西洋拱猪)作為測試平台。它是一個典型訊息不完全的遊戲案例，傳統的強化式學習方法之效果不彰。與微軟視窗 (Microsoft Windows)的內建遊戲程式MSHEARTS 較量100 場，我們訓練完成的代理人贏得最後的勝利。

關鍵字

時間差網路；強化式學習

並列摘要

Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment. Rather than being given the expertise, the agent takes trial actions itself and experiences the information returning from the environment continuously in order to seek a remarkable way to accomplish its goal. By continually exploiting what it already knew and exploring what it had never experienced, the agent can renew the policy progressively based on the value function built by the agent itself incrementally. The value function is, in fact, a mechanism for the agent to record the merit of each state in the environment so that the agent can look it up to determine which action should be the best under the current situation in the environment. There are two challenges in reinforcement learning. First, it can't create a table to store values for an environment with an enormous large number of states because of insu±cient memory spaces. Second, traditional solution methods for reinforcement learning focused on perfectly knowable environment, but many real-world problems are not so. In this thesis, we propose a novel learning method in an information imperfect environment. In this approach, two cascaded neural-networks are linked together as our learning engine. One neural-network is designed based on the concept of the temporal-difference network. Its task is to predict the occurrence of successive events corresponding to a trial action in probability sense. The predictions, accompanying with some known information, are then passed to the other neural-network for value estimation. The best action corresponding to the current state can then be determined by choosing the one with the most beneficial value. The sophisticated agent is able to estimate the opponents' violent attacks, and make a strong resistance to fight them back. The card game "Hearts" is our test target. It is a typical example of imperfect information games, and it is so di±cult that the traditional reinforcement learning methods can't learn well. Playing 100 games with MSHEARTS in Microsoft Windows, our well-trained agent won the championship.

並列關鍵字

Temporal-Difference Networks ； Reinforcement Learning

參考文獻

[10] Michael L. Littman, Richard S. Sutton, and Satinder P. Singh. Predictive representations of state. In Thomas G. Dietterich, Suzanna Becker, and Zoubin Ghahramani, editors, NIPS, pages 1555-1561. MIT Press, 2001.

[18] Nathan R. Sturtevant. Multi-Player Games: Algorithms and Approaches. PhD thesis, UCLA, 2003.

[19] Nathan R. Sturtevant and Richard E. Korf. On pruning techniques for multi-player games. In AAAI/IAAI, pages 201-207. AAAI Press / The MIT Press, 2000.

[5] Michael Robert James. Using Predictions for Planning and Modeling in Stochastic Environments. PhD thesis, University of Michigan, 2005.

[22] Richard S. Sutton and Brian Tanner. Knowledge representation in td networks(associated talk). http://www.cs.ualberta.ca/.

國際替代計量

時間差預測於智慧型多人遊戲之應用

未授權

主題瀏覽