模型學習及多代理人知識分享於Dyna-Q之應用

在本論文中，首先提出以決策樹為結構的適應性模型學習演算法，並結合加強式學習，形成以模型為基礎之加強式學習演算法。當代理人與環境互動的過程中，以Q學習進行策略學習，同時以適應性模型學習演算法逼近環境模型，稱為虛擬環境模型。當代理人取樣狀態時，將學習模式切換成規劃模式，此時行動函數將配合虛擬環境模型產生虛擬互動經驗，用以調整策略，藉此減少代理人於環境探索經驗與學習策略所需花費的時間。第二部分，當代理人處於隨機性的環境時，將衍伸出狀態轉移機率的問題。因此，以適應性模型學習演算法為基礎，並提出線上分群演算法，藉由分群方式使模型學習得以逼近狀態轉移機率，稱為隨機模型學習演算法。再者，採用後向回想的概念，將隨機模型學習演算法進行轉型，使此演算法從預測未來經驗轉為回想代理人過去經驗，並使代理人於規劃模式能多一步的回想，加快策略調整的腳步。第三部分，將隨機模型學習演算法延伸於多代理人系統，為減少代理人探索未知區域所需花費的成本，由於模型學習採用樹狀結構為基礎，提出多代理人知識分享的演算法，藉由互相分享每位代理人與環境互動過程的累積的知識，使虛擬模型得以快速趨近環境模型，透過虛擬模型，以間接的方式快速調整每位代理人的策略。

關鍵字

規劃模式；模型學習演算法；隨機性；適應性；加強式學習；多代理人系統；知識分享

並列摘要

In this dissertation, first an adaptive model learning method based on tree structure is presented to enhance the sample efficiency in reinforcement learning problems. The proposed method is composed of a Q-Learning, a decision tree and action functions. The Q-Learning is for learning policy, and the decision tree is for model learning that builds the environment model by considering the effect of action across continuous states. When the agent develops an accurate model, the action functions use the model to produce simulating experience that makes the agent perform value iterations quickly. Second, the model-based reinforcement learning methods are applied to solve the working tasks with the stochastic environment. Therefore, the online cluster method is proposed and extended into the adaptive model learning method. The cluster method makes the model learning method have the abilities to evaluate the transition probability; the learning method is named stochastic model learning method. Moreover, the conception of backward recall is introduced to transform the stochastic model learning method. The forward prediction is transformed to backward recall, which makes the agent increase one step planning. The policy learning can speed up further. In model-based reinforcement learning, the environmental model builds from sampled experience. However, how to build the model sufficiently in a short lapsed time in exploration is an important issue, especially for complicated environments. When model-based algorithms are applied to multi-agent systems, consulting with peer agents’ experiences, an agent not only can make learn process faster, but also alleviate the burden of exploration for unvisited states or unseen situations. Finally, model sharing methods between multi-agents are introduced. After sharing the experiences and constructing a more global model from scattered local models held by individual agents, the agents can enhance the ability of fast learning by alternating between the processes of direct learning and indirect learning, i.e., planning. To decreasing the complexity of the sharing process, the proposed method executes model sharing between cooperative agents by grafting partial branches of trees containing required and useful experiences, instead of merging whole trees together.

並列關鍵字

Planning ； Model learning algorithm ； Stochastic ； Adaptive ； Reinforcement learning ； Multi-agent systems ； Knowledge sharing

參考文獻

[1] J. Tani, R. Nishimoto, J. Namikawa and M. Ito, “Codevelopmental Learning Between Human and Humanoid Robot using a Dynamic Neural Network Model,” IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 38, no. 1, pp. 43-59, 2008.

[2] S. Wang, W. Chaovalitwongse and R. Babuska, “Machine Learning Algorithms in Bipedal Robot Control,” IEEE Transactions on Systems, Man, and Cybernetics, Part C:Applications and Reviews, vol. 42, no. 5, pp. 728-743, 2012.

[3] R. S. Sutton, and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, 1998.

[5] K. S. Hwang and C. Y. Lo, “Policy Improvement by a Model-Free Dyna Architecture,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 5, pp. 776-788, 2013.

[8] T. Hester, M. Quinlan and P. Stone, “Generalized Model Learning for Reinforcement Learning on a Humanoid Robot,” IEEE International Conference on Robotics and Automation (ICRA 2010), 2010.

國際替代計量

模型學習及多代理人知識分享於Dyna-Q之應用

未授權

主題瀏覽