透由間接學習改進Dyna-Q之效能

本論文利用Dyna架構的學習概念，整合Q-Learning演算法、螞蟻最佳化演算法(Ant Colony Optimization, ACO)、Prioritized Sweeping，提出快速學習的系統。其中Q-Learning 負責策略學習，學習的過程中會將互動的資訊建構至虛擬環境模型；而在代理人探索的部分，加入探索因子，加強代理人在探索上的效率，以縮短建構虛擬環境模型的時間。當代理人與真實環境互動的空檔時，代理人會透過虛擬環境與Q-Learning互動，進行策略更新(Planning Update)，以達快速學習之目的；在Dyna-Q架構中，其策略更新的方式為隨機挑選虛擬模型裡的經驗。然而Prioritized Sweeping被提出來改良Dyna-Q間接學習(Indirect Learning)的一種方法。本論文提出兩種使模型Planning Update更有效率的方法，深度優先規劃(Depth-first Planning)與混合規劃(Hybrid Planning)。深度優先規劃是結合螞蟻最佳化演算法費洛蒙的概念，並加入探索因子增加螞蟻探索的機率；混合規劃(Hybrid Planning)是整合深度優先規劃與Prioritized Sweeping(廣度優先規劃)的優點。為了加強虛擬模型的不足，而提出分享模型機制(Model Shaping)，預測虛擬模型缺乏的資訊。最後，我們以迷宮、登山車(Mountain Car)的實驗環境，來模擬我們所提出的演算法，由模擬的結果證明，本論文所提出的方法確有明顯提升學習速度的效能。

關鍵字

螞蟻群體演算法； Dyna-Q學習；模型學習；加強式學習；間接學習

並列摘要

In this thesis, we applied more algorithms, such as Ant Colony Optimization (ACO), Prioritized Sweeping, to improve the problem of learning speed in Dyna-Q learning algorithms. The agent interacts with environment and learns policy by Q-learning, and builds interactive information to the virtual backward prediction model. As the agent explores the unknown environment, makes a decision action by the exploration factor. In Dyna architecture, the planning method produces the random state-action pairs that have been experienced; however, prioritized sweeping (breadth-first planning) is proposed to improve Dyna-Q planning method. In this thesis, we proposed two methods of planning, depth-first planning and hybrid planning. Depth-first planning applied ACO algorithm concept and explored factor; hybrid Planning combines with advantage of the depth-first planning with prioritized sweeping (breadth-first planning). In order to improve shortcomings of the model, we propose the model shaping predicting insufficient information on the model. For verifying the proposed method, we simulate in maze and mountain car environment. The results of simulation prove we propose the method can promote speed and efficiency of learning.

並列關鍵字

ACO ； Dyna-Q ； Reinforcement learning ； Planning ； Model learning

參考文獻

[1] R. S. Sutton , A. G. Barto, Reinforcement learning: An introduction, MIT press, Cambridge, 1998.

[3] R. S. Sutton, “Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming,” Proceedings of the Seventh International Conference on Machine Learning, pp. 216-224, 1990.

[4] C. F. Touzet, “Q-learning for robot,” in M. A. Arbib, editor, Handbook of Brain Theory and Neural Networks, pp. 934-937, 2003.

[5] B. Q. Huang, G. Y. Cao, M. Guo, “Reinforcement learning neural network to the problem of autonomous mobile robot obstacle avoidance,” International Conference on Machine Learning and Cybernetics, Vol. 1, pp. 85-89, 2005.

[6] T. Minato, M. Asada, “Environmental change adaptation for mobile robot navigation,” IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1859-1864, 1998.

國際替代計量

透由間接學習改進Dyna-Q之效能

未授權

主題瀏覽