具混合規劃架構之並行Dyna-Q學習演算法

傳統的加強式學習演算法，如 Q-learning，是建立在一個代理人以及沒有模型的狀況下以one step 的方式學習，所以近幾年來紛紛有許多人提出多代理人以及利用模型做重複學習的概念用來解決學習效率低落的問題，如Dyna-Q、multiagent system 等等。在這篇論文中，我們融合了一些在不同領域上的演算法，應用不同領域的概念在加強式學習上並且配合Dyna-Q 以及multi agent system 等既有的概念再做延伸。在代理人的探索上加入了UCB 的算法，加強代理人在探索上的效率，縮短建立虛擬環境模型的時間。在Dyna-Q 的虛擬環境模型上，加入了影像處理的概念銳化模型。我們也提出了一個能夠實行平行運算，加快Dyna-Q 學習的一種針對環境空間平行的規劃算法，並將優先掃除的概念融入其中，進一步的增加規劃的效率，有效的利用運算的資源，基於以上的演算法延伸以及融合後，利用GPGPU(General Purpose Computing on Graphics Processing Units)的概念將模擬實作在CUDA(Compute Unified Device Architecture)架構上，並藉由模擬的方式驗證以上所提出的方法對Dyna-Q 的學習速度上的影響。

關鍵字

Dyna-Q ； UCB ；加強式學習； GPGPU

並列摘要

Traditional reinforcement learning algorithm, such as Q-learning, is based on one agent and one step learning without a model. In recent years, many have proposed the concepts of multi-agents and using a model for retraining to increase learning efficiency, such as Dyna-Q and multi-agent system. In this thesis, we integrated several algorithms of different domains, applied concepts from different domains in reinforcement learning, and made extensions in compliance with existing concepts such as Dyna-Q and multi-agent system. We added UCB algorithm to reinforce exploration efficiency of agents and shorten the time for virtual environment model establishment. For the virtual environment model of Dyna-Q, we added the concept of image processing to sharpened model. We also proposed a planning algorithm for environmental space paralleling, which can perform parallel computing and accelerate Dyna-Q learning. The concept of prioritized sweeping was integrated to further increase planning efficiency and resource management. After improving and integrating the above algorithms, the concept of GPGPU (General Purpose Computing on Graphics Processing Units) was used for simulation on CUDA (Compute Unified Device Architecture). The simulation was applied for verifying the impact of the above method on learning speed of Dyna-Q.

並列關鍵字

Dyna-Q ； UCB ； Reinforcement learning ； GPGPU

參考文獻

[2]Araujo A. F. R., Braga, A. P. S., “Reward-penalty reinforcement learning scheme for planning and reactive behavior ,” IEEE International Conference on Systems, Man, and Cybernetics, Vol. 2, pp.1485-1490, 1998.

[3]Junfei Qiao, Zhanjun Hou, Xiaogang Ruan, “Q-learning Based on Neural Network in Learning Action Selection of Mobile Robot,” IEEE International Conference on Automation and Logistics, pp. 263 - 267, 2007.

[4]Bing-Qiang Huang, Guang-Yi Cao, Min Guo, “Reinforcement Learning Neural Network to the Problem of Autonomous Mobile Robot Obstacle Avoidance,” International Conference on Machine Learning and Cybernetics, Vol. 1, pp. 85-89, 2005.

[5]Caihong Li, Jingyuan Zhang, Yibin Li, “Application of Artificial Neural Network Based on Q-learning for Mobile Robot Path Planning,” IEEE International Conference on Information Acquisition, pp.978 - 982, 2006.

[6]Minato T., Asada M., “Environmental change adaptation for mobile robot navigation,” IEEE/RSJ International Conference on Intelligent Robots and Systems , Vol. 3, pp.1859 -1864, 1998.

被引用紀錄

Wu, L. H. (2013). 透由間接學習改進Dyna-Q之效能 [master's thesis, National Chung Cheng University]. Airiti Library. https://www.airitilibrary.com/Article/Detail?DocID=U0033-2110201613561218

姜冠毅（2015）。基於熵調整增強式學習探索率之研究〔碩士論文，國立中正大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0033-2110201614041577

國際替代計量

具混合規劃架構之並行Dyna-Q學習演算法

未授權

主題瀏覽