基於模型學習改善Dyna-Q之效能

在增強式學習中Q-Learning是一種較多人知道的無建立模型(model-free)的學習法，然而Q-Learning存在著bootstrapping的問題，其更新需要等到下一個狀態值(state-value)正確之後，才能將值繼續往回傳遞，使得值的擴散很緩慢，本論文使用以Dyna-Q為基礎用樹狀結構來建立反向模型達到預測的功能，代理人(agent)會收集走過的狀態變化(state-transition)以及相關聯的回饋(reward)，將這些資料用一些分類的方式來定義一些區域的狀態轉移、轉移機率和得到的回饋。利用這些資料可以使用相似的狀態轉移、轉移機率和回饋來間接學習一些未到達過的狀態，這樣代理人就能夠擁有額外的經驗來更新策略(policy)，有了預測就能夠讓代理人模擬出狀態之間的轉移變化，使得值的擴散可以是以接續的方式往回傳遞，降低學習的時間，加快學習的效率。然而當模型尚未建立完全時會存在著預測錯誤，在存在著不可到達狀態(absorbing state)的環境下會使得學習效果變差，甚至無法完成學習到最佳策略，本論文也提出了一些方法降低預測錯誤導致穿越不可到達狀態的發生。最後，我們以迷宮及調節速度以躲避障礙物的實驗環境，來模擬本論文所提出的演算法，由模擬結果的證明，本論文所提出的方法確實能夠提升學習的速度效能。

關鍵字

決策樹狀模型； Dyna-Q學習；間接學習；接續預測；模型學習；增強式學習

並列摘要

Q-Learning, a well-known model-free RL method. However, it has a problem called bootstrapping. When it discovers a good reward, it need many iterations to diffuse the information. In this thesis, we used an adaptive model learning method based on tree structure to predict state and reward which agent never arrived. Prediction can let agent has more experiences to indirectly learn policy. Agent can use prediction to simulate the transfer between states that agent can transfer information quickly. However, when model was not complete, it maybe get through the absorbing state. Getting through the absorbing state will make the learning effect worse or unable to learn the optimal policy. We propose some method to avoid getting through the absorbing state. For verifying the proposed method, we simulate in maze and control velocity to avoid collision environment. The results of simulation prove we propose the method can promote speed and efficiency of learning policy.

並列關鍵字

Dyna-Q agent ； Decision tree ； Reinforcement learning ； Model learning

參考文獻

[1] R. S. Sutton, and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, 1998.

[2] R. S. Sutton, “Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming,” In Proceedings of the Seventh International Conference on Machine Learning (ICML), Austin, pp. 216-224, 1990.

[6] K. S. Hwang, H. Y. Lin, Y. P. Hsu and H. H. Yu, “Self-organizing State Aggregation for Architecture Design of Q-learning,” Information sciences, vol.181, pp.2813-2822, July 2011.

[7] B. Baddeley, “Reinforcement Learning in Continuous Time and Space: Interference and Not Ill Conditioning Is the Main Problem When Using Distributed Function Approximators,” IEEE Transactions in System. Man, and Cybernetics Part B: Cybernetics, vol.38, pp. 950-956, August 2008.

[11] A. W. Moore, C. G. Atkeson, “Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time,” Machine Learning, vol.13, pp. 103-130, 1993.

國際替代計量

基於模型學習改善Dyna-Q之效能

未授權

主題瀏覽