以一種創新的自適應性探索策略革新加強式學習理論之架構

加強式學習是一種無監督學習方式，在各種機器學習方案中最為流行的方法之一。它描述了如何在動態環境中採取行動以求取報酬函數之最大化。在加強式學習的體系結構中，計算的機器人學習如何經評價反饋以調整進行適當的行動。因此度量評價需要顯示對應行動之成功程度，以作為對此行動採取獎勵或懲罰。在學習過程中，由於估算的差異性，策略的評估和改進所採用迭代方法往往會導致相當的收斂速度慢之問題，尤其是發生在初始學習階段。加強式學習演算法的另一個關鍵問題是探勘與探索的權衡問題。此問題需要平衡每個所採取的探勘與探索行動，也就是探索新經驗或利用現有的知識，如此才能獲得最大的回報。所有這些問題在這篇論文中都將討論其解決方法。在這篇論文中提出了一種演算法是基於灰色理論的一種合成方法，藉以消去所有的步長參數並提高資料的效率，並加以灰色理論的觀點說明該演算法的穩定性。此演算法配合評論-行動者（Critic-Actor）之加強式學習模型實作在一個可編程晶片（ SOPC ）面板上，並進一步與自適性應啟發式評論模型（Adaptive Heuristic Critic）比較，其實驗結果表明所提出的控制機制可以用很少的先驗知識學習控制系統。對於如何加速加強式學習過程中之策略改進，這裡提出一種近Dyna形式之系統。此系統結合兩個學習方案，其一是採用了時序差分法（TD）直接學習，另一是在接續兩個直接學習週期中使用相對值的間接學習。此方式不是建立一個複雜的實際模型，而是引入了一個簡單的預測平均報酬於評論-行動者架構的模擬規劃學習模式。狀態的相對值（此值定義為累計立即報酬和平均報酬之間的差異）是用來引導改進學習過程進入正確的方向。最後，此論文提出了利用自然度量（一種基於自適應性狀態聚合的演算法）對狀態-行動（state-action）空間建構出區域模型，也就是它引入了一種規劃演算法，其廣域勘探所需的時間僅取決於此度量分解度而非狀態空間的大小。此外，為解決加強式學習之行動探勘或探索的權衡問題，引入一種禁忌表列的收縮長度方法和一種自適應性ε-貪婪的探索計劃方案。自適應性ε-貪婪策略是基於信息熵（Information Entropy），也就是ε值的調整不是手動，而是按照學習進度調整並在迷宮模擬實驗得到良好收斂結果。

關鍵字

部分馬可夫；智慧型控制；時序差分法；灰色理論；加強式學習

並列摘要

Reinforcement learning is one of the most popular approaches in a variety of machine learning schemes and is a well-known unsupervised learning in robot learning. It describes about how to act in a dynamic environment by maximizing payoff functions. In the architecture of reinforcement learning, the computational agents learn to perform an appropriate action with the evaluative feedback. Therefore, a metric evaluation is thus needed to indicate degrees of success for the assignment of credit or blame with respect to the action. Furthermore, in learning processes, the problem arises while the iteration methods of policy evaluation and policy improvement often lead to quite slow convergence, due to estimation variances, especially at the initial stage. Another critical issue for a reinforcement learning algorithm is the trade-off problem between exploration and exploitation. The learning agent needs to balance at each action taken either to explore for new experience, or exploit current knowledge so as to gaining a maximum reward. All these problems will be solved in this thesis. In this dissertation, we proposed an algorithm to eliminate all step size parameters and improve data efficiency based on a synthetic approach of Grey theory and also present the stability of the proposed algorithm from the viewpoint of Grey theory. The algorithm along with critic-actor reinforcement learning model is implemented in a System-on-a-Programmable-Chip (SOPC) board. In addition to comparing with the renowned model, Adaptive Heuristic Critic (AHC), the results of experiments demonstrate that the proposed control mechanism can learn to control a system with very little a priori knowledge. For accelerating the process of policy improvement in reinforcement learning, we proposed Dyna style system to combine two learning schemes, one of which utilizes a TD method for direct learning; the other uses relative values for indirect learning in planning between two successive direct learning cycles. Instead of establishing a complicated world model, the approach introduces a simple predictor of average rewards to actor-critic architecture in the simulation (planning) mode. The relative value of a state, defined as the accumulated differences between immediate reward and the average reward, is used to steer the improvement process toward the right direction. Finally, we proposed a method utilizes a natural metric, an adaptive state aggregation algorithm, on the state-action space to construct a local model. It also introduces a planning algorithm where the time required for global exploration depends only on metric resolution, instead of the size of the state space. Besides, for solving the trade-off problem between exploration and exploitation, a shrinking length of the Tabu list and an adaptive ε-greedy exploration scheme is introduced. Adaptive ε-greedy strategy is based on information entropy where that ε varies by the learning progress instead of manual tuning and it outputs a good convergent consequence in the maze simulation.

並列關鍵字

Intelligent Control ； Temporal Difference ； Grey theory ； Reinforcement Learning ； POMDP

參考文獻

[18]M. Lu and K. Wevers, “Grey System Theory and Applications: A Way Forward,” The Journal of Grey System, No. I, Vol. 10, pp. 1-24, 2007.

[1]M. Butcher and A. Karimi, “Linear Parameter-Varying Iterative Learning Control with Application to a Linear Motor System,” IEEE/ASME Transactions on Mechatronics, Vol. 15, No. 3, pp. 412-420, 2010.

[2]Y. H. Liu, H. P. Huang, and C. H. Weng, “Recognition of Electromyographic Signals Using Cascaded Kernel Learning Machine,” IEEE/ASME Transactions on Mechatronics, Vol. 12, No.3, pp. 253-264, 2007.

[5]K. S. Hwang and C. S. Lin, “Smoothing Trajectory Tracking of Three-Link Robot: A self-Organizing CMAC Approach,” IEEE Transactions on System, Man, and Cybernetics-Part B: Cybernetics, vol.28, no.5, pp.680-692, Oct. 1998.

[6]R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduction,” The MIT Press, 1998.

被引用紀錄

姜冠毅（2015）。基於熵調整增強式學習探索率之研究〔碩士論文，國立中正大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0033-2110201614041577

國際替代計量

以一種創新的自適應性探索策略革新加強式學習理論之架構

未授權

主題瀏覽