增量式強化學習與對偶適性ϵ-貪婪探索

近年來，強化學習已在多方領域展現令人驚豔的成果。然而，多數強化學習（Reinforcement Learning, RL）之架構過度簡化了問題設定，靜態且固定不變的環境假設經常難以推廣至真實世界應用。本篇論文將介紹一個創新且更實際的挑戰，增量式強化學習（Incremental RL），也就是馬可夫決策過程（MDP）的搜尋空間將會隨著學習進行而持續的增長，而非一成不變。過去的方法在探索（藉由隨機或基於學習過程）環境中的未知轉變（unseen transition）時大多會受限於效率的缺乏，尤其在不斷增長中的探索空間裡更是如此。相較之下，我們針對此挑戰提出了一個新穎且簡單有效的演算法，對偶適性ϵ-貪婪探索。特別的是，對偶適性ϵ-貪婪探索調用了一個元策略與一個探索網路以避免冗餘的採樣：元策略以啟發式策略評估給定狀態之探索收斂性並適性地指派ϵ 值；探索網路估計給定狀態下每個動作的採用頻率，並採取被探索最少的動作。此外，我們也建立了一個測試平台（基於一個指數成長之環境與雅達利基準）來驗證在增量式強化學習中，探索演算法的有效性。實驗結果顯示我們提出的架構能有效率地探索並學習環境中的未知轉換，並且平均而言比起八個現有標準方法性能提昇高達81.12%。

關鍵字

強化學習；增量學習；探索效率

並列摘要

Recently, reinforcement learning (RL) methods have achieved impressive performance in various domains. However, most RL frameworks oversimplify the problem by assuming a static, fixed-yet-known environment and often have difficulty being generalized to real-world scenarios. In this paper, we address a new challenge with a more realistic setting, Incremental Reinforcement Learning (Incremental RL), where the search space of Markov Decision Process (MDP) continually expands. While previous methods usually suffer from the lack of efficiency, especially with increasing search space, in exploring the unseen transitions in the environment by random sampling or depending on the learning process, we present a new exploration framework named Dual-Adaptive ϵ-greedy Exploration (DAE), a simple yet effective method to address the challenge of Incremental RL. Specifically, DAE employs a Meta Policy and an Explorer to avoid redundant computation on those sufficiently learned samples. On the one hand, the Meta Policy evaluates a state’s exploration convergence through a heuristic strategy and adaptively assigns the value of ϵ. On the other hand, the Explorer estimates the occurrence of each action, conditional to a given state, for nominating the least-tried one to explore. Furthermore, we also release a testbed based on an exponential-growing environment and the Atari benchmark to validate the effectiveness of any exploration algorithms under Incremental RL. Experimental results demonstrate that the proposed framework can efficiently learn the unseen transitions in new environments, leading to prominent performance improvement, i.e., averagely more than 80%, over eight state-of-the-art methods examined.

並列關鍵字

Reinforcement Learning ； Incremental Learning ； Exploration Efficiency

參考文獻

[1] David Abel, Yuu Jinnai, Sophie Yue Guo, George Konidaris, and Michael Littman. Policy and value transfer in lifelong reinforcement learning. In ICML, 2018.

Google Scholar

[2] Susan Amin, Maziar Gomrokchi, Harsh Satija, Herke van Hoof, and Doina Precup. A survey of exploration methods in reinforcement learning. arXiv preprint arXiv: 2109.00157, 2021.

Google Scholar

[3] Ahmad Taher Azar, Anis Koubaa, Nada Ali Mohamed, Habiba A Ibrahim, Zahra Fathy Ibrahim, Muhammad Kazim, Adel Ammar, Bilel Benjdira, Alaa M Khamis, Ibrahim A Hameed, et al. Drone deep reinforcement learning: A review. Electronics, 10(9):999, 2021.

Google Scholar

[4] Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Zhaohan Daniel Guo, and Charles Blundell. Agent57: Outperforming the atari human benchmark. In ICML, 2020.

Google Scholar

[5] Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch. Emergent tool use from multi-agent autocurricula. arXiv preprint arXiv:1909.07528, 2019.

Google Scholar

主題瀏覽