再思考多智能體合作強化學習中的單調性約束

許多複雜的多智能體系統，如機器人群控制和自主車輛協調，可以被建模為多代理強化學習（MARL）任務。QMIX是一種流行的基於單調性約束的MARL算法，已被用作基準環境的基線，如星際爭霸多Agent挑戰賽（SMAC）、捕食者-獵物（PP）。最近的QMIX變體以放松QMIX的單調性約束為目標，以提高QMIX的表達能力，使其在SMAC的性能得到改善。然而，我們發現，這些變體的性能改進受到各種實現技巧的顯著影響。在本文中，我們重新審視了QMIX的單調性約束。（1）我們設計了一個新穎的模型RMC來進一步研究單調性約束；結果表明，單調性約束可以提高一些純合作任務的採樣效率；（2）然後我們通過網格超參數搜索技巧來重新評估QMIX和這些變體的性能；結果表明QMIX在它們中取得了最佳性能；(3) 我們從理論角度分析了單調性混合網絡，並表明它可以代表任何純合作任務。這些分析表明，放鬆值分解網絡的單調性約束並不總是能提高QMIX的性能，這打破了我們以前對單調性約束的印象。

關鍵字

多智能體強化學習；單調性約束；超參數

並列摘要

Many complex multi-agent systems such as robot swarms control and autonomous vehicle coordination can be modeled as Multi-Agent Reinforcement Learning (MARL) tasks. QMIX, a popular MARL algorithm based on the monotonicity constraint, has been used as a baseline for the benchmark environments, e.g., Starcraft Multi-Agent Challenge (SMAC), Predator-Prey (PP). Recent variants of QMIX target relaxing the monotonicity constraint of QMIX to improve the expressive power of QMIX, allowing for performance improvement in SMAC. However, we find that such performance improvements of the variants are significantly affected by various implementation tricks. In this paper, we revisit the monotonicity constraint of QMIX, (1) we design a novel model RMC to further investigate the monotonicity constraint; the results show that monotonicity constraint can improve sample efficiency in some purely cooperative tasks. (2) we then re-evaluate the performance of QMIX and these variants by a grid hyperparameter search for the tricks; the results show QMIX achieves the best performance among them. (3) we analyze the monotonic mixing network from a theoretical perspective and show that it can represent any purely cooperative tasks; These analyses demonstrate that relaxing the monotonicity constraint of the mixing network will not always improve the performance of QMIX, which breaks our previous impressions of the monotonicity constraints.

並列關鍵字

Multi-agent Reinforcement Learning ； Monotonicity Constraint ； Hyperparameters

參考文獻

[1] M. Andrychowicz, A. Raichuk, P. Stan´czyk, M. Orsini, S. Girgin, R. Marinier, L. Hussenot, M. Geist, O. Pietquin, M. Michalski, S. Gelly, and O. Bachem. What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study. arXiv:2006.05990, 2020.

Google Scholar

[2] W. Boehmer, V. Kurin, and S. Whiteson. Deep coordination graphs. In ICML 2020, 13-18 July 2020, Virtual Event, pages 980–991, 2020.

Google Scholar

[3] Y. Cao, W. Yu, W. Ren, and G. Chen. An overview of recent progress in the study of distributed multi-agent coordination. IEEE Transactions on Industrial informatics, 9(1):427–438, 2012.

Google Scholar

[4] K. Cobbe, J. Hilton, O. Klimov, and J. Schulman. Phasic policy gradient. arXiv preprint arXiv:2009.04416, 2020.

Google Scholar

[5] L. Engstrom, A. Ilyas, S. Santurkar, D. Tsipras, F. Janoos, L. Rudolph, and A. Madry. Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO. arXiv:2005.12729, 2020.

Google Scholar

國際替代計量

再思考多智能體合作強化學習中的單調性約束

未授權

主題瀏覽