透過您的圖書館登入
IP:18.226.88.145
  • 學位論文

批量增強與多智能體加強式學習於交通號誌最佳化之實現

Batch-Augmented Multi-Agent Reinforcement Learning for Efficient Traffic Signal Optimization

指導教授 : 林守德

摘要


本作的目的是為了提出一個高效率的加強式學習架構,使得其得以應用在交通號誌的優化上,進而減緩交通的堵塞情形。雖然加強式學習已經在各個領域取得卓越的成績,然而此類方法有許多的需求,例如龐大的採樣數,使得應用強化式學習在解決沒有高效率模擬環境的任務時十分困難。交通號誌的優化便是其中一個例子。儘管目前已有許多交通模擬器,然而一個能夠精準模擬交通流量的模擬器在處理壅塞的交通路口時需要大量的運算,也使得模擬的速度大幅下降。 我們提出了幾個在解決交通號誌優化時實際上會遇到的挑戰,並提出了相對應的解決方法。由於一個具有適應性的號誌系統,仰賴路口的攝影機提供現時的路況狀態並進行調整,我們同時也需要考慮到在攝影機損壞的情形之下的解決辦法。此外,如同前面提到的,一個高效率交通模擬器的取得十分昂貴以及困難,也因此提出一個不需要模擬器也可以優化交通號誌的加強式學習算法也是我們的目標之一。最後,我們考慮一個多智能體的設定,也就是在應用加強式學習的算法時,每個路口的控制器無法觀測到其他路口的狀態,並能在這種情況下使得控制器間可以互相合作並減輕交通堵塞。 提出的架構可以分為兩個階段。第一個階段是由進化策略演算法所構成,我們提出了一個能夠優化固定時制的相位長以及總相位長的探索方式,藉此找到一個固定時制的地方最佳解。第二個階段是一個多基因體加強式學習演算法,其算法能夠基於一個固定時制來取得一個更佳的適應性交通系統且不需要交通模擬器來幫助優化。因此在輕量模擬可允許之下,我們可以先使用第一階段的進化策略演算法進行第一階段的優化,再將其結果提供給第二階段的演算法所使用。在完全沒有模擬器的情況之下,也可直接使用第二階段的多基因體加強式演算法,藉著實際上蒐集到的交通資訊來進行優化。在第二個階段中,我們分別提出了『有限動作空間』來穩定多基因體在沒有溝通之下的合作、『批量加強』來強化批量資料所能提供的交通流量資訊、『替代修剪獎勵』使得異策略加強式學習演算法能夠在沒有環境互動的情形之下可以學到適當的策略。 在本作的實驗中,整個架構在僅600次向模擬器查訊之下,降低了總路網36%的停等時間,顯示本方法在路網的交通號誌優化中能夠擁有高度的取樣效率。

並列摘要


The goal of this work is to provide a viable solution based on reinforcement learning for traffic signal control problems. Although the state-of-the-art reinforcement learning approaches have yielded great success in a variety of domains, directly applying it to alleviate traffic congestion can be challenging, considering the requirement of high sample efficiency and how training data is gathered. In this work, we address several challenges that we encountered when we attempted to mitigate serious traffic congestion occurring in a metropolitan area. Specifically, we are required to provide a solution that is able to (1) handle the traffic signal control when certain surveillance cameras that retrieve information for reinforcement learning are down, (2) learn from batch data without a traffic simulator, and (3) make control decisions without shared information across intersections. We present a two-stage framework to deal with the above-mentioned situations. The framework can be decomposed into an Evolution Strategies approach that gives a fixed-time traffic signal control schedule and a multi-agent off-policy reinforcement learning that is capable of learning from batch data with the aid of three proposed components, bounded action, batch augmentation, and surrogate reward clipping. We show that the evolution strategies method is able to obtain a fixed-time control schedule that outperforms reinforcement learning agents dynamically adjusting traffic light duration with much fewer samples from simulators. For the multi-agent reinforcement learning part, the bounded action component maintains stability under the multi-agent scenario where the controller at each intersection should make the decision independently. The surrogate reward clipping enables multi-agent and off-policy reinforcement learning approaches to learn from batch data without a simulator in a cooperative task. Lastly, the batch augmentation mitigates the lack of efficiency in collecting data in traffic signal control problems. Our experiments show that the proposed framework reduces traffic congestion by 36% in terms of waiting time compared with the currently used fixed-time traffic signal plan. Furthermore, the framework requires only 600 queries to a simulator to achieve the result.

參考文獻


[1] N. Casas. Deep deterministic policy gradient for urban traffic light control. arXiv preprint arXiv:1703.09035, 2017.
[2] M. Fellendorf and P. Vortisch. Microscopic traffic flow simulator vissim. In Fun- damentals of traffic simulation, pages 63–93. Springer, 2010.
[3] J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson. Counterfac- tual multi-agent policy gradients. In Thirty-second AAAI conference on artificial intelligence, 2018.
[4] S.Fujimoto,D.Meger,andD.Precup.Off-policydeepreinforcementlearningwith- out exploration. arXiv preprint arXiv:1812.02900, 2018.
[5] J. Geweke. Antithetic acceleration of monte carlo integration in bayesian inference. Journal of Econometrics, 38(1-2):73–89, 1988.

延伸閱讀