基於強化學習PPO之全自動化無人機研究

台灣人熱愛登山，據統計，山難事件從民國108年的206件快速地攀升至109年的453件，因此當山難事故發生時，如何快速地提供救援儼然成為重要的議題。然而，當發生山難的時候，有時候並不知道受困者的位置，因此需要耗費人力、物力去搜山，往往無法提供即時的救援。因此本論文提出透過強化學習使無人機自動的搜尋未探索區塊以達到即時救援的目的，當有搜尋需求的時候，本論文所提出的無人機系統可以在沒有人為控制下快速且大量的在山區搜索，且能夠即時監測搜尋的進度，不但能夠加快救援進度也能降低搜尋成本。為了使強化學習(Reinforcement learning)和無人機能夠有最好的訓練結果，本論文採用目前強化學習訓練穩定且快速的演算法「近端策略最佳化」又稱為PPO(Proximal Policy Optimization)，透過無人機上的感測器，使無人機能夠自動的避開障礙物且完整搜尋欲搜尋的區塊，且為了盡可能的模擬真實世界的物理環境，本篇論文透過PID控制器來平衡無人機的飛行姿態，且使用連續性的行動空間來取代離散的行動空間。本論文將原本只能夠計算一種神經網路的PPO架構，更改成能夠同時計算兩種神經網路的架構，目的使訓練過程更加穩定，且能夠同時獲取對環境更全面的資訊來完成山區的搜尋任務。並與其他強化式學習演算法SAC(Soft Actor Critic)進行成果比較。實驗結果顯示，透過PPO的訓練，無人機在設計的三種不同場景都能完成設定的目標，在第一種場景證明無人機能夠在PID控制器的幫助下穩定飛行。第二種場景證明無人機能夠透過機身上的相機感測器、lidar感測器去進行避障。而在複雜的第三種場景開放式山域中，本論文提出的兩種無人機的搜尋方式，分別為固定路線與非固定路線，實驗結果同樣顯示無人機經過訓練後都能夠完整的搜尋整個區塊，在陌生且無訓練過的山區，也能夠完整的搜尋，證明了經過此訓練之後的無人機人能應用在自動飛行、避障與搜尋場域，進而完成山難救援的搜尋任務。

關鍵字

強化式學習；無人機；搜救；近端策略最佳化

並列摘要

Taiwan is a mountainous island and many people in Taiwan love mountaineering. According to statistics, mountain accidents have rapidly risen from 206 in 2019 to 453 in 2020. How to provide mountain rescue in time is an important issue. However, sometimes the location of people who are waiting for rescue is not known while a mountain accident happened. Then, it would take lots of manpower, efforts, and resources to search the people in the mountain. Therefore, this thesis proposed a fully autonomous UAV (unmanned aerial vehicle) based on PPO to automatically search people in the unknown areas or blocks of mountain in time. When there is a search demand, the proposed UAV (also known as drone) can quickly and widely search people or target in the mountains without personnel control and to speed up the rescue process. In addition, without personnel control the cost of rescue can also be reduced. To achieve the goal of this thesis, the proposed UAV uses a stable and fast reinforcement learning algorithm, called "Proximal Policy Optimization", which is also known as PPO to have the best training results of UAV. Through the sensors on the drone, the drone can automatically fly while avoiding obstacles and completely search for the block to be searched. In order to simulate the physical environment of the real world as much as possible, this thesis uses the PID controller to balance the flight attitude of the drone, and the use of continuous action space instead of discrete action space. Moreover, the original PPO architecture that can only take one neural network as input is modified to accept two neural networks as input at the same time. After the modification, the training process is more stable and the drone can obtain more comprehensive information about the environment to complete the search task in the mountainous area. In addition, the comparison between the proposed UAV system with other reinforcement learning algorithms SAC (Soft Actor Critic) is also discussed in this thesis. The experimental results show that through PPO training, the UAV can complete the goals in the three different designed scenarios. In the first scenario, it is proved that the UAV can fly stably with the help of the PID controller. The second scenario proves that the drone can avoid obstacles through the camera sensor and lidar sensor on the fuselage. Two drone search methods, i.e., fixed routes method and non-fixed routes method are proposed in the thesis and to applied in the third environment, which is an emulated mountain domain and is more challenging for the drone to search the entire block completely. The results show that even in unfamiliar and untrained mountainous areas, the proposed UAV system can complete search the area and fulfill the rescue mission.

並列關鍵字

Reinforcement Learning ； Proximal Policy Optimization ； Drone ； Soft Actor Critic ； Search and Rescue

參考文獻

[1] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, Demis Hassabis,” Mastering the game of Go with deep neural networks and tree search”, Nature 529, 484–489 (2016)

Google Scholar

[2] Leslie Pack Kaelbling, Michael L. Littman, Andrew W. Moore,” Reinforcement Learning: A Survey”, Journal of Artificial Intelligence Research4: 237–285

Google Scholar

[3] 中華民國內政部消防署(2020),山域事故統計範疇說明Retrieved June 18, 2021, from

Google Scholar

https://www.nfa.gov.tw/cht/index.php?act=download&ids=8984&path=../upload/cht/attachment/cf5cb51d37ea61362884ab8a1010c615.pdf&title=109%E5%B9%B4%E5%B1%B1%E5%9F%9F%E4%BA%8B%E6%95%85%E7%B5%B1%E8%A8%88%E5%88%86%E6%9E%90

Google Scholar

[4] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov,” Proximal Policy Optimization Algorithms”,arXiv preprint arXiv:1707.06347, 2017.

Google Scholar

國際替代計量

基於強化學習PPO之全自動化無人機研究

不提供下載

主題瀏覽