本研究應用深度強化學習於模擬環境中的自走機器人與無人機,希望能解決在使用傳統控制器較難解決的問題,如在含有動靜態障礙物與未知地圖的環境中規畫路徑並避開障礙物,此環境中的動態障礙物動態未知且具有隨機性。 在未知動態的環境中使用無模型強化學習需要從環境中不斷取樣,並從中找出好的策略以避開障礙物與規畫路徑。我們更進一步使用簡單且次優的數值解做為專家,透過專家指引以及策略梯度(policy gradient)找出優於專家表現的策略。 在無人機上,本研究從兩個方面著手。其一是使用模仿學習,以傳統控制器為目標,使用資料增強(data augmentation)做為輔助,訓練出一個在不同取樣與控制頻率中皆可以達成定位控制的四軸無人機控制器。 其二是利用深度強化學習不需模型的強大優點,學習理解更抽象的特徵輸入,做為高階的速度控制器。在這個部分,環境中擁有不同大小的門框並以第一人稱視角相機的照片作為觀察輸入,控制器的目標是要利用觀察的照片,控制無人機穿過門框。為此,我們預先訓練了變分自編碼器(variational auto-encoder)用於將照片編碼成低微度的向量以降低狀態空間的大小。然而,在這個場景下,馬可夫決策過程中完全狀態的假設會被前視影像產生的歧義而打破。歧義的問題可以從兩個方向嘗試解決,第一個是盡可能給予更多的資料做為觀察輸入,如此便可以盡可能地滿足馬可夫決策過程中的假設;第二個是將馬可夫決策過程的模型修正為部分可觀測馬可夫決策過程,並以整條軌跡做為觀測輸入。在本研究的最後使用了不同的條件進行模擬,並探討了不同設置下馬可夫決策過程與部分可觀測馬可夫決策過程的適用性。
In this work, we applied deep reinforcement learning with deep learning approaches to control self-driving robots and UAVs in the simulations to resolve difficult problems that are hard to achieve by using traditional control methods, such as mapless motion planning in the environment containing stationary obstacles and moving obstacles with randomly moving directions and speeds. With deep reinforcement learning, the robots with unknown motion models can explore the environment with sample-based methods, which helped to find a good policy to solve the mapless motion planning problem without collision. Furthermore, a trivial sub-optimal solution was introduced as experts' demonstration so that our agent was capable of finding a policy that exceeded the performance of the expert with the help of policy gradient. The control of UAVs is designed from two aspects. On one hand, we trained a policy using imitation learning that learned from the traditional controller to keep the quadrotors hovering in the specified position. With the help of data augmentation (DAgger), our policy was trained for position control with different control rates. On the other hand, the deep reinforcement learning is leveraged to understand the abstract representations and used as high-level controllers, where the environment included different sizes of gates and first-person perspective images as observations. The goal of the controller was to control the quadrotor to pass the gates with forward-looking images. Under the circumstances, the hypothesis of the Markov decision process (MDP) was not able to be satisfied due to the ambiguity in the scenario. To solve the problem, we first introduced variational auto-encoder (VAE) to encode the images to low dimensional features then we tried two different approaches. One was to give as more information to satisfy the hypothesis as possible; the other one was to modify the MDP model to partially observable Markov decision process (POMDP), and this required using whole trajectories as observations rather than an observation given by a single timestep. Simulation results were conducted in this work to show the efficacy of the MDP and the POMDP under different observation settings.