障礙物閃避與環境感測是自動駕駛與機器人應用的重要核心。而攝影機因其低廉的成本與透過影像提供的豐富環境資訊,常被廣泛運用於上述應用之中,也因此使用單攝影機取像並進行深度預測成為現今研究主流之一。然而現有方法大多仰賴複雜的計算與高成本的設備來實現應用,有鑑於此,本論文專注於即時深度預測之輕量化類神經網路的開發。 基於深度學習網路的encoder-decoder架構設計,我們提出監督式學習的輕量化類神經網路架構,該架構運用高效能的bottleneck 設計與可拆解的decoder區塊設計,預測多尺度下的深度圖。另外我們針對不同尺度的decoder區塊進行多任務的損失函數設計,在加入語意分割任務的訓練下達到更好的模型收斂效果與更快的訓練速度。為了在KITTI資料庫中進行訓練,我們使用PSMNet、DeepLabV3與多種前處理方法產生深度圖與語意分割的正確標註。此外我們透過AirSim以多樣化的攝影機視角自行收集一組合成資料庫,進一步測試本方法的穩健性。 透過變量控制實驗,我們提出一高效能的類神經網路,以極少的模型參數與極低的運算量進行即時深度預測。透過在KITTI資料集上的訓練,本深度預測方法在KITTI多項評估標準下勝過其他現有方法。而透過AirSim資料集的測試,我們展示了本方法對於不同相機視角所拍攝的影像進行深度預測的準確性。
Obstacle avoidance and environment sensing are crucial applications in autonomous driving and robotics. Among all types of sensors, camera is widely used in these applications because it can offer rich visual contents with relatively low-cost. Thus, using images from a single camera to perform depth estimation became one of the main focus in resent research works. However, prior works usually rely on highly complicated computation and power-consuming equipment to achieve such task; therefore, we focus on developing a real-time light-weight system for depth prediction in this thesis. Based on the well-known encoder-decoder architecture, we propose a supervised learning-based CNN with detachable decoders that outputs predicted depth maps with multiple resolutions. We also formulate a novel multi-task loss function for each decoder block, which considers both depth map and semantic segmentation simultaneously to encourage model convergence as well as to speed up the training process. To train our model on KITTI dataset, we generate depth map and semantic segmentation via PSMNet and DeepLabV3, respectively as ground truth, and test various pre-processing methods. We also collect a synthetic dataset in AirSim with a wide range of cameras views to evaluate the proposed depth estimation approach in terms of robustness. Via a series of ablation studies and experiments, it is validated that our model can efficiently performs real-time depth prediction with few parameters and fairly low computation cost, with the best trained model outperforms previous works on KITTI dataset for various evaluation matrices. Trained and tested on our AirSim dataset, our model also shown to be able to deal with images captured with quite different camera poses and altitudes.