透過您的圖書館登入
IP:18.190.152.38
  • 學位論文

基於深度學習之高解析度畫面內插演算法及即時性積體電路架構設計

Deep Learning-based High-Resolution Frame Interpolation Algorithm and Real-Time Hardware Architecture

指導教授 : 簡韶逸

摘要


視頻畫面內插(Video Frame Interpolation) 藉由在兩畫面產生平滑的轉場畫面來提升視頻的時間解析度。在最近的研究中,卷積神經網路(Convolutional Neural Networks) 可以達到令人驚豔的效果。然而,在處理高解析度的影像時,這些方法需要大量的記憶體資源以及漫長的運算時間,使得在硬體資源上有比較高的需求。 在本篇論文中,我們利用通道數優化、光流共用及像素洗牌提出了一個輕運算量的網路架構,和現有的視頻畫面內插網路比較下,在2倍及4倍光流共用時,可以分別減少91倍及364倍的卷積層運算。另外,利用知識蒸餾的技巧,可以提升視頻內插的品質。並藉由人眼視覺系統對色度不敏感的特性,進一步減少在採樣時對於U和V通道的資料存取。 此外,我們也基於提出的網路架構實作了一個硬體系統。我們使用我們實驗室內部開發的神經網路處理器VectorMesh來加速神經網路的運算,並設計一個有效率存取的網格採樣引擎。此採樣引擎可以避免多餘的資料存取並可以無損的還原採樣的結果。 並且,我們整合2個VectorMesh的基本單位MERIT處理器以及1個網格採樣引擎為一個實體矽智財(Physical IP) 並實現此視頻畫面內插系統,此系統可以即時運算超高畫質解析度的影像。

並列摘要


Video frame interpolation (VFI) achieves temporal super-resolution by generating smooth transitions between frames. In recent research works, it has shown remarkable results with the convolutional neural networks. However, these methods demand huge amounts of memory and run time for high-resolution videos, which result in expensive hardware requirement and unpleasant visual experience. In this thesis, we propose a light-computational network architecture with channel number optimization, flow sharing, and pixel shuffling. This architecture can reduce 91 and 364 times computation reduction in convolution layers with 2x2 and 4x4 sharing, respectively, compared with the existing VFI network. Moreover, with knowledge distillation technique, the video quality can be well maintained as well. Due to insensitivity to chrominance for the human visual system, we can further reduce the data access when sampling the U and V channels. Furthermore, we also implement a hardware system based on the proposed network architecture. We employ VectorMesh, an in-house neural processor silicon intellectual property (IP) developed by our lab, as a CNN accelerator and design an access-efficient Grid-Sample engine to execute the proposed network. The Grid-Sample engine can efficiently avoid unnecessary data access and reconstruct the proposed architecture with near-lossless performance. We also combine two MERIT processors and a Grid-Sample unit as a physical IP and realize the VFI system. This system can inference the Full-HD resolution image and reach real-time performance.

參考文獻


C. Yeh, L.-W. Kang, Y.-W. Chiou, C.-W. Lin, and S.-J. Fan-Jiang, “Self-learning-based post-processing for image/video deblocking via sparse representation,” J. Vis. Commun. Image Represent., vol. 25, pp. 891–903, 2014.
H. Jiang, D. Sun, V. Jampani, M.-H. Yang, E. Learned-Miller, and J. Kautz, “ Super slomo: High quality estimation of multiple intermediate frames for video interpolation,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2018, pp. 9000–9008.
Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala, “Video frame synthesis using deep voxel flow,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4463–4471.
W. Bao, W.-S. Lai, C. Ma, X. Zhang, Z. Gao, and M.-H. Yang, “Depth-aware video frame interpolation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3703–3712.
T. Peleg, P. Szekely, D. Sabo, and O. Sendik, “Im-net for high resolution video frame interpolation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2398–2407.

延伸閱讀