運用深度學習方法最佳化沉浸式影片編碼設定

沉浸式視頻流技術通過為用戶提供更直觀的方式在模擬世界中移動，例如使用六自由度 (6DoF)互動模式，改善了虛擬現實 (VR) 用戶體驗。實現 6DoF 的一種簡單方法是根據用戶的移動在許多不同的位置和方向部署攝像頭，不幸的是，這既昂貴又繁瑣且效率低下。實現 6DoF 交互的更好解決方案是從有限數量的源視圖中即時合成目標視圖。雖然最近的沉浸式視頻測試模型 (TMIV) 編解碼器支持這種視圖合成，但 TMIV 需要手動選擇編碼配置，無法在視頻品質、解碼時間和頻寬消耗之間進行權衡。在本文中，我們研究了 TMIV 的局限性，並通過在巨大的搜尋空間中尋找最優配置來解決其配置優化問題。我們首先確定 TMIV 配置中的關鍵參數。然後，我們從兩個不同的方面介紹了兩種基於神經網絡的算法針對兩種問題：(i) 卷積神經網絡 (CNN) 算法解決回歸問題和 (ii) 深度強化學習 (DRL) 算法解決決策問題。我們進行了客觀和主觀實驗，以在兩個不同的數據集上評估 CNN 和 DRL 算法：透視和等距柱狀投影數據集。客觀評估表明，這兩種算法都顯著優於默認配置。對於透視（等距柱狀）投影數據集，所提出的算法平均只需要23\%（95\%）個解碼時間，傳送23\%（79\%）的視圖，並且將效用提高73\%（6\%）。主觀評估證實，與默認和最佳 TMIV 配置相比，所提出的算法消耗更少的資源，同時實現可比的體驗質量 (QoE)。

關鍵字

虛擬實境；擴增實境；擴展實境；畫面合成；串流；六自由度

並列摘要

Immersive video streaming technologies improve Virtual Reality (VR) user experience by providing users with more intuitive ways to move in simulated worlds, e.g., with 6 Degree-of-Freedom (6DoF) interaction mode. A naive method to achieve 6DoF is deploying cameras in numerous different positions and orientations that may be required based on users' movements, which unfortunately is expensive, tedious, and inefficient. A better solution for realizing 6DoF interactions is to synthesize target views on-the-fly from a limited number of source views. While such view synthesis is enabled by the recent Test Model for Immersive Video (TMIV) codec, TMIV dictates manually-composed configurations, which cannot exercise the tradeoff among video quality, decoding time, and bandwidth consumption. In this thesis, we study the limitations of TMIV and solve its configuration optimization problem by searching for the optimal configuration in a huge configuration space. We first identify the critical parameters of the TMIV configurations. Then, we introduce two Neural Network (NN)-based algorithms from two heterogeneous aspects: (i) a Convolutional Neural Network (CNN) algorithm solving a regression problem and (ii) a Deep Reinforcement Learning (DRL) algorithm solving a decision making problem, respectively. We conduct both objective and subjective experiments to evaluate the CNN and DRL algorithms on two diverse datasets: a perspective and an equirectangular projection dataset. The objective evaluations revealed that both algorithms significantly outperformed the default configurations. In particular, with the perspective (equirectangular) projection dataset, the proposed algorithms only required 23\% (95\%) decoding time, streamed 23\% (79\%) of views, and improved the utility by 73\% (6\%) on average. The subjective evaluations confirm that the proposed algorithms consume fewer resources while achieving comparable Quality of Experience (QoE) than the default and the optimal TMIV configurations.