多核心DSP與GPU之H.265即時視訊編碼器與實現

高效能視訊編碼器(HEVC)，又稱H.265，是最新一代的視訊編碼標準，為了提高編碼的效能，H.265允許在運動估測(motion estimation: ME)中執行多重參考畫面(multiple reference frame: MRF)來達到更精準的預測，但這也導致H.265執行MRF-ME模組時，需要非常龐大的計算量，導致無法達到即時視訊的應用。為了加速龐大的計算量，近期很多學者利用輝達(NVIDIA)所開發之圖型處理單元(graphics processor unit: GPU)來平行處理這龐大的計算量，加速MRF-ME的運算時間。若將H.265測試平台(HM)的MRF-ME模組直接應用在GPU執行，會產生頻繁地呼叫GPU和將資料傳送至GPU記憶體中來進行運算，導致無法有效改善H.265的編碼速度。因此，最近Khemiri等學者提出以GPU為基礎的SAD(sum of absolute difference)和SSE(sum of square difference)平行架構演算法，來加速ME的SAD和SSD計算過程[14]，但仍無法解決對GPU頻繁呼叫的問題。另Lin等學者則是將H.264測試平台(JM)所提供快速FMF-ME(fast merge full search ME)演算法的架構[6]，直接應用於H.265視訊編碼上，並利用GPU來加速執行SAD的計算和合併，他們也發現GPU必須頻繁地對全域記憶體(global memory)存取資料，因此提出以畫面為基礎(frame-based)，利用GPU的共享記憶體(shared memory)進行SAD計算和合併，加速MRF-ME模組的運算速度。然而Lin等學者並未考慮到三點問題: (1)當運用於4K超高解析度影像時，由於GPU全域記憶體的需求量過大，將導致成本大幅提高；(2)以畫面為基礎的SAD計算，會導致H.265編碼架構缺乏彈性，無法直接應用在以CTU為基礎的快速編碼方法；(3)當搜尋視窗(search window)擴大時，會導致GPU的共享記憶體不足，以至於實用性大大降低。為了解決上述的問題，本論文提出以CTU為基礎之GPU快速MRF-ME模組架構，來解決GPU全域記憶體需求量過大和缺乏彈性的缺點，並可直接應用在以CTU為基礎的快速編碼演算法，來更進一步加速H.265整體編碼時間。我們首先將GPU規劃成3個核心(Kernel)函數來進行平行處理與運算，Kernel 1執行CTU最小區塊(88)的SAD計算，接著Kernel 2 進行各種不同大小區塊(88~6464)的合併，最後Kernel 3找出各個區塊的最佳匹配區塊，並結合我們先前所提出優先參考畫面演算法(priority-based reference frame selection algorithm: PRFSA )，來進一步加速H.265整體編碼時間。為了能直接應用於消費性電子產品上，本論文採用雙核心ADSP-BF609開發板來實現所提出之快速H.265編碼器之DSP實現。我們先將DSP記憶體的配置最佳化，把運算複雜較高的模組從L3配置到L1和L2中，且為了提高編碼器的執行效率，本論文採用4個ADSP-BF609來模擬8核心運作，完成嵌入式多核心DSP和GPU之H.265即時編碼器。由實驗結果得知，當採用NVIDIA GeForce GTX 1060之GPU，在搜尋視窗的範圍大於32時，則Lin的方法會因共享記憶體不足，而導致無法執行，但我們所提快速MRF-ME模組可以順利的運作。當MRF=4與MRF=8時，論文所提方法結合PRFSA時與HM16.7[3]相比較，整體時間改善率(time improve ratio: TIR)分別可達到89%和91%以上。因此，本論文提出多核心DSP和GPU快速編碼器，除了能加速H.265編碼過程外，更可以獲得與HM16.7差異不大的影像品質。

關鍵字

none

並列摘要

The H.265/HEVC can achieve higher performance than previous video coding standards, such as H.264/AVC and MPEG-4. In order to achieve this improved coding performance, H.265 adopts multiple reference frame motion estimation (MRF-ME), which requires very heavy computation in proportion to the number of reference frames. To reduce the computational complexity, some studies had been proposed using graphic process unit (GPU) which developed by NVIDIA to parallel accelerate the calculation process of MRF-ME module, recently. Khemiri et al. [14] proposed a fast parallel ME method using sum of absolute differences (SAD) and sum of square difference (SSD) algorithm to accelerate SAD and SSD calculation process based on GPU since they found that there is a same calculation item existing the SAD and SSD for rate-distortion cost (RDcost) of ME and RDcost in mode decision (RDmode), respectively. Therefore, they proposed parallel difference (PD) and parallel reduction (PR) algorithm to accelerate SAD and SSD calculation process through CPU and GPU parallel architecture. But Khemiri et al. didn’t consider that HEVC coding structure call GPU kernel function most frequently by CPU and transfer data most frequently from CPU to GPU. This leads to their method occurs an obstacle to further speed up MRF-ME module. On the other hand, Lin et al. [5] directly applied the fast ME algorithm default in H.264 test platform (JM) [6] to H.265 video encoding. They utilized GPU to perform the computation and merging of SAD for different prediction unit (PU) mode. However, their method encounter that the shared memory is not enough in GPU when the range of searching window is larger than 32. And, it is inflexible for H.265 encoder due to frame-based structure. In order to solve above-mentioned problem, we proposed a fast MRF-ME algorithm based on CTU-level, which embedded on 4 double-core ADSP-BF609 to simulate multi-core DSP and finish real-time H.265 encoder based on NVIDIA GeForce GTX 1060. We re-allocate the function of time-comsuming module from L3 DDR-RAM to L1 and L2 SRAM to speed up the encoding time. The proposed method can be combined with fast CTU-level MRF-ME [8] to further reduce the encoding time. Firstly, we decompose ME algorithm into three kernels to achieve a highly parallel computation with a low external memory on GPU. Secondly, the kernel 1 executes a GPU program of calculating the sum of absolute differences (SAD) of small coding unit (SCU 88). Thirdly, the kernel 2 merges the variable block size from SCU (88) to large coding unit (LCU 6464). Finally, the kernel 3 compares minimum SAD to find the best matching block. Simulation results show that the proposed method can achieve an average time improving ratio (TIR) of H.265 encoder about 89.44% and 91.82% when compared to HM16.7 under MRF=4 and MRF=8, respectively.

並列關鍵字

none

參考文獻

[1] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Trans. Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560-576, July 2003

Google Scholar

[2] J. Ohm, and W. J. Han, T. Wiegand, “Overview of the high efficiency video coding (HEVC) standard,” IEEE Trans. Circuits System Video Technology, vol. 22, no. 12, pp. 1649- 1668, Dec. 2012.

Google Scholar

[3] HEVC Test Model documentation. https://hevc.hhi.fraunhofer.de/HM-doc/

Google Scholar

[4] J. Kim, D.S. Jun, S. Jeong, et al. “An SAD-based selective bi-prediction method for fast motion estimation in high efficiency video coding,” Electronics and Telecommunication Research Institute Journal, vol. 34, iss. 5, pp. 753–758, Oct. 2012

Google Scholar

[5] Y. C. Lin, S. C. Wu, “Parallel motion estimation and GPU-based fast coding unit splitting mechanism for HEVC,” IEEE High Performance Extreme Computing Conference, pp. 1-7, Dec. 2016.

Google Scholar

國際替代計量

多核心DSP與GPU之H.265即時視訊編碼器與實現

全文下載

主題瀏覽