視訊編碼標準中時間預測方法之分析與積體電路架構設計

時間預測為視訊編碼標準中最重要的模組，其相關發展從最早的視訊編碼標準H.261到目前正在制訂中的可調式視訊編碼標準(Scalable Video Coding)有著長足的進步。時間預測方法的進步是全面性的，不但包括了舊有預測工具的改善、新的預測工具的提出、更有新一代預測方式結構的發展。在本篇論文裡，我們將視訊編碼標準中的時間預測方法分成三大類予以討論和分析；第一類是區域移動估計，第二類是全域移動估計，第三類則是移動補償式時間濾波。在每一類的討論分析裡都包含了時間預測硬體設計兩大核心問題—運算單元的架構設計以及資料重複使用策略。在區域移動估計部份，可變式區塊移動估計為此類時間預測方法中的一個重要預測工具，因此我們首先討論支援可變式區塊移動估計對硬體架構所造成的影響。我們採用支援可變式區塊移動估計的方法為利用小區塊的誤差導出大區塊的誤差，這樣一來對於硬體架構所造成的影響將會是最小的。在這樣的實現方式下，我們根據不同資料流的處理方式可以歸納出支援可變式區塊移動估計對硬體架構所造成的影響與部份誤差的累積和儲存方式有關。例如：如果部份誤差是儲存在運算單元中然後每個運算週期累加一次誤差，則將因為所需要儲存的小區塊誤差數目變多而導致暫存器數量大增；如果部份誤差是利用傳遞暫存器一級一級的傳遞並且累加，則除了原本傳遞暫存器需要增加外，另外還需要延遲暫存器去同步所產生的小區塊誤差；如果沒有部份誤差的產生和儲存，那麼其所需要的硬體就會是三者中最小的。在區域移動估計中，最常被使用的資料重複使用策略為區塊層級之資料重複使用。在這方面的研究，我們打破以往的掃瞄方式發展出一種新的區塊層級資料重複使用策略—階級C＋。階級C＋不但具有和階級C一樣的水平資料重複使用能力，也具有部份垂直方向的資料重複使用能力，同時藉著調整參數可以提供多種不同在記憶體頻寬和所需內部記憶體大小的選擇，進而填滿了傳統階級C和階級D中間的斷層。配合我們提出的掃瞄順序，階級C＋在HDTV 720p、搜尋範圍為 [-128, 128）的規格下，只需要階級C的一半頻寬，而內部記憶體只增加了12%。在區域移動估計的最後一部份，則是因應隨身裝置中變動的系統資源特性所提出來的具有運算感知功能的移動估計演算法。我們所提出的演算法具有兩個重要特性：一次掃瞄和可適式搜尋策略，前者可以避免前人演算法的幾個缺點，如：隨機存取畫面中的所有區塊、大量的排序運算、大量的外部記憶體使用量；後者則讓我們的演算法可以適用於多種不同移動型態的視訊影像。論文的第二部份則是針對全域移動估計所做的分析和架構設計。在演算法層級，我們分析了三種典型的全域移動估計演算法，發現微分逼近法不論在複雜度或是效果上都比較適合硬體實作。接著我們利用移動的連續性和檢測不需要的運算來加速原本的梯度下降演算法，藉由這樣的方法可以節省91%的記憶體存取頻寬和80%的運算量。在硬體架構設計上，我們利用我們提出的演算法解決了龐大運算量和記憶體存取頻寬的問題；再利用以參考畫面為基準所發展出來的排程方式解決全域移動估計中不規則的記憶體存取問題，並且進而提高硬體使用率；最後則是藉由交錯的資料擺放方式，我們在一個運算週期內可以同時讀取四個相鄰的參考點，因此可以滿足為了執行內差運算所需的記憶體讀取。最後在UMC 0.18 um的製程下，我們所提出的硬體架構面積為1.45mm x 1.45mm，工作頻率為30MHz即可處理每秒三十張的CIF畫面；相較於前人提出的架構，我們的架構只需要25% 內部記憶體和10% 外部記憶體存取頻寬，因此對整個視訊編碼系統的整合較為有利。論文的最後一部份是由移動補償式時間濾波所延伸出的來兩個議題—畫面層級之資料重複使用和移動補償式時間濾波的架構設計。因為移動補償式時間濾波為一開放式迴路預測結構，因此畫面層級之資料重複使用策略得以發展。在本論文裡，我們不但發展出適合移動補償式時間濾波的畫面層級之資料重複使用策略，並且將分析方法予以整理出來，最後再將畫面層級之資料重複使用策略延伸到傳統封閉式迴路預測結構中。不管是針對哪種預測結構，畫面層級之資料重複使用策略都可以分為兩大類：第一大類是最小化現在處理畫面的讀取次數；第二大類則是最小化參考畫面的讀取次數。前者資料重複使用能力較差但是需要的外部記憶體大小也較小；後者則是犧牲外部記憶體儲存容量換取較小的記憶體傳輸頻寬。隨著搜尋範圍的增大，後者的頻寬節省效果會越顯著。除此之外，對於其他可能的畫面層級之資料重複使用策略皆可利用我們所提出的分析方法進行類似的分析。在移動補償式時間濾波的架構設計中，我們則是將移動補償式時間濾波和傳統的封閉式迴路預測結構兩者結合實現在單晶片上，藉此以提供運算感知的能力。在這部份的設計困難點主要有下列幾項：第一、針對預測階段如何將畫面層級的資料重複使用策略有效地實現在硬體架構中，在這部份我們提出一個新的區塊管線化的排程方式解決了因採用畫面層級的資料重複使用策略所造成的資料暫存器數目變多的問題；第二、針對更新階段如何設計一個有效率的排程或是資料重複使用策略來解決所需的大量不規則記憶體存取頻寬，我們利用硬體資源重複使用策略和新的排程方式不但可以節省將近一半的記憶體傳輸頻寬，還可以將不規則記憶體讀取轉變成為規則的讀取方式，大幅降低更新階段所需要的運算資源；第三則是各種時間預測結構的硬體使用率的平衡以及盡可能降低每種時間預測結構所需的記憶體傳輸頻寬，在這部份我們分別利用前面兩項所提出的解決方式先讓每種時間預測結構所需的記憶體頻寬得以減少，並且讓更新階段所需的硬體資源降到最小，同時再利用第一項中的新的排程方式讓每種時間預測結構所需的硬體資源都相差不大，進而平衡其硬體使用率。最後我們利用TSMC 0.18um 的製程實作我們所提出的運算感知之時間預測模組，其面積為3.82mm x 3.57mm、工作頻率最大為60MHz，共可支援六種不同的時間預測結構，同時藉著執行不同的時間預測結構，此晶片可以提供運算感知的功能，亦即根據系統可提供的資源選擇可以被執行的時間預測結構執行，使得視訊編碼系統不論在何種系統狀況下都得以運作。

關鍵字

視訊編碼標準；移動估計；移動補償式時間濾波

並列摘要

Temporal prediction is the most critical component of video coding systems, because it not only significantly improves the coding performance but also dominates the required computation resources of a video coding system. Due to its required huge computation complexity and large memory bandwidth, a hardware acceleration is a must for temporal prediction, which is also the core of this dissertation. There are four major hardware design challenges of temporal prediction. The first one is the architecture design of processing elements (PEs) due to the huge computation complexity. The second one is the data reuse strategy because of the large memory bandwidth. The on-chip memory arrangement is another design challenge to satisfy the memory requirements of PEs and the data reuse strategy. The scheduling is the last one. Irregular memory access usually reduces the utilization of memory bandwidth, so a scheduling to guarantee regular memory access is important. These four design challenges are related three system issues, hardware area, system memory bandwidth, and system memory size. For different systems, the constraints and weighting factors of system issues are variant, so different design strategies are required. In the following, we classify temporal prediction into three categories, local motion estimation (LME), global motion estimation (GME), and motion-compensated temporal filtering (MCTF), for the discussion. We not only overcome the four design challenges but also provide different design strategies for different systems. In the first part of LME, we target on the architecture design of VBSME. There are many methods to support VBSME, and the most efficient one is to use the SADs of the smallest blocks to derive those of larger blocks. If this method is used, the overhead of VBSME in different architectures depends on the data flow of the partial sum of absolute differences (SAD). We classify the data flows of partial SADs into three types, storing in registers of PEs, propagating with propagation registers, and no partial SADs. Among three kinds of data flows, the first one requires the largest overhead of VBSME and the last one requires the smallest overhead. In the second part of LME, we discuss the data reuse strategy and propose a macroblock-level data reuse scheme, Level C+ scheme, in which the overlapped searching region in the horizontal and vertical directions can be fully and partially reused, respectively. Compared to Level C scheme, Level C+ scheme with corresponding scan order can save 46% memory bandwidth with only 12% increase of on-chip memory size, For HDTV 720p with the searching range of size [-128,128). In the GME part, we use the architecture of GME to discuss the memory arrangement and scheduling. The major design challenges of GME are irregular memory access due to scaling and rotation and the memory access requirement of the interpolation and differential values. We propose the reference-based scheduling to eliminate irregular memory access and adopt the interleaved memory arrangement to satisfy the memory access requirement. Finally, a hardware accelerator of GME is implemented, which is 131 K gates with 7.9 Kbits memory and can real time process MPEG-4 ASP@L3 at 30 MHz. Compared to the previous work, our proposed architecture requires much smaller on-chip memory size and memory bandwidth. In MCTF part, frame-level data reuse and the hardware architecture of MCTF are two issues. The former focuses on data reuse strategies and the latter involves four design challenges. Frame-level data reuse means that we can use the system memory size to further reduce the required system memory bandwidth. We develop a methodology for the frame-level data reuse analysis and estimate their tradeoffs between on-chip memory size and system memory usages. In the second part, we present the first hardware accelerator of MCTF, which is also a computation-aware engine. By adopting the frame-level data reuse schemes, 20% -- 42% memory bandwidth of prediction stages can be saved in different coding schemes. A new MB-pipelining scheme is developed to save the data buffer overhead of frame-level data reuse. As for the update stage, the proposed techniques can save 50% memory bandwidth and 75% hardware cost. The reconfigurable concept is also adopted in the on-chip memory to perform different data reuse strategies. The proposed accelerator can real time process CIF Format with the searching range of size [-32, 32) @ 60 MHz. Totally, six coding schemes are supported, so it can adapt itself to fit the dynamic system resources constraints, by performing a suitable coding schemes. The implemented result is 3.82 mm x 3.57 mm, in TSMC 0.18um technology. In brief, the analysis and VLSI architecture of temporal prediction methods in video coding standards are studied in this dissertation. By classifying data flows and from various data processing viewpoints, new schedulings, architectures, and data reuse schemes are developed for different systems. For architecture design, we not only analyze the impact of VBSME on hardware architectures but also propose a new MB pipeline to eliminate the overhead of frame-level data reuse. For data reuse strategies, we discuss the data reuse schemes from MB level to frame level and provide various tradeoff between on-chip memory size and system memory usages. As for on-chip memory arrangement, we adopt the interleaved and reconfigurable concept to satisfy the memory requirements of PEs and data reuse strategies. As for the scheduling, the reference-based scheduling is also developed to solve the irregular memory access problems.

並列關鍵字

Video Coding ； motion estimation ； motion-compensated temporal filtering

參考文獻

[31] C.-T. Huang, C.-Y. Chen, Y.-H. Chen, and L.-G. Chen, “Memory analysis of

Nakwoong Eum, Jongdae Kim, and Hyunmook Cho, “A pipelined low

[37] T.-C. Chen, Y.-W. Huang, and L.-G. Chen, “Analysis and design of macroblock

[39] P. L. Tsai, S. Y. Huang, C. T. Liu, and J. S. Wang, “Computation-aware

[51] B.S. Manjunath, Philippe Salembier, and Thomas Sikora, Introduction to

國際替代計量

視訊編碼標準中時間預測方法之分析與積體電路架構設計

全文下載

主題瀏覽