透過您的圖書館登入
IP:18.189.22.136
  • 學位論文

針對移動估計與H.264/AVC標準以及智慧型視訊信號處理之演算法和架構設計

Algorithm and Architecture Design for Motion Estimation, H.264/AVC Standard, and Intelligent Video Signal Processing

指導教授 : 陳良基

摘要


數位視訊科技已在我們的日常生活中扮演必要的角色,主要是應用於娛樂、通訊、監控、以及智慧型人機介面。在本論文中,我們以三個部份:區塊比對移動估計、H.264/AVC 編碼系統、以及智慧型視訊信號處理,來探討針對目前與未來視訊應用演算法和架構的核心技術。 移動估計是視訊編碼系統的心臟,為最重要的模組,在編碼器中需要最多運算量與記憶體存取。在本論文的第一部份,我們首先廣泛地審視最近二十年來(1981-2004)的移動估計演算法和架構;所有的快速區塊比對演算法可被分成六大類,我們以視訊品質和運算複雜度比較了許多演算法,提供軟體應用諸多實用的指導方針;我們介紹了許多全搜尋架構與快速搜尋架構,並以六角圖來比較代表性架構的六個方位,以利清楚評估。第二,我們提出了快速區塊比對的全域消除演算法,主要觀念是將區塊比對分成所有搜尋位置之概略初掃瞄以及緊接著的詳細比對初掃瞄中具潛力的候選區塊,全域消除演算法可保有和全搜尋相當的視訊品質,同時只需要其10%之運算量;我們所發展的全域消除演算法所對應之架構包括了心脈式跳動模組來萃取概略特徵、平行樹狀絕對差值總和器來執行比對工作、以及平行樹狀比較器來找尋有潛力的候選區塊;此外,我們為了更高的規格進一步提出了平行全域消除演算法和架構,我們的設計在節省面積和提升運算速度的綜合表現為全搜尋架構之十倍。第三,我們提出了具有運算感知能力的區塊比對演算法,能在運算量有限且運算量不固定的環境中,並在即時處理的前提下,求得較佳的移動向量;與先前運算感知區塊比對演算法不同的是,之前的作法需要隨機存取巨集區塊,我們的單流程演算法不但可以大量減少記憶體用量,還可有效利用周圍鄰近巨集區塊的脈絡資訊來加快速度並得到較佳品質,此外,我們亦運用了可適性搜尋策略來更進一步提升視訊品質,比起先前的演算法,在相同的品質下,我們的方法可減少70%的運算時間。 H.264/AVC 是目前最新的國際視訊編碼標準,相較於MPEG-4、H.263、和MPEG-2,它可分別節省39%、49%、和64%的資料量。在本論文的第二部份,我們首先提出基於脈絡的可適性方法來加速H.264/AVC 編碼器中運算複雜度最高的多畫面移動估計,藉由內部預測和前一張參考畫面的區塊比對後之統計分析,可推導出基於脈絡的可適性準則來決定是否值得搜尋更多參考畫面,在維持全搜尋的品質下,可忽略76%-96%之不必要的參考畫面。第二,我們提出了H.264/AVC 內部畫面編碼之快速演算法與H.264/AVC 內部編碼器架構;對於軟體實現,可採用基於脈絡地去除不可能的候選模式、比對運算的減少取樣、和交錯式全搜尋暨快速搜尋策略,如此可節省全部運算之45%並同時使峰值信號雜訊比損失在0.3dB 之內;至於硬體加速,在廣泛分析下,我們設計了平行度為四之系統架構,原型晶片採用0.25 微米CMOS 製程,核心面積是1.855×1.885mm2,每秒可處理1600 萬畫素。第三,我們提出了世界上第一顆H.264/AVC 單晶片編碼器,在0.18 微米製程下核心大小為7.68×4.13 mm2,新的四級巨集區塊管線架構可在108MHz 之工作頻率下對每秒30 張HDTV720p 畫面 (1280×720)的視訊做即時編碼,新管線架構與傳統二級管線比較,可增加兩倍的處理能力與硬體使用率;編碼器還包含了五個引擎,分別針對整數點移動估計、小數點移動估計、內部預測、亂度編碼、以及去除區塊而設計,我們貢獻了許多新穎觀念來克服棘手的設計難題(在處理器上每秒需要3.6 兆運算指令與5.6 兆位元組記憶體存取)。 智慧型視訊處理是先進視訊應用的驅動力,而視訊物件切割則是對以物件為基礎的MPEG-4、物體追蹤、人臉辨識、場景精靈、MPEG-7多媒體描述、等等,為最重要的前處理單元。在本論文的第三部份,我們首先復習有效率之視訊切割演算法,主要觀念是背景註冊,能輕易解決傳統改變偵測所會遭遇的靜態物體問題與新揭露背景問題,在最佳化的實作下,450MHz Pentium 第三代中央處理器可每秒處理25張QCIF (176×144)畫面;此外,我們還考慮了影子效應之去除、結合預測分水嶺以得較準確之物體邊緣、與針對輕微震動的全域移動補償,可為基準模式之增強版。第二,我們提出了簡單但有效的演算法使具有轉角傾斜縮放能力之攝影機能自動追蹤一個移動物體,所提出的追蹤演算法蒐集在攝影機位置格子點上的背景資訊,藉由比對在格子點上拮取之畫面和背景畫面來決定下一個格子點目標,因此可將移動物體一直保留在畫面中央;以區塊為單位之處理能減少運算量,膚色偵測則使攝影機更易追蹤人臉,我們測試了許多實際情況,並成功地將我們的追蹤演算法整合到商業化的監控IP 攝影機。第三,我們提出了基於描述子的低運算量人臉辨識,我們以不隨移動、轉動、及畫面拉遠拉近而變化之描述子做為辨識核心的輸入向量,而非以光柵掃瞄之像素,使我們的方法比傳統基於像素之演算法更可靠,更甚者,因為輸入向量和共分散矩陣之維度下降,運算量與記憶體需求可大幅減少數百萬倍,用來計算投影方向的時間由數十小時降為數秒鐘。 簡而言之,我們對數位視訊技術的貢獻主要有三個方向。我們所提出的移動估計可運用於所有的視訊編碼標準中;我們所提出的H.264/AVC 編碼系統是世界上最領先的設計,提供了許多新穎觀念;我們所提出的視訊切割、物體追蹤、以及人臉辨識將會是結構視訊和智慧型監控系統的關鍵。我們由衷希望我們的研究成果能對人類生活之便利性帶來進步。

並列摘要


Digital video technology has played an essential role in our daily life for entertainment, communication, surveillance, and intelligent human-machine interfaces. In this dissertation, algorithms and architectures of core techniques for both current and future video applications are discussed in three different parts: block matching motion estimation, H.264/AVC encoding systems, and intelligent video signal processing. Motion estimation (ME) is the heart of video coding systems. It is the most important module and demands the most computing power and memory access in a video encoder. In Part I of this dissertation, we first made a comprehensive survey of ME algorithms and architectures during the last two decades (1981-2004). All fast block matching algorithms (BMAs) are classified into six categories, and many of them are compared in terms of video quality and computational complexity, which provides useful guidelines for software applications. Many architectures supporting full search or fast search are introduced, and comparisons of representative designs are presented in six aspects by hexagonal plots for clear evaluation. Second, we proposed a global elimination algorithm (GEA) for fast block matching. The main concept of GEA is to divide the block matching into an initial scan of all search positions with coarse matching of candidates, followed by fine matching of candidates which are the potential ones in the initial scan. While preserving the same quality as full search, GEA has less than 10% of full search complexity. The corresponding GEA architecture comprising a systolic part to extract coarse features, a parallel sum of absolute differences (SAD) tree to perform matching operations, and a parallel comparator tree to find the potential candidates, is also developed. Moreover, we further proposed a parallel global elimination algorithm (PGEA) and its corresponding architecture for higher specifications. Our design is 10 times more area-speed efficient than full search architectures. Third, we proposed a computation-aware (CA) BMA to obtain better motion vectors with real-time constraints in a computation-limited and computation-variant environment. Different from prior CA BMAs in which random access of macroblocks is inevitable, our one-pass flow can not only significantly reduce the memory size but also effectively utilize the context information of neighboring macroblocks to achieve faster speed and better quality. Moreover, video quality can be further improved with the adaptive search strategy. Our one-pass algorithm can save 70% of the processing time while obtaining the same quality in comparison with prior CA BMAs. H.264/AVC is the latest international video coding standard. It can save 39%, 49%, and 64% of bitrates in comparison with MPEG-4, H.263, and MPEG-2, respectively. In Part II of this dissertation, we first proposed a context-based adaptive method to speed up the multi-frame ME, which is the most computationally intensive part in an H.264/AVC encoder. Statistical analysis is applied to the available information after intra prediction and the block matching process for the previous reference frame. Context-based adaptive criteria are then derived to determine whether it is worth searching more reference frames. Full search quality can be maintained while 76%-96% of unnecessary reference frames can be omitted. Second, we proposed an H.264/AVC intra frame coding fast algorithm and an H.264/AVC intra coder architecture. Context-based decimation of unlikely candidates, subsampling of matching operations, and interleaved full-search/partial-search strategy are adopted in the software implementation, which can reduce 45% of the total computation while keeping the PSNR degradation less than 0.3dB. As for the hardware accelerator, a four-parallel system architecture is designed with comprehensive analysis. A prototype chip with core size of 1.855x1.885mm2, which can process 16mega-pixels within one second at 54MHz, is fabricated using 0.25μm CMOS technology. Third, we proposed the first H.264/AVC single-chip encoder in the world. The core size is 7.68x4.13mm2 with 0.18μm CMOS technology. A new four-stage macroblock pipelining architecture encodes HDTV720p (1280x720) 30frames/s videos in real time at 108MHz. The new pipelining doubles the throughput and utilization of the conventional two-stage macroblock pipelining. The encoder contains five engines for integer motion estimation (IME), fractional motion estimation (FME), intra prediction (IP), entropy coding (EC), and deblocking (DB).We contributed many novel ideas to overcome the tough design challenges (3.6TOPS of computing power and 5.6TB/s of memory access on a processor). Intelligent video signal processing is the driving force of advanced video applications, and video object segmentation is the most important pre-processing unit for object-based MPEG-4, object tracking, face recognition, sprite generation, MPEG-7 multimedia description, ...etc. In Part III of this dissertation, we first reviewed an efficient algorithm of video object segmentation. The background registration is the main idea, which can easily solve the still object problem and the uncovered background problem encountered by conventional change detection. With optimized implementation, a 450MHz Pentium III CPU can process 25 QCIF (176x144) frames in one second. Moreover, the elimination of shadow effects, combination with predictive watersheds for more accurate object boundaries, and global motion compensation for slight camera motion are also considered as enhancements of the baseline mode. Second, we proposed a simple but effective algorithm for a pan-tilt camera to automatically track one moving object. The proposed tracking algorithm collects the background information at the grid points of camera positions and then compares the captured frame with the background at a grid point for determining the next grid point. A moving object is thus kept in the middle of the image. Block-based processing and skin color detection are used to reduce computation and to favor human faces, respectively. Many practical situations are tested, and our tracking algorithm has been successfully integrated into a commercial surveillance IP camera. Third, we proposed a low complexity descriptor-based face recognition. Descriptors with translation-, rotation-, and scaling-invariant properties are used as the input vectors to the feature extraction kernel instead of raster scanned image pixels, making our method much more reliable than conventional pixel-based algorithms. What is more, the computational complexity and the memory requirement are significantly reduced by millions of times due to the dimension reduction of input vectors and the covariance matrix. The processing time to calculate the projection directions is reduced from several ten hours to a few seconds. In brief, digital video techniques are contributed in three directions. The proposed motion estimation can be applied in all video coding standards. The proposed H.264/AVC encoding system is the leading design in the world and brings many new concepts. The proposed video segmentation, object tracking, and face recognition will play the key roles of structured videos and intelligent surveillance systems. We sincerely hope that our research results can make progress for the convenience of human life.

參考文獻


[1] Information Technology - Coding of Moving Pictures and Associated Audio for
H.261, Mar. 1993.
[4] Video Coding for Low Bit Rate Communication, ITU-T Recommendation H.263,
Feb. 1998.
July 2004.

延伸閱讀