目前市場上有越來越多的應用程式會使用影像壓縮的技術, 所以像H.264這樣的壓縮標準漸漸的受到重視. H.264針對高解析度的影片提供了目前最佳的壓縮效率, 但是高壓縮率也帶來了非常重的計算負擔. 所以必須要強化處理器才能應付H.264的效能要求. 因為目前的處理器強化是朝多核心的方向, 所以為了要達到最佳的效能, 應用程式的設計也必須把多核心平台的特性加以思考利用. 除此之外, H.264核心的演算法中擁有許多複雜的相依性, 所以必須要好好分析才能達到較佳的平行化. Cell寬頻引擎是針對多媒體計算設計的高 效能晶片多核心處理器, 包含一顆主要處理器PowerPC Processor Element (PPE) 以及八顆Synergistic Processor Elements (SPEs). 另外還提供非常多的函式庫以及介面來幫助應用程式的開發, H.264藉著Cell 寬頻引擎的高效能可以減低影像壓縮所帶來的沉重負擔. 在這篇論文中, 我們探討資料平行以及工作平行的可行性, 然後提出了一個結合兩種方式的平行化方法.
With the growing number of applications involved with video compression and decompression, video CODEC like H.264 plays an important role in modern market. H.264 achieves the highest compression efficiency targeting at the requirement of High Definition (HD) video contents at present, but the cost is the demand of high computational complexity. As a result the processor must be advanced to attain good performance for H.264. However, due to the well-known energy consumption and heat radiation issues, multicore platform becomes the main trend in computer architecture. In order to approach peak performance, multi-core platform’s characteristics must be taken into consideration. Also when parallelizing the H.264 algorithm, the CODEC must be exploited and evaluated to solve the complex dependencies in it. One of the popular multicore platforms is the IBM Cell Broadband Engine (Cell B.E.), which is a heterogeneous chip multicore processor composed of one Power Processor Element (PPE) and eight Synergistic Processor Elements (SPEs). The Cell BE is specially designed to meet the high performance requirement for multimedia applications with Single Instruction Multiple Data (SIMD) and Direct Memory Access (DMA) units inside. It also provides a rich set of libraries and APIs for application development. With the strength of Cell BE, we should be able to reduce the burden of computation introduced by H.264. In this thesis, the data parallelism and task parallelism are exploited to bring up a combinational parallel decoder based on JM’s open source H.264 decoder. PPE distributes two slices at a time, with two pipelined decoding flow each being composed of 4 SPEs. Double buffering is employed to process the slices independently. The theoretical speedup is 9 times comparing to sequential execution on PPE. Deblocking module is offloaded to SPE with double buffering used in the experiment, and the speedup is 1.17 times.