幀內預測是H.264/AVC影像編碼標準中幀內編碼方式的第一個程序,而幀內編碼方式是利用移除空間多餘的資訊達到影像壓縮的目的。為了支援高畫質影像編碼的應用,我們提出了一個高平行架構的幀內預測器以加快總預測時間。經過系統需求分析後,幀內預測器的效能必須達到每個時鐘脈衝週期產生32個luma 4x4預測畫素或者16個luma 16x16、chroma 8x8的預測畫素。此外,我們提出一個可半重新裝配的硬體架構、一個處理單元的最佳化方法以及一個預測模式層級的排程方案,去達到降低硬體需求。我們使用verilog 硬體描述語言實作所提出的硬體架構。在200MHz 的運作時脈下,所需之邏輯閘數為20.4K。與直接實作的結果做比較,所提出的硬體架構可以減少90.2%的硬體面積。此外,所提出預測器在產生預測畫素需要4時鐘脈衝週期以完成一個luma 4x4模式的預測,花費3.5K個邏輯閘;而需要96個時鐘脈衝週期以完成luma 16x16、chroma 8x8模式的預測,花費11.7K個邏輯閘。最後,所提出的幀內預測器已經成功的整合進入H.264/AVC幀內編碼器,其效能可以在運作在138MHz達到即時編碼1080p全高畫質之影像。
Intra Prediction is the first process of H.264/AVC intra encoding, which compresses video by removing spatial redundancy. For high-resolution applications, we propose a highly parallel architecture of Intra Prediction Generation Engine (IPGE) to shorten the prediction time. An analysis derives that the required degree of Pixel-Level Parallelism (PLP) for luma 4x4 is 32 pixel/cycle, whereas for luma 16x16 and chroma 8x8 is 16 pixel/cycle. In addition, we propose a semi-reconfigurable architecture, a Processing Element Optimization Method (PEOM), and a Mode-Level Scheduling Scheme (MLSS) to reduce hardware usage. The proposed design has been implemented in Verilog RTL and synthesized targeted towards a TSMC 0.13μm CMOS cell library. Its gate count is 20.4K when running at 200MHz. In comparison with direct implementation, the proposed architecture reduces 90.2% of gate count. The engine for all luma 4x4 modes consumes 3.5K gates and takes 4 cycles to predict a 4x4 block; the engine for luma 16x16/chroma 8x8 modes consumes 11.7K gates and takes totally 96 cycles to generate a 16x16 macroblock. We have integrated the proposed design into an H.264/AVC intra encoder which can process 30fps 1080p HD video when running at 138MHz.