多媒體應用之影像處理演算法的金字塔架構設計與實現

在視訊通訊、資訊家電、與電腦視覺等領域，影像序列處理已然擔任一個重要的角色。影像序列處理的主要目的乃是結合數種不同性質的演算法，形成一個超級過濾器的效用，將有意義的資訊從影像中擷取出來。由於需要處理龐大的視訊影像資料，許多大型積體電路的架構被提出，用以實現即時的影像處理，其中，平行化處理是達成即時性之最主要的設計技巧之一，也就是在架構上探討，如何將演算法切割成許多子程序，而每個子程序可以用獨立的模組來實現，最後再將資料彙集成結果。在這個過程中，如何將資料即時地傳遞給各個模組且維持彼此之間的關聯性，往往成為系統的瓶頸。在本篇論文中，我們討論如何以塊狀為單位來實踐影像處理演算法，並且分析其中獲得好的效益與新帶來的問題。繼而，我們提出金字塔架構來提供高度的平行度給塊狀影像處理演算法，並且應用我們提出的架構於兩種不同的系統。針對互補式金屬氧化物半導體影像感測器，影像處理序列是產生高畫質的主要關鍵。為了在每個周期提供各個濾波器整個濾波視窗的像素，在積體電路上需要整合所需的畫面線性緩衝記憶體，其通常占據最主要的晶片面積與功率消耗，而隨著要處理的影像解析度提高或濾波器範圍增加，面積與功率也會相對地增加。我們提出金字塔架構來設計一個應用介於影像感測器與影像視訊壓縮器的影像處理序列。首先，將影像切割成許多階層式的塊狀小單位，接著，我們提出兩種運算方法，中間結果重複利用法與垂直蛇狀掃描法，來減輕因為塊狀運算而伴隨的多餘運算。利用此金字塔結構與影像視訊區塊編碼器，我們提出的架構具有延展的能力來適應不同的影像解析度與濾波器尺寸。針對每秒30張之3840×2160四倍全高清視訊，一個支援7×5濾波尺寸的90nm互補式金屬氧化物半導體晶片被設計來顯示功率及面積的效率。與畫面線性緩衝架構相比較，提出的設計減少25%的功率消耗，從145mW降至108mW；減少65%的硬體面積，從888K降至309K邏輯閘；針對YUV4:2:0視訊格式，外部記憶體頻寬從5972Mbits/s增加至8286Mbits/s；針對YUV4:2:2格式視訊格式，外部記憶體頻寬從7963Mbits/s增加至8286Mbits/s；針對YUV4:4:4視訊格式，外部記憶體頻寬減少30%，從11944Mbits/s減少至8286 Mbits/s。針對使用尺度不變特徵轉換之電腦視訊應用，足夠的濾波尺寸是建立高斯金字塔的關鍵條件，用以萃取出尺度無關的特徵點。過去的文獻中，為了達到高品質的結果，現有的尺度不變特徵轉換使用高效能的通用處理器來實踐，但卻低於即時性的需求；而在資源受限的嵌入式應用中，演算法首先被化簡成3×3或7×7的濾波大小，低的解析度與弱的特徵擷取能力被實現於積體電路或現場可編程邏輯閘陣列平台。我們檢視演算法，區分為低階處理與高階處理，進而延伸單金字塔架構至多金字塔架構來實踐3階、15×11濾波尺寸之低階處理序列。提出的設計使用90nm 互補式金屬氧化物半導體製程，整合791K邏輯閘與204K靜態隨機存取記憶體位元。合成的結果顯示此設計可以工作於270MHz，來達到每秒204張1280×960階。與使用單核心通用處理器之軟體實現比較，可以改善48.5倍的效能，而具有4.3%的重現誤差；與畫面線性緩衝架構相比較，此設計減少89.23%的靜態隨機存取記憶體位元，從894K減少至204K位元。與過去的嵌入式硬體設計做比較，此設計的演算法重現率提高34.8%，面積效率提高7.3倍。提出之架構需要額外的外部記憶體頻寬，可以針對系統的限制來取捨內部靜態隨機存取記憶體位元數量與外部記憶體頻寬需求。在此設計實現採用16×32塊尺寸，針對640×480視訊，需要的頻寬是每秒312M位元組；當設計中的記憶體數量增加至1597K位元時，可減少至每秒66M位元組。

關鍵字

影像處理演算法架構；影像處理序列；特徵偵測；積體電路架構設計

並列摘要

Image-processing algorithms have played an essential role in our daily life for entertainment, video communication, and computer vision. Various kinds of algorithms are linked together to carry out a super filtering to retrieve meaningful information from 2-D images. To carry out these algorithms in the real-time performance with the modern VLSI design, parallelism is a major architecture design skill to seek for a way to a framework with many computing units and separated memory modules. Larger problems can be divided into several tasks with pieces of the interested data and solved simultaneously on this architecture. During the parallelizing, it is usually a bottleneck to get data ready for computing units and keep dependency. In this dissertation, we address designing image-processing algorithms in tiles and discuss benefits and issues. We then proposed a pyramid architecture for tile-based image-processing algorithms to be efficiently applied in various kinds of systems. In applications using CMOS image sensors, an image processing is crucial to generating high quality images. The on-chip line buffer normally dominates the total area and power dissipation due to the needed filter window buffering. As the image resolution and filter support increases, the area and power requirement increase accordingly. We propose the pyramid architecture design to efficiently process a system that the image pipeline is between an image sensor and a video coding engine. By utilizing the features of pyramid structure and block-based video/image encoder, the proposed architecture is scalable from low to high resolution and filter size. The input image is partitioned into floors of tiles to reduce frame-line buffers. Two computing schemes, immediate result reuse (IRR) and vertical snack scan (VSS), are utilized to reduce the overlapping redundant computation. A 90nm CMOS chip design with the 7×5 filter support for 3840×2160 Quad Full High Definition (QFHD) video at 30 frames/s is designed to demonstrate the performance of power and area efficiency. Compared with the traditional architecture with frame-line buffers, the proposed design has shown the power consumption is reduced by 25% to 108mW from 145mW. The chip area is reduced by 65% to 309K from 888K logic gates. The external memory bandwidth increases to 8286Mbits/s from 5972Mbits/s for YUV4:2:0, from 7963Mbits/s for YUV4:2:2, and is reduced by 30% from 11944Mbits/s for YUV4:4:4 videos. In computer-vision applications using scale-invariant feature transform (SIFT), the kernel size is a key to build a Gaussian pyramid to extract features in scale-space representations. The SIFT implementations typically involve the use of a high-power, general-purposed processor to keep high-quality results but achieve the less-than-real-time performance. For resource-limited embedded systems, the algorithm is simplified to the 3×3 or 7×7 kernel size such that a small 320×240 resolution with a weakened capability of feature extraction is feasible with the modern ASIC or a FPGA platform. We carefully examine the algorithm and separate it into the low-level and feature-level processes. A 90nm CMOS SIFT accelerator with the 15×11 filter support for 3 scales in an octave is designed by extending single-pyramid to multiple-pyramid architecture. The design integrates 791K logic gates and 204K SRAM bits. The synthesis result shows it works at 270MHz to achieve 1280×960 octaves at 204 frames/s. Compared with the software implementation on a single-core processor, the speedup is 48.5 times and the algorithm quality degrades 4.3% in the repeatability. Compared with the frame-line-buffer architecture, the SRAM usage is reduced by 89.23% from 1894 to 204 Kbits, the area efficiency is improved by 7.3 times, and the algorithm quality is improved by 34.8% in the repeatability for a single-object test. The proposed design takes additional 312 Mbytes/s bandwidth to process 640×480 videos at 30 frames/s. The architecture provides the feasibility to trade-off the global bandwidth and local SRAM usage according to system constrains. The global bandwidth can be reduced by 79% to 66 Mbytes/s while the SRAM usage increases by 7.82 times to 1597 Kbits.

並列關鍵字

Image Signal Processing ； Image-Processing Pipeline ； Feature Detection ； VLSI Architecture

參考文獻

[1] T. Nakamm, H. Marumori, M. Takahh, and Y. Fujii, “An MPEG-2 CODEC LSI with an Audio Accelerator for Camcorders,” Consumer Electronics, IEEE Transactions on, vol.48, no.3, pp. 656- 661, Aug. 2002.

[2] M. Kuwahara and K. Yoneyama, “A Portable Camcorder/Server for ireless Video Transmission,” Consumer Electronics, IEEE Transactions on, vol.51, no.2, pp. 351- 356, May 2005.

[6] H.-T. Chen, P.C. Wu, Y.K. Lai, and L.-G. Chen, “A multimedia video conference system: using region base hybrid coding,” Consumer Electronics, IEEE Transactions on, vol.42, no.3, pp.781-786, Aug. 1996.

[7] W.-C. Kao, J.-A. Ye, M.-I. Chu, and C.Y. Su, “Image quality improvement for electrophoretic displays for combining contrast enhancement and halftoning techniques,” Consumer Electronics, IEEE Transactions on, vol.55, no.1, pp.15-19, Feb. 2009.

[9] T. Serre, L. Wolf, S. Bileschi, and M. Riesenhuber, “Robust object recognition with cortex-like mechanisms,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol.29, no.3, pp.411-426, Mar. 2007.

國際替代計量

多媒體應用之影像處理演算法的金字塔架構設計與實現

主題瀏覽