透過您的圖書館登入
IP:18.220.90.54
  • 學位論文

基於積分圖形與積分直方圖應用之可重組快取記憶體之硬體機制設計

Reconfigurable Cache Memory Mechanism for Integral Image and Integral Histogram Applications

指導教授 : 簡韶逸

摘要


半導體科技的進步日新月異,摩爾定律(Moore’s Law)訂定出微型處理器(Micro-processor)的效能每18 個月會增長一倍。然而相對於微型處理器的成長,記憶體的速度平均每年只有增加7%。因此微型處理器與記憶體之間的速度差異 形成巨大的鴻溝。快取記憶體(Cache Memory)是用來作為處理器與晶片外的動態隨機存取記憶體(DRAM)的緩衝裝置,它是一塊非常小,具備高速讀寫功能的記憶體。快取記憶體將經常用到的資料儲存在快取內部以減少處理器與DRAM 讀寫之間的記憶體延遲。然而,快取的讀寫機制卻不適合處理一些具備串流處理(Stream Processing)的演算法,例如著名的積分圖形(Integral Image)以及積分直方圖(Integral Histogram)。透過積分圖形及積分直方圖,我們可以輕易取得任意形狀大小的面積總合或是統計直方圖以加速運算。然而這些演算法通常都具有串流處理的特性,分析結果顯示快取記憶體的擊中率(Cache hit rate)在此類型的資料流中容易遇上瓶頸。無法使效能更加進升一步。 在這篇論文當中,我們提出一種可重組(Reconfigurable)的快取記憶體機制同時 支援一般資料抓取或是專門負責積分圖形或積分直方圖的串流處理,稱作RBSP-記憶體。它具備兩種不同的運作模式:快取記憶體模式以及RBSP 模式。當RBSP-記憶體處於快取記憶體模式時,它的運作方式如同一個集合關聯式的快取記憶體。而當處於RBSP 模式時,則是專門用來處理積分圖形以及積分直方圖的應用。它會先將演算法中接下來會用到的積分圖形或積分直方圖資料從DRAM 取回儲存至RBSP-記憶體中。之後處理器直接跟RBSP-記憶體溝通進行存取,以減少不必要的記憶體延遲。我們將這樣的一塊記憶體實現至兩種積分圖形以及積分直方圖的演算法應用中。一是加速強健特徵點(Speed Up Robust Feature, SURF),另一是中央-周圍直方圖差(Center-surround histogram)。其中我們討論到RBSP-記憶體讀取時對於圖形中每一列的重覆使用情形。此外,為了將演算法中每一次之後都會使用到的列元素存取至記憶體中,我們提出了一種映射演算法來幫助記憶體的存取,我們分別使用硬體及軟體來實現該演算法並討論其效能及對於晶片面積的大小影響。最後討論到的是藉由將讀取的區塊作切割(Memory Dividing Technique, MDT),以減少記憶體中所需儲存的單元長度(Word length)。 我們將硬體實現於電子系統層級設計(Electronic System Level)的模擬軟體中。在輸入影像為VGA 640x480 的大小下, RBSP-記憶體在加速強健特徵點的表現比同樣大小的傳統快取記憶體好上38.31%;而在中央-周圍直方圖差當中則快上48.29%。最後我們使用Synopsys Design Compiler 進行合成,使用TSMC 180 奈米製程,所得到的閘數為514.6K,其中RBSP 模式中,分別使用硬體或軟體執行映射演算法所需要的控制電路只佔全部的7.61%或5.28%。

並列摘要


With the development of semiconductor technology, the capability of micro-processor doubles almost every 18 months, by the Moore's Law. However, the speed of off-chip DRAM grows only 7% every year. There is a huge gap between the speed of CPU and DRAM. Cache memory is a high speed memory which can reduce the memory access latency between the processor and off-chip DRAM, and it usually occupies a large area of the whole system.However, for some operation with integral images and integral integral histograms, which are famous for getting an arbitrary-sized block summation and histogram in a constant speed and widely implemented in many applications, the read and write mechanisms of cache are not suitable for such algorithms with stream processing characteristic. With larger cache size, the cycle count can be further reduced. However, the analysis results show that there is a bottleneck of cache hit rate and the cycle count reduction meets a limitation. For the above reasons, in this thesis, a reconfigurable cache memory mechanism is proposed to support both general data access and stream processing. This proposed memory has two modes: normal cache mode and Row-Based Stream Processing (RBSP) mode, which is a specific mechanism for data accessing of integral images and integral histograms. The RBSP mode can reduce the cycle count because all the subsequent necessary data has been precisely prefetched with the basic accessing unit of an image row. Two integral image and integral histogram applications, SURF algorithm and center-surround histogram of salience map, are implemented to verify the proposed mechanism. Moreover, the data reuse scheme intra-filter-size sharing and inter-filter-size sharing between different filter sizes and diffierent filtering stripes are taken into consideration to further reduce the data access to the off-chip DRAM. A mapping algorithm is proposed to help the RBSP memory read and write the data, which is implemented in hardware and software versions. In addition, a method called Memory Dividing Technique (MDT) is also proposed to further reduce the word-length. The whole system is built in the Coware Platform Architect to verify our design. Our target image size is VGA 640 x 480 and the experimental results show that the proposed Reconfigurable RBSP memory can save 38.31% and 48.29% memory cycle count for these two applications compared to the traditional data cache in the same level of size. The hardware is implemented with Verilog-HDL and synthesized with Synopsys Design Compiler in TSMC 180nm technology. The total gate count of RBSP memory is 557.0K. The overhead of our proposed RBSP memory is very small, just 7.61% or 5.28% with hardware or software based implementation compared to the set associative cache.

參考文獻


[1] Elsevier, Computer Architecture, 2006.
[3] A. J. Smith, "Cache memories," in ACM Computing Survey, 1982, vol. 14, pp. 473clk530.
[4] P. Kostas and S. Ali, "Contentclkaddressable memory (CAM) circuits and architectures: A tutorial and survey," in Journal of SolidclkState Circuits, 2006, vol. 41, pp. 712clk727.
[6] P. Viola and M. Jones., "Rapid object detection using a boosted cascade of simple features," in Proceedings of Computer Vision and Pattern Recognition (CVPR), 2001, pp. 511{518.
[7] F. Porikli, "Integral histogram: a fast way to extract histograms in cartesian spaces," in Computer Vision and Pattern Recognition, CVPR, 2005, vol. 1, pp. 829{836.

延伸閱讀