透過您的圖書館登入
IP:3.142.200.226
  • 學位論文

應用於GPU之蝶形演算法資料排列

Index Conversion on GPU for Butterfly Based Algorithm

指導教授 : 許雅三

摘要


現今訊號處理演算法在許多領域扮演重要的角色,例如: 無線通訊、影像處理、頻譜分析…等。隨著運算需求增大,許多研究針對各種訊號處理演算法提出快速演算法 (fast algorithm),而GPU的應用更促使許多平行化的快速演算法的研究蓬勃發展。許多快速演算法都有資料排列的潛在問題,例如:傅立葉轉換、沃爾什-阿達瑪轉換、哈特利轉換、哈爾轉換…等,而許多研究都是在轉換前後額外處理。值得一提的是[2]實作出令人驚艷的函式庫,[2]利用定義的運算子以及排程將資料排列的工作並入轉換的演算法部分。 [1]利用硬體的方式解決異質系統中資料排列的問題,而本篇論文也應用[1]的概念解決上述快速演算法的潛在問題。在位址轉換過程中我們使用[1]的ping-pong傳輸模式以盡可能縮短位址轉換造成的延遲。資料排列方面我們實做出三種模式: bit-reversal order、natural Hadamard order、快速哈爾轉換的資料次序,並模擬使用硬體方式解決資料排列問題後對於六種演算法在GPU上的加速。 在我們的實驗中,訊號處理演算法得到的加速和資料排列的複雜度成正比,但是如果演算法轉換部分運算量較大,得到的加速則會下降。在挑選的演算法裡面最差的加速是快速傅立葉轉換,落在7.12%~11.2%;最佳的加速則是快速沃爾什-阿達瑪轉換,落在15.28%~40.78%。

關鍵字

資料排列

並列摘要


Signal processing algorithm plays an important role in many fields nowadays, such as wireless communication, image processing, spectrum analysis and so on. As computational demands increase, many researches presented fast algorithm about signal processing algorithms. Besides, evolving GPU technology further boosts researchers parallelize existing fast algorithm. However, fast algorithms such as Fast Fourier Transform (FFT), Fast Walsh-Hadamard Transform (FWHT), Fast Hartley Transform, Fast Haar Transform, suffered from index permutation issue, which are often solved by programmer before or after transformation. However, researchers proposed impressive library solution in [2]. Work in [2] utilizes defined operators and scheduling to achieve index permutation and merges it into transformation algorithm. [1] adopts hardware method to tackle with data layout issue between heterogeneous processors, and we solve index permutation issue with concept in [1]. We adopt ping-pong transpose mechanism to hide the latency due to index conversion. We implement three types of index permutation, such as bit-reversal order, natural Hadamard order, and index order in Fast Haar Transform. Finally, we show the speedup of six chosen algorithms on GPU after eliminating index permutation problem with our index converter. In our experiment, the speedup of signal processing algorithm is proportional to the complexity of index distribution, and in inverse proportion with the execution time of fast transform. In selected benchmarks, we obtain worst case in FFT, its speedup ranges from 7.12% to 11.2%, and best case in FWHT, ranging from 15.28% to 40.78%.

並列關鍵字

GPU data layout butterfly index permutation

參考文獻


Lee, “Optimized Memory Access Support for Data Layout Conversion on Heterogeneous Multi-core Systems,” in Embedded Systems for Real-time Multimedia (ESTIMedia), 2014 IEEE 12th Symposium.
[2] Jacobo Lobeiras, Margarita Amor, and Ramón Doallo, “Designing Efficient Index-Digit Algorithms for CUDA GPU Architectures,” in IEEE Transactions on Parallel and Distributed Systems, 2016.
[3] I-Jui Sung, Geng Daniel Liu, and Wen-Mei W. Hwu, “DL: A data layout transformation system for heterogeneous computing,” in Innovative Parallel Computing (InPar), 2012.
[5] Peter R. Roeser, and M. E. Jernigan, “Fast Haar Transform Algorithms,” in IEEE Transactions on Computers, 1982
[6] Yanwei Pang, Xuelong Li, Yuan Yuan, Dacheng Tao, and Jing Pan, “Fast Haar Transform Based Feature Extraction for Face Representation and Recognition,” in IEEE Transactions on Information Forensics and Security, 2009.

延伸閱讀