在GPU上增進排序演算法的效能

Sorting演算法在許多的應用程式所使用的技術中皆是相當重要一環，而在Graphics Processing Unit (GPU)上也不例外，多年來已有各樣高效能的sorting演算法實作在GPU上，然而隨著GPU計算性能和資料量的增加，計算和通訊的效能差距會呈現指數成長，這會導致host (CPU) 和device (GPU) 之間的資料傳輸時間會成為效能的瓶頸；以sorting algorithm為例，當資料量大於 2^20 時，花在資料搬移的時間比例將會超過整體執行時間的60%。本文中提出一個framework，利用streams concurrency技術使communication和computation的時間能夠重疊，藉此增進GPU sorting演算法的效能。首先將資料分割成數個buckets，每個bucket的資料數量大致相同，而且每個bucket中的資料會大於或等於前一個bucket中的資料。第二步，使用任何一種GPU sorting演算法將各bucket中的資料分別排序好。最後將排序以及資料輸出重疊，藉此隱藏communication的時間。這個framework主要的挑戰在步驟一的時候如何將資料分割成有順序的bucket並且每個bucket的資料數要差不多，這裡使用sample sort的演算法來解決這個問題。這裡將此framework實作在三個演算法上來驗證此framework的效能，分別是radix sort、merge sort、bitonic sort。實驗結果顯示當資料數 n=2^28 時，radix sort上有接近25%的效能增進，在merge sort上則是8%，在bitonic sort上最多可達8.32%的效能增進。

關鍵字

通用GPU計算；排序；重疊計算及傳輸時間

並列摘要

Efficient implementations of various sorting algorithms on Graphics Processing Unit (GPU) had been studied for years, owing to their technical importance in many applications. However, as the computational power of GPU and data size increases, the performance gap of computation and communication enlarges exponentially. As the result, the data movement between host (CPU) and device (GPU) becomes the performance bottleneck. For sorting algorithms, the time of data movement can take over 60% of total execution time when the data size is larger than 220 on Fermi C2070. In this thesis, we propose a framework to enhance the performance of GPU sorting algorithms, which utilizes the streams concurrency technique to overlap the communication and computation time. First, data are partitioned into buckets. Each bucket has roughly the same size and data in each bucket has partial order to the data in other buckets. Second, data in each bucket are sorted separately on GPU using preferred sorting algorithms. Last, the sorting and data output are overlapped to hide the communication time. The major challenge of this framework is in the first step: to partition the data into ordered buckets of roughly equal size. The sample sort algorithm is employed to resolve this problem. Three sorting algorithms were implemented to justify the effectiveness of this framework: radix sort, merge sort, and bitonic sort. Experiments show that nearly 25% time performance improvement of radix sort can be obtained when n=2^28. For merge sort, the improvement is 8% ; and for bitonic sort, up to 8.32% performance improvement can be achieved.

並列關鍵字

GPGPU ； sorting ； CUDA ； stream ； concurrency ； communication time ； computation time

參考文獻

[9] Intel TBB. [Online] http://threadingbuildingblocks.org/

[12] Flynn M. "Some Computer Organizations and Their Effectiveness", 1972, IEEE Trans. Comput. C-21: 948.

[13] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. “Introduction to Algorithms, Second Edition.” MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Section 8.1: Lower bounds for sorting, pp. 165–168.

[15] Donald Knuth. “The Art of Computer Programming, Volume 3: Sorting and Searching,” Second Edition. Addison-Wesley, 1997. ISBN 0-201-89685-0.

[18] D. Cederman and P. Tsigas. “A practical quicksort algorithm for graphics processors.” In Proc. European Symposium on Algorithms (ESA), volume 5193 of LNCS, pages 246–258, 2008.

國際替代計量

在GPU上增進排序演算法的效能

全文下載

主題瀏覽