異質運算平台下開放計算語言之工作群組大小分析及調整

開放計算語言的效能受程式及運算平臺特性的影響十分顯著。為了達到較好的效能，使用者往往需要針對不同的平臺，從許多可能的程式參數中尋找較好的設定。然而，隨著異質運算裝置越趨多樣化，我們需要一種有效率且便於使用的自動參數調整技術。由於工作群組大小對效能會帶來很大的影響，因此他廣泛的被選為自動調整的目標。然而，以目前的自動參數調整技術而言，若想要適用於不同異質的運算裝紙，則只能在目標平台上運作。在這篇論文中，我們分析了工作群組大小對效能造成影響的根本原因，並提出了一個專門用於調整工作群組大小的模型。藉由抽象化不同裝置間的架構差異，並只針對造成效能影響的根本原因進行估計，該模型可以迅速的跨平台找出針對不同的運算裝置適合的工作群組大小。本論文使用了七個基準測試程式和五個不同的運算裝置進行模型的驗證。實驗數據說明該模型可以迅速的排除掉平均95.1%可能的工作群組設定。如果和最好的工作群組設定比較，在所有該模型找出的候選的工作群組設定中，最佳者可以達到平均95.7%的效能，最差者也能達到平均92.2%的效能。

關鍵字

開放計算語言；工作群組大小；微基準測試程式；自動參數調整

並列摘要

The performance of an OpenCL kernel is significantly influenced by both the hardware and software attributes. To attain superior performance, users need to search through a huge tuning space to determine proper parameters. However, with the growth of variety and heterogeneity on the underlying computing devices, efficient and easy-to-apply automatic tuning technique become an essential. Among all possible tuning knobs, workgroup size, which would largely affect the performance, is commonly used for general OpenCL programs. However, existing portable tuning approaches can only be leveraged once the target device is available. In this thesis, we analyze the key factors that cause performance discrepancies under different workgroup sizes and present a dedicate workgroup size selection model. By abstracting the hardware details and modeling only the key factors, our approach provides a portable and efficient way to determine the suitable workgroup size without the requirement of target device. Among all the seven benchmarks and five distinct devices, our model is shown to filter out an average of 95.1% of the possible workgroup sizes with negligible overhead, while achieving an average of 95.7% best-known performance with the best candidate and 92.2% of the best-known performance with the worst candidate.

並列關鍵字

OpenCL ； Workgroup Size ； Micro-benchmark ； Auto-tuning

參考文獻

[1] Stone, John E., David Gohara, and Guochun Shi. "OpenCL: A parallel programming standard for heterogeneous computing systems." Computing in science & engineering 12.3 (2010): 66-73.

[2] Rupp, Karl, et al. "Performance portability study of linear algebra kernels in OpenCL." Proceedings of the International Workshop on OpenCL 2013 & 2014. ACM, 2014.

[5] Agosta, Giovanni, et al. "OpenCL performance portability for general‐purpose computation on graphics processor units: an exploration on cryptographic primitives." Concurrency and Computation: Practice and Experience 27.14 (2015): 3633-3660.

[6] Thoman, Peter, et al. "Automatic OpenCL device characterization: guiding optimized kernel design." European Conference on Parallel Processing. Springer Berlin Heidelberg, 2011.

[7] Pennycook, Simon J., et al. "An investigation of the performance portability of OpenCL." Journal of Parallel and Distributed Computing 73.11 (2013): 1439-1450.

國際替代計量

異質運算平台下開放計算語言之工作群組大小分析及調整

全文下載

主題瀏覽