動態核心編排：提升包含動態執行緒平行度之圖形處理器應用程式之執行效能

圖形處理器(GPU) 由於其高生產量(throughput)、高平行度(parallelism) 以及高能耗效率(energy efficiency) 之特性，為近十年來最受矚目之處理單元。許多研究均發現具高度資料平行(data-parallelism) 特性之應用程式可於GPU 上取得數十倍甚至數百倍於中央處理器(CPU) 之效能提升。然而近年來有越來越多於GPU 上執行不規則應用程式(irregular applications) 之需求，例如大數據或機器學習等。這些不規則應用程式多半包含資料相依(data-dependent) 之控制流(control-flow) 或記憶體存取，並包含動態生成之執行緒層級平行度(dynamic parallelism)，使其無法有效地透過目前GPU 所支援之程式設計模型(programming model) 達到較高之執行效率。近年來新的GPU 架構透過透過允許各執行緒(thread) 於執行過程中產生新子核心(child kernels)，稱為device-side kernel launch，來平行化前述之動態執行緒平行度。然而，許多研究發現目前GPU 對device-side kernel launch 的軟硬體實作，因為子核心的啟動延遲(invocation overhead) 及低執行效率而無法有效利用GPU 的高生產量。很多時候甚至會造成效能的下降。本論文試著從根源處，也就是device-side kernel launch 的程式設計模型，來了解其限制。本論文發現程式設計者必須在撰寫過程中自行設定device-side kernel launch 的重要參數，但動態平行應用程式的特性使得使用者很難選到能讓程式執行有效率的參數。此外，也發現到有很多有機會平行執行的動態工作(dynamic tasks) 必須要循序執行(execute in serial)。與其透過調整參數或其他軟硬體方法來減少啟動延遲，本論文的目標在於執行動態平行應用程式時將GPU 的效能完全釋放。換句話說，將「所有動態產生的工作」平行地執行於子核心中，在執行過程裡保持高效率，且避免程式設計者花費過多的力氣在參數選擇上。透過分析，我整理出了下列三個完全利用子核心平行度潛能的挑戰：〈一〉雖然理想的執行參數(execution parameters) 可以透過離線分析(off-line profiling) 取得，但要直接使用到執行緒數量不一的子核心上，往往不能收到預期的效果；〈二〉同時產生的子核心數量過多，執行緒區塊(thread block) 調度(dispatching) 過程中需要從DRAM讀取其中介資料(meta data) 而拖慢整體執行速度；以及〈三〉子核心執行序數量普遍較少，造成其執行過程裡核心參數(kernel parameters) 的讀取較無效率。對此，本論文提出了一個整體性的解決方案，稱之為動態核心編排(Dynamic Kernel Consolidation and Scheduling, DKCS)，希望能在 GPU 執行包含device-side kernel launch 之應用程式時完全發揮其高平行度及高生產量。此機制的核心為一個包含編譯器及硬體支援的動態核心融合(Dynamic Kernel Consolidation, DKC) 機制，將大量有著不同數量執行緒的子核心打散重組，並選擇更合適的執行參數以提高其執行效能。針對同時產生子核心數量過多的問題，DKCS 提出考量動態平行之GPU 排程機制(Dynamic Parallelism-aware GPU Scheduling, DPS)，利用於GPU SM 裡同步多工處理的技術(simultaneous multikernel, SMK) 同時執行親/子核心，並在需要的時候優先執行子核心來避免執行緒區塊調度時額外的DRAM 存取延遲。對於子核心內執行緒數量過少的問題，DKCS 提出使用另外一條data-path, read-only cache, 而非原本的 constant cache 來存取核心參數以提高存取效能。同時，DKCS 可以讓程式的撰寫過程變得非常簡單，程式設計者不再需要繁瑣地調整子核心啟動的條件或執行參數來確保整體執行效率。實驗結果可發現，本論文所提出之動態核心編排機制相比於過去降低單一核心啟動延遲之 dynamic thread-block launch (DTBL) 機制平均可提高64% 之執行效能，對於動態調整親子核心工作分配比例的SPAWN 機制亦可提高44% 之平均效能。

關鍵字

圖形處理器；圖形處理器端核心啟動； CUDA動態平行；不規則GPGPU應用程式；圖形處理器多工；圖形處理器排程

並列摘要

The demand of accelerating irregular applications, which often operate on unstructured data sets such as trees, graphs, or sparse matrices, and therefore have dynamic, workload-dependent nested thread-level parallelism with hard to predict parallelism degree, on GPUs has increased recently. Modern GPUs support device-side kernel launch (or CUDA dynamic parallelism, CDP) to improve the programmability of irregular GPGPU applications with dynamic parallelism. However, researches have shown that current GPU implementation of device-side kernel launch fails to make good use of its high throughput because of the non-trivial kernel invocation overhead and the inefficient execution of device-launched kernels. This dissertation aims at fully exploiting the potential of device-side kernel launch. I analyze the limitation of device-side kernel launch based on the CDP programming model and find that due to the CDP programming model, which requires a programmer to determine important parameters for device-side kernel launch, and the very nature of dynamic parallelism, which makes selecting parameters maximizing execution efficiency extremely difficult, device-launched kernels are generally executed with low utilization. Furthermore, still a significant part of parallelizable tasks are forced to execute in serial, which impairs the potential of device-side kernel launch. To fully unleash the benefit of device kernels, I propose an ideal version of CDP code and summarizethe challenges of efficiently executing it as follows. First, child kernels should be executed with tuned execution parameters. Though favorable execution parameters can be statically calculated, how to apply them on the child kernels is non-trivial, as the amount of threads in a child kernel is dynamically set at run-time and may not fit the selected execution parameters. Second, in the ideal CDP codes, the number of device kernels are often in order of magnitude larger than host kernels (from thousands to hundreds of thousands), but with much fewer threads per kernel (average 55 for the tested workloads). Massive device kernels with few threads per kernel lead to inefficient meta data and kernel parameters accesses during execution. Based on the analyses, I propose a Dynamic Kernel Consolidation and Scheduling (DKCS) framework as the first attempt to fully exploit the potential of device-side kernel launch. The key innovation of DKCS is a run-time layer that performs kernel fusing/splitting and TB regrouping operations to form consolidated kernels and thread blocks so that the selected grid and block sizes can be successfully applied on. DKCS also includes two features to resolve the inefficiency incurred by massive and small child kernels: a block and warp scheduling policy for efficient meta data accessing, and utilizing an alternative data path, the read-only cache, for efficient child kernel parameter accessing. With DKCS, programming for dynamic parallelism becomes very simple and easy without cumbersome tuning of threshold values for launching device-side kernels and kernel execution parameters. Results show that the proposed DKCS framework outperforms the mechanism that reduces the invocation overhead of a single child kernel, dynamic thread-block launch (DTBL) by 64%, and outperforms the mechanism that controls the launches of device-side kernels to prevent cons from overshadowing pros, SPAWN, by 44%.

並列關鍵字

GPU ； Device-side kernel launch ； CUDA dynamic parallelism ； Irregular GPGPU applications ； GPU multitasking ； GPU scheduling

參考文獻

[5] N. N. Batada, T. Reguly, A. Breitkreutz, L. Boucher, B.-J. Breitkreutz, L. D. Hurst, and M. Tyers. Still stratus not altocumulus: further evidence against the date/party hub distinction. PLoS Biol, 5(6):e154, 2007.

[6] L. Bergstrom and J. Reppy. Nested data-parallelism on the gpu. In ACM SIGPLAN Notices, volume 47, pages 247–258. ACM, 2012.

[11] L.-J. Chen, H.-Y. Cheng, P.-H. Wang, and C.-L. Yang. Improving gpgpu performance via cache locality aware thread block scheduling. IEEE Computer Architecture Letters, 2017.

[14] T. A. Davis and Y. Hu. The university of florida sparse matrix collection. ACM Trans. Math. Softw., 38(1):1:1–1:25, Dec. 2011.

[16] J. DiMarco and M. Taufer. Performance impact of dynamic parallelism on different clustering algorithms. In Proc. SPIE, volume 8752, page 87520E, 2013.

國際替代計量

動態核心編排：提升包含動態執行緒平行度之圖形處理器應用程式之執行效能

主題瀏覽