一個支援不規則巢狀平行的類OpenACC 程式設計模型

OpenACC 是一套基於編譯器註釋的程式開發模型。只要標記可平行執行的迴圈，程式設計師就能利用GPU 提供的運算資源。快速排序法之類的巢狀平行演算法也可以利用巢狀平行迴圈實作，然而， OpenACC 對於巢狀平行迴圈只提供了有限的支援。因此，我們提出了 PFACC，一套類似OpenACC 的程式開發模型。PFACC 註釋可用來標記平行迴圈，或是不同記憶體階層之間的資料搬移。平行迴圈能夠任意嵌套，或是被放置在會被其他平行迴圈呼叫的函數內。PFACC 翻譯器能夠在附有PFACC 註釋的C 程式之中插入負載平衡及資料搬移的程式碼，並將其翻譯為CUDA 程式。PFACC 的負載平衡是一種兩層式的機制。每個Thread block 被當作是扁平的SIMT 處理器，會動態的將迴圈迭代組織成批次，並以深度優先的順序執行這些批次。不同的 Thread block 會透過一個工作偷竊機制來分享彼此的工作。因為深度優先的執行順序，PFACC 產生的CUDA 程式具有合理的記憶體使用量。PFACC 的兩層式負載平衡機制不需要特殊硬體支援，且能夠很好地適應CUDA 的線程階層。實驗結果顯示PFACC 在大部分Benchmark 表現都比NESL 好，並且在某些Benchmark 能比CUDA dynamic parallelism 快100 倍以上。

關鍵字

巢狀平行；通用圖型處理器；平行程式設計模型

並列摘要

OpenACC is a directive-based programming model which allows programmers to enjoy the computing power of GPUs by simply annotating parallel loops. However, OpenACC has poor support for irregular nested parallel loops, which are natural choices to express nested parallelism. We propose PFACC, a programming model similar to OpenACC. PFACC directives are introduced to annotate parallel loops and to guide data movement between different levels of memory hierarchy. Parallel loops can be arbitrarily nested or be placed inside functions that would be called in other parallel loops. The PFACC translator translates C programs with PFACC directives into CUDA programs by inserting runtime iteration-sharing and memory allocation routines and performing necessary code transformations. The PFACC runtime iteration-sharing routine is a two-level mechanism. Thread blocks are treated as flat SIMT processors, which dynamically organize loop iterations into batches and execute the batches in a depth-first order. Different thread blocks share iterations among one another with an iteration-stealing mechanism. PFACC generates CUDA programs with reasonable memory usage because of the depth-first execution order. The two-level iteration-sharing mechanism in implemented purely in software and fits well with the CUDA thread hierarchy. Experiments show that PFACC outperforms NESL significantly in most benchmarks, and obtains more than 100x speedup over CUDA dynamic parallelism on certain benchmarks.