使用多子矩陣法結合中央處理器和圖形處理器解決大型稀疏線性系統

大型稀疏線性系統長久以來在科學計算和工程領域佔有極為重要的地位。在眾多直接解法中，我們特別專注在多子矩陣法的研究。多子矩陣法利用分解樹（非對稱問題使用行分解樹）將大型稀疏線性系統分成許多較小的全滿矩陣計算，以利於使用中央處理器和圖形顯示器的結合的計算環境。我們分析演算法和程式實作層面以探討如何使用一張或多張圖形顯示卡來對計算進行加速，並複習及回顧多子矩陣法的發展和演進，從對稱正定，對稱不定到非對稱問題。我們成功的將理想的演算法在對稱正定的問題上實作。此稀疏分解的實作成果其計算效能得以接近全滿矩陣的效率。對於對稱不定問題則實作方式類似於對稱正定問題，我門預估其效率只會略低於對稱正定問題。但是此種理想的方法無法應用在先進的非對稱問題上，因為先進的非對稱解法為了縮小使用的記憶體空間和計算量選擇在分解過程中動態的進行行重排，此種方法能減少分解後的額外記憶體需求，但是必須犧牲掉兩個次方三的運算，這種降階的計算對於圖形顯示卡的計算效能上有極大的影響，並且增加了中央處理器和圖形顯示卡之間的傳輸量，進而影響計算效率。我們在非對稱的時作上盡可能的降低傳輸所帶來的影響，其中使用諸多的高效能記憶體等技術。這些技術在對稱正定的問題上同樣被使用，並切增添了效能的理論分析。在將總執行時間的模型推導出後，我們隨即提出方法試圖尋找最佳的排程以最小化總執行時間。理論和可計算的誤差上界，定理，證明，數值時間在文章中都有提供。最後對於各類型的方法，我們提出適應型的方法以利於在更大型的計算環境下其計算能力得以倍增，而在文中提出的分析模型仍能適用。我們以日前最流行的塔型伺服器的規格（可以容納四張圖形處理器）上限作為目標，以避免討論不同量級的傳輸速度（網路傳輸），在對稱問題上，我們更提出一種新的方法以避免圖形顯示卡之間的傳輸以利於整體的效率。

關鍵字

稀疏矩陣；圖形處理器；平行計算；直接法；線性系統；高效能計算；多子矩陣法；對稱正定

並列摘要

Solving large-scale sparse linear systems is at the heart of various scientific and engineering computations. Among various direct methods, we focus on the multifrontal method in particular. A multifrontal method uses a elimination tree (column elimination tree for unsymmetric problems) to transform a large sparse linear system problem into many smaller dense frontal operations, suitable for hybrid CPU-GPU systems. We analyze the method from both algorithmic and implementation perspectives to see how a GPU or more GPUs can be used to accelerate the computations, and review the multifrontal method. Problems are studied from symmetric positive definite (SPD), symmetric indefinite to unsymmetric cases. We successfully carry the ideal implementation out SPD multifrontal which provides nearly peak performance as dense BLAS3 routines on GPU, MAGMA’s Cholesky, and the same symmetric property accounts for the similar implementation and performance for symmetric indefinite problems. However, unsymmetric problems can be hard to implement due to the runtime column pivoting which separates 2 BLAS3 routines into several BLAS2 routines; extra communications are also inevitable. In order to handle the communication between CPU and GPU, easily slowing down the performance, several strategies are provided in the article for all kinds of multifrontal problem to reducing the communication and accelerating the process. Further more, we analyze the total execution time of SPD problem, providing nearly optimal workload distribution for hybrid CPU-GPU cooperation. For all kinds of problem, scalable algorithm are provided to adapt more GPUs, up to 4. By avoiding the analysis of communication of cluster network which is a different scale from PCI-E communication speed, we focus on adapting the optimization strategies on a box server. The extension and analysis of optimization model for the new parallel scheme is quite the same as the single GPU model. New factorization cost and communication cost for multiple GPUs are added to the model, and the performance bound and properties are still hold.

並列關鍵字

sparse matrix ； GPU ； parallel computing ； direct method ； linear system ； high performance computing ； Multifrontal ； symmetric positive definite

參考文獻

[7] C. Ashcraft and R. Grimes. The influence of relaxed supernode partitions on the multifrontal method. ACM Transactions on Mathematical Software(TOMS), 15(4):291–309, 1989.

T.A. Davis. A column pre-ordering strategy for the unsymmetric-pattern multifrontal method. ACMTransactions onMathematical Software (TOMS), 30(2):195, 2004.

[10] T.A. Davis. A column pre-ordering strategy for the unsymmetric-pattern multifrontal method. ACMTransactions onMathematical Software (TOMS), 30(2):195, 2004.

[12] Timothy A. Davis and Iain S. Duff. An unsymmetric-pattern multifrontal method for sparse lu factorization. SIAM Journal on Matrix Analysis and Applications, 18(1):140–158, 1997.

[13] J.J.Dongarra, J.Du Croz, S.Hammarling, and I.S.Duff. Aset of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software (TOMS), 16(1):1–17, 1990.

國際替代計量

使用多子矩陣法結合中央處理器和圖形處理器解決大型稀疏線性系統

全文下載

主題瀏覽