透過您的圖書館登入
IP:3.144.43.165
  • 學位論文

使用多子矩陣法結合中央處理器和圖形處理器解決大型稀疏線性系統

CPU-GPU Hybrid Approaches in Multifrontal Methods for Large and Sparse Linear System

指導教授 : 王偉仲

摘要


大型稀疏線性系統長久以來在科學計算和工程領域佔有極為重要的地 位。 在眾多直接解法中,我們特別專注在多子矩陣法的研究。多子矩陣法 利用 分解樹(非對稱問題使用行分解樹)將大型稀疏線性系統分成許多較 小的 全滿矩陣計算,以利於使用中央處理器和圖形顯示器的結合的計算環 境。 我們分析演算法和程式實作層面以探討如何使用一張或多張圖形顯示 卡來 對計算進行加速,並複習及回顧多子矩陣法的發展和演進,從對稱正 定, 對稱不定到非對稱問題。 我們成功的將理想的演算法在對稱正定的問題上實作。此稀疏分解的實 作成果 其計算效能得以接近全滿矩陣的效率。對於對稱不定問題則實作方 式類似於對 稱正定問題,我門預估其效率只會略低於對稱正定問題。但是 此種理想的方法 無法應用在先進的非對稱問題上,因為先進的非對稱解法 為了縮小使用的記憶 體空間和計算量選擇在分解過程中動態的進行行重 排,此種方法能減少分解後 的額外記憶體需求,但是必須犧牲掉兩個次方 三的運算,這種降階的計算對於 圖形顯示卡的計算效能上有極大的影響, 並且增加了中央處理器和圖形顯示卡 之間的傳輸量,進而影響計算效率。 我們在非對稱的時作上盡可能的降低傳輸 所帶來的影響,其中使用諸多的 高效能記憶體等技術。這些技術在對稱正定的 問題上同樣被使用,並切增 添了效能的理論分析。在將總執行時間的模型推導 出後,我們隨即提出方 法試圖尋找最佳的排程以最小化總執行時間。理論和可 計算的誤差上界, 定理,證明,數值時間在文章中都有提供。 最後對於各類型的方法,我們提出適應型的方法以利於在更大型的計算 環境下 其計算能力得以倍增,而在文中提出的分析模型仍能適用。我們以 日前最流行 的塔型伺服器的規格(可以容納四張圖形處理器)上限作為目 標,以避免討論 不同量級的傳輸速度(網路傳輸),在對稱問題上,我們 更提出一種新的方法 以避免圖形顯示卡之間的傳輸以利於整體的效率。

並列摘要


Solving large-scale sparse linear systems is at the heart of various scientific and engineering computations. Among various direct methods, we focus on the multifrontal method in particular. A multifrontal method uses a elimination tree (column elimination tree for unsymmetric problems) to transform a large sparse linear system problem into many smaller dense frontal operations, suitable for hybrid CPU-GPU systems. We analyze the method from both algorithmic and implementation perspectives to see how a GPU or more GPUs can be used to accelerate the computations, and review the multifrontal method. Problems are studied from symmetric positive definite (SPD), symmetric indefinite to unsymmetric cases. We successfully carry the ideal implementation out SPD multifrontal which provides nearly peak performance as dense BLAS3 routines on GPU, MAGMA’s Cholesky, and the same symmetric property accounts for the similar implementation and performance for symmetric indefinite problems. However, unsymmetric problems can be hard to implement due to the runtime column pivoting which separates 2 BLAS3 routines into several BLAS2 routines; extra communications are also inevitable. In order to handle the communication between CPU and GPU, easily slowing down the performance, several strategies are provided in the article for all kinds of multifrontal problem to reducing the communication and accelerating the process. Further more, we analyze the total execution time of SPD problem, providing nearly optimal workload distribution for hybrid CPU-GPU cooperation. For all kinds of problem, scalable algorithm are provided to adapt more GPUs, up to 4. By avoiding the analysis of communication of cluster network which is a different scale from PCI-E communication speed, we focus on adapting the optimization strategies on a box server. The extension and analysis of optimization model for the new parallel scheme is quite the same as the single GPU model. New factorization cost and communication cost for multiple GPUs are added to the model, and the performance bound and properties are still hold.

參考文獻


[7] C. Ashcraft and R. Grimes. The influence of relaxed supernode partitions on the multifrontal method. ACM Transactions on Mathematical Software(TOMS), 15(4):291–309, 1989.
T.A. Davis. A column pre-ordering strategy for the unsymmetric-pattern multifrontal method. ACMTransactions onMathematical Software (TOMS), 30(2):195, 2004.
[10] T.A. Davis. A column pre-ordering strategy for the unsymmetric-pattern multifrontal method. ACMTransactions onMathematical Software (TOMS), 30(2):195, 2004.
[12] Timothy A. Davis and Iain S. Duff. An unsymmetric-pattern multifrontal method for sparse lu factorization. SIAM Journal on Matrix Analysis and Applications, 18(1):140–158, 1997.
[13] J.J.Dongarra, J.Du Croz, S.Hammarling, and I.S.Duff. Aset of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software (TOMS), 16(1):1–17, 1990.

延伸閱讀