解三對角矩陣在現在自然科學及工程學裡的問題,例如微分方程、偏微分方程、或是流體動力學等問題,已成為不可或缺的一環。因為三對角矩陣本身的特殊稀疏架構,有很多針對此架構的演算法被提出,在這之前,主流是運用對角旋轉(Diagonal Pivoting)的方式解決準確性的問題。然而對角旋轉有它的限制,在某些矩陣的測試下會得到錯誤的答案。此時,找尋不同於主流或改良主流演算法的必要性與日俱增,而增廣塊狀西門諾分散式演算法(Augmented Block Cimmino Distributed Algorithm)提供了一個不同於主流的演算法。 在這篇論文中,我們研究並將增廣塊狀西門諾分散式演算法實作在GPU上。其中針對GPU的計算架構,我們應用了CUDA程式設計實作時需遵循的基本原則,例如透過轉換資料儲存格式達成GPU連續記憶體存取來增進效能,並提出了邊界填補修正方法,此方法能夠弭平大幅降低GPU效能的計算分歧,或是降低計算分歧的影響,進而使得GPU的效能獲得顯著提升。在實驗部分,我們比較了ABCD在Matlab上的準確性、ABCD在GPU上的準確性以及主流演算法的準確性,並比較了不同演算法的計算時間。
The tridiagonal solver nowadays appears as a fundamental component in scientific and engi-neering problems, such as Alternating Direction Implicit methods (ADI), fluid Simulation, and Poisson’s equation. Due to the particular sparse format of tridiagonal matrix, many algorithms of solving the system are conceived. Previously, the main stream of solving the system is by using Diagonal Pivoting to reduce the accuracy issue. But, Diagonal Pivoting has its limits and will lead to error solution while the condition number increases. Augmented Block Cimmino Distributed (ABCD) algorithm serves as another option when trying to resolve the problem accurately. In this thesis, we study and implement the ABCD algorithm on GPU. Because of the spe-cial structure of tridiagonal matrices, we investigate the boundary padding technique to eliminate the execution branches on GPU for better performance. In addition, our implementation incorpo-rates various performance optimization techniques, such as memory coalesce, to further enhance the performance. In the experiments, we evaluate the accuracy and performance of our GPU im-plementation against CPU implementation, and analyze the effectiveness of each performance op-timization technique. The performance of GPU version is about 15 times faster than that of the CPU version.