透過您的圖書館登入
IP:18.223.20.57
  • 學位論文

建構於可程式化邏輯板實現硬體加速之Hadoop 叢集用於資料探勘演算法

Hadoop Cluster with FPGA-based Hardware Accelerators for Data Mining Algorithms

指導教授 : 鍾菁哲
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


由於物聯網的興起,人們經由網路與伺服器交換的資料量越來越龐大,隨著大數據的演進,如何從巨量資料中挖掘出有價值的資訊,是現今的重要課題。因此資料探勘演算法被廣泛使用在各個領域上。而如何處理這些海量的資料以及分析不同的資料型態成為大數據會面臨到的問題。 為了解決儲存設備與運算能力的限制,分散式系統和雲端運算近年也越來越普及,透過多個伺服器叢集執行平行化的運算,克服CPU運算速度的瓶頸;此外,藉由多個伺服器的串聯來增加儲存的容量,彌補單一設備空間不足的問題。為了提升運算的效能,在處理任務的時候,可以藉由硬體加速平台來分擔運算的負載。硬體加速平台最常見的有圖形處理加速器(GPU)和可程式邏輯陣列(FPGA),通常擁有數量眾多的運算單元,用來執行高密度且獨立的運算並達到運算平行化。 本論文針對巨量資料的儲存平台與運算能力的增進,提出一個軟硬體整合的方案,在Hadoop系統串聯以FPGA為基礎的硬體加速平台,利用Hadoop叢集的分散式檔案系統(HDFS)以及MapRdeuce的平行運算優勢,再藉由網路分享器提升擴充性,建構一個用於資料探勘演算法的Hadoop與FPGA整合的加速平台。我們使用在資料探勘中最常見的K-means分群演算法以及KNN最近鄰居分類演算法來呈現此整合加速平台的優勢。

並列摘要


Since the growing popularity of the internet of things (IoT), the amount of data people exchange via web servers are increasing huge. With the evolution of big data, it is important to extract the valuable information from the massive data. Therefore, data mining algorithms are widely used in various fields. The “5Vs” including volume, velocity, variety, veracity and, value are the challenges of big data processing and analyzing. In order to overcome the limitations of storage devices and computing capability, the distributed systems and cloud computing are becoming popular in recent years. The parallel computing cluster by multiple servers can conquer the bottleneck of CPU computing capability. In addition, the distributed systems can provide the advantage of storage capacity to make up for the lack of the disk space issue. Graphic processing units (GPUs) and field programmable gate arrays (FPGAs) are potential hardware accelerators and usually have a large number of arithmetic units for performing high density and independent operations in parallel to enhance the effectiveness. In this thesis, the implementation of the K-means clustering algorithm and K-nearest neighbor algorithm on a Hadoop cluster with FPGA-based hardware accelerators is presented. The proposed design follows MapReduce programming model and uses Hadoop distribution file system (HDFS) for storing large dataset. The proposed FPGA-based hardware accelerator for speed up the proposed algorithms is implemented on Xilinx VC707 evaluation boards (EVBs).

參考文獻


[4] Gartner, “Big Data”, available: http://www.gartner.com/it-glossary/big-data/
[6] Jinson Zhang and Mao Lin Huang, "5Ws model for big data analysis and visualization," in Proceedings of IEEE Conference on Computational Science and Engineering (CSE), Dec. 2013, pp. 1021-1028.
[7] IBM, “What is big data?” available:
[12] Apache Mahout, available: https://mahout.apache.org/
[13] Apache Spark, available: http://spark.apache.org/

延伸閱讀