具擴展性的分散式資料儲存系統

近年來，機器學習被應用在各方的領域上面，由於一般機器學習是針對大資料做的分析，勢必在運算能力和儲存裝置上面都必須要透過某種架構做到最優化的處置才可以有效率的減少機器學習演算法的計算時間。為了解決功耗以及速度的問題。我們嘗試著提出一個具有機器學習能力和具擴展性儲存的軟硬體協同設計平台解決方案，並建立了一個可以達到加速以及降低功耗的可擴展資料儲存平台，同時結合異質分析加速解決方案。這篇論文將會詳細的介紹依此架構而搭建的平台，以及該平台的參數設定，並將一個目前廣為人知的叢集演算法K-means放到這個平台上面，最後再依據CPU叢集和CPU+FPGA叢集時間，能量量測結果比較，論證此平台的有效性。本論文提出的解決辦法有別於微軟於2009年提出的CPU+GPU多核心架構[1]，本篇提出的是CPU+FPGA的叢集加速架構，此架構一樣能夠達到加速的效果，另一個明確的優點是，他將可以成為ASIC晶片的開發雛形，並更為精確的預測晶片實際加速效果。在此平台上的FPGA是跑在100MHz，但是在UMC 90奈米製程模擬下，是可以跑到200MHz，這樣就給我們一個參考的數據。最後，比起原本未加速的CPU叢集，此平台的加速效果是大約25倍左右。

關鍵字

拓展性；嵌入式系統；分散式運算；異質加速運算；大資料分析

並列摘要

Recently, Machine learning has been widely used in various areas. Since Machine Learning is basically for Big Data Analysis which requires large amount of computation loads and storage, Machine learning will be efficiently accelerated if only if computation ability and storage equipment are both properly optimized through some methodologies. We tried to explore a hardware/software co-design platform for big data analysis with machine learning capability and storage scalability to solve the two major problems in Machine learning that is power, and Speed. To verify this concept, we built a scalable storage system on Hadoop which adopts heterogeneous architectures (CPU+FPGA) for acceleration and power reduction. this thesis will introduce this platform's parameter setting in detail, port a well-known clustering algorithm K-means onto this platform and finally show the profiled comparison between CPU clustering and CPU+FPGA clustering in speed and power. Based on this profiling result, we can claim that this architecture really works. This architecture is different from the solution of CPU+GPU cluster multi-core architecture proposed by Microsoft on 2009. This proposed solution also has the ability of acceleration. Another advantage of it will be of that this architecture can be implemented as the prototype of ASIC and offers a rather accurate prediction of the acceleration after taping out as a chip. Like in this platform, we implement the circuit on FPGA at 120MHz, however this same circuit can pass 200MHz test simulated at UMC 90nm technology which gives us a prediction of the speed. Also the final acceleration is around 25 times faster than not accelerated A9 CPU cluster.

並列關鍵字

scalable ； embedded system ； distributive computation ； heterogeneous computing ； Big data analysis

參考文獻

[17] Daniel D. Gajski, Embedded System Design: Modeling, Synthesis and Verification

[3] Learning Feature Representations with K-means Montavon, G. B. Orr, K.-R. M¨uller

(Eds.), Neural Networks: Tricks of the Trade, 2nd edn, Springer LNCS 7700, 2012.

Coates1, Honglak Lee2, Andrew Y. Ng Computer Science Department, Stanford

[5]Finding the Observed Information Matrix when Using the EM Algorithm, Thomas A.

國際替代計量

具擴展性的分散式資料儲存系統

全文下載

主題瀏覽