以機器學習改善Hadoop系統優化

隨著巨量資料分析的興起, 支持此類大規模資料處理的系統, 如分散式系統也越受到關注, 在管理建立在日益龐大機器叢集的系統, 系統管理者必須花更多心力管理。除了使系統能夠穩定地支援各式各樣的資料分析應用, 也需要對系統作優化, 讓效能夠有效的提昇, 提高系統的使率及降低運行這些資料分析應用的時間。然而, 對大規模機器叢集而言, 系統參數調校是複雜的, 管理者除了要處理各個機器之間互動的問題, 也必須針對不同應用, 了解其運算特性, 進而調校系統參數。而現行系統參數調校的方法有可用性不高, 以及可調校的參數受到限制等缺點。本研究基於這些現行的的方法, 以機器學習來改善上述的這些問題, 打破這些限制使系統效能更進一步提昇

關鍵字

巨量資料；分散式系統；機器學習；全局優化；隨機抽樣

並列摘要

Big Data has emerged in recent year. Systems which is able to support such large-scale data analysis are received more attentions. The distributed system like Hadoop is most used for the analysis. However, it will be increasingly difficult for system administrators to manage the whole system when the cluster of the system scales out. System administrator should maintain the system to execute applications stably. Besides, they need to optimize the system to improve the performance, increase the system utilization and reduce the latency of application executing. And the configuration problem is the most important issue of system optimization. Configuration parameter tuning is related lots of complicated issues. It needs to understand the interaction between physical machines and the behavior of each applications. The current method, rule-based and cost-based optimization, have drawbacks like unfeasibility and limitation of configuration parameter space. Our work exploit machine learning to solve the problem to improve the performance.

並列關鍵字

big data ； distributed system ； machine learning ； global optimization ； random sampling

參考文獻

[5] S. Babu. Towards automatic optimization of mapreduce programs. In SoCC, 2010.

[6] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters.

In Communications of the ACM, 2008.

University, 2011.

[10] S. Huang. The hibench benchmark suite: Characterization of the mapreduce-based

國際替代計量

以機器學習改善Hadoop系統優化

全文下載

主題瀏覽