大規模羅吉斯回歸與線性支持向量機在Spark上之應用

對於大規模分類問題之學習，羅吉斯回歸與線性支持向量機都是相當有用的方法。然而，此兩種模型的分散式實作，並沒有被徹底及完整地研究。另外，因為典型的映射化簡架構對於機器學習的迭代法之實作遭受到計算效率的瓶頸，所以叢集式記憶體內的運算平台─Spark在最近數年內逐漸嶄露頭角。由於Spark對於資料處理與分析的能力，此平台成為一個被廣泛使用的架構。在這篇論文裡，我們提出牛頓法之分散式演算法，並實作於Spark上。我們點出與分析會強烈影響計算效能與溝通時間的細節，並對這些問題提出解決辦法。最後，在經過謹慎的考量與研究後，我們將此論文中提出的演算法實作為一個有效率並且公開的工具以供使用。

關鍵字

大規模學習；分散式運算；羅吉斯回歸；支持向量機；牛頓法

並列摘要

Logistic regression and linear SVM are useful methods for large-scale classification. However, their distributed implementations have not been well studied. Recently, because of the inefficiency of the MapReduce framework on iterative algorithms, Spark, an in-memory cluster-computing platform, has been proposed. It has emerged as a popular framework for large-scale data processing and analytics. In this work, we consider a distributed Newton method for solving logistic regression as well linear SVM and implement it on Spark. We carefully examine many implementation issues significantly affecting running time and propose our solutions. After conducting thorough empirical investigations, we release an efficient and easy-to-use tool for the Spark community.

並列關鍵字

large scale learning ； distributed computing ； logistic regression ； support vector machine ； Newton method

參考文獻

[1] B. E. Boser, I. Guyon, and V. Vapnik, “A training algorithm for optimal margin classifiers,” in COLT, 1992.

[2] C. Cortes and V. Vapnik, “Support-vector network,” MLJ, vol. 20, pp. 273–297, 1995.

[3] G.-X. Yuan, C.-H. Ho, and C.-J. Lin, “Recent advances of large-scale linear classification,” PIEEE, vol. 100, pp. 2584–2603, 2012.

[4] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” CACM, vol. 51, pp. 107–113, 2008.

and Implementation, 2012.

國際替代計量

大規模羅吉斯回歸與線性支持向量機在Spark上之應用

全文下載

主題瀏覽