透過您的圖書館登入
IP:3.128.156.46
  • 學位論文

牛頓法於機器學習之應用

Newton Methods For Machine Learning

指導教授 : 林智仁

摘要


牛頓法可以運用在很多監督式學習的問題中,但是對於大量的資料,使用海森矩陣 最佳化問題是需要大量的時間的. 在這篇論文中,我們的目標是要讓牛頓法可以應用在實務上的大資料問題.這篇論文的第一部份是有關於抽樣海森矩陣. 在近代的研究中,抽樣海森矩陣已經被提出,其只利用使用部份的資料去計算出海森矩陣而達到減少計算時間的目的. 美中不足的是,我們發現在某些狀況下,因為抽樣海森矩陣的更新向量的準確度遠低於海森矩陣的更新向量, 所以抽樣海森矩陣的收斂速度會慢於標準的牛頓法. 在這篇論文中,我們提出一些可以改良現有的抽樣海森矩陣的方法.主要的想法是為了能夠更加的最佳化目標函數, 我們利用多解一個兩個變數的子問題來調整得到的抽樣海森矩陣的更新方向,而且我們在這篇論文中提供了理論上收斂的證明. 在邏輯迴歸,線性向量支援機,最大嫡以及深度類神經網路的實驗中,驗證我們所提出的方法可以顯著的降低使用抽樣海森矩陣最佳化問題的訓練時間. 對於大資料的分類,在這個部份所提出的方法可以變成一個相對於牛頓法的有效的演算法. 在這篇論文的第二部份,我們研究應用在分散式系統上訓練深度類神經網路的牛 頓法.因為大量的權重存在於相鄰的兩層之間,所以深度學習包含了困難的非凸優化問題.在訓練大資料的情況下,分散式訓練是有其必要性的,但是這樣會導致目標函數,梯度以及海森矩陣的計算花費大量的時間.尤其是傳輸以及同步化的問題會變成問題的瓶頸.在這篇論文,我們提出一個可以應用在深層學習的新的分散式牛頓法.首先,我們在分散式系統的環境中儲存雅可比矩陣,然後使用了對角化的方法近似海森矩陣已達到多台機器間不需要傳輸的時間.其次,為了節省多台機器間同步的時間,即使有某些機器未完成各自的共軛梯度法.再者,我們使用了第一部份的抽樣海森矩陣來減少多台機器各自的計算以及傳輸的時間.在這個部份的實驗中,我們驗證了我們提出的分散式的牛頓法可以很有效率的應用在深度學習中.

並列摘要


Newton methods can be applied in many supervised learning approaches. However, for large-scale data, the use of the whole Hessian matrix can be time consuming. In this thesis we aim to make Newton methods practically viable for various large-scale scenarios. The first part of this thesis is about sub-sampled Newton methods. Recently, subsampled Newton methods have been proposed to reduce the computational time by using only a subset of data for calculating an approximation of the Hessian matrix. Unfortunately, we find that in some situations the running speed is worse than the standard Newton method because cheaper but less accurate search directions are used. In this thesis, we propose some novel techniques to improve the existing subsampled Hessian Newton method. The main idea is to solve a 2-dimensional sub-problem per iteration to adjust the search direction to better minimize the second-order approximation of the function value. We prove the theoretical convergence of the proposed method. Experiments on logistic regression, linear SVM, maximum entropy, and deep networks indicate that our techniques significantly reduce the running time of the subsampled Hessian Newton method. The resulting algorithm becomes a compelling alternative to the standard Newton method for large-scale data classification. Deep learning involves a difficult non-convex optimization problem because of the large number of weights between any two adjacent layers of a deep structure. In the large-scale scenarios, distributed training is needed, but the calculation of function, gradient, and Hessian is expensive. In particular, the communication and synchronization cost becomes the bottleneck. In this thesis, we propose a novel distributed Newton method for deep learning. First, to reduce the communication cost, we consider storing the Jacobian matrix in a distributed environment, and then propose a diagonalization method such that an approximate Newton direction can be obtained without communication between machines. Second, to reduce the synchronization cost, we terminate the process of finding an approximate Newton direction even though some nodes have not finished their tasks. Third, we consider subsampled Hessian for reducing running time as well as communication cost. Details of some implementation issues in distributed environments are thoroughly investigated. Experiments demonstrate that the proposed method is effective for the distributed training of deep neural networks.

參考文獻


P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature Communications, 5, 2014.
Y. Bian, X. Li, M. Cao, and Y. Liu. Bundle CDN: a highly parallelized approach for large-scale l1-regularized logistic regression. In Proceedings of European Confer-
ence on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/ PKDD), 2013.
L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT 2010, pages 177–186. 2010.
R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal. On the use of stochastic Hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977–995, 2011.

延伸閱讀