Artificial Neural Networks Nonlinear Least Squares Learning

論文摘要 : 機器學習是包含許多因素的複雜功能，我們終極的目標是去實現一個聰明的代理人可以透過互動的環境中學習正確的事物。為了撰寫論文，我們必須把代理學習這個大題目縮小範圍到一個適當的的主題，也就是類神經網路學習。雖然基本的精神是由生物學所來, 但所採用的方法是當今的電腦和各種資訊科學的技術。在論文中最重要的部分是在可以套用於類神經學習的演算及運算結果，特別是在監督式學習方面, 採用的類神經網路模型是針對輸入及指定的輸出反應來做最佳化。直覺的方法一般會導引出所謂的「類神經網路非線性最小平方問題」。我們針對提出的問題採用當今的數直線性代數方法，特別是那些跟類神經網路有明顯關係的部分，並針對資料的稀疏性, 對稱性階段架構和參數的分離性來做特徵。這些關鍵的特徵卻經常在類神經網路的文獻裡被忽略。這裡說明為何稀疏問題在多重反應的問題中的重要性，一般來說多重反應問題會包含兩個典型的稀疏矩陣。而利用資料的稀疏性可以導出有效學習的演算法並可以套用於機器學習和最佳化問題。下一個要提出的是對稱的架構並嵌入在一個多重層的感知器。一個通用的多層次向前式類神經網路，會在參數空間形成一個圓錐形（對稱）。並且利用離散階段的最佳控制理論導出進階的學習策略。這個方法用來設定初始值的敏感度並套用在在一個典型的二類歸類問題，用新的學習策略可以得到一個值得注意的結果。

關鍵字

類神經網路；非線性最小平方學習法

並列摘要

Machine learning is a complicated function of many elements. Our ultimate goal is to realize an intelligent agent that can learn to do the right things from reinforcement through interactions with the environment. As a dissertation theme, we have narrowed down a vast scope of ``agent learning'' to a small yet indispensable ``brain modeling'' subsidiary, called ``artificial'' neural-network (NN) learning. It is quite artificial because our approaches (described in this dissertation) are tied to efficient implementations on modern computers designed from engineering and computer science perspectives although its fundamental concept is biologically inspired. Our primary interest in this dissertation resides in the development of algorithmic and computational results applicable to NN-learning, especially to supervised learning, where our NN model is optimized to learn the designated outputs in response to certain input stimuli. A straightforward formulation often gives rise to what we call ``neural networks nonlinear least squares problems.'' We attack the posed problems in conjunction with modern numerical linear algebra techniques specially geared to conspicuous characteristics arising in NN-learning; specifically, we identify and exploit data sparsity, symmetric stagewise architecture, and parameter separability. These key features are often neglected in the NN literature. We begin to explain how sparsity issues stand up in multiple-response problems, which commonly entail two typical sparse matrix formats: a block-arrow Hessian matrix, and a block-angular Jacobian matrix of the residual vector. Exploiting the data sparsity leads to very efficient learning algorithms suitable for a wide variety of machine learning and optimization problems: In small- or medium-scale optimization problems, the sparsity-exploitation makes an efficient matrix factorization on the Hessian matrix, while in large-scale problems it fulfills a sparse matrix-vector multiply for extracting the Hessian information in the Krylove subspace. The latter method comes into play as a new learning mode as ``iterative batch learning'' implementable in either full-batch or mini-batch (i.e., block) mode. We next direct our special attention to a symmetric ``stagewise'' structure embedded in a so-called multi-layer perceptron (MLP), a popular feed-forward NN model with multiple layers (or stages); geometrically, an MLP forms a (symmetric) cone in the parameter space. The theory of discrete-stage optimal control dictates advanced learning strategies such as the introduction of ``stage costs'' in addition to the terminal cost, leading to what we call ``hidden-node teaching.'' A remarkable result obtained by this new learning scheme is that it can develop insensitivity to initial parameters in a classical two-class classification parity benchmark problem. More significantly, the theory serves to exploit the nice multi-stage symmetric structure for evaluating the Hessian matrix just as the well-known (first-order) backpropagation computes the gradient vector in a stagewise fashion. Our newly-developed ``stagewise'' second-order backpropagation algorithm, derived from the second-order optimal control theory, can evaluate the full Hessian matrix faster than ``standard'' methods that obtain only the Gauss-Newton Hessian matrix (e.g., see Matlab NN-toolbox for such a procedure); this is a truly tremendous breakthrough in the nonlinear least squares sense. In reality, the full Hessian matrix may not be positive (semi-)definite during the learning phase, but the widely-employed trust-region nonlinear optimization method can deal excellently with the indefinite Hessian since the underlying theory has thrived on the ``negative curvatures'' over the last two decades. The trust-region approach based on the full Hessian matrix is of immense value in solving real-world ``large-residual'' nonlinear least squares problems because the matrix of second derivatives is important to efficiency. In consequence, our stagewise second-order backpropagation approach would prove practically useful for general nonlinear optimization in a broader sense as long as a posed problem possesses a stagewise constitution. Furthermore, a model of mixed linear and nonlinear parameters may become of great concern in various contexts of machine learning. In numerical linear algebra, the variable projection (VP) algorithm has been the standard approach to the ``separable'' nonlinear (i.e., mixed linear & nonlinear) least squares problems since early 1970s. For the sake of second-order algorithms, we desire to use as much Hessian information as possible while manipulating certain structural properties associated with a given NN model. Looking in this spirit toward further exploitation of parameter separability, we have endeavored to devise an extension of VP algorithms that employ the full Hessian matrix. The consequent method aims at solving large-residual machine learning problems when both linear and nonlinear parameters co-exist in a given learning model. Although this approach still needs further investigation, it would probably help in optimizing other machine learning models such as generalized linear discriminant functions. Special structure should always be exploited when it arises. The multi-stage NN-learning is an excellent challenge, for it exhibits a great deal of structure; the principal ingredients are analyzed out to be sparse, symmetric, stagewise, and separable. Along the guidance on structure exploitation, we emphasize the rigorous mathematical theory of optimal control as well as the practical use of modern numerical linear algebra and nonlinear numerical optimization for algorithmic design purposes. Our proposed learning methods could apply broadly to learning machines in yet unexplored domains and therefore have enormous potential for diverse future extensions.

並列關鍵字

Neural Networks ； Nonlinear Least Squares Learning

參考文獻

Peter N. Brown and Youcef Saad.

Robert B. Schnabel and Elizabeth Eskow.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman.

Andrew R. Conn, Nicholas IM Gould, and Philippe L. Toint.

Richard S. Sutton and Andrew G. Barto.

國際替代計量

Artificial Neural Networks Nonlinear Least Squares Learning

主題瀏覽