{"title": "On the Optimality of Incremental Neural Network Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 295, "page_last": 301, "abstract": null, "full_text": "On the optimality of incremental neural network \n\nalgorithms \n\nRon Meir* \n\nDepartment of Electrical Engineering \n\nTechnion, Haifa 32000, Israel \n\nVitaly Maiorovt \n\nDepartment of Mathematics \nTechnion, Haifa 32000, Israel \n\nrmeir@dumbo.technion.ac.il \n\nmaiorov@tx.technion.ac.il \n\nAbstract \n\nWe study the approximation of functions by two-layer feedforward neu(cid:173)\nral networks, focusing on incremental algorithms which greedily add \nunits, estimating single unit parameters at each stage. As opposed to \nstandard algorithms for fixed architectures, the optimization at each stage \nis performed over a small number of parameters, mitigating many of the \ndifficult numerical problems inherent in high-dimensional non-linear op(cid:173)\ntimization. We establish upper bounds on the error incurred by the al(cid:173)\ngorithm, when approximating functions from the Sobolev class, thereby \nextending previous results which only provided rates of convergence for \nfunctions in certain convex hulls of functional spaces. By comparing our \nresults to recently derived lower bounds, we show that the greedy algo(cid:173)\nrithms are nearly optimal. Combined with estimation error results for \ngreedy algorithms, a strong case can be made for this type of approach. \n\n1 Introduction and background \n\nA major problem in the application of neural networks to real world problems is the ex(cid:173)\ncessively long time required for training large networks of a fixed architecture. Moreover, \ntheoretical results establish the intractability of such training in the worst case [9][4]. Ad(cid:173)\nditionally, the problem of determining the architecture and size of the network required to \nsolve a certain task is left open. Due to these problems, several authors have considered \nincremental algorithms for constructing the network by the addition of hidden units, and \nestimation of each unit's parameters incrementally. These approaches possess two desir(cid:173)\nable attributes: first, the optimization is done step-wise, so that only a small number of \nparameters need to be optimized at each stage; and second, the structure of the network \n\n-This work was supported in part by the a grant from the Israel Science Foundation \ntThe author was partially supported by the center for Absorption in Science, Ministry of Immi(cid:173)\n\ngrant Absorption, State of Israel. \n\n\f296 \n\nR. Meir and V Maiorov \n\nis established concomitantly with the learning, rather than specifying it in advance. How(cid:173)\never, until recently these algorithms have been rather heuristic in nature, as no guaranteed \nperformance bounds had been established. Note that while there has been a recent surge \nof interest in these types of algorithms, they in fact date back to work done in the early \nseventies (see [3] for a historical survey). \n\nThe first theoretical result establishing performance bounds for incremental approximations \nin Hilbert space, was given by Jones [8]. This work was later extended by Barron [2], and \napplied to neural network approximation of functions characterized by certain conditions \non their Fourier coefficients. The work of Barron has been extended in two main direc(cid:173)\ntions. First, Lee et at. [10] have considered approximating general functions using Hilbert \nspace techniques, while Donahue et al. [7] have provided powerful extensions of Jones' \nand Barron's results to general Banach spaces. One of the most impressive results of the \nlatter work is the demonstration that iterative algorithms can, in many cases, achieve nearly \noptimal rates of convergence, when approximating convex hulls. \n\nWhile this paper is concerned mainly with issues of approximation, we comment that it is \nhighly relevant to the statistical problem of learning from data in neural networks. First, \nLee et at. [10] give estimation error bounds for algorithms performing incremental opti(cid:173)\nmization with respect to the training error. Under certain regularity conditions, they are \nable to achieve rates of convergence comparable to those obtained by the much more com(cid:173)\nputationally demanding algorithm of empirical error minimization. Moreover, it is well \nknown that upper bounds on the approximation error are needed in order to obtain per(cid:173)\nformance bounds, both for parametric and nonparametric estimation, where the latter is \nachieved using the method of complexity regularization. Finally, as pointed out by Don(cid:173)\nahue et al. [7], lower bounds on the approximation error are crucial in establishing worst \ncase speed limitations for learning. \n\nThe main contribution of this paper is as follows. For functions belonging to the Sobolev \nclass (see definition below), we establish, under appropriate conditions, near-optimal rates \nof convergence for the incremental approach, and obtain explicit bounds on the parameter \nvalues of the network. The latter bounds are often crucial for establishing estimation error \nrates. In contrast to the work in [10] and [7], we characterize approximation rates for \nfunctions belonging to standard smoothness classes, such as the Sobolev class. The former \nwork establishes rates of convergence with respect to the convex hulls of certain subsets \nof functions, which do not relate in a any simple way to standard functional classes (such \nas Lipschitz, Sobolev, Holder, etc.). As far as we are aware, the results reported here are \nthe first to report on such bounds for incremental neural network procedures. A detailed \nversion of this work, complete with the detailed proofs, is available in [13]. \n\n2 Problem statement \n\nWe make use of the nomenclature and definitions from [7]. Let H be a Banach space of \nfunctions with norm II . II. For concreteness we assume henceforth that the norm is given \nby the Lq norm, 1 < q < 00, denoted by II . Ilq. Let linn H consist of all sums of the form \nL~=l aigi , gi E H and arbitrary ai, and COn H is the set of such sums with ai E [0,1] and \nL~=l ai = 1. The distances, measured in the Lq norm, from a function f are given by \n\ndist(1innH,f) = inf {l lh - fllq : hE linnH}, \ndist(conH , f) = inf {l lh - fllq : hE conH}. \n\nThe linear span of H is given by linH = Un linn H, while the convex-hull of H is coH = \nUncon H. We follow standard notation and denote closures of sets by a bar, e.g. coH is the \nclosure of the convex hull of H. In this work we focus on the special case where \n\nH = H1} ~ {g : g(x) = eCJ(aT x + b), lei::; 1}, IICJ(\u00b7)llq ::; I}, \n\n(1) \n\n\fOn the Optimality of Incremental Neural Network Algorithms \n\n297 \n\ncorresponding to the basic building blocks of multilayer neural networks. The restriction \n110-011 ::; 1 is not very demanding as many sigmoidal functions can be expressed as a sum \nof functions of bounded norm. It should be obvious that linn 1l1) corresponds to a two-layer \nneural network with a linear output unit and o--activation functions in the single hidden \nlayer, while COn 1l1) is equivalent to a restricted form of such a network, where restrictions \nare placed on the hidden-to-output weights. In terms of the definitions introduced above, \nthe by now well known property of universal function approximation over compacta can \nbe stated as lin1l = C(M), where C(M) is the class of continuous real valued functions \ndefined over M , a compact subset of Rd . A necessary and sufficient condition for this \nhas been established by Leshno et af. [11], and essentially requires that 0-(.) be locally \nintegrable and non-polynomial. We comment that if 'T} = 00 in (l), and c is unrestricted in \nsign, then co1l= = lin1l=. The distinction becomes important only if 'T} < 00 , in which \ncase co1l1) C lin1l1). \n\nFor the purpose of incremental approximation, it turns out to be useful to consider the con(cid:173)\nvex hull co1l, rather than the usual linear span, as powerful algorithms and performance \nbounds can be developed in this case. In this context several authors have considered \nbounds for the approximation of a function 1 belonging to co1l by sequences of functions \nbelonging to COn 1l . However, it is not clear in general how well convex hulls of bounded \nfunctions approximate general functions . One contribution of this work is to show how \none may control the rate of growth of the bound 'T} in (1), so that general functions, belong(cid:173)\ning to certain smoothness classes (e.g. Sobolev), may be well approximated. In fact, we \nshow that the incremental approximation scheme described below achieves nearly optimal \napproximation error for functions in the Sobolev space. \n\nFollowing Donahue et af. [7], we consider c:-greedy algorithms. Let E = (El ' E2, ... ) be a \npositive sequence, and similarly for (ai, a 2, ... ), 0 < an < 1. A sequence of functions \nhI, h2 ' ... is E-greedy with respect to 1 if for n = 0, 1, 2, .. . , \n\nIlhn+I - Illq < inf {llanhn + (1 - an)g - Illq : 9 E 1l1)} + En , \n\n(2) \nwhere we set ho = O. For simplicity we set an = (n -\nl)/n , although other schemes are \nalso possible. It should be clear that at each stage n, the function hn belongs to COn 1l1). \nObserve also that at each step, the infimum is taken with respect to 9 E 1l1)' the function \nhn being fixed. In terms of neural networks, this implies that the optimization over each \nhidden unit parameters (a, b, c) is performed independently of the others. We note in pass(cid:173)\ning, that while this greatly facilitates the optimization process in practice, no theoretical \nguarantee can be made as to the convexity of the single-node error function (see [1] for \ncounter-examples). The variables En are slack variables, allowing the extra freedom of \nonly approximate minimization. In this paper we do not optimize over an, but rather fix a \nsequence in advance, forfeiting some generality at the price of a simpler presentation. In \nany event, the rates we obtain are unchanged by such a restriction. \n\nIn the sequel we consider E-greedy approximations of smooth functions belonging to the \nSobolev class of functions, \n\nW; = {I : m?x WDk 1112 ::; \u00b71} , \n\nOS k S r \n\nwhere k = (k 1 , . \u2022\u2022 , k d ) , ki 2:: 0 and Ikl = ki + ... k d . Here Vk is the partial derivative \noperator of order k. All functions are defined over a compact domain K C Rd. \n\n3 Upper bound for the L2 norm \n\nFirst, we consider the approximation of functions from WI using the L2 norm. In dis(cid:173)\ntinction with other Lq norms, there exists an inner product in this case, defined through \n\n\f298 \n\nR. Meir and V Maiorov \n\n(\".) = II\u00b7II~. This simp1ification is essential to the proof in this case. \nWe begin by recalling a result from [12], demonstrating that any function in L2 may be \nexactly expressed as a convex integral representation of the form \n\nf(x) = Q J h(x, O)w(O)dO, \n\n(3) \n\nwhere 0 < Q < 00 depends on f, and w( 0) is a probability density function (pdf) with \nrespect to the multi-dimensional variable O. Thus, we may write f(x) = QEw{h(x, e)}, \nwhere Ew denotes the expectation operator with respect to the pdf w . Moreover, it was \nshown in [12], using the Radon and wavelet transforms, that the function h(x, 0) can be \ntaken to be a ridge function with 0 = (a, b, e) and h(x, 0) = ea(aT x + b). \nIn the case of neural networks, this type of convex representation was first exploited by \nBarron in [2], assuming f belongs to a class of functions characterized by certain moment \nconditions on their Fourier transforms. Later, Delyon et al. \n[6] and Maiorov and Meir \n[12] extended Barron's results to the case of wavelets and neural networks, respectively, \nobtaining rates of convergence for functions in the Sobolev class. \nThe basic idea at this point is to generate an approximation, hn(x), based on n draws of \nrandom variables en = {e l , e2, ... ,en}, ei ,....., w(\u00b7), resulting in the random function \n\nhn(x; en) = - 2: h(x, ei ). \n\nQ n \n\nn i=l \n\n(4) \n\nThroughout the paper we conform to standard notation, and denote random variables by \nuppercase letters, as in e, and their realization by lower case letters, as in O. Let wn = \nTI~=l Wi represent the product pdf for {e l , ... ,en}. Our first result demonstrates that, \non the average, the above procedure leads to good approximation of functions belonging to \nW{. \n\nTheorem 3.1 Let K C Rd be a compact set. Then/or any f E W{, n > 0 and c > 0 \nthere exists a constant e > 0, such that \n\nEw \" Ilf - hn(x; en)112 :S en- rld+E , \n\n(5) \n\nwhere Q < en(1/2- r l d)+, and (x)+ = max(O,x). \n\nThe implication of the upper bound on the expected value, is that there exists a set of \nvalues o* ,n = {Or, ... , O~}, for which the rate (5) can be achieved. Moreover, as long as \nthe functions h(x, Od in (4) are bounded in the L2 norm, a bound on Q implies a bound on \nthe size of the function h n itself. \nProof sketch The proof proceeds by expressing f as the sum of two functions, iI \nand 12. The function iI is the best approximation to f from the class of multi-variate \nsplines of degree r. From [12] we know that there exist parameters on such that \nIliI (.) - h n {-, on)112 :S en-rid. Moreover, using the results of [5] it can be shown that \n1112112 \n:S en-rid. Using these two observations, together with the triangle inequality \nIlf - h n l12 :S IliI - hnl12 + 1112112, yields the desired result. \nI \nNext, we show that given the approximation rates attained in Theorem 3.1, the same rates \nmay be obtained using an c -greedy algorithm. Moreover, since in [12] we have established \nthe optimality of the upper bound (up to a logarithmic factor in n), we conclude that greedy \napproximations can indeed yield near-optimal perfonnance, while at the same time being \nmuch more attractive computationally. In fact, in this section we use a weaker algorithm, \nwhich does not perform a full minimization at each stage. \n\n\fOn the Optimality of Incremental Neural Network Algorithms \n\n299 \n\nIncremental algorithm: (q = 2) Let an = 1 -\n\n1. Let 0i be chosen to satisfy \n\nlin, 6:n = 1 - an = lin. \n\nIlf(x) - Qh(x,Onll~ = EWl {llf(x) - Qh(x,edIID\u00b7 \n\n2. Assume that 0i , \u00b02, ... ,O~-l have been generated. Select O~ to obey \n\n2 \n\nf(x) - ::~; L h(x,On - 6:nQh(x,O~) \n\nn-l \n\ni=l \n\n2 \n\nf(x) - Qan \"h(x, On - 6:nQh(x, en) \n\nn-l \nn-lL..,; \ni=l \n\n22} . \n\nDefine \n\nwhich measures the error incurred at the n-th stage by this incremental procedure. The \nmain result of this section then follows. \n\nTheorem 3.2 For any f E WI and c > 0, the error of the incremental algorithm above is \nbounded as \n\nfor some finite constant c. \n\nProof sketch The claim will be established upon showing that \n\n(6) \n\nnamely, the error incurred by the incremental procedure is identical to that of the non(cid:173)\nincremental one described preceding Theorem 3.1. The result will then follow upon using \nHolder's inequality and the upper bound (5) for the r.h.s. of (6). The remaining details are \nI \nstraightforward, but tedious, and can be found in the full paper [13]. \n\n4 Upper bound for general Lq norms \n\nHaving established rates of convergence for incremental approximation of WI in the L2 \nnorm, we move now to general Lq norms. First, note that the proof of Theorem 3.2 relies \nheavily on the existence on an inner product. This useful tool is no longer available in the \ncase of general Banach spaces such as L q . In order to extend the results to the latter norm, \nwe need to use more advanced ideas from the theory of the geometry of Banach spaces. In \nparticular, we will make use of recent results from the work of Donahue et al. [7]. Second, \nwe must keep in mind that the approximation of the Sobolev space WI using the Lq norm \nonly makes sense if the embedding condition rid> (1/2 -\nl/q)+ holds, since otherwise \nthe Lq norm may be infinite (the embedding condition guarantees its finiteness; see [14] \nfor details). \n\nWe first present the main result of this section, followed by a sketch of the proof. The full \ndetails of the rather technical proof can be found in [13]. Note that in this case we need to \nuse the greedy algorithm (2) rather than the algorithm of Section 3. \n\n\f300 \n\nR. Meir and V Maiorov \n\nTheorem 4.1 Let the embedding condition r / d > (1/2 - 1/ q) + hold for 1 < q < 00, \n0< r < r*, r* = ~ + (~- ~)+ andassumethatllh(-,O)llq:S IforallO. Thenforany \nJEW; andf. > 0 \n\nwhere \n\n, = \n\n{ \n\n;I- (~-P \n!+% _qd \n~ \n\nq > 2, \nq :s 2, \n\n(7) \n\nc = c(r, d , K) and hn(-, on) is obtained via the incremental greedy algorithm (2) with \ncn = O. \n\nProof sketch The main idea in the proof of Theorem 4.1 is a two-part approximation \nscheme. First, based on [13], we show that any JEW; may be well approximated by \nfunctions in the convex class COn ('111/) for an appropriate value of TJ (see Lemma 5.2 in \n[13]), where R1/ is defined in (1). Then, it is argued, making use of results from [7] (in \nparticular, using Corollary 3.6) , that an incremental greedy algorithm can be used to ap(cid:173)\nproximate the closure of the class co( R 1/) by the class COn (111/). The proof is completed by \nusing the triangle inequality. The proof along the above lines is done for the case q > 2. In \nthe case q :s 2, a simple use of the Holder inequality in the form Ilfllq ~ IKI 1/q- l/21IfI12, \nwhere IKI is the volume of the region K, yields the desired result, which, given the lower \nI \nbounds in [12], is nearly optimal. \n\n5 Discussion \n\nWe have presented a theoretical analysis of an increasingly popular approach to incremental \nlearning in neural networks . Extending previous results , we have shown that near-optimal \nrates of convergence may be obtained for approximating functions in the Sobolev class \nW; . These results extend and clarify previous work dealing solely with the approximation \nof functions belonging to the closure of convex hulls of certain sets of functions. Moreover, \nwe have given explicit bounds on the parameters used in the algorithm , and shown that the \nrestriction to COn 111/ is not too stringent. In the case q ~ 2 the rates obtained are as good (up \nto logarithmic factors) to the rates obtained for general spline functions, which are known \nto be optimal for approximating Sobolev spaces. The rates obtained in the case q > 2 \nare sub-optimal as compared to spline functions, but can be shown to be provably better \nthan any linear approach. In any event, we have shown that the rates obtained are equal, \nup to logarithmic factors, to approximation from linn 111/' when the size of TJ is chosen \nappropriately, implying that positive input-to-output weights suffice for approximation. An \nopen problem remaining at this point is to demonstrate whether incremental algorithms for \nneural network construction can be shown to be optimal for every value of q. In fact, this \nis not even known at this stage for neural network approximation in general. \n\nReferences \n\n[1] P. Auer, M. Herbster, and M. Warmuth. Exponentially many local minima for single \nneurons. In D.S. Touretzky, M .e. Mozer, and M.E. Hasselmo, editors, Advances in \nNeural Information Processing Systems 8, pages 316-322. MIT Press, 1996. \n\n[2] AR. Barron. Universal approximation bound for superpositions of a sigmoidal func(cid:173)\n\ntion. IEEE Trans. In! Th., 39:930-945, 1993. \n\n[3] AR. Barron and R.L. Barron. Statistical learning networks: a unifying view. \n\nIn \nE. Wegman , editor, Computing Science and Statistics: Proceedings 20th Symposium \nInterface , pages 192-203, Washington D.e., 1988. Amer. Statis. Assoc. \n\n\fOn the Optimality of Incremental Neural Network Algorithms \n\n301 \n\n[4] A. Blum and R. Rivest. Training a 3-node neural net is np-complete. In D.S. Touret(cid:173)\n\nzky, editor, Advances in Neural Information Processing Systems I, pages 494-50l. \nMorgan Kaufmann, 1989. \n\n[5] c. de Boor and G. Fix. Spline approximation by quasi-interpolation. J. Approx. \n\nTheory, 7:19-45,1973. \n\n[6] B. Delyon, A. Juditsky, and A. Benveniste. Accuracy nalysis for wavelet approxima(cid:173)\n\ntions. IEEE Transaction on Neural Networks, 6:332-348, 1995. \n\n[7] M.J. Donahue, L. Gurvits, C. Darken, and E. Sontag. Rates of convex approximation \n\nin non-hilbert spaces. Constructive Approx. , 13: 187-220, 1997. \n\n[8] L. Jones. A simple lemma on greedy approximation in Hilbert space and conver(cid:173)\n\ngence rate for projection pursuit regression and neural network training. Ann. Statis. , \n20:608-613, 1992. \n\n[9] S. Judd. Neural Network Design and the Complexity of Learning. MIT Press, Boston, \n\nUSA,1990. \n\n[10] W.S . Lee, P.S. Bartlett, and R.c. Williamson. Efficient Agnostic learning of neural \n\nnetworks with bounded fan-in . IEEE Trans. In! Theory, 42(6):2118-2132, 1996. \n\n[11] M. Leshno, V. Lin, A. Pinkus, and S. Schocken. Multilayer Feedforward Networks \nwith a Nonpolynomial Activation Function Can Approximate any Function. Neural \nNetworks, 6:861-867,1993. \n\n[12] V.E. Maiorov and R. Meir. On the near optimality of the stochastic approximation of \n\nsmooth functions by neural networks. Technical Report CC-223, Technion, Depart(cid:173)\nment of Electrical Engineering, November ]997. Submitted to Advances in Compu(cid:173)\ntational Mathematics. \n\n[13] R. Meir and V. Maiorov. On the optimality of neural network approxima-\nSubmitted for publication, October 1998. \n\ntion using incremental algorithms. \nftp://dumbo.technion.ac.il/pub/PAPERSlincrementa] .pdf. \n\n[14] H. Triebel. Theory of Function Spaces. Birkhauser, Basel, 1983. \n\n\f", "award": [], "sourceid": 1517, "authors": [{"given_name": "Ron", "family_name": "Meir", "institution": null}, {"given_name": "Vitaly", "family_name": "Maiorov", "institution": null}]}