DERIVATIVEFREE OPTIMIZATION OF NOISY FUNCTIONS
by
Jeffrey M. Larson B.A., Carroll College, 2005 M.S., University of Colorado Denver, 2008
A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Applied Mathematics
2012
This thesis for the Doctor of Philosophy degree by Jeffrey M. Larson has been approved by
Stephen Billups, Advisor and Chair Alexander Engau Burt Simon Michael Jacobson Fred Glover
Date
n
Larson, Jeffrey M. (Ph.D., Applied Mathematics) Derivativefree Optimization of Noisy Functions Thesis directed by Associate Professor Stephen Billups
ABSTRACT
Derivativefree optimization (DFO) problems with noisy functions are increasingly prevalent. In this thesis, we propose two algorithms for noisy DFO, as well as termination criteria for general DFO algorithms. Both proposed algorithms are based on the methods of Conn, Scheinberg, and Vicente [9] which use regression models in a trust region framework. The first algorithm utilizes weighted regression to obtain more accurate model functions at each trust region iteration. A weighting scheme is proposed which simultaneously handles differing levels of uncertainty in function evaluations and errors induced by poor model fidelity. To prove convergence of this algorithm, we extend the theory of Apoisedness and strong Apoisedness to weighted regression. The second algorithm modifies the first for functions with stochastic noise. We prove our algorithm generates a subsequence of iterates which converge almost surely to a firstorder stationary point, provided the number of successful steps is finite and the noise for each function evaluation is independent and identically normally distributed. Lastly, we address termination of DFO algorithms on functions with noise corrupted evaluations. If the function is computationally expensive, stopping well before traditional criteria (e.g., after a budget of function evaluations is exhausted) are satisfied can yield significant savings. Early termination is especially desirable when the function being optimized is noisy, and the solver proceeds for an extended period while only seeing changes which are insignificant relative to the noise in the function. We develop techniques for comparing the quality of termination tests, propose families of tests to be used on general DFO algorithms, and then compare
the tests in terms of both accuracy and efficiency.
The form and content of this abstract are approved. I recommend its publication.
Approved: Stephen Billups
IV
ACKNOWLEDGMENT
I would like to thank my advisor, Steve Billups, for his years of research assistance and advice. His guidance was instrumental in obtaining the results in this thesis. I would also like to thank Peter Graf and Stefan Wild for their assistance in researching and writing parts of this thesis. The research in this thesis was partially supported by National Science Foundation Grant GK120742434 and partially supported by the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Contract DEAC0206CH11357. Lastly, Iâ€™d like to thank my wife, Jessica, for years of patience while the research for this thesis was performed.
v
TABLE OF CONTENTS
Figures ................................................................... ix
Tables..................................................................... xi
Chapter
1. Introduction............................................................. 1
1.1 Review of Methods ................................................. 3
1.2 Outline............................................................ 5
1.3 Notation........................................................... 8
2. Background.............................................................. 10
2.1 Modelbased Trust Region Methods.................................. 10
2.1.1 Model Construction Without Derivatives...................... 11
2.1.2 CSV2framework.............................................. 11
2.1.3 Poisedness.................................................. 18
2.2 Performance Profiles.............................................. 23
2.3 Probabilistic Convergence......................................... 24
3. Derivativefree Optimization of Expensive Functions with Computational
Error Using Weighted Regression......................................... 26
3.1 Introduction...................................................... 26
3.2 Model Construction................................................ 27
3.3 Error Analysis and the Geometry of the Sample Set................. 29
3.3.1 Weighted Regression Lagrange Polynomials.................... 29
3.3.2 Error Analysis ............................................. 30
3.3.3 Apoisedness (in the Weighted Regression Sense)............. 35
3.4 Model Improvement Algorithm....................................... 39
3.5 Computational Results............................................. 50
3.5.1 Using Error Bounds to Choose Weights .................... 50
3.5.2 Benchmark Performance....................................... 52
vi
3.6 Summary and Conclusions........................................... 57
4. Stochastic Derivativefree Optimization using a Trust Region Framework . 59
4.1 Preliminary Results and Definitions............................. 62
4.1.1 Models which are kfully Quadratic with Confidence 1 â€” au . 65
4.1.2 Models which are kfully Linear with Confidence 1 â€” ... 69
4.2 Stochastic Optimization Algorithm................................. 70
4.3 Convergence....................................................... 73
4.3.1 Convergence to a Firstorder Stationary Point............... 73
4.3.2 Infinitely Many Successful Steps ........................... 78
4.4 Computational Example ............................................ 80
4.4.1 Deterministic Trust Region Method........................... 80
4.4.2 Stochastic Trust Region Method.............................. 81
4.5 Conclusion........................................................ 83
5. Nonintrusive Termination of Noisy Optimization.......................... 84
5.1 Introduction and Motivation....................................... 84
5.2 Background ....................................................... 87
5.3 Stopping Tests.................................................... 90
5.3.1 /f Test..................................................... 92
5.3.2 MaxDifference/ Test....................................... 92
5.3.3 MaxDistancex Test......................................... 93
5.3.4 MaxDistancex* Test........................................ 93
5.3.5 MaxBudget Test............................................. 94
5.3.6 Tests Based on Estimates of the Noise....................... 94
5.3.7 Relationship to Loss Functions.............................. 96
5.4 Numerical Experiments............................................. 98
5.4.1 Accuracy Profiles for the (f)\ Family...................... 100
5.4.2 Performance Profiles for the (f)\ Family................... 102
vii
5.4.3 Accuracy and Performance Plots for the (f)2 Family.......... 104
5.4.4 Acrossfamily Comparisons................................... 106
5.4.5 Deterministic Noise ........................................ 107
5.4.6 Validation for Individual Solvers .......................... 109
5.5 Discussion......................................................... 110
6. Concluding Remarks....................................................... 113
References.................................................................. 119
viii
FIGURES
Figure
2.1 An example of a performance profile..................................... 24
3.1 Performance (left) and data (right) profiles: Interpolation vs. regression
vs. weighted regression................................................... 55
3.2 Performance (left) and data (right) profiles: weighted regression vs.
NEWUOA vs. DFO (Problems with Stochastic Noise).................... 56
4.1 Several iterations of a traditional trust region method assuming deterministic function evaluations. The trust region center is never moved........ 81
4.2 Several iterations of a traditional trust region method assuming stochastic
function evaluations...................................................... 82
5.1 Part of a noisy trajectory of function values for an expensive nuclear
physics problem. After more significant decreases in the first 70 evaluations, progress begins to stall...................................... 85
5.2 First terms in (f)\ (top, with k = 100) and
the noise............................................................... 96
5.3 Number of evaluations i* for a termination test based on (5.3) with fixed ujr. and k, but using a fi parameterized by c. The plots show remarkably similar behavior to the number of evaluations that minimize L(, c) in (5.8). 98
5.4 Accuracy profiles for members of the (f)\ family on problems (5.9) with
two different magnitudes of (known) stochastic relative noise a. In the
top plots, k is held fixed and the shown members have different fi values.
In the bottom plots, ji is held fixed and the shown members have different
k values................................................................ 101
ix
5.5 Performance profiles for the most accurate 0i tests on problems (5.9) with
two different magnitudes of (known) stochastic relative noise a. Note that the aaxis has been truncated for each plot; 05 eventually terminates all of the problems and thus has a profile that will reach the value 1; all other tests change by less than .01....................................... 103
5.6 Accuracy (top) and performance (bottom) profiles for the 02 family on problems (5.9) with two different magnitudes of stochastic relative noise
a as k and fi are varied............................................ 105
5.7 Accuracy (top) and performance (bottom) profiles for the best tests on problems (5.9) with two different magnitudes of stochastic relative noise a.
The horizontal axes on the performance profiles are truncated for clarity;
05 eventually achieves a value of 1; all other tests change by less than .03. 106
5.8 Accuracy (top) and performance (bottom) profiles for the best tests on
problems (5.9) with two different magnitudes of deterministic noise. The horizontal axes on the performance profiles are truncated for clarity; 05 eventually achieves a value of 1; all other tests change by less than .03. . 108
5.9 Performance profiles for more conservative tests on problems (5.9) with two different magnitudes of deterministic noise. The horizontal axes on the performance profiles are truncated for clarity; 05 eventually achieves
a value of 1; all other tests change by less than .03............... 109
5.10 Accuracy profiles for the individual algorithms on the recommended tests. 110
x
TABLES
Table
5.1 Recommendations for termination tests for noisy optimization.. Ill
xi
1. Introduction
Traditional unconstrained optimization is inherently tied to derivatives; the necessary conditions for a firstorder minimum are characterized by the derivative being equal to zero. Nevertheless, there exists a variety of functions which must be minimized when derivatives are unavailable. For example, consider an engineer in a lab who wants to maximize the strength of a metal bar by adjusting various factors of production. After the bar is constructed, it is broken to determine its strength. There is no closed formed solution for the barâ€™s strength; each function evaluation comes from an expensive procedure. Also, the process of breaking the bar provides no information about how to change the factors of production to increase the barâ€™s strength. In addition to the optimization of systems which must be physically evaluated, function evaluations by complex computer simulations often provide no (or unreliable) derivatives. Such simulations of complex phenomena (sometimes called blackbox functions) are becoming increasingly common as computational modeling and computer hardware continue to advance. Whereas traditional techniques are concerned with efficiency of the algorithm, such concerns are secondary throughout this thesis. Explicitly, we assume that the cost of evaluating the function overwhelms the computational requirements of the algorithm.
In addition to unavailable derivatives, noise of various forms is often present in these functions. Throughout this thesis we will categorize this noise  or difference between the true value and computed value  into two categories: deterministic and stochastic. Deterministic noise (e.g., arising from finiteprecision calculations or iterative methods) is often present if the function being optimized is a simulation of a physical system. For example, if evaluating the function involves solving a system of nonlinear partial differential equations or computing the eigenvalues of a large matrix, a small perturbation in the parameters can yield a large jump in the difference between the true and computed values. Though the computed value and true
1
value may differ, reevaluating the function with the same choice of parameters will provide no further information. In contrast, reevaluating a function with stochastic noise will provide additional information about the true value of the function. Two common sources of stochastic noise are found in functions whose evaluation requires a largescale MonteCarlo simulation of an actual system or if a function evaluation requires physically measuring a property in some system.
The thesis consists of three distinct but connected chapters addressing the problem:
min f(x) : Eâ„¢ â€”> E
X
when the algorithm only has access to noise corrupted values
f(x) := f(x) + e(x),
where e(x) denotes the noise.
Each chapter makes different assumptions about the noise e(x). Chapter 3 assumes that the accuracy at which / approximates / can vary, and that this accuracy can be quantified. For example, if fix) is calculated using a MonteCarlo simulation, increasing the number of trials will decrease the magnitude of e(x). Similarly, if / is calculated by a finite element method, increased accuracy can be obtained by a finer grid. Of course, this greater accuracy comes at the cost of increased computational time; so it makes sense to vary the accuracy requirements over the course of the optimization. With this in mind, Chapter 3 asks: How can knowledge of the accuracy of different function evaluations be exploited to design a better algorithm?
Chapter 4 assumes that the noise for each function evaluation is independent of x and can be modeled as a normally distributed random variable with mean zero and a fixed, finite standard deviation. Though many other algorithms have been designed to optimize such a function, they often resort to repeated sampling of points. This provides information about the noise at a point, but no information about the
2
function nearby. This motivates the question addressed in Chapter 4: How can greater accuracy be efficiently achieved by oversampling without necessarily repeating function evaluations?
Chapter 5 assumes that a reasonably accurate estimate of the magnitude of the noise can be obtained, and that this estimate remains relatively constant throughout the domain. Though there are many algorithms in the literature designed to optimize noisy functions, very few use estimates of the noise in their termination criteria. When function evaluations are cheap, termination can be determined by common tests (e.g., small step size or gradient approximation). But when function evaluations are expensive, determining when to stop becomes an important multiobjective optimization problem. The optimizer wants to fold the best solution possible while minimizing computational effort. As this is a difficult problem to explicitly formulate, practitioners frequently terminate algorithms when (i) a predefined number of iterations has elapsed, (ii) no decrease in the optimal function value has been detected for a number of iterations, or (iii) the distance between a number of successive iterates is below some threshold. Chapter 5 attempts to answer the question: When should an algorithm optimizing an expensive, noisy function be terminated?
1.1 Review of Methods
Before discussing our algorithms further, we first discuss previous DFO techniques.
Heuristics are perhaps the first recourse when attempting to optimize a function without derivatives. Simulated annealing [36, 63], genetic algorithms [28], random search and its variants [55, 66, 39, 52], tabu search [20], scatter search [21], particle swarm optimization [34], and NelderMead [45] are just a few of the heuristics developed to solve problems where only function evaluations are available. Though most of these algorithms lack formal convergence results (aside from results dependent on the algorithm producing iterates which are dense in the domain), they remain popular
3
due to their (1) ease of implementation, (2) flexibility on a variety of problem classes, and (3) frequent success in practice.
Other techniques attempt to approximate the unavailable derivative. Classical finitedifference methods approximate the derivative by adjusting each variable and noting the change in the function value. Other techniques, such as the pattern search methods [62, 2] and implicit filtering [5], evaluate a changing pattern of points around the best known solution. Also of note is the DIRECT algorithm [30], a global optimization method based on dividing hyperrectangles using only function values.
An increasingly popular class of algorithms for derivativefree optimization (DFO) are modelbased trust region methods [31, 11]. Local models approximating the function are constructed and minimized to generate successive iterates. These models are commonly loworder polynomials interpolating function values close to the best known value, for example Powellâ€™s UOBYQA algorithm [48]. Other examples include [49], where Powell introduces a minimum Frobenius norm condition on underdetermined quadratic models, and ORBIT by Wild et al. [64], which constructs models using interpolating radial basis functions. (These local models should not be confused with kriging [59] or response surface methodologies [44] which build global models of the function.) Though implementation of these techniques is not as simple as some other techniques, a welldeveloped convergence theory exists. As this thesis focuses on noisy DFO problems, we considered trustregion methods with regression models most appropriate (since, in many cases, regression models through enough points can approximate the true function).
There are also a variety of existing DFO algorithms for optimizing functions with noise. For functions with stochastic noise, replications of function evaluations can be a simple way to modify existing algorithms. For example, [14] modifies Powellâ€™s UOBYQA [48], [15] modifies DIRECT [30], and [61] modifies NelderMead by repeated sampling of the function at points of interest. For deterministic noise, Kelley
4
[33] proposes a technique to detect and restart NelderMead methods. Neumaierâ€™s SNOBFIT [29] algorithm accounts for noise by not requiring the surrogate functions to interpolate function values, but rather fit a stochastic model. Similarly, [10] proposes using leastsquares regression models instead of interpolating models when noise is present in the function evaluations.
Lastly, Stochastic Approximation algorithms are also designed to minimize functions with stochastic noise. These algorithms are developed by statisticians to solve
min f(x) = E [[f(x)]] ,
when only / can be computed. Two of the more famous algorithms, the KieferWolfowitz and Simultaneous Perturbation methods, take predefined step lengths in a direction approximating â€”V/. These techniques have strong theoretical convergence results, but can be difficult to implement in practice. For further discussion of these algorithms, see the beginning of Chapter 4.
1.2 Outline
The work in this thesis focuses on modifications to modelbased trust region methods in order to handle noise. Throughout the thesis we assume that only noisy, expensive function evaluations / are available, but there is some smooth underlying function / which is twice continuously differentiable with a Lipschitz continuous Hessian. We also assume that the noise in the evaluation of / is unbiased with bounded variance.
Chapter 3 (joint work with Stephen Billups and Peter Graf) proposes a DFO algorithm to optimize functions which are expensive to evaluate and contain computational noise. The algorithm is based on the trust region methods of [9, 10] which build interpolation or regression models around the best known solution. We propose using weighted regression models in a trust region framework, and prove convergence of such methods provided the weighting scheme satisfies some basic conditions.
5
The algorithm fits into a general framework for derivativefree trust region methods using quadratic models, which was described by Conn, Scheinberg, and Vicente in [12, 11], We shall refer to this framework as the CSV2framework. This framework constructs a quadratic model function rrik(), which approximates the objective function / on a set of sample points Yk C Râ„¢ at each iteration k. The next iterate is then determined by minimizing nik over a trust region. In order to guarantee global convergence, the CSV2framework monitors the distribution of points in the sample set, and occasionally invokes a modelimprovement algorithm that modifies the sample set to ensure m*, accurately approximates /. The CSV2framework is the basis of the wellknown DFO algorithm which is freely available on the COINOR website [38].
To fit our algorithm into the CSV2framework we extend the theory of poisedness, as described in [12], to weighted regression. We show (Proposition 3.12) that a sample set that is strongly Apoised in the regression sense is also strongly cApoised in the weighted regression sense for some constant c, provided that no weight is too small relative to the other weights. Thus, any model improvement scheme that ensures strong Apoisedness in the regression sense can be used in the weighted regression framework.
The convergence proof in Chapter 3 requires that the computational error decreases as the trust region decreases; such an assumption can be satisfied if the user has some control of the accuracy in the function evaluation. Since Chapter 3 is centered on exploiting differing levels in different function evaluations, such an assumption is reasonable for that chapter. In Chapter 4, we remove this assumption, but add the assumption that / has the form
f{x) = f{x) + e (1.1)
where e ~ A/â€(0,
modifies the algorithm from Chapter 3 to converge when the error does not decrease
6
with the trust region radius. With some light assumptions on the noise and underlying function, we prove the algorithm generates a subsequence of iterates which converge almost surely to a firstorder stationary point in the case where the number of successful iterates is finite.
At a given point of interest xÂ°, the algorithm does not repeatedly sample f(xÂ°) in order to glean information about f(xÂ°). Rather nik(x0), the value of the trust region model at xÂ° is used to estimate f(xÂ°). We derive bounds on the error between / and m, provided the set of points used to construct m satisfies certain geometric conditions, called strongly Apoised (see Definition 2.14), and contains a sufficient number of points. Also, the step size is controlled by the algorithm, increasing and decreasing as the algorithm progresses and stagnates. This contrasts many of the methods in the Stochastic Approximation literature where the user must provide a predefined set of steps to be taken by the algorithm.
The results in Section 4.3 prove the algorithm will generate a subsequence of iterates converging almost surely to a firstorder stationary point when the number of successful iterates is finite, and makes progress in the infinite case. Such results are not common for most DFO algorithms on problems with stochastic noise. Both the simplicial direct search method [1] and the trust region method in [4] prove similar convergence results, but both reduce the variance at a point by repeated sampling. In addition to our strong convergence result, we are able to directly quantify the probability of the success of some iterates (see Lemma 4.15 for one such example). We are unaware of any other similar theoretical results for algorithms minimizing stochastic functions.
Chapter 5 (joint work with Stefan Wild) addresses termination criteria, the choice of which is a common problem when optimizing noisy functions. We propose objective measures to compare the quality of termination rules. Families of termination tests are then proposed and their performance is analyzed across a broad range of DFO
7
algorithms. Recommendations for tests which work for many algorithms are also provided. Lastly Chapter 6 contains concluding remarks and directions for future research.
1.3 Notation
The following notation will be used: Râ„¢ denotes the Euclidean space of real nvectors.  â€¢ p denotes the standard Â£p vector norm, and  â€¢  (without the subscript) denotes the Euclidean norm.  â€¢ ^ denotes the Frobenius norm of a matrix. Ck denotes the set of functions on Râ„¢ with k continuous derivatives. Dj f denotes the jth derivative of a function / G Ck, j < k. Given an open set Q C Eâ„¢, LCk(Q) denotes the set of Ck functions with Lipschitz continuous kth derivatives. That is, for / G LCk(Q), there exists a Lipschitz constant L such that
Dkf(y) â€” Dkf(x) II < L \\y â€” x\\ for all x,y G Q.
Vn denotes the space of polynomials of degree less than or equal to d in Era; qi denotes the dimension of (specifically q\ = (n + 1 )(n + 2)/2). We use standard â€œbigOhâ€ notation (written O()) to state, for example, that for two functions on the same domain, f(x) = 0(g(x)) if there exists a constant M such that \f(x)\ < M\g(x)\ for all x with sufficiently small norm. Given a set F, T denotes the cardinality and conv(T) denotes the convex hull. For a real number a, [aj denotes the greatest integer less than or equal to a. For a matrix A, A+ denotes the MoorePenrose generalized inverse [22], eJ denotes the jth column of the identity matrix. The ball of radius A centered at i G R" is denoted B(x; A). Given a vector w, diag(w) denotes the diagonal matrix W with diagonal entries Wa = Wi. For a square matrix A, cond(A) denotes the condition number, Amin(A) denotes the smallest eigenvalue, and omin denotes the smallest singular value. For a set Y := {yÂ°, â€¢ â€¢ â€¢ ,yp} C Râ„¢, the quadratic design matrix X has rows
i yi
y3n
vLi yi \{y3n?
(1.2)
8
Let rrik denote the fcth trust region model (as defined in Chapter 2). Let gk = Vmk{xk) and Hk = V2mk{xk).
Sk(x) =max{Vmfc(x) ,Xmin(X2mk(x))} s(x) = max { V/(x) ,Amin(V2f(x))}
These variables measure how close x is to a first and secondorder stationary point of / and mk (i.e. the gradient is zero and all eigenvalues are positive). If X is a random variable, the notation X < 7 denotes P (X < 7) > a. Note that the relation < is
a 1â€”Oi
not transitive.
9
2. Background
Before continuing, we introduce the background material on which the thesis is constructed.
2.1 Modelbased Trust Region Methods
A trust region algorithm is a numerical technique for minimizing a sufficiently smooth function /. At each iteration k, a model function nik(x) is constructed to approximate / near the best point xk. When derivatives are available, mk is usually a truncated first or secondorder Taylor series approximation of / at xk. For example, if V/ and V2/ are easy to calculate,
mk{x) = f(xk) + V f(xk)(x â€” xk) + â€” xk)TV2 f(xk)(x â€” xk).
nik is minimized over the trust region B{xk; Ak) by solving the problem
min mk(xk + s)
s:\\s\\
to generate a potential next trust region center xk + sk. f(xk + sk) is evaluated and the ratio
f(xk) â€” f(xk + sk)
^k mk{xk) â€” mk{xk + sk)
is calculated, which compares the actual decrease in / versus the decrease predicted by the model nrikâ– This ratio quantifies the success of iteration k and also how well the model function approximates the true function /. If pk is larger than a prescribed threshold rji, it indicates that the iteration was successful and the model is a good approximation of the function. In this case, xk+1 is set to xk + sk and the trust region radius, Ak is increased. If pk is less than another parameter rj0, the model function is considered unreliable so the trust region radius Ak is decreased and the iterate is not updated (i.e. xk+1 = xk). Lastly, k is incremented and the process repeats.
2.1.1 Model Construction Without Derivatives
10
When derivatives are unavailable, the models mk are constructed using points where / has been evaluated. For example, the Conn, Scheinberg, and Vicente framework (which we refer to as the CSV2framework) builds models mk from a specified class of models M. using a sample set of points Yk = {yÂ°, â– â– â– ,yp} C B(xk; A*,) on which the function has been evaluated.
Given Yk and a vector of corresponding function values / = (f(yÂ°), â€¢ â€¢ â€¢ , f(yp)), an interpolating model is a model m(x) such that m{yl) = f(yl) for i = 0, â€¢ â€¢ â€¢ ,p. Given a basis 0(x),..., 4>q(x)} of the class of models M, we can calculate the coefficients Pi in the basis representation of the interpolating model m(x) = Y^h=q by
solving the linear system
M(4>,Y)P = f (2.1)
where
M(4>,Y)
MvÂ°) Mv0) â– MyÂ°)
My1) My1) â– â– â– My1)
Mvp)Myp) â– Myp)
Note that for this equation to have a unique solution, the number of sample points p +1 must equal the size of the basis q+1 and the matrix M(, Y) must be invertible.
Regression models have also been investigated, [10] in which the number of sample points p + 1 is greater than the size of the basis. In this case, the matrix M((f>,Y) has more rows than columns so the equation (2.1) is solved in a least squares sense.
Lastly, if M((f>,Y) has more columns than rows, the system (2.1) is underdetermined. Nevertheless, bounds between the function and an underdetermined model can be obtained in certain cases. For example, see [13] considering the minium Frobenius norm method.
2.1.2 CSV2framework
11
We next outline the CSV2framework for derivativefree trust region methods described by Conn, Scheinberg, and Vicente [12, Algorithm 10.3]. Algorithms in the framework construct a model function rrik() at iteration k, which approximates the objective function / on a set of sample points Yk = {yÂ°,... ,yPk} C Râ„¢. The next iterate is then determined by minimizing m*,. Specifically, given the iterate xk, a putative next iterate is given by xk + sk where the step sk solves the trust region subproblem
min nik{xk + s)
s:\\s\\
where the scalar A& > 0 denotes the trust region radius, which may vary from iteration to iteration. If xk+sk produces sufficient descent in the model function, then f(xk+sk) is evaluated, and the iterate is accepted if f(xk + sk) < f(xk), otherwise, the step is not accepted. In either case, the trust region radius may be adjusted, and a modelimprovement algorithm may be called to obtain a more accurate model.
To establish convergence properties, the following smoothness assumption is made on /:
Assumption 2.1 Suppose that a set of points S CR" and a radius Amax are giuen. Let Q be an open domain containing the Amax neighborhood
J B(x; Amax)
ices'
of the set S. Assume f G LC2(Q) with Lipschitz constant L.
The CSV2framework does not specify how the model functions are constructed. However, it does require that the model functions be selected from a fully quadratic class of model functions M.:
Definition 2.2 Let f satisfy Assumption 2.1. Let k = be a giuen
uector of constants, and let A > 0. A model function m G C2 is Kfully quadratic on B(x; A) if m has a Lipschitz continuous Hessian with corresponding Lipschitz constant bounded by uf1 and
12
â€¢ the error between the Hessian of the model and the Hessian of the function satisfies
 V2/(y)  V2m(y) < nehA for all y E B(x; A),
â€¢ the error between the gradient of the model and the gradient of the function satisfies
II V/(y)  Vm(y) < negA2 for all y E B(x\ A),
â€¢ the error between the model and the function satisfies
I f(y) ~ m{y)\ < Kef A3 for all y E B(x; A).
Definition 2.3 Let f satisfy Assumption 2.1. A set of model functions A4 = {m : Rn R,m E C2} is called a fully quadratic class of models if there exist positive constants k = (nef, neg, Keh, vf1), such that the following hold:
1. for any x E S and A e (0, Amax]; there exists a model function m in M. which is kfully quadratic on B(x; A).
2. For this class M., there exists an algorithm, called a â€œmodelimprovementâ€ algorithm, that in a finite, uniformly bounded (with respect to x and A) number of steps can
â€¢ either certify that a given model m E M. is kfully quadratic on B(x; A),
â€¢ or, find a model fh E M. that is kfully quadratic on B(x; A).
Note that this definition of a fully quadratic class of models is equivalent to [12, Definition 6.2]; but we have given a separate definition of a Kfully quadratic model (Definition 2.2) that includes the use of k to stress the fixed nature of the bounding constants. This change simplifies some analysis by allowing us to discuss Kfully quadratic models independent of the class of models they belong to. It is important
13
to note that k does not need to be known explicitly. Instead, it can be defined implicitly by the model improvement algorithm. All that is required is for k to be fixed (that is, independent of x and A). We also note that the set A4 may include nonquadratic functions, but when the model functions are quadratic, the Hessian is fixed, so vâ„¢ can be chosen to be zero. For the algorithms presented in Chapter 3 and Chapter 4, we focus on model functions that quadratic. That is, A4 = Vf.
As a side note, kfully quadratic models may be too difficult to construct or may be undesired in some situations. If that is the case, nfully linear models might provide a useful alternative.
Definition 2.4 Let f G LC2 and let k = (nef, Keg, vâ„¢) be a given vector of constants, and let A > 0. A model function m G C2 is Kfully linear on B(x; A) if m has a Lipschitz continuous gradient with corresponding Lipschitz constant bounded by uf1 and
â€¢ the error between the gradient of the model and the gradient of the function satisfies
lV/(y)  Vm(y) < negA Vy e B(x] A),
â€¢ the error between the model and the function satisfies
I f(y) ~ m(y)I < WfA2 Vy G B(x; A).
If is Kfully linear, it approximates / in a fashion similar to the hrstorder Taylor model of /. If is Kfully quadratic, then it approximates / in a fashion similar to the secondorder Taylor model of /. If Kfully linear (or quadratic) models are used within the CSV2framework, we can guarantee convergence of the algorithm to a first (or second) order stationary point of /.
A critical distinction between the CSV2framework and classical trust region
methods lies in the optimality criteria. In classical trust region methods, mu is the
14
secondorder Taylor approximation of / at xk\ so if xk is optimal for mk, it satisfies the first and secondorder necessary conditions for an optimum of /. In the CSV2framework, xk must be optimal for mk, but mk must also be an accurate approximation of / near xk. This requires that the trust region radius is small and that nik is /rfully quadratic on the trust region for some fixed n.
To explicitly outline the CSV2framework, we proved pseudocode below. In the algorithm, g^b and Hâ„¢b denote the gradient and Hessian of the incumbent model m^b. (We use the superscript icb to stress that incumbent parameters from the previous iterates may be changed before they are used in the trust region step.) The optimality of xk with respect to nik is tested by calculating <^c6 = max { gj[c6, â€” ATOira(f/(,c6)}. This quantity is zero if and only if the first and secondorder optimality conditions for nik are satisfied. The algorithm enters the criticality step when <^c6 is close to zero. This routine builds a (possibly) new Kfully quadratic model for the current Al^b, and tests if <^cb for this model is sufficiently large. If so, a descent direction has been determined, and the algorithm can proceed. If not, the criticality step reduces A^6 and updates the sample set to improve the accuracy of the model function near xk. The criticality step ends when <^c6 is large enough (and the algorithm proceeds) or when both q^b and A^b are smaller than given threshold values tc and Amin (in which case the algorithm has identified a secondorder stationary point). We refer the reader to [12] for a more detailed discussion of the algorithm, including explanations of the parameters rj0, rji, 7,7irac, /3, n and oj.
Algorithm CSV2 ([12, Algorithm 10.3])
Step 0 (initialization): Choose a fully quadratic class of models At and a corresponding modelimprovement algorithm, with associated k defined by Definition 2.3. Choose an initial point xÂ° and maximum trust region radius Amax > 0. We assume that the following are given: an initial model m^b{x) (with gradient and Hessian at
15
x = xÂ° denoted by glQb and #oc6, respectively), q)cb = max {pqc6, â€”^min(Hocb)}, and a trust region radius Aqc6 g (0, Amax].
The constants 70, hi, 7,7mc, <7, A, ^ are given and satisfy the conditions 0 < Vo < hi < 1 (with yt o), 0 < 7 < 1 < 7inc,ec > 0, p > f3 > 0,u G (0,1). Set k = 0. Step 1 (criticality step): If q]pb > ec, then mk = rriÂ£b and Afc = A^b.
If Ckb < ec, then proceed as follows. Call the modelimprovement algorithm to attempt to certify if the model m^b is Kfully quadratic on B(xk; A^b). If at least one of the following conditions hold,
â€¢ the model m^b is not certihably Kfully quadratic on B(xk; A^b), or
. Af > pAkcb,
then apply Algorithm Criticality Step (described below) to construct a model mk(x) (with gradient and Hessian at x = xk denoted by gkl and Hk, respectively), with = max 7fc, â€” Amira(i/fc), which is Kfully quadratic on the ball B(xk] Ak) for some Afc G (0, pyâ„¢] given by [12, Algorithm 10.4], In such a case set
mk = fhk and Ak = min jmax j Afc, j , Afcc6.
Otherwise, set mk = m^b and Ak = AlÂ£b. For a more complete discussion of trust region management, see [26].
Step 2 (step calculation): Compute a step sk that sufficiently reduces the model mk (in the sense of [12, (10.13)]) such that xk + sk G B(xk; Ak).
Step 3 (acceptance of the trial point): Compute f{xk + sk) and dehne
f(xk) â€” f(xk + sk)
^k mk(xk) â€” mk(xk + sk)
If Pk A Vi or if both pfc > g0 and the model is Kfully quadratic on B(xk; Ak), then xk+1 â€” xk __ sk anc[ the model is updated to include the new iterate into the sample set resulting in a new model m^^x) (with gradient and Hessian at x = xk+l denoted
16
by g^_l and H^, respectively), with = max Amin(i/^1)}; otherwise,
the model and the iterate remain unchanged (m^1 = mk and xk+1 = xk).
Step 4 (model improvement): If pk < p 1, use the modelimprovement algorithm to
â€¢ attempt to certify that mk is Kfully quadratic on B(xk; Ak),
â€¢ if such a certificate is not obtained, we say that nik is not certihably Kfully quadratic and make one or more suitable improvement steps.
Define m^1 to be the (possibly improved) model. Step 5 (trust region update): Set
A
icb k+1
(min {q'iricAfc, /S.max}}
[Afc,min{7iracAfc, A
maa;}]
G
{^fc}
if Pk > hi and Ak < if Pk > hi and Afc > if pk < hi and nik is if pk < hi and nik is Kfully quadratic.
AC,
AC,
itfully quadratic, not certihably
Increment k by 1 and go to Step 1.
Algorithm CriticalityStep ([12, Algorithm 10.4]) This algorithm is applied only if C& A ec and at least one of the following holds: the model m^b is not certihably Kfully quadratic on B{xk; A(c6) or A(c6 > p^b.
Initialization: Set % = 0. Set m^ = mjf>.
Repeat Increment i by one. Use the model improvement algorithm to improve the previous model until it is Kfully quadratic on B(xk; a;t_1A(c6). Denote the
new model by ni^. Set Ak = cU_1 A(c6 and nik = nif1.
Until Ak < h(C)W
17
Global Convergence If the following assumptions are satisfied, it has been shown that the CSV2framework iterates will converge to a stationary point of /. Define the set L(xÂ°) = {x E Eâ„¢ : f(x) < f(x0)}.
Assumption 2.5 Assume that f is bounded from below on L(xÂ°).
Assumption 2.6 There exists a constant Kbhm > 0 such that, for all xk generated by the algorithm,
11 Hk 11 Hbhm â–
Theorem 2.7 ([12, Theorem 10.23]) Let Assumptions 2.1, 2.5 and 2.6 hold with S = L(xÂ°). Then, if the models used in the CSV2framework are nfully quadratic, the iterates xk generated by the CSV2framework satisfy
hm max{V/(xfc), â€”Amin(V2/(xfc))} = 0.
fcâ€”>â– + cÂ©
It follows from this theorem that any accumulation point of {xk} satisfies the first and secondorder necessary conditions for a minimum of /.
2.1.3 Poisedness
Having outlined the CSV2framework, we can discuss conditions on the sample set used to build rrik which guarantee the model sufficiently approximates /. Consider the set of polynomials in Eâ„¢ of degree less than or equal to d, denoted V%. A basis 0 = {0o(A), â€¢ â€¢ â€¢ , 4>q(x)} of V]) is a set of polynomials from Vlf which span V%. That is, for any P(x) E Vthere exists coefficients [3i such that P(x) = Ym=o &&(%) We let \Pn\ denote the number of elements in any basis 0 of V]). For example, \Vf \ = n + 1 and \Vf\ = (n+l)(n + 2)/2.
Definition 2.8 A set of points X = {xÂ°, â€¢ â€¢ ,xp} C Eâ„¢ with X = is poised for interpolation if the matrix M(0,X) is nonsingular for some basis 0 in Vlf
18
If \x\ > \V%\, we can construct the least squares regression model by solving (2.1) as well. We extend the definition of poisedness for the regression case.
Definition 2.9 A set of points X = {xÂ°, â€¢ â€¢ â€¢ ,xp} C Eâ„¢ with X > \P^\ is poised for regression if the matrix M( in
Since we have limited information about the function /, we want the interpolating or regressing m(x) to be an accurate approximation in a region of interest. This requires that X consists of points spread out within said region. Since M((f>,X) can be arbitrarily poorly conditioned and X is still poised, simply being poised is not enough to measure the quality of a set X. Also looking at the condition number of M((f>,X) is inadequate since scaling the sample set X or choosing a different basis can arbitrarily adjust this quantity. Nevertheless, there are methods for measuring the quality of a sample set, one of the most common of which is based on Lagrange polynomials.
Definition 2.10 For a set X = {xÂ°, â€¢ â€¢ â€¢ ,xp} C Eâ„¢, with \X\ = \Pn\, the set of interpolating Lagrange polynomials Â£ = {Â£o, â€¢ â€¢ â€¢ ,Â£p} C Vlf are the polynomials satisfying
0 otherwise.
If the set X is poised, then the set of polynomials are guaranteed to exist and be uniquely defined.
We can extend the definition of Lagrange polynomials to the regression case in a natural fashion.
Definition 2.11 For a set X = {x0, ,xp} C Eâ„¢, with \X\ > \P^\, the set of regression Lagrange polynomials Â£ = {Â£0, â– â– â– ,Â£p} C Vlf are the polynomials satisfying
W)
l.s.
1 */ * = j,
0 otherwise.
19
Though these polynomials are no longer linearly independent, if X is poised, then the set of regression Lagrange polynomials exists and is uniquely defined.
We can now use these Lagrange polynomials to extend the definition of poisedness to Apoisedness. This relates the magnitude of the Lagrange polynomials on a set B Cl" which will allow a method for measuring the quality of a sample set.
Definition 2.12 Let A > 0 and a set B C Eâ„¢ be given. For a basis ofV%, a poised set X = {xÂ°, â€¢ â€¢ â€¢ ,xp} is said to be Apoised in B (in the interpolating sense) if and only if
1. for the basis of Lagrange polynomials associated with X
A > max max I Â£A ,
0
or, equivalently,
2. for any x E B there exists \{x) such that
p
^2 Xi(x)4>(yl) = 4>(x) with A(a:)00 < A.
i=0
And we again can extend this definition to the regression case.
Definition 2.13 Let A > 0 and a set B C Eâ„¢ be given. For a basis ofVlf, a poised set X = {xÂ°, â– â– â– ,xp} with X > 'P^ is said to be Apoised in B (in the regression sense) if and only if
1. for the basis of Lagrange polynomials associated with X
A > max max It'd ,
0
or, equivalently,
2. for any x E B there exists \{x) such that
p
^ Xi(x)4>(yl) = 4>(x) with A(a:)00 < A.
i=0
20
Note that if X = \P%\, the definition for Apoised in the interpolation and regression sense coincide.
We can now examine the following bound (from [7])
D7(;r)  Drmix) < T ;râ€˜  Dâ€™7'.(x)
(d+i)!
where Dr f(x) is the rth derivative of /.
dxix â– â– â– dxiâ€ž 11
and Ud is an upper bound on Dr+1f(x). If r = 0, this bound reduces to
/(;r)  mix) < ^t_(p+ l)ixdAtAd+1
(2/2)
where
Ab = max max \Â£A ,
0
and A is the diameter of the smallest ball containing X. Therefore, if the number of points in X is bounded, then Apoisedness is sufficient to derive bounds on the error between the regression or interpolating model and the function. That is, decreasing the radius of the sample set will provide bounds similar to Taylor bounds when derivatives are available. If using regression models with arbitrarily many points, Apoisedness is not enough to construct similar bounds. Strongly Apoisedness can help in this case.
Definition 2.14 Let Â£{x) = {Â£q{x)^ ,Â£p(x))t be the regression Lagrange polynomials associated vnth the set Y = {yÂ°,... ,yp}. Let A > 0 and let B be a set in Eâ„¢. The set Y is said to be strongly Apoised in B (in the regression sense) if and only if
^LA > max \\Â£{x)  ,
VPi x^B
where q\ = \Vf \ and pi = T.
21
Since we can rewrite (2.2) as
I f(x)~ m(x)  < ^ \/p + l/VV2Ad+1
where
Afe,2 = max H^ll ,
xEB
strong Apoisedness provides Taylorlike error bounds between the regression model m and the function /, even when the number of points in X is unbounded.
As a final note, explicitly calculating the value of A is computationally expensive, but not necessary. It is possible to use the condition number of the design matrix M((f>,X) to bound the constant A. Since it is possible to scale the condition number of M((f>,X) by shifting and scaling X, or choosing a different basis, conditions must be placed on M(,X) before using its condition number. If we 1) use the standard basis (e.g., ,X). The next two theorems are for the interpolation and regression case respectively.
Theorem 2.15 Let X denote the shifted and scaled version of X so every point lies within the unit ball and at least one point has norm 1. Let M = M((f>,X) where
is the standard basis. If M is nonsingular and
M
l
< A, then the set X is
ydpiApoised in the unit ball. Conversely, if the set X is Apoised in the unit ball,
then
A.
< e^jpia
where 9 > 0 depends on n and d but is independent of X and
Theorem 2.16 Let X denote the shifted and scaled version of X so every point lies within the unit ball and at least one point has norm 1. Let M = M((f>,X) where
22
E
l
< \J q\jp\A, then the set X is strongly Apoised in the unit ball. Conversely,
if the set X is Apoised in the unit ball, then and d but is independent of X and A.
E"1 < iauA
â€” VpT
where 9 depends on n
2.2 Performance Profiles
Next, we explain the content of performance profiles which are a compact method for comparing the performance of various algorithms on a set of problems. We will use Figure 2.1 as an example. Algorithms A, B, and C have been run an identical set of problems for the same number of function evaluations. The left axis shows the percentage of problems each algorithm solved first, where solved is userdefined. (Often, an algorithm is considered to solve a problem when it first finds a function value within tolerance of the best known solution.) In the example, A solves 20% of the problems first, B solves 55% of the problems first, and C solves 30% of the problems first. As these percentages total to over 100%, there is an overlap in the set of problems the algorithms solve first.
The right axis shows the percentage of problems solved by a given algorithm in the number of function evaluations given. All algorithms in Figure 2.1 solve over 90% of the problems in the testing set. Values between the left and right axes denote the percentage of problems solved as a multiple of the number of evaluations required for the fastest algorithm. For example, given 6 times as many iterations as the fastest algorithm on a problem, A solves 80% of the problems in the testing set.
Formally, an algorithm is considered to solve a problem when it first finds a function value satisfying
f{xÂ°)  f(x) > (1  r)(f(xÂ°)  fL)
where r > 0 is a small tolerance, is the smallest found function value for any solver in a specified number of iterations, and xÂ° is the initial point given to each algorithm.
23
Figure 2.1: An example of a performance profile
If ip
tp,a
Tp,a : FT V
mill
a
Then the performance profile of a solver a is the fraction of problems where the performance ratio is at most a. That is,
Pa(cv) = \peP: rp
where P is the set of benchmark problems.
2.3 Probabilistic Convergence
Lastly, we define various forms of probabilistic convergence. A sequence {A,,}
of random variable is said to converge in distribution or converge weakly, or
24
converge in law to a random variable X if
lim Fn(x) = F(x),
nâ€”)>cÂ©
for every number {x G K} at which F is continuous. Here Fn and F are the cumulative distribution functions of random variables Xn and X correspondingly.
A sequence {Xn} of random variables converges in probability to X if for all e > 0
lim P{\XnX\ >e) = 0
71â€”>â– CO
A sequence {Xn} of random variables converges almost surely or almost everywhere or with probability 1 or strongly towards X if
P ( lim Xn = X) = 1
\nâ€”]>oc 1
25
3. Derivativefree Optimization of Expensive Functions with Computational Error Using Weighted Regression
3.1 Introduction
In this chapter, we construct an algorithm designed to optimize functions evaluated by large computational codes, taking minutes, hours or even days for a single function call, for which derivative information is unavailable, and for which function evaluations are subject to computational error. Such error may be deterministic (arising, for example, from discretization error), or stochastic (for example, from MonteCarlo simulation). Because function evaluations are extremely expensive, it is sensible to perform substantial work at each iteration to reduce the number of function evaluations required to obtain an optimum.
We assume that the accuracy of the function evaluation can vary from point to point, and this variation can be quantified. In this chapter, we will use knowledge of this varying error to improve the performance of the algorithm by using weighted regression models in a trust region framework. By giving more accurate points more weight when constructing the trust region model, we hope that the models will more closely approximate the function being optimized. This leads to a better performing algorithm.
Our algorithm fits within in the CSV2framework, which is outlined in Chapter 2. To specify an algorithm within this framework, three things are required:
1. Define the class of model functions AT This is determined by the method for constructing models from the sample set. In [10] models were constructed using interpolation, least squares regression, and minimum Frobenius norm methods. We describe the general form of our weighted regression models in Â§3.2 and propose a specific weighting scheme in Â§3.5.
2. Define a modelimprovement algorithm. Â§3.4 describes our model improvement
algorithm which tests the geometry of the sample set, and if necessary, adds
26
and/or deletes points to ensure that the model function constructed from the sample set satisfies the error bounds in Definition 2.2 (i.e. it is Kfully quadratic).
3. Demonstrate that the modelimprovement algorithm satisfies the requirements for the definition of a class of fully quadratic models. For our algorithm, this is discussed in Â§3.4.
The chapter is organized as follows. We place our algorithm in the CSV2framework by describing 1) how model functions are constructed (Â§3.2), and 2) a model improvement algorithm (Â§3.4). Before describing the model improvement algorithm, we first extend the theory of Apoisedness to the weighted regression framework (Â§3.3). Computational results are presented in Â§3.5 using a heuristic weighting scheme, which is described in that section. Â§3.6 concludes the chapter.
3.2 Model Construction
This section describes how we construct the model function m*, at the fcth iteration. For simplicity, we drop the subscript k for the remainder of this section. Let / = (/o,..., fp)T where /* denotes the computed function value at y\ and let e* denote the associated computational error. That is
fi = f(yt) + ^i (31)
Let w = (wo, â– â– â– ,wp)T be a vector of positive weights for the set of points Y = {yÂ°, â– â– â– ,yp}. A quadratic polynomial m is said to be a weighted least squares approximation of / (with respect to w) if it minimizes
X>? hf= \\W (mIX)  f) 2.
i=0
where m(Y) denotes the vector (m(yÂ°), m(y1),... ,m(yp))T and W = diag(w). In this case, we write
Wm(Y) = Wf.
(3.2)
27
Let = {(f>o,(f>i, â– â– â– ,(f>q} be a basis for the quadratic polynomials in Eâ„¢. For example, f might be the monomial basis
4> = {1, Â£1,2:2, â€¢ â€¢ ,xn,x i/2,Â£iÂ£2, â€¢ â€¢ â€¢ ,xn_iXn,xl/2}. (3.3)
Dehne
M(4>,Y)
MyÂ°)My0) MvÂ°) My1) My1) â– â– â– My1)
Myp)Myp)â€¢â€¢â€¢ Myp)
Since 0 is a basis for the quadratic polynomials, the model function can be written q
m(x) = ai(f>i(x) The coefficients a = (ao,...,aq)T solve the weighted least
i=0
squares regression problem
WM((f), Y)a =' W f. (3.4)
If M((f),Y) has full column rank, the sample set Y is said to be poised for quadratic regression. The following lemma is a straightforward generalization of [12, Lemma 4.3]:
Lemma 3.1 If Y is poised for quadratic regression, then the weighted least squares regression polynomial (with respect to positive weights w = (wo,..., wp)) exists, is unique and is given by m(x) = 4>(x)Ta, where
a = (WM)+Wf = (MTW2M)~1MTW2f, (3.5)
where W = diag(u>) and M = M((f>,Y).
Proof: Since W and M both have full column rank, so does WM. Thus, the least squares problem (3.4) has a unique solution given by {WM)+Wf. Moreover, since WM has full column rank, (WM)+ = ({WM)T(ITM))1 MTW. u
3.3 Error Analysis and the Geometry of the Sample Set
28
Throughout this section, Y = {yÂ°, â– â– â– ,yp} denotes the sample set, p\ = p + 1, w E IBdj1 is a vector of positive weights, W = diag(u>), and M = M((f>,Y). / denotes the vector of computed function values at the points in Y as defined by (3.1).
The accuracy of the model function rrik relies critically on the geometry of the sample set. In this section, we generalize the theory of poisedness from [12] to the weighted regression framework. This section also includes error analysis which extends results from [12] to weighted regression, as well as considering the impact of computational error on the error bounds. We start by defining weighted regression Lagrange polynomials.
3.3.1 Weighted Regression Lagrange Polynomials
Definition 3.2 A set of polynomials Â£j(x),j = 0,...,p in are called
weighted regression Lagrange polynomials with respect to the weights w and sample set Y if for each j,
Wi] = Wej,
where Â£] = [Â£j(y0), â– â– ,ij(yp)]T.
The following lemma is a direct application of Lemma 3.1.
Lemma 3.3 Let f>{x) = ((f)0(x),... ,4>q(x))T. IfY is poised, then the set of weighted regression Lagrange polynomials exists and is unique, and is given by Â£j(x) = 4>(x)Taj, j = 0, â€¢ â€¢ â€¢ ,p, where aj is the jth column of the matrix
A = (WM)+W.. (3.6)
Proof: Note that m() = Â£,(â€¢) satisfies (3.2) with / = eh By Lemma 3.1, Â£j{x) = (f)(x)Taj where a? = (WM)+Wej which is the jth column of (WM)+W. â–
The following lemma is based on [12, Lemma 4.6].
29
Lemma 3.4 IfY is poised, then the model function defined by (3.2) satisfies
p
Mx) = ^fMx),
i=0
where Â£j{x), j = 0, â€¢ â€¢ â€¢ ,p denote the weighted regression Lagrange polynomials corresponding to Y and W.
Proof: By Lemma 3.1 m{x) =
a = (WM)+Wf = Af
for A defined by (3.6). Let (fix) = [Â£o{x), â– â– â– ,Â£p(x)]T. Then by Lemma 3.3
p
m(x) = (f>T(x)Af = ftix) = ^ffifix).
i=0
3.3.2 Error Analysis
For the remainder of this chapter, let Y = {yÂ°, â– â– â– ,yp} denote the shifted and scaled sample set where yl = (yl â€” yÂ°)/R and R = max \\yl â€” yÂ°. Note that yÂ° = 0
i
and max y* = 1. Any analysis of Y can be directly related to Y by the following
i
lemma:
Lemma 3.5 Define the basis is the monomial basis. Let {Ifix), â– â– â– ,Â£p(x)} be weighted regression Lagrange polynomials forY and jÂ£o(x),â€¢â€¢â€¢ ,ip(x) be weighted regression Lagrange polynomials forY. Then M(fi,Y) = M(fi,Y). IfY is poised, then
Â£{x) = Â£
x~y
R
Proof: Observe that
MvÂ°) â€¢ â€¢ MyÂ°) MyÂ°) â€¢ â€¢ MM
M(4>, Y) = My1) â– â– My1) = Mv1) â€¢ â– MM
_Myp) â– â– Myp)_ _Myp) â– â– Myp)_
M(fi,Y).
30
By the definition of poisedness, Y is poised if and only if Y is poised. Let cf>(x) =
â€” â€” rp /v //v ^ \ T
0(x),..., (f>q(x)) and f(x) = {(fo(x),... ,(f)q(x)j .Then
4>(x) =
1 o o 1 o(x)
1 o 1 Mx)
By Lemma 3.3, if Y is poised, then
4>(x).
Â£{x) = cf)(x)T(WM((f), Y))+W = (f)(x)T(WM(, Y))+W = Â£{x).
Let fi be defined by (3.1) and let Q be an open convex set containing Y. If / is C2 on Q, then by Taylorâ€™s theorem, for each sample point yl G Y, and a fixed x G conv(T), there exists a point rji(x) on the line segment connecting x to yl such that
fi = f{yl) +Q
= f(x) + Vf(x)T(yl  x) + yl  x)TV2f(Vi(x))(yl  x) + q
= f(x) + Vf(x)T(yl  x) + hyl  x)TV2f(x)(yl  x)
2 (3.7)
+ 2^yt~ x^THi^x^yl ~ x) +
where Hfx) = V2f{j)i(x)) â€” V2/(x).
Let {Â£i(x)} denote the weightedregression Lagrange polynomials associated with Y. The following lemma and proof are inspired by [7, Theorem 1]:
Lemma 3.6 Let f be twice continuously differentiable on Q and let m(x) denote the quadratic function determined by weighted regression. Then, for any x G Q the following identities hold:
â€¢ m(x) = f{x) + \ YX=o(yi  x)THi(x)(yi  x)Â£i{x) + ELo ^Mx),
31
â€¢ Vm(x) = Vf(x) + \ Y7i=o(yi  x)THi(x)(yi  x)VÂ£i(x) + Y%=o tNÂ£i(x),
â€¢ V2m(x) = V2/(x) +  J2Pi=o(Vi ~ x)THi(x)(yi  x)V2Â£i(x) + ELo qV2^(x);
where H^x) = V2/(r]i(x)) â€” V2/(x) for some point rji(x) = 9x + (1 â€” 0)yl, 0 < d < 1 on Â£/ie /ine segment connecting x to yl.
Proof: Let D denote the differential operator as defined in [7], where DJ is
the jth derivative of a function in Cl where i > j. In particular, DÂ°f(x) = f(x), D1f(x)z = Vf(x)Tz, and D2f(x)z2 = zT'V2f(x)z. By Lemma 3.4, m(x) = EL0 fdi(x); so for h = 0,1 or 2,
Dhm(x) = Y,UDh^(x).
i=0
Substituting (3.7) for /* in the above equation yields
2 P
Dhm{x) = Yl D3f(x)(vl ~ x)3DhÂ£t{x)
+b ~ x)TR^yl  x)D%(x) + Y,tiD%(x)
j=0 J i=0 P
(3.8)
i=0
where H^x) = V2/(r]i(x)) â€” V2/(x) for some point rji(x) on the line segment connecting x to y\ Consider the first term on the right hand side above. We shall show that
i=o
1 LL , Dhfix) for j = h
Y,Dâ€™f(x)(<;xyD'%(x) = {
J' i=o 0 for j yt h.
(3.9)
for j = 0,1, 2, Let Bj = Dj f(x), and let gj : Eâ„¢ â€”>â€¢ R be the polynomial defined
by gj(z) = jjBj(z â€” x)j. Observe that Djgj(x) = Bj and Dhgj(x) = 0 for h ^ j. Since Qj has degree j < 2, the weighted least squares approximation of gj by a quadratic polynomial is gj itself. Thus, by Lemma 3.4 and the definition of gj,
9j(z) = ~
i=0 i=0
(310)
32
Applying the differential operator Dh with respect to z yields
Dh'Ji{') = \i2Bj(y'xyDht%(z)
i=0
= f'EDX(Etyâ€˜EWf%(z).
i=0
Letting z = x, the expression on the right is identical to the left side of (3.9). This proves (3.9) since Dhgj(x) = 0 for j ^ h and Djgj(x) = Bj for j = h. By (3.9), (3.8) reduces to
p p
Dhm{x) = Dhf(x) +  ~ x)TH%(x)(y%  x)DhR(x) + J] etDhR(x).
i=0 i=0
Applying this with h = 0,1,2 proves the lemma. â–
Since iA(:r) A L \\yl â€” x by the Lipschitz continuity of V2/, the following is a direct consequence of Lemma 3.6.
Corollary 3.7 Let f satisfy Assumption 2.1 for some convex set and let m(x) denote the quadratic function determined by weighted regression. Then, for any x E the following error bounds hold:
â€¢ i/Â»(X)i < ELo (f w  Â»ii3 + Nil) \c(x)\
. V/(x)  Vm(x) < EEo (f III/1  T + f.) IIWMII . V2/(.t)  v2m(x) < ELâ€ž (f Ilk  if + k.l) lV20(.r)l.
Using this corollary, the following result provides error bounds between the function and the model in terms of the sample set radius.
Corollary 3.8 Let Y be poised, and let R = max \\yl â€” yÂ°\\. Suppose q < e for
i
i = 0,... ,p. If f satisfies Assumption 2.1 with Lipschitz constant L, then there exist constants A\, A2; and A3; independent of R, such that for all x E Â£>(yÂ°; R),
â€¢  f(x) ~ m(x)  < Ai v/pf (4Li?3 + e).
33
V/(x)  Vm(x)
â€¢ \\V2f{x)V2m{x)\\ < A3v/pi (4Li? + e/i?2).
Proof: Let {t'o(x),...,Â£p(x)} be the Lagrange polynomials generated by the shifted and scaled set Y, and let {Â£o(x),... ,Â£p(x)} be the Lagrange polynomials generated by the set Y. By Lemma 3.5, for each x E B(yÂ°;R), Â£i(x) = Â£i(x) V i, where x = (x  yÂ°)/R. Thus, V^(ar) = VÂ£i(x)/R, and VH^x) = V2Â£i(x)/R2.
i T
Let Â£(x) = Â£0(x),... ,Â£p(x) , g(x) = VÂ£o{x)
V2Â£0(x)
V%{x)
T
T
VÂ£p(x) and h(x)
By Corollary 3.7,
I f{x) ~ m{x) Wyi ~ xlf + lQl) l^^)!
4â€”n V ^ 7
i=0
P
< (4Lf?3 + e) C(^) (since \\yl â€” x < 2R, and q < e)
i=0
= (4Li?3 + e) Wiix)^
< y/p[ (4Li?3 + e) Â£{x) , (since for x E Râ„¢, xi < II^II2)
Similarly, Vf(x) â€” Vra(i)  < ^/pÂ£ (4LR2 + â€” J \\g{x
and  V2/(x) â€” V2m(x) < ^/pÂ£ (4Li? +
h(x)
Setting Ai = max
xâ‚¬B(0;l)
Â£{pc)
A2 = max II g(;r) II, and A3 = max
xeB( 0;1) xGB(0;l)
h(x)
yields
the desired result. â–
Note the similarity between these error bounds and those in the definition of kfully quadratic models. If there is no computational error (or if the error is (9(A3)), kfully quadratic models (for some fixed n) can be obtained by controlling the geometry of the sample set so that Aiv/pf, i = 1,2,3 are bounded by fixed constants and by controlling the trust region radius A so that ^ is bounded. This motivates the definitions of Apoised and strongly Apoised in the weighted regression sense in the next section.
34
3.3.3 Apoisedness (in the Weighted Regression Sense)
In this section, we restrict our attention to the monomial basis defined in (3.3). In order to produce accurate model functions, the points in the sample set need to be distributed in such a way that the matrix M = M(,Y) is sufficiently wellconditioned. This is the motivation behind the following definitions of Apoised and strongly Apoised sets. These definitions are identical to [12, Definitions 4.7, 4.10] except that the Lagrange polynomials in the definitions are weighted regression Lagrange polynomials.
Definition 3.9 Let A > 0 and let B be a set in Eâ„¢. Let w = (w0,... ,wp) be a vector of positive weights, Y = {yÂ°,... ,yp} be a poised set, and let {Â£0,... ,Â£p} be the associated weighted regression Lagrange polynomials. Let Â£{x) = (Â£0(x), ,Â£p(x))T and qi = \Vf\.
â€¢ Y is said to be Apoised in B (in the weighted regression sense) if and only if
A > max max \Â£Ax) I.
xEB 0
â€¢ Y is said to be strongly Apoised in B (in the weighted regression sense) if and only if
~^A > max \\Â£(x)  .
v/p7 xeB
Note that if the weights are all equal, the above definitions are equivalent to those for Apoised and strongly Apoised given in [12].
We are naturally interested in using these weighted regression Lagrange polynomials to form models that are guaranteed to sufficiently approximate /. Let Yk, A*,, and Rk denote the sample set, trust region radius, and sample set radius at iteration k as defined at the beginning of Â§3.3.2. Assume that ^ is bounded. If the number of sample points is bounded, it can be shown, using Corollary 3.8, that if Yk is Apoised for all k, then the corresponding model functions are Kfully quadratic, (assuming no
35
computational error, or that the computational error is (9(A3)). When the number of sample points is not bounded, Apoisedness is not enough. In the following, we show that if Yk is strongly Apoised for all k, then the corresponding models are Kfully quadratic, regardless of the number of points in Yk.
Lemma 3.10 Let M = M{fi>,Y). If W(MTW)+ < y/qi/piA, then Y is strongly Apoised in 5(0; 1) in the weighted regression sense, with respect to the weights w. Conversely, ifY is strongly Apoised in 5(0; 1) in the weighted regression sense, then
Oqi
W (M1 W)
<
\/P~i
: A,
where 9 > 0 is a fixed constant dependent only on n (but independent ofY and A).
Proof: Let A = (WM)+W and Â£{x) = (Â£o(x),... ,Â£p(x))T. By Lemma 3.3, Â£{x) = AT(f){x). It follows that for any x G 5(0; 1),
\\Â£(x)\\ = AT0(x) < \\A\\ \\4>(x)\\ < ^qi/piA (v^ll^L) < (Qi/y/Pi) A
(For the last inequality, we used the fact that maxj.eB(0;i) ^(;r)00 â€” !)â€¢
To prove the converse, let UYVT = AT be the reduced singular value decomposition of AT. Note that U and V are p\ x qi and q\ x qi matrices, respectively, with orthonormal columns; E is a q\ x qi diagonal matrix, whose diagonal entries are the singular values of AT. Let o\ be the largest singular value with vl the corresponding column of V. As shown in the proof of [10, Theorem 2.9], there exists a constant 7 > 0 such that for any unit vector v, there exists an x G 5(0; 1) such that uT0(x) > 7. Therefore, since Hu1!! = 1, there is an 1 G 5(0; 1) such that \(v1)T(f)(x)\ > 7. Let v1 be the orthogonal projection of fi(x) onto the subspace orthogonal to u1; so (fix) = {fivfiT(p{x)) vl + vA Note that YVTv1 and YVTv^ are orthogonal vectors. Note also that for any vector z, 5EI/Tz = EI/Tz (since U
36
has orthonormal columns). It follows that
\\Â£(x)\\ = At0(7) = SI/T0(x) = (SI/TnÂ±2 +SI/T((n1)T0(x)n1)2y/2 > \{vl)T4>{x)\ SI/Tn1 > 7 SI/Tn1 = 7 11Se111 = 7 A .
Thus, Mil < max Â£(x) H < â€”â€”A, which proves the result with 6 = I/7. â–
xeB( o;i) 7v/pf
We can now prove that models generated by weighted regression Lagrange polynomials are icfully quadratic.
Proposition 3.11 Let / satisfy Assumption 2.1 and let A > 0 be fixed. There exists a vector k = (nef, Keg, Keh, 0) such that for any yÂ° Â£ S and A < Amax, if
1. Y = {yÂ°,... ,yp} C B(yÂ°;A) is strongly Apoised in B(yÂ°,A) in the weighted regression sense with respect to positive weights w = {u>o, â€¢ â€¢ â€¢, wp}, and
2. the computational error q is bounded by CA3, where C is a fixed constant,
then the corresponding model function m is nfully quadratic.
Proof: Let x, Â£(),g(), h(), Ai,A2, and A3 be as defined in the proof of
Corollary 3.8. Let M = M(fi,Y) and W = diagw. By Lemma 3.3, Â£{x) = AT(f>(x),
Oqi
where A = (WM)+W. By Lemma 3.10 It follows that
<
y/Pl
:A, where 9 is a fixed constant.
Ai = max
xeB( 0; 1)
Â£(x)
< max
*Â£B(0;1)
0O) < ci~j=A,
where <7 = max ^ 0(x) is a constant independent of Y. Similarly,
Ao
max \\g(x)\\ = max
EB( 0;1) X eB( 0;1)
max EB( 0;1) \\ATVfi(x)
lVt'0(x), â€¢ â€¢ â€¢ , Vt'p(x)
t Tt
xeB(0;l)
< Jqfi max
xÂ£B(0;l)
\Vfi(x)\\
y/Pi
37
where C2 = max Vd>(A)ll is independent of Y.
xeB( 0;1) 11 11
To bound A3, let JSyt denote the unique index j such that xs and xt both appear in the quadratic monomial 4>j(x). For example Jyi = n + 2, Ji;2 = J2,i = n + 3, etc. Observe that
1 if j = Js,t,
0 otherwise.
It follows that
[V2*(x)]
S,t
3=0
Alh,l ALl,2 â– â– â€¢ Ahi,
AW â€¢ â€¢ â€¢ Ah2.
AT AT A^,Jn,2 â€˜ â€˜ â– Aljn,
We conclude that
V2h{x) < V2h{x)
F
< y/2 II Aj II. Thus,
A3 = max
i;Â£B(0;l)
h(x)
max
xeB(0;l)
\v2io(x)
I v2ip(x)
<
\
2E Kf = ^ mIf < 0^ mi < ^a.
VB
By assumption, the computational error q is bounded by e = CA3. So, by Corollary 3.8, for all x G B(yÂ°] A),
â€¢  f{x) â€” m{x)  < (4L + C) A3A3 < cidqiA (4L + C) A3 = /y=/A3.
â€¢ V/(x) â€” Vm(x) < (4L + (7) A2A2 < c29qlA (4L + C) A2 = KegA2.
â€¢  V2/(x) â€” V2m(x)  < (4L + C) AA3 < \f29ql A (4L + (7) A = /y=/jA.
where /q,/ = cqdgiA (4L + (7), neg = c20ql A (4L + C), /q,/i = V^dq? A (4L + C). Thus, m(x) is (nef, neg, neh, 0)fully quadratic, and since these constants are independent of yÂ° and A, the result is proven. â–
The final step in establishing that we have a fully quadratic class of models is to define an algorithm that produces a strongly Apoised sample set in a finite number of steps.
38
Proposition 3.12 Let Y = {yÂ°,... ,yp} C Mn be a set of points in the unit ball 0(0; 1) such that \\yj\\ = 1 for at least one j. Let w = (wo,... ,wp)T be a vector of positive weights. If Y is strongly Apoised in 0(0; 1) in the sense of unweighted regression, then there exists a constant 9 > 0, independent ofY, A and w, such that Y is strongly (cond(IT)dA) poised in the weighted regression sense.
Proof: Let M =
with unit weights A. Thus,
= M((f,Y), where is the monomial basis. By Lemma 3.10 (applied ), M+ < 9qiA/y/pi, where 9 is a constant independent of Y and
W(MtW)
< cond(IT)
M+
<
cond(W)9qi
y/Pl
where the first inequality results from
(.MTW)
G r\
(mtw)
<
a.
mm \ AI 1 (7min
(W)
M+ll IIIT1
The result follows with 9 = 9^/qf. â–
The significance of this proposition is that any model improvement algorithm for unweighted regression can be used for weighted regression to ensure the same global convergence properties, provided cond(VL) is bounded. For the model improvement algorithm described in the following section, this requirement is satisfied by bounding the weights away from zero while keeping the largest weight equal to 1.
In practice, we need not ensure Apoisedness of Yk at every iterate to guarantee the algorithm convergences to a secondorder minimum. Rather, Apoisedness only needs to be enforced as the algorithm stagnates.
3.4 Model Improvement Algorithm
This section describes a model improvement algorithm (MIA) for regression which, by the preceding section, can also be used for weighted regression to ensure that the sample sets are strongly Apoised for some fixed A (which is not necessarily
39
known). The algorithm is based on the following observation, which is a straightforward extension of [12, Theorem 4.11].
The MIA presented in [12] makes assumptions (such as all points must lie within B(yÂ°; A)) to simplify the theory. We resist such assumptions to account for practical concerns (points which he outside of B(yÂ°, A)) that arise in the algorithm. Proposition 3.13 If the shifted and scaled sample set Y of pi points contains l = (f^\ disjoint subsets of qi points, each of which are Apoised in B(0; 1) (in the interpolation sense), then Y is strongly \JBAApoised in B{0; 1) (in the regression sense).
Proof: Let Yj = {y),y(,..., yJ}, j = 1,..., l be the disjoint Apoised subsets of Y,
and let Yr be the remaining points in Y. Let X\, i = 0,..., q be the (interpolation)
Lagrange polynomials for the set Yj. As noted in [12], for any x E Râ„¢,
q
= f(x), j = 1,. â€¢ â€¢
i=0
Dividing each of these equations by l and summing yields
= ^) (3n)
j=1 i=0
Let Xj(x) = (A{(x), â€¢ â€¢ â€¢ , XJqi(x)) , and let A G EPl be formed by concatenating the Xj(x), j = 1, â€¢ â€¢ â€¢ ,1 and a zero vector of length \Yr\ and then dividing every entry by l. By (3.11), A is a solution to the equation
= ^x) (312)
i=0
Since Yj is Apoised in B(0; 1), for any x E B(0; 1),
AJ(x) < v^H^WlL < v^A.
Thus,
A
< VI max
j
lAJÂ»ll
l
<
Va <
i +1
Qi
Pi/Q:
A
l + 1 Qi
l Vv~\
A.
40
Let Â£i(x), i = 0, â€¢ â€¢ â€¢ ,p be the regression Lagrange polynomials for the complete set Y. As observed in [12], Â£{x) = (Â£o(x), â– â– â– ,Â£p(x))T is the minimum norm solution to (3.12). Thus,
11^)11 < A <
l + l l
A.
Since this holds for all r 6 5(0; 1), L is strongly
l + l l
Apoised in 5(0; 1).
Based on this observation, and noting that ^ < 2 for / > 1, we adopt the following strategy for improving a shifted and scaled regression sample set Y C 5(0; 1):
1. If Y contains l > 1 Apoised subsets with at most q\ points left over, Y is strongly \/2Apoised.
2. Otherwise, if Y contains at least one Apoised subset, save as many Apoised subsets as possible, plus at most q\ additional points from Y, discarding the rest.
3. Otherwise, add additional points to Y in order to create a Apoised subset. Keep this subset, plus at most q\ additional points from Y.
To implement this strategy, we first describe an algorithm that attempts to find a Apoised subset of Y. To discuss the algorithm we introduce the following definition:
Definition 3.14 A setY C B is said to be Asubpoised in a set B if there exists a superset Z DY that is Apoised in B with \Z\ = q\.
Given a sample set Y C 5(0; 1) (not necessarily shifted and scaled) and a radius A, the algorithm below selects a Asubpoised subset Ynew C Y containing as many points as possible. If \Ynew\ = qi, then Ynew is Apoised in 5(0; A) for some fixed A. Otherwise, the algorithm determines a new point ynew E 5(0; A) such that Ynew U {ynew} is Asubpoised in 5(0; A).
41
Algorithm FindSet (Finds a Asubpoised set)
Input: A sample set Y C 0(0; 1) and a trust region radius A G [VÂ£acc, l], for fixed parameter Â£acc > 0.
Output: A set Ynew C Y that is Apoised in 0(0; A); or a Asubpoised
set Ynew C 0(0; A) and a new point ynew C 0(0; A) such that Ynew (J {ynew} is Asubpoised in 0(0; A).
Step 0 (Initialization:) Initialize the pivot polynomial basis to the monomial basis: Ui(x) = 4>i(x), i = 0,..., q. Set Ynew = 0. Set i = 0.
Step 1 (Point Selection:) If possible, choose j* G {*,â€¢â€¢â€¢, T â€” 1} such that \ui(yH)\ > iacc (threshold test).
If such an index is found, add the corresponding point to Ynew and swap the positions of points yl and yji in Y.
Otherwise, compute ynew = argmaxtq(:r), and exit, returning
xÂ£B(0;A)
Ynew and Dne w *
Step 2 (Gaussian Elimination: ) For j = i + 1,..., F â€” 1
/ \ / \ ui(y%) / \
Uj(x) = Ujix)â€”uAx).
Ui{yl)
If i < q, set i = i + 1 and go to step 1.
Exit, returning Ynew.
The algorithm, which is modeled after Algorithms 6.5 and 6.6 in [12], applies Gaussian elimination with a threshold test to form a basis of pivot polynomials {ui(x)}. As discussed in [12], at the completion of the algorithm, the values ui{yl),yl ^ Ynew are exactly the diagonal entries of the diagonal matrix D in the LDU factorization of M = M((f>,Ynew). If \Ynew\ = q1, M is a square matrix. In this case, since \ui(yl)\ > Â£acc,
42
M
i I
<
a/Qi^q tow t h
(3.13)
where fgr0wth is the growth factor for the factorization (see [27]).
Point Selection The point selection rule allows flexibility in how an acceptable point is chosen. For example, to keep the growth factor down, one could choose the index j* that maximizes \ui(yj)\ (which corresponds to Gaussian elimination with partial pivoting). But in practice, it is often better to select points according to their proximity to the current iterate. In our implementation, we balance these two criteria by choosing the index that maximizes \ui(yj)\/d^, over j > i, where dj = max{l, \\yj\\ /A}. If all sample points are contained in 5(0; A), then dj = 1 for all j. In this case, the point selection rule is identical to the one used in Algorithm 6.6 of [12] (with the addition of the threshold test). When Y contains points outside 5(0; A), the corresponding values of dj are greater than 1, so the point selection rule gives preference to points that are within 5(0; A).
The theoretical justification for our revised point selection rule comes from examining the error bounds in Corollary 3.7. For a given point x in 5(0; A), each sample
3
point yl makes a contribution to the error bound that is proportional to \\yl â€” x (assuming the computational error is relatively small). Since x can be anywhere in
\ui(yji)\
the trust region, this suggests modifying the point selection rule to maximize
3i
where dj = max,^^.^ \\yj â€” x /A = \\yj\\ /A + 1. To simplify analysis, we modify this formula so that all points inside the trust region are treated equally, resulting in the formula dj = max(l, \\yj\\ /A).
Lemma 3.15 Suppose Algorithm FindSet returns a set Ynew with \Ynew\ = q1. Then Ynew is Apoised in 5(0; A) for some A, which is proportional to
Cgrowth Â£,acc
nation.
max{l, A2/2, A}; where fgr0wth hs the growth factor for the Gaussian elimi
43
1
â€” y/Ql^growth/face Let Â£(x)
Proof: Let M = M((f>,Ynew). By (3.13), M (Â£o(x), â€¢ â€¢ â€¢, Â£q(x))T be the vector of interpolation Lagrange polynomials for the sample set Ynew. For any x E 5(0; A),
\mi
<
M T(f)(x)
Ql^growth
<
M
l
0(^)L < Vql
M
l
U(x)\
#t) < #^nâ„¢{U'72,4).
Since this inequality holds for all x E 5(0; A), Ynew is Apoised for A = (QiCgrowth/Cacc) max{l, A2/2, A}, which establishes the result. â–
In general, the growth factor in the above lemma depends on the matrix M and the threshold facc. Because of the threshold test, it is possible to establish a bound on the growth factor that is independent of M. So we can claim that the algorithm selects a Apoised set for a fixed A that is independent of Y. However, the bound is extremely large, so is not very useful. Nevertheless, in practice fgrowth is quite reasonable, so A tends to be proportional to max{l, A2/2, K}/facc.
In the case where the threshold test is not satisfied, Algorithm FindSet determines a new point ynew by maximizing \ui(x)\ over 5(0; A). In this case, we need to show that the new point would satisfy the threshold test. The following lemma shows that this is possible, provided facc is small enough. The proof is modeled after the proof of [12, Lemma 6.7].
Lemma 3.16 Let vT(x) be a quadratic polynomial of degree 2, where IHI^ = 1. Then
A2
max_ \vT(f)(x)\ >min{l,â€”}.
xÂ£B(0;A) 4
Proof: Since IHI^ = 1, at least one of the coefficients of q(x) = vTf>(x) is 1, 1, 1/2, or 1/2. Looking at the case where the largest coefficient is 1 or 1/2 (1 and 1/2 are similarly proven), either this coefficient corresponds to the constant term, a linear term ay, or a quadratic term xf/2 or aqaq. Restrict all variables not appearing in the term corresponding to the largest coefficient to zero.
44
â€¢ If q{x) = 1 then the lemma trivially holds.
â€¢ If q(x) = xf/2 + axi + 6, let x = he1 e B(0; A)
q(x) = A2/2 + ha + b, q{â€”x) = h2/2 â€” ha + b, and g(0) = b.
If g(â€”x) > ^ or g(x) > the result is shown. Otherwise, A+ < q(h) < and Ar < q(â€”h) < Adding these together yields < A2 + 2b < Therefore & < 4â€œ â€” 4â€œ = â€œ4â€œ anc^ therefore \q(0)\ >
â€¢ If q{x) = ax2/2 + ay + b, then let x = he1, yielding q{x) = A + ah2/2 + b and q(â€”x) = â€”A + ah2/2 + b then
r 'i A2
max {g(â€”x)\, (?(x)} = max  â€” A + a:, A + a: j > A > min{l, â€” }
where a = ah/2 + 6 = 0.
â€¢ If q(x) = ax2/2 + bx2/2 + XiXj + cay + dxj + e, we consider 4 points on B(0; A)
1 1 > V2 = 1 1 , 2/3 = 1 1 1 , Vi = 1 1 1
1
Q(Vs)
q(vh
4
Note that q(yi)  q(y2) = A + d\/~2h and g(y3)  g(y4) are two cases:
+ e + e + e + e
â€”A + d\/2A. There
45
1. If d > 0, then q{y\) â€” q{y2) > A, so either \q{yi)\ > A > min l, j or
g(y2) > f > min 11, 1â€¢
2. If d < 0, then a similar study of qigjs) â€” qigji) proves the result.
Lemma 3.17 Suppose in Algorithm FindSet facc < min{l,A2/4}. If Algorithm FindSet exits during the point selection step, then Ynew\J {ynew} is Asubpoised in 5(0; A) for some fixed A, which is proportional to ^9rÂ°wth maxjl, A2/2, A}; where
S acc
fgrowth is the growth parameter for the Gaussian elimination.
Proof: Consider a modified version of Algorithm FindSet that does not exit in
the point selection step. Instead, yl is replaced by ynew and ynew is added to Ynew. This modified algorithm will always return a set consisting of q\ points. Call this set Z. Let Ynew and ynew be the output of the unmodified algorithm, and observe that Ynew U{l/riew}' C Z.
To show that Ynew (J{ynew} is Asubpoised, we show that Z is Apoised in 5(0; A). From the Gaussian elimination, after k iterations of the algorithm, the (k + l)st pivot polynomial Uk(x) can be expressed as (vk)T(f>(x), for some vk = (v0,..., Vki, 1, 0,..., 0). (That is, the ry are the coefficients for the basis expansion of the polynomial Uk)â€¢ Observe that A 1, and let v = râ€”jâ€”. By Lemma 3.16,
max_ \iik{x)\= max_ \{vk)T{x)\ = nfc ( max_ \vT
xeB(0;A) xeB(0;A) Â°Â° \xeB(0;A) J
A2 A2
> min{l, â€”} H^ll^ > min{l, â€”} > facc.
It follows that each time a new point is chosen in the point selection step, that point will satisfy the threshold test. Thus, the set Z returned by the modified algorithm will include q\ points, all of which satisfy the threshold test. By
46
Lemma 3.15, Z is Apoised, with A proportional to ^arÂ°wth max{ 1, A2/2, A}. It fol
S acc
lows that Ynew (J {ynew} is Asubpoised. â–
We are now ready to state our model improvement algorithm for regression. Prior to calling this algorithm, we discard all points in Y with distance greater than A/ y/^acc from yÂ°. We then form the shifted and scaled set Y by the transformation yJ = (yjyÂ°)/d, where d = max^gy \\yj â€” yÂ° , and scale the trust region radius accordingly (i.e., A = A/d). This ensures that A = A >  = Vlocc After calling the
algorithm, we reverse the shift and scale transformation.
Algorithm MIA (Model Improvement for Regression)
Input: A shifted and scaled sample set Y cB(0;l),a trust region radius A > y/^acc for fixed Â£acc G (0, A^), where r > 1 is a fixed parameter.
Output: A modified set Y1 with improved poisedness on B(0; A).
Step 0 (Initialization:) Remove the point in Y farthest from yÂ° = 0 if it is outside B(0; rA). Set Y' = 0.
Step 1 (Find Poised Subset:) Use Algorithm FindSet either to identify a
Apoised subset Ynew C Y, or to identify a Asubpoised subset Ynew C Y and one additional point ynew G B(0; A) such that Ynew U {ynew} is Asubpoised on B(0; A).
Step 2 (Update Set:)
If Ynew is Apoised, add it to Y1 and remove Ynew from Y. Remove all points from Y that are outside of B(0; rA).
Otherwise
If \Y'\ = 0, set Y' = Ynew U {ynew} plus qi  \Ynew\  1 additional points from Y.
Otherwise set Y1 = 7'(J Ynew plus q\ â€” \Ynew\ additional points from Y.
47
Set Y = 0.
Step 3 If \Y\ > qi, go to Step 1.
In Algorithm MIA, if every call to Algorithm FindSet yields a Apoised set Ynew, then eventually all points in Y will be included in Y'. In this case, the algorithm has verified that Y contains t = \_^\ Apoised sets. By Proposition 3.13, Y is strongly ^dApoised in B(0; 1).
If the first call to FindSet fails to identify a Apoised subset, the algorithm improves the sample set by adding a new point ynew and by removing points so that the output set Y1 contains at most q\ points. In this case the output set contains the Asubpoised set Ynew (J {ynew} Thus, if the algorithm is called repeatedly, with Y replaced by Y1 after each call, eventually Y1 will contain a Apoised subset and will be strongly 2Apoised, by Proposition 3.13.
If Y fails to be Apoised after the second or later call to FindSet, no new points are added. Instead, the sample set is improved by removing points from Y so that the output set Y1 consists of all the Apoised subsets identified by FindSet, plus up to qi additional points. The resulting set is then strongly Apoised, where 1 = â€¢
Trust region scale factor The trust region scale factor r was suggested in [12, Section 11.2], although implementation details were omitted. The scale factor determines what points are allowed to remain in the sample set. Each call to Algorithm MIA removes a point from outside B(0; rA) if such a point exists. Thus, if the algorithm is called repeatedly with Y replaced by Y1 each time, eventually all points in the sample set will be in the region B(0; rA). Using a scale factor r > 1 can improve the efficiency of the algorithm. To see this, observe that if r = 1, a slight movement of the trust region center may result in previously â€œgoodâ€ points lying just outside of B(yÂ°; A). These points would then be unnecessarily removed from Y.
48
To justify this approach, suppose that Y is strongly Apoised in 5(0; A). By Proposition 3.11, the associated model function m is Kfully quadratic for some fixed vector k, which depends on A. If instead Y has points outside of 5(0; A), we can show (by a simple modification to the proof of Proposition 3.11) that the model function is 53/ifully quadratic, where R = max{t/* â€” yÂ°}. Thus, if Y C 5(0; rA) for some fixed r > 1, then calling the model improvement algorithm will result in a model function m that is /tfully quadratic with respect to a different (but still fixed) k = r3n. In this case, however, whenever new points are added during the model improvement algorithm, they are always chosen within the original trust region 5(0; A).
The discussion above demonstrates that Algorithm MIA satisfies the requirements of a model improvement algorithm specified in Definition 2.2. This algorithm is used in the CSV2framework described in Chapter 2 as follows:
â€¢ In Step 1 of Algorithm CSV2, Algorithm MIA is called once. If no change is made to the sample set, the model is certified to be Kfully quadratic.
â€¢ In Step 4 of Algorithm CSV2, Algorithm MIA is called once. If no change is made to the sample set, the model is Kfully quadratic. Otherwise, the sample set is modified to improve the model.
â€¢ In Algorithm Criticality Step, Algorithm MIA is called repeatedly to improve the model until it is Kfully quadratic.
In our implementation, we modified Algorithm CriticalityStep to improve efficiency by introducing an additional exit criterion. Specifically, after each call to the model improvement algorithm, qlk = max{^ , â€”Amin(Hlk)} is tested. If qlk > ec, xk is no longer a secondorder stationary point of the model function; so we exit the criticality step.
49
3.5 Computational Results
As shown in the previous section, the CSV2framework using weighted quadratic regression converges to a secondorder stationary point provided the ratio between the largest and smallest weight is bounded. This leaves much leeway in the derivation of the weights. We now describe a heuristic strategy based on the error bounds derived in Â§4.
3.5.1 Using Error Bounds to Choose Weights
Intuitively, the models used throughout our algorithm will be most effective if the weights are chosen so that m(x) is as accurate as possible in the sense that it agrees with the secondorder Taylor approximation of f(x) around the current trust region center yÂ°. That is, we want to estimate the quadratic function
Q{x) = f(yÂ°) + V/(yÂ°)T(x  yÂ°) + ^(x  yÂ°)TV2f{yÂ°){x  yÂ°).
If f(x) happens to be a quadratic polynomial, then
ft = Q(yl) + Q
If the errors are uncorrelated random variables with zero mean and finite variances of,z = 0then the best linear unbiased estimator of the polynomial q(x) is given by m(x) = 4>(x)Ta, where a solves (3.4) with the ith weight proportional to 1 joi [51, Theorem 4.4], This is intuitively appealing since each sample point will have the same expected contribution to the weighted sum of square errors.
When f{x) is not a quadratic function, the errors depend not just on the computational error, but also on the distances from each point to yÂ°. In the particular case when x = yÂ°, the first three terms of (3.7) are the quadratic function q(yl). Thus, the error between the computed function value and q(yl) is given by:
fi  q(yl) = \(yl  y0)THi(y0)(yl  yÂ°) + w (3.14)
50
where Hi(yÂ°) = V2/(rji(y0)) â€” V2/(yÂ°) for some point rji(yÂ°) on the line segment connecting yÂ° and yl.
We shall refer to the first term on the right as the Taylor error and the second term on the right as the computational error. By Assumption 2.1, fA(yÂ°) < L \\yl â€” yÂ°\\. This leads us to the following heuristic argument for choosing the weights: Suppose that Hi(yÂ°) is a random symmetric matrix such that the standard deviation of \\Hi(yÂ°)\\ is proportional to L \\yl â€” yÂ°. In other words fA(yÂ°) = (L \\yl â€” yÂ°\\ for some constant (. Then the Taylor error will have standard deviation proportional to L yl â€” yÂ°3. Assuming the computational error is independent of the Taylor error, the total error fi â€” q(yl) will have standard deviation \J L \\yl â€” yÂ°3)2 + of, where Oi is the standard deviation of q. This leads to the following formula for the weights:
1
Wi OC =.
yCL2 Wy*  yÂ°\\6 + af
Of course, this formula depends on knowing (,L and cp. If L, and/or ( are not known, this formula could still be used in conjunction with some strategy for estimating L, di, and ( (for example, based upon the accuracy of the model functions at known points). Alternatively, ( and L can be combined into a single parameter C that could be chosen using some type of adaptive strategy:
Wi oc
1
JC Wy1  yÂ°\f + a,
If the computational errors have equal variances, the formula can be further simplified as
1
Wi oc , (3.15)
yC W  yÂ° 6 + l
where C = (T/a2.
An obvious flaw in the above development is that the errors in \fi â€” q{yl)\ are
not uncorrelated. Additionally, the assumption that fA(yÂ°) is proportional to
L yl â€” yÂ° is valid only for limited classes of functions. Nevertheless, based on our
51
computational experiments, (3.15) appears to provide a sensible strategy for balancing differing levels of computational uncertainty with the Taylor error.
3.5.2 Benchmark Performance
To study the impact of weighted regression, we developed MATLAB implementations of three quadratic modelbased trust region algorithms using interpolation, regression, and weighted regression, respectively, to construct the quadratic model functions. To the extent possible, the differences between these algorithms were minimized, with code shared whenever possible. Obviously, all three methods use different strategies for constructing the model from the sample set. Beyond that, the only difference is that the two regression methods use the model improvement algorithm described in Section 3.4, whereas the interpolation algorithm uses the model improvement algorithm described in [12, Algorithm 6.6].
We compared the three algorithms using the suite of test problems for benchmarking derivativefree optimization algorithms made available by More and Wild
[41]. We ran our tests on the four types of problems from this test suite: smooth problems (with no noise), piecewise smooth functions, functions with deterministic noise and functions with stochastic noise. We do not consider the algorithm presented in this chapter to be ideal for handling stochastically noisy functions. For example, if the initial point happens to be evaluated with large negative noise, the algorithm will never reevaluate this point and possibly never move the trust region center. We are actively attempting to construct a more robust algorithm. We consider such modifications nontrivial and outside the scope of the current work. The problems were run with the following parameter settings:
A max = 100, Aqc6 = 1, no = 10â€œ6, rii = 0.5, 7 = 0.5, 7inc = 2, ec = 0.01, ji = 2, f3 = 0.5, to = .5, r = 3, Â£acc = 104. For the interpolation algorithm, we used iimp = 1.01, for the calls to [12, Algorithm 6.6].
52
As described in [41], the smooth problems are derived from 22 nonlinear least squares functions defined in the CUTEr [23] collection. For each problem, the objective function f(x) is defined by
m
f(x) = ^9k{x)2,
k= 1
where g : W1 â€”> represents one of the CUTEr test functions. The piecewise
smooth problems are defined using the same CUTEr test functions by
m
f(x) = \9k(x)I.
k= 1
The noisy problems are derived from the smooth problems by multiplying by a noise function as follows:
m
f(x) = (1 +Â£fV(x)) J2dk(x)2,
k=1
where Â£f defines the relative noise level. For stochastically noisy problem, r(x) is a random variable drawn from the uniform distribution U[â€” 1,1]. To simulate deterministic noise, r(x) is a function that oscillates between 1 and 1, with both highfrequency and lowfrequency oscillations. (For an equation for the deterministic T, see [41, Eqns. (4.2)(4.3)].) Using multiple starting points for some of the test functions, a total of 53 different problems are specified in the test suite for each of these 3 types of problems.
For the weighted regression algorithm, the weights were determined by the weighting scheme (3.15) with C = 100.
The relative performances of the algorithms were compared using performance profiles and data profiles [17, 41], If S is the set of solvers to be compared on a suite of problems P, let tPtS be the number of iterates required for solver s E S on a problem p E P to find a function value satisfying:
f{x) < h + t {f{x0)  fL) , (3.16)
53
where Jl is the best function value achieved by any s E S. Then the performance profile of a solver s E S is the following fraction:
ps(a) =
p e P
Â°p,s
min{tP)S : s E S}
< a
The data profile of a solver s E S is:
ds(cv) \P\
p E P :
t
â€¢p,s
nv + 1
< a
where np is the dimension of problem p E P. For more information on these profiles, including their relative merits and faults, see [41].
Performance profiles comparing the three algorithms are shown in Figure 3.1 for an accuracy of r = 105. We observe that on the smooth problems, the weighted and unweighted regression methods had similar performance and both performed slightly better than interpolation. For the deterministically noisy problems, we see slightly better performance from the weighted regression method; and this improvement is even more pronounced for the benchmark problems with stochastic noise. And for the piecewise differentiable functions, the performance of the weighted regression method is significantly better. This mirrors the findings in [13] where SIDPSM using regression models shows considerable improvement over interpolation models.
We also compared our weighted regression algorithm with the DFO algorithm
[8] as well as NEWUOA [50], (which had the best performance of the three solvers compared in [41]). We obtained the DFO code from the COINOR website [38]. This code calls IPOPT, which we also obtained from COINOR. We obtained NEWUOA from [40]. We ran both algorithms on the benchmark problems with a stopping criterion of Amin = 10â€”8, where Amin denotes the minimum trust region radius. For NEWUOA, the number of interpolation conditions was set to NPT=2u + 1.
The performance profiles are shown in Figure 3.2, with an accuracy of r = 105.
NEWUOA outperforms both our algorithm and DFO on the smooth problems. This
is not surprising since NEWUOA is a mature code that has been refined over several
54
Smooth Problems; r = 10 5
Smooth Problems; r = 10 5
Nondifferentiable Problems; r = 10 5
Nondifferentiable Problems; r = 10 5
Deterministically Noisy Problems; r = 10 5
Stochastically Noisy Problems; r = 10 5
Figure 3.1: Performance (left) and data (right) profiles: Interpolation vs. regression vs. weighted regression
years, whereas our code is a relatively unsophisticated implementation. In contrast, on the noisy problems and the piecewise differentiable problems, our weighted regression
55
algorithm achieves superior performance.
Smooth Problems; r = 10 5
Performance Ratio
Nondifferentiable Problems; r = 10 5
Smooth Problems; r = 10 5
Nondifferentiable Problems; r = 10 5
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
 NEWUOA (2n+l)
 DFO
Weighted Regression
1 1.5 2 2.5 3 3.5
Performance Ratio
Deterministically Noisy Problems; r = 105
Deterministically Noisy Problems; r = 10 5
Performance Ratio
Stochastically Noisy Problems; r = 10 5
Stochastically Noisy Problems; r = 10 5
Figure 3.2: Performance (left) and data (right) profiles: weighted regression vs. NEWUOA vs. DFO (Problems with Stochastic Noise)
3.6 Summary and Conclusions
56
Our computational results indicate that using weighted regression to construct more accurate model functions can reduce the number of function evaluations required to reach a stationary point. Encouraged by these results, we believe that further study of weighted regression methods is warranted. This chapter provides a theoretical foundation for such study. In particular, we have extended the concepts of Apoisedness and strong Apoisedness to the weighted regression framework, and we demonstrated that any scheme that maintains strongly Apoised sample sets for (unweighted) regression can also be used to maintain strongly Apoised sample sets for weighted regression, provided that no weight is too small relative to the other weights. Using these results, we showed that, when the computational error is sufficiently small relative to the trust region radius, the algorithm converges to a stationary point under standard assumptions.
This investigation began with a goal of more efficiently dealing with computational error in derivativefree optimization, particularly under varying levels of uncertainty. Surprisingly, we discovered that regression based methods can be advantageous even in the absence of computational error. Regression methods produce quadratic approximations that better fit the objective function close to the trust region center. This is due partly to the fact that interpolation methods throw out points that are close together in order to maintain a wellpoised sample set. In contrast, regression models keep these points in the sample set, thereby putting greater weight on points close to the trust region center.
The question of how to choose weights needs further study. In this chapter, we proposed a heuristic that balances uncertainties arising from computational error with uncertainties arising from poor model fidelity (i.e., Taylor error as described in Â§3.5.1). This weighting scheme appears to provide a benefit for noisy problems or nondifferentiable problems. We believe better schemes can be devised based on more rigorous analysis.
57
Finally, we note that the advantage of regressionbased methods is not without cost in terms of computational efficiency. In the regression methods, quadratic models are constructed from scratch every iteration, requiring 0(n6) operations. In contrast, interpolationbased methods are able to use an efficient scheme developed by Powell [50] to update the quadratic models at each iteration. It is not clear whether such a scheme can be devised for regression methods. Nevertheless, when function evaluations are extremely expensive, and when the number of variables is not too large, this advantage is outweighed by the reduction in function evaluations realized by regression based methods.
58
4. Stochastic Derivativefree Optimization using a Trust Region Framework
In this chapter, we propose and analyze the convergence of an algorithm which finds a local minimizer of the unconstrained function / : W1 â€”> IBL The value of / at a given point x can not be observed directly; rather the optimization routine only has access to noise corrupted function values /. Such noise may be deterministic, due to roundoff error from finiteprecision arithmetic or iterative methods, or stochastic, arising from variability (or randomness) in some observed process. We focus our attention in this chapter on minimizing / when / has the form:
f{x) = f{x) + e (4.1)
where jV(0, a2).
Minimizing noisy functions of this nature arise in a variety of settings. For example, in almost any problem where physical system measurements are being optimized. Consider a city planner wanting to maximize traffic flow on a major thoroughfare by adjusting the timing of traffic lights. For each timing pattern x, the traffic flow f(x) is physically measured to provide information about the expected traffic flow fix). Stochastic approximation algorithms, built to solve
min f{x) = E [fix)]
have existed in the literature since RobbinsMonroâ€™s algorithm for finding roots of an expected value function [53]. The KieferWolfowitz (KW) algorithm [35] generalized this algorithm to minimize the expected value of a function. Their iterates have the form
xk+l = xk + cikGixk)
where G is a finite difference estimate for the gradient of /. The zth component of G is found by
_ fixk + ckei)  fixk  ckei)
wri â€” â€¢
59
where e* is the zth unit vector. While KW spawned many generalizations, most forms require a predetermined decaying sequence for both the steps size parameter ak and finite difference parameter ck. As opposed to the 2n evaluations of / required at each iterate of KW, Spallâ€™s simultaneous perturbation stochastic approximation (SPSA) [56] requires only 2 function evaluations per iterate, independent of n. SPSA estimates G* by
_ f(xk + ck8k)  f(xk  ck8k)
1 2ck8kt
where 8k G W1 is a random perturbation vector with entries 8ki which are independent and identically distributed (i.i.d.) from a distribution with bounded inverse moments, symmetrically distributed around zero, and uniformly bounded in magnitude for all k and i. Though SPSA greatly reduces the number of evaluations of /, the choice of sequences ak and ck are critical to algorithmic performance. Nevertheless, if / has a unique minimum x*, both KW and SPSA have almost sure convergence (which implies convergence in probability and convergence in distribution) of xk â€”> x* as k â€”> co. There is also a version of SPSA which uses four function evaluations to estimate the value, gradient, and Hessian of / [57].
The algorithm which follows differs from the work in Chapter 3 in a few ways. First, the analysis presented in Chapter 3 gives very conservative error bounds; to tighten these bounds, we must consider specific probability distributions for the error. Second, in Chapter 3, pk was defined using f(xk) as an estimate for f(xk). The work that follows evaluates model functions mk{xk) and mk(xk) to provide better estimates of the true function value. It is possible to estimate f(xk) by repeated evaluation of f(xk), but we desire an algorithm which avoids repeatedly sampling points to reduce the variance at a point x. Such a technique only gains information about noise e in the stochastic case, and no information about e if / is deterministic. But if many points sufficiently close to x are sampled, information about / and e can be gleaned. As is often the case, the point x is the likely next iterate, and the information gathered
60
about / near x will be used immediately. Also, if the noise in / is deterministic but the optimizer has imperfect control of x, it may be possible to consider the problem a stochastic optimization problem.
The analysis of the algorithm is complicated by the presence of noise. Since there is noise in each function evaluation, it is impossible to be certain the model matches the function. For example, if f{x) = x2, there is a nonzero (but tiny) probability that f(x) 0 at every point evaluated. Therefore, we can only have confidence (which we denote 1 â€” for ak small) that the model and function agree. The quantity ak can be adjusted as the algorithm stagnates to ensure increasingly accurate models (at the expense of more function evaluations). A key requirement of the convergence analysis is that as A& â€”> 0, oik does as well. For example, we can choose a simple rule such as oik = min{Afc, 0.05} to prove results about our algorithm. There are many other equally valid rules for handling oik to ensure increasing accurate models as A& â€”>â€¢ 0.
Our ultimate goal is to prove that the algorithm converges to a stationary point of / almost surely (with probability 1), but this is a daunting task. This is to be expected considering the following two quotes (both from [58])
There is a fundamental tradeoff between algorithm efficiency and algorithm robustness (reliability and stability in a broad range of problems).
In essence, algorithms that are designed to be uery efficient on one type of problem tend to be â€œbrittleâ€ in the sense that they do not reliably transfer to problems of a different type.
and
Unfortunately, for general nonlinear problems, there is no known finitesample (k < oo) distribution for the SA [stochastic approximation] iterate. Further, the theory governing the asymptotic (k â€”> oo) distribution is rather difficult.
61
Despite these pessimistic views, we are able to make progress in the work that follows. Since we are attempting to construct a robust algorithm, with a measure of confidence in our solution after a finite number of iterations, our theoretical requirements may not be implemented in the algorithm. Relaxing requirements may yield a more suitable algorithm for a specific problem instance.
Our algorithm is a derivativefree trust region method using regression quadratic models for their perceived ability to handle noisy function evaluations. We outline the modifications required for convergence when minimizing a function with stochastic noise. For example, when there is no noise in function being optimized, we can measure the accuracy of the fcthe model m*, with the ratio
f(xk) â€” f(xk + sk)
^k rrik{xk) â€” nik{xk + sk) â€™
This measures the actual decrease observed in / versus the decrease predicted by the model m,fc. Since / cannot be evaluated directly, we propose a modified ratio pk in Section 4.2 which we believe is more appropriate for noisy functions. We also propose a modified form of Kfully quadratic for stochastically noisy functions.
An overview of the chapter follows: in Section 4.1 we define Kfully quadratic (and linear) models with confidence 1 â€” oik on B(x; A) and show that quadratic and linear regression models satisfy these new definitions, provided there are a sufficient number of poised points in B(x; A). We outline the algorithm in Section 4.2 and show that it converges to a firstorder stationary point in Section 4.3. We provide suggestions for implementing our algorithm and compare one implementation against other algorithms for minimizing (4.1) in Section 4.4. Lastly, we discuss the results in Section 4.5 and outline some of the future avenues for research.
4.1 Preliminary Results and Definitions
We make the following assumptions:
Assumption 4.1 The noise e ~ A/â€(0,
62
Assumption 4.2 The function f G LC2(Q) with Lipschitz constant L on
n = \jB(xk,Amax) CK"
k
Assumption 4.3 The function f is bounded on Lf^o) (where La = {x\f(x) < a}).
In solving the trust region subproblem, we do not require an exact solution  instead it is sufficient to find an approximate solution, but it must satisfy the following assumption.
Assumption 4.4 Ifmk and Ak are the model and trust region radius at iterate k, and xk + sk is chosen by the trust region soluer to solue min mk(x), and skÂ° = â€”irAigk
xÂ£B(xk; Afc)
is the Cauchy step , then for all k
mk{xk)  mk(xk + sk) > Kfcd mk{xk)  mk(xk + skÂ°)
for some constant Kfcd G (0,1]
This assumption merely states that every trust region subproblem solution is a fraction of the decrease possible from taking the Cauchy step, and this fraction is bounded positively away from zero. Also, the assumption allows us to not solve the trust region subproblem exactly.
Assumption 4.5 There exists a constant Kbhf > 0 such that, for all xk generated in the algorithm
\\V2f(xk)\\
We prove the following three claims used in this chapter.
Lemma 4.6 If X < Y andY < Z, then X < Z.
1â€”a 1â€”a 1â€”2a
63
Proof:
P(X P(X
= 1  P(X >YVY > Z)
= 1  P(X >Y)~ P(Y > Z) + P(X > Y A Y > Z) > 1  P(X >Y)~ P(Y > Z)
= 1 â€” a â€” o = l â€” 2a.
SoX < Z.
12 a
Lemma 4.7
i=l
a, < e
.i=i
Proof:
.i=i
A an < â€”
P ( Vdi < t I >P(ai< â€” Aa2 <  A ' I \ n n
= 1 â€” P (cii > â€” V â€¢ â€¢ â€¢ V an > â€”) V n nJ
Lemma 4.8 LetY C B( 0; 1) fee a strongly Apoised (Definition 2.14) sample set with Pi points. Let X fee the quadratic design matrix defined by (1.2), then
[(XTX)~% < â€”A2 Pi
where is the ith diagonal entry of A.
Proof: Since (XTX)~1 is symmetric and positivedefinite, the zth eigenvalue (Aj) equals the fill singular value (af). By [12, Theorem 4.11], the inverse of the smallest singular value of X is bounded by . /^A. That is,
1
Omin
{X)
64
or
^A2 >__________=_____________
Pi ((JminiX)) {^min{XT X))
= amax((XTX)~1) = \\(XTX)~1
= Knox {(XTX)
i
= max (XJX)iw > WiX1 XyXiW > [{X1 X)~%
where e, is the zth unit vector.
4.1.1 Models which are Kfully Quadratic with Confidence 1 â€” ak
To prove convergence of the algorithm presented in Section 4.2, we first propose a modified version of Kfully quadratic models.
Definition 4.9 Let f satisfy Assumption f.2. Let k = (Kef,Keg,Keh,vff) be a given vector of constants, and let A > 0. A model function m E C2 is Kfully quadratic with confidence 1 â€” oik on B(x; A) for oik Â£ (0,1) if m has a Lipschitz continuous Hessian with corresponding Lipschitz constant bounded by uf1 and
â€¢ the error between the Hessian of the model and the Hessian of the function satisfies
P (V2/(y)  V2m(y)  < nehA My e B(x; A)) > 1  ak,
â€¢ the error between the gradient of the model and the gradient of the function satisfies
p (lV/(y)  Vm(y) < negA2 My e B(x; A)) > 1  ak,
â€¢ the error between the model and the function satisfies
P (I f(y) ~ m(y)  < Kef A3 My e B(x; A)) > 1  ak.
This is occasionally abbreviated nf.q.w.c. 1 â€” ak.
65
These definitions are only useful if model functions can be (easily) constructed which satisfy them; the models must also be easy to minimize over a trust region. In the following theorem, we show that quadratic regression models satisfy the requirements of Definition 4.9, provided there are enough poised points within the trust region.
Theorem 4.10 If the function f satisfies Assumption f.2 and the noise e satisfies
Assumption f.l, then for a giuen ak Â£ (0,1), there exists a k = (nef, Keg, Keh,1'â„¢)
such that for any xÂ° Â£ Eâ„¢, A > 0, ifY C B(xÂ°, A) is strongly Apoised and
(%_iv )2a2qfA2 1^1 > >
then the quadratic regression model is nfudly quadratic with confidence 1 â€” a^, (where za/2 is the number of standard deuiations away from zero on a standard normal distribution, such that the area to the left of za/2 is ol/2) [46].
Proof: By Taylorâ€™s theorem, for any point x Â£ B(xÂ°] A) there exists a point rj(x) on the line segment connecting x to xÂ° such that
f(x) = f(xÂ°) + V f(xÂ°)T(x â€” xÂ°) + ^(x â€” xÂ°)TV2 / (rj(x))(x â€” xÂ°)
= f(xÂ°) + V/(xÂ°)T(x â€” xÂ°) + {x â€” xÂ°)TV2 f(xÂ°)(x â€” xÂ°)
? (4.2)
+ ~(x â€” xÂ°)T H(x)(x â€” xÂ°),
where H(x) = V2/(r](x)) â€” V2/(xÂ°).
Let m(x) be the quadratic least squares model regressing Y. Since m is quadratic, Taylorâ€™s theorem says for any x,
m(x) = m(xÂ°) + Vm{xÂ°)T{x â€” xÂ°) + ^(x â€” xÂ°)T'V2m(xÂ°)(x â€” xÂ°).
Let f3 be the true parameters of the quadratic part of / (defined by the first three terms of (4.2)) and let fd be the leastsquares estimate for fd. (i.e. if X is the design matrix for the set Y and / is the vector with fill entry f(yl), then [d =
66
(XTX) XTJ.) Define the mapping V(x) : Râ„¢ â€”> Rqi where V(x) = V([x\, â– â– â– xn\T) = [l,xi, â€¢ â€¢ â€¢ ,xn, \x\)x\X2) â– â– â– ^]T Then the zth row of X is V(yl)T. The parameters P define the model m. That is, mix) = f3TV(x).
Without loss of generality, assume A < 1. Then for any x E B(xÂ°; A),
I f(x) ~ m(x) 
f(xÂ°) + V f(xÂ°)T(x â€” xÂ°) + ^(x â€” xÂ°)T'V2f(xÂ°)(x â€” xÂ°) â€” xÂ°)T H{x){x â€” xÂ°) â€” m(xÂ°) â€” Xm(xÂ°)T(x â€” xÂ°)
1
â€” ^(x â€” xÂ°)T'V2m(xÂ°)(x â€” xÂ°)
<
p+Vixx^^Vixx")
^(x â€” xÂ°)T H{x){x â€” xÂ°)
< ^ A  Pi \V(x  xÂ°)i\ + â€” lire â€” ar
i=0
q
L
2
L
i=0
<
Po â€” Po
I Pi â€” Pi A + I Pi ~ Pi
i= 1
A2
i=n\1
A3
2
<
I] U  pi
A3.
2
(4.3)
i=0
Since the noise is uncorrelated with mean zero, constant variance, and is normally distributed, it is known that P ~ Af(P, cr2(XTX)~1) [46]. If [A]iti denotes the zth diagonal entry of a matrix A, then the 1 â€” ^ conhdence interval for each of the Pi has the form [46]:
67
1  â€” = P ( A  z^^Ja^XTX)^ < f3i< & + Zl_^x/a2[(XTX)^
qi \ 2n V 2Q1
= P
< P
< P = P
Pi ~ Pi Pi ~ Pi
Pi  A
Pi  Pi
'i
'i
< Z]_ ak <7
qiAe
<
A3
Qi
At \l (z^P^qfA2
by Lemma 4.8 A 1 by bound on T
Therefore, by Lemma 4.7
p i^p
â€ž i=0
i=0
q
Pi ~ Pi
>
A3
Qi
i=0
1 C^k
Qi
Substituting into (4.3), we know
\f(x)m(x)\ < A3 + ^A3 = KefA3
lâ€”a l
where Kef = 1 + ^. Similarly,
 Vf{x)  Vm(x)  = V/(arÂ°) + V2f(xÂ°)T(x  xÂ°) + H{x){x  xÂ°)
â€” (Vm(iÂ°) + 'V2m(xÂ°)(x â€” xÂ°)) 
< V/(xÂ°)  Vm(arÂ°)
+ \ \V2 f (xÂ°)T (x â€” xÂ°) â€” V2m(xÂ°)(x â€” xÂ°)
+ II H(x)(x â€” xÂ°)
i= 1
Pi ~ Pi
E aa
i=n\1
A + L
<
A + LA^
i=l
q
< \pi  pi
i=n\1
LA2
i=l
< A3 + LA2 < A2 + LA2 = /ieflA2
1â€”a
where neg = 1 + L. A similar argument for V2/(x) â€” V2m(x) with Ke/j = 1 + L proves the theorem. â–
To certify a model satisfies Definition 4.9 or to improve a least squares regression model into one that is Kfully quadratic with confidence 1 â€” is straightforward: we must ensure there are enough poised points within B(x; A) to satisfy the bound given in Theorem 4.10. Otherwise, add enough strongly Apoised points to Y. For a technique to generate strongly Apoised sets, see Chapter 3 of this thesis or [12].
4.1.2 Models which are /tfully Linear with Confidence 1 â€” au
While the models rrik used in the main algorithm are quadratic, linear models m(x) can approximate / near B(xk + sk\ A*,) to sufficient accuracy. Therefore, if we have enough points within B(xk + sk\ A), we can bound the error between f(xk + sk) and rhk(xk + sk). We quantify that accuracy in the following definition.
69
Definition 4.11 Let f satisfy Assumption f.2. Let k = {kef, keg, vf1) be a given vector of constants, and let A > 0. A model function m G C is Kfully linear with confidence 1 â€” oik on B(x; A) for oik G (0,1) if m has a Lipschitz continuous gradient with corresponding Lipschitz constant bounded by vâ„¢ and
â€¢ the error between the gradient of the model and the gradient of the function satisfies
^(11 V/(y)  Vm(y) < kegA Vy G B(x\ A)) > 1  ak,
â€¢ the error between the model and the function satisfies
p (l/(y)  rn(y)\ < kefA2 Vy G B(x; A)) > 1  ak.
This is occasionally abbreviated nf.l.w.c. 1 â€” ak.
Theorem 4.12 If the function f satisfies Assumption f.2 and the noise e satisfies Assumption f.l, then for a given ak G (0,1); there exists a k = (nef, Keg, uf1) such that for any xÂ° G Eâ„¢, A > 0, ifY C B(xÂ°, A) is strongly Apoised and
(21_^)V(n+ 1)W
I I  A4
then the linear regression model is k fully linear with confidence 1 â€” ak.
Proof: The proof is nearly identical that of Theorem 4.10.
4.2 Stochastic Optimization Algorithm
Below is an outline of our proposed stochastic algorithm. For xk + sk, the solution to the trust region subproblem, and a radius Afc > Afc > 0, define Yk = {y G Ytot \\xk + sk  y < Afc.
Let Ytot = {y1, â€¢ â€¢ â€¢ ,ym} be the set of points where / has been evaluated. /Â» :=
f(yl). Define a null model mo, initial trust region radius A0, and an initial TR center
70
xÂ°. Choose constants satisfying 0 < 7 < 1 < pinci ec > 0, 0 < rjo < rji < 1 (rji yt 0),
where rjo 0 < p and oj G (0,1). Choose r G (0,1) and define Ak = rAk.
Algorithm 1: A trustregion algorithm to minimize a stochastic function.
Let k = 0;
Start;
Set ak = min{Afc, 0.05};
if <7 = max{7fc , â€” Xmin(Hk)} < tc and either i) rrik is not certifiably nf.q.w.c. 1 â€” Ofc on B(xk; Afc) or ii) Afc > then Apply Algorithm 2 to update Yk, Ak, and mk,
Set ak = min{Afc, 0.05}; else
Select (or generate) a strongly Apoised set of points Yk C B(xk; Afc) from Ytot such that Yk has enough points to ensure mk is Kf.q.w.c. 1 â€” ak.
Build a regression quadratic model mk(x) through Yk. Solve
sk = arg min mk(xk + s). Build a Kf.l.w.c. 1 â€” ak model rhk on Yk
s:\\s\\
(possibly adding points to Ytot) and compute
^ mk(xk) â€” mk(xk + sk)
^k mk(xk) â€” mk(xk + sk)
if Pk A Vi or (pk > rjo A mk is nf.q.w.c. 1 â€” ak on B(xk; Afc)) then
Xk+1 = â€™jk sk'
else xk+l = xk\ if Pk A Pi then
Afc+i = min{7iracAfc, A
max };
else Afc+i = 7Afc;
Let mk+1 be the (possibly improved) model;
Set k = k + 1 and go to Start;
71
Note that we are approximating f(xk + sk) using a second model mk in a different trust region Ak around xk + sk. Formal convergence of the algorithm, specifically Lemma 4.15, requires the ability to approximate f(xk + sk) with increasing accuracy as the algorithm progresses. Such accuracy is not available from a realization of the noise function value, namely f(xk + sk). While it is possible to obtain increasingly accurate approximations of f(xk + sk) by repeated sampling, we are hoping the theory generated in this chapter can be easily transfered to the case where deterministic noise is present in /. With deterministic noise, Var (/) = 0, and therefore repeated sampling will provide no further information.
Also, if we eventually shrink our trust region around a given point, points generated in B(xk + sk; A*,) to make an accurate model mk(xk + sk) can be used in the
construction of an accurate model rrij(x) at some later iterate j.
Algorithm 2: Criticality Step
Initialization Set i = 0. Set m^ = m,k
Repeat Increment i by one. Improve the previous model by adding points to Ytot until it is Kfully quadratic with confidence 1 â€” oik on B(xk',u* l~1Ak). This can be done (by Theorem 4.10 and the model improvement algorithm from [3] which builds a strongly Apoised set Y) in (9(^e) steps if the models satisfy Definition 4.9.) Denote the new model vnPk . Set Ak = uk1 Ak and fiq = rnf1. ;
Until Afc < /i^kl\xk).
Return mk = fhk, Afc = min jmax Afc, ^)(xfc) , Afc, and Ytot.
We adopt the naming of iterates from Conn, Scheinberg, Vicente:
1. pk > rji; (xk + sk is accepted and the trust region is increased). We call these iterations successful.
2 rji > pk A do and mk is Kfully quadratic with confidence 1 â€” ak; (xk + sk is accepted but Afc is decreased). We call these iterations acceptable.
72
3. rji > pk and mk is not Kfully quadratic with confidence 1 â€” ap (xk + sk is not accepted and the model is improved). We call these iterations model improving.
4. Tj0 > pk and nik is kfully quadratic with confidence 1 â€” a p, (xk + sk is not accepted and Ak is reduced). We call these iterations unsuccessful.
4.3 Convergence
4.3.1 Convergence to a Firstorder Stationary Point
The use of quadratic mk might suggest convergence to a secondorder stationary point. Such a proof would require a quadratic rhk as well, and since A > A, this would require more points in B(xk + sk, Afc) than in B(xk, Afc). Since it is frequently the case that f{xk + sk) > f(xk) (even when nik is Kf.q.w.c. 1 â€” Â«*,), we find it wasteful to build a quadratic rhk around xk + sk. This is one of the motivations for Kfully linear models for rhk, with this, we can prove convergence to a firstorder stationary point.
We first show that if xk is not a stationary point for /, then Algorithm 2 will exit with probability 1.
Theorem 4.13 Given oik Â£ (0,1), if f satisfies Assumption f.2 and
\\Vf(xk)\\ > ^W_1Ak, there is probability at least 1 â€” at that Algorithm 2 will correctly exit on each iterate (i) after (j) such that ui11 < (where p and oj are
declared in Algorithm 1).
Proof: Assume oW1 < , Ak A 1, and Algorithm 2 cycles infinitely. After
sufficiently many iterations of the criticality step, will be Kfully quadratic with
73
confidence 1 â€” au on B(xk: ./^AaA. Therefore
V UKeg
by assumption by Definition 4.9
since A& < 1 and i > j
So for each (i) after (j) such that u4_1 < \Jppr, we have 1 â€” confidence that Algorithm 2 will exit. â–
Since we require oik â€”> 0 as A& â€”>â€¢ 0, then for any oik > 0, eventually A& will be small enough so that Algorithm 2 with probability at least 1 â€” In other words, this theorem ensures that the algorithm exits with probability 1.
Lemma 4.14 If f satisfies Assumption f.5 and rrik is nfudly quadratic with confidence 1 â€” a, there exists a constant r^hm > 0 such that
7/fc A ^blim?
1â€”a
for all k, where Hk is the Hessian of rrikProof:
T>
w
9i
> V/(V)  V/(;V)Â»f
v/(V)  i'
> u> lAk 
JU
2
1â€”a /J>
^ Ak f^eg
â€˜ uji
1
fiKeg
A l
uS'Ak  uij~1A2k
1
â€” ( n
> ojl~lAk.
Hk\\<\\HkV2f(xk)\\ + \\V2f(xk)\\
< KehAk + V2/(xfc) by Definition 4.9
1â€”a
< fte/jAfc + Kbhf by Assumption 4.5
A ^e/jAmaic T bibhf â€”â€¢ ^bhm
74
The following lemma shows that, if xk is not a stationary point of /, then if A is small enough, there is a high probability that a successful step will be taken.
Lemma 4.15 Let / satisfy Assumption f.5 and let the trust region subproblem solution satisfy Assumption 44â– Let k = (nef, ^eg, neh, vâ„¢) and k = (kef, Leg, keh, vf1). Let the constants Kfcct, Hbhm, &ef, kef, rji be as specified in Assumption 44j Lemma 4H, Definition 49, Definition 491, and declared at the beginning of Algorithm 1 respectively. If mk is nf.q.w.c. l â€” exk on B(xk,Ak), mk is kf.l.w.c. l â€” ak cm B{xk + sk; Afc), and
Ak < min
\\9k\\ Kfcd \\gk\\ (1  rji) k^bhm Amax T 2kef
(4.4)
then we have confidence 1 â€” 3ak that pk > rp on the kth iteration.
Proof: Using Lemma 4.14, the fact that xk + sk is no worse than the Cauchy step (Assumption 4.4), and Ak < (4.4) yields
mk(xk)  mk(xk + sk) > l^fcll min Afc
lâ€”a Â£ I Kbhm
(4.5)
75
Using this
I Pk ~
1
<
<
1â€”a
mk{xk) â€” mk(xk + sk) mk(xk) â€” mk(xk + sk)
mk(xk) â€” mk(xk + sk) mk(xk) â€” mk(xk + sk)
mk(xk + sfc) â€” rhk{xk + sfc)
mk(xk) â€” mk(xk + sfc)
mk(xk + sk) â€” f(xk + sk) 1 f(xk + sk) â€” rhk(xk + sk)
mk(xk) â€” mk(xk + sk) mk(xk) â€” mk(xk + sk)
Ke/A f(xk + sfc) â€” mk(xk + sfc)
â€” mk(xk + sk)\ mk(xk) â€” mk(xk + sk)
by Definition 4.9
< ___________KefAk_________________________As/Afc__________
iÂ« m.fc(xfc) â€” mk(xk + sfc) m,fc(xfc) â€” mk(xk + sfc)
by Definition 4.11
<
1â€”a
2/te/A + 2Â£e/A
Kfed, Qk  Afc
2nefAmax T 2hef
< â€” rwâ€”~Ak
kkfed  9 k 
< 1  Vi by (4.4)
by (4.5) since Ak > Ak
Since we have confidence 1 â€” ak that the second, third and fourth inequalities hold, we have confidence 1 â€” 3ak that all three hold simultaneously. â–
Lemma 4.16 For all k, assume the trust region subproblem solution satisfies Assumption 44â– Let f satisfy Assumption 45. If there exists a constant Ki such that \\gk\\ A ki for all k, then there exists another constant such that, for every iteration k where
Ak < K.2
we have confidence 1 â€” 3ak that iteration k will be successful and Ak will increase if mk is nf.q.w.c. 1 â€” ak.
Proof: This proof is similar to [12, Lemma 10.7]. Whether Algorithm 2 has been called or not,
Ak > mm{qk(xk),Ak_i} > min{^fc , Afc_i} > min {kx, Afc_i}
76
Since \\gk\\ > n,\ for all k, Lemma 4.15 implies that whenever Ais less than
Ki KfcdKi(l  rji)
= mm , ,
t^bhm 2/ve/A max A 2 Kef
we have conhdence 1 â€” 3a7 that iteration k will be successful (Afc+1 = 7iracAfc) or model improving (Afc+i = A*,). In either case A^+i > Ak so we have conhdence 1 â€” 3Q!fc that Afc+i > Ak will hold whenever Ak < min {A0, 7/^3} = K2 â–
Theorem 4.17 Let Assumptions 41~45 be satisfied. If the number of successful iterations is finite, then
liminf IIVf(xk) II = 0
kâ€”Â¥ CO
with probability 1.
Proof: Consider iterations after the last successful iteration, denoted kiast. For every k > kiast, the iteration is unsuccessful (pk < rj 1) and the model improvement algorithm is called. It takes a finite number j of model improvement steps for
the model to become Kfully quadratic with conhdence 1 â€” oik on a given B(xk; A); there are an inhnite number of iterations that are either acceptable or unsuccessful. Given Afc, we can guarantee that the trust region radius must decrease by at least one multiple of 7 G (0,1) after jffi iterations (for a hxed constant C). Since for any e > 0, there exists an integer N such that 7*Afc;ast < e. After
C
c
N
^ (rA 1
c
<7766* (76* 1)
(Aklastfi ' ' (7*1Afc;ast)6 ^ (t^AO6 AÂ« Mt (76  1)
iterations, the trust region radius will be less than e. Therefore, lim^oo A& = 0, which implies oik â€”> 0. Therefore, there exists an inhnite sequence of iterates {kf\ where is Kf.q.w.c. 1 â€” and
v/(V')<v/(V')*,l + ll*.ll Â£ ^Gi. + lla,l.
1 Â«fc;
The second term converges to zero with probability 1. To see this, assume 11<7^ 11 is
bounded away from zero and we can derive a contradiction using Lemma 4.15 and
77
the fact that limfc_).00 A& = 0. Since A^ â€”> 0 and â€”> 0, then for ki sufficiently
large, A^ < k2, so there is probability 1 â€” 30^ that iteration ki will be successful. Thus, for any Â«& > 0 and K > 0, there exists ki > K such that the probability of step ki being successful is greater than 1 â€” Therefore, with probability 1, there are infinitely many successful iterates, contradicting the definition of kiast. â–
4.3.2 Infinitely Many Successful Steps
The results that follow outline parts of a proof for the case when Algorithm 1 generates infinitely many successful iterates. While the previous theorem proves Afc â€”>â€¢ 0, the proof is not valid when there are infinitely many successful iterates. We have made considerable effort to prove Afc â€”>â€¢ 0, but have been unable to do so. To progress, we assume it for the time being.
Assumption 4.18
lim A k = 0
fcâ€”)>CÂ©
It should be noted that it may be possible to ensure Algorithm 1 satisfies this assumption, perhaps by slowly decreasing Amax. The details would need to be worked out, but this assumption is not as strong as it might appear.
Conjecture 4.19 If Assumption f.18 and Assumption f.5 hold and the trust region subproblem solution satisfies Assumption 44 for aU k, then
with probability 1.
lim inf \\g
kâ€”y cÂ©
k
o
Discussion: If  n,\ for some n,\ > 0, by Lemma 4.16, there exists a n,2 such that whenever A< k2, we will have a 1 â€” 3oik confidence of increasing the trust region. Using Assumption 4.18 and the fact that oik = min{Afc, 0.05}, we will increase the trust region with probability approaching 1 as k gets large. This would appear to
78
contradict Ak â€”>â€¢ 0, but to prove almost sure convergence (assuming each iteration is independent) we must show the product of the 1 â€” ak approaches 1. And even the assumption that each iteration is independent is difficult, as many of the points used to build rrik will be used to build mk+\. If the events are dependent, then we must consider conditional probabilities such as the probability one step being a success given the last step was not.
Conjecture 4.20 If Assumptions f.lf.5 and Assumption f.18 hold, for any subsequence {hi} such that
lim \\giti  = 0 (4.6)
then, with probability 1
lim IIV/(xfci) = 0.
iâ€”Â¥ cÂ©
Discussion: By (4.6), for i sufficiently large, \\gki\1 < ec. Thus, by Theorem 4.13, Algorithm 2 ensures that the model mki is /rf.q.w.c. 1 â€” oik on the ball B(xki; Ak.) with Aki < p WdkiW for all i sufficiently large (provided V/(xfci) 0). By Definition 4.9
 V/(xfci)  gki  < Keg Afc. < Kegfj, \\gki\\ .
1â€”a
Therefore,
\\^f(xki)\\ <\\Vf{xki)  gki\\ + \\gki\\ < {Keg/a + 1) \\gki\\ .
1â€”a
Since \\gki\\ â€”> 0 with probability 1, so does V/(xfci).
Conjecture 4.21 If Assumptions f.lf.5 and Assumption f.18 hold, then
lim inf  V f(xk)  = 0
k^oo
with probability 1.
Discussion: By Conjecture 4.19, we know there must exist a sequence of {/q} such that lim^oo \\gki  = 0. By Conjecture 4.20, this same sequence {/q} has lim^oo V/(xfci) = 0. This proves the result.
79
4.4 Computational Example
In this section, we highlight some of the advantages of using Algorithm 1 over a traditional trust region method (which assumes deterministic function evaluations). While both algorithms have much in common, the slight differences become significant in the presence of stochastic noise. For example, the deterministic algorithms are susceptible to negative noise, as we see in the Figure 4.1. In that figure, the solid line is the true function / which we want to minimize, and the dashed black lines show the 95% confidence interval of the noise. The black squares mark the noisy functions which determine the quadratic trust region model mk and the trust region radius Ak is represented by the dashed lines. The trust region center xk has a red box around it.
4.4.1 Deterministic Trust Region Method
Figure 4.1 shows a traditional trust region method after moving to a new trust region center at xk = 2.5. Each image shows the progress of the algorithm, and we describe what occurred in the previous iterate to yield the present situation:
Figure 4.1, top left By chance, the realization of f{xk + sk) was much less than / at any point near xk + sk. It is now the new trust region center.
Figure 4.1, top right The minimum of the quadratic model was not accepted since,
= K%k) ~ K%k + sk) n
^k mk(xk) â€” mk(xk + sk)
The trust region radius has also been shrunk since the sample set is strongly Apoised.
Figure 4.1, bottom left A point outside of the trust region radius has been removed. Since pk < 0, the trust region radius will shrink again.
Figure 4.1, bottom right Another point outside of the trust region has been removed, and a new model has been built.
80
3.5 1.5
Figure 4.1: Several iterations of a traditional trust region method assuming deterministic function evaluations. The trust region center is never moved.
The deterministic algorithm will accept a new trust region center when p*. is sufficiently positive, (i.e. if f(xk + sk) is also much less than f(xk + sk)). If this does not happen, the algorithm will not find a successful step and the trust region radius will be repeatedly decreased. Since f(xk) is never reevaluated, it is likely that the algorithm will terminate without ever taking a further step.
4.4.2 Stochastic Trust Region Method
In contrast, by using p*. introduced in Section 4.2,
mi. \i'k) â€” nn/\xk + sk) mi, {xk) â€” mi, {xk + sk) â€™
and increasing the number of points in the trust region as Afc decreases allows the algorithm to proceed to move off of a trust region center with negative noise, seen in Figure 4.2.
81
1.5
0.5
 â€” \
Vs
> sA M
\ t / 7
s' A \ \
' vvt /
\ \ > \ / i
\ N \ v / i /
; //
/ f .
\\ \ ///
" * "
3.5 1.5
Figure 4.2: Several iterations of a traditional trust region method assuming stochastic function evaluations.
Figure 4.2, top left Again, the realization of f(xk + sk) was much less than / at any point near xk + sk. It is now the new trust region center.
Figure 4.2, top right The minimum of the quadratic model was not accepted since Pk < 0, but the trust region radius is not decreased. Though the sample set is strongly Apoised, there are not enough points to ensure the model is nf.q.w.c. 1 â€” ay..
Figure 4.2, bottom left More points have been added to the sample set and the model has been reconstructed.
Figure 4.2, bottom right The minimum of the quadratic model is accepted since Pk > Ph (even though f(xk + sk) > f(xk)).
82
By using the model value at xk instead of f(xk) in the calculation of pk allows the estimate of f(xk) to adjust without wastefully reevaluating f(xk). In this fashion, Algorithm 1 can avoid stagnating at points with negative noise.
4.5 Conclusion
In this chapter we presented an algorithm using quadratic trust region models nik to minimize a function / which cannot be evaluated exactly. Even though the algorithm only has access to noise corrupted function evaluations /, we proved almost sure convergence of a subsequence of iterates to a hrstorder stationary point of / (when the number of successful steps is finite). We also have outlined a proof for the case when the number of successful steps is infinite. These results were accomplished, not by repeatedly sampling / at points of interest xk, but rather by constructing models rhk which are increasingly accurate approximations of / near xk. Since it is often the case that xk is the candidate for the new trust region center, this information is immediately useful in constructing mk+i We then highlighted how this algorithm remedies a common problem with using traditional trust region methods on functions with stochastic noise.
83
5. Nonintrusive Termination of Noisy Optimization
5.1 Introduction and Motivation
The optimization of realworld, computationally expensive functions invariably leads to the difficult question of when an optimization procedure should be terminated. Algorithm developers and the mathematical optimization community at large typically assume that the optimization is terminated when either a measure of criticality (gradient norm, mesh size, etc.) is satisfied or a userâ€™s computational budget (number of evaluations, wall clock time, etc.) is exhausted.
For a large class of problems, however, the user may not have a welldefined computational budget and instead demand a termination test t solving
min Computational expense(t) s.t. Acceptable accuracy of the solution(f),
(5.1)
with the criticality measure of the solver employed typically chosen with the accuracy constraint in mind. Examples of such accuracybased criticality tests are discussed in detail by Gill, Murray, and Wright [19, Section 8.2.3].
The main difficulties arising from this approach are a result of (5.1) possibly being poorly formulated. The computational expense could be unbounded because an a priori userdefined accuracy is unrealistic for the problem/solver pair. Furthermore, a user may have difficulty translating the criticality measures provided by a solver, which are generally based on assumptions of smoothness and infiniteprecision calculations, into practical metrics on the solution accuracy.
In Figure 5.1 we illustrate the challenges in this area with an example from nuclear physics, similar to the minimization problems considered in [37]. Each of the function values shown is obtained from running a deterministic simulation for one minute on a 640core cluster. Stopping the optimization sooner than 200 function evaluations would not only return a solution faster but would also free the cluster for
84
Figure 5.1: Part of a noisy trajectory of function values for an expensive nuclear physics problem. After more significant decreases in the first 70 evaluations, progress begins to stall.
other applications and/or result in a savings in energy, an increasingly crucial factor in highperformance computing.
If we assume that the optimization (partially) shown in Figure 5.1 has not been terminated by a solverâ€™s criticality measures or a userâ€™s computational budget, the question is then whether termination should occur for other reasons. For example, if only the first three digits of the simulation output were computed stably, one may want to terminate the optimization sooner than if computational noise corrupted only the eighth digit of the output. Alternatively, the behavior shown could mean the solver in question has stagnated (because of noise, errors in the simulation, a limitation of the solver, etc.), and hence examining the solution and/or restarting the optimization could be a more effective use of the remaining computational budget. Wright [65] refers to this stalled progress as perseveration and notes that there is â€œno fully general way to define â€˜insufficient progress.â€™ â€ Even so, it may be advantageous to use knowledge of the uncertainty or accuracy of a given function evaluation when making such a decision.
85
In the remainder of this chapter we explore these issues and propose termination criteria that can be easily incorporated on top of a userâ€™s solver of choice. In [18], Fletcher summarizes the challenges at hand (in the case of roundoff errors alone): Some consideration has to be given to the effects of roundoff near the solution, and to terminate when it is judged that these effects are preventing further progress. It is difficult to be certain what strategy is best in this respect.
Moreover, Gill, Murray, and Wright [19] stress that
no set of termination criteria is suitable for all optimization problems and all methods.
This sentiment is shared by Powell [47] who says
it is believed that it is impossible to choose such a convergence criterion which is effective for the most general function ... so a compromise has to be made between stopping the iterative procedure too soon and calculating f an unnecessarily large number of times.
Consequently, we will consider tests that allow for the use of estimates of the noise particular to a problem. Furthermore, our criteria are not intended as substitutes for a computational budget or a solverâ€™s builtin criticality tests, which we consider to be important safeguards. Likewise, the termination problem can be viewed as a realtime control problem depending on complete knowledge of the solverâ€™s decisions, but we resist this urge for purposes of portability and applicability.
We provide background on previous work and introduce notation in Section 5.2. The families of stopping tests we propose in Section 5.3 do not provide guarantees on the quality of the solution, although doing so may be the role of a solverâ€™s builtin criteria. Instead, the proposed tests are parameterized in order to quantify a
86
userâ€™s tradeoff between the benefit of achieving additional decrease and the cost of additional evaluations, while requiring a minimal amount of information from the solver. Equally important, our results in Section 5.4 comparing the quality of these families of stopping tests on a collection of local optimization algorithms. We first consider all solvers as a single routine, later validating this approach by demonstrating equal performance for the best tests on individual algorithms. While our results can be incorporated in a local subroutine of any global search algorithm, the tests proposed in Section 5.3 are unable to distinguish between exploration and refinement phases in their current form. We summarize our results in Section 5.5 and provide recommendations when implementing these tests.
5.2 Background
Our discussion and analysis are limited to optimization methods that do not explicitly require derivative information. However, other algorithms could readily employ the tests proposed here in addition to their derivativebased stopping criteria. While our work can be further extended to incorporate noisy gradient information, the derivatives of noisy functions are typically even noisier than the function.
Derivativefree optimization methods are often favored for their perceived ability
to handle noisy functions. Although asymptotic convergence of these methods is
generally proved assuming a smooth function, adjustments are frequently made to
accommodate noise. In the case of stochastic functions, where noise results from a
random distribution with Var (f(x)) > 0, replications of function evaluations can be
used to modify existing methods (e.g., [14] modifying UOBYQA [48], [15, 1] modifying
DIRECT [30], and [61] modifying NelderMead (see, e.g., [12])). However, stopping
criteria for these methods involve limited knowledge of the noise and indicate the
wide variety of stopping tests used in practice. In [1], optimization is stopped when
adjacent points are within 104 of each other, whereas [15] allows stopping when
the best function value has not been improved after some number of consecutive
87
iterations. To limit the number of stochastic replications, the authors of [14] and [61] adjust the maximum number of allowed replications at a particular point based on the variance of the noise.
Deterministic noise  that is, noise that results from a deterministic process, such as finiteprecision arithmetic, iterative methods, and adaptive procedures  is far less understood than its stochastic counterpart [42], Not surprising, even less knowledge of the magnitude of noise is used for problems with deterministic objectives. When lowamplitude noise is present, Kelley [33] proposes a restart technique for NelderMead but terminates when sufficiently small differences exist in the simplicial function values, independent of the magnitude of the noise. Implicit filtering [32] has numerous termination possibilities (small function value differences on a stencil, a small change in the best function value from one iteration to the next, etc.) but none that are explicitly related by the author to the magnitude of the noise. A similar implicit relationship to noise can be seen in [24], where treed Gaussian process models for optimization are terminated when a maximum improvement statistic is sufficiently small. The authors of SNOBFIT [29] suggest stopping when the best point has not changed for a number of consecutive calls to the main SNOBFIT algorithm.
Our work more closely follows that of Gill et. al [19], where section (8.2) is devoted to properties of the computed solution. The authors there recommend terminating NelderMeadlike algorithms when the maximum difference between function values on the simplex is less than a demanded accuracy weighted by the best function value on the simplex.
The only other direct relationship between stopping criteria and a measure of noise that we are aware of are in [42, Section 9] and [25]. In [42], a stochastic model of the noise is used to estimate the noise level of a function value f(x) by difference tablebased approximations of the standard deviation (Var (f(x)))1//2. Results are validated for deterministic /. As an example application, the authors terminate a NelderMead
method on an ODEbased problem when consecutive decreases are less than a factor of the noise level. The authors of [25] perturb boundconstrained problems so the incumbent iterate is the exact solution to this new problem. An algorithm can then be terminated when the size of this perturbation first decreases below the error in the problem. Natural extensions to gradient/derivativebased tests are also enabled by the recent work in [43] where nearoptimal finite difference estimates are provided as a function of the noise level.
Before proceeding, we define the notation employed throughout. We let R+ denote the nonnegative reals and N denote the natural numbers. We let {ay, â– â– â– , xm} C Eâ„¢ and {/i, â€¢ â€¢ â€¢ , fm} G R be a sequence of points and corresponding function values produced by a local minimization solver, and we collect the data from the first i evaluations in Jy = {(ay, /i),..., (ay, fi)}. The best function value in the first % evaluations is given by /* = min {/,}, with x* denoting the point corresponding to f*.
1
Accordingly, the sequence {/*} is nonincreasing. Unless otherwise stated,  â€¢  denotes the standard Euclidean distance.
We let iir be an estimate of the relative noise at fi (i.e., the noise at ay scaled by the magnitude of f(xi)). This estimate may come from experience, numerical analysis of the underlying processes in computing fi, or appropriate scaling (by l//j) of the noiselevel estimates from the method proposed in [42], In the case of stochastic functions with nonzero mean at ay, eir is the standard deviation of /(ay) relative to the expected value E [/(ay)].
Favorable properties of a termination test include scale and shift invariance, so that the test would terminate after the same number of evaluations for any affine transformation of the objective function. Specifically, a test is scale invariant in / if it terminates optimization runs defined by Jy and aJy = {(ay, cn/i),..., (ay, aff)} at an identical evaluation number for any a > 0. Similarly, a test is shift invariant in / if it terminates Jy and Jy + f3 = {(ay, fi + fi),..., (ay, fi + /?)} after an identical
89

PAGE 1
DERIVATIVEFREEOPTIMIZATIONOFNOISYFUNCTIONS by JereyM.Larson B.A.,CarrollCollege,2005 M.S.,UniversityofColoradoDenver,2008 Athesissubmittedtothe FacultyoftheGraduateSchoolofthe UniversityofColoradoinpartialfulllment oftherequirementsforthedegreeof DoctorofPhilosophy AppliedMathematics 2012
PAGE 2
ThisthesisfortheDoctorofPhilosophydegreeby JereyM.Larson hasbeenapproved by StephenBillups,AdvisorandChair AlexanderEngau BurtSimon MichaelJacobson FredGlover Date ii
PAGE 3
Larson,JereyM.Ph.D.,AppliedMathematics DerivativefreeOptimizationofNoisyFunctions ThesisdirectedbyAssociateProfessorStephenBillups ABSTRACT DerivativefreeoptimizationDFOproblemswithnoisyfunctionsareincreasinglyprevalent.Inthisthesis,weproposetwoalgorithmsfornoisyDFO,aswellas terminationcriteriaforgeneralDFOalgorithms.Bothproposedalgorithmsarebased onthemethodsofConn,Scheinberg,andVicente[9]whichuseregressionmodelsin atrustregionframework.Therstalgorithmutilizesweightedregressiontoobtain moreaccuratemodelfunctionsateachtrustregioniteration.Aweightingscheme isproposedwhichsimultaneouslyhandlesdieringlevelsofuncertaintyinfunction evaluationsanderrorsinducedbypoormodeldelity.Toproveconvergenceofthis algorithm,weextendthetheoryofpoisednessandstrongpoisednesstoweighted regression.Thesecondalgorithmmodiestherstforfunctionswithstochasticnoise. Weproveouralgorithmgeneratesasubsequenceofiterateswhichconvergealmost surelytoarstorderstationarypoint,providedthenumberofsuccessfulstepsis niteandthenoiseforeachfunctionevaluationisindependentandidenticallynormallydistributed.Lastly,weaddressterminationofDFOalgorithmsonfunctions withnoisecorruptedevaluations.Ifthefunctioniscomputationallyexpensive,stoppingwellbeforetraditionalcriteriae.g.,afterabudgetoffunctionevaluationsis exhaustedaresatisedcanyieldsignicantsavings.Earlyterminationisespecially desirablewhenthefunctionbeingoptimizedisnoisy,andthesolverproceedsforan extendedperiodwhileonlyseeingchangeswhichareinsignicantrelativetothenoise inthefunction.Wedeveloptechniquesforcomparingthequalityofterminationtests, proposefamiliesofteststobeusedongeneralDFOalgorithms,andthencompare iii
PAGE 4
thetestsintermsofbothaccuracyandeciency. Theformandcontentofthisabstractareapproved.Irecommenditspublication. Approved:StephenBillups iv
PAGE 5
ACKNOWLEDGMENT Iwouldliketothankmyadvisor,SteveBillups,forhisyearsofresearchassistance andadvice.Hisguidancewasinstrumentalinobtainingtheresultsinthisthesis.I wouldalsoliketothankPeterGrafandStefanWildfortheirassistanceinresearching andwritingpartsofthisthesis.Theresearchinthisthesiswaspartiallysupportedby NationalScienceFoundationGrantGK120742434andpartiallysupportedbythe OceofAdvancedScienticComputingResearch,OceofScience,U.S.Department ofEnergy,underContractDEAC0206CH11357.Lastly,I'dliketothankmywife, Jessica,foryearsofpatiencewhiletheresearchforthisthesiswasperformed. v
PAGE 6
TABLEOFCONTENTS Figures.......................................ix Tables........................................xi Chapter 1.Introduction...................................1 1.1ReviewofMethods...........................3 1.2Outline..................................5 1.3Notation.................................8 2.Background...................................10 2.1ModelbasedTrustRegionMethods..................10 2.1.1ModelConstructionWithoutDerivatives............11 2.1.2CSV2framework.........................11 2.1.3Poisedness.............................18 2.2PerformanceProles...........................23 2.3ProbabilisticConvergence........................24 3.DerivativefreeOptimizationofExpensiveFunctionswithComputational ErrorUsingWeightedRegression.......................26 3.1Introduction...............................26 3.2ModelConstruction...........................27 3.3ErrorAnalysisandtheGeometryoftheSampleSet.........29 3.3.1WeightedRegressionLagrangePolynomials..........29 3.3.2ErrorAnalysis..........................30 3.3.3poisednessintheWeightedRegressionSense.......35 3.4ModelImprovementAlgorithm.....................39 3.5ComputationalResults.........................50 3.5.1UsingErrorBoundstoChooseWeights............50 3.5.2BenchmarkPerformance.....................52 vi
PAGE 7
3.6SummaryandConclusions.......................57 4.StochasticDerivativefreeOptimizationusingaTrustRegionFramework.59 4.1PreliminaryResultsandDenitions..................62 4.1.1Modelswhichare fullyQuadraticwithCondence1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k .65 4.1.2Modelswhichare^ fullyLinearwithCondence1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k ...69 4.2StochasticOptimizationAlgorithm...................70 4.3Convergence...............................73 4.3.1ConvergencetoaFirstorderStationaryPoint.........73 4.3.2InnitelyManySuccessfulSteps................78 4.4ComputationalExample........................80 4.4.1DeterministicTrustRegionMethod...............80 4.4.2StochasticTrustRegionMethod................81 4.5Conclusion................................83 5.NonintrusiveTerminationofNoisyOptimization..............84 5.1IntroductionandMotivation......................84 5.2Background...............................87 5.3StoppingTests..............................90 5.3.1 f i 0 Test..............................92 5.3.2MaxDierencef Test......................92 5.3.3MaxDistancex Test.......................93 5.3.4MaxDistancex i Test......................93 5.3.5MaxBudgetTest.........................94 5.3.6TestsBasedonEstimatesoftheNoise.............94 5.3.7RelationshiptoLossFunctions.................96 5.4NumericalExperiments.........................98 5.4.1AccuracyProlesforthe 1 Family...............100 5.4.2PerformanceProlesforthe 1 Family.............102 vii
PAGE 8
5.4.3AccuracyandPerformancePlotsforthe 2 Family......104 5.4.4AcrossfamilyComparisons...................106 5.4.5DeterministicNoise.......................107 5.4.6ValidationforIndividualSolvers................109 5.5Discussion................................110 6.ConcludingRemarks..............................113 References ......................................119 viii
PAGE 9
FIGURES Figure 2.1Anexampleofaperformanceprole....................24 3.1Performanceleftanddatarightproles:Interpolationvs.regression vs.weightedregression............................55 3.2Performanceleftanddatarightproles:weightedregressionvs. NEWUOAvs.DFOProblemswithStochasticNoise..........56 4.1Severaliterationsofatraditionaltrustregionmethodassumingdeterministicfunctionevaluations.Thetrustregioncenterisnevermoved.....81 4.2Severaliterationsofatraditionaltrustregionmethodassumingstochastic functionevaluations..............................82 5.1Partofanoisytrajectoryoffunctionvaluesforanexpensivenuclear physicsproblem.Aftermoresignicantdecreasesintherst70evaluations,progressbeginstostall........................85 5.2Firsttermsin 1 top,with =100and 2 bottom,with =10on alog 10 scalewhenminimizinga10dimensionalconvexquadraticwith stochasticrelativenoiseofdierentmagnitudes.Theasymptotesofthe quantitiesshowntendtobeseparatedbythedierencesinmagnitudesof thenoise....................................96 5.3Numberofevaluations i foraterminationtestbasedon.3withxed F i and ,butusinga parameterizedby c .Theplotsshowremarkably similarbehaviortothenumberofevaluationsthatminimize L ;c in.8.98 5.4Accuracyprolesformembersofthe 1 familyonproblems.9with twodierentmagnitudesofknownstochasticrelativenoise .Inthe topplots, isheldxedandtheshownmembershavedierent values. Inthebottomplots, isheldxedandtheshownmembershavedierent values.....................................101 ix
PAGE 10
5.5Performanceprolesforthemostaccurate 1 testsonproblems.9with twodierentmagnitudesofknownstochasticrelativenoise .Notethat the axishasbeentruncatedforeachplot; 5 eventuallyterminatesall oftheproblemsandthushasaprolethatwillreachthevalue1;allother testschangebylessthan.01.........................103 5.6Accuracytopandperformancebottomprolesforthe 2 familyon problems.9withtwodierentmagnitudesofstochasticrelativenoise as and arevaried............................105 5.7Accuracytopandperformancebottomprolesforthebesttestson problems.9withtwodierentmagnitudesofstochasticrelativenoise . Thehorizontalaxesontheperformanceprolesaretruncatedforclarity; 5 eventuallyachievesavalueof1;allothertestschangebylessthan.03.106 5.8Accuracytopandperformancebottomprolesforthebesttestson problems.9withtwodierentmagnitudesofdeterministicnoise.The horizontalaxesontheperformanceprolesaretruncatedforclarity; 5 eventuallyachievesavalueof1;allothertestschangebylessthan.03..108 5.9Performanceprolesformoreconservativetestsonproblems.9with twodierentmagnitudesofdeterministicnoise.Thehorizontalaxeson theperformanceprolesaretruncatedforclarity; 5 eventuallyachieves avalueof1;allothertestschangebylessthan.03.............109 5.10Accuracyprolesfortheindividualalgorithmsontherecommendedtests.110 x
PAGE 11
TABLES Table 5.1Recommendationsforterminationtestsfornoisyoptimization.......111 xi
PAGE 12
1.Introduction Traditionalunconstrainedoptimizationisinherentlytiedtoderivatives;thenecessaryconditionsforarstorderminimumarecharacterizedbythederivativebeing equaltozero.Nevertheless,thereexistsavarietyoffunctionswhichmustbeminimizedwhenderivativesareunavailable.Forexample,consideranengineerinalab whowantstomaximizethestrengthofametalbarbyadjustingvariousfactors ofproduction.Afterthebarisconstructed,itisbrokentodetermineitsstrength. Thereisnoclosedformedsolutionforthebar'sstrength;eachfunctionevaluation comesfromanexpensiveprocedure.Also,theprocessofbreakingthebarprovides noinformationabouthowtochangethefactorsofproductiontoincreasethebar's strength.Inadditiontotheoptimizationofsystemswhichmustbephysicallyevaluated,functionevaluationsbycomplexcomputersimulationsoftenprovidenoor unreliablederivatives.Suchsimulationsofcomplexphenomenasometimescalled blackboxfunctionsarebecomingincreasinglycommonascomputationalmodeling andcomputerhardwarecontinuetoadvance.Whereastraditionaltechniquesareconcernedwitheciencyofthealgorithm,suchconcernsaresecondarythroughoutthis thesis.Explicitly,weassumethatthecostofevaluatingthefunctionoverwhelmsthe computationalrequirementsofthealgorithm. Inadditiontounavailablederivatives,noiseofvariousformsisoftenpresentin thesefunctions.Throughoutthisthesiswewillcategorizethisnoise{ordierence betweenthetruevalueandcomputedvalue{intotwocategories:deterministicand stochastic.Deterministicnoisee.g.,arisingfromniteprecisioncalculationsoriterativemethodsisoftenpresentifthefunctionbeingoptimizedisasimulationof aphysicalsystem.Forexample,ifevaluatingthefunctioninvolvessolvingasystemofnonlinearpartialdierentialequationsorcomputingtheeigenvaluesofalarge matrix,asmallperturbationintheparameterscanyieldalargejumpinthedierencebetweenthetrueandcomputedvalues.Thoughthecomputedvalueandtrue 1
PAGE 13
valuemaydier,reevaluatingthefunctionwiththesamechoiceofparameterswill providenofurtherinformation.Incontrast,reevaluatingafunctionwithstochastic noisewillprovideadditionalinformationaboutthetruevalueofthefunction.Two commonsourcesofstochasticnoisearefoundinfunctionswhoseevaluationrequires alargescaleMonteCarlosimulationofanactualsystemorifafunctionevaluation requiresphysicallymeasuringapropertyinsomesystem. Thethesisconsistsofthreedistinctbutconnectedchaptersaddressingtheproblem: min x f x : R n ! R whenthealgorithmonlyhasaccesstonoisecorruptedvalues f x := f x + x ; where x denotesthenoise. Eachchaptermakesdierentassumptionsaboutthenoise x .Chapter3assumesthattheaccuracyatwhich f approximates f canvary,andthatthisaccuracy canbequantied.Forexample,if f x iscalculatedusingaMonteCarlosimulation, increasingthenumberoftrialswilldecreasethemagnitudeof x .Similarly,if f is calculatedbyaniteelementmethod,increasedaccuracycanbeobtainedbyaner grid.Ofcourse,thisgreateraccuracycomesatthecostofincreasedcomputational time;soitmakessensetovarytheaccuracyrequirementsoverthecourseoftheoptimization.Withthisinmind,Chapter3asks:Howcanknowledgeoftheaccuracy ofdierentfunctionevaluationsbeexploitedtodesignabetteralgorithm? Chapter4assumesthatthenoiseforeachfunctionevaluationisindependentof x andcanbemodeledasanormallydistributedrandomvariablewithmeanzero andaxed,nitestandarddeviation.Thoughmanyotheralgorithmshavebeen designedtooptimizesuchafunction,theyoftenresorttorepeatedsamplingofpoints. Thisprovidesinformationaboutthenoiseatapoint,butnoinformationaboutthe 2
PAGE 14
functionnearby.ThismotivatesthequestionaddressedinChapter4:Howcan greateraccuracybeecientlyachievedbyoversamplingwithoutnecessarilyrepeating functionevaluations? Chapter5assumesthatareasonablyaccurateestimateofthemagnitudeofthe noisecanbeobtained,andthatthisestimateremainsrelativelyconstantthroughout thedomain.Thoughtherearemanyalgorithmsintheliteraturedesignedtooptimizenoisyfunctions,veryfewuseestimatesofthenoiseintheirterminationcriteria. Whenfunctionevaluationsarecheap,terminationcanbedeterminedbycommon testse.g.,smallstepsizeorgradientapproximation.Butwhenfunctionevaluationsareexpensive,determiningwhentostopbecomesanimportantmultiobjective optimizationproblem.Theoptimizerwantstondthebestsolutionpossiblewhile minimizingcomputationaleort.Asthisisadicultproblemtoexplicitlyformulate, practitionersfrequentlyterminatealgorithmswheniapredenednumberofiterationshaselapsed,iinodecreaseintheoptimalfunctionvaluehasbeendetectedfor anumberofiterations,oriiithedistancebetweenanumberofsuccessiveiterates isbelowsomethreshold.Chapter5attemptstoanswerthequestion:Whenshould analgorithmoptimizinganexpensive,noisyfunctionbeterminated? 1.1ReviewofMethods Beforediscussingouralgorithmsfurther,werstdiscusspreviousDFOtechniques. Heuristicsareperhapstherstrecoursewhenattemptingtooptimizeafunction withoutderivatives.Simulatedannealing[36,63],geneticalgorithms[28],random searchanditsvariants[55,66,39,52],tabusearch[20],scattersearch[21],particle swarmoptimization[34],andNelderMead[45]arejustafewoftheheuristicsdevelopedtosolveproblemswhereonlyfunctionevaluationsareavailable.Thoughmostof thesealgorithmslackformalconvergenceresultsasidefromresultsdependentonthe algorithmproducingiterateswhicharedenseinthedomain,theyremainpopular 3
PAGE 15
duetotheireaseofimplementation,exibilityonavarietyofproblemclasses, andfrequentsuccessinpractice. Othertechniquesattempttoapproximatetheunavailablederivative.Classical nitedierencemethodsapproximatethederivativebyadjustingeachvariableand notingthechangeinthefunctionvalue.Othertechniques,suchasthepatternsearch methods[62,2]andimplicitltering[5],evaluateachangingpatternofpointsaround thebestknownsolution.AlsoofnoteistheDIRECTalgorithm[30],aglobaloptimizationmethodbasedondividinghyperrectanglesusingonlyfunctionvalues. AnincreasinglypopularclassofalgorithmsforderivativefreeoptimizationDFO aremodelbasedtrustregionmethods[31,11].Localmodelsapproximatingthefunctionareconstructedandminimizedtogeneratesuccessiveiterates.Thesemodels arecommonlyloworderpolynomialsinterpolatingfunctionvaluesclosetothebest knownvalue,forexamplePowell'sUOBYQAalgorithm[48].Otherexamplesinclude [49],wherePowellintroducesaminimumFrobeniusnormconditiononunderdeterminedquadraticmodels,andORBITbyWildetal.[64],whichconstructsmodels usinginterpolatingradialbasisfunctions.Theselocalmodelsshouldnotbeconfused withkriging[59]orresponsesurfacemethodologies[44]whichbuildglobalmodelsof thefunction.Thoughimplementationofthesetechniquesisnotassimpleassome othertechniques,awelldevelopedconvergencetheoryexists.Asthisthesisfocuses onnoisyDFOproblems,weconsideredtrustregionmethodswithregressionmodels mostappropriatesince,inmanycases,regressionmodelsthroughenoughpointscan approximatethetruefunction. TherearealsoavarietyofexistingDFOalgorithmsforoptimizingfunctionswith noise.Forfunctionswithstochasticnoise,replicationsoffunctionevaluationscan beasimplewaytomodifyexistingalgorithms.Forexample,[14]modiesPowell's UOBYQA[48],[15]modiesDIRECT[30],and[61]modiesNelderMeadbyrepeatedsamplingofthefunctionatpointsofinterest.Fordeterministicnoise,Kelley 4
PAGE 16
[33]proposesatechniquetodetectandrestartNelderMeadmethods.Neumaier's SNOBFIT[29]algorithmaccountsfornoisebynotrequiringthesurrogatefunctions tointerpolatefunctionvalues,butrathertastochasticmodel.Similarly,[10]proposesusingleastsquaresregressionmodelsinsteadofinterpolatingmodelswhennoise ispresentinthefunctionevaluations. Lastly,StochasticApproximationalgorithmsarealsodesignedtominimizefunctionswithstochasticnoise.Thesealgorithmsaredevelopedbystatisticianstosolve min f x = E [ f x ] ; whenonly f canbecomputed.Twoofthemorefamousalgorithms,theKieferWolfowitzandSimultaneousPerturbationmethods,takepredenedsteplengthsina directionapproximating r f .Thesetechniqueshavestrongtheoreticalconvergence results,butcanbediculttoimplementinpractice.Forfurtherdiscussionofthese algorithms,seethebeginningofChapter4. 1.2Outline Theworkinthisthesisfocusesonmodicationstomodelbasedtrustregion methodsinordertohandlenoise.Throughoutthethesisweassumethatonlynoisy, expensivefunctionevaluations f areavailable,butthereissomesmoothunderlying function f whichistwicecontinuouslydierentiablewithaLipschitzcontinuousHessian.Wealsoassumethatthenoiseintheevaluationof f isunbiasedwithbounded variance. Chapter3jointworkwithStephenBillupsandPeterGrafproposesaDFO algorithmtooptimizefunctionswhichareexpensivetoevaluateandcontaincomputationalnoise.Thealgorithmisbasedonthetrustregionmethodsof[9,10]which buildinterpolationorregressionmodelsaroundthebestknownsolution.Wepropose using weighted regressionmodelsinatrustregionframework,andproveconvergence ofsuchmethodsprovidedtheweightingschemesatisessomebasicconditions. 5
PAGE 17
Thealgorithmtsintoageneralframeworkforderivativefreetrustregionmethodsusingquadraticmodels,whichwasdescribedbyConn,Scheinberg,andVicente in[12,11].Weshallrefertothisframeworkasthe CSV2framework .Thisframeworkconstructsaquadraticmodelfunction m k ,whichapproximatestheobjective function f onasetofsamplepoints Y k R n ateachiteration k .Thenextiterateis thendeterminedbyminimizing m k overatrustregion.Inordertoguaranteeglobal convergence,theCSV2frameworkmonitorsthedistributionofpointsinthesample set,andoccasionallyinvokesamodelimprovementalgorithmthatmodiesthesamplesettoensure m k accuratelyapproximates f .TheCSV2frameworkisthebasis ofthewellknownDFOalgorithmwhichisfreelyavailableontheCOINORwebsite [38]. TotouralgorithmintotheCSV2frameworkweextendthetheoryofpoisedness, asdescribedin[12],toweightedregression.WeshowProposition3.12thatasample setthatisstronglypoisedintheregressionsenseisalsostrongly c poisedinthe weighted regressionsenseforsomeconstant c ,providedthatnoweightistoosmall relativetotheotherweights.Thus,anymodelimprovementschemethatensures strongpoisednessintheregressionsensecanbeusedintheweightedregression framework. TheconvergenceproofinChapter3requiresthatthecomputationalerrordecreasesasthetrustregiondecreases;suchanassumptioncanbesatisediftheuser hassomecontroloftheaccuracyinthefunctionevaluation.SinceChapter3is centeredonexploitingdieringlevelsindierentfunctionevaluations,suchanassumptionisreasonableforthatchapter.InChapter4,weremovethisassumption, butaddtheassumptionthat f hastheform f x = f x + .1 where N ; 2 .ThecontentofChapter4jointworkwithStephenBillups modiesthealgorithmfromChapter3toconvergewhentheerrordoesnotdecrease 6
PAGE 18
withthetrustregionradius.Withsomelightassumptionsonthenoiseandunderlyingfunction,weprovethealgorithmgeneratesasubsequenceofiterateswhich convergealmostsurelytoarstorderstationarypointinthecasewherethenumber ofsuccessfuliteratesisnite. Atagivenpointofinterest x 0 ,thealgorithmdoesnotrepeatedlysample f x 0 inordertogleaninformationabout f x 0 .Rather m k x 0 ,thevalueofthetrust regionmodelat x 0 isusedtoestimate f x 0 .Wederiveboundsontheerrorbetween f and m ,providedthesetofpointsusedtoconstruct m satisescertaingeometric conditions,calledstronglypoisedseeDenition2.14,andcontainsasucient numberofpoints.Also,thestepsizeiscontrolledbythealgorithm,increasingand decreasingasthealgorithmprogressesandstagnates.Thiscontrastsmanyofthe methodsintheStochasticApproximationliteraturewheretheusermustprovidea predenedsetofstepstobetakenbythealgorithm. TheresultsinSection4.3provethealgorithmwillgenerateasubsequenceof iteratesconvergingalmostsurelytoarstorderstationarypointwhenthenumberof successfuliteratesisnite,andmakesprogressintheinnitecase.Suchresultsare notcommonformostDFOalgorithmsonproblemswithstochasticnoise.Boththe simplicialdirectsearchmethod[1]andthetrustregionmethodin[4]provesimilar convergenceresults,butbothreducethevarianceatapointbyrepeatedsampling. Inadditiontoourstrongconvergenceresult,weareabletodirectlyquantifythe probabilityofthesuccessofsomeiteratesseeLemma4.15foronesuchexample. Weareunawareofanyothersimilartheoreticalresultsforalgorithmsminimizing stochasticfunctions. Chapter5jointworkwithStefanWildaddressesterminationcriteria,thechoice ofwhichisacommonproblemwhenoptimizingnoisyfunctions.Weproposeobjective measurestocomparethequalityofterminationrules.Familiesofterminationtests arethenproposedandtheirperformanceisanalyzedacrossabroadrangeofDFO 7
PAGE 19
algorithms.Recommendationsfortestswhichworkformanyalgorithmsarealso provided.LastlyChapter6containsconcludingremarksanddirectionsforfuture research. 1.3Notation Thefollowingnotationwillbeused: R n denotestheEuclideanspaceofreal n vectors. kk p denotesthestandard ` p vectornorm,and kk withoutthesubscript denotestheEuclideannorm. kk F denotestheFrobeniusnormofamatrix. C k denotesthesetoffunctionson R n with k continuousderivatives. D j f denotesthe j thderivativeofafunction f 2 C k , j k .Givenanopenset R n , LC k denotesthesetof C k functionswithLipschitzcontinuous k thderivatives.Thatis, for f 2 LC k ,thereexistsaLipschitzconstant L suchthat D k f y )]TJ/F19 11.9552 Tf 11.955 0 Td [(D k f x L k y )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k forall x;y 2 : P d n denotesthespaceofpolynomialsofdegreelessthanorequalto d in R n ; q 1 denotes thedimensionof P 2 n specically q 1 = n +1 n +2 = 2.WeusestandardbigOh" notationwritten O tostate,forexample,thatfortwofunctionsonthesame domain, f x = O g x ifthereexistsaconstant M suchthat j f x j M j g x j forall x withsucientlysmallnorm.Givenaset Y , j Y j denotesthecardinality andconv Y denotestheconvexhull.Forarealnumber , b c denotesthegreatest integerlessthanorequalto .Foramatrix A , A + denotestheMoorePenrose generalizedinverse[22]. e j denotesthe j thcolumnoftheidentitymatrix.Theball ofradiuscenteredat x 2 R n isdenoted B x ;.Givenavector w ,diag w denotesthediagonalmatrix W withdiagonalentries W ii = w i .Forasquarematrix A ,cond A denotestheconditionnumber, min A denotesthesmallesteigenvalue, and min denotesthesmallestsingularvalue.Foraset Y := f y 0 ; ;y p g R n ,the quadraticdesignmatrix X hasrows 1 y j 1 y j n 1 2 y j 1 2 y j 1 y j 2 y j n )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 y j n 1 2 y j n 2 .2 8
PAGE 20
Let m k denotethe k thtrustregionmodelasdenedinChapter2.Let g k = r m k x k and H k = r 2 m k x k . & k x =max kr m k x k ; )]TJ/F19 11.9552 Tf 9.299 0 Td [( min r 2 m k x & x =max kr f x k ; )]TJ/F19 11.9552 Tf 9.298 0 Td [( min r 2 f x Thesevariablesmeasurehowclose x istoarstandsecondorderstationarypointof f and m k i.e.thegradientiszeroandalleigenvaluesarepositive.If X isarandom variable,thenotation X denotes P X .Notethattherelation 1 )]TJ/F20 7.9701 Tf 6.587 0 Td [( is nottransitive. 9
PAGE 21
2.Background Beforecontinuing,weintroducethebackgroundmaterialonwhichthethesisis constructed. 2.1ModelbasedTrustRegionMethods Atrustregionalgorithmisanumericaltechniqueforminimizingasuciently smoothfunction f .Ateachiteration k ,amodelfunction m k x isconstructedto approximate f nearthebestpoint x k .Whenderivativesareavailable, m k isusuallya truncatedrstorsecondorderTaylorseriesapproximationof f at x k .Forexample, if r f and r 2 f areeasytocalculate, m k x = f x k + r f x k x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k + 1 2 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k T r 2 f x k x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k : m k isminimizedoverthetrustregion B x k ; k bysolvingtheproblem min s : k s k k m k x k + s togenerateapotentialnexttrustregioncenter x k + s k . f x k + s k isevaluatedand theratio k = f x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(f x k + s k m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k iscalculated,whichcomparestheactualdecreasein f versusthedecreasepredicted bythemodel m k .Thisratioquantiesthesuccessofiteration k andalsohowwell themodelfunctionapproximatesthetruefunction f .If k islargerthanaprescribed threshold 1 ,itindicatesthattheiterationwassuccessfulandthemodelisagood approximationofthefunction.Inthiscase, x k +1 issetto x k + s k andthetrustregion radius, k isincreased.If k islessthananotherparameter 0 ,themodelfunction isconsideredunreliablesothetrustregionradius k isdecreasedandtheiterateis notupdatedi.e. x k +1 = x k .Lastly, k isincrementedandtheprocessrepeats. 2.1.1ModelConstructionWithoutDerivatives 10
PAGE 22
Whenderivativesareunavailable,themodels m k areconstructedusingpoints where f hasbeenevaluated.Forexample,theConn,Scheinberg,andVicenteframeworkwhichwerefertoasthe CSV2framework buildsmodels m k fromaspecied classofmodels M usingasamplesetofpoints Y k = f y 0 ; ;y p g B x k ; k on whichthefunctionhasbeenevaluated. Given Y k andavectorofcorrespondingfunctionvalues f = f y 0 ; ;f y p ,an interpolating modelisamodel m x suchthat m y i = f y i for i =0 ; ;p .Givena basis = f 0 x ;:::; q x g oftheclassofmodels M ,wecancalculatethecoecients i inthebasisrepresentationoftheinterpolatingmodel m x = P p i =0 i i x by solvingthelinearsystem M ;Y = f .1 where M ;Y = 2 6 6 6 6 6 6 6 4 0 y 0 1 y 0 q y 0 0 y 1 1 y 1 q y 1 . . . . . . . . . . . . 0 y p 1 y p q y p 3 7 7 7 7 7 7 7 5 Notethatforthisequationtohaveauniquesolution,thenumberofsamplepoints p +1mustequalthesizeofthebasis q +1andthematrix M ;Y mustbeinvertible. Regressionmodelshavealsobeeninvestigated,[10]inwhichthenumberofsample points p +1isgreaterthanthesizeofthebasis.Inthiscase,thematrix M ;Y hasmorerowsthancolumnssotheequation.1issolvedinaleastsquaressense. Lastly,if M ;Y hasmorecolumnsthanrows,thesystem.1isunderdetermined.Nevertheless,boundsbetweenthefunctionandanunderdeterminedmodel canbeobtainedincertaincases.Forexample,see[13]consideringtheminiumFrobeniusnormmethod. 2.1.2CSV2framework 11
PAGE 23
WenextoutlinetheCSV2frameworkforderivativefreetrustregionmethods describedbyConn,Scheinberg,andVicente[12,Algorithm10.3].Algorithmsinthe frameworkconstructamodelfunction m k atiteration k ,whichapproximatesthe objectivefunction f onasetofsamplepoints Y k = f y 0 ;:::;y p k g R n .Thenext iterateisthendeterminedbyminimizing m k .Specically,giventheiterate x k ,a putativenextiterateisgivenby x k + s k wherethestep s k solvesthetrustregion subproblem min s : k s k k m k x k + s wherethescalar k > 0denotesthetrustregionradius,whichmayvaryfromiteration toiteration.If x k + s k producessucientdescentinthemodelfunction,then f x k + s k isevaluated,andtheiterateisacceptedif f x k + s k 0 .Amodelfunction m 2 C 2 is fullyquadratic on B x ; if m hasaLipschitzcontinuousHessianwithcorrespondingLipschitz constantboundedby m 2 and 12
PAGE 24
theerrorbetweentheHessianofthemodelandtheHessianofthefunction satises r 2 f y )222(r 2 m y eh forall y 2 B x ; ; theerrorbetweenthegradientofthemodelandthegradientofthefunction satises kr f y )222(r m y k eg 2 forall y 2 B x ; ; theerrorbetweenthemodelandthefunctionsatises j f y )]TJ/F19 11.9552 Tf 11.955 0 Td [(m y j ef 3 forall y 2 B x ; : Denition2.3 Let f satisfyAssumption2.1.Asetofmodelfunctions M = f m : R n ! R ;m 2 C 2 g iscalleda fullyquadratic classofmodelsifthereexistpositive constants = ef ; eg ; eh ; m 2 ,suchthatthefollowinghold: 1.forany x 2 S and 2 ; max ] ,thereexistsamodelfunction m in M which is fullyquadraticon B x ; . 2.Forthisclass M ,thereexistsanalgorithm,calledamodelimprovement"algorithm,thatinanite,uniformlyboundedwithrespectto x and number ofstepscan eithercertifythatagivenmodel m 2M is fullyquadraticon B x ; , or,ndamodel m 2M thatis fullyquadraticon B x ; . Notethatthisdenitionofafullyquadraticclassofmodelsisequivalentto[12, Denition6.2];butwehavegivenaseparatedenitionofa fullyquadraticmodel Denition2.2thatincludestheuseof tostressthexednatureofthebounding constants.Thischangesimpliessomeanalysisbyallowingustodiscuss fully quadraticmodelsindependentoftheclassofmodelstheybelongto.Itisimportant 13
PAGE 25
tonotethat doesnotneedtobeknownexplicitly.Instead,itcanbedened implicitlybythemodelimprovementalgorithm.Allthatisrequiredisfor tobe xedthatis,independentof x and.Wealsonotethattheset M mayinclude nonquadraticfunctions,butwhenthemodelfunctionsarequadratic,theHessianis xed,so m 2 canbechosentobezero.ForthealgorithmspresentedinChapter3and Chapter4,wefocusonmodelfunctionsthatquadratic.Thatis, M = P 2 n . Asasidenote, fullyquadraticmodelsmaybetoodiculttoconstructor maybeundesiredinsomesituations.Ifthatisthecase, fullylinear modelsmight provideausefulalternative. Denition2.4 Let f 2 LC 2 andlet = ef ; eg ; m 1 beagivenvectorofconstants, andlet > 0 .Amodelfunction m 2 C 2 is fullylinear on B x ; if m hasa LipschitzcontinuousgradientwithcorrespondingLipschitzconstantboundedby m 1 and theerrorbetweenthegradientofthemodelandthegradientofthefunction satises kr f y )222(r m y k eg 8 y 2 B x ; ; theerrorbetweenthemodelandthefunctionsatises j f y )]TJ/F19 11.9552 Tf 11.955 0 Td [(m y j ef 2 8 y 2 B x ; : If m k is fullylinear,itapproximates f inafashionsimilartotherstorder Taylormodelof f .If m k is fullyquadratic,thenitapproximates f inafashion similartothesecondorderTaylormodelof f .If fullylinearorquadraticmodels areusedwithintheCSV2framework,wecanguaranteeconvergenceofthealgorithm toarstorsecondorderstationarypointof f . AcriticaldistinctionbetweentheCSV2frameworkandclassicaltrustregion methodsliesintheoptimalitycriteria.Inclassicaltrustregionmethods, m k isthe 14
PAGE 26
secondorderTaylorapproximationof f at x k ;soif x k isoptimalfor m k ,itsatisestherstandsecondordernecessaryconditionsforanoptimumof f .Inthe CSV2framework, x k mustbeoptimalfor m k ,but m k mustalsobeanaccurateapproximationof f near x k .Thisrequiresthatthetrustregionradiusissmallandthat m k is fullyquadraticonthetrustregionforsomexed . ToexplicitlyoutlinetheCSV2framework,weprovedpseudocodebelow.Inthe algorithm, g icb k and H icb k denotethegradientandHessianoftheincumbentmodel m icb k . Weusethesuperscript icb tostressthatincumbentparametersfromtheprevious iteratesmaybechangedbeforetheyareusedinthetrustregionstep.Theoptimality of x k withrespectto m k istestedbycalculating & icb k =max k g icb k k ; )]TJ/F19 11.9552 Tf 9.298 0 Td [( min H icb k . Thisquantityiszeroifandonlyiftherstandsecondorderoptimalityconditions for m k aresatised.Thealgorithmentersthecriticalitystepwhen & icb k iscloseto zero.Thisroutinebuildsapossiblynew fullyquadraticmodelforthecurrent icb k ,andtestsif & icb k forthismodelissucientlylarge.Ifso,adescentdirectionhas beendetermined,andthealgorithmcanproceed.Ifnot,thecriticalitystepreduces icb k andupdatesthesamplesettoimprovetheaccuracyofthemodelfunctionnear x k .Thecriticalitystependswhen & icb k islargeenoughandthealgorithmproceeds orwhenboth & icb k and icb k aresmallerthangiventhresholdvalues c and min in whichcasethealgorithmhasidentiedasecondorderstationarypoint.Wereferthe readerto[12]foramoredetaileddiscussionofthealgorithm,includingexplanations oftheparameters 0 ; 1 ;; inc ;; and ! . AlgorithmCSV2 [12,Algorithm10.3] Step0initialization: Chooseafullyquadraticclassofmodels M andacorrespondingmodelimprovementalgorithm,withassociated denedbyDenition2.3. Chooseaninitialpoint x 0 andmaximumtrustregionradius max > 0.Weassume thatthefollowingaregiven:aninitialmodel m icb 0 x withgradientandHessianat 15
PAGE 27
x = x 0 denotedby g icb 0 and H icb 0 ,respectively, & icb 0 =max k g icb 0 k ; )]TJ/F19 11.9552 Tf 9.299 0 Td [( min H icb 0 ,and atrustregionradius icb 0 2 ; max ]. Theconstants 0 ; 1 ;; inc ; c ;;;! aregivenandsatisfytheconditions0 0 1 < 1with 1 6 =0,0 << 1 < inc ; c > 0 ;>> 0 ;! 2 ; 1 : Set k =0. Step1criticalitystep: If & icb k > c ,then m k = m icb k and k = icb k . If & icb k c ,thenproceedasfollows.Callthemodelimprovementalgorithmto attempttocertifyifthemodel m icb k is fullyquadraticon B x k ; icb k .Ifatleastone ofthefollowingconditionshold, themodel m icb k isnotcertiably fullyquadraticon B x k ; icb k ,or icb k >& icb k , thenapplyAlgorithmCriticalityStepdescribedbelowtoconstructamodel~ m k x withgradientandHessianat x = x k denotedby~ g k ,and ~ H k ,respectively,with ~ & m k =max n k ~ g k k ; )]TJ/F19 11.9552 Tf 9.299 0 Td [( min ~ H k o ,whichis fullyquadraticontheball B x k ; ~ k for some ~ k 2 ; ~ & m k ]givenby[12,Algorithm10.4].Insuchacaseset m k =~ m k and k =min n max n ~ k ; ~ & m k o ; icb k o : Otherwise,set m k = m icb k and k = icb k .Foramorecompletediscussionoftrust regionmanagement,see[26]. Step2stepcalculation: Computeastep s k thatsucientlyreducesthe model m k inthesenseof[12,.13]suchthat x k + s k 2 B x k ; k . Step3acceptanceofthetrialpoint: Compute f x k + s k anddene k = f x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(f x k + s k m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k If k 1 orifboth k 0 andthemodelis fullyquadraticon B x k ; k ,then x k +1 = x k + s k andthemodelisupdatedtoincludethenewiterateintothesample setresultinginanewmodel m icb k +1 x withgradientandHessianat x = x k +1 denoted 16
PAGE 28
by g icb k +1 and H icb k +1 ,respectively,with & icb k +1 =max k g icb k +1 k ; )]TJ/F19 11.9552 Tf 9.299 0 Td [( min H icb k +1 ;otherwise, themodelandtheiterateremainunchanged m icb k +1 = m k and x k +1 = x k . Step4modelimprovement: If k < 1 ,usethemodelimprovementalgorithmto attempttocertifythat m k is fullyquadraticon B x k ; k , ifsuchacerticateisnotobtained,wesaythat m k isnotcertiably fully quadraticandmakeoneormoresuitableimprovementsteps. Dene m icb k +1 tobethepossiblyimprovedmodel. Step5trustregionupdate: Set icb k +1 2 8 > > > > > > > > > > > < > > > > > > > > > > > : f min f inc k ; max gg if k 1 and k <& m k ; [ k ; min f inc k ; max g ]if k 1 and k & m k ; f k g if k < 1 and m k is fullyquadratic, f k g if k < 1 and m k isnotcertiably fullyquadratic. Increment k by1andgotoStep1. AlgorithmCriticalityStep [12,Algorithm10.4]Thisalgorithmisappliedonly if & icb k c andatleastoneofthefollowingholds:themodel m icb k isnotcertiably fullyquadraticon B x k ; icb k or icb k >& icb k . Initialization: Set i =0.Set m k = m icb k . Repeat Increment i byone.Usethemodelimprovementalgorithmtoimprove thepreviousmodel m i )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 k untilitis fullyquadraticon B x k ; ! i )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 icb k .Denotethe newmodelby m i k .Set ~ k = ! i )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 icb k and~ m k = m i k . Until ~ k & m k i . 17
PAGE 29
GlobalConvergence Ifthefollowingassumptionsaresatised,ithasbeenshown thattheCSV2frameworkiterateswillconvergetoastationarypointof f .Dene theset L x 0 = f x 2 R n : f x f x 0 g . Assumption2.5 Assumethat f isboundedfrombelowon L x 0 . Assumption2.6 Thereexistsaconstant bhm > 0 suchthat,forall x k generated bythealgorithm, k H k k bhm : Theorem2.7 [12,Theorem10.23]LetAssumptions2.1,2.5and2.6holdwith S = L x 0 .Then,ifthemodelsusedintheCSV2frameworkare fullyquadratic, theiterates x k generatedbytheCSV2frameworksatisfy lim k ! + 1 max fkr f x k k ; )]TJ/F19 11.9552 Tf 9.298 0 Td [( min r 2 f x k g =0 : Itfollowsfromthistheoremthatanyaccumulationpointof f x k g satisesthe rstandsecondordernecessaryconditionsforaminimumof f . 2.1.3Poisedness HavingoutlinedtheCSV2framework,wecandiscussconditionsonthesample setusedtobuild m k whichguaranteethemodelsucientlyapproximates f .Consider thesetofpolynomialsin R n ofdegreelessthanorequalto d ,denoted P d n .Abasis = f 0 x ; ; q x g of P d n isasetofpolynomialsfrom P d n whichspan P d n .Thatis, forany P x 2P d n ,thereexistscoecients i suchthat P x = P p i =0 i i x .Welet P d n denotethenumberofelementsinanybasis of P d n .Forexample, jP 1 n j = n +1 and jP 2 n j = n +1 n +2 = 2. Denition2.8 Asetofpoints X = f x 0 ; ;x p g R n with k X k = P d n ispoised forinterpolationifthematrix M ;X isnonsingularforsomebasis in P d n 18
PAGE 30
If j X j > P d n ,wecanconstructtheleastsquaresregressionmodelbysolving .1aswell.Weextendthedenitionofpoisednessfortheregressioncase. Denition2.9 Asetofpoints X = f x 0 ; ;x p g R n with k X k P d n ispoised forregressionifthematrix M ;X hasfullcolumnrankforsomebasis in P d n Sincewehavelimitedinformationaboutthefunction f ,wewanttheinterpolating orregressing m x tobeanaccurateapproximationinaregionofinterest.This requiresthat X consistsofpointsspreadoutwithinsaidregion.Since M ;X can bearbitrarilypoorlyconditionedand X isstillpoised,simplybeingpoisedisnot enoughtomeasurethequalityofaset X .Alsolookingattheconditionnumberof M ;X isinadequatesincescalingthesampleset X orchoosingadierentbasis canarbitrarilyadjustthisquantity.Nevertheless,therearemethodsformeasuring thequalityofasampleset,oneofthemostcommonofwhichisbasedonLagrange polynomials. Denition2.10 Foraset X = f x 0 ; ;x p g R n ,with j X j = P d n ,thesetofinterpolatingLagrangepolynomials ` = f ` 0 ; ;` p gP d n arethepolynomialssatisfying ` i y j = 8 > < > : 1 if i = j; 0 otherwise. Iftheset X ispoised,thenthesetofpolynomialsareguaranteedtoexistandbe uniquelydened. WecanextendthedenitionofLagrangepolynomialstotheregressioncaseina naturalfashion. Denition2.11 Foraset X = f x 0 ; ;x p g R n ,with j X j > P d n ,thesetof regressionLagrangepolynomials ` = f ` 0 ; ;` p gP d n arethepolynomialssatisfying ` i y j = `:s: 8 > < > : 1 if i = j; 0 otherwise. 19
PAGE 31
Thoughthesepolynomialsarenolongerlinearlyindependent,if X ispoised,then thesetofregressionLagrangepolynomialsexistsandisuniquelydened. WecannowusetheseLagrangepolynomialstoextendthedenitionofpoisedness topoisedness.ThisrelatesthemagnitudeoftheLagrangepolynomialsonaset B R n whichwillallowamethodformeasuringthequalityofasampleset. Denition2.12 Let > 0 andaset B R n begiven.Forabasis of P d n ,apoised set X = f x 0 ; ;x p g issaidtobe poisedin B intheinterpolatingsenseifand onlyif 1.forthebasisofLagrangepolynomialsassociatedwith X max 0 i p max x 2 B j ` i j ; or,equivalently, 2.forany x 2 B thereexists x suchthat p X i =0 i x y i = x with k x k 1 : Andweagaincanextendthisdenitiontotheregressioncase. Denition2.13 Let > 0 andaset B R n begiven.Forabasis of P d n ,apoised set X = f x 0 ; ;x p g with j X j P d n issaidtobe poisedin B intheregression senseifandonlyif 1.forthebasisofLagrangepolynomialsassociatedwith X max 0 i p max x 2 B j ` i j ; or,equivalently, 2.forany x 2 B thereexists x suchthat p X i =0 i x y i = x with k x k 1 : 20
PAGE 32
Notethatif j X j = P d n ,thedenitionforpoisedintheinterpolationandregression sensecoincide. Wecannowexaminethefollowingboundfrom[7] k D r f x )]TJ/F19 11.9552 Tf 11.956 0 Td [(D r m x k 1 d +1! d p X i =0 x i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x d +1 k D r ` i x k where D r f x isthe r thderivativeof f . D r f x z 1 ;:::;z r = X i 1 ;:::;i r @ r f x @x i 1 @x i r z 1 i 1 z r i r and d isanupperboundon D r +1 f x .If r =0,thisboundreducesto j f x )]TJ/F19 11.9552 Tf 11.956 0 Td [(m x j 1 d +1! p +1 d b d +1 .2 where b =max 0 i p max x 2 B j ` i j ; andisthediameterofthesmallestballcontaining X .Therefore,ifthenumberof pointsin X isbounded,thenpoisednessissucienttoderiveboundsontheerror betweentheregressionorinterpolatingmodelandthefunction.Thatis,decreasing theradiusofthesamplesetwillprovideboundssimilartoTaylorboundswhen derivativesareavailable.Ifusingregressionmodelswitharbitrarilymanypoints, poisednessisnotenoughtoconstructsimilarbounds.Stronglypoisednesscan helpinthiscase. Denition2.14 Let ` x = ` 0 x ; ;` p x T betheregressionLagrangepolynomialsassociatedwiththeset Y = f y 0 ;:::;y p g .Let > 0 andlet B beasetin R n . Theset Y issaidtobestrongly poisedin B intheregressionsenseifandonlyif q 1 p p 1 max x 2 B k ` x k ; where q 1 = jP 2 n j and p 1 = j Y j . 21
PAGE 33
Sincewecanrewrite.2as j f x )]TJ/F19 11.9552 Tf 11.955 0 Td [(m x j 1 d +1! p p +1 d b; 2 d +1 where b; 2 =max x 2 B k ` i k ; strongpoisednessprovidesTaylorlikeerrorboundsbetweentheregressionmodel m andthefunction f ,evenwhenthenumberofpointsin X isunbounded. Asanalnote,explicitlycalculatingthevalueofiscomputationallyexpensive, butnotnecessary.Itispossibletousetheconditionnumberofthedesignmatrix M ;X toboundtheconstant.Sinceitispossibletoscaletheconditionnumber of M ;X byshiftingandscaling X ,orchoosingadierentbasis,conditionsmust beplacedon M ;X beforeusingitsconditionnumber.Ifwe1usethestandard basise.g., = 1 ;x 1 ; ;x n ; 1 2 x 2 1 ;x 1 x 2 ; ;x 2 n for P d n ,2shiftthesampleset X soeverypointlieswithintheunitcircleand,3atleastonepointhasnorm1,thenit ispossibleboundusingtheconditionnumberof M ;X .Thenexttwotheorems arefortheinterpolationandregressioncaserespectively. Theorem2.15 Let ^ X denotetheshiftedandscaledversionof X soeverypointlies withintheunitballandatleastonepointhasnorm1.Let ^ M = M ; ^ X where isthestandardbasis.If ^ M isnonsingularand ^ M )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 ,thentheset ^ X is p p 1 poisedintheunitball.Conversely,iftheset ^ X is poisedintheunitball, then ^ M )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 p p 1 ,where > 0 dependson n and d butisindependentof ^ X and . Theorem2.16 Let ^ X denotetheshiftedandscaledversionof X soeverypointlies withintheunitballandatleastonepointhasnorm1.Let ^ M = M ; ^ X where is thestandardbasisandlet ^ U ^ ^ V T bethereducedsingularvaluedecompositionof ^ M . Thatis ^ isthe q 1 q 1 diagonalmatrixofsingularvalues.If ^ isnonsingularand 22
PAGE 34
^ )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 p q 1 =p 1 ,thentheset ^ X isstrongly poisedintheunitball.Conversely, iftheset ^ X is poisedintheunitball,then ^ )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 q 1 p p 1 ,where dependson n and d butisindependentof ^ X and . 2.2PerformanceProles Next,weexplainthecontentofperformanceproleswhichareacompactmethod forcomparingtheperformanceofvariousalgorithmsonasetofproblems.Wewill useFigure2.1asanexample.Algorithms A , B ,and C havebeenrunanidentical setofproblemsforthesamenumberoffunctionevaluations.Theleftaxisshows thepercentageofproblemseachalgorithmsolvedrst,wheresolvedisuserdened. Often,analgorithmisconsideredtosolveaproblemwhenitrstndsafunction valuewithintoleranceofthebestknownsolution.Intheexample, A solves20% oftheproblemsrst, B solves55%oftheproblemsrst,and C solves30%ofthe problemsrst.Asthesepercentagestotaltoover100%,thereisanoverlapintheset ofproblemsthealgorithmssolverst. Therightaxisshowsthepercentageofproblemssolvedbyagivenalgorithmin thenumberoffunctionevaluationsgiven.AllalgorithmsinFigure2.1solveover90% oftheproblemsinthetestingset.Valuesbetweentheleftandrightaxesdenotethe percentageofproblemssolvedasamultipleofthenumberofevaluationsrequiredfor thefastestalgorithm.Forexample,given6timesasmanyiterationsasthefastest algorithmonaproblem, A solves80%oftheproblemsinthetestingset. Formally,analgorithmisconsideredtosolveaproblemwhenitrstndsa functionvaluesatisfying f x 0 )]TJ/F19 11.9552 Tf 11.955 0 Td [(f x )]TJ/F19 11.9552 Tf 11.955 0 Td [( f x 0 )]TJ/F19 11.9552 Tf 11.956 0 Td [(f L where > 0isasmalltolerance, f L isthesmallestfoundfunctionvalueforanysolver inaspeciednumberofiterations,and x 0 istheinitialpointgiventoeachalgorithm. 23
PAGE 35
Figure2.1: Anexampleofaperformanceprole If i p;a isthenumberofiterationsneededforsolver a tosolveproblem p ,thenthe performanceratioisdenedby r p;a = t p;a min a f t p;a g Thentheperformanceproleofasolver a isthefractionofproblemswherethe performanceratioisatmost .Thatis, a = j p 2 P : r p;a j 1 j P j where P isthesetofbenchmarkproblems. 2.3ProbabilisticConvergence Lastly,wedenevariousformsofprobabilisticconvergence.Asequence f X n g ofrandomvariableissaidto convergeindistribution or convergeweakly ,or 24
PAGE 36
convergeinlaw toarandomvariable X if lim n !1 F n x = F x ; foreverynumber f x 2 R g atwhich F iscontinuous.Here F n and F arethecumulative distributionfunctionsofrandomvariables X n and X correspondingly. Asequence f X n g ofrandomvariables convergesinprobability to X ifforall > 0 lim n !1 P j X n )]TJ/F19 11.9552 Tf 11.956 0 Td [(X j =0 Asequence f X n g ofrandomvariables convergesalmostsurely or almosteverywhere or withprobability1 or strongly towards X if P lim n !1 X n = X =1 25
PAGE 37
3.DerivativefreeOptimizationofExpensiveFunctionswith ComputationalErrorUsingWeightedRegression 3.1Introduction Inthischapter,weconstructanalgorithmdesignedtooptimizefunctionsevaluatedbylargecomputationalcodes,takingminutes,hoursorevendaysforasingle functioncall,forwhichderivativeinformationisunavailable,andforwhichfunctionevaluationsaresubjecttocomputationalerror.Sucherrormaybedeterministic arising,forexample,fromdiscretizationerror,orstochasticforexample,from MonteCarlosimulation.Becausefunctionevaluationsareextremelyexpensive,it issensibletoperformsubstantialworkateachiterationtoreducethenumberof functionevaluationsrequiredtoobtainanoptimum. Weassumethattheaccuracyofthefunctionevaluationcanvaryfrompointto point,andthisvariationcanbequantied.Inthischapter,wewilluseknowledge ofthisvaryingerrortoimprovetheperformanceofthealgorithmbyusing weighted regressionmodelsinatrustregionframework.Bygivingmoreaccuratepointsmore weightwhenconstructingthetrustregionmodel,wehopethatthemodelswillmore closelyapproximatethefunctionbeingoptimized.Thisleadstoabetterperforming algorithm. OuralgorithmtswithinintheCSV2framework,whichisoutlinedinChapter2. Tospecifyanalgorithmwithinthisframework,threethingsarerequired: 1.Denetheclassofmodelfunctions M .Thisisdeterminedbythemethodfor constructingmodelsfromthesampleset.In[10]modelswereconstructedusing interpolation,leastsquaresregression,andminimumFrobeniusnormmethods. Wedescribethegeneralformofourweightedregressionmodelsin x 3.2and proposeaspecicweightingschemein x 3.5. 2.Deneamodelimprovementalgorithm. x 3.4describesourmodelimprovement algorithmwhichteststhegeometryofthesampleset,andifnecessary,adds 26
PAGE 38
and/ordeletespointstoensurethatthemodelfunctionconstructedfromthe samplesetsatisestheerrorboundsinDenition2.2i.e.itis fullyquadratic. 3.Demonstratethatthemodelimprovementalgorithmsatisestherequirements forthedenitionofaclassoffullyquadraticmodels.Forouralgorithm,thisis discussedin x 3.4. Thechapterisorganizedasfollows.WeplaceouralgorithmintheCSV2frameworkbydescribing1howmodelfunctionsareconstructed x 3.2,and2a modelimprovementalgorithm x 3.4.Beforedescribingthemodelimprovementalgorithm,werstextendthetheoryofpoisednesstotheweightedregressionframework x 3.3.Computationalresultsarepresentedin x 3.5usingaheuristicweighting scheme,whichisdescribedinthatsection. x 3.6concludesthechapter. 3.2ModelConstruction Thissectiondescribeshowweconstructthemodelfunction m k atthe k thiteration.Forsimplicity,wedropthesubscript k fortheremainderofthissection. Let f = f 0 ;:::; f p T where f i denotesthecomputedfunctionvalueat y i ,andlet i denotetheassociatedcomputationalerror.Thatis f i = f y i + i : .1 Let w = w 0 ;:::;w p T beavectorofpositiveweightsforthesetofpoints Y = f y 0 ; ;y p g .Aquadraticpolynomial m issaidtobea weightedleastsquares approximation of f withrespectto w ifitminimizes p X i =0 w 2 i )]TJ/F19 11.9552 Tf 5.48 9.684 Td [(m y i )]TJ/F15 11.9552 Tf 14.503 3.155 Td [( f i 2 = W )]TJ/F19 11.9552 Tf 5.48 9.684 Td [(m Y )]TJ/F15 11.9552 Tf 14.503 3.155 Td [( f 2 : where m Y denotesthevector m y 0 ;m y 1 ;:::;m y p T and W =diag w .Inthis case,wewrite Wm Y `:s: = W f: .2 27
PAGE 39
Let = f 0 ; 1 ;:::; q g beabasisforthequadraticpolynomialsin R n .For example, mightbethemonomialbasis = f 1 ;x 1 ;x 2 ;:::;x n ;x 2 1 = 2 ;x 1 x 2 ;:::;x n )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 x n ;x 2 n = 2 g : .3 Dene M ;Y = 2 6 6 6 6 6 6 6 4 0 y 0 1 y 0 q y 0 0 y 1 1 y 1 q y 1 . . . . . . . . . . . . 0 y p 1 y p q y p 3 7 7 7 7 7 7 7 5 : Since isabasisforthequadraticpolynomials,themodelfunctioncanbewritten m x = q X i =0 i i x .Thecoecients = 0 ;:::; q T solvetheweightedleast squaresregressionproblem WM ;Y `:s: = W f: .4 If M ;Y hasfullcolumnrank,thesampleset Y issaidtobe poisedforquadratic regression .Thefollowinglemmaisastraightforwardgeneralizationof[12,Lemma 4.3]: Lemma3.1 If Y ispoisedforquadraticregression,thentheweightedleastsquares regressionpolynomialwithrespecttopositiveweights w = w 0 ;:::;w p exists,is uniqueandisgivenby m x = x T ,where = WM + W f = M T W 2 M )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 M T W 2 f; .5 where W =diag w and M = M ;Y . Proof: Since W and M bothhavefullcolumnrank,sodoes WM .Thus,theleast squaresproblem.4hasauniquesolutiongivenby WM + W f .Moreover,since WM hasfullcolumnrank, WM + = )]TJ/F15 11.9552 Tf 5.479 9.684 Td [( WM T WM )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 M T W . 3.3ErrorAnalysisandtheGeometryoftheSampleSet 28
PAGE 40
Throughoutthissection, Y = f y 0 ; ;y p g denotesthesampleset, p 1 = p +1, w 2 R p 1 + isavectorofpositiveweights, W =diag w ,and M = M ;Y . f denotes thevectorofcomputedfunctionvaluesatthepointsin Y asdenedby.1. Theaccuracyofthemodelfunction m k reliescriticallyonthegeometryofthe sampleset.Inthissection,wegeneralizethetheoryofpoisednessfrom[12]tothe weightedregressionframework.Thissectionalsoincludeserroranalysiswhichextendsresultsfrom[12]toweightedregression,aswellasconsideringtheimpactof computationalerrorontheerrorbounds.Westartbydening weightedregression Lagrangepolynomials . 3.3.1WeightedRegressionLagrangePolynomials Denition3.2 Asetofpolynomials ` j x ;j =0 ;:::;p in P d n arecalled weightedregressionLagrangepolynomials withrespecttotheweights w andsampleset Y ifforeach j , W` Y j `:s: = We j ; where ` Y j =[ ` j y 0 ; ;` j y p ] T . ThefollowinglemmaisadirectapplicationofLemma3.1. Lemma3.3 Let x = 0 x ;:::; q x T .If Y ispoised,thenthesetofweighted regressionLagrangepolynomialsexistsandisunique,andisgivenby ` j x = x T a j ;j =0 ; ;p ,where a j isthe j th columnofthematrix A = WM + W: .6 Proof: Notethat m = ` j satises.2with f = e j .ByLemma3.1, ` j x = x T a j where a j = WM + We j whichisthe j thcolumnof WM + W . Thefollowinglemmaisbasedon[12,Lemma4.6]. 29
PAGE 41
Lemma3.4 If Y ispoised,thenthemodelfunctiondenedby .2 satises m x = p X i =0 f i ` i x ; where ` j x ;j =0 ; ;p denotetheweightedregressionLagrangepolynomialscorrespondingto Y and W . Proof: ByLemma3.1 m x = x T where = WM + W f = A f for A denedby.6.Let ` x =[ ` 0 x ; ;` p x ] T .ThenbyLemma3.3 m x = T x A f = f T ` x = p X i =0 f i ` i x : 3.3.2ErrorAnalysis Fortheremainderofthischapter,let ^ Y = f ^ y 0 ; ; ^ y p g denotetheshiftedand scaledsamplesetwhere^ y i = y i )]TJ/F19 11.9552 Tf 12.116 0 Td [(y 0 =R and R =max i k y i )]TJ/F19 11.9552 Tf 11.956 0 Td [(y 0 k .Notethat^ y 0 =0 andmax i k ^ y i k =1.Anyanalysisof Y canbedirectlyrelatedto ^ Y bythefollowing lemma: Lemma3.5 Denethebasis ^ = n ^ 0 x ; ; ^ q x o ,where ^ i x = i Rx + y 0 , i =0 ;:::;q and isthemonomialbasis.Let f ` 0 x ; ;` p x g beweightedregression Lagrangepolynomialsfor Y and n ^ ` 0 x ; ; ^ ` p x o beweightedregressionLagrange polynomialsfor ^ Y .Then M ;Y = M ^ ; ^ Y .If Y ispoised,then ` x = ^ ` x )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 R : Proof: Observethat M ;Y = 2 6 6 6 6 6 6 6 4 0 y 0 q y 0 0 y 1 q y 1 . . . . . . . . . 0 y p q y p 3 7 7 7 7 7 7 7 5 = 2 6 6 6 6 6 6 6 4 ^ 0 ^ y 0 ^ q ^ y 0 ^ 0 ^ y 1 ^ q ^ y 1 . . . . . . . . . ^ 0 ^ y p ^ q ^ y p 3 7 7 7 7 7 7 7 5 = M ^ ; ^ Y : 30
PAGE 42
Bythedenitionofpoisedness, Y ispoisedifandonlyif ^ Y ispoised.Let x = )]TJ/F15 11.9552 Tf 6.992 6.528 Td [( 0 x ;:::; q x T and ^ x = ^ 0 x ;:::; ^ q x T .Then ^ ^ x = 2 6 6 6 6 4 ^ 0 x )]TJ/F20 7.9701 Tf 6.586 0 Td [(y 0 R . . . ^ q x )]TJ/F20 7.9701 Tf 6.587 0 Td [(y 0 R 3 7 7 7 7 5 = 2 6 6 6 6 4 0 x . . . q x 3 7 7 7 7 5 = x : ByLemma3.3,if Y ispoised,then ` x = x T WM ;Y + W = ^ ^ x T WM ^ ; ^ Y + W = ^ ` ^ x : Let f i bedenedby.1andletbeanopenconvexsetcontaining Y .If f is C 2 on,thenbyTaylor'stheorem,foreachsamplepoint y i 2 Y ,andaxed x 2 conv Y ,thereexistsapoint i x onthelinesegmentconnecting x to y i such that f i = f y i + i = f x + r f x T y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x + 1 2 y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x T r 2 f i x y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x + i = f x + r f x T y i )]TJ/F19 11.9552 Tf 11.956 0 Td [(x + 1 2 y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x T r 2 f x y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x + 1 2 y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x T H i x y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x + i ; .7 where H i x = r 2 f i x )222(r 2 f x . Let f ` i x g denotetheweightedregressionLagrangepolynomialsassociatedwith Y .Thefollowinglemmaandproofareinspiredby[7,Theorem1]: Lemma3.6 Let f betwicecontinuouslydierentiableon andlet m x denote thequadraticfunctiondeterminedbyweightedregression.Then,forany x 2 the followingidentitieshold: m x = f x + 1 2 P p i =0 y i )]TJ/F19 11.9552 Tf 11.956 0 Td [(x T H i x y i )]TJ/F19 11.9552 Tf 11.956 0 Td [(x ` i x + P p i =0 i ` i x , 31
PAGE 43
r m x = r f x + 1 2 P p i =0 y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x T H i x y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x r ` i x + P p i =0 i r ` i x , r 2 m x = r 2 f x + 1 2 P p i =0 y i )]TJ/F19 11.9552 Tf 11.956 0 Td [(x T H i x y i )]TJ/F19 11.9552 Tf 11.956 0 Td [(x r 2 ` i x + P p i =0 i r 2 ` i x , where H i x = r 2 f i x )194(r 2 f x forsomepoint i x = x + )]TJ/F19 11.9552 Tf 11.615 0 Td [( y i ; 0 1 onthelinesegmentconnecting x to y i . Proof: Let D denotethedierentialoperatorasdenedin[7],where D j is the j thderivativeofafunctionin C i where i j .Inparticular, D 0 f x = f x , D 1 f x z = r f x T z ,and D 2 f x z 2 = z T r 2 f x z .ByLemma3.4, m x = P p i =0 f i ` i x ;sofor h =0 ; 1or2, D h m x = p X i =0 f i D h ` i x : Substituting.7for f i intheaboveequationyields D h m x = 2 X j =0 1 j ! p X i =0 D j f x y i )]TJ/F19 11.9552 Tf 11.956 0 Td [(x j D h ` i x + 1 2 p X i =0 y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x T H i x y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x D h ` i x + p X i =0 i D h ` i x .8 where H i x = r 2 f i x )261(r 2 f x forsomepoint i x onthelinesegmentconnecting x to y i .Considerthersttermontherighthandsideabove.Weshallshow that 1 j ! p X i =0 D j f x y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x j D h ` i x = 8 > < > : D h f x for j = h 0for j 6 = h: .9 for j =0 ; 1 ; 2 ;::: .Let B j = D j f x ,andlet g j : R n ! R bethepolynomialdened by g j z = 1 j ! B j z )]TJ/F19 11.9552 Tf 10.222 0 Td [(x j .Observethat D j g j x = B j and D h g j x =0for h 6 = j .Since g j hasdegree j 2,theweightedleastsquaresapproximationof g j byaquadratic polynomialis g j itself.Thus,byLemma3.4andthedenitionof g j , g j z = p X i =0 g j y i ` i z = 1 j ! p X i =0 B j y i )]TJ/F19 11.9552 Tf 11.956 0 Td [(x j ` i z : .10 32
PAGE 44
Applyingthedierentialoperator D h withrespectto z yields D h g j z = 1 j ! p X i =0 B j y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x j D h ` i z = 1 j ! p X i =0 D j f x y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x j D h ` i z : Letting z = x ,theexpressionontherightisidenticaltotheleftsideof.9.This proves.9since D h g j x =0for j 6 = h and D j g j x = B j for j = h .By.9,.8 reducesto D h m x = D h f x + 1 2 p X i =0 y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x T H i x y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x D h ` i x + p X i =0 i D h ` i x : Applyingthiswith h =0 ; 1 ; 2provesthelemma. Since k H i x k L k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k bytheLipschitzcontinuityof r 2 f ,thefollowingis adirectconsequenceofLemma3.6. Corollary3.7 Let f satisfyAssumption2.1forsomeconvexset ,andlet m x denotethequadraticfunctiondeterminedbyweightedregression.Then,forany x 2 thefollowingerrorboundshold: j f x )]TJ/F19 11.9552 Tf 11.955 0 Td [(m x j P p i =0 L 2 k y i )]TJ/F19 11.9552 Tf 11.956 0 Td [(x k 3 + j i j j ` i x j kr f x )222(r m x k P p i =0 L 2 k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k 3 + j i j kr ` i x k kr 2 f x )222(r 2 m x k P p i =0 L 2 k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k 3 + j i j kr 2 ` i x k . Usingthiscorollary,thefollowingresultprovideserrorboundsbetweenthefunctionandthemodelintermsofthesamplesetradius. Corollary3.8 Let Y bepoised,andlet R =max i k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 k .Suppose j i j for i =0 ;:::;p .If f satisesAssumption2.1withLipschitzconstant L ,thenthereexist constants 1 , 2 ,and 3 ,independentof R ,suchthatforall x 2 B y 0 ; R , j f x )]TJ/F19 11.9552 Tf 11.955 0 Td [(m x j 1 p p 1 LR 3 + . 33
PAGE 45
kr f x )222(r m x k 2 p p 1 LR 2 + =R . kr 2 f x )222(r 2 m x k 3 p p 1 LR + =R 2 . Proof: Let f ^ ` 0 x ;:::; ^ ` p x g betheLagrangepolynomialsgeneratedbytheshifted andscaledset ^ Y ,andlet f ` 0 x ;:::;` p x g betheLagrangepolynomialsgeneratedbytheset Y .ByLemma3.5,foreach x 2 B y 0 ; R , ` i x = ^ ` i ^ x 8 i , where^ x = x )]TJ/F19 11.9552 Tf 13.158 0 Td [(y 0 =R .Thus, r ` i x = r ^ ` i ^ x =R ,and r 2 ` i x = r 2 ^ ` i ^ x =R 2 . Let ^ ` x = h ^ ` 0 x ;:::; ^ ` p x i T ,^ g x = h r ^ ` 0 x ;:::; r ^ ` p x i T and ^ h x = h r 2 ^ ` 0 x ;:::; r 2 ^ ` p x i T . ByCorollary3.7, j f x )]TJ/F19 11.9552 Tf 11.955 0 Td [(m x j p X i =0 L 2 y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 3 + j i j j ` i x j p X i =0 )]TJ/F15 11.9552 Tf 5.48 9.684 Td [(4 LR 3 + j ` i x j since k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k 2 R ,and j i j = )]TJ/F15 11.9552 Tf 5.48 9.684 Td [(4 LR 3 + k ^ ` ^ x k 1 p p 1 )]TJ/F15 11.9552 Tf 5.479 9.684 Td [(4 LR 3 + ^ ` ^ x ; sincefor x 2 R n , k x k 1 p n k x k 2 : Similarly, kr f x )222(r m x k p p 1 4 LR 2 + R k ^ g ^ x k and r 2 f x )222(r 2 m x p p 1 4 LR + R 2 ^ h ^ x : Setting 1 =max x 2 B ;1 ^ ` x , 2 =max x 2 B ;1 k ^ g x k ,and 3 =max x 2 B ;1 ^ h x yields thedesiredresult. Notethesimilaritybetweentheseerrorboundsandthoseinthedenitionof fullyquadraticmodels.Ifthereisnocomputationalerrororiftheerroris O 3 , fullyquadraticmodelsforsomexed canbeobtainedbycontrollingthegeometry ofthesamplesetsothat i p p 1 , i =1 ; 2 ; 3areboundedbyxedconstantsandby controllingthetrustregionradiussothat R isbounded.Thismotivatesthe denitionsofpoisedandstronglypoisedintheweightedregressionsenseinthe nextsection. 34
PAGE 46
3.3.3 poisednessintheWeightedRegressionSense Inthissection,werestrictourattentiontothemonomialbasis denedin .3.Inordertoproduceaccuratemodelfunctions,thepointsinthesampleset needtobedistributedinsuchawaythatthematrix M = M ;Y issuciently wellconditioned.Thisisthemotivationbehindthefollowingdenitionsofpoised andstronglypoisedsets.Thesedenitionsareidenticalto[12,Denitions4.7, 4.10]exceptthattheLagrangepolynomialsinthedenitionsare weightedregression Lagrangepolynomials. Denition3.9 Let > 0 andlet B beasetin R n .Let w = w 0 ;:::;w p bea vectorofpositiveweights, Y = f y 0 ;:::;y p g beapoisedset,andlet f ` 0 ;:::;` p g bethe associatedweightedregressionLagrangepolynomials.Let ` x = ` 0 x ; ;` p x T and q 1 = jP 2 n j . Y issaidtobe poisedin B intheweightedregressionsenseifandonlyif max x 2 B max 0 i p j ` i x j : Y issaidtobestrongly poisedin B intheweightedregressionsenseifand onlyif q 1 p p 1 max x 2 B k ` x k : Notethatiftheweightsareallequal,theabovedenitionsareequivalenttothose forpoisedandstronglypoisedgivenin[12]. WearenaturallyinterestedinusingtheseweightedregressionLagrangepolynomialstoformmodelsthatareguaranteedtosucientlyapproximate f .Let Y k , k , and R k denotethesampleset,trustregionradius,andsamplesetradiusatiteration k asdenedatthebeginningof x 3.3.2.Assumethat R k k isbounded.Ifthenumberof samplepointsisbounded,itcanbeshown,usingCorollary3.8,thatif Y k ispoised forall k ,thenthecorrespondingmodelfunctionsare fullyquadratic,assumingno 35
PAGE 47
computationalerror,orthatthecomputationalerroris O 3 .Whenthenumberof samplepointsisnotbounded,poisednessisnotenough.Inthefollowing,weshow thatif Y k is strongly poisedforall k ,thenthecorrespondingmodelsare fully quadratic,regardlessofthenumberofpointsin Y k . Lemma3.10 Let ^ M = M ; ^ Y .If W ^ M T W + p q 1 =p 1 ,then ^ Y isstrongly poisedin B ;1 intheweightedregressionsense,withrespecttotheweights w . Conversely,if ^ Y isstrongly poisedin B ;1 intheweightedregressionsense,then W ^ M T W + q 1 p p 1 ; where > 0 isaxedconstantdependentonlyon n butindependentof Y and . Proof: Let A = W ^ M + W and ` x = ` 0 x ;:::;` p x T .ByLemma3.3, ` x = A T x .Itfollowsthatforany x 2 B ;1, k ` x k = A T x k A k x p q 1 =p 1 )]TJ/F22 11.9552 Tf 5.48 1.651 Td [(p q 1 x 1 q 1 = p p 1 : Forthelastinequality,weusedthefactthatmax x 2 B ;1 x 1 1. Toprovetheconverse,let U V T = A T bethereducedsingularvaluedecompositionof A T .Notethat U and V are p 1 q 1 and q 1 q 1 matrices,respectively, withorthonormalcolumns;isa q 1 q 1 diagonalmatrix,whosediagonalentries arethesingularvaluesof A T .Let 1 bethelargestsingularvaluewith v 1 thecorrespondingcolumnof V .Asshownintheproofof[10,Theorem2.9],thereexists aconstant > 0suchthatforanyunitvector v ,thereexistsan x 2 B ;1such that v T x .Therefore,since k v 1 k =1,thereisan x 2 B ;1suchthat v 1 T x .Let v ? betheorthogonalprojectionof x ontothesubspace orthogonalto v 1 ;so x = )]TJ/F15 11.9552 Tf 5.48 9.684 Td [( v 1 T x v 1 + v ? .Notethat V T v 1 and V T v ? are orthogonalvectors.Notealsothatforanyvector z , U V T z = V T z since U 36
PAGE 48
hasorthonormalcolumns.Itfollowsthat k ` x k = A T x = V T x = V T v ? 2 + V T )]TJ/F15 11.9552 Tf 5.479 9.683 Td [( v 1 T x v 1 2 1 = 2 v 1 T x V T v 1 V T v 1 = e 1 = k A k : Thus, k A k max x 2 B ;1 k ` x k = q 1 p p 1 ,whichprovestheresultwith =1 = . WecannowprovethatmodelsgeneratedbyweightedregressionLagrangepolynomialsare fullyquadratic. Proposition3.11 Let f satisfyAssumption2.1andlet > 0 bexed.Thereexists avector = ef ; eg ; eh ; 0 suchthatforany y 0 2 S and max ,if 1. Y = f y 0 ;:::;y p g B y 0 ; isstrongly poisedin B y 0 ; intheweighted regressionsensewithrespecttopositiveweights w = f w 0 ;:::;w p g ,and 2.thecomputationalerror j i j isboundedby C 3 ,where C isaxedconstant, thenthecorrespondingmodelfunction m is fullyquadratic. Proof: Let^ x , ^ ` ; ^ g , ^ h , 1 ; 2 ,and 3 beasdenedintheproofof Corollary3.8.Let ^ M = M ; ^ Y and W =diag w .ByLemma3.3, ^ ` x = A T x , where A = W ^ M + W .ByLemma3.10, k A k q 1 p p 1 ,where isaxedconstant. Itfollowsthat 1 =max x 2 B ;1 ^ ` x max x 2 B ;1 k A k x c 1 q 1 p p 1 ; where c 1 =max x 2 B ;1 x isaconstantindependentof Y .Similarly, 2 =max x 2 B ;1 k ^ g x k =max x 2 B ;1 kr ^ ` 0 x k ; ; kr ^ ` p x k =max x 2 B ;1 A T r x F p q 1 max x 2 B ;1 A T r x p q 1 max x 2 B ;1 k A k r x c 2 q 3 2 1 p p 1 ; 37
PAGE 49
where c 2 =max x 2 B ;1 r x isindependentof Y . Tobound 3 ,let J s;t denotetheuniqueindex j suchthat x s and x t bothappear inthequadraticmonomial j x .Forexample J 1 ; 1 = n +2, J 1 ; 2 = J 2 ; 1 = n +3,etc. Observethat r 2 j x s;t = 8 > < > : 1if j = J s;t ; 0otherwise : Itfollowsthat r 2 ^ ` i x = q X j =0 A T i;j r 2 j x = 2 6 6 6 6 6 6 6 4 A T i;J 1 ; 1 A T i;J 1 ; 2 :::A T i;J 1 ;n A T i;J 2 ; 1 A T i;J 2 ; 2 :::A T i;J 2 ;n . . . . . . A T i;J n; 1 A T i;J n; 2 :::A T i;J n;n 3 7 7 7 7 7 7 7 5 : Weconcludethat r 2 ^ ` i x r 2 ^ ` i x F p 2 A T i; .Thus, 3 =max x 2 B ;1 ^ h x =max x 2 B ;1 kr 2 ^ ` 0 x k ; ; kr 2 ^ ` p x k v u u t 2 p X i =0 A T i; 2 = p 2 k A k F p 2 q 1 k A k p 2 q 3 2 1 p p 1 : Byassumption,thecomputationalerror j i j isboundedby = C 3 .So,by Corollary3.8,forall x 2 B y 0 ;, j f x )]TJ/F19 11.9552 Tf 11.955 0 Td [(m x j p p 1 L + C 3 1 c 1 q 1 L + C 3 = ef 3 . kr f x )222(r m x k p p 1 L + C 2 2 c 2 q 3 2 1 L + C 2 = eg 2 . r 2 f x )222(r 2 m x p p 1 L + C 3 p 2 q 3 2 1 L + C = eh . where ef = c 1 q 1 L + C ; eg = c 2 q 3 2 1 L + C ; eh = p 2 q 3 2 1 L + C : Thus, m x is ef ; eg ; eh ; 0fullyquadratic,andsincetheseconstantsareindependentof y 0 and,theresultisproven. Thenalstepinestablishingthatwehaveafullyquadraticclassofmodelsisto deneanalgorithmthatproducesastronglypoisedsamplesetinanitenumber ofsteps. 38
PAGE 50
Proposition3.12 Let ^ Y = f y 0 ;:::;y p g R n beasetofpointsintheunitball B ;1 suchthat k y j k =1 foratleastone j .Let w = w 0 ;:::;w p T beavector ofpositiveweights.If ^ Y isstrongly poisedin B ;1 inthesenseof unweighted regression,thenthereexistsaconstant > 0 ,independentof ^ Y , and w ,suchthat ^ Y isstrongly )]TJ/F15 11.9552 Tf 5.479 9.684 Td [(cond W poisedinthe weighted regressionsense. Proof: Let ^ M = M ; ^ Y ,where isthemonomialbasis.ByLemma3.10applied withunitweights, ^ M + q 1 = p p 1 ,where isaconstantindependentof ^ Y and .Thus, W ^ M T W + cond W ^ M + cond W q 1 p p 1 : wheretherstinequalityresultsfrom ^ M T W + = 1 min ^ M T W 1 min ^ M T min W = M + W )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 : Theresultfollowswith = p q 1 . Thesignicanceofthispropositionisthatanymodelimprovementalgorithmfor unweightedregressioncanbeusedforweightedregressiontoensurethesameglobal convergenceproperties,providedcond W isbounded.Forthemodelimprovement algorithmdescribedinthefollowingsection,thisrequirementissatisedbybounding theweightsawayfromzerowhilekeepingthelargestweightequalto1. Inpractice,weneednotensurepoisednessof Y k ateveryiteratetoguarantee thealgorithmconvergencestoasecondorderminimum.Rather,poisednessonly needstobeenforcedasthealgorithmstagnates. 3.4ModelImprovementAlgorithm ThissectiondescribesamodelimprovementalgorithmMIAforregression which,bytheprecedingsection,canalsobeusedforweightedregressiontoensure thatthesamplesetsarestronglypoisedforsomexedwhichisnotnecessarily 39
PAGE 51
known.Thealgorithmisbasedonthefollowingobservation,whichisastraightforwardextensionof[12,Theorem4.11]. TheMIApresentedin[12]makesassumptionssuchasallpointsmustliewithin B y 0 ;tosimplifythetheory.Weresistsuchassumptionstoaccountforpractical concernspointswhichlieoutsideof B y 0 ; thatariseinthealgorithm. Proposition3.13 Iftheshiftedandscaledsampleset ^ Y of p 1 pointscontains l = b p 1 q 1 c disjointsubsetsof q 1 points,eachofwhichare poisedin B ;1 inthe interpolationsense,then ^ Y isstrongly q l +1 l poisedin B ;1 intheregression sense. Proof: Let Y j = f y 0 j ;y 1 j ;:::;y q j g , j =1 ;:::;l bethedisjointpoisedsubsetsof ^ Y , andlet Y r betheremainingpointsin ^ Y .Let j i , i =0 ;:::;q betheinterpolation Lagrangepolynomialsfortheset Y j .Asnotedin[12],forany x 2 R n , q X i =0 j i x y i j = x ;j =1 ;:::;l: Dividingeachoftheseequationsby l andsummingyields l X j =1 q X i =0 1 l j i x y i j = x : .11 Let j x = )]TJ/F19 11.9552 Tf 5.48 9.684 Td [( j 1 x ; ; j q 1 x T ,andlet 2 R p 1 beformedbyconcatenatingthe j x , j =1 ; ;l andazerovectoroflength j Y r j andthendividingeveryentryby l .By.11, isasolutiontotheequation p X i =0 i y i = x : .12 Since Y j ispoisedin B ;1,forany x 2 B ;1, j x p q 1 j x 1 p q 1 : Thus, p l max j k j x k l r q 1 l r l +1 l r q 1 p 1 =q 1 = r l +1 l q 1 p p 1 : 40
PAGE 52
Let ` i x , i =0 ; ;p betheregressionLagrangepolynomialsforthecompleteset ^ Y .Asobservedin[12], ` x = ` 0 x ; ;` p x T istheminimumnormsolutionto .12.Thus, k ` x k r l +1 l q 1 p p 1 : Sincethisholdsforall x 2 B ;1, ^ Y isstrongly r l +1 l poisedin B ;1. Basedonthisobservation,andnotingthat l +1 l 2for l 1,weadoptthefollowingstrategyforimprovingashiftedandscaledregressionsampleset ^ Y B ;1: 1.If ^ Y contains l 1poisedsubsetswithatmost q 1 pointsleftover, ^ Y is strongly p 2poised. 2.Otherwise,if ^ Y containsatleastonepoisedsubset,saveasmanypoised subsetsaspossible,plusatmost q 1 additionalpointsfrom ^ Y ,discardingthe rest. 3.Otherwise,addadditionalpointsto ^ Y inordertocreateapoisedsubset. Keepthissubset,plusatmost q 1 additionalpointsfrom ^ Y . Toimplementthisstrategy,werstdescribeanalgorithmthatattemptstonda poisedsubsetof ^ Y .Todiscussthealgorithmweintroducethefollowingdenition: Denition3.14 Aset Y B issaidtobe subpoised inaset B ifthereexistsa superset Z Y thatis poisedin B with j Z j = q 1 . Givenasampleset Y B ;1notnecessarilyshiftedandscaledandaradius ~ ,thealgorithmbelowselectsasubpoisedsubset Y new Y containingas manypointsaspossible.If j Y new j = q 1 ,then Y new ispoisedin B ; ~ forsome xed.Otherwise,thealgorithmdeterminesanewpoint y new 2 B ; ~ suchthat Y new S f y new g issubpoisedin B ; ~ . 41
PAGE 53
AlgorithmFindSet Findsasubpoisedset Input: Asampleset Y B ;1andatrustregionradius ~ 2 p acc ; 1 , forxedparameter acc > 0. Output: Aset Y new Y thatispoisedin B ; ~ ;orasubpoised set Y new B ; ~ andanewpoint y new B ; ~ suchthat Y new S f y new g is subpoisedin B ; ~ . Step0Initialization: Initializethepivotpolynomialbasistothemonomialbasis: u i x = i x ;i =0 ;:::;q .Set Y new = ; .Set i =0. Step1PointSelection: Ifpossible,choose j i 2f i;:::; j Y j)]TJ/F15 11.9552 Tf 17.933 0 Td [(1 g suchthat j u i y j i j acc thresholdtest. If suchanindexisfound,addthecorrespondingpointto Y new andswap thepositionsofpoints y i and y j i in Y . Otherwise ,compute y new =argmax x 2 B ; ~ j u i x j ,and exit ,returning Y new and y new . Step2GaussianElimination: For j = i +1 ;:::; j Y j)]TJ/F15 11.9552 Tf 17.932 0 Td [(1 u j x = u j x )]TJ/F19 11.9552 Tf 13.15 8.087 Td [(u j y i u i y i u i x . If i
PAGE 54
M )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 p q 1 growth acc ; .13 where growth isthegrowthfactorforthefactorizationsee[27]. PointSelection Thepointselectionruleallowsexibilityinhowanacceptable pointischosen.Forexample,tokeepthegrowthfactordown,onecouldchoosethe index j i thatmaximizes j u i y j j whichcorrespondstoGaussianeliminationwith partialpivoting.Butinpractice,itisoftenbettertoselectpointsaccordingto theirproximitytothecurrentiterate.Inourimplementation,webalancethesetwo criteriabychoosingtheindexthatmaximizes j u i y j j =d 3 j ,over j i ,where d j = max f 1 ; k y j k = ~ g .Ifallsamplepointsarecontainedin B ; ~ ,then d j =1forall j .Inthiscase,thepointselectionruleisidenticaltotheoneusedinAlgorithm6.6 of[12]withtheadditionofthethresholdtest.When Y containspointsoutside B ; ~ ,thecorrespondingvaluesof d j aregreaterthan1,sothepointselectionrule givespreferencetopointsthatarewithin B ; ~ . ThetheoreticaljusticationforourrevisedpointselectionrulecomesfromexaminingtheerrorboundsinCorollary3.7.Foragivenpoint x in B ; ~ ,eachsample point y i makesacontributiontotheerrorboundthatisproportionalto k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k 3 assumingthecomputationalerrorisrelativelysmall.Since x canbeanywherein thetrustregion,thissuggestsmodifyingthepointselectionruletomaximize j u i y j i j ^ d 3 j i , where ^ d j =max x 2 B ; ~ k y j )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k = ~ = k y j k = ~ +1.Tosimplifyanalysis,wemodify thisformulasothatallpointsinsidethetrustregionaretreatedequally,resultingin theformula d j =max ; k y j k = ~ . Lemma3.15 SupposeAlgorithmFindSetreturnsaset Y new with j Y new j = q 1 .Then Y new is poisedin B ; ~ forsome ,whichisproportionalto growth acc max f 1 ; ~ 2 = 2 ; ~ g ,where growth isthegrowthfactorfortheGaussianelimination. 43
PAGE 55
Proof: Let ~ M = M ;Y new .By.13, ~ M )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 p q 1 growth = acc .Let ` x = ` 0 x ;:::;` q x T bethevectorofinterpolationLagrangepolynomialsforthesample set Y new .Forany x 2 B ; ~ , k ` x k 1 = ~ M )]TJ/F20 7.9701 Tf 6.587 0 Td [(T x 1 ~ M )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 1 x 1 p q 1 ~ M )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 x 1 q 1 growth acc x 1 q 1 growth acc max f 1 ; ~ 2 = 2 ; ~ g : Sincethisinequalityholdsforall x 2 B ; ~ , Y new ispoisedfor= q 1 growth = acc max f 1 ; ~ 2 = 2 ; ~ g ,whichestablishestheresult. Ingeneral,thegrowthfactorintheabovelemmadependsonthematrix M and thethreshold acc .Becauseofthethresholdtest,itispossibletoestablishabound onthegrowthfactorthatisindependentof ~ M .Sowecanclaimthatthealgorithm selectsapoisedsetforaxedthatisindependentof Y .However,thebound isextremelylarge,soisnotveryuseful.Nevertheless,inpractice growth isquite reasonable,sotendstobeproportionaltomax f 1 ; ~ 2 = 2 ; ~ g = acc . Inthecasewherethethresholdtestisnotsatised,AlgorithmFindSetdetermines anewpoint y new bymaximizing j u i x j over B ; ~ .Inthiscase,weneedtoshow thatthenewpointwouldsatisfythethresholdtest.Thefollowinglemmashowsthat thisispossible,provided acc issmallenough.Theproofismodeledaftertheproof of[12,Lemma6.7]. Lemma3.16 Let v T x beaquadraticpolynomialofdegree2,where k v k 1 =1 . Then max x 2 B ; ~ j v T x j min f 1 ; ~ 2 4 g : Proof: Since k v k 1 =1,atleastoneofthecoecientsof q x = v T x is1,1, 1/2,or1/2.Lookingatthecasewherethelargestcoecientis1or1/21and 1/2aresimilarlyproven,eitherthiscoecientcorrespondstotheconstantterm,a linearterm x i ,oraquadraticterm x 2 i = 2or x i x j .Restrictallvariablesnotappearing inthetermcorrespondingtothelargestcoecienttozero. 44
PAGE 56
If q x =1thenthelemmatriviallyholds. If q x = x 2 i = 2+ ax i + b ,let~ x = ~ e i 2 B ; ~ q ~ x = ~ 2 = 2+ ~ a + b;q )]TJ/F15 11.9552 Tf 10.024 0 Td [(~ x = ~ 2 = 2 )]TJ/F15 11.9552 Tf 13.906 3.022 Td [(~ a + b; and q = b: If j q )]TJ/F15 11.9552 Tf 10.023 0 Td [(~ x j ~ 2 4 or j q ~ x j ~ 2 4 theresultisshown.Otherwise, )]TJ/F17 7.9701 Tf 7.997 2.014 Td [(~ 4 ~ 2 4 . If q x = ax 2 i = 2+ x i + b ,thenlet~ x = ~ e i ,yielding q ~ x = ~ + a ~ 2 = 2+ b and q )]TJ/F15 11.9552 Tf 10.024 0 Td [(~ x = )]TJ/F15 11.9552 Tf 11.249 3.022 Td [(~ + a ~ 2 = 2+ b then max fj q )]TJ/F15 11.9552 Tf 10.023 0 Td [(~ x j ; j q ~ x jg =max n j)]TJ/F15 11.9552 Tf 19.884 3.022 Td [(~ + j ; j ~ + j o ~ min f 1 ; ~ 2 4 g where = a ~ = 2+ b =0. If q x = ax 2 i = 2+ bx 2 j = 2+ x i x j + cx i + dx j + e ,weconsider4pointson B ; ~ y 1 = 2 6 4 q ~ 2 q ~ 2 3 7 5 ;y 2 = 2 6 4 q ~ 2 )]TJ/F25 11.9552 Tf 9.298 14.792 Td [(q ~ 2 3 7 5 ;y 3 = 2 6 4 )]TJ/F25 11.9552 Tf 9.299 14.792 Td [(q ~ 2 q ~ 2 3 7 5 ;y 4 = 2 6 4 )]TJ/F25 11.9552 Tf 9.299 14.792 Td [(q ~ 2 )]TJ/F25 11.9552 Tf 9.299 14.792 Td [(q ~ 2 3 7 5 : q y 1 = a ~ 4 + b ~ 4 + ~ 2 + c s ~ 2 + d s ~ 2 + e q y 2 = a ~ 4 + b ~ 4 )]TJ/F15 11.9552 Tf 15.102 11.11 Td [(~ 2 + c s ~ 2 )]TJ/F19 11.9552 Tf 11.955 0 Td [(d s ~ 2 + e q y 3 = a ~ 4 + b ~ 4 )]TJ/F15 11.9552 Tf 15.102 11.109 Td [(~ 2 )]TJ/F19 11.9552 Tf 11.955 0 Td [(c s ~ 2 + d s ~ 2 + e q y 4 = a ~ 4 + b ~ 4 + ~ 2 )]TJ/F19 11.9552 Tf 11.955 0 Td [(c s ~ 2 )]TJ/F19 11.9552 Tf 11.955 0 Td [(d s ~ 2 + e Notethat q y 1 )]TJ/F19 11.9552 Tf 11.732 0 Td [(q y 2 = ~ + d p 2 ~ and q y 3 )]TJ/F19 11.9552 Tf 11.733 0 Td [(q y 4 = )]TJ/F15 11.9552 Tf 11.249 3.022 Td [(~ + d p 2 ~ .There aretwocases: 45
PAGE 57
1.If d 0,then q y 1 )]TJ/F19 11.9552 Tf 12.218 0 Td [(q y 2 ~ ,soeither j q y 1 j ~ 2 min n 1 ; ~ 2 4 o or j q y 2 j ~ 2 min n 1 ; ~ 2 4 o . 2.If d< 0,thenasimilarstudyof q y 3 )]TJ/F19 11.9552 Tf 11.955 0 Td [(q y 4 provestheresult. Lemma3.17 SupposeinAlgorithmFindSet acc min f 1 ; ~ 2 = 4 g .IfAlgorithm FindSetexitsduringthepointselectionstep,then Y new S f y new g is subpoisedin B ; ~ forsomexed ,whichisproportionalto growth acc max f 1 ; ~ 2 = 2 ; ~ g ,where growth isthegrowthparameterfortheGaussianelimination. Proof: ConsideramodiedversionofAlgorithmFindSetthatdoesnotexitin thepointselectionstep.Instead, y i isreplacedby y new and y new isaddedto Y new . Thismodiedalgorithmwillalwaysreturnasetconsistingof q 1 points.Callthisset Z .Let Y new and y new betheoutputoftheunmodiedalgorithm,andobservethat Y new S f y new g Z . Toshowthat Y new S f y new g issubpoised,weshowthat Z ispoisedin B ; ~ . FromtheGaussianelimination,after k iterationsofthealgorithm,the k +1st pivotpolynomial u k x canbeexpressedas v k T x ,forsome v k = v 0 ;:::;v k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 ; 1 ; 0 ;:::; 0.Thatis,the v i arethecoecientsforthebasisexpansionofthepolynomial u k .Observethat v k 1 1,andlet~ v = v k k v k k 1 .ByLemma3.16, max x 2 B ; ~ j u k x j =max x 2 B ; ~ v k T x = v k 1 max x 2 B ; ~ j ~ v T x j min f 1 ; ~ 2 4 g v k 1 min f 1 ; ~ 2 4 g acc : Itfollowsthateachtimeanewpointischoseninthepointselectionstep, thatpointwillsatisfythethresholdtest.Thus,theset Z returnedbythemodiedalgorithmwillinclude q 1 points,allofwhichsatisfythethresholdtest.By 46
PAGE 58
Lemma3.15, Z ispoised,withproportionalto growth acc max f 1 ; ~ 2 = 2 ; ~ g .Itfollowsthat Y new S f y new g issubpoised. Wearenowreadytostateourmodelimprovementalgorithmforregression.Prior tocallingthisalgorithm,wediscardallpointsin Y withdistancegreaterthan = p acc from y 0 .Wethenformtheshiftedandscaledset Y bythetransformation y j = y j )]TJ/F19 11.9552 Tf 9.351 0 Td [(y 0 =d ,where d =max y j 2 Y k y j )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 k ,andscalethetrustregionradiusaccordingly i.e., ~ = =d .Thisensuresthat ~ = d = p acc = p acc .Aftercallingthe algorithm,wereversetheshiftandscaletransformation. AlgorithmMIA ModelImprovementforRegression Input: Ashiftedandscaledsampleset ^ Y B ;1,atrustregionradius ~ p acc forxed acc 2 ; 1 4 r 2 ,where r 1isaxedparameter. Output: Amodiedset Y 0 withimprovedpoisednesson B ; ~ . Step0Initialization: Removethepointin ^ Y farthestfrom y 0 =0ifitis outside B ; r ~ .Set Y 0 = ; . Step1FindPoisedSubset: UseAlgorithm FindSet eithertoidentifya poisedsubset Y new ^ Y ,ortoidentifyasubpoisedsubset Y new ^ Y and oneadditionalpoint y new 2 B ; ~ suchthat Y new S f y new g issubpoised on B ; ~ . Step2UpdateSet: If Y new ispoised,additto Y 0 andremove Y new from ^ Y .Removeall pointsfrom ^ Y thatareoutsideof B ; r ~ . Otherwise If j Y 0 j = ; ,set Y 0 = Y new S f y new g plus q 1 )222(j Y new j)]TJ/F15 11.9552 Tf 17.932 0 Td [(1additionalpoints from ^ Y . Otherwise set Y 0 = Y 0 S Y new plus q 1 )222(j Y new j additionalpointsfrom ^ Y . 47
PAGE 59
Set ^ Y = ; . Step3If j ^ Y j q 1 ,goto Step1 . InAlgorithmMIA,ifeverycalltoAlgorithmFindSetyieldsapoisedset Y new , theneventuallyallpointsin ^ Y willbeincludedin Y 0 .Inthiscase,thealgorithmhas veriedthat ^ Y contains ` = b p 1 q 1 c poisedsets.ByProposition3.13, ^ Y isstrongly l +1 l poisedin B ;1. IftherstcalltoFindSetfailstoidentifyapoisedsubset,thealgorithmimprovesthesamplesetbyaddinganewpoint y new andbyremovingpointssothat theoutputset Y 0 containsatmost q 1 points.Inthiscasetheoutputsetcontainsthe subpoisedset Y new S f y new g .Thus,ifthealgorithmiscalledrepeatedly,with ^ Y replacedby Y 0 aftereachcall,eventually Y 0 willcontainapoisedsubsetandwill bestrongly2poised,byProposition3.13. If ^ Y failstobepoisedafterthesecondorlatercalltoFindSet,nonewpoints areadded.Instead,thesamplesetisimprovedbyremovingpointsfrom ^ Y sothat theoutputset Y 0 consistsofallthepoisedsubsetsidentiedbyFindSet,plusupto q 1 additionalpoints.Theresultingsetisthenstrongly ^ ` +1 ^ ` poised,where ^ ` = b j Y 0 j q 1 c . Trustregionscalefactor Thetrustregionscalefactor r wassuggestedin[12, Section11.2],althoughimplementationdetailswereomitted.Thescalefactordetermineswhatpointsareallowedtoremaininthesampleset.EachcalltoAlgorithm MIAremovesapointfromoutside B ; r ~ ifsuchapointexists.Thus,ifthealgorithmiscalledrepeatedlywith ^ Y replacedby Y 0 eachtime,eventuallyallpointsin thesamplesetwillbeintheregion B ; r ~ .Usingascalefactor r> 1canimprove theeciencyofthealgorithm.Toseethis,observethatif r =1,aslightmovement ofthetrustregioncentermayresultinpreviouslygood"pointslyingjustoutsideof B y 0 ;.Thesepointswouldthenbeunnecessarilyremovedfrom ^ Y . 48
PAGE 60
Tojustifythisapproach,supposethat ^ Y isstronglypoisedin B ; ~ .By Proposition3.11,theassociatedmodelfunction m is fullyquadraticforsomexed vector ,whichdependson.Ifinstead ^ Y haspointsoutsideof B ; ~ ,wecan showbyasimplemodicationtotheproofofProposition3.11thatthemodel functionis R 3 fullyquadratic,where R =max fk y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 kg .Thus,if ^ Y B ; r ~ forsomexed r 1,thencallingthemodelimprovementalgorithmwillresultin amodelfunction m thatis^ fullyquadraticwithrespecttoadierentbutstill xed^ = r 3 .Inthiscase,however,whenevernewpointsareaddedduringthe modelimprovementalgorithm,theyarealwayschosenwithintheoriginaltrustregion B ; ~ . ThediscussionabovedemonstratesthatAlgorithmMIAsatisestherequirements ofamodelimprovementalgorithmspeciedinDenition2.2.Thisalgorithmisused intheCSV2frameworkdescribedinChapter2asfollows: InStep1ofAlgorithmCSV2,AlgorithmMIAiscalledonce.Ifnochangeis madetothesampleset,themodeliscertiedtobe fullyquadratic. InStep4ofAlgorithmCSV2,AlgorithmMIAiscalledonce.Ifnochangeis madetothesampleset,themodelis fullyquadratic.Otherwise,thesample setismodiedtoimprovethemodel. InAlgorithmCriticalityStep,AlgorithmMIAiscalledrepeatedlytoimprove themodeluntilitis fullyquadratic. Inourimplementation,wemodiedAlgorithmCriticalitySteptoimproveeciencybyintroducinganadditionalexitcriterion.Specically,aftereachcalltothe modelimprovementalgorithm, & i k =max fk g i k k ; )]TJ/F19 11.9552 Tf 9.298 0 Td [( min H i k g istested.If & i k > c , x k isnolongerasecondorderstationarypointofthemodelfunction;soweexitthe criticalitystep. 49
PAGE 61
3.5ComputationalResults Asshownintheprevioussection,theCSV2frameworkusingweightedquadratic regressionconvergestoasecondorderstationarypointprovidedtheratiobetweenthe largestandsmallestweightisbounded.Thisleavesmuchleewayinthederivationof theweights.Wenowdescribeaheuristicstrategybasedontheerrorboundsderived in x 4. 3.5.1UsingErrorBoundstoChooseWeights Intuitively,themodelsusedthroughoutouralgorithmwillbemosteectiveifthe weightsarechosensothat m x isasaccurateaspossibleinthesensethatitagrees withthesecondorderTaylorapproximationof f x aroundthecurrenttrustregion center y 0 .Thatis,wewanttoestimatethequadraticfunction q x = f y 0 + r f y 0 T x )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 + 1 2 x )]TJ/F19 11.9552 Tf 11.956 0 Td [(y 0 T r 2 f y 0 x )]TJ/F19 11.9552 Tf 11.956 0 Td [(y 0 : If f x happenstobeaquadraticpolynomial,then f i = q y i + i : Iftheerrors i areuncorrelatedrandomvariableswithzeromeanandnitevariances 2 i ;i =0 ;:::;p ,thenthebestlinearunbiasedestimatorofthepolynomial q x is givenby m x = x T ,where solves.4withthe i th weightproportionalto 1 = i [51,Theorem4.4].Thisisintuitivelyappealingsinceeachsamplepointwillhave thesameexpectedcontributiontotheweightedsumofsquareerrors. When f x isnotaquadraticfunction,theerrorsdependnotjustonthecomputationalerror,butalsoonthedistancesfromeachpointto y 0 .Intheparticularcase when x = y 0 ,therstthreetermsof.7arethequadraticfunction q y i .Thus, theerrorbetweenthecomputedfunctionvalueand q y i isgivenby: f i )]TJ/F19 11.9552 Tf 11.955 0 Td [(q y i = 1 2 y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 T H i y 0 y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 + i ; .14 50
PAGE 62
where H i y 0 = r 2 f i y 0 )299(r 2 f y 0 forsomepoint i y 0 onthelinesegment connecting y 0 and y i . Weshallrefertothersttermontherightasthe Taylorerror andthesecondterm ontherightasthe computationalerror .ByAssumption2.1, k H i y 0 k L k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 k . Thisleadsustothefollowingheuristicargumentforchoosingtheweights:Supposethat H i y 0 isarandomsymmetricmatrixsuchthatthestandarddeviationof k H i y 0 k isproportionalto L k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 k .Inotherwords k H i y 0 k = L k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 k for someconstant .ThentheTaylorerrorwillhavestandarddeviationproportionalto L k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 k 3 .AssumingthecomputationalerrorisindependentoftheTaylorerror, the totalerror f i )]TJ/F19 11.9552 Tf 11.314 0 Td [(q y i willhavestandarddeviation q )]TJ/F19 11.9552 Tf 5.479 9.683 Td [(L k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 k 3 2 + 2 i ,where i isthestandarddeviationof i .Thisleadstothefollowingformulafortheweights: w i / 1 q 2 L 2 k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 k 6 + 2 i : Ofcourse,thisformuladependsonknowing ;L and i .If L , i ,and/or are notknown,thisformulacouldstillbeusedinconjunctionwithsomestrategyfor estimating L , i ,and forexample,basedupontheaccuracyofthemodelfunctions atknownpoints.Alternatively, and L canbecombinedintoasingleparameter C thatcouldbechosenusingsometypeofadaptivestrategy: w i / 1 q C k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 k 6 + 2 i : Ifthecomputationalerrorshaveequalvariances,theformulacanbefurther simpliedas w i / 1 q C k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 k 6 +1 ; .15 where C = C= 2 i . Anobviousawintheabovedevelopmentisthattheerrorsin j f i )]TJ/F19 11.9552 Tf 12.534 0 Td [(q y i j are notuncorrelated.Additionally,theassumptionthat k H i y 0 k isproportionalto L k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 k isvalidonlyforlimitedclassesoffunctions.Nevertheless,basedonour 51
PAGE 63
computationalexperiments,.15appearstoprovideasensiblestrategyforbalancingdieringlevelsofcomputationaluncertaintywiththeTaylorerror. 3.5.2BenchmarkPerformance Tostudytheimpactofweightedregression,wedevelopedMATLABimplementationsofthreequadraticmodelbasedtrustregionalgorithmsusinginterpolation, regression,andweightedregression,respectively,toconstructthequadraticmodel functions.Totheextentpossible,thedierencesbetweenthesealgorithmswere minimized,withcodesharedwheneverpossible.Obviously,allthreemethodsuse dierentstrategiesforconstructingthemodelfromthesampleset.Beyondthat,the onlydierenceisthatthetworegressionmethodsusethemodelimprovementalgorithmdescribedinSection3.4,whereastheinterpolationalgorithmusesthemodel improvementalgorithmdescribedin[12,Algorithm6.6]. WecomparedthethreealgorithmsusingthesuiteoftestproblemsforbenchmarkingderivativefreeoptimizationalgorithmsmadeavailablebyMoreandWild [41].Weranourtestsonthefourtypesofproblemsfromthistestsuite:smooth problemswithnonoise,piecewisesmoothfunctions,functionswithdeterministic noiseandfunctionswithstochasticnoise.Wedonotconsiderthealgorithmpresented inthischaptertobeidealforhandlingstochasticallynoisyfunctions.Forexample,if theinitialpointhappenstobeevaluatedwithlargenegativenoise,thealgorithmwill neverreevaluatethispointandpossiblynevermovethetrustregioncenter.Weare activelyattemptingtoconstructamorerobustalgorithm.Weconsidersuchmodicationsnontrivialandoutsidethescopeofthecurrentwork.Theproblemswere runwiththefollowingparametersettings: max =100 ; icb 0 =1 ; 0 =10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(6 ; 1 =0 : 5 ; =0 : 5 ; inc =2 ; c =0 : 01 ; =2 ; =0 : 5 ;! = : 5 ;r =3 ; acc =10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(4 : Fortheinterpolationalgorithm,weused imp =1 : 01 ; forthecallsto[12,Algorithm6.6]. 52
PAGE 64
Asdescribedin[41],thesmoothproblemsarederivedfrom22nonlinearleast squaresfunctionsdenedintheCUTEr[23]collection.Foreachproblem,theobjectivefunction f x isdenedby f x = m X k =1 g k x 2 ; where g : R n ! R m representsoneoftheCUTErtestfunctions.ThepiecewisesmoothproblemsaredenedusingthesameCUTErtestfunctionsby f x = m X k =1 j g k x j : Thenoisyproblemsarederivedfromthesmoothproblemsbymultiplyingbyanoise functionasfollows: f x =+ " f \050 x m X k =1 g k x 2 ; where " f denestherelativenoiselevel.Forstochasticallynoisyproblem,\050 x isa randomvariabledrawnfromtheuniformdistribution U [ )]TJ/F15 11.9552 Tf 9.298 0 Td [(1 ; 1].Tosimulatedeterministicnoise,\050 x isafunctionthatoscillatesbetween1and1,withbothhighfrequencyandlowfrequencyoscillations.Foranequationforthedeterministic,see [41,Eqns..2.3].Usingmultiplestartingpointsforsomeofthetestfunctions, atotalof53dierentproblemsarespeciedinthetestsuiteforeachofthese3types ofproblems. Fortheweightedregressionalgorithm,theweightsweredeterminedbytheweightingscheme.15with C =100. Therelativeperformancesofthealgorithmswerecomparedusingperformance prolesanddataproles[17,41].If S isthesetofsolverstobecomparedonasuiteof problems P ,let t p;s bethenumberofiteratesrequiredforsolver s 2 S onaproblem p 2 P tondafunctionvaluesatisfying: f x f L + f x 0 )]TJ/F19 11.9552 Tf 11.955 0 Td [(f L ; .16 53
PAGE 65
where f L isthebestfunctionvalueachievedbyany s 2 S .Thentheperformance proleofasolver s 2 S isthefollowingfraction: s = 1 j P j p 2 P : t p;s min f t p;s : s 2 S g : Thedataproleofasolver s 2 S is: d s = 1 j P j p 2 P : t p;s n p +1 where n p isthedimensionofproblem p 2 P .Formoreinformationontheseproles, includingtheirrelativemeritsandfaults,see[41]. PerformanceprolescomparingthethreealgorithmsareshowninFigure3.1for anaccuracyof =10 )]TJ/F17 7.9701 Tf 6.587 0 Td [(5 .Weobservethatonthesmoothproblems,theweightedand unweightedregressionmethodshadsimilarperformanceandbothperformedslightly betterthaninterpolation.Forthedeterministicallynoisyproblems,weseeslightly betterperformancefromtheweightedregressionmethod;andthisimprovementis evenmorepronouncedforthebenchmarkproblemswithstochasticnoise.Andfor thepiecewisedierentiablefunctions,theperformanceoftheweightedregression methodissignicantlybetter.Thismirrorsthendingsin[13]whereSIDPSMusing regressionmodelsshowsconsiderableimprovementoverinterpolationmodels. WealsocomparedourweightedregressionalgorithmwiththeDFOalgorithm [8]aswellasNEWUOA[50],whichhadthebestperformanceofthethreesolvers comparedin[41].WeobtainedtheDFOcodefromtheCOINORwebsite[38].This codecallsIPOPT,whichwealsoobtainedfromCOINOR.WeobtainedNEWUOA from[40].Weranbothalgorithmsonthebenchmarkproblemswithastopping criterionof min =10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(8 ,where min denotestheminimumtrustregionradius.For NEWUOA,thenumberofinterpolationconditionswassettoNPT=2 n +1. TheperformanceprolesareshowninFigure3.2,withanaccuracyof =10 )]TJ/F17 7.9701 Tf 6.587 0 Td [(5 . NEWUOAoutperformsbothouralgorithmandDFOonthesmoothproblems.This isnotsurprisingsinceNEWUOAisamaturecodethathasbeenrenedoverseveral 54
PAGE 66
Figure3.1: Performanceleftanddatarightproles:Interpolationvs.regression vs.weightedregression years,whereasourcodeisarelativelyunsophisticatedimplementation.Incontrast,on thenoisyproblemsandthepiecewisedierentiableproblems,ourweightedregression 55
PAGE 67
algorithmachievessuperiorperformance. Figure3.2: Performanceleftanddatarightproles:weightedregressionvs. NEWUOAvs.DFOProblemswithStochasticNoise 3.6SummaryandConclusions 56
PAGE 68
Ourcomputationalresultsindicatethatusingweightedregressiontoconstruct moreaccuratemodelfunctionscanreducethenumberoffunctionevaluationsrequiredtoreachastationarypoint.Encouragedbytheseresults,webelievethat furtherstudyofweightedregressionmethodsiswarranted.Thischapterprovidesa theoreticalfoundationforsuchstudy.Inparticular,wehaveextendedtheconceptsof poisednessandstrongpoisednesstotheweightedregressionframework,andwe demonstratedthatanyschemethatmaintainsstronglypoisedsamplesetsforunweightedregressioncanalsobeusedtomaintainstronglypoisedsamplesetsfor weightedregression,providedthatnoweightistoosmallrelativetotheotherweights. Usingtheseresults,weshowedthat,whenthecomputationalerrorissucientlysmall relativetothetrustregionradius,thealgorithmconvergestoastationarypointunder standardassumptions. Thisinvestigationbeganwithagoalofmoreecientlydealingwithcomputationalerrorinderivativefreeoptimization,particularlyundervaryinglevelsofuncertainty.Surprisingly,wediscoveredthatregressionbasedmethodscanbeadvantageousevenintheabsenceofcomputationalerror.Regressionmethodsproduce quadraticapproximationsthatbetterttheobjectivefunctionclosetothetrustregioncenter.Thisisduepartlytothefactthatinterpolationmethodsthrowoutpoints thatareclosetogetherinordertomaintainawellpoisedsampleset.Incontrast, regressionmodelskeepthesepointsinthesampleset,therebyputtinggreaterweight onpointsclosetothetrustregioncenter. Thequestionofhowtochooseweightsneedsfurtherstudy.Inthischapter, weproposedaheuristicthatbalancesuncertaintiesarisingfromcomputationalerror withuncertaintiesarisingfrompoormodeldelityi.e.,Taylorerrorasdescribed in x 3.5.1.Thisweightingschemeappearstoprovideabenetfornoisyproblemsor nondierentiableproblems.Webelievebetterschemescanbedevisedbasedonmore rigorousanalysis. 57
PAGE 69
Finally,wenotethattheadvantageofregressionbasedmethodsisnotwithout costintermsofcomputationaleciency.Intheregressionmethods,quadraticmodels areconstructedfromscratcheveryiteration,requiring O n 6 operations.Incontrast, interpolationbasedmethodsareabletouseanecientschemedevelopedbyPowell[50]toupdatethequadraticmodelsateachiteration.Itisnotclearwhether suchaschemecanbedevisedforregressionmethods.Nevertheless,whenfunction evaluationsareextremelyexpensive,andwhenthenumberofvariablesisnottoo large,thisadvantageisoutweighedbythereductioninfunctionevaluationsrealized byregressionbasedmethods. 58
PAGE 70
4.StochasticDerivativefreeOptimizationusingaTrustRegion Framework Inthischapter,weproposeandanalyzetheconvergenceofanalgorithmwhich ndsalocalminimizeroftheunconstrainedfunction f : R n ! R .Thevalueof f atagivenpoint x cannotbeobserveddirectly;rathertheoptimizationroutineonly hasaccesstonoisecorruptedfunctionvalues f .Suchnoisemaybedeterministic,due toroundoerrorfromniteprecisionarithmeticoriterativemethods,orstochastic, arisingfromvariabilityorrandomnessinsomeobservedprocess.Wefocusour attentioninthischapteronminimizing f when f hastheform: f x = f x + .1 where N ; 2 . Minimizingnoisyfunctionsofthisnatureariseinavarietyofsettings.Forexample,inalmostanyproblemwherephysicalsystemmeasurementsarebeingoptimized. Consideracityplannerwantingtomaximizetracowonamajorthoroughfareby adjustingthetimingoftraclights.Foreachtimingpattern x ,thetracow f x isphysicallymeasuredtoprovideinformationabouttheexpectedtracow f x . Stochasticapproximationalgorithms,builttosolve min f x = E f x haveexistedintheliteraturesinceRobbinsMonro'salgorithmforndingrootsofan expectedvaluefunction[53].TheKieferWolfowitzKWalgorithm[35]generalized thisalgorithmtominimizetheexpectedvalueofafunction.Theiriterateshavethe form x k +1 = x k + a k G x k where G isanitedierenceestimateforthegradientof f .The i thcomponentof G isfoundby G i = f x k + c k e i )]TJ/F15 11.9552 Tf 14.502 3.155 Td [( f x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(c k e i 2 c k : 59
PAGE 71
where e i isthe i thunitvector.WhileKWspawnedmanygeneralizations,most formsrequireapredetermineddecayingsequenceforboththestepssizeparameter a k andnitedierenceparameter c k .Asopposedtothe2 n evaluationsof f required ateachiterateofKW,Spall'ssimultaneousperturbationstochasticapproximation SPSA[56]requiresonly2functionevaluationsperiterate,independentof n .SPSA estimates G i by G i = f x k + c k k )]TJ/F15 11.9552 Tf 14.503 3.155 Td [( f x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(c k k 2 c k k i where k 2 R n isarandomperturbationvectorwithentries k i whichareindependent andidenticallydistributedi.i.d.fromadistributionwithboundedinversemoments, symmetricallydistributedaroundzero,anduniformlyboundedinmagnitudeforall k and i .ThoughSPSAgreatlyreducesthenumberofevaluationsof f ,thechoice ofsequences a k and c k arecriticaltoalgorithmicperformance.Nevertheless,if f has auniqueminimum x ,bothKWandSPSAhavealmostsureconvergencewhich impliesconvergenceinprobabilityandconvergenceindistributionof x k ! x as k !1 .ThereisalsoaversionofSPSAwhichusesfourfunctionevaluationsto estimatethevalue,gradient,andHessianof f [57]. ThealgorithmwhichfollowsdiersfromtheworkinChapter3inafewways. First,theanalysispresentedinChapter3givesveryconservativeerrorbounds;to tightenthesebounds,wemustconsiderspecicprobabilitydistributionsfortheerror. Second,inChapter3, k wasdenedusing f x k asanestimatefor f x k .Thework thatfollowsevaluatesmodelfunctions m k x k and^ m k x k toprovidebetterestimates ofthetruefunctionvalue.Itispossibletoestimate f x k byrepeatedevaluationof f x k ,butwedesireanalgorithmwhichavoidsrepeatedlysamplingpointstoreduce thevarianceatapoint x .Suchatechniqueonlygainsinformationaboutnoise inthe stochasticcase,andnoinformationabout if f isdeterministic.Butifmanypoints sucientlycloseto x aresampled,informationabout f and canbegleaned.Asis oftenthecase,thepoint x isthelikelynextiterate,andtheinformationgathered 60
PAGE 72
about f near x willbeusedimmediately.Also,ifthenoisein f isdeterministicbut theoptimizerhasimperfectcontrolof x ,itmaybepossibletoconsidertheproblem astochasticoptimizationproblem. Theanalysisofthealgorithmiscomplicatedbythepresenceofnoise.Sincethere isnoiseineachfunctionevaluation,itisimpossibletobecertainthemodelmatches thefunction.Forexample,if f x = x 2 ,thereisanonzerobuttinyprobability that f x < forany > 0ateverypointevaluated.Therefore,wecanonlyhave condencewhichwedenote1 )]TJ/F19 11.9552 Tf 11.61 0 Td [( k for k smallthatthemodelandfunctionagree. Thequantity k canbeadjustedasthealgorithmstagnatestoensureincreasingly accuratemodelsattheexpenseofmorefunctionevaluations.Akeyrequirementof theconvergenceanalysisisthatas k ! 0, k doesaswell.Forexample,wecan chooseasimplerulesuchas k =min f k ; 0 : 05 g toproveresultsaboutouralgorithm. Therearemanyotherequallyvalidrulesforhandling k toensureincreasingaccurate modelsas k ! 0. Ourultimategoalistoprovethatthealgorithmconvergestoastationarypoint of f almostsurelywithprobability1,butthisisadauntingtask.Thisistobe expectedconsideringthefollowingtwoquotesbothfrom[58] Thereisafundamentaltradeobetweenalgorithmeciencyandalgorithmrobustnessreliabilityandstabilityinabroadrangeofproblems. Inessence,algorithmsthataredesignedtobeveryecientononetypeof problemtendtobebrittle"inthesensethattheydonotreliablytransfer toproblemsofadierenttype. and Unfortunately,forgeneralnonlinearproblems,thereisnoknownnitesample k< 1 distributionfortheSA[stochasticapproximation]iterate. Further,thetheorygoverningtheasymptotic k !1 distributionis ratherdicult. 61
PAGE 73
Despitethesepessimisticviews,weareabletomakeprogressintheworkthatfollows. Sinceweareattemptingtoconstructarobustalgorithm,withameasureofcondence inoursolutionafteranitenumberofiterations,ourtheoreticalrequirementsmaynot beimplementedinthealgorithm.Relaxingrequirementsmayyieldamoresuitable algorithmforaspecicprobleminstance. Ouralgorithmisaderivativefreetrustregionmethodusingregressionquadratic modelsfortheirperceivedabilitytohandlenoisyfunctionevaluations.Weoutlinethe modicationsrequiredforconvergencewhenminimizingafunctionwithstochastic noise.Forexample,whenthereisnonoiseinfunctionbeingoptimized,wecan measuretheaccuracyofthe k themodel m k withtheratio k = f x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(f x k + s k m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k : Thismeasurestheactualdecreaseobservedin f versusthedecreasepredictedbythe model m k .Since f cannotbeevaluateddirectly,weproposeamodiedratio^ k in Section4.2whichwebelieveismoreappropriatefornoisyfunctions.Wealsopropose amodiedformof fullyquadraticforstochasticallynoisyfunctions. Anoverviewofthechapterfollows:inSection4.1wedene fullyquadratic andlinearmodelswithcondence1 )]TJ/F19 11.9552 Tf 11.863 0 Td [( k on B x ;andshowthatquadraticand linearregressionmodelssatisfythesenewdenitions,providedthereareasucient numberofpoisedpointsin B x ;.WeoutlinethealgorithminSection4.2and showthatitconvergestoarstorderstationarypointinSection4.3.Weprovide suggestionsforimplementingouralgorithmandcompareoneimplementationagainst otheralgorithmsforminimizing.1inSection4.4.Lastly,wediscusstheresultsin Section4.5andoutlinesomeofthefutureavenuesforresearch. 4.1PreliminaryResultsandDenitions Wemakethefollowingassumptions: Assumption4.1 Thenoise N ; 2 . 62
PAGE 74
Assumption4.2 Thefunction f 2 LC 2 withLipschitzconstant L on = [ k B x k ; max R n : Assumption4.3 Thefunction f isboundedon L f x 0 where L = f x j f x g . Insolvingthetrustregionsubproblem,wedonotrequireanexactsolutioninstead itissucienttondanapproximatesolution,butitmustsatisfythefollowing assumption. Assumption4.4 If m k and k arethemodelandtrustregionradiusatiterate k ,and x k + s k ischosenbythetrustregionsolvertosolve min x 2 B x k ; k m k x ,and s k C = )]TJ/F17 7.9701 Tf 13.215 5.112 Td [( k k g k k g k istheCauchystep,thenforall k m k x k )]TJ/F19 11.9552 Tf 11.956 0 Td [(m k x k + s k fcd h m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k C i forsomeconstant fcd 2 ; 1] ThisassumptionmerelystatesthateverytrustregionsubproblemsolutionisafractionofthedecreasepossiblefromtakingtheCauchystep,andthisfractionisbounded positivelyawayfromzero.Also,theassumptionallowsustonotsolvethetrustregion subproblemexactly. Assumption4.5 Thereexistsaconstant bhf > 0 suchthat,forall x k generatedin thealgorithm r 2 f x k bhf : Weprovethefollowingthreeclaimsusedinthischapter. Lemma4.6 If X 1 )]TJ/F20 7.9701 Tf 6.586 0 Td [( Y and Y 1 )]TJ/F20 7.9701 Tf 6.586 0 Td [( Z ,then X 1 )]TJ/F17 7.9701 Tf 6.587 0 Td [(2 Z . 63
PAGE 75
Proof: P X Z P X Y ^ Y Z =1 )]TJ/F19 11.9552 Tf 11.955 0 Td [(P X>Y _ Y>Z =1 )]TJ/F19 11.9552 Tf 11.955 0 Td [(P X>Y )]TJ/F19 11.9552 Tf 11.955 0 Td [(P Y>Z + P X>Y ^ Y>Z 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [(P X>Y )]TJ/F19 11.9552 Tf 11.955 0 Td [(P Y>Z =1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( )]TJ/F19 11.9552 Tf 11.955 0 Td [( =1 )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 : So X 1 )]TJ/F17 7.9701 Tf 6.587 0 Td [(2 Z . Lemma4.7 1 )]TJ/F20 7.9701 Tf 18.02 14.944 Td [(n X i =1 P a i n P n X i =1 a i < ! Proof: P n X i =1 a i < ! P a 1 < n ^ a 2 < n ^^ a n < n =1 )]TJ/F19 11.9552 Tf 11.955 0 Td [(P a 1 n __ a n n Lemma4.8 Let Y B ;1 beastrongly poisedDenition2.14samplesetwith p 1 points.Let X bethequadraticdesignmatrixdenedby .2 ,then [ X T X )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 ] i;i q 1 p 1 2 where [ A ] i;i isthe i thdiagonalentryof A . Proof: Since X T X )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 issymmetricandpositivedenite,the i theigenvalue i equalsthe i thsingularvalue i .By[12,Theorem4.11],theinverseofthe smallestsingularvalueof X isboundedby q q 1 p 1 .Thatis, r q 1 p 1 1 min X 64
PAGE 76
or q 1 p 1 2 1 min X 2 = 1 min X T X = max )]TJ/F15 11.9552 Tf 5.479 9.683 Td [( X T X )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 = max )]TJ/F15 11.9552 Tf 5.48 9.683 Td [( X T X )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 = X T X )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 =max k v k =1 X T X )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 v X T X )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 e i [ X T X )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 ] i;i where e i isthe i thunitvector. 4.1.1Modelswhichare fullyQuadraticwithCondence 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k ToproveconvergenceofthealgorithmpresentedinSection4.2,werstpropose amodiedversionof fullyquadraticmodels. Denition4.9 Let f satisfyAssumption4.2.Let = ef ; eg ; eh ; m 2 beagiven vectorofconstants,andlet > 0 .Amodelfunction m 2 C 2 is fullyquadratic withcondence1 )]TJ/F19 11.9552 Tf 12.52 0 Td [( k on B x ; for k 2 ; 1 if m hasaLipschitzcontinuous HessianwithcorrespondingLipschitzconstantboundedby m 2 and theerrorbetweentheHessianofthemodelandtheHessianofthefunction satises P )]TJ 5.479 0.478 Td [( r 2 f y )222(r 2 m y eh 8 y 2 B x ; 1 )]TJ/F19 11.9552 Tf 11.956 0 Td [( k ; theerrorbetweenthegradientofthemodelandthegradientofthefunction satises P )]TJ/F22 11.9552 Tf 5.48 9.684 Td [(kr f y )222(r m y k eg 2 8 y 2 B x ; 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k ; theerrorbetweenthemodelandthefunctionsatises P )]TJ/F22 11.9552 Tf 5.48 9.684 Td [(j f y )]TJ/F19 11.9552 Tf 11.955 0 Td [(m y j ef 3 8 y 2 B x ; 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k : Thisisoccasionallyabbreviated f.q.w.c. 1 )]TJ/F19 11.9552 Tf 11.956 0 Td [( k . 65
PAGE 77
Thesedenitionsareonlyusefulifmodelfunctionscanbeeasilyconstructed whichsatisfythem;themodelsmustalsobeeasytominimizeoveratrustregion. Inthefollowingtheorem,weshowthatquadraticregressionmodelssatisfytherequirementsofDenition4.9,providedthereareenoughpoisedpointswithinthetrust region. Theorem4.10 Ifthefunction f satisesAssumption4.2andthenoise satises Assumption4.1,thenforagiven k 2 ; 1 ,thereexistsa = ef ; eg ; eh ; m 2 suchthatforany x 0 2 R n , > 0 ,if Y B x 0 ; isstrongly poisedand j Y j z 1 )]TJ/F21 5.9776 Tf 8.699 4.623 Td [( k 2 q 1 2 2 q 3 1 2 6 ; thenthequadraticregressionmodelis fullyquadraticwithcondence 1 )]TJ/F19 11.9552 Tf 10.684 0 Td [( k ,where z = 2 isthenumberofstandarddeviationsawayfromzeroonastandardnormaldistribution,suchthattheareatotheleftof z = 2 is = 2 [46]. Proof: ByTaylor'stheorem,foranypoint x 2 B x 0 ;thereexistsapoint x onthelinesegmentconnecting x to x 0 suchthat f x = f x 0 + r f x 0 T x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 + 1 2 x )]TJ/F19 11.9552 Tf 11.956 0 Td [(x 0 T r 2 f x x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 = f x 0 + r f x 0 T x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 + 1 2 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 T r 2 f x 0 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 + 1 2 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 T H x x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 ; .2 where H x = r 2 f x )222(r 2 f x 0 . Let m x bethequadraticleastsquaresmodelregressing Y .Since m isquadratic, Taylor'stheoremsaysforany x , m x = m x 0 + r m x 0 T x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 + 1 2 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 T r 2 m x 0 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 : Let bethetrueparametersofthequadraticpartof f denedbytherst threetermsof.2andlet ^ betheleastsquaresestimatefor .i.e.if X is thedesignmatrixfortheset Y and f isthevectorwith i thentry f y i ,then ^ = 66
PAGE 78
)]TJ/F19 11.9552 Tf 5.479 9.684 Td [(X T X X T f .Denethemapping V x : R n ! R q 1 where V x = V [ x 1 ; x n ] T = 1 ;x 1 ; ;x n ; 1 2 x 2 1 ;x 1 x 2 ; 1 2 x 2 n T .Thenthe i throwof X is V y i T .Theparameters ^ denethemodel m .Thatis, m x = ^ T V x . Withoutlossofgenerality,assume 1.Thenforany x 2 B x 0 ;, j f x )]TJ/F19 11.9552 Tf 11.955 0 Td [(m x j = f x 0 + r f x 0 T x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 + 1 2 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 T r 2 f x 0 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 + 1 2 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 T H x x )]TJ/F19 11.9552 Tf 11.956 0 Td [(x 0 )]TJ/F19 11.9552 Tf 11.955 0 Td [(m x 0 )222(r m x 0 T x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 )]TJ/F15 11.9552 Tf 10.494 8.088 Td [(1 2 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 T r 2 m x 0 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 T V x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 )]TJ/F15 11.9552 Tf 13.64 3.154 Td [(^ T V x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 + 1 2 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 T H x x )]TJ/F19 11.9552 Tf 11.956 0 Td [(x 0 q X i =0 i )]TJ/F15 11.9552 Tf 13.64 3.155 Td [(^ i j V x )]TJ/F19 11.9552 Tf 11.956 0 Td [(x 0 i j + L 2 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 3 q X i =0 i )]TJ/F15 11.9552 Tf 13.64 3.155 Td [(^ i j V x )]TJ/F19 11.9552 Tf 11.956 0 Td [(x 0 i j + L 2 3 0 )]TJ/F15 11.9552 Tf 13.64 3.154 Td [(^ 0 + n X i =1 i )]TJ/F15 11.9552 Tf 13.64 3.154 Td [(^ i + q X i = n +1 i )]TJ/F15 11.9552 Tf 13.64 3.154 Td [(^ i 2 + L 2 3 q X i =0 i )]TJ/F15 11.9552 Tf 13.64 3.155 Td [(^ i + L 2 3 : .3 Sincethenoiseisuncorrelatedwithmeanzero,constantvariance,andisnormally distributed,itisknownthat ^ N ; 2 X T X )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 [46].If[ A ] i;i denotesthe i th diagonalentryofamatrix A ,thenthe1 )]TJ/F20 7.9701 Tf 13.454 5.112 Td [( k q 1 condenceintervalforeachofthe i hastheform[46]: 67
PAGE 79
1 )]TJ/F19 11.9552 Tf 13.15 8.087 Td [( k q 1 = P ^ i )]TJ/F19 11.9552 Tf 11.955 0 Td [(z 1 )]TJ/F21 5.9776 Tf 8.699 4.623 Td [( k 2 q 1 q 2 [ X T X )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 ] i;i < i < ^ i + z 1 )]TJ/F21 5.9776 Tf 8.699 4.623 Td [( k 2 q 1 q 2 [ X T X )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 ] i;i = P i )]TJ/F15 11.9552 Tf 13.64 3.155 Td [(^ i
PAGE 80
kr f x )222(r m x k = r f x 0 + r 2 f x 0 T x )]TJ/F19 11.9552 Tf 11.956 0 Td [(x 0 + H x x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 )]TJ/F25 11.9552 Tf 11.291 9.683 Td [()]TJ/F22 11.9552 Tf 5.479 9.683 Td [(r m x 0 + r 2 m x 0 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 r f x 0 )222(r m x 0 + r 2 f x 0 T x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 )222(r 2 m x 0 x )]TJ/F19 11.9552 Tf 11.956 0 Td [(x 0 + H x x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 n X i =1 i )]TJ/F15 11.9552 Tf 13.64 3.155 Td [(^ i + q X i = n +1 i )]TJ/F15 11.9552 Tf 13.639 3.155 Td [(^ i + L x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 2 n X i =1 i )]TJ/F15 11.9552 Tf 13.64 3.154 Td [(^ i + q X i = n +1 i )]TJ/F15 11.9552 Tf 13.639 3.154 Td [(^ i + L 2 q X i =1 i )]TJ/F15 11.9552 Tf 13.64 3.155 Td [(^ i + L 2 1 )]TJ/F20 7.9701 Tf 6.587 0 Td [( 3 + L 2 2 + L 2 = eg 2 : where eg =1+ L .Asimilarargumentfor kr 2 f x )222(r 2 m x k with eh =1+ L provesthetheorem. TocertifyamodelsatisesDenition4.9ortoimprovealeastsquaresregression modelintoonethatis fullyquadraticwithcondence1 )]TJ/F19 11.9552 Tf 12.593 0 Td [( k isstraightforward: wemustensurethereareenoughpoisedpointswithin B x ;tosatisfythebound giveninTheorem4.10.Otherwise,addenoughstronglypoisedpointsto Y .Fora techniquetogeneratestronglypoisedsets,seeChapter3ofthisthesisor[12]. 4.1.2Modelswhichare ^ fullyLinearwithCondence 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k Whilethemodels m k usedinthemainalgorithmarequadratic,linearmodels ^ m x canapproximate f near B x k + s k ; ^ k tosucientaccuracy.Therefore,ifwe haveenoughpointswithin B x k + s k ; ^ ,wecanboundtheerrorbetween f x k + s k and^ m k x k + s k .Wequantifythataccuracyinthefollowingdenition. 69
PAGE 81
Denition4.11 Let f satisfyAssumption4.2.Let ^ =^ ef ; ^ eg ; m 1 beagiven vectorofconstants,andlet > 0 .Amodelfunction m 2 C is ^ fullylinearwith condence1 )]TJ/F19 11.9552 Tf 11.232 0 Td [( k on B x ; for k 2 ; 1 if m hasaLipschitzcontinuousgradient withcorrespondingLipschitzconstantboundedby m 1 and theerrorbetweenthegradientofthemodelandthegradientofthefunction satises P kr f y )222(r m y k ^ eg 8 y 2 B x ; 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k ; theerrorbetweenthemodelandthefunctionsatises P )]TJ/F22 11.9552 Tf 5.479 9.684 Td [(j f y )]TJ/F19 11.9552 Tf 11.956 0 Td [(m y j ^ ef 2 8 y 2 B x ; 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k : Thisisoccasionallyabbreviated f.l.w.c. 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k . Theorem4.12 Ifthefunction f satisesAssumption4.2andthenoise satises Assumption4.1,thenforagiven k 2 ; 1 ,thereexistsa = ef ; eg ; m 1 such thatforany x 0 2 R n , > 0 ,if Y B x 0 ; isstrongly poisedand j Y j z 1 )]TJ/F21 5.9776 Tf 11.839 4.623 Td [( k 2 n +2 2 2 n +1 3 2 4 : thenthelinearregressionmodelis fullylinearwithcondence 1 )]TJ/F19 11.9552 Tf 11.956 0 Td [( k . Proof: TheproofisnearlyidenticalthatofTheorem4.10. 4.2StochasticOptimizationAlgorithm Belowisanoutlineofourproposedstochasticalgorithm.For x k + s k ,the solutiontothetrustregionsubproblem,andaradius k > ^ k > 0,dene ^ Y k = n y 2 Y tot j x k + s k )]TJ/F19 11.9552 Tf 11.955 0 Td [(y ^ k o . Let Y tot = f y 1 ; ;y m g bethesetofpointswhere f hasbeenevaluated. f i := f y i .Deneanullmodel m 0 ,initialtrustregionradius 0 ,andaninitialTRcenter 70
PAGE 82
x 0 .Chooseconstantssatisfying0 << 1 < inc , c > 0,0 0 1 < 1 1 6 =0, where 0 0 < and ! 2 ; 1.Choose r 2 ; 1anddene ^ k = r k . Algorithm1: Atrustregionalgorithmtominimizeastochasticfunction. Let k =0; Start ; Set k =min f k ; 0 : 05 g ; if & k max fk g k k ; )]TJ/F19 11.9552 Tf 9.298 0 Td [( min H k g < c andeitheri m k isnotcertiably f.q.w.c. 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k on B x k ; k orii k >& k then ApplyAlgorithm2toupdate Y k , k ,and m k ; Set k =min f k ; 0 : 05 g ; else Selectorgenerateastronglypoisedsetofpoints Y k B x k ; k from Y tot suchthat Y k hasenoughpointstoensure m k is f.q.w.c.1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k . Buildaregressionquadraticmodel m k x through Y k .Solve s k argmin s : k s k < k m k x k + s .Builda f.l.w.c.1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k model^ m k on ^ Y k possiblyaddingpointsto Y tot andcompute ^ k = m k x k )]TJ/F15 11.9552 Tf 14.148 0 Td [(^ m k x k + s k m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k if ^ k 1 or ^ k 0 ^ m k is f.q.w.c. 1 )]TJ/F19 11.9552 Tf 11.956 0 Td [( k on B x k ; k then x k +1 = x k + s k ; else x k +1 = x k ; if ^ k 1 then k +1 =min f inc k ; max g ; else k +1 = k ; Let m k +1 bethepossiblyimprovedmodel; Set k = k +1andgoto Start ; 71
PAGE 83
Notethatweareapproximating f x k + s k usingasecondmodel^ m k inadierent trustregion ^ k around x k + s k .Formalconvergenceofthealgorithm,specically Lemma4.15,requirestheabilitytoapproximate f x k + s k withincreasingaccuracy asthealgorithmprogresses.Suchaccuracyisnotavailablefromarealizationofthe noisefunctionvalue,namely f x k + s k .Whileitispossibletoobtainincreasingly accurateapproximationsof f x k + s k byrepeatedsampling,wearehopingthetheory generatedinthischaptercanbeeasilytransferedtothecasewheredeterministic noiseispresentin f .Withdeterministicnoise,Var f =0,andthereforerepeated samplingwillprovidenofurtherinformation. Also,ifweeventuallyshrinkourtrustregionaroundagivenpoint,pointsgeneratedin B x k + s k ; ^ k tomakeanaccuratemodel^ m k x k + s k canbeusedinthe constructionofanaccuratemodel m j x atsomelateriterate j . Algorithm2: CriticalityStep Initialization Set i =0.Set m k = m k . Repeat Increment i byone.Improvethepreviousmodelbyaddingpointsto Y tot untilitis fullyquadraticwithcondence1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k on B x k ; ! i )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 k .ThiscanbedonebyTheorem4.10andthemodel improvementalgorithmfrom[3]whichbuildsastronglypoisedset Y in O )]TJ/F17 7.9701 Tf 10.162 4.977 Td [(1 6 stepsifthemodelssatisfyDenition4.9.Denotethe newmodel m i k .Set ~ k = ! i )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 k and~ m i = m i k .; Until ~ k & i k x k . Return m k =~ m k , k =min n max n ~ k ;& i k x k o ; k o ,and Y tot . WeadoptthenamingofiteratesfromConn,Scheinberg,Vicente: 1. k 1 ; x k + s k isacceptedandthetrustregionisincreased.Wecallthese iterations successful . 2. 1 > k 0 and m k is fullyquadraticwithcondence1 )]TJ/F19 11.9552 Tf 12.463 0 Td [( k ; x k + s k is acceptedbut k isdecreased.Wecalltheseiterations acceptable . 72
PAGE 84
3. 1 > k and m k isnot fullyquadraticwithcondence1 )]TJ/F19 11.9552 Tf 12.751 0 Td [( k ; x k + s k is notacceptedandthemodelisimproved.Wecalltheseiterations model improving . 4. 0 > k and m k is fullyquadraticwithcondence1 )]TJ/F19 11.9552 Tf 12.75 0 Td [( k ; x k + s k isnot acceptedand k isreduced.Wecalltheseiterations unsuccessful . 4.3Convergence 4.3.1ConvergencetoaFirstorderStationaryPoint Theuseofquadratic m k mightsuggestconvergencetoasecondorderstationary point.Suchaproofwouldrequireaquadratic^ m k aswell,andsince ^ ,this wouldrequiremorepointsin B x k + s k ; ^ k thanin B x k ; k .Sinceitisfrequently thecasethat f x k + s k >f x k evenwhen m k is f.q.w.c.1 )]TJ/F19 11.9552 Tf 9.871 0 Td [( k ,wenditwasteful tobuildaquadratic^ m k around x k + s k .Thisisoneofthemotivationsfor fully linearmodelsfor^ m k ;withthis,wecanproveconvergencetoarstorderstationary point. Werstshowthatif x k isnotastationarypointfor f ,thenAlgorithm2willexit withprobability1. Theorem4.13 Given k 2 ; 1 ,if f satisesAssumption4.2and r f x k > 2 ! j )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 k ,thereisprobabilityatleast 1 )]TJ/F19 11.9552 Tf 12.19 0 Td [( k thatAlgorithm2willcorrectlyexitoneachiterate i after j suchthat ! i )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 < q ! j )]TJ/F18 5.9776 Tf 5.757 0 Td [(1 eg ,where and ! are declaredinAlgorithm1. Proof: Assume ! i )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 < q ! j )]TJ/F18 5.9776 Tf 5.756 0 Td [(1 eg , k 1,andAlgorithm2cyclesinnitely.After sucientlymanyiterationsofthecriticalitystep, m i k willbe fullyquadraticwith 73
PAGE 85
condence1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k on B x k ; q ! j )]TJ/F18 5.9776 Tf 5.756 0 Td [(1 eg k .Therefore & i k g i k r f x k )]TJ/F25 11.9552 Tf 11.955 13.748 Td [( r f x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(g i k > 2 ! j )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 k )]TJ/F25 11.9552 Tf 11.955 13.748 Td [( r f x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(g i k byassumption 1 )]TJ/F20 7.9701 Tf 6.587 0 Td [( 2 ! j )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 k )]TJ/F19 11.9552 Tf 11.955 0 Td [( eg s ! j )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 eg ! 2 2 k byDenition4.9 = 2 ! j )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 k )]TJ/F15 11.9552 Tf 13.746 8.088 Td [(1 ! j )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 2 k 1 ! i )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 k : since k 1and i j Soforeach i after j suchthat ! i )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 < q ! j )]TJ/F18 5.9776 Tf 5.757 0 Td [(1 eg ,wehave1 )]TJ/F19 11.9552 Tf 12.687 0 Td [( k condencethat Algorithm2willexit. Sincewerequire k ! 0as k ! 0,thenforany k > 0,eventually k willbe smallenoughsothatAlgorithm2withprobabilityatleast1 )]TJ/F19 11.9552 Tf 12.269 0 Td [( k .Inotherwords, thistheoremensuresthatthealgorithmexitswithprobability1. Lemma4.14 If f satisesAssumption4.5and m k is fullyquadraticwithcondence 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( ,thereexistsaconstant bhm > 0 suchthat k H k k 1 )]TJ/F20 7.9701 Tf 6.586 0 Td [( bhm ; forall k ,where H k istheHessianof m k . Proof: k H k k H k )222(r 2 f x k + r 2 f x k 1 )]TJ/F20 7.9701 Tf 6.587 0 Td [( eh k + r 2 f x k byDenition4.9 eh k + bhf byAssumption4.5 eh max + bhf =: bhm 74
PAGE 86
Thefollowinglemmashowsthat,if x k isnotastationarypointof f ,thenif k issmallenough,thereisahighprobabilitythatasuccessfulstepwillbetaken. Lemma4.15 Let f satisfyAssumption4.5andletthetrustregionsubproblemsolutionsatisfyAssumption4.4.Let = ef ; eg ; eh ; m 2 and ^ =^ ef ; ^ eg ; ^ eh ; m 1 . Lettheconstants fcd , bhm , ef , ^ ef , 1 beasspeciedinAssumption4.4, Lemma4.14,Denition4.9,Denition4.11,anddeclaredatthebeginningofAlgorithm1respectively.If m k is f.q.w.c. 1 )]TJ/F19 11.9552 Tf 10.896 0 Td [( k on B x k ; k , ^ m k is ^ f.l.w.c. 1 )]TJ/F19 11.9552 Tf 10.897 0 Td [( k on B x k + s k ; ~ k ,and k min k g k k bhm ; fcd k g k k )]TJ/F19 11.9552 Tf 11.955 0 Td [( 1 2 ef max +2^ ef ; .4 thenwehavecondence 1 )]TJ/F15 11.9552 Tf 11.955 0 Td [(3 k that k 1 onthe k thiteration. Proof: UsingLemma4.14,thefactthat x k + s k isnoworsethantheCauchy stepAssumption4.4,and k k g k k bhm .4yields m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k 1 )]TJ/F20 7.9701 Tf 6.586 0 Td [( fcd 2 k g k k min k g k k bhm ; k = fcd 2 k g k k k : .5 75
PAGE 87
Usingthis, j k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 j = m k x k )]TJ/F15 11.9552 Tf 14.148 0 Td [(^ m k x k + s k m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k )]TJ/F19 11.9552 Tf 13.15 8.087 Td [(m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k = m k x k + s k )]TJ/F15 11.9552 Tf 14.148 0 Td [(^ m k x k + s k m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k m k x k + s k )]TJ/F19 11.9552 Tf 11.955 0 Td [(f x k + s k m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k + f x k + s k )]TJ/F15 11.9552 Tf 14.148 0 Td [(^ m k x k + s k m k x k )]TJ/F19 11.9552 Tf 11.956 0 Td [(m k x k + s k 1 )]TJ/F20 7.9701 Tf 6.587 0 Td [( ef 3 k j m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k j + f x k + s k )]TJ/F15 11.9552 Tf 14.148 0 Td [(^ m k x k + s k m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k byDenition4.9 1 )]TJ/F20 7.9701 Tf 6.587 0 Td [( ef 3 k j m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k j + ^ ef ^ 2 k j m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k j byDenition4.11 1 )]TJ/F20 7.9701 Tf 6.587 0 Td [( 2 ef 3 k +2^ ef ^ 2 k fcd k g k k k by.5 2 ef max +2^ ef fcd k g k k k since k ^ k 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( 1 : by.4 Sincewehavecondence1 )]TJ/F19 11.9552 Tf 12.095 0 Td [( k thatthesecond,thirdandfourthinequalitieshold, wehavecondence1 )]TJ/F15 11.9552 Tf 11.955 0 Td [(3 k thatallthreeholdsimultaneously. Lemma4.16 Forall k ,assumethetrustregionsubproblemsolutionsatisesAssumption4.4.Let f satisfyAssumption4.5.Ifthereexistsaconstant 1 suchthat k g k k 1 forall k ,thenthereexistsanotherconstant 2 suchthat,foreveryiteration k where k 2 wehavecondence 1 )]TJ/F15 11.9552 Tf 12.024 0 Td [(3 k thatiteration k willbesuccessfuland k willincreaseif m k is f.q.w.c. 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k . Proof: Thisproofissimilarto[12,Lemma10.7].WhetherAlgorithm2has beencalledornot, k min & k x k ; k )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 min fk g k k ; k )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 g min f 1 ; k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 g 76
PAGE 88
Since k g k k 1 forall k ,Lemma4.15impliesthatwhenever k islessthan 3 =min 1 bhm ; fcd 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( 1 2 ef max +2^ ef ; wehavecondence1 )]TJ/F15 11.9552 Tf 12.644 0 Td [(3 k thatiteration k willbesuccessful k +1 = inc k or modelimproving k +1 = k .Ineithercase k +1 k sowehavecondence 1 )]TJ/F15 11.9552 Tf 11.955 0 Td [(3 k that k +1 k willholdwhenever k min f 0 ; 1 ; 3 g = 2 . Theorem4.17 LetAssumptions4.1{4.5besatised.Ifthenumberofsuccessful iterationsisnite,then liminf k !1 r f x k =0 withprobability1. Proof: Consideriterationsafterthelastsuccessfuliteration,denoted k last .For every k>k last ,theiterationisunsuccessful k < 1 andthemodelimprovement algorithmiscalled.Ittakesanitenumber O 1 6 k ofmodelimprovementstepsfor themodeltobecome fullyquadraticwithcondence1 )]TJ/F19 11.9552 Tf 12.249 0 Td [( k onagiven B x k ;; thereareaninnitenumberofiterationsthatareeitheracceptableorunsuccessful. Given k ,wecanguaranteethatthetrustregionradiusmustdecreasebyatleast onemultipleof 2 ; 1after C 6 k iterationsforaxedconstant C .Sinceforany > 0,thereexistsaninteger N suchthat N k last < .After C k last 6 + + C N )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 k last 6 = N X i =1 C i )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 k last 6 = C 6 )]TJ/F17 7.9701 Tf 6.587 0 Td [(6 N )]TJ/F19 11.9552 Tf 5.479 9.683 Td [( 6 N )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 6 k last 6 )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 iterations,thetrustregionradiuswillbelessthan .Therefore,lim k !1 k =0, whichimplies k ! 0.Therefore,thereexistsaninnitesequenceofiterates f k i g where m k i is f.q.w.c.1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k i and r f x k i r f x k i )]TJ/F19 11.9552 Tf 11.955 0 Td [(g k i + k g k i k 1 )]TJ/F20 7.9701 Tf 6.586 0 Td [( k i eg 2 k i + k g k i k : Thesecondtermconvergestozerowithprobability1.Toseethis,assume k g k i k is boundedawayfromzeroandwecanderiveacontradictionusingLemma4.15and 77
PAGE 89
thefactthatlim k !1 k =0.Since k i ! 0and k i ! 0,thenfor k i suciently large, k i 2 ,sothereisprobability1 )]TJ/F15 11.9552 Tf 12.323 0 Td [(3 k i thatiteration k i willbesuccessful. Thus,forany k > 0and K> 0,thereexists k i >K suchthattheprobabilityof step k i beingsuccessfulisgreaterthan1 )]TJ/F19 11.9552 Tf 12.26 0 Td [( k .Therefore,withprobability1,there areinnitelymanysuccessfuliterates,contradictingthedenitionof k last . 4.3.2InnitelyManySuccessfulSteps TheresultsthatfollowoutlinepartsofaproofforthecasewhenAlgorithm1 generatesinnitelymanysuccessfuliterates.Whiletheprevioustheoremproves k ! 0,theproofisnotvalidwhenthereareinnitelymanysuccessfuliterates. Wehavemadeconsiderableeorttoprove k ! 0,buthavebeenunabletodoso. Toprogress,weassumeitforthetimebeing. Assumption4.18 lim k !1 k =0 ItshouldbenotedthatitmaybepossibletoensureAlgorithm1satisesthisassumption,perhapsbyslowlydecreasing max .Thedetailswouldneedtobeworked out,butthisassumptionisnotasstrongasitmightappear. Conjecture4.19 IfAssumption4.18andAssumption4.5holdandthetrustregion subproblemsolutionsatisesAssumption4.4forall k ,then liminf k !1 k g k k =0 withprobability1. Discussion: If k g k k 1 forsome 1 0,byLemma4.16,thereexistsa 2 suchthat whenever k < 2 ,wewillhavea1 )]TJ/F15 11.9552 Tf 12.139 0 Td [(3 k condenceofincreasingthetrustregion. UsingAssumption4.18andthefactthat k =min f k ; 0 : 05 g ,wewillincreasethe trustregionwithprobabilityapproaching1as k getslarge.Thiswouldappearto 78
PAGE 90
contradict k ! 0,buttoprovealmostsureconvergenceassumingeachiterationis independentwemustshowtheproductofthe1 )]TJ/F19 11.9552 Tf 12.298 0 Td [( k approaches1.Andeventhe assumptionthateachiterationisindependentisdicult,asmanyofthepointsused tobuild m k willbeusedtobuild m k +1 .Iftheeventsaredependent,thenwemust considerconditionalprobabilitiessuchastheprobabilityonestepbeingasuccess giventhelaststepwasnot. Conjecture4.20 IfAssumptions4.1{4.5andAssumption4.18hold,foranysubsequence f k i g suchthat lim i !1 k g k i k =0.6 then,withprobability1 lim i !1 r f x k i =0 : Discussion: By.6,for i sucientlylarge, k g k i k c .Thus,byTheorem4.13, Algorithm2ensuresthatthemodel m k i is f.q.w.c.1 )]TJ/F19 11.9552 Tf 9.749 0 Td [( k ontheball B x k i ; k i with k i k g k i k forall i sucientlylargeprovided r f x k i 6 =0.ByDenition4.9 r f x k i )]TJ/F19 11.9552 Tf 11.955 0 Td [(g k i 1 )]TJ/F20 7.9701 Tf 6.587 0 Td [( eg k i eg k g k i k : Therefore, r f x k i r f x k i )]TJ/F19 11.9552 Tf 11.955 0 Td [(g k i + k g k i k 1 )]TJ/F20 7.9701 Tf 6.587 0 Td [( eg +1 k g k i k : Since k g k i k! 0withprobability1,sodoes r f x k i . Conjecture4.21 IfAssumptions4.1{4.5andAssumption4.18hold,then liminf k !1 r f x k =0 withprobability1. Discussion: ByConjecture4.19,weknowtheremustexistasequenceof f k i g suchthatlim i !1 k g k i k =0.ByConjecture4.20,thissamesequence f k i g has lim i !1 r f x k i =0.Thisprovestheresult. 79
PAGE 91
4.4ComputationalExample Inthissection,wehighlightsomeoftheadvantagesofusingAlgorithm1overa traditionaltrustregionmethodwhichassumesdeterministicfunctionevaluations. Whilebothalgorithmshavemuchincommon,theslightdierencesbecomesignicant inthepresenceofstochasticnoise.Forexample,thedeterministicalgorithmsare susceptibletonegativenoise,asweseeintheFigure4.1.Inthatgure,thesolidline isthetruefunction f whichwewanttominimize,andthedashedblacklinesshow the95%condenceintervalofthenoise.Theblacksquaresmarkthenoisyfunctions whichdeterminethequadratictrustregionmodel m k andthetrustregionradius k isrepresentedbythedashedlines.Thetrustregioncenter x k hasaredboxaround it. 4.4.1DeterministicTrustRegionMethod Figure4.1showsatraditionaltrustregionmethodaftermovingtoanewtrust regioncenterat x k =2 : 5.Eachimageshowstheprogressofthealgorithm,andwe describewhatoccurredinthepreviousiteratetoyieldthepresentsituation: Figure4.1,topleft Bychance,therealizationof f x k + s k wasmuchlessthan f atanypointnear x k + s k .Itisnowthenewtrustregioncenter. Figure4.1,topright Theminimumofthequadraticmodelwasnotacceptedsince, k = f x k )]TJ/F15 11.9552 Tf 14.502 3.154 Td [( f x k + s k m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k < 0 Thetrustregionradiushasalsobeenshrunksincethesamplesetisstrongly poised. Figure4.1,bottomleft Apointoutsideofthetrustregionradiushasbeenremoved.Since k < 0,thetrustregionradiuswillshrinkagain. Figure4.1,bottomright Anotherpointoutsideofthetrustregionhasbeenremoved,andanewmodelhasbeenbuilt. 80
PAGE 92
Figure4.1: Severaliterationsofatraditionaltrustregionmethodassumingdeterministicfunctionevaluations.Thetrustregioncenterisnevermoved. Thedeterministicalgorithmwillacceptanewtrustregioncenterwhen k issucientlypositive,i.e.if f x k + s k isalsomuchlessthan f x k + s k .Ifthisdoes nothappen,thealgorithmwillnotndasuccessfulstepandthetrustregionradius willberepeatedlydecreased.Since f x k isneverreevaluated,itislikelythatthe algorithmwillterminatewithoutevertakingafurtherstep. 4.4.2StochasticTrustRegionMethod Incontrast,byusing^ k introducedinSection4.2, ^ k = m k x k )]TJ/F15 11.9552 Tf 14.148 0 Td [(^ m k x k + s k m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k ; andincreasingthenumberofpointsinthetrustregionas k decreasesallowsthe algorithmtoproceedtomoveoofatrustregioncenterwithnegativenoise,seenin Figure4.2. 81
PAGE 93
Figure4.2: Severaliterationsofatraditionaltrustregionmethodassumingstochasticfunctionevaluations. Figure4.2,topleft Again,therealizationof f x k + s k wasmuchlessthan f at anypointnear x k + s k .Itisnowthenewtrustregioncenter. Figure4.2,topright Theminimumofthequadraticmodelwasnotacceptedsince ^ k < 0,butthetrustregionradiusisnotdecreased.Thoughthesampleset isstronglypoised,therearenotenoughpointstoensurethemodelis f.q.w.c.1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k . Figure4.2,bottomleft Morepointshavebeenaddedtothesamplesetandthe modelhasbeenreconstructed. Figure4.2,bottomright Theminimumofthequadraticmodelisacceptedsince ^ k > 1 ,eventhough f x k + s k > f x k . 82
PAGE 94
Byusingthemodelvalueat x k insteadof f x k inthecalculationof^ k allowsthe estimateof f x k toadjustwithoutwastefullyreevaluating f x k .Inthisfashion, Algorithm1canavoidstagnatingatpointswithnegativenoise. 4.5Conclusion Inthischapterwepresentedanalgorithmusingquadratictrustregionmodels m k tominimizeafunction f whichcannotbeevaluatedexactly.Eventhoughthe algorithmonlyhasaccesstonoisecorruptedfunctionevaluations f ,weprovedalmost sureconvergenceofasubsequenceofiteratestoarstorderstationarypointof f whenthenumberofsuccessfulstepsisnite.Wealsohaveoutlinedaproofforthe casewhenthenumberofsuccessfulstepsisinnite.Theseresultswereaccomplished, notbyrepeatedlysampling f atpointsofinterest x k ,butratherbyconstructing models^ m k whichareincreasinglyaccurateapproximationsof f near x k .Sinceitis oftenthecasethat x k isthecandidateforthenewtrustregioncenter,thisinformation isimmediatelyusefulinconstructing m k +1 .Wethenhighlightedhowthisalgorithm remediesacommonproblemwithusingtraditionaltrustregionmethodsonfunctions withstochasticnoise. 83
PAGE 95
5.NonintrusiveTerminationofNoisyOptimization 5.1IntroductionandMotivation Theoptimizationofrealworld,computationallyexpensivefunctionsinvariably leadstothedicultquestionofwhenanoptimizationprocedureshouldbeterminated.Algorithmdevelopersandthemathematicaloptimizationcommunityatlarge typicallyassumethattheoptimizationisterminatedwheneitherameasureofcriticalitygradientnorm,meshsize,etc.issatisedorauser'scomputationalbudget numberofevaluations,wallclocktime,etc.isexhausted. Foralargeclassofproblems,however,theusermaynothaveawelldened computationalbudgetandinsteaddemandaterminationtest t solving min t Computationalexpense t s.t.Acceptableaccuracyofthesolution t ; .1 withthecriticalitymeasureofthesolveremployedtypicallychosenwiththeaccuracy constraintinmind.Examplesofsuchaccuracybasedcriticalitytestsarediscussed indetailbyGill,Murray,andWright[19,Section8.2.3]. Themaindicultiesarisingfromthisapproacharearesultof.1possibly beingpoorlyformulated.Thecomputationalexpensecouldbeunboundedbecause anaprioriuserdenedaccuracyisunrealisticfortheproblem/solverpair.Furthermore,ausermayhavedicultytranslatingthecriticalitymeasuresprovidedbya solver,whicharegenerallybasedonassumptionsofsmoothnessandinniteprecision calculations,intopracticalmetricsonthesolutionaccuracy. InFigure5.1weillustratethechallengesinthisareawithanexamplefrom nuclearphysics,similartotheminimizationproblemsconsideredin[37].Eachofthe functionvaluesshownisobtainedfromrunningadeterministicsimulationforone minuteona640corecluster.Stoppingtheoptimizationsoonerthan200function evaluationswouldnotonlyreturnasolutionfasterbutwouldalsofreetheclusterfor 84
PAGE 96
Figure5.1: Partofanoisytrajectoryoffunctionvaluesforanexpensivenuclear physicsproblem.Aftermoresignicantdecreasesintherst70evaluations,progress beginstostall. otherapplicationsand/orresultinasavingsinenergy,anincreasinglycrucialfactor inhighperformancecomputing. IfweassumethattheoptimizationpartiallyshowninFigure5.1hasnotbeen terminatedbyasolver'scriticalitymeasuresorauser'scomputationalbudget,the questionisthenwhetherterminationshouldoccurforotherreasons.Forexample, ifonlytherstthreedigitsofthesimulationoutputwerecomputedstably,onemay wanttoterminatetheoptimizationsoonerthanifcomputationalnoisecorrupted onlytheeighthdigitoftheoutput.Alternatively,thebehaviorshowncouldmean thesolverinquestionhasstagnatedbecauseofnoise,errorsinthesimulation,a limitationofthesolver,etc.,andhenceexaminingthesolutionand/orrestartingthe optimizationcouldbeamoreeectiveuseoftheremainingcomputationalbudget. Wright[65]referstothisstalledprogressas perseveration andnotesthatthereisno fullygeneralwaytodene`insucientprogress.'"Evenso,itmaybeadvantageous touseknowledgeoftheuncertaintyoraccuracyofagivenfunctionevaluationwhen makingsuchadecision. 85
PAGE 97
Intheremainderofthischapterweexploretheseissuesandproposetermination criteriathatcanbeeasilyincorporatedontopofauser'ssolverofchoice.In[18], Fletchersummarizesthechallengesathandinthecaseofroundoerrorsalone: Someconsiderationhastobegiventotheeectsofroundonearthesolution,andtoterminatewhenitisjudgedthattheseeectsarepreventing furtherprogress.Itisdiculttobecertainwhatstrategyisbestinthis respect. Moreover,Gill,Murray,andWright[19]stressthat nosetofterminationcriteriaissuitableforalloptimizationproblemsand allmethods. ThissentimentissharedbyPowell[47]whosays itisbelievedthatitisimpossibletochoosesuchaconvergencecriterion whichiseectiveforthemostgeneralfunction...soacompromisehasto bemadebetweenstoppingtheiterativeproceduretoosoonandcalculating f anunnecessarilylargenumberoftimes. Consequently,wewillconsiderteststhatallowfortheuseofestimatesofthenoise particulartoaproblem.Furthermore,ourcriteriaarenotintendedassubstitutes foracomputationalbudgetorasolver'sbuiltincriticalitytests,whichweconsider tobeimportantsafeguards.Likewise,theterminationproblemcanbeviewedasa realtimecontrolproblemdependingoncompleteknowledgeofthesolver'sdecisions, butweresistthisurgeforpurposesofportabilityandapplicability. WeprovidebackgroundonpreviousworkandintroducenotationinSection5.2. ThefamiliesofstoppingtestsweproposeinSection5.3donotprovideguarantees onthequalityofthesolution,althoughdoingsomaybetheroleofasolver'sbuiltincriteria.Instead,theproposedtestsareparameterizedinordertoquantifya 86
PAGE 98
user'stradeobetweenthebenetofachievingadditionaldecreaseandthecostof additionalevaluations,whilerequiringaminimalamountofinformationfromthe solver.Equallyimportant,ourresultsinSection5.4comparingthequalityofthese familiesofstoppingtestsonacollectionoflocaloptimizationalgorithms.Werst considerallsolversasasingleroutine,latervalidatingthisapproachbydemonstrating equalperformanceforthebesttestsonindividualalgorithms.Whileourresults canbeincorporatedinalocalsubroutineofanyglobalsearchalgorithm,thetests proposedinSection5.3areunabletodistinguishbetweenexplorationandrenement phasesintheircurrentform.WesummarizeourresultsinSection5.5andprovide recommendationswhenimplementingthesetests. 5.2Background Ourdiscussionandanalysisarelimitedtooptimizationmethodsthatdonot explicitlyrequirederivativeinformation.However,otheralgorithmscouldreadily employthetestsproposedhereinadditiontotheirderivativebasedstoppingcriteria. Whileourworkcanbefurtherextendedtoincorporatenoisygradientinformation, thederivativesofnoisyfunctionsaretypicallyevennoisierthanthefunction. Derivativefreeoptimizationmethodsareoftenfavoredfortheirperceivedability tohandlenoisyfunctions.Althoughasymptoticconvergenceofthesemethodsis generallyprovedassumingasmoothfunction,adjustmentsarefrequentlymadeto accommodatenoise.Inthecaseofstochasticfunctions,wherenoiseresultsfroma randomdistributionwithVar f x > 0,replicationsoffunctionevaluationscanbe usedtomodifyexistingmethodse.g.,[14]modifyingUOBYQA[48],[15,1]modifying DIRECT[30],and[61]modifyingNelderMeadsee,e.g.,[12].However,stopping criteriaforthesemethodsinvolvelimitedknowledgeofthenoiseandindicatethe widevarietyofstoppingtestsusedinpractice.In[1],optimizationisstoppedwhen adjacentpointsarewithin10 )]TJ/F17 7.9701 Tf 6.587 0 Td [(4 ofeachother,whereas[15]allowsstoppingwhen thebestfunctionvaluehasnotbeenimprovedaftersomenumberofconsecutive 87
PAGE 99
iterations.Tolimitthenumberofstochasticreplications,theauthorsof[14]and[61] adjustthemaximumnumberofallowedreplicationsataparticularpointbasedon thevarianceofthenoise. Deterministicnoise{thatis,noisethatresultsfromadeterministicprocess,such asniteprecisionarithmetic,iterativemethods,andadaptiveprocedures{isfarless understoodthanitsstochasticcounterpart[42].Notsurprising,evenlessknowledge ofthemagnitudeofnoiseisusedforproblemswithdeterministicobjectives.When lowamplitudenoiseispresent,Kelley[33]proposesarestarttechniqueforNelderMeadbutterminateswhensucientlysmalldierencesexistinthesimplicialfunction values,independentofthemagnitudeofthenoise.Implicitltering[32]hasnumerous terminationpossibilitiessmallfunctionvaluedierencesonastencil,asmallchange inthebestfunctionvaluefromoneiterationtothenext,etc.butnonethatare explicitlyrelatedbytheauthortothemagnitudeofthenoise.Asimilarimplicit relationshiptonoisecanbeseenin[24],wheretreedGaussianprocessmodelsfor optimizationareterminatedwhenamaximumimprovementstatisticissuciently small.TheauthorsofSNOBFIT[29]suggeststoppingwhenthebestpointhasnot changedforanumberofconsecutivecallstothemainSNOBFITalgorithm. OurworkmorecloselyfollowsthatofGillet.al[19],wheresection.2isdevoted topropertiesofthecomputedsolution.Theauthorsthererecommendterminating NelderMead{likealgorithmswhenthemaximumdierencebetweenfunctionvalues onthesimplexislessthanademandedaccuracyweightedbythebestfunctionvalue onthesimplex. Theonlyotherdirectrelationshipbetweenstoppingcriteriaandameasureof noisethatweareawareofarein[42,Section9]and[25].In[42],astochasticmodelof thenoiseisusedtoestimatethe noiselevel ofafunctionvalue f x bydierencetablebasedapproximationsofthestandarddeviationVar f x 1 = 2 .Resultsarevalidated fordeterministic f .Asanexampleapplication,theauthorsterminateaNelderMead 88
PAGE 100
methodonanODEbasedproblemwhenconsecutivedecreasesarelessthanafactor ofthenoiselevel.Theauthorsof[25]perturbboundconstrainedproblemssothe incumbentiterateistheexactsolutiontothisnewproblem.Analgorithmcanthen beterminatedwhenthesizeofthisperturbationrstdecreasesbelowtheerrorin theproblem.Naturalextensionstogradient/derivativebasedtestsarealsoenabled bytherecentworkin[43]wherenearoptimalnitedierenceestimatesareprovided asafunctionofthenoiselevel. Beforeproceeding,wedenethenotationemployedthroughout.Welet R + denotethenonnegativerealsand N denotethenaturalnumbers.Welet f x 1 ; ;x m g R n and f f 1 ; ;f m g2 R beasequenceofpointsandcorrespondingfunctionvalues producedbyalocalminimizationsolver,andwecollectthedatafromtherst i evaluationsin F i = f x 1 ;f 1 ;:::; x i ;f i g .Thebestfunctionvalueintherst i evaluationsisgivenby f i =min 1 j i f f j g ,with x i denotingthepointcorrespondingto f i . Accordingly,thesequence f f i g isnonincreasing.Unlessotherwisestated, kk denotes thestandardEuclideandistance. Welet^ " i r beanestimateoftherelativenoiseat f i i.e.,thenoiseat x i scaledby themagnitudeof f x i .Thisestimatemaycomefromexperience,numericalanalysis oftheunderlyingprocessesincomputing f i ,orappropriatescalingby1 = j f i j ofthe noiselevelestimatesfromthemethodproposedin[42].Inthecaseofstochastic functionswithnonzeromeanat x i ,^ " i r isthestandarddeviationof f x i relativeto theexpectedvalue E [ j f x i j ]. Favorablepropertiesofaterminationtestincludescaleandshiftinvariance,so thatthetestwouldterminateafterthesamenumberofevaluationsforanyane transformationoftheobjectivefunction.Specically,atestisscaleinvariantin f if itterminatesoptimizationrunsdenedby F i and F i f x 1 ;f 1 ;:::; x i ;f i g at anidenticalevaluationnumberforany > 0.Similarly,atestisshiftinvariantin f ifitterminates F i and F i + f x 1 ;f 1 + ;:::; x i ;f i + g afteranidentical 89
PAGE 101
numberofevaluationsregardlessof .Weusethefollowingpropositiontoaidinthe subsequentanalysisofscaleandshiftinvariance. Proposition5.1 Forstochastic f i ofnite,nonzerovariance,theestimateoftherelativenoise, ^ " i r ,isscaleinvariantin f andtheabsolutenoise, ^ " i r E [ j f x i j ] j p Var f x i ,isshiftinvariantin f . Proof: BothresultsfollowdirectlyfrompropertiesofVar .Since Var f x i > 0,itfollowsthat E [ j f x i j ] > 0andhence,for > 0 ^ " i r = p Var f x i E [ j f x i j ] = p 2 Var f x i E [ j f x i j ] = p Var f x i E [ j f x i j ] : Whendenedbythestandarddeviation,theabsolutenoiseisshiftinvariantin f because p Var f x i = p Var f x i + . Inthecaseofdeterministicnoise, invariancedependsonthemethodsusedtoobtaintheestimates^ " i r and^ " i r j f i j . 5.3StoppingTests Inthissectionwedenefamiliesofterminationtestsandprovidemotivationfor theiruse.Eachfamilycanbedenedthroughanextendedvaluefunction mapping to R [f + 1g .Given F ,theproblemdatafromanoptimizationrun,theassociated terminationteststopsafter i F evaluations,where i F isthesolutiontoahitting problem, i F =min i f i : F i ; F i ; 0 g : .2 where F i and denoteparametersthatarepossiblydependenton F i andindependentof F i ,respectively.Hence,ateststopsanoptimizationrun,producingthe history F i ,attherst i 2 N where F i ; F i ; 0.Membersofafamilyoftests aredeterminedbydierentvaluesoftheparametervector F i ; .Since quanties theprogressofanalgorithmthroughthehistoryoffunctionvaluesandpoints,each familyoftestsisdesignedtodeterminewhencontinuingwiththepresentcourseis likelywastefulasmeasuredbytheparametersin F i ; . 90
PAGE 102
Itisoftenusefultoconsiderhowatestwillchangeiftheunderlyingfunction undergoesananechange.FollowingSection5.2,wewillsaythatatestis scale invariant if i F =min i f i : F i ; F i ; 0 g8 > 0 ; where F i f x 1 ;f 1 ;:::; x i ;f i g ,and shiftinvariant if i F =min i f i : F i + ; F i + ; 0 g8 ; where F i + f x 1 ;f 1 + ;:::; x i ;f i + g .Similaranechangesto f x 1 ;:::;x i g couldbeconsideredbutarenotcentraltothepresentdiscussion,andhenceallnotions ofinvarianceherearerelativetothefunction f .Wedropthesubscriptfrom i F if F isimplied. Similarly,itisusefultoconsiderwhether ismonotoneinsomeofitsparameters. Monotonicityof isdesirablebecauseitresultsinthesameformofmonotonicity forthecorrespondingnumberofevaluations i .Forexample,if ismonotonically increasinginascalarparameter ,thenincreasing resultsinamoreconservative testbecausethesolutionto.2isatleastaslarge.Asaconsequence,if is monotonicallyincreasingin andthetest ; ; 1 isneversatisedonasetof problems,itisnotnecessarytoconsider > 1 valuesonthatsetofproblems becauseitwillalsoneverbesatised. Wenowdeneseveralfamiliesofterminationtestsanddiscusstheirproperties andunderlyingmotivation.Allofthesetestsassumenoknowledgeoftheinner workingsofthealgorithmtheyareterminating,butsuchknowledgemightleadto appropriatemodications.Forexample,ifthemethodusesasimplex,ratherthan stoppingwhenthelast functionevaluationsarewithinafactorofthenoise,one couldstopwhenthelast simplexverticesarewithinafactorofthenoiseessentially amodicationoftheproposedrulein[19]. 5.3.1 f i 0 Test 91
PAGE 103
1 F i ; F i ;; 8 > > < > > : f i )]TJ/F21 5.9776 Tf 5.756 0 Td [( +1 )]TJ/F20 7.9701 Tf 6.586 0 Td [(f i )]TJ/F19 11.9552 Tf 11.955 0 Td [( j f i j F i if i ; 1 else, with F i ; 2 R + ; 2 N : .3 Thisfamilyoftestsisdesignedtostopwhentheaveragerelativechangein f overthe last evaluationsislessthan F i .Theinteger canbethoughtofasabackward dierenceparameterforestimatingthechangeinthebestfunctionvaluewithrespect tothenumberofevaluations. Wenotethat 1 ismonotonicallydecreasingin since,forxed , F i ,and F i , 1 2 = 1 F i ; F i ;; 1 1 F i ; F i ;; 2 : 1 isalsomonotonicallydecreasingin F i butisnotmonotonein .Membersof thisfamilyarescaleinvariantprovidedthat F i is,andshiftinvariantprovidedthat j f i j F i isproveninTheorem5.2inSection5.3.6. Weconsidertwospecialcases.When F i =1oranyconstant,weobtaintests thatarescaleinvariantbutnotshiftinvariantandstopiftheaveragerelativechange inthebestfunctionvaluedropsbelow .If F i =^ " i r ,thetestsarescaleandshift invariantbyProposition5.1andstopanalgorithmiftheaveragerelativechangein f becomeslessthanafactor timestherelativenoise. 5.3.2MaxDierencef Test With F i ; 2 R + ; 2 N ,dene 2 F i ; F i ;; 8 > > < > > : max i )]TJ/F20 7.9701 Tf 6.586 0 Td [( +1 j i j f j )]TJ/F19 11.9552 Tf 11.956 0 Td [(f i j)]TJ/F19 11.9552 Tf 17.933 0 Td [( j f i j F i if i ; 1 else. .4 Thisfamilyoftestsstopswhen consecutivefunctionvaluesarewithin j f i j F i of f i . 92
PAGE 104
Onecanshowthat 2 ismonotonicallydecreasinginboth and F i andmonotonicallyincreasingin since 1 2 = max i )]TJ/F20 7.9701 Tf 6.587 0 Td [( 1 j i j f j )]TJ/F19 11.9552 Tf 11.955 0 Td [(f i j max i )]TJ/F20 7.9701 Tf 6.587 0 Td [( 2 j i j f j )]TJ/F19 11.9552 Tf 11.955 0 Td [(f i j : Wealsonotethatif 2 ismodiedsothat f j isreplacedby f j ,weobtainatest equivalentto 1 F i ; F i ;; .Theinvariancepropertiesofthistestareidenticalto thosefor 1 andformallyproveninTheorem5.2inSection5.3.6. Weexaminetwospecialcases.If F i =1oranyconstant, 2 isscaleinvariant butnotshiftinvariant;thisfamily, 2 F i ;1 ;; ,terminateswhenthelast function valuesdierbylessthanafactor relativetothebestfunctionvaluesofar.If F i =^ " i r ,theresultingtestsarescaleandshiftinvariantbyProposition5.1and terminatewhentheabsolutechangeinthelast functionvaluesiswithinafactor ofthenoiseof f . 5.3.3MaxDistancex Test 3 F i ; ; 8 > > < > > : max i )]TJ/F20 7.9701 Tf 6.586 0 Td [( +1 j;k i k x j )]TJ/F19 11.9552 Tf 11.956 0 Td [(x k k)]TJ/F19 11.9552 Tf 20.59 0 Td [( if i 1 else, with 2 R + ; 2 N : .5 Thisfamilystopswhen consecutive x valuesarewithinadistance ofeachother andisanalyzedwith 4 below. 5.3.4MaxDistancex i Test 4 F i ; ; 8 > > < > > : max i )]TJ/F20 7.9701 Tf 6.587 0 Td [( +1 j i x j )]TJ/F19 11.9552 Tf 11.955 0 Td [(x i )]TJ/F19 11.9552 Tf 11.955 0 Td [( if i ; 1 else, with 2 R + ; 2 N : .6 Thisfamilystopswhen consecutive x i valuesarewithinadistance ofeachother. Ingeneral,membersofbothofthefamiliesdenedby 3 and 4 arenotscaleshift 93
PAGE 105
invariantunlesstheproceduregenerating f x i g i isscaleshiftinvariantin f .Both 3 and 4 aremonotonicallydecreasingin andmonotonicallyincreasingin .We examinedatestusingmax i )]TJ/F20 7.9701 Tf 6.587 0 Td [( +1 j i k x j )]TJ/F19 11.9552 Tf 11.69 0 Td [(x i k butfoundtheperformancetobesimilarto thatof 3 . 5.3.5MaxBudgetTest 5 F i ; 8 > > < > > : 0if i ; 1 else, with 2 N : .7 Asapointofreference,weincludethefamilycorrespondingtostoppingafterabudget of evaluations. 5.3.6TestsBasedonEstimatesoftheNoise Thefamiliesoftestsintroducedabovehavebeenbroadlyparameterizedtocaptureawiderangeofbehaviors.Thefollowingtheoremsummarizestherelationship betweensomeoftheseparametersandinvariancepropertiesofthetests. Theorem5.2 aIf F i isscaleinvariant,thenallmembersofthefamilies 1 and 2 arescaleinvariant. bIf j f i j F i isshiftinvariant,thenallmembersofthefamilies 1 and 2 areshift invariant. cIftheproceduregeneratingthedata f x i g isscaleshiftinvariant,thenallmembersofthefamilies 3 and 4 arescaleshiftinvariant. dMembersofthe 5 familyarescaleandshiftinvariant. Proof: aSince and areindependentof F ,thedenitionsin.3and.4 showthat,forany > 0, 1 F i ; F i ;; = f i )]TJ/F20 7.9701 Tf 6.586 0 Td [( +1 )]TJ/F19 11.9552 Tf 11.956 0 Td [(f i )]TJ/F19 11.9552 Tf 11.955 0 Td [( j f i j F i 2 F i ; F i ;; =max i )]TJ/F20 7.9701 Tf 6.587 0 Td [( +1 j i j f j )]TJ/F19 11.9552 Tf 11.955 0 Td [(f i j)]TJ/F19 11.9552 Tf 17.932 0 Td [( j f i j F i ; 94
PAGE 106
providedthat i .Scaleinvarianceof F i impliesthat F i = F i andhenceimplies that j F i ; F i ;; = j F i ; F i ;; ,for j =1 ; 2.Since ispositive,thevalue of i forwhich j and j changesignareidentical,showingthat i F = i F . bSimilarly,forany and i ,wehavethat 1 F i + ; F i + ;; = f i )]TJ/F20 7.9701 Tf 6.586 0 Td [( +1 )]TJ/F19 11.9552 Tf 11.955 0 Td [(f i )]TJ/F19 11.9552 Tf 11.955 0 Td [( j f i + j F i + 2 F i + ; F i + ;; =max i )]TJ/F20 7.9701 Tf 6.587 0 Td [( +1 j i j f j )]TJ/F19 11.9552 Tf 11.955 0 Td [(f i j)]TJ/F19 11.9552 Tf 17.932 0 Td [( j f i + j F i + ; andhenceif j f i + j F i + = j f i j F i ,then j F i + ; F i + ;; = j F i ; F i ;; for j =1 ; 2.Asaresult,thefunctionsinthehittingproblem5.2remainunchanged and i F + = i F . cBoth 3 and 4 dependonlyon F throughthelocationsoftheevaluated points, x i .Hence,iftheprocedureforgenerating x i producesthesamepointsfor F resp. F + asitwouldfor F ,thetestsarescaleresp.shiftinvariant. dThefunction 5 isindependentof F ,andhencethehittingproblem.2is unaectedbychangesin F . AsaconsequenceofTheorem5.2andProposition5.1,using F i =^ " i r asan estimateofthenoisein 1 and 2 resultsinteststhatarebothscaleandshiftinvariant. Furthermore,weseethattherstterminthedenitionof 1 and 2 hasastrong associationwiththemagnitudeofthenoise.ThisfeatureisillustratedinFigure5.2, whereeachlinerepresentsoneinstanceofaNelderMeadmethodminimizinga10dimensionalconvexquadraticforincreasinglevelsofnoise.Figure5.2topshowsthe rsttermof 1 , f i )]TJ/F21 5.9776 Tf 5.757 0 Td [( +1 )]TJ/F20 7.9701 Tf 6.586 0 Td [(f i ,asthealgorithmprogresses.Hereweseethatthequantity generallyattensoutatincrementsseparatedbythesameorderofmagnitudeasthe sevennoiselevels.ThisbehaviorisevenmoreevidentinFigure5.2bottomwhen thersttermof 2 ,max i )]TJ/F20 7.9701 Tf 6.586 0 Td [( +1 j i j f j )]TJ/F19 11.9552 Tf 11.956 0 Td [(f i j ,isconsidered. 95
PAGE 107
Figure5.2: Firsttermsin 1 top,with =100and 2 bottom,with =10 onalog 10 scalewhenminimizinga10dimensionalconvexquadraticwithstochastic relativenoiseofdierentmagnitudes.Theasymptotesofthequantitiesshowntend tobeseparatedbythedierencesinmagnitudesofthenoise. Consequently,inthenumericaltestsinSection5.4,werestrictourattentionto testsbasedon 1 and 2 forwhich F i =^ " i r .Wenotethatalarger isrequiredin Figure5.2lefttopreventthersttermin 1 fromprematurelytakingazerovalue; dependenceonparameterslike isdiscussedfurtherinSection5.4.Weexamined plotssimilartothoseinFigure5.2forthersttermsof 3 and 4 butfoundnosuch relationshipwiththenoiselevel.Asaresult,wehavechosennottoincludeconstants oftheform F i inthedenitionsof 3 and 4 . 5.3.7RelationshiptoLossFunctions 96
PAGE 108
Ideally,analgorithmshouldstopwhenthecostofperformingadditionalfunction evaluationsoutweighsadditionalimprovementsinthefunctionvalue.Whensuch atradeocanbequantied,thisproblembecomesoneof optimalstopping [54]. Resultsintheliteraturetypicallyfocusoncaseswhenthedistributionofthestochastic improvementisknown.Webrieyillustrateaconnectiontoasimplelossfunction employedinoptimalstoppingwithourtests. Wefocusonthecasewhenthecostofanadditionalevaluationisconstant.This canbeviewedastreatingthecomputationalexpenseperfunctionevaluationasconstant,butthecostandthetestsproposedherecouldbesuitablymodiedasan algorithmentersasubdomainwherethecostofanevaluationchanges.Givena sequence f f j g ,thelossfunction L i;c =min 1 j i f f j g + c i .8 providesameasureofthesuccessofstoppingafter i evaluationswhenthecostper evaluationrelativeto f is c .Thislossfunctionappearsintheoptimalstopping literatureasthehousesellingproblem[6],where f f j g areassumedtobeindependent andidenticallydistributedrandomvariables. Figure5.3showstheminimizerof L ;c foravarietyof c valuesonasequence f f j g 3000 j =1 outputbyadirectsearchsolveronanonlinearfunctionwithdeterministic leftandstochasticrightnoise.Wecomparethisminimizerwiththenumberof evaluations i denedby.2forthefamily 1 when c isusedasalinearmultiplier fortheparameter .Figure5.3showsastrongconnectionbetweenthebehaviorof argmin i L i;c andtheterminationtestdenedby 1 usinganestimateofthenoise andanappropriatechoiceoftheparameters ; .Thisillustrateshowvaryingthe parametersintheproposedfamiliescanbecloselyrelatedtothecostofperforming anevaluation. 5.4NumericalExperiments 97
PAGE 109
Figure5.3: Numberofevaluations i foraterminationtestbasedon.3with xed F i and ,butusinga parameterizedby c .Theplotsshowremarkablysimilar behaviortothenumberofevaluationsthatminimize L ;c in.8. Wenowdemonstratethemeritsoftheproposedtestsandexploretheeect ofchangingtheassociatedparametervaluesbyconsideringoutputsgeneratedby asetofderivativefreeoptimizationsolversonacollectionofnoisytestproblems. Wepresentillustrativeresultshere,acomprehensivesetofplotsmaybefoundat http://www.mcs.anl.gov/ ~ wild/tnoise/ . Weconsiderthecollectionofunconstrainedleastsquaresproblemsusedin[41], witheachfunctiontakingtheform f x =1++ g x m X i =1 F s i x 2 ; .9 whereeach F s i isasmooth,deterministicfunctionand 1isapositivescalar usedtocontroltheamplitudeofthenoisebeingaddedto f s x = P m i =1 F s i x 2 .We beginourstudybyconsideringstochasticnoise,sothat g x representsindependent andidenticallydistributediidrandomvariableswithvarianceVar g x =1.Asa result,therelativenoiseofthesetestfunctionsissimply and,hence,independentof x .Theconstant1wasaddedin.9sothattherelativenoiseisconsistentlydened 98
PAGE 110
evenif F s i x =0forall i ;suchshiftsarecommonlyperformedinaccuracymeasures see,e.g.,[16]. Toexaminethetestsonadiversesetoflocalmethods,weconsidersequences f f j g producedbydierentderivativefreeoptimizationsolvers.Sincetherelativemerits ofthesesolversisnotthefocusofthisstudy,wedonotexplicitlylistthem,but wenotethattheycomefromavarietyofclasses,includingmodelbasedmethods, implementationsofNelderMead,patternsearchmethods,andmethodsthatcross theseclasses.Afterexaminingtheterminationtestsontheentiregroupofsolvers, weanalyzethesuccessoftherecommendedtestsoneachsolverindividuallyasa meansofvalidation. Tomoreaccuratelystudytheeectofourtests,wehavemadethebuiltinterminationcriteriaofthesesolversasambitiousaspossibleinanattempttoremove theirinuence.Hence,weraneachsolveruntileitheritcrashede.g.,foranumerical reason,suchasthesimplexsidesbeingdroppedsucientlybelowmachineprecision oramaximumbudgetof5,000functionevaluationswasachieved.Thisbudgetof evaluationsissignicantlylargerthantheoneconsideredin[41],andweconsiderit tobemorethansucientfortheproblemsinthisset,whichrangeindimensionfrom n =2to n =12.Wedenotethemaximumnumberoffunctionevaluationseither 5,000,orfewerifthesolvercrashedby i max . Wethenhaveasetof318nonnegativesequences f f j g i max j =1 ,whichconstituteour setofproblems P .Weusetheseproblemstoexaminetheperformanceofasetoftests T ,denedasmembersofthefamiliesproposedinSection5.3.Foratest t 2T and problem p 2P ,wedenote i p;t tobethenumberoffunctionvaluesafterwhichtest t wouldstoponproblem p .Ifthetestisnotsatisedbeforethemaximumnumber ofevaluations i max ofproblem p ,welet i p;t = i max tomirrorwhatwouldbedonein practice. 5.4.1AccuracyProlesforthe 1 Family 99
PAGE 111
Terminationcriteriathataretooeasilysatisedhavelimitedpracticalitysince theycouldstopwithafunctionvaluefarfromtheminimum.Wewillmeasurethis abilitybyconsideringtherelativedierencebetween f i p;t and f i max , e p;t = 8 > > < > > : 1 if i p;t = i max ; f i p;t )]TJ/F20 7.9701 Tf 6.586 0 Td [(f i max f i p;t if i p;t
PAGE 112
Figure5.4: Accuracyprolesformembersofthe 1 familyonproblems.9with twodierentmagnitudesofknownstochasticrelativenoise .Inthetopplots, isheldxedandtheshownmembershavedierent values.Inthebottomplots, isheldxedandtheshownmembershavedierent values. Theleftasymptoteof ! t showsthefractionofproblemsonwhichtheteststopped afteritreachedtheminimumvalue f i max ,whiletherightasymptoteisthefractionof problemsforwhichthetestwassatisedwith i p;t
PAGE 113
testsbecomelessconservativeas grows,anumberofproblemsareterminated wellbeforetherelativeerrorisontheorderofthenoise.Ontheotherhand,not muchisgainedbysetting lessthan10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 or10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(2 . ThebottomtwoplotsinFigure5.4show 1 familymembersforxed =10 )]TJ/F17 7.9701 Tf 6.587 0 Td [(2 andvariousvaluesof ,whichcanbethoughtofasabackwarddierenceparameter. Littleimprovementisseenfor > 20 n ,butamarkeddecreaseinaccuracyoccurs when < 10 n .Manyproblemsarestoppingwithalargerelativeerror,inpart because f j canremainunchangedformanyconsecutive j .Forexample,thelower leftplotshowsthatfornoiseaectingthethirddigit,onall318problems f j remained unchangedfor3 n consecutiveevaluationsbefore j = i max wasreached. 5.4.2PerformanceProlesforthe 1 Family Whileaccuracyprolescanquantifywhenateststopstoosoon,theymaynot revealwhichtestsrequireexcessivefunctionevaluationstoachievehighaccuracies. Forexample,themaximumbudgettest 5 triviallyachievesidealaccuracybutcan makemanymoreevaluationsthanarerequiredtogetsucientaccuracy. Weuseperformanceproles[17]tocomparedierentstoppingrulesintermsof bothaccuracyandthenumberoffunctionevaluationsrequired.Aperformanceprole requiresaconvergencetestaswellasaperformancemeasure r p;t foreachproblem p 2P andtest t 2T .Weusethenumberofevaluations i p;t asourperformance measureandaconvergencetestrequiringthatthesolutionobtainediswithinthe absolutenoiselevelofthebestfunctionvalue, f i p;t )]TJ/F19 11.9552 Tf 11.955 0 Td [(f i max j f i p;t j ^ " i r : .11 Theconvergencetesthastheeectofsettingtheperformancemeasure i p;t = 1 whenevertheoriginal i p;t doesnotsatisfy.11.Theperformanceratio r p;t = 8 > > < > > : i p;t min n i p; ~ t : ~ t 2 T ; .11satisedfor p; ~ t o if.11issatisedfor p;t 1 else 102
PAGE 114
Figure5.5: Performanceprolesforthemostaccurate 1 testsonproblems.9 withtwodierentmagnitudesofknownstochasticrelativenoise .Notethatthe axishasbeentruncatedforeachplot; 5 eventuallyterminatesalloftheproblems andthushasaprolethatwillreachthevalue1;allothertestschangebylessthan .01. measurestherelativeperformanceonproblem p oftest t whencomparedwiththe othertestsin T . Theperformanceprole t = jf p 2P : r p;t gj jPj thenrepresentsthefractionofproblemswheretest t satisedtheaccuracyrequirement.11withanumberofevaluationswithinafactor ofthebestperforming, sucientlyaccuratetest.Largervaluesof t arehencebetter,with t being thefractionofproblemswhere t hassuccessfullyterminatedrstamongalltestsin T andlim !1 t beingthefractionofproblemsforwhich t satised.11. Figure5.5showstheperformanceprolesforthemostaccurate 1 familymembersfortwolevelsofnoise .Weinclude 5 ; i max in T asapointofreferenceto indicateanupperboundonthefractionofproblemsthatallothertestsmaynothave terminatedwith i p;t
PAGE 115
satisfy.11. Theseperformanceprolesillustratethatsomemembersofthe 1 familyoftests requireafractionofthefull i max evaluations.Thisisthecaseespeciallyforlarger magnitudesofnoise,wherelessaccuratesolutionsaredemanded.Likewise,asthe noisedecreases,.11demandsmoreaccuratesolutions,anditbecomesnecessaryto perform i max evaluationsonalargershareoftheproblems. Theseperformanceprolesalsodemonstratehowmoreliberalstoppingrulescan bemoresuccessfulthantheaccuracyprolesreveal.Forexample,inFigure5.4, 1 ; ; 20 n; 0and 1 ; ; 20 n; 10 )]TJ/F17 7.9701 Tf 6.587 0 Td [(2 appearnearlyidenticalintermsoftheiraccuracy, butinFigure5.5weseeamarkeddierenceintheperformancemeasures.Theright asymptotesoftheirperformanceprolesarenearlyidentical,areectionoftheir accuracyprolesat a = ,buttherestoftheprolesshowthat 1 ; ; 20 n; 10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(2 usesconsiderablyfewerfunctionevaluationstosatisfythisaccuracyrequirement. Becauseofthishighaccuracyandperformance,weconsider 1 ; ; 20 n; 10 )]TJ/F17 7.9701 Tf 6.587 0 Td [(2 tobe thebestofthestoppingrulesconsideredinitsfamily. 5.4.3AccuracyandPerformancePlotsforthe 2 Family Havingoutlinedourprocedurefordeterminingwhatconstitutesgoodmembers ofthe 1 family,wecannowquicklydosoforthefamilybasedon 2 in.4. TheaccuracyprolesintheupperleftplotofFigure5.6showthatthe 2 ; ; 10 n; 1testwassatisedonlessthan5%oftheproblems,andso 1has littlerelevanceforthisfamily.Inourexperience,decreasing didnotalleviatethis problemforsmall .Ingeneral, 2 tendstobemuchmoresensitivetothevalue thanaretestsbasedon 1 .Wealsoseethat 2 ismoreaccurateatsmallervaluesof than 1 was; =3 n isnowamorecompetitiveparameterchoice.Thistradeo inaccuracycomesatthecostofthe 2 testsbeingmoreconservativeand,hence, satisedonfeweroftheproblems. 104
PAGE 116
Figure5.6: Accuracytopandperformancebottomprolesforthe 2 familyon problems.9withtwodierentmagnitudesofstochasticrelativenoise as and arevaried. Weagainuseperformanceprolestomeasurewhethertestsareoverlyconservative.Asindicatedbythelargerrangefor inthebottomtwoplotsofFigure5.6, the 2 familyoftestsaremorediculttosatisfyoverall,andthenumberoffunction evaluationsrequiredcomparesslightlylessfavorablywith i max thanforthe 1 family. Wealsoseethat 2 ; ; 3 n; 10tendstobethemostliberaltest,inpartbecauseit requiresfewerconsecutiveevaluationsthantheothermembersshownasindicatedby thevalueof ,butthat 2 ; ; 10 n; 10requiresjustasmallincreaseinthenumber offunctionevaluationswhilesolvingagreaterfractionofproblemsoverall.Basedon ourcomputationalexperience,weconsider 2 ; ; 10 n; 10tobethebesttestofthose 105
PAGE 117
Figure5.7: Accuracytopandperformancebottomprolesforthebesttests onproblems5.9withtwodierentmagnitudesofstochasticrelativenoise .The horizontalaxesontheperformanceprolesaretruncatedforclarity; 5 eventually achievesavalueof1;allothertestschangebylessthan.03. consideredinthisfamilyfortheseproblems. 5.4.4AcrossfamilyComparisons Weperformedsimilarcomparisonsforthemembersofthe 3 and 4 families, buttheanalysisisidenticaltowhathasbeenpresentedabove.Forthebenchmarkproblems P ,wefoundthattests 3 ; n; 10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(7 and 4 ;20 n; 10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(7 performed bestamongthoseconsideredintheirrespectivefamilies.See http://mcs.anl.gov/ ~ wild/tnoise foracompletestudyofthetestsinthesefamilies. Havingidentiedthebestmembersofeachfamilyoftests,wecomparethem headtoheadinFigure5.7.ThetoptwoplotsofFigure5.7demonstratethatwhen 106
PAGE 118
thefourtestsconsideredstopwithfewerthan i max evaluations,theyalltendtohave obtainedasolutionwithinthelevelofthenoise,though 1 and 3 areslightlyworse duetoa1%jumppast a = .The x valuebasedtests 3 and 4 havenearly identicalperformance,arehighlyaccurate,andaresatisedforalargefractionofthe problems.Forthefunctionvaluebasedtests, 1 appliestomoreoftheproblemswe examinedthan 2 . Ontheotherhand,thelowertwoplotsofFigure5.7showthatthetestbased on 2 generallyrequiresfewerevaluationstobesatisedonasignicantnumberof problemswithlargernoise.Thetestsbasedon 3 and 4 aremorefavorablewhen thenoiselevelislow.Sincetheperformanceandaccuracyofthe 3 and 4 testsare nearlyidentical,weremove 4 fromfurtherconsideration. 5.4.5DeterministicNoise Wenowconsiderhowthesetestsperforminthepresenceofdeterministicnoise byusingfunctionsoftheform.9,withadeterministic g .Tomodeldeterministic noise,weusethesame g combininghighfrequencyandlowerfrequencynonsmooth oscillationsasusedin[41],with g : R n ! [ )]TJ/F15 11.9552 Tf 9.299 0 Td [(1 ; 1]denedbythecubicChebyshev polynomial g x = x x 2 )]TJ/F15 11.9552 Tf 11.955 0 Td [(3,where x =0 : 9sin k x k 1 cos k x k 1 +0 : 1cos k x k 2 : Usingthetechniquein[42],weconsistentlyestimatedtherelativenoiseinthe318 resultingproblemstobeoftheorder0 : 6 ,providedthatthesamplingdistanceis appropriatelychosen. Tostudythevariousfamiliesofterminationtestsinthedeterministiccase,an analysisidenticaltothestochasticcasewasperformed.Whiletherewereslight dierencesintheperformanceoftestsbetweenthesecases,forthemostpart,the conclusionsweresimilar.Foreaseofpresentation,weproceedbydiscussingthe beststochastictestsinthedeterministiccase,acknowledgingthatslightlymore 107
PAGE 119
Figure5.8: Accuracytopandperformancebottomprolesforthebesttestson problems.9withtwodierentmagnitudesofdeterministicnoise.Thehorizontal axesontheperformanceprolesaretruncatedforclarity; 5 eventuallyachievesa valueof1;allothertestschangebylessthan.03. conservativetestsareneeded.Foracompletestudyofthedeterministicnoisesee http://www.mcs.anl.gov/ ~ wild/tnoise/ . TheaccuracyprolesinFigure5.8showamilddecreaseinaccuracyforthebest testscomparedwithstochasticnoiseinFigure5.7.Asaresult,weseethatonjust over10%%oftheproblems,thetestbasedon 1 2 nowterminateswhilenot satisfyingtheconvergencetest.11when =10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(3 .Inpractice,onewouldalso needanestimateoftherelativenoise^ " i r ,butourresultsof 1 and 2 varyingthe linearmultiplierofthenoise, ,showthatthetestremainrelativelystableif^ " i r is estimatedwithinanorderofmagnitude.Also,adjustingtheseteststobeslightly 108
PAGE 120
Figure5.9: Performanceprolesformoreconservativetestsonproblems.9 withtwodierentmagnitudesofdeterministicnoise.Thehorizontalaxesonthe performanceprolesaretruncatedforclarity; 5 eventuallyachievesavalueof1;all othertestschangebylessthan.03. moreconservativeimprovestheirecacy.Resultsforthischangearepresentedin Figure5.9. 5.4.6ValidationforIndividualSolvers Theresultsaboveshowfavorableaccuracyandperformanceforourrecommended testsforacollectionofsolversonasuiteofproblems.Fortheseterminationteststo beofpracticaluseforeachindividualsolverinthecollection,onemustensurethat thetestsdonotprematurelyterminateonesolvermoreoftenthananothersolver. WepresentaccuracyprolesinFigure5.10brokendownbythesixsolversexamined andnotethatnosinglesolveraccountsforadisproportionatenumberofincorrect terminations.Itisnoteworthythatneitherrecommendedtestisparticularlyrelevant tosolverD,asitisrarelysatised,andwouldthereforehavenoeectpositiveor negativeonthissolverontheseproblems.Weremarkthatthebehaviorseenismore anindicationofsolverD'sperformanceonthesenoisyproblemsratherthanonthe test'sperformance.Thisisveriedbythe 3 plotinFigure5.10,showingsolverD neverhas n consecutivepointswithinaEuclideandistanceof10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(7 ofeachother. 5.5Discussion 109
PAGE 121
Figure5.10: Accuracyprolesfortheindividualalgorithmsontherecommended tests. Inthischapterweconsideredparameterizedfamiliesofterminationteststhat requiresolelyahistoryofevaluationspointsandfunctionvaluesfromanoptimizationsolver.Ouranalysisandexperimentsshowhowvaluesfortheseparameterscan bechangedtoreectauser'sviewoftheexpenseofanadditionalfunctionevaluationandtheaccuracydemanded,thetwocharacteristicsthatformthebasisfor.1. Whenusedinconjunctionwithperformanceproles,theaccuracyprolesintroduced herearevaluableforguidingterminationdecisionsbasedonthistradeobetween accuracyandfunctionevaluations. Inourstudyofstochasticnoiseweencounteredanumberofproblemswherea solverproducednochangein f for300+evaluationsbutthenfoundchangeinthe 110
PAGE 122
Table5.1: Recommendationsforterminationtestsfornoisyoptimization. Test InterpretationofStoppingRule 1 20 n 0 : 01 Stopwhentheaveragerelativedecreaseinthebest functionvalueoverthelast20 n functionevaluations islessthanonehundredthoftherelativenoiselevel. 2 10 n 10 Stopwhenthelast10 n functionevaluationsare within10timestheabsolutenoiselevelof thebestfunctionvalue. 3 n 10 )]TJ/F17 7.9701 Tf 6.587 0 Td [(7 Stopwhenthelast n pointsevaluatedare withinadistanceof10 )]TJ/F17 7.9701 Tf 6.587 0 Td [(7 ofeachother rstdigitof f .Thiswassucientlyremediatedinthetestsbasedonfunctionvalues 1 and 2 by ,aparameterdetermininghowfarintopastfunctionvalueswere remembered.Werecommendabaselinevalueof =20 n ,withacorresponding lessthanorequalto0.1for 1 .Goodperformancefor 2 canstillbeseenfor less than10 n ,butthistestismoresensitiveto values.For 10, 2 testsarerarely satised,whereasfor 10,terminationcanoccurwithaninaccuratesolution.This resultisimportantaspreviousworkfocusedonsuccessivedecreases[41]orvalueson asimplexorstencil[19].OurrecommendationsaresummarizedinTable5.1. Wehavealsoseenthatfewerproblemsareterminatedbeforethebudgetconstraint asnoisemeasuredby becomessmall.Inthesecases,however,asolver'sbuiltinterminationcriteriashouldbesatisedmoreeasily.Testsbasedon x values, ratherthanfunctionvalues,performsurprisinglywell.Whenthenoiselevelismore signicant,testutilizingknowledgeofthisnoisecanleadtoimprovedperformance.In anycase,terminationtestsusingknowledgeofthepointsevaluatedandtheirfunction valuesoutperformusingthemetricspresentedinthischaptersimplyexhaustinga budgetoffunctionevaluations. 111
PAGE 123
Fordeterministicnoise,wefoundthattherecommendedtestsusedconsiderably fewerfunctionevaluationsthanthemaximumbudget.Thisresultwasobtainedat thecostofloweraccuracy.Werecommendslightlymodifyingtheseteststobemore conservativefordeterministicnoiseandagainallowthesolver'sbuiltinteststostop arunwhennecessary.ThiseectisshowninFigure5.9,whereweseethattests basedon 1 and 2 performbetterwhen isincreasedby10 n . Althoughournumericalexperimentsfocusedonlocalderivativefreesolvers,the proposedtestscanalsobeusedinconjunctionwiththeterminationtestsofderivativebasedalgorithmsortherenementstageofglobalalgorithms.Incorporatingadditionalinformation,suchasnoisyderivativevaluesorthenumberoflocalminima,is leftasfuturework. 112
PAGE 124
6.ConcludingRemarks Theresultsinthisthesisnaturallyleadintofurtherresearch.WhileChapter3 provedconvergenceofatrustregionalgorithmusingweightedregressionmodelsassumingtheconditionnumberontheweightingmatrixisboundedthatleavesconsiderablefreedomindecidingonaweightingscheme.Theweightingproposedin Section3.5hassomeheuristicintuition,butnothingmore.Wehaveexperimented withvariousotherweightingschemes,someofwhichwethoughtwouldspeedconvergencethoughthathasn'thappenedyet.Whileitseemsintuitivetopicktheweights whichminimizetheerrorbetweenthemodelandthefunction,institutingsuchaplan ontestproblemshasn'timprovedtheconvergenceofthealgorithm. Eachtimeamodelisbuilt,weconsideredthecalculationoftheweightsasa separateproblemtobesolved.Weattemptedtochooseweightswhichsolve min w r m k x k )222(r f x k s:t: cond W c 1 ; w i c 2 ; where c 1 and c 2 aresmallpositiveconstantsand W =diag w .Inwords,weare attemptingtominimizethedierencebetweenthegradientsofthemodelandfunction atthetrustregioncenter.Whileapracticalimplementationwouldonlybeableto minimizeboundsontheobjectivefunctionasaproxyforminimizingthedesired objective,initialresultssuggestagainstsuchavenuesofresearch.Foraproofof concept,weconstructedDFOproblemswithknownderivativesandtobeusedbythe weightingschemebutnotbythealgorithm.Onthesetestproblems,thealgorithm usingtheheuristicweightingschemepresentedinSection3.5.1outperformedthe optimizedweights.Wealsoselectedweightssolving 113
PAGE 125
min w max x 2 B x k ; k kr m k x )222(r f x k s:t: cond W c 1 ; w i c 2 ; atconsiderablecomputationaleort.Again,theseweightsperformednobetterthan heuristicschemes.Wearestillsearchingforaweightingschemewhichoutperforms ourheuristic. WebelievetheweightsintroducedinChapter3shouldallowformoreaccurate modelswhentheaccuracyoffunctionevaluationsisdierentatdierentpointsi.e., weightmoreaccuratefunctionvalueshigher.Thiscouldextendtheworkof[60] whereincreasinglyaccuratefunctionevaluationscanbeachievedatincreasingcost. ThealgorithminChapter3couldbemodiedtocallthefunctionwithvariousdegrees ofaccuracyasitconvergestoaminimum. Inadditiontoprovingasubsequenceofiteratesfromourstochasticalgorithmconvergealmostsurelytoarstorderstationarypointwhenthereareinnitelymany successfuliterates,wearealsogeneralizingtheresultsfromChapter4toapplytoall noisewithmeanzeroandboundedvarianceratherthannormallydistributederrors. BothofthealgorithmsinChapter3andChapter4arebasedonquadraticmodels, whichrequire O n 2 functionevaluationsforafunctionin n dimensionalspace.If thefunctionevaluationsareevenmoderatelyexpensive,thenmerelyconstructingan initialquadraticfunctioncanexhaustacomputationalbudgetof n =100forexample.Weareexploringwaystostartthealgorithmwithunderdeterminedmodels andslowlybuildinguptowardslinearandquadraticmodels.Wehaveaskedwhether itisbetter,whentherearemorethan n +1pointsinthetrustregion,tobuildan underdeterminedquadraticoraoverdeterminedlinearmodel;suchananswermight notexist.OneofthelargestcontributionsofChapter5istheframeworktoob114
PAGE 126
jectivelycomparethequalityofstoppingcriteria.Thetestspresentedinthispaper weredirectedatstoppingnoisyfunctionoptimization,andsinceDFOalgorithmshave purporteddeftnessinhandlingnoise,onlyDFOalgorithmwerecompared.Nonetheless,thetoolsdevelopedtocomparestoppingcriteriatranslatedirectlytoalgorithms wherereliablederivativesareavailable,butevaluatingthemisexpensive. Lastly,duringthecourseofresearch,westudiedtheconvergencerateofDFO algorithms.WhatfollowsisonlyproofofquadraticconvergenceforaclassofDFO methodsthatweareawareof,assuming f issmoothandprovided m k isasucient approximationof f . Assumption6.1 f 2 LC 2 ,and x satisessecondordersucientconditionsfor f . Specically, r f x =0 and r 2 f x ispositivedenite. Assumption6.2 m k 2 LC 2 isamodelwhichapproximates f withanerror kr f x )281(r m k x k = O k x )]TJ/F19 11.9552 Tf 12.664 0 Td [(x k 2 and kr 2 f x )222(r 2 m k x k = O k x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k for x sucientlycloseto x . Foreaseofnotation,denethefollowing: 1. m k k := m k x k 2. m k := m k x 3. r m k k := r m k x k 4. r 2 m k k := r 2 m k x k 5. f k := f x k 6. f := f x Theorem6.3 If f , x ,and m k satisfytheconditionsinAssumption6.1model, thenfor x 0 sucientlycloseto x ,iteratesgeneratedbyNewtonsteps, x k +1 = x k )]TJ/F15 11.9552 Tf 11.956 0 Td [( r 2 m k k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 r m k k ,convergequadraticallyto x . 115
PAGE 127
Proof: UsingtheintegralformofTaylor'stheorem r h x + p = r h x + Z 1 0 r 2 h x + tp pdt .1 with h = m k , x = x k ,and p = x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k yields r m k x k + x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k = r m k x k + Z 1 0 r 2 m k x k + t x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k dt r m k x k )222(r m k x = )]TJ/F25 11.9552 Tf 11.291 16.272 Td [(Z 1 0 r 2 m k x k + t x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k dt r m k k )222(r m k = Z 1 0 r 2 m k x k + t x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(x dt .2 116
PAGE 128
Therefore, k x k +1 )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k = k x k )]TJ/F15 11.9552 Tf 11.956 0 Td [( r 2 m k k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 r m k k )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k Factor = r 2 m k k )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 r 2 m k k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(x )222(r m k k Add/subtractthesameterms: = r 2 m k k )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 r 2 m k k x k )]TJ/F19 11.9552 Tf 11.956 0 Td [(x )222(r m k k + r m k )222(r m k Add r f =0andgroup = r 2 m k k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 r 2 m k k x k )]TJ/F19 11.9552 Tf 11.956 0 Td [(x )]TJ/F15 11.9552 Tf 11.955 0 Td [( r m k k )222(r m k )]TJ/F15 11.9552 Tf 11.955 0 Td [( r m k )222(r f UsetheintegralformofTaylor'stheorem.2: = r 2 m k k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 r 2 m k k x k )]TJ/F19 11.9552 Tf 11.956 0 Td [(x )]TJ/F25 11.9552 Tf 11.291 16.273 Td [(Z 1 0 r 2 m k x k + t x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k x k )]TJ/F19 11.9552 Tf 11.956 0 Td [(x dt )]TJ/F15 11.9552 Tf 11.955 0 Td [( r m k )222(r f Sincethe c = R 1 0 cdt , = r 2 m k k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 Z 1 0 r 2 m k k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(x dt )]TJ/F25 11.9552 Tf 11.291 16.272 Td [(Z 1 0 r 2 m k x k + t x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k x k )]TJ/F19 11.9552 Tf 11.956 0 Td [(x dt )]TJ/F15 11.9552 Tf 11.955 0 Td [( r m k )222(r f Combiningintegralswithidenticallimitsandfactoringtheconstant = r 2 m k k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 Z 1 0 [ r 2 m k k )222(r 2 m k x k + t x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k ] T dt x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k )]TJ/F15 11.9552 Tf 9.299 0 Td [( r m k )222(r f ] Because r 2 m k isLipschitz: r 2 m k k )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(x Z 1 0 Lt x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(x dt )]TJ/F15 11.9552 Tf 11.955 0 Td [( r m k )222(r f 117
PAGE 129
Evaluatetheintegral: = r 2 m k k )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(x L 2 x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(x )]TJ/F15 11.9552 Tf 11.955 0 Td [( r m k )222(r f Usethetriangleinequality: r 2 m k k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 L 2 x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 2 + r 2 m k k )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 r m k )222(r f CauchySchwartz = r 2 m k k )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 L 2 x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 2 + r 2 m k k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 kr m k )222(r f k Since r 2 f x ispositivedenite,wecanpick x closeenoughto x so r 2 m k x isalsopositivedenite.Wecanthereforebound k r 2 m k x )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 k c 0 k r 2 f x )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 k forsomeconstant c 0 .Thisresultsin k x k +1 )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k L 2 c 0 k r 2 f )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 kk x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k 2 + c 0 k r 2 f )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 k c 1 k x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k 2 C k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k 2 forsomeconstants c 0 ;c 1 ;C where c 1 comesfromAssumption6.2.Therefore,the sequenceof x k convergesquadraticallyto x . Asmentionedintheintroduction,problemswithexpensiveandnoisyfunction evaluationsareprevalentinavarietyofelds.Ascomputationalresourcesandmodelingmature,theseproblemsarisewithincreasinginfrequency.Therefore,thework inthisthesisshouldlaythegroundworkforavarietyofpertinentcontributionsto theoptimizationcommunity. 118
PAGE 130
REFERENCES [1]E.J.AndersonandM.C.Ferris.Adirectsearchalgorithmforoptimizationwith noisyfunctionevaluations. SIAMJournalonOptimization ,11:837{857,2001. [2]CharlesAudetandJ.E.DennisJr.Apatternsearchltermethodfornonlinear programmingwithoutderivatives.Technicalreport,SIAMJournalonOptimization,2000. [3]S.C.Billups,J.Larson,andP.Graf.Derivativefreeoptimizationofexpensive functionswithcomputationalnoiseusingweightedregression.Technicalreport, UniversityofColoradoDenver,2010. SIAMJournalonOptimization ,submitted. [4]K.H.Chang,LJeHong,andH.Wan.StochasticTrustRegionResponseSurfaceMethodSTRONG{ANewResponseSurfaceFrameworkforSimulation Optimization. INFORMSJournalonComputing ,pages1{14,April2012. [5]T.D.ChoiandC.T.Kelley.Superlinearconvergenceandimplicitltering. SIAMJournalonOptimization ,10:1149{1162,2000. [6]Y.S.ChowandH.Robbins.Onoptimalstoppingrules. ProbabilityTheoryand RelatedFields ,2:33{49,1963. [7]P.G.CiarletandP.A.Raviart.GeneralLagrangeandHermiteinterpolationin R n withapplicationstoniteelementmethods. ArchiveforRationalMechanics andAnalysis ,46:177{199,1972. [8]A.R.Conn,K.Scheinberg,andP.L.Toint.Recentprogressinunconstrained nonlinearoptimizationwithoutderivatives. MathematicalProgramming ,79:397{ 414,1997. [9]A.R.Conn,K.Scheinberg,andL.N.Vicente.Geometryofinterpolationsets inderivativefreeoptimization. MathematicalProgramming ,111:141{172,2008. [10]A.R.Conn,K.Scheinberg,andL.N.Vicente.Geometryofsamplesetsin derivativefreeoptimization:Polynomialregressionandunderdeterminedinterpolation. IMAJournalonNumericalAnalysis ,28:721{748,2008. [11]A.R.Conn,K.Scheinberg,andL.N.Vicente.GlobalConvergenceofGeneral DerivativeFreeTrustRegionAlgorithmstoFirstandSecondOrderCritical Points. SIAMJournalonOptimization ,20:387{415,2009. [12]A.R.Conn,K.Scheinberg,andL.N.Vicente. IntroductiontoDerivativeFree Optimization .MPSSIAMSeriesonOptimization.SIAM,Philadelphia,2009. 119
PAGE 131
[13]A.L.Custodio,H.Rocha,andL.N.Vicente.IncorporatingminimumFrobenius normmodelsindirectsearch. ComputationalOptimizationandApplications , 46:265{278,2010. [14]G.DengandM.C.Ferris.AdaptationoftheUOBQYAalgorithmfornoisy functions.In ProceedingsoftheWinterSimulationConference ,pages312{319, 2006. [15]G.DengandM.C.Ferris.Extensionofthedirectoptimizationalgorithmfor noisyfunctions.In ProceedingsoftheWinterSimulationConference ,pages497{ 504,2007. [16]J.E.DennisandR.B.Schnabel. NumericalMethodsforUnconstrainedOptimizationandNonlinearEquations .SIAM,Philadelphia,1996. [17]E.D.DolanandJ.J.More.Benchmarkingoptimizationsoftwarewithperformanceproles. MathematicalProgramming ,91:201{213,2002. [18]R.Fletcher. PracticalMethodsofOptimization .JohnWiley&Sons,NewYork, 2ndedition,1987. [19]P.E.Gill,W.Murray,andM.H.Wright. PracticalOptimization .Academic Press,London,1981. [20]F.Glover.TabuSearch{PartI. INFORMSJournalonComputing ,1:190{206, January1989. [21]F.Glover,M.Laguna,andR.Mart.Fundamentalsofscattersearchandpath relinking. Controlandcybernetics ,39,2000. [22]G.H.GolubandC.F.VanLoan. MatrixComputations .TheJohnsHopkins UniversityPress,3rdedition,1996. [23]N.I.M.Gould,D.Orban,andP.L.Toint.CUTErandSifDec:Aconstrained andunconstrainedtestingenvironment,revisited. ACMTransactionsonMathematicalSoftware ,29:373{394,2003. [24]R.B.GramacyandM.A.Taddy.Categoricalinputs,sensitivityanalysis,optimizationandimportancetemperingwith tgp version2,anRpackagefortreed Gaussianprocessmodels. J.StatisticalSoftware ,33:1{48,2010. [25]S.Gratton,M.Moue,andP.Toint.Stoppingrulesandbackwarderroranalysis forboundconstrainedoptimization. NumerischeMathematik ,119:163{187, 2011. [26]S.GrattonandL.N.Vicente.Asurrogatemanagementframeworkusingrigorous trustregionssteps.Preprint1111,Univ.ofCoimbra,March2011. [27]N.J.Higham. AccuracyandStabilityofNumericalAlgorithms .SIAM,Philadelphia,2ndedition,2002. 120
PAGE 132
[28]J.H.Holland. Adaptationinnaturalandarticialsystems .MITPress,Cambridge,MA,USA,1992. [29]W.HuyerandA.Neumaier.SNOBFITstablenoisyoptimizationbybranchand t. ACMTransactionsonMathematicalSoftware ,35:1{25,2008. [30]D.R.Jones,C.D.Perttunen,andB.E.Stuckman.Lipschitzianoptimization withouttheLipschitzconstant. JournalofOptimizationTheoryandApplications ,79:157{181,1993. [31]BKarasozen.Surveyoftrustregionderivativefreeoptimizationmethods. JournalofIndustrialandManagementOptimization ,3:321{334,2007. [32]C.T.Kelley. UsersGuideforimlversion1 .Availableat www4.ncsu.edu/ ~ ctk/imfil.html . [33]C.T.Kelley.DetectionandremediationofstagnationintheNelderMeadalgorithmusingasucientdecreasecondition. SIAMJournalonOptimization , 10:43{55,1999. [34]J.KennedyandR.Eberhart.Particleswarmoptimization. Proceedingsof ICNN'95InternationalConferenceonNeuralNetworks ,4:1942{1948,1995. [35]JKieferandJWolfowitz.Stochasticestimationofthemaximumofaregression function. TheAnnalsofMathematicalStatistics ,23:462{466,1952. [36]SKirkpatrick,C.D.Gelatt,andVecchiM.P.Optimizationbysimulatedannealing. Science ,220:671{680,1983. [37]M.Kortelainen,T.Lesinski,J.More,W.Nazarewicz,J.Sarich,N.Schunck, M.V.Stoitsov,andS.Wild.Nuclearenergydensityoptimization. Phys.Rev. C ,82:024313,August2010. [38]R.LougeeHeimer.TheCommonOptimizationINterfaceforOperationsResearch:Promotingopensourcesoftwareintheoperationsresearchcommunity. IBMJournalofResearchandDevelopment ,47:57{66,2003. [39]J.Matyas.Randomoptimization. AutomationandRemoteControl ,26:246{ 253,1965. [40]H.D.Mittelmann.Decisiontreeforoptimizationsoftware. http://plato.asu.edu/guide.html,2010. [41]J.J.MoreandS.M.Wild.Benchmarkingderivativefreeoptimizationalgorithms. SIAMJournalonOptimization ,20:172{191,2009. [42]J.J.MoreandS.M.Wild.Estimatingcomputationalnoise. SIAMJ.Scientic Computing ,33:1292{1314,2011. 121
PAGE 133
[43]J.J.MoreandS.M.Wild.Estimatingderivativesofnoisysimulations. ACM Trans.Math.Softw. ,38,2011.Toappear. [44]R.H.Myers,D.C.Montgomery,andC.M.AndersonCook. ResponseSurface Methodology:ProcessandProductOptimizationUsingDesignedExperiments . JohnWileyandSons,2008. [45]J.A.NelderandR.Mead.Asimplexmethodforfunctionminimization. ComputerJournal ,7:308{313,1965. [46]J.Neter,M.H.Kutner,C.J.Nachtsheim,andW.Wasserman. AppliedLinear StatisticalModels .McGrawHill,4thedition,1996. [47]M.J.D.Powell.Anecientmethodforndingtheminimumofafunctionofseveralvariableswithoutcalculatingderivatives. TheComputerJournal ,7:155{ 162,1964. [48]M.J.D.Powell.UOBYQA:Unconstrainedoptimizationbyquadraticapproximation. MathematicalProgramming ,92:555{582,2002. [49]M.J.D.Powell.LeastFrobeniusnormupdatingofquadraticmodelsthatsatisfy interpolationconditions. MathematicalProgramming ,100:183{215,2004. [50]M.J.D.Powell.DevelopmentsofNEWUOAforminimizationwithoutderivatives. IMAJournalonNumericalAnalysis ,28:649{664,2008. [51]C.R.RaoandH.Toutenburg. LinearModels:LeastSquaresandAlternatives . SpringerSeriesinStatistics.SpringerVerlag,2ndedition,1999. [52]L.A.Rastrigin.Theconvergenceoftherandomsearchmethodintheextremalcontrolofamanyparametersystem. AutomationandRemoteControl , 24:1337{1342,1963. [53]H.RobbinsandS.Monro.Astochasticapproximationmethod. TheAnnalsof MathematicalStatistics ,1951. [54]A.N.Shiryayev. OptimalStoppingRules .SpringerVerlag,NewYork,1978. [55]F.J.SolisandR.Wets.Minimizationbyrandomsearchtechniques. Mathematicsofoperationsresearch ,6:19{30,1981. [56]J.C.Spall.Multivariatestochasticapproximationusingasimultaneousperturbationgradientapproximation. AutomaticControl,IEEETransactionson , 37:332{341,1992. [57]J.C.Spall.Acceleratedsecondorderstochasticoptimizationusingonlyfunction measurements.In DecisionandControl,1997.,Proceedingsofthe36thIEEE Conferenceon ,volume2,pages1417{1424.IEEE,1997. 122
PAGE 134
[58]J.C.Spall. Introductiontostochasticsearchandoptimization:estimation,simulationandcontrol .JohnWileyandSons,2003. [59]M.L.Stein. InterpolationofSpatialData:SomeTheoryforKriging .Springer, NewYork,1999. [60]J.TakakiandN.Yamashita.Aderivativefreetrustregionalgorithmforunconstrainedoptimizationwithcontrollederror. NumericalAlgebra,Controland Optimization ,1:117{145,February2011. [61]J.J.Tomick,S.F.Arnold,andR.R.Barton.Samplesizeselectionforimproved NelderMeadperformance.In ProceedingsoftheWinterSimulationConference , pages341{345,1995. [62]V.Torczon.Ontheconvergenceofpatternsearchalgorithms. SIAMJournalon optimization ,7:1{25,1997. [63]V. Cerny.Thermodynamicalapproachtothetravelingsalesmanproblem:An ecientsimulationalgorithm. Journalofoptimizationtheoryandapplications , 45l:41{51,1985. [64]S.M.Wild,R.G.Regis,andC.A.Shoemaker.ORBIT:Optimizationbyradialbasisfunctioninterpolationintrustregions. SIAMJournalonScientic Computing ,30:3197{3219,2008. [65]M.H.Wright.Usingrandomnesstoavoidperseverationindirectsearchmethods. PresentationatTheInternationalSymposiumonMathematicalProgramming", 2009. [66]A.A.Zhigljavsky. Theoryofglobalrandomsearch ,volume65of Mathematicsand itsApplicationsSovietSeries .KluwerAcademicPublishersGroup,Dordrecht, 1991.Translatedandrevisedfromthe1985Russianoriginalbytheauthor,With aforewordbySergeM.Ermakov. 123
