Citation
Derivative-free optimization of noisy functions

Material Information

Title:
Derivative-free optimization of noisy functions
Creator:
Larson, Jeffrey M/
Place of Publication:
Denver, CO
Publisher:
University of Colorado Denver
Publication Date:
Language:
English

Thesis/Dissertation Information

Degree:
Doctorate ( Doctor of philosophy)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Mathematical and Statistical Sciences, CU Denver
Degree Disciplines:
Applied mathematics
Committee Chair:
Billups, Stephen
Committee Members:
Engau, Alexander
Simon, Burt
Jacobson, Michael
Glover, Fred

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
Copyright Jeffrey M. Larson. Permission granted to University of Colorado Denver to digitize and display this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.

Downloads

This item has the following downloads:


Full Text
DERIVATIVE-FREE OPTIMIZATION OF NOISY FUNCTIONS
by
Jeffrey M. Larson B.A., Carroll College, 2005 M.S., University of Colorado Denver, 2008
A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Applied Mathematics
2012


This thesis for the Doctor of Philosophy degree by Jeffrey M. Larson has been approved by
Stephen Billups, Advisor and Chair Alexander Engau Burt Simon Michael Jacobson Fred Glover
Date
n


Larson, Jeffrey M. (Ph.D., Applied Mathematics) Derivative-free Optimization of Noisy Functions Thesis directed by Associate Professor Stephen Billups
ABSTRACT
Derivative-free optimization (DFO) problems with noisy functions are increasingly prevalent. In this thesis, we propose two algorithms for noisy DFO, as well as termination criteria for general DFO algorithms. Both proposed algorithms are based on the methods of Conn, Scheinberg, and Vicente [9] which use regression models in a trust region framework. The first algorithm utilizes weighted regression to obtain more accurate model functions at each trust region iteration. A weighting scheme is proposed which simultaneously handles differing levels of uncertainty in function evaluations and errors induced by poor model fidelity. To prove convergence of this algorithm, we extend the theory of A-poisedness and strong A-poisedness to weighted regression. The second algorithm modifies the first for functions with stochastic noise. We prove our algorithm generates a subsequence of iterates which converge almost surely to a first-order stationary point, provided the number of successful steps is finite and the noise for each function evaluation is independent and identically normally distributed. Lastly, we address termination of DFO algorithms on functions with noise corrupted evaluations. If the function is computationally expensive, stopping well before traditional criteria (e.g., after a budget of function evaluations is exhausted) are satisfied can yield significant savings. Early termination is especially desirable when the function being optimized is noisy, and the solver proceeds for an extended period while only seeing changes which are insignificant relative to the noise in the function. We develop techniques for comparing the quality of termination tests, propose families of tests to be used on general DFO algorithms, and then compare


the tests in terms of both accuracy and efficiency.
The form and content of this abstract are approved. I recommend its publication.
Approved: Stephen Billups
IV


ACKNOWLEDGMENT
I would like to thank my advisor, Steve Billups, for his years of research assistance and advice. His guidance was instrumental in obtaining the results in this thesis. I would also like to thank Peter Graf and Stefan Wild for their assistance in researching and writing parts of this thesis. The research in this thesis was partially supported by National Science Foundation Grant GK-12-0742434 and partially supported by the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Contract DE-AC02-06CH11357. Lastly, I’d like to thank my wife, Jessica, for years of patience while the research for this thesis was performed.
v


TABLE OF CONTENTS
Figures ................................................................... ix
Tables..................................................................... xi
Chapter
1. Introduction............................................................. 1
1.1 Review of Methods ................................................. 3
1.2 Outline............................................................ 5
1.3 Notation........................................................... 8
2. Background.............................................................. 10
2.1 Model-based Trust Region Methods.................................. 10
2.1.1 Model Construction Without Derivatives...................... 11
2.1.2 CSV2-framework.............................................. 11
2.1.3 Poisedness.................................................. 18
2.2 Performance Profiles.............................................. 23
2.3 Probabilistic Convergence......................................... 24
3. Derivative-free Optimization of Expensive Functions with Computational
Error Using Weighted Regression......................................... 26
3.1 Introduction...................................................... 26
3.2 Model Construction................................................ 27
3.3 Error Analysis and the Geometry of the Sample Set................. 29
3.3.1 Weighted Regression Lagrange Polynomials.................... 29
3.3.2 Error Analysis ............................................. 30
3.3.3 A-poisedness (in the Weighted Regression Sense)............. 35
3.4 Model Improvement Algorithm....................................... 39
3.5 Computational Results............................................. 50
3.5.1 Using Error Bounds to Choose Weights .................... 50
3.5.2 Benchmark Performance....................................... 52
vi


3.6 Summary and Conclusions........................................... 57
4. Stochastic Derivative-free Optimization using a Trust Region Framework . 59
4.1 Preliminary Results and Definitions............................. 62
4.1.1 Models which are k-fully Quadratic with Confidence 1 — au . 65
4.1.2 Models which are k-fully Linear with Confidence 1 — ... 69
4.2 Stochastic Optimization Algorithm................................. 70
4.3 Convergence....................................................... 73
4.3.1 Convergence to a First-order Stationary Point............... 73
4.3.2 Infinitely Many Successful Steps ........................... 78
4.4 Computational Example ............................................ 80
4.4.1 Deterministic Trust Region Method........................... 80
4.4.2 Stochastic Trust Region Method.............................. 81
4.5 Conclusion........................................................ 83
5. Non-intrusive Termination of Noisy Optimization.......................... 84
5.1 Introduction and Motivation....................................... 84
5.2 Background ....................................................... 87
5.3 Stopping Tests.................................................... 90
5.3.1 /f Test..................................................... 92
5.3.2 Max-Difference-/ Test....................................... 92
5.3.3 Max-Distance-x Test......................................... 93
5.3.4 Max-Distance-x* Test........................................ 93
5.3.5 Max-Budget Test............................................. 94
5.3.6 Tests Based on Estimates of the Noise....................... 94
5.3.7 Relationship to Loss Functions.............................. 96
5.4 Numerical Experiments............................................. 98
5.4.1 Accuracy Profiles for the (f)\ Family...................... 100
5.4.2 Performance Profiles for the (f)\ Family................... 102
vii


5.4.3 Accuracy and Performance Plots for the (f)2 Family.......... 104
5.4.4 Across-family Comparisons................................... 106
5.4.5 Deterministic Noise ........................................ 107
5.4.6 Validation for Individual Solvers .......................... 109
5.5 Discussion......................................................... 110
6. Concluding Remarks....................................................... 113
References.................................................................. 119
viii


FIGURES
Figure
2.1 An example of a performance profile..................................... 24
3.1 Performance (left) and data (right) profiles: Interpolation vs. regression
vs. weighted regression................................................... 55
3.2 Performance (left) and data (right) profiles: weighted regression vs.
NEWUOA vs. DFO (Problems with Stochastic Noise).................... 56
4.1 Several iterations of a traditional trust region method assuming deterministic function evaluations. The trust region center is never moved........ 81
4.2 Several iterations of a traditional trust region method assuming stochastic
function evaluations...................................................... 82
5.1 Part of a noisy trajectory of function values for an expensive nuclear
physics problem. After more significant decreases in the first 70 evaluations, progress begins to stall...................................... 85
5.2 First terms in (f)\ (top, with k = 100) and the noise............................................................... 96
5.3 Number of evaluations i* for a termination test based on (5.3) with fixed ujr. and k, but using a fi parameterized by c. The plots show remarkably similar behavior to the number of evaluations that minimize L(-, c) in (5.8). 98
5.4 Accuracy profiles for members of the (f)\ family on problems (5.9) with
two different magnitudes of (known) stochastic relative noise a. In the
top plots, k is held fixed and the shown members have different fi values.
In the bottom plots, ji is held fixed and the shown members have different
k values................................................................ 101
ix


5.5 Performance profiles for the most accurate 0i tests on problems (5.9) with
two different magnitudes of (known) stochastic relative noise a. Note that the a-axis has been truncated for each plot; 05 eventually terminates all of the problems and thus has a profile that will reach the value 1; all other tests change by less than .01....................................... 103
5.6 Accuracy (top) and performance (bottom) profiles for the 02 family on problems (5.9) with two different magnitudes of stochastic relative noise
a as k and fi are varied............................................ 105
5.7 Accuracy (top) and performance (bottom) profiles for the best tests on problems (5.9) with two different magnitudes of stochastic relative noise a.
The horizontal axes on the performance profiles are truncated for clarity;
05 eventually achieves a value of 1; all other tests change by less than .03. 106
5.8 Accuracy (top) and performance (bottom) profiles for the best tests on
problems (5.9) with two different magnitudes of deterministic noise. The horizontal axes on the performance profiles are truncated for clarity; 05 eventually achieves a value of 1; all other tests change by less than .03. . 108
5.9 Performance profiles for more conservative tests on problems (5.9) with two different magnitudes of deterministic noise. The horizontal axes on the performance profiles are truncated for clarity; 05 eventually achieves
a value of 1; all other tests change by less than .03............... 109
5.10 Accuracy profiles for the individual algorithms on the recommended tests. 110
x


TABLES
Table
5.1 Recommendations for termination tests for noisy optimization.. Ill
xi


1. Introduction
Traditional unconstrained optimization is inherently tied to derivatives; the necessary conditions for a first-order minimum are characterized by the derivative being equal to zero. Nevertheless, there exists a variety of functions which must be minimized when derivatives are unavailable. For example, consider an engineer in a lab who wants to maximize the strength of a metal bar by adjusting various factors of production. After the bar is constructed, it is broken to determine its strength. There is no closed formed solution for the bar’s strength; each function evaluation comes from an expensive procedure. Also, the process of breaking the bar provides no information about how to change the factors of production to increase the bar’s strength. In addition to the optimization of systems which must be physically evaluated, function evaluations by complex computer simulations often provide no (or unreliable) derivatives. Such simulations of complex phenomena (sometimes called black-box functions) are becoming increasingly common as computational modeling and computer hardware continue to advance. Whereas traditional techniques are concerned with efficiency of the algorithm, such concerns are secondary throughout this thesis. Explicitly, we assume that the cost of evaluating the function overwhelms the computational requirements of the algorithm.
In addition to unavailable derivatives, noise of various forms is often present in these functions. Throughout this thesis we will categorize this noise - or difference between the true value and computed value - into two categories: deterministic and stochastic. Deterministic noise (e.g., arising from finite-precision calculations or iterative methods) is often present if the function being optimized is a simulation of a physical system. For example, if evaluating the function involves solving a system of nonlinear partial differential equations or computing the eigenvalues of a large matrix, a small perturbation in the parameters can yield a large jump in the difference between the true and computed values. Though the computed value and true
1


value may differ, re-evaluating the function with the same choice of parameters will provide no further information. In contrast, re-evaluating a function with stochastic noise will provide additional information about the true value of the function. Two common sources of stochastic noise are found in functions whose evaluation requires a large-scale Monte-Carlo simulation of an actual system or if a function evaluation requires physically measuring a property in some system.
The thesis consists of three distinct but connected chapters addressing the problem:
min f(x) : E™ —> E
X
when the algorithm only has access to noise corrupted values
f(x) := f(x) + e(x),
where e(x) denotes the noise.
Each chapter makes different assumptions about the noise e(x). Chapter 3 assumes that the accuracy at which / approximates / can vary, and that this accuracy can be quantified. For example, if fix) is calculated using a Monte-Carlo simulation, increasing the number of trials will decrease the magnitude of e(x). Similarly, if / is calculated by a finite element method, increased accuracy can be obtained by a finer grid. Of course, this greater accuracy comes at the cost of increased computational time; so it makes sense to vary the accuracy requirements over the course of the optimization. With this in mind, Chapter 3 asks: How can knowledge of the accuracy of different function evaluations be exploited to design a better algorithm?
Chapter 4 assumes that the noise for each function evaluation is independent of x and can be modeled as a normally distributed random variable with mean zero and a fixed, finite standard deviation. Though many other algorithms have been designed to optimize such a function, they often resort to repeated sampling of points. This provides information about the noise at a point, but no information about the
2


function nearby. This motivates the question addressed in Chapter 4: How can greater accuracy be efficiently achieved by oversampling without necessarily repeating function evaluations?
Chapter 5 assumes that a reasonably accurate estimate of the magnitude of the noise can be obtained, and that this estimate remains relatively constant throughout the domain. Though there are many algorithms in the literature designed to optimize noisy functions, very few use estimates of the noise in their termination criteria. When function evaluations are cheap, termination can be determined by common tests (e.g., small step size or gradient approximation). But when function evaluations are expensive, determining when to stop becomes an important multi-objective optimization problem. The optimizer wants to fold the best solution possible while minimizing computational effort. As this is a difficult problem to explicitly formulate, practitioners frequently terminate algorithms when (i) a predefined number of iterations has elapsed, (ii) no decrease in the optimal function value has been detected for a number of iterations, or (iii) the distance between a number of successive iterates is below some threshold. Chapter 5 attempts to answer the question: When should an algorithm optimizing an expensive, noisy function be terminated?
1.1 Review of Methods
Before discussing our algorithms further, we first discuss previous DFO techniques.
Heuristics are perhaps the first recourse when attempting to optimize a function without derivatives. Simulated annealing [36, 63], genetic algorithms [28], random search and its variants [55, 66, 39, 52], tabu search [20], scatter search [21], particle swarm optimization [34], and Nelder-Mead [45] are just a few of the heuristics developed to solve problems where only function evaluations are available. Though most of these algorithms lack formal convergence results (aside from results dependent on the algorithm producing iterates which are dense in the domain), they remain popular
3


due to their (1) ease of implementation, (2) flexibility on a variety of problem classes, and (3) frequent success in practice.
Other techniques attempt to approximate the unavailable derivative. Classical finite-difference methods approximate the derivative by adjusting each variable and noting the change in the function value. Other techniques, such as the pattern search methods [62, 2] and implicit filtering [5], evaluate a changing pattern of points around the best known solution. Also of note is the DIRECT algorithm [30], a global optimization method based on dividing hyper-rectangles using only function values.
An increasingly popular class of algorithms for derivative-free optimization (DFO) are model-based trust region methods [31, 11]. Local models approximating the function are constructed and minimized to generate successive iterates. These models are commonly low-order polynomials interpolating function values close to the best known value, for example Powell’s UOBYQA algorithm [48]. Other examples include [49], where Powell introduces a minimum Frobenius norm condition on underdetermined quadratic models, and ORBIT by Wild et al. [64], which constructs models using interpolating radial basis functions. (These local models should not be confused with kriging [59] or response surface methodologies [44] which build global models of the function.) Though implementation of these techniques is not as simple as some other techniques, a well-developed convergence theory exists. As this thesis focuses on noisy DFO problems, we considered trust-region methods with regression models most appropriate (since, in many cases, regression models through enough points can approximate the true function).
There are also a variety of existing DFO algorithms for optimizing functions with noise. For functions with stochastic noise, replications of function evaluations can be a simple way to modify existing algorithms. For example, [14] modifies Powell’s UOBYQA [48], [15] modifies DIRECT [30], and [61] modifies Nelder-Mead by repeated sampling of the function at points of interest. For deterministic noise, Kelley
4


[33] proposes a technique to detect and restart Nelder-Mead methods. Neumaier’s SNOBFIT [29] algorithm accounts for noise by not requiring the surrogate functions to interpolate function values, but rather fit a stochastic model. Similarly, [10] proposes using least-squares regression models instead of interpolating models when noise is present in the function evaluations.
Lastly, Stochastic Approximation algorithms are also designed to minimize functions with stochastic noise. These algorithms are developed by statisticians to solve
min f(x) = E [[f(x)]] ,
when only / can be computed. Two of the more famous algorithms, the Kiefer-Wolfowitz and Simultaneous Perturbation methods, take predefined step lengths in a direction approximating —V/. These techniques have strong theoretical convergence results, but can be difficult to implement in practice. For further discussion of these algorithms, see the beginning of Chapter 4.
1.2 Outline
The work in this thesis focuses on modifications to model-based trust region methods in order to handle noise. Throughout the thesis we assume that only noisy, expensive function evaluations / are available, but there is some smooth underlying function / which is twice continuously differentiable with a Lipschitz continuous Hessian. We also assume that the noise in the evaluation of / is unbiased with bounded variance.
Chapter 3 (joint work with Stephen Billups and Peter Graf) proposes a DFO algorithm to optimize functions which are expensive to evaluate and contain computational noise. The algorithm is based on the trust region methods of [9, 10] which build interpolation or regression models around the best known solution. We propose using weighted regression models in a trust region framework, and prove convergence of such methods provided the weighting scheme satisfies some basic conditions.
5


The algorithm fits into a general framework for derivative-free trust region methods using quadratic models, which was described by Conn, Scheinberg, and Vicente in [12, 11], We shall refer to this framework as the CSV2-framework. This framework constructs a quadratic model function rrik(-), which approximates the objective function / on a set of sample points Yk C Râ„¢ at each iteration k. The next iterate is then determined by minimizing nik over a trust region. In order to guarantee global convergence, the CSV2-framework monitors the distribution of points in the sample set, and occasionally invokes a model-improvement algorithm that modifies the sample set to ensure m*, accurately approximates /. The CSV2-framework is the basis of the well-known DFO algorithm which is freely available on the COIN-OR website [38].
To fit our algorithm into the CSV2-framework we extend the theory of poisedness, as described in [12], to weighted regression. We show (Proposition 3.12) that a sample set that is strongly A-poised in the regression sense is also strongly cA-poised in the weighted regression sense for some constant c, provided that no weight is too small relative to the other weights. Thus, any model improvement scheme that ensures strong A-poisedness in the regression sense can be used in the weighted regression framework.
The convergence proof in Chapter 3 requires that the computational error decreases as the trust region decreases; such an assumption can be satisfied if the user has some control of the accuracy in the function evaluation. Since Chapter 3 is centered on exploiting differing levels in different function evaluations, such an assumption is reasonable for that chapter. In Chapter 4, we remove this assumption, but add the assumption that / has the form
f{x) = f{x) + e (1.1)
where e ~ A/”(0, modifies the algorithm from Chapter 3 to converge when the error does not decrease
6


with the trust region radius. With some light assumptions on the noise and underlying function, we prove the algorithm generates a subsequence of iterates which converge almost surely to a first-order stationary point in the case where the number of successful iterates is finite.
At a given point of interest x°, the algorithm does not repeatedly sample f(x°) in order to glean information about f(x°). Rather nik(x0), the value of the trust region model at x° is used to estimate f(x°). We derive bounds on the error between / and m, provided the set of points used to construct m satisfies certain geometric conditions, called strongly A-poised (see Definition 2.14), and contains a sufficient number of points. Also, the step size is controlled by the algorithm, increasing and decreasing as the algorithm progresses and stagnates. This contrasts many of the methods in the Stochastic Approximation literature where the user must provide a predefined set of steps to be taken by the algorithm.
The results in Section 4.3 prove the algorithm will generate a subsequence of iterates converging almost surely to a first-order stationary point when the number of successful iterates is finite, and makes progress in the infinite case. Such results are not common for most DFO algorithms on problems with stochastic noise. Both the simplicial direct search method [1] and the trust region method in [4] prove similar convergence results, but both reduce the variance at a point by repeated sampling. In addition to our strong convergence result, we are able to directly quantify the probability of the success of some iterates (see Lemma 4.15 for one such example). We are unaware of any other similar theoretical results for algorithms minimizing stochastic functions.
Chapter 5 (joint work with Stefan Wild) addresses termination criteria, the choice of which is a common problem when optimizing noisy functions. We propose objective measures to compare the quality of termination rules. Families of termination tests are then proposed and their performance is analyzed across a broad range of DFO
7


algorithms. Recommendations for tests which work for many algorithms are also provided. Lastly Chapter 6 contains concluding remarks and directions for future research.
1.3 Notation
The following notation will be used: R™ denotes the Euclidean space of real n-vectors. || • ||p denotes the standard £p vector norm, and || • || (without the subscript) denotes the Euclidean norm. || • ||^ denotes the Frobenius norm of a matrix. Ck denotes the set of functions on R™ with k continuous derivatives. Dj f denotes the jth derivative of a function / G Ck, j < k. Given an open set Q C E™, LCk(Q) denotes the set of Ck functions with Lipschitz continuous kth derivatives. That is, for / G LCk(Q), there exists a Lipschitz constant L such that
Dkf(y) — Dkf(x) II < L \\y — x\\ for all x,y G Q.
Vn denotes the space of polynomials of degree less than or equal to d in Era; qi denotes the dimension of (specifically q\ = (n + 1 )(n + 2)/2). We use standard “big-Oh” notation (written O(-)) to state, for example, that for two functions on the same domain, f(x) = 0(g(x)) if there exists a constant M such that \f(x)\ < M\g(x)\ for all x with sufficiently small norm. Given a set F, |T| denotes the cardinality and conv(T) denotes the convex hull. For a real number a, [aj denotes the greatest integer less than or equal to a. For a matrix A, A+ denotes the Moore-Penrose generalized inverse [22], eJ denotes the jth column of the identity matrix. The ball of radius A centered at i G R" is denoted B(x; A). Given a vector w, diag(w) denotes the diagonal matrix W with diagonal entries Wa = Wi. For a square matrix A, cond(A) denotes the condition number, Amin(A) denotes the smallest eigenvalue, and omin denotes the smallest singular value. For a set Y := {y°, • • • ,yp} C R™, the quadratic design matrix X has rows
i yi
y3n

vL-i yi \{y3n?
(1.2)
8


Let rrik denote the fcth trust region model (as defined in Chapter 2). Let gk = Vmk{xk) and Hk = V2mk{xk).
Sk(x) =max{||Vmfc(x)|| ,-Xmin(X2mk(x))} s(x) = max {|| V/(x)|| ,-Amin(V2f(x))}
These variables measure how close x is to a first- and second-order stationary point of / and mk (i.e. the gradient is zero and all eigenvalues are positive). If X is a random variable, the notation X < 7 denotes P (X < 7) > a. Note that the relation < is
a 1—Oi
not transitive.
9


2. Background
Before continuing, we introduce the background material on which the thesis is constructed.
2.1 Model-based Trust Region Methods
A trust region algorithm is a numerical technique for minimizing a sufficiently smooth function /. At each iteration k, a model function nik(x) is constructed to approximate / near the best point xk. When derivatives are available, m-k is usually a truncated first- or second-order Taylor series approximation of / at xk. For example, if V/ and V2/ are easy to calculate,
mk{x) = f(xk) + V f(xk)(x — xk) + — xk)TV2 f(xk)(x — xk).
nik is minimized over the trust region B{xk; Ak) by solving the problem
min mk(xk + s)
s:\\s\\ to generate a potential next trust region center xk + sk. f(xk + sk) is evaluated and the ratio
f(xk) — f(xk + sk)
^k mk{xk) — mk{xk + sk)
is calculated, which compares the actual decrease in / versus the decrease predicted by the model nrikâ–  This ratio quantifies the success of iteration k and also how well the model function approximates the true function /. If pk is larger than a prescribed threshold rji, it indicates that the iteration was successful and the model is a good approximation of the function. In this case, xk+1 is set to xk + sk and the trust region radius, Ak is increased. If pk is less than another parameter rj0, the model function is considered unreliable so the trust region radius Ak is decreased and the iterate is not updated (i.e. xk+1 = xk). Lastly, k is incremented and the process repeats.
2.1.1 Model Construction Without Derivatives
10


When derivatives are unavailable, the models mk are constructed using points where / has been evaluated. For example, the Conn, Scheinberg, and Vicente framework (which we refer to as the CSV2-framework) builds models mk from a specified class of models M. using a sample set of points Yk = {y°, ■ ■ ■ ,yp} C B(xk; A*,) on which the function has been evaluated.
Given Yk and a vector of corresponding function values / = (f(y°), • • • , f(yp)), an interpolating model is a model m(x) such that m{yl) = f(yl) for i = 0, • • • ,p. Given a basis

0(x),..., 4>q(x)} of the class of models M, we can calculate the coefficients Pi in the basis representation of the interpolating model m(x) = Y^h=q by
solving the linear system
M(4>,Y)P = f (2.1)
where
M(4>,Y)
Mv°) Mv0)-- ■ My°)
My1) My1) â–  â–  â–  My1)
Mvp)Myp)-- â–  Myp)
Note that for this equation to have a unique solution, the number of sample points p +1 must equal the size of the basis q+1 and the matrix M(, Y) must be invertible.
Regression models have also been investigated, [10] in which the number of sample points p + 1 is greater than the size of the basis. In this case, the matrix M((f>,Y) has more rows than columns so the equation (2.1) is solved in a least squares sense.
Lastly, if M((f>,Y) has more columns than rows, the system (2.1) is underdetermined. Nevertheless, bounds between the function and an underdetermined model can be obtained in certain cases. For example, see [13] considering the minium Frobe-nius norm method.
2.1.2 CSV2-framework
11


We next outline the CSV2-framework for derivative-free trust region methods described by Conn, Scheinberg, and Vicente [12, Algorithm 10.3]. Algorithms in the framework construct a model function rrik(-) at iteration k, which approximates the objective function / on a set of sample points Yk = {y°,... ,yPk} C R™. The next iterate is then determined by minimizing m*,. Specifically, given the iterate xk, a putative next iterate is given by xk + sk where the step sk solves the trust region subproblem
min nik{xk + s)
s:\\s\\ where the scalar A& > 0 denotes the trust region radius, which may vary from iteration to iteration. If xk+sk produces sufficient descent in the model function, then f(xk+sk) is evaluated, and the iterate is accepted if f(xk + sk) < f(xk)-, otherwise, the step is not accepted. In either case, the trust region radius may be adjusted, and a model-improvement algorithm may be called to obtain a more accurate model.
To establish convergence properties, the following smoothness assumption is made on /:
Assumption 2.1 Suppose that a set of points S CR" and a radius Amax are giuen. Let Q be an open domain containing the Amax neighborhood
|J B(x; Amax)
ices'
of the set S. Assume f G LC2(Q) with Lipschitz constant L.
The CSV2-framework does not specify how the model functions are constructed. However, it does require that the model functions be selected from a fully quadratic class of model functions M.:
Definition 2.2 Let f satisfy Assumption 2.1. Let k = be a giuen
uector of constants, and let A > 0. A model function m G C2 is K-fully quadratic on B(x; A) if m has a Lipschitz continuous Hessian with corresponding Lipschitz constant bounded by uf1 and
12


• the error between the Hessian of the model and the Hessian of the function satisfies
|| V2/(y) - V2m(y)|| < nehA for all y E B(x; A),
• the error between the gradient of the model and the gradient of the function satisfies
II V/(y) - Vm(y)|| < negA2 for all y E B(x\ A),
• the error between the model and the function satisfies
I f(y) ~ m{y)\ < Kef A3 for all y E B(x; A).
Definition 2.3 Let f satisfy Assumption 2.1. A set of model functions A4 = {m : Rn R,m E C2} is called a fully quadratic class of models if there exist positive constants k = (nef, neg, Keh, vf1), such that the following hold:
1. for any x E S and A e (0, Amax]; there exists a model function m in M. which is k-fully quadratic on B(x; A).
2. For this class M., there exists an algorithm, called a “model-improvement” algorithm, that in a finite, uniformly bounded (with respect to x and A) number of steps can
• either certify that a given model m E M. is k-fully quadratic on B(x; A),
• or, find a model fh E M. that is k-fully quadratic on B(x; A).
Note that this definition of a fully quadratic class of models is equivalent to [12, Definition 6.2]; but we have given a separate definition of a K-fully quadratic model (Definition 2.2) that includes the use of k to stress the fixed nature of the bounding constants. This change simplifies some analysis by allowing us to discuss K-fully quadratic models independent of the class of models they belong to. It is important
13


to note that k does not need to be known explicitly. Instead, it can be defined implicitly by the model improvement algorithm. All that is required is for k to be fixed (that is, independent of x and A). We also note that the set A4 may include non-quadratic functions, but when the model functions are quadratic, the Hessian is fixed, so vâ„¢ can be chosen to be zero. For the algorithms presented in Chapter 3 and Chapter 4, we focus on model functions that quadratic. That is, A4 = Vf.
As a side note, k-fully quadratic models may be too difficult to construct or may be undesired in some situations. If that is the case, n-fully linear models might provide a useful alternative.
Definition 2.4 Let f G LC2 and let k = (nef, Keg, vâ„¢) be a given vector of constants, and let A > 0. A model function m G C2 is K-fully linear on B(x; A) if m has a Lipschitz continuous gradient with corresponding Lipschitz constant bounded by uf1 and
• the error between the gradient of the model and the gradient of the function satisfies
l|V/(y) - Vm(y)|| < negA Vy e B(x] A),
• the error between the model and the function satisfies
I f(y) ~ m(y)I < WfA2 Vy G B(x; A).
If is K-fully linear, it approximates / in a fashion similar to the hrst-order Taylor model of /. If is K-fully quadratic, then it approximates / in a fashion similar to the second-order Taylor model of /. If K-fully linear (or quadratic) models are used within the CSV2-framework, we can guarantee convergence of the algorithm to a first- (or second-) order stationary point of /.
A critical distinction between the CSV2-framework and classical trust region
methods lies in the optimality criteria. In classical trust region methods, mu is the
14


second-order Taylor approximation of / at xk\ so if xk is optimal for m-k, it satisfies the first- and second-order necessary conditions for an optimum of /. In the CSV2-framework, xk must be optimal for m-k, but m-k must also be an accurate approximation of / near xk. This requires that the trust region radius is small and that nik is /r-fully quadratic on the trust region for some fixed n.
To explicitly outline the CSV2-framework, we proved pseudocode below. In the algorithm, g^b and H™b denote the gradient and Hessian of the incumbent model m^b. (We use the superscript icb to stress that incumbent parameters from the previous iterates may be changed before they are used in the trust region step.) The optimality of xk with respect to nik is tested by calculating <^c6 = max { ||gj[c6||, — ATOira(f/(,c6)}. This quantity is zero if and only if the first- and second-order optimality conditions for nik are satisfied. The algorithm enters the criticality step when <^c6 is close to zero. This routine builds a (possibly) new K-fully quadratic model for the current Al^b, and tests if <^cb for this model is sufficiently large. If so, a descent direction has been determined, and the algorithm can proceed. If not, the criticality step reduces A^6 and updates the sample set to improve the accuracy of the model function near xk. The criticality step ends when <^c6 is large enough (and the algorithm proceeds) or when both q^b and A^b are smaller than given threshold values tc and Amin (in which case the algorithm has identified a second-order stationary point). We refer the reader to [12] for a more detailed discussion of the algorithm, including explanations of the parameters rj0, rji, 7,7irac, /3, n and oj.
Algorithm CSV2 ([12, Algorithm 10.3])
Step 0 (initialization): Choose a fully quadratic class of models At and a corresponding model-improvement algorithm, with associated k defined by Definition 2.3. Choose an initial point x° and maximum trust region radius Amax > 0. We assume that the following are given: an initial model m^b{x) (with gradient and Hessian at
15


x = x° denoted by glQb and #oc6, respectively), q)cb = max {||pqc6||, —^min(Hocb)}, and a trust region radius Aqc6 g (0, Amax].
The constants 70, hi, 7,7mc, <7, A, ^ are given and satisfy the conditions 0 < Vo < hi < 1 (with yt o), 0 < 7 < 1 < 7inc,ec > 0, p > f3 > 0,u G (0,1). Set k = 0. Step 1 (criticality step): If q]pb > ec, then mk = rri£b and Afc = A^b.
If Ckb < ec, then proceed as follows. Call the model-improvement algorithm to attempt to certify if the model m^b is K-fully quadratic on B(xk; A^b). If at least one of the following conditions hold,
• the model m^b is not certihably K-fully quadratic on B(xk; A^b), or
. Af > pAkcb,
then apply Algorithm Criticality Step (described below) to construct a model mk(x) (with gradient and Hessian at x = xk denoted by gkl and Hk, respectively), with = max |||7fc||, — Amira(i/fc)|, which is K-fully quadratic on the ball B(xk] Ak) for some Afc G (0, py™] given by [12, Algorithm 10.4], In such a case set
mk = fhk and Ak = min jmax j Afc, j , Afcc6|.
Otherwise, set mk = m^b and Ak = Al£b. For a more complete discussion of trust region management, see [26].
Step 2 (step calculation): Compute a step sk that sufficiently reduces the model mk (in the sense of [12, (10.13)]) such that xk + sk G B(xk; Ak).
Step 3 (acceptance of the trial point): Compute f{xk + sk) and dehne
f(xk) — f(xk + sk)
^k mk(xk) — mk(xk + sk)
If Pk A Vi or if both pfc > g0 and the model is K-fully quadratic on B(xk; Ak), then xk+1 — xk _|_ sk anc[ the model is updated to include the new iterate into the sample set resulting in a new model m^^x) (with gradient and Hessian at x = xk+l denoted
16


by g^_l and H^, respectively), with = max -Amin(i/^1)}; otherwise,
the model and the iterate remain unchanged (m^1 = m-k and xk+1 = xk).
Step 4 (model improvement): If pk < p 1, use the model-improvement algorithm to
• attempt to certify that m-k is K-fully quadratic on B(xk; Ak),
• if such a certificate is not obtained, we say that nik is not certihably K-fully quadratic and make one or more suitable improvement steps.
Define m^1 to be the (possibly improved) model. Step 5 (trust region update): Set
A
icb k+1
(min {q'iricAfc, /S.max}}
[Afc,min{7iracAfc, A
maa;}]
G

{^fc}
if Pk > hi and Ak < if Pk > hi and Afc > if pk < hi and nik is if pk < hi and nik is K-fully quadratic.
AC,
AC,
it-fully quadratic, not certihably
Increment k by 1 and go to Step 1.
Algorithm CriticalityStep ([12, Algorithm 10.4]) This algorithm is applied only if C& A ec and at least one of the following holds: the model m^b is not certihably K-fully quadratic on B{xk; A(c6) or A(c6 > p^b.
Initialization: Set % = 0. Set m^ = mjf>.
Repeat Increment i by one. Use the model improvement algorithm to improve the previous model until it is K-fully quadratic on B(xk; a;t_1A(c6). Denote the
new model by ni^. Set Ak = cU_1 A(c6 and nik = nif1.
Until Ak < h(C)W-
17


Global Convergence If the following assumptions are satisfied, it has been shown that the CSV2-framework iterates will converge to a stationary point of /. Define the set L(x°) = {x E E™ : f(x) < f(x0)}.
Assumption 2.5 Assume that f is bounded from below on L(x°).
Assumption 2.6 There exists a constant Kbhm > 0 such that, for all xk generated by the algorithm,
11 Hk 11 Hbhm â– 
Theorem 2.7 ([12, Theorem 10.23]) Let Assumptions 2.1, 2.5 and 2.6 hold with S = L(x°). Then, if the models used in the CSV2-framework are n-fully quadratic, the iterates xk generated by the CSV2-framework satisfy
hm max{||V/(xfc)||, —Amin(V2/(xfc))} = 0.
fc—>■ + c©
It follows from this theorem that any accumulation point of {xk} satisfies the first- and second-order necessary conditions for a minimum of /.
2.1.3 Poisedness
Having outlined the CSV2-framework, we can discuss conditions on the sample set used to build rrik which guarantee the model sufficiently approximates /. Consider the set of polynomials in E™ of degree less than or equal to d, denoted V%. A basis 0 = {0o(A), • • • , 4>q(x)} of V]) is a set of polynomials from Vlf which span V%. That is, for any P(x) E Vthere exists coefficients [3i such that P(x) = Ym=o &&(%)- We let \Pn\ denote the number of elements in any basis 0 of V]). For example, \Vf \ = n + 1 and \Vf\ = (n+l)(n + 2)/2.
Definition 2.8 A set of points X = {x°,- • • ,xp} C E™ with ||X|| = is poised for interpolation if the matrix M(0,X) is nonsingular for some basis 0 in Vlf
18


If \x\ > \V%\, we can construct the least squares regression model by solving (2.1) as well. We extend the definition of poisedness for the regression case.
Definition 2.9 A set of points X = {x°, • • • ,xp} C E™ with ||X|| > \P^\ is poised for regression if the matrix M( in
Since we have limited information about the function /, we want the interpolating or regressing m(x) to be an accurate approximation in a region of interest. This requires that X consists of points spread out within said region. Since M((f>,X) can be arbitrarily poorly conditioned and X is still poised, simply being poised is not enough to measure the quality of a set X. Also looking at the condition number of M((f>,X) is inadequate since scaling the sample set X or choosing a different basis can arbitrarily adjust this quantity. Nevertheless, there are methods for measuring the quality of a sample set, one of the most common of which is based on Lagrange polynomials.
Definition 2.10 For a set X = {x°, • • • ,xp} C E™, with \X\ = \Pn\, the set of interpolating Lagrange polynomials £ = {£o, • • • ,£p} C Vlf are the polynomials satisfying
0 otherwise.
If the set X is poised, then the set of polynomials are guaranteed to exist and be uniquely defined.
We can extend the definition of Lagrange polynomials to the regression case in a natural fashion.
Definition 2.11 For a set X = {x0,--- ,xp} C E™, with \X\ > \P^\, the set of regression Lagrange polynomials £ = {£0, ■ ■ ■ ,£p} C Vlf are the polynomials satisfying
W)
l.s.
1 */ * = j,
0 otherwise.
19


Though these polynomials are no longer linearly independent, if X is poised, then the set of regression Lagrange polynomials exists and is uniquely defined.
We can now use these Lagrange polynomials to extend the definition of poisedness to A-poisedness. This relates the magnitude of the Lagrange polynomials on a set B Cl" which will allow a method for measuring the quality of a sample set.
Definition 2.12 Let A > 0 and a set B C E™ be given. For a basis ofV%, a poised set X = {x°, • • • ,xp} is said to be A-poised in B (in the interpolating sense) if and only if
1. for the basis of Lagrange polynomials associated with X
A > max max I £A ,
0 or, equivalently,
2. for any x E B there exists \{x) such that
p
^2 Xi(x)4>(yl) = 4>(x) with ||A(a:)||00 < A.
i=0
And we again can extend this definition to the regression case.
Definition 2.13 Let A > 0 and a set B C E™ be given. For a basis ofVlf, a poised set X = {x°, ■ ■ ■ ,xp} with |X| > |'P^| is said to be A-poised in B (in the regression sense) if and only if
1. for the basis of Lagrange polynomials associated with X
A > max max It'd ,
0 or, equivalently,
2. for any x E B there exists \{x) such that
p
^ Xi(x)4>(yl) = 4>(x) with ||A(a:)||00 < A.
i=0
20


Note that if |X| = \P%\, the definition for A-poised in the interpolation and regression sense coincide.
We can now examine the following bound (from [7])
||D7(;r) - Drmix)|| < T ||;r‘ - ||D’7'.(x)||
(d+i)!
where Dr f(x) is the rth derivative of /.


dxix ■ ■ ■ dxi„ 11
and Ud is an upper bound on Dr+1f(x). If r = 0, this bound reduces to
|/(;r) - mix) < ^-t_(p+ l)ixdAtAd+1
(2/2)
where
Ab = max max \£A ,
0 and A is the diameter of the smallest ball containing X. Therefore, if the number of points in X is bounded, then A-poisedness is sufficient to derive bounds on the error between the regression or interpolating model and the function. That is, decreasing the radius of the sample set will provide bounds similar to Taylor bounds when derivatives are available. If using regression models with arbitrarily many points, A-poisedness is not enough to construct similar bounds. Strongly A-poisedness can help in this case.
Definition 2.14 Let £{x) = {£q{x)^-- ,£p(x))t be the regression Lagrange polynomials associated vnth the set Y = {y°,... ,yp}. Let A > 0 and let B be a set in E™. The set Y is said to be strongly A-poised in B (in the regression sense) if and only if
^LA > max \\£{x) || ,
VPi x^B
where q\ = \Vf \ and pi = |T|.
21


Since we can rewrite (2.2) as
I f(x)~ m(x) | < ^ \/p + l/VV2Ad+1
where
Afe,2 = max H^ll ,
xEB
strong A-poisedness provides Taylor-like error bounds between the regression model m and the function /, even when the number of points in X is unbounded.
As a final note, explicitly calculating the value of A is computationally expensive, but not necessary. It is possible to use the condition number of the design matrix M((f>,X) to bound the constant A. Since it is possible to scale the condition number of M((f>,X) by shifting and scaling X, or choosing a different basis, conditions must be placed on M(,X) before using its condition number. If we 1) use the standard basis (e.g., ,X). The next two theorems are for the interpolation and regression case respectively.
Theorem 2.15 Let X denote the shifted and scaled version of X so every point lies within the unit ball and at least one point has norm 1. Let M = M((f>,X) where
is the standard basis. If M is nonsingular and
M
-l
< A, then the set X is
ydpiA-poised in the unit ball. Conversely, if the set X is A-poised in the unit ball,
then
A.
< e^jpia
where 9 > 0 depends on n and d but is independent of X and
Theorem 2.16 Let X denote the shifted and scaled version of X so every point lies within the unit ball and at least one point has norm 1. Let M = M((f>,X) where 22


E
-l
< \J q\jp\A, then the set X is strongly A-poised in the unit ball. Conversely,
if the set X is A-poised in the unit ball, then and d but is independent of X and A.
E"1 < iauA
— VpT
where 9 depends on n
2.2 Performance Profiles
Next, we explain the content of performance profiles which are a compact method for comparing the performance of various algorithms on a set of problems. We will use Figure 2.1 as an example. Algorithms A, B, and C have been run an identical set of problems for the same number of function evaluations. The left axis shows the percentage of problems each algorithm solved first, where solved is user-defined. (Often, an algorithm is considered to solve a problem when it first finds a function value within tolerance of the best known solution.) In the example, A solves 20% of the problems first, B solves 55% of the problems first, and C solves 30% of the problems first. As these percentages total to over 100%, there is an overlap in the set of problems the algorithms solve first.
The right axis shows the percentage of problems solved by a given algorithm in the number of function evaluations given. All algorithms in Figure 2.1 solve over 90% of the problems in the testing set. Values between the left and right axes denote the percentage of problems solved as a multiple of the number of evaluations required for the fastest algorithm. For example, given 6 times as many iterations as the fastest algorithm on a problem, A solves 80% of the problems in the testing set.
Formally, an algorithm is considered to solve a problem when it first finds a function value satisfying
f{x°) - f(x) > (1 - r)(f(x°) - fL)
where r > 0 is a small tolerance, is the smallest found function value for any solver in a specified number of iterations, and x° is the initial point given to each algorithm.
23


Figure 2.1: An example of a performance profile
If ip tp,a
Tp,a : FT V
mill
a
Then the performance profile of a solver a is the fraction of problems where the performance ratio is at most a. That is,
Pa(cv) = \peP: rp
where P is the set of benchmark problems.
2.3 Probabilistic Convergence
Lastly, we define various forms of probabilistic convergence. A sequence {A,,}
of random variable is said to converge in distribution or converge weakly, or
24


converge in law to a random variable X if
lim Fn(x) = F(x),
n—)>c©
for every number {x G K} at which F is continuous. Here Fn and F are the cumulative distribution functions of random variables Xn and X correspondingly.
A sequence {Xn} of random variables converges in probability to X if for all e > 0
lim P{\Xn-X\ >e) = 0
71—>■ CO
A sequence {Xn} of random variables converges almost surely or almost everywhere or with probability 1 or strongly towards X if
P ( lim Xn = X) = 1
\n—]>oc 1
25


3. Derivative-free Optimization of Expensive Functions with Computational Error Using Weighted Regression
3.1 Introduction
In this chapter, we construct an algorithm designed to optimize functions evaluated by large computational codes, taking minutes, hours or even days for a single function call, for which derivative information is unavailable, and for which function evaluations are subject to computational error. Such error may be deterministic (arising, for example, from discretization error), or stochastic (for example, from Monte-Carlo simulation). Because function evaluations are extremely expensive, it is sensible to perform substantial work at each iteration to reduce the number of function evaluations required to obtain an optimum.
We assume that the accuracy of the function evaluation can vary from point to point, and this variation can be quantified. In this chapter, we will use knowledge of this varying error to improve the performance of the algorithm by using weighted regression models in a trust region framework. By giving more accurate points more weight when constructing the trust region model, we hope that the models will more closely approximate the function being optimized. This leads to a better performing algorithm.
Our algorithm fits within in the CSV2-framework, which is outlined in Chapter 2. To specify an algorithm within this framework, three things are required:
1. Define the class of model functions AT This is determined by the method for constructing models from the sample set. In [10] models were constructed using interpolation, least squares regression, and minimum Frobenius norm methods. We describe the general form of our weighted regression models in §3.2 and propose a specific weighting scheme in §3.5.
2. Define a model-improvement algorithm. §3.4 describes our model improvement
algorithm which tests the geometry of the sample set, and if necessary, adds
26


and/or deletes points to ensure that the model function constructed from the sample set satisfies the error bounds in Definition 2.2 (i.e. it is K-fully quadratic).
3. Demonstrate that the model-improvement algorithm satisfies the requirements for the definition of a class of fully quadratic models. For our algorithm, this is discussed in §3.4.
The chapter is organized as follows. We place our algorithm in the CSV2-framework by describing 1) how model functions are constructed (§3.2), and 2) a model improvement algorithm (§3.4). Before describing the model improvement algorithm, we first extend the theory of A-poisedness to the weighted regression framework (§3.3). Computational results are presented in §3.5 using a heuristic weighting scheme, which is described in that section. §3.6 concludes the chapter.
3.2 Model Construction
This section describes how we construct the model function m*, at the fcth iteration. For simplicity, we drop the subscript k for the remainder of this section. Let / = (/o,..., fp)T where /* denotes the computed function value at y\ and let e* denote the associated computational error. That is
fi = f(yt) + ^i- (3-1)
Let w = (wo, ■ ■ ■ ,wp)T be a vector of positive weights for the set of points Y = {y°, ■ ■ ■ ,yp}. A quadratic polynomial m is said to be a weighted least squares approximation of / (with respect to w) if it minimizes
X>? -hf= \\W (mIX) - f) ||2.
i=0
where m(Y) denotes the vector (m(y°), m(y1),... ,m(yp))T and W = diag(w). In this case, we write
Wm(Y) = Wf.
(3.2)
27


Let = {(f>o,(f>i, â–  â–  â–  ,(f>q} be a basis for the quadratic polynomials in Eâ„¢. For example, f might be the monomial basis
4> = {1, £1,2:2, • • -,xn,x i/2,£i£2, • • • ,xn_iXn,xl/2}. (3.3)
Dehne
M(4>,Y)
My°)My0)--- Mv°) My1) My1) ■ ■ ■ My1)
Myp)Myp)••• Myp)
Since 0 is a basis for the quadratic polynomials, the model function can be written q
m(x) = ai(f>i(x)- The coefficients a = (ao,...,aq)T solve the weighted least
i=0
squares regression problem
WM((f), Y)a =' W f. (3.4)
If M((f),Y) has full column rank, the sample set Y is said to be poised for quadratic regression. The following lemma is a straightforward generalization of [12, Lemma 4.3]:
Lemma 3.1 If Y is poised for quadratic regression, then the weighted least squares regression polynomial (with respect to positive weights w = (wo,..., wp)) exists, is unique and is given by m(x) = 4>(x)Ta, where
a = (WM)+Wf = (MTW2M)~1MTW2f, (3.5)
where W = diag(u>) and M = M((f>,Y).
Proof: Since W and M both have full column rank, so does WM. Thus, the least squares problem (3.4) has a unique solution given by {WM)+Wf. Moreover, since WM has full column rank, (WM)+ = ({WM)T(ITM))-1 MTW. u
3.3 Error Analysis and the Geometry of the Sample Set
28


Throughout this section, Y = {y°, ■ ■ ■ ,yp} denotes the sample set, p\ = p + 1, w E IBdj1 is a vector of positive weights, W = diag(u>), and M = M((f>,Y). / denotes the vector of computed function values at the points in Y as defined by (3.1).
The accuracy of the model function rrik relies critically on the geometry of the sample set. In this section, we generalize the theory of poisedness from [12] to the weighted regression framework. This section also includes error analysis which extends results from [12] to weighted regression, as well as considering the impact of computational error on the error bounds. We start by defining weighted regression Lagrange polynomials.
3.3.1 Weighted Regression Lagrange Polynomials
Definition 3.2 A set of polynomials £j(x),j = 0,...,p in are called
weighted regression Lagrange polynomials with respect to the weights w and sample set Y if for each j,
Wi] = Wej,
where £] = [£j(y0),- ■ ■ ,ij(yp)]T.
The following lemma is a direct application of Lemma 3.1.
Lemma 3.3 Let f>{x) = ((f)0(x),... ,4>q(x))T. IfY is poised, then the set of weighted regression Lagrange polynomials exists and is unique, and is given by £j(x) = 4>(x)Taj, j = 0, • • • ,p, where aj is the jth column of the matrix
A = (WM)+W.. (3.6)
Proof: Note that m(-) = £,(•) satisfies (3.2) with / = eh By Lemma 3.1, £j{x) = (f)(x)Taj where a? = (WM)+Wej which is the jth column of (WM)+W. ■
The following lemma is based on [12, Lemma 4.6].
29


Lemma 3.4 IfY is poised, then the model function defined by (3.2) satisfies
p
Mx) = ^fMx),
i=0
where £j{x), j = 0, • • • ,p denote the weighted regression Lagrange polynomials corresponding to Y and W.
Proof: By Lemma 3.1 m{x) = a = (WM)+Wf = Af
for A defined by (3.6). Let (fix) = [£o{x), ■ ■ ■ ,£p(x)]T. Then by Lemma 3.3
p
m(x) = (f>T(x)Af = ftix) = ^ffifix).
i=0
3.3.2 Error Analysis
For the remainder of this chapter, let Y = {y°, ■ ■ ■ ,yp} denote the shifted and scaled sample set where yl = (yl — y°)/R and R = max \\yl — y°||. Note that y° = 0
i
and max ||y*|| = 1. Any analysis of Y can be directly related to Y by the following
i
lemma:
Lemma 3.5 Define the basis

is the monomial basis. Let {Ifix), ■ ■ ■ ,£p(x)} be weighted regression Lagrange polynomials forY and j£o(x),••• ,ip(x)| be weighted regression Lagrange polynomials forY. Then M(fi,Y) = M(fi,Y). IfY is poised, then
£{x) = £
x~y
R
Proof: Observe that
Mv°) • • My°) My°) • • MM
M(4>, Y) = My1) ■ ■ My1) = Mv1) • ■ MM
_Myp) â–  â–  Myp)_ _Myp) â–  â–  Myp)_
M(fi,Y).
30


By the definition of poisedness, Y is poised if and only if Y is poised. Let cf>(x) =
— — rp /v //v ^ \ T
0(x),..., (f>q(x)) and f(x) = {(fo(x),... ,(f)q(x)j .Then
4>(x) =
1 o o 1 o(x)
1 o 1 Mx)
By Lemma 3.3, if Y is poised, then
4>(x).
£{x) = cf)(x)T(WM((f), Y))+W = (f)(x)T(WM(, Y))+W = £{x).
Let fi be defined by (3.1) and let Q be an open convex set containing Y. If / is C2 on Q, then by Taylor’s theorem, for each sample point yl G Y, and a fixed x G conv(T), there exists a point rji(x) on the line segment connecting x to yl such that
fi = f{yl) +Q
= f(x) + Vf(x)T(yl - x) + yl - x)TV2f(Vi(x))(yl - x) + q
= f(x) + Vf(x)T(yl - x) + hyl - x)TV2f(x)(yl - x)
2 (3.7)
+ 2^yt~ x^THi^x^yl ~ x) +
where Hfx) = V2f{j)i(x)) — V2/(x).
Let {£i(x)} denote the weighted-regression Lagrange polynomials associated with Y. The following lemma and proof are inspired by [7, Theorem 1]:
Lemma 3.6 Let f be twice continuously differentiable on Q and let m(x) denote the quadratic function determined by weighted regression. Then, for any x G Q the following identities hold:
• m(x) = f{x) + \ YX=o(yi - x)THi(x)(yi - x)£i{x) + ELo ^Mx),
31


• Vm(x) = Vf(x) + \ Y7i=o(yi - x)THi(x)(yi - x)V£i(x) + Y%=o tN£i(x),
• V2m(x) = V2/(x) + | J2Pi=o(Vi ~ x)THi(x)(yi - x)V2£i(x) + ELo qV2^(x);
where H^x) = V2/(r]i(x)) — V2/(x) for some point rji(x) = 9x + (1 — 0)yl, 0 < d < 1 on £/ie /ine segment connecting x to yl.
Proof: Let D denote the differential operator as defined in [7], where DJ is
the jth derivative of a function in Cl where i > j. In particular, D°f(x) = f(x), D1f(x)z = Vf(x)Tz, and D2f(x)z2 = zT'V2f(x)z. By Lemma 3.4, m(x) = EL0 fdi(x); so for h = 0,1 or 2,
Dhm(x) = Y,UDh^(x).
i=0
Substituting (3.7) for /* in the above equation yields
2 P
Dhm{x) = Yl D3f(x)(vl ~ x)3Dh£t{x)
+b ~ x)TR^yl - x)D%(x) + Y,tiD%(x)
j=0 J i=0 P
(3.8)
i=0
where H^x) = V2/(r]i(x)) — V2/(x) for some point rji(x) on the line segment connecting x to y\ Consider the first term on the right hand side above. We shall show that
i=o
1 LL , Dhfix) for j = h
-Y,D’f(x)(<;-xyD'%(x) = {
J' i=o 0 for j yt h.
(3.9)
for j = 0,1, 2, Let Bj = Dj f(x), and let gj : E™ —>• R be the polynomial defined
by gj(z) = jjBj(z — x)j. Observe that Djgj(x) = Bj and Dhgj(x) = 0 for h ^ j. Since Qj has degree j < 2, the weighted least squares approximation of gj by a quadratic polynomial is gj itself. Thus, by Lemma 3.4 and the definition of gj,
9j(z) = ~
i=0 i=0
(3-10)
32


Applying the differential operator Dh with respect to z yields
Dh'Ji{') = \i2Bj(y'-xyDht%(z)
i=0
= f'EDX(Ety‘-EWf%(z).
i=0
Letting z = x, the expression on the right is identical to the left side of (3.9). This proves (3.9) since Dhgj(x) = 0 for j ^ h and Djgj(x) = Bj for j = h. By (3.9), (3.8) reduces to
p p
Dhm{x) = Dhf(x) + - ~ x)TH%(x)(y% - x)DhR(x) + J] etDhR(x).
i=0 i=0
Applying this with h = 0,1,2 proves the lemma. â– 
Since ||iA(:r)|| A L \\yl — x|| by the Lipschitz continuity of V2/, the following is a direct consequence of Lemma 3.6.
Corollary 3.7 Let f satisfy Assumption 2.1 for some convex set and let m(x) denote the quadratic function determined by weighted regression. Then, for any x E the following error bounds hold:
• i/
»(X)i < ELo (f w - -»ii3 + Nil) \c(x)\
. ||V/(x) - Vm(x)|| < EEo (f III/1 - T + |f.|) IIWMII . ||V2/(.t) - v2m(x)|| < EL„ (f Ilk - if + k.l) l|V20(.r)l|.
Using this corollary, the following result provides error bounds between the function and the model in terms of the sample set radius.
Corollary 3.8 Let Y be poised, and let R = max \\yl — y°\\. Suppose |q| < e for
i
i = 0,... ,p. If f satisfies Assumption 2.1 with Lipschitz constant L, then there exist constants A\, A2; and A3; independent of R, such that for all x E £>(y°; R),
• | f(x) ~ m(x) | < Ai v/pf (4Li?3 + e).
33


||V/(x) - Vm(x)|| • \\V2f{x)-V2m{x)\\ < A3v/pi (4Li? + e/i?2).
Proof: Let {t'o(x),...,£p(x)} be the Lagrange polynomials generated by the shifted and scaled set Y, and let {£o(x),... ,£p(x)} be the Lagrange polynomials generated by the set Y. By Lemma 3.5, for each x E B(y°;R), £i(x) = £i(x) V i, where x = (x - y°)/R. Thus, V^(ar) = V£i(x)/R, and VH^x) = V2£i(x)/R2.
i T
Let £(x) = £0(x),... ,£p(x) , g(x) = V£o{x)
V2£0(x)
V%{x)
T
T
V£p(x) and h(x)
By Corollary 3.7,
I f{x) ~ m{x)| Wyi ~ xlf + lQl) l^^)!
4—n V ^ 7
i=0
P
< (4Lf?3 + e) |C(^)| (since \\yl — x|| < 2R, and |q| < e)
i=0
= (4Li?3 + e) Wiix)^
< y/p[ (4Li?3 + e) £{x) , (since for x E R™, ||x||i < II^II2)-
Similarly, ||Vf(x) — Vra(i) || < ^/p£ (4LR2 + — J \\g{x
and || V2/(x) — V2m(x)|| < ^/p£ (4Li? +

h(x)
Setting Ai = max
x€B(0;l)
£{pc)
A2 = max II g(;r) II, and A3 = max
xeB( 0;1) xGB(0;l)
h(x)
yields
the desired result. â– 
Note the similarity between these error bounds and those in the definition of k-fully quadratic models. If there is no computational error (or if the error is (9(A3)), k-fully quadratic models (for some fixed n) can be obtained by controlling the geometry of the sample set so that Aiv/pf, i = 1,2,3 are bounded by fixed constants and by controlling the trust region radius A so that ^ is bounded. This motivates the definitions of A-poised and strongly A-poised in the weighted regression sense in the next section.
34


3.3.3 A-poisedness (in the Weighted Regression Sense)
In this section, we restrict our attention to the monomial basis defined in (3.3). In order to produce accurate model functions, the points in the sample set need to be distributed in such a way that the matrix M = M(,Y) is sufficiently well-conditioned. This is the motivation behind the following definitions of A-poised and strongly A-poised sets. These definitions are identical to [12, Definitions 4.7, 4.10] except that the Lagrange polynomials in the definitions are weighted regression Lagrange polynomials.
Definition 3.9 Let A > 0 and let B be a set in E™. Let w = (w0,... ,wp) be a vector of positive weights, Y = {y°,... ,yp} be a poised set, and let {£0,... ,£p} be the associated weighted regression Lagrange polynomials. Let £{x) = (£0(x),--- ,£p(x))T and qi = \Vf\.
• Y is said to be A-poised in B (in the weighted regression sense) if and only if
A > max max \£Ax) I.
xEB 0 • Y is said to be strongly A-poised in B (in the weighted regression sense) if and only if
~^A > max \\£(x) || .
v/p7 xeB
Note that if the weights are all equal, the above definitions are equivalent to those for A-poised and strongly A-poised given in [12].
We are naturally interested in using these weighted regression Lagrange polynomials to form models that are guaranteed to sufficiently approximate /. Let Yk, A*,, and Rk denote the sample set, trust region radius, and sample set radius at iteration k as defined at the beginning of §3.3.2. Assume that ^ is bounded. If the number of sample points is bounded, it can be shown, using Corollary 3.8, that if Yk is A-poised for all k, then the corresponding model functions are K-fully quadratic, (assuming no
35


computational error, or that the computational error is (9(A3)). When the number of sample points is not bounded, A-poisedness is not enough. In the following, we show that if Yk is strongly A-poised for all k, then the corresponding models are K-fully quadratic, regardless of the number of points in Yk.
Lemma 3.10 Let M = M{fi>,Y). If W(MTW)+ < y/qi/piA, then Y is strongly A-poised in 5(0; 1) in the weighted regression sense, with respect to the weights w. Conversely, ifY is strongly A-poised in 5(0; 1) in the weighted regression sense, then
Oqi
W (M1 W)
<
\/P~i
: A,
where 9 > 0 is a fixed constant dependent only on n (but independent ofY and A).
Proof: Let A = (WM)+W and £{x) = (£o(x),... ,£p(x))T. By Lemma 3.3, £{x) = AT(f){x). It follows that for any x G 5(0; 1),
\\£(x)\\ = ||AT0(x)|| < \\A\\ \\4>(x)\\ < ^qi/piA (v^ll^L) < (Qi/y/Pi) A-
(For the last inequality, we used the fact that maxj.eB(0;i) ||^(;r)||00 — !)•
To prove the converse, let UYVT = AT be the reduced singular value decomposition of AT. Note that U and V are p\ x qi and q\ x qi matrices, respectively, with orthonormal columns; E is a q\ x qi diagonal matrix, whose diagonal entries are the singular values of AT. Let o\ be the largest singular value with vl the corresponding column of V. As shown in the proof of [10, Theorem 2.9], there exists a constant 7 > 0 such that for any unit vector v, there exists an x G 5(0; 1) such that |uT0(x)| > 7. Therefore, since Hu1!! = 1, there is an 1 G 5(0; 1) such that \(v1)T(f)(x)\ > 7. Let v1- be the orthogonal projection of fi(x) onto the subspace orthogonal to u1; so (fix) = {fivfiT(p{x)) vl + vA Note that YVTv1 and YVTv^ are orthogonal vectors. Note also that for any vector z, ||5EI/Tz|| = ||EI/Tz|| (since U
36


has orthonormal columns). It follows that
\\£(x)\\ = ||At0(7)|| = ||SI/T0(x)|| = (||SI/Tn±||2 +||SI/T((n1)T0(x)n1)||2y/2 > \{vl)T4>{x)\ ||SI/Tn1|| > 7 ||SI/Tn1|| = 7 11Se111 = 7 ||A|| .
Thus, Mil < max ||£(x)|| H < —-—A, which proves the result with 6 = I/7. ■
xeB( o;i) 7v/pf
We can now prove that models generated by weighted regression Lagrange polynomials are ic-fully quadratic.
Proposition 3.11 Let / satisfy Assumption 2.1 and let A > 0 be fixed. There exists a vector k = (nef, Keg, Keh, 0) such that for any y° £ S and A < Amax, if
1. Y = {y°,... ,yp} C B(y°;A) is strongly A-poised in B(y°-,A) in the weighted regression sense with respect to positive weights w = {u>o, • • •, wp}, and
2. the computational error |q| is bounded by CA3, where C is a fixed constant,
then the corresponding model function m is n-fully quadratic.
Proof: Let x, £(-),g(-), h(-), Ai,A2, and A3 be as defined in the proof of
Corollary 3.8. Let M = M(fi,Y) and W = diagw. By Lemma 3.3, £{x) = AT(f>(x),
Oqi
where A = (WM)+W. By Lemma 3.10 It follows that
<
y/Pl
:A, where 9 is a fixed constant.
Ai = max
xeB( 0; 1)
£(x)
< max
*£B(0;1)
|0O)|| < ci~j=A,
where <7 = max ^ ||0(x)|| is a constant independent of Y. Similarly,
Ao
max \\g(x)\\ = max
EB( 0;1) X eB( 0;1)
max EB( 0;1) \\ATVfi(x)
l|Vt'0(x)||, • • • , ||Vt'p(x)||
t Tt
xeB(0;l)
< Jqfi max
x£B(0;l)
\Vfi(x)\\ y/Pi
37


where C2 = max ||Vd>(A)ll is independent of Y.
xeB( 0;1) 11 11
To bound A3, let JSyt denote the unique index j such that xs and xt both appear in the quadratic monomial 4>j(x). For example Jyi = n + 2, Ji;2 = J2,i = n + 3, etc. Observe that
1 if j = Js,t,
0 otherwise.
It follows that
[V2*(x)]
S,t
3=0
Alh,l ALl,2 ■ ■ • Ahi,
AW • • • Ah2.
AT AT A^,Jn,2 ‘ ‘ ■ Aljn,
We conclude that
V2h{x) < V2h{x)
F
< y/2 II Aj II. Thus,
A3 = max
i;£B(0;l)
h(x)
max
xeB(0;l)
\v2io(x)
I v2ip(x)
<
\
2E K-f = ^ mIf < 0^ mi < ^a.
VB
By assumption, the computational error |q| is bounded by e = CA3. So, by Corollary 3.8, for all x G B(y°] A),
• | f{x) — m{x) | < (4L + C) A3A3 < cidqiA (4L + C) A3 = /y=/A3.
• ||V/(x) — Vm(x)|| < (4L + (7) A2A2 < c29qlA (4L + C) A2 = KegA2.
• || V2/(x) — V2m(x) || < (4L + C) AA3 < \f29ql A (4L + (7) A = /y=/jA.
where /q,/ = cqdgiA (4L + (7), neg = c20ql A (4L + C), /q,/i = V^dq? A (4L + C). Thus, m(x) is (nef, neg, neh, 0)-fully quadratic, and since these constants are independent of y° and A, the result is proven. ■
The final step in establishing that we have a fully quadratic class of models is to define an algorithm that produces a strongly A-poised sample set in a finite number of steps.
38


Proposition 3.12 Let Y = {y°,... ,yp} C Mn be a set of points in the unit ball -0(0; 1) such that \\yj\\ = 1 for at least one j. Let w = (wo,... ,wp)T be a vector of positive weights. If Y is strongly A-poised in 0(0; 1) in the sense of unweighted regression, then there exists a constant 9 > 0, independent ofY, A and w, such that Y is strongly (cond(IT)dA) -poised in the weighted regression sense.
Proof: Let M =
with unit weights A. Thus,
= M((f,Y), where is the monomial basis. By Lemma 3.10 (applied ), M+ < 9qiA/y/pi, where 9 is a constant independent of Y and
W(MtW)
< cond(IT)
M+
<
cond(W)9qi
y/Pl
where the first inequality results from
(.MTW)
G r\
(mtw)
<
a.
mm \ AI 1 (7min
(W)
M+ll IIIT-1
The result follows with 9 = 9^/qf. â– 
The significance of this proposition is that any model improvement algorithm for unweighted regression can be used for weighted regression to ensure the same global convergence properties, provided cond(VL) is bounded. For the model improvement algorithm described in the following section, this requirement is satisfied by bounding the weights away from zero while keeping the largest weight equal to 1.
In practice, we need not ensure A-poisedness of Yk at every iterate to guarantee the algorithm convergences to a second-order minimum. Rather, A-poisedness only needs to be enforced as the algorithm stagnates.
3.4 Model Improvement Algorithm
This section describes a model improvement algorithm (MIA) for regression which, by the preceding section, can also be used for weighted regression to ensure that the sample sets are strongly A-poised for some fixed A (which is not necessarily
39


known). The algorithm is based on the following observation, which is a straightforward extension of [12, Theorem 4.11].
The MIA presented in [12] makes assumptions (such as all points must lie within B(y°; A)) to simplify the theory. We resist such assumptions to account for practical concerns (points which he outside of B(y°, A)) that arise in the algorithm. Proposition 3.13 If the shifted and scaled sample set Y of pi points contains l = (f^\ disjoint subsets of qi points, each of which are A-poised in B(0; 1) (in the interpolation sense), then Y is strongly -\JBAA-poised in B{0; 1) (in the regression sense).
Proof: Let Yj = {y),y(,..., yJ}, j = 1,..., l be the disjoint A-poised subsets of Y,
and let Yr be the remaining points in Y. Let X\, i = 0,..., q be the (interpolation)
Lagrange polynomials for the set Yj. As noted in [12], for any x E Râ„¢,
q
= f(x), j = 1,. • •
i=0
Dividing each of these equations by l and summing yields
= ^)- (3-n)
j=1 i=0
Let Xj(x) = (A{(x), • • • , XJqi(x)) , and let A G EPl be formed by concatenating the Xj(x), j = 1, • • • ,1 and a zero vector of length \Yr\ and then dividing every entry by l. By (3.11), A is a solution to the equation
= ^x)- (3-12)
i=0
Since Yj is A-poised in B(0; 1), for any x E B(0; 1),
||AJ(x)|| < v^H^WlL < v^A.
Thus,
A
< VI max
j
l|AJ»ll
l
<
Va <
i +1
Qi
Pi/Q:
-A
l + 1 Qi
l Vv~\
A.
40


Let £i(x), i = 0, • • • ,p be the regression Lagrange polynomials for the complete set Y. As observed in [12], £{x) = (£o(x), ■ ■ ■ ,£p(x))T is the minimum norm solution to (3.12). Thus,
11^)11 < A <
l + l l
A.
Since this holds for all r 6 5(0; 1), L is strongly
l + l l
A-poised in 5(0; 1).
Based on this observation, and noting that ^ < 2 for / > 1, we adopt the following strategy for improving a shifted and scaled regression sample set Y C 5(0; 1):
1. If Y contains l > 1 A-poised subsets with at most q\ points left over, Y is strongly -\/2A-poised.
2. Otherwise, if Y contains at least one A-poised subset, save as many A-poised subsets as possible, plus at most q\ additional points from Y, discarding the rest.
3. Otherwise, add additional points to Y in order to create a A-poised subset. Keep this subset, plus at most q\ additional points from Y.
To implement this strategy, we first describe an algorithm that attempts to find a A-poised subset of Y. To discuss the algorithm we introduce the following definition:
Definition 3.14 A setY C B is said to be A-subpoised in a set B if there exists a superset Z DY that is A-poised in B with \Z\ = q\.
Given a sample set Y C 5(0; 1) (not necessarily shifted and scaled) and a radius A, the algorithm below selects a A-subpoised subset Ynew C Y containing as many points as possible. If \Ynew\ = qi, then Ynew is A-poised in 5(0; A) for some fixed A. Otherwise, the algorithm determines a new point ynew E 5(0; A) such that Ynew U {ynew} is A-subpoised in 5(0; A).
41


Algorithm FindSet (Finds a A-subpoised set)
Input: A sample set Y C -0(0; 1) and a trust region radius A G [V£acc, l], for fixed parameter £acc > 0.
Output: A set Ynew C Y that is A-poised in 0(0; A); or a A-subpoised
set Ynew C -0(0; A) and a new point ynew C 0(0; A) such that Ynew (J {ynew} is A-subpoised in 0(0; A).
Step 0 (Initialization:) Initialize the pivot polynomial basis to the monomial basis: Ui(x) = 4>i(x), i = 0,..., q. Set Ynew = 0. Set i = 0.
Step 1 (Point Selection:) If possible, choose j* G {*,•••, |T| — 1} such that \ui(yH)\ > iacc (threshold test).
If such an index is found, add the corresponding point to Ynew and swap the positions of points yl and yji in Y.
Otherwise, compute ynew = argmax|tq(:r)|, and exit, returning
x£B(0;A)
Ynew and Dne w *
Step 2 (Gaussian Elimination: ) For j = i + 1,..., |F| — 1
/ \ / \ ui(y%) / \
Uj(x) = Ujix)-------—-uAx).
Ui{yl)
If i < q, set i = i + 1 and go to step 1.
Exit, returning Ynew.
The algorithm, which is modeled after Algorithms 6.5 and 6.6 in [12], applies Gaussian elimination with a threshold test to form a basis of pivot polynomials {ui(x)}. As discussed in [12], at the completion of the algorithm, the values ui{yl),yl ^ Ynew are exactly the diagonal entries of the diagonal matrix D in the LDU factorization of M = M((f>,Ynew). If \Ynew\ = q1, M is a square matrix. In this case, since \ui(yl)\ > £acc,
42


M
-i I
<
a/Qi^q tow t h
(3.13)
where fgr0wth is the growth factor for the factorization (see [27]).
Point Selection The point selection rule allows flexibility in how an acceptable point is chosen. For example, to keep the growth factor down, one could choose the index j* that maximizes \ui(yj)\ (which corresponds to Gaussian elimination with partial pivoting). But in practice, it is often better to select points according to their proximity to the current iterate. In our implementation, we balance these two criteria by choosing the index that maximizes \ui(yj)\/d^, over j > i, where dj = max{l, \\yj\\ /A}. If all sample points are contained in 5(0; A), then dj = 1 for all j. In this case, the point selection rule is identical to the one used in Algorithm 6.6 of [12] (with the addition of the threshold test). When Y contains points outside 5(0; A), the corresponding values of dj are greater than 1, so the point selection rule gives preference to points that are within 5(0; A).
The theoretical justification for our revised point selection rule comes from examining the error bounds in Corollary 3.7. For a given point x in 5(0; A), each sample
3
point yl makes a contribution to the error bound that is proportional to \\yl — x|| (assuming the computational error is relatively small). Since x can be anywhere in
\ui(yji)\
the trust region, this suggests modifying the point selection rule to maximize

3i
where dj = max,^^.^ \\yj — x|| /A = \\yj\\ /A + 1. To simplify analysis, we modify this formula so that all points inside the trust region are treated equally, resulting in the formula dj = max(l, \\yj\\ /A).
Lemma 3.15 Suppose Algorithm FindSet returns a set Ynew with \Ynew\ = q1. Then Ynew is A-poised in 5(0; A) for some A, which is proportional to
Cgrowth £,acc
nation.
max{l, A2/2, A}; where fgr0wth hs the growth factor for the Gaussian elimi-
43


-1
— y/Ql^growth/face- Let £(x)
Proof: Let M = M((f>,Ynew). By (3.13), M (£o(x), • • •, £q(x))T be the vector of interpolation Lagrange polynomials for the sample set Ynew. For any x E 5(0; A),
\mi
<
M T(f)(x)
Ql^growth
<
M
-l
|0(^)|L < Vql
M
-l
U(x)\
#t)|| < #^nâ„¢{U'72,4).
Since this inequality holds for all x E 5(0; A), Ynew is A-poised for A = (QiCgrowth/Cacc) max{l, A2/2, A}, which establishes the result. â– 
In general, the growth factor in the above lemma depends on the matrix M and the threshold facc. Because of the threshold test, it is possible to establish a bound on the growth factor that is independent of M. So we can claim that the algorithm selects a A-poised set for a fixed A that is independent of Y. However, the bound is extremely large, so is not very useful. Nevertheless, in practice fgrowth is quite reasonable, so A tends to be proportional to max{l, A2/2, K}/facc.
In the case where the threshold test is not satisfied, Algorithm FindSet determines a new point ynew by maximizing \ui(x)\ over 5(0; A). In this case, we need to show that the new point would satisfy the threshold test. The following lemma shows that this is possible, provided facc is small enough. The proof is modeled after the proof of [12, Lemma 6.7].
Lemma 3.16 Let vT(x) be a quadratic polynomial of degree 2, where IHI^ = 1. Then
A2
max_ \vT(f)(x)\ >min{l,—}.
x£B(0;A) 4
Proof: Since IHI^ = 1, at least one of the coefficients of q(x) = vTf>(x) is 1, -1, 1/2, or -1/2. Looking at the case where the largest coefficient is 1 or 1/2 (-1 and -1/2 are similarly proven), either this coefficient corresponds to the constant term, a linear term ay, or a quadratic term xf/2 or aqaq. Restrict all variables not appearing in the term corresponding to the largest coefficient to zero.
44


• If q{x) = 1 then the lemma trivially holds.
• If q(x) = xf/2 + axi + 6, let x = he1 e B(0; A)
q(x) = A2/2 + ha + b, q{—x) = h2/2 — ha + b, and g(0) = b.
If |g(—x)| > ^ or |g(x)| > the result is shown. Otherwise, A+ < q(h) < and Ar < q(—h) < Adding these together yields < A2 + 2b < Therefore & < 4“ — 4“ = “4“ anc^ therefore \q(0)\ >
• If q{x) = ax2/2 + ay + b, then let x = he1, yielding q{x) = A + ah2/2 + b and q(—x) = —A + ah2/2 + b then
r 'i A2
max {|g(—x)\, |(?(x)|} = max || — A + a:|, |A + a:| j > A > min{l, — }
where a = ah/2 + 6 = 0.
• If q(x) = ax2/2 + bx2/2 + XiXj + cay + dxj + e, we consider 4 points on B(0; A)
1 1 > V2 = 1 1 , 2/3 = 1 1 1 , Vi = 1 1 1
1 Q(Vs)
q(vh
4
Note that q(yi) - q(y2) = A + d\/~2h and g(y3) - g(y4) are two cases:
+ e + e + e + e
—A + d\/2A. There
45


1. If d > 0, then q{y\) — q{y2) > A, so either \q{yi)\ > A > min |l, j or
|g(y2)| > f > min 11, 1•
2. If d < 0, then a similar study of qigjs) — qigji) proves the result.
Lemma 3.17 Suppose in Algorithm FindSet facc < min{l,A2/4}. If Algorithm FindSet exits during the point selection step, then Ynew\J {ynew} is A-subpoised in 5(0; A) for some fixed A, which is proportional to ^9r°wth maxjl, A2/2, A}; where
S acc
fgrowth is the growth parameter for the Gaussian elimination.
Proof: Consider a modified version of Algorithm FindSet that does not exit in
the point selection step. Instead, yl is replaced by ynew and ynew is added to Ynew. This modified algorithm will always return a set consisting of q\ points. Call this set Z. Let Ynew and ynew be the output of the unmodified algorithm, and observe that Ynew U{l/riew}' C Z.
To show that Ynew (J{ynew} is A-subpoised, we show that Z is A-poised in 5(0; A). From the Gaussian elimination, after k iterations of the algorithm, the (k + l)st pivot polynomial Uk(x) can be expressed as (vk)T(f>(x), for some vk = (v0,..., Vk-i, 1, 0,..., 0). (That is, the ry are the coefficients for the basis expansion of the polynomial Uk)• Observe that A 1, and let v = -r—-j-—. By Lemma 3.16,
max_ \iik{x)\= max_ \{vk)T{x)\ = ||nfc|| ( max_ \vT xeB(0;A) xeB(0;A) °° \xeB(0;A) J
A2 A2
> min{l, —} H^ll^ > min{l, —} > facc.
It follows that each time a new point is chosen in the point selection step, that point will satisfy the threshold test. Thus, the set Z returned by the modified algorithm will include q\ points, all of which satisfy the threshold test. By
46


Lemma 3.15, Z is A-poised, with A proportional to ^ar°wth max{ 1, A2/2, A}. It fol-
S acc
lows that Ynew (J {ynew} is A-subpoised. â– 
We are now ready to state our model improvement algorithm for regression. Prior to calling this algorithm, we discard all points in Y with distance greater than A/ y/^acc from y°. We then form the shifted and scaled set Y by the transformation yJ = (yj-y°)/d, where d = max^gy \\yj — y° ||, and scale the trust region radius accordingly (i.e., A = A/d). This ensures that A = A > - = Vlocc- After calling the
algorithm, we reverse the shift and scale transformation.
Algorithm MIA (Model Improvement for Regression)
Input: A shifted and scaled sample set Y cB(0;l),a trust region radius A > y/^acc for fixed £acc G (0, A^), where r > 1 is a fixed parameter.
Output: A modified set Y1 with improved poisedness on B(0; A).
Step 0 (Initialization:) Remove the point in Y farthest from y° = 0 if it is outside B(0; rA). Set Y' = 0.
Step 1 (Find Poised Subset:) Use Algorithm FindSet either to identify a
A-poised subset Ynew C Y, or to identify a A-subpoised subset Ynew C Y and one additional point ynew G B(0; A) such that Ynew U {ynew} is A-subpoised on B(0; A).
Step 2 (Update Set:)
If Ynew is A-poised, add it to Y1 and remove Ynew from Y. Remove all points from Y that are outside of B(0; rA).
Otherwise
If \Y'\ = 0, set Y' = Ynew U {ynew} plus qi - \Ynew\ - 1 additional points from Y.
Otherwise set Y1 = 7'(J Ynew plus q\ — \Ynew\ additional points from Y.
47


Set Y = 0.
Step 3 If \Y\ > qi, go to Step 1.
In Algorithm MIA, if every call to Algorithm FindSet yields a A-poised set Ynew, then eventually all points in Y will be included in Y'. In this case, the algorithm has verified that Y contains t = \_^\ A-poised sets. By Proposition 3.13, Y is strongly ^dA-poised in B(0; 1).
If the first call to FindSet fails to identify a A-poised subset, the algorithm improves the sample set by adding a new point ynew and by removing points so that the output set Y1 contains at most q\ points. In this case the output set contains the A-subpoised set Ynew (J {ynew}- Thus, if the algorithm is called repeatedly, with Y replaced by Y1 after each call, eventually Y1 will contain a A-poised subset and will be strongly 2A-poised, by Proposition 3.13.
If Y fails to be A-poised after the second or later call to FindSet, no new points are added. Instead, the sample set is improved by removing points from Y so that the output set Y1 consists of all the A-poised subsets identified by FindSet, plus up to qi additional points. The resulting set is then strongly A-poised, where 1 = •
Trust region scale factor The trust region scale factor r was suggested in [12, Section 11.2], although implementation details were omitted. The scale factor determines what points are allowed to remain in the sample set. Each call to Algorithm MIA removes a point from outside B(0; rA) if such a point exists. Thus, if the algorithm is called repeatedly with Y replaced by Y1 each time, eventually all points in the sample set will be in the region B(0; rA). Using a scale factor r > 1 can improve the efficiency of the algorithm. To see this, observe that if r = 1, a slight movement of the trust region center may result in previously “good” points lying just outside of B(y°; A). These points would then be unnecessarily removed from Y.
48


To justify this approach, suppose that Y is strongly A-poised in 5(0; A). By Proposition 3.11, the associated model function m is K-fully quadratic for some fixed vector k, which depends on A. If instead Y has points outside of 5(0; A), we can show (by a simple modification to the proof of Proposition 3.11) that the model function is 53/i-fully quadratic, where R = max{||t/* — y°||}. Thus, if Y C 5(0; rA) for some fixed r > 1, then calling the model improvement algorithm will result in a model function m that is /t-fully quadratic with respect to a different (but still fixed) k = r3n. In this case, however, whenever new points are added during the model improvement algorithm, they are always chosen within the original trust region 5(0; A).
The discussion above demonstrates that Algorithm MIA satisfies the requirements of a model improvement algorithm specified in Definition 2.2. This algorithm is used in the CSV2-framework described in Chapter 2 as follows:
• In Step 1 of Algorithm CSV2, Algorithm MIA is called once. If no change is made to the sample set, the model is certified to be K-fully quadratic.
• In Step 4 of Algorithm CSV2, Algorithm MIA is called once. If no change is made to the sample set, the model is K-fully quadratic. Otherwise, the sample set is modified to improve the model.
• In Algorithm Criticality Step, Algorithm MIA is called repeatedly to improve the model until it is K-fully quadratic.
In our implementation, we modified Algorithm CriticalityStep to improve efficiency by introducing an additional exit criterion. Specifically, after each call to the model improvement algorithm, qlk = max{||^|| , —Amin(Hlk)} is tested. If qlk > ec, xk is no longer a second-order stationary point of the model function; so we exit the criticality step.
49


3.5 Computational Results
As shown in the previous section, the CSV2-framework using weighted quadratic regression converges to a second-order stationary point provided the ratio between the largest and smallest weight is bounded. This leaves much leeway in the derivation of the weights. We now describe a heuristic strategy based on the error bounds derived in §4.
3.5.1 Using Error Bounds to Choose Weights
Intuitively, the models used throughout our algorithm will be most effective if the weights are chosen so that m(x) is as accurate as possible in the sense that it agrees with the second-order Taylor approximation of f(x) around the current trust region center y°. That is, we want to estimate the quadratic function
Q{x) = f(y°) + V/(y°)T(x - y°) + ^(x - y°)TV2f{y°){x - y°).
If f(x) happens to be a quadratic polynomial, then
ft = Q(yl) + Q-
If the errors are uncorrelated random variables with zero mean and finite variances of,z = 0then the best linear unbiased estimator of the polynomial q(x) is given by m(x) = 4>(x)Ta, where a solves (3.4) with the ith weight proportional to 1 joi [51, Theorem 4.4], This is intuitively appealing since each sample point will have the same expected contribution to the weighted sum of square errors.
When f{x) is not a quadratic function, the errors depend not just on the computational error, but also on the distances from each point to y°. In the particular case when x = y°, the first three terms of (3.7) are the quadratic function q(yl). Thus, the error between the computed function value and q(yl) is given by:
fi - q(yl) = \(yl - y0)THi(y0)(yl - y°) + w (3.14)
50


where Hi(y°) = V2/(rji(y0)) — V2/(y°) for some point rji(y°) on the line segment connecting y° and yl.
We shall refer to the first term on the right as the Taylor error and the second term on the right as the computational error. By Assumption 2.1, ||fA(y°)|| < L \\yl — y°\\. This leads us to the following heuristic argument for choosing the weights: Suppose that Hi(y°) is a random symmetric matrix such that the standard deviation of \\Hi(y°)\\ is proportional to L \\yl — y°||. In other words ||fA(y°)|| = (L \\yl — y°\\ for some constant (. Then the Taylor error will have standard deviation proportional to L ||yl — y°||3. Assuming the computational error is independent of the Taylor error, the total error fi — q(yl) will have standard deviation \J L \\yl — y°||3)2 + of, where Oi is the standard deviation of q. This leads to the following formula for the weights:
1
Wi OC =.
yCL2 Wy* - y°\\6 + af
Of course, this formula depends on knowing (,L and cp. If L, and/or ( are not known, this formula could still be used in conjunction with some strategy for estimating L, di, and ( (for example, based upon the accuracy of the model functions at known points). Alternatively, ( and L can be combined into a single parameter C that could be chosen using some type of adaptive strategy:
Wi oc
1
JC Wy1 - y°\f + a,
If the computational errors have equal variances, the formula can be further simplified as
1
Wi oc , (3.15)
yC W - y° ||6 + l
where C = (T/a2.
An obvious flaw in the above development is that the errors in \fi — q{yl)\ are
not uncorrelated. Additionally, the assumption that ||fA(y°)|| is proportional to
L ||yl — y°|| is valid only for limited classes of functions. Nevertheless, based on our
51


computational experiments, (3.15) appears to provide a sensible strategy for balancing differing levels of computational uncertainty with the Taylor error.
3.5.2 Benchmark Performance
To study the impact of weighted regression, we developed MATLAB implementations of three quadratic model-based trust region algorithms using interpolation, regression, and weighted regression, respectively, to construct the quadratic model functions. To the extent possible, the differences between these algorithms were minimized, with code shared whenever possible. Obviously, all three methods use different strategies for constructing the model from the sample set. Beyond that, the only difference is that the two regression methods use the model improvement algorithm described in Section 3.4, whereas the interpolation algorithm uses the model improvement algorithm described in [12, Algorithm 6.6].
We compared the three algorithms using the suite of test problems for benchmarking derivative-free optimization algorithms made available by More and Wild
[41]. We ran our tests on the four types of problems from this test suite: smooth problems (with no noise), piecewise smooth functions, functions with deterministic noise and functions with stochastic noise. We do not consider the algorithm presented in this chapter to be ideal for handling stochastically noisy functions. For example, if the initial point happens to be evaluated with large negative noise, the algorithm will never re-evaluate this point and possibly never move the trust region center. We are actively attempting to construct a more robust algorithm. We consider such modifications non-trivial and outside the scope of the current work. The problems were run with the following parameter settings:
A max = 100, Aqc6 = 1, no = 10“6, rii = 0.5, 7 = 0.5, 7inc = 2, ec = 0.01, ji = 2, f3 = 0.5, to = .5, r = 3, £acc = 10-4. For the interpolation algorithm, we used iimp = 1.01, for the calls to [12, Algorithm 6.6].
52


As described in [41], the smooth problems are derived from 22 nonlinear least squares functions defined in the CUTEr [23] collection. For each problem, the objective function f(x) is defined by
m
f(x) = ^9k{x)2,
k= 1
where g : W1 —> represents one of the CUTEr test functions. The piecewise-
smooth problems are defined using the same CUTEr test functions by
m
f(x) = \9k(x)I.
k= 1
The noisy problems are derived from the smooth problems by multiplying by a noise function as follows:
m
f(x) = (1 +£fV(x)) J2dk(x)2,
k=1
where £f defines the relative noise level. For stochastically noisy problem, r(x) is a random variable drawn from the uniform distribution U[— 1,1]. To simulate deterministic noise, r(x) is a function that oscillates between -1 and 1, with both high-frequency and low-frequency oscillations. (For an equation for the deterministic T, see [41, Eqns. (4.2)-(4.3)].) Using multiple starting points for some of the test functions, a total of 53 different problems are specified in the test suite for each of these 3 types of problems.
For the weighted regression algorithm, the weights were determined by the weighting scheme (3.15) with C = 100.
The relative performances of the algorithms were compared using performance profiles and data profiles [17, 41], If S is the set of solvers to be compared on a suite of problems P, let tPtS be the number of iterates required for solver s E S on a problem p E P to find a function value satisfying:
f{x) < h + t {f{x0) - fL) , (3.16)
53


where Jl is the best function value achieved by any s E S. Then the performance profile of a solver s E S is the following fraction:
ps(a) =
p e P
°p,s
min{tP)S : s E S}
< a
The data profile of a solver s E S is:
ds(cv) \P\
p E P :
t
•p,s
nv + 1
< a
where np is the dimension of problem p E P. For more information on these profiles, including their relative merits and faults, see [41].
Performance profiles comparing the three algorithms are shown in Figure 3.1 for an accuracy of r = 10-5. We observe that on the smooth problems, the weighted and unweighted regression methods had similar performance and both performed slightly better than interpolation. For the deterministically noisy problems, we see slightly better performance from the weighted regression method; and this improvement is even more pronounced for the benchmark problems with stochastic noise. And for the piecewise differentiable functions, the performance of the weighted regression method is significantly better. This mirrors the findings in [13] where SID-PSM using regression models shows considerable improvement over interpolation models.
We also compared our weighted regression algorithm with the DFO algorithm
[8] as well as NEWUOA [50], (which had the best performance of the three solvers compared in [41]). We obtained the DFO code from the COIN-OR website [38]. This code calls IPOPT, which we also obtained from COIN-OR. We obtained NEWUOA from [40]. We ran both algorithms on the benchmark problems with a stopping criterion of Amin = 10—8, where Amin denotes the minimum trust region radius. For NEWUOA, the number of interpolation conditions was set to NPT=2u + 1.
The performance profiles are shown in Figure 3.2, with an accuracy of r = 10-5.
NEWUOA outperforms both our algorithm and DFO on the smooth problems. This
is not surprising since NEWUOA is a mature code that has been refined over several
54


Smooth Problems; r = 10 5
Smooth Problems; r = 10 5
Nondifferentiable Problems; r = 10 5
Nondifferentiable Problems; r = 10 5
Deterministically Noisy Problems; r = 10 5
Stochastically Noisy Problems; r = 10 5
Figure 3.1: Performance (left) and data (right) profiles: Interpolation vs. regression vs. weighted regression
years, whereas our code is a relatively unsophisticated implementation. In contrast, on the noisy problems and the piecewise differentiable problems, our weighted regression
55


algorithm achieves superior performance.
Smooth Problems; r = 10 5
Performance Ratio
Nondifferentiable Problems; r = 10 5
Smooth Problems; r = 10 5
Nondifferentiable Problems; r = 10 5
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
---- NEWUOA (2n+l)
---- DFO
-------Weighted Regression
1 1.5 2 2.5 3 3.5
Performance Ratio
Deterministically Noisy Problems; r = 10-5
Deterministically Noisy Problems; r = 10 5
Performance Ratio
Stochastically Noisy Problems; r = 10 5
Stochastically Noisy Problems; r = 10 5
Figure 3.2: Performance (left) and data (right) profiles: weighted regression vs. NEWUOA vs. DFO (Problems with Stochastic Noise)
3.6 Summary and Conclusions
56


Our computational results indicate that using weighted regression to construct more accurate model functions can reduce the number of function evaluations required to reach a stationary point. Encouraged by these results, we believe that further study of weighted regression methods is warranted. This chapter provides a theoretical foundation for such study. In particular, we have extended the concepts of A-poisedness and strong A-poisedness to the weighted regression framework, and we demonstrated that any scheme that maintains strongly A-poised sample sets for (unweighted) regression can also be used to maintain strongly A-poised sample sets for weighted regression, provided that no weight is too small relative to the other weights. Using these results, we showed that, when the computational error is sufficiently small relative to the trust region radius, the algorithm converges to a stationary point under standard assumptions.
This investigation began with a goal of more efficiently dealing with computational error in derivative-free optimization, particularly under varying levels of uncertainty. Surprisingly, we discovered that regression based methods can be advantageous even in the absence of computational error. Regression methods produce quadratic approximations that better fit the objective function close to the trust region center. This is due partly to the fact that interpolation methods throw out points that are close together in order to maintain a well-poised sample set. In contrast, regression models keep these points in the sample set, thereby putting greater weight on points close to the trust region center.
The question of how to choose weights needs further study. In this chapter, we proposed a heuristic that balances uncertainties arising from computational error with uncertainties arising from poor model fidelity (i.e., Taylor error as described in §3.5.1). This weighting scheme appears to provide a benefit for noisy problems or non-differentiable problems. We believe better schemes can be devised based on more rigorous analysis.
57


Finally, we note that the advantage of regression-based methods is not without cost in terms of computational efficiency. In the regression methods, quadratic models are constructed from scratch every iteration, requiring 0(n6) operations. In contrast, interpolation-based methods are able to use an efficient scheme developed by Powell [50] to update the quadratic models at each iteration. It is not clear whether such a scheme can be devised for regression methods. Nevertheless, when function evaluations are extremely expensive, and when the number of variables is not too large, this advantage is outweighed by the reduction in function evaluations realized by regression based methods.
58


4. Stochastic Derivative-free Optimization using a Trust Region Framework
In this chapter, we propose and analyze the convergence of an algorithm which finds a local minimizer of the unconstrained function / : W1 —> IBL The value of / at a given point x can not be observed directly; rather the optimization routine only has access to noise corrupted function values /. Such noise may be deterministic, due to round-off error from finite-precision arithmetic or iterative methods, or stochastic, arising from variability (or randomness) in some observed process. We focus our attention in this chapter on minimizing / when / has the form:
f{x) = f{x) + e (4.1)
where jV(0, a2).
Minimizing noisy functions of this nature arise in a variety of settings. For example, in almost any problem where physical system measurements are being optimized. Consider a city planner wanting to maximize traffic flow on a major thoroughfare by adjusting the timing of traffic lights. For each timing pattern x, the traffic flow f(x) is physically measured to provide information about the expected traffic flow fix). Stochastic approximation algorithms, built to solve
min f{x) = E [fix)]
have existed in the literature since Robbins-Monro’s algorithm for finding roots of an expected value function [53]. The Kiefer-Wolfowitz (KW) algorithm [35] generalized this algorithm to minimize the expected value of a function. Their iterates have the form
xk+l = xk + cikGixk)
where G is a finite difference estimate for the gradient of /. The zth component of G is found by
_ fixk + ckei) - fixk - ckei)
wri — ---------------------------•
59


where e* is the zth unit vector. While KW spawned many generalizations, most forms require a predetermined decaying sequence for both the steps size parameter ak and finite difference parameter ck. As opposed to the 2n evaluations of / required at each iterate of KW, Spall’s simultaneous perturbation stochastic approximation (SPSA) [56] requires only 2 function evaluations per iterate, independent of n. SPSA estimates G* by
_ f(xk + ck8k) - f(xk - ck8k)
1 2ck8kt
where 8k G W1 is a random perturbation vector with entries 8ki which are independent and identically distributed (i.i.d.) from a distribution with bounded inverse moments, symmetrically distributed around zero, and uniformly bounded in magnitude for all k and i. Though SPSA greatly reduces the number of evaluations of /, the choice of sequences ak and ck are critical to algorithmic performance. Nevertheless, if / has a unique minimum x*, both KW and SPSA have almost sure convergence (which implies convergence in probability and convergence in distribution) of xk —> x* as k —> co. There is also a version of SPSA which uses four function evaluations to estimate the value, gradient, and Hessian of / [57].
The algorithm which follows differs from the work in Chapter 3 in a few ways. First, the analysis presented in Chapter 3 gives very conservative error bounds; to tighten these bounds, we must consider specific probability distributions for the error. Second, in Chapter 3, pk was defined using f(xk) as an estimate for f(xk). The work that follows evaluates model functions mk{xk) and mk(xk) to provide better estimates of the true function value. It is possible to estimate f(xk) by repeated evaluation of f(xk), but we desire an algorithm which avoids repeatedly sampling points to reduce the variance at a point x. Such a technique only gains information about noise e in the stochastic case, and no information about e if / is deterministic. But if many points sufficiently close to x are sampled, information about / and e can be gleaned. As is often the case, the point x is the likely next iterate, and the information gathered
60


about / near x will be used immediately. Also, if the noise in / is deterministic but the optimizer has imperfect control of x, it may be possible to consider the problem a stochastic optimization problem.
The analysis of the algorithm is complicated by the presence of noise. Since there is noise in each function evaluation, it is impossible to be certain the model matches the function. For example, if f{x) = x2, there is a nonzero (but tiny) probability that f(x) 0 at every point evaluated. Therefore, we can only have confidence (which we denote 1 — for ak small) that the model and function agree. The quantity ak can be adjusted as the algorithm stagnates to ensure increasingly accurate models (at the expense of more function evaluations). A key requirement of the convergence analysis is that as A& —> 0, oik does as well. For example, we can choose a simple rule such as oik = min{Afc, 0.05} to prove results about our algorithm. There are many other equally valid rules for handling oik to ensure increasing accurate models as A& —>• 0.
Our ultimate goal is to prove that the algorithm converges to a stationary point of / almost surely (with probability 1), but this is a daunting task. This is to be expected considering the following two quotes (both from [58])
There is a fundamental trade-off between algorithm efficiency and algorithm robustness (reliability and stability in a broad range of problems).
In essence, algorithms that are designed to be uery efficient on one type of problem tend to be “brittle” in the sense that they do not reliably transfer to problems of a different type.
and
Unfortunately, for general nonlinear problems, there is no known finite-sample (k < oo) distribution for the SA [stochastic approximation] iterate. Further, the theory governing the asymptotic (k —> oo) distribution is rather difficult.
61


Despite these pessimistic views, we are able to make progress in the work that follows. Since we are attempting to construct a robust algorithm, with a measure of confidence in our solution after a finite number of iterations, our theoretical requirements may not be implemented in the algorithm. Relaxing requirements may yield a more suitable algorithm for a specific problem instance.
Our algorithm is a derivative-free trust region method using regression quadratic models for their perceived ability to handle noisy function evaluations. We outline the modifications required for convergence when minimizing a function with stochastic noise. For example, when there is no noise in function being optimized, we can measure the accuracy of the fcthe model m*, with the ratio
f(xk) — f(xk + sk)
^k rrik{xk) — nik{xk + sk) ’
This measures the actual decrease observed in / versus the decrease predicted by the model m,fc. Since / cannot be evaluated directly, we propose a modified ratio pk in Section 4.2 which we believe is more appropriate for noisy functions. We also propose a modified form of K-fully quadratic for stochastically noisy functions.
An overview of the chapter follows: in Section 4.1 we define K-fully quadratic (and linear) models with confidence 1 — oik on B(x; A) and show that quadratic and linear regression models satisfy these new definitions, provided there are a sufficient number of poised points in B(x; A). We outline the algorithm in Section 4.2 and show that it converges to a first-order stationary point in Section 4.3. We provide suggestions for implementing our algorithm and compare one implementation against other algorithms for minimizing (4.1) in Section 4.4. Lastly, we discuss the results in Section 4.5 and outline some of the future avenues for research.
4.1 Preliminary Results and Definitions
We make the following assumptions:
Assumption 4.1 The noise e ~ A/”(0, 62


Assumption 4.2 The function f G LC2(Q) with Lipschitz constant L on
n = \jB(xk-,Amax) CK"
k
Assumption 4.3 The function f is bounded on Lf^o) (where La = {x\f(x) < a}).
In solving the trust region subproblem, we do not require an exact solution - instead it is sufficient to find an approximate solution, but it must satisfy the following assumption.
Assumption 4.4 Ifmk and Ak are the model and trust region radius at iterate k, and xk + sk is chosen by the trust region soluer to solue min mk(x), and sk° = —irAigk
x£B(xk; Afc)
is the Cauchy step , then for all k
mk{xk) - mk(xk + sk) > Kfcd mk{xk) - mk(xk + sk°)
for some constant Kfcd G (0,1]
This assumption merely states that every trust region subproblem solution is a fraction of the decrease possible from taking the Cauchy step, and this fraction is bounded positively away from zero. Also, the assumption allows us to not solve the trust region subproblem exactly.
Assumption 4.5 There exists a constant Kbhf > 0 such that, for all xk generated in the algorithm
\\V2f(xk)\\ We prove the following three claims used in this chapter.
Lemma 4.6 If X < Y andY < Z, then X < Z.
1—a 1—a 1—2a
63


Proof:
P(X P(X = 1 - P(X >YVY > Z)
= 1 - P(X >Y)~ P(Y > Z) + P(X > Y A Y > Z) > 1 - P(X >Y)~ P(Y > Z)
= 1 — a — o = l — 2a.
SoX < Z.
1-2 a
Lemma 4.7

i=l
a,- < e
.i=i
Proof:
.i=i
A an < —
P ( Vdi < t I >P(ai< — Aa2 < - A ' I \ n n
= 1 — P (cii > — V • • • V an > —) V n nJ
Lemma 4.8 LetY C B( 0; 1) fee a strongly A-poised (Definition 2.14) sample set with Pi points. Let X fee the quadratic design matrix defined by (1.2), then
[(XTX)~% < —A2 Pi
where is the ith diagonal entry of A.
Proof: Since (XTX)~1 is symmetric and positive-definite, the zth eigenvalue (Aj) equals the fill singular value (af). By [12, Theorem 4.11], the inverse of the smallest singular value of X is bounded by . /^A. That is,
1
Omin
{X)
64


or
^-A2 >______-____=________-_____
Pi ((JminiX)) {^min{XT X))
= amax((XTX)~1) = \\(XTX)~1
= Knox {(XTX)
-i
= max ||(XJX)-iw|| > WiX1 XyXiW > [{X1 X)~%
where e,- is the zth unit vector.
4.1.1 Models which are K-fully Quadratic with Confidence 1 — ak
To prove convergence of the algorithm presented in Section 4.2, we first propose a modified version of K-fully quadratic models.
Definition 4.9 Let f satisfy Assumption f.2. Let k = (Kef,Keg,Keh,vff) be a given vector of constants, and let A > 0. A model function m E C2 is K-fully quadratic with confidence 1 — oik on B(x; A) for oik £ (0,1) if m has a Lipschitz continuous Hessian with corresponding Lipschitz constant bounded by uf1 and
• the error between the Hessian of the model and the Hessian of the function satisfies
P (||V2/(y) - V2m(y) || < nehA My e B(x; A)) > 1 - ak,
• the error between the gradient of the model and the gradient of the function satisfies
p (l|V/(y) - Vm(y)|| < negA2 My e B(x; A)) > 1 - ak,
• the error between the model and the function satisfies
P (I f(y) ~ m(y) | < Kef A3 My e B(x; A)) > 1 - ak.
This is occasionally abbreviated n-f.q.w.c. 1 — ak.
65


These definitions are only useful if model functions can be (easily) constructed which satisfy them; the models must also be easy to minimize over a trust region. In the following theorem, we show that quadratic regression models satisfy the requirements of Definition 4.9, provided there are enough poised points within the trust region.
Theorem 4.10 If the function f satisfies Assumption f.2 and the noise e satisfies
Assumption f.l, then for a giuen ak £ (0,1), there exists a k = (nef, Keg, Keh,1'™)
such that for any x° £ E™, A > 0, ifY C B(x°, A) is strongly A-poised and
(%_iv )2a2qfA2 1^1 > --------------->
then the quadratic regression model is n-fudly quadratic with confidence 1 — a^, (where za/2 is the number of standard deuiations away from zero on a standard normal distribution, such that the area to the left of za/2 is ol/2) [46].
Proof: By Taylor’s theorem, for any point x £ B(x°] A) there exists a point rj(x) on the line segment connecting x to x° such that
f(x) = f(x°) + V f(x°)T(x — x°) + ^(x — x°)TV2 / (rj(x))(x — x°)
= f(x°) + V/(x°)T(x — x°) + -{x — x°)TV2 f(x°)(x — x°)
? (4.2)
+ ~(x — x°)T H(x)(x — x°),
where H(x) = V2/(r](x)) — V2/(x°).
Let m(x) be the quadratic least squares model regressing Y. Since m is quadratic, Taylor’s theorem says for any x,
m(x) = m(x°) + Vm{x°)T{x — x°) + ^(x — x°)T'V2m(x°)(x — x°).
Let f3 be the true parameters of the quadratic part of / (defined by the first three terms of (4.2)) and let fd be the least-squares estimate for fd. (i.e. if X is the design matrix for the set Y and / is the vector with fill entry f(yl), then [d =
66


(XTX) XTJ.) Define the mapping V(x) : R™ —> Rqi where V(x) = V([x\, ■ ■ ■ xn\T) = [l,xi, • • • ,xn, \x\)x\X2) ■ ■ ■ |^]T- Then the zth row of X is V(yl)T. The parameters P define the model m. That is, mix) = f3TV(x).
Without loss of generality, assume A < 1. Then for any x E B(x°; A),
I f(x) ~ m(x) |
f(x°) + V f(x°)T(x — x°) + ^(x — x°)T'V2f(x°)(x — x°) — x°)T H{x){x — x°) — m(x°) — Xm(x°)T(x — x°)
1
— ^(x — x°)T'V2m(x°)(x — x°)
<
p+Vix-x^-^Vix-x")
^(x — x°)T H{x){x — x°)
< ^ A - Pi \V(x - x°)i\ + — lire — ar
i=0
q
L
2
L
i=0
<
Po — Po
I Pi — Pi A + I Pi ~ Pi
i= 1
A2
i=n-\-1
-A3
2
<
I] U - pi
-A3.
2
(4.3)
i=0
Since the noise is uncorrelated with mean zero, constant variance, and is normally distributed, it is known that P ~ Af(P, cr2(XTX)~1) [46]. If [A]iti denotes the zth diagonal entry of a matrix A, then the 1 — ^ conhdence interval for each of the Pi has the form [46]:
67


1 - — = P ( A - z^^Ja^XTX)-^ < f3i< & + Zl_^x/a2[(XTX)-^
qi \ 2n V 2Q1
= P
< P
< P = P
Pi ~ Pi Pi ~ Pi
Pi - A
Pi - Pi
'i-
'i-
< Z]_ ak <7
qiAe
<
A3
Qi
At \l (z^P^qfA2
by Lemma 4.8 A 1 by bound on |T|
Therefore, by Lemma 4.7
p i-^p
„ i=0
i=0
q
Pi ~ Pi
>
A3
Qi
i=0
1 C^k-
Qi
Substituting into (4.3), we know
\f(x)-m(x)\ < A3 + ^A3 = KefA3
l—a l
where Kef = 1 + ^. Similarly,


|| Vf{x) - Vm(x) || = ||V/(ar°) + V2f(x°)T(x - x°) + H{x){x - x°)
— (Vm(i°) + 'V2m(x°)(x — x°)) ||
< ||V/(x°) - Vm(ar°)||
+ \ \V2 f (x°)T (x — x°) — V2m(x°)(x — x°)||
+ II H(x)(x — x°)||
i= 1
Pi ~ Pi
E |a-a
i=n-\-1
A + L

<

A + LA^
i=l
q
< \pi - pi
i=n-\-1
LA2
i=l
< A3 + LA2 < A2 + LA2 = /ieflA2
1—a
where neg = 1 + L. A similar argument for ||V2/(x) — V2m(x)|| with Ke/j = 1 + L proves the theorem. ■
To certify a model satisfies Definition 4.9 or to improve a least squares regression model into one that is K-fully quadratic with confidence 1 — is straightforward: we must ensure there are enough poised points within B(x; A) to satisfy the bound given in Theorem 4.10. Otherwise, add enough strongly A-poised points to Y. For a technique to generate strongly A-poised sets, see Chapter 3 of this thesis or [12].
4.1.2 Models which are /t-fully Linear with Confidence 1 — au
While the models rrik used in the main algorithm are quadratic, linear models m(x) can approximate / near B(xk + sk\ A*,) to sufficient accuracy. Therefore, if we have enough points within B(xk + sk\ A), we can bound the error between f(xk + sk) and rhk(xk + sk). We quantify that accuracy in the following definition.
69


Definition 4.11 Let f satisfy Assumption f.2. Let k = {kef, keg, vf1) be a given vector of constants, and let A > 0. A model function m G C is K-fully linear with confidence 1 — oik on B(x; A) for oik G (0,1) if m has a Lipschitz continuous gradient with corresponding Lipschitz constant bounded by v™ and
• the error between the gradient of the model and the gradient of the function satisfies
^(11 V/(y) - Vm(y)|| < kegA Vy G B(x\ A)) > 1 - ak,
• the error between the model and the function satisfies
p (l/(y) - rn(y)\ < kefA2 Vy G B(x; A)) > 1 - ak.
This is occasionally abbreviated n-f.l.w.c. 1 — ak.
Theorem 4.12 If the function f satisfies Assumption f.2 and the noise e satisfies Assumption f.l, then for a given ak G (0,1); there exists a k = (nef, Keg, uf1) such that for any x° G E™, A > 0, ifY C B(x°, A) is strongly A-poised and
(21_^)V(n+ 1)W
I I - A4
then the linear regression model is k fully linear with confidence 1 — ak.
Proof: The proof is nearly identical that of Theorem 4.10.
4.2 Stochastic Optimization Algorithm
Below is an outline of our proposed stochastic algorithm. For xk + sk, the solution to the trust region subproblem, and a radius Afc > Afc > 0, define Yk = {y G Ytot| \\xk + sk - y|| < Afc|.
Let Ytot = {y1, • • • ,ym} be the set of points where / has been evaluated. /» :=
f(yl). Define a null model mo, initial trust region radius A0, and an initial TR center
70


x°. Choose constants satisfying 0 < 7 < 1 < pinci ec > 0, 0 < rjo < rji < 1 (rji yt 0),
where rjo 0 < p and oj G (0,1). Choose r G (0,1) and define Ak = rAk.
Algorithm 1: A trust-region algorithm to minimize a stochastic function.
Let k = 0;
Start;
Set ak = min{Afc, 0.05};
if <7 = max{||7fc|| , — Xmin(Hk)} < tc and either i) rrik is not certifiably n-f.q.w.c. 1 — Ofc on B(xk; Afc) or ii) Afc > then Apply Algorithm 2 to update Yk, Ak, and mk,
Set ak = min{Afc, 0.05}; else
Select (or generate) a strongly A-poised set of points Yk C B(xk; Afc) from Ytot such that Yk has enough points to ensure mk is K-f.q.w.c. 1 — ak.
Build a regression quadratic model mk(x) through Yk. Solve
sk = arg min mk(xk + s). Build a K-f.l.w.c. 1 — ak model rhk on Yk
s:\\s\\ (possibly adding points to Ytot) and compute
^ mk(xk) — mk(xk + sk)
^k mk(xk) — mk(xk + sk)
if Pk A Vi or (pk > rjo A mk is n-f.q.w.c. 1 — ak on B(xk; Afc)) then
Xk+1 = ’jk sk'
else xk+l = xk\ if Pk A Pi then
Afc+i = min{7iracAfc, A
max };
else Afc+i = 7Afc;
Let mk+1 be the (possibly improved) model;
Set k = k + 1 and go to Start;
71


Note that we are approximating f(xk + sk) using a second model mk in a different trust region Ak around xk + sk. Formal convergence of the algorithm, specifically Lemma 4.15, requires the ability to approximate f(xk + sk) with increasing accuracy as the algorithm progresses. Such accuracy is not available from a realization of the noise function value, namely f(xk + sk). While it is possible to obtain increasingly accurate approximations of f(xk + sk) by repeated sampling, we are hoping the theory generated in this chapter can be easily transfered to the case where deterministic noise is present in /. With deterministic noise, Var (/) = 0, and therefore repeated sampling will provide no further information.
Also, if we eventually shrink our trust region around a given point, points generated in B(xk + sk; A*,) to make an accurate model mk(xk + sk) can be used in the
construction of an accurate model rrij(x) at some later iterate j.
Algorithm 2: Criticality Step
Initialization Set i = 0. Set m^ = m,k-
Repeat Increment i by one. Improve the previous model by adding points to Ytot until it is K-fully quadratic with confidence 1 — oik on B(xk',u* l~1Ak). This can be done (by Theorem 4.10 and the model improvement algorithm from [3] which builds a strongly A-poised set Y) in (9(^e) steps if the models satisfy Definition 4.9.) Denote the new model vnPk . Set Ak = uk-1 Ak and fiq = rnf1. ;
Until Afc < /i^kl\xk).
Return mk = fhk, Afc = min jmax |Afc, ^)(xfc)| , Afc|, and Ytot.
We adopt the naming of iterates from Conn, Scheinberg, Vicente:
1. pk > rji; (xk + sk is accepted and the trust region is increased). We call these iterations successful.
2- rji > pk A do and mk is K-fully quadratic with confidence 1 — ak; (xk + sk is accepted but Afc is decreased). We call these iterations acceptable.
72


3. rji > pk and m-k is not K-fully quadratic with confidence 1 — ap (xk + sk is not accepted and the model is improved). We call these iterations model improving.
4. Tj0 > pk and nik is k-fully quadratic with confidence 1 — a p, (xk + sk is not accepted and Ak is reduced). We call these iterations unsuccessful.
4.3 Convergence
4.3.1 Convergence to a First-order Stationary Point
The use of quadratic m-k might suggest convergence to a second-order stationary point. Such a proof would require a quadratic rhk as well, and since A > A, this would require more points in B(xk + sk, Afc) than in B(xk, Afc). Since it is frequently the case that f{xk + sk) > f(xk) (even when nik is K-f.q.w.c. 1 — «*,), we find it wasteful to build a quadratic rhk around xk + sk. This is one of the motivations for K-fully linear models for rhk, with this, we can prove convergence to a first-order stationary point.
We first show that if xk is not a stationary point for /, then Algorithm 2 will exit with probability 1.
Theorem 4.13 Given oik £ (0,1), if f satisfies Assumption f.2 and
\\Vf(xk)\\ > ^W_1Ak, there is probability at least 1 — at that Algorithm 2 will correctly exit on each iterate (i) after (j) such that ui1-1 < (where p and oj are
declared in Algorithm 1).
Proof: Assume oW1 < -, Ak A 1, and Algorithm 2 cycles infinitely. After
sufficiently many iterations of the criticality step, will be K-fully quadratic with
73


confidence 1 — au on B(xk: ./^-AaA. Therefore
V UKeg
by assumption by Definition 4.9
since A& < 1 and i > j
So for each (i) after (j) such that u4_1 < \Jppr-, we have 1 — confidence that Algorithm 2 will exit. ■
Since we require oik —> 0 as A& —>• 0, then for any oik > 0, eventually A& will be small enough so that Algorithm 2 with probability at least 1 — In other words, this theorem ensures that the algorithm exits with probability 1.
Lemma 4.14 If f satisfies Assumption f.5 and rrik is n-fudly quadratic with confidence 1 — a, there exists a constant r^hm > 0 such that
||7/fc|| A ^blim?
1—a
for all k, where Hk is the Hessian of rrik-Proof:
T>
w
9i
> ||V/(V)|| - V/(;V)-»f
v/(V) - i'
> -u> lAk -
JU
2
1—a /J>
^ Ak f^eg
‘ uji
-1
fiKeg
A l
-uS-'Ak - -uij~1A2k


1
— ( n
> -ojl~lAk.
Hk\\<\\Hk-V2f(xk)\\ + \\V2f(xk)\\
< KehAk + ||V2/(xfc)|| by Definition 4.9
1—a
< fte/jAfc + Kbhf by Assumption 4.5
A ^e/jAmaic T bibhf —• ^bhm
74


The following lemma shows that, if xk is not a stationary point of /, then if A is small enough, there is a high probability that a successful step will be taken.
Lemma 4.15 Let / satisfy Assumption f.5 and let the trust region subproblem solution satisfy Assumption 4-4■ Let k = (nef, ^eg, neh, v™) and k = (kef, Leg, keh, vf1). Let the constants Kfcct, Hbhm, &ef, kef, rji be as specified in Assumption 4-4j Lemma 4-H, Definition 4-9, Definition 4-9-1, and declared at the beginning of Algorithm 1 respectively. If mk is n-f.q.w.c. l — exk on B(xk-,Ak), mk is k-f.l.w.c. l — ak cm B{xk + sk; Afc), and
Ak < min
\\9k\\ Kfcd \\gk\\ (1 - rji) k^bhm Amax T 2kef
(4.4)
then we have confidence 1 — 3ak that pk > rp on the kth iteration.
Proof: Using Lemma 4.14, the fact that xk + sk is no worse than the Cauchy step (Assumption 4.4), and Ak < (4.4) yields
mk(xk) - mk(xk + sk) > l^fcll min Afc
l—a £ I Kbhm
(4.5)
75


Using this
I Pk ~
1
<
<
1—a
mk{xk) — mk(xk + sk) mk(xk) — mk(xk + sk)
mk(xk) — mk(xk + sk) mk(xk) — mk(xk + sk)
mk(xk + sfc) — rhk{xk + sfc)
mk(xk) — mk(xk + sfc)
mk(xk + sk) — f(xk + sk) 1 f(xk + sk) — rhk(xk + sk)
mk(xk) — mk(xk + sk) mk(xk) — mk(xk + sk)
Ke/A| f(xk + sfc) — mk(xk + sfc)
— mk(xk + sk)\ mk(xk) — mk(xk + sk)
by Definition 4.9
< ___________KefAk______________|___________As/Afc__________
i-« |m.fc(xfc) — mk(xk + sfc)| |m,fc(xfc) — mk(xk + sfc)|
by Definition 4.11
<
1—a
2/te/A| + 2£e/A|
Kfed, ||Qk || Afc
2nefAmax T 2hef
< — ------rw—~Ak
kkfed || 9 k ||
< 1 - Vi- by (4.4)
by (4.5) since Ak > Ak
Since we have confidence 1 — ak that the second, third and fourth inequalities hold, we have confidence 1 — 3ak that all three hold simultaneously. ■
Lemma 4.16 For all k, assume the trust region subproblem solution satisfies As-sumption 4-4â–  Let f satisfy Assumption 4-5. If there exists a constant Ki such that \\gk\\ A ki for all k, then there exists another constant such that, for every iteration k where
Ak < K.2
we have confidence 1 — 3ak that iteration k will be successful and Ak will increase if mk is n-f.q.w.c. 1 — ak.
Proof: This proof is similar to [12, Lemma 10.7]. Whether Algorithm 2 has been called or not,
Ak > mm{qk(xk),Ak_i} > min{||^fc|| , Afc_i} > min {kx, Afc_i}
76


Since \\gk\\ > n,\ for all k, Lemma 4.15 implies that whenever Ais less than
Ki KfcdKi(l - rji)
= mm , ,
t^bhm 2/ve/A max A 2 Kef
we have conhdence 1 — 3a7 that iteration k will be successful (Afc+1 = 7iracAfc) or model improving (Afc+i = A*,). In either case A^+i > Ak so we have conhdence 1 — 3Q!fc that Afc+i > Ak will hold whenever Ak < min {A0, 7/^3} = K2- ■
Theorem 4.17 Let Assumptions 4-1~4-5 be satisfied. If the number of successful iterations is finite, then
liminf IIVf(xk) II = 0
k—¥ CO
with probability 1.
Proof: Consider iterations after the last successful iteration, denoted kiast. For every k > kiast, the iteration is unsuccessful (pk < rj 1) and the model improvement algorithm is called. It takes a finite number j of model improvement steps for
the model to become K-fully quadratic with conhdence 1 — oik on a given B(xk; A); there are an inhnite number of iterations that are either acceptable or unsuccessful. Given Afc, we can guarantee that the trust region radius must decrease by at least one multiple of 7 G (0,1) after jffi iterations (for a hxed constant C). Since for any e > 0, there exists an integer N such that 7*Afc;ast < e. After
C
c
N
^ (r-A 1
c
<776-6* (76* -1)
(Aklastfi ' ' (7*-1Afc;ast)6 ^ (t^AO6 A« Mt (76 - 1)
iterations, the trust region radius will be less than e. Therefore, lim^oo A& = 0, which implies oik —> 0. Therefore, there exists an inhnite sequence of iterates {kf\ where is K-f.q.w.c. 1 — and
|v/(V')||<||v/(V')-*,|l + ll*.ll £ ^Gi. + lla,l|.
1 -«fc;
The second term converges to zero with probability 1. To see this, assume 11<7^ 11 is
bounded away from zero and we can derive a contradiction using Lemma 4.15 and
77


the fact that limfc_).00 A& = 0. Since A^ —> 0 and —> 0, then for ki sufficiently
large, A^ < k2, so there is probability 1 — 30^ that iteration ki will be successful. Thus, for any «& > 0 and K > 0, there exists ki > K such that the probability of step ki being successful is greater than 1 — Therefore, with probability 1, there are infinitely many successful iterates, contradicting the definition of kiast. ■
4.3.2 Infinitely Many Successful Steps
The results that follow outline parts of a proof for the case when Algorithm 1 generates infinitely many successful iterates. While the previous theorem proves Afc —>• 0, the proof is not valid when there are infinitely many successful iterates. We have made considerable effort to prove Afc —>• 0, but have been unable to do so. To progress, we assume it for the time being.
Assumption 4.18
lim A k = 0
fc—)>C©
It should be noted that it may be possible to ensure Algorithm 1 satisfies this assumption, perhaps by slowly decreasing Amax. The details would need to be worked out, but this assumption is not as strong as it might appear.
Conjecture 4.19 If Assumption f.18 and Assumption f.5 hold and the trust region subproblem solution satisfies Assumption 4-4 for aU k, then
with probability 1.
lim inf \\g
k—y c©
k
o
Discussion: If || n,\ for some n,\ > 0, by Lemma 4.16, there exists a n,2 such that whenever A< k2, we will have a 1 — 3oik confidence of increasing the trust region. Using Assumption 4.18 and the fact that oik = min{Afc, 0.05}, we will increase the trust region with probability approaching 1 as k gets large. This would appear to
78


contradict Ak —>• 0, but to prove almost sure convergence (assuming each iteration is independent) we must show the product of the 1 — ak approaches 1. And even the assumption that each iteration is independent is difficult, as many of the points used to build rrik will be used to build mk+\. If the events are dependent, then we must consider conditional probabilities such as the probability one step being a success given the last step was not.
Conjecture 4.20 If Assumptions f.l-f.5 and Assumption f.18 hold, for any subsequence {hi} such that
lim \\giti || = 0 (4.6)
then, with probability 1
lim IIV/(xfci)|| = 0.
i—¥ c©
Discussion: By (4.6), for i sufficiently large, \\gki\1 < ec. Thus, by Theorem 4.13, Algorithm 2 ensures that the model mki is /r-f.q.w.c. 1 — oik on the ball B(xki; Ak.) with Aki < p WdkiW for all i sufficiently large (provided ||V/(xfci)|| 0). By Definition 4.9
|| V/(xfci) - gki || < Keg Afc. < Kegfj, \\gki\\ .
1—a
Therefore,
\\^f(xki)\\ <\\Vf{xki) - gki\\ + \\gki\\ < {Keg/a + 1) \\gki\\ .
1—a
Since \\gki\\ —> 0 with probability 1, so does ||V/(xfci)||.
Conjecture 4.21 If Assumptions f.l-f.5 and Assumption f.18 hold, then
lim inf || V f(xk) || = 0
k^-oo
with probability 1.
Discussion: By Conjecture 4.19, we know there must exist a sequence of {/q} such that lim^oo \\gki || = 0. By Conjecture 4.20, this same sequence {/q} has lim^oo ||V/(xfci)|| = 0. This proves the result.
79


4.4 Computational Example
In this section, we highlight some of the advantages of using Algorithm 1 over a traditional trust region method (which assumes deterministic function evaluations). While both algorithms have much in common, the slight differences become significant in the presence of stochastic noise. For example, the deterministic algorithms are susceptible to negative noise, as we see in the Figure 4.1. In that figure, the solid line is the true function / which we want to minimize, and the dashed black lines show the 95% confidence interval of the noise. The black squares mark the noisy functions which determine the quadratic trust region model mk and the trust region radius Ak is represented by the dashed lines. The trust region center xk has a red box around it.
4.4.1 Deterministic Trust Region Method
Figure 4.1 shows a traditional trust region method after moving to a new trust region center at xk = 2.5. Each image shows the progress of the algorithm, and we describe what occurred in the previous iterate to yield the present situation:
Figure 4.1, top left By chance, the realization of f{xk + sk) was much less than / at any point near xk + sk. It is now the new trust region center.
Figure 4.1, top right The minimum of the quadratic model was not accepted since,
= K%k) ~ K%k + sk) n
^k mk(xk) — mk(xk + sk)
The trust region radius has also been shrunk since the sample set is strongly A-poised.
Figure 4.1, bottom left A point outside of the trust region radius has been removed. Since pk < 0, the trust region radius will shrink again.
Figure 4.1, bottom right Another point outside of the trust region has been removed, and a new model has been built.
80


3.5 1.5
Figure 4.1: Several iterations of a traditional trust region method assuming deterministic function evaluations. The trust region center is never moved.
The deterministic algorithm will accept a new trust region center when p*. is sufficiently positive, (i.e. if f(xk + sk) is also much less than f(xk + sk)). If this does not happen, the algorithm will not find a successful step and the trust region radius will be repeatedly decreased. Since f(xk) is never re-evaluated, it is likely that the algorithm will terminate without ever taking a further step.
4.4.2 Stochastic Trust Region Method
In contrast, by using p*. introduced in Section 4.2,
mi. \-i'k) — nn/\xk + sk) mi, {xk) — mi, {xk + sk) ’
and increasing the number of points in the trust region as Afc decreases allows the algorithm to proceed to move off of a trust region center with negative noise, seen in Figure 4.2.
81


1.5
0.5
- —- \
Vs
> sA M
\ t / 7
s' A \ \
' vvt /
\ \ > \ / i
\ N \ v / i /
; //
/ f .
\\ \ ///


" * "
3.5 1.5
Figure 4.2: Several iterations of a traditional trust region method assuming stochastic function evaluations.
Figure 4.2, top left Again, the realization of f(xk + sk) was much less than / at any point near xk + sk. It is now the new trust region center.
Figure 4.2, top right The minimum of the quadratic model was not accepted since Pk < 0, but the trust region radius is not decreased. Though the sample set is strongly A-poised, there are not enough points to ensure the model is n-f.q.w.c. 1 — ay..
Figure 4.2, bottom left More points have been added to the sample set and the model has been reconstructed.
Figure 4.2, bottom right The minimum of the quadratic model is accepted since Pk > Ph (even though f(xk + sk) > f(xk)).
82


By using the model value at xk instead of f(xk) in the calculation of pk allows the estimate of f(xk) to adjust without wastefully reevaluating f(xk). In this fashion, Algorithm 1 can avoid stagnating at points with negative noise.
4.5 Conclusion
In this chapter we presented an algorithm using quadratic trust region models nik to minimize a function / which cannot be evaluated exactly. Even though the algorithm only has access to noise corrupted function evaluations /, we proved almost sure convergence of a subsequence of iterates to a hrst-order stationary point of / (when the number of successful steps is finite). We also have outlined a proof for the case when the number of successful steps is infinite. These results were accomplished, not by repeatedly sampling / at points of interest xk, but rather by constructing models rhk which are increasingly accurate approximations of / near xk. Since it is often the case that xk is the candidate for the new trust region center, this information is immediately useful in constructing m-k+i- We then highlighted how this algorithm remedies a common problem with using traditional trust region methods on functions with stochastic noise.
83


5. Non-intrusive Termination of Noisy Optimization
5.1 Introduction and Motivation
The optimization of real-world, computationally expensive functions invariably leads to the difficult question of when an optimization procedure should be terminated. Algorithm developers and the mathematical optimization community at large typically assume that the optimization is terminated when either a measure of criticality (gradient norm, mesh size, etc.) is satisfied or a user’s computational budget (number of evaluations, wall clock time, etc.) is exhausted.
For a large class of problems, however, the user may not have a well-defined computational budget and instead demand a termination test t solving
min Computational expense(t) s.t. Acceptable accuracy of the solution(f),
(5.1)
with the criticality measure of the solver employed typically chosen with the accuracy constraint in mind. Examples of such accuracy-based criticality tests are discussed in detail by Gill, Murray, and Wright [19, Section 8.2.3].
The main difficulties arising from this approach are a result of (5.1) possibly being poorly formulated. The computational expense could be unbounded because an a priori user-defined accuracy is unrealistic for the problem/solver pair. Furthermore, a user may have difficulty translating the criticality measures provided by a solver, which are generally based on assumptions of smoothness and infinite-precision calculations, into practical metrics on the solution accuracy.
In Figure 5.1 we illustrate the challenges in this area with an example from nuclear physics, similar to the minimization problems considered in [37]. Each of the function values shown is obtained from running a deterministic simulation for one minute on a 640-core cluster. Stopping the optimization sooner than 200 function evaluations would not only return a solution faster but would also free the cluster for
84


Figure 5.1: Part of a noisy trajectory of function values for an expensive nuclear physics problem. After more significant decreases in the first 70 evaluations, progress begins to stall.
other applications and/or result in a savings in energy, an increasingly crucial factor in high-performance computing.
If we assume that the optimization (partially) shown in Figure 5.1 has not been terminated by a solver’s criticality measures or a user’s computational budget, the question is then whether termination should occur for other reasons. For example, if only the first three digits of the simulation output were computed stably, one may want to terminate the optimization sooner than if computational noise corrupted only the eighth digit of the output. Alternatively, the behavior shown could mean the solver in question has stagnated (because of noise, errors in the simulation, a limitation of the solver, etc.), and hence examining the solution and/or restarting the optimization could be a more effective use of the remaining computational budget. Wright [65] refers to this stalled progress as perseveration and notes that there is “no fully general way to define ‘insufficient progress.’ ” Even so, it may be advantageous to use knowledge of the uncertainty or accuracy of a given function evaluation when making such a decision.
85


In the remainder of this chapter we explore these issues and propose termination criteria that can be easily incorporated on top of a user’s solver of choice. In [18], Fletcher summarizes the challenges at hand (in the case of round-off errors alone): Some consideration has to be given to the effects of round-off near the solution, and to terminate when it is judged that these effects are preventing further progress. It is difficult to be certain what strategy is best in this respect.
Moreover, Gill, Murray, and Wright [19] stress that
no set of termination criteria is suitable for all optimization problems and all methods.
This sentiment is shared by Powell [47] who says
it is believed that it is impossible to choose such a convergence criterion which is effective for the most general function ... so a compromise has to be made between stopping the iterative procedure too soon and calculating f an unnecessarily large number of times.
Consequently, we will consider tests that allow for the use of estimates of the noise particular to a problem. Furthermore, our criteria are not intended as substitutes for a computational budget or a solver’s built-in criticality tests, which we consider to be important safeguards. Likewise, the termination problem can be viewed as a real-time control problem depending on complete knowledge of the solver’s decisions, but we resist this urge for purposes of portability and applicability.
We provide background on previous work and introduce notation in Section 5.2. The families of stopping tests we propose in Section 5.3 do not provide guarantees on the quality of the solution, although doing so may be the role of a solver’s built-in criteria. Instead, the proposed tests are parameterized in order to quantify a
86


user’s trade-off between the benefit of achieving additional decrease and the cost of additional evaluations, while requiring a minimal amount of information from the solver. Equally important, our results in Section 5.4 comparing the quality of these families of stopping tests on a collection of local optimization algorithms. We first consider all solvers as a single routine, later validating this approach by demonstrating equal performance for the best tests on individual algorithms. While our results can be incorporated in a local subroutine of any global search algorithm, the tests proposed in Section 5.3 are unable to distinguish between exploration and refinement phases in their current form. We summarize our results in Section 5.5 and provide recommendations when implementing these tests.
5.2 Background
Our discussion and analysis are limited to optimization methods that do not explicitly require derivative information. However, other algorithms could readily employ the tests proposed here in addition to their derivative-based stopping criteria. While our work can be further extended to incorporate noisy gradient information, the derivatives of noisy functions are typically even noisier than the function.
Derivative-free optimization methods are often favored for their perceived ability
to handle noisy functions. Although asymptotic convergence of these methods is
generally proved assuming a smooth function, adjustments are frequently made to
accommodate noise. In the case of stochastic functions, where noise results from a
random distribution with Var (f(x)) > 0, replications of function evaluations can be
used to modify existing methods (e.g., [14] modifying UOBYQA [48], [15, 1] modifying
DIRECT [30], and [61] modifying Nelder-Mead (see, e.g., [12])). However, stopping
criteria for these methods involve limited knowledge of the noise and indicate the
wide variety of stopping tests used in practice. In [1], optimization is stopped when
adjacent points are within 10-4 of each other, whereas [15] allows stopping when
the best function value has not been improved after some number of consecutive
87


iterations. To limit the number of stochastic replications, the authors of [14] and [61] adjust the maximum number of allowed replications at a particular point based on the variance of the noise.
Deterministic noise - that is, noise that results from a deterministic process, such as finite-precision arithmetic, iterative methods, and adaptive procedures - is far less understood than its stochastic counterpart [42], Not surprising, even less knowledge of the magnitude of noise is used for problems with deterministic objectives. When low-amplitude noise is present, Kelley [33] proposes a restart technique for Nelder-Mead but terminates when sufficiently small differences exist in the simplicial function values, independent of the magnitude of the noise. Implicit filtering [32] has numerous termination possibilities (small function value differences on a stencil, a small change in the best function value from one iteration to the next, etc.) but none that are explicitly related by the author to the magnitude of the noise. A similar implicit relationship to noise can be seen in [24], where treed Gaussian process models for optimization are terminated when a maximum improvement statistic is sufficiently small. The authors of SNOBFIT [29] suggest stopping when the best point has not changed for a number of consecutive calls to the main SNOBFIT algorithm.
Our work more closely follows that of Gill et. al [19], where section (8.2) is devoted to properties of the computed solution. The authors there recommend terminating Nelder-Mead-like algorithms when the maximum difference between function values on the simplex is less than a demanded accuracy weighted by the best function value on the simplex.
The only other direct relationship between stopping criteria and a measure of noise that we are aware of are in [42, Section 9] and [25]. In [42], a stochastic model of the noise is used to estimate the noise level of a function value f(x) by difference table-based approximations of the standard deviation (Var (f(x)))1//2. Results are validated for deterministic /. As an example application, the authors terminate a Nelder-Mead


method on an ODE-based problem when consecutive decreases are less than a factor of the noise level. The authors of [25] perturb bound-constrained problems so the incumbent iterate is the exact solution to this new problem. An algorithm can then be terminated when the size of this perturbation first decreases below the error in the problem. Natural extensions to gradient/derivative-based tests are also enabled by the recent work in [43] where near-optimal finite difference estimates are provided as a function of the noise level.
Before proceeding, we define the notation employed throughout. We let R+ denote the nonnegative reals and N denote the natural numbers. We let {ay, ■ ■ ■ , xm} C E™ and {/i, • • • , fm} G R be a sequence of points and corresponding function values produced by a local minimization solver, and we collect the data from the first i evaluations in Jy = {(ay, /i),..., (ay, fi)}. The best function value in the first % evaluations is given by /* = min {/,}, with x* denoting the point corresponding to f*.
1 Accordingly, the sequence {/*} is nonincreasing. Unless otherwise stated, || • || denotes the standard Euclidean distance.
We let iir be an estimate of the relative noise at fi (i.e., the noise at ay scaled by the magnitude of f(xi)). This estimate may come from experience, numerical analysis of the underlying processes in computing fi, or appropriate scaling (by l/|/j|) of the noise-level estimates from the method proposed in [42], In the case of stochastic functions with nonzero mean at ay, eir is the standard deviation of /(ay) relative to the expected value E [|/(ay)|].
Favorable properties of a termination test include scale and shift invariance, so that the test would terminate after the same number of evaluations for any affine transformation of the objective function. Specifically, a test is scale invariant in / if it terminates optimization runs defined by Jy and aJy = {(ay, cn/i),..., (ay, aff)} at an identical evaluation number for any a > 0. Similarly, a test is shift invariant in / if it terminates Jy and Jy + f3 = {(ay, fi + fi),..., (ay, fi + /?)} after an identical
89


Full Text

PAGE 1

DERIVATIVE-FREEOPTIMIZATIONOFNOISYFUNCTIONS by JereyM.Larson B.A.,CarrollCollege,2005 M.S.,UniversityofColoradoDenver,2008 Athesissubmittedtothe FacultyoftheGraduateSchoolofthe UniversityofColoradoinpartialfulllment oftherequirementsforthedegreeof DoctorofPhilosophy AppliedMathematics 2012

PAGE 2

ThisthesisfortheDoctorofPhilosophydegreeby JereyM.Larson hasbeenapproved by StephenBillups,AdvisorandChair AlexanderEngau BurtSimon MichaelJacobson FredGlover Date ii

PAGE 3

Larson,JereyM.Ph.D.,AppliedMathematics Derivative-freeOptimizationofNoisyFunctions ThesisdirectedbyAssociateProfessorStephenBillups ABSTRACT Derivative-freeoptimizationDFOproblemswithnoisyfunctionsareincreasinglyprevalent.Inthisthesis,weproposetwoalgorithmsfornoisyDFO,aswellas terminationcriteriaforgeneralDFOalgorithms.Bothproposedalgorithmsarebased onthemethodsofConn,Scheinberg,andVicente[9]whichuseregressionmodelsin atrustregionframework.Therstalgorithmutilizesweightedregressiontoobtain moreaccuratemodelfunctionsateachtrustregioniteration.Aweightingscheme isproposedwhichsimultaneouslyhandlesdieringlevelsofuncertaintyinfunction evaluationsanderrorsinducedbypoormodeldelity.Toproveconvergenceofthis algorithm,weextendthetheoryof-poisednessandstrong-poisednesstoweighted regression.Thesecondalgorithmmodiestherstforfunctionswithstochasticnoise. Weproveouralgorithmgeneratesasubsequenceofiterateswhichconvergealmost surelytoarst-orderstationarypoint,providedthenumberofsuccessfulstepsis niteandthenoiseforeachfunctionevaluationisindependentandidenticallynormallydistributed.Lastly,weaddressterminationofDFOalgorithmsonfunctions withnoisecorruptedevaluations.Ifthefunctioniscomputationallyexpensive,stoppingwellbeforetraditionalcriteriae.g.,afterabudgetoffunctionevaluationsis exhaustedaresatisedcanyieldsignicantsavings.Earlyterminationisespecially desirablewhenthefunctionbeingoptimizedisnoisy,andthesolverproceedsforan extendedperiodwhileonlyseeingchangeswhichareinsignicantrelativetothenoise inthefunction.Wedeveloptechniquesforcomparingthequalityofterminationtests, proposefamiliesofteststobeusedongeneralDFOalgorithms,andthencompare iii

PAGE 4

thetestsintermsofbothaccuracyandeciency. Theformandcontentofthisabstractareapproved.Irecommenditspublication. Approved:StephenBillups iv

PAGE 5

ACKNOWLEDGMENT Iwouldliketothankmyadvisor,SteveBillups,forhisyearsofresearchassistance andadvice.Hisguidancewasinstrumentalinobtainingtheresultsinthisthesis.I wouldalsoliketothankPeterGrafandStefanWildfortheirassistanceinresearching andwritingpartsofthisthesis.Theresearchinthisthesiswaspartiallysupportedby NationalScienceFoundationGrantGK-12-0742434andpartiallysupportedbythe OceofAdvancedScienticComputingResearch,OceofScience,U.S.Department ofEnergy,underContractDE-AC02-06CH11357.Lastly,I'dliketothankmywife, Jessica,foryearsofpatiencewhiletheresearchforthisthesiswasperformed. v

PAGE 6

TABLEOFCONTENTS Figures.......................................ix Tables........................................xi Chapter 1.Introduction...................................1 1.1ReviewofMethods...........................3 1.2Outline..................................5 1.3Notation.................................8 2.Background...................................10 2.1Model-basedTrustRegionMethods..................10 2.1.1ModelConstructionWithoutDerivatives............11 2.1.2CSV2-framework.........................11 2.1.3Poisedness.............................18 2.2PerformanceProles...........................23 2.3ProbabilisticConvergence........................24 3.Derivative-freeOptimizationofExpensiveFunctionswithComputational ErrorUsingWeightedRegression.......................26 3.1Introduction...............................26 3.2ModelConstruction...........................27 3.3ErrorAnalysisandtheGeometryoftheSampleSet.........29 3.3.1WeightedRegressionLagrangePolynomials..........29 3.3.2ErrorAnalysis..........................30 3.3.3-poisednessintheWeightedRegressionSense.......35 3.4ModelImprovementAlgorithm.....................39 3.5ComputationalResults.........................50 3.5.1UsingErrorBoundstoChooseWeights............50 3.5.2BenchmarkPerformance.....................52 vi

PAGE 7

3.6SummaryandConclusions.......................57 4.StochasticDerivative-freeOptimizationusingaTrustRegionFramework.59 4.1PreliminaryResultsandDenitions..................62 4.1.1Modelswhichare -fullyQuadraticwithCondence1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k .65 4.1.2Modelswhichare^ -fullyLinearwithCondence1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k ...69 4.2StochasticOptimizationAlgorithm...................70 4.3Convergence...............................73 4.3.1ConvergencetoaFirst-orderStationaryPoint.........73 4.3.2InnitelyManySuccessfulSteps................78 4.4ComputationalExample........................80 4.4.1DeterministicTrustRegionMethod...............80 4.4.2StochasticTrustRegionMethod................81 4.5Conclusion................................83 5.Non-intrusiveTerminationofNoisyOptimization..............84 5.1IntroductionandMotivation......................84 5.2Background...............................87 5.3StoppingTests..............................90 5.3.1 f i 0 Test..............................92 5.3.2Max-Dierencef Test......................92 5.3.3Max-Distancex Test.......................93 5.3.4Max-Distancex i Test......................93 5.3.5Max-BudgetTest.........................94 5.3.6TestsBasedonEstimatesoftheNoise.............94 5.3.7RelationshiptoLossFunctions.................96 5.4NumericalExperiments.........................98 5.4.1AccuracyProlesforthe 1 Family...............100 5.4.2PerformanceProlesforthe 1 Family.............102 vii

PAGE 8

5.4.3AccuracyandPerformancePlotsforthe 2 Family......104 5.4.4Across-familyComparisons...................106 5.4.5DeterministicNoise.......................107 5.4.6ValidationforIndividualSolvers................109 5.5Discussion................................110 6.ConcludingRemarks..............................113 References ......................................119 viii

PAGE 9

FIGURES Figure 2.1Anexampleofaperformanceprole....................24 3.1Performanceleftanddatarightproles:Interpolationvs.regression vs.weightedregression............................55 3.2Performanceleftanddatarightproles:weightedregressionvs. NEWUOAvs.DFOProblemswithStochasticNoise..........56 4.1Severaliterationsofatraditionaltrustregionmethodassumingdeterministicfunctionevaluations.Thetrustregioncenterisnevermoved.....81 4.2Severaliterationsofatraditionaltrustregionmethodassumingstochastic functionevaluations..............................82 5.1Partofanoisytrajectoryoffunctionvaluesforanexpensivenuclear physicsproblem.Aftermoresignicantdecreasesintherst70evaluations,progressbeginstostall........................85 5.2Firsttermsin 1 top,with =100and 2 bottom,with =10on alog 10 scalewhenminimizinga10-dimensionalconvexquadraticwith stochasticrelativenoiseofdierentmagnitudes.Theasymptotesofthe quantitiesshowntendtobeseparatedbythedierencesinmagnitudesof thenoise....................................96 5.3Numberofevaluations i foraterminationtestbasedon.3withxed F i and ,butusinga parameterizedby c .Theplotsshowremarkably similarbehaviortothenumberofevaluationsthatminimize L ;c in.8.98 5.4Accuracyprolesformembersofthe 1 familyonproblems.9with twodierentmagnitudesofknownstochasticrelativenoise .Inthe topplots, isheldxedandtheshownmembershavedierent values. Inthebottomplots, isheldxedandtheshownmembershavedierent values.....................................101 ix

PAGE 10

5.5Performanceprolesforthemostaccurate 1 testsonproblems.9with twodierentmagnitudesofknownstochasticrelativenoise .Notethat the -axishasbeentruncatedforeachplot; 5 eventuallyterminatesall oftheproblemsandthushasaprolethatwillreachthevalue1;allother testschangebylessthan.01.........................103 5.6Accuracytopandperformancebottomprolesforthe 2 familyon problems.9withtwodierentmagnitudesofstochasticrelativenoise as and arevaried............................105 5.7Accuracytopandperformancebottomprolesforthebesttestson problems.9withtwodierentmagnitudesofstochasticrelativenoise . Thehorizontalaxesontheperformanceprolesaretruncatedforclarity; 5 eventuallyachievesavalueof1;allothertestschangebylessthan.03.106 5.8Accuracytopandperformancebottomprolesforthebesttestson problems.9withtwodierentmagnitudesofdeterministicnoise.The horizontalaxesontheperformanceprolesaretruncatedforclarity; 5 eventuallyachievesavalueof1;allothertestschangebylessthan.03..108 5.9Performanceprolesformoreconservativetestsonproblems.9with twodierentmagnitudesofdeterministicnoise.Thehorizontalaxeson theperformanceprolesaretruncatedforclarity; 5 eventuallyachieves avalueof1;allothertestschangebylessthan.03.............109 5.10Accuracyprolesfortheindividualalgorithmsontherecommendedtests.110 x

PAGE 11

TABLES Table 5.1Recommendationsforterminationtestsfornoisyoptimization.......111 xi

PAGE 12

1.Introduction Traditionalunconstrainedoptimizationisinherentlytiedtoderivatives;thenecessaryconditionsforarst-orderminimumarecharacterizedbythederivativebeing equaltozero.Nevertheless,thereexistsavarietyoffunctionswhichmustbeminimizedwhenderivativesareunavailable.Forexample,consideranengineerinalab whowantstomaximizethestrengthofametalbarbyadjustingvariousfactors ofproduction.Afterthebarisconstructed,itisbrokentodetermineitsstrength. Thereisnoclosedformedsolutionforthebar'sstrength;eachfunctionevaluation comesfromanexpensiveprocedure.Also,theprocessofbreakingthebarprovides noinformationabouthowtochangethefactorsofproductiontoincreasethebar's strength.Inadditiontotheoptimizationofsystemswhichmustbephysicallyevaluated,functionevaluationsbycomplexcomputersimulationsoftenprovidenoor unreliablederivatives.Suchsimulationsofcomplexphenomenasometimescalled black-boxfunctionsarebecomingincreasinglycommonascomputationalmodeling andcomputerhardwarecontinuetoadvance.Whereastraditionaltechniquesareconcernedwitheciencyofthealgorithm,suchconcernsaresecondarythroughoutthis thesis.Explicitly,weassumethatthecostofevaluatingthefunctionoverwhelmsthe computationalrequirementsofthealgorithm. Inadditiontounavailablederivatives,noiseofvariousformsisoftenpresentin thesefunctions.Throughoutthisthesiswewillcategorizethisnoise{ordierence betweenthetruevalueandcomputedvalue{intotwocategories:deterministicand stochastic.Deterministicnoisee.g.,arisingfromnite-precisioncalculationsoriterativemethodsisoftenpresentifthefunctionbeingoptimizedisasimulationof aphysicalsystem.Forexample,ifevaluatingthefunctioninvolvessolvingasystemofnonlinearpartialdierentialequationsorcomputingtheeigenvaluesofalarge matrix,asmallperturbationintheparameterscanyieldalargejumpinthedierencebetweenthetrueandcomputedvalues.Thoughthecomputedvalueandtrue 1

PAGE 13

valuemaydier,re-evaluatingthefunctionwiththesamechoiceofparameterswill providenofurtherinformation.Incontrast,re-evaluatingafunctionwithstochastic noisewillprovideadditionalinformationaboutthetruevalueofthefunction.Two commonsourcesofstochasticnoisearefoundinfunctionswhoseevaluationrequires alarge-scaleMonte-Carlosimulationofanactualsystemorifafunctionevaluation requiresphysicallymeasuringapropertyinsomesystem. Thethesisconsistsofthreedistinctbutconnectedchaptersaddressingtheproblem: min x f x : R n ! R whenthealgorithmonlyhasaccesstonoisecorruptedvalues f x := f x + x ; where x denotesthenoise. Eachchaptermakesdierentassumptionsaboutthenoise x .Chapter3assumesthattheaccuracyatwhich f approximates f canvary,andthatthisaccuracy canbequantied.Forexample,if f x iscalculatedusingaMonte-Carlosimulation, increasingthenumberoftrialswilldecreasethemagnitudeof x .Similarly,if f is calculatedbyaniteelementmethod,increasedaccuracycanbeobtainedbyaner grid.Ofcourse,thisgreateraccuracycomesatthecostofincreasedcomputational time;soitmakessensetovarytheaccuracyrequirementsoverthecourseoftheoptimization.Withthisinmind,Chapter3asks:Howcanknowledgeoftheaccuracy ofdierentfunctionevaluationsbeexploitedtodesignabetteralgorithm? Chapter4assumesthatthenoiseforeachfunctionevaluationisindependentof x andcanbemodeledasanormallydistributedrandomvariablewithmeanzero andaxed,nitestandarddeviation.Thoughmanyotheralgorithmshavebeen designedtooptimizesuchafunction,theyoftenresorttorepeatedsamplingofpoints. Thisprovidesinformationaboutthenoiseatapoint,butnoinformationaboutthe 2

PAGE 14

functionnearby.ThismotivatesthequestionaddressedinChapter4:Howcan greateraccuracybeecientlyachievedbyoversamplingwithoutnecessarilyrepeating functionevaluations? Chapter5assumesthatareasonablyaccurateestimateofthemagnitudeofthe noisecanbeobtained,andthatthisestimateremainsrelativelyconstantthroughout thedomain.Thoughtherearemanyalgorithmsintheliteraturedesignedtooptimizenoisyfunctions,veryfewuseestimatesofthenoiseintheirterminationcriteria. Whenfunctionevaluationsarecheap,terminationcanbedeterminedbycommon testse.g.,smallstepsizeorgradientapproximation.Butwhenfunctionevaluationsareexpensive,determiningwhentostopbecomesanimportantmulti-objective optimizationproblem.Theoptimizerwantstondthebestsolutionpossiblewhile minimizingcomputationaleort.Asthisisadicultproblemtoexplicitlyformulate, practitionersfrequentlyterminatealgorithmswheniapredenednumberofiterationshaselapsed,iinodecreaseintheoptimalfunctionvaluehasbeendetectedfor anumberofiterations,oriiithedistancebetweenanumberofsuccessiveiterates isbelowsomethreshold.Chapter5attemptstoanswerthequestion:Whenshould analgorithmoptimizinganexpensive,noisyfunctionbeterminated? 1.1ReviewofMethods Beforediscussingouralgorithmsfurther,werstdiscusspreviousDFOtechniques. Heuristicsareperhapstherstrecoursewhenattemptingtooptimizeafunction withoutderivatives.Simulatedannealing[36,63],geneticalgorithms[28],random searchanditsvariants[55,66,39,52],tabusearch[20],scattersearch[21],particle swarmoptimization[34],andNelder-Mead[45]arejustafewoftheheuristicsdevelopedtosolveproblemswhereonlyfunctionevaluationsareavailable.Thoughmostof thesealgorithmslackformalconvergenceresultsasidefromresultsdependentonthe algorithmproducingiterateswhicharedenseinthedomain,theyremainpopular 3

PAGE 15

duetotheireaseofimplementation,exibilityonavarietyofproblemclasses, andfrequentsuccessinpractice. Othertechniquesattempttoapproximatetheunavailablederivative.Classical nite-dierencemethodsapproximatethederivativebyadjustingeachvariableand notingthechangeinthefunctionvalue.Othertechniques,suchasthepatternsearch methods[62,2]andimplicitltering[5],evaluateachangingpatternofpointsaround thebestknownsolution.AlsoofnoteistheDIRECTalgorithm[30],aglobaloptimizationmethodbasedondividinghyper-rectanglesusingonlyfunctionvalues. Anincreasinglypopularclassofalgorithmsforderivative-freeoptimizationDFO aremodel-basedtrustregionmethods[31,11].Localmodelsapproximatingthefunctionareconstructedandminimizedtogeneratesuccessiveiterates.Thesemodels arecommonlylow-orderpolynomialsinterpolatingfunctionvaluesclosetothebest knownvalue,forexamplePowell'sUOBYQAalgorithm[48].Otherexamplesinclude [49],wherePowellintroducesaminimumFrobeniusnormconditiononunderdeterminedquadraticmodels,andORBITbyWildetal.[64],whichconstructsmodels usinginterpolatingradialbasisfunctions.Theselocalmodelsshouldnotbeconfused withkriging[59]orresponsesurfacemethodologies[44]whichbuildglobalmodelsof thefunction.Thoughimplementationofthesetechniquesisnotassimpleassome othertechniques,awell-developedconvergencetheoryexists.Asthisthesisfocuses onnoisyDFOproblems,weconsideredtrust-regionmethodswithregressionmodels mostappropriatesince,inmanycases,regressionmodelsthroughenoughpointscan approximatethetruefunction. TherearealsoavarietyofexistingDFOalgorithmsforoptimizingfunctionswith noise.Forfunctionswithstochasticnoise,replicationsoffunctionevaluationscan beasimplewaytomodifyexistingalgorithms.Forexample,[14]modiesPowell's UOBYQA[48],[15]modiesDIRECT[30],and[61]modiesNelder-Meadbyrepeatedsamplingofthefunctionatpointsofinterest.Fordeterministicnoise,Kelley 4

PAGE 16

[33]proposesatechniquetodetectandrestartNelder-Meadmethods.Neumaier's SNOBFIT[29]algorithmaccountsfornoisebynotrequiringthesurrogatefunctions tointerpolatefunctionvalues,butrathertastochasticmodel.Similarly,[10]proposesusingleast-squaresregressionmodelsinsteadofinterpolatingmodelswhennoise ispresentinthefunctionevaluations. Lastly,StochasticApproximationalgorithmsarealsodesignedtominimizefunctionswithstochasticnoise.Thesealgorithmsaredevelopedbystatisticianstosolve min f x = E [ f x ] ; whenonly f canbecomputed.Twoofthemorefamousalgorithms,theKieferWolfowitzandSimultaneousPerturbationmethods,takepredenedsteplengthsina directionapproximating r f .Thesetechniqueshavestrongtheoreticalconvergence results,butcanbediculttoimplementinpractice.Forfurtherdiscussionofthese algorithms,seethebeginningofChapter4. 1.2Outline Theworkinthisthesisfocusesonmodicationstomodel-basedtrustregion methodsinordertohandlenoise.Throughoutthethesisweassumethatonlynoisy, expensivefunctionevaluations f areavailable,butthereissomesmoothunderlying function f whichistwicecontinuouslydierentiablewithaLipschitzcontinuousHessian.Wealsoassumethatthenoiseintheevaluationof f isunbiasedwithbounded variance. Chapter3jointworkwithStephenBillupsandPeterGrafproposesaDFO algorithmtooptimizefunctionswhichareexpensivetoevaluateandcontaincomputationalnoise.Thealgorithmisbasedonthetrustregionmethodsof[9,10]which buildinterpolationorregressionmodelsaroundthebestknownsolution.Wepropose using weighted regressionmodelsinatrustregionframework,andproveconvergence ofsuchmethodsprovidedtheweightingschemesatisessomebasicconditions. 5

PAGE 17

Thealgorithmtsintoageneralframeworkforderivative-freetrustregionmethodsusingquadraticmodels,whichwasdescribedbyConn,Scheinberg,andVicente in[12,11].Weshallrefertothisframeworkasthe CSV2-framework .Thisframeworkconstructsaquadraticmodelfunction m k ,whichapproximatestheobjective function f onasetofsamplepoints Y k R n ateachiteration k .Thenextiterateis thendeterminedbyminimizing m k overatrustregion.Inordertoguaranteeglobal convergence,theCSV2-frameworkmonitorsthedistributionofpointsinthesample set,andoccasionallyinvokesamodel-improvementalgorithmthatmodiesthesamplesettoensure m k accuratelyapproximates f .TheCSV2-frameworkisthebasis ofthewell-knownDFOalgorithmwhichisfreelyavailableontheCOIN-ORwebsite [38]. TotouralgorithmintotheCSV2-frameworkweextendthetheoryofpoisedness, asdescribedin[12],toweightedregression.WeshowProposition3.12thatasample setthatisstrongly-poisedintheregressionsenseisalsostrongly c -poisedinthe weighted regressionsenseforsomeconstant c ,providedthatnoweightistoosmall relativetotheotherweights.Thus,anymodelimprovementschemethatensures strong-poisednessintheregressionsensecanbeusedintheweightedregression framework. TheconvergenceproofinChapter3requiresthatthecomputationalerrordecreasesasthetrustregiondecreases;suchanassumptioncanbesatisediftheuser hassomecontroloftheaccuracyinthefunctionevaluation.SinceChapter3is centeredonexploitingdieringlevelsindierentfunctionevaluations,suchanassumptionisreasonableforthatchapter.InChapter4,weremovethisassumption, butaddtheassumptionthat f hastheform f x = f x + .1 where N ; 2 .ThecontentofChapter4jointworkwithStephenBillups modiesthealgorithmfromChapter3toconvergewhentheerrordoesnotdecrease 6

PAGE 18

withthetrustregionradius.Withsomelightassumptionsonthenoiseandunderlyingfunction,weprovethealgorithmgeneratesasubsequenceofiterateswhich convergealmostsurelytoarst-orderstationarypointinthecasewherethenumber ofsuccessfuliteratesisnite. Atagivenpointofinterest x 0 ,thealgorithmdoesnotrepeatedlysample f x 0 inordertogleaninformationabout f x 0 .Rather m k x 0 ,thevalueofthetrust regionmodelat x 0 isusedtoestimate f x 0 .Wederiveboundsontheerrorbetween f and m ,providedthesetofpointsusedtoconstruct m satisescertaingeometric conditions,calledstrongly-poisedseeDenition2.14,andcontainsasucient numberofpoints.Also,thestepsizeiscontrolledbythealgorithm,increasingand decreasingasthealgorithmprogressesandstagnates.Thiscontrastsmanyofthe methodsintheStochasticApproximationliteraturewheretheusermustprovidea predenedsetofstepstobetakenbythealgorithm. TheresultsinSection4.3provethealgorithmwillgenerateasubsequenceof iteratesconvergingalmostsurelytoarst-orderstationarypointwhenthenumberof successfuliteratesisnite,andmakesprogressintheinnitecase.Suchresultsare notcommonformostDFOalgorithmsonproblemswithstochasticnoise.Boththe simplicialdirectsearchmethod[1]andthetrustregionmethodin[4]provesimilar convergenceresults,butbothreducethevarianceatapointbyrepeatedsampling. Inadditiontoourstrongconvergenceresult,weareabletodirectlyquantifythe probabilityofthesuccessofsomeiteratesseeLemma4.15foronesuchexample. Weareunawareofanyothersimilartheoreticalresultsforalgorithmsminimizing stochasticfunctions. Chapter5jointworkwithStefanWildaddressesterminationcriteria,thechoice ofwhichisacommonproblemwhenoptimizingnoisyfunctions.Weproposeobjective measurestocomparethequalityofterminationrules.Familiesofterminationtests arethenproposedandtheirperformanceisanalyzedacrossabroadrangeofDFO 7

PAGE 19

algorithms.Recommendationsfortestswhichworkformanyalgorithmsarealso provided.LastlyChapter6containsconcludingremarksanddirectionsforfuture research. 1.3Notation Thefollowingnotationwillbeused: R n denotestheEuclideanspaceofreal n vectors. kk p denotesthestandard ` p vectornorm,and kk withoutthesubscript denotestheEuclideannorm. kk F denotestheFrobeniusnormofamatrix. C k denotesthesetoffunctionson R n with k continuousderivatives. D j f denotesthe j thderivativeofafunction f 2 C k , j k .Givenanopenset R n , LC k denotesthesetof C k functionswithLipschitzcontinuous k thderivatives.Thatis, for f 2 LC k ,thereexistsaLipschitzconstant L suchthat D k f y )]TJ/F19 11.9552 Tf 11.955 0 Td [(D k f x L k y )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k forall x;y 2 : P d n denotesthespaceofpolynomialsofdegreelessthanorequalto d in R n ; q 1 denotes thedimensionof P 2 n specically q 1 = n +1 n +2 = 2.Weusestandardbig-Oh" notationwritten O tostate,forexample,thatfortwofunctionsonthesame domain, f x = O g x ifthereexistsaconstant M suchthat j f x j M j g x j forall x withsucientlysmallnorm.Givenaset Y , j Y j denotesthecardinality andconv Y denotestheconvexhull.Forarealnumber , b c denotesthegreatest integerlessthanorequalto .Foramatrix A , A + denotestheMoore-Penrose generalizedinverse[22]. e j denotesthe j thcolumnoftheidentitymatrix.Theball ofradiuscenteredat x 2 R n isdenoted B x ;.Givenavector w ,diag w denotesthediagonalmatrix W withdiagonalentries W ii = w i .Forasquarematrix A ,cond A denotestheconditionnumber, min A denotesthesmallesteigenvalue, and min denotesthesmallestsingularvalue.Foraset Y := f y 0 ; ;y p g R n ,the quadraticdesignmatrix X hasrows 1 y j 1 y j n 1 2 y j 1 2 y j 1 y j 2 y j n )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 y j n 1 2 y j n 2 .2 8

PAGE 20

Let m k denotethe k thtrustregionmodelasdenedinChapter2.Let g k = r m k x k and H k = r 2 m k x k . & k x =max kr m k x k ; )]TJ/F19 11.9552 Tf 9.299 0 Td [( min r 2 m k x & x =max kr f x k ; )]TJ/F19 11.9552 Tf 9.298 0 Td [( min r 2 f x Thesevariablesmeasurehowclose x istoarst-andsecond-orderstationarypointof f and m k i.e.thegradientiszeroandalleigenvaluesarepositive.If X isarandom variable,thenotation X denotes P X .Notethattherelation 1 )]TJ/F20 7.9701 Tf 6.587 0 Td [( is nottransitive. 9

PAGE 21

2.Background Beforecontinuing,weintroducethebackgroundmaterialonwhichthethesisis constructed. 2.1Model-basedTrustRegionMethods Atrustregionalgorithmisanumericaltechniqueforminimizingasuciently smoothfunction f .Ateachiteration k ,amodelfunction m k x isconstructedto approximate f nearthebestpoint x k .Whenderivativesareavailable, m k isusuallya truncatedrst-orsecond-orderTaylorseriesapproximationof f at x k .Forexample, if r f and r 2 f areeasytocalculate, m k x = f x k + r f x k x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k + 1 2 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k T r 2 f x k x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k : m k isminimizedoverthetrustregion B x k ; k bysolvingtheproblem min s : k s k k m k x k + s togenerateapotentialnexttrustregioncenter x k + s k . f x k + s k isevaluatedand theratio k = f x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(f x k + s k m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k iscalculated,whichcomparestheactualdecreasein f versusthedecreasepredicted bythemodel m k .Thisratioquantiesthesuccessofiteration k andalsohowwell themodelfunctionapproximatesthetruefunction f .If k islargerthanaprescribed threshold 1 ,itindicatesthattheiterationwassuccessfulandthemodelisagood approximationofthefunction.Inthiscase, x k +1 issetto x k + s k andthetrustregion radius, k isincreased.If k islessthananotherparameter 0 ,themodelfunction isconsideredunreliablesothetrustregionradius k isdecreasedandtheiterateis notupdatedi.e. x k +1 = x k .Lastly, k isincrementedandtheprocessrepeats. 2.1.1ModelConstructionWithoutDerivatives 10

PAGE 22

Whenderivativesareunavailable,themodels m k areconstructedusingpoints where f hasbeenevaluated.Forexample,theConn,Scheinberg,andVicenteframeworkwhichwerefertoasthe CSV2-framework buildsmodels m k fromaspecied classofmodels M usingasamplesetofpoints Y k = f y 0 ; ;y p g B x k ; k on whichthefunctionhasbeenevaluated. Given Y k andavectorofcorrespondingfunctionvalues f = f y 0 ; ;f y p ,an interpolating modelisamodel m x suchthat m y i = f y i for i =0 ; ;p .Givena basis = f 0 x ;:::; q x g oftheclassofmodels M ,wecancalculatethecoecients i inthebasisrepresentationoftheinterpolatingmodel m x = P p i =0 i i x by solvingthelinearsystem M ;Y = f .1 where M ;Y = 2 6 6 6 6 6 6 6 4 0 y 0 1 y 0 q y 0 0 y 1 1 y 1 q y 1 . . . . . . . . . . . . 0 y p 1 y p q y p 3 7 7 7 7 7 7 7 5 Notethatforthisequationtohaveauniquesolution,thenumberofsamplepoints p +1mustequalthesizeofthebasis q +1andthematrix M ;Y mustbeinvertible. Regressionmodelshavealsobeeninvestigated,[10]inwhichthenumberofsample points p +1isgreaterthanthesizeofthebasis.Inthiscase,thematrix M ;Y hasmorerowsthancolumnssotheequation.1issolvedinaleastsquaressense. Lastly,if M ;Y hasmorecolumnsthanrows,thesystem.1isunderdetermined.Nevertheless,boundsbetweenthefunctionandanunderdeterminedmodel canbeobtainedincertaincases.Forexample,see[13]consideringtheminiumFrobeniusnormmethod. 2.1.2CSV2-framework 11

PAGE 23

WenextoutlinetheCSV2-frameworkforderivative-freetrustregionmethods describedbyConn,Scheinberg,andVicente[12,Algorithm10.3].Algorithmsinthe frameworkconstructamodelfunction m k atiteration k ,whichapproximatesthe objectivefunction f onasetofsamplepoints Y k = f y 0 ;:::;y p k g R n .Thenext iterateisthendeterminedbyminimizing m k .Specically,giventheiterate x k ,a putativenextiterateisgivenby x k + s k wherethestep s k solvesthetrustregion subproblem min s : k s k k m k x k + s wherethescalar k > 0denotesthetrustregionradius,whichmayvaryfromiteration toiteration.If x k + s k producessucientdescentinthemodelfunction,then f x k + s k isevaluated,andtheiterateisacceptedif f x k + s k 0 .Amodelfunction m 2 C 2 is -fullyquadratic on B x ; if m hasaLipschitzcontinuousHessianwithcorrespondingLipschitz constantboundedby m 2 and 12

PAGE 24

theerrorbetweentheHessianofthemodelandtheHessianofthefunction satises r 2 f y )-222(r 2 m y eh forall y 2 B x ; ; theerrorbetweenthegradientofthemodelandthegradientofthefunction satises kr f y )-222(r m y k eg 2 forall y 2 B x ; ; theerrorbetweenthemodelandthefunctionsatises j f y )]TJ/F19 11.9552 Tf 11.955 0 Td [(m y j ef 3 forall y 2 B x ; : Denition2.3 Let f satisfyAssumption2.1.Asetofmodelfunctions M = f m : R n ! R ;m 2 C 2 g iscalleda fullyquadratic classofmodelsifthereexistpositive constants = ef ; eg ; eh ; m 2 ,suchthatthefollowinghold: 1.forany x 2 S and 2 ; max ] ,thereexistsamodelfunction m in M which is -fullyquadraticon B x ; . 2.Forthisclass M ,thereexistsanalgorithm,calledamodel-improvement"algorithm,thatinanite,uniformlyboundedwithrespectto x and number ofstepscan eithercertifythatagivenmodel m 2M is -fullyquadraticon B x ; , or,ndamodel m 2M thatis -fullyquadraticon B x ; . Notethatthisdenitionofafullyquadraticclassofmodelsisequivalentto[12, Denition6.2];butwehavegivenaseparatedenitionofa -fullyquadraticmodel Denition2.2thatincludestheuseof tostressthexednatureofthebounding constants.Thischangesimpliessomeanalysisbyallowingustodiscuss -fully quadraticmodelsindependentoftheclassofmodelstheybelongto.Itisimportant 13

PAGE 25

tonotethat doesnotneedtobeknownexplicitly.Instead,itcanbedened implicitlybythemodelimprovementalgorithm.Allthatisrequiredisfor tobe xedthatis,independentof x and.Wealsonotethattheset M mayinclude non-quadraticfunctions,butwhenthemodelfunctionsarequadratic,theHessianis xed,so m 2 canbechosentobezero.ForthealgorithmspresentedinChapter3and Chapter4,wefocusonmodelfunctionsthatquadratic.Thatis, M = P 2 n . Asasidenote, -fullyquadraticmodelsmaybetoodiculttoconstructor maybeundesiredinsomesituations.Ifthatisthecase, -fullylinear modelsmight provideausefulalternative. Denition2.4 Let f 2 LC 2 andlet = ef ; eg ; m 1 beagivenvectorofconstants, andlet > 0 .Amodelfunction m 2 C 2 is -fullylinear on B x ; if m hasa LipschitzcontinuousgradientwithcorrespondingLipschitzconstantboundedby m 1 and theerrorbetweenthegradientofthemodelandthegradientofthefunction satises kr f y )-222(r m y k eg 8 y 2 B x ; ; theerrorbetweenthemodelandthefunctionsatises j f y )]TJ/F19 11.9552 Tf 11.955 0 Td [(m y j ef 2 8 y 2 B x ; : If m k is -fullylinear,itapproximates f inafashionsimilartotherst-order Taylormodelof f .If m k is -fullyquadratic,thenitapproximates f inafashion similartothesecond-orderTaylormodelof f .If -fullylinearorquadraticmodels areusedwithintheCSV2-framework,wecanguaranteeconvergenceofthealgorithm toarst-orsecond-orderstationarypointof f . AcriticaldistinctionbetweentheCSV2-frameworkandclassicaltrustregion methodsliesintheoptimalitycriteria.Inclassicaltrustregionmethods, m k isthe 14

PAGE 26

second-orderTaylorapproximationof f at x k ;soif x k isoptimalfor m k ,itsatisestherst-andsecond-ordernecessaryconditionsforanoptimumof f .Inthe CSV2-framework, x k mustbeoptimalfor m k ,but m k mustalsobeanaccurateapproximationof f near x k .Thisrequiresthatthetrustregionradiusissmallandthat m k is -fullyquadraticonthetrustregionforsomexed . ToexplicitlyoutlinetheCSV2-framework,weprovedpseudocodebelow.Inthe algorithm, g icb k and H icb k denotethegradientandHessianoftheincumbentmodel m icb k . Weusethesuperscript icb tostressthatincumbentparametersfromtheprevious iteratesmaybechangedbeforetheyareusedinthetrustregionstep.Theoptimality of x k withrespectto m k istestedbycalculating & icb k =max k g icb k k ; )]TJ/F19 11.9552 Tf 9.298 0 Td [( min H icb k . Thisquantityiszeroifandonlyiftherst-andsecond-orderoptimalityconditions for m k aresatised.Thealgorithmentersthecriticalitystepwhen & icb k iscloseto zero.Thisroutinebuildsapossiblynew -fullyquadraticmodelforthecurrent icb k ,andtestsif & icb k forthismodelissucientlylarge.Ifso,adescentdirectionhas beendetermined,andthealgorithmcanproceed.Ifnot,thecriticalitystepreduces icb k andupdatesthesamplesettoimprovetheaccuracyofthemodelfunctionnear x k .Thecriticalitystependswhen & icb k islargeenoughandthealgorithmproceeds orwhenboth & icb k and icb k aresmallerthangiventhresholdvalues c and min in whichcasethealgorithmhasidentiedasecond-orderstationarypoint.Wereferthe readerto[12]foramoredetaileddiscussionofthealgorithm,includingexplanations oftheparameters 0 ; 1 ;; inc ;; and ! . AlgorithmCSV2 [12,Algorithm10.3] Step0initialization: Chooseafullyquadraticclassofmodels M andacorrespondingmodel-improvementalgorithm,withassociated denedbyDenition2.3. Chooseaninitialpoint x 0 andmaximumtrustregionradius max > 0.Weassume thatthefollowingaregiven:aninitialmodel m icb 0 x withgradientandHessianat 15

PAGE 27

x = x 0 denotedby g icb 0 and H icb 0 ,respectively, & icb 0 =max k g icb 0 k ; )]TJ/F19 11.9552 Tf 9.299 0 Td [( min H icb 0 ,and atrustregionradius icb 0 2 ; max ]. Theconstants 0 ; 1 ;; inc ; c ;;;! aregivenandsatisfytheconditions0 0 1 < 1with 1 6 =0,0 << 1 < inc ; c > 0 ;>> 0 ;! 2 ; 1 : Set k =0. Step1criticalitystep: If & icb k > c ,then m k = m icb k and k = icb k . If & icb k c ,thenproceedasfollows.Callthemodel-improvementalgorithmto attempttocertifyifthemodel m icb k is -fullyquadraticon B x k ; icb k .Ifatleastone ofthefollowingconditionshold, themodel m icb k isnotcertiably -fullyquadraticon B x k ; icb k ,or icb k >& icb k , thenapplyAlgorithmCriticalityStepdescribedbelowtoconstructamodel~ m k x withgradientandHessianat x = x k denotedby~ g k ,and ~ H k ,respectively,with ~ & m k =max n k ~ g k k ; )]TJ/F19 11.9552 Tf 9.299 0 Td [( min ~ H k o ,whichis -fullyquadraticontheball B x k ; ~ k for some ~ k 2 ; ~ & m k ]givenby[12,Algorithm10.4].Insuchacaseset m k =~ m k and k =min n max n ~ k ; ~ & m k o ; icb k o : Otherwise,set m k = m icb k and k = icb k .Foramorecompletediscussionoftrust regionmanagement,see[26]. Step2stepcalculation: Computeastep s k thatsucientlyreducesthe model m k inthesenseof[12,.13]suchthat x k + s k 2 B x k ; k . Step3acceptanceofthetrialpoint: Compute f x k + s k anddene k = f x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(f x k + s k m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k If k 1 orifboth k 0 andthemodelis -fullyquadraticon B x k ; k ,then x k +1 = x k + s k andthemodelisupdatedtoincludethenewiterateintothesample setresultinginanewmodel m icb k +1 x withgradientandHessianat x = x k +1 denoted 16

PAGE 28

by g icb k +1 and H icb k +1 ,respectively,with & icb k +1 =max k g icb k +1 k ; )]TJ/F19 11.9552 Tf 9.299 0 Td [( min H icb k +1 ;otherwise, themodelandtheiterateremainunchanged m icb k +1 = m k and x k +1 = x k . Step4modelimprovement: If k < 1 ,usethemodel-improvementalgorithmto attempttocertifythat m k is -fullyquadraticon B x k ; k , ifsuchacerticateisnotobtained,wesaythat m k isnotcertiably -fully quadraticandmakeoneormoresuitableimprovementsteps. Dene m icb k +1 tobethepossiblyimprovedmodel. Step5trustregionupdate: Set icb k +1 2 8 > > > > > > > > > > > < > > > > > > > > > > > : f min f inc k ; max gg if k 1 and k <& m k ; [ k ; min f inc k ; max g ]if k 1 and k & m k ; f k g if k < 1 and m k is -fullyquadratic, f k g if k < 1 and m k isnotcertiably -fullyquadratic. Increment k by1andgotoStep1. AlgorithmCriticalityStep [12,Algorithm10.4]Thisalgorithmisappliedonly if & icb k c andatleastoneofthefollowingholds:themodel m icb k isnotcertiably -fullyquadraticon B x k ; icb k or icb k >& icb k . Initialization: Set i =0.Set m k = m icb k . Repeat Increment i byone.Usethemodelimprovementalgorithmtoimprove thepreviousmodel m i )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 k untilitis -fullyquadraticon B x k ; ! i )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 icb k .Denotethe newmodelby m i k .Set ~ k = ! i )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 icb k and~ m k = m i k . Until ~ k & m k i . 17

PAGE 29

GlobalConvergence Ifthefollowingassumptionsaresatised,ithasbeenshown thattheCSV2-frameworkiterateswillconvergetoastationarypointof f .Dene theset L x 0 = f x 2 R n : f x f x 0 g . Assumption2.5 Assumethat f isboundedfrombelowon L x 0 . Assumption2.6 Thereexistsaconstant bhm > 0 suchthat,forall x k generated bythealgorithm, k H k k bhm : Theorem2.7 [12,Theorem10.23]LetAssumptions2.1,2.5and2.6holdwith S = L x 0 .Then,ifthemodelsusedintheCSV2-frameworkare -fullyquadratic, theiterates x k generatedbytheCSV2-frameworksatisfy lim k ! + 1 max fkr f x k k ; )]TJ/F19 11.9552 Tf 9.298 0 Td [( min r 2 f x k g =0 : Itfollowsfromthistheoremthatanyaccumulationpointof f x k g satisesthe rst-andsecond-ordernecessaryconditionsforaminimumof f . 2.1.3Poisedness HavingoutlinedtheCSV2-framework,wecandiscussconditionsonthesample setusedtobuild m k whichguaranteethemodelsucientlyapproximates f .Consider thesetofpolynomialsin R n ofdegreelessthanorequalto d ,denoted P d n .Abasis = f 0 x ; ; q x g of P d n isasetofpolynomialsfrom P d n whichspan P d n .Thatis, forany P x 2P d n ,thereexistscoecients i suchthat P x = P p i =0 i i x .Welet P d n denotethenumberofelementsinanybasis of P d n .Forexample, jP 1 n j = n +1 and jP 2 n j = n +1 n +2 = 2. Denition2.8 Asetofpoints X = f x 0 ; ;x p g R n with k X k = P d n ispoised forinterpolationifthematrix M ;X isnonsingularforsomebasis in P d n 18

PAGE 30

If j X j > P d n ,wecanconstructtheleastsquaresregressionmodelbysolving .1aswell.Weextendthedenitionofpoisednessfortheregressioncase. Denition2.9 Asetofpoints X = f x 0 ; ;x p g R n with k X k P d n ispoised forregressionifthematrix M ;X hasfullcolumnrankforsomebasis in P d n Sincewehavelimitedinformationaboutthefunction f ,wewanttheinterpolating orregressing m x tobeanaccurateapproximationinaregionofinterest.This requiresthat X consistsofpointsspreadoutwithinsaidregion.Since M ;X can bearbitrarilypoorlyconditionedand X isstillpoised,simplybeingpoisedisnot enoughtomeasurethequalityofaset X .Alsolookingattheconditionnumberof M ;X isinadequatesincescalingthesampleset X orchoosingadierentbasis canarbitrarilyadjustthisquantity.Nevertheless,therearemethodsformeasuring thequalityofasampleset,oneofthemostcommonofwhichisbasedonLagrange polynomials. Denition2.10 Foraset X = f x 0 ; ;x p g R n ,with j X j = P d n ,thesetofinterpolatingLagrangepolynomials ` = f ` 0 ; ;` p gP d n arethepolynomialssatisfying ` i y j = 8 > < > : 1 if i = j; 0 otherwise. Iftheset X ispoised,thenthesetofpolynomialsareguaranteedtoexistandbe uniquelydened. WecanextendthedenitionofLagrangepolynomialstotheregressioncaseina naturalfashion. Denition2.11 Foraset X = f x 0 ; ;x p g R n ,with j X j > P d n ,thesetof regressionLagrangepolynomials ` = f ` 0 ; ;` p gP d n arethepolynomialssatisfying ` i y j = `:s: 8 > < > : 1 if i = j; 0 otherwise. 19

PAGE 31

Thoughthesepolynomialsarenolongerlinearlyindependent,if X ispoised,then thesetofregressionLagrangepolynomialsexistsandisuniquelydened. WecannowusetheseLagrangepolynomialstoextendthedenitionofpoisedness to-poisedness.ThisrelatesthemagnitudeoftheLagrangepolynomialsonaset B R n whichwillallowamethodformeasuringthequalityofasampleset. Denition2.12 Let > 0 andaset B R n begiven.Forabasis of P d n ,apoised set X = f x 0 ; ;x p g issaidtobe -poisedin B intheinterpolatingsenseifand onlyif 1.forthebasisofLagrangepolynomialsassociatedwith X max 0 i p max x 2 B j ` i j ; or,equivalently, 2.forany x 2 B thereexists x suchthat p X i =0 i x y i = x with k x k 1 : Andweagaincanextendthisdenitiontotheregressioncase. Denition2.13 Let > 0 andaset B R n begiven.Forabasis of P d n ,apoised set X = f x 0 ; ;x p g with j X j P d n issaidtobe -poisedin B intheregression senseifandonlyif 1.forthebasisofLagrangepolynomialsassociatedwith X max 0 i p max x 2 B j ` i j ; or,equivalently, 2.forany x 2 B thereexists x suchthat p X i =0 i x y i = x with k x k 1 : 20

PAGE 32

Notethatif j X j = P d n ,thedenitionfor-poisedintheinterpolationandregression sensecoincide. Wecannowexaminethefollowingboundfrom[7] k D r f x )]TJ/F19 11.9552 Tf 11.956 0 Td [(D r m x k 1 d +1! d p X i =0 x i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x d +1 k D r ` i x k where D r f x isthe r thderivativeof f . D r f x z 1 ;:::;z r = X i 1 ;:::;i r @ r f x @x i 1 @x i r z 1 i 1 z r i r and d isanupperboundon D r +1 f x .If r =0,thisboundreducesto j f x )]TJ/F19 11.9552 Tf 11.956 0 Td [(m x j 1 d +1! p +1 d b d +1 .2 where b =max 0 i p max x 2 B j ` i j ; andisthediameterofthesmallestballcontaining X .Therefore,ifthenumberof pointsin X isbounded,then-poisednessissucienttoderiveboundsontheerror betweentheregressionorinterpolatingmodelandthefunction.Thatis,decreasing theradiusofthesamplesetwillprovideboundssimilartoTaylorboundswhen derivativesareavailable.Ifusingregressionmodelswitharbitrarilymanypoints, -poisednessisnotenoughtoconstructsimilarbounds.Strongly-poisednesscan helpinthiscase. Denition2.14 Let ` x = ` 0 x ; ;` p x T betheregressionLagrangepolynomialsassociatedwiththeset Y = f y 0 ;:::;y p g .Let > 0 andlet B beasetin R n . Theset Y issaidtobestrongly -poisedin B intheregressionsenseifandonlyif q 1 p p 1 max x 2 B k ` x k ; where q 1 = jP 2 n j and p 1 = j Y j . 21

PAGE 33

Sincewecanrewrite.2as j f x )]TJ/F19 11.9552 Tf 11.955 0 Td [(m x j 1 d +1! p p +1 d b; 2 d +1 where b; 2 =max x 2 B k ` i k ; strong-poisednessprovidesTaylor-likeerrorboundsbetweentheregressionmodel m andthefunction f ,evenwhenthenumberofpointsin X isunbounded. Asanalnote,explicitlycalculatingthevalueofiscomputationallyexpensive, butnotnecessary.Itispossibletousetheconditionnumberofthedesignmatrix M ;X toboundtheconstant.Sinceitispossibletoscaletheconditionnumber of M ;X byshiftingandscaling X ,orchoosingadierentbasis,conditionsmust beplacedon M ;X beforeusingitsconditionnumber.Ifwe1usethestandard basise.g., = 1 ;x 1 ; ;x n ; 1 2 x 2 1 ;x 1 x 2 ; ;x 2 n for P d n ,2shiftthesampleset X soeverypointlieswithintheunitcircleand,3atleastonepointhasnorm1,thenit ispossibleboundusingtheconditionnumberof M ;X .Thenexttwotheorems arefortheinterpolationandregressioncaserespectively. Theorem2.15 Let ^ X denotetheshiftedandscaledversionof X soeverypointlies withintheunitballandatleastonepointhasnorm1.Let ^ M = M ; ^ X where isthestandardbasis.If ^ M isnonsingularand ^ M )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 ,thentheset ^ X is p p 1 -poisedintheunitball.Conversely,iftheset ^ X is -poisedintheunitball, then ^ M )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 p p 1 ,where > 0 dependson n and d butisindependentof ^ X and . Theorem2.16 Let ^ X denotetheshiftedandscaledversionof X soeverypointlies withintheunitballandatleastonepointhasnorm1.Let ^ M = M ; ^ X where is thestandardbasisandlet ^ U ^ ^ V T bethereducedsingularvaluedecompositionof ^ M . Thatis ^ isthe q 1 q 1 diagonalmatrixofsingularvalues.If ^ isnonsingularand 22

PAGE 34

^ )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 p q 1 =p 1 ,thentheset ^ X isstrongly -poisedintheunitball.Conversely, iftheset ^ X is -poisedintheunitball,then ^ )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 q 1 p p 1 ,where dependson n and d butisindependentof ^ X and . 2.2PerformanceProles Next,weexplainthecontentofperformanceproleswhichareacompactmethod forcomparingtheperformanceofvariousalgorithmsonasetofproblems.Wewill useFigure2.1asanexample.Algorithms A , B ,and C havebeenrunanidentical setofproblemsforthesamenumberoffunctionevaluations.Theleftaxisshows thepercentageofproblemseachalgorithmsolvedrst,wheresolvedisuser-dened. Often,analgorithmisconsideredtosolveaproblemwhenitrstndsafunction valuewithintoleranceofthebestknownsolution.Intheexample, A solves20% oftheproblemsrst, B solves55%oftheproblemsrst,and C solves30%ofthe problemsrst.Asthesepercentagestotaltoover100%,thereisanoverlapintheset ofproblemsthealgorithmssolverst. Therightaxisshowsthepercentageofproblemssolvedbyagivenalgorithmin thenumberoffunctionevaluationsgiven.AllalgorithmsinFigure2.1solveover90% oftheproblemsinthetestingset.Valuesbetweentheleftandrightaxesdenotethe percentageofproblemssolvedasamultipleofthenumberofevaluationsrequiredfor thefastestalgorithm.Forexample,given6timesasmanyiterationsasthefastest algorithmonaproblem, A solves80%oftheproblemsinthetestingset. Formally,analgorithmisconsideredtosolveaproblemwhenitrstndsa functionvaluesatisfying f x 0 )]TJ/F19 11.9552 Tf 11.955 0 Td [(f x )]TJ/F19 11.9552 Tf 11.955 0 Td [( f x 0 )]TJ/F19 11.9552 Tf 11.956 0 Td [(f L where > 0isasmalltolerance, f L isthesmallestfoundfunctionvalueforanysolver inaspeciednumberofiterations,and x 0 istheinitialpointgiventoeachalgorithm. 23

PAGE 35

Figure2.1: Anexampleofaperformanceprole If i p;a isthenumberofiterationsneededforsolver a tosolveproblem p ,thenthe performanceratioisdenedby r p;a = t p;a min a f t p;a g Thentheperformanceproleofasolver a isthefractionofproblemswherethe performanceratioisatmost .Thatis, a = j p 2 P : r p;a j 1 j P j where P isthesetofbenchmarkproblems. 2.3ProbabilisticConvergence Lastly,wedenevariousformsofprobabilisticconvergence.Asequence f X n g ofrandomvariableissaidto convergeindistribution or convergeweakly ,or 24

PAGE 36

convergeinlaw toarandomvariable X if lim n !1 F n x = F x ; foreverynumber f x 2 R g atwhich F iscontinuous.Here F n and F arethecumulative distributionfunctionsofrandomvariables X n and X correspondingly. Asequence f X n g ofrandomvariables convergesinprobability to X ifforall > 0 lim n !1 P j X n )]TJ/F19 11.9552 Tf 11.956 0 Td [(X j =0 Asequence f X n g ofrandomvariables convergesalmostsurely or almosteverywhere or withprobability1 or strongly towards X if P lim n !1 X n = X =1 25

PAGE 37

3.Derivative-freeOptimizationofExpensiveFunctionswith ComputationalErrorUsingWeightedRegression 3.1Introduction Inthischapter,weconstructanalgorithmdesignedtooptimizefunctionsevaluatedbylargecomputationalcodes,takingminutes,hoursorevendaysforasingle functioncall,forwhichderivativeinformationisunavailable,andforwhichfunctionevaluationsaresubjecttocomputationalerror.Sucherrormaybedeterministic arising,forexample,fromdiscretizationerror,orstochasticforexample,from Monte-Carlosimulation.Becausefunctionevaluationsareextremelyexpensive,it issensibletoperformsubstantialworkateachiterationtoreducethenumberof functionevaluationsrequiredtoobtainanoptimum. Weassumethattheaccuracyofthefunctionevaluationcanvaryfrompointto point,andthisvariationcanbequantied.Inthischapter,wewilluseknowledge ofthisvaryingerrortoimprovetheperformanceofthealgorithmbyusing weighted regressionmodelsinatrustregionframework.Bygivingmoreaccuratepointsmore weightwhenconstructingthetrustregionmodel,wehopethatthemodelswillmore closelyapproximatethefunctionbeingoptimized.Thisleadstoabetterperforming algorithm. OuralgorithmtswithinintheCSV2-framework,whichisoutlinedinChapter2. Tospecifyanalgorithmwithinthisframework,threethingsarerequired: 1.Denetheclassofmodelfunctions M .Thisisdeterminedbythemethodfor constructingmodelsfromthesampleset.In[10]modelswereconstructedusing interpolation,leastsquaresregression,andminimumFrobeniusnormmethods. Wedescribethegeneralformofourweightedregressionmodelsin x 3.2and proposeaspecicweightingschemein x 3.5. 2.Deneamodel-improvementalgorithm. x 3.4describesourmodelimprovement algorithmwhichteststhegeometryofthesampleset,andifnecessary,adds 26

PAGE 38

and/ordeletespointstoensurethatthemodelfunctionconstructedfromthe samplesetsatisestheerrorboundsinDenition2.2i.e.itis -fullyquadratic. 3.Demonstratethatthemodel-improvementalgorithmsatisestherequirements forthedenitionofaclassoffullyquadraticmodels.Forouralgorithm,thisis discussedin x 3.4. Thechapterisorganizedasfollows.WeplaceouralgorithmintheCSV2frameworkbydescribing1howmodelfunctionsareconstructed x 3.2,and2a modelimprovementalgorithm x 3.4.Beforedescribingthemodelimprovementalgorithm,werstextendthetheoryof-poisednesstotheweightedregressionframework x 3.3.Computationalresultsarepresentedin x 3.5usingaheuristicweighting scheme,whichisdescribedinthatsection. x 3.6concludesthechapter. 3.2ModelConstruction Thissectiondescribeshowweconstructthemodelfunction m k atthe k thiteration.Forsimplicity,wedropthesubscript k fortheremainderofthissection. Let f = f 0 ;:::; f p T where f i denotesthecomputedfunctionvalueat y i ,andlet i denotetheassociatedcomputationalerror.Thatis f i = f y i + i : .1 Let w = w 0 ;:::;w p T beavectorofpositiveweightsforthesetofpoints Y = f y 0 ; ;y p g .Aquadraticpolynomial m issaidtobea weightedleastsquares approximation of f withrespectto w ifitminimizes p X i =0 w 2 i )]TJ/F19 11.9552 Tf 5.48 -9.684 Td [(m y i )]TJ/F15 11.9552 Tf 14.503 3.155 Td [( f i 2 = W )]TJ/F19 11.9552 Tf 5.48 -9.684 Td [(m Y )]TJ/F15 11.9552 Tf 14.503 3.155 Td [( f 2 : where m Y denotesthevector m y 0 ;m y 1 ;:::;m y p T and W =diag w .Inthis case,wewrite Wm Y `:s: = W f: .2 27

PAGE 39

Let = f 0 ; 1 ;:::; q g beabasisforthequadraticpolynomialsin R n .For example, mightbethemonomialbasis = f 1 ;x 1 ;x 2 ;:::;x n ;x 2 1 = 2 ;x 1 x 2 ;:::;x n )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 x n ;x 2 n = 2 g : .3 Dene M ;Y = 2 6 6 6 6 6 6 6 4 0 y 0 1 y 0 q y 0 0 y 1 1 y 1 q y 1 . . . . . . . . . . . . 0 y p 1 y p q y p 3 7 7 7 7 7 7 7 5 : Since isabasisforthequadraticpolynomials,themodelfunctioncanbewritten m x = q X i =0 i i x .Thecoecients = 0 ;:::; q T solvetheweightedleast squaresregressionproblem WM ;Y `:s: = W f: .4 If M ;Y hasfullcolumnrank,thesampleset Y issaidtobe poisedforquadratic regression .Thefollowinglemmaisastraightforwardgeneralizationof[12,Lemma 4.3]: Lemma3.1 If Y ispoisedforquadraticregression,thentheweightedleastsquares regressionpolynomialwithrespecttopositiveweights w = w 0 ;:::;w p exists,is uniqueandisgivenby m x = x T ,where = WM + W f = M T W 2 M )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 M T W 2 f; .5 where W =diag w and M = M ;Y . Proof: Since W and M bothhavefullcolumnrank,sodoes WM .Thus,theleast squaresproblem.4hasauniquesolutiongivenby WM + W f .Moreover,since WM hasfullcolumnrank, WM + = )]TJ/F15 11.9552 Tf 5.479 -9.684 Td [( WM T WM )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 M T W . 3.3ErrorAnalysisandtheGeometryoftheSampleSet 28

PAGE 40

Throughoutthissection, Y = f y 0 ; ;y p g denotesthesampleset, p 1 = p +1, w 2 R p 1 + isavectorofpositiveweights, W =diag w ,and M = M ;Y . f denotes thevectorofcomputedfunctionvaluesatthepointsin Y asdenedby.1. Theaccuracyofthemodelfunction m k reliescriticallyonthegeometryofthe sampleset.Inthissection,wegeneralizethetheoryofpoisednessfrom[12]tothe weightedregressionframework.Thissectionalsoincludeserroranalysiswhichextendsresultsfrom[12]toweightedregression,aswellasconsideringtheimpactof computationalerrorontheerrorbounds.Westartbydening weightedregression Lagrangepolynomials . 3.3.1WeightedRegressionLagrangePolynomials Denition3.2 Asetofpolynomials ` j x ;j =0 ;:::;p in P d n arecalled weightedregressionLagrangepolynomials withrespecttotheweights w andsampleset Y ifforeach j , W` Y j `:s: = We j ; where ` Y j =[ ` j y 0 ; ;` j y p ] T . ThefollowinglemmaisadirectapplicationofLemma3.1. Lemma3.3 Let x = 0 x ;:::; q x T .If Y ispoised,thenthesetofweighted regressionLagrangepolynomialsexistsandisunique,andisgivenby ` j x = x T a j ;j =0 ; ;p ,where a j isthe j th columnofthematrix A = WM + W: .6 Proof: Notethat m = ` j satises.2with f = e j .ByLemma3.1, ` j x = x T a j where a j = WM + We j whichisthe j thcolumnof WM + W . Thefollowinglemmaisbasedon[12,Lemma4.6]. 29

PAGE 41

Lemma3.4 If Y ispoised,thenthemodelfunctiondenedby .2 satises m x = p X i =0 f i ` i x ; where ` j x ;j =0 ; ;p denotetheweightedregressionLagrangepolynomialscorrespondingto Y and W . Proof: ByLemma3.1 m x = x T where = WM + W f = A f for A denedby.6.Let ` x =[ ` 0 x ; ;` p x ] T .ThenbyLemma3.3 m x = T x A f = f T ` x = p X i =0 f i ` i x : 3.3.2ErrorAnalysis Fortheremainderofthischapter,let ^ Y = f ^ y 0 ; ; ^ y p g denotetheshiftedand scaledsamplesetwhere^ y i = y i )]TJ/F19 11.9552 Tf 12.116 0 Td [(y 0 =R and R =max i k y i )]TJ/F19 11.9552 Tf 11.956 0 Td [(y 0 k .Notethat^ y 0 =0 andmax i k ^ y i k =1.Anyanalysisof Y canbedirectlyrelatedto ^ Y bythefollowing lemma: Lemma3.5 Denethebasis ^ = n ^ 0 x ; ; ^ q x o ,where ^ i x = i Rx + y 0 , i =0 ;:::;q and isthemonomialbasis.Let f ` 0 x ; ;` p x g beweightedregression Lagrangepolynomialsfor Y and n ^ ` 0 x ; ; ^ ` p x o beweightedregressionLagrange polynomialsfor ^ Y .Then M ;Y = M ^ ; ^ Y .If Y ispoised,then ` x = ^ ` x )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 R : Proof: Observethat M ;Y = 2 6 6 6 6 6 6 6 4 0 y 0 q y 0 0 y 1 q y 1 . . . . . . . . . 0 y p q y p 3 7 7 7 7 7 7 7 5 = 2 6 6 6 6 6 6 6 4 ^ 0 ^ y 0 ^ q ^ y 0 ^ 0 ^ y 1 ^ q ^ y 1 . . . . . . . . . ^ 0 ^ y p ^ q ^ y p 3 7 7 7 7 7 7 7 5 = M ^ ; ^ Y : 30

PAGE 42

Bythedenitionofpoisedness, Y ispoisedifandonlyif ^ Y ispoised.Let x = )]TJ/F15 11.9552 Tf 6.992 -6.528 Td [( 0 x ;:::; q x T and ^ x = ^ 0 x ;:::; ^ q x T .Then ^ ^ x = 2 6 6 6 6 4 ^ 0 x )]TJ/F20 7.9701 Tf 6.586 0 Td [(y 0 R . . . ^ q x )]TJ/F20 7.9701 Tf 6.587 0 Td [(y 0 R 3 7 7 7 7 5 = 2 6 6 6 6 4 0 x . . . q x 3 7 7 7 7 5 = x : ByLemma3.3,if Y ispoised,then ` x = x T WM ;Y + W = ^ ^ x T WM ^ ; ^ Y + W = ^ ` ^ x : Let f i bedenedby.1andletbeanopenconvexsetcontaining Y .If f is C 2 on,thenbyTaylor'stheorem,foreachsamplepoint y i 2 Y ,andaxed x 2 conv Y ,thereexistsapoint i x onthelinesegmentconnecting x to y i such that f i = f y i + i = f x + r f x T y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x + 1 2 y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x T r 2 f i x y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x + i = f x + r f x T y i )]TJ/F19 11.9552 Tf 11.956 0 Td [(x + 1 2 y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x T r 2 f x y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x + 1 2 y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x T H i x y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x + i ; .7 where H i x = r 2 f i x )-222(r 2 f x . Let f ` i x g denotetheweighted-regressionLagrangepolynomialsassociatedwith Y .Thefollowinglemmaandproofareinspiredby[7,Theorem1]: Lemma3.6 Let f betwicecontinuouslydierentiableon andlet m x denote thequadraticfunctiondeterminedbyweightedregression.Then,forany x 2 the followingidentitieshold: m x = f x + 1 2 P p i =0 y i )]TJ/F19 11.9552 Tf 11.956 0 Td [(x T H i x y i )]TJ/F19 11.9552 Tf 11.956 0 Td [(x ` i x + P p i =0 i ` i x , 31

PAGE 43

r m x = r f x + 1 2 P p i =0 y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x T H i x y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x r ` i x + P p i =0 i r ` i x , r 2 m x = r 2 f x + 1 2 P p i =0 y i )]TJ/F19 11.9552 Tf 11.956 0 Td [(x T H i x y i )]TJ/F19 11.9552 Tf 11.956 0 Td [(x r 2 ` i x + P p i =0 i r 2 ` i x , where H i x = r 2 f i x )-194(r 2 f x forsomepoint i x = x + )]TJ/F19 11.9552 Tf 11.615 0 Td [( y i ; 0 1 onthelinesegmentconnecting x to y i . Proof: Let D denotethedierentialoperatorasdenedin[7],where D j is the j thderivativeofafunctionin C i where i j .Inparticular, D 0 f x = f x , D 1 f x z = r f x T z ,and D 2 f x z 2 = z T r 2 f x z .ByLemma3.4, m x = P p i =0 f i ` i x ;sofor h =0 ; 1or2, D h m x = p X i =0 f i D h ` i x : Substituting.7for f i intheaboveequationyields D h m x = 2 X j =0 1 j ! p X i =0 D j f x y i )]TJ/F19 11.9552 Tf 11.956 0 Td [(x j D h ` i x + 1 2 p X i =0 y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x T H i x y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x D h ` i x + p X i =0 i D h ` i x .8 where H i x = r 2 f i x )-261(r 2 f x forsomepoint i x onthelinesegmentconnecting x to y i .Considerthersttermontherighthandsideabove.Weshallshow that 1 j ! p X i =0 D j f x y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x j D h ` i x = 8 > < > : D h f x for j = h 0for j 6 = h: .9 for j =0 ; 1 ; 2 ;::: .Let B j = D j f x ,andlet g j : R n ! R bethepolynomialdened by g j z = 1 j ! B j z )]TJ/F19 11.9552 Tf 10.222 0 Td [(x j .Observethat D j g j x = B j and D h g j x =0for h 6 = j .Since g j hasdegree j 2,theweightedleastsquaresapproximationof g j byaquadratic polynomialis g j itself.Thus,byLemma3.4andthedenitionof g j , g j z = p X i =0 g j y i ` i z = 1 j ! p X i =0 B j y i )]TJ/F19 11.9552 Tf 11.956 0 Td [(x j ` i z : .10 32

PAGE 44

Applyingthedierentialoperator D h withrespectto z yields D h g j z = 1 j ! p X i =0 B j y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x j D h ` i z = 1 j ! p X i =0 D j f x y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x j D h ` i z : Letting z = x ,theexpressionontherightisidenticaltotheleftsideof.9.This proves.9since D h g j x =0for j 6 = h and D j g j x = B j for j = h .By.9,.8 reducesto D h m x = D h f x + 1 2 p X i =0 y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x T H i x y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x D h ` i x + p X i =0 i D h ` i x : Applyingthiswith h =0 ; 1 ; 2provesthelemma. Since k H i x k L k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k bytheLipschitzcontinuityof r 2 f ,thefollowingis adirectconsequenceofLemma3.6. Corollary3.7 Let f satisfyAssumption2.1forsomeconvexset ,andlet m x denotethequadraticfunctiondeterminedbyweightedregression.Then,forany x 2 thefollowingerrorboundshold: j f x )]TJ/F19 11.9552 Tf 11.955 0 Td [(m x j P p i =0 L 2 k y i )]TJ/F19 11.9552 Tf 11.956 0 Td [(x k 3 + j i j j ` i x j kr f x )-222(r m x k P p i =0 L 2 k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k 3 + j i j kr ` i x k kr 2 f x )-222(r 2 m x k P p i =0 L 2 k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k 3 + j i j kr 2 ` i x k . Usingthiscorollary,thefollowingresultprovideserrorboundsbetweenthefunctionandthemodelintermsofthesamplesetradius. Corollary3.8 Let Y bepoised,andlet R =max i k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 k .Suppose j i j for i =0 ;:::;p .If f satisesAssumption2.1withLipschitzconstant L ,thenthereexist constants 1 , 2 ,and 3 ,independentof R ,suchthatforall x 2 B y 0 ; R , j f x )]TJ/F19 11.9552 Tf 11.955 0 Td [(m x j 1 p p 1 LR 3 + . 33

PAGE 45

kr f x )-222(r m x k 2 p p 1 LR 2 + =R . kr 2 f x )-222(r 2 m x k 3 p p 1 LR + =R 2 . Proof: Let f ^ ` 0 x ;:::; ^ ` p x g betheLagrangepolynomialsgeneratedbytheshifted andscaledset ^ Y ,andlet f ` 0 x ;:::;` p x g betheLagrangepolynomialsgeneratedbytheset Y .ByLemma3.5,foreach x 2 B y 0 ; R , ` i x = ^ ` i ^ x 8 i , where^ x = x )]TJ/F19 11.9552 Tf 13.158 0 Td [(y 0 =R .Thus, r ` i x = r ^ ` i ^ x =R ,and r 2 ` i x = r 2 ^ ` i ^ x =R 2 . Let ^ ` x = h ^ ` 0 x ;:::; ^ ` p x i T ,^ g x = h r ^ ` 0 x ;:::; r ^ ` p x i T and ^ h x = h r 2 ^ ` 0 x ;:::; r 2 ^ ` p x i T . ByCorollary3.7, j f x )]TJ/F19 11.9552 Tf 11.955 0 Td [(m x j p X i =0 L 2 y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 3 + j i j j ` i x j p X i =0 )]TJ/F15 11.9552 Tf 5.48 -9.684 Td [(4 LR 3 + j ` i x j since k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k 2 R ,and j i j = )]TJ/F15 11.9552 Tf 5.48 -9.684 Td [(4 LR 3 + k ^ ` ^ x k 1 p p 1 )]TJ/F15 11.9552 Tf 5.479 -9.684 Td [(4 LR 3 + ^ ` ^ x ; sincefor x 2 R n , k x k 1 p n k x k 2 : Similarly, kr f x )-222(r m x k p p 1 4 LR 2 + R k ^ g ^ x k and r 2 f x )-222(r 2 m x p p 1 4 LR + R 2 ^ h ^ x : Setting 1 =max x 2 B ;1 ^ ` x , 2 =max x 2 B ;1 k ^ g x k ,and 3 =max x 2 B ;1 ^ h x yields thedesiredresult. Notethesimilaritybetweentheseerrorboundsandthoseinthedenitionof fullyquadraticmodels.Ifthereisnocomputationalerrororiftheerroris O 3 , fullyquadraticmodelsforsomexed canbeobtainedbycontrollingthegeometry ofthesamplesetsothat i p p 1 , i =1 ; 2 ; 3areboundedbyxedconstantsandby controllingthetrustregionradiussothat R isbounded.Thismotivatesthe denitionsof-poisedandstrongly-poisedintheweightedregressionsenseinthe nextsection. 34

PAGE 46

3.3.3 -poisednessintheWeightedRegressionSense Inthissection,werestrictourattentiontothemonomialbasis denedin .3.Inordertoproduceaccuratemodelfunctions,thepointsinthesampleset needtobedistributedinsuchawaythatthematrix M = M ;Y issuciently well-conditioned.Thisisthemotivationbehindthefollowingdenitionsof-poised andstrongly-poisedsets.Thesedenitionsareidenticalto[12,Denitions4.7, 4.10]exceptthattheLagrangepolynomialsinthedenitionsare weightedregression Lagrangepolynomials. Denition3.9 Let > 0 andlet B beasetin R n .Let w = w 0 ;:::;w p bea vectorofpositiveweights, Y = f y 0 ;:::;y p g beapoisedset,andlet f ` 0 ;:::;` p g bethe associatedweightedregressionLagrangepolynomials.Let ` x = ` 0 x ; ;` p x T and q 1 = jP 2 n j . Y issaidtobe -poisedin B intheweightedregressionsenseifandonlyif max x 2 B max 0 i p j ` i x j : Y issaidtobestrongly -poisedin B intheweightedregressionsenseifand onlyif q 1 p p 1 max x 2 B k ` x k : Notethatiftheweightsareallequal,theabovedenitionsareequivalenttothose for-poisedandstrongly-poisedgivenin[12]. WearenaturallyinterestedinusingtheseweightedregressionLagrangepolynomialstoformmodelsthatareguaranteedtosucientlyapproximate f .Let Y k , k , and R k denotethesampleset,trustregionradius,andsamplesetradiusatiteration k asdenedatthebeginningof x 3.3.2.Assumethat R k k isbounded.Ifthenumberof samplepointsisbounded,itcanbeshown,usingCorollary3.8,thatif Y k is-poised forall k ,thenthecorrespondingmodelfunctionsare -fullyquadratic,assumingno 35

PAGE 47

computationalerror,orthatthecomputationalerroris O 3 .Whenthenumberof samplepointsisnotbounded,-poisednessisnotenough.Inthefollowing,weshow thatif Y k is strongly -poisedforall k ,thenthecorrespondingmodelsare -fully quadratic,regardlessofthenumberofpointsin Y k . Lemma3.10 Let ^ M = M ; ^ Y .If W ^ M T W + p q 1 =p 1 ,then ^ Y isstrongly -poisedin B ;1 intheweightedregressionsense,withrespecttotheweights w . Conversely,if ^ Y isstrongly -poisedin B ;1 intheweightedregressionsense,then W ^ M T W + q 1 p p 1 ; where > 0 isaxedconstantdependentonlyon n butindependentof Y and . Proof: Let A = W ^ M + W and ` x = ` 0 x ;:::;` p x T .ByLemma3.3, ` x = A T x .Itfollowsthatforany x 2 B ;1, k ` x k = A T x k A k x p q 1 =p 1 )]TJ/F22 11.9552 Tf 5.48 -1.651 Td [(p q 1 x 1 q 1 = p p 1 : Forthelastinequality,weusedthefactthatmax x 2 B ;1 x 1 1. Toprovetheconverse,let U V T = A T bethereducedsingularvaluedecompositionof A T .Notethat U and V are p 1 q 1 and q 1 q 1 matrices,respectively, withorthonormalcolumns;isa q 1 q 1 diagonalmatrix,whosediagonalentries arethesingularvaluesof A T .Let 1 bethelargestsingularvaluewith v 1 thecorrespondingcolumnof V .Asshownintheproofof[10,Theorem2.9],thereexists aconstant > 0suchthatforanyunitvector v ,thereexistsan x 2 B ;1such that v T x .Therefore,since k v 1 k =1,thereisan x 2 B ;1suchthat v 1 T x .Let v ? betheorthogonalprojectionof x ontothesubspace orthogonalto v 1 ;so x = )]TJ/F15 11.9552 Tf 5.48 -9.684 Td [( v 1 T x v 1 + v ? .Notethat V T v 1 and V T v ? are orthogonalvectors.Notealsothatforanyvector z , U V T z = V T z since U 36

PAGE 48

hasorthonormalcolumns.Itfollowsthat k ` x k = A T x = V T x = V T v ? 2 + V T )]TJ/F15 11.9552 Tf 5.479 -9.683 Td [( v 1 T x v 1 2 1 = 2 v 1 T x V T v 1 V T v 1 = e 1 = k A k : Thus, k A k max x 2 B ;1 k ` x k = q 1 p p 1 ,whichprovestheresultwith =1 = . WecannowprovethatmodelsgeneratedbyweightedregressionLagrangepolynomialsare -fullyquadratic. Proposition3.11 Let f satisfyAssumption2.1andlet > 0 bexed.Thereexists avector = ef ; eg ; eh ; 0 suchthatforany y 0 2 S and max ,if 1. Y = f y 0 ;:::;y p g B y 0 ; isstrongly -poisedin B y 0 ; intheweighted regressionsensewithrespecttopositiveweights w = f w 0 ;:::;w p g ,and 2.thecomputationalerror j i j isboundedby C 3 ,where C isaxedconstant, thenthecorrespondingmodelfunction m is -fullyquadratic. Proof: Let^ x , ^ ` ; ^ g , ^ h , 1 ; 2 ,and 3 beasdenedintheproofof Corollary3.8.Let ^ M = M ; ^ Y and W =diag w .ByLemma3.3, ^ ` x = A T x , where A = W ^ M + W .ByLemma3.10, k A k q 1 p p 1 ,where isaxedconstant. Itfollowsthat 1 =max x 2 B ;1 ^ ` x max x 2 B ;1 k A k x c 1 q 1 p p 1 ; where c 1 =max x 2 B ;1 x isaconstantindependentof Y .Similarly, 2 =max x 2 B ;1 k ^ g x k =max x 2 B ;1 kr ^ ` 0 x k ; ; kr ^ ` p x k =max x 2 B ;1 A T r x F p q 1 max x 2 B ;1 A T r x p q 1 max x 2 B ;1 k A k r x c 2 q 3 2 1 p p 1 ; 37

PAGE 49

where c 2 =max x 2 B ;1 r x isindependentof Y . Tobound 3 ,let J s;t denotetheuniqueindex j suchthat x s and x t bothappear inthequadraticmonomial j x .Forexample J 1 ; 1 = n +2, J 1 ; 2 = J 2 ; 1 = n +3,etc. Observethat r 2 j x s;t = 8 > < > : 1if j = J s;t ; 0otherwise : Itfollowsthat r 2 ^ ` i x = q X j =0 A T i;j r 2 j x = 2 6 6 6 6 6 6 6 4 A T i;J 1 ; 1 A T i;J 1 ; 2 :::A T i;J 1 ;n A T i;J 2 ; 1 A T i;J 2 ; 2 :::A T i;J 2 ;n . . . . . . A T i;J n; 1 A T i;J n; 2 :::A T i;J n;n 3 7 7 7 7 7 7 7 5 : Weconcludethat r 2 ^ ` i x r 2 ^ ` i x F p 2 A T i; .Thus, 3 =max x 2 B ;1 ^ h x =max x 2 B ;1 kr 2 ^ ` 0 x k ; ; kr 2 ^ ` p x k v u u t 2 p X i =0 A T i; 2 = p 2 k A k F p 2 q 1 k A k p 2 q 3 2 1 p p 1 : Byassumption,thecomputationalerror j i j isboundedby = C 3 .So,by Corollary3.8,forall x 2 B y 0 ;, j f x )]TJ/F19 11.9552 Tf 11.955 0 Td [(m x j p p 1 L + C 3 1 c 1 q 1 L + C 3 = ef 3 . kr f x )-222(r m x k p p 1 L + C 2 2 c 2 q 3 2 1 L + C 2 = eg 2 . r 2 f x )-222(r 2 m x p p 1 L + C 3 p 2 q 3 2 1 L + C = eh . where ef = c 1 q 1 L + C ; eg = c 2 q 3 2 1 L + C ; eh = p 2 q 3 2 1 L + C : Thus, m x is ef ; eg ; eh ; 0-fullyquadratic,andsincetheseconstantsareindependentof y 0 and,theresultisproven. Thenalstepinestablishingthatwehaveafullyquadraticclassofmodelsisto deneanalgorithmthatproducesastrongly-poisedsamplesetinanitenumber ofsteps. 38

PAGE 50

Proposition3.12 Let ^ Y = f y 0 ;:::;y p g R n beasetofpointsintheunitball B ;1 suchthat k y j k =1 foratleastone j .Let w = w 0 ;:::;w p T beavector ofpositiveweights.If ^ Y isstrongly -poisedin B ;1 inthesenseof unweighted regression,thenthereexistsaconstant > 0 ,independentof ^ Y , and w ,suchthat ^ Y isstrongly )]TJ/F15 11.9552 Tf 5.479 -9.684 Td [(cond W -poisedinthe weighted regressionsense. Proof: Let ^ M = M ; ^ Y ,where isthemonomialbasis.ByLemma3.10applied withunitweights, ^ M + q 1 = p p 1 ,where isaconstantindependentof ^ Y and .Thus, W ^ M T W + cond W ^ M + cond W q 1 p p 1 : wheretherstinequalityresultsfrom ^ M T W + = 1 min ^ M T W 1 min ^ M T min W = M + W )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 : Theresultfollowswith = p q 1 . Thesignicanceofthispropositionisthatanymodelimprovementalgorithmfor unweightedregressioncanbeusedforweightedregressiontoensurethesameglobal convergenceproperties,providedcond W isbounded.Forthemodelimprovement algorithmdescribedinthefollowingsection,thisrequirementissatisedbybounding theweightsawayfromzerowhilekeepingthelargestweightequalto1. Inpractice,weneednotensure-poisednessof Y k ateveryiteratetoguarantee thealgorithmconvergencestoasecond-orderminimum.Rather,-poisednessonly needstobeenforcedasthealgorithmstagnates. 3.4ModelImprovementAlgorithm ThissectiondescribesamodelimprovementalgorithmMIAforregression which,bytheprecedingsection,canalsobeusedforweightedregressiontoensure thatthesamplesetsarestrongly-poisedforsomexedwhichisnotnecessarily 39

PAGE 51

known.Thealgorithmisbasedonthefollowingobservation,whichisastraightforwardextensionof[12,Theorem4.11]. TheMIApresentedin[12]makesassumptionssuchasallpointsmustliewithin B y 0 ;tosimplifythetheory.Weresistsuchassumptionstoaccountforpractical concernspointswhichlieoutsideof B y 0 ; thatariseinthealgorithm. Proposition3.13 Iftheshiftedandscaledsampleset ^ Y of p 1 pointscontains l = b p 1 q 1 c disjointsubsetsof q 1 points,eachofwhichare -poisedin B ;1 inthe interpolationsense,then ^ Y isstrongly q l +1 l -poisedin B ;1 intheregression sense. Proof: Let Y j = f y 0 j ;y 1 j ;:::;y q j g , j =1 ;:::;l bethedisjoint-poisedsubsetsof ^ Y , andlet Y r betheremainingpointsin ^ Y .Let j i , i =0 ;:::;q betheinterpolation Lagrangepolynomialsfortheset Y j .Asnotedin[12],forany x 2 R n , q X i =0 j i x y i j = x ;j =1 ;:::;l: Dividingeachoftheseequationsby l andsummingyields l X j =1 q X i =0 1 l j i x y i j = x : .11 Let j x = )]TJ/F19 11.9552 Tf 5.48 -9.684 Td [( j 1 x ; ; j q 1 x T ,andlet 2 R p 1 beformedbyconcatenatingthe j x , j =1 ; ;l andazerovectoroflength j Y r j andthendividingeveryentryby l .By.11, isasolutiontotheequation p X i =0 i y i = x : .12 Since Y j is-poisedin B ;1,forany x 2 B ;1, j x p q 1 j x 1 p q 1 : Thus, p l max j k j x k l r q 1 l r l +1 l r q 1 p 1 =q 1 = r l +1 l q 1 p p 1 : 40

PAGE 52

Let ` i x , i =0 ; ;p betheregressionLagrangepolynomialsforthecompleteset ^ Y .Asobservedin[12], ` x = ` 0 x ; ;` p x T istheminimumnormsolutionto .12.Thus, k ` x k r l +1 l q 1 p p 1 : Sincethisholdsforall x 2 B ;1, ^ Y isstrongly r l +1 l -poisedin B ;1. Basedonthisobservation,andnotingthat l +1 l 2for l 1,weadoptthefollowingstrategyforimprovingashiftedandscaledregressionsampleset ^ Y B ;1: 1.If ^ Y contains l 1-poisedsubsetswithatmost q 1 pointsleftover, ^ Y is strongly p 2-poised. 2.Otherwise,if ^ Y containsatleastone-poisedsubset,saveasmany-poised subsetsaspossible,plusatmost q 1 additionalpointsfrom ^ Y ,discardingthe rest. 3.Otherwise,addadditionalpointsto ^ Y inordertocreatea-poisedsubset. Keepthissubset,plusatmost q 1 additionalpointsfrom ^ Y . Toimplementthisstrategy,werstdescribeanalgorithmthatattemptstonda -poisedsubsetof ^ Y .Todiscussthealgorithmweintroducethefollowingdenition: Denition3.14 Aset Y B issaidtobe -subpoised inaset B ifthereexistsa superset Z Y thatis -poisedin B with j Z j = q 1 . Givenasampleset Y B ;1notnecessarilyshiftedandscaledandaradius ~ ,thealgorithmbelowselectsa-subpoisedsubset Y new Y containingas manypointsaspossible.If j Y new j = q 1 ,then Y new is-poisedin B ; ~ forsome xed.Otherwise,thealgorithmdeterminesanewpoint y new 2 B ; ~ suchthat Y new S f y new g is-subpoisedin B ; ~ . 41

PAGE 53

AlgorithmFindSet Findsa-subpoisedset Input: Asampleset Y B ;1andatrustregionradius ~ 2 p acc ; 1 , forxedparameter acc > 0. Output: Aset Y new Y thatis-poisedin B ; ~ ;ora-subpoised set Y new B ; ~ andanewpoint y new B ; ~ suchthat Y new S f y new g is -subpoisedin B ; ~ . Step0Initialization: Initializethepivotpolynomialbasistothemonomialbasis: u i x = i x ;i =0 ;:::;q .Set Y new = ; .Set i =0. Step1PointSelection: Ifpossible,choose j i 2f i;:::; j Y j)]TJ/F15 11.9552 Tf 17.933 0 Td [(1 g suchthat j u i y j i j acc thresholdtest. If suchanindexisfound,addthecorrespondingpointto Y new andswap thepositionsofpoints y i and y j i in Y . Otherwise ,compute y new =argmax x 2 B ; ~ j u i x j ,and exit ,returning Y new and y new . Step2GaussianElimination: For j = i +1 ;:::; j Y j)]TJ/F15 11.9552 Tf 17.932 0 Td [(1 u j x = u j x )]TJ/F19 11.9552 Tf 13.15 8.087 Td [(u j y i u i y i u i x . If i
PAGE 54

M )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 p q 1 growth acc ; .13 where growth isthegrowthfactorforthefactorizationsee[27]. PointSelection Thepointselectionruleallowsexibilityinhowanacceptable pointischosen.Forexample,tokeepthegrowthfactordown,onecouldchoosethe index j i thatmaximizes j u i y j j whichcorrespondstoGaussianeliminationwith partialpivoting.Butinpractice,itisoftenbettertoselectpointsaccordingto theirproximitytothecurrentiterate.Inourimplementation,webalancethesetwo criteriabychoosingtheindexthatmaximizes j u i y j j =d 3 j ,over j i ,where d j = max f 1 ; k y j k = ~ g .Ifallsamplepointsarecontainedin B ; ~ ,then d j =1forall j .Inthiscase,thepointselectionruleisidenticaltotheoneusedinAlgorithm6.6 of[12]withtheadditionofthethresholdtest.When Y containspointsoutside B ; ~ ,thecorrespondingvaluesof d j aregreaterthan1,sothepointselectionrule givespreferencetopointsthatarewithin B ; ~ . ThetheoreticaljusticationforourrevisedpointselectionrulecomesfromexaminingtheerrorboundsinCorollary3.7.Foragivenpoint x in B ; ~ ,eachsample point y i makesacontributiontotheerrorboundthatisproportionalto k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k 3 assumingthecomputationalerrorisrelativelysmall.Since x canbeanywherein thetrustregion,thissuggestsmodifyingthepointselectionruletomaximize j u i y j i j ^ d 3 j i , where ^ d j =max x 2 B ; ~ k y j )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k = ~ = k y j k = ~ +1.Tosimplifyanalysis,wemodify thisformulasothatallpointsinsidethetrustregionaretreatedequally,resultingin theformula d j =max ; k y j k = ~ . Lemma3.15 SupposeAlgorithmFindSetreturnsaset Y new with j Y new j = q 1 .Then Y new is -poisedin B ; ~ forsome ,whichisproportionalto growth acc max f 1 ; ~ 2 = 2 ; ~ g ,where growth isthegrowthfactorfortheGaussianelimination. 43

PAGE 55

Proof: Let ~ M = M ;Y new .By.13, ~ M )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 p q 1 growth = acc .Let ` x = ` 0 x ;:::;` q x T bethevectorofinterpolationLagrangepolynomialsforthesample set Y new .Forany x 2 B ; ~ , k ` x k 1 = ~ M )]TJ/F20 7.9701 Tf 6.587 0 Td [(T x 1 ~ M )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 1 x 1 p q 1 ~ M )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 x 1 q 1 growth acc x 1 q 1 growth acc max f 1 ; ~ 2 = 2 ; ~ g : Sincethisinequalityholdsforall x 2 B ; ~ , Y new is-poisedfor= q 1 growth = acc max f 1 ; ~ 2 = 2 ; ~ g ,whichestablishestheresult. Ingeneral,thegrowthfactorintheabovelemmadependsonthematrix M and thethreshold acc .Becauseofthethresholdtest,itispossibletoestablishabound onthegrowthfactorthatisindependentof ~ M .Sowecanclaimthatthealgorithm selectsa-poisedsetforaxedthatisindependentof Y .However,thebound isextremelylarge,soisnotveryuseful.Nevertheless,inpractice growth isquite reasonable,sotendstobeproportionaltomax f 1 ; ~ 2 = 2 ; ~ g = acc . Inthecasewherethethresholdtestisnotsatised,AlgorithmFindSetdetermines anewpoint y new bymaximizing j u i x j over B ; ~ .Inthiscase,weneedtoshow thatthenewpointwouldsatisfythethresholdtest.Thefollowinglemmashowsthat thisispossible,provided acc issmallenough.Theproofismodeledaftertheproof of[12,Lemma6.7]. Lemma3.16 Let v T x beaquadraticpolynomialofdegree2,where k v k 1 =1 . Then max x 2 B ; ~ j v T x j min f 1 ; ~ 2 4 g : Proof: Since k v k 1 =1,atleastoneofthecoecientsof q x = v T x is1,-1, 1/2,or-1/2.Lookingatthecasewherethelargestcoecientis1or1/2-1and -1/2aresimilarlyproven,eitherthiscoecientcorrespondstotheconstantterm,a linearterm x i ,oraquadraticterm x 2 i = 2or x i x j .Restrictallvariablesnotappearing inthetermcorrespondingtothelargestcoecienttozero. 44

PAGE 56

If q x =1thenthelemmatriviallyholds. If q x = x 2 i = 2+ ax i + b ,let~ x = ~ e i 2 B ; ~ q ~ x = ~ 2 = 2+ ~ a + b;q )]TJ/F15 11.9552 Tf 10.024 0 Td [(~ x = ~ 2 = 2 )]TJ/F15 11.9552 Tf 13.906 3.022 Td [(~ a + b; and q = b: If j q )]TJ/F15 11.9552 Tf 10.023 0 Td [(~ x j ~ 2 4 or j q ~ x j ~ 2 4 theresultisshown.Otherwise, )]TJ/F17 7.9701 Tf 7.997 2.014 Td [(~ 4 ~ 2 4 . If q x = ax 2 i = 2+ x i + b ,thenlet~ x = ~ e i ,yielding q ~ x = ~ + a ~ 2 = 2+ b and q )]TJ/F15 11.9552 Tf 10.024 0 Td [(~ x = )]TJ/F15 11.9552 Tf 11.249 3.022 Td [(~ + a ~ 2 = 2+ b then max fj q )]TJ/F15 11.9552 Tf 10.023 0 Td [(~ x j ; j q ~ x jg =max n j)]TJ/F15 11.9552 Tf 19.884 3.022 Td [(~ + j ; j ~ + j o ~ min f 1 ; ~ 2 4 g where = a ~ = 2+ b =0. If q x = ax 2 i = 2+ bx 2 j = 2+ x i x j + cx i + dx j + e ,weconsider4pointson B ; ~ y 1 = 2 6 4 q ~ 2 q ~ 2 3 7 5 ;y 2 = 2 6 4 q ~ 2 )]TJ/F25 11.9552 Tf 9.298 14.792 Td [(q ~ 2 3 7 5 ;y 3 = 2 6 4 )]TJ/F25 11.9552 Tf 9.299 14.792 Td [(q ~ 2 q ~ 2 3 7 5 ;y 4 = 2 6 4 )]TJ/F25 11.9552 Tf 9.299 14.792 Td [(q ~ 2 )]TJ/F25 11.9552 Tf 9.299 14.792 Td [(q ~ 2 3 7 5 : q y 1 = a ~ 4 + b ~ 4 + ~ 2 + c s ~ 2 + d s ~ 2 + e q y 2 = a ~ 4 + b ~ 4 )]TJ/F15 11.9552 Tf 15.102 11.11 Td [(~ 2 + c s ~ 2 )]TJ/F19 11.9552 Tf 11.955 0 Td [(d s ~ 2 + e q y 3 = a ~ 4 + b ~ 4 )]TJ/F15 11.9552 Tf 15.102 11.109 Td [(~ 2 )]TJ/F19 11.9552 Tf 11.955 0 Td [(c s ~ 2 + d s ~ 2 + e q y 4 = a ~ 4 + b ~ 4 + ~ 2 )]TJ/F19 11.9552 Tf 11.955 0 Td [(c s ~ 2 )]TJ/F19 11.9552 Tf 11.955 0 Td [(d s ~ 2 + e Notethat q y 1 )]TJ/F19 11.9552 Tf 11.732 0 Td [(q y 2 = ~ + d p 2 ~ and q y 3 )]TJ/F19 11.9552 Tf 11.733 0 Td [(q y 4 = )]TJ/F15 11.9552 Tf 11.249 3.022 Td [(~ + d p 2 ~ .There aretwocases: 45

PAGE 57

1.If d 0,then q y 1 )]TJ/F19 11.9552 Tf 12.218 0 Td [(q y 2 ~ ,soeither j q y 1 j ~ 2 min n 1 ; ~ 2 4 o or j q y 2 j ~ 2 min n 1 ; ~ 2 4 o . 2.If d< 0,thenasimilarstudyof q y 3 )]TJ/F19 11.9552 Tf 11.955 0 Td [(q y 4 provestheresult. Lemma3.17 SupposeinAlgorithmFindSet acc min f 1 ; ~ 2 = 4 g .IfAlgorithm FindSetexitsduringthepointselectionstep,then Y new S f y new g is -subpoisedin B ; ~ forsomexed ,whichisproportionalto growth acc max f 1 ; ~ 2 = 2 ; ~ g ,where growth isthegrowthparameterfortheGaussianelimination. Proof: ConsideramodiedversionofAlgorithmFindSetthatdoesnotexitin thepointselectionstep.Instead, y i isreplacedby y new and y new isaddedto Y new . Thismodiedalgorithmwillalwaysreturnasetconsistingof q 1 points.Callthisset Z .Let Y new and y new betheoutputoftheunmodiedalgorithm,andobservethat Y new S f y new g Z . Toshowthat Y new S f y new g is-subpoised,weshowthat Z is-poisedin B ; ~ . FromtheGaussianelimination,after k iterationsofthealgorithm,the k +1st pivotpolynomial u k x canbeexpressedas v k T x ,forsome v k = v 0 ;:::;v k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 ; 1 ; 0 ;:::; 0.Thatis,the v i arethecoecientsforthebasisexpansionofthepolynomial u k .Observethat v k 1 1,andlet~ v = v k k v k k 1 .ByLemma3.16, max x 2 B ; ~ j u k x j =max x 2 B ; ~ v k T x = v k 1 max x 2 B ; ~ j ~ v T x j min f 1 ; ~ 2 4 g v k 1 min f 1 ; ~ 2 4 g acc : Itfollowsthateachtimeanewpointischoseninthepointselectionstep, thatpointwillsatisfythethresholdtest.Thus,theset Z returnedbythemodiedalgorithmwillinclude q 1 points,allofwhichsatisfythethresholdtest.By 46

PAGE 58

Lemma3.15, Z is-poised,withproportionalto growth acc max f 1 ; ~ 2 = 2 ; ~ g .Itfollowsthat Y new S f y new g is-subpoised. Wearenowreadytostateourmodelimprovementalgorithmforregression.Prior tocallingthisalgorithm,wediscardallpointsin Y withdistancegreaterthan = p acc from y 0 .Wethenformtheshiftedandscaledset Y bythetransformation y j = y j )]TJ/F19 11.9552 Tf 9.351 0 Td [(y 0 =d ,where d =max y j 2 Y k y j )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 k ,andscalethetrustregionradiusaccordingly i.e., ~ = =d .Thisensuresthat ~ = d = p acc = p acc .Aftercallingthe algorithm,wereversetheshiftandscaletransformation. AlgorithmMIA ModelImprovementforRegression Input: Ashiftedandscaledsampleset ^ Y B ;1,atrustregionradius ~ p acc forxed acc 2 ; 1 4 r 2 ,where r 1isaxedparameter. Output: Amodiedset Y 0 withimprovedpoisednesson B ; ~ . Step0Initialization: Removethepointin ^ Y farthestfrom y 0 =0ifitis outside B ; r ~ .Set Y 0 = ; . Step1FindPoisedSubset: UseAlgorithm FindSet eithertoidentifya -poisedsubset Y new ^ Y ,ortoidentifya-subpoisedsubset Y new ^ Y and oneadditionalpoint y new 2 B ; ~ suchthat Y new S f y new g is-subpoised on B ; ~ . Step2UpdateSet: If Y new is-poised,additto Y 0 andremove Y new from ^ Y .Removeall pointsfrom ^ Y thatareoutsideof B ; r ~ . Otherwise If j Y 0 j = ; ,set Y 0 = Y new S f y new g plus q 1 )-222(j Y new j)]TJ/F15 11.9552 Tf 17.932 0 Td [(1additionalpoints from ^ Y . Otherwise set Y 0 = Y 0 S Y new plus q 1 )-222(j Y new j additionalpointsfrom ^ Y . 47

PAGE 59

Set ^ Y = ; . Step3If j ^ Y j q 1 ,goto Step1 . InAlgorithmMIA,ifeverycalltoAlgorithmFindSetyieldsa-poisedset Y new , theneventuallyallpointsin ^ Y willbeincludedin Y 0 .Inthiscase,thealgorithmhas veriedthat ^ Y contains ` = b p 1 q 1 c -poisedsets.ByProposition3.13, ^ Y isstrongly l +1 l -poisedin B ;1. IftherstcalltoFindSetfailstoidentifya-poisedsubset,thealgorithmimprovesthesamplesetbyaddinganewpoint y new andbyremovingpointssothat theoutputset Y 0 containsatmost q 1 points.Inthiscasetheoutputsetcontainsthe -subpoisedset Y new S f y new g .Thus,ifthealgorithmiscalledrepeatedly,with ^ Y replacedby Y 0 aftereachcall,eventually Y 0 willcontaina-poisedsubsetandwill bestrongly2-poised,byProposition3.13. If ^ Y failstobe-poisedafterthesecondorlatercalltoFindSet,nonewpoints areadded.Instead,thesamplesetisimprovedbyremovingpointsfrom ^ Y sothat theoutputset Y 0 consistsofallthe-poisedsubsetsidentiedbyFindSet,plusupto q 1 additionalpoints.Theresultingsetisthenstrongly ^ ` +1 ^ ` -poised,where ^ ` = b j Y 0 j q 1 c . Trustregionscalefactor Thetrustregionscalefactor r wassuggestedin[12, Section11.2],althoughimplementationdetailswereomitted.Thescalefactordetermineswhatpointsareallowedtoremaininthesampleset.EachcalltoAlgorithm MIAremovesapointfromoutside B ; r ~ ifsuchapointexists.Thus,ifthealgorithmiscalledrepeatedlywith ^ Y replacedby Y 0 eachtime,eventuallyallpointsin thesamplesetwillbeintheregion B ; r ~ .Usingascalefactor r> 1canimprove theeciencyofthealgorithm.Toseethis,observethatif r =1,aslightmovement ofthetrustregioncentermayresultinpreviouslygood"pointslyingjustoutsideof B y 0 ;.Thesepointswouldthenbeunnecessarilyremovedfrom ^ Y . 48

PAGE 60

Tojustifythisapproach,supposethat ^ Y isstrongly-poisedin B ; ~ .By Proposition3.11,theassociatedmodelfunction m is -fullyquadraticforsomexed vector ,whichdependson.Ifinstead ^ Y haspointsoutsideof B ; ~ ,wecan showbyasimplemodicationtotheproofofProposition3.11thatthemodel functionis R 3 -fullyquadratic,where R =max fk y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 kg .Thus,if ^ Y B ; r ~ forsomexed r 1,thencallingthemodelimprovementalgorithmwillresultin amodelfunction m thatis^ -fullyquadraticwithrespecttoadierentbutstill xed^ = r 3 .Inthiscase,however,whenevernewpointsareaddedduringthe modelimprovementalgorithm,theyarealwayschosenwithintheoriginaltrustregion B ; ~ . ThediscussionabovedemonstratesthatAlgorithmMIAsatisestherequirements ofamodelimprovementalgorithmspeciedinDenition2.2.Thisalgorithmisused intheCSV2-frameworkdescribedinChapter2asfollows: InStep1ofAlgorithmCSV2,AlgorithmMIAiscalledonce.Ifnochangeis madetothesampleset,themodeliscertiedtobe -fullyquadratic. InStep4ofAlgorithmCSV2,AlgorithmMIAiscalledonce.Ifnochangeis madetothesampleset,themodelis -fullyquadratic.Otherwise,thesample setismodiedtoimprovethemodel. InAlgorithmCriticalityStep,AlgorithmMIAiscalledrepeatedlytoimprove themodeluntilitis -fullyquadratic. Inourimplementation,wemodiedAlgorithmCriticalitySteptoimproveeciencybyintroducinganadditionalexitcriterion.Specically,aftereachcalltothe modelimprovementalgorithm, & i k =max fk g i k k ; )]TJ/F19 11.9552 Tf 9.298 0 Td [( min H i k g istested.If & i k > c , x k isnolongerasecond-orderstationarypointofthemodelfunction;soweexitthe criticalitystep. 49

PAGE 61

3.5ComputationalResults Asshownintheprevioussection,theCSV2-frameworkusingweightedquadratic regressionconvergestoasecond-orderstationarypointprovidedtheratiobetweenthe largestandsmallestweightisbounded.Thisleavesmuchleewayinthederivationof theweights.Wenowdescribeaheuristicstrategybasedontheerrorboundsderived in x 4. 3.5.1UsingErrorBoundstoChooseWeights Intuitively,themodelsusedthroughoutouralgorithmwillbemosteectiveifthe weightsarechosensothat m x isasaccurateaspossibleinthesensethatitagrees withthesecond-orderTaylorapproximationof f x aroundthecurrenttrustregion center y 0 .Thatis,wewanttoestimatethequadraticfunction q x = f y 0 + r f y 0 T x )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 + 1 2 x )]TJ/F19 11.9552 Tf 11.956 0 Td [(y 0 T r 2 f y 0 x )]TJ/F19 11.9552 Tf 11.956 0 Td [(y 0 : If f x happenstobeaquadraticpolynomial,then f i = q y i + i : Iftheerrors i areuncorrelatedrandomvariableswithzeromeanandnitevariances 2 i ;i =0 ;:::;p ,thenthebestlinearunbiasedestimatorofthepolynomial q x is givenby m x = x T ,where solves.4withthe i th weightproportionalto 1 = i [51,Theorem4.4].Thisisintuitivelyappealingsinceeachsamplepointwillhave thesameexpectedcontributiontotheweightedsumofsquareerrors. When f x isnotaquadraticfunction,theerrorsdependnotjustonthecomputationalerror,butalsoonthedistancesfromeachpointto y 0 .Intheparticularcase when x = y 0 ,therstthreetermsof.7arethequadraticfunction q y i .Thus, theerrorbetweenthecomputedfunctionvalueand q y i isgivenby: f i )]TJ/F19 11.9552 Tf 11.955 0 Td [(q y i = 1 2 y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 T H i y 0 y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 + i ; .14 50

PAGE 62

where H i y 0 = r 2 f i y 0 )-299(r 2 f y 0 forsomepoint i y 0 onthelinesegment connecting y 0 and y i . Weshallrefertothersttermontherightasthe Taylorerror andthesecondterm ontherightasthe computationalerror .ByAssumption2.1, k H i y 0 k L k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 k . Thisleadsustothefollowingheuristicargumentforchoosingtheweights:Supposethat H i y 0 isarandomsymmetricmatrixsuchthatthestandarddeviationof k H i y 0 k isproportionalto L k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 k .Inotherwords k H i y 0 k = L k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 k for someconstant .ThentheTaylorerrorwillhavestandarddeviationproportionalto L k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 k 3 .AssumingthecomputationalerrorisindependentoftheTaylorerror, the totalerror f i )]TJ/F19 11.9552 Tf 11.314 0 Td [(q y i willhavestandarddeviation q )]TJ/F19 11.9552 Tf 5.479 -9.683 Td [(L k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 k 3 2 + 2 i ,where i isthestandarddeviationof i .Thisleadstothefollowingformulafortheweights: w i / 1 q 2 L 2 k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 k 6 + 2 i : Ofcourse,thisformuladependsonknowing ;L and i .If L , i ,and/or are notknown,thisformulacouldstillbeusedinconjunctionwithsomestrategyfor estimating L , i ,and forexample,basedupontheaccuracyofthemodelfunctions atknownpoints.Alternatively, and L canbecombinedintoasingleparameter C thatcouldbechosenusingsometypeofadaptivestrategy: w i / 1 q C k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 k 6 + 2 i : Ifthecomputationalerrorshaveequalvariances,theformulacanbefurther simpliedas w i / 1 q C k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 k 6 +1 ; .15 where C = C= 2 i . Anobviousawintheabovedevelopmentisthattheerrorsin j f i )]TJ/F19 11.9552 Tf 12.534 0 Td [(q y i j are notuncorrelated.Additionally,theassumptionthat k H i y 0 k isproportionalto L k y i )]TJ/F19 11.9552 Tf 11.955 0 Td [(y 0 k isvalidonlyforlimitedclassesoffunctions.Nevertheless,basedonour 51

PAGE 63

computationalexperiments,.15appearstoprovideasensiblestrategyforbalancingdieringlevelsofcomputationaluncertaintywiththeTaylorerror. 3.5.2BenchmarkPerformance Tostudytheimpactofweightedregression,wedevelopedMATLABimplementationsofthreequadraticmodel-basedtrustregionalgorithmsusinginterpolation, regression,andweightedregression,respectively,toconstructthequadraticmodel functions.Totheextentpossible,thedierencesbetweenthesealgorithmswere minimized,withcodesharedwheneverpossible.Obviously,allthreemethodsuse dierentstrategiesforconstructingthemodelfromthesampleset.Beyondthat,the onlydierenceisthatthetworegressionmethodsusethemodelimprovementalgorithmdescribedinSection3.4,whereastheinterpolationalgorithmusesthemodel improvementalgorithmdescribedin[12,Algorithm6.6]. Wecomparedthethreealgorithmsusingthesuiteoftestproblemsforbenchmarkingderivative-freeoptimizationalgorithmsmadeavailablebyMoreandWild [41].Weranourtestsonthefourtypesofproblemsfromthistestsuite:smooth problemswithnonoise,piecewisesmoothfunctions,functionswithdeterministic noiseandfunctionswithstochasticnoise.Wedonotconsiderthealgorithmpresented inthischaptertobeidealforhandlingstochasticallynoisyfunctions.Forexample,if theinitialpointhappenstobeevaluatedwithlargenegativenoise,thealgorithmwill neverre-evaluatethispointandpossiblynevermovethetrustregioncenter.Weare activelyattemptingtoconstructamorerobustalgorithm.Weconsidersuchmodicationsnon-trivialandoutsidethescopeofthecurrentwork.Theproblemswere runwiththefollowingparametersettings: max =100 ; icb 0 =1 ; 0 =10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(6 ; 1 =0 : 5 ; =0 : 5 ; inc =2 ; c =0 : 01 ; =2 ; =0 : 5 ;! = : 5 ;r =3 ; acc =10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(4 : Fortheinterpolationalgorithm,weused imp =1 : 01 ; forthecallsto[12,Algorithm6.6]. 52

PAGE 64

Asdescribedin[41],thesmoothproblemsarederivedfrom22nonlinearleast squaresfunctionsdenedintheCUTEr[23]collection.Foreachproblem,theobjectivefunction f x isdenedby f x = m X k =1 g k x 2 ; where g : R n ! R m representsoneoftheCUTErtestfunctions.ThepiecewisesmoothproblemsaredenedusingthesameCUTErtestfunctionsby f x = m X k =1 j g k x j : Thenoisyproblemsarederivedfromthesmoothproblemsbymultiplyingbyanoise functionasfollows: f x =+ " f \050 x m X k =1 g k x 2 ; where " f denestherelativenoiselevel.Forstochasticallynoisyproblem,\050 x isa randomvariabledrawnfromtheuniformdistribution U [ )]TJ/F15 11.9552 Tf 9.298 0 Td [(1 ; 1].Tosimulatedeterministicnoise,\050 x isafunctionthatoscillatesbetween-1and1,withbothhighfrequencyandlow-frequencyoscillations.Foranequationforthedeterministic,see [41,Eqns..2-.3].Usingmultiplestartingpointsforsomeofthetestfunctions, atotalof53dierentproblemsarespeciedinthetestsuiteforeachofthese3types ofproblems. Fortheweightedregressionalgorithm,theweightsweredeterminedbytheweightingscheme.15with C =100. Therelativeperformancesofthealgorithmswerecomparedusingperformance prolesanddataproles[17,41].If S isthesetofsolverstobecomparedonasuiteof problems P ,let t p;s bethenumberofiteratesrequiredforsolver s 2 S onaproblem p 2 P tondafunctionvaluesatisfying: f x f L + f x 0 )]TJ/F19 11.9552 Tf 11.955 0 Td [(f L ; .16 53

PAGE 65

where f L isthebestfunctionvalueachievedbyany s 2 S .Thentheperformance proleofasolver s 2 S isthefollowingfraction: s = 1 j P j p 2 P : t p;s min f t p;s : s 2 S g : Thedataproleofasolver s 2 S is: d s = 1 j P j p 2 P : t p;s n p +1 where n p isthedimensionofproblem p 2 P .Formoreinformationontheseproles, includingtheirrelativemeritsandfaults,see[41]. PerformanceprolescomparingthethreealgorithmsareshowninFigure3.1for anaccuracyof =10 )]TJ/F17 7.9701 Tf 6.587 0 Td [(5 .Weobservethatonthesmoothproblems,theweightedand unweightedregressionmethodshadsimilarperformanceandbothperformedslightly betterthaninterpolation.Forthedeterministicallynoisyproblems,weseeslightly betterperformancefromtheweightedregressionmethod;andthisimprovementis evenmorepronouncedforthebenchmarkproblemswithstochasticnoise.Andfor thepiecewisedierentiablefunctions,theperformanceoftheweightedregression methodissignicantlybetter.Thismirrorsthendingsin[13]whereSID-PSMusing regressionmodelsshowsconsiderableimprovementoverinterpolationmodels. WealsocomparedourweightedregressionalgorithmwiththeDFOalgorithm [8]aswellasNEWUOA[50],whichhadthebestperformanceofthethreesolvers comparedin[41].WeobtainedtheDFOcodefromtheCOIN-ORwebsite[38].This codecallsIPOPT,whichwealsoobtainedfromCOIN-OR.WeobtainedNEWUOA from[40].Weranbothalgorithmsonthebenchmarkproblemswithastopping criterionof min =10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(8 ,where min denotestheminimumtrustregionradius.For NEWUOA,thenumberofinterpolationconditionswassettoNPT=2 n +1. TheperformanceprolesareshowninFigure3.2,withanaccuracyof =10 )]TJ/F17 7.9701 Tf 6.587 0 Td [(5 . NEWUOAoutperformsbothouralgorithmandDFOonthesmoothproblems.This isnotsurprisingsinceNEWUOAisamaturecodethathasbeenrenedoverseveral 54

PAGE 66

Figure3.1: Performanceleftanddatarightproles:Interpolationvs.regression vs.weightedregression years,whereasourcodeisarelativelyunsophisticatedimplementation.Incontrast,on thenoisyproblemsandthepiecewisedierentiableproblems,ourweightedregression 55

PAGE 67

algorithmachievessuperiorperformance. Figure3.2: Performanceleftanddatarightproles:weightedregressionvs. NEWUOAvs.DFOProblemswithStochasticNoise 3.6SummaryandConclusions 56

PAGE 68

Ourcomputationalresultsindicatethatusingweightedregressiontoconstruct moreaccuratemodelfunctionscanreducethenumberoffunctionevaluationsrequiredtoreachastationarypoint.Encouragedbytheseresults,webelievethat furtherstudyofweightedregressionmethodsiswarranted.Thischapterprovidesa theoreticalfoundationforsuchstudy.Inparticular,wehaveextendedtheconceptsof -poisednessandstrong-poisednesstotheweightedregressionframework,andwe demonstratedthatanyschemethatmaintainsstrongly-poisedsamplesetsforunweightedregressioncanalsobeusedtomaintainstrongly-poisedsamplesetsfor weightedregression,providedthatnoweightistoosmallrelativetotheotherweights. Usingtheseresults,weshowedthat,whenthecomputationalerrorissucientlysmall relativetothetrustregionradius,thealgorithmconvergestoastationarypointunder standardassumptions. Thisinvestigationbeganwithagoalofmoreecientlydealingwithcomputationalerrorinderivative-freeoptimization,particularlyundervaryinglevelsofuncertainty.Surprisingly,wediscoveredthatregressionbasedmethodscanbeadvantageousevenintheabsenceofcomputationalerror.Regressionmethodsproduce quadraticapproximationsthatbetterttheobjectivefunctionclosetothetrustregioncenter.Thisisduepartlytothefactthatinterpolationmethodsthrowoutpoints thatareclosetogetherinordertomaintainawell-poisedsampleset.Incontrast, regressionmodelskeepthesepointsinthesampleset,therebyputtinggreaterweight onpointsclosetothetrustregioncenter. Thequestionofhowtochooseweightsneedsfurtherstudy.Inthischapter, weproposedaheuristicthatbalancesuncertaintiesarisingfromcomputationalerror withuncertaintiesarisingfrompoormodeldelityi.e.,Taylorerrorasdescribed in x 3.5.1.Thisweightingschemeappearstoprovideabenetfornoisyproblemsor non-dierentiableproblems.Webelievebetterschemescanbedevisedbasedonmore rigorousanalysis. 57

PAGE 69

Finally,wenotethattheadvantageofregression-basedmethodsisnotwithout costintermsofcomputationaleciency.Intheregressionmethods,quadraticmodels areconstructedfromscratcheveryiteration,requiring O n 6 operations.Incontrast, interpolation-basedmethodsareabletouseanecientschemedevelopedbyPowell[50]toupdatethequadraticmodelsateachiteration.Itisnotclearwhether suchaschemecanbedevisedforregressionmethods.Nevertheless,whenfunction evaluationsareextremelyexpensive,andwhenthenumberofvariablesisnottoo large,thisadvantageisoutweighedbythereductioninfunctionevaluationsrealized byregressionbasedmethods. 58

PAGE 70

4.StochasticDerivative-freeOptimizationusingaTrustRegion Framework Inthischapter,weproposeandanalyzetheconvergenceofanalgorithmwhich ndsalocalminimizeroftheunconstrainedfunction f : R n ! R .Thevalueof f atagivenpoint x cannotbeobserveddirectly;rathertheoptimizationroutineonly hasaccesstonoisecorruptedfunctionvalues f .Suchnoisemaybedeterministic,due toround-oerrorfromnite-precisionarithmeticoriterativemethods,orstochastic, arisingfromvariabilityorrandomnessinsomeobservedprocess.Wefocusour attentioninthischapteronminimizing f when f hastheform: f x = f x + .1 where N ; 2 . Minimizingnoisyfunctionsofthisnatureariseinavarietyofsettings.Forexample,inalmostanyproblemwherephysicalsystemmeasurementsarebeingoptimized. Consideracityplannerwantingtomaximizetracowonamajorthoroughfareby adjustingthetimingoftraclights.Foreachtimingpattern x ,thetracow f x isphysicallymeasuredtoprovideinformationabouttheexpectedtracow f x . Stochasticapproximationalgorithms,builttosolve min f x = E f x haveexistedintheliteraturesinceRobbins-Monro'salgorithmforndingrootsofan expectedvaluefunction[53].TheKiefer-WolfowitzKWalgorithm[35]generalized thisalgorithmtominimizetheexpectedvalueofafunction.Theiriterateshavethe form x k +1 = x k + a k G x k where G isanitedierenceestimateforthegradientof f .The i thcomponentof G isfoundby G i = f x k + c k e i )]TJ/F15 11.9552 Tf 14.502 3.155 Td [( f x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(c k e i 2 c k : 59

PAGE 71

where e i isthe i thunitvector.WhileKWspawnedmanygeneralizations,most formsrequireapredetermineddecayingsequenceforboththestepssizeparameter a k andnitedierenceparameter c k .Asopposedtothe2 n evaluationsof f required ateachiterateofKW,Spall'ssimultaneousperturbationstochasticapproximation SPSA[56]requiresonly2functionevaluationsperiterate,independentof n .SPSA estimates G i by G i = f x k + c k k )]TJ/F15 11.9552 Tf 14.503 3.155 Td [( f x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(c k k 2 c k k i where k 2 R n isarandomperturbationvectorwithentries k i whichareindependent andidenticallydistributedi.i.d.fromadistributionwithboundedinversemoments, symmetricallydistributedaroundzero,anduniformlyboundedinmagnitudeforall k and i .ThoughSPSAgreatlyreducesthenumberofevaluationsof f ,thechoice ofsequences a k and c k arecriticaltoalgorithmicperformance.Nevertheless,if f has auniqueminimum x ,bothKWandSPSAhavealmostsureconvergencewhich impliesconvergenceinprobabilityandconvergenceindistributionof x k ! x as k !1 .ThereisalsoaversionofSPSAwhichusesfourfunctionevaluationsto estimatethevalue,gradient,andHessianof f [57]. ThealgorithmwhichfollowsdiersfromtheworkinChapter3inafewways. First,theanalysispresentedinChapter3givesveryconservativeerrorbounds;to tightenthesebounds,wemustconsiderspecicprobabilitydistributionsfortheerror. Second,inChapter3, k wasdenedusing f x k asanestimatefor f x k .Thework thatfollowsevaluatesmodelfunctions m k x k and^ m k x k toprovidebetterestimates ofthetruefunctionvalue.Itispossibletoestimate f x k byrepeatedevaluationof f x k ,butwedesireanalgorithmwhichavoidsrepeatedlysamplingpointstoreduce thevarianceatapoint x .Suchatechniqueonlygainsinformationaboutnoise inthe stochasticcase,andnoinformationabout if f isdeterministic.Butifmanypoints sucientlycloseto x aresampled,informationabout f and canbegleaned.Asis oftenthecase,thepoint x isthelikelynextiterate,andtheinformationgathered 60

PAGE 72

about f near x willbeusedimmediately.Also,ifthenoisein f isdeterministicbut theoptimizerhasimperfectcontrolof x ,itmaybepossibletoconsidertheproblem astochasticoptimizationproblem. Theanalysisofthealgorithmiscomplicatedbythepresenceofnoise.Sincethere isnoiseineachfunctionevaluation,itisimpossibletobecertainthemodelmatches thefunction.Forexample,if f x = x 2 ,thereisanonzerobuttinyprobability that f x < forany > 0ateverypointevaluated.Therefore,wecanonlyhave condencewhichwedenote1 )]TJ/F19 11.9552 Tf 11.61 0 Td [( k for k smallthatthemodelandfunctionagree. Thequantity k canbeadjustedasthealgorithmstagnatestoensureincreasingly accuratemodelsattheexpenseofmorefunctionevaluations.Akeyrequirementof theconvergenceanalysisisthatas k ! 0, k doesaswell.Forexample,wecan chooseasimplerulesuchas k =min f k ; 0 : 05 g toproveresultsaboutouralgorithm. Therearemanyotherequallyvalidrulesforhandling k toensureincreasingaccurate modelsas k ! 0. Ourultimategoalistoprovethatthealgorithmconvergestoastationarypoint of f almostsurelywithprobability1,butthisisadauntingtask.Thisistobe expectedconsideringthefollowingtwoquotesbothfrom[58] Thereisafundamentaltrade-obetweenalgorithmeciencyandalgorithmrobustnessreliabilityandstabilityinabroadrangeofproblems. Inessence,algorithmsthataredesignedtobeveryecientononetypeof problemtendtobebrittle"inthesensethattheydonotreliablytransfer toproblemsofadierenttype. and Unfortunately,forgeneralnonlinearproblems,thereisnoknownnitesample k< 1 distributionfortheSA[stochasticapproximation]iterate. Further,thetheorygoverningtheasymptotic k !1 distributionis ratherdicult. 61

PAGE 73

Despitethesepessimisticviews,weareabletomakeprogressintheworkthatfollows. Sinceweareattemptingtoconstructarobustalgorithm,withameasureofcondence inoursolutionafteranitenumberofiterations,ourtheoreticalrequirementsmaynot beimplementedinthealgorithm.Relaxingrequirementsmayyieldamoresuitable algorithmforaspecicprobleminstance. Ouralgorithmisaderivative-freetrustregionmethodusingregressionquadratic modelsfortheirperceivedabilitytohandlenoisyfunctionevaluations.Weoutlinethe modicationsrequiredforconvergencewhenminimizingafunctionwithstochastic noise.Forexample,whenthereisnonoiseinfunctionbeingoptimized,wecan measuretheaccuracyofthe k themodel m k withtheratio k = f x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(f x k + s k m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k : Thismeasurestheactualdecreaseobservedin f versusthedecreasepredictedbythe model m k .Since f cannotbeevaluateddirectly,weproposeamodiedratio^ k in Section4.2whichwebelieveismoreappropriatefornoisyfunctions.Wealsopropose amodiedformof -fullyquadraticforstochasticallynoisyfunctions. Anoverviewofthechapterfollows:inSection4.1wedene -fullyquadratic andlinearmodelswithcondence1 )]TJ/F19 11.9552 Tf 11.863 0 Td [( k on B x ;andshowthatquadraticand linearregressionmodelssatisfythesenewdenitions,providedthereareasucient numberofpoisedpointsin B x ;.WeoutlinethealgorithminSection4.2and showthatitconvergestoarst-orderstationarypointinSection4.3.Weprovide suggestionsforimplementingouralgorithmandcompareoneimplementationagainst otheralgorithmsforminimizing.1inSection4.4.Lastly,wediscusstheresultsin Section4.5andoutlinesomeofthefutureavenuesforresearch. 4.1PreliminaryResultsandDenitions Wemakethefollowingassumptions: Assumption4.1 Thenoise N ; 2 . 62

PAGE 74

Assumption4.2 Thefunction f 2 LC 2 withLipschitzconstant L on = [ k B x k ; max R n : Assumption4.3 Thefunction f isboundedon L f x 0 where L = f x j f x g . Insolvingthetrustregionsubproblem,wedonotrequireanexactsolution-instead itissucienttondanapproximatesolution,butitmustsatisfythefollowing assumption. Assumption4.4 If m k and k arethemodelandtrustregionradiusatiterate k ,and x k + s k ischosenbythetrustregionsolvertosolve min x 2 B x k ; k m k x ,and s k C = )]TJ/F17 7.9701 Tf 13.215 5.112 Td [( k k g k k g k istheCauchystep,thenforall k m k x k )]TJ/F19 11.9552 Tf 11.956 0 Td [(m k x k + s k fcd h m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k C i forsomeconstant fcd 2 ; 1] ThisassumptionmerelystatesthateverytrustregionsubproblemsolutionisafractionofthedecreasepossiblefromtakingtheCauchystep,andthisfractionisbounded positivelyawayfromzero.Also,theassumptionallowsustonotsolvethetrustregion subproblemexactly. Assumption4.5 Thereexistsaconstant bhf > 0 suchthat,forall x k generatedin thealgorithm r 2 f x k bhf : Weprovethefollowingthreeclaimsusedinthischapter. Lemma4.6 If X 1 )]TJ/F20 7.9701 Tf 6.586 0 Td [( Y and Y 1 )]TJ/F20 7.9701 Tf 6.586 0 Td [( Z ,then X 1 )]TJ/F17 7.9701 Tf 6.587 0 Td [(2 Z . 63

PAGE 75

Proof: P X Z P X Y ^ Y Z =1 )]TJ/F19 11.9552 Tf 11.955 0 Td [(P X>Y _ Y>Z =1 )]TJ/F19 11.9552 Tf 11.955 0 Td [(P X>Y )]TJ/F19 11.9552 Tf 11.955 0 Td [(P Y>Z + P X>Y ^ Y>Z 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [(P X>Y )]TJ/F19 11.9552 Tf 11.955 0 Td [(P Y>Z =1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( )]TJ/F19 11.9552 Tf 11.955 0 Td [( =1 )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 : So X 1 )]TJ/F17 7.9701 Tf 6.587 0 Td [(2 Z . Lemma4.7 1 )]TJ/F20 7.9701 Tf 18.02 14.944 Td [(n X i =1 P a i n P n X i =1 a i < ! Proof: P n X i =1 a i < ! P a 1 < n ^ a 2 < n ^^ a n < n =1 )]TJ/F19 11.9552 Tf 11.955 0 Td [(P a 1 n __ a n n Lemma4.8 Let Y B ;1 beastrongly -poisedDenition2.14samplesetwith p 1 points.Let X bethequadraticdesignmatrixdenedby .2 ,then [ X T X )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 ] i;i q 1 p 1 2 where [ A ] i;i isthe i thdiagonalentryof A . Proof: Since X T X )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 issymmetricandpositive-denite,the i theigenvalue i equalsthe i thsingularvalue i .By[12,Theorem4.11],theinverseofthe smallestsingularvalueof X isboundedby q q 1 p 1 .Thatis, r q 1 p 1 1 min X 64

PAGE 76

or q 1 p 1 2 1 min X 2 = 1 min X T X = max )]TJ/F15 11.9552 Tf 5.479 -9.683 Td [( X T X )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 = max )]TJ/F15 11.9552 Tf 5.48 -9.683 Td [( X T X )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 = X T X )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 =max k v k =1 X T X )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 v X T X )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 e i [ X T X )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 ] i;i where e i isthe i thunitvector. 4.1.1Modelswhichare -fullyQuadraticwithCondence 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k ToproveconvergenceofthealgorithmpresentedinSection4.2,werstpropose amodiedversionof -fullyquadraticmodels. Denition4.9 Let f satisfyAssumption4.2.Let = ef ; eg ; eh ; m 2 beagiven vectorofconstants,andlet > 0 .Amodelfunction m 2 C 2 is -fullyquadratic withcondence1 )]TJ/F19 11.9552 Tf 12.52 0 Td [( k on B x ; for k 2 ; 1 if m hasaLipschitzcontinuous HessianwithcorrespondingLipschitzconstantboundedby m 2 and theerrorbetweentheHessianofthemodelandtheHessianofthefunction satises P )]TJ 5.479 0.478 Td [( r 2 f y )-222(r 2 m y eh 8 y 2 B x ; 1 )]TJ/F19 11.9552 Tf 11.956 0 Td [( k ; theerrorbetweenthegradientofthemodelandthegradientofthefunction satises P )]TJ/F22 11.9552 Tf 5.48 -9.684 Td [(kr f y )-222(r m y k eg 2 8 y 2 B x ; 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k ; theerrorbetweenthemodelandthefunctionsatises P )]TJ/F22 11.9552 Tf 5.48 -9.684 Td [(j f y )]TJ/F19 11.9552 Tf 11.955 0 Td [(m y j ef 3 8 y 2 B x ; 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k : Thisisoccasionallyabbreviated -f.q.w.c. 1 )]TJ/F19 11.9552 Tf 11.956 0 Td [( k . 65

PAGE 77

Thesedenitionsareonlyusefulifmodelfunctionscanbeeasilyconstructed whichsatisfythem;themodelsmustalsobeeasytominimizeoveratrustregion. Inthefollowingtheorem,weshowthatquadraticregressionmodelssatisfytherequirementsofDenition4.9,providedthereareenoughpoisedpointswithinthetrust region. Theorem4.10 Ifthefunction f satisesAssumption4.2andthenoise satises Assumption4.1,thenforagiven k 2 ; 1 ,thereexistsa = ef ; eg ; eh ; m 2 suchthatforany x 0 2 R n , > 0 ,if Y B x 0 ; isstrongly -poisedand j Y j z 1 )]TJ/F21 5.9776 Tf 8.699 4.623 Td [( k 2 q 1 2 2 q 3 1 2 6 ; thenthequadraticregressionmodelis -fullyquadraticwithcondence 1 )]TJ/F19 11.9552 Tf 10.684 0 Td [( k ,where z = 2 isthenumberofstandarddeviationsawayfromzeroonastandardnormaldistribution,suchthattheareatotheleftof z = 2 is = 2 [46]. Proof: ByTaylor'stheorem,foranypoint x 2 B x 0 ;thereexistsapoint x onthelinesegmentconnecting x to x 0 suchthat f x = f x 0 + r f x 0 T x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 + 1 2 x )]TJ/F19 11.9552 Tf 11.956 0 Td [(x 0 T r 2 f x x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 = f x 0 + r f x 0 T x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 + 1 2 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 T r 2 f x 0 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 + 1 2 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 T H x x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 ; .2 where H x = r 2 f x )-222(r 2 f x 0 . Let m x bethequadraticleastsquaresmodelregressing Y .Since m isquadratic, Taylor'stheoremsaysforany x , m x = m x 0 + r m x 0 T x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 + 1 2 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 T r 2 m x 0 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 : Let bethetrueparametersofthequadraticpartof f denedbytherst threetermsof.2andlet ^ betheleast-squaresestimatefor .i.e.if X is thedesignmatrixfortheset Y and f isthevectorwith i thentry f y i ,then ^ = 66

PAGE 78

)]TJ/F19 11.9552 Tf 5.479 -9.684 Td [(X T X X T f .Denethemapping V x : R n ! R q 1 where V x = V [ x 1 ; x n ] T = 1 ;x 1 ; ;x n ; 1 2 x 2 1 ;x 1 x 2 ; 1 2 x 2 n T .Thenthe i throwof X is V y i T .Theparameters ^ denethemodel m .Thatis, m x = ^ T V x . Withoutlossofgenerality,assume 1.Thenforany x 2 B x 0 ;, j f x )]TJ/F19 11.9552 Tf 11.955 0 Td [(m x j = f x 0 + r f x 0 T x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 + 1 2 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 T r 2 f x 0 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 + 1 2 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 T H x x )]TJ/F19 11.9552 Tf 11.956 0 Td [(x 0 )]TJ/F19 11.9552 Tf 11.955 0 Td [(m x 0 )-222(r m x 0 T x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 )]TJ/F15 11.9552 Tf 10.494 8.088 Td [(1 2 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 T r 2 m x 0 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 T V x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 )]TJ/F15 11.9552 Tf 13.64 3.154 Td [(^ T V x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 + 1 2 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 T H x x )]TJ/F19 11.9552 Tf 11.956 0 Td [(x 0 q X i =0 i )]TJ/F15 11.9552 Tf 13.64 3.155 Td [(^ i j V x )]TJ/F19 11.9552 Tf 11.956 0 Td [(x 0 i j + L 2 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 3 q X i =0 i )]TJ/F15 11.9552 Tf 13.64 3.155 Td [(^ i j V x )]TJ/F19 11.9552 Tf 11.956 0 Td [(x 0 i j + L 2 3 0 )]TJ/F15 11.9552 Tf 13.64 3.154 Td [(^ 0 + n X i =1 i )]TJ/F15 11.9552 Tf 13.64 3.154 Td [(^ i + q X i = n +1 i )]TJ/F15 11.9552 Tf 13.64 3.154 Td [(^ i 2 + L 2 3 q X i =0 i )]TJ/F15 11.9552 Tf 13.64 3.155 Td [(^ i + L 2 3 : .3 Sincethenoiseisuncorrelatedwithmeanzero,constantvariance,andisnormally distributed,itisknownthat ^ N ; 2 X T X )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 [46].If[ A ] i;i denotesthe i th diagonalentryofamatrix A ,thenthe1 )]TJ/F20 7.9701 Tf 13.454 5.112 Td [( k q 1 condenceintervalforeachofthe i hastheform[46]: 67

PAGE 79

1 )]TJ/F19 11.9552 Tf 13.15 8.087 Td [( k q 1 = P ^ i )]TJ/F19 11.9552 Tf 11.955 0 Td [(z 1 )]TJ/F21 5.9776 Tf 8.699 4.623 Td [( k 2 q 1 q 2 [ X T X )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 ] i;i < i < ^ i + z 1 )]TJ/F21 5.9776 Tf 8.699 4.623 Td [( k 2 q 1 q 2 [ X T X )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 ] i;i = P i )]TJ/F15 11.9552 Tf 13.64 3.155 Td [(^ i
PAGE 80

kr f x )-222(r m x k = r f x 0 + r 2 f x 0 T x )]TJ/F19 11.9552 Tf 11.956 0 Td [(x 0 + H x x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 )]TJ/F25 11.9552 Tf 11.291 9.683 Td [()]TJ/F22 11.9552 Tf 5.479 -9.683 Td [(r m x 0 + r 2 m x 0 x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 r f x 0 )-222(r m x 0 + r 2 f x 0 T x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 )-222(r 2 m x 0 x )]TJ/F19 11.9552 Tf 11.956 0 Td [(x 0 + H x x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 n X i =1 i )]TJ/F15 11.9552 Tf 13.64 3.155 Td [(^ i + q X i = n +1 i )]TJ/F15 11.9552 Tf 13.639 3.155 Td [(^ i + L x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 0 2 n X i =1 i )]TJ/F15 11.9552 Tf 13.64 3.154 Td [(^ i + q X i = n +1 i )]TJ/F15 11.9552 Tf 13.639 3.154 Td [(^ i + L 2 q X i =1 i )]TJ/F15 11.9552 Tf 13.64 3.155 Td [(^ i + L 2 1 )]TJ/F20 7.9701 Tf 6.587 0 Td [( 3 + L 2 2 + L 2 = eg 2 : where eg =1+ L .Asimilarargumentfor kr 2 f x )-222(r 2 m x k with eh =1+ L provesthetheorem. TocertifyamodelsatisesDenition4.9ortoimprovealeastsquaresregression modelintoonethatis -fullyquadraticwithcondence1 )]TJ/F19 11.9552 Tf 12.593 0 Td [( k isstraightforward: wemustensurethereareenoughpoisedpointswithin B x ;tosatisfythebound giveninTheorem4.10.Otherwise,addenoughstrongly-poisedpointsto Y .Fora techniquetogeneratestrongly-poisedsets,seeChapter3ofthisthesisor[12]. 4.1.2Modelswhichare ^ -fullyLinearwithCondence 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k Whilethemodels m k usedinthemainalgorithmarequadratic,linearmodels ^ m x canapproximate f near B x k + s k ; ^ k tosucientaccuracy.Therefore,ifwe haveenoughpointswithin B x k + s k ; ^ ,wecanboundtheerrorbetween f x k + s k and^ m k x k + s k .Wequantifythataccuracyinthefollowingdenition. 69

PAGE 81

Denition4.11 Let f satisfyAssumption4.2.Let ^ =^ ef ; ^ eg ; m 1 beagiven vectorofconstants,andlet > 0 .Amodelfunction m 2 C is ^ -fullylinearwith condence1 )]TJ/F19 11.9552 Tf 11.232 0 Td [( k on B x ; for k 2 ; 1 if m hasaLipschitzcontinuousgradient withcorrespondingLipschitzconstantboundedby m 1 and theerrorbetweenthegradientofthemodelandthegradientofthefunction satises P kr f y )-222(r m y k ^ eg 8 y 2 B x ; 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k ; theerrorbetweenthemodelandthefunctionsatises P )]TJ/F22 11.9552 Tf 5.479 -9.684 Td [(j f y )]TJ/F19 11.9552 Tf 11.956 0 Td [(m y j ^ ef 2 8 y 2 B x ; 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k : Thisisoccasionallyabbreviated -f.l.w.c. 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k . Theorem4.12 Ifthefunction f satisesAssumption4.2andthenoise satises Assumption4.1,thenforagiven k 2 ; 1 ,thereexistsa = ef ; eg ; m 1 such thatforany x 0 2 R n , > 0 ,if Y B x 0 ; isstrongly -poisedand j Y j z 1 )]TJ/F21 5.9776 Tf 11.839 4.623 Td [( k 2 n +2 2 2 n +1 3 2 4 : thenthelinearregressionmodelis fullylinearwithcondence 1 )]TJ/F19 11.9552 Tf 11.956 0 Td [( k . Proof: TheproofisnearlyidenticalthatofTheorem4.10. 4.2StochasticOptimizationAlgorithm Belowisanoutlineofourproposedstochasticalgorithm.For x k + s k ,the solutiontothetrustregionsubproblem,andaradius k > ^ k > 0,dene ^ Y k = n y 2 Y tot j x k + s k )]TJ/F19 11.9552 Tf 11.955 0 Td [(y ^ k o . Let Y tot = f y 1 ; ;y m g bethesetofpointswhere f hasbeenevaluated. f i := f y i .Deneanullmodel m 0 ,initialtrustregionradius 0 ,andaninitialTRcenter 70

PAGE 82

x 0 .Chooseconstantssatisfying0 << 1 < inc , c > 0,0 0 1 < 1 1 6 =0, where 0 0 < and ! 2 ; 1.Choose r 2 ; 1anddene ^ k = r k . Algorithm1: Atrust-regionalgorithmtominimizeastochasticfunction. Let k =0; Start ; Set k =min f k ; 0 : 05 g ; if & k max fk g k k ; )]TJ/F19 11.9552 Tf 9.298 0 Td [( min H k g < c andeitheri m k isnotcertiably -f.q.w.c. 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k on B x k ; k orii k >& k then ApplyAlgorithm2toupdate Y k , k ,and m k ; Set k =min f k ; 0 : 05 g ; else Selectorgenerateastrongly-poisedsetofpoints Y k B x k ; k from Y tot suchthat Y k hasenoughpointstoensure m k is -f.q.w.c.1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k . Buildaregressionquadraticmodel m k x through Y k .Solve s k argmin s : k s k < k m k x k + s .Builda -f.l.w.c.1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k model^ m k on ^ Y k possiblyaddingpointsto Y tot andcompute ^ k = m k x k )]TJ/F15 11.9552 Tf 14.148 0 Td [(^ m k x k + s k m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k if ^ k 1 or ^ k 0 ^ m k is -f.q.w.c. 1 )]TJ/F19 11.9552 Tf 11.956 0 Td [( k on B x k ; k then x k +1 = x k + s k ; else x k +1 = x k ; if ^ k 1 then k +1 =min f inc k ; max g ; else k +1 = k ; Let m k +1 bethepossiblyimprovedmodel; Set k = k +1andgoto Start ; 71

PAGE 83

Notethatweareapproximating f x k + s k usingasecondmodel^ m k inadierent trustregion ^ k around x k + s k .Formalconvergenceofthealgorithm,specically Lemma4.15,requirestheabilitytoapproximate f x k + s k withincreasingaccuracy asthealgorithmprogresses.Suchaccuracyisnotavailablefromarealizationofthe noisefunctionvalue,namely f x k + s k .Whileitispossibletoobtainincreasingly accurateapproximationsof f x k + s k byrepeatedsampling,wearehopingthetheory generatedinthischaptercanbeeasilytransferedtothecasewheredeterministic noiseispresentin f .Withdeterministicnoise,Var f =0,andthereforerepeated samplingwillprovidenofurtherinformation. Also,ifweeventuallyshrinkourtrustregionaroundagivenpoint,pointsgeneratedin B x k + s k ; ^ k tomakeanaccuratemodel^ m k x k + s k canbeusedinthe constructionofanaccuratemodel m j x atsomelateriterate j . Algorithm2: CriticalityStep Initialization Set i =0.Set m k = m k . Repeat Increment i byone.Improvethepreviousmodelbyaddingpointsto Y tot untilitis -fullyquadraticwithcondence1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k on B x k ; ! i )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 k .ThiscanbedonebyTheorem4.10andthemodel improvementalgorithmfrom[3]whichbuildsastrongly-poisedset Y in O )]TJ/F17 7.9701 Tf 10.162 -4.977 Td [(1 6 stepsifthemodelssatisfyDenition4.9.Denotethe newmodel m i k .Set ~ k = ! i )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 k and~ m i = m i k .; Until ~ k & i k x k . Return m k =~ m k , k =min n max n ~ k ;& i k x k o ; k o ,and Y tot . WeadoptthenamingofiteratesfromConn,Scheinberg,Vicente: 1. k 1 ; x k + s k isacceptedandthetrustregionisincreased.Wecallthese iterations successful . 2. 1 > k 0 and m k is -fullyquadraticwithcondence1 )]TJ/F19 11.9552 Tf 12.463 0 Td [( k ; x k + s k is acceptedbut k isdecreased.Wecalltheseiterations acceptable . 72

PAGE 84

3. 1 > k and m k isnot -fullyquadraticwithcondence1 )]TJ/F19 11.9552 Tf 12.751 0 Td [( k ; x k + s k is notacceptedandthemodelisimproved.Wecalltheseiterations model improving . 4. 0 > k and m k is -fullyquadraticwithcondence1 )]TJ/F19 11.9552 Tf 12.75 0 Td [( k ; x k + s k isnot acceptedand k isreduced.Wecalltheseiterations unsuccessful . 4.3Convergence 4.3.1ConvergencetoaFirst-orderStationaryPoint Theuseofquadratic m k mightsuggestconvergencetoasecond-orderstationary point.Suchaproofwouldrequireaquadratic^ m k aswell,andsince ^ ,this wouldrequiremorepointsin B x k + s k ; ^ k thanin B x k ; k .Sinceitisfrequently thecasethat f x k + s k >f x k evenwhen m k is -f.q.w.c.1 )]TJ/F19 11.9552 Tf 9.871 0 Td [( k ,wenditwasteful tobuildaquadratic^ m k around x k + s k .Thisisoneofthemotivationsfor -fully linearmodelsfor^ m k ;withthis,wecanproveconvergencetoarst-orderstationary point. Werstshowthatif x k isnotastationarypointfor f ,thenAlgorithm2willexit withprobability1. Theorem4.13 Given k 2 ; 1 ,if f satisesAssumption4.2and r f x k > 2 ! j )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 k ,thereisprobabilityatleast 1 )]TJ/F19 11.9552 Tf 12.19 0 Td [( k thatAlgorithm2willcorrectlyexitoneachiterate i after j suchthat ! i )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 < q ! j )]TJ/F18 5.9776 Tf 5.757 0 Td [(1 eg ,where and ! are declaredinAlgorithm1. Proof: Assume ! i )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 < q ! j )]TJ/F18 5.9776 Tf 5.756 0 Td [(1 eg , k 1,andAlgorithm2cyclesinnitely.After sucientlymanyiterationsofthecriticalitystep, m i k willbe -fullyquadraticwith 73

PAGE 85

condence1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k on B x k ; q ! j )]TJ/F18 5.9776 Tf 5.756 0 Td [(1 eg k .Therefore & i k g i k r f x k )]TJ/F25 11.9552 Tf 11.955 13.748 Td [( r f x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(g i k > 2 ! j )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 k )]TJ/F25 11.9552 Tf 11.955 13.748 Td [( r f x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(g i k byassumption 1 )]TJ/F20 7.9701 Tf 6.587 0 Td [( 2 ! j )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 k )]TJ/F19 11.9552 Tf 11.955 0 Td [( eg s ! j )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 eg ! 2 2 k byDenition4.9 = 2 ! j )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 k )]TJ/F15 11.9552 Tf 13.746 8.088 Td [(1 ! j )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 2 k 1 ! i )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 k : since k 1and i j Soforeach i after j suchthat ! i )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 < q ! j )]TJ/F18 5.9776 Tf 5.757 0 Td [(1 eg ,wehave1 )]TJ/F19 11.9552 Tf 12.687 0 Td [( k condencethat Algorithm2willexit. Sincewerequire k ! 0as k ! 0,thenforany k > 0,eventually k willbe smallenoughsothatAlgorithm2withprobabilityatleast1 )]TJ/F19 11.9552 Tf 12.269 0 Td [( k .Inotherwords, thistheoremensuresthatthealgorithmexitswithprobability1. Lemma4.14 If f satisesAssumption4.5and m k is -fullyquadraticwithcondence 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( ,thereexistsaconstant bhm > 0 suchthat k H k k 1 )]TJ/F20 7.9701 Tf 6.586 0 Td [( bhm ; forall k ,where H k istheHessianof m k . Proof: k H k k H k )-222(r 2 f x k + r 2 f x k 1 )]TJ/F20 7.9701 Tf 6.587 0 Td [( eh k + r 2 f x k byDenition4.9 eh k + bhf byAssumption4.5 eh max + bhf =: bhm 74

PAGE 86

Thefollowinglemmashowsthat,if x k isnotastationarypointof f ,thenif k issmallenough,thereisahighprobabilitythatasuccessfulstepwillbetaken. Lemma4.15 Let f satisfyAssumption4.5andletthetrustregionsubproblemsolutionsatisfyAssumption4.4.Let = ef ; eg ; eh ; m 2 and ^ =^ ef ; ^ eg ; ^ eh ; m 1 . Lettheconstants fcd , bhm , ef , ^ ef , 1 beasspeciedinAssumption4.4, Lemma4.14,Denition4.9,Denition4.11,anddeclaredatthebeginningofAlgorithm1respectively.If m k is -f.q.w.c. 1 )]TJ/F19 11.9552 Tf 10.896 0 Td [( k on B x k ; k , ^ m k is ^ -f.l.w.c. 1 )]TJ/F19 11.9552 Tf 10.897 0 Td [( k on B x k + s k ; ~ k ,and k min k g k k bhm ; fcd k g k k )]TJ/F19 11.9552 Tf 11.955 0 Td [( 1 2 ef max +2^ ef ; .4 thenwehavecondence 1 )]TJ/F15 11.9552 Tf 11.955 0 Td [(3 k that k 1 onthe k thiteration. Proof: UsingLemma4.14,thefactthat x k + s k isnoworsethantheCauchy stepAssumption4.4,and k k g k k bhm .4yields m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k 1 )]TJ/F20 7.9701 Tf 6.586 0 Td [( fcd 2 k g k k min k g k k bhm ; k = fcd 2 k g k k k : .5 75

PAGE 87

Usingthis, j k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 j = m k x k )]TJ/F15 11.9552 Tf 14.148 0 Td [(^ m k x k + s k m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k )]TJ/F19 11.9552 Tf 13.15 8.087 Td [(m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k = m k x k + s k )]TJ/F15 11.9552 Tf 14.148 0 Td [(^ m k x k + s k m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k m k x k + s k )]TJ/F19 11.9552 Tf 11.955 0 Td [(f x k + s k m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k + f x k + s k )]TJ/F15 11.9552 Tf 14.148 0 Td [(^ m k x k + s k m k x k )]TJ/F19 11.9552 Tf 11.956 0 Td [(m k x k + s k 1 )]TJ/F20 7.9701 Tf 6.587 0 Td [( ef 3 k j m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k j + f x k + s k )]TJ/F15 11.9552 Tf 14.148 0 Td [(^ m k x k + s k m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k byDenition4.9 1 )]TJ/F20 7.9701 Tf 6.587 0 Td [( ef 3 k j m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k j + ^ ef ^ 2 k j m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k j byDenition4.11 1 )]TJ/F20 7.9701 Tf 6.587 0 Td [( 2 ef 3 k +2^ ef ^ 2 k fcd k g k k k by.5 2 ef max +2^ ef fcd k g k k k since k ^ k 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( 1 : by.4 Sincewehavecondence1 )]TJ/F19 11.9552 Tf 12.095 0 Td [( k thatthesecond,thirdandfourthinequalitieshold, wehavecondence1 )]TJ/F15 11.9552 Tf 11.955 0 Td [(3 k thatallthreeholdsimultaneously. Lemma4.16 Forall k ,assumethetrustregionsubproblemsolutionsatisesAssumption4.4.Let f satisfyAssumption4.5.Ifthereexistsaconstant 1 suchthat k g k k 1 forall k ,thenthereexistsanotherconstant 2 suchthat,foreveryiteration k where k 2 wehavecondence 1 )]TJ/F15 11.9552 Tf 12.024 0 Td [(3 k thatiteration k willbesuccessfuland k willincreaseif m k is -f.q.w.c. 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k . Proof: Thisproofissimilarto[12,Lemma10.7].WhetherAlgorithm2has beencalledornot, k min & k x k ; k )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 min fk g k k ; k )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 g min f 1 ; k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 g 76

PAGE 88

Since k g k k 1 forall k ,Lemma4.15impliesthatwhenever k islessthan 3 =min 1 bhm ; fcd 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( 1 2 ef max +2^ ef ; wehavecondence1 )]TJ/F15 11.9552 Tf 12.644 0 Td [(3 k thatiteration k willbesuccessful k +1 = inc k or modelimproving k +1 = k .Ineithercase k +1 k sowehavecondence 1 )]TJ/F15 11.9552 Tf 11.955 0 Td [(3 k that k +1 k willholdwhenever k min f 0 ; 1 ; 3 g = 2 . Theorem4.17 LetAssumptions4.1{4.5besatised.Ifthenumberofsuccessful iterationsisnite,then liminf k !1 r f x k =0 withprobability1. Proof: Consideriterationsafterthelastsuccessfuliteration,denoted k last .For every k>k last ,theiterationisunsuccessful k < 1 andthemodelimprovement algorithmiscalled.Ittakesanitenumber O 1 6 k ofmodelimprovementstepsfor themodeltobecome -fullyquadraticwithcondence1 )]TJ/F19 11.9552 Tf 12.249 0 Td [( k onagiven B x k ;; thereareaninnitenumberofiterationsthatareeitheracceptableorunsuccessful. Given k ,wecanguaranteethatthetrustregionradiusmustdecreasebyatleast onemultipleof 2 ; 1after C 6 k iterationsforaxedconstant C .Sinceforany > 0,thereexistsaninteger N suchthat N k last < .After C k last 6 + + C N )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 k last 6 = N X i =1 C i )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 k last 6 = C 6 )]TJ/F17 7.9701 Tf 6.587 0 Td [(6 N )]TJ/F19 11.9552 Tf 5.479 -9.683 Td [( 6 N )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 6 k last 6 )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 iterations,thetrustregionradiuswillbelessthan .Therefore,lim k !1 k =0, whichimplies k ! 0.Therefore,thereexistsaninnitesequenceofiterates f k i g where m k i is -f.q.w.c.1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k i and r f x k i r f x k i )]TJ/F19 11.9552 Tf 11.955 0 Td [(g k i + k g k i k 1 )]TJ/F20 7.9701 Tf 6.586 0 Td [( k i eg 2 k i + k g k i k : Thesecondtermconvergestozerowithprobability1.Toseethis,assume k g k i k is boundedawayfromzeroandwecanderiveacontradictionusingLemma4.15and 77

PAGE 89

thefactthatlim k !1 k =0.Since k i ! 0and k i ! 0,thenfor k i suciently large, k i 2 ,sothereisprobability1 )]TJ/F15 11.9552 Tf 12.323 0 Td [(3 k i thatiteration k i willbesuccessful. Thus,forany k > 0and K> 0,thereexists k i >K suchthattheprobabilityof step k i beingsuccessfulisgreaterthan1 )]TJ/F19 11.9552 Tf 12.26 0 Td [( k .Therefore,withprobability1,there areinnitelymanysuccessfuliterates,contradictingthedenitionof k last . 4.3.2InnitelyManySuccessfulSteps TheresultsthatfollowoutlinepartsofaproofforthecasewhenAlgorithm1 generatesinnitelymanysuccessfuliterates.Whiletheprevioustheoremproves k ! 0,theproofisnotvalidwhenthereareinnitelymanysuccessfuliterates. Wehavemadeconsiderableeorttoprove k ! 0,buthavebeenunabletodoso. Toprogress,weassumeitforthetimebeing. Assumption4.18 lim k !1 k =0 ItshouldbenotedthatitmaybepossibletoensureAlgorithm1satisesthisassumption,perhapsbyslowlydecreasing max .Thedetailswouldneedtobeworked out,butthisassumptionisnotasstrongasitmightappear. Conjecture4.19 IfAssumption4.18andAssumption4.5holdandthetrustregion subproblemsolutionsatisesAssumption4.4forall k ,then liminf k !1 k g k k =0 withprobability1. Discussion: If k g k k 1 forsome 1 0,byLemma4.16,thereexistsa 2 suchthat whenever k < 2 ,wewillhavea1 )]TJ/F15 11.9552 Tf 12.139 0 Td [(3 k condenceofincreasingthetrustregion. UsingAssumption4.18andthefactthat k =min f k ; 0 : 05 g ,wewillincreasethe trustregionwithprobabilityapproaching1as k getslarge.Thiswouldappearto 78

PAGE 90

contradict k ! 0,buttoprovealmostsureconvergenceassumingeachiterationis independentwemustshowtheproductofthe1 )]TJ/F19 11.9552 Tf 12.298 0 Td [( k approaches1.Andeventhe assumptionthateachiterationisindependentisdicult,asmanyofthepointsused tobuild m k willbeusedtobuild m k +1 .Iftheeventsaredependent,thenwemust considerconditionalprobabilitiessuchastheprobabilityonestepbeingasuccess giventhelaststepwasnot. Conjecture4.20 IfAssumptions4.1{4.5andAssumption4.18hold,foranysubsequence f k i g suchthat lim i !1 k g k i k =0.6 then,withprobability1 lim i !1 r f x k i =0 : Discussion: By.6,for i sucientlylarge, k g k i k c .Thus,byTheorem4.13, Algorithm2ensuresthatthemodel m k i is -f.q.w.c.1 )]TJ/F19 11.9552 Tf 9.749 0 Td [( k ontheball B x k i ; k i with k i k g k i k forall i sucientlylargeprovided r f x k i 6 =0.ByDenition4.9 r f x k i )]TJ/F19 11.9552 Tf 11.955 0 Td [(g k i 1 )]TJ/F20 7.9701 Tf 6.587 0 Td [( eg k i eg k g k i k : Therefore, r f x k i r f x k i )]TJ/F19 11.9552 Tf 11.955 0 Td [(g k i + k g k i k 1 )]TJ/F20 7.9701 Tf 6.587 0 Td [( eg +1 k g k i k : Since k g k i k! 0withprobability1,sodoes r f x k i . Conjecture4.21 IfAssumptions4.1{4.5andAssumption4.18hold,then liminf k !1 r f x k =0 withprobability1. Discussion: ByConjecture4.19,weknowtheremustexistasequenceof f k i g suchthatlim i !1 k g k i k =0.ByConjecture4.20,thissamesequence f k i g has lim i !1 r f x k i =0.Thisprovestheresult. 79

PAGE 91

4.4ComputationalExample Inthissection,wehighlightsomeoftheadvantagesofusingAlgorithm1overa traditionaltrustregionmethodwhichassumesdeterministicfunctionevaluations. Whilebothalgorithmshavemuchincommon,theslightdierencesbecomesignicant inthepresenceofstochasticnoise.Forexample,thedeterministicalgorithmsare susceptibletonegativenoise,asweseeintheFigure4.1.Inthatgure,thesolidline isthetruefunction f whichwewanttominimize,andthedashedblacklinesshow the95%condenceintervalofthenoise.Theblacksquaresmarkthenoisyfunctions whichdeterminethequadratictrustregionmodel m k andthetrustregionradius k isrepresentedbythedashedlines.Thetrustregioncenter x k hasaredboxaround it. 4.4.1DeterministicTrustRegionMethod Figure4.1showsatraditionaltrustregionmethodaftermovingtoanewtrust regioncenterat x k =2 : 5.Eachimageshowstheprogressofthealgorithm,andwe describewhatoccurredinthepreviousiteratetoyieldthepresentsituation: Figure4.1,topleft Bychance,therealizationof f x k + s k wasmuchlessthan f atanypointnear x k + s k .Itisnowthenewtrustregioncenter. Figure4.1,topright Theminimumofthequadraticmodelwasnotacceptedsince, k = f x k )]TJ/F15 11.9552 Tf 14.502 3.154 Td [( f x k + s k m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k < 0 Thetrustregionradiushasalsobeenshrunksincethesamplesetisstrongly -poised. Figure4.1,bottomleft Apointoutsideofthetrustregionradiushasbeenremoved.Since k < 0,thetrustregionradiuswillshrinkagain. Figure4.1,bottomright Anotherpointoutsideofthetrustregionhasbeenremoved,andanewmodelhasbeenbuilt. 80

PAGE 92

Figure4.1: Severaliterationsofatraditionaltrustregionmethodassumingdeterministicfunctionevaluations.Thetrustregioncenterisnevermoved. Thedeterministicalgorithmwillacceptanewtrustregioncenterwhen k issucientlypositive,i.e.if f x k + s k isalsomuchlessthan f x k + s k .Ifthisdoes nothappen,thealgorithmwillnotndasuccessfulstepandthetrustregionradius willberepeatedlydecreased.Since f x k isneverre-evaluated,itislikelythatthe algorithmwillterminatewithoutevertakingafurtherstep. 4.4.2StochasticTrustRegionMethod Incontrast,byusing^ k introducedinSection4.2, ^ k = m k x k )]TJ/F15 11.9552 Tf 14.148 0 Td [(^ m k x k + s k m k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(m k x k + s k ; andincreasingthenumberofpointsinthetrustregionas k decreasesallowsthe algorithmtoproceedtomoveoofatrustregioncenterwithnegativenoise,seenin Figure4.2. 81

PAGE 93

Figure4.2: Severaliterationsofatraditionaltrustregionmethodassumingstochasticfunctionevaluations. Figure4.2,topleft Again,therealizationof f x k + s k wasmuchlessthan f at anypointnear x k + s k .Itisnowthenewtrustregioncenter. Figure4.2,topright Theminimumofthequadraticmodelwasnotacceptedsince ^ k < 0,butthetrustregionradiusisnotdecreased.Thoughthesampleset isstrongly-poised,therearenotenoughpointstoensurethemodelis f.q.w.c.1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( k . Figure4.2,bottomleft Morepointshavebeenaddedtothesamplesetandthe modelhasbeenreconstructed. Figure4.2,bottomright Theminimumofthequadraticmodelisacceptedsince ^ k > 1 ,eventhough f x k + s k > f x k . 82

PAGE 94

Byusingthemodelvalueat x k insteadof f x k inthecalculationof^ k allowsthe estimateof f x k toadjustwithoutwastefullyreevaluating f x k .Inthisfashion, Algorithm1canavoidstagnatingatpointswithnegativenoise. 4.5Conclusion Inthischapterwepresentedanalgorithmusingquadratictrustregionmodels m k tominimizeafunction f whichcannotbeevaluatedexactly.Eventhoughthe algorithmonlyhasaccesstonoisecorruptedfunctionevaluations f ,weprovedalmost sureconvergenceofasubsequenceofiteratestoarst-orderstationarypointof f whenthenumberofsuccessfulstepsisnite.Wealsohaveoutlinedaproofforthe casewhenthenumberofsuccessfulstepsisinnite.Theseresultswereaccomplished, notbyrepeatedlysampling f atpointsofinterest x k ,butratherbyconstructing models^ m k whichareincreasinglyaccurateapproximationsof f near x k .Sinceitis oftenthecasethat x k isthecandidateforthenewtrustregioncenter,thisinformation isimmediatelyusefulinconstructing m k +1 .Wethenhighlightedhowthisalgorithm remediesacommonproblemwithusingtraditionaltrustregionmethodsonfunctions withstochasticnoise. 83

PAGE 95

5.Non-intrusiveTerminationofNoisyOptimization 5.1IntroductionandMotivation Theoptimizationofreal-world,computationallyexpensivefunctionsinvariably leadstothedicultquestionofwhenanoptimizationprocedureshouldbeterminated.Algorithmdevelopersandthemathematicaloptimizationcommunityatlarge typicallyassumethattheoptimizationisterminatedwheneitherameasureofcriticalitygradientnorm,meshsize,etc.issatisedorauser'scomputationalbudget numberofevaluations,wallclocktime,etc.isexhausted. Foralargeclassofproblems,however,theusermaynothaveawell-dened computationalbudgetandinsteaddemandaterminationtest t solving min t Computationalexpense t s.t.Acceptableaccuracyofthesolution t ; .1 withthecriticalitymeasureofthesolveremployedtypicallychosenwiththeaccuracy constraintinmind.Examplesofsuchaccuracy-basedcriticalitytestsarediscussed indetailbyGill,Murray,andWright[19,Section8.2.3]. Themaindicultiesarisingfromthisapproacharearesultof.1possibly beingpoorlyformulated.Thecomputationalexpensecouldbeunboundedbecause anaprioriuser-denedaccuracyisunrealisticfortheproblem/solverpair.Furthermore,ausermayhavedicultytranslatingthecriticalitymeasuresprovidedbya solver,whicharegenerallybasedonassumptionsofsmoothnessandinnite-precision calculations,intopracticalmetricsonthesolutionaccuracy. InFigure5.1weillustratethechallengesinthisareawithanexamplefrom nuclearphysics,similartotheminimizationproblemsconsideredin[37].Eachofthe functionvaluesshownisobtainedfromrunningadeterministicsimulationforone minuteona640-corecluster.Stoppingtheoptimizationsoonerthan200function evaluationswouldnotonlyreturnasolutionfasterbutwouldalsofreetheclusterfor 84

PAGE 96

Figure5.1: Partofanoisytrajectoryoffunctionvaluesforanexpensivenuclear physicsproblem.Aftermoresignicantdecreasesintherst70evaluations,progress beginstostall. otherapplicationsand/orresultinasavingsinenergy,anincreasinglycrucialfactor inhigh-performancecomputing. IfweassumethattheoptimizationpartiallyshowninFigure5.1hasnotbeen terminatedbyasolver'scriticalitymeasuresorauser'scomputationalbudget,the questionisthenwhetherterminationshouldoccurforotherreasons.Forexample, ifonlytherstthreedigitsofthesimulationoutputwerecomputedstably,onemay wanttoterminatetheoptimizationsoonerthanifcomputationalnoisecorrupted onlytheeighthdigitoftheoutput.Alternatively,thebehaviorshowncouldmean thesolverinquestionhasstagnatedbecauseofnoise,errorsinthesimulation,a limitationofthesolver,etc.,andhenceexaminingthesolutionand/orrestartingthe optimizationcouldbeamoreeectiveuseoftheremainingcomputationalbudget. Wright[65]referstothisstalledprogressas perseveration andnotesthatthereisno fullygeneralwaytodene`insucientprogress.'"Evenso,itmaybeadvantageous touseknowledgeoftheuncertaintyoraccuracyofagivenfunctionevaluationwhen makingsuchadecision. 85

PAGE 97

Intheremainderofthischapterweexploretheseissuesandproposetermination criteriathatcanbeeasilyincorporatedontopofauser'ssolverofchoice.In[18], Fletchersummarizesthechallengesathandinthecaseofround-oerrorsalone: Someconsiderationhastobegiventotheeectsofround-onearthesolution,andtoterminatewhenitisjudgedthattheseeectsarepreventing furtherprogress.Itisdiculttobecertainwhatstrategyisbestinthis respect. Moreover,Gill,Murray,andWright[19]stressthat nosetofterminationcriteriaissuitableforalloptimizationproblemsand allmethods. ThissentimentissharedbyPowell[47]whosays itisbelievedthatitisimpossibletochoosesuchaconvergencecriterion whichiseectiveforthemostgeneralfunction...soacompromisehasto bemadebetweenstoppingtheiterativeproceduretoosoonandcalculating f anunnecessarilylargenumberoftimes. Consequently,wewillconsiderteststhatallowfortheuseofestimatesofthenoise particulartoaproblem.Furthermore,ourcriteriaarenotintendedassubstitutes foracomputationalbudgetorasolver'sbuilt-incriticalitytests,whichweconsider tobeimportantsafeguards.Likewise,theterminationproblemcanbeviewedasa real-timecontrolproblemdependingoncompleteknowledgeofthesolver'sdecisions, butweresistthisurgeforpurposesofportabilityandapplicability. WeprovidebackgroundonpreviousworkandintroducenotationinSection5.2. ThefamiliesofstoppingtestsweproposeinSection5.3donotprovideguarantees onthequalityofthesolution,althoughdoingsomaybetheroleofasolver'sbuiltincriteria.Instead,theproposedtestsareparameterizedinordertoquantifya 86

PAGE 98

user'strade-obetweenthebenetofachievingadditionaldecreaseandthecostof additionalevaluations,whilerequiringaminimalamountofinformationfromthe solver.Equallyimportant,ourresultsinSection5.4comparingthequalityofthese familiesofstoppingtestsonacollectionoflocaloptimizationalgorithms.Werst considerallsolversasasingleroutine,latervalidatingthisapproachbydemonstrating equalperformanceforthebesttestsonindividualalgorithms.Whileourresults canbeincorporatedinalocalsubroutineofanyglobalsearchalgorithm,thetests proposedinSection5.3areunabletodistinguishbetweenexplorationandrenement phasesintheircurrentform.WesummarizeourresultsinSection5.5andprovide recommendationswhenimplementingthesetests. 5.2Background Ourdiscussionandanalysisarelimitedtooptimizationmethodsthatdonot explicitlyrequirederivativeinformation.However,otheralgorithmscouldreadily employthetestsproposedhereinadditiontotheirderivative-basedstoppingcriteria. Whileourworkcanbefurtherextendedtoincorporatenoisygradientinformation, thederivativesofnoisyfunctionsaretypicallyevennoisierthanthefunction. Derivative-freeoptimizationmethodsareoftenfavoredfortheirperceivedability tohandlenoisyfunctions.Althoughasymptoticconvergenceofthesemethodsis generallyprovedassumingasmoothfunction,adjustmentsarefrequentlymadeto accommodatenoise.Inthecaseofstochasticfunctions,wherenoiseresultsfroma randomdistributionwithVar f x > 0,replicationsoffunctionevaluationscanbe usedtomodifyexistingmethodse.g.,[14]modifyingUOBYQA[48],[15,1]modifying DIRECT[30],and[61]modifyingNelder-Meadsee,e.g.,[12].However,stopping criteriaforthesemethodsinvolvelimitedknowledgeofthenoiseandindicatethe widevarietyofstoppingtestsusedinpractice.In[1],optimizationisstoppedwhen adjacentpointsarewithin10 )]TJ/F17 7.9701 Tf 6.587 0 Td [(4 ofeachother,whereas[15]allowsstoppingwhen thebestfunctionvaluehasnotbeenimprovedaftersomenumberofconsecutive 87

PAGE 99

iterations.Tolimitthenumberofstochasticreplications,theauthorsof[14]and[61] adjustthemaximumnumberofallowedreplicationsataparticularpointbasedon thevarianceofthenoise. Deterministicnoise{thatis,noisethatresultsfromadeterministicprocess,such asnite-precisionarithmetic,iterativemethods,andadaptiveprocedures{isfarless understoodthanitsstochasticcounterpart[42].Notsurprising,evenlessknowledge ofthemagnitudeofnoiseisusedforproblemswithdeterministicobjectives.When low-amplitudenoiseispresent,Kelley[33]proposesarestarttechniqueforNelderMeadbutterminateswhensucientlysmalldierencesexistinthesimplicialfunction values,independentofthemagnitudeofthenoise.Implicitltering[32]hasnumerous terminationpossibilitiessmallfunctionvaluedierencesonastencil,asmallchange inthebestfunctionvaluefromoneiterationtothenext,etc.butnonethatare explicitlyrelatedbytheauthortothemagnitudeofthenoise.Asimilarimplicit relationshiptonoisecanbeseenin[24],wheretreedGaussianprocessmodelsfor optimizationareterminatedwhenamaximumimprovementstatisticissuciently small.TheauthorsofSNOBFIT[29]suggeststoppingwhenthebestpointhasnot changedforanumberofconsecutivecallstothemainSNOBFITalgorithm. OurworkmorecloselyfollowsthatofGillet.al[19],wheresection.2isdevoted topropertiesofthecomputedsolution.Theauthorsthererecommendterminating Nelder-Mead{likealgorithmswhenthemaximumdierencebetweenfunctionvalues onthesimplexislessthanademandedaccuracyweightedbythebestfunctionvalue onthesimplex. Theonlyotherdirectrelationshipbetweenstoppingcriteriaandameasureof noisethatweareawareofarein[42,Section9]and[25].In[42],astochasticmodelof thenoiseisusedtoestimatethe noiselevel ofafunctionvalue f x bydierencetablebasedapproximationsofthestandarddeviationVar f x 1 = 2 .Resultsarevalidated fordeterministic f .Asanexampleapplication,theauthorsterminateaNelder-Mead 88

PAGE 100

methodonanODE-basedproblemwhenconsecutivedecreasesarelessthanafactor ofthenoiselevel.Theauthorsof[25]perturbbound-constrainedproblemssothe incumbentiterateistheexactsolutiontothisnewproblem.Analgorithmcanthen beterminatedwhenthesizeofthisperturbationrstdecreasesbelowtheerrorin theproblem.Naturalextensionstogradient/derivative-basedtestsarealsoenabled bytherecentworkin[43]wherenear-optimalnitedierenceestimatesareprovided asafunctionofthenoiselevel. Beforeproceeding,wedenethenotationemployedthroughout.Welet R + denotethenonnegativerealsand N denotethenaturalnumbers.Welet f x 1 ; ;x m g R n and f f 1 ; ;f m g2 R beasequenceofpointsandcorrespondingfunctionvalues producedbyalocalminimizationsolver,andwecollectthedatafromtherst i evaluationsin F i = f x 1 ;f 1 ;:::; x i ;f i g .Thebestfunctionvalueintherst i evaluationsisgivenby f i =min 1 j i f f j g ,with x i denotingthepointcorrespondingto f i . Accordingly,thesequence f f i g isnonincreasing.Unlessotherwisestated, kk denotes thestandardEuclideandistance. Welet^ " i r beanestimateoftherelativenoiseat f i i.e.,thenoiseat x i scaledby themagnitudeof f x i .Thisestimatemaycomefromexperience,numericalanalysis oftheunderlyingprocessesincomputing f i ,orappropriatescalingby1 = j f i j ofthe noise-levelestimatesfromthemethodproposedin[42].Inthecaseofstochastic functionswithnonzeromeanat x i ,^ " i r isthestandarddeviationof f x i relativeto theexpectedvalue E [ j f x i j ]. Favorablepropertiesofaterminationtestincludescaleandshiftinvariance,so thatthetestwouldterminateafterthesamenumberofevaluationsforanyane transformationoftheobjectivefunction.Specically,atestisscaleinvariantin f if itterminatesoptimizationrunsdenedby F i and F i f x 1 ;f 1 ;:::; x i ;f i g at anidenticalevaluationnumberforany > 0.Similarly,atestisshiftinvariantin f ifitterminates F i and F i + f x 1 ;f 1 + ;:::; x i ;f i + g afteranidentical 89

PAGE 101

numberofevaluationsregardlessof .Weusethefollowingpropositiontoaidinthe subsequentanalysisofscaleandshiftinvariance. Proposition5.1 Forstochastic f i ofnite,nonzerovariance,theestimateoftherelativenoise, ^ " i r ,isscaleinvariantin f andtheabsolutenoise, ^ " i r E [ j f x i j ] j p Var f x i ,isshiftinvariantin f . Proof: BothresultsfollowdirectlyfrompropertiesofVar .Since Var f x i > 0,itfollowsthat E [ j f x i j ] > 0andhence,for > 0 ^ " i r = p Var f x i E [ j f x i j ] = p 2 Var f x i E [ j f x i j ] = p Var f x i E [ j f x i j ] : Whendenedbythestandarddeviation,theabsolutenoiseisshiftinvariantin f because p Var f x i = p Var f x i + . Inthecaseofdeterministicnoise, invariancedependsonthemethodsusedtoobtaintheestimates^ " i r and^ " i r j f i j . 5.3StoppingTests Inthissectionwedenefamiliesofterminationtestsandprovidemotivationfor theiruse.Eachfamilycanbedenedthroughanextended-valuefunction mapping to R [f + 1g .Given F ,theproblemdatafromanoptimizationrun,theassociated terminationteststopsafter i F evaluations,where i F isthesolutiontoahitting problem, i F =min i f i : F i ; F i ; 0 g : .2 where F i and denoteparametersthatarepossiblydependenton F i andindependentof F i ,respectively.Hence,ateststopsanoptimizationrun,producingthe history F i ,attherst i 2 N where F i ; F i ; 0.Membersofafamilyoftests aredeterminedbydierentvaluesoftheparametervector F i ; .Since quanties theprogressofanalgorithmthroughthehistoryoffunctionvaluesandpoints,each familyoftestsisdesignedtodeterminewhencontinuingwiththepresentcourseis likelywastefulasmeasuredbytheparametersin F i ; . 90

PAGE 102

Itisoftenusefultoconsiderhowatestwillchangeiftheunderlyingfunction undergoesananechange.FollowingSection5.2,wewillsaythatatestis scale invariant if i F =min i f i : F i ; F i ; 0 g8 > 0 ; where F i f x 1 ;f 1 ;:::; x i ;f i g ,and shiftinvariant if i F =min i f i : F i + ; F i + ; 0 g8 ; where F i + f x 1 ;f 1 + ;:::; x i ;f i + g .Similaranechangesto f x 1 ;:::;x i g couldbeconsideredbutarenotcentraltothepresentdiscussion,andhenceallnotions ofinvarianceherearerelativetothefunction f .Wedropthesubscriptfrom i F if F isimplied. Similarly,itisusefultoconsiderwhether ismonotoneinsomeofitsparameters. Monotonicityof isdesirablebecauseitresultsinthesameformofmonotonicity forthecorrespondingnumberofevaluations i .Forexample,if ismonotonically increasinginascalarparameter ,thenincreasing resultsinamoreconservative testbecausethesolutionto.2isatleastaslarge.Asaconsequence,if is monotonicallyincreasingin andthetest ; ; 1 isneversatisedonasetof problems,itisnotnecessarytoconsider > 1 valuesonthatsetofproblems becauseitwillalsoneverbesatised. Wenowdeneseveralfamiliesofterminationtestsanddiscusstheirproperties andunderlyingmotivation.Allofthesetestsassumenoknowledgeoftheinner workingsofthealgorithmtheyareterminating,butsuchknowledgemightleadto appropriatemodications.Forexample,ifthemethodusesasimplex,ratherthan stoppingwhenthelast functionevaluationsarewithinafactorofthenoise,one couldstopwhenthelast simplexverticesarewithinafactorofthenoiseessentially amodicationoftheproposedrulein[19]. 5.3.1 f i 0 Test 91

PAGE 103

1 F i ; F i ;; 8 > > < > > : f i )]TJ/F21 5.9776 Tf 5.756 0 Td [( +1 )]TJ/F20 7.9701 Tf 6.586 0 Td [(f i )]TJ/F19 11.9552 Tf 11.955 0 Td [( j f i j F i if i ; 1 else, with F i ; 2 R + ; 2 N : .3 Thisfamilyoftestsisdesignedtostopwhentheaveragerelativechangein f overthe last evaluationsislessthan F i .Theinteger canbethoughtofasabackward dierenceparameterforestimatingthechangeinthebestfunctionvaluewithrespect tothenumberofevaluations. Wenotethat 1 ismonotonicallydecreasingin since,forxed , F i ,and F i , 1 2 = 1 F i ; F i ;; 1 1 F i ; F i ;; 2 : 1 isalsomonotonicallydecreasingin F i butisnotmonotonein .Membersof thisfamilyarescaleinvariantprovidedthat F i is,andshiftinvariantprovidedthat j f i j F i isproveninTheorem5.2inSection5.3.6. Weconsidertwospecialcases.When F i =1oranyconstant,weobtaintests thatarescaleinvariantbutnotshiftinvariantandstopiftheaveragerelativechange inthebestfunctionvaluedropsbelow .If F i =^ " i r ,thetestsarescaleandshift invariantbyProposition5.1andstopanalgorithmiftheaveragerelativechangein f becomeslessthanafactor timestherelativenoise. 5.3.2Max-Dierencef Test With F i ; 2 R + ; 2 N ,dene 2 F i ; F i ;; 8 > > < > > : max i )]TJ/F20 7.9701 Tf 6.586 0 Td [( +1 j i j f j )]TJ/F19 11.9552 Tf 11.956 0 Td [(f i j)]TJ/F19 11.9552 Tf 17.933 0 Td [( j f i j F i if i ; 1 else. .4 Thisfamilyoftestsstopswhen consecutivefunctionvaluesarewithin j f i j F i of f i . 92

PAGE 104

Onecanshowthat 2 ismonotonicallydecreasinginboth and F i andmonotonicallyincreasingin since 1 2 = max i )]TJ/F20 7.9701 Tf 6.587 0 Td [( 1 j i j f j )]TJ/F19 11.9552 Tf 11.955 0 Td [(f i j max i )]TJ/F20 7.9701 Tf 6.587 0 Td [( 2 j i j f j )]TJ/F19 11.9552 Tf 11.955 0 Td [(f i j : Wealsonotethatif 2 ismodiedsothat f j isreplacedby f j ,weobtainatest equivalentto 1 F i ; F i ;; .Theinvariancepropertiesofthistestareidenticalto thosefor 1 andformallyproveninTheorem5.2inSection5.3.6. Weexaminetwospecialcases.If F i =1oranyconstant, 2 isscaleinvariant butnotshiftinvariant;thisfamily, 2 F i ;1 ;; ,terminateswhenthelast function valuesdierbylessthanafactor relativetothebestfunctionvaluesofar.If F i =^ " i r ,theresultingtestsarescaleandshiftinvariantbyProposition5.1and terminatewhentheabsolutechangeinthelast functionvaluesiswithinafactor ofthenoiseof f . 5.3.3Max-Distancex Test 3 F i ; ; 8 > > < > > : max i )]TJ/F20 7.9701 Tf 6.586 0 Td [( +1 j;k i k x j )]TJ/F19 11.9552 Tf 11.956 0 Td [(x k k)]TJ/F19 11.9552 Tf 20.59 0 Td [( if i 1 else, with 2 R + ; 2 N : .5 Thisfamilystopswhen consecutive x -valuesarewithinadistance ofeachother andisanalyzedwith 4 below. 5.3.4Max-Distancex i Test 4 F i ; ; 8 > > < > > : max i )]TJ/F20 7.9701 Tf 6.587 0 Td [( +1 j i x j )]TJ/F19 11.9552 Tf 11.955 0 Td [(x i )]TJ/F19 11.9552 Tf 11.955 0 Td [( if i ; 1 else, with 2 R + ; 2 N : .6 Thisfamilystopswhen consecutive x i -valuesarewithinadistance ofeachother. Ingeneral,membersofbothofthefamiliesdenedby 3 and 4 arenotscaleshift 93

PAGE 105

invariantunlesstheproceduregenerating f x i g i isscaleshiftinvariantin f .Both 3 and 4 aremonotonicallydecreasingin andmonotonicallyincreasingin .We examinedatestusingmax i )]TJ/F20 7.9701 Tf 6.587 0 Td [( +1 j i k x j )]TJ/F19 11.9552 Tf 11.69 0 Td [(x i k butfoundtheperformancetobesimilarto thatof 3 . 5.3.5Max-BudgetTest 5 F i ; 8 > > < > > : 0if i ; 1 else, with 2 N : .7 Asapointofreference,weincludethefamilycorrespondingtostoppingafterabudget of evaluations. 5.3.6TestsBasedonEstimatesoftheNoise Thefamiliesoftestsintroducedabovehavebeenbroadlyparameterizedtocaptureawiderangeofbehaviors.Thefollowingtheoremsummarizestherelationship betweensomeoftheseparametersandinvariancepropertiesofthetests. Theorem5.2 aIf F i isscaleinvariant,thenallmembersofthefamilies 1 and 2 arescaleinvariant. bIf j f i j F i isshiftinvariant,thenallmembersofthefamilies 1 and 2 areshift invariant. cIftheproceduregeneratingthedata f x i g isscaleshiftinvariant,thenallmembersofthefamilies 3 and 4 arescaleshiftinvariant. dMembersofthe 5 familyarescaleandshiftinvariant. Proof: aSince and areindependentof F ,thedenitionsin.3and.4 showthat,forany > 0, 1 F i ; F i ;; = f i )]TJ/F20 7.9701 Tf 6.586 0 Td [( +1 )]TJ/F19 11.9552 Tf 11.956 0 Td [(f i )]TJ/F19 11.9552 Tf 11.955 0 Td [( j f i j F i 2 F i ; F i ;; =max i )]TJ/F20 7.9701 Tf 6.587 0 Td [( +1 j i j f j )]TJ/F19 11.9552 Tf 11.955 0 Td [(f i j)]TJ/F19 11.9552 Tf 17.932 0 Td [( j f i j F i ; 94

PAGE 106

providedthat i .Scaleinvarianceof F i impliesthat F i = F i andhenceimplies that j F i ; F i ;; = j F i ; F i ;; ,for j =1 ; 2.Since ispositive,thevalue of i forwhich j and j changesignareidentical,showingthat i F = i F . bSimilarly,forany and i ,wehavethat 1 F i + ; F i + ;; = f i )]TJ/F20 7.9701 Tf 6.586 0 Td [( +1 )]TJ/F19 11.9552 Tf 11.955 0 Td [(f i )]TJ/F19 11.9552 Tf 11.955 0 Td [( j f i + j F i + 2 F i + ; F i + ;; =max i )]TJ/F20 7.9701 Tf 6.587 0 Td [( +1 j i j f j )]TJ/F19 11.9552 Tf 11.955 0 Td [(f i j)]TJ/F19 11.9552 Tf 17.932 0 Td [( j f i + j F i + ; andhenceif j f i + j F i + = j f i j F i ,then j F i + ; F i + ;; = j F i ; F i ;; for j =1 ; 2.Asaresult,thefunctionsinthehittingproblem5.2remainunchanged and i F + = i F . cBoth 3 and 4 dependonlyon F throughthelocationsoftheevaluated points, x i .Hence,iftheprocedureforgenerating x i producesthesamepointsfor F resp. F + asitwouldfor F ,thetestsarescaleresp.shiftinvariant. dThefunction 5 isindependentof F ,andhencethehittingproblem.2is unaectedbychangesin F . AsaconsequenceofTheorem5.2andProposition5.1,using F i =^ " i r asan estimateofthenoisein 1 and 2 resultsinteststhatarebothscaleandshiftinvariant. Furthermore,weseethattherstterminthedenitionof 1 and 2 hasastrong associationwiththemagnitudeofthenoise.ThisfeatureisillustratedinFigure5.2, whereeachlinerepresentsoneinstanceofaNelder-Meadmethodminimizinga10dimensionalconvexquadraticforincreasinglevelsofnoise.Figure5.2topshowsthe rsttermof 1 , f i )]TJ/F21 5.9776 Tf 5.757 0 Td [( +1 )]TJ/F20 7.9701 Tf 6.586 0 Td [(f i ,asthealgorithmprogresses.Hereweseethatthequantity generallyattensoutatincrementsseparatedbythesameorderofmagnitudeasthe sevennoiselevels.ThisbehaviorisevenmoreevidentinFigure5.2bottomwhen thersttermof 2 ,max i )]TJ/F20 7.9701 Tf 6.586 0 Td [( +1 j i j f j )]TJ/F19 11.9552 Tf 11.956 0 Td [(f i j ,isconsidered. 95

PAGE 107

Figure5.2: Firsttermsin 1 top,with =100and 2 bottom,with =10 onalog 10 scalewhenminimizinga10-dimensionalconvexquadraticwithstochastic relativenoiseofdierentmagnitudes.Theasymptotesofthequantitiesshowntend tobeseparatedbythedierencesinmagnitudesofthenoise. Consequently,inthenumericaltestsinSection5.4,werestrictourattentionto testsbasedon 1 and 2 forwhich F i =^ " i r .Wenotethatalarger isrequiredin Figure5.2lefttopreventthersttermin 1 fromprematurelytakingazerovalue; dependenceonparameterslike isdiscussedfurtherinSection5.4.Weexamined plotssimilartothoseinFigure5.2forthersttermsof 3 and 4 butfoundnosuch relationshipwiththenoiselevel.Asaresult,wehavechosennottoincludeconstants oftheform F i inthedenitionsof 3 and 4 . 5.3.7RelationshiptoLossFunctions 96

PAGE 108

Ideally,analgorithmshouldstopwhenthecostofperformingadditionalfunction evaluationsoutweighsadditionalimprovementsinthefunctionvalue.Whensuch atrade-ocanbequantied,thisproblembecomesoneof optimalstopping [54]. Resultsintheliteraturetypicallyfocusoncaseswhenthedistributionofthestochastic improvementisknown.Webrieyillustrateaconnectiontoasimplelossfunction employedinoptimalstoppingwithourtests. Wefocusonthecasewhenthecostofanadditionalevaluationisconstant.This canbeviewedastreatingthecomputationalexpenseperfunctionevaluationasconstant,butthecostandthetestsproposedherecouldbesuitablymodiedasan algorithmentersasubdomainwherethecostofanevaluationchanges.Givena sequence f f j g ,thelossfunction L i;c =min 1 j i f f j g + c i .8 providesameasureofthesuccessofstoppingafter i evaluationswhenthecostper evaluationrelativeto f is c .Thislossfunctionappearsintheoptimalstopping literatureasthehouse-sellingproblem[6],where f f j g areassumedtobeindependent andidenticallydistributedrandomvariables. Figure5.3showstheminimizerof L ;c foravarietyof c valuesonasequence f f j g 3000 j =1 outputbyadirectsearchsolveronanonlinearfunctionwithdeterministic leftandstochasticrightnoise.Wecomparethisminimizerwiththenumberof evaluations i denedby.2forthefamily 1 when c isusedasalinearmultiplier fortheparameter .Figure5.3showsastrongconnectionbetweenthebehaviorof argmin i L i;c andtheterminationtestdenedby 1 usinganestimateofthenoise andanappropriatechoiceoftheparameters ; .Thisillustrateshowvaryingthe parametersintheproposedfamiliescanbecloselyrelatedtothecostofperforming anevaluation. 5.4NumericalExperiments 97

PAGE 109

Figure5.3: Numberofevaluations i foraterminationtestbasedon.3with xed F i and ,butusinga parameterizedby c .Theplotsshowremarkablysimilar behaviortothenumberofevaluationsthatminimize L ;c in.8. Wenowdemonstratethemeritsoftheproposedtestsandexploretheeect ofchangingtheassociatedparametervaluesbyconsideringoutputsgeneratedby asetofderivative-freeoptimizationsolversonacollectionofnoisytestproblems. Wepresentillustrativeresultshere,acomprehensivesetofplotsmaybefoundat http://www.mcs.anl.gov/ ~ wild/tnoise/ . Weconsiderthecollectionofunconstrainedleast-squaresproblemsusedin[41], witheachfunctiontakingtheform f x =1++ g x m X i =1 F s i x 2 ; .9 whereeach F s i isasmooth,deterministicfunctionand 1isapositivescalar usedtocontroltheamplitudeofthenoisebeingaddedto f s x = P m i =1 F s i x 2 .We beginourstudybyconsideringstochasticnoise,sothat g x representsindependent andidenticallydistributediidrandomvariableswithvarianceVar g x =1.Asa result,therelativenoiseofthesetestfunctionsissimply and,hence,independentof x .Theconstant1wasaddedin.9sothattherelativenoiseisconsistentlydened 98

PAGE 110

evenif F s i x =0forall i ;suchshiftsarecommonlyperformedinaccuracymeasures see,e.g.,[16]. Toexaminethetestsonadiversesetoflocalmethods,weconsidersequences f f j g producedbydierentderivative-freeoptimizationsolvers.Sincetherelativemerits ofthesesolversisnotthefocusofthisstudy,wedonotexplicitlylistthem,but wenotethattheycomefromavarietyofclasses,includingmodel-basedmethods, implementationsofNelder-Mead,patternsearchmethods,andmethodsthatcross theseclasses.Afterexaminingtheterminationtestsontheentiregroupofsolvers, weanalyzethesuccessoftherecommendedtestsoneachsolverindividuallyasa meansofvalidation. Tomoreaccuratelystudytheeectofourtests,wehavemadethebuilt-interminationcriteriaofthesesolversasambitiousaspossibleinanattempttoremove theirinuence.Hence,weraneachsolveruntileitheritcrashede.g.,foranumerical reason,suchasthesimplexsidesbeingdroppedsucientlybelowmachineprecision oramaximumbudgetof5,000functionevaluationswasachieved.Thisbudgetof evaluationsissignicantlylargerthantheoneconsideredin[41],andweconsiderit tobemorethansucientfortheproblemsinthisset,whichrangeindimensionfrom n =2to n =12.Wedenotethemaximumnumberoffunctionevaluationseither 5,000,orfewerifthesolvercrashedby i max . Wethenhaveasetof318nonnegativesequences f f j g i max j =1 ,whichconstituteour setofproblems P .Weusetheseproblemstoexaminetheperformanceofasetoftests T ,denedasmembersofthefamiliesproposedinSection5.3.Foratest t 2T and problem p 2P ,wedenote i p;t tobethenumberoffunctionvaluesafterwhichtest t wouldstoponproblem p .Ifthetestisnotsatisedbeforethemaximumnumber ofevaluations i max ofproblem p ,welet i p;t = i max tomirrorwhatwouldbedonein practice. 5.4.1AccuracyProlesforthe 1 Family 99

PAGE 111

Terminationcriteriathataretooeasilysatisedhavelimitedpracticalitysince theycouldstopwithafunctionvaluefarfromtheminimum.Wewillmeasurethis abilitybyconsideringtherelativedierencebetween f i p;t and f i max , e p;t = 8 > > < > > : 1 if i p;t = i max ; f i p;t )]TJ/F20 7.9701 Tf 6.586 0 Td [(f i max f i p;t if i p;t
PAGE 112

Figure5.4: Accuracyprolesformembersofthe 1 familyonproblems.9with twodierentmagnitudesofknownstochasticrelativenoise .Inthetopplots, isheldxedandtheshownmembershavedierent values.Inthebottomplots, isheldxedandtheshownmembershavedierent values. Theleftasymptoteof ! t showsthefractionofproblemsonwhichtheteststopped afteritreachedtheminimumvalue f i max ,whiletherightasymptoteisthefractionof problemsforwhichthetestwassatisedwith i p;t
PAGE 113

testsbecomelessconservativeas grows,anumberofproblemsareterminated wellbeforetherelativeerrorisontheorderofthenoise.Ontheotherhand,not muchisgainedbysetting lessthan10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 or10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(2 . ThebottomtwoplotsinFigure5.4show 1 familymembersforxed =10 )]TJ/F17 7.9701 Tf 6.587 0 Td [(2 andvariousvaluesof ,whichcanbethoughtofasabackwarddierenceparameter. Littleimprovementisseenfor > 20 n ,butamarkeddecreaseinaccuracyoccurs when < 10 n .Manyproblemsarestoppingwithalargerelativeerror,inpart because f j canremainunchangedformanyconsecutive j .Forexample,thelower leftplotshowsthatfornoiseaectingthethirddigit,onall318problems f j remained unchangedfor3 n consecutiveevaluationsbefore j = i max wasreached. 5.4.2PerformanceProlesforthe 1 Family Whileaccuracyprolescanquantifywhenateststopstoosoon,theymaynot revealwhichtestsrequireexcessivefunctionevaluationstoachievehighaccuracies. Forexample,themaximumbudgettest 5 triviallyachievesidealaccuracybutcan makemanymoreevaluationsthanarerequiredtogetsucientaccuracy. Weuseperformanceproles[17]tocomparedierentstoppingrulesintermsof bothaccuracyandthenumberoffunctionevaluationsrequired.Aperformanceprole requiresaconvergencetestaswellasaperformancemeasure r p;t foreachproblem p 2P andtest t 2T .Weusethenumberofevaluations i p;t asourperformance measureandaconvergencetestrequiringthatthesolutionobtainediswithinthe absolutenoiselevelofthebestfunctionvalue, f i p;t )]TJ/F19 11.9552 Tf 11.955 0 Td [(f i max j f i p;t j ^ " i r : .11 Theconvergencetesthastheeectofsettingtheperformancemeasure i p;t = 1 whenevertheoriginal i p;t doesnotsatisfy.11.Theperformanceratio r p;t = 8 > > < > > : i p;t min n i p; ~ t : ~ t 2 T ; .11satisedfor p; ~ t o if.11issatisedfor p;t 1 else 102

PAGE 114

Figure5.5: Performanceprolesforthemostaccurate 1 testsonproblems.9 withtwodierentmagnitudesofknownstochasticrelativenoise .Notethatthe -axishasbeentruncatedforeachplot; 5 eventuallyterminatesalloftheproblems andthushasaprolethatwillreachthevalue1;allothertestschangebylessthan .01. measurestherelativeperformanceonproblem p oftest t whencomparedwiththe othertestsin T . Theperformanceprole t = jf p 2P : r p;t gj jPj thenrepresentsthefractionofproblemswheretest t satisedtheaccuracyrequirement.11withanumberofevaluationswithinafactor ofthebest-performing, sucientlyaccuratetest.Largervaluesof t arehencebetter,with t being thefractionofproblemswhere t hassuccessfullyterminatedrstamongalltestsin T andlim !1 t beingthefractionofproblemsforwhich t satised.11. Figure5.5showstheperformanceprolesforthemostaccurate 1 familymembersfortwolevelsofnoise .Weinclude 5 ; i max in T asapointofreferenceto indicateanupperboundonthefractionofproblemsthatallothertestsmaynothave terminatedwith i p;t
PAGE 115

satisfy.11. Theseperformanceprolesillustratethatsomemembersofthe 1 familyoftests requireafractionofthefull i max evaluations.Thisisthecaseespeciallyforlarger magnitudesofnoise,wherelessaccuratesolutionsaredemanded.Likewise,asthe noisedecreases,.11demandsmoreaccuratesolutions,anditbecomesnecessaryto perform i max evaluationsonalargershareoftheproblems. Theseperformanceprolesalsodemonstratehowmoreliberalstoppingrulescan bemoresuccessfulthantheaccuracyprolesreveal.Forexample,inFigure5.4, 1 ; ; 20 n; 0and 1 ; ; 20 n; 10 )]TJ/F17 7.9701 Tf 6.587 0 Td [(2 appearnearlyidenticalintermsoftheiraccuracy, butinFigure5.5weseeamarkeddierenceintheperformancemeasures.Theright asymptotesoftheirperformanceprolesarenearlyidentical,areectionoftheir accuracyprolesat a = ,buttherestoftheprolesshowthat 1 ; ; 20 n; 10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(2 usesconsiderablyfewerfunctionevaluationstosatisfythisaccuracyrequirement. Becauseofthishighaccuracyandperformance,weconsider 1 ; ; 20 n; 10 )]TJ/F17 7.9701 Tf 6.587 0 Td [(2 tobe thebestofthestoppingrulesconsideredinitsfamily. 5.4.3AccuracyandPerformancePlotsforthe 2 Family Havingoutlinedourprocedurefordeterminingwhatconstitutesgoodmembers ofthe 1 family,wecannowquicklydosoforthefamilybasedon 2 in.4. TheaccuracyprolesintheupperleftplotofFigure5.6showthatthe 2 ; ; 10 n; 1testwassatisedonlessthan5%oftheproblems,andso 1has littlerelevanceforthisfamily.Inourexperience,decreasing didnotalleviatethis problemforsmall .Ingeneral, 2 tendstobemuchmoresensitivetothevalue thanaretestsbasedon 1 .Wealsoseethat 2 ismoreaccurateatsmallervaluesof than 1 was; =3 n isnowamorecompetitiveparameterchoice.Thistrade-o inaccuracycomesatthecostofthe 2 testsbeingmoreconservativeand,hence, satisedonfeweroftheproblems. 104

PAGE 116

Figure5.6: Accuracytopandperformancebottomprolesforthe 2 familyon problems.9withtwodierentmagnitudesofstochasticrelativenoise as and arevaried. Weagainuseperformanceprolestomeasurewhethertestsareoverlyconservative.Asindicatedbythelargerrangefor inthebottomtwoplotsofFigure5.6, the 2 familyoftestsaremorediculttosatisfyoverall,andthenumberoffunction evaluationsrequiredcomparesslightlylessfavorablywith i max thanforthe 1 family. Wealsoseethat 2 ; ; 3 n; 10tendstobethemostliberaltest,inpartbecauseit requiresfewerconsecutiveevaluationsthantheothermembersshownasindicatedby thevalueof ,butthat 2 ; ; 10 n; 10requiresjustasmallincreaseinthenumber offunctionevaluationswhilesolvingagreaterfractionofproblemsoverall.Basedon ourcomputationalexperience,weconsider 2 ; ; 10 n; 10tobethebesttestofthose 105

PAGE 117

Figure5.7: Accuracytopandperformancebottomprolesforthebesttests onproblems5.9withtwodierentmagnitudesofstochasticrelativenoise .The horizontalaxesontheperformanceprolesaretruncatedforclarity; 5 eventually achievesavalueof1;allothertestschangebylessthan.03. consideredinthisfamilyfortheseproblems. 5.4.4Across-familyComparisons Weperformedsimilarcomparisonsforthemembersofthe 3 and 4 families, buttheanalysisisidenticaltowhathasbeenpresentedabove.Forthebenchmarkproblems P ,wefoundthattests 3 ; n; 10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(7 and 4 ;20 n; 10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(7 performed bestamongthoseconsideredintheirrespectivefamilies.See http://mcs.anl.gov/ ~ wild/tnoise foracompletestudyofthetestsinthesefamilies. Havingidentiedthebestmembersofeachfamilyoftests,wecomparethem head-to-headinFigure5.7.ThetoptwoplotsofFigure5.7demonstratethatwhen 106

PAGE 118

thefourtestsconsideredstopwithfewerthan i max evaluations,theyalltendtohave obtainedasolutionwithinthelevelofthenoise,though 1 and 3 areslightlyworse duetoa1%jumppast a = .The x -valuebasedtests 3 and 4 havenearly identicalperformance,arehighlyaccurate,andaresatisedforalargefractionofthe problems.Forthefunctionvaluebasedtests, 1 appliestomoreoftheproblemswe examinedthan 2 . Ontheotherhand,thelowertwoplotsofFigure5.7showthatthetestbased on 2 generallyrequiresfewerevaluationstobesatisedonasignicantnumberof problemswithlargernoise.Thetestsbasedon 3 and 4 aremorefavorablewhen thenoiselevelislow.Sincetheperformanceandaccuracyofthe 3 and 4 testsare nearlyidentical,weremove 4 fromfurtherconsideration. 5.4.5DeterministicNoise Wenowconsiderhowthesetestsperforminthepresenceofdeterministicnoise byusingfunctionsoftheform.9,withadeterministic g .Tomodeldeterministic noise,weusethesame g combininghigh-frequencyandlower-frequencynonsmooth oscillationsasusedin[41],with g : R n ! [ )]TJ/F15 11.9552 Tf 9.299 0 Td [(1 ; 1]denedbythecubicChebyshev polynomial g x = x x 2 )]TJ/F15 11.9552 Tf 11.955 0 Td [(3,where x =0 : 9sin k x k 1 cos k x k 1 +0 : 1cos k x k 2 : Usingthetechniquein[42],weconsistentlyestimatedtherelativenoiseinthe318 resultingproblemstobeoftheorder0 : 6 ,providedthatthesamplingdistanceis appropriatelychosen. Tostudythevariousfamiliesofterminationtestsinthedeterministiccase,an analysisidenticaltothestochasticcasewasperformed.Whiletherewereslight dierencesintheperformanceoftestsbetweenthesecases,forthemostpart,the conclusionsweresimilar.Foreaseofpresentation,weproceedbydiscussingthe beststochastictestsinthedeterministiccase,acknowledgingthatslightlymore 107

PAGE 119

Figure5.8: Accuracytopandperformancebottomprolesforthebesttestson problems.9withtwodierentmagnitudesofdeterministicnoise.Thehorizontal axesontheperformanceprolesaretruncatedforclarity; 5 eventuallyachievesa valueof1;allothertestschangebylessthan.03. conservativetestsareneeded.Foracompletestudyofthedeterministicnoisesee http://www.mcs.anl.gov/ ~ wild/tnoise/ . TheaccuracyprolesinFigure5.8showamilddecreaseinaccuracyforthebest testscomparedwithstochasticnoiseinFigure5.7.Asaresult,weseethatonjust over10%%oftheproblems,thetestbasedon 1 2 nowterminateswhilenot satisfyingtheconvergencetest.11when =10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(3 .Inpractice,onewouldalso needanestimateoftherelativenoise^ " i r ,butourresultsof 1 and 2 varyingthe linearmultiplierofthenoise, ,showthatthetestremainrelativelystableif^ " i r is estimatedwithinanorderofmagnitude.Also,adjustingtheseteststobeslightly 108

PAGE 120

Figure5.9: Performanceprolesformoreconservativetestsonproblems.9 withtwodierentmagnitudesofdeterministicnoise.Thehorizontalaxesonthe performanceprolesaretruncatedforclarity; 5 eventuallyachievesavalueof1;all othertestschangebylessthan.03. moreconservativeimprovestheirecacy.Resultsforthischangearepresentedin Figure5.9. 5.4.6ValidationforIndividualSolvers Theresultsaboveshowfavorableaccuracyandperformanceforourrecommended testsforacollectionofsolversonasuiteofproblems.Fortheseterminationteststo beofpracticaluseforeachindividualsolverinthecollection,onemustensurethat thetestsdonotprematurelyterminateonesolvermoreoftenthananothersolver. WepresentaccuracyprolesinFigure5.10brokendownbythesixsolversexamined andnotethatnosinglesolveraccountsforadisproportionatenumberofincorrect terminations.Itisnoteworthythatneitherrecommendedtestisparticularlyrelevant tosolverD,asitisrarelysatised,andwouldthereforehavenoeectpositiveor negativeonthissolverontheseproblems.Weremarkthatthebehaviorseenismore anindicationofsolverD'sperformanceonthesenoisyproblemsratherthanonthe test'sperformance.Thisisveriedbythe 3 plotinFigure5.10,showingsolverD neverhas n consecutivepointswithinaEuclideandistanceof10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(7 ofeachother. 5.5Discussion 109

PAGE 121

Figure5.10: Accuracyprolesfortheindividualalgorithmsontherecommended tests. Inthischapterweconsideredparameterizedfamiliesofterminationteststhat requiresolelyahistoryofevaluationspointsandfunctionvaluesfromanoptimizationsolver.Ouranalysisandexperimentsshowhowvaluesfortheseparameterscan bechangedtoreectauser'sviewoftheexpenseofanadditionalfunctionevaluationandtheaccuracydemanded,thetwocharacteristicsthatformthebasisfor.1. Whenusedinconjunctionwithperformanceproles,theaccuracyprolesintroduced herearevaluableforguidingterminationdecisionsbasedonthistrade-obetween accuracyandfunctionevaluations. Inourstudyofstochasticnoiseweencounteredanumberofproblemswherea solverproducednochangein f for300+evaluationsbutthenfoundchangeinthe 110

PAGE 122

Table5.1: Recommendationsforterminationtestsfornoisyoptimization. Test InterpretationofStoppingRule 1 20 n 0 : 01 Stopwhentheaveragerelativedecreaseinthebest functionvalueoverthelast20 n functionevaluations islessthanone-hundredthoftherelativenoiselevel. 2 10 n 10 Stopwhenthelast10 n functionevaluationsare within10timestheabsolutenoiselevelof thebestfunctionvalue. 3 n 10 )]TJ/F17 7.9701 Tf 6.587 0 Td [(7 Stopwhenthelast n pointsevaluatedare withinadistanceof10 )]TJ/F17 7.9701 Tf 6.587 0 Td [(7 ofeachother rstdigitof f .Thiswassucientlyremediatedinthetestsbasedonfunctionvalues 1 and 2 by ,aparameterdetermininghowfarintopastfunctionvalueswere remembered.Werecommendabaselinevalueof =20 n ,withacorresponding lessthanorequalto0.1for 1 .Goodperformancefor 2 canstillbeseenfor less than10 n ,butthistestismoresensitiveto values.For 10, 2 testsarerarely satised,whereasfor 10,terminationcanoccurwithaninaccuratesolution.This resultisimportantaspreviousworkfocusedonsuccessivedecreases[41]orvalueson asimplexorstencil[19].OurrecommendationsaresummarizedinTable5.1. Wehavealsoseenthatfewerproblemsareterminatedbeforethebudgetconstraint asnoisemeasuredby becomessmall.Inthesecases,however,asolver'sbuiltinterminationcriteriashouldbesatisedmoreeasily.Testsbasedon x values, ratherthanfunctionvalues,performsurprisinglywell.Whenthenoiselevelismore signicant,testutilizingknowledgeofthisnoisecanleadtoimprovedperformance.In anycase,terminationtestsusingknowledgeofthepointsevaluatedandtheirfunction valuesoutperformusingthemetricspresentedinthischaptersimplyexhaustinga budgetoffunctionevaluations. 111

PAGE 123

Fordeterministicnoise,wefoundthattherecommendedtestsusedconsiderably fewerfunctionevaluationsthanthemaximumbudget.Thisresultwasobtainedat thecostofloweraccuracy.Werecommendslightlymodifyingtheseteststobemore conservativefordeterministicnoiseandagainallowthesolver'sbuilt-inteststostop arunwhennecessary.ThiseectisshowninFigure5.9,whereweseethattests basedon 1 and 2 performbetterwhen isincreasedby10 n . Althoughournumericalexperimentsfocusedonlocalderivative-freesolvers,the proposedtestscanalsobeusedinconjunctionwiththeterminationtestsofderivativebasedalgorithmsortherenementstageofglobalalgorithms.Incorporatingadditionalinformation,suchasnoisyderivativevaluesorthenumberoflocalminima,is leftasfuturework. 112

PAGE 124

6.ConcludingRemarks Theresultsinthisthesisnaturallyleadintofurtherresearch.WhileChapter3 provedconvergenceofatrustregionalgorithmusingweightedregressionmodelsassumingtheconditionnumberontheweightingmatrixisboundedthatleavesconsiderablefreedomindecidingonaweightingscheme.Theweightingproposedin Section3.5hassomeheuristicintuition,butnothingmore.Wehaveexperimented withvariousotherweightingschemes,someofwhichwethoughtwouldspeedconvergencethoughthathasn'thappenedyet.Whileitseemsintuitivetopicktheweights whichminimizetheerrorbetweenthemodelandthefunction,institutingsuchaplan ontestproblemshasn'timprovedtheconvergenceofthealgorithm. Eachtimeamodelisbuilt,weconsideredthecalculationoftheweightsasa separateproblemtobesolved.Weattemptedtochooseweightswhichsolve min w r m k x k )-222(r f x k s:t: cond W c 1 ; w i c 2 ; where c 1 and c 2 aresmallpositiveconstantsand W =diag w .Inwords,weare attemptingtominimizethedierencebetweenthegradientsofthemodelandfunction atthetrustregioncenter.Whileapracticalimplementationwouldonlybeableto minimizeboundsontheobjectivefunctionasaproxyforminimizingthedesired objective,initialresultssuggestagainstsuchavenuesofresearch.Foraproofof concept,weconstructedDFOproblemswithknownderivativesandtobeusedbythe weightingschemebutnotbythealgorithm.Onthesetestproblems,thealgorithm usingtheheuristicweightingschemepresentedinSection3.5.1outperformedthe optimizedweights.Wealsoselectedweightssolving 113

PAGE 125

min w max x 2 B x k ; k kr m k x )-222(r f x k s:t: cond W c 1 ; w i c 2 ; atconsiderablecomputationaleort.Again,theseweightsperformednobetterthan heuristicschemes.Wearestillsearchingforaweightingschemewhichoutperforms ourheuristic. WebelievetheweightsintroducedinChapter3shouldallowformoreaccurate modelswhentheaccuracyoffunctionevaluationsisdierentatdierentpointsi.e., weightmoreaccuratefunctionvalueshigher.Thiscouldextendtheworkof[60] whereincreasinglyaccuratefunctionevaluationscanbeachievedatincreasingcost. ThealgorithminChapter3couldbemodiedtocallthefunctionwithvariousdegrees ofaccuracyasitconvergestoaminimum. Inadditiontoprovingasubsequenceofiteratesfromourstochasticalgorithmconvergealmostsurelytoarst-orderstationarypointwhenthereareinnitelymany successfuliterates,wearealsogeneralizingtheresultsfromChapter4toapplytoall noisewithmeanzeroandboundedvarianceratherthannormallydistributederrors. BothofthealgorithmsinChapter3andChapter4arebasedonquadraticmodels, whichrequire O n 2 functionevaluationsforafunctionin n dimensionalspace.If thefunctionevaluationsareevenmoderatelyexpensive,thenmerelyconstructingan initialquadraticfunctioncanexhaustacomputationalbudgetof n =100forexample.Weareexploringwaystostartthealgorithmwithunder-determinedmodels andslowlybuildinguptowardslinearandquadraticmodels.Wehaveaskedwhether itisbetter,whentherearemorethan n +1pointsinthetrustregion,tobuildan under-determinedquadraticoraoverdeterminedlinearmodel;suchananswermight notexist.OneofthelargestcontributionsofChapter5istheframeworktoob114

PAGE 126

jectivelycomparethequalityofstoppingcriteria.Thetestspresentedinthispaper weredirectedatstoppingnoisyfunctionoptimization,andsinceDFOalgorithmshave purporteddeftnessinhandlingnoise,onlyDFOalgorithmwerecompared.Nonetheless,thetoolsdevelopedtocomparestoppingcriteriatranslatedirectlytoalgorithms wherereliablederivativesareavailable,butevaluatingthemisexpensive. Lastly,duringthecourseofresearch,westudiedtheconvergencerateofDFO algorithms.WhatfollowsisonlyproofofquadraticconvergenceforaclassofDFO methodsthatweareawareof,assuming f issmoothandprovided m k isasucient approximationof f . Assumption6.1 f 2 LC 2 ,and x satisessecond-ordersucientconditionsfor f . Specically, r f x =0 and r 2 f x ispositivedenite. Assumption6.2 m k 2 LC 2 isamodelwhichapproximates f withanerror kr f x )-281(r m k x k = O k x )]TJ/F19 11.9552 Tf 12.664 0 Td [(x k 2 and kr 2 f x )-222(r 2 m k x k = O k x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k for x sucientlycloseto x . Foreaseofnotation,denethefollowing: 1. m k k := m k x k 2. m k := m k x 3. r m k k := r m k x k 4. r 2 m k k := r 2 m k x k 5. f k := f x k 6. f := f x Theorem6.3 If f , x ,and m k satisfytheconditionsinAssumption6.1model, thenfor x 0 sucientlycloseto x ,iteratesgeneratedbyNewtonsteps, x k +1 = x k )]TJ/F15 11.9552 Tf 11.956 0 Td [( r 2 m k k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 r m k k ,convergequadraticallyto x . 115

PAGE 127

Proof: UsingtheintegralformofTaylor'stheorem r h x + p = r h x + Z 1 0 r 2 h x + tp pdt .1 with h = m k , x = x k ,and p = x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k yields r m k x k + x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k = r m k x k + Z 1 0 r 2 m k x k + t x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k dt r m k x k )-222(r m k x = )]TJ/F25 11.9552 Tf 11.291 16.272 Td [(Z 1 0 r 2 m k x k + t x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k dt r m k k )-222(r m k = Z 1 0 r 2 m k x k + t x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(x dt .2 116

PAGE 128

Therefore, k x k +1 )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k = k x k )]TJ/F15 11.9552 Tf 11.956 0 Td [( r 2 m k k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 r m k k )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k Factor = r 2 m k k )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 r 2 m k k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(x )-222(r m k k Add/subtractthesameterms: = r 2 m k k )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 r 2 m k k x k )]TJ/F19 11.9552 Tf 11.956 0 Td [(x )-222(r m k k + r m k )-222(r m k Add r f =0andgroup = r 2 m k k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 r 2 m k k x k )]TJ/F19 11.9552 Tf 11.956 0 Td [(x )]TJ/F15 11.9552 Tf 11.955 0 Td [( r m k k )-222(r m k )]TJ/F15 11.9552 Tf 11.955 0 Td [( r m k )-222(r f UsetheintegralformofTaylor'stheorem.2: = r 2 m k k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 r 2 m k k x k )]TJ/F19 11.9552 Tf 11.956 0 Td [(x )]TJ/F25 11.9552 Tf 11.291 16.273 Td [(Z 1 0 r 2 m k x k + t x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k x k )]TJ/F19 11.9552 Tf 11.956 0 Td [(x dt )]TJ/F15 11.9552 Tf 11.955 0 Td [( r m k )-222(r f Sincethe c = R 1 0 cdt , = r 2 m k k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 Z 1 0 r 2 m k k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(x dt )]TJ/F25 11.9552 Tf 11.291 16.272 Td [(Z 1 0 r 2 m k x k + t x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k x k )]TJ/F19 11.9552 Tf 11.956 0 Td [(x dt )]TJ/F15 11.9552 Tf 11.955 0 Td [( r m k )-222(r f Combiningintegralswithidenticallimitsandfactoringtheconstant = r 2 m k k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 Z 1 0 [ r 2 m k k )-222(r 2 m k x k + t x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k ] T dt x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k )]TJ/F15 11.9552 Tf 9.299 0 Td [( r m k )-222(r f ] Because r 2 m k isLipschitz: r 2 m k k )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(x Z 1 0 Lt x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(x dt )]TJ/F15 11.9552 Tf 11.955 0 Td [( r m k )-222(r f 117

PAGE 129

Evaluatetheintegral: = r 2 m k k )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(x L 2 x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(x )]TJ/F15 11.9552 Tf 11.955 0 Td [( r m k )-222(r f Usethetriangleinequality: r 2 m k k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 L 2 x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 2 + r 2 m k k )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 r m k )-222(r f Cauchy-Schwartz = r 2 m k k )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 L 2 x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(x 2 + r 2 m k k )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 kr m k )-222(r f k Since r 2 f x ispositivedenite,wecanpick x closeenoughto x so r 2 m k x isalsopositivedenite.Wecanthereforebound k r 2 m k x )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 k c 0 k r 2 f x )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 k forsomeconstant c 0 .Thisresultsin k x k +1 )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k L 2 c 0 k r 2 f )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 kk x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k 2 + c 0 k r 2 f )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 k c 1 k x )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k 2 C k x k )]TJ/F19 11.9552 Tf 11.955 0 Td [(x k 2 forsomeconstants c 0 ;c 1 ;C where c 1 comesfromAssumption6.2.Therefore,the sequenceof x k convergesquadraticallyto x . Asmentionedintheintroduction,problemswithexpensiveandnoisyfunction evaluationsareprevalentinavarietyofelds.Ascomputationalresourcesandmodelingmature,theseproblemsarisewithincreasinginfrequency.Therefore,thework inthisthesisshouldlaythegroundworkforavarietyofpertinentcontributionsto theoptimizationcommunity. 118

PAGE 130

REFERENCES [1]E.J.AndersonandM.C.Ferris.Adirectsearchalgorithmforoptimizationwith noisyfunctionevaluations. SIAMJournalonOptimization ,11:837{857,2001. [2]CharlesAudetandJ.E.DennisJr.Apatternsearchltermethodfornonlinear programmingwithoutderivatives.Technicalreport,SIAMJournalonOptimization,2000. [3]S.C.Billups,J.Larson,andP.Graf.Derivative-freeoptimizationofexpensive functionswithcomputationalnoiseusingweightedregression.Technicalreport, UniversityofColoradoDenver,2010. SIAMJournalonOptimization ,submitted. [4]K.H.Chang,LJeHong,andH.Wan.StochasticTrust-RegionResponseSurfaceMethodSTRONG{ANewResponse-SurfaceFrameworkforSimulation Optimization. INFORMSJournalonComputing ,pages1{14,April2012. [5]T.D.ChoiandC.T.Kelley.Superlinearconvergenceandimplicitltering. SIAMJournalonOptimization ,10:1149{1162,2000. [6]Y.S.ChowandH.Robbins.Onoptimalstoppingrules. ProbabilityTheoryand RelatedFields ,2:33{49,1963. [7]P.G.CiarletandP.A.Raviart.GeneralLagrangeandHermiteinterpolationin R n withapplicationstoniteelementmethods. ArchiveforRationalMechanics andAnalysis ,46:177{199,1972. [8]A.R.Conn,K.Scheinberg,andP.L.Toint.Recentprogressinunconstrained nonlinearoptimizationwithoutderivatives. MathematicalProgramming ,79:397{ 414,1997. [9]A.R.Conn,K.Scheinberg,andL.N.Vicente.Geometryofinterpolationsets inderivative-freeoptimization. MathematicalProgramming ,111:141{172,2008. [10]A.R.Conn,K.Scheinberg,andL.N.Vicente.Geometryofsamplesetsin derivativefreeoptimization:Polynomialregressionandunderdeterminedinterpolation. IMAJournalonNumericalAnalysis ,28:721{748,2008. [11]A.R.Conn,K.Scheinberg,andL.N.Vicente.GlobalConvergenceofGeneral Derivative-FreeTrust-RegionAlgorithmstoFirst-andSecond-OrderCritical Points. SIAMJournalonOptimization ,20:387{415,2009. [12]A.R.Conn,K.Scheinberg,andL.N.Vicente. IntroductiontoDerivative-Free Optimization .MPS-SIAMSeriesonOptimization.SIAM,Philadelphia,2009. 119

PAGE 131

[13]A.L.Custodio,H.Rocha,andL.N.Vicente.IncorporatingminimumFrobenius normmodelsindirectsearch. ComputationalOptimizationandApplications , 46:265{278,2010. [14]G.DengandM.C.Ferris.AdaptationoftheUOBQYAalgorithmfornoisy functions.In ProceedingsoftheWinterSimulationConference ,pages312{319, 2006. [15]G.DengandM.C.Ferris.Extensionofthedirectoptimizationalgorithmfor noisyfunctions.In ProceedingsoftheWinterSimulationConference ,pages497{ 504,2007. [16]J.E.DennisandR.B.Schnabel. NumericalMethodsforUnconstrainedOptimizationandNonlinearEquations .SIAM,Philadelphia,1996. [17]E.D.DolanandJ.J.More.Benchmarkingoptimizationsoftwarewithperformanceproles. MathematicalProgramming ,91:201{213,2002. [18]R.Fletcher. PracticalMethodsofOptimization .JohnWiley&Sons,NewYork, 2ndedition,1987. [19]P.E.Gill,W.Murray,andM.H.Wright. PracticalOptimization .Academic Press,London,1981. [20]F.Glover.TabuSearch{PartI. INFORMSJournalonComputing ,1:190{206, January1989. [21]F.Glover,M.Laguna,andR.Mart.Fundamentalsofscattersearchandpath relinking. Controlandcybernetics ,39,2000. [22]G.H.GolubandC.F.VanLoan. MatrixComputations .TheJohnsHopkins UniversityPress,3rdedition,1996. [23]N.I.M.Gould,D.Orban,andP.L.Toint.CUTErandSifDec:Aconstrained andunconstrainedtestingenvironment,revisited. ACMTransactionsonMathematicalSoftware ,29:373{394,2003. [24]R.B.GramacyandM.A.Taddy.Categoricalinputs,sensitivityanalysis,optimizationandimportancetemperingwith tgp version2,anRpackagefortreed Gaussianprocessmodels. J.StatisticalSoftware ,33:1{48,2010. [25]S.Gratton,M.Moue,andP.Toint.Stoppingrulesandbackwarderroranalysis forbound-constrainedoptimization. NumerischeMathematik ,119:163{187, 2011. [26]S.GrattonandL.N.Vicente.Asurrogatemanagementframeworkusingrigorous trust-regionssteps.Preprint11-11,Univ.ofCoimbra,March2011. [27]N.J.Higham. AccuracyandStabilityofNumericalAlgorithms .SIAM,Philadelphia,2ndedition,2002. 120

PAGE 132

[28]J.H.Holland. Adaptationinnaturalandarticialsystems .MITPress,Cambridge,MA,USA,1992. [29]W.HuyerandA.Neumaier.SNOBFITstablenoisyoptimizationbybranchand t. ACMTransactionsonMathematicalSoftware ,35:1{25,2008. [30]D.R.Jones,C.D.Perttunen,andB.E.Stuckman.Lipschitzianoptimization withouttheLipschitzconstant. JournalofOptimizationTheoryandApplications ,79:157{181,1993. [31]BKarasozen.Surveyoftrust-regionderivative-freeoptimizationmethods. JournalofIndustrialandManagementOptimization ,3:321{334,2007. [32]C.T.Kelley. UsersGuideforimlversion1 .Availableat www4.ncsu.edu/ ~ ctk/imfil.html . [33]C.T.Kelley.DetectionandremediationofstagnationintheNelder-Meadalgorithmusingasucientdecreasecondition. SIAMJournalonOptimization , 10:43{55,1999. [34]J.KennedyandR.Eberhart.Particleswarmoptimization. Proceedingsof ICNN'95-InternationalConferenceonNeuralNetworks ,4:1942{1948,1995. [35]JKieferandJWolfowitz.Stochasticestimationofthemaximumofaregression function. TheAnnalsofMathematicalStatistics ,23:462{466,1952. [36]SKirkpatrick,C.D.Gelatt,andVecchiM.P.Optimizationbysimulatedannealing. Science ,220:671{680,1983. [37]M.Kortelainen,T.Lesinski,J.More,W.Nazarewicz,J.Sarich,N.Schunck, M.V.Stoitsov,andS.Wild.Nuclearenergydensityoptimization. Phys.Rev. C ,82:024313,August2010. [38]R.Lougee-Heimer.TheCommonOptimizationINterfaceforOperationsResearch:Promotingopen-sourcesoftwareintheoperationsresearchcommunity. IBMJournalofResearchandDevelopment ,47:57{66,2003. [39]J.Matyas.Randomoptimization. AutomationandRemoteControl ,26:246{ 253,1965. [40]H.D.Mittelmann.Decisiontreeforoptimizationsoftware. http://plato.asu.edu/guide.html,2010. [41]J.J.MoreandS.M.Wild.Benchmarkingderivative-freeoptimizationalgorithms. SIAMJournalonOptimization ,20:172{191,2009. [42]J.J.MoreandS.M.Wild.Estimatingcomputationalnoise. SIAMJ.Scientic Computing ,33:1292{1314,2011. 121

PAGE 133

[43]J.J.MoreandS.M.Wild.Estimatingderivativesofnoisysimulations. ACM Trans.Math.Softw. ,38,2011.Toappear. [44]R.H.Myers,D.C.Montgomery,andC.M.Anderson-Cook. ResponseSurface Methodology:ProcessandProductOptimizationUsingDesignedExperiments . JohnWileyandSons,2008. [45]J.A.NelderandR.Mead.Asimplexmethodforfunctionminimization. ComputerJournal ,7:308{313,1965. [46]J.Neter,M.H.Kutner,C.J.Nachtsheim,andW.Wasserman. AppliedLinear StatisticalModels .McGrawHill,4thedition,1996. [47]M.J.D.Powell.Anecientmethodforndingtheminimumofafunctionofseveralvariableswithoutcalculatingderivatives. TheComputerJournal ,7:155{ 162,1964. [48]M.J.D.Powell.UOBYQA:Unconstrainedoptimizationbyquadraticapproximation. MathematicalProgramming ,92:555{582,2002. [49]M.J.D.Powell.LeastFrobeniusnormupdatingofquadraticmodelsthatsatisfy interpolationconditions. MathematicalProgramming ,100:183{215,2004. [50]M.J.D.Powell.DevelopmentsofNEWUOAforminimizationwithoutderivatives. IMAJournalonNumericalAnalysis ,28:649{664,2008. [51]C.R.RaoandH.Toutenburg. LinearModels:LeastSquaresandAlternatives . SpringerSeriesinStatistics.Springer-Verlag,2ndedition,1999. [52]L.A.Rastrigin.Theconvergenceoftherandomsearchmethodintheextremalcontrolofamanyparametersystem. AutomationandRemoteControl , 24:1337{1342,1963. [53]H.RobbinsandS.Monro.Astochasticapproximationmethod. TheAnnalsof MathematicalStatistics ,1951. [54]A.N.Shiryayev. OptimalStoppingRules .Springer-Verlag,NewYork,1978. [55]F.J.SolisandR.Wets.Minimizationbyrandomsearchtechniques. Mathematicsofoperationsresearch ,6:19{30,1981. [56]J.C.Spall.Multivariatestochasticapproximationusingasimultaneousperturbationgradientapproximation. AutomaticControl,IEEETransactionson , 37:332{341,1992. [57]J.C.Spall.Acceleratedsecond-orderstochasticoptimizationusingonlyfunction measurements.In DecisionandControl,1997.,Proceedingsofthe36thIEEE Conferenceon ,volume2,pages1417{1424.IEEE,1997. 122

PAGE 134

[58]J.C.Spall. Introductiontostochasticsearchandoptimization:estimation,simulationandcontrol .JohnWileyandSons,2003. [59]M.L.Stein. InterpolationofSpatialData:SomeTheoryforKriging .Springer, NewYork,1999. [60]J.TakakiandN.Yamashita.Aderivative-freetrust-regionalgorithmforunconstrainedoptimizationwithcontrollederror. NumericalAlgebra,Controland Optimization ,1:117{145,February2011. [61]J.J.Tomick,S.F.Arnold,andR.R.Barton.Samplesizeselectionforimproved Nelder-Meadperformance.In ProceedingsoftheWinterSimulationConference , pages341{345,1995. [62]V.Torczon.Ontheconvergenceofpatternsearchalgorithms. SIAMJournalon optimization ,7:1{25,1997. [63]V. Cerny.Thermodynamicalapproachtothetravelingsalesmanproblem:An ecientsimulationalgorithm. Journalofoptimizationtheoryandapplications , 45l:41{51,1985. [64]S.M.Wild,R.G.Regis,andC.A.Shoemaker.ORBIT:Optimizationbyradialbasisfunctioninterpolationintrust-regions. SIAMJournalonScientic Computing ,30:3197{3219,2008. [65]M.H.Wright.Usingrandomnesstoavoidperseverationindirectsearchmethods. PresentationatTheInternationalSymposiumonMathematicalProgramming", 2009. [66]A.A.Zhigljavsky. Theoryofglobalrandomsearch ,volume65of Mathematicsand itsApplicationsSovietSeries .KluwerAcademicPublishersGroup,Dordrecht, 1991.Translatedandrevisedfromthe1985Russianoriginalbytheauthor,With aforewordbySergeM.Ermakov. 123