RELAXING CONVERGENCE CONDITIONS TO
IMPROVE THE CONVERGENCE RATE
by
Daniel MacMillan
B.S., University of Colorado, 1976
M.S., Colorado School of Mines, 1985
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Applied Mathematics
1999
This thesis for the Doctor of Philosophy
degree by
Daniel MacMillan
has been approved
by
Richard H. Byrd
Weldon A. Lodwick
/H q&W
Date
Thomas F. Russell
MacMillan, Daniel (Ph.D., Applied Mathematics)
RELAXING CONVERGENCE CONDITIONS TO IMPROVE THE CON
VERGENCE RATE
Thesis directed by Professor Harvey J. Greenberg
ABSTRACT
Standard global convergence proofs are examined to determine why some al
gorithms perform better than other algorithms. We show that relaxing the
conditions required to prove global convergence can improve an algorithms
performance. Further analysis indicates that minimizing an estimate of the
distance to the minimum relaxes the convergence conditions in such a way as
to improve an algorithms convergence rate.
A new linesearch algorithm based on these ideas is presented that
does not force a reduction in the objective function at each iteration, yet it
allows the objective function to increase during an iteration only if this will
result in faster convergence. Unlike the nonmonotone algorithms in the litera
ture, these new functions dynamically adjust to account for changes between
the influence of curvature and descent. The result is an optimal algorithm in
the sense that an estimate of the distance to the minimum is minimized at
each iteration.
The algorithm is shown to be well defined and we prove global con
vergence of the algorithm. Performance of the algorithm on some standard test
functions is presented to illustrate the features of the algorithm.
in
This abstract accurately represents the content of the candidates thesis. I
recommend its publication.
IV
ACKNOWLEDGMENTS
I would like to acknowledge Marathon Oil Company for its generous
support throughout my pursuit of this degree. I would also like to express
my sincerest gratitude to my wife Lois and daughter Michelle for their endless
patience and support. Without them this thesis would not have been possible.
I am indebted to Dr. H. J. Greenberg whose enthusiasm and knowledge of
mathematics in general and optimization in particular have been truly inspir
ing. I would also like to thank Richard Byrd, Gary Kochenberger, Weldon
Lodwick, Sylvia Lu and Thomas Russell for their many valuable comments
and help with this thesis. I would also like to acknowledge Allen Holder whose
DTgXtemplate and assistance with DT^X where invaluable. Finally, I wish to
thank my M.S. thesis advisor, Dr. R. G. Underwood, who introduced me to
the wonderful world of optimization.
CONTENTS
Figures ........................................................... viii
Tables............................................................... ix
Chapter
1. Introduction....................................................... 1
2. Optimization Preliminaries......................................... 3
3. Convergence Theory for Optimization............................... 11
3.1 Wolfes Framework............................................... 11
3.2 Zangwills Framework............................................ 15
3.3 Convergence Rate................................................ 18
4. Relaxing the Conditions for Convergence............................ 22
4.1 Relaxing the Sufficient Decrease Condition...................... 22
4.2 Relaxing the Bounded Away Condition............................. 24
4.3 Relaxing Both Conditions........................................ 27
5. A New Approach to Algorithm Convergence........................... 30
5.1 Examining the Bounded Away Condition............................ 30
5.2 Examining the Sufficient Decrease Condition..................... 33
5.3 Combining the Results........................................... 37
6. A MinimumDistance Function Algorithm............................. 40
6.1 Line Search Algorithm........................................... 47
7. Examples of Algorithm Performance ................................ 58
vi
7.1 Rosenbrocks Function............................................ 66
7.2 Freudenstein and Roths Function................................. 76
7.3 Powells Singular Function....................................... 79
8. Extensions and Future Research..................................... 81
Appendix
A. Standard Test Functions............................................ 86
References............................................................ 96
vii
9
FIGURES
Figure
2.1 Typical trustregion path................................. 6
3.1 Region of acceptable points (W) for Wolfes framework....... 15
3.2 Region of acceptable points (Z) for Zangwills framework.... 18
4.1 Trustregion algorithm for Rosenbrocks function.......... 27
5.1 Example of when increasing f(x) is beneficial............... 34
5.2 Example of when increasing f(x) is not beneficial........... 37
5.3 Regions defined by tfie minimumdistance function......... 39
6.1 How A affects the minimumdistance function............... 44
6.2 Comparison of nonmonotone and minimum distance regions. . 45
7.1 Minimumdistance function path for Ai = 0.95................ 68
7.2 Convergence of A* for Ai = 0.95............................. 69
7.3 Objective function values for Ax = 0.95..................... 70
7.4 Minimumdistance function path for Ai = 0.8................. 71
7.5 Convergence of Xk for Ax = 0.8.............................. 72
7.6 Objective function values for Ai = 0.8...................... 73
7.7 Minimumdistance function path for Ai = 0.5................. 74
7.8 Convergence of A*. for Ax = 0.5............................. 75
7.9 Objective function values for Ax = 0.5...................... 76
Vlll
TABLES
Table
7.1 Minimum distance algorithm performance on standard test prob
lems.............................................................. 62
7.2 Minimum distance algorithm performance on standard test prob
lems continued.............................................. 63
7.3 Mores Marquardt algorithm performance on standard test prob
lems.............................................................. 64
7.4 Mores Marquardt algorithm performance on standard test prob
lems, continued.............................................. 65
IX
1. Introduction
This thesis is the result of research on relaxing the standard con
vergence conditions in order to improve the convergence rate of optimization
algorithms. The standard convergence proofs used to prove an optimization
algorithm will converge to a local minimum from an arbitrary starting point
are examined. This examination, along with a review of the nonmonotone
algorithms in the literature, indicate that relaxing these standard conditions
could increase the convergence rate of algorithms in many cases.
Two convergence conditions that are central to this thesis are the
bounded away from orthogonality condition and the sufficient decrease
condition. An analysis of these two conditions indicates that minimizing a
function different than the objective function relaxes the convergence condi
tions in such a way as to improve an algorithms convergence rate. Unlike the
nonmonotone algorithms in the literature, these new functions dynamically
adjust to account for changes between the influence of curvature and descent.
The result is an optimal algorithm in the sense that an estimate of the distance
to the minimum is minimized at each iteration.
1
A new linesearch algorithm based on these ideas is presented that
does not require the objective function to decrease at any iteration. This
algorithm is shown to be well defined and we prove global convergence of the
algorithm. Performance of the algorithm on some standard test functions is
presented to illustrate the features of the algorithm.
The thesis is organized as follows. The first chapter gives a general
discussion of optimization along with definitions needed later in the text. The
second chapter gives a detailed description of two common global convergence
proofs which are expanded upon in the next chapter. In chapter 4, nonmono
tone algorithms from the literature that relax the convergence conditions are
shown to obtain faster convergence. Reasons why relaxing the conditions help
convergence are studied in detail in the subsequent chapter, which identifies
why relaxing the convergence conditions improves algorithm performance and
determines when relaxing these conditions is beneficial. The results of these
analyses are combined to obtain a new function to minimize that improves
algorithm performance. In Chapter 6, a linesearch algorithm based on this
function is presented and global convergence to a local minimum is proved.
The performance of this algorithm is analyzed in the next chapter where the
features of the algorithm are illustrated using standard test functions. Finally,
the results are summarized and avenues for future research are presented.
2
2. Optimization Preliminaries
Consider the optimization problem of minimizing a real valued ob
jective function f(x) with continuous second derivatives where x 6 and /
has a minimum at x*. A level set for f(x) is defined as:
Definition 2.1 The level set for a given function f(x) and a point Xk
is defined to be
L/{xk) = {z : f(x) < f(xk)} (2.1)
The gradient of this function denoted by V/(x) is an mdimensional vector.
The Hessian of this function denoted by V2/(x) is an m by m matrix. One
method of minimizing f(x) is to search negative gradient directions successively
for the minimum value of /. It has been shown that this steepest descent
algorithm exhibits a linear asymptotic convergence rate [73]. In practice, the
convergence rate of this method can be quite slow due to the face effects of
the eigenspaces associated with the minimum and maximum eigenvalues [1].
In order to obtain a faster convergence rate, other algorithms have
been considered. One such algorithm is Newtons method where the next
3
iterate is obtained from the previous iterate using the formula
As* = V2/(^)1V/(xfc) , (2.2)
where Xk+i = X& + Axk and Xk is the point reached at iteration k. Newtons
method can be shown to exhibit a quadratic convergence rate on many prob
lems in the neighborhood of a solution provided V2/(x*) is positive definite
[73]. However, the simplified Newtons method presented here will not work
on most optimization problems without some modifications. Often, a point
that is close enough to the minimum so that Newtons method would yield
quadratic convergence may not be available. A robust algorithm needs to be
able to converge to a local minimum from any arbitrary starting point. An
other problem is that the Hessian may not be positive definite, even at the
minimum. If the Hessian is not of full rank, then the inverse will not exist and
the algorithm, as presented here, becomes undefined. If the Hessian contains
negative eigenvalues, the direction obtained could be an ascent, rather than a
descent, direction. Many modifications of Newtons algorithm have been sug
gested to overcome these difficulties. A line search may be used to ensure that
f(xk+1) < f(xk) in an attempt to guarantee convergence to a local minimum.
The Newton direction is often modified to ensure that it is a descent direction.
Yet many of these modifications can prevent the desired asymptotic quadratic
4
convergence of the algorithm.
One method of modifying Newtons method that enables it to con
verge from arbitrary starting points is the model trust region algorithm, where
a model of f(x) near Xk is obtained. A typical model is the quadratic approx
imation:
m(x) = (x xk)T^2f{xk)(x xk) + Wf(xk)T(x xk) + f{xk) . (2.3)
Usually, the model needs to be only a first order approximation of f(x) to
prove global convergence. In a typical trust region algorithm, the model is
minimized subject to the trust region constraint jas Xk\\ < A*, using an
appropriate norm, where A* is the current trust region size. A step is then
taken using the x obtained by minimizing the model subject to the trust region
constraint. Before the next iteration, the trust region size can be expanded or
contracted depending on how well the model predicts the reduction obtained
in the objective function f(x). An excellent review of trust region algorithms
was done by More [81]. Shultz et al. [110] obtained a family of trust region
algorithms that also account for directions of negative curvature. More recently,
Dennis et al. [29] extended the trust region ideas to nonsmooth optimization.
Trust region algorithms have the advantage of yielding a negative
gradient step for a small trust region size, ensuring a reduction in f(x), while
5
allowing a full Newton step to be taken when the trust region size is large.
This, is achieved by a locus of possible points obtained for various trust region
sizes that is nonlinear. An example of a typical trustregion path, illustrated
using a quadratic function, is given in Figure 2.1. When a full Newton step
results in an increase in f(x), smaller trust region sizes are used ensuring that
a sufficient reduction in / will be obtained guaranteeing convergence to a local
minimum. When the full Newton step reduces the objective function the trust
region increases allowing the algorithm to take full Newton steps resulting in
the desired local quadratic convergence rate.
Figure 2.1. Typical trustregion path.
6
It is often possible to take advantage of special structure that exists in
certain types of optimization problems. One example of this is the nonlinear
leastsquares problem:
f(x) = lR(xfR(x) = iH(i)2 , (2.4)
where the norm used here is the Euclidean norm with R(x) Â£ 5R", x Â£ and
m < n. Let ri(x) be the ith element of R(x). The functions r^(x) are often
called residuals. The gradient of / is
V/(x) = J(xfR(x) , (2.5)
where J(x) is the Jacobian matrix of first derivatives of R(x) and we use Ji(x)
to denote the ith column of J{x). The Hessian of / is
V2f(x) = J(x)TJ(x) + 'jTri(x)V2ri(x) .
>.6)
1=1
If V2/(x) is positive definite, Newtons method becomes
A xk = 
J(xk)TJ(xk) + ^Tlri(xk)V2ri(xk)
i=i
J{xkfR{xk) . (2.7)
We can take advantage of the special structure of the problem by ignoring the
second derivative term
S(x) = ^n(x)V2rt(x) .
(2.8)
1=1
7
Assuming that J(xk) is of full column rank yields the GaussNewton method
equation
J{%k) R{j^k)
(2.9)
where [J(xk)T J(xfc)]1 J(xk)T is the MoorePenrose generalized inverseof J(xk).
More [80] has found a method due to Levenberg [71] and Marquardt [77] to be
useful for minimizing the nonlinear leastsquares problem using a trust region
algorithm. The LevenbergMarquardt algorithm is
= [jf:c;jr./f,ct) + hi .Hxk f Rlxk)
(2.10)
where A^ > 0. More solves the trust region problem by choosing A so that
z Xk ~ Ak
The concept of a generalized matrix inverse [114] is useful in obtaining
leastsquares solutions for any general matrix J, even if J is not of full column
rank.
Definition 2.2 The MoorePenrose generalized inverse of an n by m
matrix J is defined to be the unique mbyn matrix J+ satisifying the
following four properties:
J+JJ+ = J+ (2.11)
JJ+J = J (2.12)
8
(JJ+)T = JJ+
(2.13)
(J+J)T = J+J . (2.14)
We also have the relations
(J+)+ = J (2.15)
(JTY = (J+f (216)
(.JJT)+ = (J+)TJ+ . (2.17)
In the sequel, the following definitions for projection matrices will be found
useful
Pj = JJ+ (2.18)
Rj = J+J . (2.19)
Using the MoorePenrose generalized inverse, the GaussNewton method be
comes
Axk J(xk)+R(xk) (2.20)
This definition makes the GaussNewton method well defined, even when J(xk)
is not of full column rank.
The MoorePenrose generalized inverse can easily be obtained using
the singular value decomposition
J = USVT , (2.21)
9
where U is an n by n unitary orthogonal matrix, V is an m by m unitary
orthogonal matrix and S is an n by m matrix with the positive singular values,
Si, along the diagonal. The MoorePenrose inverse of J is obtained with the
singular value decomposition using
J+ = VS+UT , (2.22)
where the pseudo inverse of S, S+, is an m by n matrix with 1/s, on the
diagonal. If J is not of full column rank, then one or more of the singular
values are zero. In this case, S+ will have a zero on the diagonal whenever Si
is zero, and 1/s, otherwise. We will find the singular value decomposition to
be a useful analysis tool in the sequel.
10
3. Convergence Theory for Optimization
The above discussion identified two of the numerous major consider
ations in nonlinear optimization algorithm design. One consideration, global
convergence, is ensuring that the algorithm converges to a local minimum,
given an arbitrary starting point. The other major consideration is obtain
ing a fast rate of convergence. Two methods of proving that an algorithm is
globally convergent have been used extensively. One is a method attributed
%
primarily to Wolfe [124, 125] the other is a method attributed primarily to
Zangwill [128]. Convergence rate and these two global convergence proofs are
discussed in the following sections.
3.1 Wolfes Framework
Two independent conditions have been used together for years to
prove global convergence of optimization algorithms. The first condition is
 Vf(xk)Tdk > CV/(x*)4 , (3.1)
where dk is the search direction and c is some small positive constant. This
condition is often referred to as bounded away from orthogonality or the
11
bounded away condition. The second condition is
/(**) /(**) > o (' V/^} 4 ') . (52)
where cr() is a forcing function. Forcing functions are any nondecreasing
function on [0, oo) with the property that a(t) > 0 when t > 0 and a(t) = 0
when t = 0. An example of a forcing function is cr(t) = et2. This condition is
often referred to as sufficient decrease or the sufficient decrease condition.
If an algorithm satisfies both of the above conditions, then global
convergence of the algorithm can be proven for a Gdifferentiable function [89].
Definition 3.1 A function / : fl C SR 4 is Gateaux or (G)
differentiable at an interior point x of D if there exists a linear operator
A : 9ft > such that, for any h 5Rn,
+ f(x) tAh\\ = 0. (3.3)
We use the following theorem taken from Ortega and Rheinboldt [89] to prove
global convergence.
Theorem 3.2 Suppose that f(x) : D C > 1R is Gdifferentiable and
bounded below on some set Do C D, the iterates {a:*,} remain in D0, and the
sufficient decrease condition is satisfied for every iteration k > 0. Then
limv/Ml4=0
**  <41
(3.4)
12
This theorem, along with the bounded away from orthogonality condition,
ensure the first order convergence result that V/(xfc) > 0.
To design a practical globally convergent algorithm, one first picks a
search direction algorithm that satisfies the bounded away from orthogonality
condition. Often, the method of generating the search direction automatically
ensures convergence without explicitly enforcing the bounded away from or
thogonality condition. QuasiNewton algorithms and Newton methods using
a modified Cholesky decomposition are some examples. Next, a line search
method is chosen that satisfies the sufficient decrease condition. Many practi
cal line search algorithms, such as the GoldsteinArmijo method, satisfy this
condition [89]. This yields a very powerful theory. A large number of practical
globally convergent algorithms are possible by using different search directions
with different line search techniques [89].
Let W denote the region of acceptable points that satisfies these two
conditions. The forcing function can be chosen so that only a very small
reduction in f(x) is needed to ensure convergence. The constant c can be
chosen so small that directions nearly perpendicular to the negative gradient
direction are acceptable. An illustration of this in two dimensions (m = 2) is
given in Figure 3.1. The current point is xk with objective function value f(xk)
Since the function value is required to decrease, W C L/(xk) as illustrated in
13
the figure. The region of acceptable new points is just slightly inside L/(xk),
and it nowhere touches the boundary due to the sufficient decrease condition.
There is a small excluded region toward the bottom right where L/(xk) is
outside the region defined by the bounded away from orthogonality condition.
Points in this region satisfy the sufficient decrease condition but cannot be
reached because the search direction required would violate the bounded away
from orthogonality condition.
14
Figure 3.1. Region of acceptable points (W) for Wolfes framework.
3.2 Zangwills Framework
Zangwill proposed a global convergence proof based on the idea of
pointtoset maps [128]. He defined an algorithm as a pointtoset mapping,
where the algorithm uses the current point, Xf., to generate a set of possible
15
new iterates. For example, when performing a line search, it is impractical to
obtain the exact minimum. Hence, a range of points along the line that are
within some tolerance of the minimum are considered acceptable. This range of
points becomes the set of points generated by the algorithm in the pointtoset
map.
An important property of pointtoset mappings needed to prove
global convergence is the idea of a closed mapping [73].
Definition 3.3 A pointtoset mapping A{x) from a set X to subsets
of a set Y is said to be closed at x Â£ X if the conditions
lim xk = x, xk X (3.5)
ktoo
and
lim yk = y, yk A(xk) (3.6)
kÂ¥oo
together imply that
y A(x) (3.7)
Using the idea of a pointtoset map and closed mappings, we have the following
global convergence theorem [128].
Theorem 3.4 Let A{x) be an algorithm on a set X. Let T C X be the
solution set for the optimization problem. Assume that, given x0, the sequence
{xfe} is generated satisfying xk+i Â£ A(xk), and also satisfying the following.
16
All Xk are contained in a compact set S C X.
There is a continuous function g on X such that
(1) if x 0 T then g(y) < g(x) V y E A(x)
(2) if x T then g(y) < g(x) V y G A(x).
The mapping A is closed at points outside of I\
Then, every limit point of is a solution.
To use this theorem to prove global convergence for some algorithm,
the function g is usually chosen as the objective function and a solution is
usually defined as a stationary point, i.e. V/ = 0. Then, the line search is
shown to be a closed pointtoset mapping that reduces the objective function
(when not at a solution).
The region Z of acceptable points that satisfies the conditions of this
theorem is again quite large. The conditions on g(x) = f(x) ensure that
Z C Lj{xk). An illustration of this for the same two dimensional problem
shown in Figure 3.1 is given in Figure 3.2. As can be seen, the region of
acceptable new points is just slightly inside Lj(xk), and it nowhere touches the
boundary because the mapping is closed and a strict decrease in g is obtained
outside the solution set. Also note that there are no excluded regions like
that obtained for Wolfes proof due to the bounded away from orthogonality
condition. However, most practical algorithms will have V/(xfc)T<4 < 0 for
17
Vf(xk) i1 0 so that f(x) can be reduced during the line search. Hence, the two
convergence proofs produce essentially the same region of acceptable points.
Figure 3.2. Region of acceptable points (Z) for Zangwills framework.
3.3 Convergence Rate
Consider the following two definitions commonly used in rate analysis
[51]
18
Definition 3.5 Let the sequence {a;*} converge to x*. The order of
convergence denoted by a is defined to be
a
 sup{q
lim sup
ky oo
lkfc+i z*
\Xk
< 00 }
(3.8)
The case a 1 is called linear convergence, a = 2 is called quadratic conver
gence, and a = 3 is called cubic convergence.
Definition 3.6 The convergence ratio denoted'by f3 for a sequence {xfc}
of order a is defined to be
lim sup
k>oo
ls*+i ~g*ll
Ikfc x*a
(3.9)
The case a = 1 and j3 = 0 is called superlinear convergence. These two
definitions are useful for algorithm acceleration and the design of convergence
criteria. Also, they have often been used as a method of recommending one
algorithm over another. However, this use can lead to erroneous conclusions.
Larger orders and smaller ratios do not necessarily imply that one algorithm
will converge to the solution of a given problem faster [51]. Another problem
with using these definitions to compare algorithms is that they are only useful
in a region near the solution, x*. While this asymptotic convergence rate
is important, the analysis of algorithm performance at points far from the
solution, where the asymptotic convergence rate does not apply, is also very
19
important.
. There have been many attempts to come up with a good definition
for global convergence rate. For steepest descent, the relation
/(*fc+i) <
has been obtained for a quadratic objective function where Xmax and Amtn are
the largest and smallest eigenvalues of the (constant) Hessian [73]. Others have
obtained bounds on f(xk) f(x*) for certain classes of objective functions.
However, an estimate for global convergence rate for more general objective
functions remains elusive. A summary of this work is presented in Nocedal
[88]
The success obtained in proving global convergence leads us to exam
ine the global convergence proofs given above to see if they contain any insight
into algorithm performance. However, such insight is not obtained. Consider
an algorithm that always obtains its next point on the boundary of W or Z.
For Wolfe, this algorithm will always satisfy the sufficient decrease condition
with equality. If the forcing function is a(t) et2, then the objective func
tion decreases by only e\Vf(xk)Tdk\/\\dk\\2 at every iteration. This yields a
linear convergence rate with a convergence ratio near one for e near zero. For
Zangwill, the point obtained will be on the boundary of the closed mapping.
Xmax At
^max + A r
f{*k)
(3.10)
20
Since g = f(x) is required only to be monotonically decreasing, again very slow
convergence is obtained. Thus, we see that the above two global convergence
proofs bound the algorithm in question by another algorithm that converges,
however, very slowly. This results in a proof of global convergence but yields
little insight into algorithm performance at points far from the solution. These
global convergence proofs do, however, say something about the reliability of
the algorithm. They guarantee that the algorithm is reliable in the sense that
it will eventually converge to a local minimum.
21
4. Relaxing the Conditions for Convergence
In the previous chapter, we saw that global convergence theorems
do not provide much insight into algorithm performance. Further, because
rate analysis is definitionally asymptotic, it is not useful for analyzing the per
formance of algorithms at points far from the solution. The large regions of
acceptance, W and Z, illustrated in Figures 3.1 and 3.2, respectively, and the
extremely slow convergence of an algorithm limited to the boundary of these re
gions,indicates that almost any efficient algorithm would fit into the framework
provided in the proofs. However, in this chapter we examine evidence that re
laxing these convergence conditions can actually lead to algorithms exhibiting
faster convergence rates. This evidence is divided into three sections that cover
relaxing the sufficient decrease condition, relaxing the bounded away from or
thogonality condition and relaxing both the sufficient decrease and bounded
away conditions.
4.1 Relaxing the Sufficient Decrease Condition
In the mid 1980s, Grippo et al. began a series of papers proposing
the advantages of using a nonmonotone linesearch method to speed up the
22
convergence rate. The first paper appeared in 1986 [53], the next in 1989
[54], and another in 1990 [55]. This work was culminated in 1991 with a
paper covering an entire class of nonmonotone methods [56]. These non
monotone methods were summarized, tested and improved upon by Toint,
confirming the improved performance of the proposed algorithms [115, 116,
119]. Panier and Tits [91] along with Bonnans et al. [8] immediately saw
the advantages of these techniques for constrained optimization problems and
applied the nonmonotone linesearch idea to help alleviate the Maratos effect
in their program FSQP. The Maratos effect arises in constrained optimization,
where a reduction in a merit function is often enforced rather than using the
objective function because the objective function does not account for the
presence of the constraints. Often, the merit function used will not allow a full
Newton step to be taken, even at points very close to the solution. This effect,
called the Maratos effect [76], prevents Newtons method from obtaining the
asymptotic quadratic convergence rate.
More recently, nonmonotone methods have been applied to a wide
variety of algorithms and problems [26, 34, 36, 35, 117, 118, 120, 126, 130]. For
other nonmonotone methods see [4] and [74].
These nonmonotone algorithms relax the sufficient decrease condition
23
by replacing it with the weaker condition
_ !{xm) 17 (' V/*ii 4') (41)
where p is on the order of 10 or less (the optimization program FSQP currently
uses p = 3 [8, 91, 129]). While this condition allows for the relaxing of the
sufficient decrease condition, resulting in faster convergence in many cases,
it does not explain why relaxing this condition helps. It also does not give
a method or algorithm for determining when to relax the sufficient decrease
condition and when relaxing the sufficient decrease condition is not a good
idea.
4.2 Relaxing the Bounded Away Condition
The bounded away from orthogonality condition can be quite restric
tive if it is explicitly enforced. For example, consider Newtons method applied
to a quadratic problem, where Newtons method converges in just one itera
tion. However, for a given constant, c, there exist quadratic problems with
an eigenvalue small enough that the full Newton step will violate the bounded
away from orthogonality condition. If the Newton step is modified to satisfy
the bounded away from orthogonality condition, this will result in destroying
the Newton algorithms performance on this quadratic problem.
Allowing an unmodified Newton step to be taken has proved to be
24
very computationally efficient in practice. Even though convergence has not
been proven for such an algorithm, the inclusion of the bounded away condition
has been found to be quite detrimental to the efficient solution of real problems.
Hence, relaxing the bounded away from orthogonality condition could, in many
cases, improve algorithm performance.
Trust region methods relax the bounded away from orthogonality
condition. Rather than checking the step direction to ensure it is a descent
direction, trust region methods check how well the model fits the function. At
the end of each iteration, the relative function reduction is computed:
_ f(xk+1)f(xk)
Pk / \ / \
m[xk+1) m(xk)
If 0.75 < pk, the model fit is considered to be good and the trust region size can
be increased. If pk < 0.25, the model fit is not good, and the trust region size
is reduced. As the trust region size goes to zero, the step direction approaches
the negative gradient direction, so that a sufficient decrease will eventually be
obtained ensuring the (theoretical) convergence of the algorithm [81]. In the
region where Newtons method is quadratically convergent, the model fit is
good, so the trust region size increases, and the model subproblem allows full
Newton steps to be taken. Thus, as long as the model is a good approximation
to f(x) near xk the step direction need not be a descent direction. However, it
25
is possible to show that the step to the minimizer of the trust region problem
is alwaysl a descent direction. Thus, explicit inclusion of the bounded away
from orthogonality condition is not needed for trust region algorithms.
However, trust region algorithms could also benefit from relaxing the
sufficient decrease condition. Figure 4.1 illustrates one step of a trust region
algorithm taken from two different starting points using Rosenbrocks function
[102]. One point, x, is near the valley floor while the other point, x, is farther off
the valley floor, yet both points are in close proximity to each other. The locus
of acceptable points for a trust region algorithm initiating from both points
are also illustrated in the figure. Note that the point near the valley floor, x,
has a shorter maximum step length than that obtained for the other point. It
would be good if points that are in such close proximity could have the same
trust region size. However, if the algorithm were allowed a longer step length
from x, then the objective function would have to increase. This illustrates the
utility of relaxing the sufficient decrease condition for trust region algorithms
in certain cases. Some authors have applied the above nonmonotone ideas to
trust region algorithms with some success [26, 117, 118, 120, 126].
26
3 i
Figure 4.1. Trustregion algorithm for Rosenbrocks function.
4.3 Relaxing Both Conditions
The above discussion indicates that relaxing both the bounded away
and sufficient decrease conditions could be helpful. Here we discuss work that
has been done to relax both conditions.
The watchdog method due to Chamberlain et al. [14] is an attempt
to avoid the Maratos effect in constrained optimization. The basic idea behind
27
the watchdog method is to take several Newton steps without checking any
conditions until after the last step. If the merit function has sufficiently de
creased, then the Newton steps will be accepted. If not, the Newton steps are
discarded, the algorithm goes back to the point where the Newton steps were
started, and the step size is reduced. Thus, this method relaxes both conditions
since no conditions are checked during the Newton steps. This method can be
quite effective. It can also increase the computer time required, and there is
the risk of taking a bad step, exponential overflow or other difficulties. This
idea has been incorporated by Grippo et al. [56] in their class of nonmonotone
methods.
Solodov [75, 111] weakens both conditions by picking a direction <4
using
Vf(zk)Tdt > o'(V/(it)) Xt ,
(4.3)
where Ak > 0,
/(*) /(zhi) > nv/(ii)T4 n
(4.4)
with 77* > 0 and Uk > 0. He proves convergence when the following conditions
hold:
0O
CO
CO
Y r)k = oo, Y ^kVk < oo, Y < co .
(4.5)
These conditions relax both the original bounded away from orthogonality and
28
sufficient decrease conditions. Solodov was able to show that some of the
algorithms used to train neural networks fit within this structure.
29
5. A New Approach to Algorithm Convergence
In this section we examine the two convergence conditions more closely
to determine when relaxing each condition helps speed convergence and when it
does not. The results are then combined to yield a new approach to algorithm
convergence.
5.1 Examining the Bounded Away Condition
The bounded away from orthogonality condition becomes a problem
when the search direction denoted by dk = d{xk) is nearly orthogonal to the
negative gradient direction. The objective function will not decrease very fast
along such a direction, and will likely increase within a short distance of Xk
This suggests minimizing an alternate function, h(x), with the following prop
erties.
(1) V/i(x) = d(x) so that Vh(x^) = dk.
(2) x* minimizes h(x).
(3) h(x) has no local minimum other than x*.
A function that satisfies the first condition is called a gradient mapping [89].
30
Definition 5.1 A mapping d(x) : D C > is a gradient mapping
on a subset Do C D if there exists a Gdifferentiable function g : D0 C
5?1 such that d(x) = V^(aj) for all x Do.
Such a function, if it exists, would solve the bounded away from orthogonality
problem because the desired search direction would be the negative gradient
direction of the new function being minimized. The nonmonotone idea of
allowing f(x) to increase also becomes less of an issue because h{x) decreases
fastest along its negative gradient direction.
Ortega and Rheinboldt state the following Theorem [89].
Theorem 5.2 Let d(x) : D C be continuously differentiable on an
open convex set Do C D. Then, d(x) is a gradient mapping on Do if, and only
if, Vd(x) is symmetric for all x 6 Do.
Thus, the desired function exists if, and only if, Vd(x) is symmet
ric. This condition obviously holds for (4 = V/(x*). But, this condition is
very restrictive so that there are very few search directions that satisfy this
condition.
However, there are functions that relax the first condition while sat
isfying the other two conditions. Consider the function
/>(*) = 5llV7M1v/(*)2, (5.1)
31
where the norm used is the Euclidean norm. The gradient of this function is
Vh(x) = [v2/(*)_1v2/(*)j +
,lT
V2f(x)~'Vf(x) , (5.2)
Evfvvwr1)^
Li=l
where V2f(x)i 1 is the ith column of V2/(x) h At Xk we have
Vh(xk) = 4 +
,t=l
V2/(n)V/( Xt). (5.3)
The second higher order term is the error obtained when using this h(x) to
approximate the first condition. Minimizing this h(x) is like minimizing the
square of an estimate of the distance to the minimum. Hence, we will call all
such functions minimum distance functions.
This particular minimum distance function has several problems. The
derivatives of h(x) needed for a steepest descent or Newton algorithm are
difficult to obtain. The gradient and Hessian of the objective function, f(x),
are needed at every point in a line search. This function could also find a
maximum rather than a minimum. To alleviate some of these problems, one
could minimize a sequence of functions
M*) = jlIvVM'v/MH2
(5.4)
32
The gradient of this function is
Vhk(x)
^2f(xk)^2f(x)
V/(x*)V/(z)
(5.5)
At xk we now obtain Vh{xk) = dk as desired. However, we must ensure that
minimizing a sequence of such functions will converge, and that it will converge
to x* and not some other point. The idea behind this definition is to have a
trust region for how far the current Hessian approximates the eigenvectors and
eigenvalues of the problem. Here again, the gradient of the function is needed
at each point during a line search.
Now, we extend this idea to leastsquares problems where the Gauss
Newton algorithm can be used to take advantage of the special structure of the
problem, see page 7. This yields the equation
h(x) = i./(a*)+fl(*)l1 (5.6)
In this case, when f(x) is evaluated, R(x) is also obtained because f(x) =
^R(x)TR(x). Thus, there is no need to evaluate the gradient at each point
in the line search. Hence, this minimum distance function could be useful for
relaxing the bounded away from orthogonality condition.
5.2 Examining the Sufficient Decrease Condition
In this section we address the question, When is allowing f(x) to in
crease a good idea? The goal is to obtain an algorithm that can automatically
33
determine the best point along the search direction.
An obvious case where increasing f(x) along the search direction is
beneficial is illustrated in Figure 5.1 using a Rosenbrock type function. Two
possible next iterates, Xk+i and ik+i are contrasted in the figure. The point
xjt+i is obtained by minimizing the objective function along the typical Newton
or GaussNewton search direction. The point Xk+i is some arbitrary point
farther along the search direction, where f(xk+1) > f{%k)
Figure 5.1. Example of when increasing f(x) is beneficial.
34
Note that Â£*+1 is not necessarily the point obtained by taking a full
Newton or GaussNewton step. This point could be much farther out due to
near rank deficiency in the Hessian and may cause an exponential overflow or
other numerical problems.
The idea illustrated in Figure 5.1 is that it is easier to minimize f(x)
along eigenvector directions with a large associated eigenvalue (ei in the figure)
than it is to minimize f(x) along eigenvector directions with a very small as
sociated eigenvalue (e2 in the figure) [74]. The objective function is allowed to
increase in order to make progress along the eigenvector directions with small
associated eigenvalues. Thus, instead of minimizing f{x), a new function,
hk(x), is minimized that weights the eigenvector directions with small asso
ciated eigenvalues more than the eigenvector directions with large associated
eigenvalues. Here we include the subscript k to indicate that the eigenvectors
and eigenvalues are obtained at Xk.
Let us design a new function for nonlinear leastsquares objective
functions that has this property. Let J(xk) = USVT be the singular value
decomposition of the current Jacobian matrix. Also, assume that J is of full
column rank. Next, we rotate the residual vector by forming UTR(x). We now
35
define our new function to be
k(x) = \\DUtR(x)\\2 , (5.7)
where D is a diagonal matrix of scale factors that we shall use to scale the
eigenvector directions. Note that if D = /, then hk(x) = f(x). We now pick
the diagonal elements of D denoted here by d{ so that they give a small weight
to the large S{ and a large weight to the small s;. One easy way to do this
is to set di = 1 jS{. This yields exactly the same minimum distance function
obtained above for nonlinear leastsquares problems because
^DU7,R(x)\\2 = ~]\VS+UtR(x)\\2 = iU(xi)+fi(x)2 . (5.8)
Now, consider another case illustrated in Figure 5.2, a plot of arctan(x),
which is an example of when it is not beneficial to increase f(x). Assume that
the objective function is f(x) = arctan(a;)2, with minimum at the origin. If
Newtons method is started at some point far from the origin, it will take
increasingly larger steps zigzagging across the origin yielding ever increasing
objective function values until it eventually diverges to oo. If this example is
minimized using a minimum distance function with constant eigenvectors and
eigenvalues, the minimum will still be obtained. Since i G R1, J(xk)+ = 1/si
is a constant and hk(x) = ri(a:)/sx2. Thus, hf{x) is just the objective
function modified by a constant factor, so that the minimum is not affected.
36
Figure 5.2. Example of when increasing f(x) is not beneficial.
5.3 Combining the Results
The above results of examining the two convergence criteria have
both amazingly suggested minimizing a minimum distance function along the
search direction. It is instructive to examine what this idea does to a quadratic
function where, L/(xk) is an ellipse. In this case, Lh(xk) will be a circle, as
shown in Figure 5.3, because Newton solves a quadratic function exactly. For
37
the sufficient decrease condition, the new point Xk+i must lie inside Lj(xk),
regions 1 and 3 in the figure. If we require h(x) to decrease along the line
search direction, then the new point Â£jfc+1 must lie inside Lk(xk), regions 1 and
2 in the figure. Region 1 is ideal because both the objective function and the
Newton step length are decreasing. Region 2 is the region we seek. Points
in region 2 have f(xk+i) > f{xk) but the Newton step length is decreasing
so convergence can still be obtained. Points where f(xk+1) > f(xk) that are
outside of Region 2 are not acceptable because the Newton step length is
increasing. If the algorithm accepts points in this region, convergence may
not be obtained. Region 3is very interesting. Points in this region have a
lower objective function value but have an increased distance to the minimum.
Requiring hk(x) to decrease along the search direction will exclude these points
from consideration.
38
Figure 5.3. Regions defined by the minimumdistance function.
If hk(x) is a good approximation for the distance to the minimum,
minimizing h(x) makes sense. It allows each step of the algorithm to make the
greatest possible progress toward the minimum and is, at least in this sense,
optimal.
39
6. A MinimumDistance Function Algorithm
In this section we consider minimizing a sequence of functions
*>(*) = iJ(**)+fl(x)a , (6.1)
to take advantage of the special structure, see page 7. We often denote f(xk)
by fk, R(xk) by Rk and J(xk) by Jk to simplify the notation when no confusion
will result.
These functions have several nice properties. For example, consider
the gradient
Vhk(x) = (^J(xk)+J{x)Y J(xk)+R{x) (6.2)
and Hessian
V2hk(x) = (^J(xk)+J{x)Y (j(xk)+J(xf) + J^j/i(x)Vari(x) (6.3)
*'=i
where yi(x) is the ith element of Yk(x) defined by:
Yk(x) = (J(xk)+)TJ(xk)+R(x) . (6.4)
Here we see that Yk(xk) 0 as Xk > x* if J(x)+ is continuous and bounded
in a neighborhood of x*. It would be nice if this led to superlinear convergence
40
for the associated GaussNewton Method. However, such is not the case. The
negative gradient at Xk is
Vhk(xk) = (J(xk)+J(xk))TJ(xk)+R(xk) = J(xk)+R{xk) . (6.5)
Ignoring the higher order terms in the Hessian yields the GaussNewton step
Ax* = [(J*+J*)T(J+J*)] (4Jt)TJtRt
= JtB*
(6.6)
(6.7)
Thus, the negative gradient direction and the GaussNewton direction for hk(x)
coincide. These directions coincide because the first term in the Hessian is
R^Rj = Rj, a unitary orthogonal matrix, that projects onto the row space
of J(xk). This gives the Hessian of hk{x) a much better condition number
than the Hessian of f(x) when the higher order terms, S(x) on page 7, are
small compared to the first order terms. In this case, V2hk(x) Rj while
V2/(x) J(x)TJ(x).
Another nice feature of hk(x) can be seen when considering the case
where J(x) is square, of full rank, and the residual at the solution is zero. This
type of problem is often encountered in the solution of simultaneous nonlinear
equations. Here, we can rescale the residual with a nonsingular scaling matrix,
D, as R = DR. At the minimum, R = 0 so R = 0 as well, and R 0 implies
that R ^ 0. Thus, the two problems are equivalent. Now we have J = DJ
41
and, since both D and J are square matrices of full rank, J+ = J lD 1 [99].
This leads to
J+R = J~xD~lDR = J+R . (6.8)
Hence, the hk(x) functions are scale invariant with respect to scale changes in
R for this special case. However, the result for scale changes in x is different.
Let
x = Dx , (6.9)
1 G) II ""s (6.10)
J+ = DJ+ . (6.11)
Then,
hk(x) = i7+ii2 . (6.12)
= IWD'DJ+Rf , (6.13)
= ilp'J+flll2 , (6.14)
^ i]V+!l2 (6.15)
Thus, the hk(x) functions are not scale invariant with respect to scale changes
in x.
There are two problems with these functions that must be solved
before global convergence can be proved. First, hk(x) does not always decrease
42
as x > x*. Some problems have a singular value that goes to zero as x > x*
causing hk(x) > oo. The second problem can be seen with the rearrangement
hk(x) = ~R(x)T (J(xk)+)T J(xk)+R(x) . (6.16)
This makes hk(x) look like the square of an alternate norm induced by the
matrix
Ak = (J(xk)+)TJ(xk)+ . (6.17)
This definition of Ak is not of full rank; hence, this is not a norm. Herein lies
the difficulty: components of R{x) that are orthogonal to the column space
of J are not considered. An algorithm that minimizes this function could not
prevent these components increasing without bound. Hence, these orthogonal
directions must also be considered in order to obtain global convergence.
Both of these problems are overcome by defining Ak as
Ak = (1 \k)(JÂ£)TJt + XkI, 0 < Afc < 1 . (6.18)
Then define
hk(x) = R(x?AkR{x) = iB(x)^ . (6.19)
The parameter \k interpolates between a minimum distance function obtained
at = 0 and the objective function obtained at A*, = 1. Changing \k allows
us to adjust the amount of allowable function increase during the line search,
Figure 6.1.
43
Figure 6.1. How A affects the minimumdistance function.
It is also instructive to compare this with the nonmonotone method of
Grippo et al. [53]. Suppose ma,Xkp
tion value obtainable in Lh{xk) Then, the regions of acceptable next iterates
for a quadratic objective function are as shown in Figure 6.2. The minimum
distance function will reject points in regions 3 and 4 while the nonmonotone
method of Grippo et al. accepts such points. In these regions, x* is
increasing so that the algorithm is diverging. The nonmonotone algorithm
44
obtains convergence when maxfc_p<;
mum distance function obtains convergence by decreasing either  jÂ£i?(m) or
the objective function.
Figure 6.2. Comparison of nonmonotone and minimum distance regions.
In the next section, we consider minimizing a sequence of these hk{x)
functions using a line search method. However, we first present two lemmas.
Lemma 6.1 For the GaussNewton direction, the above definition for hk(x)
relaxes the bounded away from orthogonality condition as decreases from
one. That is, the parameter \k decreases the minimum allowable angle between
the negative gradient and the GaussNewton direction as A*, decreases to zero.
45
Proof: Using the GaussNewton direction, dk = J^Rk,
cos($) = _RlAkJkJÂ£Rk___
Noting that
Afc (JtfJt
(6.20)
(6.21)
as Xk approaches zero we obtain the limit
cos(0) =
ll^ail2
= i
(6.22)
while, for Ajt = 1 we get the cos(0) obtained for the GaussNewton method.
Lemma 6.2 The above definition for hk(x) relaxes the sufficient decrease con
dition as Xk decreases from one if a decrease in \\\Ftkjr\{Jk)T Rk+\\[2 b
tained.
Proof: Take the sufficient decrease condition for f(x) and multiply by Xk to
obtain
yll^+ill2 < yll^ll2 + tXkRlJkdk (6.23)
Also consider
+ t(l \t) JtRk)T dt (624)
46
The sum of equations (6.23) and (6.24) gives the sufficient decrease condition
for hk(x). If Afc = 1, the sufficient decrease condition for f(x) is obtained.
However, if a reduction in ^\\R7e+i{Jk)T Jjt Rh+i\\2 is obtained, this term can
compensate for an increase in f(x). Hence, the standard sufficient decrease
condition is relaxed if a decrease in is obtained. That
an increase in f(x) does happen in some cases is illustrated in the next section
using Rosenbrocks function.
These lemmas show that the features the minimum distance func
tions had of relaxing the bounded away and sufficient decrease conditions are
retained by this modification.
6.1 Line Search Algorithm
We now discuss the algorithm in detail and present the global con
vergence proof. The algorithm presented here uses the steepest descent search
direction. While this is normally not a good choice because of the slow linear
rate of convergence often observed, it is quite satisfactory in this case because
the steepest descent direction for hk{x) is close to the GaussNewton direc
tion when \k is close to zero. Obviously, the algorithm could be improved by
considering other search directions.
47
Algorithm 6.3 The minimum distance linesearch algorithm is:
Compute R(xi), f(xi), J(xi), J+(xi);
Set f(xi);
Set 0 < e < 1.
(1) IF converged THEN quit
(2) Pick Ak so that
Rl(jÂ£)TJtRk
2(t fk) + RTk(Jt)TJtRk ~ 
At = (1  + A*/ ,
AkRk
(3) Obtain the search direction:
JjAkRk
k '
(4) Pick ak to minimize hk(xk + ctkdk), then set Xk+i = &* + otkdk This line
search is designed to ensure that the following sufficient decrease condition
is satisfied:
h,k{3'k+1) hki^Xk)
(5) Pick qk+i so that:
f(xk+i) < qk+i < <7fc + e[hk(xk+i) hk{xk)} .
48
(6) Compute J(xk+\) and increment k.
(7) GOTO step 1
Note that if A& = 1 at any iteration, our equation for choosing qk+i
leads to the possibility that A^ < 1 for the next iteration. Also, if A*, is close to
zero, but R%(Jk)TJkRk suddenly becomes large, our equation for choosing Ak
yields a value for Ak closer to one. Thus, the algorithm can dynamically adjust
Xk as needed to switch between the influence of curvature and descent. When
Ak is small, the objective function can increase to allow for the curvature of
the problem. When Xk is close to one, the steepest descent direction for / is
searched emphasizing descent.
We now show that the algorithm is well defined. The sufficient de
crease condition for the line search in step 4 can be satisfied by many standard
line search algorithms [89]. We also need to show that our choice for Ak keeps
f(xk+i) small enough that a new value for qk+i can be obtained from the
equation
f(xk+i) < qk+i < Qk + t[hk(xk+i) hk(xk)] . (6.25)
Theorem 6.4 Suppose Xk is chosen to satisfy
RiytfJtRk . ,
2(A) + RTt(Jt)TJiRk k 
(6.26)
49
and the sufficient decrease condition:
h/: (^fcJl) ^ h'ki^Xk') ^"( I ^ b'ki.Xk') d^)
is satisfied by the line search. Then
f(xk+1) < qk + e[hk(xk+i) hk(xk)\ .
Proof: Starting from the sufficient decrease condition
hk(xk^.\) ^ hk(xk)
*
and 0 < e < 1 we have
hk{xk^.\) ^ T c[h'k(xk+i') ^(a^)]
Using the definition of Ak
Ak = (1 Xk)(Jk)TJk + AkI
and the definition of hk(xk)
h'k(xk') = Rk AkRk ,
we have
l;Rl+1((l\k)(jÂ£)TjÂ£ + XkI)Rk+1 < hk(xk) + e[hk(xk+1)hk(xk)}
(6.27)
(6.28)
(6.29)
(6.30)
(6.31)
(6.32)
. (6.33)
50
Expanding, and noting the definition for f(xk+1) we obtain
'\^k{Xk +l)
Again, expanding A* in hk(xk), we get
Vttzt+i) < \rI ((1  + At/) ft
Rearranging and dividing by A*, gives
/(**+.) < /(**) + ^^Rl(JÂ£)rJtRt
l^Rl(Jtf4Rk+i +
\flk (^fc + l ) ^fc(>Efc)]
Here we are justified in dividing by A^ since
Rl(Jt)TJtRk
0 <
< A*;
(6.34)
(6.35)
(6.36)
(6.37)
2(g* /*) + RiiJtyjtRk
because qk > /& was enforced during the previous iteration of the algorithm,
and if RKJtfJtRt > 0 ever becomes zero, the algorithm terminates. Rear
ranging the inequality for A*, gives
Rl{Jt)TJtRk < 2Ak{qkh) + XkRTk(J+)TJ+Rk ,
(6.38)
51
or
/M + 1^Rl(Ji)TJtRk < 9* (6.39)
Observing that
> o , (6.40)
we obtain
f{Xi) + 1^rR*(Jt'>TjtRt ^2ArC.(#)T4+ft+1 Â£ qt (641)
Inserting this relation into (6.36) we get
f(xk+i) < qk + ^[hk{xk+i) ~ hk{xk)] . (6.42)
Noting that 0 < A* < 1 with [/ifc(a:yfc+1) hk(xk)] < 0 we obtain the final result
f{xk+1) < qk + e[hk{xk+1) hk{xk)\ (6.43)
thus completing the proof.
Next, we need to show that minimizing hk(x) also minimizes f(x).
The following theorem characterizes the stationary point convergence of the
algorithm.
Theorem 6.5 Suppose Vhk(xk) = 0. Then R(xk) is orthogonal to the column
space of J(xk).
52
Proof: We have
V/ifc(xfc) =
jjAkRk
Jl
(lXk)(J+)TJ+ + XkI\ Rk
(1 Ak)Jl{J+)TjÂ£Rk + AkJ*Rk
(1 AO [JtJk\ JkRk + AkJlRk
(6.44)
(6.45)
(6.46)
(6.47)
Using Equation 2.14 (noting that Rj = Jk Jk is symmetric), we have
Vhk(xk) = (1 Ak)J(xk)+J(xk)J(xk)+R(xk) + AkJ(xk)TR(xk) , (6.48)
and from Equation 2.11 we obtain the simplified equation
Vhk(xk) = (lXk)J(xk)+R(kk) + AkJjR{xk) . (6.49)
Using the singular value decomposition for J(xk), we obtain
Vhk(xk) = 4 AkS{ if s, > 0, 0 otherwise UTR(xk) (6.50)
Since 0 < A* < 1, if Vhk(xk) = 0 then R(xk) is orthogonal to the column space
of J(xk).
The definition for Ak implies that Xk > 1 is a sufficient condition for
convergence of the algorithm to a stationary point of f(x) because hk(x) >
f(x) as Xk > 1. However, Theorem 6.5 shows that this condition is not
necessary. We now give our global convergence theorem.
53
Theorem 6.6 Suppose at every iteration, A*, is chosen using the equation
2(tA) + RKJtfJtRk k 
(6.51)
the sufficient decrease condition
hk{xk+1) < hk{xk) cr(\Vhk(xk)Tdk\) (6.52)
is satisfied by the linesearch, and qk is chosen using
/(zjt+i) < qk+i < qk + e[hk(xk+i) hk(xk)] . (6.53)
Then {V/(a:fc)} > 0 and any limit point of {Xk} is a stationary point of f(x).
Proof: Since e > 0 and cr() is a forcing function, the sufficient decrease
condition
hk{xk+1) < hk(xk) a(\\/hk(xk)Tdk\) (6.54)
ensures that e[hk(xk+1) hk(xk)] < 0. Thus,
qk+1 Qk T t\h'k{xk+i) hk{xk)\ (6.55)
ensures that qk+i < qk and the sequence {qk} is a decreasing sequence. Noting
that f(xk) < qk and that f(x) > 0 for leastsquares problems we see that the
qk are bounded below by 0. Thus, the sequence {qk} is a convergent sequence
and the sequence {qk qk+i} converges to zero. Rearranging
0 < f(xk+i) < qk+i < qk + e[hk{xk+i) hk(xk)\ (6.56)
54
we have
0 ^ 5;
Rearranging the sufficient decrease condition gives
[hk(xk+i) hk(xk)\ > a(\Vhk(xk)Tdk\)
so that
(6.57)
(6.58)
0 < ecr(\Vhk(xk)Tdk\) < e[hk(xk+i) hk(xk)] < qk qk+l (6.59)
Since e > 0 is a constant, this implies that \Vhk(xk)Tdk\ approaches zero in
the limit. We are using steepest descent direction for hk. Thus, the bounded
away from orthogonality condition for our algorithm is
\Vhk(xk)Tdk\ = VAfc(a:fc) (6.60)
so that  V/i*,(Â£*;)  goes to zero in the limit. This along with Theorem 6.5 im
plies that {V/(xfc)} > 0 so that any limit point of {xk} is a stationary point
of f(x).
The above theorem ensures that the algorithm converges to a point
where the gradient of f(x) is zero. Stronger results can be obtained using
additional assumptions on f(x). For example, if the minimum is unique, then
55
the sequence {a:&} also converges and it converges to the minimum. See also
Ortega and Rheinboldt [89] for more details.
However, first order stationary point convergence does not preclude
the possibility of the algorithm converging to a local maximum of f(x). Grippo
et al. [53] proved that Newtons method will not converge to a local maximum.
A variation of their proof is given here that applies to more general situations,
including Algorithm 6.3.
Theorem 6.7 Suppose that the sequence {a;*,}, generated using
xk+i = xk + akdk , (6.61)
is well defined and remains in a set S where the function f(x) : S > 3? has
continuous second derivatives. Also, suppose that {xk} converges to x* G S
where x* is a local maximum such that there exists an open sphere B(x*) where
yT^2f{x)y < 0 for all x G B(x*), y / 0. If akVf(xk)Tdk < 0 for all k, then x*
cannot be a local maximum.
Proof: Let rj > 0 be such that Amaa,[V2/(a:)] < 77 for all x G B{x*), where
^max[H] is the maximum eigenvalue of H. Then, we can write:
f(xk+1) = f(xk) + Vf(xk)T[xk+ixk] + ^[xk+ixk]TS72f(xk)[xk+lxk} +
r([xfc+i sjt])
= f{xk) + akVf(xk)Tdk + ^[zfc+i xk]TV2f(xk)[xk+1 xk] +
56
Kfrfc+l Xk])
< fixk) + ockVf{xk)Tdk ^rj\\xk+i ~ Xfcll2 + r([xk+1 xk])
for all k, where liraÂ£^.0+(r'(^)/e2) = 0.
It follows from akV f(xk)Tdk < 0, that for sufficiently large k, f(xk+1) <
f(xk) contradicting the assumption that x* is a local maximum.
The strice inequality for the condition ak Vf(xk)Tdk < 0 in the above
proof ensures that convergence is not obtained in a finite number of iterations.
If convergence were obtained in a finite number of iterations, then first order
convergence to a local maximum could be obtained. That Algorithm 6.3 does
not converge to a local maximum is easily obtained using the above theorem
and
V/(zfc)r4 = 
Vf(xk)TJ?AkRk
WJlAkRkW
RTkJkJlAkRk
\\JtAkRk\\
(6.62)
so that akVf(xk)Tdk < 0 because ak > 0 and JkJk Ak is positive semidefinite.
57
7. Examples of Algorithm Performance
Algorithm 6.3 was coded using FORTRAN 77. The singularvalue
decomposition was used to obtain the required pseudo inverses [48]. While
this is a rather computationally intensive method, it is fast on test problems
and it adds valuable insight into the performance of the algorithm. Since the
minimumdistance functions are sensitive to scale changes in x, the ith element
of xk was scaled using the scale factor
Si = max(J,(xj))
30, k
(7.1)
This scaling option is used quite often; see, for example, Mores Marquardt
algorithm [80] and NL2SOL [27, 28]. A method of choosing the two parameters
Ak and qk is also needed to implement the algorithm. The smallest possible
value of Xk,
2( ft) + RU4)TJifRt
(7.2)
and the largest possible value of qk,
qk+i = qk + e[hk(xk+i) hk(xk)] , (7.3)
with e = 104 were always used in this implementation to encourage the algo
rithm to use the minimum distance idea as much as possible. The initial value
58
for qi was chosen to yield a starting value of Ax = 0.5.
The line search performs backtracking or extrapolation to obtain a
three point pattern. The derivative at xk is used to perform backtracking. Once
a three point pattern is obtained, quadratic interpolation is used to obtain an
improved point. The line search is terminated at the first point that satisfies
hk(xk+1) < hk{xk) + eakS7hk(xk)Tdk (7.4)
where ak is the step length.
The termination criteria used in the algorithm are:
(1) f(xk) < 1013, INFO = 1.
(2) V/(zjO < 1012, INFO = 2.
(3) The step length < max(l, xfc)107, INFO = 3.
(4) The number of function evaluations exceeds 200(m + 1), INFO = 4
[82].
(5) Ajfc > 0.9999, INFO = 9999.
The last termination test is used to stop the algorithm if it is essentially steepest
descent. This was done to prevent the algorithm from consuming too much
computer time. When A* is too close to one, the minimum distance idea is not
working and neither is the steepest descent method.
The algorithm was tested using the More, Garbow, Hillstrom test
59
problems [84, 83]. The performance of the algorithm on these standard prob
lems is given in Tables 7.1 and 7.2. The acronyms used by More et al. are given
in parentheses for reference. Any missing entries in one of the tables indicate
that the algorithm ran into some numerical difficulty and could not continue.
A listing of the objective functions is given in Appendix A. The scale factor
listed is used to scale the starting point. More et al. designed these test prob
lems specifically to test, among other things, how well an algorithm performs
under scale changes in r. In the test set, many problems are repeated with
the starting point scaled by 10 and then 100. One example is Rosenbrocks
function, problem number 4; where the first row in the table uses the standard
initial point (1.2,1). The two succeeding rows use (12,10) and (120,100)
respectively.
For comparison, Tables 7.3 and 7.4 give the performance of an old
(mid 1980s) version of Mores Marquardt algorithm on the same test set [80].
Mores convergence criteria are:
(1) Both actual and predicted relative reductions in the sumofsquares are
at most 108, INFO = 1.
(2) Relative error between two consecutive iterates is at most 108, INFO
= 2.
(3) Conditions for both 1 and 2 hold, INFO = 3.
60
(4) The cosine of the angle between R(x) and any column of the Jacobian
is zero, INFO = 4.
(5) Maximum number of function evaluations has been exceeded, INFO =
5.
61
Problem Scale Function
Number Factor Calls
(NPROB) M N (NFEV)
1 5 10 1 2
1 5 50 1 2
2 5 10 1 2
2 5 50 1 3
3 5 10 1 2
3 5 50 1 3
4 2 2 1 6
4 2 2 10 5
4 2 2 100 5
5 3 3 1 10
5 3 3 10 26
5 3 3 100 181
6 4 4 1 13
6 4 4 10 16
6 4 4 100 , 20
7 2 2 1 16
7 2 2 10 22
7 2 2 100
8 3 15 1 7
8 3 15 10 29
8 3 15 100 3
9 4 11 1 42
9 4 11 10 8
9 4 11 100 5
10 3 16 1 16
10 3 16 10 16
Jacobian Objective
Calls Function Norm
(NJEF) INFO (FINAL L2 NORM)
2 2 2.2360680D+00
2 2 6.7082039D+00
2 2 1.4638501D+00
3 2 3.4826302D+00
2 2 1.9097274D+00
3 2 3.6917294D+00
6 1 4.3299800D08
5 1 6.9715345D10
5 1 6.5081163D10
9 1 1.9467384D12
17 1 1.8631818D13
64 1 5.4240424D10
13 1 2.7555867D07
16 1 3.3882791D07
20 1 1.3249207D07
9 9999 6.9988753D+00
12 9999 7.8965477D+00
7 2 9.0635960D02
13 3 9.0635960D02
3 9999 4.1747279D+00
37 3 1.7535838D02
7 9999 1.1500400D01
3 9999 2.2209595D01
12 3 9.3779451D+00
12 3 9.3779451D+00
Table 7.1. Minimum distance algorithm performance on standard test prob
lems.
62
Problem Scale Function
Number Factor Calls
(NPROB) M N (NFEV)
11 6 31 1 86
11 6 31 10 17
11 6 31 100 21
11 9 31 1 426
11 9 31 10 22
11 9 31 100 19
11 12 31 1 2600
11 12 31 10 13
11 12 31 100 69
12 3 10 1 219
13 2 10 1 9
14 4 20 1 1001
14 4 20 10 1001
14 4 20 100 1000
15 1 8 1 1
15 1 8 10 28
15 1 8 100 36
15 8 8 1 73
15 9 9 1 17
15 10 10 1 137
16 10 10 10 17
16 10 10 100 11
16 30 30 1 11
16 40 40 1 11
17 5 33 1 17
18 11 65 1 25
Jacobian Objective
Calls Function Norm
(NJEF) INFO (FINAL L2 NORM)
45 3 4.7829594D02
17 2 4.7829594D02
20 2 4.7829594D02
181 2 1.1831146D03
15 2 1.1831146D03
17 2 1.1831146D03
1116 . 4 2.3497279D02
11 2 2.1731040D05
33 2 2.1731040D05
80 1 2.7028459D12
5 9999 1.2257774D+01
395 4 3.2677102D+02
416 4 3.3237608D+02
411 4 3.2367148D+02
1 2 1.8862380D+00
28 2 1.8842482D+00
36 2 5.8315069D+02
25 9999 8.4654539D02
11 1 1.9295463D09
39 9999 1.4342183D01
4 9999 3.6390336D01
10 2 6.1315453D07
7 1 3.5793833D08
7 1 3.7277429D08
13 2 7.3924926D03
19 3 2.0034404D01
Table 7.2. Minimum distance algorithm performance on standard test prob
lems continued.
Problem Scale Function Jacobian Objective
Number Factor Calls Calls Function Norm
(NPROB) M N (NFEV) (NJEF) INFO (FINAL L2 NORM)
1 5 10 1 3 2 3 2.2360680D+00
1 5 50 1 3 2 2 6.7082039D+00
2 5 10 1 3 2 1 1.4638501D+00
2 5 50 1 3 2 1 3.4826302D+00
3 5 10 1 3 2 1 1.9097274D+00
3 5 50 1 3 2 1 3.6917294D+00
4 2 2 1 21 16 4 0.0000000D+00
4 2 2 10 8 5 2 O.OOOOOOOD+OO
4 2 2 100 6 4 2 0.0000000D+00
5 3 3 1 11 8 2 9.9365231D17
5 3 3 10 20 15 2 1.0446809D19
5 3 3 100 19 16 2 3.7665334D29
6 4 4 1 500 499 5 0.0000000D+00
6 4 4 10 500 499 5 O.OOOOOOODfOO
6 4 4 100 66 65 4 9.3220945D35
7 2 2 1 14 8 1 6.9988752D+00
7 2 2 10 19 12 1 6.9988752D+00
7 2 2 100 24 17 1 6.9988752D+00
8 3 15 1 6 5 1 9.0635960D02
8 3 15 10 37 36 1 4.1747687D+00
8 3 15 100 14 13 1 4.1747687D+00
9 4 11 1 18 16 1 1.7535838D02
9 4 11 10 78 70 1 3.2052193D02
9 4 11 100 500 369 5 2.2569255D02
10 3 16 1 126 116 2 9.3779451D+00
10 3 16 10 400 346 5 7.9607763D+02
Table 7.3. Mores Marquardt algorithm performance on standard test prob
lems.
64
Problem
Number
(NPROB) M N
11 6 31
11 6 31
11 6 31
11 9 31
11 9 31
11 9 31
11 12 31
11 12 31
11 12 31
12 3 10
13 2 10
14 4 20
14 4 20
14 4 20
15 1 8
15 1 8
15 1 8
15 8 8
15 9 9
15 10 10
16 10 10
16 10 10
16 10 10
16 30 30
16 40 40
17 5 33
18 11 65
Scale Function
Factor Calls
(NFEV)
1 8
10 14
100 15
1 8
10 19
100 18
1 10
10 13
100 34
1 7
1 21
1 254
10 53
100 238
1 1
10 29
100 47
1 39
1 12
1 25
1 14
10 13
100 22
1 19
1 19
1 18
1 16
Jacobian
Calls
(NJEF)
7 1
13 1
14 1
7 2
15 1
15 1
9 3
12 2
28 3
6 2
12 1
236 1
42 1
222 1
1 4
28 1
46 1
20 1
9 2
12 1
12 2
8 2
20 2
14 2
14 2
15 1
12 1
Objective
Function Norm
(FINAL L2
NORM)
4.7829594D02
4.7829594D02
4.7829594D02
1.1831146D03
1.1831146D03
1.1831146D03
2.1731040D05
2.1731040D05
2.1731040D05
1.5700924D16
1.1151779D+01
2.9295429D+02
2.9295429D+02
2.9295429D+02
1.8862380D+00
1.8842482D+00
1.8842482D+00
5.9303235D02
1.9859084D16
8.0647100D02
3.0967064D15
3.0303191D15
2.1868857D15
2.2480051D13
4.7146716D14
7.3924926D03
2.0034404D01
INFO
Table 7.4. Mores Marquardt algorithm performance on standard test prob
lems, continued.
65
The algorithm performs well on problems where  J(xk)+R(xk)\\ 0
in some neighborhood of the minimum. When this condition does not exist,
A*, > 1 and the algorithm becomes steepest descent. Hence, the algorithm
performs poorly on such problems.
While these results are not spectacular they are encouraging. We
did not expect to get exceptional results using the steepest descent direction.
Incorporating a better search direction into the algorithm should help to greatly
improve algorithm performance. The results indicate that more work could be
done to improve the scaling of our algorithm. However, many of the desired
objectives of the above analysis were observable in the test cases. Specific
examples taken from some of the test problems are listed in the following
sections.
7.1 Rosenbrocks Function
The residuals for Rosenbrocks function [102], problem number 4, are
r i(x) = 10(x2 x\) (7.5)
r2(x) = lxx . (7.6)
Rosenbrocks function is a classic zeroresidual leastsquares problem where
 J(a;fc)+i?(xfc) > 0 and GaussNewton exhibits a quadratic convergence rate.
This problem illustrates several nice properties of the minimumdistance
66
algorithm. First, if the algorithm is initialized using a value of Ai near one, then
A*, * 0 as Xk > x*. To illustrate this, we ran the algorithm with Ai = 0.95.
The path taken by the algorithm in this case is given in Figure 7.1. As can be
seen, the first direction is essentially the negativegradient direction. However,
succeeding directions quickly move away from the negativegradient direction
and the algorithm quickly converges to the minimum. Also observable in the
early iterations of Figure 7.1 are the classic oscillations characteristic of using
a steepest descent algorithm. These oscillations dampen as A*, gets smaller and
the GaussNewton direction is approached. That A & quickly becomes small is
evident from Figure 7.2, a semilog plot of A^ versus the iteration number.
67
Figure 7.1. Minimumdistance function path for Ai = 0.95.
68
Figure 7.2. Convergence of A for Ai = 0.95.
This problem also illustrates that values of A* that are not too close to
one allow the objective function to increase, yet convergence to the minimum
is still obtained. For this Ai = 0.95 case, this occurs at iteration two as shown
in Figure 7.3.
69
Figure 7.3. Objective function values for Ai = 0.95.
Next, the algorithm was run with Xi = 0.8. This yields a starting
direction that is closer to the GaussNewton direction as seen in Figure 7.4.
This allows A*, to decrease more quickly, Figure 7.5. This quicker drop in
A*, allows the algorithm to increase the objective function, Figure 7.6. These
combined effects help speed up the algorithm resulting in fewer iterations being
required than in the Xi = 0.95 case.
70
Figure 7.4. Minimumdistance function path for Ai = 0.8.
71
Lambda
Figure 7.5. Convergence of A for Ai = 0.8.
72
16
Figure 7.6. Objective function values for Ax = 0.8.
As Ai gets smaller, greater increases in f(x) are allowed. This is illus
trated with the same series of plots where the algorithm is run with Ai = 0.5.
Here, a starting direction very near the GaussNewton direction is obtained,
Figure 7.7, and very quickly gets small, Figure 7.8. As in the previous ex
amples, the objective function is allowed to increase yet eventually converges
to zero, Figure 7.9. This final case obtains the minimum in fewer iterations
than required for the previous two cases.
73
Figure 7.7. Minimumdistance function path for Ai = 0.5.
Lambda
Figure 7.8. Convergence of Ak for Ax = 0.5.
75
Objective Function Value
1.E+04
0 1 2 3 4 5 6
Iteration Number
Figure 7.9. Objective function values for Ai = 0.5.
7.2 Freudenstein and Roths Function
The residuals for the Freudenstein and Roth function [44], problem
number 7, are
n() =  13 + Xi + ((5 x2)x2  2)x2 (7.7)
r2(x) = 29 + 2?1 + ((X2 + l)x2  14)x2 (7.8)
76
While the minimumdistance function algorithm does very well on Rosenbrocks
function, the Freudenstein and Roth function causes it quite some difficulty.
This problem has a nonzero residual for one of its local minimums where one
of the singular values approaches zero as {xk} x*. The Jacobian for this
problem is
J{x)
1 3x2 + 10x2 2
(7.9)
1 3x2 + 2x2 14
This Jacobian becomes singular when the first column is a scalar multiple of
the second column. That is, the Jacobian becomes singular when
3x2 + 102 2 = 3x2 + 2x2 14 . (7.10)
This is a quadratic equation in x2 that can be easily solved to obtain X2 =
0.8968... and X2 = 2.230138.... The local minimum that causes the dif
ficulty is at (11.41..., 0.8968 ...). The minimumdistance algorithm has
trouble when jJ(xk)+R{xk)\\ > 00. In this case, this happens when X2 >
0.8968____ Once this happens, \k > 1, the steepestdescent algorithm is
obtained and the algorithm terminates when A* > 0.9999.
As seen in Table 7.1, the first instance of problem number 7 termi
nated, at a point very near the local minimum where the norm of the objective
function obtained is very near the optimal value of 6.99887.... For this case,
the minimumdistance algorithm took only a few iterations to get this close.
77
Once this point is reached, the resulting steepestdescent algorithm is unable
to successfully find the minimum in a reasonable number of iterations.
The results for the other two cases, where the algorithm is started
at 10 and 100 times farther away from the minimum, were not as good. Dur
ing the first few iterations large decreases in the objective function and also
in  J(xk)+R(xk)\\ were obtained causing Ajt to get very small. Then, the al
gorithm starts to get close to x2 = 0.8968... where the associated singular
value is close to zero. This causes J(xk)+R(xk)\\ to increase as expected.
However, at this point, an interesting scaling problem is observed. The drop in
the singular value is enough to dominate the GaussNewton direction but not
enough to affect AThe drop obtained in the objective function during the
first few iterations caused the 2(<& fk) term in the equation for obtaining A*,,
* 2(qt fk) + Rl(Jt)TJtRk 1' '
to become very large. Once the singular value started to become small, this
equation should have caused A*, to approach one causing the algorithm to turn
toward the minimum. This is what happened in the first case. In the other
cases, the 2(g* fk) term dominates causing Ak to remain small. This causes
the algorithm to search primarily along x2 making very little progress along
x\. Eventually, the singular value gets small enough that Ak approaches one.
78
However, no progress is made toward the minimum along the other singular
direction characterized by x2 = 0.8968... and steepestdescent cannot find
the minimum.
Thus, it is observed that the 2(qk fk) term must be scaled properly
for the algorithm to work well. If 2(qk fk) is too large, then A*, will remain
small. If 2(qk fk) is too small, then A*, will stay close to one. Picking Ai to be
some reasonable value, e.g. 0.5, appears to scale the 2(qk fk) term for many
problems. However, some dynamic scaling is needed to improve the minimum
distance algorithms performance. Such a scaling algorithm needs to find an
effective balance between allowing Afc > 0 for fast convergence and A^ > 1
when a singular value is approaching zero.
7.3 Powells Singular Function
The residuals for Powells singular function [92], problem number 6,
are
n(z) = + 10:r2 (7.12)
T2(x) = V5(x3 X4) (7.13)
nix) = (x2 2x3)2 (7.14)
r4(x) = \/l0(xi x4) (7.15)
79
with a minimum objective function value of zero at (0,0, 0,0). The Jacobian
for this problem,
1 10 0 0
0 0 75 75
2(x2 2x3) 4(x2 2x3) 0 0
y/lO 0 0 TTo
(7.16)
is obviously singular at the minimum by simply examining the third row.
Hence, this problem is a mixture of the above two problems. It is a zero
residual problem like Rosenbrocks function but it also has a singular value
that approaches zero as Xk approaches the minimum. The minimum distance
function performs very well on this problem since  J(xk)+R(xk)\\ > 0 as Xk ap
proaches the minimum. This allows Ak to approach zero so that the algorithm
searches a direction approaching the GaussNewton direction.
80
8. Extensions and Future Research
Two convergence conditions, the bounded away from orthogonality
and sufficient decrease conditions, were identified as keys to algorithm perfor
mance. At first, these conditions appeared to be general enough to cover most
any efficient algorithm. However, a detailed study of the nonmonotone algo
rithms in the literature indicated that relaxing these two convergence condi
tions could yield faster convergence. Further analysis revealed that an estimate
of the distance to the minimum relaxes both the bounded away and sufficient
decrease conditions. These minimum distance functions also pick an optimal
point for a linesearch in the following sense. If the estimate of the distance to
the minimum is accurate, the distance to the minimum is minimized at each
iteration.
We developed a linesearch algorithm, based on minimizing a minimum
distance function along its steepest descent direction. Unlike the nonmonotone
algorithms in the literature, this minimum distance function algorithm dynam
ically adjusts to account for changes between the influence of curvature and
descent. The algorithm does not force a reduction in the objective function
81
at each iteration, yet it allows f(x) to increase during an iteration only if this
will result in faster convergence. A variant of this algorithm was tested using
some standard test functions. While good performance of the algorithm was
observed, faster convergence could be obtained if a better search direction is
used. One such enhancement would be to include trust region logic. Trust
region algorithms have proven to be very effective in practice and were shown
in this thesis to relax the bounded away from orthogonality condition. Thus,
if A*, > 1, then an efficient algorithm that relaxes the bounded away from
orthogonality condition will result.
The nonmonotone idea of Grippo et. al. [53] has been used for con
strained optimization to help eliminate the Maratos effect [8]. This suggests
that extending the minimum distance function idea to constrained optimization
problems could be beneficial. First, let us consider linear equality constraints,
e(x) = 0. Let Jrepresent the pseudo inverse of J restricted to the linear
equality constraints, e(x) = 0 [99]. Then the nonlinear leastsquares minimum
distance function becomes
hk(x) = iJ<+>(zt)+fl(x)2 . (8.1)
This is equivalent to the problem of minimizing
hk(x) = ix xlf (8.2)
82
where
x*k argmin^W J(xk)[x xk\ + /E(ccfc)2 S.T. e(x) = 0 . (8.3)
In our minimumdistance function approach, we minimized an estimate of
the distance to the minimum of an unconstrained problem. Here, we use a
quadratic approximation to obtain an estimate of the distance to a constrained
minimum and we minimize this distance. Such a minimumdistance function
would allow feasible and nonfeasible iterates.
This idea is very similar to the original idea of allowing f(x) to in
crease along the linesearch direction in order to make progress along an eigen
vector direction with small associated eigenvalue. Here, the algorithm is al
lowed to go infeasible in order to make progress along a constraint. Consider
modifying our leastsquares objective function to obtain the penalty function
f(x) = Me(x)2 + , (8.4)
where M is a large positive constant. This yields an illconditioned uncon
strained nonlinear leastsquares problem with an eigenvector (with a small
associated singular value) oriented along the constraint e(x). If the minimum
distance function idea is applied to this penalty function, the algorithm will
allow f(x) to increase in order to move along the small singularvalue eigen
vector direction that points along the constraint. Thus, the idea of increasing
83
f{x) to make progress along a small singularvalue direction and the idea of
allowing the iterates to go infeasible in order to make progress along a con
straint, in this case, coincide. Comparing the performance of the minimum
distance idea using the penalty function approach to the performance of the
above constrained minimumdistance function could be quite enlightening.
The above example used a penalty function to convert a constrained
problem to an unconstrained problem. The minimumdistance function idea
also converts a constrained problem into an unconstrained problem. At each
function evaluation of this new hk(x) we solve a constrainedquadratic subprob
lem. This idea can easily be extended to include general nonlinear constraints
by minimizing
M*) = \\\x  > (85)
where
x\ argmin\\\J{xk)[x Xk] + )[2 S.T. the constraints . (8.6)
If these constraints are linearized, a quadratic problem with linear constraints is
obtained that can be solved iteratively to yield the solution to the subproblem.
This decouples the nonlinearity of the objective function from the nonlinearity
of the constraints. The resulting algorithm should exhibit a fast convergence
84
rate and the nonmonotone idea incorporated into the minimumdistance func
tion should eliminate the Maratos effect.
I
85
APPENDIX A Standard Test Functions
(1) Linear function, full rank [84].
m is variable, n >m
ri(x) = Xi \ (ELi xj) 1, 1 < i < m
n(x) = \ (E,m=i xj) 1, m < i < n
x0 = (1,...,1)
/ = n m at (1,..., 1)
(2) Linear function, rank 1 [84].
m is variable, n > m
ri(x) = i (EjLi i^') 1
x0 = (1,...,1)
f = at an^ Pint where EjLi3*3 =
(3) Linear function, rank 1 with zero columns and rows [84].
m is variable, n > m
ri(x) = 1
r<(:e) = (* 1) (E^l1  1, 2 < i < n
r,r(x) = 1
Xo = (1,,1)
f = ^(2nbf at any Pint where ^7=2 JXj =
(4) Rosenbrocks Function [102].
86
m = 2, n = 2
ri(i) = 10(a:2 a;?)
r2(a:) = 1 xx
X* = (1.2,1)
/ = 0 at (1,1)
(5) Helical valley funciton [43].
m = 3, n = 3
ri(a;) = 10[a:3 10^(xi,x2)\
r2(o:) = 10[a/o:? + x\ 1]
r3(z) = x3
where
8(xi,x2)
xq (1,0,0)
2ir arctan (^) ,
arctan (Â£2') + 0.5,
2tt \xi ) '
/ = Oat (1,0,0)
(6) Powell singular function [92].
m = A, n = 4
Tx(x) = xi + 10o:2
r2(o;) = y/E(x3 x4)
r3(x) = (x2 2o:3)2
if rex > 0
if x\ < 0
87
r4(s) = \/IO(si *4)
so = (3,1,0,1)
/ = 0 at (0,0,0,0)
(7) Freudenstein and Roth Function [44].
m = 2, n = 2
ri(x) = 13 + si + ((5 s2)s2 2)s2
r2(x) = 29 + si + ((s2 + l)s2 14)s2
so = (0.5, 2)
/ = 0 at (5,4)
/ = 48.9842... at (11.41..., 0.8968...)
(8) Bard Function [2].
m = 3, n = 15
r,(s) = Vi ( Xi + (16i)a;2+min(i,16t)a;3
where
i Vi i Vi i Vi
1 0.14 6 0.32 11 0.73
2 0.18 7 0.35 12 0.96
3 0.22 8 0.39 13 1.34
4 0.25 9 0.37 14 2.10
5 0.29 10 0.58 15 4.39
88
*0 = (1,1,1)
/ = 8.21487 103
/ = 17.4286 ... at (0.8406 ... oo, oo)
(9) Kowalik and Osborne Function [68].
m = 4, n = 11
ri{x) yi uf+uiXi+Xi
where
i Vi U{ i Vi Ui
1 0.1957 4.0000 7 0.0456 0.1250
2 0.1947 2.0000 8 * 0.0342 0.1000
3 0.1735 1.0000 9 0.0323 0.0833
4 0.1600 0.5000 10 0.0235 0.0714
5 0.0844 0.2500 11 0.0246 0.0625
6 0.0627 0.1670
x0 = (0.25,0.39,0.415,0.39)
/ = 3.0750510"4
/ = 1.02734 10~3 at (+oo, 14.07..., oo, oo)
(10) Meyer Function [79].
m = 3, n = 16
n(x)
X\ exp
X'l
45+5i+a?3
 Vi
89
where
1 Vi 1 Vi
1 34780 9 8261
2 28610 10 7030
3 23650 11 6005
4 19630 12 5147
5 16370 13 4427
6 13720 14 3820
7 11540 15 3307
8 9744 16 2872
x0 = O II 4000. ,250)
/ = 87.9458 ...
(11) Watson Function [68].
2 < m < 31, m 31
n(x) = ejuo' 1  fo", ^(A)'1) 1, i <; < 29
r30(x) = xi
r3i(x) = x2x\l
x0 = (0,...,0)
/ = 2.28767 103 for m = 6
/ = 1.39976  106 for m = 9
90
