Citation
Relaxing convergence conditions to improve the convergence rate

Material Information

Title:
Relaxing convergence conditions to improve the convergence rate
Creator:
MacMillan, Daniel
Publication Date:
Language:
English
Physical Description:
ix, 107 leaves : illustrations ; 28 cm

Subjects

Subjects / Keywords:
Convergence ( lcsh )
Mathematical optimization ( lcsh )
Convergence ( fast )
Mathematical optimization ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 96-107).
General Note:
Department of Mathematical and Statistical Sciences
Statement of Responsibility:
by Daniel MacMillan.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
42618258 ( OCLC )
ocm42618258
Classification:
LD1190.L622 1999d .M33 ( lcc )

Full Text
RELAXING CONVERGENCE CONDITIONS TO
IMPROVE THE CONVERGENCE RATE
by
Daniel MacMillan
B.S., University of Colorado, 1976
M.S., Colorado School of Mines, 1985
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Applied Mathematics
1999


This thesis for the Doctor of Philosophy
degree by
Daniel MacMillan
has been approved
by
Richard H. Byrd
Weldon A. Lodwick
/H- q&W-
Date
Thomas F. Russell


MacMillan, Daniel (Ph.D., Applied Mathematics)
RELAXING CONVERGENCE CONDITIONS TO IMPROVE THE CON-
VERGENCE RATE
Thesis directed by Professor Harvey J. Greenberg
ABSTRACT
Standard global convergence proofs are examined to determine why some al-
gorithms perform better than other algorithms. We show that relaxing the
conditions required to prove global convergence can improve an algorithms
performance. Further analysis indicates that minimizing an estimate of the
distance to the minimum relaxes the convergence conditions in such a way as
to improve an algorithms convergence rate.
A new line-search algorithm based on these ideas is presented that
does not force a reduction in the objective function at each iteration, yet it
allows the objective function to increase during an iteration only if this will
result in faster convergence. Unlike the nonmonotone algorithms in the litera-
ture, these new functions dynamically adjust to account for changes between
the influence of curvature and descent. The result is an optimal algorithm in
the sense that an estimate of the distance to the minimum is minimized at
each iteration.
The algorithm is shown to be well defined and we prove global con-
vergence of the algorithm. Performance of the algorithm on some standard test
functions is presented to illustrate the features of the algorithm.
in


This abstract accurately represents the content of the candidates thesis. I
recommend its publication.
IV


ACKNOWLEDGMENTS
I would like to acknowledge Marathon Oil Company for its generous
support throughout my pursuit of this degree. I would also like to express
my sincerest gratitude to my wife Lois and daughter Michelle for their endless
patience and support. Without them this thesis would not have been possible.
I am indebted to Dr. H. J. Greenberg whose enthusiasm and knowledge of
mathematics in general and optimization in particular have been truly inspir-
ing. I would also like to thank Richard Byrd, Gary Kochenberger, Weldon
Lodwick, Sylvia Lu and Thomas Russell for their many valuable comments
and help with this thesis. I would also like to acknowledge Allen Holder whose
DTgXtemplate and assistance with DT^X where invaluable. Finally, I wish to
thank my M.S. thesis advisor, Dr. R. G. Underwood, who introduced me to
the wonderful world of optimization.


CONTENTS
Figures ........................................................... viii
Tables............................................................... ix
Chapter
1. Introduction....................................................... 1
2. Optimization Preliminaries......................................... 3
3. Convergence Theory for Optimization............................... 11
3.1 Wolfes Framework............................................... 11
3.2 Zangwills Framework............................................ 15
3.3 Convergence Rate................................................ 18
4. Relaxing the Conditions for Convergence............................ 22
4.1 Relaxing the Sufficient Decrease Condition...................... 22
4.2 Relaxing the Bounded Away Condition............................. 24
4.3 Relaxing Both Conditions........................................ 27
5. A New Approach to Algorithm Convergence........................... 30
5.1 Examining the Bounded Away Condition............................ 30
5.2 Examining the Sufficient Decrease Condition..................... 33
5.3 Combining the Results........................................... 37
6. A Minimum-Distance Function Algorithm............................. 40
6.1 Line Search Algorithm........................................... 47
7. Examples of Algorithm Performance ................................ 58
vi


7.1 Rosenbrocks Function............................................ 66
7.2 Freudenstein and Roths Function................................. 76
7.3 Powells Singular Function....................................... 79
8. Extensions and Future Research..................................... 81
Appendix
A. Standard Test Functions............................................ 86
References............................................................ 96
vii


9
FIGURES
Figure
2.1 Typical trust-region path................................. 6
3.1 Region of acceptable points (W) for Wolfes framework....... 15
3.2 Region of acceptable points (Z) for Zangwills framework.... 18
4.1 Trust-region algorithm for Rosenbrocks function.......... 27
5.1 Example of when increasing f(x) is beneficial............... 34
5.2 Example of when increasing f(x) is not beneficial........... 37
5.3 Regions defined by tfie minimum-distance function......... 39
6.1 How A affects the minimum-distance function............... 44
6.2 Comparison of nonmonotone and minimum distance regions. . 45
7.1 Minimum-distance function path for Ai = 0.95................ 68
7.2 Convergence of A* for Ai = 0.95............................. 69
7.3 Objective function values for Ax = 0.95..................... 70
7.4 Minimum-distance function path for Ai = 0.8................. 71
7.5 Convergence of Xk for Ax = 0.8.............................. 72
7.6 Objective function values for Ai = 0.8...................... 73
7.7 Minimum-distance function path for Ai = 0.5................. 74
7.8 Convergence of A*. for Ax = 0.5............................. 75
7.9 Objective function values for Ax = 0.5...................... 76
Vlll


TABLES
Table
7.1 Minimum distance algorithm performance on standard test prob-
lems.............................................................. 62
7.2 Minimum distance algorithm performance on standard test prob-
lems continued.............................................. 63
7.3 Mores Marquardt algorithm performance on standard test prob-
lems.............................................................. 64
7.4 Mores Marquardt algorithm performance on standard test prob-
lems, continued.............................................. 65
IX


1. Introduction
This thesis is the result of research on relaxing the standard con-
vergence conditions in order to improve the convergence rate of optimization
algorithms. The standard convergence proofs used to prove an optimization
algorithm will converge to a local minimum from an arbitrary starting point
are examined. This examination, along with a review of the nonmonotone
algorithms in the literature, indicate that relaxing these standard conditions
could increase the convergence rate of algorithms in many cases.
Two convergence conditions that are central to this thesis are the
bounded away from orthogonality condition and the sufficient decrease
condition. An analysis of these two conditions indicates that minimizing a
function different than the objective function relaxes the convergence condi-
tions in such a way as to improve an algorithms convergence rate. Unlike the
nonmonotone algorithms in the literature, these new functions dynamically
adjust to account for changes between the influence of curvature and descent.
The result is an optimal algorithm in the sense that an estimate of the distance
to the minimum is minimized at each iteration.
1


A new line-search algorithm based on these ideas is presented that
does not require the objective function to decrease at any iteration. This
algorithm is shown to be well defined and we prove global convergence of the
algorithm. Performance of the algorithm on some standard test functions is
presented to illustrate the features of the algorithm.
The thesis is organized as follows. The first chapter gives a general
discussion of optimization along with definitions needed later in the text. The
second chapter gives a detailed description of two common global convergence
proofs which are expanded upon in the next chapter. In chapter 4, nonmono-
tone algorithms from the literature that relax the convergence conditions are
shown to obtain faster convergence. Reasons why relaxing the conditions help
convergence are studied in detail in the subsequent chapter, which identifies
why relaxing the convergence conditions improves algorithm performance and
determines when relaxing these conditions is beneficial. The results of these
analyses are combined to obtain a new function to minimize that improves
algorithm performance. In Chapter 6, a line-search algorithm based on this
function is presented and global convergence to a local minimum is proved.
The performance of this algorithm is analyzed in the next chapter where the
features of the algorithm are illustrated using standard test functions. Finally,
the results are summarized and avenues for future research are presented.
2


2. Optimization Preliminaries
Consider the optimization problem of minimizing a real valued ob-
jective function f(x) with continuous second derivatives where x 6 and /
has a minimum at x*. A level set for f(x) is defined as:
Definition 2.1 The level set for a given function f(x) and a point Xk
is defined to be
L/{xk) = {z : f(x) < f(xk)} (2.1)
The gradient of this function denoted by V/(x) is an m-dimensional vector.
The Hessian of this function denoted by V2/(x) is an m by m matrix. One
method of minimizing f(x) is to search negative gradient directions successively
for the minimum value of /. It has been shown that this steepest descent
algorithm exhibits a linear asymptotic convergence rate [73]. In practice, the
convergence rate of this method can be quite slow due to the face effects of
the eigenspaces associated with the minimum and maximum eigenvalues [1].
In order to obtain a faster convergence rate, other algorithms have
been considered. One such algorithm is Newtons method where the next
3


iterate is obtained from the previous iterate using the formula
As* = V2/(^)-1V/(xfc) , (2.2)
where Xk+i = X& + Axk and Xk is the point reached at iteration k. Newtons
method can be shown to exhibit a quadratic convergence rate on many prob-
lems in the neighborhood of a solution provided V2/(x*) is positive definite
[73]. However, the simplified Newtons method presented here will not work
on most optimization problems without some modifications. Often, a point
that is close enough to the minimum so that Newtons method would yield
quadratic convergence may not be available. A robust algorithm needs to be
able to converge to a local minimum from any arbitrary starting point. An-
other problem is that the Hessian may not be positive definite, even at the
minimum. If the Hessian is not of full rank, then the inverse will not exist and
the algorithm, as presented here, becomes undefined. If the Hessian contains
negative eigenvalues, the direction obtained could be an ascent, rather than a
descent, direction. Many modifications of Newtons algorithm have been sug-
gested to overcome these difficulties. A line search may be used to ensure that
f(xk+1) < f(xk) in an attempt to guarantee convergence to a local minimum.
The Newton direction is often modified to ensure that it is a descent direction.
Yet many of these modifications can prevent the desired asymptotic quadratic
4


convergence of the algorithm.
One method of modifying Newtons method that enables it to con-
verge from arbitrary starting points is the model trust region algorithm, where
a model of f(x) near Xk is obtained. A typical model is the quadratic approx-
imation:
m(x) = (x xk)T^2f{xk)(x xk) + Wf(xk)T(x xk) + f{xk) . (2.3)
Usually, the model needs to be only a first order approximation of f(x) to
prove global convergence. In a typical trust region algorithm, the model is
minimized subject to the trust region constraint |jas Xk\\ < A*, using an
appropriate norm, where A* is the current trust region size. A step is then
taken using the x obtained by minimizing the model subject to the trust region
constraint. Before the next iteration, the trust region size can be expanded or
contracted depending on how well the model predicts the reduction obtained
in the objective function f(x). An excellent review of trust region algorithms
was done by More [81]. Shultz et al. [110] obtained a family of trust region
algorithms that also account for directions of negative curvature. More recently,
Dennis et al. [29] extended the trust region ideas to nonsmooth optimization.
Trust region algorithms have the advantage of yielding a negative
gradient step for a small trust region size, ensuring a reduction in f(x), while
5


allowing a full Newton step to be taken when the trust region size is large.
This, is achieved by a locus of possible points obtained for various trust region
sizes that is nonlinear. An example of a typical trust-region path, illustrated
using a quadratic function, is given in Figure 2.1. When a full Newton step
results in an increase in f(x), smaller trust region sizes are used ensuring that
a sufficient reduction in / will be obtained guaranteeing convergence to a local
minimum. When the full Newton step reduces the objective function the trust
region increases allowing the algorithm to take full Newton steps resulting in
the desired local quadratic convergence rate.
Figure 2.1. Typical trust-region path.
6


It is often possible to take advantage of special structure that exists in
certain types of optimization problems. One example of this is the non-linear
least-squares problem:
f(x) = l-R(xfR(x) = i||H(i)||2 , (2.4)
where the norm used here is the Euclidean norm with R(x) £ 5R", x £ and
m < n. Let ri(x) be the ith element of R(x). The functions r^(x) are often
called residuals. The gradient of / is
V/(x) = J(xfR(x) , (2.5)
where J(x) is the Jacobian matrix of first derivatives of R(x) and we use Ji(x)
to denote the ith column of J{x). The Hessian of / is
V2f(x) = J(x)TJ(x) + 'jTri(x)V2ri(x) .
>.6)
1=1
If V2/(x) is positive definite, Newtons method becomes
A xk = -
J(xk)TJ(xk) + ^Tlri(xk)V2ri(xk)

i=i
J{xkfR{xk) . (2.7)
We can take advantage of the special structure of the problem by ignoring the
second derivative term
S(x) = ^n(x)V2rt(x) .
(2.8)
1=1
7




Assuming that J(xk) is of full column rank yields the Gauss-Newton method
equation
J{%k) R{j^k)
(2.9)
where [J(xk)T J(xfc)]-1 J(xk)T is the Moore-Penrose generalized inverseof J(xk).
More [80] has found a method due to Levenberg [71] and Marquardt [77] to be
useful for minimizing the non-linear least-squares problem using a trust region
algorithm. The Levenberg-Marquardt algorithm is
= [jf:c;jr./f,ct) + hi .Hxk f Rlxk)
(2.10)
where A^ > 0. More solves the trust region problem by choosing A so that
||z Xk|| ~ Ak-
The concept of a generalized matrix inverse [114] is useful in obtaining
least-squares solutions for any general matrix J, even if J is not of full column
rank.
Definition 2.2 The Moore-Penrose generalized inverse of an n by m
matrix J is defined to be the unique mbyn matrix J+ satisifying the
following four properties:
J+JJ+ = J+ (2.11)
JJ+J = J (2.12)
8


(JJ+)T = JJ+
(2.13)
(J+J)T = J+J . (2.14)
We also have the relations
(J+)+ = J (2.15)
(JTY = (J+f (2-16)
(.JJT)+ = (J+)TJ+ . (2.17)
In the sequel, the following definitions for projection matrices will be found
useful
Pj = JJ+ (2.18)
Rj = J+J . (2.19)
Using the Moore-Penrose generalized inverse, the Gauss-Newton method be-
comes
Axk J(xk)+R(xk) (2.20)
This definition makes the Gauss-Newton method well defined, even when J(xk)
is not of full column rank.
The Moore-Penrose generalized inverse can easily be obtained using
the singular value decomposition
J = USVT , (2.21)
9


where U is an n by n unitary orthogonal matrix, V is an m by m unitary
orthogonal matrix and S is an n by m matrix with the positive singular values,
Si, along the diagonal. The Moore-Penrose inverse of J is obtained with the
singular value decomposition using
J+ = VS+UT , (2.22)
where the pseudo inverse of S, S+, is an m by n matrix with 1/s,- on the
diagonal. If J is not of full column rank, then one or more of the singular
values are zero. In this case, S+ will have a zero on the diagonal whenever Si
is zero, and 1/s,- otherwise. We will find the singular value decomposition to
be a useful analysis tool in the sequel.
10


3. Convergence Theory for Optimization
The above discussion identified two of the numerous major consider-
ations in non-linear optimization algorithm design. One consideration, global
convergence, is ensuring that the algorithm converges to a local minimum,
given an arbitrary starting point. The other major consideration is obtain-
ing a fast rate of convergence. Two methods of proving that an algorithm is
globally convergent have been used extensively. One is a method attributed
%
primarily to Wolfe [124, 125] the other is a method attributed primarily to
Zangwill [128]. Convergence rate and these two global convergence proofs are
discussed in the following sections.
3.1 Wolfes Framework
Two independent conditions have been used together for years to
prove global convergence of optimization algorithms. The first condition is
- Vf(xk)Tdk > C||V/(x*)||||4|| , (3.1)
where dk is the search direction and c is some small positive constant. This
condition is often referred to as bounded away from orthogonality or the
11


bounded away condition. The second condition is
/(**) /(**) > o- (' V/||^}| 4 ') . (-5-2)
where cr(-) is a forcing function. Forcing functions are any non-decreasing
function on [0, oo) with the property that a(t) > 0 when t > 0 and a(t) = 0
when t = 0. An example of a forcing function is cr(t) = et2. This condition is
often referred to as sufficient decrease or the sufficient decrease condition.
If an algorithm satisfies both of the above conditions, then global
convergence of the algorithm can be proven for a G-differentiable function [89].
Definition 3.1 A function / : fl C SR 4 is Gateaux or (G-)
differentiable at an interior point x of D if there exists a linear operator
A : 9ft > such that, for any h 5Rn,
+ -f(x) -tAh\\ = 0. (3.3)
We use the following theorem taken from Ortega and Rheinboldt [89] to prove
global convergence.
Theorem 3.2 Suppose that f(x) : D C > 1R is G-differentiable and
bounded below on some set Do C D, the iterates {a:*,} remain in D0, and the
sufficient decrease condition is satisfied for every iteration k > 0. Then
limv/Ml4=0
*-* || <41|
(3.4)
12


This theorem, along with the bounded away from orthogonality condition,
ensure the first order convergence result that ||V/(xfc)|| > 0.
To design a practical globally convergent algorithm, one first picks a
search direction algorithm that satisfies the bounded away from orthogonality
condition. Often, the method of generating the search direction automatically
ensures convergence without explicitly enforcing the bounded away from or-
thogonality condition. Quasi-Newton algorithms and Newton methods using
a modified Cholesky decomposition are some examples. Next, a line search
method is chosen that satisfies the sufficient decrease condition. Many practi-
cal line search algorithms, such as the Goldstein-Armijo method, satisfy this
condition [89]. This yields a very powerful theory. A large number of practical
globally convergent algorithms are possible by using different search directions
with different line search techniques [89].
Let W denote the region of acceptable points that satisfies these two
conditions. The forcing function can be chosen so that only a very small
reduction in f(x) is needed to ensure convergence. The constant c can be
chosen so small that directions nearly perpendicular to the negative gradient
direction are acceptable. An illustration of this in two dimensions (m = 2) is
given in Figure 3.1. The current point is xk with objective function value f(xk)-
Since the function value is required to decrease, W C L/(xk) as illustrated in
13


the figure. The region of acceptable new points is just slightly inside L/(xk),
and it nowhere touches the boundary due to the sufficient decrease condition.
There is a small excluded region toward the bottom right where L/(xk) is
outside the region defined by the bounded away from orthogonality condition.
Points in this region satisfy the sufficient decrease condition but cannot be
reached because the search direction required would violate the bounded away
from orthogonality condition.
14


Figure 3.1. Region of acceptable points (W) for Wolfes framework.
3.2 Zangwills Framework
Zangwill proposed a global convergence proof based on the idea of
point-to-set maps [128]. He defined an algorithm as a point-to-set mapping,
where the algorithm uses the current point, Xf., to generate a set of possible
15


new iterates. For example, when performing a line search, it is impractical to
obtain the exact minimum. Hence, a range of points along the line that are
within some tolerance of the minimum are considered acceptable. This range of
points becomes the set of points generated by the algorithm in the point-to-set
map.
An important property of point-to-set mappings needed to prove
global convergence is the idea of a closed mapping [73].
Definition 3.3 A point-to-set mapping A{x) from a set X to subsets
of a set Y is said to be closed at x £ X if the conditions
lim xk = x, xk X (3.5)
k-t-oo
and
lim yk = y, yk A(xk) (3.6)
k¥oo
together imply that
y A(x) (3.7)
Using the idea of a point-to-set map and closed mappings, we have the following
global convergence theorem [128].
Theorem 3.4 Let A{x) be an algorithm on a set X. Let T C X be the
solution set for the optimization problem. Assume that, given x0, the sequence
{xfe} is generated satisfying xk+i £ A(xk), and also satisfying the following.
16


All Xk are contained in a compact set S C X.
There is a continuous function g on X such that
(1) if x 0 T then g(y) < g(x) V y E A(x)
(2) if x T then g(y) < g(x) V y G A(x).
The mapping A is closed at points outside of I\
Then, every limit point of is a solution.
To use this theorem to prove global convergence for some algorithm,
the function g is usually chosen as the objective function and a solution is
usually defined as a stationary point, i.e. V/ = 0. Then, the line search is
shown to be a closed point-to-set mapping that reduces the objective function
(when not at a solution).
The region Z of acceptable points that satisfies the conditions of this
theorem is again quite large. The conditions on g(x) = f(x) ensure that
Z C Lj{xk). An illustration of this for the same two dimensional problem
shown in Figure 3.1 is given in Figure 3.2. As can be seen, the region of
acceptable new points is just slightly inside Lj(xk), and it nowhere touches the
boundary because the mapping is closed and a strict decrease in g is obtained
outside the solution set. Also note that there are no excluded regions like
that obtained for Wolfes proof due to the bounded away from orthogonality
condition. However, most practical algorithms will have V/(xfc)T<4 < 0 for
17


Vf(xk) i1 0 so that f(x) can be reduced during the line search. Hence, the two
convergence proofs produce essentially the same region of acceptable points.
Figure 3.2. Region of acceptable points (Z) for Zangwills framework.
3.3 Convergence Rate
Consider the following two definitions commonly used in rate analysis
[51]-
18


Definition 3.5 Let the sequence {a;*} converge to x*. The order of
convergence denoted by a is defined to be
a
- sup{q
lim sup
ky oo
lkfc+i z*||
\Xk
< 00 }
(3.8)
The case a 1 is called linear convergence, a = 2 is called quadratic conver-
gence, and a = 3 is called cubic convergence.
Definition 3.6 The convergence ratio denoted'by f3 for a sequence {xfc}
of order a is defined to be
lim sup
k>oo
l|s*+i ~g*ll
Ikfc x*||a
(3.9)
The case a = 1 and j3 = 0 is called superlinear convergence. These two
definitions are useful for algorithm acceleration and the design of convergence
criteria. Also, they have often been used as a method of recommending one
algorithm over another. However, this use can lead to erroneous conclusions.
Larger orders and smaller ratios do not necessarily imply that one algorithm
will converge to the solution of a given problem faster [51]. Another problem
with using these definitions to compare algorithms is that they are only useful
in a region near the solution, x*. While this asymptotic convergence rate
is important, the analysis of algorithm performance at points far from the
solution, where the asymptotic convergence rate does not apply, is also very
19


important.
. There have been many attempts to come up with a good definition
for global convergence rate. For steepest descent, the relation
/(*fc+i) <
has been obtained for a quadratic objective function where Xmax and Amtn are
the largest and smallest eigenvalues of the (constant) Hessian [73]. Others have
obtained bounds on f(xk) f(x*) for certain classes of objective functions.
However, an estimate for global convergence rate for more general objective
functions remains elusive. A summary of this work is presented in Nocedal
[88]-
The success obtained in proving global convergence leads us to exam-
ine the global convergence proofs given above to see if they contain any insight
into algorithm performance. However, such insight is not obtained. Consider
an algorithm that always obtains its next point on the boundary of W or Z.
For Wolfe, this algorithm will always satisfy the sufficient decrease condition
with equality. If the forcing function is a(t) et2, then the objective func-
tion decreases by only e\Vf(xk)Tdk\/\\dk\\2 at every iteration. This yields a
linear convergence rate with a convergence ratio near one for e near zero. For
Zangwill, the point obtained will be on the boundary of the closed mapping.
Xmax At
^max + A r
f{*k)
(3.10)
20


Since g = f(x) is required only to be monotonically decreasing, again very slow
convergence is obtained. Thus, we see that the above two global convergence
proofs bound the algorithm in question by another algorithm that converges,
however, very slowly. This results in a proof of global convergence but yields
little insight into algorithm performance at points far from the solution. These
global convergence proofs do, however, say something about the reliability of
the algorithm. They guarantee that the algorithm is reliable in the sense that
it will eventually converge to a local minimum.
21


4. Relaxing the Conditions for Convergence
In the previous chapter, we saw that global convergence theorems
do not provide much insight into algorithm performance. Further, because
rate analysis is definitionally asymptotic, it is not useful for analyzing the per-
formance of algorithms at points far from the solution. The large regions of
acceptance, W and Z, illustrated in Figures 3.1 and 3.2, respectively, and the
extremely slow convergence of an algorithm limited to the boundary of these re-
gions,indicates that almost any efficient algorithm would fit into the framework
provided in the proofs. However, in this chapter we examine evidence that re-
laxing these convergence conditions can actually lead to algorithms exhibiting
faster convergence rates. This evidence is divided into three sections that cover
relaxing the sufficient decrease condition, relaxing the bounded away from or-
thogonality condition and relaxing both the sufficient decrease and bounded
away conditions.
4.1 Relaxing the Sufficient Decrease Condition
In the mid 1980s, Grippo et al. began a series of papers proposing
the advantages of using a nonmonotone line-search method to speed up the
22


convergence rate. The first paper appeared in 1986 [53], the next in 1989
[54], and another in 1990 [55]. This work was culminated in 1991 with a
paper covering an entire class of non-monotone methods [56]. These non-
monotone methods were summarized, tested and improved upon by Toint,
confirming the improved performance of the proposed algorithms [115, 116,
119]. Panier and Tits [91] along with Bonnans et al. [8] immediately saw
the advantages of these techniques for constrained optimization problems and
applied the nonmonotone line-search idea to help alleviate the Maratos effect
in their program FSQP. The Maratos effect arises in constrained optimization,
where a reduction in a merit function is often enforced rather than using the
objective function because the objective function does not account for the
presence of the constraints. Often, the merit function used will not allow a full
Newton step to be taken, even at points very close to the solution. This effect,
called the Maratos effect [76], prevents Newtons method from obtaining the
asymptotic quadratic convergence rate.
More recently, nonmonotone methods have been applied to a wide
variety of algorithms and problems [26, 34, 36, 35, 117, 118, 120, 126, 130]. For
other nonmonotone methods see [4] and [74].
These nonmonotone algorithms relax the sufficient decrease condition
23


by replacing it with the weaker condition
_ !{xm) 17 (' V/||*ii 4') (4-1)
where p is on the order of 10 or less (the optimization program FSQP currently
uses p = 3 [8, 91, 129]). While this condition allows for the relaxing of the
sufficient decrease condition, resulting in faster convergence in many cases,
it does not explain why relaxing this condition helps. It also does not give
a method or algorithm for determining when to relax the sufficient decrease
condition and when relaxing the sufficient decrease condition is not a good
idea.
4.2 Relaxing the Bounded Away Condition
The bounded away from orthogonality condition can be quite restric-
tive if it is explicitly enforced. For example, consider Newtons method applied
to a quadratic problem, where Newtons method converges in just one itera-
tion. However, for a given constant, c, there exist quadratic problems with
an eigenvalue small enough that the full Newton step will violate the bounded
away from orthogonality condition. If the Newton step is modified to satisfy
the bounded away from orthogonality condition, this will result in destroying
the Newton algorithms performance on this quadratic problem.
Allowing an unmodified Newton step to be taken has proved to be
24


very computationally efficient in practice. Even though convergence has not
been proven for such an algorithm, the inclusion of the bounded away condition
has been found to be quite detrimental to the efficient solution of real problems.
Hence, relaxing the bounded away from orthogonality condition could, in many
cases, improve algorithm performance.
Trust region methods relax the bounded away from orthogonality
condition. Rather than checking the step direction to ensure it is a descent
direction, trust region methods check how well the model fits the function. At
the end of each iteration, the relative function reduction is computed:
_ f(xk+1)-f(xk)
Pk / \ / \
m[xk+1) m(xk)
If 0.75 < pk, the model fit is considered to be good and the trust region size can
be increased. If pk < 0.25, the model fit is not good, and the trust region size
is reduced. As the trust region size goes to zero, the step direction approaches
the negative gradient direction, so that a sufficient decrease will eventually be
obtained ensuring the (theoretical) convergence of the algorithm [81]. In the
region where Newtons method is quadratically convergent, the model fit is
good, so the trust region size increases, and the model subproblem allows full
Newton steps to be taken. Thus, as long as the model is a good approximation
to f(x) near xk the step direction need not be a descent direction. However, it
25


is possible to show that the step to the minimizer of the trust region problem
is alwaysl a descent direction. Thus, explicit inclusion of the bounded away
from orthogonality condition is not needed for trust region algorithms.
However, trust region algorithms could also benefit from relaxing the
sufficient decrease condition. Figure 4.1 illustrates one step of a trust region
algorithm taken from two different starting points using Rosenbrocks function
[102]. One point, x, is near the valley floor while the other point, x, is farther off
the valley floor, yet both points are in close proximity to each other. The locus
of acceptable points for a trust region algorithm initiating from both points
are also illustrated in the figure. Note that the point near the valley floor, x,
has a shorter maximum step length than that obtained for the other point. It
would be good if points that are in such close proximity could have the same
trust region size. However, if the algorithm were allowed a longer step length
from x, then the objective function would have to increase. This illustrates the
utility of relaxing the sufficient decrease condition for trust region algorithms
in certain cases. Some authors have applied the above nonmonotone ideas to
trust region algorithms with some success [26, 117, 118, 120, 126].
26


3 i
Figure 4.1. Trust-region algorithm for Rosenbrocks function.
4.3 Relaxing Both Conditions
The above discussion indicates that relaxing both the bounded away
and sufficient decrease conditions could be helpful. Here we discuss work that
has been done to relax both conditions.
The watchdog method due to Chamberlain et al. [14] is an attempt
to avoid the Maratos effect in constrained optimization. The basic idea behind
27


the watchdog method is to take several Newton steps without checking any
conditions until after the last step. If the merit function has sufficiently de-
creased, then the Newton steps will be accepted. If not, the Newton steps are
discarded, the algorithm goes back to the point where the Newton steps were
started, and the step size is reduced. Thus, this method relaxes both conditions
since no conditions are checked during the Newton steps. This method can be
quite effective. It can also increase the computer time required, and there is
the risk of taking a bad step, exponential overflow or other difficulties. This
idea has been incorporated by Grippo et al. [56] in their class of non-monotone
methods.
Solodov [75, 111] weakens both conditions by picking a direction <4
using
-Vf(zk)Tdt > o'(||V/(it)||) Xt ,
(4.3)
where Ak > 0, /(*) /(zhi) > nv/(ii)T4 n
(4.4)
with 77* > 0 and Uk > 0. He proves convergence when the following conditions
hold:
0O
CO
CO
Y r)k = oo, Y ^kVk < oo, Y < co .
(4.5)
These conditions relax both the original bounded away from orthogonality and
28


sufficient decrease conditions. Solodov was able to show that some of the
algorithms used to train neural networks fit within this structure.
29


5. A New Approach to Algorithm Convergence
In this section we examine the two convergence conditions more closely
to determine when relaxing each condition helps speed convergence and when it
does not. The results are then combined to yield a new approach to algorithm
convergence.
5.1 Examining the Bounded Away Condition
The bounded away from orthogonality condition becomes a problem
when the search direction denoted by dk = d{xk) is nearly orthogonal to the
negative gradient direction. The objective function will not decrease very fast
along such a direction, and will likely increase within a short distance of Xk-
This suggests minimizing an alternate function, h(x), with the following prop-
erties.
(1) V/i(x) = d(x) so that Vh(x^) = dk.
(2) x* minimizes h(x).
(3) h(x) has no local minimum other than x*.
A function that satisfies the first condition is called a gradient mapping [89].
30


Definition 5.1 A mapping d(x) : D C > is a gradient mapping
on a subset Do C D if there exists a G-differentiable function g : D0 C
5?1 such that d(x) = V^(aj) for all x Do.
Such a function, if it exists, would solve the bounded away from orthogonality
problem because the desired search direction would be the negative gradient
direction of the new function being minimized. The nonmonotone idea of
allowing f(x) to increase also becomes less of an issue because h{x) decreases
fastest along its negative gradient direction.
Ortega and Rheinboldt state the following Theorem [89].
Theorem 5.2 Let d(x) : D C be continuously differentiable on an
open convex set Do C D. Then, d(x) is a gradient mapping on Do if, and only
if, Vd(x) is symmetric for all x 6 Do.
Thus, the desired function exists if, and only if, Vd(x) is symmet-
ric. This condition obviously holds for (4 = V/(x*). But, this condition is
very restrictive so that there are very few search directions that satisfy this
condition.
However, there are functions that relax the first condition while sat-
isfying the other two conditions. Consider the function
/>(*) = 5llV7M-1v/(*)||2, (5.1)
31


where the norm used is the Euclidean norm. The gradient of this function is
Vh(x) = [v2/(*)_1v2/(*)j +
,lT
V2f(x)~'Vf(x) , (5.2)
Evfvvwr1)^
Li=l
where V2f(x)i 1 is the ith column of V2/(x) h At Xk we have
Vh(xk) = -4 +
,t=l
V2/(n)-V/( Xt). (5.3)
The second higher order term is the error obtained when using this h(x) to
approximate the first condition. Minimizing this h(x) is like minimizing the
square of an estimate of the distance to the minimum. Hence, we will call all
such functions minimum distance functions.
This particular minimum distance function has several problems. The
derivatives of h(x) needed for a steepest descent or Newton algorithm are
difficult to obtain. The gradient and Hessian of the objective function, f(x),
are needed at every point in a line search. This function could also find a
maximum rather than a minimum. To alleviate some of these problems, one
could minimize a sequence of functions
M*) = jlIvVM-'v/MH2
(5.4)
32


The gradient of this function is
Vhk(x)
^2f(xk)-^2f(x)
V/(x*)-V/(z)
(5.5)
At xk we now obtain Vh{xk) = dk as desired. However, we must ensure that
minimizing a sequence of such functions will converge, and that it will converge
to x* and not some other point. The idea behind this definition is to have a
trust region for how far the current Hessian approximates the eigenvectors and
eigenvalues of the problem. Here again, the gradient of the function is needed
at each point during a line search.
Now, we extend this idea to least-squares problems where the Gauss-
Newton algorithm can be used to take advantage of the special structure of the
problem, see page 7. This yields the equation
h(x) = i||./(a*)+fl(*)l|1 (5.6)
In this case, when f(x) is evaluated, R(x) is also obtained because f(x) =
^R(x)TR(x). Thus, there is no need to evaluate the gradient at each point
in the line search. Hence, this minimum distance function could be useful for
relaxing the bounded away from orthogonality condition.
5.2 Examining the Sufficient Decrease Condition
In this section we address the question, When is allowing f(x) to in-
crease a good idea? The goal is to obtain an algorithm that can automatically
33


determine the best point along the search direction.
An obvious case where increasing f(x) along the search direction is
beneficial is illustrated in Figure 5.1 using a Rosenbrock type function. Two
possible next iterates, Xk+i and ik+i are contrasted in the figure. The point
xjt+i is obtained by minimizing the objective function along the typical Newton
or Gauss-Newton search direction. The point Xk+i is some arbitrary point
farther along the search direction, where f(xk+1) > f{%k)-
Figure 5.1. Example of when increasing f(x) is beneficial.
34


Note that £*+1 is not necessarily the point obtained by taking a full
Newton or Gauss-Newton step. This point could be much farther out due to
near rank deficiency in the Hessian and may cause an exponential overflow or
other numerical problems.
The idea illustrated in Figure 5.1 is that it is easier to minimize f(x)
along eigenvector directions with a large associated eigenvalue (ei in the figure)
than it is to minimize f(x) along eigenvector directions with a very small as-
sociated eigenvalue (e2 in the figure) [74]. The objective function is allowed to
increase in order to make progress along the eigenvector directions with small
associated eigenvalues. Thus, instead of minimizing f{x), a new function,
hk(x), is minimized that weights the eigenvector directions with small asso-
ciated eigenvalues more than the eigenvector directions with large associated
eigenvalues. Here we include the subscript k to indicate that the eigenvectors
and eigenvalues are obtained at Xk.
Let us design a new function for nonlinear least-squares objective
functions that has this property. Let J(xk) = USVT be the singular value
decomposition of the current Jacobian matrix. Also, assume that J is of full
column rank. Next, we rotate the residual vector by forming UTR(x). We now
35


define our new function to be
k(x) = \\DUtR(x)\\2 , (5.7)
where D is a diagonal matrix of scale factors that we shall use to scale the
eigenvector directions. Note that if D = /, then hk(x) = f(x). We now pick
the diagonal elements of D denoted here by d{ so that they give a small weight
to the large S{ and a large weight to the small s;. One easy way to do this
is to set di = 1 jS{. This yields exactly the same minimum distance function
obtained above for nonlinear least-squares problems because
^DU7,R(x)\\2 = ~]\VS+UtR(x)\\2 = i|U(xi)+fi(x)||2 . (5.8)
Now, consider another case illustrated in Figure 5.2, a plot of arctan(x),
which is an example of when it is not beneficial to increase f(x). Assume that
the objective function is f(x) = arctan(a;)2, with minimum at the origin. If
Newtons method is started at some point far from the origin, it will take
increasingly larger steps zig-zagging across the origin yielding ever increasing
objective function values until it eventually diverges to oo. If this example is
minimized using a minimum distance function with constant eigenvectors and
eigenvalues, the minimum will still be obtained. Since i G R1, J(xk)+ = 1/si
is a constant and hk(x) = |||ri(a:)/sx||2. Thus, hf-{x) is just the objective
function modified by a constant factor, so that the minimum is not affected.
36


Figure 5.2. Example of when increasing f(x) is not beneficial.
5.3 Combining the Results
The above results of examining the two convergence criteria have
both amazingly suggested minimizing a minimum distance function along the
search direction. It is instructive to examine what this idea does to a quadratic
function where, L/(xk) is an ellipse. In this case, Lh(xk) will be a circle, as
shown in Figure 5.3, because Newton solves a quadratic function exactly. For
37


the sufficient decrease condition, the new point Xk+i must lie inside Lj(xk),
regions 1 and 3 in the figure. If we require h(x) to decrease along the line-
search direction, then the new point £jfc+1 must lie inside Lk(xk), regions 1 and
2 in the figure. Region 1 is ideal because both the objective function and the
Newton step length are decreasing. Region 2 is the region we seek. Points
in region 2 have f(xk+i) > f{xk) but the Newton step length is decreasing
so convergence can still be obtained. Points where f(xk+1) > f(xk) that are
outside of Region 2 are not acceptable because the Newton step length is
increasing. If the algorithm accepts points in this region, convergence may
not be obtained. Region 3is very interesting. Points in this region have a
lower objective function value but have an increased distance to the minimum.
Requiring hk(x) to decrease along the search direction will exclude these points
from consideration.
38


Figure 5.3. Regions defined by the minimum-distance function.
If hk(x) is a good approximation for the distance to the minimum,
minimizing h(x) makes sense. It allows each step of the algorithm to make the
greatest possible progress toward the minimum and is, at least in this sense,
optimal.
39


6. A Minimum-Distance Function Algorithm
In this section we consider minimizing a sequence of functions
*>(*) = i||J(**)+fl(x)||a , (6.1)
to take advantage of the special structure, see page 7. We often denote f(xk)
by fk, R(xk) by Rk and J(xk) by Jk to simplify the notation when no confusion
will result.
These functions have several nice properties. For example, consider
the gradient
Vhk(x) = (^J(xk)+J{x)Y J(xk)+R{x) (6.2)
and Hessian
V2hk(x) = (^J(xk)+J{x)Y (j(xk)+J(xf) + J^j/i(x)Vari(x) (6.3)
*'=i
where yi(x) is the ith element of Yk(x) defined by:
Yk(x) = (J(xk)+)TJ(xk)+R(x) . (6.4)
Here we see that Yk(xk) 0 as Xk > x* if J(x)+ is continuous and bounded
in a neighborhood of x*. It would be nice if this led to superlinear convergence
40


for the associated Gauss-Newton Method. However, such is not the case. The
negative gradient at Xk is
-Vhk(xk) = (J(xk)+J(xk))TJ(xk)+R(xk) = J(xk)+R{xk) . (6.5)
Ignoring the higher order terms in the Hessian yields the Gauss-Newton step
Ax* = [(J*+J*)T(J+J*)] (4Jt)TJtRt
= -JtB*
(6.6)
(6.7)
Thus, the negative gradient direction and the Gauss-Newton direction for hk(x)
coincide. These directions coincide because the first term in the Hessian is
R^Rj = Rj, a unitary orthogonal matrix, that projects onto the row space
of J(xk). This gives the Hessian of hk{x) a much better condition number
than the Hessian of f(x) when the higher order terms, S(x) on page 7, are
small compared to the first order terms. In this case, V2hk(x) Rj while
V2/(x) J(x)TJ(x).
Another nice feature of hk(x) can be seen when considering the case
where J(x) is square, of full rank, and the residual at the solution is zero. This
type of problem is often encountered in the solution of simultaneous non-linear
equations. Here, we can rescale the residual with a non-singular scaling matrix,
D, as R = DR. At the minimum, R = 0 so R = 0 as well, and R 0 implies
that R ^ 0. Thus, the two problems are equivalent. Now we have J = DJ
41


and, since both D and J are square matrices of full rank, J+ = J lD 1 [99].
This leads to
J+R = J~xD~lDR = J+R . (6.8)
Hence, the hk(x) functions are scale invariant with respect to scale changes in
R for this special case. However, the result for scale changes in x is different.
Let
x = Dx , (6.9)
1 G) II ""s (6.10)
J+ = DJ+ . (6.11)
Then,
hk(x) = i||7+ii||2 . (6.12)
= IWD-'DJ+Rf , (6.13)
= ilp-'J+flll2 , (6.14)
^ i]V+!l2 (6.15)
Thus, the hk(x) functions are not scale invariant with respect to scale changes
in x.
There are two problems with these functions that must be solved
before global convergence can be proved. First, hk(x) does not always decrease
42


as x > x*. Some problems have a singular value that goes to zero as x > x*
causing hk(x) > oo. The second problem can be seen with the rearrangement
hk(x) = ~R(x)T (J(xk)+)T J(xk)+R(x) . (6.16)
This makes hk(x) look like the square of an alternate norm induced by the
matrix
Ak = (J(xk)+)TJ(xk)+ . (6.17)
This definition of Ak is not of full rank; hence, this is not a norm. Herein lies
the difficulty: components of R{x) that are orthogonal to the column space
of J are not considered. An algorithm that minimizes this function could not
prevent these components increasing without bound. Hence, these orthogonal
directions must also be considered in order to obtain global convergence.
Both of these problems are overcome by defining Ak as
Ak = (1 \k)(J£)TJt + XkI, 0 < Afc < 1 . (6.18)
Then define
hk(x) = R(x?AkR{x) = i||B(x)||^ . (6.19)
The parameter \k interpolates between a minimum distance function obtained
at = 0 and the objective function obtained at A*, = 1. Changing \k allows
us to adjust the amount of allowable function increase during the line search,
Figure 6.1.
43


Figure 6.1. How A affects the minimum-distance function.
It is also instructive to compare this with the nonmonotone method of
Grippo et al. [53]. Suppose ma,Xk-p tion value obtainable in Lh{xk)- Then, the regions of acceptable next iterates
for a quadratic objective function are as shown in Figure 6.2. The minimum
distance function will reject points in regions 3 and 4 while the nonmonotone
method of Grippo et al. accepts such points. In these regions, x*|| is
increasing so that the algorithm is diverging. The nonmonotone algorithm
44


obtains convergence when maxfc_p<; mum distance function obtains convergence by decreasing either || j£i?(m)|| or
the objective function.
Figure 6.2. Comparison of nonmonotone and minimum distance regions.
In the next section, we consider minimizing a sequence of these hk{x)
functions using a line search method. However, we first present two lemmas.
Lemma 6.1 For the Gauss-Newton direction, the above definition for hk(x)
relaxes the bounded away from orthogonality condition as decreases from
one. That is, the parameter \k decreases the minimum allowable angle between
the negative gradient and the Gauss-Newton direction as A*, decreases to zero.
45


Proof: Using the Gauss-Newton direction, dk = J^Rk,
cos($) = _RlAkJkJ£Rk___
Noting that
Afc (JtfJt
(6.20)
(6.21)
as Xk approaches zero we obtain the limit
cos(0) =
ll^ail2
= i
(6.22)
while, for Ajt = 1 we get the cos(0) obtained for the Gauss-Newton method.
Lemma 6.2 The above definition for hk(x) relaxes the sufficient decrease con-
dition as Xk decreases from one if a decrease in \\\Ftkjr\{Jk)T Rk+\\[2 b-
tained.
Proof: Take the sufficient decrease condition for f(x) and multiply by Xk to
obtain
yll^+ill2 < yll^ll2 + tXkRlJkdk (6.23)
Also consider
+ t(l \t) JtRk)T dt (6-24)
46


The sum of equations (6.23) and (6.24) gives the sufficient decrease condition
for hk(x). If Afc = 1, the sufficient decrease condition for f(x) is obtained.
However, if a reduction in ^\\R7e+i{Jk)T Jjt Rh+i\\2 is obtained, this term can
compensate for an increase in f(x). Hence, the standard sufficient decrease
condition is relaxed if a decrease in is obtained. That
an increase in f(x) does happen in some cases is illustrated in the next section
using Rosenbrocks function.
These lemmas show that the features the minimum distance func-
tions had of relaxing the bounded away and sufficient decrease conditions are
retained by this modification.
6.1 Line Search Algorithm
We now discuss the algorithm in detail and present the global con-
vergence proof. The algorithm presented here uses the steepest descent search
direction. While this is normally not a good choice because of the slow linear
rate of convergence often observed, it is quite satisfactory in this case because
the steepest descent direction for hk{x) is close to the Gauss-Newton direc-
tion when \k is close to zero. Obviously, the algorithm could be improved by
considering other search directions.
47


Algorithm 6.3 The minimum distance line-search algorithm is:
Compute R(xi), f(xi), J(xi), J+(xi);
Set f(xi);
Set 0 < e < 1.
(1) IF converged THEN quit
(2) Pick Ak so that
Rl(j£)TJtRk
2(t fk) + RTk(Jt)TJtRk ~ -
At = (1 - + A*/ ,
AkRk
(3) Obtain the search direction:
JjAkRk
k '
(4) Pick ak to minimize hk(xk + ctkdk), then set Xk+i = &* + otkdk- This line-
search is designed to ensure that the following sufficient decrease condition
is satisfied:
h,k{3'k+1) hki^Xk) (5) Pick qk+i so that:
f(xk+i) < qk+i < <7fc + e[hk(xk+i) hk{xk)} .
48


(6) Compute J(xk+\) and increment k.
(7) GOTO step 1
Note that if A& = 1 at any iteration, our equation for choosing qk+i
leads to the possibility that A^ < 1 for the next iteration. Also, if A*, is close to
zero, but R%(Jk)TJkRk suddenly becomes large, our equation for choosing Ak
yields a value for Ak closer to one. Thus, the algorithm can dynamically adjust
Xk as needed to switch between the influence of curvature and descent. When
Ak is small, the objective function can increase to allow for the curvature of
the problem. When Xk is close to one, the steepest descent direction for / is
searched emphasizing descent.
We now show that the algorithm is well defined. The sufficient de-
crease condition for the line search in step 4 can be satisfied by many standard
line search algorithms [89]. We also need to show that our choice for Ak keeps
f(xk+i) small enough that a new value for qk+i can be obtained from the
equation
f(xk+i) < qk+i < Qk + t[hk(xk+i) hk(xk)] . (6.25)
Theorem 6.4 Suppose Xk is chosen to satisfy
RiytfJtRk . ,
2(-A) + RTt(Jt)TJiRk k -
(6.26)
49


and the sufficient decrease condition:
h/: (^fc-J-l) ^ h'ki^Xk') ^"( I ^ b'ki.Xk') d^|)
is satisfied by the line search. Then
f(xk+1) < qk + e[hk(xk+i) hk(xk)\ .
Proof: Starting from the sufficient decrease condition
hk(xk^.\) ^ hk(xk) *
and 0 < e < 1 we have
hk{xk^.\) ^ T c[h'k(xk+i') ^(a^)]
Using the definition of Ak
Ak = (1 Xk)(Jk)TJk + AkI
and the definition of hk(xk)
h'k(xk') = -Rk AkRk ,
we have
l;Rl+1((l-\k)(j£)Tj£ + XkI)Rk+1 < hk(xk) + e[hk(xk+1)-hk(xk)}
(6.27)
(6.28)
(6.29)
(6.30)
(6.31)
(6.32)
. (6.33)
50


Expanding, and noting the definition for f(xk+1) we obtain


'\^k{Xk +l)
Again, expanding A* in hk(xk), we get
Vttzt+i) < \rI ((1 - + At/) ft

Rearranging and dividing by A*, gives
/(**+.) < /(**) + ^-^Rl(J£)rJtRt
l-^Rl(Jtf4Rk+i +
\flk (^-fc + l ) ^fc(>Efc)]
Here we are justified in dividing by A^ since
Rl(Jt)TJtRk
0 <
< A*;
(6.34)
(6.35)
(6.36)
(6.37)
2(g* /*) + RiiJtyjtRk
because qk > /& was enforced during the previous iteration of the algorithm,
and if RKJtfJtRt > 0 ever becomes zero, the algorithm terminates. Rear-
ranging the inequality for A*, gives
Rl{Jt)TJtRk < 2Ak{qk-h) + XkRTk(J+)TJ+Rk ,
(6.38)
51


or
/M + 1-^Rl(Ji)TJtRk < 9* (6.39)
Observing that
> o , (6.40)
we obtain
f{Xi) + 1-^rR*(Jt'>TjtRt ^2ArC.(#)T4+ft+1 £ qt (6-41)
Inserting this relation into (6.36) we get
f(xk+i) < qk + -^-[hk{xk+i) ~ hk{xk)] . (6.42)
Noting that 0 < A* < 1 with [/ifc(a:yfc+1) hk(xk)] < 0 we obtain the final result
f{xk+1) < qk + e[hk{xk+1) hk{xk)\ (6.43)
thus completing the proof.
Next, we need to show that minimizing hk(x) also minimizes f(x).
The following theorem characterizes the stationary point convergence of the
algorithm.
Theorem 6.5 Suppose Vhk(xk) = 0. Then R(xk) is orthogonal to the column
space of J(xk).
52


Proof: We have
V/ifc(xfc) =
jjAkRk
Jl
(l-Xk)(J+)TJ+ + XkI\ Rk
(1 Ak)Jl{J+)Tj£Rk + AkJ*Rk
(1 AO [JtJk\ JkRk + AkJlRk
(6.44)
(6.45)
(6.46)
(6.47)
Using Equation 2.14 (noting that Rj = Jk Jk is symmetric), we have
Vhk(xk) = (1 Ak)J(xk)+J(xk)J(xk)+R(xk) + AkJ(xk)TR(xk) , (6.48)
and from Equation 2.11 we obtain the simplified equation
Vhk(xk) = (l-Xk)J(xk)+R(kk) + AkJjR{xk) . (6.49)
Using the singular value decomposition for J(xk), we obtain
Vhk(xk) = 4- AkS{ if s, > 0, 0 otherwise| UTR(xk) (6.50)
Since 0 < A* < 1, if Vhk(xk) = 0 then R(xk) is orthogonal to the column space
of J(xk).
The definition for Ak implies that Xk > 1 is a sufficient condition for
convergence of the algorithm to a stationary point of f(x) because hk(x) >
f(x) as Xk > 1. However, Theorem 6.5 shows that this condition is not
necessary. We now give our global convergence theorem.
53


Theorem 6.6 Suppose at every iteration, A*, is chosen using the equation
2(t-A) + RKJtfJtRk k -
(6.51)
the sufficient decrease condition
hk{xk+1) < hk{xk) cr(\Vhk(xk)Tdk\) (6.52)
is satisfied by the line-search, and qk is chosen using
/(zjt+i) < qk+i < qk + e[hk(xk+i) hk(xk)] . (6.53)
Then {V/(a:fc)} > 0 and any limit point of {Xk} is a stationary point of f(x).
Proof: Since e > 0 and cr(-) is a forcing function, the sufficient decrease
condition
hk{xk+1) < hk(xk) a(\\/hk(xk)Tdk\) (6.54)
ensures that e[hk(xk+1) hk(xk)] < 0. Thus,
qk+1 Qk T t\h'k{xk+i) hk{xk)\ (6.55)
ensures that qk+i < qk and the sequence {qk} is a decreasing sequence. Noting
that f(xk) < qk and that f(x) > 0 for least-squares problems we see that the
qk are bounded below by 0. Thus, the sequence {qk} is a convergent sequence
and the sequence {qk qk+i} converges to zero. Rearranging
0 < f(xk+i) < qk+i < qk + e[hk{xk+i) hk(xk)\ (6.56)
54


we have
0 ^ 5; Rearranging the sufficient decrease condition gives
-[hk(xk+i) hk(xk)\ > a(\Vhk(xk)Tdk\)
so that
(6.57)
(6.58)
0 < ecr(\Vhk(xk)Tdk\) < e[hk(xk+i) hk(xk)] < qk qk+l (6.59)
Since e > 0 is a constant, this implies that \Vhk(xk)Tdk\ approaches zero in
the limit. We are using steepest descent direction for hk. Thus, the bounded
away from orthogonality condition for our algorithm is
\Vhk(xk)Tdk\ = ||VAfc(a:fc)|| (6.60)
so that || V/i*,(£*;) || goes to zero in the limit. This along with Theorem 6.5 im-
plies that {V/(xfc)} > 0 so that any limit point of {xk} is a stationary point
of f(x).
The above theorem ensures that the algorithm converges to a point
where the gradient of f(x) is zero. Stronger results can be obtained using
additional assumptions on f(x). For example, if the minimum is unique, then
55


the sequence {a:&} also converges and it converges to the minimum. See also
Ortega and Rheinboldt [89] for more details.
However, first order stationary point convergence does not preclude
the possibility of the algorithm converging to a local maximum of f(x). Grippo
et al. [53] proved that Newtons method will not converge to a local maximum.
A variation of their proof is given here that applies to more general situations,
including Algorithm 6.3.
Theorem 6.7 Suppose that the sequence {a;*,}, generated using
xk+i = xk + akdk , (6.61)
is well defined and remains in a set S where the function f(x) : S > 3? has
continuous second derivatives. Also, suppose that {xk} converges to x* G S
where x* is a local maximum such that there exists an open sphere B(x*) where
yT^2f{x)y < 0 for all x G B(x*), y / 0. If akVf(xk)Tdk < 0 for all k, then x*
cannot be a local maximum.
Proof: Let rj > 0 be such that Amaa,[V2/(a:)] < 77 for all x G B{x*), where
^max[H] is the maximum eigenvalue of H. Then, we can write:
f(xk+1) = f(xk) + Vf(xk)T[xk+i-xk] + ^[xk+i-xk]TS72f(xk)[xk+l-xk} +
r([xfc+i sjt])
= f{xk) + akVf(xk)Tdk + ^[zfc+i xk]TV2f(xk)[xk+1 xk] +
56


Kfrfc+l Xk])
< fixk) + ockVf{xk)Tdk ^rj\\xk+i ~ Xfcll2 + r([xk+1 xk])
for all k, where lira£^.0+(r'(^)/e2) = 0.
It follows from akV f(xk)Tdk < 0, that for sufficiently large k, f(xk+1) <
f(xk) contradicting the assumption that x* is a local maximum.
The strice inequality for the condition ak Vf(xk)Tdk < 0 in the above
proof ensures that convergence is not obtained in a finite number of iterations.
If convergence were obtained in a finite number of iterations, then first order
convergence to a local maximum could be obtained. That Algorithm 6.3 does
not converge to a local maximum is easily obtained using the above theorem
and
V/(zfc)r4 = -
Vf(xk)TJ?AkRk
WJlAkRkW
RTkJkJlAkRk
\\JtAkRk\\
(6.62)
so that akVf(xk)Tdk < 0 because ak > 0 and JkJk Ak is positive semi-definite.
57


7. Examples of Algorithm Performance
Algorithm 6.3 was coded using FORTRAN 77. The singular-value
decomposition was used to obtain the required pseudo inverses [48]. While
this is a rather computationally intensive method, it is fast on test problems
and it adds valuable insight into the performance of the algorithm. Since the
minimum-distance functions are sensitive to scale changes in x, the ith element
of xk was scaled using the scale factor
Si = max(||J,(xj)||)
3-0, k
(7.1)
This scaling option is used quite often; see, for example, Mores Marquardt
algorithm [80] and NL2SOL [27, 28]. A method of choosing the two parameters
Ak and qk is also needed to implement the algorithm. The smallest possible
value of Xk,
2( ft) + RU4)TJifRt
(7.2)
and the largest possible value of qk,
qk+i = qk + e[hk(xk+i) hk(xk)] , (7.3)
with e = 10-4 were always used in this implementation to encourage the algo-
rithm to use the minimum distance idea as much as possible. The initial value
58


for qi was chosen to yield a starting value of Ax = 0.5.
The line search performs backtracking or extrapolation to obtain a
three point pattern. The derivative at xk is used to perform backtracking. Once
a three point pattern is obtained, quadratic interpolation is used to obtain an
improved point. The line search is terminated at the first point that satisfies
hk(xk+1) < hk{xk) + eakS7hk(xk)Tdk (7.4)
where ak is the step length.
The termination criteria used in the algorithm are:
(1) f(xk) < 10-13, INFO = 1.
(2) ||V/(zjO|| < 10-12, INFO = 2.
(3) The step length < max(l, ||xfc||)10-7, INFO = 3.
(4) The number of function evaluations exceeds 200(m + 1), INFO = 4
[82].
(5) Ajfc > 0.9999, INFO = 9999.
The last termination test is used to stop the algorithm if it is essentially steepest
descent. This was done to prevent the algorithm from consuming too much
computer time. When A* is too close to one, the minimum distance idea is not
working and neither is the steepest descent method.
The algorithm was tested using the More, Garbow, Hillstrom test
59


problems [84, 83]. The performance of the algorithm on these standard prob-
lems is given in Tables 7.1 and 7.2. The acronyms used by More et al. are given
in parentheses for reference. Any missing entries in one of the tables indicate
that the algorithm ran into some numerical difficulty and could not continue.
A listing of the objective functions is given in Appendix A. The scale factor
listed is used to scale the starting point. More et al. designed these test prob-
lems specifically to test, among other things, how well an algorithm performs
under scale changes in r. In the test set, many problems are repeated with
the starting point scaled by 10 and then 100. One example is Rosenbrocks
function, problem number 4; where the first row in the table uses the standard
initial point (1.2,1). The two succeeding rows use (12,10) and (120,100)
respectively.
For comparison, Tables 7.3 and 7.4 give the performance of an old
(mid 1980s) version of Mores Marquardt algorithm on the same test set [80].
Mores convergence criteria are:
(1) Both actual and predicted relative reductions in the sum-of-squares are
at most 108, INFO = 1.
(2) Relative error between two consecutive iterates is at most 10-8, INFO
= 2.
(3) Conditions for both 1 and 2 hold, INFO = 3.
60


(4) The cosine of the angle between R(x) and any column of the Jacobian
is zero, INFO = 4.
(5) Maximum number of function evaluations has been exceeded, INFO =
5.
61


Problem Scale Function
Number Factor Calls
(NPROB) M N (NFEV)
1 5 10 1 2
1 5 50 1 2
2 5 10 1 2
2 5 50 1 3
3 5 10 1 2
3 5 50 1 3
4 2 2 1 6
4 2 2 10 5
4 2 2 100 5
5 3 3 1 10
5 3 3 10 26
5 3 3 100 181
6 4 4 1 13
6 4 4 10 16
6 4 4 100 , 20
7 2 2 1 16
7 2 2 10 22
7 2 2 100
8 3 15 1 7
8 3 15 10 29
8 3 15 100 3
9 4 11 1 42
9 4 11 10 8
9 4 11 100 5
10 3 16 1 16
10 3 16 10 16
Jacobian Objective
Calls Function Norm
(NJEF) INFO (FINAL L2 NORM)
2 2 2.2360680D+00
2 2 6.7082039D+00
2 2 1.4638501D+00
3 2 3.4826302D+00
2 2 1.9097274D+00
3 2 3.6917294D+00
6 1 4.3299800D-08
5 1 6.9715345D-10
5 1 6.5081163D-10
9 1 1.9467384D-12
17 1 1.8631818D-13
64 1 5.4240424D-10
13 1 2.7555867D-07
16 1 3.3882791D-07
20 1 1.3249207D-07
9 9999 6.9988753D+00
12 9999 7.8965477D+00
7 2 9.0635960D-02
13 3 9.0635960D-02
3 9999 4.1747279D+00
37 3 1.7535838D-02
7 9999 1.1500400D-01
3 9999 2.2209595D-01
12 3 9.3779451D+00
12 3 9.3779451D+00
Table 7.1. Minimum distance algorithm performance on standard test prob-
lems.
62


Problem Scale Function
Number Factor Calls
(NPROB) M N (NFEV)
11 6 31 1 86
11 6 31 10 17
11 6 31 100 21
11 9 31 1 426
11 9 31 10 22
11 9 31 100 19
11 12 31 1 2600
11 12 31 10 13
11 12 31 100 69
12 3 10 1 219
13 2 10 1 9
14 4 20 1 1001
14 4 20 10 1001
14 4 20 100 1000
15 1 8 1 1
15 1 8 10 28
15 1 8 100 36
15 8 8 1 73
15 9 9 1 17
15 10 10 1 137
16 10 10 10 17
16 10 10 100 11
16 30 30 1 11
16 40 40 1 11
17 5 33 1 17
18 11 65 1 25
Jacobian Objective
Calls Function Norm
(NJEF) INFO (FINAL L2 NORM)
45 3 4.7829594D-02
17 2 4.7829594D-02
20 2 4.7829594D-02
181 2 1.1831146D-03
15 2 1.1831146D-03
17 2 1.1831146D-03
1116 . 4 2.3497279D-02
11 2 2.1731040D-05
33 2 2.1731040D-05
80 1 2.7028459D-12
5 9999 1.2257774D+01
395 4 3.2677102D+02
416 4 3.3237608D+02
411 4 3.2367148D+02
1 2 1.8862380D+00
28 2 1.8842482D+00
36 2 5.8315069D+02
25 9999 8.4654539D-02
11 1 1.9295463D-09
39 9999 1.4342183D-01
4 9999 3.6390336D-01
10 2 6.1315453D-07
7 1 3.5793833D-08
7 1 3.7277429D-08
13 2 7.3924926D-03
19 3 2.0034404D-01
Table 7.2. Minimum distance algorithm performance on standard test prob-
lems continued.


Problem Scale Function Jacobian Objective
Number Factor Calls Calls Function Norm
(NPROB) M N (NFEV) (NJEF) INFO (FINAL L2 NORM)
1 5 10 1 3 2 3 2.2360680D+00
1 5 50 1 3 2 2 6.7082039D+00
2 5 10 1 3 2 1 1.4638501D+00
2 5 50 1 3 2 1 3.4826302D+00
3 5 10 1 3 2 1 1.9097274D+00
3 5 50 1 3 2 1 3.6917294D+00
4 2 2 1 21 16 4 0.0000000D+00
4 2 2 10 8 5 2 O.OOOOOOOD+OO
4 2 2 100 6 4 2 0.0000000D+00
5 3 3 1 11 8 2 9.9365231D-17
5 3 3 10 20 15 2 1.0446809D-19
5 3 3 100 19 16 2 3.7665334D-29
6 4 4 1 500 499 5 0.0000000D+00
6 4 4 10 500 499 5 O.OOOOOOOD-fOO
6 4 4 100 66 65 4 9.3220945D-35
7 2 2 1 14 8 1 6.9988752D+00
7 2 2 10 19 12 1 6.9988752D+00
7 2 2 100 24 17 1 6.9988752D+00
8 3 15 1 6 5 1 9.0635960D-02
8 3 15 10 37 36 1 4.1747687D+00
8 3 15 100 14 13 1 4.1747687D+00
9 4 11 1 18 16 1 1.7535838D-02
9 4 11 10 78 70 1 3.2052193D-02
9 4 11 100 500 369 5 2.2569255D-02
10 3 16 1 126 116 2 9.3779451D+00
10 3 16 10 400 346 5 7.9607763D+02
Table 7.3. Mores Marquardt algorithm performance on standard test prob-
lems.
64


Problem
Number
(NPROB) M N
11 6 31
11 6 31
11 6 31
11 9 31
11 9 31
11 9 31
11 12 31
11 12 31
11 12 31
12 3 10
13 2 10
14 4 20
14 4 20
14 4 20
15 1 8
15 1 8
15 1 8
15 8 8
15 9 9
15 10 10
16 10 10
16 10 10
16 10 10
16 30 30
16 40 40
17 5 33
18 11 65
Scale Function
Factor Calls
(NFEV)
1 8
10 14
100 15
1 8
10 19
100 18
1 10
10 13
100 34
1 7
1 21
1 254
10 53
100 238
1 1
10 29
100 47
1 39
1 12
1 25
1 14
10 13
100 22
1 19
1 19
1 18
1 16
Jacobian
Calls
(NJEF)
7 1
13 1
14 1
7 2
15 1
15 1
9 3
12 2
28 3
6 2
12 1
236 1
42 1
222 1
1 4
28 1
46 1
20 1
9 2
12 1
12 2
8 2
20 2
14 2
14 2
15 1
12 1
Objective
Function Norm
(FINAL L2
NORM)
4.7829594D-02
4.7829594D-02
4.7829594D-02
1.1831146D-03
1.1831146D-03
1.1831146D-03
2.1731040D-05
2.1731040D-05
2.1731040D-05
1.5700924D-16
1.1151779D+01
2.9295429D+02
2.9295429D+02
2.9295429D+02
1.8862380D+00
1.8842482D+00
1.8842482D+00
5.9303235D-02
1.9859084D-16
8.0647100D-02
3.0967064D-15
3.0303191D-15
2.1868857D-15
2.2480051D-13
4.7146716D-14
7.3924926D-03
2.0034404D-01
INFO
Table 7.4. Mores Marquardt algorithm performance on standard test prob-
lems, continued.
65


The algorithm performs well on problems where || J(xk)+R(xk)\\ 0
in some neighborhood of the minimum. When this condition does not exist,
A*, > 1 and the algorithm becomes steepest descent. Hence, the algorithm
performs poorly on such problems.
While these results are not spectacular they are encouraging. We
did not expect to get exceptional results using the steepest descent direction.
Incorporating a better search direction into the algorithm should help to greatly
improve algorithm performance. The results indicate that more work could be
done to improve the scaling of our algorithm. However, many of the desired
objectives of the above analysis were observable in the test cases. Specific
examples taken from some of the test problems are listed in the following
sections.
7.1 Rosenbrocks Function
The residuals for Rosenbrocks function [102], problem number 4, are
r i(x) = 10(x2 x\) (7.5)
r2(x) = l-xx . (7.6)
Rosenbrocks function is a classic zero-residual least-squares problem where
|| J(a;fc)+i?(xfc)|| > 0 and Gauss-Newton exhibits a quadratic convergence rate.
This problem illustrates several nice properties of the minimum-distance
66


algorithm. First, if the algorithm is initialized using a value of Ai near one, then
A*, -* 0 as Xk > x*. To illustrate this, we ran the algorithm with Ai = 0.95.
The path taken by the algorithm in this case is given in Figure 7.1. As can be
seen, the first direction is essentially the negative-gradient direction. However,
succeeding directions quickly move away from the negative-gradient direction
and the algorithm quickly converges to the minimum. Also observable in the
early iterations of Figure 7.1 are the classic oscillations characteristic of using
a steepest descent algorithm. These oscillations dampen as A*, gets smaller and
the Gauss-Newton direction is approached. That A & quickly becomes small is
evident from Figure 7.2, a semi-log plot of A^ versus the iteration number.
67


Figure 7.1. Minimum-distance function path for Ai = 0.95.
68


Figure 7.2. Convergence of A for Ai = 0.95.
This problem also illustrates that values of A* that are not too close to
one allow the objective function to increase, yet convergence to the minimum
is still obtained. For this Ai = 0.95 case, this occurs at iteration two as shown
in Figure 7.3.
69


Figure 7.3. Objective function values for Ai = 0.95.
Next, the algorithm was run with Xi = 0.8. This yields a starting
direction that is closer to the Gauss-Newton direction as seen in Figure 7.4.
This allows A*, to decrease more quickly, Figure 7.5. This quicker drop in
A*, allows the algorithm to increase the objective function, Figure 7.6. These
combined effects help speed up the algorithm resulting in fewer iterations being
required than in the Xi = 0.95 case.
70


Figure 7.4. Minimum-distance function path for Ai = 0.8.
71


Lambda
Figure 7.5. Convergence of A for Ai = 0.8.
72


16
Figure 7.6. Objective function values for Ax = 0.8.
As Ai gets smaller, greater increases in f(x) are allowed. This is illus-
trated with the same series of plots where the algorithm is run with Ai = 0.5.
Here, a starting direction very near the Gauss-Newton direction is obtained,
Figure 7.7, and very quickly gets small, Figure 7.8. As in the previous ex-
amples, the objective function is allowed to increase yet eventually converges
to zero, Figure 7.9. This final case obtains the minimum in fewer iterations
than required for the previous two cases.
73


Figure 7.7. Minimum-distance function path for Ai = 0.5.


Lambda
Figure 7.8. Convergence of Ak for Ax = 0.5.
75


Objective Function Value
1.E+04
0 1 2 3 4 5 6
Iteration Number
Figure 7.9. Objective function values for Ai = 0.5.
7.2 Freudenstein and Roths Function
The residuals for the Freudenstein and Roth function [44], problem
number 7, are
n() = - 13 + Xi + ((5 x2)x2 - 2)x2 (7.7)
r2(x) = 29 + 2?1 + ((X2 + l)x2 - 14)x2 (7.8)
76


While the minimum-distance function algorithm does very well on Rosenbrocks
function, the Freudenstein and Roth function causes it quite some difficulty.
This problem has a non-zero residual for one of its local minimums where one
of the singular values approaches zero as {xk} x*. The Jacobian for this
problem is
J{x)
1 3x2 + 10x2 2
(7.9)
1 3x2 + 2x2 14
This Jacobian becomes singular when the first column is a scalar multiple of
the second column. That is, the Jacobian becomes singular when
3x2 + 102 2 = 3x2 + 2x2 14 . (7.10)
This is a quadratic equation in x2 that can be easily solved to obtain X2 =
0.8968... and X2 = 2.230138.... The local minimum that causes the dif-
ficulty is at (11.41..., 0.8968 ...). The minimum-distance algorithm has
trouble when j|J(xk)+R{xk)\\ > 00. In this case, this happens when X2 >
0.8968____ Once this happens, \k > 1, the steepest-descent algorithm is
obtained and the algorithm terminates when A* > 0.9999.
As seen in Table 7.1, the first instance of problem number 7 termi-
nated, at a point very near the local minimum where the norm of the objective
function obtained is very near the optimal value of 6.99887.... For this case,
the minimum-distance algorithm took only a few iterations to get this close.
77


Once this point is reached, the resulting steepest-descent algorithm is unable
to successfully find the minimum in a reasonable number of iterations.
The results for the other two cases, where the algorithm is started
at 10 and 100 times farther away from the minimum, were not as good. Dur-
ing the first few iterations large decreases in the objective function and also
in || J(xk)+R(xk)\\ were obtained causing Ajt to get very small. Then, the al-
gorithm starts to get close to x2 = 0.8968... where the associated singular
value is close to zero. This causes ||J(xk)+R(xk)\\ to increase as expected.
However, at this point, an interesting scaling problem is observed. The drop in
the singular value is enough to dominate the Gauss-Newton direction but not
enough to affect AThe drop obtained in the objective function during the
first few iterations caused the 2(<& fk) term in the equation for obtaining A*,,
* 2(qt fk) + Rl(Jt)TJtRk 1' '
to become very large. Once the singular value started to become small, this
equation should have caused A*, to approach one causing the algorithm to turn
toward the minimum. This is what happened in the first case. In the other
cases, the 2(g* fk) term dominates causing Ak to remain small. This causes
the algorithm to search primarily along x2 making very little progress along
x\. Eventually, the singular value gets small enough that Ak approaches one.
78


However, no progress is made toward the minimum along the other singular
direction characterized by x2 = 0.8968... and steepest-descent cannot find
the minimum.
Thus, it is observed that the 2(qk fk) term must be scaled properly
for the algorithm to work well. If 2(qk fk) is too large, then A*, will remain
small. If 2(qk fk) is too small, then A*, will stay close to one. Picking Ai to be
some reasonable value, e.g. 0.5, appears to scale the 2(qk fk) term for many
problems. However, some dynamic scaling is needed to improve the minimum-
distance algorithms performance. Such a scaling algorithm needs to find an
effective balance between allowing Afc > 0 for fast convergence and A^ > 1
when a singular value is approaching zero.
7.3 Powells Singular Function
The residuals for Powells singular function [92], problem number 6,
are
n(z) = + 10:r2 (7.12)
T2(x) = V5(x3 X4) (7.13)
nix) = (x2 2x3)2 (7.14)
r4(x) = -\/l0(xi x4) (7.15)
79


with a minimum objective function value of zero at (0,0, 0,0). The Jacobian
for this problem,
1 10 0 0
0 0 75 -75
2(x2 2x3) -4(x2 2x3) 0 0
y/lO 0 0 -TTo
(7.16)
is obviously singular at the minimum by simply examining the third row.
Hence, this problem is a mixture of the above two problems. It is a zero
residual problem like Rosenbrocks function but it also has a singular value
that approaches zero as Xk approaches the minimum. The minimum distance
function performs very well on this problem since || J(xk)+R(xk)\\ > 0 as Xk ap-
proaches the minimum. This allows Ak to approach zero so that the algorithm
searches a direction approaching the Gauss-Newton direction.
80


8. Extensions and Future Research
Two convergence conditions, the bounded away from orthogonality
and sufficient decrease conditions, were identified as keys to algorithm perfor-
mance. At first, these conditions appeared to be general enough to cover most
any efficient algorithm. However, a detailed study of the nonmonotone algo-
rithms in the literature indicated that relaxing these two convergence condi-
tions could yield faster convergence. Further analysis revealed that an estimate
of the distance to the minimum relaxes both the bounded away and sufficient
decrease conditions. These minimum distance functions also pick an optimal
point for a line-search in the following sense. If the estimate of the distance to
the minimum is accurate, the distance to the minimum is minimized at each
iteration.
We developed a line-search algorithm, based on minimizing a minimum-
distance function along its steepest descent direction. Unlike the nonmonotone
algorithms in the literature, this minimum distance function algorithm dynam-
ically adjusts to account for changes between the influence of curvature and
descent. The algorithm does not force a reduction in the objective function
81


at each iteration, yet it allows f(x) to increase during an iteration only if this
will result in faster convergence. A variant of this algorithm was tested using
some standard test functions. While good performance of the algorithm was
observed, faster convergence could be obtained if a better search direction is
used. One such enhancement would be to include trust region logic. Trust
region algorithms have proven to be very effective in practice and were shown
in this thesis to relax the bounded away from orthogonality condition. Thus,
if A*, > 1, then an efficient algorithm that relaxes the bounded away from
orthogonality condition will result.
The nonmonotone idea of Grippo et. al. [53] has been used for con-
strained optimization to help eliminate the Maratos effect [8]. This suggests
that extending the minimum distance function idea to constrained optimization
problems could be beneficial. First, let us consider linear equality constraints,
e(x) = 0. Let Jrepresent the pseudo inverse of J restricted to the linear
equality constraints, e(x) = 0 [99]. Then the nonlinear least-squares minimum-
distance function becomes
hk(x) = i||J<+>(zt)+fl(x)||2 . (8.1)
This is equivalent to the problem of minimizing
hk(x) = i||x xlf (8.2)
82


where
x*k argmin^W J(xk)[x xk\ + /E(ccfc)||2 S.T. e(x) = 0 . (8.3)
In our minimum-distance function approach, we minimized an estimate of
the distance to the minimum of an unconstrained problem. Here, we use a
quadratic approximation to obtain an estimate of the distance to a constrained
minimum and we minimize this distance. Such a minimum-distance function
would allow feasible and non-feasible iterates.
This idea is very similar to the original idea of allowing f(x) to in-
crease along the line-search direction in order to make progress along an eigen-
vector direction with small associated eigenvalue. Here, the algorithm is al-
lowed to go infeasible in order to make progress along a constraint. Consider
modifying our least-squares objective function to obtain the penalty function
f(x) = Me(x)2 + , (8.4)
where M is a large positive constant. This yields an illconditioned uncon-
strained nonlinear least-squares problem with an eigenvector (with a small
associated singular value) oriented along the constraint e(x). If the minimum-
distance function idea is applied to this penalty function, the algorithm will
allow f(x) to increase in order to move along the small singular-value eigen-
vector direction that points along the constraint. Thus, the idea of increasing
83


f{x) to make progress along a small singular-value direction and the idea of
allowing the iterates to go infeasible in order to make progress along a con-
straint, in this case, coincide. Comparing the performance of the minimum
distance idea using the penalty function approach to the performance of the
above constrained minimum-distance function could be quite enlightening.
The above example used a penalty function to convert a constrained
problem to an unconstrained problem. The minimum-distance function idea
also converts a constrained problem into an unconstrained problem. At each
function evaluation of this new hk(x) we solve a constrained-quadratic subprob-
lem. This idea can easily be extended to include general nonlinear constraints
by minimizing
M*) = \\\x - > (8-5)
where
x\ argmin\-\\J{xk)[x Xk] + )|[2 S.T. the constraints . (8.6)
If these constraints are linearized, a quadratic problem with linear constraints is
obtained that can be solved iteratively to yield the solution to the subproblem.
This decouples the nonlinearity of the objective function from the nonlinearity
of the constraints. The resulting algorithm should exhibit a fast convergence
84


rate and the nonmonotone idea incorporated into the minimum-distance func-
tion should eliminate the Maratos effect.
I
85


APPENDIX A Standard Test Functions
(1) Linear function, full rank [84].
m is variable, n >m
ri(x) = Xi \ (ELi xj) 1, 1 < i < m
n(x) = \ (E,m=i xj) 1, m < i < n
x0 = (1,...,1)
/ = n m at (1,..., 1)
(2) Linear function, rank 1 [84].
m is variable, n > m
ri(x) = i (EjLi i^') 1
x0 = (1,...,1)
f = at an^ Pint where EjLi3*3 =
(3) Linear function, rank 1 with zero columns and rows [84].
m is variable, n > m
ri(x) = 1
r<(:e) = (* 1) (E^l1 - 1, 2 < i < n
r,r(x) = 1
Xo = (1,---,1)
f = ^(2n-bf at any Pint where ^7=2 JXj =
(4) Rosenbrocks Function [102].
86


m = 2, n = 2
ri(i) = 10(a:2 a;?)
r2(a:) = 1 xx
X* = (-1.2,1)
/ = 0 at (1,1)
(5) Helical valley funciton [43].
m = 3, n = 3
ri(a;) = 10[a:3 10^(xi,x2)\
r2(o:) = 10[a/o:? + x\ 1]
r3(z) = x3
where
8(xi,x2)
xq (-1,0,0)
2ir arctan (^) ,
arctan (£2-') + 0.5,
2tt \xi ) '
/ = Oat (1,0,0)
(6) Powell singular function [92].
m = A, n = 4
Tx(x) = xi + 10o:2
r2(o;) = y/E(x3 x4)
r3(x) = (x2 2o:3)2
if rex > 0
if x\ < 0
87


r4(s) = \/IO(si *4)
so = (3,-1,0,1)
/ = 0 at (0,0,0,0)
(7) Freudenstein and Roth Function [44].
m = 2, n = 2
ri(x) = 13 + si + ((5 s2)s2 2)s2
r2(x) = 29 + si + ((s2 + l)s2 14)s2
so = (0.5, -2)
/ = 0 at (5,4)
/ = 48.9842... at (11.41..., 0.8968...)
(8) Bard Function [2].
m = 3, n = 15
r,(s) = Vi ( Xi + (16i)a;2+min(i,16t)a;3
where
i Vi i Vi i Vi
1 0.14 6 0.32 11 0.73
2 0.18 7 0.35 12 0.96
3 0.22 8 0.39 13 1.34
4 0.25 9 0.37 14 2.10
5 0.29 10 0.58 15 4.39
88


*0 = (1,1,1)
/ = 8.21487- --10-3
/ = 17.4286 ... at (0.8406 ... -oo, oo)
(9) Kowalik and Osborne Function [68].
m = 4, n = 11
ri{x) yi uf+uiXi+Xi
where
i Vi U{ i Vi Ui
1 0.1957 4.0000 7 0.0456 0.1250
2 0.1947 2.0000 8 * 0.0342 0.1000
3 0.1735 1.0000 9 0.0323 0.0833
4 0.1600 0.5000 10 0.0235 0.0714
5 0.0844 0.2500 11 0.0246 0.0625
6 0.0627 0.1670
x0 = (0.25,0.39,0.415,0.39)
/ = 3.07505-10"4
/ = 1.02734 10~3 at (+oo, -14.07..., -oo, -oo)
(10) Meyer Function [79].
m = 3, n = 16
n(x)
X\ exp
X'l
45+5i+a?3
- Vi
89


where
1 Vi 1 Vi
1 34780 9 8261
2 28610 10 7030
3 23650 11 6005
4 19630 12 5147
5 16370 13 4427
6 13720 14 3820
7 11540 15 3307
8 9744 16 2872
x0 = O II 4000. ,250)
/ = 87.9458 ...
(11) Watson Function [68].
2 < m < 31, m 31
n(x) = ejuo' -1 - fo", ^(A)'-1) -1, i <; < 29
r30(x) = xi
r3i(x) = x2-x\-l
x0 = (0,...,0)
/ = 2.28767- --10-3 for m = 6
/ = 1.39976 - 106 for m = 9
90