AUTOMATIC DIFFERENTIATION AS
APPLIED TO THE TRUNCATED NEWTONS
METHOD FOR UNCONSTRAINED
NONLINEAR OPTIMIZATION
by
RONALD JOHN VAN IWAARDEN
B. A., University of Colorado at Denver, 1989
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
Applied Mathematics
1992
This thesis for the Master of Science
degree by
Ronald John Van Iwaarden
has been approved for the
Department of
Applied Mathematics
by
Weldon Lodwick
Date
1.4 a i
Van Iwaarden, Ronald John (M. S., Applied Mathematics)
Automatic Differentiation as Applied to the Truncated Newtons Method
for Unconstrained Nonlinear Optimization
Thesis directed by Associate Professor Weldon Lodwick
This paper explores using both forward and reverse automatic
differentiation (AD) to solve the standard unconstrained optimization
problem. A sparse version of forward AD is also implemented for com
parison. The three methods are compared when the dimensionality of the
problem is increased. As the dimension of the problem increases, two fac
tors of importance are both memory requirements and the time involved
to solve a problem of given size. The research will show that reverse AD
is the superior method when time is the largest constraint and forward
AD (either sparse or non sparse) is the superior method when memory
requirements are of the greatest concern.
The form and content of this abstract are approved. I recommend its
publication. Signed_
Weldon Lodwick
in
CONTENTS
CHAPTER
1 INTRODUCTION: NEWTONS METHOD................. 1
2 TRUNCATEDNEWTONS METHODS.................... 4
2.1 Deeper Analysis of the Conjugate Gradient Method
for Solving the Truncated Newtons Method. 5
3 FORWARD AUTOMATIC DIFFERENTIATION .... 6
3.1 Example.................................. 8
4 REVERSE AUTOMATIC DIFFERENTIATION....... 11
5 COMPARISON OF THE FORWARD AND REVERSE
METHODS.................................... 15
6 IMPLEMENTATION AND RESULTS................... 19
APPENDIX
A TEST PROBLEMS............................... 26
BIBLIOGRAPHY...................................... 27
CHAPTER 1
INTRODUCTION: NEWTONS METHOD
We 'will consider the unconstrained optimization problem
min /(sc)
where f is twice continuously differentiable and its level sets are bounded.
Our goal will be to find some a;* such that
/(*)
This may not always be possible so we seek a local minimum which, in
general, is a much easier problem. That is, we seek x* for which there
exists S > 0 such that
/(x*) < f{x) Vsc such that sc sc* < S.
If x* is a local minimizer than the first order necessary condition
is:
V/(x*) = 0.
If the Hessian matrix at i, denoted Hf(x"), is positive definite, then the
inequalities above hold with strict inequality when x ^ x* and we refer to
this as a strong global or local minimum.
A classical method for solving this problem is Newtons method.
To develop Newtons method, we approximate /(x) by Taylors theorem,
and assume that / is given exactly by the first three terms of its Taylors
series expansion. That is,
f(x + p) = f(x) + VTf(x)p + ^pTHf(x)p.
If we then minimize this function over all p, the first order necessary
conditions give
V/(s) + Hf{x)p = 0,
and Newtons method uses the solution
j P = ~Hj\x)Vf{x).
So, to find the search direction, an NxN system of equations must be
solved. This yields the following algorithm:
Step 1. Solve Hf(x)p = Vf(x)
Step 2. Set x < x + p
Step 3. If V/(s) < e, stop; else, go to step 1
The main advantage and star attraction of Newtons method is that if the
function is sufficiently smooth and x is close enough to x*, where x~ is
a strong local minimizer, then Newtons method is quadratically conver
gent. This advantage, however, is not generally enough to counteract the
following disadvantages of Newtons method.
(1) The method is not globally convergent. That is, it may not con
verge to a stationary point for an arbitrary Xq and may even di
verge.
(2) If the Hessian Hj{x) is singular, the method is not defined.
(3) If the function is not convex, the set of directions given may not
define a set of descent directions.
2
(4) An NxN system of equations must be solved at each iteration.
With the exception of (4), all of the above can be solved by a
slight modification of Newtons method. Define (a) = f(x f ctp) and let
0 < a < 1. Then the modified Newtons method is as follows.
Step 1. Solve Hf(x)p = V/()
Step 2. Find a so that the following are satisfied:.
a) (a) < '(0)a
b) (a) > (f>(0) + (1 e)^'(0)a
Step 3. Set x < x + ap
Step 4. If V/() < e, stop; else, go to step 1
The conditions in step 2 are known as the Goldstein test. These
conditions help to control the global behavior of the method. Early in the
calculation, they play a significant role, forcing the method to make a step
that minimizes the function at each step. Later in the calculations, they
have less of a role, as the function becomes asymptotically approximated
by the quadratic that we assumed in the development of Newtons method
and, therefore, the step size (ct) will approach 1.
3
CHAPTER 2
TRUNCATEDNEWTONS METHODS
Newtons method does not in general show the quadratic be
haviour that is the only advantage of using it when the present value for
x is far from the solution. For this reason, it may not be wise to solve the
Newtons equations completely when the present value for x is thought
to be far from the solution. Dembo and Steinhaug [1] proposed using
the method of conjugate gradients for solving Newtons equations. If we
let the present solution to Newtons equations be given by Xk then the
residual is given by r* = HfXk V/(x). The iterations are then trun
cated when the error, given by t,a/V/(), is small enough, which
gives the name TruncatedNewtons method. They proceed to show that
this method is globally convergent to a local minimum, and, by specifying
size of ][rjt/V/(), a prespecified convergence rate between linear and
quadratic can be achieved.
Using the conjugate gradient method to solve the Newtons equa
tions has a very important property that makes it especially convenient
for automatic differentiation. That property is that the conjugate gra
dient method never requires the Hessian matrix to be known explicitly,
rather, only a Hessian vector product Hf(x)p is required at any point in
the conjugate gradient algorithm. It is assumed that the function is in C2,
which implies that the Hessian is symmetric, and so the product pJ Hf[x)
is also acceptable. This allows for a space savings when this algorithm is
encoded for a computer program.
2.1 Deeper Analysis of the Conjugate Gradient Method for
Solving the Truncated Newtons Method.
For,the purpose of analyzing the direction that is found by using
the conjugate gradient method to solve the truncated Newtons equations,
the research assumes that the function F(y) = yTAy yTb which, by
letting x = yA~1b has the equivalent form /(*) = xTAx^bTA~1b which
is a pure quadratic plus a constant. The linear conjugate gradient method
is designed to solve the problem Ax = b where A is symmetric and positive
definite. If k < N iterations are performed by the linear solver, we will
have found a new Xk that is closer to the solution a;when the norm induced
by the matrix A is used. However, if k steps of the nonlinear conjugate
gradient method had been applied to solving the nonlinear problem, the
method would arrive at the same Xk. This is due to the fact that the
optimum search direction and step size are given algebraically when the
problem at hand is a pure quadratic.
The reason for the similarity between the k truncated Newtons
method and using k steps of the nonlinear conjugate gradient method
is that the linear conjugate gradient method is developed by trying to
minimize the function xT Ax which is the quadratic model we assumed. For
these reasons, the truncated Newtons method using the linear conjugate
gradient solver can be seen as interpolating between the conjugate gradient
direction and the Newton direction.
5
CHAPTER 3
FORWARD AUTOMATIC DIFFERENTIATION
There are two types of automatic differentiation. The forward
method has been around the longest with the pioneering works by R. E.
Wengert in 1964[12] and is used by Dixon and Price [2] in their implemen
tation of the truncated Newtons method.
To implement forward AD, we define an algebra on the space
T" of ordered (nfl)tuples (x,x',... ,x^) where is an jdimensional
tensor. For simplicity and since we require only second derivatives for
the optimization method, we work with n = 2. In this case, any element
a: Â£ T2 can be represented by (,s', x") where a: is a scalar, x' is an N
vector and x" is an NxN matrix. If we let U, V Â£ T2 and let 4>{x) E C2,
the following are all operations defined in the space T2.
U = (u, u', u") and V = (v,v',v").
This gives
U + V = (u + v, u' + v', u" + v")
UV = (uv,u'~ v',u" v")
U V = (u v, uv' + vu', uv" + u'v,T + v'u'T + vu")
UJV = (u/v, (vu1 uv')/v2, (v2u" v(v'u'T + u'v,T) + 2uv'v,T uvv")/v3)
(u), '(u)u', '(u)u" + (j>"(u)u'u'T)
These are the standard operations given by the algebra of Taylors series
which is just the implementation of the chain rule for the given functions.
Given this notation, a variable x: becomes (xj,e,,0) where e, is the Â£th
unit vector and 0 is the NxN matrix of zeros.
This requires a matrix store and is not what is needed for the
conjugate gradient method since a matrix vector product is used. Dixon
and Price [2] adapted forward AD to retain only a vector Hessian product
as follows. Define the space S2 where any element u E S has the form
u = (u,u',u"p) where u is a scalar, uf,p are Ndimensional vectors, and u"
is an NxN matrix. Then, given U,V E S and
on this space are given by:
U = (u,u',u"p) and V = (v,v',v"p)
This gives
U + V (u + v, v! + v', u"p + v"p)
U V = (u v, u' v, u"p v"p)
U *V = (u*v, uv1 + vuuv"p + u'vlTp + v'u'Tp + vu"p)
U/V = (u/v, (yu'uv^/v2, (v2u"pv{v'u'TpJru'v'Tp)\2uv'v'Tpuvv"p) jvz)
4>{U) = , 4>'(u)u"p + <})"{u)v!u'Tp)
With this new notation, an independent variable x, is represented by
(x,,e,,0) where x, is a scalar, e, is again, the ith unit vector and here,
0 is the Ndimensional zero vector. It should also be noted that in all
of the above operations, at most dot products are required and that the
Hessian is never explicitly required. This adaptation results in both vector
7
storage and faster operation for each. Hessian vector product as the entire
Hessian need not be computed.
Masao Iri [6] has shown that computing the gradient and Hessian
by the forward AD method requires w(f)0(N) operations, where w(f) is
the work required to compute the function f(x). To compute just the
gradient and Hessian vector product requires at most u>(/)6N operations
or w(f)0(N) operations.
3.1 Example
Let f(x,y) = sin(x3 + y). and consider the point (1,1). Then
the function is differentiated as follows. The independent variables x and
y become
* = (1,(1,0)T.0)
y = (l,(0,l)T,0)
Then we define the following temporary variables.
T\ = x* x
T2 = x T\
T3 = I2 + y
F = sin(T3)
The computations yield:
1 CM 1 2 1 O
1, ?
. 0 0 0 .
' CO 1 6 1 O
11
1 0 1 l O 1 O
8
3 6 r O
II E? 0, 1 3 0 0
F
sm(0), cos(O) *
3
1
1 CO 1 1 o
3
1 0 1 o
cos(O) 9sin(0) 3sin(0)
3sin(0) sin(O)
Now, F = (f(x),Vf(x),Hf(x)), which is what was desired. The reader
should note that, while this example was done using symbolic manipula
tion, on the computer, rather than using the symbols, the actual values of
those symbols would be used.
Now we will rework the same example where only vector stores
are used. Let the vector p be given by the first unit vector and consider
the same point (1,1). The independent variables x and y become
* = (1,(1, Of, (0, Of)
y =(1,(0, if, (0, Of)
Then we define the following temporary variables.
The computations yield:
II 1, 2 3 N3 i
0 0 .
II 1, i CO 3 i i
1 o i O i
II E? 0, 1 CO 3 6
tH 1 1 O i
9
3
F = szn(O), cos(0) * 1 5
cos(O) 9sin(0)
3 sin(O)
Note that the end result gave the first column of the hessian as expected.
10
CHAPTER 4
REVERSE AUTOMATIC DIFFERENTIATION
Reverse AD has been around for quite a while, but it did not
exist in its present form until about 1980 when B. Speelpenning [11] wrote
his doctoral thesis. The earliest known paper that explicitly recognizes
the potential savings that arise in reverse AD is by G. M. Ostrowski et
al. [10]. Similar to forward AD, it is based on the chain rule but in a
slightly different form than we are used to. Since the main application of
reverse AD is to compute the derivatives of functions on the computer,
we restrict our discussion to the computer functions 4>{x) = {B, U} where
B = {+,,*,/} and U = {sin,cos,exp,log,y/}. Every function
that can be written in a computer program is expressed as a composition
of these functions and then can be expressed as a computational graph
with at most two input arcs for every function node since the functions s
are either binary or unary.
Consider again the simple example used above where,
F(x,y) = sin(x3 + y).
For this example, one possible computational graph is shown in figure 4.1
where Vi = x3, v% = x + y and V3 = sini^. So, this computational graph is
essentially a graphical representation of the computational process that is
used in forward AD. A computational graph is by no means unique. Even
in this small example, node Vi could have been replaced by a u v node
Figure 4.1: Computational graph for F(x, y) = sin(x3 + y)
with two input arcs that both originate from the node x and a second u*v
node with input arcs from both x and iq.
Once, the computational graph has been created, the technique
of reverse AD can be applied to the computational graph in order to
compute the needed derivatives. For this process, we need to compute
the partial derivatives of any arc a (Vi,Vj), where u, is the node where
the arc originates and Vj is the terminal node. The computational graph
is then extended to make a new computational graph that computes the
partial derivatives. The new graph is created by placing a mirror image of
the original graph next to the original. For this mirror image, the direction
of each arc is reversed and, in the middle of each arc, a multiplication node
is placed. All of the original nodes in the mirror image become addition
nodes with the exception of the node which corresponded to the function
value F. This node receives the value 1. Between the two graphs, place
one intermediate node for every node in the original graph. Fin all}', the
value of 0 is assigned to every + node in the mirror graph.
12
For each node ut in the original graph that inputs to node Vj.
where node Vj also has inputs from node u*;, three arcs are made that
connect these three nodes with the intermediate node Wij This node
contains the function that corresponds to This node then has an
arc that connects to node Uij in the new computational graph. The node
Ui'j is the multiplication node between nodes u, and Uj.
To illustrate this, consider the previous example.
dvi q /
& = Sv'/x
dv2
dy
dv2
dvi
dvz
dvr.
= 1
= 1
= cosu2
So, node 102,3 contains the function cosi>2. Since this function
depends only on inputs from node v2, the arc from V3 does not need to
be included in the computational graph. If we continue in this fashion,
Wi'2 = 1 and has no inputs. Node wVti = 1 and similarly has no inputs.
Node iox,i = 3vi/x, where it has its first input from node Vi and its second
input from node y. Putting these into the computational graph, the graph
for VF(x,y) appears in figure 4.2.
For this graph, 102,3 = cosu2, 101,2 = 1, wx,i = 3 Vj/x, wx^2 1,
and wp,3 = 1. For the mirror graph, u, = +, and = *. This is
a new computational graph that will give the gradient of the function
F(x,y) = sin(3 + y). Once this computational graph has been traversed,
= ux and = This graph has a similar geometry to the original
computational graph and so it is rarely computed in practice.
13
If the Hessian is needed, the new computational graph can be
differentiated N times since the Hessian is just the gradient of the gra
dient. Again, this computational graph is not significantly different from
the original graph and so it does not to be explicitly computed. Also, the
Hessian vector product is easily computed using only the original compu
tational graph. For a discussion on the algorithm and the computational
complexity, the reader is referred to Iri [6]
14
CHAPTER 5
COMPARISON OF THE FORWARD AND REVERSE METHODS
l
It was mentioned earlier that both the reverse and forward meth
ods of AD can be viewed simply as different interpretations of the chain
rule. When the forward method is implemented, the total derivatives of
node Vi with respect to X{ are known when the computation of / reaches
node i. This can be represented as follows.
(1) & = 2*
(2)
dv?
Qy
dvi
dx
= o
= 2x
(3)
= y
= Zx\ cos(x3 + y)
= cos(x3 + y)
Since V3 = f, we are done. This is how the forward or top up
method of AD is implemented. Note that the symbol x3 + y is used twice
and was already computed when the function f(x,y) sin(x3 + y) was
evaluated.
The total number of additional computations that must be
done in order to get the derivative are 6 multiplications, one addition,
and (assuming that xn is evaluated by eln;r) five transendental function
evaluations.
In contrast to this, the reverse method computes the entire func
tion at first and saves much of the intermediate values along with the ele
mentary partial derivatives of each node with respect to its inputs. Then,
the graph is;traversed in the reverse direction (top down). The partial
derivative of / with respect to node Vi is known when the computation
reaches that point. When this is finished, the partial derivative of f with
respect to each of the original inputs X\,..., xn is known for each a:,. For
the example
that was used, this can be represented as follows.
C1) fe ^cosv2
(2)
dva
dy
COS V2
(3) IS = COSV2
(4) Ite =F Zvi/x COS v2
Again, since V3 = /, we have computed the gradient of the func
tion. Note that in this case, the value v2 = x3 + y has already been
computed and stored so it does not need to be computed again. Also, in
stead of computing ^ = 3a:2 which requires transendental function eval
uations, we compute ^ = 3v\/x cos v2 which saves computational work.
For the reverse method, the total amount of additional computational ef
fort expended is one multiplication, one division, and one transcendental
function evaluation.
If we assume that a division costs approximately two multiplica
tions, a transcendental function costs approximately three multiplications,
and addition and subtraction cost the same as a multiplication, then we
get the following approximations for our function and derivative evalua
tion costs for the given function. To evaluate the function alone requires
approximately 11 multiplications. For the gradient by means of forward
AD, we need an additional 22 multiplications. For the gradient by means
of reverse AD, we need an additional 6 multiplications.
16
So, even with this small example, the advantages given by re
verse AD in terms of computational time are quite evident. If a larger
problem is solved, then the difference is even more striking than what is
shown here. The most obvious disadvantage for reverse AD is the need
to store the computational graph and all of the intermediate variables.
For forward AD, only a vector store is required to compute the gradient.
For reverse AD, the author is not aware of any bounds on the growth
of the computational graph. However, in some recent work by Andreas
Griewank [3], it has been shown that it is possible to achieve logarithmic
growth in the spatial complexity of the computational graph with a trade
off of logarithmic growth in the temporal complexity.
The primary advantage of reverse AD has been the speed at
which derivatives are computed. Iri [6] has shown that the work required
to compute the function and its gradient requires at most 5w(f) evalua
tions. He also shows that the work required to compute the function, its
first derivative, and the Hessian vector product does not exceed 21u>(/).
In other words, the time requirement for evaluating the product of the
Hessian and a vector requires fewer than a constant times the amount of
work required to evaluate the function itself. To compute the full Hessian
requires less than (7+14N)u;(/) operations. It is very surprising to note
that the reverse method is of an order of magnitude faster than the for
ward method. This remarkable fact will become even more evident when
both methods are used for nonlinear optimization.
The reason for reverse AD being so much faster is that the
amount of work required to compute the gradient is dependant on the
17
size of the computational graph. Since the computational graph for the
gradient is about three times the size of the graph for the function alone,
this suggests that the amount of work to compute the gradient is on the
order of three times the work to evaluate the function. This result has
been borne out in practice although the constant can be greater than 3
since the time required to access the values from memory has not been
considered.
18
CHAPTER 6
IMPLEMENTATION AND RESULTS
Many implementations of AD have been written for C or Cf+ [4]
[8] on unix systems that allow for paging main memory out to disk. The
computational graph is built each time that the function is evaluated, and
then traversed in order to get the desired derivatives. The advantages of
this method is that this allows for differentiation of conditional branches.
Its disadvantage is that the building of the computational graph can be
time consuming since it must be in a form that allows for traversal both
up and down. Also, this requires a large amount of memory that must be
allocated dynamically, which can slow down the procedure even more if
memory is paged out to disk.
A second implementation is to write a program that will differ
entiate another function. That is, the function will read in a procedure
that describes the function f(x), and then outputs another procedure that
will compute f(x) and a specified number of derivatives. The advantage
of this method is speed because this procedure can then be linked into an
optimization algorithm. Another advantage is that if a large amount of
memory is required, the paging can be handled much more efficiently. A
disadvantage is that conditional branches are not as easily handled in this
case.
A third method is to have the computational graph computed
once and then retained in main memory. This has the advantage of speed
since the computational graph does not need to be recomputed. The dis
advantage is again that memory can not be handled very efficiently. An
other disadvantage is that conditional evaluations in the function cannot
be handled very easily.
For this research, the third method was used. The test problems
that were used by Dixon [2] did not involve any conditional branches, and
the intent was to compare forward and reverse AD on some classic test
problems. The programs were written in Turbo Pascal version 6.0 for use
on an IBM PC operating under MSDOS version 5.0. For this reason,
paging out to disk was not allowed and the maximum number of standard
variables was limited to a total of 64k, while dynamic memory had a
maximum of approximately 500k. The computational graph was created
using a combination of linked lists and arrays, which allow for direct access
to only limited areas of the computational graph, but that was all that
was needed. The function was stored in reverse polish notation that was
in an external ASCII file. This file was then read into memory and the
computational graph was built when the program was first started. The
time to compute the computational graph was not included in the times
later on. However, the time to compute the computational graph was not
more than about 8 seconds, even when the number of variables was close
the the maximum. The maximum number of variables was restricted to
of approximately 800 since paging memory was not possible.
The forward AD method was implemented exactly as described,
storing only a vector, Hessian product. Unfortunately, the 64k restriction
20
limited the maximum number of variables to approximately 400. If this
restriction was lifted and all of main memory was available, the maximum
number of variables would be about 3500. For purposes of comparison,
a sparse version was also implemented. Instead of using a linked list
implementation, the sparsity was exploited by using an array of values
and of positions. Only the nonzero values of the vector are modified and
those are identified through the array of positions. The problem of copying
values when passing vectors to procedures was overcome by passing the
addresses of those vectors. This makes the modification similar in speed
as a linked list implementation.
In order to directly compare the reverse, forward, and sparse for
ward AD implementations for the truncated Newtons method, all meth
ods used the exact same optimization code for comparison. The line
search is done by means of first checking to see if a 1 is an acceptable
step size according to the Wolf Test [7]. If it is not acceptable, the new
point is found by fitting a cubic interpolant to 4>(x), ^>'(x), ^>(1), 0'(1) where
= f(x + aP) and P is the search direction. If this again does not
find an acceptable step size, an exact line search is implemented to find
the first zero of '{x) to within a given tolerance.
The tests were done on an IBM compatible running a 80386 at
33 MHZ with no math coprocessor. All methods were terminated when
V/ < IO75. For the first three tables, Code refers to which code was
used (REV for reverse AD, FOR for forward AD, and SPA for sparse
forward AD). Time refers to the amount of time in seconds required to
solve the problem with the given method. The time was obtained by asking
21
for the system time just before the start of the optimization algorithm and
then asked for again just after the end of the algorithm. These two values
were then subtracted appropriately to obtain the figures that are given.
The second three tables concern themselves with the growth ratio in time
for a given increase in problem size. In these tables, rather than time to
solve, the columns have the value
time to solve a problem of size 2n
ratio = ;.
time to solve a problem of size n
This should show the amount of growth in time to solve a larger problem.
That is, if the algorithm is linear in time, this column should have 21
and if it is quadratic in time, the column should show 22 = 4. This ratio
cannot be expected to give the exact value for a given method but it can
give a general impression of the computational complexity of each problem
and can then (hopefully) show the advantages or disadvantages of a given
method when time to solve is concerned.
The results in the tables suggest when the size of the problem is
doubled, the time to solve when using the reverse method is approximately
doubled. This is due to the fact that the reverse method is able to compute
the gradient and the Hessian vector product for a constant times the work
to compute the function alone.
When the size of problem is doubled for the forward AD method
with no sparsity taken into account, the time to solve approximately
quadruples. This is indicative of the fact that gradients and Hessian vector
products are computed linearly in time.
It should be noticed that there are advantages to considering
any sparsity in the problem. When the size of the problem is doubled,
22
Table 6.1. Time results of three
methods with Rosenbrocks func
tion________.__________________
AD method
n SPA FOR REV
8 3.57 8.46 3.13
16 10.66 26.14 5.5
32 21.2 97.44 10.88
64 72.17 400.57 22.24
128 204.1 1554.52 41.52
256 640.66 6567.11 155.18
512 230.09
Table 6.2. Time results of three
methods with Powells function
AD method
n SPA FOR REV
8 4.28 15.78 6.88
16 14.28 41.36 12.79
32 21.09 172.58 28.17
64 89.80 594.08 47.47
128 271.22 1784.23 84.92
256 684.31 8817.59 153.24
512 433.21
Table 6.3. Time results of three
methods with Problem 3_________________
AD method
n SPA FOR REV
8 4.91 43.90 19.60
16 13.73 60.86 10.88
32 55.31 201.74 34.28
64 330.28 2216.91 58.27
128 545.36 4440.12 190.71
256 512 2605.17 21086.65 312.53 1013.53
Table 6.4. Ratio of time results
of three methods with Rosenbrocks
function_________________
AD method
n SPA FOR REV
8 2.99 3.09 1.76
16 1.99 3.73 1.98
32 3.40 4.11 2.04
64 2.83 3.88 1.87
128 3.14 4.22 3.74
256 1.48
Table 6.5. Ratio of time results of
three methods with Powells func
tion
AD method
n SPA FOR REV
8 3.34 12,62 1.86
16 1.48 4.17 2.20
32 4.26 3.44 1.69
64 3.02 2.92 1.79
128 2.52 5.08 1.80
256 2.83
Table 6.6. Ratio of time results
three methods with Problem 3
AD method
n SPA FOR REV
8 2.98 1.38 0.56
16 4.03 3.31 3.15
32 5.97 10.99 1.70
64 1.65 2.00 3.27
128 4.78 4.75 1.38
256 3.24
of
23
the time to solve goes up at a rate that is faster than linear but slower
than quadratic. It is not linear which causes it to take more time than
the reverse method in general but it is considerably faster than the for
ward method. It should also be noted that for small numbers of variables,
forward AD is comparable to reverse AD and sparse forward AD is some
times faster. This result has been found to be true in practice and, for
some problems such as Taylor methods for solving ordinary differential
equations and problems of low dimension, the forward method of AD is
generally felt to be superior.
Problem 3 turned out to be difficult for all methods to solve. It
differs from the other two in that as the size of the problem is increased,
the Hessian matrix becomes more ill conditioned. With Rosenbrocks and
Powells functions, the condition number of the Hessian matrix is constant
for any number of variables. This is what causes all of the methods to take
considerably longer to solve this problem and have erratic computation
times.
In Dixon and Prices [2] paper, they compare the truncated New
tons method with forward AD to E04KDF, the NAG modified Newton
code, OPCG, the Hatfield Polytechnic OPTIMA library conjugate gra
dient subroutine [9], and OPVM, the OPTIMA variable metric code [9].
They found that in all problems considered, the truncated Newtons method
with forward AD took approximately three times as long to solve as
OPCG, was comparable to OPVM and was slower than E04KDF when
the dimension of the problem was small. When the size of the problem
was increased, OPVM and E04KDF were both slower than the truncated
24
Newtons method.
Since the reverse method was significantly faster for all problems
that were considered, the research is fairly confidant is recommending the
truncated Newtons method with reverse automatic differentiation as an
optimization technique. Using any other optimization method, reverse AD
should give superior results whenever gradients or Hessians are required.
The major drawback to reverse AD is the memory requirements.
25
APPENDIX A
TEST PROBLEMS
The test problems that were used to compare the forward and
reverse methods are as follows.
Problem 1. Extended Rosenbrocks function:
n/2
?(*) = Â£ 100(*2i ~ Ai\f + (1 2i 1 )2
i=l
X0 = (1.2,1,1.2,1,..., 1.2,1)T
X* = (l,l,l,...,lf
Problem 2. Powells function:
F(x) = E"Â£i (*
f* (4j2 ) 10(x4j'_3
X0 = (3,1,0,1,3,1,0,1,... ,3,1,0,If
= (0,0,0,. ..,0)r
Problem 3.
F(x) = Â£*(2*;
i=2
i_l)2 + (i l)2
X0 = (l,
X* = (1,
111
21/2 23/4 27/8 2(2
1
1 l)/2r1
)
26
BIBLIOGRAPHY
[1] R. Dembo and T. Steinhaug. Truncatednewton algorithms for large
scale unconstrained optimization. Mathematical Programming,
26:1904212, 1982.
[2] L. Dixon and R. Price. Truncated newton method for sparse uncon
strained optimization using automatic differentiation. Journal of
Optimization Theory and Applications, 60(2):261275, 1989.
[3] A. Griewank. Achieving logarithmic growth of temporal and spatial
complixity in reverse automatic differentiation. Preprint MCSP228
0491, Argonne National Laboratory, Math and Computer Science Di
vision, Argonne, IL, May 1991.
[4] D. Juedes A. Griewank and J. Srinivasan. Adolc, a package for the
automatic differentiation of algorithms written in c/c++. Preprint
MCSP1801190, Argonne National Laboratory, Math and Computer
Science Division, Argonne, IL, February 1991.
[5] M. Iri. Simultaneous Computation of Functions, Partial Derivatives,
and Estimates of Rounding Errors Complexity and Practicality.
Japan Journal of Applied Mathematics, 1(2):223252, 1984.
[6] M. Iri and K. Kubota. Methods of fast automatic differentiation and
applications. In Extended English translation of the Japanese
paper presented at the 7th Mathematical Programming Sym
posium, Nagoya, Japan, November 1986.
[7] D. Luenberger. Linear and Nonlinear Programming. Addison
Wesley, Reading MA, 2nd edition, 1989.
[8] L. Micholotti. Mxyzptlk: A practical, userfriendly ctf implimenta
tion of differentiation algebra: Users guide. Technical Memorandum
FN535, Fermi National Accelerator Laboratory, Batavia, IL, January
1990.
[9] Optima manual. Numerical Optimization Center, Hatfield Polytech
nic Hatfield, Hertfordshire, England, 1984.
[10] J. Ostrovsoii, G. Wolin and Borisov W. Uber die brechnung von
ableitungen. Wissenschaftliche Zeitschrift der Technischen
Hochschule fur Chemie, LeunaMerseburg, 13:382384, 1971.
[11] B. Speelpenning. Compiling Fast Partial Derivatives of Func
tions Given by Algorithms. PhD thesis, Department of Com
puter Science, University of Illinois at UrbanaChampaign, Urbana
Champaign, IL, 1980.
[12] R. E.^Wengert. A simple automatic derivative evaluation program.
COMM. ACM, 7:463464, 1964.
28

PAGE 1
AUTOMATIC DIFFERENTIATION AS APPLIED TO THE TRUNCATED NEWTON'S METHOD FOR UNCONSTRAINED NONLINEAR OPTIMIZATION by RONALD JOHN VAN IWAARDEN B. A., University of Colorado at Denver, 1989 A thesis submitted to the Faculty of the Graduate School of the University of Colorado at Denver in partial fulfillment of the requirements for the degree of Master of Science Applied Mathematics 1992
PAGE 2
This thesis for the Master of Science degree by Ronald John Van Iwaarden has been approved for the Department of Applied Mathematics by Weldon Lodwick Date
PAGE 3
Van Iwaarden, Ronald John (M.S., Applied Mathematics) Automatic Differentiation as Applied to the Truncated Newton's Method for Unconstrained Nonlinear Optimization Thesis directed by Associate Professor Weldon Lodwick This paper explores using both forward and reverse automatic differentiation (AD) to solve the standard unconstrained optimization problem. A sparse version of forward AD is also implemented for com parison. The three methods are compared when the dimensionality of the problem is increased. As the dimension of the problem increases, two fac tors of importance are both memory requirements and the time involved to solve a problem of given size. The research will show that reverse AD is the superior method when time is the largest constraint and forward AD (either sparse or non sparse) is the superior method when memory requirements are of the greatest concern. The form and content of this abstract are aupr
PAGE 4
CONTENTS CHAPTER 1 INTRODUCTION: NEWTON'S METHOD 2 TRUNCATEDNEWTON'S METHODS ... 2.1 Deeper Analysis of the Conjugate Gradient Method for Solving the Truncated Newton's Method. 3 FORWARD AUTOMATIC DIFFERENTIATION 3.1 Example ................... 4 REVERSE AUTOMATIC DIFFERENTIATION 5 COMPARISON OF THE FORWARD AND REVERSE METHODS ................ 6 IMPLEMENTATION AND RESULTS APPENDIX A TEST PROBLEMS BIBLIOGRAPHY ..... 1 4 5 6 8 11 15 19 26 27
PAGE 5
I I I I CHAPTER 1 INTRODUCTION: NEWTON'S METHOD We will consider the unconstrained optimization problem minf(x) xEi.RN where f is twice continuously differentiable and its level sets are bounded. Our goal will be to find some x* such that f(x*)s;f(x) This may not always be possible so we seek a local minimum which, in general, is a much easier problem. That is, we seek x* for which there exists 5 > 0 such that f(x*)::; f(x) 1:/x such that llxx*ll::; 5. If x* is a local minimizer than the first order necessary condition IS: \1 f(x*) = 0. If the Hessian matrix at x*, denoted HJ(x*), is positive definite, then the inequalities above hold with strict inequality when x oJ x* and we refer to this as a strong global or local minimum. A classical method for solving this problem is Newton's method. To develop Newton's method, we approximate f(x) by Taylor's theorem,
PAGE 6
I I and assume that f is given exactly by the first three terms of its Taylor's series expansion. That is, If we then minimize this function over all p, the first order necessary conditions give \lf(x) + HJ(x)p = 0, and Newton's method uses the solution p = H"/(x)\lf(x). So, to find the search direction, an NxN system of equations must be solved. This yields the following algorithm: Step 1. Solve Ht(x)p = \lf(x) Step 2. Set x <x + p Step 3. If jjv.f(x)jj < E, stop; else, go to step 1 The main advantage and star attraction of Newton's method is that if the function is sufficiently smooth and x is close enough to x", where x" is a strong local minimizer, then Newton's method is quadratically convergent. This advantage, however, is not generally enough to counteract the following disadvantages of Newton's method. (1) The method is not globally convergent. That is, it may not converge to a stationary point for an arbitrary x0 and may even diverge. (2) If the Hessian H 1(x) is singular, the method is not defined. (3) If the function is not convex, the set of directions given may not define a set of descent directions. 2
PAGE 7
I I ( 4) An l>fxN system of equations must be solved at each iteration. With the exception of ( 4), all of the above can be solved by a slight modification of Newton's method. Define (a)= f(x + ap) and let 0 < a :::; 1. Then the modified Newton's method is as follows. Step 1. Solve Ht(x)p = \lf(x) Step 2. Find a so that the following are satisfied:. a) (a)< (0) + c'(0)a b) (a) > (0) + (1c)'(0)a Step 3. Set x <x + ap Step 4. If li'V f(x )\\ < c, stop; else, go to step 1 The conditions in step 2 are known as the Goldstein test. These conditions help to control the global behavior of the method. Early in the calculation, they play a significant role, forcing the method to make a. step that minimizes the function at each step. Later in the calculations, they have less of a role, as the function becomes asymptotically approximated by the quadratic that we assumed in the development of Newton's method and, therefore, the step size (a) will approach 1. 3
PAGE 8
CHAPTER 2 TRUNCATEDNEWTON'S METHODS Newton's method does not in general show the quadratic be haviour that is the only advantage of using it when the present value for x is far from the solution. For this reason, it may not be wise to solve the Newton's equations completely when the present value for x is thought to be far from the solution. Dembo and Steinhaug [1] proposed usmg the method of conjugate gradients for solving Newton's equations. If we let the present solution to Newton's equations be given by Xk then the residual is given by rk = Hfxk 'V f(x ). The iterations are then truncated when the error, given by lh II /!!'V f( x) II, is 'small enough,' which gives the name TruncatedNewton's method. They proceed to show that this method is globally convergent to a local minimum, and, by specifying size of lhll/ II'V f( x) II, a prespecified convergence rate between linear and quadratic can be achieved. Using the conjugate gradient method to solve the Newton's equations has a very important property that makes it especially convenient for automatic differentiation. That property is that the conjugate graclient method never requires the Hessian matrix to be known explicitly, rather, only a Hessian vector product H 1(x)p is required at any point in the conjugate gradient algorithm. It is assumed that the function is in C2 which implies that the Hessian is symmetric, and so the product pT l!1(x)
PAGE 9
I is also acceptable. This allows for a space 1>avings when this algorithm is encoded for a computer program. 2.1 Deeper Analysis of the Conjugate Gradient Method for Solving the Truncated Newton's Method. For the purpose of analyzing the direction that is found by using the conjugate gradient method to solve the truncated Newton's equations, the research assumes that the function F(y) = AyyTb which, by letting x = yA Jb has the equivalent form f( x) = xT Ax bT A J b which is a pure quadratic plus a constant. The linear conjugate gradient method is designed to solve the problem Ax = b where A is symmetric and positive definite. If k < N iterations are performed by the linear solver, we will have found a new Xk that is closer to the solution x"when the norm induced by the matrix A is used. However, if k steps of the nonlinear conjugate gradient method had been applied to solving the nonlinear problem, the method would arrive at the same Xk. This is due to the fact that the optimum search direction and step size are given algebraically when the problem at hand is a pure quadratic. The reason for the similarity between the k truncated Newton's method and using k steps of the nonlinear conjugate gradient method is that the linear conjugate gradient method is developed by trying to minimize the function xT Ax which is the quadratic model we assumed. For these reasons, the truncated Newton's method using the linear conjugate gradient solver can be seen as interpolating between the conjugate gradient direction and the Newton direction. 5
PAGE 10
I I CHAPTER 3 FORWARD AUTOMATIC DIFFERENTIATION There are two types of automatic differentiation. The forward method has been around the longest with the pioneering works by R. E. Wengert in 1964[12] and is used by Dixon and Price [2] in their implementation of the truncated Newton's method. To implement forward AD, we define an algebra on the space Tn of ordered (n+l)tuples (x, x1 x(n)) where x(j) is an jdimensional tensor. For simplicity and since we require only second derivatives for the optimization method, we work with n = 2. In this case, any element x E T2 can be represented by (x,x1,x") where xis a scalar, X1 is anNvector and x" is an N xN matrix. If we let U, V E T2 and let ( x) E= C2 the following are all operations defined in the space T2 U=(u,u1,u") and 'V=(v,v1,v11). This gives U + V = ( u + v, u1 + v1 u11 + v") U V ( I I /1 II) = uv,u v,u v U V = (u v, uv' + vu', uv" + u'v'7 + v'u'7 + vu") (U) = ( ( u ), 1 ( u )u1 1 ( u )u" + "( u )u1u1T)
PAGE 11
These are the standard operations given by the algebra of Taylor's series which is just the implementation of the chain rule for the given functions. Given this notation, a variable x1 becon1es \vhere eJ is the ith unit vector and 0 is the NxN matrix of zeros. This requires a matrix store and is not what is needed for the conjugate gradient method since a matrix vector product is used. Dixon and Price [2] adapted forward AD to retain only a vector Hessian product as follows. Define the space 52 where any element u E S has the form u = ( u, u1 u11p) where u is a scalar, u1 p are N dimensionaJ vectors, and u11 is an NxN matrix. Then, given U, V E Sand (x) E C2 the operations on this space are given by: This gives U = (u,u1,u11p) and V = (v,v1,v11p) U L TT ( I + I fl 1 II ) ,v=u,v,u v,up,vp U V ( I I II II) =uv,uv,upvp U V = (u v,uv' + vu',v,1/'p + u'v'T p + v'u1 T p + vu"p) U /V = ( ujv, ( vu1uv1 ) fv2 ( v2u"pv( v1u1T p+u1v1T p)+2uv1v11puvv"p) / v")
PAGE 12
storage and faster operation for each Hessian vector product as the entire Hessian need not be computed. Masao Iri [6] has shown that computing the gradient and Hessian by the forward AD method requires w(f)O( N) operations, where w(f) is the work required to compute the function f( x ). To compute just the gradient and Hessian vector product requires at most w(f)6N operations or w(f)O( N) operations. 3.1 Example Let f(x,y) = sin(x3 + y). and consider the point (1,1). Then the function is differentiated as follows. The independent variables a; and y become x = (l,(l,ol,o) y = (1,(0,1?,0) Then we define the following temporary variables. The computations yield: T3 = T2 + y F = sin(T3 ) 8
PAGE 13
I .I [ f3l [6 oll T:, = 0, j I Ll 001_ F [ [ 3] [ cos(O)9sin(O) sm(O), cos(O) 1 3 sin(O) 3 sin(O) l j" sin(O) H: l, [: : ll Now, F = (f(x), vf(x),H1(x)), which is what was desired. The reader should note that, while this example was done using symbolic manipulation, on the computer, rather than using the symbols, the actual values of those symbols would be used. Now we will rework the same example where only vector stores are used. Let the vector p be given by the first unit vector and consider the same point (1,1). The independent variables x andy become ( T T X= 1,(1,0) ,(0,0) ) y = (1,(0,1?,(0,0)') Then we define the following temporary variables. The computations yield: T, [1, [ : l [ : ]] T, [ 1, [ : l [ : ]] T, [ 0, [ : l [ : ]] 9
PAGE 14
[ [ 3] [ cos( 0) 9 sin( 0) l 11 sm(O), cos(O) , 1 3sin(O) JJ F = H: l, [: ll Note that the end result gave the first column of the hessian as expected. 10
PAGE 15
I I I I I CHAPTER4 REVERSE AUTOMATIC DIFFERENTIATION Reverse AD has been around for quite a while, but it did not exist in its present form until about 1980 when B. Speelpenning [11] wrote his doctoral thesis. The earliest known paper that explici ily recognizes the potential savings that arise in reverse AD is by G. M. Ostrowski et al. [10]. Similar to forward AD, it is based on the chain rule but in a slightly different form than we are used to. Since the main application of reverse AD is to compute the derivatives of functions on the computer, we restrict our discussion to the computer functions
PAGE 16
X y Figure 4.1: Computational graph for F(x,y) == sin(x3 + y) with two input arcs that both originate from the node x and a second 1t v node with input arcs from both x and v1 Once, the computational graph has been created, the technique of reverse AD can be applied to the computational graph m order to compute the needed derivatives. For this process, we need to compute the partial derivatives of any arc a == ( v, Vj ), where v, is the node where the arc originates and Vj is the terminal node. The computational graph is then extended to make a new computational graph that computes the partial derivatives. The new graph is created by placing a mirror image of the original graph next to the original. For this mirror image, the direction of each arc is reversed and, in the middle of each arc, a multiplication node is placed. All of the original nodes in the mirror image become addition nodes with the exception of the node which corresponded to the function value F. This node receives the value 1. Between the two graphs, place one intermediate node for every node in the original graph. Finally, the value of 0 is assigned to every + node in the mirror graph. 12
PAGE 17
I I For each node v, in the original graph that inputs to node v1 where node vi also has inputs from node vk, three arcs are made that connect these three nodes with the intermediate node w;_J. This node contains the function that corresponds to This node then has an arc that connects to node u;..; in the new computational graph. The node u;,j is the multiplication node between nodes u; and Uj. To illustrate this, consider the previous example. OVj ox = 3v,jx 8v2 =1 [)y fJv2 =l 8v1 fJv3 = COSV'> fJv2 So, node w2 3 contains the function cos v2 Since this function depends only on inputs from node v2 the arc from v3 does not need to be included in the computational graph. If we continue in this fashion, w1,2 = 1 and has no inputs. Node wy,J = 1 and similarly has no inputs. Node Wx,l = 3v,jx, where it has its first input from node v1 and its second input from node y. Putting these into the computational graph, the gra.ph for v F( x, y) appears in figure 4.2. For this graph, w2,3 = cosv2, w1,2 = 1, Wx,I = 3 vifx, Wx.2 = 1, I and WF,:J = 1. For the mirror graph, u; = +, and u,,1 = This is I a new computational graph that will give the gradient of the function F(x,y) = sin(x3 +y). Once this computational graph has been traversed, = Ux and = uy. This graph has a similar 'geometry' to the original computational graph and so it is rarely computed in practice. 13
PAGE 18
sri Figure 4.2. Computational graph for the gradient ofF( x, y) = sin( x'l + y) If the Hessian is needed, the new computational graph can be 'differentiated' N times since the Hessian is just the gradient of the gradient. Again, this computational graph is not significantly different from the original graph and so it does not to be explicitly computed. Also, the Hessian vector product is easily computed using only the original compu tational graph. For a discussion on the algorithm and the computational complexity, the reader is referred to Iri [6] 14
PAGE 19
CHAPTER 5 COMPARISON OF THE FORWARD AND REVERSE METHODS It was mentioned earlier that both the reverse and forward methods of AD can be viewed simply as different interpretations of the chain rule. When the forward method is implemented, the total derivatiYes of node v; with respect to x; are known when the computation of f reaches node i. (1) (2) (3) This can be represented as follows. = 2x ax Bv2 0 Oy  t = 2x ox =y &y = 3xi cos(x3 + y) = cos(x3 +y) Since V3 = f, we are done. This is how the forward or top up method of AD is implemented. Note that the symbol x3 + y is used twice and was already computed when the function f(x,y) = sin(x3 + y) was evaluated. The total number of additional computations that must be done in order to get the derivative are 6 multiplications, one addition, and (assuming that xn is evaluated by e" lnx) five transendental function evaluations. In contrast to this, the reverse method computes the entire function at first and saves much of the intermediate values along with the,elementary partial derivatives of each node with respect to its inputs. Then,
PAGE 20
the graph is traversed in the reverse direction (top down). The partial derivative of f with respect to node v; is known when the computation reaches that point. When this is finished, the partial derivative off with respect to each of the original inputs x1 Xn is known for each X;. For the example that was used, this can be represented as follows. (1) (2) (3) !'a= COSVz Ov2 = COSV2 = COSVz av, ( 4) '7J; = 3vd x cos Vz Again, since v3 = f, we have computed the gradient of the function. Note that in this case, the value v2 = x3 + y has already been computed and stored so it does not need to be computed again. Also, instead of computing 'Jr: = 3x2 which requires transendental function eval uations, we compute = 3v1 / x cos v2 which saves computational work. For the reverse method, the total amount of additional computational effort expended is one multiplication, one division, and one transcendental function evaluation. If we assume that a division costs approximately two multi plications, a transcendental function costs approximately three multiplications, and addition and subtraction cost the same as a multiplication, then we get the following approximations for our function and derivative evaluation costs for the given function. To evaluate the function alone requires approximately 11 multiplications. For the gradient by means of forward AD, we need an additional 22 multiplications. For the gradient by means of reverse AD, we need an additional 6 multiplications. 16
PAGE 21
So, even with this small example, the advantages given by re verse AD in terms of computational time are quite evident. If a larger problem is solved, then the difference is even more striking than what is shown here. The most obvious disadvantage for reverse AD is the need to store the computational graph and all of the intermediate variables. For forward AD, only a vector store is required to compute the gradient. For reverse AD, the author is not aware of any bounds on the growth of the computational graph. However, in some recent work by Andreas Griewank [3], it has been shown that it is possible to achieve logarithmic growth in the spatial complexity of the computational graph with a trade off of logarithmic growth in the temporal complexity. The primary advantage of reverse AD has been the speed at which derivatives are computed. Iri [6] has shown that the work required to compute the function and its gradient requires at most 5w(f) evalua tions. He also shows that the work required to compute the function, its first derivative, and the Hessian vector product does not exceed 21w(f). In other words, the time requirement for evaluating the product of the Hessian and a vector requires fewer than a constant times the amount of work required to evaluate the function itself. To compute the full Hessian requires less than (7+14N)w(f) operations. It is very surprising to note that the reverse method is of an order of magnitude faster than the forward method. This remarkable fact will become even more evident when both methods are used for nonlinear optimization. The reason for reverse AD being so much faster is that the amount of work required to compute the gradient is dependant oi1 the 17
PAGE 22
I I size of the computational graph. Since the computational graph for the gradient is about three times the size of the graph for the function alone, this suggests that the amount of work to compute the gradient is on the order of three times the work to evaluate the function. This result has been borne out in practice although the constant can be greater than 3 since the ti.me required to access the values from memory has not been considered. 18
PAGE 23
CHAPTER 6 IMPLEMENTATION AND RESULTS Many implementations of AD have been written for Cor C++ [4] [8] on unix systems that allow for paging main memory out to disk. The computational graph is built each time that the function is evaluated, and then traversed in order to get the desired derivatives. The advantages of this method is that this allows for differentiation of conditional branches. Its disadvantage is that the building of the computational graph can be time consuming since it must be in a form that allows for traversal both up and down. Also, this requires a large amount of memory that must be allocated dynamically, which can slow down the procedure even more if memory is paged out to disk. A second implementation is to write a program that will differ entiate another function. That is, the function will read in a procedure that describes the function f( x ), and then outputs another procedure that will compute f(x) and a specitled number of derivatives. The advantage of this method is speed because this procedure can then be linked into an optimization algorithm. Another advantage is that if a large amount of memory is required, the paging can be handled much more efficiently. A disadvantage is that conditional branches are not as easily handled in this case. A third method is to have the computational graph computed
PAGE 24
I once and then retained in main memory. This has the advantage of speed since the computational graph does not need to be recomputed. The disadvantage is again that memory can not be handled very efficiently. Another disadvantage is that conditional evaluations in the function cannot be handled very easily. For this research, the third method was used. The test problems that were used by Dixon [2] did not involve any conditional branches, and the intent was to compare forward and reverse AD on some classic test problems. The programs were written in Turbo Pascal version 6.0 for use on an IBM PC operating under MSDOS version 5.0. For this reason, paging out to disk was not allowed and the maximum number of standard variables was limited to a total of 64k, while dynamic memory had a maximum of approximately 500k. The computational graph was created using a combination of linked lists and arrays, which allow for direct access to only limited areas of the computational graph, but that was all that was needed. The function was stored in reverse polish notation that was in an external ASCII file. This file was then read into memory and the computational graph was built when the program was first started. The time to compute the computational graph was not included in the times later on. However, the time to compute the computational graph was not more than about 8 seconds, even when the number of variables was close the the maximum. The maximum number of variables was restricted to of approximately 800 since paging memory was not possible. The forward AD method was implemented exactly as described, storing only a vector, Hessian product. Unfortunately, the 64k restriction 20
PAGE 25
I limited the maximum number of variables to approximately 400. If this restriction was lifted and all of main memory was available, the maximum number of variables would be about 3500. For purposes of comparison, a sparse version was also implemented. Instead of using a linked list implementation, the sparsity was exploited by using an array of values and of posjtions. Only the nonzero values of the vector are modified and those are identified through the array of positions. The problem of copying values when passing vectors to procedures was overcome by passing the addresses of those vectors. This makes the modification similar in speed as a linked list implementation. In order to directly compare the reverse, forward, and sparse for ward AD implementations for the truncated Newton's method, all meth ods used the exact same optimization code for comparison. The line search is done by means of first checking to see if a = 1 is an acceptable step size according to the Wolf Test [7]. If it is not acceptable, the new point is found by fitting a cubic interpolant to ( x), '( x), ( 1 ), '( 1) where (a) = f(x + aP) and Pis the search direction. If this again does not find an acceptable step size, an exact line search is implemented to find the first zero of'( x) to within a given tolerance. The tests were done on an IBM compatible running a 80386 at 33 MHZ with no math coprocessor. All methods were terminated when II \7 fll < 105 For the first three tables, Code refers to which code was used (REV for reverse AD, FOR for forward AD, and SPA for sparse forward AD). Time refers to the amount of time in seconds required to solve the problem with the given method. The time was obtained by asking 21
PAGE 26
'I I for the system time just before the start of the optimization algorithm and then asked for again just after the end of the algorithm. These two values were then subtracted appropriately to obtain the figures that are given. The second three tables concern themselves with the growth ratio in time for a given increase in problem size. In these tables, rather than time to solve, the columns have the value time to solve a problem of size 2n ratw = time to solve a problem of size n This should show the amount of growth in time to solve a larger problem. That is, if the algorithm is linear in time, this column should have 21 and if it is quadratic in time, the column should show 22 = 4. This ratio cannot be expected to give the exact value for a given method but it can give a general impression of the computational complexity of each problem and can then (hopefully) show the advantages or disadvantages of a given method when time to solve is concerned. The results in the tables suggest when the size of the problem is doubled, the time to solve when using the reverse method is approximately doubled. This is due to the fact that the reverse method is able to compute the gradient and the Hessian vector product for a constant times the work to compute the function alone. When the size of problem is doubled for the forward AD method with no sparsity taken into account, the time to solve approximately quadruples. This is indicative of the fact that gradients and Hessian vector products are computed linearly in time. It should be noticed that there are advantages to considering any sparsity in the problem. When the size of the problem is doubled, 22
PAGE 27
Table 6.1. Time results of three methods with Rosenbrock's func tion AD method n SPA FOR REV 8 3.57 8.46 3.13 16 10.66 26.14 5.5 I 32 21.2 97.44 I 10.88 64 72.17 400.57 22.24 I 128 204.1 1554.52 41.52 256 640.66 6567.11 155.18 512 230.09 Table 6.2. Time results of three ;;: I s 4.28 15.78 6.88 I 161 14.28 41.36 I 12.79 I 32 21.o9 172.58 1 28.17 64[ 89.80 594.08 47.47 128 I 271.22 I 1784 .. 23 84.92 i 256 1 684.31 I 8817.59 153.241 I s12 I ' 433 21 L I I Table 6.3. Time results of three Table 6.4. Ratio of time results methods with Problem 3 I AD method I n SPA FOR I 8 4.!)1 43.90 I 16 13.73 60.86 32 55.31 201.74 64 330.28 2216.91 128 545.36 4440.12 I I 2605.17 21086.65 REV 19.60 10.88 34.28 58.27 19o.n 1 312.53 I 1013.53 i of three methods with Rosen brock's function method n 1 SPA I FOR I REV I 8 2.99 I 3.091 1.76 'I 16 1.99 I 3. 73 l.9s 1 32 3.40 1 4.11 2.04 I 64 2.83 1 3.88 1.s1 128 3.14 4.22 3.74 256 1.48 Table 6.5. Ratio of time results of . th th d h p ll' f Table 6.6. Ratio of time results of ree me o s Wlt owe s uncth h d h p bl 3 ree met o s w1t ro em tion r F:fi AD method AD method n SPA I FOR I REV I 8 2.98 1.38 0.56 16 j 4.03 3.31 I 3.15 n SPA FORi REV 8 3.34 12.6211.86 I 16 1.48 32 4.26 64 3.02 128 2.52 256 4.17 2.20 3.44 1.69 2.92 1.79 5.08 1.80 2.83 I W2 I 5.97 10.991' 1.70 I 411.65 2.00 3.27 4.78 i 4.751' 1.38 j 6 I I I 3.24 I 23
PAGE 28
the time to solve goes up at a rate that is faster than linear but slower than quadratic. It is not linear which causes it to take more time than the reverse method in general but it is considerably faster than the forward method. It should also be noted that for small numbers of variables, forward AD is comparable to reverse AD and sparse forward AD is some times This result has been found to be true in practice and, for some problems such as Taylor methods for solving ordinary differential equations and problems of low dimension, the forward method of AD is generally felt to be superior. Problem 3 turned out to be difficult for all methods to solve. It differs from the other two in that as the size of the problem is increased, the Hessian matrix becomes more ill conditioned. With Rosenbrock's and Powell's functions, the condition number of the Hessian matrix is constant for any number of variables. This is what causes all of the methods to take considerably longer to solve this problem and have erratic computation times. In Dixon and Price's [2] paper, they compare the truncated Newton's method with forward AD to E04KDF, the NAG modified Newton code, OPCG, the Hatfield Polytechnic OPTIMA library conjugate gra dient subroutine [9], and OPVM, the OPTIMA variable metric code [9]. They found that in all problems considered, the truncated Newton's method with forward AD took approximately three times as long to solve as OPCG, was comparable to OPVM and was slower than E04KDF when the dimension of the problem was small. When the size of the problem was increased, OPVM and E04KDF were both slower than the truncated 24
PAGE 29
Newton's method. Since the reverse method was significantly faster for all problems that were considered, the research is fairly confidant is recommending the truncated Newton's method with reverse automatic differentiation as an optimization technique. Using any other optimization method, reverse AD should give superior results whenever gradients or Hessians are required. The major drawback to reverse AD is the memory requirements. 25
PAGE 30
APPENDIX A TEST PROBLEMS The test problems that were used to compare the forward and reverse methods are as follows. Problem 1. Extended Rosenbrock's function: n/2 F(x) = L lOO(xzi x3i_1 ) 2 + (1Xzid X0 = ( 1.2, 1, 1.2, 1, ... 1.2, 1? X*= (1, 1, 1, ... 1jY Problem 2. Powell's function: T Xo = (3,1,0,1,3,1,0,1, ... ,3,1,0,1) X*= (O,O,O, ... ,O)T Problem 3. n F(x) = I;i(2xfXi_1 ) 2 + (x1 1)2 i=2 X0 = (1,1, ... ,1) X* = (l _1 __ 1 __ 1_ 1 ) '21/2' 23/4' 27/8''' '' 2(zn'l)/2nl 26
PAGE 31
BIBLIOGRAPHY [1] R. Dembo and T. Steinhaug. Truncatednewton algorithms for large scale unconstrained optimization. lVIathematical Programming, 26:190212, 1982. [2] L. Dixon and R. Price. Truncated newton method for sparse uncon strained optimization using automatic differentiation. Journal of Optimization Theory and Applications, 60(2):261275, 1989. [3] A. Griewank. Achieving logarithmic growth of temporal and spatial complixity in reverse automatic differentiation. Preprint MCSP2280491, Argonne National Laboratory, Math and Computer Science Di vision, Argonne, IL, May 1991. [4] D. Juedes A. Griewank and J. Srinivasan. Adolc, a package for the automatic differentiation of algorithms written in c/c++Preprint MCSPlS01190, Argonne National Laboratory, Math and Computer Science Division, Argonne, IL, February 1991. [5] M. Iri. Simultaneous Computation of Functions, Partial Derivatives, and Estimates of Rounding Errors Complexity and Practicality. Japan .Journal of Applied Mathematics, 1(2):223252, 1984. [6] M. Iri and K. Kubota. Methods of fast automatic differentiation and applications. In Extended English translation of the Japanese paper presented at the 7'h Mathematical Programming Symposium, Nagoya, Japan, November 1986. [7] D. Luenberger. Linear and Nonlinear Programming. Addison Wesley, Reading MA, 2nd edition, 1989. [8] 1. Micholotti. Mxyzptlk: A practical, userfriendly c++ implimenta tion of differentiation algebra: Users guide. Technical Memorandum FN535, Fermi National Accelerator Laboratory, Batavia, IL, January 1990. [9] Optima manual. Numerical Optimization Center, Hatfield Polytechnic Hatfield, Hertfordshire, England, 1984.
PAGE 32
I l [10} J. Ostrovsoii, G. Wolin and Borisov W. Uber die brechnung von ableitungen. Wissenschaftliche Zeitschrift der Technischen Hochschule fur Chemie, LeunaMerseburg, 13:382384, 1971. [11] B. Speelpenning. Compiling Fast Partial Derivatives of Func tions Given by Algorithms. PhD thesis, Department of Com puter Science, University of Illinois at UrbanaChampaign, Urbana Champaign, IL, 1980. [12] R. E. Wengert. A simple automatic derivative evaluation program. COMM. ACM, 7:463464, 1964. 28
