
Citation 
 Permanent Link:
 http://digital.auraria.edu/AA00001807/00001
Material Information
 Title:
 Smart simulated annealing
 Creator:
 Heine, George Winfield
 Publication Date:
 1994
 Language:
 English
 Physical Description:
 vii, 75 leaves : ; 29 cm
Subjects
 Subjects / Keywords:
 Simulated annealing (Mathematics) ( lcsh )
Simulated annealing (Mathematics) ( fast )
 Genre:
 bibliography ( marcgt )
theses ( marcgt ) nonfiction ( marcgt )
Notes
 Bibliography:
 Includes bibliographical references (leaves 7275).
 General Note:
 Department of Mathematical and Statistical Sciences
 Statement of Responsibility:
 by George Winfield Heine III.
Record Information
 Source Institution:
 University of Colorado Denver
 Holding Location:
 Auraria Library
 Rights Management:
 All applicable rights reserved by the source institution and holding location.
 Resource Identifier:
 35059443 ( OCLC )
ocm35059443
 Classification:
 QA402.5 .H45 1994a ( lcc )

Downloads 
This item has the following downloads:

Full Text 
SMART SIMULATED ANNEALING
by
George Winfield Heine, III
B. A., Reed College, 1971
M. S., University of Colorado, 1989
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Applied Mathematics
1994
This thesis for the Doctor of Philosophy
 degree by
George Winfield Heine, III
has been approved for the
Department of
Mathematics
by
iw Arl I ^
1994 by George Winfield Heine, III
All rights reserved.
Heine, III, George Winfield. (Ph.D., Applied Mathematics)
Smart Simulated Annealing
Thesis directed by Professor Bennett L. Fox
A naive implementation of simulated annealing is inefficient; at low
temperatures it spends almost all its time proposing moves and then rejecting
them. We present an algorithm, due to Fox, which reduces time at selfloops
while preserving convergence properties. We show that the computer time
spent in each distinct state by Foxs algorithm is bounded (a.s.) by a constant
which can be determined in advance.
This result suggests a refinement of Foxs algorithm: when computer
time at a given state exceeds the predetermined constant, exit to another state.
A technical lemma needed to prove the refinement works has surprising corol
laries. For example, we use the lemma to analyze simulated annealing when
the objective function is not precisely known and must be estimated.
Another way of speeding up simulated annealing keeps temperature
constant while in a state, updating when the state changes. Thus sojourn time
in a state is easily simulated by sampling from a geometric distribution. Using
methods of Connors and Kumar, we show that in this scenario the proportion
of time spent in an optimal state converges to one almost surely.
It might be thought that the various convergence results in simulated
annealing are an artifact, since at low temperatures the algorithm spends large
blocks of time at optimal states. However, we look at the embedded chain of
pairwise distinct states and find that it too exhibits convergence, though in
general to a slightly different set of states.
IV
This abstract accurately represents the content of the candidates thesis. I
recommend its publication.
Signed
Bennett L. Fox
v
ACKNOWLEDGEMENTS
Many people have contributed, directly or indirectly, to the comple
tion of this thesis. Many U.C.D. faculty members have given me inspiration,
technical advice, and moral support; I would like to especially thank Bill Briggs,
Stan Payne, and Rich Lundgren, as well as the members of my thesis commit
tee. Of course, the greatest thanks are due my advisor, Ben Fox. This project
would not have been possible without his knowledge and insight, as well as his
constant support.
Special thanks goes to Arlan Ramsay at C.U.Boulder, both for having
suggested I investigate the U.C.D. graduate program in the first place, and,
together with his son, Bruce, for having written the software program which
produced this thesis.
Thanks also to management and coworkers at the Bureau of Land
Management, who have cheerfully put up with so many irregularities in my
schedule.
I also gratefully acknowledge the unwavering faith in my ability to
succeed given by friends and family over the years, especially my parents,
George and Mary Heine, my sister Nancy and her family, and my inlaws,
Charles and Audrey Barnett. Most of all, I am indebted to my wife, Janet
Barnett, whose unswerving love has been put to the test in many difficult
moments. It is to her that I dedicate this thesis.
vi
CONTENTS
CHAPTER
1 INTRODUCTION.......................................... 1
1.1 The Simulated Annealing Algorithm................. 2
1.2 Previous Work..................................... 4
2 CONVERGENCE OF A LOOPSKIPPING ALGORITHM . 7
2.1 Summary of Results................................ 9
2.2 Proofs of Theorems............................... 12
2.2.1 Proof of Theorem 2.1....................... 15
2.2.2 Proof of Theorem 2.2....................... 16
2.2.3 Proof of Theorem 2.3 . .............. 17
2.3 Efficiency of Foxs Algorithm.................... 20
3 SIMULATED ANNEALING WITH OVERRIDES............. 25
3.1 Simple Examples................................. 27
3.1.1 Random Restart............................. 27
3.1.2 Greedy Overrides .......................... 27
3.1.3 Perverse Overrides......................... 28
3.2 Dynamic Neighborhoods ........................... 28
3.3 Overrides: The Main Result....................... 30
4 SIMULATED ANNEALING WITH A NOISY OBJECTIVE
FUNCTION............................................. 34
4.1 Formal Setup..................................... 35
vii
4.2 Consistent and Stable Estimators..................... 37
4.3 Asymptotically Stable Estimators: First Example...... 39
4.4 Asymptotically Stable Estimators: Second Example ... 42
5 PATHWISE CONVERGENCE WITH ADAPTIVE COOLING 45
5.1 Main Results ........................................ 46
5.2 SamplePath Convergence for Deterministic Cooling ... 48
5.3 SamplePath Convergence for Adaptive Cooling......... 50
5.4 An Example Where Pathwise Convergence Fails ......... 53
6 CONVERGENCE OF THE PRUNED CHAIN...................... 57
6.1 Convergence in Probability........................... 58
6.2 Pathwise Convergence for the Pruned Chain ........... 60
6.3 Si Compared to S0 .................................. 64
7 DIRECTIONS FOR FUTURE WORK .............................. 67
7.1 Hitting Time........................................ 67
7.2 Tabu Penalties and Simulated Annealing .............. 68
7.3 Noisy Objective Function Revisited ................. 69
7.4 Refinement of Algorithm QUICKER1 ................. 70
BIBLIOGRAPHY.................................................... 72
vm
CHAPTER 1
INTRODUCTION
Simulated annealing is attracting increasing attention as a method
of combinatorial optimization. As knowledge of the algorithm spreads, re
searchers find ever more diverse uses, reporting applications such as refinement
of protein structures [35], redrawing of legislative districts [5], incore nuclear
fuel management [33], and contaminated groundwater remediation [34].
In certain senses and under certain conditions, made explicit below,
the Markov chain generated by simulated annealing is convergent. However,
the algorithm has been criticized for being slow and expensive. Perhaps this is
partly because most theoretical studies have given little attention to practical
ways to implement the algorithm.
The results in this thesis were originally motivated by attempts to find
an efficient algorithm which maintains the desirable convergence properties.
In the process, we have discovered additional theoretical results, interesting in
their own right.
To probabilists, simulated annealing presents an interesting special
case of a timeinhomogeneous Markov chain. The properties of stationary
Markov chains have long been wellunderstood, but theory for the general
nonstationary case still appears far off. Nonstationary Markov chains are better
understood today than a few years ago, partly because of research in simulated
annealing. We hope that the results outlined in this thesis will contribute to
the growth of knowledge in this important area.
1.1 The Simulated Annealing Algorithm
The goal of simulated annealing is to minimize an objective function
U on a set of states S. We take S to be finite unless the contrary is explic
itly stated. The algorithm generates a sequence of moves between successive
neighboring states. At each step of the algorithm, a candidate move is pro
posed. If the move is to a state with lower or equal objectivefunction value,
it is accepted; if the move is to a state with a higher objectivefunction value,
it is accepted with a small probability.
We now formalize these notions. By problem we mean a triple
(S,U,R), where S' is a (finite) set of states, U is a realvalued function on
S, and R is the IS^ x Sj stochastic matrix of tentativemove probabilities.
We assume without loss of generality that the minimum of U over S' is 0.
Sometimes we use energy level as shorthand for objectivefunction value.
Simulated annealing is a (generally nonstationary) Markov chain {Xt :
t = 1,2,...} on a set of states S. The transition probabilities are given by
Pr(At+1 = y\Xt = x) = R(x,y)e(t)uW+
for x y, and
Pv(Xt+1 =x\Xt = x) = lJ2 Pr(*<+i = y\Xt = x),
where e() is implied by the cooling schedule, described below.
Denote by So the set of states on which U is minimized. The neigh
borhood N(x) of a state x consists of those states y for which R(x,y) > 0. A
2
path is a sequence of states x0,...,xp such that Xi G N(xii) for 1 < i < p.
As is customary, we assume that R is irreducible, so that any two
states are connected by a path. Following Chiang and Chow [7], we say
that state y is reachable from x at height h if there exists a path x =
Â£0, xi,..., xp = y such that maxo<;
mum if no state y with U(y) < U(x) can be reached from x at height U{x).
For each pair (x, y) of distinct states, let h{x, y) be the smallest number h such
that y is reachable from x at height h. The depth d(x) of a state is defined as
{min/&(*,?/), U(y)
.
mmh(x,y), y Â£ So if x G S0.
The problem (S, U, R) has the property of neighborhood symmetry (TVS') if
R(x, y) > 0 whenever R(y, x) > 0. In most cases this property is unnecessarily
strong and can be replaced with weak reversibility (WR), which requires that
if a state y is reachable from state x at height h, then x is also reachable from y
at height h. Assuming WR, as we do throughout, rather than NS has important
practical implications: Fox [14] shows how to integrate simulated annealing and
tabu penalties by redefining S, U, and R] with additional structure, given in
[14], the redefined problem satisfies WR, but it is generally not possible to
satisfy NS.
The cooling schedule is a sequence (Tk),k = 1,2,.... of positive
numbers, usually converging to zero, such that the probability of accepting an
uphill move of height one at step k is
e(k) = exp(1/Tfc).
We find it more convenient to deal directly with the sequence (e(&)); by abuse
3
of terminology, we also call this sequence the cooling schedule; this causes no
ambiguity, because the e(k) and Tk determine each other.
In most of what follows, we restrict our attention to schedules of the
form
e(Jfe) = k~1/c, (1.1)
for c > 0, or more generally, to schedules satisfying
Ci(k + a)1/,c < e(k) < c2(k + a)_1^c (1.2)
for positive constants a, c, c\ and c2. These are the schedules that have at
tracted the most attention historically, at least among probabilists. See [6] and
[21] for discussions of other kinds of schedules.
In the work that follows, we are often concerned with the dynamics
of exit from a single state x\ the most significant quantity in this analysis is
how high the chain must climb to leave x. Denote this quantity by
S{x) = min (U(y)U{x))+,
yÂ£N(x)
note that S(x) > 0 only if every neighbor of x has an objectivefunction value
strictly larger than U{x). We call such a state an isolated local minimum.
1.2 Previous Work
A central problem of simulated annealing is how slowly the cooling
schedule must go to zero in order to guarantee convergence to So. Geman and
Geman [20] showed that, for problems satisfying NS, a constant c exists such
that
CO
Â£ e(t)c = oo =} lim Pr(Xt 6 S0) + 1.
4=1
4
Hajek [25] identifies the constant as
d* = maxd(x),
xÂ£S0 v n
where d() is the depth function defined in Section 1.1. He requires (S,U,R)
to satisfy WR but not necessarily NS. Hajek shows that his conditions are
necessary as well as sufficient. Hwang and Sheu [30] prove essentially the
same theorem by an application of the cycles of Freidlin and Wentzell [18].
Chiang and Chow [7],[8] repeat the result in continuous time, and (with certain
restrictions) give the asymptotic probability distribution for large t. Tsitsiklis
[37] obtains similar results by working with small perturbations of probability
matrices. Connors and Kumar [11], [12] show that, under similar conditions,
the lim sup of the Cesaro average of Pr(Xfc So) is one; we return to this point
in Section 5.1.
All of these results assume a finite state space. Although there is a
large body of literature about simulated annealing on continuous state spaces,
we are not aware of any exact results relating the speed of cooling to the
existence of convergence. Belisle [2] shows that convergence occurs on finite or
infinite state spaces provided So is reachable in a single step from anywhere in
the state space. It is easy to modify his proofs to show that a slightly weaker
condition suffices: namely, from anywhere in S there is a path with no uphill
moves leading to 5b Belisles condition is strong enough that his result holds
for any cooling schedule that approaches zero, or even a random schedule that
approaches zero in probability.
The rest of this thesis focuses on efficiently implementing simulated
5
annealing without losing the convergence properties just discussed. In Chap
ter 2, we describe Foxs [14] algorithm QUICKER. Using it eliminates explicitly
rejected moves while leaving the sample path stochastically unchanged. We
show that the computer time to execute QUICKER is asymptotically bounded
(a.s.) by a constant that can be prescribed in advance. This complements Foxs
analysis of quadraticmean convergence. In Chapter 3, we use these results to
propose an improved algorithm QUICKERj. A technical lemma needed to
prove the convergence of QUICKERj turns out to be useful in an important
and littlestudied problem: simulated annealing when the objective function
is not precisely known; we analyze this situation in Chapter 4. In Chapter 5,
we analyze an algorithm, equivalent to QUICKER1, which amounts to using a
random cooling schedule. It exhibits a different sort of convergence: the pro
portion of observed optimal states tends almost surely to one. In Chapter 6, we
consider the embedded chain of pairwise distinct states (i.e., the chain (Xk),
pruned of its selfloops) and show that converges to a set Si. We relate So
and Si. Chapter 7 indicates directions for future research.
6
CHAPTER 2
CONVERGENCE OF A LOOPSKIPPING ALGORITHM
Implementing simulated annealing in a naive but straightforward fash
ion, the processing at step t looks like:
Algorithm SLOw(a;,t)
Select y from N(x)
If U(y) < U(x) then
Set Xt+1 < y
.Else
Generate a uniform random variate V
If V < then
Set Xi+i < y
Else
Set. x
End
Implemented on a computer, Algorithm SLOW is inefficient;. the computer
spends almost all of its time proposing moves and then rejecting them. More
over, the longer the simulation runs, the worse the inefficiency becomes. Fox
(preprint of [14]) notes that the algorithm is a victim of its own success:
when the probability of accepting an uphill move gets small, the algorithm
stays, with high probability, at local minima for very long periods of time.
Algorithm SLOW, as well as its modifications discussed in this thesis, may also
consume a lot of time in oscillating between adjacent states. Fox [14] inhibits
this, while preserving convergence in probability, by integrating tabu penalties.
We discuss this further in Section 7.2.
Greene and Supowit [24] eliminate rejections by forcing the algorithm
to change states at every step. The new state is chosen from the neighbors of
the current state via a distribution conditioned on an accepted move. However,
their method fails to update temperatures; thus, it destroys the convergence
properties described in the previous chapter.
Fox [14, 15] proposes an algorithm QUICKER which greatly reduces
the computer time in selfloops, yet (because simulated time is unaffected)
updates temperatures in the right way to preserve convergence to the optimal
states. Suppose that the algorithm enters state x at time k. The point of
QUICKER is to generate
L = min{/ > k : X\ ^ Xk},
the transition number on which a move to a different state next occurs. Let
a(x, k) be the probability that the algorithm exits state x at time k + 1; that
is,
a{x, k) = Pr (Xk+1 ^ x \ Xk = x).
We assume a(x, k) is nonincreasing in k for any fixed x. QUICKER is based on
the observation that, to a first approximation, L has the same distribution as
a geometric variate with parameter a(x, k) (i.e., the trial number of the first
success in iid Bernoulli trials with success probability a(x, k)). The algorithm
follows.
Algorithm QuicKER(x,t)
Set k < t
Until exit, repeat
8
Generate a geometric r.v. G with parameter a(x, k)
Set L
Generate a uniform r.v. V
If V < a(x,L)/a(x,k) then
Execute NEXT(a:,.L)
Exit
Else
Set k < L
End
Above, Exit means go to the next state. The subroutine NEXT (a;, k) selects
a state from N{x), conditioned on a move away from x at time k.
Fox [15] shows that the number of repetitions of the Until... repeat
loop required by each invocation of QUICKER converges in quadratic mean to 1,
provided e(t) = t~x!c and c is strictly larger than a constant J, to be introduced
shortly. He shows by example that in general no convergence occurs when
c = d. We give an almostsure counterpart, with interesting consequences.
Fox also shows that at isolated local minima QUICKER reduces com
puter time by an asymptotically infinite factor. We sharpen this observation
by giving upper and lower bounds for this time reduction.
2.1 Summary of Results
As in the previous section, a(x, k) denotes the probability of accepting
a move out of state x at time k. We often suppress dependence on x, writing
simply a.k
9
On any given sample path lo, let
Q(x, k,uj)
the number of geometric variates (number of inner
loop iterations) generated by QUICKER if state x
is entered at time k;
0, otherwise.
(2.1)
Again, we often suppress dependence on w or x and write Q(x, k) or just Qk
Recall the function 8(x) defined on page 4. Let
d max^(r) = max min (U(y) U(x))+.
xÂ£S xÂ£S yÂ£N(x)
There is in general no simple relationship between d and Hajeks constant d*\
they are maxima of different functions over different sets. However, the case of
greatest interest is probably that of a problem with a local, nonglobal minimum
x whose depth d{x) is at least as great as the depth of all global optimizers.
In this case, d < d*, and a necessary condition for Pr(AT So) > 1 is that
e{ky = oo. When e(k) fc_1/c, this is equivalent to d/c < 1.
In contrast to d*, the constant d depends for its value only on the
immediate neighborhood of each local minimum. This reveals something of
the flavor of QUICKER; its improvement over SLOW is greatest at those local
minima x where S(x) is large.
As discussed above, d < d* in the cases of greatest interest. There
8{x)/c < 1 is a necessary condition for Pr{Afc G So} * 1 We shall see shortly
that it is undesirable to have 8{x)/c = 1. On the other hand, when 8{x)jc < 1,
QUICKER acquires some attractive properties.
Foxs result in [15] is that, for c > ^(),
lim Ef(Q(a;,A0l))2  Q(x,k) > o) =0.
kHX> ' '
10
Actually, it is routine to show that all moments converge; that is, the above
expression holds with any positive integer exponent.
The next two theorems constitute an almostsure counterpart to Foxs
result.
Theorem 2.1 When a state x is not an isolated local minimum
(8(x) = 0), then with probability one, Q(x, k) < 1 for all but finitely many k.
Theorem 2.2 Let j be a positive integer. If x is an isolated local
minimum, and 8{x)/c < (j 1)/j, then with probability one, Q(x,k) < j for
all but finitely many k.
We do not have a converse to Theorem 2.2, but the following may
provide some insight. Although the conditions of Theorem 2.3 would never be
realized in a lone sample path, they might apply, for example, to massively
parallel runs of QUICKER.
Theorem 2.3 Assume that Q(x, k) > 0 for all &, and that Q(x, k)
and Q(x, l) are independent when k ^ l. If 6(x)/c > (j l)/j, then Q(x, k) > j
infinitely often on almost all sample paths.
The quantities Q(x, k) are certainly not independent, but Theorem 2.3
may indicate something about the worstcase behavior of QUICKER.
The following example (see Figure 2.1) shows what can go wrong in
the critical case 8(x)/c 1. Let S = {x, y, z} with U(x) = 1, U(y) = 2, U(z)
0, N(x) = N(z) = {y} and N(y) {x,z}, and take e(k) = &1. A routine
calculation shows that the expected time to exit from state x, given Xf. = x,
is infinite for all k. Since the sum of a finite number of geometric variates with
positive parameters is finite a.s., it follows that Pr((Q(x, k) > j) > 0 for all
11
u
positive integers j and k.
Adaptive Cooling. Consider the following modification: upon
entry to a state at step k, keep the temperature constant at e(k) until a move
out of the state is finally accepted at time m; then change the temperature to
e(m). In effect, we modify the cooling schedule (e(t)) in a certain random way.
The sojourn time in state x beginning at time k has a geometric distribution
with parameter a(x, k).
Suppose (S, U, R) contains no isolated local minima, and e(t) = f_1/,c,
with c > d*. Theorem 2.1 makes it plausible that Pr(A* G So) * 1 when e(t) is
replaced by the random schedule described above. We show in Chapter 3 that
this is in fact the case. We give further convergence results for such schedules
in Chapter 5.
2.2 Proofs of Theorems
All three proofs follow the same pattern: we write down an exact
expression for Pr(Qfc > j \ Qk > 0), then bound it by suitable inequalities
to show that the sum over k either converges or diverges, then invoke the
BorelCantelli lemmas to assert that the event Qk > j does or does not occur
12
infinitely often. We begin with some relationships that will be useful in all the
proofs.
The acceptance probability at a state x is of the form
n
a(x,k) A0 + ^2Ai(k + a)~di/c, (2.2)
2=1
where n > 0 and we assume without loss of generality that 0 < d\ < ... < dn
and all the A; except possibly Ao are strictly positive. The case n = 0 occurs
when and only when x has no uphill neighbors; i.e., it is a local maximum.
The case A0 = 0 occurs when and only when all of the neighbors of x are
strictly uphillthat is, recalling the definition on page 4, x is an isolated local
minimum. In that case, d\ = 8{x).
The inner loop of QUICKER will be executed more than once if and
only if the first Vtest fails. Conditioned on the first geometric variate be
ing /, the probability of this is
1 (ak+ii/ak) = a*1 (a* ak+i1).
Thus
Pr(Qk > 1) < Pr(Qfc > 1  Qk > 0)
oo
= (afc oik+ii) (1 afc)'1 ak
i=i
OO
=  ak+i)(i 
1=0
OO
=  a*+0(l akf (23)
;=i
Likewise, if the first geometric variate is l and the first Vtest fails, the
remaining number of geometric variates is Qk+iThus we obtain the recursion
Pr((?fc > j + 1) < Pr(Qfc > j + 1  Qk > 0)
13
= Y,(ak ak+ii)(l aky 1 P^Qk+i! > j)
1=1
OO
= YXak fe+0(1 ak)1 Pr(Qk+I > j)
1=0 .
OO
= XXa* ajfc+iX1 ak)1 Pr(Qk+I > j) (2.4)
i=i
In the case j = 0, (2.4) reduces to (2.3).
Now we obtain a bound for the factor ak ctk+i, which is the (possibly
null) sum of differences of the form
Ardi/c Ai{k + l)~di/c.
The exponent d;/c is strictly negative. We expand Ai(k + l)~d'!c in a Taylor
series around Aik~d'/C and note that the series alternates in sign, implying
Ai{k + l)^c > Aik*'0 {dilcykW*'*1, (2.5)
or equivalently,
Aik~di^c Ai(k + l)~d'!c < (di/c)lkAdi/*)~\ (2.6)
Therefore,
akak+l = (A0A0) + it(Aik~di/cMk + l)~di/c)
2=1
< 'ZMdi/^lk^1. (2.7)
2=1
Equality occurs in (2.7) in the case n = 0 (a local maximum), since then both
sides vanish.
14
By keeping the second order term in the Taylor series, we get the
lower bounds corresponding to (2.6) and (2.7):
Aik~di/c Ai{k + l)~dilc >
(di/c) l kMM1 l(di/c)( 1 + di/c) l2 k
and
akak+l >
i1
lJ2Mdi/c)( 1 + di/c) l2 k~^~2. (2.8)
i=l
Finally, the expressions for the first and second moments of a geometric distri
bution imply
OO
E 1 (1 *)' = 2( 1 *k) < (2.9)
ii
and
OO
ii;/2(lai),"1 = K3(2*)(l)<^3 (210)
/=1
2.2.1 Proof of Theorem 2.1 Fix a state x. If n = 0 (i.e., x
has no uphill neighbors) then (2.3) and the remark following (2.7) imply that
Pr(<3fc > 1) = 0 for all k, and the theorem is trivial. Assume for the rest of the
proof that n > 0. We have from (2.2) that
a(x, k) > Ao;
By hypothesis, x is not an isolated local minimum. This means that Ao > 0
and therefore,
ak2 < V (2.H)
15
From (2.3) and (2.7),
Pr{Qk > 1) = Y,(ak ~ ak+i)(l ak)1
i=i
n oo
i=1 1=1
Applying (2.9) and (2.11),
Pr(Qfc>l) <
i=1
< 2(di/c)k(d^)\
2=1
and summing over positive A:,
' EPrM* > 4 < E A,Ao2(di/c)(E < oo. (2.12)
k i=1
From the BorelCantelli lemmas, (2.12) is sufficient to prove that, almost surely,
Qk > 1 only finitely often.
2.2.2 Proof of Theorem 2.2 By hypothesis, x is an isolated
local minimum, so Ao = 0 and a(x, k) is of the form
ak = Â£,Aikd<1',
2=1
where n > 0. As before, assume that 0 < dx < ... < dn < 1 and all the A{ are
strictly positive. Note that d\ = 8{x)\ for the rest of this proof we suppress
the reference to x and write just 8. There is a positive constant Cs such that
for all k
Atks/c < ak < C3k~^c
or equivalently,
C3V/c < a*1 < A^ks/c. (2.13)
We now prove by induction the following statement for nonnegative integers j:
16
(Vj) Assume 8/c < 1. Then there exists a constant Dj, not dependent on
k, such that Pr(Qfc > j) < Dj
By the BorelCantelli lemmas, proving statement Vj for positive j is sufficient
to prove Theorem 2.2, since if j >0 and 8/c < (j l)/j, then 8/c < 1; also, the
quantity j(8/c) j is strictly less than 1, so that > j) is bounded
by the convergent series Yk^jk^6!^^.
Statement V0 is trivial. Assume statement Vj and 8/c < 1. Then
j(8/c)j< 0, so
(Jfc + 0j(fi/c)_i < Jfe,Wc)_j (2.14)
Writing (2.4) and using (2.7), the induction hypothesis, and (2.14):
CO
Pr(Qk > j + 1) = EPr(Q*+* > j)(ak ~ a*+f)(l ak)l
i=i
n oo
< E AiDj{di/cE /(l ctk)1.
i=1 1=1
It now follows from (2.9) and the right inequality of (2.13) that
Pr(Qfc >j +1) < ^E AiDjdk/^ . a2
< ^E AiA^2Djdi/cj k(S/c)i+3(S/c)j+2(S/c).
this proves statement "Pj+1 and finishes the proof of Theorem 2.2.
2.2.3 Proof of Theorem 2.3 By hypothesis, Qk > 0 for all k\
therefore,
Pr(Qfc > j) = Pr(Qfc > j  Qk > 0), a.s. (2.15)
In the remainder of the proof, we use the the two sides of (2.15) interchangeably.
Since a*, is strictly less than one (at least for large k), and is nonin
creasing in k, statement (2.13) implies that there is a constant C4 such that
*l2(l ak) > C4ks/C.
(2.16)
17
We prove the following statement by induction for nonnegative integers j:
("Pj) Assume 8/c < 1. Then there exist positive constants D' and not
dependent on k, such that
Pr(Qfc > j) > D'j k3S/c~j D'! fctf+1)(/0(i+1).
Theorem 2.3 will then follow from the BorelCantelli lemmas and the indepen
dence of the Qk
Statement Vo is trivial. Assume V. Then from (2.4), (2.8), and the
induction hypothesis,
Pr(Qk > j + 1)
CO
= ]C(a* ak+i)(l ~ ak)1 Pr(Qk+i > j)
i=i
CO
1=1
 Â£ Mdi/cXi + di/c)k^~2{i2/2))
i=1 2=1
x (D'jik +  D'j(k + j)U+i)(/<0Cj+i)) . (2.17)
Now
_j_ /)(i+1Kff/c)(J+1) < jcU+1)(s/c)~U+1)^
(2.18)
while a Taylorseries argument similar to that used to derive (2.5) shows that
(Jfe + lyW) > k^5/c^j jl( 1  (2.19)
Applying (2.18) and (2.19) to the last factor of (2.17), expanding the product,
and dropping some positive terms,
Pr( j + 1)
CO
l=i
18
X ('/t1()rs'=*j Â£ + )r(*w2d
V c i=i c c
x  T>'i(l )pWc)ji/ ^^0+i)(J/c)0+i)j
> A1Dlj(8/c)k^~1+^^ ^ /(I afc)^
A^S/c^i 1 8/c)kW)'+WMii (f ;r(i fc)A
W=1
(oo
/2
 E Aityidi/cKl + [Y: '(1 aky J .
i=l
Finally, we use (2.9) and (2.10), and then (2.13) and (2.16) to conclude that
Pr(Qk >j + 1) > AiD'jiS/c)^1^^)^)
2AiDj(8/c)j(l 8lc)k^~^slc^+2\af)
AiDp/ckWMAj+V^f)
 Y AiD^di/c)^ + dilc)k^dil^+2\af)
2=1
> Dj+ilfc(J+1h5/c)~(j+1) Â£),!+^}~U+2)(s/c)U+2)^
where
D'j+i = PjArC^S/c)
and
D" T)' A~3
L/j+1
2A1(8/c){l 8/c)j + YAi(di/c)( 1 + di/c)
2=1
+ D'!A^8/c.
This finishes the induction proof of V+l and the proof of Theorem 2.3.
19
2.3 Efficiency of Foxs Algorithm
We compare the expected efficiency, work per unit cost, of algorithms
SLOW and QUICKER. Say that computing the states X\,..., Xk represents k
units of work, and that each iteration of SLOW represents one unit of cost, so
that the efficiency of SLOW is 1. To find the relative efficiency of QUICKER
we find its cost for computing Xi,..., Xk
We can break each invocation of QUICKER into three phases:
(i) Computation of the acceptance probability.
(ii) Iteration of the inner loop.
(iii) The call to the algorithm NEXT.
Steps (i) and (iii) involve scanning the neighborhood of state x and thus their
cost depends on the size of N(x). It is reasonable to assume that this cost
is bounded above and below by constant multiples m and M, respectively,
of the cost of one iteration of SLOW. The main work in the inner loop is
the generation of two random numbers, (and, possibly, another neighborhood
scan), so the cost of each iteration of the inner loop is bounded above and
below by constants m' and M'.
In QUICKER, the clock is advanced only during iterations of the inner
loop. Thus
H(k) m + G(k) m! < cost (Xi,... ,Xk) < H(k) M f G(k) M',
where H{k) is the number of iterations of QUICKER up to time k and G(k) is
the number of iterations of the inner loop up to time k.
The import of the theorems in this chapter is that for all large k,
G(k)/H(k) is between 1 and j, for some j depending only on the cooling
20
schedule. We use this to get
(m + m') H{k) < cost (Xx,... Xk) < (M + jM') H(k), (2.20)
with probability arbitrarily close to one for large values of k.
We now estimate E(H(k)). For this purpose, consider the sequence
of random variables /3(h), k = 0,1,..., defined by
P(k) = (**_!, k) = Pi(Xk Xh._i).
Suppose we know H(k) for some k. If the algorithm moves to a different state
at step k + 1, then H(k + 1) = H(k) + 1. Otherwise, H{k + 1) = H(k). Thus
we have the recursive relationship
E(H(k + l)\H(k),/3(k + 1)) = (1 + H(k))/3(k + 1) + H(k)(l /3(k + 1))
= H(k) + /3(k + 1). (2.21)
Taking expectations in (2.21) and using the initial condition H(0) = 0, we
conclude by induction that
E(ff(*)) = XW(*)) (222)
i=i
Next, we estimate the quantities /3(1),..., /3(k). Trivially /3(j) < 1 for all j.
Also, if d > 0, and Xj is an isolated local minimum, then
DjVc
for a constant D. If Xj is not an isolated local minimum, then /3(j) is bounded
away from zero. Thus in any case we have
21
(2.23)
for some constant D'. Taking sums, we have
k k
k > E^(i) > rd7c > c0 (fc1^)
3=i i=i
for another constant Co; applying (2.22),
Co fc1_((J/c) < E(H(k)) < k. (2.24)
Recall that our measure for efficiency is work per unit cost, or
k/(cost (Xi,...,Xfc)).
Applying (2.24) to (2.20), taking multiplicative inverses, and then multiplying
through by &, this means that there exist constants C\ and C2 such that
Ci < ^(efficiency of QUICKER at time k) < C^^. (2.25)
The efficiency of QUICKER relative to SLOW increases as the parameter c in the
exponent of the cooling schedule gets smaller. This should not be interpreted
as an argument in favor of faster schedules. It reflects the fact that SLOW is
becoming less efficient; as c gets close, to d, the algorithm spends more and
more of its time in self loops at local minima.
Efficiency is highest when many of the states Xi,..., Xf. are isolated
0
local minima. When there are no isolated local minima, the lower bound in
(2.7), and thus the upper bound in (2.25), can be replaced by a constant.
In this case, the relative efficiency of QUICKER is bounded both above and
below by a constant, and there is no orderofmagnitude gain in speed from its
use. On the other hand, if all local minima are isolated, then (2.7) is sharp.
Heuristically, QUICKER tends to be an efficient algorithm to the degree that
direct transitions between local minima are rare or difficult.
22
Some examples from the literature show that isolated or nearisolated
local minima can actually occur. In the maximum matching problem of [25],
all local minima are isolated. In the first formulation of the graph coloring
problem in [31], direct transitions between local minima are not impossible,
but would be extremely rare. Our results indicate that QUICKER would attain
its maximum efficiency in such situations and would be worth trying there.
The Distribution of H(k) Another approach to computing the
expected value of H(k) may give additional insight. Consider the following
modification: fix a state xq and start the algorithm there. Run QUICKER
as before, but every time there is a move to a state other than xo, return im
mediately to xo.before continuing. Let H(k) denote the number of innerloop
iterations up to time k in the modified process. Clearly, H() is an inhomo
geneous discretetime counting process. Letting Pi(k) denote the probability
that H(k) = i, the kstep transition vector p(k) satisfies the forward equations
p0(k + l) = (1 a(x0,k + l))p0(fc);
Pi{k + 1) = (1 a(x0,k + l))p,(&) + a(x0,k + l)p;_i(A;), i = 1,2,...
The generating function .((z, k) =f zlPi(k) then satisfies the firstorder
difference equation
C{z, k + 1) ((z, k) = a(a:o, k + 1)C(^, k) + za(x0, k + k),
with initial condition ((z, 0) = 1. It is easily verified by induction that the
solution is
C(z, k) = (1 + (z l)a(x0,j)); (2.26)
3=1
23
taking the first derivative at z = 1 shows that
E(H(k)) = 'Â£/a(x0,j), (2.27)
j=i
exactly. Moreover, for 2 near one, the right side of (2.26) is arbitrarily close to
exp
i=1
?
which is the probability generating function of a Poisson distribution with the
same mean.
Observing that H(k) < H(k) stochastically, we obtain (2.24) as be
fore.
24
CHAPTER 3
SIMULATED ANNEALING WITH OVERRIDES
The results of Chapter 2 show that there is an integer,;, determinable
in advance, which bounds the number of innerloop iterations of QUICKER(a;)
for all x after a random but finite number of moves. This suggest that we
modify QUICKER to always force an exit after j iterations. Formally:
Algorithm Quicker^', a;, t)
Set k < t
Set i < 0
Until exit, repeat
Set i < i + 1
Generate a geometric r.v. G with parameter a(x, k)
Set L < k + G 1
Generate a uniform r.v. V
If V < a(x, L)/a{x, k)
or i> ]
then
Execute NEXT(s, L)
Exit
Else
Set k < L
End
The boxes above outline the additional steps needed to convert QUICKER into
QUICKER.;.
With QUICKERj, more moves are accepted in fixed computer time
than with QUICKER. We cannot say that QUICKER,; shortens the time to first
hit S0. However, QUICKERj lets us synchronize parallel processors with SIMD
architecture executing stochastically independent runs of simulated annealing.
Each move at a given computer clock tick executes the same instruction, on
processordependent data. When QUICKERj exits after h < j iterations on
some processor, then that processor executes j h additional dummy iterations
in order to maintain synchronization. Such synchronization is not possible with
either QUICKER or the naive move mechanism.
Stochastically, QUICKER1 generates the same sample path as SLOW
with the adaptive cooling schedule described in Section 2.1.
It seems intuitively plausible that convergence in probability would
continue to hold under algorithm QUICKERj for the right choice of j. For
mally, we have
Theorem 3.1 If j chosen so that the number of innerloop iterations
of QUICKER exceeds j only finitely often, and if
lim Pr(X* G So) = 1 (3.1)
ktoo
under algorithm QUICKER, then (3.1) also holds under algorithm QUICKERj.
Theorem 3.1 is a corollary of a much more general result which we
prove in this chapter. Broadly, the convergence properties of a probabilistic
search algorithm are unaffected by any modification of its initial steps, provided
the total number of modified steps is finite with probability one. That is,
suppose the search strategy can have its standard move mechanism overridden
up to a random but finite time JV; just after N, the standard move mechanism
takes over. We show that the current state of the Markov chain associated with
the modified process converges in probability to the set So of optimal states
26
whenever this is true in the unmodified process.
We do not require the state space S to be finite. Moreover, So can be
an arbitrary measurable subset of S, not just the set of global optimizers.
In the next two sections, we give examples of applications of this
general result, followed by a formal statement and proof in Section 3.3.
3.1 Simple Examples
3.1.1 Random Restart Begin the simulated annealing process
in the usual fashion. Observe the process. If it appears to stall at a particular
state or set of states, jump at random to a different part of the state space.
Repeat this until a fixed number n of restarts have been made. With any
reasonable definition of stall, the restart mechanism is in fact triggered n
times. In this case, N is the transition number at which the nth restart
is made. Our results do not say that modifying simulated annealing in this
fashion makes the algorithm either better or worse. However, we are assured
the modified algorithm retains the convergence property (3.1).
In this example, N is a stopping time (Markov time), because we
know its value after N transitions have occurred, but later we give important
examples where this is not true. Perhaps surprisingly, our basic result continues
to hold under certain natural conditions even when N is not a Markov time.
3.1.2 Greedy Overrides Upon arrival to a state, scan the en
ergy levels of all its neighbors. If a neighbor is encountered with a lower energy
level than any state yet seen, move immediately to the lowest such neighbor.
On a finite state space, N is obviously finite. Under weak conditions, N is the
transition number on which Sq is first visited, and so is a Markov time.
27
Such a greedy override strategy is widely used in the tabusearch
community (see, for example, Glover [22],[23]). Under certain restrictions, Fox
[14] recasts tabu search as simulated annealing on a more elaborate state space.
It is easy to give instances where a greedy override strategy increases the time
to first visit a global optimum. Heuristically, however, it seems attractive unless
the state space is contrived.
3.1.3 Perverse Overrides Upon arrival to a state, scan the
neighbors. If one of the neighbors has higher energy level than any state
yet encountered, move immediately to the highest such neighbor. The analysis
of the previous subsection goes through, where this time N is the transition
number on which the set of global maximizers is first encountered. We do not
claim this strategy is worthwhile, except perhaps on very unusual state spaces,
but it does show the robustness of the property (3.1).
3.2 Dynamic Neighborhoods
Consider a generic search algorithm of the form
Algorithm Search(:e, t)
Generate a set F(x,t) of candidates for the next state
Optional: if F(x,t) contains an element at least e
better than any yet seen, move to it
Until exit, repeat
Select an element of F(x,t)
If it passes the standard acceptance test, then
Exit
End
In ordinary simulated annealing, F(x,t) is the fixed set of neighbors of x.
Here we allow the possiblity of using problemspecific information to choose a
28
smaller or larger F at each step of the algorithm. Of course, we have to prove
that convergence in probability still occurs; this is true in the two examples of
this section.
Fox [14] shows that algorithm QUICKER, now depending on the triple
(x,t,F(x,t)), can replace the Until.. .repeat loop in SEARCH. We want to
further streamline SEARCH using the techniques of this chapter.
When F is a small finite subset of S, it is practical to compute the
acceptance probability a(x, k) needed for QUICKER or QUICKERj, and also
for their subroutine NEXT. If the method to pick an element of F depends
on the energy levels of all its elements, then a(x,k) can be computed with
essentially no additional work.
In the setting of Fox ([14] and [15]), S is finite. The set F(x,t) is
formed by supplementing the nominal fixed neighborhood of the current state
x with randomly generated elements from S. To show that the convergence the
orems apply, Fox recasts this as standard annealing with fixed neighborhoods,
but on a more elaborate state space.
To ensure a set F(x,t) of reasonable size, the nominal neighborhood
can be chosen relatively small, while the added elements are chosen to guar
antee weak reversibility and irreducibility. Heuristically, the added elements
diversify the search; [15] shows that they have theoretical value as well. We
modify SEARCH in two ways: first, we perform the Optional step with e = 0
(this is the same as the greedy strategy in Section 3.1); second, we further speed
up the Until... repeat loop by using QUICKERj instead of QUICKER. Our
theorem applies to both modifications. To cool as rapidly as possible and still
29
take advantage of the convergence properties of Theorems 2.1 and 2.2, we want
to ensure that the constant d is not too large. This is avoided by making re
quiring the nominal neighborhood of each state to contain a state with at most
a slightly higher energy level.
When S' is a compact subset of Belisle [2] implicitly performs
SEARCH. He takes F = S', a natural choice when NEXT is not used. Under
the strong assumptions of [2], there are no restrictions on the rate of cooling;
in particular, convergence occurs with any adaptive schedule, so long as the
temperatures decrease to zero. Moreover, any state can be reached from any
other state, so there are no isolated local minima (unless there is only one
optimal state; this special case is easily disposed of). Then we use QUICKER1
(corresponding to an adaptive schedule) to speed up the Until.. .repeat loop.
We also perform the Optional step here, this time setting e > 0.
It is straightforward to extend Belisles results to the case where the
neighborhoods F are dynamically chosen, so long as there is a path with no
uphill moves from any state to So That is, every set F(x,t) must be forced
to include a state with the same or lower energy level than x. This works also
when S is finite, and it offers a way of proving convergence that is different
from Foxs recasting scheme, mentioned earlier in this section. Once again,
there are no isolated local minima, so we use QUICKER1 as before.
3.3 Overrides: The Main Result
Let So be an arbitrary measurable subset of S, not necessarily finite.
We require that the unmodified chain (Xk) converge in probability to So from
30
any starting state at any initial time. More precisely,
lim inf Pr{Jffc 6 5oX = s} = 1, (3.2)
for all positive integers n. When S is finite, we can replace (3.2) by the simpler
condition
lim Pr{Xfc G SulXi = x} = 1, (3.3)
ktoo
The epoch of the last override must be almost surely finite, but we do not
require it to be a Markov time. It is enough that the override mechanism
stops eventually and has no further influence on the destiny of the process.
More precisely, denote by Y(k), k = 1,2,,..., the Markov chain of moves with
overrides, and by N the epoch of the last override. We assume that there exists
a family {Z(n, ) : n > 0} of Markov chains such that:
(i) The chains Z(n, ) and T() coincide exactly up to time n. That is,
Z(n,k,uj) = Y(k,u>)
for k < n and every uj in the probability space.
(ii) After time n, the transition probabilities of Z(n, ) and X are the same;
that is,
Pr(Z(n, k + l) = y\ Z(n, k) = x) = Pr(Xfc+i = y \ Xk = x)
for k > n and x,y G S.
(iii) There exists a random time N, a.s. finite, after which the transition
probabilities of T() and X are the same.
Here is the intuition for this structure. We start with the unmodified process
and define a sequence of processes in which random overrides are allowed, but
only up to a deterministic time n. Each process Z(n, ) couples with y() up to
31
time n and behaves stochastically like (Xk) after time n. We can characterize
Y(j) as lim^^oo Z(n,j). The limit is attained for some finite n, which can be
chosen uniformly in j.
Theorem 3.2 Assume (3.2) and conditions (i)(iii). Then (3.2) is
also true for the chain K().
Proof: Let e > 0 be arbitrary. For each positive integer m, let (m) be the
smallest integer j > m such that
inf Pr{Xfc 6 S0\Xm = x) > 1 e,
xÂ£S
for all k > j. Condition (3.2) guarantees that (m) is welldefined and finite
for every m.
Moreover, the function (j>{) is nondecreasing, as we now show. Select
n < (f)(m). Then there exist k > n and x S such that
Pr(Wfc e ^0  Xm = x) < 1 e.
Applying the ChapmanKolmogorov equations, we have
inf ?v(Xk G S0  Xm+1 = y)
yes
< Ys Pr(^fc e S'o  xm+1 = y) Pr(Wm+i = y \ Xm = x)
yes
= Pr(Wfc So I xm = x)
< 1 e,
which implies that n < (j){m +1). Because n was an arbitrary number between
0 and {m) < {m + 1).
Since (f>() is nondecreasing and grows without limit, there is, for
k > <^(0), a unique integer *(k) such that
(*(k)) {*{k) + 1). (3.4)
32
limit.
(3.5)
To finish the proof, we use a variant of the coupling inequality (see Asmussen[l],
p. 143). By conditions (i) and (iii), the chains Z(v, ) and F() coincide exactly
for all time on the set {N < u}. Thus
Pr(Z(i/,fc)eSo)Pr(y(fc)GÂ£,o)
=  Pr(Z(i/, k) Â£ So, N > v) Pr(y(k) E So, N > v)\ (3.6)
< Pt(N>u).
We have proved that, for arbitrary e, for all large k,
Pr(Y(k) eS0)>lePv{N> *(k)).
As remarked above, the function is nondecreasing and grows without
limit; by condition (iii), this completes the proof.
Note that the function *() is also nondecreasing and grows without
Take any k > ^(0); write v for *(k). By condition (ii),
Pr (Z{(j)*{k), k) G S0) > inf Pr (Z{v, k) G  Z(v, u) = x)
xÂ£S
= inf Pr (Xk G .So  Xv = a;)
xÂ£S
> 1 e.
33
CHAPTER 4
SIMULATED ANNEALING WITH A NOISY OBJECTIVE
FUNCTION
We now discuss a significant application of the results in Chapter 3.
Suppose we are interested in a state space where objectivefunction values
are unknown and must be estimated by measurements subject to error. For
example, each state might be a configuration of a queuing network. To run
simulated annealing in the most naive fashion, pretend that the estimated
objectivefunction values are the true values; calculate move probabilities and
move based on the estimates. Meanwhile, every time a state is seen, measure
its objectivefunction value and use the measurement to refine the estimates.
In this chapter, we show that under the right conditions, this simple procedure
works.
Yan and Mukai [38] have previously considered such situations. They
assume that the state space satisfies NS, and that we know a finite interval
(a, b) surrounding all the objectivefunction values. In their algorithm, the
probability of accepting a proposed move is itself a random quantity, typically
estimated by multiple measurements of the tentative new state. A proposed
move is rejected unless the result of every simulation is smaller than a uniform
variate on (a, b) drawn at the beginning of the move. The number of simula
tions required at each move is large and grows without limit as the algorithm
proceeds.
Gelfand and Mitter [19] propose a different method. They formulate
the transition probability at time t as
R(x,y)e(t)~uW+wW+,
where W(t) represents a noise term which may depend on past history. This
general formulation includes ours as a special case. However, their convergence
proofs assume that W(t) is Gaussian with variance asymptotically smaller than
any linear multiple of l/( log e(t)). Since some states may be visited only
rarely, it is unclear how to check the variance condition.
For technical reasons, we assume that the unknown objective function
takes on values in a finite set, such as a subset of the computerrepresentable
numbers. Neither [38] nor [19] need to do this. But since our treatment allows
arbitrary precision, we do not consider this restriction important.
In the next section, we give a formal description and show that our
conclusions follow from the results of Chapter 3. It turns out that we need
estimators which are asymptotically stable; the remainder of the chapter
discusses these and presents two examples.
4.1 Formal Setup
At the jth entry to a state x, make an observation T(x,j). Each
of the observations T(x, 1), T(x, 2),..., has expected value h(x). The vector
h = (h(x) : x G S) is unknown. Perform simulated annealing with acceptance
probabilities formed from some estimate Hn = (H(x,nx) : x 6 S), where
each component H(x,nx) is based on history of observed values of T(x,j),
j = 1,..., nx, and nx is the number of times x has been observed up to time n.
35
Assume that there are only a finite number of possible values for
h(x) at any state x (for example, integer multiples of 0.001 between 0 and 1).
In effect, we accept finite precision in our knowledge of h(x). We say that
the estimator H{x, ) is asymptotically stable if there exists a random time
N(x), almost surely finite, such that H(x,n) = h(x) for all n > N(x). We
show later that such estimators exist.
In our setting, the xth component of Hn takes account only of the
xth component of the observations (T(The observations of the other
components and the current observation number are ignored. For every com
ponent, the observations are thus iid. Since there are only a finite number of
states, we can consider asymptotic stability of each component separately. To
initialize, for each component we might set H(x, 0) = oo.
If H(x, ) is asymptotically stable for each state x, and if every state
is visited infinitely often, then from some finite time N onwards, the vector H
of Hestimates coincides with the vector h of true /ivalues. In this setting,
the standard algorithm is simulated annealing using h in the computation of
the move probabilities. The modification uses the estimates Hn. To make this
more precise, in both chains the original state space S is enlarged to include
enough historical information to compute H at time n. Only the moves in the
second chain take account of that additional information. When h is unknown,
the first chain is hypothetical.
Theorem 3.1 implies that if the first chain converges in probability to
So with h, then so does the second chain.
36
This setup generalizes easily to any situation where the objective func
tion h(x) depends on a finitedimensional vector 6 of unknown parameters, and
0 takes on only a finite number v of values. The idea is to exploit structure in
the state space so that v is much smaller than Sj, although that is not neces
sary for our proofs. As above, we can interpret this as knowing the elements
of 6 to finite precision.
4.2 Consistent and Stable Estimators
Generalizing the previous section, we want to run Monte Carlo simu
lation in an optimal way according to some criterion. The choice of simulation
strategy depends on a parameter with an optimal value h. We do not know h.
However, at step n of the simulation we make an observation T(n),
and then use the observations T(l),..., T(n) to compute an estimate H(n) of
h. At time n, the simulation strategy depends on H{n). Recall the definition of
asymptotic stability in the previous section: there is a random time N, almost
surely finite, such that H(n) = h for all n > N. Generally, to find such an H
requires that we know a finite set F containing h, and that the observations
fall in a some set G, generally a superset of F. Thus, H{n) maps the nfold
Cartesian product of G to F. We give below two examples of asymptotically
stable estimators.
For simplicity, assume that h and T(j) are scalars and by rescaling
that F is a set of consecutive integers. The set G is either a subset of the
real numbers or a subset of the integer multiples of 1/d, where d is an integer
greater than 1. We can view the latter as a finiteprecision approximation to
the former. Let (x) denote x rounded to the nearest integer, rounding upward
37
in case of ties (i.e., (x) = [a; + (1/2)J). Set t = E(T(j)) and let h = (t).
An example shows what can go wrong. Let the sequence of sample
means be T(n) = re1 Â£)"=1 T(j). The estimator H(n) = (T(n)) is not asymp
totically stable when t is a halfinteger; for large n, it oscillates unpredictably
between [ij and [i]. If t is near a halfinteger, then damped oscillation occurs;
although there is eventual stability, arguably the rate of damping is very slow.
This is the practical reason to consider the estimators in Sections 4.3 and 4.4.
Suppose we try to correct the situation by observing the entire se
quence of sample means T(l),..., T(n) (computed to double precision, per
haps), counting how many are above or below the halfinteger line, and round
ing up or down accordingly. This actually makes matters worse. For large n,
the sequence of sample means closely approximates Brownian motion, and the
proportion of the T(j) which lie above t has an arc sine distribution (see Feller
[13], pp. 39799.) when t is exactly a halfinteger. When t is approximately
a halfinteger, this proportion forms an approximate arc sine distribution with
large probabilities of being near 0 or 1. The refined estimator still oscillates,
but the intervals between changes grow longer and longer (in fact the growth is
exponential). To the careless observer, it may falsely appear to be stabilizing.
It is important to be clear about the difference between an estimator
which is asymptotically stable and one which is merely consistent. If we had
wanted, say, a static point estimator, restricted to the integers, of a parameter
with true value 1.5, then we would be indifferent between an estimate of 1.0
and 2.0. Asymptotic stability is needed when we are interested, not in the
parameter per se, but in a simulation process dependent on an estimate of the
38
parameter.
4.3 Asymptotically Stable Estimators: First Example
Let Mk(n) be the number of observations equal to k up to step n;
that is,
Mk(n) = I(T(j) = k).
3=1
Each observation T(n) is formed by adding a random error to t, and then
rounding to the nearest integer:
T(n) = (t + Hn)
Suppose that the errors Â£n are iid. Then the T(n) have a common discrete dis
tribution function with mass points ..., p_i,po,pi,..on the integers. Sup
pose also that the distribution of the is symmetric about zero and has a
density with bounded support, strictly decreasing on positive numbers. Then
when t is not a halfinteger the set {pk > 0 : k Z} is finite and has ph = P(t)
for its unique largest element. In the exceptional case where t is a halfinteger,
there will be maxima at both p*j and pp]. For large n, we expect Mk(n)/n to
approximately equal pk, at least for those k not in the tail of the distribution.
These remarks form the intutive basis for our estimator.
Let /(n) be a function on the positive integers which is nondecreasing
in n and which satisfies both
f(n) = o(n) and log(n) = o(/(n)). (4.1)
(For example, f{n) = y/n.) Below, break ties for the secondlargest value
arbitrarily.
39
ALGORITHM STABLEl(z, f)
For each step n:
Make an observation T(n);
Set k < T(n);
Set Mk(n) < Mk(n) + 1;
If the set N = {Mk(n) : k G F} has a single maximum at m,
and arg maxiV \ {m} = m + 1,
and \Mm(n) Mm+1{n)\ < f(n), then
Set H(n) < m + 1.
Else if the set {Mk{n) : k G F} has a single maximum at m,
and arg maxiV \ {m} = m 1,
and Mm(n) Mm_i(n) < /(n), then
Set H{n) < m.
Else
Set ^(n) < argmax{Mfc(n) : k G F}.
End
The following theorem shows that H{n) has the desired properties.
Theorem 4.1 Assume that / satisfies (4.1) and that the error terms
Â£n satisfy the conditions stated above. Then, almost surely, the estimator H(n)
of algorithm STABLE1 exactly equals h for all but finitely many n.
Proof: Suppose first that t is not a halfinteger. Assume that m <
t < m + (1/2); the other case is symmetric. By the conditions imposed on the
distribution of the set {pk : k G G} has its unique largest value at m and its
second largest value at m + 1. The strong law of large numbers and G < oo
imply that, a.s. for large n, the set {Mk(n) : k G G} attains its unique largest
maximum at m and its unique second largest value at m + 1. For arbitrary
6 > 0, another application of the strong law shows that for all large n, on
almost all sample paths,
Mm(n) Mm+1(n)\ > n(pm pm+1) 6,
40
and the expression on the right is bounded below by f(n) for all large n. This
completes the proof if t is not a half integer.
Now suppose that t = m + (1/2). Then pm = pm+1 and these two
are the only maxima in the set {pk : k Â£ Z}. Using the strong law as in the
previous paragraph, the two largest elements of the set {Mk(n) : k Â£ Z} are
Mm(n) and Mm+i(n) a.s. for all large n. Let j(0) = 0 and
j(i + 1) = min {T(n) = m or T{n) = m + 1},
n>j(i)
i.e., {/(l),j(2),...} is the sequence of hitting times to the set {m,m + 1}.
We claim that
IMm(j(i))  < f(j(i))
a.s. for all but finitely many i. Consider the Bernoulli process {T(j(i)) : i
Z+}. It is well known, and a simple consequence of the BorelCantelli lem
mas (see, for example, Billingsley [3], pp. 5354) that there exists a positive
constant K such that the events
{T(j(i)) = T(j(i) + !) = = T(j(i) + K Ini) = m}
and
{:T(j(i)) = T(j(i) + 1) = = T(j(i) + Klni)=m + 1}
each occur for only finitely many values of i. Let io be the largest i for which
either of these events occurs.
Since zero is a recurrent state in the random walk on the space of
possible values of Mm Mm+1, Mm(j(i)) = Mm+l(j(i)) for arbitrarily large
41
values of i. Let i\ be the smallest integer larger than i0 for which i)) =
Mm+i(j(*i)). Then for all i>i\,
I Mm+i(j(i))\ < K\ni < K\nj(i) < f(j(i)); (4.2)
where the last inequality is true for all large enough i.
We finish by showing that (4.2) is still true when j(i) is replaced by
an arbitrary time n. Observing that j() is a nondecreasing function on the
positive integers, we define an inverse. If n < j(l), set j*(n) 0; otherwise,
let j*(n) be the largest integer i such that j(i) < n. The function j*() has the
following properties:
j*(n)
Mm(j(j*(n)) Mm(n)\
Mm+i(j{j*{n)) = Mm+1(n).
Thus, a.s. for all large n,
\Mm+i(n) Mm(n)\ = \Mm+1(j(j*(n)) Mm(j(j*(n))\
< Kln(j*(n)) .
< K In n
< f(n).
4.4 Asymptotically Stable Estimators: Second Example
We drop previous assumptions about the error terms and instead
assume that they are iid with a zero mean and finite variance cr2. We com
pute the sample means Tn to a higher precision d than needed for our final
estimator Hn.
42
Let g(n) be a function which satisfies
J2{a2/n) In In cr2n
lim<1 and lim gin) = 0,
g(n) ~ noo^ '
almost surely. For example, we could take g(n) = n_1\/2s2 lnlns2ra, where s
is the sample variance (this requires that the have a finite fourth moment).
More simply, we could take g(n) = In n/n.
ALGORITHM STABLE2(x,t)
For each step n.
Make an observation T(n)
Compute the sample mean T(n)
Set_J(n) < [T(n)\ + (1/2)
If Tn J(n) < g(n), then
Set H(n) < \T(n)]
Else
Set H(n) (T(n))
End
By the law of the iterated logarithm,
IT{n) t\ < (n)y/2/n < g{n) (4.3)
for all but finitely many n, where 4>{n) \ju2 lnln(ncr2). When t = m + 1/2
for some integer m, then J(n) = t for all large n, so that STABLE2 is stable
by (4.3). When t is not a halfinteger,
2g(n) <  J(n) h\
for all large n; combining this with (4.3),
\J(n)T(n)\ > \J(n)t\\T(n)t\
>  J(n) h\ \T(n) t\
43
> 2g(n) ~ ff(n)
=
for all large enough n, almost surely. We have proved
Theorem 4.2 Algorithm 2 is stable and consistent.
44
CHAPTER 5
PATHWISE CONVERGENCE WITH ADAPTIVE COOLING
Up to this point, we have measured success of an algorithm by con
vergence in probability to the set of optimal states So This standard may
seem to say more than it does. Since there is no strong law of large numbers
for the general nonstationary Markov chain, there is a priori no guarantee that
optimal states will dominate others in any sample run of the algorithm. In this
chapter, we study pathwise convergence directly and show that under certain
circumstances the proportion of optimal states visited does indeed go to one,
almost surely. (However, in general the sequence of states visited has no almost
sure limitsee, for example, [2].) We prove that this is true for the algorithms
SLOW and QUICKER under Hajeks conditions, and, under additional condi
tions different from those in previous chapters, for algorithm QUICKER1. An
example shows that the additional conditions cannot in general be discarded.
It is not clear that the pathwiseconvergence results of this chapter
either imply or are implied by convergence in probability; see, however, the
discussion at the end of Section 5.1.
Our work was stimulated by that of Borkar [4], who applies the idea
of recurrence orders first developed by Connors and Kumar [11] to obtain a
samplepath result similar to ours on a deterministic cooling schedule satisfying
J^e(t)c = oo
t
when c is not less than a problemspecific constant d possibly different from,
but no smaller than, Hajeks d*. We slightly improve Borkars result, showing
that it holds for c> d*.
Next, we exhibit conditions under which pathwise convergence occurs
with an adaptive schedule (i.e., with QUICKER1), and then present an example
of a state space which fails to meet those conditions and in which pathwise
convergence fails.
5.1 Main Results
Theorem 5.1 In simulated annealing with the deterministic sched
ule e(t) = i1/c, we have
lim Y' I(Xt e S0) = 1 a.s., (5.1)
71KX)
fb t=1
for any c> d*.
Remark Borkar [4] proves the same result, except that in place of
d* he uses the constant
d = max d(x),
x Â£S
where d() is the depth function of Section 1.1. Since d* is the same maximum
over the smaller set S \ So, the two constants are related by d* < d, and our
result is a slight improvement over Borkars.
Now we describe what happens for algorithm QUICKER1. Recall
from Section 2.1 that d is the maximum over x G S of S(x), where
S(x) = min{[C/(y) U(x)]+ : y G N(x)}.
Let H = min{[/(x) : x Â£ S'o} be the smallest objectivefunction value outside
the set of global optimizers. Let T{k) be the epoch of the kth state change;
46
i.e., T( 1) = 1 and
T(k + 1) = min{t > T(k) : Xt ^ XT(5.2)
Finally, let t be the epoch of entry to the state Xt; i.e., t = T(k), where k is
the unique integer such that T(k)
Theorem 5.2 In simulated annealing with the adaptive schedule
e(t) = t 1/c, (5.1) holds for any c > d*, provided d < H.
We show in Section 5.3 that the condition d < H cannot in general be omitted.
Remark Hajek [25] shows how to modify (S, U, R), without chang
ing its convergence properties, by inserting intermediate states so that all jumps
are of a desired small size. He calls this the continuous increase property.
In principle, we could use this construction to force d to be smaller than H.
However, this expansion of the state space tends to degrade the performance
of an adaptive algorithm, since it results in more frequent state changes and
thus lower temperatures.
The results described in Theorems 2.1 and 2.2 of Section 1.2 imply
an expectedvalue version of Theorem 5.1, since if Pr{W E .So} > 1, then
limEE S0}) = lim^^Pr^ E So}) = 1. (5.3)
n t=i n t=i
For deterministic cooling schedules, Connors and Kumar [11, 12] prove a result
like (5.3) directly. (Their result involves a lim sup rather than a limit, but with
cooling schedules satisfying (1.2), it is easy to generalize their proof.)
47
P{x) = <
P(x,y)
5.2 SamplePath Convergence for Deterministic Cooling
Following Borkar [4], we associate with each state x a random quantity
the (pathwise) recurrence order of state x, defined as follows:
oo if Yt I[Xt = x] < oo
p~ if p = sup{c > 0  Yt e{t)cI[Xt = x] = oo}
and Yt t{t)pI[Xt = x] < oo
p if p = max{c > 0  Yt e(f)cI[Xt = x] = oo}.
Similarly, we can define the pathwise recurrence order of a transition from
state x to state y:
oo if Yt I[Xt x, Xt+1 = y] < oo
P~ if P = sup{c > 0  Yt e(t)cI[Xt = x, Xt+1 = y] = oo}
and Yt e{t)pI[Xt = x,Xt+i = y) < oo
p if p = max{c > 0  Yt e(t)cI[Xt x, Xt+i = y] = oo}.
The recurrence order measures how often a state is visited, or a transition oc
curs, during a sample run. A state which is transient (in the ordinary sense) has
recurrence order oo. Connors and Kumar ([11],[12]) work with the expected
values of pathwise recurrence orders, quantities which we shall call mean re
currence orders. By replacing recurrence orders by mean recurrence orders in
what follows, we get a result like (5.3) for QUICKER1.
The following remarkable theorem shows that, to a certain extent,
recurrence orders constitute a potential for objectivefunction values.
Theorem 5.3 Assume that S is finite and that weak reversibility
holds. If /3(x) > 0 and there is a path x = x0,xi,...,xp = y such that
maxi
48
Connors and Kumar prove Theorem 5.3 in [11] under the assumption
of neighborhood symmetry and in [12] under the assumption of weak reversibil
ity. Their result is stated in terms of mean recurrence orders, but the proof is
algebraic, not probabilistic, and carries over without difficulty to samplepath
recurrence orders.
The next three results are adapted from Borkar [4]. We sketch their
proofs. The final theorem is slightly stronger than the corresponding result in
[4], because we use d* rather than d.
Lemma 5.4 In simulated annealing with a deterministic cooling
schedule implying C\t~x!c < e(t) < Cit~x^c for C\ > 0, Ci > 0, c > 0, we
have maXjcgs/i^a;) = c, a.s.
Proof: For b > c, J2te(t)b < < oo; thus maxx/3(x) < c.
In addition,
E E 4tmx, = *} = E ft E<_1 =
so t t t
therefore Ylt e(t)cI{Xt = a:} = oo for some x on each sample path, and so
maxj;/^) > c, a.s.
Lemma 5.5 If maxx fl(x) > d*, then the set {x : fi(x) = max^/i^a:)}
is a subset of So, almost surely.
Proof: Let s be a state such that /3(x) = /3max Suppose x Â£ Sq.
By the definition of d*, there is a path x = x0, x\,..., xv y from x to some
state y Â£ So such that max; {/(a;;) < U(x) + d* < U(x) + /3(x). Applying
Theorem 5.3, 0 < U(x) U(y) = /3(y) {3(x) a.s., which contradicts the choice
of x. Thus, except on a set of probability 0, x E So
Proof of Theorem 5.1 From Lemma 5.4, maxj^s) = c > d*. But
49
from Lemma 5.5, the only states x satisfying f3(x) = c are in Sq. Thus
^2e(t)cI(x i S0) = Â£ So) < oo a.s.,
t t
from which (5.1) follows by Kroneckers lemma (see, for example, Chung [10],
pp. 12324).
5.3 SamplePath Convergence for Adaptive Cooling
In this section we prove Theorem 5.2. Assume that e(t) = f_1/c,
c> d*, and that d < H. To avoid repetition, we adopt the convention that all
the statements in this proof are true in an almostsure sense. The following
assertion is proved below:
c < ma,x/3(x) < c + d (54)
Statement (5.4) and d < H together imply that maxj;^(s) H < c, so that
Theorem 5.3 implies that (5{x) < c for any state x not in So. Thus
Â£rlJ(*t i So) < 'Â£i1I{Xt i So) = Y,mcKXt So) < oo,
t t t
and (5.1) follows from Kroneckers lemma as before.
To prove (5.4), observe first that
Â£?(*)c = ^ X*"1 = 00 >
t t t
so that c < max* 0(x). Pick b> c + d. We need to show that
X g(t)6 = X*_t/C <  (55)
t t
Under QUICKER1, the sojourn time in each state has a geometric distribution
and so is a.s. finite; thus, there are infinitely many state changes. Recalling
50
that T(k) is the transition number of the kth state change, we rewrite (5.5)
as
E mb = E (Td= + 1) T(k)) T(k)bl < oo; (5.6)
t k
then choose ^ > 0 such that b > (c + J)(l + Â£). We claim that
Ak = (T(k + 1) T(k)) T{k)b'c < k~(1+V (5.7)
for all but finitely many k, almost surely. This is enough to establish (5.6) and
thus (5.4).
Fix k, and say that Xx(k) = x Let N*(x) be the (nonempty) set of
neighbors of x at height S(x) above x, i.e.
N*(x) = {ye N(x) : U(y) = U(x) + S(x)}. (5.8)
The sojourn time in x ends at or before a move to N*(x) is proposed and
accepted. Thus, it is stochastically bounded by a geometric variate with pa
rameter
pT(k)~s^c > pT(k)'i/c,
where
p= X) R(x>y)>0
yN*(x)
is the probability of proposing a move to N*(x).
Since T(k) > k, we have
logTW_logÂ£_s n_
c c + d \c c + dj
for k > 3; exponentiating gives
c + d
> 0
T(k)1/ck~1/ eJ/(c+^ > 1.
51
This implies that
T{k)b/ck^ > T{k)blck~bl(c+d") > ew7(C+d) > 1;
so that there is a constant 77 such that for all large k,
[r(it)l/cife(1+Â£>J > 7, (:m)4/tr[1+<:))
(5.9)
Thus,
Pr (At > r(I+Â£) I f (*)) = Pr (T(k + 1) T(k) >  T(k))
Since (1 a)r < e ar for any real a and positive real r, this implies
Pr [Ak > k~(1+V  T(k)) < exp (p [k~^T(k)b/cj T{k)~1>C)
< exp ( pr}k~^1+^T{k)^b~^^c).
Then the trivial bound T(k) > k implies
Pr [Ak > k~W>  T{k)) < exp ( 11Pk^b~^ "(1+^). (5.10)
The choice of Â£ implies that the exponent of k on the right is strictly positive.
Now for any a > 0,
00 roo 1 / 1 \
V)exp( ja) < / exp(xa) dx = r ( ) < oo. (511)
Jo a \aj
By the BorelCantelli lemma for adapted sequences (see Loeve [36], pp. 398
99), the event Ak > k~^+0 occurs only finitely often, almost surely. This
proves (5.6) and finishes the proof of Theorem 5.2.
52
5.4 An Example Where Pathwise Convergence Fails
Consider the state space S {x,y,z}, with U{x) = 1, U(y) = 2,
U{z) = 0, and tentativemove matrix
/
1
\
R =
r 1 r ,
where r > 0. Thus, d* = 1, d 2, and H = 1. We show that pathwise Cesaro
convergence fails for the adaptive schedule e(t) = Â£1/d* = Â£1. In fact, given
any a < 1, there are infinitely many transition numbers t at which
t
t"1 YJ(XS = x) > a, a.s.;
S=1
that is, the total proportion of visits to the strict local minimum x gets arbi
trarily close to 1, infinitely often.
Assume for simplicity that Xi y. Since we are using an adaptive
schedule, there are infinitely many state changes, almost surely. In particular,
there are infinitely many visits to state y. Let S(j) denote the transition
number immediately following the jth visit to y; that is, .S'(l) = 2 and S(k +
1) = 1 + min{Â£ > S(k) : Xt = y}. Also, let M(t) the remaining lifetime, at
step t, of the current visit to state x; that is, M(t) = min{s > 0 : Xs+i ^ a;} if
Xt = x, and M(t) = 0 if Xt ^ x.
Since every proposed move out of state y is accepted, we have
Pr (XS(j) = x) = r
for all j. Let Bj be the event
{*5W) = *, M(S(j)) >
ln2/ln(l S(j) J)j}.
53
Then
Pr (Bj) = r(lSO')1)
In 2
= rexp
> r/2.
LIn2/ln(15(i) J)j
mi sur1)
IHi
(5.12)
The events Bj are not independent. However, Hi,..., Bj1 influence Bj only
through S(j); since (5.12) is true regardless of the value of S(j), we have
Pr(Hj  A) > r/2 > 0
for any A Â£ er(Ri,..., Bji). From Lemma 5.6 (see below) it now follows that
given any integer l, sequences of l consecutive events Bj, Bj+1,..., Bj+i1 occur
infinitely often. Since for any integer k,
.MTTFijJ s L^2J > i/2
it follows that if Bj,...,Bj+2ii all occur, then at time S(j + 21) 1, there
will have occurred at least l S(j) visits to state x and at most S(j) visits to
state z.
Remark A similar argument shows that (a.s.) the total proportion
of visits to the global minimum 2 also gets arbitrarily close to one infinitely
often. Thus, the Cesaro sums t1 Ss/(Xs = x) and t_1 Z)S/(AS = z) oscillate
forever without converging.
Lemma 5.6 Let Ii,l2,... be a sequence of zeroone Bernoulli ran
dom variables adapted to an increasing sequence T\,T2, of crfields. Sup
pose that for all i,
inf E(/i+i  A) > a,
AFi
54
for some positive constant a not depending on i. Then for any positive integer
l, the sequence (Ij) contains infinitely many strings of ones of length l.
Proof: Let (Ij : j = 1,2,...), (Tj : j = 1,2,...), and a be as in the statement
of the lemma. We need to show that for any given integer / > 0, there is
an infinite subsequence of indices j = 1,2,..., such that for each j,
Ii(j)+i = Ii(j)+2 = = Ii(j)+ii = 1. The proof will show that we can take the
i(j) to be multiples of l.
Let Ijm = nr=i Ij+k We establish by induction on m that
E(Ijm  A) > am (5.13)
for A G Jj. When m = 1, statement (5.13) is true by hypothesis. Assume
(5.13). Then
E(/j,m+l  A) = E(/j+i/j+2 Ij+m+l  A)
= E(/j+2 Ij+m+l  Ij+l = 1, A)E(//+1  A)
= E(/j+i>m  Ij+i = l,A)E(ij+i  A)
> am~1a,
since {Ij+i = 1} fl A G Fj+i
Now let Jjm Ijm,j = Ijm+iIjm+2 Ijm+j If j < k, then the event
{Jjm = 0} is in IF(j+i)m C Tkm, so that (5.13) implies that for any k > 0 and
n > 0,
Pr ( f) (Jjm = 0)) = Pr(^fcm = 0, . Jnm = 0)
Kj=k '
= P^(Jkm = 0) X Pv(Jk+l,m = 0  Jkm 0) X
X Pr(JnTO = 0  Jkm = Jk+l,m = Jnl,m 0)
< (1 a)n~k
55
letting n > oo and then taking the union over all positive k shows that
Pr( Jjm = 0 all but finitely often )
0.
56
CHAPTER 6
CONVERGENCE OF THE PRUNED CHAIN
It might be thought that the convergence results of [7], [11], [25],
and [37], as well as those in previous chapters of this thesis, are an artifact,
resulting from the algorithm spending long periods of time in optimal states.
In this chapter we show that convergence occurs even if we do not take into
account the length of time that the chain occupies a state, i.e., if we consider
only the embedded chain of distinct states.
As in the previous chapter (see (5.2) on page 47), let the sequence
{T(k) : fc = 1,2,...} denote the epochs of transitions between distinct states.
Then (Xx(k) ' k = 1,2,...) is the induced chain of pairwise distinct states vis
ited. Let Si be the set of global minimizers of the function U + 8. At the end
of this chapter we describe the set Si for some typical problems. It often is a
superset of So and usually its component states have energy levels very close
to the minimum. If there are no isolated local minima, as in the setting of [14],
then Si So
Theorem 6.1 (Convergence in Probability) If (S', U, R) is weakly
reversible and if the cooling schedule satisfies the conditions in [25], then
lim Pr (AT(fe) G 5i) = 1.
K* OO V
Theorem 6.2 (Pathwise Convergence) In simulated annealing us
ing the adaptive schedule e(t) = t 1/c for c > d* + d,
a.s.
6.1 Convergence in Probability
Our procedure in both Theorem 6.1 and 6.2 is to show that the dy
namics of the pruned chain are the same as those of ordinary simulated an
nealing on a triple (S,U + 7,R), where R(x,y) > 0 iff R(x,y) > 0. The first
task is to show that the modified problem satisfies WR.
Lemma 6.3 (S, U + 8, R) has property WR if (S, U, R) does.
Proof: The change from RtoR makes no difference, since any path
A
under R is also a path under R and vice versa. Suppose that state y is reachable
from state x at (U+^)height h. This means in particular that C/(a:)+5(a:) < h.
Since the function Â£() is nonnegative, it also means that y is reachable from x at
17height h. Since (5, U, R) satisfies WR, there exists a path y = y0,..., yp = x
such that maxo<;
(U + S)(yi) < h. At the beginning of the proof we remarked that this is true for
the state x. Therefore, assume that 0 < < p 1; i.e., yi is not the last state
on the path. If Â£%,) > U(yi+i), then %;) = 0, so (U + S)(y,) = U(yi) < h,
and if U{yi) < U(yi+1), then U(yi) + S(yi) < f7(y;+i) < h. The claim is proved.
We have exhibited a path from y to x of (U + <5)height not exceeding h, which
proves the lemma.
58
Lemma 6.4 If (S, U, R) is weakly reversible, then
[(tf + S)(v) ~(U + *)M]+ = [Vim) U(X) Â£(;r)]+
for all states x,y with y E N(x).
Proof: If U(y) > U{x), then there must be a path y = yo,yi,... ,yp =
x leading from y to x at height not exceeding U(y). In particular, y has a
neighbor not strictly uphill from it, namely yi, so that S(y) = 0, and the
conclusion of the lemma holds in this case. Suppose on the other hand that
U(y) < U(x). Then there must be a path y = y0,yi,...,yp = x leading
from y to x at a height not exceeding U(x). In particular, yx E N(y) and
U(yi) < U(x), so that S(y) < U(x) U(y). Together with S(x) > 0, this
implies
(U + S)(y) ~(U + S)(x) < U(y) + 6(y) U(x) < 0,
so that the conclusion of the lemma also holds in this case.
Proof of Theorem 6.1: It is sufficient to demonstrate that the
condition (A.2) of [25] is satisfied; that is, that there exist constants cx, C2 such
that
de(*)[
where Q(x,y,t,e) denotes the transition probability, conditioned on accep
tance, from state x to state y at time t under cooling schedule e().
As in Section 2.2, we write the acceptance probability at time k from
state x as
a(x, k) = A0 + Y^A{k~di/c.
i1
59
If x is an isolated local minimum, then the probability of moving to y, condi
tioned on acceptance, is
Q(x,y,t,e)
ce(t)Mv)uW+
TZ=i Mt)diixy
For some constant K not dependent on t, the denominator satisfies the double
inequality
Aie(<)ff(ar) < < Ke(t)s^
i=1
since the exponent in the numerator is strictly positive,
c (f)tr(l))_I7(I)_f(I) < Q(x,y,t,e) < S.x^v)u(x)6(x\
A A\
from which (6.1) follows by Lemma 6.4. This finishes the proof for the case
where x is an isolated local minimum. If x is not an isolated local minimum,
then
Q(x,y,t,e) =
ce{t)VMuW+
Ao + e=i Mty^r
Here, the denominator is bounded below by A0, which is strictly positive, and
above by some (different) constant K, again not dependent on t. Therefore
c^uiv)^) < Q(x,y,f,e) <
K Ao
Since S(x) = 0 in this case, we also get (6.1).
6.2 Pathwise Convergence for the Pruned Chain
In this section we prove Theorem 6.2. As in the proof of Theorem 6.1,
we find expressions involving powers of k to bound the transition probabilities
of the fcth step in the pruned chain. The greatest difficulty lies in finding an
asymptotic upper bound for T(k), the time of the fcth transition; this is the
60
purpose of Lemmas 6.5 and 6.6. A modified version of the techniques used to
prove Theorem 5.2 then establishes pathwise convergence.
Lemma 6.5 On almost all sample paths, T(k + 1) T(k) < T(k)Wc)
for all except finitely many k.
Proof: Let Â£ > 0 be arbitrary. We proceed as in the proof of in
equality (5.10) on page 52. As in that proof, the exit time from any state is
stochastically no greater than a geometric variate with parameter T(k)~dlc\ so
for all large k,
Pr {T(k + 1) T(k) > T{kfH  T(k)} < (1 T(/fe)J/c)LT(fc)(,,/e)+
< exp ( rjT(k)^)
< exp ( i]k^),
where rj is a positive constant small enough to compensate for the integer
rounding. Applying (5.11), then the BorelCantelli lemma for adapted se
quences as before, we have
Pr {T(k + 1) T(Jfe) > T(kf+t i.o.} = 0,
for Â£ > 0. Now for j = 1,2,..., let
% = {T(k + 1) T(Jfe) > T(A:)J+2_i i.o.}.
Then
Pr {T(k + 1) T(k) > T(k)d+2~j i.o.}
oo oo
= Pr{U%}sÂ£Pr{%}=0.
j=l j=1
Lemma 6.6 T{k) 0(kc^c~d^), almost surely.
61
Proof: For convenience, write d = d/c. Inductively define a function
fk on the nonnegative integers by
A(0) = i; Mi +1) = AO) + (AO))"
Now consider the related solution to the initialvalue problem
= k) 9k(x) = (9k(x))d; (6.2)
for 0 < d < 1. It is easy to verify that this has the unique solution
9k{x) = ((1 d)x +
on nonnegative real numbers x. Since the function gk is nondecreasing,
{9k(j)) < I (9k(x)) dx = gk(j + 1) gk(j),
>3
and from this it easily follows by induction that
fk(j) < gk{j) = 0(j)1/{1d)
for all nonnegative integers j.
On any sample path u>, there is, by Lemma 6.5, a constant K = K{u)
such that T{k + 1) T(k) < T(k)d^c for all k> K. Then
T{k)
Proof of Theorem 6.2: As in Section 5.3, let N*(x) be the set of
neighbors at height 8(x) above a:. The transition probabilities for the pruned
chain are
Pr (Xr(k+1) = V  XT(k) = x)
R{x,y)e{T{k))VWuWP
ZzeN(*)R(x,z)e(T(k))M*)uW
62
rsj
R{x,y) e{T{k))^y)~u^
J2zeN^)R(x,z) ' e(T(k)y(*)
= R(x, y) e(T(fe))[t/^(C7+5)^+
= R(x, y) e(T{k))W+5WV+s^,
where
Given Â£ > 0, we have
R(x,y)
R(x,y)
^2zÂ£N*(x) Rfai Z)
(1 t)R{x, y) e(T(k))W+5M(u+sW+ < Pr (xT(fc+1) = y \ XT(k) = x)
and
Pr (Xnk+1) = y  XT{k) = x)< R{x,y) e(T(k)fu+^^u+^\
for all large k. Applying Lemma 6.6 and the trivial inequality T(k) > k,
ki/(c3) < e(T(k)) < k~1/c, a.s.
Therefore, the transition probabilities for the pruned chain are bounded below
by
and above by
This is equivalent to bounding the cooling schedule for the pruned chain by
(1 Â£)(ifc1/(cd)) and kVc.
To finish the proof, we need one more lemma. Recall the recurrence
orders fl(x) defined on page 48.
63
Lemma 6.7 If there exist constants b, B, c, C such that
brl!c < e{t) < Bt~xtc,
then c < max^gs /?(x) < C.
This is proved in exactly the same way as Lemma 5.4.
Applying Lemma 6.7, c d < maxxl=s ]3(x) < c, where ft(x) is the
recurrence order of the pruned chain. An application of Lemma 5.5 completes
the proof.
6.3 Si Compared to So
We briefly discuss the set jSi of global minimizers of the function U + S
and its relationship to the set So. The following two examples show that this
relationship can attain either extremethe sets can be completely coincident
or completely disjoint.
Example 6.1 If there are no isolated global minima then Sq Si.
This is not surprising: in the unpruned chain, the expected waiting time in
any individual state is 0(1); its asymptotic behavior should not differ greatly
from the pruned chain, where we force the waiting time to be exactly 1.
Example 6;2 Let S = {w:x,y,z}, N(w) = {x}, N(x) {u>,?/},
N(y) = {y,z}, N(z) = {y}, with U(w) = 1.5, U{x) = 1, U{y) = 2, and
U(z) = 0. (See Figure 6.1).
In this case, the probability distribution of Xx(k) converges to a unit
mass on {w, x}, even though neither of these states is a global minimizer.
Again, this is reasonable, since a visit to z is always followed by a visit to y,
but the chain is very likely to contain long oscillations between states x and w.
64
u
2
1
Despite Example 6.2, we believe that convergence to the set S\ would
in most settings be an acceptable outcome. Some typical applications of sim
ulated annealing to real world optimization problems illustrate this.
Example 6.3 In Hajeks [25] formulation of the maximummatching
problem, a situation like Example 6.2 could not occur, since every move changes
the objective function (number of edges) by one. In fact, any problem formula
tion which satisfies the continuous increase assumption of [25] has So C Si]
moreover, the energy level of every state in Si is either minimal or one step
away from minimal.
Example 6.4 One of the earliest papers to propose simulated an
nealing as an optimization technique was [32]. In its formulation of the travel
ling salesman problem, a move between states (tours) consists of reversing one
section of the tour. A situation like Example 6.2 could occur only if there were
two tours such that
(i) each could be reached from the other by a single move;
(ii) no shorter tours could be reached from either by a single move; and
65
(iii) the length of each tour is shorter than the length of any tour one move
away from an optimum.
Unless the arrangement of cities is very regular, or the number of cities very
few, this situation would be rare. Ordinarily, elements of Si would be tours at
most one move away from optimal, from which a true optimum could be found
by a short postprocessing routine.
66
CHAPTER 7
DIRECTIONS FOR FUTURE WORK
The nonstationary Markov chains of simulated annealing are a rich
source of unanswered questions; we present a small selection.
7.1 Hitting Time
A useful complement to the convergence results of this thesis would
be information about the first hitting time T to So In a recent paper, Chiang
and Chow [7] used Wentzell and Freidlins theory of Wgraphs to show that
the second eigenvalue of Pt, the transition matrix at time t, is asymptotically
m = i 
where d* is Hajeks constant, defined in Chapter 1. This is suggestive for the
following reason. We know that
t
maxPr(T > t \ Xk = x) II JJ Pj
cc^S N t oo
3=k
where Pj denotes the matrix Pj with all rows and columns corresponding to
So deleted. For large j, it is well known (see, for example, Section 1 of [25])
that the stationary distribution (i.e., the principal eigenspace) of Pj is almost
entirely concentrated on So. Therefore, X(j) is nearly the same as the principal
eigenvalue of Pj. If there were a constant A
at least for large k, then we could write
Pr(r > t  Xk = x) S A exp (  e(j)d*),
3=1
and in principle estimate all the moments of T. (Of course, we would also like
to compute or find good bounds for A.) However, a relationship like (7.1),
which must depend on the special structure of the matrices Pj, has thus far
eluded us.
7.2 Tabu Penalties and Simulated Annealing
Fox [14] incorporates kstep tabu penalties into the simulated anneal
ing algorithm. Roughly, he contructs a new space with states having multiple
components; candidates for the latter are ^tuples in Sk which correspond to
possible paths, without selfloops, of length k in S. Unfortunately, the mod
ified problem does not in general satisfy WR. To avoid this difficulty, Fox
imposes additional structure. Among other things, he requires that from any
state there is a path with no uphill moves leading to So] this is achieved by
incorporating a generalization of the neighborhood enrichment mentioned in
Section 3.2.
Neighborhood enrichment has heuristic appeal as a way to diversify
the search. Also, an adaptation of the proof of Belisle [2] shows that Foxs pro
cedure guarantees convergence in probability to So with any cooling schedule
that goes to zero, no matter how quicklyeven in the extreme case e(t) = 0
for all t. Nevertheless, we feel that insight could be gained by reconsidering the
general problem without neighborhood enrichment. Are there other, weaker
conditions which guarantee WR in the expanded state space? An answer to
this question would likely increase our understanding of the property of weak
68
reversiblity. Moreover, it might allow us to quantify in some sense the benefits
of tabu search; this would have both theoretical and practical interest.
7.3 Noisy Objective Function Revisited
In Chapter 4, we assumed that the objective function values were
unknown but confined to a known finite set. Even though we can force an
arbitrary degree of precision, the assumption is still somewhat unsatisfying,
at least from the point of view of traditional mathematics. Dropping the
assumption would allow us to replace the asymptotically stable estimators of
Chapter 4 by estimators which are merely consistent.
Recall the procedure: every time we visit a state, we observe its energy
level with a measurement subject to error. Past observations are combined to
form a vector of estimates of the objective function U(x) at each state x. If
the estimator(s) are consistent, then after a random but a.s. finite time, the
estimated value of every U(x) will be within any prescribed Â£ > 0 of its true
value.
The first difficulty is that this is not a stopping time. We could get
around this with a stronger version of Theorem 3.2. The theorem would need
to hold when conditions (ii) and (iii) imply merely approximate equality of the
respective transition probabilities. Under those weakened conditions, equalities
(3.5) and (3.6) are no longer valid, and it is not immediately clear that they
could be replaced with appropriate inequalities.
The next step would be to show that knowing the energy levels to
within an arbitrary Â£ suffices. We are looking at a Markov chain where the
69
transition probabilities at time t satisfy some inequality
Ae(t)Z+VMuW+ < P(x, y, t) < Be(t)*+luMuW+. (7.2)
It might turn out that (7.2) is enough to guarantee convergence. Alternatively,
assuming, as do Gelfand and Mitter [19], that the error term Â£ does not de
pend on x or y, a system satisfying (7.2) can be viewed as ordinary simulated
annealing with a random cooling schedule e(). Assuming the original cooling
schedule satisfies (1.2),
e(t) = (7.3)
where Â£ is a random variable in the interval (a, b). In effect, we move the
randomness from the objective function to the cooling schedule. To get con
vergence in probability, we need to show that convergence results such as those
in [7] or [25] continue to be true if the cooling schedule under the weaker
condition (7.3). It is not yet clear whether this is possible.
7.4 Refinement of Algorithm QUICKER1
Fox has suggested replacing the geometric variate with its expecta
tion (rounded upward to the nearest integer) in algorithm QUICKER1. A
proof like that in in Section 5.3 (but easier) shows that the modified algorithm
exhibits the same pathwise convergence discussed there. We do not know if
Foxs modified QUICKER1 gives convergence in probability; this would be
worth investigating.
Foxs modified QUICKER1 would be slightly simpler to implement.
It seems to reduce variance in the. random cooling schedule. However, more
work is needed to determine whether this translates into reduced variance in the
70
sequence of states generated. The modified algorithm would prevent anomalous
rapid cooling. Heuristically, it can be argued that this is desirable. Once
again, though, more work is needed to determine whether this improves the
performance of the algorithm.
71
BIBLIOGRAPHY
[1] Asmussen, S. (1987). Applied Probability and Queues. New York: Wiley.
[2] Belisle, C. J. P. (1992). Convergence theorems for a class of simulated
annealing algorithms on Journal of Applied Probability 29:88595.
[3] Billingsley, P. (1986). Probability and Measure. New York: Wiley.
[4] Borkar, V. (1992). Pathwise Recurrence Orders and Simulated Anneal
ing. Journal of Applied Probability 29:47276.
[5] Browdy, M. H. (1990). Simulated Annealing: An Improved Computer
Model for Political Redistricting. Yale Law Sz Policy Review 8:163ff.
[6] Catoni, 0. (1992). Rough Large Deviations Estimates for Simulated An
nealing: Application to Exponential Schedules. The Annals of Probabil
ity 20:110946.
[7] Chiang, T.S. and Chow, Y. (1988). On the Convergence Rate of Anneal
ing Processes. SIAM Journal of Control and Optimization 26:145470.
[8] Chiang, T.S. and Chow, Y. (1989). A Limit Theorem for a Class of
Inhomogeneous Markov Processes. The Annals of Probability 17:1483
1502.
[9] Chiang, T.S. and Chow, Y. (1993). Asymptotic Behavior of Eigenvalues
and Random Updating Schemes. Applied Mathematics and Optimization
28:25975.
[10] Chung, K. L. (1974). A Course in Probability Theory. 2d ed. San Diego:
Academic Press.
[11] Connors, D. P. and Kumar, P. R. (1988). Balance of Recurrence Orders
in TimeInhomogeneous Markov Chains with Application to Simulated
Annealing. Probability in the Engineering and Informational Sciences
2:15784.
[12] Connors, D. P. and Kumar, P. R. (1988). Simulated Annealing Type
Markov Chains and Their Order Balance Equations. SIAM Journal of
Control and Optimization 27:144061.
[13] Feller, W. (1968). An Introduction to Probability Theory and Its Appli
cations. Vol. 2, 2d ed. New York: Wiley.
[14] Fox, B. L. (1993). Integrating and Accelerating Tabu Search, Simu
lated Annealing, and Genetic Algorithms. Annals of Operations Research
41:4767.
[15] Fox, B. L. (1994). Faster Simulated Annealing. To appear in SIAM Jour
nal of Optimization.
[16] Fox, B. L. and Heine, G. W. (1993). Probabilistic Search with Overrides.
Technical Report, University of Colorado, Denver.
[17] Fox, B. L. and Heine, G. W. (1993). Stable Estimators for SelfAdjusting
Simulations. Technical Report, University of Colorado, Denver.
[18] Freidlin, M. I. and Wentzell, A. D. (1984). Random Perturbations of
Dynamical Systems. Translated by J. Sziics. New York: SpringerVerlag.
[19] Gelfand, S. B. and Mitter, S. K. (1989). Simulated Annealing with Noisy
or Imprecise Measurements. Journal of Optimization Theory and Appli
cations 62:4962.
[20] Geman, S. and Geman, D. (1984). Stochastic Relaxation, Gibbs Distri
bution, and the Bayesian Restoration of Images. IEEE Transactions in
Pattern Analysis and Machine Intelhgence 6:72141.
[21] Gidas, B. (1984). NonStationary Markov Chains and the Convergence
of Annealing Algorithms. Journal of Statistical Physics 39:75131.
[22] Glover, F. and Laguna, M. (1993). Tabu Search. In Modern Heuristic
Techniques for Combinatorial Problems, ed. C. R. Reeves, pp. 70141.
Blackwell Scientific Publications.
[23] Glover, F., Taillard, E., and de Werra, D. (1993). A Users Guide to Tabu
Search. Annals of Operations Research 41:330.
73
[24] Greene, J. W. and Supowit, K. J. (1986). Simulated Annealing Without
Rejected Moves. IEEE Transactions on ComputerAided Design, Vol.
CAD5, pp. 22128.
[25] Hajek, B. (1988). Cooling Schedules for Optimal Annealing. Mathematics
of Operations Research 13:31129.
[26] Heine, G. W. (1993). Convergence Properties of a LoopSkipping Algo
rithm for Simulated Annealing. Technical Report, University of Colorado,
Denver.
[27] Heine, G. W. (1994). Pathwise Convergence for Simulated Annealing
with an Adaptive Schedule. Technical Report, University of Colorado,
Denver.
[28] Hwang, C.R. and Sheu, S.J. (1989). On the Weak Reversibility Condi
tion in Simulated Annealing. Soochow Journal of Mathematics 15:159
70.
[29] Hwang, C.R. and Sheu, S.J. (1990). LargeTime Behavior of Perturbed
Diffusion Markov Processes with Applications to the Second Eigenvalue
Problem for FokkerPlanck Operators and Simulated Annealing. Acta
Applicandae Mathematicae 19:25395.
[30] Hwang, C.R. and Sheu, S.J. (1991). Singular Perturbed Markov Chains
and Exact Behaviors of Simulated Annealing Processes. Journal of The
oretical Probability 5:22349.
[31] Johnson, D., Aragon, C., McGeoch, L. and Schevon, C. (1991). Opti
mization by Simulated Annealing: An Experimental Evaluation; Part II,
Graph Coloring and Number Partitioning. Operations Research 37:378
406.
[32] Kirkpatrick, S., Gelatt, C. D. and Vecchi, M. P. (1983). Optimization by
Simulated Annealing. Science 220:62180.
[33] Kropaczek, D. J. and Turinsky, P. J. (1991). InCore Nuclear Fuel Man
agmenet Optimization for Pressurized Water Reactors Utilizing Simu
lated Annealing. Nuclear Technology 95:9ff
[34] Kuo, C.H., Michel, A. N. and Gray, W. G. (1992). Design of Optimal
PumpandTreat Strategies for Contaminated Groundwater Remediation
74
Using the Simulated Annealing Algorithm. Advances in Water Resources
15:95ff.
[35] Kuriyan, J., Brunger, A. T. and Karplus, M. (1989). XRay Refinement
of Protein Structures by Simulated Annealing: Test of the Method on
Myohemerythrin. Acta Crystallographica (Section A) 45:396
[36] Loeve, M. (1963). Probability Theory. 3d ed. Princeton, New Jersey: Van
Nostrand.
[37] Tsitsiklis, J. (1989). Markov Chains with Rare Transitions and Simulated
Annealing. Mathematics of Operations Research 14:7090.
[38] Yan, D. I. and Mukai, H. (1992). Stochastic Discrete Optimization. SIAM
Journal of Control and Optimization 30:594612.
75

Full Text 
PAGE 1
SMART SIMULATED ANNEALING by George Winfield Heine, III B. A., Reed College, 1971 M. S., University of Colorado, 1989 A thesis submitted to the Faculty of the Graduate School of the University of Colorado at Denver in partial fulfillment of the requirements for the degree of Doctor of Philosophy Applied Mathematics 1994
PAGE 2
1994 by George Winfield Heine, III All rights reserved.
PAGE 3
This thesis for the Doctor of Philosophy degree by George Winfield Heine, III has been approved for the Department of Mathematics by Bennett L. Fox r James Koehler Burton Simon Richard Davis Thomas Russell Richard olley Date
PAGE 4
Heine, III, George Winfield (Ph.D., Applied Mathematics) Smart Simulated Annealing Thesis directed by Professor Bennett L. Fox A naive implementation of simulated annealing is inefficient; at low temperatures it spends almost all its time proposing moves and then rejecting them. We present an algorithm, due to Fox, which reduces time at selfloops while preserving convergence properties. We show that the computer time spent in each distinct state by Fox's algorithm is bounded (a.s.) by a constant which can be determined in advance. This result suggests a refinement of Fox's algorithm: when computer time at a given state exceeds the predetermined constant, exit to another state. A technical lemma needed to prove the refinement works has surprising corol laries. For example, we use the lemma to analyze simulated annealing when the objective function is not precisely known and must be estimated. Another way of speeding up simulated annealing keeps temperature constant while in a. state, updating when the state changes. Thus sojourn time in a. state is easily simulated by sampling from a geometric distribution. Using methods of Connors and Kumar, we show that in this scenario the proportion of time spent in a.n optimal state converges to one almost surely. It might be thought that the various convergence results in simulated annealing a.re an artifact, since a.t low temperatures the algorithm spends large blocks of time at optima.! states. However, we look at the embedded chain of pairwise distinct states and find that it too exhibits convergence, though in general to a slightly different set of states. !V
PAGE 5
This abstract accurately represents the content of the candidate's thesis. I recommend its publication. Signed v Bennett L. Fox
PAGE 6
ACKNOWLEDGEMENTS Many people have contributed, directly or indirectly, to the comple tion of this thesis. Many U.C.D. faculty members have given me inspiration, technical advice, and moral support; I would like to especially thank Bill Briggs, Stan Payne, and Rich Lundgren, as well as the members of my thesis commit tee. Of course, the greatest thanks are due my advisor, Ben Fox. This project would not have been possible without his knowledge and insight, as well as his constant support. Special thanks goes to Arlan Ramsay at C.U.Boulder, both for having suggested I investigate the U.C.D. graduate program in the first place, and, together with his son, Bruce, for having written the software program which produced this thesis. Thanks also to management and coworkers at the Bureau of Land Management, who have cheerfully put up with so many irregularities in my schedule. I also gratefully acknowledge the unwavering faith in my ability to succeed given by friends and family over the years, especially my parents, George and Mary Heine, my sister Nancy and her family, and my inlaws, Charles and Audrey Barnett. Most of all, I am indebted to my wife, Janet Barnett, whose unswerving love has been put to the test in many difficult moments. It is to her that I dedicate this thesis. VI
PAGE 7
CONTENTS CHAPTER 1 INTRODUCTION 1 1.1 The Simulated Annealing Algorithm 2 1.2 Previous Work . . . . . . 4 2 CONVERGENCE OF A LOOPSKIPPING ALGORITHM 7 2.1 Summary of Results 9 2.2 Proofs of Theorems 12 2.2.1 Proof of Theorem 2.1 2.2.2 Proof of Theorem 2.2 2.2.3 Proof of Theorem 2.3 2.3 Efficiency of Fox's Algorithm 3 SIMULATED ANNEALING WITH OVERRIDES 3.1 Simple Examples .... 3.1.1 Random Restart 3.1.2 Greedy Overrides 3.1.3 Perverse Overrides 3.2 Dynamic Neighborhoods 3.3 Overrides: The Main Result 4 SIMULATED ANNEALING WITH A NOISY OBJECTIVE FUNCTION .... 4.1 Formal Setup VII 15 16 17 20 25 27 27 27 28 28 30 34 35
PAGE 8
4.2 Consistent and Stable Estimators . . . . . 4.3 Asymptotically Stable Estimators: First Example 37 39 4.4 Asymptotically Stable Estimators: Second Example 42 5 PATHWISE CONVERGENCE WITH ADAPTIVE COOLING 45 5.1 Main Results . . . . . . . . . . . 46 5.2 SamplePath Convergence for Deterministic Cooling 48 5.3 SamplePath Convergence for Adaptive Cooling 50 5.4 An Example Where Pathwise Convergence Fails 6 CONVERGENCE OF THE PRUNED CHAIN 6.1 Convergence in Probability ....... 6.2 Pathwise Convergence for the Pruned Chain 6.3 S1 Compared to So ....... 7 DIRECTIONS FOR FUTURE WORK 7.1 Hitting Time . . . . . . 7.2 Tabu Penalties and Simulated Annealing 7.3 Noisy Objective Function Revisited 7.4 Refinement of Algorithm QUICKER1 BIBLIOGRAPHY ................. Vlll 53 57 58 60 64 67 67 68 69 70 72
PAGE 9
CHAPTER 1 INTRODUCTION Simulated annealing is attracting increasing attention as a method of combinatorial optimization. As knowledge of the algorithm spreads, re searchers find ever more diverse uses, reporting applications such as refinement of protein structures [35], redrawing of legislative districts [5], incore nuclear fuel management [33], and contaminated groundwater remediation [34]. In certain senses and under certain conditions, made explicit below, the Ma.rkov chain generated by simulated annealing is convergent. However, the algorithm ha.s been criticized for being slow and expensive. Perhaps this is pa.rtly because most theoretical studies have given little attention to pra.ctica.l wa.ys to implement the algorithm. The results in this thesis were originally motivated by attempts to find an efficient algorithm which maintains the desirable convergence properties. In the process, we ha.ve discovered additional theoretical results, interesting in their own right. To probabilists, simulated annealing presents an interesting special ca.se of a. timeinhomogeneous Ma.rkov chain. The properties of sta.tiona.ry Ma.rkov chains ha.ve long been wellunderstood, but theory for the genera.! nonsta.tiona.ry case still a.ppea.rs fa.r off. Nonsta.tiona.ry Ma.rkov cha.ins a.re better understood toda.y tha.n a. few years a.go, partly because of research in simulated
PAGE 10
annealing. We hope that the results outlined in this thesis will contribute to the growth of knowledge in this important area. 1.1 The Simulated Annealing Algorithm The goal of simulated annealing is to minimize an objective function U on a set of states S. We take S to be finite unless the contrary is explicitly stated. The algorithm generates a sequence of moves between successive neighboring states. At each step of the algorithm, a candidate move is proposed. If the move is to a state with lower or equal objectivefunction value, it is accepted; if the move is to a state with a higher objectivefunction value, it is accepted with a small probability. We now formalize these notions. By problem we mean a triple (S, U, R), where S is a (finite) set of states, U is a realvalued function on S, and R is the lSI x lSI stochastic matrix of tentativemove probabilities. We assume without loss of generality that the minimum of U over S is 0. Sometimes we use energy level as shorthand for objectivefunction value. Simulated annealing is a (generally nonstationary) Markov chain {X, : t = 1, 2, ... } on a set of states S. The transition probabilities are given by Pr(Xt+l = yiXt = x) = R(x,y)c(t)(U(y)U(x))+ for x # y, and Pr(Xt+1 = xiX, = x) = 12:: Pr(Xt+l = yiXt = x), y'l'x where E( ) is implied by the cooling schedule, described below. Denote by S0 the set of states on which U is minimized. The neighborhood N(x) of a state x consists of those states y for which R(x, y) > 0. A 2
PAGE 11
path is a sequence of states x0 . Xp such that Xi E N(xil) for 1 :S i :S p. As is customary, we assume that R is irreducible, so that any two states are connected by a path. Following Chiang and Chow [7], we say that state y is reachable from x at height h if there exists a path x = x0 x1 ... Xp = y such that maxo::;i::;p U(xi) :S h. A state x is a local minimum if no state y with U(y) < U(x) can be reached from x at height U(x). For each pair ( x, y) of distinct states, let h( x, y) be the smallest number h such that y is reachable from x at height h. The depth d( x) of a state is defined as { minh(x,y), U(y) < U(x) d(x) = minh(x,y), y E So if x rf. So ; if x E So. The problem ( S, U, R) has the property of neighborhood symmetry( NS) if R(x,y) > 0 whenever R(y,x) > 0. In most cases this property is unnecessarily strong and can be replaced with weak reversibility( WR), which requires that if a state y is reachable from state x at height h, then x is also reachable from y at height h. Assuming WR, as we do throughout, rather than NS has important practical implications: Fox [14] shows how to integrate simulated annealing and tabu penalties by redefining S, U, and R; with additional structure, given in [14], the redefined problem satisfies WR, but it is generally not possible to satisfy NS. The cooling schedule is a sequence (Tk), k = 1, 2, .... of positive numbers, usually converging to zero, such that the probability of accepting an uphill move of height one at step k is c(k) = exp(1/Tk) We find it more convenient to deal directly with the sequence ( c( k) ); by abuse 3
PAGE 12
of terminology, we also call this sequence the cooling schedule; this causes no ambiguity, because the c( k) and Tk determine each other. In most of what follows, we restrict our attention to schedules of the form (1.1) for c > 0, or more generally, to schedules satisfying (1.2) for positive constants a, c, c1 and c2 These are the schedules that have attracted the most attention historically, at least among probabilists. See [6] and [21 J for discussions of other kinds of schedules. In the work that follows, we are often concerned with the dynamics of exit from a single state x; the most significant quantity in this analysis is how high the chain must climb to leave x. Denote this quantity by b(x) = min (U(y)U(x))+; yEN(x) note that b( x) > 0 only if every neighbor of x has an objectivefunction value strictly larger than U ( x). We call such a state an isolated local minimum. 1.2 Previous Work A central problem of simulated annealing is how slowly the cooling schedule must go to zero in order to guarantee convergence to S0 Geman and Gernan [20] showed that, for problems satisfying NS, a constant c exists such that 00 2.:: E(i)' = oo t=l lim Pr(X, E So) > 1. 4
PAGE 13
Hajek [25] identifies the constant as d* = maxd(x), xrf.So where d() is the depth function defined in Section 1.1. He requires (S, U, R) to satisfy WR but not necessarily NS. Hajek shows that his conditions are necessary as well as sufficient. Hwang and Sheu [30] prove essentially the same theorem by an application of the "cycles" of Freidlin and Wentzell [18]. Chiang and Chow [7],[8] repeat the result in continuous time, and (with certain restrictions) give the asymptotic probability distribution for large t. Tsitsiklis [37] obtains similar results by working with small perturbations of probability matrices. Connors and Kumar [11], [12] show that, under similar conditions, "' h tX n \ L L .J L .J the 11m sup ott e \Jesaro average or k E Oo) IS one; we reLUrll G.uls pmnc in Section 5.1. All of these results assume a finite state space. Although there is a large body of literature about simulated annea.ling on continuous state spaces, we are not aware of any exact results relating the speed of cooling to the existence of convergence. Belisle [2] shows that convergence occurs on finite or infinite state spaces provided So is reachable in a single step from anywhere in the state space. It is easy to modify his proofs to show that a slightly weaker condition suffices: namely, from anywhere in S there is a path with no uphill moves leading to S0 Belisle's condition is strong enough that his result holds for any cooling schedule that approaches zero, or even a random schedule that approaches zero in probability. The rest of this thesis focuses on efficiently implementing simulated 5
PAGE 14
annealing without losing the convergence properties just discussed. In Chap ter 2, we describe Fox's [14] algorithm QUICKER. Using it eliminates explicitly rejected moves while leaving the sample path stochastically unchanged. We show that the computer time to execute QUICKER is asymptotically bounded ( a.s.) by a constant that can be prescribed in advance. This complements Fox's analysis of quadraticmean convergence. In Chapter 3, we use these results to propose an improved algorithm QUICKERj. A technical lemma needed to prove the convergence of QUICKERj turns out to be useful in an important and littlestudied problem: simulated annealing when the objective function is not precisely known; we analyze this situation in Chapter 4. In Chapter 5, we analyze an algorithm, equivalent to QUICKER1, which amounts to using a random cooling schedule. It exhibits a different sort of convergence: the pro portion of observed optimal states tends almost surely to one. In Chapter 6, we consider the embedded chain of pairwise distinct states (i.e., the chain (Xk), pruned of its selfloops) and show that converges to a set S1 We relate So and S1 Chapter 7 indicates directions for future research. 6
PAGE 15
CHAPTER 2 CONVERGENCE OF A LOOPSKIPPING ALGORITHM Implementing simulated annealing in a naive but straightforward fashion, the processing at step t looks like: ALGORITHM SLOW(x, t) Select y from N ( x) If U(y) :":: U(x) then Set Xt+, +y Else Generate a uniform random variate V If V < E(t)IU(y)U(x)J, then Set Xt+, +y Else Set xt+l +X End Implemented on a computer, Algorithm SLOW is inefficient; the computer spends almost all of its time proposing moves and then rejecting them. Moreover, the longer the simulation runs, the worse the inefficiency becomes. Fox (preprint of [14]) notes that the algorithm is a "victim of its own success": when the probability of accepting an uphill move gets small, the algorithm stays, with high probability, at local minima for very long periods of time. Algorithm SLOW, as well as its modifications discussed in this thesis, may also consume a lot of time in oscillating between adjacent states. Fox [14] inhibits this, while preserving convergence in probability, by integrating tabu penalties.
PAGE 16
We discuss this further in Section 7.2. Greene and Supowit [24] eliminate rejections by forcing the algorithm to change states at every step. The new state is chosen from the neighbors of the current state via a distribution conditioned on an accepted move. However, their method fails to update temperatures; thus, it destroys the convergence properties described in the previous chapter. F'ox [14, 15] proposes an algorithm QUICKER which greatly reduces the computer time in selfloops, yet (because simulated time is unaffected) updates temperatures in the right way to preserve convergence to the optimal states. Suppose that the algorithm enters state x at time k. The point of QUICKER is to generate the transition number on which a move to a different state next occurs. Let a(x, k) be the probability that the algorithm exits state x at time k + 1; that IS, a(x, k) = Pr (Xk+l f X I xk = x). We assume a(x, k) is nonincreasing ink for any fixed x. QUICKER is based on the observation that, to a first approximation, L has the same distribution as a geometric variate with parameter a(x, k) (i.e., the trial number of the first success in iid Bernoulli trials with success probability a( x, k) ). The algorithm follows. ALGORITHM QUICKER(x, t) Set k <t Until exit, repeat 8
PAGE 17
Generate a geometric r.v. G with parameter a(x, k) Set L +k + G 1 Generate a uniform r.v. V If V < a(x,L)ja(x,k) then Execute NEXT( x, L) Exit Else Set k +L End Above, "Exit" means go to the next state. The subroutine NEXT( x, k) selects a state from N(x), conditioned on a move away from .Tat time k. Fox [15] shows that the number of repetitions of the "Until. .. repeat" loop required by each invocation of QUICKER converges in quadratic mean to 1, provided c(t) = rrfc and cis strictly larger than a constant J, to be introduced shortly. He shows by example that in general no convergence occurs when c = J. We give an almostsure counterpart, with interesting consequences. Fox also shows that at isolated local minima QUICKER reduces cornputer time by an asymptotically infinite factor. We sharpen this observation by giving upper and lower bounds for this time reduction. 2.1 Summary of Results As in the previous section, a( x, k) denotes the probability of accepting a move out of state x at time k. We often suppress dependence on x, writing simply ak. 9
PAGE 18
On any given sample path w, let Q(x,k,w) = the number of geometric variates (number of inner loop iterations) generated by QUICKER if state x is entered at time k; 0, otherwise. (2.1) Again, we often suppress dependence on w or J; and write Q(x, k) or just Qk. Recall the function 8( x) defined on page 4. Let iJ = maxh(x) =max min (U(y)U(x))+ xES xES yEN(x) There is in general no simple relationship between iJ and Hajek's constant d*; they are maxima of different functions over different sets. However, the case of greatest interest is probably that of a problem with a local, nonglobal minimum x whose depth d(x) is at least as great as the depth of all global optimizers. In this case, iJ S d*, and a necessary condition for Pr(Xk E So) t 1 is that E( k )J = oo. When E( k) = kI(c, this is equivalent to ilj c S 1. In contrast to d*, the constant d depends for its value only on the immediate neighborhood of each local minimum. This reveals something of the flavor of QUICKER; its improvement over SLOW is greatest at those local minima x where 0 ( x) is large. As discussed above, iJ S d* in the cases of greatest interest. There 8(x)jc S 1 is a necessary condition for Pr{Xk E S0 } t 1. We shall see shortly that it is undesirable to have 8(x)jc = 1. On the other hand, when o(x)jc < 1, QUICKER acquires some attractive properties. Fox's result in [15] is that, for c > 8(x), lim E((Q(x,k)1))21 Q(x,k) > o) = 0. 10
PAGE 19
Actually, it is routine to show that all moments converge; that is, the above expression holds with any positive integer exponent. The next two theorems constitute an almostsure counterpart to Fox's result. Theorem 2.1 When a state x is not an isolated local minimum (8(x) = 0), then with probability one, Q(x, k):::; 1 for all but finitely many k. Theorem 2.2 Let j be a positive integer. If x is an isolated local minimum, and 8(x)/c < (j1)/j, then with probability one, Q(x, k) :::; j for all but finitely many k. We do not have a converse to Theorem 2.2, but the following may provide some insight. Although the conditions of Theorem 2.3 would never be realized in a lone sample path, they might apply, for example, to massively parallel runs of QUICKER. Theorem 2.3 Assume that Q(x, k) > 0 for all k, and that Q(x, k) and Q(x, l) are independent when k # l. If 8(x )/c (j 1)/j, then Q(x, k) > j infinitely often on almost all sample paths. The quantities Q( x, k) are certainly not independent, but Theorem 2.3 may indicate something about the worstcase behavior of QUICKER. The following example (see Figure 2.1) shows what can go wrong in thecriticalcase8(x)/c = 1. LetS= {x,y,z} with U(x) = 1, U(y) = 2, U(z) = 0, N(x) = N(z) = {y} and N(y) = {x,z}, and take E(k) = k1 A routine calculation shows that the expected time to exit from state x, given xk = x, is infinite for all k. Since the sum of a finite number of geometric variates with positive parameters is finite a.s., it follows that Pr(Q(x, k) > j) > 0 for all 11
PAGE 20
u 2 y 1 X 0 z Figure 2.1. positive integers j and k. Adaptive Cooling. Consider the following modification: upon entry to a state at step k, keep the temperature constant at c( k) until a move out of the state is finally accepted at time m; then change the temperature to E(m). In effect, we modify the cooling schedule (c(t)) in a certain random way. The sojourn time in state x beginning at time k has a geometric distribution with parameter a(x, k). Suppose (S, U, R) contains no isolated local minima, and c(t) = r11', with c;::: d*. Theorem 2.1 makes it plausible that Pr(Xk E So)> 1 when c(t) is replaced by the random schedule described above. We show in Chapter 3 that this is in fact the case. We give further convergence results for such schedules in Chapter 5. 2.2 Proofs of Theorems All three proofs follow the same pattern: we write down an exact expression for Pr(Qk > .i [ Qk > 0), then bound it by suitable inequalities to show that the sum over k either converges or diverges, then invoke the BorelCantelli lemmas to assert that the event Q k > j does or does not occur 12
PAGE 21
infinitely often. We begin with some relationships that will be useful in all the proofs. The acceptance probability at a state x is of the form n a(x, k) = Ao + + atdf', (2.2) i=l where n 2: 0 and we assume without loss of generality that 0 < d1 < ... < dn and all the A,: except possibly A0 are strictly positive. The case n = 0 occurs when and only when x has no uphill neighbors; i.e., it is a local maximum. The case Ao = 0 occurs when and only when all of the neighbors of x are strictly uphillthat is, recalling the definition on page 4, x is an isolated local minimum. In that case, d1 = 8( x ). The inner loop of QUICKER will be executed more than once if and only if the first "V test" fails. Conditioned on the first geometric variate being l, the probability of this is Thus Pr(Qk > 1) < Pr(Qk > 1 I Qk > 0) co (akak+ltl (1ad1 ak 1=1 co ak+l)(lak) 1 [::;:;;0 co ak+1)(1ak( (2.3) 1=1 Likewise, if the first geometric variate is l and the first "Vtest" fails, the remaining number of geometric variates is Qk+l1 Thus we obtain the recursion Pr(Qk>.i+l) ::; Pr(Qk>.i+11Qk>0) 13
PAGE 22
00 I;(ak j) 1=1 00 I;(ak j) 1=0 00 I;(akak+l)(ladPr(Qk+l > j). (2.4) 1=1 In the case j = 0, (2.4) reduces to (2.3). Now we obtain a bound for the factor A;kd;fc(d;jc)/k(d;fc)1, (2.5) or equivalently, (2.6) Therefore, n (AoAo) +I; (A;kd;/cA;(k + l)d;/c) i=l n < I; A;( d;j c) lk(d;jc)1 (2. 7) i=l Equality occurs in (2.7) in the case n = 0 (a local maximum), since then both sides vanish. 14
PAGE 23
By keeping the second order term in the Taylor series, we get the lower bounds corresponding to (2.6) and (2.7): Aikd;jcAi(k + l)d;jc > (d;jc) l k(d;jc)l+ d;jc) 2 k(d;jc)z, and n CXkCXk+l > L A;(d;jc) l k(d;/c)1 i::::l n } L A;(d;jc)(l + d;jc) 2 k(d;jc)2. (2.8) i=l Finally, the expressions for the first and second moments of a geometric distribution imply and 00 co (1 ak)1 = a;2(1ak) ::; a;2 1=1 L 12(1ak)1 1 = ak)(1ak) < a;3 1=1 (2.9) (2.10) 2.2.1 Proof of Theorem 2.1 Fix a state x. If n = 0 (i.e., x has no uphill neighbors) then (2.3) and the remark following (2. 7) imply that Pr( Qk > 1) = 0 for all k, and the theorem is trivial. Assume for the rest of the proof that n > 0. We have from (2.2) that a(x,k) 2: Ao; By hypothesis, x is not an isolated local minimum. This means that A0 > 0 and therefore, 2 < Az ak o 15 (2.11)
PAGE 24
From (2.3) and (2. 7), 00 Pr(Qk > 1) l:(akak+1)(1ad 1=1 n oo < 2::: Ai(di/c)k(di/c)1 2::: 1(1ad. i=l 1=1 Applying (2.9) and (2.11), n Pr(Qk > 1) < 2::: Ai(dt/c)k(di/c)loi;2 i=1 n < 2::: AiAo'(dt/c)k(d,fc)1' i=l and summing over positive k, n l:Pr(Qk > 1)::; l:AiAo'(dt/c)(l:k(di/c)1) < oo. (2.12) k i=l k From the BorelCantelli lemmas, (2.12) is sufficient to prove that, almost surely, Qk > 1 only finitely often. D 2.2.2 Proof of Theorem 2.2 By hypothesis, x is an isolated local minimum, so A0 = 0 and a(x, k) is of the form n ak "'Akd,jc ..... i=l where n > 0. As before, assume that 0 < d1 < ... < dn < 1 and all the Ai are strictly positive. Note that d1 = 8(x); for the rest of this proof we suppress the reference to x and write just 8. There is a positive constant C3 such that for all k or equivalently, (2.13) We now prove by induction the following statement for nonnegative integers j: 16
PAGE 25
(Pj) Assume 8/c < 1. Then there exists a constant Dj, not dependent on k, such that Pr( Qk > j) :::; Dj kj(S/c)j. By the BorelCantelli lemmas, proving statement Pj for positive j is sufficient to prove Theorem 2.2, since if j > 0 and 8/ c < (j 1) fj, then 8/ c < 1; also, the quantity j(8/c)j is strictly less than 1, so that LkPr(Qk > j) is bounded by the convergent series Lk Djkj(&/c)j. Statement Po is trivial. Assume statement Pj and 8/ c < 1. Then j(8jc)j < o, so ( k + l)j(8/c)j < kj(ojc).i Writing (2.4) and using (2.7), the induction hypothesis, and (2.14): 00 Pr(Qk > j + 1) = I: Pr(Qk+l > j)(akak+t)(1ak) 1 l=l n oo < L AiDj( d;j c)k(d;/c)l kj(ofc)j L 1(1 ad. l=l It now follows from (2.9) and the right inequality of (2.13) that Pr( Qk > j + 1) < AiD,di/ c) k(8/c)1 kJ(&/c)i a;;2 < (.2. A A 2 D djc) k(S/c)l+j(5/c)j+2(5/c). L 1 .1 t i=l (2.14) this proves statement Pj+> and finishes the proof of Theorem 2.2. D 2.2.3 Proof of Theorem 2.3 By hypothesis, Qk > 0 for all k; therefore, Pr(Qk > j) = Pr(Qk > j I Qk > 0), a.s. (2.15) In the remainder of the proof, we use the the two sides of (2.15) interchangeably. Since ak is strictly less than one (at least for large k), and is nonincreasing in k, statement (2.13) implies that there is a constant C4 such that (2.16) 17
PAGE 26
We prove the following statement by induction for nonnegative integers j: ('Pj) Assume 8/c < 1. Then there exist positive constants Dj and D'j, not dependent on k, such that Pr( Qk > j) 2: Dj kjojcj D'j. k(j+l)(o/c)(j+l). Theorem 2.3 will then follow from the BorelCantelli lemmas and the indepen dence of the Q k. Statement is trivial. Assume 'Pj. Then from (2.4), (2.8), and the induction hypothesis, Pr(Qk>j+l) 00 2] O:k O:k+i)(1 o:d Pr( Qk+l > j) l=l l=l n n x(l:::Ai(d;fc)k(d,jc)lz_ 2::Ai(d;fc)(1 + d;/c)k(d,/c)2(12/2)) i=l i=l X (Dj(k + l)j(S/c)jD'j(k + l)(j+l)(S/c)(j+l)). (2.17) Now (k + l)(j+1)(6/c)(j+l) :<:; k(j+l)(S/c)(j+l)' (2.18) while a Taylorseries argument similar to that used to derive (2.5) shows that (k + l)i(S/c)j 2: kj(S/c)jj/(18/c)kj(S/c)j1. (2.19) Applying (2.18) and (2.19) to the last factor of (2.17), expanding the product, and dropping some positive terms, Pr(Qk>j+1) 00 > 2:::(1o:d l=l 18
PAGE 27
X )k&/c1zt A;( d; )(1 + d; )k(d,jc)2( l) c i=1 c c 2 x (n'ki(&/c)iD'j(1D"k(j+l)(&/c)(j+1)) J .1 c .7 > A1Dj(6/c)k(5/c)1+J(5/c)j (f /(1ak)l) 1=1 A1Dj(6/c)j(16/c)k(S/c)l+i(S/c)j1 (t, /2(1ad) A1Dj'(8/c)k(S/c)1+(j+1)(6/c)(j+1) (t, /(1ad) t, A;Dj(d;/c)(1 + d;/c)kd,jc2+j(S/c)j (t, (1ak)1). Finally, we use (2.9) and (2.10), and then (2.13) and (2.16) to conclude that Pr(Qk > j + 1) 2: A1Dj(6/c)k(iI)(S/c)(i+1 )(ak"2 ) 2A1Dj(8/c)j(lfi/c)k(j1)(Sjc)(j+2)(ak"3) n L A;Dj(d;/c)(1 + d;/c)k(j!)(di/c)(i+2)(ak"3 ) i=l > D' k(i+1)(5/c)(i+1) 1)'! k(j+2)(5/c)(j+2) J+l J+l where and This finishes the induction proof of Pj+, and the proof of Theorem 2.3. D 19
PAGE 28
2.3 Efficiency of Fox's Algorithm We compare the expected efficiency, work per unit cost, of algorithms SLOW and QUICKER. Say that computing the states X1 ... Xk represents k units of work, and that each iteration of SLOW represents one unit of cost, so that the efficiency of SLOW is 1. To find the relative efficiency of QUICKER we find its cost for computing X1 ... Xk. We can break each invocation of QUICKER into three phases: (i) Computation of the acceptance probability. (ii) Iteration of the inner loop. (iii) The call to the algorithm NEXT. Steps (i) and (iii) involve scanning the neighborhood of state x and thus their cost depends on the size of N(x). It is reasonable to assume that this cost is bounded above and below by constant multiples m and M, respectively, of the cost of one iteration of SLOW. The main work in the inner loop is the generation of two random numbers, (and, possibly, another neighborhood scan), so the cost of each iteration of the inner loop is bounded above and below by constants m' and M'. In QUICKER, the clock is advanced only during iterations of the inner loop. Thus H(k) m + G(k) m'::; cost (Xr, ... ,Xk)::; H(k) M + G(k) M', where H(k) is the number of iterations of QUICKER up to time k and G(k) is the number of iterations of the inner loop up to time k. The import of the theorems in this chapter is that for all large k, G(k)/ H(k) is between 1 and j, for some j depending only on the cooling 20
PAGE 29
schedule. We use this to get (m + m') H(k) :C::: cost (X1, ... Xk) :C::: (M + jM') H(k), (2.20) with probability arbitrarily close to one for large values of k. We now estimate E(H(k)). For this purpose, consider the sequence of random variables f3(k), k = 0, 1, ... defined by Suppose we know H(k) for some k. If the algorithm moves to a different state at step k + 1, then H(k + 1) = H(k) + 1. Otherwise, H(k + 1) = H(k). Thus we have the recursive relationship E(H(k + 1)IH(k),(3(k + 1)) (1 + H(k))j3(k + 1) + H(k)(1f3(k + 1)) H(k) + f3(k + 1). (2.21) Taking expectations in (2.21) and using the initial condition H(O) 0, we conclude by induction that k E(H(k)) = LE(f3(k)). (2.22) j=1. Next, we estimate the quantities (3(1), ... ,(3(k). Trivially j3(j) :C::: 1 for all j. Also, if d > 0, and Xj is an isolated local minimum, then D. jd/c :C::: (3(j) for a constant D. If Xj is not an isolated local minimum, then (3(j) is bounded away from zero. Thus in any case we have D' yJ/c :C::: (3(j) :C::: 1, 21
PAGE 30
for some constant D'. Taking sums, we have k k k ?: I: j3(j) ?: 2:: ])' jd/c > Co ( kldfc) (2.23) j=l j=:l for another constant C0 ; applying (2.22), Co k'(dfc)::; E(H(k))::; k. (2.24) Recall that our measure for efficiency is work per unit cost, or k/(cost (X1 ... ,Xk)). Applying (2.24) to (2.20), taking multiplicative inverses, and then multiplying through by k, this means that there exist constants C1 and C2 such that C1 ::; E(efficiency of QUICKER at time k)::; C2kdfc. (2.25) The efficiency of QUICKER relative to SLOVV increases as the parameter c in the exponent of the cooling schedule gets smaller. This should not be interpreted as an argument in favor of faster schedules. It reflects the fact that SLOW is becoming less efficient; as c gets close to d, the algorithm spends more and more of its time in self loops at local minima. Efficiency is highest when many of the states X1 ... Xk are isolated local minima. When there are no isolated local minima, the lower bound in (2.7), and thus the upper bound in (2.25), can be replaced by a constant. In this case, the relative efficiency of QUICKER is bounded both above and below by a constant, and there is no orderofmagnitude gain in speed from its use. On the other hand, if all local minima are isolated, then (2.7) is sharp. Heuristically, QUICKER tends to be an efficient algorithm to the degree that direct transitions between local minima are rare or difiicult. 22
PAGE 31
Some examples from the literature show that isolated or nearisolated local minima can actnally occnr. In the "maximnm m.atching problem" of [25], all local minima are isolated. In the first formnlation of the graph coloring problem in [31], direct transitions between local minima are not impossible, bnt would be extremely rare. Our results indicate that QUICKER would attain its maximum efficiency in such situations and would be worth trying there. The Distribution of H(k) Another approach to computing the expected value of H( k) may give additional insight. Consider the following modification: fix a state x0 and start the algorithm there. Run QUICKER as before, but every time there is a move to a state other than x0 return im mediately to x0 before continuing. Let fi ( k) denote the number of innerloop iterations up to time k in the modified process. Clearly, H( ) is an inhomogeneous discretetime counting process. Letting p;(k) denote the probability that lf(k) = i, the kstep transition vector p(k) satisfies the forward equations (1a(xo, k + l))Po(k); Po(k + 1) p;(k+1) (1a(xo, k + l))Pi(k) + a(xo, k + l)p;_1(k), i = 1, 2, ... clef ( The generating function ((z, k) z:o z'p; k) then satisfies the firstorder difference equation ((z, k + 1) ((z, k) = a(x0 k + l)((z, k) + za(x0 k + l)((z, k), with initial condition ((z, 0) = 1. It is easily verified by induction that the solution is k ((z, k) =IT (1 + (z1)a(xo,j)); (2.26) j=l 23
PAGE 32
taking the first derivative at z = 1 shows that k E(H(k)) = 2.:: a(x0,j), (2.27) j;;;;:;l exactly. Moreover, for z near one, the right side of (2.26) is arbitrarily close to which is the probability generating function of a Poisson distribution with the same mean. Observing that H(k) S: H(k) stochastically, we obtain (2.24) as before. 24
PAGE 33
CHAPTER 3 SIMULATED ANNEALING WITH OVERRIDES The results of Chapter 2 show that there is an integer j, determinable in advance, which bounds the number of innerloop iterations of QUICKER(x) for all x after a random but finite number of moves. This suggest that we modify QUICKER to always force an exit after .i iterations. formally: ALGORITHM QUICKERj(j, x, t) Set k _71j !then Execute NEXT(x,L) Exit Else Set k
PAGE 34
architecture executing stochastically independent runs of simulated annealing. Each move at a given computer clock tick executes the same instruction, on processordependent data. When QUICKERj exits after h :; j iterations on some processor, then that processor executes jh additional dummy iterations in order to maintain synchronization. Such synchronization is not possible with either QUICKER or the naive move mechanism. Stochastically, QUICKER1 generates the same sample path as SLOW with the adaptive cooling schedule described in Section 2.1. It seems intuitively plausible that convergence in probability would continue to hold under algorithm QUICKERj for the right choice of j. Formally, we have Theorem 3.1 If j chosen so that the number of innerloop iterations of QUICKER exceeds j only finitely often, and if lim Pr(Xk E So) = 1 (3.1) under algorithm QUICKER, then (3.1) also holds under algorithm QUICKERj. Theorem 3.1 is a corollary of a much more general result which we prove in this chapter. Broadly, the convergence properties of a probabilistic search algorithm are unaffected by any modification of its initial steps, provided the total number of modified steps is finite with probability one. That is, suppose the search strategy can have its standard move mechanism overridden up to a random but finite time N; just after N, the standard move mechanism takes over. We show that the current state of the Markov chain associated with the modified process converges in probability to the set So of optimal states 26
PAGE 35
whenever this is true in the unmodified process. We do not require the state spaceS to be finite. Moreover, So can be an arbitrary measurable subset of S, not just the set of global optimizers. In the next two sections, we give examples of applications of this general result, followed by a formal statement and proof in Section 3.3. 3.1 Simple Examples 3.1.1 Random Restart Begin the simulated annealing process in the usual fashion. Observe the process. If it appears to stall at a particular state or set of states, jump at random to a different part of the state space. Repeat this until a fixed number n of restarts have been made. With any reasonable definition of "stall", the restart mechanism is in fact triggered n times. In this case, N is the transition number at which the nth restart is made. Our results do not say that modifying simulated annealing in this fashion makes the algorithm either better or worse. However, we are assured the modified algorithm retains the convergence property (3.1). In this example, N is a. stopping time (Markov time), because we know its value after N transitions have occurred, but later we give important examples where this is not true. Perhaps surprisingly, our basic result continues to hold under certain natural conditions even when N is not a Markov time. 3.1.2 Greedy Overrides Upon arrival to a state, scan the en ergy levels of all its neighbors. If a. neighbor is encountered with a. lower energy level than any state yet seen, move immediately to the lowest such neighbor. On a. finite state space, N is obviously finite. Under weak conditions, N is the transition number on which S0 is first visited, and so is a Markov time. 27
PAGE 36
Such a greedy override strategy is widely used in the tabusearch community (see, for example, Glover [22],[23]). Under certain restrictions, Fox [14] recasts tabu search as simulated annealing on a more elaborate state space. It. is easy to give instances where a greedy override strategy increases the time to first visit a global optimum. Heuristically, however, it seems attractive unless the state space is contrived. 3.1.3 Perverse Overrides Upon arrival to a state, scan the neighbors. If one of the neighbors has higher energy level than any state yet encountered, move immediately to the highest such neighbor. The analysis of the previous subsection goes through, where this time N is the transition number on which the set of global maximizers is first encountered. Vve do not claim this strategy is worthwhile, except perhaps on very unusual state spaces, but it does show the robustness of the property (3.1). 3.2 Dynamic Neighborhoods Consider a generic search algorithm of the form ALGORITHM SEARCH(x,t) Generate a set F( x, t) of candidates for the next state Optional: if F(x, t) contains an element at least E better than any yet seen, move to it Until exit, repeat Select an element of F ( x, t) If it passes the standard acceptance test, then Exit End In ordinary simulated annealing, F(x, t) is the fixed set of neighbors of x. Here we allow the possiblity of using problemspecific information to choose a 28
PAGE 37
smaller or larger F' at each step of the algorithm. Of course, we have to prove that convergence in probability sti1l occurs; this is true in the two examples of this section. Fox [14] shows that algorithm QUICKER, now depending on the triple (x, t, F'(x, t)), can replace the "Until. .. repeat" loop in SEARCH. We want to further streamline SEARCH using the techniques of this chapter. When F is a small finite subset of S, it is practical to compute the acceptance probability a(x, k) needed for QUICKER or QUICKERj, and also for their subroutine NEXT. If the method to pick an element of F' depends on the energy levels of all its elements, then a(x, k) can be computed with essentially no additional work. In the setting of Fox ([14] and [15]), S is finite. The set F(x, t) is formed by supplementing the nominal fixed neighborhood of the current state x with randomly generated elements from S. To show that the convergence the orems apply, Fox recasts this as standard annealing with fixed neighborhoods, but on a more elaborate state space. To ensure a set F(x, t) of reasonable size, the nominal neighborhood can be chosen relatively small, while the added elements are chosen to guar antee weak reversibility and irreducibility. Heuristically, the added elements diversify the search; [15] shows that they have theoretical value as well. We modify SEARCH in two ways: first, we perform the "Optional" step with E = 0 (this is the same as the greedy strategy in Section 3.1); second, we further speed up the "Until. .. repeat" loop hy using QUICKERj instead of QUICKER. Our theorem applies to both modifications. To cool as rapidly as possible and still 29
PAGE 38
take advantage of the convergence properties of Theorems 2.1 and 2.2, we want to ensure that the constant J is not too large. This is avoided by making re quiring the nominal neighborhood of each state to contain a state with at most a slightly higher energy level. When S is a compact subset of Belisle [2] implicitly performs SEARCH. He takes F = S, a natural choice when NEXT is not used. Under the strong assumptions of [2], there are no restrictions on the rate of cooling; in particular, convergence occurs with any adaptive schedule, so long as the temperatures decrease to zero. Moreover, any state can be reached from any other state, so there are no isolated local minima (unless there is only one optimal state; this special case is easily disposed of). Then we use QUICKER1 (corresponding to an adaptive schedule) to speed up the "Until. .. repeat" loop. We also perform the "Optional" step here, this time setting E > 0. It is straightforward to extend Belisle's results to the case where the neighborhoods F are dynamically chosen, so long as there is a path with no uphill moves from any state to S0 That is, every set F(x, t) must be forced to include a state with the same or lower energy level than x. This works a.lso when S is finite, a.nd it offers a way of proving convergence that is different from Fox's recasting scheme, mentioned earlier in this section. Once again, there are no isolated local minima, so we use QUICKER1 a.s before. 3.3 Overrides: The Main Result Let So be an arbitrary measurable subset of S, not necessarily finite. We require that the unmodified chain (Xk) converge in probability to 80 from 30
PAGE 39
any starting state at any initial time. More precisely, lim infPr{Xk E SoiXn = x} = 1, k+co .TES (3.2) for all positive integers n. When Sis finite, we can replace (3.2) by the simpler condition lim Pr{Xk E SoiXn = x} = 1, (3.3) The epoch of the last override must be almost surely finite, but we do not require it to be a Markov time. It is enough that the override mechanism stops eventually and has no further influence on the destiny of the process. More precisely, denote by Y(k), k = 1, 2,, ... the Markov chain of moves with overrides, and by N the epocb of the last override. We assume that there exists a family {Z(n, ): n ::C 0} of Markov chains such that: (i) The chains Z(n, )andY() coincide exactly up to time n. That is, Z(n, k,w) = Y(k,w) for k :::; n and every w in the probability space. (ii) After time n, the transition probabilities of Z(n, )and X are the same; that is, Pr(Z(n, k + 1) = y I Z(n, k) =X) = Pr(Xk+1 = y I xk =X) for k > n and x, y E S. (iii) There exists a random time N, a.s. finite, after which the transition probabilities of Y() and X are the same. Here is the intuition for this structure. We start with the unmodified process and define a sequence of processes in which random overrides are allowed, but only up to a deterministic time n. Each process Z(n, )couples withY() up to 31
PAGE 40
time n and behaves stochastically like (Xk) after time n. We can characterize Y(j) as Z(n,j). The limit is attained for some finite n, which can be chosen uniformly in j. Theorem 3.2 Assume (3.2) and conditions (i)(iii). Then (:).2) is also true for the chain Y (). Proof: Let E > 0 be arbitrary. For each positive integer rn, let cf( rn) be the smallest integer j > rn such that inf Pr{Xk E SoiXm = x} > 1E, xES for all k :2: j. Condition (3.2) guarantees that (m) is welldefined and finite for every rn. Moreover, the function() is nondecreasing, as we now show. Select n <::; ( rn ). Then there exist k :2: n and x E S such that Pr(Xk E So I Xm = x) <::; 1E. Applying the ChapmanKolmogorov equations, we have inf Pr(Xk E So I Xm+! = y) yES < L Pr(Xk E So I Xm+! = y) Pr(Xm+! = Y I Xm = x) yES Pr(XkESoiXm=x) < 1E, which implies that n <::; (m+ 1). Because n was an arbitrary number between 0 and (m), (rn) <::; (rn + 1). Since cf() is nondecreasing and grows without limit, there is, for k :2: (0), a unique integer *(k) such that
PAGE 41
Note that the function t/J*() is also nondecreasing and grows without limit. Take any k 2': tjJ(O); write v for t/J*(k). By condition (ii), Pr(Z(t/J*(k),k) E So) > inf Pr (Z(v, k) E S0 I Z(v, v) = x) xES inf Pr(Xk E So I Xv = x) xES (3.5) > 1 E. To finish the proof, we use a variant of the coupling inequality (see Asmussen[l], p. 143). By conditions (i) and (iii), the chains Z(v, )andY() coincide exactly for all time on the set { N < v}. Thus I Pr(Z(v, k) E So)Pr(Y(k) E So)l = I Pr(Z(v, k) E So, N 2: v)Pr(Y(k) E S0 N 2: v)l (3.6) < Pr(N2':v). Vl/e have proved that, for arbitrary E, for all large k, Pr (Y(k) E So)> 1EPr (N 2: tjJ*(k)). As remarked above, the function t/J*() is nondecreasing and grows without limit; by condition (iii), this completes the proof. 0 33
PAGE 42
CHAPTER 4 SIMULATED ANNEALING WITH A NOISY OBJECTIVE FUNCTION We now discuss a significant application of the results in Chapter 3. Suppose we are interested in a state space where objectivefunction values are unknown and must be estimated by measurements subject to error. For example, each state might be a configuration of a queuing network. To run simulated annealing in the most naive fashion, pretend that the estimated objectivefunction values are the true values; calculate move probabilities and move based on the estimates. Meanwhile, every time a state is seen, measure its objectivefunction value and use the measurement to refine the estimates. In this chapter, we show that under the right conditions, this simple procedure works. Yan and Mukai [38] have previously considered such situations. They assume that the state space satisfies NS, and that we know a. finite interval (a, b) surrounding all the objectivefunction values. In their algorithm, the probability of accepting a proposed move is itself a random quantity, typically estimated by multiple measurements of the tentative new state. A proposed move is rejected unless the result of every simulation is smaller than a uniform va.ria.te on (a, b) drawn at the beginning of the move. The number of simula tions required a.t each move is large and grows without limit as the algorithm
PAGE 43
proceeds. Gelfand and Mitter [19] propose a different method. They formulate the transition probability at time t as R( X, Y )E( t)[U(y)U(x)+ W(t)J+, where W(t) represents a noise term which may depend on past history. This general formulation includes ours as a special case. However, their convergence proofs assume that W ( t) is Gaussian with variance asymptotically smaller than any linear multiple of 1/( log c(t)). Since some states may be visited only rarely, it is unclear how to check the variance condition. For technical reasons, we assume that the unknown objective function takes on values in a finite set, such as a subset of the computerrepresentable numbers. Neither [38] nor [19] need to do this. But since our treatment allows arbitrary precision, we do not consider this restriction important. In the next section, we give a formal description and show that our conclusions follow from the results of Chapter 3. It turns out that we need estimators which are "asymptotically stable"; the remainder of the chapter discusses these and presents two examples. 4.1 Formal Setup At the jth entry to a state x, make an observation T'(x,j). Each of the observations T(x,1),T'(x,2), ... has expected value h(x). The vector h = ( h( x) : x E S) is unknown. Perform simulated annealing with acceptance probabilities formed from some estimate Hn = (H(x, nx) : x E S), where each component H(x, n.x) is based on history of observed values of T(x,.i), j = 1, ... nx, and nx is the number of times x has been observed up to time n. 35
PAGE 44
Assume that there are only a finite number of possible values for h(x) at any state x (for example, integer multiples of 0.001 between 0 and 1). In effect, we accept finite precision in our knowledge of h( x ). We say that the estimator H(x, )is asymptotically stable if there exists a random time N(x), almost surely finite, such that H(x,n) = h(x;) for all n 2': N(x). We show later that such estimators exist. In our setting, the xth component of Hn takes account only of the xth component of the observations (T(, j) ). The observations of the other components and the current observation number are ignored. For every component, the observations are thus iid. Since there are only a finite number of states, we can consider asymptotic stability of each component separately. To initialize, for each component we might set H(x, 0) = oo. If H(x, )is asymptotically stable for each state x, and if every state is visited infinitely often, then from some finite time N onwards, the vector H of Hestimates coincides with the vector h of true hvalues. In this setting, the "standard" algorithm is simulated annealing using h in the computation of the move probabilities. The modification uses the estimates Hn. To make this more precise, in both chains the original state space S is enlarged to include enough historical information to compute Hn at time n. Only the moves in the second chain take account of that additional information. When h is unknown, the first chain is hypothetical. Theorem 3.1 implies that if the first chain converges in probability to 50 with h, then so does the second chain. 36
PAGE 45
This setup generalizes easily to any situation where the objective func tion h( x) depends on a finitedimensional vector 8 of unknown parameters, and 8 takes on only a finite number v of values. The idea is to exploit structure in the state space so that v is much smaller than lSI, although that is not neces sary for our proofs. As above, we can interpret this as knowing the elements of e to finite precision. 4.2 Consistent and Stable Estimators Generalizing the previous section, we want to run Monte Carlo simu lation in an optimal way according to some criterion. The choice of simulation strategy depends on a parameter with an optimal value h. We do not kuow h. However, at step n of the simulation we make an observation T(n), and then use the observations T(l), ... T(n) to compute an estimate H(n) of h. At time n, the simulation strategy depends on H(n). Recall the definition of asymptotic stability in the previous section: there is a random time N, almost surely finite, such that H( n) = h for all n 2: N. Generally, to find such an H requires that we know a finite set F containing h, and that the observations fall in a some set G, generally a superset of F. Thus, H(n) maps thenfold Cartesian product of G to F. We give below two examples of asymptotically stable estimators. For simplicity, assume that h and T(j) are scalars and by rescaling that F is a set of consecutive integers. The set G is either a subset of the real numbers or a subset of the integer multiples of 1/ d, where d is an integer greater than 1. We can view the latter as a finiteprecision approximation to the former. Let (x) denote x rounded to the nearest integer, rounding upward 37
PAGE 46
in case of ties (i.e., (x) = lx + (1/2)J). Set t = E(T(j)) and let h = (t). An example shows what can go wrong. Let the sequence of sample means be T(n) = n1 L:j=1 T(j). The estimator H(n) = (T(n)) is not asymp totically stable when t is a halfinteger; for large n, it oscillates unpredictably between lt J and !tl. If t is near a halfinteger, then damped oscillation occurs; although there is eventual stability, arguably the rate of damping is very slow. This is the practical reason to consider the estimators in Sections 4.3 and 4.4. Suppose we try to correct the situation by observing the entire sequence of sample means T(l), ... T(n) (computed to double precision, per haps), counting how many are above or below the halfinteger line, and round ing up or down accordingly. This actually makes matters worse. For large n, the sequence of sample means closely approximates Brownian motion, and the proportion of the T(j) which lie above t has an arc sine distribution (see Feller [13], pp. 39799.) when t is exactly a halfinteger. When t is approximately a halfinteger, this proportion forms an approximate arc sine distribution with large probabilities of being near 0 or 1. The refined estimator still oscillates, but the intervals between changes grow longer and longer (in fact the growth is exponential). To the careless observer, it may falsely appear to be stabilizing. It is important to be clear about the difference between an estimator which is asymptotically stable and one which is merely consistent. If we had wanted, say, a static point estimator, restricted to the integers, of a parameter with true value 1.5, then we would be indifferent between an estimate of 1.0 and 2.0. Asymptotic stability is needed when we are interested, not in the parameter per se, but in a simulation process dependent on an estimate of the 38
PAGE 47
parameter. 4.3 Asymptotically Stable Estimators: First Example Let lYh ( n) be the number of observations equal to k up to step n; that is, n Mk(n) = = k). j=l Each observation T(n) is formed by adding a random error en tot, and then rounding to the nearest integer: T(n) = (t +en) Suppose that the errors en are iid. Then the T( n) have a common discrete distribution function with mass points ... ,p_1,p0,p1 . on the integers. Supdensity with bounded support, strictly decreasing on positive numbers. Then when t is not a halfinteger the set {Pk > 0 : k E Z} is finite and has Ph = P(t) for its unique largest element. In the exceptional case where t is a halfinteger, there will be maxima at both PLtJ and Pftl For large n, we expect Mk(n)/n to approximately equal Pk, at least for those knot in the tail of the distribution. These remarks form the intutive basis for our estimator. Let f( n) be a function on the positive integers which is nondecreasing in n and which satisfies both f(n) = o(n) and log(n) = o(f(n)). ( 4.1) (For example, f(n) yin.) Below, break ties for the secondlargest value arbitrarily. 39
PAGE 48
ALGORITHM STABLEl(x, t) For each step n: Make an observation T(n); Set k 0, another application of the strong law shows tha.t for all large n, on almost all sample paths, 40
PAGE 49
and the expression on the right is bounded below by f( n) for all large n. This completes the proof if t is not a balf integer. Now suppose that t = m + (1/2). Then Pm = Pm+l and these two are the only maxima in the set {Pk : k E Z}. Using the strong law as in the previous paragraph, the two largest elem.ents of the set { Mk( n) : k E Z} are Mm(n) and Mm+l(n) a.s. for all large n. Let j(O) = 0 and j(i + 1) = min {T(n) = m or T(n) = m + 1}, n>J(t) i.e., {j (1 ), j (2), ... } is the sequence of hitting times to the set { m, m + 1 }. We claim that IMm(j(i))Mm+l(j(i))l S: J(j(i)) a.s. for all but finitely many i. Consider the Bernouili process {T(j(i)) : i E z+}. It is well known, and a simple consequence of the BorelCantelli lemmas (see, for example, Billingsley [3], pp. 5354) that there exists a positive constant K such that the events {T(j(i)) = T(j(i) + 1) = .. = T(j(i) + Klni) = m} and {T(j(i)) = T(j(i) + 1) = = T(j(i) + Klni) = m + 1} each occur for only finitely many values of i. Let i0 be the largest i for which either of these events occurs. Since zero is a recurrent state in the random walk on the space of possible values of MmMm+l, Mm(j(i)) = Mm+l(j(i)) for arbitrarily large 41
PAGE 50
values of i. Let i1 be the smallest integer larger than i0 for which Mm(j(i1)) = Mm+1(j(i1)). Then for all i 2: IMm(j(i))Mm+i(j(i))l < Klni Klnj(i) < f(j(i)); (4.2) where the last inequality is true for all large enough i. We finish by showing that ( 4.2) is still true when .i ( i) is replaced by an arbitrary time n. Observing that j () is a nondecreasing function on the positive integers, we define an inverse. If n < j(l), set j*(n) = 0; otherwise, let j*(n) be the largest integer i such that j(i) n. The function j*() has the following properties: Thus, a.s. for all large n, j*(n) j(j*(n)) n; Mm(j(j*(n)) = Mm(n); Mm+l(j(j*(n)) = Mm+l(n). IMm+J(j(j*(n))Mm(j(j*(n))l < Kln(j*(n)) < Klnn < f(n). D 4.4 Asymptotically Stable Estimators: Second Example We drop previous assumptions about the error terms en, and instead assume that they are iid with a zero mean and finite variance CY2 We compute the sample means T n to a higher precision d than needed for our final estimator Hn 42
PAGE 51
Let g( n) be a function which satisfies and lim g(n) = 0, almost surely. For example, we could take g(n) = n1v'2s2lnlns2n, where s is the sample variance (this requires that the have a finite fourth moment). More simply, we could take g(n) = lnn/n. ALGORITHM STABLE2(x,t) For each step n Make an observation T( n) Compute the sample mean T(n) Set J(n) +lT(n)J + (1/2) If ITnJ(n)l :<:: g(n), then Set H(n) <rT(n)l Else Set H(n) <(T(n)) End By the law of the iterated logarithm, !T(n)t! :<:: (n)rz;;; :<:: g(n) ( 4.3) for all but finitely many n, where (n) = V0'2lnln(n0'2). When t = m + 1/2 for some integer m, then J(n) = t for all large n, so that STABLE2 is stable by (4.3). When tis not a halfinteger, 2g(n) < !J(n)hi for all large n; combining this with ( 4.3), !J(n)T(n)l > !J(n)t!IT(n)t! > IJ(n)hiIT(n)t! 43
PAGE 52
> 2g(n)g(n) = g(n), for all large enough n, almost surely. We have proved Theorem 4.2 Algorithm 2 is stable and consistent. 44
PAGE 53
CHAPTER 5 PATHWISE CONVERGENCE WITH ADAPTIVE COOLING Up to this point, we have measured success of an algorithm by convergence in probability to the set of optimal states 50 This standard may seem to say more than it does. Since there is no strong law of large numbers for the general nonstationary Markov chain, there is a priori no guarantee that optimal states will dominate others in any sample run of the algorithm. In this chapter, we study pathwise convergence directly and show that under certain circumstances the proportion of optimal states visited does indeed go to one, almost surely. (However, in general the sequence of states visited has no almost sure limitsee, for example, [2].) We prove that this is true for the algorithms SLOW and QUICKER under Hajek's conditions, and, under additional conditions different from those in previous chapters, for algorithm QUICKER1. An example shows that the additional conditions cannot in general be discarded. It is not clear that the pathwiseconvergence results of this chapter either imply or are implied by convergence in probability; see, however, the discussion at the end of Section 5.1. Our work was stimulated by that of Borkar [4], who applies the idea of "recurrence orders" first developed by Connors and Kumar [11] to obtain a samplepath result similar to ours on a deterministic cooling schedule satisfying
PAGE 54
when c is not less than a problemspecific constant d possibly different from, but no smaller than, Hajek's d*. vVe slightly improve Borkar's result, showing that it holds for c ::?: d*. Next, we exhibit. conditions under which pathwise convergence occurs with an adaptive schedule (i.e., with QUICKER1), and then present an example of a state space which fails to meet those conditions and in which pathwise convergence fails. 5.1 Main Results Theorem 5.1 In simulated annealing with the deterministic sched ule c(t) = r'l', we have lim n+oo 1 n L I(X, E So) = 1 n t=l a.s., (5.1) for any c ::?: d*. Remark Borkar [4] proves the same result, except that in place of d* he uses the constant d =max d(x), xES where d() is the depth function of Section 1.1. Since d* is the same maximum over the smaller set S \ S0 the two constants are related by d* :S d, and our result is a slight improvement over Borkar's. Now we describe what happens for algorithm QUICKER1. Recall from Section 2.1 that J is the maximum over xES of !5(x), where !5(x) = min{[U(y)U(x)J+: y E N(x)}. Let H = rnin{U(x): x :. S0 } be the smallest objectivefunction value outside the set of global optirnizers. Let T( k) be the epoch of the kth state change; 46
PAGE 55
i.e., T(1) = 1 and T(k + 1) = min{t > T(k): X, f XT(k)} (5.2) Finally, lett be the epoch of entry to the state X,; i.e., l = T(k), where k is the unique integer such that T(k)::; t < T(k + 1). Theorem 5.2 In simulated annealing with the adaptive schedule E(t) = [I/c, (5.1) holds for any c;:::: d*, provided d
PAGE 56
5.2 SamplePath Convergence for Deterministic Cooling Following Borkar [4], we associate with each state x a random quantity j3(x), the (pathwise) recurrence order of state x, defined as follows: j3(x) = = if I:, I[X, = x] < = p if p = sup{c 2:0 I Lt c(t)'I[X, = x] = =} and Lt c(t)P I[X, = x] < = p if p = max{c 2:0 I I;,c(t)'J[X, = xJ = =}. Similarly, we can define the pathwise recurrence order of a transition from state x to state y: = if I:, I[X, = x, X,+t = y] < = j3(x,y)= p if p =sup{ c::: 0 I Lt c(t)' I[X, = x, xt+l = y] =(X)} onrl" At\PT[X X X y] / ti.LL\...L L.Jt '\ j L i ) 'l+l .... V'V p if p =max{ c 2: 0 I Lt c(t)' I[X, = x, Xwt = y] = = }. The recurrence order measures how often a state is visited, or a transition occurs, during a sample run. A state which is transient (in the ordinary sense) has recurrence order oo. Connors and Kumar ([11],[12]) work with the expected values of pathwise recurrence orders, quantities which we shall call mean recurrence orders. By replacing recurrence orders by mean recurrence orders in what follows, we get a result like (5.3) for QUICKER1. The following remarkable theorem shows that, to a certain extent, recurrence orders constitute a potential for objectivefunction values. Theorem 5.3 Assume that S is finite and that weak reversibility holds. If j3(x) 2: 0 and there is a path x = x0 .cz:1 ... Xp = y such that max19,sp(U(x;)U(x)) :<; ;J(x), then j3(y) = j3(x) + U(x)U(y), a.s. 48
PAGE 57
Connors and Kumar prove Theorem 5.3 in [11] under the assumption of neighborhood symmetry and in [12] under the assumption of weak reversibility. Their result is stated in terms of mean recurrence orders, but the proof is algebraic, not probabilistic, and carries over without difficulty to samplepath recurrence orders. The next three results are adapted from Borkar [4]. We sketch their proofs. The final theorem is slightly stronger than the corresponding result in [4], because we use d* rather than d. Lemma 5.4 In simulated annealing with a deterministic cooling schedule implying C1r11' :::; c(t) :::; C2 tI/c for C1 > 0, C2 > 0, c > 0, we have maxxES (3( x) = c, a.s. Proof: For b > c, Lt c(t)6 :::; Cz Lt tb/c < oo; thus maxx (J(x) :::; c. In addition, 2.:: I:c(IYI{X, = x} = I:c(t)c 2': C1 I:r' = oo; X f t f therefore Lt c(t)c I {X1 = x} = oo for some x on each sample path, and so maxx (3( x) 2': c, a.s. D Lemma 5.5 Ifmaxxf3(x) 2': d*, then the set {x: (J(x) = maxxf3(x)} is a subset of S0 almost surely. Proof: Let x be a state such that (J(x) = f3max Suppose x tt S0 By the definition of d*, there is a path x = x0 x1 .. Xp = y from x to some state y E So such that maxiU(xi):::; U(x) + d*:::; U(x) + (J(x). Applying Theorem 5.3, 0 < U(x)U(y) = (J(y)(J(x) a.s., which contradicts the choice of x. Thus, except on a set of probability 0, x E S0 D Proof of Theorem 5.1 From Lemma 5.4, maxx (3( x) = c 2': d*. But 49
PAGE 58
from Lemma 5.5, the only states x satisfying ;J(x) =care in S0 Thus I>(t)'I(x !f. So)= "L,r1I(x tt So)< oo a.s., t t from which (5.1) follows by Kronecker's lemma (see, for example, Chung [10], pp. 12324). 0 5.3 SamplePath Convergence for Adaptive Cooling In this section we prove Theorem 5.2. Assume that ii(t) = [ 11', c :::0: d*, and that d < H. To avoid repetition, we adopt the convention that all the statements in this proof are true in an almostsure sense. The following assertion is proved below: c:; maxfJ(x):; c + J X (5.4) Statement (5.4) and J < H together imply that maxxfJ(x)H < c, so that Theorem 5.3 implies that fJ(x) < c for any state x not in S0 Thus "L,r1I(X, !f. So):; "L,t1I(X, !f. So)= 'i:,c(t)"I(X, E So)< oo, t t t and ( 5.1) follows from Kronecker's lemma as before. To prove (5.4), observe first that I:, ii(t)' =I:, l1 :::0: I:, r1 = oo, t t t so that c:; maxx fJ(x ). Pick b > c + J. We need to show that 'i:,<'(t)b = I:,fb/c < oo. (5.5) t t Under QUICKERI, the sojourn time in each state has a geometric distribution and so is a.s. finite; thus, there are infinitely many state changes. Recalling 50
PAGE 59
that T(k) is the transition number of the kth state change, we rewrite (5.5) as L (t)b = L (T(k + 1)T(k)) T(ktbjc < oo; (5.6) t k then 0 such that b > (c + d)(l + 0We claim that Ak (T(k + 1) T(k)) T(k)bfc < k(IH) (5.7) for all but finitely many k, almost surely. This is enough to establish (5.6) and thus (5.4). Fix k, and say that XT(k) = x. Let N*(x) be the (nonempty) set of neighbors of x at height 8(x) above x, i.e. N*(x) = {y E N(x): U(y) = U(.r) H(x)}. (5.8) The sojourn time in x ends at or before a move to N*( x) is proposed and accepted. Thus, it is stochastically bounded by a geometric variate with parameter where p = L R( X, y) > 0 yEN'(x) is the probability of proposing a move to N*(x). Since T( k) ::0: k, we have for k ::0: 3; exponentiating gives 51
PAGE 60
This implies that so that there is a constant 1) such that for all large k, (5.9) Thus, Pr (Ak > k(1+() I T(k)) Pr (T(k + 1)'T(k) > k(l+()T(k)bfo I T(k)) ( ) lk(1+0T(k)bfj < 1pT(ktdfc Since (1a)'::; ear for any real a and positive real r, this implies Pr (Ak > k(lH) I T(k)) < exp(p lk(l+()T(k)bfcj T(ktdfc) < exp (pryk(1H)T(k)(bJ)fc). Then the trivial bound T( k) ::C k implies Pr (Ak > k(H() I 'T'(k))::; exp( rypk((bd)fc)(1+0). (5.10) The choice of e implies that the exponent of k on the right is strictly positive. Now for any a > 0, f exp( j") < l"" exp( x") dx = _!:_r (_1:_) < oo. (5.11) j=l 0 0: a: By the BorelCantelli lemma for adapted sequences (see Loeve [36], pp. 39899), the event Ak > k(lH) occurs only finitely often, almost surely. This proves (5.6) and finishes the proof of Theorem 5.2. D 52
PAGE 61
5.4 An Example Where Pathwise Convergence Fails Consider the state spaceS= {x,y,z}, with U(x) = 1, U(y) 2, U(z) = 0, and tentativemove matrix 1 R= r 1r 1 where r > 0. Thus, d* = 1, J = 2, and H = 1. We show that pathwise Cesaro convergence fails for the adaptive schedule i'(t) = ll/d* = l1 In fact, given any a< 1, there are infinitely many transition numbers t at which t r1 ,E I(X., = x) > a, a.s.; s=l that is, the total proportion of visits to the strict local minimum x gets arbitrarily close to 1, infinitely often. Assume for simplicity that X1 = y. Since we are using an adaptive schedule, there are infinitely many state changes, almost surely. In particular, there are infinitely many visits to state y. Let S(j) denote the transition number immediately following the jth visit to y; that is, S(l) = 2 and S(k + 1) = 1 + min{t > S(k): X,= y}. Also, let M(t) the remaining lifetime, at step t, of the current visit to state x; that is, M(t) = min{s > 0: Xs+t f x} if X, = x, and M(t) = 0 if X, f x. Since every proposed move out of state y is accepted, we have Pr(Xs(j) = x) = r for all j. Let Bj be the event {xs(j) = x, M(S(j)) > lln2/ln(lS(jt1)j}. 53
PAGE 62
Then Pr(Bj) = r. (1S(jt') L ln2/ln(IS(W')j = r exp 1j ln(1S(jt')} > r/2. (5.12) The events B; are not independent. However, B1 B;_1 influence Bj only through S(j); since (5.12) is true regardless of the value of S(j), we have Pr(B; I A)?:': r/2 > 0 for any A E a(B1 ... ,Bj_1). From Lemma. 5.6 (see below) it now follows that given any integer I, sequences of I consecutive events Bj, B;+1 Bj+ll occur infinitely often. Since for any integer k, it follows that if B;, ... B;+21l all occur, then at time S(j + 21)1, there will have occurred at least I S(j) visits to state x and at most S(j) visits to state z. D Remark A similar argument shows that ( a.s.) the total proportion of visits to the global minimum z also gets arbitrarily close to one infinitely often. Thus, the Cesaro sums r1 2::, I(Xs = x) and r1 2::, I(X, = z) oscillate forever without converging. Le1n1na 5.6 Let I1, !2 .. be a sequence of zerooue Bernoulli random variables adapted to an increasing sequence :F1 :F2 of afields. Suppose that for all i, 54
PAGE 63
for some positive constant a not depending on i. Then for any positive integer I, the sequence (Ij) contains infinitely many strings of ones of length I. Proof: Let (Ij: j = 1, 2, ... ), (F;: j = 1, 2, ... ), and a be as in the statement of the lemma. We need to show that for any given integer I > 0, there is an infinite subsequence of indices (i(j)), j = 1, 2, ... such that for each j, li(j)+l = Ii(i)+2 = = li(j)+l1 = 1. The proof will show that we can take the i(j) to be multiples of I. Let ljm = Tik=l lj+k We establish by induction on m that (5.13) for A E Fj. When m = 1, statement (5.13) is true by hypothesis. Assume (5.13). Then E(Ij+2 lj+m+l I /;+1 = 1, A)E(Ij+1 I A) E(I;+1,m I /j+l = 11 A )E( lj+l I A) since {Ii+t = 1} n A E F;+l Now let Jjm = ljm,j = ljm+1ljm+2 Ijm+j If j < k, then the event { Jjm = 0} is in F(j+l)m <;;; Fkm, so that (5.13) implies that for any k > 0 and n > 0, Pr(rl(Jjm=O)) = Pr(Jkm=O, ... ,Jnm=O) y=k Pr(Am = 0) X Pr(Jk+l,m = 0 I Jkm = 0) X X Pr(Jnm = 0 I Jkm = Jk+l,m = = Jnl,m = 0) 55
PAGE 64
letting n > oo and then taking the union over all positive k shows that Pr (lim inf(l;m = o)) J m?:.J Pr( ljm = 0 all but finitely often ) 0. D 56
PAGE 65
CHAPTER 6 CONVERGENCE OF THE PRUNED CHAIN It might be thought that the convergence results of [7], [11], [25], and [37], as well as those in previous chapters of this thesis, are an artifact, resulting from the algorithm spending long periods of time in optimal states. In this chapter we show that convergence occurs even if we do not take into account the length of time that the chain occupies a state, i.e., if we consider only the embedded chain of distinct states. As in the previous chapter (see (5.2) on page 47), let the sequence {T'(k) : k = 1, 2, ... } denote the epochs of transitions between distinct states. Then (Xr(k): k = 1, 2, ... ) is the induced chain of pairwise distinct states vis ited. Let S1 be the set of global minimizers of the function U + 8. At the end of this chapter we describe the set S1 for some typical problems. It often is a superset of So and usually its component states have energy levels very close to the minimum. If there are no isolated local minima, as in the setting of [14], then S1 = So. Theorem 6.1 (Convergence in Probability) If (S, U, R) is weakly reversible and if the cooling schedule satisfies the conditions in [25], then lim Pr (Xr(k) E S!) = 1.
PAGE 66
Theorem 6.2 (Pathwise Convergence) In simulated annealing us ing the adaptive schedule i:(t) =I/o for c 2: d* + d, 1 n lim L I(XT(k) E s,) = 1, a.s. n+oo n k=l 6.1 Convergence in Probability Our procedure in both Theorem 6.1 and 6.2 is to show that the dynamics of the pruned chain are the same as those of ordinary simulated an nealing on a triple (S,U + fi,R), where R(x,y) > 0 iff R(x,y) > 0. The first task is to show that the modified problem satisfies WR. Lemma 6.3 (S, U + 8, R) has property WR if (S, U, R) does. Proof: The change from R to R makes no difference, since any path under R is also a path under Rand vice versa. Suppose that state y is reachable from state x a.t ( U + .5)height h. This means in particular that U ( x) +.5( x) :S: h. Since the function .5() is nonnegative, it also means that y is reachable from x at Uheight h. Since (S, U, R) satisfies WR, there exists a. path y = y0 . YP = x such that maXo U(Yi+!), then 8(yi) = 0, so (U + 8)(yi) = U(yi) :S: h, and if U(yi) :S: U(Yi+,), then U(yi) + 8(yi) :S: U(Yi+l) :S: h. The claim is proved. We have exhibited a. path from y to x of (U + .5)height not exceeding h, which proves the lemma.. D 58
PAGE 67
Lemma 6.4 If (S, U, R) is weakly reversible, then [(U + c5)(y)(U + c5)(x)]+ = [U(y)U(x)8(x)j+ for all states x, y withy E N(x). Proof: If U(y) > U(x ), then there must be apathy = y0 Yh ... YP = x leading from y to x at height not exceeding U(y). In particular, y has a neighbor not strictly uphill from it, namely so that c5(y) = 0, and the conclusion of the lemma holds in this case. Suppose on the other hand that U(y) :<:=; U(x). Then there must be apathy= Yo,Yl,,Yp = x leading from y to x at a height not exceeding U(x). In particular, y1 E N(y) and U(y1 ) :<:=; U(x), so that 8(y) :<:=; U(x)U(y). Together with 8(x) :::: 0, this implies (U + 8)(y)(U H)(x) :<:=; U(y) H(y)U(x) :<:=; 0, so that the conclusion of the lemma also holds in this case. 0 Proof of Theorem 6.1: It is sufficient to demonstrate that the condition (A.2) of [25] is satisfied; that is, that there exist constants c1 c2 such that clc(t)l(U+&)(y)(U+&)(x)]+ :<:=; Q(x, y, t, c) :s; c2c(t)[(U+&)(y)(U+&)(x)l+, (6.1) where Q(x, y, t, c) denotes the transition probability, conditioned on acceptance, from state x to state y at time t under cooling schedule c( ). As in Section 2.2, we write the acceptance probability at time k from state x as n a(x, k) = Ao + 'EAikd;fc_ i:::::l 59
PAGE 68
If x is an isolated local minimum, then the probability of moving to y, conditioned on acceptance, is cE(i)[U(y)U(x)J+ Q(.x, y, i, E)= A ()di(x). L.,.,z:::::l tE l For some constant I< not dependent on t, the denominator satisfies the double inequality n A1 E( i)&(x) :C: L AiE( i )di(x) :':: J< E( i )&(x); i=l since the exponent in the numerator is strictly positive, (t)U(y)U(x)6(x) < Q( i ) < ..':__ ,U(y)U(x)&(.x) J(E .T,y, ,E Al At from which (6.1) follows by Lemma 6.4. This finishes the proof for the case where x is an isolated local minimum. If x is not an isolated local minimum, then cc(t)[U(y)U(x)]+ Q(x, y, t, c)= Ao + 2::;'=1 AiE(t)di(x). Here, the denominator is bounded below by A0 which is strictly positive, and above by some (different) constant ](, again not dependent on t. Therefore Since 8(x) = 0 in this case, we also get (6.1). D 6.2 Pathwise Convergence for the Pruned Chain In this section we prove Theorem 6.2. As in the proof of Theorem 6.1, we find expressions involving powers of k to bound the transition probabilities of the kth step in the pruned chain. The greatest difficulty lies in finding an asymptotic upper bound for T( k ), the time of the kth transition; this is the 60
PAGE 69
purpose of Lemmas 6.5 and 6.6. A modified version of the techniques used to prove Theorem 5.2 then establishes pathwise convergence. Lemma 6.5 On almost all sample paths, T(k+1)T(k)::; T(k)(J/c) for all except finitely many k. Proof: Let e > 0 be arbitrary. We proceed as in the proof of in equality (5.10) on page 52. As in that proof, the exit time from any state is stochastically no greater than a geometric variate with parameter T( k tJ.fc; so for all large k, Pr{T(k + 1)T(k) > T(k)JH J T(k)} < (1T(ktd/c)LT(k)(d/c)HJ < exp(1JT(k)f.) < exp ( J]kf.), where 1J is a positive constant small enough to compensate for the integer rounding. Applying (5.11), then the BorelCantelli lemma for adapted sequences as before, we have Pr {T(k + 1)T(k) > T(k)JH i.o.} = 0, for e > 0. Now for .i = 1, 2, ... let fl; = {T(k + 1)T(k) > T(k)J+2' i.o.}. Then Pr {T(k + 1)T(k) > T(k)J+T' i.o.} 00 00 = Pr{ U fl;} ::0: 2::Pr{fl;} = 0. o j:::::l j=l Lemma 6.6 T(k) = O(kc/(cJ)), almost surely. 61
PAGE 70
Proof: For convenience, write d = d/ c. Inductively define a function ]k on the nonnegative integers by !k(O) = k; !k(j + 1) = .fk(j) + (fk(j) t Now consider the related solution to the initialvalue problem (6.2) for 0 :S: d < 1. It is easy to verify that this has the unique solution on nonnegative real numbers x. Since the function 9k is nondecreasing, and from this it easily follows by induction that for all nonnegative integers j. On any sample path w, there is, by Lemma 6.5, a constant K = K(w) such that T(k + 1)T(k) :S: T(k)dfc for all k 2: I<. Then T(k) :S: fr<(kK) = O(k)l/(1d) = O(kt/(cJJ. D Proof of Theorem 6.2: As in Section 5.a, let N*(x) be the set of neighbors at height 8(x) above x. The transition probabilities for the pruned chain are Pr ( XT(k+l) = y I XT(k) = x) R( X, y )E(T( k) )[U(y)U(x)J+ LzEN(x) R( x, Z )c(T( k) )[U(z)U(x)]+ 62
PAGE 71
R(:c, y) E(T(k))(U(y)U(x))+ LzEN'(x) R(x, z) c(T(k))6(x) R( X, y) c(T( k) )[U(y)(UH)(x)]+ R( X' Y) c(T( k) )[(UH)(y)(UH)(x)]+' where R(x,y) = R(x,y) LzEN(x) R(x, z) Given > 0, we have (1 OR( x, y) c(T( k) )[(UH)(y)(UH)(x)]+ ::; Pr ( Xr(k+l) = y I Xr(k) = x) and Pr ( XT(k+l) = y I Xr(k) = x) ::; R(x, y) c(T( k) )[(UH)(y)(UH)(x)]+, for all large k. Applying Lemma 6.6 and the trivial ineqnality T( k) ::0: k, Therefore, the transition probabilities for the pruned chain are bounded below by (1 OR( X, Y) ( k1/(cd/)[(U H)(y)(UH)(x)]+ and above by R( X, Y) ( k1fc)[(U+6)(y)(U H)(x)]+. This is equivalent to bounding the cooling schedule for the pruned chain by To finish the proof, we need one more lemma. Recall the recurrence orders j3(x) defined on page 48. 63
PAGE 72
Lemma 6. 7 If there exist constants b, B, c, C such that then c :0: maxxES (3( x) :0: C. This is proved in exactly the same way as Lemma 5.4. Applying Lemma 6.7, cJ::; :0: c, where is the recurrence order of the pruned chain. An application of Lemma 5.5 completes the proof. D 6.3 S1 Compared to So We briefly discuss the set S1 of global minimizers of the function U + b and its relationship to the set S0 The following two examples show that this relationship can attain either extremethe sets can be completely coincident or completely disjoint. Example 6.1 If there are no isolated global minima then S0 = S1 This is not surprising: in the unpruned chain, the expected waiting time in any individual state is 0(1); its asymptotic behavior should not differ greatly from the pruned chain, where we force the waiting time to be exactly 1. Example 6.2 LetS= {w,x,y,z}, N(w) = {x}, N(x) = {w,y}, N(y) = {y,z}, N(z) = {y}, with U(w) = 1.5, U(x) = 1, U(y) = 2, and U(z) = 0. (See Figure 6.1). In this case, the probability distribution of Xr(k) converges to a unit mass on { w, x }, even though neither of these states is a global minimizer. Again, this is reasonable, since a visit to z is always followed by a visit to y, but the chain is very likely to contain long oscillations between states x and w. 64
PAGE 73
u 2 y w 1 s, 0 Figure 6.1. Despite Example 6.2, we believe that convergence to the set S1 would in most settings be an acceptable outcome. Some typical applications of simulated annealing to "real world" optimization problems illustrate this. Example 6.3 In Hajek's [25] formulation of the maximummatching problem, a situation like Example 6.2 could not occur, since every move changes the objective function (number of edges) by oue. In fact, any problem formulation which satisfies the "continuous increase assumption" of [25] has S0 (';:;; S1 ; moreover, the energy level of every state in S1 is either minimal or one step away from minimal. Example 6.4 One of the earliest papers to propose simulated an nealing as an optimization technique was [32]. In its formulation of the travel ling salesman problem, a move between states (tours) consists of reversing one section of the tour. A situation like Example 6.2 could occur only if there were two tours such that (i) each could be reached from the other by a single move; (ii) no shorter tours could be reached from either by a single move; and 65
PAGE 74
(iii) the length of each tour is shorter than the length of any tour one move away from an optimum. Unless the arrangement of cities is very regular, or the number of cities very few, this situation would be rare. Ordinarily, elements of 81 would be tours at most one move away from optimal, from which a true optimum could be found by a short postprocessing routine. 66
PAGE 75
CHAPTER 7 DIRECTIONS FOR FUTURE WORK The nonstationary Markov chains of simulated annealing are a rich source of unanswered questions; we present a small selection. 7.1 Hitting Time A useful complement to the convergence results of this thesis would be information about the first hitting timeT to 80 In a recent paper, Chiang and Chow [7] used Wentzell and Freidlin's theory of Wgraphs to show that the second eigenvalue of P1 the transition matrix at timet, is asymptotically where d* is Hajek's constant, defined in Chapter 1. This is suggestive for the following reason. We know that t maxPr(T > t I Xk = x) =II IT Pill xES j=k oo where Pj denotes the matrix P; with all rows and columns corresponding to 80 deleted. For large j, it is well known (see, for example, Section 1 of [25]) that the stationary distribution (i.e., the principal eigenspace) of P1 is almost entirely concentrated on 80 Therefore, >.(j) is nearly the same as the principal eigenvalue of Pj. If there were a constant A t t II IT Pjllc,, ""A IT >.(.i), j=k j=k (7.1)
PAGE 76
at least for large k, then we could write t Pr(T > t I xk = X) A. exp (L c(j)d')' j=l and in principle estimate all the moments of T. (Of course, we would also like to compute or find good bounds for A.) However, a relationship like (7.1), which must depend on the special structure of the matrices P;, has thus far eluded us. 7.2 Tabu Penalties and Simulated Annealing Fox [14] incorporates kstep tabu penalties into the simulated annealing algorithm. Roughly, he contructs a new space with states having multiple components; candidates for the latter are ktuples in Sk which correspond to possible paths, without selfloops, of length k in S. Unfortunately, the modified problem does not in general satisfy WR. To avoid this diJli.culty, Fox imposes additional structure. Among other things, he requires that from any state there is a path with no uphill moves leading to S0 ; this is achieved by incorporating a generalization of the "neighborhood enrichment" mentioned in Section 3.2. Neighborhood enrichment has heuristic appeal as a way to diversify the search. Also, an adaptation of the proof of Belisle [2] shows that Fox's procedure guarantees convergence in probability to So with any cooling schedule that goes to zero, no matter how quicklyeven in the extreme case c( t) = 0 for all t. Nevertheless, we feel that insight could he gained by reconsidering the general problem without neighborhood enrichment. Are there other, weaker conditions which guarantee WR in the expanded state space? An answer to this question would likely increase our nnderstanding of the property of weak 68
PAGE 77
reversiblity. Moreover, it might allow us to quantify in some sense the benefits of tabu search; this would have both theoretical and practical interest. 7.3 Noisy Objective Function Revisited In Chapter 4, we assumed that the objective function values were unknown but confined to a known finite set. Even though we can force an arbitrary degree of precision, the assumption is still somewhat unsatisfying, at least from the point of view of traditional mathematics. Dropping the assumption would allow us to replace the asymptotically stable estimators of Chapter 4 by estimators which are merely consistent. Recall the procedure: every time we visit a state, we observe its energy level with a measurement subject to error. Past observations are combined to form a vector of estimates of the objective function U(x) at each state x. If the estimator(s) are consistent, then after a random but a.s. finite time, tbe estimated value of every U(x) will be within any > 0 of its true value. The first difficulty is that this is not a stopping time. We could get around this with a stronger version of Theorem 3.2. The theorem would need to hold when conditions (ii) and (iii) imply merely approximate equality of the respective transition probabilities. Under those weakened conditions, equalities (3.5) and (3.6) are no longer valid, and it is not immediately clear that they could be replaced with appropriate inequalities. The next step would be to show that knowing the energy levels to within an arbitrary suffices. We are looking at a Markov chain where the 69
PAGE 78
transition probabilities at timet satisfy some inequality Ac(ftU[U(y)U(x)j+ :<;; P(x,y, t) :<;; Bc(t)<+[U(y)U(x)]+ (7.2) It might turn out that (7.2) is enough to guarantee convergence. Alternatively, assuming, as do Gelfand and Mitter [19], that the error term does not de pend on x or y, a system satisfying (7.2) can be viewed as ordinary simulated annealing with a random cooling schedule i{). Assuming the original cooling schedule satisfies (1.2), (7.3) where is a random variable in the interval (a, b). In effect, we move the randomness from the objective function to the cooling schedule. To get con vergence in probability, we need to show that convergence results such as those in [7] or [25] continue to be true if the cooling schedule under the weaker condition (7.3). It is not yet clear whether this is possible. 7.4 Refinement of Algorithm QUICKER1 Fox has suggested replacing the geometric variate with its expectation (rounded upward to the nearest integer) in algorithm QUICKER1. A proof like that in in Section 5.3 (but easier) shows that the modified algorithm exhibits the same pathwise convergence discussed there. We do not know if Fox's modified QUICKER1 gives convergence in probability; this would be worth investigating. Fox's modified QUICKER1 would be slightly simpler to implement. It seems to reduce variance in the random cooling schedule. However, more work is needed to determine whether this translates into reduced variance in the 70
PAGE 79
sequence of states generated. The modified algorithm would prevent anomalous rapid cooling. Heuristically, it can be argued that this is desirable. Once again, though, more work is needed to determine whether this improves the performance of the algorithm. 71
PAGE 80
BIBLIOGRAPHY [1] Asmussen, S. (1987). Applied Probability and Queues. New York: Wiley. [2] Belisle, C. J. P. (1992). Convergence theorems for a class of simulated annealing algorithms on 3?d. Journal of Applied Probability 29:88595. [3] Billingsley, P. (1986). Probability and Measure. New York: Wiley. [4] Borkar, V. (1992). Pathwise Recurrence Orders and Simulated Anneal ing. Journal of Applied Probability 29:4 7276. [5] Browdy, M. H. (1990). Simulated Annealing: An Improved Computer Model for Political Redistricting. Yale Law & Policy Review 8:163ff. [6] Catoni, 0. (1992). Rough Large Deviations Estimates for Simulated An nealing: Application to Exponential Schedules. The Annals of Probability 20:110946. [7] Chiang, T.S. and Chow, Y. (1988). On the Convergence Rate of Anneal ing Processes. SIAM Journal of Control and Optimization 26:145470. [8] Chiang, T.S. and Chow, Y. (1989). A Limit Theorem for a Class of Inhomogeneous Markov Processes. The Annals of Probability 17:14831502. [9] Chiang, T.S. and Chow, Y. (1993). Asymptotic Behavior of Eigenvalues and Random Updating Schemes. Applied MathemaUcs and Optimization 28:25975. [10] Chung, K. L. (1974). A Course in Probability Theory. 2d ed. San Diego: Academic Press. [11] Connors, D.P. and Kumar, P. R. (1988). Balance of Recurrence Orders in TimeInhomogeneous Markov Chains with Application to Simulated Annealing. Probability in the Engineering and Informational Sciences 2:15784.
PAGE 81
[12] Connors, D. P. and Kumar, P. R. (1988). Simulated Annealing Type Markov Chains and Their Order Balance Equations. SIAM Journal of Control and Optimization 27:144061. [13] Feller, W. (1968). An Introduction to Probability Theory and Its Appli cations. Vol. 2, 2d ed. New York: Wiley. [14] Fox, B. L. (1993). Integrating and Accelerating Tabu Search, Simu lated Annealing, and Genetic Algorithms. Annals of Operations Researcl1 41:4767. [15] Fox, B. L. (1994). Faster Simulated Annealing. To appear in SIAM Jour nal of Optimization. [16] Fox, B. L. and Heine, G. W. (1993). Probabilistic Search with Overrides. Technical Report, University of Colorado, Denver. [17] Fox, B. L. and Heine, G. W. (1993). Stable Estimators for SelfAdjusting Simulations. Technical Report, University of Colorado, Denver. [18] Freidlin, M. I. and Wentzell, A. D. (1984). Random Perturbations of Dynamical Systems. Translated by J. Sziics. New York: SpringerVerlag. [19] Gelfand, S. B. and Mitter, S. K. (1989). Simulated Annealing with Noisy or Imprecise Measurements. Journal of Optimization Theory and Appli cations 62:4962. [20] Gem.an, S. and Geman, D. (1984). Stochastic Relaxation, Gibbs Distri bution, and the Bayesian Restoration of Images. IEEE Transactions in Pattern Analysis and Machine Intelligence 6:72141. [21] Gidas, B. (1984). NonStationary Markov Chains and the Convergence of Annealing Algorithms. Journal of Statistical Physics 39:75131. [22] Glover, F. and Laguna, M. (1993). Tabu Search. In Modern Heuristic Techniques for Combinatorial Problems, ed. C. R. Reeves, pp. 70141. Blackwell Scientific Publications. [23] Glover, F., Taillard, E., and de Werra, D. (1993). A User's Guide to Tabu Search. Annals of Operations Research 41:330. 73
PAGE 82
[24] Greene, J. W. and Supowit, K. J. (1986). Simulated Annealing Without Rejected Moves. IEEE Transactions on ComputerAided Design, Vol. CAD5, pp. 22128. [25] Hajek, B. (1988). Cooling Schedules for Optimal Annealing. Mathematics of Operations Research 13:31129. [26] Heine, G. W. (1993). Convergence Properties of a LoopSkipping Algorithm for Simulated Annealing. Technical Report, University of Colorado, Denver. [27] Heine, G. W. (1994). Pathwise Convergence for Simulated Annealing with an Adaptive Schedule. Technical Report, University of Colorado, Denver. [28] Hwang, C.R. and Sheu, S.J. (1989). On the Weak Reversibility Concli tion in Simulated Annealing. Soochow Journal of Mathemat.ics 15:15970. [29] Hwang, C.R. and Sheu, S.J. (1990). LargeTime Behavior of Perturbed Diffusion Markov Processes with Applications to the Second Eigenvalue Problem for FokkerPlanck Operators and Simulated Annealing. Acta Applicandae Mathematicae 19:25395. [30] Hwang, C.R. and Sheu, S.J. (1991). Singnlar Perturbed Markov Chains and Exact Behaviors of Simulated Annealing Processes. Journa.l of Theoretical Probability 5:22349. [31] Johnson, D., Aragon, C., McGeoch, L. and Schevon, C. (1991). Opti mization by Simulated Annealing: An Experimental Evaluation; Pa.rt II, Graph Coloring and Number Partitioning. Operations Research 37:378406. [32] Kirkpatrick, S., Gelatt, C. D. and Vecchi, M.P. (1983). Optimization by Simulated Annealing. Science 220:62180. [33] Kropaczek, D. J. and Turinsky, P. J. (1991). InCore Nnclear Fuel Man agmenet Optimization for Pressurized Water Reactors Utilizing Simu lated Annealing. Nuclear Technology 95 :91I [34] Kuo, C.II., Michel, A. N. a.nd Gra.y, W. G. (1992). Design of Optima.! PumpandTreat Strategies for Contaminated Groundwater Remediation 74
PAGE 83
Using the Simulated Annealing Algorithm. Advances in Water Resources 15:95ff. [35] Kuriyan, J., Brunger, A. T. and Karplus, M. (1989). XRay Refinement of Protein Structures by Simulated Annealing: Test of the Method on Myohemerythrin. Acta Crystallographica (Section A) 45:396 [36] Loeve, M. (1963). Probability Theory. 3d ed. Princeton, New Jersey: Van Nostrand. [37] Tsitsiklis, J. (1989). Markov Chains with Rare Transitions and Simulated Annealing. Mathematics of Operations Research 14:7090. [38] Yan, D. I. and Mukai, H. (1992). Stochastic Discrete Optimization. SIAM Journal of Control and Optimization 30:594612. 75

