SMART SIMULATED ANNEALING
by
George Winfield Heine, III
B. A., Reed College, 1971
M. S., University of Colorado, 1989
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Applied Mathematics
1994
This thesis for the Doctor of Philosophy
- degree by
George Winfield Heine, III
has been approved for the
Department of
Mathematics
by
iw Arl I ^
1994 by George Winfield Heine, III
All rights reserved.
Heine, III, George Winfield. (Ph.D., Applied Mathematics)
Smart Simulated Annealing
Thesis directed by Professor Bennett L. Fox
A naive implementation of simulated annealing is inefficient; at low
temperatures it spends almost all its time proposing moves and then rejecting
them. We present an algorithm, due to Fox, which reduces time at self-loops
while preserving convergence properties. We show that the computer time
spent in each distinct state by Foxs algorithm is bounded (a.s.) by a constant
which can be determined in advance.
This result suggests a refinement of Foxs algorithm: when computer
time at a given state exceeds the predetermined constant, exit to another state.
A technical lemma needed to prove the refinement works has surprising corol-
laries. For example, we use the lemma to analyze simulated annealing when
the objective function is not precisely known and must be estimated.
Another way of speeding up simulated annealing keeps temperature
constant while in a state, updating when the state changes. Thus sojourn time
in a state is easily simulated by sampling from a geometric distribution. Using
methods of Connors and Kumar, we show that in this scenario the proportion
of time spent in an optimal state converges to one almost surely.
It might be thought that the various convergence results in simulated
annealing are an artifact, since at low temperatures the algorithm spends large
blocks of time at optimal states. However, we look at the embedded chain of
pairwise distinct states and find that it too exhibits convergence, though in
general to a slightly different set of states.
IV
This abstract accurately represents the content of the candidates thesis. I
recommend its publication.
Signed
Bennett L. Fox
v
ACKNOWLEDGEMENTS
Many people have contributed, directly or indirectly, to the comple-
tion of this thesis. Many U.C.D. faculty members have given me inspiration,
technical advice, and moral support; I would like to especially thank Bill Briggs,
Stan Payne, and Rich Lundgren, as well as the members of my thesis commit-
tee. Of course, the greatest thanks are due my advisor, Ben Fox. This project
would not have been possible without his knowledge and insight, as well as his
constant support.
Special thanks goes to Arlan Ramsay at C.U.-Boulder, both for having
suggested I investigate the U.C.D. graduate program in the first place, and,
together with his son, Bruce, for having written the software program which
produced this thesis.
Thanks also to management and co-workers at the Bureau of Land
Management, who have cheerfully put up with so many irregularities in my
schedule.
I also gratefully acknowledge the unwavering faith in my ability to
succeed given by friends and family over the years, especially my parents,
George and Mary Heine, my sister Nancy and her family, and my in-laws,
Charles and Audrey Barnett. Most of all, I am indebted to my wife, Janet
Barnett, whose unswerving love has been put to the test in many difficult
moments. It is to her that I dedicate this thesis.
vi
CONTENTS
CHAPTER
1 INTRODUCTION.......................................... 1
1.1 The Simulated Annealing Algorithm................. 2
1.2 Previous Work..................................... 4
2 CONVERGENCE OF A LOOP-SKIPPING ALGORITHM . 7
2.1 Summary of Results................................ 9
2.2 Proofs of Theorems............................... 12
2.2.1 Proof of Theorem 2.1....................... 15
2.2.2 Proof of Theorem 2.2....................... 16
2.2.3 Proof of Theorem 2.3 . .............. 17
2.3 Efficiency of Foxs Algorithm.................... 20
3 SIMULATED ANNEALING WITH OVERRIDES............. 25
3.1 Simple Examples................................. 27
3.1.1 Random Restart............................. 27
3.1.2 Greedy Overrides .......................... 27
3.1.3 Perverse Overrides......................... 28
3.2 Dynamic Neighborhoods ........................... 28
3.3 Overrides: The Main Result....................... 30
4 SIMULATED ANNEALING WITH A NOISY OBJECTIVE
FUNCTION............................................. 34
4.1 Formal Setup..................................... 35
vii
4.2 Consistent and Stable Estimators..................... 37
4.3 Asymptotically Stable Estimators: First Example...... 39
4.4 Asymptotically Stable Estimators: Second Example ... 42
5 PATHWISE CONVERGENCE WITH ADAPTIVE COOLING 45
5.1 Main Results ........................................ 46
5.2 Sample-Path Convergence for Deterministic Cooling ... 48
5.3 Sample-Path Convergence for Adaptive Cooling......... 50
5.4 An Example Where Pathwise Convergence Fails ......... 53
6 CONVERGENCE OF THE PRUNED CHAIN...................... 57
6.1 Convergence in Probability........................... 58
6.2 Pathwise Convergence for the Pruned Chain ........... 60
6.3 Si Compared to S0 .................................. 64
7 DIRECTIONS FOR FUTURE WORK .............................. 67
7.1 Hitting Time........................................ 67
7.2 Tabu Penalties and Simulated Annealing .............. 68
7.3 Noisy Objective Function Revisited ................. 69
7.4 Refinement of Algorithm QUICKER-1 ................. 70
BIBLIOGRAPHY.................................................... 72
vm
CHAPTER 1
INTRODUCTION
Simulated annealing is attracting increasing attention as a method
of combinatorial optimization. As knowledge of the algorithm spreads, re-
searchers find ever more diverse uses, reporting applications such as refinement
of protein structures [35], redrawing of legislative districts [5], in-core nuclear
fuel management [33], and contaminated groundwater remediation [34].
In certain senses and under certain conditions, made explicit below,
the Markov chain generated by simulated annealing is convergent. However,
the algorithm has been criticized for being slow and expensive. Perhaps this is
partly because most theoretical studies have given little attention to practical
ways to implement the algorithm.
The results in this thesis were originally motivated by attempts to find
an efficient algorithm which maintains the desirable convergence properties.
In the process, we have discovered additional theoretical results, interesting in
their own right.
To probabilists, simulated annealing presents an interesting special
case of a time-inhomogeneous Markov chain. The properties of stationary
Markov chains have long been well-understood, but theory for the general
nonstationary case still appears far off. Nonstationary Markov chains are better
understood today than a few years ago, partly because of research in simulated
annealing. We hope that the results outlined in this thesis will contribute to
the growth of knowledge in this important area.
1.1 The Simulated Annealing Algorithm
The goal of simulated annealing is to minimize an objective function
U on a set of states S. We take S to be finite unless the contrary is explic-
itly stated. The algorithm generates a sequence of moves between successive
neighboring states. At each step of the algorithm, a candidate move is pro-
posed. If the move is to a state with lower or equal objective-function value,
it is accepted; if the move is to a state with a higher objective-function value,
it is accepted with a small probability.
We now formalize these notions. By problem we mean a triple
(S,U,R), where S' is a (finite) set of states, U is a real-valued function on
S, and R is the IS^ x |Sj stochastic matrix of tentative-move probabilities.
We assume without loss of generality that the minimum of U over S' is 0.
Sometimes we use energy level as shorthand for objective-function value.
Simulated annealing is a (generally nonstationary) Markov chain {Xt :
t = 1,2,...} on a set of states S. The transition probabilities are given by
Pr(At+1 = y\Xt = x) = R(x,y)e(t)-uW+
for x y, and
Pv(Xt+1 =x\Xt = x) = l-J2 Pr(*<+i = y\Xt = x),
where e(-) is implied by the cooling schedule, described below.
Denote by So the set of states on which U is minimized. The neigh-
borhood N(x) of a state x consists of those states y for which R(x,y) > 0. A
2
path is a sequence of states x0,...,xp such that Xi G N(xi-i) for 1 < i < p.
As is customary, we assume that R is irreducible, so that any two
states are connected by a path. Following Chiang and Chow [7], we say
that state y is reachable from x at height h if there exists a path x =
Â£0, xi,..., xp = y such that maxo<;
mum if no state y with U(y) < U(x) can be reached from x at height U{x).
For each pair (x, y) of distinct states, let h{x, y) be the smallest number h such
that y is reachable from x at height h. The depth d(x) of a state is defined as
{min/&(*,?/), U(y)
.
mmh(x,y), y Â£ So if x G S0.
The problem (S, U, R) has the property of neighborhood symmetry (TVS') if
R(x, y) > 0 whenever R(y, x) > 0. In most cases this property is unnecessarily
strong and can be replaced with weak reversibility (WR), which requires that
if a state y is reachable from state x at height h, then x is also reachable from y
at height h. Assuming WR, as we do throughout, rather than NS has important
practical implications: Fox [14] shows how to integrate simulated annealing and
tabu penalties by redefining S, U, and R] with additional structure, given in
[14], the redefined problem satisfies WR, but it is generally not possible to
satisfy NS.
The cooling schedule is a sequence (Tk),k = 1,2,.... of positive
numbers, usually converging to zero, such that the probability of accepting an
uphill move of height one at step k is
e(k) = exp(1/Tfc).
We find it more convenient to deal directly with the sequence (e(&)); by abuse
3
of terminology, we also call this sequence the cooling schedule; this causes no
ambiguity, because the e(k) and Tk determine each other.
In most of what follows, we restrict our attention to schedules of the
form
e(Jfe) = k~1/c, (1.1)
for c > 0, or more generally, to schedules satisfying
Ci(k + a)-1/,c < e(k) < c2(k + a)_1^c (1.2)
for positive constants a, c, c\ and c2. These are the schedules that have at-
tracted the most attention historically, at least among probabilists. See [6] and
[21] for discussions of other kinds of schedules.
In the work that follows, we are often concerned with the dynamics
of exit from a single state x\ the most significant quantity in this analysis is
how high the chain must climb to leave x. Denote this quantity by
S{x) = min (U(y)-U{x))+,
yÂ£N(x)
note that S(x) > 0 only if every neighbor of x has an objective-function value
strictly larger than U{x). We call such a state an isolated local minimum.
1.2 Previous Work
A central problem of simulated annealing is how slowly the cooling
schedule must go to zero in order to guarantee convergence to So. Geman and
Geman [20] showed that, for problems satisfying NS, a constant c exists such
that
CO
Â£ e(t)c = oo =} lim Pr(Xt 6 S0) -+ 1.
4=1
4
Hajek [25] identifies the constant as
d* = maxd(x),
xÂ£S0 v n
where d(-) is the depth function defined in Section 1.1. He requires (S,U,R)
to satisfy WR but not necessarily NS. Hajek shows that his conditions are
necessary as well as sufficient. Hwang and Sheu [30] prove essentially the
same theorem by an application of the cycles of Freidlin and Wentzell [18].
Chiang and Chow [7],[8] repeat the result in continuous time, and (with certain
restrictions) give the asymptotic probability distribution for large t. Tsitsiklis
[37] obtains similar results by working with small perturbations of probability
matrices. Connors and Kumar [11], [12] show that, under similar conditions,
the lim sup of the Cesaro average of Pr(Xfc So) is one; we return to this point
in Section 5.1.
All of these results assume a finite state space. Although there is a
large body of literature about simulated annealing on continuous state spaces,
we are not aware of any exact results relating the speed of cooling to the
existence of convergence. Belisle [2] shows that convergence occurs on finite or
infinite state spaces provided So is reachable in a single step from anywhere in
the state space. It is easy to modify his proofs to show that a slightly weaker
condition suffices: namely, from anywhere in S there is a path with no uphill
moves leading to 5b- Belisles condition is strong enough that his result holds
for any cooling schedule that approaches zero, or even a random schedule that
approaches zero in probability.
The rest of this thesis focuses on efficiently implementing simulated
5
annealing without losing the convergence properties just discussed. In Chap-
ter 2, we describe Foxs [14] algorithm QUICKER. Using it eliminates explicitly-
rejected moves while leaving the sample path stochastically unchanged. We
show that the computer time to execute QUICKER is asymptotically bounded
(a.s.) by a constant that can be prescribed in advance. This complements Foxs
analysis of quadratic-mean convergence. In Chapter 3, we use these results to
propose an improved algorithm QUICKER-j. A technical lemma needed to
prove the convergence of QUICKER-j turns out to be useful in an important
and little-studied problem: simulated annealing when the objective function
is not precisely known; we analyze this situation in Chapter 4. In Chapter 5,
we analyze an algorithm, equivalent to QUICKER-1, which amounts to using a
random cooling schedule. It exhibits a different sort of convergence: the pro-
portion of observed optimal states tends almost surely to one. In Chapter 6, we
consider the embedded chain of pairwise distinct states (i.e., the chain (Xk),
pruned of its self-loops) and show that converges to a set Si. We relate So
and Si. Chapter 7 indicates directions for future research.
6
CHAPTER 2
CONVERGENCE OF A LOOP-SKIPPING ALGORITHM
Implementing simulated annealing in a naive but straightforward fash-
ion, the processing at step t looks like:
Algorithm SLOw(a;,t)
Select y from N(x)
If U(y) < U(x) then
Set Xt+1 < y
.Else
Generate a uniform random variate V
If V < then
Set Xi+i < y
Else
Set. x
End
Implemented on a computer, Algorithm SLOW is inefficient;. the computer
spends almost all of its time proposing moves and then rejecting them. More-
over, the longer the simulation runs, the worse the inefficiency becomes. Fox
(preprint of [14]) notes that the algorithm is a victim of its own success:
when the probability of accepting an uphill move gets small, the algorithm
stays, with high probability, at local minima for very long periods of time.
Algorithm SLOW, as well as its modifications discussed in this thesis, may also
consume a lot of time in oscillating between adjacent states. Fox [14] inhibits
this, while preserving convergence in probability, by integrating tabu penalties.
We discuss this further in Section 7.2.
Greene and Supowit [24] eliminate rejections by forcing the algorithm
to change states at every step. The new state is chosen from the neighbors of
the current state via a distribution conditioned on an accepted move. However,
their method fails to update temperatures; thus, it destroys the convergence
properties described in the previous chapter.
Fox [14, 15] proposes an algorithm QUICKER which greatly reduces
the computer time in self-loops, yet (because simulated time is unaffected)
updates temperatures in the right way to preserve convergence to the optimal
states. Suppose that the algorithm enters state x at time k. The point of
QUICKER is to generate
L = min{/ > k : X\ ^ Xk},
the transition number on which a move to a different state next occurs. Let
a(x, k) be the probability that the algorithm exits state x at time k + 1; that
is,
a{x, k) = Pr (Xk+1 ^ x \ Xk = x).
We assume a(x, k) is nonincreasing in k for any fixed x. QUICKER is based on
the observation that, to a first approximation, L has the same distribution as
a geometric variate with parameter a(x, k) (i.e., the trial number of the first
success in iid Bernoulli trials with success probability a(x, k)). The algorithm
follows.
Algorithm QuicKER(x,t)
Set k < t
Until exit, repeat
8
Generate a geometric r.v. G with parameter a(x, k)
Set L
Generate a uniform r.v. V
If V < a(x,L)/a(x,k) then
Execute NEXT(a:,.L)
Exit
Else
Set k < L
End
Above, Exit means go to the next state. The subroutine NEXT (a;, k) selects
a state from N{x), conditioned on a move away from x at time k.
Fox [15] shows that the number of repetitions of the Until... repeat
loop required by each invocation of QUICKER converges in quadratic mean to 1,
provided e(t) = t~x!c and c is strictly larger than a constant J, to be introduced
shortly. He shows by example that in general no convergence occurs when
c = d. We give an almost-sure counterpart, with interesting consequences.
Fox also shows that at isolated local minima QUICKER reduces com-
puter time by an asymptotically infinite factor. We sharpen this observation
by giving upper and lower bounds for this time reduction.
2.1 Summary of Results
As in the previous section, a(x, k) denotes the probability of accepting
a move out of state x at time k. We often suppress dependence on x, writing
simply a.k-
9
On any given sample path lo, let
Q(x, k,uj)
the number of geometric variates (number of inner
loop iterations) generated by QUICKER if state x
is entered at time k;
0, otherwise.
(2.1)
Again, we often suppress dependence on w or x and write Q(x, k) or just Qk-
Recall the function 8(x) defined on page 4. Let
d max^(r) = max min (U(y) U(x))+.
xÂ£S xÂ£S yÂ£N(x)
There is in general no simple relationship between d and Hajeks constant d*\
they are maxima of different functions over different sets. However, the case of
greatest interest is probably that of a problem with a local, nonglobal minimum
x whose depth d{x) is at least as great as the depth of all global optimizers.
In this case, d < d*, and a necessary condition for Pr(AT So) > 1 is that
e{ky = oo. When e(k) fc_1/c, this is equivalent to d/c < 1.
In contrast to d*, the constant d depends for its value only on the
immediate neighborhood of each local minimum. This reveals something of
the flavor of QUICKER; its improvement over SLOW is greatest at those local
minima x where S(x) is large.
As discussed above, d < d* in the cases of greatest interest. There
8{x)/c < 1 is a necessary condition for Pr{Afc G So} * 1- We shall see shortly
that it is undesirable to have 8{x)/c = 1. On the other hand, when 8{x)jc < 1,
QUICKER acquires some attractive properties.
Foxs result in [15] is that, for c > ^(),
lim Ef(Q(a;,A0-l))2 | Q(x,k) > o) =0.
kHX> ' '
10
Actually, it is routine to show that all moments converge; that is, the above
expression holds with any positive integer exponent.
The next two theorems constitute an almost-sure counterpart to Foxs
result.
Theorem 2.1 When a state x is not an isolated local minimum
(8(x) = 0), then with probability one, Q(x, k) < 1 for all but finitely many k.
Theorem 2.2 Let j be a positive integer. If x is an isolated local
minimum, and 8{x)/c < (j 1)/j, then with probability one, Q(x,k) < j for
all but finitely many k.
We do not have a converse to Theorem 2.2, but the following may
provide some insight. Although the conditions of Theorem 2.3 would never be
realized in a lone sample path, they might apply, for example, to massively
parallel runs of QUICKER.
Theorem 2.3 Assume that Q(x, k) > 0 for all &, and that Q(x, k)
and Q(x, l) are independent when k ^ l. If 6(x)/c > (j l)/j, then Q(x, k) > j
infinitely often on almost all sample paths.
The quantities Q(x, k) are certainly not independent, but Theorem 2.3
may indicate something about the worst-case behavior of QUICKER.
The following example (see Figure 2.1) shows what can go wrong in
the critical case 8(x)/c 1. Let S = {x, y, z} with U(x) = 1, U(y) = 2, U(z)
0, N(x) = N(z) = {y} and N(y) {x,z}, and take e(k) = &-1. A routine
calculation shows that the expected time to exit from state x, given Xf. = x,
is infinite for all k. Since the sum of a finite number of geometric variates with
positive parameters is finite a.s., it follows that Pr((Q(x, k) > j) > 0 for all
11
u
positive integers j and k.
Adaptive Cooling. Consider the following modification: upon
entry to a state at step k, keep the temperature constant at e(k) until a move
out of the state is finally accepted at time m; then change the temperature to
e(m). In effect, we modify the cooling schedule (e(t)) in a certain random way.
The sojourn time in state x beginning at time k has a geometric distribution
with parameter a(x, k).
Suppose (S, U, R) contains no isolated local minima, and e(t) = f_1/,c,
with c > d*. Theorem 2.1 makes it plausible that Pr(A* G So) * 1 when e(t) is
replaced by the random schedule described above. We show in Chapter 3 that
this is in fact the case. We give further convergence results for such schedules
in Chapter 5.
2.2 Proofs of Theorems
All three proofs follow the same pattern: we write down an exact
expression for Pr(Qfc > j \ Qk > 0), then bound it by suitable inequalities
to show that the sum over k either converges or diverges, then invoke the
Borel-Cantelli lemmas to assert that the event Qk > j does or does not occur
12
infinitely often. We begin with some relationships that will be useful in all the
proofs.
The acceptance probability at a state x is of the form
n
a(x,k) A0 + ^2Ai(k + a)~di/c, (2.2)
2=1
where n > 0 and we assume without loss of generality that 0 < d\ < ... < dn
and all the A; except possibly Ao are strictly positive. The case n = 0 occurs
when and only when x has no uphill neighbors; i.e., it is a local maximum.
The case A0 = 0 occurs when and only when all of the neighbors of x are
strictly uphillthat is, recalling the definition on page 4, x is an isolated local
minimum. In that case, d\ = 8{x).
The inner loop of QUICKER will be executed more than once if and
only if the first V-test fails. Conditioned on the first geometric variate be-
ing /, the probability of this is
1 (ak+i-i/ak) = a*1 (a* ak+i-1).
Thus
Pr(Qk > 1) < Pr(Qfc > 1 | Qk > 0)
oo
= (afc oik+i-i) (1 afc)'-1 ak
i=i
OO
= - ak+i)(i -
1=0
OO
= - a*+0(l akf- (2-3)
;=i
Likewise, if the first geometric variate is l and the first V-test fails, the
remaining number of geometric variates is Qk+i-Thus we obtain the recursion
Pr((?fc > j + 1) < Pr(Qfc > j + 1 | Qk > 0)
13
= Y,(ak ak+i-i)(l aky 1 P^Qk+i-! > j)
1=1
OO
= YXak fe+0(1 ak)1 Pr(Qk+I > j)
1=0 .
OO
= XXa* ajfc+iX1 ak)1 Pr(Qk+I > j)- (2.4)
i=i
In the case j = 0, (2.4) reduces to (2.3).
Now we obtain a bound for the factor ak ctk+i, which is the (possibly
null) sum of differences of the form
A-rdi/c Ai{k + l)~di/c.
The exponent d;/c is strictly negative. We expand Ai(k + l)~d'!c in a Taylor
series around Aik~d'/C and note that the series alternates in sign, implying
Ai{k + l)-^c > Aik-*'0 {dilcyk-W*'*-1, (2.5)
or equivalently,
Aik~di^c Ai(k + l)~d'!c < (di/c)lkAdi/*)~\ (2.6)
Therefore,
ak-ak+l = (A0-A0) + it(Aik~di/c-Mk + l)~di/c)
2=1
< 'ZMdi/^lk-^-1. (2.7)
2=1
Equality occurs in (2.7) in the case n = 0 (a local maximum), since then both
sides vanish.
14
By keeping the second order term in the Taylor series, we get the
lower bounds corresponding to (2.6) and (2.7):
Aik~di/c Ai{k + l)~dilc >
(di/c) l k-MM-1 l(di/c)( 1 + di/c) l2 k
and
ak-ak+l >
i1
-lJ2Mdi/c)( 1 + di/c) l2 k~^~2. (2.8)
i=l
Finally, the expressions for the first and second moments of a geometric distri-
bution imply
OO
E 1 (1 *)' = 2( 1 *k) < (2.9)
i-i
and
OO
ii;/2(l-ai),"1 = K3(2-*)(l-)<^3- (2-10)
/=1
2.2.1 Proof of Theorem 2.1 Fix a state x. If n = 0 (i.e., x
has no uphill neighbors) then (2.3) and the remark following (2.7) imply that
Pr(<3fc > 1) = 0 for all k, and the theorem is trivial. Assume for the rest of the
proof that n > 0. We have from (2.2) that
a(x, k) > Ao;
By hypothesis, x is not an isolated local minimum. This means that Ao > 0
and therefore,
ak2 < V- (2.H)
15
From (2.3) and (2.7),
Pr{Qk > 1) = Y,(ak ~ ak+i)(l ak)1
i=i
n oo
i=1 1=1
Applying (2.9) and (2.11),
Pr(Qfc>l) <
i=1
< 2(di/c)k-(d^)-\
2=1
and summing over positive A:,
' EPrM* > 4 < E A,Ao2(di/c)(E < oo. (2.12)
k i=1
From the Borel-Cantelli lemmas, (2.12) is sufficient to prove that, almost surely,
Qk > 1 only finitely often.
2.2.2 Proof of Theorem 2.2 By hypothesis, x is an isolated
local minimum, so Ao = 0 and a(x, k) is of the form
ak = Â£,Aik-d<1',
2=1
where n > 0. As before, assume that 0 < dx < ... < dn < 1 and all the A{ are
strictly positive. Note that d\ = 8{x)\ for the rest of this proof we suppress
the reference to x and write just 8. There is a positive constant Cs such that
for all k
Atk-s/c < ak < C3k~^c
or equivalently,
C3-V/c < a*1 < A^ks/c. (2.13)
We now prove by induction the following statement for nonnegative integers j:
16
(Vj) Assume 8/c < 1. Then there exists a constant Dj, not dependent on
k, such that Pr(Qfc > j) < Dj
By the Borel-Cantelli lemmas, proving statement Vj for positive j is sufficient
to prove Theorem 2.2, since if j >0 and 8/c < (j l)/j, then 8/c < 1; also, the
quantity j(8/c) j is strictly less than 1, so that > j) is bounded
by the convergent series Yk^jk^6!^-^.
Statement V0 is trivial. Assume statement Vj and 8/c < 1. Then
j(8/c)-j< 0, so
(Jfc + 0j(fi/c)_i < Jfe,Wc)_j (2.14)
Writing (2.4) and using (2.7), the induction hypothesis, and (2.14):
CO
Pr(Qk > j + 1) = EPr(Q*+* > j)(ak ~ a*+f)(l ak)l
i=i
n oo
< E AiDj{di/cE /(l ctk)1.
i=1 1=1
It now follows from (2.9) and the right inequality of (2.13) that
Pr(Qfc >j +1) < ^E AiDjdk/^ . a-2
< ^E AiA^2Djdi/cj k-(S/c)-i+3(S/c)-j+2(S/c).
this proves statement "Pj+1 and finishes the proof of Theorem 2.2.
2.2.3 Proof of Theorem 2.3 By hypothesis, Qk > 0 for all k\
therefore,
Pr(Qfc > j) = Pr(Qfc > j | Qk > 0), a.s. (2.15)
In the remainder of the proof, we use the the two sides of (2.15) interchangeably.
Since a*, is strictly less than one (at least for large k), and is nonin-
creasing in k, statement (2.13) implies that there is a constant C4 such that
*l2(l ak) > C4ks/C.
(2.16)
17
We prove the following statement by induction for nonnegative integers j:
("Pj) Assume 8/c < 1. Then there exist positive constants D'- and not
dependent on k, such that
Pr(Qfc > j) > D'j k3S/c~j D'! fctf+1)(/0-(i+1).
Theorem 2.3 will then follow from the Borel-Cantelli lemmas and the indepen-
dence of the Qk-
Statement Vo is trivial. Assume V-. Then from (2.4), (2.8), and the
induction hypothesis,
Pr(Qk > j + 1)
CO
= ]C(a* ak+i)(l ~ ak)1 Pr(Qk+i > j)
i=i
CO
1=1
- Â£ Mdi/cXi + di/c)k-^~2{i2/2))
i=1 2=1
x (D'jik + - D'j(k + j)U+i)(/<0-Cj+i)) . (2.17)
Now
_j_ /)(i+1Kff/c)(J+1) < jcU+1)(s/c)~U+1)^
(2.18)
while a Taylor-series argument similar to that used to derive (2.5) shows that
(Jfe + lyW)- > k^5/c^j jl( 1 - (2.19)
Applying (2.18) and (2.19) to the last factor of (2.17), expanding the product,
and dropping some positive terms,
Pr( j + 1)
CO
l=i
18
X ('/t1(-)rs'=-*j Â£ + -)r(*w-2d
V c i=i c c
x - T>'i(l -)pWc)-j-i/ ^^0+i)(J/c)-0+i)j
> A1Dlj(8/c)k-^~1+^^ ^ /(I afc)^
-A^S/c^i 1 8/c)k-W)-'+WM-i-i (f ;r(i fc)A
W=1
(oo
/2
- E Aityidi/cKl + [Y: '-(1 aky J .
i=l
Finally, we use (2.9) and (2.10), and then (2.13) and (2.16) to conclude that
Pr(Qk >j + 1) > AiD'jiS/c)^-1^-^)^)
2AiDj(8/c)j(l 8lc)k^~^slc^+2\af)
-AiDp/ckWMAj+V^f)
- Y AiD^di/c)^ + dilc)k^dil^+2\af)
2=1
> Dj+ilfc(J+1h5/c)~(j+1) Â£),!+^}~U+2)(s/c)-U+2)^
where
D'j+i = PjArC^S/c)
and
D" T)' A~3
L/j+1
2A1(8/c){l 8/c)j + YAi(di/c)( 1 + di/c)
2=1
+ D'!A^8/c.
This finishes the induction proof of V-+l and the proof of Theorem 2.3.
19
2.3 Efficiency of Foxs Algorithm
We compare the expected efficiency, work per unit cost, of algorithms
SLOW and QUICKER. Say that computing the states X\,..., Xk represents k
units of work, and that each iteration of SLOW represents one unit of cost, so
that the efficiency of SLOW is 1. To find the relative efficiency of QUICKER
we find its cost for computing Xi,..., Xk-
We can break each invocation of QUICKER into three phases:
(i) Computation of the acceptance probability.
(ii) Iteration of the inner loop.
(iii) The call to the algorithm NEXT.
Steps (i) and (iii) involve scanning the neighborhood of state x and thus their
cost depends on the size of N(x). It is reasonable to assume that this cost
is bounded above and below by constant multiples m and M, respectively,
of the cost of one iteration of SLOW. The main work in the inner loop is
the generation of two random numbers, (and, possibly, another neighborhood
scan), so the cost of each iteration of the inner loop is bounded above and
below by constants m' and M'.
In QUICKER, the clock is advanced only during iterations of the inner
loop. Thus
H(k) m + G(k) m! < cost (Xi,... ,Xk) < H(k) M -f G(k) M',
where H{k) is the number of iterations of QUICKER up to time k and G(k) is
the number of iterations of the inner loop up to time k.
The import of the theorems in this chapter is that for all large k,
G(k)/H(k) is between 1 and j, for some j depending only on the cooling
20
schedule. We use this to get
(m + m') H{k) < cost (Xx,... Xk) < (M + jM') H(k), (2.20)
with probability arbitrarily close to one for large values of k.
We now estimate E(H(k)). For this purpose, consider the sequence
of random variables /3(h), k = 0,1,..., defined by
P(k) = (**_!, k) = Pi(Xk Xh._i).
Suppose we know H(k) for some k. If the algorithm moves to a different state
at step k + 1, then H(k + 1) = H(k) + 1. Otherwise, H{k + 1) = H(k). Thus
we have the recursive relationship
E(H(k + l)\H(k),/3(k + 1)) = (1 + H(k))/3(k + 1) + H(k)(l /3(k + 1))
= H(k) + /3(k + 1). (2.21)
Taking expectations in (2.21) and using the initial condition H(0) = 0, we
conclude by induction that
E(ff(*)) = XW(*))- (2-22)
i=i
Next, we estimate the quantities /3(1),..., /3(k). Trivially /3(j) < 1 for all j.
Also, if d > 0, and Xj is an isolated local minimum, then
D-j-Vc
for a constant D. If Xj is not an isolated local minimum, then /3(j) is bounded
away from zero. Thus in any case we have
21
(2.23)
for some constant D'. Taking sums, we have
k k
k > E^(i) > -rd7c > c0- (fc1-^)
3=i i=i
for another constant Co; applying (2.22),
Co fc1_((J/c) < E(H(k)) < k. (2.24)
Recall that our measure for efficiency is work per unit cost, or
k/(cost (Xi,...,Xfc)).
Applying (2.24) to (2.20), taking multiplicative inverses, and then multiplying
through by &, this means that there exist constants C\ and C2 such that
Ci < ^(efficiency of QUICKER at time k) < C^^. (2.25)
The efficiency of QUICKER relative to SLOW increases as the parameter c in the
exponent of the cooling schedule gets smaller. This should not be interpreted
as an argument in favor of faster schedules. It reflects the fact that SLOW is
becoming less efficient; as c gets close, to d, the algorithm spends more and
more of its time in self loops at local minima.
Efficiency is highest when many of the states Xi,..., Xf. are isolated
0
local minima. When there are no isolated local minima, the lower bound in
(2.7), and thus the upper bound in (2.25), can be replaced by a constant.
In this case, the relative efficiency of QUICKER is bounded both above and
below by a constant, and there is no order-of-magnitude gain in speed from its
use. On the other hand, if all local minima are isolated, then (2.7) is sharp.
Heuristically, QUICKER tends to be an efficient algorithm to the degree that
direct transitions between local minima are rare or difficult.
22
Some examples from the literature show that isolated or near-isolated
local minima can actually occur. In the maximum matching problem of [25],
all local minima are isolated. In the first formulation of the graph coloring
problem in [31], direct transitions between local minima are not impossible,
but would be extremely rare. Our results indicate that QUICKER would attain
its maximum efficiency in such situations and would be worth trying there.
The Distribution of H(k) Another approach to computing the
expected value of H(k) may give additional insight. Consider the following
modification: fix a state xq and start the algorithm there. Run QUICKER
as before, but every time there is a move to a state other than xo, return im-
mediately to xo.before continuing. Let H(k) denote the number of inner-loop
iterations up to time k in the modified process. Clearly, H(-) is an inhomo-
geneous discrete-time counting process. Letting Pi(k) denote the probability
that H(k) = i, the k-step transition vector p(k) satisfies the forward equations
p0(k + l) = (1 a(x0,k + l))p0(fc);
Pi{k + 1) = (1 a(x0,k + l))p,-(&) + a(x0,k + l)p;_i(A;), i = 1,2,...
The generating function .((z, k) =f zlPi(k) then satisfies the first-order
difference equation
C{z, k + 1) ((z, k) = -a(a:o, k + 1)C(^, k) + za(x0, k + k),
with initial condition ((z, 0) = 1. It is easily verified by induction that the
solution is
C(z, k) = (1 + (z l)a(x0,j)); (2.26)
3=1
23
taking the first derivative at z = 1 shows that
E(H(k)) = 'Â£/a(x0,j), (2.27)
j=i
exactly. Moreover, for 2 near one, the right side of (2.26) is arbitrarily close to
exp
i=1
?
which is the probability generating function of a Poisson distribution with the
same mean.
Observing that H(k) < H(k) stochastically, we obtain (2.24) as be-
fore.
24
CHAPTER 3
SIMULATED ANNEALING WITH OVERRIDES
The results of Chapter 2 show that there is an integer,;, determinable
in advance, which bounds the number of inner-loop iterations of QUICKER(a;)
for all x after a random but finite number of moves. This suggest that we
modify QUICKER to always force an exit after j iterations. Formally:
Algorithm Quicker-^', a;, t)
Set k < t
Set i < 0
Until exit, repeat
Set i < i + 1
Generate a geometric r.v. G with parameter a(x, k)
Set L < k + G 1
Generate a uniform r.v. V
If V < a(x, L)/a{x, k)
or i> ]
then
Execute NEXT(s, L)
Exit
Else
Set k < L
End
The boxes above outline the additional steps needed to convert QUICKER into
QUICKER-.;.
With QUICKER-j, more moves are accepted in fixed computer time
than with QUICKER. We cannot say that QUICKER-,; shortens the time to first
hit S0. However, QUICKER-j lets us synchronize parallel processors with SIMD
architecture executing stochastically independent runs of simulated annealing.
Each move at a given computer clock tick executes the same instruction, on
processor-dependent data. When QUICKER-j exits after h < j iterations on
some processor, then that processor executes j h additional dummy iterations
in order to maintain synchronization. Such synchronization is not possible with
either QUICKER or the naive move mechanism.
Stochastically, QUICKER-1 generates the same sample path as SLOW
with the adaptive cooling schedule described in Section 2.1.
It seems intuitively plausible that convergence in probability would
continue to hold under algorithm QUICKER-j for the right choice of j. For-
mally, we have
Theorem 3.1 If j chosen so that the number of inner-loop iterations
of QUICKER exceeds j only finitely often, and if
lim Pr(X* G So) = 1 (3.1)
kt-oo
under algorithm QUICKER, then (3.1) also holds under algorithm QUICKER-j.
Theorem 3.1 is a corollary of a much more general result which we
prove in this chapter. Broadly, the convergence properties of a probabilistic
search algorithm are unaffected by any modification of its initial steps, provided
the total number of modified steps is finite with probability one. That is,
suppose the search strategy can have its standard move mechanism overridden
up to a random but finite time JV; just after N, the standard move mechanism
takes over. We show that the current state of the Markov chain associated with
the modified process converges in probability to the set So of optimal states
26
whenever this is true in the unmodified process.
We do not require the state space S to be finite. Moreover, So can be
an arbitrary measurable subset of S, not just the set of global optimizers.
In the next two sections, we give examples of applications of this
general result, followed by a formal statement and proof in Section 3.3.
3.1 Simple Examples
3.1.1 Random Restart Begin the simulated annealing process
in the usual fashion. Observe the process. If it appears to stall at a particular
state or set of states, jump at random to a different part of the state space.
Repeat this until a fixed number n of restarts have been made. With any
reasonable definition of stall, the restart mechanism is in fact triggered n
times. In this case, N is the transition number at which the n-th restart
is made. Our results do not say that modifying simulated annealing in this
fashion makes the algorithm either better or worse. However, we are assured
the modified algorithm retains the convergence property (3.1).
In this example, N is a stopping time (Markov time), because we
know its value after N transitions have occurred, but later we give important
examples where this is not true. Perhaps surprisingly, our basic result continues
to hold under certain natural conditions even when N is not a Markov time.
3.1.2 Greedy Overrides Upon arrival to a state, scan the en-
ergy levels of all its neighbors. If a neighbor is encountered with a lower energy
level than any state yet seen, move immediately to the lowest such neighbor.
On a finite state space, N is obviously finite. Under weak conditions, N is the
transition number on which Sq is first visited, and so is a Markov time.
27
Such a greedy override strategy is widely used in the tabu-search
community (see, for example, Glover [22],[23]). Under certain restrictions, Fox
[14] recasts tabu search as simulated annealing on a more elaborate state space.
It is easy to give instances where a greedy override strategy increases the time
to first visit a global optimum. Heuristically, however, it seems attractive unless
the state space is contrived.
3.1.3 Perverse Overrides Upon arrival to a state, scan the
neighbors. If one of the neighbors has higher energy level than any state
yet encountered, move immediately to the highest such neighbor. The analysis
of the previous subsection goes through, where this time N is the transition
number on which the set of global maximizers is first encountered. We do not
claim this strategy is worthwhile, except perhaps on very unusual state spaces,
but it does show the robustness of the property (3.1).
3.2 Dynamic Neighborhoods
Consider a generic search algorithm of the form
Algorithm Search(:e, t)
Generate a set F(x,t) of candidates for the next state
Optional: if F(x,t) contains an element at least e
better than any yet seen, move to it
Until exit, repeat
Select an element of F(x,t)
If it passes the standard acceptance test, then
Exit
End
In ordinary simulated annealing, F(x,t) is the fixed set of neighbors of x.
Here we allow the possiblity of using problem-specific information to choose a
28
smaller or larger F at each step of the algorithm. Of course, we have to prove
that convergence in probability still occurs; this is true in the two examples of
this section.
Fox [14] shows that algorithm QUICKER, now depending on the triple
(x,t,F(x,t)), can replace the Until.. .repeat loop in SEARCH. We want to
further streamline SEARCH using the techniques of this chapter.
When F is a small finite subset of S, it is practical to compute the
acceptance probability a(x, k) needed for QUICKER or QUICKER-j, and also
for their subroutine NEXT. If the method to pick an element of F depends
on the energy levels of all its elements, then a(x,k) can be computed with
essentially no additional work.
In the setting of Fox ([14] and [15]), S is finite. The set F(x,t) is
formed by supplementing the nominal fixed neighborhood of the current state
x with randomly generated elements from S. To show that the convergence the-
orems apply, Fox recasts this as standard annealing with fixed neighborhoods,
but on a more elaborate state space.
To ensure a set F(x,t) of reasonable size, the nominal neighborhood
can be chosen relatively small, while the added elements are chosen to guar-
antee weak reversibility and irreducibility. Heuristically, the added elements
diversify the search; [15] shows that they have theoretical value as well. We
modify SEARCH in two ways: first, we perform the Optional step with e = 0
(this is the same as the greedy strategy in Section 3.1); second, we further speed
up the Until... repeat loop by using QUICKER-j instead of QUICKER. Our
theorem applies to both modifications. To cool as rapidly as possible and still
29
take advantage of the convergence properties of Theorems 2.1 and 2.2, we want
to ensure that the constant d is not too large. This is avoided by making re-
quiring the nominal neighborhood of each state to contain a state with at most
a slightly higher energy level.
When S' is a compact subset of Belisle [2] implicitly performs
SEARCH. He takes F = S', a natural choice when NEXT is not used. Under
the strong assumptions of [2], there are no restrictions on the rate of cooling;
in particular, convergence occurs with any adaptive schedule, so long as the
temperatures decrease to zero. Moreover, any state can be reached from any
other state, so there are no isolated local minima (unless there is only one
optimal state; this special case is easily disposed of). Then we use QUICKER-1
(corresponding to an adaptive schedule) to speed up the Until.. .repeat loop.
We also perform the Optional step here, this time setting e > 0.
It is straightforward to extend Belisles results to the case where the
neighborhoods F are dynamically chosen, so long as there is a path with no
uphill moves from any state to So- That is, every set F(x,t) must be forced
to include a state with the same or lower energy level than x. This works also
when S is finite, and it offers a way of proving convergence that is different
from Foxs recasting scheme, mentioned earlier in this section. Once again,
there are no isolated local minima, so we use QUICKER-1 as before.
3.3 Overrides: The Main Result
Let So be an arbitrary measurable subset of S, not necessarily finite.
We require that the unmodified chain (Xk) converge in probability to So from
30
any starting state at any initial time. More precisely,
lim inf Pr{Jffc 6 5o|X = s} = 1, (3.2)
for all positive integers n. When S is finite, we can replace (3.2) by the simpler
condition
lim Pr{Xfc G SulXi = x} = 1, (3.3)
ktoo
The epoch of the last override must be almost surely finite, but we do not
require it to be a Markov time. It is enough that the override mechanism
stops eventually and has no further influence on the destiny of the process.
More precisely, denote by Y(k), k = 1,2,,..., the Markov chain of moves with
overrides, and by N the epoch of the last override. We assume that there exists
a family {Z(n, ) : n > 0} of Markov chains such that:
(i) The chains Z(n, ) and T(-) coincide exactly up to time n. That is,
Z(n,k,uj) = Y(k,u>)
for k < n and every uj in the probability space.
(ii) After time n, the transition probabilities of Z(n, ) and X are the same;
that is,
Pr(Z(n, k + l) = y\ Z(n, k) = x) = Pr(Xfc+i = y \ Xk = x)
for k > n and x,y G S.
(iii) There exists a random time N, a.s. finite, after which the transition
probabilities of T(-) and X are the same.
Here is the intuition for this structure. We start with the unmodified process
and define a sequence of processes in which random overrides are allowed, but
only up to a deterministic time n. Each process Z(n, ) couples with y(-) up to
31
time n and behaves stochastically like (Xk) after time n. We can characterize
Y(j) as lim^^oo Z(n,j). The limit is attained for some finite n, which can be
chosen uniformly in j.
Theorem 3.2 Assume (3.2) and conditions (i)-(iii). Then (3.2) is
also true for the chain K(-).
Proof: Let e > 0 be arbitrary. For each positive integer m, let (m) be the
smallest integer j > m such that
inf Pr{Xfc 6 S0\Xm = x) > 1 e,
xÂ£S
for all k > j. Condition (3.2) guarantees that (m) is well-defined and finite
for every m.
Moreover, the function (j>{) is non-decreasing, as we now show. Select
n < (f)(m). Then there exist k > n and x S such that
Pr(Wfc e ^0 | Xm = x) < 1 e.
Applying the Chapman-Kolmogorov equations, we have
inf ?v(Xk G S0 | Xm+1 = y)
yes
< Ys Pr(^fc e S'o | xm+1 = y) Pr(Wm+i = y \ Xm = x)
yes
= Pr(Wfc So I xm = x)
< 1 e,
which implies that n < (j){m +1). Because n was an arbitrary number between
0 and {m) < {m + 1).
Since (f>() is non-decreasing and grows without limit, there is, for
k > <^(0), a unique integer *(k) such that
(*(k)) {*{k) + 1). (3.4)
32
limit.
(3.5)
To finish the proof, we use a variant of the coupling inequality (see Asmussen[l],
p. 143). By conditions (i) and (iii), the chains Z(v, ) and F(-) coincide exactly
for all time on the set {N < u}. Thus
|Pr(Z(i/,fc)eSo)-Pr(y(fc)GÂ£,o)|
= | Pr(Z(i/, k) Â£ So, N > v) Pr(y(k) E So, N > v)\ (3.6)
< Pt(N>u).
We have proved that, for arbitrary e, for all large k,
Pr(Y(k) eS0)>l-e-Pv{N> *(k)).
As remarked above, the function is nondecreasing and grows without
limit; by condition (iii), this completes the proof.
Note that the function *() is also non-decreasing and grows without
Take any k > ^(0); write v for *(k). By condition (ii),
Pr (Z{(j)*{k), k) G S0) > inf Pr (Z{v, k) G | Z(v, u) = x)
xÂ£S
= inf Pr (Xk G .So | Xv = a;)
xÂ£S
> 1 e.
33
CHAPTER 4
SIMULATED ANNEALING WITH A NOISY OBJECTIVE
FUNCTION
We now discuss a significant application of the results in Chapter 3.
Suppose we are interested in a state space where objective-function values
are unknown and must be estimated by measurements subject to error. For
example, each state might be a configuration of a queuing network. To run
simulated annealing in the most naive fashion, pretend that the estimated
objective-function values are the true values; calculate move probabilities and
move based on the estimates. Meanwhile, every time a state is seen, measure
its objective-function value and use the measurement to refine the estimates.
In this chapter, we show that under the right conditions, this simple procedure
works.
Yan and Mukai [38] have previously considered such situations. They
assume that the state space satisfies NS, and that we know a finite interval
(a, b) surrounding all the objective-function values. In their algorithm, the
probability of accepting a proposed move is itself a random quantity, typically
estimated by multiple measurements of the tentative new state. A proposed
move is rejected unless the result of every simulation is smaller than a uniform
variate on (a, b) drawn at the beginning of the move. The number of simula-
tions required at each move is large and grows without limit as the algorithm
proceeds.
Gelfand and Mitter [19] propose a different method. They formulate
the transition probability at time t as
R(x,y)e(t)~uW+wW+,
where W(t) represents a noise term which may depend on past history. This
general formulation includes ours as a special case. However, their convergence
proofs assume that W(t) is Gaussian with variance asymptotically smaller than
any linear multiple of l/( log e(t)). Since some states may be visited only
rarely, it is unclear how to check the variance condition.
For technical reasons, we assume that the unknown objective function
takes on values in a finite set, such as a subset of the computer-representable
numbers. Neither [38] nor [19] need to do this. But since our treatment allows
arbitrary precision, we do not consider this restriction important.
In the next section, we give a formal description and show that our
conclusions follow from the results of Chapter 3. It turns out that we need
estimators which are asymptotically stable; the remainder of the chapter
discusses these and presents two examples.
4.1 Formal Setup
At the j-th entry to a state x, make an observation T(x,j). Each
of the observations T(x, 1), T(x, 2),..., has expected value h(x). The vector
h = (h(x) : x G S) is unknown. Perform simulated annealing with acceptance
probabilities formed from some estimate Hn = (H(x,nx) : x 6 S), where
each component H(x,nx) is based on history of observed values of T(x,j),
j = 1,..., nx, and nx is the number of times x has been observed up to time n.
35
Assume that there are only a finite number of possible values for
h(x) at any state x (for example, integer multiples of 0.001 between 0 and 1).
In effect, we accept finite precision in our knowledge of h(x). We say that
the estimator H{x, ) is asymptotically stable if there exists a random time
N(x), almost surely finite, such that H(x,n) = h(x) for all n > N(x). We
show later that such estimators exist.
In our setting, the x-th component of Hn takes account only of the
x-th component of the observations (T(The observations of the other
components and the current observation number are ignored. For every com-
ponent, the observations are thus iid. Since there are only a finite number of
states, we can consider asymptotic stability of each component separately. To
initialize, for each component we might set H(x, 0) = oo.
If H(x, ) is asymptotically stable for each state x, and if every state
is visited infinitely often, then from some finite time N onwards, the vector H
of H-estimates coincides with the vector h of true /i-values. In this setting,
the standard algorithm is simulated annealing using h in the computation of
the move probabilities. The modification uses the estimates Hn. To make this
more precise, in both chains the original state space S is enlarged to include
enough historical information to compute H at time n. Only the moves in the
second chain take account of that additional information. When h is unknown,
the first chain is hypothetical.
Theorem 3.1 implies that if the first chain converges in probability to
So with h, then so does the second chain.
36
This setup generalizes easily to any situation where the objective func-
tion h(x) depends on a finite-dimensional vector 6 of unknown parameters, and
0 takes on only a finite number v of values. The idea is to exploit structure in
the state space so that v is much smaller than |Sj, although that is not neces-
sary for our proofs. As above, we can interpret this as knowing the elements
of 6 to finite precision.
4.2 Consistent and Stable Estimators
Generalizing the previous section, we want to run Monte Carlo simu-
lation in an optimal way according to some criterion. The choice of simulation
strategy depends on a parameter with an optimal value h. We do not know h.
However, at step n of the simulation we make an observation T(n),
and then use the observations T(l),..., T(n) to compute an estimate H(n) of
h. At time n, the simulation strategy depends on H{n). Recall the definition of
asymptotic stability in the previous section: there is a random time N, almost
surely finite, such that H(n) = h for all n > N. Generally, to find such an H
requires that we know a finite set F containing h, and that the observations
fall in a some set G, generally a superset of F. Thus, H{n) maps the n-fold
Cartesian product of G to F. We give below two examples of asymptotically
stable estimators.
For simplicity, assume that h and T(j) are scalars and by rescaling
that F is a set of consecutive integers. The set G is either a subset of the
real numbers or a subset of the integer multiples of 1/d, where d is an integer
greater than 1. We can view the latter as a finite-precision approximation to
the former. Let (x) denote x rounded to the nearest integer, rounding upward
37
in case of ties (i.e., (x) = [a; + (1/2)J). Set t = E(T(j)) and let h = (t).
An example shows what can go wrong. Let the sequence of sample
means be T(n) = re-1 Â£)"=1 T(j). The estimator H(n) = (T(n)) is not asymp-
totically stable when t is a half-integer; for large n, it oscillates unpredictably
between [ij and [i]. If t is near a half-integer, then damped oscillation occurs;
although there is eventual stability, arguably the rate of damping is very slow.
This is the practical reason to consider the estimators in Sections 4.3 and 4.4.
Suppose we try to correct the situation by observing the entire se-
quence of sample means T(l),..., T(n) (computed to double precision, per-
haps), counting how many are above or below the half-integer line, and round-
ing up or down accordingly. This actually makes matters worse. For large n,
the sequence of sample means closely approximates Brownian motion, and the
proportion of the T(j) which lie above t has an arc sine distribution (see Feller
[13], pp. 397-99.) when t is exactly a half-integer. When t is approximately
a half-integer, this proportion forms an approximate arc sine distribution with
large probabilities of being near 0 or 1. The refined estimator still oscillates,
but the intervals between changes grow longer and longer (in fact the growth is
exponential). To the careless observer, it may falsely appear to be stabilizing.
It is important to be clear about the difference between an estimator
which is asymptotically stable and one which is merely consistent. If we had
wanted, say, a static point estimator, restricted to the integers, of a parameter
with true value 1.5, then we would be indifferent between an estimate of 1.0
and 2.0. Asymptotic stability is needed when we are interested, not in the
parameter per se, but in a simulation process dependent on an estimate of the
38
parameter.
4.3 Asymptotically Stable Estimators: First Example
Let Mk(n) be the number of observations equal to k up to step n;
that is,
Mk(n) = I(T(j) = k).
3=1
Each observation T(n) is formed by adding a random error to t, and then
rounding to the nearest integer:
T(n) = (t + Hn)
Suppose that the errors Â£n are iid. Then the T(n) have a common discrete dis-
tribution function with mass points ..., p_i,po,pi,..on the integers. Sup-
pose also that the distribution of the is symmetric about zero and has a
density with bounded support, strictly decreasing on positive numbers. Then
when t is not a half-integer the set {pk > 0 : k Z} is finite and has ph = P(t)
for its unique largest element. In the exceptional case where t is a half-integer,
there will be maxima at both p|*j and pp]. For large n, we expect Mk(n)/n to
approximately equal pk, at least for those k not in the tail of the distribution.
These remarks form the intutive basis for our estimator.
Let /(n) be a function on the positive integers which is non-decreasing
in n and which satisfies both
f(n) = o(n) and log(n) = o(/(n)). (4.1)
(For example, f{n) = y/n.) Below, break ties for the second-largest value
arbitrarily.
39
ALGORITHM STABLEl(z, f)
For each step n:
Make an observation T(n);
Set k < T(n);
Set Mk(n) < Mk(n) + 1;
If the set N = {Mk(n) : k G F} has a single maximum at m,
and arg maxiV \ {m} = m + 1,
and \Mm(n) Mm+1{n)\ < f(n), then
Set H(n) < m + 1.
Else if the set {Mk{n) : k G F} has a single maximum at m,
and arg maxiV \ {m} = m 1,
and |Mm(n) Mm_i(n)| < /(n), then
Set H{n) < m.
Else
Set -^(n) < argmax{Mfc(n) : k G -F}.
End
The following theorem shows that H{n) has the desired properties.
Theorem 4.1 Assume that / satisfies (4.1) and that the error terms
Â£n satisfy the conditions stated above. Then, almost surely, the estimator H(n)
of algorithm STABLE-1 exactly equals h for all but finitely many n.
Proof: Suppose first that t is not a half-integer. Assume that m <
t < m + (1/2); the other case is symmetric. By the conditions imposed on the
distribution of the set {pk : k G G} has its unique largest value at m and its
second largest value at m + 1. The strong law of large numbers and |G| < oo
imply that, a.s. for large n, the set {Mk(n) : k G G} attains its unique largest
maximum at m and its unique second largest value at m + 1. For arbitrary
6 > 0, another application of the strong law shows that for all large n, on
almost all sample paths,
|Mm(n) Mm+1(n)\ > n(pm pm+1) 6,
40
and the expression on the right is bounded below by f(n) for all large n. This
completes the proof if t is not a half integer.
Now suppose that t = m + (1/2). Then pm = pm+1 and these two
are the only maxima in the set {pk : k Â£ Z}. Using the strong law as in the
previous paragraph, the two largest elements of the set {Mk(n) : k Â£ Z} are
Mm(n) and Mm+i(n) a.s. for all large n. Let j(0) = 0 and
j(i + 1) = min {T(n) = m or T{n) = m + 1},
n>j(i)
i.e., {/(l),j(2),...} is the sequence of hitting times to the set {m,m + 1}.
We claim that
IMm(j(i)) - < f(j(i))
a.s. for all but finitely many i. Consider the Bernoulli process {T(j(i)) : i
Z+}. It is well known, and a simple consequence of the Borel-Cantelli lem-
mas (see, for example, Billingsley [3], pp. 53-54) that there exists a positive
constant K such that the events
{T(j(i)) = T(j(i) + !) = = T(j(i) + K Ini) = m}
and
{:T(j(i)) = T(j(i) + 1) = = T(j(i) + Klni)=m + 1}
each occur for only finitely many values of i. Let io be the largest i for which
either of these events occurs.
Since zero is a recurrent state in the random walk on the space of
possible values of Mm Mm+1, Mm(j(i)) = Mm+l(j(i)) for arbitrarily large
41
values of i. Let i\ be the smallest integer larger than i0 for which i)) =
Mm+i(j(*i)). Then for all i>i\,
I- Mm+i(j(i))\ < K\ni < K\nj(i) < f(j(i)); (4.2)
where the last inequality is true for all large enough i.
We finish by showing that (4.2) is still true when j(i) is replaced by
an arbitrary time n. Observing that j(-) is a non-decreasing function on the
positive integers, we define an inverse. If n < j(l), set j*(n) 0; otherwise,
let j*(n) be the largest integer i such that j(i) < n. The function j*(-) has the
following properties:
j*(n)
Mm(j(j*(n)) Mm(n)\
Mm+i(j{j*{n)) = Mm+1(n).
Thus, a.s. for all large n,
\Mm+i(n) Mm(n)\ = \Mm+1(j(j*(n)) Mm(j(j*(n))\
< Kln(j*(n)) .
< K In n
< f(n).
4.4 Asymptotically Stable Estimators: Second Example
We drop previous assumptions about the error terms and instead
assume that they are iid with a zero mean and finite variance cr2. We com-
pute the sample means Tn to a higher precision d than needed for our final
estimator Hn.
42
Let g(n) be a function which satisfies
J2{a2/n) In In cr2n
lim----------------<1 and lim gin) = 0,
g(n) ~ n-oo^ '
almost surely. For example, we could take g(n) = n_1\/2s2 lnlns2ra, where s
is the sample variance (this requires that the have a finite fourth moment).
More simply, we could take g(n) = In n/n.
ALGORITHM STABLE2(x,t)
For each step n.
Make an observation T(n)
Compute the sample mean T(n)
Set_J(n) <- [T(n)\ + (1/2)
If |Tn J(n)| < g(n), then
Set H(n) <- \T(n)]
Else
Set H(n) (T(n))
End
By the law of the iterated logarithm,
IT{n) t\ < (n)y/2/n < g{n) (4.3)
for all but finitely many n, where 4>{n) \ju2 lnln(ncr2). When t = m + 1/2
for some integer m, then J(n) = t for all large n, so that STABLE-2 is stable
by (4.3). When t is not a half-integer,
2g(n) < | J(n) h\
for all large n; combining this with (4.3),
\J(n)-T(n)\ > \J(n)-t\-\T(n)-t\
> | J(n) h\ \T(n) t\
43
> 2g(n) ~ ff(n)
=
for all large enough n, almost surely. We have proved
Theorem 4.2 Algorithm 2 is stable and consistent.
44
CHAPTER 5
PATHWISE CONVERGENCE WITH ADAPTIVE COOLING
Up to this point, we have measured success of an algorithm by con-
vergence in probability to the set of optimal states So- This standard may
seem to say more than it does. Since there is no strong law of large numbers
for the general nonstationary Markov chain, there is a priori no guarantee that
optimal states will dominate others in any sample run of the algorithm. In this
chapter, we study pathwise convergence directly and show that under certain
circumstances the proportion of optimal states visited does indeed go to one,
almost surely. (However, in general the sequence of states visited has no almost
sure limitsee, for example, [2].) We prove that this is true for the algorithms
SLOW and QUICKER under Hajeks conditions, and, under additional condi-
tions different from those in previous chapters, for algorithm QUICKER-1. An
example shows that the additional conditions cannot in general be discarded.
It is not clear that the pathwise-convergence results of this chapter
either imply or are implied by convergence in probability; see, however, the
discussion at the end of Section 5.1.
Our work was stimulated by that of Borkar [4], who applies the idea
of recurrence orders first developed by Connors and Kumar [11] to obtain a
sample-path result similar to ours on a deterministic cooling schedule satisfying
J^e(t)c = oo
t
when c is not less than a problem-specific constant d possibly different from,
but no smaller than, Hajeks d*. We slightly improve Borkars result, showing
that it holds for c> d*.
Next, we exhibit conditions under which pathwise convergence occurs
with an adaptive schedule (i.e., with QUICKER-1), and then present an example
of a state space which fails to meet those conditions and in which pathwise
convergence fails.
5.1 Main Results
Theorem 5.1 In simulated annealing with the deterministic sched-
ule e(t) = i-1/c, we have
lim Y' I(Xt e S0) = 1 a.s., (5.1)
71KX)
fb t=1
for any c> d*.
Remark Borkar [4] proves the same result, except that in place of
d* he uses the constant
d = max d(x),
x Â£S
where d(-) is the depth function of Section 1.1. Since d* is the same maximum
over the smaller set S \ So, the two constants are related by d* < d, and our
result is a slight improvement over Borkars.
Now we describe what happens for algorithm QUICKER-1. Recall
from Section 2.1 that d is the maximum over x G S of S(x), where
S(x) = min{[C/(y) U(x)]+ : y G N(x)}.
Let H = min{[/(x) : x Â£ S'o} be the smallest objective-function value outside
the set of global optimizers. Let T{k) be the epoch of the k-th state change;
46
i.e., T( 1) = 1 and
T(k + 1) = min{t > T(k) : Xt ^ XT(5.2)
Finally, let t be the epoch of entry to the state Xt; i.e., t = T(k), where k is
the unique integer such that T(k)
Theorem 5.2 In simulated annealing with the adaptive schedule
e(t) = t -1/c, (5.1) holds for any c > d*, provided d < H.
We show in Section 5.3 that the condition d < H cannot in general be omitted.
Remark Hajek [25] shows how to modify (S, U, R), without chang-
ing its convergence properties, by inserting intermediate states so that all jumps
are of a desired small size. He calls this the continuous increase property.
In principle, we could use this construction to force d to be smaller than H.
However, this expansion of the state space tends to degrade the performance
of an adaptive algorithm, since it results in more frequent state changes and
thus lower temperatures.
The results described in Theorems 2.1 and 2.2 of Section 1.2 imply
an expected-value version of Theorem 5.1, since if Pr{W E .So} > 1, then
limEE S0}) = lim^^Pr^ E So}) = 1. (5.3)
n t=i n t=i
For deterministic cooling schedules, Connors and Kumar [11, 12] prove a result
like (5.3) directly. (Their result involves a lim sup rather than a limit, but with
cooling schedules satisfying (1.2), it is easy to generalize their proof.)
47
P{x) = <
P(x,y)
5.2 Sample-Path Convergence for Deterministic Cooling
Following Borkar [4], we associate with each state x a random quantity
the (pathwise) recurrence order of state x, defined as follows:
-oo if Yt I[Xt = x] < oo
p~ if p = sup{c > 0 | Yt e{t)cI[Xt = x] = oo}
and Yt t{t)pI[Xt = x] < oo
p if p = max{c > 0 | Yt e(f)cI[Xt = x] = oo}.
Similarly, we can define the pathwise recurrence order of a transition from
state x to state y:
-oo if Yt I[Xt x, Xt+1 = y] < oo
P~ if P = sup{c > 0 | Yt e(t)cI[Xt = x, Xt+1 = y] = oo}
and Yt e{t)pI[Xt = x,Xt+i = y) < oo
p if p = max{c > 0 | Yt e(t)cI[Xt x, Xt+i = y] = oo}.
The recurrence order measures how often a state is visited, or a transition oc-
curs, during a sample run. A state which is transient (in the ordinary sense) has
recurrence order oo. Connors and Kumar ([11],[12]) work with the expected
values of pathwise recurrence orders, quantities which we shall call mean re-
currence orders. By replacing recurrence orders by mean recurrence orders in
what follows, we get a result like (5.3) for QUICKER-1.
The following remarkable theorem shows that, to a certain extent,
recurrence orders constitute a potential for objective-function values.
Theorem 5.3 Assume that S is finite and that weak reversibility
holds. If /3(x) > 0 and there is a path x = x0,xi,...,xp = y such that
maxi
48
Connors and Kumar prove Theorem 5.3 in [11] under the assumption
of neighborhood symmetry and in [12] under the assumption of weak reversibil-
ity. Their result is stated in terms of mean recurrence orders, but the proof is
algebraic, not probabilistic, and carries over without difficulty to sample-path
recurrence orders.
The next three results are adapted from Borkar [4]. We sketch their
proofs. The final theorem is slightly stronger than the corresponding result in
[4], because we use d* rather than d.
Lemma 5.4 In simulated annealing with a deterministic cooling
schedule implying C\t~x!c < e(t) < Cit~x^c for C\ > 0, C-i > 0, c > 0, we
have maXjcgs/i^a;) = c, a.s.
Proof: For b > c, J2te(t)b < < oo; thus maxx/3(x) < c.
In addition,
E E 4tmx, = *} = E ft E<_1 =
so t t t
therefore Ylt e(t)cI{Xt = a:} = oo for some x on each sample path, and so
maxj;/^) > c, a.s.
Lemma 5.5 If maxx fl(x) > d*, then the set {x : fi(x) = max^/i^a:)}
is a subset of So, almost surely.
Proof: Let s be a state such that /3(x) = /3max- Suppose x Â£ Sq.
By the definition of d*, there is a path x = x0, x\,..., xv y from x to some
state y Â£ So such that max; {/(a;;) < U(x) + d* < U(x) + /3(x). Applying
Theorem 5.3, 0 < U(x) U(y) = /3(y) {3(x) a.s., which contradicts the choice
of x. Thus, except on a set of probability 0, x E So-
Proof of Theorem 5.1 From Lemma 5.4, maxj^s) = c > d*. But
49
from Lemma 5.5, the only states x satisfying f3(x) = c are in Sq. Thus
^2e(t)cI(x i- S0) = Â£ So) < oo a.s.,
t t
from which (5.1) follows by Kroneckers lemma (see, for example, Chung [10],
pp. 123-24).
5.3 Sample-Path Convergence for Adaptive Cooling
In this section we prove Theorem 5.2. Assume that e(t) = f_1/c,
c> d*, and that d < H. To avoid repetition, we adopt the convention that all
the statements in this proof are true in an almost-sure sense. The following
assertion is proved below:
c < ma,x/3(x) < c + d (5-4)
Statement (5.4) and d < H together imply that maxj;^(s) H < c, so that
Theorem 5.3 implies that (5{x) < c for any state x not in So. Thus
Â£rlJ(*t i So) < 'Â£i-1I{Xt i So) = Y,mcKXt So) < oo,
t t t
and (5.1) follows from Kroneckers lemma as before.
To prove (5.4), observe first that
Â£?(*)c = ^ X*"1 = 00 >
t t t
so that c < max* 0(x). Pick b> c + d. We need to show that
X g(t)6 = X*_t/C < - (5-5)
t t
Under QUICKER-1, the sojourn time in each state has a geometric distribution
and so is a.s. finite; thus, there are infinitely many state changes. Recalling
50
that T(k) is the transition number of the k-th state change, we rewrite (5.5)
as
E mb = E (Td= + 1) T(k)) T(k)-bl < oo; (5.6)
t k
then choose ^ > 0 such that b > (c + J)(l + Â£). We claim that
Ak = (T(k + 1) T(k)) T{k)-b'c < k~(1+V (5.7)
for all but finitely many k, almost surely. This is enough to establish (5.6) and
thus (5.4).
Fix k, and say that Xx(k) = x- Let N*(x) be the (nonempty) set of
neighbors of x at height S(x) above x, i.e.
N*(x) = {ye N(x) : U(y) = U(x) + S(x)}. (5.8)
The sojourn time in x ends at or before a move to N*(x) is proposed and
accepted. Thus, it is stochastically bounded by a geometric variate with pa-
rameter
pT(k)~s^c > pT(k)'i/c,
where
p= X) R(x>y)>0
yN*(x)
is the probability of proposing a move to N*(x).
Since T(k) > k, we have
logTW_logÂ£_s n_
c c + d \c c + dj
for k > 3; exponentiating gives
c + d
> 0
T(k)1/ck~1/ eJ/(c+^ > 1.
51
This implies that
T{k)b/ck-^ > T{k)blck~bl(c+d") > ew7(C+d) > 1;
so that there is a constant 77 such that for all large k,
[r(it)l/cife-(1+Â£>J > 7, (:m-)4/tr[1+<:))
(5.9)
Thus,
Pr (At > r(I+Â£) I f (*)) = Pr (T(k + 1) T(k) > | T(k))
Since (1 a)r < e ar for any real a and positive real r, this implies
Pr [Ak > k~(1+V | T(k)) < exp (-p [k~^T(k)b/cj T{k)~1>C)
< exp ( pr}k~^1+^T{k)^b~^^c).
Then the trivial bound T(k) > k implies
Pr [Ak > k~W> | T{k)) < exp ( 11Pk^b~^ "(1+^). (5.10)
The choice of Â£ implies that the exponent of k on the right is strictly positive.
Now for any a > 0,
00 roo 1 / 1 \
V)exp( ja) < / exp(xa) dx = r ( ) < oo. (5-11)
Jo a \aj
By the Borel-Cantelli lemma for adapted sequences (see Loeve [36], pp. 398-
99), the event Ak > k~^+0 occurs only finitely often, almost surely. This
proves (5.6) and finishes the proof of Theorem 5.2.
52
5.4 An Example Where Pathwise Convergence Fails
Consider the state space S {x,y,z}, with U{x) = 1, U(y) = 2,
U{z) = 0, and tentative-move matrix
/
1
\
R =
r 1 r ,
where r > 0. Thus, d* = 1, d 2, and H = 1. We show that pathwise Cesaro
convergence fails for the adaptive schedule e(t) = Â£-1/d* = Â£-1. In fact, given
any a < 1, there are infinitely many transition numbers t at which
t
t"1 YJ(XS = x) > a, a.s.;
S=1
that is, the total proportion of visits to the strict local minimum x gets arbi-
trarily close to 1, infinitely often.
Assume for simplicity that Xi y. Since we are using an adaptive
schedule, there are infinitely many state changes, almost surely. In particular,
there are infinitely many visits to state y. Let S(j) denote the transition
number immediately following the j-th visit to y; that is, .S'(l) = 2 and S(k +
1) = 1 + min{Â£ > S(k) : Xt = y}. Also, let M(t) the remaining lifetime, at
step t, of the current visit to state x; that is, M(t) = min{s > 0 : Xs+i ^ a;} if
Xt = x, and M(t) = 0 if Xt ^ x.
Since every proposed move out of state y is accepted, we have
Pr (XS(j) = x) = r
for all j. Let Bj be the event
{*5W) = *, M(S(j)) >
ln2/ln(l S(j) J)j}.
53
Then
Pr (Bj) = r-(l-SO')-1)
In 2
= rexp
> r/2.
L-In2/ln(1-5(i) J)j
mi sur1)
IHi
(5.12)
The events Bj are not independent. However, Hi,..., Bj-1 influence Bj only
through S(j); since (5.12) is true regardless of the value of S(j), we have
Pr(Hj | A) > r/2 > 0
for any A Â£ er(Ri,..., Bj-i). From Lemma 5.6 (see below) it now follows that
given any integer l, sequences of l consecutive events Bj, Bj+1,..., Bj+i-1 occur
infinitely often. Since for any integer k,
.MTTFijJ s L^2J > i/2
it follows that if Bj,...,Bj+2i-i all occur, then at time S(j + 21) 1, there
will have occurred at least l S(j) visits to state x and at most S(j) visits to
state z.
Remark A similar argument shows that (a.s.) the total proportion
of visits to the global minimum 2 also gets arbitrarily close to one infinitely
often. Thus, the Cesaro sums t-1 Ss/(Xs = x) and t_1 Z)S/(AS = z) oscillate
forever without converging.
Lemma 5.6 Let Ii,l2,... be a sequence of zero-one Bernoulli ran-
dom variables adapted to an increasing sequence T\,T2, of cr-fields. Sup-
pose that for all i,
inf E(/i+i | A) > a,
AFi
54
for some positive constant a not depending on i. Then for any positive integer
l, the sequence (Ij) contains infinitely many strings of ones of length l.
Proof: Let (Ij : j = 1,2,...), (Tj : j = 1,2,...), and a be as in the statement
of the lemma. We need to show that for any given integer / > 0, there is
an infinite subsequence of indices j = 1,2,..., such that for each j,
Ii(j)+i = Ii(j)+2 = = Ii(j)+i-i = 1. The proof will show that we can take the
i(j) to be multiples of l.
Let Ijm = nr=i Ij+k- We establish by induction on m that
E(Ijm | A) > am (5.13)
for A G J-j. When m = 1, statement (5.13) is true by hypothesis. Assume
(5.13). Then
E(/j,m+l | A) = E(/j+i/j+2 Ij+m+l | A)
= E(/j+2 Ij+m+l | Ij+l = 1, A)E(//+1 | A)
= E(/j+i>m | Ij+i = l,A)E(ij+i | A)
> am~1a,
since {Ij+i = 1} fl A G Fj+i-
Now let Jjm Ijm,j = Ijm+iIjm+2 Ijm+j- If j < k, then the event
{Jjm = 0} is in IF(j+i)m C Tkm, so that (5.13) implies that for any k > 0 and
n > 0,
Pr ( f) (Jjm = 0)) = Pr(^fcm = 0, . Jnm = 0)
Kj=k '
= P^(Jkm = 0) X Pv(Jk+l,m = 0 | Jkm 0) X
X Pr(JnTO = 0 | Jkm = Jk+l,m = Jnl,m 0)
< (1 -a)n~k-
55
letting n > oo and then taking the union over all positive k shows that
Pr( Jjm = 0 all but finitely often )
0.
56
CHAPTER 6
CONVERGENCE OF THE PRUNED CHAIN
It might be thought that the convergence results of [7], [11], [25],
and [37], as well as those in previous chapters of this thesis, are an artifact,
resulting from the algorithm spending long periods of time in optimal states.
In this chapter we show that convergence occurs even if we do not take into
account the length of time that the chain occupies a state, i.e., if we consider
only the embedded chain of distinct states.
As in the previous chapter (see (5.2) on page 47), let the sequence
{T(k) : fc = 1,2,...} denote the epochs of transitions between distinct states.
Then (Xx(k) ' k = 1,2,...) is the induced chain of pairwise distinct states vis-
ited. Let Si be the set of global minimizers of the function U + 8. At the end
of this chapter we describe the set Si for some typical problems. It often is a
superset of So and usually its component states have energy levels very close
to the minimum. If there are no isolated local minima, as in the setting of [14],
then Si So-
Theorem 6.1 (Convergence in Probability) If (S', U, R) is weakly
reversible and if the cooling schedule satisfies the conditions in [25], then
lim Pr (AT(fe) G 5i) = 1.
K* OO V
Theorem 6.2 (Pathwise Convergence) In simulated annealing us-
ing the adaptive schedule e(t) = t 1/c for c > d* + d,
a.s.
6.1 Convergence in Probability
Our procedure in both Theorem 6.1 and 6.2 is to show that the dy-
namics of the pruned chain are the same as those of ordinary simulated an-
nealing on a triple (S,U + 7,R), where R(x,y) > 0 iff R(x,y) > 0. The first
task is to show that the modified problem satisfies WR.
Lemma 6.3 (S, U + 8, R) has property WR if (S, U, R) does.
Proof: The change from RtoR makes no difference, since any path
A
under R is also a path under R and vice versa. Suppose that state y is reachable
from state x at (U+^)-height h. This means in particular that C/(a:)+5(a:) < h.
Since the function Â£() is nonnegative, it also means that y is reachable from x at
17-height h. Since (5, U, R) satisfies WR, there exists a path y = y0,..., yp = x
such that maxo<;
(U + S)(yi) < h. At the beginning of the proof we remarked that this is true for
the state x. Therefore, assume that 0 < < p 1; i.e., yi is not the last state
on the path. If Â£%,) > U(yi+i), then %;) = 0, so (U + S)(y,) = U(yi) < h,
and if U{yi) < U(yi+1), then U(yi) + S(yi) < f7(y;+i) < h. The claim is proved.
We have exhibited a path from y to x of (U + <5)-height not exceeding h, which
proves the lemma.
58
Lemma 6.4 If (S, U, R) is weakly reversible, then
[(tf + S)(v) ~(U + *)M]+ = [Vim) U(X) Â£(;r)]+
for all states x,y with y E N(x).
Proof: If U(y) > U{x), then there must be a path y = yo,yi,... ,yp =
x leading from y to x at height not exceeding U(y). In particular, y has a
neighbor not strictly uphill from it, namely yi, so that S(y) = 0, and the
conclusion of the lemma holds in this case. Suppose on the other hand that
U(y) < U(x). Then there must be a path y = y0,yi,...,yp = x leading
from y to x at a height not exceeding U(x). In particular, yx E N(y) and
U(yi) < U(x), so that S(y) < U(x) U(y). Together with S(x) > 0, this
implies
(U + S)(y) ~(U + S)(x) < U(y) + 6(y) U(x) < 0,
so that the conclusion of the lemma also holds in this case.
Proof of Theorem 6.1: It is sufficient to demonstrate that the
condition (A.2) of [25] is satisfied; that is, that there exist constants cx, C2 such
that
de(*)[
where Q(x,y,t,e) denotes the transition probability, conditioned on accep-
tance, from state x to state y at time t under cooling schedule e(-).
As in Section 2.2, we write the acceptance probability at time k from
state x as
a(x, k) = A0 + Y^A{k~di/c.
i1
59
If x is an isolated local minimum, then the probability of moving to y, condi-
tioned on acceptance, is
Q(x,y,t,e)
ce(t)Mv)-uW+
TZ=i Mt)diixy
For some constant K not dependent on t, the denominator satisfies the double
inequality
Aie(<)ff(ar) < < Ke(t)s^-
i=1
since the exponent in the numerator is strictly positive,
c (f)tr(l))_I7(I)_f(I) < Q(x,y,t,e) < S.x^v)-u(x)-6(x\
A A\
from which (6.1) follows by Lemma 6.4. This finishes the proof for the case
where x is an isolated local minimum. If x is not an isolated local minimum,
then
Q(x,y,t,e) =
ce{t)VM-uW+
Ao + e-=i Mty^r
Here, the denominator is bounded below by A0, which is strictly positive, and
above by some (different) constant K, again not dependent on t. Therefore
c^uiv)-^) < Q(x,y,f,e) <
K Ao
Since S(x) = 0 in this case, we also get (6.1).
6.2 Pathwise Convergence for the Pruned Chain
In this section we prove Theorem 6.2. As in the proof of Theorem 6.1,
we find expressions involving powers of k to bound the transition probabilities
of the fc-th step in the pruned chain. The greatest difficulty lies in finding an
asymptotic upper bound for T(k), the time of the fc-th transition; this is the
60
purpose of Lemmas 6.5 and 6.6. A modified version of the techniques used to
prove Theorem 5.2 then establishes pathwise convergence.
Lemma 6.5 On almost all sample paths, T(k + 1) T(k) < T(k)Wc)
for all except finitely many k.
Proof: Let Â£ > 0 be arbitrary. We proceed as in the proof of in-
equality (5.10) on page 52. As in that proof, the exit time from any state is
stochastically no greater than a geometric variate with parameter T(k)~dlc\ so
for all large k,
Pr {T(k + 1) T(k) > T{kfH | T(k)} < (1 T(/fe)-J/c)LT(fc)(,,/e)+
< exp ( rjT(k)^)
< exp ( i]k^),
where rj is a positive constant small enough to compensate for the integer
rounding. Applying (5.11), then the Borel-Cantelli lemma for adapted se-
quences as before, we have
Pr {T(k + 1) T(Jfe) > T(kf+t i.o.} = 0,
for Â£ > 0. Now for j = 1,2,..., let
% = {T(k + 1) T(Jfe) > T(A:)J+2_i i.o.}.
Then
Pr {T(k + 1) T(k) > T(k)d+2~j i.o.}
oo oo
= Pr{U%}sÂ£Pr{%}=0.
j=l j=1
Lemma 6.6 T{k) 0(kc^c~d^), almost surely.
61
Proof: For convenience, write d = d/c. Inductively define a function
fk on the nonnegative integers by
A(0) = i; Mi +1) = AO) + (AO))"-
Now consider the related solution to the initial-value problem
= k) 9k(x) = (9k(x))d; (6.2)
for 0 < d < 1. It is easy to verify that this has the unique solution
9k{x) = ((1 -d)x +
on nonnegative real numbers x. Since the function gk is nondecreasing,
{9k(j)) < I (9k(x)) dx = gk(j + 1) gk(j),
>3
and from this it easily follows by induction that
fk(j) < gk{j) = 0(j)1/{1-d)
for all nonnegative integers j.
On any sample path u>, there is, by Lemma 6.5, a constant K = K{u)
such that T{k + 1) T(k) < T(k)d^c for all k> K. Then
T{k)
Proof of Theorem 6.2: As in Section 5.3, let N*(x) be the set of
neighbors at height 8(x) above a:. The transition probabilities for the pruned
chain are
Pr (Xr(k+1) = V | XT(k) = x)
R{x,y)e{T{k))VW-uWP
ZzeN(*)R(x,z)e(T(k))M*)-uW
62
rsj
R{x,y) e{T{k))^y)~u^
J2zeN^)R(x,z) ' e(T(k)y(*)
= R(x, y) e(T(fe))[t/^-(C7+5)^+
= R(x, y) e(T{k))W+5W-V+s^,
where
Given Â£ > 0, we have
R(x,y)
R(x,y)
^2zÂ£N*(x) Rfai Z)
(1 t)R{x, y) e(T(k))W+5M-(u+sW+ < Pr (xT(fc+1) = y \ XT(k) = x)
and
Pr (Xnk+1) = y | XT{k) = x)< R{x,y) e(T(k)fu+^^u+^\
for all large k. Applying Lemma 6.6 and the trivial inequality T(k) > k,
k-i/(c-3) < e(T(k)) < k~1/c, a.s.
Therefore, the transition probabilities for the pruned chain are bounded below
by
and above by
This is equivalent to bounding the cooling schedule for the pruned chain by
(1 Â£)(ifc-1/(c-d)) and k-Vc.
To finish the proof, we need one more lemma. Recall the recurrence
orders fl(x) defined on page 48.
63
Lemma 6.7 If there exist constants b, B, c, C such that
brl!c < e{t) < Bt~xtc,
then c < max^gs /?(x) < C.
This is proved in exactly the same way as Lemma 5.4.
Applying Lemma 6.7, c d < maxxl=s ]3(x) < c, where ft(x) is the
recurrence order of the pruned chain. An application of Lemma 5.5 completes
the proof.
6.3 Si Compared to So
We briefly discuss the set jSi of global minimizers of the function U + S
and its relationship to the set So. The following two examples show that this
relationship can attain either extremethe sets can be completely coincident
or completely disjoint.
Example 6.1 If there are no isolated global minima then Sq Si.
This is not surprising: in the unpruned chain, the expected waiting time in
any individual state is 0(1); its asymptotic behavior should not differ greatly
from the pruned chain, where we force the waiting time to be exactly 1.
Example 6;2 Let S = {w:x,y,z}, N(w) = {x}, N(x) {u>,?/},
N(y) = {y,z}, N(z) = {y}, with U(w) = 1.5, U{x) = 1, U{y) = 2, and
U(z) = 0. (See Figure 6.1).
In this case, the probability distribution of Xx(k) converges to a unit
mass on {w, x}, even though neither of these states is a global minimizer.
Again, this is reasonable, since a visit to z is always followed by a visit to y,
but the chain is very likely to contain long oscillations between states x and w.
64
u
2
1
Despite Example 6.2, we believe that convergence to the set S\ would
in most settings be an acceptable outcome. Some typical applications of sim-
ulated annealing to real world optimization problems illustrate this.
Example 6.3 In Hajeks [25] formulation of the maximum-matching
problem, a situation like Example 6.2 could not occur, since every move changes
the objective function (number of edges) by one. In fact, any problem formula-
tion which satisfies the continuous increase assumption of [25] has So C Si]
moreover, the energy level of every state in Si is either minimal or one step
away from minimal.
Example 6.4 One of the earliest papers to propose simulated an-
nealing as an optimization technique was [32]. In its formulation of the travel-
ling salesman problem, a move between states (tours) consists of reversing one
section of the tour. A situation like Example 6.2 could occur only if there were
two tours such that
(i) each could be reached from the other by a single move;
(ii) no shorter tours could be reached from either by a single move; and
65
(iii) the length of each tour is shorter than the length of any tour one move
away from an optimum.
Unless the arrangement of cities is very regular, or the number of cities very
few, this situation would be rare. Ordinarily, elements of Si would be tours at
most one move away from optimal, from which a true optimum could be found
by a short post-processing routine.
66
CHAPTER 7
DIRECTIONS FOR FUTURE WORK
The nonstationary Markov chains of simulated annealing are a rich
source of unanswered questions; we present a small selection.
7.1 Hitting Time
A useful complement to the convergence results of this thesis would
be information about the first hitting time T to So- In a recent paper, Chiang
and Chow [7] used Wentzell and Freidlins theory of W-graphs to show that
the second eigenvalue of Pt, the transition matrix at time t, is asymptotically
m = i -
where d* is Hajeks constant, defined in Chapter 1. This is suggestive for the
following reason. We know that
t
maxPr(T > t \ Xk = x) II JJ Pj
cc^S N t oo
3=k
where Pj denotes the matrix Pj with all rows and columns corresponding to
So deleted. For large j, it is well known (see, for example, Section 1 of [25])
that the stationary distribution (i.e., the principal eigenspace) of Pj is almost
entirely concentrated on So. Therefore, X(j) is nearly the same as the principal
eigenvalue of Pj. If there were a constant A
at least for large k, then we could write
Pr(r > t | Xk = x) S A exp ( - e(j)d*),
3=1
and in principle estimate all the moments of T. (Of course, we would also like
to compute or find good bounds for A.) However, a relationship like (7.1),
which must depend on the special structure of the matrices Pj, has thus far
eluded us.
7.2 Tabu Penalties and Simulated Annealing
Fox [14] incorporates k-step tabu penalties into the simulated anneal-
ing algorithm. Roughly, he contructs a new space with states having multiple
components; candidates for the latter are ^-tuples in Sk which correspond to
possible paths, without self-loops, of length k in S. Unfortunately, the mod-
ified problem does not in general satisfy WR. To avoid this difficulty, Fox
imposes additional structure. Among other things, he requires that from any
state there is a path with no uphill moves leading to So] this is achieved by
incorporating a generalization of the neighborhood enrichment mentioned in
Section 3.2.
Neighborhood enrichment has heuristic appeal as a way to diversify
the search. Also, an adaptation of the proof of Belisle [2] shows that Foxs pro-
cedure guarantees convergence in probability to So with any cooling schedule
that goes to zero, no matter how quicklyeven in the extreme case e(t) = 0
for all t. Nevertheless, we feel that insight could be gained by reconsidering the
general problem without neighborhood enrichment. Are there other, weaker
conditions which guarantee WR in the expanded state space? An answer to
this question would likely increase our understanding of the property of weak
68
reversiblity. Moreover, it might allow us to quantify in some sense the benefits
of tabu search; this would have both theoretical and practical interest.
7.3 Noisy Objective Function Revisited
In Chapter 4, we assumed that the objective function values were
unknown but confined to a known finite set. Even though we can force an
arbitrary degree of precision, the assumption is still somewhat unsatisfying,
at least from the point of view of traditional mathematics. Dropping the
assumption would allow us to replace the asymptotically stable estimators of
Chapter 4 by estimators which are merely consistent.
Recall the procedure: every time we visit a state, we observe its energy
level with a measurement subject to error. Past observations are combined to
form a vector of estimates of the objective function U(x) at each state x. If
the estimator(s) are consistent, then after a random but a.s. finite time, the
estimated value of every U(x) will be within any prescribed Â£ > 0 of its true
value.
The first difficulty is that this is not a stopping time. We could get
around this with a stronger version of Theorem 3.2. The theorem would need
to hold when conditions (ii) and (iii) imply merely approximate equality of the
respective transition probabilities. Under those weakened conditions, equalities
(3.5) and (3.6) are no longer valid, and it is not immediately clear that they
could be replaced with appropriate inequalities.
The next step would be to show that knowing the energy levels to
within an arbitrary Â£ suffices. We are looking at a Markov chain where the
69
transition probabilities at time t satisfy some inequality
Ae(t)-Z+VM-uW+ < P(x, y, t) < Be(t)*+luM-uW+. (7.2)
It might turn out that (7.2) is enough to guarantee convergence. Alternatively,
assuming, as do Gelfand and Mitter [19], that the error term Â£ does not de-
pend on x or y, a system satisfying (7.2) can be viewed as ordinary simulated
annealing with a random cooling schedule e(-). Assuming the original cooling
schedule satisfies (1.2),
e(t) = (7.3)
where Â£ is a random variable in the interval (a, b). In effect, we move the
randomness from the objective function to the cooling schedule. To get con-
vergence in probability, we need to show that convergence results such as those
in [7] or [25] continue to be true if the cooling schedule under the weaker
condition (7.3). It is not yet clear whether this is possible.
7.4 Refinement of Algorithm QUICKER-1
Fox has suggested replacing the geometric variate with its expecta-
tion (rounded upward to the nearest integer) in algorithm QUICKER-1. A
proof like that in in Section 5.3 (but easier) shows that the modified algorithm
exhibits the same pathwise convergence discussed there. We do not know if
Foxs modified QUICKER-1 gives convergence in probability; this would be
worth investigating.
Foxs modified QUICKER-1 would be slightly simpler to implement.
It seems to reduce variance in the. random cooling schedule. However, more
work is needed to determine whether this translates into reduced variance in the
70
sequence of states generated. The modified algorithm would prevent anomalous
rapid cooling. Heuristically, it can be argued that this is desirable. Once
again, though, more work is needed to determine whether this improves the
performance of the algorithm.
71
BIBLIOGRAPHY
[1] Asmussen, S. (1987). Applied Probability and Queues. New York: Wiley.
[2] Belisle, C. J. P. (1992). Convergence theorems for a class of simulated
annealing algorithms on Journal of Applied Probability 29:885-95.
[3] Billingsley, P. (1986). Probability and Measure. New York: Wiley.
[4] Borkar, V. (1992). Pathwise Recurrence Orders and Simulated Anneal-
ing. Journal of Applied Probability 29:472-76.
[5] Browdy, M. H. (1990). Simulated Annealing: An Improved Computer
Model for Political Redistricting. Yale Law Sz Policy Review 8:163ff.
[6] Catoni, 0. (1992). Rough Large Deviations Estimates for Simulated An-
nealing: Application to Exponential Schedules. The Annals of Probabil-
ity 20:1109-46.
[7] Chiang, T.-S. and Chow, Y. (1988). On the Convergence Rate of Anneal-
ing Processes. SIAM Journal of Control and Optimization 26:1454-70.
[8] Chiang, T.-S. and Chow, Y. (1989). A Limit Theorem for a Class of
Inhomogeneous Markov Processes. The Annals of Probability 17:1483-
1502.
[9] Chiang, T.-S. and Chow, Y. (1993). Asymptotic Behavior of Eigenvalues
and Random Updating Schemes. Applied Mathematics and Optimization
28:259-75.
[10] Chung, K. L. (1974). A Course in Probability Theory. 2d ed. San Diego:
Academic Press.
[11] Connors, D. P. and Kumar, P. R. (1988). Balance of Recurrence Orders
in Time-Inhomogeneous Markov Chains with Application to Simulated
Annealing. Probability in the Engineering and Informational Sciences
2:157-84.
[12] Connors, D. P. and Kumar, P. R. (1988). Simulated Annealing Type
Markov Chains and Their Order Balance Equations. SIAM Journal of
Control and Optimization 27:1440-61.
[13] Feller, W. (1968). An Introduction to Probability Theory and Its Appli-
cations. Vol. 2, 2d ed. New York: Wiley.
[14] Fox, B. L. (1993). Integrating and Accelerating Tabu Search, Simu-
lated Annealing, and Genetic Algorithms. Annals of Operations Research
41:47-67.
[15] Fox, B. L. (1994). Faster Simulated Annealing. To appear in SIAM Jour-
nal of Optimization.
[16] Fox, B. L. and Heine, G. W. (1993). Probabilistic Search with Overrides.
Technical Report, University of Colorado, Denver.
[17] Fox, B. L. and Heine, G. W. (1993). Stable Estimators for Self-Adjusting
Simulations. Technical Report, University of Colorado, Denver.
[18] Freidlin, M. I. and Wentzell, A. D. (1984). Random Perturbations of
Dynamical Systems. Translated by J. Sziics. New York: Springer-Verlag.
[19] Gelfand, S. B. and Mitter, S. K. (1989). Simulated Annealing with Noisy
or Imprecise Measurements. Journal of Optimization Theory and Appli-
cations 62:49-62.
[20] Geman, S. and Geman, D. (1984). Stochastic Relaxation, Gibbs Distri-
bution, and the Bayesian Restoration of Images. IEEE Transactions in
Pattern Analysis and Machine Intelhgence 6:721-41.
[21] Gidas, B. (1984). Non-Stationary Markov Chains and the Convergence
of Annealing Algorithms. Journal of Statistical Physics 39:75-131.
[22] Glover, F. and Laguna, M. (1993). Tabu Search. In Modern Heuristic
Techniques for Combinatorial Problems, ed. C. R. Reeves, pp. 70-141.
Blackwell Scientific Publications.
[23] Glover, F., Taillard, E., and de Werra, D. (1993). A Users Guide to Tabu
Search. Annals of Operations Research 41:3-30.
73
[24] Greene, J. W. and Supowit, K. J. (1986). Simulated Annealing Without
Rejected Moves. IEEE Transactions on Computer-Aided Design, Vol.
CAD-5, pp. 221-28.
[25] Hajek, B. (1988). Cooling Schedules for Optimal Annealing. Mathematics
of Operations Research 13:311-29.
[26] Heine, G. W. (1993). Convergence Properties of a Loop-Skipping Algo-
rithm for Simulated Annealing. Technical Report, University of Colorado,
Denver.
[27] Heine, G. W. (1994). Pathwise Convergence for Simulated Annealing
with an Adaptive Schedule. Technical Report, University of Colorado,
Denver.
[28] Hwang, C.-R. and Sheu, S.-J. (1989). On the Weak Reversibility Condi-
tion in Simulated Annealing. Soochow Journal of Mathematics 15:159
70.
[29] Hwang, C.-R. and Sheu, S.-J. (1990). Large-Time Behavior of Perturbed
Diffusion Markov Processes with Applications to the Second Eigenvalue
Problem for Fokker-Planck Operators and Simulated Annealing. Acta
Applicandae Mathematicae 19:253-95.
[30] Hwang, C.-R. and Sheu, S.-J. (1991). Singular Perturbed Markov Chains
and Exact Behaviors of Simulated Annealing Processes. Journal of The-
oretical Probability 5:223-49.
[31] Johnson, D., Aragon, C., McGeoch, L. and Schevon, C. (1991). Opti-
mization by Simulated Annealing: An Experimental Evaluation; Part II,
Graph Coloring and Number Partitioning. Operations Research 37:378-
406.
[32] Kirkpatrick, S., Gelatt, C. D. and Vecchi, M. P. (1983). Optimization by
Simulated Annealing. Science 220:621-80.
[33] Kropaczek, D. J. and Turinsky, P. J. (1991). In-Core Nuclear Fuel Man-
agmenet Optimization for Pressurized Water Reactors Utilizing Simu-
lated Annealing. Nuclear Technology 95:9ff
[34] Kuo, C.-H., Michel, A. N. and Gray, W. G. (1992). Design of Optimal
Pump-and-Treat Strategies for Contaminated Groundwater Remediation
74
Using the Simulated Annealing Algorithm. Advances in Water Resources
15:95ff.
[35] Kuriyan, J., Brunger, A. T. and Karplus, M. (1989). X-Ray Refinement
of Protein Structures by Simulated Annealing: Test of the Method on
Myohemerythrin. Acta Crystallographica (Section A) 45:396
[36] Loeve, M. (1963). Probability Theory. 3d ed. Princeton, New Jersey: Van
Nostrand.
[37] Tsitsiklis, J. (1989). Markov Chains with Rare Transitions and Simulated
Annealing. Mathematics of Operations Research 14:70-90.
[38] Yan, D. I. and Mukai, H. (1992). Stochastic Discrete Optimization. SIAM
Journal of Control and Optimization 30:594-612.
75