Mutual information

Material Information

Mutual information mutual information-based registration of digitally reconstructed radiographs and electronic portal images
Bachman, Katherine Anne
Publication Date:
Physical Description:
xiii, 116 leaves : ; 28 cm


Subjects / Keywords:
Computer vision -- Mathematical models ( lcsh )
Diagnostic imaging -- Digital techniques -- Mathematical models ( lcsh )
Image processing -- Digital techniques -- Mathematical models ( lcsh )
Image analysis -- Data processing -- Mathematical models ( lcsh )
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )


Includes bibliographical references (leaves 114-116).
General Note:
Department of Mathematical and Statistical Sciences
Statement of Responsibility:
by Katherine Anne Bachman.

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
66459954 ( OCLC )
LD1193.L622 2005m B32 ( lcc )

Full Text
/ /
Katherine Anne Bachman
Master of Basic Science, Mathematics, University of Colorado at Denver, 2002
Bachelor of Science, Chemistry, University of Colorado at Denver, 2000
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
Applied Mathematics

This thesis for the Master of Science
degree by
Katherine Anne Bachman
has been approved
Weldon A. Lodwick


Bachman, Katherine Anne (M.S., Applied Mathematics)
Mutual Information
Mutual Information-Based Registration of
Digitally Reconstructed Radiographs and
Electronic Portal Images
Thesis directed by Professor Weldon A. Lodwick
This study regards discrete mutual information and demonstrates the use of
the theory with an example of radiological image registration with in-plane,
two-dimensional images using various search strategies.
Image registration is the process of finding an optimal geometric transforma-
tion between corresponding image data. Although it has applications in many
fields, the one that is addressed in this thesis is medical image registration.
Medical image registration has a wide range of potential applications, but the
emphasis is on radiological imaging.
In addition to having application in communication and computer vision,
mutual information has proven robust and has resulted in fully automated reg-
istration algorithms that are currently in use. Mutual information, which is
given by the difference between the sum of the entropies of individual over-
lapping images and the joint entropy of the combined images, is a measure of
the reduction in uncertainty about one image due to knowledge of the other.
This noil-parametric technique makes no assumption of the functional form or

relationship between image intensities in the two images.
The thesis is organized as follows. Chapter 1 gives a broad overview of
medical image registration as the context in which to present the subject of
mutual information-based medical image registration. Chapter 2 regards the
theory and mathematics of discrete mutual information and its origin in infor-
mation theory. Chapter 3 looks at the implementation of the theory applied
to image registration in general. Chapter 4 looks at an application of mu-
tual information-based image registration in radiological imaging registration
of Digitally Reconstructed Radiographs (DRRs) and Electronic Portal Images
(EPIs). Chapter 5 consists of concluding remarks. The Appendix includes in-
formation that is relevant, but not critical, to the understanding of the material
presented in this thesis. Because probability theory is a major part of infor-
mation theory and, consequently, mutual information theory, a brief overview
of discrete probability theory is included in the Appendix as a quick reference.
Examples are provided in the Appendix as well as throughout the body of the

This abstract accurately represents the content of the candidates thesis. I
recommend its publication.
Weldon A. Lodwick

To my parents, Dr. and Mrs. Miles MacGran, and to my husband, Randy.

I would like to thank my advisor, Weldon A. Lodwick, for the means of support
obtained for me, without which this paper might not have been possible. I
would also like to thank Dr. Lodwick and Francis Newman for their meticulous
review and critique of the thesis, and Francis Newman for supplying the images
used for the DRR/EPI registration experiments and other related images. I
would like to thank Weldon Lodwick, Bill Briggs, and Bill Cherowitzo for acting
as my Graduate Committee, and also Mike Jacobson for attending the thesis

Figures .................................................................. x
Tables................................................................. xiii
1. Medical Image Registration Overview................................... 1
1.1 Taxonomy of Medical Image Registration Methodology.................... 2
1.1.1 Prospective and Retrospective Methods ............................. 4
1.2 Types of Transformation............................................... 7
1.3 Image Registration as an Optimization Problem....................... 9
2. Mutual Information Theory............................................ 13
2.1 Information........................................................ 13
2.2 Entropy (uncertainty or complexity) ................................. 15
2.2.1 Joint Entropy and Conditional Entropy ............................ 25
2.2.2 Relative Entropy.................................................. 31
2.2.3 Concavity of Entropy.............................................. 36
2.2.4 Entropy of a Signaling System..................................... 37
2.2.5 Entropy and Kolmogorov Complexity................................. 38
2.3 Mutual Information.................................................. 40
3. Mutual Information Theory Applied to Image Registration........... 51
4. Mutual Information-Based Registration of Digitally Reconstructed Ra-
diographs (DRRs) and Electronic Portal Images (EPIs).................. 61

4.1 Experiments and Results......................................... 65
4.1.1 Greedy Algorithm ............................................. 65
4.1.2 Genetic Algorithm............................................. 76
4.2 Nelder-Mead (MATLABs fminseareh) ............................. 81
4.3 Simulated Annealing............................................. 81
4.4 Other Experiments............................................... 83
5. Conclusion........................................................ 86
A. Brief Discrete Probability Theory................................ 88
B. Properties of Convex Functions................................... 95
C. Miscellaneous.................................................... 96
D. Sample Output.................................................... 98
References........................................................... 114

1.1 Registration example with multiple solutions. The image B is to be
transformed to image A. The rightmost image shows B registered to
A by two translations and a rotation of 45 degrees................ 10
2.1 Discrete noiseless binary channel................................... 13
2.2 Discrete binary symmetric channel................................... 13
2.3 The entropy function for base 2, e, and 10 logs.................... 15
2.4 Decomposition of a choice from three possibilities................. 16
2.5 The entropy function (base 2 log) of two probabilities............. 21
2.6 Bound on loge(:r) function......................................... 23
2.7 Noisy communication channel, Example 2............................. 28
2.8 Relationship between entropy and mutual information. The mutual
information I(A, B) corresponds to the intersection of the informa-
tion in A with the information in B............................ 46
3.1 The plot on the left is the joint histogram of two randomly generated
150 x 150 matrices taking on values from 1 to 256. The plot on the
right is the plot of the joint histogram of one matrix with itself. . 53
3.2 Relationship between entropy and mutual information. The mutual
information I(A,B) corresponds to the intersection of the informa-
tion in A with the information in B............................ 54

3.3 Venn diagram of the relationship between joint entropy and mutual
information of totally unrelated images............................ 54
3.4 Venn diagram of the relationship between joint entropy and mutual
information of totally related images.............................. 55
3.5 The image data sets, A and B.......................................... 56
3.6 The joint probability distribution of Example 4 as a surface.......... 57
3.7 The image data sets A, rotated -90 degrees, and B..................... 59
3.8 The joint probability distribution as a surface of Example 4 with
image A rotated 90 as in Figure 3.7.............................. 60
4.1 A typical radiation treatment plan [12]............................... 61
4.2 Typical aligned pelvis DRR [12]....................................... 62
4.3 Typical aligned pelvis EPI [12]....................................... 62
4.4 A misaligned EPI and the DRR automatically rotated to the EPI
position [12]........................................................ 63
4.5 Joint histogram of misaligned images. Referring to the rightmost
plot, the x- and y- axes represent the range of intensities in the
images. The z-axis represents the probability that an intensity in
one image will occur with an intensity in the other. It is a surface
plot of the probability values represented in the plot on the left. [12] 63
4.6 The result after the automated registration process [12].............. 64
4.7 Greedy algorithm, run 1, 256x256 images........................... 68
4.8 Greedy algorithm, run 2, 256x256 images........................... 70
4.9 64x64 DRR and EPI..................................................... 73
4.10 128x128 DRR and EPI.................................................. 73

4.11 256x256 DRR and EPI........................................... 74
4.12 Plot of MI versus Angle of Rotation from simulated annealing data. 84
4.13 Detail of Figure 4.12, plot of MI versus Angle of Rotation from sim-
ulated annealing data......................................... 85
A.l Example 9..................................................... 93

4.1 Sample run (Figure 4-7) converges to the optimum transformation
in 8.712 minutes............................................... 09
4.2 Sample run (Figure 4-8) converges to a suboptimal solution in 5.0823
minutes........................................................ 71
4.3 Convergence data for 22 runs, 256x256 images..................... 72
4.4 64x 64 run converges in 0.6504 minutes........................... 76
4.5 128x128 run converges in 1.2058 minutes.......................... 77
4.6 256x256 run converges in 4-3367 minutes.......................... 78
4.7 Summary of Tables 4-3, 4-4, an(l 4-5............................. 78
4.8 Genetic algorithm. 256x256 images................................ 81
4.9 Parameter list for genetic run pairs............................. 82
4.10 MATLAB's fminsearch (Nelder-Mead) algorithm. 256x256 images. 82

1. Medical Image Registration Overview
Image registration is the process of finding an optimal geometric transformation
between corresponding image data. In other words, given a reference or model,
image A, and a test, or floating image, image B, find a suitable transformation,
T, such that the transformed test image becomes similar to the reference. The
image registration problem typically occurs when two images represent essen-
tially the same object, but there is no direct spatial correspondence between
them. The images might be acquired with different sensors, or the same sensor
at different times or from different perspectives.
Modern three-dimensional treatment radiation planning is based on se-
quences of tomographic images. Computed tomography (CT) has the potential
to quantitatively characterize the physical properties of heterogeneous tissue in
terms of electron densities. Magnetic resonance (MR), which looks at hydrogen
atom densities, is very often superior to CT, especially for the task of differ-
entiating between healthy tissue and tumor tissue. Positron emission tomogra-
phy (PET), single photo emission tomography (SPECT), and MRS (magnetic
resonance spectroscopy) imaging have the potential to include information on
tumor metabolism. The various image modalities are non-competing. They
have specific properties and deliver complementary information. The images
supply important information for delineation of tumor and target volume, and
for therapy monitoring.

There is usually a high degree of shared information between images of dif-
ferent modalities of the same structures. This is certainly the case with images
of the same modality which have been acquired at different times or from dif-
ferent perspectives as in the example presented in Chapter 4. In this case, the
modality is CT, or CT-derived, and the problem is rectification of orientation,
or pose, of one CT image with respect the other by in-plane rotations and/or
In order to use image sequences of the same, or various, modalities simulta-
neously, a definite relation between the picture elements (pixels), or volume ele-
ments (voxels) of the various image sequences needs to be established. Methods
which ar e able to calculate and establish these relations are called registration,
matching, or image correlation techniques.
1.1 Taxonomy of Medical Image Registration Methodology
Image registration methods can be categorized by
User interaction
The transformation is determined directly by user interaction.
The transformation is calculated automatically, but the user deter-
mines the image features used and the start-up parameters.
No user interaction is required.

Scope and elasticity of transformation
- Scope
* Global
Global transformations modify the data set as a whole, that
is, not just the therapy relevant area, by applying a single
function to all elements.
* Local
In local procedures the function can vary. Single voxels or pixels,
single image slices, or single organs could be affected.
Elasticity or plasticity (geometrical properties) of transformation
* Rigid transformations
The distance between any two points in the first, image is [ire-
served when these two points arc mapped onto the second
image. Rigid transformations can be decomposed into trans-
lation, rotation, and reflection. Rigid transformations are
special cases of affine transformations. A transformation
is called affine when any straight line in the first image is
mapped onto a straight line in the second image while paral-
lelism is preserved.
* Non-rigid affine transformations
Examples of non-rigid affine transformations are both uniform
and non-uniform scaling and shearing.
* Curved, or elastic, transformations

A curved or elastic transformation can map a straight line onto a
curve, for example, transformations described by polynomial
functions. Rigid transformations can be described as curved
transformations with zero elasticity. Elastic transformations
can be approximated using local affine, that is, piecewise lin-
ear, algorithms when using a small granularity, that is, a fine
level of detail.
Usage of auxiliary means
Methods are distinguished by the amount of auxiliary information which
must be used before, or during, image acquisition to allow subsequent
registration (for example, the use of external or internal markers).
1.1.1 Prospective and Retrospective Methods
The methods listed in the previous section can also be categorized under what
are called prospective and retrospective image registration methods [14].
Prospective methods always register artificial, as opposed to anatomical,
landmarks (fiducials), for example, objects attached to the patient that are
detectable in all pertinent modalities. If the position of the landmarks relative
to the patient remains constant, a high degree of accuracy is possible. The
disadvantage is that, since the calculation of the transformation is based only
on a restricted number of landmarks, the resulting transformations are always
rigid. If organ movements occur between the acquisition of different image series,
these shifts cannot be considered correctly. Also, this procedure is invasive and
inconvenient for the patient and staff.

Retrospective methods do not need any setup steps and are not subject to
the conditions of image acquisition. To derive the necessary transformation, only
patient related references (anatomical landmarks, surface of structures, etc.) are
Retrospective methods include the following:
Feature-based methods
Point matching (Procrustes method)
Identification of corresponding points in the image series requires
user interaction. An advantage of this method is that land-
marks can be defined just in the therapy relevant image area,
or locally. Therefore, the registration results will not be influ-
enced by possible distortions in other regions of the image. A
disadvantage is that it is difficult to locate corresponding land-
marks in complementary image modalities precisely. Thus, a
large amount of anatomical knowledge is required of the user.
The precision of the resulting transformation depends on the pre-
cision with which these corresponding landmarks can be identi-
fied and depends therefore on the resolution (pixel size) and slice
distance/thickness of the images.
Surface matching
Rather than single points, this method uses the entire surface of
a structure, for example, the patients contour. A transforma-
tion is used which minimizes the distance between the surface of

corresponding structures. An example is the head/hat analogy,
where the contour of the head, segmented in the first image data
set, is put over the head contour in the second data set. In this
way, uncertainties in the definition of single points do not carry
much weight. The disadvantage is that the structures must, be
segmented in an initial step. Since segmentation is a non-trivial
task and normally only semi-automatic segmentation procedures
deliver acceptable results, these methods are time-consuming.
Metrics of similarity
This retrospective approach is a method where a value for the similarity of
the two data sets is calculated. The image data set, an area or volume,
to be registered is transformed by an optimization procedure, Section
1.3, until the similarity with the first data set achieves a maximum.
Voxel similarity measure techniques can be fully automatic and have
an accuracy comparable to bone-implanted markers, but they can
also fail [7, 26]. A common cause of failure can be that the images
are poorly aligned at the start.
Sum of squared intensity differences, a correlation-based distance
Mutual information (an information theoretic technique)
Mutual Information measurements consider the intensity distribu-
tion of both image data sets and are, therefore, well suited to the

registration of multimodal image sequences, since no presump-
tions about the intensities have to be made (for example, it is
possible that regions with a high intensity in the reference image
correspond to regions with low, medium or high intensity in the
template image). The mutual information reaches a maximum
if, for a given intensity in image A, a distinct intensity in image
B is found. (In other words, two different intensities in image
A dont correspond to the same intensity in image B.) There
are no restrictions concerning the intensity combinations. An
advantage of this method is that it does not depend on the lo-
calization of single landmarks. No user interaction is required.
It can be fully automatic in that it makes no assumption of the
functional form or relationship between image intensities in the
image to be registered. A disadvantage is that the calculation ef-
fort is much higher compared to point matching methods. Also,
since this method considers the global intensity distribution and
looks for the optimum over the image sequence on the whole, it
is possible that the precision of the superposition is suboptimal
in the area of interest. The theory and an application of mu-
tual information-based medical image registration are presented
in the chapters that follow.
1.2 Types of Transformation
Since we are three-dimensional, moving beings, registration should be four-
dimensional, that is, it should include the three spatial dimensions as well the

temporal dimension. However, in practice, assumptions and approximations are
made so that the body can be represented in fewer dimensions.
2D images may be registered simply by a rotation and two orthogonal
translations. Differences in scaling from the real object to each of the im-
ages to be registered may also be necessary. Controlling the geometry of
image acquisition is usually very difficult, so clinically relevant examples
of 2D-2D registration are rare. For this study, 2D images are used for the
sake of simplicity and the fact that these were the images that were avail-
able. These images are registered by a rotation and vertical and horizontal
In the case of 3D-3D image registration, three translations and three ro-
tations bring the images into registration. It is assumed that the imaged
part of the body behaves as a rigid body, that is, the internal anatomy of
the patient has not distorted or changed. Careful calibration of scanning
devices/s is required to determine image scaling.
When establishing correspondence between 3D and projection images, 2D-
3D registration may be required. The main application of these methods
is in image-guided interventions.
This class of registration problem concerns image sequences that follow

some process that changes with time.
1.3 Image Registration as an Optimization Problem
Algorithms that directly calculate the transformation to establish correspon-
dence between two images can be used where the images to be registered have
very similar intensities and the transformation required to establish correspon-
dence is small. This is the case with the point matching, or Procrustes, method
described above. In other cases, a process of optimization is required. Opti-
mization algorithms compute a cost, or similarity, function relating to how well
two images are registered. The goal is to minimize, or maximize, the associated
cost function. The cost function can be expressed as
Ctrails formation Csimilarity;
wdiere the first term characterizes the cost associated with particular trans-
formations (lateral translations, rotations, nonlinear, etc.) and tire second term
characterizes the similarity between the reference and test images. Mutual infor-
mation and sum of squared intensity differences are examples of cost functions.
For mutual information-based image registration, as will be seen in the following
chapters, the cost function increases as the images to be registered come into
alignment. Conversely, when the sum of squared intensity differences algorithm
is used, the cost function decreases as alignment increases.
Image registration as an optimization problem is hard to solve in that the
problem is ill-posed [10], Small changes of the input images can lead to com-
pletely different registration results. Optimization algorithms can converge to

reference image, A test image, B transformed test image, B
Figure 1.1: Registration example with multiple solutions. The image B is to
be transformed to image A. The rightmost image shows B registered to A by
two translations and a rotation of 45 degrees.
a suboptimal solution, a local optimum, which can cause registration to fail.
Moreover, the solution may not be unique. This is illustrated in the simple ex-
ample of Figure 1.1. In order to register the image labeled B to the one labeled
A, several solutions are found: two pure translations followed by a rotation
of 45 degrees, or equivalently, a rotation of 135 degrees followed by two pure
translations, etc. The objective value at optimality is the same, but the solution
path is different. Clearly, more information is required in order to determine
which transformation should be used.
A property of the medical images that are registered is that they are discrete.
That is, the object is sampled at a finite number of voxels or pixels. In this
study, four discrete search algorithms are used with mutual information to find a
solution to a medical image registration problem: a greedy algorithm, a genetic
algorithm, and MATLABs fminsearch algorithm that uses the Nelder-Mead
simplex (direct search) method.

Greedy algorithms are algorithms which follow the problem solving meta-
heuristic of making the locally optimum choice at each stage with the hope of
finding the global optimum. Greedy algorithms find the overall, or globally,
optimal solution for some optimization problems, but may find suboptimal solu-
tions for some' instances of other problems. Greedy algorithms work in phases.
In each phase, a decision is made that appears to be good, without regard for
future consequences. Generally, this means that some local optimum is chosen.
This take what you can get now strategy is the source of the name for this class
of algorithms. Hopefully, when the algorithm terminates, the local optimum is
the global optimum. If this is the case, then the algorithm is correct. Otherwise,
the algorithm has produced a suboptimal solution. In the case of medical image
registration, it is probably safe to state that the global optimum is the required
Genetic algorithms comprise a particular class of evolutionary algorithms
inspired by the mechanisms of genetics, which has been applied to global op-
timization optimization problems. Genetic algorithms use biologically-derived
techniques such as inheritance, mutation, natural selection, and recombination
(or crossover). With genetic algorithms, each of a population consisting of a
number of trial solutions, in this case image transformations, is evaluated to
yield fitness. A new generation is created, crossover being the dominate means
of creating new members, from the better of them. The process is continued
through a number of generations with the objective that the population should
evolve to contain a global optimum or at least an acceptable solution.

The MATLAB program fminsearch uses the Nelder-Mead simplex (direct
search) method of [8]. This is a direct search method that does not use numerical
or analytic gradients. It is based on evaluating a function at the vertices of a
simplex, then iteratively shrinking the simplex as better points are found until
some desired bound is obtained [11],

Figure 2.1: Discrete noiseless binary channel.
Figure 2.2: Discrete binary symmetric channel.
2. Mutual Information Theory
The derivation of mutual information from information theory and the associ-
ated mathematics are considered in this chapter.
2.1 Information
In 1928, R. V. L. Hartley [6] was the first to suggest the use of logarithms for
the measure of information. The definition of information in the context of

information theory is an engineering definition based on probabilities, not on
the meaning of symbols communicated to a human receiver. Therefore, it is
somewhat counterintuitive.
Consider an alphabet of n symbols, ai,a2,... ,an, each with its probability
p(ai),p(a,2), .. ,p{an) of occurrence, communicated via some channel. If p(ai) =
1, then ai will be received with certainty. This is the case of the noiseless binary
communication channel pictured in Figure 2.1. Any transmission, 0 or 1, is
received without error. There is no surprise in the occurrence of ai given p(o.i),
so no information is obtained. If a symbol with a low probability occurs, there is
more surprise, more information. This might be the case for the binary channel
pictured in Figure 2.2, where, for example, a 0 is sent and 1 is received. Thus,
information is somewhat inversely related to the probability of occurrence.
The information from two independent symbols is the sum of the information
from each separately (See Appendix A, Example 6). Since the probabilities of
two independent choices are multiplied together to get the probability of the
compound event, it is natural to define the amount of information as
Definition 2.1 Information
I (a*) = log
Remark 2.2 Information is inversely proportional to probability. Symbols with
the least probability of occurring will provide the most information.

Figure 2.3: The entropy function for base 2, e. and 10 logs.
As a result,
/(ai) + I{a2) = log
Thus, given the two probabilities of events, the probability of both occurring,
assuming independence, is a product. This product yields the amount of infor-
mation as a sum.
2.2 Entropy (uncertainty or complexity)
The entropy function measures the amount of uncertainty, surprise, or informa-
tion in the outcome of a situation, for example, the reception of a message or the

Figure 2.4: Decomposition of a choice from three possibilities.
outcome of some experiment. It involves only the distribution of probabilities.
The Shannon-Wiener entropy measure //, originally developed as part of com-
munication theory in the 1940s [15, 17], is the most commonly used measure of
information in signal and image processing. This formula is derived from three
conditions that a measure of uncertainty in a communication channel should
1. The functional should be continuous in p.
2. If all Pi equal where n is the number of symbols, then H should be
monotonically increasing in n.
3. If a choice is broken down into a sequence of choices, then the original
value of H should be the weighted sum of the constituent H. That is,

P2+P3 P2+P3
£2__ PA
The meaning
of this condition is illustrated in Figure 2.4. On the left there are three
possibilities: Pi = P2 = and p.3 = On the right a choice is made
first between two possibilities each with probability If the second occurs,
the same probabilities as before. In this special case it is required that
The coefficient | is the weighting factor introduced because this second
choice occurs only half the time.
Shannon proved that the Pi logpj form was the only functional form
satisfying all three conditions.
Theorem 2.3 The only H satisfying all three assumptions is of the form:
where K is a positive constant.
Proof: Let H (^, L,...) I) = An. From condition (3) a choice can be decom-
posed from sm equally likely possibilities into a series of m choices, each from s
equally likely possibilities from which
then a choice is made between probabilities | and |. The final results have
A(sm) = mA(s)
is obtained. Similarly
A{tn) = nA{t)

n arbitrarily large can be chosen and m found to satisfy
srn < tn < s(m+1).
Taking the logarithms and dividing by n log s yields
m log t m 1
< ------< + or
n log s n n
< ,
where e is arbitrarily small. From the monotonic property of A(n),
A(sm) < A(tn) < A{sm+l)
mA(s) < nA(t) < (m + l)A(s).
Dividing by nA(s),
m A(t) m 1
a < \ -----1 01
n A{s) n n
A(t) log t
m A(t)
n A(s)
A(t) = Klogt,
< e
A(s) logs
where K must be positive to satisfy (2).
Now suppose there is a choice from n possibilities with commeasurable prob-
abilities Pi = where the n.L are integers. A choice can be broken down from
Y^ni possibilities into a choice from n possibilities with probabilities pi,, pn
and then, it the ith were chosen, a choice from nl with equal probabilities. Using
condition (3) again, the total choice from ^ n is equated as computed by two
K log X m = H{pi,..., pn) + K XPi lob
H = K [XVl log X n ~ X Pl log n']
= A' X Pi log = -K X Pi loSPi-
^ L,

If the pi are incommeasurable, they may be approximated by rationals and the
same expression must hold by the continuity assumption. Thus the expression
holds in general. The choice of coefficient K is a matter of convenience and
amounts to the choice of a unit of measure.
Entropy will have a maximum value if all symbols have equal probability of
occurring (that is, pn = ^ for all i), and have a minimum value of zero if the
probability of one symbol occurring is one, and the probability of all the others
occurring is zero. [7]
Definition 2.4 The entropy H(X) of a discrete random variable X (Appendix
A, Definition A.l) is a measure of the uncertainty of the random variable, and
is defined by
(*) = -£ p{x) logp(x)
The above quantity can also be expressed as H(p). [2]
Remark 2.5 The entropy of X can also be interpreted as the expected value
(Appendix A, Definition A.3) of log where X is drawn according to the
probability distribution function p(x). Thus
H(X) = Ep\o g

The following lemmas are immediate consequences of Definition 2.4.
Lemma 2.6 H(X) > 0.
Proof: 0 < p(x) < 1 implies log(^y) > 0.
Lemma 2.7
Hb(X) = (logba)Ha(X).
(Please refer to Appendix C)
log:bp =
lOg aP
l0ga b
logfe a loga p.
Example 1. Entropy as a function of two probabilities (Figure 2.5).
{1 with probability p,
0 with probability 1 p.
H(X) = -plogp- (1 -p) log(l -p) = H(p).
Using 2 as the base of the logarithm, H(x) = 1 when p = Figure 2.5 illustrates
some of the basic properties of entropy. Entropy is a concave (Appendix B,
Definition B.2) function of the distribution and equals 0 when p equals 0 or 1.

Figure 2.5: The entropy function (base 2 log) of two probabilities.

This makes sense since there is no uncertainty when p equals 0 or 1. Uncertainty
is maximum when p equals which also corresponds to the maximum value of
the entropy using base 2 logarithm.
Using Lemma 2.7 for the function p log2 ^ (Figure 2.3, r=2) and deriving,
TP(p^l) = Tp(ploel)lo^e
= log2 - log2 e.
From the last equation, it can be seen that the slope at p ~ 0 is infinite and
that the maximum of the entropy occurs at p = U
Remark 2.8
lim (p loge x) = 0.
log X
lim (p loge x) = lim I {
xO x>0 1
Application of VHospital's rule yields
x>0 \ ) x>0
lim I -ZT = lim(-x) = 0.
Theorem 2.9 (Fundamental Inequality)
(f) <
t=i k /
Proof: Fitting the tangent line at the point (0,1) on the loge x function (Figure
2.6), the slope is
d{ logex) _1
---J----- |x=l U

Figure 2.6: Bound on log,,(a1) function.

so that the tangent line is
y- 0= l(x 1)
y = x-l.
Therefore, for all x greater than 0,
loge x < x 1.
Let x-i and yL be two probability distribution functions, where, of course, ^ xt
1 and Vi ~ 1, xi,Vi > 0. Now consider the following expression, using Lemma
2.7, involving both distributions.
Xxj loge2^
Using the relationship of Equation 2.1,

x, | -----1
= i^2
r' 2
= 0.

Converting back to log base 2 yields the Fundamental Inequality,
i=l ' '

or, equivalently,
For the last two equations, equality holds only when all the aq = yt. The lefthand
sides of the last two equations are also called the relative entrvpy or Kullback-
Leibler distance (See also Section 2.2.2).
2.2.1 Joint Entropy and Conditional Entropy
Definition 2.10 The joint entropy H(X,Y) of a pair of discrete random
variables (X,Y) with a joint distribution p(x,y) is defined as
H{X, Y) = EE P{x, y) logp{x,y). (2.2)
xeX yY
For the special case when X and Y are statistically independent (Appendix A,
Definition A.5) for all i and j, the independence condition is expressed by
The joint entropy becomes
h{x, = P&MVj) {loS
i=l 3=1 ^
+ log
q s
^p(Xi) = 1 and ^2p(yj) = 1,
;=i j=i
H(X,Y) = H{X) + H(Y).

Definition 2.11 If (X,Y) ~p(x,y), then the conditional entropy H{Y\X)
is defined as
H{Y\X) = Y,P(x)H(Y\X = x)
= p^2p(y\x) los p(y\x)
xex y&
EE p(x, y) logp(j/|ar)
xeX yeY
EE p(x,y)log
x£X yeY
Remark 2.12 If X,Y independent, then
H(Y\X) = EE p{x)p(y) log p(y)
xeX y&Y
= EE p(x)p(y) log
xex-yeY yyyj
= EpM1os~
= H(Y).
Theorem 2.13 (Chain Ride for Entropy)
Let Xi, X2,..., Xn be drawn according to p(xi,x2,.... xn). Then
H(XU X2, ...,Xn) = YJ H(Xi ..., *0.

Proof: By repeated application of the two-variable expansion rule for entropies,
H(X1,X2) = H(X1) + H(X2\X1),
H(XuX2,X3) = H(X1) + H(X2,X3\Xx)
= H{Xx) + H(X2\Xx) + H(X3\X2, Xx),
H(XU ...,Xn) = H(Xx) + H(X2\Xx) + ... + H(Xn\Xn.u ...,Xx)
Alternative proof: Write p(xi,.... xn) = f|"=1 p(xi\x,i-i..., Xi), and evaluate
H{XuX2,...,Xn) = - ^ p{x1,x2,...,xn)\ogp(xux2,...,xn)
Y p(x !>> .,Xx)
X] ,...yXn n i=l
Y Yp^ Xl,...,xn 21 n ...,xn)logp(o:i|^_i,. ..,xx)
Y Y t l X\)...)Xj\ n ...,xn)logp(a:i|a:i_i,.
Y Y p^ i=l ,Xi)
= YtH(Xi\Xi-1,...,X1).
Theorem 2.14 (Chain Rule for Conditional Entropy)
H(X,Y) = H(X) + H(Y\X).

Figure 2.7: Noisy communication channel, Example 2.
H(X,Y) = EE p{x,y) \ogp(x,y).
xex yeY
=-EE p{x,y) \ogp{x)p(y\x)
xeX yeY
EE p{x,y) logp(x) ^2^2p(x,y) \ogp(y\x)
xex yeY xex yeY
= -£p(x) logpOr)-^^ p{x, y) logp(y\x)
xeX xex yeY
= H(X) + H{Y\X).
logp(X, Y) = logp(X) + logp(y|X).
Taking the expectation of both sides of the equation yields the theorem.
Example 2. Noisy communication channel.
The input source to a noisy communication channel, Figure 2.7, is a random

variable X (Appendix A, Definition A.l) over the four symbols a, b, c, d. The
output from the channel is a random variable Y over the same four symbols.
The joint distribution (Appendix A, Definition A.4) of X and Y is given by
X II x=b x=c X=d
II £ 1 1 1 1
8 16 16 4
Y y=b 1 16 1 8 1 16 0
<< II o 1 32 1 32 1 16 0
II >> 1 32 1 32 1 16 0
The marginal distribution (Appendix A, Definition A.4) for X is
The marginal entropy of X, H(X), is
= 4
(log2 1 log2 4)
- ^2 l0g2 P(Xi)= 4 Q)
= 2 bits.
Since base 2 log is used, the resulting unit of information is called a bit (binary
digit). If base e is used, the unit of information is called a nat. If base 10 is used,
the unit is called a Hartley, after R. V. L. Hartley, the first person to propose
the use of the logarithmic measure of information.

The marginal distribution for Y is |).
The marginal entropy of Y, H(Y), is
^2 PiVi) \og2 P(Vi) = 2
-g (log2 1 log2 8)
+ -J (!g2 1 l0S2 4)
+ 2 (1o82 1 lg2 2)
7 .
= bits.
The joint entropy, H(X, Y), of X and Y in bits is the sum of plogp (Section
2.2.1, Equation 2.2) over all 16 probabilities in the joint distribution.
- v) log2p(x, y) = -\ (loS21 lo§2 4) + 2
xeX y£Y

log2 8)
~TF (1o82 1 lo& 1(>)
+ 4
(log2 1 log2 32)

The conditional entropy H(Y\X) (Section 2.2.1, Equation 2.3) is
xeX yeY / \4/ \ 4 /
113 113 1
8 + 8+IC + 8 + 8+l6 + 2+
= bits.
This example is continued in Section 2.3, where the mutual information for this
example is calculated.
2.2.2 Relative Entropy
Definition 2.15 The relative entropy or Kullback-Leibler distance be-
tween two probability mass functions p(x) and q(x) is defined as
D{p\\q) =
= EP log
where E denotes expectation.
Definition 2.16 The conditional relative entropy. D(p(y\x)\\q(y\x)) is the
average of the relative entropies between the conditional probability mass func-
tions p(y\x) and q(y\x) averaged over the probability mass function p(x). More

D(p(y\x)\\q(y\x)) = ]Tp(:r) ^p(y\x) log
X y
where E denotes expectation.
Theorem 2.17 (Chain Rule for Relative Entropy)
D{p(x,y)\\q{x,y)) = D(p(x)\\q{x)) + D{p(y\x)\\q(y\x)).
EE p{x, y) log
x y
EE p(x, y) log
x y
Y p(x> y) lo% + p(x> y) loe
nr* ii x \ / /*-
D{p{x)\\q{x)) + D{p(y\x)\\q(y\x)).
Theorem 2.18 (Jensens Inequality)
If f is a convex function (Appendix B, Definition B.l) and X is a random
variable, then
where E denotes expectation. If f is strictly convex, then equality implies that
X = EX with probability 1, that is, X is a constant.

Proof: For a two mass point distribution, the inequality becomes
Pif{xi)+p2f{x2) > f{piXi +p-2x2),
which follows directly from the definition of convex functions. Suppose that the
theorem is true for distributions with k 1 mass points. Then writing p'L
for i = 1,2,.... A; 1,
k k 1
^PifiXi) =Pkf{xk) + (1 ~Pk) £?'/(**)
i=l i=l
pkxk + (l ~ Pk)^2p'iXi
where the first inequality follows from the induction hypothesis and the second
follows from the definition of convexity.
Theorem 2.19 (Lop sum inequality)
For non-negative numbers, ai,a2, ,an and b\, b2,.... bn,
with equality if and only if jf = constant.
Proof: Assume without loss of generality that a-i > 0 and bl > 0 (If a, = bt
0, then the sum is 0). The function f(t) = t log t is strictly convex, since

f (t) = yloge > 0 for all positive t. Hence, by Jensens inequality (Theorem
for a-i > 0, V), a, = 1. Setting al and tt = fT
1 l^j = 1 j i '

E bj ^ E bj
is obtained which is the log sum inequality.
Theorem 2.20 D(p\\q) is convex in the pair (p, q), that is, if (pi, (p) and (p2, q2)
are two pairs of probability mass functions, then
D(Xpi + (1 A)p2||A(/i + (1 X)q2) < XD(pi\\qi) T (1 X)D(p2\\q2) (2.5)
for all 0 < A < 1.
Proof: The log sum inequality is applied to a term on the lefthand side of 2.5,
that is,
(Api(ar) + (1 A)p2(z))log
Xpi (x)
< Api(x) log
+ (1 A)p2(x) loi
Api(ar) + (1 ~ X)p2(x)
Xqi(x) + (1 X)q2(x)
(1 A)p2{x)
Xqx(x) (1 A)q2{x)
Summing this over all x, the desired property is obtained.
Theorem 2.21 (Information Inequality)
Letp(x), q{x), x Jif, be two probability distribution functions. Then

with equality if and only if
p(x) q(x) for all x.
Proof: Let A = {x : p(x) > 0} be the support of p(x). Then
-D(p\\q) = -^2p(x) log
J^p(x) log

=log Y
< log Y q(x)
= log 1
where 2.6 follows from Theorem 2.18. Since log t is a strictly concave function of
t, there is equality in 2.6 if and only if ^||y = 1 everywhere, that is, p{x) q{x).
Hence D(p\\q) = 0 if and only if p(x) = q(x) for all x.
Theorem 2.22 H(X) < \og\JF\, where \Jf?\ denotes the number of elements
in the range of X, with equality if arid only if X has a uniform distribution over
Proof: Let u(x) = -A be the uniform probability mass function over and
let p(x) be the probability mass function (Definition A.2, Appendix A) for X.

log jj = lo&\^\ ~ H(X)-
By the non-negativity of relative entropy,
0 < D(p\\u) = log\Jl?\-H(X).
Corollary 2.23
with equality if and only if p(y\x) = q(y\x) for all y arid x with p(x) > 0.
2.2.3 Concavity of Entropy
Theorem 2.24 (Concavity of entropy)
H(p) is a concave (Appendix B, Theorem B.l) function of p.
H(p) = log |JT| D(p\\u),
where u is the uniform distribution on \3^\ outcomes. The concavity of H then
follows directly from the convexity of D.
Alternative Proof: Let X\ be a random variable with distribution p\ taking on
values in a set A. Let X2 be another random variable with distribution P2 on
the same set. Let

{1 with probability A
2 with probability 1 A
Let Z Xq. Then the distribution of Z is Xp1 + (1 A)p2. Since conditioning
reduces entropy (Theorem 2.31), we have
H(Z) > H(Z\d),
or equivalently,
H(Xpi + (1 A)p2) > AH(pi) + (1 A)H(p2),
which proves the concavity of the entropy as a function of the distribution.
2.2.4 Entropy of a Signaling System
Consider again the communication channel of Section 2.1. If /(a*) units of
information are obtained when the symbol a* is received, then, on the average,
since p(a.i) is the probability of getting the information, I(at), for each symbol
p{oi)I{ai) =p{ai) log
From this it follows that, on the average, over the entire alphabet of symbols a,,
^ P(ai)
will be received. This important quantity is the entropy of the signaling system,
A, having symbols at with probabilities p(at).
H( A) = |>(,) log

2.2.5 Entropy and Kolmogorov Complexity
The Kolmogorov Complexity is approximately equal to the Shannon entropy H.
Entropy is often called information in that it is a bound length on the code that
is required to transmit a message.
Kolmogorov complexity (also called minimum description length principle,
Chaitin complexity, descriptive complexity, shortest program length, and al-
gorithmic entropy) is related to communication, an application of information
theory. If the sender and receiver agree upon a specification method, such as
an encoding or compression technique, then a message a can be transmitted as
b and decoded given some fixed method L, denoted L(b) a. The cost of the
transmission of a is the length of the transmitted message, b. The least cost is
the minimum length of such a message. The minimum length is the entropy of a
given the method of transmission. The complexity of the minimum description
length is universal, that is, computer independent.
Definition 2.25 The Kolmogorov complexity of a binary string a, K(a), is
defined as the length of the shortest binary program for computing the string
K(a) min[t/(6) = a],
where U represents the abstract universal Turing computer that can implement
any algorithm and compute any computable function.
Remark 2.26 Kolmogorov Complexity is reminiscent of Ockham's Razor, a
principle formulated by William of Ockham (1300s) stating that terms, con-
cepts, assumption, etc., must not be multiplied beyond necessity. According to

[2], Einstein allegedly said Everything should be made as simple as possible,
but no simpler.
Theorem 2.27 (The Fundamental Theorem for a Noiseless Channel) Let a
source have entropy H (bits per symbol) and a channel have a capacity C (bits
per second). Then it is possible to encode the output of the source in such a way
as to transmit at the average rate jj e symbols per second over the channel
where e is arbitrarily small. It is not possible to transmit at an average rate
greater than jj.
Proof: The converse part of the theorem, that jj cannot be exceeded, may
be proven by noting that the entropy of the channel input per second is equal
to that of the source, since the transmitter must be non-singular, and also this
entropy cannot exceed the channel capacity. Hence H' < C and the number of
symbols per second =jj- < j.
To prove the first part of the theorem, consider the set of all sequences of
N symbols produced by the source. For large N, the symbols can be divided
into two groups, one containing less than 2(H+,^N members and the second
containing less than 2RN members (where R is the logarithm of the number of
different symbols) and having a total probability less that //. As N increases
and rj and /./ approach zero. The number of signals of duration T in the channel
is greater than 2(-c~0C with 6 small when T is large. If
is chosen, then there will be a sufficient number of sequences of channel symbols
for the high probability group when N and T are sufficiently large (however small

A) and also some additional ones. The high probability group is coded in an
arbitrary one-to-one way into this set. The remaining sequences are represented
by larger sequences, starting and ending with one of the sequences not used
for the high probability group. This special sequence acts as a start and stop
signal for a different code. In between, a sufficient time is allowed to give enough
different sequences for all the low probability messages. This will require

will then be greater than
T T 1
+ = ^~6)
+ A )+5
As N increases, 8, A and tp approach zero and the rate approaches jj.
Remark 2.28 If a source can only produce one particular message, its entropy
is zero and no channel is fequired.
2.3 Mutual Information
Mutual information is a measure of the amount of information that one random
variable contains about another random variable. It is the reduction in the
uncertainty of one random variable due to the knowledge of the other.
C. E. Shannon first presented the functional form of mutual information
in 1948. He defined it as the rate of transmission in a noisy communication
channel between source and receiver. [15]

Consider a communication channel where the input symbols are a, and the
output are bj. The channel is defined by the conditional probabilities PtJ. Prior
to reception, the probability of the input symbol a* was p(a,;). This is the a pri-
ori probability of at. After reception of bj, the probability that the input symbol
was a-i becomes P(di\bi), the conditional probability that we sent a* given that
we received bj. This is an a posteriori probability of a*. The change in the
probability measures how much the receiver learned from the reception of the
bj. In an ideal channel with no noise, the a posteriori probability is 1, since
we are certain from the received bj exactly what was sent. In practical systems
there are finite nonzero probabilities that errors will occur and the receiver can-
not be absolutely sure what was sent. The difference between the information
uncertainty before (the a priori probabilities) and after (the a posteriori proba-
bilities) reception of a bj measures the gain in information due to the reception
of the bj. This information is called the mutual information and is defined as
Definition 2.29 (Mutual Information)
Piai) '
If p(ai) equals P(ai\bj), then no information has been gained and the mutual
information is zero, that is, no information has been transmitted. Only when
something new has been learned about the probabilities of the at from the re-
ceived bj is the mutual information positive.
I(ai,bj)= log

Multiplying the numerator and denominator of the rightmost log term by
p{bj) yields
P(ai\bj) P{di\bj)p{bj) Pja^bj) P(bj\di)p(di) = P(bj|a,)
P(a) P(Oi)p(bj) P(at)p(bj) p(d.t)p(bj) p(bj)
I(di, bj) = log
' P(ai,bj)
I(Pji &i)-
From the symmetry of the at and bj,
I{a,i,bj) = log
I{bj,ai) = log
pK) j
. p(bj) J
I{ai, bj) = I{bj, di).
I{di,bj) < I(di).
This follows from Definition 2.1,
I{di) = log
I{di, bj) = log P{di\bj) + I(oi)
= log P(di \bj) + log
= log P(di\bj) logp(fli)
= log

Since the maximum for the P(a.i\bj) is 1,
I{ai, bj) < I(at),
where equality occurs when P(a.i\bj) 1.
If at and bj are independent, that is, if
P(at\bj) =p(di), or equivalently
P(di, bj) = p(a.i)p(bj), then
I(ai,bj) = 0. (2.9)
This follows from 2.7 and 2.8 above.
Because of noise, the behavior of a channel can be understood only on
the average. Therefore, the mutual information must be averaged over the
alphabets, A and B, using the appropriate probabilities.
=p(a'i bj) k>g
I(ai,B) = Y,P(bJ\ai)\og
These two equations are called the conditional mutual information. I(A,bj)
measures the information gain about the alphabet A provided by the reception
of bj. I(ai,B) is the information gain about the alphabet B given that at was
. P(ai) '

sent. Finally,
I {A, B) P(a;)/(aj, B)
by symmetry. This equation, the system mutual information, provides a mea-
sure of the information gain of the whole system and does not depend on the
individual input and output symbols but only on their frequencies. From Equa-
tion 2.10, it can be seen that the mutual information I(B,A) is the relative
entropy (Equation 2.4) between the joint distribution, P(a.;,6j), and the prod-
uct p(a,i)p(bj).
The system mutual information has the properties
I {A, B) 0 if and only if A and B are independent. (Equation 2.9)
I(A, B) = I(B, A) (Equation 2.10)
The proof of Equation 2.11 lies in the following corollary to Theorem 2.21.
Corollary 2.30 (Non-negativity of mutual information)
For any two random variables (alphabets, for example), A, B,
I {A, B) > 0,
with equality if and only if A and B are independent.
I{A,B)> 0

Proof: I(A,B) = D(p(a,b)\\p(a)p(b)) > 0, with equality if and only if
p(a,b) = p(a)p(b), that is, A and B are independent.
Theorem 2.31 (Conditioning reduces entropy)
H(A\B) < H(A)
with equality if and only if A arid B are independent.
0 < I (A, B) = H(A) H(A\B).
Remark 2.32 Intuitively, the theorem says that knowing another random vari-
able B can only reduce the uncertainty in A. This is true only on avemge.
Specifically, H(A\B = b) may be greater than or less than or equal to H(A),
but on the average H(A\B) 'Yf p(y)H(A\B = b) < H(A). For example, in a
court case, specific new evidence might increase uncertainty, but on the avemge
evidence decreases uncertainty.
The various entropies can be related to each other by the following algebraic

H(A\B) f(A,B)
/ \
H(A) H(B)
Figure 2.8: Relationship between entropy and mutual information. The mu-
tual information I(A, B) corresponds to the intersection of the information in
A with the information in B.
/(,4,B) = £x>(ai.J)log
i j
q s
=££ P(au bj) [log P(at, bj) logp(Oi) logp(6j
i j
= -^2J2P{ai,bj) log
+ ^p(a,)log
P{ai,bj) j
+ J2p^log
H{A) + H(B) H(A,B) > 0.
From Theorem 2.14,
H(A,B) = H(A) + H{B\A)
= H(B) + H{A\B),

and two results are obtained:
I{A,B) = H(A) H(A\B) >0
= H(B)-H{B\A) >0.
0 0 < H(B\A) < H{B)
H{A,B) [22]
Theorem 2.33 (Concavity of Mutual Information)
Let (A, B) ~ p(a, b) p(a)p(b\a). The mxitual information I (A, B) is a concave
function of p(a) for fixed p(b\a) and a convex function of p(b\a) for fixed p(a).
Proof: To prove the first part, expand the mutual information
I {A, B) = H(B) H(B\A) = H(B) ^p{a)H(B\A = a).
If p(b\a) is fixed, then p(b) is a linear function of p(a) (Appendix A, Equation
A.3). Hence H(B), which is a concave function of p(b), is a concave function
of p(a). The second term is a linear function of p(a). Hence the difference is a
concave function of p(a).

To prove the second part, fix p(a) and consider two different conditional
distributions pi(b\a) and p2(b\a). The corresponding joint distributions are
Pi (a, b) = p(a)pi(b\a) and p2(a,b) = p(a)p2(b\a), and their respective marginals
are p(a), Pi(b), and p(a), p2(b). Consider a conditional distribution
Px{b\a) = Xpi(b\a) + (l X)p2{b\a)
that is a mixture of pi(6|a) and p2(b\a). The corresponding joint distribution is
also a mixture of the corresponding joint distributions,
Px{a,b) = Xpi{a,b) + (l X)p2(a,b),
and the distribution of B is also a mixture
P\(b) = Xpi(b) + (l X)p2(b).
Hence, let q\(a,b) = p(a)p\(b) be the product of the marginal distributions, to
q\{ai b) = Xqi(a, b) + (l X)q2{a,b).
Since the mutual information is the relative entropy between the joint distribu-
tion and the product of the marginals, that is,
I {A, B) = D(px\\qx),
and relative entropy D(p||g) is a convex function of (p, q), it follows that the
mutual information is a convex function of the conditional distribution.

Definition 2.34 The conditional mutual information of random variables
X and Y given Z is defi'ned by
I{X, Y\Z) = H(X\Z) H(X\Y, Z)
p(X, Y\Z)
Ep(x,y,z) iog
Mutual information also satisfies a chain rule.
Theorem 2.35 (Chain Rule for Information)
nxlf X2,..., Xn, Y) = 7(x- Y\Xi-i,Xi-2, , xx).
i= 1
I{X1,X2,...,Xn,Y) = H(X1,X2,...,Xn)-H(Xl,X2,...,Xn\Y)
n n
= Y H(Xi\Xi-^ iXx)~Y H(xi\xi-1>
i=l i=1
= ^/(xi,y|xljx2,...,xi_1).
Example 3. Example 2, Section 2.2.1, continued.
The mutual information between the two random variables, X and Y, in bits
(Appendix C), using some of the answers obtained in Example 2, can be found
several ways (Equations 2.13, 2.14, and 2.15):
I(X, Y) = H(Y) H(Y\X)
_ 7 11
4 ~ ¥
3 , .
= b,ts.

I{X,Y) = H{X) -H{X\Y)

^ |oo =C CM I00
1 2 + 1
CM CO |00 II II £ N- 1 ^ + bits
ST CM co 1 oo

3. Mutual Information Theory Applied to Image Registration
The mutual information measure for image registration has proven to be very
robust and has resulted in fully automated 3D-to-3D rigid-body registration
algorithms that are now in use [7, 19, 20, 21].
The use of mutual information as a basis for a registration metric was pro-
posed independently by Collignon et al [1] and the MIT group [25] in 1995.
Mutual information for image registration is defined (Equations 2.13, 2.14, and
2.15) as:
I{A(x), T(B(x))) = H(A(x)) + H(T(B(x))) H(A(x), T(B(x)))
= H(A(x))-ff(A(x)\T(B(x)))
= ff(B(x))-ff(T(B(x))\A(x)),
where A(x) is a model, or reference, image and T(B(x)) is a transformation of
the test image. The images, A and B, are matrices of pixel (picture element)
values, x.
Mutual information measures the joint entropy, H(A(x),T(B(x))), of the
model image, A, and test image, B. At optimal alignment, the joint entropy is
minimized with respect to the entropy of the overlapping part of the individual
images, so that the mutual information is maximized. Minimizing the joint
entropy minimizes the conditional entropy,
H(A(x)\T(B(x))) = H(A(x),T(B(x))) H{B(x)),

since the entropy of the model is fixed.
The image B can be considered as having two parts the part that is similar
to the model, A, Bm, and the part that cannot be related to the model, B'.
Bm = {B(x) 3 T~1(A(x)) is defined},
B' = = B~B,
I {A, B) = H(A) + H(B) H(A, B)
= H(A) + H{Bm, B') H(A, Bm, B')
= H{A) + H(Bm) + H(B') H(A, Bm) H(B')
= H(A) + H(Bm)-H(A,Bm) (3.1)
The assumption is that B is independent of A and Bm. Therefore, discarding
pixels such as those representing background, removes them from the analysis
and maximizes the mutual information. Equation 3.1 shows that maximizing
the entropy of the modeled part of the image, H(Bm), and minimizing the joint
entropy, H(A,Bm), maximizes the mutual information I(A,B). [23]
For image registration, A and B are two image data sets. Entropy is calcu-
lated using the images histograms or probability density functions (Appendix
A, Definition A.2). Each point in one image, and its associated intensity, will
correspond to a point, with its respective intensity, in the other. Scatter plots
of these image intensities can be generated point by point. These are two-
dimensional plots of image intensity of one image against corresponding image

Figure 3.1: The plot on the left is the joint histogram of two randomly gener-
ated 150 x 150 matrices taking on values from 1 to 256. The plot on the right
is the plot of the joint histogram of one matrix with itself.
intensity of the other. The resulting plot, a two-dimensional histogram, is called
a joint intensity histogram. When divided by the number of contributing pixels,
it is equal to the joint probability distribution (Appendix A, Definition A.4).
For each pair of image intensities in the two images, the joint probability distri-
bution provides a number equal to the probability that those intensities occur
together at corresponding locations in the two images. To illustrate the idea of
a joint histogram, refer to Figure 3.1. The plot on the left is the joint histogram
of two randomly generated 150 x 150 matrices taking on values from 1 to 256.
The plot on the right is the plot of the joint histogram of one matrix with itself.
Therefore, it is diagonal as should be expected for identical matrices.
.Joint entropy measures the amount of information in the combined images.
If A and B are totally unrelated, then the joint entropy will be the sum of the
entropies of the individual images (refer to Equation 2.12 and Figures 3.2 and

H(A\B) f(A,B)
/ \
H(A) H(B)
Figure 3.2: Relationship between entropy and mutual information. The mu-
tual information I {A, B) corresponds to the intersection of the information in
A with the information in B.

i(A, B)~Q
Figure 3.3: Venn diagram of the relationship between joint entropy and mutual
information of totally unrelated images.


l(A,B)=H(B) I(A,B)=H(A)
Figure 3.4: Venn diagram of the relationship between joint entropy and mutual
information of totally related images.
3.3). Therefore, there will be no mutual information. Conversely, if A and B
are completely dependent, that is, knowledge of the outcome of an event in A
gives exact knowledge of the outcome of an event in B, then the joint entropy
of the two images is equivalent to the entropy of the image A, so the mutual
information is H(B). If knowledge of the outcome of an event in B gives exact
knowledge of the outcome of an event in A, then the joint entropy of the two
images is equivalent to the entropy of the image B, so the mutual information is
H(A) (refer to Figure 3.4). The more closely related, that is, less independent,
the images are, the smaller in absolute value the joint entropy compared to the
sum of the individual entropies. That is,
I(A, B) = H{A) + H{B) H{A, B).
This equation shows that maximizing the entropy of the test image, H(B),
and minimizing the joint entropy, H(A,B), maximizes the mutual information

Figure 3.5: The image data sets, A and B.
Minimizing the joint entropy, calculated from the joint intensity histogram,
| was proposed by Studholme et al [18] and Collignon [1] as a basis for a regis-
tration method. However, joint entropy on its own does not provide a robust
; measure of image alignment, as it is often possible to find alternative alignments
; that result in much lower joint entropy. For example, mutual information, which
j is given by the difference between the sum of the entropies of the individual im-
j ages at overlap and the joint entropy of the combined images, works better than
| simply joint entropy in regions of image background (low contrast of neighboring
] pixels) where there will be low joint entropy, but this is offset by low individual
! entropies as well, so that the overall mutual information will be low.
| Example 4. A sample calculation of the mutual information between two
matrices, image data sets, Figure 3.5.
31 23
A = and B =
2 2 22
- -

Figure 3.6: The joint probability distribution of Example 4 as a surface.

Determine the mutual information between the matrices A and B. That is, find
^b) = EE P{ai:bj) log
P{aii bj)
, (Equation 2.10).
1 2 3 P(bj)
1 0 0 0 0
B 2 0 i i 3
2 4 4
3 i 4 0 0 i 4
P(Oi) 1 4 1 2 i 4 *
is the joint probability distribution, where the rightmost column and bottom
row consist of the marginal probabilities, p(b.t) and p(a.;), for the values b and a,
respectively, and the nonzero values are the P(a,i,bj). Calculation of the mutual
information yields
I {A, B) = ]- log | + j log -f + j log -f
Z 8 4 16 4 16
L 4 E 4 l, A
= log + log + log 4
2 fo3 4 4 s
^ ,5(.2877) + ,25(.2877) + .25(1.3863)
= .5623 nats.
It might be tempting to think that if image A were rotated 90 as in (Figure
3.7), then a greater value for mutual information would result since 3 out of 4
pixel values would match in intensity. However, that would not be the case. The
mutual information would be the same. This technique makes no assumption
about the relationship between image intensities in the images to be registered.

Figure 3.7: The image data sets A, rotated -90 degrees, and B.
Rotating A 90 yields
23 23
A = 21 and B (unchanged) = 22
The joint probability distribution becomes
1 2 3 PiPj)
1 0 0 0 0
B 2 i 4 1 2 0 3 4
3 0 0 l 4 1 4
P(fli) 1 4 1 2 1 4 *
and the calculation of the mutual information yields .5623 nats as in the case
prior to the rotation of the image A.

Figure 3.8: The joint probability distribution as a surface of Example 4 with
image A rotated 90 as in Figure 3.7.

4. Mutual Information-Based Registration of Digitally
Reconstructed Radiographs (DRRs) and Electronic Portal Images
Electronic portal imaging devices (EPIDs) provide real-time digital images that
allow for a time-efficient patient repositioning before radiation treatment. A

Figure 4.2: Typical aligned pelvis DRR [12].
Figure 4.3: Typical aligned pelvis EPI [12].

Figure 4.4: A misaligned EPI and the DRR automatically rotated to the EPI
position [12].
Figure 4.5: Joint histogram of misaligned images. Referring to the rightmost
plot, the x- and y- axes represent the range of intensities in the images. The
z-axis represents the probability that an intensity in one image will occur with
an intensity in the other. It is a surface plot of the probability values represented
in the plot on the left. [12]

Figure 4.6: The result after the automated registration process [12].
physician can decide, based on these portal images, if the treatment field is
placed correctly relative to the patient. A preliminary patient, positioning can
be carried out based on external markers, either on the patients skin or the
mask systems. The position of the target region (tumor) and the organs at risks
(spinal column, brain, etc.) is not fixed relative to these markers. Therefore, an
image of the patients anatomy prior to treatment is needed. A possible mis-
alignment can then be corrected by repositioning the patient on the treatment
couch. For example, portal images can be compared to digitally reconstructed
radiographs (DRRs), generated from the planning CT (computed tomograph).
DRRs are computed by integrating (summing intensities) along rays through the
CT volume to simulate the process of perspective projection in x-ray imaging.
To formulate DRR/EPI registration as a mutual information-based image
registration problem, the joint image histogram is used as the joint probabil-
ity density function. The transformation, T, that represents in-plane rotations

and shifts, or some other type of transformation, represents the communication
channel. The model for DRR/EPI registration becomes:
I(A(x),T(B(x))) = H(A(x)) + H(T(B(x))) H(A(x),T(B(x))),
where A{x) is the model, or reference, image, and T(B(x)) is the transformation
of a test image. The objective is to find the transformation, T, that maximizes
the mutual information of the system. The problem can be stated:
Find T such that I(A(x),T(B(x))) is maximized.
Figure 4.f shows a typical treatment plan, where the parallel lines on the
right two images represent the leaves of a multileaf beam collimator, a beam-
shaping device. Figures 4.2 and 4.3 show a pelvic region aligned in the DRR
and the EPI. Figure 4.4 shows the pelvic region misaligned in the EPI (left) and
the DRR (right) automatically rotated, using mutual information-based image
registration, to the EPI position. Figures 4.5 and 4.6 each shows 2D and 3D
representations of the joint histogram before and after alignment.
4.1 Experiments and Results
Registration was performed on a set of images provided by [12] consisting of
one DRR and nineteen EPIs. Setup errors were unknown. Based on registra-
tion results and visual inspection of the results, the setup errors ranged from
translations of 0 to 7 pixels and rotations of 0 to 10 degrees.
4.1.1 Greedy Algorithm
For this part of the study/research, a greedy algorithm [13] developed in MAT-
LAB (The MathWorks, Inc.) was used. It is a greedy algorithm in that it
follows the problem solving metaheuristic of making a locally optimum choice

at each stage with the hope of finding the global optimum. The algorithm is
like the heuristic of a tabu search in that moves are recorded, so that a move
wont be unnecessarily repeated. In this respect, the record functions as long
term memory in that the history of the entire exploration process is maintained.
In some applications of a tabu search, repeating a move can be advantageous
so that such a list functions as short term memory in recording the most re-
cent movements. However, in the current application, a particular move always
results in the same outcome so that repeating a move is inefficient.
The transformation, T, for this example is a series of translations and ro-
tations for two-dimensional images of resolution 256 x 256 pixels. An initial
three-dimensional vector consisting of the extent of a move left, right, up, and
down, an x-shift and a y-shift, in pixels, and angle of rotation, in degrees, is
given. The test image is transformed according to these parameters. The initial
x- and y-shifts and angle of rotation are added componentwise to each of 52 rows
in a preset matrix. The first two elements of each row of the matrix consists
of two elements from the set {-2,-1,0,1,2} to be added to the initial values for
the x-shift and y-shift. The third and final clement in each row is added to the
initial value for angle of rotation and is from the set (-1,-.5,0,.5,1}. This results
in the generation of 52 new transformation vectors. The mutual information is
then calculated and stored for each of these row vectors. The set of parameters,
x- and y-shifts and angle of rotation, from the solution set that corresponds to
the maximum value of mutual information is used as the starting vector for the
next iteration. If there is a return to a previous transformation, the program

The capture range of the correct optimum is the portion of the parameter
space in which an algorithm is more likely to converge to the correct optimum [7].
The greedy algorithm does not work well if the initial transformation vector is
not in the capture range of the correct optimum in that there is divergence from
the optimal solution. A local optimum is eventually attained and the program
terminates. For the following examples, the transformation that results in the
optimal solution (for 0 .5 or integer .5 rotation values for the 256x256 case)
is an rr-shift of 1 pixel, a y-shift of 0 pixels, and a rotation of -10 degrees, denoted
(1 0-10).
Table 4.1 and Figure 4.7 show convergence to the approximate optimum
resulting from a starting transformation of (11 11 10). Table 4.2 and Figure 4.8
show convergence to an incorrect solution from the starting transformation of
(12 12 12). A rough estimate of the capture range for this algorithm can be
derived from Table 4.3 which gives the starting and final transformation vectors
for 22 trials, where there is convergence to the correct transformation, (10-
10), for initial transformations of (-10 -10 -18) and (25 25 11). Therefore, the
capture range is roughly within 35 pixels in the horizontal direction, 36 pixels
in the vertical direction, and 20 degrees of rotation. Unfortunately, the size of
the capture range depends on the features in the images and cannot be known
a priori [7]. However, visual inspection of the registered images can reveal
convergence outside the capture range.
The problem of the patient positioning has to be solved using an algorithm
that is not only sufficiently accurate, but one that is also fast enough to fulfill the
requirements of daily clinical use. For this application, the requirement that the

8 9 10
12 13 14
16 17
19 20 21
23 24
Figure 4.7: Greedy algorithm, run 1, 256x256 images.

iteration x-shift y-shift rotation MI
1 11 11 10 0.7638
2 13 9 9 0.7780
3 11 7 8 0.7871
4 9 5 7 0.7969
5 9 3 6 0.8043
6 9 3 5 0.8087
7 9 3 4 0.8149
8 9 3 3 0.8177
9 7 1 2 0.8203
10 5 1 1 0.8253
11 4 1 0.5 0.8311
12 2 1 -0.5 0.8349
13 2 -1 -1.5 0.8419
14 0 -1 -2.5 0.8542
15 -2 1 -3.5 0.8845
16 -4 -1 -4.5 0.9112
17 -4 1 -5.5 0.9466
18 -2 1 -6.5 0.9869
19 -2 1 -7.5 1.0420
20 0 1 -8.5 1.1204
21 0 1 -9.5 1.1916
22 1 0 -10 1.2176
23 0 0 -9.5 1.2021
24 1 0 -10 1.2176
Table 4.1: Sample run (Figure 4-7) converges to the optimum transformation
in 8.712 minutes.

12 13 14
Greedy algorithm, run 2, 256x256 images.
Figure 4.8:

iteration x-shift y-shift rotation MI
1 12 12 12 0.7663
2 14 10 13 0.7734
3 14 8 14 0.7821
4 14 G 14 0.7890
5 12 4 14 0.7922
6 10 2 15 0.7982
7 8 2 1G 0.8029
8 8 2 16.5 0.8052
9 8 0 1G.5 0.8058
10 6 -2 1G.5 0.80G7
11 6 -4 17.5 0.8086
12 6 -6 17.5 0.8135
13 7 -6 18 0.8150
14 8 -6 18 0.8152
15 7 -6 18 0.8150
Table 4.2: Sample run (Figure 4-8) converges to a suboptimal solution in
5.0828 minutes.

T0{x,y, angle) Tfinai{x,y, angle) MI time(min)
(11 11 10) ( 1 0-10) 1.217601 8.788317
(12 12 10) ( 1 0-10) 1.217601 8.709550
(13 13 10) ( 1 0-10) 1.217601 8.498200
(16 16 10) ( 1 0-10) 1.217601 9.150317
(20 20 10) ( 1 0-10) 1.217601 10.609733
(30 30 10) ( 42 46 7.5) 0.750654 4.397467
(11 11 11) ( 8 -6 18) 0.815214 4.548517
(20 20 11) ( 1 0-10) 1.217601 10.912367
(25 25 11) ( 1 0-10) 1.217601 11.816650
(26 26 11) ( 8-6 18) 0.815214 8.016017
(29 29 11) ( 51 2 15.5) 0.829537 7.151950
(00 00 12) (-14 -9 18) 0.833644 4.187017
(11 11 12) ( 8 -6 18) 0.815214 4.644167
(-10 -10 -19) ( 6 -25 -15.5) 0.883105 5.377883
(-10 -10 -18) ( 1 0-10) 1.217601 4.953967
(-11 -11 -18) ( 6 -25 -15.5) 0.883105 5.203483
(-12 -12 -18) ( 6 -25 -15.5) 0.883105 5.360033
(-13 -13 -18) ( 4 -50 -20.5) 0.916436 9.069867
(-17-17-18) (-11 -33 -26) 0.879511 4.321567
(-18 -18 -18) ( -3 -35 -27) 0.888810 5.214000
(-19 -19 -18) (-23 -44 -19) 0.909333 6.070050
(-20 -20 -18) (-23 -44 -19) 0.909333 5.909833
Table 4.3: Convergence data for 22 vans, 256x256 images.

Figure 4.9: 64 x 64 DRR and EPI.
Figure 4.10: 128x128 DR.R and EPI.

Figure 4.11: 250x256 DRR and EPI.
initial set of parameters be in a relatively narrow capture range of the optimum
does not seem unreasonable assuming that external markers are used for prelim-
inary patient positioning. However, the speed of convergence can be improved
by finding the optimal transformation for low resolution versions of the images
to be registered. Figures 4.9, 4.10, and 4.11 show the sets of images used in these
examples at resolutions of 64 x 64, 128 x 128, and 256 x 256, respectively. Tables
4.4, 4.5, and 4.6 respectively, show the progression toward convergence for these
sets. Table 4.7 summarizes the data. As expected, it requires the least amount
of time for the lowest, 64 x 64, resolution. Also, the 64 x 64 and 128 x 128 reso-
lutions converge to a slightly different transformation solution, (0 0 -9.5), than
that of the 256 x 256 resolution. However, it is very close and can be used as the
starting transformation for a higher resolution case. From Table 4.6, iterations
11 and 12, it can be seen that this would require only two iterations compared to

12, a significant, reduction in computation time. So, a multi-resolution approach,
that is, starting with a series of relatively coarse resolutions with the objective
being to find one or more approximate solutions, or candidate local maxima, as
suitable starting solutions for refinement at progressively higher resolutions, is
one way to speed convergence to the global optimum. Of course, whether or not
the global optimum, or something reasonably close to it, has been attained, has
to be determined, for the type of registration problem described here, by visual
The multi-resolution method was implemented using first the 64 x 64 res-
olution, starting with (0 0 0), which converged to the solution (0 0 -9.5). This
solution was then used as the starting transformation vector with the 256 x 256
resolution. The algorithm converged to (1 0 -10) in two iterations with total
time for both resolutions equal to 1.5311 minutes.
The program can be improved for the case of the possibility of a poorly cho-
sen initial starting point. For example, the algorithm can be called repeatedly
while a record of the transformation vectors that have been checked is main-
tained. A new, not previously checked, transformation vector can be created
for a new iteration of the algorithm. The maximum mutual information might
then be retrieved from the stored list of values. This is basically multistart
optimization and was implemented and apparently works well. Using a set of
64 x 64 resolution images, with an initial transformation vector of (-50 50 -50),
and the specification that the greedy algorithm be called 10 times, the program
converged to the expected optimum of (0 0 -9.5) for the 64 x 64 resolution. This
occurred at iteration 63 of 104 total iterations and in approximately 5.7 minutes.

iteration x-shift y-shift rot (degrees) MI
1 0 0 0 1.904037
2 -2 0 -1 1.932668
3 -2 0 -2 1.946724
4 -2 0 -3 1.947779
5 -2 0 -4 1.974409
6 0 0 -5 1.981745
7 0 0 -6 2.010966
8 0 0 -7 2.027811
9 0 0 -8 2.059447
10 0 0 -9 2.101459
11 0 0 -9.5 2.110144
Table 4.4: 6'^x 64 run converges in 0.6504 minutes.
4.1.2 Genetic Algorithm
Unlike the greedy algorithm, the genetic algorithm, created in MATLAB for
this part of the study, begins with a random set of solutions, or a population,
consisting of a number of trial solutions. A trial solution consists of a trans-
formation vector as described in the previous section, that is, a lateral shift, a
vertical shift, and a rotation. The mutual information of each is evaluated. The
transformations that yield a mutual information values that meets or exceed a

iteration x-shift y-shift rot (degrees) MI
1 0 0 0 1.200664
2 -2 -2 -1 1.236596
3 -4 0 -2 1.252961
4 -4 0 -3 1.272571
5 -4 0 -4 1.291002
0 -2 0 -5 1.314651
7 -2 0 -6 1.353761
8 -2 0 -7 1.393065
9 0 0 -8 1.440195
10 0 0 -9 1.527838
11 0 0 -9.5 1.549610
Table 4.5: 128x128 run converges in 1.2058 minutes.

iteration x-shift y-shift rot (degrees) MI
1 0 0 0 0.795863
2 2 -2 -1 0.842388
3 0 -2 -2 0.850167
4 -2 -2 -3 0.872867
5 -4 -2 -4 0.899841
6 -4 0 -5 0.929021
7 -4 0 -6 0.965123
8 -2 0 -7 1.013159
9 -2 0 -8 1.076260
10 0 0 -9 1.170099
11 0 0 -9.5 1.202078
12 1 0 -10 1.217601
Table 4.6: 256x256 run converges in 4-3367 minutes.
resolution T0(x,y, angle) Tfinai{x, y, angle) MI tiine(min)
64 x 64 (0 0 0) (0 0 -9.5) 2.1101 0.6504
128x 128 (0 0 0) (0 0 -9.5) 1.5496 1.2058
256x 256 (0 0 0) (1 0-10) 1.2176 4.3367
Table 4.7: Summary of Tables 4-3, 4-4, and 4-5-

user-specified value above the average value are selected. Each of these is ran-
domly paired with another member in this set. A new set of vectors is created
by randomly exchanging vector elements between the pairs. Additionally, for
each iteration, another random set of trials half the size of the original set is
added, an immigration step. Every trial is recorded so that none is repeated.
After a user-specified number of iterations, the program terminates and the vec-
tor that yields the maximum value for mutual information is considered to be
the solution. Obviously, the larger the population and the larger the number of
iterations, the greater the chance that this will be the optimal solution.
An experiment was run two times with the 256 x 256 DRR and EPI image
shown in Figure 4.11 and yielded the results shown in Table 4.8. The approxi-
mate optimal solution was found in run 2. Each run began with a population of
100 and an initial transformation vector (0 0 0). Forty iterations were specified
as well as a maximum shift, in absolute value, of 11 pixels, and maximum rota-
tion, also in absolute value, of 10 degrees. Clearly, more iterations are required
to guarantee an approximately optimal solution.
Repeating the same experiment, but using maximum shifts, in absolute
value, of 2 pixels, and maximum rotation, also in absolute value, of 10 degrees
yielded the results shown in Table 4.8. The logic here is to assume that the actual
transformation lies within, in absolute value, the range of values determined
by these maxima. Note the shorter runtime and that both runs yielded an
approximately optimal solution. This is no doubt due to the fact that the
respective vector values were from a relatively narrow range and close to the

If, say, 10 2 had been designated as the range of values to be considered for
runs 3 and 4, then one could expect a shorter runtime. For these runs, rotation
values from 0 to 10 were possible, so a wider range of values generated resulted
in longer runtime.
The algorithm was adjusted to account for a range of values between user-
specified minimum and maximum values for each parameter. The results for
runs 5 and 6 are displayed in Table 4.8. The only changes in the experimental
parameters (Table 4.9) were the addition of a minimum transformation vector
of values, (0 0 9), and a change from 10 to 11 for the upper bound on rotation
giving a maximum transformation vector of (2 2 11). Not only was the run-
time significantly reduced, but mutual information values closer to optimal were
achieved. Note that where the greedy search allows additions to integer rotation
values from the set {-1,-.5,0,.5,1}, the genetic algorithm allows for a continuous
range, that is, randomly generated decimal values plus or minus an integer or
zero, of values for rotation.
In all of these experiments, fitness was defined as having a mutual infor-
mation value 1.1 and above times the average value for an iteration. Keeping
all parameters the same and increasing this factor to 1.2, yielded the results
shown in Table 4.8. As expected, mutual information values near optimum were
achieved and there was a significant reduction in runtime apparently due to the
fact that fewer vectors deemed as fit, can be chosen from a narrower range of
values that are also within the capture range of the optimum.
The parameters used for these experiments are tabulated in Table 4.9.
4.2 Nelder-Mead (MATLABs fminsearch)

parameters 1,2 run 3,4 pairs 5,6 7,8
To{x,y. angle) (0 0 0) (0 0 0) (0 0 0) (0 0 0)
initial population 100 100 100 100
iterations 40 40 40 40
above average factor 1.1 1.1 1.1 1.2
min x-shift - - 0 0
min y-shift - - 0 0
minimum rotation - - 9 9
max x-shift 11 2 2 2
max y-shift 11 2 2 2
max rot 10 10 11 11
Table 4.9: Parameter list for genetic run pairs.
x-shift y-shift rot (degrees) MI time(min)
1 0 -9.8673 1.2213 1.1455
Table 4.10: MATLABs fminsearch (Nelder-Mead) algorithm. 256x256 im-

of the angle of rotation of -9. The integral part of the value was not changed,
since, from the previously described experiments and visual inspection, it was
established that the solution for rotation is 9 a decimal value. The results
from over 7500 decimal values generated are shown in Figures 4.12 and 4.13. A
value of -9.867333 as the angle of rotation with a mutual information value of
1.221330 was obtained. Figure 4.13 shows that a range of angle values gives the
same mutual information value.
A major advantage of simulated annealing is its ability to avoid becoming
trapped at local minima. The algorithm employs a random search that can
accept changes that decrease objective function as well as those that increase it.
However, for this application, acceptance of a transformation that results in a
smaller value for mutual information makes no sense, so only those that result
in a higher mutual information value are accepted.
4.4 Other Experiments
In addition to the EPI example used in this chapter, 18 other EPIs of resolution
64x64 were tested using the greedy algorithm which was modified to limit the
search space. In each case, the search space included x- and y-shifts of 0 to 8
pixels, and rotations of 0 to 10 degrees. Additionally, in every case, the greedy
algorithm was called 20 times with 50 iterations maximum specified per run. It
was determined, by the data and visual inspection of the output graphics, that
the resultant transformations were successful in all cases.

Figure 4.12: Plot of MI versus Angle of Rotation from simulated annealing

Figure 4.13: Detail of Figure 4.12, plot of MI versus Angle of Rotation from
simulated annealing data.

5. Conclusion
In the experiments described in Chapter 4, mutual information-based image reg-
istration was found to be a robust and easily implemented technique. In every
case the entire, unpreprocessed, image was registered. Because the setup errors
were unknown, the registrations were judged to be successful or not based on
visual inspection. In all nineteen cases, the use of the mutual information tech-
nique resulted in successful transformations as determined by visual inspection.
It appears that the greedy and genetic algorithms perform well, and in a rea-
sonable amount of time from a clinical perspective, given that there is assumed
to be a good estimate of the general location of the optimum. If such an esti-
mate could not be made, then the genetic algorithm would probably outperform
the greedy algorithm since the greedy algorithm tends to terminate at a local
optimum. Of course, this problem is resolved by multistart optimization, that
is, doing multiple searches starting with a variety of transformations resulting
in multiple solutions, and then choosing the solution which yields the highest
value of mutual information.
Given a full range of possible transformation values, the genetic algorithm
can find the global optimum given enough time. The time required may be
more than is practicable, but the problem can be made practicable, as was
demonstrated in this experiment, by constraining the search area to within the
capture range of the optimum. As mentioned in Section 4.1.1, the size of the
capture range depends on the features in the images and cannot be known a

priori, but visual inspection of the registered images can reveal convergence
outside of the capture range.
The advantage of mutual information-based image registration is that it
can be fully automatic in that it makes no assumption of the functional form or
relationship between image intensities in the image to be registered. However,
it is clearly important that a method of quality assurance be used to ensure that
only well registered images are used for clinical decision making.

Appendix A. Brief Discrete Probability Theory
Definition A.l Given an experiment whose outcome is unpredictable, such an
outcome, say X, is called a random variable, or trial. The sample space
of the experiment is the set of all possible trials. If the sample space is either
finite or countably infinite, the random variable is said to be discrete.
Example 5. Roll of a die.
Suppose that a die is rolled once, and let X denote the outcome of the experi-
ment. The sample space for this experiment is the 6-element set
n = {1,2,3,4,5,6},
where each outcome i, for i 1,..., 6, corresponds to the number of dots on
the face that turns up. The event
£ = {1,3,5}
corresponds to the statement that the result of the roll is an odd number. As-
suming that the die is fair, or unloaded, the assumption is that every outcome
is equally likely. Therefore, a probability of | is assigned to each of the six
outcomes in il, that is, m(i) = for 1 < i < 6.
Definition A.2 Let X be a random variable which denotes the outcome, of
finitely many possible outcomes, of an experiment. Let il be the sample space of
the experiment. A probability distribution function, or probability mass