Citation |

- Permanent Link:
- http://digital.auraria.edu/AA00003000/00001
## Material Information- Title:
- Mutual information mutual information-based registration of digitally reconstructed radiographs and electronic portal images
- Creator:
- Bachman, Katherine Anne
- Publication Date:
- 2005
- Language:
- English
- Physical Description:
- xiii, 116 leaves : ; 28 cm
## Subjects- Subjects / Keywords:
- Computer vision -- Mathematical models ( lcsh )
Diagnostic imaging -- Digital techniques -- Mathematical models ( lcsh ) Image processing -- Digital techniques -- Mathematical models ( lcsh ) Image analysis -- Data processing -- Mathematical models ( lcsh ) - Genre:
- bibliography ( marcgt )
theses ( marcgt ) non-fiction ( marcgt )
## Notes- Bibliography:
- Includes bibliographical references (leaves 114-116).
- General Note:
- Department of Mathematical and Statistical Sciences
- Statement of Responsibility:
- by Katherine Anne Bachman.
## Record Information- Source Institution:
- University of Colorado Denver
- Holding Location:
- Auraria Library
- Rights Management:
- All applicable rights reserved by the source institution and holding location.
- Resource Identifier:
- 66459954 ( OCLC )
ocm66459954 - Classification:
- LD1193.L622 2005m B32 ( lcc )
## Auraria Membership |

Full Text |

/ /
MUTUAL INFORMATION MUTUAL INFORMATION-BASED REGISTRATION OF DIGITALLY RECONSTRUCTED RADIOGRAPHS AND ELECTRONIC PORTAL IMAGES by Katherine Anne Bachman Master of Basic Science, Mathematics, University of Colorado at Denver, 2002 Bachelor of Science, Chemistry, University of Colorado at Denver, 2000 A thesis submitted to the University of Colorado at Denver in partial fulfillment of the requirements for the degree of Master of Science Applied Mathematics 2005 r, !ai] This thesis for the Master of Science degree by Katherine Anne Bachman has been approved by Weldon A. Lodwick Date Bachman, Katherine Anne (M.S., Applied Mathematics) Mutual Information Mutual Information-Based Registration of Digitally Reconstructed Radiographs and Electronic Portal Images Thesis directed by Professor Weldon A. Lodwick ABSTRACT This study regards discrete mutual information and demonstrates the use of the theory with an example of radiological image registration with in-plane, two-dimensional images using various search strategies. Image registration is the process of finding an optimal geometric transforma- tion between corresponding image data. Although it has applications in many fields, the one that is addressed in this thesis is medical image registration. Medical image registration has a wide range of potential applications, but the emphasis is on radiological imaging. In addition to having application in communication and computer vision, mutual information has proven robust and has resulted in fully automated reg- istration algorithms that are currently in use. Mutual information, which is given by the difference between the sum of the entropies of individual over- lapping images and the joint entropy of the combined images, is a measure of the reduction in uncertainty about one image due to knowledge of the other. This noil-parametric technique makes no assumption of the functional form or m relationship between image intensities in the two images. The thesis is organized as follows. Chapter 1 gives a broad overview of medical image registration as the context in which to present the subject of mutual information-based medical image registration. Chapter 2 regards the theory and mathematics of discrete mutual information and its origin in infor- mation theory. Chapter 3 looks at the implementation of the theory applied to image registration in general. Chapter 4 looks at an application of mu- tual information-based image registration in radiological imaging registration of Digitally Reconstructed Radiographs (DRRs) and Electronic Portal Images (EPIs). Chapter 5 consists of concluding remarks. The Appendix includes in- formation that is relevant, but not critical, to the understanding of the material presented in this thesis. Because probability theory is a major part of infor- mation theory and, consequently, mutual information theory, a brief overview of discrete probability theory is included in the Appendix as a quick reference. Examples are provided in the Appendix as well as throughout the body of the thesis. I j IV This abstract accurately represents the content of the candidates thesis. I recommend its publication. Signed Weldon A. Lodwick i v DEDICATION To my parents, Dr. and Mrs. Miles MacGran, and to my husband, Randy. ACKNOWLEDGMENT I would like to thank my advisor, Weldon A. Lodwick, for the means of support obtained for me, without which this paper might not have been possible. I would also like to thank Dr. Lodwick and Francis Newman for their meticulous review and critique of the thesis, and Francis Newman for supplying the images used for the DRR/EPI registration experiments and other related images. I would like to thank Weldon Lodwick, Bill Briggs, and Bill Cherowitzo for acting as my Graduate Committee, and also Mike Jacobson for attending the thesis presentation. CONTENTS Figures .................................................................. x Tables................................................................. xiii Chapter 1. Medical Image Registration Overview................................... 1 1.1 Taxonomy of Medical Image Registration Methodology.................... 2 1.1.1 Prospective and Retrospective Methods ............................. 4 1.2 Types of Transformation............................................... 7 1.3 Image Registration as an Optimization Problem....................... 9 2. Mutual Information Theory............................................ 13 2.1 Information........................................................ 13 2.2 Entropy (uncertainty or complexity) ................................. 15 2.2.1 Joint Entropy and Conditional Entropy ............................ 25 2.2.2 Relative Entropy.................................................. 31 2.2.3 Concavity of Entropy.............................................. 36 2.2.4 Entropy of a Signaling System..................................... 37 2.2.5 Entropy and Kolmogorov Complexity................................. 38 2.3 Mutual Information.................................................. 40 3. Mutual Information Theory Applied to Image Registration........... 51 4. Mutual Information-Based Registration of Digitally Reconstructed Ra- diographs (DRRs) and Electronic Portal Images (EPIs).................. 61 viii 4.1 Experiments and Results......................................... 65 4.1.1 Greedy Algorithm ............................................. 65 4.1.2 Genetic Algorithm............................................. 76 4.2 Nelder-Mead (MATLABs fminseareh) ............................. 81 4.3 Simulated Annealing............................................. 81 4.4 Other Experiments............................................... 83 5. Conclusion........................................................ 86 Appendix A. Brief Discrete Probability Theory................................ 88 B. Properties of Convex Functions................................... 95 C. Miscellaneous.................................................... 96 D. Sample Output.................................................... 98 References........................................................... 114 i I IX FIGURES Figure 1.1 Registration example with multiple solutions. The image B is to be transformed to image A. The rightmost image shows B registered to A by two translations and a rotation of 45 degrees................ 10 2.1 Discrete noiseless binary channel................................... 13 2.2 Discrete binary symmetric channel................................... 13 2.3 The entropy function for base 2, e, and 10 logs.................... 15 2.4 Decomposition of a choice from three possibilities................. 16 2.5 The entropy function (base 2 log) of two probabilities............. 21 2.6 Bound on loge(:r) function......................................... 23 2.7 Noisy communication channel, Example 2............................. 28 2.8 Relationship between entropy and mutual information. The mutual information I(A, B) corresponds to the intersection of the informa- tion in A with the information in B............................ 46 3.1 The plot on the left is the joint histogram of two randomly generated 150 x 150 matrices taking on values from 1 to 256. The plot on the right is the plot of the joint histogram of one matrix with itself. . 53 3.2 Relationship between entropy and mutual information. The mutual information I(A,B) corresponds to the intersection of the informa- tion in A with the information in B............................ 54 x 3.3 Venn diagram of the relationship between joint entropy and mutual information of totally unrelated images............................ 54 3.4 Venn diagram of the relationship between joint entropy and mutual information of totally related images.............................. 55 3.5 The image data sets, A and B.......................................... 56 3.6 The joint probability distribution of Example 4 as a surface.......... 57 3.7 The image data sets A, rotated -90 degrees, and B..................... 59 3.8 The joint probability distribution as a surface of Example 4 with image A rotated 90 as in Figure 3.7.............................. 60 4.1 A typical radiation treatment plan [12]............................... 61 4.2 Typical aligned pelvis DRR [12]....................................... 62 4.3 Typical aligned pelvis EPI [12]....................................... 62 4.4 A misaligned EPI and the DRR automatically rotated to the EPI position [12]........................................................ 63 4.5 Joint histogram of misaligned images. Referring to the rightmost plot, the x- and y- axes represent the range of intensities in the images. The z-axis represents the probability that an intensity in one image will occur with an intensity in the other. It is a surface plot of the probability values represented in the plot on the left. [12] 63 4.6 The result after the automated registration process [12].............. 64 4.7 Greedy algorithm, run 1, 256x256 images........................... 68 4.8 Greedy algorithm, run 2, 256x256 images........................... 70 4.9 64x64 DRR and EPI..................................................... 73 4.10 128x128 DRR and EPI.................................................. 73 xi 4.11 256x256 DRR and EPI........................................... 74 4.12 Plot of MI versus Angle of Rotation from simulated annealing data. 84 4.13 Detail of Figure 4.12, plot of MI versus Angle of Rotation from sim- ulated annealing data......................................... 85 A.l Example 9..................................................... 93 xii TABLES Table 4.1 Sample run (Figure 4-7) converges to the optimum transformation in 8.712 minutes............................................... 09 4.2 Sample run (Figure 4-8) converges to a suboptimal solution in 5.0823 minutes........................................................ 71 4.3 Convergence data for 22 runs, 256x256 images..................... 72 4.4 64x 64 run converges in 0.6504 minutes........................... 76 4.5 128x128 run converges in 1.2058 minutes.......................... 77 4.6 256x256 run converges in 4-3367 minutes.......................... 78 4.7 Summary of Tables 4-3, 4-4, an(l 4-5............................. 78 4.8 Genetic algorithm. 256x256 images................................ 81 4.9 Parameter list for genetic run pairs............................. 82 4.10 MATLAB's fminsearch (Nelder-Mead) algorithm. 256x256 images. 82 xiii 1. Medical Image Registration Overview Image registration is the process of finding an optimal geometric transformation between corresponding image data. In other words, given a reference or model, image A, and a test, or floating image, image B, find a suitable transformation, T, such that the transformed test image becomes similar to the reference. The image registration problem typically occurs when two images represent essen- tially the same object, but there is no direct spatial correspondence between them. The images might be acquired with different sensors, or the same sensor at different times or from different perspectives. Modern three-dimensional treatment radiation planning is based on se- quences of tomographic images. Computed tomography (CT) has the potential to quantitatively characterize the physical properties of heterogeneous tissue in terms of electron densities. Magnetic resonance (MR), which looks at hydrogen atom densities, is very often superior to CT, especially for the task of differ- entiating between healthy tissue and tumor tissue. Positron emission tomogra- phy (PET), single photo emission tomography (SPECT), and MRS (magnetic resonance spectroscopy) imaging have the potential to include information on tumor metabolism. The various image modalities are non-competing. They have specific properties and deliver complementary information. The images supply important information for delineation of tumor and target volume, and for therapy monitoring. 1 There is usually a high degree of shared information between images of dif- ferent modalities of the same structures. This is certainly the case with images of the same modality which have been acquired at different times or from dif- ferent perspectives as in the example presented in Chapter 4. In this case, the modality is CT, or CT-derived, and the problem is rectification of orientation, or pose, of one CT image with respect the other by in-plane rotations and/or shifts. In order to use image sequences of the same, or various, modalities simulta- neously, a definite relation between the picture elements (pixels), or volume ele- ments (voxels) of the various image sequences needs to be established. Methods which ar e able to calculate and establish these relations are called registration, matching, or image correlation techniques. 1.1 Taxonomy of Medical Image Registration Methodology Image registration methods can be categorized by User interaction Manual The transformation is determined directly by user interaction. Semi-automatic The transformation is calculated automatically, but the user deter- mines the image features used and the start-up parameters. Automatic No user interaction is required. 2 Scope and elasticity of transformation - Scope * Global Global transformations modify the data set as a whole, that is, not just the therapy relevant area, by applying a single function to all elements. * Local In local procedures the function can vary. Single voxels or pixels, single image slices, or single organs could be affected. Elasticity or plasticity (geometrical properties) of transformation * Rigid transformations The distance between any two points in the first, image is [ire- served when these two points arc mapped onto the second image. Rigid transformations can be decomposed into trans- lation, rotation, and reflection. Rigid transformations are special cases of affine transformations. A transformation is called affine when any straight line in the first image is mapped onto a straight line in the second image while paral- lelism is preserved. * Non-rigid affine transformations Examples of non-rigid affine transformations are both uniform and non-uniform scaling and shearing. * Curved, or elastic, transformations 3 A curved or elastic transformation can map a straight line onto a curve, for example, transformations described by polynomial functions. Rigid transformations can be described as curved transformations with zero elasticity. Elastic transformations can be approximated using local affine, that is, piecewise lin- ear, algorithms when using a small granularity, that is, a fine level of detail. Usage of auxiliary means Methods are distinguished by the amount of auxiliary information which must be used before, or during, image acquisition to allow subsequent registration (for example, the use of external or internal markers). 1.1.1 Prospective and Retrospective Methods The methods listed in the previous section can also be categorized under what are called prospective and retrospective image registration methods [14]. Prospective methods always register artificial, as opposed to anatomical, landmarks (fiducials), for example, objects attached to the patient that are detectable in all pertinent modalities. If the position of the landmarks relative to the patient remains constant, a high degree of accuracy is possible. The disadvantage is that, since the calculation of the transformation is based only on a restricted number of landmarks, the resulting transformations are always rigid. If organ movements occur between the acquisition of different image series, these shifts cannot be considered correctly. Also, this procedure is invasive and inconvenient for the patient and staff. 4 Retrospective methods do not need any setup steps and are not subject to the conditions of image acquisition. To derive the necessary transformation, only patient related references (anatomical landmarks, surface of structures, etc.) are used. Retrospective methods include the following: Feature-based methods Point matching (Procrustes method) Identification of corresponding points in the image series requires user interaction. An advantage of this method is that land- marks can be defined just in the therapy relevant image area, or locally. Therefore, the registration results will not be influ- enced by possible distortions in other regions of the image. A disadvantage is that it is difficult to locate corresponding land- marks in complementary image modalities precisely. Thus, a large amount of anatomical knowledge is required of the user. The precision of the resulting transformation depends on the pre- cision with which these corresponding landmarks can be identi- fied and depends therefore on the resolution (pixel size) and slice distance/thickness of the images. Surface matching Rather than single points, this method uses the entire surface of a structure, for example, the patients contour. A transforma- tion is used which minimizes the distance between the surface of corresponding structures. An example is the head/hat analogy, where the contour of the head, segmented in the first image data set, is put over the head contour in the second data set. In this way, uncertainties in the definition of single points do not carry much weight. The disadvantage is that the structures must, be segmented in an initial step. Since segmentation is a non-trivial task and normally only semi-automatic segmentation procedures deliver acceptable results, these methods are time-consuming. Metrics of similarity This retrospective approach is a method where a value for the similarity of the two data sets is calculated. The image data set, an area or volume, to be registered is transformed by an optimization procedure, Section 1.3, until the similarity with the first data set achieves a maximum. Voxel similarity measure techniques can be fully automatic and have an accuracy comparable to bone-implanted markers, but they can also fail [7, 26]. A common cause of failure can be that the images are poorly aligned at the start. Sum of squared intensity differences, a correlation-based distance measure Mutual information (an information theoretic technique) Mutual Information measurements consider the intensity distribu- tion of both image data sets and are, therefore, well suited to the 6 registration of multimodal image sequences, since no presump- tions about the intensities have to be made (for example, it is possible that regions with a high intensity in the reference image correspond to regions with low, medium or high intensity in the template image). The mutual information reaches a maximum if, for a given intensity in image A, a distinct intensity in image B is found. (In other words, two different intensities in image A dont correspond to the same intensity in image B.) There are no restrictions concerning the intensity combinations. An advantage of this method is that it does not depend on the lo- calization of single landmarks. No user interaction is required. It can be fully automatic in that it makes no assumption of the functional form or relationship between image intensities in the image to be registered. A disadvantage is that the calculation ef- fort is much higher compared to point matching methods. Also, since this method considers the global intensity distribution and looks for the optimum over the image sequence on the whole, it is possible that the precision of the superposition is suboptimal in the area of interest. The theory and an application of mu- tual information-based medical image registration are presented in the chapters that follow. 1.2 Types of Transformation Since we are three-dimensional, moving beings, registration should be four- dimensional, that is, it should include the three spatial dimensions as well the 7 temporal dimension. However, in practice, assumptions and approximations are made so that the body can be represented in fewer dimensions. 2D-2D 2D images may be registered simply by a rotation and two orthogonal translations. Differences in scaling from the real object to each of the im- ages to be registered may also be necessary. Controlling the geometry of image acquisition is usually very difficult, so clinically relevant examples of 2D-2D registration are rare. For this study, 2D images are used for the sake of simplicity and the fact that these were the images that were avail- able. These images are registered by a rotation and vertical and horizontal translations. 3D-3D In the case of 3D-3D image registration, three translations and three ro- tations bring the images into registration. It is assumed that the imaged part of the body behaves as a rigid body, that is, the internal anatomy of the patient has not distorted or changed. Careful calibration of scanning devices/s is required to determine image scaling. 2D-3D When establishing correspondence between 3D and projection images, 2D- 3D registration may be required. The main application of these methods is in image-guided interventions. Time This class of registration problem concerns image sequences that follow 8 some process that changes with time. [7] 1.3 Image Registration as an Optimization Problem Algorithms that directly calculate the transformation to establish correspon- dence between two images can be used where the images to be registered have very similar intensities and the transformation required to establish correspon- dence is small. This is the case with the point matching, or Procrustes, method described above. In other cases, a process of optimization is required. Opti- mization algorithms compute a cost, or similarity, function relating to how well two images are registered. The goal is to minimize, or maximize, the associated cost function. The cost function can be expressed as Ctrails formation Csimilarity; wdiere the first term characterizes the cost associated with particular trans- formations (lateral translations, rotations, nonlinear, etc.) and tire second term characterizes the similarity between the reference and test images. Mutual infor- mation and sum of squared intensity differences are examples of cost functions. For mutual information-based image registration, as will be seen in the following chapters, the cost function increases as the images to be registered come into alignment. Conversely, when the sum of squared intensity differences algorithm is used, the cost function decreases as alignment increases. Image registration as an optimization problem is hard to solve in that the problem is ill-posed [10], Small changes of the input images can lead to com- pletely different registration results. Optimization algorithms can converge to 9 reference image, A test image, B transformed test image, B Figure 1.1: Registration example with multiple solutions. The image B is to be transformed to image A. The rightmost image shows B registered to A by two translations and a rotation of 45 degrees. a suboptimal solution, a local optimum, which can cause registration to fail. Moreover, the solution may not be unique. This is illustrated in the simple ex- ample of Figure 1.1. In order to register the image labeled B to the one labeled A, several solutions are found: two pure translations followed by a rotation of 45 degrees, or equivalently, a rotation of 135 degrees followed by two pure translations, etc. The objective value at optimality is the same, but the solution path is different. Clearly, more information is required in order to determine which transformation should be used. A property of the medical images that are registered is that they are discrete. That is, the object is sampled at a finite number of voxels or pixels. In this study, four discrete search algorithms are used with mutual information to find a solution to a medical image registration problem: a greedy algorithm, a genetic algorithm, and MATLABs fminsearch algorithm that uses the Nelder-Mead simplex (direct search) method. 10 Greedy algorithms are algorithms which follow the problem solving meta- heuristic of making the locally optimum choice at each stage with the hope of finding the global optimum. Greedy algorithms find the overall, or globally, optimal solution for some optimization problems, but may find suboptimal solu- tions for some' instances of other problems. Greedy algorithms work in phases. In each phase, a decision is made that appears to be good, without regard for future consequences. Generally, this means that some local optimum is chosen. This take what you can get now strategy is the source of the name for this class of algorithms. Hopefully, when the algorithm terminates, the local optimum is the global optimum. If this is the case, then the algorithm is correct. Otherwise, the algorithm has produced a suboptimal solution. In the case of medical image registration, it is probably safe to state that the global optimum is the required solution. Genetic algorithms comprise a particular class of evolutionary algorithms inspired by the mechanisms of genetics, which has been applied to global op- timization optimization problems. Genetic algorithms use biologically-derived techniques such as inheritance, mutation, natural selection, and recombination (or crossover). With genetic algorithms, each of a population consisting of a number of trial solutions, in this case image transformations, is evaluated to yield fitness. A new generation is created, crossover being the dominate means of creating new members, from the better of them. The process is continued through a number of generations with the objective that the population should evolve to contain a global optimum or at least an acceptable solution. 11 The MATLAB program fminsearch uses the Nelder-Mead simplex (direct search) method of [8]. This is a direct search method that does not use numerical or analytic gradients. It is based on evaluating a function at the vertices of a simplex, then iteratively shrinking the simplex as better points are found until some desired bound is obtained [11], 12 0 0 1 1 Figure 2.1: Discrete noiseless binary channel. Figure 2.2: Discrete binary symmetric channel. 2. Mutual Information Theory The derivation of mutual information from information theory and the associ- ated mathematics are considered in this chapter. 2.1 Information In 1928, R. V. L. Hartley [6] was the first to suggest the use of logarithms for the measure of information. The definition of information in the context of 13 information theory is an engineering definition based on probabilities, not on the meaning of symbols communicated to a human receiver. Therefore, it is somewhat counterintuitive. Consider an alphabet of n symbols, ai,a2,... ,an, each with its probability p(ai),p(a,2), .. ,p{an) of occurrence, communicated via some channel. If p(ai) = 1, then ai will be received with certainty. This is the case of the noiseless binary communication channel pictured in Figure 2.1. Any transmission, 0 or 1, is received without error. There is no surprise in the occurrence of ai given p(o.i), so no information is obtained. If a symbol with a low probability occurs, there is more surprise, more information. This might be the case for the binary channel pictured in Figure 2.2, where, for example, a 0 is sent and 1 is received. Thus, information is somewhat inversely related to the probability of occurrence. The information from two independent symbols is the sum of the information from each separately (See Appendix A, Example 6). Since the probabilities of two independent choices are multiplied together to get the probability of the compound event, it is natural to define the amount of information as Definition 2.1 Information I (a*) = log 1 P(di)' [22] Remark 2.2 Information is inversely proportional to probability. Symbols with the least probability of occurring will provide the most information. 14 Figure 2.3: The entropy function for base 2, e. and 10 logs. As a result, /(ai) + I{a2) = log 1 p(ai)p(a2) I(a1,a2). Thus, given the two probabilities of events, the probability of both occurring, assuming independence, is a product. This product yields the amount of infor- mation as a sum. 2.2 Entropy (uncertainty or complexity) The entropy function measures the amount of uncertainty, surprise, or informa- tion in the outcome of a situation, for example, the reception of a message or the 15 1/2 1/3 1/6 Figure 2.4: Decomposition of a choice from three possibilities. outcome of some experiment. It involves only the distribution of probabilities. The Shannon-Wiener entropy measure //, originally developed as part of com- munication theory in the 1940s [15, 17], is the most commonly used measure of information in signal and image processing. This formula is derived from three conditions that a measure of uncertainty in a communication channel should satisfy. 1. The functional should be continuous in p. 2. If all Pi equal where n is the number of symbols, then H should be monotonically increasing in n. 3. If a choice is broken down into a sequence of choices, then the original value of H should be the weighted sum of the constituent H. That is, 16 P2+P3 P2+P3 Â£2__ PA The meaning of this condition is illustrated in Figure 2.4. On the left there are three possibilities: Pi = P2 = and p.3 = On the right a choice is made first between two possibilities each with probability If the second occurs, the same probabilities as before. In this special case it is required that The coefficient | is the weighting factor introduced because this second choice occurs only half the time. Shannon proved that the Pi logpj form was the only functional form satisfying all three conditions. Theorem 2.3 The only H satisfying all three assumptions is of the form: where K is a positive constant. Proof: Let H (^, L,...) I) = An. From condition (3) a choice can be decom- posed from sm equally likely possibilities into a series of m choices, each from s equally likely possibilities from which then a choice is made between probabilities | and |. The final results have n A(sm) = mA(s) is obtained. Similarly A{tn) = nA{t) 17 n arbitrarily large can be chosen and m found to satisfy srn < tn < s(m+1). Taking the logarithms and dividing by n log s yields m log t m 1 < ------< + or n log s n n m n log* logs < , where e is arbitrarily small. From the monotonic property of A(n), A(sm) < A(tn) < A{sm+l) mA(s) < nA(t) < (m + l)A(s). Dividing by nA(s), m A(t) m 1 a < \ -----1 01 n A{s) n n A(t) log t <2e m A(t) n A(s) A(t) = Klogt, < e A(s) logs where K must be positive to satisfy (2). Now suppose there is a choice from n possibilities with commeasurable prob- abilities Pi = where the n.L are integers. A choice can be broken down from Y^ni possibilities into a choice from n possibilities with probabilities pi,, pn and then, it the ith were chosen, a choice from nl with equal probabilities. Using condition (3) again, the total choice from ^ n is equated as computed by two methods K log X m = H{pi,..., pn) + K XPi lob rii Hence H = K [XVl log X n ~ X Pl log n'] = A' X Pi log = -K X Pi loSPi- ^ L, 18 If the pi are incommeasurable, they may be approximated by rationals and the same expression must hold by the continuity assumption. Thus the expression holds in general. The choice of coefficient K is a matter of convenience and amounts to the choice of a unit of measure. [16] Entropy will have a maximum value if all symbols have equal probability of occurring (that is, pn = ^ for all i), and have a minimum value of zero if the probability of one symbol occurring is one, and the probability of all the others occurring is zero. [7] Definition 2.4 The entropy H(X) of a discrete random variable X (Appendix A, Definition A.l) is a measure of the uncertainty of the random variable, and is defined by (*) = -Â£ p{x) logp(x) xex The above quantity can also be expressed as H(p). [2] Remark 2.5 The entropy of X can also be interpreted as the expected value (Appendix A, Definition A.3) of log where X is drawn according to the probability distribution function p(x). Thus H(X) = Ep\o g 1 PW' [2] 19 The following lemmas are immediate consequences of Definition 2.4. Lemma 2.6 H(X) > 0. Proof: 0 < p(x) < 1 implies log(^y) > 0. Lemma 2.7 Hb(X) = (logba)Ha(X). (Please refer to Appendix C) Proof: log:bp = lOg aP l0ga b logfe a loga p. [2] Example 1. Entropy as a function of two probabilities (Figure 2.5). Let {1 with probability p, 0 with probability 1 p. Then H(X) = -plogp- (1 -p) log(l -p) = H(p). Using 2 as the base of the logarithm, H(x) = 1 when p = Figure 2.5 illustrates some of the basic properties of entropy. Entropy is a concave (Appendix B, Definition B.2) function of the distribution and equals 0 when p equals 0 or 1. 20 Figure 2.5: The entropy function (base 2 log) of two probabilities. 21 This makes sense since there is no uncertainty when p equals 0 or 1. Uncertainty is maximum when p equals which also corresponds to the maximum value of the entropy using base 2 logarithm. Using Lemma 2.7 for the function p log2 ^ (Figure 2.3, r=2) and deriving, TP(p^l) = Tp(ploel)lo^e = log2 - log2 e. P From the last equation, it can be seen that the slope at p ~ 0 is infinite and that the maximum of the entropy occurs at p = U Remark 2.8 lim (p loge x) = 0. x>0 log X lim (p loge x) = lim I { xO x>0 1 Application of VHospital's rule yields x>0 \ ) x>0 x1 lim I -ZT = lim(-x) = 0. Theorem 2.9 (Fundamental Inequality) (f) < t=i k / Proof: Fitting the tangent line at the point (0,1) on the loge x function (Figure 2.6), the slope is d{ logex) _1 ---J----- |x=l U 22 Figure 2.6: Bound on log,,(a1) function. 23 so that the tangent line is y- 0= l(x 1) or y = x-l. Therefore, for all x greater than 0, loge x < x 1. (2.1) Let x-i and yL be two probability distribution functions, where, of course, ^ xt 1 and Vi ~ 1, xi,Vi > 0. Now consider the following expression, using Lemma 2.7, involving both distributions. I>*lQg2 1 Xxj loge2^ Using the relationship of Equation 2.1, x, | -----1 x, = i^2 r' 2 = 0. Converting back to log base 2 yields the Fundamental Inequality, Q (!)<, i=l ' ' 24 or, equivalently, !>iog2-0' For the last two equations, equality holds only when all the aq = yt. The lefthand sides of the last two equations are also called the relative entrvpy or Kullback- Leibler distance (See also Section 2.2.2). [22] 2.2.1 Joint Entropy and Conditional Entropy Definition 2.10 The joint entropy H(X,Y) of a pair of discrete random variables (X,Y) with a joint distribution p(x,y) is defined as H{X, Y) = EE P{x, y) logp{x,y). (2.2) xeX yY [2] For the special case when X and Y are statistically independent (Appendix A, Definition A.5) for all i and j, the independence condition is expressed by P(Xi,yj)=p(Xi)p(yj). The joint entropy becomes h{x, = P&MVj) {loS i=l 3=1 ^ Axi). + log PiVi) Since q s ^p(Xi) = 1 and ^2p(yj) = 1, ;=i j=i then H(X,Y) = H{X) + H(Y). 25 Definition 2.11 If (X,Y) ~p(x,y), then the conditional entropy H{Y\X) is defined as H{Y\X) = Y,P(x)H(Y\X = x) xX = p^2p(y\x) los p(y\x) xex y& EE p(x, y) logp(j/|ar) xeX yeY p{x,y) EE p(x,y)log xÂ£X yeY P(x) (2.3) [2] Remark 2.12 If X,Y independent, then H(Y\X) = EE p{x)p(y) log p(y) xeX y&Y = EE p(x)p(y) log xex-yeY yyyj = EpM1os~ 3/ev = H(Y). Theorem 2.13 (Chain Ride for Entropy) Let Xi, X2,..., Xn be drawn according to p(xi,x2,.... xn). Then n H(XU X2, ...,Xn) = YJ H(Xi ..., *0. 26 Proof: By repeated application of the two-variable expansion rule for entropies, H(X1,X2) = H(X1) + H(X2\X1), H(XuX2,X3) = H(X1) + H(X2,X3\Xx) = H{Xx) + H(X2\Xx) + H(X3\X2, Xx), H(XU ...,Xn) = H(Xx) + H(X2\Xx) + ... + H(Xn\Xn.u ...,Xx) n i=1 Alternative proof: Write p(xi,.... xn) = f|"=1 p(xi\x,i-i..., Xi), and evaluate H{XuX2,...,Xn) = - ^ p{x1,x2,...,xn)\ogp(xux2,...,xn) X\,X2,.:,Xn n Y p(x !>> .,Xx) X] ,...yXn n i=l Y Yp^ Xl,...,xn 21 n ...,xn)logp(o:i|^_i,. ..,xx) Y Y t l X\)...)Xj\ n ...,xn)logp(a:i|a:i_i,. Y Y p^ i=l ,Xi) = YtH(Xi\Xi-1,...,X1). i=l [2] Theorem 2.14 (Chain Rule for Conditional Entropy) H(X,Y) = H(X) + H(Y\X). 27 Figure 2.7: Noisy communication channel, Example 2. Proof: H(X,Y) = EE p{x,y) \ogp(x,y). xex yeY =-EE p{x,y) \ogp{x)p(y\x) xeX yeY EE p{x,y) logp(x) ^2^2p(x,y) \ogp(y\x) xex yeY xex yeY = -Â£p(x) logpOr)-^^ p{x, y) logp(y\x) xeX xex yeY = H(X) + H{Y\X). Equivalently, logp(X, Y) = logp(X) + logp(y|X). Taking the expectation of both sides of the equation yields the theorem. [2] Example 2. Noisy communication channel. The input source to a noisy communication channel, Figure 2.7, is a random 28 variable X (Appendix A, Definition A.l) over the four symbols a, b, c, d. The output from the channel is a random variable Y over the same four symbols. The joint distribution (Appendix A, Definition A.4) of X and Y is given by X X II x=b x=c X=d II Â£ 1 1 1 1 8 16 16 4 Y y=b 1 16 1 8 1 16 0 << II o 1 32 1 32 1 16 0 II >> 1 32 1 32 1 16 0 The marginal distribution (Appendix A, Definition A.4) for X is The marginal entropy of X, H(X), is 4 = 4 i=1 4 (log2 1 log2 4) - ^2 l0g2 P(Xi)= 4 Q) = 2 bits. Since base 2 log is used, the resulting unit of information is called a bit (binary digit). If base e is used, the unit of information is called a nat. If base 10 is used, the unit is called a Hartley, after R. V. L. Hartley, the first person to propose the use of the logarithmic measure of information. 29 The marginal distribution for Y is |). The marginal entropy of Y, H(Y), is ^2 PiVi) \og2 P(Vi) = 2 i=1 -g (log2 1 log2 8) + -J (!g2 1 l0S2 4) + 2 (1o82 1 lg2 2) 7 . = bits. 4 The joint entropy, H(X, Y), of X and Y in bits is the sum of plogp (Section 2.2.1, Equation 2.2) over all 16 probabilities in the joint distribution. - v) log2p(x, y) = -\ (loS21 loÂ§2 4) + 2 xeX yÂ£Y log2 8) +6 ~TF (1o82 1 lo& 1(>) 16 + 4 32 (log2 1 log2 32) 30 The conditional entropy H(Y\X) (Section 2.2.1, Equation 2.3) is xeX yeY / \4/ \ 4 / 32 log2 -16l0g2 1 lg2 -2 -4 lW(f 4logd| 113 113 1 8 + 8+IC + 8 + 8+l6 + 2+ = bits. 8 This example is continued in Section 2.3, where the mutual information for this example is calculated. 2.2.2 Relative Entropy Definition 2.15 The relative entropy or Kullback-Leibler distance be- tween two probability mass functions p(x) and q(x) is defined as D{p\\q) = xex P{x) q{x) = EP log P(X) Q(xy (2.4) where E denotes expectation. Definition 2.16 The conditional relative entropy. D(p(y\x)\\q(y\x)) is the average of the relative entropies between the conditional probability mass func- tions p(y\x) and q(y\x) averaged over the probability mass function p(x). More 31 precisely, D(p(y\x)\\q(y\x)) = ]Tp(:r) ^p(y\x) log X y p(y|g) q(v\x) Ep(x,y) P(Y\X) Q(Y\xy where E denotes expectation. Theorem 2.17 (Chain Rule for Relative Entropy) D{p(x,y)\\q{x,y)) = D(p(x)\\q{x)) + D{p(y\x)\\q(y\x)). Proof: D(p{x,y)\\q(x,y)) EE p{x, y) log x y p{x,y) q{x,y) EE p(x, y) log x y p(x)p(y\x) q(x)p{y\x) Y p(x> y) lo% + p(x> y) loe nr* ii x \ / /*- p{y\x) p(y\x) D{p{x)\\q{x)) + D{p(y\x)\\q(y\x)). [2] Theorem 2.18 (Jensens Inequality) If f is a convex function (Appendix B, Definition B.l) and X is a random variable, then Ef(X)>f(EX), where E denotes expectation. If f is strictly convex, then equality implies that X = EX with probability 1, that is, X is a constant. 32 Proof: For a two mass point distribution, the inequality becomes Pif{xi)+p2f{x2) > f{piXi +p-2x2), which follows directly from the definition of convex functions. Suppose that the theorem is true for distributions with k 1 mass points. Then writing p'L for i = 1,2,.... A; 1, k k 1 ^PifiXi) =Pkf{xk) + (1 ~Pk) Â£?'/(**) i=l i=l (k-1 i=1 (k-1 pkxk + (l ~ Pk)^2p'iXi where the first inequality follows from the induction hypothesis and the second follows from the definition of convexity. [2] Theorem 2.19 (Lop sum inequality) For non-negative numbers, ai,a2, ,an and b\, b2,.... bn, with equality if and only if jf = constant. Proof: Assume without loss of generality that a-i > 0 and bl > 0 (If a, = bt 0, then the sum is 0). The function f(t) = t log t is strictly convex, since 33 f (t) = yloge > 0 for all positive t. Hence, by Jensens inequality (Theorem 2.18), for a-i > 0, V), a, = 1. Setting al and tt = fT 1 l^j = 1 j i ' a( E&: E bj ^ E bj is obtained which is the log sum inequality. [2] Theorem 2.20 D(p\\q) is convex in the pair (p, q), that is, if (pi, (p) and (p2, q2) are two pairs of probability mass functions, then D(Xpi + (1 A)p2||A(/i + (1 X)q2) < XD(pi\\qi) T (1 X)D(p2\\q2) (2.5) for all 0 < A < 1. Proof: The log sum inequality is applied to a term on the lefthand side of 2.5, that is, (Api(ar) + (1 A)p2(z))log Xpi (x) < Api(x) log + (1 A)p2(x) loi Api(ar) + (1 ~ X)p2(x) Xqi(x) + (1 X)q2(x) (1 A)p2{x) Xqx(x) (1 A)q2{x) Summing this over all x, the desired property is obtained. [2] Theorem 2.21 (Information Inequality) Letp(x), q{x), x Jif, be two probability distribution functions. Then D(p\\q)>0 34 with equality if and only if p(x) q(x) for all x. Proof: Let A = {x : p(x) > 0} be the support of p(x). Then p(x) -D(p\\q) = -^2p(x) log xA q(x) J^p(x) log xeA Q(x) p(x) Qi.x) p{x) =log Y x&A < log Y q(x) xex = log 1 (2.6) 0, where 2.6 follows from Theorem 2.18. Since log t is a strictly concave function of t, there is equality in 2.6 if and only if ^||y = 1 everywhere, that is, p{x) q{x). Hence D(p\\q) = 0 if and only if p(x) = q(x) for all x. [2] Theorem 2.22 H(X) < \og\JF\, where \Jf?\ denotes the number of elements in the range of X, with equality if arid only if X has a uniform distribution over Jf. Proof: Let u(x) = -A be the uniform probability mass function over and let p(x) be the probability mass function (Definition A.2, Appendix A) for X. 35 Then log jj = lo&\^\ ~ H(X)- By the non-negativity of relative entropy, 0 < D(p\\u) = log\Jl?\-H(X). Corollary 2.23 D{p(y\x)\\q(y\x))>0, with equality if and only if p(y\x) = q(y\x) for all y arid x with p(x) > 0. [2] 2.2.3 Concavity of Entropy Theorem 2.24 (Concavity of entropy) H(p) is a concave (Appendix B, Theorem B.l) function of p. Proof: H(p) = log |JT| D(p\\u), where u is the uniform distribution on \3^\ outcomes. The concavity of H then follows directly from the convexity of D. Alternative Proof: Let X\ be a random variable with distribution p\ taking on values in a set A. Let X2 be another random variable with distribution P2 on the same set. Let 36 {1 with probability A ' 2 with probability 1 A Let Z Xq. Then the distribution of Z is Xp1 + (1 A)p2. Since conditioning reduces entropy (Theorem 2.31), we have H(Z) > H(Z\d), or equivalently, H(Xpi + (1 A)p2) > AH(pi) + (1 A)H(p2), which proves the concavity of the entropy as a function of the distribution. [2] 2.2.4 Entropy of a Signaling System Consider again the communication channel of Section 2.1. If /(a*) units of information are obtained when the symbol a* is received, then, on the average, since p(a.i) is the probability of getting the information, I(at), for each symbol T, p{oi)I{ai) =p{ai) log PK) From this it follows that, on the average, over the entire alphabet of symbols a,, 1 y'pKOlog-- ^ P(ai) i=l will be received. This important quantity is the entropy of the signaling system, A, having symbols at with probabilities p(at). H( A) = |>(,) log 37 2.2.5 Entropy and Kolmogorov Complexity The Kolmogorov Complexity is approximately equal to the Shannon entropy H. Entropy is often called information in that it is a bound length on the code that is required to transmit a message. Kolmogorov complexity (also called minimum description length principle, Chaitin complexity, descriptive complexity, shortest program length, and al- gorithmic entropy) is related to communication, an application of information theory. If the sender and receiver agree upon a specification method, such as an encoding or compression technique, then a message a can be transmitted as b and decoded given some fixed method L, denoted L(b) a. The cost of the transmission of a is the length of the transmitted message, b. The least cost is the minimum length of such a message. The minimum length is the entropy of a given the method of transmission. The complexity of the minimum description length is universal, that is, computer independent. Definition 2.25 The Kolmogorov complexity of a binary string a, K(a), is defined as the length of the shortest binary program for computing the string K(a) min[t/(6) = a], where U represents the abstract universal Turing computer that can implement any algorithm and compute any computable function. [2.3] Remark 2.26 Kolmogorov Complexity is reminiscent of Ockham's Razor, a principle formulated by William of Ockham (1300s) stating that terms, con- cepts, assumption, etc., must not be multiplied beyond necessity. According to 38 [2], Einstein allegedly said Everything should be made as simple as possible, but no simpler. Theorem 2.27 (The Fundamental Theorem for a Noiseless Channel) Let a source have entropy H (bits per symbol) and a channel have a capacity C (bits per second). Then it is possible to encode the output of the source in such a way as to transmit at the average rate jj e symbols per second over the channel where e is arbitrarily small. It is not possible to transmit at an average rate greater than jj. Proof: The converse part of the theorem, that jj cannot be exceeded, may be proven by noting that the entropy of the channel input per second is equal to that of the source, since the transmitter must be non-singular, and also this entropy cannot exceed the channel capacity. Hence H' < C and the number of symbols per second =jj- < j. To prove the first part of the theorem, consider the set of all sequences of N symbols produced by the source. For large N, the symbols can be divided into two groups, one containing less than 2(H+,^N members and the second containing less than 2RN members (where R is the logarithm of the number of different symbols) and having a total probability less that //. As N increases and rj and /./ approach zero. The number of signals of duration T in the channel is greater than 2(-c~0C with 6 small when T is large. If is chosen, then there will be a sufficient number of sequences of channel symbols for the high probability group when N and T are sufficiently large (however small 39 A) and also some additional ones. The high probability group is coded in an arbitrary one-to-one way into this set. The remaining sequences are represented by larger sequences, starting and ending with one of the sequences not used for the high probability group. This special sequence acts as a start and stop signal for a different code. In between, a sufficient time is allowed to give enough different sequences for all the low probability messages. This will require where
will then be greater than |