Citation
High speed multipliers for read channel asics

Material Information

Title:
High speed multipliers for read channel asics
Creator:
Chen, Michael H
Publication Date:
Language:
English
Physical Description:
ix, 57 leaves : ; 28 cm

Subjects

Subjects / Keywords:
Application-specific integrated circuits ( lcsh )
Linear integrated circuits ( lcsh )
Application-specific integrated circuits ( fast )
Linear integrated circuits ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 55-57).
Statement of Responsibility:
by Michael H. Chen.

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
41462116 ( OCLC )
ocm41462116
Classification:
LD1190.E54 1998m .C46 ( lcc )

Downloads

This item is only available as the following downloads:


Full Text

PAGE 1

HIGH SPEED MULTIPLIERS FOR READ CHANNEL ASICS by Michael H. Chen B.S.E.E. University of Colorado at Denver, 1997 A thesis submitted to the University of Colorado at Denver in partial fulfillment of the requirements for the degree of Master of Science Electrical Engineering 1998

PAGE 3

Chen, Michael H. (M.S., Electrical Engineering) High Speed Multipliers for Read Channel ASICs Thesis directed by Associate Professor Shelly Goggin ABSTRACT This thesis explores various multiplication architectures as well as their associated addition architectures in search for the fastest (lowest latency), 6-bit, two's complement, multiplier to be used in read channel ASICs, more specifically in equalization. The proposed optimal multiplier is comprised of a radix-8 Booth multiplier encoding with carry-save addition to minimize overall propagation delay by reducing the number of partial products as well as the number of the sum of partial products. This thesis will prove that the performance achievable by the radix-8 Booth multiplier far outweighs its added hardware. The baseline for the multiplication architecture comparison will be the Shift-Add multiplier. The Shift-Add multiplier is mainly chosen for its performance and cost considerations. An additional high performance multiplier was selected based on its computational efficiency and was evaluated with the baseline and the optimal designs. The comparisons will be made on the register transfer level (RTL) using the IBM ASIC CMOS 5S standard cell libraries. These libraries were chosen based on their superior performance and low-power consumption as compared to other ASIC libraries available on similar technologies.

PAGE 4

This abstract accurately represents the content of the candidate's thesis. I recommend its publication.

PAGE 5

Dedication I dedicate this thesis to my wife for her patience and support over the last couple of years, which made it possible for me to achieve my goals. Her unselfishness is to be recognized.

PAGE 6

Acknow ledgements This thesis would not have materialized if it had not for the vision of a brilliant senior read channel design engineer at the Exabyte Corporation, Chris Watanabe. My most sincere appreciation for his assistance and patience. I would also like to thank Doug Holmquist, ASIC engineer, also at the Exabyte Corporation for his assistance on obtaining the ASIC databooks, which enable the comparisons of the multiplication architectures. Special thanks to my original advisor, Dr. Richard Auletta, for his guidance which kept me on track. His patience and flexibility made it possible for this thesis not to interfere with my duties at work. It was his genuine concern of his students' quest for knowledge that sparked an interest in VLSI systems in me. I've always been able to seek his assistance even late on Sunday nights! His professionalism is to be applauded.

PAGE 7

Contents Chapter 1. Introduction 2. FIR Equalization in Read Channel ASICs 3 7 7 7 7 8 9 3. Addition Architectures 3.1 Serial Adders 3.2 Parallel Adders 3.2.1 Carry Completion Adder 3.2.2 Carry Look-Ahead Adder 3.2.3 Carry Select Adder 3.2.4 Carry Skip Adder 3.2.5 Completion Prediction Adder .................................... 11 3.2.6 Conditional Sum Adder ............................................. 12 3.2.7 Manchester Carry Chain Adder .................................... 12 3.2.8 Ripple Carry Adder ...................................................... 13 3.3 Addition Technique Comparison ............................................. 14 4. Multiplication Architectures 4.1 Sequential Multiplier ...................................................... 16 ...................................................... 16 VII

PAGE 8

4.1.1 Bit-Slice Multiplier ...................................................... 17 4.2 Parallel Multiplier ...... ............ ... ... ............... .. .. ............. ... .. 18 4.2.1 Shift-Add Multiplier ...................................................... 18 4.2.2 Bit-Slice Multiplier ...................................................... 20 4.2.3 Radix-n Multiplier ...................................................... 20 4.2.4 Wallace Tree Multiplier ............................................. 23 4.2.5 Pipelined Multiplier ...... ...... ... ... ... ..... ...... ... ............ .... .. 24 4.2.6 Dadda Multiplier .. ... ...... ...... ... .... .. .. .. .. .. ... ............ 26 4.3 Array Multiplier ............................................................... 28 4.3.1 Braun Multiplier ...................................................... 29 4.3.2 Baugh-Wooley Multiplier ............................................. 30 4.4 Multiplier Comparison ...... ... .. ... ...... ... .. .. .. .. .. .......... ... ... 32 5. Optimal Multiplier Solution ................................. ..................... 34 5.1 Representation Consideration .. .. ... ... ...... ... ..... .. ..... .... .. ... 34 5.2 Performance Consideration ............................................. 35 5.3 Proposed Multiplier Design ............................................. 35 5.3.1 Recoding Scheme ...................................................... 35 5.3.2 Accumulation Scheme ............................................. 36 6. Performance Evaluation ............................................................... 37 6.1 Propagation Delay of the Shift-Add Architecture ........................... 37 6.2 Propagation Delay of the Modified-Booth Architecture .................. 39 VIII

PAGE 9

6.3 Propagation Delay of the Radix-8 Booth Architecture .................. 41 6.4 Perfonnance Comparison of the Selected Architectures ... ... ............ 44 7. Conclusions ... ........... ....... ...... ..... .. ...... ... ...... ... ... ... ... ... ...... 50 Appendix Multiplier Perfonnance Predictor in Matlab (multiply.m) .................. 52 References ................................................................................. 55 ix

PAGE 10

Chapter I Introduction The demand for high performance read channel integrated circuits is continuously rising for the data storage industry. To meet this demand, numerous semiconductor companies have experimented with various read channel designs. With the advent of Partial Response Maximum Likelihood (PRML) and Extended Partial Response Maximum Likelihood (EPRML) read channel designs, transfer speeds of over 500 Mbitls are possible However, there are several bottlenecks associated with these designs that are essential to the performance of read channel ASICs: Trellis tuning algorithms, Viterbi implementation in its true fashion, phase-lock loop (PLL), and finite impulse response (FIR) equalization. First, the problem with the Trellis tuning algorithm is with the accuracy attainable from simultaneous processing of multiple parameters. Second, the Viterbi algorithm is currently implemented at half of the attainable efficiency due to independent processing of even and odd sequences. Third, the speed of the PLL limits the overall operating performance of the read channel ASICs. Lastly, the performance of the FIR equalization is limited by the speed of the coefficient multipliers as well as the tap adders. While all these bottlenecks attribute to the performance of the read channel ASICs this thesis concentrates on the FIR equalization. The intent of this paper is to explore various multiplication techniques that would improve the overall performance of a 5-tap FIR equalization filter with 6-bit multipliers as used in the SSI's (Silicon Systems Inc.) 32P41OIA PRML read channel chip shown in Figure 1.1 [2].

PAGE 11

Figure 1.1: SSI 32P4101A EPR4 Read Channel ASICs 2

PAGE 12

Chapter 2 FIR Equalization in Read Channel ASICs Equalization is used in Read Channel ASICs to remove distortion in output signal caused by noise, impedance mismatch, read head dimensions, etc. In order to modify the amplitude and phase responses, equalization filters are used [3-5]. For example, suppose the target and actual channel responses are as shown in Figures 2.1 and 2.2 respectively. Then, to achieve the desired channel response, the shaded regions in Figure 2.3, which represent the differences between the target and the actual responses, must be attenuated or boosted accordingly. Figure 2.1: Target Channel Response 3

PAGE 13

Figure 2.2: Actual Channel Response Figure 2.3: Channel Equalization 4

PAGE 14

Thus, the equalization filter can be defined as follows: = (2.1) where YE(f) is the equalization, YT(f) is the target response, and Y A(f) is the actual response. This equation can be easily realized by a finite impulse response (FIR) filter, sometime referred as a transversal filter [6]. The transfer function, H(z), of a causal FIR filter with length M is defined by [7] .................................................................. (2.2) where z is the input and h is the impulse response of the filter. Equation 2.2 can be rewritten in the time domain as [7-9] -............................................................ (2.3) where x[n] and y[n] are the input and output sequences, respectively. A FIR filter can be realized in various forms, i.e. direct, cascade, etc. [7]. The direct form implementation of a 5-tap FIR filter is shown in Figure 2.4 as described by equation 2.4. 55

PAGE 15

As shown in Figure 2.4, the speed of this 5-tap FIR filter is directly dependent on the performance of both the FIR coefficient multipliers and tap adders. Hence, it is the intent of this paper is to explore various multiplication and addition techniques to optimize the multipliers used in realization of this FIR equalization filter. 6

PAGE 16

Chapter 3 Addition Architectures A typical multiplier is often made up of AND gates (to generate partial products), shifters, and adders. Therefore, prior to evaluating several of the most popular multiplication algorithms used, we will examine the core elements that these multiplication algorithms are based on. Adders in general can be classified by the method in which the inputs are processed, i.e. parallel, serial, and parallel/serial. 3.1 Serial Adders Serial adders are generally not used in fast applications due the sequential processing nature of these adders. However, some serial addition algorithms are currently being evaluated; serial adders require numerous clock cycles to compute the sums of two binary numbers but with short cycle time as compared to parallel adders [10]. 3.2 Parallel Adders Parallel adders are often used in fast multiplication applications due to their efficiency over the serial adders. Several parallel addition architectures are examined in detail in the subsections that follow. 3.2.1 Carry Completion Adder The carry completion adder (CCA) is constructed from an asynchronous ripple carry adder with carry completion detection circuitry [11]. The carry completion signal is generated whenever the sum output is stable for a finite length of time. Unlike ordinary full adders, the carry completion adder has two carry-in inputs, CinO and CinI. as well as two carry-out outputs, CoutO and Coutl [13]. Each adder stage computes the sum of two input bits based on both carry-in combinations. The first occurrence of a high signal on either of the two carry-out lines activates the 7

PAGE 17

completion circuit, which represents that the output value is reliable which can be used. The CinO input will be high if the carry out signal from the previous stage is low otherwise the Cinl will be high. The completion signal, CC, is defined by, CC = ................ (3.1) The carry completion adder is shown in Figure 3.1. + o Due to its asynchronous nature, the total delay through this type of adder is dependent on the length of the operands. 3.2.2 Carry Look-Ahead Adder One of the main disadvantages of ripple carry adder is its carry delay. To reduce the overall carry delay in an adder, the carry look-ahead (CLA) generates carries in parallel instead of in series as compared to the ripple carry adder [10-12]. The carry of each stage is computed from the carry input, the generate carry (G) and propagate carry (P) signals, which are defined by [12] P ..................................................................... (3.2) ..................................................................... (3.3) Therefore, the carry out signal, C can be expressed as shown in equation 3.4 [13]. 8

PAGE 18

Equation 3.4 can be re-written in terms of Gj's, Pj'S, and as shown in equation 3.5 [13] which enables the carries be generated based on only the original inputs X and Y and hence, eliminating the carry propagation delay. With the carry look-ahead addition, an additional cycle is need for the addition. = cA 2-bit carry look-ahead adder is shown in Figure 3.2. + 3.2.3 Carry Select Adder The carry select adder is comprised of adder blocks that performs two additions each [12]. At each stage, one adder block assumes the carry-in input is a zero, the other assumes a one. The actual sum and carry-out of each stage are selected by the carry-out generated from the previous stage. A block diagram of the carry select adder structure is shown in Figure 3.3. 9

PAGE 19

Carry Select Adder 2-bit 10 Adder 2-bit II Adder 2-bit I Adder Sel Out I--SI I Mux Sel Out t--cI I Mux Figure 3.3: 2-bit Carry Select Adder In this adder structure, the additions are performed in parallel for all input bits which enables the carry select adder to achieve higher performance at the cost of added hardware. The computation efficiency gained outweighs the minimal delay introduced by the multiplexer. 3.2.4 Carry Skip Adder The carry skip adder is constructed from a ripple carry adder with carry skip circuitry [10-11]. The ripple carry adder is divided into blocks each with embedded carry skip logic. The carry skip circuitry is used to bypass several consecutive adder stages, which reduces the carry propagation time whenever possible [13]. Unlike the carry look-ahead adder, this type of adder does not generate carries in parallel but rather in a pseudo-parallel manner. An example of a 2-bit carry skip adder is shown in Figure 3.4. Carry Skip Adder Figure 3.4: 2-Bit Carry Skip Adder 10

PAGE 20

both adder stages in Figure 3.4 receive identical inputs with reference to logic levels, the carry input, is passed on as otherwise, the will be generated based on the propagation through both adder stages [10]. 3.2.5 Completion Prediction Adder The completion prediction adder as proposed by Lee [14] is a synchronous adder with a completion prediction logic generated from evaluation of its inputs. This design enables both delay and hardware reduction. The block diagram of a completion prediction adder is shown in Figure 3.5. A B The completion prediction block as shown in Figure 3.5 is comprised of carry propagation (Pi) generation circuitry, an AND tree, and a window detection circuit. The AND tree is used to track the number of successive instances where Pi'S are high. The window detection circuit is used to categorize the values returned by the AND tree in order to generate the prediction signals. This is accomplished by tapping off at various stages of the AND tree and evaluating the input levels through a combinational network. These prediction signals are then evaluated by the completion signal generation block to generate the completion signal used for synchronization [14].

PAGE 21

3.2.6 Conditional Sum Adder The conditional sum adder was regarded as the fastest adder in the 1960' s [12]. It is comprised of conditional cell blocks and multiplexers. Its theory of operation is very similar to that of the carry select adder. A block diagram of the conditional sum adder structure is shown in Figure 3.6. 5 3.6: 5 As depicted in Figure 3.6, there are two layers of multiplexers which generate the sum and carry-out signals based on both carry-in conditions (zero or one) from the previous stage. Although, only two bits are shown in this figure, this structure can be extended to n-bits with the adder's total delay linearly proportional to the total number of multiplexer stages. 3.2.7 Manchester Carry Chain Adder One way of improving the carry chain of a carry look-ahead adder is to utilize the pre-charged circuits. The Manchester carry chain adder is formed by a carry look-ahead adder with pre-charged stages that derives the sum output from two

PAGE 22

signals: propagate (P) and generate (G). The diagram of a two-bit Manchester carry chain is shown in Figure 3.7 [10]. L Manchester Carry Chain --r-Cnd Cnd 1 ____ -.J Figure 3.7: 2-Stage Manchester Carry Chain During the pre-charge phase (clock is set low), the storage node at the junction between the pmos and nmos transistors which represents the complement of carry-in is charged to a 1 within each stage. While during the evaluation phase (clock is set high) if the Generate signal is also high, the storage node will discharge which generates a carry-out into the next stage [10]. 3.2.8 Ripple Carry Adder A ripple carry adder (RCA) is comprised of a series of full adders in which the carry out output from the previous stage drives the carry in input on the subsequent stage and hence its name [10-12]. Since the sum output of the nth bit cannot be completed until the all carry outputs have propagated through all previous stages as shown in Figure 3.8, the delay of a ripple carry adder for an n-bit addition can be characterized by, f" ..................................................................... (3.7)

PAGE 23

Where, tT is the total delay of the RCA, tc is the carry delay, and ts is the sum delay Carry Adder Cml Coutl "inn-l Coutn_l CoutO Sn-l Figure 3.8: Ripple Carry Adder 3.3 Adder Comparison The carry completion adder can reduce the amount of carry propagation time at the expense of additional circuitry. However, it is unsuitable for synchronous operations due to added hardware and delay required by the resynchronization process which is a major drawback of its asynchronous nature [11]. The carry look ahead adder is an ideal candidate for high speed addition if size is not an important factor. The added hardware to compute the carries in parallel accelerates the carry propagation but may render this type of adder unsuitable due to area [13]. The added hardware also equates to larger chip area, which dissipates more power [12]. The carry select adder is often a good compromise between performance and size [12]. The carry skip adder is comparable in speed as compared to the carry look ahead adder while consuming less power due to its chip area requirement [13]. It is also relatively inexpensive and slightly slower than the carry ripple adder [11]. The completion prediction adder yields high performance with regard to size and length of critical path when used in applications greater than 64 bits [14]. The conditional sum adder operates at roughly the same speed as the carry look-ahead adder although its design is less modular. For this reason, the carry look ahead adders are used more frequently than conditional sum adders [13]. The Manchester carry chain adder utilizes pre-charged circuits to accelerate the carry chain [9]. The fan-out capability of this carry chain renders it impractical for larger adders. The main advantages of the ripple carry adder are simplicity, low cost, and relatively high performance. Its simple and highly regular structure lends itself to cell

PAGE 24

designs. However, this adder is unsuitable for large designs since its critical delay is linearly proportional to its length [12]. A performance summary is presented in Table 3.1 based on the characteristics of these adders along with the performance evaluation of FPGA implementations conducted by Xing [10-15]. Carry Completion 6 8 Carry Look-Ahead 8 Carry Select 4 3 Carry Skip 2 4 Completion Prediction 7 Conditional Sum 7 6 Manchester Carry Chain 3 2 Ripple Carry Adder The ratings shown in Table 3.1 are arbitrarily assigned: 8 being the worst and 1 being the best and are based on n (number of bits) less than 8. All the above designs can be optimized through various techniques, which would yield different results than shown in Table 3.1; it is not the intent of this paper to evaluate every possible implementation schemes.

PAGE 25

Chapter 4 Multiplication Architectures Binary multiplication in the most basic form consists of two steps: evaluation of partial products and summation of shifted partial products [16]. For example, multiplication of two binary numbers is completed as shown in Figure 4.1. The partial products can be generated with AND gates by ANDing the multiplicand with each multiplier bit. High speed multipliers can be classified according to the manner in which the partial products and their sums are generated. These classifications are sequential multiplier, parallel multiplier, and array multiplier [13]. These three types of multipliers are discussed at lengths in sections 4.1,4.2, and 4.3 respectively. 4.1 Sequential Multipliers The sequential multiplier generates the partial products and sums them sequentially. Due to the sequential nature of the serial multipliers they are generally

PAGE 26

not used in high speed multiplication applications. Therefore, for the purpose of discussion, only one type of sequential multiplier, the bit-slice multiplier, is discussed in this paper. 4.1.1 Bit-Slice Multiplier The bit-slice mUltiplier computes the product of two binary numbers by decomposing each n-bit number into multiple m-bit numbers and summing the partial product tenns. The decomposition is accomplished with equation 4.1 [17]. Where, Xu = the upper bits of X Yu = the upper bits of Y Xl = the lower bits of X Yl = the lower bits of Y M = the number of lower bits For example, the product of 2 (3810) and 12 (1310) can be decomposed as demonstrated in equation 4.2. = *26 +( 2)*23 + = = ....................................... (4.2) This requires four 3-bit multipliers and three 6-bit adders [17] which leads to two 3bit (M=3) bit-slice multipliers. Applying equation 4.1, this 6-bit multiplier can be further composed to three 2-bit (M=2) bit-slice multiplier as shown in equation 4.3. .................. (4.3) The serial implementation of this two 3-bit bit-slice multiplier is shown in Figure 4.2.

PAGE 27

1 Each pair of Xi and Yj is fed into the multiplier sequentially. The product is then shifted according to the power of two of each partial product listed in equation 4.3. Next, the shifted partial product is added to the previously stored sum in the accumulator. When all XiS and yjS have all cycle through the adder, the adder will return the final product. This implementation scheme minimizes hardware at the expense of increased propagation delay. 4.2 Parallel Multipliers The parallel multiplier generates the partial products in parallel. Fast binary multipliers are often implemented in a parallel scheme. There are a number of fast parallel multiplier algorithms used in signal processing and an attempt will be made here to address the most popular techniques in the sections to follow. 4.2.1 Shift-Add Multiplier The shift-add multiplier is the simplest and most straight-forward form of binary multiplier. The procedures for perfonning this type of computation in binary are identical of that of ordinary multiplication (base 10 arithmetic). For a 6-bit shift add multiplier, a total of 36 partial products are generated as shown by Figure 4.3 [ 18].

PAGE 28

100110 (Multiplicand) x 010011 (Multiplier) 100110 100110 000000 000000 100110 000000 01011010010 Figure 4.3: Shift-Add Multiplication One implementation of this 6-bit multiplier is shown in Figure 4.4 [19]. Shift-Add Multiplier Multiplier 6 Shift Right Multiplicand 6 Shift Left 12 Add Enable Accumulator Figure 4.4: Shift-Add Multiplier 12 Product

PAGE 29

While the elk and Multiply_Enable signals are high and the multiplier bit is high, the accumulator will sum the previous partial sum with the shifted multiplicand. This cycle will repeat until all 6 multiplier bits have shifted out and the accumulator will return the final product. 4.2.2 Bit-Slice Multiplier The two 3-bit (M=3) bit-slice mUltiplier previously discussed as defined by equation 4.3 can also be implemented in parallel. The realization of the parallel bit slice multiplier is shown in Figure 4.5. Although, the parallel operation enables the parallel bit-slice multiplier to achieve a higher speed than the serial bit-slice multiplier, the added hardware and cost might overweigh its improvement in performance. 4.2.3 Radix-n Multiplier An effective means of reducing the number of partial product terms, which reduces the overall computation time, is through simultaneous evaluation of multiple 20

PAGE 30

bits of the multiplier. Each time the number of bits examined is doubled, the number of partial products is halved. Through the evaluation of these bits, a reduced set of bits with associated operations can be formed. This is so called recoding. The Booth algorithm [20] is one of the earliest recoding scheme proposed. The original multiplication technique proposed by Booth [20] is designed to facilitate machine computation (multiplication) by the use of two's complement arithmetic with correction. Booth suggested that the multiplication between two binary numbers x and y be carry out in the following manner: if x and yare both positive, the result will simply be +xy; if x is negative and y is positive, the result will be 2x-xy; if x is positive and y is negative, the result will be 2y-xy; if x and yare both negative, the result will be 4-2y-2x+xy. is clear that the corrections are needed for the last 3 cases require examination by the machine, which leads to added hardware not to mention the complexity. The proposed two's complement multiplication technique is stated in Table 4.1 [19-20]. The y values in Table 4.1 refer to the multiplier bits, while the operation is performed on the multiplicand. The multiplier, y, when evaluated in groups of 2-bits can be expressed as = 2; ............................................................ (4.4) The function in parenthesis can only take on the values {O, I}. The weight of the y terms is expressed in multiples of two's and therefore, the Booth multiplier is also referred to as radix-2 multiplier [13], [16]. is apparent from Table 4.1 that the operation can be represented with one binary bit with the Booth recoding scheme. This recoding scheme can be extended to a radix-4 scheme where every 3 bits of the multiplier are recoded into 2 bits. The modified-Booth multiplier [16] is an example of the radix-4 multiplier where 3 bits of the multiplier are evaluated at a time.

PAGE 31

The modified-Booth multiplier evaluates 3 bits of the multiplicand at a time and carries out the operation defined by Table 4.2 [10], [16]. two's complement, the multiplier, y, can be written as [12] i=n-2 ......................................................... (4.5) Equation 4.5 can be also rewritten as [12] J = (4.6) The term within the parenthesis takes on the values in the set {-2, -1, 0, 1, 2} [12], which specifies the operations listed in Table 4.2. The operations defined in Table 4.2 are performed on the multiplicand based on the 3 bits of multiplier under evaluation. Through each iteration, the multiplicand is shifted by and incremented by the operation specified by the multiplier bits under examination. For example, the product of 000 1012 (+5) and 10 0 1 (-19) can be computed via the modified-Booth algorithm as shown in Table 4.3. Yn+l Yn Yn-l YI Yo Y-I 0102 X PI = Po +000000001012 000000001012 Y3 Y2 YI 1102 P 2 = PI 000000101002 111111100012 Y5 Y4 Y3 101-, P 3 = P2 000010100002 111101000012 22

PAGE 32

Since the most significant bit (MSB) is a one, the product is negative and thus the sum of partial products must be complemented (2's complement). Therefore, the product in signed arithmetic is 100010 111112 which is -9510, the correct answer. The realization of the 6-bit modified-Booth multiplier is shown in Figure 4.6. The Booth recoding blocks are omitted for simplicity. Modified-Booth Multiplier Y 1 Y 3 Y 2 Y l Y S Y 4 Y 3 3 3 3 Sel Sel Sel 0 0 0 0 0 0 x 1 x 1 x 1 2x 2 2x 2 2x 2 Po Figure 4.6: 6-Bit Modified-Booth Multiplier Each group of 3 bits of the multiplier as shown in Figure 4.6 selects the proper amount of magnification of the shifted multiplicand to be added to the current partial product. 4.2.4 Wallace Tree Multipliers The Wallace tree multiplier is a modified tree multiplier that reduces the total number of adders needed through the use of compressors in order to minimize the computations of the partial products [21]. The original design as proposed by Wallace entails 3:2 compressors [21], although other compressors have been proposed, i.e. the 6:2 and 9:2 compressors [22-23]. A 3:2 compressor is made up of carry-save adders, each with three inputs, A, B, and C and two outputs, X and Y. The outputs X and Yare generated according to equations 4.7 and 4.8 [10]. x = ............................................................... (4.7) = C) ............................................................ (4.8) This type of adder tree requires no carry chain, which reduces the total number of summands to compute the product of two binary numbers [10]. A possible implementation of a 3-bit Wallace multiplier is shown in Figure 4.7. 23

PAGE 33

Wallace Tree Multiplier ;Y2 x 1 Y l X:;zYo ;Yl x1Yo Yo o o o Figure 4.7: 3-Bit Wallace Tree Multiplier The computation of two 3-bit binary numbers requires a total of 6 carry-save adders instead of 8 carry-propagate adders as required by several of the multipliers previously discussed. In addition, this adder tree structure reduces the total number of adder delays. For example, a 16-bit multiplier has at most 16 partial products to be summed. Therefore, the compression of the 16 partial product through the Wallace adder tree is as follows: 16-+ 12-+8-+6-+4-+3-+2. Hence, a maximum of 7 adder delays is to be expected. The same multiplication would have resulted in 8 adder delays with a modified-Booth multiplier [16]. 4.2.5 Pipelined Multiplier To increase the throughput (the delivery of successive multiplications) rates of large multipliers, as used in signal processing applications, pipelining is typically used. Pipelining is a technique used to minimize the execution of successive operations that are identical. This is achieved by segmenting a design into several stages. Each stage executes independently on consecutive operands. Latches are added between adjacent stages to enable parallel operations. A block diagram of a 2stage pipeline is shown in Figure 4.8 [13], [24].

PAGE 34

x a c a c a c z Pipelining can be applied to various multiplication algorithms, i.e. Booth multiplier, Wallace tree multiplier, Dadda multiplier, array mUltiplier, etc. [25]. To increase the throughput of the pipelined multiplier, a special type of pipelined multiplier called wave-pipelined multiplier can be used; the wave-pipelined multipliers achieves higher performance by eliminating pipelining registers between combinational logic blocks within the multiplier. The balance of the delay throughout the multiplier is assured through critical layout and routing. A 4-bit wave pipelined multiplier is shown in Figure 4.9 [26]. 4 4 25

PAGE 35

The partial products are generated by the AND gates as shown in Figure 4.9. The processing elements (PE) are basically AND-gated full adders that are connected in an array fashion. To minimize the latency, a semi-systolic schedule is followed by the PE's; latches are inserted between rows of processing elements. The vector merge adder performs the conversion between carry-save product representation and normal binary representation [26]. 4.2.6 Dadda Multiplier The Dadda multiplication algorithm is based on parallel (n,m) counters; n refers to the number of inputs and m refers to the number of outputs of the parallel counters (combinational networks) [27-28]. A 4 x 4 Dadda multiplication is shown in Figure 4.10. 5 4 3 2 1 0 B //// C As with typical binary multiplication, the product between a 4-bit multiplicand, X, and a 4-bit multiplier, Y, generates 16 partial products. These partial products are represented by dots as shown in Figure 4.10. The collection of all partial products is termed matrix The first step in applying the Dadda parallel multiplier algorithm is to transform matrix A to a reduced-row matrix, B. This is accomplished by first directly transcribing columns 0 and 1 without alteration from matrix A to 26

PAGE 36

matrix For the remaining columns, each column will be shortened by the appropriate counters. These counters come in various sizes. For the ease of implementation, only 3:2 counters (full adders) and 2:2 counters (half adders) will be used. Therefore, the three partial product terms in column 2 will be reduced by a 3:2 counter. The LSB of the output will be mapped onto the same column in matrix B, while the MSB of the output will be mapped onto column 3 although at one row below. In the cases where the number of partial products cannot be adequately dealt with one single counter, the unused partial products are simply mapped onto matrix B at the next available row in the same column. Upon completion of the first transformation, if the transformed matrix has more than two rows, additional transformation in the same manner is needed. The reasoning behind the two-row limit is to facilitate two-operand addition through the CPA. The implementation of this multiplier following the algorithm shown in Figure 4.10 is shown in Figure 4.11. 4-Bit Dadda Multiplier Y2 2:2 HA Y 2Y2 x1Y3 x3o X 21 3:2 1Y2 Y2o X1Y1 XOY2 I-Bit CPA Figure 4.11: 4-Bit Dadda Multiplier Product The reduction scheme used to generate the 4-bit Dadda multiplier as shown in Figure 4.11 is not optimal. Dadda had suggested several reduction schemes which involves alternate partial reduction rules to minimize the total number of counters [28]. 4.3 Array Multiplier 27

PAGE 37

The array multiplier generates partial products and their sums through an array of nearly identical cells which lend itself to VLSI implementation with high performance at the expense of hardware complexity [13], [29]. For example, a 3 x 3 multiplication will result in the partial products and their sums as shown in Figure 4.12. One possible implementation of this 3 x 3 multiplier in the array multiplication scheme is shown in Figure 4.13. 3-Bit 28

PAGE 38

As shown in Figure 4.13, the partial sums can be easily computed through an array of half adders. The output bits from each stage of AND gates are added in parallel with the sum of partial products from the previous stage. The carry out signals are treated as the MSB of the second input into the adders. 4.3.1 Braun Multiplier The Braun multiplier is an array multiplier that computes the product of two unsigned binary numbers according to [12] (X;f) (4.9) Where X is the multiplicand, Y the multiplier, and P the product. Equation 4.9 is derived from equations 4.10 and 4.11, which represent two unsigned binary, numbers X and Y [12]. X = IX;2; ..................................................................... (4.10) f= If)2) ..................................................................... (4.11) The partial products of this type of multiplier are generated by AND gates in parallel and the partial sums of the partial products are generated by array adders as shown in Figure 4.14 [12]. A s 29

PAGE 39

For an m by n Braun multiplier, a total of m(n-l) adders (CPA) and AND gates are needed. The implementation of a 3-bit Braun multiplier is shown in Figure 4.15. The Braun multiplier shown in Figure 4.15 can be rearranged into a 3 by 3 array which facilitates placement and routing in an integrated circuit. 4.3.2 Baugh-Wooley Multiplier Another type of array multiplier is the Baugh-Wooley multiplier [30]. The structure of the Baugh-Wooley multiplier is similar to that of the Braun multiplier with the exception that it is designed for computation of two's complement numbers instead of unsigned numbers. Both the multiplier and multiplicand can be expressed as shown by equations 4.12 and 4.13 [30]. = 12 1 2; ......................................................... (4.12) ...................................................... (4.13) 30

PAGE 40

Therefore, the product of multiplicand, x, and multiplier, y, can be expressed as shown by equation 4.14 [30]. = ;=I 2 ;=I 2 I 'I .. (4.14) ;=0 ;=0 Direct implementation of equation 4.14 requires both adders and subtractors. Thus, in order to reduce the complexity and hardware of such design, the negative terms can be rewritten as shown by [12], [30], 2 2 )J ........................... (4.15) This result in all positive partial product terms at the expense of added hardware to complement the multiplicand and multiplier bits as suggested by equation 4.15. A 3bit Baugh-Wooley multiplication algorithm is shown in Figure 4.16. 0 x 2

PAGE 41

The partial product tenns Po through Ps are generated based on equation 4.16. A possible implementation (not optimized) of this 3-bit Baugh-Wooley multiplier is shown in Figure 4.17. 4.4 Multiplier Comparison The sequential bit-slice multiplier is well suited for applications where size is a major factor. However, its compactness is associated with longer cycle time and propagation delays. These issues are overcome with the parallel version of the bit slice multiplier at the expense of added hardware, chip area, and power consumption. The add-shift multiplier is the simplest design to conceptualize since this is how ordinary multiplication is carried out. However, the added registers and large adders render it impractical for large designs. 32

PAGE 42

The radix-n multiplier such as the Booth multiplier and the modified Booth multiplier reduces the number of operational cycles by grouping the mUltiplier bits into multiples of It is faster than the bit-slice multiplier and the shift-add multiplier [17]. The radix-n multipliers are often used in the signal processing, digital filtering, and ALUs since no correction is needed for two's complement arithmetic. The Booth algorithm is ideal for operands greater than 16 bits [12] due to the partial product savings. The Wallace tree multiplier reduces the carry delay through the use of carry save adders. The compressor stages of a Wallace tree multiplier reduces the total number of adders required which enhances the overall performance of the multiplier. However, the additional hardware required by the Wallace tree translates to more area, which tends to consume more power and therefore, it is not suitable for low power applications [12]. The pipelined multiplier accelerates the throughout at the expense of higher latency and cost due to the addition of pipelining stages. The Dadda multiplication algorithms are very similar in structure as compared to the Wallace tree multiplication techniques. The advantage of the Dadda multiplier is that in certain application, the Dadda multiplier can improve the partial product savings considerably over the Wallace tree multiplier [28]. The array multiplier is modular and flexible due to its highly regular structure. An array multiplier of a particular size can be easily extended to a larger application. The Braun multiplier is designed to process unsigned binary numbers while the Baugh-Wooley multiplier is optimized for two's complement arithmetic. The latter is unsuitable for operands larger than 16 bits due to its area requirement and performance [12]. A summary of the multipliers discussed based on their structures and characteristics is presented in Table 4.4. Type Cost Performance Area Power Application Bit-Slice Low Low Low Low <16 Add-Shift Low Low Low Low <16 Radix-n Medium Medium Medium Medium > 16 Wallace Tree Medium High Medium High >16 Pipelined High Medium High High >16 Dadda High High Medium High > 16 Array Medium Medium High High < 16 Table 4.4: Multiplier Summary 33

PAGE 43

Chapter 5 Optimal Multiplier Solution 5.1 Representation Consideration The first factor to consider for an optimal 6-bit multiplier is representation. The FIR coefficients can be positive as well as negative. Therefore, a choice has to be made regarding the negative number representation. There are three basic fixed binary number representations that accommodate negative numbers: signed magnitude, one's complement (diminished-radix complement), and two's complement (radix complement) [13]. One of the main disadvantages of the signed-magnitude representation is the operation decisions required based on the signs of both binary numbers which leads to additional hardware, computation time, and cost. One's complement resolves the added computational sequences and hardware issues of the signed-magnitude representation. However, the representation is not optimal. Both the signed magnitude and one's complement representations can express binary numbers in the range (_2N l -1) X (2N l -1), where N is the number of bits. The two's complement representation on the other hand, can express binary numbers in the range (_2N l ) X (2N l -1) with a singular representation for 010 instead of multiple representations for zero, as with the other two systems. An example of the three types of binary number representation for a 3-bit number is shown in Table 5.1. (Xh 1 's 2's 011 3 3 3 010 2 2 2 001 1 1 1 000 +0 +0 0 111 -3 -0 -1 110 -2 -1 -2 101 -1 -2 -3 100 -0 -3 -4 34

PAGE 44

As shown in Table 5.1 2's complement representation can extend the negative range by one as well as singular representation for zero. 5.2 Performance Consideration It has been shown in the previous chapters that the key to a fast multiplier is the management of partial products. This management can be divided into two tasks: partial product reduction and partial product sum reduction. The first task of partial product reduction can be easily realized with many recoding schemes such as the Booth, Modified Booth, etc. The latter task can also be easily realized through efficient addition architectures such as the Wallace tree or the Dadda parallel techniques. 5.3 Proposed Multiplier Design Based on the representation and performance considerations discussed previously, the optimal 6-bit multiplier is to be of the radix-8 architecture. Its detailed recoding and accumulations schemes are examined in the following sub sections. 5.3.1 Recoding Scheme To reduce the number of partial products, the radix-8 modified Booth encoding scheme is chosen. As shown in the radix-n multiplier section of this paper, the two's complement representation for the multiplier can be expressed in several ways to yield more efficient computation techniques. One way to re-write Equation 4.13 is as shown in equation 5.1. The quantity within the parenthesis takes on the set {O, 1, }. The weight of the multiplier bits is expressed in multiples of 23 or 8\0 and hence radix-8. With a 6-bit multiplier, if the modified Booth encoding [13] is used, a total of three multiplier bit evaluations are needed. The same 6-bit computation with the proposed radix-8 modified Booth encoding will require one less evaluation cycle. The operations associated with the 4-bit multiplier evaluation are cataloged in Table 5.2. 35

PAGE 45

All operations listed in Table 5.2 with the exception of x are easily generated through shifting. the case of x, an additional adder could be used to sum x with 1 x, which can be generated in parallel. The proposed recoding scheme for a 6bit multiplier is discussed in Chapter 6. 5.3.2 Accumulation Scheme The accumulation of the partial products will be accomplished through the use of carry-save adders. The carry-save adders differ from other addition algorithms in the respect that they are carry-propagation free. carry-save addition, the carry generated from each full-adder is done independently from other carries [28]. The full adders used, compute the partial sum of three input bits instead of two and a carry as with other addition architectures. Therefore, the total number of adders needed is reduced. This reduction along with the inherent carry-propagation-free characteristics of the carry-save adder can accelerate the performance of any multiplier without added hardware and cost. The carry-save adders are widely used in both the Wallace tree multiplier as well as the Dadda multiplier as discussed in Chapter 4. 36

PAGE 46

Chapter 6 Performance Evaluation This chapter will evaluate the perfonnance of the three multipliers that are most suitable for a 5-tap FIR equalization filter: Shift-Add, Modified-Booth (Radix4), and Radix-8 Booth. The comparisons made in this chapter are at the RTL level with routing and interconnections delays omitted. The propagation delay values used are based on IBM's 0.36 CMOS technology with the 5S process [31]. The propagation delay for the standard cells used are cataloged in Table 6.1. These values are obtained by averaging the worst-case propagation delays from low to high (tPLH) and high to low (tpHd. The perfonnance level chosen is A with normal process, low voltage (V = 3.3V) operation, and junction temperature (T at 25C. The programmable 0-10 bit shift register standard cell will not be used, since the shift operations required by the multipliers are fixed. Therefore, there are no intrinsic delay associated with the shifters used but rather just routing delays. Device Delay (ns) ANDOR22 0.251 ANDOR33 0.243 XOR2 0.3195 AND2 0.18 MUX21 0.2325 MUX81 0.825 ADDF 0.5335 SHIFf 0.26 Table 6.1: Standard Cell Delay 6.1 Propagation Delay of the Shift-Add Architecture Based on the discussion of the Shift-Add architecture examined in Section 4.2.1, one possible realization of a 6-bit two's complement multiplier is shown in Figure 6.1.

PAGE 47

6-Bit Shift-Add Multiplier Sy Sel Sel Sx Mux 0 6 6 y Ridtt-Shift Left-Shift Register Register Accumulator Product Figure 6.1: 6-Bit Shift-Add Multiplier This multiplier has been modified to compute two binary numbers in the two's complement fonn by first inverting all multiplicand and multiplier bits and then incrementing them by one as shown in Figure 6.1. From this architecture, a delay flow diagram was constructed and is shown in Figure 6.2. Propagation Delay Flow Diagram 6-Bit 2:1 AND2 12-Bit A DDF Mux ADDF Figure 6.2: Delay Flow Diagram The block shaded in gray is repeated for each stage. This is due to the fact that the delay associated with the 12-bit full adder is much greater than the previous three blocks combined. Therefore, the overall propagation delay of an n-bit multiplier can be expressed by equation 6.1. 38

PAGE 48

)+ 21 + + Where the standard cell delays refer to the values listed in Table 6.1. 6.2 Propagation Delay of the Modified-Booth Architecture Based on the discussion of the modified-Booth architecture examined in Section 4.2.3, one possible realization of a 6-bit two's complement multiplier is shown in Figure 6.3. Po .. Op .. L A A.LIiSub /2 0 112 --; ""'12 Op ,(2 LA 0 2 Op ,2 0 i12 2 6.3: Detailed schematics of the radix-4 encoder as well as the function generator are shown in Figures 6.4 and 6.5 respectively. 39

PAGE 49

Modified Booth (Radix-4) Encoder 1 Yi-l ---.--. Figure 6.4: Radix-4 Encoder Radix-4 Function Generator / .. / 7 Ix o 2x Shifter (left) Figure 6.5: Radix-4 Function Generator The exclusive OR gate depicted in Figure 6.4 is used to determine the operation to be perfonned by the adderlsubtractor shown in Figure 6.3. The rest of the logic is used to encode the multiplier bits. The function generator shown in Figure 6.5 is used to generate zero times, one times, and two times the multiplicand bits that are fed the input to the 2: 1 multiplexers shown in Figure 6.3. Upon examination of the data-flow, a model of the overall propagation delay is depicted in Figure 6.6. 40

PAGE 50

Propagation Delay Flow Diagram Figure 6.6: Delay Flow Diagram The block shaded in gray is repeated for each stage. The delay associated with an adderlsubtractor can be treated as a full adder with one additional XOR delay. Therefore, the overall propagation delay of an n-bit multiplier can be expressed by equation 6.2. ..................... 6.3 Propagation Delay of the Radix-8 Booth Architecture Based on the discussion of the radix-8 Booth architecture examined in Section 5.4.1, one possible realization of a 6-bit two's complement multiplier is shown in Figure 6.7. 6-Bit Radix-8 Multiplier Encoder 1-----7-'3"--1 Sel o Function 2 3 4 Encoder Sel o 2 3 2 Shifter 3 12 12 Figure 6.7: 6-Bit Radix-8 Booth Multiplier

PAGE 51

Detailed schematics of the radix-S encoder as well as the function generator are shown in Figures 6 S and 6.9 respectively 1 42

PAGE 52

o / 7 / p / p The exclusive OR gate depicted in Figure 6.8 is used to detennine the operation to be performed by the adderlsubtractor shown in Figure 6.7. The rest of the logic is used to encode the multiplier bits. The function generator shown in Figure 6.9 is used to generate zero through four times the multiplicand bits that are fed the input to the 8: 1 multiplexers shown in Figure 6.7. Upon examination of the data-flow, a model of the overall propagation delay is depicted in Figure 6.10. The block shaded in gray is repeated for each stage. The delay for the function generator (FG) is computed from the propagation delay of a (n+2)-bit full adder; the adder is of (n+2) bits due to the 2x operation and the carry bit resulted from the sum of 2x and 1 x. The delay associated with an adderlsubtractor can be treated as a full 43

PAGE 53

adder with one additional XOR delay. Therefore, the overall propagation delay of an n-bit multiplier can be expressed by equation 6.3. = ((2 81 ......... 6.4 Performance Comparison of the Selected Architectures Based on equations 6.1-3, a plot of the propagation delay versus the number of bits (1-12) is generated. The simulation was conducted on Matlab with script multiply.m, which is included in Appendix A. The performance comparison does not incorporate the addition computation savings made possible by the carry-save algorithm. The comparisons are purely based the architecture of each chosen multiplier. Thus, the delay values listed in Table 6.2 for all three multipliers can be further accelerated through the use of carry-save adders. 6.2: Results A plot of the perfonnance of all three multiplication architectures is shown in Figure 6.11. 44

PAGE 54

160 140 120 (/) c 100 ca 80 C 60 40 20 --+--+2 3 4 5 6 7 8 9 10 11 12 The Matlab simulation results indicated that the perfonnance of the radix-8 Booth multiplier increases as n becomes larger. A plot of the multiplier perfonnance from 1-128 bits is shown in Figure 6.12. 45

PAGE 55

20000,---------------------------------------------------------, -.s >10000 .. C 8000 6000 2000 Referring to Figure 6.12, the shift-add multiplier is the worst perfonner of the three architectures as shown by the top curve. The next curve represents the modified Booth (radix-4) multiplier. Lastly, the bottom curve is the radix-8 Booth multiplier, which outperformed the other two multipliers. The simulation demonstrated performance as expected. The perfonnance results for the chosen multipliers with varied parameters are shown in Tables 6.3-6. 46

PAGE 56

The results listed in Table 6.3 are obtained by decreasing the intrinsic propagation delays of the MUX2l and MUX81 cells by 5% to 0.2209 ns and 0.7838 ns respectively with all other delay values held constant. The results listed in Table 6.4 are obtained by decreasing the intrinsic propagation delay of the ADDF cell by 5% to 0.5068 ns with all other delay values held constant. 47

PAGE 57

The results listed in Table 6.5 are obtained by decreasing the intrinsic propagation delay of the XOR2 cell by 5% to 0.3035 ns with all other delay values held constant. The results listed in Table 6.6 are obtained by decreasing the intrinsic propagation delay of the AND2 cell by 5% to 0.3035 ns with all other delay values held constant. The performance of the radix-8 Booth multiplier with variable delay values are shown in Figure 6.13. Its results are listed in Table 6.7. 48

PAGE 58

60 50 Qj C 20 10 (-5%U 2 3 4 .5 6 7 8 9 10 As shown in Table 6.7, the intrinsic propagation delay value of the ADDF cell seemed to have the greatest effect on the performance of the radix-8 Booth multiplier; a 5% reduction in ADDF propagation delay increases the performance of the 6-bit radix-8 Booth multiplier by 4.61 %. The performance increases due to the reductions in propagation delay of MUX cells, XOR2 cell, and AND2 cell are 0.22%,0.17%, and 0% respectively. The reduction in the propagation delay of the AND2 cell produced no changes in the performance of the 6-bit multiplier. 49

PAGE 59

Chapter 7 Conclusions The main objective of this thesis was to derive the fastest 6-bit multiplier for use in FIR equalization in read channel ASICs. To accomplish this goal, the basics behind all multipliers and their associated summing schemes were explored. Next, the inefficiencies of current multiplier architectures were identified. Therefore, the selection of the most ideal multiplier can be achieved through the exploration of partial product reduction architectures as well as partial product summing architectures. The first task of partial product reduction was demonstrated through several widely used recoding schemes such as the Booth, modified-Booth, etc. The detailed theory of operations, benefits, and disadvantages of these recoding schemes were discussed in Chapter 4. The latter task of accelerating the summation of partial products was also investigated through various addition architectures as presented in Chapter 3. It was later shown that the proposed multiplier solution was proven to be competitive in both speed and power consumption as compared to the other two designs especially when n (number of bits) is high. It has been argued that the delay associated with the generation of higher than two times the multiplicand far outweighs the performance benefits of higher radix-n partial product reduction algorithms. However, this claim is highly dependent on the intended applications. Although, it is true that there are additional delays introduced by the generation of greater than two times the multiplicand, this overhead is only incurred once at the first stage and not at the subsequent stages. Hence, as the number of bits increases, the greater the differences between the radix-4 and radix-8 multipliers, as demonstrated in Chapter 6. The radix-8 recoding scheme together with the addition savings of the carry-save adder can achieve the optimal design requirement of this thesis. This can be easily verified through timing simulations with CMOS implementation. The main emphasis of this paper was centered on the optimization of the multiplier required by the FIR equalization. Perhaps, the fastest FIR equalization is not to be achieved by a multiplier design but rather a multiplier-less design. has 50

PAGE 60

been suggested by Samadi, Nishihara, Fujii, and Chen that a linear phase FIR filters can be realized free of multipliers by the use of array filter structure [32-33].

PAGE 61

Appendix A. Multiplier Perfonnance Predictor in Matlab (multiply.m) %################################################################# %################################################################# %# multiply.m ## %# By: Michael H. Chen ## %# Date: August 5, 1998 ## %# Description: This program computes and plots the estimated ## %# propagation delay associated with an n-bit ## %# multiplier implemented in three multiplier ## %# algorithms: shift-add, modified Booth ## %# (Radix-4), and Radix-8. ## %################################################################# %################################################################# %####################################################### %# Gate Delays Per IBM's ASIC CMOS 5S (0.36um) Process # %####################################################### %############################################# %# Constants (All values are in nanoseconds) # %############################################# AND2 0.18; % 2-Input AND gate XOR2 = 0.3195; % 2-Input XOR gate ADDF = 0.5335; % 1-Bit Full Adder MUX21 = 0.2325; % 2:1 Multiplexer MUX81 = 0.825; % 8:1 Multiplexer N= 128; % Number of Bits %########################################################### %# Propagation Delay Computations for Shift-ADD Multiplier # %########################################################### = 1; for N_SA = l:N % Number of stages required for the Shift-Add implementation S_SA = N_SA; % Propagation delay computation D_SA = (N_SA*ADDF)+MUX21+AND2+(S_SA*(ADDF*2*S_SA)); % Construct an array of number of bits versus delay P_SA(i,1:2) = [N_SA D_SAJ; 52

PAGE 62

end %######################################################### %# Propagation Delay Computations for Radix-4 Multiplier # %######################################################### = 1; for N_R4 = 2:N % Number of stages required for the Radix-4 implementation S_R4 = ceil (N_R4/2) ; % Propagation delay computation D_R4 = XOR2+MUX21+(S_R4*((2*N_R4*ADDF)+XOR2)); % Construct an array of number of bits versus delay P_R4(i,l:2) = [N_R4 D_R4]; i= i+1; end %######################################################### %# Propagation Delay Computations for Radix-B Multiplier # %######################################################### = 1; for N_RB = 3:N % Number of stages required for the Radix-B implementation S_RB = cei1(N_RB/3); % Propagation delay computation D_RB = SHIFT+((2+N_RB)*ADDF)+MUXB1+S_RB*(XOR2+(2*N_RB*ADDF))); % Construct an array of number of bits versus delay P_RB(i,l:2) = [N_RB D_RB]; = end %################################ %# Performance Comparison Graph # %################################ % Plots the performance of the Shift-Add Multiplier plot ( P _SA ( : 1) P _SA ( : 2) 'r' ) hold on % Plots the performance of the Modified-Booth Multiplier plot (P _R4 ( : ,1) P _R4 ( : ,2) 'g' ) % Plots the performance of the Radix-B Multiplier plot (P _RB ( : ,1) P _RB ( : ,2) 'b' ) 53

PAGE 63

xlabel{'Nurnber of Bits'); ylabel{'Delay (ns) '); title{'Multiplier Performance Comparison'); % Turns "ZOOM" option on zoom on; 54

PAGE 64

References [1] SSI. June 1997. http://www.ti.com/sc/docs/storage/news/pce pr4.htm. [2] SSI. datasheet. Mar. 1998. http://www.ti.com/sc/docs/storage/products/data/41 0 1 a.pdf. [3] Ozue, Tadashi, Watanabe Yukio, Kohno, Takanobu, Hirose, Toshihido, Koseki, Kazuya. Development of High Density Recording For Advanced Tape Streamer. vol. 32, Sep 1996, pp. 3881-3883. [4] Hirasaka, H., Tanaka, A., Nomura, K., Fukuda, S. An Adaptive Equalizer for the Partial Response Class I Channel Used by DDS3. vol. 32, Sep 1996, pp. 3977-3979. [5] Mallinson, John C. Academic Press, 1987. [6] Couch, Leon W. 3rd Ed., Macmillan Publishing, 1990. [7] Mitra, Sanjit K. McGraw-Hill, 1998. [8] Rorabaugh, C. Britton. 2nd Ed., McGraw-Hili, 1997. [9] Choi, Jun Rim, Jeong, Seong Wook, Jang, Lak Hyun, Choi, Jin Ho. Structured Design of a 288-tap FIR Filter by Optimized Partial Product Tree Compression. pp. 79-82. [10] Wolf, Wayne Hendrix. Prentice Hall, 1994. 55

PAGE 65

[11] Yu, William W.H., Xing, Shanzhen. Perfonnance Evaluation of FPGA Implementation of High-Speed Addition Algorithms. vol. 2914, Nov. 1996, pp. 26-33. [12] Bellaouar, Abdellatif, Elmasry, Mohamed 1. Kluwer Academic Publishers, 1995. [13] Koren, Israel. Prentice-Hall, 1993. [14] Lee, Jeehan. A Synchronous Completion Prediction Adder (SCPA). vol. E80-A, Mar 1997, pp. 606-609. [IS] Morris, Mano M. 3rd Ed,. Prentice-Hall, 1993. [16] Weste, Neil H. E. AT&T,1993. [17] Smith, Winthrop W. IEEE, 1995. [18] Parr, E. Clays Ltd, 1993. [19] Lin, Wen 2nd Ed., CRC Press, 1990. [20] Booth, Andrew D. A Signed Binary Multiplication Technique. vol. IV, Pt. 2, 1951. [21] Wallace, S. A Suggestion for a Fast Multiplier. vol. EC-13, Feb 1964, pp. 14-17. [22] Song, Minkyu. Power Optimization for Data Compressors Based on a Window Detector in a 54xS4 Bit Multiplier. vol. E80C, Jul 1997, pp. 1016-1027. [23] Fadavi-Ardekani, Jalil. M x N Booth Encoded Multiplier Generator Using Optimized Wallace Trees. vol. 1, Jun 1993, pp. 120-125. [24] Cappello, Peter Steiglitz, Kenneth. A VLSI Layout for a Pipelined Dadda Multiplier. vol. 1, May 1983, pp. 157-174. 56

PAGE 66

[25] Somasekhar, Dinesh, Visvanathan, V. A 230-MHZ Half-Bit Level Pipelined Multiplier Using True Single-Phase Clocking. vol. 1, Dec 1993, pp. 415-422. [26] Ghosh, Debabrata. Design and Realization of High-Performance Wave Pipelined 8x8 b Multiplier in CMOS Technology. vol. 3, Mar 1995, pp. 36-48. [27] Song, Paul J., Micheli, Giovanni De. Circuit and Architecture Trade-offs for High-Speed Multiplication. vol. 26, Sep 1991, pp. 1184-1198 [28] Dadda, Some Schemes For Parallel Multipliers. 34, May 1965, pp. 349-356. [29] He, Shousheng, Torkelson, Mats. A Complex Array Multiplier Using Distributed Arithmetic. pp.71-74. [30] Baugh, Charles R., Wooley, Bruce A Two's Complement Parallel Array Multiplication Algorithm. vol. C-22, Dec 1973, pp. 1045-1047. [31] IBM. 5S IBM ASIC Logic Products, 1997. [32] Chen, Chamg-Kann. A method for synthesizing multiplierless FIR digital filters with narrow transition widths. J un 1997, pp. 351360. [33] Samadi, Saed, Nishihara, Akinori, Fujii, Nobuo. Modular Array Structures for Design and Multiplierless Realization of Two-Deimensional Linear Phase FIR Digital Filters. vol. E80-A, Apr 1997, pp.722-735. 57