PAGE 1
HIGH SPEED MULTIPLIERS FOR READ CHANNEL ASICS by Michael H. Chen B.S.E.E. University of Colorado at Denver, 1997 A thesis submitted to the University of Colorado at Denver in partial fulfillment of the requirements for the degree of Master of Science Electrical Engineering 1998
PAGE 3
Chen, Michael H. (M.S., Electrical Engineering) High Speed Multipliers for Read Channel ASICs Thesis directed by Associate Professor Shelly Goggin ABSTRACT This thesis explores various multiplication architectures as well as their associated addition architectures in search for the fastest (lowest latency), 6bit, two's complement, multiplier to be used in read channel ASICs, more specifically in equalization. The proposed optimal multiplier is comprised of a radix8 Booth multiplier encoding with carrysave addition to minimize overall propagation delay by reducing the number of partial products as well as the number of the sum of partial products. This thesis will prove that the performance achievable by the radix8 Booth multiplier far outweighs its added hardware. The baseline for the multiplication architecture comparison will be the ShiftAdd multiplier. The ShiftAdd multiplier is mainly chosen for its performance and cost considerations. An additional high performance multiplier was selected based on its computational efficiency and was evaluated with the baseline and the optimal designs. The comparisons will be made on the register transfer level (RTL) using the IBM ASIC CMOS 5S standard cell libraries. These libraries were chosen based on their superior performance and lowpower consumption as compared to other ASIC libraries available on similar technologies.
PAGE 4
This abstract accurately represents the content of the candidate's thesis. I recommend its publication.
PAGE 5
Dedication I dedicate this thesis to my wife for her patience and support over the last couple of years, which made it possible for me to achieve my goals. Her unselfishness is to be recognized.
PAGE 6
Acknow ledgements This thesis would not have materialized if it had not for the vision of a brilliant senior read channel design engineer at the Exabyte Corporation, Chris Watanabe. My most sincere appreciation for his assistance and patience. I would also like to thank Doug Holmquist, ASIC engineer, also at the Exabyte Corporation for his assistance on obtaining the ASIC databooks, which enable the comparisons of the multiplication architectures. Special thanks to my original advisor, Dr. Richard Auletta, for his guidance which kept me on track. His patience and flexibility made it possible for this thesis not to interfere with my duties at work. It was his genuine concern of his students' quest for knowledge that sparked an interest in VLSI systems in me. I've always been able to seek his assistance even late on Sunday nights! His professionalism is to be applauded.
PAGE 7
Contents Chapter 1. Introduction 2. FIR Equalization in Read Channel ASICs 3 7 7 7 7 8 9 3. Addition Architectures 3.1 Serial Adders 3.2 Parallel Adders 3.2.1 Carry Completion Adder 3.2.2 Carry LookAhead Adder 3.2.3 Carry Select Adder 3.2.4 Carry Skip Adder 3.2.5 Completion Prediction Adder .................................... 11 3.2.6 Conditional Sum Adder ............................................. 12 3.2.7 Manchester Carry Chain Adder .................................... 12 3.2.8 Ripple Carry Adder ...................................................... 13 3.3 Addition Technique Comparison ............................................. 14 4. Multiplication Architectures 4.1 Sequential Multiplier ...................................................... 16 ...................................................... 16 VII
PAGE 8
4.1.1 BitSlice Multiplier ...................................................... 17 4.2 Parallel Multiplier ...... ............ ... ... ............... .. .. ............. ... .. 18 4.2.1 ShiftAdd Multiplier ...................................................... 18 4.2.2 BitSlice Multiplier ...................................................... 20 4.2.3 Radixn Multiplier ...................................................... 20 4.2.4 Wallace Tree Multiplier ............................................. 23 4.2.5 Pipelined Multiplier ...... ...... ... ... ... ..... ...... ... ............ .... .. 24 4.2.6 Dadda Multiplier .. ... ...... ...... ... .... .. .. .. .. .. ... ............ 26 4.3 Array Multiplier ............................................................... 28 4.3.1 Braun Multiplier ...................................................... 29 4.3.2 BaughWooley Multiplier ............................................. 30 4.4 Multiplier Comparison ...... ... .. ... ...... ... .. .. .. .. .. .......... ... ... 32 5. Optimal Multiplier Solution ................................. ..................... 34 5.1 Representation Consideration .. .. ... ... ...... ... ..... .. ..... .... .. ... 34 5.2 Performance Consideration ............................................. 35 5.3 Proposed Multiplier Design ............................................. 35 5.3.1 Recoding Scheme ...................................................... 35 5.3.2 Accumulation Scheme ............................................. 36 6. Performance Evaluation ............................................................... 37 6.1 Propagation Delay of the ShiftAdd Architecture ........................... 37 6.2 Propagation Delay of the ModifiedBooth Architecture .................. 39 VIII
PAGE 9
6.3 Propagation Delay of the Radix8 Booth Architecture .................. 41 6.4 Perfonnance Comparison of the Selected Architectures ... ... ............ 44 7. Conclusions ... ........... ....... ...... ..... .. ...... ... ...... ... ... ... ... ... ...... 50 Appendix Multiplier Perfonnance Predictor in Matlab (multiply.m) .................. 52 References ................................................................................. 55 ix
PAGE 10
Chapter I Introduction The demand for high performance read channel integrated circuits is continuously rising for the data storage industry. To meet this demand, numerous semiconductor companies have experimented with various read channel designs. With the advent of Partial Response Maximum Likelihood (PRML) and Extended Partial Response Maximum Likelihood (EPRML) read channel designs, transfer speeds of over 500 Mbitls are possible However, there are several bottlenecks associated with these designs that are essential to the performance of read channel ASICs: Trellis tuning algorithms, Viterbi implementation in its true fashion, phaselock loop (PLL), and finite impulse response (FIR) equalization. First, the problem with the Trellis tuning algorithm is with the accuracy attainable from simultaneous processing of multiple parameters. Second, the Viterbi algorithm is currently implemented at half of the attainable efficiency due to independent processing of even and odd sequences. Third, the speed of the PLL limits the overall operating performance of the read channel ASICs. Lastly, the performance of the FIR equalization is limited by the speed of the coefficient multipliers as well as the tap adders. While all these bottlenecks attribute to the performance of the read channel ASICs this thesis concentrates on the FIR equalization. The intent of this paper is to explore various multiplication techniques that would improve the overall performance of a 5tap FIR equalization filter with 6bit multipliers as used in the SSI's (Silicon Systems Inc.) 32P41OIA PRML read channel chip shown in Figure 1.1 [2].
PAGE 11
Figure 1.1: SSI 32P4101A EPR4 Read Channel ASICs 2
PAGE 12
Chapter 2 FIR Equalization in Read Channel ASICs Equalization is used in Read Channel ASICs to remove distortion in output signal caused by noise, impedance mismatch, read head dimensions, etc. In order to modify the amplitude and phase responses, equalization filters are used [35]. For example, suppose the target and actual channel responses are as shown in Figures 2.1 and 2.2 respectively. Then, to achieve the desired channel response, the shaded regions in Figure 2.3, which represent the differences between the target and the actual responses, must be attenuated or boosted accordingly. Figure 2.1: Target Channel Response 3
PAGE 13
Figure 2.2: Actual Channel Response Figure 2.3: Channel Equalization 4
PAGE 14
Thus, the equalization filter can be defined as follows: = (2.1) where YE(f) is the equalization, YT(f) is the target response, and Y A(f) is the actual response. This equation can be easily realized by a finite impulse response (FIR) filter, sometime referred as a transversal filter [6]. The transfer function, H(z), of a causal FIR filter with length M is defined by [7] .................................................................. (2.2) where z is the input and h is the impulse response of the filter. Equation 2.2 can be rewritten in the time domain as [79] ............................................................ (2.3) where x[n] and y[n] are the input and output sequences, respectively. A FIR filter can be realized in various forms, i.e. direct, cascade, etc. [7]. The direct form implementation of a 5tap FIR filter is shown in Figure 2.4 as described by equation 2.4. 55
PAGE 15
As shown in Figure 2.4, the speed of this 5tap FIR filter is directly dependent on the performance of both the FIR coefficient multipliers and tap adders. Hence, it is the intent of this paper is to explore various multiplication and addition techniques to optimize the multipliers used in realization of this FIR equalization filter. 6
PAGE 16
Chapter 3 Addition Architectures A typical multiplier is often made up of AND gates (to generate partial products), shifters, and adders. Therefore, prior to evaluating several of the most popular multiplication algorithms used, we will examine the core elements that these multiplication algorithms are based on. Adders in general can be classified by the method in which the inputs are processed, i.e. parallel, serial, and parallel/serial. 3.1 Serial Adders Serial adders are generally not used in fast applications due the sequential processing nature of these adders. However, some serial addition algorithms are currently being evaluated; serial adders require numerous clock cycles to compute the sums of two binary numbers but with short cycle time as compared to parallel adders [10]. 3.2 Parallel Adders Parallel adders are often used in fast multiplication applications due to their efficiency over the serial adders. Several parallel addition architectures are examined in detail in the subsections that follow. 3.2.1 Carry Completion Adder The carry completion adder (CCA) is constructed from an asynchronous ripple carry adder with carry completion detection circuitry [11]. The carry completion signal is generated whenever the sum output is stable for a finite length of time. Unlike ordinary full adders, the carry completion adder has two carryin inputs, CinO and CinI. as well as two carryout outputs, CoutO and Coutl [13]. Each adder stage computes the sum of two input bits based on both carryin combinations. The first occurrence of a high signal on either of the two carryout lines activates the 7
PAGE 17
completion circuit, which represents that the output value is reliable which can be used. The CinO input will be high if the carry out signal from the previous stage is low otherwise the Cinl will be high. The completion signal, CC, is defined by, CC = ................ (3.1) The carry completion adder is shown in Figure 3.1. + o Due to its asynchronous nature, the total delay through this type of adder is dependent on the length of the operands. 3.2.2 Carry LookAhead Adder One of the main disadvantages of ripple carry adder is its carry delay. To reduce the overall carry delay in an adder, the carry lookahead (CLA) generates carries in parallel instead of in series as compared to the ripple carry adder [1012]. The carry of each stage is computed from the carry input, the generate carry (G) and propagate carry (P) signals, which are defined by [12] P ..................................................................... (3.2) ..................................................................... (3.3) Therefore, the carry out signal, C can be expressed as shown in equation 3.4 [13]. 8
PAGE 18
Equation 3.4 can be rewritten in terms of Gj's, Pj'S, and as shown in equation 3.5 [13] which enables the carries be generated based on only the original inputs X and Y and hence, eliminating the carry propagation delay. With the carry lookahead addition, an additional cycle is need for the addition. = cA 2bit carry lookahead adder is shown in Figure 3.2. + 3.2.3 Carry Select Adder The carry select adder is comprised of adder blocks that performs two additions each [12]. At each stage, one adder block assumes the carryin input is a zero, the other assumes a one. The actual sum and carryout of each stage are selected by the carryout generated from the previous stage. A block diagram of the carry select adder structure is shown in Figure 3.3. 9
PAGE 19
Carry Select Adder 2bit 10 Adder 2bit II Adder 2bit I Adder Sel Out ISI I Mux Sel Out tcI I Mux Figure 3.3: 2bit Carry Select Adder In this adder structure, the additions are performed in parallel for all input bits which enables the carry select adder to achieve higher performance at the cost of added hardware. The computation efficiency gained outweighs the minimal delay introduced by the multiplexer. 3.2.4 Carry Skip Adder The carry skip adder is constructed from a ripple carry adder with carry skip circuitry [1011]. The ripple carry adder is divided into blocks each with embedded carry skip logic. The carry skip circuitry is used to bypass several consecutive adder stages, which reduces the carry propagation time whenever possible [13]. Unlike the carry lookahead adder, this type of adder does not generate carries in parallel but rather in a pseudoparallel manner. An example of a 2bit carry skip adder is shown in Figure 3.4. Carry Skip Adder Figure 3.4: 2Bit Carry Skip Adder 10
PAGE 20
both adder stages in Figure 3.4 receive identical inputs with reference to logic levels, the carry input, is passed on as otherwise, the will be generated based on the propagation through both adder stages [10]. 3.2.5 Completion Prediction Adder The completion prediction adder as proposed by Lee [14] is a synchronous adder with a completion prediction logic generated from evaluation of its inputs. This design enables both delay and hardware reduction. The block diagram of a completion prediction adder is shown in Figure 3.5. A B The completion prediction block as shown in Figure 3.5 is comprised of carry propagation (Pi) generation circuitry, an AND tree, and a window detection circuit. The AND tree is used to track the number of successive instances where Pi'S are high. The window detection circuit is used to categorize the values returned by the AND tree in order to generate the prediction signals. This is accomplished by tapping off at various stages of the AND tree and evaluating the input levels through a combinational network. These prediction signals are then evaluated by the completion signal generation block to generate the completion signal used for synchronization [14].
PAGE 21
3.2.6 Conditional Sum Adder The conditional sum adder was regarded as the fastest adder in the 1960' s [12]. It is comprised of conditional cell blocks and multiplexers. Its theory of operation is very similar to that of the carry select adder. A block diagram of the conditional sum adder structure is shown in Figure 3.6. 5 3.6: 5 As depicted in Figure 3.6, there are two layers of multiplexers which generate the sum and carryout signals based on both carryin conditions (zero or one) from the previous stage. Although, only two bits are shown in this figure, this structure can be extended to nbits with the adder's total delay linearly proportional to the total number of multiplexer stages. 3.2.7 Manchester Carry Chain Adder One way of improving the carry chain of a carry lookahead adder is to utilize the precharged circuits. The Manchester carry chain adder is formed by a carry lookahead adder with precharged stages that derives the sum output from two
PAGE 22
signals: propagate (P) and generate (G). The diagram of a twobit Manchester carry chain is shown in Figure 3.7 [10]. L Manchester Carry Chain rCnd Cnd 1 ____ .J Figure 3.7: 2Stage Manchester Carry Chain During the precharge phase (clock is set low), the storage node at the junction between the pmos and nmos transistors which represents the complement of carryin is charged to a 1 within each stage. While during the evaluation phase (clock is set high) if the Generate signal is also high, the storage node will discharge which generates a carryout into the next stage [10]. 3.2.8 Ripple Carry Adder A ripple carry adder (RCA) is comprised of a series of full adders in which the carry out output from the previous stage drives the carry in input on the subsequent stage and hence its name [1012]. Since the sum output of the nth bit cannot be completed until the all carry outputs have propagated through all previous stages as shown in Figure 3.8, the delay of a ripple carry adder for an nbit addition can be characterized by, f" ..................................................................... (3.7)
PAGE 23
Where, tT is the total delay of the RCA, tc is the carry delay, and ts is the sum delay Carry Adder Cml Coutl "innl Coutn_l CoutO Snl Figure 3.8: Ripple Carry Adder 3.3 Adder Comparison The carry completion adder can reduce the amount of carry propagation time at the expense of additional circuitry. However, it is unsuitable for synchronous operations due to added hardware and delay required by the resynchronization process which is a major drawback of its asynchronous nature [11]. The carry look ahead adder is an ideal candidate for high speed addition if size is not an important factor. The added hardware to compute the carries in parallel accelerates the carry propagation but may render this type of adder unsuitable due to area [13]. The added hardware also equates to larger chip area, which dissipates more power [12]. The carry select adder is often a good compromise between performance and size [12]. The carry skip adder is comparable in speed as compared to the carry look ahead adder while consuming less power due to its chip area requirement [13]. It is also relatively inexpensive and slightly slower than the carry ripple adder [11]. The completion prediction adder yields high performance with regard to size and length of critical path when used in applications greater than 64 bits [14]. The conditional sum adder operates at roughly the same speed as the carry lookahead adder although its design is less modular. For this reason, the carry look ahead adders are used more frequently than conditional sum adders [13]. The Manchester carry chain adder utilizes precharged circuits to accelerate the carry chain [9]. The fanout capability of this carry chain renders it impractical for larger adders. The main advantages of the ripple carry adder are simplicity, low cost, and relatively high performance. Its simple and highly regular structure lends itself to cell
PAGE 24
designs. However, this adder is unsuitable for large designs since its critical delay is linearly proportional to its length [12]. A performance summary is presented in Table 3.1 based on the characteristics of these adders along with the performance evaluation of FPGA implementations conducted by Xing [1015]. Carry Completion 6 8 Carry LookAhead 8 Carry Select 4 3 Carry Skip 2 4 Completion Prediction 7 Conditional Sum 7 6 Manchester Carry Chain 3 2 Ripple Carry Adder The ratings shown in Table 3.1 are arbitrarily assigned: 8 being the worst and 1 being the best and are based on n (number of bits) less than 8. All the above designs can be optimized through various techniques, which would yield different results than shown in Table 3.1; it is not the intent of this paper to evaluate every possible implementation schemes.
PAGE 25
Chapter 4 Multiplication Architectures Binary multiplication in the most basic form consists of two steps: evaluation of partial products and summation of shifted partial products [16]. For example, multiplication of two binary numbers is completed as shown in Figure 4.1. The partial products can be generated with AND gates by ANDing the multiplicand with each multiplier bit. High speed multipliers can be classified according to the manner in which the partial products and their sums are generated. These classifications are sequential multiplier, parallel multiplier, and array multiplier [13]. These three types of multipliers are discussed at lengths in sections 4.1,4.2, and 4.3 respectively. 4.1 Sequential Multipliers The sequential multiplier generates the partial products and sums them sequentially. Due to the sequential nature of the serial multipliers they are generally
PAGE 26
not used in high speed multiplication applications. Therefore, for the purpose of discussion, only one type of sequential multiplier, the bitslice multiplier, is discussed in this paper. 4.1.1 BitSlice Multiplier The bitslice mUltiplier computes the product of two binary numbers by decomposing each nbit number into multiple mbit numbers and summing the partial product tenns. The decomposition is accomplished with equation 4.1 [17]. Where, Xu = the upper bits of X Yu = the upper bits of Y Xl = the lower bits of X Yl = the lower bits of Y M = the number of lower bits For example, the product of 2 (3810) and 12 (1310) can be decomposed as demonstrated in equation 4.2. = *26 +( 2)*23 + = = ....................................... (4.2) This requires four 3bit multipliers and three 6bit adders [17] which leads to two 3bit (M=3) bitslice multipliers. Applying equation 4.1, this 6bit multiplier can be further composed to three 2bit (M=2) bitslice multiplier as shown in equation 4.3. .................. (4.3) The serial implementation of this two 3bit bitslice multiplier is shown in Figure 4.2.
PAGE 27
1 Each pair of Xi and Yj is fed into the multiplier sequentially. The product is then shifted according to the power of two of each partial product listed in equation 4.3. Next, the shifted partial product is added to the previously stored sum in the accumulator. When all XiS and yjS have all cycle through the adder, the adder will return the final product. This implementation scheme minimizes hardware at the expense of increased propagation delay. 4.2 Parallel Multipliers The parallel multiplier generates the partial products in parallel. Fast binary multipliers are often implemented in a parallel scheme. There are a number of fast parallel multiplier algorithms used in signal processing and an attempt will be made here to address the most popular techniques in the sections to follow. 4.2.1 ShiftAdd Multiplier The shiftadd multiplier is the simplest and most straightforward form of binary multiplier. The procedures for perfonning this type of computation in binary are identical of that of ordinary multiplication (base 10 arithmetic). For a 6bit shift add multiplier, a total of 36 partial products are generated as shown by Figure 4.3 [ 18].
PAGE 28
100110 (Multiplicand) x 010011 (Multiplier) 100110 100110 000000 000000 100110 000000 01011010010 Figure 4.3: ShiftAdd Multiplication One implementation of this 6bit multiplier is shown in Figure 4.4 [19]. ShiftAdd Multiplier Multiplier 6 Shift Right Multiplicand 6 Shift Left 12 Add Enable Accumulator Figure 4.4: ShiftAdd Multiplier 12 Product
PAGE 29
While the elk and Multiply_Enable signals are high and the multiplier bit is high, the accumulator will sum the previous partial sum with the shifted multiplicand. This cycle will repeat until all 6 multiplier bits have shifted out and the accumulator will return the final product. 4.2.2 BitSlice Multiplier The two 3bit (M=3) bitslice mUltiplier previously discussed as defined by equation 4.3 can also be implemented in parallel. The realization of the parallel bit slice multiplier is shown in Figure 4.5. Although, the parallel operation enables the parallel bitslice multiplier to achieve a higher speed than the serial bitslice multiplier, the added hardware and cost might overweigh its improvement in performance. 4.2.3 Radixn Multiplier An effective means of reducing the number of partial product terms, which reduces the overall computation time, is through simultaneous evaluation of multiple 20
PAGE 30
bits of the multiplier. Each time the number of bits examined is doubled, the number of partial products is halved. Through the evaluation of these bits, a reduced set of bits with associated operations can be formed. This is so called recoding. The Booth algorithm [20] is one of the earliest recoding scheme proposed. The original multiplication technique proposed by Booth [20] is designed to facilitate machine computation (multiplication) by the use of two's complement arithmetic with correction. Booth suggested that the multiplication between two binary numbers x and y be carry out in the following manner: if x and yare both positive, the result will simply be +xy; if x is negative and y is positive, the result will be 2xxy; if x is positive and y is negative, the result will be 2yxy; if x and yare both negative, the result will be 42y2x+xy. is clear that the corrections are needed for the last 3 cases require examination by the machine, which leads to added hardware not to mention the complexity. The proposed two's complement multiplication technique is stated in Table 4.1 [1920]. The y values in Table 4.1 refer to the multiplier bits, while the operation is performed on the multiplicand. The multiplier, y, when evaluated in groups of 2bits can be expressed as = 2; ............................................................ (4.4) The function in parenthesis can only take on the values {O, I}. The weight of the y terms is expressed in multiples of two's and therefore, the Booth multiplier is also referred to as radix2 multiplier [13], [16]. is apparent from Table 4.1 that the operation can be represented with one binary bit with the Booth recoding scheme. This recoding scheme can be extended to a radix4 scheme where every 3 bits of the multiplier are recoded into 2 bits. The modifiedBooth multiplier [16] is an example of the radix4 multiplier where 3 bits of the multiplier are evaluated at a time.
PAGE 31
The modifiedBooth multiplier evaluates 3 bits of the multiplicand at a time and carries out the operation defined by Table 4.2 [10], [16]. two's complement, the multiplier, y, can be written as [12] i=n2 ......................................................... (4.5) Equation 4.5 can be also rewritten as [12] J = (4.6) The term within the parenthesis takes on the values in the set {2, 1, 0, 1, 2} [12], which specifies the operations listed in Table 4.2. The operations defined in Table 4.2 are performed on the multiplicand based on the 3 bits of multiplier under evaluation. Through each iteration, the multiplicand is shifted by and incremented by the operation specified by the multiplier bits under examination. For example, the product of 000 1012 (+5) and 10 0 1 (19) can be computed via the modifiedBooth algorithm as shown in Table 4.3. Yn+l Yn Ynl YI Yo YI 0102 X PI = Po +000000001012 000000001012 Y3 Y2 YI 1102 P 2 = PI 000000101002 111111100012 Y5 Y4 Y3 101, P 3 = P2 000010100002 111101000012 22
PAGE 32
Since the most significant bit (MSB) is a one, the product is negative and thus the sum of partial products must be complemented (2's complement). Therefore, the product in signed arithmetic is 100010 111112 which is 9510, the correct answer. The realization of the 6bit modifiedBooth multiplier is shown in Figure 4.6. The Booth recoding blocks are omitted for simplicity. ModifiedBooth Multiplier Y 1 Y 3 Y 2 Y l Y S Y 4 Y 3 3 3 3 Sel Sel Sel 0 0 0 0 0 0 x 1 x 1 x 1 2x 2 2x 2 2x 2 Po Figure 4.6: 6Bit ModifiedBooth Multiplier Each group of 3 bits of the multiplier as shown in Figure 4.6 selects the proper amount of magnification of the shifted multiplicand to be added to the current partial product. 4.2.4 Wallace Tree Multipliers The Wallace tree multiplier is a modified tree multiplier that reduces the total number of adders needed through the use of compressors in order to minimize the computations of the partial products [21]. The original design as proposed by Wallace entails 3:2 compressors [21], although other compressors have been proposed, i.e. the 6:2 and 9:2 compressors [2223]. A 3:2 compressor is made up of carrysave adders, each with three inputs, A, B, and C and two outputs, X and Y. The outputs X and Yare generated according to equations 4.7 and 4.8 [10]. x = ............................................................... (4.7) = C) ............................................................ (4.8) This type of adder tree requires no carry chain, which reduces the total number of summands to compute the product of two binary numbers [10]. A possible implementation of a 3bit Wallace multiplier is shown in Figure 4.7. 23
PAGE 33
Wallace Tree Multiplier ;Y2 x 1 Y l X:;zYo ;Yl x1Yo Yo o o o Figure 4.7: 3Bit Wallace Tree Multiplier The computation of two 3bit binary numbers requires a total of 6 carrysave adders instead of 8 carrypropagate adders as required by several of the multipliers previously discussed. In addition, this adder tree structure reduces the total number of adder delays. For example, a 16bit multiplier has at most 16 partial products to be summed. Therefore, the compression of the 16 partial product through the Wallace adder tree is as follows: 16+ 12+8+6+4+3+2. Hence, a maximum of 7 adder delays is to be expected. The same multiplication would have resulted in 8 adder delays with a modifiedBooth multiplier [16]. 4.2.5 Pipelined Multiplier To increase the throughput (the delivery of successive multiplications) rates of large multipliers, as used in signal processing applications, pipelining is typically used. Pipelining is a technique used to minimize the execution of successive operations that are identical. This is achieved by segmenting a design into several stages. Each stage executes independently on consecutive operands. Latches are added between adjacent stages to enable parallel operations. A block diagram of a 2stage pipeline is shown in Figure 4.8 [13], [24].
PAGE 34
x a c a c a c z Pipelining can be applied to various multiplication algorithms, i.e. Booth multiplier, Wallace tree multiplier, Dadda multiplier, array mUltiplier, etc. [25]. To increase the throughput of the pipelined multiplier, a special type of pipelined multiplier called wavepipelined multiplier can be used; the wavepipelined multipliers achieves higher performance by eliminating pipelining registers between combinational logic blocks within the multiplier. The balance of the delay throughout the multiplier is assured through critical layout and routing. A 4bit wave pipelined multiplier is shown in Figure 4.9 [26]. 4 4 25
PAGE 35
The partial products are generated by the AND gates as shown in Figure 4.9. The processing elements (PE) are basically ANDgated full adders that are connected in an array fashion. To minimize the latency, a semisystolic schedule is followed by the PE's; latches are inserted between rows of processing elements. The vector merge adder performs the conversion between carrysave product representation and normal binary representation [26]. 4.2.6 Dadda Multiplier The Dadda multiplication algorithm is based on parallel (n,m) counters; n refers to the number of inputs and m refers to the number of outputs of the parallel counters (combinational networks) [2728]. A 4 x 4 Dadda multiplication is shown in Figure 4.10. 5 4 3 2 1 0 B //// C As with typical binary multiplication, the product between a 4bit multiplicand, X, and a 4bit multiplier, Y, generates 16 partial products. These partial products are represented by dots as shown in Figure 4.10. The collection of all partial products is termed matrix The first step in applying the Dadda parallel multiplier algorithm is to transform matrix A to a reducedrow matrix, B. This is accomplished by first directly transcribing columns 0 and 1 without alteration from matrix A to 26
PAGE 36
matrix For the remaining columns, each column will be shortened by the appropriate counters. These counters come in various sizes. For the ease of implementation, only 3:2 counters (full adders) and 2:2 counters (half adders) will be used. Therefore, the three partial product terms in column 2 will be reduced by a 3:2 counter. The LSB of the output will be mapped onto the same column in matrix B, while the MSB of the output will be mapped onto column 3 although at one row below. In the cases where the number of partial products cannot be adequately dealt with one single counter, the unused partial products are simply mapped onto matrix B at the next available row in the same column. Upon completion of the first transformation, if the transformed matrix has more than two rows, additional transformation in the same manner is needed. The reasoning behind the tworow limit is to facilitate twooperand addition through the CPA. The implementation of this multiplier following the algorithm shown in Figure 4.10 is shown in Figure 4.11. 4Bit Dadda Multiplier Y2 2:2 HA Y 2Y2 x1Y3 x3o X 21 3:2 1Y2 Y2o X1Y1 XOY2 IBit CPA Figure 4.11: 4Bit Dadda Multiplier Product The reduction scheme used to generate the 4bit Dadda multiplier as shown in Figure 4.11 is not optimal. Dadda had suggested several reduction schemes which involves alternate partial reduction rules to minimize the total number of counters [28]. 4.3 Array Multiplier 27
PAGE 37
The array multiplier generates partial products and their sums through an array of nearly identical cells which lend itself to VLSI implementation with high performance at the expense of hardware complexity [13], [29]. For example, a 3 x 3 multiplication will result in the partial products and their sums as shown in Figure 4.12. One possible implementation of this 3 x 3 multiplier in the array multiplication scheme is shown in Figure 4.13. 3Bit 28
PAGE 38
As shown in Figure 4.13, the partial sums can be easily computed through an array of half adders. The output bits from each stage of AND gates are added in parallel with the sum of partial products from the previous stage. The carry out signals are treated as the MSB of the second input into the adders. 4.3.1 Braun Multiplier The Braun multiplier is an array multiplier that computes the product of two unsigned binary numbers according to [12] (X;f) (4.9) Where X is the multiplicand, Y the multiplier, and P the product. Equation 4.9 is derived from equations 4.10 and 4.11, which represent two unsigned binary, numbers X and Y [12]. X = IX;2; ..................................................................... (4.10) f= If)2) ..................................................................... (4.11) The partial products of this type of multiplier are generated by AND gates in parallel and the partial sums of the partial products are generated by array adders as shown in Figure 4.14 [12]. A s 29
PAGE 39
For an m by n Braun multiplier, a total of m(nl) adders (CPA) and AND gates are needed. The implementation of a 3bit Braun multiplier is shown in Figure 4.15. The Braun multiplier shown in Figure 4.15 can be rearranged into a 3 by 3 array which facilitates placement and routing in an integrated circuit. 4.3.2 BaughWooley Multiplier Another type of array multiplier is the BaughWooley multiplier [30]. The structure of the BaughWooley multiplier is similar to that of the Braun multiplier with the exception that it is designed for computation of two's complement numbers instead of unsigned numbers. Both the multiplier and multiplicand can be expressed as shown by equations 4.12 and 4.13 [30]. = 12 1 2; ......................................................... (4.12) ...................................................... (4.13) 30
PAGE 40
Therefore, the product of multiplicand, x, and multiplier, y, can be expressed as shown by equation 4.14 [30]. = ;=I 2 ;=I 2 I 'I .. (4.14) ;=0 ;=0 Direct implementation of equation 4.14 requires both adders and subtractors. Thus, in order to reduce the complexity and hardware of such design, the negative terms can be rewritten as shown by [12], [30], 2 2 )J ........................... (4.15) This result in all positive partial product terms at the expense of added hardware to complement the multiplicand and multiplier bits as suggested by equation 4.15. A 3bit BaughWooley multiplication algorithm is shown in Figure 4.16. 0 x 2
PAGE 41
The partial product tenns Po through Ps are generated based on equation 4.16. A possible implementation (not optimized) of this 3bit BaughWooley multiplier is shown in Figure 4.17. 4.4 Multiplier Comparison The sequential bitslice multiplier is well suited for applications where size is a major factor. However, its compactness is associated with longer cycle time and propagation delays. These issues are overcome with the parallel version of the bit slice multiplier at the expense of added hardware, chip area, and power consumption. The addshift multiplier is the simplest design to conceptualize since this is how ordinary multiplication is carried out. However, the added registers and large adders render it impractical for large designs. 32
PAGE 42
The radixn multiplier such as the Booth multiplier and the modified Booth multiplier reduces the number of operational cycles by grouping the mUltiplier bits into multiples of It is faster than the bitslice multiplier and the shiftadd multiplier [17]. The radixn multipliers are often used in the signal processing, digital filtering, and ALUs since no correction is needed for two's complement arithmetic. The Booth algorithm is ideal for operands greater than 16 bits [12] due to the partial product savings. The Wallace tree multiplier reduces the carry delay through the use of carry save adders. The compressor stages of a Wallace tree multiplier reduces the total number of adders required which enhances the overall performance of the multiplier. However, the additional hardware required by the Wallace tree translates to more area, which tends to consume more power and therefore, it is not suitable for low power applications [12]. The pipelined multiplier accelerates the throughout at the expense of higher latency and cost due to the addition of pipelining stages. The Dadda multiplication algorithms are very similar in structure as compared to the Wallace tree multiplication techniques. The advantage of the Dadda multiplier is that in certain application, the Dadda multiplier can improve the partial product savings considerably over the Wallace tree multiplier [28]. The array multiplier is modular and flexible due to its highly regular structure. An array multiplier of a particular size can be easily extended to a larger application. The Braun multiplier is designed to process unsigned binary numbers while the BaughWooley multiplier is optimized for two's complement arithmetic. The latter is unsuitable for operands larger than 16 bits due to its area requirement and performance [12]. A summary of the multipliers discussed based on their structures and characteristics is presented in Table 4.4. Type Cost Performance Area Power Application BitSlice Low Low Low Low <16 AddShift Low Low Low Low <16 Radixn Medium Medium Medium Medium > 16 Wallace Tree Medium High Medium High >16 Pipelined High Medium High High >16 Dadda High High Medium High > 16 Array Medium Medium High High < 16 Table 4.4: Multiplier Summary 33
PAGE 43
Chapter 5 Optimal Multiplier Solution 5.1 Representation Consideration The first factor to consider for an optimal 6bit multiplier is representation. The FIR coefficients can be positive as well as negative. Therefore, a choice has to be made regarding the negative number representation. There are three basic fixed binary number representations that accommodate negative numbers: signed magnitude, one's complement (diminishedradix complement), and two's complement (radix complement) [13]. One of the main disadvantages of the signedmagnitude representation is the operation decisions required based on the signs of both binary numbers which leads to additional hardware, computation time, and cost. One's complement resolves the added computational sequences and hardware issues of the signedmagnitude representation. However, the representation is not optimal. Both the signed magnitude and one's complement representations can express binary numbers in the range (_2N l 1) X (2N l 1), where N is the number of bits. The two's complement representation on the other hand, can express binary numbers in the range (_2N l ) X (2N l 1) with a singular representation for 010 instead of multiple representations for zero, as with the other two systems. An example of the three types of binary number representation for a 3bit number is shown in Table 5.1. (Xh 1 's 2's 011 3 3 3 010 2 2 2 001 1 1 1 000 +0 +0 0 111 3 0 1 110 2 1 2 101 1 2 3 100 0 3 4 34
PAGE 44
As shown in Table 5.1 2's complement representation can extend the negative range by one as well as singular representation for zero. 5.2 Performance Consideration It has been shown in the previous chapters that the key to a fast multiplier is the management of partial products. This management can be divided into two tasks: partial product reduction and partial product sum reduction. The first task of partial product reduction can be easily realized with many recoding schemes such as the Booth, Modified Booth, etc. The latter task can also be easily realized through efficient addition architectures such as the Wallace tree or the Dadda parallel techniques. 5.3 Proposed Multiplier Design Based on the representation and performance considerations discussed previously, the optimal 6bit multiplier is to be of the radix8 architecture. Its detailed recoding and accumulations schemes are examined in the following sub sections. 5.3.1 Recoding Scheme To reduce the number of partial products, the radix8 modified Booth encoding scheme is chosen. As shown in the radixn multiplier section of this paper, the two's complement representation for the multiplier can be expressed in several ways to yield more efficient computation techniques. One way to rewrite Equation 4.13 is as shown in equation 5.1. The quantity within the parenthesis takes on the set {O, 1, }. The weight of the multiplier bits is expressed in multiples of 23 or 8\0 and hence radix8. With a 6bit multiplier, if the modified Booth encoding [13] is used, a total of three multiplier bit evaluations are needed. The same 6bit computation with the proposed radix8 modified Booth encoding will require one less evaluation cycle. The operations associated with the 4bit multiplier evaluation are cataloged in Table 5.2. 35
PAGE 45
All operations listed in Table 5.2 with the exception of x are easily generated through shifting. the case of x, an additional adder could be used to sum x with 1 x, which can be generated in parallel. The proposed recoding scheme for a 6bit multiplier is discussed in Chapter 6. 5.3.2 Accumulation Scheme The accumulation of the partial products will be accomplished through the use of carrysave adders. The carrysave adders differ from other addition algorithms in the respect that they are carrypropagation free. carrysave addition, the carry generated from each fulladder is done independently from other carries [28]. The full adders used, compute the partial sum of three input bits instead of two and a carry as with other addition architectures. Therefore, the total number of adders needed is reduced. This reduction along with the inherent carrypropagationfree characteristics of the carrysave adder can accelerate the performance of any multiplier without added hardware and cost. The carrysave adders are widely used in both the Wallace tree multiplier as well as the Dadda multiplier as discussed in Chapter 4. 36
PAGE 46
Chapter 6 Performance Evaluation This chapter will evaluate the perfonnance of the three multipliers that are most suitable for a 5tap FIR equalization filter: ShiftAdd, ModifiedBooth (Radix4), and Radix8 Booth. The comparisons made in this chapter are at the RTL level with routing and interconnections delays omitted. The propagation delay values used are based on IBM's 0.36 CMOS technology with the 5S process [31]. The propagation delay for the standard cells used are cataloged in Table 6.1. These values are obtained by averaging the worstcase propagation delays from low to high (tPLH) and high to low (tpHd. The perfonnance level chosen is A with normal process, low voltage (V = 3.3V) operation, and junction temperature (T at 25C. The programmable 010 bit shift register standard cell will not be used, since the shift operations required by the multipliers are fixed. Therefore, there are no intrinsic delay associated with the shifters used but rather just routing delays. Device Delay (ns) ANDOR22 0.251 ANDOR33 0.243 XOR2 0.3195 AND2 0.18 MUX21 0.2325 MUX81 0.825 ADDF 0.5335 SHIFf 0.26 Table 6.1: Standard Cell Delay 6.1 Propagation Delay of the ShiftAdd Architecture Based on the discussion of the ShiftAdd architecture examined in Section 4.2.1, one possible realization of a 6bit two's complement multiplier is shown in Figure 6.1.
PAGE 47
6Bit ShiftAdd Multiplier Sy Sel Sel Sx Mux 0 6 6 y RidttShift LeftShift Register Register Accumulator Product Figure 6.1: 6Bit ShiftAdd Multiplier This multiplier has been modified to compute two binary numbers in the two's complement fonn by first inverting all multiplicand and multiplier bits and then incrementing them by one as shown in Figure 6.1. From this architecture, a delay flow diagram was constructed and is shown in Figure 6.2. Propagation Delay Flow Diagram 6Bit 2:1 AND2 12Bit A DDF Mux ADDF Figure 6.2: Delay Flow Diagram The block shaded in gray is repeated for each stage. This is due to the fact that the delay associated with the 12bit full adder is much greater than the previous three blocks combined. Therefore, the overall propagation delay of an nbit multiplier can be expressed by equation 6.1. 38
PAGE 48
)+ 21 + + Where the standard cell delays refer to the values listed in Table 6.1. 6.2 Propagation Delay of the ModifiedBooth Architecture Based on the discussion of the modifiedBooth architecture examined in Section 4.2.3, one possible realization of a 6bit two's complement multiplier is shown in Figure 6.3. Po .. Op .. L A A.LIiSub /2 0 112 ; ""'12 Op ,(2 LA 0 2 Op ,2 0 i12 2 6.3: Detailed schematics of the radix4 encoder as well as the function generator are shown in Figures 6.4 and 6.5 respectively. 39
PAGE 49
Modified Booth (Radix4) Encoder 1 Yil .. Figure 6.4: Radix4 Encoder Radix4 Function Generator / .. / 7 Ix o 2x Shifter (left) Figure 6.5: Radix4 Function Generator The exclusive OR gate depicted in Figure 6.4 is used to determine the operation to be perfonned by the adderlsubtractor shown in Figure 6.3. The rest of the logic is used to encode the multiplier bits. The function generator shown in Figure 6.5 is used to generate zero times, one times, and two times the multiplicand bits that are fed the input to the 2: 1 multiplexers shown in Figure 6.3. Upon examination of the dataflow, a model of the overall propagation delay is depicted in Figure 6.6. 40
PAGE 50
Propagation Delay Flow Diagram Figure 6.6: Delay Flow Diagram The block shaded in gray is repeated for each stage. The delay associated with an adderlsubtractor can be treated as a full adder with one additional XOR delay. Therefore, the overall propagation delay of an nbit multiplier can be expressed by equation 6.2. ..................... 6.3 Propagation Delay of the Radix8 Booth Architecture Based on the discussion of the radix8 Booth architecture examined in Section 5.4.1, one possible realization of a 6bit two's complement multiplier is shown in Figure 6.7. 6Bit Radix8 Multiplier Encoder 17'3"1 Sel o Function 2 3 4 Encoder Sel o 2 3 2 Shifter 3 12 12 Figure 6.7: 6Bit Radix8 Booth Multiplier
PAGE 51
Detailed schematics of the radixS encoder as well as the function generator are shown in Figures 6 S and 6.9 respectively 1 42
PAGE 52
o / 7 / p / p The exclusive OR gate depicted in Figure 6.8 is used to detennine the operation to be performed by the adderlsubtractor shown in Figure 6.7. The rest of the logic is used to encode the multiplier bits. The function generator shown in Figure 6.9 is used to generate zero through four times the multiplicand bits that are fed the input to the 8: 1 multiplexers shown in Figure 6.7. Upon examination of the dataflow, a model of the overall propagation delay is depicted in Figure 6.10. The block shaded in gray is repeated for each stage. The delay for the function generator (FG) is computed from the propagation delay of a (n+2)bit full adder; the adder is of (n+2) bits due to the 2x operation and the carry bit resulted from the sum of 2x and 1 x. The delay associated with an adderlsubtractor can be treated as a full 43
PAGE 53
adder with one additional XOR delay. Therefore, the overall propagation delay of an nbit multiplier can be expressed by equation 6.3. = ((2 81 ......... 6.4 Performance Comparison of the Selected Architectures Based on equations 6.13, a plot of the propagation delay versus the number of bits (112) is generated. The simulation was conducted on Matlab with script multiply.m, which is included in Appendix A. The performance comparison does not incorporate the addition computation savings made possible by the carrysave algorithm. The comparisons are purely based the architecture of each chosen multiplier. Thus, the delay values listed in Table 6.2 for all three multipliers can be further accelerated through the use of carrysave adders. 6.2: Results A plot of the perfonnance of all three multiplication architectures is shown in Figure 6.11. 44
PAGE 54
160 140 120 (/) c 100 ca 80 C 60 40 20 ++2 3 4 5 6 7 8 9 10 11 12 The Matlab simulation results indicated that the perfonnance of the radix8 Booth multiplier increases as n becomes larger. A plot of the multiplier perfonnance from 1128 bits is shown in Figure 6.12. 45
PAGE 55
20000,, .s >10000 .. C 8000 6000 2000 Referring to Figure 6.12, the shiftadd multiplier is the worst perfonner of the three architectures as shown by the top curve. The next curve represents the modified Booth (radix4) multiplier. Lastly, the bottom curve is the radix8 Booth multiplier, which outperformed the other two multipliers. The simulation demonstrated performance as expected. The perfonnance results for the chosen multipliers with varied parameters are shown in Tables 6.36. 46
PAGE 56
The results listed in Table 6.3 are obtained by decreasing the intrinsic propagation delays of the MUX2l and MUX81 cells by 5% to 0.2209 ns and 0.7838 ns respectively with all other delay values held constant. The results listed in Table 6.4 are obtained by decreasing the intrinsic propagation delay of the ADDF cell by 5% to 0.5068 ns with all other delay values held constant. 47
PAGE 57
The results listed in Table 6.5 are obtained by decreasing the intrinsic propagation delay of the XOR2 cell by 5% to 0.3035 ns with all other delay values held constant. The results listed in Table 6.6 are obtained by decreasing the intrinsic propagation delay of the AND2 cell by 5% to 0.3035 ns with all other delay values held constant. The performance of the radix8 Booth multiplier with variable delay values are shown in Figure 6.13. Its results are listed in Table 6.7. 48
PAGE 58
60 50 Qj C 20 10 (5%U 2 3 4 .5 6 7 8 9 10 As shown in Table 6.7, the intrinsic propagation delay value of the ADDF cell seemed to have the greatest effect on the performance of the radix8 Booth multiplier; a 5% reduction in ADDF propagation delay increases the performance of the 6bit radix8 Booth multiplier by 4.61 %. The performance increases due to the reductions in propagation delay of MUX cells, XOR2 cell, and AND2 cell are 0.22%,0.17%, and 0% respectively. The reduction in the propagation delay of the AND2 cell produced no changes in the performance of the 6bit multiplier. 49
PAGE 59
Chapter 7 Conclusions The main objective of this thesis was to derive the fastest 6bit multiplier for use in FIR equalization in read channel ASICs. To accomplish this goal, the basics behind all multipliers and their associated summing schemes were explored. Next, the inefficiencies of current multiplier architectures were identified. Therefore, the selection of the most ideal multiplier can be achieved through the exploration of partial product reduction architectures as well as partial product summing architectures. The first task of partial product reduction was demonstrated through several widely used recoding schemes such as the Booth, modifiedBooth, etc. The detailed theory of operations, benefits, and disadvantages of these recoding schemes were discussed in Chapter 4. The latter task of accelerating the summation of partial products was also investigated through various addition architectures as presented in Chapter 3. It was later shown that the proposed multiplier solution was proven to be competitive in both speed and power consumption as compared to the other two designs especially when n (number of bits) is high. It has been argued that the delay associated with the generation of higher than two times the multiplicand far outweighs the performance benefits of higher radixn partial product reduction algorithms. However, this claim is highly dependent on the intended applications. Although, it is true that there are additional delays introduced by the generation of greater than two times the multiplicand, this overhead is only incurred once at the first stage and not at the subsequent stages. Hence, as the number of bits increases, the greater the differences between the radix4 and radix8 multipliers, as demonstrated in Chapter 6. The radix8 recoding scheme together with the addition savings of the carrysave adder can achieve the optimal design requirement of this thesis. This can be easily verified through timing simulations with CMOS implementation. The main emphasis of this paper was centered on the optimization of the multiplier required by the FIR equalization. Perhaps, the fastest FIR equalization is not to be achieved by a multiplier design but rather a multiplierless design. has 50
PAGE 60
been suggested by Samadi, Nishihara, Fujii, and Chen that a linear phase FIR filters can be realized free of multipliers by the use of array filter structure [3233].
PAGE 61
Appendix A. Multiplier Perfonnance Predictor in Matlab (multiply.m) %################################################################# %################################################################# %# multiply.m ## %# By: Michael H. Chen ## %# Date: August 5, 1998 ## %# Description: This program computes and plots the estimated ## %# propagation delay associated with an nbit ## %# multiplier implemented in three multiplier ## %# algorithms: shiftadd, modified Booth ## %# (Radix4), and Radix8. ## %################################################################# %################################################################# %####################################################### %# Gate Delays Per IBM's ASIC CMOS 5S (0.36um) Process # %####################################################### %############################################# %# Constants (All values are in nanoseconds) # %############################################# AND2 0.18; % 2Input AND gate XOR2 = 0.3195; % 2Input XOR gate ADDF = 0.5335; % 1Bit Full Adder MUX21 = 0.2325; % 2:1 Multiplexer MUX81 = 0.825; % 8:1 Multiplexer N= 128; % Number of Bits %########################################################### %# Propagation Delay Computations for ShiftADD Multiplier # %########################################################### = 1; for N_SA = l:N % Number of stages required for the ShiftAdd implementation S_SA = N_SA; % Propagation delay computation D_SA = (N_SA*ADDF)+MUX21+AND2+(S_SA*(ADDF*2*S_SA)); % Construct an array of number of bits versus delay P_SA(i,1:2) = [N_SA D_SAJ; 52
PAGE 62
end %######################################################### %# Propagation Delay Computations for Radix4 Multiplier # %######################################################### = 1; for N_R4 = 2:N % Number of stages required for the Radix4 implementation S_R4 = ceil (N_R4/2) ; % Propagation delay computation D_R4 = XOR2+MUX21+(S_R4*((2*N_R4*ADDF)+XOR2)); % Construct an array of number of bits versus delay P_R4(i,l:2) = [N_R4 D_R4]; i= i+1; end %######################################################### %# Propagation Delay Computations for RadixB Multiplier # %######################################################### = 1; for N_RB = 3:N % Number of stages required for the RadixB implementation S_RB = cei1(N_RB/3); % Propagation delay computation D_RB = SHIFT+((2+N_RB)*ADDF)+MUXB1+S_RB*(XOR2+(2*N_RB*ADDF))); % Construct an array of number of bits versus delay P_RB(i,l:2) = [N_RB D_RB]; = end %################################ %# Performance Comparison Graph # %################################ % Plots the performance of the ShiftAdd Multiplier plot ( P _SA ( : 1) P _SA ( : 2) 'r' ) hold on % Plots the performance of the ModifiedBooth Multiplier plot (P _R4 ( : ,1) P _R4 ( : ,2) 'g' ) % Plots the performance of the RadixB Multiplier plot (P _RB ( : ,1) P _RB ( : ,2) 'b' ) 53
PAGE 63
xlabel{'Nurnber of Bits'); ylabel{'Delay (ns) '); title{'Multiplier Performance Comparison'); % Turns "ZOOM" option on zoom on; 54
PAGE 64
References [1] SSI. June 1997. http://www.ti.com/sc/docs/storage/news/pce pr4.htm. [2] SSI. datasheet. Mar. 1998. http://www.ti.com/sc/docs/storage/products/data/41 0 1 a.pdf. [3] Ozue, Tadashi, Watanabe Yukio, Kohno, Takanobu, Hirose, Toshihido, Koseki, Kazuya. Development of High Density Recording For Advanced Tape Streamer. vol. 32, Sep 1996, pp. 38813883. [4] Hirasaka, H., Tanaka, A., Nomura, K., Fukuda, S. An Adaptive Equalizer for the Partial Response Class I Channel Used by DDS3. vol. 32, Sep 1996, pp. 39773979. [5] Mallinson, John C. Academic Press, 1987. [6] Couch, Leon W. 3rd Ed., Macmillan Publishing, 1990. [7] Mitra, Sanjit K. McGrawHill, 1998. [8] Rorabaugh, C. Britton. 2nd Ed., McGrawHili, 1997. [9] Choi, Jun Rim, Jeong, Seong Wook, Jang, Lak Hyun, Choi, Jin Ho. Structured Design of a 288tap FIR Filter by Optimized Partial Product Tree Compression. pp. 7982. [10] Wolf, Wayne Hendrix. Prentice Hall, 1994. 55
PAGE 65
[11] Yu, William W.H., Xing, Shanzhen. Perfonnance Evaluation of FPGA Implementation of HighSpeed Addition Algorithms. vol. 2914, Nov. 1996, pp. 2633. [12] Bellaouar, Abdellatif, Elmasry, Mohamed 1. Kluwer Academic Publishers, 1995. [13] Koren, Israel. PrenticeHall, 1993. [14] Lee, Jeehan. A Synchronous Completion Prediction Adder (SCPA). vol. E80A, Mar 1997, pp. 606609. [IS] Morris, Mano M. 3rd Ed,. PrenticeHall, 1993. [16] Weste, Neil H. E. AT&T,1993. [17] Smith, Winthrop W. IEEE, 1995. [18] Parr, E. Clays Ltd, 1993. [19] Lin, Wen 2nd Ed., CRC Press, 1990. [20] Booth, Andrew D. A Signed Binary Multiplication Technique. vol. IV, Pt. 2, 1951. [21] Wallace, S. A Suggestion for a Fast Multiplier. vol. EC13, Feb 1964, pp. 1417. [22] Song, Minkyu. Power Optimization for Data Compressors Based on a Window Detector in a 54xS4 Bit Multiplier. vol. E80C, Jul 1997, pp. 10161027. [23] FadaviArdekani, Jalil. M x N Booth Encoded Multiplier Generator Using Optimized Wallace Trees. vol. 1, Jun 1993, pp. 120125. [24] Cappello, Peter Steiglitz, Kenneth. A VLSI Layout for a Pipelined Dadda Multiplier. vol. 1, May 1983, pp. 157174. 56
PAGE 66
[25] Somasekhar, Dinesh, Visvanathan, V. A 230MHZ HalfBit Level Pipelined Multiplier Using True SinglePhase Clocking. vol. 1, Dec 1993, pp. 415422. [26] Ghosh, Debabrata. Design and Realization of HighPerformance Wave Pipelined 8x8 b Multiplier in CMOS Technology. vol. 3, Mar 1995, pp. 3648. [27] Song, Paul J., Micheli, Giovanni De. Circuit and Architecture Tradeoffs for HighSpeed Multiplication. vol. 26, Sep 1991, pp. 11841198 [28] Dadda, Some Schemes For Parallel Multipliers. 34, May 1965, pp. 349356. [29] He, Shousheng, Torkelson, Mats. A Complex Array Multiplier Using Distributed Arithmetic. pp.7174. [30] Baugh, Charles R., Wooley, Bruce A Two's Complement Parallel Array Multiplication Algorithm. vol. C22, Dec 1973, pp. 10451047. [31] IBM. 5S IBM ASIC Logic Products, 1997. [32] Chen, ChamgKann. A method for synthesizing multiplierless FIR digital filters with narrow transition widths. J un 1997, pp. 351360. [33] Samadi, Saed, Nishihara, Akinori, Fujii, Nobuo. Modular Array Structures for Design and Multiplierless Realization of TwoDeimensional Linear Phase FIR Digital Filters. vol. E80A, Apr 1997, pp.722735. 57
