Online implementation of the fastica algorithm for efficient implementation in hardware

Material Information

Online implementation of the fastica algorithm for efficient implementation in hardware
Casselberry, Rebekah J
Publication Date:
Physical Description:
ix, 41 leaves : ; 28 cm


Subjects / Keywords:
Multivariate analysis ( lcsh )
Principal components analysis ( lcsh )
Computer input-output equipment ( lcsh )
Algorithms ( lcsh )
Algorithms ( fast )
Computer input-output equipment ( fast )
Multivariate analysis ( fast )
Principal components analysis ( fast )
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )


Includes bibliographical references (leaf 41).
Statement of Responsibility:
by Rebekah J. Casselberry.

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
318450032 ( OCLC )
LD1193.E54 2008m C37 ( lcc )


This item has the following downloads:

Full Text
Rebekah J. Casselberry B.S., LeTourneau University, 2005
A thesis submitted to the University of Colorado Denver in fulfillment
of the requirements for the degree of Masters of Science Electrical Engineering 2008

This thesis for the Masters of Science degree by
Rebekah J. Casselberry has been approved by

Casselberry, Rebekah J. (M.S., Electrical Engineering)
Online Implementation of the FastICA Algorithm for Efficient Implementation in Hardware
Thesis Directed by Professor Miloje Radenkovic
This thesis presents the theoretical background for the development of different mathematical approaches to Independent Component Analysis (ICA) and details the mathematical derivation of the fixed point algorithm, FastICA introduced to the community by Hyvarinen and Oja. This paper takes a practical approach to the algorithm by examining the performance trades involved in realizing implementations at reasonable computational cost for real-time signal processing systems. Required computations are analyzed, and online algorithms are given for data preprocessing techniques as well as the FastICA implementation. Simulated performance of this algorithm is shown using a set of benchmark speech signals, and is compared to the performance of the batch processing form of the FastICA algorithm. Fundamental implementation design trades are summarized, and key hardware functions and structures are detailed for implementation in custom digital processor circuits, such as ASICs or FPGAs.

This abstract accurately represents the content of the candidates thesis. I recommend its publication.

Miloje Radenkovic

I dedicate this work to my father, the late David Casselberry, and mother, Caroline Casselberry, who have always inspired me to pursue excellence in my work without forsaking the priorities of God and family. I also dedicate this work to Adam Mitchell, whose steadfast support and encouragement made this work possible.

I would like to thank my advisor Dr. Miloje Radenkovic for introducing me to topics in blind signal processing and supplying guidance and support this year as I have prepared this thesis.
I would also like to acknowledge Bob Anderson for heavily influencing the way I explore trade-offs in algorithm development and design to reduce the cost of implementation.

LIST OF FIGURES.................................................ix
1. Introduction...............................................1
2. Theoretical Background.....................................3
2.1 Kurtosis................................................ 4
2.2 Negentropy................................................5
2.3 Mutual Information........................................7
3. Pre-processing Observed Signals for ICA....................9
3.1 Data Centering.............................................9
3.2 Data Whitening...........................................10
4. Development of the FastICA Algorithm......................14
4.1 Algorithm for a Single Processing Unit...................14
4.2 Decorrelation of Single Processing Units
for Multi-Unit ICA.................................... 17
5. Simulated Algorithm Performance...........................19
5.1 Adaptive Whitening...................................... 19
5.2 FastICA................................................ 19
6. Implementation Structures.................................23
6.1 Fundamental Operations................................. 25
6.1.1 Multiplication...........................................25
6.1.2 Squaring.................................................28
6.2 Moving Average Filters...................................29
6.3 CORDIC Algorithms........................................31
- vii -

7. Conclusion...................................................37
Appendix A: Simulation Code...................................38
- viii -

Figure 1-1 - Block Diagram of an ICA System........................1
Figure 3-1 - Block Diagram of an ICA System with Preprocessing.....9
Figure 5-1 - Unprocessed Sensor Signals (X).......................21
Figure 5-2 - Original Sources (S).................................21
Figure 5-3 - Estimated Sources (Y), Real Time Algorithm...........22
Figure 5-4 - Estimated Sources (Y), Batch Algorithm...............22
Figure 6-1 - 8x8 Unsigned Multiplication Operation................26
Figure 6-2 - An n-Element Moving Average Structure...............30
Figure 6-3 - Generalized IIR for Exponential Smoothing............30
- ix -

1. Introduction
As the worldwide communications environment has developed in the last several decades, concerted efforts have been made to discover ways to recover signals of interest from very dense signal environments while avoiding channel capacity overhead (e.g., the use of a training sequences). Approaches that function independently of training sequences are known as blind processing techniques. Independent Component Analysis (ICA) is a blind approach to extracting/separating signals from their environment (other signals plus noise).
The basic ICA problem can be understood by examining the following block diagram:
-sjk) -s2(k)
yjk) ~ s,(k)>
Figure 1-1 Block Diagram of an ICA System
Assuming that both the source vector s and the characteristics of mixing system A are unknown, it is desired to construct a learning
-1 -

algorithm to update the characteristics of our estimated de-mixing system W to operate on the observation vector x, driven only by the characteristics of the estimated source vector y. Thus, ICA is a blind approach to extracting/separating signals. Mathematically, this processing system can be described as follows:
x = As , (1)
where A is an unknown mixing system of dimensions m by n, s = [sj(k),S2(k),.. .,sn(k)]T where each s(k) is a vector row of original source samples over time, and x = [xi(k),X2(k),... ,xm(k)] where each x(k) is a vector row of observed sensor samples over time. Therefore, it follows that
s = y = Wx , (2)
where W is the adaptive demixing system designed to converge to A1, and y = [yi(k),y2(k),...,yn(k)]T.

2. Theoretical Background
In principle, ICA algorithms are based on the statistical property known as the Central Limit Theorem. Fundamentally, the Central Limit Theorem states that as statistically independent1 2 sources are mixed, the distribution of their mixture will approach a Gaussian distribution, regardless of the original distributions of the contributing sources. Qualitatively then, it follows that the success of an attempt to separate a mixture into its original sources could be measured by the degree of non-Gaussianity observed in the recovered source estimates themselves, provided that the original sources are not Gaussian processes In other words, as the estimated sources are observed to be less Gaussian, it follows that they are less mixed with each other.
Learning algorithms for ICA work like many other adaptive learning algorithms: they identify a cost function to be minimized or maximized to produce the desired results. As a direct consequence of the previous discussion, many cost functions for ICA quantify the Gaussianity of the estimated source vector y and are then minimized.
1 Full, statistical independence is required for ICA; lack of statistical correlation is not sufficient. That is: p(si,s2,...,s) = p(si)p(s2),...,p(sn).
2 This is, in fact, an assumption that must be valid for ICA to be successful. More exactly, it is required that no more than one source of interest in a mixture may be Gaussian. This Gaussian source of interest may actually be comprised of multiple Gaussian sources, so long as the algorithm is allowed to treat the Gaussian mixture as a single source to be recovered. In this situation, sequential BSE techniques using deflation are often most effective.

This approach attempts to converge on one of two local minima in the Gaussianity space for each separated source yi, namely, Sj. Thus, the estimated sources produced by an ICA algorithm contain ambiguity of order (sequence), and ambiguity of sign. Additionally, since both s and A are unknown, it is impossible to determine the energy (variances) of each original source signal. Thus, the estimated sources will also be ambiguous in scale.
In the following sections, we will discuss common measures of Gaussianity and examine their benefits and drawbacks as cost functions for ICA learning algorithms.
2.1 Kurtosis
Kurtosis is commonly defined as the fourth cumulant divided by the square of the second cumulant. Mathematically, kurtosis is defined as
fcwrr(y) = £'{y4}-3(£{y2})2. (3)
Tangibly, kurtosis is a measure of how spiky the distribution of a random variable (RV) is. For a Gaussian RV, the kurtosis evaluates to zero. The normal distribution clearly fits this category, and is often referred to as mesokurtic. A negative value of kurtosis indicates that a distribution is platykurtic, sometimes called sub-Gaussian. The uniform distribution and raised-cosine distribution are common examples of platykurtic RVs. Finally, a positive value of kurtosis indicates that a distribution is leptokurtic, or super-Gaussian. The

Laplace distribution and hyperbolic secant distribution are common examples of leptokurtic RVs. Therefore, kurtosis is useful as a cost function for ICA when its absolute value is maximized.
Due to unavoidable ambiguity of scale in the recovered signals, we make a key assumption for ICA processing that the estimated signal vectors contain unit energy, that is, E{yi2} = 1. In fact, preprocessing is usually performed on the observation vectors to ensure that the observed signals are of unit variance as well (described in section 3.2). Therefore, for our applications, we can define kurtosis mathematically as
kurt(y) = £'{y4}-3. (4)
The primary difficulty encountered when using kurtosis as a cost function for ICA is that it requires a very large number of samples to obtain an adequate empirical estimate of its value for a given signal. Empirical estimation techniques are quite sensitive to statistical outliers in the estimated sources, and therefore tend to degrade severely in the presence of noise.
2.2 Negentropy
The quantity known as negentropy can be understood as a modified version of differential entropy as presented in classical information theory. Mathematically, negentropy is defined as

J(y) = H{v)-H(y),
H(y) = j f(y)^og(f(y))dy,
where H(y) is the definition of differential entropy in terms of the probability distribution function (PDF) of a signal y and v is a continuous Gaussian RV with the same variance as each continuous RV y (such as a single component of the estimated source vector y).
Tangibly, entropy is a measure of the randomness of a random variable. Said another way, the more unstructured (or more Gaussian) an RVs distribution f(y), -the greater its entropy. Negentropy is a quantity that references the amount of entropy in an observed RV to the amount of entropy in a reference Gaussian RV with identical variance. Specifically, negentropy is always non-negative, and yields larger values as the PDF of the RV in question becomes less Gaussian in nature. Thus, negentropy is useful as an ICA cost function when its value is maximized.
Despite its theoretical promise, negentropy is not a directly useful cost function for ICA because it requires a-priori knowledge of each f(yO, that is, the PDF of each estimated unknown source. For this reason, approximations of negentropy have been developed that do not require knowledge of an RVs PDF. Additionally, care has been taken to avoid the use of kurtosis in these approximations due to the noise sensitivities discussed previously. In [1], Hyvarinen suggests the following approximation:

where G(y) is a non-quadratic function of the estimated sources, and v is a Gaussian RV of the same mean and variance as the estimated variable y, assumed to be zero and one, respectively. Exemplary functions often used for G are
Gy(u) = -exp(-M2 /2),
G2(m) = log (cosh (m))
The functions in (7) lend themselves nicely to implementation in real time systems, as they can be evaluated by applying the hyperbolic CORDIC method. This will be covered in detail in section 6.3.
2.3 Mutual Information
Again, drawing upon classical information theory, mutual information is a quantity that describes statistical dependence between two RVs. Specifically, it is defined as
Ky^y 2>-, 30

where H(yO is the differential entropy in a single estimated source, and H(y) is the entropy in the vector of estimated sources.
In essence, the quantity calculated in (8) describes the amount of common information in the individual estimated source vectors, yi. Clearly, this quantity would be a useful cost function for ICA if it is

minimized. However, using principles of information theory (as developed in [1]), mutual information is directly related to negentropy as follows:
/O,. >2...y,) = c-~ZJ O,). (9)
where C is an arbitrary constant. Thus, the use of cost functions derived from mutual information is equivalent to the use of cost functions derived from negentropy.

3. Pre-processing Observed Signals for ICA
In order to reduce the complexify of ICA algorithm implementations, we often find it useful to pre-process the observed (mixed) signal vectors before performing ICA. This pre-processing often involves statistical centering and whitening. From here forward, the pre-processed observed signal vector will be referred to as z. Thus, our amended system block diagram is:
-yfk) ~ s,(k)-
Figure 3-1 Block Diagram of an ICA System with Preprocessing
3.1 Data Centering
The process of centering simply removes any DC bias present in the observed signals so that their mean value is moved to zero. Mathematically,
r = x-£{x}. (10)
If recovery of the mean value of the estimated sources is important in a given application, the mean value can be added to the estimated sources after ICA processing has been completed.

In terms of practical implementation, if a signals DC bias is unknown, it can be estimated with an averaging filter. Suggested implementations are discussed in section 6.2.
3.2 Data Whitening
Many receiving systems are designed with the assumption that the interfering noise environment will be white; that is, will have a variance of unity. Especially in MIMO systems or in systems where the effects of multipath are prevalent, this is not a valid assumption. In order to avoid dealing with the complexity of colored environments in the receiving algorithm, it is often desirable to find a linear system Q that will force the variance of the centered received signal(s) to unity, or whiten the data. That is, for a zero-mean received signal r:
z Qr (11)
e\zzr}= I (12)
In general, whitening is a process that uses eigenvalue decomposition techniques to linearly transform the observation vector x so that each transformed observation is decorrelated and of unit variance. Mathematically this whitening system can be described as
Q = A1/2Er , (13)
where A is a diagonal matrix of eigenvalues of E{rrT}, and E is the orthogonal matrix of eigenvectors of E{rrT}. Since whitening is a linear transformation, we can combine the mixing matrix A and the

whitening matrix A~1/2ET to define a new system A directly relating s and z. In addition to whitening, Principle Component Analysis techniques may be used to eliminate eigenvalues of A that are extremely small, thus reducing the dimension of z. This is common practice in ICA systems where the number of sensor signals available is greater than the number of targeted source signals. Dimension reduction controls the number of extracted sources, and thus reduces the computational overhead required for ICA processing.
But while whitened data simplifies the formulation of receiving algorithms, including ICA, it is not trivial to implement. The whitening matrix given in (13) relies on batch eigenvalue decomposition, and is straightforward to implement in processing systems where computation time is not limited to real-time (that is, the actual signal sampling period). Thus, for application in real-time systems, it is desirable to derive an algorithm to calculate the whitening matrix Q adaptively. The development of such a whitening filter is presented in [2], and is summarized here in (14) through (22).
Consider the set of real-valued, zero-mean, colored sensor signals r with covariance matrix K:
2s{rrr} = K . (14)
We can thus rewrite the variance given in (12) and set it equal to the identity matrix:
-11 -

£{zzr}=QKQr =1.
By observation, we note that a solution to (15) is
q=7T (16)
Using the spectral theorem representation of K, it can be shown that the unique positive definite solution to (15) is
q = S^r''2v,v", d7)
where A,j is the ith eigenvalue of the covariance matrix K, and Vi is the i* eigenvector of K [2].
To develop an iterative process to find Q, we will first use Newtons method to find the inverse square root of a real number K, using the function
f{x) = x-\-K (18)
which has zeros at k'1/2. Thus, we have the iteration
*i+l = )> (1^)
1 /9
where Xk+i will converge to k This iteration can be extended to find the inverse square root of the covariance matrix K instead of k:
Q,=|Q.(3I-KqO (20)

The form of (20) is desirable for efficient implementation, as it requires only matrix multiplications, additions, and division by 2.
As developed in [2], (20) will converge to the optimal solution given in (17) as long as the initial condition
Qo=aI, (21)
is met, where
a 2~p P =
If the covariance matrix of the observed sources is unknown (as is often the case), the following estimate may be used:
^k = ^k-1 "*"7"rkri (23)
k k
Simulations have shown that it is important to allow time for this estimate of K to converge before initializing the learning algorithm for Q. This ensures that the initial condition set by (21) is accurate enough to support the convergence of (20).

4. Development of the FastICA Algorithm
The FastICA algorithm developed and presented by Hyvarinen in [1] uses Hyvarinens approximation for negentropy as the cost function for estimating the demixing matrix W, and thus the source estimation vector y.
4.1 Algorithm for a Single Processing Unit
The following mathematical description defines a single processing unit, that is, one neuron for ICA. In the following equations, Wj will represent a single vector in W at iteration k such that Wj = Wj(k) = [Wji(k),Wj2(k),...,Wjn(k)]T, and Zk will represent a column of z at discrete sample time k such that Zk = [zi(k), Z2(k),...,zn(k)]T. Note that there is a direct relationship between discrete sample times and processing iterations.
From (6), we can write Hyvarinens approximation for negentropy in terms of the estimated demixing matrix W:
/(y) = 7(w>,) = [£{G(w>,)}-£{G(v)}]2 (24)
where is they* estimated source value at sample time k, and G is one of the non-quadratic functions defined in (7). In order to obtain a simplified cost function for maximizing negentropy with respect to Wj, we note that only the first expectation term is a function of Wj. Thus, with respect to Wj, maximum negentropy is achieved when the following quantity is maximized:

/(w>i) = fc{G(w^z,)}]2 06
Thus, we wish to maximize the cost function/with respect to Wj, given the constraint ||wj||2 = 1. Using the method of Lagrange multipliers for constrained maximization, we have
A(w>,,/l) = E{G(w>i)}-i(|W ,||2-1) (26)
F(Wj)~^- = ^kG'iyf^k)}-2^ j =o = £{z,G'(w^z,)}-/?w =0
where X is the Lagrange multiplier and the constant [3 = 2X.
Now that we have an expression defining the maxima of f(y_j) within our constrained space, we can evaluate (27) using Newtons method. First, we find the Jacobian matrix of function F:
V = (28)
In order to reduce the computational overhead required to perform matrix inversion operations, we wish for the Jacobian matrix to be diagonal if at all possible. Because z is the whitened signal vector (variance is 1), it is reasonable to adopt the following approximation [3]:
-15 -

kzk G"(y?Tji j)\= E\ikzkT \E\p"{vtTjzk)}
= lE{G"(wTjzk)}
By substituting (29) for the first term in (28), we have VW/F(w,) = I£{G'(w>,)}-/a
= [s{g>>,)}-a]I.
Finally, by Newtons method, we obtain the expression for updating Wj at each iteration k:
= w,-v;;f(w,)
_E{zG'(v/TjZk)}- fivr j
y/j e{g\w>4)}-y?
= £{ztG'(w^)}-£{cr(w^zt
To preserve the constraint ||wj||2 = 1, we normalize this value before updating wj:
w; =
Equations (31) and (32) give us the FastICA algorithm in final form for a single processing unit. Algorithm convergence is achieved when consecutive evaluations of Wj point in the same directionthat is, their dot product approaches one.

4.2 Decorrelation of Single Processing Units for Multi-Unit ICA
This single unit definition of FastICA is easily extended to multiple processing units that perform the same mathematical operations on each observation vector in parallel. However, it is important to decorrelate the row components of the demixing matrix W (the wjs) before they are fed back to the update process for the next update iteration. This prevents any two units (components of W) from converging to the same maxima. This can be accomplished using the standard definition of symmetric decorrelation:
Wt=(W,W;r1/2W, (33)
Alternatively, Hyvarinen has proposed an iterative algorithm for decorrelation as
W0 =
where Wk is formed from the updated row vectors calculated in (32)
- 17-

Once each component of W has converged within acceptable tolerance, we may use it to process the observation vector x to estimate the original source vector at iteration k:
y* =s* =Wjz*. (35)

5. Simulated Algorithm Performance
Simulations were performed using a benchmark set of speech signals to demonstrate the effectiveness of the adaptive algorithms presented in this work. Summarized performance is shown for the adaptive whitening filter as defined in section 3.2 and the FastICA algorithm as defined in chapter 4.
5.1 Adaptive Whitening
In practice, it is necessary to update the whitening filter system only up to the point of convergence. In the simulations run for this performance assessment, if the cumulative change in the elements of Q was less than 1/2 the filter update process was stopped, and the last calculated value of Q was then used to create the whitened input to the FastICA process. In the set of test runs simulated, the average convergence time for the whitening filter was 12 iterations. It was observed that, for filter stability, enough time must be allotted for the empirical covariance estimate to be accurate enough for Qo to lie within the limits required for convergence.
5.2 FastICA
Again, it is often not necessary to update W once the individual components have converged. For these simulations, once the dot product of two successive outputs from each processing unit was equal to one, the final value of W was used as the demixing system.

The following figures show the simulated performance of an implemented FastICA system. Figure 5-1 shows the raw sensor signals (uncentered, unwhitened, and mixed) as seen by the receiving system. Figure 5-2 shows the original sources (uncentered, unwhitened, and unmixed). Finally, Figure 5-3 shows the estimated sources recovered from the raw sensor sources using the real time preprocessing and FastICA algorithms described in this work. Figure 5-4 shows the benchmark performance of the batch-processing FastICA algorithm, which works by examining the statistical properties of the entire block of sensor samples to refine the estimate of W, rather than working sample-by-sample with the observed sensor data. Clearly, performance is gained by the extra statistical accuracy of the batch processing algorithm.
While the performance of the real time algorithm is degraded from the optimized batch algorithm, it is visually apparent that the estimated source vector Y resembles the original source vector X. Note that, in this case, the estimated sources are shown in the same order and sign as the original sources. This is not always the case due to ambiguity of sequence and sign.

Mixed signals
Figure 5-1 Unprocessed Sensor Signals (X)
Figure 5-2 Original Sources (S)
-21 -

Estimates Sources (Independent Components)
yi 0

v2 o
2000 2500 3000 3500
0 500 1000 1500 2000 2500 3000 35i
2000 2500 3000 3500
y4 0

0 500 1000 1500 2000 2500 3000 3500

0 500 1000 1500 2000 2500 3000 3500
Figure 5-3 Estimated Sources (Y), Real Time Algorithm
Estimated sources (independent components)
1 1

.5 I____________________I____________________I_____________________I____________________I____________________1____________________I_____________________
Figure 5-4 Estimated Sources (Y), Batch Algorithm

6. Implementation Structures
To choose an architecture for implementing the presented algorithm, key trade-offs must be considered in the context of system requirements.
If the algorithm is to be executed on a processor based system, the designer must ensure that the number of instructions that must be executed to process each sample in the observation array be completed within a sampling period Tsthat is, the time between x(k) and x(k+l). This requires a careful examination of the clocking rate and instruction set of the specific processor architecture. If not enough clock cycles exist within the period Ts to complete a single iteration of ICA processing, data decimation must be considered for the systems observed signal streams, or the number of observed signals must be reduced to execute the algorithm on that hardware platform. If a processor is being utilized at or near capacity, it is also important to ensure that there is sufficient power and thermal dissipation available to protect the hardware platform.
If the algorithm is to be executed using a custom digital hardware platform (such as an FPGA), the designer must determine the number of parallel computational units that will be required to function within the sampling period. Unlike a processor-based system, parallelization is clearly possible up to the point of maximum hardware resource utilization in the target hardware device. In this case, a trade space

exists between hardware utilization (parallelization), target device clocking rate, sampling period, and the number of observed signals. Assuming that the target device is clocked at a rate faster than the incoming data stream, it may be possible to share hardware computational units to process multiple values. However, the designer must carefully consider the impact of hardware sharing and fast clocking on system power and thermal dissipation requirements.
It is helpful here to consider a concrete example. In order to multiply the n-signal zero-mean observation vector by the adaptive whitening system Qk, n multiplications are required. In a system where n = 3, one could accomplish this with a single multiplier if the bandwidth of the signals in r allow a decimated sampling ratefsr such that the device clocking rate is > 9fsr. This clearly reduces the hardware footprint of this matrix operation3. However, the implemented multiplier circuit will now be actively consuming power for a much greater portion of the sample period. Depending on how this period of power consumption lines up with the rest of the computational elements in the design, it may be desired to accomplish the matrix multiplication operation in less wall-clock time by parallelizing hardware and controlling where it occurs within a sampling period to evenly distribute power consumption. These
3 The hardware savings of such a design choice is not the full factor of 9 that one might expect. Extra hardware logic is required to arbitrate the use of the single multiplier resource, which adds a small amount of overhead.

complex trades are very common in VLSI design, and must be carefully studied in selecting an architecture for algorithm implementation.
The sections in this chapter will deal primarily with different digital structures that can be used to implement the mathematical operations for the FastICA algorithm in the context of VLSI design.
6.1 Fundamental Operations
Depending on the target device for a VLSI design, on-chip multiplier resources may be limited or unavailable. In this case, the designer may be required to build these computational units. If the electrical design is being defined in a hardware descriptive language such as VHDL or Verilog, the synthesis tools used to translate the coded hardware models into schematic designs are sophisticated enough to generate multiplier lattice structures. However, it may be necessary for the designer to build multipliers from scratch if the clocking rate of the design is high or precise control over power consumption is required. For this reason, a basic treatment is given to the design of multiplication and squaring units.
6.1.1 Multiplication
The most straightforward way to consider the implementation of a multiplier is to carefully observe a long-hand multiplication operation:

11010111 x 11101010
Figure 6-1 8x8 Unsigned Multiplication Operation
First, we notice that an 8x8 unsigned multiplication operation results in eight partial products, each of which requires eight two-bit AND gates to generate. In a fully pipelined multiplier architecture4, all of these partial products must be computed in a single clock cycle. Thus, each m by n multiplier requires m-n two-bit AND gates. Once these partial products are obtained, they are combined using an adder tree as represented above. Depending on the depth of the adder tree and the speed of the target hardware device, the propagation delay through the logic elements in this adder tree are often greater than a single clocking period. Thus, registers may need to be added between adder stages to ensure that the design meets timing constraints with adequate margin to avoid metastability and race conditions in the circuit.
4 That is, where the multiplier is capable of accepting an input and producing an output on each activating clock edge.

The example shown above depicts an unsigned multiplication operation. To handle twos complement signed multiplication, extra logic is required to correctly handle the product sign. We will now examine the Baugh-Wooley algorithm, an algorithm for twos complement multiplication that minimizes the extra logic required.
For a twos complement multiplier a of bit-length m and a twos complement multiplicand b of bit length n, the product may be generated as [5]:
P = am .bn ,2m .b,2
m-l n-i m1 /
\ i+m-l
m-2 n-2
+2>A-,2i" +EZaA 2I,J
;=0 i=0 ;'=0
+ 2n_1 +2m_1 + 2m+n~l
where am.j is the most significant bit (MSB) of a, where ao is the least significant bit (LSB), etc. [5]. It can be clearly seen that the term
m-2 n-2
i=0 ;=0
in (36) generates the bits of the partial products as depicted in Figure 6-1 that do not involve the MSB of the multiplicand or multiplier. The other two summation terms in (36) invert the results of the lower bit AND operations involving these MSBs before they are added to the product sum. Finally, the remaining terms manipulate the product to

ensure correctness of sign and magnitude in the resulting twos complement product.
6.1.2 Squaring
Squaring operations are obviously similar to multiplication operations conceptually. However, we can exploit the fact that multiplier and multiplicand are identical in a squaring operation and reduce the logic required to implement these operations. Equations (38) and (39) are specific to hybrid unsigned and twos complement squaring units and are developed in [4], If a design must only work with unsigned or negative numbers, the logic required can be further reduced as shown in [4],
For a given number A with an even bit-width n and sign (sgn), A can be defined as follows:
n2 i-2
where a.i is the MSB of a, and ao is the LSB of a.

Similarly, for a given number A with an odd bit-width n and sign
The actual hardware/area savings of this squaring algorithm will depend on the target device. On a Xilinx Virtex device clocked at 50 MHz, a 12x12 twos compliment hybrid multiplier as defined in (36) utilized 168 look up table (LUT) resources (also called function generators) and 208 flip flops (FFs), while a 12x12 hybrid squaring unit as defined in (38) utilizes 143 LUTs and 99 FFs.
6.2 Moving Average Filters
A moving average is a straight forward computation in hardware because its recursive form only requires two carry adder units and a FIFO for the averaging space. If the averaging span is chosen as a power of two, the final division operation can be implemented with a simple bit shift. For instance, to implement a moving average filter over n samples, where n is a power of two, the following structure can be used:
(sgn), A can be defined as:
+an-iao2" +an-iao sgn2"+1
n-2 , , n2 i-2
;=2 7=0

Figure 6-2 An n-Element Moving Average Structure
If system memory is a concern and an exact moving average is not required, a similar effect can be achieved using exponential smoothing techniques in a basic HR filter structure. The following diagram shows a generalized smoothing structure, where ai and (X2 are static or dynamic values chosen to control the weighting of new data values and to scale the desired output. If ai and 0*2 can be chosen to be powers of two, they can be implemented with simple bit shifts in hardware, making the smoothing process computationally inexpensive.
Figure 6-3 Generalized HR for Exponential Smoothing

6.3 CORDIC Algorithms
CORDIC (Coordinate Rotation Digital Computer) algorithms are frequently implemented in digital systems to allow the rotation of vectors with simple shift-add operations. One of the most frequent applications of CORDIC algorithms is the evaluation of trigonometric functions, as they can be represented as vector rotations on the unit circle.
For application in the FastICA algorithm as presented, the two functions given in (7) and their derivatives can be expressed as a function of hyperbolic trig functions, namely:
Since only the first and second derivatives are required calculations for updating the demixing matrix W, we will focus here on CORDIC methods for evaluating hyperbolic sine and hyperbolic cosine. With these quantities available, it is straightforward to compute the derivatives of Gi and G2.
G2(u) = In (cosh (w)) G'2(u) = tanh(w) G2(u) = (cosh( w))2
-31 -

In principle, CORDIC algorithms work by rotating a vector by a series of fixed angles that are specifically chosen such that
\an{(p)=2~k (42)
where k is an iteration counter. Any arbitrary angle could thus be represented as an additive series of these angles, with precision dependent on the number of iterations performed.
CORDIC algorithms are derived from the mathematical expression rotating a vector (jc,y) by an angle (p:
X = xcos(^)- ysin(07) y' = y cos( (,p)+ x sin( (p).
Factoring cos(cp) out of the expressions above allows us to write (43) in terms of tan(cp), which we have defined to be 2'k. Thus, we can define the rotated jc and y vector coordinates as [6]:
x K x y( 2 k) y'=K\y + x(2-k}
K = cos((p) =
Jl + tan2 (cp)
If vector amplitude is critical in an application, a final multiplication by the cumulative K (the cumulative product of the ICs as defined in (44)) is required. Otherwise, this scaling quantity can be treated as part of a systems over-all gain. As the number of iterations

approaches infinity, the cumulative gain of the CORDIC algorithm approaches 1.647.
After each iteration rotating (x,y), it is also necessary to accumulate the angle of rotation, which we will call z. Since the angles of rotation are fixed and known, z can be written as:
In practice, a look up table is implemented in hardware to associate an angle value (accumulated in z) with a shift value (used to calculate jc and y). This simple hardware structure then enables the iterative CORDIC operation using only shift and add operations.
For each iteration of the CORDIC algorithm, it is necessary to chose the sign of the rotation angle. This decision can be made with reference to two possible quantities: Zk or y*- When the rotation sign is determined by the goal of making Zk go to zero, the algorithm is said to be operating in rotation mode. Alternatively, when the rotation sign is determined by the goal of making yk go to zero, the algorithm is said to be operating in vectoring mode. We will see later that, depending on the mode of operation, different functions can be evaluated by the CORDIC algorithm. Thus, we can write the iterative CORDIC update process in a final form:

**+i =xk-ykdk2~k
yk+1 = yk+xkdk2~k
Z*+1 =Zk-dk tan-1 (2-")
where dk = -1 if z* < 0 (rotation mode) or yk < 0 (vectoring mode), dk = 1 otherwise.
To summarize the results, for an input angle zo and input starting vector (xo,yo) in a circular coordinate system, the CORDIC algorithm will produce the following after k iterations operating in rotation mode
Likewise, for an input angle zo and input starting vector (xo,yo) in a circular coordinate system, the CORDIC algorithm will produce the following after k iterations operating in vectoring mode [7]:
**. = At [*0 cos Z0 ^0 sin Zo ]
yk = Ak bo cos Zo + *0 sin Z0 ]
z* ^ 0
Ak = rT Vi + 221 > 1.647
yk 0
zk = z0 +tan-,(y0/x0)
Ak =Yl^l + 2~2i > 1.647

Thus far, we have seen how the CORDIC algorithm behaves working in a circular coordinate system. It can be shown that the same iterative process works in both the linear and hyperbolic coordinate systems as well. Specifically, if the angle accumulator shown in (46) accumulates an inverse hyperbolic tangent instead of an inverse tangent, the CORDIC algorithm will behave as follows in rotation mode and vectoring mode, respectively [7]:
xk = Ak ixo cosh + y0 sinh *o ]
yk = Ak bocosh z0 + *osinh ]
-> o (49)
Ak = 1 + 2'2' -> 0.8
xk = Ak Vxo y0
zk = z0 + tanh_1(y0/x0) (50)
Ak =P[Vl + 2~2i ->6.8
Therefore, we see that the function Gi from (40) can be evaluated by a single hyperbolic CORDIC function in rotation mode as follows:
Gj(m) = sinh (-w2 /2)-cosh (- u212)
= -7"**h> =-hy0=hzQ=-u2 n A,

Since Gi is an exponential, the derivatives required for the FastICA algorithm are calculated (51) using simple multiplication.
Likewise, the function G2 can be evaluated using a single hyperbolic CORDIC function in rotation mode given inputs of xo = 0, y0= 1, and zo = u:
G'2(u) = tanh(w)
smh( u) xk ,
rrT= K> =0>3;o =1^o = M
cosh() yk
G"(u) = (cosh()) 2

k =>>;o =1 ,z0=u
Note that since the gain Ak approaches 0.8 as the number of CORDIC iterations becomes large, it is reasonable to replace (A*)1 in (51) and (52) with the scaling factor 1.25, which can also be implemented in hardware as a simple shift and add operation.

7. Conclusion
Using the techniques presented in this paper, we see that ICA algorithms can be implemented at a reasonable computational cost. Using empirical estimates and adaptive algorithms in ICA processing does cause the performance of the algorithm to degrade. However, these degradations may be acceptable for specific applications, especially in situations where batch processing of sensor data is not feasible.

Appendix A: Simulation Code
function W_final = user_alg7( X ) ;
% Input:
% X Observed signal vector, observed signals in rows % It is assumed that there is one observed signal per IC for
% this implementation (dimension reduction is not performed).
% Output:
% W Separation matrix
% Returned matrix incorporates whitening system
% References:
% [1] Independent Component Analysis: Algorithms and Applications
% Aapo Hyvarinen and Erkki Oja
% [2] An Iterative Algorithm for Computing a Spatial Whitening Filter.
% Venkatesan, Mailaender, Salz. 2004 IEEE 5th Workshop on
% Signal Processing Advances in Wireless Communications.
% [3] Independent Component Analysis
% Aapo Hyvarinen, Juha Karhunen, and Erikki Oja. John Wiley & .
% Sons, Inc. 2001.
% Control Variables % meanConverge = 500; covConverge = 3000; mu = 1/128;
% Set some dimension constants [numSensors,numSamples] = size(X);
% Variable declarations: Preallocate for speed Xcentered = zeros(size(X));
Xmean = zeros(size(X));
Xwhite = zeros (size (X)) ,-Q = zeros(numSensors,numSensors,100);
C = zeros(numSensors,numSensors,numSamples);
W = zeros(numSensors,numSensors,2000); y = zeros(numSensors,2000);
Y = size(X);
Argl = zeros(size(X));
Arg2 = zeros(numSamples,1);
% Loop to compute the running mean of X for k=l:numSamples
Xmean(:,k) = MovingVectorAve(X,k);
Xcentered = X Xmean;

% Perform iterations to determine initial covariance matrix estimate C(:,:,l) = zeros(numSensors,numSensors) ; for k=2:numSamples
C(:,:,k) = (k-l)/k C(:,:,k-1)+
(1/k)*Xcentered(:,k)*Xcentered( : k) ;
% Compute whitening matrix for X
% Initial conditions for adaptive whitening filter presented in [2] : COV = C(:,:,covConverge); p = ceil(0.5*log(trace(COV))/log(2)); a = 2 ^ ~p;
Q(:,:,l) = a*eye(numSensors); for k=2:numSamples
Q(:,:,k) = 0.5*Q(:,:,k-1)*(3*eye(numSensors)-COV*Q(:,:,k-1)~2); iff k < numSamples covConverge )
COV = C(:, :,covConverge + k-1);
iff sum(sum(Q(:,:,k)-Q(:,:,k-1))) < (1/(2A32)) ) disp('Whitening Filter converged!');
QConvCt = k % Print this break;
Xwhite = Q(:,:,QConvCt)*Xcentered;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Main FastICA Loop (algorithm presented in [1]
% Random initial condition for W:
WConvCt = numSamples;
W(:,:,l) = 0.25*eye(numSensors,numSensors);
for k=l;numSamples
% Iterate over the FastICA neurons for j =1:numSensors
w_j = W(j,;,k)'; % jth row of W transposed
y(j,k) = w_j'*Xwhite(:,k); % The kth output of the jth
% neuron
u = y(j,k) ;
Argl(:,k) = Xwhite(:,k)*cordicG(u,1) ;
Arg2(k,l) = cordicGlu,2);
w_jtemp = MovingVectorAve(Argl,k) ...
MovingScalarAve(Arg2(:,1) k)*w_j;
W(j,:,k+1) = (w_jteirp/norm(w_jtenp) ) ;
% Decorrelate the new W if k == 1
W(:,:,k+1) = W{:,:,k+l)/sqrt(norm(W(:,: ,k+l)*W(:, :,k+l) '));
W(:,:,k+1) = 1.5*W( :, :,k+1)- ...
. 0.5*W( :, :, k+1)*W(:, :,k+l) '*W(:, :,k+l);

1.0 Sc&
% Check for separating matrix convergence if( k+1 > 1000 && abs(dot(W(1,:,k+1),w(1,:,k))) == abs(dot(W(2,:,k+l),W(2,:,k))) == 1.0 && ... abs(dot(W(3,:,k+1),W(3,:,k))) == 1.0 ) disp('Unmixing matrix converged!1);
WConvCt = k+1 break;
% Plot the estimated source vector Y
Y = W(:,:,WConvCt)*Xwhite;
xplot = [1inumSanples];
title('Estimated Sources (Independent Components)') for j=l mumSensors
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Return the final demixing matrix to the calling GUI W_final = Q(: :,QConvCt)*W(:, :,WConvCt);

[1] A. Hyvarinen and E. Oja. Independent Component Analysis: Algorithms and Applications. Unpublished ms. , October 2008.
[2] S. Venkatesan, L. Mailaender, J. Salz. An Iterative Algorithm for Computing a Spatial Whitening Filter. 2004 IEEE 5th Workshop on Signal Processing Advances in Wireless Communications, 11-14 July 2004. 338- 342
[3] A. Chichocki and S. Amari. Adaptive Blind Signal and Image Processing: Learning Algorithms and Applications. West Sussex, England: Wiley, 2002. 130-132; 184-190.
[4] K. Wires, M. Schulte, L. Marquette, and P. Balzola. Combined Unsigned and Twos Complement Squarers. Record of the Thirty-Third Asilomar Conference on Signals, Systems, and Computers, Vol 2, 1999. 1215- 1219
[5] J. Birkner, J. Jian, K. Smith. High Performance Multipliers in QuickLogic FPGAs. Unpublished ms., Mar. 4, 2002.
[6] P. Pirsch. Architectures for Digital Signal Processing. West Sussex, England: Wiley, 1998. 135-148.
[7] R. Andraka. A Survey of CORDIC Algorithms for FPGA Based Computers. Proceedings of the 1998 ACM/SIGDA Sixth International Symposium on Field Programmable Gate Arrays, 1998. 191-200.
-41 -

Full Text


log ,It,


-slk) xlk) ylk) -slk) xik ) ylk) -sik ) xm(k) Ym(k) slk) ..


m n,


kurt (y) E{y4}_ 3(E{y 2}y.


ley) H(v) H(y), H (y) fey) (y) )dy, Y


ley) [E{G(y)}E{G(V)}]2, G(y) y, I(y .. y" ... y.) H(YJ] H(y),


x l(k) zl(k) Ylk) -slk)-. -slk) xlk) zlk) ylk) slk)-. -sn(k) xm(k) z/k) y,(k) slk) ...


f(x) Xk+l I12




E {C k F:


= k k = ] ] Wj ] E{zO'(w ][P E{O"(w ] E{O"'(W>k)}-P =


k k k k


n2 n 9!s,.


a m n, n-2 P a b 2mn L..J m-2 m-2 n-2 L..J L..J L..J +2m -2mn -am a, ao


n n/2-1( ) ao L a; ai_I +ai a i _2 + ;=n/2+1 + (an/2 an12_ I}2n an/2"t\an a n 2 a n 1 a n 2 a n n-2 n-2 ;-2 EBajpn+j IIaiaji+j + an-l a, ao a.


n n-2 n-2 +"(a! L..J n-! L..J L..J n n


u u 2


-k x' y x' y('2-k)] y' K[y x( 2-k)] K T2k




= Zo (xo,yo) Ak [xo Zo Yo zo] Ak .J1 k Zo (xo,Yo) k -1(yO/XO) .J1 k


X k Ak [xo Zo Yo zo] Yk Ak [Yo Zo Xo zo] Zk Ak .J1 k X k Y; Yk Zk Zo -1(yO/XO) Ak .J1 k (u) u u =-Xk Xo =-l,yo =l,zo =-u /2 Ak


yo Zo O'Yo l,zo u 2 d,zo =u