Citation
Speaker verification with acoustic parameters

Material Information

Title:
Speaker verification with acoustic parameters
Creator:
Matthews, Michael F
Publication Date:
Language:
English
Physical Description:
xii, 103 leaves : illustrations ; 29 cm

Subjects

Subjects / Keywords:
Hearing ( lcsh )
Sound ( lcsh )
Automatic speech recognition ( lcsh )
Speech processing systems ( lcsh )
Automatic speech recognition ( fast )
Hearing ( fast )
Sound ( fast )
Speech processing systems ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 102-103).
General Note:
Submitted in partial fulfillment of the requirements for the degree, Master of Science, Department of Computer Science and Engineering
Statement of Responsibility:
by Michael F. Matthews.

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
37357442 ( OCLC )
ocm37357442
Classification:
LD1190.E52 1996m .M38 ( lcc )

Downloads

This item has the following downloads:


Full Text
SPEAKER VERIFICATION WITH ACOUSTIC
PARAMETERS
by
Michael F. Matthews
B.S., University of Colorado at Denver, 1995
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Masters of Science
Computer Science
1996


This thesis for the Master of Science
degree by
Michael F. Matthews
has been approved
by
Jody Paul
Date


Matthews, Michael F. (M.S., Computer Science)
Speaker Verification With Acoustic Parameters
thesis directed by Professor Jody Paul
ABSTRACT
Security in todays society has taken on new interest and importance.
Much emphasis has been placed on securing remote access to proprietary data,
safely trading commerce on the Internet, and reducing credit card fraud at
point-of-purchase locations. One question has not been fully answered: how
can we verify people are who they say they are? Our current methods of
verification are unfriendly, costly, and not reliable. Speaker verification is a
cost-effective, reliable, and user friendly technique. Advancing computer
technology has enabled speaker verification to become an effective security
tool.
111


Much of the previous research in speaker verification has focused on
matching one instance of speech data with another. Because of the amount of
unique information obtained by extracting acoustic parameters from speech,
we can explore alternate methods of using these parameters to improve
speaker verification. Through the use of artificial intelligence approaches such
as fuzzy rules and Bayesian networks, acoustic parameters of speech are used in
a newly developed method for this research. This method, called Adaptive
Forward Planning (AFP), provides a decision making mechanism in which
speaker verification can be implemented with promising results.
This thesis surveys existing speaker verification technologies and
implementations. It points out shortfalls and proposes how to address them. It
then introduces the concept of Adaptive Forward Planning, and details its
implementation. Finally, experimental results of this implementation are
discussed, and directions for further research are outlined.
This abstract accurately represents the content of the candidates thesis. I
recommend its publication.
Signed
v )
Jody Paul
IV


CONTENTS
Figures....................................................................ix
Tables.....................................................................xi
Acknowledgements..........................................................xii
CHAPTER
1. INTRODUCTION.............................................................1
Security In Todays Society............................................1
Speaker Verification: A Viable Solution................................3
How Can Speaker Verification Be Improved?..............................5
Cost................................................................5
User Interface......................................................6
Training The System.................................................6
Speaker Population..................................................7
Using Acoustic Parameters Of The Speech Signal......................7
Organization Of Thesis.................................................8
CHAPTER
2. SPEAKER VERFICATION TECHNOLOGY AND SYSTEMS.............................10
Speaker Verification Technologies....................................11
Predictive Models.................................................11
The Hybrid MLP-RBF-Based System..................................12
The Self Segmenting Linear Predictor Model.......................12
The Neural Prediction Model......................................13
The Hidden Control Neural Network Model..........................14
Gaussian Mixture Models............................................14
v


Statistical Features For Speaker Verification.......................16
Extraction Of Cepstral Coefficients..................................16
Speaker Verification Systems............................................17
ITT Defense Communications Division (ITTDCD).........................17
AT&T Information Systems.............................................18
AT&T Bell Laboratories...............................................20
The ARPA Speech Understanding Project: Continuous-Word Recognition
Techniques..............................................................21
Systems Development Corporation (SDC)................................21
Hear What I Mean (HWIM)..............................................22
Camegie-Mellon University Hearsay-II.................................23
Camegie-Mellon University Harpy......................................23
What Is Missing?........................................................24
CHAPTER
3. APPROACH.................................................................26
Adaptive Forward Planning...............................................27
Password Models.........................................................30
The AFP Architecture....................................................32
Fuzzy Rule Evaluation Of Parameters (FREP)...........................32
Bayesian Network for Password Analysis (BANPA).......................32
Phonetic Acoustic Parameter Evaluation Using Fuzzy Rules................34
Fuzzy Rules For Acoustic Parameter Value Determination..................37
Linguistic Variables.................................................38
Correlation Coefficient...........................................38
Regression Line Slope Difference..................................38
Fuzzy Rule Process (Defuzzification)..............................41
Subjective Bayesian Inference Networks For Password Analysis............43
Acoustic Feature Inputs..............................................45
Fundamental Frequency.............................................45
Spectral Shape....................................................45
vi


Low-Frequency Energy, Mid-Frequency Energy, High-Frequency' Energy. 45
First Formant Frequency, Second Formant Frequency Third Formant
Frequency.............................................................46
Phonetic Classification..................................................48
Intensity.............................................................48
Stress................................................................48
Intonation............................................................48
Articular.' Configuration.............................................48
Phonetic Quality Attributes..............................................48
Vowel Presence........................................................49
Consonant Presence....................................................49
Prosodic Quality......................................................49
Articulation Quality..................................................49
Total Energy..........................................................49
Probability Relationships For Phonetic Attributes........................50
Odds Formulation......................................................50
Likelihood Ratio......................................................50
Measurement of the Evidence (E)..........................................53
Multiple Evidence........................................................55
Phonetic Analysis Relationships..........................................56
Summary....................................................................56
CHAPTER
4. IMPLEMENTATION........................................................58
System Overview.....................................................58
BANPA Implementation..............................................58
FREP Implementation...............................................60
Password Model Implementation.....................................61
Vowel Password Selection.......................................61
Stop Consonant Password Selection..............................62
Liquids/Glides Password Selection..............................62
Vll


Nasal Consonant Password Selection.............................63
The HpW Works Analyzer For Acoustic Parameter Extraction..........65
AFP System Integration..............................................66
AFP System Process................................................68
Speaker Verification Tests..........................................69
Speaker Password Selection Results................................70
Speaker Verification..............................................79
One Verses Two Acoustic Parameters.............................79
Matching Algorithms...............................................80
Algorithm 1: Correlation.......................................80
Algorithm 2: Closest Match.....................................80
Algorithm 3: Slope.............................................80
Algorithm 4: FREP..............................................80
Algorithm 5: FREP With BANPA (AFP).............................80
System Verification Criteria 1: Comparison Against The Same Passwords ... 81
System Verification Criteria 2: Comparison Against The Baseline Passwords 83
System Verification Criteria 3: Spoofing. Determining A Successful Rejection
Rate..............................................................85
CHAPTER
5. CONCLUSIONS...........................................................88
Speaker Verification: A Technology Waiting..........................90
Where To Go From Here...............................................91
APPENDIX
A. VALUES USED FOR BANPA SYSTEM.........................................94
B. RESULTS OF BANPA RUNS................................................95
C. RESULTS OF FREP RUNS.................................................99
REFERENCES................................................................102
viii


FIGURES
Figure
3.1 Adaptive Forward Planning Steps............................................28
3.2 Password Model Tree........................................................31
3.3 The AFP Architecture.......................................................33
3.4 Establishing A Baseline For The Acoustic Parameter FF......................36
3.5 Plot Comparison Of Two Fundamental Frequency Parameters....................39
3.6 Fuzzy Rule Distribution....................................................41
3.7 Bayesian Network For Phonetic Analysis.....................................47
3.8 Sufficiency Relationship Between Fundamental Frequency And Intonation.....51
3.9 Necessity Relationship Between Low Frequency Energy And Intensity'.......52
3.10 Approximation Model For Evidence Of An Acoustic Or Phonetic Attribute....54
4.1 The BANPA User Interface...................................................59
4.2 The FREP User Interface....................................................60
4.3 Password Models For Study..................................................64
4.4 HpW Works System Author And Version........................................65
4.5 User Interface Screen For AFP Password Selection...........................67
4.6 AFP System Process.........................................................69
4.7 Distribution Of Password Selection.........................................72
4.8 Spectral Shape For Speaker 1 Verses Average Spectral Shape For Password
animated"................................................................73
4.9 Spectral Shape For Speaker 2 Verses Average Spectral Shape For Password
animated.................................................................73
4.10 Spectral Shape For Speaker 3 Verses Average Spectral Shape For Password
animated.................................................................74
4.11 Spectral Shape For Speaker 4 Verses Average Spectral Shape For Password
animated.................................................................74
IX


4.12 Spectral Shape For Speaker 5 Verses Average Spectral Shape For Password
animated...................................................................75
4.13 Spectral Shape For Speaker 6 Verses Average Spectral Shape For Password
animated...................................................................75
4.14 Spectral Shape For Speaker 7 Verses Average Spectral Shape For Password
animated...................................................................76
4.15 Spectral Shape For Speaker 8 Verses Average Spectral Shape For Password
animated...................................................................76
4.16 Spectral Shape For Speaker 9 Verses Average Spectral Shape For Password
animated...................................................................77
4.17 Successful Verifications: Against Same Passwords
(One Parameter)..............................................................82
4.18 Successful Verifications: Against Same Passwords
(Two Parameters).............................................................82
4.19 Successful Verifications: Against Baseline Password
(One Parameter)..............................................................84
4.20 Successful Verifications: Against Baseline Password
(Two Parameters).............................................................84
4.21 Successful Rejections: Against Spoofing Baseline Password
(One Parameters).............................................................86
4.22 Successful Rejections: Against Spoofing Baseline Password
(Two Parameters).............................................................86
x


TABLES
Table
3.1 Fuzzy Rules For Determining The Quality Of An Acoustic Parameter.......40
3.2 Values Computed By Fuzzy Rules..........................................42
4.1 Results of Password selection...........................................71
4.2 Phonetic Class Selection Of Speakers...................................78
xi


ACKNOWLEDGMENTS
The author wishes to thank the following people for their support,
encouragement and assistance.
Carol Conway Matthews
Kevin Matthews
Daniel Matthews
Dr. Jody Paul
Dr. W. J. Wolfe
Dr. John Clark
Bemie and Dede Conway
AT&T Bell Labs/Lucent Technologies (Denver)


CHAPTER 1
INTRODUCTION
Throughout the last twenty years, speaker verification has been used
for computer system access, building access, credit card verification, and crime
lab forensics, all with differing amounts of success. As compared to other
security techniques such as fingerprints, palm-prints, hand writing scans, retinal,
and facial scans, speaker verification is comparatively inexpensive. Each
individuals speech patterns are unique. Because of this, individual speech
samples are as unique as fingerprints, facial scans or retinal prints. This makes
speaker verification an excellent tool for security control. Recent developments
in artificial intelligence such as neural networks and fuzzy logic have extended
research in many existing technologies. Speaker verification is one of these
technologies. The cost-effectiveness, ease of use, and reliability of speech
verification make it a practical technology to pursue.
Security In Todays Society
Despite recent advances in computer technology, one security problem
remains that has not been fully solved: how can we verify the claimed identity
of a person? Security access has become a more integral part of todays society
in many ways. With an increasingly competitive high tech industry, the need for
protection of proprietary material has become critical. Systems used for
security access to buildings and computer labs have become regular fare. The
1


Internet has become a viable means of world-wide communication. The ability
to trade commerce safely on the Internet is about to become a reality7.
There are basically three methods by which a person can be identified.
One is by an object in their possession such as a badge, a key, or credit card.
The second method is by having something memorized such as a user id, a
password, or personal identification number. The third method is by a physical
characteristic unique to the person such as a fingerprint, facial scan, signature,
or a voiceprint. The first two methods are transferable between one person
and another making them vulnerable to impostors. The third method of a
unique physical characteristic offers the most promise for secure personal
identification1.
Despite the need for secure personal identification, there are not many
devices available on the market today. The only effective devices are those
based on biometrics such as fingerprints, palm-prints, and retinal scans. These
devices are used at point-of-access locations and there use in other locations
such as point of sale terminals, or for remote transactions is not likely due to
the expense. Many situations where access to a secure area must be controlled
involve the use of guards. Examples include areas such as computer rooms,
bank vaults, aircraft maintenance areas, and drug storage areas. If guards are
employed seven days a week, 24 hours-a-day, the cost could exceed $50,000 a
year.
Fraudulent use of credit cards and bank checks is becoming an
increasing problem to merchants, banks, and credit companies. As we move
closer to a cash-less society, the amount of money lost due to insecure
transactions will become enormous. This type of security problem is different
than that of building access in that the number of places where protection is
2


needed is staggering. In essence, all gas stations, restaurants, stores, and banks
are potential targets to fraud. Because of such a large number of places where
transactions can be performed, the costs of a security solution must be low and
its performance must be very reliable. In addition, any method of security of
this type must be easy to use and widely acceptable to customers.
Todays society is moving towards long distance working relationships,
telecommuting, and a growing reliance on remote access to proprietary data.
The fears of unauthorized physical access have continued to grow in recent
years. Society is demanding more effective ways of providing security to a
growing number of people and a variety of different needs. The solutions for
these types of transactions must be easy to use, inexpensive, and widely
accepted among all who must interact with it.
Speaker Verification: A Viable Solution
In everyday life it is possible to recognize people by their voices. This
attribute makes the human voice a natural candidate for automated
identification. One persons voice is different from another because the relative
amplitudes of different frequency components of their speech are different. By
extracting acoustic features such as frequency components from the speech
signal, we can further the reliable identification of a voice.
The technique of speaker verification is one of the few reliable speech
recognition technologies available today. As compared to continuous-word
recognition, speaker verification is constrained to single words or phrases.
Because of this, the complexity of recognition is reduced tremendously. In
addition, speaker verification works with a known user, meaning the system
has previous information stored about the user. Other speech recognition
3


technologies must generally work with an arbitrary user. The fact that speaker
verification is free of many constraints inherent in other speech recognition
systems allow it to be one of the most reliable speech technologies available.
Typically the application of speaker verification technology is a device
that customers can be comfortable with. They know what to expect and
therefore dont have unrealistic expectations. These unrealistic expectations
may very well lead to a customer being unhappy or even frustrated by a
product if any rough edges are encountered. Speaker verification can address
probably the single most important issue concerning consumers: will people
accept it as a security method? Because little more than a microphone is
required, and most people find speaking natural, the answer is most likely yes.
To address the need for remote access of proprietary data, speaker
verification can play a key role for securing safe transactions. Speaker
verification technology can be implemented over long distances through
telephone or computer networks. Particularly, as telephone technology
improves, more and more reliable speech signal quality becomes a reality.
Obviously, finer speech signal quality will allow for more dependable speaker
verification. Similar to the way systems today verify credit card purchases,
speaker verification can be used for securing other more critical money related
transactions.
In the case of private criminal investigation of individuals, speaker
verification has been very effectively used to reliably authenticate speakers.
With our knowledge of existing verification processes, we have found that
voice patterns reveal more invariant properties than many existing verification
techniques. Another main advantage of speaker verification is that it does not
require much additional or specialized equipment at the point-of-use.
4


Equipment for hand scanners or retinal readers can be very large and
cumbersome. A speaker verification application requires little more than a
microphone or handset.
Approaches such as palm prints, eye and body scans, fingerprint or
signature analysis can be very costly. One fingerprint method costs $53,000 for
a central unit, and $4000 for each station2. Another example, is the cost of a
hand print scanner. Devices such as this typically cost around $3000.
Comparatively, a speaker verification system costs much less. A complete
system can be implemented with a personal computer, sound card, and
microphone. A system such as this cost as little as $500.
How Can Speaker Verification Be Improved?
Speaker verification technology for commercial applications is still fairly
new. Even though much research has taken place, there are a number of ways
this technology can still be improved. The following sections point out the
potential improvements that are addressed in this thesis.
Cost
Most speaker verification systems require specialized hardware to
accomplish the processing needed. Not only is most of this hardware
proprietary, but is typically very expensive. An improvement can be made by
developing a system on a common platform such as an IBM compatible
machine using a standard sound card. By doing this, the system should be able
to be duplicated very easily and inexpensively. In addition, since the majority of
computer users use this type of platform, users of a speaker verification system
developed on it will instantly be comfortable with using it.
5


User Interface
A speaker verification system must be easy to use and its interface must
be appealing and friendly to interact with. Much of the speaker verihcation
systems in use are specialized laboratory applications, and are not designed for
typical commercial use. By simply looking at a computer screen with a mouse
in hand, and talking into a microphone, users should be very comfortable
interacting with a system. In reality, all that is needed is a microphone or
handset, and a simple visual or audio feedback mechanism that alerts the user
to the verification results. The system developed and described in chapter 3 has
attempted to capture these attributes.
Training The System
Wlien a user first encounters a speaker verification system, it typically
carries out a training process that results in a voice print that is stored and later
used for matching the voice of that individual. Most systems employ a
technique called speaker adaptation to refine matching during subsequent
encounters with the verification system. The speakers voice print is adapted to
include new acoustic information the first few times a speaker uses the system
successfully. Because of this, the system may be unstable and less secure in its
initial use.
By emphasizing speaker adaptation at the earliest stages of a users first
encounter, the initial training of a system can be improved. Instead of allowing
the system to adapt over a prolonged period of time, collection of critical
information is carried out at the beginning of a systems life cycle. Not only will
the system perform more reliably at earlier stages, it will be easier to use in the
long run. This process, which is developed in this thesis, is called Adaptive
Forward Planning. It is described in chapter 3.
6


Speaker Population
Most speech recognition systems can readily handle single speakers by
specifically tailoring the system to the nuances of the individual speaker. Most
realistic applications demand the ability to handle more than a single talker.
Systems such as these are designed for computer system access, credit card
verification or building access. These systems must handle hundreds or even
thousands of different speakers. In addition to handling such a multitude of
different speakers, these systems must be able to deal with uncooperative
speakers. These types of speakers may not say exactly what the system has
asked for or may even try to fool the system on purpose.
By focusing on techniques and processes that can generically be
adapted to a diverse speaker population, the performance of a speaker
verification system can be improved. This can be accomplished by designing a
set of parameters that are invariant among all speakers, and relying on speech
characteristics that apply to the way people produce speech. Uncooperative
speakers can be dealt with by utilizing these parameters to detect potential
impostors.
Using Acoustic Parameters of The Speech Signal
Probably the most popular approach to speaker verification is to focus
on speech signal parameters. Many studies have utilized a technique of
extracting acoustic parameters that can later be used for identification. Most of
the processes used for this technique have encompassed very specialized
hardware and complicated algorithms. While most of the systems developed
have yielded significant results, not many have explored alternate methods of
acoustic parameter extraction, and how these acoustic parameters can be used
7


for speaker verification. This is probably the most open area for speaker
verification exploration. This is the core focus of this thesis.
Organization Of Thesis
In the remaining chapters of this thesis, we give an overview of speech
verification technologies and systems that have been developed over the last
twenty years. We point out whats missing, offer solutions, describe our
approach in detail, and discuss experimental results.
In Chapter 2, Speaker Verification Technology And Systems, speaker
verification methods and techniques are discussed followed by details of foil
implementations. The purpose of this chapter is to provide the interested
reader with background and history, and to point out whats missing and hasnt
been addressed in previous speaker verification research.
Chapter 3, Approach, is the core of this thesis. The method of
Adaptive Forward Planning is introduced. This new method explores alternate
ways of utilizing acoustic features extracted from the speech signal. When
features are extracted, they are evaluated to determine how much they
contribute or do not contribute to the verification process. A new approach to
evaluation using Bayesian networks and a fozzy rule process is explained. These
new approaches have the potential to improve the verification process.
In Chapter 4, Implementation, how the research was put into
practice, including the system that was built, is discussed. The discussion
includes detailed system integration and the tradeoffs and compromises. The
results of several speaker verification experiments are provided as well.
8


Finally, in Chapter 5, Conclusions, a re-cap and conclusions of this
research are presented. In addition, this chapter summarizes why this research
is important, and what future types of speaker verification research may
provide further successes and rewards.
9


CHAPTER 2
SPEAKER VERFICATION TECHNOLOGY AND SYSTEMS
Research in speech recognition has traversed a multitude of directions
over the past twenty years. The types of systems developed have been very
diverse: from reliable 24 hour-a-day isolated word recognizers to extremely
versatile and complex systems designed to interpret sentences or continuous-
word speech. There are a number of commercial systems currently available.
Most of these systems are used for simple desktop assistance, or for games and
entertainment. There are also a number of private or industrial systems that
have been designed for much more complicated and scientific uses. These uses
include real-time language interpreters, human-computer interaction systems,
and implementations used for security and verification purposes. It is this last
use that we are particularly interested in.
Even though computer technology has dramatically improved over the
last twenty years, the practical uses of speech recognition technology have
remained limited. The two most applicable speech recognition technologies are
isolated-word recognition, and speaker verification. These two are closely
related in a number of ways. Each involve a cooperative speaker who is willing
to respond to the systems needs in order to achieve success. Both
technologies compare incoming speech with prerecorded templates of various
speech data. Also, both use similar types of matching algorithms to accomplish
their goals. Of these two, speaker verification has proven to be a more useful
10


technology in the sense that it has successfully been used for a variety of
applications.
In this chapter, a brief overview of speech verification technology is
given. The focus is on the most promising, and current technologies that have
been researched and or implemented within the last twenty years. In addition
to the focus on technology, complete systems that have been implemented are
looked at. The purpose of this second focus is to get a feel for the successes
and lack of successes in speech verification.
The following overviews are a representative survey and is by no means
exhaustive and complete.
Speaker Verification Technologies
This section will provide an overview of popular techniques and
methodologies used for speaker verification. Most of these techniques have
two things in common: first, there is an initial speech data collection usually
implemented by a training session. Secondly, verification is carried out by
comparing this initial data against an incoming speech signals collected data.
Predictive Models3
Predictive models have been successfully used in speaker verification.
In this section, we will briefly discuss four that have been researched. In each,
the incoming speech signal is transformed into a model which is then used to
verify the speaker. The first two models do not require pre-processing of the
incoming signal, the second two do.
11


The Hvbnd MLP-RBF-Based System. The hybrid MLP-RBF model4 is
a two-stage connectionist model designed to operate in the time-domain alone
and performs well without any time-warping. The first stage is a Multi-Layer
Perception (MLP) neural network and is used to extract speech parameters
which are used in the verification stage. The MLP is trained to act as a
nonlinear speech predictor for the utterance spoken by the person who claims
an identity. The second stage implements the speaker verification process
based on a Radial Basis Function (RBF) classifier using the weights of the MLP
as its inputs. The RBF classifier is previously trained to accept the weights
produced by the true speaker utterance applied to stage one, and to reject all
other weights produced by other speakers.
Several previous studies have been carried out on the use of neural
architectures for the purpose of time series and speech prediction. The MLP-
RBF system is based on the fact that the MLP model is capable of learning the
underlying speaker-dependent trends of a speech utterance. This connectionist
model was shown that it could be trained to predict the speech waveform in a
non-recursive mode. However, after the training process was completed,
attempting to operate the model in a recursive manner behaved chaotically but
did reveal a relationship to the original speech waveform. It was found that a
connectionist model was not sufficient to model the time varying parameters
of speech and therefore could not work well in a recursive mode. But, the
similarities between the chaotic series produced by the recursive prediction and
the original speech signal proved that the connectionist model was learning the
operation of the underlying speech production mechanism. The MLP-RBF
system is based on this knowledge.
The Self Segmenting Linear Predictor Model. This model uses an array
of linear predictors to model the true speaker where each predictor is
12


associated with a particular sub-unit of the speech utterance. Linear predictors
have proven successful in speaker verification applications where the speech is
divided up into frames of equal length. Because the speech signal has a slow
time varying property, the vocal tract shape is considered to stay constant
during the duration of a frame. Each frame can be considered a stationary
signal allowing each to be represented with Linear Predictive Coefficients
(LPC). In this model, Linear Predictors are used to represent the temporal
structures of speech. An iterative training process uses Dynamic Programming
(DP) to segment speech into sub-units during which the vocal tract stays
constant and then trains to a set of LPs for each of these speech segments.
The segmentation and training are done on a sample-by-sample basis in the
time domain. No other pre-processing is required.
Both the LP coefficients and the segmentation units of the training
utterances are stored and used for verification. Verification involves DP of the
test utterance with the LP coefficients of the claimed true speaker. The
normalized mean square prediction residual is calculated by dividing the
accumulated squared sample values over the entire utterance. The normalized
mean squared prediction error is then compared to a threshold to determine
the success of verification.
The Neural Prediction Model. The Neural Prediction Model (NPM)
consists of an array of MLP predictors and is constructed as a state transition
network. Each state has a particular MLP associated with it. Each MLP
predictor has one hidden layer and one output layer consisting of eight nodes.
The hidden layer nodes have a sigmoidal function, while the output nodes are
linear. The 8 Mel frequency cepstral coefficients are used as the frame feature
vectors. The MLP outputs a predicted frame feature vector based on the
13


preceding frame feature vectors. The difference between the predicted feature
vector and the actual feature vector is defined as the prediction residual.
The goal of this type of system is to find a set of MLP predictor
weights which minimize the accumulated prediction residual for a training set.
Speaker verification is carried out by creating a training data set by collecting a
series of password utterances. Verification requires the application of the test
utterance to the NPM associated with the speaker who claims identity. The
accumulated prediction residual divided by the sum of the squares of each
feature component in the utterance is used to determine verification success.
The Hidden Control Neural Network Model. This model utilizes a
single MLP predictor. Like the NPM, this model is constructed as a state
transition network and also uses the 8 Mel frequency cepstral coefficients as
frame feature vectors. The single MLP outputs a frame feature vector
prediction. The model attempts to find a set of MLP predictor weights that
minimize the accumulated prediction residual for the true speaker utterance
training set. Verification of a claimed speaker involves application of the
incoming utterance along with the MLP weights set to those associated with
the claimed speaker. As in NPM, the accumulated prediction residual divided
by the sum of the squares of each feature component in the utterance is used
to determine verification success.
Gaussian Mixture Models5
The Gaussian mixture speaker model was first introduced in 1990 and
has demonstrated very accurate verification for text-independent speaker
utterances. In the Gaussian mixture model (GMM), the distribution of feature
vectors extracted from speech is modeled by a Gaussian mixture density. The
14


density is a weighted linear combination of uni-modal Gaussian densities, each
parameterized by a mean vector, and covariance matrix. Maximum likelihood
speaker model parameters are estimated using the iterative Expectation-
Maximization (EM) algorithm. Generally, 10 iterations are sufficient for
parameter convergence. The GMM can be viewed as a hybrid between two
effective models for speaker recognition: a uni-modal Gaussian classifier and a
vector quantizer codebook. The GMM combines the robustness and
smoothness of the parametric Gaussian model with the arbitrary density
modeling of the non-parametric VQ model.
The speech signal is first segmented into frames by a 20 ms window
progressing at a 10 ms frame rate. Silence and noise frames are discarded using
a speech activity detector (SAD). This is important in text-independent speaker
recognition because by removing silence and noise frames, modeling and
detection is based solely on the speaker, not the environment in which the
speaker is speaking. After the SAD processing, Mel cepstral feature vectors are
then extracted from the speech frames and cepstral coefficients are derived.
Finally, the feature vectors are channel equalized via blind deconvolution. The
deconvolution is implemented by subtracting the average cepstral vector from
each input utterance. It is critical to collect training samples and test samples
from the same microphones or channels to achieve good recognition accuracy.
The speaker verification process is a straight-forward maximum
likelihood classifier. For a group of speakers, each is represented by a GMM.
The objective then is to find the speaker model which has the maximum
postenor probability for the input feature vector sequence. The minimum error
Bayes decision rule is used to determine the accuracy of verification.
15


Statistical Features For Speaker Verification6
Speaker verification systems have also been implemented by using
statistical features of speech. On such system studied used statistical features
extracted from speech samples for an automatic verification of a claimed
identity. The analysis of a prescribed code sentence to be repeated for
verification is performed by a real-time hardware processor consisting of a
multi-channel filter bank covering the frequency range from 100 Hz to 6.2
kHz. The incoming speech signal is scanned every 20 ms, and a multiplex
integrator is used to compute the long-term averaged spectrum (LTS) over the
entire utterance.
The system is trained with a number of sample utterances of the code
sequence used for verification. From these samples a speaker-specific reference
is calculated and stored on the computer. In addition, a verification threshold is
calculated and stored for each speaker. For successful verification, the distance
between the test speech input and the closest stored reference must fall below
this threshold.
Extraction Of Cepstral Coefficients7
In this implementation, cepstral coefficients are extracted from the
incoming speech signal. These coefficients are used to build a codebook
containing a set of codevectors (coefficient vectors) representative of the
person to be identified. Verification is carried out by extracting cepstral
coefficients from the new speech of the person to be identified, and a
distortion metric is used to measure the distance from the new vectors and
those in the codebook.
16


Frames of approximately 30 ms of speech are digitally sampled and
then processed through a pre-emphasis network to boost the high-frequency
components of the speech. Cepstral components are extracted and a
probability of voice factor is computed for each frame. A probability of voicing
at or above a given threshold indicates the presence of formants, which
provide a correlation with an individuals vocal tract. All the frames are then
presented to a Linde-Buzo-Gay clustering algorithm, which is used to
transform the initial coefficient vectors into a set of 64 codevectors which is
then used for verification.
Speaker Verification Systems
Several speaker verification systems have been implemented each
particular to a specific application. Even though the applications are specific,
the goals remain the same: how accurately can a system determine if the
speaker claiming an identity is who they say they really are. The following
sections provide an overview of the systems that have been developed.
I Tl' Defense Communications Division (ITl'PCD)8
The United States government has historically been interested in
speech recognition for as long as the technology has been around. The major
areas of interest of the government is word-spotting, talker identification,
language identification, command-and-control, and secure (encrypted)
speech transmission. ITTDCD has worked on many of these government
applications. ITTDCD has investigated methods for speaker verification by
attempting to recognize the identity of a speaker even if the text of the
analyzed utterance is unknown. ITTDCD has also worked on improving
speaker verification by making such systems more accurate and less expensive.
17


Most of the research at ITTDCD has been directed at the fundamental
problems of recognizing speech in noisy environments and/or over telephone
lines. Of considerable interest is the selection of acoustic features that are
immune to degradation over phone lines or from background noise. These
features must be insensitive to noise and also to the identity of the speaker.
Because of these constraints, the feature selection techniques designed
by ITTDCD have been centered on determining optimal methods for reducing
the effects of noise on the accuracy of the incoming speech signal. Among the
approaches taken include comparative evaluation of different feature sets. The
feature sets that were examined include linear predictive coefficients (LPC),
vocal tract are functions, autocorrelation coefficients, cepstral coefficients, and
LPC derived pseudo formants. Another approach involves the use of a linear
mean square (LMS) adaptive filtering method for the removal of additive noise
from the speech signal. And finally, a third approach involves the investigation
of a noise-reduced LPC parameter set.
ITTDCD is well established in the speech compression (vocoding)
area. This technology is well suited as a front-end to many speech recognition
systems. In its research involving speech recognition, ITTDCD has
emphasized low-cost solutions and has developed several systems used for
isolated word recognition, word-spotting, and speaker verification.
AT&T Information Systems9
AT&T and its Bell Laboratories have developed and researched many
different speech recognition technologies. Although much research has been in
continuous-word recognition, many word-spotting and speaker verification
systems have also been developed. One of particular interest was developed by
18


AT&Ts I.S. division in the mid 1980s. This application is a voice password
system for security access using speaker verification designed for use over dial-
up telephone lines. The voice password system (VPS), can be used for secure
access to telephone networks, computers, rooms, and buildings. The VPS
system works by allowing a user to call into the system, enter his or her
identification number, and then speak a password that is usually a phrase or
short sentence. On initial encounter, the VPS system creates a model of the
users voice and stores a reference template.
Incoming speech is processed in the VPS system on a frame-by-frame
basis. Frames are spaced at 15-millisecond intervals, and overlap with a 45-ms
duration. For each frame a set of features are extracted that characterize
aspects of the signal such as short-term energy and spectrum. For each feature
extracted, the autocorrelation of the incoming signal is computed creating
autocorrelation coefficients. The coefficients are then modified by simulating
the addition of white noise to reduce differences between noisy long-distance
telephone lines and clear local lines. These modified coefficients are then
transformed into linear predictive coefficients (LPC) which represent, in bits, a
spectrum of the voice. The LPC coefficients are then transformed to cepstral
coefficients and are normalized by subtracting the mean cepstral values over
the utterance from each 15-ms frame of speech.
Finally the beginning and end frames of the password are located. Once
this feature set is computed, it is matched with a previously generated reference
pattern using a method called dynamic time warping (DTW) which accounts for
timing differences among repeated utterances of the same phrase. The DTW
match yields an absolute distance score and is used to evaluate the identity of
the speaker.
19


AT&T Bell Laboratories10
Many methods of speech verification have been studied at Bell
Laboratories. One of those methods experimented with comparing certain
characteristics of a speakers voice with the same characteristics of the voice of
the person who the speaker claims to be. Allowances are made for normal
variations in speech rate, pitch, volume, and other factors. The belief is that a
system using this method can be just as fast as a human listeners and detect
impostors much more accurately.
A file of prototype utterances of a single phrase spoken several times is
collected and averaged by a computer. This average forms a prototype that is
stored along with measurements of the variability among individual utterances.
The variability data is collected because no one can speak the same phrase
twice in exactly the same way. When verification is desired, the computer
fetches the stored prototype for the claimed identity, analyzes the incoming
speech sample and determines if it is close enough to the prototype version.
Five characteristics or features are extracted form the speech signal.
The first three are the lowest three frequencies known as formants one, two,
and three. The fourth characteristic is voice pitch and its variation with time.
The fifth characteristic is the variation of the intensity (or loudness) of the
speech with time. Before a voice sample can be compared, the prototype and
the incoming voice sample are brought into temporal registration by time-
warping the voice sample. This is done by speeding up or slowing down
various portions of the utterance.
Once the characteristics have been extracted and the two samples to be
compared are brought into registration, measurements are taken to determine
how similar the prototype and the sample are. To accomplish this, the
20


computer divides each of the five characteristics into 20 equal time segments.
For each segment, several measures of dissimilarity are computed. Distance
measurements such as mean squared difference and the squared difference of
the average rate of change. After these distance measurements are computed
for each separate segment, each distance is averaged over the 20 segments.
Finally, a sixth distance measure is taken that reflects the degree of time-
warping that was necessary to achieve registration. The computer then
combines all six distance measures and computes an overall final distance
measure of dissimilarity.
The ARP A Speech Understanding Project:
Continuous-Word Recognition Techniques
The focus of this thesis is on speaker verification, which is considered
an isolated-word recognition technique. Even though isolated-word
recognition is the most applicable to speaker verification, there is much
correlation to continuous-word techniques. Because of this, it is hard to ignore
the most important research that has occurred in continuous-word speech
recognition. This section will briefly hig^iligjit the major systems that were
developed for The Advanced Research Projects Agency (ARPA) Speech
Understanding Project in the 1970s11. ARPA is an agency of the Department of
Defense.
Systems Development Corporation (SDC)12
The SDC system was developed to process sentences. When a digitized
waveform enters the system, formant frequencies and other parameters are
extracted. From this a phonetic transcription is obtained, including several
alternative labels for each 10-ms segment of the waveform. All of this data is
21


then placed into an array for later examination by top-end routines. The
utterance is processed from left to right. First a list of all possible sentence
beginning words are generated. Then an abstract phoneme representation is
extracted for each lexical hypothesis and a graph of expected acoustic variants
is created. Each of these graphs are then sent to a mapper to determine how
good an acoustic match can be obtained. The mapper includes techniques for
estimating the probability that the expected word is present given the phonetic
and acoustic data collected.
The mapper constitutes a verification strategy based on syllables. This is
an attractive strategy for predicting phonetic segments.
Hear What I Mean (HWIM)13
Similarly to the SDC system, when a digitized waveform enters the
system, formant frequencies and other parameters are extracted. This
information is then used to derive a set of phonetic transcription alternatives
that are arranged in a phonetic segment lattice. The advantage of the lattice
structure is that it can represent segmentation ambiguity in those cases where
decisions are most difficult. Identification of words is earned out by searching
through the segmental representation of the utterance for the closest lexical
matching words. These matches are used as seeds that are later used to build
up partial sentence hypothesis. The best-scoring word is then sent to a word-
verification component that utilizes parametric data to get a quasi-independent
measure of the quality of the match.
The method of verification is analysis by synthesis. The verification
score is combined with the lexical matching score, and if this score is high
enough, the word hypothesis is sent to a syntactic predictor which using
22


grammatical constraints, proposes which words can appear on the left and
right of the seed word. The word proposals eventually build a lexical decoding
network that produces hypothesis of two-words, three-words, four-words, and
so forth until a final sentence is obtained.
Camegie-Mellon University Hearsay-II14
The process of recognition is similar to the HWIM, and SDC systems
described above. The Hearsay-II system employs a set of parallel asynchronous
processes that simulate each of the component knowledge sources of a speech
understanding system. The knowledge sources communicate via a global
blackboard data base. When any one of the knowledge source components
are activated by the blackboard, it tries to extend the current state of analysis.
The blackboard is divided into several major categories: sequences of
segment labels, syllables, lexical items proposed, accepted words, and partial
phrase theories. Initially, amplitude and zero-crossing parameters are used to
divide an utterance up into segments that are categorized by manner-of-
articulation features. A word hypothesizer then lists all words having a syllable
structure compatible with the partial phonetic segments. A word verification
component scores each lexical hypothesis by comparing an expected sequence
of spectra with observed linear-prediction spectra. High-scoring words activate
a syntactic component which attempts to piece words together into partial
sentence theories. This process continues until a complete sentence is found.
Camegie-Mellon University Harpy15
The Harpy system is an extension of a Markov model of sentence
decoding originally employed by a sentence recognition system called Dragon16.
23


In Dragon, a breadth-first dynamic programming strategy was used to find
the optimal path through the network. In Harpy, a beam-search technique is
used in which a restricted beam of near-miss alternatives around the best-
scoring path are considered. Dragon also used a-priori probabilities in choosing
the most likely path, where Harpy considers only spectral distance.
The Harpy finite-state machine has 15,000 states. The state transition
network includes all possible paths, alternate representations of all lexical items
in terms of acoustic segments, and a set of rules that define expected acoustic
segment sequence changes across word boundaries. The input utterance is
divided up into brief acoustic segments. Each segment is compared with 98
talker specific linear-prediction spectral templates to obtain a set of 98 spectral
distances. Each state in the network has a an associated spectral template. The
strategy is to try and find the best scoring path through the state transition
network by comparing the distance between the observed spectra and template
sequences given in the network.
What Is Missing?
The preceding sections have provided a brief overview of speaker
verification technologies, methods, and implementations. There are two
concepts in common with all of them. First, the method used for verification
involves decision making based on matching one set of acoustic, or other
speech derived data, against another set of similar data. The second concept
involves using a limited set or number of acoustic features, and not really
exploring the impact that additional features may have.
By simply matching one set of data against another, there is an
increased potential for lost information. Because the acoustics of speech vary
24


so much from one speaker to another, decision making algorithms must
involve more than a simple matching process.
In this thesis, the method of determining verification is taken a step
further than simply matching data. Because of the nature of speech, there is
more information available to us than an individual set of acoustic features.
Phonetic relationships exist that allow us to study feature interactions. By
modeling these relationships and interactions, new and alternate methods of
decision making algorithms are possible. By focusing on a definitive set of
acoustic features, phonetic relationships can be derived which should help to
distinguish one voice from another.
Acoustic features are pivotal to the speaker verification process. It is
within certain combinations of these features that verification can be improved.
One feature for verification is most likely not enough, nor is two. But, by
combining more and more features, we can create a finer granularity of
decision making that should improve the accuracy of speaker verification. In
dns thesis, we will also show that substantial improvement can be realized by
combining and utilizing additional acoustic features in the verification process.
25


CHAPTER 3
APPROACH
The focus of this study is to determine how acoustic features and
phonetic relationships can aid us in making decisions in the speaker verification
process. We are interested in the role acoustic parameters play in improving the
accuracy of speaker verification. Extracting acoustic parameters or features
from the speech signal is not a new concept. The potential methods in which
parameters are evaluated and utilized for speaker verification is conceptually
new and has not been fully tapped. In this chapter, a new method of using
acoustic parameters of speech is presented.
Decision making used in speaker verification is taken further than
simply matching one speech derived data set with another. A new technique
called Adaptive Forward Planning is introduced that attempts to take advantage
of unique acoustic features inherent to a speaker. New procedures for acoustic
feature evaluation and how these evaluations can aid speaker verification is
presented in this chapter. In addition, this thesis goes further and explores the
impact of combining additional acoustic parameters.
26


Adaptive Forward Planning
The technique of Adaptive Formal Planning takes advantage of how
the human vocal system produces speech. By capturing the essential aspects of
the human vocal system, speaker verification can be improved by identifying
those individual speaker characteristics that distinguish an individual from the
average user or group of users. Characteristics such as vocal tract resonance,
characteristics of articulation, and the rate of vibration of the vocal cords are
utilized to improve verification. Techniques of speaker normalization or
channel normalization can be used so that certain features of the speech signal
can be detected and used to capture specific characteristics of the speech
signal. An example would be detecting the fundamental frequency or pitch of
the signal. Another would be the use of complex processes that would
determine the speakers vocal tract length. Using criteria such as these, a
specific acoustic feature profile may be created and stored for later use.
Adaptive Formal Planning (AFP) uses a pre-meditated set of acoustic
information criteria. This set can be thought of a boiler plate of specific
acoustic information. By focusing on a definitive set of acoustic features up
front, AFP can formulate a unique profile for every user. With AFP, the task of
speaker verification is to determine the information carrying features common
to repeated utterances of the same phrase or word. When a user encounters
the system for the first time, rather than simply storing initial voice information
for that user, the adaptive process is carried out interactively. The system
proceeds through refinement learning steps and updates a unique profile of
acoustical feature information until it meets a pre-specified set of criteria. As
the system works its way toward refining this set of criteria, it makes use of
different phrases or words whose selection is determined by the refinement
process. Acoustic features may be selected based on trial and error heuristics
27


which capture the distinct individual speaker characteristics that pertain to a
particular phrase.
Once this set of criteria has been obtained, the system then selects a
final appropriate phrase that the user will be asked to use for verification. The
phrase will be selected from a set of stored templates that have been collected
from the user population. The phrase selected will be the phrase that best
matches the collection of acoustic information that has been obtained. In
addition to selecting this final phrase, this system will store the reusable
acoustic information obtained from the speaker. The illustration below depicts
this process. The progression proceeds from top to bottom:
VOICE VERIFICATION ADAPTIVE FORWARD PLANNING
Original Voice Characteristics Boiler Rate
Refined Speaker characteristics boiler plate
Refining process continues...
T
T
Final Voice Characteristics Information
Speaker characteristics are
further refined
Figure 3.1 Adaptive Forward Planning Steps
28


By creating this unique acoustic information profile, additional in depth
verification may be carried out in later system encounters that otherwise would
not exist. For example, if a user attempts to gain access to the system at a later
date, a number of pre-determined thresholds could be used to determine the
accuracy of verification. The system would use this unique acoustic profile to
further the verification analysis if a number of the thresholds were not met.
The advantage of this is that the probability of false rejection (i.e. a valid user
not being accepted) could be lowered.
Adaptive Forward Planning will improve speaker verification in a
number of other ways as well. For instance, when a user wishes or is required
to choose a new password, they will have an existing specific feature set in
which to interact with. Rather than randomly choosing new phrases for
passwords, specific phrases can be derived from the acoustic information
available. By doing this, the users acoustic feature profile will remain more or
less intact. This will ensure that the uniqueness features of their voice remain
embedded in the system and are continually taken advantage of. Another
advantage is the use of similar passwords among different speakers. Because
the acoustic profile will be unique to each speaker, if two or more use a similar,
or even the same password, verification analysis should reveal unique speakers
because of the depth of information held in each profile. The ability to use
similar or equivalent passwords will reduce the overhead required to maintain
the password model database.
29


Password Models
To accomplish collecting a unique acoustic profile, AFP uses password
models. Each password model is a single phrase that has been analyzed in
depth and determined to model a set of phonetic qualities that yield unique
acoustic features when spoken. AFP first presents the user with an initial set of
password models that they will be asked to speak in sequence. As each word or
phrase is spoken, the system will capture those acoustic features that pertain to
each individual model. AFP will then determine which phrase or word spoken
best match the acoustic features of the speaker and will then proceed to the
next appropriate set of related password models.
The password model system will be hierarchical in nature and can be
represented in a tree like structure. Contained in the top level of the tree (level
0), will be the initial set of password models. Each of their children will
represent a subset that will contain similar, but more detailed phonetic
properties. When a user encounters the system, they start at the top of the tree
and work their way down passing through several internal nodes until they get
to the lowest level of the tree. The lowest level will represent the actual final
phrases or words that AFP will determine best suits the user. AFP will then
chose among these for the password the user will be required to use. The
illustration below depicts the password model tree.
30


THE PASSWORD MODEL HIERARCHY
Figure 3.2 Password Model Tree
The lowest level of the hierarchy represents the actual passwords that
will be used in the system. Initially, a finite amount of passwords is used in the
system. AFP will require ongoing maintenance of the password model
hierarchy. When the lowest level passwords are eventually exhausted, they may
simply be replaced by sets of phrases that fit the acoustic feature properties of
their predecessors. It will be several years if ever, that the system will run out of
available passwords to use.
By proceeding through the password model hierarchy, AFP will
accurately narrow down those acoustic features that are unique to an individual
speaker and map these to a specific phrase. As these acoustic features are
discovered, AFP will dynamically build an acoustic feature profile for each user.
31


The AFP Architecture
To accomplish the goals of AFP, two main architectural components
were developed: a sub-system for evaluating acoustic parameters when
extracted, and a second sub-system for combining the various acoustic
parameter evaluations and determining the overall value or acoustic worth of
a particular password model. These two sub-systems along with the password
models descnbed above make up the AFP architecture.
Fuzzy Rule Evaluation Of Parameters (FREP)
The sub-system developed for evaluating acoustic parameters is the
Fuzzy Rule Evaluation of Parameters (FREP). Because acoustic parameters
extracted from the speech signal vary so greatly, we often are dealing with
vagueness and ambiguity when trying to determine how good a parameter is.
For this reason, this sub-system is modeled using fuzzy rules.
Bayesian Network for Password Analysis (BANPA)
The sub-system developed for evaluating the overall value of a
particular password model is the Bayesian Network for Password Analysis
(BANPA). Due to our limited knowledge of phonetic features that correspond
to the human voice, we are working with incomplete or uncertain data when
trying to evaluate the overall value of a spoken phrase. A Bayesian network was
chosen for this sub-system because it allows us to combine several sources of
data (acoustic parameter evaluations) each with varying amounts of useful
information. We can then analyze and determine probabilistic relationships
among this data, combine these relationships and come to a conclusion or
overall value.
32


The illustration below depicts the AFP architecture.
Password Models
Figure 3.3 The AFP Architecture
Note in the figure above that the speaker first reads and speaks each
available password at each level in the hierarchy. For each password spoken,
the acoustic parameters are extracted and sent to FREP for parameter
evaluation. Each evaluated parameter is then sent (in tandem) to BANPA
where the overall phonetic value of the password is determined. This is done
for each password on the same level of the password hierarchy. These values
are then returned to the password model hierarchy where the next level of
passwords is selected based on these values.
33


The remaining sections of this chapter describe these two main sub-
systems in detail.
Phonetic Acoustic Parameter Evaluation Using Fuzzy Rules
Through much previous research in phonetics, linguists have noted the
importance of phonetic features in describing the sound structure of language
and in postulating how the sound systems of language change. It has been
proposed that speech sounds occurring in the languages of the world could be
described in terms of a finite set of phonetic features.17 These phonetic features
can be defined in reference to acoustic patterns or acoustic properties derivable
from the speech signal. The acoustic properties can be defined in terms of
spectral patterns, changes in overall amplitude, or fluctuations in energy
extracted from the incoming speech signal18.
In Adaptive Forward Planning, a number of acoustic parameters are
extracted from the speech signal so that the overall value of a phrase can be
determined. For example, the acoustic features fundamentalfrequency, gross spectral
shape, energy frequencies, and formantfrequencies provide an invariant set of features
that will differ between speakers. Not only do these features differ between
individual speakers, but they also differ between the same utterance spoken by
the same speaker. The acoustic features of a speech signal allow us to identify
phonetic features such as consonant stops or intonation. Because the production of
human voiced sounds vary so dramatically between speakers, we are working
with a minimal amount of ambiguity when computing the value of an acoustic
parameter. This is significant when using these parameters for speaker
verification.
34


The challenge is to find a way to determine the value of an acoustic
parameter based on acoustic features extracted from an incoming speech
signal. Each acoustic parameter contributes to a phonetic property in varying
ways. For instance, a small amount of low frequency energy contributes very
little to the phonetic property of intensity, but a greater amount contributes
enormously. It therefore becomes important to detect the different levels that
each acoustic feature can contribute to a phonetic property. We cannot assume
that an acoustic parameter either contributes or it does not. Rather, we must
determine how much it contributes. Even though the measurement or graph
of one instance of an acoustic parameter differs from another instance, it still
may be acceptable under certain criteria. A fuzzy rule system provides an
excellent vehicle in which to approach our problem. The criteria in which we
accept one measurement over another can only be defined in terms of fuzzy
rules.
In order to determine the value of an acoustic parameter, we must
establish a measure of quality or baseline against which we can measure. For
each password model in our system, we can establish a baseline measurement
for each acoustic feature associated with that password. To accomplish this, a
database of spoken phrases is collected, and a baseline acoustic feature set is
established for each phrase. For each spoken phrase, the set of acoustic
parameters are extracted. Then, all measurements of an extracted acoustic
parameter are averaged. This average establishes a baseline or measure of
quality for an individual acoustic parameter associated with a phrase. The
illustration below depicts this. Speaker A through speaker X speak the same
password hello. The acoustic parameter fundamentalfrequency (FF) is extracted
for each spoken instance of hello for each speaker. The X occurrences of FF
are then averaged to form the FF baseline for the password hello.
35


Average Fundamental Frequency plot for the
phrase "Hello"
Figure 3.4 Establishing A Baseline For The
Acoustic Parameter FF
Once the acoustic parameter baseline has been established, we can
compare subsequent acoustic parameter extractions against it. To determine
36


the quality or amount of information we gain from this comparison, fuzzy rules
will be used.
Fuzzy Rules For Acoustic Parameter Value Determination
In comparing one acoustic parameter with another, we cannot simply
claim that the two parameters match or they do not. Rather, we must define
the comparison in terms of a possibility distribution. For instance, when
studying the acoustic parameter fundamental frequency, many variances are
detected. The varying rate of fundamental frequency is determined by the
shape and mass of the moving vocal cords, the tension of the laryngeal
muscles, and the air pressure generated by the lungs19. When two different
speakers articulate a phrase, the shape of the vocal cords may be similar, but
the air pressure generated by the lungs could be very different. In addition,
attempting to measure the difference of tension in the laryngeal muscles is
nearly impossible.
Rather than focus on these types of differences, we can utilize our
general knowledge of how the fundamental frequency rate and other acoustic
parameters vary among individual phrases. Because of the way a speaker
produces sounds, we can establish commonalties among individual phrases. If
we plot two different occurrences of an acoustic parameter extracted from a
spoken phrase, we can calculate the difference in these commonalties. In this
research two general criteria for determining how two acoustic parameters
differ have been selected: correlation coefficient and regression line slope difference.
Using fuzzy rules, these two criteria become our linguistic variables.
37


Linguistic Variables
For each acoustic parameter, two linguistic variables are created:
correlation, and regression line slope difference. The values of these two variables are
computed from extracting the difference between the baseline measurement
and the incoming speech signal. The linguistic variables are defined as follows:
Correlation Coefficient. The correlation coefficient is computed from
the graph of the baseline measurement and the graph of the incoming acoustic
parameter. By using the correlation coefficient, we can determine the similarity
between the two acoustic parameters. The equation for the correlation
coefficient is:
Cov (x,y)
cc ----------
Ox (Ty n
where Covariance : Cov(x,y) = (1/n)5^ (xj Bx) (Yi By)
i=l
and pv = the mean of y, px = the mean of x
Equation 1 Correlation
The value would be high if there is a strong relationship, and low if
there is a weak relationship.
Regression Line Slope Difference. For each acoustic parameters plot,
the slope of the linear regression line through data points in known y's and
known x's is calculated. The slope is the vertical distance divided by the
horizontal distance between any two points on the line, which is the rate of
38


change along the regression line. The equation for the slope of the regression
line is:
n Z xy (X x) (Z y)
b= ---------------------------
__________n Z x2 (Z x)2______________________
Equation 2 Slope Of The Regression Line
This is calculated for the baseline measurement and for the incoming
speech signal. Then the difference between the two is calculated, and a value
for the variable is determined. The value would be high if there is very little
difference, and low if there is a great amount of difference.
The illustration below depicts two different fundamental frequency
parameter extractions for the phrase hello:
FF plot of baseline vs incoming speech signal
(lighter=incoming signal, darker=baseline)
Time (msec)
Figure 3.5 Plot Comparison Of Two Fundamental
Frequency Parameters
39


The two linguistic variables were designed to take on one of five values:
very low (VL), low (L), medium (M), high (H), and very high (VH). These five
values allow each variable to be separated into five measurement qualities. A
total of five values was used for simplicity and as a starting point for this
research. They were adapted from20. Many more values could be used if so
desired. In addition, by keeping this number low, a simple set of consistent
fuzzy rules was easier to derive.
We have now defined two fuzzy rule determination sets each with a
possibility distribution (VL .. VH). By defining our sets in this way, we can
define a fuzzy relationship between these two linguistic variables and the
possibility distribution for the value of an acoustic parameter. By relating the
values of our linguistic variables to the value of an acoustic parameter, we can
then define a set of fuzzy rules used for determining the value or quality of an
acoustic parameter. The figure below defines the rules.
Slope diff.
VL L M H VH
Correlation VL VL L L M H
L L L M M H
M L M M H VH
H M M H VH VH
VH H H H VH VH
Table 3.1 Fuzzy Rules For Determining The
Quality Of An Acoustic Parameter
Again, these rules were defined for simplicity and also to serve as a
starting point for this research. They also were adapted from 21. A total of
40


twenty-five rules were defined. We can now define fuzzy rule functions that
will allow us to mathematically compute numerical values that map to the
distributions shown in the figure above. The illustration below depicts how the
fuzzy rule functions are distributed.
Figure 3.6 Fuzzy Rule Distribution
The mapping functions depicted above was chosen again as a starting
point for this research. They were adapted from 22,23.
Fuzzy Rule Process (Defuzzification). Each of the fuzzy rule functions
can be coded in if then else rules which will allow output of an acoustic
parameter value. For instance, referring to the figures above: if the correlation
between parameters is medium(M), and there is very little difference (smaller
difference means higher value H) in slope between the parameters, then the
41


acoustic parameter value is high. Each of our rules can be implemented this
way. The ranges for each of the values can be set according to the following
figure.
VL 0.0-0.15
L 0.15 0.30
M 0.30 0.60
H 0.60 0.85
VH 0.85 1.0
Table 3.2 Values Computed By Fuzzy Rules
The ranges in the figure above were selected because they represent a
distribution of five values ranging from 0 to 1, and they correlate to the
mappings above. Because five rules were selected, five distributions were
necessary. The middle value M, was defined to be twice as large as the other
four so that the other four would have equal ranges. These values could be
distributed in many other ranges as well. Different value distributions could be
explored in future research.
In order to determine how much value the two linguistic variables
contribute, each variables value will be calculated in terms of percentages.
Then, both percentages are averaged which will result in an overall percentage
contribution or worth. Once this percentage contribution is calculated, it
then is mapped to the acoustic parameter value according to the figure above.
This can be implemented for each of the 25 rules. For example, if the slope
difference was 0.45, and the correlation value was .75, this translates to :
slope difference value = M (0.45), correlation value = H (0.75),
percentage of M for slope difference = slope 0.30 = 0.15. Then, 0.15/0.30 =
42


0.5. Here, 0.5 corresponds to halfway in the window, or 50% of the window.
Then, the percentage of Hs window for correlation = corr 0.60 = 0.15.
Then, 0.15/0.25 = 0.6. Here, 0.6 corresponds to 10% past halfway in the
window, or 60% of the window.
These two percentages are then averaged: (0.5+0.6) / 2 = 0.55. This
percentage, 0.55, is then used to calculate the amount of that value that this
particular rule maps to: Mapped rule: When correlation is H and slope is M,
the parameter value is H. How much H? This is calculated with the above
percentage contribution, 0.55: H ranges from 0.60 to 0.85, a total of 0.25.
55% of this range is: (0.55)(0.25) = 0.1375. So, 0.60 + 0.1375 = .7375.
Therefore, the amount of H is .7375. All 25 rules are computed in this way. All
rules were chosen to be computed the same way so that a starting point could
be established. This is simply an interpretation of the fuzzy rule defuzzification
process described above.
In future research, several different variations of linguistic variable
values, fuzzy rule definitions, and defuzzification processes could be
experimented with.
Subjective Bayesian Inference Networks For Password Analysis
For each password model described above, AFP must determine the
value or quality of each separate model in terms of its inherent acoustic
features. By determining these properties, AFP can make the best possible
decisions in building an individual users unique acoustic feature set and
determine the most effective password available. Previous work in acoustic
theory of speech production has shown that a number of phonetic parameters
may be used to characterize different speech sounds. These parameters may be
43


used for phonetic processing of the speech sounds and vary according to
individual speaker and techniques of extraction.
The task is to choose a representative set of parameters that can be
used to characterize different speech sounds in terms of phonetic qualities.
Unfortunately there is not, at this time, a complete, specific and definitive set
of phonetic characterizations that correspond to the human voice. By
capturing those phonetic characteristics that we can, we can at best only
capture a particular set or subset of all the possible phonetic features of the
human voice. In addition, the inherent complexity and fuzziness of acoustic
parameters forces us to make at best, estimated conclusions about the
meanings or relevance of phonetic parameter data. Because of these
constraints, we are working with incomplete or uncertain data yet we still must
draw inferences and make useful judgments.
Our goal is to determine the value or quality of a particular phrase. To
accomplish this, we must use what acoustic information we have and draw
conclusions. In dealing with incomplete or uncertain data, many techniques
have been developed to aid in forming judgments or conclusions24. Probability
theory provides a powerful mechanism for dealing with inference type
problems such as the one AFP has in determining the value or quality of a
spoken phrase. Not only must we determine the value or quality of a particular
phrase, but we must also capture those phonetic features that are qualitative to
a unique human voice. In this way we capture two types of important
information that we can use in speaker verification: a specific password model
that the user can use for verification, and a unique acoustic feature set that
corresponds to an individual user that can later be used to aid in the
verification process.
-14


Subjective Bayesian inference networks provide a powerful model
which can provide a measure of confidence for the overall phonetic quality of a
particular spoken phrase. By translating values we place on acoustic features to
phonetic qualities, we can determine the overall value of a particular password.
The Bayesian network model allows us to do this. First, we will discuss
potential acoustic features that can be used as inputs to the network.
Acoustic Feature Inputs
A set of acoustic parameters can be extracted from the incoming
speech signal so that phonetic analysis may be carried out. Each of the acoustic
parameter inputs used for phonetic analysis will be computed in terms of their
total informational value. Acoustic parameters that can be used in this way are
discussed below.
Fundamental Frequency. During the production of voiced sounds, the
vocal cords are set into vibration. The fundamental frequency of voicing can be
used as a voicing indicator. Voicing distinction may be captured indicating the
presence of prosodic information such as stress, intonation, and intensity.
Spectral Shape. Characteristics of speech events such as the production
offricatives and the onset of plosive releases, are best characterized in terms of the
gross spectral shape of the speech signal. The spectral energies may be derived
from the gross spectral shape thus providing spectral information as input.
Low-Frequency Energy. Mid-Frequency Energy. High-Frequency
Energy. One if the not the most important characteristics of the speech signal
is the fact that the intensity varies as a function of time. For example, sharp
intensity changes in different frequency regions (i.e. low, medium, or high)
often signify boundaries between speech sounds. One example of this is low
45


overall intensity typically signifies a weak fricative, whereas a drop in mid-
frequency intensity usually indicates the presence of an intervocalic consonant.
First Formant Frequency. Second Formant Frequency. Third Formant
Frequency. Previous studies have found that the first three formants for vowels
and sonorants carry important information about the articulary configuration in the
production of speech sounds. These frequencies can be used to classify vowels
and consonants.
By placing values on each of these parameters, we can supply our
Bayesian network with the necessary input values. The diagram shown below
indicates the relationships between incoming acoustic parameters (input nodes)
and phonetic classifications (lumped nodes) which serve as lumped evidential
variables that summarize the incoming acoustic parameters. Also, the
relationships between phonetic classifications and the attributes that contribute
to the phonetic quality of a phrase (predictor nodes) are depicted.
46


Bayesian Network for Acoustic Phonetic Anaylsis
Input nodes
Lumped nodes
Predictor nodes
Fundamental Frequency
Spectral Shape
Low Frequency Energy
Mid Frequency Energy
High Frequency Energy
First Formant Frequency
Second Formant
Frequency
Third Formant
Frequency
Figure 3.7 Bayesian Network For Phonetic
Analysis
Each of the acoustic parameters are extracted and analyzed separately
to produce an input value that enters the left side of the network. These values
are then fed through the network to produce phonetic classification values on
the lumped nodes. Then these values are fed to the predictor nodes, and their
values are updated. Finally, the predictor node values are combined to produce
an output on the far right-hand side of the network. This final output
represents the overall phonetic value or quality of the phrase from which the
acoustic parameters were extracted.
node
Phonetic Quality of
Phrase
47


Phonetic Classification
In order to define relationships between acoustic parameters and
phonetic qualities, the following phonetic classifications are necessary. These
classifications are a representative set, and is by no means exhaustive and
complete.
Intensity. The overall intensity of a speech signal may be used to detect
the presence of vowel and consonant strengths.
Stress. This is known as the most basic abstract prosodic feature. Stress
patterns provide information about intonation and phonological phrases and can
aid in classifying the vocal effort of the speaker.
Intonation. Intonation, known as the pitch contour of an utterance,
provides vital clues to linguistic structures. Also, intonation may be studied to
derive length characteristics so that differences may be obtained between low
vowels, tense vowels, and lax vowels.
Articulary Configuration. By analyzing the articulation of speech,
acoustic distinctions may be made. For example, there are definite rates of
respiratory airflow below which airflow is laminar and above which airflow is
turbulent. This yields sharp distinctions between sonorant sounds and fricatives. In
addition, rather abrupt changes in acoustic features may be detected
corresponding to the opening of the velic valve for nasalisation.
Phonetic Quality Attributes
These attributes constitute the overall phonetic quality of a particular
phrase. By combining the values or contributions of the phonetic
classifications previously discussed, we can define phonetic quality attributes.
48


Vowel Presence. Vowels can be detected by the presence of substantial
energy in the low and mid frequency regions of the speech signal. They may be
characterized by the steady state values of the first three formant frequencies.
Consonant Presence. Consonants are usually divided into several
groups depending on the manner in which they are articulated. There are five
groups in English. Plosives which are characterized acoustically by a period of
prolonged silence, followed by an abrupt increase in amplitude at the
consonantal release. Fricatives, which are detectable by the presence of turbulent
noise. Nasals which are invariably adjacent to a vowel, and are marked by a
sharp change in intensity and spectrum. Glides which occur only next to a
vowel as formant transitions into or out of a vowel, and are fairly smooth and
much slower than those of other consonants. Africatives which are characterized
as a plosive followed by a fricative.
Prosodic Quality. One of the areas of speech understanding that has
not been fully tapped is that of extracting prosodic information. Prosodic
information of speech consists of stress patterns, intonation, pauses, accent,
and timing structures. By computing an overall prosodic quality of a phrase, we
can capture unique characteristics of the incoming speech signal.
Articulation Quality. This represents the overall quality of detectable
articulation parameters.
Total Energy. This represents the total speech signal energy obtained
and may be used to evaluate overall intensity changes which indicate the
presence or lack of presence of phonetic parameters.
49


Probability Relationships For Phonetic Attributes
The model for determining the quality of a phrase that AFP uses must
account for degrees of evidence and degrees of confidence in the hypothesis
that a particular spoken phrase is of a certain quality. Probability theory
provides the mathematical formalism in which we need to work. Each node in
our network can be thought of a model in which evidence (E) can provide
support for, or against, a hypothesis (H). Using basic probability theory we can
view the nodes in our Bayesian network as events that have been derived from
prior knowledge about the relationships that exist between it and a node that
led to it. It is now necessary to define relationships.
Odds Formulation. The odds of an event is defined as :
O (A)=p (A) /p (~ A).
Likelihood Ratio. The likelihood ratio X for events A and B is given
by:
X=p(A|B)/p(A|~B)
This is useful because it can be viewed as a way to update the odds of a
hypothesis H based on the evidence E:
if A, = p(E | H)/p(E | ~H) then 0(H | E) = X 0(H).
Our overall goal throughout the network is to estimate the conditional
probability of an hypothesis (H) that a phonetic classification or attribute exists
given the evidence (E) of acoustic parameters extracted from the speech signal.
Two more terms are necessary to support our understanding on how E relates
to H: sufficiency X, and necessity X'. These two numbers are used to capture
50


the prior knowledge of the probabilistic relationship between E and H.
Depending on the relationship between E and H, these limiting cases will vary
somewhat. For example if the acoustic parameter of fundamental frequency
and the phonetic classification of intonation were related as in the sketch:
Fundamental frequency
/
Figure 3.8 Sufficiency Relationship Between
Fundamental Frequency And Intonation
then whenever fundamental frequency was present, the phonetic
classification intonation would occur, thus fundamental frequency is
sufficient for intonation to occur.
On the other hand, if the relationship between the acoustic parameter
low frequency energy and the phonetic classification of intensity were depicted
as:
51


Low frequency energy
Intensity
Figure 3.9 Necessity Relationship Between Low
Frequency Energy And Intensity
then whenever the intensity characteristic occurred, the acoustic
parameter type of low frequency energy is certain to occur. Thus, the acoustic
parameter type of low frequency energy is necessary for the intensity
characteristic to occur.
The more general cases of limiting is handled by a mix of the values X
and X' where X usually runs between 1 to large values based on degrees of
sufficiency of acoustic parameter types and phonetic classification types, and X'
runs between 0 and 1 based on the degree on necessity of acoustic parameter
type and phonetic classification type. These mix of values are also used in
determining the relationships between phonetic classification types and the
attributes that constitute the quality of a particular phonetic phrase. In our
subjective Bayesian network, the challenge lies in coming up with values of X
52


and A/ for each link in the inference network the best represents expert
knowledge about each relationship.
Measurement of the Evidence (E).
In our model, evidence (E) of an acoustic parameter or the presence of
a phonetic type arrives in a uncertain fashion. Each parameter input to the
system is really a measurement of the degree of existence of that acoustic
parameter or phonetic type. Assume a measurement is made, call it Er, that
tells us how sure that the acoustic parameter or phonetic type is actually
present. The value t=p(E | E') with 0 < t < 1, represents the absolute certainty
that E has or has not occurred. A value of 1 means that E has absolutely
occurred, and a value of 0 means that E has absolutely not occurred. We can
now estimate p(H | E') based on the measurement of E and our prior
knowledge. To carry out this estimation an approximation based on linear
interpolation is used25. The following diagram depicts the approximation model:
53


P(H|E')
the estimated
conditional
probability of the
phonetic
classification
type or phonetic
quality attribute
given the
measurement of
the existence of
an acoustic
parameter or
phonetic type.
" I I
t = p(E|E): the measurement that tells us how
sure we are that a given acoustic parameter or
phonetic type is present.
Figure 3.10 Approximation Model For Evidence
Of An Acoustic Or Phonetic Attribute
The graph shown above is used to calculate an estimate of the chances
of a phonetic type or phonetic attribute being present, p(H | E). Implicitly the
graph uses X and X' to get p(H | E) and p(H | ~E) from:
p(H| E) = X 0(H)/(1+ X 0(H))
p(H | ~E) = X' 0(H)/(1+ X' 0(H))
where 0(H) = p(H)/p(~H)
We apply a graph such as this to every node in our Bayesian network
and update our hypotheses based on our prior knowledge and evidence
collected. We are assuming we have prior knowledge of the probability of a
54


phonetic type or phonetic attribute being present (p(H) above. Note a
phonetic type may also be p(E), in which it contributes to the conditional
probability of a phonetic attribute). Also we assume we have prior knowledge
of the probability of an acoustic parameter being present (p(E) above). The
graph ensures that p(E) maps to p(H) meaning that if the measurement is
inconclusive (i.e. we have no measurement at all), then the estimate of the
chances of H should also be the prior probability of H. If the measurement is
less than p(E), then we have evidence against the presence of E and therefore
reduce the chance of H. The shaded area depicted in the graph is said to
support the presence of H because if the measurement is greater than our
prior knowledge E, this maps to evidence of H beyond our prior knowledge
of H.
Multiple Evidence
In our subjective Bayesian network, there are several places where
multiple evidence is required to update the probability of H. To deal with this,
a new term is defined:
A,eff = 0(H | E)/0(H) which is computed from p(H | E) :
0(H | E1) = p(H | E)/(l p(H | E)
For the purposes of this research, we make the assumption that the
multiple evidence is independent26, and compute A.eff for each link. Now, we
can define:
A.EFF = XeffjX,eff2 ...XefFn
This allows us to update the odds of H :
55


0(H | Et E, .. .En) = AEFF 0(H).
It is probable that multiple evidence is not always independent. In this
case, we must make modifications that deal with these scenarios. This could be
dealt with in future research.
Phonetic Analysis Relationships
For each node in our network, we must derive prior knowledge of the
probability of E our evidence, and of H our hypotheses. For H, p(H) is the
chance that H is present, before any evidence E is presented. For E, p(E) is the
chance that the evidence is present or not. In addition, values must be selected
for sufficiency A,, and necessity A' by carefully looking at each separate link in
the network and determining how these values best describe each relationship.
These values can be selected based on expert knowledge available about each
relationship in our network.
Summary
In this chapter we have introduced a new method called Adaptive
Forward Planning (AFP) which evaluates and utilizes acoustic parameters of
speech for speaker verification. By utilizing a fuzzy rule system (FREP), we
have shown how acoustic parameters extracted from speech can be evaluated
by measuring their values against a baseline (average) measurement. We then
use these values as inputs to a Bayesian inference network (BANPA). The
Bayesian network was developed by analyzing relationships between phonetic
attributes and phonetic classifications inherent to the human voice.
56


There were many potential variations to the components of the AFP
architecture that were discussed. The components of most interest were
linguistic variable values, fuzzy rules, defuzzification processes, values chosen
for necessity and sufficiency, the process and methods of updating the
network, and the handling of multiple evidence. The variations available to
these components are almost endless. Each should serve as a focal point for
future research.
57


CHAPTER 4
IMPLEMENTATION
Now that we have defined the AFP architecture, we describe the
implementation of certain components and evaluate their performance. In this
chapter, we will discuss the implementation details and some of the
experimental results.
System Overview
Adhering very closely to the model described in chapter 3, a prototype
system was built. There were limitations to the implementation due to lack of
resources and time. These limitations will be pointed out were applicable
throughout this chapter. The prototype was built in Microsoft Windows 95,
using Asymetrixs Multimedia ToolBook 4.0. The system was built in three
phases. The first phase implemented was the BANPA sub-system. The second
was the FREP sub-system. The third phase consisted of tying together the
BANPA, FREP, Password Models, and acoustic extraction functionality. Each
of the phases is briefly discussed below.
BANPA Implementation
The BANPA sub-system described in chapter 3 was constructed as
depicted. The sub-system was built so that a user can enter values for the
58


acoustic parameters, and then compute an overall value for a phrase that the
acoustic parameters were extracted from. The values for necessity and
sufficiency for each of the relationships in the network were entered into an
Excel spreadsheet. The spreadsheet was saved as a text file. When the sub-
system was run, the data from the spreadsheet was read into the network and
assigned to all interior nodes on the network. The data used for the BANPA
system is shown in appendix A. These values were used for BANPA simulation
runs, and for the actual finished AFP system. The interface for inputting
acoustic parameter values and computing the overall value of a phrase is shown
below.
r Multimedia ToolBook AFP.TBK
Erie Edit Text £aga tielp
Figure 4.1 The BANPA User Interface
59


A senes of simulations was run to determine the overall sensitivity and
behavior of the Bayesian network. The results of these runs can be found in
appendix B. The source code used to implement the sub-system is available
upon request.
FREP Implementation
The fuzzy rule sub-system FREP as described in chapter 3 was
constructed as depicted. The sub-system was built so that a user can enter
values for the correlation and slope values as would be derived when
comparing an acoustic feature extraction against the baseline (or average) of
that feature. The interface that was built for FREP is shown below.
i Multimedia ToolBook AFP TBK
Die Edit Text gags Help
551*1
Figure 4.2 The FREP User Interface
60


As was done for the BANPA simulation, a series of simulations were
run to determine the overall sensitivity and correctness of the fuzzy rule
system. The results of these runs can be found in appendix C. The source code
used to implement the sub-system is available upon request.
Password Model Implementation
A set of 21 passwords were chosen for this initial study. Each word was
chosen based on its acoustic correlation to various phonetic properties
inherent in the English language. By studying these properties, the words were
partitioned into four separate phonetic classifications: vowels, stop consonants,
liquids and glides, and nasal consonants. Four initial words were required to be
spoken up front. Each one of these correlated to one of the four phonetic
classifications mentioned above and served as a starting point for password
selection. These four password models are the initial set of passwords
mentioned in chapter 3. Each of the four phonetic classifications and the
words chosen for each are briefly described below. For a more in-depth
discussion of acoustic phonetic theory see2717,28
Vowel Password Selection. Vowel characterized words were chosen
based on four properties: long vowel duration, short vowel duration, vowel nasalisation,
and oral vowels. The words chosen for these vowel characteristics are as follows:
animated This word serves as the initial starting point or initial password for
vowel classification. It was chosen in an attempt to capture all four properties
mentioned above. By doing this, if AFP chose this as the best initial password,
then the speaker would be asked to speak the following four vowel
characterized passwords.
61


heed This was chosen to represent the characteristic of long vowel duration.
can This was chosen to represent the characteristic of vowel nasalisation.
high This was chosen to represent the characteristic of an oral vowel.
hid This was chosen to represent the characteristic of short vowel duration.
Stop Consonant Password Selection. The stop consonants studied [p t
k b d gj, were chosen based on five articulation characteristics: labials, alveolars,
velars, voiced, and non-voiced.
powderkeg This word serves as the starting point or initial password for stop
consonant classification. It was chosen in an attempt to capture all five
characteristics mentioned above. By doing this, if AFP chose this as the best
initial password, then the speaker would be asked to speak the following five
stop consonant characterized passwords.
people This was chosen to represent the characteristics of labials [p b].
date This was chosen to represent the characteristics of alveolars [t d].
keg This was chosen to represent the characteristics of velars [k g].
dogbone This was chosen to represent the characteristics of voiced [b d g].
ketchup This was chosen to represent the characteristics of non-voiced [p t k].
Liquids/Glides Password Selection. Acoustic sounds of [1 r] are referred
to as liquids, and [w y] as glides. The duration of liquids and glides have been
found to produce identifiable acoustic characteristics. The passwords chosen
62


for this classification were broken into four characteristics: short liquid, long
liquid, short glide, and long glide.
lawyer This word serves as the starting point or initial password for the
liquid/glides classification. It was chosen in an attempt to capture all four
characteristics mentioned above. By doing this, if AFP chose this as the best
initial password, then the speaker would be asked to speak the following four
liquids/glides characterized passwords.
love This was chosen to represent the characteristics of a short duration liquid.
room This was chosen to represent the characteristics of a long duration liquid.
weather This was chosen to represent the characteristics of a short duration
glide.
yams This was chosen to represent the characteristics of a long duration glide.
Nasal Consonant Password Selection. The nasal consonants [m n] were
chosen based on the variety of acoustic consequences created by the opening
of the nasal cavity when sound is propagated through both the nose and mouth.
midnight This word serves as the starting point or initial password for nasal
consonant classification. It was chosen in an attempt to capture the
characteristics mentioned above. By doing this, if AFP chose this as the best
initial password, then the speaker would be asked to speak the following four
nasal consonant characterized passwords.
moon This was chosen to represent the characteristics of a long duration nasal
consonant.
63


nose This was chosen to represent the characteristics of a short duration nasal
consonant.
m&n This was chosen to represent the characteristics of a combination of nasal
consonants.
animal This was chosen to represent the characteristics of a nasal consonant
combined with vowel nasalisation.
These 21 passwords selected can be viewed as a tree-like structure that
represents the relationships among each. The following figure depicts this.
PASSWORD SELECTION CLASSIFICATIONS
t
animated
t
powderkeg
t
lawyer
t
midnight
heed
high
hid
keg \ ketchup love
dogbone room
yams moon
weather i
m&n animal
Figure 4.3 Password Models For Study
64


The HpW Works Analyzer For Acoustic Parameter Extraction
The main limitation of implementation was in the model for extracting
acoustic parameters. Due to the platform chosen for development, and
method of integration, AFP required an IBM-PC Windows 3.1/95 based
component. There was only one usable system found after a long exhaustive
search. By far, this was the most difficult part of implementation. A system was
needed that executed the extraction of acoustic parameters from an incoming
speech signal. Most of the systems discovered were either very expensive and
proprietary, ran on incompatible platforms and/or hardware, or did not offer a
way to capture useful acoustic data. The only system found that provided
useful functionality and met the constraints listed above, was the HpW Works
Analyzer.
l§S§pisK
1 -
HpW Works Analyzer . ^ ..
m -
I C '
L. Version: 1.00.014 Demo fc
% ^ -
; e-mail: 100350.630@compuserve.com
;
U Copyright 1995 -1996 by HpW "
%# -


Figure 4.4 HpW Works System Author And
Version
65


Using the HpW Works Analyzer, Fast Fourier Transform (FFT) data
using a foil Hamming window, and Spectral data were extracted from digital
audio files. The HpW Works system allows digital audio files to be processed,
and then allows data dumps to text files of both FFT data and Spectral data.
The system has an easy-to-use interface and runs on Windows 3.1 and 95.
For all speaker subjects used in this research, digital audio files were tirst
recorded in the WAV format, and pre-processed. Then each was individually
submitted to the HpW Works system, and two files containing FFT data and
Spectral data respectively, were created using the system. Even though the
HpW Works system only extracted these two types of data, it was very useful
and allowed the research to continue.
AFP System Integration
Now that all of the sub-system implementations have been described,
the overall AFP system used in this research is described next. The goal was to
integrate all, or as many as possible, of the components depicted in chapter 3.
This made up the final AFP system used for research and testing. The most
painstaking part of integration was creating the acoustic feature data for each
sample recorded. This had to be done manually and was very time consuming.
Once the acoustic data files had been created for all of the samples used in the
research, the rest of the system integration went fairly quickly and easily.
The BANPA and FREP sub-systems were tied together under a
common interface. Then, new user interfaces were created so a speaker could
interact with the system, and carry out speaker verification processes. The
original goal was to allow a speaker to record his or her voice in real-time while
interacting with the AFP system. For experimental purposes, the functionality
to load acoustic feature data files was added. Each of the acoustic feature data
66


files were created as described above. Once these files had been created, they
could then be loaded into the system via the interface. An example of one of
the user interface screens is shown below.
Multimedia ToolBook AFP THK
- Sle Edit Iext £age ttelp
.AcUT- ' -rt^!teMasav.

Whr"'
pw9:
pwll:
pw12:
l -+$?
|pw13:
AFPStop Consonants'll
password selections
people
Submit from file ^Bpc^*
*' > date' [ Submit from file 1 [ Becord |fj

keg [ Submit from file | | Becord ||

dogbone Submit from file j Becord j
-A.-..':
ketchup "| Submit from file | | Becord
Elnished
t_ Xo Main Menu j
Your password:
Figure 4.5 User Interface Screen For AFP
Password Selection
In the screen shown above, the user is requested to record or load
(submit from file button) data for each of the passwords. The system
interacts with the user exactly as is described in chapter 3, and eventually finds
the password for the speaker to use. For the screen above, after all passwords
have been recorded or loaded, the user hits the Finished button, and the
system analyzes the password data, and determines the next group of
67


passwords based on the analysis. The user is then presented with the next
interface screen.
AFP System Process
First, to carry out the research, speakers were recorded to tape. The
recording system used to record speaker subjects consisted of a super-VHS
recording deck with a built-in microphone. The microphone quality was very
similar to what might be used in a computer lab or office environment. Voice
samples were first recorded on this deck and then individually sampled from
the deck to a Pentium 166 Mhz PC using a 16-bit Sound Blaster
compatible sound card. All voice recordings were sampled at 22.050 Khz, in
mono, at 16-bits, and saved in the WAV file format.
Next, each sample was processed by truncating out useless artifacts
found at the beginning and end of each sample. Once all of the WAV files
were processed in this fashion, each was submitted to the HP Works system
described above. From this system, the two acoustic parameters of FFT and
Spectrum were extracted from the HP Works system and were dumped to
separate data files. The two data files were merged, and this represented the
acoustic parameter data set of each spoken password. The data set files were
then named and organized by separate speakers and stored in a common
directory. The following illustration depicts the AFP process.
68


Figure 4.6 AFP System Process
Speaker Verification Tests
In this phase of research a group of 9 speakers were studied. Each
speaker was asked to speak all 21 password models to simulate an initial user
training session. Each word was recorded and processed as explained above
resulting in acoustic feature profiles containing FFT and Spectral data for each
word spoken. In total, there were 9 speakers, each with 21 passwords recorded.
This resulted in 189 acoustic data files that were stored in a central directory on
the host computer. The next step was to record each speaker speaking each
password 5 separate times. This was done so that subsequent verification tests
could be carried out. In total there were 6x9x21 = 1134 recordings made.
The speaker data collected was done as is. That is, there was no
additional processing such as time-warping, or other types of filtering done on
the digital audio files. The reason for this was to find out if verification could
69


be implemented without this type of additional processing. The HpW Works
system did a small amount of initial processing to execute a discrete fast fourier
transform, but that was all.
There were two main goals of the speaker verification tests. The first
was to determine how well the system selected speaker passwords based on the
complete feature set (two in this case). The second was to determine how well
the system performed speaker verification using one, and then two acoustic
feature parameters.
Speaker Password Selection Results
Once all passwords had been recorded and processed, the acoustic data
collected for each was averaged among the 9 speakers as illustrated in chapter
3, and a base acoustic data file was created. This resulted in 21 base
password files, one for each password. Then, a user interaction was simulated
for each speaker by submitting their acoustic data files to the AFP system. As
explained in chapter 3, the correlation and slope difference was computed
against the base password data files and was scored for best match. The
resulting decisions made by the AFP system determined a particular speakers
password. The results of the 9 speaker interactions and password selections are
shown below.
70


, SPEAKER SEX AGE PASSWORD ; SELECTED
Mike m 35 weather
Carol f 35 yams
Dean m 42 people
Chris m 39 nose
Linda f 38 yams
Peggy f 34 high
Sue f 36 yams
Steve m 32 ketchup
John m 40 weather
Table 4.1 Results of Password Selection
Using 9 speakers and 21 passwords, the AFP system illustrated a fairly
reasonable distribution of password selection. As the table above shows, only
two passwords were selected for more than one person. The figure below
illustrates this distribution.
71


Distribution Of Password Selections
7 ;<
m-m W''V;/ 0 w
3
0
5 10 '
Cft
m D ^ CL 4 H


3 2 3 Sp i makers 1 5 .. 9 7 3
Figure 4.7 Distribution Of Password Selection
Note that even though there were two passwords selected for more
than one speaker, the passwords were selected for uniqueness of speaker, not
for uniqueness of words. Because of this, it is irrelevant if the passwords are
the same or different. What the previous two figures illustrate is that the way
people articulate certain words are similar, and these similarities can be grouped
together. For instance, the password yams, was selected for three speakers,
all females of approximately the same age.
As was mentioned above, each speakers acoustic data set was measured
against the baseline data set which is the average of all 9 acoustic profile for
each password. To illustrate this, the following set of graphs show the results
for each speakers spectral shape generated from the initial password
animated.
72


Average Spectral Shape For Password animated
Average Spectral Shape For Password animated
73


Figure 4.11 Spectral Shape For Speaker 4 Verses
Average Spectral Shape For Password animated
74


Figure 4.13 Spectral Shape For Speaker 6 Verses
Average Spectral Shape For Password animated
75


Figure 4.14 Spectral Shape For Speaker 7 Verses
Figure 4.15 Spectral Shape For Speaker 8 Verses
Average Spectral Shape For Password animated
76


Figure 4.16 Spectral Shape For Speaker 9 Verses
Average Spectral Shape For Password animated
As the graphs above illustrate, each occurrence of a spoken word varies
significantly. Because of this variance, speakers can be successfully identified
using acoustic data such as this. As mentioned above, four initial passwords
were used to differentiate between four phonetic classifications. The figure
below depicts the speaker-phonetic classification relationships that AFP
determined.
77


; SPEAKER SEX AGE INITIAL ; PHONETIC
* PASSWORD CLASSIFICATION
SELECTED DETERMINED
1. Mike m 35 lawyer Liquids/Glides
2. Carol f 35 lawyer Liquids/Glides
3. Dean m 42 powderkeg Stop Consonants
4. Chris m 39 midnight Nasal Consonants
5. Linda f 38 lawyer Liquids/Glides
6. Peggy f 34 animated Vowels
7. Sue f 36 lawyer Liquids / Glides
8. Steve m 32 powderkeg Stop Consonants
9. John m 40 lawyer Liquids/Glides
Table 4.2 Phonetic Class Selection Of Speakers
For the case of the word animated shown above, the acoustic
parameters of speaker 6 matched the average spectral shape for animated
very closely. Referring to the graphs above for all speakers, speaker 6 is the
closest match among all 9 speakers. The criteria for selecting a password could
also be to find the speaker who is furthest away from the baseline average. This
would distinguish the speaker from all others by depicting how one speakers
articulation differs from the other speakers in the system. This criteria should
be examined in future research.
The results of this experiment have shown a reasonable distribution of
acoustic feature differences among a small set of 9 people. This is encouraging
in the sense that with a small speaker population, the AFP system was able to
78


distribute passwords and distinguish each speaker among the averages of all
speakers.
Speaker Verification
Once user passwords had been determined, the AFP system was
challenged by simulating subsequent access encounters to the system. The goal
in this portion of study was to determine the accuracy of the AFP speaker
verification system. That is, how well does the system verify a speaker when he
or she encounters the system at a later date. The system currently has
information stored about each user of the system. This information
encompasses the users id, the users password used for access, and the acoustic
data profile of that password which was derived from the users first encounter
with the system.
As mentioned before, for each speaker used in the study, 5 additional
occurrences of each password were recorded. Once the AFP system had
determined what the users password would be from the previous experiment,
the 5 additional occurrences of the same password were processed as before
and then used to simulate 5 separate system verification encounters. This was a
total of 45 (9 users x 5 passwords), attempted verifications to the AFP system.
One Versus Two Acoustic Parameters. Several variations of speaker
verification were carried out in two ways: first, by only utilizing one acoustic
parameter, and then secondly, two acoustic parameters were used. The goal for
doing this was to determine what difference, if any, occurred using one verses
two acoustic parameters for verification.
79


Matching Algorithms
A variety of simple matching algorithms were employed in the
verification experiments. The goal was to determine what, if any impact
different matching algorithms has on the verification process. The algorithms
used in this study are discussed below.
Algorithm 1: Correlation. Correlation as defined in chapter 3 was
computed on two data sets. Out of all paired matches, the one with maximum
value (i.e. best correlation) was identified. When two parameters are used, the
maximum is taken.
Algorithm 2: Closest Match. Data points are compared between two
similar data sets. The absolute difference between each point is computed. Out
of all paired matches, the one with minimum value (i.e. is the closest match)
was identified. When two parameters are used, the minimum is taken.
Algorithm 3: Slope. Slope of the regression line, as defined in chapter 3,
was computed on two data sets. Out of all paired matches, the one with
minimum difference was identified. When two parameters are used, the
minimum of the two is taken.
Algorithm 4: FREP. The FREP (Fuzzy Rule Evaluation of Parameters)
sub-system was used to determine how the two data sets compared.
Correlation and Slope was computed for the pair of data sets, and the fuzzy
rule system described in chapter 3 was used to compute a final value. Out of all
matches, the one with maximum value was identified. WTien two parameters
are used, the maximum is taken.
Algorithm 5: FREP With BANPA (AFP). The FREP sub-system was
combined with the BANPA sub-system. Correlation and Slope was computed
80


for the pair, the fuzzy rule system was used, and then the results were sent to
the Bayesian network to compute a final value. Out of all matches, the one
with maximum value was identified. With one parameter, the second input to
the Bayesian network was held at 0.5. With two parameters, the two results
were fed as inputs to the network as was done in the selection of passwords
described above. This is the complete AFP system implementation described in
chapter 3.
System Verification Criteria 1: Comparison Against The Same Passwords
To determine whether an incoming speech signal accurately verifies a
speaker, the acoustic parameters FFT, and Spectrum were extracted as before
to form a new acoustic data profile. This new profile was then compared
against all profiles of the same password from all users of the system. By using
the matching algorithms described above, the closest match was found and
reported to the system. If the match was not who the user said he or she was,
then verification failed, otherwise verification was successful. The results of the
45 system verification encounters are illustrated in the figures below.
81


45.001
40.00 35.00
j m Corr
30.00
. 25.00 Closest
VerifieS 20.00] I Slope
15.00 FREP
10.001 FREP/BANPA
5.0O1 0.00
One
Parameter
Figure 4.17 Successful Verifications: Against Same
Passwords (One Parameter)
As the graph above shows, using a single parameter for speaker
verification does not yield great results. This was somewhat expected. The best
matching algorithm was Algo 4 (FREP).
Verifies
One
Parameter
Two
Parameters
Corr
Closest
Slope
FREP
FREP/BANPA
Figure 4.18 Successful Verifications: Against Same
Passwords (Two Parameters)
82


As the graph above illustrates, there was a significant improvement
when adding a second parameter for verification. Also, Algo 5, the FREP and
BANPA combination fared the best. This algorithm is the complete AFP
system implementation and in this case, has been used for speaker verification
The AFP theoretical methods of acoustic parameter evaluation and utilization
have yielded the best verification results.
System Verification Criteria 2: Comparison Against The Baseline Passwords
In this experiment, the new profiles created from a users attempt at
verification were compared against the baseline profile (average of all users) of
the same password. A confidence threshold was created by defining how
close a match needed to be. The threshold was set at 90% for this research.
This meant that if a match was 90% or better, verification succeeded. With this
criterion, if the verification attempt fell below the threshold, verification failed.
Otherwise it was successful. Further research may be carried out by defining a
variety of thresholds, and observing the change in algorithm performance. The
same matching algorithms described above were used. The results of the 45 ,
system verification encounters are illustrated in the figures below.
83


4500 '
40.00
35.00
30.00
Verifies
One
Parameter
HCorr
Closest
Slope
FREP
FREP/BANPA
Figure 4.19 Successful Verifications: Against
Baseline Password (One Parameter)
Overall, the number of successful verifications grew. Surprisingly, Algo
5 did not fare well when meeting a threshold. Algo 4 FREP, did the best. This
algorithm takes the maximum of the correlation and slope difference values.
Verifies
One
Parameter
Two
Parameters
Corn
Closest
Slope
FREP
FREP/BANPA
Figure 4.20 Successful Verifications: Against
Baseline Password (Two Parameters)
84


Again, when adding a second acoustic parameter, verification improved.
Algo 4, FREP, improved significantly (from 19 to 38 successful verifications).
In this case, the fuzzy rule system defined in chapter 3, using two acoustic
parameters, yielded the best results.
System Verification Criteria 3: Spoofing. Determining A Successful Rejection
Rate.
In this experiment, each of the users attempted to use a password other
than their designated password for verification. There are 21 baseline
passwords in the system. Each user attempted 5 other passwords all of which
are baseline passwords, but not their own. Again, a confidence threshold was
defined. With this criterion, if the verification attempt fell below the threshold,
verification was rejected and deemed a successful rejection. Otherwise it was
deemed a failure in that it allowed verification with an unauthorized password.
The same matching algorithms described above were used. The results of the
45 system verification encounters are illustrated in the figures below.
85


Rejects
Rejects
45.00y''
40.00-
One
Parameter
H Corr
Closest
Slope
FREP
FREP/BANPA
Figure 4.21 Successful Rejections: Against
Spoofing Baseline Password (One Parameters)
One
Parameter
Two
Parameters
QCorr
Closest
Slope
FREP
FREP/BANPA
Figure 4.22 Successful Rejections: Against
Spoofing Baseline Password (Two Parameters)
86


There was very little difference between one and two parameters. The
exception to this was Algo 5, the AFP system.
The last two experiments point out that by setting a pre-determined
threshold, one can control the behavior of a speaker verification system if
enough data is available to the administrator of the system. The problem with
this is that any pre-set threshold is vulnerable to system changes, and must be
constantly updated to accommodate these changes. The point is, one must be
extremely careful when using thresholds, and allow for the increase in
maintenance.
Although the experiments above showed promising results, the
potential number of variations was limited. Future research should explore
additional variations such as matching algorithms and matching thresholds.
87


CHAPTER 5
CONCLUSIONS
The research that was carried out in this study found that there are new
ways in which speaker verification can be implemented using acoustic
parameters. By focusing on the key attributes of the human voice, we have
found that by parameterizing speech, there is much promise in developing an
invariant set of acoustic features that will differentiate one speaker from the
next. Even though much work has been done in extracting and using acoustic
features for speaker verification, we have found new ways and techniques of
implementation that appears to be effective.
This study demonstrates that with ingenuity and thought, one can
explore many different avenues that very well could lead to new successes in
speaker verification. Much of the work accomplished in this study has been
derived from a multitude of disciplines including signal processing, artificial
intelligence and computer science. Our main focus has been on the extraction
and use of acoustic features to improve or approach speaker verification from
a new direction. We have discovered the important role acoustic features play
in the overall effectiveness of speaker verification.
In this study, we have also shown that one acoustic feature is not
enough for successful verification. We have shown that two is not enough
either. But more importantly, we have shown that the difference between one
88


Full Text

PAGE 1

SPEAKER VERIFICATION WITH ACOUSTIC PARAMETERS by B.S., University of Colorado at Denver, 1995 A thesis submitted to the University of Colorado at Denver in partial fulfillrnen t of the requirements for the degree of 11asters of Science Computer Science 1996

PAGE 2

This thesis for the Master of Science degree by Michael F. Matthews has been approved b y Jody Paul -

PAGE 3

Matthews, Michael F . (M.S. , Computer Scie nc e ) Speaker Verification With coustic Parameters thesis directed b y Professor Jody Paul ABSTRACT Security in today's society has take n on new interest and importance. Much emphasis has been placed on securing remote access to proprietary d a ta, safely trading c omme rc e o n the Internet, and reducin g credit card fraud at point-of-purchase locations. One question has not been fully answered: how can we verify people are who they say they are? Our current methods of verificatio n are unfriendly, c ostly, and not reliable. Speaker verifi c ation is a cost-effective, reliable, and user friendly technique. Advancing computer technology has enab l e d speaker verific ation to bec ome an effe cti ve security tool. 111

PAGE 4

Much of the previous r esear ch in speaker verification has focused on matching one instance of speech data with another. Because of the amount of unique inf o rmati o n o btained b y extracting acoustic parameters fr o m speech, we can explore alternate meth ods of using these parameters t o improve speaker verification . Through the use of artificial intelligence approaches such as fuzzy rules and Bayesian netw o rks, acoustic parameters of speech are used in a newl y developed method f o r this research. This meth od, called Adaptive Forward Plannin g (AFP), provi des a decisi o n making mechanism in which speaker verificati on can be implemented with pro mising results . This thesis surveys existing speaker verification technol ogies and implementations . It points out shortfalls and proposes h ow to address them. It th e n introduces the concept of Adaptive Forward Planning, and details its implementation. Finally, experime ntal results of this implementation are discussed , and directions for furth e r research are outlined. This abstract accurately represents the content of the candidate's thesis. I rec o mmend its publicati o n. -Signed Jody Paul t v

PAGE 5

C ONTENTS Figures ...................... . ............................................. ......................... .. .i x Tables ....................................... ... ......... . ............................................ xi Acknowledgements ................................................................................ xii CHAPTER 1 . INTRODUCTION ... ......................... . ...... ..... . ... . ........ ....... . . ... . .......................... . ...... 1 Securi ty In Today's Society .................... .... ..... ............... ......... . . ....... . ...... ..... ........ 1 Speaker Verification : A Viable Solution .... ..................... ... ........ ..... .......... . . ... .. . . ... 3 Ho w Can Speaker Verification Be Impro ved? .......... ....... ....... .......... ..... . . .............. 5 Cost . . ...................................... ........... ...... ................................ ........ ................ 5 User Int e rface ....... . ...... ...... ................ ............ . ............................ ... . . ......... . ...... 6 Training The System ............ . ..... ......... . ....... ..... . ... . .............. . ... .. . ...... . .......... . ... 6 Speaker Population ...................................... .. ........................ ... ...................... . 7 U sing Acoustic Parameters Of The Speech Signal .... ....................... . . .............. 7 Organization Of Thesis ............ ................................................................. . .......... 8 CHAPTER 2 . SPEAKER VERFICATION TECHNOLOGY AND SYSTEMS ....................... . .... 10 Speaker Verification Technologies .......... . .............. . ....................... . ................... 11 Predicti v e Models ...... . ........ .... . .................... . ... . ....... . ............. . . . .... . ................ 11 The H y brid MLP-RBF-Based S y stem ........................................................ 12 The S elf Segmenting Linear Predictor Model... ....... . .... . ............ . ... . ..... . .... . 12 The Neural Prediction Model... ................................................................ . 13 The Hidden Control Neural Network Model... ...... ... . . ........... . ... . ....... . ....... . 14 Gaussian Mixture Models .... . ... .......... . ...................... . ... . ........... . .................... 14 v

PAGE 6

Statistical Features For Speaker Verification ...... ........ 00 00 00 ••• 00 •••••• 00 •••••••••••••••• 16 Extraction Of Cepstral Coefficients ......... .... ... . . ... . ... ... ........ ... ....... . ... . ... .... ..... 16 Speaker Verification Systems ......................... . . . .... .............. ..... .... ...................... 17 ITI Defense Communications Division (ITTDCD) .............................. . ........ 17 AT&T Information Systems ...... ........ ..... ....................................................... 18 AT&T Bell Laboratories ......... .. ............................................... ...................... 20 The ARPA Speech Understanding Project: Continuous-Word Recognition Techniques ...... . ........ . ........... . . . ...... ... ............................................... ................... 21 Systems De ve lopment Corporation (SDC) ................ . ............................. ....... 21 Hear What I Mean (HWIM) ........... ....... .............. .... ............. . . .......... . ... . ........ 22 Carnegie-Mellon University Hearsay-IT ... ............. . ........................................ 23 Carnegie-Mellon University Harpy . . .......................... ...................... ..... . .... . ... 2 3 What Is Missing ? ..................................... . ................. . ...... ...... . ........ . ... . ... ......... . 24 CHAPTER 3. APPROACH ........................................................ . .......... . . ... ... .. ................... ......... 26 Adaptive Forward Planning .......................... . ............. . . ........ . ............................ 27 Password Models ............................ . . . ........... . ................. . ................................ .. 30 The AFP Architecture ......... . ........... . ............................. . .................................... 32 Fuzzy Rul e Evaluation Of Parameters (FREP) ............................................... 32 Bayesian Network for Password Anal y sis (BANPA) ...................... . ........... . ... 32 Phonetic Acoustic Parameter Evaluation Using Fuzzy Rules ......... . ....... . ...... . . . ... 34 Fuzzy Rules For Acoustic Parameter Value Determination ........ ... ...... ... ... .......... 37 Linguistic Variables ...... ..... . .................... ..... ... ... .... ................. . . .. . .... . ........... . 38 Correlation Coefficient .. ... . ..... . ............ . . . ..... . ... . . . ......... . ... . ....... . ................ 38 Regression Line Slope Difference ... . ...... ..... ... ..... . ... . ... . ........... . ................. 38 Fuzzy Rule Process (Defuzzification) ...................... ............ ...................... 41 Subjective Ba yesian Inference Networks For Password Anal y sis ......................... 43 Acoustic Feature Inputs ... . . . ......... . ..... ....... ....... . ... . ... . ........ . ............. ... ...... . ..... 45 Fundamental Frequency .................. . ........ . ......... .. . ................................ . ... 45 Spectral Shape ..................... .... ... ..... ...... . ............. ... . . ...................... . ..... . ... 45 V I

PAGE 7

Low-Frequency Energy , Mid-Frequency Energy , High-Frequency Energy . 45 First Formant Frequency , Second Formant Frequency . Third Formant Frequency ... .... .................................. . ........... .... ........... . ...... . ... . ........ . ........ 46 Phonetic Classification . ...................... . . ....... . . ...... . ............ . ....... . ... . ... . ... . ... . .... 48 Intensity ... ... ..... ....... . ................................ . ...... . ............... . ... . ... . ................. 48 Stress ................... . ...... . .... ... ... ..... ... . . . . .... . . ... . ........ ........... . ............. . ....... . ... 48 Intonation ...... ... . .......... ...... ....... .... . . ............ . . . . . ...... . ......... .... ... . ... . . . . . ..... . . . . 48 Articulary Configuration .... ..... ... . .... ... ... ...... . ......................... . ... . .............. 48 Phonetic Quality Attributes .... ... ........... . ............ . ....... .............. ... ... . . . ....... . ...... 48 Vowel Presence . . .... ...... .............. ....... ......... . ... . ....... ................... . ............... 49 Consonant Presence .............. . ....... . . . .... . ........ .......... ..... . . . ...... .... ................ 49 Prosodic Quality .... . ................... . ................... . . . .......... .......... .... . ............... 49 Articulation Quality ....... . ..... ... ..................... . .... . .... . . . ... ..... . . . . ... .... .... .... ..... 49 Total Energy .............................. . ...... .... ....... . . .... . . ........ ... . ... . . . ... . . .... .... ...... 49 Probability Relationships For Phonetic Attributes ........................................ .. 50 Odds Formulation . ... ..... . ... . ....... .... ................ . ....... . ............... . ..... . . ........... . 50 Likelihood Ratio ... ......................... . ..................... . ..................... . ............ . . 50 Measurement of the Evidence (E) . ................................................................. 53 Multiple Evidence . . ... .... ... . ........... . . . ..... . .... . . ...... . ... . ........... 00.00. 00 ...... 0000 000 00 000 • • 55 Phonetic Anal y sis Relationships 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 .. 00 00 00 00 00 00 .. 00 .. 00 00 00 00 00 00 00 00 56 Summary .......................... . .... ....... . . ...... . . ......... . . ...... . ......................... . ............... 56 CHAPTER 4. IMPLEMENTATION ...... . ... . ... . .... ..................... ....... . ........ . . . ..... ... . . ...... . .... ....... . ... 58 S y stem Overview .......................... . .................... .... ....... ............. . . ......... ..... . . . ..... 58 BANPA Implementation oo oo 58 FREP Implementation ................... .... . ... . ...... .... . ....... . .... ....... . . ...... . ................ 60 Password Model Implementation .. ... . ... . . . . . . . 00 ••••••••••••• 00 • • • 00 •• 00 ......... 00 ••• 00 ......... 61 Vowel Password Selection 000000000000000000000000000000000000000000000000000000000000000000000000 61 Stop Consonant Password Selection 00 00 ... 00 .. 00 00 00 00 00 .. 00 .. 00 .. 00 .. 00 .... 00 00 00 00 00 00 00 ... 62 Liquids/Glides Password Selection .............. . ... 00 ••••••••••••• 00 00. oo. 00 ••• 00 •••••••••••• 62 00 Vll

PAGE 8

Nasal Consonant Password Selection ... ....... ..... ................. .......... ... . ...... .... 63 The HpW Works Analyzer For Acoustic Parameter E"1:raction ...................... 65 AFP S y stem Integration . . ..... . . ......... .... ......................... . . . ....... . .......... .... . . ........... 66 AFP S y stem Process ................... ........ .... ...... ........... ................................ ... ... 68 Speaker Verification Tests ............................ . ................ .... ............ ...... ........ ...... 69 CHAPTER Speaker Password Selection Results .... ....... .... ........................ ...... ...... ............ 70 Speaker Verification ............ . ........ ........... ....... ...... .......... . ... . ...... ....... .. ........... 79 One Verses Two Acoustic Parameters ....................... ...... .......................... 79 Matching Algorithms .............................. ......................... . ..... .... ...... . ............ 80 Algorithm 1: Correlation ...... .......................... .............. .... ...... .................. 80 Algorithm 2: Closest Match ................ .... ........................ .......................... 80 Algorithm 3 : Slope ........................ . ........................................... . .... . ......... 80 Algorithm 4 : FREP ............................. . ................. ... ... . .............. .............. . 80 Algorithm 5 : FREP With BANPA (AFP) .......... .... .. .................................. 80 System Verification Criteria 1: Comparison Against The Same Passwords .... 81 S y stem Verification Criteria 2: Comparison Against The Baseline Passwords 8 3 S y stem Verification Criteria 3 : Spoofing . Determining A Successful Rejection Rate ..... ............ .. ........ . . ............ ................................................... ... ... . ... ......... 85 5 . CONCLUSIONS ......... ... .... . .................. . ............... ... ...... ... ............ ...... . ........... . ..... 88 Speaker Verification : A Technology Waiting .... ...... ................ ........................... 90 Where To Go From Here .................... .... ................................ ........................ .... 91 APPENDIX A. VALUES USED FORBANPA SYSTEM ............................................................. 94 B . RESULTSOFBANPARUNS .......... . ........... .................. .... . ..... ........... .......... .... ... 95 C . RESULTS OF FREP RUNS ................... ........ ...................................................... 99 REFERENCES .. . ... .................................... ... ... ....................................... 102 Vlll

PAGE 9

FIGURES FIGURE 3 . 1 Adaptive Forward Planning Steps ............................ . .......................... . ............. 28 3 . 2 Password Model Tree ....... ......... ........................... . ......... ....... ....... ..... ...... .... . . . ... 31 3.3 The AFP Architecture ... ............. . .... . ............... . ...... ... . ........... ......... . ..... . . . .......... 33 3.4 Establishing A Baseline For The Acoustic Parameter FF . . ... ... . ........... ............... 36 3.5 Plot Comparison Of Two Fundamental Frequency Parameters ..... ... ....... . ........... 39 3.6 Fuzzy Rule Distribution ................. . ............... ..... .... ..... ...................... ................ 41 3. 7 Bayesian Network For Phonetic Anal y sis .......................................... . ............... 4 7 3 . 8 Sufficiency Relationship Between Fundamental Frequency And Intonation ....... 51 3.9 Necessity Relationship Between Low Frequency Energy And Intensity .............. 52 3.10 Approximation Model For Evidence Of An Acoustic Or Phonetic Attribute ...... 54 4 . 1 The BANPA User Interface ...... . ........................................ . ............................... 59 4 . 2 The FREP User Interface .......... ..... ................... . ..... . ................................. ......... 60 4 . 3 Password Models For Study .............. . ................... ... ..................... . ...... . ............. 64 4 . 4 HpW Works S y stem Author And Version ........... . ........ . ...................... . ..... . . . ...... 65 4 . 5 User Interface Screen For AFP Password Selection .................. . .... . ................... . 67 4 . 6 AFP System Process .............. . ..... . ...................... .......... .................................. ... 69 4 . 7 Distribution OfPassword Selection ........................................ ......................... .. 72 4.8 Spectral Shape For Speaker 1 Verses Average Spectral Shape For Password " animated " .......... . ...... .... ........ . ................................................... ...... ... ....... ....... 73 4 .9 Spectral Shape For Speaker 2 Verses Average Spectral Shape For Password " animated " ........ . ..................................................... .......................................... 73 4 .10 Spectral Shape For Speaker 3 Verses Average Spectral Shape For Password " animated " ............ . ........... ............................................................ .................. .. 74 4 .11 Spectral Shape For Speaker 4 Verses Average Spectral Shape For Password ' animated " . ........... . ........................ . ......... .... . . ... ... ............... ........... . ......... . ....... . 74 IX

PAGE 10

4 .12 Spectral Shape For Speaker 5 Verses Average Spectral Shape For Password " animated " ... .. ... .................................. . ............. ..... ................... . .... . ... . . . ............ 7 5 4 .13 Spectral Shape For Speaker 6 Verses Average Spectral Shape For Password " animated " .......................................... ... . . ...... . ......... . ............ .... . . . ............. . ...... . 75 4 .14 Spectral Shape For Speaker 7 Verses Average Spectral Shape For Password " animated " . . ... . . . ... . . . ............ . ...... ............ . . . . .... ........................ .. ... ...... . . ........ . ... . . 7 6 4 .15 Spectral Shape For Speaker 8 Verses Average Spectral Shape For Password " animated " . ..... .......... . .................. . .... ...... . . . . . . .... . ......... ......... ...... ...... . . ......... . ... . . 76 4 .16 Spectral Shape For Speaker 9 Verses Average Spectra l Shape For Password " animated " .... . .... .... .... ... . ... . . ........... ..... ....... . ........ .... . .... . . . . . .......... . . . . . ....... . ......... 77 4 .17 Suc ce ssful Verifications : Against Same Pass w ords ( One Parameter ) . ...... .. ... .... .... .... ..... . . . ..... . . . . . ...... ..... .. ...... . . . . ....... . . . . . .... . ... ... ........ 82 4 .18 Successful Verifications : Against Same Pass w ords ( T w o Parameters ) .. ....... . ................................ . . ....... . . ............. 0 ••• 0 ••• 0 0 ••• 0 •• 0. 0 •••••• o•. 82 4 .19 Successful V e rifications : Aga i nst Baseline Pass w ord ( One Parameter ) ....................... . ....... . ........... . ........... . ....................... .... ....... . ..... 84 4 . 20 Successful Verifications : Against Baseline Pass w ord (Tw o Parameters ) . ..... ........ . . ................... . .............. ..... o • • • ••••••••••••••••••••••••••••••••••• 84 4 .21 Successful Rejections : Agains t Spoofing Baseline Password ( One Parameters ) o o o . . . . o.•o o 8 6 4 . 2 2 Successful Rejections : Against Spoofing Baselin e Pass w ord (Tw o Parameters ) . . . 00 • • o .... . . . o ................ o ........... o .......... . . . ................ 0 • ••• ••••••••••••••• 86 X

PAGE 11

TABL E S Tabl e 3.1 Fuzzy Rules For Determining The Quality Of An Acoustic Parameter.. 00 00 00 00 00 00 40 3 . 2 Values Computed B y Fuzzy Rules ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooOOOOOO 42 4 . 1 Results of Password selection ........................................ . .................................. 71 4 . 2 Phonetic Class Selection Of Speakers ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooOOO 78 X l

PAGE 12

ACKNOWLEDGMENTS The author wishes to thank the following people for their support, encouragement and assistance. Carol Conway Matthews Kevin Matthews Daniel Matthews Dr. Jody Paul Dr. W. J. Wolfe Dr. John Clark Bernie and Dede Conwa y AT&T Bell Labs/Lucent Technologies (Denver) X11

PAGE 13

CHAPTER 1 I TRODUCTIO J Throughout the last twenty years, speaker verification has been used for computer system access, building access, credit card verification, and crime lab forensics, all with differing amounts of success . As compared to other security techniques such as fingerprints, palm-prints, hand writing scans, retinal , and facial scans, speaker verification is comparatively inexpensive. Each individual's speech patterns are unique. Because of this, individual speech samples are as unique as fingerprints, facial scans or retinal prints. This makes speaker verification an excellent tool for security control. Recent developments in artificial intelligence such as neural networks and fuzzy logic have extended research in many existing technologies. Speaker ve rific ation is one of these technologies . The cost-effectiveness, ease of use, and reliability of speech verification make it a practical technology to pursue . Security In Today's Society Despite recent advances in computer technology, one security problem remains that has not been fully solved : how can we verify the claimed identity of a person? Security access has become a more integral part of today's society in many ways. With an increasingly competitive high tech industry, the need for protection of proprietary material has become critical. Systems used for security access to buildings and computer labs have become regular fare. The 1

PAGE 14

Internet has become a viable means of world-wide communication. The ability to trade commerce safely on the Internet is about to become a r eality . There are basically three methods by which a person can be identified. One is by an object in their possession such as a badge, a key, or credit card. The second method is by having something memorized such as a user id, a password, or personal identification number. The third method is by a physical characteristic unique to th e person such as a fmgerprint, facial scan, signature, or a voiceprint The first two methods are transferable between one person and another making them vulnerable to impostors. The third method of a unique physical characteristic offers the most promise for secure personal identification 1 • Despite the need for secure personal identification, there are not many devices available on the market today. The only effective devices are those based on biometrics such as fmgerprints, palm-prints, and retinal scans. These devices are used at point-of-access locations and there use in other locations such as point of sale terminals, or for remote transactions is not likely due to the expense . Many situations where access to a secure area must be controlled involve the use of guards. Examples include areas such as computer rooms, bank vaults, aircraft maintenance areas, and drug storage areas. If guards are employed seven days a week, 24 hours-a day, the cost could exceed $50,000 a year. Fraudulent use of credit cards and bank checks is becoming an increasing problem to merchants, banks, and credit companies . As we move closer to a cash-less society, the amount of money lost due to insecure transactions will become enormous. This type of security problem is different than that of building access in that the number of places where protection is 2

PAGE 15

needed is staggering. In essence, all gas stations, restaurants, stores, and banks are potential targets to fraud. Because of such a larg e number of places where transactions can be performed, the costs of a security solution must be low and its performance must be very reliable. In addition, any method of security of this type must be easy to use and widely acceptable to customers. Today's society is moving towards long distance working relationships, telecommuting, and a growing reliance on remote access to proprietary data. The fears of unauthorized physical access have continued to grow in recent years. Society is demanding more effective ways of providing security to a growing number of people and a variety of different needs. The solutions for these types of transactions must be easy to use, inexpensive, and widely accepted among all who must interact with it. Speaker Verification : A Viable Solution In everyday life it is possible to recognize people by their voices. This attribute makes the human voice a natural candidate for automated identification. One person's voice is different from another because the relative amplitudes of different frequency components of their speech are different. By extracting acoustic features such as frequency components from the speech signal, we can further the reliable identification of a voice. The technique of speaker verification is one of the few reliable speech recognition technologies available today. As compared to continuous-word recognition, speaker verification is constrained to single words or phrases. Because of this, the complexity of recognition is reduced tremendously. In addition, speaker verification works with a "known user", meaning the system has previous information stored about the user. Other speech recognition 3

PAGE 16

technologies must generally work with an arbitrary user. The fact that speaker verification is free of many constraints inherent in other speech recognition systems allow it to be one of the most reliable speech technologies available. Typically the application of speaker verification technology is a device that customers can be comfortable with. They know what to expect and therefore don't have unrealistic expectations. These unrealistic expectations may very well lead to a customer being unhappy or even frustrated by a product if any rough edges are encountered. Speaker verification can address probably the single most important issue concerning consumers: will people accept it as a security method? Because little more than a microphone is required, and most people fmd speaking natural, the answer is most likely yes. To address the need for remote access of proprietary data, speaker verification can play a key role for securing safe transactions. Speaker verification technology can be implemented over long distances through telephone or computer networks . Particularly, as telephone technology improves, more and more reliable speech signal quality becomes a reality. Obviously, fmer speech signal quality will allow for more dependable speaker verification. Similar to the way systems today verify credit card purchases, speaker verification can be used for securing other more critical money related transactions. In the case of private criminal investigation of individuals, speaker verification has been very effectively used to reliably authenticate speakers. With our knowledge of existing verification processes, we have found that voice patterns reveal more invariant properties than many existing verification techniques. Another main advantage of speaker verification is that it does not require much additional or specialized equipment at the point-of-use. 4

PAGE 17

Equipment for hand scanners or retinal readers can be very large and cumbersome. A speaker verification application requires little more than a microphone or handset. Approaches such as palm prints, eye and body scans, fingerprint or signature analysis can be very costly. One fmgerprint method costs $53,000 for a central unit, and $4000 for each station2 • Another example, is the cost of a hand print scanner. Devices such as this typically cost around $3000. Comparatively, a speaker verification system costs much less. A complete system can be implemented with a personal computer, sound card, and microphone. A system such as this cost as little as $500. How Can Speaker Verification Be Improved? Speaker verification technology for commercial applications is still fairly new . Even though much research has taken place, there are a number of ways this technology can still be improved. The following sections point out the potential improvements that are addressed in this thesis. Most speaker verification systems require specialized hardware to accomplish the processing needed . Not only is most of this hardware proprietary, but is typically very expensive. An improvement can be made by developing a system on a common platform such as an IBM compatible machine using a standard sound card. By doing this, the system should be able to be duplicated very easily and inexpensively. In addition, since the majority of computer users use this type of platform, users of a speaker verification system developed on it will instantly be comfortable with using it. 5

PAGE 18

User Interface A speaker verification system must be easy to use and its interface must be appealing and friendly to interact with . Much of the speaker verification systems in use are specialized laboratory applications, and are not designed for typical commercial use. By simply looking at a computer screen with a mouse in hand , and talking into a microphone, users shou ld be very comfortable interacting with a system. In reality, all that is needed is a microphone or handset, and a simple visual o r audio feedback mechanism that alerts the user to the verification results. The system developed and described in chapter 3 has attempted to capture these attributes . Training The System When a user first encounters a speaker verification system, it typically carries out a training process that results in a voice print that is stored and later used for matching the voice of that individual. Most systems employ a technique called speaker adaptation to refme matching during subsequent encounters with the verification system. The speakers voice print is adapted to include new acoustic information the first few times a speaker uses the system successfully. Because of this, the system may be unstable and less secure in its initial use. By emphasizing speaker adaptation at the earliest stages of a user's first encounter, the initial training of a system can be improved. Instead of allowing the system to adapt ove r a prolonged period o f time, collection of critical information is carried out at the beginning of a system's life cycle. Not only will the system perform m o re reliably at earlier stages, it will be easier to use in the long run. This process, which is de ve loped in this thesis, is called Adaptive Forward Planning. It is described in chapter 3. 6

PAGE 19

Speaker Population Most speech recognition systems can readily handle single speakers by specifically tailoring the system to the nuances of the individual speaker. Most realistic applications demand the ability to handle more than a single talker. Systems such as these are designed for computer system access, credit card verification or building access. These systems must handle hundreds or even thousands of different speakers. In addition to handling such a multitude of different speakers, these systems must be able to deal with uncooperative speakers. These types of speakers may not say exactly what the system has asked for or may even try to fool the system on purpose. By focusing on techniques and processes that can generically be adapted to a diverse speaker population, the performance of a speaker verification system can be improved. This can be accomplished by designing a set of parameters that are invariant among all speakers, and relying on speech characteristics that apply to the way people produce speech. Uncooperativ e speakers can be dealt with by utilizing these parameters to detect potential impostors. Using Acoustic Parameters of The Speech Signal Probably the most popular approach to speaker verification is to focus on speech signal parameters. Many studies have utilized a technique of extracting acoustic parameters that can later be used for identification. M ost of the processes used for this technique have encompassed very specialized hardware and complicated algorithms. While most of the systems developed have yielded significant results, not many have explored alternate methods of acoustic parameter extraction, and how these acoustic parameters can be used 7

PAGE 20

for speaker ve rificati o n. This is pro babl y the m os t "open" area f o r speaker ve rificati o n ex pl o rati o n. This i s th e core f o cu s o f thi s the sis. Organi z ati o n Of Thesis In the remaining chapters o f this th esis, we give an overv iew of spee ch verifi cati o n technologies and systems that h ave bee n de ve loped ove r th e last twenty years . We p oint out what's missing, offe r s o luti o ns , describe our app roach in detail , and discuss experimental results. In Chapter 2 , " Speaker Verification T e chnol ogy And Systems", speake r ve rific atio n meth o ds and techniques are dis cussed foll owed b y d e tails of full impl e mentatio ns. The purp ose of this chapte r is t o provide th e inter ested reader with backgr ound and hi s t ory, and t o p oin t out w h a t' s missing and hasn't b ee n addresse d in previous speaker ve rification research. Chapter 3, "Approach", is the c o r e of this the sis. The method of A daptiv e Forward Plannin g is intr o duced. This new meth o d exp lore s alternate ways o f utili z ing a c ous tic features ex tracted fr o m the speech s ignal. Wh e n f ea tures are extracted, the y are eval uated t o d e termin e how much they c o ntribut e o r d o not c o ntribut e t o the ve rific atio n pro ce ss . A n ew approach t o evaluation u s ing Bayesian n etwo rks and a fuzzy rule pro c ess is ex plained. These n ew approaches have the p o tential to improve th e ve rificati o n pro cess. In Chapter 4, " Impl e mentation" , how the res e arch was put int o pra ctic e, including the system that was built, is discussed. The discussi o n includes detailed system integrati o n and the trade offs and c ompro mi ses. The results of seve ral speake r ve rificati o n experim ents are provide d as well. 8

PAGE 21

Finally, in Chapter 5, "Conclusions", a re-cap and conclusions of this research are presented. In addition , this chapter summarizes why this research is important, and what future types of speaker verification research may provide further successes and rewards. 9

PAGE 22

CHAPTER 2 SPEAKER VERFICATION TECHNOLOGY AND SYSTEMS Research in speech recognition has tra ve rsed a multitude o f directions over the past twenty years . The types of systems developed have been very dive rse: from reliable 24 hour-a-da y isolated word recognizers to extremely versatile and comple x systems designed to interpret sentences or continu o us wo rd speech. There are a number o f commercial systems currently available. Most of the se systems are used for simple desktop assistance, o r for games and entertainment. There are also a number of pri va te or industrial systems that have been designed for much more complicat e d and scientific uses . These use s include r e al-time language interpreters , human -computer interaction systems, and implementati o ns used for security and ve rification purposes . It is this last use that we are particularly interested in. Even th o ugh c omputer techno l ogy has dramaticall y improved ove r th e las t twenty years, the practical uses o f speech recognition technology have r e mained limited. The two most applicable speech recogniti o n techn o l ogies are isolated -wo rd r ecognition, and speaker verification . These two are closely related in a number of ways . Each involve a cooperative speaker who is willing t o respond to the system's needs in order t o achieve success. Both techn o l ogies compare incoming speech with prerecorded templates of vario us spee ch d a ta. Also, b o th use similar types o f matching alg o rithms to accomplish their goals. Of these two, speaker ve rification has prove n to be a m o re useful 10

PAGE 23

technol ogy in the sense that it has s uccessfully been used for a variety of applications. In this chapter , a brief ove rview o f s peech v erificati o n technology is given. The focus is on the most pro mising, and current techn o l ogies that have b ee n researched and o r implemented within the last twenty years. In addition t o the f o cus o n t e chn o l ogy, complete systems that have been implemented are l ooke d at. The purpose o f this second f o cus is t o get a feel for the successes and lack of successes in speech ve rificati o n. The following ove rviews are a representati ve survey and is by no means exhaustive and c o mplet e . Speaker Verification Technologies This section will provi de an overview of p o pular techniques and methodol ogies u se d for speaker ve rification. Most o f these techniques have two things in common: first, there is an initial speech data c o llecti o n usuall y implemented b y a training session. Sec o ndly, ve rification is carried out b y c o mparing this initial data against an incoming speech signal's c o llected data. Predicti ve Models3 Pre dicti ve m o dels have been successfully used in speaker ve rificati o n. In this section, we will briefly discuss four that have been r ese arched. In ea ch , the inc o ming speech signal is transformed int o a model which is then used to ve rify the speaker. The first two m o dels do n o t r e quire pre -p r o cessing of the inc o ming signal, the second two d o. 11

PAGE 24

The Hybrid .MLP-RBF-Based System. The hybrid .MLP-RBF model4 is a two-stage connectionist model designed to operate in the time-domain alone and performs well without any time-warping. The first stage is a Multi-Layer Perceptron (.MLP) neural network and is used to extract speech parameters which are used in the verification stage . The .MLP is trained to act as a nonlinear speech predictor for the utterance spoken by the person who claims an identity. The second stage implements the speaker verification process based on a Radial Basis Function (RBF) classifier using the weights of the .MLP as its inputs. The RBF classifier is previously trained to accept the weights produced by the true speaker utterance applied to stage one, and to reject all other weights produced by other speakers . Several previous studies have been carried out on the use of neural architectures for the purpose of time series and speech prediction . The .MLP RBF system is based on the fact that the .MLP model is capable of learning the underlying speaker-dependent trends of a speech utterance. This connectionist model was shown that it could be trained to predict the speech waveform in a non-recursive mode. However, after the training process was completed, attempting to operate the model in a recursive manner behaved chaotically but did reveal a relationship to the original speech waveform. It was found that a connectionist model was not sufficient to model the time varying parameters of speech and therefore could not work well in a recursive mode. But, the similarities between the chaotic series produced by the recursive prediction and the original speech signal proved that the connectionist model was learning the operation of the underlying speech production mechanism. The .MLP-RBF system is based on this knowledge. The Self Segmenting Linear Predictor Model. This model uses an array of linear predictors to model the true speaker where each predictor is 12

PAGE 25

associated with a particular sub-unit of the speech utterance. Linear predictors have proven successful in speaker verification applications where the speech is divided up into frames of equal length. Because the speech signal has a slow time varying property, the vocal tract shape is considered to stay constant during the duration of a frame. Each frame can be considered a stationary signal allowing each to be represented with Linear Predictive Coefficients (LPC). In this model, Linear Predictors are used to represent the temporal structures of speech. An iterative training process uses Dynamic Programming (DP) to segment speech into sub-units during which the vocal tract stays constant and then trains to a set of LPs for each of these speech segments. The segmentation and training are done on a sample-by-sample basis in the time domain. No other pre-processing is required. Both the LP coefficients and the segmentation units of the training utterances are stored and used for verification. Verification involves DP of the test utterance with the LP coefficients of the claimed true speaker . The normalized mean square prediction residual is calculated by dividing the accumulated squared sample values over the entire utterance. The normalized mean squared prediction error is then compared to a threshold to determine the success of verification. The Neural Prediction Model. The Neural Prediction Model (NPM) consists of an array of MLP predictors and is constructed as a state transition network. Each state has a particular MLP associated with it. Each MLP predictor has one hidden layer and one output layer consisting of eight nodes. The hidden layer nodes have a sigmoidal function, while the output nodes are linear. The 8 Mel frequency cepstral coefficients are used as the frame feature vectors. The MLP outputs a predicted frame feature vector based on the 13

PAGE 26

preceding frame feature vec t ors. The difference between th e predicted feature vector and the actual feature vector is defined as the predicti o n residual. The goal of this type of system is t o fmd a set of MLP predictor weights which minimize the accumulated prediction residual for a trainin g set. Speaker verificati o n is carri e d out b y creating a training data set by c o llecting a series of password utterances. Verification requires the application of the test utte ranc e t o the NPM associated with the speaker who claims identity. The accumulated prediction residual div ided b y th e sum of the squares of each feature c omponent in the utterance is used t o determine verification success. The Hidden Contro l Neural Network Model. This m odel utilize s a single MLP predictor. Like the NPM, this m ode l is c o nstruct e d as a state transiti o n netw o rk and also uses the 8 Mel frequency cepstral coefficient s as frame feature vectors. The single MLP o utputs a frame feature vecto r prediction. The model attempts to fmd a set of MLP predictor weights that minimize the accumulated predicti o n residual for the true speaker utterance training set. Verification of a claim e d speaker involves application of th e incoming utterance along with the MLP weights set t o th ose associated with the claim e d speaker . As in NPM, the accumulated predicti o n residual divided b y the sum of the squares of each feature c o mp onent in the utterance is used to determine verification success. Gaussian Mixture Models5 The Gaussian mixture speaker model was first in traduced in 1990 and has dem o nstrated very accurate veri fication f o r text-independent speaker utterances. In th e Gaussian mixture m o del ( GMM), the distribution of feature vectors extracted fr o m speech is m o deled by a Gaussian mixture density. The 14

PAGE 27

density is a weighted linear combination of uni-modal Gaussian densities, each parameterized by a mean vector, and covariance matrix. Maximum likelihood speaker model parameters are estimated using the iterative Expectation Maximi za tion (EM) algorithm. Generally, 10 iterations are sufficient for parameter convergence. The GMM can be viewed as a hybrid between two effective models for speaker recognition: a uni modal Gaussian classifier and a vector quantizer codebook. The GMM combines the robustness and smoothness of the parametric Gaussian model with the arbitrary density modeling of the non-parametric VQ model. The speech signal is first segmented into frames b y a 20 ms window progress ing at a 10 ms frame rate. Silence and noise frames are discarded using a speech activity detector (SAD). This is important in te.1:-independent speaker recognition because by removing silence and noise frames, modeling and detection is based solely on the speaker, not the environment in which the speaker is speaking. After the SAD processing, Mel cepstral feature vectors are then extracted from the speech frames and cepstral coefficients are deri ved. Finally, the feature ve ctors are channel equalized via blind deconvolution. The decon vo lution is implemented by subtracting the average cepstral vector from each input utterance. It is critical to collect training samples and test samples from the same microphones o r channels to achieve good recognition accuracy. The speaker verification process is a straight-forward maximum likelihood classifier . For a group of speakers, each is represented by a GMM. The objective then is to find the speaker model which has the maximum posterior probability for the input feature ve ctor sequence. The minimum error Bayes' decision rule is used to determine the accuracy of verification. 15

PAGE 28

Statistical Features For Speaker Verification6 Speaker verification systems have also been implemented by using statistical features of speech. On such system studied used statistical features extracted from speech samples for an automatic verification of a claimed identity. The analysis of a prescribed code sentence to be repeated for verification is performed by a real-time hardware processor consisting of a multi-channel filter bank covering the frequency range from 100Hz to 6.2 kHz. The incoming speech signal is scanned every 20 ms, and a multiplex integrator is used to compute the long-term averaged spectrum (LTS) over the en tire utterance. The system is trained with a number of sample utterances of the code sequence used for verification. From these samples a speaker-specific reference is calculated and stored on the computer. In addition, a verification threshold is calculated and stored for each speaker. For successful verification, the distance between the test speech input and the closest stored reference must fall below this threshold. Extraction Of Cepstral Coefficients7 In this implementation, cepstral coefficients are extracted from the incoming speech signal. These coefficients are used to build a codebook containing a set of codevectors (coefficient vectors) representative of the person to be identified. Verification is carried out by extracting cepstral coefficients from the new speech of the person to be identified, and a distortion metric is used to measure the distance from the new vectors and those in the codebook. 16

PAGE 29

Frames o f approximately 30 ms of speech are digitall y sampled and then pro cessed thro ugh a pre -e mphasis netw o rk t o b oos t the high-frequ ency components of the speech. Cepstral c o mponents are extracted and a probability o f vo ice factor is computed f o r each frame. A pro bability of vo icin g a t o r above a given thresh o ld indicates the presence o f formants, which provide a correlation wi th an individual's vocal tract. All the frames are then presented to a Linde-Bu zo-Gay clustering alg o rithm , w hich is used to transf orm the initial coefficient vectors into a set of 64 c o de vectors which is then used for ve rificati o n . Speaker Verification Systems Several speaker verification systems have be e n implemented each particular to a s pecific applicati o n. Even though the applications ar e specific, the goals r e main th e same: h ow accurately can a system det ermine if th e s p e aker claiming an identity is who they say the y really are. The following se ctions provi de an ove rview o f th e systems that have been developed . ITI D efe nse Communications Divisi o n (ITIDCD)8 The United States government has hist o ricall y b een interested in speech rec ogni tion for as l ong as the technol ogy has been around. The m ajo r areas of interest of th e government is word-spotting, talker identification, languag e identification, " c o mmand-and -control", and secure (encrypted ) s peech transmissi o n. ITIDCD has wo rked on man y of thes e government applications. ITIDCD has investigated methods for speaker ve rification b y attempting to recognize the identity o f a speaker even if the te x t of the analyzed utterance is unknown. ITIDCD has also worked o n improving speaker verifica tion b y making such systems more accurate and less expensive. 17

PAGE 30

Most of the research at ITIDCD has been directed at the fundamental problems of recognizing speech in noisy environments and/or over telephone lines. Of considerable interest is the selection of acoustic features that are immune to degradation over phone lines or from b ackground noise. These featu res must be insensitive to n oise and also t o the identity of the speaker. Because of these constraints, the feature selectio n t echniq ues de signed by ITIDCD have been centered on determining optimal methods for reducing the effects of noise on the accuracy of the incoming spee ch signal. Among the approaches taken include comparative eval uation of different feature sets. The feature sets that were examined include linear predictive coefficients (LPC), vocal tract are functions, au t ocorrelation coefficients, cepstral coefficients, and LPC derived pseudo formants. Another app r oach involves the use of a linear mean square (LMS) adaptive filtering method for the removal of additive noise from the speech signal. And finally, a third approach involves the investigation of a noise-reduced LPC parameter set. ITIDCD is well esta blish ed in the speech compression (vocoding) area. This technology is well sui t ed as a front-end to many speech recognition systems. In its research involving speech recognition, ITIDCD has emphasized l owcost sol uti o ns and has d eveloped several systems used for isolated word recognition, wo rd -spotting, and speaker verification . AT&T Information Systems9 AT&T and its Bell Laboratories have developed and researched many different speech recognition technologies. Although much research has been in continuous-word recognition, many word-spotting and speaker verification syste m s have also b een developed. One of particular interest was developed by 18

PAGE 31

AT&T's I.S. division in the mid 1980s. This application is a voice password system for security access using speaker verification designed for use over dial up telephone lines. The voice password system (VPS), can be used for secure access to telephone networks, computers, rooms, and buildings. The VPS system works by allowing a user to call into the system, enter his or her identification number, and then speak a password that is usually a phrase or short sentence. On initial encounter, the VPS system creates a model of the users voice and stores a reference template. Incoming speech is processed in the VPS system on a frame-by-frame basis. Frames are spaced at 15-millisecond intervals, and overlap with a 45-ms duration. For each frame a set of features are extracted that characterize aspects of the signal such as short-term energy and spectrum. For each feature extracted, the autocorrelation of the incoming signal is computed creating autocorrelation coefficients. The coefficients are then modified by simulating the addition of white noise to reduce differences between noisy long-distance telephone lines and clear local lines. These modified coefficients are then transformed into linear predictive coefficients (LPC) which represent, in bits, a spectrum of the voice. The LPC coefficients are then transformed to cepstral coefficients and are normalized by subtracting the mean cepstral values over the utterance from each 15-ms frame of speech. Finally the beginning and end frames of the password are located. Once this feature set is computed, it is matched with a previously generated reference pattern using a method called tfynamic time 1varping (DTW) which accounts for timing differences among repeated utterances of the same phrase. The DTW match yields an absolute distance score and is used to evaluate the identity of the speaker. 19

PAGE 32

AT&T Bell Laboratories1 0 Many methods of speech verification have been studied at Bell Laboratories. One of those methods experimented with comparing certain characteristics of a speaker's voice with the same characteristics of the voice of the person who the speaker claims to be. Allowances are made for normal variatio ns in speech rate, pitch, volume, and other factors. The belief is that a system using this method can be just as fast as a human listeners and detect impost ors much m o re accurate ly. A file of prototype utterances of a single phrase spoken several times is collected and averaged by a computer. This average forms a prototype that is stored along with measurements of the variability among individual utterances. The variability data is collected because no one can speak the same phrase twice in exactly the same way. When verification is desired, the c omputer fetches the sto r ed pro totype for the claimed identity, analyzes the inc o ming speech sample and determines if it is close enough to the pro totype version. Five characteristics or features are extracted form the speech signal . Th e first three are th e lowest three frequencies known as formants o ne , two, and three. The fourth characteristic is voice pitch and its v ariation with time. The fifth characteristic is the variation of the intensity (or loudness) of the speech with time. Before a voice samp l e can be compared, the prototype and the inc o ming voice sample are brought into temporal registration by "time warping" the voice sample. This is done b y speeding up or slowing down various portions of the utterance. Once the characteristics have been extrac t ed and the two samples to b e c ompared are brought into registration, measurements are taken to determine h ow similar the prototype and the sample are. To accomplish this, the 20

PAGE 33

computer divides each of the five characteristics into 20 equal time segments. For each segment, several measures of dissimilarity are computed. Distance measurements such as mean squared difference and the squared difference of the average rate of change. After these distance measurements are computed for each separate segment, each distance is averaged over the 20 segments. Finally, a sixth distance measure is taken that reflects the degree of time warping that was necessary to achieve registration. The computer then combines all six distance measures and computes an overall fmal distance measure of dissimilarity. The ARPA Speech Understanding Project: Continuous-Word Recognition Techniques The focus of this thesis is on speaker verification, which is considered an isolated-word recognition technique. Even though isolated-word recognition is the most applicable to speaker verification, there is much correlation to continuous-word techniques. Because of this, it is hard to ignore the most important research that has occurred in continuous-word speech recognition. This section will briefly highlight the major systems that were developed for The Advanced Research Projects Agency (ARPA) Speech Understanding Project in the 1970sll. ARPA is an agency of the Department of Defense. Systems Development Corporation (SDC)12 The S D C system was developed to process sentences. When a digitized waveform enters the system, formant frequencies and other parameters are extracted. From this a phonetic transcription is obtained, including several alternative labels for each 10-ms segment of the waveform. All of this data is 21

PAGE 34

then placed into an array for later examination by top-end routines. The utterance is processed from left to right. First a list of all possible sentence beginning words are generated . Then an abstract phoneme representation is extracted for each lexical hypothesis and a graph of expected acoustic variants is created. Each of these graphs are then sent to a mapper to determine how good an acoustic match can be obtained. The mapper includes techniques for estimating the probability that the expected word is present given the phonetic and acoustic data collected. The mapper constitutes a verification strategy based on syllables. This is an attractive strategy for predicting phonetic segments. Hear What I Mean (HWIM) 1 3 Similarly to the SDC system, when a digitized waveform enters the system, formant frequencies and other parameters are extracted. This information is then used to derive a set of phonetic transcription alternatives that are arranged in a phonetic "segment lattice". The advantage of the lattice structure is that it can represent segmentation ambiguity in those cases where decisions are most difficult. Identification of words is carried out by searching through the segmental representation of the utterance for the closest lexical matching words. These matches are used as "seeds" that are later used to build up partial sentence hypothesis. The best-scoring word is then sent to a word verification component that utilizes parametric data to get a quasi-independent measure of the quality of the match. The method of verification is analysis by synthesis. The verification score is combined with the lexical matching score, and if this score is high enough, the word hypothesis is sent to a syntactic predictor which using 22

PAGE 35

grammatical constraints, proposes which words can appear on the left and right of the seed word. The word proposals eventually build a lexical decoding network that produces hypothesis of two-words, three-words, four-words, and so forth until a fmal sentence is obtained. Carnegie-Mellon University Hearsay-Il14 The process of recognition is similar to the HWIM, and SDC systems described above. The HearsayII system employs a set of parallel asynchronous processes that simulate each of the component knowledge sources of a speech understanding system. The knowledge sources communicate via a global "blackboard" data base . When any one of the knowledge source components are activated by the blackboard, it tries to extend the current state of analysis . The blackboard is divided into several major categories: sequences of segment labels, syllables, lexical items proposed, accepted words, and partial phrase theories. Initially, amplitude and zero-crossing parameters are used to divide an utterance up into segments that are categorized by manner-of articulation features. A word hypothesizer then lists all words having a syllable structure compatible with the partial phonetic segments. A word verification component scores each lexical hypothesis b y comparing an expected sequence of spectra with observed linear-prediction spectra. High-scoring words activate a syntactic component which attempts to piece words together into partial sentence theories. This process continues until a comp l ete sentence is found. Carnegie-Mellon University Harpy'5 The Harpy system is an extension of a Markov model of sentence decoding originally employed by a sentence recognition system called Dragon 1 6 • 23

PAGE 36

In Dragon, a "breadth-ftrst" dynamic programming strategy was used to fmd the optimal path through the network. In Harpy, a beam-search technique is used in which a restricted beam of near-miss alternatives around the best scoring path are considered. Dragon also used a-priori probabilities in choosing the most likely path, where Harpy considers only spectral distance. The Harpy fmite-state machine has 15,000 states. The state transition network includes all possible paths, alternate representations of all lexical items in terms of acoustic segments, and a set of rules that deftne expected acoustic segment sequence changes across word boundaries. The input utterance is divided up into brief acoustic segments. Each segment is compared with 98 talker speciftc linear-prediction spectral templates to obtain a set of 98 spectral distances. Each state in the network has a an associated spectral template. The strategy is to try and fmd the best scoring path through the state transition network by comparing the distance between the observed spectra and template sequences given in the network . What Is Missing? The preceding sections have provided a brief overview of speaker veriftcation technologies, methods, and implementations. There are two concepts in common with all of them . First, the method used for veriftcation involves decision making based on matching one set of acoustic, or other speech derived data, against another set of similar data. The second concept involves using a limited set or number of acoustic features, and not really exploring the impact that additional features may have. By simply matching one set of data against another, there is an increased potential for lost information. Because the acoustics of speech vary 24

PAGE 37

so much from one speaker to another, decision making algorithms must involve more than a simple m a tchin g process. In this thesis, the method of determining verification is taken a step further than simp l y matching data. Because of the nature of speech, there is more information availa bl e to us than an individual set of a c oustic features . Phonetic relationships exist th at allow us to study feature interactions. B y m o deling these relationships and interactions, new and alternate methods of decision making algorithms are possible . By focusing on a defmitive se t of acoustic features, phonetic relationships can b e derived which should h elp to distinguish one voice from another. Aco ustic features are pivotal to the speaker verification process. It is within certain combinations of these features that ve rification can be impro ve d . One feature for verification is most likely not enough, nor is two. But, b y combining more and more features, we can cr eate a fmer granularity of decision making that should impro ve the accuracy of speaker ve rification. In this th esis, we will also show that substantial improvement can be realized b y combining and utilizing additional acoustic features in the verification process . 25

PAGE 38

CHAPTER 3 APPROACH The focus of this study is to determine how acoustic features and phonetic relationships can aid us in making decisions in the speaker verification process. We are interested in the r ole acoustic parameters play in improving the accuracy of speaker verification. Extracting acoustic parameters or features from the speec h signal is not a new concept. The potential methods in which parameters are evaluated and utili ze d for speaker verification is conceptuall y new and has not been fully tapped. In this chapter, a new method of using acoustic parameters of speech is presented. Decision making used in speaker verification is taken further than simply matching one speech derived data set with another. A new technique called Adaptive Forward Planning is introduced that attempts to take advantage of unique acoustic features inherent to a speaker. New procedures for acoustic feature evaluation and how these evaluations can aid speaker verification is presented in this chapter. In addition, this thesis goes further and exp l ores the impact of combining additional acoustic parameters. 26

PAGE 39

Adaptive Forward Planning The technique of Adaptive Formal Planning takes advantage of how the human vocal system produces speech . By capturing the essential aspects of the human vocal system, speaker verifi cation can be improved b y identifying th ose individual speaker characteristics that distinguish an individual from the average user or group of users. Characteristics su ch as vocal tract resonance, characteristics of articulation, and the rate of vibration of the vocal cords are utilized to improve ve rification. Techniques of speaker normalization or channel n o rmalization can be used so that certain features of the speech signal can be detected and used to capture specific characteristics of the speech signal . An example would be detecting the fundamental frequency o r pitch of the signal. Another would be the use of complex processes that would determine the speaker's vocal tract length. Using criteria such as these, a specific acoustic feature profile may be created and stored for later use. Adaptive Formal Planning (AFP) uses a pre-meditated set of acoustic info rmati o n criteria. This set can be thought o f a b o iler plate of specific acoustic information. B y focusing on a defmiti ve set of acoustic features up front, AFP can formulate a unique profile for every user. With AFP, the task of speaker verification is to determine the info rmation carrying features common to repeated utterances of the same phrase or word. When a user encounters the system for the first time, rather than simply storing initial voice information for that user , the adaptive process is carried out interacti vely. The system pro ceeds thro ugh refinement learning steps and updates a unique profile of acoustical feature information until it meets a pre-specified set of criteria. As th e system works its way toward refming this set of criteria, it makes use of different phrases o r words whose selection is determined b y the refinement process. Acoustic features may be selected based on trial and error heuristics 27

PAGE 40

which capture the distinct individual speaker characteristics that pertain to a particular phrase. Once this set of criteria has been obtained, the system then selects a fmal appropriate phrase that th e user will b e asked t o use for verification. The phrase will be selected from a set of stored templates that have been collected from the user population. The phrase selected will be the phrase that best matches the collection of acoustic information that has been obtained. In addition to selecting this fmal phrase, this system will store the reusable acoustic information obtained from the speaker. The illustration below depicts this process. The progression proceeds from top to b ottom: VOICE VERIFICATION ADAPTIVE FORWARD PLANNING Original Voice Characteristics Boiler Plate Refining process continues ..... Final Voice Characteristics Information ' ' Speaker characteristics are further refined Speaker characteristics are further refined Figure 3 . 1 Adaptive Forward Planning Steps 28

PAGE 41

By creating thi s unique acoustic informatio n proftle, additional in depth verification may be carried out in later system encounters that otherwise would not exist. For example, if a user attempts t o gain access t o the system at a later date, a number of pre-determined thresh o lds could be used to determine the accuracy of verification. The system would use this unique acoustic profile to further the ve rification analysis if a number of the thresholds were not met. The advantage of this is that the probability of false rejection (i.e. a valid user not b eing accepted) c o uld be l owered. A dapti ve Forward Planning will impr ove speaker ve rificati o n in a number o f o ther ways as well. For instance , when a user wishes or is required t o choose a new password , they will have an existing specific featur e set in which to interact with. Rather than randomly choosing new phrases for passwords, specific phrases can be derived from the a c oustic information available. By doing this, the user 's acoustic feature profile will remain m ore or less intact. This will ensure that the uniqueness features of their vo ice remain embedded in the system and are continuall y taken advantage of. Another adv antag e is the use of similar passw o rds among different speakers. Because the acoustic profile will be unique t o each speaker, if two o r more use a similar, o r even the same password, verification analysis should reveal unique speakers because of th e depth of information held in each profile. The ability t o use similar o r equivalent passwords will reduce the overhead required to maintain the password model database. 29

PAGE 42

Password Models To accomplish collecting a unique acoustic profile, AFP uses password models. Each password model is a single phrase that has been analyzed in depth and determined to model a set of phonetic qualities that yield unique acoustic features when spoken. AFP first presents the user with an initial set of password models that they will be asked to speak in sequence. As each word or phrase is spoken, the system will capture those acoustic features that pertain to each individual model. AFP will then determine which phrase or word spoken best match the acoustic features of the speaker and will then proceed to the next appropriate set of related password models. The password model system will be hierarchical in nature and can be represented in a tree like structure. Contained in the top level of the tree Qevel 0), will be the initial set of password models. Each of their children will represent a subset that will contain similar, but more detailed phonetic properties. When a user encounters the system, they start at the top of the tree and work their way down passing through several internal nodes until they get to the lowest level of the tree. The lowest level will represent the actual final phrases or words that AFP will determine best suits the user. AFP will then chose among these for the password the user will be required to use. The illustration below depicts the password model tree. 30

PAGE 43

Levol 0 I Levol j Level n THE PASSWORD MODEL HIERARCHY / Tho phras e s al t his level represent the ftnal password list thai AF P chooses fr om Figure 3 . 2 Password Model Tree The lowest level of the hierarchy represents the actual passwords that will b e used in the system. Initially, a fmite amount of passwords is used in the system. AFP will require ongoing maintenance of the password model hierarchy. When the lowest l eve l passwords are eventual l y exhausted, they may simply be replaced by sets of phrases that fit the acoustic feature properties of their predecessors . It will b e several years if ever, that th e system will run out of availa ble passwords to use. B y proceeding through the password model hierarchy, AFP will accurately narrow down those acoustic features that are unique to an individual speaker and map these to a specific phrase. As these acoustic features are discovered, AFP will dynamically build an acoustic feature profile for each user. 31

PAGE 44

The AFP Architecture To accomplish the goals of AFP, two main architectural components were developed: a sub-system for evaluating acoustic parameters when extracted, and a second sub system for combining the various acoustic parameter evaluations and determining the overall value or "acoustic worth" of a particular password model. These two sub-systems along with the password models described above make up the AFP architecture. Fuzzy Rule Evaluation Of Parameters (FREP) The su b -system developed for evaluating acoustic parameters is the Fuzzy Rule Evaluation of Parameters (FREP). Because acoustic parameters extra cted from the speech signal vary so greatly, we often are dealing with vagueness and ambiguity when trying to determine "how good" a parameter is. For this reason, this sub-system is modeled using fuzzy rules. Bayesian Network for Password Analysis (BANPA) The su b -system developed for evaluating the overall value of a particular password model is the Bayesian Network for Password Analysis (BANP A). Due to our limited knowledge of phonetic features that correspond to the human voice, we are working with incomplete or uncertain data when trying to evaluate the overall value of a spoken phrase . A Bayesian network was chosen for this sub-system because it allows us to combine several sources of data (acoustic parameter evaluations) each with varying amounts of useful information. We can then analyze and determine probabilistic relationships among this data, combine these relationships and come to a conclusion or overall value. 32

PAGE 45

The illustrati o n below depicts the AFP architecture. Password Models FREP BANPA F igw: e 3.3 The AFP Architecture o t e in the figure above that the speaker first reads and speaks each available password at ea ch level in the hierarch y . For each password spoken, the acoustic parameters are extracted and sent t o FREP for parameter evaluation. Each evaluated parameter is then sent (in tandem ) to BANP A where th e overall phonetic v alue o f the passw o rd is determin e d . This is done f o r each password o n the same level o f the password hierarch y . These values are then returned t o the password mode l hierarchy where the next level of passw o rds is selected based on these values. 33

PAGE 46

The remaining sections of this chapter describe these two main sub systems in detail. Phonetic Acoustic Parameter Evaluation Using Fuzzy Rules Through much previous research in phonetics, linguists have noted the importance of phonetic features in describing the sound structure of language and in postulating how the sound systems of language change. It has been proposed that speech sounds occurring in the languages of the world could be described in terms of a fmite set of phonetic features.17 These phonetic features can be defmed in reference to acoustic patterns or acoustic properties derivable from the speech signal. The acoustic properties can be defined in terms of spectral patterns, changes in overall amplitude, or fluctuations in energy extracted from the incoming speech signal18• In Adaptive Forward Planning, a number of acoustic parameters are extracted from the speech signal so that the overall value of a phrase can be determined. For example, the acoustic features fundamental frequenry, gross spectral shape, energy frequencies, and formant frequencies provide an invariant set of features that will differ between speakers. Not only do these features differ between individual speakers, but they also differ between the same utterance spoken by the same speaker. The acoustic features of a speech signal allow us to identify phonetic features such as consonant stops or intonatio11. Because the production of human voiced sounds vary so dramatically between speakers, we are working with a minimal amount of ambiguity when computing the value of an acoustic parameter. This is significant when using these parameters for speaker verification. 34

PAGE 47

The challenge is to find a way to determine the value of an acoustic parameter based on acoustic features extracted from an incoming speech signal. Each acoustic parameter contributes to a phonetic property in varying ways. For instance, a small amount of low frequency energy contributes very little to the phonetic property of intensity, but a greater amount contributes enormously. It therefore becomes important to detect the different levels that each acoustic feature can contribute to a phonetic property. We cannot assume that an acoustic parameter either contributes or it does not. Rather, we must determine how much it contributes. Even though the measurement or graph of one instance of an acoustic parameter differs from another instance, it still may be acceptable under certain criteria. A fuzzy rule system provides an excellent vehicle in which to approach our problem. The criteria in which we accept one measurement over another can only be defmed in terms of fuzzy rules. In order to determine the value of an acoustic parameter, we must establish a "measure of quality" or baseline against which we can measure. For each password model in our system, we can establish a baseline measurement for each acoustic feature associated with that password. To accomp lish this, a database of spoken phrases is collected, and a baseline acoustic feature set is established for each phrase. For each spoken phrase, the set of acoustic parameters are extracted. Then, all measurements of an extracted acoustic parameter are averaged. This average establishes a baseline or "measure of quality" for an individual acoustic parameter associated with a phrase. The illustration below depicts this. Speaker A through speaker X speak the same password "hello". The acoustic parameter fundamental frequenry (FF) is extracted for each spoken instance of "hello" for each speaker. The X occurrences of FF are then averaged to form the FF baseline for the password "hello" . 35

PAGE 48

speal
PAGE 49

the quality or amount of information we gain from this comparison, fuzzy rules will be used. Fuzzy Rules For Acoustic Parameter Valu e Determination In comparing one acoustic parameter with another, we cannot simply claim that the two parameters match or they d o not. Rather , we must define the comparison in terms of a possibility distribution. For instance , when studying the acoustic parameter fundamental frequency, many variances are detected. The vary ing rate of fundamental frequency is determined by th e shape and mass of the moving vocal cords , the tension of the laryngeal muscles, and the air pressure generated by the lungs1 9 • When two different speakers articulate a phrase, the shape of the vocal cords may be similar, but the air pressure generated by the lungs c ould be very different. In addition, attempting t o measure the difference of tension in the laryngeal muscles is nearl y impossible. Rather than focus on these types of differences, we can utilize our general knowledge of how the fundamental frequency rate and other acoustic parameters vary among individual phrases. Because of the way a speaker produces sounds, we can establish commonalties among individual phrases. If we plot two different occurrences of an acoustic parameter extracted from a spoken phrase, we can calculate the difference in these comm o nalties. In this research two general criteria for determining how two acoustic parameters differ have been selected : correlation coifficimt and regression line slope difference. Using fuzzy rules, these two criteria become our lingui stic variables. 37

PAGE 50

Linguistic Variables For each acoustic parameter, two linguistic varia bles are created: correlation, and regression line slope diiferettce. The values of these two variables are computed from extracting the difference b etween the b aseline measurement and the incoming speech signal. The l inguistic variables are defined as follows: Correlation Coefficient. The co rr elation coeffi c ient is co m pu t ed from the graph of the b aseline measure m e n t an d th e graph of the incoming acoustic parameter . By using the correlation coefficient, we can determine the similarity between the two acoustic parameters. The equation for the correlation coefficient is: Cov (x,y) c c = -----------crx * cry where Covarianc e : Cov( x, y) = n (1/n)L (X; J.lx) * (y,i=1 and = the mean of y , = the mean of x Equation 1 Correlation The valu e would b e high if there is a stro n g re l atio n ship, and l ow if there is a weak relationsh i p. Regress ion Line S l ope D ifference. For ea ch a c oustic parameters' plot, th e s l ope of the l inear regression line th r o u gh data points in known y's and known x's is calculated. The slope is the verti cal distance divided b y the horizontal distance between any two points on the line, which is the rate of 38

PAGE 51

change along the regression line. The equation for the slope of the regression line is: n L xy (L x) (L y) b= Equation 2 Slope Of The Regression Line This is calculated for th e b aseline m easure ment and for the incoming speec h signal. Then the difference between the two is calcu lated, and a value for the variable is determined. The valu e wo uld be hi g h if th ere i s very littl e difference, and l ow if there is a great amount of difference. The illustration b elow depicts two different fundamental freq u ency parameter extra cti ons for the phrase "hello": 100 1 40 120 100 0" Ql 80 ... 00 II. 40 20 0 FF plot of baseline vs incoming speech signal (lighter=incoming signal, darker=baseline) 0 0 0 0 0 N ..,. <0 co 0 N <0 Tlme(msec) Figure 3 . 5 Plot Comparison Of Two Fundamental Frequency Parameters 39 co 0 N

PAGE 52

The two linguistic varia bles were designed to take on one of five values: very low (VL), l ow (L), medium (M), high (H), and very high (VH). These five values allow each varia ble t o b e separated into five "measurement qualities". A total of five values was used for simplicity and as a starting point for this research. They were adapted from20 • Many more values c ould be used if so desired . In a dditi on, by keeping thi s number low , a simple set of c onsistent fu zzy rules was easier to derive . We have n ow defmed two fuzzy rule d ete rminati o n sets each with a possibility distribution (VL . . VH). By defming our sets in this way, we can define a fuzz y relati o nship between these two linguistic variables and the possibility distribution for th e value of an acoustic parameter. By relating the values of o ur linguistic varia bles to the val u e of an acoustic parameter, we can then defme a set of fuzzy rules used for determining the va lue or qu ality of an acoustic parameter. The figure b elow d efmes the rules. Correlation VL L M H VH Slope diff. VL L M VL L L L L M L M M M M H H H H .. Table 3 . 1 Fuzzy Rules For Detem:urung The Quality Of An Acoustic Parameter H VH M H M H H VH VH VH VH VH Again, these rules were defined for simplicity and also to serve as a starting point for this research. They also were adapted from 21• A total of 40

PAGE 53

twenty-five rules were defined. We can now define fuzzy rule functions that will allow us to mathematically compute numerical values that map to the distributions shown in the figure above. The illustration below depicts how the fuzzy rule functions are distributed. 1 . 0 . 7 5 0 . 5 0 . 25 0 . 0 0 . 0 0 . 25 0 . 5 . 75 1 . 0 Correlation value 1 . 0 . 75 0 . 5 0 . 25 0 . 0 0 . 0 0 . 25 0 . 5 . 75 1 . 0 Slope difference value 1 . 0 . 75 0 . 5 0 . 25 0 . 0 0 . 0 0 . 25 0 . 5 . 75 1 . 0 Acoustic parameter value Figure 3.6 Fuzzy Rule Distribution The mapping functions depicted above was chosen again as a starting point for this research. They were adapted from 22.23• Fuzzy Rule Process (Defuzzification). Each of the fuzzy rule functions can be coded in "if then else" rules which will allow output of an acoustic parameter value . For instance, referring to the figures above: if the correlation between parameters is medium(M), and there is very little difference (smaller difference means higher value "H") in slope between the parameters, then the 41

PAGE 54

acoustic parameter value is high . Each of our rules can be implemented this way. The ranges for each o f the val ues can be set according t o the following figure. VL 0.0-0.15 L 0.15 0.30 M 0.30-0. 60 H 0.60 0.85 VH 0.85 1.0 Table 3.2 Values Compute d By Fuzzy Rule s The ranges in the figure above were selected becaus e they represent a distribution of five values ranging from 0 to 1, and they correlate to the mappings above. Because five rules were selected, five distributions were necessary . The middle val ue M, was defined to be twice as large as the other four so that the other four would have equal ranges. These v alues could be distributed in many other ranges as well. Different val u e distributions c o uld be explored in future res ear ch. In order t o determine how much value the two linguistic v ariables c o ntribute, each v ariable's value will be calculated in terms of percentages. Then, both percentages are averaged which will result in an ove rall "percentage contribution " or worth. Once this " percentage contribution is calculat e d , it then is mapped t o the acoustic parameter v alue according to the figure above. This can be implemented for each of the 25 rules. For example, if the slope difference was 0.45 , and the correlation value was .75, this translates to : slope differenc e value = M ( 0.45) , corre lation value = H (0.75), percentage of M for slope difference = slope0.30 = 0.15. Then, 0.15/0.30 = 42

PAGE 55

0.5. Here, 0.5 corresponds to halfway in the window, or 50% of the window. Then, the percentage of H's window for correlation = carr 0.60 = 0.15. Then, 0.15/0.25 = 0.6. Here, 0.6 corresp onds to 10% past halfway in the window, or 60% of the window. These two percentages are then averaged: ( 0.5+0.6) / 2 = 0.55. This percentage, 0.55, is then used to calculate the "amount of" that value that this particular rule maps to: Mapped rule: "When correlation is H and slope is M, the parameter value is H". How much H? This is calculated with the above "percentage contribution" , 0.55: H ranges from 0.60 to 0.85, a total of 0.25. 55% of this range is: ( 0.55)(0.25 ) = 0.1375. So, 0.60 + 0.1375 = .7375. Therefore, the amount of His .7375. A1125 rules are computed in this way. All rules were chosen to be computed the same way so that a starting point could be established. This is simply an interpretation of the fuzzy rule defuzzification process described above. In future research, several different variations of linguistic variable values, fuzzy rule defmitions, and defuzzificati o n processes could be experimented with. Subjective Bayesian Inference Networks For Password Analysis For each password model described above, AFP must determin e the value or quality of each separate model in terms of its inherent acoustic features. By determining these properties, AFP can make the best possible decisions in building an individual user's unique acoustic feature set and determine the most effective password available. Previous work in acoustic the ory of speech production has shown that a number of phonetic parameters may be used to characterize different speech sounds. These parameters may be 43

PAGE 56

used for phonetic processing of the speech sounds and vary according to individual speaker and techniques of extraction. The task is to choose a representative set of parameters that can be used to characterize different speech sounds in terms of phonetic qualities. Unfortunately there is not, at this time, a complete, specific and definitive set of phonetic characterizations that correspond to the human voice. By capturing those phonetic characteristics that we can, we can at best only capture a particular set or subset of all the possible phonetic features of the human voice. In addition, the inherent complexity and fuzziness of acoustic parameters forces us to make at best, estimated conclusions about the meanings or relevance of phonetic parameter data. Because of these constraints, we are working with incomplete or uncertain data yet we still must draw inferences and make useful judgments. Our goal is to determine the value or quality of a particular phrase. To accomplish this, we must use what acoustic information we have and draw conclusions. In dealing with incomplete or uncertain data, many techniques have been developed to aid in forming judgments or conclusions2 4 • Probability theory provides a powerful mechanism for dealing with inference type problems such as the one AFP has in determining the value or quality of a spoken phrase. Not only must we determine the value or quality of a particular phrase, but we must also capture those phonetic features that are qualitative to a unique human voice. In this way we capture two types of important information that we can use in speaker verification: a specific password model that the user can use for verification, and a unique acoustic feature set that corresponds to an individual user that can later be used to aid in the verification process. 44

PAGE 57

Subjective Bayesian inferenc e netwo rks provide a powerful m odel which can provide a measure of confidence for the ove rall phonetic quality of a particular s p o ken phrase. B y translating val ue s we place on acoustic fea tures t o phonetic qualities , we can determin e the overall value of a particular p asswo rd . The B aye sian network m o del allows us t o d o this. First, we will discuss p o tential acoustic features that can be used as inputs to the network. A c o ustic Featur e Inputs A set of acous tic param e ters can be extracted fr o m the inc oming speech signal so that phonetic analysis may b e carried out. Each of th e acoustic paramet e r inputs used for pho netic analysis will be computed in terms of their t o tal info rmational value. Acoustic parameter s that can be used in this way are discuss ed below. F undamental F requency. During th e production of voi ced so unds , the vo cal c o rds are set into v ibrati on. The fundamental frequency o f vo icin g can b e used as a voicing indicator. Voicing distincti o n may be c apture d indicating the presenc e if prosodic information such as stress, intonation, and intensity. Spectral Shape. Characteristics of spee ch events such as th e pro ducti o n of fn'catives and th e o n se t of plosive releases, are bes t charact erize d in terms of the gross spe ctral shape of th e s peech signal. The spe ctral energies may be derived fr o m the gross spectral shape thus providing spectral informatio n as input. Low-Frequency Energy, Mid-Frequency Energy. High-Frequency Energy. One if th e not the most important characteristics of the speech signal is the fact that th e intensity varies as a function o f time. For ex ample, s harp inte nsity changes in different frequency regi o n s ( i.e. l ow, medium, o r high ) ofte n signify boundaries between speech sounds. One example of this is low 45

PAGE 58

overall intensity typically signifies a weakfricative, whereas a drop in mid frequency intensi ty usually indic a t es th e presence of an intervocalic consonant. Firs t Formant Frequency. Second Formant Frequency. Third Formant Frequ ency. Previous s tudi es have found that the first three formants for vowels and sonorants carry important information a bout th e articulary co'!ftguration in the production of speech sounds . These frequencies can be used to classify vowels and consonants. B y placing values on each of these parameters, we can supply our B ayesian network with th e nec essary input val u es . The diagram s hown below indicates the relationships between incoming aco ustic parame t ers (input nodes) and phonetic classifications (lumped nodes) which serve as "lumped evidential varia bl es" that summarize the incoming acoustic parameters. Also, the relationships b etween phonetic classifications and th e attributes that c ontribute to the phonetic quality of a phrase (predictor nodes) are depicted. 46

PAGE 59

Bayesia n Network for A cou s t ic Phonetic Anaylsis Input nodes Fundamental Frequency S pectral Shape Low Frequency Energy Mid Frequency Energy High Frequency Energy Second Formant Frequency Third Formant Frequency Lumped nodes Figure 3. 7 Bayesian etwork For Phonetic Analysis Predicta nodes Each of the a c o ustic param e ter s are extracted and analyzed se parat e l y t o produce an input value th a t enters th e l e ft side of the network. These values are then fed thro ugh the n etwo rk to pro duce phonetic cl assifi c atio n values o n th e lump e d n o d es. Then th ese values are fed t o th e pre dictor n o des , and their values are updat ed. Final ly, th e predicto r n o d e value s are combin e d t o pro duc e an o utput o n th e far rig ht-hand sid e o f th e n etwo rk . This final o utput r e presents the overall phonetic value or quality of the phrase from which th e acoustic parameters were extracted. 47

PAGE 60

Phonetic Classification In order to defme relationships between acoustic parameters and phonetic qualities, the following phonetic classifications are necessary . These classifications are a representative set, and is by no means exhaustive and complete. Intensity. The overall intensity of a speech signal may be used to detect the presence of vowel and consonant strengths. Stress. This is known as the most basic abstract prosodic feature. Stress patterns provide informati o n about intonation and phonological phrases and can aid in classifying the vocal iffort of the speaker. Intonation. Intonation, known as the pitch contour of an utterance, provides vital clues to linguistic structures. Also, intonation may be studied to derive length characteristics so that differences may be obtained between low vmvels, tense vowels, and lax vowels. Articulary Configuration. B y analyzing the articulation of speech, acoustic distinctions may be made. For example, there are defmite rates of respiratory airflow below which airflow is laminar and above which airflow is turbulent. This yields sharp distinctions between sonorant sounds and fricatives. In addition, rather abrupt changes in acoustic features may be detected corresponding to the opening of the velic valve for nasalization. Phonetic Quality Attributes These attributes constitute the overall phonetic quality of a particular phrase. By combining the values or contributions of the phonetic classifications previously discussed, we can defme phonetic quality attributes. 48

PAGE 61

Vowel Presence. Vowels can b e detected by the presence of substantial energy in the low and mid frequency regions of the speech signal. They may b e characterized b y the steady state values of the first three formant frequencies. Consonant Presence. Consonants are usually divided into several groups depending on the manner in which they are articulated. There are five groups in English. Plosives which are characterized acoustically b y a period of prolonged silence, followed b y an a brupt increase in amplitude at the consonantal release. Fricatives, which are detectable b y the presence of turbulent noise. l\TasaLr which are invariably adjacent to a vowel, and are marked by a sharp change in intensity and spectrum. Glides which occur only next to a vowel as formant transitions into or out of a vowel, and are fairly smooth and much slower than those of other consonants. Africativeswhich are characterized as a plosive followed by a fricative. Prosodic Quality. One of the areas of speech understanding that has not been fully tapped is that of extracting prosodic information. Prosodic information of speech consists of stress patterns, intonation, pauses, accent, and timing structures. By computing an overall prosodic quality of a phrase, we can capture unique characteristics of the incoming speech signal . Articulation Quality. This represents the overall quality of detectable articulation parameters. Total Energy. This represents the total speech signal energy obtained and may be used to evaluate overall intensity changes which indicate the presence or lack of presence of phonetic parameters. 49

PAGE 62

Probability Relationships For Phonetic Attributes The model for determining the quality of a phrase that AFP uses must account for degrees of evidence and degrees of confidence in the hypothesis that a particular spoken phrase is of a certain quality. Probability theory provides the mathematical formalism in which we need t o work. Each n o d e in o ur netw ork can be th o ught of a model in which evidence (E) can provide support for, o r against, a hypothesis (H). Using basic probability the ory we can view th e n o des in our Bayesian n etwo rk as eve nts that hav e been deri ved from prior knowledge about the relati o nships that exist between it and a node that l ed t o it. It is now necessary to define relati o nships. Odds F o rmulati o n . The "odds" o f an event is defined as : O(A)=p(A)/p(-A). Likelihood Rati o . The "likelihoo d rati o" A. for eve nts A and B is given by: A.=p(AIB)/p(AI-B) This is useful because it can be viewed as a way t o update the o dds o f a h ypo thesis H based on the evidence E : if A.= p(E I H)/p(E 1-H) then O(H I E) =A. O(H) . Our overall goal thro ugh ou t the netw o rk is t o estimate the c o nditi o nal probability of an hyp o thesis (H) that a phonetic classificati on o r attribute ex ists given the evidence (E) of acoustic parameters extracted from the speech signal . Two more terms are necessary to support o ur understanding o n howE relates to H: sufficiency A., and n e cessity A.'. These two numbers are used to captur e 50

PAGE 63

the prior knowledge of the probabilistic relationship between E and H. Depending on the relationship between E and H, these limiting cases will vary somewhat. For example if the acoustic parameter of fund amen tal frequency and the phonetic classification of intonation were related as in the sketch: Fundamental frequency Figure 3 . 8 Suffici=cy Relationship B etwe= Fundam=tal Frequ=cy And Intonation Intonation then whenever fundamental frequency was present, the phonetic classification intonation would occur , thus fundamental frequency is "sufficient" for intonation to occur. On the other hand, if the relationship between the acoustic parameter low frequency energy and the phonetic classification of intensity were depicted as: 51 I

PAGE 64

Low frequency energy Figure 3 . 9 Necessity Relationship Between Low Frequency Energy And Intensity Intensity then whenever the intensity characteristic occurred, the acoustic parameter type of low frequency energy is certain to occur. Thus, th e acoustic parameter type of low frequency energy is "necessary" for the intensity characteristic to occur. The more general cases of limitin g is handled b y a mix o f the values A. and A.' where A. usually runs between 1 t o large val ues based on degrees of sufficiency of acoustic parameter types and pho netic classification types, and A.' runs between 0 and 1 based on the degree o n necessity of acoustic parameter type and phonetic classification type. These mix of values are also used in determining the relationships between phonetic classification types and the attributes that constitute the quality of a particular phonetic phrase. In our subjective Bayesian netwo rk , the challenge lies in coming up with values of A. 52

PAGE 65

and A.' for each link in the inference network the best represents "expert knowledge" about each relationship. Measurement of the Evidence (E). In our model, evidence (E) of an acoustic parameter or the presence of a phonetic type arrives in a uncertain fashion. Each parameter input to the system is really a measurement of the degree of existence of that acoustic parameter or phonetic type . Assume a measurement is made, call it E', that tells us how sure that the acoustic parameter or phonetic type is actually present. The value t=p(E IE') with 0 t 1, represents the absolute certainty that E has or has not occurred. A value of 1 means that E has absolutely occurred, and a value of 0 means that E has absolutely not occurred. We can now estimate p(H IE') based on the measurement of E and our prior knowledge. To carry out this estimation an approximation based on linear i nt erpolation is used25 • The following diagram depicts the approximation model: 53

PAGE 66

p(HjE ') the estimated conditional probability of the phonet i c classification type or phonetic quality attribute given the measurement of the existence of an acoustic parameter or phonetic type . p(HIE) p(H) p(HI-E) 0 t = p(EIE ' ) : the measurement that tells us how sure we are that a given acoustic parameter or phonetic type is present. F igure 3.10 Approximation Model For Evidence Of An Acoustic Or Phonetic Attri but e The graph shown a b ove is used to calculat e an estimate of the chances of a phonetic type or phonetic attribute b eing present, p(H IE') . Implicitly th e graph uses A and A.' to ge t p(H IE') and p(H 1 E) from: p(HIE) =A. O(H)/( 1 + A. O(H)) p(HI-E) =A.' O(H)/(1+ A.' O(H) ) w here O(H) = p(H)/p(-H) We appl y a graph suc h as this t o every node in our Bayesian network and update our hyp otheses bas ed on our prior knowledge and evidence c o llected. We are assuming we have prior knowl e dge of the probability of a 54

PAGE 67

phonetic type or phonetic attribute being present ( p(H) above. Note a phonetic type may also be p(E), in which it contributes to the conditional probability of a phonetic attribute). Also we assume we have prior knowledge of the probability of an acoustic parameter being present (p(E) above). The graph ensures that p(E) maps to p(H) meaning that if the measurement is inconclusive (i.e. we have no measurement at all), then the estimate of the chances of H should also be the prior probability of H . If the measurement is less than p(E), then we have evidence against the presence of E and therefore reduce the chance of H. The shaded area depicted in the graph is said to "support" the presence of H because if the measurement is greater than our prior knowledge E, this maps to evidence of H beyond our prior knowledge of H. Multiple Evidence In our subjective Bayesian network, there are several places where multiple evidence is required to update the probability of H. To deal with this, a new term is deftned: A.eff = O(H I E')/O(H) which is computed from p(H IE') : O(HIE') = p(HIE')/(1-p(HIE') For the purposes of this research, we make the assumption that the multiple evidence is independene6 , and compute A.eff for each link. ow, we can defme: A.EFF = A.eff1 A.eff2 ... A.effn This allows us to update the odds of H : 55

PAGE 68

It is probable that multiple evidence is not always independent. In this case, we must make modifications that deal with these scenarios. This could be dealt with in future research. Phonetic Analysis Relationships For each node in our network, we must derive prior knowledge of the probability of E our evidence, and of H our hypotheses. For H, p(H) is the chance that His present, before any evidence E is presented. ForE, p(E) is the chance that the evidence is present or not. In addition, values must be selected for sufficiency A., and necessity A.' by carefully looking at each separate link in the network and determining how these values best describe each relationship. These values can be selected based on "expert knowledge" available about each relationship in our network. Summary In this chapter we have introduced a new method called Adapti v e Forward Planning (AFP) which evaluates and utilizes acoustic parameters of speech for speaker v erification. By utilizing a fuzzy rule system (FREP ), we have shown how acoustic parameters extracted from speech can be evaluated by measuring their values against a baseline ( average) measurement. We then use these values as inputs to a Bayesian inference network (BANPA). The Bayesian network was developed by analyzing relationships between phonetic attributes and phonetic classifications inherent to the human voice. 56

PAGE 69

There were many potential variations to the components of the AFP architecture that were discussed. The components of most interest were linguistic variable values, fuzzy rules, defuzzification processes, values chosen for necessity and sufficiency, the process and methods of updating the network, and the handling of multiple evidence . The variations available to these components are almost endless. Each should serve as a focal p o int for future research. 57

PAGE 70

CHAPTER 4 IMPLEMENTATION ow that we have defmed the AFP architecture, we describe the implementation of certain components and evaluate their performance. In this chapter, we will discuss the implementation details and some of the experimental results. System Overview Adhering very closely to the model described in chapter 3, a prototype system was built. There were limitations to the implementation due to lack of resources and time. These limitations will be pointed out were applicable throughout thi s chapter. The prototype was built in Microsoft Windows 95, u sing Asymetrix's Multimedia ToolBook 4.0. The system was built in three phases. The first phase implemented was the BANPA sub-system . The second was the FREP sub-system. The third phase consisted of tying together the BANPA, FREP, Password Models, and acoustic extraction functionality. Each of the phases is briefly discussed below. BANP A Implementation The BANPA sub-system described in chapter 3 was constructed as depicted. The sub-system was built so that a user can enter values for the 58

PAGE 71

acoustic parameters, and then compute an overall value for a phrase that the acoustic parameters were extracted fr om. The v alues for necessity and sufficiency for each of the relationships in the network were entered int o an Excel spreadsheet. The spreadsheet was saved as a text file. When the sub system was run, the data from the spreadsheet was read into the network and assigned to all interior nodes on the network. The data used for the BANPA system is shown in appendix A. These values were used for BANP A simulation runs, and for the actual fmished AFP system. The interface for inputting acoustic parameter values and computing the overall value of a phrase is shown bel ow . I AFf-> I IlK I!II!JE'J Figure 4 . 1 The BANPA User Interface 59

PAGE 72

A series of sim ulati ons was run to determine the overal l sensitivity and beha vior of the Ba yesian network. The results of these runs can be found in appendix B. The source code u sed to implement the sub-system is available up on request. FREP Implementati o n The fuzzy rule sub-system FREP as described in chapter 3 was construct ed as depict ed . The sub-system was built so that a user can enter values for the correlation and slope values as would be derived when c omparing an acoustic feature extractio n against the b aseline (or average ) of th a t feature. The interface that was built for FREP is shown below. Figure 4 . 2 The FREP Use r Interface 60

PAGE 73

As was done for the BANP A s imul ation, a series of simulations were run to determine the overall sensitivity and correctness of the fuzzy rule system. The results of these runs can be found in appendix C. The source code used to implement the sub-system is available upon request. Passw ord Model Implementation A set of 21 passwords were chosen for this initial study. Each word was chosen based on its acoustic correlation to vario us phonetic properties inherent in the English language. By studying these properties, the words were partitioned into four separate pho netic classifications: vowels, stop consonants, liquids and glides, and nasal comonants. Four initial words were required t o be spoken up front. Each one of these correlated to o n e of the four phonetic classifications mentioned above and served as a starting point for password selection . These four password models are the initial set of passwords mentioned in chapter 3 . Each of the four phonetic classifications and the words chosen for each are briefl y described bel ow. For a more in-depth d . . f . h . th 2 7 1 7,2 8 tscusston o acousttc p onettc eory see Vowel Password Selection. Vowe l characterized words were chosen based on four properties: long vo1vel duration, short vowel duration, vowel nasalization, and oral vowels. The words ch ose n for these vowel characteristics are as follows: "animated" This word serves as the initial starting point or initial passw o rd for vowel classification . It was chosen in an attempt to capture all four properties mentioned above. By doing this, if AFP chose this as the best initial password, then the speaker would be asked to speak the following four vowel characterized passw ords . 61

PAGE 74

"heed" This was chosen to represent the characteristic of long vowel duration. "can" This was chosen to represent the characteristic of vowel nasalization. "high" This was chose n to represent the characteristic of an oral v01vel. "hid" This was ch osen to represent the characteristic of short vowel duration. Stop Consonant Password Selection. The stop consonants studied [p t k b d g], were ch ose n based on five articulation characteristics : !abia!s, alveolars, ve!ars, voiced, and non-voiced. "pow derkeg" This word serves as the starting point o r initial password for stop c onsonant classificati on . It was chosen in an attempt to capture all five characteristics menti one d above. B y doing this, if AFP chose this as the best initial password, then the speaker would be asked t o speak the following five stop c onsonant char acterized passwords. " people " This was ch ose n to represent the characteristics of !abia!s [p b]. "date" This was chose n to represent the characteristics of alveolars [t d]. "keg" This was chosen to represent the characteristics of ve!ars [kg]. "dog b one" This was chosen t o represent the characteristics of voiced [b d g]. "ketchup" This was chosen to represent the characteristics of non-voiced [p t k]. Liquids/Glides Password Selection. Aco ustic sounds o f 0 r] are referred t o as liquids, and [ w y] as glides. The durati o n of liquids and g lides ha ve been found to pro duce identifiable acoustic charact eris tics . The passwords ch osen 62

PAGE 75

for this classificati o n we r e broken into four characteristics: short liquid, lorzg liquid, short glide, and long glide. "lawyer" This word serves as the starting point or initial password for the liquid/ glides classification. It was chosen in an attempt t o capture all four characteristics mentioned a b ove. By doing this, if AFP chose this as the best initial passw o rd, then the speaker would b e asked t o speak the following four liquids / glides characterized passw o rds. "love" This was chosen t o represent th e characteristics of a short duration liquid. "room" This was ch ose n t o represent the characteristics of a long duration liquid. "wea ther" This was ch ose n to represent the characteristics of a short duration glide. "yams" This was cho sen to rep resent the characteristics of a lorzg duration glide. Nasal Consonant Password Selecti o n. The nasal c o ns o nants [m n] were ch ose n based on the v ariety o f acoustic c o nsequences created b y the opening of the nasal caviry when sound is propagated thro ugh both the nose and mouth. "midnight" This word serves as the starting p oint or initial password f o r nasal consonant classification. It was ch ose n in an attempt t o capture the characteristics mentioned above. B y doing this, if AFP ch o se this as the best initial password, then the speaker would be asked to speak the following four nasal consonant char a cterized passw o rds. "moon" This was ch ose n to represent the characteristics of a long duration nasal consonant. 63

PAGE 76

"nose" This was chosen to represent the characteristics of a short duration nasal consonant. "m&n" This was chosen to represent the characteristics of a combination o/ nasal consonants. "an imal" This was ch ose n t o represent the characteristics of a nasal consonant combined with vowel nasalization. These 21 passwords selected can be viewed as a tree-like structure that represents the relationships among each. The following figure depicts this. VOWELS l animated heed c an high hid PASSWORD SELECTION CLASSIFICATIONS STOP CONSONANTS l powderkeg LIQUIDS/ GLIDES l lawyer NASAL CONSONANTS l midnight dat e dogbon e room weather no se m&n ammal Figure 4.3 Password Models For S tud y 64

PAGE 77

The HpW Works Analyzer For Acoustic Parameter Extraction The main limitation of implementation was in the model for extracting acoustic parameters. Due to the platform chosen for development, and method of integration, AFP required an IBM-PC Windows 3.1/95 based component. There was only one usable system found after a long exhaustive search. By far, this was the most difficult part of implementation. A system was needed that executed the extraction of acoustic parameters from an incoming speech signal. Most of the systems discovered were either very expensive and proprietary, ran on incompatible platforms and/ or hardware, or did not offer a way to capture useful acoustic data. The only system found that provided useful functionality and met the constraints listed above, was the HpW Works Analyzer. About 13 HpW Works Analyzer Figure 4.4 HpW Works System A uthor And Version 65

PAGE 78

Using the HpW Works Analyzer, Fast Fourier Transform (FFI) data using a full Hamming window, and Spectral data were extracted from digital audio @es. The HpW Works system allows digital audio @es to be processed, and then allows data dumps to text files of both FFT data and Spectral data. The system has an easy-to-use interface and runs on Windows 3.1 and 95. For all speaker subjects used in this research, digital audio ftles were first recorded in the W A V format, and pre-processed. Then each was individually submitted to the HpW Works system, and two files containing FFT data and Spectral data respectively, were created using the system. Even though the HpW Works system only extracted these two types of data, it was very useful and allowed the research to continue. AFP System Integration ow that all of the sub-system implementations have been described, the overall AFP system used in this research is described next. The goal was to integrate ali, or as many as possible, of the components depicted in chapter 3. This made up the final AFP system used for research and testing. The most painstaking part of integration was creating the acoustic feature data for each sample recorded. This had to be done manually and was very time consuming. Once the acoustic data files had been created for all of the samples used in the research, the rest of the system integration went fairly quickly and easily. The BANPA and FREP sub-systems were tied together under a common interface. Then, new user interfaces were created so a speaker could interact with the system, and carry out speaker verification processes. The original goal was to allow a speaker to record his or her voice in real-time while interacting with the AFP system. For experimental purposes, the functionality to load acoustic feature data ftles was added . Each of the acoustic feature data 66

PAGE 79

files were created as described above. Once these files had been created, they could then be loaded into the system via the interface . An example of one of the user interface screens is shown below. ,e,. MulltrroPrhct >\II-' Ill!( I!II!J Eage tlelp . Figure 4 . 5 U ser Interface Screen For AFP Password Selection In the screen shown above, the user is requested to record or load ("submit from file" button) data for each of the passwords. The system interacts with the user exactly as is described in chapter 3, and eventually finds the password for the speaker to use. For the screen above, after all passwords have been recorded or loaded, the user hits the "Finished" button, and the system analyzes the password data, and determines the next group of 67

PAGE 80

passwords based on the analysis. The user is then presented with the next interface screen. AFP System Process First, to carry out the research, speakers were recorded to tape. The recording system used to record speaker subjects consisted of a super-VHS recording deck with a built-in microphone . The microphone quality was very similar to what might be used in a computer lab or office environment. Voice samples were first recorded on this deck and then individually sampled from the deck to a Pentium 166 Mhz PC using a 16-bit Sound Blaster compatible sound card. All voice recordings were sampled at 22.050 Khz, in mono, at 16-bits, and saved in the WAV file format. Next, each sample was processed by truncating out useless artifacts found at the beginning and end of each sample. Once all of the WAV files were processed in this fashion, each was submitted to the HP Works system described above. From this system, the two acoustic parameters of FFT and Spectrum were extracted from the HP Works system and were dumped to separate data files. The two data files were merged, and this represented the acoustic parameter data set of each spoken password. The data set files were then named and organized by separate speakers and stored in a common directory. The following illustration depicts the AFP process. 68

PAGE 81

Figure 4.6 AFP System Process Speaker Verification Tests In this phase of research a group of 9 speakers were studied . Each speaker was asked t o speak all 21 password models to simulate an initial user training session. Each word was recorded and processed as exp lain ed above resulting in acoustic feature profiles containing FFT and Spectral data for each word spoken. In total, there were 9 speakers, each with 21 passwords rec o rded . This resulted in 189 acoustic data files that were stored in a central directory on the host computer. The next step was to record each speaker speaking each password 5 separate times. This was done so that subsequent ve rification tests could be carried out. In total there were 6 x 9 x 21 = 1134 recordings made . The speaker data collected was done "as is" . That is, there was no additional processing such as time-warping, o r othe r types of filtering d one on the digital audio files. The reason for this was to fmd out if ve rificati o n could 69

PAGE 82

be implemented without thi s type of additional processing. The HpW Works system did a small amount of initial processing to execute a discrete fast fourier transf o rm, but that was all. There were two main goals of the speaker verification tests. The first was to determine how well the system sel ected speaker passw o rds based o n th e complete feature set ( two in this case). The second was to determine h ow well the system performed speaker verificatio n using o ne, and then two acoustic feature parameters. Speaker Password Selection Results Once all passwords had b een recorded and processed, the acoustic data collected for each was averaged among th e 9 speakers as illustrated in chapter 3, and a base acoustic data file was created . This resulted in 21 " b ase password" files, one for each password. Then, a user interacti o n was simulated for each speaker b y submitting their acoustic data files to the AFP system. As explained in chapter 3, the correlation and slope difference was computed against the b ase passw o rd data files and was scored for b est match. The resulting decisions made b y the AFP syst em determined a particular speaker's password. The r esults of the 9 speaker interactions and password selections are shown below. 70

PAGE 83

Mike m 35 weather Carol f 35 yams Dean m 42 people Chris m 39 nose Linda f 38 yams Peggy f 34 high Sue f 36 yams Steve m 32 ketchup John m 40 weather Table 4 . 1 Results of Password Selection Using 9 speakers and 21 passwords, the AFP system illustrated a fairly reasonable distribution of password selection. As the table above shows, only two passwords were selected for more than one person. The figure below illustrates this distribution . 71

PAGE 84

Distribution Of Password Select i o n s 20 18 16 14 u ... 12 0 :r; 10 f/l 8 f/l lU 6 Q. 4 2 0 0 2 3 4 5 6 7 8 9 Sp e a k ers 1 .. 9 Figure 4 . 7 Distribution Of Password Selection Note th at even though ther e we r e two passwords selec ted for m ore than o ne speaker, the passwords were selected for uniqu e ness of s p eaker, not for uniqueness of words. B ecause of this, it is irre l evant if the passwords are the same o r different. What the previous two figures illustr ate is that the wa y people articulate certain words are similar, and these similarities can b e grouped together. For instance, th e password "y am s", was selected for three speakers, all females of app r oxima t e l y th e same age. As was m e nti o n e d above, eac h speakers aco ustic data set was m easure d against the b aseline data set which is th e ave rage of all 9 acoustic profile f o r each password . To illustrate this, the following set of grap h s show th e result s for each speake r 's spe ctral s h a p e gene r a ted from th e initial password "anima t ed" . 72

PAGE 85

? . OE-03 ---------------------------------------------------------------6 . 0E-03 S.OE-03 Amp. 4 . 0E-03 (dBFS) 3 . 0E-03 2 . 0E-03 1 . 0E.Q3 0 Frequency (KHz) Figure 4 . 8 Spectral S h ape For Speaker 1 Verses Ave rage Spectral Shape For Passwor d " animated" ? .OOE-03 ......------------------, 6 . 00E-03 5 . 00E-03 Amp. 4 . 00E-03 (dBFS) 3 . 00E-03 2 . 00E-03 1 . 00E-03 0 0 N N M -.t LO (0 r---CO Ol Frequency (KHz) Figure 4.9 Spectral Shape For Speaker 2 Verses Ave rage Spectral Shape For Password "animated" 73 --base --speaker2

PAGE 86

3 .5:>E-D2 .,..-----------------, 3 .00E-D2 2 .5:>E-D2 Amp. 2.00E-D2 (dBFS) 1.5:>E-D2 1 .00E-D2 S .OOE-03 6 . 00E-03 S .OOE-03 Amp. 4 . 00E-03 (dBFS) 3 . 00E-03 2.00E-03 1 . 00E-03 0 Frequency (KHz) Figure 4.10 Spectral Shape For Speaker 3 Verses A verage Spectral Shape For Password "anima t ed" Frequency (KHz) Figure 4 .11 Spectral Shape For Speaker 4 Verses A verage Spectral Shape For Password "animated" 74 --base --spea ker3 --base --speaker4

PAGE 87

9.00E-03 8 .00E-03 7 .00E-03 Amp.6.00E-03 (dBFS) S . OOE-03 4.00E-03 3 .00E-03 2.00E-03 1 .00E-03 --base --speakerS O .OOE+OO 7 . 00E-D3 6 . 00E-D3 5 . 00E-D3 Amp. 4 . 00E-D3 (dBFS) 3 . 00E-D3 2 . 00E-D3 1 . 00E-D3 N "
PAGE 88

S .OOE-03 ? .OOE-03 6 . 00E-03 5 . 00E-03 Amp . (dBFS f .OOE-03 3 . 00E-03 2 . 00E-03 1 . 00 E -03 --base --speaker? 0 0 N Frequency (KHz) Figure 4 . 14 Spectral Shape For Speaker 7 Verses Ave rage Spectral Shape For Password "anima t ed" 9 . 00E-03 -------------------S .OOE -03 ?.OOE-03 6 . 00E-03 Amp . S .OOE-03 (dBFS) 4 . 00E-03 3 . 00E-03 2.00E-03 1.00E-03 j--base i --speaker 8 O .OOE+OO N 0 Frequency (KHz) Figure 4 . 15 Spectral Shape For Speaker 8 Verses Average Spectral Shape For Password " animated" 76 N

PAGE 89

Frequency (KHz) Figure 4 .16 Spectral Shape For Spe aker 9 Vers e s Ave rage Spectral Shape For Password " animated" As the graphs above illustrate, each occurrence of a spoken word v aries significantly. Because o f this variance, speakers can be successfully identified using acoustic data such as this. As mentio n ed above, four initial passwords were used to differentiate between four phonetic classifications. The figure below depicts the speaker-phonetic classification relationship s that AFP determined. 77

PAGE 90

1. Mike m 35 lawye r Liquids/Glides 2 . Carol f 35 lawye r Liquids/ Glides 3 . D ean m 42 powder keg St o p Co ns onants 4. Chris m 39 midnight asal Consonants 5. Linda f 38 l awye r Liquids/ Glides 6 . Peggy f 34 animate d Vowels 7.Sue f 36 lawyer Liquids/ Glides 8. Steve m 32 powder keg Stop Consonants 9.John m 40 l awye r Liquids/ Glides Table 4.2 Phon e tic C lass Selection Of S p eakers For the case o f th e word "an imated " shown above, the acoustic param e t e rs of speaker 6 match e d th e average spe ctral shape for "ani m a t ed" very close l y . Referring to th e graphs ab ove for all s p eake rs, speake r 6 is the closest match among all 9 speakers. The crit eria for selecting a p asswo rd c o uld also b e t o fmd th e speaker who is furthest away fro m th e b aseline ave rag e . This would distinguish th e speake r from all o th e rs b y d e pictin g how one speaker's a rticul atio n di ffer s fr o m the other speakers in the system. This criteria should b e examine d in futur e research. The results of this experiment have shown a reasonable distributi o n of acoustic feature diff erences amo n g a small set of 9 pe o ple. This is encouraging in the sense that wi th a s mall speaker p o pulati o n , the AFP system was able t o 78

PAGE 91

distribute passwords and distinguish each speaker among the averages of all speakers. Speaker Verification Once user passwords had been determined, the AFP system was challenged b y simulating subsequent access encounters to the system. The goal in this portion of study was to determine the accuracy of the AFP speaker verification system . That is, how well does the system verify a speaker when he or she encounters the system at a later date. The system currently has information stored about each user of the system. This information encompasses the users id, the user's password used for access, and the acoustic data profile of that password which was derived from the users' first encounter with the system. As mentioned before, for each speaker used in the study, 5 additional occurrences of each password were recorded. Once the AFP system had determined what the user's password would be from the previous experiment, the 5 additional occurrences of the same password were processed as before and then used to simulate 5 separate system verification encounte rs. This was a total of 45 (9 users x 5 passwords), attempted verifications to the AFP system. One Versus Two Acoustic Parameters. Several variations of speaker verification were carried out in two ways: first, by only utilizing one acoustic parameter, and then secondly, two acoustic parameters were used. The goal for doing this was to determine what difference, if any, occurred using one verses two acoustic parameters for verification. 79

PAGE 92

Matching Algorithms A variety of simple matching algorithms were employed in the verification experiments. The goal was to determine what, if any impact different matching algorithms has on the verification process. The algorithms used in this study are discussed below. Algorithm 1: Corre l ation. Correlation as defmed in chapter 3 was computed on two data sets. Out of all paired matches, the one with maximum value (i.e. best correlation) was identified. When two parameters are used, the maximum is taken. Algorithm 2: Closest Match. Data points are compared between two similar data sets . The absolute difference between each point is computed . Out of all paired matches, the one with minimum value (i.e. is the closest match) was identified . When two parameters are used, the minimum is taken. Algorithm 3: Slope. Slope of the regression line, as defmed in chapter 3, was computed on two data sets. Out of all paired matches, the one with minimum difference was identified. When two parameters are used, the minimum of the two is taken. Algorithm 4: FREP. The FREP (Fuzzy Rule Evaluation of Parameters) sub-system was used t o determine how the two data sets compared. Correlation and Slope was computed for the pair of data sets, and the fuzzy rule system described in chapter 3 was used to compute a final value. Out of all matches, the one with maximum value was identified. When two parameters are used, the maximum is taken. Algorithm 5: FREP With BANPA (AFP). The FREP sub-system was combined with the BANP A sub-system. Correlation and Slope was computed 80

PAGE 93

for the pair, the fuzzy rule system was used, and then the results were sent to the Bayesian network to compute a ftnal value. Out of all matches, the one with maximum value was identifted. With one parameter, the second input to the Bayesian network was held at 0.5. With two parameters, the two results were fed as inputs to the network as was done in the selection of passwords described above. This is the complete AFP system implementation described in chapter 3. System Veriftcation Criteria 1: Comparison Against The Same Passwords To determine whether an incoming speech signal accurately veriftes a speaker, the acoustic parameters FFT, and Spectrum were extracted as before to form a new acoustic data proftle. This new proftle was then compared against all proftles of the same password from all users of the system. By using the matching algorithms described above, the closest match was found and reported to the system. If the match was not who the user said he or she was, then veriftcation failed, otherwise veriftcation was successful. The results of the 45 system verification encounters are illustrated in the ftgures below. 81

PAGE 94

Verifies One Parameter Figure 4.17 Successful Verifications: Against Same Passwords (One Parameter ) 11 Corr •Closest oSiope oFREP • FREP/BANPA As the graph above s h ows, u sing a single parame t e r fo r speaker verific ation does not yield great results . This was somew hat expected . The best matching algori th m was Algo 4 (FRE P ) . Verifies 45 . 00 40 . 00 35 . 00 30 . 00 2 5 . 00 20 . 00 1 5 . 00 1 0 . 00 5 .00 0 . 00 One Parame t e r Two Parameters Figure 4.18 Successful Verifications : Against Sam e Passwords (fwo Parameters ) 82 a Corr •Closest o Siope o FREP • FREP / BANPA

PAGE 95

As the graph above illustrates, there was a significant improvement when adding a second parameter for verification. Also, Algo 5, the FREP and BANP A combination fared the best . This algorithm is the comp l ete AFP system implementation and in this case, has been used for speaker verification. The AFP theoretical methods of acoustic parameter evaluation and utilization have yielded the best verification results. System Verification Criteria 2: Comparison Against The Baseline Passwords In this experiment, the new profiles created from a users attempt at verification were compared against the baseline profile (average of all users) of the same password. A "confidence threshold" was created by defming how close a match needed to be. The threshold was set at 90% for this research. This meant that if a match was 90% or better, verification succeeded. With this criterion, if the verification attempt fell below the threshold, verification failed. Otherwise it was successful. Further research may be carried out by defming a variety of thresholds, and observing the change in algorithm performance. The same matching algorithms described above were used. The results of the 45 system verification encounters are illustrated in the figures below. 83

PAGE 96

Verifies One Parameter F igur e 4 . 19 Successful Verifications: Ag ainst Baseline Password (One Parameter ) aCorr 8Ciosest oSiope oFREP •FREP/BANPA Overall, the number o f s ucc essfu l verifications grew. Surprisingly, Algo 5 did not fare well w h e n me e ting a thre s h old. Algo 4 FREP, did th e b est . This algorithm takes th e maximum of the correlation and slo p e difference values. Verifies One Parameter Two Parameters Figure 4 . 20 Successful Verifications : Against Baseline Password (fwo P arameters) 84 BCorr • Closest oSiope o FREP • FREP/BANPA

PAGE 97

Again, when adding a second acoustic parameter, verification improved. Algo 4, FREP, improved significantly (from 19 to 38 successful verifications). In this case, the fuzzy rule system defmed in chapter 3, using two acoustic parameters, yielded the best results. System Verification Criteria 3 : Spoofing. Determining A Successful Rejection Rate. In this experiment, each of the users attempted to use a password other than their designated password for verification. There are 21 baseline passwords in the system. Each user attempted 5 other passwords all of which are baseline passwords, but not their own. Again, a "confidence thresh o ld" was defmed. With this criterion, if the ve rification attempt fell below the threshold, verification was rejected and deemed a successful rejection. Otherwise it was deemed a failure in that it allowed ve rification with an unauth o rized password. The same matching algorithms described above were used. The results of th e 45 system verification encounters are illustrated in the figures below. 85

PAGE 98

Rejects Rejects One Paramete r Figure 4 .21 Successful Rejections : Agains t S poofing Baselin e Password (O n e Parameters ) One Parameter Two Parameters Figu r e 4 . 22 Successful R e j ections: Agains t Spoofing Ba se lin e P ass word (fwo Parameters ) 86 11 Carr • Closes t oSiope [J FREP • FREP /BANPA o eorr • Closest oSiope o FREP • FREP /BANPA

PAGE 99

There was very little difference between one and two parameters. The exception to this was Algo 5, the AFP system. The last two experiments point out that by setting a pre-determined threshold, one can control the behavior of a speaker verification system if enough data is available to the administrator of the system. The problem with this is that any pre set threshold is vulnerable to system changes, and must be constantly updated to accommodate these changes. The point is, one must be extremely careful when using thresholds, and allow for the increase in maintenance. Although the experiments above showed promising results, the potential number of variations was limited. Future research should explore additional variations such as matching algorithms and matching thresholds. 87

PAGE 100

CHAPTER 5 CONCLUSIONS The research that was carried out in this study found that there are new ways in which speaker ve rification can be implemented using acoustic parameters. B y focusing o n the key attributes of the human voice, we have found that b y parameteri z ing speech, there is much pro mis e in de ve l o pin g an invariant set of acoustic features that will differentiate one speaker from the ne x t. Even though much work has been done in extracting and using acoustic features for speaker ve rification, we have f o und new ways and techniques of implementati o n that appears t o be effective. This study demonstrates that with ingenuity and thought, one can explore man y different avenues that very well c o uld lea d t o new successes in speaker ve rificati o n . Much of the work accomplished in this study has been deri ve d from a multitude of disciplines including signal processing, artificial intelligence and computer science. Our main focus has been on th e ex tracti o n and use of acous tic features to impro ve o r approach speaker ve rification from a new direction. We have discovered the important role acoustic features play in the overall effe ctiv eness o f s peak e r ve rificati o n. In this study, we have also shown that one acoustic feature is not enough f o r successful ve rification. We h ave shown th a t two is n o t enough either. But m o re importantly, we have shown that the difference between o n e 88

PAGE 101

and two acoustic parameters used for speaker verification yields significant results. That is, with two parameters, we have improved verification from where it stood with one parameter. This shows us that acoustic parameterization of speech is probably one of the more important technologies that we can use when implementing speaker verification. If two parameters improve over one by as much as we have shown, then we would expect that three would further improve our results even further. The same could be said for four, five, six, or additional parameters, until the returns diminished. As a matter of fact, it is our belief that the more acoustic parameters introduced, the more accurate speaker verification would become. Of course without further investigation, we could not confirm this, but our results thus far appear promising. We have shown that speaker verification can be implemented cost effectively and reliably, while integrating an easy-to-use, appealing user interface. We have developed a new method for evaluating and utilizing acoustic parameters that can be used effectively for speaker verification. This method, Adaptive Forward Planning (AFP), was built with experimentation and further research in mind. By doing this, many areas within the AFP architecture may be altered and combined in different ways that will potentially yield a variety of different results. We have modeled the first implementation of AFP in a way that has yielded significant speaker verification results. Based on this success, further research using the AFP architecture should continue t o lead to new and exciting results. 89

PAGE 102

Speaker Verification: A Technology Waiting In this study, we have looked at a variety of speaker verification technologies that have been developed or researched over the last twenty years. Most of the previous research has been focused on laboratory experiments, not really designed for commercial use. Although much very useful technology has resulted from this work, more effort needs to be put into taking this technology to fruition. It was found that the methods used for verification based on acoustic parameters has mostly been limited to matching data sets against each other. This has left open alternate possibilities of how acoustic parameters can be used for decision making algorithms in speaker verification. This thesis has explored a few of these possibilities. Speaker verification is a technology that will continue to evolve. Due to the heightened awareness and desire for more economical and portable security systems, speaker verification technology can, and will play a pivotal role in the future. The explosion of the Internet demands a common, reliable, and easy to-use method for securing transactions. Speaker verification can fill this void because it can b e made reliable, it is simple to use, and can be developed in a way that will make it common to the many different platforms used for accessing the Internet. To secure point-of-purchase transactions such as grocery stores, gas stations, banks, and restaurants, speaker verification can easily be implemented on the simplest computers. Most systems potentially would require no more than a 486 PC with a Sound Blaster compatible sound card, and a reasonable medium grade microphone. A system such as this can be purchased for a few hundred dollars. Considering the savings in preventing credit-card fraud, this could be a very beneficial addition to many businesses. 90

PAGE 103

Pro b a bl y th e bi ggest c once rn is this qu estio n : will c o n sume rs a cc ept s uch a form o f se curity? The an swe r i s m os t lik e l y . Man y p eo pl e t o d ay ar e c o m forta ble and a ccu s t o m e d t o c ompute rs , particularl y th ose that ar e apt t o b e makin g cre dit c ar d purch ases o r p e ru s in g th e Inte rn e t. But m o r e imp o rtantly, s p e ak e r ve rificati o n doesn ' t ne e d t o run fr o m a p e rs o nal c ompute r at all. M uch of th e t echno l ogy di s cu sse d in thi s p a p e r can b e impl e m ente d as an embe dd e d sys t e m o n a much s mall e r c ompute r , p e rh a ps the siz e o f a s hoe box. This will n o t o nl y allow f o r maximum p o rtability , but s uch a d evice can b e pack age d in a very c o n sume r-fri e ndl y way. If o n e was t o think f o r a m oment a b out a l t e rn ative m e th o d s o f se curi ty, th ey would n o t b e very appe alin g . Try t o imagin e f o r ex ample, s t o ppin g w h a t your d o in g, and p os iti o nin g your hand dir e ctl y o n a hand scann e r . Would thi s b e s impler t o d o than s p e akin g into a micr opho n e ? Would yo u be m o re c omforta bl e pl a cin g your hand o n so m e thin g, o r d o in g w h a t w e n a turall y d o every d ay o f our lives: s p e ak . Anoth e r alt e rn ative would b e allowing your face t o b e pho tograph e d b y a stran ge, unfri e ndl y lo o kin g infrar e d high t e ch lookin g thin g wi th a l e n s o n it. Again, w h a t would yo u b e m o r e c o m fo rtabl e w ith ? Wh e r e To G o F r o m H e r e Many o f th e c once p ts and th eo ri e s that h ave s urfaced in thi s r esear ch h ave n o t b ee n fully r ealize d. The d e tail s o f AFP d e scrib e d in ch apte r 3 , h ave m any variations availa bl e to th e m. For ex ampl e, th e v alue s that wer e d e termin e d f o r s uffici ency (A.) and n e cessity (A.'), and w e r e u se d in th e B ayes ian n e tw o rk c o uld b e furth e r improve d and eve n d e ri ve d statis tically. F o r thi s s tudy, th ese v alu es we r e d erive d b y c ar efull y lookin g a t e ach separate link in th e n etwo rk and d e t e rminin g how th ese v alues b es t d e scrib e ea ch pho n e tic 91

PAGE 104

relationship. As our understanding of phonetic relationships improve, these values could be improved as well. Much of the previous research in speaker verification made greater use of processing the speech signal. For instance, it is very common to use the method of time-warping to bring speech data sets into parity before comparing the two. Many researchers have underlined the importance of this due to the fact that the length of every spoken phrase is different. Had we included this type of processing in our work, we probably would of yielded even better verification results. Another important factor of speech processing is controlling the recording environment. Unfortunately, this approach does not have much practical use in the real world. This fact should lead our implementations to be less environment dependent, which our approach attempts. Even though the use of matching algorithms is a main focus of speaker verification, we have tried to approach matching less-critically, and have relied more on the acoustic features themselves. The matching algorithms used were straight-forward and efficient to use. Much could be done in this area. The more that is understood about the features we extract, the more detailed and fme-tuned matching algorithms would become. One could focus on just this aspect, and most likely uncover many more effective algorithms then were used in this study . Even though there have been many approaches to speaker verification such as vocal tract modeling with neural networks, statistical methods, and probabilistic approaches, we have concentrated our efforts on one thing: acoustic features of speech. We have done this because we believe that the 92

PAGE 105

acoustic features, if extracted appropriately, really hold the information we are after. The focus of continuing this research should be on three things: what acoustic features we are extracting and their relationships to o ne another, h ow we are extracting and processing them, and h ow we will use this extracted informati o n to impr ove speaker verification. 93

PAGE 106

APPE DIX AVALUES USED FOR BANPA SYSTEM PRtoR PROBABILITIES Feature p(E) p(H) Fundamental Fre,qutncy 0 . 5 0 . 7 Spectral Energy 0 . 4 0 . 8 low Frequency Energy 0 . 5 0 . 8 Mid Frequency Energy 0 . 5 0 . 7 High Energy 0 . 8 0 . 8 First Fornunt Frequtncy 0 . 5 0 . 7 Second Formant Fre,quency 0 . 4 0 . 5 Th i rd Formant Frequency 0 . 4 0 . 8 lntansly lnl eriornodn 0 . 5 0 . 8 Stress 0 . 8 0 . 9 lntonauon 0 . 8 0 . 7 ArtJculay Conftgurltlon 0 . 5 0 . 8 Vowel Prtunca I'HONETIC QUALITY 0 . 5 0 . 7 Consonant Presence 0 . 5 0 . 8 Prosodic Ou•ty 0 . 8 0 . 8 ArUculltion Oually 0 . 5 0 . 7 Tota l Enwgy 0 . 4 0 . 8 lambda La mbda ' NOTION Fundamental Frequency lnlansly 15 0 . 8 R.LATIONSHII'S : Spectral Energy 0 zwo for lam•da, low Frequency Energy 10 0 . 5 lambd a._ relationship Mid Frequency Enwgy 10 0 . 4 don nataxtst High Frequtnc:y Energy First Fonnant Frequency 10 0 . 5 S..:ond Formant Frequency 14 0 . 5 Th i rd Fonwlnt Frequency 12 0 . 8 Frequency Strau 15 0 . 8 Spltlr• Energy 10 0 . 7 low Frequency Energy Mid Frequency Enwgy 0 . 4 High Frequency Energy 0 . 5 F i rst Formant Fnquency Second Fonn•nl Frequ.ncy 10 0 . 5 Th i rd Fo nunt Fr.quency 10 0 . 5 Fundam.ntal Fr.qu..,cy lnlon•Uon 15 Sp•ctr81 Energy 0 Low Frequency En•rgy Mid Fr•quancy Enwgy High Frequency Energy 14 0 . 4 F i rst Formant Frequency 12 0 . 8 S.c:on d Form•nt Frequency 11 0 . 5 Th i rd Fonn•nt Fr.quancy 10 0 . 5 Fundamental Fraqu..,cy Artlcutay Contlgur.tlon Spact r • Energy 14 0 . 8 Low Frequency Energy 0 Mid Frequency Enwgy High Frequency En•'liY First Formant Frequency 10 0 . 7 S.c:on d Formant Fraqu..,cy 8 0 . 6 Third Fonn•nt Fr.quency 10 0 . 7 lntansty Vowel P resence 15 0 . 7 Stress 15 0 . 5 Intonation 0 Artlculay Conflgurlllon 10 0 . 8 l ntensly Conson ant Presence 14 0 . 7 Stress 12 0 . 4 lntoneUon 10 0 . 5 ArUculery Configuration 0 . 5 lnlensly Prosoc:Uc Quelly 15 0 . 7 Sb"ass 0 0 lntoneUon 18 0 . 5 Artlculery Contlgurellon 0 lntensty ArUculelkm Ouailly 0 0 Strass 12 0 . 4 Intonation 10 0 . 5 Artlcutay Conflgur.tlon 18 0 . 8 lntensly Total En•gy 15 0 . 5 Strass 10 0 . 8 lntonetlon 8 0 . 8 Artlculery Conflgurellon 15 0 . 4 Vowel P ras anca P'HONETtC QUA . L .ITY 18 0 . 7 Conso ne nt Presence 12 0 . 8 94

PAGE 107

APPENDIX B-RESULTS OF BANPA RUNS The data shown be l ow resulted from a series of s i mulation runs of the BANPA sub-system. For each run , the user inputted val ues that represented the acoustic parameter value (or worth) . These v alues were then fed t o th e BANP A system, and an ov erall p h oneti c valu e was co m puted. Be l ow are the v alues input, and the resulting BANP A output. The sour c e c ode for the BANP A sub-system is available up o n requ e st. 0 0 . 14899 0 . 1 0 . 1 54 2 8 7 0 . 2 0 . 1 65926 0.3 0 . 195896 0 . 4 0 . 3 29512 0 . 5 0 . 687049 0 . 6 0 . 83308 3 0 . 7 0 . 844935 0 . 8 0 . 846209 0 . 9 0 . 84636 9 1 0 . 846381 RESULTS OF BAYESIAN NETWORK FOR PHONETIC ANALYSIS In the graph below , all inputs were run s i multaneously from 0 to 1 . Grouped Acoust i c Measurements 1r----------------------. puts 0 . 4 0 ,, I -Se< ies11 0 0 0 0 0 0 0 0 0 0 Inputs 95

PAGE 108

In the graphs below, all inputs were held at .5, and an individual input was run from 0 to 1. 0 0 . 43493 0 . 1 0.477531 0 . 2 0 . 528322 0 . 3 0 . 582453 0 . 4 0 . 63549 0.5 0 . 687049 0 . 6 0 . 750227 0 . 7 0 . 79347 0.8 0 . 820653 0 . 9 0 . 836261 0.84424 0 0 . 649427 0.1 0 . 655322 0 . 2 0 . 661157 0 . 3 0.666934 0 . 4 0 . 672653 0 . 5 0.687049 0 . 6 0 . 701085 0 . 7 0 . 714775 0 . 8 0 . 72813 0 . 9 0 . 741164 1 0 . 753888 0 0 . 637279 0 . 1 0 . 647819 0 . 2 0 . 658054 0.3 0 . 667996 0 . 4 0 . 677657 0 . 5 0 . 687049 0 . 6 0 . 704793 0 . 7 0 . 72161 0 . 8 0 . 73757 0 . 9 0 . 752736 0 . 767164 Fundamental Frequency I I-Series1l 0 0 0 0 0 0 0 0 0 0 Inputs Spectral Energy 0 . 74 0 . 72 0 . 7 Ollt-0. 68 puts0. 66 0 . 64 0 . 62 0 . 6 0 . 58 +-----n,----,-......--r--'-........ ----ar-dr--1 0 0 96 0 0 0 0 0 0 0 0 0 I n puts Low Frequency Energy 0 0 0 0 0 0 0 0 0 Inputs 1 -Series1l 1 -Series1l

PAGE 109

0 0 . 558675 0 . 1 0.57951 0 . 2 0.60623 0 . 3 0 . 634232 0 . 4 0 . 661221 0 . 5 0 . 687049 0 . 6 0 . 715932 0 . 7 0 . 742888 0 . 8 0 . 767833 0 . 9 0 . 790732 1 0 . 811598 0 0.512167 0 . 1 0 . 542544 0 . 2 0 . 57777 0.3 0 . 612254 0 . 4 0 . 649811 0.5 0 . 687049 0 . 6 0 . 72088 0. 7 0 . 766355 0 . 8 0 . 801482 0 . 9 0 . 826232 1 0.841269 0 0 . 527932 0 . 1 0 . 560428 0 .2 0.591942 0.3 0 . 624674 0 . 4 0 . 657363 0.5 0 . 687049 0.6 0.737612 0 . 7 0 . 775293 0.8 0.801927 0 . 9 0 . 820075 0 . 832336 Mid Frequency Energy 1 ... j puts 0 .4-+--.. .... , ..... .... <1-tilci>,.---,ol>llir---rl'ot--l-aubb---dcbr--0 0 0 0 0 0 0 0 0 0 Inputs High Frequency Energy 0 . 8 puts 0 . 4 0 .2 0 0 0 0 0 0 0 0 0 0 Inputs First Fonnant Frequency 0 . 8 Out0 . 6 !:----puts 0 . 4 0 . 2 0 97 0 0 0 0 0 0 0 0 0 Inputs / -Series1/ /-Series1/ /-Series1J

PAGE 110

0 0 . 350842 0 . 1 0 . 396255 0 . 2 0 . 455871 0.3 0 . 5263 7 0 . 4 0 . 589956 0.5 0 . 687049 0 . 6 0 . 753294 0 . 7 0 . 794929 0.8 0.819777 0 . 9 0 . 833995 0 . 841778 0 0 . 374 1 1 1 0 . 1 0 . 420268 0 . 2 0 . 477221 0 . 3 0 . 542892 0 . 4 0 . 604674 0 . 5 0 . 687049 0 . 6 0 . 747998 0 . 7 0 . 789758 0 . 8 0 . 8 1 6528 0 . 9 0 . 8326 1 3 0 . 84158 7 1 FF 2 SE 3 LFE 4 MFE 5 HFE 6 FFF 7 SFF 8 TFF Second Formant Frequency put s 0 . 4 0 . 2 0 ab d J 0 0 0 0 0 0 0 0 0 0 Input s Thi rd Formant Frequency \ -Seies1\ 0 J I -Seies1\ 0 0 0 0 0 0 0 0 0 0 Input s In the graph below, each input was set at 1 , while all others were held at 0. 0.409138 0 . 156839 0 . 160258 0 . 177489 0 . 317836 0 . 238552 0 . 313378 0 . 288949 Acoustic Measurement Effect 0 . 45 -------•••-•••• • •••••••••••••••••• ••-••• •----------------c-----------c•---0 . 4 • 0 . 35 • . . . . . . 1,' 0 .05 0 98 1 FF 2 SE 3 5 6 7 8 LFE MFE HFE FFF SFF TFF

PAGE 111

APPENDIX C -RESULTS OF FREP RUNS The data shown below resulted from a series of simulation runs of the FREP sub-system. For each run, the user inputted values that represented the correlation and slope difference of two separate acoustic parameter data sets. Bel o w are the v alues input, and the resulting FREP output . The source code for the FREP sub-system is available upon request. 0 . 8 0 . 7 0. 6 Pa rameter O.S Va lue 0.4 0 . 3 0.2 0 . 1 Correlation with Slope at 0.11 0 0 0 0 0 0 0 0 0 Corr elation Input 99

PAGE 112

Slope with correlation held at 0.11 0 . 9 . M : 0 . 7 0 . 6 Parameter 0 . 5 Value 0 . 4 0 . 3 0 . 2 0 . 1 0 + I ell I tf> I .. I a!> I d> I ,t I ch I ch t j 0 0 0 0 0 0 0 0 0 Slope Input Correlation = Slope 1 .20E+OO -r ----c••------------cc•---c•c••--------------------1 .00E+OO Parameter V a lue Parameter Value 0 0 0 0 0 0 0 0 0 0 0 0 Ol Corr J Siope Input Correlation with Slope held at 0.5 0 . 9 ,.-----------------...... 0 . 8 0 . 7 0 . 6 0 . 5 0 . 4 0 . 3 0 . 2 0 . 1 o.T r 'R 1 1 'R 1 'R 1 $ 1 'R 1 'f 1 'f 1 'R 1 I 0 0 "' 0 100 0 0 0 0 0 0 0 0 Correlation Input

PAGE 113

0 . 9 0 . 8 0 . 7 0 . 6 ParanMter 0 5 Value 0 . 4 0 . 3 0 . 2 0 . 1 Slope with Corr . held at 0.5 0 J I "f I "f I "f I "f I "f I "f I . cv I I "f . I i 0 0 "' "' "' " "' ,._ 0 0 0 0 0 0 0 Slope Input Corrleation with Slope held at 1 . 0 <0 0 0 1 . 2 , .... , ............. ,.,.,, ...••••....•. , •.•.....•.••. , ... ........•• ,. ••.• , •• ,., .• , .• , 0 . 8 Parametllr Value 0 . 6 0.4 0 . 2 OJ I '9 I "f I "f I "f f "f I 'f I "f f '9 I "f j 0 0 "' 0 "' "' 0 0 " "' co 0 0 0 Correlation Input Slope with Corr . held at 1 . 0 ,._ 0 1 r •ccccocccooooco••••c•• '"""" 0 . 9 0 . 8 0 . 7 0 . 6 ParanMter Value 0 . 5 0 . 4 0 . 3 0 . 2 0 . 1 <0 "' 0 0 o .1. 1 '1' t "? Y "? 1 "? r 'f P I "? 1 '9 1 '1' 1 "? i 0 0 "' 0 0 0 101 0 0 0 0 0 0 Slope Input

PAGE 114

REFERENCES 1 Sykes, D. J.; Positive Personal Identification, Datamation , Nov. 1 , 1978. 179-186 2 Sykes, D . J.; Positive Personal Identification, Datamation, N ov. 1 , 1978 . 179 186 3 Ambikairajah, E.; Keane , M.; Kelly, A.; Kilmartin, L.; Tattersall, G . Predictive Models For Speaker Verification, Speech Communication, Dec. 1993, Vol. 13 , o . 3-4. 417 -4 25 4 Ambikairajah, E.; Kilmartin, L.; Tattersall, G . Predictive Models For Speaker Verification, Speech Communication, Dec. 1993, Vol. 13 , No. 3-4. 418-420 5 Reynolds , Douglas A., Speaker Identification kd Verification Using Gaussian Mixture Speaker Models, Speech Communication, Aug. 1995, Vol. 17, o. 1 -2.91108 6 Kuhn, Michael H., Access Control By Means OJ Automatic Speaker Verification, Journal Of Physics E (Scientific Instruments) , Vol. 13 , No. 1. 85-86 7 Keller,). G.; Rogers, S. K; Ruck, D . W.; Oxley, M. E.; Identiry Verification Thro11gh Fusion OJ Face And Speaker Data , Proceedings Of The Society Of Photo-Optical Instrumentation Engineers, April, 1994, Vol. 2243 , No. 1. 607-615 8 Lea, W. A., Trends In Speech &cognition, Prentice Hall, Englewood Cliffs, N.J., 1980. 469-482 9 Birnbaum, Martha; Cohen, Larry A.; Welsh, Frank X, A Voice Password System For Access SeCllriry, AT&T Technical Journal , September/October, Vol. 65, Issue 5 . 68-74 1 0 Lummis, R.C., Speaker Verification: A Stp Towards The "Chec kless" Sot:iery, Bell Laboratories Record , Sept., 1972 , Vol. 50 , No.8. 254-259 11 Lea, W . A., Trends In Speech &cognition, Prentice Hall, Englewood Cliffs, .J., 1980 . Chp. 11-17 12 Lea, W. A., Trmds In Speech &cognition, Prentice Hall, Englewood Cliffs, N.J., 1980. Chp 12 13 Lea, W. A., Trends In Speech &cognition, Prentice Hall, Englewood Cliffs, .J., 1980. Chp 14 1 4 Lea, W. A., Trends In Speech &cognition, Prentice Hall, Englewood Cliffs, N.J., 1980 . Chp 16 15 Lea, W. A., Trends In Speech &cognition, Prentice Hall, Englewood Cliffs, .J., 1980 . 69-70, Chp 15 16 Lea, W. A., Trends In Speech &cognition, Prentice Hall, Englewood Cliffs, N.J., 1980. 14, 56, 261 , 333335 17 Lieberman, P.; Blumstein, S. E., Speec h Physiology, Speech Pmrption, And Acoustic Phonetics, Cambridge University Press, New York, 1988. Chp. 8 18 Lieberman, P.; Blumstein , S. E., Speech Physiology, Speech Pmrption, And Acoustic Phonetics, Cambridge University Press, New York, 1988 . Chp . 8-10 19 Lieberman, P.; Blumstein, S. E ., Speech Physiolo!J, Speech Perr:ption, And Acoustic Phonetics, Cambridge University Press, New York, 1988. Chp . 8-10 102

PAGE 115

20 W. J. Wolfe , Expert Syst ems Le cture Notes, Department Of Computer Science and Engineering, U niversity Of Colorado at Denver, 1995 . Chp . ''Fuzzy Systems" 21 W . J . Wolfe, Expert Sys tems u cture Notes, Department Of Computer Science and Engineering, U niversity Of Colorado at Denver, 1995 . Chp. ' 'Fuzzy Systems" 22 Tanimoto , Steven , The Elements OJ AI Using Common Lisp 23 W . J. Wolfe , Expert Syst ems Le cture Notes, Department Of Computer Science and Engineering, University Of Colorado at Denver , 1995 . Chp . ''Fuzzy Systems" 2 4 Tanimoto , Steven, The Elements Of AI Using Common Lisp 25 Duda, Richard 0 . ; Hart, Peter E.; Nilsson , Nils J., Subject iiJII Bqyesiat1 M ethods For futle-Based Inference Syst ems, Proceedings National Computer Conference (AFIPS), Vol . 15, 1976. 274 -281 26 Duda, Richard 0 . ; Hart, Peter E.; ilsson, Nils J., SubjectiiJII Bqyesian Methods For futk-Bas e d Inference Systems, Proceedings N ational Computer Conference ( AFIPS) , Vol. 15 , 1976 . 279 2 7 Frauenfelder, U. H . ; Tyler, L. K, Spoken Word Recognition, The MIT Press, Cambrige, Masachusetts ; London, England , 1987 28 Lea, W . A, Trends bt Speech RecognitWn, Prentice Hall , Englewood Cliffs , N.J., 1980 . 69-70 , Chp 6 103