Citation
Speaker verification with acoustic parameters

Material Information

Title:
Speaker verification with acoustic parameters
Creator:
Matthews, Michael F
Publication Date:
Language:
English
Physical Description:
xii, 103 leaves : illustrations ; 29 cm

Subjects

Subjects / Keywords:
Hearing ( lcsh )
Sound ( lcsh )
Automatic speech recognition ( lcsh )
Speech processing systems ( lcsh )
Automatic speech recognition ( fast )
Hearing ( fast )
Sound ( fast )
Speech processing systems ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 102-103).
General Note:
Submitted in partial fulfillment of the requirements for the degree, Master of Science, Department of Computer Science and Engineering
Statement of Responsibility:
by Michael F. Matthews.

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
37357442 ( OCLC )
ocm37357442
Classification:
LD1190.E52 1996m .M38 ( lcc )

Full Text
SPEAKER VERIFICATION WITH ACOUSTIC
PARAMETERS
by
Michael F. Matthews
B.S., University of Colorado at Denver, 1995
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Masters of Science
Computer Science
1996


This thesis for the Master of Science
degree by
Michael F. Matthews
has been approved
by
Jody Paul
Date


Matthews, Michael F. (M.S., Computer Science)
Speaker Verification With Acoustic Parameters
thesis directed by Professor Jody Paul
ABSTRACT
Security in todays society has taken on new interest and importance.
Much emphasis has been placed on securing remote access to proprietary data,
safely trading commerce on the Internet, and reducing credit card fraud at
point-of-purchase locations. One question has not been fully answered: how
can we verify people are who they say they are? Our current methods of
verification are unfriendly, costly, and not reliable. Speaker verification is a
cost-effective, reliable, and user friendly technique. Advancing computer
technology has enabled speaker verification to become an effective security
tool.
111


Much of the previous research in speaker verification has focused on
matching one instance of speech data with another. Because of the amount of
unique information obtained by extracting acoustic parameters from speech,
we can explore alternate methods of using these parameters to improve
speaker verification. Through the use of artificial intelligence approaches such
as fuzzy rules and Bayesian networks, acoustic parameters of speech are used in
a newly developed method for this research. This method, called Adaptive
Forward Planning (AFP), provides a decision making mechanism in which
speaker verification can be implemented with promising results.
This thesis surveys existing speaker verification technologies and
implementations. It points out shortfalls and proposes how to address them. It
then introduces the concept of Adaptive Forward Planning, and details its
implementation. Finally, experimental results of this implementation are
discussed, and directions for further research are outlined.
This abstract accurately represents the content of the candidates thesis. I
recommend its publication.
Signed
v )
Jody Paul
IV


CONTENTS
Figures....................................................................ix
Tables.....................................................................xi
Acknowledgements..........................................................xii
CHAPTER
1. INTRODUCTION.............................................................1
Security In Todays Society............................................1
Speaker Verification: A Viable Solution................................3
How Can Speaker Verification Be Improved?..............................5
Cost................................................................5
User Interface......................................................6
Training The System.................................................6
Speaker Population..................................................7
Using Acoustic Parameters Of The Speech Signal......................7
Organization Of Thesis.................................................8
CHAPTER
2. SPEAKER VERFICATION TECHNOLOGY AND SYSTEMS.............................10
Speaker Verification Technologies....................................11
Predictive Models.................................................11
The Hybrid MLP-RBF-Based System..................................12
The Self Segmenting Linear Predictor Model.......................12
The Neural Prediction Model......................................13
The Hidden Control Neural Network Model..........................14
Gaussian Mixture Models............................................14
v


Statistical Features For Speaker Verification.......................16
Extraction Of Cepstral Coefficients..................................16
Speaker Verification Systems............................................17
ITT Defense Communications Division (ITTDCD).........................17
AT&T Information Systems.............................................18
AT&T Bell Laboratories...............................................20
The ARPA Speech Understanding Project: Continuous-Word Recognition
Techniques..............................................................21
Systems Development Corporation (SDC)................................21
Hear What I Mean (HWIM)..............................................22
Camegie-Mellon University Hearsay-II.................................23
Camegie-Mellon University Harpy......................................23
What Is Missing?........................................................24
CHAPTER
3. APPROACH.................................................................26
Adaptive Forward Planning...............................................27
Password Models.........................................................30
The AFP Architecture....................................................32
Fuzzy Rule Evaluation Of Parameters (FREP)...........................32
Bayesian Network for Password Analysis (BANPA).......................32
Phonetic Acoustic Parameter Evaluation Using Fuzzy Rules................34
Fuzzy Rules For Acoustic Parameter Value Determination..................37
Linguistic Variables.................................................38
Correlation Coefficient...........................................38
Regression Line Slope Difference..................................38
Fuzzy Rule Process (Defuzzification)..............................41
Subjective Bayesian Inference Networks For Password Analysis............43
Acoustic Feature Inputs..............................................45
Fundamental Frequency.............................................45
Spectral Shape....................................................45
vi


Low-Frequency Energy, Mid-Frequency Energy, High-Frequency' Energy. 45
First Formant Frequency, Second Formant Frequency Third Formant
Frequency.............................................................46
Phonetic Classification..................................................48
Intensity.............................................................48
Stress................................................................48
Intonation............................................................48
Articular.' Configuration.............................................48
Phonetic Quality Attributes..............................................48
Vowel Presence........................................................49
Consonant Presence....................................................49
Prosodic Quality......................................................49
Articulation Quality..................................................49
Total Energy..........................................................49
Probability Relationships For Phonetic Attributes........................50
Odds Formulation......................................................50
Likelihood Ratio......................................................50
Measurement of the Evidence (E)..........................................53
Multiple Evidence........................................................55
Phonetic Analysis Relationships..........................................56
Summary....................................................................56
CHAPTER
4. IMPLEMENTATION........................................................58
System Overview.....................................................58
BANPA Implementation..............................................58
FREP Implementation...............................................60
Password Model Implementation.....................................61
Vowel Password Selection.......................................61
Stop Consonant Password Selection..............................62
Liquids/Glides Password Selection..............................62
Vll


Nasal Consonant Password Selection.............................63
The HpW Works Analyzer For Acoustic Parameter Extraction..........65
AFP System Integration..............................................66
AFP System Process................................................68
Speaker Verification Tests..........................................69
Speaker Password Selection Results................................70
Speaker Verification..............................................79
One Verses Two Acoustic Parameters.............................79
Matching Algorithms...............................................80
Algorithm 1: Correlation.......................................80
Algorithm 2: Closest Match.....................................80
Algorithm 3: Slope.............................................80
Algorithm 4: FREP..............................................80
Algorithm 5: FREP With BANPA (AFP).............................80
System Verification Criteria 1: Comparison Against The Same Passwords ... 81
System Verification Criteria 2: Comparison Against The Baseline Passwords 83
System Verification Criteria 3: Spoofing. Determining A Successful Rejection
Rate..............................................................85
CHAPTER
5. CONCLUSIONS...........................................................88
Speaker Verification: A Technology Waiting..........................90
Where To Go From Here...............................................91
APPENDIX
A. VALUES USED FOR BANPA SYSTEM.........................................94
B. RESULTS OF BANPA RUNS................................................95
C. RESULTS OF FREP RUNS.................................................99
REFERENCES................................................................102
viii


FIGURES
Figure
3.1 Adaptive Forward Planning Steps............................................28
3.2 Password Model Tree........................................................31
3.3 The AFP Architecture.......................................................33
3.4 Establishing A Baseline For The Acoustic Parameter FF......................36
3.5 Plot Comparison Of Two Fundamental Frequency Parameters....................39
3.6 Fuzzy Rule Distribution....................................................41
3.7 Bayesian Network For Phonetic Analysis.....................................47
3.8 Sufficiency Relationship Between Fundamental Frequency And Intonation.....51
3.9 Necessity Relationship Between Low Frequency Energy And Intensity'.......52
3.10 Approximation Model For Evidence Of An Acoustic Or Phonetic Attribute....54
4.1 The BANPA User Interface...................................................59
4.2 The FREP User Interface....................................................60
4.3 Password Models For Study..................................................64
4.4 HpW Works System Author And Version........................................65
4.5 User Interface Screen For AFP Password Selection...........................67
4.6 AFP System Process.........................................................69
4.7 Distribution Of Password Selection.........................................72
4.8 Spectral Shape For Speaker 1 Verses Average Spectral Shape For Password
animated"................................................................73
4.9 Spectral Shape For Speaker 2 Verses Average Spectral Shape For Password
animated.................................................................73
4.10 Spectral Shape For Speaker 3 Verses Average Spectral Shape For Password
animated.................................................................74
4.11 Spectral Shape For Speaker 4 Verses Average Spectral Shape For Password
animated.................................................................74
IX


4.12 Spectral Shape For Speaker 5 Verses Average Spectral Shape For Password
animated...................................................................75
4.13 Spectral Shape For Speaker 6 Verses Average Spectral Shape For Password
animated...................................................................75
4.14 Spectral Shape For Speaker 7 Verses Average Spectral Shape For Password
animated...................................................................76
4.15 Spectral Shape For Speaker 8 Verses Average Spectral Shape For Password
animated...................................................................76
4.16 Spectral Shape For Speaker 9 Verses Average Spectral Shape For Password
animated...................................................................77
4.17 Successful Verifications: Against Same Passwords
(One Parameter)..............................................................82
4.18 Successful Verifications: Against Same Passwords
(Two Parameters).............................................................82
4.19 Successful Verifications: Against Baseline Password
(One Parameter)..............................................................84
4.20 Successful Verifications: Against Baseline Password
(Two Parameters).............................................................84
4.21 Successful Rejections: Against Spoofing Baseline Password
(One Parameters).............................................................86
4.22 Successful Rejections: Against Spoofing Baseline Password
(Two Parameters).............................................................86
x


TABLES
Table
3.1 Fuzzy Rules For Determining The Quality Of An Acoustic Parameter.......40
3.2 Values Computed By Fuzzy Rules..........................................42
4.1 Results of Password selection...........................................71
4.2 Phonetic Class Selection Of Speakers...................................78
xi


ACKNOWLEDGMENTS
The author wishes to thank the following people for their support,
encouragement and assistance.
Carol Conway Matthews
Kevin Matthews
Daniel Matthews
Dr. Jody Paul
Dr. W. J. Wolfe
Dr. John Clark
Bemie and Dede Conway
AT&T Bell Labs/Lucent Technologies (Denver)


CHAPTER 1
INTRODUCTION
Throughout the last twenty years, speaker verification has been used
for computer system access, building access, credit card verification, and crime
lab forensics, all with differing amounts of success. As compared to other
security techniques such as fingerprints, palm-prints, hand writing scans, retinal,
and facial scans, speaker verification is comparatively inexpensive. Each
individuals speech patterns are unique. Because of this, individual speech
samples are as unique as fingerprints, facial scans or retinal prints. This makes
speaker verification an excellent tool for security control. Recent developments
in artificial intelligence such as neural networks and fuzzy logic have extended
research in many existing technologies. Speaker verification is one of these
technologies. The cost-effectiveness, ease of use, and reliability of speech
verification make it a practical technology to pursue.
Security In Todays Society
Despite recent advances in computer technology, one security problem
remains that has not been fully solved: how can we verify the claimed identity
of a person? Security access has become a more integral part of todays society
in many ways. With an increasingly competitive high tech industry, the need for
protection of proprietary material has become critical. Systems used for
security access to buildings and computer labs have become regular fare. The
1


Internet has become a viable means of world-wide communication. The ability
to trade commerce safely on the Internet is about to become a reality7.
There are basically three methods by which a person can be identified.
One is by an object in their possession such as a badge, a key, or credit card.
The second method is by having something memorized such as a user id, a
password, or personal identification number. The third method is by a physical
characteristic unique to the person such as a fingerprint, facial scan, signature,
or a voiceprint. The first two methods are transferable between one person
and another making them vulnerable to impostors. The third method of a
unique physical characteristic offers the most promise for secure personal
identification1.
Despite the need for secure personal identification, there are not many
devices available on the market today. The only effective devices are those
based on biometrics such as fingerprints, palm-prints, and retinal scans. These
devices are used at point-of-access locations and there use in other locations
such as point of sale terminals, or for remote transactions is not likely due to
the expense. Many situations where access to a secure area must be controlled
involve the use of guards. Examples include areas such as computer rooms,
bank vaults, aircraft maintenance areas, and drug storage areas. If guards are
employed seven days a week, 24 hours-a-day, the cost could exceed $50,000 a
year.
Fraudulent use of credit cards and bank checks is becoming an
increasing problem to merchants, banks, and credit companies. As we move
closer to a cash-less society, the amount of money lost due to insecure
transactions will become enormous. This type of security problem is different
than that of building access in that the number of places where protection is
2


needed is staggering. In essence, all gas stations, restaurants, stores, and banks
are potential targets to fraud. Because of such a large number of places where
transactions can be performed, the costs of a security solution must be low and
its performance must be very reliable. In addition, any method of security of
this type must be easy to use and widely acceptable to customers.
Todays society is moving towards long distance working relationships,
telecommuting, and a growing reliance on remote access to proprietary data.
The fears of unauthorized physical access have continued to grow in recent
years. Society is demanding more effective ways of providing security to a
growing number of people and a variety of different needs. The solutions for
these types of transactions must be easy to use, inexpensive, and widely
accepted among all who must interact with it.
Speaker Verification: A Viable Solution
In everyday life it is possible to recognize people by their voices. This
attribute makes the human voice a natural candidate for automated
identification. One persons voice is different from another because the relative
amplitudes of different frequency components of their speech are different. By
extracting acoustic features such as frequency components from the speech
signal, we can further the reliable identification of a voice.
The technique of speaker verification is one of the few reliable speech
recognition technologies available today. As compared to continuous-word
recognition, speaker verification is constrained to single words or phrases.
Because of this, the complexity of recognition is reduced tremendously. In
addition, speaker verification works with a known user, meaning the system
has previous information stored about the user. Other speech recognition
3


technologies must generally work with an arbitrary user. The fact that speaker
verification is free of many constraints inherent in other speech recognition
systems allow it to be one of the most reliable speech technologies available.
Typically the application of speaker verification technology is a device
that customers can be comfortable with. They know what to expect and
therefore dont have unrealistic expectations. These unrealistic expectations
may very well lead to a customer being unhappy or even frustrated by a
product if any rough edges are encountered. Speaker verification can address
probably the single most important issue concerning consumers: will people
accept it as a security method? Because little more than a microphone is
required, and most people find speaking natural, the answer is most likely yes.
To address the need for remote access of proprietary data, speaker
verification can play a key role for securing safe transactions. Speaker
verification technology can be implemented over long distances through
telephone or computer networks. Particularly, as telephone technology
improves, more and more reliable speech signal quality becomes a reality.
Obviously, finer speech signal quality will allow for more dependable speaker
verification. Similar to the way systems today verify credit card purchases,
speaker verification can be used for securing other more critical money related
transactions.
In the case of private criminal investigation of individuals, speaker
verification has been very effectively used to reliably authenticate speakers.
With our knowledge of existing verification processes, we have found that
voice patterns reveal more invariant properties than many existing verification
techniques. Another main advantage of speaker verification is that it does not
require much additional or specialized equipment at the point-of-use.
4


Equipment for hand scanners or retinal readers can be very large and
cumbersome. A speaker verification application requires little more than a
microphone or handset.
Approaches such as palm prints, eye and body scans, fingerprint or
signature analysis can be very costly. One fingerprint method costs $53,000 for
a central unit, and $4000 for each station2. Another example, is the cost of a
hand print scanner. Devices such as this typically cost around $3000.
Comparatively, a speaker verification system costs much less. A complete
system can be implemented with a personal computer, sound card, and
microphone. A system such as this cost as little as $500.
How Can Speaker Verification Be Improved?
Speaker verification technology for commercial applications is still fairly
new. Even though much research has taken place, there are a number of ways
this technology can still be improved. The following sections point out the
potential improvements that are addressed in this thesis.
Cost
Most speaker verification systems require specialized hardware to
accomplish the processing needed. Not only is most of this hardware
proprietary, but is typically very expensive. An improvement can be made by
developing a system on a common platform such as an IBM compatible
machine using a standard sound card. By doing this, the system should be able
to be duplicated very easily and inexpensively. In addition, since the majority of
computer users use this type of platform, users of a speaker verification system
developed on it will instantly be comfortable with using it.
5


User Interface
A speaker verification system must be easy to use and its interface must
be appealing and friendly to interact with. Much of the speaker verihcation
systems in use are specialized laboratory applications, and are not designed for
typical commercial use. By simply looking at a computer screen with a mouse
in hand, and talking into a microphone, users should be very comfortable
interacting with a system. In reality, all that is needed is a microphone or
handset, and a simple visual or audio feedback mechanism that alerts the user
to the verification results. The system developed and described in chapter 3 has
attempted to capture these attributes.
Training The System
Wlien a user first encounters a speaker verification system, it typically
carries out a training process that results in a voice print that is stored and later
used for matching the voice of that individual. Most systems employ a
technique called speaker adaptation to refine matching during subsequent
encounters with the verification system. The speakers voice print is adapted to
include new acoustic information the first few times a speaker uses the system
successfully. Because of this, the system may be unstable and less secure in its
initial use.
By emphasizing speaker adaptation at the earliest stages of a users first
encounter, the initial training of a system can be improved. Instead of allowing
the system to adapt over a prolonged period of time, collection of critical
information is carried out at the beginning of a systems life cycle. Not only will
the system perform more reliably at earlier stages, it will be easier to use in the
long run. This process, which is developed in this thesis, is called Adaptive
Forward Planning. It is described in chapter 3.
6


Speaker Population
Most speech recognition systems can readily handle single speakers by
specifically tailoring the system to the nuances of the individual speaker. Most
realistic applications demand the ability to handle more than a single talker.
Systems such as these are designed for computer system access, credit card
verification or building access. These systems must handle hundreds or even
thousands of different speakers. In addition to handling such a multitude of
different speakers, these systems must be able to deal with uncooperative
speakers. These types of speakers may not say exactly what the system has
asked for or may even try to fool the system on purpose.
By focusing on techniques and processes that can generically be
adapted to a diverse speaker population, the performance of a speaker
verification system can be improved. This can be accomplished by designing a
set of parameters that are invariant among all speakers, and relying on speech
characteristics that apply to the way people produce speech. Uncooperative
speakers can be dealt with by utilizing these parameters to detect potential
impostors.
Using Acoustic Parameters of The Speech Signal
Probably the most popular approach to speaker verification is to focus
on speech signal parameters. Many studies have utilized a technique of
extracting acoustic parameters that can later be used for identification. Most of
the processes used for this technique have encompassed very specialized
hardware and complicated algorithms. While most of the systems developed
have yielded significant results, not many have explored alternate methods of
acoustic parameter extraction, and how these acoustic parameters can be used
7


for speaker verification. This is probably the most open area for speaker
verification exploration. This is the core focus of this thesis.
Organization Of Thesis
In the remaining chapters of this thesis, we give an overview of speech
verification technologies and systems that have been developed over the last
twenty years. We point out whats missing, offer solutions, describe our
approach in detail, and discuss experimental results.
In Chapter 2, Speaker Verification Technology And Systems, speaker
verification methods and techniques are discussed followed by details of foil
implementations. The purpose of this chapter is to provide the interested
reader with background and history, and to point out whats missing and hasnt
been addressed in previous speaker verification research.
Chapter 3, Approach, is the core of this thesis. The method of
Adaptive Forward Planning is introduced. This new method explores alternate
ways of utilizing acoustic features extracted from the speech signal. When
features are extracted, they are evaluated to determine how much they
contribute or do not contribute to the verification process. A new approach to
evaluation using Bayesian networks and a fozzy rule process is explained. These
new approaches have the potential to improve the verification process.
In Chapter 4, Implementation, how the research was put into
practice, including the system that was built, is discussed. The discussion
includes detailed system integration and the tradeoffs and compromises. The
results of several speaker verification experiments are provided as well.
8


Finally, in Chapter 5, Conclusions, a re-cap and conclusions of this
research are presented. In addition, this chapter summarizes why this research
is important, and what future types of speaker verification research may
provide further successes and rewards.
9


CHAPTER 2
SPEAKER VERFICATION TECHNOLOGY AND SYSTEMS
Research in speech recognition has traversed a multitude of directions
over the past twenty years. The types of systems developed have been very
diverse: from reliable 24 hour-a-day isolated word recognizers to extremely
versatile and complex systems designed to interpret sentences or continuous-
word speech. There are a number of commercial systems currently available.
Most of these systems are used for simple desktop assistance, or for games and
entertainment. There are also a number of private or industrial systems that
have been designed for much more complicated and scientific uses. These uses
include real-time language interpreters, human-computer interaction systems,
and implementations used for security and verification purposes. It is this last
use that we are particularly interested in.
Even though computer technology has dramatically improved over the
last twenty years, the practical uses of speech recognition technology have
remained limited. The two most applicable speech recognition technologies are
isolated-word recognition, and speaker verification. These two are closely
related in a number of ways. Each involve a cooperative speaker who is willing
to respond to the systems needs in order to achieve success. Both
technologies compare incoming speech with prerecorded templates of various
speech data. Also, both use similar types of matching algorithms to accomplish
their goals. Of these two, speaker verification has proven to be a more useful
10


technology in the sense that it has successfully been used for a variety of
applications.
In this chapter, a brief overview of speech verification technology is
given. The focus is on the most promising, and current technologies that have
been researched and or implemented within the last twenty years. In addition
to the focus on technology, complete systems that have been implemented are
looked at. The purpose of this second focus is to get a feel for the successes
and lack of successes in speech verification.
The following overviews are a representative survey and is by no means
exhaustive and complete.
Speaker Verification Technologies
This section will provide an overview of popular techniques and
methodologies used for speaker verification. Most of these techniques have
two things in common: first, there is an initial speech data collection usually
implemented by a training session. Secondly, verification is carried out by
comparing this initial data against an incoming speech signals collected data.
Predictive Models3
Predictive models have been successfully used in speaker verification.
In this section, we will briefly discuss four that have been researched. In each,
the incoming speech signal is transformed into a model which is then used to
verify the speaker. The first two models do not require pre-processing of the
incoming signal, the second two do.
11


The Hvbnd MLP-RBF-Based System. The hybrid MLP-RBF model4 is
a two-stage connectionist model designed to operate in the time-domain alone
and performs well without any time-warping. The first stage is a Multi-Layer
Perception (MLP) neural network and is used to extract speech parameters
which are used in the verification stage. The MLP is trained to act as a
nonlinear speech predictor for the utterance spoken by the person who claims
an identity. The second stage implements the speaker verification process
based on a Radial Basis Function (RBF) classifier using the weights of the MLP
as its inputs. The RBF classifier is previously trained to accept the weights
produced by the true speaker utterance applied to stage one, and to reject all
other weights produced by other speakers.
Several previous studies have been carried out on the use of neural
architectures for the purpose of time series and speech prediction. The MLP-
RBF system is based on the fact that the MLP model is capable of learning the
underlying speaker-dependent trends of a speech utterance. This connectionist
model was shown that it could be trained to predict the speech waveform in a
non-recursive mode. However, after the training process was completed,
attempting to operate the model in a recursive manner behaved chaotically but
did reveal a relationship to the original speech waveform. It was found that a
connectionist model was not sufficient to model the time varying parameters
of speech and therefore could not work well in a recursive mode. But, the
similarities between the chaotic series produced by the recursive prediction and
the original speech signal proved that the connectionist model was learning the
operation of the underlying speech production mechanism. The MLP-RBF
system is based on this knowledge.
The Self Segmenting Linear Predictor Model. This model uses an array
of linear predictors to model the true speaker where each predictor is
12


associated with a particular sub-unit of the speech utterance. Linear predictors
have proven successful in speaker verification applications where the speech is
divided up into frames of equal length. Because the speech signal has a slow
time varying property, the vocal tract shape is considered to stay constant
during the duration of a frame. Each frame can be considered a stationary
signal allowing each to be represented with Linear Predictive Coefficients
(LPC). In this model, Linear Predictors are used to represent the temporal
structures of speech. An iterative training process uses Dynamic Programming
(DP) to segment speech into sub-units during which the vocal tract stays
constant and then trains to a set of LPs for each of these speech segments.
The segmentation and training are done on a sample-by-sample basis in the
time domain. No other pre-processing is required.
Both the LP coefficients and the segmentation units of the training
utterances are stored and used for verification. Verification involves DP of the
test utterance with the LP coefficients of the claimed true speaker. The
normalized mean square prediction residual is calculated by dividing the
accumulated squared sample values over the entire utterance. The normalized
mean squared prediction error is then compared to a threshold to determine
the success of verification.
The Neural Prediction Model. The Neural Prediction Model (NPM)
consists of an array of MLP predictors and is constructed as a state transition
network. Each state has a particular MLP associated with it. Each MLP
predictor has one hidden layer and one output layer consisting of eight nodes.
The hidden layer nodes have a sigmoidal function, while the output nodes are
linear. The 8 Mel frequency cepstral coefficients are used as the frame feature
vectors. The MLP outputs a predicted frame feature vector based on the
13


preceding frame feature vectors. The difference between the predicted feature
vector and the actual feature vector is defined as the prediction residual.
The goal of this type of system is to find a set of MLP predictor
weights which minimize the accumulated prediction residual for a training set.
Speaker verification is carried out by creating a training data set by collecting a
series of password utterances. Verification requires the application of the test
utterance to the NPM associated with the speaker who claims identity. The
accumulated prediction residual divided by the sum of the squares of each
feature component in the utterance is used to determine verification success.
The Hidden Control Neural Network Model. This model utilizes a
single MLP predictor. Like the NPM, this model is constructed as a state
transition network and also uses the 8 Mel frequency cepstral coefficients as
frame feature vectors. The single MLP outputs a frame feature vector
prediction. The model attempts to find a set of MLP predictor weights that
minimize the accumulated prediction residual for the true speaker utterance
training set. Verification of a claimed speaker involves application of the
incoming utterance along with the MLP weights set to those associated with
the claimed speaker. As in NPM, the accumulated prediction residual divided
by the sum of the squares of each feature component in the utterance is used
to determine verification success.
Gaussian Mixture Models5
The Gaussian mixture speaker model was first introduced in 1990 and
has demonstrated very accurate verification for text-independent speaker
utterances. In the Gaussian mixture model (GMM), the distribution of feature
vectors extracted from speech is modeled by a Gaussian mixture density. The
14


density is a weighted linear combination of uni-modal Gaussian densities, each
parameterized by a mean vector, and covariance matrix. Maximum likelihood
speaker model parameters are estimated using the iterative Expectation-
Maximization (EM) algorithm. Generally, 10 iterations are sufficient for
parameter convergence. The GMM can be viewed as a hybrid between two
effective models for speaker recognition: a uni-modal Gaussian classifier and a
vector quantizer codebook. The GMM combines the robustness and
smoothness of the parametric Gaussian model with the arbitrary density
modeling of the non-parametric VQ model.
The speech signal is first segmented into frames by a 20 ms window
progressing at a 10 ms frame rate. Silence and noise frames are discarded using
a speech activity detector (SAD). This is important in text-independent speaker
recognition because by removing silence and noise frames, modeling and
detection is based solely on the speaker, not the environment in which the
speaker is speaking. After the SAD processing, Mel cepstral feature vectors are
then extracted from the speech frames and cepstral coefficients are derived.
Finally, the feature vectors are channel equalized via blind deconvolution. The
deconvolution is implemented by subtracting the average cepstral vector from
each input utterance. It is critical to collect training samples and test samples
from the same microphones or channels to achieve good recognition accuracy.
The speaker verification process is a straight-forward maximum
likelihood classifier. For a group of speakers, each is represented by a GMM.
The objective then is to find the speaker model which has the maximum
postenor probability for the input feature vector sequence. The minimum error
Bayes decision rule is used to determine the accuracy of verification.
15


Statistical Features For Speaker Verification6
Speaker verification systems have also been implemented by using
statistical features of speech. On such system studied used statistical features
extracted from speech samples for an automatic verification of a claimed
identity. The analysis of a prescribed code sentence to be repeated for
verification is performed by a real-time hardware processor consisting of a
multi-channel filter bank covering the frequency range from 100 Hz to 6.2
kHz. The incoming speech signal is scanned every 20 ms, and a multiplex
integrator is used to compute the long-term averaged spectrum (LTS) over the
entire utterance.
The system is trained with a number of sample utterances of the code
sequence used for verification. From these samples a speaker-specific reference
is calculated and stored on the computer. In addition, a verification threshold is
calculated and stored for each speaker. For successful verification, the distance
between the test speech input and the closest stored reference must fall below
this threshold.
Extraction Of Cepstral Coefficients7
In this implementation, cepstral coefficients are extracted from the
incoming speech signal. These coefficients are used to build a codebook
containing a set of codevectors (coefficient vectors) representative of the
person to be identified. Verification is carried out by extracting cepstral
coefficients from the new speech of the person to be identified, and a
distortion metric is used to measure the distance from the new vectors and
those in the codebook.
16


Frames of approximately 30 ms of speech are digitally sampled and
then processed through a pre-emphasis network to boost the high-frequency
components of the speech. Cepstral components are extracted and a
probability of voice factor is computed for each frame. A probability of voicing
at or above a given threshold indicates the presence of formants, which
provide a correlation with an individuals vocal tract. All the frames are then
presented to a Linde-Buzo-Gay clustering algorithm, which is used to
transform the initial coefficient vectors into a set of 64 codevectors which is
then used for verification.
Speaker Verification Systems
Several speaker verification systems have been implemented each
particular to a specific application. Even though the applications are specific,
the goals remain the same: how accurately can a system determine if the
speaker claiming an identity is who they say they really are. The following
sections provide an overview of the systems that have been developed.
I Tl' Defense Communications Division (ITl'PCD)8
The United States government has historically been interested in
speech recognition for as long as the technology has been around. The major
areas of interest of the government is word-spotting, talker identification,
language identification, command-and-control, and secure (encrypted)
speech transmission. ITTDCD has worked on many of these government
applications. ITTDCD has investigated methods for speaker verification by
attempting to recognize the identity of a speaker even if the text of the
analyzed utterance is unknown. ITTDCD has also worked on improving
speaker verification by making such systems more accurate and less expensive.
17


Most of the research at ITTDCD has been directed at the fundamental
problems of recognizing speech in noisy environments and/or over telephone
lines. Of considerable interest is the selection of acoustic features that are
immune to degradation over phone lines or from background noise. These
features must be insensitive to noise and also to the identity of the speaker.
Because of these constraints, the feature selection techniques designed
by ITTDCD have been centered on determining optimal methods for reducing
the effects of noise on the accuracy of the incoming speech signal. Among the
approaches taken include comparative evaluation of different feature sets. The
feature sets that were examined include linear predictive coefficients (LPC),
vocal tract are functions, autocorrelation coefficients, cepstral coefficients, and
LPC derived pseudo formants. Another approach involves the use of a linear
mean square (LMS) adaptive filtering method for the removal of additive noise
from the speech signal. And finally, a third approach involves the investigation
of a noise-reduced LPC parameter set.
ITTDCD is well established in the speech compression (vocoding)
area. This technology is well suited as a front-end to many speech recognition
systems. In its research involving speech recognition, ITTDCD has
emphasized low-cost solutions and has developed several systems used for
isolated word recognition, word-spotting, and speaker verification.
AT&T Information Systems9
AT&T and its Bell Laboratories have developed and researched many
different speech recognition technologies. Although much research has been in
continuous-word recognition, many word-spotting and speaker verification
systems have also been developed. One of particular interest was developed by
18


AT&Ts I.S. division in the mid 1980s. This application is a voice password
system for security access using speaker verification designed for use over dial-
up telephone lines. The voice password system (VPS), can be used for secure
access to telephone networks, computers, rooms, and buildings. The VPS
system works by allowing a user to call into the system, enter his or her
identification number, and then speak a password that is usually a phrase or
short sentence. On initial encounter, the VPS system creates a model of the
users voice and stores a reference template.
Incoming speech is processed in the VPS system on a frame-by-frame
basis. Frames are spaced at 15-millisecond intervals, and overlap with a 45-ms
duration. For each frame a set of features are extracted that characterize
aspects of the signal such as short-term energy and spectrum. For each feature
extracted, the autocorrelation of the incoming signal is computed creating
autocorrelation coefficients. The coefficients are then modified by simulating
the addition of white noise to reduce differences between noisy long-distance
telephone lines and clear local lines. These modified coefficients are then
transformed into linear predictive coefficients (LPC) which represent, in bits, a
spectrum of the voice. The LPC coefficients are then transformed to cepstral
coefficients and are normalized by subtracting the mean cepstral values over
the utterance from each 15-ms frame of speech.
Finally the beginning and end frames of the password are located. Once
this feature set is computed, it is matched with a previously generated reference
pattern using a method called dynamic time warping (DTW) which accounts for
timing differences among repeated utterances of the same phrase. The DTW
match yields an absolute distance score and is used to evaluate the identity of
the speaker.
19


AT&T Bell Laboratories10
Many methods of speech verification have been studied at Bell
Laboratories. One of those methods experimented with comparing certain
characteristics of a speakers voice with the same characteristics of the voice of
the person who the speaker claims to be. Allowances are made for normal
variations in speech rate, pitch, volume, and other factors. The belief is that a
system using this method can be just as fast as a human listeners and detect
impostors much more accurately.
A file of prototype utterances of a single phrase spoken several times is
collected and averaged by a computer. This average forms a prototype that is
stored along with measurements of the variability among individual utterances.
The variability data is collected because no one can speak the same phrase
twice in exactly the same way. When verification is desired, the computer
fetches the stored prototype for the claimed identity, analyzes the incoming
speech sample and determines if it is close enough to the prototype version.
Five characteristics or features are extracted form the speech signal.
The first three are the lowest three frequencies known as formants one, two,
and three. The fourth characteristic is voice pitch and its variation with time.
The fifth characteristic is the variation of the intensity (or loudness) of the
speech with time. Before a voice sample can be compared, the prototype and
the incoming voice sample are brought into temporal registration by time-
warping the voice sample. This is done by speeding up or slowing down
various portions of the utterance.
Once the characteristics have been extracted and the two samples to be
compared are brought into registration, measurements are taken to determine
how similar the prototype and the sample are. To accomplish this, the
20


computer divides each of the five characteristics into 20 equal time segments.
For each segment, several measures of dissimilarity are computed. Distance
measurements such as mean squared difference and the squared difference of
the average rate of change. After these distance measurements are computed
for each separate segment, each distance is averaged over the 20 segments.
Finally, a sixth distance measure is taken that reflects the degree of time-
warping that was necessary to achieve registration. The computer then
combines all six distance measures and computes an overall final distance
measure of dissimilarity.
The ARP A Speech Understanding Project:
Continuous-Word Recognition Techniques
The focus of this thesis is on speaker verification, which is considered
an isolated-word recognition technique. Even though isolated-word
recognition is the most applicable to speaker verification, there is much
correlation to continuous-word techniques. Because of this, it is hard to ignore
the most important research that has occurred in continuous-word speech
recognition. This section will briefly hig^iligjit the major systems that were
developed for The Advanced Research Projects Agency (ARPA) Speech
Understanding Project in the 1970s11. ARPA is an agency of the Department of
Defense.
Systems Development Corporation (SDC)12
The SDC system was developed to process sentences. When a digitized
waveform enters the system, formant frequencies and other parameters are
extracted. From this a phonetic transcription is obtained, including several
alternative labels for each 10-ms segment of the waveform. All of this data is
21


then placed into an array for later examination by top-end routines. The
utterance is processed from left to right. First a list of all possible sentence
beginning words are generated. Then an abstract phoneme representation is
extracted for each lexical hypothesis and a graph of expected acoustic variants
is created. Each of these graphs are then sent to a mapper to determine how
good an acoustic match can be obtained. The mapper includes techniques for
estimating the probability that the expected word is present given the phonetic
and acoustic data collected.
The mapper constitutes a verification strategy based on syllables. This is
an attractive strategy for predicting phonetic segments.
Hear What I Mean (HWIM)13
Similarly to the SDC system, when a digitized waveform enters the
system, formant frequencies and other parameters are extracted. This
information is then used to derive a set of phonetic transcription alternatives
that are arranged in a phonetic segment lattice. The advantage of the lattice
structure is that it can represent segmentation ambiguity in those cases where
decisions are most difficult. Identification of words is earned out by searching
through the segmental representation of the utterance for the closest lexical
matching words. These matches are used as seeds that are later used to build
up partial sentence hypothesis. The best-scoring word is then sent to a word-
verification component that utilizes parametric data to get a quasi-independent
measure of the quality of the match.
The method of verification is analysis by synthesis. The verification
score is combined with the lexical matching score, and if this score is high
enough, the word hypothesis is sent to a syntactic predictor which using
22


grammatical constraints, proposes which words can appear on the left and
right of the seed word. The word proposals eventually build a lexical decoding
network that produces hypothesis of two-words, three-words, four-words, and
so forth until a final sentence is obtained.
Camegie-Mellon University Hearsay-II14
The process of recognition is similar to the HWIM, and SDC systems
described above. The Hearsay-II system employs a set of parallel asynchronous
processes that simulate each of the component knowledge sources of a speech
understanding system. The knowledge sources communicate via a global
blackboard data base. When any one of the knowledge source components
are activated by the blackboard, it tries to extend the current state of analysis.
The blackboard is divided into several major categories: sequences of
segment labels, syllables, lexical items proposed, accepted words, and partial
phrase theories. Initially, amplitude and zero-crossing parameters are used to
divide an utterance up into segments that are categorized by manner-of-
articulation features. A word hypothesizer then lists all words having a syllable
structure compatible with the partial phonetic segments. A word verification
component scores each lexical hypothesis by comparing an expected sequence
of spectra with observed linear-prediction spectra. High-scoring words activate
a syntactic component which attempts to piece words together into partial
sentence theories. This process continues until a complete sentence is found.
Camegie-Mellon University Harpy15
The Harpy system is an extension of a Markov model of sentence
decoding originally employed by a sentence recognition system called Dragon16.
23


In Dragon, a breadth-first dynamic programming strategy was used to find
the optimal path through the network. In Harpy, a beam-search technique is
used in which a restricted beam of near-miss alternatives around the best-
scoring path are considered. Dragon also used a-priori probabilities in choosing
the most likely path, where Harpy considers only spectral distance.
The Harpy finite-state machine has 15,000 states. The state transition
network includes all possible paths, alternate representations of all lexical items
in terms of acoustic segments, and a set of rules that define expected acoustic
segment sequence changes across word boundaries. The input utterance is
divided up into brief acoustic segments. Each segment is compared with 98
talker specific linear-prediction spectral templates to obtain a set of 98 spectral
distances. Each state in the network has a an associated spectral template. The
strategy is to try and find the best scoring path through the state transition
network by comparing the distance between the observed spectra and template
sequences given in the network.
What Is Missing?
The preceding sections have provided a brief overview of speaker
verification technologies, methods, and implementations. There are two
concepts in common with all of them. First, the method used for verification
involves decision making based on matching one set of acoustic, or other
speech derived data, against another set of similar data. The second concept
involves using a limited set or number of acoustic features, and not really
exploring the impact that additional features may have.
By simply matching one set of data against another, there is an
increased potential for lost information. Because the acoustics of speech vary
24


so much from one speaker to another, decision making algorithms must
involve more than a simple matching process.
In this thesis, the method of determining verification is taken a step
further than simply matching data. Because of the nature of speech, there is
more information available to us than an individual set of acoustic features.
Phonetic relationships exist that allow us to study feature interactions. By
modeling these relationships and interactions, new and alternate methods of
decision making algorithms are possible. By focusing on a definitive set of
acoustic features, phonetic relationships can be derived which should help to
distinguish one voice from another.
Acoustic features are pivotal to the speaker verification process. It is
within certain combinations of these features that verification can be improved.
One feature for verification is most likely not enough, nor is two. But, by
combining more and more features, we can create a finer granularity of
decision making that should improve the accuracy of speaker verification. In
dns thesis, we will also show that substantial improvement can be realized by
combining and utilizing additional acoustic features in the verification process.
25


CHAPTER 3
APPROACH
The focus of this study is to determine how acoustic features and
phonetic relationships can aid us in making decisions in the speaker verification
process. We are interested in the role acoustic parameters play in improving the
accuracy of speaker verification. Extracting acoustic parameters or features
from the speech signal is not a new concept. The potential methods in which
parameters are evaluated and utilized for speaker verification is conceptually
new and has not been fully tapped. In this chapter, a new method of using
acoustic parameters of speech is presented.
Decision making used in speaker verification is taken further than
simply matching one speech derived data set with another. A new technique
called Adaptive Forward Planning is introduced that attempts to take advantage
of unique acoustic features inherent to a speaker. New procedures for acoustic
feature evaluation and how these evaluations can aid speaker verification is
presented in this chapter. In addition, this thesis goes further and explores the
impact of combining additional acoustic parameters.
26


Adaptive Forward Planning
The technique of Adaptive Formal Planning takes advantage of how
the human vocal system produces speech. By capturing the essential aspects of
the human vocal system, speaker verification can be improved by identifying
those individual speaker characteristics that distinguish an individual from the
average user or group of users. Characteristics such as vocal tract resonance,
characteristics of articulation, and the rate of vibration of the vocal cords are
utilized to improve verification. Techniques of speaker normalization or
channel normalization can be used so that certain features of the speech signal
can be detected and used to capture specific characteristics of the speech
signal. An example would be detecting the fundamental frequency or pitch of
the signal. Another would be the use of complex processes that would
determine the speakers vocal tract length. Using criteria such as these, a
specific acoustic feature profile may be created and stored for later use.
Adaptive Formal Planning (AFP) uses a pre-meditated set of acoustic
information criteria. This set can be thought of a boiler plate of specific
acoustic information. By focusing on a definitive set of acoustic features up
front, AFP can formulate a unique profile for every user. With AFP, the task of
speaker verification is to determine the information carrying features common
to repeated utterances of the same phrase or word. When a user encounters
the system for the first time, rather than simply storing initial voice information
for that user, the adaptive process is carried out interactively. The system
proceeds through refinement learning steps and updates a unique profile of
acoustical feature information until it meets a pre-specified set of criteria. As
the system works its way toward refining this set of criteria, it makes use of
different phrases or words whose selection is determined by the refinement
process. Acoustic features may be selected based on trial and error heuristics
27


which capture the distinct individual speaker characteristics that pertain to a
particular phrase.
Once this set of criteria has been obtained, the system then selects a
final appropriate phrase that the user will be asked to use for verification. The
phrase will be selected from a set of stored templates that have been collected
from the user population. The phrase selected will be the phrase that best
matches the collection of acoustic information that has been obtained. In
addition to selecting this final phrase, this system will store the reusable
acoustic information obtained from the speaker. The illustration below depicts
this process. The progression proceeds from top to bottom:
VOICE VERIFICATION ADAPTIVE FORWARD PLANNING
Original Voice Characteristics Boiler Rate
Refined Speaker characteristics boiler plate
Refining process continues...
T
T
Final Voice Characteristics Information
Speaker characteristics are
further refined
Figure 3.1 Adaptive Forward Planning Steps
28


By creating this unique acoustic information profile, additional in depth
verification may be carried out in later system encounters that otherwise would
not exist. For example, if a user attempts to gain access to the system at a later
date, a number of pre-determined thresholds could be used to determine the
accuracy of verification. The system would use this unique acoustic profile to
further the verification analysis if a number of the thresholds were not met.
The advantage of this is that the probability of false rejection (i.e. a valid user
not being accepted) could be lowered.
Adaptive Forward Planning will improve speaker verification in a
number of other ways as well. For instance, when a user wishes or is required
to choose a new password, they will have an existing specific feature set in
which to interact with. Rather than randomly choosing new phrases for
passwords, specific phrases can be derived from the acoustic information
available. By doing this, the users acoustic feature profile will remain more or
less intact. This will ensure that the uniqueness features of their voice remain
embedded in the system and are continually taken advantage of. Another
advantage is the use of similar passwords among different speakers. Because
the acoustic profile will be unique to each speaker, if two or more use a similar,
or even the same password, verification analysis should reveal unique speakers
because of the depth of information held in each profile. The ability to use
similar or equivalent passwords will reduce the overhead required to maintain
the password model database.
29


Password Models
To accomplish collecting a unique acoustic profile, AFP uses password
models. Each password model is a single phrase that has been analyzed in
depth and determined to model a set of phonetic qualities that yield unique
acoustic features when spoken. AFP first presents the user with an initial set of
password models that they will be asked to speak in sequence. As each word or
phrase is spoken, the system will capture those acoustic features that pertain to
each individual model. AFP will then determine which phrase or word spoken
best match the acoustic features of the speaker and will then proceed to the
next appropriate set of related password models.
The password model system will be hierarchical in nature and can be
represented in a tree like structure. Contained in the top level of the tree (level
0), will be the initial set of password models. Each of their children will
represent a subset that will contain similar, but more detailed phonetic
properties. When a user encounters the system, they start at the top of the tree
and work their way down passing through several internal nodes until they get
to the lowest level of the tree. The lowest level will represent the actual final
phrases or words that AFP will determine best suits the user. AFP will then
chose among these for the password the user will be required to use. The
illustration below depicts the password model tree.
30


THE PASSWORD MODEL HIERARCHY
Figure 3.2 Password Model Tree
The lowest level of the hierarchy represents the actual passwords that
will be used in the system. Initially, a finite amount of passwords is used in the
system. AFP will require ongoing maintenance of the password model
hierarchy. When the lowest level passwords are eventually exhausted, they may
simply be replaced by sets of phrases that fit the acoustic feature properties of
their predecessors. It will be several years if ever, that the system will run out of
available passwords to use.
By proceeding through the password model hierarchy, AFP will
accurately narrow down those acoustic features that are unique to an individual
speaker and map these to a specific phrase. As these acoustic features are
discovered, AFP will dynamically build an acoustic feature profile for each user.
31


The AFP Architecture
To accomplish the goals of AFP, two main architectural components
were developed: a sub-system for evaluating acoustic parameters when
extracted, and a second sub-system for combining the various acoustic
parameter evaluations and determining the overall value or acoustic worth of
a particular password model. These two sub-systems along with the password
models descnbed above make up the AFP architecture.
Fuzzy Rule Evaluation Of Parameters (FREP)
The sub-system developed for evaluating acoustic parameters is the
Fuzzy Rule Evaluation of Parameters (FREP). Because acoustic parameters
extracted from the speech signal vary so greatly, we often are dealing with
vagueness and ambiguity when trying to determine how good a parameter is.
For this reason, this sub-system is modeled using fuzzy rules.
Bayesian Network for Password Analysis (BANPA)
The sub-system developed for evaluating the overall value of a
particular password model is the Bayesian Network for Password Analysis
(BANPA). Due to our limited knowledge of phonetic features that correspond
to the human voice, we are working with incomplete or uncertain data when
trying to evaluate the overall value of a spoken phrase. A Bayesian network was
chosen for this sub-system because it allows us to combine several sources of
data (acoustic parameter evaluations) each with varying amounts of useful
information. We can then analyze and determine probabilistic relationships
among this data, combine these relationships and come to a conclusion or
overall value.
32


The illustration below depicts the AFP architecture.
Password Models
Figure 3.3 The AFP Architecture
Note in the figure above that the speaker first reads and speaks each
available password at each level in the hierarchy. For each password spoken,
the acoustic parameters are extracted and sent to FREP for parameter
evaluation. Each evaluated parameter is then sent (in tandem) to BANPA
where the overall phonetic value of the password is determined. This is done
for each password on the same level of the password hierarchy. These values
are then returned to the password model hierarchy where the next level of
passwords is selected based on these values.
33


The remaining sections of this chapter describe these two main sub-
systems in detail.
Phonetic Acoustic Parameter Evaluation Using Fuzzy Rules
Through much previous research in phonetics, linguists have noted the
importance of phonetic features in describing the sound structure of language
and in postulating how the sound systems of language change. It has been
proposed that speech sounds occurring in the languages of the world could be
described in terms of a finite set of phonetic features.17 These phonetic features
can be defined in reference to acoustic patterns or acoustic properties derivable
from the speech signal. The acoustic properties can be defined in terms of
spectral patterns, changes in overall amplitude, or fluctuations in energy
extracted from the incoming speech signal18.
In Adaptive Forward Planning, a number of acoustic parameters are
extracted from the speech signal so that the overall value of a phrase can be
determined. For example, the acoustic features fundamentalfrequency, gross spectral
shape, energy frequencies, and formantfrequencies provide an invariant set of features
that will differ between speakers. Not only do these features differ between
individual speakers, but they also differ between the same utterance spoken by
the same speaker. The acoustic features of a speech signal allow us to identify
phonetic features such as consonant stops or intonation. Because the production of
human voiced sounds vary so dramatically between speakers, we are working
with a minimal amount of ambiguity when computing the value of an acoustic
parameter. This is significant when using these parameters for speaker
verification.
34


The challenge is to find a way to determine the value of an acoustic
parameter based on acoustic features extracted from an incoming speech
signal. Each acoustic parameter contributes to a phonetic property in varying
ways. For instance, a small amount of low frequency energy contributes very
little to the phonetic property of intensity, but a greater amount contributes
enormously. It therefore becomes important to detect the different levels that
each acoustic feature can contribute to a phonetic property. We cannot assume
that an acoustic parameter either contributes or it does not. Rather, we must
determine how much it contributes. Even though the measurement or graph
of one instance of an acoustic parameter differs from another instance, it still
may be acceptable under certain criteria. A fuzzy rule system provides an
excellent vehicle in which to approach our problem. The criteria in which we
accept one measurement over another can only be defined in terms of fuzzy
rules.
In order to determine the value of an acoustic parameter, we must
establish a measure of quality or baseline against which we can measure. For
each password model in our system, we can establish a baseline measurement
for each acoustic feature associated with that password. To accomplish this, a
database of spoken phrases is collected, and a baseline acoustic feature set is
established for each phrase. For each spoken phrase, the set of acoustic
parameters are extracted. Then, all measurements of an extracted acoustic
parameter are averaged. This average establishes a baseline or measure of
quality for an individual acoustic parameter associated with a phrase. The
illustration below depicts this. Speaker A through speaker X speak the same
password hello. The acoustic parameter fundamentalfrequency (FF) is extracted
for each spoken instance of hello for each speaker. The X occurrences of FF
are then averaged to form the FF baseline for the password hello.
35


Average Fundamental Frequency plot for the
phrase "Hello"
Figure 3.4 Establishing A Baseline For The
Acoustic Parameter FF
Once the acoustic parameter baseline has been established, we can
compare subsequent acoustic parameter extractions against it. To determine
36


the quality or amount of information we gain from this comparison, fuzzy rules
will be used.
Fuzzy Rules For Acoustic Parameter Value Determination
In comparing one acoustic parameter with another, we cannot simply
claim that the two parameters match or they do not. Rather, we must define
the comparison in terms of a possibility distribution. For instance, when
studying the acoustic parameter fundamental frequency, many variances are
detected. The varying rate of fundamental frequency is determined by the
shape and mass of the moving vocal cords, the tension of the laryngeal
muscles, and the air pressure generated by the lungs19. When two different
speakers articulate a phrase, the shape of the vocal cords may be similar, but
the air pressure generated by the lungs could be very different. In addition,
attempting to measure the difference of tension in the laryngeal muscles is
nearly impossible.
Rather than focus on these types of differences, we can utilize our
general knowledge of how the fundamental frequency rate and other acoustic
parameters vary among individual phrases. Because of the way a speaker
produces sounds, we can establish commonalties among individual phrases. If
we plot two different occurrences of an acoustic parameter extracted from a
spoken phrase, we can calculate the difference in these commonalties. In this
research two general criteria for determining how two acoustic parameters
differ have been selected: correlation coefficient and regression line slope difference.
Using fuzzy rules, these two criteria become our linguistic variables.
37


Linguistic Variables
For each acoustic parameter, two linguistic variables are created:
correlation, and regression line slope difference. The values of these two variables are
computed from extracting the difference between the baseline measurement
and the incoming speech signal. The linguistic variables are defined as follows:
Correlation Coefficient. The correlation coefficient is computed from
the graph of the baseline measurement and the graph of the incoming acoustic
parameter. By using the correlation coefficient, we can determine the similarity
between the two acoustic parameters. The equation for the correlation
coefficient is:
Cov (x,y)
cc ----------
Ox (Ty n
where Covariance : Cov(x,y) = (1/n)5^ (xj Bx) (Yi By)
i=l
and pv = the mean of y, px = the mean of x
Equation 1 Correlation
The value would be high if there is a strong relationship, and low if
there is a weak relationship.
Regression Line Slope Difference. For each acoustic parameters plot,
the slope of the linear regression line through data points in known y's and
known x's is calculated. The slope is the vertical distance divided by the
horizontal distance between any two points on the line, which is the rate of
38


change along the regression line. The equation for the slope of the regression
line is:
n Z xy (X x) (Z y)
b= ---------------------------
__________n Z x2 (Z x)2______________________
Equation 2 Slope Of The Regression Line
This is calculated for the baseline measurement and for the incoming
speech signal. Then the difference between the two is calculated, and a value
for the variable is determined. The value would be high if there is very little
difference, and low if there is a great amount of difference.
The illustration below depicts two different fundamental frequency
parameter extractions for the phrase hello:
FF plot of baseline vs incoming speech signal
(lighter=incoming signal, darker=baseline)
Time (msec)
Figure 3.5 Plot Comparison Of Two Fundamental
Frequency Parameters
39


The two linguistic variables were designed to take on one of five values:
very low (VL), low (L), medium (M), high (H), and very high (VH). These five
values allow each variable to be separated into five measurement qualities. A
total of five values was used for simplicity and as a starting point for this
research. They were adapted from20. Many more values could be used if so
desired. In addition, by keeping this number low, a simple set of consistent
fuzzy rules was easier to derive.
We have now defined two fuzzy rule determination sets each with a
possibility distribution (VL .. VH). By defining our sets in this way, we can
define a fuzzy relationship between these two linguistic variables and the
possibility distribution for the value of an acoustic parameter. By relating the
values of our linguistic variables to the value of an acoustic parameter, we can
then define a set of fuzzy rules used for determining the value or quality of an
acoustic parameter. The figure below defines the rules.
Slope diff.
VL L M H VH
Correlation VL VL L L M H
L L L M M H
M L M M H VH
H M M H VH VH
VH H H H VH VH
Table 3.1 Fuzzy Rules For Determining The
Quality Of An Acoustic Parameter
Again, these rules were defined for simplicity and also to serve as a
starting point for this research. They also were adapted from 21. A total of
40


twenty-five rules were defined. We can now define fuzzy rule functions that
will allow us to mathematically compute numerical values that map to the
distributions shown in the figure above. The illustration below depicts how the
fuzzy rule functions are distributed.
Figure 3.6 Fuzzy Rule Distribution
The mapping functions depicted above was chosen again as a starting
point for this research. They were adapted from 22,23.
Fuzzy Rule Process (Defuzzification). Each of the fuzzy rule functions
can be coded in if then else rules which will allow output of an acoustic
parameter value. For instance, referring to the figures above: if the correlation
between parameters is medium(M), and there is very little difference (smaller
difference means higher value H) in slope between the parameters, then the
41


acoustic parameter value is high. Each of our rules can be implemented this
way. The ranges for each of the values can be set according to the following
figure.
VL 0.0-0.15
L 0.15 0.30
M 0.30 0.60
H 0.60 0.85
VH 0.85 1.0
Table 3.2 Values Computed By Fuzzy Rules
The ranges in the figure above were selected because they represent a
distribution of five values ranging from 0 to 1, and they correlate to the
mappings above. Because five rules were selected, five distributions were
necessary. The middle value M, was defined to be twice as large as the other
four so that the other four would have equal ranges. These values could be
distributed in many other ranges as well. Different value distributions could be
explored in future research.
In order to determine how much value the two linguistic variables
contribute, each variables value will be calculated in terms of percentages.
Then, both percentages are averaged which will result in an overall percentage
contribution or worth. Once this percentage contribution is calculated, it
then is mapped to the acoustic parameter value according to the figure above.
This can be implemented for each of the 25 rules. For example, if the slope
difference was 0.45, and the correlation value was .75, this translates to :
slope difference value = M (0.45), correlation value = H (0.75),
percentage of M for slope difference = slope 0.30 = 0.15. Then, 0.15/0.30 =
42


0.5. Here, 0.5 corresponds to halfway in the window, or 50% of the window.
Then, the percentage of Hs window for correlation = corr 0.60 = 0.15.
Then, 0.15/0.25 = 0.6. Here, 0.6 corresponds to 10% past halfway in the
window, or 60% of the window.
These two percentages are then averaged: (0.5+0.6) / 2 = 0.55. This
percentage, 0.55, is then used to calculate the amount of that value that this
particular rule maps to: Mapped rule: When correlation is H and slope is M,
the parameter value is H. How much H? This is calculated with the above
percentage contribution, 0.55: H ranges from 0.60 to 0.85, a total of 0.25.
55% of this range is: (0.55)(0.25) = 0.1375. So, 0.60 + 0.1375 = .7375.
Therefore, the amount of H is .7375. All 25 rules are computed in this way. All
rules were chosen to be computed the same way so that a starting point could
be established. This is simply an interpretation of the fuzzy rule defuzzification
process described above.
In future research, several different variations of linguistic variable
values, fuzzy rule definitions, and defuzzification processes could be
experimented with.
Subjective Bayesian Inference Networks For Password Analysis
For each password model described above, AFP must determine the
value or quality of each separate model in terms of its inherent acoustic
features. By determining these properties, AFP can make the best possible
decisions in building an individual users unique acoustic feature set and
determine the most effective password available. Previous work in acoustic
theory of speech production has shown that a number of phonetic parameters
may be used to characterize different speech sounds. These parameters may be
43


used for phonetic processing of the speech sounds and vary according to
individual speaker and techniques of extraction.
The task is to choose a representative set of parameters that can be
used to characterize different speech sounds in terms of phonetic qualities.
Unfortunately there is not, at this time, a complete, specific and definitive set
of phonetic characterizations that correspond to the human voice. By
capturing those phonetic characteristics that we can, we can at best only
capture a particular set or subset of all the possible phonetic features of the
human voice. In addition, the inherent complexity and fuzziness of acoustic
parameters forces us to make at best, estimated conclusions about the
meanings or relevance of phonetic parameter data. Because of these
constraints, we are working with incomplete or uncertain data yet we still must
draw inferences and make useful judgments.
Our goal is to determine the value or quality of a particular phrase. To
accomplish this, we must use what acoustic information we have and draw
conclusions. In dealing with incomplete or uncertain data, many techniques
have been developed to aid in forming judgments or conclusions24. Probability
theory provides a powerful mechanism for dealing with inference type
problems such as the one AFP has in determining the value or quality of a
spoken phrase. Not only must we determine the value or quality of a particular
phrase, but we must also capture those phonetic features that are qualitative to
a unique human voice. In this way we capture two types of important
information that we can use in speaker verification: a specific password model
that the user can use for verification, and a unique acoustic feature set that
corresponds to an individual user that can later be used to aid in the
verification process.
-14


Subjective Bayesian inference networks provide a powerful model
which can provide a measure of confidence for the overall phonetic quality of a
particular spoken phrase. By translating values we place on acoustic features to
phonetic qualities, we can determine the overall value of a particular password.
The Bayesian network model allows us to do this. First, we will discuss
potential acoustic features that can be used as inputs to the network.
Acoustic Feature Inputs
A set of acoustic parameters can be extracted from the incoming
speech signal so that phonetic analysis may be carried out. Each of the acoustic
parameter inputs used for phonetic analysis will be computed in terms of their
total informational value. Acoustic parameters that can be used in this way are
discussed below.
Fundamental Frequency. During the production of voiced sounds, the
vocal cords are set into vibration. The fundamental frequency of voicing can be
used as a voicing indicator. Voicing distinction may be captured indicating the
presence of prosodic information such as stress, intonation, and intensity.
Spectral Shape. Characteristics of speech events such as the production
offricatives and the onset of plosive releases, are best characterized in terms of the
gross spectral shape of the speech signal. The spectral energies may be derived
from the gross spectral shape thus providing spectral information as input.
Low-Frequency Energy. Mid-Frequency Energy. High-Frequency
Energy. One if the not the most important characteristics of the speech signal
is the fact that the intensity varies as a function of time. For example, sharp
intensity changes in different frequency regions (i.e. low, medium, or high)
often signify boundaries between speech sounds. One example of this is low
45


overall intensity typically signifies a weak fricative, whereas a drop in mid-
frequency intensity usually indicates the presence of an intervocalic consonant.
First Formant Frequency. Second Formant Frequency. Third Formant
Frequency. Previous studies have found that the first three formants for vowels
and sonorants carry important information about the articulary configuration in the
production of speech sounds. These frequencies can be used to classify vowels
and consonants.
By placing values on each of these parameters, we can supply our
Bayesian network with the necessary input values. The diagram shown below
indicates the relationships between incoming acoustic parameters (input nodes)
and phonetic classifications (lumped nodes) which serve as lumped evidential
variables that summarize the incoming acoustic parameters. Also, the
relationships between phonetic classifications and the attributes that contribute
to the phonetic quality of a phrase (predictor nodes) are depicted.
46


Bayesian Network for Acoustic Phonetic Anaylsis
Input nodes
Lumped nodes
Predictor nodes
Fundamental Frequency
Spectral Shape
Low Frequency Energy
Mid Frequency Energy
High Frequency Energy
First Formant Frequency
Second Formant
Frequency
Third Formant
Frequency
Figure 3.7 Bayesian Network For Phonetic
Analysis
Each of the acoustic parameters are extracted and analyzed separately
to produce an input value that enters the left side of the network. These values
are then fed through the network to produce phonetic classification values on
the lumped nodes. Then these values are fed to the predictor nodes, and their
values are updated. Finally, the predictor node values are combined to produce
an output on the far right-hand side of the network. This final output
represents the overall phonetic value or quality of the phrase from which the
acoustic parameters were extracted.
node
Phonetic Quality of
Phrase
47


Phonetic Classification
In order to define relationships between acoustic parameters and
phonetic qualities, the following phonetic classifications are necessary. These
classifications are a representative set, and is by no means exhaustive and
complete.
Intensity. The overall intensity of a speech signal may be used to detect
the presence of vowel and consonant strengths.
Stress. This is known as the most basic abstract prosodic feature. Stress
patterns provide information about intonation and phonological phrases and can
aid in classifying the vocal effort of the speaker.
Intonation. Intonation, known as the pitch contour of an utterance,
provides vital clues to linguistic structures. Also, intonation may be studied to
derive length characteristics so that differences may be obtained between low
vowels, tense vowels, and lax vowels.
Articulary Configuration. By analyzing the articulation of speech,
acoustic distinctions may be made. For example, there are definite rates of
respiratory airflow below which airflow is laminar and above which airflow is
turbulent. This yields sharp distinctions between sonorant sounds and fricatives. In
addition, rather abrupt changes in acoustic features may be detected
corresponding to the opening of the velic valve for nasalisation.
Phonetic Quality Attributes
These attributes constitute the overall phonetic quality of a particular
phrase. By combining the values or contributions of the phonetic
classifications previously discussed, we can define phonetic quality attributes.
48


Vowel Presence. Vowels can be detected by the presence of substantial
energy in the low and mid frequency regions of the speech signal. They may be
characterized by the steady state values of the first three formant frequencies.
Consonant Presence. Consonants are usually divided into several
groups depending on the manner in which they are articulated. There are five
groups in English. Plosives which are characterized acoustically by a period of
prolonged silence, followed by an abrupt increase in amplitude at the
consonantal release. Fricatives, which are detectable by the presence of turbulent
noise. Nasals which are invariably adjacent to a vowel, and are marked by a
sharp change in intensity and spectrum. Glides which occur only next to a
vowel as formant transitions into or out of a vowel, and are fairly smooth and
much slower than those of other consonants. Africatives which are characterized
as a plosive followed by a fricative.
Prosodic Quality. One of the areas of speech understanding that has
not been fully tapped is that of extracting prosodic information. Prosodic
information of speech consists of stress patterns, intonation, pauses, accent,
and timing structures. By computing an overall prosodic quality of a phrase, we
can capture unique characteristics of the incoming speech signal.
Articulation Quality. This represents the overall quality of detectable
articulation parameters.
Total Energy. This represents the total speech signal energy obtained
and may be used to evaluate overall intensity changes which indicate the
presence or lack of presence of phonetic parameters.
49


Probability Relationships For Phonetic Attributes
The model for determining the quality of a phrase that AFP uses must
account for degrees of evidence and degrees of confidence in the hypothesis
that a particular spoken phrase is of a certain quality. Probability theory
provides the mathematical formalism in which we need to work. Each node in
our network can be thought of a model in which evidence (E) can provide
support for, or against, a hypothesis (H). Using basic probability theory we can
view the nodes in our Bayesian network as events that have been derived from
prior knowledge about the relationships that exist between it and a node that
led to it. It is now necessary to define relationships.
Odds Formulation. The odds of an event is defined as :
O (A)=p (A) /p (~ A).
Likelihood Ratio. The likelihood ratio X for events A and B is given
by:
X=p(A|B)/p(A|~B)
This is useful because it can be viewed as a way to update the odds of a
hypothesis H based on the evidence E:
if A, = p(E | H)/p(E | ~H) then 0(H | E) = X 0(H).
Our overall goal throughout the network is to estimate the conditional
probability of an hypothesis (H) that a phonetic classification or attribute exists
given the evidence (E) of acoustic parameters extracted from the speech signal.
Two more terms are necessary to support our understanding on how E relates
to H: sufficiency X, and necessity X'. These two numbers are used to capture
50


the prior knowledge of the probabilistic relationship between E and H.
Depending on the relationship between E and H, these limiting cases will vary
somewhat. For example if the acoustic parameter of fundamental frequency
and the phonetic classification of intonation were related as in the sketch:
Fundamental frequency
/
Figure 3.8 Sufficiency Relationship Between
Fundamental Frequency And Intonation
then whenever fundamental frequency was present, the phonetic
classification intonation would occur, thus fundamental frequency is
sufficient for intonation to occur.
On the other hand, if the relationship between the acoustic parameter
low frequency energy and the phonetic classification of intensity were depicted
as:
51


Low frequency energy
Intensity
Figure 3.9 Necessity Relationship Between Low
Frequency Energy And Intensity
then whenever the intensity characteristic occurred, the acoustic
parameter type of low frequency energy is certain to occur. Thus, the acoustic
parameter type of low frequency energy is necessary for the intensity
characteristic to occur.
The more general cases of limiting is handled by a mix of the values X
and X' where X usually runs between 1 to large values based on degrees of
sufficiency of acoustic parameter types and phonetic classification types, and X'
runs between 0 and 1 based on the degree on necessity of acoustic parameter
type and phonetic classification type. These mix of values are also used in
determining the relationships between phonetic classification types and the
attributes that constitute the quality of a particular phonetic phrase. In our
subjective Bayesian network, the challenge lies in coming up with values of X
52


and A/ for each link in the inference network the best represents expert
knowledge about each relationship.
Measurement of the Evidence (E).
In our model, evidence (E) of an acoustic parameter or the presence of
a phonetic type arrives in a uncertain fashion. Each parameter input to the
system is really a measurement of the degree of existence of that acoustic
parameter or phonetic type. Assume a measurement is made, call it Er, that
tells us how sure that the acoustic parameter or phonetic type is actually
present. The value t=p(E | E') with 0 < t < 1, represents the absolute certainty
that E has or has not occurred. A value of 1 means that E has absolutely
occurred, and a value of 0 means that E has absolutely not occurred. We can
now estimate p(H | E') based on the measurement of E and our prior
knowledge. To carry out this estimation an approximation based on linear
interpolation is used25. The following diagram depicts the approximation model:
53


P(H|E')
the estimated
conditional
probability of the
phonetic
classification
type or phonetic
quality attribute
given the
measurement of
the existence of
an acoustic
parameter or
phonetic type.
" I I
t = p(E|E): the measurement that tells us how
sure we are that a given acoustic parameter or
phonetic type is present.
Figure 3.10 Approximation Model For Evidence
Of An Acoustic Or Phonetic Attribute
The graph shown above is used to calculate an estimate of the chances
of a phonetic type or phonetic attribute being present, p(H | E). Implicitly the
graph uses X and X' to get p(H | E) and p(H | ~E) from:
p(H| E) = X 0(H)/(1+ X 0(H))
p(H | ~E) = X' 0(H)/(1+ X' 0(H))
where 0(H) = p(H)/p(~H)
We apply a graph such as this to every node in our Bayesian network
and update our hypotheses based on our prior knowledge and evidence
collected. We are assuming we have prior knowledge of the probability of a
54


phonetic type or phonetic attribute being present (p(H) above. Note a
phonetic type may also be p(E), in which it contributes to the conditional
probability of a phonetic attribute). Also we assume we have prior knowledge
of the probability of an acoustic parameter being present (p(E) above). The
graph ensures that p(E) maps to p(H) meaning that if the measurement is
inconclusive (i.e. we have no measurement at all), then the estimate of the
chances of H should also be the prior probability of H. If the measurement is
less than p(E), then we have evidence against the presence of E and therefore
reduce the chance of H. The shaded area depicted in the graph is said to
support the presence of H because if the measurement is greater than our
prior knowledge E, this maps to evidence of H beyond our prior knowledge
of H.
Multiple Evidence
In our subjective Bayesian network, there are several places where
multiple evidence is required to update the probability of H. To deal with this,
a new term is defined:
A,eff = 0(H | E)/0(H) which is computed from p(H | E) :
0(H | E1) = p(H | E)/(l p(H | E)
For the purposes of this research, we make the assumption that the
multiple evidence is independent26, and compute A.eff for each link. Now, we
can define:
A.EFF = XeffjX,eff2 ...XefFn
This allows us to update the odds of H :
55


0(H | Et E, .. .En) = AEFF 0(H).
It is probable that multiple evidence is not always independent. In this
case, we must make modifications that deal with these scenarios. This could be
dealt with in future research.
Phonetic Analysis Relationships
For each node in our network, we must derive prior knowledge of the
probability of E our evidence, and of H our hypotheses. For H, p(H) is the
chance that H is present, before any evidence E is presented. For E, p(E) is the
chance that the evidence is present or not. In addition, values must be selected
for sufficiency A,, and necessity A' by carefully looking at each separate link in
the network and determining how these values best describe each relationship.
These values can be selected based on expert knowledge available about each
relationship in our network.
Summary
In this chapter we have introduced a new method called Adaptive
Forward Planning (AFP) which evaluates and utilizes acoustic parameters of
speech for speaker verification. By utilizing a fuzzy rule system (FREP), we
have shown how acoustic parameters extracted from speech can be evaluated
by measuring their values against a baseline (average) measurement. We then
use these values as inputs to a Bayesian inference network (BANPA). The
Bayesian network was developed by analyzing relationships between phonetic
attributes and phonetic classifications inherent to the human voice.
56


There were many potential variations to the components of the AFP
architecture that were discussed. The components of most interest were
linguistic variable values, fuzzy rules, defuzzification processes, values chosen
for necessity and sufficiency, the process and methods of updating the
network, and the handling of multiple evidence. The variations available to
these components are almost endless. Each should serve as a focal point for
future research.
57


CHAPTER 4
IMPLEMENTATION
Now that we have defined the AFP architecture, we describe the
implementation of certain components and evaluate their performance. In this
chapter, we will discuss the implementation details and some of the
experimental results.
System Overview
Adhering very closely to the model described in chapter 3, a prototype
system was built. There were limitations to the implementation due to lack of
resources and time. These limitations will be pointed out were applicable
throughout this chapter. The prototype was built in Microsoft Windows 95,
using Asymetrixs Multimedia ToolBook 4.0. The system was built in three
phases. The first phase implemented was the BANPA sub-system. The second
was the FREP sub-system. The third phase consisted of tying together the
BANPA, FREP, Password Models, and acoustic extraction functionality. Each
of the phases is briefly discussed below.
BANPA Implementation
The BANPA sub-system described in chapter 3 was constructed as
depicted. The sub-system was built so that a user can enter values for the
58


acoustic parameters, and then compute an overall value for a phrase that the
acoustic parameters were extracted from. The values for necessity and
sufficiency for each of the relationships in the network were entered into an
Excel spreadsheet. The spreadsheet was saved as a text file. When the sub-
system was run, the data from the spreadsheet was read into the network and
assigned to all interior nodes on the network. The data used for the BANPA
system is shown in appendix A. These values were used for BANPA simulation
runs, and for the actual finished AFP system. The interface for inputting
acoustic parameter values and computing the overall value of a phrase is shown
below.
r Multimedia ToolBook AFP.TBK
Erie Edit Text £aga tielp
Figure 4.1 The BANPA User Interface
59


A senes of simulations was run to determine the overall sensitivity and
behavior of the Bayesian network. The results of these runs can be found in
appendix B. The source code used to implement the sub-system is available
upon request.
FREP Implementation
The fuzzy rule sub-system FREP as described in chapter 3 was
constructed as depicted. The sub-system was built so that a user can enter
values for the correlation and slope values as would be derived when
comparing an acoustic feature extraction against the baseline (or average) of
that feature. The interface that was built for FREP is shown below.
i Multimedia ToolBook AFP TBK
Die Edit Text gags Help
551*1
Figure 4.2 The FREP User Interface
60


As was done for the BANPA simulation, a series of simulations were
run to determine the overall sensitivity and correctness of the fuzzy rule
system. The results of these runs can be found in appendix C. The source code
used to implement the sub-system is available upon request.
Password Model Implementation
A set of 21 passwords were chosen for this initial study. Each word was
chosen based on its acoustic correlation to various phonetic properties
inherent in the English language. By studying these properties, the words were
partitioned into four separate phonetic classifications: vowels, stop consonants,
liquids and glides, and nasal consonants. Four initial words were required to be
spoken up front. Each one of these correlated to one of the four phonetic
classifications mentioned above and served as a starting point for password
selection. These four password models are the initial set of passwords
mentioned in chapter 3. Each of the four phonetic classifications and the
words chosen for each are briefly described below. For a more in-depth
discussion of acoustic phonetic theory see2717,28
Vowel Password Selection. Vowel characterized words were chosen
based on four properties: long vowel duration, short vowel duration, vowel nasalisation,
and oral vowels. The words chosen for these vowel characteristics are as follows:
animated This word serves as the initial starting point or initial password for
vowel classification. It was chosen in an attempt to capture all four properties
mentioned above. By doing this, if AFP chose this as the best initial password,
then the speaker would be asked to speak the following four vowel
characterized passwords.
61


heed This was chosen to represent the characteristic of long vowel duration.
can This was chosen to represent the characteristic of vowel nasalisation.
high This was chosen to represent the characteristic of an oral vowel.
hid This was chosen to represent the characteristic of short vowel duration.
Stop Consonant Password Selection. The stop consonants studied [p t
k b d gj, were chosen based on five articulation characteristics: labials, alveolars,
velars, voiced, and non-voiced.
powderkeg This word serves as the starting point or initial password for stop
consonant classification. It was chosen in an attempt to capture all five
characteristics mentioned above. By doing this, if AFP chose this as the best
initial password, then the speaker would be asked to speak the following five
stop consonant characterized passwords.
people This was chosen to represent the characteristics of labials [p b].
date This was chosen to represent the characteristics of alveolars [t d].
keg This was chosen to represent the characteristics of velars [k g].
dogbone This was chosen to represent the characteristics of voiced [b d g].
ketchup This was chosen to represent the characteristics of non-voiced [p t k].
Liquids/Glides Password Selection. Acoustic sounds of [1 r] are referred
to as liquids, and [w y] as glides. The duration of liquids and glides have been
found to produce identifiable acoustic characteristics. The passwords chosen
62


for this classification were broken into four characteristics: short liquid, long
liquid, short glide, and long glide.
lawyer This word serves as the starting point or initial password for the
liquid/glides classification. It was chosen in an attempt to capture all four
characteristics mentioned above. By doing this, if AFP chose this as the best
initial password, then the speaker would be asked to speak the following four
liquids/glides characterized passwords.
love This was chosen to represent the characteristics of a short duration liquid.
room This was chosen to represent the characteristics of a long duration liquid.
weather This was chosen to represent the characteristics of a short duration
glide.
yams This was chosen to represent the characteristics of a long duration glide.
Nasal Consonant Password Selection. The nasal consonants [m n] were
chosen based on the variety of acoustic consequences created by the opening
of the nasal cavity when sound is propagated through both the nose and mouth.
midnight This word serves as the starting point or initial password for nasal
consonant classification. It was chosen in an attempt to capture the
characteristics mentioned above. By doing this, if AFP chose this as the best
initial password, then the speaker would be asked to speak the following four
nasal consonant characterized passwords.
moon This was chosen to represent the characteristics of a long duration nasal
consonant.
63


nose This was chosen to represent the characteristics of a short duration nasal
consonant.
m&n This was chosen to represent the characteristics of a combination of nasal
consonants.
animal This was chosen to represent the characteristics of a nasal consonant
combined with vowel nasalisation.
These 21 passwords selected can be viewed as a tree-like structure that
represents the relationships among each. The following figure depicts this.
PASSWORD SELECTION CLASSIFICATIONS
t
animated
t
powderkeg
t
lawyer
t
midnight
heed
high
hid
keg \ ketchup love
dogbone room
yams moon
weather i
m&n animal
Figure 4.3 Password Models For Study
64


The HpW Works Analyzer For Acoustic Parameter Extraction
The main limitation of implementation was in the model for extracting
acoustic parameters. Due to the platform chosen for development, and
method of integration, AFP required an IBM-PC Windows 3.1/95 based
component. There was only one usable system found after a long exhaustive
search. By far, this was the most difficult part of implementation. A system was
needed that executed the extraction of acoustic parameters from an incoming
speech signal. Most of the systems discovered were either very expensive and
proprietary, ran on incompatible platforms and/or hardware, or did not offer a
way to capture useful acoustic data. The only system found that provided
useful functionality and met the constraints listed above, was the HpW Works
Analyzer.
l§S§pisK
1 -
HpW Works Analyzer . ^ ..
m -
I C '
L. Version: 1.00.014 Demo fc
% ^ -
; e-mail: 100350.630@compuserve.com
;
U Copyright 1995 -1996 by HpW "
%# -


Figure 4.4 HpW Works System Author And
Version
65


Using the HpW Works Analyzer, Fast Fourier Transform (FFT) data
using a foil Hamming window, and Spectral data were extracted from digital
audio files. The HpW Works system allows digital audio files to be processed,
and then allows data dumps to text files of both FFT data and Spectral data.
The system has an easy-to-use interface and runs on Windows 3.1 and 95.
For all speaker subjects used in this research, digital audio files were tirst
recorded in the WAV format, and pre-processed. Then each was individually
submitted to the HpW Works system, and two files containing FFT data and
Spectral data respectively, were created using the system. Even though the
HpW Works system only extracted these two types of data, it was very useful
and allowed the research to continue.
AFP System Integration
Now that all of the sub-system implementations have been described,
the overall AFP system used in this research is described next. The goal was to
integrate all, or as many as possible, of the components depicted in chapter 3.
This made up the final AFP system used for research and testing. The most
painstaking part of integration was creating the acoustic feature data for each
sample recorded. This had to be done manually and was very time consuming.
Once the acoustic data files had been created for all of the samples used in the
research, the rest of the system integration went fairly quickly and easily.
The BANPA and FREP sub-systems were tied together under a
common interface. Then, new user interfaces were created so a speaker could
interact with the system, and carry out speaker verification processes. The
original goal was to allow a speaker to record his or her voice in real-time while
interacting with the AFP system. For experimental purposes, the functionality
to load acoustic feature data files was added. Each of the acoustic feature data
66


files were created as described above. Once these files had been created, they
could then be loaded into the system via the interface. An example of one of
the user interface screens is shown below.
Multimedia ToolBook AFP THK
- Sle Edit Iext £age ttelp
.AcUT- ' -rt^!teMasav.

Whr"'
pw9:
pwll:
pw12:
l -+$?
|pw13:
AFPStop Consonants'll
password selections
people
Submit from file ^Bpc^*
*' > date' [ Submit from file 1 [ Becord |fj

keg [ Submit from file | | Becord ||

dogbone Submit from file j Becord j
-A.-..':
ketchup "| Submit from file | | Becord
Elnished
t_ Xo Main Menu j
Your password:
Figure 4.5 User Interface Screen For AFP
Password Selection
In the screen shown above, the user is requested to record or load
(submit from file button) data for each of the passwords. The system
interacts with the user exactly as is described in chapter 3, and eventually finds
the password for the speaker to use. For the screen above, after all passwords
have been recorded or loaded, the user hits the Finished button, and the
system analyzes the password data, and determines the next group of
67


passwords based on the analysis. The user is then presented with the next
interface screen.
AFP System Process
First, to carry out the research, speakers were recorded to tape. The
recording system used to record speaker subjects consisted of a super-VHS
recording deck with a built-in microphone. The microphone quality was very
similar to what might be used in a computer lab or office environment. Voice
samples were first recorded on this deck and then individually sampled from
the deck to a Pentium 166 Mhz PC using a 16-bit Sound Blaster
compatible sound card. All voice recordings were sampled at 22.050 Khz, in
mono, at 16-bits, and saved in the WAV file format.
Next, each sample was processed by truncating out useless artifacts
found at the beginning and end of each sample. Once all of the WAV files
were processed in this fashion, each was submitted to the HP Works system
described above. From this system, the two acoustic parameters of FFT and
Spectrum were extracted from the HP Works system and were dumped to
separate data files. The two data files were merged, and this represented the
acoustic parameter data set of each spoken password. The data set files were
then named and organized by separate speakers and stored in a common
directory. The following illustration depicts the AFP process.
68


Figure 4.6 AFP System Process
Speaker Verification Tests
In this phase of research a group of 9 speakers were studied. Each
speaker was asked to speak all 21 password models to simulate an initial user
training session. Each word was recorded and processed as explained above
resulting in acoustic feature profiles containing FFT and Spectral data for each
word spoken. In total, there were 9 speakers, each with 21 passwords recorded.
This resulted in 189 acoustic data files that were stored in a central directory on
the host computer. The next step was to record each speaker speaking each
password 5 separate times. This was done so that subsequent verification tests
could be carried out. In total there were 6x9x21 = 1134 recordings made.
The speaker data collected was done as is. That is, there was no
additional processing such as time-warping, or other types of filtering done on
the digital audio files. The reason for this was to find out if verification could
69


be implemented without this type of additional processing. The HpW Works
system did a small amount of initial processing to execute a discrete fast fourier
transform, but that was all.
There were two main goals of the speaker verification tests. The first
was to determine how well the system selected speaker passwords based on the
complete feature set (two in this case). The second was to determine how well
the system performed speaker verification using one, and then two acoustic
feature parameters.
Speaker Password Selection Results
Once all passwords had been recorded and processed, the acoustic data
collected for each was averaged among the 9 speakers as illustrated in chapter
3, and a base acoustic data file was created. This resulted in 21 base
password files, one for each password. Then, a user interaction was simulated
for each speaker by submitting their acoustic data files to the AFP system. As
explained in chapter 3, the correlation and slope difference was computed
against the base password data files and was scored for best match. The
resulting decisions made by the AFP system determined a particular speakers
password. The results of the 9 speaker interactions and password selections are
shown below.
70


, SPEAKER SEX AGE PASSWORD ; SELECTED
Mike m 35 weather
Carol f 35 yams
Dean m 42 people
Chris m 39 nose
Linda f 38 yams
Peggy f 34 high
Sue f 36 yams
Steve m 32 ketchup
John m 40 weather
Table 4.1 Results of Password Selection
Using 9 speakers and 21 passwords, the AFP system illustrated a fairly
reasonable distribution of password selection. As the table above shows, only
two passwords were selected for more than one person. The figure below
illustrates this distribution.
71


Distribution Of Password Selections
7 ;<
m-m W''V;/ 0 w
3
0
5 10 '
Cft
m D ^ CL 4 H


3 2 3 Sp i makers 1 5 .. 9 7 3
Figure 4.7 Distribution Of Password Selection
Note that even though there were two passwords selected for more
than one speaker, the passwords were selected for uniqueness of speaker, not
for uniqueness of words. Because of this, it is irrelevant if the passwords are
the same or different. What the previous two figures illustrate is that the way
people articulate certain words are similar, and these similarities can be grouped
together. For instance, the password yams, was selected for three speakers,
all females of approximately the same age.
As was mentioned above, each speakers acoustic data set was measured
against the baseline data set which is the average of all 9 acoustic profile for
each password. To illustrate this, the following set of graphs show the results
for each speakers spectral shape generated from the initial password
animated.
72


Average Spectral Shape For Password animated
Average Spectral Shape For Password animated
73


Figure 4.11 Spectral Shape For Speaker 4 Verses
Average Spectral Shape For Password animated
74


Figure 4.13 Spectral Shape For Speaker 6 Verses
Average Spectral Shape For Password animated
75


Figure 4.14 Spectral Shape For Speaker 7 Verses
Figure 4.15 Spectral Shape For Speaker 8 Verses
Average Spectral Shape For Password animated
76


Figure 4.16 Spectral Shape For Speaker 9 Verses
Average Spectral Shape For Password animated
As the graphs above illustrate, each occurrence of a spoken word varies
significantly. Because of this variance, speakers can be successfully identified
using acoustic data such as this. As mentioned above, four initial passwords
were used to differentiate between four phonetic classifications. The figure
below depicts the speaker-phonetic classification relationships that AFP
determined.
77


; SPEAKER SEX AGE INITIAL ; PHONETIC
* PASSWORD CLASSIFICATION
SELECTED DETERMINED
1. Mike m 35 lawyer Liquids/Glides
2. Carol f 35 lawyer Liquids/Glides
3. Dean m 42 powderkeg Stop Consonants
4. Chris m 39 midnight Nasal Consonants
5. Linda f 38 lawyer Liquids/Glides
6. Peggy f 34 animated Vowels
7. Sue f 36 lawyer Liquids / Glides
8. Steve m 32 powderkeg Stop Consonants
9. John m 40 lawyer Liquids/Glides
Table 4.2 Phonetic Class Selection Of Speakers
For the case of the word animated shown above, the acoustic
parameters of speaker 6 matched the average spectral shape for animated
very closely. Referring to the graphs above for all speakers, speaker 6 is the
closest match among all 9 speakers. The criteria for selecting a password could
also be to find the speaker who is furthest away from the baseline average. This
would distinguish the speaker from all others by depicting how one speakers
articulation differs from the other speakers in the system. This criteria should
be examined in future research.
The results of this experiment have shown a reasonable distribution of
acoustic feature differences among a small set of 9 people. This is encouraging
in the sense that with a small speaker population, the AFP system was able to
78


distribute passwords and distinguish each speaker among the averages of all
speakers.
Speaker Verification
Once user passwords had been determined, the AFP system was
challenged by simulating subsequent access encounters to the system. The goal
in this portion of study was to determine the accuracy of the AFP speaker
verification system. That is, how well does the system verify a speaker when he
or she encounters the system at a later date. The system currently has
information stored about each user of the system. This information
encompasses the users id, the users password used for access, and the acoustic
data profile of that password which was derived from the users first encounter
with the system.
As mentioned before, for each speaker used in the study, 5 additional
occurrences of each password were recorded. Once the AFP system had
determined what the users password would be from the previous experiment,
the 5 additional occurrences of the same password were processed as before
and then used to simulate 5 separate system verification encounters. This was a
total of 45 (9 users x 5 passwords), attempted verifications to the AFP system.
One Versus Two Acoustic Parameters. Several variations of speaker
verification were carried out in two ways: first, by only utilizing one acoustic
parameter, and then secondly, two acoustic parameters were used. The goal for
doing this was to determine what difference, if any, occurred using one verses
two acoustic parameters for verification.
79


Matching Algorithms
A variety of simple matching algorithms were employed in the
verification experiments. The goal was to determine what, if any impact
different matching algorithms has on the verification process. The algorithms
used in this study are discussed below.
Algorithm 1: Correlation. Correlation as defined in chapter 3 was
computed on two data sets. Out of all paired matches, the one with maximum
value (i.e. best correlation) was identified. When two parameters are used, the
maximum is taken.
Algorithm 2: Closest Match. Data points are compared between two
similar data sets. The absolute difference between each point is computed. Out
of all paired matches, the one with minimum value (i.e. is the closest match)
was identified. When two parameters are used, the minimum is taken.
Algorithm 3: Slope. Slope of the regression line, as defined in chapter 3,
was computed on two data sets. Out of all paired matches, the one with
minimum difference was identified. When two parameters are used, the
minimum of the two is taken.
Algorithm 4: FREP. The FREP (Fuzzy Rule Evaluation of Parameters)
sub-system was used to determine how the two data sets compared.
Correlation and Slope was computed for the pair of data sets, and the fuzzy
rule system described in chapter 3 was used to compute a final value. Out of all
matches, the one with maximum value was identified. WTien two parameters
are used, the maximum is taken.
Algorithm 5: FREP With BANPA (AFP). The FREP sub-system was
combined with the BANPA sub-system. Correlation and Slope was computed
80


for the pair, the fuzzy rule system was used, and then the results were sent to
the Bayesian network to compute a final value. Out of all matches, the one
with maximum value was identified. With one parameter, the second input to
the Bayesian network was held at 0.5. With two parameters, the two results
were fed as inputs to the network as was done in the selection of passwords
described above. This is the complete AFP system implementation described in
chapter 3.
System Verification Criteria 1: Comparison Against The Same Passwords
To determine whether an incoming speech signal accurately verifies a
speaker, the acoustic parameters FFT, and Spectrum were extracted as before
to form a new acoustic data profile. This new profile was then compared
against all profiles of the same password from all users of the system. By using
the matching algorithms described above, the closest match was found and
reported to the system. If the match was not who the user said he or she was,
then verification failed, otherwise verification was successful. The results of the
45 system verification encounters are illustrated in the figures below.
81


45.001
40.00 35.00
j m Corr
30.00
. 25.00 Closest
VerifieS 20.00] I Slope
15.00 FREP
10.001 FREP/BANPA
5.0O1 0.00
One
Parameter
Figure 4.17 Successful Verifications: Against Same
Passwords (One Parameter)
As the graph above shows, using a single parameter for speaker
verification does not yield great results. This was somewhat expected. The best
matching algorithm was Algo 4 (FREP).
Verifies
One
Parameter
Two
Parameters
Corr
Closest
Slope
FREP
FREP/BANPA
Figure 4.18 Successful Verifications: Against Same
Passwords (Two Parameters)
82


As the graph above illustrates, there was a significant improvement
when adding a second parameter for verification. Also, Algo 5, the FREP and
BANPA combination fared the best. This algorithm is the complete AFP
system implementation and in this case, has been used for speaker verification
The AFP theoretical methods of acoustic parameter evaluation and utilization
have yielded the best verification results.
System Verification Criteria 2: Comparison Against The Baseline Passwords
In this experiment, the new profiles created from a users attempt at
verification were compared against the baseline profile (average of all users) of
the same password. A confidence threshold was created by defining how
close a match needed to be. The threshold was set at 90% for this research.
This meant that if a match was 90% or better, verification succeeded. With this
criterion, if the verification attempt fell below the threshold, verification failed.
Otherwise it was successful. Further research may be carried out by defining a
variety of thresholds, and observing the change in algorithm performance. The
same matching algorithms described above were used. The results of the 45 ,
system verification encounters are illustrated in the figures below.
83


4500 '
40.00
35.00
30.00
Verifies
One
Parameter
HCorr
Closest
Slope
FREP
FREP/BANPA
Figure 4.19 Successful Verifications: Against
Baseline Password (One Parameter)
Overall, the number of successful verifications grew. Surprisingly, Algo
5 did not fare well when meeting a threshold. Algo 4 FREP, did the best. This
algorithm takes the maximum of the correlation and slope difference values.
Verifies
One
Parameter
Two
Parameters
Corn
Closest
Slope
FREP
FREP/BANPA
Figure 4.20 Successful Verifications: Against
Baseline Password (Two Parameters)
84


Again, when adding a second acoustic parameter, verification improved.
Algo 4, FREP, improved significantly (from 19 to 38 successful verifications).
In this case, the fuzzy rule system defined in chapter 3, using two acoustic
parameters, yielded the best results.
System Verification Criteria 3: Spoofing. Determining A Successful Rejection
Rate.
In this experiment, each of the users attempted to use a password other
than their designated password for verification. There are 21 baseline
passwords in the system. Each user attempted 5 other passwords all of which
are baseline passwords, but not their own. Again, a confidence threshold was
defined. With this criterion, if the verification attempt fell below the threshold,
verification was rejected and deemed a successful rejection. Otherwise it was
deemed a failure in that it allowed verification with an unauthorized password.
The same matching algorithms described above were used. The results of the
45 system verification encounters are illustrated in the figures below.
85


Rejects
Rejects
45.00y''
40.00-
One
Parameter
H Corr
Closest
Slope
FREP
FREP/BANPA
Figure 4.21 Successful Rejections: Against
Spoofing Baseline Password (One Parameters)
One
Parameter
Two
Parameters
QCorr
Closest
Slope
FREP
FREP/BANPA
Figure 4.22 Successful Rejections: Against
Spoofing Baseline Password (Two Parameters)
86


There was very little difference between one and two parameters. The
exception to this was Algo 5, the AFP system.
The last two experiments point out that by setting a pre-determined
threshold, one can control the behavior of a speaker verification system if
enough data is available to the administrator of the system. The problem with
this is that any pre-set threshold is vulnerable to system changes, and must be
constantly updated to accommodate these changes. The point is, one must be
extremely careful when using thresholds, and allow for the increase in
maintenance.
Although the experiments above showed promising results, the
potential number of variations was limited. Future research should explore
additional variations such as matching algorithms and matching thresholds.
87


CHAPTER 5
CONCLUSIONS
The research that was carried out in this study found that there are new
ways in which speaker verification can be implemented using acoustic
parameters. By focusing on the key attributes of the human voice, we have
found that by parameterizing speech, there is much promise in developing an
invariant set of acoustic features that will differentiate one speaker from the
next. Even though much work has been done in extracting and using acoustic
features for speaker verification, we have found new ways and techniques of
implementation that appears to be effective.
This study demonstrates that with ingenuity and thought, one can
explore many different avenues that very well could lead to new successes in
speaker verification. Much of the work accomplished in this study has been
derived from a multitude of disciplines including signal processing, artificial
intelligence and computer science. Our main focus has been on the extraction
and use of acoustic features to improve or approach speaker verification from
a new direction. We have discovered the important role acoustic features play
in the overall effectiveness of speaker verification.
In this study, we have also shown that one acoustic feature is not
enough for successful verification. We have shown that two is not enough
either. But more importantly, we have shown that the difference between one
88