Citation
Proposed analytical framework for electronically frequency-modified voices

Material Information

Title:
Proposed analytical framework for electronically frequency-modified voices
Creator:
Bonilla, Eliud ( author )
Language:
English
Physical Description:
1 electronic file (80 pages) : ;

Thesis/Dissertation Information

Degree:
Master's ( Master of science)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Music and Entertainment Industry Studies, CU Denver
Degree Disciplines:
Recording arts
Committee Chair:
Grigoras, Catalin
Committee Members:
Smith, Jeff M.
Lewis, Jason R.

Subjects

Subjects / Keywords:
Voiceprints ( lcsh )
Forensic audiology ( lcsh )
Spectrum analysis ( lcsh )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Abstract:
The concept of the human voice as a reliable biometric signal is rapidly being accepted and implemented in today’s society. State-of-the art call centers are increasingly incorporating automated speaker recognition (ASR) technologies in an effort to enhance customer service and minimize identity theft. In the forensics realm, ASR has been accepted in many European courts and may also be accepted in US federal courts in the not too distant future. However ASR is relatively fragile due to its dependency on the frequency dimension of the voice while not incorporating higher layers of information such as prosody and accents. It can be degraded by a number of inter/intra speaker characteristics, in addition to multiple variables along the signal chain. ( ,,, )
Abstract:
Purposely modifying the frequency/pitch of a voice by electronic means is an effective counter-forensic measure. It is most often implemented to mask the identity of an individual. Its use can be considered legitimate when used to protect the identity of a witness in a television interview or as part of a law enforcement investigation. However it can also be used to protect the identity of individuals committing crimes ranging from classic scenarios such as phone calls for ransom requests to recording video/audio messages inciting violence.
Abstract:
Electronic frequency/pitch modification impacts various common forensic analysis methods. Vowel spaces are distorted thus neutralizing their use by phoneticians. Moderate changes degrade likelihood ratio (LR) scores in ASR systems while aggressive changes induce additional errors by distorting the format-frequency relationships outside of normal expected ranges.
Abstract:
The proposed Electronic Voice-Frequency Modification Analysis (ElVo-FMA) framework covers key concepts, questions and procedures to enable practical real-world forensic speaker comparison and identification efforts.
Thesis:
Thesis (M.A.)--University of Colorado Denver
Bibliography:
Includes bibliographical references.
System Details:
System requirements: Adobe Reader.
Restriction:
Embargo ended 05/17/2019
General Note:
n3p
Statement of Responsibility:
by Eliud Bonilla.

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
on10071 ( NOTIS )
1007155032 ( OCLC )
on1007155032

Downloads

This item has the following downloads:


Full Text
PROPOSED ANALYTICAL LRAMEWORK LOR
ELECTRONICALLY LREQUENCY-MODILIED VOICES
by
ELIUD BONILLA
B.S., University of Puerto Rico Mayagiiez, 1998
A thesis submitted to the Laculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Master of Science Recording Arts Program
2017


©2017
ELIUD BONILLA ALL RIGHTS RESERVED
11


This thesis for the Master of Science degree by Eliud Bonilla has been approved for the Recording Arts Program by
Catalin Grigoras, Chair JeffM. Smith Jason R. Lewis
Date: May 13, 2017
m


Bonilla, Eliud (M.S., Recording Arts Program)
Proposed Analytical Framework for Electronically Frequency-Modified Voices Thesis directed by Assistant Professor Catalin Grigoras
ABSTRACT
The concept of the human voice as a reliable biometric signal is rapidly being accepted and implemented in today’s society. State-of-the art call centers are increasingly incorporating automated speaker recognition (ASR) technologies in an effort to enhance customer service and minimize identity theft. In the forensics realm, ASR has been accepted in many European courts and may also be accepted in US federal courts in the not too distant future. However ASR is relatively fragile due to its dependency on the frequency dimension of the voice while not incorporating higher layers of information such as prosody and accents. It can be degraded by a number of inter/intra speaker characteristics, in addition to multiple variables along the signal chain.
Purposely modifying the frequency/pitch of a voice by electronic means is an effective counter-forensic measure. It is most often implemented to mask the identity of an individual. Its use can be considered legitimate when used to protect the identity of a witness in a television interview or as part of a law enforcement investigation. However it can also be used to protect the identity of individuals committing crimes ranging from classic scenarios such as phone calls for ransom requests to recording video/audio messages inciting violence.
Electronic frequency/pitch modification impacts various common forensic analysis methods. Vowel spaces are distorted thus neutralizing their use by phoneticians. Moderate changes degrade likelihood ratio (LR) scores in ASR systems while aggressive changes induce additional errors by distorting the format-frequency relationships outside of normal expected ranges.
The proposed Electronic Voice-Frequency Modification Analysis (ElVo-FMA) framework covers key concepts, questions and procedures to enable practical real-world forensic speaker comparison and identification efforts.
IV


The form and content of this abstract are approved. I recommend its publication.
Approved: Catalin Grigoras
v


DEDICATION
I wish to dedicate this to my family who mean so much to me. To my parents Rafael and Lucy who instilled in me the values of honest hard work and the love of learning. To my little sister Arlene who has always lovingly teased her “nerdy” big brother. To Ana and Alex whom I wish all the happiness and love in the world. And last, but not least, to my wife Kathleen whose love, support and patience allowed me to complete this endeavor.
vi


ACKNOWLEDGEMENTS
I want to thank my advisor Catalin Grigoras for all of his support during the academic program and thesis process, passionate teaching, stimulating discussions, professional respect and friendship. I also want to thank Jeff Smith for his enthusiastic sharing of knowledge and support.
I want to thank Leah Haloin for helping keep me on schedule with all the logistics for class, registration, thesis, and graduation paperwork. I also want to thank my fellow cohort students who made the courses, labs, and the whole program so stimulating and enjoyable.
Vll


TABLE OF CONTENTS
CHAPTER
I. INTRODUCTION..........................................................................1
1.1 Voice as a Biometric Signal..................................................1
1.2 Forensic Automatic Speaker Recognition.......................................5
1.4 United States Law............................................................8
1.5 Electronic Frequency/Pitch Changers.........................................11
1.6 Scope.......................................................................12
1.7 Motivation and Goals........................................................13
II. LITERATURE REVIEW....................................................................15
2.1 Non-Electronic Voice Disguise vs Manual SID.................................15
2.2 Non-Electronic Voice Disguising vs Automated SID............................17
2.3 Electronic Voice Disguise vs Manual SID.....................................18
2.4 Electronic Voice Disguise vs Automated SID..................................19
III. ELECTRONIC FREQUENCY/PITCH MODIFICATION OF THE VOICE.................................20
3.1 Modification of Frequency Bandwidth Content.................................20
3.2 Impact on Formant-Based Analysis............................................26
3.4 Impact on Automated Speaker Identification..................................32
IV. ELVO-FMA ANALYTICAL FRAMEWORK PROPOSAL...............................................37
4.1 Obtaining Speech Evidence...................................................39
4.2 ElVo-FMA Key Assessments - Four Questions...................................39
4.3 ElVo-FMA Stage 1 - Has Electronic Frequency Modification Occurred?..........39
4.4 ElVo-FMA Stage 2 - Is the Method Known?.....................................42
4.5 ElVo-FMA Stage 3 - Are the Settings/Values of the Modification Parameters Known?.44
4.6 ElVo-FMA Stage 4 - Is the Modification Reversible?..........................45
V. PATH FORWARD.........................................................................49
5.1 Principal Recommendations...................................................49
viii


5.2 Deficiencies.......................................................49
BIBLIOGRAPHY.....................................................................51
APPENDIX.........................................................................56
IX


LIST OF TABLES
TABLE
Table 1 Log Likelihood Ratio scores for male voice samples
x


LIST OF FIGURES
Figure 1 Sources of variability in speech recordings................................3
Figure 2 General processing chain for automatic speaker recognition.................6
Figure 3 LR Estimation with probability density functions...........................7
Figure 4 Exemplar of a suggested LR Verbal Scoring Table............................8
Figure 5 Examples of Voice Changer Options..........................................12
Figure 6 Illustration of PSOLA algorithm waveform modification......................13
Figure 7 Bandwidth of Male Voice; Original plus lowered by 1,2 and 4 Semi-Tones.....21
Figure 8 Bandwidth of Male Voice: Original plus lowered by 1,2 and 3 Semi-Tones.....22
Figure 9 Zoom of Bandwidth of Male Voice; Original plus lowered by 1, 2 and 3 Semi-Tones . 22
Figure 10 Bandwidth of Male Voice; Original plus raised by 1,2 and 3 Semi-Tones.....23
Figure 11 Zoom of Bandwidth of Male Voice; Original plus raised by 1, 2 and 3 Semi-Tones . 24
Figure 12 Male Voice, Voice Changer App for “Normal”, “Chipmunk” and “Old Man”......25
Figure 13 Zoom View: Male Voice, Voice Changer Android App for “Normal”, “Chipmunk”
and “Old Man”.......................................................................25
Figure 14 International Phonetic Association Chart for Vowels.......................27
Figure 15 Formant Space Plot - Unmodified...........................................27
Figure 16 Format Space Plot - Reduced Centroid View.................................28
Figure 17 Formant Space Plot - Unmodified Male Voice................................29
Figure 18 Formant Space Plot - Male voice lowered by three semi-tones...............29
Figure 19 Formant Frequency View: Unmodified male voice.............................30
Figure 20 Formant Frequency View: Modified male voice down three semi-tones.........31
Figure 21 Formant Frequency View: Modified male voice up three semi-tones...........31
Figure 22 Default Vowel Frequency Range in Catalina®................................32
Figure 23 Unmodified Voice - Speaker ID Assessment..................................35
xi


Figure 24 Modified Voice 3 Semi-Tones Down - Speaker ID Assessment................35
Figure 25 Modified Voice 3 Semi-Tones Up - Speaker ID Assessment..................36
Figure 26 ElVo-FMA Flowchart Diagram..............................................38
Figure 27 ElVo-FMA Stage 1 Flowchart Diagram......................................40
Figure 28 ElVo-FMA Stage 2 Flowchart Diagram......................................42
Figure 29 ElVo-FMA Stage 3 Flowchart Diagram......................................44
Figure 30 ElVo-FMA Stage 4 Flowchart “No” Branch Diagram..........................46
Figure 31 ElVo-FMA Stage 4 Flowchart “Yes” Branch Diagram.........................47
xii


LIST OF ABBREVIATIONS
ABBREVIATION
1. AIR Average Identification Rate
2. ALISP Automatic Language Independent Speech Processing
3. ASR Automated Speaker Recognition
4. CRR Correct Recognition Rate
5. DAW Digital Audio Workstation
6. ElVo-FMA Electronic Voice-Frequency Modification Analysis
7. FO Fundamental Frequency
8. FASR Forensic Automatic Speaker Recognition
9. FRE Federal Rules of Evidence
10. FSC Forensic Speaker Comparison
ll.FSID Forensic Speaker Identification
12. LLR Logarithmic Likelihood Ratio
13. LPC Linear Predictive Coding
14. LR Likelihood Ratio
15. MFCC Mel-Frequency Cepstral Coefficients
16. PSOLA Pitch Synchronous Overlap and Add
17. RASTA RelAtive SpecTrAl
18. SID Speaker Identification
19. SME Subject Matter Expert
20. VIN Voice Inspector0 by Phonexia
xm


CHAPTER 1
INTRODUCTION
This chapter will cover some basic concepts necessary to have the proper context and background for understanding the impact of electronic frequency/pitch modification on Forensic Automatic Speaker Recognition and the need for a coherent analytical framework.
1.1 Voice as a Biometric Signal
Biometrics is the science of establishing the identity of individuals based on their biological and behavioral characteristics [1]. Identifying a person based on their speech can be both extremely effortless as well as extremely difficult. For most people who have normal hearing capabilities cognitive aural processes are extremely efficient at learning and remembering the sound characteristics of people they know [2], This ability is recognized to the point of having earwitness1 testimony allowed in military, federal and some state courts [3], But to do so with scientific rigor, with mathematical analysis in formal settings such as in the field of forensics, identifying individuals by their speech is a very fragile process.
Human speech is an activity/performance-based signal in contrast to other biological-based biometrics such as DNA, fingerprints and facial features [4] [5]. It is not known whether or not every person in the world produces utterances which are unique to them and different from those of all other. We do not know if intraspeaker variability is always less than interspeaker variability and if this relationship is true for all situations and under all conditions [2].
The human voice is very flexible and a person is susceptible to many influences that, in turn, will affect the variability of his/her speech recording [6] [7] [8] [9], (see Figure 1) Some factors include: * 1
1 Testimony based on recall of auditory events, especially spoken messages uttered at the scene of a crime.
1


• Situational task stress: Speaking while doing a task that may require physical or cognitive effort such as lifting, walking, operating a vehicle.
• Vocal effort/style: Altering normal speech to account for social or environment such as whispering, shouting, speaking to an audience, singing, or compensating for speaking in the presence of noise (Lombard effect).
• Emotional state: Speaking while experiencing deep emotional states such as sadness, happiness, fear, or anger.
• Physiological factors: Effects include illness, aging, exhaustion, age, under the influence of medication/alcohol.
• Interaction-based scenarios: Human-to-Human (one-on-one or in a group setting), read/scripted vs. spontaneous speech, and Human-to-Machine scenarios.
• Circadian rhythm-based: “Morning” vs. “evening” voice or “alert” vs “sleepy”.
• Technological influences including:
o Electromechanical: Transmission channels, devices (cellphone, landline, computer), microphones.
o Environmental: Background noise, room acoustics, microphone distance/placement.
o Data Quality: Sampling rate, audio codec and compression.
2


Technology Based
Conversation Based
Speaker Based
3
I
r \
Human-to-Human
Human-to-Machine
V_____________________________
C \
Prompted/Read Speech Spontaneous Speech V______________________________
r
Monologue
Two-Way Conversation Group Discussion
Microphone/
Sensors
Vocabulary, Turn Taking
Speaker Task Stress Lombard Effect
Noise, Signal-to-Noise Ratio
Language/ Culture
Accent/ Dialect Emotion
Vocal Effort
Figure 1 Sources of variability in speech recordings
In the same manner that we use voice as a biometric identifier in our daily lives it should not come as a surprise that disguising our voices is a natural human counter-identification tactic.
Disguising a voice is not a modem phenomenon; it could be reasonably argued that it probably began in pre-historic times as a natural result of the human social evolution. From the earliest writings we find stories with examples of the use of voice disguises. For example in the biblical book of Genesis, Chapter 27, we read a story about inheritance and deceit. Jacob, aided by his mother Rebekah, tricks his elderly and blind father Isaac into pronouncing the socially significant patriarchal blessing upon him instead of the legitimate heir, his brother Esau. In the critical moment when Isaac is identifying his son we read:
3


And Jacob went near unto Isaac his father; and he felt him, and said, “The voice is Jacob's voice, but the hands are the hands of Esau. ” [10]
If we analyzed this story as a forensic speaker identification (FSID) case we could state that Jacob used a “multimode” disguise approach, changing his physical appearance as well as his voice. Under normal circumstances Isaac could have been considered a subject matter expert (SME) in FSID when it came to his sons but his failing health produced a “false positive” identity assessment.
We can proceed to classify the intent for a person to disguise his/her voice into two general categories: protecting an identity for illegal acts and protecting an identity for legal acts.
Protecting an individual’s identity while planning or committing an illegal act can cover a wide range of activities including phone calls, recordings, or masked individuals: demanding money in a robbery, extortion, threats of violence, stating demands in hostage situations, or planning among collaborators. Usually this is perceived as a countermeasure against law enforcement or intelligence authorities, situations where society usually sees it as “the bad guys trying to neutralize the good guys”.
The second category is to protect an individual’s identity while conducting a lawful act. This is a common scenario when a person is an eye witness to a crime but needs to protect its identity from retaliation, like in the case of whistleblowers in workplace or corruption cases; witnesses to a crime while testifying to authorities, or providing an interview to the media. This is usually seen as a countermeasure against criminal or corrupt organizations and a pro-society and law enforcement effort.
4


Electronic frequency voice disguise is rapidly becoming more feasible with technology. Nonelectronic voice disguise requires ability, focus and stamina from the speaker that is not required when using electronic means.
The approach towards applying digital electronic voice frequency/pitch disguise can be divided into two general categories: voices that are clearly identified as disguised by naive2 listeners and voices that are disguised in subtle ways in an attempt to not be discovered as manipulated [11].
Voices that are clearly identified as digitally manipulated usually are used in scenarios where it is of the utmost importance to not give up the identity of the speaker. The resulting aural characteristics of the process are usually severe; pushing the frequency content higher or lower to an unnatural state where it is relatively easy for the vast majority of naive listeners to know the voice is manipulated. On the other hand the manipulations cannot be so severe as to affect the intelligibility of the speech since the message content must be understood by the listener.
Voices that are manipulated in a subtle manner have the goal to be imperceptible to aural analysis, maintain enough natural sounding characteristics while modifying the features used in SID methods. This is usually harder to attain due to the need of more sophisticated algorithms and the limitations of audio bandwidth in scenarios such as telephones circuits versus fuller spectrum audio channels.
1.2 Forensic Automatic Speaker Recognition
Speaker recognition is the general term used to describe the task of discriminating one person from another based on the sound of their voices. But the term itself encompasses two subtasks: speaker identification and speaker validation. For speaker identification the goal is to identify an unknown speaker from a set of known speakers. For speaker verification an unknown speaker is
2 A listener who lacks specialized phonetic or linguistic training.
5


claiming an identity (such as in the case of security systems) and the goal is to verify if this claim is true. Unfortunately on the forensic community there is not a general agreement on the use of all these definitions and acronyms and it will be common to hear/read other terms with similar meaning. As a result it will be common to find similar terminology in the literature such as Automated Speaker Recognition (ASR) although Automated Speech Recognition used the same acronym, Forensic Automatic Speaker Recognition (FASR), Speaker Identification (SID) as well as Forensic Speaker Identification (FSID), and Forensic Speaker Comparison (FSC) [12],
The general process for FASR (see Figure 2) goes from obtaining the speech signals, extracting features, creating models and then implementing a comparison to the extracted features of a questioned recording. The comparison algorithms usually provide a "similarity score” which then is interpreted.
Feature extraction converts raw speech data into feature vectors. The most popular methods include Mel-frequency cepstral coefficients (MFCC), Relative Spectral (RASTA) [13], and formant (F0-F3) extraction [14],
Similarity
score
Evidence (E)
Figure 2 General processing chain for automatic speaker recognition
6


The comparative analysis and score interpretation is usually based on a Bayesian inference
and the LR [ 15]. LR is defined by a ratio of two conditional probabilities; the probability (p) of the evidence (E) given (|) the prosecution hypothesis (Hp) to the probability of evidence given the defense hypothesis (Hd). The odds form of the LR is defined in the equation below:
LR pfEIBd />(E|Hd)
The probabilistic nature of the LR allows for visualization, thus providing insight into the probability density functions of intra and inter variability.
Figure 3 LR Estimation with probability density functions
At the conceptual level LR is an assessment of the similarity between the suspect and
offender's samples regarding a given parameter and the typicality of those values within a wider,
relevant population [16], The outcome is a value that pivots on one, such that LRs >1 offer support
for Hp, and LRs <1 offer support for Hd. If LR=1 then there is the same probability between Hp and
7


Hd. The magnitude of the LR determines how much more likely the evidence would be under the competing hypotheses. The range can be significant, sometimes requiring the use of logarithmic scale to produce a logarithmic likelihood ratio (LLR). It may be necessary for a SME to translate a LR/LLR score to a verbal scale when trying to communicate with triers-of-fact or the general public. This should be done carefully since numerical results vary according to the quality of the suspect, questioned, and population recordings.
LR from LR to
1000 >1000
100 1000
10 100
1 10
0.1 1
0.01 0.1
0.001 0.01
<0.001 0.001
Proposed verbal ratio
Strong evidence to support prosecution hypothesis Moderately strong evidence to support prosecution hypothesis Moderate evidence to support prosecution hypothesis Limited evidence to support prosecution hypothesis Limited evidence against prosecution hypothesis Moderate evidence against prosecution hypothesis Moderately strong evidence against prosecution hypothesis Strong evidence against prosecution hypothesis
Figure 4 Exemplar of a suggested LR Verbal Scoring Table3
A recent poll conducted by Interpol [17] revealed that out of 96 responding countries forty four has speaker identification capabilities and twenty six forensic had either voice comparison systems or automatic speaker recognition systems.
1.4 United States Law
The acceptance of normal, non-modified speech as part of a FSID expert testimony into a court case is not a trivial matter. If we add the burden that the speech evidence has been
3 Phonexia Voice Inspector (YIN) training class notes
8


modified/tampered, and yet we intend to still present expert testimony tying it to the identity of a person, we must admit that it would be a very difficult legal hurdle to overcome.
In the United States federal courts the admissibility of expert testimony is subject to the Federal Rules of Evidence (FRE), with special emphasis on Articles IV, VII, and IX [18], In 1993 the Daubert4 standard superseded the previous Frye5 standard for the interpretation of the FRE. Under Frye the criteria for accepting expert witness was based on the “general acceptance in the particular field in which it belongs.” while Daubert assigns the trial judge the role as a “gatekeeper” to ensure that:
• The testimony is based upon sufficient facts or data
• It is the product of reliable principles and methods
• The witness has applied the principles and methods reliably to the facts of the case In pretrial hearings to assess whether an expert witness may be allowed to testify, commonly
referred to as a Daubert hearing, the judge has a non-exclusive checklist to evaluate the witness and the proposed evidence:
• Whether the theory or technique has been tested
• Whether it has been subject to peer review and publication
• Whether technique or theory produces results with a known potential error rate
• Whether it has attracted a widespread acceptance within a relevant scientific community
• Whether it has the existence and maintenance of standards controlling its operation
• Other factors to include:
4 Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579, 125 L. Ed. 2d 469, 113 S. Ct. 2786 (1993)
5 Frye v. United States, 54 App. D.C. 46, 293F.1013, DC Ct App (1923)
9


o Whether experts are proposing to testify about matters growing naturally and directly out of research they have conducted independent of the litigation, or whether they have developed their opinions expressly for purposes of testifying.
o Whether the expert has unjustifiably extrapolated from an accepted premise to an unfounded conclusion.
o Whether the expert has adequately accounted for obvious alternative explanations.
o Whether the expert is being as careful as he would be in his regular professional work outside his paid litigation consulting.
o Whether the field of expertise claimed by the expert is known to reach reliable results for the type of opinion the expert would give.
State courts do not have to accept FRE with the Daubert standard but thirty have accepted it or have laws consistent with it, fourteen reject it6, and seven neither accept nor reject Daubert [19], Puerto Rico, which is a territory of the United States has also adopted FRE and Daubert [20].
When it comes to speech evidence for FSID the US courts have been very conservative in their interpretation of Daubert. As of this date FSID testimony has not been accepted as evidence in an appellate federal court. The closest that speaker identification evidence has come to being accepted, and establishing a precedent, was in United States v. Ahmed, et al. 12 CR 661 (SI) (SLT) [21], Mr. Mohammed Yusuf, together with Madhi Hashi and Ali Yasin Ahmed were accused of providing support to al-Shabaab, a terrorist organization based in Somalia [22]. The case had speech evidence that had been analyzed by acoustic, aural-phonetic, and automated speaker recognition
6 Including Washington DC
10


methods [23], The Daubert hearing lasted three days but since the accused pleaded guilty before the beginning of the trial, the judge’s determinations were not made public and the opportunity to establish precedence was gone.
The proposed ElVo-FMA framework could be very helpful in allowing electronically modified speech to be accepted in a US federal court in the future it.
1.5 Electronic Frequency/Pitch Changers
To state that there is a plethora of tools to electronically alter the frequency/pitch of speech recordings/transmissions is an understatement. An informal online market survey quickly provides a wide variety of products that can be grouped into different categories including:
• Professional audio production systems:
o Broadcast: Adjust voice quality for commercials and protect interviewee identity
o Digital Audio Workstation (DAW): for music and audio productions in studio settings
• Phone-related
o Android and iOS Apps: Multiple voice effects
o External hardware: Microphone and settings
• Entertainment
o Hobbyist kits
o First-person gaming applications
Electronic voice changers can also be classified by the speed of their algorithms which can result in real-time, near real-time, and post-processing scenarios.
11


Figure 5 Examples of Voice Changer Options
1.6 Scope
The ElVo-FMA framework is described at a technology-agnostic level. It provides criterion and a workflow based on a higher level than of specific signal processing algorithms. It is presented in a holistic manner to reach technical, research, and investigative readers.
The electronic frequency/pitch changes conducted in this research were limited to Pitch Synchronous Overlap and Add (PSOLA) based algorithms within Adobe Audition 3.0. The algorithm determines a fundamental pitch period, segments audio into frames, and then reassembles those frames with an overlap-add method at a different rate [24],
12


Figure 6 Illustration of PSOLA algorithm waveform modification
The software tools used to evaluate formant and LLR scores of modified speech samples are treated as "black boxes'’. No further effort is made to "look under the hood" of these tools beyond the description provided in their documentation. This keeps the environment very compatible with "real-world" scenarios where investigators are not given access to algorithms that are protected by intellectual property concerns of the tool developers.
1.7 Motivation and Goals
As I conducted my research and literature search it became more and more clear that there is a gap in the academic, technical, and forensic communities regarding on how to deal with electronically disguised voices.
13


The main goal of this research is to develop ElVo-FMA as an analytical framework for working with electronically modified frequency/pitch speech that serves academic, technical, and forensic professionals. It can provide a common workflow that:
• Facilitates communication between academic researchers to identify areas that could benefit from collaboration.
• Provides a clear map for forensic professionals, investigators, attorneys and triers-of-fact to evaluate speech evidence that may be presented.
• Assist all participants in admissibility hearings by providing a common framework to discuss and argue, in favor or against, on the admission of expert testimony regarding electronically frequency/pitch modified speech.
14


CHAPTER 2
LITERATURE REVIEW
In this chapter an overview is presented of the most pertinent literature surrounding the issue of voice disguise and its impact on forensic SID. The discussion will be organized into four major groups:
• Non-Electronic Voice Disguise vs Manual SID
• Non-Electronic Voice Disguising vs Automated SID
• Electronic Voice Disguising vs Manual SID
• Electronic Voice Disguising vs Automatic SID
2.1 Non-Electronic Voice Disguise vs Manual SID
The majority of current literature found is focused on non-electronic disguises and its impact on manual auditory and acoustic-phonetic analyses. The authors have a wide range of backgrounds and address different perspectives including academic, law enforcement, and practitioners. As expected, the results and recommendation for further studies covers a wide range of methodologies, results and recommendations for future work.
Kirchhiibel and Howard [25] conducted a study on detecting suspicious behavior using speech and searching for correlation in acoustic cues. The work intended to explore changes or lack of changes in the speech signal when people were being deceptive. A group of ten speakers provided elicited truthful and deceptive speech in an interview setting. Acoustic analysis were done on various speech parameters including fundamental frequency (FO), overall amplitude and mean vowel formants FI, F2 and F3. No significant correlation between deceptive and truthful speech and any of the acoustic features examined could be established. The researchers recommend follow on studies with a larger test subject group and refinements to the test procedures. Kirchhiibel would later have
15


the opportunity to do further testing and publish the results in her doctoral dissertation [26] where she concludes: “For the majority of acoustic parameters no significant differences or trends can he discerned that would allow for a reliable differentiation between deceptive and non-deceptive speech. ”
Solan and Tiersma [27] studied the field of earwitness in a forensic setting. One of their findings was that the basic cognitive identification performance of humans, that they can identify familiar voices with high accuracy but are poor performers in matching unfamiliar voices, get amplified with voice disguises. In one study the results were of 79% accuracy in identifying familiar disguised voices (whispering, mimicking accents, changing raising/lowering pitch, imitation of other speakers) while no better than 20.7% success rate for unfamiliar disguised voices. They express concern on the vulnerability of both expert and naive listeners in the courts throughout the complete investigative and judicial process.
Amno, Osanai, and Kamada conducted an overview of forensic speaker recognition for the from a historical and procedural perspective [13], Their research led them to conclude that: “Robustness against noise or voice disguise is higher in human-based methods compared to computer-based automatic methods. Therefore, aural identification is a suitable method, or at least can he a good aid for other methods, when voice disguise exists. ”
They point out that the most common voice disguises included changes in phonation such as falsettos, shiwper, creaky voices, mimiking a foreign accent and pinching one’s nose.
Rhodes, through a literature survey in his doctoral dissertation [28], finds that the reported use of non-electronic disguises in cases vary from less than 5% in the UK to up to 25% of cases in Germany when the speaker is expecting his/her voice to be recorded. Human-based forensic SID can
16


also be severely impacted by other factors that can have a secondary disguise-effect such as extreme distress, intoxication, drug addiction, multilingualism/L2 speakers or comparison between languages.
2.2 Non-Electronic Voice Disguising vs Automated SID
There have been relatively few published papers on the impact of non-electronic voice disguise on automated or forensic SID systems.
Researchers at the China Criminal Police University [29] [30] have studied the effects of non-electronic voice disguise efforts against a forensic ASR system from d-EAR Technologies [31] which was developed at the Department of Computer Science & Technology in Tsinghua University7.
The participants were asked to try different disguising techniques including:
• Raising and lowering pitch
• Faster and slower speech
• Whispering
• Pinched nostril
• Masking of mouth with hand
• Bite blocking with a pencil
• Chewing gum
• Mimicking foreign accents
Their results showed significant impact from some of the techniques. Consciously raising and lowering the pitch of speech affected the ASR tool, showing a significant degradation of the tool’s Correct Recognition Rate (CRR) scores. Raising the voice pitch resulted in a CCR score of only 10% while the score for lowered pitch was 55%.
7 The tool was described as capable of speaker identification and verification. No further technical details on its classifiers or internal algorithms were provided.
17


Researchers at the Speech Technology and Research Laboratory, SRI International, [32] investigated the effects of non-electronic voice disguise efforts against an ASR system based on GMM and cepstral features. Participants were encouraged to use disguise techniques of their choosing including changing pitch or speaking rate, or mimicking an accent.
The results confirmed significant degradation when evaluating pitch-modified voice samples through a data set that had been trained on unmodified voices. It was also observed combining human evaluation together with ASR showed promising improvement.
2.3 Electronic Voice Disguise vs Manual SID
Brixen [33] explored the effects of commercially available DSP- based voice processing algorithms on samples of a male speaker with the purpose of providing disguise. He used Voice Pro, a commercial stand-alone voice processor by TC Helicon, as well as Audition 3.0 from Adobe for the tests. He selected a variety of algorithms and presets chosen to alter the voice but to not be considered as too unnatural sounding. His analysis showed that in most cases the formants had shifted in their frequency space. In some cases there were processing artifacts that caused linear predictive coding (LPC) prediction of formants to produce erroneous results. He summarized that this would make correct SID difficult.
lessica Clark and Paul Foulkes [34] conducted a test where voices had their F0 modified in a range of ±8 semitones. This was done with the commercial DAW software Sony Sound Forge (version 8.0). The disguised voices were evaluated by a total of 36 listeners, ten males and twenty six females.
Their results showed that the listeners performed best with unmodified voices, with an average identification rate (AIR) of 59.7%. Identification rates fell in all disguised conditions, with
18


the ±8 semitone conditions yielding the lowest rates. Rates were worse for the lowered FO values (AIR=31%) than for the corresponding raised FO voices (AIR=41%).
2.4 Electronic Voice Disguise vs Automated SID
Perrot, Aversano and Chollet [35] [36] have led tests in support of the Forensic Research Institute of the French Gendarmerie. Their work included a survey of non-electronic and electronic methods to disguise voices and the impact on manual and ASR techniques. For electronic frequency modification they are exploring their own technique called Automatic Language Independent Speech Processing (ALISP) which shares some commonality with PSOLA techniques. For detection they focused on using MFCC as well as GMM and SVM algorithms. They also discuss the topic of reversing electronically-disguised voices. The published results were very limited and most in a preliminary status.
19


CHAPTER 3
ELECTRONIC FREQUENCY/PITCH MODIFICATION OF THE VOICE
This chapter contains an overview of the impact that modifying the frequency/pitch of voice samples produce on the basic signal and some of the basic processing steps that are common in FSID.
3.1 Modification of Frequency Bandwidth Content
Automated forensic SID is based on analyzing the lower attributes of human speech, mostly based on frequency characteristics. It is therefore reasonable to expect that modifications of the frequency-spectral content will have a negative impact on the fundamental analyses and identification processes. For example in Figure 7 the frequency-bandwidth view of four versions of the same speech recording that have been lowered by a PSOLA algorithm have been superimposed (Male voice, audio bandwidth of 22,050 Hz, 16 bits).8:
• The green plot shows the spectral content of the unmodified sample.
• The yellow plot shows the spectral content after being lowered one Semi-Tone.
• The orange plot shows the spectral content after being lowered two Semi-Tones.
• The purple plot shows the spectral content after being lowered four Semi-Tones.
8 Processed with Adobe Audition 3.0 VST “Pitch Shifter” Plugin
20


Figure 7 Bandwidth of Male Voice; Original plus lowered by 1, 2 and 4 Semi-Tones
The pattern shifts are clearly seen, confirming that the lower pitch is not an "aural illusion'’ but indeed a lowering of the energy related directly with the frequency by a discreet amount.9 We can also observe in the figure that there seems to be a change in the spectral noise floor and in the appearance of certain artifacts seen in the frequency dimension as nulls, peaks and valleys. These artifacts are common byproducts of the different signal processing algorithms and may hold the key to detection and identification of frequency modified voice samples.
The next two figures show a male voice recording, the first one (Figure 8) at the full original recording bandwidth of 24 kHz (sample rate of 48 kHz). The second one (Figure 9) is zoomed into the first 4000 Hz of the voice baseband. On both figures we have superimposed the frequency-bandwidth view of four versions of the same speech recording speech recording that have been modified by a PSOLA algorithm (Male voice, audio bandwidth of 24 kHz, 16 bits)* 111:
9 Depending on the specific western musical scale, a semi-tone step can be between 5-6% of increment or reduction from the fundamental frequency.
111 Processed with Adobe Audition 3.0’s VST "Pitch Shifter” Plugin
21


• The green plot shows the spectral content of the unmodified sample.
• The orange plot shows the spectral content after being lowered one Semi-Tone.
• The purple plot shows the spectral content after being lowered two Semi-Tones.
• The yellow plot shows the spectral content after being lowered three Semi-Tones.
Figure 8 Bandwidth of Male Voice: Original plus lowered by 1, 2 and 3 Semi-Tones
Figure 9 Zoom of Bandwidth of Male Voice; Original plus lowered by 1, 2 and 3 Semi-Tones
22


The next two figures show the same male voice recording but in this case the frequency has been raised instead of lowered. Figure 10 is the full original recording bandwidth of 24 kHz while Figure 11 is zoomed into the first 4000 Hz voice baseband. On both figures we have superimposed the frequency-bandwidth view of four versions of the same speech recording that have been modified by a PSOLA algorithm (Male voice, audio bandwidth of 24 kHz, 16 bits)11:
• The green plot shows the spectral content of the unmodified sample.
• The orange plot shows the spectral content after being raised one Semi-Tone.
• The purple plot shows the spectral content after being raised two Semi-Tones.
• The yellow plot shows the spectral content after being raised three Semi-Tones. 11
Figure 10 Bandwidth of Male Voice; Original plus raised by 1,2 and 3 Semi-Tones
11 Processed with Adobe Audition 3.0’s VST “Pitch Shifter” Plugin
23


Figure 11 Zoom of Bandwidth of Male Voice; Original plus raised by 1, 2 and 3 Semi-Tones
The next figure (Figure 12) is of a male voice that has been recorded and modified on an Android phone with a "voice changing app"12 while Figure 13 is zoomed into the first 7200 Hz voice baseband. These types of apps usually apply more aggressive frequency changes in an effort to make the voices clearly obvious to any naive listener that they have been modified. The modified voice samples were extracted as attachments to emails since this application does not change voice in "real-time'’ for telephone conversations. The frequency-bandwidth view has been superimposed of three versions of the same speech recording (audio bandwidth of 12 kHz, 16 bits, file format: MP3):
• The green plot shows the spectral content of the unmodified sample.
• The red plot shows the spectral content after the voice being converted with the "Chipmunk" settings.
12 Voice Changer (1.0.60), Author: Android Rocker, email: androidrocker@163.com
24


The purple plot shows the spectral content after the voice being converted with the "Old Man” settings.

Figure 12 Male Voice, Voice Changer App for “Normal”, “Chipmunk” and “Old Man”
Figure 13 Zoom View: Male Voice, Voice Changer Android App for “Normal”, “Chipmunk”
and “Old Man”
25


We can clearly observe that there are changes in the frequency-bandwidth content of the processed recordings that range from subtle to harsh and dramatic.
3.2 Impact on Formant-Based Analysis
Formant-based analysis is a foundational method of forensic SID. They are traditionally implemented in an acoustic-phonetic methodology [14] [3 7] [3 8] [3 9] and usually require highly skilled SMEs. Some common tools used by the linguistic practitioner communities for formant analysis include PRAAT [40], Wave Surfer [41] and Catalina0 [42],
In general, a three-step process is followed to characterize a speaker and conduct a comparison assessment. The first step is to transcribe recorded speech using a phonetic alphabet13 with most attention focused on the vowels (see Figure 14). The second step is to map the vowel sounds onto a graph, where the X and Y axis are the formant pair frequencies. Since human vowel utterances are rich in harmonics, each vowel can be represented by having energy content at F0 through F3 formants. The most common formant pairings for graphical representation are FI vs. F2 and F2 vs. F3. The resulting plots provide a pattern that can be reduced to a centroid diagram to improve clarity (see Figure 15 and 16) [42], These reduced plots are then used for visual comparison and SID determinations.
13 The most common phonetic alphabet used for the English language is defined by the International Phonetics Association, https://www.intemationalphonehcassociahon.org/
26


F2 [Hz]
VOWELS
Front
Central
Back
Close
Close-mid
Open-mid
Open
Where symbols appear in pairs, the one to the right represents a rounded vowel.
Figure 14 International Phonetic Association Chart for Vowels
Formant Space F1 vs F2
500
F1 [HZ]
Figure 15 Formant Space Plot - Unmodified
27


Formant Space: A=748/1306; E=446/1641; 1=293/2090; 0=407/1044 [Hz]
2500
2000
N1
— 1500
CM
U_
1000
200 300 400 500 600 700 800
F1 [HZ]
Figure 16 Format Space Plot - Reduced Centroid View
The FI vs. F2 formant-space plot of an unmodified recording of a male is shown in Figure 17 while the plot of the same recording modified lower by three semi-tones is shown in Figure 18. The analysis has been done manually and five main vowel sounds have been mapped.
Visual inspection clearly shows that the F1/F2 frequency values for each vowel have changed. It is also evident that the rate of change is different for each vowel and as a result the shape of the pattern or "constellation'’ has also changed. Should an investigator have to make an assessment on these two plots it would mostly result in a "does not match" determination. This shows the clear anti-forensic effect of voice frequency/pitch manipulation against this form of analysis.
28


100
Vowel space analysis. Tile: FZGPSFebruary15th#kHa.wav
Figure 17 Formant Space Plot - Unmodified Male Voice
Vowel space analysis, file: FZGPSFebruary15thDn38kHz.wav
a
i
£ 500
â–¡
o
â–¡
a
3000 2500
1500 F2 [Hz]
Figure 18 Formant Space Plot - Male voice lowered by three semi-tones
29


Formant-based SID can also be done automatically with the software extracting, evaluating and matching formant patterns. But these types of tools are also susceptible to manipulation of voice frequency resulting in erroneous results.
The automatic formant analysis feature of Catalina0 was used on three versions of a male voice recording. The first recording was unmodified (see Figure 19), the second recording was lowered by three semi-tones from the original (see Figure 20) and the third recording had the frequency raised by three semi-tones from the original (see Figure 21). Notice how the mean values of the formants F0 through F3 as well as their histogram distribution change.
FO: mean=139.4459 Hz std=27.4626; sample lengths 39.15 sec
500 -400 300 -200 100
50
--- FZGPSFebruary15th8kHz
150 200
Long-Term Average Spectrum (LTAS)
-20 -40 -60 -80 F -100
--- FZGPSFebruary15th8kHz
0 500 1000 1500 2000 2500 3000 3500 4000
F1=426/166 F2=1369/448 F3=2512/311
400 200 -0
___ FZGPSFebnjary15th8kHz
500 1000 1500 2000 2500 3000 3500 4000
frequency [Hz]
Figure 19 Formant Frequency View: Unmodified male voice
30


FO: mean=129.6089 Hz std=24.1419; sample length=135.57 sec
m -20 T3.
a, -40
TD
3 -60
I* -80
s
-100
100 150 200 250
Long-Term Average Spectrum (LTAS)
I i l 1 , 1 — fzgps pebruary 15thDn38kHz

\
500 1000 1500 2000 2500 3000
F1=366/133 F2=1279/424 F3=2256/299
3500 4000
500 1000 1500 2000 2500 3000 3500 4000
frequency [Hz]
Figure 20 Formant Frequency View: Modified male voice down three semi-tones
FO: mean=132.216 Hz std=31.7016; sample length=142.43 sec
Long-Term Average Spectrum (LTAS)
500 1000 1500 2000 2500 3000 3500 4000
frequency [Hz]
Figure 21 Formant Frequency View: Modified male voice up three semi-tones
31


There is also a trend in the FO values that seems counterintuitive. The calculated mean FO
values for both lowered (129.6 Hz) and raised (132.2 Hz) frequencies are below the unmodified FO value of 139.4 Hz. Since we have seen in previous figures that the raw frequency-bandwidth spectrum directly correlates with the lowered or increase in of semi-tones steps, we can presume that the electronic modifications have induced errors in the statistical computations. Based on conversations with the author of Catalina014, we suspect that the primary source of error in the computations most likely come from the user-defined settings for the vowel frequency range (see Figure 22). By default the tool detects the vowels [a], [e], [i], and [o] from the following settings:
601 850 1100 1600 2200 2800 401 600 1500 2000 2100 2800 220 400 2000 2400 2400 2900 370 600 700 1200 2200 2600
T T T T T T
low high low high low high limits limits limits
for FI for F2 forF3
«— vowel [a] vowel [e] <— vowel [i] <— vowel [o]
Figure 22 Default Vowel Frequency Range in Catalina®
As voice recordings frequency are modified electronically and pushed to ranges not expected in normal operations, it would be reasonable to expect that any other automated analysis tool may suffer the same type of degradation and produce erroneous results as well.
3.4 Impact on Automated Speaker Identification
The last step of our evaluation on the potential impact of electronic frequency modification was to test unmodified and modified voice recordings on an automated SID tool. Voice Inspector
14 Dr. Catalin Grigoras
32


(VIN) by Phonexia was used as the tool due to its many desired features. VIN is a commercial state-of-the-art forensic SID software that uses various algorithms, including i-Vector processing and Bayesian likelihood ratio calculations to evaluate voice evidence for law enforcement and intelligence scenarios.
All the recordings used for the population database, suspect reference, and questioned voice were sourced from the same podcasts producer15. This promoted minimum variability in the source/channel for the voice recordings. The general population database was composed of a total of fifteen male voice recordings. There were three suspect reference voice recordings and a total of thirteen “questioned” voice recordings to be evaluated. The questioned voice recordings and the suspect recordings were from the same individual. Of the questioned recordings, one was unmodified, six were frequency lowered sequentially at one semi-tone intervals, and six were frequency raised sequentially at one semi-tone interval.
Each questioned voice recording was processed automatically for SID assessment. By using the same original recording for the unmodified and all the versions of the frequency-modified recording, we can establish a correlation between the direction and amount of frequency modification and the degradation in the LLR based score. The results are shown in Table 2.
The unmodified voice recording had an LLR score of 9.53, which has an equivalent verbal score of “Strong evidence to support the prosecution hypothesis” which in layman’s terms could be expressed as “strong evidence for a match between questioned and suspect voice”. The only other recording that supported the “prosecution hypothesis” was the one that had been lowered by one semi-tone. All other recording comparisons produced a LLR score that was significant in supporting “Strong evidence against the prosecution hypothesis” which in layman’s terms could be expressed as
15 CNN’s weekly “Fareed Zakaria Global Public Square” podcasts acquired via iTunes.
33


"strong evidence for no match between questioned and suspect voice”, (see Table 1) A visual representation for a subset of comparisons; the unmodified voice (see Figure 23), modified voice down three semi-tones (see Figure 24), and modified voice up three semi-tones (see Figure 25) provide insight to the Bayesian framework being used for the assessment16.
Table 1 Log Likelihood Ratio scores for male voice samples
Name LLR
FZGPSFebruaryl 5th_8kHz.wav 9.53
FZ GPS February 15th 8kHz Dnl Semi.wav 9.53
FZ GPS February 15th 8kHz Dn2 Semi.wav -23.3
FZ GPS February 15th 8kHz Dn3 Semi.wav -23.3
FZ GPS February 15th 8kHz Dn4 Semi.wav -113
FZ GPS February 15th 8kHz Dn5 Semi.wav -153
FZ GPS February 15th 8kHz Dn6 Semi.wav -104
FZ GP S_February_ 15th_8kHz_Up 1 Semi .wav -32.2
FZ_GPS_February_15th_8kHz_Up2_Semi.wav -68.8
FZ_GPS_February_15th_8kHz_Up3_Semi.wav -124
FZ_GPS_February_15th_8kHz_Up4_Semi.wav -124
FZ_GPS_February_15th_8kHz_Up5_Semi.wav -142
FZ_GPS_February_15th_8kHz_Up6_Semi.wav -176
16 The full forensic report generated by VIN, including LLR, Verbal Scores, and Bayesian graphics for all thirteen questioned recordings are included in the Appendix.
34


E: 16.96 LR: 3.414e+09 LLR: 9.533
Non-Target scores histogram Target scores histogram
----- HO
----- HI
Figure 23 Unmodified Voice - Speaker ID Assessment
E: 3.057 LR: 5.084e-24 LLR:-23.29
Non-Target scores histogram Target scores histogram
----- HO
----- HI
Figure 24 Modified Voice 3 Semi-Tones Down - Speaker ID Assessment
35


E: -13.62 LR: 1.158e-124 LLR: -123.9
Non-Target scores histogram Target scores histogram ----- HO
----- HI
Figure 25 Modified Voice 3 Semi-Tones Up - Speaker ID Assessment
The LLR scores clearly show that with the exception of a one semi-tone lowered recording, all the other electronically modified voice frequency recordings degraded the effectiveness of Phonexia's VIN tool in a dramatic manner. Since other state-of-the-art automated SID software also use i-Vector algorithms and similar processes it is reasonable to expect that this may be a problem for this whole class of automated SID tools.
36


CHAPTER 4
ELVO-FMA ANALYTICAL FRAMEWORK PROPOSAL
The proposed Electronic Voice-Frequency Modification Analysis (ElVo-FMA) framework provides scientific, technical, and legal professionals a clear analytical pathway for the assessment of the maturity level of tests, experiments, and methods for FSID. It provides context of the “big picture” for specific tasks within the required path needed travel from having evidence to making a forensic determination. Key benefits of ElVo-FMA include:
• A unifying framework for researchers and scientists to place tests and experiments within the proper context of related, and possibly interdependent, efforts within their communities.
• A workflow that guides investigators and forensic professionals through the process of determining the utility of voice evidence in a specific case. Specifically it should help assess if the questioned voice recordings can be used directly for forensic SID and be presented to the courts, or only as an aid to help sort through other pertinent case evidence.
• A clear progression path for assessing the maturity of technical methods as viable forensic methods.
ElVo-FMA is focused on answering four key questions and, depending on a binary “Yes/No” answer to each one, guide the user towards the next logical step. An answer of “Do not know” or “Inconclusive” provides a natural stopping point on the workflow and highlights a potential gap in current capabilities and potential area of needed research.
37


Figure 26 ElVo-FMA Flowchart Diagram
38


4.1 Obtaining Speech Evidence
The ElVo-FMA process begins when voice evidence (in the case of an investigation) or voice test files (in the case of research) have been obtained. In the case of voice evidence it is imperative that this be done in a forensically sound manner. In today’s digital environment this means that it should be treated and handled the same way as if this were any other digital evidence and thus appropriate to apply comparable computer forensics methods. Proper documentation is necessary in order to comply with “chain of custody” requirements. Assurance of “non-contamination”, such as the use of write-blocking technology, hash calculations and safe archival of original files should be observed. At this stage we assign the voice evidence to be in “Stage 0” which means that no ElVo-FMA analysis has yet begun. For this proposed framework the voice evidence can be considered to be in its raw stage ready to begin analysis.
4.2 ElVo-FMA Key Assessments - Four Questions
Once the evidence file/s have been obtained in a forensically sound manner it is time to begin the evaluation. ElVo-FMA’s workflow will be directed based on assessing the answer to four key questions:
• Has electronic frequency modification been applied to the voice evidence?
• Is the method of electronic voice modification known?
• Are the settings/values of the electronic modification method known?
• Is the electronic modification reversible from a forensic SID perspective?
4.3 ElVo-FMA Stage 1 - Has Electronic Frequency Modification Occurred?
Determining if electronic modification has been done can be more complex than at first
thought. A multimodal approach is often a more effective and robust way to proceed. In an
39


investigation a determination can be reached by a combination of methods which include: investigative leads, acoustic methods, and technical analyses.
Figure 27 ElVo-FMA Stage 1 Flowchart Diagram
Investigative leads and resources can be used to determine if electronic frequency modification has been used. These can include:
• Confession by the suspect or person of interest
• Corroboration or confession by an associate of the suspect or person of interest
• Written documentation stating the use of electronic frequency modification techniques
• Intercepted communications where frequency modification has been stated or suggested
• Seized pertinent hardware and/or software by law enforcement authorities
Aural methods are the most natural for human beings and common today for determining if there has been electronic frequency modification. These can include:
40


• Aural analysis by naive listeners: It should be noted that most naive listeners can only detect frequency modification when it is extreme and obvious, such as the case of “chipmunk-like” or “deep monster-like” effects. Slight modifications may be extremely difficult to detect. Also highly susceptible to biases and cognitive limitations.
• Aural analysis by forensic experts which may include a variety of approaches such as linguistic-language-acoustic techniques.
Technical methods use a variety of signal processing, statistical, and forensic techniques. These may include:
• Forensic analysis of file metadata [43]: signature of software or methods known to modify the voice frequencies
• Signal Processing Methods: Voice analysis based on a variety of acoustical measurements
• Signal Processing Methods: Acoustic voice and environment analysis
• Statistical analyses including Machine Learning algorithms
The first decision point is based on a “Yes” or “No” binary decision. If the determination of frequency modification is “No” then the voice evidence can proceed to be evaluated, analyzed and interpreted by a conventional forensic SID process. This should be implemented in a manner that has been accepted by the courts, satisfying Daubert, Frye, or other accepted legal standards.
If the determination of frequency modification is “Yes” then the voice evidence can proceed to enter the second phase of the framework.
41


4.4 ElVo-FMA Stage 2 - Is the Method Known?
Investigative leads, acoustic methods, and technical analyses can also be used to assess if the modification method is, or can be, known.
Figure 28 ElVo-FMA Stage 2 Flowchart Diagram
Investigative leads and resources can include:
• Confession by the suspect or person of interest
• Corroboration or confession by an associate of the suspect or person of interest
• Written documentation stating the use of electronic frequency modification techniques
• Intercepted communications where frequency modification has been stated or suggested
• Seized pertinent hardware and/or software by law enforcement authorities
42


Aural methods can be useful but may be limited to specialized forensic experts. The identification of a type of method for frequency modification will most probably be beyond the capability of most naive listeners. The forensic expert should have training and proven proficiency to make basic assessments such as fundamental frequency modification.
Technical methods for identifying modification is an area of much interest to researchers today and can include:
• Forensic analysis of file metadata: signature of software or methods known to modify the voice frequencies
• Signal Processing Methods: Voice analysis based on a variety of acoustical measurements
• Signal Processing Methods: Acoustic voice and environment analysis
• Statistical analyses including Machine Learning algorithms
The second decision point is also based on a binary decision of “Yes” or “No”. If the identification of the method of frequency modification is “No” then the voice evidence should not be presented to the courts as part of a conventional SID process. This does not mean however that the usefulness of the voice evidence is gone. It would be very beneficial to add it to an “Electronically Modified Voice corpus” and set aside for further technical analysis. While the use of such evidence for scientific research may vary from jurisdiction to jurisdiction, the establishment and care of such a corpus can be highly beneficial for the development and discovery of future methods.
If the identification of the method of frequency modification is “Yes” then the voice evidence can proceed to enter the third phase of the framework.
43


4.5 ElVo-FMA Stage 3 - Are the Settings/Values of the Modification Parameters Known?
Investigative leads, acoustic methods, and technical analyses can be used to assess if the modification settings/values are, or can be, known. These can vary depending on the implementation, varying from many adjustable parameters for more sophisticated methods to simple presets on others.
Figure 29 ElVo-FMA Stage 3 Flowchart Diagram
Investigative leads and resources can include:
• Confession by the suspect or person of interest
• Corroboration or confession by an associate of the suspect or person of interest
• Written documentation stating the settings used for electronic frequency modification
• Intercepted communications where the settings for frequency modification have been stated or suggested
• Seized pertinent hardware and/or software by law enforcement authorities with the settings still preserved
44


Aural methods can be useful but may be limited to specialized forensic experts. The identification of setting values will most probably be beyond the capability of most naive and expert listeners.
Technical methods for identifying modification method is an area of much interest to researchers today and can include:
• Forensic analysis of file metadata: signature of software or methods known to modify the voice frequencies
• Signal Processing Methods: Voice analysis based on a variety of acoustical measurements
• Signal Processing Methods: Acoustic voice and environment analysis
• Statistical analyses including Machine Learning algorithms
The third decision point is based on a binary decision: Yes or No. If the identification of the method of frequency modification is “No” then the voice evidence should not be presented to the courts as part of a conventional SID process. Just as in the previous step (ElVo-FMA Stage 2) it would be very beneficial to add it to an “Electronically Modified Voice corpus” and set aside for future scientific analysis.
If the identification of the method settings/values of frequency modification is “Yes” then the voice evidence can proceed to enter the fourth phase of the framework.
4.6 ElVo-FMA Stage 4 - Is the Modification Reversible?
Reversing the effect of electronic frequency modification could be considered as the ultimate forensic SID goal. By the term “reversible’ we mean that the voice evidence can be pre-processed in a manner that allows it to then be input into a conventional forensic SID process, to be evaluated with a
45


corpus of unmodified suspect and general population. This capability would counter the attempt by criminals to disguise their voices and eliminate another technique to elude identification.
At the writing of this thesis the literature search does not indicate that this is currently possible for most, if not all, frequency modification techniques. But if such methods were available it would be necessary to meet scientific, technical, as well as legal standards such as Daubert/Frye for it to be acceptable in a court of law. As a result, the question of "Is the process reversible?” highlights ElVo-FMA's effectiveness in identifying possible avenues of research for addressing the capabilities gap within the forensic SID community and potential ways to address them in the search for an acceptable forensic solution.
Figure 30 ElVo-FMA Stage 4 Flowchart “No” Branch Diagram
46


As with previous key decision points, we also have a binary “Yes’’ or ‘"'No”. If it is
determined that the answer for "reversibility” is a "No”, this can lead then to another question: Can a modified SID process be applied that would be scientifically and legally sound? Another paradigm could open for forensic SID. What if there is sufficient inter/intra-variability definition in not only the evidence voices, but also in the corpus of suspect and general populations that have been also modified with the same process? Could it be feasible to apply a modified SID process, with the same currently accepted Bayesian philosophy, to conduct a rigorous assessment of "Match/No-Match”? This ElVo-FMA stage highlights a potentially new and exciting field of research. A "Yes” to a modified forensic SID process would be of much interest to the community. On the other hand a "No” would then result in the recommended path of adding the voice data to a database of modified voice corpus for further scientific or technical analysis.
Answering "Yes” for the "reversibility” question also leads to a potentially exciting path. What are those techniques? What methods are the most promising?
Figure 31 ElVo-FMA Stage 4 Flowchart “Yes” Branch Diagram
47


The application of a conventional forensic SID method to a “converted” voice data must satisfy scientific, technical, and legal standards. At the writing of this thesis, literature search suggests this area of endeavor is at its infancy.
48


CHAPTER 5
PATH FORWARD
The recommendations in this section can be organized as two groups; those consistent with the topic of research and deficiencies discovered during this endeavor.
5.1 Principal Recommendations
The ElVo-FMA framework should be disseminated throughout the academic and forensic communities for adoption. It provides a clear workflow that promotes a holistic, multidiscipline, collaboration. It empowers clear communication that will facilitate technological advancements as well as accelerate the compliance with the US FRE and Daubert.
Another recommendation is to develop a taxonomy of electronically frequency/pitch methods where hardware and software tools can be mapped, classified and disseminated. This would be most beneficial for law enforcement and investigative partners. Such a taxonomy can leveraged in such a way as to map out the method and, if possible, a counter-method for analysis and neutralizing voice disguise tools using electronic frequency/pitch modification.
A third recommendation is to conduct research on how to optimize feature extraction methods, such as MFCC and FPC, to frequency/pitch modified techniques. The results in this research concur with the literature research; that the reliability of FSID tools degrade rapidly in the presence of frequency modified speech and can be considered today as an effective anti-forensic method.
5.2 Deficiencies
During this research there were deficiencies identified in current practices that merit to be revisited. These include:
49


• Exploring expanding the voice channel to beyond the 4 kHz bandwidth paradigm. Today there are many options to transfer voice through channels with much wider bandwidth. Most, if not all, FASR systems can only process 4 kHz-wide speech signals. While this is understandable when dealing with data that has travelled through phone telecommunications infrastructure the landscape of audio communications is changing at an accelerated pace.
• Increase research and testing with female voices. The vast majority of published research focuses on male voices; a more balanced corpus and research focus is needed.
50


BIBLIOGRAPHY
[1] A. Drygajlo, “Chapter 2: Automatic Speaker Recognition for Forensic Case Assessment and Interpretation,” in Forensic Speaker Recognition: Law Enforcement and Counter-Terrorism, First., A. Neustein and H. A. Patil, Eds. New York, NY: Springer, 2012, pp. 21-39.
[2] H. Hollien, Forensic Voice Identification, First. London: ACADEMIC PRESS, 2002.
[3] D. L. Faigman, D. H. Kaye, M. J. Saks, and J. Sanders, “Modern Scientific Evidence -The Law and Science of Expert Testimony Volume 3,” in Modern Scientific Evidence - The Law and Science of Expert Testimony Volume 3, W. P. Co, Ed. St. Paul, 2002, p. 46.
[4] P. Rose, Forensic Speaker Identification, First. New York, NY: Taylor & Francis Inc., 2002.
[5] W. J. Hardcastle, J. Laver, and F. E. Gibbon, Eds., The Handbook of Phonetic Sciences, Second. Malden: Wiley-Blackwell, 2013.
[6] A. Alexander, “Forensic automatic speaker recognition using Bayesian interpretation and statistical compensation for mismatched conditions,” ECOLE POLYTECHNIQUE FEDERALE DE LAUSANNE, 2005.
[7] J. H. L. Hansen and T. Hasan, “Speaker Recognition by Machines and Humans: A tutorial review,” IEEE Signal Process. Mag., vol. 32, no. 6, pp. 74-99, Nov. 2015.
[8] C. (NCMF) Grigoras, “Forensic audio analysis,” 2013.
[9] G. Senthil Raja and S. Dandapat, “Speaker recognition under stressed condition,” Int.
51


J. Speech Technol., vol. 13, no. 3, pp. 141-161, 2010.
[10] “Genesis 27:22; And Jacob went near unto Isaac his father; and he felt him, and said, The voice is Jacob’s voice, but the hands are the hands of Esau.” [Online], Available: https://www.bible.eom/bible/l/GEN.27.22. [Accessed: 03-Apr-2017],
[11] E. Brixen, “Digitally Disguised Voices,” Audio Engineering Society 39th International Conference, 2010. [Online], Available: http://www.aes.org/e-lib/browse.cfm?elib=15485.
[12] E. Gold and V. Hughes, “Issues and opportunities: The application of the numerical likelihood ratio framework to forensic speaker comparison,” Sci. Justice, vol. 54, no.
4, pp. 292-299, 2014.
[13] K. Amino, T. Osanai, T. Kamada, and H. Makinae, “Historical and Procedural Overview of Forensic Speaker Recognition as a Science,” in Forensic Speaker Recognition, First., A. Neustein and H. A. Patil, Eds. Springer, 2012, pp. 3-20.
[14] F. Nolan and C. Grigoras, “A case for formant analysis in forensic speaker identification,”/^. J. Speech, Lang. Law, vol. 12, no. 2, pp. 143-173, 2005.
[15] C. G. G. Aitken, F. Taroni, and A. Biedermann, “Statistical Interpretation of Evidence: Bayesian Analysis,” in Encyclopedia of Forensic Sciences, 2nd ed., Elsevier Ltd.,
2013, pp. 292-297.
[16] G. S. Morrison, “Forensic voice comparison and the paradigm shift,” Sci. Justice, vol. 49, no. 4, pp. 298-308, 2009.
[17] G. S. Morrison, F. H. Sahito, G. Lie Jardine, D. Djokic, S. Clavet, S. Berghs, and C.
52


G. Domy, “INTERPOL survey of the use of speaker identification by law enforcement agencies,” Forensic Sci. Int., pp. 92-100, 2016.
[18] Federal Rules of Evidence. The Committee on the Judiciary, House of Representatives, 2010, pp. 1-41.
[19] M S. (Atlantic L. F. Kaufman, “THE STATUS OF DAUBERT IN STATE COURTS,” 2006.
[20] (Tribunal Supremo de Puerto Rico), Reglas de Evidencia de Puerto Rico. 2013, pp. 1-91.
[21] S. L. (United S. D. J. Townes, Gov request to introduce audio evidence approved. 2015, pp. 1324-1327.
[22] “Three Members of al Shabaab Plead Guilty to Conspiring to Provide Material Support to the Terrorist Organization — FBI.” [Online], Available: https://www.fbi.gov/contact-us/field-offices/newyork/news/press-releases/three-members-of-al-shabaab-plead-guilty-to-conspiring-to-provide-material-support-to-the-terrorist-organization. [Accessed: 05-Apr-2017],
[23] J. (Jane S. S. E. . Smith and D. (Esq. . Stern, “United States v. Ahmed,” 2015.
[24] I. McLoughlin, Applied Speech and Audio Processing With MA TLAB Examples, First. New York, NY: Cambridge University Press, 2009.
[25] C. Kirchhubel and D. M. Howard, “Detecting suspicious behaviour using speech: Acoustic correlates of deceptive speech - An exploratory investigation,” Appl. Ergon., vol. 44, no. 5, pp. 694-702, 2013.
53


[26] C. Kirchhiibel, “The acoustic and temporal characteristics of deceptive speech,” 2013.
[27] L. M. Solan and P. M. Tiersma, “Hearing voices: Speaker identification in court,” Hastings Law Jvol. 54, no. 2, p. 373-+, 2003.
[28] R. W. Rhodes, “Assessing the Strength of Evidence,” 2013.
[29] C. Zhang and T. Tan, “Voice disguise and automatic speaker recognition,” Forensic Sci. Int., vol. 175, no. 2-3, pp. 118-122, 2008.
[30] T. Tan, “The effect of voice disguise on automatic speaker recognition,” Proc. - 2010 3rd Int. Congr. Image Signal Process. CISP 2010, vol. 8, pp. 3538-3541, 2010.
[31] “The Tsinghua University (RUT) - d-Ear Technologies Joint Laboratory for Voiceprint Processing Established.” [Online], Available: http://www.d-ear.com/englishl/newsview.asp?id=251. [Accessed: 09-Apr-2017],
[32] S. S. Kajarekar, H. Bratt, E. Shriberg, and R. De Leon, “A study of intentional voice modifications for evading automatic speaker recognition,” in IEEE Odyssey 2006: Workshop on Speaker and Language Recognition, 2006, vol. 0, pp. 1-6.
[33] M. Colosimo, D. Ph, and M. Peterson, “State of the Art Biometrics Excellence Roadmap - Technology Assessment: Volume 3 (of 3),” vol. 3, no. JUNE 1987, pp. 1-87, 2009.
[34] J. Clark and P. Foulkes, “Identification of familiar voices in disguised speech,” vol.
14, no. 2001, p. 2006, 2006.
[35] P. Perrot and G. Chollet, Forensic Speaker Recognition. 2012.
[36] P. Perrot, G. Aversano, and G. Chollet, “Voice Disguise and Automatic Detection :
54


Review and Perspectives,” Lect. Notes Comput. Sci., pp. 101-117, 2007.
[37] E. Gold, P. French, and P. Harrison, “Examining long-term formant distributions as a discriminant in forensic speaker comparisons under a likelihood ratio framework,” vol. 19, no. 2012, pp. 060041-060041, 2013.
[38] P. T. Harrison, “Making Accurate Formant Measurements: An Empirical Investigation of the Influence of the Measurement Tool, Analysis Settings and Speaker on Formant Measurements,” no. April, 2013.
[39] J. H. Esling, “Phonetic Notation,” in The Handbook of Phonetic Sciences, Second., W. J. Hardcastle, J. Laver, andF. E. Gibbon, Eds. Chichester, England: Wiley-Blackwell, 2013, p. 870.
[40] W. Styler, “Using Praat for Linguistic Research,” pp. 1-70, 2013.
[41] “WaveSurfer User Manual.” [Online], Available:
https://www.speech.kth.se/wavesurfer/man.html. [Accessed: 04-Apr-2017].
[42] C. Grigoras, “Catalina Toolbox User’s Manual,” 2007.
[43] B. E. Koenig and D. S. Lacey, “Forensic Authentication of Digital Audio and Video Files,” in Handbook of Digital Forensics of Multimedia Data and Devices, First., A.
T. S. Ho and S. Li, Eds. Chichester, England: Wiley - IEEE Press, 2015, pp. 133-181.
55


APPENDIX
CASE REPORT EXEMPLAR
56


rational Center for Media Forensics
EXPERT OPINION
National Center for Media Forensics
Center of Excellence for Mj It media Forensics
Case name: Electronic Freq Change Exemplar
Case description: Impact of frequency modification on FZ voice sampfe
Suspected speaker: FZ
Description of the act
Evaluate the Impact of frequency/pfcch modflcatlon on automated forensic speaker matchng array sis. An unrrodfied sample together with modified samples In dscrete sent-tone steps are evaluated against a population database.
Questioned recordings
Name Type SNR Speech length Audio length
FZ GPS February 15th 8kttz.wav wav 35 00:02:50.139 00:03:33.689
FZ_GPS_February_15th_8kHz_Dnl_Semi.wav wav 74 00:02:36.569 00:03:33.689
FZJ3PSJebruaiy_15h_8kHz_Dn2_Semi.wav wav 72 00:02:35.879 00:03:33.689
FZ GPS February 15th 8kHz Dn3 Semi.wav wav 72 00:02:35.689 00:03:33.689
FZ_GPS_February_15th_8kHz_Dn4_Semi.wav wav 71 00:02:33.750 00:03:33.689
FZ GPS February 15th 8kHz Dn5 Semi.wav wav 72 00:02:25.379 00:03:33.689
FZ_GPS_February_15th_8kHz_Dn6_Semi.wav wav 73 00:02:14.509 00:03:33.689
FZ_GPS February 15th 8kHz Upl Semi.wav wav 31 00:02:48.969 00:03:33.689
FZ GPS February 15th 8kHz Up2 Semi.wav wav 34 00:02:48.65}) 00:03:33.689
FZ_GPS_February_15th_8kHz_Up3_Semi.wav wav 31 00:02:46.889 00:03:33.689
FZ GPS February 15th 8kHz Up4 Semi.wav wav 31 00:02:46.6B9 00:03:33.689
FZ_GPS_February_15th_8kHz_Up5_Semi.wav wav 31 00:02:44.810 00:03:33.689
FZ_GPS_February_15th_8kHz_Up6_Semi.wav wav 34 00:02:41.860 00:03:33.689
Suspected reference speaker's recordings
Name Type SNR Speech length Audio length
FZlb.wav wav 84 00:01:11.799 00:01:30.000
FZlcwav wav 100 00:01:04.659 00:01:25.000
FZla.wav wav 100 00:00:46.590 00:01:00.000
Population database
Name: EB English-Male
Desorption: Male Population Set created by EB
Recording count: 15
Total speech length: 00:40:54.500
Mil. speech length: 00:01:04.689
Max. speech length: 00:04:0Z77O
Average speech length: 00:02:43.629
Name Type SNR Speech length Audio length
Page 1/9
57


National Center for Media Forensics
GPS May 3rd Male3.waw wav 45 00:04:02.770 00:05:02.269
GPS May 3rd Malel.wav wav 27 00:03:25.909 00:04:51.870
GPSApril26thMale2.wav wav 36 00:03:40.340 00:04:47.649
GPS June 7th Male2.wav wav 28 00:03:29.349 00:04:18.509
GFS May 10th Male3.wav wav 31 00:03:17.199 00:03:51.189
GPS April 26di Male3.wav wav 40 00:02:59.419 00:03:51.120
GPS May 31st Malel.wav wav 27 00:02:55.710 00:03:44.689
GPS May 31st Maleiwav wav 36 00:02:39.449 00:03:42.539
GPS May 10th Malel.wav wav 30 00:02:36.189 00:03:25.000
GFS May 3rd Male2.wav wav 27 00:02:31.449 00:03:23.469
GPS June 7th Male3.wav wav 24 00:02:26.180 00:02:52.490
GFS May 24th Male2.wav wav 31 00:02:16.710 00:02:50.500
GPS May 24th Male3.wav wav 43 00:02:13.210 00:02:49.090
GPS June 7th Malel.wav wav 34 00:01:15.920 00:01:32.579
GPS May 24th Malel.wav wav 35 00:01:04.689 00:01:22.489
Results
Name LR LLR Verbal ratio
FZ GPS Februarv 15th 8kHz.wav 3.41e-r09 9.53 Strong evidence to support prosecution hypothesis
FZ GPS February 15th 8kHz Dnl Semi.wav 3.41e-H)9 9.53 Strong evidence to support prosecution hypothesis
FZ_GPS_February_15th_8kHz_Dn2_Semi.wav 5.25e-24 -23.3 Strong evidence against prosecution hypothesis
FZ GPS February 15th 8kHz Dn3 Semi.wav 5,08e-24 -23.3 Strong evidence against prosecution hypothesis
FZ GPS Februarv 15th 8kHz Dn4 Semi.wav 7.23e-114 -113 Strong evidence against prosecution hypothesis
FZ_GPS_February_15th_8kHz_Dn5_Semi.wav 2.96e-153 -153 Strong evidence against prosecution hypothesis
FZ GPS February 15th 8kHz _Dn6_Semi.wav 1.21e-104 -104 Strong evidence against prosecution hypothesis
FZ_GPS_February_15th_8kHz_Upl_Semi.wav 5.65e-33 -32.2 Strong evidence against prosecution hypothesis
FZ GPS February 15th 8kHz Up2 Semi.wav 1.61e-69 -68.8 Strong evidence against prosecution hypothesis
FZ GPS February 15th 8kHz Up3 Semi.wav 1.16e-124 -124 Strong evidence against prosecution hypothesis
FZ_GPS_February_15th_8kHz_Up4_Semi.wav 2.92e-124 -124 Strong evidence against prosecution hypothesis
FZ GPS February 15th 8kHz Up5 Semi.wav 6.51e-143 -142 Strong evidence against prosecution hypothesis
FZ_GPS_February_15th_8kHz_Up6_Semi.wav 2.48e-176 -176 Strong evidence against prosecution hypothesis
Probability density function graphs
Page 2/9
58


rational Center for Media Forensics
FZ_GPS_Febniary_15th_8kHz.wav
fc 16.96 IR: 3.4l4e+09 LLR: 9.533
Non-Target scores h tstogra ti Target scores histogram
------ HO
------ HI
E: 16.61 LR: 3.414e+09 LLR: 9.533
Non-Target scores histogram Target scores histogram
------ HO
------ HI
Page 3/9
59


fetlonal Center for Media Forensics
FZ_GPS_Febmary_15tti_8kHz_Dn2_SemLwav
& 3.029 LR: 5.24Se-24 LLR:-23.2S
Non-Target scores histogram Target scores histogram
------ HO
------ HI
F2_GPS_February_15th_8kHz_Dn3_ScmLwav
& 3.057 LR: 5 .084e-24 LLR: -23.29
Non-Target scores histogram Target scores histogram ------ HO
------ HI
Page 4/9
60


rational Center for Media Forensics
FZ_GPS_February_15tli_8kHz_Dn4_SemLwav
E:-12.23 LRr 7.234e-U4 lilt: -113.1
Non-Target scores histoyam Target scores histogram
----- HO
----- HI
FZ_GPS_Febmary_15Hi_8kHz_Dn5_SemLwav
i .... i .... i .... i .... i .... i
-30 -20 -10 0 10 20
Log-IkeIhood Score
E: -17.00 Ut: 2.955e-l53 LLR: -152.5
Non-Target scores histoyam Target soores histogram ------ HO
------ HI
Page S / 9
61


Mtlonal Center for HedLa Forensics
FZ_GPS_Febmary_15tti_8kHz_Dn6_SemLwav
E: -11.34 LR: 1.211e-l04 lift: -103.9
Non-Target scores histogram Target scores histogram
------ HO
------ HI
FZ_GPS_Fcbraary_15tti_8kHz_Upl_Semiwav
E: 1.066 lit: 5.653e-33 Lift: -32.25
[---1 Non-Target
scores histogram Target scores histogram
------ HO
------ HI
Page 6/9
62


National Center for Media Forensics
FZ_GPS_February_15tti_8kHz_Up2_SemLwav
E: -5.764 Ut: 1.607e-69 LLft: -66.79
Non-Target scores histogram T&rget scores histogram
------ HO
------HI
E:-13.62 Lfe I.158e-124 LLR: -123,9
Non-Target scores histogram Target scores histogram ------HO
------HI
Page 7/9
63


rational Center for Media Forensics
FZ_GPS_Februaiy_15tli_8kHz_Up4_SefnLwav
E:-13 .57 LR: 2.924e-l24 Lift: -123.5
Non-Target scores histogram Target scores histogram
------ HO
------ HI
E: -15.fi3 LR: 6.509e-143 lift: -142.2
Non-Target scores histogram Target scores histogram ------ HO
------ HI
Page 8/9
64


National Center for Media Forensics
FZ_GPS_Febmary_15th_8kHz_Up6_SemLwav
E:-19.55 Lft:2.4«e-176 UR: -175.6
Non-Target scores histogram Target scores histogram
------ HO
------ HI
Scale of the probabifity
LR Verbal ratio
from to
0 0.001 Strong evidence against prosecution hypothesis
0.001 0.01 Moderately strong evidence against prosecution hypothesis
0.01 ai Moderate evidence against prosecution hypothesis
0.1 1 Limited evidence against prosecution hypothesis
1 10 Limited evidence to support prosecution hypothesis
10 100 Moderate evidence to support prosecution hypothesis
100 1000 Moderately strong evidence to support prosecLtfon hypothesis
1000 + infinity Strong evidence to support prosecution hypothesis
Conclusion
Expert opinion elaborated by : Blud Bonilla
Page 9/9
65


THE INTERNATIONAL PHONETIC ALPHABET (revised to 2015)
CONSONANTS (PULMONIC) 0 2015 IP A
Bilabial Labiodental Dental Alveolar Poctolveolar Retroflex Palatal Velar Uvular Pharyngeal OLattal
Plosive P b t d t 4 c } k g q G ?
Nanai m n] n a J1 g N
Trill B r R
Tap or Flap V* c t
Fricative f V e a s z J 3 S A 9 i x y X B h 9 h li
Lateral fricative 11?
Approximans n i \ j U1
Lateral approximnnt i l L
Symbol: to she right in a cell ore voiced, to the left ore voiceless. Shaded area: denote artioularionc judged impossible.
CONSONANTS (NON-PULMONIC)
VOWELS
Clickn Voiced imploeivec Ejectivco
O Bilabial Q Bilabial t Examples:
| Dental tf DentaL'alveolar Bilabial
J (Poslialveolar J" Palatal l DentaL’alveolar
^ Paiatoa.lv eclor Cj Velar k Velar
|| Alveolar lateral Cj Uvular S Alveolar fricative
Alveolo-jialaial fricatives J Voiced alveolar lateral flap
OTHER SYMBOLS
AY Voiceless labul-velar fricalive W Voiced labial-velar approximanc
Voiced labial-palatal opprotimant fj Simultaneous J' and X H Voiceless epigloOal fricative
. Affricates and dmi-le articulations
V Voiced epi glottal fhcanve can be represented by two symbols
â– j joined by a tie bar if necessary.
-f Fpiglnctal plosive
Book.
u
Is kp
DIACRITICS Some diacritics may be placed abo’.e a symbol with a descender, e.g. I]
Where symbolc appear in palm, the one to the right reprecentc a rounded vowel.
^oond'tijan
Voiceless n d Breathy voiced b a Dental t d
„ Voiced § t Creaky vetoed b a Apical t d
h Aspuated th dh LinguoLibial t d laminal t d
More rounded SSI Labialized t" dw Nasalized e
Less rounded j Palatalized tJ dJ 11 Nasal release d"
Advanced V V Velanzed tv dv i Lateral release d1
Retracted c Pharyngeal lzed tT dv No audible release d"
Centralized e - Velanzed or pharyngeal ized 1
X \1id centralized e X Raised i j - voiced alveolar fricative)
Syllabic P T Lowered e voiced bilabial approximant)
Non-syllahic e Advanced Toogue Root 9
Rhotidty a-- a 1- Retracted Tongue Root 9
SUPRASEOMENTALS
^ Primary stress
1 Secondary stress
2 Long Cl
' Half-long C
w ./
Extn-cbort C | Minor l lbot) group || \tajor (intonation) group . Syllable break ji.Stkl v Linking (absence ot’ a break)
TONES AND WORD ACCENTS
c Elxcra high c or ft Rising
e 1 High e ^ Falling
e H Mid e A 1 rising
c -1 Low c j Low ^ rising
e J Extra low e ✓J Rising-i falling
j. Downstep s Global rise
t (Jpstep \ Global fall
66


A INTERPOL SURVEY OF THE USE OF SPEAKER IDENTIFICATION
BY LAW ENFORCEMENT AGENCIES
INTERPOL
Geoffrey Stewart Morrison, Farhan Hyder Sahito, Gadle Jardine, Djordjc Djofcic, Sophie Clavet, Sabine Berghs, Caroline Goemans Dorny Office of Legal Affairs, INTERPOL General Secretariat
METHODOLOGY: Questionnaire in English, French, Spanish, and Arabic circulated to law enforcement agencies in the 190 INTERPOL member countries
RESULTS:
»• Number of responses:
91 from 69 countries
44 had speaker identification capabilities
or via external laboratories
»• Numbers of respondents using named forensic voice comparison or automatic speaker recognition systems
num »pma
U BATVOK (AGNTO)
5 KAA LAB (Speech Techootafy Center]
3 LV6 (LoQuendo/Nuance)
2 Vocel&e ICMord W*e teeerch)
1 Phonena Speaker Identifnabor |Phon«u)
1 Ail21 (Ur Menitv of AM«non|
1 develop «l r> house
1 *y«em develops) «i house
•> Approaches to speaker identification:
y Frameworks for reporting conclusions:
P ijn Speaker IdenUlicaUon OlVl inlograted Project
67


Full Text

PAGE 1

PROPOSED ANALYTICAL FRAMEWORK FOR ELECTRONICALLY FREQUENCY MODIFIED VOICES by ELIUD BONILLA B.S., University of Puerto Rico Mayagüez, 1998 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Master of Science Recording Arts Program 2017

PAGE 2

ii © 2017 ELIUD BONILLA ALL RIGHTS RESERVED

PAGE 3

iii This thesis for the Master of Science degree by Eliud Bonilla has been approved for the Recording Arts Program by Catalin Grigoras, Chair Jeff M. Smith Jason R. Lewis Date: May 13 , 201 7

PAGE 4

iv Bonilla , Eliud (M.S., Recording Arts Program ) Proposed Analytical Framework for Electronically Frequency Modified Voices Thesis directed by Assistant Profes sor Catalin Grigoras ABSTRACT The concept of the human voice as a reliable biometric signal is rapidly being accepted and of the art call centers are increasingly incorporating automated speaker recognition (ASR) tech nologies in an effort to enhance customer service and minimize identity theft. In the forensics realm, ASR has been accepted in many European courts and may also be accepted in US federal courts in the not too distant future. However ASR is relatively frag ile due to its dependency on the frequency dimension of the voice while not incorporating higher layers of information such as prosody and accents. It can be degraded by a number of inter/intra speaker characteristics, in addition to multiple variables alo ng the signal chain. Purposely modifying the frequency/pitch of a voice by electronic means is an effective counter forensic measure. It is most often implemented to mask the identity of an individual. Its use can be considered legitimate when used to prot ect the identity of a witness in a television interview or as part of a law enforcement investigation. However it can also be used to protect the identity of individuals committing crimes ranging from classic scenarios such as phone calls for ransom reques ts to recording video/audio messages inciting violence. Electronic frequency/pitch modification impacts various common forensic analysis methods. Vowel spaces are distorted thus neutralizing their use by phoneticians. Moderate changes degrade likelihood ra tio (LR) scores in ASR systems while aggressive changes induce additional errors by distorting the format frequency relationships outside of normal expected ranges. The proposed Electronic Voice Frequency Modification Analysis (ElVo FMA) framework covers k ey concepts, questions and procedures to enable practical real world forensic speaker comparison and identification efforts.

PAGE 5

v The form and content of this abstract are approved. I recommend its publication. Approved: Catalin Grigoras

PAGE 6

vi DED ICATION I wish to dedicate this to my family who mean so much to me . To my parents Rafael and Lucy who instilled in me the values of honest hard work and the love of learning . To my little sister r. To Ana and Alex who m I wish all the happiness and love in the world . And last, but not least, to my wife Kathleen whose love, support and patience allowed me to complete this endeavor.

PAGE 7

vii ACKNOWLEDGEMENT S I want to thank my advisor Catalin Grigoras for al l of his support during the academic program and thesis process , passionate teaching, stimulating discussion s , professional respect and friendship . I also want to thank Jeff Smith for his enthusiastic sharing of knowledge and support. I want to thank Leah Haloin for helping keep me on schedule with all the logistics for class, registration, thesis , and graduation paperwork. I also want to thank my fellow cohort students who made the courses, labs, and the whole program so stimulating and enjoyable.

PAGE 8

viii TABLE OF CONTENTS CHAPTER I. INTRODUCTION ................................ ................................ ................................ ................................ .... 1 1.1 Voice as a Biometric Signal ................................ ................................ ................................ ....... 1 1.2 Forensic Automatic Speaker Recognition ................................ ................................ .................. 5 1.4 United States Law ................................ ................................ ................................ ...................... 8 1.5 Electronic Frequency/Pitch Changers ................................ ................................ ...................... 11 1.6 Scope ................................ ................................ ................................ ................................ ........ 12 1.7 Motivation and Goals ................................ ................................ ................................ ............... 13 II. LITERATURE REVIEW ................................ ................................ ................................ ....................... 15 2.1 Non Electronic Voice Disguise vs Manual SID ................................ ................................ ...... 15 2.2 Non Electronic Voice Disguising vs Automated SID ................................ .............................. 17 2.3 Electronic Voice Disguise vs Manual SID ................................ ................................ ............... 18 2.4 Electronic Voice Disguise vs Automated SID ................................ ................................ ......... 19 III. ELECT RONIC FREQUENCY/PITCH MODIFICATION OF THE VOICE ................................ ....... 20 3.1 Modification of Frequency Bandwidth Content ................................ ................................ ....... 20 3.2 Impact on Formant Base d Analysis ................................ ................................ ......................... 26 3.4 Impact on Automated Speaker Identification ................................ ................................ .......... 32 IV. ELVO FMA ANALYTICAL FRAMEWORK PROPOSAL ................................ ................................ 37 4.1 Obtaining Speech Evidence ................................ ................................ ................................ ..... 39 4.2 ElVo FMA Key Assessments Four Questions ................................ ................................ ...... 39 4.3 ElVo FMA Stage 1 Has Electronic Frequency Modification Occurred? .............................. 39 4.4 ElVo FMA Stage 2 Is the Method Known? ................................ ................................ .......... 42 4.5 ElVo F MA Stage 3 Are the Settings/Values of the Modification Parameters Known? ........ 44 4.6 ElVo FMA Stage 4 Is the Modification Reversible? ................................ ............................. 45 V. PATH FORWARD ................................ ................................ ................................ ................................ 49 5.1 Principal Recommendations ................................ ................................ ................................ .... 49

PAGE 9

ix 5.2 Deficiencies ................................ ................................ ................................ ............................. 49 BIBLIOGRAPHY ................................ ................................ ................................ ................................ ................ 51 APPENDIX ................................ ................................ ................................ ................................ .......................... 56

PAGE 10

x LIST OF TABLES TABLE Table 1 Log Likelihood Ratio scores for male voice samples ................................ ......................... 34

PAGE 11

xi LIST OF FIGURES Figure 1 Sources of variability in speech recordings ................................ ................................ ......... 3 F igure 2 General processing chain for automatic speaker recognition ................................ .......... 6 Figure 3 LR Estimation with probability density functions ................................ ............................. 7 Figure 4 Exemplar of a suggested LR Verbal Scoring Table ................................ .......................... 8 Figure 5 Examples of Voice Changer Options ................................ ................................ ................. 12 Figure 6 Illustrati on of PSOLA algorithm waveform modification ................................ ............. 13 Figure 7 Bandwidth of Male Voice; Original plus lowered by 1, 2 and 4 Semi Tones ................. 21 Figure 8 Bandwidth of Male Voice: Original plus lowered by 1, 2 and 3 Semi Tones ................. 22 Figure 9 Zoom of Bandwidth of Male Voice; Original plus lowered by 1, 2 and 3 Semi Tones . 22 Figure 10 Bandwidth of Male Voice; Original plus raised by 1, 2 and 3 Semi Tones .................. 23 Figure 11 Zoom of Bandwidth of Male Voice; Original plus raised by 1, 2 and 3 Semi Tones .. 24 .......... 25 Figure 13 ................................ ................................ ................................ ................................ ... 25 Figure 14 International Phonetic Association Chart for Vowels ................................ ................... 27 Figure 15 Formant Space Plot Unmodified ................................ ................................ ................... 27 Figure 16 Format Space Plot Reduced Centroid View ................................ ................................ 28 Figure 17 Formant Space Plot Unmodified Male Voice ................................ ............................... 29 Figure 18 Formant Space Plot Male voice lowered by three semi tones ................................ .... 29 Figure 19 Formant Frequency View: Unmodified male voice ................................ ........................ 30 Figure 20 Formant Frequency View: Modified male voice down three semi tones ..................... 31 Figure 21 Formant Frequency View: Modified male voice up three semi tones .......................... 31 Figure 22 Default Vowel Frequency Range in Catalina © ................................ ................................ 32 Figure 23 Unmodified Voice Speaker ID Assessment ................................ ................................ . 35

PAGE 12

xii Figure 24 Modified Voice 3 Semi Tones Down Speaker ID Assessment ................................ ... 35 Figure 25 Modified Voice 3 Semi Tones Up Speaker ID Assessment ................................ ........ 36 Figure 26 ElVo FMA Flowchart Diagram ................................ ................................ ....................... 38 Figur e 27 ElVo FMA Stage 1 Flowchart Diagram ................................ ................................ .......... 40 Figure 28 ElVo FMA Stage 2 Flowchart Diagram ................................ ................................ .......... 42 Figure 29 ElVo FMA Stage 3 Flowchart Diagram ................................ ................................ ......... 44 Figure 30 ElVo ................................ .................. 46 Figure 31 ElVo am ................................ ................ 47

PAGE 13

xiii LIST OF ABBREVIATIONS ABBREVIATION 1. AIR Average I dentific ation Rate 2. ALISP Automatic Language Independent Speech Processing 3. ASR Automated Speaker Recognition 4. CRR Correct Recognition Rate 5. DAW Digital Audio Workstat ion 6. ElVo FMA Electronic Voice Frequency Modification Analysis 7. F0 Fundamental Frequency 8. FASR Forensic Automatic Speaker Recognition 9. FRE F ederal Rules of Evidence 10. FSC Forensic Speaker Comparison 11. FSID Forensic Speaker I dentification 12. LLR Logarithmic Likelihood R atio 13. LPC Linear Predictive Coding 14. LR Likelihood R atio 15. MFCC Mel Frequency Cepstral C oefficients 16. PSOLA Pitch Synchronous O verlap and Add 17. RASTA RelAtive SpecTrAl 18. SID Speaker Identification 19. SME Subject Matter Expert 20. VIN Voice Inspector © by Phonexia

PAGE 14

1 CHAPTER 1 INTRODUCTION This chapter will cover some basic concepts necessary to have the proper context and background for understanding the impact of electronic frequency/pitch modification on Forensic Automatic Speaker Recognition and t he need for a coherent analytical framework. 1.1 Voice as a Biometric Signal Biometrics is the science of establishing the identity of individuals based on their biological and behavioral characteristics [1] . Identif ying a person based on their speech can be both extremely effortless as well as extremely difficult. For most people who have normal hearing capabil ities cognitive aural processes are extremely efficient at learning and remembering the sound characteristi cs of people they know [2] . This ability is recognized to the point of having earwitness 1 testimony allowed in military, federal and some state courts [3] . But to do so with scientific rigor, with mathematical analysis in formal settings such as in the field of forensics, identifying individuals by their speech is a very fragile process. Human speech is an activity/performance based signal in contrast to other biological based biometric s such as DNA, fingerprin ts and facial features [4] [5] . It is not known whether or not every person in the wor ld produces utterances which are unique to them and different from those of all other. We do not know if intraspeaker variability is always less than interspeaker variability and if this relationship is true for all situations and under all condition s [2] . The human voice is very flexible and a person is susceptible to many influences that, in turn, will affect the variability of his/her speech recording [6] [7] [8] [9] . (see Figure 1) Some factors include: 1 T estimony based on recall of auditory events, especially spoken messages uttered at the scene of a crime .

PAGE 15

2 Situational task stress : Speaking while doing a task that may require physical or cognitive effort such as lifting, wal king, operating a vehicle. Vocal effort /style : Altering normal speech to account for social or environment such as whispering, shouting, speaking to an audience, singing, or compensating for speaking in the presence of noise (Lombard effect). Emotional sta te : Speaking while experiencing deep emotional states such as sadness, happiness, fear, or anger. Physiological factors : Effects include illness, aging, exhaustion, age, under the influence of medication/alcohol. Interaction based scenarios: Human to Human (one on one or in a group setting), read/scripted vs. spontaneous speech, and Human to Machine scenarios. Circadian rhythm Technological influences including : o Electromechanical : Transmission cha nnels, devices (cellphone, landline, computer), microphones . o Environmental : Background noise, room acoustics, microphone distance/placement . o Data Quality: Sampling rate, audio codec and compression .

PAGE 16

3 Figure 1 Sources of variabili ty in speech recordings In the same manner that we use voice as a biometric identifier in our daily lives it should not come as a surprise that disguising our voices is a natural human counter identification tactic. Disguising a voic e is not a modern phen omenon ; it could be reasonably argued that it probably began in pre historic times as a natural result of the human social evolution. From the earliest writings we find stories with examples of the use of voice disguises. For example in the biblical book o f Genesis, Chapter 27, we read a story about inheritance and deceit. Jacob, aided by his mother Rebekah, tricks his elderly and blind father Isaac into pronouncing the socially significant patriarchal blessing upon him instead of the legitimate heir, his b rother Esau. In the critical moment when Isaac is identifying his son we read:

PAGE 17

4 And Jacob went near unto Isaac his father; and he felt him, and [10] If we analyzed this story as a forensic speaker identification (F SID ) case we could state that Jacob Under normal circumstances Isaac could have been considered a subject matter expe rt ( SME ) in F We can proceed to classify the intent for a person to disguise his/her voice into two general categories: protecting an identity for illegal acts and protecting an identity for legal acts. wide range of activities including phone calls, recordings, or masked individuals: demanding money in a robbery, ex tortion, threats of violence, stating demands in hostage situations, or planning among collaborators. Usually this is perceived as a countermeasure against law enforcement or intelligence ys trying to neutralize the good is a common scenario when a person is an eye witness to a crime but needs to protect its identity from retaliation, like in the case of whistleblowers in workplace or corruption cases; witnesses to a crime while testifying to authorities, or providing an interview to the media. This is usually seen as a countermeasure against criminal or corrupt organizations and a pro soci ety and law enforcement effort.

PAGE 18

5 Electronic frequency voice disguise is rapidly becoming more feasible with technology. Non electronic voice disguise requires ability, focus and stamina from the speaker that is not required when using electronic means. The approach towards applying digital electronic voice frequency/pitch disguise can be divided into two general categories: voices that are clearly identified as disguised by naïve 2 listeners and voices that are disguised in subtle ways in an attempt to n ot b e discovered as manipulated [11] . Voices that are clearly identified as digitally manipulated usually are used in scenarios where it is of the utmost importance to not give up the identity of the speaker. The resulting aural characteristics of the process are usually severe ; pushing the frequency content higher or lower to an unnatural state where it is relatively easy for the vast majority of naïve listeners to kno w the voice is manipulated. On the other hand the manipulations cannot be so severe as to affect the intelligibility of the speech since the message content must be understood by the listener. Voices that are manipulated in a subtle manner have the goal to be imperceptible to aural analysis, maintain enough natural sounding characteristics while modifying the features used in SID methods. This is usually harder to attain due to the need of more sophisticated algorithms and the limitations of audio bandwidth in scenarios such as telephones circuits versus fuller spectrum audio channels. 1.2 Forensic Automatic Speaker Recognition Speaker recognition is the general term used to describe the task of discriminating one person from another based on the sound of th eir voices. But the term itself encompasses two subtasks: speaker identification and speaker validation. For speaker identification the goal is to identify an unknown speaker from a set of known speakers. For speaker verification an unknown speaker is 2 A listener who lack s specialized phonetic or linguistic training .

PAGE 19

6 clai ming an identity (such as in the case of security systems) and the goal is to verify if this claim is true. Unfortunately on the forensic community there is not a general agreement on the use of all these definitions and acronyms and it will be common to h ear/read other terms with similar meaning . As a result it will be common to find similar terminology in the literature such as Automated Speaker Recognition (ASR) although Automated Speech Recognition used the same acronym, Forensic Automati c Speaker Recog nition (FASR), Speaker Identification (SID) a s well as Forensic Speaker Identification (FSID) , and Forensic Speaker Comparison (FSC) [12] . The general process for FASR (see Figure 2) goes from obtaining the speech signal s , extracting features, creating models and then implementing a comparison to the extracted features of a questioned rec interpreted. Feature extraction converts raw spe ech data into feature vectors. The m ost popular methods include Mel frequency cepstral coefficients ( MFCC ) , Relative Spectr a l ( RASTA) [13] , and formant (F 0 F3) extraction [14] . Figure 2 General p rocessing chain for automatic speaker recognition

PAGE 20

7 The comparative analysis and score interpretation is usually based on a Bayesian inference and the LR [15] . LR is defined by a ratio of two conditional probabilities; the probability (p) of the evidence (E) given (|) the prosecution hypothesis ( H p ) to the probability of evidence given the defense hypothesis ( H d ). The odds form o f the LR is defined in the equation below: LR p (E|H p ) p (E| H d ) The probabilistic nature of the LR allows for visualization , thus providing insight in to the probability density functions of intra and inter variability. Figure 3 LR Estimation with probability density functions At the conceptual level LR is an assessment of the similarity between the suspect and offender samples regarding a given parameter and the typicality of those values within a wider, relevant pop ulation [ 16] . The outcome is a value that pivots on one, such that LRs > 1 offer sup port for H p , and LRs <1 offer support for H d . If LR=1 then there is the same probability between H p and

PAGE 21

8 H d . The magnitude of the LR determines how much more likely the evidence woul d be und er the competing hypotheses . The range can be significant, sometimes requiring the use of logarithmic scale to produce a logarithmic likelihood ratio (LLR). It may be necessary for a SME to translate a LR/LLR score to a verbal scale when trying to communicate with triers of fact or the general public. This should be done carefully since numerical results vary according to the quality of the suspect, questioned, and population recordings. Figure 4 Exemplar of a suggested L R Verbal Scoring Table 3 A recent poll conducted by Interpol [17] revealed that out of 96 responding countries forty four has speaker identification capabilities and t wenty six forensic had either voice comparison systems or automatic speaker recognition systems . 1.4 United States Law The acceptance of normal, non modifi ed speech as part of a F SID expert testimony into a court case is not a trivial matter. If we add the burden that the speech evidence has been 3 Phonexia Voice Inspector (VIN) training class notes

PAGE 22

9 modified/tampered, and yet we intend to still present expert testimony tying it to the identity of a person, we must admit that it would be a very difficult legal hurdle to overcome. In the U nited S tates federal co urts the admissibility of expert testimony is subject to the F ederal Rules of Evidence (FRE), with special emphasis on Articles IV, VII, and IX [18] . In 1993 t he Daubert 4 standard superseded the previous Frye 5 standard for the interpretation of the FRE . Under Frye the criteria for accepting expert witness was in the particular field in which it belongs. while Daubert assigns the trial judge the role that: The testimony is based u pon sufficient facts or data It is t he product of reliable principles a nd methods T he witness has applied the principles and methods re liably to the facts of the case In pretrial hearings to assess whether an expert witness may be allowed to testify, commonly referred to as a Daubert hearing, the judge has a non exclusive che cklist to evaluate the witness and the proposed evidence: Whether the theory or technique has been tested Whether it has been subject to peer review and publication Whether technique or theory produces results with a known pote ntial error rate Whether it h as attracted a widespread acceptance within a relevant scientific community Whether it has the existence and maintenance of standards controlling its operation Other factors to include: 4 Daubert v. Merrell Dow Pharmaceuticals, Inc ., 509 U.S. 579, 125 L. Ed. 2d 469, 113 S. Ct. 2786 (1993) 5 Frye v. United States , 54 App. D.C. 46, 293F.1013, DC Ct App (1923)

PAGE 23

10 o Whether experts are proposing to testify about matters growing natural ly and directly out of research they have conducted independent of the litigation, or whether they have developed their opinions expressly for purposes of testifying. o Whether the expert has unjustifiably extrapolated from an accepted premise to an unfounde d conclusion. o Whether the expert has adequately accounted for obvious alternative explanations. o Whether the expert is being as careful as he would be in his regular professional work outside his paid litigation consulting. o Whether the field of expertise cl aimed by the expert is known to reach reliable results for the type of opinion the expert would give. State courts do not have to accept FRE with the Daubert standard but thirty ha ve accepted it or have laws consistent with it, fourteen reject it 6 , and sev en neither accept nor reject Daubert [19] . Puerto Rico, which is a territory of the United States has also adopted FRE and Daubert [20] . When it comes to speech evidence for F SID the US courts have been very conservative in their interpret ation of Daubert . As of this date F SID testimony has not been accepted as evidence in an appellate federal court . The closest that speaker identification evidence has come to being accepted, and establishing a precedent, was in United States v. Ahmed , et a l. 12 CR 661 (S1) (SLT) [21] . Mr. Mohammed Yusuf, together with Madhi Hashi and Ali Yasin Ahmed were acc used of providing support to al Shabaab, a terrorist organization based in Somalia [22] . The case had speech evidence that had been analyzed by acoustic, aural phonetic, and automated speaker recognition 6 Including Washington DC

PAGE 24

11 methods [23] . The Daubert hearing lasted three days but since the accused pleaded guilty before the beginning of the trial , were not made p ublic a nd the opportunity to establish precedence was gone. The proposed ElVo FMA framework could be very helpful in allowing electronically modified speech to be accepted in a US federal court in the future it. 1.5 Electronic Frequency/Pitch Changers To state t hat there is a plethora of tools to electronically alter the frequency/pitch of speech recording s/transmissions is an understatement. A n informal online market surve y quickly provides a wide variety of products that can be grouped into different categories including: Professional audio production systems: o Broadcast: Adjust voice quality for commercials and protect interviewee identity o Digital Audio Workstation (DAW): for music and audio productions in studio settings Phone related o Android and iOS Apps : Mul tiple voice effects o External hardware : Microphone and settings Entertainment o Hobbyist kits o First person gaming applications Electronic voice changers can also be classified by the speed of their algorithms which can result in real time, near real time, an d post processing scenarios.

PAGE 25

12 Figure 5 Examples of Voice Changer Options 1. 6 Scope The ElVo FMA framework is described at a technology agnostic level. It provides criterion and a workflow based on a higher level than of specific signal processing algorithms. It is presented i n a holistic manner to reach technical, research, and investigative readers. The electro nic frequency/pitch changes conducted in this research were limited to Pitch Synchronous Overlap and Add ( PSOLA ) based al gorithms within Adobe Audition 3.0. The algorithm determines a fundamental pitch period, segments audio into frames, and then reassembles those frames with an overlap add method at a different rate [24] .

PAGE 26

13 Figure 6 Illustration of PSOLA algorithm waveform modification The software tools used to evaluate formant and LLR scores of modified speech samples are tr the s to algorithms that are protected by intellectual property concerns of the tool developers. 1. 7 Motivation and Goals As I conducted my research and literature search it became more and more clear that there is a gap in the academic, technical, and forens ic communities regarding on how to deal with electronically disguised voices.

PAGE 27

14 The main goal of this research is to develop ElVo FMA as an analytical framework for working with electronically modified frequency/pitch speech that serves academic, technical, and forensic professionals. It can provide a common workflow that: Facilitates communication between academic researchers to identify areas that could benefit from collaboration. Provides a clear map for forensic professionals, investigators, attorneys a nd triers of fact to evaluate speech evidence that may be presented. Assist all participants in admissibility hearings by providing a common framework to discuss and argue, in favor or against, on the admission of expert testimony regarding electronically frequency/pitch modified speech.

PAGE 28

15 CHAPTER 2 LITERATURE REVIEW In this chapter an overview is presented of the most pertinent literature surrounding the issue of voice disguise and its impact on forensic SID. The discussion will be organized into four major groups: Non Electronic Voice Disguise vs Manual SID Non Electronic Voice Disguising vs Automated SID Electronic Voice Disguising vs Manual SID Electronic Voice Disguising vs Automatic SID 2. 1 Non Electronic Voice Disguise vs Manual SID The majority o f current literature found is focused on non electronic disguises and its impact on manual auditory and acoustic phonetic analyses. The authors have a wide range of backgrounds and address different perspectives including academic, law enforcement, and pra ctitioners. As expected , the results and recommendation for further studies covers a wide range of methodologies, results and recommendations for future work. Kirchhübel and Howard [25] conducted a study on d etecting suspicious behavior using speech and searching for correlation in acoustic cues . The work intended to explore chan ges or lack of changes in the speech signal when people were being deceptive. A group of ten speakers provided elicited truthful and deceptive speech in an interview setting. A cousti c analysis were done on various speech parameters in cluding fundamental f requency (F 0), overall amplitude and mean vowel formants F1, F2 and F3. No significant correlation between decep tive and truthful speech and any of the acoustic features examined could be established. The researchers recommend follow on studies with a larg er test subject group and refinements to the test procedures. Kirchhübel would later have

PAGE 29

16 the opportunity to do further testing and publish the results in her doctoral dissertation [26] where she concludes: rity of acoustic parameters no significant differences or trends can be discerned that would allow for a reliable differentiation between dece ptive and non deceptive Solan and Tiersma [27] studied the field of earwitness in a forensic setting. One of their findings was that the basic cognitive identification performance of humans, that they can identify familiar voices with high accuracy but are poor performers in matching unfa miliar voices, get amplified with voice disguises. In one study the results were of 79% accuracy in identifying familiar disguised voices (whispering, mimicking accents, changing raising/lowering pitch, imitation of other speakers) while no better than 20. 7% success rate for unfamiliar disguised voices. They express concern on the vulnerability of both expert and naïve listeners in the courts throughout the complete investigative and judicial process. Amno, Osanai, and Kamada conducted an overview of forens ic speaker recognition for the from a historical and procedural perspective [13] . Their research led them to conclude that: based methods compared to computer based automatic methods. Therefore, aural identification is a suitable method, or at least can be a good aid for other methods, when voice dis They point out that the most common voice disguises included changes in phonation such as Rhodes, through a literature survey in his doctoral dissertatio n [28] , finds that the reported use of non electronic disguises in cases vary from less than 5% in the UK to up to 25% of cases in Germany when the speaker is expecting his/her voice to be recorded. Human based forensic SID can

PAGE 30

17 also be severely impacted by other factors that can have a secondary disguise eff ect such as extreme distress , into xication, drug addiction , multilingualism/L2 speakers or comparison between languages . 2.2 Non Electronic Voice Disguising vs Automated SID There have been relatively few published papers on the impact of non electronic vo ice disguise on automated or forensic SID systems. Researchers at the China Criminal Police University [29] [30] hav e studied the effects of non electronic voice disguise efforts against a forensic ASR system from d EAR Technologies [31] which was developed at the Department of Computer Science & Technology in Tsinghua University 7 . The participants were asked to try different dis guising techniques including: Raising and lowering pitch Faster and slower speech Whispering Pinched nostril Masking of mouth with hand Bite blocking with a pencil Chewing gum Mimicking foreign accents Their results showed significant impact from some of t he techniques. Consciously raising and lowering the pitch of speech affected the ASR tool, showing a significant degradation of the . Raising the voice pitch resulted in a CCR score of only 10% while the score f or lower e d pitch was 55%. 7 The tool was described as capable of speaker id entification and verification. No further technical details on its classifiers or internal algorithms were provided.

PAGE 31

18 Researchers at the Speech Technology and Research Laboratory, SRI International, [32] investigated the effects of non elect ronic voice disguise efforts against an ASR system based on GMM and cepstral features. Participants were encouraged to use disguise techniques of their choosing including changing pitch or speaking rate, or mimicking an accent. The results confirmed signif icant degradation when evaluating pitch modified voice samples through a data set that had been trained on unmodified voices. It was also observed combining human evaluation together with ASR showed promising improvement. 2.3 Electronic Voice Disguise vs Manual SID Brixen [33] explored the effects of commercially available DSP based voice processing algorithms on samples of a male speaker with the purpose of providing disgui se. He used VoicePro, a commercial stand alone voice processor by TC Helicon, as well as Audition 3.0 from Adobe for the tests. He selected a variety of algorithms and presets ch osen to alter the voice but to n ot be considered as too unnatural sounding. Hi s analysis showed that in most cases the formants had shifted in their frequency space. In some cases there were processing artifacts that caused linear predictive coding ( LPC ) prediction of forma n ts to produce erroneous results. He summarized that this wo uld make correct SID difficult. Jessica Clark and Paul Foulkes [34] conducted a test where voices had their F 0 modified in a range of ± 8 semitones. This was done with the commercial DAW software Sony Sound Forge (version 8.0). The disguised voices were evaluated by a total of 36 listeners, ten males and twenty six females. Their results showed that the listeners performed best with unmodified voices , with an average identification rate (AIR) of 59.7%. Identification rates fell in all disguised conditions, with

PAGE 32

19 the ±8 semitone conditions yielding the lowest rate s. Rates were worse for the lowered F 0 values (AIR=31%) than for the corresponding raised F 0 voices (AIR=41%). 2.4 Electronic Voice Disguise vs Automated SID Perrot , Aversano and Chollet [35] [36] have led tests in support of the Forensic Research Institute of the French Gendarmerie . Their work included a survey of non electronic and electronic methods to disguise voices and the impact on manual and ASR techniques. For electronic frequency modification they are exploring their own technique called Automatic Language Independent Speech Processing (ALISP) which shares some commonality with PSOLA techniques. F or detection they focused on using MFCC as well as GMM and SVM algorithms. They also discuss the topic of reversing electronically disguised voices. The published results were very limited and most in a preliminary status.

PAGE 33

20 CHAPTER 3 ELECTRONIC FREQUENCY /PITCH MODIFICATION OF THE VOICE This chapter contains an overview of the impact that modifying the frequency/pitch of voice samples produce on the basic signal and some of the basic processing steps that are common in FSID. 3. 1 Modification of Frequency Bandwidth Content Automated forensic SID is based on an alyzing the lower attributes of human speech, mostly based on frequency characteristics. It is therefore reasonable to expect that modifications of the frequency spectral content will have a negative impact on the fundamental analyse s and identification pr oces ses . For example in Figure 7 the frequency bandwidth view of four versions of the same speech recording that have been lowered by a PSOLA algorithm have been superimposed ( Male voice, audio bandwidth of 22,050 Hz, 16 bits ) . 8 : The green plot shows the s pectral content of the unmodified sample . The yellow plot shows the spectral content after being lowered one Semi Tone. The orange plot shows the spectral content after being lowered two Semi Tones. The purple plot shows the spectral content after being lo wered four Semi Tones. 8 Processed with Adobe Audition 3.0

PAGE 34

21 Figure 7 Bandwidt h of Male Voice; Original plus l owered by 1, 2 and 4 Semi Tones The pattern shifts but indeed a lowering of the energy related directly with the frequency by a discreet amount. 9 We can also observe in the figure that there seems to be a change in the spectral noise floor and in the appearance of certain artifacts seen in the frequency dimension as nulls, peaks and v alleys. These artifacts are common byproducts of the different signal processing algorithms and may hold the key to detection and identification of frequency modified voice samples. The next two figures show a male voice re cording, the first one (Figure 8 ) at the full or iginal recording bandwidth of 24 kHz (sample rate of 48 kHz) . The second one (Figure 9 ) is zoom ed into the first 4 000 Hz of the voice baseband. On both figures we have superimposed the frequency bandwidth view of four versions of the same s peec h recording speech recording that have been modified by a PSOLA algorithm ( Male voice, audio bandwidth of 24 kHz, 16 bits ) 10 : 9 Depending on the specific western musical scale, a semi tone step can be between 5 6% of increment or reduction from the fundamental frequency. 10 Processed with

PAGE 35

22 The green plot shows the spectral content of the unmodified sample. The orange plot shows the spectral content after being lowe red one Semi Tone. The purple plot shows the spectral content after being lowered two Semi Tones. The yellow plot shows the spectral content after being lowered three Semi Tones. Figure 8 Bandwidth of Male Voice : Original plus lo wered by 1, 2 and 3 Semi Tones Figure 9 Zoom of Bandwidth of Male Voice; Original plus lowered by 1, 2 and 3 Semi Tones

PAGE 36

23 The next two figures show the same male voice recording but in this case the frequency has been ra ised instea d of lowered. Figure 10 is the full or iginal recording bandwidth of 24 kHz while Figure 11 is zoomed into the first 4000 Hz voice baseband. On both figures we have superimposed the frequency bandwidth view of four versions of the same speech recording that have been modified by a PSOLA algorithm (Male voice, audio bandwidth of 24 kHz, 16 bits ) 11 : The green plot shows the spectral content of the unmodified sample. The orange plot shows the spectral content after being raised one Semi Tone. The purple plot sho ws the spectral content after being raised two Semi Tones. The yellow plot shows the spectral content after being raised three Semi Tones. Figure 10 Bandwidth of Male Voice; Original plus raised by 1, 2 and 3 Semi Tones 11 Processed with

PAGE 37

24 Figure 11 Zoom of Bandwidth of Male Voice; Original plus raised by 1, 2 and 3 Semi Tones The next figure (Fig ure 12) is of a male voice that has been recorded and modified on an 12 while Figure 13 i s zoomed into the first 72 00 Hz voice baseband. These types of apps usually apply more aggressive frequency changes in an effort to make the voices clearly obvious to any naïve listener that they have been modified. The modified voice samples were extracte d as attachments to emails since this application does not change voice The frequency bandwidth view has been superimposed of three versions of the same speech recording ( audio bandwidth of 12 kHz, 16 bits , file format: MP3 ) : The green plot shows the spectral content of the unmodified sample. The red plot shows the spectral content after the voice being converted with the . 12 Voice Changer (1.0.60), Author: Android Rocker, email : androidrocker@163.com

PAGE 38

25 The purple plot shows the spectral content after the voice being convert ed with the Figure 12 Male Voice, Voice Changer App Figure 13 Zoom

PAGE 39

26 We can clearly observe that there are changes in the frequency bandwidth content of the processed recordings that range from subtle to harsh and dramatic. 3.2 Impact on Formant Based Analysis Formant based analysis is a foundational method of fore nsic SID. They are traditionally implemented in an acoustic phonetic methodology [14] [37] [38] [39] and usually require highly skilled SMEs. Some common tools used by the linguistic practitioner communities for formant analysis include PRAAT [40] , Wave Surfer [41] and C atalina © [42] . In general, a three step process is followed to characterize a speaker and conduct a comparison assessment. The f irst step is to transcribe recorded speech using a phonetic alphabet 13 with most attention focused on the vowels (see Figure 14 ). The second step is to map the vowel sounds onto a graph , where the X and Y axis are the formant pair frequencies. Since human vowel utterances are rich in harmonics, each vowel can be represented by having energy content at F0 through F3 formants. The most common formant pairings for graphical representation are F1 vs. F2 and F2 vs. F3. The resulting plots provide a pattern that can be reduced to a centroid diagram to improve clarity (see Figure 15 and 16 ) [42] . These reduc ed plots are then used for visual comparison and SID determinations. 13 The most common phonetic alphabet used for the English language is defined by the International Phonetics Association, https://www.internationalphoneticassociation.org/

PAGE 40

27 Figure 14 International Phonetic Association Chart for Vowels Figure 15 Formant Space Plot Unmodified

PAGE 41

28 Figure 16 Format Space Plot Reduced Centroid View The F1 vs. F2 formant space plot of an unmodified recording of a male is shown in Figure 17 while the plot of the same recording modified lower by three semi tones is shown in Figure 18 . The analysis has b een done manually and five main vowel sounds have been mapped. Visual inspection clearly show s that the F1/ F2 frequency values for each vowel have change d . It is also evident that the rate of change is different for each vowel and as a result the shape of the pattern or has also changed . Should an investigator have to make an assessment on these two plots it would determination . This shows the clear anti forensic effect of voice frequency/pitch manipulatio n against this form of analysis.

PAGE 42

29 Figure 17 Formant Space Plot Unmodified Male Voice Figure 18 Formant Space Plot Male voice lowered by three s emi tones

PAGE 43

30 Formant based SID can also be done automatical ly with the software extracting, evaluating and matching formant patterns. But these types of tools are also susceptible to manipulation of voice frequency resulting in erroneous results. The automatic formant analysis feature of Catalina © was used on thre e versions of a male voice recording. The first recordi ng was unmodified (see Figure 19 ), the second recording was lowered by three semi tones from the o riginal (see Figure 20 ) and the third recording had the frequency raised by three semi tones from the o riginal (see Figure 21 ). Notice how the mean values of the formants F0 through F3 as well as the ir histogram distribution change. Figure 19 Formant Frequency View: Unmodified male voice

PAGE 44

31 Figure 20 Form ant Frequency View: Modified male voice down three semi tones Figure 21 Formant Frequency View: Modified male voice up three semi tones

PAGE 45

32 There is also a trend in the F0 values that seems counterintuitive. The calculated mean F0 va lues for both lowered (129.6 Hz) and raised (132.2 Hz) frequencies are below the unmodified F0 value of 139.4 Hz. Since we have seen in previous figures that the raw frequency bandwidth spectrum directly correlates with the lowered or increase in of semi t ones steps, we can presume that the electronic modifications have induced errors in the statistical computations . Based on conversations with the author of Catalina © 14 , we suspect that the primary source of error in the computations most likely come from th e user defined settings for the vowel frequency range (see Figure 22) . By default the tool detects the vowels [a], [e], [i], and [o] from the following settings: Figure 22 Default Vowel Frequency Range in Catalina © As voice recor dings frequency are modified electronically and pushed to ranges not expected in normal operations, it would be reasonab le to expect that any other automated analysis tool may suffer the same type of degradation and produce erroneous results as well . 3.4 Impact on Automated Speaker Identification The last step of our evaluation on the potential impact of electronic frequency modification was to test unmodified and modified voice recordings on an automated SID tool. Voice Inspector 14 Dr. Catalin Grigoras

PAGE 46

33 (VIN) by Phonexia was use d as the tool due to its many desired features. VIN is a commercial state of the art forensic SID software that uses various algorithms, including i Vector processing and Bayesian likelihood ratio calculations to evaluate voice evidence for law enforcement and intelligence scenarios. All the recordings used for the population database, suspect reference, and questioned voice were sourced from the same podcasts producer 15 . This promoted minimum variability in the source/channel for the voice recordings. The general population database was composed of a total of fifteen male voice recordings. There were three suspect reference voice recordings and a total of thirteen voic e recordings to be evaluated. The questioned voice recordings and the suspect recordings were from the same individual. Of the questioned recordings, one was unmodified, six were frequency lowered sequentially at one semi tone intervals, and six were frequency raised sequentially at one semi tone interval. Each questioned voice rec ording was processed automatically for SID assessment. By using the same original recording for the unmodified and all the versions of the frequency modified recording, we can establish a correlation between the direction and amount of frequency modificati on and the d egradation in the LLR based score. The result s are shown in Table 2 . The unmodified voice recording had an LLR score of 9.53, which has an equivalent verbal s could be expressed as The only other recording that been lowered by one semi tone. All other recording comparisons produced a LLR score that was significant in supporting 15

PAGE 47

34 (see Table 1 ) A visual representation for a subset of comparisons; the unmodified voice (see Figure 23 ), modified voice down three semi tones (see Figure 24 ), and modified voice up three semi tones (see Fig ure 25 ) provide insight to the Bayesian framework being used for the assessment 16 . Table 1 Log Likelihood Ratio scores for male voice samples 16 The full forensic report generated by VIN, including LLR, Verbal Scores, and Bayesian graphics for all thirteen questioned recordings are included in the Appendix. Name LLR FZ_GPS_February_15th_8kHz.wav 9.53 FZ_GPS_February_15th_8kHz_Dn1_Semi.wav 9.53 FZ_GPS_February_15th_8kHz_Dn2_Semi.wav 23.3 FZ_GPS_February_15th_8kHz_Dn3_ Semi.wav 23.3 FZ_GPS_February_15th_8kHz_Dn4_Semi.wav 113 FZ_GPS_February_15th_8kHz_Dn5_Semi.wav 153 FZ_GPS_February_15th_8kHz_Dn6_Semi.wav 104 FZ_GPS_February_15th_8kHz_Up1_Semi.wav 32.2 FZ_GPS_February_15th_8kHz_Up2_Semi.wav 68.8 FZ_GPS_Februa ry_15th_8kHz_Up3_Semi.wav 124 FZ_GPS_February_15th_8kHz_Up4_Semi.wav 124 FZ_GPS_February_15th_8kHz_Up5_Semi.wav 142 FZ_GPS_February_15th_8kHz_Up6_Semi.wav 176

PAGE 48

35 Figure 23 Unmodified Voice Speaker ID Assessment Figur e 24 Modified Voice 3 Semi Tones Down Speaker ID Assessment

PAGE 49

36 Figure 25 Mo dified Voice 3 Semi Tones Up Speaker ID Assessment The LLR scores clearly show that with the exception of a one semi tone low ered recording, all the other electronically modified voice frequency recordings degrade d the effectiveness of VIN tool in a dramatic manner. Since other state of the art automa ted SID software also use i Vector a lgorithms and similar processes it is reasonable to expect that this may be a problem for this whole class of automated SID tools.

PAGE 50

37 CHAPTER 4 ELVO FMA ANALYTICAL FRAMEWORK PROPOSAL The proposed E lectronic Voice Frequency Modification Analysis (ElVo FM A ) framework provides scientific, te chnical , and legal professionals a clear analytical pathway for the assessment of the maturity level of tests, experiments, and methods for FSID . evidence to making a forensic determination. Key benefits of ElVo FMA include: A u nifying framework for r esearchers and scientists to place tests and experiments within the proper context of related, and possibly interdependent, efforts within their commu nities . A w orkflow that guides investigators and forensic professionals through the process of determining the utility of voice evidence in a specific case. Specifically it should help assess if the questioned voice recordings can be used directly for fore nsic SID and be presented to the courts, or only as an aid to help sort through other pertinent case evidence. A c lear progression path for assessing the maturity of technical methods as viable forensic methods . ElVo FMA is focused on answering four key qu answer to each one , guide the user towards the next logical step. current capabili ties and potential area of needed research.

PAGE 51

38 Figure 26 ElVo FMA Flowchart Diagram

PAGE 52

39 4.1 Obtaining Speech Evidence The ElVo FMA process begins when voice evidence (in the case of an investigation) or voice test files (in the case o f research) have been obtained. In the case of voice evidence it is imperative that this be done should be treated and handled the same way as if this were any other digital evidence and thus appropriate to apply comparable computer forensic s methods. Proper documentation is necessary in the use of write blocking technology , hash ca lculations and safe archival of original files should be observed. At this stage we assign the voice evidence to be in Stage 0 which means that no ElVo FMA analysis has yet begun. For this proposed framework the voice evidence can be co nsidered to be in its raw stage ready to begin analysis. 4.2 ElVo FMA Key Assessments Four Questions Once the evidence file/s have been obtained in a forensically sound manner i t is time to beg in the evaluation. ElVo FMA answer to four key questions: Has electronic frequency modification been applied to the voice evidence? Is the method of electronic voice modification known? Are the settings/values of the electronic modification method known? Is the electronic modificatio n reversible from a forensic SID perspective? 4.3 ElVo FMA Stage 1 H as Electronic Frequency Modification Occurred? D etermining if electronic modification has been done can be more complex than at first thought. A multimodal approach is often a more eff ective and robust way to proceed. In an

PAGE 53

40 investigation a determination can be reached by a combination of methods which include : investigative leads, acoustic methods, and technical analyses. Figure 27 ElVo FMA Stage 1 Flowchart D iagram Investigative leads and resources can be used to determine if electronic frequency modification has been used. These can include: Confession by the suspect or person of interest Corroboration or confession by an associate of the suspect or person of interest Written documentation stating the use of electronic frequency modification techniques Intercepted communications where frequency modificati on has been stated or suggested Seized per tinent hardware and/or software by law enforcement authorities A u ral methods are the most natural for human beings and common today for determining if there has been electronic frequency modification. These can include:

PAGE 54

41 Aural analysis by naïve listeners : It should be noted that most naïve listeners can only detect frequ ency modification when it is extreme and obvious, such as the case of Slight modifications may be extremely difficult to detect. Also highly susceptible to biases and cognitive limitations. Aural analysis by forensic expert s which may include a variety of approaches such as linguistic language acoustic techniques. Technical methods use a variety of signal processing, statistical, and forensic techniques. These may include: Forensic analysis of file metadata [43] : signature of software or methods known to modify the voice frequencies Signal Processing Methods : Vo ice analysis based on a variety of acoustical measurements Signal Processing Methods: Acoustic voice and environment analysis Statistical analyses including Machine Learning algorithms The first decision point is based on a Yes or No . I f the determination of frequency modification is then the voice evidence can proceed to be evaluated, analyzed and interpreted by a conventional forensic SID process. This should be implemented in a manner that has been accepted by the courts, satisfy ing Daubert , Frye , or other accepted legal standard s . to enter the second phase of the framework.

PAGE 55

42 4.4 ElVo FMA Stage 2 Is the Method Known? I nvestigative leads, a coustic methods, and technical analyses can also be used to assess if the modification method is , or can be , known. Figure 28 ElVo FMA Stage 2 Flowchart Diagram Investigative leads and resources can include: Confession by the su spect or person of interest Corroboration or confession by an associate of the suspect or person of interest Written documentation stating the use of electronic frequency modification techniques Intercepted communications where frequency modification has b een stated or suggested Seized pertinent hardware and/or software by law enforcement authorities

PAGE 56

43 Aural methods can be useful but may be limited to specialized forensic experts. The identification of a type of method for frequency modification will most pro bably be beyond the capability of most naïve listeners. The forensic expert should have training and proven proficiency to make basic assessments such as fundamental frequency modification . Technical methods for identifying modification is an area of much interest to researchers today and can include: Forensic analysis of file metadata: signature of software or methods known to modify the voice frequencies Signal Processing Methods: Voice analysis based on a variety of acoustical measurements Signal Process ing Methods: Acoustic voice and environment analysis Statistical analyses including Machine Learning algorithms The second decision point is also based on a binary decision of Yes or No . If the identification of the method of frequency modification is should not be presented to the courts as part of a conventional SID process. This does not mean however that the Modified Voi further technical analysis. While the use of such evidence for scientific research may vary from jurisdiction to jurisdiction, the establishment and care of such a corpus can be highly beneficial for the development and discove ry of future methods. can proceed to enter the third phase of the framework.

PAGE 57

44 4.5 ElVo FMA Stage 3 A re the Settings /Values of the Modification Parameters Known ? Investigative leads, acoustic methods, and technical analyses can be used to assess if the modification settings/values are , or can be, known. These can vary depending on the implementation, varying from many adjustable parameters for more sophisticated methods to simple presets on others. Figure 29 ElVo FMA Stage 3 Flowchart Diagram Investigative leads and resources can include: Confession by the suspect or person of interest Corroboration or confession by an associate of the suspect or person of interest Written documentation stating the settings used for electronic frequency modification Intercepted communications where the settings for frequency modification have been stated or suggested Seized pertinent hardware and/or soft ware by law enforcement authorities with the settings still preserved

PAGE 58

45 Aural methods can be useful but may be limited to specialized forensic experts. The identification of setting values will most probably be beyond the capability of most naïve and expert listeners. Technical methods for identifying modification method is an area of much interest to researchers today and can include: Forensic analysis of file metadata: signature of software or methods known to modify the voice frequencies Signal Processing Methods: Voice analysis based on a variety of acoustical measurements Signal Processing Methods: Acoustic voice and environment analysis Statistical analyses including Machine Learning algorithms The third decision point is based on a binary decision: Yes or No. If the identification of the courts as part of a conventional SID process. Just as in the previous step ( ElVo FMA Stage 2) i t would be very beneficial future scientific analysis. If the identification of the method settings/values voice evidence can proceed to enter the fourth phase of th e framework. 4.6 ElVo FMA Stage 4 Is the Modification Reversible? R eversing the effect of electronic freq uency modification could be considered as the ultimate forensic SID goal. proce ssed in a manner that allows it to then be input into a conventional forensic SID process, to be evaluated with a

PAGE 59

46 corpus of unmodified suspect and general population. This capability w ould counter the attempt by criminals to disguise their voices and elimi nate another technique to elude identification . At the writing of this thesis the literature search does not indicate that this is currently possible for most, if not all, frequency modification techniques. But if such methods were available it would be n ecessary to meet scientific, technical, as well as legal standards such as Daubert / Frye for it to be acceptable in a court of law. ElVo s of research for addressing the capabilities gap within the forensic SID community and potential ways to address them in the search for an acceptable forensic solution. Figure 30 ElVo FMA

PAGE 60

47 A If it is modified SID process be applied that would be scientifically and lega lly sound? Another paradigm could open for forensic SID. What if there is sufficient inter/intra variability definition in not only the evidence voices, but also in the corpus of suspect and general populations that have been also modified with the same pr ocess? Could it be feasible to apply a modified SID process, with the same current ly accepted T his ElVo FMA stage highlights a potentially new and exciting field of research. modified forensic SID process would be of much interest to the community. On the other hand a would then result in the recommended path of adding the voice data to a database of modified voice corpus for further scientific or technical analysis . What are those techniques? What methods are the most promising? Figure 31 ElVo

PAGE 61

48 The applica tion d satisfy scientific, technical, and legal standards. At the writing of this thesis, literature search suggests this area of endeavor is at i ts infancy.

PAGE 62

49 CHAPTER 5 PATH FORWARD T he recommendations in this section can be organized as two groups; those consistent with the topic of research and deficiencies discovered during this endeavor. 5.1 Principal Recommendations The ElVo FMA framework should be disseminated throughout the acad emic and forensic communities for adoption . I t provides a clear workflow that promotes a holistic, multidiscipline, collaboration . It empowers clear communication that will facilitate technological advancements as well as accelerate the compliance with the US FRE and Daubert . Another recommendation is to develop a taxonomy of electronically frequency/pitch methods where hardware and software tool s can be mapped, classified and disseminated . This would be most beneficial for law enforcement and investigative partners. Such a taxonomy can leveraged in such a way as to map out the method and, if possible, a counter method f or analysis and neutralizing voice disguise tools using electronic frequency/pitch modification. A third recommendation is to conduct resear ch on how to optimize feature extraction methods, such as MFCC and LPC, to frequency/pitch modified techniques. The results in this research concur with the literature research; th at the reliability of FSID tools degrade rapidly in the presence of frequenc y modified speech and can be considered today as an effective anti forensic method. 5.2 Deficiencies D uring this research there were deficiencies identified in current practices that merit to be revisit ed . These include:

PAGE 63

50 E xploring expanding the voice cha nnel to beyond the 4 kHz bandwidth paradigm . Today there are many options to transfer voice through channels with much wider bandwidth . Most, if not all, FASR systems can only process 4 kHz wide speech signals. While this is understandable when dealing wit h data that has travelled through phone telecommunications infrastructure the landscape of audio communications is changing at an accelerated pace. Increase research and testing with female voices. The vast majority of published research focuses on male vo ices; a more balanced corpus and research focus is needed.

PAGE 64

51 BIBLIOGRAPHY [1] Forensic Speaker Recognition: Law Enforcement and Counter Terrorism , First., A. Neustein and H. A. Patil, Eds. New York, NY: Springer, 2012, pp. 21 39. [2] H. Hollien, Forensic Voice Identification , First. London: ACADEMIC PRESS, 2002. [3] Modern Scientific Evidence The Law and Science of Expert Testimony Volume 3 , W. P. Co, Ed. St. Paul, 2002, p. 46. [4] P. Rose, Forensic Speaker Identification , First. New York, NY: Taylor & Francis Inc., 2002. [5] W. J. Hardcastle, J. Laver, and F. E. Gibbon, Eds., The Handbook of Phonetic Sciences , Second. Malden: Wiley Blackwell, 2013. [6] and statistical compensati POLYTECHNIQUE FÉDÉRALE DE LAUSANNE, 2005. [7] IEEE Signal Process. Mag. , vol. 32, no. 6, pp. 74 99, Nov. 2015. [8] C. (NCMF) [9] Int.

PAGE 65

52 J. Speech Technol. , vol. 13, no. 3, pp. 141 161, 2010. [10] he felt him, and said, https://www.bible.com/bible/1/GEN.27.22. [Accessed: 03 Apr 2017]. [11] Audio Engineering Society 39th In ternational Conference , 2010. [Online]. Available: http://www.aes.org/e lib/browse.cfm?elib=15485. [12] Sci. Just ice , vol. 54, no. 4, pp. 292 299, 2014. [13] Forensic Speaker Recognition , First., A. Neustein and H. A. Patil, Eds. Sprin ger, 2012, pp. 3 20. [14] Int. J. Speech, Lang. Law , vol. 12, no. 2, pp. 143 173, 2005. [15] ion of Evidence: Encyclopedia of Forensic Sciences , 2nd ed., Elsevier Ltd., 2013, pp. 292 297. [16] Sci. Justice , vol. 49, no. 4, pp. 298 308, 2009. [17] G. S. Morri son, F. H. Sahito, G. Lle Jardine, D. Djokic, S. Clavet, S. Berghs, and C.

PAGE 66

53 Forensic Sci. Int. , pp. 92 100, 2016. [18] Federal Rules of Evidence . The Committee on the Judiciary, House of Representatives, 2010, pp. 1 41. [19] [20] (Tribunal Supremo de Puerto Rico), Reglas de Evidencia de Puerto Rico . 2013, pp. 1 91. [21] S. L. (United S. D. J. Townes, Gov request to introduce audio evidence approved . 2015, pp. 1324 1327. [22] Support to the Terrorist Organization https://www.fbi.gov/conta ct us/field offices/newyork/news/press releases/three members of al shabaab plead guilty to conspiring to provide material support to the terrorist organization. [Accessed: 05 Apr 2017]. [23] [24] I. McLoughlin, Applied Speech and Audio Processing With MATLAB Examples , First. New York, NY: Cambridge University Press, 2009. [25] Acoustic correlates of d eceptive speech Appl. Ergon. , vol. 44, no. 5, pp. 694 702, 2013.

PAGE 67

54 [26] [27] Hastings Law J. , vol. 54, no. 2, p. 373 +, 2003. [28] [29] Forensic Sci. Int. , vol. 175, no. 2 3, pp. 118 122, 2008. [30] Proc. 2010 3rd Int. Congr. Image Signal Process. CISP 2010 , vol. 8, pp. 3538 3541, 2010. [31] d Ear Technologies Joint Laboratory for ear.com/english1/newsview.asp?id=251. [Accessed: 09 Apr 2017]. [32] modifications for evading au IEEE Odyssey 2006: Workshop on Speaker and Language Recognition , 2006, vol. 0, pp. 1 6. [33] Roadmap l. 3, no. JUNE 1987, pp. 1 87, 2009. [34] 14, no. 2001, p. 2006, 2006. [35] P. Perrot and G. Chollet, Forensic Speaker Recognition . 2012. [36] P. Perrot, G. Aversano, an

PAGE 68

55 Lect. Notes Comput. Sci. , pp. 101 117, 2007. [37] term formant distributions as a discriminant in forensic speaker compar 19, no. 2012, pp. 060041 060041, 2013. [38] of the Influence of the Measurement Tool, Analysis Settings and Speaker on Formant [39] The Handbook of Phonetic Sciences , Second., W. J. Hardcastle, J. Laver, and F. E. Gibbon, Eds. Chichester, England: Wiley Blackwell, 2013, p. 870. [40] 70, 2013. [41] https://www.speech.kth.se/wavesurfer/man.html. [Accessed: 04 Apr 2017]. [42] [43] Forensic Authentication of Digital Audio and Video Handbook of Digital Forensics of Multimedia Data and Devices , First., A. T. S. Ho and S. Li, Eds. Chichester, England: Wiley IEEE Press, 2015, pp. 133 181.

PAGE 69

56 APPENDIX CASE REPORT EXEMPLA R

PAGE 70

57

PAGE 71

58

PAGE 72

59

PAGE 73

60

PAGE 74

61

PAGE 75

62

PAGE 76

63

PAGE 77

64

PAGE 78

65

PAGE 79

66

PAGE 80

67