Citation
Understanding the relationship between real and artificial deception

Material Information

Title:
Understanding the relationship between real and artificial deception
Creator:
Yang, Yanjuan
Publication Date:
Language:
English
Physical Description:
xi, 103 leaves : ; 28 cm

Subjects

Subjects / Keywords:
Data mining ( lcsh )
Deception ( lcsh )
Automatic data collection systems ( lcsh )
Automatic data collection systems ( fast )
Data mining ( fast )
Deception ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 98-103).
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
by Yanjuan Yang.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
527639108 ( OCLC )
ocn527639108
Classification:
LD1193.E52 2009d Y36 ( lcc )

Full Text
UNDERSTANDING THE RELATIONSHIP BETWEEN REAL AND
ARTIFICIAL DECEPTION
by
Yanjuan Yang
A thesis submitted to the
University of Colorado Denver
in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Computer Science and Information Systems
2009


2009 by Yanjuan Yang
All rights reserved.


This thesis for the Doctor of Philosophy
degree by
Yanjuan Yang
has been approved
by
Michael Mannino
Tom Altman
Ronald Ramirez
Date


Yang, Yanjuan (Ph.D., Computer Science and Information Systems)
Understanding the Relationship between Real and Artificial Deception
Thesis directed by Associate Professor Michael Mannino
ABSTRACT
Deception has become an important area in data mining with many recent
studies on terrorism threats, intrusion detection, and fraud prevention. To develop a
data mining approach for a deception application, data collection costs can be
prohibitive because both the deceptive data and the truthful data without deception
are necessary to collect. In order to lower the cost of data collection, artificially
generated deception data can be used to train the data mining program, but the
impact of using artificially generated deception data hasnt been known. This
project aims to investigate the relationship between real and artificial deception.
The deception and truth data were collected from financial aid applications, a
document centric area with limited resources for verification. The data collection
provides a unique data set containing truth, natural deception, and deliberate
deception. The data collection was augmented by randomly generated artificial
deception. To better simulate deception behavior, a new noise model that is called
the application deception model is proposed and implemented to generate artificial
deception in the context of different deception scenarios.
Two experimental studies Eire proposed to analyze the relationship between
real and artificial deception. The first study investigates the fit between data mining
noise models and deception data to determine if artificially generated deception can
be used to reduce data collection costs. Outlier score and directed distance


percentage change are used as outcome variables. The second study investigates the
impact of real and artificial deception on screening policy performance. The
performance of the screening method is evaluated using an information theoretic
measure and a cost model that is built in the context of the financial aid application.
This abstract accurately represents the content of the candidates thesis. I
recommend its publication.
Signed
Michael Mannino


ACKNOWLEDGMENT
It is with my special appreciation that I acknowledge my advisor, Dr. Michael
Mannino, for his support, encouragement, invaluable guidance throughout the
course of this work. His knowledge and dedication have been a constant source of
inspiration. Without the direction of Dr. Mannino, this project could have never
been completed.
I would also like to thank the other members of my dissertation committee, Dr.
Tom Altman, Dr. Peter Bryant, Dr. Dawn Gregg and Dr. Ronald Ramirez for their
invaluable comments and useful suggestions.
I gratefully acknowledge the support from the UCD Business School to my
graduate study.


TABLE OF CONTENTS
Figures....................................................................ix
Tables ....................................................................x
Chapter
1. Introduction ..........................................................1
1.1 Motivation and Overview............................................1
1.2 Thesis Contribution................................................5
1.3 Thesis Outline.....................................................6
2. Theoretical Analysis...................................................7
2.1 Deception Literature Review .......................................7
2.2 Data Collection Issue ............................................11
2.3 Noise Literature..................................................13
2.4 Artificial Data Generation........................................18
3. Analysis and Modeling of Real Deception ..............................20
3.1 Research Model and Hypotheses ....................................20
3.2 Experimental Methodology..........................................23
3.3 Dependent Variables and Measures..................................25
3.3.1 Directed Distance..............................................25
3.3.2 Outlier Score .................................................28
3.4 Real Deception Data Collection....................................31
3.5 Artificial Noise and Deception Models.............................32
3.5.1 Variable Noise Model ..........................................32
3.5.2 Application Deception Model....................................33
4. Analysis of Impact of Real and Artificial Deception on
Vll


Screening Policies .............................................43
4.1 Research Framework and Hypotheses ...............................43
4.2 Research Methodology.............................................45
4.3 Performance Measures.............................................48
4.3.1 Information Theoretic Measure................................49
4.3.2 Cost Model...................................................51
5. Experimental Results and Analysis...................................56
5.1 Simulation of Real Deception.....................................56
5.1.1 Analysis Based on Distance Measure ..........................57
5.1.2 Analysis Based on Outlier Score..............................59
5.1.3 Discussion ..................................................62
5.2 Impact of Real and Artificial Deception on Screening Policies....64
5.2.1 Impact on Top Policy.........................................65
5.2.2 Impact on Random Policy......................................78
5.2.3 Discussion ..................................................87
6. Conclusions and Future Work.........................................90
6.1 Conclusions..................................................90
6.2 Implications.................................................91
6.3 Limitations and Future Work..................................92
Appendix
A. Financial Aid Application Form .....................................94
Bibliography ............................................................98
viii


LIST OF FIGURES
Figure
3.1 Research Model............................................................21
3.2 The Distance from Truth to Real and Artificial Deception..................26
3.3 Application Deception Model...............................................36
4.1 Comparison of Screening Method Performance on Real and Artificial
Deception.................................................................44
4.2 Labeling Natural Deception Data...........................................47
4.3 Labeling Deliberate Deception Data........................................48
4.4 Method to Calculate HM....................................................51
4.5 Award Allocation Based on Two Models of Budget..........................53
4.6 The Meaning of Award Difference...........................................54
5.1 The Impacts of Real and Artificial Deception on Top Policy at Different
Percentage of Screened Case with Harmonic Mean............................69
5.2 The Impacts of Real and Artificial Deception on Top Policy at Different
Percentage of Screened Case with Cost (Fixed Budget)...................75
5.3 The Impacts of Real and Artificial Deception on Top Policy at Different
Percentage of Screened Case with Cost (Variable Budget)...................77
5.4 The Impacts of Real and Artificial Deception on Random Policy at Different
Percentage of Screened Case with Harmonic Mean............................82
5.5 The Impacts of Real and Artificial Deception on Random Policy at Different
Percentage of Screened Case with Cost (Fixed Budget)......................84
5.6 The Impacts of Real and Artificial Deception on Random Policy at Different
Percentage of Screened Case with Cost (Variable Budget)...................86
IX


LIST OF TABLES
Table
2.1 Summary of theoretical noise models.....................................14
2.2 Summary of noise models used in previous studies........................16
3.1 Experimental comparisons between real and artificial deception..........24
3.2 Sample sizes............................................................24
3.3 The sign for directed distance..........................................26
3.4 Original deception scenarios in financial aid application...............38
3.5 Deception scenarios in financial aid application (after combination)....39
3.6 Features included in each group of variables............................39
3.7 Data selection for deception scenarios in financial aid application.....40
3.8 Data perturbation for deception scenarios in financial aid application..41
3.9 Threshold and ideal values for parents financial variables.............42
3.10 Threshold and ideal values for merit-based variables..................42
4.1 Comparison of screening method performance on real and artificial deception46
4.2 Confusion matrix of a true-deception prediction.........................49
4.3 Cost matrix for deception detection.....................................54
4.4 Cost model for financial aid application deception detection............55
5.1 Hypotheses based on directed distance measure............................57
5.2 Normality tests for distance variable...................................58
5.3 Sample mean results based on distance measure...........................59
5.4 Summary of statistical test results based on distance measure...........59
5.5 Hypotheses based on outlier score.......................................60
5.6 Normality tests for outlier scores......................................60
5.7 Sample mean results based on outlier score calculated by D algorithm..61
x


5.8 Sample mean results based on outlier score calculated by LOF algorithm....62
5.9 Summary of findings........................................................64
5.10 Relative HM percentage differences for the top policy....................67
5.11 Hypotheses based on cost measure.........................................70
5.12 Normality tests for cost variable........................................70
5.13 Cost (fixed budget) sample mean results..................................72
5.14 Cost (variable budget) sample mean results...............................72
5.15 Summary of statistical test results based on cost measure.................72
5.16 Relative HM percentage differences for the random policy..................79
5.17 Cost (fixed budget) sample mean results for random policy................79
5.18 Cost (variable budget) sample mean results for random policy............79
5.19 Summary of statistical test results based on cost measure.................80
5.20 Summary of comparison results for top policy..............................88
5.21 Summary of comparison results for random policy...........................88
5.22 Summary of findings.......................................................89
xi


1. Introduction
1.1 Motivation and Overview
Deception is an everyday occurrence across all communication media.
Deception can be manifested in many forms, from the simple white lies that are
often told for the purpose of facilitating social interaction, to more serious lies that
involve crime or infidelity. Whether lies are small or serious, they involve a
conscious attempt to mislead another person by either concealing or giving false
information, along with willful manipulation of another individuals ability to
accurately assess the truthfulness of a statement or situation.
There has been an increasing interest in learning about deception and its
detection for many years. Various topics related to deception have been studied,
including deception in business practices. People tell more lies when they want to
appear likeable or competent, both important aspects for success in business
(Feldman et al. 2002). Since deception in business tends to hurt productivity and
profitability, any insights that researchers obtain from the study of deception may
have beneficial consequences. For example, auditors can develop a set of heuristics
to help them detect financial fraud (Johnson et al. 2001).
Detecting lies is often difficult. Many people overestimate their natural
ability to catch lies. In reality, the chance of detecting lies is either chance (around
50%) or lower (Feeley & deTurck, 1995). Another factor that contributes to the
generally low level of lie detection is inability of people to detect reliable cues of
deception.
Because it is so difficult to detect the majority of lies people tell, efforts have
been made to improve deception detection. Research has discovered some reliable
1


indicators of deception (Zuckerman & Driver, 1985). Training to detect these cues
can be given to people for improving detection accuracy.
The majority of information exchanged on a daily basis normally involves
some level of deceit and is done using rich media (e.g., face-to-face, voice).
Therefore, research has mainly focused on richly mediated communication
channels. Research regarding the ability to identify deception in textual information
has been sparse at best. The ability to identify deceptive information in textual
forms can reduce revenue losses and decrease time spent following deceptive leads
or information. Therefore, it is important to expand the knowledge about deception
over text-based systems.
This research is concerned with a specific type of text-based deception:
document-centric deception. In document-centric deception an individual falsifies
a portion of her application for a benefit such as a position, financial aid, loan, or
admission to a university. Document-centric deception has emerged as an
important and costly deception area. Existing deception detection techniques
developed for applications in communication and physiology are not suitable for
discovering deception in these kinds of applications which have few or no
linguistic patterns.
Welfare fraud is a prominent form of document-centric deception. Welfare
fraud refers to various intentional misuses of state welfare systems by withholding
information or giving false or inaccurate information. This may be done in small,
uncoordinated efforts, or in larger, organized criminal rings. Some common types
of welfare fraud are failing to report a household member, failure to report income
and providing false information. Welfare fraud creates a burden for taxpayers by
increasing the cost of programs. The U.S. Department of Agriculture (USDA)
estimates that about 8 percent of the food stamp benefit expenditures are
2


overpayments or payments to ineligible households. According to the statistics
from the United Council on Welfare Fraud (UCOWF), fraud was discovered in
upwards of 69% of the investigations conducted with total annual discovered fraud
amounts ranging from $10,000 to $1 million (United Council on Welfare Fraud,
2003).
Financial aid deception is another prominent form of document-centric
deception. Federal, state and private financial aid programs target their assistance
toward students with the least ability to pay for college. This targeting of aid is
based on student and parental self-reports about their financial condition.
Therefore, ensuring the accuracy of the information plays an important role in
equalizing the educational opportunities available to all students. Colleges and
universities routinely verify the accuracy of a subset of aid applications. A US
Department of Education audit of 2.3 million 1995-96 Pell grant recipients [L2000]
found that 4.4% (about 100,000) had reported income figures on their financial aid
applications that were lower than the figures reported to the IRS. According to the
report prepared by Rhodes and Tuccillo (Rhodes & Tuccillo, 2008), forty percent
of records in the random Quality Assurance sample data for Federal Student Aid
contain false information. They found that 30 percent of dependent student records
and 20 percent of independent student records have false data fields when schools
verified the information as part of the random sample process. This results in $2
billion (15.9%) of Pell dollars in 2005-06 at risk for an improper payment. To
determine false applications, a financial aid office usually adopts a simple policy
such as a 100% verification or random verification of a small number of
applications. Based on the report, on average schools chose to verify 50.7% of the
records. However, school verification was not very effective in terms of targeting
3


problematic records (Rhodes & Tuccillo, 2008). Also these verifications are labor
intensive involving substantial time and cost.
This research is motivated by the fact that data collection for data mining
methods involving document deception is costly because the true data without
deception and the deceptive data both need to be obtained. To lower the cost of data
collection, artificially generated deception data can be used to train the data mining
program, but the impact of using artificially generated deception data is not known.
A number of studies (Mannino & Koushik 2000; Kearns & Li, 1993; Mannino et
al. 2009; Zhu & Wu, 2004; Angluin & Laird, 1988) have used artificially generated
noise to study the sensitivity of classification algorithm performance to noise.
These studies have used conceptual noise models that provide a method to perturb
or change data from its true state to an incorrect state to generate noise using noise-
free training data. In contrast, other studies (Burgoon et al. 2003; Zhou, 2003) have
collected primary data containing deception data paired with ground truth data.
Previous work has not examined the fit between artificially generated deception
data and real deception data to understand the impact of using artificially generated
deception data in training of data mining algorithms.
This study investigates the fit between real deception data and artificial
deception data generated by data mining noise models and the impact of real and
artificial deception on screening method performance. Deception data and the
ground truth data are collected from financial aid applications, a document-centric
area with limited resources for verification. The data collection provides naturally
occurring deception in which subjects have some incentive to falsify applications
and deliberate deception in which subjects deliberately falsify their applications.
Two different experimental studies are conducted to investigate the relationship
between real and artificial deception. The first set of experiments compare the real
4


deception data (natural and deliberate) to the artificial deception data with outlier
score percentage change and directed distance percentage change as outcome
variables. The second set of experiments compare the performance of screening
policies on the real and artificial deception data. The performance of screening
policies is evaluated using an information theoretic measure and a cost model.
The results of this study will extend existing literature and provide guidance
to data mining researchers studying deception. If the experimental results indicate a
reasonable fit between artificial and real deception data, researchers will have some
justification for relying on artificial deception data especially when developing
application-independent techniques. If the experimental results indicate a poor fit,
researchers should reduce usage of artificial deception data when training their data
mining applications.
1.2 Thesis Contribution
The contributions of this dissertation are:
It is the first systematic study investigating the relationship between real
deception and artificial deception generated by a noise model.
The deception data and the ground truth are specifically collected from
financial aid applications. The data collection provides a unique data set
containing truth, natural deception, and deliberate deception.
A novel noise model that is called the application deception model is
proposed and developed. The proposed model considers the generation of
artificial deception in different application-based deception contexts, which
is the focus of this study.
5


An experimental design is developed to compare real deception with
artificial deception. The experiment involves two measures to compare
differences between real and artificial deception.
An experiment is designed to compare performance differences of screening
policies between real and artificial deception. The experiment provides
evidence about bias in using artificially generated deception to evaluate
performance of typical screening policies used in financial aid decision
making.
1.3 Thesis Outline
This dissertation consists of six chapters. After a brief introduction and
deception overview in Chapter 1, Chapter 2 systematically reviews the deception
and noise background and related research work. The issues associated with
gathering deception data are also discussed. Chapter 3 develops a novel method of
artificial deception generation as part of a comparison of real and artificial
deception. An experimental design is developed to depict the relationship between
real deception and artificial deception generated by a noise model. Chapter 4
describes an experiment designed to investigate the impact of real and artificial
deception on screening policies. In Chapter 5, the experiment results are presented
and analyzed. Chapter 6 summarizes the dissertation and proposes future research
directions.
6


2. Theoretical Analysis
The theoretical foundation for this research is drawn from a combination of
theories of deception and noise. To provide a context for this study, the literature
about deception and its successful detection are reviewed in this chapter. The issues
associated with gathering deception data are discussed. Also, a brief review of the
noise literature, specifically the impact of artificially generated noise on
classification algorithm performance is included.
2.1 Deception Literature Review
As Ekman (Ekman, 1992) and others have implied, everyone lies to some
extent, and lies can occur in any social situation and modality. An example is when
people are asked, How are you? and they reply Good. In many cases these
people do not really feel good but give this answer because it is the socially
accepted and expected answer. Another example of common lies is the white lie.
Someone may say that they like a co-workers new haircut, when they really think
it looks less than flattering. Another common type of lie is when someone is asked
what he or she did today and the person responds with only part of what happened
to him or her during that day. This is an example of a lie of omission. It would be
often tedious and boring to describe and listen to every detail of a persons day.
The above examples of socially accepted and relatively harmless lies serve a
function in peoples lives and are not included in most of the academic research on
deception and its detection.
As a multidisciplinary concept, deception has been defined in many ways.
For a lie to be considered an act of deception, the communicative exchanges
between people must involve perceptions by one or more of the people involved
7


that there is an intent to deceive (Miller & Stiff, 1993). A widely-used definition
for the term deception and the one that will be used for this study is a message
knowingly transmitted by a sender to foster a false belief or conclusion by the
receiver (Buller & Burgoon, 1996). Thus, deceptive communication consists of
messages and information knowingly transmitted to create a false conclusion
(Miller & Stiff, 1993). Messages that are unknowingly sent by a sender are not
considered deceptive, as there is no intention to deceive. Deception not only
includes outright lies. Evasions of the truth, equivocations, exaggerations,
misdirections, deflections, and concealments are also considered deception. These
forms of deceit are more common than outright lies (DePaulo et al. 1996). Thus,
deception can be conducted in many ways with the purpose of and motivation for
personal gain.
Deception detection aims to determine whether a piece of information is
deceptive. Whether governments protect their citizens from terrorists or
corporations protect their assets from fraud, many organizations are interested in
finding and exposing deception. The problem is that most people are poor at
detecting deception even when presented with all of the verbal and non-verbal
information conveyed in a face-to-face discussion. Numerous studies have noted
that the accuracy with which people typically identify deception is only slightly
better than chance (approximately 54%) (Bond & DePaulo, 2006). This poor
performance is not limited to laypersons, but is also found in professional lie-
catchers such as police officers and federal law enforcement officers. This issue is
more intense when the deception is conveyed in text because of the lack of
nonverbal cues. Furthermore, deception strategies may change from situation to
situation as the deceiver anticipates the interactions and attempts to fool possible
detectors.
8


To improve low deception detection accuracy, researchers have investigated
methods to assist in detection. These methods take advantage of physiological or
behavioral traits that appear in conjunction with deception. Perhaps the most
familiar tool used in deception detection is the polygraph. Other methods of
deception detection include criteria-based content analysis (Steller & Kohnken
1989) and scientific content analysis (Sporer, 1997). Each of these techniques
includes a set of criteria against which suspect statement is compared.
While many methods exist to differentiate deception from truth, all methods
are tied together by one feature: they rely on a human operator to make the final
judgment. A potentially more promising approach is to integrate improved human
efforts with automated tools. Compared with the manual approach, the automatic
approach is more efficient and easier to use. Moreover, the enormous amount of
information generated in many deception environments makes it infeasible to
process it manually. The automatic prediction of deception can be achieved through
three steps: (1) identify significant cues to deception, (2) automatically derive the
cues from various media, and (3) build classification models for predicting
deception from new messages. Briefly speaking, in data mining, the goal is to
classify the data (message) into one of 2 categories (truth or deception) based on its
attributes (for example, linguistic cues). Research on cues and classification
methods make deception detection objectives possible.
Previous deception research has identified a rich set of cues to deception that
have been tested in lab or field environments (DePaulo et al. 2003). Researchers at
the University of Arizona have been developing computer-based methods of
deception detection which analyze linguistic and kinesic behavior in search of
deceptive cues. The methods analyze the movements and linguistic properties of
communication from one person engaged in a recorded face-to-face interaction.
9


The methods utilize supervised learning techniques from manually prepared
training sets to detect patterns in the linguistic and kinesic channels.
Linguistic methods are based on language features identified as promising
indicators of deceit in previous research (Burgoon et al. 2003). For example,
deceptive messages have been found to include higher informality and expressivity,
and lower wording diversity and complexity. Focusing on language behaviors,
rather than specific content, has the advantage that indicators derived from
language behaviors may be relatively independent of context and are more
conformable to simple parsing approaches. Moreover, deceivers may have control
over the content of their messages, but deceptive intent may still be delivered
through ones language use. Some progress has been made in identifying and
automatically deriving deception indicators from text by integrating findings and
methods from multiple relevant disciplines, including natural language processing,
criminal justice and linguistics (Zhou et al. 2004).
Kinesic methods seek to detect behavioral cues automatically. Empirical
evidence suggests that deceivers head and hands move differently than truth-
tellers. For example, it is pointed out that deceivers display significantly more chin
raises than truth-tellers. Kinesic analysis utilizes a tracking method to extract hand
and face regions using the color distribution from a digital image sequence. The
extracted features are summarized and are then used for classification.
Some investigations have also been focused on the third objective of building
classification models for predicting deceit by evaluating classification approaches
ability to discriminate truthful from deceptive messages. Many common machine
learning approaches, such as neural networks and decision trees, can automatically
build classification models from the existing data and then predict the outcome for
the new data. Neural networks have been found to provide good prediction in some
10


applications (Zahedi, 1996). There has also been at least one attempt at applying
decision trees in grouping messages into deceptive and truthful classes (Burgoon et
al. 2003). The work conducted by Zhou et al. (Zhou et al. 2004) extends prior work
on cues to deception by investigating four classification methods discriminant
analysis, logistic regression, decision trees, and neural networks for their
predictive power in discriminating truth from deception. Their results suggest that
all of the four methods were promising for predicting deception with cues to
deception. Among them, neural networks exhibited consistent performance and
were robust across test settings.
2.2 Data Collection Issue
With empirical methods such as statistical models and data mining, data
collection is crucial for improved decision making. However, data collection is
costly. A lot of research in marketing has been conducted to reduce data collection
costs. The costs of various data collection methods including traditional telephone,
postal and email surveys, and web-based surveys are investigated in the literature
(Wiseman et al. 1983; McDonald & Adam 2003). Acquiring deceptive data is even
more expensive since the ground truth data also needs to be obtained. In this case,
subjects may have to be given costly incentives to reveal the truth.
In data mining, supervised classifier learning requires data with class labels.
In many applications, collecting class labels can also be costly. For example, many
historical cases are needed when diagnostic models are trained. To train document
classifiers experts may need to read many documents and assign them labels.
Active learning is an important approach to reducing data-collection costs in
machine learning. The active learning literature (Cohn et al. 1994; Saar-Tsechansky
& Provost 2001) offers several algorithms for cost-effective label acquisitions.
11


Active learners acquire training data incrementally, using the model induced from
the available labeled examples to identify helpful additional training examples for
labeling. A number of utility measures have been proposed to indicate the
information value of acquiring labels for unlabeled cases. Active learning methods
have been empirically demonstrated to reduce the cost of label acquisition to
achieve a specified level of classifier performance.
Data collection is costly and time consuming. Also, there are many problems
involved in ensuring that data collected is accurate. The quality of the data acquired
can be affected by the statistical problems of sampling and measurement errors.
Sampling error: Sampling error occurs during the process of selecting a sample
from the frame population. It arises from the fact that not all members of the frame
population are measured. The sample used for a particular survey is only one of a
large number of possible samples of the same size and design that could have been
selected. Even if the same questionnaire and instructions were used, the estimates
from each sample would differ from the others.
Measurement error: Measurement error is the deviation of the answers of
respondents from their true values on the measure. In both self-administered and
interviewer-administered surveys, measurement errors could arise from the
respondent or from the instrument. Unclear terms to respondents, lack of
motivation and the confidentiality of their answers or deliberate distortion may
cause errors. The question wording and the design of the survey instrument (such
as the placement of questions, flow of instrument, typographical feature, etc.) also
affect the accuracy of data collected.
12


2.3 Noise Literature
Noise is defined by Quinlan as non-systematic errors in either the values of
attributes or class information. A number of theoretical noise models have been
proposed as extensions to the theory of Probably Approximately Correct (PAC)
learning (Valiant, 1984). Valiants PAC model of learning is one of the most
important models for learning from examples: a system in which the learning
algorithm develops classification rules that can be used to determine the class of an
object from its attributes. PAC learning provides a nice formalism for deciding how
much data you need to collect in order for a given classifier to achieve a given
probability of correct predictions on a given fraction of future test data.
Although the PAC model better reflects the requirements of learning in the
real world and thus has been widely adopted, one drawback of the PAC model is
that the data used for learning is assumed to be noise free. In many environments,
however, there is always some chance that an erroneous example is given to the
learning algorithm. In a training session for a learning algorithm, this might be due
to an incorrectly measuring an input, wrongly reporting the state of an input,
relying on stale values, and using imprecise measurement devices. Input errors in a
training set can cause a learning algorithm to form a rule with an incorrect state for
an input, while input errors in cases to be classified can cause the wrong rule to be
used. To combat this deficiency, a number of theoretical noise models have been
introduced into the theory of PAC learning. These models have been classified by
source (random or adversary) and target (attribute versus class).
The first noise model, called the Random Classification Noise model, was
introduced in (Angluin & Laird, 1988). In this model, the adversary flips a biased
coin before providing each example to the learning algorithm; whenever the coins
13


shows H, which happens with probability g, the classification of the example is
flipped and so the algorithm is provided with the wrongly classified example.
Another model is the Malicious Noise model introduced in (Kearns & Li, 1993). In
this model, the adversary can replace the example whenever the g-biased coin
shows H. This gives the adversary the power to distort the distribution D. There
are also two noise models in which the examples are corrupted by purely random
noise affecting only the instances (not the labels). They are called Uniform Random
Attribute noise model and Product Random Attribute noise model (Goldman &
Stone, 1995). For uniform attribute noise, each attribute is flipped independently at
random with the same probability. Contrasting this model, in product attribute
noise model, each attribute is flipped randomly and independently with its own
probability. These noise models are summarized in Table 2.1.
Table 2.1: Summary of theoretical noise models
Reference Noise Model Description
Goldman & Uniform random attribute Each attribute is flipped
Stone, 1995 noise independently at random with the same probability
Goldman & Product random attribute Each attribute is flipped
Stone, 1995 noise randomly and independently with its own probability
Angluin & Laird, 1988 Random classification noise Class noise. The label is inverted.
Kearns & Li, 1993 Malicious attribute noise The example may be maliciously selected by an adversary who has infinite computing power and has knowledge of the target concept. The nature of the noise is unknown or unpredictable.
14


Noise models have been used in a number of studies to investigate the
sensitivity of noise on classification algorithm performance. Using the uniform
attribute noise model, Quinlan (Quinlan, 1986a, b) found sensitivity of the ID3
decision tree induction algorithm to attribute noise. He demonstrated that the
classification accuracy of ID3 was worse for a noise-free training set if the level of
field noise is high (45% or greater). Nolan (Nolan, 2001) applied the uniform
random attribute noise model and empirically studied the effect of attribute noise
on several prominent classification algorithms (C5.0, back propagation neural
network, and linear discriminate analysis). He found that the neural network
performed significantly better than the other algorithms when noise levels exceeded
10%. Zhu and Wu (Zhu & Wu, 2004) studied the effects of data cleaning on the
performance of C4.5 using the uniform attribute noise model and the class noise
model. They compared predictive accuracy of C4.5 on combinations of clean and
noisy training and test sets. The emphasis in their study is the value of data
cleaning either in training data or field data. They reached a number of conclusions
that are counter to the conventional wisdom established by Quinlan (Quinlan
1986a, b) that training data should contain noise representative of field data. To
extend previous studies on classification algorithms sensitivity to noise, Mannino
and Yang (Mannino et al. 2009) emphasize asymmetric levels of noise (under and
over representation of attribute noise) and use three noise models: uniform attribute
noise; product attribute noise, and importance attribute noise. Their results
contradict conventional wisdom indicating that investments to achieve
representative noise levels may not be worthwhile. In other studies, artificially
generated noise has also been used in testing intrusion detection system (McHugh,
2000) because of privacy and the sensitivity of actual intrusion data. The noise
models used in these studies are summarized in Table 2.2.
15


Table 2.2: Summary of noise models used in previous studies
Reference Noise Model Used
Quinlan 1986a,b Uniform random attribute noise
Mannino & Koushik, 2000 The level of noise was either implicitly known through a sample of the noise process or explicitly known through an external parameter
Nolan, 2001 Uniform random attribute noise
Zhu & Wu, 2004 Uniform random attribute noise Class noise
Jiang et al. 2005 Uniform attribute noise
Mannino et al. 2007 Uniform random attribute noise Product random attribute noise Importance attribute noise
In early applied studies, the most popular noise handling technique is
decision tree pruning. Pruning techniques reduce specialization by eliminating rules
in whole or part. They are useful to handle noise because noise in a training set can
lead to extra rules and highly specialized rules. Quinlan [Q 1986b] found that
pruned decision trees perform better than un-pruned decision trees in the presence
of input data noise. While pruning is usually applied as a post processing technique
(i.e., the tree is pruned after it is induced), other approaches prune the decision
during its construction, i.e., the induction process itself is modified to cope with
noise. In addition to pruning, a number of fuzzy learning methods have been used
to derive fuzzy rules that perform well with noise and/or incomplete training data
(Hong & Chen, 2000; Wu et al. 2003).
In contrast to pruning techniques, Mookerjee et al. (Mookerjee et al. 1995)
investigated explicit noise handling using clean training data along with a noise
parameter. Explicit noise handling adds noise to clean training data in a controlled
16


manner using the noise parameter. The study demonstrated both analytically and
empirically that explicit noise handling has the same expected performance but
lower variance than traditional techniques using noisy training data.
Jiang et al. (Jiang et al. 2005) also conducted another study to handle
explicitly noisy input data on the web. Although the term noise is used on the
paper, the study actually deals with deception on the web. A variety of factors
contribute to the presence of web deception. The most significant cause of
deception on the web is the deliberate falsification of input data by web users. Web
users also lie to protect their privacy and possible misuse by firms of any personal
data they provide. Another factor that contributes to lying is that there is no face-to-
face interaction between the user and the organizations agents in an online
environment. Thus, there are no visual or non-verbal cues that could potentially
help an agent recognize that a user is lying. In their study, a wide range of noise
levels is considered but the same distortion level is applied for all inputs. To cope
with deception, two methods are proposed knowledge base modification (KM)
and input modification (IM). The KM method considers modifying the knowledge
base (a decision tree) to account for distortion in the inputs provided by the user. It
is appropriate when the distortions in inputs are relatively stationary (input noise
levels do not change much over time). The IM method involves a preprocessing
step during which the observed inputs are modified to account for distortions. This
method involves modifying an observed input to the most likely true value of the
input given the observations made by the system. The modified input is then fed
into the existing (unmodified) knowledge base. In the KM method, a revised
decision tree is obtained, which specifies optimal recommendations for all feasible
combinations of observed input values. The IM method does not require any
modification of the original decision tree.
17


2.4 Artificial Data Generation
It is widely recognized that real data must play a vital role in evaluation. Real
data by definition includes the types of pattern and regularity found buried in data
that reflects the real world. Such evaluations are therefore vital in establishing the
credibility of a data mining procedure.
However, real data also have serious disadvantages for use in the systematic
type of testing. The most important is the fact that relatively little is known about
the structural regularities of a real data set. The goal of data mining is to discover
such regularities. However, if the patterns to be discovered are not known, how can
the investigator determine a data mining programs success at detecting patterns?
Another problem is the need to alter the degree of difficulty of the data sets since a
particular type of difficulty in the data affects the performance of a data mining
procedure. Furthermore, it may be impossible or very difficult to acquire the
amount of or type of data due to legal and competitive reasons. To circumvent
these data problems and work on a particular data type, one alternative is to create
synthetic data which matches closely to actual data.
In the intrusion detection area, some work has been done using synthetic test
data. Puketza et al. (Puketza rt al. 1996) describe a software platform for testing
intrusion detection systems where they simulate user sessions using the UNIX
package expect. Using the expect language, they can write scripts that include
intrusive commands. For running the scripts, expect provides a script interpreter
which issues the script commands to the computer system just as if a real user had
typed in the commands. In the fraud detection area, Barse et al. (Barse et al. 2003)
proposed a five-step synthetic data generation methodology using authentic normal
18


data and fraud as a seed. They justify that synthetic data can be used for training
and testing a fraud detection system.
Although artificially generated data and noise models have been used in the
literature of data mining and information systems, previous work has not examined
the relationship between artificially generated deception and real deception to
understand the impact of using artificially generated deception in training of data
mining algorithms. As will be shown in the following chapter, a set of experiments
will be proposed for studying the relationship between real deception and artificial
deception. In addition, the measures will be defined and hypotheses related to the
experiment will be presented.
19


3. Analysis and Modeling of Real Deception
This chapter extends traditional studies with a focus on the fit between
deceptive patterns and data mining noise models. If the fit is reasonable, artificially
generated deception data can be used instead of (or in addition to) real deceptive
data to reduce the costs associated with obtaining training data for the data mining
application. A set of experiments are conducted to compare the real deceptive data
(natural and deliberate) to the artificial deceptive data using outlier score and
directed distance percentage change as outcome measures. A new noise model,
which is called the application deception noise model, is proposed. The proposed
model considers the generation of the artificial deception in the different
application-based deception contexts. The detailed experiment design, research
hypotheses, and the new noise model are described next.
3.1 Research Model and Hypotheses
The research model tested in the dissertation is illustrated below in figure 3.1.
The research model describes the proposed relationships between real deception
and artificial deception. The real deceptive data are collected for a student financial
aid application with two levels of treatments: natural and deliberate deception. The
artificial deception data is generated using a noise model to corrupt feature values
according to a specified noise parameter. Based on the preceding literature review,
it is expected that different noise models will have different impacts on the
relationship between real and artificial deception. Therefore, the relationship
between real and artificial deception is analyzed under two noise models: variable
noise model (Goldman & Stone, 1995) (different noise rates on attributes) and
application deception noise model (different noise manipulations on three groups of
20


variables: status variables, financial variables, and merit-based variables). The
noise parameter is set to be consistent with the deception level in the real deceptive
data set.
Figure 3.1: Research Model
Under the natural setting, subjects provide the information with few fields
changed because of their awareness of verification. Under the deliberate setting,
applicants are encouraged to appear as competitive as possible in order to increase
their chances to get the aid; therefore, deliberate deception introduces more
deception. To be consistent with the different amounts of deception involved in the
natural and deliberate deception data sets, the artificial deception data sets are
generated using two noise levels: low and high. Low noise rate generated deception
21


is compared with natural deception, while high noise rate generated deception is
compared with deliberate deception.
As mentioned in the previous chapter, a noise model provides a method to
perturb or change data from its true state to an incorrect state. The noise models
introduced in the literature generate non-systematic noise by corrupting the original
feature values randomly without considering simulating actions. The variable noise
model adopted here assigns the noise rate to each feature randomly and
independently. The variable noise should not make an observation appear more
unusual because perturbations from the truth are symmetric with a mean level of
perturbation near zero. Therefore, the artificial deception pattern modeled by the
variable noise may not fit the real deception pattern since deception involves
systematic noise.
To simulate deception using the noise model, data should be perturbed
purposefully. The deception noise model proposed in this study simulates deceptive
actions by manipulating groups of features based on scenarios of deception
behavior. In this noise model, noise will be on one side: to make an applicant to
appear stronger. Thus, it is predicted that the pattern of real deception matches the
deception pattern in the artificial data set modeled by deception noise.
Based on the analysis, four hypotheses are presented, with the first two
dealing with the relationship between natural and artificial deception generated
using low noise level, and with the other two focusing on comparing deliberate and
artificial deception generated using high noise level. The hypotheses are based on
the differences between deception and noise. Noise is symmetric change from the
truth without a goal direction. Deception (real or artificial) is change from the truth
directed towards a goal. Thus, it is expected that variable noise will be significantly
22


different than real deception but artificial deception will be similar to real
deception.
Natural deception vs. artificial deception
HI a: There will be a significant difference between natural deception and
artificial deception modeled by low variable noise.
Hlb: There will be no significant difference between natural deception and
artificial deception modeled by low deception noise.
Deliberate deception vs. artificial deception
H2a: There will be a significant difference between deliberate deception and
artificial deception modeled by high variable noise.
H2b: There will be no significant difference between deliberate deception and
artificial deception modeled by high deception noise.
3.2 Experimental Methodology
The relationship between real and artificial deception is analyzed by two
experiments. The first experiment is set to analyze the relationship between
artificial and natural deception, while the second is to reveal the relationship
between artificial and deliberate deception. Each experiment involves two
comparisons between the real deceptive data and the data perturbed with noise
using a noise model and corresponding noise parameters as shown in Table 3.1. To
manipulate data with noise, a noise rate that is consistent with the deception level in
the real deception data set is applied to the truth data depending on two models of
noise. The detailed descriptions of the noise models and the methods of
manipulation are presented in section 3.6. After the process of generating noise is
completed, a comparison between the real deceptive data and the noise modeled
data is performed using the directed distance and outlier score measures.
23


Table 3.1: Experimental comparisons between real and artificial deception
Deception Type
Experiments Real Artificial
Experiment 1 Natural Variable-Low Deception-Low
Experiment 2 Deliberate Variable-High Deception-High
In order to determine the sample size before the data is collected, the
standard study using a=0.05 and having a power of 0.80 is considered. The desired
width of a confidence interval is 7%. Therefore, a sample size of 150 per cell in
table 3.2 is adequate.
Table 3.2: Sample sizes
Experiments Sample Sizes
Natural vs. Variable + Low 150
Natural vs. Application + Low 150
Deliberate vs. Variable + High 150
Deliberate vs. Application + High 150
Hypotheses will be tested with ANOVA followed by post hoc t-tests of
group means. A statistically significant outcome only indicates that it is likely that
there is a difference between group means. It doesnt mean that the difference is
large or important. To judge the size of the difference and describe the strength of
the relationship when the statistical test result is significant, the effect size needs to
be calculated. The most commonly used guideline to interpret effect size is
provided by Cohen (Cohen, 1988): an effect size of 0.1 to 0.3 might be a small
effect, around 0.3 to 0.5 a medium effect and 0.5 to 1.0 a large effect. The
24


number might be larger than one. In this study, the effect size is computed using
Hedges g measure, which can be computed from the value of the t test of the
differences between the two groups (Hedges, 1981). The formula is:
The formula with separate n's should be used when the s are not equal. The
formula with the overall number of cases, N, should be used when the s are equal.
3.3 Dependent Variables and Measures
To analyze the fitness between real and artificial deception, two different
measures of relationship were developed for this study, directed distance and
outlier score.
3.3.1 Directed Distance
To measure the relationship between real and artificial deception, one
method is to compare the distances from the truth to both types of deception. Figure
3.2 illustrates the comparison between real and artificial deception with the distance
measure. The distances from the truth to the real and artificial deception are
denoted by D1 and D2. Each observations distance from the truth to both types of
deception is calculated across all attributes.
3.1
Or
g = 2t/Vn
3.2
25


Figure 3.2: The Distance from Truth to Real and Artificial Deception
Standard distance measures only involve the amount of change because
they are symmetric. However, in terms of deception and noise model, the feature
values change can happen either toward the goal or the opposite. Because both the
direction and magnitude of change are important, directed distance should be
measured. As a similarity measure, directed distance has been used for object
matching in images (Sim et al. 1999) and graph partitioning problems (Charikar et
al. 2006).
The directed distance contains a sign indicating a positive or negative sense.
In this study, the sign of the directed distance is determined using the method
shown in Table 3.3. It is based on the distance from the ideal candidate. Each
attribute has an extreme value indicating the ideal candidate. If deception moves
closer to the ideal candidate than the truth does, the directed distance is positive. If
the change moves away from the ideal candidate, the directed distance is negative.
Table 3.3: The sign for directed distance
Value Comparison Ideal Goal of Deception Sign (s)
V aluedeception ^ V aluetruth Maximize the value s = 1
V aluedeception ^ V aluetruth Minimize the value s = -1
Valuedeception Valuetruth Maximize the value s = -1
V aluedeception ^ V aluetruth Minimize the value s = 1
26


Based on the directed distance measure, the outcome variable is the
Deception Truth
percentage change in directed distance (PD):
Truth
-%. Percentage
change is easier to interpret than just the directed distance between truth and
deception. The definitions of percentage change in directed distance for real and
artificial deception are:
PDatura,=100x
Natural-Truth
Truth
3.3
PD
deliberate
PD
variable
= 100x
= 100x
Deliberate Truth
Truth
Variable-Truth
Truth
3.4
3.5
PD
application
,.. Application Truth
100x ----------------
Truth
3.6
To handle applications with both numeric and non numeric attributes, a
heterogeneous distance function that uses different attribute distance functions is
used. This study uses the overlap metric for nominal attributes and normalized
distance for linear attributes [WM1997]. The heterogeneous distance function
defines the distance between two values x and y of a given attribute as:
da(x>y) =
1, if x or y is unknown, else
overlap(x,y), if a is nominal, else
m_diffa(x,y)
3.7
The function overlap and the range-normalized difference rn diff are defined as:
[0, if x = y
overlap(x, y) = (
11, otherwise
3.8
27


3.9
m_diffa(x,y)
lx~y|
rangea
The value range is used to normalize the attributes, and is defined as the difference
between maximum and minimum values. The overall directed distance between
two input vectors x and y is given in equation 3.10 by the Heterogeneous Directed
Distance function HDD(x,y):
m
HDD(x,y) = £da(xa,ya)s 3.10
a*l
3.3.2 Outlier Score
An outlier is defined as a data point which is very different from the rest of
the data based on some measure. Outliers arise due to mechanical faults, changes in
system behavior, fraudulent behavior, network intrusions or human errors. As a
fundamental issue in data mining, outlier detection has been used to detect and
remove anomalous objects from data. The outliers themselves may be of particular
interest, such as in the case of fraud detection, where outliers may indicate
fraudulent activity. Thus, outlier analysis has been used to detect fraudulent
patterns that are substantially different from the main characteristics of regular
credit card transactions (Han & Kamber, 2001; Wheeler & Aitken, 2000).
Outlier score has been applied to detect fraud. Outlier score is a measure of
the extent of unusualness. An outlier score is calculated using the Mahalanobis
distance (Barnett & Lewis, 1994) and the quadratic distance (Fawcett & Provost,
1999; Knorr & Ng, 1998) in some work. Yamanishi et al (Yamanishi et al. 2004)
demonstrated the unsupervised SmartSifiter algorithm which assigns a score to the
datum with a high score indicating a high possibility of being a statistical outlier
using Hellinger distance, on medical insurance data. An experimental application to
28


network intrusion detection shows that the algorithm was able to identify data with
high scores that corresponded to attacks.
In this study, outlier detection algorithms are applied to the data with real
and artificial deception. Each observation is assigned an outlier score that indicates
the unusualness of the observation relative to other observations. An outlier score
can also be considered as a multivariate measure of dispersion of an outlier relative
to other outliers.
The relationship between real and artificial deception is measured by testing
the average outlier score. Noise should have more dispersion than deception
because noise is symmetric and deception is directed. Deception should tend to
increase the density of the observations towards the goal.
Outlier detection methods can be categorized into parametric and the non-
parametric approaches. Parametric approaches define outliers based on a known
distribution. However for outlier identification in large multidimensional data sets,
the underlying distribution is unknown. In contrast to these methods, data-mining
related methods are often non-parametric. It is not required to assume an
underlying generating model of the data. These methods are designed to manage
large databases from high-dimensional spaces. This study adopts two methods in
this category: one is the distance-based nearest neighbor algorithm, and another is
the density-based approach.
Distance-based methods were originally proposed by Knorr and Ng (Knorr
et al. 2000). The basic nearest neighbor algorithm is based on the definition that an
observation is defined as a distance-based outlier if at least a fraction p of the
observations in the dataset are at a distance greater than X from it. However, as
pointed out in Acuna and Rodriguez (Acuna & Rodriguez, 2004), this definition
has certain difficulties such as the determination of X and the lack of a ranking for
29


the outliers. Furthermore, the two algorithms proposed are either quadratic in the
data set size or exponential in the number of dimensions. Hence it is not an
adequate definition to use with datasets having a large number of instances.
In the later work (Ramaswamy et al. 2000), the definition of outlier is
modified to address the above drawbacks. Outlier detection is based on the distance
of the £-th nearest neighbor of a point p, denoted with D*(p). The new definition by
Ramaswamy et. al is given the integer numbers k and n (k outlier if no more than n-1 other points in the data set have a higher value for Dk
than p. This means that outliers are the top n instances with the largest distance to
their &-th nearest neighbor. The main idea in the algorithm is that for each instance
in D one keeps track of the closest neighbors found so far. When an instances
closest neighbors achieve the k value specified then the instance is removed
because it can no longer be an outlier. These outliers are referred as the £)* outliers
of a dataset. The basic nearest neighbor algorithm and the £)* variation both find
global outliers.
Another method is the density-based, presented in (Breunig et al. 2000)
where a new notion of local outlier is introduced that measures the degree of an
object to be an outlier with respect to the density of the local neighborhood. This
degree is called Local Outlier Factor LOF and is assigned to each object. The LOF
algorithm uses local densities to avoid anomalous situations in which many points
can be considered as global outliers. Although the LOF algorithm does not require
explicit clustering, it finds outliers that are outside of clusters. The local outlier
factor for an object is the mean of the ratio of the sum of the local densities for an
objects neighbors to the objects local density. The local density for an object is
near 1 for dense neighborhoods and near 0 for sparse neighborhoods.
30


Both outlier approaches involve the distance from the kth nearest neighbor.
The outlier score in the Dk approach is directly related to the kth nearest neighbor.
The outlier score in LOF approach is an adjusted k nearest neighbor score. The
choice of k should be made relative to the size of the data set.
3.4 Real Deception Data Collection
Participants in the data collection were students enrolled in undergraduate
level courses. Experimental data were collected from the subjects completing a
hypothetical financial aid application. Financial aid can be divided into two main
categories, need-based and merit-based. This study adopts merit-based scholarship
application along with items drawn from need-based application. The scholarship
application form was created by combining the items from the scholarship
application in UCD Business School with the Free Application for Federal student
aid (FAFSA, 2008). The data collection instrument is presented in Appendix A.
The data were collected via a web setting. Subjects were told that the purpose
of the research is to investigate the relationship between actual and artificial
deception in student financial aid applications. Subjects needed to complete a
consent form and were informed that they could terminate their participation at any
time. Instructions to fill out the application were provided. The participation was
limited to completing the same financial aid application three times. In the first
completion, subjects were told to provide their natural responses in completing a
financial aid application in which they perceive little chance for detection of
deception. In the second completion, subjects were told to correct all dishonest
responses in their original application. In the third completion, subjects were
31


instructed to deliberately respond unethically to make themselves appear to be as
competitive as possible.
Obtaining deception in a natural environment without subject awareness of
this study would help to increase the reliability of data. The original data collection
plan was that the subjects complete the financial aid application without knowing
the real reason for providing the data. The subject would be told about the research
purpose after completing the financial aid application. However, in order to
conduct the research in accordance with human experimentation requirements of
the academic human subjects review board, the real research purpose has to be
disclosed to the subjects before their participations. As a result of this constraint,
the data collection approach adopted in this study is the best feasible alternative.
3.5 Artificial Noise and Deception Models
Noise occurs when the true input state is perturbed by a measurement
process. A noise model provides a method to perturb or change data from its true
state to an incorrect state. A noise model should allow simulation of a real data
generation process with both correct and incorrect data. The Variable Noise Model
(Goldman & Stone, 1995) has been used in studies of noise handling in the data
mining literature. Deception occurs when an individual knowingly perturbs the true
input state to achieve some goal. Since there has not been an artificial deception
model proposed in the literature, the Application Deception Model (ADM) is
developed for this project.
3.5.1 Variable Noise Model
For the variable noise model, each attribute is flipped randomly and
independently with its own probability pi, all of which are less than the given upper
32


bound for the noise rate. Each attribute is randomly assigned a noise level from a
uniform distribution between the specified ranges. And a random number between
0 and 1 is drawn. If the number is less than or equal to the attributes noise
probability, the value is changed. If the attributes scale is nominal, the value is
randomly changed to any other value. If the attributes scale is ordinal, the attribute
is changed to an adjacent value. If the attribute is numeric (ratio or absolute scale),
the value is changed using an equal height histogram with at most 10 ranges. A
smaller number of ranges are used for highly skewed numeric attributes. After
randomly selecting an adjacent cell of the equal height histogram, a value is
randomly selected between the end points of the cell.
3.5.2 Application Deception Model
The noise models in the literature have been proposed as extensions to the
theory of Probably, Approximately Correct (PAC) learning. The noise perturbation
corrupts the original feature values randomly without considering causative actions.
Thus, these noise models generate non-systematic noise. However, to simulate
deception, data should be perturbed purposefully since deception involves
systematic noise. Therefore, the previous noise models may not be suitable for
studying deception schemes. It is necessary to develop a new model to structure the
methodology of simulating deceptive actions so that the data are randomly
perturbed according to deception objectives. The Application Deception Model
supports generation of artificial deception in different application-based deception
contexts, which are the major focus of this study.
33


3.5.2.1 Description of the Application Deception Model
The Application Deception Model involves scenario analysis, feature
grouping, and data perturbation as depicted in Figure 3.3. The starting point of the
model is to analyze deception scenarios. In Unified Modeling Language (UML),
use-cases and scenarios are commonly used by designers as a way to understand
users motivation and tasks in an interface. A use case scenario is a specific
sequence of actions as specified in a use case carried out under certain conditions.
Use scenarios are developed for as many functions or features and user types as
possible. The goal is to ensure that the different ways different uses try to complete
the same tasks do not conflict with each other.
Based on the use case methodology, the possible deceptive behaviors are
identified in the context of the application. Each scenario is defined by conditions
and corresponding actions. It is possible that some scenarios have identical
conditions, but different actions. In this case, if there are not enough data available
to distinguish the corresponding subsets, they should be combined into one
scenario.
Once scenarios have been specified, the next step is to classify important
features based on the scenarios. These features are changed together during the
perturbation process. Unlike other noise models, the deception noise model
supports dependencies among attributes by grouping attributes. Although
independence is usually a reasonable assumption in random noise, deception
generation must allow dependence.
In the final step of data selection, only those observations that have
potential to take deceptive action are chosen in the subset for perturbation. These
observations are selected if their distances from the ideal candidate are more than
34


the specified distance threshold. In the perturbation procedure, the actual values of
each attribute in a deception scenario are randomly perturbed between the threshold
and extreme value. The threshold and extreme values (either maximum or
minimum) must be specified for each attribute in a deception scenario.
The usage of threshold has been suggested in the literature for document
driven deception. Mannino and Koushik (Mannino & Koushik 2000) studied the
cost minimizing inverse classification problem with respect to similarity-based
classification systems. As part of a sensitivity analysis, they sought to find the
minimum required change to a case to reclassify it as a member of the preferred
class with implicit representation of concept boundaries.
35


Ranking
Figure 3.3: Application Deception Model
A formal description of the procedure for generating artificial deception in
the application-based context follows:
Input
D(x, n): original dataset where x denotes the number of observations and n
denotes the number of attributes
p: percentile ranking
c: noise level
Output
I (vfs, Fs): the set of instances whose value of feature Fs is equal to Vfs. s < n.
36


Procedure
1. Let Fs be the features corresponding to each scenario
2. For each case i 6 D do
3. Compute distance dj from the maximum candidate based on Fs
4. End For
5. Sort(di, d2,dx)
6. Top p > Subset T
7. For each case j 6 T do
8. Let r be a random number in [0, 1]
9. If r < c then
10. For all features Fs do
11. randomly choose a value v between extreme and threshold values
12. setv^=v
13. End for
14. End If
15. End For
The proposed Application Deception Model provides a foundation for
generating artificial deception data in various application contexts. In the following
subsections, it is applied to the financial aid application.
3.5.2.2 Application to the Financial Aid Application
Financial aid applications may be classified into two types based on the
criteria through which the financial aid is awarded: need-based or merit-based.
Need-based financial aid is awarded on the basis of the financial need of the
student. The Free Application for Federal Student Aid (FAFSA, 2008) is generally
used for determining federal, state, and institutional need-based aid eligibility.
Merit-based financial aid is typically awarded for outstanding academic
achievements. Some merit scholarships can be awarded for special talents,
leadership potential and other personal characteristics. Merit-based financial aid
does not focus on a students actual financial needs. In this study, need-based and
37


merit-based application items are combined in order to better capture the impact of
different types of deception in financial aid applications. The Application
Deception Model is applied on these two types of financial aid.
The first step of the Application Deception Model involves scenario
analysis. Based on discussions with decision makers in the University of Colorado
financial aid office, three scenarios cover most deceptive behaviors as shown in
Table 3.4. Students falsify their applications to appear eligible for financial aid. In
Scenario 1, students change their status from dependent to independent to qualify
for financial aid if their parents financial condition is strong. In scenario 2,
students reduce their parents financial variables to qualify rather than change their
status. Scenarios 1 and 2 have the same conditions, but different actions. The
financial aid form does not provide information to distinguish additional conditions
favorable to each scenario 1 or 2. Therefore, they are combined into one scenario.
Scenario 3 involves inflating merit-based features to make a candidate appear more
qualified. Scenario 3 is similar to resume padding performed by job applicants.
After the combination of the original Scenario 1 and 2, two final scenarios that are
applied in this study are summarized in Table 3.5.
Table 3.4: Original deception scenarios in financial aid application
Scenarios Conditions Actions
Scenario 1 Dependent Change status to
Deception on status High parents financial variable independent
Scenario 2 Dependent Reduce parents
Deception on financial variables High parents financial variable financial variable
Scenario 3 Low merit-based feature Increase merit-based
Deception on merit- based features value feature value
38


Table 3.5: Deception scenarios in financial aid application (after combination)
Scenarios Conditions Actions
Scenario 1 Dependent Change status to
Deception on status High parents financial independent or reduce
variable parents financial variable
Scenario 2 Low merit-based feature Increase merit-based
Deception on merit- value feature value
based features
Based on the specified scenarios as described in the table above, the
features are classified into three groups: status variables, financial variables, and
merit-based variables. Table 3.6 lists the features included in each group.
Table 3.6: Features included in each group of variables
Variables Attributes
Status Age Marital Status
Financial Students earned income Students total bank balance Students net worth of real estate Parents earned income Parents investment income Parents total balance Parents net worth of real estate
Merit-based Cumulative GPA GPA from the most recent semester GPA from the semester prior to the most recent semester Previous award Number of employment positions Number of management experiences Number of activities Number of leadership experiences
39


Status variables indicate the condition of either an independent student or a
dependent student. Determining the status as dependent or independent from their
parents is one common factor involved in all federal and state financial aid
applications. Students are classified as independent or dependent because federal
student aid programs are based on the principle that it is the parents responsibility
to provide for their childrens education. Parents ability to pay is considered when
deciding students eligibility for financial aid. The status is determined on the basis
of the information provided on the application. Students are considered to be
independent if they are at least 24 years old or married. Otherwise, they are
considered dependent on their parents.
Data Selection
Based on two scenarios, the subsets of data that are eligible for perturbation
are selected using the method depicted in Table 3.7. To select data for each subset,
the distances from each observation in the truth dataset to the ideal candidate are
computed using the features corresponding to the variables in each group. Then
they are sorted in descending order. Based on the percentile ranking, the
observations are selected to form the subset.
Table 3.7: Data selection for deception scenarios in financial aid application
Subset 1
Subset 2
Truth____
Status +
Financial
Features
Ideal
Candidate
Distance
Truth
Merit
Feature 1
Ideal
Candidate
Distance
Ideal Candidate: Independent and
lowest parent financial variables
Ideal Candidate: Highest merit
variables
40


Data Perturbation
Once the subsets of data are selected, it is ready to start the perturbation
process. The method of data perturbation for each scenario is described in Table
3.8.
Table 3.8: Data perturbation for deception scenarios in financial aid application
Subset 1 Subset 2
Conditions: Conditions:
Dependent Low merit-based variables
High parents financial Perturb:
variable Increase merit-based
Perturb: variables
Status to independent or
Reduce parent financial
variables
Scenario 1: This scenario considers the deceptive actions on status or
financial variables that applicants may take to strengthen their need of financial aid
when they have dependent status and their parents financial status is good. Under
these conditions, financial variables and status variables are manipulated. For each
individual selected in the subset, a random number between 0 and 1 is drawn for
each group of variables. If the number is less than or equal to the noise probability
assigned for the group of status variables, the status is changed to independent. If
the number is less than or equal to the noise probability assigned for the group of
financial variables, the original state for each variable in this group is changed by
randomly picking a value between the threshold and ideal value. The thresholds are
determined according to the guidelines from the federal form. Table 3.9 shows the
threshold and ideal values for parents financial variables.
41


Table 3.9: Threshold and ideal values for parents financial variables
Parents Financial Variables Threshold Value Ideal Value
Parents earned income $100,000 $0
Parents investment income $100,000 $0
Parents total balance $100,000 $0
Parents net worth of real estate $100,000 $0
Scenario 2: This scenario focuses on the merit-based variables. For each
individual selected in the subset, a random number between 0 and 1 is drawn for
the group of merit-based variables. If the number is less than or equal to the noise
probability, the original value for each variable in this group is randomly perturbed
to the level between the threshold and ideal value for each attribute. The threshold
and ideal values for the merit-based variables are listed in Table 3.10. As shown in
the table, some features are interpreted using counts from their original text-based
values for data analysis.
Table 3.10: Threshold and ideal values for merit-based variables
Merit Variables Threshold Value Ideal Value
Cumulative GPA 3.0 4.0
GPA from the most recent semester 3.0 4.0
GPA from the semester prior to the most recent semester 3.0 4.0
Previous award (counts) 0 2
Employment (counts) 1 2
Management experience (counts) 1 2
Activity (counts) 1 2
Leadership experience (counts) 1 2
42


4. Analysis of Impact of Real and Artificial
Deception on Screening Policies
The relationship between the real and artificial deception has been investigated
in Chapter 3. To extend the study, this chapter analyzes differences in screening
policy performance on real deception data and artificial deception data. Therefore,
the study focuses on examining the impact of data. The experiments are designed to
compare the real deception data with the artificial deception data using information
theoretic measures and a cost model.
4.1 Research Framework and Hypotheses
A research framework involving comparisons of screening policy
performance on real and artificial deception as depicted in Figure 4.1 is adopted. To
analyze the relationship, the screening policy performance is analyzed based on two
models of noise with the specified noise parameter. The classification performance
of the screening method is compared on two types of measures: (1) information
theoretic measures and (2) cost.
43


Figure 4.1: Comparison of Screening Method Performance on Real and Artificial
Deception
Based on the previous analysis, four hypotheses are presented, with the first
two dealing with the impacts of natural and artificial deception generated using low
noise level on screening policy performance, and with the other two focusing on
comparing the screening method performance on deliberate and artificial deception
generated using high noise level. These relationships involve comparisons among
two types of performance measures of the screening method: information theoretic
measures and cost.
44


Natural deception vs. artificial deception
H3a: There is a significant difference in the screening policy performance on
natural deceptive data versus artificial deceptive data modeled by low variable
noise.
H3b: There is no significant difference in the screening policy performance on
natural deceptive data versus artificial deceptive data modeled by low
application deception model.
Deliberate deception vs. artificial deception
H4a: There is a significant difference in the screening policy performance on
deliberate deceptive data versus artificial deceptive data modeled by high
variable noise.
H4b: There is no significant difference in the screening policy performance on
deliberate deceptive data versus artificial deceptive data modeled by high
application deception noise.
4.2 Research Methodology
The impact of natural and artificial deception on each screening method is
analyzed by two experiments based on the research framework depicted in Figure
4.1. Each experiment is conducted by two comparisons of performance as
described in Table 4.1. The performance (Pe) of the screening policy is compared
between real (natural and deliberate) and artificial deception generated using noise
model and corresponding noise parameters.
45


Table 4.1: Comparison of screening method performance on real and artificial
deception
Deception Type
JJAJJCI llUCillS Real Artificial
Experiment 1 Pe(Natural) Pe(VariableLow) Pe(DeceptionLow)
Experiment 2 Pe(Deliberate) Pe(VariableHigh) Pe(DeceptionHigh)
Typically, financial aid offices adopt a top policy of screening the financial
aid applications. The top policy verifies the eligible and nearly eligible
applications. Nearly eligible applications are verified to allow for reassessment of
financial aid if deception is uncovered in eligible applications. Based on
discussions with decision makers in the University of Colorado financial aid office,
the top 30% of observations are considered to be eligible for the financial aid. In
addition, the next 5% of financial aid applications are considered nearly eligible.
Both the eligible and nearly eligible applications are screened. To determine the
eligible and nearly eligible applications, the distance from the ideal candidate is
computed for each observation in the real and artificial deception data sets. Then
the distances are sorted in ascending order. The top 35% of observations are
flagged to screen.
Although not commonly adopted, random verification is still used by some
Quality Assurance schools. To assist the analysis, this study adopts the random
46


policy as a reference point. The policy randomly selects 50% of applications to
verify.
To measure the performance of a screening policy, the class label is needed
for each observation. Deceptive cases are used as positive samples, and truthful
data are selected as negative samples. It is not necessary to label the artificial
deception data since the cases are recorded in which perturbation is used. Since the
original data do not have associated labels, cases are labeled by a labeling method.
The labeling method differs for natural and deliberate deception.
Figure 4.2 describes the method to label the natural deception data set. For
natural deception, deceptive and true observations are mixed. In order to classify
true and deceptive cases, the distance from truth to natural deception is calculated.
Non-zero distance indicates that the case contains deception. Thus, these identified
cases are labeled with deceptive. The other cases in the dataset are labeled with
true. Since small amounts of deception are involved in the natural deception, it is
not necessary to split deceptive cases.
Figure 4.2: Labeling Natural Deception Data
The method to label the deliberate deception data set is described in Figure
4.3. In the deliberate deception, all of the cases contain deception. Therefore, it is
not necessary to calculate the distance from the truth to the deception to determine
the case label. Since it is not realistic to use 100% deceptive cases, cases with truth
47


and deliberate deception are mixed. Based on the reported percentage of deception
in financial aid (Rhodes & Tuccillo, 2008), 20% deliberate deception cases labeled
with deceptive are used. The other cases are collected from truth dataset and
labeled with true.
Figure 4.3: Labeling Deliberate Deception Data
4.3 Performance Measures
In machine learning literature, typical metrics for measuring the performance
of classification systems are accuracy, information theoretic measures such as
precision and recall. Since the accuracy assumes that the class priors in the target
environment are constant and thus it is sensitive to the class distribution, it is not
used in this study. A realistic evaluation should also take misclassification costs
into account. This is especially important if cost asymmetries exist. Therefore, an
information theoretic measure (Harmonic Mean) and a cost model are used for
evaluating the performance of the screening policies in this study.
48


4.3.1 Information Theoretic Measure
In a statistical classification task, the performance of a classification
prediction can be evaluated using the data in a confusion matrix or contingency
table (Kohavi and Provost, 1998) which contains information about actual and
predicted classifications done by a classification system. In the two-class case with
classes true and deceptive, a single prediction has the four different possible
outcomes shown in Table 4.2. The true positives (TP) and true negatives (TN) are
correct classifications. A false positive (FP) occurs when the outcome is incorrectly
predicted as deceptive (or positive) when it is actually true (negative). A false
negative (FN) occurs when the outcome is incorrectly predicted as negative when it
is actually positive.
Table 4.2: Confusion matrix of a true-deception prediction
Actual
Deceptive True
Predicted Screening TP FP
No Screening FN TN
The Precision for a class is the number of true positives (i.e. the number of
items correctly labeled as belonging to the positive class) divided by the total
number of elements labeled as belonging to the positive class (i.e. the sum of true
positives and false positives). Recall in this context is defined as the number of true
positives divided by the total number of elements that actually belong to the
positive class (i.e. the sum of true positives and false negatives). A Precision score
of 1.0 for class C means that every item labeled as belonging to class C does indeed
49


belong to class C whereas a Recall of 1.0 means that every item from class C was
labeled as belonging to class C.
TP
Precision =--------------------------------- 4.1
TP + FP
TP
Re call =--------------------------------- 4.2
TP + FN
Usually, Precision and Recall scores are not discussed in isolation. Instead,
either values for one measure are compared for a fixed level at the other measure or
both are combined into a single measure. The Harmonic mean combines Precision
and Recall into a single number ranging from 0 (worst prediction) to 1 (best
prediction).
TT . . Precision x Recall
Harmonic mean = 2 x------------------- 4.3
Precision + Recall
In summary, Figure 4.4 illustrates how the confusion matrix is constructed
for a screening problem in financial aid application and how performance metrics
are calculated.
50


Figure 4.4 Method to Calculate HM
4.3.2 Cost Model
In the domain of deception detection, it is important and necessary to place
monetary value on predictions. Therefore, a cost model is proposed in this study to
evaluate the screening policies performance based on the cost and benefit of
detecting deception.
4.3.2.1 Budget Models
According to the U.S. Department of Education (2009), the Federal Pell
Grants amounts are directly determined by the expected family contribution (EFC).
Pell grant funding never runs out. So if a student is eligible, he or she will get it.
Schools then use the EFC to determine eligibility for other aid programs (grant,
load, work-study). These aid programs are typically limited. According to this
51


information, when allocating the award, two models are applied in this study based
on two different types of budget:
a) Fixed budget model: In this model, only the top 30% of the applicants are
awarded since the financial aid resources are limited. Therefore, the total
amount of award is static. The model calculates the distance from each
applicant to the ideal candidate and ranked all these distances in ascending
order. There are three levels of award. The applicants corresponding to the
top 5% distances are eligible for the level 1 award, the next 10% and the
next 15% are qualified for the level 2 award and level 3 award respectively.
b) Variable budget model: In this model, the total amount of financial aid is
flexible because it is determined by a linear function of EFC. Since the
experiment data do not permit EFC calculation, a distance-based method is
used instead. The function is formulated based on three points and two
distances. The three points are the applicant, the marginal candidate and the
ideal candidate. The marginal candidate contains the lower value that is
considered as borders or as margins, while the ideal candidate has the
perfect standard of financial aid application. Two distances are the distance
from the applicant to the ideal candidate and the distance from the marginal
candidate to the ideal candidate. If an applicants distance from the ideal
candidate is less than the marginal candidates distance from the ideal
candidate, the financial aid is awarded. The amount of aid is calculated
using the equation described below.
Financial Aid Amount = (1-------, Distancestuden.-ideaicand.d^-) x Full Amount
DistanCeMarginaI Candidate-Ideal Candidate
4.4
52


Figure 4.5 illustrates the method to allocate the financial aid based on two
models of budget that were described above.
Award Allocation:
Fixed budget model: dls ranking
Varied budget model: dl Figure 4.5 Award Allocation Based on Two Models of Budget
4.3.2.2 Award Difference
To better describe and understand the cost caused by misallocating financial
aid between the application with truth and the application with deceptive
information, the award difference is used in our cost model. Since the screening
policy will be applied to the deceptive data set, the award difference shouldnt be
based on the award difference between the truth data set and the deceptive data set
before the screening. The screening policy will address some of the problematic
applications and thus adjust the original deceptive data set. The final award
allocation will be based on the adjusted data set. Therefore, the award difference is
the difference between the awards in the truth data set and the policy adjusted data
set for each applicant. The meaning of award difference is illustrated in Figure 4.4.
53


Award Difference
Figure 4.6: The Meaning of Award Difference
4.3.2.3 Cost Model Structure
The cost matrix for deception detection is shown in Table 4.3. A false
positive error (a false alarm) corresponds to wrongly deciding that an applicant
provides deceptive information. A false negative error (a miss) corresponds to
letting deceptive information go undetected. A true negative prediction corresponds
to correctly classifying an application with true information as non-deceptive. A
true positive prediction corresponds to successfully detecting that an applicant
provides deceptive information.
Table 4.3: Cost matrix for deception detection
Prediction Deception No deception
Alert Hit (TP) False Alarm (FP)
No alert Miss (FN) Normal (TN)
Table 4.4 below illustrates the cost model. This particular cost model has
two assumptions. First, all alerts must be investigated. Second, the deceptive
application can be successfully caught by investigation. Due to the different dollar
54


amount of each application, the cost varies with each application. Hence, the cost
model for this domain relies on the sum and average of loss caused by deception.
They are defined as:
n
CumulativeCost = ^ Cost(i) 4.5
and
AverageCost =
CumulativeCost
n
4.6
Where Cost(i) is the cost associated with application i, and n is the total number of
applications.
Table 4.4 shows that false alarms and hits require investigation costs; and
misses pay out the award difference. Hits have a benefit of avoiding paying the
award difference. There is no cost associated with Normals. Based on the cost of a
deception analysts time, the average cost per investigation for the financial aid
application data set is estimated to be about $25.
Table 4.4: Cost model for financial aid application deception detection
Outcome Cost
Misses (false negative -- FN) Award Difference [Deception-Truth]
False alarms (false positive FP) Average Cost Per Investigation
Hits (true positive TP) Average Cost Per Investigation Award Difference [Deception-T ruth]
Normals (true negative TN) 0
55


5. Experimental Results and Analysis
This chapter presents experimental results and analysis to provide evidence
about the research framework and experimental designs presented in previous
chapters. The experiments test research questions involving the fit between
artificial deception and real deception and the impact of artificial deception on
performance of the screening policies.
5.1 Simulation of Real Deception
The relationship between real deception and artificial deception are tested by
paired tests of group means, with a < .05 being set as the level of statistical
significance, with an N of 150. In order to use paired t-tests to determine if there is
a significant difference between the two means, the assumptions need to be
satisfied. Paired t-test assumes that the paired differences are independent and are
all normally distributed. Since the subjects are randomly selected and independent
of any other subjects, the assumption of independence has been met. To determine
whether the assumption of normality is valid for the data, the normality tests are
conducted. If the data follow the normal distribution, the t-test will be used.
Otherwise, the Wilcoxon test, which is a nonparametric procedure, is used.
If the result is not statistically significant at the chosen alpha, the null
hypothesis of no difference in means between two groups cant be rejected. On the
other hand, if the difference is statistically significant, an effect size should be
examined to see if the difference is also practically significant. A statistically
significant outcome only indicates that it is likely that there is a difference between
group means. It doesnt mean that the difference is large or important. In this study,
the effect sizes are computed using the Hedges g measure (Hedges, 1981). Cohen
56


provides the following guidelines of effect size (r): small effect size, r = 0.1;
medium, r = 0.3; large, r = 0.5.
In the following results, the fit between real deception and artificial deception
is evaluated by two measures defined in chapter 3 as directed distance and outlier
score.
5.1.1 Analysis Based on Distance Measure
For easy interpretation, the hypotheses, which were proposed in Chapter 3,
are restated with the null hypothesis and alternative hypothesis based on the
directed distance measure as shown in Table 5.1. The mean of directed distances
from the true data to the deception data is denoted by: Pdistance(deception type), where
deception type indicates real (natural or deliberate) or artificial (VariableLow,
VariableHigh, ApplicationLow or ApplicationHigh) deception.
Table 5.1: Hypotheses based on directed distance measure
Null Hypothesis Alternative Hypothesis
M-distance(natural) ~ Hdistance(VariableLow) M-distance(natural) ^ M'distance(VariableLow)
M-distance(natural) M'distance(ApplicationLow) M'distance(natural) ~t~ M'distance(ApplicationLow)
M'distance(deliberate) M'distance(VariableHigh) M-distance(deliberate) ^ M'distance(VariableHigh)
M-distance(deliberate) ^ di stance! ADDlicationHi eh) M-distance(deliberate) 7^ M-distancefADDlicationHieh)
To determine whether the data are normally distributed, normality tests are
conducted in SPSS. As shown in Table 5.2, the results of the test for normality
suggest that the null hypothesis of samples being normally distributed is rejected.
Therefore, the Wilcoxon test is conducted to test the mean differences between the
real and artificial data.
57


Table 5.2: Normality tests for distance variable
Tests of Normality
Kolmoqorov-Smirnov3 Shapiro-Wilk
Statistic df Sig. Statistic df Sig.
NaturalDist .384 150 .000 .646 150 .000
DeliberateDist .469 150 .000 .503 150 .000
a. Lilliefors Significance Correction
Table 5.3 shows the SPSS output for testing the mean distance differences
between the real and artificial deception data. Table 5.4 summarizes the results and
lists the p-value and the effect size for the significant result. The results, as
expected, show that there was no statistical difference between natural deception
and artificial deception generated by the application deception model with low
noise level. The results also demonstrate identical means between deliberate
deception and artificial deception generated using application deception model with
high noise level. In addition, the results reject the null hypothesis and confirm that
real deception and artificial deception generated by the variable noise model differ
statistically. Furthermore, the effect sizes of 0.30 (moderate) and 0.74 (large)
suggest that both statistically significant results are practically significant.
58


Table 5.3: Sample mean results based on distance measure
Test Statistics?
VariableLDist NaturalDist Application LDist NaturalDist VariableHDist DeliberateDist Application HDist DeliberateDist
z Asymp. Sig. (2-tailed) -3.724a .000 -1.076b .282 -9.0523 .000 -1.528b .127
a. Based on negative ranks.
b. Based on positive ranks.
c- Wilcoxon Signed Ranks Test
Table 5.4: Summary of statistical test results based on distance measure
(The numbers in parentheses are the /7-values and effect sizes.)
Comparison Distance Measure
Natural vs. VariableLow Significant (.000*, 0.30)
Natural vs. ApplicationLow Not significant (.689)
Deliberate vs. VariableHigh Significant (.000*, 0.74)
Deliberate vs. ApplicationHigh Not significant (.687)
*. Significant at the .05 level
5.1.2 Analysis Based on Outlier Score
Outlier score indicates the unusualness of the observation relative to other
observations. It can be considered as a multivariate measure of dispersion of an
outlier relative to other outliers. The relationship between real deception and
artificial deception is measured by testing the average outlier score. Table 5.5
restates the hypotheses that were proposed in Chapter 3 with the null hypothesis
and alternative hypothesis based on the outlier score measure. The mean of the
outlier scores in the deception data set is denoted by: p0utiier(deception type), where
59


deception type indicates real (natural or deliberate) or artificial (VariableLow,
VariableHigh, ApplicationLow or ApplicationHigh) deception.
Table 5.5: Hypotheses based on outlier score
Null Hypothesis Alternative Hypothesis
M'Outlier(natural) ^outlier( VariableLow) M'Outlier(natural) M-outIier(VariableLow)
Moutlier(natural) M-outlier(AppIicationLow) M-outlier(natural) ~f~ M'Outlier(ApplicationLow)
Moutlier(deliberate) ^outlier( VariableHigh) M'Outlier(deliberate) M-outIier(VariabIeHigh)
M-outlierfdeliberate) ~ HoutlieriAoolicationHieh) M-outlierfdeliberate) ~f~ M-outlier(ApDlicationHich)
In order to evaluate whether the outlier scores follow the normal
distribution, the normality tests are performed in SPSS. Table 5.6 shows the
normality tests for outlier scores. From the table, the p-value from the
Kolmogorov-Smimov test (0.200) suggests that there is insufficient evidence to
reject the null hypothesis of samples being normally distributed. Therefore, the
paired t-test can be conducted to test the mean differences of outlier score between
the real and artificial data.
Table 5.6: Normality tests for outlier scores
Tests of Normality
Kolmoaorov-Smirnov3 Shapiro-Wilk
Statistic df Sig. Statistic df Sig.
DKNNatural .064 150 .200* .961 150 .000
DKNDeliberate .053 150 .200* .975 150 .008
* This is a lower bound of the true significance,
a- Lilliefors Significance Correction
In this study, two outlier detection algorithms are adopted to generate
outlier score for each observation in the real and artificial deception data sets. Table
60


5.7 shows the analysis results of the average mean of the outlier score based on the
algorithm. When comparing Poutlier(natural) VS. Poutlier(ApplicationLow) and |J-outlier(deliberate)
vs. Poutiier(AppiicationHigh), the non significant results dont reject the null hypotheses
and thus confirm the alternative hypotheses that real deception have the identical
mean outlier score with artificial deception generated by the application deception
model. When testing the difference between |d0uti>er(dei,berate) and ^outiier(VariabieHigh), the
significant result, as expected, demonstrates that deliberate deception and artificial
deception generated using variable noise model with high noise level are different.
In the meanwhile, the large effect size of 0.80 indicates that the observed statistical
significance is also practically significant. In the comparison of |i0utiier(naturai) with
lioutiier(VariabieLow), the p-\alue of .064 doesnt reject the null hypothesis. But the
marginal significance provides some evidence that the mean outlier scores for the
natural deception data and the artificial deception data modeled by low variable
noise are different.
Table 5.7: Sample mean results based on outlier score calculated by Dn algorithm
Comparison Outlier Score ( Dn )
Real Deception vs. Artificial Deception /-value Sig. Effect Size
Natural vs. VariableLow -1.870 .064 Not applicable
Natural vs. ApplicationLow -.692 .490 Not applicable
Deliberate vs. VariableHigh -4.860 .000* 0.80
Deliberate vs. ApplicationHigh .946 .346 Not applicable
*. Significant at the .05 level
Table 5.8 shows the analysis results of the average mean of the outlier score
based on the LOF algorithm. The results provide evidence that there is no statistical
difference between real deception and artificial deception generated by the
61


application deception model. Also, the significant results offer support for the
differences between real deception and artificial deception generated by the
variable noise model. Furthermore, two moderate effect sizes confirm that the
observed differences are practically significant.
Table 5.8: Sample mean results based on outlier score calculated by LOF algorithm
Comparison Outlier Score ( LOF )
Real Deception vs. Artificial Deception /-value Sig. Effect Size
Natural vs. VariableLow 2.926 .004* 0.48
Natural vs. ApplicationLow -.282 .778 Not applicable
Deliberate vs. VariableHigh 3.240 .001* 0.53
Deliberate vs. ApplicationHigh -.066 .948 Not applicable
*. Significant at the .05 level
5.1.3 Discussion
This part of the study involves the fit between real deception and artificial
deception. The experiment results of hypotheses testing based on two measures are
systematically summarized in Table 5.9.
From the table, it can be seen that Hypothesis la, considering the
relationship between natural deception and artificial deception generated by
variable noise model with low level of noise, is supported by the directed distance
measure and the outlier score produced by the LOF algorithm. When testing the
mean difference of the outlier score produced by the D algorithm, despite the null
hypothesis isnt rejected, the marginal significance provides some evidence of a
mean difference in D outlier scores between two data sets. Hypothesis lb,
focusing on the relationship between natural deception and artificial deception
62


generated by the application deception model with low level of noise, is supported.
Hypothesis 2a and 2b, concerning the relationships between deliberate deception
and artificial deception generated by variable noise model and application
deception model with high level of noise, are also supported.
These results obtained provide evidence that artificially generated deception
could be used instead of the real deceptive data. However, the noise model for
generating deception has to be selected carefully. The existing noise models in the
literature that perturbs the original state uniformly without considering the real
deception behavior are not appropriate. Deception involves intentional and goal-
oriented behaviors. Therefore, data should be perturbed purposefully. The proposed
application deception model applies directed corruption to simulate deception. The
results further confirm the importance of directed corruption to simulate deception
and provide evidence that the model could be adopted to generate deception.
63


Table 5.9: Summary of findings
Hypothesis
HI a: There is a significant difference
between natural deception and
artificial deception modeled by low
variable noise.
Hlb: There is no significant
difference between natural deception
and artificial deception modeled by
low deception noise.
H2a: There is a significant difference
between deliberate deception and
artificial deception modeled by high
variable noise.
Findings
Support: There are statistically
significant differences in directed
distances and LOF produced outlier
scores between two data sets.
Exception: Marginal significance
indicates some evidence but not
conclusive evidence of a difference in D
produced outlier scores between two data
sets.
Support: There is no statistically
significant difference in directed
distances, D and LOF produced outlier
scores between two data sets.
Support: There are statistically
significant differences in directed
distances, D and LOF produced outlier
scores between two data sets.
H2b: There is no significant
difference between deliberate
deception and artificial deception
modeled by high deception noise.
Support: There is no statistically
significant difference in directed
distances, D and LOF produced outlier
scores between two data sets.
5.2 Impact of Real and Artificial Deception on Screening
Policies
This section assesses the impact of real deception and artificial deception on
screening policy performance. Typically, schools adopt the top policy to evaluate
financial aid applicants who are eligible or nearly eligible. For schools participating
64


in the Quality Assurance Program, they also draw random samples of aid applicants
to verify (Rhodes & Tuccillo, 2008).
In the following results, the prediction of the screening policy is evaluated by
two types of performance measures as described in Chapter 4 with respect to the
real and artificial deception data sets: information theoretic measure and cost. The
results of the impacts of real deception and artificial deception on the top policy
and the random policy are presented.
5.2.1 Impact on Top Policy
Typically, the top policy verifies 30% of the eligible applications and 5% of
the nearly eligible applications. Data are initially analyzed based on 35% of
verification percentage. Additionally, in order to capture the pattern of the impacts
of real and artificial deception on the top policy in different percentages of
verification, the percentage of screened cases are randomly varied 200 times
between 20% and 40%. For each run, based on the randomly selected percentage of
verification, the corresponding number of observations in the ranked list
constructed by calculating the distance from the ideal candidate is flagged. The rest
of observations are not screened. In the following two sections, experimental
results that measure the top policy performance in terms of Harmonic Mean (HM)
and cost are presented and analyzed, respectively.
5.2.1.1 Performance Comparison based on Harmonic Mean
HM doesnt have any probabilistic interpretation, hence, significance tests
cant be applied to its values. To compare the performance difference with HM
between the real deception data and the artificial deception data, the relative
percentage difference (PD) for HM is defined as:
65


PD(HM) = t^artlflcial HMfea'l x 100% 5.1
HMreal
Table 5.10 shows the experimental results that measure the top policys
performance on the real deception data sets and the artificial deception data sets
regarding to HM. The numbers in parentheses are the HM values. The relative
difference percentages of HM for each pair of comparison are listed on the right
column of the table. The impacts of real and artificial deception on the screening
policy are compared based on 35% of screened cases.
As these data show, in comparison with the performance on the natural
deception data, the policys performance decreases 76.3% on the artificial
deception data generated by the variable noise model with low noise level. The
artificial deception data generated by the variable noise model with high noise level
bring in the 52% of performance improvement on the policy than the deliberate
deception data. These results, as desired, imply that the real deception data and the
artificial deception data generated by the variable noise model have different
impacts on the policy. In comparison with the performance on the natural deception
data, the policys performance increases 17.8% on the artificial deception data
generated by the application deception model with low noise level. The 1.9% of
performance difference for Deliberate vs. ApplicationHigh suggests that the
deliberate deception data and the artificial deception data with high noise level have
the similar impacts on the policy.
66


Table 5.10: Relative HM percentage differences for the top policy
(The numbers in parentheses are the HM values.)
Comparison PD(HM)
Natural (0.291), VariableLow (0.069) Natural (0.291), ApplicationLow (0.343) Deliberate (0.313), VariableHigh (0.476) Deliberate (0.313), ApplicationHigh (0.319) 76.3% decrease 17.8% increase 52% increase 1.9% increase
Figure 5.1 shows the impacts of real and artificial deception on the top
policy with HM when the percentage of screened case is varied 200 times lfom
20% to 40%. In these graphs, the x-axis denotes the percentage of screened case
while the y-axis represents the HM value. To keep consistency, the values
corresponding to the real deception data and the artificial deception data are
indicated by the dark color line and the light color line, respectively.
Figure 5.1 (a) and (c) represent the performance comparison between two
types of real deception and artificial deception generated by the variable noise
model. As expected, the graphs show obvious performance difference of the
screening policy on the real deception data and the variable noise modeled data
based on the HM metric. The performance comparisons between real deception and
artificial deception generated by application deception model are shown in (b) and
(d) of Figure 5.1. As shown in Figure 5.1 (b), in comparison of the performance on
natural deception, the screening policy has slightly better performance on artificial
deception generated by application deception model with low noise level. In Figure
5.1(d), two lines are overlapped, which indicates that the screening policy performs
similarly on the deliberate deception data and the artificial deception data generated
by the application deception model with high noise level. These graphs visually
67


describe the performance difference between the real and artificial deception data
and further confirm the results in Table 5.10.
OOfMNfO^^L^lDCOCOOlOfMNrO
(N(NfM(N(NN(N(N(N(\(N(MfOfOfOfO
^ -it LO ID N Ut
m m ro m ro m ro
Top Policy Percentage
Natural
VariableLow
(a)
(b)
68


1
0.9
0.8
0.2
0.1
0
£ £ S? S? £ S? * }£ S* 3S
o P* O p CO o o ro p. o p. CO p. q CO q o p* CO p CO CO
o o rsi rsj crj d d 00 00 d d CM rsi CO LO d pj d
-----Deliberate
----VariableHigh
Top Policy Percentage
(C)
1
0.9
0.8
0.2-----------------------------------------
0.1-----------------------------------------
0------------------------------------------
o\ js av as as as as as as as os as os os as as os as as os cfs
qrsONfnqqfnNpsmrsOfsfnoprsfnNfflfO
odfNfsin'i^^i/iujoooboidfsiNffi^^^iniDpsoi
(NfNrsi(N(N(NN(NfNoifN(NmmrorofOfOfnfnmrofO
Deliberate
ApplicationHigh
Top Policy Percentage
(d)
Figure 5.1: The Impacts of Real and Artificial Deception on Top Policy at Different
Percentage of Screened Case with Harmonic Mean
69


5.2.1.2 Performance Comparison based on Cost Model
Table 5.11 restates the hypotheses that were proposed in Chapter 4 with the
null hypothesis and alternative hypothesis based on the cost measure. The mean
cost of the screening policy on the deception data is denoted by: pCost(decePtion type),
where deception type indicates real (natural or deliberate) or artificial (VariableLow, VariableHigh, ApplicationLow or ApplicationHigh) deception. Table 5.11: Hypotheses based on cost measure
Null Hypothesis Alternative Hypothesis
Pcost(natural) i^cost(VariableLow) M'cost(natural) ^ M'Cost(VariableLow)
M'cost(natural) M-cost(ApplicationLow) M'Cost(natural) "T" M-cost( ApplicationLow)
M-cost(deliberate) ~ MCost(VariableHigh) M'cost(deliberate) ^ M'Cost(VariableHigh)
Pxost(deliberate) M-costfAoDlicationHieh) M-costfdcliberate) ^ M-costfADDlicationHigh^^^^
To determine whether the assumption of normality is valid for the cost
variable, a normality test is performed. Table 5.12 shows the results of the
normality test in SPSS. Based on the statistics of Kolmogorov-Smimov and
Shapiro-Francia, the null hypothesis of normality is rejected; therefore the
Wilcoxon t-test is conducted to test the mean cost differences.
Table 5.12: Normality tests for cost variable
Tests of Normality
Kolmoqorov-Smirnov3 Shapiro-Wilk
Statistic df Sig. Statistic df Sig.
NaturalTF .477 150 .000 .366 150 .000
DeliberateTF .473 150 .000 .221 150 .000
a- Lilliefors Significance Correction
70


Table 5.13 and 5.14 show the SPSS outputs of testing the mean cost
differences between the impacts of real and artificial deception on the top policy
based on the fixed and variable budget models that were defined in Chapter 4. To
aid the understanding, Table 5.15 summarizes the results and lists the p-value and
effect size for the significant result.
From Table 5.15, it can be seen that deliberate deception and artificial
deception generated by the variable noise model with high noise level have
significantly different impacts on the policy based on the variable budget model as
predicted. However, the fixed budget model doesnt make the same result.
Similarly, the significant result for testing Natural vs. VariableLow suggest a
performance difference based on the variable budget model, while the non
significant results fail to reject the null hypothesis based on the fixed budget model.
When comparing Deliberate vs. ApplicationHigh, the non significant results for
both cost models do not reject the null hypothesis and thus suggest that the policy
has similar performance on the deliberate deception data and the artificial deception
data generated by application deception model with high noise level. The results for
testing the performance difference between the natural deception data and the
artificial deception data generated by the application deception model with low
noise level arrive at the same conclusion.
71


Table 5.13: Cost (fixed budget) sample mean results
Test Statistics?
VariableLTF - Application LTF - VariableHTF - Application HTF -
NaturalTF NaturalTF DeliberateTF DeliberateTF
z -,907a -1.802 -1.194a -.018a
Asymp. Sig. (2-tailed) .364 .072 .232 .985
a- Based on negative ranks,
b. Based on positive ranks,
c- Wilcoxon Signed Ranks Test
Table 5.14: Cost (variable budget) sample mean results
Test Statistics?
VariableLTV - Application LTV- VariableHTV - Application HTV-
NaturalTV NaturalTV DeliberateTV DeliberateTV
Z -2.018a -1.317 -4.727a -.233
Asymp. Sig. (2-tailed) .044 .188 .000 .816
a. Based on negative ranks.
b. Based on positive ranks.
c. Wilcoxon Signed Ranks Test
Table 5.15: Summary of statistical test results based on cost measure
(The numbers in parentheses are the p-values and effect sizes.)
Comparison Fixed Budget Variable Budget
Natural vs. VariableLow Not significant (.364) Significant (.044*, 0.16)
Natural vs. ApplicationLow Not significant (.072) Not significant (.188)
Deliberate vs. VariableHigh Not Significant (.232) Significant (.000*, 0.39)
Deliberate vs. ApplicationHigh Not significant (.985) Not significant (.816)
*. Significant at the .05 level
72


In the above tables, the impacts of real and artificial deception on the
screening policy are compared based on 35% of screened cases. Figure 5.2 and
Figure 5.3 show the impacts of real and artificial deception on the top policy with
the fixed and variable budget models when the percentage of screened case is
varied 200 times from 20% to 40%. On the graphs, the x-axis denotes the
percentage of screened case and y-axis represents the mean cost in dollars.
In all figures of 5.2 and 5.3, (a) and (c) represent the cost comparison
between two types of real deception and artificial deception generated by variable
noise model. As shown in Figure 5.2 (a), when comparing natural deception with
artificial deception generated by the variable noise model with low noise level, the
costs are similar. However, the variable budget model in Figure 5.3 (a) exhibits
significant performance difference and thus suggests that the policy performs better
on the natural deception data. The graphs (c) in Figure 5.2 and 5.3 show obvious
performance difference of the screening policy on the deliberate deception data and
the variable noise modeled data as predicted. Also, these graphs demonstrate that
the variable noise model always costs more than the real deception data.
The cost comparisons between real deception and artificial deception
generated by application deception model are shown in (b) and (d) of Figure 5.2
and 5.3. As expected, the graphs show similar performance of the screening policy
on the real deception data and the application deception modeled data based on
both budget models. Also these graphs present that the application deception
always costs slightly less than the real deception data.
73


Mean Cost (Fixed Budget)
-fc.
o
o
o
o
D
(S
5
3
20.0%
20.7%
21.3%
22.0%
22.7%
24.0%
24.7%
26.0%
27.3%
28.0%
29.3%
29.3%
30.0%
31.3%
32.0%
32.7%
33.3%
34.7%
35.3%
36.0%
37.3%
38.0%
39.3%
w in in v> in in in w in i/v >-*
UJ t* VI Cl vl 00 id o
in- o o o o o o o o o o
o o o o o o o o o o o
>
o
Mean Cost (Fixed Budget)
w lo- in in in lo- in W in in M
H* ro u> S* in ci VI 00 U3 o
i/v o o o o o o o o o o
o o o o o o o o o o o
20.0%
20.7%
21.3%
22.0%
22.7%
24.0%
24.7%
26.0%
o' 27.3%
J 28.0%
g 29.3%
u 29.3%
% 30.0%
3 31.3%
32.7%
33.3%
34.7%
35.3%
36.0%
37.3%
38.0%
39.3%


$1,000
$900
~ $800
| $700
$600
ai
iZ $500
S $400
u
% $300
4)
S $200
$100
$0


<








Deliberate
VariableHigh
nP SP vp \0 vO so sp sP vO sp vp \0 sP sp \P \P sp vO \P \0 vO vO
Q\ O'* O'* ffv os os OS OS os os OS OS flS OS OS OS OS OS OS OS OS
prsroprsprsprnprofoproprsforscoprnpro
doHfslntt^drsoocrioidHiNiNfo^L/i^rsoooi
NfNfMfMNlNfMfMfslMNfMmfOPrirOfOfOfOfOrOrOfO
Top Policy Percentage
(C)
Deliberate
ApplicationHigh
Top Policy Percentage
(d)
Figure 5.2: The Impacts of Real and Artificial Deception on Top Policy at Different
Percentage of Screened Case with Cost (Fixed Budget)
75


Mean Cost (Varied Budget)
Os
w w w M ro CO cn "vl 00 o
w O O o o O o o o O o
o o o o o o o o o O o
*o
o
2
fl>
s
20.0%
20.7%
22.0%
22.7%
24.0%
25.3%
26.0%
26.7%
26.7%
28.0%
28.0%
28.7%
30.0%
31.3%
32.7%
32.7%
33.3%
35.3%
36.7%
37.3%
38.0%
38.7%
39.3%
>
o
Mean Cost (Varied Budget)
TJ
O
s
3
a)
w
fD
w


Deliberate
VariableHigh
Top Policy Percentage
(C)
OONfN^iniOlOlflOOCOOOOHlSNfOl/l^NOflCOOl
(NNIMfMINMINNNfMNNfnMfOfOmmrOfOrOfOfO
Deliberate
ApplicationHigh
Top Policy Percentage
(d)
Figure 5.3: The Impacts of Real and Artificial Deception on Top Policy at Different
Percentage of Screened Case with Cost (Variable Budget)
77


5.2.2 Impact on Random Policy
Random policy isnt commonly adopted by all schools. However, it would
be helpful to use the random policy as a reference point to assist the analysis. In
this section, experimental results that measure the random policy performance
based on HM and the cost are presented, respectively. The random policy typically
verifies about 50% of applications. Data are first analyzed based on the 50%
verification percentage. Additionally, in order to capture the pattern of the impacts
of real and artificial deception on the random policy in different percentages of
verification, the percentage of screened cases are randomly varied 200 times
between 40% and 60%. For each run, based on the randomly selected percentage of
verification, the corresponding numbers of observations are randomly selected and
screened. The rest of observations arent screened.
The experimental results analogous to the ones displayed for the top policy
in section 5.2.1 are shown in Table 5.16 for relative harmonic mean percentage
differences. Table 5.17 and 5.18 show the results of statistical tests for the mean
cost based on the fixed and variable budget models, respectively. Table 5.19
summarizes the results and lists the p-value and the effect size for the significant
result.
Figure 5.4 5.6 are the performance results for the random policy based on
HM and costs (fixed and varied) when the percentage of screened cases is
randomly varied between 40% and 60%. Due to the random selection, it can be
noticed that the curves in the graphs arent as smooth as ones in the graphs for the
top policy. Despite the fluctuation, the graphs show the similar patterns as the ones
for the top policy.
78


Table 5.16: Relative HM percentage differences for the random policy
(The numbers in parentheses are the HM values.)
Comparison___________________________
Natural (0.46), VariableLow (0.132)
Natural (0.46), ApplicationLow (0.552)
Deliberate (0.267), VariableHigh (0.476)
Deliberate (0.267), ApplicationHigh (0.25)
PD(HM)
56.9% decrease
20% increase
78.2% increase
6.3% decrease
Table 5.17: Cost (fixed budget) sample mean results for random policy
Test Statistics!3
VariableLRF - Application LRF- VariableHRF - Application HRF -
NaturalRF NaturalRF DeliberateRF DeliberateRF
z -,977a -,141a -1.7803 -,709a
Asymp. Sig. (2-tailed) .328 .888 .075 .478
a. Based on negative ranks.
b. Wilcoxon Signed Ranks Test
Table 5.18: Cost (variable budget) sample mean results for random policy
Test Statistics?
VariableLRV - Application LRV - VariableHRV - Application HRV -
NaturalRV NaturalRV DeliberateRV DeliberateRV
Z -,870a 742F -3.787a -,750a
Asymp. Sig. (2-tailed) .384 .672 .000 .453
a. Based on negative ranks.
b. Based on positive ranks.
c- Wilcoxon Signed Ranks Test
79


Table 5.19: Summary of statistical test results based on cost measure
(The numbers in parentheses are the /7-values and effect sizes.)
Comparison Fixed Budget Variable Budget
Natural vs. VariableLow Natural vs. ApplicationLow Deliberate vs. VariableHigh Deliberate vs. ApplicationHigh Not significant (.328) Not significant (.888) Not Significant (.075) Not significant (.478) Not significant (.384) Not significant (.672) Significant (.000*, 0.31) Not significant (.453)
*. Significant at the .05 level
These results are consistent with those obtained for the top policy, which
suggest that the real deception data and the artificial deception data generated by
the application deception model have the similar impact on the policy as expected.
The non significant statistical results suggest that the natural deception data and the
artificial deception data generated by the variable noise model with low noise level
have similar impacts on the policy with both the fixed and variable budget models.
When comparing the mean cost difference between the deliberate deception data
and the artificial deception data generated by the variable noise model with high
noise level, the fixed budget model suggests non significant mean cost difference,
while the variable budget model indicates the data with high variable noise
performs significantly worse than the deliberate deception data. The results of
statistical tests are further confirmed by the graphs shown in Figure 5.4 5.6.
80


1 0.9 0.8 c 07 m | 0.6 g C 0.5 0 1 0.4



i unUllfMA lUJUvnn nn AfUlHiAitflAJ'-
* U IlMl 1 lUl 1* il'J'*<| TJ* 1 t U*i ** u Natural
| K|JJ| f*t|i V** p ^ ^' ~ VariableLow

' c 4U.U7o 40.7% 41.3% 42.7% 43.3% 44.7% 46.0% a> 46.7% | 47.3% | 47.3% ? 48.7% 5 50.0% S 50.7% 5 51.3% S 51.3% 52.7% 53.3% 55.3% 55.3% 56.0% 57.3% 58.0% 59.3%
(a)
i 0.9 0.8 c 07 <0 I 0.6 | 0.5 1 0.4 <0 x 0.3 0.2 0.1 -



i ii nmilAriA r aiAn nnnn/mATUftAniA- Nlllllll
SnNl 1 r iViuhIWt^V^v viwu* u ^ Ndiuidi
U U1 |j 1 ApplicationLow


0 1 c 3rs-roh-fop^or^rocor^c>r^f1oroh;rrjroroc>ropro jdHNfn^uj^isp^ooodHHNfriinifl^rNooo) Random Policy Percentage
(b)
81


1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-r '~^r
*, t *

* * £ 35 35 * 35 3? 3? 35 35 35 35 35 35 35 35 35 35 35 35 35 35
o PN ro r> ro r*-. o PN ro ro r- o pv ro ro p^ ro ro ro o ro o ro
o o tH (N ro d d p*^ 00 o o rH -i to in m m m in in m m m m m
Random Policy Percentage
Deliberate
VariableHigh
(C)
35 35 35 35 35 35 35 35 35 35^^35 35 35 35 35 35 35 35 35 35 35
qp^ror^rop^osroror^or^roror'-.rorofoppoprn
odHNro^^iirsNobddHHfsiroiiiin^isoicn
Random Policy Percentage
Deliberate
ApplicationHigh
(d)
Figure 5.4: The Impacts of Real and Artificial Deception on Random Policy at
Different Percentage of Screened Case with Harmonic Mean
82


Mean Cost (Fixed Budget)
oo
u>
w la la -la la w la -LA la w w M
M h- NJ UJ Ck Ln 'vl 00 LO O
O LA o O o o o O o o O O
O o o o o o o O o o O O
a
Q.
o
3
-o
o
8
3
ft)
09
CD
>
o
"D_
o
Q>
rf
o'
3
r~
O
$
z
QJ
r-+
c
a>
Mean Cost (Fixed Budget)
LA w la LA la LA la LA LA LA -LA M
M U) -fk cn O la o O o o o o o O O o
o o o O o o o o o O O o
Q.
O
3
o
s
3
ft)
OQ
(D


$1,000
t *
bo
*o
o
Qi
X
o
o
ffllll
?

$o
sp vP S.O vO vP sP
05 (JS OS os os OS
o h*. o ro o rs
6 6 (N fO ^
^ ^
Sp sP s sP sp sp
ffS os os OS os OS
r>. p p o rs o
ixi s' od co o'
S* SS ^ ^
N (V) o fO
o
O P's
^ SS ^
ro ps ro
rHrsiro?-^fiiDiX)psoocr
^'j^^^^i/ii/iLnLnLriLninuii/t^in
Deliberate
* VariableHigh
Random Policy Percentage
(C)
$1,000
$900
- $800
| $700
o $600
Deliberate
ApplicationHigh
sP sP sP sP sP sP sp sp sp sP sP so sp sp Sp sP sp sP sP sp sP sP sP
s os os os os os os os os os os os Os ps os os os os os ss os os os
ors.proprs.rspropp^pr^roprnprsphsrnr^rp
ddrvifn^^^ujrsodModriiNfo^^' ^td^aicTi
Random Policy Percentage
(d)
Figure 5.5: The Impacts of Real and Artificial Deception on Random Policy at
Different Percentage of Screened Case with Cost (Fixed Budget)
84


l/V
o
40.0% -
40.7%
42.7%
43.3%
44.0%
44.7%
45.3%
3 46.7%
o. 47.3%
1 48.0%
o 48.7%
2 50.0%
^ 3 50.7%
3 S 52.0%
| 52.7%
54.0%
54.7%
55.3%
56.7%
58.0%
58.0%
58.7%
59.3%
Mean Cost (Varied Budget)
w W w w
ro uj u Ln CTi 'nJ 00 o
O o o o O O o o o o
O o o o o O o o o o
>
o
T3
O
S
Mean Cost (Varied Budget)
3
CL
O
3
-o
o
o
n
40.0%
40.7%
42.7%
43.3%
44.0%
44.7%
45.3%
46.7%
47.3%
48.0%
48.7%
50.0%
50.7%
52.0%
52.7%
54.0%
54.7%
55.3%
56.7%
58.0%
58.0%
58.7%
59.3%


$2,400
$2,200
jj $2,000
{? $1,800
$1,600
1 $1,400
> $1,200
tt $1,000
c $800
| $600
$400
$200
$0
ft

yyyiA
"rWj\
nP \P nP Sp SO sP SP sp sP sP sP sP sP sP Sp sP sP sP sP sP sP sP sP
pr^r^foor^rorsfopp^pr^ppvpr^rnr^ppr^rn
ddNM^^i/i'iscooddofSN^^inibooooooai
Random Policy Percentage
Deliberate
VariableHigh
(C)
pr^fsroprvrorsropr^prvpr^pr^rorsppr^rn
ddfsm^^iniiiseoooddr'ifNi^'fui^coMoooi
Random Policy Percentage
(d)
Figure 5.6: The Impacts of Real and Artificial Deception on Random Policy at
Different Percentage of Screened Case with Cost (Variable Budget)
86


5.2.3 Discussion
This part of the study focuses on investigating the impacts of real and
artificial deception on the screening policy. The results of performance comparison
are summarized in Table 5.20 and 5.21 for the top and random policy, respectively.
In these two tables, Pe(A) denotes the performance on the artificial deception data
and Pe(R) denotes the performance on the real deception data. To summarize, the
experiment results of the hypotheses testing based on all measures are displayed in
Table 5.22.
As indicated in Table 5.22, when comparing the impact of the natural
deception data and the artificial deception data generated by the variable noise
model with low noise level on the screening policies, Hypothesis 3a is supported
because HM performance is better for the real deception data than for the artificial
deception data. But in the meanwhile, there is an exception of no statistical
performance difference between two data sets with both fixed and variable cost
measures. Hypothesis 3 b compares the performance on natural deception and
artificial deception generated by the application deception model with low noise
level, the results support the hypothesis and show that they have similar impacts on
the screening policies based on the fixed budget cost and variable budget cost
measures. On the other hand, when the performance is measured by the HM metric,
it appears that the screening policies perform better on the artificial data.
Hypothesis 4a focuses on the performance difference between the deliberate
deception data and the artificial deception data modeled with high generated by the
variable noise model with high noise level. It is clear that real deception and
artificial deception generated by the variable noise model have different impacts on
the screening policies based on HM and the variable budget model. However, the
87


fixed budget model doesnt support this conclusion. The table reveals that
deliberate deception and artificial deception generated by the application deception
model with high noise level have similar impacts on both screening policies based
on HM, fixed budget model and variable budget model. Therefore, Hypothesis 4b
is supported. These results confirm that the application deception model is capable
of simulating real deception.
In summary, the test results and findings obtained demonstrate that
artificially generated deception data could be used instead of the real deception data
for the data mining application. The proposed application deception model in this
study is confirmed by the experiments and can be used for testing a screening
method.
Table 5.20: Summary of comparison results for top policy
Comparison HM Cost (Fixed) Cost (Variable)
Natural vs. VariableLow Different Pe(A) Natural vs. ApplicationLow Different Pe(A)>Pe(R) Similar Similar
Deliberate vs. VariableHigh Different Pe(A)>Pe(R) Similar Different Pe(A) Deliberate vs. ApplicationHigh Similar Similar Similar
Table 5.21: Summary of comparison results for random policy
Comparison HM Cost (Fixed) Cost (Variable)
Natural vs. VariableLow Different Pe(A) Natural vs. ApplicationLow Different Pe(A)>Pe(R) Similar Similar
Deliberate vs. VariableHigh Different Pe(A)>Pe(R) Similar Different Pe(A) Deliberate vs. ApplicationHigh Similar Similar Similar
88


Table 5.22: Summary of findings
Hypothesis
H3a: There is a significant difference
in the screening policy performance
on the natural deceptive data versus
the artificial deceptive data modeled
by low variable noise.
H3b: There is no significant
difference in the screening policy
performance on the natural deceptive
data versus the artificial deceptive
data modeled by low application
deception model.
H4a: There is a significant difference
in the screening policy performance
on the deliberate deceptive data
versus the artificial deceptive data
modeled by high variable noise.
Findings
Support: The HM performance is better
for the real deception data than for the
artificial data.
Exception: There is no significant
performance difference in fixed budget
cost and variable budget cost between
two data sets.
Support: There is no significant
performance difference in fixed budget
cost and variable budget cost between
two data sets.
Exception: The HM performance is
better for the artificial data than for the
real data.
Support. The HM performance is better
for the artificial deception data than for
the real deception data.
The performance is better for the real
deception data than for the artificial
deception data with variable budget cost.
Exception: There is no significant
performance difference in fixed budget
cost between two data sets.
H4b: There is no significant
difference in the screening policy
performance on the deliberate
deceptive data versus the artificial
deceptive data modeled by high
application deception noise.________
Support: There is no significant
performance difference in HM, fixed
budget cost and variable budget cost
between two data sets.
89