UNDERSTANDING THE RELATIONSHIP BETWEEN REAL AND

ARTIFICIAL DECEPTION

by

Yanjuan Yang

A thesis submitted to the

University of Colorado Denver

in partial fulfillment

of the requirements for the degree of

Doctor of Philosophy

Computer Science and Information Systems

2009

2009 by Yanjuan Yang

All rights reserved.

This thesis for the Doctor of Philosophy

degree by

Yanjuan Yang

has been approved

by

Michael Mannino

Tom Altman

Ronald Ramirez

Date

Yang, Yanjuan (Ph.D., Computer Science and Information Systems)

Understanding the Relationship between Real and Artificial Deception

Thesis directed by Associate Professor Michael Mannino

ABSTRACT

Deception has become an important area in data mining with many recent

studies on terrorism threats, intrusion detection, and fraud prevention. To develop a

data mining approach for a deception application, data collection costs can be

prohibitive because both the deceptive data and the truthful data without deception

are necessary to collect. In order to lower the cost of data collection, artificially

generated deception data can be used to train the data mining program, but the

impact of using artificially generated deception data hasnt been known. This

project aims to investigate the relationship between real and artificial deception.

The deception and truth data were collected from financial aid applications, a

document centric area with limited resources for verification. The data collection

provides a unique data set containing truth, natural deception, and deliberate

deception. The data collection was augmented by randomly generated artificial

deception. To better simulate deception behavior, a new noise model that is called

the application deception model is proposed and implemented to generate artificial

deception in the context of different deception scenarios.

Two experimental studies Eire proposed to analyze the relationship between

real and artificial deception. The first study investigates the fit between data mining

noise models and deception data to determine if artificially generated deception can

be used to reduce data collection costs. Outlier score and directed distance

percentage change are used as outcome variables. The second study investigates the

impact of real and artificial deception on screening policy performance. The

performance of the screening method is evaluated using an information theoretic

measure and a cost model that is built in the context of the financial aid application.

This abstract accurately represents the content of the candidates thesis. I

recommend its publication.

Signed

Michael Mannino

ACKNOWLEDGMENT

It is with my special appreciation that I acknowledge my advisor, Dr. Michael

Mannino, for his support, encouragement, invaluable guidance throughout the

course of this work. His knowledge and dedication have been a constant source of

inspiration. Without the direction of Dr. Mannino, this project could have never

been completed.

I would also like to thank the other members of my dissertation committee, Dr.

Tom Altman, Dr. Peter Bryant, Dr. Dawn Gregg and Dr. Ronald Ramirez for their

invaluable comments and useful suggestions.

I gratefully acknowledge the support from the UCD Business School to my

graduate study.

TABLE OF CONTENTS

Figures....................................................................ix

Tables ....................................................................x

Chapter

1. Introduction ..........................................................1

1.1 Motivation and Overview............................................1

1.2 Thesis Contribution................................................5

1.3 Thesis Outline.....................................................6

2. Theoretical Analysis...................................................7

2.1 Deception Literature Review .......................................7

2.2 Data Collection Issue ............................................11

2.3 Noise Literature..................................................13

2.4 Artificial Data Generation........................................18

3. Analysis and Modeling of Real Deception ..............................20

3.1 Research Model and Hypotheses ....................................20

3.2 Experimental Methodology..........................................23

3.3 Dependent Variables and Measures..................................25

3.3.1 Directed Distance..............................................25

3.3.2 Outlier Score .................................................28

3.4 Real Deception Data Collection....................................31

3.5 Artificial Noise and Deception Models.............................32

3.5.1 Variable Noise Model ..........................................32

3.5.2 Application Deception Model....................................33

4. Analysis of Impact of Real and Artificial Deception on

Vll

Screening Policies .............................................43

4.1 Research Framework and Hypotheses ...............................43

4.2 Research Methodology.............................................45

4.3 Performance Measures.............................................48

4.3.1 Information Theoretic Measure................................49

4.3.2 Cost Model...................................................51

5. Experimental Results and Analysis...................................56

5.1 Simulation of Real Deception.....................................56

5.1.1 Analysis Based on Distance Measure ..........................57

5.1.2 Analysis Based on Outlier Score..............................59

5.1.3 Discussion ..................................................62

5.2 Impact of Real and Artificial Deception on Screening Policies....64

5.2.1 Impact on Top Policy.........................................65

5.2.2 Impact on Random Policy......................................78

5.2.3 Discussion ..................................................87

6. Conclusions and Future Work.........................................90

6.1 Conclusions..................................................90

6.2 Implications.................................................91

6.3 Limitations and Future Work..................................92

Appendix

A. Financial Aid Application Form .....................................94

Bibliography ............................................................98

viii

LIST OF FIGURES

Figure

3.1 Research Model............................................................21

3.2 The Distance from Truth to Real and Artificial Deception..................26

3.3 Application Deception Model...............................................36

4.1 Comparison of Screening Method Performance on Real and Artificial

Deception.................................................................44

4.2 Labeling Natural Deception Data...........................................47

4.3 Labeling Deliberate Deception Data........................................48

4.4 Method to Calculate HM....................................................51

4.5 Award Allocation Based on Two Models of Budget..........................53

4.6 The Meaning of Award Difference...........................................54

5.1 The Impacts of Real and Artificial Deception on Top Policy at Different

Percentage of Screened Case with Harmonic Mean............................69

5.2 The Impacts of Real and Artificial Deception on Top Policy at Different

Percentage of Screened Case with Cost (Fixed Budget)...................75

5.3 The Impacts of Real and Artificial Deception on Top Policy at Different

Percentage of Screened Case with Cost (Variable Budget)...................77

5.4 The Impacts of Real and Artificial Deception on Random Policy at Different

Percentage of Screened Case with Harmonic Mean............................82

5.5 The Impacts of Real and Artificial Deception on Random Policy at Different

Percentage of Screened Case with Cost (Fixed Budget)......................84

5.6 The Impacts of Real and Artificial Deception on Random Policy at Different

Percentage of Screened Case with Cost (Variable Budget)...................86

IX

LIST OF TABLES

Table

2.1 Summary of theoretical noise models.....................................14

2.2 Summary of noise models used in previous studies........................16

3.1 Experimental comparisons between real and artificial deception..........24

3.2 Sample sizes............................................................24

3.3 The sign for directed distance..........................................26

3.4 Original deception scenarios in financial aid application...............38

3.5 Deception scenarios in financial aid application (after combination)....39

3.6 Features included in each group of variables............................39

3.7 Data selection for deception scenarios in financial aid application.....40

3.8 Data perturbation for deception scenarios in financial aid application..41

3.9 Threshold and ideal values for parents financial variables.............42

3.10 Threshold and ideal values for merit-based variables..................42

4.1 Comparison of screening method performance on real and artificial deception46

4.2 Confusion matrix of a true-deception prediction.........................49

4.3 Cost matrix for deception detection.....................................54

4.4 Cost model for financial aid application deception detection............55

5.1 Hypotheses based on directed distance measure............................57

5.2 Normality tests for distance variable...................................58

5.3 Sample mean results based on distance measure...........................59

5.4 Summary of statistical test results based on distance measure...........59

5.5 Hypotheses based on outlier score.......................................60

5.6 Normality tests for outlier scores......................................60

5.7 Sample mean results based on outlier score calculated by D algorithm..61

x

5.8 Sample mean results based on outlier score calculated by LOF algorithm....62

5.9 Summary of findings........................................................64

5.10 Relative HM percentage differences for the top policy....................67

5.11 Hypotheses based on cost measure.........................................70

5.12 Normality tests for cost variable........................................70

5.13 Cost (fixed budget) sample mean results..................................72

5.14 Cost (variable budget) sample mean results...............................72

5.15 Summary of statistical test results based on cost measure.................72

5.16 Relative HM percentage differences for the random policy..................79

5.17 Cost (fixed budget) sample mean results for random policy................79

5.18 Cost (variable budget) sample mean results for random policy............79

5.19 Summary of statistical test results based on cost measure.................80

5.20 Summary of comparison results for top policy..............................88

5.21 Summary of comparison results for random policy...........................88

5.22 Summary of findings.......................................................89

xi

1. Introduction

1.1 Motivation and Overview

Deception is an everyday occurrence across all communication media.

Deception can be manifested in many forms, from the simple white lies that are

often told for the purpose of facilitating social interaction, to more serious lies that

involve crime or infidelity. Whether lies are small or serious, they involve a

conscious attempt to mislead another person by either concealing or giving false

information, along with willful manipulation of another individuals ability to

accurately assess the truthfulness of a statement or situation.

There has been an increasing interest in learning about deception and its

detection for many years. Various topics related to deception have been studied,

including deception in business practices. People tell more lies when they want to

appear likeable or competent, both important aspects for success in business

(Feldman et al. 2002). Since deception in business tends to hurt productivity and

profitability, any insights that researchers obtain from the study of deception may

have beneficial consequences. For example, auditors can develop a set of heuristics

to help them detect financial fraud (Johnson et al. 2001).

Detecting lies is often difficult. Many people overestimate their natural

ability to catch lies. In reality, the chance of detecting lies is either chance (around

50%) or lower (Feeley & deTurck, 1995). Another factor that contributes to the

generally low level of lie detection is inability of people to detect reliable cues of

deception.

Because it is so difficult to detect the majority of lies people tell, efforts have

been made to improve deception detection. Research has discovered some reliable

1

indicators of deception (Zuckerman & Driver, 1985). Training to detect these cues

can be given to people for improving detection accuracy.

The majority of information exchanged on a daily basis normally involves

some level of deceit and is done using rich media (e.g., face-to-face, voice).

Therefore, research has mainly focused on richly mediated communication

channels. Research regarding the ability to identify deception in textual information

has been sparse at best. The ability to identify deceptive information in textual

forms can reduce revenue losses and decrease time spent following deceptive leads

or information. Therefore, it is important to expand the knowledge about deception

over text-based systems.

This research is concerned with a specific type of text-based deception:

document-centric deception. In document-centric deception an individual falsifies

a portion of her application for a benefit such as a position, financial aid, loan, or

admission to a university. Document-centric deception has emerged as an

important and costly deception area. Existing deception detection techniques

developed for applications in communication and physiology are not suitable for

discovering deception in these kinds of applications which have few or no

linguistic patterns.

Welfare fraud is a prominent form of document-centric deception. Welfare

fraud refers to various intentional misuses of state welfare systems by withholding

information or giving false or inaccurate information. This may be done in small,

uncoordinated efforts, or in larger, organized criminal rings. Some common types

of welfare fraud are failing to report a household member, failure to report income

and providing false information. Welfare fraud creates a burden for taxpayers by

increasing the cost of programs. The U.S. Department of Agriculture (USDA)

estimates that about 8 percent of the food stamp benefit expenditures are

2

overpayments or payments to ineligible households. According to the statistics

from the United Council on Welfare Fraud (UCOWF), fraud was discovered in

upwards of 69% of the investigations conducted with total annual discovered fraud

amounts ranging from $10,000 to $1 million (United Council on Welfare Fraud,

2003).

Financial aid deception is another prominent form of document-centric

deception. Federal, state and private financial aid programs target their assistance

toward students with the least ability to pay for college. This targeting of aid is

based on student and parental self-reports about their financial condition.

Therefore, ensuring the accuracy of the information plays an important role in

equalizing the educational opportunities available to all students. Colleges and

universities routinely verify the accuracy of a subset of aid applications. A US

Department of Education audit of 2.3 million 1995-96 Pell grant recipients [L2000]

found that 4.4% (about 100,000) had reported income figures on their financial aid

applications that were lower than the figures reported to the IRS. According to the

report prepared by Rhodes and Tuccillo (Rhodes & Tuccillo, 2008), forty percent

of records in the random Quality Assurance sample data for Federal Student Aid

contain false information. They found that 30 percent of dependent student records

and 20 percent of independent student records have false data fields when schools

verified the information as part of the random sample process. This results in $2

billion (15.9%) of Pell dollars in 2005-06 at risk for an improper payment. To

determine false applications, a financial aid office usually adopts a simple policy

such as a 100% verification or random verification of a small number of

applications. Based on the report, on average schools chose to verify 50.7% of the

records. However, school verification was not very effective in terms of targeting

3

problematic records (Rhodes & Tuccillo, 2008). Also these verifications are labor

intensive involving substantial time and cost.

This research is motivated by the fact that data collection for data mining

methods involving document deception is costly because the true data without

deception and the deceptive data both need to be obtained. To lower the cost of data

collection, artificially generated deception data can be used to train the data mining

program, but the impact of using artificially generated deception data is not known.

A number of studies (Mannino & Koushik 2000; Kearns & Li, 1993; Mannino et

al. 2009; Zhu & Wu, 2004; Angluin & Laird, 1988) have used artificially generated

noise to study the sensitivity of classification algorithm performance to noise.

These studies have used conceptual noise models that provide a method to perturb

or change data from its true state to an incorrect state to generate noise using noise-

free training data. In contrast, other studies (Burgoon et al. 2003; Zhou, 2003) have

collected primary data containing deception data paired with ground truth data.

Previous work has not examined the fit between artificially generated deception

data and real deception data to understand the impact of using artificially generated

deception data in training of data mining algorithms.

This study investigates the fit between real deception data and artificial

deception data generated by data mining noise models and the impact of real and

artificial deception on screening method performance. Deception data and the

ground truth data are collected from financial aid applications, a document-centric

area with limited resources for verification. The data collection provides naturally

occurring deception in which subjects have some incentive to falsify applications

and deliberate deception in which subjects deliberately falsify their applications.

Two different experimental studies are conducted to investigate the relationship

between real and artificial deception. The first set of experiments compare the real

4

deception data (natural and deliberate) to the artificial deception data with outlier

score percentage change and directed distance percentage change as outcome

variables. The second set of experiments compare the performance of screening

policies on the real and artificial deception data. The performance of screening

policies is evaluated using an information theoretic measure and a cost model.

The results of this study will extend existing literature and provide guidance

to data mining researchers studying deception. If the experimental results indicate a

reasonable fit between artificial and real deception data, researchers will have some

justification for relying on artificial deception data especially when developing

application-independent techniques. If the experimental results indicate a poor fit,

researchers should reduce usage of artificial deception data when training their data

mining applications.

1.2 Thesis Contribution

The contributions of this dissertation are:

It is the first systematic study investigating the relationship between real

deception and artificial deception generated by a noise model.

The deception data and the ground truth are specifically collected from

financial aid applications. The data collection provides a unique data set

containing truth, natural deception, and deliberate deception.

A novel noise model that is called the application deception model is

proposed and developed. The proposed model considers the generation of

artificial deception in different application-based deception contexts, which

is the focus of this study.

5

An experimental design is developed to compare real deception with

artificial deception. The experiment involves two measures to compare

differences between real and artificial deception.

An experiment is designed to compare performance differences of screening

policies between real and artificial deception. The experiment provides

evidence about bias in using artificially generated deception to evaluate

performance of typical screening policies used in financial aid decision

making.

1.3 Thesis Outline

This dissertation consists of six chapters. After a brief introduction and

deception overview in Chapter 1, Chapter 2 systematically reviews the deception

and noise background and related research work. The issues associated with

gathering deception data are also discussed. Chapter 3 develops a novel method of

artificial deception generation as part of a comparison of real and artificial

deception. An experimental design is developed to depict the relationship between

real deception and artificial deception generated by a noise model. Chapter 4

describes an experiment designed to investigate the impact of real and artificial

deception on screening policies. In Chapter 5, the experiment results are presented

and analyzed. Chapter 6 summarizes the dissertation and proposes future research

directions.

6

2. Theoretical Analysis

The theoretical foundation for this research is drawn from a combination of

theories of deception and noise. To provide a context for this study, the literature

about deception and its successful detection are reviewed in this chapter. The issues

associated with gathering deception data are discussed. Also, a brief review of the

noise literature, specifically the impact of artificially generated noise on

classification algorithm performance is included.

2.1 Deception Literature Review

As Ekman (Ekman, 1992) and others have implied, everyone lies to some

extent, and lies can occur in any social situation and modality. An example is when

people are asked, How are you? and they reply Good. In many cases these

people do not really feel good but give this answer because it is the socially

accepted and expected answer. Another example of common lies is the white lie.

Someone may say that they like a co-workers new haircut, when they really think

it looks less than flattering. Another common type of lie is when someone is asked

what he or she did today and the person responds with only part of what happened

to him or her during that day. This is an example of a lie of omission. It would be

often tedious and boring to describe and listen to every detail of a persons day.

The above examples of socially accepted and relatively harmless lies serve a

function in peoples lives and are not included in most of the academic research on

deception and its detection.

As a multidisciplinary concept, deception has been defined in many ways.

For a lie to be considered an act of deception, the communicative exchanges

between people must involve perceptions by one or more of the people involved

7

that there is an intent to deceive (Miller & Stiff, 1993). A widely-used definition

for the term deception and the one that will be used for this study is a message

knowingly transmitted by a sender to foster a false belief or conclusion by the

receiver (Buller & Burgoon, 1996). Thus, deceptive communication consists of

messages and information knowingly transmitted to create a false conclusion

(Miller & Stiff, 1993). Messages that are unknowingly sent by a sender are not

considered deceptive, as there is no intention to deceive. Deception not only

includes outright lies. Evasions of the truth, equivocations, exaggerations,

misdirections, deflections, and concealments are also considered deception. These

forms of deceit are more common than outright lies (DePaulo et al. 1996). Thus,

deception can be conducted in many ways with the purpose of and motivation for

personal gain.

Deception detection aims to determine whether a piece of information is

deceptive. Whether governments protect their citizens from terrorists or

corporations protect their assets from fraud, many organizations are interested in

finding and exposing deception. The problem is that most people are poor at

detecting deception even when presented with all of the verbal and non-verbal

information conveyed in a face-to-face discussion. Numerous studies have noted

that the accuracy with which people typically identify deception is only slightly

better than chance (approximately 54%) (Bond & DePaulo, 2006). This poor

performance is not limited to laypersons, but is also found in professional lie-

catchers such as police officers and federal law enforcement officers. This issue is

more intense when the deception is conveyed in text because of the lack of

nonverbal cues. Furthermore, deception strategies may change from situation to

situation as the deceiver anticipates the interactions and attempts to fool possible

detectors.

8

To improve low deception detection accuracy, researchers have investigated

methods to assist in detection. These methods take advantage of physiological or

behavioral traits that appear in conjunction with deception. Perhaps the most

familiar tool used in deception detection is the polygraph. Other methods of

deception detection include criteria-based content analysis (Steller & Kohnken

1989) and scientific content analysis (Sporer, 1997). Each of these techniques

includes a set of criteria against which suspect statement is compared.

While many methods exist to differentiate deception from truth, all methods

are tied together by one feature: they rely on a human operator to make the final

judgment. A potentially more promising approach is to integrate improved human

efforts with automated tools. Compared with the manual approach, the automatic

approach is more efficient and easier to use. Moreover, the enormous amount of

information generated in many deception environments makes it infeasible to

process it manually. The automatic prediction of deception can be achieved through

three steps: (1) identify significant cues to deception, (2) automatically derive the

cues from various media, and (3) build classification models for predicting

deception from new messages. Briefly speaking, in data mining, the goal is to

classify the data (message) into one of 2 categories (truth or deception) based on its

attributes (for example, linguistic cues). Research on cues and classification

methods make deception detection objectives possible.

Previous deception research has identified a rich set of cues to deception that

have been tested in lab or field environments (DePaulo et al. 2003). Researchers at

the University of Arizona have been developing computer-based methods of

deception detection which analyze linguistic and kinesic behavior in search of

deceptive cues. The methods analyze the movements and linguistic properties of

communication from one person engaged in a recorded face-to-face interaction.

9

The methods utilize supervised learning techniques from manually prepared

training sets to detect patterns in the linguistic and kinesic channels.

Linguistic methods are based on language features identified as promising

indicators of deceit in previous research (Burgoon et al. 2003). For example,

deceptive messages have been found to include higher informality and expressivity,

and lower wording diversity and complexity. Focusing on language behaviors,

rather than specific content, has the advantage that indicators derived from

language behaviors may be relatively independent of context and are more

conformable to simple parsing approaches. Moreover, deceivers may have control

over the content of their messages, but deceptive intent may still be delivered

through ones language use. Some progress has been made in identifying and

automatically deriving deception indicators from text by integrating findings and

methods from multiple relevant disciplines, including natural language processing,

criminal justice and linguistics (Zhou et al. 2004).

Kinesic methods seek to detect behavioral cues automatically. Empirical

evidence suggests that deceivers head and hands move differently than truth-

tellers. For example, it is pointed out that deceivers display significantly more chin

raises than truth-tellers. Kinesic analysis utilizes a tracking method to extract hand

and face regions using the color distribution from a digital image sequence. The

extracted features are summarized and are then used for classification.

Some investigations have also been focused on the third objective of building

classification models for predicting deceit by evaluating classification approaches

ability to discriminate truthful from deceptive messages. Many common machine

learning approaches, such as neural networks and decision trees, can automatically

build classification models from the existing data and then predict the outcome for

the new data. Neural networks have been found to provide good prediction in some

10

applications (Zahedi, 1996). There has also been at least one attempt at applying

decision trees in grouping messages into deceptive and truthful classes (Burgoon et

al. 2003). The work conducted by Zhou et al. (Zhou et al. 2004) extends prior work

on cues to deception by investigating four classification methods discriminant

analysis, logistic regression, decision trees, and neural networks for their

predictive power in discriminating truth from deception. Their results suggest that

all of the four methods were promising for predicting deception with cues to

deception. Among them, neural networks exhibited consistent performance and

were robust across test settings.

2.2 Data Collection Issue

With empirical methods such as statistical models and data mining, data

collection is crucial for improved decision making. However, data collection is

costly. A lot of research in marketing has been conducted to reduce data collection

costs. The costs of various data collection methods including traditional telephone,

postal and email surveys, and web-based surveys are investigated in the literature

(Wiseman et al. 1983; McDonald & Adam 2003). Acquiring deceptive data is even

more expensive since the ground truth data also needs to be obtained. In this case,

subjects may have to be given costly incentives to reveal the truth.

In data mining, supervised classifier learning requires data with class labels.

In many applications, collecting class labels can also be costly. For example, many

historical cases are needed when diagnostic models are trained. To train document

classifiers experts may need to read many documents and assign them labels.

Active learning is an important approach to reducing data-collection costs in

machine learning. The active learning literature (Cohn et al. 1994; Saar-Tsechansky

& Provost 2001) offers several algorithms for cost-effective label acquisitions.

11

Active learners acquire training data incrementally, using the model induced from

the available labeled examples to identify helpful additional training examples for

labeling. A number of utility measures have been proposed to indicate the

information value of acquiring labels for unlabeled cases. Active learning methods

have been empirically demonstrated to reduce the cost of label acquisition to

achieve a specified level of classifier performance.

Data collection is costly and time consuming. Also, there are many problems

involved in ensuring that data collected is accurate. The quality of the data acquired

can be affected by the statistical problems of sampling and measurement errors.

Sampling error: Sampling error occurs during the process of selecting a sample

from the frame population. It arises from the fact that not all members of the frame

population are measured. The sample used for a particular survey is only one of a

large number of possible samples of the same size and design that could have been

selected. Even if the same questionnaire and instructions were used, the estimates

from each sample would differ from the others.

Measurement error: Measurement error is the deviation of the answers of

respondents from their true values on the measure. In both self-administered and

interviewer-administered surveys, measurement errors could arise from the

respondent or from the instrument. Unclear terms to respondents, lack of

motivation and the confidentiality of their answers or deliberate distortion may

cause errors. The question wording and the design of the survey instrument (such

as the placement of questions, flow of instrument, typographical feature, etc.) also

affect the accuracy of data collected.

12

2.3 Noise Literature

Noise is defined by Quinlan as non-systematic errors in either the values of

attributes or class information. A number of theoretical noise models have been

proposed as extensions to the theory of Probably Approximately Correct (PAC)

learning (Valiant, 1984). Valiants PAC model of learning is one of the most

important models for learning from examples: a system in which the learning

algorithm develops classification rules that can be used to determine the class of an

object from its attributes. PAC learning provides a nice formalism for deciding how

much data you need to collect in order for a given classifier to achieve a given

probability of correct predictions on a given fraction of future test data.

Although the PAC model better reflects the requirements of learning in the

real world and thus has been widely adopted, one drawback of the PAC model is

that the data used for learning is assumed to be noise free. In many environments,

however, there is always some chance that an erroneous example is given to the

learning algorithm. In a training session for a learning algorithm, this might be due

to an incorrectly measuring an input, wrongly reporting the state of an input,

relying on stale values, and using imprecise measurement devices. Input errors in a

training set can cause a learning algorithm to form a rule with an incorrect state for

an input, while input errors in cases to be classified can cause the wrong rule to be

used. To combat this deficiency, a number of theoretical noise models have been

introduced into the theory of PAC learning. These models have been classified by

source (random or adversary) and target (attribute versus class).

The first noise model, called the Random Classification Noise model, was

introduced in (Angluin & Laird, 1988). In this model, the adversary flips a biased

coin before providing each example to the learning algorithm; whenever the coins

13

shows H, which happens with probability g, the classification of the example is

flipped and so the algorithm is provided with the wrongly classified example.

Another model is the Malicious Noise model introduced in (Kearns & Li, 1993). In

this model, the adversary can replace the example whenever the g-biased coin

shows H. This gives the adversary the power to distort the distribution D. There

are also two noise models in which the examples are corrupted by purely random

noise affecting only the instances (not the labels). They are called Uniform Random

Attribute noise model and Product Random Attribute noise model (Goldman &

Stone, 1995). For uniform attribute noise, each attribute is flipped independently at

random with the same probability. Contrasting this model, in product attribute

noise model, each attribute is flipped randomly and independently with its own

probability. These noise models are summarized in Table 2.1.

Table 2.1: Summary of theoretical noise models

Reference Noise Model Description

Goldman & Uniform random attribute Each attribute is flipped

Stone, 1995 noise independently at random with the same probability

Goldman & Product random attribute Each attribute is flipped

Stone, 1995 noise randomly and independently with its own probability

Angluin & Laird, 1988 Random classification noise Class noise. The label is inverted.

Kearns & Li, 1993 Malicious attribute noise The example may be maliciously selected by an adversary who has infinite computing power and has knowledge of the target concept. The nature of the noise is unknown or unpredictable.

14

Noise models have been used in a number of studies to investigate the

sensitivity of noise on classification algorithm performance. Using the uniform

attribute noise model, Quinlan (Quinlan, 1986a, b) found sensitivity of the ID3

decision tree induction algorithm to attribute noise. He demonstrated that the

classification accuracy of ID3 was worse for a noise-free training set if the level of

field noise is high (45% or greater). Nolan (Nolan, 2001) applied the uniform

random attribute noise model and empirically studied the effect of attribute noise

on several prominent classification algorithms (C5.0, back propagation neural

network, and linear discriminate analysis). He found that the neural network

performed significantly better than the other algorithms when noise levels exceeded

10%. Zhu and Wu (Zhu & Wu, 2004) studied the effects of data cleaning on the

performance of C4.5 using the uniform attribute noise model and the class noise

model. They compared predictive accuracy of C4.5 on combinations of clean and

noisy training and test sets. The emphasis in their study is the value of data

cleaning either in training data or field data. They reached a number of conclusions

that are counter to the conventional wisdom established by Quinlan (Quinlan

1986a, b) that training data should contain noise representative of field data. To

extend previous studies on classification algorithms sensitivity to noise, Mannino

and Yang (Mannino et al. 2009) emphasize asymmetric levels of noise (under and

over representation of attribute noise) and use three noise models: uniform attribute

noise; product attribute noise, and importance attribute noise. Their results

contradict conventional wisdom indicating that investments to achieve

representative noise levels may not be worthwhile. In other studies, artificially

generated noise has also been used in testing intrusion detection system (McHugh,

2000) because of privacy and the sensitivity of actual intrusion data. The noise

models used in these studies are summarized in Table 2.2.

15

Table 2.2: Summary of noise models used in previous studies

Reference Noise Model Used

Quinlan 1986a,b Uniform random attribute noise

Mannino & Koushik, 2000 The level of noise was either implicitly known through a sample of the noise process or explicitly known through an external parameter

Nolan, 2001 Uniform random attribute noise

Zhu & Wu, 2004 Uniform random attribute noise Class noise

Jiang et al. 2005 Uniform attribute noise

Mannino et al. 2007 Uniform random attribute noise Product random attribute noise Importance attribute noise

In early applied studies, the most popular noise handling technique is

decision tree pruning. Pruning techniques reduce specialization by eliminating rules

in whole or part. They are useful to handle noise because noise in a training set can

lead to extra rules and highly specialized rules. Quinlan [Q 1986b] found that

pruned decision trees perform better than un-pruned decision trees in the presence

of input data noise. While pruning is usually applied as a post processing technique

(i.e., the tree is pruned after it is induced), other approaches prune the decision

during its construction, i.e., the induction process itself is modified to cope with

noise. In addition to pruning, a number of fuzzy learning methods have been used

to derive fuzzy rules that perform well with noise and/or incomplete training data

(Hong & Chen, 2000; Wu et al. 2003).

In contrast to pruning techniques, Mookerjee et al. (Mookerjee et al. 1995)

investigated explicit noise handling using clean training data along with a noise

parameter. Explicit noise handling adds noise to clean training data in a controlled

16

manner using the noise parameter. The study demonstrated both analytically and

empirically that explicit noise handling has the same expected performance but

lower variance than traditional techniques using noisy training data.

Jiang et al. (Jiang et al. 2005) also conducted another study to handle

explicitly noisy input data on the web. Although the term noise is used on the

paper, the study actually deals with deception on the web. A variety of factors

contribute to the presence of web deception. The most significant cause of

deception on the web is the deliberate falsification of input data by web users. Web

users also lie to protect their privacy and possible misuse by firms of any personal

data they provide. Another factor that contributes to lying is that there is no face-to-

face interaction between the user and the organizations agents in an online

environment. Thus, there are no visual or non-verbal cues that could potentially

help an agent recognize that a user is lying. In their study, a wide range of noise

levels is considered but the same distortion level is applied for all inputs. To cope

with deception, two methods are proposed knowledge base modification (KM)

and input modification (IM). The KM method considers modifying the knowledge

base (a decision tree) to account for distortion in the inputs provided by the user. It

is appropriate when the distortions in inputs are relatively stationary (input noise

levels do not change much over time). The IM method involves a preprocessing

step during which the observed inputs are modified to account for distortions. This

method involves modifying an observed input to the most likely true value of the

input given the observations made by the system. The modified input is then fed

into the existing (unmodified) knowledge base. In the KM method, a revised

decision tree is obtained, which specifies optimal recommendations for all feasible

combinations of observed input values. The IM method does not require any

modification of the original decision tree.

17

2.4 Artificial Data Generation

It is widely recognized that real data must play a vital role in evaluation. Real

data by definition includes the types of pattern and regularity found buried in data

that reflects the real world. Such evaluations are therefore vital in establishing the

credibility of a data mining procedure.

However, real data also have serious disadvantages for use in the systematic

type of testing. The most important is the fact that relatively little is known about

the structural regularities of a real data set. The goal of data mining is to discover

such regularities. However, if the patterns to be discovered are not known, how can

the investigator determine a data mining programs success at detecting patterns?

Another problem is the need to alter the degree of difficulty of the data sets since a

particular type of difficulty in the data affects the performance of a data mining

procedure. Furthermore, it may be impossible or very difficult to acquire the

amount of or type of data due to legal and competitive reasons. To circumvent

these data problems and work on a particular data type, one alternative is to create

synthetic data which matches closely to actual data.

In the intrusion detection area, some work has been done using synthetic test

data. Puketza et al. (Puketza rt al. 1996) describe a software platform for testing

intrusion detection systems where they simulate user sessions using the UNIX

package expect. Using the expect language, they can write scripts that include

intrusive commands. For running the scripts, expect provides a script interpreter

which issues the script commands to the computer system just as if a real user had

typed in the commands. In the fraud detection area, Barse et al. (Barse et al. 2003)

proposed a five-step synthetic data generation methodology using authentic normal

18

data and fraud as a seed. They justify that synthetic data can be used for training

and testing a fraud detection system.

Although artificially generated data and noise models have been used in the

literature of data mining and information systems, previous work has not examined

the relationship between artificially generated deception and real deception to

understand the impact of using artificially generated deception in training of data

mining algorithms. As will be shown in the following chapter, a set of experiments

will be proposed for studying the relationship between real deception and artificial

deception. In addition, the measures will be defined and hypotheses related to the

experiment will be presented.

19

3. Analysis and Modeling of Real Deception

This chapter extends traditional studies with a focus on the fit between

deceptive patterns and data mining noise models. If the fit is reasonable, artificially

generated deception data can be used instead of (or in addition to) real deceptive

data to reduce the costs associated with obtaining training data for the data mining

application. A set of experiments are conducted to compare the real deceptive data

(natural and deliberate) to the artificial deceptive data using outlier score and

directed distance percentage change as outcome measures. A new noise model,

which is called the application deception noise model, is proposed. The proposed

model considers the generation of the artificial deception in the different

application-based deception contexts. The detailed experiment design, research

hypotheses, and the new noise model are described next.

3.1 Research Model and Hypotheses

The research model tested in the dissertation is illustrated below in figure 3.1.

The research model describes the proposed relationships between real deception

and artificial deception. The real deceptive data are collected for a student financial

aid application with two levels of treatments: natural and deliberate deception. The

artificial deception data is generated using a noise model to corrupt feature values

according to a specified noise parameter. Based on the preceding literature review,

it is expected that different noise models will have different impacts on the

relationship between real and artificial deception. Therefore, the relationship

between real and artificial deception is analyzed under two noise models: variable

noise model (Goldman & Stone, 1995) (different noise rates on attributes) and

application deception noise model (different noise manipulations on three groups of

20

variables: status variables, financial variables, and merit-based variables). The

noise parameter is set to be consistent with the deception level in the real deceptive

data set.

Figure 3.1: Research Model

Under the natural setting, subjects provide the information with few fields

changed because of their awareness of verification. Under the deliberate setting,

applicants are encouraged to appear as competitive as possible in order to increase

their chances to get the aid; therefore, deliberate deception introduces more

deception. To be consistent with the different amounts of deception involved in the

natural and deliberate deception data sets, the artificial deception data sets are

generated using two noise levels: low and high. Low noise rate generated deception

21

is compared with natural deception, while high noise rate generated deception is

compared with deliberate deception.

As mentioned in the previous chapter, a noise model provides a method to

perturb or change data from its true state to an incorrect state. The noise models

introduced in the literature generate non-systematic noise by corrupting the original

feature values randomly without considering simulating actions. The variable noise

model adopted here assigns the noise rate to each feature randomly and

independently. The variable noise should not make an observation appear more

unusual because perturbations from the truth are symmetric with a mean level of

perturbation near zero. Therefore, the artificial deception pattern modeled by the

variable noise may not fit the real deception pattern since deception involves

systematic noise.

To simulate deception using the noise model, data should be perturbed

purposefully. The deception noise model proposed in this study simulates deceptive

actions by manipulating groups of features based on scenarios of deception

behavior. In this noise model, noise will be on one side: to make an applicant to

appear stronger. Thus, it is predicted that the pattern of real deception matches the

deception pattern in the artificial data set modeled by deception noise.

Based on the analysis, four hypotheses are presented, with the first two

dealing with the relationship between natural and artificial deception generated

using low noise level, and with the other two focusing on comparing deliberate and

artificial deception generated using high noise level. The hypotheses are based on

the differences between deception and noise. Noise is symmetric change from the

truth without a goal direction. Deception (real or artificial) is change from the truth

directed towards a goal. Thus, it is expected that variable noise will be significantly

22

different than real deception but artificial deception will be similar to real

deception.

Natural deception vs. artificial deception

HI a: There will be a significant difference between natural deception and

artificial deception modeled by low variable noise.

Hlb: There will be no significant difference between natural deception and

artificial deception modeled by low deception noise.

Deliberate deception vs. artificial deception

H2a: There will be a significant difference between deliberate deception and

artificial deception modeled by high variable noise.

H2b: There will be no significant difference between deliberate deception and

artificial deception modeled by high deception noise.

3.2 Experimental Methodology

The relationship between real and artificial deception is analyzed by two

experiments. The first experiment is set to analyze the relationship between

artificial and natural deception, while the second is to reveal the relationship

between artificial and deliberate deception. Each experiment involves two

comparisons between the real deceptive data and the data perturbed with noise

using a noise model and corresponding noise parameters as shown in Table 3.1. To

manipulate data with noise, a noise rate that is consistent with the deception level in

the real deception data set is applied to the truth data depending on two models of

noise. The detailed descriptions of the noise models and the methods of

manipulation are presented in section 3.6. After the process of generating noise is

completed, a comparison between the real deceptive data and the noise modeled

data is performed using the directed distance and outlier score measures.

23

Table 3.1: Experimental comparisons between real and artificial deception

Deception Type

Experiments Real Artificial

Experiment 1 Natural Variable-Low Deception-Low

Experiment 2 Deliberate Variable-High Deception-High

In order to determine the sample size before the data is collected, the

standard study using a=0.05 and having a power of 0.80 is considered. The desired

width of a confidence interval is 7%. Therefore, a sample size of 150 per cell in

table 3.2 is adequate.

Table 3.2: Sample sizes

Experiments Sample Sizes

Natural vs. Variable + Low 150

Natural vs. Application + Low 150

Deliberate vs. Variable + High 150

Deliberate vs. Application + High 150

Hypotheses will be tested with ANOVA followed by post hoc t-tests of

group means. A statistically significant outcome only indicates that it is likely that

there is a difference between group means. It doesnt mean that the difference is

large or important. To judge the size of the difference and describe the strength of

the relationship when the statistical test result is significant, the effect size needs to

be calculated. The most commonly used guideline to interpret effect size is

provided by Cohen (Cohen, 1988): an effect size of 0.1 to 0.3 might be a small

effect, around 0.3 to 0.5 a medium effect and 0.5 to 1.0 a large effect. The

24

number might be larger than one. In this study, the effect size is computed using

Hedges g measure, which can be computed from the value of the t test of the

differences between the two groups (Hedges, 1981). The formula is:

The formula with separate n's should be used when the s are not equal. The

formula with the overall number of cases, N, should be used when the s are equal.

3.3 Dependent Variables and Measures

To analyze the fitness between real and artificial deception, two different

measures of relationship were developed for this study, directed distance and

outlier score.

3.3.1 Directed Distance

To measure the relationship between real and artificial deception, one

method is to compare the distances from the truth to both types of deception. Figure

3.2 illustrates the comparison between real and artificial deception with the distance

measure. The distances from the truth to the real and artificial deception are

denoted by D1 and D2. Each observations distance from the truth to both types of

deception is calculated across all attributes.

3.1

Or

g = 2t/Vn

3.2

25

Figure 3.2: The Distance from Truth to Real and Artificial Deception

Standard distance measures only involve the amount of change because

they are symmetric. However, in terms of deception and noise model, the feature

values change can happen either toward the goal or the opposite. Because both the

direction and magnitude of change are important, directed distance should be

measured. As a similarity measure, directed distance has been used for object

matching in images (Sim et al. 1999) and graph partitioning problems (Charikar et

al. 2006).

The directed distance contains a sign indicating a positive or negative sense.

In this study, the sign of the directed distance is determined using the method

shown in Table 3.3. It is based on the distance from the ideal candidate. Each

attribute has an extreme value indicating the ideal candidate. If deception moves

closer to the ideal candidate than the truth does, the directed distance is positive. If

the change moves away from the ideal candidate, the directed distance is negative.

Table 3.3: The sign for directed distance

Value Comparison Ideal Goal of Deception Sign (s)

V aluedeception ^ V aluetruth Maximize the value s = 1

V aluedeception ^ V aluetruth Minimize the value s = -1

Valuedeception Valuetruth Maximize the value s = -1

V aluedeception ^ V aluetruth Minimize the value s = 1

26

Based on the directed distance measure, the outcome variable is the

Deception Truth

percentage change in directed distance (PD):

Truth

-%. Percentage

change is easier to interpret than just the directed distance between truth and

deception. The definitions of percentage change in directed distance for real and

artificial deception are:

PDatura,=100x

Natural-Truth

Truth

3.3

PD

deliberate

PD

variable

= 100x

= 100x

Deliberate Truth

Truth

Variable-Truth

Truth

3.4

3.5

PD

application

,.. Application Truth

100x ----------------

Truth

3.6

To handle applications with both numeric and non numeric attributes, a

heterogeneous distance function that uses different attribute distance functions is

used. This study uses the overlap metric for nominal attributes and normalized

distance for linear attributes [WM1997]. The heterogeneous distance function

defines the distance between two values x and y of a given attribute as:

da(x>y) =

1, if x or y is unknown, else

overlap(x,y), if a is nominal, else

m_diffa(x,y)

3.7

The function overlap and the range-normalized difference rn diff are defined as:

[0, if x = y

overlap(x, y) = (

11, otherwise

3.8

27

3.9

m_diffa(x,y)

lx~y|

rangea

The value range is used to normalize the attributes, and is defined as the difference

between maximum and minimum values. The overall directed distance between

two input vectors x and y is given in equation 3.10 by the Heterogeneous Directed

Distance function HDD(x,y):

m

HDD(x,y) = Â£da(xa,ya)s 3.10

a*l

3.3.2 Outlier Score

An outlier is defined as a data point which is very different from the rest of

the data based on some measure. Outliers arise due to mechanical faults, changes in

system behavior, fraudulent behavior, network intrusions or human errors. As a

fundamental issue in data mining, outlier detection has been used to detect and

remove anomalous objects from data. The outliers themselves may be of particular

interest, such as in the case of fraud detection, where outliers may indicate

fraudulent activity. Thus, outlier analysis has been used to detect fraudulent

patterns that are substantially different from the main characteristics of regular

credit card transactions (Han & Kamber, 2001; Wheeler & Aitken, 2000).

Outlier score has been applied to detect fraud. Outlier score is a measure of

the extent of unusualness. An outlier score is calculated using the Mahalanobis

distance (Barnett & Lewis, 1994) and the quadratic distance (Fawcett & Provost,

1999; Knorr & Ng, 1998) in some work. Yamanishi et al (Yamanishi et al. 2004)

demonstrated the unsupervised SmartSifiter algorithm which assigns a score to the

datum with a high score indicating a high possibility of being a statistical outlier

using Hellinger distance, on medical insurance data. An experimental application to

28

network intrusion detection shows that the algorithm was able to identify data with

high scores that corresponded to attacks.

In this study, outlier detection algorithms are applied to the data with real

and artificial deception. Each observation is assigned an outlier score that indicates

the unusualness of the observation relative to other observations. An outlier score

can also be considered as a multivariate measure of dispersion of an outlier relative

to other outliers.

The relationship between real and artificial deception is measured by testing

the average outlier score. Noise should have more dispersion than deception

because noise is symmetric and deception is directed. Deception should tend to

increase the density of the observations towards the goal.

Outlier detection methods can be categorized into parametric and the non-

parametric approaches. Parametric approaches define outliers based on a known

distribution. However for outlier identification in large multidimensional data sets,

the underlying distribution is unknown. In contrast to these methods, data-mining

related methods are often non-parametric. It is not required to assume an

underlying generating model of the data. These methods are designed to manage

large databases from high-dimensional spaces. This study adopts two methods in

this category: one is the distance-based nearest neighbor algorithm, and another is

the density-based approach.

Distance-based methods were originally proposed by Knorr and Ng (Knorr

et al. 2000). The basic nearest neighbor algorithm is based on the definition that an

observation is defined as a distance-based outlier if at least a fraction p of the

observations in the dataset are at a distance greater than X from it. However, as

pointed out in Acuna and Rodriguez (Acuna & Rodriguez, 2004), this definition

has certain difficulties such as the determination of X and the lack of a ranking for

29

the outliers. Furthermore, the two algorithms proposed are either quadratic in the

data set size or exponential in the number of dimensions. Hence it is not an

adequate definition to use with datasets having a large number of instances.

In the later work (Ramaswamy et al. 2000), the definition of outlier is

modified to address the above drawbacks. Outlier detection is based on the distance

of the Â£-th nearest neighbor of a point p, denoted with D*(p). The new definition by

Ramaswamy et. al is given the integer numbers k and n (k

outlier if no more than n-1 other points in the data set have a higher value for Dk

than p. This means that outliers are the top n instances with the largest distance to

their &-th nearest neighbor. The main idea in the algorithm is that for each instance

in D one keeps track of the closest neighbors found so far. When an instances

closest neighbors achieve the k value specified then the instance is removed

because it can no longer be an outlier. These outliers are referred as the Â£)* outliers

of a dataset. The basic nearest neighbor algorithm and the Â£)* variation both find

global outliers.

Another method is the density-based, presented in (Breunig et al. 2000)

where a new notion of local outlier is introduced that measures the degree of an

object to be an outlier with respect to the density of the local neighborhood. This

degree is called Local Outlier Factor LOF and is assigned to each object. The LOF

algorithm uses local densities to avoid anomalous situations in which many points

can be considered as global outliers. Although the LOF algorithm does not require

explicit clustering, it finds outliers that are outside of clusters. The local outlier

factor for an object is the mean of the ratio of the sum of the local densities for an

objects neighbors to the objects local density. The local density for an object is

near 1 for dense neighborhoods and near 0 for sparse neighborhoods.

30

Both outlier approaches involve the distance from the kth nearest neighbor.

The outlier score in the Dk approach is directly related to the kth nearest neighbor.

The outlier score in LOF approach is an adjusted k nearest neighbor score. The

choice of k should be made relative to the size of the data set.

3.4 Real Deception Data Collection

Participants in the data collection were students enrolled in undergraduate

level courses. Experimental data were collected from the subjects completing a

hypothetical financial aid application. Financial aid can be divided into two main

categories, need-based and merit-based. This study adopts merit-based scholarship

application along with items drawn from need-based application. The scholarship

application form was created by combining the items from the scholarship

application in UCD Business School with the Free Application for Federal student

aid (FAFSA, 2008). The data collection instrument is presented in Appendix A.

The data were collected via a web setting. Subjects were told that the purpose

of the research is to investigate the relationship between actual and artificial

deception in student financial aid applications. Subjects needed to complete a

consent form and were informed that they could terminate their participation at any

time. Instructions to fill out the application were provided. The participation was

limited to completing the same financial aid application three times. In the first

completion, subjects were told to provide their natural responses in completing a

financial aid application in which they perceive little chance for detection of

deception. In the second completion, subjects were told to correct all dishonest

responses in their original application. In the third completion, subjects were

31

instructed to deliberately respond unethically to make themselves appear to be as

competitive as possible.

Obtaining deception in a natural environment without subject awareness of

this study would help to increase the reliability of data. The original data collection

plan was that the subjects complete the financial aid application without knowing

the real reason for providing the data. The subject would be told about the research

purpose after completing the financial aid application. However, in order to

conduct the research in accordance with human experimentation requirements of

the academic human subjects review board, the real research purpose has to be

disclosed to the subjects before their participations. As a result of this constraint,

the data collection approach adopted in this study is the best feasible alternative.

3.5 Artificial Noise and Deception Models

Noise occurs when the true input state is perturbed by a measurement

process. A noise model provides a method to perturb or change data from its true

state to an incorrect state. A noise model should allow simulation of a real data

generation process with both correct and incorrect data. The Variable Noise Model

(Goldman & Stone, 1995) has been used in studies of noise handling in the data

mining literature. Deception occurs when an individual knowingly perturbs the true

input state to achieve some goal. Since there has not been an artificial deception

model proposed in the literature, the Application Deception Model (ADM) is

developed for this project.

3.5.1 Variable Noise Model

For the variable noise model, each attribute is flipped randomly and

independently with its own probability pi, all of which are less than the given upper

32

bound for the noise rate. Each attribute is randomly assigned a noise level from a

uniform distribution between the specified ranges. And a random number between

0 and 1 is drawn. If the number is less than or equal to the attributes noise

probability, the value is changed. If the attributes scale is nominal, the value is

randomly changed to any other value. If the attributes scale is ordinal, the attribute

is changed to an adjacent value. If the attribute is numeric (ratio or absolute scale),

the value is changed using an equal height histogram with at most 10 ranges. A

smaller number of ranges are used for highly skewed numeric attributes. After

randomly selecting an adjacent cell of the equal height histogram, a value is

randomly selected between the end points of the cell.

3.5.2 Application Deception Model

The noise models in the literature have been proposed as extensions to the

theory of Probably, Approximately Correct (PAC) learning. The noise perturbation

corrupts the original feature values randomly without considering causative actions.

Thus, these noise models generate non-systematic noise. However, to simulate

deception, data should be perturbed purposefully since deception involves

systematic noise. Therefore, the previous noise models may not be suitable for

studying deception schemes. It is necessary to develop a new model to structure the

methodology of simulating deceptive actions so that the data are randomly

perturbed according to deception objectives. The Application Deception Model

supports generation of artificial deception in different application-based deception

contexts, which are the major focus of this study.

33

3.5.2.1 Description of the Application Deception Model

The Application Deception Model involves scenario analysis, feature

grouping, and data perturbation as depicted in Figure 3.3. The starting point of the

model is to analyze deception scenarios. In Unified Modeling Language (UML),

use-cases and scenarios are commonly used by designers as a way to understand

users motivation and tasks in an interface. A use case scenario is a specific

sequence of actions as specified in a use case carried out under certain conditions.

Use scenarios are developed for as many functions or features and user types as

possible. The goal is to ensure that the different ways different uses try to complete

the same tasks do not conflict with each other.

Based on the use case methodology, the possible deceptive behaviors are

identified in the context of the application. Each scenario is defined by conditions

and corresponding actions. It is possible that some scenarios have identical

conditions, but different actions. In this case, if there are not enough data available

to distinguish the corresponding subsets, they should be combined into one

scenario.

Once scenarios have been specified, the next step is to classify important

features based on the scenarios. These features are changed together during the

perturbation process. Unlike other noise models, the deception noise model

supports dependencies among attributes by grouping attributes. Although

independence is usually a reasonable assumption in random noise, deception

generation must allow dependence.

In the final step of data selection, only those observations that have

potential to take deceptive action are chosen in the subset for perturbation. These

observations are selected if their distances from the ideal candidate are more than

34

the specified distance threshold. In the perturbation procedure, the actual values of

each attribute in a deception scenario are randomly perturbed between the threshold

and extreme value. The threshold and extreme values (either maximum or

minimum) must be specified for each attribute in a deception scenario.

The usage of threshold has been suggested in the literature for document

driven deception. Mannino and Koushik (Mannino & Koushik 2000) studied the

cost minimizing inverse classification problem with respect to similarity-based

classification systems. As part of a sensitivity analysis, they sought to find the

minimum required change to a case to reclassify it as a member of the preferred

class with implicit representation of concept boundaries.

35

Ranking

Figure 3.3: Application Deception Model

A formal description of the procedure for generating artificial deception in

the application-based context follows:

Input

D(x, n): original dataset where x denotes the number of observations and n

denotes the number of attributes

p: percentile ranking

c: noise level

Output

I (vfs, Fs): the set of instances whose value of feature Fs is equal to Vfs. s < n.

36

Procedure

1. Let Fs be the features corresponding to each scenario

2. For each case i 6 D do

3. Compute distance dj from the maximum candidate based on Fs

4. End For

5. Sort(di, d2,dx)

6. Top p > Subset T

7. For each case j 6 T do

8. Let r be a random number in [0, 1]

9. If r < c then

10. For all features Fs do

11. randomly choose a value v between extreme and threshold values

12. setv^=v

13. End for

14. End If

15. End For

The proposed Application Deception Model provides a foundation for

generating artificial deception data in various application contexts. In the following

subsections, it is applied to the financial aid application.

3.5.2.2 Application to the Financial Aid Application

Financial aid applications may be classified into two types based on the

criteria through which the financial aid is awarded: need-based or merit-based.

Need-based financial aid is awarded on the basis of the financial need of the

student. The Free Application for Federal Student Aid (FAFSA, 2008) is generally

used for determining federal, state, and institutional need-based aid eligibility.

Merit-based financial aid is typically awarded for outstanding academic

achievements. Some merit scholarships can be awarded for special talents,

leadership potential and other personal characteristics. Merit-based financial aid

does not focus on a students actual financial needs. In this study, need-based and

37

merit-based application items are combined in order to better capture the impact of

different types of deception in financial aid applications. The Application

Deception Model is applied on these two types of financial aid.

The first step of the Application Deception Model involves scenario

analysis. Based on discussions with decision makers in the University of Colorado

financial aid office, three scenarios cover most deceptive behaviors as shown in

Table 3.4. Students falsify their applications to appear eligible for financial aid. In

Scenario 1, students change their status from dependent to independent to qualify

for financial aid if their parents financial condition is strong. In scenario 2,

students reduce their parents financial variables to qualify rather than change their

status. Scenarios 1 and 2 have the same conditions, but different actions. The

financial aid form does not provide information to distinguish additional conditions

favorable to each scenario 1 or 2. Therefore, they are combined into one scenario.

Scenario 3 involves inflating merit-based features to make a candidate appear more

qualified. Scenario 3 is similar to resume padding performed by job applicants.

After the combination of the original Scenario 1 and 2, two final scenarios that are

applied in this study are summarized in Table 3.5.

Table 3.4: Original deception scenarios in financial aid application

Scenarios Conditions Actions

Scenario 1 Dependent Change status to

Deception on status High parents financial variable independent

Scenario 2 Dependent Reduce parents

Deception on financial variables High parents financial variable financial variable

Scenario 3 Low merit-based feature Increase merit-based

Deception on merit- based features value feature value

38

Table 3.5: Deception scenarios in financial aid application (after combination)

Scenarios Conditions Actions

Scenario 1 Dependent Change status to

Deception on status High parents financial independent or reduce

variable parents financial variable

Scenario 2 Low merit-based feature Increase merit-based

Deception on merit- value feature value

based features

Based on the specified scenarios as described in the table above, the

features are classified into three groups: status variables, financial variables, and

merit-based variables. Table 3.6 lists the features included in each group.

Table 3.6: Features included in each group of variables

Variables Attributes

Status Age Marital Status

Financial Students earned income Students total bank balance Students net worth of real estate Parents earned income Parents investment income Parents total balance Parents net worth of real estate

Merit-based Cumulative GPA GPA from the most recent semester GPA from the semester prior to the most recent semester Previous award Number of employment positions Number of management experiences Number of activities Number of leadership experiences

39

Status variables indicate the condition of either an independent student or a

dependent student. Determining the status as dependent or independent from their

parents is one common factor involved in all federal and state financial aid

applications. Students are classified as independent or dependent because federal

student aid programs are based on the principle that it is the parents responsibility

to provide for their childrens education. Parents ability to pay is considered when

deciding students eligibility for financial aid. The status is determined on the basis

of the information provided on the application. Students are considered to be

independent if they are at least 24 years old or married. Otherwise, they are

considered dependent on their parents.

Data Selection

Based on two scenarios, the subsets of data that are eligible for perturbation

are selected using the method depicted in Table 3.7. To select data for each subset,

the distances from each observation in the truth dataset to the ideal candidate are

computed using the features corresponding to the variables in each group. Then

they are sorted in descending order. Based on the percentile ranking, the

observations are selected to form the subset.

Table 3.7: Data selection for deception scenarios in financial aid application

Subset 1

Subset 2

Truth____

Status +

Financial

Features

Ideal

Candidate

Distance

Truth

Merit

Feature 1

Ideal

Candidate

Distance

Ideal Candidate: Independent and

lowest parent financial variables

Ideal Candidate: Highest merit

variables

40

Data Perturbation

Once the subsets of data are selected, it is ready to start the perturbation

process. The method of data perturbation for each scenario is described in Table

3.8.

Table 3.8: Data perturbation for deception scenarios in financial aid application

Subset 1 Subset 2

Conditions: Conditions:

Dependent Low merit-based variables

High parents financial Perturb:

variable Increase merit-based

Perturb: variables

Status to independent or

Reduce parent financial

variables

Scenario 1: This scenario considers the deceptive actions on status or

financial variables that applicants may take to strengthen their need of financial aid

when they have dependent status and their parents financial status is good. Under

these conditions, financial variables and status variables are manipulated. For each

individual selected in the subset, a random number between 0 and 1 is drawn for

each group of variables. If the number is less than or equal to the noise probability

assigned for the group of status variables, the status is changed to independent. If

the number is less than or equal to the noise probability assigned for the group of

financial variables, the original state for each variable in this group is changed by

randomly picking a value between the threshold and ideal value. The thresholds are

determined according to the guidelines from the federal form. Table 3.9 shows the

threshold and ideal values for parents financial variables.

41

Table 3.9: Threshold and ideal values for parents financial variables

Parents Financial Variables Threshold Value Ideal Value

Parents earned income $100,000 $0

Parents investment income $100,000 $0

Parents total balance $100,000 $0

Parents net worth of real estate $100,000 $0

Scenario 2: This scenario focuses on the merit-based variables. For each

individual selected in the subset, a random number between 0 and 1 is drawn for

the group of merit-based variables. If the number is less than or equal to the noise

probability, the original value for each variable in this group is randomly perturbed

to the level between the threshold and ideal value for each attribute. The threshold

and ideal values for the merit-based variables are listed in Table 3.10. As shown in

the table, some features are interpreted using counts from their original text-based

values for data analysis.

Table 3.10: Threshold and ideal values for merit-based variables

Merit Variables Threshold Value Ideal Value

Cumulative GPA 3.0 4.0

GPA from the most recent semester 3.0 4.0

GPA from the semester prior to the most recent semester 3.0 4.0

Previous award (counts) 0 2

Employment (counts) 1 2

Management experience (counts) 1 2

Activity (counts) 1 2

Leadership experience (counts) 1 2

42

4. Analysis of Impact of Real and Artificial

Deception on Screening Policies

The relationship between the real and artificial deception has been investigated

in Chapter 3. To extend the study, this chapter analyzes differences in screening

policy performance on real deception data and artificial deception data. Therefore,

the study focuses on examining the impact of data. The experiments are designed to

compare the real deception data with the artificial deception data using information

theoretic measures and a cost model.

4.1 Research Framework and Hypotheses

A research framework involving comparisons of screening policy

performance on real and artificial deception as depicted in Figure 4.1 is adopted. To

analyze the relationship, the screening policy performance is analyzed based on two

models of noise with the specified noise parameter. The classification performance

of the screening method is compared on two types of measures: (1) information

theoretic measures and (2) cost.

43

Figure 4.1: Comparison of Screening Method Performance on Real and Artificial

Deception

Based on the previous analysis, four hypotheses are presented, with the first

two dealing with the impacts of natural and artificial deception generated using low

noise level on screening policy performance, and with the other two focusing on

comparing the screening method performance on deliberate and artificial deception

generated using high noise level. These relationships involve comparisons among

two types of performance measures of the screening method: information theoretic

measures and cost.

44

Natural deception vs. artificial deception

H3a: There is a significant difference in the screening policy performance on

natural deceptive data versus artificial deceptive data modeled by low variable

noise.

H3b: There is no significant difference in the screening policy performance on

natural deceptive data versus artificial deceptive data modeled by low

application deception model.

Deliberate deception vs. artificial deception

H4a: There is a significant difference in the screening policy performance on

deliberate deceptive data versus artificial deceptive data modeled by high

variable noise.

H4b: There is no significant difference in the screening policy performance on

deliberate deceptive data versus artificial deceptive data modeled by high

application deception noise.

4.2 Research Methodology

The impact of natural and artificial deception on each screening method is

analyzed by two experiments based on the research framework depicted in Figure

4.1. Each experiment is conducted by two comparisons of performance as

described in Table 4.1. The performance (Pe) of the screening policy is compared

between real (natural and deliberate) and artificial deception generated using noise

model and corresponding noise parameters.

45

Table 4.1: Comparison of screening method performance on real and artificial

deception

Deception Type

JJAJJCI llUCillS Real Artificial

Experiment 1 Pe(Natural) Pe(VariableLow) Pe(DeceptionLow)

Experiment 2 Pe(Deliberate) Pe(VariableHigh) Pe(DeceptionHigh)

Typically, financial aid offices adopt a top policy of screening the financial

aid applications. The top policy verifies the eligible and nearly eligible

applications. Nearly eligible applications are verified to allow for reassessment of

financial aid if deception is uncovered in eligible applications. Based on

discussions with decision makers in the University of Colorado financial aid office,

the top 30% of observations are considered to be eligible for the financial aid. In

addition, the next 5% of financial aid applications are considered nearly eligible.

Both the eligible and nearly eligible applications are screened. To determine the

eligible and nearly eligible applications, the distance from the ideal candidate is

computed for each observation in the real and artificial deception data sets. Then

the distances are sorted in ascending order. The top 35% of observations are

flagged to screen.

Although not commonly adopted, random verification is still used by some

Quality Assurance schools. To assist the analysis, this study adopts the random

46

policy as a reference point. The policy randomly selects 50% of applications to

verify.

To measure the performance of a screening policy, the class label is needed

for each observation. Deceptive cases are used as positive samples, and truthful

data are selected as negative samples. It is not necessary to label the artificial

deception data since the cases are recorded in which perturbation is used. Since the

original data do not have associated labels, cases are labeled by a labeling method.

The labeling method differs for natural and deliberate deception.

Figure 4.2 describes the method to label the natural deception data set. For

natural deception, deceptive and true observations are mixed. In order to classify

true and deceptive cases, the distance from truth to natural deception is calculated.

Non-zero distance indicates that the case contains deception. Thus, these identified

cases are labeled with deceptive. The other cases in the dataset are labeled with

true. Since small amounts of deception are involved in the natural deception, it is

not necessary to split deceptive cases.

Figure 4.2: Labeling Natural Deception Data

The method to label the deliberate deception data set is described in Figure

4.3. In the deliberate deception, all of the cases contain deception. Therefore, it is

not necessary to calculate the distance from the truth to the deception to determine

the case label. Since it is not realistic to use 100% deceptive cases, cases with truth

47

and deliberate deception are mixed. Based on the reported percentage of deception

in financial aid (Rhodes & Tuccillo, 2008), 20% deliberate deception cases labeled

with deceptive are used. The other cases are collected from truth dataset and

labeled with true.

Figure 4.3: Labeling Deliberate Deception Data

4.3 Performance Measures

In machine learning literature, typical metrics for measuring the performance

of classification systems are accuracy, information theoretic measures such as

precision and recall. Since the accuracy assumes that the class priors in the target

environment are constant and thus it is sensitive to the class distribution, it is not

used in this study. A realistic evaluation should also take misclassification costs

into account. This is especially important if cost asymmetries exist. Therefore, an

information theoretic measure (Harmonic Mean) and a cost model are used for

evaluating the performance of the screening policies in this study.

48

4.3.1 Information Theoretic Measure

In a statistical classification task, the performance of a classification

prediction can be evaluated using the data in a confusion matrix or contingency

table (Kohavi and Provost, 1998) which contains information about actual and

predicted classifications done by a classification system. In the two-class case with

classes true and deceptive, a single prediction has the four different possible

outcomes shown in Table 4.2. The true positives (TP) and true negatives (TN) are

correct classifications. A false positive (FP) occurs when the outcome is incorrectly

predicted as deceptive (or positive) when it is actually true (negative). A false

negative (FN) occurs when the outcome is incorrectly predicted as negative when it

is actually positive.

Table 4.2: Confusion matrix of a true-deception prediction

Actual

Deceptive True

Predicted Screening TP FP

No Screening FN TN

The Precision for a class is the number of true positives (i.e. the number of

items correctly labeled as belonging to the positive class) divided by the total

number of elements labeled as belonging to the positive class (i.e. the sum of true

positives and false positives). Recall in this context is defined as the number of true

positives divided by the total number of elements that actually belong to the

positive class (i.e. the sum of true positives and false negatives). A Precision score

of 1.0 for class C means that every item labeled as belonging to class C does indeed

49

belong to class C whereas a Recall of 1.0 means that every item from class C was

labeled as belonging to class C.

TP

Precision =--------------------------------- 4.1

TP + FP

TP

Re call =--------------------------------- 4.2

TP + FN

Usually, Precision and Recall scores are not discussed in isolation. Instead,

either values for one measure are compared for a fixed level at the other measure or

both are combined into a single measure. The Harmonic mean combines Precision

and Recall into a single number ranging from 0 (worst prediction) to 1 (best

prediction).

TT . . Precision x Recall

Harmonic mean = 2 x------------------- 4.3

Precision + Recall

In summary, Figure 4.4 illustrates how the confusion matrix is constructed

for a screening problem in financial aid application and how performance metrics

are calculated.

50

Figure 4.4 Method to Calculate HM

4.3.2 Cost Model

In the domain of deception detection, it is important and necessary to place

monetary value on predictions. Therefore, a cost model is proposed in this study to

evaluate the screening policies performance based on the cost and benefit of

detecting deception.

4.3.2.1 Budget Models

According to the U.S. Department of Education (2009), the Federal Pell

Grants amounts are directly determined by the expected family contribution (EFC).

Pell grant funding never runs out. So if a student is eligible, he or she will get it.

Schools then use the EFC to determine eligibility for other aid programs (grant,

load, work-study). These aid programs are typically limited. According to this

51

information, when allocating the award, two models are applied in this study based

on two different types of budget:

a) Fixed budget model: In this model, only the top 30% of the applicants are

awarded since the financial aid resources are limited. Therefore, the total

amount of award is static. The model calculates the distance from each

applicant to the ideal candidate and ranked all these distances in ascending

order. There are three levels of award. The applicants corresponding to the

top 5% distances are eligible for the level 1 award, the next 10% and the

next 15% are qualified for the level 2 award and level 3 award respectively.

b) Variable budget model: In this model, the total amount of financial aid is

flexible because it is determined by a linear function of EFC. Since the

experiment data do not permit EFC calculation, a distance-based method is

used instead. The function is formulated based on three points and two

distances. The three points are the applicant, the marginal candidate and the

ideal candidate. The marginal candidate contains the lower value that is

considered as borders or as margins, while the ideal candidate has the

perfect standard of financial aid application. Two distances are the distance

from the applicant to the ideal candidate and the distance from the marginal

candidate to the ideal candidate. If an applicants distance from the ideal

candidate is less than the marginal candidates distance from the ideal

candidate, the financial aid is awarded. The amount of aid is calculated

using the equation described below.

Financial Aid Amount = (1-------, Distancestuden.-ideaicand.d^-) x Full Amount

DistanCeMarginaI Candidate-Ideal Candidate

4.4

52

Figure 4.5 illustrates the method to allocate the financial aid based on two

models of budget that were described above.

Award Allocation:

Fixed budget model: dls ranking

Varied budget model: dl
Figure 4.5 Award Allocation Based on Two Models of Budget

4.3.2.2 Award Difference

To better describe and understand the cost caused by misallocating financial

aid between the application with truth and the application with deceptive

information, the award difference is used in our cost model. Since the screening

policy will be applied to the deceptive data set, the award difference shouldnt be

based on the award difference between the truth data set and the deceptive data set

before the screening. The screening policy will address some of the problematic

applications and thus adjust the original deceptive data set. The final award

allocation will be based on the adjusted data set. Therefore, the award difference is

the difference between the awards in the truth data set and the policy adjusted data

set for each applicant. The meaning of award difference is illustrated in Figure 4.4.

53

Award Difference

Figure 4.6: The Meaning of Award Difference

4.3.2.3 Cost Model Structure

The cost matrix for deception detection is shown in Table 4.3. A false

positive error (a false alarm) corresponds to wrongly deciding that an applicant

provides deceptive information. A false negative error (a miss) corresponds to

letting deceptive information go undetected. A true negative prediction corresponds

to correctly classifying an application with true information as non-deceptive. A

true positive prediction corresponds to successfully detecting that an applicant

provides deceptive information.

Table 4.3: Cost matrix for deception detection

Prediction Deception No deception

Alert Hit (TP) False Alarm (FP)

No alert Miss (FN) Normal (TN)

Table 4.4 below illustrates the cost model. This particular cost model has

two assumptions. First, all alerts must be investigated. Second, the deceptive

application can be successfully caught by investigation. Due to the different dollar

54

amount of each application, the cost varies with each application. Hence, the cost

model for this domain relies on the sum and average of loss caused by deception.

They are defined as:

n

CumulativeCost = ^ Cost(i) 4.5

and

AverageCost =

CumulativeCost

n

4.6

Where Cost(i) is the cost associated with application i, and n is the total number of

applications.

Table 4.4 shows that false alarms and hits require investigation costs; and

misses pay out the award difference. Hits have a benefit of avoiding paying the

award difference. There is no cost associated with Normals. Based on the cost of a

deception analysts time, the average cost per investigation for the financial aid

application data set is estimated to be about $25.

Table 4.4: Cost model for financial aid application deception detection

Outcome Cost

Misses (false negative -- FN) Award Difference [Deception-Truth]

False alarms (false positive FP) Average Cost Per Investigation

Hits (true positive TP) Average Cost Per Investigation Award Difference [Deception-T ruth]

Normals (true negative TN) 0

55

5. Experimental Results and Analysis

This chapter presents experimental results and analysis to provide evidence

about the research framework and experimental designs presented in previous

chapters. The experiments test research questions involving the fit between

artificial deception and real deception and the impact of artificial deception on

performance of the screening policies.

5.1 Simulation of Real Deception

The relationship between real deception and artificial deception are tested by

paired tests of group means, with a < .05 being set as the level of statistical

significance, with an N of 150. In order to use paired t-tests to determine if there is

a significant difference between the two means, the assumptions need to be

satisfied. Paired t-test assumes that the paired differences are independent and are

all normally distributed. Since the subjects are randomly selected and independent

of any other subjects, the assumption of independence has been met. To determine

whether the assumption of normality is valid for the data, the normality tests are

conducted. If the data follow the normal distribution, the t-test will be used.

Otherwise, the Wilcoxon test, which is a nonparametric procedure, is used.

If the result is not statistically significant at the chosen alpha, the null

hypothesis of no difference in means between two groups cant be rejected. On the

other hand, if the difference is statistically significant, an effect size should be

examined to see if the difference is also practically significant. A statistically

significant outcome only indicates that it is likely that there is a difference between

group means. It doesnt mean that the difference is large or important. In this study,

the effect sizes are computed using the Hedges g measure (Hedges, 1981). Cohen

56

provides the following guidelines of effect size (r): small effect size, r = 0.1;

medium, r = 0.3; large, r = 0.5.

In the following results, the fit between real deception and artificial deception

is evaluated by two measures defined in chapter 3 as directed distance and outlier

score.

5.1.1 Analysis Based on Distance Measure

For easy interpretation, the hypotheses, which were proposed in Chapter 3,

are restated with the null hypothesis and alternative hypothesis based on the

directed distance measure as shown in Table 5.1. The mean of directed distances

from the true data to the deception data is denoted by: Pdistance(deception type), where

deception type indicates real (natural or deliberate) or artificial (VariableLow,

VariableHigh, ApplicationLow or ApplicationHigh) deception.

Table 5.1: Hypotheses based on directed distance measure

Null Hypothesis Alternative Hypothesis

M-distance(natural) ~ Hdistance(VariableLow) M-distance(natural) ^ M'distance(VariableLow)

M-distance(natural) M'distance(ApplicationLow) M'distance(natural) ~t~ M'distance(ApplicationLow)

M'distance(deliberate) M'distance(VariableHigh) M-distance(deliberate) ^ M'distance(VariableHigh)

M-distance(deliberate) ^ di stance! ADDlicationHi eh) M-distance(deliberate) 7^ M-distancefADDlicationHieh)

To determine whether the data are normally distributed, normality tests are

conducted in SPSS. As shown in Table 5.2, the results of the test for normality

suggest that the null hypothesis of samples being normally distributed is rejected.

Therefore, the Wilcoxon test is conducted to test the mean differences between the

real and artificial data.

57

Table 5.2: Normality tests for distance variable

Tests of Normality

Kolmoqorov-Smirnov3 Shapiro-Wilk

Statistic df Sig. Statistic df Sig.

NaturalDist .384 150 .000 .646 150 .000

DeliberateDist .469 150 .000 .503 150 .000

a. Lilliefors Significance Correction

Table 5.3 shows the SPSS output for testing the mean distance differences

between the real and artificial deception data. Table 5.4 summarizes the results and

lists the p-value and the effect size for the significant result. The results, as

expected, show that there was no statistical difference between natural deception

and artificial deception generated by the application deception model with low

noise level. The results also demonstrate identical means between deliberate

deception and artificial deception generated using application deception model with

high noise level. In addition, the results reject the null hypothesis and confirm that

real deception and artificial deception generated by the variable noise model differ

statistically. Furthermore, the effect sizes of 0.30 (moderate) and 0.74 (large)

suggest that both statistically significant results are practically significant.

58

Table 5.3: Sample mean results based on distance measure

Test Statistics?

VariableLDist NaturalDist Application LDist NaturalDist VariableHDist DeliberateDist Application HDist DeliberateDist

z Asymp. Sig. (2-tailed) -3.724a .000 -1.076b .282 -9.0523 .000 -1.528b .127

a. Based on negative ranks.

b. Based on positive ranks.

c- Wilcoxon Signed Ranks Test

Table 5.4: Summary of statistical test results based on distance measure

(The numbers in parentheses are the /7-values and effect sizes.)

Comparison Distance Measure

Natural vs. VariableLow Significant (.000*, 0.30)

Natural vs. ApplicationLow Not significant (.689)

Deliberate vs. VariableHigh Significant (.000*, 0.74)

Deliberate vs. ApplicationHigh Not significant (.687)

*. Significant at the .05 level

5.1.2 Analysis Based on Outlier Score

Outlier score indicates the unusualness of the observation relative to other

observations. It can be considered as a multivariate measure of dispersion of an

outlier relative to other outliers. The relationship between real deception and

artificial deception is measured by testing the average outlier score. Table 5.5

restates the hypotheses that were proposed in Chapter 3 with the null hypothesis

and alternative hypothesis based on the outlier score measure. The mean of the

outlier scores in the deception data set is denoted by: p0utiier(deception type), where

59

deception type indicates real (natural or deliberate) or artificial (VariableLow,

VariableHigh, ApplicationLow or ApplicationHigh) deception.

Table 5.5: Hypotheses based on outlier score

Null Hypothesis Alternative Hypothesis

M'Outlier(natural) ^outlier( VariableLow) M'Outlier(natural) M-outIier(VariableLow)

Moutlier(natural) M-outlier(AppIicationLow) M-outlier(natural) ~f~ M'Outlier(ApplicationLow)

Moutlier(deliberate) ^outlier( VariableHigh) M'Outlier(deliberate) M-outIier(VariabIeHigh)

M-outlierfdeliberate) ~ HoutlieriAoolicationHieh) M-outlierfdeliberate) ~f~ M-outlier(ApDlicationHich)

In order to evaluate whether the outlier scores follow the normal

distribution, the normality tests are performed in SPSS. Table 5.6 shows the

normality tests for outlier scores. From the table, the p-value from the

Kolmogorov-Smimov test (0.200) suggests that there is insufficient evidence to

reject the null hypothesis of samples being normally distributed. Therefore, the

paired t-test can be conducted to test the mean differences of outlier score between

the real and artificial data.

Table 5.6: Normality tests for outlier scores

Tests of Normality

Kolmoaorov-Smirnov3 Shapiro-Wilk

Statistic df Sig. Statistic df Sig.

DKNNatural .064 150 .200* .961 150 .000

DKNDeliberate .053 150 .200* .975 150 .008

* This is a lower bound of the true significance,

a- Lilliefors Significance Correction

In this study, two outlier detection algorithms are adopted to generate

outlier score for each observation in the real and artificial deception data sets. Table

60

5.7 shows the analysis results of the average mean of the outlier score based on the

algorithm. When comparing Poutlier(natural) VS. Poutlier(ApplicationLow) and |J-outlier(deliberate)

vs. Poutiier(AppiicationHigh), the non significant results dont reject the null hypotheses

and thus confirm the alternative hypotheses that real deception have the identical

mean outlier score with artificial deception generated by the application deception

model. When testing the difference between |d0uti>er(dei,berate) and ^outiier(VariabieHigh), the

significant result, as expected, demonstrates that deliberate deception and artificial

deception generated using variable noise model with high noise level are different.

In the meanwhile, the large effect size of 0.80 indicates that the observed statistical

significance is also practically significant. In the comparison of |i0utiier(naturai) with

lioutiier(VariabieLow), the p-\alue of .064 doesnt reject the null hypothesis. But the

marginal significance provides some evidence that the mean outlier scores for the

natural deception data and the artificial deception data modeled by low variable

noise are different.

Table 5.7: Sample mean results based on outlier score calculated by Dn algorithm

Comparison Outlier Score ( Dn )

Real Deception vs. Artificial Deception /-value Sig. Effect Size

Natural vs. VariableLow -1.870 .064 Not applicable

Natural vs. ApplicationLow -.692 .490 Not applicable

Deliberate vs. VariableHigh -4.860 .000* 0.80

Deliberate vs. ApplicationHigh .946 .346 Not applicable

*. Significant at the .05 level

Table 5.8 shows the analysis results of the average mean of the outlier score

based on the LOF algorithm. The results provide evidence that there is no statistical

difference between real deception and artificial deception generated by the

61

application deception model. Also, the significant results offer support for the

differences between real deception and artificial deception generated by the

variable noise model. Furthermore, two moderate effect sizes confirm that the

observed differences are practically significant.

Table 5.8: Sample mean results based on outlier score calculated by LOF algorithm

Comparison Outlier Score ( LOF )

Real Deception vs. Artificial Deception /-value Sig. Effect Size

Natural vs. VariableLow 2.926 .004* 0.48

Natural vs. ApplicationLow -.282 .778 Not applicable

Deliberate vs. VariableHigh 3.240 .001* 0.53

Deliberate vs. ApplicationHigh -.066 .948 Not applicable

*. Significant at the .05 level

5.1.3 Discussion

This part of the study involves the fit between real deception and artificial

deception. The experiment results of hypotheses testing based on two measures are

systematically summarized in Table 5.9.

From the table, it can be seen that Hypothesis la, considering the

relationship between natural deception and artificial deception generated by

variable noise model with low level of noise, is supported by the directed distance

measure and the outlier score produced by the LOF algorithm. When testing the

mean difference of the outlier score produced by the D algorithm, despite the null

hypothesis isnt rejected, the marginal significance provides some evidence of a

mean difference in D outlier scores between two data sets. Hypothesis lb,

focusing on the relationship between natural deception and artificial deception

62

generated by the application deception model with low level of noise, is supported.

Hypothesis 2a and 2b, concerning the relationships between deliberate deception

and artificial deception generated by variable noise model and application

deception model with high level of noise, are also supported.

These results obtained provide evidence that artificially generated deception

could be used instead of the real deceptive data. However, the noise model for

generating deception has to be selected carefully. The existing noise models in the

literature that perturbs the original state uniformly without considering the real

deception behavior are not appropriate. Deception involves intentional and goal-

oriented behaviors. Therefore, data should be perturbed purposefully. The proposed

application deception model applies directed corruption to simulate deception. The

results further confirm the importance of directed corruption to simulate deception

and provide evidence that the model could be adopted to generate deception.

63

Table 5.9: Summary of findings

Hypothesis

HI a: There is a significant difference

between natural deception and

artificial deception modeled by low

variable noise.

Hlb: There is no significant

difference between natural deception

and artificial deception modeled by

low deception noise.

H2a: There is a significant difference

between deliberate deception and

artificial deception modeled by high

variable noise.

Findings

Support: There are statistically

significant differences in directed

distances and LOF produced outlier

scores between two data sets.

Exception: Marginal significance

indicates some evidence but not

conclusive evidence of a difference in D

produced outlier scores between two data

sets.

Support: There is no statistically

significant difference in directed

distances, D and LOF produced outlier

scores between two data sets.

Support: There are statistically

significant differences in directed

distances, D and LOF produced outlier

scores between two data sets.

H2b: There is no significant

difference between deliberate

deception and artificial deception

modeled by high deception noise.

Support: There is no statistically

significant difference in directed

distances, D and LOF produced outlier

scores between two data sets.

5.2 Impact of Real and Artificial Deception on Screening

Policies

This section assesses the impact of real deception and artificial deception on

screening policy performance. Typically, schools adopt the top policy to evaluate

financial aid applicants who are eligible or nearly eligible. For schools participating

64

in the Quality Assurance Program, they also draw random samples of aid applicants

to verify (Rhodes & Tuccillo, 2008).

In the following results, the prediction of the screening policy is evaluated by

two types of performance measures as described in Chapter 4 with respect to the

real and artificial deception data sets: information theoretic measure and cost. The

results of the impacts of real deception and artificial deception on the top policy

and the random policy are presented.

5.2.1 Impact on Top Policy

Typically, the top policy verifies 30% of the eligible applications and 5% of

the nearly eligible applications. Data are initially analyzed based on 35% of

verification percentage. Additionally, in order to capture the pattern of the impacts

of real and artificial deception on the top policy in different percentages of

verification, the percentage of screened cases are randomly varied 200 times

between 20% and 40%. For each run, based on the randomly selected percentage of

verification, the corresponding number of observations in the ranked list

constructed by calculating the distance from the ideal candidate is flagged. The rest

of observations are not screened. In the following two sections, experimental

results that measure the top policy performance in terms of Harmonic Mean (HM)

and cost are presented and analyzed, respectively.

5.2.1.1 Performance Comparison based on Harmonic Mean

HM doesnt have any probabilistic interpretation, hence, significance tests

cant be applied to its values. To compare the performance difference with HM

between the real deception data and the artificial deception data, the relative

percentage difference (PD) for HM is defined as:

65

PD(HM) = t^artlflcial HMfea'l x 100% 5.1

HMreal

Table 5.10 shows the experimental results that measure the top policys

performance on the real deception data sets and the artificial deception data sets

regarding to HM. The numbers in parentheses are the HM values. The relative

difference percentages of HM for each pair of comparison are listed on the right

column of the table. The impacts of real and artificial deception on the screening

policy are compared based on 35% of screened cases.

As these data show, in comparison with the performance on the natural

deception data, the policys performance decreases 76.3% on the artificial

deception data generated by the variable noise model with low noise level. The

artificial deception data generated by the variable noise model with high noise level

bring in the 52% of performance improvement on the policy than the deliberate

deception data. These results, as desired, imply that the real deception data and the

artificial deception data generated by the variable noise model have different

impacts on the policy. In comparison with the performance on the natural deception

data, the policys performance increases 17.8% on the artificial deception data

generated by the application deception model with low noise level. The 1.9% of

performance difference for Deliberate vs. ApplicationHigh suggests that the

deliberate deception data and the artificial deception data with high noise level have

the similar impacts on the policy.

66

Table 5.10: Relative HM percentage differences for the top policy

(The numbers in parentheses are the HM values.)

Comparison PD(HM)

Natural (0.291), VariableLow (0.069) Natural (0.291), ApplicationLow (0.343) Deliberate (0.313), VariableHigh (0.476) Deliberate (0.313), ApplicationHigh (0.319) 76.3% decrease 17.8% increase 52% increase 1.9% increase

Figure 5.1 shows the impacts of real and artificial deception on the top

policy with HM when the percentage of screened case is varied 200 times lfom

20% to 40%. In these graphs, the x-axis denotes the percentage of screened case

while the y-axis represents the HM value. To keep consistency, the values

corresponding to the real deception data and the artificial deception data are

indicated by the dark color line and the light color line, respectively.

Figure 5.1 (a) and (c) represent the performance comparison between two

types of real deception and artificial deception generated by the variable noise

model. As expected, the graphs show obvious performance difference of the

screening policy on the real deception data and the variable noise modeled data

based on the HM metric. The performance comparisons between real deception and

artificial deception generated by application deception model are shown in (b) and

(d) of Figure 5.1. As shown in Figure 5.1 (b), in comparison of the performance on

natural deception, the screening policy has slightly better performance on artificial

deception generated by application deception model with low noise level. In Figure

5.1(d), two lines are overlapped, which indicates that the screening policy performs

similarly on the deliberate deception data and the artificial deception data generated

by the application deception model with high noise level. These graphs visually

67

describe the performance difference between the real and artificial deception data

and further confirm the results in Table 5.10.

OOfMNfO^^L^lDCOCOOlOfMNrO

(N(NfM(N(NN(N(N(N(\(N(MfOfOfOfO

^ -it LO ID N Ut

m m ro m ro m ro

Top Policy Percentage

Natural

VariableLow

(a)

(b)

68

1

0.9

0.8

0.2

0.1

0

Â£ Â£ S? S? Â£ S? * }Â£ S* 3S

o P* O p CO o o ro p. o p. CO p. q CO q o p* CO p CO CO

o o rsi rsj crj d d 00 00 d d CM rsi CO LO d pj d

-----Deliberate

----VariableHigh

Top Policy Percentage

(C)

1

0.9

0.8

0.2-----------------------------------------

0.1-----------------------------------------

0------------------------------------------

o\ js av as as as as as as as os as os os as as os as as os cfs

qrsONfnqqfnNpsmrsOfsfnoprsfnNfflfO

odfNfsin'i^^i/iujoooboidfsiNffi^^^iniDpsoi

(NfNrsi(N(N(NN(NfNoifN(NmmrorofOfOfnfnmrofO

Deliberate

ApplicationHigh

Top Policy Percentage

(d)

Figure 5.1: The Impacts of Real and Artificial Deception on Top Policy at Different

Percentage of Screened Case with Harmonic Mean

69

5.2.1.2 Performance Comparison based on Cost Model

Table 5.11 restates the hypotheses that were proposed in Chapter 4 with the

null hypothesis and alternative hypothesis based on the cost measure. The mean

cost of the screening policy on the deception data is denoted by: pCost(decePtion type),

where deception type indicates real (natural or deliberate) or artificial (VariableLow, VariableHigh, ApplicationLow or ApplicationHigh) deception. Table 5.11: Hypotheses based on cost measure

Null Hypothesis Alternative Hypothesis

Pcost(natural) i^cost(VariableLow) M'cost(natural) ^ M'Cost(VariableLow)

M'cost(natural) M-cost(ApplicationLow) M'Cost(natural) "T" M-cost( ApplicationLow)

M-cost(deliberate) ~ MCost(VariableHigh) M'cost(deliberate) ^ M'Cost(VariableHigh)

Pxost(deliberate) M-costfAoDlicationHieh) M-costfdcliberate) ^ M-costfADDlicationHigh^^^^

To determine whether the assumption of normality is valid for the cost

variable, a normality test is performed. Table 5.12 shows the results of the

normality test in SPSS. Based on the statistics of Kolmogorov-Smimov and

Shapiro-Francia, the null hypothesis of normality is rejected; therefore the

Wilcoxon t-test is conducted to test the mean cost differences.

Table 5.12: Normality tests for cost variable

Tests of Normality

Kolmoqorov-Smirnov3 Shapiro-Wilk

Statistic df Sig. Statistic df Sig.

NaturalTF .477 150 .000 .366 150 .000

DeliberateTF .473 150 .000 .221 150 .000

a- Lilliefors Significance Correction

70

Table 5.13 and 5.14 show the SPSS outputs of testing the mean cost

differences between the impacts of real and artificial deception on the top policy

based on the fixed and variable budget models that were defined in Chapter 4. To

aid the understanding, Table 5.15 summarizes the results and lists the p-value and

effect size for the significant result.

From Table 5.15, it can be seen that deliberate deception and artificial

deception generated by the variable noise model with high noise level have

significantly different impacts on the policy based on the variable budget model as

predicted. However, the fixed budget model doesnt make the same result.

Similarly, the significant result for testing Natural vs. VariableLow suggest a

performance difference based on the variable budget model, while the non

significant results fail to reject the null hypothesis based on the fixed budget model.

When comparing Deliberate vs. ApplicationHigh, the non significant results for

both cost models do not reject the null hypothesis and thus suggest that the policy

has similar performance on the deliberate deception data and the artificial deception

data generated by application deception model with high noise level. The results for

testing the performance difference between the natural deception data and the

artificial deception data generated by the application deception model with low

noise level arrive at the same conclusion.

71

Table 5.13: Cost (fixed budget) sample mean results

Test Statistics?

VariableLTF - Application LTF - VariableHTF - Application HTF -

NaturalTF NaturalTF DeliberateTF DeliberateTF

z -,907a -1.802 -1.194a -.018a

Asymp. Sig. (2-tailed) .364 .072 .232 .985

a- Based on negative ranks,

b. Based on positive ranks,

c- Wilcoxon Signed Ranks Test

Table 5.14: Cost (variable budget) sample mean results

Test Statistics?

VariableLTV - Application LTV- VariableHTV - Application HTV-

NaturalTV NaturalTV DeliberateTV DeliberateTV

Z -2.018a -1.317 -4.727a -.233

Asymp. Sig. (2-tailed) .044 .188 .000 .816

a. Based on negative ranks.

b. Based on positive ranks.

c. Wilcoxon Signed Ranks Test

Table 5.15: Summary of statistical test results based on cost measure

(The numbers in parentheses are the p-values and effect sizes.)

Comparison Fixed Budget Variable Budget

Natural vs. VariableLow Not significant (.364) Significant (.044*, 0.16)

Natural vs. ApplicationLow Not significant (.072) Not significant (.188)

Deliberate vs. VariableHigh Not Significant (.232) Significant (.000*, 0.39)

Deliberate vs. ApplicationHigh Not significant (.985) Not significant (.816)

*. Significant at the .05 level

72

In the above tables, the impacts of real and artificial deception on the

screening policy are compared based on 35% of screened cases. Figure 5.2 and

Figure 5.3 show the impacts of real and artificial deception on the top policy with

the fixed and variable budget models when the percentage of screened case is

varied 200 times from 20% to 40%. On the graphs, the x-axis denotes the

percentage of screened case and y-axis represents the mean cost in dollars.

In all figures of 5.2 and 5.3, (a) and (c) represent the cost comparison

between two types of real deception and artificial deception generated by variable

noise model. As shown in Figure 5.2 (a), when comparing natural deception with

artificial deception generated by the variable noise model with low noise level, the

costs are similar. However, the variable budget model in Figure 5.3 (a) exhibits

significant performance difference and thus suggests that the policy performs better

on the natural deception data. The graphs (c) in Figure 5.2 and 5.3 show obvious

performance difference of the screening policy on the deliberate deception data and

the variable noise modeled data as predicted. Also, these graphs demonstrate that

the variable noise model always costs more than the real deception data.

The cost comparisons between real deception and artificial deception

generated by application deception model are shown in (b) and (d) of Figure 5.2

and 5.3. As expected, the graphs show similar performance of the screening policy

on the real deception data and the application deception modeled data based on

both budget models. Also these graphs present that the application deception

always costs slightly less than the real deception data.

73

Mean Cost (Fixed Budget)

-fc.

o

o

o

o

D

(S

5

3

20.0%

20.7%

21.3%

22.0%

22.7%

24.0%

24.7%

26.0%

27.3%

28.0%

29.3%

29.3%

30.0%

31.3%

32.0%

32.7%

33.3%

34.7%

35.3%

36.0%

37.3%

38.0%

39.3%

w in in v> in in in w in i/v >-*

UJ t* VI Cl vl 00 id o

in- o o o o o o o o o o

o o o o o o o o o o o

>

o

Mean Cost (Fixed Budget)

w lo- in in in lo- in W in in M

H* ro u> S* in ci VI 00 U3 o

i/v o o o o o o o o o o

o o o o o o o o o o o

20.0%

20.7%

21.3%

22.0%

22.7%

24.0%

24.7%

26.0%

o' 27.3%

J 28.0%

g 29.3%

u 29.3%

% 30.0%

3 31.3%

32.7%

33.3%

34.7%

35.3%

36.0%

37.3%

38.0%

39.3%

$1,000

$900

~ $800

| $700

$600

ai

iZ $500

S $400

u

% $300

4)

S $200

$100

$0

<

Deliberate

VariableHigh

nP SP vp \0 vO so sp sP vO sp vp \0 sP sp \P \P sp vO \P \0 vO vO

Q\ O'* O'* ffv os os OS OS os os OS OS flS OS OS OS OS OS OS OS OS

prsroprsprsprnprofoproprsforscoprnpro

doHfslntt^drsoocrioidHiNiNfo^L/i^rsoooi

NfNfMfMNlNfMfMfslMNfMmfOPrirOfOfOfOfOrOrOfO

Top Policy Percentage

(C)

Deliberate

ApplicationHigh

Top Policy Percentage

(d)

Figure 5.2: The Impacts of Real and Artificial Deception on Top Policy at Different

Percentage of Screened Case with Cost (Fixed Budget)

75

Mean Cost (Varied Budget)

Os

w w w
M ro CO cn "vl 00 o

w O O o o O o o o O o

o o o o o o o o o O o

*o

o

2

fl>

s

20.0%

20.7%

22.0%

22.7%

24.0%

25.3%

26.0%

26.7%

26.7%

28.0%

28.0%

28.7%

30.0%

31.3%

32.7%

32.7%

33.3%

35.3%

36.7%

37.3%

38.0%

38.7%

39.3%

>

o

Mean Cost (Varied Budget)

TJ

O

s

3

a)

w

fD

w

Deliberate

VariableHigh

Top Policy Percentage

(C)

OONfN^iniOlOlflOOCOOOOHlSNfOl/l^NOflCOOl

(NNIMfMINMINNNfMNNfnMfOfOmmrOfOrOfOfO

Deliberate

ApplicationHigh

Top Policy Percentage

(d)

Figure 5.3: The Impacts of Real and Artificial Deception on Top Policy at Different

Percentage of Screened Case with Cost (Variable Budget)

77

5.2.2 Impact on Random Policy

Random policy isnt commonly adopted by all schools. However, it would

be helpful to use the random policy as a reference point to assist the analysis. In

this section, experimental results that measure the random policy performance

based on HM and the cost are presented, respectively. The random policy typically

verifies about 50% of applications. Data are first analyzed based on the 50%

verification percentage. Additionally, in order to capture the pattern of the impacts

of real and artificial deception on the random policy in different percentages of

verification, the percentage of screened cases are randomly varied 200 times

between 40% and 60%. For each run, based on the randomly selected percentage of

verification, the corresponding numbers of observations are randomly selected and

screened. The rest of observations arent screened.

The experimental results analogous to the ones displayed for the top policy

in section 5.2.1 are shown in Table 5.16 for relative harmonic mean percentage

differences. Table 5.17 and 5.18 show the results of statistical tests for the mean

cost based on the fixed and variable budget models, respectively. Table 5.19

summarizes the results and lists the p-value and the effect size for the significant

result.

Figure 5.4 5.6 are the performance results for the random policy based on

HM and costs (fixed and varied) when the percentage of screened cases is

randomly varied between 40% and 60%. Due to the random selection, it can be

noticed that the curves in the graphs arent as smooth as ones in the graphs for the

top policy. Despite the fluctuation, the graphs show the similar patterns as the ones

for the top policy.

78

Table 5.16: Relative HM percentage differences for the random policy

(The numbers in parentheses are the HM values.)

Comparison___________________________

Natural (0.46), VariableLow (0.132)

Natural (0.46), ApplicationLow (0.552)

Deliberate (0.267), VariableHigh (0.476)

Deliberate (0.267), ApplicationHigh (0.25)

PD(HM)

56.9% decrease

20% increase

78.2% increase

6.3% decrease

Table 5.17: Cost (fixed budget) sample mean results for random policy

Test Statistics!3

VariableLRF - Application LRF- VariableHRF - Application HRF -

NaturalRF NaturalRF DeliberateRF DeliberateRF

z -,977a -,141a -1.7803 -,709a

Asymp. Sig. (2-tailed) .328 .888 .075 .478

a. Based on negative ranks.

b. Wilcoxon Signed Ranks Test

Table 5.18: Cost (variable budget) sample mean results for random policy

Test Statistics?

VariableLRV - Application LRV - VariableHRV - Application HRV -

NaturalRV NaturalRV DeliberateRV DeliberateRV

Z -,870a 742F -3.787a -,750a

Asymp. Sig. (2-tailed) .384 .672 .000 .453

a. Based on negative ranks.

b. Based on positive ranks.

c- Wilcoxon Signed Ranks Test

79

Table 5.19: Summary of statistical test results based on cost measure

(The numbers in parentheses are the /7-values and effect sizes.)

Comparison Fixed Budget Variable Budget

Natural vs. VariableLow Natural vs. ApplicationLow Deliberate vs. VariableHigh Deliberate vs. ApplicationHigh Not significant (.328) Not significant (.888) Not Significant (.075) Not significant (.478) Not significant (.384) Not significant (.672) Significant (.000*, 0.31) Not significant (.453)

*. Significant at the .05 level

These results are consistent with those obtained for the top policy, which

suggest that the real deception data and the artificial deception data generated by

the application deception model have the similar impact on the policy as expected.

The non significant statistical results suggest that the natural deception data and the

artificial deception data generated by the variable noise model with low noise level

have similar impacts on the policy with both the fixed and variable budget models.

When comparing the mean cost difference between the deliberate deception data

and the artificial deception data generated by the variable noise model with high

noise level, the fixed budget model suggests non significant mean cost difference,

while the variable budget model indicates the data with high variable noise

performs significantly worse than the deliberate deception data. The results of

statistical tests are further confirmed by the graphs shown in Figure 5.4 5.6.

80

1 0.9 0.8 c 07 m | 0.6 g C 0.5 0 1 0.4

i unUllfMA lUJUvnn nn AfUlHiAitflAJ'-

* U IlMl 1 lUl 1* il'J'*<| TJ* 1 t U*i ** u Natural

| K|JJ| f*t|i V** p ^ ^' ~ VariableLow

' c 4U.U7o 40.7% 41.3% 42.7% 43.3% 44.7% 46.0% a> 46.7% | 47.3% | 47.3% ? 48.7% 5 50.0% S 50.7% 5 51.3% S 51.3% 52.7% 53.3% 55.3% 55.3% 56.0% 57.3% 58.0% 59.3%

(a)

i 0.9 0.8 c 07 <0 I 0.6 | 0.5 1 0.4 <0 x 0.3 0.2 0.1 -

i ii nmilAriA r aiAn nnnn/mATUftAniA- Nlllllll

SnNl 1 r iViuhIWt^V^v viwu* u ^ Ndiuidi

U U1 |j 1 ApplicationLow

0 1 c 3rs-roh-fop^or^rocor^c>r^f1oroh;rrjroroc>ropro jdHNfn^uj^isp^ooodHHNfriinifl^rNooo) Random Policy Percentage

(b)

81

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

-r '~^r

*, t *

* * Â£ 35 35 * 35 3? 3? 35 35 35 35 35 35 35 35 35 35 35 35 35 35

o PN ro r> ro r*-. o PN ro ro r- o pv ro ro p^ ro ro ro o ro o ro

o o tH (N ro d d p*^ 00 o o rH -i
to in m m m in in m m m m m

Random Policy Percentage

Deliberate

VariableHigh

(C)

35 35 35 35 35 35 35 35 35 35^^35 35 35 35 35 35 35 35 35 35 35

qp^ror^rop^osroror^or^roror'-.rorofoppoprn

odHNro^^iirsNobddHHfsiroiiiin^isoicn

Random Policy Percentage

Deliberate

ApplicationHigh

(d)

Figure 5.4: The Impacts of Real and Artificial Deception on Random Policy at

Different Percentage of Screened Case with Harmonic Mean

82

Mean Cost (Fixed Budget)

oo

u>

w la la -la la w la -LA la w w M

M h- NJ UJ Ck Ln 'vl 00 LO O

O LA o O o o o O o o O O

O o o o o o o O o o O O

a

Q.

o

3

-o

o

8

3

ft)

09

CD

>

o

"D_

o

Q>

rf

o'

3

r~

O

$

z

QJ

r-+

c

a>

Mean Cost (Fixed Budget)

LA w la LA la LA la LA LA LA -LA M

M U) -fk cn
O la o O o o o o o O O o

o o o O o o o o o O O o

Q.

O

3

o

s

3

ft)

OQ

(D

$1,000

t *

__
bo__

*o

o

Qi

X

o

o

ffllll

?

$o

sp vP S.O vO vP sP

05 (JS OS os os OS

o h*. o ro o rs

6 6 (N fO ^

^ ^

Sp sP s sP sp sp

ffS os os OS os OS

r>. p p o rs o

ixi s' od co o'

S* SS ^ ^

N (V) o fO

o

O P's

^ SS ^

ro ps ro

rHrsiro?-^fiiDiX)psoocr

^'j^^^^i/ii/iLnLnLriLninuii/t^in

Deliberate

* VariableHigh

Random Policy Percentage

(C)

$1,000

$900

- $800

| $700

o $600

Deliberate

ApplicationHigh

sP sP sP sP sP sP sp sp sp sP sP so sp sp Sp sP sp sP sP sp sP sP sP

s os os os os os os os os os os os Os ps os os os os os ss os os os

ors.proprs.rspropp^pr^roprnprsphsrnr^rp

ddrvifn^^^ujrsodModriiNfo^^' ^td^aicTi

Random Policy Percentage

(d)

Figure 5.5: The Impacts of Real and Artificial Deception on Random Policy at

Different Percentage of Screened Case with Cost (Fixed Budget)

84

l/V

o

40.0% -

40.7%

42.7%

43.3%

44.0%

44.7%

45.3%

3 46.7%

o. 47.3%

1 48.0%

o 48.7%

2 50.0%

^ 3 50.7%

3 S 52.0%

| 52.7%

54.0%

54.7%

55.3%

56.7%

58.0%

58.0%

58.7%

59.3%

Mean Cost (Varied Budget)

w W w w
ro uj u Ln CTi 'nJ 00 o

O o o o O O o o o o

O o o o o O o o o o

>

o

T3

O

S

Mean Cost (Varied Budget)

3

CL

O

3

-o

o

o

n

40.0%

40.7%

42.7%

43.3%

44.0%

44.7%

45.3%

46.7%

47.3%

48.0%

48.7%

50.0%

50.7%

52.0%

52.7%

54.0%

54.7%

55.3%

56.7%

58.0%

58.0%

58.7%

59.3%

$2,400

$2,200

jj $2,000

{? $1,800

$1,600

1 $1,400

> $1,200

tt $1,000

c $800

| $600

$400

$200

$0

ft

yyyiA

"rWj\

nP \P nP Sp SO sP SP sp sP sP sP sP sP sP Sp sP sP sP sP sP sP sP sP

pr^r^foor^rorsfopp^pr^ppvpr^rnr^ppr^rn

ddNM^^i/i'iscooddofSN^^inibooooooai

Random Policy Percentage

Deliberate

VariableHigh

(C)

pr^fsroprvrorsropr^prvpr^pr^rorsppr^rn

ddfsm^^iniiiseoooddr'ifNi^'fui^coMoooi

Random Policy Percentage

(d)

Figure 5.6: The Impacts of Real and Artificial Deception on Random Policy at

Different Percentage of Screened Case with Cost (Variable Budget)

86

5.2.3 Discussion

This part of the study focuses on investigating the impacts of real and

artificial deception on the screening policy. The results of performance comparison

are summarized in Table 5.20 and 5.21 for the top and random policy, respectively.

In these two tables, Pe(A) denotes the performance on the artificial deception data

and Pe(R) denotes the performance on the real deception data. To summarize, the

experiment results of the hypotheses testing based on all measures are displayed in

Table 5.22.

As indicated in Table 5.22, when comparing the impact of the natural

deception data and the artificial deception data generated by the variable noise

model with low noise level on the screening policies, Hypothesis 3a is supported

because HM performance is better for the real deception data than for the artificial

deception data. But in the meanwhile, there is an exception of no statistical

performance difference between two data sets with both fixed and variable cost

measures. Hypothesis 3 b compares the performance on natural deception and

artificial deception generated by the application deception model with low noise

level, the results support the hypothesis and show that they have similar impacts on

the screening policies based on the fixed budget cost and variable budget cost

measures. On the other hand, when the performance is measured by the HM metric,

it appears that the screening policies perform better on the artificial data.

Hypothesis 4a focuses on the performance difference between the deliberate

deception data and the artificial deception data modeled with high generated by the

variable noise model with high noise level. It is clear that real deception and

artificial deception generated by the variable noise model have different impacts on

the screening policies based on HM and the variable budget model. However, the

87

fixed budget model doesnt support this conclusion. The table reveals that

deliberate deception and artificial deception generated by the application deception

model with high noise level have similar impacts on both screening policies based

on HM, fixed budget model and variable budget model. Therefore, Hypothesis 4b

is supported. These results confirm that the application deception model is capable

of simulating real deception.

In summary, the test results and findings obtained demonstrate that

artificially generated deception data could be used instead of the real deception data

for the data mining application. The proposed application deception model in this

study is confirmed by the experiments and can be used for testing a screening

method.

Table 5.20: Summary of comparison results for top policy

Comparison HM Cost (Fixed) Cost (Variable)

Natural vs. VariableLow Different Pe(A)
Natural vs. ApplicationLow Different Pe(A)>Pe(R) Similar Similar

Deliberate vs. VariableHigh Different Pe(A)>Pe(R) Similar Different Pe(A)
Deliberate vs. ApplicationHigh Similar Similar Similar

Table 5.21: Summary of comparison results for random policy

Comparison HM Cost (Fixed) Cost (Variable)

Natural vs. VariableLow Different Pe(A)
Natural vs. ApplicationLow Different Pe(A)>Pe(R) Similar Similar

Deliberate vs. VariableHigh Different Pe(A)>Pe(R) Similar Different Pe(A)
Deliberate vs. ApplicationHigh Similar Similar Similar

88

Table 5.22: Summary of findings

Hypothesis

H3a: There is a significant difference

in the screening policy performance

on the natural deceptive data versus

the artificial deceptive data modeled

by low variable noise.

H3b: There is no significant

difference in the screening policy

performance on the natural deceptive

data versus the artificial deceptive

data modeled by low application

deception model.

H4a: There is a significant difference

in the screening policy performance

on the deliberate deceptive data

versus the artificial deceptive data

modeled by high variable noise.

Findings

Support: The HM performance is better

for the real deception data than for the

artificial data.

Exception: There is no significant

performance difference in fixed budget

cost and variable budget cost between

two data sets.

Support: There is no significant

performance difference in fixed budget

cost and variable budget cost between

two data sets.

Exception: The HM performance is

better for the artificial data than for the

real data.

Support. The HM performance is better

for the artificial deception data than for

the real deception data.

The performance is better for the real

deception data than for the artificial

deception data with variable budget cost.

Exception: There is no significant

performance difference in fixed budget

cost between two data sets.

H4b: There is no significant

difference in the screening policy

performance on the deliberate

deceptive data versus the artificial

deceptive data modeled by high

application deception noise.________

Support: There is no significant

performance difference in HM, fixed

budget cost and variable budget cost

between two data sets.

89