PAVSS : a framework for scoring data privacy risk

Material Information

PAVSS : a framework for scoring data privacy risk
Foreman, Zackary D.
Place of Publication:
Denver, CO
University of Colorado Denver
Publication Date:

Thesis/Dissertation Information

Master's ( Master of science)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Computer Science and Engineering, CU Denver
Degree Disciplines:
Computer science
Committee Chair:
Augustine, Thomas
Committee Members:
Altman, Tom
Jafarian, Haadi


Currently the guidelines for business entities to collect and use consumer information from online sources is guided by the Fair Information Practice Principles set forth by the Federal Trade Commission in the United States. As it will be shown throughout this documentation, these guidelines are inadequate, out dated, and provide no protection for consumers. Through the use of information retrieval techniques, it will be shown that social engineering techniques can be used to use this information against the consumers. There exists many techniques to attempt to anonymize the data that is stored and collected. However what does not exist is a framework which is capable of evaluating and scoring the effects of this information in the event that a system is compromised. In this thesis a framework for scoring and evaluating data is presented. This framework is created to be used in parallel with currently adopted frameworks that are used to score and evaluate other areas of deficiencies within software, as well as for individual users in efforts maintain a level of control and confidence with their information. We created a framework called the Privacy Assessment Vulnerability Scoring System (PAVSS), that provides a standardized score showing the risk of a privacy invasion an individual takes on. We tested hypotheses regarding types and amounts of direct and indirect personal identifiable information (PII) from a large Twitter data set.

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
Copyright Zackary D. Foreman. Permission granted to University of Colorado Denver to digitize and display this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.


This item has the following downloads:

Full Text
ZACKARY D. FOREMAN B.S., University of Colorado Denver, 2017
A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Master of Science Computer Science Program

This thesis for the Master of Science Computer Science degree by
Zackary D. Foreman has been approved for the Computer Science Program by
Thomas Augustine, Chair Thomas Augustine, Advisor Tom Altman, Committee Member Haadi Jafarian, Committee Member
Date 19 April 2019

Foreman, Zackary D.
PAVSS: A Framework for scoring data privacy risk Thesis directed by Thomas Augustine
Currently the guidelines for business entities to collect and use consumer information from online sources is guided by the Fair Information Practice Principles set forth by the Federal Trade Commission in the United States. As it will be shown throughout this documentation, these guidelines are inadequate, out dated, and provide no protection for consumers. Through the use of information retrieval techniques, it will be shown that social engineering techniques can be used to use this information against the consumers. There exists many techniques to attempt to anonymize the data that is stored and collected. However what does not exist is a framework which is capable of evaluating and scoring the effects of this information in the event that a system is compromised. In this thesis a framework for scoring and evaluating data is presented. This framework is created to be used in parallel with currently adopted frameworks that are used to score and evaluate other areas of deficiencies within software, as well as for individual users in efforts maintain a level of control and confidence with their information. We created a framework called the Privacy Assessment Vulnerability Scoring System (PAVSS), that provides a standardized score showing the risk of a privacy invasion an individual takes on. We tested hypotheses regarding types and amounts of direct and indirect personal identifiable information (PII) from a large Twitter data set.
The form and content of this abstract are approved. I recommend its publication.
Approved: Thomas Augustine

I. INTRODUCTION......................................................... 1
1.1 Problem Description............................................. 1
1.2 Structure of Thesis ............................................ 2
II. LITERATURE REVIEW................................................... 4
II. 1 Research Questions............................................ 4
11.2 Current Assessments of Software Deficiencies .................. 7
11.2.1 Common Vulnerability Scoring System ................. 7
11.2.2 Common Weakness Scoring System...................... 11
11.2.3 Analysis of Common Scoring Techniques .............. 13
11.3 Privacy Management............................................ 14
11.3.1 Communication Privacy Management Theory............. 15
11.3.2 Privacy Preserving Algorithms....................... 16
11.3.3 Analysis ........................................... 18
11.4 Data Collection Practices and Policies........................ 18
11.4.1 Data Collection..................................... 18
11.4.2 Data Collection Policies............................ 19
11.4.3 Text ............................................... 21
11.4.4 Analysis ........................................... 22
11.5 Measurement Techniques........................................ 22
11.5.1 Overview of Delphi ................................. 23
11.5.2 Delphi Characteristics.............................. 23
11.6 Chapter Review................................................ 23

III. 1 Hypotheses................................................ 25
111.2 Model Goals................................................ 26
111.3 Model Design .............................................. 27
111.3.1 Model Design Iteration I.......................... 27
111.3.2 Delphi Results ................................... 31
111.3.3 Final Model Design................................ 40
111.4 Chapter Review............................................. 47
IV. TESTING.......................................................... 49
IV. 1 Pretesting Requirements .................................. 49
IVY Testing Procedure............................................ 51
IV.2.1 First Phase........................................ 51
IV.2.2 Second Phase....................................... 53
IV.3 Results..................................................... 57
IV.3.1 Results from Data Set.............................. 57
IV.3.2 Results from Twitter Data.......................... 60
IV.4 Challenges.................................................. 66
IV.5 Chapter Review.............................................. 67
V. FUTURE RESEARCH.................................................. 69
VI. CONCLUSION....................................................... 71
REFERENCES................................................................ 74

There currently are frameworks approved by the cybersecurity community to determine and score the security vulnerability of various parts of a particular software. A single software program may have many different scores of its susceptibility to a security vulnerability [66]. Frameworks also exist for scoring and sharing information on; malware, attack patterns, software weaknesses, incidents, indicators of attacks, etc [33]. These frameworks when utilized together allow the cyber security community to evaluate risks involved with software in a consistent manner. There is, however, no framework which relates data privacy to other vectors of victimization.
This master thesis paper proposes a framework which can be used to evaluate and score the various types of data that may be stored. This framework will be shown to be capable of being used for not only the various entities that store and consume sensitive data, yet that of the originator of the sensitive data.
1.1 Problem Description
At any given moment there is more data being generated and stored than was previously available in the history of the internet. The source of this data comes from many different avenues. Of all the information that is being stored, the most concerning data is personally valuable information. This is data that is directly associated with an individual. Information being stored can be used for a variety of reasons, from creating a user personalized experience [58], to running analytics on the data for better health services [65]. These stores of personally valuable information also create a target for themselves.
According to [12], from 2016 to 2017 there was an increase of 45% of the number of data breaches in the United States. In [45], the authors re-establish the fact that an individual becomes harmed at the moment of a data breach whether or not that data

is misused. From 2005 to July 2018 there have been a total of 9,215 recorded unauthorized breaches of data systems, exposing a total of 1,104,625,430 records [14], As previously stated there are National and commercial frameworks for quantifying the impacts of deficiencies within software. These frameworks provide a consistent repeatable scoring system which helps share deficiency information among developers. However, are these frameworks enough to protect the end user?
As will be shown there have been many attempts to maintain privacy but few to quantify privacy vulnerabilities. Companies, individuals, and those privy to sensitive data pertaining to another individual, who provide the data also allow which information that chosen to share. Many times, when this information is stored the privacy statements are often complex legal jargon, pertaining to how the data will be used and who will have access to it. The framework that will be described will provide a scoring system that allows the data source the ability to easily understand how at risk they are when giving an entity their data. This framework will also be useful for entities wishing to gain sensitive information from individuals, as they will be able to advertise their score of showing the privacy risk, whether in the event of a data breach or when data analysis is performed.
1.2 Structure of Thesis
Chapter 2 will serve as a background to the problem being researched and solved. In this chapter the reader will become aware of the research questions being asked, various frameworks and techniques that are used to currently assess the soundness of software. This chapter will also cover the short comings of each current technique. The chapter will then complete by covering current world wide data protections regulations and acts. Chapter 3 will then state the hypotheses for this thesis paper along with a model which will demonstrate a framework to provide a possible standardized privacy scoring approach. After a clear understanding of the hypotheses being tested, the goals and design for the model will be defined, showcasing how the model will be

used to test the proposed hypotheses. Chapter 4 will then go into the details of the implementation of the model, followed by the proposed testing procedure of the model. After discussing the testing procedure and methods the results from the test will be discussed. Chapter 5 will discuss potential future research, with chapter 6 being the conclusion.

This chapter will focus on the background information that is pertinent to understanding the design of the model. The first section will define the research questions that have led to investigation for this thesis paper. The proceeding section will go into detail over the current most widely used and accepted scoring system for software deficiencies. This will include uses for scoring systems, where they have been found to fall short, as well as areas that have been identified as areas to improve upon. The next section will show various data collection sources, models for privacy. This will include current common algorithms used for maintaining privacy while performing data analysis. Next we will discuss the current policies for data collections. The final section will be a review of the chapter and a brief discussion on this will all be used to define and build a framework for scoring data privacy risk.
II. 1 Research Questions
As discussed earlier data breaches occur frequently. They occur within Government organizations, businesses, and personal networks. These research questions will need to be answered in order to create an effective framework for measuring the risk from attaining personally identifying data during a data breach. From [12] it is said that it’s not a matter of if a breach will occur, rather when will that breach occur.
This leads to research question 1:
RQ-1 : What information is needed to impersonate a person?
Due to data breaches being virtual in nature, the information that a malicious actor would need in order to impersonate someone virtually is needed to be known in order to create a proper model. We see in [63], that levels of deception in online social networking sites raise as the sites become more popular. The ability to deceive a person with false facts is congruent with deceiving the identity of an individual.

The most popular online social networks allow users to create an account using what is described in [24], as weak identities, as in unverified accounts, often only requiring an e-mail address [8]. This leads to the ability for malicious actors to carry out identity impersonation attacks [24], In [25], the authors have given access to a free of use site where users can view if their e-mails or other personal information has been leaked such that it is readily found online. This ability has led to the proliferation of a type of social engineering attack called a Sybil attack [61]. In a Sybil attack, a malicious actor impersonates someone and then befriends another under the guise of this false identity. This then allows the malicious actor to then gain the trust of the other user and extract useful information. Therefore, it can be deduced by simply knowing a person’s name and their e-mail address it is possible to impersonate someone, and thus answering the RQ-1.
In [2] it was shown that with only knowing the date of birth and place of birth they could partially determine a person’s social security number. From [32], it is stated that 65% of internet users are on some type of online social network, whether it be Facebook, Twitter, Instagram, etc. As answered in RQ-1, all that is needed to impersonate someone is their name and e-mail. With this little information, various types of attacks such as Sybil or Reverse Engineering Attacks [61] are capable of being performed. Therefore, research question 2 is proposed.
RQ-2 : What information about a person is more valuable?
A social security number is deemed to be a sensitive authentication device [2], however as just stated, it has been shown that the ability to determine a social security number from just a date of birth and place of birth is possible. This leads to the assumption there may should be more weight on the value of a birth date. Although it was possible to derive this information, as it’s been shown in [61, 8, 24], many times when a person is impersonated this is used to then gain more sensitive information

from other users. From this it can be deduced that simply knowing a person’s name and email address is not necessarily valuable to a malicious actor.
To scope RQ-2, we define valuable information about someone in the same sense that [15] defines private information, in that any information that makes a person feel a level of vulnerability. This question will also be covered in the scope of personally identifiable information, as defined in [7] as information that can used to distinguish or trace an individual’s identity. The Identity Theft Resource Center tracks data breaches that occur per year. In the 2018 report [13], for nearly every breach the attacker was able to gain personally identified information. In [43], personal information is broken into three categories, that of static vs dynamic, unique vs non-unique, and shared vs distinct. For instance, an individual’s name is static in that it most likely won’t change, however this is not a unique identifier. The authors in [43] discuss shared vs distinct as an element to a piece of information that might be shared across many internet platforms. This can however be modified to be stated that a distinct piece of personal information, that is also static and unique could be an individual’s employee number at their current employment, or their student identification number. By using these same three criteria we could answer RQ-2, that the most valuable information about a person is a piece of information that is private, can be used to distinguish someone, and falls into a category of being static, unique, and distinct.
RQ-3 : What is the minimum amount of information needed to be known about someone in order to find more information?
As in [63, 24], social media users are often deceived by fake users. In [24] it is shown that often the fake accounts are that of real people. Spreading fake information in the name of someone can be damaging, and usually only requires knowing the individuals name and e-mail address in order to create a fake account. This last research question specifically wishes to find what long term damages can be caused with knowing the smallest amount of information on someone. This can actually cause severe

long term damage, with either spreading misinformation by acting as someone with a high status among society, or as well as stating something discriminatory under the guise of someone else, which could lead to losing a job, or prevent that person from attaining another job.
RQ-4 : Do current software scoring metrics for deficiencies take into account the importance of the data at risk of being stolen?
As stated previously existing frameworks do score the risk for a vulnerability or a software weakness to be exploited. However, these scores do not take into account the sensitivity of the data that is at risk of being taken if a breach occurs. It is of the authors opinion that not all data breaches are equal.
RQ-5 : Do current data collection techniques used in data analysis follow strict enough guidelines for ensuring the anonymity of the source of the information?
Most of the research questions this far have been pertaining to data that is gathered from data breaches, yet that is only one avenue of obtaining data from about a person. Yet as discussed stated [6, 35] there is now an increase in data collection. There needs to be a governance though to guarantee that the data being collected and used by these individuals and companies does not invoke unnecessary risk to the beginning user.
II.2 Current Assessments of Software Deficiencies
The most commonly used and open source scoring systems are the Common Vulnerability Scoring System (CVSS) and the Common Weakness Scoring System (CWSS). From this point forward the Common Vulnerability Scoring System will be referred to as CVSS, and the Common Weakness Scoring System will be referred to as the CWSS.
II.2.1 Common Vulnerability Scoring System

The CVSS is most commonly used to score a Common Vulnerability and Exposures (CVE) entry in the National Vulnerability Database (NVD), that is hosted by the National Institute of Standards and Technology (NIST). As noted in [33] this scoring system is used in multiple standards, and as mentioned in [5], this has been adopted by the United States, Department of Defense (DoD), per directive 8500.01.
The CVSS was created and maintained by the Forum of Incident Response and Security Teams (FIRST) [23]. FIRST describes three invaluable benefits that come from using the CVSS. These consist of:
1. A standardized scoring system, that uses a common algorithm across all possibly infected technological platforms.
2. An open source framework, that provides a objective, repeatable score rather than a subjective score.
3. Provides the ability to create a list of vulnerabilities which deserve more focus on prevention, allowing the most critical vulnerabilities to be identified.
II.2.1.1 CVSS Framework Metrics
The CVSS contains three metrics, where each metric is composed several attributes which are used to eventually calculate a score. These metrics are both qualitative and quantitative in nature. They consist of the Base, Temporal, and Environmental metric groups. The CVSS is currently at version 3.0, which will be the version used when discussing the metrics.
The Base metrics are defined as the intrinsic attributes of the vulnerability, or the way the vulnerability behaves. The Temporal metric is used to determine the current state the vulnerability is at, for example if there is a patch for the vulnerability. This metric may change over time, as the state of the vulnerability changes. The final

metric, the Environmental metric allows for the customization of the score for the end
Table 2.1: CVSS 3.0 Metric Attributes
Base Metric Group Temporal Metric Group Environmental Metric Group
Exploitability Metrics: Impact Metrics: Attack Vector Confidentiality Impact Exploit Code Maturity Modified Base Metrics
Attack Complexity Integrity Impact Remediation Level Confidentiality Requirement
Privileges Required Availability Impact Report Confidence Integrity Requirement
User Interaction Availability Requirement
As it can be seen from Table-2.1 the Base metric group contains the most criteria when determining the score. In [31] it is recommended that security professionals should be assessing a vulnerability to the criteria in Base and Temporal metrics, and the end user should be modifying the score based on the Environmental metrics. The intent of this is to allow that ability for prioritization when implementing processes and procedures to prevent or mitigate the vulnerability.
II.2.1.2 CVSS Calculation
An overall score is given a value between 0 to 10, where 10 is considered the absolute worst. While calculating a score, a vector string is also produced. The advantage of the vector string is this allows an end user the ability to view which characteristics of the various metrics that a particular vulnerability affects. Many of the constant multipliers seem quite random in many cases, but per [67] represent the results of formula refinement based on case studies for actual vulnerabilities and systems.

The Base metric group score is calculation is based on the outcome of the Scope metric. If Scope is considered unchanged then the Base metric group calculation is as
Base = |~min[(Exploitability + Impact), 10]] (2-1)
If the Scope metric is considered changed then the Base metric group calculation is:
Base = |~min[l.08 x (Exploitability + Impact), 10]] (2-2)
However, if the Impact metric score is less than or equal to zero, then the overall Base metric group is given a score of zero. To calculate the Impact metric if the Scope metric is considered Unchanged then the formula is:
Impact = 6.42 x ImpactSubScore (2-3)
If the Scope metric is considered Changed then the formula now becomes:
Impact = 7.52 x [ImpactSubScore — 0.029] — 3.25 x [ImpactSubScore — 0.02]15 (2.4)
Where the formula for determining the ImpactSubScore is:
ImpactSubScore = 1 — [(1 — ImpactConf) x (1 — ImpactInteg) x (1 — ImpactAvaii)]
With the Exploitability formula as:
8.22 X AttVector X AttComplexity X PrivRequired X Userlnteraction (2-6)
As can be observed each sub metric of the Base metric group contains various sub values. These values can further found in [23] along with how to calculate the score for

the Temporal and Environmental metric group. When a CVE entry in the NVD contains a CVSS, this score is only the Base metric excluding the Temporal metric [51].
II.2.2 Common Weakness Scoring System
The CWSS is most often used to an entry in the Common Weakness Enumeration (CWE) database. This database is provided and maintained by the MITRE corporation [48]. Where the CVSS is used to score a particular vulnerability, for instance a particular buffer overflow event in a given piece of software. The CWSS is used to score the risk of a weakness being exploited, for instance a buffer overflow being exploited within software as a whole. This allows organizations to create a priority list of weaknesses to monitor and prevent while developing and maintaining software [46]. To clarify, it can be viewed that the CWSS is best utilized while developing and maintaining software, and the CVSS is best utilized by those whose responsibilities are to ensure that applications running on a system are not open vectors for attacks.
Like the CVSS the CWSS boasts three main advantages when using the framework. These consists of:
1. A measurable value of a weakness that could be present in software that has not been caught.
2. A common framework that allows organizations to make a list of most critical weaknesses that need to be fixed within software.
3. The ability to customize the most critical weaknesses, since the needs may differ from organization to organization.
II.2.2.1 CWSS Framework Metrics
The CWSS is comprised of three different metric groups. These metric groups are the Base Finding, Attack Surface, and Environmental metric groups. Each of these

groups break down into several other metrics, known as factors [48]. The most current version of the CWSS is 1.0.1 which will be used for discussing the metrics and scoring.
The Base Finding metric is used to show the associated risks of the weakness, the assurance this risk is warranted, and effectiveness of the controls used to prevent this weakness. The Attack Surface metric shows what the attacker must be able to overcome in order to exploit this weakness. Just due to the presence of a weakness does not implicate that a weakness will be used. The final metric group the Environmental group is similar to that of the Environmental group in the CVSS, where this allows the end user the ability to customize the score per their use.
Table 2.2: CWSS 1.0.1 Metric Factors
Base Finding Attack Surface Environmental
Technical Impact (TI) Required Privilege (RP) Business Impact (BI)
Acquired Privilege (AP) Required Privilege Layer (RPL) Likelihood of Discovery (LD)
Acquired Privilege Layer (APL) Access Vector (AV) Likelihood of Exploit (LE)
Internal Control Effectiveness (ICE) Authentication Strength (AS) External Control Effectiveness (ECE)
Fhiding Confidence (FC) Level of Interaction (LI) Prevalence (P)
Deployment Scope (DS)
In Table-2.2 it can be seen the various factors that are composites of the CWSS. These three metrics are all calculated independently, then finally multiplied between each other. This in turn returns a score from 0 to 100.
II.2.2.2 CWSS Calculation
As mentioned previously the overall score returned by the CWSS ranges from 0 to 100, which is calculated by multiplying each sub-score to each other. The formula is

as follows:
BaseFinding x AttackSurf ace x Environmental (2-7)
The Base Finding metric score is calculated as follows:
[(10 x TI + 5 x (AP + APL) + 5 x FC) x f (Tl) x ICE] x 4 (2.8)
Where f (TI) is zero if TI = 0 otherwise f (TI) = 1. This is similar to how the CVSS handles a negative or zero score of the Impact metric, as such the Base Finding metric score would become a total of zero. The Attack Surface formula is then as follows:
[20 x (RP + RPL + AV) + 20 x DS + 15 x LI + 5 x AS] -P 100 (2.9)
The values composing of the Attack Surface score will range in value between 0 and 100 which is why the division by 100 is necessary [48]. The Environmental formula is then:
[(10 xBI + 3xLD + 4xLE + 3xP)x f (Bl) x ECE] 4- 20 (2.10)
Where the value for f (BI) is calculated similar to that of f (TI) where if BI = 0 then f (BI) = 0 else f (BI) = 1. As can be observed the BI score influences the overall Environmental score.
The overall score for each individual component of the Base Finding, Attack Surface, and Environmental score is covered in [48] along with use case examples.
II.2.3 Analysis of Common Scoring Techniques
As it can be seen above neither of the most common scoring systems factor into the overall score the sensitivity of the data at risk of being stolen in the event of a data breach. Due to the wide spread adoption of the CVSS framework there has been many critiques of the framework. In [66], there is a proposed new method for scoring

vulnerabilities building off previous scoring systems. Another method to improve the CVSS called VRSS [41] and it’s improved methodology [42] do not factor in the value of the data that is at risk of being stolen. There have been methods utilizing machine learning as in [11] to better predict whether a vulnerability will be exploited, yet this does not factor into account data that could be accessed. As well in the comprehensive survey provided in [54], there is no mention of the value of the data being protected. When analyzing the relation between bounties and relations in [50], the economic incentive is analyzed as in how much would be paid to find a vulnerability yet this incentive is not driven by the underlying data at risk.
The CWSS does not have as many critiques, and has been shown in connection with the CWE to be effective in preventing and helping defend against attacks [20, 64], However as in the case of the CVSS framework, the CWSS framework does not factor the value of the data as well. From this analysis we can answer RQ-4, that current software scoring metrics for deficiencies do not take into account the importance of the data at risk of being stolen.
A common critique of both scores however are due to the rather randomness and unexplained constants. In [31], various experts in software vulnerabilities were asked to calculate a value for the CVSS, they found much discrepancy between the values. The authors then list possible solutions and modifications to these score calculations as per evaluation of the experts used in the study.
II.3 Privacy Management
Each day there is more data being created and stored. In [30] they state that there will over 10 zettabytes of data being stored on cloud computing infrastructures by 2019. With all this data, there exists many algorithms proposed in order to prevent the malicious use of this data. In order to understand the effectiveness of these algorithms it’s necessary to understand the underlying theory of privacy management. There is two perspectives to view data privacy management: one that is from the user

who’s data is being collected and analyzed, and another from the entity that is storing or analyzing the data. This section will discuss these two view points on privacy management.
II.3.1 Communication Privacy Management Theory
Communication Privacy Management (CPM) theory views privacy from the perspective of the user. CPM theory explores the balance between maintaining private information while being open with others that users attempt to equalize. As described in [56], this is understanding the relationship that privacy plays in sharing information. CPM theory is defined by 5 important principles. Listed in [56] principles consist of:
1. Ownership of information.
2. Control of the information.
3. Privacy rules that provide regulation.
4. Guardianship or co-owning another person’s data.
5. Breakdown of regulations for privacy.
In [16], the focus is on the first two principles listed. In reference to principle one, the ownership of the data, [16], tested the level of concern when user information was collected. The results found the level of concern was related to the scope of the collection, varying between local and global collection. The results also showed that the general concern of collection of the data was more related with individuals who already were concerned with privacy. This is also supported in results found in [15]. However in [37], the authors split the data willing to be shared with others into two separate categories, where certain information was considered by sensitive than others. The results found that information that was classified as not very sensitive such as hobbies, interests, and lifestyle, were willing to be shared more easily than sensitive information.

While testing the second principle, control of the information, it was found that the level of concern when giving control of information was related to how much general concern the individual initially had.
II.3.2 Privacy Preserving Algorithms
The moment sensitive information is stored, whether it’s from a user of a service that is attempting to form a relationship in a online social network [15, 37, 16], or to use a service like a search engine [3], a web-browser [34], or any other user personalized service, this information is now under the trust of another party. Whether the data is being stored for a better user experience, or for performing data analysis, there has been much research in data privacy preserving algorithms. The primary focus on algorithms of this nature is to allow data analytics to be performed on sensitive data, while ensuring the source of the data that they will not be identifiable [44], At this time the more common algorithms are; differential privacy [22], k-anonymity [59], t-closeness [38], and membership privacy [39].
II.3.2.1 k-anonymity
The authors of k-anonymity [59], state the goal of the algorithm is not to prevent the accessing of the information. The goal is rather the ability to release the information such that the ability to identify a person as a source of the information is not possible. This particular algorithm is useful in protecting against the disclosure of identifying individuals directly from the data released. The algorithm defines what is called quasi-identifiers which are attributes that when combined together could lead to the identification of the source [59, 38, 35]. Yet as shown in [38, 6], this does not prevent against releasing of information that is a attribute of an individual. If an adversary already has information about a person, then they are capable of relating the data to the source.

II.3.2.2 t-closeness
The algorithm t-closeness is an attempt at preventing an adversary, whom may already have much information on an individual from being able to further aggregate the release of future data to an individual. The authors of t-closeness [38], require that the distance between data-sets should not differ by any more than a threshold t, [38, 35]. However as discussed in [35] t-closeness can only be performed once the data has already been collected. This does not help the individuals in the event of a data breach. Membership Privacy
Membership privacy differs from the k-anonymity and t-closeness algorithm in that the later must be performed after the data has been collected. The Membership Privacy framework defined in [39], it is assumed that an adversary already knows every characteristic on a given individual. To expand further, they state that in the event of data being released, whether published or through a data breach, if a single person is able to be identified from this data, then a privacy breech as occurred. Differential Privacy
Differential privacy defined in [22], has become the standard in maintaining privacy while allowing data analysis to be performed [62, 29, 44, 57, 28]. Differential privacy, is to ensure that a data set is statistically non-distinguishable compared to similar data set where a record was removed [28]. The US Consensus has even declared that it will be using differential privacy during the 2020 United States consensus [1]. A variation of this framework called local differential privacy has even been adopted by large companies such as Google and Microsoft [17].
Differential privacy however is an all-encompassing term, that several different algorithms fall under this umbrella term. The number of algorithms that could be con-

sidered differential privacy has led to research in [28], where it is argued that due to the amount of algorithms available it would be an over burden for someone to try and determine which algorithm best fit their scenario.
II.3.3 Analysis
There are two ways of viewing who has control of valuable data. One from the perspective of the individual as seen in section II.3.1, or the other as in the person performing the analysis of data and publishing the information using one of the many algorithms discussed in section II.3.2. For storing data it is assumed that this criteria would be met by following guidelines for ensuring proper software configuration. There is not a one size fits all technique for preserving privacy and it is a multifaceted problem. However the process taken in [28], where various differential privacy techniques are analyzed can be performed, thus relating to the end user how much risk they are in using whatever platform they choose.
II.4 Data Collection Practices and Policies
In section II. 1 the focus of the discussion was on how data is obtained by malicious users. The focus of this section will be on how these troves of data get created initially. The first area to be observed is how various companies collect data and the standards they should follow in order to maintain the privacy of data. The following area of discussion will show different policies from the United States and Europe, regarding the collection and use of personal data. The last topic to be covered in this area are challenges that are often encountered by users and companies alike regarding the use and collection of personal data.
II.4.1 Data Collection
Entities can obtain personally valuable information from various sources. In [64], they state terabytes of user made content is created every minute on online social net-

works. This data comes from both traditional personal computers and mobile technologies. The reasons users decide to disclose such sensitive information on these platforms is outside the scope of this paper and can be read in more detail in [64, 21].
II.4.1.1 Data Sources
As it can be seen in [16, 61, 8, 24, 63, 15, 32, 68], the use of online social media plays a major role in day to day life. The use of social media ranges from political discourse [68], to gaining employment and maintaining personal relationships [37]. In some cases, third-party content creators of these online social networks have ability to gain access to personal information [16, 15]. Online social networks often offer access to their platforms in exchange for the ability to use the data created from users. From [43], it was identified that due to the success of online social networks, many other companies have taken to the monetization of user data through the use of their online products. This can further be examined by the use of mobile phone applications and data collected as discussed in [26]. In [3] the authors identified search engines retaining data of users, for a more personalized experience, and [34] discusses how web browsers are able to collect user information.
This is not the limitations of utilizing electronic data, as in [65], which targeted the limitations of accessing and using medical data from various countries, with the argument of the benefits that could be obtained by allowing access to this information. Also, a relatively new revenue for data sources come from “Internet of Things”, which many times requires that the user be identified more accurately than just supplying an e-mail address [4],
II.4.2 Data Collection Policies
Data collection laws vary depending on the country. For the purpose of this master thesis, the comparison of data collection policies between the United States and the European Union will be analyzed.
19 European Union
The European Union has recently enacted the Global Data Protection Regulation (GDPR). These new regulations declare that it is now a right of a person to have protection of their personal data [47]. The GDPR also acknowledges that when anonymization algorithms has been performed on the data, it places more burden on the collector to guarantee to the user that their data will remain anonymous [30]. The European Union has found that only six countries outside of the Europe Union have privacy laws that meet expectations of the GDPR [27]. United States
Within the United States there are both Federal and various State laws impacting the collection and protection of data [10]. For consumers the enforcing body is the Federal Trade Commission (FTC), a legal body that has been tasked by the United States congress with enforcing user data policies. These data policies however are mostly concerned with unfair or deceptive practices. There have been guidelines offered by the FTC for companies to follow to ensure data privacy for consumers, however these are just guidelines and not law.
In the United States the most common data protection policy is the Health Insurance Portability and Accountability Act (HIPAA) which is handled by the Office of Civil Rights. However this policy is only directed towards health records and how they must be handled.
For general data there are not many regulations set forth in the United States. In [53], the discussion between the European Union and the United States is presented for use of cloud computing technologies for Government purposes. In terms of the United States policies for cloud computing for data storage, providers are directed to the NIST Cloud Computing Security Requirements. For a complete and comprehensive list of federal data protection policies in the United States view [10].
20 Challenges
As identified in [60] reading and understanding privacy policy statements that are usually presented to individuals is difficult. This inherently makes it difficult for users to determine if they should accept the privacy policy. Even more of a pressing challenge is that using a service that collects data is becoming nearly a normal day to day event. The risk of sharing information that may be compromising to the individual is sometimes related to the persons personality as discussed in [21]. In [52], the authors found that many times the users forgone any control of discussing valuable personal information on social media platforms.
11.4.3 Text
It might be assumed that if data is anonymized, it becomes impossible to then re-identify the individual whose private information has been leaked. However, the standard for data exploitation is not to rely on one source of data, but to aggregate multiple sources of data to form a strong vulnerability. As stated in the previous sections, the amount of data that is present in cloud based servers is in the zettabytes. To aggregate all this data, has now become easier than has before, by the amount of data as well as advances in text processing.
Although it would not be feasible for a malicious user to narrowly analyze every text that they come across, they can first perform a regular expression checks, string matching, on a block of text, to see if the text contains any words or phrases, that are associated with privacy divulging comments or sentences. The Aho-Corasiek algorithm, is a popular multi-string matching algorithm that is still used in many applications today. This algorithm, utilizes a Finite State Automata (FSA), where every state is a single letter of the word to be found [55].
Although matching a word in a block of text may be an indicator for a potential privacy divulging comment or sentence, this is not enough for one to simply au-

tomatically crawl through the amount of information. This is due to the ambiguity of the various meanings that a single word in a sentence can possess. However, methods of Natural Language Processing have been refined to disambiguate these ambiguous meanings. One particularly useful method is the Word Sense Disambiguation method, which attempts to disambiguate the meaning of a word based upon the other words used in the same comment or sentence. For the remainder of this thesis we will use the abbreviation WSD for Word Sense Disambiguation.
WSD is the process for computational identification for words in the context for which they are used [36]. This is a method for removing lexical ambiguity. Where lexical ambiguity, is ambiguity in a sentence when the words used with in a sentence can have multiple meanings [19]. The original Lesk algorithm defined in [36], was shown in [49] to be between 50 to 70% accurate.
In our testing methodologies, these two methods, string matching and word sense disambiguation will be utilized.
II.4.4 Analysis
For the originator as the source of the data the European Union takes a harder stance for protections than the United States. From section II.4.1 and section II.3, RQ-5 can be answered. For RQ-5 if data analysis follows strict enough guidelines for ensuring the anonymity of the source of the information, the answer is dependent upon where the analysis is being performed and on which data. If inside the European Union then they must follow the GDPR. If they are in the United States and the information is outside the realm as being Personal Identifiable Information as defined in the HIPAA regulation, then it is left to the discretion of the analyst. The regulations in the United States have not matched the rise in data analytics nor the rise in data breaches.

II.5 Measurement Techniques
As shown in section II.2 the current methods of scoring deficiencies in software use various equations to calculate a score for measuring the risk involved. Since we have not found any previous literature on measuring privacy risk, we will utilize the Delphi Method for creating the equations in order to determine the associated scores and risks for the various metrics.
11.5.1 Overview of Delphi
The Delphi method was originally developed, and utilized by the RAND corporation [18, 40]. The original use of the Delphi Method was to gather expert opinions for determining policies [67, 18].
11.5.2 Delphi Characteristics
Although there are many different variations of the Delphi Method, all maintain the following criteria:
• Selection of Experts
• Anonymity
• Feedback
Firstly, a group of experts are selected that are considered very knowledgeable in a given topic. The next step is to then send each expert a questionnaire. Once the questionnaire is returned, and the results are analyzed, the questionnaire is then returned back to the experts or re-evaluated. If the experts original response differs from the consensus of the other experts, feedback is given as to allow the expert to re-evaluate their answer. This feedback is completely anonymous, this way the expert is not under pressure to conform to the responses of the others, in an attempt to create an unbiased judgment.

II.6 Chapter Review
In this chapter the various degrees upon which an individual could be affected by the exposure of their data has been analyzed. Also, different frameworks for determining software deficiencies have been reviewed. What can be surmised is this is a complex area to navigate, and with the growing trends of utilizing online services that collect data, it has become an overtax on individuals to navigate how a misuse of their information may occur. We have the credit scores to judge a person’s credit worthiness, ratings from the better business bureau for determining the trustworthiness of a business, to frameworks for discussing the soundness of software as seen in section II.2. This framework should determine an individual’s risk of their privacy being breached. That can measure this breach not only in terms of how the data is being handled and stored, however also from the impact that the user from an interpersonal level may be affected. This is the framework that is posited in this thesis, that will be the discussion of the proceeding chapter.

III. 1 Hypotheses
In this section the hypotheses that the model will attempt to answer will be listed and discussed. These hypotheses listed are the product from the research questions outlined in Chapter 2 along with feedback from the experts while initiating and performing the survey.
HI: Those accounts that have 1 piece of direct PII available, will also have a minimum of 2 pieces of indirect PII available.
Hypothesis 1 (HI), listed above, believes that individuals who share a single piece of direct PII, on a social site, will also be more likely to divulge indirect PII as well.
We believe that these individuals will release a minimum of two pieces of indirect PII, if they also release a single piece of direct PII. In section III.3.2, and section III.3.3 we will discuss the value of indirect PII to that of direct PII.
H2: Users that give their birthday or allow it to be determined, are more likely to have more pieces of Indirect or Direct PII than users that do not give their birthday.
In our second hypothesis (H2), we narrow the scope of who we believe to be more privacy risk relaxed, than those who are not. The hypothesis is a measurement of those whose birthday is able to be found, against those whose birthdays are not found. In our model that will be presented in III.3.3 the individuals who fall into the set of users who are in H2, will have a higher score than those will not.
H3: Users that have their e-mail addresses easily accessible, are more likely to divulge more PII, and therefore have a higher score from our model, than the set of users from H2.

Upon feedback from the various experts that were surveyed, it has been found that one of the most essential pieces of information that a malicious user can obtain about their target, is the targets e-mail. From this we have formulated hypothesis 3 (H3). We also believe the users of H3, will have a higher score in our model than those whose e-mail we can’t find. As shown in Figure 3.1, the set of users who classify into H3 would be a sub-set of the users from HI.
Sets of Users Per Hypotheses
Figure 3.1: HI Users to H3 Users
We focus our attention on users with publicly facing birthdays and e-mail address due to the responses of the survey from experts that is discussed in section III.3.2. Particularly these two pieces of PII, to us, are gateway indicators. As in, once these pieces of PII are obtained, the ability to gain other critical pieces of PII becomes easier.
III.2 Model Goals
In this section the goals of the model will be discussed and highlighted. A primary drive to this research was the total amount of privacy breaches that are reported

each year, along with the recent reports of social media organizations reportedly selling user data. Through the research that took place and discussed in Chapter 2, it can be seen that there is a definitive need for a tool for users to use to assess their own vulnerabilities for a privacy breach.
A major hurdle is to place a quantitative value to an intangible object. To the quick thought, most could sputter what they personally feel as though private, as this can be seen in the section pertaining to CPM. To overcome this hurdle we will be utilizing the Delphi method. The Delphi method has been used in literature as well as in real world applications, to find quantitative values for objects, that were previously not valued or ranked. The Delphi method relies on input from individuals that would be considered experts in the given area of focus.
Based upon research that was done for Chapter 2, when analyzing the various existing scoring systems for other cyber security related issues, it was observed that finding a community accepted model was difficult, if custom made. Therefore, our model will bear semblance to already defined Cyber security industry standards. By using this foundation along, with the results from the Delphi method, to build our model we will be able to test and present solutions to our hypotheses. We will also be able to show how this model will be a useful tool for users of online accounts.
III.3 Model Design
This section we will discuss the actual model and the steps involved for the creation of the model. There were various iterations of the model. The first iteration was using knowledge from the literature review, and our hypotheses to build this model. Next a survey was created and sent off for experts to answer. After the experts had answered, and their results were calculated, the final iteration of the model was designed.
III.3.1 Model Design Iteration I

For a base for the model, we are utilizing the format of the CVSS and CWSS, as discussed in Section II.2.1 and II.2.2 respectively. We focused more on the CVSS, since this is the most commonly used, and most widely known structure for giving a qualitative value to vulnerabilities in software. As shown in Section II.2.1, the CVSS gave an overall score of zero to ten, therefore for our model we wish to produce a score within the same range. A ten would be indicate by using a service there is a critically high likely hood of being involved in a breach of privacy, and 0 there is a very low if non-existent threat of succumbing to a privacy breach.
Total_Score = [min[(Exploitability + Accessibility + Harm), 10]] (3-1)
We have determined that the likelihood of being involved in a privacy breach falls under three distinct metrics. These are the metrics of Exploitability, Accessibility, and Harm metrics. As with the CVSS and CWSS, each of these primary metrics are thus further devolved into individual components that will then be used for this final calculation. Table 3.1, shows each individual component of the various metrics that will be discussed.
Table 3.1: PAVSS 0.9 Metric Attributes
Exploitability Metric Accessibility Metrics Harm Metrics
Direct PII Metric (DP) Data Release Policy (DR) Professional Attack (HP)
Indirect PII Metric (IP) Public Information Metric (PI) Financial Attack (HF)
Exploit Value Metric (EV) Associativity Metric (AA) Personal Expectation Metric (HPE)
Data Expectation Metric (DE)

The first metric to be examined is the Exploitability Metric. From our research we have determined that this metric holds the most weight in determining whether an individual would be involved in a privacy breach.
Exploitability = DE * (EV * (DP + IP * 0.5)) (3.2)
The equation above was the first iteration of what we determined to be the value for determining the overall score for the Exploitability metric. Our assumption was that, information that would qualify as indirect personally identifiable information, is actually only half as valuable as information that would qualify as direct personally identifiable information. This value addition is then multiplied by the overall Exploit value. With the overall score then multiplied by the Data expectation to privacy. Both of these metrics will be defined in the paragraph below.
The Exploit value is being defined as the value that given this information, what more exploit could be performed. This is a value to be given for information that when combined, could create avenues for gathering more information. Whereas the data expectation to privacy metric is used to either, bring the overall value higher or lower depending on the service being used. We felt this was important to give weight to the fact that we are not saying a given service is bad or good. That is outside the scope of this model. For instance, when a user uses a social media service, they should be under the assumption the information that is provided, is something that the general public could see, therefore the data has a general low expectation to privacy. Whereas if using an online banking service, one would assume this information is kept private and secure, thus indicating a high level of expectation for privacy.
Accessibility = 0.2 * (DR + PI + AA) (3-3)

The equation listed above indicates how the accessibility metric is assumed to be calculated. As it can be observed, the overall addition of each value is then multiplied by 0.2. This is based upon an assumption that the user has little to no control over these indicators, however they should be made aware of. Thus, these indicators, not be much as much of an influence in the total score. The Data Release value is a score based upon, if by using a service the data will be released to third party and which privacy preserving algorithm is being used to release this information. Where using no privacy preserving algorithm, this would create a high value, and utilizing a method for privacy preserving would indicate a lower value. Privacy preserving algorithms were discussed in Section II.3.2. Public information metric is given a constant value. This is either a value of 1 if the information could be obtained through public information, or 0 if not.
Associativity = Total_Pieces_Inf o * 0.25 (3-4)
Associativity is a value to indicate the likelihood of a privacy breach given that information is all in one place. The total pieces of information would be a total tally of the number of direct and indirect personally identifiable information that has been stored for access in a single location.
Harm = 0.2 * ((HP + HF) * HPE) (3.5)
Finally the Harm metric is listed above. For the purpose of this Thesis, we are evaluating whether the user would be susceptible to either a Sybil or Financial attack if the given information was released. In the nature of being modular, many other areas of attacks could be added later. For this reason, both Sybil and Financial will be given a value of either a 1 if an attack of this type is possible with the given information, or a 0 if an attack is not possible. This value is then multiplied by the personal expectation to privacy. This value would be a value by a user, to indicate how they

would personally feel if involved in a privacy breach. Overall this score is then multiplied by 0.2, indicating again this has little control by the user.
III.3.2 Delphi Results
III.3.2.1 Survey Background Information
Drawing from research performed in the literature review for Chapter 2, and our first iteration for our model design, we were able to create a survey where we cyber security professionals to provide their professional expertise. As per the discussion of the Delphi method in 2.5.2 we maintain anonymity with the professionals filling out the survey. We did however ask for these individuals to share their years of experience as well as their specific area of expertise. As is shown in Table 3.2 over half of the participants had experience that was greater than 10 years. In Table 3.3 it can be observed that a quarter of the participants had experience in privacy assessment, and over three quarters had experience in cyber security compliance. This expertise combined with the participants that had penetration testing experience allowed for a wide variety of individuals to be questioned.
Table 3.2: Survey Profile
Percentage of Participants Years Experience
33.0% + 15
22.0% 10-15
38.9% 5-10
5.6% 1-5
This variety of expertise was essential to be able to ensure that feedback from the survey was well represented from the perspective as one who would try to protect sensitive data as well as one who would attempt to exploit personal information. The questions presented to the individuals were presented in various scenario based situations, as well as which piece of information would be considered more valuable.

Table 3.3: Participants Profession
Percentage of Participants Profession
27.8% Privacy Assessment
66.7% Vulnerability Assessment
77.8% Cyber Security Compliance
38.9% Penetration Testing
Following as an example in [37] where the authors divided PII, the experts were shown what constitutes as direct PII and what is indirect PII, these values can be viewed in Table 3.4. We also defined what is determined to be publicly available information, these values can be found in Table 3.5. A final source of identifiable information that has been identified and then shown to the experts, was that of information that could be found in a Freedom of Information Act (FOIA) request, this information can be seen in Table 3.6.
Table 3.4: Personal Identifiable Information
Direct PII Indirect PII
SSN Telephone Number Gender Race
Email Address Medical Records Birth-date Geographic Indicator
Name Address Online Username
Table 3.5: Publicly Available Information
Name Address
Voter Affiliation Prior Criminal Trouble
Business Ownership Marriage Certificates
Death Certificates Mortgages
We defined Exploitable, Accessibility, and Harm, to ensure each individual completing the survey was answering questions with the same frame of reference as another. As a way to prevent the questions from being ambiguous. The definitions for each term, is listed in Table 3.7.
The participants were given scenario-based questions. These questions were presented as if the participants were about to perform either a professional cyber-attack or

Table 3.6: Freedom of Information Act Request Information
Federal Employment Qualifications Degrees
Technical Training Government Employment
Professional Group Membership Awards and Honors
Table 3.7: Definitions of terms to survey participants
Exploitable Could you exploit a person(s) with this information. I.E. Would you be able to use this information to impersonate this person? Does this allow you to get more information?
Accessibility How accessible is this information?
Harm Awards and Honors A quantitative value for damage done in event a person(s) data has been exploited
a financial cyber-attack, on a given individual should they display a particular piece of direct or indirect PII. They were asked how exploited would this information be, given they were attempting to perform, either of the two previously listed attacks. We also give the participants a list of indirect or direct PII and asked which new piece of information would they try to obtain in order to make the attack more effective. At the end of the survey the participants were given two open ended questions, the first asking the participants in their professional opinion what are the most critical factors for exploiting privacy information. The second question asking, what are the most crucial areas for securing and prevention of personal information.
III.3.2.2 Survey Results
The participants of the survey all answered the questions with relative similarities, which helped to strengthen our initial model design. There were some intriguing answers that occurred, however these answers were similar among all participants, as such they were not dismissed, however further analyzed. We will now discuss the results from the survey, and the most thought-provoking inputs from the participants.

The questions asking for how exploitable, or harmful a given scenario is with given PII was asked to be rated on a scale from one to seven, where one was not exploitable or harmful and seven was very exploitable or harmful. We then generalized the results as listed below in Table 3.8.
Table 3.8: Ranges Based Upon Expert Input
Low Percentage 0 - 42.85%
Medium Percentage 42.85% - 71.42%
High Percentage 71.42% - 100%
Most results from the survey were as to be expected when it comes to PII. For instance, in general it was considered to be of medium ability to exploit an individual in general, by just obtaining a single piece of direct PII. The participants also indicated, that once they had access to a single piece of direct PII, all they would need was either an individual’s online user-name or birth-date in order to exploit someone with confidence. This was evenly split 50/50 between participants. In Figure 3.2 the results from a question to the experts is shown. The experts were told they already had a single piece of direct PII as listed in Table 3.4, they were then asked to pick from the following which other information would be most sought out in order to perform a financial attack.
According to the experts, information that could be obtained from the FOIA request, was of medium ease of accessibility. The ranges can be seen if Figure 3.3, where a majority of experts were in favor of this assessment.
The experts where then asked to rate the likeliness with having access to this information if they could gain access to an individual’s bank records, Figure 3.4 or open a line of credit, Figure 3.5. As seen in the figures listed, accessing the bank records was considered a medium likeliness from the experts, with the given direct PII. This however differs from the experts opinions with trying to open a line of credit in an individual’s name seen in Figure 3.5

Choosen Additional Direct Pll Financial Harm
Direct Pll Options
Figure 3.2: Additional PII needed for financial exploit
Accessibility to Gain Information from FOIA Request
Figure 3.3: Ease of Access FOIA Request

Given Direct Pll Likeliness of Bank Record Access
Score Ranges Low
Figure 3.4: Likeliness of Accessing Bank Records with Direct PII
Given Direct Pll Likeliness of Opening Line of Credit
Figure 3.5: Likeliness of Opening Line of Credit with Direct PII

When the experts were presented with a similar situation as the one listed previously, except this time the goal was to perform a professional harm attack, the information that they sought vastly differed from that of the financial harm attacks. Seen in Figure 3.6, the most sought information after already having an individuals name was their e-mail address. Compared this to the information sought after being in knowledge of an individuals name for a financial attack, Figure 3.2, where the most sought information was the individuals SSN.
Choosen Additional Direct Pll Professional Harm
Social Medical Address Telephone E-mail
Security Records Number Address
Direct Pll Options
Figure 3.6: Additional PII needed for Professional Harm
With this information the experts were asked to rate the severity of professional harm that could be caused with the Name, and the additional direct PII shown in Figure 3.6. These results were interesting, where we had no expert claiming they wouldn’t be able to do low harm as seen in Figure 3.7. The results were nearly split in half between those that believe they could only cause a medium amount of professional harm and those that could cause a high amount of professional harm.
Next we asked the experts the likeliness of an exploit with this given information if they were able to find it all in the same location, the same site or service. We

Professional Harm from Direct Pll
Score Ranges Medium High
Figure 3.7: Severity of Professional Harm Given Direct PII
followed this up, with the same question however, the information was spread out, not all in one location. In Figure 3.8, we have the results side by side to compare. When all the pieces of direct PII were located on the same site or service, 61% of the experts this information was highly exploitable to cause professional harm. However, as soon as the information was spread out over multiple sites, and services, only 22.3% of the experts responded the information was highly exploitable.
At the end of the survey we asked the experts for opinions in two open ended questions. This first open ended question was, in their expert opinion what was the most critical factors for exploiting information. The top five results can be viewed in Table 3.9. One of the top five answers was the compound effect, where you could use information previous known to find other information. It appears that from knowing just a user’s name, due to what is publicly available as well as information that is available in a FOIA request, you could create a rather detailed profile of an individual. This information helps us to answer our research question RQ-3, where the minimum

Exploitability Based on Accessibility
Pll All on Same Service
PI I on Different Services
Figure 3.8: Exploitability by Accessibility
information needed to to be known about someone in order to find more information is just someones name, a single piece of information.
From the responses shown in Table 3.9, we can also see user awareness is a big contributor in factors for exploiting awareness. This indicates we need to keep enforcing and training users on what can constitutes as acceptable online behavior for privacy protections.
Table 3.9: Expert Opinion Most Critical Factors for Exploiting Information
Database Protections
Using previous information to find more (Compound Effect)
Aggregation of Data Users
Ease of access to information
Our other open ended question we asked the experts was in their opinion, what were the most critical areas of focus, to prevent a privacy breach from occurring. The top answers from this question are listed in Table 3.10. Many of the answers received

complimented the responses from Table 3.9. An interesting common response was to have a protected digital identity.
Table 3.10: Expert Opinion Most Critical Areas of Focus
Separation of information storage Protections to databases Have a protected digital identity Restricting information access Awareness training
III.3.3 Final Model Design
In this section we will discuss the final changes to the model used to create our model, the Privacy Assessment Vulnerability Scoring System (PAVSS). Based upon our first iteration of the model and the results from the participants in our survey discussed in section III.3.2, we were able to come to a final iteration of our model. As it will be observed the overall model design has not changed drastically to that as described in section III.3.1.
In Table 3.11, we can see the final design of the model. As can be observed we still maintain 3 distinct areas, Base, Accessibility, and Harm metrics, for that comprise of the total score value. We use the Base Metric column to be comprised of the Ex-ploitability Metrics and the Service Metrics. This was to incorporate the feedback that was given by the participants of the survery. The Exploitability Metrics column is still the same and comprised of the total direct PII, indirect PII, and data expectation to privacy. However, the participants noted in the survey, that not only is the information that is presented a primary factor of the risk of being involved in a privacy breach, the service that was being used was also a major contributing factor. For this reason, we also added a Service Metrics column to our Base Metrics. The service metrics is a way to quantify the risk of using a particular service presents.

Table 3.11: PAVSS 1.0 Metric Attributes
Base Metric Group Accessibility Metric Group Harm Metric Group
Exploitability Metrics: Service Metrics:
Direct PII (DP) Storage (ST) Public Information (PI) Professional Harm Metric (HP)
Indirect PII (IP) Sharing (SH) User Awareness (UA) F inancial Harm Metric (HF)
Data Expectation (DE) Default Account Setting (DAS) User Default Account (UDA) Personal Expectation Metric (HPE)
One modification to the Exploitability Metric group is that of the Data Expectation Metric. In the section III.3.1 our initial assumption was that the expectation to privacy from the data’s perspective held higher weight. However, after the survey and further research, we have found that this does not hold much weight on the overall ex-ploitability score. Therefore, we give this a value of either zero or one. This value is dependent upon what pieces of direct PII are used with a particular service and what are the current laws pertaining to the importance of that data. For instance, if the service had a user’s social security number or medical records, and this was in the United States, then these pieces of direct PII have an expectation to privacy. A score of zero for this metric would be an indicator that the data has no expectation of privacy by current laws. For the scores for Indirect and Direct PII Metrics, these are a counter to the amount of individual PII that can be found while parsing through a service looking for an individual.
The Service Metric group is comprised of the Storage, Sharing, and Default Account Settings metrics. With feedback from the survey along with information found during the initial research, these are the three most critical areas that are of concern

when trying to ascertain whether using a particular service is of risk for a privacy breach. In an ideal situation the data owner would know: exactly what the protections on the data, who internally and externally has access to the data, who the service could potentially be selling data to, whether they are using a privacy preserving algorithm would help to define a better metric. However, users rarely have all of this information so we based the metrics on values that can either be found from setting up an initial account, Default Account Setting Metrics, or by parsing through the terms of service agreement, Sharing and Storage metrics. The Storage metric as we have defined it can have a value ranging from 0 to 2. These values are given based upon whether the service stores any information on the user. The full scope and definitions of these values can be found in Table 3.12.
Table 3.12: Storage Metric possible values
Score Storage Metric
0 Service does not store any information
1 Service stores basic information (Username/Password)
2 Service stores basic information, plus PII
The Sharing Metric is the next listed metric in the Service Metric group. This metric is to quantify the risk imposed by the service sharing and or selling the data to third parties. Much like the Storage Metric we have given this metric a value from 0 to 2. As we have identified 3 separate sharing scenarios. With each scenario being considered of high risk of being involved in a privacy breach. The first scenario the service does not sell or share any information about the user base. In the second scenario the service sells and or shares information however uses one of many privacy preserving algorithms. The final scenario the service shares and or sells the information and uses no privacy preserving algorithms. In Table 3.13 we have defined the possible 3 scores depending on which of the three scenarios is present. If the service does not list or state

they sell and or share the data using a privacy preserving algorithm then it is assumed they do not.
Table 3.13: Sharing Metric possible values
Score Sharing Metric
0 Service does not share any information
1 Service shares information, but releases using privacy preserving algorithms
2 Service shares information, no privacy preserving algorithms
The final metric in the Service Metric group is that of the Default Account Setting metric. This metric is to quantify the PII that is by default required to create an account, that is service member facing. That is this is the default information required from the user in order to create an account and by default this information can be viewed by others that are also members of this service. As shown in Table 3.14 we give this a scoring from one to three. These scores depend upon on how much PII is required by default to be entered that the other members can see by default.
Table 3.14: Default Account Setting Metric possible values
Score Default Account Setting Metric
1 Zero to Two pieces indirect or direct PII required for a default account
2 Two to Four pieces of indirect or direct PII required for a default account
3 Four or more piece of PII required for a default account
The other change to the model described in section III.3.1, is that of the Accessibility Metric group. The Accessibility Metric group is the quantification of how accessible the information is dependent upon user actions along with what is considered pub-

lie information. The Public Information Metric is given a value or either zero or one. This value is dependent upon the pieces of PII counted in the Exploitability Metric group. If one or more of the pieces of PII counted in the Exploitability Metric group is considered to be public information then the Public Information Metric is given a value of one, if not then the value is a zero. The second metric listed in the Accessibility Metric group is the User Awareness Metric. The values range from zero to two and can be viewed in Table 3.15. Initially this was not considered an important metric, however, with input from the participants of the survey, it was determined to be a contributor to the risk of a privacy breach. This metric score the users overall knowledge of online PII. It was found that even if a user was concerned of their data, if they do not have the proper training or knowledge, then they are inherently at a higher risk than someone who has the training. The score is of zero if they are well trained and don’t share any PII, this includes using a different user name for each online service they use.
Table 3.15: User Awareness Metric possible values
Score User Awareness Metric
0 Well trained, doesn’t allow sharing of PII, user name is different from other services
1 Understands basic online PII protections, shares little amounts of PII
2 No online safety training, shares much if not all information
The final metric in the Accessibility Metric group, is the User Default Account Metric. This score is a representation of what the user sets as sharing preferences for other members of the service. These values can be seen in Table 3.16 and show that the values range from one to three. Unlike the Default Account Setting Metric, this metric measures how much the user allows for sharing of their information, or the amount of control they possess. A value of one in this metric would indicate the user allows

sharing of very little to no PII data, where a value of three would indicate the user allows sharing of multiple pieces of direct and indirect PII.
Table 3.16: User Default Account Metric possible values
Score User Default Account Metric
1 User shares less then 2 pieces of indirect PII, 1 or less piece of direct PII
2 User shares more then 1 piece of direct PII, along with 1 or more indirect PII
3 User shares multiple direct and indirect PII
The Harm Metric group has stayed the same from the first iteration from the model. This metric group was the score of what harm could be done. This metric group is created as such, for the ability to be expanded upon to account for other types of cyber security attacks that can be performed from having access to individuals PII. For both Professional and Financial Metrics, the values are either given a one or a zero.
This is dependent upon if the pieces of PII accounted for in the Exploitability Metric group, can be used for either of the attacks. The values for the Personal Expectation Metric are shown in Table 3.17 and can range from zero to two. These values are a representation of CPM discuss in section II.3.1. A user would be given a value of zero if they have little to low expectations of privacy, where someone who has a high degree of privacy expectations would be given a value of two.
Table 3.17: Personal Expectation Metric possible values
Score Personal Expectation Metric
0 Low Privacy Expectations
1 General Privacy Expectations
2 High Privacy Expectations
The equations used for calculating the total score have also been updated to reflect the input from the participants from the survey. In equation 3.6, it is shown that

we want a maximum score no higher then ten, as well as a integer value. The score should relate to the user in the most simplistic form, such that the user can make a valid risk assessment quickly.
Total = |~min[(Base + Accessibility + Harm), 10]] (3-6)
The new Base Metric group is is calculated by equation 3.7. As it can be seen, the total score is that of the total for the Exploitability Metric group added to the total of the Service Metric group. This was done in response to the input received during the survey from the experts, where they indicated that the service had just as much weight as the pieces of PII in determining the overall risk of being involved in a privacy breach.
Base = Exploitability + Service (3-7)
The final Exploitability Metric group score is calculated by equation 3.8. We know from previous research and from the experts in the survey that indirect PII data is worth half as much as direct PII. The overall total from the direct and indirect PII, is then added by the expectation of privacy from the perspective of the data which has been discussed above. This is an additive value to show that this particular set of PII that has been captured holds more weight than PII captured where the data does not have an expectation to privacy.
Exploitability = DE + (DP + IP * 0.5))
The Service Metric group total can be seen in equation 3.9. The vectors that comprise of the total score were considered no more important than another and therefore are additive in the equation below. There is a constant associated with this metric. The reasoning for this constant is due to the potential of a total score of seven.

The constant value places the final score for service from zero to three. What the service provides is a large contributor to whether or not an individual is involved in a privacy breach. Yet, ultimately, the biggest contributor is the information that is placed on to site.
Service = [0.568 * (ST + SH + DAS)J (3.9)
The Accessibility Metric group vectors are each additive as well and are shown in equation 3.10. From opinion from the experts in the survey, this metric group did not hold as much weight as the base metric group and therefore only contributes 20% to the overall score.
Accessibility = 0.2 * (PI + UA + UDA) (3.10)
The final equation is of the Harm Metric group. This calculation is observable in equation 3.11. As in the Accessibility Metric group each vector is equally weighted for the metric group. However, this holds a weight of 30% to the overall score. This 30% is due to the input from the experts that the greater the harm, the higher the reward for the potential malicious user and therefor a higher risk of being involved in a privacy breach. This value does not hold as much weight as the Base Metric group, however holds more weight than that of the Accessibility Metric group, as such we gave it a weight of 30%
Harm = 0.3 * (HP + HF + HPE) (3.11)
III.4 Chapter Review
In this chapter we introduced our three hypotheses that we plan to test for and discuss. This discussion occurred in section III. 1, and we will discuss the results in the proceeding chapter where we will discuss the testing procedure, implementation, and results. Also discussed in this chapter was our method for determining quantitative values for privacy measurements, where there have been no other quantitative measure-

ments to our knowledge before. Discussed in section III.3.2, we highlighted the results from using the Delphi Method that we defined in section II.5. The final major idea presented in this chapter was the final model design, shown in section III.3.3. Utilizing this model, in the next chapter we will test our hypotheses from section III. 1.

In this chapter we will discuss the testing of the model. The first section we will discuss the testing plan and the implementation. The next section will then discuss the results from testing and whether or not the data from the results support either of the three hypotheses stated in the previous chapter. Finally, this chapter will be concluded by discussing the challenges that were faced while performing the tests and how further testing may be performed.
IV. 1 Pretesting Requirements
For testing we wanted to match our initial goal of creating a tool to help the average individual to access their risk of being involved in a privacy breach. Therefore, for testing we would choose to use an online service that has users interacting and sharing information, along with ease of accessing this information. We also want to show the need for this scoring system, because the ability to sift through large amounts of information and find critical information, is no longer just for those who are well versed in computer science. As such, the data collecting tools we will use to show the effectiveness of our tool will be the same out of the box tools available to the average adversarial that would be the potential malicious user using these same services, to find potential targets.
For testing we have chosen to use the online social media platform Twitter.1 Twitter is a popular social media platform that allows ease of access to their stored data of their users through their API. Twitter allows their users to make a post, called a tweet, that is of 280 characters or less. Users also maintain a base of other users that they can follow. When a user creates a tweet, other users on the platform can share the original tweet, in a method known as retweeting. When a user follows another user,

they subscribe to their tweets and are thus notified when the user they are following makes a new post. To share another users post, you do not always need to be a follower of the original poster. As such users also have followers that are then capable of monitoring their posts and information. Users can mention other users in their post by using the @ symbol followed by the users user name. Signing up for an account is free, and access to the stored data through the API is free as well, although the free access is rate limited, depending on what access is being performed. Due to this, this makes the ideal testing platform for the model, since it has a large worldwide user base, and ease of access to the data.
In order to access the twitter API, first step was to create a twitter account. Creating a twitter account was straight forward requiring a name and either a phone number or e-mail address. For the purposes of this test an e-mail address was used. A confirmation e-mail was then sent with a confirmation code. To continue the account creation, this code had to be entered. Next the service asked if you’d like to upload a photo, it also gives you the ability to skip this step. For our purposes we skipped this step. Next, we were asked for a short bio, that was to be less than 160 characters. This step you are also allowed to skip, as such for our testing purposes we did. Next, we were asked about our interests, and if we would like to search for people to follow. This step like the previous steps were allowed to be skip, which we did. After this we were presented with our home page. The name we provided on the initial page was turned into our user name that other users of the service could communicate with us if they found the need.
After the creation of the account we could register with twitter for a developer account and ask for access to the API with Twitter generated Oauth tokens. To receive permission, we were required to fill out what the purpose of accessing the API was, what would we do with the data, would individuals be identified from our data, as well as who would ultimately see the data we collect. We told twitter our purpose

was for research for a thesis and no usernames would be released with the data. The
turnaround time for Twitter to give us access was less than one week.
IV.2 Testing Procedure
The testing procedure was broken into two phases. The first phase was creating manual entries for users that, fit into our three categories of high, medium, and low scores. We did this based upon what values as a user of twitter we could enter into our profile, as well as common posts that users could make that would identify and disclose pieces of PII. The second phase was creating a testing the algorithms discuss in the section Second Phase. This involved utilizing tools provided by twitter as well as common python package libraries.
IV.2.1 First Phase
For the first phase of testing, we were creating fake users as a way to manually test our model. The fake users would have accounts that modeled real Twitter accounts, by the identification of the various pieces of information we were asked to enter. Along with identifying possible statements users could post to other users and themselves that would result in a piece of PII being viewed from other users. In Table 4.1, we have listed the various pieces of information a user could enter into their profile and which values are required by default.
Table 4.1: Twitter profile information
User Profile Attributes Required
Photo No
Username Yes
Bio No
Location No
Website No
Birthday No
The next step was to identify sharing and selling of data, along with the storage of data, as to be able to fill in the required values for the metrics in the Service Metrics

group from the model. According to Twitters Term of Service agreement, they do and will sell user information to third party vendors. However the Terms of Service does not specify if and when they sell they use a privacy preserving algorithm, due to this, and according to our model, they received a score of two for the Sharing Metric vector. Due to the nature of the service, the Storage Metric vector gets rated a score of two. The final vector in the Service Metric group was the Default Account Setting metric, and by analyzing the contents of Table 4.1, this vector receives a score of one. Table 4.2, has the scores listed with the vectors for the Service Metric group.
Table 4.2: Twitter Service Metric group score
Service Metric Group Score
Storage 2
Sharing 2
Default Account Setting 1
After identifying and scoring the various vectors for the Service Metric group, fake users needed to be created. To create the fake users, we needed to identify what users at each level, High, Medium, and Low, would place into their profile. For our fake, the values that we believe they would have in their profiles is listed in Table 4.3.
Table 4.3: Twitter Profile Information Manual Entries
User Profile Attributes High Medium Low
Photo Yes Yes No
Username Yes Yes Yes
Bio Yes No No
Location Yes No No
Website Yes Yes No
Birthday Yes Yes No
To assess the account variables listed in Table 4.3, we based the values on what would receive a person a higher to lower rating in the User Default Account Metric, and Personal Expectation Metric vectors. This was done by utilizing the work done in [37], and the theory discussed in section II.3.1. Someone who would score high in the

Personal Expectation Metric vector, would be an individual who would be considered to have a high personal expectation to privacy as such would be less willing to share information. Where as an individual with a low personal expectation to privacy would score low on the Personal Expectation Metric vector, and therefore be willing to share more information.
An assumption was then made concerning the User Awareness Metric vector.
This assumption was that users who shared many pieces of PII, where typically users that did not have a high level of online safety awareness. Whereas the users who were more restrictive on the information they posted on their accounts typically had a medium to high degree of online safety. This also correlates with the user’s personal expectation to privacy, we assume those that have a high level of personal expectation to privacy, would be the users typically educating themselves on online safety.
IV.2.2 Second Phase
This phase of testing can be broken into three distinct steps, before scores could be calculated. The first was to access the live stream through Twitter’s API and place a filter to capture results. The second step is to take the captured text from users, through a regular expression check, to verify that the text about to be analyzed contained one of our keywords. The final step is taking the verified text, and performing the Lesk algorithm on the text, finding the sense of the word we were searching.
In Table 4.4 there is a list of the primary search criteria filters. However, we will also use variants of these terms for attempting to extract as much information from the streaming services as possible.
Table 4.4: Terms to Search
Gender Race
Birthday Geographic Indicators
Online Usernames Social Security Number
Email Address Medical Records

Will only be searching for terms that are considered Direct or Indirect PII. In Table 4.5, we have listed all the terms we filtered on, through the live tweet capturing. We identified in section II.4.3 to find out whether or not a sentence, comment, or block of text, could contain possible private information, we could automate this to a satisfactory degree of confidence.
Table 4.5: Variations of terms to search
Gender Race birthdate
Birthday Geographic Indicators birth
Online Usernames Social Security Number SSN
Email Address Medical Records disease
email e-mail username
address phone cell
diagnose health hospital
clinic medical name
For our purposes, the information we retained was, username, name, location, and text. The retention of most of this data was only temporary. Once verifying a specific piece of information was found we stored a boolean value to indicate that this particular piece of PII was able to be determined. This information was stored into a SQLite database, on our local machines. The purpose of our testing was not to actually cause harm to the end users, just to verify if information was found.
The live streaming filtering, that is provided by Twitter, returns a Javascript Object Notation (JSON) data format, of the tweet that was caught by the filter. In Table 4.6 we specify some useful information that the JSON data returns. A more detailed description of all the information provided in this data format can be found on the Twitter Developer website.2 We would save off each JSON data into a single instance at first into a standard text hie. The hies would have a single JSON text data format, per line on the hie.
2https: / /

Table 4.6: Useful information from tweets caught in filter
Created at time Retweeted Coordinates
Username Retweet Count Location
Text Name Media Information
For the regular expression matching, we created regular expressions based upon the words that we used for for filtering through the twitter API. After a user would be processed in this stage we would store them in a SQLite database. We read in the hie, and began the processing. First we ran the text through a regular expression check. If a match occurred the next step would be to determine if the message was a retweet or an original tweet. Below is an example of how a retweet is structured.
• RT @user_retweeter: the original message
If the message was a retweet we would attempt to identify if there was another user name in the text. This was done for a more accurate account of who the information in the text was about. We would then look up the user in the database. If the user existed, we would update the column relating to the PII that we had just found, as well as adding the text, to the previously stored text for that user. If they did not exist, we would add an entry to the database, making the column true, for the PII data that we had found and storing the text of the tweet. This is shown the psue-docode from Figure 4.1. Our regular expression testing, would look for whole words,
Before being able for the final step in processing the text, we needed to determine the true sense of the sensitive words that we were searching on. As mentioned in section IF4.3 a word can have many meanings, making the true meaning of the sentence that the word appears in to be ambiguous. In order for use to use the Lesk algorithm in our final processing step, we needed to first run Lesk’s algorithm on sentences that we knew to be showing the sense of the particular word we were searching on. We then created these sentences, and ran Lesk’s algorithm, over the sentence then stored

Data: text file with json entry per line
Result: Database of aggregated tweets from text file
while lines in file do
Regular expression test for different search terms; if Regular Expression Match then if User exists then
Update users entry in database for search term, add text to list of text for user;
| Create new user entry; end else
| Continue; end end
Figure 4.1: Pseudo code for regular expression testing
in a SQLite database the sense of the word. The version of Lesk’s algorithm we used, was the implementation in the python package provided by [9]. We did this for each word that had been created into a regular expression check, in the second step.
With the sense of different privacy divulging words known we then started the final step of processing the data. We accessed each member of the database, from the users that had made it through the second filter. We then determine which particular PII the text that we had stored for that user had contained. Some users had been marked for many different pieces of PII, since we had captured tweets over a series of many days. For each particular PII that had been marked, we ran Lesk’s algorithm over the text, then attempted to determine the sense for each PII. If the sense of the piece of PII found was the same as what we had stored previously, then we marked it as a true found piece of PII, else we marked it as false and continue. The psuedocode for this can be view in 4.2
Once all the entries have been stored the last step that needed to be done was determine how many direct and indirect PII each user had. For this we iterated over the new database entries boolean fields that flagged which pieces of PII have made it

Data: Database of users with text flagged with pieces of PII Result: New Database of users that have been determined true while Entries in database do
Determine which piece of PII was flagged; foreach Flagged PII do
Perform Lesk’s Algorithm on text for entry; if Sense of PII mutches then | Mark True for Particular PII in New Database; else
| Continue; end end end
Figure 4.2: Pseudo code for final filtering phase
through the last filtration process. We then stored the number of indirect and direct.
At this stage we also determined if the found pieces of PII, if any combinations of them could be used for a professional or financial attack. We then stored this information per user as well, in order to run through our equations defined in section III.3.3.
IV.3 Results
We will now discuss the results from performing both phases of testing. We will also determine if any of the results we had processed lend support for or against the the three hypotheses we have set out to test. First, we will discuss test results from the created data set. This data set was meant to test our model manually, as to be able to verify the correctness of the equations and our assumptions. After we will discuss the results from the phase two tests, where we captured live data from the online social media platform Twitter.
IV.3.1 Results from Data Set
To reiterate section IV.2.1, we created users that fit within the realms of High, Medium, and Low, of our model. These profiles for each of these users can be seen in Table 4.3. For each range of high, medium, and low, we created nine users.

In the following tables, the column titles directly reference Table 3.11, from the previous chapter. Also, when determining the final score, the value for the service metrics was already determined in Table 4.2.
In Table 4.7 we have the total values for users in our high category. These are users that we expect would share many pieces of PII, and would have a very low if any training on online safety. The values that we picked for the direct and indirect were based upon the default values the users would place on their profiles. These values are listed in 4.3, under the column High. The additional values we listed in, we assumed would be the tweets that other users would make about them, or tweets that they would post about themselves.
Table 4.7: Created user accounts, high scores
User 1 2 4 1 1 2 3 1 1 0 9
User 2 2 4 1 1 2 3 1 1 0 9
User 3 2 3 1 1 2 3 1 1 0 8
User 4 3 5 1 1 2 3 1 1 0 10
User 5 2 5 1 1 2 3 1 1 0 9
User 6 3 4 1 1 2 3 1 1 0 10
User 7 3 5 1 1 2 3 1 1 0 10
User 8 3 3 1 1 2 3 1 1 0 9
User 9 3 4 1 1 2 3 1 1 0 10
In Table 4.8 we have listed out the results from the users that fall into the medium score category. In contrast to the users from Table 4.7, these users post less direct PII, however still more than enough indirect PII that these users could potentially be reidentified if the information was found. These users would be considered to have a medium level of online safety; however, they might still share information that can be used for re-identification purposes.
These users would have a profile filled in similarly to the users in Table 4.3, under the column medium. This column was to give an example of what a medium level user would look like. As can be deduced from Table 4.8, the actual information filled in

can vary. We wanted to show that even users who have an awareness of online safety, could also be caught up and leave information that a malicious user could use.
Table 4.8: Created user accounts, medium scores
User 1 3 1 0 1 1 3 1 0 1 7
User 2 1 3 0 1 2 1 1 0 1 6
User 3 1 2 1 0 1 1 1 1 1 6
User 4 1 3 0 1 2 1 1 0 1 6
User 5 2 1 0 1 1 2 1 0 1 6
User 6 1 1 0 1 1 1 1 0 1 5
User 7 1 3 1 1 2 1 1 1 1 7
User 8 2 2 0 1 1 2 1 0 1 6
User 9 2 1 0 1 1 2 1 0 1 6
Our final user group, those in the low score range, can be viewed in Table 4.9. Comparing these users to those in Tables 4.7 and 4.8, it can be observed that these would be the users who attempt to mask their online identity the most. They would be considered to have a high online safety awareness and share the least amount of PII. One area these users would score high in, is for HPE. These would be the users who would have a high degree of harm felt, if their information was leaked and no longer under their control.
These users would be those labeled in Table 4.3 under the column low. An observation to Table 4.9 shows that these users share very little information, yet still receive a score of three. This is do the the service they are using. In this case, that service is Twitter. Just because these users are not sharing much information intentionally or allowing others to share it doesn’t mean someone cannot access it. The service itself sells off the information.
These created data sets were used primarily to hone in and demonstrate that our equations would work as expected. Since our created data set, ended up determining scores that we had found sufficient, we will now step to the results from the second phase where we capture live information as it was flowing. In this next section will will

Table 4.9: Created user accounts, low scores
User 1 0 1 0 0 0 1 0 0 2 3
User 2 0 1 0 0 0 1 0 0 2 3
User 3 0 1 0 0 0 1 0 0 2 3
User 4 0 1 0 0 0 1 0 0 2 3
User 5 0 1 0 0 0 1 0 0 2 3
User 6 0 1 0 0 0 1 0 0 2 3
User 7 0 1 0 0 0 1 0 0 2 3
User 8 0 1 0 0 0 1 0 0 2 3
User 9 0 1 0 0 0 1 0 0 2 3
be able to compare the captured data to the data we created and view if our assumptions were correct.
IV.3.2 Results from Twitter Data
In this section we will discuss and highlight the results from accessing the live streams from Twitter. As we do, we will also discuss if any of the results we have analyzed help to lend support for or against the hypotheses listed in the previous chapter. First, we will show how many tweets we captured, and how many tweets were filtered out at each stage of our filtering process. Then we will show the overall average scores for all of those users that had made it through all of our filtration processes. Afterwards we will dig deeper into separate categories of users.
An interesting result from the various filtration process was the number of tweets we would eliminate per step. Our initial data set consisted of 2.5 million tweets. These were captured over a series of days. Shown in Figure 4.3 after we aggregated all the tweets per user, and filtered out the tweets that did not pass the regular expression test we were left with 1.6 million users. One thing to note, is that the first step was a tweet per user. Therefore some of the tweets caught could have been multiple tweets by the same user. As well as these tweets that were caught that may have matched due to a url posted, or the title of a piece of media posted.

Number of Users per Filtration Step
â– Q
1600000 -1400000 -1200000 -1000000 -800000 -600000 -400000 -200000-0-
Regular Expression Test
WSD Test
Filtration Step
Figure 4.3: Number of tweets per filtration process
The last filtration test left us with just under 500 thousand users. Showing that not every tweet caught was significant, to privacy divulging information. Thus aggregating and removing a total of 2 million tweets from those first caught in our filter. This can be attributed to a few factors. The first being how the twitter filtration works. According to the Twitter developer documents, the filtration will not match to the filtered word in the tweet but also to url names, media names, and screen names.3 This helps explains why there was a decrease between the first filtration and the second filtration step. Our regular expression test was testing for whole words in the text of a tweet, and not against usernames, url strings, or names of media, like photos or videos embedded in posts from other users. As for why there was a drop of tweets that made it past the WSD test, this is attributed to ambiguity of the meaning of the word used in the sentence. For example, one of the phrases we were filtering on was the word address. In Table 4.10 we have listed all the various forms the word address

can take on when it is used as a noun. The word address can also be used as a verb
where it is then takes on up to ten other various meanings as well.
Table 4.10: Various definitions of word "address" as a noun
Address Part of Speech Definition
Address - Noun (computer science) the code that identifies where a piece of information is stored
Address - Noun the place where a person or organization can be found or communicated with
Address - Noun the act of delivering a formal spoken communication to an audience
Address - Noun the manner of speaking to another individual
Address - Noun a sign in front of a house or business carrying the conventional form by which its location is described
Address - Noun written directions for finding some location; written on letters or packages that are to be delivered to that location
Address - Noun the stance assumed by a golfer in preparation for hitting a golf ball
Since the number of tweets that had been removed due to the filtration steps has been addressed, we can now view the results from analyzing the users. In Figure 4.4, we compare the total users that contain at least one piece of direct PII compared to those that have no pieces of direct PII.
Table 4.11, has listed values from from users who had made it through the final filtration step. The total users with direct PII compared to users with no direct PII, is vastly different. The total users with no indirect PII compared to those with indirect PII is as expected, since we were able to get each users, username from the tweet, therefore, for each user that we had captured a tweet of we also got their username thus giving us at minimum of one piece of indirect PII per user.

Users with With Direct Pll vs No Direct Pll
100000 -
Users With Direct
Users No Direct Pll
Users Pll Sets
Figure 4.4: Total users with Direct PII vs No Direct PII
Table 4.11: Total Users in various sets
User Set Total Number of Users
Users with Direct PII 464658
Users with No Direct PII 22018
Users with Indirect PII 486676
Users with No Indirect PII 0
In Table 4.12 we have listed out the total number of users who have at least one piece of direct PII and strictly greater than one piece of indirect PII, to those users with one piece of direct PII and only one piece of indirect PII. This is the total count of 464658, listed in Table 4.11. As it is shown in the table, the total users with at least a piece of direct PII and more than one piece of indirect PII is much greater than the users with at least a piece of direct PII and only one piece of indirect PII.
Table 4.12: Total indirect PII for users with at least one direct PII
User Set Total Number of Users
Users with two or more indirect PII 373462
Users with exactly one indirect PII 91196

We further analyzed the results, to get the values in Table 4.13. From the table we can see that the mean for the number of pieces of indirect PII for users was 2.0602, with a standard deviation of 0.6702. We then calculated the Z-score, for the probability of finding a user with strictly greater than one piece of indirect PII. We use the Z-score, to determine how many standard deviations away from mean, a data point it. For this scenario our universe was the total number with at least one piece of direct PII and one piece of indirect PII. This probability came to be 94.3%, indicating that this is the percentage of the universe who have two or more indirect PII, while also divulging at a minimum one piece of direct PII. This then would lend support to HI, that users who have one piece of direct PII available will also have a minimum of two pieces of indirect PII available.
Table 4.13: Users with at least one direct PII, number of indirect PII statistics
Parameter Value
Mean 2.0602034184281774
Standard Deviation 0.6701555985087279
Z-Score, One or less indirect PII -1.582025757581386
p-value, Greater than one indirect PII 0.9431781547517581
When we then calculate the PAVSS score for these users, we can see that the users with direct PII available have a higher score than those that do not have direct PII accessible. This can be seen in Table 4.14, where we have the two separate sets of users in comparison.
Table 4.14: Users with Direct PII vs Users without Direct PII PAVSS Score
User Set Average PAVSS Score
With Direct PII 7.551973771433622
Without Direct PII 6.6339090949561506
To test our other hypotheses, we had to further investigate our results. First, we analyzed the users who would fall in and out of the set of users from H2. In Table 4.15, the results from the filtration test are listed. One noticeable observation, is that

the mean number of total pieces of PII between the users with an accessible birthday compared to those without an accessible birthday, is nearly the same. The results show that for all total users that we have analyzed, users that don’t show their birthday actually have more PII available than those who do not.
Table 4.15: Users total number of PII for users with and without their birthdays accessible
Parameter Accessible Not Accessible
Total Number of Users 222640 250181
Percentage 47.91% 52.09%
Mean 3.5240747394897594 3.6272698566238044
Standard Deviation 0.6240154278586392 0.6093790317448934
Z-Score, Difference in Mean 0.16537270158234338
p-value 0.4343253194025472
If we then analyze the associated PAVSS score with each user score, we see that those without a birthday score higher on average than those without a birthday. Table 4.16 shows the average PAVSS score between the different sets of users. From Table 4.15 and 4.16, we can see that the results do not lend support to H2.
Table 4.16: Users with and without Birthday PAVSS Score
User Set Average PAVSS Score
With Birthday 7.134337944664032
Without Birthday 7.4619215687842
Next, we looked at the results for users with and with an email available. To summarize H3, we hypothesized that those that have their emails available will also show more total pieces of PII, compared to those who don’t. In Table 4.17, we have listed the results. The users who had their emails accessible, on average had a higher number of PII accessible than those who did not.
In Table 4.18, we show the average score for users with emails accessible against those without emails accessible. Aligning with what we have listed in Table 4.17, the users with their emails accessible on average have a high PAVSS score.

Table 4.17: Users total number of PII for users with and without their emails accessible
Parameter Accessible Not Accessible
Total Number of Users 43731 428420
Percentage 9.41% 90.59%
Mean 4.027326153072192 3.532834601559218
Standard Deviation 0.682536387468522 0.5929284576361334
Z-Score, Difference in Mean -0.72449U L195826604
p-value 0.7656178614293423
Table 4.18: Users with and without Email PAYSS Score
User Set Average PAVSS Score
With Email 7.834739658365919
Without Email 7.253650623220205
The users who had their e-mails accessible where a small subset of those who ended up passing through all the filters. In Table 4.17, we can see that only 9% of total users had their e-mails accessible. Yet of those 9% on average had a higher total number of PII available. Our p value associated with the difference in means was 0.7656, which is not the most desirable, yet considering the small percentage of those who had their emails accessible and the density of those with more PII, the results tend to support H3.
IV.4 Challenges
Some challenges we overcame while performing the testing. One of the first challenges was that twitter limited the amount of information that you could query through their API. The amount and length of time you could siphon live tweets was not limited, however if you wished to query their API on specific users you were rate limited.
Their API would allow you to get all the tweets a user had posted, within the past 30 days. However you were limited to only 15 users per 15 minutes. Attempting to do this for every user that we have in our database would have become a very time consuming task. You did have the opportunity to pay, and not be rate limited, however for our testing we wanted to appear as a malicious user. Our assumption would be

a malicious user would attempt to extract enough information without having a trail connecting them to the data. If a malicious user paid for access, then this would be a serious of breadcrumbs to the malicious user.
Another issue that arose was the amount of filtering streams you were allowed to have open at a given instance. For efficiency, the first iteration of the program for testing, would have a separate process filtering the the live tweets per category of word we were filtering on. For instance, when attempting to find a tweet about a birthday, we were filtering for words like, birth, birthday, birth-date. The first attempt would have that all in one process, as a way to store all the captured tweets in one area and one grouping for easier processing later. We found that after two processes had connected to the live tweets filter, we would be kicked off and denied. As such we ended up requiring one process who filtered on all the terms at once.
IV.5 Chapter Review
In this chapter we set out to test our hypotheses and model listed in the previous chapter. For testing purposes, we used the online social network called Twitter. From Twitter we had free access to their tweets that were being streamed live. To perform the tests we had to sign up for an account with was discussed in the first section of this chapter.
Our testing was capable of being separated in two different phases. The first phase was when we manually created users. These manually created users fell into three categories, those scoring High Medium and Low on the PAVSS scale. This was our training data set, to verify that the equations from our model would hold true.
The second phase of testing was actually accessing the live tweet stream. This process was then broken into three filtration steps. The first was the filtering Twitter live stream performed based on the terms we provided. The second filtration was using regular expression testing on the tweets we found. The third filtration was using WSD

to determine the true sense of the privacy critical term that had passed our regular expression testing.
While in the second phase we were able to test our hypothesis. The results from our tests had lent support to HI and H3, however H2 our results did not lend support to. Further testing can be done to analyze H2, that may provide more accurate results. These further testing, can include paying for unrestricted access to Twitter’s database, and looking for this information for each user. Another way to test this hypothesis in another way would be to filter live tweets containing a birthday for an entire year. This approach would be time intensive as well as data intensive considering the number of tweets, we captured in only a few days’ time range.

In this chapter we will quickly discuss future research that can be performed related to this thesis. These are areas that we had determined would have helped to reach better and more concise results. As well as new questions that had emerged that were not within the scope of this thesis yet are related to this research.
First having more experts participate within the study would have helped produce more accurate results within the model. Also having many more iterations of the survey would have helped hone in some of the more critical privacy reducing areas. We could apply our model to other online platforms as well, to see how our model works on various other platforms.
Future research could be done on lexical analysis, on how many different ways a user could make a statement that produces a privacy vulnerability. Much internet text is comprised of slang and shortened abbreviated words, to make sure the full meaning of the sentence is able to be placed within the confines of the allotted character limit. Due to this, finding the true meaning of what a user states becomes more difficult since many NLP programs and devices have been trained on peer reviewed texts. These texts would have proper grammar, formatting which does not happen often on social sites between users where there is a character limit.
Identifying a way a user could have a protected online identity that was disassociated to their offline, real world identity would be an interesting topic to research as well. This was identified by experts who responded to our survey as an area to focus for prevention for a privacy breach. Since earlier research has shown that individuals release certain private information as a way to gain trust with others online, this would be a rather challenging avenue to research. Partially due to needing a way to help users gain trust without breaking the confines of what is private.

A final area of future research would be expanding our model to test against other use cases for PII compromise. An interesting study would be to determine the number of individuals involved in a privacy breach and see how many of those were active on a social networking site. After determining this, then determine the PAVSS total score those users would produce and compare it to those who have no been involved in a privacy breach. This could potentially help lead to more accurate ways to inform users one how to be safe in an online environment.

In this chapter we will wrap everything up. We will quickly discuss the reasoning for this thesis, our research questions and answers. The hypotheses that were then formed while researching our research questions, and the model that we designed to test our hypotheses. This will lead to a summary of our testing methodologies, and results, that then segue into avenues for future research.
This research was started on the premise that currently in the United States, there are no true protections for consumer privacy. The standards that due exist are inadequate, as such do not provide enough protections to help prevent consumers from being involved in a privacy breach. We showed in section II.2 that there are currently industry wide methods and procedures for determining the faults and severity of these faults for software, yet these procedures do not address user privacy. Researching this helped to answer one of our research questions, RQ-4, were we set out to find if current scoring metrics take into account importance of the data at risk of being stolen.
In chapter II we identified five research questions, related to user’s privacy. Our first research question was RQ-1, which asked what information was needed to impersonate someone. We determine that it was sufficient enough to only know a person’s name and email address in order to impersonate someone. Our second research question RQ-2 asked what information about someone was more valuable. This was determined to be information that fell into the four categories of: information that is private, unique, and distinct. This information is rightfully called direct personally identifiable information. The third research question RQ-3 was not able to be fully answered until after the survey was complete. For this we found that the least amount of information needed to to be known in order to find more information was only a single piece of information. This single piece of information was just a persons name in many cases. We have just recently discussed RQ-4, which leaves us with our last research

question RQ-5, which asked if current data collection techniques used in data analysis follow a strict enough guideline to ensure user anonymity. We were able to answer RQ-5, after analyzing various data anonymization algorithms, as well as laws in the United States along with in Europe. We came to the conclusion that the current state of data collection techniques do not follow a strict enough guideline, since many times guidelines do not exist.
After our research questions we determined three hypotheses. Which we introduced in chapter III They were:
HI: Those accounts that have 1 piece of direct PII available, will also have a minimum of 2 pieces of indirect PII available.
H2: Users that give their birthday or allow it to be determined, are more likely to have more pieces of Indirect or Direct PII then users that do not give their birthday.
H3: Users that have their e-mail addresses easily accessible, are more likely to divulge more PII, and therefore have a higher score from our model, then the set of users from H2.
From these hypotheses and the research questions we developed our PAVSS framework. This framework is the first of it’s kind to quantify the risk a user takes while using an online service. To help develop this framework, we utilized the Delphi method to find quantitative values for various privacy vulnerable scenarios. We ultimately came to our final version of PAVSS which was discuss in section III.3.3.
In chapter IV we tested our model as well as tested our hypotheses. For testing we utilized the online social networking site Twitter. A user on Twitter is capable of making an online post called a tweet, that other users on and off the platform can view. By accessing tweets as they were posted, we captured tweets that contained various words that are indicative of speech that contains privacy divulging information.
We filtered our results from this capturing process by using regular expression test-

ing, then WSD testing on the text that made it through the regular expression testing. From our results we were able to show results that lent support for HI and H3. The results did not lend support for H2. However, we identified other testing methodologies that could be done in order to further test H2.
Finally in chapter V, we identified possible avenues for further research that were either outside the scope of this thesis, or were further research with this thesis as a base. Utilizing this model for other use cases of PII would help to strengthen the need for our framework. Also, to identify a way to separate a user’s online identity from their real-world identity, while allowing users to build trust is an interesting topic for further research.

[1] Abowd, J. M. The u.s. census bureau adopts differential privacy. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery 38; Data Mining (New York, NY, USA, 2018), KDD T8, ACM, pp. 2867-2867.
[2] Acquisti, A., and Gross, R. Predicting social security numbers from public data. Proceedings of the National Academy of Sciences 106, 27 (2009), 10975-10980.
[3] Ahmad, W. U., Rahman, M. M., and Wang, H. Topic model based privacy protection in personalized web search. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (New York, NY, USA, 2016), SIGIR T6, ACM, pp. 1025-1028.
[4] Al-Karkhi, A., Al-Yasiri, A., AND Jaseemuddin, M. Non-intrusive user identity provisioning in the internet of things. In Proceedings of the 12th ACM International Symposium on Mobility Management and Wireless Access (New York, NY, USA, 2014), MobiWac T4, ACM, pp. 83-90.
[5] Allodi, L., Banescu, S., Femmer, H., and Beckers, K. Identifying relevant information cues for vulnerability assessment using cvss. In Proceedings of the Eighth ACM Conference on Data and Application Security and Privacy (New York, NY, USA, 2018), CODASPY T8, ACM, pp. 119-126.
[6] Anjum, A., and Raschia, G. Banga: An efficient and flexible generalization-based algorithm for privacy preserving data publication. Computers 6, 1 (2017).
[7] Arefi, M. N., Alexander, G., and Crandall, J. R. Piitracker: Automatic tracking of personally identifiable information in windows. In Proceedings of the 11th European Workshop on System,s Security (New York, NY, USA, 2018), Eu-roSec’18, ACM, pp. 3:l-3:6.
[8] BAHRI, L. Identity related threats, vulnerabilities and risk mitigation in online social networks: A tutorial. In Proceedings of the 2017 ACM SIGSAG Conference on Computer and Communications Security (New York, NY, USA, 2017), CCS T7, ACM, pp. 2603-2605.
[9] Bird, S., Klein, E., and Loper, E. Natural Language Processing with Python, 1st ed. O’Reilly Media, Inc., 2009.
[10] Boyne, S. M. Data protection in the united states. The American Journal of Comparative Law 66, suppl_l (2018), 299-343.
[11] Bozorgi, M., Saul, L. K., Savage, S., and Voelker, G. M. Beyond heuristics: Learning to classify vulnerabilities and predict exploits. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2010), KDD TO, ACM, pp. 105-114.

[12] Center, I. T. R. 2017 annual report. Tech, rep., Identity Theft Resource Center, 2017.
[13] Center, I. T. R. Data breach reports. Tech, rep., Identity Theft Resource Center, 2018.
[14] Center, I. T. R. Data breaches, 2018.
[15] CHILD, J. T., AND STARCHER, S. C. Fuzzy facebook privacy boundaries: Exploring mediated lurking, vague-booking, and facebook privacy management. Computers in Hum,an Behavior 54 (2016), 483 - 490.
[16] Choi, B. C., and Land, L. The effects of general privacy concerns and transactional privacy concerns on facebook apps usage. Inform,ation & Management 53, 7 (2016), 868-877. Special Issue on Papers Presented at Pacis 2015.
[17] Cormode, G., Jha, S., Kulkarni, T., Li, N., Srivastava, D., and Wang,
T. Privacy at scale: Local differential privacy in practice. In Proceedings of the 2018 International Conference on Management of Data (New York, NY, USA, 2018), SIGMOD T8, ACM, pp. 1655-1658.
[18] Dalkey, N., and Helmer, O. An experimental application of the delphi method to the use of experts. Management science 9, 3 (1963), 458-467.
[19] DE Bruijn, F., and Dekkers, H. L. Ambiguity in natural language software requirements: A case study. In International Working Conference on Requirements Engineering: Foundation for Software Quality (2010), Springer, pp. 233-247.
[20] Douziech, P.-E., and Curtis, B. Cross-technology, cross-layer defect detection in it systems: Challenges and achievements. In Proceedings of the First International Workshop on Complex falllts and Failures in LargE Software System,s (Piscataway, NJ, USA, 2015), COUFLESS T5, IEEE Press, pp. 21-26.
[21] DUPUIS, M. "wait, do i know you?": A look at personality and preventing one’s personal information from being compromised. In Proceedings of the 5th Annual Conference on Research in Information Technology (New York, NY, USA, 2016), RUT T6, ACM, pp. 55-55.
[22] Dwork, C. Differential privacy: A survey of results. In International, Conference on Theory and Applications of Models of Commutation (2008), Springer, pp. 1-19.
[23] FIRST. Cvss v3.0 specification document, 2018.
[24] Goga, O., Venkatadri, G., and Gummadi, K. P. The doppelganger bot attack: Exploring identity impersonation in online social networks. In Proceedings of the 2015 Internet Measurement Conference (New York, NY, USA, 2015), IMC T5, ACM, pp. 141-153.

[25] Graupner, H., Jaeger, D., Cheng, F., and Meinel, C. Automated parsing and interpretation of identity leaks. In Proceedings of the ACM International Conference on Computing Frontiers (New York, NY, USA, 2016), CF ’16, ACM, pp. 127-134.
[26] Gu, J., Xu, Y. C., Xu, H., Zhang, C., and Ling, H. Privacy concerns for mobile app download: An elaboration likelihood model perspective. Decision Support Systems 9f (2017), 19 - 28.
[27] HALL, H. K. Restoring dignity and harmony to united states-european union data protection regulation. Communication Law and Policy 23, 2 (2018), 125-157.
[28] Hay, M., Machanavajjhala, A., Miklau, G., Chen, Y., and Zhang, D. Principled evaluation of differentially private algorithms using dpbench. In Proceedings of the 2016 International Conference on Management of Data (New York, NY, USA, 2016), SIGMOD ’16, ACM, pp. 139-154.
[29] Hay, M., Machanavajjhala, A., Miklau, G., Chen, Y., Zhang, D., and Bissias, G. Exploring privacy-accuracy tradeoffs using dpcomp. In Proceedings of the 2016 International Conference on Management of Data (New York, NY, USA, 2016), SIGMOD ’16, ACM, pp. 2101-2104.
[30] Hofman, D., Duranti, L., and How, E. Trust in the balance: Data protection laws as tools for privacy and security in the cloud. Algorithm,s 10, 47 (2017).
[31] Holm, H., and Afridi, K. K. An expert-based investigation of the common vulnerability scoring system. Computers & Security 53 (2015), 18 - 30.
[32] JEONG, Y., AND Kim, Y. Privacy concerns on social networking sites: Interplay among posting types, content, and audiences. Computers in Hum,an Behavior 69 (2017), 302 - 310.
[33] KAMPANAKIS, P. Security automation and threat information-sharing options. IEEE Security Privacy 12, 5 (Sept 2014), 42-51.
[34] Kerschbaumer, C., Crouch, L., Ritter, T., and Vyas, T. Can we build a privacy-preserving web browser we ah deserve? XRDS 24, 4 (July 2018), 40-44.
[35] Kim, S., and Chung, Y. D. An anonymization protocol for continuous and dynamic privacy-preserving data collection. Future Generation Computer System,s (2017).
[36] Lesk, M. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of the 5th annual, international, conference on System,s documentation (1986), ACM, pp. 24-26.
[37] Li, K., Lin, Z., and Wang, X. An empirical analysis of users’ privacy disclosure behaviors on social network sites. Information & Management 52, 7 (2015), 882 -891. Novel applications of social media analytics.

[38] Li, N., Li, T., AND Venkatasubramanian, S. t-closeness: Privacy beyond k-anonymity and 1-diversity. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on (2007), IEEE, pp. 106-115.
[39] Li, N., Qardaji, W., Su, D., Wu, Y., and Yang, W. Membership privacy:
A unifying framework for privacy definitions. In Proceedings of the 2013 ACM SIGSAC Conference on Computer 38; Communications Security (New York, NY, USA, 2013), CCS ’13, ACM, pp. 889-900.
[40] Linstone, H. A., and Turoff, M. Delphi: A brief look backward and forward. Technological Forecasting and Social Change 78, 9 (2011), 1712-1719.
[41] Liu, Q., and Zhang, Y. Vrss: A new system for rating and scoring vulnerabilities. Computer Communications 34 , 3 ( 2011), 264 - 273. Special Issue of Computer Communications on Information and Future Communication Security.
[42] Liu, Q., Zhang, Y., Kong, Y., and Wu, Q. Improving vrss-based vulnerability prioritization using analytic hierarchy process. Journal of Systems and Software 85, 8 (2012), 1699 - 1708.
[43] Liu, Y., Song, H. H., Bermudez, I., Mislove, A., Baldi, M., and Ton-GAONKAR, A. Identifying personal information in internet traffic. In Proceedings of the 2015 ACM on Conference on Online Social Networks (New York, NY, USA, 2015), COSN ’15, ACM, pp. 59-70.
[44] MACHANAVAJJHALA, A., He, X., AND Hay, M. Differential privacy in the wild: A tutorial on current practices 38; open challenges. In Proceedings of the 2017 ACM International Conference on Management of Data (New York, NY, USA, 2017), SIGMOD ’17, ACM, pp. 1727-1730.
[45] Martin, K. D., Borah, A., and Palmatier, R. W. Data privacy: Effects on customer and firm performance. Journal of Marketing 81, 1 (2017), 36-58.
[46] Martin, R., and Christey, S. The software industry’s "clean water act" alternative. IEEE Security Privacy 10, 3 (May 2012), 24-31.
[47] McDermott, Y. Conceptualising the right to data protection in an era of big data. Big Data & Society 4, 1 (2017), 2053951716686994.
[48] MITRE. Common weakness scoring system.
[49] Moro, A., Raganato, A., and Navigli, R. Entity linking meets word sense disambiguation: a unified approach. Transactions of the Association for Computational Linguistics 2 (2014), 231-244.
[50] MUNAIAH, N., AND MENEELY, A. Vulnerability severity scoring and bounties: Why the disconnect? In Proceedings of the 2Nd International Workshop on Software Analytics (New York, NY, USA, 2016), SWAN 2016, ACM, pp. 8-14.

[51] NIST. Vulnerability metrics, 2018.
[52] Ortiz, J., Chang, S.-H., Chih, W.-H., and Wang, C.-H. The contradiction between self-protection and self-presentation on knowledge sharing behavior. Computers in Hum,an Behavior 76 (2017), 406 - 416.
[53] PARK, E. H., Kim, J., AND PARK, Y. S. The role of information security learning and individual factors in disclosing patients’ health information. Computers & Security 65 (2017), 64 - 76.
[54] Pendleton, M., Garcia-Lebron, R., Cho, J.-H., and Xu, S. A survey on systems security metrics. ACM Com,put. Surv. 49, 4 (Dec. 2016), 62:1-62:35.
[55] RASOOL, A., Tiwari, A., SlNGLA, G., and Khare, N. String matching methodologies: A comparative analysis. REM (Text) 234 5 67, 11 (2012), 3.
[56] SANDRA, P. Communication privacy management theory: What do we know about family privacy regulation? Journal of Family Theory & Review 2, 3 (2010), 175-196.
[57] Song, S., Wang, Y., and Chaudhuri, K. Pufferhsh privacy mechanisms for correlated data. In Proceedings of the 2017 ACM International, Conference on Management of Data (New York, NY, USA, 2017), SIGMOD T7, ACM, pp. 1291— 1306.
[58] Staite, C. Portable secure identity management for software engineering. In Proceedings of the 32Nd ACM/IEEE International, Conference on Software Engineering - Volume 2 (New York, NY, USA, 2010), ICSE TO, ACM, pp. 325-326.
[59] Sweeney, L. k-anonymity: A model for protecting privacy. International, Journal, of Uncertainty, Fuzziness and Knowledge-Based Systems 10, 05 (2002), 557-570.
[60] Tesfay, W. B., Hofmann, P., Nakamura, T., Kiyomoto, S., and Serna,
J. Privacyguide: Towards an implementation of the eu gdpr on internet privacy policy evaluation. In Proceedings of the Fourth ACM International, Workshop on Security and Privacy Analytics (New York, NY, USA, 2018), IWSPA T8, ACM, pp. 15-21.
[61] TORKY, M., MELIGY, A., AND IBRAHIM, H. Recognizing fake identities in online social networks based on a finite automaton approach. In 2016 12th International, Computer Engineering Conference (ICENCO) (Dec 2016), pp. 1-7.
[62] Tramer, F., Huang, Z., Hubaux, J.-P., and Ayday, E. Differential privacy with bounded priors: Reconciling utility and privacy in genome-wide association studies. In Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security (New York, NY, USA, 2015), CCS T5, ACM, pp. 1286-1297.

[63] Tsikerdekis, M., and Zeadally, S. Online deception in social media. Com-mun. ACM 51, 9 (Sept. 2014), 72-80.
[64] Ufuktepe, E., and Tuglular, T. Estimating software robustness in relation to input validation vulnerabilities using bayesian networks. Software Quality Journal 26, 2 (Jun 2018), 455-489.
[65] VAN Velthoven, M. H., Mastellos, N., Majeed, A., ODonoghue, J., and Car, J. Feasibility of extracting data from electronic medical records for research: an international comparative study. BMC Medical Inform,atics and Decision Making 16 (2016), 10.
[66] Wang, J. A., Guo, M., Wang, H., Xia, M., and Zhou, L. Ontology-based security assessment for software products. In Proceedings of the 5th Annual Workshop on Cyber Security and Information Intelligence Research: Cyber Security and Information Intelligence Challenges and Strategies (New York, NY, USA, 2009), CSIIRW ’09, ACM, pp. 15:1-15:4.
[67] Woudenberg, F. An evaluation of delphi. Technological forecasting and social change 40, 2 (1991), 131-150.
[68] Yaqub, U., Chun, S. A., Atluri, V., and Vaidya, J. Analysis of political discourse on twitter in the context of the 2016 us presidential elections. Government Information Quarterly 34, 4 (2017), 613 - 626.

Full Text


PAVSS:AFRAMEWORKFORSCORINGDATAPRIVACYRISK by ZACKARYD.FOREMAN B.S.,UniversityofColoradoDenver,2017 Athesissubmittedtothe FacultyoftheGraduateSchoolofthe UniversityofColoradoinpartialfulllment oftherequirementsforthedegreeof MasterofScience ComputerScienceProgram 2019


ThisthesisfortheMasterofScienceComputerSciencedegreeby ZackaryD.Foreman hasbeenapprovedforthe ComputerScienceProgram by ThomasAugustine,Chair ThomasAugustine,Advisor TomAltman,CommitteeMember HaadiJafarian,CommitteeMember Date19April2019 ii


Foreman,ZackaryD. PAVSS:AFrameworkforscoringdataprivacyrisk ThesisdirectedbyThomasAugustine ABSTRACT CurrentlytheguidelinesforbusinessentitiestocollectanduseconsumerinformationfromonlinesourcesisguidedbytheFairInformationPracticePrinciples setforthbytheFederalTradeCommissionintheUnitedStates.Asitwillbeshown throughoutthisdocumentation,theseguidelinesareinadequate,outdated,andprovidenoprotectionforconsumers.Throughtheuseofinformationretrievaltechniques, itwillbeshownthatsocialengineeringtechniquescanbeusedtousethisinformation againsttheconsumers.Thereexistsmanytechniquestoattempttoanonymizethe datathatisstoredandcollected.Howeverwhatdoesnotexistisaframeworkwhich iscapableofevaluatingandscoringtheeectsofthisinformationintheeventthat asystemiscompromised.Inthisthesisaframeworkforscoringandevaluatingdata ispresented.Thisframeworkiscreatedtobeusedinparallelwithcurrentlyadopted frameworksthatareusedtoscoreandevaluateotherareasofdeciencieswithinsoftware,aswellasforindividualusersineortsmaintainalevelofcontrolandcondence withtheirinformation.WecreatedaframeworkcalledthePrivacyAssessmentVulnerabilityScoringSystemPAVSS,thatprovidesastandardizedscoreshowingthe riskofaprivacyinvasionanindividualtakeson.Wetestedhypothesesregardingtypes andamountsofdirectandindirectpersonalidentiableinformationPIIfromalarge Twitterdataset. Theformandcontentofthisabstractareapproved.Irecommenditspublication. Approved:ThomasAugustine iii


TABLEOFCONTENTS CHAPTER I.INTRODUCTION...............................1 I.1ProblemDescription...........................1 I.2StructureofThesis...........................2 II.LITERATUREREVIEW...........................4 II.1ResearchQuestions...........................4 II.2CurrentAssessmentsofSoftwareDeciencies............7 II.2.1CommonVulnerabilityScoringSystem..........7 II.2.2CommonWeaknessScoringSystem............11 II.2.3AnalysisofCommonScoringTechniques.........13 II.3PrivacyManagement..........................14 II.3.1CommunicationPrivacyManagementTheory.......15 II.3.2PrivacyPreservingAlgorithms...............16 II.3.3Analysis...........................18 II.4DataCollectionPracticesandPolicies................18 II.4.1DataCollection.......................18 II.4.2DataCollectionPolicies...................19 II.4.3Text.............................21 II.4.4Analysis...........................22 II.5MeasurementTechniques........................22 II.5.1OverviewofDelphi.....................23 II.5.2DelphiCharacteristics....................23 II.6ChapterReview.............................23 iv


III.HYPOTHESESANDMODELDESIGN..................25 III.1Hypotheses...............................25 III.2ModelGoals..............................26 III.3ModelDesign.............................27 III.3.1ModelDesignIterationI..................27 III.3.2DelphiResults.......................31 III.3.3FinalModelDesign.....................40 III.4ChapterReview............................47 IV.TESTING...................................49 IV.1PretestingRequirements.......................49 IV.2TestingProcedure...........................51 IV.2.1FirstPhase.........................51 IV.2.2SecondPhase........................53 IV.3Results.................................57 IV.3.1ResultsfromDataSet...................57 IV.3.2ResultsfromTwitterData.................60 IV.4Challenges...............................66 IV.5ChapterReview............................67 V.FUTURERESEARCH............................69 VI.CONCLUSION................................71 REFERENCES.....................................74 v


CHAPTERI INTRODUCTION Therecurrentlyareframeworksapprovedbythecybersecuritycommunitytodetermineandscorethesecurityvulnerabilityofvariouspartsofaparticularsoftware.A singlesoftwareprogrammayhavemanydierentscoresofitssusceptibilitytoasecurityvulnerability[66].Frameworksalsoexistforscoringandsharinginformationon; malware,attackpatterns,softwareweaknesses,incidents,indicatorsofattacks,etc[33]. Theseframeworkswhenutilizedtogetherallowthecybersecuritycommunitytoevaluaterisksinvolvedwithsoftwareinaconsistentmanner.Thereis,however,noframeworkwhichrelatesdataprivacytoothervectorsofvictimization. Thismasterthesispaperproposesaframeworkwhichcanbeusedtoevaluate andscorethevarioustypesofdatathatmaybestored.Thisframeworkwillbeshown tobecapableofbeingusedfornotonlythevariousentitiesthatstoreandconsume sensitivedata,yetthatoftheoriginatorofthesensitivedata. I.1ProblemDescription Atanygivenmomentthereismoredatabeinggeneratedandstoredthanwas previouslyavailableinthehistoryoftheinternet.Thesourceofthisdatacomesfrom manydierentavenues.Ofalltheinformationthatisbeingstored,themostconcerningdataispersonallyvaluableinformation.Thisisdatathatisdirectlyassociated withanindividual.Informationbeingstoredcanbeusedforavarietyofreasons,from creatingauserpersonalizedexperience[58],torunninganalyticsonthedataforbetter healthservices[65].Thesestoresofpersonallyvaluableinformationalsocreateatarget forthemselves. Accordingto[12],from2016to2017therewasanincreaseof45%ofthenumber ofdatabreachesintheUnitedStates.In[45],theauthorsre-establishthefactthatan individualbecomesharmedatthemomentofadatabreachwhetherornotthatdata 1


ismisused.From2005toJuly2018therehavebeenatotalof9,215recordedunauthorizedbreachesofdatasystems,exposingatotalof1,104,625,430records[14].AspreviouslystatedthereareNationalandcommercialframeworksforquantifyingtheimpacts ofdeciencieswithinsoftware.Theseframeworksprovideaconsistentrepeatablescoringsystemwhichhelpssharedeciencyinformationamongdevelopers.However,are theseframeworksenoughtoprotecttheenduser? Aswillbeshowntherehavebeenmanyattemptstomaintainprivacybutfew toquantifyprivacyvulnerabilities.Companies,individuals,andthoseprivytosensitivedatapertainingtoanotherindividual,whoprovidethedataalsoallowwhichinformationthatchosentoshare.Manytimes,whenthisinformationisstoredtheprivacy statementsareoftencomplexlegaljargon,pertainingtohowthedatawillbeusedand whowillhaveaccesstoit.Theframeworkthatwillbedescribedwillprovideascoringsystemthatallowsthedatasourcetheabilitytoeasilyunderstandhowatriskthey arewhengivinganentitytheirdata.Thisframeworkwillalsobeusefulforentities wishingtogainsensitiveinformationfromindividuals,astheywillbeabletoadvertise theirscoreofshowingtheprivacyrisk,whetherintheeventofadatabreachorwhen dataanalysisisperformed. I.2StructureofThesis Chapter2willserveasabackgroundtotheproblembeingresearchedandsolved. Inthischapterthereaderwillbecomeawareoftheresearchquestionsbeingasked,variousframeworksandtechniquesthatareusedtocurrentlyassessthesoundnessofsoftware.Thischapterwillalsocovertheshortcomingsofeachcurrenttechnique.The chapterwillthencompletebycoveringcurrentworldwidedataprotectionsregulationsandacts.Chapter3willthenstatethehypothesesforthisthesispaperalong withamodelwhichwilldemonstrateaframeworktoprovideapossiblestandardized privacyscoringapproach.Afteraclearunderstandingofthehypothesesbeingtested, thegoalsanddesignforthemodelwillbedened,showcasinghowthemodelwillbe 2


usedtotesttheproposedhypotheses.Chapter4willthengointothedetailsoftheimplementationofthemodel,followedbytheproposedtestingprocedureofthemodel. Afterdiscussingthetestingprocedureandmethodstheresultsfromthetestwillbe discussed.Chapter5willdiscusspotentialfutureresearch,withchapter6beingthe conclusion. 3


CHAPTERII LITERATUREREVIEW Thischapterwillfocusonthebackgroundinformationthatispertinenttounderstandingthedesignofthemodel.Therstsectionwilldenetheresearchquestions thathaveledtoinvestigationforthisthesispaper.Theproceedingsectionwillgointo detailoverthecurrentmostwidelyusedandacceptedscoringsystemforsoftwaredeciencies.Thiswillincludeusesforscoringsystems,wheretheyhavebeenfoundtofall short,aswellasareasthathavebeenidentiedasareastoimproveupon.Thenext sectionwillshowvariousdatacollectionsources,modelsforprivacy.Thiswillinclude currentcommonalgorithmsusedformaintainingprivacywhileperformingdataanalysis.Nextwewilldiscussthecurrentpoliciesfordatacollections.Thenalsectionwill beareviewofthechapterandabriefdiscussiononthiswillallbeusedtodeneand buildaframeworkforscoringdataprivacyrisk. II.1ResearchQuestions Asdiscussedearlierdatabreachesoccurfrequently.TheyoccurwithinGovernmentorganizations,businesses,andpersonalnetworks.Theseresearchquestionswill needtobeansweredinordertocreateaneectiveframeworkformeasuringtherisk fromattainingpersonallyidentifyingdataduringadatabreach.From[12]itissaid thatit'snotamatterofifabreachwilloccur,ratherwhenwillthatbreachoccur. Thisleadstoresearchquestion1: RQ-1: Whatinformationisneededtoimpersonateaperson? Duetodatabreachesbeingvirtualinnature,theinformationthatamalicious actorwouldneedinordertoimpersonatesomeonevirtuallyisneededtobeknownin ordertocreateapropermodel.Weseein[63],thatlevelsofdeceptioninonlinesocial networkingsitesraiseasthesitesbecomemorepopular.Theabilitytodeceiveapersonwithfalsefactsiscongruentwithdeceivingtheidentityofanindividual. 4


Themostpopularonlinesocialnetworksallowuserstocreateanaccountusing whatisdescribedin[24],asweakidentities,asinunveriedaccounts,oftenonlyrequiringane-mailaddress[8].Thisleadstotheabilityformaliciousactorstocarryout identityimpersonationattacks[24].In[25],theauthorshavegivenaccesstoafreeof usesitewhereuserscanviewiftheire-mailsorotherpersonalinformationhasbeen leakedsuchthatitisreadilyfoundonline.Thisabilityhasledtotheproliferationof atypeofsocialengineeringattackcalledaSybilattack[61].InaSybilattack,amaliciousactorimpersonatessomeoneandthenbefriendsanotherundertheguiseofthis falseidentity.Thisthenallowsthemaliciousactortothengainthetrustoftheother userandextractusefulinformation.Therefore,itcanbededucedbysimplyknowinga person'snameandtheire-mailaddressitispossibletoimpersonatesomeone,andthus answeringthe RQ-1 . In[2]itwasshownthatwithonlyknowingthedateofbirthandplaceofbirth theycouldpartiallydetermineaperson'ssocialsecuritynumber.From[32],itisstated that65%ofinternetusersareonsometypeofonlinesocialnetwork,whetheritbe Facebook,Twitter,Instagram,etc.Asansweredin RQ-1 ,allthatisneededtoimpersonatesomeoneistheirnameande-mail.Withthislittleinformation,varioustypes ofattackssuchasSybilorReverseEngineeringAttacks[61]arecapableofbeingperformed.Therefore,researchquestion2isproposed. RQ-2: Whatinformationaboutapersonismorevaluable? Asocialsecuritynumberisdeemedtobeasensitiveauthenticationdevice[2], howeverasjuststated,ithasbeenshownthattheabilitytodetermineasocialsecuritynumberfromjustadateofbirthandplaceofbirthispossible.Thisleadstothe assumptiontheremayshouldbemoreweightonthevalueofabirthdate.Althoughit waspossibletoderivethisinformation,asit'sbeenshownin[61,8,24],manytimes whenapersonisimpersonatedthisisusedtothengainmoresensitiveinformation 5


fromotherusers.Fromthisitcanbededucedthatsimplyknowingaperson'sname andemailaddressisnotnecessarilyvaluabletoamaliciousactor. Toscope RQ-2 ,wedenevaluableinformationaboutsomeoneinthesamesense that[15]denesprivateinformation,inthatanyinformationthatmakesapersonfeel alevelofvulnerability.Thisquestionwillalsobecoveredinthescopeofpersonally identiableinformation,asdenedin[7]asinformationthatcanusedtodistinguishor traceanindividual'sidentity.TheIdentityTheftResourceCentertracksdatabreaches thatoccurperyear.Inthe2018report[13],fornearlyeverybreachtheattackerwas abletogainpersonallyidentiedinformation.In[43],personalinformationisbrokenintothreecategories,thatofstaticvsdynamic,uniquevsnon-unique,andshared vsdistinct.Forinstance,anindividual'snameisstaticinthatitmostlikelywon't change,howeverthisisnotauniqueidentier.Theauthorsin[43]discusssharedvs distinctasanelementtoapieceofinformationthatmightbesharedacrossmanyinternetplatforms.Thiscanhoweverbemodiedtobestatedthatadistinctpieceof personalinformation,thatisalsostaticanduniquecouldbeanindividual'semployee numberattheircurrentemployment,ortheirstudentidenticationnumber.Byusingthesesamethreecriteriawecouldanswer RQ-2 ,thatthemostvaluableinformationaboutapersonisapieceofinformationthatisprivate,canbeusedtodistinguish someone,andfallsintoacategoryofbeingstatic,unique,anddistinct. RQ-3: Whatistheminimumamountofinformationneededtobeknownabout someoneinordertondmoreinformation? Asin[63,24],socialmediausersareoftendeceivedbyfakeusers.In[24]itis shownthatoftenthefakeaccountsarethatofrealpeople.Spreadingfakeinformation inthenameofsomeonecanbedamaging,andusuallyonlyrequiresknowingtheindividualsnameande-mailaddressinordertocreateafakeaccount.Thislastresearch questionspecicallywishestondwhatlongtermdamagescanbecausedwithknowingthesmallestamountofinformationonsomeone.Thiscanactuallycausesevere 6


longtermdamage,witheitherspreadingmisinformationbyactingassomeonewitha highstatusamongsociety,oraswellasstatingsomethingdiscriminatoryunderthe guiseofsomeoneelse,whichcouldleadtolosingajob,orpreventthatpersonfromattaininganotherjob. RQ-4: Docurrentsoftwarescoringmetricsfordecienciestakeintoaccountthe importanceofthedataatriskofbeingstolen? Asstatedpreviouslyexistingframeworksdoscoretheriskforavulnerabilityor asoftwareweaknesstobeexploited.However,thesescoresdonottakeintoaccount thesensitivityofthedatathatisatriskofbeingtakenifabreachoccurs.Itisofthe authorsopinionthatnotalldatabreachesareequal. RQ-5: Docurrentdatacollectiontechniquesusedindataanalysisfollowstrictenough guidelinesforensuringtheanonymityofthesourceoftheinformation? Mostoftheresearchquestionsthisfarhavebeenpertainingtodatathatisgatheredfromdatabreaches,yetthatisonlyoneavenueofobtainingdatafromabout aperson.Yetasdiscussedstated[6,35]thereisnowanincreaseindatacollection. Thereneedstobeagovernancethoughtoguaranteethatthedatabeingcollectedand usedbytheseindividualsandcompaniesdoesnotinvokeunnecessaryrisktothebeginninguser. II.2CurrentAssessmentsofSoftwareDeciencies ThemostcommonlyusedandopensourcescoringsystemsaretheCommonVulnerabilityScoringSystemCVSSandtheCommonWeaknessScoringSystemCWSS. FromthispointforwardtheCommonVulnerabilityScoringSystemwillbereferredto asCVSS,andtheCommonWeaknessScoringSystemwillbereferredtoastheCWSS. II.2.1CommonVulnerabilityScoringSystem 7


TheCVSSismostcommonlyusedtoscoreaCommonVulnerabilityandExposuresCVEentryintheNationalVulnerabilityDatabaseNVD,thatishosted bytheNationalInstituteofStandardsandTechnologyNIST.Asnotedin[33]this scoringsystemisusedinmultiplestandards,andasmentionedin[5],thishasbeen adoptedbytheUnitedStates,DepartmentofDefenseDoD,perdirective8500.01. TheCVSSwascreatedandmaintainedbytheForumofIncidentResponseand SecurityTeamsFIRST[23].FIRSTdescribesthreeinvaluablebenetsthatcome fromusingtheCVSS.Theseconsistof: 1.Astandardizedscoringsystem,thatusesacommonalgorithmacrossallpossibly infectedtechnologicalplatforms. 2.Anopensourceframework,thatprovidesaobjective,repeatablescorerather thanasubjectivescore. 3.Providestheabilitytocreatealistofvulnerabilitieswhichdeservemorefocuson prevention,allowingthemostcriticalvulnerabilitiestobeidentied. II.2.1.1CVSSFrameworkMetrics TheCVSScontainsthreemetrics,whereeachmetriciscomposedseveralattributeswhichareusedtoeventuallycalculateascore.Thesemetricsarebothqualitativeandquantitativeinnature.TheyconsistoftheBase,Temporal,andEnvironmentalmetricgroups.TheCVSSiscurrentlyatversion3.0,whichwillbetheversion usedwhendiscussingthemetrics. TheBasemetricsaredenedastheintrinsicattributesofthevulnerability,or thewaythevulnerabilitybehaves.TheTemporalmetricisusedtodeterminethecurrentstatethevulnerabilityisat,forexampleifthereisapatchforthevulnerability. Thismetricmaychangeovertime,asthestateofthevulnerabilitychanges.Thenal 8


metric,theEnvironmentalmetricallowsforthecustomizationofthescorefortheend user. Table2.1:CVSS3.0MetricAttributes BaseMetricGroup Temporal MetricGroup Environmental MetricGroup ExploitabilityMetrics: ImpactMetrics: AttackVectorCondentiality Impact ExploitCode Maturity ModiedBase Metrics AttackComplexityIntegrityImpact Remediation Level Condentiality Requirement PrivilegesRequiredAvailability Impact ReportCondence Integrity Requirement UserInteraction Availability Requirement Scope AsitcanbeseenfromTable-2.1theBasemetricgroupcontainsthemostcriteriawhendeterminingthescore.In[31]itisrecommendedthatsecurityprofessionals shouldbeassessingavulnerabilitytothecriteriainBaseandTemporalmetrics,and theendusershouldbemodifyingthescorebasedontheEnvironmentalmetrics.The intentofthisistoallowthatabilityforprioritizationwhenimplementingprocessesand procedurestopreventormitigatethevulnerability. II.2.1.2CVSSCalculation Anoverallscoreisgivenavaluebetween0to10,where10isconsideredtheabsoluteworst.Whilecalculatingascore,avectorstringisalsoproduced.Theadvantage ofthevectorstringisthisallowsanendusertheabilitytoviewwhichcharacteristics ofthevariousmetricsthataparticularvulnerabilityaects.Manyoftheconstantmultipliersseemquiterandominmanycases,butper[67]representtheresultsofformula renementbasedoncasestudiesforactualvulnerabilitiesandsystems. 9


TheBasemetricgroupscoreiscalculationisbasedontheoutcomeoftheScope metric.IfScopeisconsideredunchangedthentheBasemetricgroupcalculationisas follows: Base = d min [ Exploitability + Impact ; 10 ] e .1 IftheScopemetricisconsideredchangedthentheBasemetricgroupcalculationis: Base = d min [ 1 : 08 Exploitability + Impact ; 10 ] e .2 However,iftheImpactmetricscoreislessthanorequaltozero,thentheoverallBase metricgroupisgivenascoreofzero.TocalculatetheImpactmetriciftheScopemetricisconsideredUnchangedthentheformulais: Impact = 6 : 42 ImpactSubScore .3 IftheScopemetricisconsideredChangedthentheformulanowbecomes: Impact = 7 : 52 [ ImpactSubScore )]TJ/F38 11.9552 Tf 11.955 0 Td [(0 : 029 ] )]TJ/F38 11.9552 Tf 11.955 0 Td [(3 : 25 [ ImpactSubScore )]TJ/F38 11.9552 Tf 11.955 0 Td [(0 : 02 ] 15 .4 WheretheformulafordeterminingtheImpactSubScoreis: ImpactSubScore = 1 )]TJ/F15 11.9552 Tf 11.956 0 Td [([ 1 )]TJ/F38 11.9552 Tf 11.955 0 Td [(Impact Conf 1 )]TJ/F38 11.9552 Tf 11.955 0 Td [(Impact Integ 1 )]TJ/F38 11.9552 Tf 11.955 0 Td [(Impact Avail ] .5 WiththeExploitabilityformulaas: 8 : 22 AttVector AttComplexity PrivRequired UserInteraction .6 AscanbeobservedeachsubmetricoftheBasemetricgroupcontainsvarioussubvalues.Thesevaluescanfurtherfoundin[23]alongwithhowtocalculatethescorefor 10


theTemporalandEnvironmentalmetricgroup.WhenaCVEentryintheNVDcontainsaCVSS,thisscoreisonlytheBasemetricexcludingtheTemporalmetric[51]. II.2.2CommonWeaknessScoringSystem TheCWSSismostoftenusedtoanentryintheCommonWeaknessEnumerationCWEdatabase.ThisdatabaseisprovidedandmaintainedbytheMITREcorporation[48].WheretheCVSSisusedtoscoreaparticularvulnerability,forinstance aparticularbueroveroweventinagivenpieceofsoftware.TheCWSSisusedto scoretheriskofaweaknessbeingexploited,forinstanceabueroverowbeingexploitedwithinsoftwareasawhole.Thisallowsorganizationstocreateaprioritylist ofweaknessestomonitorandpreventwhiledevelopingandmaintainingsoftware[46]. Toclarify,itcanbeviewedthattheCWSSisbestutilizedwhiledevelopingandmaintainingsoftware,andtheCVSSisbestutilizedbythosewhoseresponsibilitiesareto ensurethatapplicationsrunningonasystemarenotopenvectorsforattacks. LiketheCVSStheCWSSboaststhreemainadvantageswhenusingtheframework.Theseconsistsof: 1.Ameasurablevalueofaweaknessthatcouldbepresentinsoftwarethathasnot beencaught. 2.Acommonframeworkthatallowsorganizationstomakealistofmostcritical weaknessesthatneedtobexedwithinsoftware. 3.Theabilitytocustomizethemostcriticalweaknesses,sincetheneedsmaydier fromorganizationtoorganization. II.2.2.1CWSSFrameworkMetrics TheCWSSiscomprisedofthreedierentmetricgroups.Thesemetricgroups aretheBaseFinding,AttackSurface,andEnvironmentalmetricgroups.Eachofthese 11


groupsbreakdownintoseveralothermetrics,knownasfactors[48].Themostcurrent versionoftheCWSSis1.0.1whichwillbeusedfordiscussingthemetricsandscoring. TheBaseFindingmetricisusedtoshowtheassociatedrisksoftheweakness,the assurancethisriskiswarranted,andeectivenessofthecontrolsusedtopreventthis weakness.TheAttackSurfacemetricshowswhattheattackermustbeabletoovercomeinordertoexploitthisweakness.Justduetothepresenceofaweaknessdoes notimplicatethataweaknesswillbeused.ThenalmetricgrouptheEnvironmental groupissimilartothatoftheEnvironmentalgroupintheCVSS,wherethisallowsthe endusertheabilitytocustomizethescorepertheiruse. Table2.2:CWSS1.0.1MetricFactors BaseFinding AttackSurface Environmental TechnicalImpactTI RequiredPrivilegeRP BusinessImpactBI AcquiredPrivilegeAP RequiredPrivilegeLayer RPL LikelihoodofDiscovery LD AcquiredPrivilegeLayer APL AccessVectorAV LikelihoodofExploit LE InternalControlEectivenessICE AuthenticationStrength AS ExternalControlEectivenessECE FindingCondence FC LevelofInteractionLI PrevalenceP DeploymentScopeDS InTable-2.2itcanbeseenthevariousfactorsthatarecompositesoftheCWSS. Thesethreemetricsareallcalculatedindependently,thennallymultipliedbetween eachother.Thisinturnreturnsascorefrom0to100. II.2.2.2CWSSCalculation AsmentionedpreviouslytheoverallscorereturnedbytheCWSSrangesfrom0 to100,whichiscalculatedbymultiplyingeachsub-scoretoeachother.Theformulais 12


asfollows: BaseFinding AttackSurface Environmental .7 TheBaseFindingmetricscoreiscalculatedasfollows: [ 10 TI + 5 AP + APL + 5 FC f TI ICE ] 4 .8 Where fTI iszeroif TI=0 otherwise fTI=1 .ThisissimilartohowtheCVSS handlesanegativeorzeroscoreoftheImpactmetric,assuchtheBaseFindingmetric scorewouldbecomeatotalofzero.TheAttackSurfaceformulaisthenasfollows: [ 20 RP + RPL + AV + 20 DS + 15 LI + 5 AS ] 100 .9 ThevaluescomposingoftheAttackSurfacescorewillrangeinvaluebetween0and 100whichiswhythedivisionby100isnecessary[48].TheEnvironmentalformulais then: [ 10 BI + 3 LD + 4 LE + 3 P f BI ECE ] 20 .10 Wherethevaluefor fBI iscalculatedsimilartothatof fTI whereif BI=0 then fBI=0 else fBI=1 .Ascanbeobservedthe BI scoreinuencestheoverallEnvironmentalscore. TheoverallscoreforeachindividualcomponentoftheBaseFinding,AttackSurface,andEnvironmentalscoreiscoveredin[48]alongwithusecaseexamples. II.2.3AnalysisofCommonScoringTechniques Asitcanbeseenaboveneitherofthemostcommonscoringsystemsfactorinto theoverallscorethesensitivityofthedataatriskofbeingstolenintheeventofa databreach.DuetothewidespreadadoptionoftheCVSSframeworktherehasbeen manycritiquesoftheframework.In[66],thereisaproposednewmethodforscoring 13


vulnerabilitiesbuildingopreviousscoringsystems.Anothermethodtoimprovethe CVSScalledVRSS[41]andit'simprovedmethodology[42]donotfactorinthevalue ofthedatathatisatriskofbeingstolen.Therehavebeenmethodsutilizingmachine learningasin[11]tobetterpredictwhetheravulnerabilitywillbeexploited,yetthis doesnotfactorintoaccountdatathatcouldbeaccessed.Aswellinthecomprehensive surveyprovidedin[54],thereisnomentionofthevalueofthedatabeingprotected. Whenanalyzingtherelationbetweenbountiesandrelationsin[50],theeconomicincentiveisanalyzedasinhowmuchwouldbepaidtondavulnerabilityyetthisincentiveisnotdrivenbytheunderlyingdataatrisk. TheCWSSdoesnothaveasmanycritiques,andhasbeenshowninconnection withtheCWEtobeeectiveinpreventingandhelpingdefendagainstattacks[20,64]. HoweverasinthecaseoftheCVSSframework,theCWSSframeworkdoesnotfactor thevalueofthedataaswell.Fromthisanalysiswecananswer RQ-4 ,thatcurrent softwarescoringmetricsfordecienciesdonottakeintoaccounttheimportanceofthe dataatriskofbeingstolen. Acommoncritiqueofbothscoreshoweverareduetotheratherrandomnessand unexplainedconstants.In[31],variousexpertsinsoftwarevulnerabilitieswereaskedto calculateavaluefortheCVSS,theyfoundmuchdiscrepancybetweenthevalues.The authorsthenlistpossiblesolutionsandmodicationstothesescorecalculationsasper evaluationoftheexpertsusedinthestudy. II.3PrivacyManagement Eachdaythereismoredatabeingcreatedandstored.In[30]theystatethat therewillover10zettabytesofdatabeingstoredoncloudcomputinginfrastructures by2019.Withallthisdata,thereexistsmanyalgorithmsproposedinordertopreventthemalicioususeofthisdata.Inordertounderstandtheeectivenessofthese algorithmsit'snecessarytounderstandtheunderlyingtheoryofprivacymanagement. Thereistwoperspectivestoviewdataprivacymanagement:onethatisfromtheuser 14


who'sdataisbeingcollectedandanalyzed,andanotherfromtheentitythatisstoring oranalyzingthedata.Thissectionwilldiscussthesetwoviewpointsonprivacymanagement. II.3.1CommunicationPrivacyManagementTheory CommunicationPrivacyManagementCPMtheoryviewsprivacyfromtheperspectiveoftheuser.CPMtheoryexploresthebalancebetweenmaintainingprivate informationwhilebeingopenwithothersthatusersattempttoequalize.Asdescribed in[56],thisisunderstandingtherelationshipthatprivacyplaysinsharinginformation. CPMtheoryisdenedby5importantprinciples.Listedin[56]principlesconsistof: 1.Ownershipofinformation. 2.Controloftheinformation. 3.Privacyrulesthatprovideregulation. 4.Guardianshiporco-owninganotherperson'sdata. 5.Breakdownofregulationsforprivacy. In[16],thefocusisonthersttwoprincipleslisted.Inreferencetoprincipleone, theownershipofthedata,[16],testedthelevelofconcernwhenuserinformationwas collected.Theresultsfoundthelevelofconcernwasrelatedtothescopeofthecollection,varyingbetweenlocalandglobalcollection.Theresultsalsoshowedthatthe generalconcernofcollectionofthedatawasmorerelatedwithindividualswhoalready wereconcernedwithprivacy.Thisisalsosupportedinresultsfoundin[15].However in[37],theauthorssplitthedatawillingtobesharedwithothersintotwoseparate categories,wherecertaininformationwasconsideredbysensitivethanothers.Theresultsfoundthatinformationthatwasclassiedasnotverysensitivesuchashobbies, interests,andlifestyle,werewillingtobesharedmoreeasilythansensitiveinformation. 15


Whiletestingthesecondprinciple,controloftheinformation,itwasfoundthatthe levelofconcernwhengivingcontrolofinformationwasrelatedtohowmuchgeneral concerntheindividualinitiallyhad. II.3.2PrivacyPreservingAlgorithms Themomentsensitiveinformationisstored,whetherit'sfromauserofaservice thatisattemptingtoformarelationshipinaonlinesocialnetwork[15,37,16],orto useaservicelikeasearchengine[3],aweb-browser[34],oranyotheruserpersonalized service,thisinformationisnowunderthetrustofanotherparty.Whetherthedata isbeingstoredforabetteruserexperience,orforperformingdataanalysis,therehas beenmuchresearchindataprivacypreservingalgorithms.Theprimaryfocusonalgorithmsofthisnatureistoallowdataanalyticstobeperformedonsensitivedata,while ensuringthesourceofthedatathattheywillnotbeidentiable[44].Atthistimethe morecommonalgorithmsare;dierentialprivacy[22],k-anonymity[59],t-closeness [38],andmembershipprivacy[39]. II.3.2.1k-anonymity Theauthorsofk-anonymity[59],statethegoalofthealgorithmisnottoprevent theaccessingoftheinformation.Thegoalisrathertheabilitytoreleasetheinformationsuchthattheabilitytoidentifyapersonasasourceoftheinformationisnotpossible.Thisparticularalgorithmisusefulinprotectingagainstthedisclosureofidentifyingindividualsdirectlyfromthedatareleased.Thealgorithmdeneswhatiscalled quasi-identiers whichareattributesthatwhencombinedtogethercouldleadtothe identicationofthesource[59,38,35].Yetasshownin[38,6],thisdoesnotprevent againstreleasingofinformationthatisaattributeofanindividual.Ifanadversaryalreadyhasinformationaboutaperson,thentheyarecapableofrelatingthedatatothe source. 16


II.3.2.2t-closeness Thealgorithmt-closenessisanattemptatpreventinganadversary,whommay alreadyhavemuchinformationonanindividualfrombeingabletofurtheraggregate thereleaseoffuturedatatoanindividual.Theauthorsoft-closeness[38],require thatthedistancebetweendata-setsshouldnotdierbyanymorethanathreshold t [38,35].Howeverasdiscussedin[35]t-closenesscanonlybeperformedoncethedata hasalreadybeencollected.Thisdoesnothelptheindividualsintheeventofadata breach. II.3.2.3MembershipPrivacy Membershipprivacydiersfromthek-anonymityandt-closenessalgorithmin thatthelatermustbeperformedafterthedatahasbeencollected.TheMembership Privacyframeworkdenedin[39],itisassumedthatanadversaryalreadyknowsevery characteristiconagivenindividual.Toexpandfurther,theystatethatintheeventof databeingreleased,whetherpublishedorthroughadatabreach,ifasinglepersonis abletobeidentiedfromthisdata,thenaprivacybreechasoccurred. II.3.2.4DierentialPrivacy Dierentialprivacydenedin[22],hasbecomethestandardinmaintainingprivacywhileallowingdataanalysistobeperformed[62,29,44,57,28].Dierentialprivacy,istoensurethatadatasetisstatisticallynon-distinguishablecomparedtosimilardatasetwherearecordwasremoved[28].TheUSConsensushasevendeclared thatitwillbeusingdierentialprivacyduringthe2020UnitedStatesconsensus[1].A variationofthisframeworkcalledlocaldierentialprivacyhasevenbeenadoptedby largecompaniessuchasGoogleandMicrosoft[17]. Dierentialprivacyhoweverisanall-encompassingterm,thatseveraldierent algorithmsfallunderthisumbrellaterm.Thenumberofalgorithmsthatcouldbecon17


sidereddierentialprivacyhasledtoresearchin[28],whereitisarguedthatdueto theamountofalgorithmsavailableitwouldbeanoverburdenforsomeonetotryand determinewhichalgorithmbestttheirscenario. II.3.3Analysis Therearetwowaysofviewingwhohascontrolofvaluabledata.Onefromthe perspectiveoftheindividualasseeninsectionII.3.1,ortheotherasinthepersonperformingtheanalysisofdataandpublishingtheinformationusingoneofthemanyalgorithmsdiscussedinsectionII.3.2.Forstoringdataitisassumedthatthiscriteria wouldbemetbyfollowingguidelinesforensuringpropersoftwareconguration.There isnotaonesizetsalltechniqueforpreservingprivacyanditisamultifacetedproblem.Howevertheprocesstakenin[28],wherevariousdierentialprivacytechniques areanalyzedcanbeperformed,thusrelatingtotheenduserhowmuchrisktheyarein usingwhateverplatformtheychoose. II.4DataCollectionPracticesandPolicies InsectionII.1thefocusofthediscussionwasonhowdataisobtainedbymalicioususers.Thefocusofthissectionwillbeonhowthesetrovesofdatagetcreated initially.Therstareatobeobservedishowvariouscompaniescollectdataandthe standardstheyshouldfollowinordertomaintaintheprivacyofdata.Thefollowing areaofdiscussionwillshowdierentpoliciesfromtheUnitedStatesandEurope,regardingthecollectionanduseofpersonaldata.Thelasttopictobecoveredinthis areaarechallengesthatareoftenencounteredbyusersandcompaniesalikeregarding theuseandcollectionofpersonaldata. II.4.1DataCollection Entitiescanobtainpersonallyvaluableinformationfromvarioussources.In[64], theystateterabytesofusermadecontentiscreatedeveryminuteononlinesocialnet18


works.Thisdatacomesfrombothtraditionalpersonalcomputersandmobiletechnologies.Thereasonsusersdecidetodisclosesuchsensitiveinformationontheseplatforms isoutsidethescopeofthispaperandcanbereadinmoredetailin[64,21]. II.4.1.1DataSources Asitcanbeseenin[16,61,8,24,63,15,32,68],theuseofonlinesocialmedia playsamajorroleindaytodaylife.Theuseofsocialmediarangesfrompolitical discourse[68],togainingemploymentandmaintainingpersonalrelationships[37].In somecases,third-partycontentcreatorsoftheseonlinesocialnetworkshaveabilityto gainaccesstopersonalinformation[16,15].Onlinesocialnetworksoftenoeraccessto theirplatformsinexchangefortheabilitytousethedatacreatedfromusers.From [43],itwasidentiedthatduetothesuccessofonlinesocialnetworks,manyother companieshavetakentothemonetizationofuserdatathroughtheuseoftheironline products.Thiscanfurtherbeexaminedbytheuseofmobilephoneapplicationsand datacollectedasdiscussedin[26].In[3]theauthorsidentiedsearchenginesretaining dataofusers,foramorepersonalizedexperience,and[34]discusseshowwebbrowsers areabletocollectuserinformation. Thisisnotthelimitationsofutilizingelectronicdata,asin[65],whichtargeted thelimitationsofaccessingandusingmedicaldatafromvariouscountries,withthe argumentofthebenetsthatcouldbeobtainedbyallowingaccesstothisinformation. Also,arelativelynewrevenuefordatasourcescomefromInternetofThings,which manytimesrequiresthattheuserbeidentiedmoreaccuratelythanjustsupplyingan e-mailaddress[4]. II.4.2DataCollectionPolicies Datacollectionlawsvarydependingonthecountry.Forthepurposeofthismasterthesis,thecomparisonofdatacollectionpoliciesbetweentheUnitedStatesandthe EuropeanUnionwillbeanalyzed. 19


II.4.2.1EuropeanUnion TheEuropeanUnionhasrecentlyenactedtheGlobalDataProtectionRegulationGDPR.Thesenewregulationsdeclarethatitisnowarightofapersonto haveprotectionoftheirpersonaldata[47].TheGDPRalsoacknowledgesthatwhen anonymizationalgorithmshasbeenperformedonthedata,itplacesmoreburdenon thecollectortoguaranteetotheuserthattheirdatawillremainanonymous[30].The EuropeanUnionhasfoundthatonlysixcountriesoutsideoftheEuropeUnionhave privacylawsthatmeetexpectationsoftheGDPR[27]. II.4.2.2UnitedStates WithintheUnitedStatestherearebothFederalandvariousStatelawsimpactingthecollectionandprotectionofdata[10].Forconsumerstheenforcingbodyisthe FederalTradeCommissionFTC,alegalbodythathasbeentaskedbytheUnited Statescongresswithenforcinguserdatapolicies.Thesedatapolicieshoweveraremostly concernedwithunfairordeceptivepractices.Therehavebeenguidelinesoeredbythe FTCforcompaniestofollowtoensuredataprivacyforconsumers,howevertheseare justguidelinesandnotlaw. IntheUnitedStatesthemostcommondataprotectionpolicyistheHealthInsurancePortabilityandAccountabilityActHIPAAwhichishandledbytheOceof CivilRights.Howeverthispolicyisonlydirectedtowardshealthrecordsandhowthey mustbehandled. ForgeneraldatatherearenotmanyregulationssetforthintheUnitedStates.In [53],thediscussionbetweentheEuropeanUnionandtheUnitedStatesispresentedfor useofcloudcomputingtechnologiesforGovernmentpurposes.IntermsoftheUnited Statespoliciesforcloudcomputingfordatastorage,providersaredirectedtotheNIST CloudComputingSecurityRequirements.Foracompleteandcomprehensivelistof federaldataprotectionpoliciesintheUnitedStatesview[10]. 20


II.4.2.3Challenges Asidentiedin[60]readingandunderstandingprivacypolicystatementsthat areusuallypresentedtoindividualsisdicult.Thisinherentlymakesitdicultfor userstodetermineiftheyshouldaccepttheprivacypolicy.Evenmoreofapressing challengeisthatusingaservicethatcollectsdataisbecomingnearlyanormaldayto dayevent.Theriskofsharinginformationthatmaybecompromisingtotheindividual issometimesrelatedtothepersonspersonalityasdiscussedin[21].In[52],theauthors foundthatmanytimestheusersforgoneanycontrolofdiscussingvaluablepersonal informationonsocialmediaplatforms. II.4.3Text Itmightbeassumedthatifdataisanonymized,itbecomesimpossibletothen re-identifytheindividualwhoseprivateinformationhasbeenleaked.However,the standardfordataexploitationisnottorelyononesourceofdata,buttoaggregate multiplesourcesofdatatoformastrongvulnerability.Asstatedintheprevioussections,theamountofdatathatispresentincloudbasedserversisinthezettabytes.To aggregateallthisdata,hasnowbecomeeasierthanhasbefore,bytheamountofdata aswellasadvancesintextprocessing. Althoughitwouldnotbefeasibleforamalicioususertonarrowlyanalyzeevery textthattheycomeacross,theycanrstperformaregularexpressionchecks,string matching,onablockoftext,toseeifthetextcontainsanywordsorphrases,thatare associatedwithprivacydivulgingcommentsorsentences.TheAho-Corasickalgorithm, isapopularmulti-stringmatchingalgorithmthatisstillusedinmanyapplications today.Thisalgorithm,utilizesaFiniteStateAutomataFSA,whereeverystateis asingleletterofthewordtobefound[55]. Althoughmatchingawordinablockoftextmaybeanindicatorforapotentialprivacydivulgingcommentorsentence,thisisnotenoughforonetosimplyau21


tomaticallycrawlthroughtheamountofinformation.Thisisduetotheambiguityof thevariousmeaningsthatasinglewordinasentencecanpossess.However,methods ofNaturalLanguageProcessinghavebeenrenedtodisambiguatetheseambiguous meanings.OneparticularlyusefulmethodistheWordSenseDisambiguationmethod, whichattemptstodisambiguatethemeaningofawordbasedupontheotherwords usedinthesamecommentorsentence.Fortheremainderofthisthesiswewillusethe abbreviationWSDforWordSenseDisambiguation. WSDistheprocessforcomputationalidenticationforwordsinthecontextfor whichtheyareused[36].Thisisamethodforremovinglexicalambiguity.Wherelexicalambiguity,isambiguityinasentencewhenthewordsusedwithinasentencecan havemultiplemeanings[19].TheoriginalLeskalgorithmdenedin[36],wasshownin [49]tobebetween50to70%accurate. Inourtestingmethodologies,thesetwomethods,stringmatchingandwordsense disambiguationwillbeutilized. II.4.4Analysis FortheoriginatorasthesourceofthedatatheEuropeanUniontakesaharder stanceforprotectionsthantheUnitedStates.FromsectionII.4.1andsectionII.3, RQ-5 canbeanswered.For RQ-5 ifdataanalysisfollowsstrictenoughguidelines forensuringtheanonymityofthesourceoftheinformation,theanswerisdependent uponwheretheanalysisisbeingperformedandonwhichdata.IfinsidetheEuropean UnionthentheymustfollowtheGDPR.IftheyareintheUnitedStatesandtheinformationisoutsidetherealmasbeingPersonalIdentiableInformationasdenedin theHIPAAregulation,thenitislefttothediscretionoftheanalyst.Theregulations intheUnitedStateshavenotmatchedtheriseindataanalyticsnortheriseindata breaches. 22


II.5MeasurementTechniques AsshowninsectionII.2thecurrentmethodsofscoringdecienciesinsoftware usevariousequationstocalculateascoreformeasuringtheriskinvolved.Sincewe havenotfoundanypreviousliteratureonmeasuringprivacyrisk,wewillutilizethe DelphiMethodforcreatingtheequationsinordertodeterminetheassociatedscores andrisksforthevariousmetrics. II.5.1OverviewofDelphi TheDelphimethodwasoriginallydeveloped,andutilizedbytheRANDcorporation[18,40].TheoriginaluseoftheDelphiMethodwastogatherexpertopinionsfor determiningpolicies[67,18]. II.5.2DelphiCharacteristics AlthoughtherearemanydierentvariationsoftheDelphiMethod,allmaintain thefollowingcriteria: SelectionofExperts Anonymity Feedback Firstly,agroupofexpertsareselectedthatareconsideredveryknowledgeable inagiventopic.Thenextstepistothensendeachexpertaquestionnaire.Oncethe questionnaireisreturned,andtheresultsareanalyzed,thequestionnaireisthenreturnedbacktotheexpertsorre-evaluated.Iftheexpertsoriginalresponsediersfrom theconsensusoftheotherexperts,feedbackisgivenastoallowtheexperttore-evaluate theiranswer.Thisfeedbackiscompletelyanonymous,thiswaytheexpertisnotunder pressuretoconformtotheresponsesoftheothers,inanattempttocreateanunbiased judgment. 23


II.6ChapterReview Inthischapterthevariousdegreesuponwhichanindividualcouldbeaected bytheexposureoftheirdatahasbeenanalyzed.Also,dierentframeworksfordeterminingsoftwaredeciencieshavebeenreviewed.Whatcanbesurmisedisthisisa complexareatonavigate,andwiththegrowingtrendsofutilizingonlineservicesthat collectdata,ithasbecomeanovertaxonindividualstonavigatehowamisuseoftheir informationmayoccur.Wehavethecreditscorestojudgeaperson'screditworthiness,ratingsfromthebetterbusinessbureaufordeterminingthetrustworthinessofa business,toframeworksfordiscussingthesoundnessofsoftwareasseeninsectionII.2. Thisframeworkshoulddetermineanindividual'sriskoftheirprivacybeingbreached. Thatcanmeasurethisbreachnotonlyintermsofhowthedataisbeinghandledand stored,howeveralsofromtheimpactthattheuserfromaninterpersonallevelmaybe aected.Thisistheframeworkthatispositedinthisthesis,thatwillbethediscussion oftheproceedingchapter. 24


CHAPTERIII HYPOTHESESANDMODELDESIGN III.1Hypotheses Inthissectionthehypothesesthatthemodelwillattempttoanswerwillbe listedanddiscussed.ThesehypotheseslistedaretheproductfromtheresearchquestionsoutlinedinChapter2alongwithfeedbackfromtheexpertswhileinitiatingand performingthesurvey. H1: Thoseaccountsthathave1pieceofdirectPIIavailable,willalsohaveaminimum of2piecesofindirectPIIavailable. Hypothesis1H1,listedabove,believesthatindividualswhoshareasinglepiece ofdirectPII,onasocialsite,willalsobemorelikelytodivulgeindirectPIIaswell. WebelievethattheseindividualswillreleaseaminimumoftwopiecesofindirectPII, iftheyalsoreleaseasinglepieceofdirectPII.InsectionIII.3.2,andsectionIII.3.3we willdiscussthevalueofindirectPIItothatofdirectPII. H2: Usersthatgivetheirbirthdayorallowittobedetermined,aremorelikelyto havemorepiecesofIndirectorDirectPIIthanusersthatdonotgivetheir birthday. InoursecondhypothesisH2,wenarrowthescopeofwhowebelievetobemore privacyriskrelaxed,thanthosewhoarenot.Thehypothesisisameasurementofthose whosebirthdayisabletobefound,againstthosewhosebirthdaysarenotfound.In ourmodelthatwillbepresentedinIII.3.3theindividualswhofallintothesetofusers whoareinH2,willhaveahigherscorethanthosewillnot. H3: Usersthathavetheire-mailaddresseseasilyaccessible,aremorelikelytodivulge morePII,andthereforehaveahigherscorefromourmodel,thanthesetofusers fromH2. 25


Uponfeedbackfromthevariousexpertsthatweresurveyed,ithasbeenfound thatoneofthemostessentialpiecesofinformationthatamalicioususercanobtain abouttheirtarget,isthetargetse-mail.Fromthiswehaveformulatedhypothesis3 H3.WealsobelievetheusersofH3,willhaveahigherscoreinourmodelthanthose whosee-mailwecan'tnd.AsshowninFigure3.1,thesetofuserswhoclassifyinto H3wouldbeasub-setoftheusersfromH1. Figure3.1:H1UserstoH3Users Wefocusourattentiononuserswithpubliclyfacingbirthdaysande-mailaddress duetotheresponsesofthesurveyfromexpertsthatisdiscussedinsectionIII.3.2. ParticularlythesetwopiecesofPII,tous,aregatewayindicators.Asin,oncethese piecesofPIIareobtained,theabilitytogainothercriticalpiecesofPIIbecomeseasier. III.2ModelGoals Inthissectionthegoalsofthemodelwillbediscussedandhighlighted.Aprimarydrivetothisresearchwasthetotalamountofprivacybreachesthatarereported 26


eachyear,alongwiththerecentreportsofsocialmediaorganizationsreportedlyselling userdata.ThroughtheresearchthattookplaceanddiscussedinChapter2,itcanbe seenthatthereisadenitiveneedforatoolforuserstousetoassesstheirownvulnerabilitiesforaprivacybreach. Amajorhurdleistoplaceaquantitativevaluetoanintangibleobject.Tothe quickthought,mostcouldsputterwhattheypersonallyfeelasthoughprivate,asthis canbeseeninthesectionpertainingtoCPM.ToovercomethishurdlewewillbeutilizingtheDelphimethod.TheDelphimethodhasbeenusedinliteratureaswellasin realworldapplications,tondquantitativevaluesforobjects,thatwerepreviouslynot valuedorranked.TheDelphimethodreliesoninputfromindividualsthatwouldbe consideredexpertsinthegivenareaoffocus. BaseduponresearchthatwasdoneforChapter2,whenanalyzingthevariousexistingscoringsystemsforothercybersecurityrelatedissues,itwasobservedthatndingacommunityacceptedmodelwasdicult,ifcustommade.Therefore,ourmodel willbearsemblancetoalreadydenedCybersecurityindustrystandards.Byusing thisfoundationalong,withtheresultsfromtheDelphimethod,tobuildourmodelwe willbeabletotestandpresentsolutionstoourhypotheses.Wewillalsobeableto showhowthismodelwillbeausefultoolforusersofonlineaccounts. III.3ModelDesign Thissectionwewilldiscusstheactualmodelandthestepsinvolvedforthecreationofthemodel.Therewerevariousiterationsofthemodel.Therstiterationwas usingknowledgefromtheliteraturereview,andourhypothesestobuildthismodel. Nextasurveywascreatedandsentoforexpertstoanswer.Aftertheexpertshadanswered,andtheirresultswerecalculated,thenaliterationofthemodelwasdesigned. III.3.1ModelDesignIterationI 27


Forabaseforthemodel,weareutilizingtheformatoftheCVSSandCWSS, asdiscussedinSectionII.2.1andII.2.2respectively.WefocusedmoreontheCVSS, sincethisisthemostcommonlyused,andmostwidelyknownstructureforgivinga qualitativevaluetovulnerabilitiesinsoftware.AsshowninSectionII.2.1,theCVSS gaveanoverallscoreofzerototen,thereforeforourmodelwewishtoproduceascore withinthesamerange.Atenwouldbeindicatebyusingaservicethereisacritically highlikelyhoodofbeinginvolvedinabreachofprivacy,and0thereisaverylowif non-existentthreatofsuccumbingtoaprivacybreach. Total _ Score = d min [ Exploitability + Accessibility + Harm ; 10 ] e .1 Wehavedeterminedthatthelikelihoodofbeinginvolvedinaprivacybreachfalls underthreedistinctmetrics.ThesearethemetricsofExploitability,Accessibility,and Harmmetrics.AswiththeCVSSandCWSS,eachoftheseprimarymetricsarethus furtherdevolvedintoindividualcomponentsthatwillthenbeusedforthisnalcalculation.Table3.1,showseachindividualcomponentofthevariousmetricsthatwillbe discussed. Table3.1:PAVSS0.9MetricAttributes Exploitability Metric Accessibility Metrics HarmMetrics DirectPIIMetricDP DataRelease PolicyDR Professional AttackHP IndirectPII MetricIP PublicInformationMetric PI FinancialAttackHF ExploitValue MetricEV Associativity MetricAA PersonalExpectationMetric HPE DataExpectationMetric DE 28


TherstmetrictobeexaminedistheExploitabilityMetric.Fromourresearch wehavedeterminedthatthismetricholdsthemostweightindeterminingwhetheran individualwouldbeinvolvedinaprivacybreach. Exploitability = DE EV DP + IP 0 : 5 .2 Theequationabovewastherstiterationofwhatwedeterminedtobethevalue fordeterminingtheoverallscorefortheExploitabilitymetric.Ourassumptionwas that,informationthatwouldqualifyasindirectpersonallyidentiableinformation,is actuallyonlyhalfasvaluableasinformationthatwouldqualifyasdirectpersonally identiableinformation.ThisvalueadditionisthenmultipliedbytheoverallExploit value.WiththeoverallscorethenmultipliedbytheDataexpectationtoprivacy.Both ofthesemetricswillbedenedintheparagraphbelow. TheExploitvalueisbeingdenedasthevaluethatgiventhisinformation,what moreexploitcouldbeperformed.Thisisavaluetobegivenforinformationthatwhen combined,couldcreateavenuesforgatheringmoreinformation.Whereasthedataexpectationtoprivacymetricisusedtoeither,bringtheoverallvaluehigherorlower dependingontheservicebeingused.Wefeltthiswasimportanttogiveweighttothe factthatwearenotsayingagivenserviceisbadorgood.Thatisoutsidethescopeof thismodel.Forinstance,whenauserusesasocialmediaservice,theyshouldbeunder theassumptiontheinformationthatisprovided,issomethingthatthegeneralpublic couldsee,thereforethedatahasagenerallowexpectationtoprivacy.Whereasifusinganonlinebankingservice,onewouldassumethisinformationiskeptprivateand secure,thusindicatingahighlevelofexpectationforprivacy. Accessibility = 0 : 2 DR + PI + AA .3 29


Theequationlistedaboveindicateshowtheaccessibilitymetricisassumedtobe calculated.Asitcanbeobserved,theoveralladditionofeachvalueisthenmultiplied by0.2.Thisisbaseduponanassumptionthattheuserhaslittletonocontrolover theseindicators,howevertheyshouldbemadeawareof.Thus,theseindicators,not bemuchasmuchofaninuenceinthetotalscore.TheDataReleasevalueisascore basedupon,ifbyusingaservicethedatawillbereleasedtothirdpartyandwhich privacypreservingalgorithmisbeingusedtoreleasethisinformation.Whereusingno privacypreservingalgorithm,thiswouldcreateahighvalue,andutilizingamethod forprivacypreservingwouldindicatealowervalue.Privacypreservingalgorithmswere discussedinSectionII.3.2.Publicinformationmetricisgivenaconstantvalue.Thisis eitheravalueof1iftheinformationcouldbeobtainedthroughpublicinformation,or 0ifnot. Associativity = Total _ Pieces _ Info 0 : 25 .4 Associativityisavaluetoindicatethelikelihoodofaprivacybreachgiventhat informationisallinoneplace.Thetotalpiecesofinformationwouldbeatotaltally ofthenumberofdirectandindirectpersonallyidentiableinformationthathasbeen storedforaccessinasinglelocation. Harm = 0 : 2 HP + HF HPE .5 Finally,theHarmmetricislistedabove.ForthepurposeofthisThesis,weare evaluatingwhethertheuserwouldbesusceptibletoeitheraSybilorFinancialattack ifthegiveninformationwasreleased.Inthenatureofbeingmodular,manyotherareasofattackscouldbeaddedlater.Forthisreason,bothSybilandFinancialwillbe givenavalueofeithera1ifanattackofthistypeispossiblewiththegiveninformation,ora0ifanattackisnotpossible.Thisvalueisthenmultipliedbythepersonal expectationtoprivacy.Thisvaluewouldbeavaluebyauser,toindicatehowthey 30


wouldpersonallyfeelifinvolvedinaprivacybreach.Overallthisscoreisthenmultipliedby0.2,indicatingagainthishaslittlecontrolbytheuser. III.3.2DelphiResults III.3.2.1SurveyBackgroundInformation DrawingfromresearchperformedintheliteraturereviewforChapter2,andour rstiterationforourmodeldesign,wewereabletocreateasurveywherewecybersecurityprofessionalstoprovidetheirprofessionalexpertise.Asperthediscussionofthe Delphimethodin2.5.2wemaintainanonymitywiththeprofessionalsllingoutthe survey.Wedidhoweveraskfortheseindividualstosharetheiryearsofexperienceas wellastheirspecicareaofexpertise.AsisshowninTable3.2overhalfoftheparticipantshadexperiencethatwasgreaterthan10years.InTable3.3itcanbeobserved thataquarteroftheparticipantshadexperienceinprivacyassessment,andoverthree quartershadexperienceincybersecuritycompliance.Thisexpertisecombinedwith theparticipantsthathadpenetrationtestingexperienceallowedforawidevarietyof individualstobequestioned. Table3.2:SurveyProle PercentageofParticipants YearsExperience 33.0% +15 22.0% 10-15 38.9% 5-10 5.6% 1-5 Thisvarietyofexpertisewasessentialtobeabletoensurethatfeedbackfrom thesurveywaswellrepresentedfromtheperspectiveasonewhowouldtrytoprotect sensitivedataaswellasonewhowouldattempttoexploitpersonalinformation.The questionspresentedtotheindividualswerepresentedinvariousscenariobasedsituations,aswellaswhichpieceofinformationwouldbeconsideredmorevaluable. 31


Table3.3:ParticipantsProfession PercentageofParticipants Profession 27.8% PrivacyAssessment 66.7% VulnerabilityAssessment 77.8% CyberSecurityCompliance 38.9% PenetrationTesting Followingasanexamplein[37]wheretheauthorsdividedPII,theexpertswere shownwhatconstitutesasdirectPIIandwhatisindirectPII,thesevaluescanbe viewedinTable3.4.Wealsodenedwhatisdeterminedtobepubliclyavailableinformation,thesevaluescanbefoundinTable3.5.Analsourceofidentiableinformationthathasbeenidentiedandthenshowntotheexperts,wasthatofinformation thatcouldbefoundinaFreedomofInformationActFOIArequest,thisinformation canbeseeninTable3.6. Table3.4:PersonalIdentiableInformation DirectPII IndirectPII SSNTelephoneNumber GenderRace EmailAddressMedicalRecords Birth-dateGeographicIndicator NameAddress OnlineUsername Table3.5:PubliclyAvailableInformation Name Address VoterAliation PriorCriminalTrouble BusinessOwnership MarriageCerticates DeathCerticates Mortgages WedenedExploitable,Accessibility,andHarm,toensureeachindividualcompletingthesurveywasansweringquestionswiththesameframeofreferenceasanother.Asawaytopreventthequestionsfrombeingambiguous.Thedenitionsfor eachterm,islistedinTable3.7. Theparticipantsweregivenscenario-basedquestions.Thesequestionswerepresentedasiftheparticipantswereabouttoperformeitheraprofessionalcyber-attackor 32


Table3.6:FreedomofInformationActRequestInformation FederalEmploymentQualications Degrees TechnicalTraining GovernmentEmployment ProfessionalGroupMembership AwardsandHonors Table3.7:Denitionsoftermstosurveyparticipants Exploitable Couldyouexploitapersonswiththisinformation.I.E.Wouldyoubeabletousethisinformationtoimpersonatethisperson?Doesthisallow youtogetmoreinformation? Accessibility Howaccessibleisthisinformation? Harm AwardsandHonorsAquantitativevaluefor damagedoneineventapersonsdatahasbeen exploited anancialcyber-attack,onagivenindividualshouldtheydisplayaparticularpieceof directorindirectPII.Theywereaskedhowexploitedwouldthisinformationbe,given theywereattemptingtoperform,eitherofthetwopreviouslylistedattacks.Wealso givetheparticipantsalistofindirectordirectPIIandaskedwhichnewpieceofinformationwouldtheytrytoobtaininordertomaketheattackmoreeective.Attheend ofthesurveytheparticipantsweregiventwoopenendedquestions,therstaskingthe participantsintheirprofessionalopinionwhatarethemostcriticalfactorsforexploitingprivacyinformation.Thesecondquestionasking,whatarethemostcrucialareas forsecuringandpreventionofpersonalinformation. III.3.2.2SurveyResults Theparticipantsofthesurveyallansweredthequestionswithrelativesimilarities,whichhelpedtostrengthenourinitialmodeldesign.Thereweresomeintriguing answersthatoccurred,howevertheseanswersweresimilaramongallparticipants,as suchtheywerenotdismissed,howeverfurtheranalyzed.Wewillnowdiscusstheresultsfromthesurvey,andthemostthought-provokinginputsfromtheparticipants. 33


Thequestionsaskingforhowexploitable,orharmfulagivenscenarioiswith givenPIIwasaskedtoberatedonascalefromonetoseven,whereonewasnotexploitableorharmfulandsevenwasveryexploitableorharmful.Wethengeneralized theresultsaslistedbelowinTable3.8. Table3.8:RangesBasedUponExpertInput LowPercentage 0-42.85% MediumPercentage 42.85%-71.42% HighPercentage 71.42%-100% MostresultsfromthesurveywereastobeexpectedwhenitcomestoPII.For instance,ingeneralitwasconsideredtobeofmediumabilitytoexploitanindividualingeneral,byjustobtainingasinglepieceofdirectPII.Theparticipantsalsoindicated,thatoncetheyhadaccesstoasinglepieceofdirectPII,alltheywouldneed waseitheranindividual'sonlineuser-nameorbirth-dateinordertoexploitsomeone withcondence.Thiswasevenlysplit50/50betweenparticipants.InFigure3.2the resultsfromaquestiontotheexpertsisshown.Theexpertsweretoldtheyalready hadasinglepieceofdirectPIIaslistedinTable3.4,theywerethenaskedtopickfrom thefollowingwhichotherinformationwouldbemostsoughtoutinordertoperforma nancialattack. Accordingtotheexperts,informationthatcouldbeobtainedfromtheFOIArequest,wasofmediumeaseofaccessibility.TherangescanbeseenifFigure3.3,where amajorityofexpertswereinfavorofthisassessment. Theexpertswherethenaskedtoratethelikelinesswithhavingaccesstothisinformationiftheycouldgainaccesstoanindividual'sbankrecords,Figure3.4oropen alineofcredit,Figure3.5.Asseeninthegureslisted,accessingthebankrecordswas consideredamediumlikelinessfromtheexperts,withthegivendirectPII.Thishoweverdiersfromtheexpertsopinionswithtryingtoopenalineofcreditinanindividual'snameseeninFigure3.5 34


Figure3.2:AdditionalPIIneededfornancialexploit Figure3.3:EaseofAccessFOIARequest 35


Figure3.4:LikelinessofAccessingBankRecordswithDirectPII Figure3.5:LikelinessofOpeningLineofCreditwithDirectPII 36


Whentheexpertswerepresentedwithasimilarsituationastheonelistedpreviously,exceptthistimethegoalwastoperformaprofessionalharmattack,theinformationthattheysoughtvastlydieredfromthatofthenancialharmattacks. SeeninFigure3.6,themostsoughtinformationafteralreadyhavinganindividuals namewastheire-mailaddress.Comparedthistotheinformationsoughtafterbeing inknowledgeofanindividualsnameforanancialattack,Figure3.2,wherethemost soughtinformationwastheindividualsSSN. Figure3.6:AdditionalPIIneededforProfessionalHarm Withthisinformationtheexpertswereaskedtoratetheseverityofprofessional harmthatcouldbecausedwiththeName,andtheadditionaldirectPIIshowninFigure3.6.Theseresultswereinteresting,wherewehadnoexpertclaimingtheywouldn't beabletodolowharmasseeninFigure3.7.Theresultswerenearlysplitinhalfbetweenthosethatbelievetheycouldonlycauseamediumamountofprofessionalharm andthosethatcouldcauseahighamountofprofessionalharm. Nextweaskedtheexpertsthelikelinessofanexploitwiththisgiveninformationiftheywereabletonditallinthesamelocation,thesamesiteorservice.We 37


Figure3.7:SeverityofProfessionalHarmGivenDirectPII followedthisup,withthesamequestionhowever,theinformationwasspreadout,not allinonelocation.InFigure3.8,wehavetheresultssidebysidetocompare.When allthepiecesofdirectPIIwerelocatedonthesamesiteorservice,61%oftheexperts thisinformationwashighlyexploitabletocauseprofessionalharm.However,assoon astheinformationwasspreadoutovermultiplesites,andservices,only22.3%ofthe expertsrespondedtheinformationwashighlyexploitable. Attheendofthesurveyweaskedtheexpertsforopinionsintwoopenended questions.Thisrstopenendedquestionwas,intheirexpertopinionwhatwasthe mostcriticalfactorsforexploitinginformation.Thetopveresultscanbeviewedin Table3.9.Oneofthetopveanswerswasthecompoundeect,whereyoucoulduse informationpreviousknowntondotherinformation.Itappearsthatfromknowingjustauser'sname,duetowhatispubliclyavailableaswellasinformationthatis availableinaFOIArequest,youcouldcreatearatherdetailedproleofanindividual. Thisinformationhelpsustoanswerourresearchquestion RQ-3 ,wheretheminimum 38


Figure3.8:ExploitabilitybyAccessibility informationneededtotobeknownaboutsomeoneinordertondmoreinformationis justsomeonesname,asinglepieceofinformation. FromtheresponsesshowninTable3.9,wecanalsoseeuserawarenessisabig contributorinfactorsforexploitingawareness.Thisindicatesweneedtokeepenforcingandtrainingusersonwhatcanconstitutesasacceptableonlinebehaviorforprivacyprotections. Table3.9:ExpertOpinionMostCriticalFactorsforExploitingInformation DatabaseProtections UsingpreviousinformationtondmoreCompoundEect AggregationofData Users Easeofaccesstoinformation Ourotheropenendedquestionweaskedtheexpertswasintheiropinion,what werethemostcriticalareasoffocus,topreventaprivacybreachfromoccurring.The topanswersfromthisquestionarelistedinTable3.10.Manyoftheanswersreceived 39


complimentedtheresponsesfromTable3.9.Aninterestingcommonresponsewasto haveaprotecteddigitalidentity. Table3.10:ExpertOpinionMostCriticalAreasofFocus Separationofinformationstorage Protectionstodatabases Haveaprotecteddigitalidentity Restrictinginformationaccess Awarenesstraining III.3.3FinalModelDesign Inthissectionwewilldiscussthenalchangestothemodelusedtocreateour model,thePrivacyAssessmentVulnerabilityScoringSystemPAVSS.Basedupon ourrstiterationofthemodelandtheresultsfromtheparticipantsinoursurveydiscussedinsectionIII.3.2,wewereabletocometoanaliterationofourmodel.Asit willbeobservedtheoverallmodeldesignhasnotchangeddrasticallytothatasdescribedinsectionIII.3.1. InTable3.11,wecanseethenaldesignofthemodel.Ascanbeobservedwe stillmaintain3distinctareas,Base,Accessibility,andHarmmetrics,forthatcomprise ofthetotalscorevalue.WeusetheBaseMetriccolumntobecomprisedoftheExploitabilityMetricsandtheServiceMetrics.Thiswastoincorporatethefeedbackthat wasgivenbytheparticipantsofthesurvery.TheExploitabilityMetricscolumnisstill thesameandcomprisedofthetotaldirectPII,indirectPII,anddataexpectationto privacy.However,theparticipantsnotedinthesurvey,thatnotonlyistheinformation thatispresentedaprimaryfactoroftheriskofbeinginvolvedinaprivacybreach,the servicethatwasbeingusedwasalsoamajorcontributingfactor.Forthisreason,we alsoaddedaServiceMetricscolumntoourBaseMetrics.Theservicemetricsisaway toquantifytheriskofusingaparticularservicepresents. 40


Table3.11:PAVSS1.0MetricAttributes BaseMetricGroup Accessibility Metric Group HarmMetricGroup ExploitabilityMetrics: ServiceMetrics: DirectPIIDPStorageST PublicInformation PI Professional HarmMetric HP IndirectPIIIPSharingSH UserAwarenessUA Financial HarmMetric HF DataExpectation DE DefaultAccount SettingDAS UserDefault Account UDA Personal ExpectationMetric HPE OnemodicationtotheExploitabilityMetricgroupisthatoftheDataExpectationMetric.InthesectionIII.3.1ourinitialassumptionwasthattheexpectationto privacyfromthedata'sperspectiveheldhigherweight.However,afterthesurveyand furtherresearch,wehavefoundthatthisdoesnotholdmuchweightontheoverallexploitabilityscore.Therefore,wegivethisavalueofeitherzeroorone.Thisvalueis dependentuponwhatpiecesofdirectPIIareusedwithaparticularserviceandwhat arethecurrentlawspertainingtotheimportanceofthatdata.Forinstance,iftheservicehadauser'ssocialsecuritynumberormedicalrecords,andthiswasintheUnited States,thenthesepiecesofdirectPIIhaveanexpectationtoprivacy.Ascoreofzero forthismetricwouldbeanindicatorthatthedatahasnoexpectationofprivacyby currentlaws.ForthescoresforIndirectandDirectPIIMetrics,theseareacounterto theamountofindividualPIIthatcanbefoundwhileparsingthroughaservicelooking foranindividual. TheServiceMetricgroupiscomprisedoftheStorage,Sharing,andDefaultAccountSettingsmetrics.Withfeedbackfromthesurveyalongwithinformationfound duringtheinitialresearch,thesearethethreemostcriticalareasthatareofconcern 41


whentryingtoascertainwhetherusingaparticularserviceisofriskforaprivacybreach. Inanidealsituationthedataownerwouldknow:exactlywhattheprotectionson thedata,whointernallyandexternallyhasaccesstothedata,whotheservicecould potentiallybesellingdatato,whethertheyareusingaprivacypreservingalgorithm wouldhelptodeneabettermetric.However,usersrarelyhaveallofthisinformation sowebasedthemetricsonvaluesthatcaneitherbefoundfromsettingupaninitial account,DefaultAccountSettingMetrics,orbyparsingthroughthetermsofservice agreement,SharingandStoragemetrics.TheStoragemetricaswehavedeneditcan haveavaluerangingfrom0to2.Thesevaluesaregivenbaseduponwhethertheservicestoresanyinformationontheuser.Thefullscopeanddenitionsofthesevalues canbefoundinTable3.12. Table3.12:StorageMetricpossiblevalues Score StorageMetric 0 Servicedoesnotstoreanyinformation 1 Servicestoresbasicinformation Username/Password 2 Servicestoresbasicinformation, plusPII TheSharingMetricisthenextlistedmetricintheServiceMetricgroup.This metricistoquantifytheriskimposedbytheservicesharingandorsellingthedatato thirdparties.MuchliketheStorageMetricwehavegiventhismetricavaluefrom0 to2.Aswehaveidentied3separatesharingscenarios.Witheachscenariobeingconsideredofhighriskofbeinginvolvedinaprivacybreach.Therstscenariotheservice doesnotsellorshareanyinformationabouttheuserbase.Inthesecondscenariothe servicesellsandorsharesinformationhoweverusesoneofmanyprivacypreservingalgorithms.Thenalscenariotheservicesharesandorsellstheinformationandusesno privacypreservingalgorithms.InTable3.13wehavedenedthepossible3scoresdependingonwhichofthethreescenariosispresent.Iftheservicedoesnotlistorstate 42


theysellandorsharethedatausingaprivacypreservingalgorithmthenitisassumed theydonot. Table3.13:SharingMetricpossiblevalues Score SharingMetric 0 Servicedoesnotshareanyinformation 1 Servicesharesinformation,but releasesusingprivacypreserving algorithms 2 Servicesharesinformation,no privacypreservingalgorithms ThenalmetricintheServiceMetricgroupisthatoftheDefaultAccountSettingmetric.ThismetricistoquantifythePIIthatisbydefaultrequiredtocreate anaccount,thatisservicememberfacing.Thatisthisisthedefaultinformationrequiredfromtheuserinordertocreateanaccountandbydefaultthisinformationcan beviewedbyothersthatarealsomembersofthisservice.AsshowninTable3.14we givethisascoringfromonetothree.ThesescoresdependupononhowmuchPIIis requiredbydefaulttobeenteredthattheothermemberscanseebydefault. Table3.14:DefaultAccountSettingMetricpossiblevalues Score DefaultAccountSetting Metric 1 ZerotoTwopiecesindirector directPIIrequiredforadefault account 2 TwotoFourpiecesofindirector directPIIrequiredforadefault account 3 FourormorepieceofPIIrequiredforadefaultaccount TheotherchangetothemodeldescribedinsectionIII.3.1,isthatoftheAccessibilityMetricgroup.TheAccessibilityMetricgroupisthequanticationofhowaccessibletheinformationisdependentuponuseractionsalongwithwhatisconsideredpub43


licinformation.ThePublicInformationMetricisgivenavalueoreitherzeroorone. ThisvalueisdependentuponthepiecesofPIIcountedintheExploitabilityMetric group.IfoneormoreofthepiecesofPIIcountedintheExploitabilityMetricgroup isconsideredtobepublicinformationthenthePublicInformationMetricisgivena valueofone,ifnotthenthevalueisazero.ThesecondmetriclistedintheAccessibilityMetricgroupistheUserAwarenessMetric.Thevaluesrangefromzerototwo andcanbeviewedinTable3.15.Initiallythiswasnotconsideredanimportantmetric,however,withinputfromtheparticipantsofthesurvey,itwasdeterminedtobea contributortotheriskofaprivacybreach.ThismetricscoretheusersoverallknowledgeofonlinePII.Itwasfoundthatevenifauserwasconcernedoftheirdata,ifthey donothavethepropertrainingorknowledge,thentheyareinherentlyatahigherrisk thansomeonewhohasthetraining.Thescoreisofzeroiftheyarewelltrainedand don'tshareanyPII,thisincludesusingadierentusernameforeachonlineservice theyuse. Table3.15:UserAwarenessMetricpossiblevalues Score UserAwarenessMetric 0 Welltrained,doesn'tallowsharingofPII,usernameisdierent fromotherservices 1 Understandsbasiconline PIIprotections,shareslittle amountsofPII 2 Noonlinesafetytraining,shares muchifnotallinformation ThenalmetricintheAccessibilityMetricgroup,istheUserDefaultAccount Metric.Thisscoreisarepresentationofwhattheusersetsassharingpreferencesfor othermembersoftheservice.ThesevaluescanbeseeninTable3.16andshowthat thevaluesrangefromonetothree.UnliketheDefaultAccountSettingMetric,this metricmeasureshowmuchtheuserallowsforsharingoftheirinformation,ortheamount ofcontroltheypossess.Avalueofoneinthismetricwouldindicatetheuserallows 44


sharingofverylittletonoPIIdata,whereavalueofthreewouldindicatetheuserallowssharingofmultiplepiecesofdirectandindirectPII. Table3.16:UserDefaultAccountMetricpossiblevalues Score UserDefaultAccountMetric 1 Usershareslessthen2piecesof indirectPII,1orlesspieceof directPII 2 Usersharesmorethen1pieceof directPII,alongwith1ormore indirectPII 3 Usersharesmultipledirectand indirectPII TheHarmMetricgrouphasstayedthesamefromtherstiterationfromthe model.Thismetricgroupwasthescoreofwhatharmcouldbedone.Thismetricgroup iscreatedassuch,fortheabilitytobeexpandedupontoaccountforothertypesofcybersecurityattacksthatcanbeperformedfromhavingaccesstoindividualsPII.For bothProfessionalandFinancialMetrics,thevaluesareeithergivenaoneorazero. ThisisdependentuponifthepiecesofPIIaccountedforintheExploitabilityMetric group,canbeusedforeitheroftheattacks.ThevaluesforthePersonalExpectation MetricareshowninTable3.17andcanrangefromzerototwo.Thesevaluesarea representationofCPMdiscussinsectionII.3.1.Auserwouldbegivenavalueofzero iftheyhavelittletolowexpectationsofprivacy,wheresomeonewhohasahighdegree ofprivacyexpectationswouldbegivenavalueoftwo. Table3.17:PersonalExpectationMetricpossiblevalues Score PersonalExpectationMetric 0 LowPrivacyExpectations 1 GeneralPrivacyExpectations 2 HighPrivacyExpectations Theequationsusedforcalculatingthetotalscorehavealsobeenupdatedtoreecttheinputfromtheparticipantsfromthesurvey.Inequation3.6,itisshownthat 45


wewantamaximumscorenohigherthenten,aswellasaintegervalue.Thescore shouldrelatetotheuserinthemostsimplisticform,suchthattheusercanmakea validriskassessmentquickly. Total = d min [ Base + Accessibility + Harm ; 10 ] e .6 ThenewBaseMetricgroupisiscalculatedbyequation3.7.Asitcanbeseen, thetotalscoreisthatofthetotalfortheExploitabilityMetricgroupaddedtothetotaloftheServiceMetricgroup.Thiswasdoneinresponsetotheinputreceivedduring thesurveyfromtheexperts,wheretheyindicatedthattheservicehadjustasmuch weightasthepiecesofPIIindeterminingtheoverallriskofbeinginvolvedinaprivacy breach. Base = Exploitability + Service .7 ThenalExploitabilityMetricgroupscoreiscalculatedbyequation3.8.We knowfrompreviousresearchandfromtheexpertsinthesurveythatindirectPIIdata isworthhalfasmuchasdirectPII.TheoveralltotalfromthedirectandindirectPII, isthenaddedbytheexpectationofprivacyfromtheperspectiveofthedatawhichhas beendiscussedabove.ThisisanadditivevaluetoshowthatthisparticularsetofPII thathasbeencapturedholdsmoreweightthanPIIcapturedwherethedatadoesnot haveanexpectationtoprivacy. Exploitability = DE + DP + IP 0 : 5 .8 TheServiceMetricgrouptotalcanbeseeninequation3.9.Thevectorsthat compriseofthetotalscorewereconsiderednomoreimportantthananotherandthereforeareadditiveintheequationbelow.Thereisaconstantassociatedwiththismetric.Thereasoningforthisconstantisduetothepotentialofatotalscoreofseven. 46


Theconstantvalueplacesthenalscoreforservicefromzerotothree.Whattheserviceprovidesisalargecontributortowhetherornotanindividualisinvolvedinaprivacybreach.Yet,ultimately,thebiggestcontributoristheinformationthatisplaced ontosite. Service = b 0 : 568 ST + SH + DAS c .9 TheAccessibilityMetricgroupvectorsareeachadditiveaswellandareshownin equation3.10.Fromopinionfromtheexpertsinthesurvey,thismetricgroupdidnot holdasmuchweightasthebasemetricgroupandthereforeonlycontributes20%to theoverallscore. Accessibility = 0 : 2 PI + UA + UDA .10 ThenalequationisoftheHarmMetricgroup.Thiscalculationisobservable inequation3.11.AsintheAccessibilityMetricgroupeachvectorisequallyweighted forthemetricgroup.However,thisholdsaweightof30%totheoverallscore.This 30%isduetotheinputfromtheexpertsthatthegreatertheharm,thehighertherewardforthepotentialmalicioususerandthereforahigherriskofbeinginvolvedina privacybreach.ThisvaluedoesnotholdasmuchweightastheBaseMetricgroup, howeverholdsmoreweightthanthatoftheAccessibilityMetricgroup,assuchwegave itaweightof30% Harm = 0 : 3 HP + HF + HPE .11 III.4ChapterReview Inthischapterweintroducedourthreehypothesesthatweplantotestforand discuss.ThisdiscussionoccurredinsectionIII.1,andwewilldiscusstheresultsinthe proceedingchapterwherewewilldiscussthetestingprocedure,implementation,and results.Alsodiscussedinthischapterwasourmethodfordeterminingquantitative valuesforprivacymeasurements,wheretherehavebeennootherquantitativemeasure47


mentstoourknowledgebefore.DiscussedinsectionIII.3.2,wehighlightedtheresults fromusingtheDelphiMethodthatwedenedinsectionII.5.Thenalmajoridea presentedinthischapterwasthenalmodeldesign,showninsectionIII.3.3.Utilizingthismodel,inthenextchapterwewilltestourhypothesesfromsectionIII.1. 48


CHAPTERIV TESTING Inthischapterwewilldiscussthetestingofthemodel.Therstsectionwewill discussthetestingplanandtheimplementation.Thenextsectionwillthendiscussthe resultsfromtestingandwhetherornotthedatafromtheresultssupporteitherofthe threehypothesesstatedinthepreviouschapter.Finally,thischapterwillbeconcluded bydiscussingthechallengesthatwerefacedwhileperformingthetestsandhowfurther testingmaybeperformed. IV.1PretestingRequirements Fortestingwewantedtomatchourinitialgoalofcreatingatooltohelptheaverageindividualtoaccesstheirriskofbeinginvolvedinaprivacybreach.Therefore, fortestingwewouldchoosetouseanonlineservicethathasusersinteractingand sharinginformation,alongwitheaseofaccessingthisinformation.Wealsowantto showtheneedforthisscoringsystem,becausetheabilitytosiftthroughlargeamounts ofinformationandndcriticalinformation,isnolongerjustforthosewhoarewell versedincomputerscience.Assuch,thedatacollectingtoolswewillusetoshowthe eectivenessofourtoolwillbethesameoutoftheboxtoolsavailabletotheaverage adversarialthatwouldbethepotentialmalicioususerusingthesesameservices,to ndpotentialtargets. FortestingwehavechosentousetheonlinesocialmediaplatformTwitter. 1 Twitterisapopularsocialmediaplatformthatallowseaseofaccesstotheirstored dataoftheirusersthroughtheirAPI.Twitterallowstheiruserstomakeapost,called atweet,thatisof280charactersorless.Usersalsomaintainabaseofotherusersthat theycanfollow.Whenausercreatesatweet,otherusersontheplatformcanshare theoriginaltweet,inamethodknownasretweeting.Whenauserfollowsanotheruser, 1 49


theysubscribetotheirtweetsandarethusnotiedwhentheusertheyarefollowing makesanewpost.Toshareanotheruserspost,youdonotalwaysneedtobeafolloweroftheoriginalposter.Assuchusersalsohavefollowersthatarethencapableof monitoringtheirpostsandinformation.Userscanmentionotherusersintheirpostby usingthe@symbolfollowedbytheusersusername.Signingupforanaccountisfree, andaccesstothestoreddatathroughtheAPIisfreeaswell,althoughthefreeaccess isratelimited,dependingonwhataccessisbeingperformed.Duetothis,thismakes theidealtestingplatformforthemodel,sinceithasalargeworldwideuserbase,and easeofaccesstothedata. InordertoaccessthetwitterAPI,rststepwastocreateatwitteraccount. Creatingatwitteraccountwasstraightforwardrequiringanameandeitheraphone numberore-mailaddress.Forthepurposesofthistestane-mailaddresswasused.A conrmatione-mailwasthensentwithaconrmationcode.Tocontinuetheaccount creation,thiscodehadtobeentered.Nexttheserviceaskedifyou'dliketouploada photo,italsogivesyoutheabilitytoskipthisstep.Forourpurposesweskippedthis step.Next,wewereaskedforashortbio,thatwastobelessthan160characters.This stepyouarealsoallowedtoskip,assuchforourtestingpurposeswedid.Next,we wereaskedaboutourinterests,andifwewouldliketosearchforpeopletofollow.This steplikethepreviousstepswereallowedtobeskip,whichwedid.Afterthiswewere presentedwithourhomepage.Thenameweprovidedontheinitialpagewasturned intoourusernamethatotherusersoftheservicecouldcommunicatewithusifthey foundtheneed. Afterthecreationoftheaccountwecouldregisterwithtwitterforadeveloper accountandaskforaccesstotheAPIwithTwittergeneratedOauthtokens.Toreceivepermission,wewererequiredtolloutwhatthepurposeofaccessingtheAPI was,whatwouldwedowiththedata,wouldindividualsbeidentiedfromourdata, aswellaswhowouldultimatelyseethedatawecollect.Wetoldtwitterourpurpose 50


wasforresearchforathesisandnousernameswouldbereleasedwiththedata.The turnaroundtimeforTwittertogiveusaccesswaslessthanoneweek. IV.2TestingProcedure Thetestingprocedurewasbrokenintotwophases.Therstphasewascreating manualentriesforusersthat,tintoourthreecategoriesofhigh,medium,andlow scores.Wedidthisbaseduponwhatvaluesasauseroftwitterwecouldenterinto ourprole,aswellascommonpoststhatuserscouldmakethatwouldidentifyand disclosepiecesofPII.Thesecondphasewascreatingatestingthealgorithmsdiscuss inthesectionSecondPhase.Thisinvolvedutilizingtoolsprovidedbytwitteraswellas commonpythonpackagelibraries. IV.2.1FirstPhase Fortherstphaseoftesting,wewerecreatingfakeusersasawaytomanually testourmodel.ThefakeuserswouldhaveaccountsthatmodeledrealTwitteraccounts,bytheidenticationofthevariouspiecesofinformationwewereaskedtoenter.Alongwithidentifyingpossiblestatementsuserscouldposttootherusersand themselvesthatwouldresultinapieceofPIIbeingviewedfromotherusers.InTable 4.1,wehavelistedthevariouspiecesofinformationausercouldenterintotheirprole andwhichvaluesarerequiredbydefault. Table4.1:Twitterproleinformation UserProleAttributes Required Photo No Username Yes Bio No Location No Website No Birthday No Thenextstepwastoidentifysharingandsellingofdata,alongwiththestorage ofdata,astobeabletollintherequiredvaluesforthemetricsintheServiceMetrics 51


groupfromthemodel.AccordingtoTwittersTermofServiceagreement,theydoand willselluserinformationtothirdpartyvendors.HowevertheTermsofServicedoes notspecifyifandwhentheyselltheyuseaprivacypreservingalgorithm,duetothis, andaccordingtoourmodel,theyreceivedascoreoftwofortheSharingMetricvector. Duetothenatureoftheservice,theStorageMetricvectorgetsratedascoreoftwo. ThenalvectorintheServiceMetricgroupwastheDefaultAccountSettingmetric, andbyanalyzingthecontentsofTable4.1,thisvectorreceivesascoreofone.Table 4.2,hasthescoreslistedwiththevectorsfortheServiceMetricgroup. Table4.2:TwitterServiceMetricgroupscore ServiceMetricGroup Vector Score Storage 2 Sharing 2 DefaultAccountSetting 1 AfteridentifyingandscoringthevariousvectorsfortheServiceMetricgroup, fakeusersneededtobecreated.Tocreatethefakeusers,weneededtoidentifywhat usersateachlevel,High,Medium,andLow,wouldplaceintotheirprole.Forour fake,thevaluesthatwebelievetheywouldhaveintheirprolesislistedinTable4.3. Table4.3:TwitterProleInformationManualEntries UserProleAttributes High Medium Low Photo Yes Yes No Username Yes Yes Yes Bio Yes No No Location Yes No No Website Yes Yes No Birthday Yes Yes No ToassesstheaccountvariableslistedinTable4.3,webasedthevaluesonwhat wouldreceiveapersonahighertolowerratingintheUserDefaultAccountMetric, andPersonalExpectationMetricvectors.Thiswasdonebyutilizingtheworkdonein [37],andthetheorydiscussedinsectionII.3.1.Someonewhowouldscorehighinthe 52


PersonalExpectationMetricvector,wouldbeanindividualwhowouldbeconsidered tohaveahighpersonalexpectationtoprivacyassuchwouldbelesswillingtoshare information.Whereasanindividualwithalowpersonalexpectationtoprivacywould scorelowonthePersonalExpectationMetricvector,andthereforebewillingtoshare moreinformation. AnassumptionwasthenmadeconcerningtheUserAwarenessMetricvector. ThisassumptionwasthatuserswhosharedmanypiecesofPII,wheretypicallyusers thatdidnothaveahighlevelofonlinesafetyawareness.Whereastheuserswhowere morerestrictiveontheinformationtheypostedontheiraccountstypicallyhadamedium tohighdegreeofonlinesafety.Thisalsocorrelateswiththeuser'spersonalexpectation toprivacy,weassumethosethathaveahighlevelofpersonalexpectationtoprivacy, wouldbetheuserstypicallyeducatingthemselvesononlinesafety. IV.2.2SecondPhase Thisphaseoftestingcanbebrokenintothreedistinctsteps,beforescorescould becalculated.TherstwastoaccessthelivestreamthroughTwitter'sAPIandplace altertocaptureresults.Thesecondstepistotakethecapturedtextfromusers, througharegularexpressioncheck,toverifythatthetextabouttobeanalyzedcontainedoneofourkeywords.Thenalstepistakingtheveriedtext,andperforming theLeskalgorithmonthetext,ndingthesenseofthewordweweresearching. InTable4.4thereisalistoftheprimarysearchcriterialters.However,wewill alsousevariantsofthesetermsforattemptingtoextractasmuchinformationfromthe streamingservicesaspossible. Table4.4:TermstoSearch Gender Race Birthday GeographicIndicators OnlineUsernames SocialSecurityNumber EmailAddress MedicalRecords 53


WillonlybesearchingfortermsthatareconsideredDirectorIndirectPII.In Table4.5,wehavelistedallthetermswelteredon,throughthelivetweetcapturing. WeidentiedinsectionII.4.3tondoutwhetherornotasentence,comment,orblock oftext,couldcontainpossibleprivateinformation,wecouldautomatethistoasatisfactorydegreeofcondence. Table4.5:Variationsoftermstosearch Gender Race birthdate Birthday GeographicIndicators birth OnlineUsernames SocialSecurityNumber SSN EmailAddress MedicalRecords disease email e-mail username address phone cell diagnose health hospital clinic medical name Forourpurposes,theinformationweretainedwas,username,name,location, andtext.Theretentionofmostofthisdatawasonlytemporary.Onceverifyingaspecicpieceofinformationwasfoundwestoredabooleanvaluetoindicatethatthis particularpieceofPIIwasabletobedetermined.Thisinformationwasstoredintoa SQLitedatabase,onourlocalmachines.Thepurposeofourtestingwasnottoactuallycauseharmtotheendusers,justtoverifyifinformationwasfound. Thelivestreamingltering,thatisprovidedbyTwitter,returnsaJavascriptObjectNotationJSONdataformat,ofthetweetthatwascaughtbythelter.InTable 4.6wespecifysomeusefulinformationthattheJSONdatareturns.Amoredetailed descriptionofalltheinformationprovidedinthisdataformatcanbefoundonthe TwitterDeveloperwebsite. 2 WewouldsaveoeachJSONdataintoasingleinstance atrstintoastandardtextle.TheleswouldhaveasingleJSONtextdataformat, perlineonthele. 2 54


Table4.6:Usefulinformationfromtweetscaughtinlter Createdattime Retweeted Coordinates Username RetweetCount Location Text Name MediaInformation Fortheregularexpressionmatching,wecreatedregularexpressionsbasedupon thewordsthatweusedforforlteringthroughthetwitterAPI.Afterauserwouldbe processedinthisstagewewouldstoretheminaSQLitedatabase.Wereadinthele, andbegantheprocessing.Firstweranthetextthrougharegularexpressioncheck.If amatchoccurredthenextstepwouldbetodetermineifthemessagewasaretweetor anoriginaltweet.Belowisanexampleofhowaretweetisstructured. RT@user_retweeter:theoriginalmessage Ifthemessagewasaretweetwewouldattempttoidentifyiftherewasanother usernameinthetext.Thiswasdoneforamoreaccurateaccountofwhotheinformationinthetextwasabout.Wewouldthenlookuptheuserinthedatabase.Ifthe userexisted,wewouldupdatethecolumnrelatingtothePIIthatwehadjustfound, aswellasaddingthetext,tothepreviouslystoredtextforthatuser.Iftheydidnot exist,wewouldaddanentrytothedatabase,makingthecolumntrue,forthePII datathatwehadfoundandstoringthetextofthetweet.ThisisshownthepsuedocodefromFigure4.1.Ourregularexpressiontesting,wouldlookforwholewords, Beforebeingableforthenalstepinprocessingthetext,weneededtodeterminethetruesenseofthesensitivewordsthatweweresearchingon.Asmentionedin sectionII.4.3awordcanhavemanymeanings,makingthetruemeaningofthesentencethatthewordappearsintobeambiguous.InorderforusetousetheLeskalgorithminournalprocessingstep,weneededtorstrunLesk'salgorithmonsentences thatweknewtobeshowingthesenseoftheparticularwordweweresearchingon.We thencreatedthesesentences,andranLesk'salgorithm,overthesentencethenstored 55


Data: textlewithjsonentryperline Result: Databaseofaggregatedtweetsfromtextle while linesinle do Regularexpressiontestfordierentsearchterms; if RegularExpressionMatch then if Userexists then Updateusersentryindatabaseforsearchterm,addtexttolistoftext foruser; else Createnewuserentry; end else Continue; end end Figure4.1:Pseudocodeforregularexpressiontesting inaSQLitedatabasethesenseoftheword.TheversionofLesk'salgorithmweused, wastheimplementationinthepythonpackageprovidedby[9].Wedidthisforeach wordthathadbeencreatedintoaregularexpressioncheck,inthesecondstep. Withthesenseofdierentprivacydivulgingwordsknownwethenstartedthe nalstepofprocessingthedata.Weaccessedeachmemberofthedatabase,fromthe usersthathadmadeitthroughthesecondlter.WethendeterminewhichparticularPIIthetextthatwehadstoredforthatuserhadcontained.Someusershadbeen markedformanydierentpiecesofPII,sincewehadcapturedtweetsoveraseriesof manydays.ForeachparticularPIIthathadbeenmarked,weranLesk'salgorithm overthetext,thenattemptedtodeterminethesenseforeachPII.Ifthesenseofthe pieceofPIIfoundwasthesameaswhatwehadstoredpreviously,thenwemarkedit asatruefoundpieceofPII,elsewemarkeditasfalseandcontinue.Thepsuedocode forthiscanbeviewin4.2 Oncealltheentrieshavebeenstoredthelaststepthatneededtobedonewas determinehowmanydirectandindirectPIIeachuserhad.Forthisweiteratedover thenewdatabaseentriesbooleaneldsthataggedwhichpiecesofPIIhavemadeit 56


Data: DatabaseofuserswithtextaggedwithpiecesofPII Result: NewDatabaseofusersthathavebeendeterminedtrue while Entriesindatabase do DeterminewhichpieceofPIIwasagged; foreach FlaggedPII do PerformLesk'sAlgorithmontextforentry; if SenseofPIImatches then MarkTrueforParticularPIIinNewDatabase; else Continue; end end end Figure4.2:Pseudocodefornallteringphase throughthelastltrationprocess.Wethenstoredthenumberofindirectanddirect. AtthisstagewealsodeterminedifthefoundpiecesofPII,ifanycombinationsofthem couldbeusedforaprofessionalornancialattack.Wethenstoredthisinformation peruseraswell,inordertorunthroughourequationsdenedinsectionIII.3.3. IV.3Results Wewillnowdiscusstheresultsfromperformingbothphasesoftesting.Wewill alsodetermineifanyoftheresultswehadprocessedlendsupportfororagainstthe thethreehypotheseswehavesetouttotest.First,wewilldiscusstestresultsfromthe createddataset.Thisdatasetwasmeanttotestourmodelmanually,astobeable toverifythecorrectnessoftheequationsandourassumptions.Afterwewilldiscuss theresultsfromthephasetwotests,wherewecapturedlivedatafromtheonlinesocial mediaplatformTwitter. IV.3.1ResultsfromDataSet ToreiteratesectionIV.2.1,wecreatedusersthattwithintherealmsofHigh, Medium,andLow,ofourmodel.Theseprolesforeachoftheseuserscanbeseenin Table4.3.Foreachrangeofhigh,medium,andlow,wecreatednineusers. 57


Inthefollowingtables,thecolumntitlesdirectlyreferenceTable3.11,fromthe previouschapter.Also,whendeterminingthenalscore,thevaluefortheservicemetricswasalreadydeterminedinTable4.2. InTable4.7wehavethetotalvaluesforusersinourhighcategory.Theseare usersthatweexpectwouldsharemanypiecesofPII,andwouldhaveaverylowifany trainingononlinesafety.Thevaluesthatwepickedforthedirectandindirectwere baseduponthedefaultvaluestheuserswouldplaceontheirproles.Thesevalues arelistedin4.3,underthecolumnHigh.Theadditionalvalueswelistedin,weassumedwouldbethetweetsthatotheruserswouldmakeaboutthem,ortweetsthat theywouldpostaboutthemselves. Table4.7:Createduseraccounts,highscores User DP IP DE PI UA UAS HP HF HPE Score User1 2 4 1 1 2 3 1 1 0 9 User2 2 4 1 1 2 3 1 1 0 9 User3 2 3 1 1 2 3 1 1 0 8 User4 3 5 1 1 2 3 1 1 0 10 User5 2 5 1 1 2 3 1 1 0 9 User6 3 4 1 1 2 3 1 1 0 10 User7 3 5 1 1 2 3 1 1 0 10 User8 3 3 1 1 2 3 1 1 0 9 User9 3 4 1 1 2 3 1 1 0 10 InTable4.8wehavelistedouttheresultsfromtheusersthatfallintothemedium scorecategory.IncontrasttotheusersfromTable4.7,theseuserspostlessdirectPII, howeverstillmorethanenoughindirectPIIthattheseuserscouldpotentiallybereidentiediftheinformationwasfound.Theseuserswouldbeconsideredtohavea mediumlevelofonlinesafety;however,theymightstillshareinformationthatcanbe usedforre-identicationpurposes. TheseuserswouldhaveaprolelledinsimilarlytotheusersinTable4.3,underthecolumnmedium.Thiscolumnwastogiveanexampleofwhatamediumlevel userwouldlooklike.AscanbededucedfromTable4.8,theactualinformationlledin 58


canvary.Wewantedtoshowthatevenuserswhohaveanawarenessofonlinesafety, couldalsobecaughtupandleaveinformationthatamalicioususercoulduse. Table4.8:Createduseraccounts,mediumscores User DP IP DE PI UA UAS HP HF HPE Score User1 3 1 0 1 1 3 1 0 1 7 User2 1 3 0 1 2 1 1 0 1 6 User3 1 2 1 0 1 1 1 1 1 6 User4 1 3 0 1 2 1 1 0 1 6 User5 2 1 0 1 1 2 1 0 1 6 User6 1 1 0 1 1 1 1 0 1 5 User7 1 3 1 1 2 1 1 1 1 7 User8 2 2 0 1 1 2 1 0 1 6 User9 2 1 0 1 1 2 1 0 1 6 Ournalusergroup,thoseinthelowscorerange,canbeviewedinTable4.9. ComparingtheseuserstothoseinTables4.7and4.8,itcanbeobservedthatthese wouldbetheuserswhoattempttomasktheironlineidentitythemost.Theywould beconsideredtohaveahighonlinesafetyawarenessandsharetheleastamountofPII. Oneareatheseuserswouldscorehighin,isforHPE.Thesewouldbetheuserswho wouldhaveahighdegreeofharmfelt,iftheirinformationwasleakedandnolonger undertheircontrol. TheseuserswouldbethoselabeledinTable4.3underthecolumnlow.AnobservationtoTable4.9showsthattheseusersshareverylittleinformation,yetstillreceive ascoreofthree.Thisisdothetheservicetheyareusing.Inthiscase,thatserviceis Twitter.Justbecausetheseusersarenotsharingmuchinformationintentionallyorallowingotherstoshareitdoesn'tmeansomeonecannotaccessit.Theserviceitselfsells otheinformation. Thesecreateddatasetswereusedprimarilytohoneinanddemonstratethatour equationswouldworkasexpected.Sinceourcreateddataset,endedupdetermining scoresthatwehadfoundsucient,wewillnowsteptotheresultsfromthesecond phasewherewecaptureliveinformationasitwasowing.Inthisnextsectionwillwill 59


Table4.9:Createduseraccounts,lowscores User DP IP DE PI UA UAS HP HF HPE Score User1 0 1 0 0 0 1 0 0 2 3 User2 0 1 0 0 0 1 0 0 2 3 User3 0 1 0 0 0 1 0 0 2 3 User4 0 1 0 0 0 1 0 0 2 3 User5 0 1 0 0 0 1 0 0 2 3 User6 0 1 0 0 0 1 0 0 2 3 User7 0 1 0 0 0 1 0 0 2 3 User8 0 1 0 0 0 1 0 0 2 3 User9 0 1 0 0 0 1 0 0 2 3 beabletocomparethecaptureddatatothedatawecreatedandviewifourassumptionswerecorrect. IV.3.2ResultsfromTwitterData Inthissectionwewilldiscussandhighlighttheresultsfromaccessingthelive streamsfromTwitter.Aswedo,wewillalsodiscussifanyoftheresultswehaveanalyzedhelptolendsupportfororagainstthehypotheseslistedinthepreviouschapter. First,wewillshowhowmanytweetswecaptured,andhowmanytweetswereltered outateachstageofourlteringprocess.Thenwewillshowtheoverallaveragescores forallofthoseusersthathadmadeitthroughallofourltrationprocesses.Afterwardswewilldigdeeperintoseparatecategoriesofusers. Aninterestingresultfromthevariousltrationprocesswasthenumberoftweets wewouldeliminateperstep.Ourinitialdatasetconsistedof2.5milliontweets.These werecapturedoveraseriesofdays.ShowninFigure4.3afterweaggregatedallthe tweetsperuser,andlteredoutthetweetsthatdidnotpasstheregularexpression testwewereleftwith1.6millionusers.Onethingtonote,isthattherststepwasa tweetperuser.Thereforesomeofthetweetscaughtcouldhavebeenmultipletweets bythesameuser.Aswellasthesetweetsthatwerecaughtthatmayhavematcheddue toaurlposted,orthetitleofapieceofmediaposted. 60


Figure4.3:Numberoftweetsperltrationprocess Thelastltrationtestleftuswithjustunder500thousandusers.Showingthat noteverytweetcaughtwassignicant,toprivacydivulginginformation.Thusaggregatingandremovingatotalof2milliontweetsfromthoserstcaughtinourlter.Thiscanbeattributedtoafewfactors.Therstbeinghowthetwitterltration works.AccordingtotheTwitterdeveloperdocuments,theltrationwillnotmatchto thelteredwordinthetweetbutalsotourlnames,medianames,andscreennames. 3 Thishelpsexplainswhytherewasadecreasebetweentherstltrationandthesecondltrationstep.Ourregularexpressiontestwastestingforwholewordsinthetext ofatweet,andnotagainstusernames,urlstrings,ornamesofmedia,likephotosor videosembeddedinpostsfromotherusers.Asforwhytherewasadropoftweets thatmadeitpasttheWSDtest,thisisattributedtoambiguityofthemeaningofthe wordusedinthesentence.Forexample,oneofthephraseswewerelteringonwas thewordaddress.InTable4.10wehavelistedallthevariousformsthewordaddress 3 61


cantakeonwhenitisusedasanoun.Thewordaddresscanalsobeusedasaverb, whereitisthentakesonuptotenothervariousmeaningsaswell. Table4.10:Variousdenitionsofword"address"asanoun AddressPartofSpeech Denition Address-Noun computersciencethecodethatidentieswhereapieceofinformationis stored Address-Noun theplacewhereapersonororganizationcanbefoundorcommunicated with Address-Noun theactofdeliveringaformalspoken communicationtoanaudience Address-Noun themannerofspeakingtoanother individual Address-Noun asigninfrontofahouseorbusiness carryingtheconventionalformby whichitslocationisdescribed Address-Noun writtendirectionsforndingsome location;writtenonlettersorpackagesthataretobedeliveredtothat location Address-Noun thestanceassumedbyagolferin preparationforhittingagolfball Sincethenumberoftweetsthathadbeenremovedduetotheltrationstepshas beenaddressed,wecannowviewtheresultsfromanalyzingtheusers.InFigure4.4, wecomparethetotalusersthatcontainatleastonepieceofdirectPIIcomparedto thosethathavenopiecesofdirectPII. Table4.11,haslistedvaluesfromfromuserswhohadmadeitthroughthenal ltrationstep.ThetotaluserswithdirectPIIcomparedtouserswithnodirectPII, isvastlydierent.ThetotaluserswithnoindirectPIIcomparedtothosewithindirectPIIisasexpected,sincewewereabletogeteachusers,usernamefromthetweet, therefore,foreachuserthatwehadcapturedatweetofwealsogottheirusername thusgivingusatminimumofonepieceofindirectPIIperuser. 62


Figure4.4:TotaluserswithDirectPIIvsNoDirectPII Table4.11:TotalUsersinvarioussets UserSet TotalNumberofUsers UserswithDirectPII 464658 UserswithNoDirectPII 22018 UserswithIndirectPII 486676 UserswithNoIndirectPII 0 InTable4.12wehavelistedoutthetotalnumberofuserswhohaveatleastone pieceofdirectPIIandstrictlygreaterthanonepieceofindirectPII,tothoseusers withonepieceofdirectPIIandonlyonepieceofindirectPII.Thisisthetotalcount of464658,listedinTable4.11.Asitisshowninthetable,thetotaluserswithatleast apieceofdirectPIIandmorethanonepieceofindirectPIIismuchgreaterthanthe userswithatleastapieceofdirectPIIandonlyonepieceofindirectPII. Table4.12:TotalindirectPIIforuserswithatleastonedirectPII UserSet TotalNumberofUsers UserswithtwoormoreindirectPII 373462 UserswithexactlyoneindirectPII 91196 63


Wefurtheranalyzedtheresults,togetthevaluesinTable4.13.Fromthetable wecanseethatthemeanforthenumberofpiecesofindirectPIIforuserswas2.0602, withastandarddeviationof0.6702.WethencalculatedtheZ-score,fortheprobabilityofndingauserwithstrictlygreaterthanonepieceofindirectPII.Weusethe Z-score,todeterminehowmanystandarddeviationsawayfrommean,adatapointit. Forthisscenarioouruniversewasthetotalnumberwithatleastonepieceofdirect PIIandonepieceofindirectPII.Thisprobabilitycametobe94.3%,indicatingthat thisisthepercentageoftheuniversewhohavetwoormoreindirectPII,whilealsodivulgingataminimumonepieceofdirectPII.Thisthenwouldlendsupportto H1 , thatuserswhohaveonepieceofdirectPIIavailablewillalsohaveaminimumoftwo piecesofindirectPIIavailable. Table4.13:UserswithatleastonedirectPII,numberofindirectPIIstatistics Parameter Value Mean 2.0602034184281774 StandardDeviation 0.6701555985087279 Z-Score,OneorlessindirectPII -1.582025757581386 p-value,GreaterthanoneindirectPII 0.9431781547517581 WhenwethencalculatethePAVSSscorefortheseusers,wecanseethatthe userswithdirectPIIavailablehaveahigherscorethanthosethatdonothavedirect PIIaccessible.ThiscanbeseeninTable4.14,wherewehavethetwoseparatesetsof usersincomparison. Table4.14:UserswithDirectPIIvsUserswithoutDirectPIIPAVSSScore UserSet AveragePAVSSScore WithDirectPII 7.551973771433622 WithoutDirectPII 6.6339090949561506 Totestourotherhypotheses,wehadtofurtherinvestigateourresults.First, weanalyzedtheuserswhowouldfallinandoutofthesetofusersfrom H2 .InTable 4.15,theresultsfromtheltrationtestarelisted.Onenoticeableobservation,isthat 64


themeannumberoftotalpiecesofPIIbetweentheuserswithanaccessiblebirthday comparedtothosewithoutanaccessiblebirthday,isnearlythesame.Theresultsshow thatforalltotalusersthatwehaveanalyzed,usersthatdon'tshowtheirbirthdayactuallyhavemorePIIavailablethanthosewhodonot. Table4.15:UserstotalnumberofPIIforuserswithandwithouttheirbirthdaysaccessible Parameter Accessible NotAccessible TotalNumberofUsers 222640 250181 Percentage 47.91% 52.09% Mean 3.5240747394897594 3.6272698566238044 StandardDeviation 0.6240154278586392 0.6093790317448934 Z-Score,DierenceinMean 0.16537270158234338 p-value 0.4343253194025472 IfwethenanalyzetheassociatedPAVSSscorewitheachuserscore,weseethat thosewithoutabirthdayscorehigheronaveragethanthosewithoutabirthday.Table 4.16showstheaveragePAVSSscorebetweenthedierentsetsofusers.FromTable 4.15and4.16,wecanseethattheresultsdonotlendsupportto H2 . Table4.16:UserswithandwithoutBirthdayPAVSSScore UserSet AveragePAVSSScore WithBirthday 7.134337944664032 WithoutBirthday 7.4619215687842 Next,welookedattheresultsforuserswithandwithanemailavailable.To summarize H3 ,wehypothesizedthatthosethathavetheiremailsavailablewillalso showmoretotalpiecesofPII,comparedtothosewhodon't.InTable4.17,wehave listedtheresults.Theuserswhohadtheiremailsaccessible,onaveragehadahigher numberofPIIaccessiblethanthosewhodidnot. InTable4.18,weshowtheaveragescoreforuserswithemailsaccessibleagainst thosewithoutemailsaccessible.AligningwithwhatwehavelistedinTable4.17,the userswiththeiremailsaccessibleonaveragehaveahighPAVSSscore. 65


Table4.17:UserstotalnumberofPIIforuserswithandwithouttheiremailsaccessible Parameter Accessible NotAccessible TotalNumberofUsers 43731 428420 Percentage 9.41% 90.59% Mean 4.027326153072192 3.532834601559218 StandardDeviation 0.682536387468522 0.5929284576361334 Z-Score,DierenceinMean -0.7244911195826604 p-value 0.7656178614293423 Table4.18:UserswithandwithoutEmailPAVSSScore UserSet AveragePAVSSScore WithEmail 7.834739658365919 WithoutEmail 7.253650623220205 Theuserswhohadtheire-mailsaccessiblewhereasmallsubsetofthosewho endeduppassingthroughallthelters.InTable4.17,wecanseethatonly9%oftotal usershadtheire-mailsaccessible.Yetofthose9%onaveragehadahighertotalnumberofPIIavailable.Ourpvalueassociatedwiththedierenceinmeanswas0.7656, whichisnotthemostdesirable,yetconsideringthesmallpercentageofthosewhohad theiremailsaccessibleandthedensityofthosewithmorePII,theresultstendtosupport H3 . IV.4Challenges Somechallengesweovercamewhileperformingthetesting.Oneoftherstchallengeswasthattwitterlimitedtheamountofinformationthatyoucouldquerythrough theirAPI.Theamountandlengthoftimeyoucouldsiphonlivetweetswasnotlimited,howeverifyouwishedtoquerytheirAPIonspecicusersyouwereratelimited. TheirAPIwouldallowyoutogetallthetweetsauserhadposted,withinthe past30days.Howeveryouwerelimitedtoonly15usersper15minutes.Attempting todothisforeveryuserthatwehaveinourdatabasewouldhavebecomeaverytime consumingtask.Youdidhavetheopportunitytopay,andnotberatelimited,howeverforourtestingwewantedtoappearasamalicioususer.Ourassumptionwouldbe 66


amalicioususerwouldattempttoextractenoughinformationwithouthavingatrail connectingthemtothedata.Ifamalicioususerpaidforaccess,thenthiswouldbea seriousofbreadcrumbstothemalicioususer. Anotherissuethatarosewastheamountoflteringstreamsyouwereallowedto haveopenatagiveninstance.Foreciency,therstiterationoftheprogramfortesting,wouldhaveaseparateprocesslteringthethelivetweetspercategoryofwordwe werelteringon.Forinstance,whenattemptingtondatweetaboutabirthday,we werelteringforwordslike,birth,birthday,birth-date.Therstattemptwouldhave thatallinoneprocess,asawaytostoreallthecapturedtweetsinoneareaandone groupingforeasierprocessinglater.Wefoundthataftertwoprocesseshadconnected tothelivetweetslter,wewouldbekickedoanddenied.Assuchweendeduprequiringoneprocesswholteredonallthetermsatonce. IV.5ChapterReview Inthischapterwesetouttotestourhypothesesandmodellistedintheprevious chapter.Fortestingpurposes,weusedtheonlinesocialnetworkcalledTwitter.From Twitterwehadfreeaccesstotheirtweetsthatwerebeingstreamedlive.Toperform thetestswehadtosignupforanaccountwithwasdiscussedintherstsectionofthis chapter. Ourtestingwascapableofbeingseparatedintwodierentphases.Therst phasewaswhenwemanuallycreatedusers.Thesemanuallycreatedusersfellinto threecategories,thosescoringHighMediumandLowonthePAVSSscale.Thiswas ourtrainingdataset,toverifythattheequationsfromourmodelwouldholdtrue. Thesecondphaseoftestingwasactuallyaccessingthelivetweetstream.This processwasthenbrokenintothreeltrationsteps.TherstwasthelteringTwitter livestreamperformedbasedonthetermsweprovided.Thesecondltrationwasusing regularexpressiontestingonthetweetswefound.ThethirdltrationwasusingWSD 67


todeterminethetruesenseoftheprivacycriticaltermthathadpassedourregular expressiontesting. Whileinthesecondphasewewereabletotestourhypothesis.Theresultsfrom ourtestshadlentsupportto H1 and H3 ,however H2 ourresultsdidnotlendsupportto.Furthertestingcanbedonetoanalyze H2 ,thatmayprovidemoreaccurate results.Thesefurthertesting,canincludepayingforunrestrictedaccesstoTwitter's database,andlookingforthisinformationforeachuser.Anotherwaytotestthishypothesisinanotherwaywouldbetolterlivetweetscontainingabirthdayforanentireyear.Thisapproachwouldbetimeintensiveaswellasdataintensiveconsidering thenumberoftweets,wecapturedinonlyafewdays'timerange. 68


CHAPTERV FUTURERESEARCH Inthischapterwewillquicklydiscussfutureresearchthatcanbeperformed relatedtothisthesis.Theseareareasthatwehaddeterminedwouldhavehelpedto reachbetterandmoreconciseresults.Aswellasnewquestionsthathademergedthat werenotwithinthescopeofthisthesisyetarerelatedtothisresearch. Firsthavingmoreexpertsparticipatewithinthestudywouldhavehelpedproducemoreaccurateresultswithinthemodel.Alsohavingmanymoreiterationsofthe surveywouldhavehelpedhoneinsomeofthemorecriticalprivacyreducingareas.We couldapplyourmodeltootheronlineplatformsaswell,toseehowourmodelworks onvariousotherplatforms. Futureresearchcouldbedoneonlexicalanalysis,onhowmanydierentwaysa usercouldmakeastatementthatproducesaprivacyvulnerability.Muchinternettext iscomprisedofslangandshortenedabbreviatedwords,tomakesurethefullmeaningofthesentenceisabletobeplacedwithintheconnesoftheallottedcharacter limit.Duetothis,ndingthetruemeaningofwhatauserstatesbecomesmoredicultsincemanyNLPprogramsanddeviceshavebeentrainedonpeerreviewedtexts. Thesetextswouldhavepropergrammar,formattingwhichdoesnothappenoftenon socialsitesbetweenuserswherethereisacharacterlimit. Identifyingawayausercouldhaveaprotectedonlineidentitythatwasdisassociatedtotheiroine,realworldidentitywouldbeaninterestingtopictoresearch aswell.Thiswasidentiedbyexpertswhorespondedtooursurveyasanareatofocusforpreventionforaprivacybreach.Sinceearlierresearchhasshownthatindividualsreleasecertainprivateinformationasawaytogaintrustwithothersonline,this wouldbearatherchallengingavenuetoresearch.Partiallyduetoneedingawayto helpusersgaintrustwithoutbreakingtheconnesofwhatisprivate. 69


Analareaoffutureresearchwouldbeexpandingourmodeltotestagainst otherusecasesforPIIcompromise.Aninterestingstudywouldbetodeterminethe numberofindividualsinvolvedinaprivacybreachandseehowmanyofthosewere activeonasocialnetworkingsite.Afterdeterminingthis,thendeterminethePAVSS totalscorethoseuserswouldproduceandcompareittothosewhohavenobeeninvolvedinaprivacybreach.Thiscouldpotentiallyhelpleadtomoreaccuratewaysto informusersonehowtobesafeinanonlineenvironment. 70


CHAPTERVI CONCLUSION Inthischapterwewillwrapeverythingup.Wewillquicklydiscussthereasoningforthisthesis,ourresearchquestionsandanswers.Thehypothesesthatwerethen formedwhileresearchingourresearchquestions,andthemodelthatwedesignedto testourhypotheses.Thiswillleadtoasummaryofourtestingmethodologies,andresults,thatthensegueintoavenuesforfutureresearch. ThisresearchwasstartedonthepremisethatcurrentlyintheUnitedStates, therearenotrueprotectionsforconsumerprivacy.Thestandardsthatdueexistare inadequate,assuchdonotprovideenoughprotectionstohelppreventconsumersfrom beinginvolvedinaprivacybreach.WeshowedinsectionII.2thattherearecurrently industrywidemethodsandproceduresfordeterminingthefaultsandseverityofthese faultsforsoftware,yettheseproceduresdonotaddressuserprivacy.Researchingthis helpedtoansweroneofourresearchquestions, RQ-4 ,werewesetouttondifcurrentscoringmetricstakeintoaccountimportanceofthedataatriskofbeingstolen. InchapterIIweidentiedveresearchquestions,relatedtouser'sprivacy.Our rstresearchquestionwas RQ-1 ,whichaskedwhatinformationwasneededtoimpersonatesomeone.Wedeterminethatitwassucientenoughtoonlyknowaperson'snameandemailaddressinordertoimpersonatesomeone.Oursecondresearch question RQ-2 askedwhatinformationaboutsomeonewasmorevaluable.Thiswas determinedtobeinformationthatfellintothefourcategoriesof:informationthat isprivate,unique,anddistinct.Thisinformationisrightfullycalleddirectpersonally identiableinformation.Thethirdresearchquestion RQ-3 wasnotabletobefullyanswereduntilafterthesurveywascomplete.Forthiswefoundthattheleastamountof informationneededtotobeknowninordertondmoreinformationwasonlyasingle pieceofinformation.Thissinglepieceofinformationwasjustapersonsnameinmany cases.Wehavejustrecentlydiscussed RQ-4 ,whichleavesuswithourlastresearch 71


question RQ-5 ,whichaskedifcurrentdatacollectiontechniquesusedindataanalysisfollowastrictenoughguidelinetoensureuseranonymity.Wewereabletoanswer RQ-5 ,afteranalyzingvariousdataanonymizationalgorithms,aswellaslawsinthe UnitedStatesalongwithinEurope.Wecametotheconclusionthatthecurrentstate ofdatacollectiontechniquesdonotfollowastrictenoughguideline,sincemanytimes guidelinesdonotexist. Afterourresearchquestionswedeterminedthreehypotheses.WhichweintroducedinchapterIIITheywere: H1: Thoseaccountsthathave1pieceofdirectPIIavailable,willalsohaveaminimum of2piecesofindirectPIIavailable. H2: Usersthatgivetheirbirthdayorallowittobedetermined,aremorelikelyto havemorepiecesofIndirectorDirectPIIthenusersthatdonotgivetheir birthday. H3: Usersthathavetheire-mailaddresseseasilyaccessible,aremorelikelytodivulge morePII,andthereforehaveahigherscorefromourmodel,thenthesetofusers fromH2. FromthesehypothesesandtheresearchquestionswedevelopedourPAVSSframework.Thisframeworkistherstofit'skindtoquantifytheriskausertakeswhileusinganonlineservice.Tohelpdevelopthisframework,weutilizedtheDelphimethod tondquantitativevaluesforvariousprivacyvulnerablescenarios.Weultimately cametoournalversionofPAVSSwhichwasdiscussinsectionIII.3.3. InchapterIVwetestedourmodelaswellastestedourhypotheses.FortestingweutilizedtheonlinesocialnetworkingsiteTwitter.AuseronTwitteriscapable ofmakinganonlinepostcalledatweet,thatotherusersonandotheplatformcan view.Byaccessingtweetsastheywereposted,wecapturedtweetsthatcontainedvariouswordsthatareindicativeofspeechthatcontainsprivacydivulginginformation. Welteredourresultsfromthiscapturingprocessbyusingregularexpressiontest72


ing,thenWSDtestingonthetextthatmadeitthroughtheregularexpressiontesting. Fromourresultswewereabletoshowresultsthatlentsupportfor H1 and H3 .The resultsdidnotlendsupportfor H2 .However,weidentiedothertestingmethodologiesthatcouldbedoneinordertofurthertest H2 . FinallyinchapterV,weidentiedpossibleavenuesforfurtherresearchthatwere eitheroutsidethescopeofthisthesis,orwerefurtherresearchwiththisthesisasa base.UtilizingthismodelforotherusecasesofPIIwouldhelptostrengthentheneed forourframework.Also,toidentifyawaytoseparateauser'sonlineidentityfrom theirreal-worldidentity,whileallowinguserstobuildtrustisaninterestingtopicfor furtherresearch. 73


REFERENCES [1] Abowd,J.M. Theu.s.censusbureauadoptsdierentialprivacy.In Proceedings ofthe24thACMSIGKDDInternationalConferenceonKnowledgeDiscovery38; DataMining NewYork,NY,USA,2018,KDD'18,ACM,pp.2867. [2] Acquisti,A.,andGross,R. Predictingsocialsecuritynumbersfrompublic data. ProceedingsoftheNationalAcademyofSciences106 ,2709,10975 10980. [3] Ahmad,W.U.,Rahman,M.M.,andWang,H. Topicmodelbasedprivacy protectioninpersonalizedwebsearch.In Proceedingsofthe39thInternational ACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval NewYork,NY,USA,2016,SIGIR'16,ACM,pp.1025. [4] Al-Karkhi,A.,Al-Yasiri,A.,andJaseemuddin,M. Non-intrusiveuser identityprovisioningintheinternetofthings.In Proceedingsofthe12thACMInternationalSymposiumonMobilityManagementandWirelessAccess NewYork, NY,USA,2014,MobiWac'14,ACM,pp.830. [5] Allodi,L.,Banescu,S.,Femmer,H.,andBeckers,K. Identifyingrelevantinformationcuesforvulnerabilityassessmentusingcvss.In Proceedingsof theEighthACMConferenceonDataandApplicationSecurityandPrivacy New York,NY,USA,2018,CODASPY'18,ACM,pp.119. [6] Anjum,A.,andRaschia,G. Banga:Anecientandexiblegeneralizationbasedalgorithmforprivacypreservingdatapublication. Computers6 ,1. [7] Arefi,M.N.,Alexander,G.,andCrandall,J.R. Piitracker:Automatic trackingofpersonallyidentiableinformationinwindows.In Proceedingsofthe 11thEuropeanWorkshoponSystemsSecurity NewYork,NY,USA,2018,EuroSec'18,ACM,pp.3:1:6. [8] Bahri,L. Identityrelatedthreats,vulnerabilitiesandriskmitigationinonline socialnetworks:Atutorial.In Proceedingsofthe2017ACMSIGSACConference onComputerandCommunicationsSecurity NewYork,NY,USA,2017,CCS '17,ACM,pp.2603. [9] Bird,S.,Klein,E.,andLoper,E. NaturalLanguageProcessingwithPython , 1sted.O'ReillyMedia,Inc.,2009. [10] Boyne,S.M. Dataprotectionintheunitedstates. TheAmericanJournalof ComparativeLaw66 ,suppl_1,299. [11] Bozorgi,M.,Saul,L.K.,Savage,S.,andVoelker,G.M. Beyondheuristics:Learningtoclassifyvulnerabilitiesandpredictexploits.In Proceedingsofthe 16thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandData Mining NewYork,NY,USA,2010,KDD'10,ACM,pp.105. 74


[12] Center,I.T.R. 2017annualreport.Tech.rep.,IdentityTheftResourceCenter, 2017. [13] Center,I.T.R. Databreachreports.Tech.rep.,IdentityTheftResourceCenter,2018. [14] Center,I.T.R. Databreaches,2018. [15] Child,J.T.,andStarcher,S.C. Fuzzyfacebookprivacyboundaries:Exploringmediatedlurking,vague-booking,andfacebookprivacymanagement. ComputersinHumanBehavior54 ,483490. [16] Choi,B.C.,andLand,L. Theeectsofgeneralprivacyconcernsandtransactionalprivacyconcernsonfacebookappsusage. Information&Management53 ,7 ,868.SpecialIssueonPapersPresentedatPacis2015. [17] Cormode,G.,Jha,S.,Kulkarni,T.,Li,N.,Srivastava,D.,andWang, T. Privacyatscale:Localdierentialprivacyinpractice.In Proceedingsofthe 2018InternationalConferenceonManagementofData NewYork,NY,USA, 2018,SIGMOD'18,ACM,pp.1655. [18] Dalkey,N.,andHelmer,O. Anexperimentalapplicationofthedelphi methodtotheuseofexperts. Managementscience9 ,363,458. [19] deBruijn,F.,andDekkers,H.L. Ambiguityinnaturallanguagesoftware requirements:Acasestudy.In InternationalWorkingConferenceonRequirements Engineering:FoundationforSoftwareQuality ,Springer,pp.233. [20] Douziech,P.-E.,andCurtis,B. Cross-technology,cross-layerdefectdetectioninitsystems:Challengesandachievements.In ProceedingsoftheFirstInternationalWorkshoponComplexfaUltsandFailuresinLargESoftwareSystems Piscataway,NJ,USA,2015,COUFLESS'15,IEEEPress,pp.21. [21] Dupuis,M. "wait,doiknowyou?":Alookatpersonalityandpreventingone's personalinformationfrombeingcompromised.In Proceedingsofthe5thAnnual ConferenceonResearchinInformationTechnology NewYork,NY,USA,2016, RIIT'16,ACM,pp.55. [22] Dwork,C. Dierentialprivacy:Asurveyofresults.In InternationalConference onTheoryandApplicationsofModelsofComputation ,Springer,pp.1. [23] FIRST .Cvssv3.0specicationdocument,2018. [24] Goga,O.,Venkatadri,G.,andGummadi,K.P. Thedoppelgangerbotattack:Exploringidentityimpersonationinonlinesocialnetworks.In Proceedingsof the2015InternetMeasurementConference NewYork,NY,USA,2015,IMC'15, ACM,pp.141. 75


[25] Graupner,H.,Jaeger,D.,Cheng,F.,andMeinel,C. Automatedparsingandinterpretationofidentityleaks.In ProceedingsoftheACMInternational ConferenceonComputingFrontiers NewYork,NY,USA,2016,CF'16,ACM, pp.127. [26] Gu,J.,Xu,Y.C.,Xu,H.,Zhang,C.,andLing,H. Privacyconcernsformobileappdownload:Anelaborationlikelihoodmodelperspective. DecisionSupport Systems94 ,1928. [27] Hall,H.K. Restoringdignityandharmonytounitedstates-europeanunion dataprotectionregulation. CommunicationLawandPolicy23 ,28,125. [28] Hay,M.,Machanavajjhala,A.,Miklau,G.,Chen,Y.,andZhang,D. Principledevaluationofdierentiallyprivatealgorithmsusingdpbench.In Proceedingsofthe2016InternationalConferenceonManagementofData NewYork, NY,USA,2016,SIGMOD'16,ACM,pp.139. [29] Hay,M.,Machanavajjhala,A.,Miklau,G.,Chen,Y.,Zhang,D.,and Bissias,G. Exploringprivacy-accuracytradeosusingdpcomp.In Proceedingsof the2016InternationalConferenceonManagementofData NewYork,NY,USA, 2016,SIGMOD'16,ACM,pp.2101. [30] Hofman,D.,Duranti,L.,andHow,E. Trustinthebalance:Dataprotection lawsastoolsforprivacyandsecurityinthecloud. Algorithms10 ,4717. [31] Holm,H.,andAfridi,K.K. Anexpert-basedinvestigationofthecommon vulnerabilityscoringsystem. Computers&Security53 ,1830. [32] Jeong,Y.,andKim,Y. Privacyconcernsonsocialnetworkingsites:Interplay amongpostingtypes,content,andaudiences. ComputersinHumanBehavior69 ,302310. [33] Kampanakis,P. Securityautomationandthreatinformation-sharingoptions. IEEESecurityPrivacy12 ,5Sept2014,42. [34] Kerschbaumer,C.,Crouch,L.,Ritter,T.,andVyas,T. Canwebuilda privacy-preservingwebbrowserwealldeserve? XRDS24 ,4July2018,40. [35] Kim,S.,andChung,Y.D. Ananonymizationprotocolforcontinuousanddynamicprivacy-preservingdatacollection. FutureGenerationComputerSystems . [36] Lesk,M. Automaticsensedisambiguationusingmachinereadabledictionaries: howtotellapineconefromanicecreamcone.In Proceedingsofthe5thannual internationalconferenceonSystemsdocumentation ,ACM,pp.24. [37] Li,K.,Lin,Z.,andWang,X. Anempiricalanalysisofusers'privacydisclosure behaviorsonsocialnetworksites. Information&Management52 ,75,882 891.Novelapplicationsofsocialmediaanalytics. 76


[38] Li,N.,Li,T.,andVenkatasubramanian,S. t-closeness:Privacybeyondkanonymityandl-diversity.In DataEngineering,2007.ICDE2007.IEEE23rd InternationalConferenceon ,IEEE,pp.106. [39] Li,N.,Qardaji,W.,Su,D.,Wu,Y.,andYang,W. Membershipprivacy: Aunifyingframeworkforprivacydenitions.In Proceedingsofthe2013ACM SIGSACConferenceonComputer38;CommunicationsSecurity NewYork,NY, USA,2013,CCS'13,ACM,pp.889. [40] Linstone,H.A.,andTuroff,M. Delphi:Abrieflookbackwardandforward. TechnologicalForecastingandSocialChange78 ,911,171219. [41] Liu,Q.,andZhang,Y. Vrss:Anewsystemforratingandscoringvulnerabilities. ComputerCommunications34 ,311,264273.SpecialIssueofComputerCommunicationsonInformationandFutureCommunicationSecurity. [42] Liu,Q.,Zhang,Y.,Kong,Y.,andWu,Q. Improvingvrss-basedvulnerabilityprioritizationusinganalytichierarchyprocess. JournalofSystemsandSoftware85 ,82,16991708. [43] Liu,Y.,Song,H.H.,Bermudez,I.,Mislove,A.,Baldi,M.,andTongaonkar,A. Identifyingpersonalinformationininternettrac.In Proceedings ofthe2015ACMonConferenceonOnlineSocialNetworks NewYork,NY,USA, 2015,COSN'15,ACM,pp.59. [44] Machanavajjhala,A.,He,X.,andHay,M. Dierentialprivacyinthewild: Atutorialoncurrentpractices38;openchallenges.In Proceedingsofthe2017 ACMInternationalConferenceonManagementofData NewYork,NY,USA, 2017,SIGMOD'17,ACM,pp.1727. [45] Martin,K.D.,Borah,A.,andPalmatier,R.W. Dataprivacy:Eectson customerandrmperformance. JournalofMarketing81 ,17,36. [46] Martin,R.,andChristey,S. Thesoftwareindustry's"cleanwateract"alternative. IEEESecurityPrivacy10 ,3May2012,24. [47] McDermott,Y. Conceptualisingtherighttodataprotectioninaneraofbig data. BigData&Society4 ,17,2053951716686994. [48] MITRE .Commonweaknessscoringsystem. [49] Moro,A.,Raganato,A.,andNavigli,R. Entitylinkingmeetswordsense disambiguation:auniedapproach. TransactionsoftheAssociationforComputationalLinguistics2 ,231. [50] Munaiah,N.,andMeneely,A. Vulnerabilityseverityscoringandbounties: Whythedisconnect?In Proceedingsofthe2NdInternationalWorkshoponSoftwareAnalytics NewYork,NY,USA,2016,SWAN2016,ACM,pp.8. 77


[51] NIST .Vulnerabilitymetrics,2018. [52] Ortiz,J.,Chang,S.-H.,Chih,W.-H.,andWang,C.-H. Thecontradiction betweenself-protectionandself-presentationonknowledgesharingbehavior. ComputersinHumanBehavior76 ,406416. [53] Park,E.H.,Kim,J.,andPark,Y.S. Theroleofinformationsecuritylearningandindividualfactorsindisclosingpatients'healthinformation. Computers& Security65 ,6476. [54] Pendleton,M.,Garcia-Lebron,R.,Cho,J.-H.,andXu,S. Asurveyon systemssecuritymetrics. ACMComput.Surv.49 ,4Dec.2016,62:1:35. [55] Rasool,A.,Tiwari,A.,Singla,G.,andKhare,N. Stringmatching methodologies:Acomparativeanalysis. REMText234567 ,1112,3. [56] Sandra,P. Communicationprivacymanagementtheory:Whatdoweknow aboutfamilyprivacyregulation? JournalofFamilyTheory&Review2 ,30, 175. [57] Song,S.,Wang,Y.,andChaudhuri,K. Puershprivacymechanismsfor correlateddata.In Proceedingsofthe2017ACMInternationalConferenceon ManagementofData NewYork,NY,USA,2017,SIGMOD'17,ACM,pp.1291 1306. [58] Staite,C. Portablesecureidentitymanagementforsoftwareengineering.In Proceedingsofthe32NdACM/IEEEInternationalConferenceonSoftwareEngineering-Volume2 NewYork,NY,USA,2010,ICSE'10,ACM,pp.325. [59] Sweeney,L. k-anonymity:Amodelforprotectingprivacy. InternationalJournal ofUncertainty,FuzzinessandKnowledge-BasedSystems10 ,0502,557. [60] Tesfay,W.B.,Hofmann,P.,Nakamura,T.,Kiyomoto,S.,andSerna, J. Privacyguide:Towardsanimplementationoftheeugdproninternetprivacy policyevaluation.In ProceedingsoftheFourthACMInternationalWorkshopon SecurityandPrivacyAnalytics NewYork,NY,USA,2018,IWSPA'18,ACM, pp.15. [61] Torky,M.,Meligy,A.,andIbrahim,H. Recognizingfakeidentitiesinonline socialnetworksbasedonaniteautomatonapproach.In 201612thInternational ComputerEngineeringConferenceICENCO Dec2016,pp.1. [62] Tramr,F.,Huang,Z.,Hubaux,J.-P.,andAyday,E. Dierentialprivacy withboundedpriors:Reconcilingutilityandprivacyingenome-wideassociation studies.In Proceedingsofthe22NdACMSIGSACConferenceonComputerand CommunicationsSecurity NewYork,NY,USA,2015,CCS'15,ACM,pp.1286 1297. 78


[63] Tsikerdekis,M.,andZeadally,S. Onlinedeceptioninsocialmedia. Commun.ACM57 ,9Sept.2014,72. [64] Ufuktepe,E.,andTuglular,T. Estimatingsoftwarerobustnessinrelation toinputvalidationvulnerabilitiesusingbayesiannetworks. SoftwareQualityJournal26 ,2Jun2018,455. [65] vanVelthoven,M.H.,Mastellos,N.,Majeed,A.,ODonoghue,J.,and Car,J. Feasibilityofextractingdatafromelectronicmedicalrecordsforresearch: aninternationalcomparativestudy. BMCMedicalInformaticsandDecisionMaking16 ,10. [66] Wang,J.A.,Guo,M.,Wang,H.,Xia,M.,andZhou,L. Ontology-based securityassessmentforsoftwareproducts.In Proceedingsofthe5thAnnualWorkshoponCyberSecurityandInformationIntelligenceResearch:CyberSecurityand InformationIntelligenceChallengesandStrategies NewYork,NY,USA,2009, CSIIRW'09,ACM,pp.15:1:4. [67] Woudenberg,F. Anevaluationofdelphi. Technologicalforecastingandsocial change40 ,21,131. [68] Yaqub,U.,Chun,S.A.,Atluri,V.,andVaidya,J. Analysisofpoliticaldiscourseontwitterinthecontextofthe2016uspresidentialelections. Government InformationQuarterly34 ,47,613626. 79