Citation
Cancer prediction in data mining

Material Information

Title:
Cancer prediction in data mining
Creator:
Al Khalaf, Nedhal Abdallah
Place of Publication:
Denver, CO
Publisher:
University of Colorado Denver
Publication Date:
Language:
English
Physical Description:
v, 72 leaves : ; 28 cm.

Thesis/Dissertation Information

Degree:
Master's ( Master of Sciences)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Computer Science and Engineering, CU Denver
Degree Disciplines:
Computer science

Subjects

Subjects / Keywords:
Cancer -- Diagnosis ( lcsh )
Data mining ( lcsh )
Cancer -- Diagnosis ( fast )
Data mining ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Thesis:
Thesis (M.S.)--University of Colorado Denver, 2011. Computer science
Bibliography:
Includes bibliographical references (leaves 69-72).
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
by Nedhal Abdallah Al Khalaf.

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
785950141 ( OCLC )
ocn785950141

Downloads

This item has the following downloads:


Full Text

CANCER PREDICTION IN DATA MINING
by
Nedhal Abdallah Al Khalaf
B.S., King Saud University, 2007
A thesis submitted to the
University of Colorado Denver
In partial fulfillment
Of the requirements for the degree of
Master of Science
Computer Science
2011


This thesis for the Master of Science
Degree by
Nedhal Al Khalaf
has been approved
For recommendation to the Graduate Committee
Thesis Advisor
Professor Tom Altman
MS Committee
Professor Bogdan Chlebus
Professor llkyeun Ra
November 18a 2011
Date


Al Khalaf, Nedhal Aballah (M.S., Computer Science)
Cancer Prediction in Data Mining
Thesis directed by Professor Tom Altman
ABSTRACT
Predicting a disease before it surprises the person is better than
identifying it too late. In this field, data mining plays an important role in the
medical field since its ability to discover useful information that helps in
predicting the disease's risk. This thesis discusses data mining role in health
care field and its techniques to help doctors predict the chances of a patient
developing cancer. The techniques involve using the predicted analysis to
build a model that calculates the likelihood of the disease. In order to support
this topic, an implementation will be provided about building a prediction
model for two types of cancer involving breast cancer and ovarian cancer.
This thesis will evaluate the implementation results by comparing them
against the Gail algorithm and the Risk of Ovarian Cancer algorithm. For this
purpose, data mining is supportive to health care in predicting the cancer's
risk prior its diagnoses.
This abstract accurately represents the content of the candidates thesis. I
recommend its publication.
Signed
Tom Altman


DEDICATION
I am proud to dedicate my Master thesis to the five pillars in my life:
God, my parents, and best friends, my life would not be amazing and I would
not step to higher levels without all of you.
I might be lost in this life and depressed by its harming shocks, but
once I stand to pray to you with all my heart, God Allah, I get strengths from
pleading to you to continue my way in this life journey.
Moms, Nabiha, you always support me and devote your life since I was
a baby to grow me to learn as much knowledge as I can. My entire warm
wishes to you.
Daddy, Abdullah, f, you always tell me "Keep going, there is no
impossible", Thanks for aiding me to reach higher academic levels and
supporting me with his precious advises and wise look to future.
My best friend, Siham, thanks for looking after me and standing by me
in every step during my study.
My best friend, Wafa, thanks for sharing me all my special moments in
my life.


TABLE OF CONTENTS
Figures...................................................................vi
Tables.....................................................................v
Chapter
1.Inspiration..............................................................1
2.Introduction.............................................................2
3. Definitions.............................................................4
4. Prediction Analysis in data mining....................................12
4.1 Predictive data mining stages........................................12
4.2 Predictive Analysis Techniques.......................................15
4.2.1 Decision trees......................................................15
4.2.1.1 Decision trees as predictive models...............................15
4.2.1.2 Advantages of decision trees.....................................19
4.2.1.3 Disadvantages of decision trees..................................20
4.2.2 Decision rules.....................................................20
4.2.3 Association rule mining............................................22
4.3 Why is Predictive Analysis useful?.................................22
5. Predictive analysis in caner prediction...............................24
5.1 Predictive analysis steps in cancer prediction........................24
5.2 The goals cancer prediction..........................................25


5.3 Predictive analysis techniques in cancer prediction..................26
5.3.1 Decision trees......................................................26
5.3.1.1 Decision trees in cancer prediction...............................27
5.3.1.1.1 Example on how the decision tree is used
in cancer prediction............................................30
5.2.2 Decision rules.....................................................38
5.2.3 Association rules..................................................40
6. Prediction models for two types of cancer..............................43
6.1 Breast Cancer...................................................43
6.1.1 Breast Cancer risk factors..........................................43
6.1.2 Gail breast cancer algorithm.......................................45
6.1.2.1 How Gail breast cancer works......................................46
6.1.2.2 Disadvantages of Gail algorithm...................................46
6.1.2.3 How is Gail algorithm related to
data mining.......................................................47
6.2 Ovarian Cancer.......................................................48
6.2.1 Ovarian Cancer risk factors.........................................48
6.2.2 The Risk of Ovarian Cancer Algorithm...............................49
6.2.2.1 How the Risk of Ovarian Cancer Works..............................50
7. Contribution and Discussion............................................51
7.1 Results
55


7.2 Evaluation...................................................60
7.3 Advantages and Disadvantages.................................61
7.4 Future work..................................................62
7.5 Conclusion...................................................63
Appendix
A. CANCER PREDICTOR EXPERT
SYSTEM USER GUIDE.............................................64
Bibliography.....................................................69


LIST OF FIGURES
Figure
3.1 KDD STEPS..........................................5
3.2 DECISION TREE NODE TYPES..........................10
4.1 PREDICTIVE DATA MINING STAGES.....................14
4.2 PATIENTS DATABASE DECISION TREE...................18
5.1 DECISION TREE AS PREDICTION MODEL
FOR THE CANCER PREDICTOR CALCULATOR................29
5.2 DECISION TREE AS PREDICTION MODEL
FOR UTERINE CANCER.................................36
7.1 SOFTWARE FLOWCHART................................52
iv


LIST OF TABLES
Table
4.1 Patients Information Database.........................................17
5.1 Patient Disease History Table.........................................30
5.2 Patient Uterine Activity Table........................................31
5.3 Patient Drugs Table...................................................32
5.4 Patient Personal info Table...........................................33
5.5 Patient Risk Factors Table............................................34
5.6 Final Predicted Cancer Risk for Patients..............................37
5.7 Prediction Decision Rules from the
cancer predictor calculator............................................39
7.1 Race and Age...........................................................55
7.2 Age and First Birth Age...............................................56
7.3 Age and Family History................................................57
7.4 Menstruation Age and Age..............................................58
v


1.Inspiration
Nothing is more precious than having a healthy body. When I was at
school, I saw a TV program in which there was a lady who had cancer. At that
moment, I felt sorry seeing her in this situation. Especially since she was
taking Chemotherapy and all her hair was falling out. Also, my sister works in
radiation section in the hospital, and she always tells me stories about the
tragedy of losing someone from cancer. This created a strong inspiration in
my heart to help people who feel pain from this disease. Therefore, I decided
to connect my thesis topic that is data mining to the health care field.
As one of the most valuable fields in computer science, data mining
has played an important role in predicting outcomes based on hidden patterns
in databases. Using the prediction rule, which is one of the data mining tools;
I decided to implement a medical program that predicts the probability of
someone getting cancer. This software hopefully will help people to expect
their cancer percentage's risk and based on that they can take more
precautions such as talking to a doctor about their risk factors or taking a
medical test. Hence, this will aid in discovering the cancer at an earlier stage
and thus fighting it in advance before it fights their cells.
1


2. Introduction
Today, thousands of people suffer from the severe pain, noticeable
effects, and the uncontrolled spread of cancer. Nowadays, a lot of hospitals
and medical institutions have been establishing cancer centers in order to
provide the optimum therapy for cancer's patients. Furthermore, many
organizations and universities around the world are conducting research to
discover better cancer cures.
When a person consults an oncologist, the oncologist will study the
cancer case in this patient, and the risk factors that have resulted in this
disease occurring. Consequently, all of the patients information including the
risk factors will be kept in medical databases in order to keep a medical
history for the patient and get benefits from the cancer case for research
purposes.
As massive amount of data is stored in huge databases in hospitals
and medical organizations, it will be easy to detect some hidden data pattern
that is useful to summarize what causes the disease. The databases usually
contain data that consists of information about diseases, patients, and other
diseases' factors. In fact, each record or object in these databases has
sensitive information about the patient at the time of getting the disease such
as personal factors (age, family history), social factors (race), and habits
(smoking, drinking alcohol), etc. From these records, it is beneficial to mine
this health databases to extract useful information about the disease.
How can these databases be mined to explore more useful
2


information? Since computer science has served many sciences, health care
had also been able to take benefit from this. Consequently, data mining as
one of the colossal fields in computer science has served health care field a
lot. This is mainly because its capability to detect hidden useful patterns in the
medical databases. In cancer disease case, data mining rules such as
prediction rule and association have been effective in predicting the cancer
risk and associating its risk factors.
In this thesis, I will address data mining in health care field and its
techniques that help doctors, and then I will narrow it down to specify the data
mining role in specific disease such as cancer, and that includes applying
some of the data minings techniques to build a prediction model which
determined the likelihood of getting the cancer disease.
3


3. Definitions
In this chapter, we give a number of definitions will be given regarding
some terminologies that will be used in this thesis.
1.Data mining
Data mining can be defined as one of the computer branches that
includes the process of extracting useful information from large data sets or
databases (14). Data mining is the analysis stage of knowledge discovery in
databases. In addition, it is known with some other terminologies such as
knowledge extraction, information discovery, information harvesting, data
archeology, knowledge discovery, and data pattern processing.
In general, data mining analyzes data from datasets and summarizes it
to be used in many fields such as security, disease prediction, and market
stocks. Data mining is like gold mining, where we extract data from databases
to get useful information.
2. Knowledge Discovery in databases
Knowledge discovery in databases (KDD) is the overall process of
finding useful knowledge from data. Data mining is the step of applying
algorithms to data to extract patterns from it (4).
4


2.1 Knowledge Discovery in databases steps
KDD has several steps (14) as shown in Figure 3.1:
Figure 3.1 KDD STEPS
These steps include data preprocessing, data mining, and post-
processing. In the data preprocessing step, raw input data is being processed
to make it more suitable for data mining (1). This includes combining data
from various sources, data cleaning from noise and duplicate observations,
and data reduction that focuses on selecting useful features and records that
are relevant to the data mining technique. The second step is the data mining
step that consists of using data classification or prediction in order to detect
hidden patterns in the data. The classification techniques involve clustering,
while the prediction techniques involve decision trees and decision rules. The
third step is data post-processing which gives the interpretation or inferences
from the data mining analysis from the previous step.
5


3. Prediction Analysis
Prediction analysis in data mining can be defined as the branch of data
mining that emphasizes predicting unknown or future values of other
variables based on some other variables (23). Thus, prediction analysis
mainly can predict what futures outcomes are based on historical data in
databases to define trends and expect homogeneous cases. The main
element of predictive analysis is the predicator that is used as a variable that
can be used for individuals in order to predict future outcomes. A collection of
predicators are integrated in a prediction model. In fact, when this collection is
analyzed, it can be used to expect future possible outcomes in a reliable
level.
One example of predictive analysis is its application in cyber security.
Using predictive analysis, investigators were able to derive an important
conclusion analyzing historical data about the situation that cyber attacks
follows physical ones. Subsequent to April 1,2001, mid-air collision between
American surveillance plane and a Chinese fighter aircraft, after this conflict,
group of Chinese hackers sustained and arranged an extensive hacking
campaign of cyber attacks against American targets for one week. Thus,
using predictive analysis and old historical data such as those provided above
about physical attacks or conflicts, investigators can predict cyber attack after
physical one and, therefore, take additional security steps after any physical
attacks. In this thesis, predictive analysis with the help of other data mining
techniques will be used to aid in the prediction of cancer in patients.
6


4. Predictive Modeling
Predictive modeling can be defined as the process used by predictive
analysis to create a statistical model that aids in forecasting future trends and
probabilities. In addition, predictive model contains multiple numbers of
predicators that are variables that affect the future outcomes (24).
Predictive modeling is like using known data to predict unknown
values. Such as the disease prediction field, a predictive model consists of
the patient's family history, genes, etc will be built to predict the disease risk
for that patient. Based on the predictors that include family history, genes, etc,
the patient's disease risk will be predicted. In this thesis, some prediction
models such as decision trees, decision rules, and association rules will be
used to built an aid in the cancer prediction process.
5. Data warehouse
Data warehouse (DW) is defined in respective to data mining as a
repository of data (current and historical) in which a collection of data is
stored in huge databases for data mining and analysis purposes (18).
Data warehouse usually serve multiple subject areas .For instance,
data warehouses are used by search engines such as Google. These Data
warehouses involve current and historical data collected from various
sources. Thus, in Google search engines, there can be found huge data
warehouses of different subjects such as medicine, business, psychology,
and others. If there is an expert interested in getting knowledge on both
biology and computer science, Google's search engine will aid him or her in
7


both subjects. This is because huge amount of information can be found by
mining Google's data warehouses.
6. Data mart
Data mart (DM) is a subset of data warehouse that has smaller
databases containing series of related data that are grouped together and
subjected to one division or department (21). Data marts usually serve
particular subject such as sales, or medicines. For example, data about
patients can be placed into a data mart to make it accessible by doctors and
other medical researchers.
7. Decision trees
Decision trees in data mining can be defined as hierarchy of rules that can
be presented graphically as a tree structure allowing set of data to be
classified according to these rules (19).In decision tree, each leaf or end node
is assigned a class label. On the other hand, the internal nodes and root node
include attribute test conditions to separate records of a dataset that have
various characteristics. Once the decision tree is built, it will be very easy to
classify a test record. The test starts from the root node by applying the test
condition on the record and then follows the suitable branch depending on the
output of the test. This will take to either an internal node where another test
condition is taken, or to a leaf node. The leaf node has an associated class
label that, as result will be assigned to the record. Decision trees usually
show all the possible outcomes of a decision, and very effective in
classification and prediction since they are predictive models that can be
8


benefited from by making predictions thru classification. In addition, they
consist of three types of nodes that are presented in (Figure 3.2) (11):
i. Decision node: these nodes are usually represented by squares
displaying the possible decisions that can be made. Unes that emerge
from these nodes present all different choices available at a node.
ii. Chance node: these nodes are usually represented by circles
displaying chance outcomes or event that can occur and leads to two
or more outcomes.
iii. Terminal node: these nodes are usually represented by triangles or
lines that are not containing decision or chance nodes. Terminal nodes
display the final outcomes of the decision making process. These
nodes usually show a prediction to the problem or the question asked
at the first place in the root node.
9


Outcome 1
-Decision
- Uncertain
ty (external evenl)
Figure 3.2 DECISION TREE NODE TYPES (11)
8. Decision rules
Decision rules can be defined as a set of rules that can be extracted
from a decision tree or a dataset to predict the class to which a record of a
database belongs (14).
10


Decision rules provide easily understandable and general presentation
to the knowledge included in a dataset. In addition, they give all the possible
decisions and predictions that can be taken in specific case. For example, in
the medical field, decision rules can be used by oncologists when reviewing a
patient's case to predict if he/she has a cancer risk. Based on what the risk
factors available in that case; including gene mutations and any family history,
the decision rules will predict the cancer risk for that patient.
11


4. Prediction Analysis in data mining
Predictive analysis is one of the more popular data mining techniques.
It has been applied in many fields especially the medical one. In this chapter,
a detailed explanation will be provided regarding its stages and techniques.
4.1 Predictive data mining stages
Since one of the important tasks of data mining is prediction, one of the
data mining methods such as decision tree will be constructed, and some
interpretation or inference will be made on the available data set to predict the
new data set behavior. Predictive analysis is constructed by data mining
techniques and methods. Data mining techniques have the ability to extract
data by dealing with large databases and accessing them, and then these
techniques process data with algorithms to discover the hidden patterns and
predictive information.
Predictive data mining consists of several stages as shown in Figure
4.1, these stages gave perfect thought of all aspects of data mining (1).
Looking at Figure 4.1, the process starts with collecting data from its
sources such as data warehouse and data mart. Data collection from data
warehouse include defining relevant features to the field where predictive
data mining will be employed and preparing a storage file in order to
document these features. Also, it involves data preprocessing that consists of
data cleaning, data transformation, data reduction, and securing the data to
12


protect it from corruption. The following step is data exploration that includes
applying preparatory analysis to data to prepare it for mining. Next procedure
is feature selection or reduction. The third main step is mining or model
building for prediction. The final step is data post-processing and
interpretation that contains inferences and deriving to conclusion from the
previous step.
13


Figure 4.1 PREDICTIVE DATA MINING STAGES (1)
14


4.2 Predictive Analysis Techniques
There are many techniques that help the predictive analysis in
predicting the outcomes and drawing the future trend by analyzing huge of
amount of data with various variables, but only the ones that are specified in
this thesis will be explain in depth. These techniques include the followings:
i. Decision trees.
ii_ Decision rules,
iii. Association rules.
4.2.1 Decision trees
Decision trees can be used as classifiers to decide a suitable action
from a group of predefined actions. In addition, they have widely applied to
different problems such as medical diagnosis, disease prediction, and credit
card risk assessment.
4.2.1.1 Decision trees as predictive models
In predictive analysis respective, decision trees can be used as
predictive model to derive to conclusion about the item's value based on its
observations. Predictive model is focused on analyzing patterns in historical
data to transform data into decisions in action.
How the decision tree is used as a predictive model is shown by this
following example. Suppose that there is historical patients' data set that
15


consists of some patients who had been diagnosed with breast cancer in the
past. In order to predict if a person in the future is at a risk of getting breast
cancer after visiting the doctor, the decision tree can be built based on that
historical dataset. The variables that will help predicting the cancer in the
decision tree will be the risk factors, which are displayed as the predictors in
the decision tree. These predictors are very helpful deriving to a prediction of
whether a patient who visits the doctor is at risk of cancer in the future or not.
Thus, the decision tree will be drawn by first mining the dataset and extracting
the useful attributes that help in the prediction process which are the risk
factors. The patients' database is represented in Table 4.1
16


Table 4.1 Patients Information Database
Patient ID# Family History disease degree Medical History in one of the breast sides BRAC1 or BRAC2 mutation Disease high Risk
345679894 First Yes Yes Yes
356890322 No No Yes Yes
367345890 Second Yes No Yes
378436888 First Yes Yes Yes
388094671 No No No No
395537120 Third No No No
334298018 First No Yes Yes
312475490 Second No Yes Yes
309456731 Third Yes No Yes
The previous database presented in Table 4.1 can be visualized
graphically by the following decision tree in Figure 4.1
17


Figure 4.2 PATIENTS DATABASE DECISION TREE
Based on the decision tree in Figure 4.2, oncologists can predict if any
patient who visits them in the future is at a risk of cancer. The prediction starts
after taking patients information including family history of cancer, medical
history, and gene mutations. This information is the predicators in the
decision tree in Figure 4.2 that help derive to a prediction of the patients
case.
18


The predictive analysis in this example has used the decision tree as a
technique to predict the disease risk, and the predicator variables in the tree
helped it in this role to derive at the end to one prediction per patient. A group
of predictor variables formed a predictive model from the decision tree. The
predictor variables here are the family history, medical history, and gene
mutation. Based on the variables in the decision tree, there has been ability to
predict the person's risk.
4.2.1.2 Advantages of decision trees:
1. Decision trees are considered to be self explanatory, and this makes
them simple to follow. That means, it will not be difficult to understand
the decision tree by non-data mining experts in case it consists of
reasonable amount of leaves.
2. Decision trees can deal easily with both nominal and numeric attributes
of a dataset.
3. Decision trees can deal with databases or datasets that may have
some errors.
4. Decision trees are wealth enough that they can easily show graphically
any discrete value classifier.
5. Datasets that have missing values can be handled easily be decision
trees.
6. Decision trees are the most attractive technique in data mining. Once
the classification is costly because they only inquire for the "featured"
values following one path from the root node to the leaf one.
7. Decision trees are graphic representations showing all decision
alternatives, potential outcomes, and events that might occur by
19


chance. This graphical presentation aids in understanding complex
sequences of decisions and dependable ones.
8. Decision trees are complementary which means they can be effectively
employed in many fields in conjunction with other used tools in that
field. For example, decision trees applied in cancer prediction can
evaluate the percentage risk.
9. Decision trees are speed methods in prediction and development (11).
4.2.1.3 Disadvantages of decision trees:
1. Decision trees may experience fragmentation obstacle if there are a lot
of relevant attributes.
2. If there are few relevant attributes, the decision tree performance will
be good whereas if there are complex communications between
attributes, then its performance will be less (11).
4.2.2 Decision rules:
Decision rules are common data mining technique where each rule of
these decision rules takes the format of Horn Clause such as:
If Conditioni Conditiori2..Then
Class
The rules are presented in a disjunctive normal form such that:
R= (rivr2v....rk), where R is the rule set and n is a rule or disjunct. Each
decision rule can be presented in this way:
20


n (Condition!)^ yi
(4.1)
The left hand side of the rule in (4.1) is called the precondition or the
rule antecedent and it includes a conjunction of attribute tests such that:
conditio^ = iA\PV) A ^Ai0PV^ A -(AkPVi) (4-2)
The pair shows an attribute value such that is an attribute
and y. is attribute value, and op represents a logical operator that can be
picked from the set {=,#<>ss}. Each rule is called conjunct. The
right hand side of the rule represents the rule consequent that involves the
predicted class _y .
Moreover, if the precondition of a rule r is satisfied by the attributes or
columns of a record x in a dataset, then the rule r covers the record x. In
addition, r can be called to be triggered or fired as it covers a record. The
variables in the precondition part are the predicator variables that all together
help to derive to the right prediction (14).
Decision rules can be used in many fields such as disease prediction,
clinical diagnoses, and academy rewards for choosing candidates with the
most phenomenal achievements, and academic purposes. Decision rules can
be derived or generated from the database or decision tree itself, which is
very useful in prediction process.
21


4.2.3 Association rule mining:
Association rule are helpful in detecting relations among unrelated data
in a database or data repositories, and discovering patterns of relationships
between the columns or attributes of the dataset and observations (16).
Association rule mining had been proposed by Agrawal et al in 1993. It
can be used in many fields such as market basket analysis,
telecommunication networks, market and risk management, and analyzing
the associations among rules for disease prediction.
Association rules has two main parts, which are an antecedent (if) and
a consequent (then). From these relationships in the association rules, it will
be very useful to analyze these relations to detect pattern that aid in the
prediction process(16).Association rules is at the form of X=>Y where X is the
antecedent that combine a set of predicates resulted from exploring the
dataset, and Y is the consequent that only has one predicate. The rule shows
the relation ship between the antecedent and consequent.
4.3 Why is Predictive Analysis useful?
Predictive analysis is one of the most efficient branches in data mining.
It has the ability to extract information from the data in databases cooperated
with its techniques, and use this information to predict future trends for the
particular field the predictive analysis has been used for. Predicting future's
outcomes help the field providers to improve their service by taking benefits of
the possible outputs, take precautions against any predicted risks, and
22


decrease any previous mistakes done in the past before this prediction
procedure.
23


5. Predictive analysis in cancer prediction
Data mining techniques has a promising role in predicting diseases
especially cancer. In this chapter, a number of these techniques will be
employed on the cancer predictor software implemented in this thesis to
clarify their roles in cancer prediction.
5.1 Predictive analysis steps in cancer prediction
Predictive analysis has an important role in the prediction of a disease
outcome. Based on some other factors of a specific disease such as age,
family history, gene, etc, predictive analysis has the ability to predict the
disease itself.
The process that will happen in a disease prediction given a historical
database for patients who had the disease before will be achieved by these
following steps:
1. Given historical patient data from medical databases that show the
factors that help the patient to get cancer such as gene mutations,
family history, life style, etc, the data mining will first prepare the data
for mining process.
2. The data preparation include cleaning data from any noise and
duplicate observations, and data reduction that focuses on selecting
useful features and records that are relevant to the data mining
technique.
24


3. The next step is the data mining step that consists of using data
prediction in order to detect hidden patterns in the data.
4. The data mining will extract the information from those medical
database to reach to useful observations in the cancer prediction for a
particular patient that have the some risk factors as patients, whose
information had been extracted.
5. Looking at both the current patient information and the extracted
patient record information that had cancer before, if the current patient
has the same risk factors for cancer as the extracted risk factors,
cancer might be predicted for the current patient.
6. Predictive analysis with the support of other techniques such as the
decision tree, neural networks, and other techniques and using
historical medical databases for cancer patients will predict cancer for
the current patient based on other patients who had have the same
factors and as a result have been diagnosed with cancer.
5.2 The goals of cancer prediction
The basic goal for cancer prediction is different from the goals of
cancer diagnosis. In the cancer prediction and prognosis the cancer
researcher is mostly concerned about the following three predictive foci:
1. Risk evaluation for cancer or the prediction of cancer susceptibility.
2. The prediction of cancer reoccurrence.
3. The prediction of the lifetime for the cancer patient after cancer
diagnosis meaning how many months the patient may survive (2).
25


In the first situation, the cancer researcher attempts to predict the
probability of cancer before the occurrence of the disease in the patient. In
the second situation, the cancer researcher attempts to predict the probability
of redeveloping cancer after to the obvious resolution of the cancer disease or
after the patient had already been diagnosed with cancer. In the third case,
the cancer researcher attempts to predict an output of cancer such as life
expectancy, cancer survivability, cancer progress, and sensitivity of cancer
tumor medicines after cancer diagnosis (2). However, in this thesis, the focus
will be only on the first predictive tool, which is predicting cancer prior to its
occurrence.
5.3 Predictive analysis techniques in cancer prediction
Predictive analysis is method that demands the help of other
techniques. Thus, a number of data mining techniques will help the predictive
analysis in the role of cancer prediction using the following techniques:
5.3.1 Decision trees.
Decision trees as a data mining technique has being in used in the
medical field for many years due to its wealth of classification rules and
appeal of clarity or visibility. Cancer prediction is one of its uses in that field.
In the past, cancer researchers have applied many techniques including
decision trees to cancer prognosis and prediction.
Formally, decision trees are structured graphs or flow charts of a series
26


of decisions represented by the nodes in the tree and all possible results or
consequences from these decisions' that are represented by leaves or
branches in the tree. Theses decision tree are mainly used to make a plan in
order to get to a goal. This goal depends on the reason of using the decision
tree. For example, if the decision tree is being used for disease prediction,
then this tree will be primarily built in plan to achieve the disease prediction
goal.
5.3.1.1 Decision trees in cancer prediction
Given a database that includes medical information about patients
such as age, data of birth, medical history, etc, cancer researchers found the
applicability and efficiency of decision trees for predicting the groups of
patients with the high risk of cancer in a database. After mining the
information required for the cancer prediction from the medical database by
the decision tree, the decision tree with its predictive model will be able to
classify the patients in the database into groups. Between the groups, it will
predict which group a patient case belongs to. Consequently, the decision
tree will classify the patients in the database by predicting which group a case
belongs to. Some patients will be predicted to be in the group with the high
risk of cancer and others with less risk.
In order to construct the decision tree for cancer prediction, the
following steps will be taken:
1.Some patients might consult a doctor in order to check if they might have
cancer in the future, so the doctor performs some tests and examination
27


plus asking the patients about some factors that aid in cancer occurrence.
2. The patient information afterward will be stored in a patient database that
also has data about other patients that consult a doctor for the same
reason which is cancer prediction.
3. Depending on the type of cancer, cancer researchers will specify the risk
factors that aid in the cancer occurrence diagnosis or prediction.Thus, the
patients' database will be mined for these risk factors to extract useful
patterns from the attributes that represent these factors.
4. The decision tree will be built based on the mined data from the database.
The nodes of the decision tree are the cancers risk factors which are the
attributes of the database, and the branches of a particular node represent
all the possible values of that node.
5. The leaf nodes will be the final groups where a patient will be predicted to
be in one of them e.g. a group of patients with high cancer risk and other
group for less risk and non-risk patients.
In the decision tree graph, the decision tree's nodes represent the
decisions that predict the cancer risk for a particular patient. In the prediction
process, the cancer researcher will walk through the decision tree starting
from the root and ending with one leave node. Based on the decisions nodes
that a patient meets in the decision tree when going through the tree, her
cancer risk will be predicted at the leaf nodes. Thus, in the leaf node, it will be
obvious to find if the patient is either at high risk or no risk of cancer in the
future.
Applying the decision tree to the cancer predictor calculator will be very
useful in predicting cancer since decision trees are prediction models
consisted of predictors that contribute in predicting to which class of cancer
28


risk (low, medium, high) a person should belong. The predictors in the
decision tree will be the main risk factors of the cancer predictor Calculator
have strong role and impact in predicting cancer.
Figure 5.1:DECISION TREE AS PREDICTION MODEL
FOR THE CANCER PREDICTOR CALCULATOR
29


5.3.1.1.1 Example on how the decision tree is used in cancer prediction:
Suppose the following patients database is provided after the patients
visitation to a medical institution for the reason of checking if they are at a risk
of uterine cancer. This database consists of four tables that store the patients
information such as their medical history, uterine history, drugs consumption,
and other personal data. The first table of the database, which is Table 5.1,
represents the family history for the patient. The family history involves types
of cancer that increase the risk of uterine cancer such as ovarian cancer,
colorectal cancer, and uterine cancer.
Table 5.1 Patient Disease History Table
Patient ID Ovarian cancer family history Uterine cancer Family history degree one Number of affected relatives Colorectal cancer Family history
1458349 Sister first 1 Yes
1478020 Daughter third 1 No
1356780 Mother first 2 Yes
1344673 Mother second 1 No
1267340 Daughter first 3 No
1259875 Sister third 1 Yes
1145689 Mother third 2 No
1175934 Sister second 2 No
30


The second table of the database, which is Table 5.2, represents the
uterine activity for the patient. Uterine activity here means the activities that a
uterine performs such as pregnancy history, menstrual period, and
menopause.
Table 5.2 Patient Uterine Activity Table
Patient ID No Reproductive history Menstrual period before age 12 Menopause after age 55
1458349 Yes Yes Yes
1478020 Yes Yes Yes
1356780 No Yes No
1344673 No No No
1267340 No No No
1259875 Yes No Yes
1145689 No Yes No
1175934 No No Yes
The third table of the database, which is Table 5.3, represents the drugs
consumption history for a patient. These drugs involve the ones that a patient
has taken for long periods such as the following drugs:
31


Estrogen: a kind of medicine that can be made by the body or inside
medical laboratories and it aids in developing women sex
characteristics and it also helps in the bone growth.
Tamoxifen: a type of medicine used by a patient in order to treat
certain kinds of breast cancer. In addition, it can be used to prevent
breast cancer in females who had ductile carcinoma in situ, which
means abnormal cells in the ducts of the breast, and it can be taken by
women who are at high risk of breast cancer.
Radiation therapy: the use of radiation at high energy levels from x-
rays, neutrons, protons, gamma rays, and other types of radiation in
order to destroy cancer cells and decrease cancer tumors inside the
body (26).
Table 5.3 Patient Drugs Table
Patient ID Estrogen consumption Tamoxifenconsumption Radiation therapy
1458349 No Yes Yes
1478020 Yes Yes Yes
1356780 No Yes No
1344673 Yes No Yes
1267340 No Yes No
1259875 Yes No Yes
1145689 No Yes Yes
1175934 Yes No Yes
32


The fourth table of the database, which is Table 5.4, represents patients
personal information such as age, obesity, and smoking.
Table 5.4 Patient Personal Info Table
Patient ID Age Obese High fat diet Smoker
1458349 50 Yes No Yes
1478020 33 No Yes Yes
1356780 66 Yes Yes No
1344673 28 No No Yes
1267340 18 Yes No No
1259875 70 Yes Yes Yes
1145689 68 Yes Yes Yes
1175934 59 No No Yes
00king at all the previous four tables, in order to predict the uterine
cancer risk using the decision tree for the patient giving her information such
as medical history, reproductive history, and used drugs, the data mining will
first mine the previous patients' database in order to extract useful information
that helps in the prediction process. In this case, the most helpful information
in predicting the uterine cancer risk will be the most effective factors that
develop the cancer. The following table, which is Table 5.5, shows these
factors:
33


Table 5.5 Patient Risk Factors Table
Patient ID Uterine cancer Family history Number of affected relatives Reproductive history Estrogen consumption Tamoxifen consumption
1458349 first 1 Yes No No
1478020 third 1 Yes Yes Yes
1356780 first 2 No No Yes
1344673 second 1 No Yes No
1267340 first 3 No No Yes
1259875 third 1 Yes Yes No
1145689 third 2 Yes No No
1175934 second 2 No Yes No
Moreover, with the help of historical databases for patients who had been
diagnosed with the same type of cancer in the past, it will be easy to predict
for each patient in Table 5.5 if she is at a high risk of developing uterine
cancer or not.
Using the decision tree, a prediction model consisted of the following
predictor variables had been designed in Figure 5.1 to predict to which class
of risk previous the patients belong to:
34


Family history.
Number of affected relatives by breast cancer.
Reproductive history.
Estrogen consumption.
Tamoxifen consumption.
Based on these predictors in the decision tree, the patients in the previous
database will be classified in low, medium, and high risk of uterine cancer as
shown in Table 5.6.
35


Figure 5.2: DECISION TREE AS PREDICTION MODEL FOR
UTERINE CANCER
36


Table 5.6: Final Predicted Cancer Risk for Patients
Patient ID Uterine cancer Family history Number of affected relatives Reprocdut- ive history Estrogen consumption Tamoxifen consumptio n Predicted risk
1458349 First 1 Yes No No Medium
1478020 Third 1 Yes Yes Yes Medium
1356780 First 2 No No Yes High
1344673 Second 1 No Yes No High
1267340 First 3 No No Yes High
1259875 Third 1 Yes Yes No Medium
1145689 Third 2 Yes No No low
1175934 Second 2 No Yes No High
From the decision tree above, it will be easy to predict the uterine cancer
risk for any patient who visits the doctor in the future to see if she is at a
higher risks of uterine cancer or not. For example, if a patient x has two
relatives of degree one who had uterine cancer before, then it is predicted
using the decision tree in Figure 5.1 that she is at high risk of uterine cancer.
37


5.2.2 Decision rules
Decision rules as a data mining technique has being in used in the
medical field. In clinical field, decision rules used for prediction are usually
called clinical prediction rules or clinical decision rules.
Clinical prediction rules or risk scores are designed tools that support
the professions in health care in making decision once they are providing
health care for their patients. These rules contain predictors that are variables
obtained from history such as physical tests or examinations, disease
characteristics, and patient characteristics (6). In addition, applying these
rules to aid in decision-making relevant to prognosis has been used in clinical
prediction the last few years.
Applying the clinical prediction rules to the cancer predictor calculator
is effective in predicting cancer because decision rules are one of the
prediction models that have been assisting cancer prediction. The predictors
that contribute in predicting cancer in these rules are mainly the risk factors of
the disease. For instance, the clinical prediction rules that can be extracted
from the ovarian cancer part of the cancer predictor calculator as the following
table:
38


Table 5.7: Prediction Decision Rules from the cancer predictor calculator
Prediction rules Preconditions Ovarian Cancer Predicted Result
If (talcum powder =Ya (FertilityDrugs =Y+ Medium Risk
If (Age>=80) a (Reproductive_History = "Y") Low Risk
If (Tubal_Ligation = Ya ((Exercise = Yv (Family_History = Y))+ No Risk
If (Period_Age<12) a (Reproductive_History =Y ow risk
If (HRT = "Ya (Reproductive_History =Y ow risk
If (Ovarian Removed = "Y") No Risk
If (Period_Age<12)HRT =Y+ Medium Risk
If (Overweight = Ya (Endomtrosisjssue = Y+ Medium Risk
If (Smoker = Ya (Exercise = Y+ No Risk
If (Gene_Mutation= Ya (Fertility_Drugs =Y+ (For any Genes such as BRCA1,BRCA2,...etc) Medium Risk
Another example for the prediction rules are the from the uterine
cancer example specified above in Chapter 5, the prediction rules that can be
extracted from the patients database are the following rules:
If (Family_History_Degree = "First") a (Affected_Relatives>=2) Then
Patient is in the class of high risk of cancer.
39


If (FamilyHistoryDegree = Thirda (Estrogen_Consumption==
YesThen
Patient is in the class of medium risk of uterine cancer.
* If (Family_History_Degree ==" First) A(Reproductive_History=="No")
Then
Patient is in the class of high risk of uterine cancer.
if (Family_History_Degree =="second")EstrogenConsumption ==
"Yes") Then _ ~
Patient is in the class of medium risk of uterine cancer.
if (Reproductive_History=="No")((Estrogen Consumption == "Yes")
v(Tamoxifen_Consumption == "Yes")) Then
Patient is in the class of medium risk of uterine cancer.
The preconditions of the previous prediction rules consist of the
predicators which are actually the attributes in the patients database
specified in earlier section. These predictors form the prediction model for
each prediction rule, and thus support predicting uterine cancer. These
prediction decision rules will be used by oncologists in reviewing patients'
cases to predict their uterine cancer risk. For example, suppose there is a
patient X, whose uterine cancer will be predicted based on the previous
prediction decision rules. This patient has no reproductive history and
consumes both estrogen and Tamoxifen. Consequently, patient X is predicted
to be at high risk of uterine cancer based on the last decision rule.
5.2.3 Association rules:
Since association rules find relationships and association between
large set of data by mining databases and it also displays attributes value
40


conditions that usually occur together in a database, they can be used in
predicting cancer. Association rules in the process of predicting cancer will
detect the risk factors that are related to each other. For example, in the
cancer predictor calculator, the association rules that could be found are
the following:
(Menopause _Age <40)a (HRT_status = Yes+
Patient is given estrogen doses.
(Race = "Whitea (FamilyHistory =First
(Num_of_Affected_Relatives >=2)
Patient has to have regular check in ovarian for cancer cells.
(BRCA1_Mutation = Yesa (Reproductive_History = No+
Patient is at high Risk of breast cancer.
This association rules clarify that the presence of one or more element
in a dataset will predict the occurrence of the other. In the above example, if
the woman menopause age was younger than 40 and she had hormone
therapy to cure the issue of menopause, then she will probably have doses of
estrogen as part of hormone therapy. As a result of her early menopause, and
hormone therapy consisted of estrogen, she will be at a high risk of breast or
ovgrian caic^r.
In the second association rule, if both risk factors the white race and first
degree family history of ovarian cancer of more than two affected relatives by
the disease had appeared together in the database, then it will be predicted
according to those factors that the woman should have regular checks-ups to
detect cancerous ovarian cells, and therefore, from the associations and the
41


result, it is predicted by predictive analysis with the help of the association
rule that the woman might get ovarian cancer in the future.
42


6. Prediction models for two types of cancer
This chapter will focus on two types of cancer: breast cancer and ovarian
cancer, which are included in my implementation. In addition, the
implementation's results will be compared to other well-known algorithms
results.
6.1 Breast Cancer
Breast cancer is the first type of cancer in this thesis, for which the
predictive analysis will predict the risk in women with the help of techniques
such as decision tree, prediction decision rules, and other ones that had been
specified previously.
6.1.1 Breast Cancer risk factors
In order to understand how the prediction works properly, it would very
beneficial to specify the risk factors of breast cancer, since the prediction
process depends on them beside the historical data for some other case-
control studies that experienced the same disease.
The risk factor of breast cancer can be specified as the following factors (27):
1. Age: the most effective risk factor of breast cancer is the woman's age,
so the older she is, the higher her risk is.
2. Reproductive history: women who had never had any children or had
their first child after the age of 30 are more likely to have breast
43


cancer. Thus, number of pregnancies and being pregnant at a younger
age reduce the risk.
3. Gene mutations: both BRAC1 and BRAC2 are located in breast cells
and produce proteins that prevent cell in breast from growing
abnormally. In a case where it happens that there is an inherited
mutation of any of the previous genes from one of the parents, there
will be a high risk of developing breast cancer.
4. Body exercise: sedentary women who tend to not exercise normally
are at high risk for breast cancer while those who perform physical
activities normally are at less risk due to its effect on obesity.
5. Alcohol consumption: alcohol moderate consuming can increase the
risk of breast cancer. The risk also depends on the number of
consumed glasses per day, so for women who dont consume any
types of alcohol are at lower risks.
6. Family history: the woman's risk of breast cancer grows more in case
she has a family history of the disease. This risk increases more by the
degree of family history, so if for example a family member of degree
one (mother, sister, daughter) has been diagnosed with breast cancer,
it is predicted that the woman's risk grows for the disease.
7. Race: white women are more likely to get breast cancer than other
races.
8. Menstrual periods: women who started their period before the age of
12 or had menopause after the age of 55 are more predicted to be at a
risk of breast cancer. This is because of the long life time of estrogen
and progesterone exposure.
9. Reproductive history: women who had never had children or had
children after the age of 30 are more likely to have breast cancer risk.
10. Breast feeding: women who breastfeed their children have lower risks
44


of breast cancer, while others who don't are at higher risks.
11. Hormonetherapy: women who take hormonetherapy medications that
involve both estrogen and progesterone to be treated from menopause
symptoms are at higher risks of breast cancer.
12. Medical history of breast cancer: if a woman experienced a breast
cancer in one breast before, it is predicted that she has an increased
risk of the disease.
13. Obesity: obese women are predicted to have higher risks of breast
cancer than others because of fat tissues generates estrogen that may
aid fuel certain cancers.
14. Birth controls: women who take birth control pills for more than 10
years are more predicted to have higher risks of breast cancer.
6.1.2 Gail breast cancer algorithm
Gail breast cancer algorithm is an algorithm that predicts the breast
cancer risk for a woman for five years based on some certain predictor
variables. It uses woman's personal medical history (previous breast biopsies,
the presence of atypical hyperplasia in previous breast biopsy), her family
history especially first degree (mother, sister, and daughter), reproductive
history, Menstrual periods (her age at first child birth, age at the beginning of
menstruation) to make its predictions.
Gail model had been developed by Dr. Mitchell Gail, Senior Investigator in
the Biostatistics Branch of NCI's Division of Cancer Epidemiology and
Genetics, and his colleagues (20).
45


6.1.2.1 How Gail breast cancer algorithm works
Gail breast cancer algorithm works as follows:
Gail in his model emphasized on four risk factors of breast cancer
including age at first child birth, age at menarche, number of first
degree relatives(mother, sister, daughter) who had the disease before,
and number of breast biopsies.
The Gail model predicts breast cancer risk by multiplying the relative
risks RRs for four various risk factors of breast cancer by the womans
specific age. This is because although a woman might not have the
other risk factors, the age by itself plays a role in increasing the risk of
breast cancer (10).
6.1.2.2 Disadvantages of Gail algorithm:
_ Gail model does not take in consideration second degree relative
(nieces, aunts, uncles, nephews, and grandparents), and third degree
relatives (cousins) who had been diagnosed with the disease.
Gail model does not take into account other risk factors that play a role
in increasing the risk of breast cancer such as hormone therapy, age at
menopause, gene mutations, and radiation exposure.
Gail model may underestimate the womans risk of breast cancer since
it only concentrates on four risk factors.
46


6.1.2.3 How is Gail algorithm related to data mining:
Prediction is one task of Data mining that predict outcome for future. In the
case of breast cancer prediction, predictive analysis will support decision-
making during the prediction process.
Since the predictive analysis predicts future outcomes based on some
other historical events that happened before, the Gail algorithm is based on
the predictive analysis. The algorithm predicts the risk of breast cancer for a
person giving some inputs such as family history, age, etc so what the
predictive analysis will do here is building a prediction model. The prediction
model consists of some variables called the predictors that affect the
predicted result. The predictive analysis also depends on some other
historical databases for patients who had breast cancer before with the same
predictor variables that represent the input in Gail algorithm.
As a result, when a woman uses Gail algorithm to see her predicted risk,
she will be asked to provide information about the predictor variable such as
age, family history, gene, etc. based on some other databases for breast
cancer patients who had the disease before and had combined the same
factors as the woman, a predictive model consisted of predictor variables will
be built for that woman to predict her risk. The predictive analysis will use
these variables to predict the womans risk of breast cancer. Depending on
these predictors, she will be given her risk.
47


6.2 Ovarian Cancer
Ovarian cancer is the second kind of cancer addressed in this thesis,
in which the predictive analysis will estimate the risk in women with the help
of techniques such as decision tree, prediction and decision rules.
6.2.1 Ovarian Cancer risk factors
The risk factor of ovarian cancer can be specified as follows (22):
1. Age: the older the woman is the more predicted that she is at higher
risk of ovarian cancer.
2. Reproductive history: women who had never given birth for children
are more predicted to have breast cancer risk. Thus, the risk goes
lower with each pregnancy.
3. Family history: women who have a family history such that first
degree relatives had been affected by ovarian cancer, breast
cancer, or colorectal cancer are more predicted to have ovarian
cancer than others with no family history.
4. Hormone therapy: women who take estrogen after menopause are
more predicted to have ovarian cancer risk.
5. Infertility: it is predicted that infertile women have higher risks of
ovarian cancer than other fertile ones.
6. Gene mutations: women who experience mutation in the some
genes such as BRCA1, BRCA2, and HNPCC (hereditary
nonpolyposis colorectal cancer) are more predicted to have higher
risks.
48


7. Obesity: women with high body mass indexes are more likely to
have higher risks of ovarian cancer.
8. Previous cancer history: cancer researchers predict that women with
previous breast cancer are at high risks of the disease.
9. Endometriosis: women with Endometriosis condition in which these
tissues may grow and reside in one of the ovary have higher risks of
the disease.
6.2.2 The Risk of Ovarian Cancer Algorithm
The Risk of Ovarian Cancer Algorithm (ROCA) is an algorithm that
predicts the ovarian cancer for women depending on the woman's age and
trends in CA125 blood. However, the algorithm is still under study and its
results are expected to be ready by 2015 (9). Biostatistician Steven Skates,
Ph D., developed the risk of ovarian cancer algorithm.
CA125 is actually a cancer antigen or carbohydrate antigen that can be
found in ovarian tumor cells. CA125 is also a test that can be taken on
patients to check the CA125 protein measurement in ovarian cells because
CA125 percentage gets higher in ovarian cells in case the person has ovarian
cancer.
49


6.2.2.1 How The Risk of Ovarian Cancer works
The Risk of Ovarian Cancer works for postmenopausal women as the
following steps (13):
1. A woman will have a CA125 measurement test done to find out the
percentage of CA125 protein in ovarian cells.
2. Depending on the ROC algorithm that uses mathematical model,
information about the woman's age and any changes in CA125 levels
are integrated over time.
3. Based on the consequences of the test, women will be classified in
one of three groups:
Low risk: women in low risk group will be asked to take CA125
test in next year.
Intermediate risk: women in intermediate risk group will be
asked to repeat CA125 test in three months.
High risk: women in high risk group will be advised to go
through special medical care with gynecologic oncologist that
might recommend a surgery if necessary.
50


7. Contribution and Discussion:
Since this thesis is discussing the prediction of cancer using the predictive
analysis and other data mining techniques, an implementation regarding the
thesis topic has been provided to support its topic.
The software has been implemented using Microsoft C# 2010 Express
and built in SQL. The software includes a designed risk assessment
calculator for two types of cancers: breast and ovarian. The main role of this
calculator is computing the estimated risk percentage of breast or ovarian
cancer. In fact, this calculator can be used only by females who would like to
estimate their lifetime risk of breast cancer or ovarian cancer. Figure 7.1
shows a simplified model on how the software operates to calculate the
disease risk percentage.
51


Figure 7.1: SOFTWARE FLOWCHART
52


In the calculator's implementation, I developed the following
mathematical formulas that aid in calculating the final related cancer risk
percentage for either breast or ovarian cancer.
1. Double Disease_Risk = 0;
2. Int i = 0;
3. While (i<=N)
{
4. Begin
k
5 Form_Risk ='^Relative_Risk',
i
6. DiseaseRisk = DiseaseRisk + Form_risk;
7. i = i + 1;
8. End While
}
9. Risk Percentage = ((Disease_Risk 100)/10);
In the above algorithm that is used by the calculator to compute the
risk percentage, a number of variables have been used, where:
i: refers to the form number in the program.
N: refers to the number of forms,
k: refers to number of questions in form i..
In addition, as breastfeeding is one factor that decreases the cancer
risk, I developed the following equation that calculates the amount of
decrease in cancer based on number of months a mother breastfeed her
53


children and number of children she had given birth to, after getting benefit
from 0.8 which is the decreased risk percentage for breast cancer based on
lactation (8).
lactation Factor = (lactation_Months +( 0.8 Childern Num); (7.1)
Real databases including patients information such as age, family
history,etc, have been used to validate the software. The first databases
used to validate breast cancer predicted risk percentage is called WHI
(Women Health Initiative) databases. The second databases used to validate
ovarian cancer predicted risk percentage is called SEER (Surveillance
Epidemiology and End Results) databases. Applying real medical records
information for patients, who has been diagnosed by breast or ovarian
cancer, to the risk assessment calculator aid in the process of validating the
final predicted risk prediction.
In order to compare the calculator risk assessment results to other well
known algorithms, Gail algorithm's results will be used to compare them to
the predicted breast cancer risk percentage results from the calculator. Also,
The Risk of Ovarian Cancer algorithm's results will be compared to the
predicted ovarian cancer risk percentage results.
54


7.1 Results:
Generally, Comparing the results of cancer predictor calculator to
Women Health Initiative (WHI) databases after applying WHI records values
to the calculator, a percentage of 7% is the difference between the Cancer
predictor calculator and the relative risk calculated from the WHI databases.
Generally, Comparing the results of cancer predictor calculator to
Surveillance Epidemiology and End Results (SEER) databases after applying
SEER records' values to the calculator, a percentage of 8.2% is the difference
between the Cancer predictor calculator and the relative risk calculated from
the WHI databases.
The results of Gail Algorithm calculator for breast cancer will be
compared with the cancer predictor calculator results as the following tables:
Table 7.1: Race and Age
Cancer Predictor Calculator Gail Model Calculator
Age Black White Black White
20-29 6.5% 32.5% 4.7% 4.1%
30-39 11.5% 37.5% 4.3% 6.3%
40-49 21.9% 47.9% 11.2% 13.1%
50-59 23.3% 49.3% 5.9% 8%
60-69 29% 55% 4.7% 7.7%
70-79 30% 56% 3.5% 6.3%
80+ 19.5% 45.5% NA NA
55


Looking at Table 7.1, and after comparing the Gail Algorithm's results
to my implemented cancer predictor, it is obvious that the latter's results is
higher than Gails Algorithm calculator.
56


Table 7.2: Age and First Birth Age
Age
First birth
Age 20-29 30-39 40-49 50-59 60-69 70-79 80+
C G C G C G C G C G C G C
% % % % % % % % % % % % %
>20 1.5 NA 6.5 NA 16.9 NA 18.3 NA 24 NA 25 NA 14.5
20-24 1.7 2.6 6.7 5 17.1 14.4 18.5 9.2 24.2 9.4 25.2 7.9 14.7
25-29 2.1 4.5 7.1 5.8 17.5 16 18.9 10.6 24.6 11.5 25.6 10.1 15.1
30+ - - 7.4 8.3 17. 18 19.2 12.4 24.9 14 25.9 12.6 15.4
Nulliparous 2.1 3.2 7.1 5.8 17.5 16 18.9 10.6 24.6 11.5 25.6 10.1 15.1
C is the cancer predictor calculator
G is Gail Algorithm Calculator.
Looking at Table 7.2, and after comparing the Gail Algorithm's results to my implemented cancer
56


predictor for predicting cancer, it can be seen that for ages 20 to 49 that the average difference between
both predictors' results is approximately 2%. On the other hand, this average gets higher when it comes for
ages 50 to 80 and above.
Table 7.3: Age and Family History
Family Age
history
degree 20-29 30-39 40-49 50-59 60-69 70-79 80+
C G C G C G C G C G C G C G
% % % % % % % % % % % % % %
1st 2.8 5.3 7.8 9.2 18.2 22.2 19.6 16.3 25.3 19.5 26.3 18.1 15.8 NA
2nd 2 - 7 NA 17.4 NA 18.8 NA 24.5 NA 25.5 NA 15 NA
3ra 16.5 NA 21.5 NA 31.9 NA 33.3 NA 39 NA 40 NA 29.5 NA
C is the cancer predictor calculator
G is Gail Algorithm Calculator.
57


Since Gail algorithm calculator consider only family history of the first degree, only the predicted
results of the first degree will be shown for that algorithm in Table 7.3. However, my cancer predictor
considers all family history including first, second, and third. When comparing the predicted results for
both calculators in age and family history, it can be seen that the predicted results are slightly close in all
age ranges between 20 and 80+.
Table 7.4: Menstruation Age and Age
Age
Menstrua
tion age 20-29 30-39 40-49 50-59 60-69 70-79 80+
C G C G C G C C G C G c G c
% % % % % % % % % % % % % %
<12 20.5 6.4 25.2 9.3 35.9 17.7 37.3 12.2 43 13.7 44 12.3 33.5 N
A
58


00king at Table 7.4, and after comparing the Gail Algorithms results to my implemented cancer
predictor, it is obvious that the latters results is higher than Gails Algorithm calculator.
For the ovarian cancer prediction, since the Risk of Ovarian Cancer algorithm is still under study and
development, and it is the only algorithm that predicts the ovarian cancer risk percentage, there was not
ability to compare the ovarian cancer predicted risk percentages resulted from the cancer predictor
calculator with the Risk of Ovarian Cancer algorithm.
59


7.2 Evaluation:
In order to evaluate the accuracy of my cancer predictor predicted
results, I entered about 760 patients' records information in both calculators.
As a result, the predicted results of both calculators showed an average
difference of 8.7% between both calculators.
In addition, in order to evaluate my cancer predictor calculator using real
databases that have real patients' information, I entered about 760 patients'
records from WHI databases in both calculators including Gail Algorithm and
my cancer predictor calculator. The results showed the following:
The average difference between my cancer predictor calculator and
WHI databases is approximately 6.5%.
The average difference between Gail Algorithm calculator and WHI
databases is approximately 15%.
From above evaluations, it is obvious that the Cancer Predictor Calculator
predicted results are closer to WHI databases predicted results than Gail
Algorithm Calculator. As a result, my cancer predictor calculator estimations
for cancer has more accurate results than Gail's.
Furthermore, after many tests on my cancer predictor software to decide
what the most indicative risk factors are in correctly predicting cancer, I got
the following risk factors:
Breastfeeding predicted results in the calculator show the same
60


predicted results in WHI databases.
First birth age predicted results in the calculator shows an approximate
average 2% between the predicted results in WHI databases and
Cancer Predictor software.
Weight at birth predicted results in the calculator shows an
approximate average 4.8% between the predicted results in WHI
databases and Cancer Predictor software.
Age predicted results in the calculator shows an approximate average
0.64% between the predicted results in SEER databases and Cancer
Predictor software.
7.3 Advantages and Disadvantages:
The cancer predictor calculator has some advantages that are not
applicable in Gail Model Calculator. First, the Cancer Predictor Calculator
combines many other risk factors of breast cancer that could not be found in
Gail's. For example, the Cancer Predictor Calculator has the following factors
that are not available in Gails:
Life style factors such as exercise, alcohol consumption, smoking, bra
wear, birth control pills, and night shift work.
Environmental factors such as Radiation and environment type.
Medical history such as abortion, breast cancer history,
Diethylstilbestrol (DES) drug, and hormone therapy.
Personal factors such as menopause, breastfeeding, number of
children a woman gave birth to, and ethnicities.
Genetic factors such as BRCA1 and BRCA2 mutation, height, weight,
and family history of all degrees.
61


Body factors such as weight at birth, height at birth, and head weight at
birth.
Second, the cancer predictor calculator correctly predicted risk for breast
cancer usually increases, as the womans age gets older. However, in Gail
Algorithm calculator the risk percentage sometimes decreases, as the woman
gets older. This should not be true because as a woman gets older, her risk
increases according to most medical studies.
On the other hand, the cancer predictor calculator sometimes
overestimates the predicted risk percentage for breast cancer regarding the
race factor when compared to the Gail model calculator.
7.4 Future work:
After I analyzed the cancer predictor calculator results, including
decision tree and some comparisons between the calculator and other well
known algorithms, I suggest the following improvement techniques on the
cancer predictor calculator as the following:
1) Designing the family history tree for the patient at the output window.
2) Designing the decision tree for each patient and highlighting his/her
path in the tree that shows the predicted risk for him/her.
3) Combing the calculator with other algorithms that predict the same
type of cancer as the Cancer Predictor Calculator. For example,
combining the calculator with Gail model calculator to allow the patient
estimate his/her risk in more than one algorithm.
4) Including the CA125 test in ovarian cancer risk assessment part.
62


5) Advising the patient after the output window such as connecting the
program to hospitals and making appointment for the patient with
oncologists to discuss her/his situation.
6) Adding more cancer tumors such as brain, Leukemia, and brain.
7.5 Conclusion
In conclusion, the purpose of this thesis was to shed the light on the
ability to operate data mining tools in the process of cancer prediction. Today,
data mining has played a major role in improving disease predictions since it
does not only extend the analysis's process, but also its depth. In fact, its
techniques have a promising future for improving the efficiency of the health
care and predicting cancer prior to its occurrence. These techniques include
decision trees that predicts cancer risk based on series of predictors in the
tree, clinical decision rules that has the ability to mine datasets and extract
rules that aid in cancer prediction. Finally, the association rules that analyze
patients database to find relations among its attributes and then predict
cancer mining these relations. All of these techniques are working together
with the predictive analysis to get the best possible predicted result. Also, this
thesis had been supported with software that calculates the predicted risk for
breast and ovarian cancer. Finally, data mining methods assist oncologists
saving people's lives by predicting cancer prior its occurrence.
63


APPENDIX A
CANCER PREDICTOR EXPERT SYSTEM USER GUIDE
The cancer predictor calculator is an online user interface model that
predicts women's risk percentage for both types of cancer involving breast
cancer and ovarian cancer. The calculator will provide its users with their
predicted percentage and the class of risk to which they belong such as low,
medium, and high risk.
Installation:
The Cancer Predictor Calculator can only operate on computer that
has Windows OS. The installation process of the Cancer Predictor Calculator
takes the following steps:
1. Go to the following website:
http://www.mediafire.com/7m53u9js2xbaphna.
2. Since the Cancer Predictor software is a password protected, the user
will be requested to enter the software, which is FinallydonE14.
3. After the password is entered correctly, the user will be directed to the
download page that prompts him/her to start downloading.
4. Once the download is complete, a file directory with the name
CancerPredictorCalculator will open.
5. The user should click setup.exe to install the software on his/her
device.
64


Using the Cancer Predictor Calculator:
1.After guiding the user through the introductory window, the user will be
directed to the Cancer Predictor Calculator Menu that prompts her to
choose either the breast cancer or the ovarian cancer as shown below.
2. After the user chooses the type of cancer she likes to know her
predicted risk in, she will asked to answer a number of questions in
consecutive windows. In the questions' window, she can know her
predicted risk by clicking the "Risk Probability" button as shown below:
65



Yes
No
Don\ know
Risk Probability
3. If it occurs that the user forgets to answer one or more questions in
any window, a window message will ask him to answer those
questions before letting him navigate to the next window.
66


Li^ Style/Que^tuari^:
1or2
< 15or None
>12houre

You forgot to answer Question 16!
OK
4. The final window includes the user's the predicted risk percentage and
thee class of risk to which she belongs. After she sees her result she
has two options of either going back to the home page or closing the
software.
67


68


BIBLIOGRAPHY
1.Chukwugozie Nsofor, Godswill. "A comparative Analysis of Predictive
Data-Mining Techniques." Master's thesis, The University of
Tennessee, 2006.
2. Cruz, Joseph A., and David S. Wishart. "Applications of Machine
Learning in Cancer Prediction and Prognosis. Cancer Informatics
2006, 59-78.
3. Doheny, Kathleen." Menstrual Periods: Clues to Ovarian Cancer."
WebMD Health News: n.p.n.d., July 9, 2009. Web.16 July 2011.
4. Fayyad, Usama Gregory Piatetsky-Shapiro, and Padhraic Smyth.
"From Data Mining to Knowledge Discovery in Databases." Aj
Magazine (1996): 37-54.
5. Hankinson, Susan E, Graham A Colditz, and Walter C Willett.
"Towards an integrated model for breast cancer etiology The lifelong
interplay of genes, lifestyle, and hormones." Breast Cancer Res. 6
(2004): 213-218.
6. Ingui, Bette, and Mary Rogers. "Searching for Clinical Prediction Rules
in Medline, Journal of the American Medical Informatics Association
8(Jul-Aug 2001): 391-397.
7. Dr. Liu, Yan. Dec/_s/o/? Tree. Department of Biomedical, Industrial and
Human Factors Engineering. Wright State University,n.d. Web. 5 Sep.
2011.
69


8. Marsh, Beezy."Breast-feeding reduces cancer risk." Dailymail.co.uk.
MailOnline.com, n.d. Web. 22 July 2011.
9. Morton, Carol Cruzan. "Ovarian cancer research takes center stage."
harvard.edu. Dana-Farber/Harvard Cancer Center, n.d. Web. 22 Oct.
2011.
10. Newman, Lisa.Breast Cancer in African-American Women. The
Oncologist Breast Cancer 10(2005): 1-14.
11.Olivas, Rafael. "Decision Trees A Primer for Decision-making
Professionals." stylusandslate.com. Stylus and Slate,10 Apr. 2007.
Web. 5 Sep. 2011.
12.S_anchez-Zamorano, Luisa Maria, Lourdes Flores-Luna, Ang_elica
Angeles-Uerenas, Isabelle Romieu, Eduardo Lazcano-Ponce,
Hernando Miranda-Hern_andez, Fernando Mainero-Ratchelous, and
Gabriela Torres-Mejia. "Healthy Lifestyle on the Risk of Breast
Cancer. Cancer Epidemiology Biomarkers Prev. 20 (2011):912-922.
13. Sharp, Frank, Tony Blackett, Jonathan Berek, and Robert Bast.
Ovarian Cancer. Oxford: Isis Medical Media Ltd. 1998.
14.Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar. Introduction to
Data Mining. Boston: Pearson Education, 2006.
15. Velickov, Slavco, and Dimitri Solomatine. Predictive Data Mining:
Practical Examples." unesco-ihe.org. UNESCO-IHE Institute for Water
70


Education, March 2000. Web. 4 Sep. 2011.
16. "Association rules (in data mining)." Searchcrm.com. SearchCRM, n.d.
Web. 30 Aug. 2011.
17. "Cancer of the Uterus Risk Vac\oxs.ncancer.gov. National Cancer
Institute, n.d. Web. 7 Sep. 2011.
18. "Data Warehouse Definition and Concept." Tutorial-computer.com.
Tutorial-computer, n.d. Web. 30 Aug. 2011.
19. Decision Tree: Introduction.ccs.m/'am/'.edu. Center for Computational
Science, n.d. Web. 30 Aug. 2011.
20. "Gail model." infoacademy.gr/microcalc. Microcalcification Resource
Site, n.d. Web. 20 Oct. 2011.
21. Lesson 8: Data Mining, Data Warehousing, and Data Marts. jDsw.edtv.
Penn State University, n.d. Web. 30 Aug. 2011.
22. "Ovarian cancer risk factors." cancerresearchuk.org. Cancer Research
UK, n.d. Web. 25 Aug. 2011.
23. "Predictive analytics." Searchcrm.com. SearchCRM, n.d. Web. 30 Aug.
2011.
71


24.,,Predictive modeling." Searchcrm.com. SearchCRM, n.d. Web. 30
Aug. 2011
25. "Twelve Basic Predictive Analytics Techniques." oacf.org. OACF,19
May 2011.Web. 6 Sep. 2011.
26. "Uterine Cancer Risk Factors."cdc.goi/. Centers for Disease Control
and Prevention, n.d. Web.12 Oct. 2011.
27. "What are the risk factors for breast cancer?" cancer.org. American
Cancer Society, 29 Sep.2011.Web. 21 Oct. 2011.
28. "What are the risk factors for ovarian cancer?" cancer.org. American
Cancer Society,13 0ct. 2011.Web. 21 Oct. 2011.
72


Full Text

PAGE 1

CANCER PREDICTION IN DATA MINING by Nedhal Abdallah AI Khalaf B S., King Saud University 2007 A thesis submitted to the University of Colorado Denver In partial fulfillment Of the requirements for the degree of Master of Science Computer Science 2011

PAGE 2

This thesis for the Master of Science Degree by Nedhal AI Khalaf has been approved For recommendation to the Graduate Committee Thesis Advisor Professor Tom Altman MS Committee Professor Bogdan Chlebus Professor llkyeun Ra November 18, 2011 Date

PAGE 3

AI Khalaf, Nedhal Aballah (M.S., Computer Science) Cancer Prediction in Data Mining Thesis directed by Professor Tom Altman ABSTRACT Predicting a disease before it surprises the person is better than identifying it too late. In this field data mining plays an important role in the medical field since its ability to discover useful information that helps in predicting the disease's risk. This thesis discusses data mining role in health care field and its techniques to help doctors predict the chances of a patient deVeloping cancer The techniques involve using the predicted analysis to build a model that calculates the likelihood of the disease In order to support this topic an implementation will be provided about building a prediction model for two types of cancer involving breast cancer and ovarian cancer. This thesis will evaluate the implementation results by comparing them against the Gail algorithm and the Risk of Ovarian Cancer algorithm. For this purpose data mining is supportive to health care in predicting the cancer s risk prior its diagnoses This abstract accurately represents the content of the candidate s thesis I recommend its publication Signed ----:=----::-:-----Tom Altman

PAGE 4

DEDICATION I am proud to dedicate my Master thesis to the five pillars in my life : God my parents and best friends my life would not be amazing and I would not step to higher levels without all of you I might be lost in this life and depressed by its harming shocks but once I stand to pray to you with all my heart God Allah, I get strengths from pleading to you to continue my way in this life journey Moms Nabiha you always support me and devote your life since I was a baby to grow me to learn as much knowledge as I can. My entire warm wishes to you. Daddy Abdullah f you always tell me Keep going there is no imposs i ble", Thanks for a i ding me to reach higher academic levels and supporting me with his precious advises and wise look to future My best friend Siham thanks for looking after me and standing by me in every step during my study My best friend Wafa, thanks for sharing me all my special moments in my life

PAGE 5

TABLE OF CONTENTS Figures .................. . ...................... ......... .... ............. . .. ......... ... vi Tables ............................................................................................ v Chapter 1 Inspiration ......................... ......................................................... 1 2 1ntroduction . ..... ..... . .. ...... .. .......................................................... 2 3.Definitions .... ................................................................................. 4 4. Prediction Analysis in data mining ................................................... 12 4.1 Predictive data mining stages ......... . .......................... ..... ............. 12 4 2 Predictive Analysis Techniques ...................................................... 15 4 .2.1 Decision trees ........ ...... ......... ..... .................................. ........... 15 4.2.1.1 Decision trees as predictive models .......................................... 15 4.2 1.2 Advantages of decision trees ................ ... ..... . ...... . ...... .... ... .... 19 4.2.1.3 Disadvantages of decision trees ..... ... ........ ....... ......... . ............ 20 4 2 2 Decision rules ........................................... .... ...... . .......... ... . ... 20 4 2 3 Association rule mining .............. .................... ............................ 22 4 3 Why is Predictive Analysis useful? ................................................ 22 5. Predictive analysis in caner prediction .......... ..... ..... ...... ....... ........... 24 5 1 Predictive analysis steps in cancer prediction .................................. 24 5.2 The goals cancer prediction ..... ... ................................................. 25

PAGE 6

5.3 Predictive analysis techniques in cancer prediction ........................... 26 5.3 1 Decision trees .................................................. ...................... 26 5.3 .1.1 Decision trees in cancer prediction ............................................. 27 5.3 1.1.1 Example on how the decision tree is used in cancer prediction ................................... ................. ......... 30 5 2 2 Decision rules ....................................................................... 38 5 2 3 Association rules ................................................................... .40 6 Prediction models for two types of cancer ........................ ................ .43 6 1 Breast Cancer ........................................................................... 43 6 1 1 Breast Cancer risk factors ......................................................... 43 6 1 2 Gail breast cancer algorithm ............................... ............. ..... ..... 45 6 1 2 1 How Gail breast cancer works ................................................ .46 6 1.2 2 Disadvantages of Gail algorithm ............ .................................. .46 6 1 2 3 How is Gail algorithm related to data mining ........................... ....................................... ........ 47 6 2 Ovarian Cancer ......................................................................... 48 6.2 1 Ovarian Cancer risk factors ...................................................... .48 6.2 2 The Risk of Ovarian Cancer Algorithm .......................................... 49 6 2 .2.1 How the Risk of Ovarian Cancer Works .......................... ........... 50 7. Contribution and Discussion ..................... ..................................... 51 7. 1 Results .... .. ... .. ... .... .. ............................................................. 55 ii

PAGE 7

7 2 Evaluation ..... .... ................... . ..... ..... ...... ... ..... ... . ................... 60 7 3 Advantages and Disadvantages ..... ........... ....... ............................. 61 7 4 Future work ............ .......... ...... ...................... .... ..... . ........ .... .... 62 7 5 Conclusion ................. ...... ... .... ... .. ...... .... ..... ... ............... ... ... 63 Appendix A CANCER PREDICTOR EXPERT SYSTEM USER GUIDE .................... ... . ...................... .... ....... ....... 64 Bibliography ... .... ............................... .. ..... ............................... ...... 69 iii

PAGE 8

LIST OF FIGURES Figure 3.1 KDD STEPS .................. ... .................................................. . .... 5 3 2 DECISION TREE NODE TYPES ................................................... 10 4.1 PREDICTIVE DATA MINING STAGES ........................................... 14 4.2 PATIENTS DATABASE DECISION TREE ...................................... 18 5.1 DECISION TREE AS PREDICTION MODEL FOR THE CANCER PREDICTOR CALCULATOR .............................. 29 5 2 DECISION TREE AS PREDICTION MODEL FOR UTERINE CANCER ............................................................ 36 7 1 SOFTWARE FLOWCHART ..... .................................. ..... ..... ...... 52 iv

PAGE 9

LIST OF TABLES Table 4 1 Patients Information Database ...................................................... 17 5 1 Patient Disease History Table ............................................... ........ 30 5 2 Patient Uterine Activity Table ... ...... ............. ................................. 31 5 3 Patient Drugs Table .................................................................... 32 5.4 Patient Personal info Table ......................... ................................. 33 5.5 Patient Risk Factors Table ............ ... ... ..... .................................... 34 5.6 Final Predicted Cancer Risk for Patients ......................................... 37 5 7 Prediction Decision Rules from the cancer predictor calculator .... ........................................... .. ........ 39 7.1 Race and Age ........................................................................... 55 7 2 Age and First Birth Age ............................................................... 56 7 3 Age and Family History ............................................................... 57 7.4 Menstruation Age and Age ...... ..................................................... 58 v

PAGE 10

1. Inspiration Nothing is more precious than having a healthy body. When I was at school I saw a TV program in which there was a lady who had cancer At that moment I felt sorry seeing her in this situation. Especially since she was taking Chemotherapy and all her hair was falling out. Also my sister works in radiation section in the hospital, and she always tells me stories about the tragedy of losing someone from cancer This created a strong inspiration in my heart to help people who feel pain from this disease Therefore I decided to connect my thesis topic that is data mining to the health care field As one of the most valuable fields in computer science data mining has played an important role in predicting outcomes based on hidden patterns in databases. Using the prediction rule, which is one of the data mining tools ; I decided to implement a medical program that predicts the probability of someone getting cancer This software hopefully will help people to expect their cancer percentage s risk and based on that they can take more precautions such as talking to a doctor about their risk factors or taking a medical test. Hence this will aid in discovering the cancer at an earlier stage and thus fighting it in advance before it fights their cells. 1

PAGE 11

2. Introduction Today, thousands of people suffer from the severe pain noticeable effects and the uncontrolled spread of cancer. Nowadays a lot of hospitals and medical institutions have been establishing cancer centers in order to provide the optimum therapy for cancer s patients Furthermore, many organizations and universities around the world are conducting research to discover better cancer cures When a person consults an oncologist the oncologist will study the cancer case in this patient and the risk factors that have resulted in this disease occurring Consequently all of the patient's information including the risk factors will be kept in medical databases in order to keep a medical history for the pat i ent and get benefits from the cancer case for research purposes As massive amount of data is stored in huge databases in hospitals and medical organizations it will be easy to detect some hidden data pattern that is useful to summarize what causes the disease The databases usually contain data that consists of information about diseases patients and other diseases factors. In fact each record or object in these databases has sensitive information about the patient at the time of getting the d i sease such as personal factors (age family history) social factors (race) and habits (smoking drinking alcohol) etc From these records, it is beneficial to mine this health databases to extract useful information about the disease How can these databases be mined to explore more useful 2

PAGE 12

information? Since computer science has served many sciences health care had also been able to take benefit from this Consequently data mining as one of the colossal fields in computer science has served health care field a lot. This is mainly because its capability to detect hidden useful patterns in the medical databases In cancer disease case data mining rules such as prediction rule and association have been effective in predicting the cancer risk and associating its risk factors In this thesis I will address data mining in health care field and its techniques that help doctors, and then I will narrow it down to specify the data mining role in specific disease such as cancer and that includes applying some of the data mining's techniques to build a prediction model which determined the likelihood of getting the cancer disease 3

PAGE 13

3. Definitions In this chapter we give a number of definitions will be given regarding some terminologies that will be used in this thesis 1. Data mining Data mining can be defined as one of the computer branches that includes the process of extracting useful information from large data sets or databases (14) Data mining is the analysis stage of knowledge discovery in databases In addition it is known with some other terminologies such as knowledge extraction information discovery information harvesting data archeology knowledge discovery and data pattern processing In general data mining analyzes data from datasets and summarizes it to be used in many fields such as security, disease prediction and market stocks Data mining is like gold mining where we extract data from databases to get useful information 2. Knowledge Discovery in databases Knowledge discovery in databases (KDD) is the overall process of finding useful knowledge from data. Data mining is the step of applying algorithms to data to extract patterns from it (4) 4

PAGE 14

2.1 Knowledge Discovery in databases steps KDD has several steps (14) as shown in Figure 3 1 : Input Data Data Preprocessing Data Mining Postprocessing Figure 3.1 KDD STEPS information These steps include data preprocessing data mining and post processing In the data preprocessing step raw input data is being p r ocessed to make it more suitable for data mining (1 ) This includes combining data from various sources data cleaning from noise and duplicate observations and data reduction that focuses on selecting useful features and records that are r elevant to the data mining technique. The second step is the data mining step that consists of using data classification or prediction in order to detect h i dden patterns in the data The classification techniques involve clustering while the prediction techniques involve decision trees and decision r ules. The third step is data post-processing which gives the interpretation or inferences from the data mining analysis from the previous step 5

PAGE 15

3. Prediction Analysis Prediction analysis in data mining can be defined as the branch of data mining that emphasizes predicting unknown or future values of other variables based on some other variables (23) Thus, prediction analysis mainly can predict what future s outcomes are based on historical data in databases to define trends and expect homogeneous cases The main element of predictive analysis is the predicator that is used as a variable that can be used for individuals in order to predict future outcomes A collection of predicators are integrated in a prediction model. In fact, when this collection is analyzed, it can be used to expect future possible outcomes in a reliable level. One example of predictive analysis is its application in cyber security. Using predictive analysis, investigators were able to derive an important conclusion analyzing historical data about the situation that cyber attacks follows physical ones. Subsequent to April 1, 2001, mid-air collision between American surveillance plane and a Chinese fighter aircraft, after this conflict, group of Chinese hackers sustained and arranged an extensive hacking campaign of cyber attacks against American targets for one week Thus, using predictive analysis and old historical data such as those provided above about physical attacks or conflicts investigators can predict cyber attack after physical one and, therefore, take additional security steps after any physical attacks. In this thesis, predictive analysis with the help of other data mining techniques will be used to aid in the prediction of cancer in patients. 6

PAGE 16

4. Predictive Modeling Predictive modeling can be defined as the process used by predictive analysis to create a statistical model that aids in forecasting future trends and proqabilities. In addition predictive model contains multiple numbers of predicators that are variables that affect the future outcomes (24) Predictive modeling is like using known data to predict unknown values. Such as the disease prediction field a predictive model consists of the patient's family history, genes etc will be built to predict the disease risk for that patient. Based on the predictors that include family history genes etc the patient's disease risk will be predicted In this thesis some prediction models such as decision trees decision rules and association rules will be used to built an aid in the cancer prediction process. 5. Data warehouse Data warehouse (OW) is defined in respective to data mining as a repository of data (current and historical) in which a collection of data is stored in huge databases for data mining and analysis purposes (18). Data warehouse usually serve multiple subject areas .For instance data warehouses are used by search engines such as Google. These Data warehouses involve current and historical data collected from various sources Thus in Google search engines there can be found huge data warehouses of different subjects such as medicine business, psychology and others If there is an expert interested in getting knowledge on both biology and computer science, Google s search engine will aid him or her in 7

PAGE 17

both subjects. This is because huge amount of information can be found by mining Google s data warehouses 6. Data mart Data mart (OM) is a subset of data warehouse that has smaller databases containing series of related data that are grouped together and subjected to one division or department (21 ) Data marts usually serve particular subject such as sales or medicines For example data about patients can be placed into a data mart to make it accessible by doctors and other medical researchers 7. Decision trees Decision trees in data mining can be defined as hierarchy of rules that can be presented graphically as a tree structure allowing set of data to be classified according to these rules (19).1n decision tree each leaf or end node is assigned a class label. On the other hand the internal nodes and root node include attribute test condit i ons to separate records of a dataset that have various characteristics Once the decision tree is built it will be very easy to classify a test record The test starts from the root node by applying the test condition on the record and then follows the suitable branch depending on the output of the test. This will take to either an internal node where another test condition is taken or to a leaf node The leaf node has an associated class label that as result will be assigned to the record Decision trees usually show all the possible outcomes of a decision and very effective in classification and prediction since they are predictive models that can be 8

PAGE 18

benefited from by making predictions thru classification In addition they consist of three types of nodes that are presented in (Figure 3.2) (11 ) : i. Decision node : these nodes are usually represented by squares displaying the possible decisions that can be made Lines that emerge from these nodes present all different choices available at a node. ii. Chance node: these nodes are usually represented by circles displaying chance outcomes or event that can occur and leads to two or more outcomes iii. Terminal node: these nodes are usually represented by triangles or lines that are not containing decision or chance nodes Terminal nodes display the final outcomes of the decision making process These nodes usually show a prediction to the problem or the question asked at the first place in the root node. 9

PAGE 19

o coma 7 0 Oeoision 0 -un rt jnty( xt m l v nt) Figure 3.2 DECISION TREE NODE TYPES (11) 8 Decision rules Decision rules can be defined as a set of rules that can be extracted from a decision tree or a dataset to predict the class to which a record of a database belongs (14) 1 0

PAGE 20

Decision rules provide easily understandable and general presentation to the knowledge included in a dataset. In addition they give all the possible decisions and predictions that can be taken in specific case For example in the medical field decision rules can be used by oncologists when reviewing a patient's case to predict if he/she has a cancer risk Based on what the risk factors available in that case ; including gene mutations and any family history the decision rules will predict the cancer r isk for that patient. 11

PAGE 21

4. Prediction Analysis in data mining Predictive analysis is one of the more popular data min ing techniques It has been applied in many fields especially the medical one In this chapter a detailed explanation will be provided regarding its stages and techniques 4.1 Predictive data mining stages Since one of the important tasks of data mining is prediction one of the data mining methods such as decision tree will be constructed and some i nterpretation or inference will be made on the available data set to predict the new data set behavior. Predictive analysis is constructed by data mining techniques and methods Data mining techniques have the ability to extract data by dealing with large databases and accessing them and then these techniques process data with algorithms to discover the hidden patterns and predictive information Predictive data mining consists of several stages as shown i n Figure 4 1 these stages gave perfect thought of all aspects of data mining (1) Looking at Figure 4 1 the process starts with collecting data from its sources such as data warehouse and data mart Data collection from data warehouse include defining relevant features to the field where predictive data mining will be employed and preparing a storage file in order to document these features. Also it involves data preprocessing t hat consists of data cleaning data transformation data reduction and securing the data to 12

PAGE 22

protect it from corruption The following step is data exploration that includes applying preparatory analysis to data to prepare it for mining Next procedure is feature selection or reduction The third main step is mining or model building for predict i on. The final step is data post processing and interpretation that contains inferences and deriving to conclusion from the previous step 13

PAGE 23

J t a 1 c on tr'o m :m(j (jata transf ormations f o r tron ,,P corr e t a uons ana strong nt ----fl-'ro llmtnary CllCCKS o n da t a quallty Apply to exi-Sting o m -
PAGE 24

4.2 Predictive Analysis Techniques There are many techniques that help the predictive analysis in predicting the outcomes and drawing the future trend by analyzing huge of amount of data with various variables but only the ones that are specified in this thesis will be explain in depth These techniques include the followings : i. Decision trees. ii. Decision rules iii. Association rules. 4.2.1 Decision trees Decision trees can be used as classifiers to decide a suitable action from a group of predefined actions In addition they have widely applied to different problems such as medical diagnosis disease prediction and credit card risk assessment. 4.2.1.1 Decision trees as predictive models In predictive analysis respective decision trees can be used as predictive model to derive to conclusion about the item s value based on its observations. Predictive model is focused on analyzing patterns in historical data to transform data into decisions in action How the decision tree is used as a predictive model is shown by this following example. Suppose that there is historical patients' data set that 15

PAGE 25

consists of some patients who had been diagnosed with breast cancer in the past. In order to predict if a person in the future is at a risk of getting breast cancer after visiting the doctor the decision tree can be built based on that historical dataset. The variables that will help predicting the cancer in the decision tree will be the risk factors which are displayed as the p r edictors in the decision tree These predictors are very helpful deriving to a prediction of whether a patient who visits the doctor is at risk of cancer in the future or not. Thus the decision tree will be drawn by first mining the dataset and extracting the useful attributes that help in the prediction process which are the risk factors The patients database is represented in Table 4 1 16

PAGE 26

Table 4.1 Patients Informat i on Database Patient ID# Family Medical BRAC1 or Disease History History in one BRAC2 high Risk disease of the breast mutation degree sides 345679894 First Yes Yes Yes 356890322 No No Yes Yes 367345890 Second Yes No Yes 378436888 First Yes Yes Yes 388094671 No No No No 395537120 Third No No No 334298018 First No Yes Yes 312475490 Second No Yes Yes 309456731 Third Yes No Yes The previous database presented in Table 4 1 can be visualized graphically by the following decis i on tree in F i gure 4 1 1 7

PAGE 27

Uterine cancer high risk? Family history de ree medical h i story Gene mutation Figure 4.2 PATIENTS DATABASE DECISION TREE Based on the decision tree in Figure 4 2 oncologists can predict if any patient who visits them in the future is at a risk of cancer The prediction starts after taking patients information including family history of cancer medical history and gene mutations This information is the predicators in the decision tree in Figure 4 2 that help derive to a prediction of the patient's case 18

PAGE 28

The predictive analysis in this example has used the decision tree as a technique to predict the disease risk and the predicator variables in the tree helped it in this role to derive at the end to one prediction per patient. A group of predictor variables formed a predictive model from the decision tree The predictor variables here are the family history medical history, and gene mutation. Based on the variables in the decision tree there has been ability to predict the person s risk 4.2.1.2 Advantages of decision trees: 1 Decision trees are considered to be self explanatory and this makes them simple to follow. That means it will not be difficult to understand the decision tree by non-data mining experts in case it consists of reasonable amount of leaves 2 Decision trees can deal easily with both nominal and numeric attributes of a dataset. 3 Decision trees can deal with databases or datasets that may have some errors 4 Decision trees are wealth enough that they can easily show graphically any discrete value classifier 5 Datasets that have missing values can be handled easily be decision trees 6 Decision trees are the most attractive technique in data mining. Once the classification is costly because they only inquire for the featured values following one path from the root node to the leaf one 7 Decision trees are graphic representations showing all decision alternatives potential outcomes and events that might occur by 19

PAGE 29

chance This graphical presentation aids in understanding complex sequences of decisions and dependable ones. 8. Decision trees are complementary which means they can be effectively employed in many fields in conjunction with other used tools in that field. For example, decision trees applied in cancer prediction can evaluate the percentage risk 9. Decision trees are speed methods in prediction and development (11 ) 4.2.1.3 Disadvantages of decision trees: 1 Decision trees may experience fragmentation obstacle if there are a lot of relevant attributes. 2. If there are few relevant attributes the decision tree performance will be good whereas if there are complex communications between attributes then its performance will be less (11 ) 4.2.2 Decision rules: Decision rules are common data mining technique where each rule of these decision rules takes the format of Horn Clause such as : If ConditiomA Conditiom . . . Then Class The rules are presented in a disjunctive normal form such that: R= (r1vr2v . .. rk), where R is the rule set and n is a rule or disjunct. Each decision rule can be presented in this way: 20

PAGE 30

n : (Conditioni)-7 Y i (4.1) The left hand side of the rule in (4.1) is called the precondition or the rule antecedent and it includes a conjunction of attribute tests such that: (4 2) The pair < A ; V ) shows an attribute value such that A ; is an attribute and V ; is attribute value, and op represents a logical operator that can be picked from the set Each rule < A ; V ) is called conjunct. The right hand side of the rule represents the rule consequent that involves the predicted class Y ; Moreover if the precondition of a rule r is satisfied by the attributes or columns of a record x in a dataset then the rule r covers the record x In addition r can be called to be triggered or fired as it covers a record The variables in the precondition part are the predicator variables that all together help to derive to the right prediction (14) Decision rules can be used in many fields such as disease prediction clinical diagnoses and academy rewards for choosing candidates with the most phenomenal achievements and academic purposes. Decision rules can be derived or generated from the database or decision tree itself which is very useful in prediction process 21

PAGE 31

4.2 3 Association rule mining: Association rule are helpful in detecting relations among unrelated data in a database or data repositories and discovering patterns of relationships between the columns or attributes of the dataset and observations (16) Association rule mining had been proposed by Agrawal et al in 1993 It can be used i n many fields such as market basket analysis telecommunication networks market and risk management and analyzing the associations among rules for disease prediction Association rules has two main parts which are an antecedent (if) and a consequent (then). From these relationships in the association rules it will be very useful to analyze these relations to detect pattern that aid in the prediction process(16) Association rules is at the form of X=>Y where X is the antecedent that combine a set of predicates resulted from exploring the dataset and Y is the consequent that only has one predicate The rule shows the relation ship between the antecedent and consequent. 4.3 Why is Predictive Analysis useful? Predictive analysis is one of the most efficient branches in data mining It has the ability to extract information from the data in databases cooperated with its techniques and use this information to predict future trends for the particular field the predictive analysis has been used for. Predicting future s outcomes help the field providers to improve their service by taking benefits of the possible outputs take precautions against any predicted risks and 22

PAGE 32

decrease any previous mistakes done in the past before this predict ion procedure 23

PAGE 33

5. Predictive analysis in cancer prediction Data mining techniques has a promising role in predicting diseases especially cancer In this chapter a number of these techniques will be employed on the cancer predictor software implemented in this thesis to clar i fy their roles in cancer prediction 5.1 Predictive analysis steps in cancer prediction Predictive analysis has an important role in the prediction of a disease outcome Based on some other factors of a specific disease such as age family history gene etc predictive analysis has the ability to predict the disease itself The process that will happen in a disease prediction given a historical database for patients who had the d i sease before will be achieved by these following steps : 1 Given historical patient data from medical databases that show the factors that help the patient to get cancer such as gene mutations family history life style etc the data mining will first prepare the data for mining process 2 The data prepa r ation include cleaning data from any noise and duplicate observations and data reduction that focuses on selecting useful features and records that are relevant to the data mining technique 24

PAGE 34

3 The next step is the data mining step that consists of using data prediction in order to detect hidden patterns in the data 4 The data mining will ext r act the information from those medical database to reach to useful observations in the cancer prediction for a particular patient that have the some risk factors as patients whose information had been extracted 5 Looking at both the current patient information and the extracted patient record information that had cancer before, if the current patient has the same risk factors for cancer as the extracted risk factors cancer might be predicted for the current patient. 6 Predictive analysis with the support of other techniques such as the decision tree neural networks and other techniques and using historical medical databases for cancer patients will predict cancer for the current pat i ent based on other patients who had have the same factors and as a result have been diagnosed with cancer 5.2 The goals of cancer prediction The basic goal for cancer prediction is different from the goals of cancer diagnosis In the cancer prediction and prognosis the cancer researcher is mostly concerned about the following three predictive foci : 1. Risk evaluation for cancer or the prediction of cancer susceptibility 2 The predict ion of cancer reoccurrence 3 The prediction of the lifetime for the cancer patient after cancer diagnosis meaning how many months the patient may survive (2) 25

PAGE 35

In the first situation the cancer researcher attempts to predict the probability of cancer before the occurrence of the disease in the patient. In the second situation the cancer researcher attempts to predict the probability of redeveloping cancer after to the obvious resolution of the cancer disease or after the patient had already been diagnosed with cancer. In the third case the cancer researcher attempts to predict an output of cancer such as life expectancy, cancer survivability cancer progress and sensitivity of cancer tumor medicines after cancer diagnosis (2) However in this thesis the focus will be only on the first predictive tool which is predicting cancer prior to its occurrence 5.3 Predictive analysis techniques in cancer prediction Predictive analysis is method that demands the help of other techniques Thus a number of data mining techniques will help the predictive analysis in the role of cancer prediction using the following techniques: 5.3.1 Decision trees. Decision trees as a data mining technique has being in used in the medical field for many years due to its wealth of classification rules and appeal of clarity or visibility Cancer prediction is one of its uses in that field In the past, cancer researchers have applied many techniques including decision trees to cancer prognosis and prediction Formally decision trees are structured graphs or flow charts of a series 26

PAGE 36

of decisions represented by the nodes in the tree and all possible results or consequences from these decisions that are represented by leaves or branches in the tree. Theses decision tree are mainly used to make a plan in order to get to a goal. This goal depends on the reason of using the decision tree For example, if the decision tree is being used for disease prediction then this tree will be primarily built in plan to achieve the disease prediction goal. 5.3.1.1 Decision trees in cancer prediction Given a database that includes medical information about patients such as age data of birth medical history etc, cancer researchers found the applicability and efficiency of decision trees for predicting the groups of patients with the high risk of cancer in a database After mining the information required for the cancer prediction from the medical database by the decision tree the decision tree with its predictive model will be able to classify the patients in the database into groups Between the groups, it will predict which group a patient case belongs to. Consequently the decision tree will classify the patients in the database by predicting which group a case belongs to Some patients will be predicted to be in the group with the high risk of cancer and others with less risk In order to construct the decision tree for cancer prediction, the following steps will be taken : 1 Some patients might consult a doctor in order to check if they might have cancer in the future so the doctor performs some tests and examination 27

PAGE 37

plus asking the patients about some factors that aid in cancer occurrence. 2 The patient information afterward will be stored in a patient database that also has data about other patients that consult a doctor for the same reason which is cancer prediction 3 Depending on the type of cancer cancer researchers will specify the risk factors that aid in the cancer occurrence diagnosis or prediction Thus the patients database will be m i ned for these risk factors to extract useful patterns from the attributes that represent these factors 4 The decision tree will be built based on the mined data from the database The nodes of the decision tree are the cancer s risk factors which are the attributes of the database and the branches of a particular node represent all the possible values of that node 5 The leaf nodes will be the final groups where a patient will be predicted to be in one of them e g a group of patients with high cancer risk and other group for less risk and non risk patients. In the decision tree graph the decision tree s nodes represent the decisions that predict the cancer risk for a particular patient. In the prediction process the cancer researcher will walk through the decision tree starting from the root and ending with one leave node Based on the decisions nodes that a patient meets in the decision tree when going through the tree her cancer risk will be predicted at the leaf nodes Thus in the leaf node it will be obvious to find if the patient is either at high risk or no risk of cancer in the future Applying the decision tree to the cancer predictor calculator will be very useful in predicting cancer since decision trees are prediction models cons i sted of predictors that contribute in predicting to which class of cancer 28

PAGE 38

risk (low medium high) a person should belong. The predictors in the decision tree will be the main risk factors of the cancer predictor Calculator have strong role and impact in predicting cancer cancer high risk? Family h istory Figure 5.1 :DECISION TREE AS PREDICTION MODEL FOR THE CANCER PREDICTOR CALCULATOR 29

PAGE 39

5.3.1.1.1 Example on how the decision tree is used in cancer prediction: Suppose the following patients database is prov i ded after the patients visitation to a medical institution fo r the reason of checking if they are at a risk of uterine cancer This database consists of four tables that store the patients information such as their medical history uterine history, drugs consumpt i on and other personal data The first table of the database which is Table 5 1 represents the family history for the patient. The family history i nvolves types of cancer that increase the risk of uterine cance r such as ovarian cancer colorectal cancer and uterine cancer. Table 5.1 Patient Disease History Table Patient ID Ovarian Uterine cancer Number of Colorectal cancer Family history affected cancer Family family degree one relatives history history 1458349 Sister first 1 Yes 1478020 Daughter third 1 No 1356780 Mother first 2 Yes 1344673 Mother second 1 No 1267340 Daughter first 3 No 1259875 Siste r third 1 Yes 1145689 Mother third 2 No 1175934 Sister second 2 No 3 0

PAGE 40

The second table of the database which is Table 5 2 represents the uterine activity fo r the patient. Uterine activity here means the activities that a ute r ine performs such as pregnancy history menstrual per i od and menopause Table 5.2 Patient Uterine Activity Table Patient ID No Menstrual period Menopause after Reproductive before age 12 age 55 history 1458349 Yes Yes Yes 1478020 Yes Yes Yes 1356780 No Yes No 1344673 No No No 1267340 No No No 1259875 Yes No Yes 1145689 No Yes No 1175934 No No Yes The third table of the database which is Table 5 3 rep r esents the drugs consumption history for a pat i ent. T hese drugs i nvolve the ones that a patient has taken for long periods such as the follow ing drugs : 31

PAGE 41

Estrogen : a kind of medicine that can be made by the body or inside medical laboratories and it aids in developing women sex characteristics and it also helps in the bone growth Tamoxifen: a type of medicine used by a patient in order to treat certain kinds of breast cancer. In addition it can be used to prevent breast cancer in females who had ductile carcinoma in situ which means abnormal cells in the ducts of the breast, and it can be taken by women who are at high risk of breast cancer. Radiation therapy : the use of radiation at high energy levels from xrays, neutrons, protons gamma rays, and other types of radiation in order to destroy cancer cells and decrease cancer tumors inside the body (26). Table 5.3 Patient Drugs Table Patient ID Estrogen Tamoxifenconsumption Radiation consumption therapy 1458349 No Yes Yes 1478020 Yes Yes Yes 1356780 No Yes No 1344673 Yes No Yes 1267340 No Yes No 1259875 Yes No Yes 1145689 No Yes Yes 1175934 Yes No Yes 32

PAGE 42

The fourth table of the database which is Table 5.4 represents patients personal information such as age obesity and smoking Table 5.4 Patient Personal Info Table Patient ID Age Obese High fat diet Smoker 1458349 50 Yes No Yes 1478020 33 No Yes Yes 1356780 66 Yes Yes No 1344673 28 No No Yes 1267340 18 Yes No No 1259875 70 Yes Yes Yes 1145689 68 Yes Yes Yes 1175934 59 No No Yes Looking at all the previous four tables in order to predict the uterine cancer risk using the decision tree for the patient giving her information such as medical history reproductive history and used drugs the data mining will first mine the previous patients database in order to extract useful information that helps in the prediction process In this case the most helpful information in predicting the uterine cancer risk will be the most effective factors that develop the cancer The following table which is Table 5 5 shows these factors : 3 3

PAGE 43

Table 5.5 Patient Risk Factors Table Patient Uterine Number of Reproductive Estrogen Tamoxifen 10 cancer affected history consumption consumption Family relatives history 1458349 first 1 Yes No No 1478020 third 1 Yes Yes Yes 1356780 first 2 No No Yes 1344673 second 1 No Yes No 1267340 first 3 No No Yes 1259875 third 1 Yes Yes No 1145689 third 2 Yes No No 1175934 second 2 No Yes No Moreover with the help of historical databases for patients who had been diagnosed with the same type of cancer in the past i t will be easy to predict for each patient in Table 5 5 if she is at a high risk of developing uterine cancer or not. Using the decision tree a prediction model consisted of the following predictor variables had been designed in Figure 5 1 to predict to which class of risk previous the patients belong to : 34

PAGE 44

Family history Number of affected relatives by breast cancer Reproductive history Estrogen consumption Tamoxifen consumption Based on these predictors in the decision tree the patients in the previous database will be classified in low medium and high risk of uterine cancer as shown in Table 5 6 35

PAGE 45

Uterine ranmhigh Numof affPCted relatives Estrogen consumption Yes Tamoxifen COIISlllllption low Figure 5.2: DECISION TREE AS PREDICTION MODEL FOR UTERINE CANCER 36

PAGE 46

Patient ID 1458349 1478020 1356780 1344673 1267340 1259875 1145689 1175934 Table 5.6: Final Predicted Cancer Risk for Patients Uterine Number Reprocdut Estrogen Tamoxifen Predicted cancer of ive consumption consumptio risk Family affected history n history relatives First 1 Yes No No Medium Third 1 Yes Yes Yes Medium First 2 No No Yes High Second 1 No Yes No High First 3 No No Yes High Third 1 Yes Yes No Medium Th ird 2 Yes No No low Second 2 No Yes No High From the decision tree above it will be easy to predict the uterine cancer risk for any patient who visits the doctor in the future to see if she is at a higher risks of uterine cancer or not. For example if a patient x has two relatives of degree one who had uterine cancer before then it is predicted using the decision tree in Figure 5 1 that she is at high risk of uter i ne cancer 3 7

PAGE 47

5.2.2 Decision rules Decision rules as a data mining technique has being in used in the medical field. In clinical field, decision rules used for prediction are usually called clinical prediction rules or clinical decision rules Clinical prediction rules or risk scores are designed tools that support the professions in health care in making decision once they are providing health care for their patients. These rules contain predictors that are variables obtained from history such as physical tests or examinations disease characteristics, and patient characteristics (6) In addition, applying these rules to aid in decision-making relevant to prognosis has been used in clinical prediction the last few years. Applying the clinical prediction rules to the cancer predictor calculator is effective in predicting cancer because decision rules are one of the prediction models that have been assisting cancer prediction. The predictors that contribute in predicting cancer in these rules are mainly the risk factors of the disease. For instance, the clinical prediction rules that can be extracted from the ovarian cancer part of the cancer predictor calculator as the following table : 38

PAGE 48

Table 5.7 : Prediction Decision Rules from the cancer predictor calculator Prediction rules Preconditions Ovarian Cancer Predicted Result If (talcum powder = Y ) (Fertility Drugs = Y ) -7 Medium Risk If (Age>=80) (Reproductive History = Y ) -7 Low Risk If (Tubal Ligation= Y ) ((Exercise= Y ) v No Risk (Family History = Y ))-7 If (Period Age<12) (Reproductive History =" Y ) -7 Low risk If (HRT = Y ) (Reproductive History = Y ) -7 Low risk If (Ovarian Removed = Y ) -7 No Risk If (Period _Age<12)" (HRT = Y ) -7 Medium Risk If (Overweight= Y ) (Endomtrosis lssue = Y ) -7 Medium Risk If (Smoker = Y ) (Exercise= Y ) -7 No Risk If (Gene Mutation= Y ) (Fertility Drugs = Y ) -7 Medium Risk (For any Genes such as BRCA1 BRCA2 ... etc) Another example for the prediction rules are the from the uterine cancer example specified above in Chapter 5 the prediction rules that can be extracted from the patients database are the following rules : If (Family History Degree = First") (Affected Relatives>=2) Then Patient is in the class of high risk of cancer. 3 9

PAGE 49

If (Family History Degree = Third ")" (Estrogen Consumption== Yes ) Then Patient is in the class of medium risk of uterine cancer. If (Fam i ly History Degree =="First) A(Reproductive History=="No") Then Patient is in the class of high risk of uterine cancer if (Family History Degree =="second ")" (Estrogen Consumption== Yes") Then Patient is in the class of medium risk of uterine cancer if (Reproductive History=="No") A((Estrogen Consumption =="Yes" ) v(Tamox i fen Consumption =="Yes")) Then Patient is in the class of medium risk of uterine cancer The preconditions of the previous prediction rules consist of the predicators which are actually the attributes in the patients database specified in earlier section. These predictors form the prediction model for each prediction rule and thus support predicting uterine cancer These prediction decision rules will be used by oncologists in reviewing patients cases to p r edict their uterine cancer risk For example suppose there is a patient X whose uterine cancer will be predicted based on the previous prediction decision rules This patient has no reproductive history and consumes both estrogen and Tamoxifen Consequently patient X is predicted to be at high risk of uterine cancer based on the last decision rule 5.2.3 Association rules: Since association rules find relat i onships and association between large set of data by mining databases and it also displays attributes value 4 0

PAGE 50

conditions that usually occur together in a database, they can be used in predicting cancer Association rules in the process of predicting cancer will detect the risk facto r s that are related to each other. For example in the cancer predictor calculator the association rules that could be found are the following : (Menopause Age <40)A (HRT _status= Yes ) 7 Patient is given estrogen doses (Race= White ) ( Family History = First ) (Num _of_ Affected Relatives >=2) 7 Patient has to have regular check in ovarian for cancer cells (BRCA1 Mutation ="Yes ")" (Reproductive History = No ) 7 Patient is at high Risk of breast cancer This association rules clarify that the presence of one or more element in a dataset will pred i ct the occurrence of the other In the above example i f the woman menopause age was younger than 40 and she had hormone therapy to cure the issue of menopause then she will probably have doses of estrogen as part of hormone therapy As a result of her early menopause and hormone therapy consisted of estrogen she will be at a high r i sk of breast or ovarian cancer In the second association rule if both risk factors the white race and first degree family history of ovarian cancer of more than two affected relatives by the disease had appeared together in the database then it will be predicted according to those factors that the woman should have regular checks-ups to detect cancerous ovarian cells and therefore from the associations and the 4 1

PAGE 51

result it is predicted by predictive analysis with the help of the association rule that the woman might get ovarian cancer in the future. 42

PAGE 52

6. Prediction models for two types of cancer This chapter will focus on two types of cancer : breast cancer and ovar i an cancer which are included in my implementation In addition the implementation s results will be compared to other well-known algorithms results 6.1 Breast Cancer Breast cancer is the first type of cancer in this thesis for which the predictive analysis will predict the risk in women with the help of techniques such as decision tree prediction decision rules and other ones that had been specified previously 6.1.1 Breast Cancer risk factors In order to understand how the prediction works properly it would very beneficial to specify the risk factors of breast cancer since the prediction process depends on them beside the historical data for some other case control studies that experienced the same disease The risk factor of breast cancer can be specified as the following factors (27) : 1 Age: the most effective risk factor of breast cancer is the woman s age so the older she is the higher her risk is. 2 Reproductive history : women who had never had any children or had their first child after the age of 30 are more likely to have breast 43

PAGE 53

cancer. Thus number of pregnancies and being pregnant at a younger age reduce the risk 3 Gene mutations : both BRAC1 and BRAC2 are located in breast cells and produce proteins that prevent cell in breast from growing abnormally In a case where it happens that there is an inherited mutation of any of the previous genes from one of the parents there will be a high risk of developing breast cancer 4 Body exercise : sedentary women who tend to not exercise normally are at high risk for breast cancer while those who perform physical activities normally are at less risk due to its effect on obesity 5 Alcohol consumption : alcohol moderate consuming can increase the risk of breast cancer The risk also depends on the number of consumed glasses per day so for women who don t consume any types of alcohol are at lower risks 6 Family history : the woman s risk of breast cancer grows more in case she has a family history of the disease. This risk increases more by the degree of family history so if for example a family member of degree one (mother sister daughter) has been diagnosed with breast cancer it is predicted that the woman s risk grows for the disease 7. Race : white women are more likely to get breast cancer than other races 8 Menstrual periods : women who started their period before the age of 12 or had menopause after the age of 55 are more predicted to be at a risk of breast cancer This is because of the long life time of estrogen and progesterone exposure. 9 Reproductive history: women who had never had children or had children after the age of 30 are more likely to have breast cancer risk 10 Breast feed ing: women who breastfeed their children have lower risks 4 4

PAGE 54

of breast cancer while others who don t are at higher risks 11. Hormonetherapy : women who take hormonetherapy medications that involve both estrogen and progesterone to be treated from menopause symptoms are at higher risks of breast cancer 12 Medical history of breast cancer : if a woman experienced a breast cancer in one breast before i t is predicted that she has an increased r i sk of the disease 13 Obesity : obese women are predicted to have higher risks of breast cancer than others because of fat tissues generates estrogen that may aid fuel certain cancers 14 Birth controls : women who take birth control pills for more than 10 years are more predicted to have higher risks of breast cancer 6.1.2 Gail breast cancer algorithm Gail breast cancer algorithm is an algorithm that predicts the breast cancer risk for a woman for five years based on some certain predictor variables It uses woman s personal medical history (previous breast biopsies the presence of atypical hyperplasia in previous breast biopsy) her family history especially first degree (mother sister and daughter) reproductive history, Menstrual periods (her age at first child birth age at the beginning of menstruation) to make its predictions. Gail model had been developed by Dr Mitchell Gail Senior Investigator in the Biostatistics Branch of NCI s Division of Cancer Epidemiology and Genetics and his colleagues (20) 4 5

PAGE 55

6.1.2.1 How Gail breast cancer algorithm works Gail breast cancer algorithm works as follows : Gail in his model emphasized on four risk factors of breast cancer including age at first child birth age at menarche number of first degree relatives(mother sister daughter) who had the disease before and number of breast biopsies The Gail model predicts breast cancer risk by multiplying the relative risks RRs for four various risk factors of breast cancer by the woman s specific age. This is because although a woman might not have the other risk factors the age by itself plays a role in increasing the risk of breast cancer (1 0). 6.1.2.2 Disadvantages of Gail algorithm: Gail model does not take in consideration second degree relative (nieces aunts uncles nephews and grandparents) and third degree relatives (cousins) who had been diagnosed with the disease. Gail model does not take into account other risk factors that play a role in increasing the risk of breast cancer such as hormone therapy age at menopause gene mutations and radiation exposure Gail model may underestimate the woman s risk of breast cancer s i nce it only concentrates on four risk factors 4 6

PAGE 56

6.1.2.3 How is Gail algorithm related to data mining: Prediction is one task of Data mining that predict outcome for future In the case of breast cancer prediction predictive analysis will support decision making during the prediction process Since the predictive analysis predicts future outcomes based on some other historical events that happened before the Gail algorithm is based on the predictive analysis The algorithm predicts the risk of breast cancer for a person giving some inputs such as family history age etc so what the predictive analysis will do here is building a prediction model. The prediction model consists of some variables called the predictors that affect the predicted result. The predictive analysis also depends on some other historical databases for patients who had breast cancer before with the same predictor variables that represent the input in Gail algorithm As a result when a woman uses Gail algorithm to see her predicted risk she will be asked to provide information about the predictor variable such as age family history gene etc based on some other databases for breast cancer patients who had the disease before and had combined the same factors as the woman a predictive model consisted of predictor variables will be built for that woman to predict her risk The predictive analysis will use these variables to predict the woman s risk of breast cancer Depending on these predictors she will be given her risk 4 7

PAGE 57

6.2 Ovarian Cancer Ovarian cancer is the second kind of cancer addressed i n this thes i s in which the predictive analysis will estimate the risk in women with the help of techniques such as decision tree prediction and decision rules 6.2.1 Ovarian Cancer risk factors The risk factor of ovarian cancer can be specified as follows (22) : 1 Age : the older the woman is the more predicted that she is at higher risk of ovar ian cancer 2 Reproductive history : women who had never given birth for children are more predicted to have breast cancer risk Thus the risk goes lower with each pregnancy 3 Family history : women who have a family history such that first degree relatives had been affected by ovarian cancer breast cancer or colorectal cancer are more predicted to have ovarian cancer than others with no family history 4 Hormone therapy : women who take estrogen after menopause are more predicted to have ovarian cancer risk 5 Infertility : it is predicted that infertile women have higher risks of ovarian cancer than other fertile ones 6 Gene mutations : women who experience mutation in the some genes such as BRCA 1 BRCA2 and HNPCC (hereditary nonpolyposis colorectal cancer) are more predicted to have higher risks 4 8

PAGE 58

7 Obesity: women with high body mass indexes are more likely to have higher risks of ovarian cancer 8 Previous cancer history : cancer researchers predict that women with previous breast cancer are at high risks of the disease. 9 Endometriosis : women with Endometriosis condition in which these tissues may grow and reside in one of the ovary have higher risks of the disease 6.2.2 The Risk of Ovarian Cancer Algorithm The Risk of Ovarian Cancer Algorithm (ROCA) is an algorithm that predicts the ovarian cancer for women depending on the woman s age and trends inCA 125 blood However the algorithm is still under study and its results are expected to be ready by 2015 (9). Biostatistician Steven Skates Ph.D., developed the risk of ovarian cancer algorithm CA 125 is actually a cancer antigen or carbohydrate antigen that can be found in ovarian tumor cells CA 125 is also a test that can be taken on patients to check the CA 125 protein measurement in ovarian cells because CA 125 percentage gets higher in ovarian cells in case the person has ovarian cancer. 49

PAGE 59

6.2.2.1 How The Risk of Ovarian Cancer works The Risk of Ovarian Cancer works for postmenopausal women as the following steps (13) : 1 A woman will have a CA 125 measurement test done to find out the percentage of CA 125 protein in ovarian cells 2 Depending on the ROC algorithm that uses mathematical model information about the woman s age and any changes in CA 125 levels are integrated over time 3 Based on the consequences of the test women will be classified in one of three groups: Low risk : women in low risk group will be asked to take CA 125 test in next year Intermediate risk : women in intermediate risk group will be asked to repeat CA 125 test in three months High risk : women in high risk group will be advised to go through special medical care with gynecologic oncologist that might recommend a surgery if necessary. s o

PAGE 60

7. Contribution and Discussion: Since this thesis is discussing the prediction of cancer using the predictive analysis and other data mining techniques an implementation regarding the thesis topic has been provided to support its topic The software has been implemented using Microsoft C# 2010 Express and built in SQL. The software includes a designed risk assessment calculator for two types of cancers: breast and ovarian. The main role of this calculator is computing the estimated risk percentage of breast or ovarian cancer In fact this calculator can be used only by females who would like to estimate their lifetime risk of breast cancer or ovarian cancer Figure 7.1 shows a simplified model on how the software operates to calculate the disease risk percentage 51

PAGE 61

breast ovarian Define Next Previous U pdate values >41--+ Computecumulative risk percentage Yes Figure 7.1: SOFTWARE FLOWCHART 52

PAGE 62

In the calculator s implementation, I developed the following mathematical formulas that aid in calculating the final related cancer risk percentage for either breast or ovarian cancer 1 Double Disease Risk = o J 2. lnt i = 0 ; 3. While (i <= N) { 4 Begin k 5 Form_Risk = 2Relative_Risk; I 6 Disease Risk = Disease Risk + Form _risk; 7 i = i + 1 ; 8 End While } 9 Risk_Percentage = ((Disease_Risk 100) /10); In the above algorithm that is used by the calculator to compute the risk percentage a number of variables have been used where: i : refers to the form number in the program N : refers to the number of forms k : refers to number of questions in form i . In addition as breastfeeding is one factor that decreases the cancer risk, I developed the following equation that calculates the amount of decrease in cancer based on number of months a mother breastfeed her 53

PAGE 63

children and number of children she had given birth to, after getting benefit from 0 8 which is the decreased risk percentage for breast cancer based on lactation (8). lactation_Factor = (lactation_Months +( 0.8 Childern_Num) ; (7.1) Real databases including patients information such as age family history,etc have been used to validate the software The first databases used to validate breast cancer predicted risk percentage is called WHI (Women Health Initiative) databases The second databases used to validate ovarian cancer predicted risk percentage is called SEER (Surveillance Epidemiology and End Results) databases. Applying real medical records information for patients, who has been diagnosed by breast or ovarian cancer to the risk assessment calculator aid in the process of validating the final predicted risk prediction In order to compare the calculator risk assessment results to other well known algorithms Gail algorithm s results will be used to compare them to the predicted breast cancer risk percentage results from the calculator Also The Risk of Ovarian Cancer algorithm s results will be compared to the predicted ovarian cancer risk percentage results. 54

PAGE 64

7.1 Results: Generally Comparing the results of cancer predictor calculator to Women Health Initiative (WHI) databases after applying WHI records values to the calculator a percentage of 7% is the difference between the Cancer predictor calculator and the relative risk calculated from the WHI databases Generally Comparing the results of cancer predictor calculator to Surveillance Epidemiology and End Results (SEER) databases after applying SEER records values to the calculator a percentage of 8 2% is the difference between the Cancer predictor calculator and the relative risk calculated from the WHI databases The results of Gail Algorithm calculator for breast cancer will be compared with the cancer predictor calculator results as the following tables : Table 7.1: Race and Age Cancer Predictor Gail Model Calculator Calculator Age Black White Black White 20-29 6 5% 32. 5% 4 7% 4 1% 30-39 11.5% 37 5% 4 3% 6 3% 40 49 21. 9% 47 9% 11. 2% 13.1% 50-59 23 .3% 49 3% 5 9% 8% 60-69 29% 55% 4 7% 7 7% 70-79 30% 56% 3 5% 6 .3% 80+ 19 5% 45 5% NA NA 55

PAGE 65

Looking at Table 7 1 and after comparing the Gail Algorithm s results to my implemented cancer predictor it is obvious that the latter s results is higher than Gail s Algorithm calculator 56

PAGE 66

Table 7.2: Age and First Birth Age Age First birth Age 20-29 30-39 40-49 50-59 60-69 70-79 80+ c G c G c G c G c G c G c % % % % % % % % % % % % % >20 1.5 NA 6.5 NA 16.9 NA 18.3 NA 24 NA 25 NA 14. 5 20-24 1 7 2.6 6 7 5 17.1 14.4 18.5 9 2 24.2 9.4 25 2 7.9 14 7 25-29 2 1 4 5 7 1 5.8 17.5 16 18.9 10.6 24.6 11.5 25.6 10.1 15.1 30+ --7.4 8 3 17. 18 19.2 12.4 24.9 14 25 9 12.6 15.4 Nulliparous 2 1 3 2 7.1 5 8 17.5 16 18.9 10 6 24.6 11.5 25.6 10.1 15.1 C is the cancer predictor calculator G is Gail Algorithm Calculator. Looking at Table 7 2 and after comparing the Gail Algorithm s results to my implemented cancer 56

PAGE 67

pred i cto r for p r ed i ct ing cancer, it can be seen t ha t fo r ages 20 to 49 t hat t he average d i ffe r ence between both predictors resu l ts i s app r ox i mate l y 2%. O n t h e other h and this average gets h i gher when i t comes fo r ages 50 t o 80 and above Table 7 3 : Age a n d Family History Age Family history degree 20-29 30-39 40 49 50-59 60-69 70-79 80+ c G c G c G c G c G c G c G % % % % % % % % % % % % % % 1st 2 8 5 3 7 8 9 2 1 8 2 22 2 1 9.6 1 6 3 25.3 19 5 26 3 1 8.1 1 5 8 NA 2no 2 -7 NA 1 7.4 NA 18 8 NA 24 5 NA 25.5 NA 15 NA 3ra 1 6 5 NA 21.5 NA 3 1 9 NA 33 3 NA 39 NA 40 NA 29 5 NA C i s t he cance r pred i c t o r cal culato r G i s Gail A l go rithm Ca l culator. 57

PAGE 68

Since Gail algorithm calculator consider only family history of the first degree only the predicted results of the first degree will be shown for that algorithm in Table 7 3 However my cancer predictor considers all family history including first, second and third. When comparing the predicted results for both calculators in age and family history, it can be seen that the predicted results are slightly close in all age ranges between 20 and 80+ Table 7.4: Menstruation Age and Age Age Menstrua tion age 20-29 30-39 40-49 50-59 60-69 70-79 80+ c G c G c G c c G c G c G c % % % % % % % % % % % % % % <12 20 5 6 4 25.2 9 3 35 9 17.7 37.3 12.2 43 13.7 44 12.3 33 5 N A 58

PAGE 69

Looking at Table 7.4 and after comparing the Gail Algorithm s results to my implemented cancer predictor it is obvious that the latter s results is higher than Gail s Algorithm calculator. For the ovarian cancer prediction since the R i sk of Ovarian Cancer algorithm is still under study and development, and it is the only algorithm that predicts the ovarian cancer risk percentage there was not ability to compare the ovarian cancer predicted risk percentages resulted from the cancer predictor calculator with the Risk of Ovarian Cancer algorithm 59

PAGE 70

7.2 Evaluation: In order to evaluate the accuracy of my cancer predictor predicted results I entered about 760 patients records information in both calculators As a result the predicted results of both calculators showed an average difference of 8 7% between both calculators In addition in order to evaluate my cancer predictor calculator using real databases that have real patients information I entered about 760 patients records from WHI databases in both calculators including Gail Algorithm and my cancer predictor calculator The results showed the following: The average difference between my cancer predictor calculator and WHI databases is approximately 6 5% The average difference between Gail Algorithm calculator and WHI databases is approximately 15% From above evaluations it is obvious that the Cancer Predictor Calculator predicted results are closer to WHI databases predicted results than Gail Algorithm Calculator As a result my cancer predictor calculator estimations for cancer has more accurate results than Gail's. Furthermore after many tests on my cancer predictor software to decide what the most indicative risk factors are in correctly predicting cancer I got the following risk factors : Breastfeeding predicted results in the calculator show the same 60

PAGE 71

predicted results in WHI databases. First birth age predicted results in the calculator shows an approximate average 2% between the predicted results in WHI databases and Cancer Predictor software Weight at birth predicted results in the calculator shows an approximate average 4 8% between the predicted results in WHI databases and Cancer Predictor software Age predicted results in the calculator shows an approximate average 0 64% between the predicted results in SEER databases and Cancer Predictor software 7.3 Advantages and Disadvantages: The cancer predictor calculator has some advantages that are not applicable in Gail Model Calculator. First the Cancer Predictor Calculator combines many other risk factors of breast cancer that could not be found in Gail's For example the Cancer Predictor Calculator has the follow ing factors that are not available in Gail s : Life style factors such as exercise alcohol consumption smoking bra wear birth control pills and night shift work Environmental factors such as Radiation and environment type Medical history such as abortion breast cancer history Diethylstilbestrol (DES) drug and hormone therapy Personal factors such as menopause breastfeeding number of children a woman gave b i rth to and ethnicities Genetic factors such as BRCA 1 and BRCA2 mutation height weight and family history of all degrees 61

PAGE 72

Body factors such as weight at birth height at birth and head weight at birth. Second the cancer predictor calculator correctly predicted risk for breast cancer usually increases as the woman s age gets older However i n Gail Algorithm calculato r the risk percentage sometimes decreases as the woman gets older This should not be true because as a woman gets older her risk increases according to most medical studies On the other hand the cancer predictor calculator sometimes overestimates the predicted risk percentage for breast cancer regarding the race factor when compared to the Gail model calculator 7.4 Future work: After I analyzed the cancer predictor calculator results including decision tree and some comparisons between the calculator and other well known algorithms I suggest the following improvement techniques on the cancer predictor calculator as the following : 1) Designing the family history tree for the patient at the output window 2) Designing the decision tree for each patient and highlighting his/her path in the tree that shows the predicted risk for him/her. 3) Combing the calculator with other algorithms that predict the same type of cancer as the Cancer Predictor Calculator. For example combining the calculator with Gail model cal culator to allow the patient estimate his/her risk in more than one algorithm 4) Including the CA125 test in ovarian cancer risk assessment part 62

PAGE 73

5) Advising the patient after the output window such as connecting the program to hospitals and making appointment for the patient with oncologists to discuss her/his situation. 6) Adding more cancer tumors such as brain, Leukemia and brain 7.5 Conclusion In conclusion, the purpose of this thesis was to shed the light on the ability to operate data mining tools in the process of cancer prediction Today data mining has played a major role in improving disease predictions since it does not only extend the analysis s process, but also its depth In fact, its techniques have a promising future for improving the efficiency of the health care and predicting cancer prior to its occurrence. These techniques include decision trees that predicts cancer risk based on series of predictors in the tree clinical decision rules that has the ability to mine datasets and extract rules that aid in cancer prediction. Finally, the association rules that analyze patients database to find relations among its attributes and then predict cancer mining these relations All of these techniques are working together with the predictive analysis to get the best possible predicted result. Also, this thesis had been supported with software that calculates the predicted risk for breast and ovarian cancer Finally data mining methods assist oncologists saving people's lives by predicting cancer prior its occurrence. 63

PAGE 74

APPENDIX A CANCER PREDICTOR EXPERT SYSTEM USER GUIDE The cancer predictor calculator is an online user interface model that predicts women's risk percentage for both types of cancer involving breast cancer and ovarian cancer The calculator will provide its users with their predicted percentage and the class of risk to which they belong such as low medium and high risk Installation : The Cancer Predictor Calculator can only operate on computer that has Windows OS. The installation process of the Cancer Predictor Calculator takes the following steps : 1 Go to the following website: http : //www mediafire.com/?m53u9js2xbaphna 2 Since the Cancer Predictor software is a password protected the user will be requested to enter the software which is FinallydonE14 3 After the password is entered correctly, the user will be directed to the download page that prompts him/her to start downloading 4 Once the download is complete a file directory with the name CancerPredictorCalculator will open 5 The user should click setup exe to install the software on his/her device 64

PAGE 75

Using the Cancer Predictor Calculator : 1 After guid ing the user through the introductory window the user will be directed to the Cancer Predictor Calculator Menu that prompts her to choose either the breast cancer or the ovarian cancer as shown below Choose a Cancer Tumor: 2 After the user chooses the type of cancer she likes to know her predicted risk in she will asked to answer a number of questions in consecutive windows In the questions w i ndow she can know her predicted risk by clicking the Risk Probability button as shown below : 6 5

PAGE 76

li1 -C X .... Yes No .... Don\ know .... 3. If it occurs that the user forgets to answer one or more questions in any window a window message will ask him to answer those questions before letting him navigate to the next window 66

PAGE 77

li' H c X Styl.e... --1or2 ... < 15or None ... Select ... > 1 2hours ... = l J Y ou f or got t o an sw er Question 16! ; I I I [ Previoos I [ OK I 4 The final window includes the user s the predicted risk percentage and thee class of risk to which she belongs After she sees her result she has two options of either going back to the home page or closing the software 67

PAGE 78

DU risk 68

PAGE 79

BIBLIOGRAPHY 1 Chukwugozie Nsofor Godswill A comparative Analysis of Predictive Data-Mining Techniques." Master's thesis The University of Tennessee 2006. 2 Cruz, Joseph A., and DavidS. Wishart Applications of Machine Learning in Cancer Prediction and Prognosis. Cancer Informatics 2006 59-78 3. Doheny, Kathleen. Menstrual Periods : Clues to Ovarian Cancer. WebMD Health News : n p.n.d. July 9 2009 Web. 16 July 2011. 4 Fayyad Usama Gregory Piatetsky-Shapiro and Padhraic Smyth From Data Mining to Knowledge Discovery in Databases." AI Magazine (1996): 37-54 5 Hankinson Susan E Graham A Colditz and Walter C Willett. Towards an integrated model for breast cancer etiology The lifelong interplay of genes lifestyle and hormones." Breast Cancer Res 6 (2004) : 213-218 6 lngui, Bette and Mary Rogers Searching for Clinical Prediction Rules in Medline. Journal of the American Medical Informatics Association 8(Jui-Aug 2001) : 391-397. 7 Dr Liu Yan Decision Tree Department of Biomedical Industrial and Human Factors Engineering Wright State University n.d Web 5 Sep 2011. 69

PAGE 80

8 Marsh Beezy."Breast-feeding reduces cancer risk." Dailymai/.co .uk. MaiiOnline com n d Web 22 July 2011 9 Morton Carol Cruzan Ovarian cancer research takes center stage." harvard edu Dana-Farber/Harvard Cancer Center n d Web 22 Oct. 2011. 10 Newman L i sa."Breast Cancer in African American Women." The Oncologist Breast Cancer 10(2005): 1 14 11. Olivas Rafael. Decision Trees A Primer for Decision-making Professionals. stylusandslate com. Stylus and Slate 10 Apr. 2007 Web 5 Sep 2011 12 S anchez Zamorano Luisa Maria, Lourdes Flores-Luna Ang elica Angeles-Lierenas Isabelle Romieu Eduardo Lazcano-Ponce Hernando Miranda-Hern andez Fernando Mainero-Ratchelous and Gabriela Torres-Mejia Healthy Lifestyle on the Risk of Breast Cancer. Cancer Epidemiology Biomarkers Prev 20 (2011) : 912-922 13 Sharp Frank Tony Blackett Jonathan Berek and Robert Bast. Ovarian Cancer. Oxford : Isis Medical Media Ltd. 1998 14 Tan Pang-Ning Michael Steinbach and Vipin Kumar. Introduction to Data Mining Boston: Pearson Education 2006 15 Velickov Slavco and Dimitri Solomatine." Predictive Data Mining : Practical Examples." unesco-ihe.org. UNESCO-I HE Institute for Water 70

PAGE 81

Education March 2000 Web. 4 Sep. 2011 16 Association rules (in data mining)." Searchcrm com SearchCRM n d Web. 30 Aug 2011. 17 Cancer of the Uterus Risk Factors."cancer .gov. National Cancer Institute n .d. Web 7 Sep. 2011 18 Data Warehouse Definition and Concept. Tutorial-computer com Tutorial-computer n .d. Web 30 Aug 2011 19 Decision Tree : lntroduction ccs miami edu Center for Computational Science n d Web 30 Aug 2011 20 Gail model. infoacademy.grlmicrocalc. Microcalcification Resource Site, n.d Web. 20 Oct. 2011 21. Lesson 8 : Data Mining Data Warehousing and Data Marts." psu edu Penn State University n.d Web. 30 Aug. 2011. 22 Ovarian cancer risk factors." cancerresearchuk org. Cancer Research UK, n d Web 25 Aug 2011 23 Predictive analytics." Searchcrm com. SearchCRM, n d Web. 30 Aug 2011 71

PAGE 82

24 Predictive modeling." Searchcrm.com SearchCRM n d Web 30 Aug 2011. 25 ." Twelve Basic Predictive Analytics Techniques. oacforg. OACF, 19 May 2011. Web 6 Sep. 2011 26. Uterine Cancer Risk Factors."cdc gov. Centers for Disease Control and Prevention n .d. Web 12 Oct. 2011 27. What are the risk factors for breast cancer?" cancer. org. American Cancer Society 29 Sep.2011 Web 21 Oct. 2011 28 What are the risk factors for ovarian cancer?" cancer.org American Cancer Society 13 Oct. 2011 Web 21 Oct. 2011 72