Cancer prediction in bioinformatics role

Material Information

Cancer prediction in bioinformatics role
Mokharrak, Wafa Abdrab Alhousein
Publication Date:
Physical Description:
v, 67 leaves : color illustrations ; 28 cm

Thesis/Dissertation Information

Master's ( Master of Science)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Computer Science and Engineering, CU Denver
Degree Disciplines:
Computer Science
Committee Chair:
Altman, Tom
Committee Members:
Chlebus, Bogdan
Ra, Ilkyeun


Subjects / Keywords:
Cancer -- Genetic aspects ( lcsh )
Bioinformatics ( lcsh )
Bioinformatics ( fast )
Cancer -- Genetic aspects ( fast )
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )


Includes bibliographical references (leaves 62-67).
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
by Wafa Abdrab Alhousein Mokharrak.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
785826922 ( OCLC )
LD1193.E52 2011M M65 ( lcc )

Full Text

Wafa Abdrab Alhousein Mokharrak
B.S. King Saud University, 2007
A thesis submitted to the
University of Colorado Denver
of the requirements for the degree of
Master of Science
Computer Science

2011, Wafa Abdrab Alhousein Mokharrak
All rights reserved.

This thesis for the Master of Science
Degree by
Wafa Mokharrak
has been approved
For recommendation to the Graduate Committee
Thesis Advisor
Professor Tom Altman
MS Committee
Professor Bogdan Chlebus
Professor llkyeun Ra
November 18. 2011

Mokharrak, Wafa Abdrab Alhousein (M.S Computer Science)
Cancer Prediction in Bioinformatics Role
Thesis directed by Professor Tom Altman
Improving the genes in variety of organisms specifically in human
beings, increasing genes quality by enabling better treatments, preventing
genetic mutations, improving the productivity of offspring, and preventative
tests, has been become an important issue in the current era.
Bioinformatics field leads us to this improvement by predicting the genes
that are responsible for causing specific genetic disease. This thesis will
discuss some available literatures about genes predictions, then focus
specifically on gene prediction for cancer disease. For this reason,
bioinformatics is considered as one of the most worthwhile fields in
computer science in gene prediction techniques. Also, this thesis will be
supported by an implementation that calculates the prediction percentages
for two types of cancer tumors (breast and ovarian). It will be evaluated by
comparing it with two existing algorithms (Gail and Risk of Ovarian Cancer
This abstract accurately represents the content of the candidates thesis. I
recommend its publication.
Tom Altman

My thanks to...
Dad, Abdrab Alhousein Al-Mokharrak, I am delightful he followed this
work through to its accomplishment, providing the support and
encouragement to make it possible
Mom, Sameera Al-Daloug, her constant prayers, sacrifice,
encouragement, and motherly care have sustained me throughout my life
My parents...Thanks for giving me love, upbringing, education, and
intellectual development
My brother, Mohammed, over the years you have helped me get to the
point. For your endless motivation, inspiration, and encouragement
My sisters (Ekram, Alia, and Safa), my pillars, my joy, my guiding
lights, thanks for plenty of friendly encouragement
My friend, Nedhal Al-Khalaf, for being there for me throughout the
entire thesis.

1. Motivation.............................................................1
2. Introduction............................................................2
3. Definitions.............................................................4
4. Gene Prediction.........................................................9
4.1 Gene Prediction Methods...............................................9
4.1.1 Searching for the evidence...........................................9
4.1.2 Combing the evidence to predict gene structure.....................15
4.1.3 Strengths and pitfalls.............................................18
4.2 Some of classical gene prediction approaches.........................19
4.2.1 Hidden Markov Models................................................20

4.2.2 Dynamic programming..............................................23
4.2.3 Bayesian network.................................................24
5. Approaches for cancer gene prediction...............................28
5.1 Molecular network approach.........................................28
5.1.1 Network-based disease gene prediction............................29
5.2 DNA Microarray....................................................30
5.2.1 DNA microarray model and data analysis...........................31
5.3 Machine learning methods..........................................35
5.3.1 Support vector machine...........................................35
5.3.2 Nearest neighbor.................................................36
5.4 Decision tree.....................................................38
6. Cancer risk prediction models.......................................40
6.1 Breast cancer perdition models.....................................40
6.2 Ovarian cancer prediction models..................................44

7. Contribution and discussion
7.1 Results....................................................50
7.2 Evaluation.................................................55
7.3 Advantages Vs. Disadvantages...............................56
7.4 Future Work................................................57
7.5 Conclusion.................................................58

3.1 DNA STRANDS......................................5
3.2 TRANSLATION PROCESS..............................6
3.3 mRNA CONFIGURATION...............................7
3.4 CODON............................................8
4.1 CDNA............................................11
4.2 IGRs............................................13
4.3 BREAST CANCER DISEASE MODEL.....................21
5.2 cDNA MICROARRAY.................................33
5.3 SVM CLASSIFIER..................................36
7.1 CLACULATOR FLOWCHART............................48

4.1 Breast Cancer Model.................................................22
7.1 Age and Race.......................................................51
7.2 Age and First Birth Age............................................52
7.3 Age and Family History Degree......................................53
7.4 Age and Menstruation Age...........................................54

1. Motivation
Since I was a child I have had many dreams that have grown up with me,
some of them had been achieved and some other not yet. One of my dreams
was to be a doctor, but that had not been achieved. In consequence of the
fact that I became a computer scientist, the dream of mine to cure had never
gone away. For this reason, I would like to achieve something that connects
between both sciences (Computer and Medicine). The existence of
Bioinformatics field have been attracted me, facilitated my approach, and
gave me a strong motivation to explore its world.
Among many epidemics and chronic diseases that spread between
people at this time where genes are one of the main causes for their
occurrence is cancer disease. Predicting genes percentage that causes
cancer disease and other genetic diseases, permit people to take reserves
and avoid its occurrence. Therefore, I got the idea to create an
implementation, which predicts the cancer risk percentage counting on public
live databases. To that end, I have been researching and writing this work on
the role of bioinformatics in role of genes cancer prediction. As I achieved my
dream now, I hope that I created a valuable and useful solution for this
problem to mankind.

2. Introduction
The greatness of divine configuration of an organisms structure
impressed scientists and doctors, and encouraged them to devote their lives
to identify the accuracy of the bodies of these creatures. The most important
creature of these organisms is the human. Where the human body consists of
complex and synaptic system of genes and genetic sequences that have a
gigantic impact on the human physical and mental characteristics. Whereas
any disorder or defect in genes and / or genetic sequences molecules lead to
diseases; such as Cancers, Heart diseases, Diabetes, Alzheimer's, and many
Due to the exploitation of the advanced technology, and using it
sometimes negatively in order to satisfy human comfort, this led to
environmental defects, which hurt humans. Since 1980s, bioinformatics field
is considered as the pioneer for the existence and the development of the
natural genetic and hereditary resources, which are important to help
molecule biologists to use them in order to achieve their purposes and goals.
For example, there is no genetic engineering without genetic resources for it.
Therefore, the biologists search for this wealth of genetic resources, which
are represented in local plants, wild animals, and microorganisms that are
used to solve human problems. These problems are lack of food, medicine,
and water, as well as others of humans concerns. Nowadays, it has become
possible to use genetic molecules in genetic surgeries, as well as in genetic
engineering to develop microorganisms by adding a desirable gene or
removing it.

A nucleus cell in human body has number of chromosomes that
contain more than thirty thousand genes. Each gene is responsible for
producing a different type of protein. There is a pair of chromosomes in every
human being (a copy inherited from the father and another from the mother),
and each chromosome has a copy of each parents genes. Consequently,
there are common diseases which hereditary genes are assumed to be the
direct reasons of their occurrences such as cancer. In coming chapters many
properties about bioinformatics role in cancer genes prediction will be
discussed in depth.

3. Definitions
This chapter displays some biological terminologies that will be used in
the thesis
1. Bioinformatics
In general, using computer techniques to serve biologists, give us a
simple concept regarding this field.
More specifically, bioinformatics defined as The mathematical,
statistical and computing methods that aim to solve biological problems using
DNA and amino acid sequences, and related information. (4)
For example, the GeneBank database holds genotypes and allows
users to use these genes to develop, produce, and introduce new
characteristics for these genes. The users can invest those genetic deposits
to solve biodiversity enrichment and environmental improvement problems
where genes play an important role regarding these problems. This example
shows how the computerized database (GeneBank) serves biologists to solve
biological problems.
2. Prediction
Prediction in any event means to approximate something to happen
relying on available data. And also, bioinformaticians have their own definition
for this technique regarding their approach as The estimation of the amount
of data that is needed to allow reliable predictions. (4)

For example, predicting the percentage risk of breast cancer disease
in women relies on many factors, one of the factors being her family history
information. In this aspect, the predicted risk percentage will vary depending
on the degree of relationship to the woman whereas genetics have a strong
impact to determine the percentage.
3. DNA Sequence
In 1953, James Watson and Francis Crick cleared up the powerful
double-helical structure of DNA molecules and its composition (21). DNA
molecule consists of two complementary strands. Each strand is comprised of
4 heterocyclic bases, adenine (A), guanine (G), cytosine (C), and thymine (T).
They are connected together with sugar units and phosphates (2).
Figure 3.1: DNA STRANDS (23)
DNA has two global processes:
1- Transcription: Transferring genetic information from DNA to RNA.
This process helps to enlarge the amount of DNA by constructing many
copies of RNA called mRNA.
2- Translation: Transferring information in mRNA into amino acids.
This process occurs in the ribosome.

/ _. tintcnption T,1T. tiandation
/ DNA-------------- mRNA--------------
4. RNA Sequence (Ribonucleic acid)
Linear polymers of ribonucleotides whose biological functions include
transmitting information, catalysis, and regulation of gene expression (12).
RNA sequence, which is also known as ribonucleic acid, has many
functions in organisms cells. Beside that its molecules included in
synthesizing protein process, RNA can function as a transporter of genetic
information, and a catalyst of biochemical responses. RNA sequence differs
from DNA sequence by substituting Thymine (T) with Uracil (U) and only
single strand of mRNA is needed to synthesize proteins.
5. mRNA (Messenger RNA)
The RNA transcription of a DNA genetic locus; acts as the template
that directs a ribosome to synthesize the corresponding polypeptide
sequence (12).
mRNA is one of RNA categories. The fundamental function of mRNA
is that each molecule of it transfers directions about how to link amino acids
into a peptide chain, which alters to a protein. This process has produced a
genetic code where each three nitrogenous bases in RNA sequence (triplet of
RNA nucleotides) will create the 20 available amino acids.

After the addition of polyadenine (poly A) into RNA sequence, the
Introns are deleted and Exons are remaining in mRNA (17).
I"* Em Ew .. Ew
i wx mn mcA rtv
Figure 3.3: mRNA CONFIGURATION (34)
6. Gene
This is the basic unit to inherit physical and functional traits. Whole
genes elaborate together in complicated ways to control human bodys
functions. For example, each gene has specific information for synthesizing a
particular protein. Although all human beings have the same genes, they
have different alleles. Allele is a term for multi types of many genes, such as
multiple alleles for hair type.
The smallest unit of inheritance; also, a region of DNA that influences
a trait (12).
7. Genome
Genome refers to a pair of chromosomes that each human has in the
nucleus of cells, where each chromosome and all genes they carry on is
inherited from both parents. Genome has all the necessary biological
information to build a living human being. The coding genes in human
genome called exons are separated by non-coding genes called introns.

All of the physical hereditary material contained in an organism (12).
8. Codon
Codon considered as the coding unit of DNA sequence where each
triplet of bases represents a specific codon. Each codon produces an amino
acid of the 20 amino acids to be translated into a protein by living cells, and
each amino acid is encoded by more than one codon. Consequently, the
genetic code is composed of 64 triplets of bases.
A sequence of three nucleotides that specifies an amino acid (12).
DNA gene
Sequence of
coding triplets
(in lYdliiy (iiis
Srtjwrtcf is longer)
Stop codon
Figure: 3.4 CODON (40)

4. Gene prediction
Locating a desirable gene that is encoded in a DNA sequence from
gene coding regions to determine its function expresses gene prediction
technique, which is not an easy problem. In response to this challenge, there
are number of approaches and computer programs for gene predictions
purpose, which are designed based on many computational prediction
Scientists define gene prediction technique accurately as, The
problem of locating genes in genomic sequences (14).
This chapter discusses these methods and explains how they contribute in
gene prediction technique with reference to their strengths and limitations.
4.1 Gene prediction methods
Considering in this chapter is the problem of fining coding genes for
human beings only. Although the prevalence of gene prediction programs,
which are used by molecular biologists, they are divided into two basic
categories: ab initio and evidence based.
4.1.1 Searching for the evidence:
For the sake of finding a gene in a genomic sequence, there are
basically two approaches (18):
A- Content sensors: serves a meaning of distinguish the DNA sequence
into two regions: coding region (exonic region) and non-coding region (intonic
region). At present, most of available gene prediction computer programs find

the coding regions of DNA sequence only, which are called exons. The non-
coding regions called introns separate exons. The following describes each
region in detail:
1- Extrinsic content sensors: It determines if the region is transcribed or
coding through the homology between a genomic sequence and a protein or
DNA sequences (18). This process uses sequence alignment, which is a way
to differentiate between two or more organism sequences (amino acids or
base sequences) in order to look up for sharing characters, by placing the
similar characters in one column, and non-similar characters of the
sequences in another column (45). There are two types of sequence
alignments, global that often referred to as a Needleman- Wunsch, and local
that is often referred to as a Smith- Waterman and it is widely used (8).
Sequence alignment in extrinsic contents sensors has three types:
- Protein homology: is commonly used type for detecting similarity between
sequences, because most of genes around 50% are available in public
databases, such as SwissProt database and PIR database. Using local
alignment method, which is outstanding to find the same functional subunits
(motifs or domains) of proteins even if they are from different families due to
the fact that the proteins are built with the same structure, between DNA
sequence and amino acid sequence to indicate coding regions (18).
FASTA and BLAST are examples of computer tools that use this type
of sequence alignment. Although this type is effective, but it has a drawback,
which is, it is unable to detect the gene structure because not all proteins
domains are available (18).

- Transcript: is the second type of sequence alignment and it has two kinds of
alignment: alignment with Complementary DNA (cDNA) and alignment with
Expressed Sequence Tags (ESTs).
cDNA arises from mRNA to target individual genes in classical way,
and defined as a DNA molecule synthesized in a laboratory from a
messenger RNA template using the enzyme reverse transcriptase; it contains
only the nucleotides of a coding sequence (12).

Figure 4.1: cDNA (35)
cDNA services (22):
1- Analyze genes and nucleotide sequences
2- Address chromosomal genes by producing probes
3- Clone and expression by using host-vector systems

4- Establish cDNA library where the source cells transcribe each cluster of
cDNA molecules
EST is an expressed sequence derived from a cDNA library, its length
ranges between 200 to 500 nucleotides. It is used in locating genes due to its
effectiveness in reducing the required time (36). These kinds rise from the
same foundation and create the structure of a gene (18).
Sim4 is a well-known computer program that uses this type of
sequence alignment. This program helps molecule biologists to align cDNA
sequence with a genomic DNA sequence.
- Genomic DNA: is the third type of sequence alignment and has two
approaches: intra-genomic and inter-genomic. The first approach illustrates
most of available genes; it specifies information related to multigenic families.
While the second approach, allows specifying orthologous genes, which have
arisen from an ancestral gene without any previous information about them.
However, there is a disadvantage regarding this type, it is difficult to
distinguish between coding and non-coding regions due to the similarity is
limited to the most conserved part of coding regions only while the other
regions are ambiguous (18).
OrthoMCL program is one example for inter-genomic approach.
Programmers designed this software by using Markov Cluster algorithm in
order to create orthologous groups based on sequence similarity. Clustalx
Computer program is an example for intra-genomic approach. It is designed
based on multiple sequence algorithms (MSA), and used for nucleotide
sequences alignment.

2- Intrinsic content sensors: originally, used for prokaryotic genomes. It
works only with two regions of DNA, coding regions that code for proteins and
will be translated, and intergenic regions. Intergenic regions (IGRs) are non-
coding regions of DNA sequence that are located between genes, and
represent most of the human genome. It has no function (41).
Gent Cluster
Intergenic DNA
Gene Cluster
Figure 4.2: IGRs (13)
From the fact that the coding regions in human body should not have
stop codons due to the shortness of translated regions, however, searching
for stop codons would not be worth the expense (18). Therefore, a codon that
depends on the genetic code instructions will be translated into a specific
amino acid.
EuGene software integrates intrinsic and extrinsic sequences to predict
genes in eukaryotic organisms. This software tool splices alignments with
EST and cDNA sequences.
B- signal sensors: There are many functional loci in DNA sequence called
signals. The methods for discovering them called signal sensors, which
evaluate fixed length features in DNA (start codons, stop codons, donor sites,
acceptor sites, promoters, and poly- A signals).
Signal sensors determine specific functional locations in the DNA
where a signal is likely to be found. The basic signal sensor is a consensus
sequence, which shows the results of multiple functional sequence alignment
(33). These signals are represented by weight matrices method to identify

donor and acceptor sites, and operate in a way like every base appears in the
molecular pattern has a match, where matching each base in multi positions
within a site will change the performance. As long as each base matches the
site, its performance will be added. Then, a weight matrix will return the result
of the addition, and it represents the possible site (10). Also, it will give
positive scores for preferable bases in the site and negative scores for
unlikely ones. If the result exceeds the correct site, log- odds ratio will be
applied between the performance of each base in the sites and the expected
performance of that base in the genome (27).
There are also other methods like weight array matrix, which considers
the non-independence of the adjacent positions for each dinucleotide.
Maximal dependence decomposition method used for splice site positions. In
addition, weight matrices are considered as a one type among the types of
neural networks called a perceptron. Neural network declares the features
between splice sites and non-functional splice sites. Although these methods
are helping in prediction of splice sites partly, weight matrix method is still the
best (27).
There are signals that predict exons, which are classified into four
categories (27):
1- Single exon genes that begin with a start codon and end with a stop
2- Initial exons that begin with a start codon and end with a donor site
3- Terminal exons that begin with an acceptor site and end with a
termination codon
4- Internal exons that begin with an acceptor site and end with a donor

Initial and terminal exons considered as the most complicated to identify due
to their shortness, and their signals lack significant information.
4.1.2 Combing the evidence to predict gene structure:
This concept used to declare gene structure as well as finding
independent exons. It depends on combining the existence of a molecular
sequence and signal sensors for finding the evidence of signal appearance
(18). Theoretically, a homogenous pair of signals creates gene models where
these models are assumptions about how a gene forms transcripts structure.
These models identify the gene regions, because of its continents. A gene
model involves introns, exons, their location in the genomic sequence, and
the location of the translated region (39). Like any model, its results may not
necessarily be accurate; it could be correct, partly correct, or entirely wrong.
Therefore; there are some features that clarify the correct gene
structures (18):
1- There are no overlapping exons
2- Coding exons must be frame compatible
3- Merging two successive coding exons will not generate an in-frame
stop at the junction
There are three approaches under this concept:
1- Similarity -based or extrinsic approach:
While genes in all organisms are similar, this approach operates based
on homology by using popular genes in one genome to predict unpopular
genes in another genome.

For example, in order to know the function of a gene, it will be
compared with popular genes in database whose genes have known
functions. It cannot predict unique genes, nor genes that do not produce
protein. It applies the cDNA or protein of specific genes to indicate the gene
One of the basic disadvantages in this approach is that there is no
precise clearness about the restricted similarities. The computer programs
that apply this approach are called spliced alignment programs. The basis of
these computerized programs is to get from signal sensors the combined
information, which determines the region boundaries of both similarity and
Available computer programs under this approach are classified based
on the type of similarity: genomic DNA/protein, genomic DNA/cDNA, or
genomic DNA/genomic DNA (18).
For example, SpliceView and splicepredictor are two software
packages that lead potential splice sites depending on this statistical model in
order to specify the accurate borders of exon, and then improve gene
structure, also it could with weak probability to find possibilities for alternative
2- Ab initio or intrinsic approach:
This approach depends on statistical model using sequence data,
where the DNA sequence itself is the only source of information in this
approach. The manner of Ab initio is based on gene component (GC content,
codon bias, and hexamer frequency) that distinguishes exonic regions from
intronic regions, and signal disclosure whereas this signal signifies that the

surrounding area has coding regions but not start and end positions for exon
and intron (6).
In order to produce statistical tests for the probability of this prediction
turns to reality, training sets of popular gene structures are applied. Training
set can be defined as, The algorithm learns from samples with known class
membership (37). This algorithm has the ability to give a feedback to new
information or events. Google Prediction API is an example of training set.
API is a machine learning, which is operating based on training by applying
historical data, and predicting by applying new data depending on what has
already known. API can create a spam email filter based on classification,
and features (44).
However, while each person has unique genes, this approach does not
predict genes depending on popular structures nor functions of homology
genes (28), but from a genomic sequence that finds the gene components,
and slightly from a gene structures that are existed on sequence boundary
The computer programs that apply this approach depend on dynamic
programming by applying the evidence of content and signal sensors. This is
useful because it helps in specifying the possible gene structures (18).
A gene structures defined by two categories under this approach:
- The exon-based category
Gene structure under this category represents the coding part of exons
that is created by assembly of segments. The coding exons differ from non-
coding exons through that the first one directs the production of a peptide
sequence while the last one does not.

The exon-based category can also be expressed also in graph language as a
way of finding an optimal path in a directed acyclic graph whereas exons
assigned to vertices and compatibility between exons assigned to edges, by
applying Viterbi algorithm.
Andrew J. Viterbi has created the Viterbi algorithm in 1966 (45). It is a
dynamic programming algorithm searching for the extremely likely sequence
in each state among all possible hidden paths that leads to a given state
(survivor path) depending on unobserved event.
- The signal-based category
Gene structure under this category represents the availability of a
sequence of signals, which are isolated by homogenous regions. It uses the
Viterbi algorithm in graph expression too (18).
3- Integrated approach:
Nowadays, this approach is used in modern gene finders by combining
the ability of extrinsic and intrinsic approaches, and doing redevelops in older
implementations in order to confirm their prediction accuracy and get kind of
All inputs under this approach with its variety are measured either
explicitly or implicitly counting on the certainty in input and correct
employment of popular gene structures respectively (14).
4.1.3 Strengths and pitfalls
The followings are some of the strengths and pitfalls for each approach
discussed in this chapter.

A- Extrinsic approach:
1- Applying the similar sequences, which are EST, mRNA,
and protein will facilitate deriving gene structures
2- Accuracy due to the using homologous sequences.
1- Runtime is slow
2- No precise clearness about the restricted similarities
3- Prior knowledge about online databases for sequences
4- Sequencing errors have affect on the results.
B- Intrinsic approach:
1- Runtime is fast
2- Apply the DNA sequence to detect coding regions
3- Can find possible genes even if they do not simulate the popular
1- Full knowledge about training set is required.
4.2 Some of classical gene prediction approaches
In the following, review about the three most popular classical
approaches for gene prediction without addressing the mathematical methods
with illustration on how to apply them on Cancer predictor Calculator.

4.2.1 Hidden Markov Models (HMMs)
This is a well-known statistical tool for modeling. It supposes that the
probability of availability of a base at a specific position based on the K
previous nucleotides, where K represents the order in Markov model. It can
be visualized as a finite state machine with emitting states. A finite state
machine generally moves via succession of states and presents output
whether the machine has approached a certain state or while it is moving
from state to another. The model topology is stable, which means the entire
transitions and emissions probabilities are estimated values.
The model is expressed statistically through this expression: P (X/k).
Where the X is a given base A, T, G, or C, and k is a previous nucleotides (2).
HMM can be applied on Cancer predictor Calculator, which is the
implementation for this project, by selecting breast cancer tumor. For
example, in order to model expected lifetime for the patient, there are lots of
parameters, one for each table entry. It will be represented in two tables one
for emissions and one for transitions.


Table 4.1: Breast Cancer Model
Transitions Emissions
State Diagnose Gene mutation -Ve factors Psychologist state First stage Second stage Third stage Fourth stage
Diagnose 92/95 1/95 1/95 1/95 0.95 0.02 0.02 0.01
Gene mutation 0 93/95 1/95 1/95 0.06 0.88 0.03 0.03
-Ve factors 0 1/95 93/95 1/95 0.2 0.04 0.66 0.1
Psychologist state 0 0 0 95/95 0.1 0.8 0.03 0.07

Molecular biologists got the benefits of this model at the moment in
terms of its being designed to deal with sequences and because of the
existence of the great number of sequenced genomes, by applying it largely
on analysis process of DNA and protein sequences and identifying genes
HMM through its advancement transmits amino acids over a
succession of states in order to produce a protein sequence, where each
state has a table of amino acid emission probabilities (15). Currently, there is
no available system that uses this method in gene prediction due to the size
and sophisticated structure of this model; for this reason HMM has been
divided into three sub-models to serve genes prediction (2):
1- Exon model, which has 89 states
2- Intron model, which has 10 states
3- Intergenic regions, which have two disconnected ten-state chains
HMM works in cooperation with two other algorithms they are
Expectation maximization and Viterbi. Expectation maximization (EM) is used
in training model by assessing the transitions and emissions probabilities to
the closest optimal values, while Viterbi is used in testing model by coming up
with the possible sequence of states through the model for a specific
4.2.2 Dynamic programming (DP)
Dynamic programming is a recursive algorithm that is applied on
biology-computerized programs for detecting the optimal sequence in the
comparison between two sequences by finding the longest or shortest path in

a directed weighted graph. The word programming in this method refers to
using a constant set of models to detect the solution, but does not mean that
it has to be a computer program.
This algorithm is used in gene prediction in order to get the optimal
integration of exons and introns. This integration uses the possible exons for
building optimal gene models, concluding proteins functions by homology to
other proteins with another function, and clustering DNA sequence data
(nucleic acids and proteins) from the fragments that are produced in
sequencing machines. DP is considered as the most well known method in
computational molecular biology (2).
Although mammogram is considered as the first tool to detect breast
cancer, but it is not accurate all the time. The existence of dynamic algorithms
was helpful to develop breast cancer detection methods. These algorithms
depend on comparison and analysis by comparing disordered BRCA genes in
an individual with the original ones to get an estimate of the cancer risk.
4.2.3 Bayesian Networks (BNs)
It is characterized as one member of the probabilistic graphical model
family. Nodes represent random variables in the graph, and probabilistic
dependencies between random variables are represented by edges. This
probability is either a conditional probability function or a table.
A Bayesian network consists of a directed acyclic graph (DAG) that
represents casual relationships between arbitrary variables and parameters.
Combing both graphical structure and conditional probability illustrates the full
understanding of a Bayesian network probabilistic model (25).

The concept of BN highlights three aspects that can be summarized in the
following (11):
1- Type of input data information is frequently subjective
2- Updating information is basically depends on Bayess terms
3- Proving each of casual and evidential patterns differently.
BN is used in bioinformatics for the task of integrates variety gene
prediction systems. It allows finding a closest corresponding network to the
existing training set of independent parameters. This process can be obtained
based on statistical function, called the scoring function. Scoring function
finds the optimal network by evaluating each network with consideration to
the training set.
BN serves many aspects, some of which are (25):
1- BN connects directed acyclic graph model of the causal relations
among gene expression (a mechanism of synthesizing proteins based on
gene information) levels to produce statistical suggestions
2- It uses the previous models (HMM and DP) occasionally
3- It shows in detail how data are combined in modeling of process
4- Observational data are searching for BNs through existing
algorithms that are created especially for this purpose.

Applying BN on Cancer predictor Calculator by combing both graphical
structure and conditional probability is shown in the following graph:

From the above DAG, genetic mutation and age represent independent
variables, when applied independent condition formulas on them, will get:
1- P (Genetic mutation, Age) = P(Age | Genetic mutation) P(Genetic mutation)
= P(Age) P(Genetic mutation)
2- P (Genetic mutation, Age) = P(Genetic mutation | Age) P(Age) =
P(Genetic mutation) P(Age).
Also, it is obvious that cancer variable occurs dependently by medical history
that is given from two independent variables genetic mutation and age.
Applying the condition formulas:
P (Cancer | Genetic mutation, Age, Medical history) = P (cancer | Medical
Family history variable is independent of medical history variable, but family
history variable is dependent on medical history to give cancer. In contrast,
ovarian tumor and breast tumor are dependent, but they are independent to
give cancer. This formula gives more clearness:
P (Breast tumor | Ovarian tumor, Cancer) = P (Breast tumor, Cancer).

5. Approaches for cancer gene prediction
This chapter discuses some of the most important cancer prediction
approaches, and specifically how they can be used within the Cancer
Predictor Calculator if applicable.
5.1 Molecular network approach
Molecular network is considered as one important approach in
predicting genetic diseases. In molecular network vertices are always refer to
bimolecules (genes and proteins), and edges between nodes are always refer
to functional interaction between two corresponding molecules. The distance
(shortest path between nodes) between two nodes in the network refers to
gene-gene and gene-disease relationships.
Presently, molecular biologists discovered and identified only about
1% of cancer genes of the human genome that are in charge of causing
mutations, which may lead to cancer tumors. Molecular biologists clarified the
importance of discovering these genetic mutations early, which are defined
based on the genes known functions that are restricted by experts
knowledge in order to eradicate this disease or at least minimize it and extend
the patients life.
Breast cancer, for instance, occurs due to genetic mutation in genes
BRCA1 and BRCA2 that affect the entire tumor with less than 5% only while
the rest of the percentage caused by other factors. Therefore, the availability
of molecular network helps a lot in predicting disease genes.

5.1.1 Network-based disease gene prediction
The aim of this method is to find the true gene which is causing a
specific disease. It supposes that there are number of genes N that have
genetic information and each of which is considered to be candidate for at
least one disease gene. In case there is no genetic information available, the
all-out human genome can represent the candidate list. Linkage after that is
performed by leading to more than one gene candidate interval where the
genes are ranked to be in top or down associated with a disease. Such
interval contains several hundred genes that cause a specific genetic
Next step, the candidate genes and all other related factors are
mapped to a human gene/protein network. Then, each candidate gene is
scored by scoring scheme based on other factors and its relative position
(genes are well-ordered linearly that means homogeneous genes locate at
the same relative position in their respective human gene/protein network). At
the end, the genes are ranked based on their scores from top (highest) to
down (lowest), where the top gene is the truest one that causes the disease
effectively. This score is confirmed by cross-validation (one trend among
many, which assesses the validation of training set in scheme on unknown
genes) with known disease- genes relationship.

Map candidate
genes to network
interval $
' I
genes to network
0.27 0.32 gk
# '03 0.27 g1
S 0.32 t 0.09 ^ gK
0.03 g2
* Genes causing different diseases
9 Candidategenes
7 Other genes
When applying this approach on ovarian cancer in Cancer Predictor
Calculator, there will be many candidate genes that have genetic information
about ovarian cancer and other diseases. The candidate genes (BRCA1,
BRCA2, HNPCC, MLH1, MSH2, and MSH6) with all other diseases factors
(e.g., personal, reproduction, genetic, hormonic, operation, birth control, body,
nutrition, and lifestyle factors) are mapped to network-based. After that, each
candidate gene is scored by a scheme relying on their relative position in the
network and other factors. Then, all candidate genes are ranked based on
their score where BRCA1 at the top = 40%, BRCA2 = 20%, MLH1, MSH2,
and MSH6= 12%, and HNPCC = 1.2%. The highest score of the candidate
genes, which is BRCA1, will reflect to be ovarian cancer main cause.
5.2 DNA microarray analysis
Microarray analysis or gene expression profiling has the ability to
measure thousands of genes at once in order to show cellular function.
Microarrays are differing based on some features such as the nature of probe
(probe is fragment of DNA or RNA holing a radioactive label, this fragment

represents the complementary base sequence), solid-surface support used,
and the method used for detecting probe.
The essential concept of each microarray is arranging on a glass slide
DNA probes or oligonucleotides that identify a particular gene coding regions
at high-density. Then, the filtered RNA is labeled either radioactively or with
some observable molecules like fluorescein, and hybridized with probes to
the slide. Next, after entire washing, the array (raw data) can be acquired by
laser scanning or picturing. Finally, the data is ready to be publicly available in
databases and analyzed by many statistical methods (42). DNA microarray is
considered as an appropriate model in transcription stage due to its ability to
fetch molecular signals in the tumor and its probability of providing other
information that may reflect the result. The results from this analysis are used
widely regarding cancer disease such as prediction (7).
5.2.1 DNA microarray model and data analysis
A. cDNA microarray: The cDNA microarray is an extremely effective
tool for gene expression, because it plays governing role in parallel analysis
of gene expression in multi biological experiments since it was invented in the
Brown laboratory at Stanford University (1).
The notion of cDNA microarray is arraying a sample of either mRNA or
cDNA on the surface and to hybridize them. Then, we label mRNAs samples
by combining fluorescent nucleotide in order to make cDNA by reverse
transcription. Next, we apply the cDNA mixture to a DNA microarray for
hybridization, where each spot on microscope slide will hybridize to its
complementary target sequence. After that, we rinse off excess cDNA and
scan microarray for fluorescent. Each fluorescent spot represents a gene

expressed in the sample. While two samples are hybridizing together, this is
often called a two-channel array.
cDNA microarray has several pros (1):
1- Powerful sensitivity due to the small size of the array
2- Parallel screening for multiple genes
3- Cheap where it costs $10 $20 per array
4- No specific supplies are required for use such as the hybridization
does not need specialized equipment
5- Flexibility in designing array in accordance with the scientific
experiment purposes
6- Direct comparison.
The cons are:
1- Inaccurate
2- it measures two samples at a time, means it cannot measure
individual sample.

O Mate* cONA by wwm
iranacrlpMon, ulng
ftuoraacantty labalad
1 t
0 HyOrMiatton: Apply
lh cONA mbrtura to a
DMA microarray.
nod to cacti apot on
coplaa at a atiort alngla
rtnwdad ONA motacuta
anpraaaad Io ttia
of Ilia orparrtam, a dM-
tarant gana In aaoh apot
Stza of an actual DMA
mtcroarray wttti aM
tha ganaa of yaaat
(400 apota)
Figure 5.2: cDNA MICROARRAY (35)
For example, arraying probes segments on a glass slide, and then two
different colors are used one for the test (breast tumor sample) and the
second for control sample (normal breast tumor). Next, both samples are

hybridized on the array. At the end, wash the array, scan it, and measure the
fluorescent intensities of both test and control.
B. Oligonucleotide microarray: Oligonucleotide microarray
comprise of small DNA fragments (short probes), usually 25-60
nucleotides. The oligonucleotides are designed based on DNA target
sequence (7).
The notion of this microarray is to identify number of genes that can be
classified under one group as well as to find treatment to abnormal genes. It
is widely used in predicting colon cancer genes. Also, it is called one-channel
array. The probes are grown on the array using lithographic process, which
shows the equality in components quantity of probes. Then, each sample is
hybridized with the array individually. Finally, the value of gene expression is
issued by a single fluorescence after scanning each expression.
The pros of this array are:
1- It is commercially available
2- Accurate.
While its pitfalls are:
1- Short probes are reducing the sensitivity
2- Expensive, it costs $300 and above per array
3- Limited to certain species.

5.3 Machine learning methods
Machine learning is an area of artificial intelligence (Al). It combines
materials of statistic, probabilistic, and finding the optimal solution for the
training sets. Based on the results of training sets, we apply this method on
new issues and sort them. This purpose of the sorting is to create a powerful
model for predicting class. Machines can learn to operate Boolean logic,
conditional statements, conditional probabilities, and apply other optimization
strategies regarding unconventional models.
There is an advantage that makes these methods classified under
unique technique; they have accurate resolutions that are not available in any
another prediction method. This uniqueness is shown through their ability to
deal with independent variables in linear combinations of these variables.
Like any another method, teaching machine has its own drawbacks such as
1- Limited examples and number of features; this is called curse of
2- Over-training by applying specific training sets every time on
specific features.
Some methods under learning machine methods are:
5.3.1 Support vector machine (SVM)
Support vector machine is considered a well-known classifier. It aims
to get a hyperplane, which isolates the data associated with two different
classes, and increases the lower margin that connects the hyperplane with
each individual data class in Euclidean space even if the data are not
breakable. The purpose of the increment is to select the most powerful

hyperplane. In case the data are not breakable, SVM classifies the total error
(distance from the hyperplane to the incorrectly classifies samples) below a
threshold (3).
Cancer Predictor Calculator in order to find a liner separator using
statistical and programming techniques to discriminate if there is a breast
cancer risk or not.
Patient __________^ Breast cancer
No risk Cancer nsk
SVM Classifier
5.3.2 Nearest neighbor (KNN)
KNN is a non-parametric classification methodology, which means
that the result of a test case that came out from the prediction is based on
ordinal data. The ordinal data refer to the K nearest neighbors of test case
chosen by cross- validation and computed based on Euclidean distance.
Using KNN requires putting in consideration that the number of neighbors is
always less than the number of samples from the test set.
A KNN classifier is comprised of a training set and a distance. The
concept of this method is to predict K nearest cases to the test case through
applying the cross-validation to all n-dimensional cases in the training set via

calculating their weighted distances. The calculation performed using the
following formula (21):
X and y in the above formula represent the samples, while W refers to
the weights for the i-th gene, and ft represents dimensions.
In order to classify unknown gene represented by cross symbol if it is
under (whether class cancer or class non-cancer) is based on its 3- nearest
neighbors; it is required to calculate the distances between all the ten
samples using Euclidean distance. In figure 5.4, assume that the dimensions
are two, that is, each sample is represented by a vector of two genes (genel
and gene2). The classification result of three nearest neighbors will be
assigned then to the predominant class.

5.4 Decision tree (DT)
DT is a standard machine learning technique that is represented by a
structured graph of decisions (nodes) and their possible results (leaves or
branches) to reach a goal. The goal of decision tree in this thesis is to predict
which category a gene belongs to: whether a disease category or normal
category, based on one or more explanatory factors.
By applying this on Cancer Predictor Calculator the main cancer disease
causing factors represent the predictors in the decision tree in order to
classify the final predictions into four classifications no risk, low risk, medium
risk, and high risk, will be like the following:


6. Cancer risk prediction models
Cancer risk prediction models are widely used because of their high
accuracy in assessing the cancer risk for individuals in average-risk based on
many factors. These models should contain some characteristics for
evaluating them if they operate effectively such as (27):
1- Discrimination: the capacity of the model to distinguishes between patients
who have high-risk disease and those who wont develop disease
2- Calibration: the ability of the model to predict probability, which consistent
with observed outcomes
3- Accuracy: the power of the model to predict the correct gross number of
infected patients with this disease
4- Clinical utility: the ability of the model to assist in making right clinical
decisions based on its outcome.
To measure the performance of those models, it should be used on an
independent dataset. This chapter will review and discuss the history and
current states of breast and ovarian cancers prediction models.
6.1 Breast cancer prediction models
Overview about breast cancer
Breast cancer although also occurs in about 1% in men. It can be
defined briefly as abnormal cells that have irregular functions and originate in
breast tissue. It classified as the number one cancer affecting women and the
second leading cause of death by cancer in women (32).
Breast cancer in women has many risk factors that allow and help in
emerging and/or developing this disease, the most important are:

1- Gender: simply the strongest risk factor. Being a female is the prime factor
for progressing breast cancer due to the female hormones (estrogen and
progesterone) that affect continually on the females breast cells (31)
2- Age: considered the second serious factor after gender that affects heavily
on the emergence of the disease. Although breast cancer does not have a
specific age to present, but the risk percentage increases with the age (32)
3- Genetic risk: since the human genes are inherited from a parent, any defect in
these genes (gene mutations) will cause cells to become cancerous. The
BRCA genes (BRCA1 and BRCA2) are tumor suppressor genes, any
mutation of these genes that occur are due to either inherit from a parent or
obtained by dealing with ionizing radiation (9); and this means that the
individual who holds this kind of genetic mutations will be at a higher risk of
progressing breast cancer during her lifetime. Furthermore, if the individual
has any of these genetic mutations in her early age that will tend to influence
both of her breasts and increase her risk percentage with age more than in
women who are not born with one of these genetic mutations. Having BRCA
gene mutation is responsible for about 5 percent of breast cancer. Once the
mutation occurs, both genes become abnormal and cancer is expected to
progress. Having this kind of genetic mutations will increase the risk for
progressing other cancers, specifically ovarian cancer. Whereas in normal
breast cells, these genes synthesize proteins that allow these cells to function
regularly and protect them from any defect like cancer emergence (30)
4- Family history: if a first-degree relative (a parent, grandmother, sibling, or
child) either has or has had breast cancer, the risk percentage for the
individual will double in progressing the disease (9)

5- Race: White women are at the higher risk for getting this disease
comparing with other racial groups
6- Menstrual periods: It is considered high risk if a woman has the menstrual
before age 12, and does not stop menstrual up to 52 years old
7- Having children: a woman will be at higher risk either if she did not give
birth or birth after age 30
8- Breast-feeding: a woman will decrease the risk for every 12 months of
9- Hormone therapy: if a woman uses hormone therapies after menopause,
that will increase her risk as compared to women who take them after five
years of menopause
10- Chemicals and Radiation: frequent exposure to radiation and chemicals
influence on breast cancer risk
11- Overweight: comparing with those who have a healthy weight, being
obese will change other hormone levels in the body; and increases the cancer
12- Personal history: if a woman have or have had any kind of cancer this will
put her in very high risk

Gail model of breast cancer
The Gail model is one of the most popular statistical breast cancer risk
assessment algorithms. Dr. Mitchell Gail invented it. The concept of this
model is to predict the absolute breast cancer risk in women over the next five
years and over her lifetime based on some factors related to this disease.
Considering certain factors in this model such as personal history and
family history, it shows the relationship that connects between the disease
prediction and human genes. Any mutations in specific genes like BRCA1
and BRCA2, which are combined with diagnoses of hereditary breast cancer,
will increase the risk. Moreover, if the cancer happened in first-and/or
second-degree biological family this is significantly increases the risk of
breast cancer for the individual (38).
The Gail model method based on data from the Breast Cancer
Detection Demonstration Project (BCDDP), it classifies women into groups
using risk factors. The model considers the age at menarche, previous breast
biopsies, age at first live birth, and limited information about first-degree
family members with breast cancer history. After computing the individual risk
factors, it combines them in order to get the lifetime risk result of breast
cancer for a woman. The calculation process proceeds by multiplying the
relative risks for the factors by individuals age, because a womans age plays
a significant role in developing breast cancer weather the other four relative
risks offered by the model are available or not (20).
43 Gail model imitations
1. Does not consider the extensive family history; it may overestimate risk
in individuals whose first-degree relatives had or have had breast cancer and
underestimates risk for individuals who have second-degree relatives or
ovarian cancer
2. Estimates for certain racial groups only
3. Does not consider the personal history if a woman have or have had
other cancers or weather the cancer occurred in both breasts
4. Does not consider abnormal genes that have a much higher risk of
breast cancer.
6.2 Ovarian cancer prediction models
Overview about ovarian cancer
Although ovarian cancer disease affects mostly older women, it
happens to younger ones as well. Ovaries in womans body are located on
each side of the uterus and they are the responsible for generating eggs and
the fundamental principle of estrogen, testosterone and progesterone
hormones. When ovarian cancer starts, the patient does not feel any
symptoms until the tumor reaches the pelvis and abdomen. It can be defined
simply as a disease where the abnormal cells grow in a very ungovernable
way. Cancer tumors could then be speared through bloodstream or lymph
channels to the entire body.

There are many risk factors that cause this disease, some of which
1. Age: if the woman 50 years old and up, the risk will increase
2. Family history: having a family member who suffers from this
disease or with breast, uterus, colon cancers will increase the risk
3. Gene mutations: inherited BRCA1 (breast cancer susceptibility
gene 1) and BRCA2 (breast cancer susceptibility gene 2) genes will
increase the risk greatly as well as will help to emerge breast cancer
with very high risk too. Beside these genes, there is HNPCC gene
(hereditary nonpolyposis colorectal cancer) that any mutation in it will
assist developing this disease.
4. Personal history of cancer: an individual who have or have had
either breast cancer or ovarian cancer in one of the ovaries will be at
high risk
5. Pregnancy: comparing individuals who never have had pregnant
and never breast-feeding with those who did is considered risk for
6. Obesity: this factor associated strongly with ovarian cancer
through hormonal mechanism
7. Endometriosis: individuals with endometriosis appear will likely
to develop the risk of disease
8. Infertility: having this issue and taking treatment for it will raise
the risk of ovarian cancer.

The risk of ovarian cancer algorithm (ROCA)
Each individual tends to have stable levels of CA-125 (cancer antigen),
a protein generated on the cells surface and run in the blood stream. The
percentages of CA-125 in ovarian cancerous cells are higher than normal
cells. Variations in CA-125 levels over time considered as significance for
ovarian cancer; for this reason the biostatistician Steven Skates has invented
a new computer-based tool called the Risk of Ovarian Cancer Algorithm
(ROCA). ROCA is just outlined and its results will be released by 2015 (19).
ROCA is used for postmenopausal women by applying the following
strategies (24):
1. CA-125 test is required in order to know the concentration of
CA-125 in the blood
2. Using mathematical calculations the algorithm integrates
patients age and annual CA-125 test
3. Based on the risk results, women are classified in one of three
categories: normal, intermediate, and elevated risk groups
4. A woman who classified under normal group, back to step 1 to
repeat CA-125 test after one year
5. A woman who classified under intermediate group, back to step
1 to repeat CA-125 test after one or six months based on her risk level
6. A woman who classified under elevated group, subjected to
intensive medical treatment.

7. Contribution and discussion
This review discussed the prediction process in bioinformatics role
regarding cancer genes. While this work emphasized available predictions
algorithms that contribute in analyzing cancer genes prediction processes
based on multiple factors, I have supported it by creating an implementation.
My implementation is an expert system to predict the cancer risk percentage
for women regarding the two kinds of cancer tumors: breast and ovarian.
The implementation that Ive created is a calculator that computes the
cancer risk percentage for a woman in her lifetime depending on many factors
that are considered one of causes of the disease. I used C# Express for
coding and GUI (Graphical User Interface) and built in SQL database for
creating the tables and connecting between the factors that cause the
disease. The following flowchart represents how my calculator works:


In my calculator I developed some mathematical formulas and I used
them in my code. The coming algorithm represents my own methodology:
1. Double Disease_Risk =0;
2. Integer 1=1;
3. While (I <= N)
4. Begin
5. {
6. Double Form_Risk = '1**"*
7. Disease_Risk += Form_Risk;
8. I++;
9. > //end while
10. Double Risk_Percentage = (Disease_Risk *100)/10;
In above algorithm:
I: refers to Form number
N: refers to last form
RR: refers to the relative risk factors, and
j: refers to the last question in the form.
Also, I got a benefit by developing a formula that decreases the risk
percentage of breast cancer. This formula related to breastfeeding,
considering it as one of the important factors, which help in decreasing the
breast cancer risk depending on how many months she breastfed each birth
she had gave (16).
Breastfeeding = breastfeedingmonths + (0.08 children_no.); (7.1)

7.1 Results
In order to validate my cancer predictor calculator, I used real data values
from Womens Health Initiative (WHI) and Surveillance Epidemiology and End
Result (SEER) databases. WHI database has collections of health studies
regarding breast cancer based on some factors of the disease. As well as,
SEER database, which has multiple statistics and information regarding
ovarian cancer patients race and age that affect in causing the disease. After
applying the real values from patients record in WHI database into my
calculator; I got a difference between the outcomes ratios approximate to
6.5%. Also, I applied two factors that are available in SEER database into the
cancer predictor calculator (race and age), and I got a difference between
both up to 8%.
I will disclose the results of cancer predictor calculators algorithm for
breast tumor by comparing it with a well-known available algorithm, which is
Gail breast cancer algorithm regarding some factors:

Table 7.1: Age and Race
Cancer Predictor Calculator Gail
Ages: Black White Black White
20-29 6.5 % 32.5 % 4.7 % 4.1
30-39 11.5 % 37.5 % 4.3 % 6.3 %
40-49 21.9 % 47.9 % 11.2 % 13.1 %
50-59 23.3 % 49.3 % 5.9 % 8%
60-69 29 % 55 % 4.7 % 7.7 %
70-79 30 % 56 % 3.5 % 6.3 %
80+ 19.5 % 45.5 % N/A N/A
This table shows the predicted risk for factors age and race after applying
different age groups that related with two races on Gail calculator and Cancer
predictor calculator. It is obvious that the cancer predictor calculators risks
are higher than Gails predicted risks.

Table 7.2: Age and First Birth Age
First birth Age
20-29 30-39 40-49 50-59 60-69 70-79 80+
age CPC Gail CPC Gail CPC Gail CPC Gail CPC Gail CPC Gail CPC
% % % % % % % % % % % % %
>20 1.5 N/A 6.5 N/A 16.9 N/A 18.3 N/A 24 N/A 25 N/A 14.5
20-24 1.7 2.6 6.7 5 17.1 14.4 18.5 9.2 24.2 9.4 25.2 7.9 14.7
25-29 2.1 4.5 7.1 5.8 17.5 16 18.9 10.6 24.6 11.5 25.6 10.1 15.1
30+ / / 7.4 8.3 17.8 18 19.2 12.4 24.9 14 25.9 12.6 15.4
Nulliparous 2.1 3.2 7.1 5.8 17.5 16 18.9 10.6 24.6 11.5 25.6 10.1 15.4
*CPC refers to Cancer Predictor Calculator
This table shows the predicted risk for factors age and first birth age after
applying different age groups that related with different birth age groups on
Gail calculator and Cancer predictor calculator. The two algorithms have
different in predicted risk percentages (e.g., with an average differences of
2% for age groups 20 49).

Table 7.3: Age and Family History Degree
Family Age
history degree 20-29 30-39 40-49 50-59 60-69 70-79 80+
CPC Gail CPC Gail CPC Gail CPC Gail CPC Gail CPC Gail CPC
% % % % % % % % % % % % %
1st degree 2.8 5.3 7.8 9.2 18.2 22.2 19.6 16.3 25.3 19.5 26.3 18.1 15.8
2nd degree 2 N/A 7 N/A 17.4 N/A 18.8 N/A 24.5 N/A 25.5 N/A 15
3rd degree 16.5 N/A 21.5 N/A 31.9 N/A 33.3 N/A 39 N/A 40 N/A 29.5
*CPC refers to Cancer Predictor Calculator
The data shows the predicted risk for factors age and first degree of family
history after applying different age groups that related to first degree of family
history on Gail calculator and Cancer predictor calculator. The table shows
that the predicted risk percentages from both algorithms are approximately
the same.

Table 7.4: Age and Menstruation Age
Menstruation Age
Age 20-29 30-39 40-49 50-59 60-69 70-79 80+
CPC Gail CPC Gail CPC Gail CPC Gail CPC Gail CPC Gail CPC
% % % % % % % % % % % % %
< 12 20.5 6.4 25.2 9.3 35.9 17.7 37.3 12.2 43 13.7 44 12.3 33.5
* CPC refers to Cancer Predictor Calculator
The data shows the predicted risk for factors age and menstruation age after
applying different age groups that related to menstruation age before 12 on
Gail calculator and Cancer predictor calculator. The calculator predicts higher
risk percentages comparing with Gail calculator.

With the regard to the results of ovarian cancer, it has been mentioned in
previous chapter that the Risk of Ovarian Cancer Algorithm is still under
development and its software results will be published in 2015, and while it is
the only algorithm for predicting ovarian cancer risk ratio; there is no chance
to compare the Cancer Predictor Calculator Models results with ROCA.
7.2 Evaluation
In order to decide which calculator presents more correct result, Gail
calculator or Cancer predictor calculator; I applied 760 patients records from
WHI database on both calculators then, I compared their results with each
other and I compared each of whichs result with the real risk percentages in
patients records. From the comparison, I got the following results:
1. The average difference between my calculator and Gail
calculator is 8.7%
2. The difference between Gail calculator and WHI databases
results is 15%
3. The difference in the outcome between my calculator and WHI
database is approximately to 6.5%.
Consequently, my calculators predicted risk is closer to the real patients
records than Gails predicted risk, making it more accurate.
Based on this, I compared the predicted risk percentage for the sharing
disease risk factors between my calculator and WHI database, and my
calculator and SEER database in order to find the most indicative factors in
pretending cancer risk percentage correctly; the results were:
l. First age of birth factor gives the same risk percentage in WHI

2. Breastfeeding factor differs in my calculator from the WHI
database by 2 %
3. Weight at birth factor differs in my calculator from the WHI
database by 4.8 %
4. Age factor for age group (0-19) differs in my calculator from the
SEER database by 0.64%
7.3 Advantages vs. disadvantages
From the tables in that describe the differences between the cancer
predictor calculator model and Gail Model based on the sharing factors
between them, its noticeable that there are some strengths and pitfalls in
1- the cancer predictor calculator model integrates some factors that
are not exist in Gail Model and considered as advantages to my calculator,
they classified as:
A. Personal factors: age at first birth, last age of birth, age of
menopause, hormone replacement therapy, race (it includes all
ethnicities), and breastfeeding
B. Medical history factors: abortion, cancer disease, and
Diethylstilbestrol drug
c. Lifestyle factors: alcohol, smoking, exercise, wearing bra, birth
control pills, and night shift work
D. Genetic factors: BRCA genes mutations, obesity, and extended
family history
e. Body factors: weight, height, and head circumference at birth

F. Environmental factors: radiation, living environment, and
working environment
2- it is observable in my model that the risk ratio increases with the
individuals age based on the proven scientific fact and medical studies in this
regard when comparing it with the Gail Model
In contrast with the strengths in my model, it has a pitfall: it is
overestimating the race risks ratio comparing it with the Gail Model and other
public breast cancer calculators.
7.4 Future work
After analyzing my model by comparing it with well-known algorithms
regarding breast cancer and ovarian cancer predictions, and applying the
decision tree model; I propose to add some features to the cancer predictor
calculator in order to improve it:
1. Give it the ability to draw a patients path on the decision tree in
order to let the patient knows the factors that caused the disease
2. Give it the ability to draw the family history tree degrees with
selected answer
3. Integrate this model with other known algorithms in order to give
the user more than one risk ratios regarding her specific disease
4. Include additional question in ovarian cancer questioner related
to CA-125 test due to its important role in exposure ovarian cancer
5. Include more cancer tumors in calculator to predict their risk
ratios for each individual
6. Give the patient advices and guidance about what shall to do
after getting (i.e., predicting) the risk.

7.5 Conclusion
In conclusion, this thesis discussed some cancer prediction methods in
bioinformatics and medicine. We had focused particularly on two well-know
algorithms Gail and ROCA, which do breast cancer prediction and ovarian
cancer prediction, respectively. This thesis includes an implemented expert
system that predicts cancer risk percentage for a woman during her lifetime
for the two types of cancer tumors: breast and ovarian. This expert system
makes several technological contributions with enhancements of
bioinformatics, identification of cancer prediction factors, and expert system
design and performs favorably when compared with state of the art systems
in the area.

The cancer predictor calculator is a graphical user interface expert
system, which aims to predict the risk percentage for a woman in her lifetime
for two types of cancer tumors breast and ovarian. This calculator published
online for public use.
Cancer Predictor Calculator is available for download from the
mediafire site for use. This version is working with Windows OS only, to
download it:
1. Go to mediafire site at:
2. Enter the password to unlock the protected file (FinallydonE14)
then, press unlock button
3. Select Click here to start download from MediaFire link at the
top of web page
4. Double click the setup.exe to start installation
5. The Cancer Predictor Calculator Wizard will open. Follow the
instruction in this wizard to complete the installation process.

l. Select one tumor from the choices to predict the risk percentage in it
2. Answer all questions in each form then press next button. In case the
user forgot to answer a question, a popup window will appear to remind
about the forgettable question. In the last form of questionnaire press Risk
Probability button to get the predicted risk percentage

The predicted risk will be displayed. Finally the user has the choice to
go back to the home form or quit.

1. Auesukaree, Choowong. cDNA Microarray Technology For The
Analysis Of Gene Expression. King Mongkuts Institute Of Technology
Ladkrabang Vol. 6 No.1 (June 2006): 29-34. Web. 13 July 2011.
2. Bandyopadhyay, Sanghamitra, Ujjwal Maulik, and Debadyuti Roy.
Gene Identification: Classical and Computational Intelligence Approaches.
IEEE. Vol.38, no.1, 2008.
3. Chuang, Li-Yeh, Kuo-Chuan Wu, Hsueh-Wei Chang, and Cheng-Hong
Yang. Support Vector Machine-based Prediction for Oral Cancer Using Four
SNPs in DNA Repair Genes. libSearch, n.d. Web. 5
September 2011.
4. Claverie, Jean-Michel, and Cedric Notredame. Bioinformatics For
Dummies. Indianapolis: Wiley Publishing, Inc. 2007.
5. Cruz, Joseph A., and David S. Wishart. Applications of Machine
Learning in Cancer Prediction and Prognosis. Libertas Academica Vol. 2
(2006): 59-78. Web. 4 October 2011.
6. Fredholm, Lotta. "The Discovery of the Molecular Structure of DNA -
The Double Helix". 30 September 2003. 14 March
2011 more.html>.
7. Gevaert, Olivier, and Bart De Moor. Prediction of cancer outcome
using DNA microarray technology: past, present and future. Informa
healthcare Vol. 3(2) (2009): 157-165.Web. 29 September 2011.

8. Gusfield, Dan. Algorithms on strings, trees, and sequences: computer
science and computational biology. New York: Cambridge University Press,
2008. Web.
9. Hankinson, Susan E., Graham A Colditz, and Walter C Willett.
Towards an integrated model for breast cancer etiology the lifelong interplay
of genes, lifestyle, and hormones. PubMed Central Vol.6 (August 2004):
213-218. Web. 16 July 2011.
10. Haussler, Daived. Computational Genefinding. University of California,
n.d. Web. 24 September 2011.
li. Hodges, Andrew P., Peter Woolf, and Yongqun He. Prediction of
Novel Pathway Elements and Interactions Using Bayesian Networks. IntechOpen, n.d. Web. 29 September 2011.
12. Hunter, Lawrence E. The Process of Life An Introduction to Molecular
Biology. Cambridge: The MIT Press, 2009.
13. Intergenic region. Wikipedia Foundation Inc. 7 September 2011.
14. Kahl, Gunter, and Khalid Meksem. The handbook of planet Functional
Genomics Concepts and protocols. Germany: Willey Blackwell, 2008.
15. Karchin, Rachel. Hidden Markove Models and Protein Sequence
Analysis. University of California Santa Cruz, n.d.
. 30
September 2011.
16. Marsh, Beezy. Breast-feeding reduces cancer risk.
Mail Online, n.d. <
feeding-reduces-cancer-risk.html>. 12 February 2011.

17. Manual, Aguilar R., Hector Fraire H., Laura Cruze R., Juan J.
Gonzalez B., Guadalupe Castilla V., and Claudia G. Gomez S. Classic
Cryptanalysis Applied to Exons and Introns Prediction. Springer-Verlag
Berlin Heidelberg (2007) No.575-584. Web. 12 February 2011.
18. Mathe, Catherine, Marie-France Sagot, Thomas Schiex, and Pierre
Rouze. Current methods of gene prediction, their strengths and weakness.
Oxford University Press Vol.30 (2002) No. 19 4103-4117. Web. 15 August
19. Morton, Carol Cruzan. Ovarian cancer research takes center stage. Dana-Farber/Harvard Cancer Center, n.d. Web. 22 October
20. Newman, Lisa A., Breast Cancer in African-American Women. The
Oncologist Vol. 10 (2005): 1-14. Web. 13 October 2011.
2i. Okun, Oleg, and Helen Priisalu. Ensembles of Nearest Neighbors for
Cancer Classification Using Gene Expression Data. CiteSeer, n.d.
Web. 4 October 2011.
22. Ooi, Chia Huey, Tatiana Ivanova, Jeanie Wu, Minghui Lee, lain
Beehuat Tan, Jiong Tao, Lindsay Ward, Jun Hao Koo, Veena
Gopalakrishnan, Yansong Zhu, Lai Ling Cheng, Julian Lee, Sun Young Rha,
Hyun Cheol Chung, Kumaresan Ganesan, Jimmy So, Khee Chee Soo,
Dennis Lim, Weng Hoong Chan, Wai Keong Wong, David Bowtell, Khay
Guan Yeoh, Heike Grabsch, AleBoussioutas, and Patrick Tan. Oncogenic
Pathway Combinations Predict Clinical Prognosis in Gastric Cancer. Plos
Genetics Vol. 5 (October 2009): 1-13. Web. 2 September 2011.
23. Redway, Keith. Complementary DNA (cDNA). University of
Westminster, n.d. . 3
February 2011.

24. Setubal, Joao Carlos, and Sergio Verjovski-Almeida, eds. Advances in
Bioinformatics and Computational Biology. Sao Leopoldo, Brazil: Brazilian
Symposium on Bioinformatics, BSB 2005. Print.
25. Spirtes, Peter, Clark Glymour, Richard Scheines, Stuart Kauffman,
Valerio Aimale, and Frank Wimberly. Constructing Bayesian Network Models
of Gene Expression Networks from Microarray Data. Department of
Philosphy. Carnegie Mellon University, n.d. Web. 30 September 2011.
26. Steyerberg, Ewout W., Andrew J. Vickers, Nancy R Cook, Thomas
Gerds, Mithat Gonen, Nancy Obuchowski, Michael J. Pencina, and Michael
W. Kattan. Assessing the Performance of Prediction Models A Framework
for Traditional and Novel Measures. Epidemiology Vol.21 (1 January 2010):
128-138. Web. 2 September 2011.
27. Stormo, Gary D. Gene-Finding Approaches for Eukaryotes. Genome
Res Vol.10 (2000): 394-397. Web. 17 July 2011.
28. Wu, Xuebing, and Shao Li. Cancer Gene Prediction Using a Network
Approach. Massachusetts Institute of Technology, n.d. Web. 1
October 2011.
29. Ab Initio Gene Prediction. Molecularscinces,
28 October 2006. . 11 May
30. BRCA gene test for breast cancer. Mayoclinic, 28
December 2010. Web. 15 July 2011.
3i. Breast Cancer. American Cancer Socitey, n.d. Web. 20
February 2011.
32. Breast Cancer. emedicinehealth, n.d. Web.
20 February 2011.

33. Consensus Sequence. NCBI, n.d.
. 2 September 2011.
34. DNA, RNA & Proteins, RNA information. Genetic
Website, n.d.
35. DNA Technology. Occidental College, 29 November
2000.Web. 15 March 2011.
36. ESTs: Gene Discovery Made Easier. NCBI, 29
March 2004. . 7
September 2011.
37. First German genome comprehensively resolved at its molecular. 16 September 2011.< http://www.domain->.
38. Gail model. Microcalcification Resource
Site, n.d. Web. 20 Oct. 2011.
39. Gene Models &bed format: what they are represent.
Loraine Lab, n.d. <>. 17 March
40. Heredity (higher level). Leaving Certificate Biology,
n.d. . 8 August
4i. Intergenic region. Mimi, n.d.
. 7 September 2011.
42. Microarray Analysis and Gene Expression Profiling. Microarrayworld, n.d. Web. 1 October 2011.

43. Protein Synthesis: Transcription and Translation. The
State University of New York, n.d.
%20101%20lectures/protein%20synthesis/protein.htm >. 3 February 2011.
44. Structure of DNA. Lion Den, 12 October 2011.
45. Viterbi Algorithm. University of Southern California, n.d. <>. 1 July 2011.
46. What is a sequence alignment?. Arne Elofsson, 7 March
2011. <>. 25 March 2011.

43. Protein Synthesis: Transcription and Translation. The
State University of New York, n.d.
%20101%20lectures/protein%20synthesis/protein.htm >. 3 February 2011.
44. Structure of DNA. Lion Den, 12 October 2011.
45. Viterbi Algorithm. University of Southern California, n.d. <>. 1 July 2011.
46. What is a sequence alignment?. Arne Elofsson, 7 March
2011. <>. 25 March 2011.