Citation
Bioinformatic approaches to study recombination in West Nile virus

Material Information

Title:
Bioinformatic approaches to study recombination in West Nile virus
Creator:
Housley, Roberta M
Publication Date:
Language:
English
Physical Description:
xiv, 184 leaves : ; 28 cm

Subjects

Subjects / Keywords:
West Nile virus ( lcsh )
Recombinant viruses ( lcsh )
Bioinformatics ( lcsh )
Bioinformatics ( fast )
Recombinant viruses ( fast )
West Nile virus ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 131-184).
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
by Roberta M. Housley.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
263685040 ( OCLC )
ocn263685040
Classification:
LD1193.E52 2008d H68 ( lcc )

Full Text
(

BIOINFORMATIC APPROACHES TO STUDY RECOMBINATION IN
WEST NILE VIRUS
by
Roberta M. Housley
MS, University of Denver. 1999
A thesis submitted to the
University of Colorado Denver
in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Computer Science and Information Systems
2008
1
jx
. i


This thesis for the Doctor of Philosophy
degree by
Roberta M. Houslev
has been approved
by
2-tH -oloo Date


Housley, Roberta M. (Ph.D., Computer Science and Information Systems)
Bioinformatic Approaches to Study Recombination in West Nile Virus
Thesis directed bv Professor Harvey J. Greenberg, Advisor
ABSTRACT
Viral recombination contributes to the genetic diversity found in viruses.
Phylogenetic analysis has been used to determine evolutionary relationships of
viruses, often assuming the absence of recombination. However, viral recombi-
nation results in mosaic genomes which cause conflicting phylogenetic signals
in the data. Therefore, a unique, bifurcating tree topology cannot accurately
describe the evolutionary relationships within a set of sequences. Quantifying
viral recombination is an essential element in the study of evolutionary processes
as well as any subsequent vaccine development.
West Nile virus (WNV) is a mosquito-borne, RNA virus and a member of
the family Flaviviridae. Homologous recombination has been observed within
this family in natural populations and under laboratory conditions. In this
study, a dataset of WNV sequences was analyzed for intraspecies, homologous
recombination. A NeighborNet analysis of the sequences revealed extensive net-
worked evolution, indicative of recombination with additional evidence provided
by nonparametric methods within the RDP2 software package. However, results
from parametric methods within the RDP2 software package did not indicate


recombination. A new parametric method based on the Jensen-Shannon diver-
gence was developed to reconcile these different results as well as analyze all
sequences in the dataset for recombination. The results of this analysis indicate
WNV is not recombinant.
This abstract accurately represents the content of the candidates thesis. I
recommend its publication.


DEDICATION
I dedicate this thesis to my grandmother, Lillian Cohen. She gave me the
encouragement I needed at just the right times. I also dedicate this to my
husband, Dan for his unwavering support and understanding while I completed
this degree and thesis. Finally, I would not have started this degree but for Lady
Sharaya, my beautiful Arabian mare who did not win the fight with West Nile
virus. Watching her struggle with its devastating effects motivated me to learn
about this virus.


ACKNOWLEDGMENT
This thesis would not have been possible without the generous support of my
advisor, Harvey J. Greenberg. I also wish to thank all the members of my
committee for their valuable participation and insights. My deepest thanks to
you.


CONTENTS
Figures ......................................................... xii
Tables............................................................. xiv
Chapter
1. Introduction...................................................... 1
1.1 Recombination in Hepacivirus.................................... 2
1.2 Recombination in Flavivirus..................................... 2
1.2.1 Recombination in West Nile virus............................... 3
1.3 Detecting Recombination......................................... 4
1.4 Dissertation Chapters .......................................... 7
2. Software.......................................................... 8
2.1 SplitTrees4 Software Package .................................. 8
2.2 RDP2 Software Package........................................... 9
2.2.1 The RDP Method ............................................... 9
2.2.2 The Maximum x1 2 Method...................................... 12
2.2.3 The Chimaera Method.......................................... 14
2.2.4 The Sister Scanning Method................................... 15
2.2.5 The Gene Conversion Method................................... 19
2.2.6 The Bootscan Method.......................................... 21
2.2.7 The LARD Method ............................................. 22
2.2.8 The LDHat Method............................................. 24
vii


2.3 Data............................................................ 26
2.3.1 Results using Nonparametric Methods............................ 27
2.4 Results using Parametric Methods ............................... 38
3. Sequence Segmentation Method...................................... 40
3.1 The Issues..................................................... 40
3.2 The Conversion Method........................................... 41
3.3 The Jensen-Shannon Divergence Measure........................... 42
3.3.1 Halting Criteria............................................... 43
3.3.2 Segment Simplification......................................... 45
3.4 Validation Experiment........................................... 45
3.4.1 Recombl Results................................................ 46
3.4.1.1 Analysis.................................................... 46
3.4.2 Recomb2 Results................................................ 48
3.4.2.1 Analysis.................................................... 49
3.4.3 Recomb3 Results................................................ 49
3.4.3.1 Analysis.................................................... 50
3.5 Second Experiment............................................... 52
3.6 Third Experiment................................................ 52
3.7 Fourth Experiment............................................... 54
3.8 Conclusion...................................................... 56
4. Future Work....................................................... 60
Appendix
A. Molecular Biology of West Nile Virus.............................. 61
A.l Taxonomy of Viruses............................................. 62
viii


A. 1.1 The Central Dogma of Molecular Biology ........................ 62
A.1.2 The Baltimore System of Classification.......................... 63
A. 1.3 The ICTV System of Taxonomy.................................... 65
A. 1.3.1 The Taxonomic and Serological Classification of West Nile Virus 66
A.2 The Genomic Structure of West Nile Virus and Kunjin Virus .... 67
A.2.1 The Primary Structure of West Nile and Kunjin Virus Genomes . 67
A.2.1.1 The Nonstructural Proteins of West Nile and Kunjin Viruses . 69
A.2.2 The Secondary and Tertiary Structures of West Nile and Kunjin
Viruses ......................................................... 70
A.3 The Virion Morphology of West Nile Virus.......................... 72
A.4 Viral Life Cycle.................................................... 73
A.4.1 Viral Attachment and Host Cell Penetration...................... 74
A.4.1.1 The Envelope Glycoprotein.................................. 74
A.4.1.2 Domain III of the Envelope Glycoprotein ....................... 74
A.4.2 Viral Uncoating and Release....................................... 75
A.4.3 Expression and Replication........................................ 76
A.4.3.1 Defective Interfering Viruses ................................. 76
A.4.4 Assembly, Maturation and Exiting................................ 77
A.4.4.1 The trans Mode of Maturation............................... 77
A.4.4.2 The cis Mode of Maturation .................................... 78
A.4.4.3 Cell Lysis................................................. 78
A.5 Pathogenesis ....................................................... 81
A.5.1 Viral Molecular Genetics of Virulence .......................... 81
A.5.2 Host Genetic Factors.............................................. 84
IX


A.5.3 Clinical Features of Disease in Humans
85
A.5.3.1 West Nile Fever............................................. 85
A.5.3.2 West Nile Virus Neuroinvasive Disease....................... 85
A.6 Potential Treatments............................................... 86
A.6.1 Long-Term Prognosis............................................. 87
A.7 Vaccines........................................................... 88
A.7.1 Jennerian Vaccines of other Flaviviruses........................ 88
A.7.2 Chimeric Vaccines............................................... 89
A.7.3 DNA Vaccines ................................................... 90
A.8 Transmission Cycle................................................. 90
A.9 The Host Range..................................................... 91
A.9.1 The Mosquito Vector............................................. 91
A.9.1.1 The Bridge Vectors............................................. 94
A.9.2 The Reservoir Host.............................................. 95
A.9.3 Secondary Hosts................................................. 96
A.10 Direct Transmission .............................................. 97
A.10.1 Vertical Transmission .......................................... 97
A.10.2 Transmission via Breast Milk, Blood Transfusions and Organ
Transfers....................................................... 99
A.10.3 Transmission via Oral Route ................................... 100
A. 10.4 Nonviremic Transmission ...................................... 101
A.ll Global Spread of West Nile Virus............................... 102
A. 12 Early History and Endemic Regions............................. 102
A. 13 Early Epidemics ................................................ 103
x


A.14 Recent Epidemics in Europe, Israel and North Africa........... 105
A. 14.1 West Nile virus in the United States and Canada............ 107
A.15 The Further Spread of the Virus .............................. 109
A.15.1 West Nile virus in Mexico, Central America and South America Ill
A. 16 Factors Contributing to the Spread of West Nile Virus ....... 113
A.16.1 Human Demographics and Agriculture.......................... 113
A.16.2 Natural Environment Changes................................. 115
A.16.3 Migrations of Birds............................................. 117
A. 16.4 Reproductive Cycle of Mosquitoes ......................... 117
B. Phylogeny and Evolution of West Nile Virus..................... 119
B.l Serology and Phylogeny of Flaviviruses............................. 119
B.2 Sequence Analysis and Phylogeny of West Nile virus ............ 121
B.2.1 Additional Lineages of West Nile Virus.......................... 124
B.2.2 Phylogeny of West Nile Virus in the United States and Mexico . 124
B.3 Evolution of West Nile Virus in the United States.............. 127
B.3.1 West Nile Virus Genotype WN-NY99................................ 128
B.3.2 West Nile Virus Genotype WN02................................... 128
B.3.3 Causes of West Nile Virus Evolution............................. 129
References............................................................. 131
xi


LIST OF FIGURES
Figure
2.1 Example of Recombination Events Found by LARD Method .... 23
2.2 NeighborNet Analysis of Lineage 1 West Nile Virus Strains......... 29
2.3 NeighborNet Analysis of Lineage 2, 3, 4, and 5 West Nile Virus Strains 30
2.4 Phylogenetic Analysis of West Nile Virus Strains............... 31
2.5 Phylogenetic Analysis of West Nile Virus Strain from Madagascar . 35
2.6 Phylogenetic Analysis of West Nile Virus Strain from Rabensburg . 36
2.7 Phylogenetic Analysis of West Nile Virus Strain from India..... 37
3.1 A Comparison of the Known and Predicted Breakpoints in RECOMB1 47
3.2 Analysis of Breakpoints in RECOMB1................................ 47
3.3 Alternative Analysis of Breakpoints in RECOMB1 ................... 48
3.4 A Comparison of the Known and Predicted Breakpoints in RECOMB2 49
3.5 Analysis of Breakpoints in RECOMB2................................ 50
3.6 A Comparison of the Known and Predicted Breakpoints in RECOMB3 50
3.7 Analysis of Breakpoints in RECOMB3................................ 51
A.l Central Dogma of Molecular Biology................................. 63
A.2 RNA schemata....................................................... 63
A.3 Viral Classification............................................... 66
A.4 West Nile Virus Genome Based on the Reference Sequence NC-00153 68
A.5 Replication Cycle of West Nile Virus............................... 80
xii


A.6 Conventional Transmission Cycle of West Nile Virus ................ 92
A.7 Expanded Transmission Cycle of West Nile Virus..................... 97
xiii


LIST OF TABLES
Table
2.1 Example of Recombination Events Found by RDP Method.......... 11
2.2 Example of Recombination Events Found by Max y2 Method ... 13
2.3 SiScan Classifications Used to Categorize Sequences............ 16
2.4 SiScan Sum Patterns............................................ 17
2.5 Example of Recombination Events Found by SiScan Method .... 18
2.6 Example of Recombination Events Found by GENECONV Method 20
2.7 Example of Recombination Events Found by Bootscan Method . 22
2.8 Sequences Used................................................. 26
2.9 Recombination Events in West Nile Virus Sequences ............. 33
2.10 Recombination Rates in West Nile Virus Sequences............... 39
3.1 Composite Sequences Generated from Results Obtained from RDP2
Software....................................................... 53
3.2 Example of Permuted Sequences Generated from Combination and
Permutation Function........................................... 54
xiv


1. Introduction
The extensive genetic and antigenic diversity observed in RNA viruses is
generally attributed to the error-prone nature of their replication machinery.
RNA viruses are subject to rapid evolution because transcription is done us-
ing RNA polymerases, which are highly error-prone due to their lack of proof
reading abilities. Evolution can also occur when RNA viruses undergo recom-
bination. Recombination occurs during viral replication when RNA-dependent
RNA polymerase switches from one RNA molecule to another through a copy-
choice mechanism. This template switching creates variation within the viral
genome in which the RNA can be from other viral strains, the same viral strain
or from the host. If recombination occurs between viral strains, it is classified in
three ways: homologous recombination occurs when a portion of the recipient
viral genome is replaced in the same region by the donor genome. The struc-
ture of the resulting genome is unchanged. Aberrant homologous recombination
occurs when the strict alignment between recipient and donor genome regions
is not maintained; heterogeneous or nonhomologous recombination occurs when
unrelated viruses recombine.
Several conditions must be present for viral recombination to occur. First,
the host must be infected by different virus strains. Second, a single cell must
be infected by multiple divergent viruses e.g., the cell must be susceptible to
superinfection. Both of these conditions are constrained by host immune
responses and superinfection exclusion properties of the virus. If a cell does be-
1


come superinfected, then one virus must replicate in the presence of the RNA of
the other and the template switching mechanism must occur. Finally, selection
plays a role in whether these recombinant viruses are viable. [491] Even with
all of these requirements and constraints, homologous recombination has been
observed within the family Flaviviridae in natural populations and under labo-
ratory conditions. This research examined whether West Nile virus (WNV), an
RNA virus that is a member of the family Flaviviridae, genus Flavivirus, has
undergone intraspecies recombination.
1.1 Recombination in Hepacivirus
Hepatitis C virus (HCV) belongs to the the family Flaviviridae. genus Hep-
acivirus. Like WNV, HCV is an enveloped virus with a single-stranded, positive
sense RNA genome. The genome has a single open reading frame which encodes
a single polyprotein. The genotypes of HCV are based on single region within
the genome and are used to classify HCV. Researchers have found evidence of
inter-genotypic recombination in strains of HCV found in St. Petersburg and
intra-genotypic recombination in strains in Peru, based on phylogenetic analy-
ses. Another member of the genus Hepacivirus is the GB virus C or hepatitis G
virus (GBV-C/HGV), which is closely related to HCV. Researchers have found
evidence of recombination in this virus based on the LARD method, which is
discussed below. [227, 226, 103, 104, 492]
1.2 Recombination in Flavivirus
Researchers have found evidence of homologous recombination in members
of the family Flaviviridae, genus Flavivirus such as the dengue, Japanese en-
cephalitis and St. Louis encephalitis viruses. Dengue virus is classified into
2


four serotypes based on its antigenic characteristics. Researchers initially found
evidence of intra-serotypic recombination in strains of type 1 dengue virus from
South America. In [493], Worobey et al. found evidence of intra-serotypic re-
combination in all dengue virus serotypes. Both of these studies obtained results
based on the LARD method. In [461], Tolou et al. also found evidence of intra-
serotypic recombination in strains of type 1 dengue virus strains from Asia using
the RDP method, which is discussed below. [200]
Researchers have also found evidence of recombination in Japanese en-
cephalitis (JE) virus and St. Louis encephalitis (SLE) virus, which belong to the
same serogroup as WNV and share a similar transmission cycle involving Culex
mosquitoes and birds. Phylogenetic analysis of each virus shows co-circulating
strains. Also, mosquitoes of the genus Culex have been shown to feed on multi-
ple hosts. Both of these traits indicate the potential of coinfection of hosts with
more than one strain as required for recombination to occur. [469, 19, 316]
1.2.1 Recombination in West Nile virus
The vast majority of work on WNV, including vaccine studies and phylo-
genetic analysis, assumes that evolution and diversity are derived through the
accumulation of mutational changes. Given the evolutionary advantages of viral
recombination such as the removal of deleterious genes or repair of the genome
through genetic exchange, and how this might affect the development of vac-
cines, it is important to determine the extent to which recombination plays a
role in WNV evolution. [491]
In [469], Twiddy et al. conducted recombination analysis on WNV. Their
results indicate recombination does not occur in WNV. However, their sample
3


was heavily biased with strains from Israel and New York from 1998 to 2000
and these strains show little diversity. They also examined only the envelope
(E) gene. These issues may have influenced the results. [469]
While evidence of recombination has not been found in natural strains of
WNV, researchers conducting vaccine research have created functional hetero-
geneous, recombinant genomes by substituting the prM and E protein genes
from WNV into the Yellow Fever vaccine strain. Heterogeneous recombinant
genomes have also been created with dengue type 4 and WNV. In [58], Bori-
sevich et al. created functional, homogeneous recombinant genomes by substi-
tuting the structural genes between Lineage 1 and Lineage 2 strains of WNV.
All of these recombinant vaccine strains do replicate and create a host immune
response against WNV infection. [369, 368, 18, 504]
West Nile virus may utilize the template switching mechanism during repli-
cation to produce defective interfering viruses. These viruses have genomes with
large deletions, which may be the result of the polymerase detaching from the
template strand and reattaching at a different position or attaching to a new
incomplete strand. [63, 114]
1.3 Detecting Recombination
The task of detecting recombination events is a difficult one if the sequences
involved in the recombination events are too similar or have undergone evolu-
tionary events after recombination which alter previous events and undermine
the accuracy of analysis. To detect viral recombination in aligned nucleotide
sequences, parametric and nonparametric methods have been developed. The
nonparametric methods can be classified as distance methods, which use a slid-
4


ing windows to calculate statistics based upon the genetic distance between
sequences; phylogenetic methods, which compare tree topologies at adjacent se-
quences; compatibility methods, which look for compatibility between sites; and
substitution distribution methods, which measure clustering of substitutions.
Parametric methods have been developed to determine the underlying evolu-
tionary history of the sequences using population genetics. For example, one
such method uses the coalescent theory framework to determine the population
rate of recombination within a set of sequences. [372]
Given the growing evidence of recombination in the family Flaviviridae, es-
pecially in closely related Flaviviruses such as JE and SLE, and that functional,
recombinant genomes using WNV can be created under laboratory conditions,
this study investigated homogeneous recombination in naturally occurring WNV
strains. This study used nonparametric and parametric recombination detection
methods to identify and characterize statistically probable recombination events
using the entire protein coding sequences of WNV to answer the following ques-
tions:
1. Is there evidence of homogeneous recombination in WNV?
2. If so, which strains are involved?
3. Where are the recombinant regions?
4. If recombination has occurred, what is the rate of its occurrence in WNV
lineages?
Preliminary analysis of the WNV genomes generated conflicting results: the
nonparametric methods indicated WNV is recombinant while the parametric
5


methods indicated it is not recombinant. Several issues make analyzing and
comparing these results difficult. Current methods, as implemented in the RDP2
software package require many user-defined settings such as the sliding window
size, p-value settings, correction methods, and whether phylogenetic evidence is
used in the analysis. Many of these settings are set without a priori knowledge
and can affect the results. For example, the optimal window size will depend
upon the length and relatedness of the sequences being analyzed as well as the
method being used. Another factor in selecting the appropriate window size is
the size of the potential recombination regions. Selecting the proper window
size is problematic since too large of a window size may lessen the sensitivity of
the analysis and increase signal: noise ratios but will detect large recombination
regions while too small of a window size may increase the sensitivity but will also
increase the possibility of false positive results. Another issue involves comparing
results obtained from parametric methods with results obtained from nonpara-
metric methods. Nonparametric methods indicate which genomes are potential
parental genomes and daughter genome as well as where the breakpoints are in
the genome i.e., where the recombination events occurred. Parametric methods
do not indicate breakpoints.
To resolve this conflict and evaluate the results given by different meth-
ods, a validation method was developed. This method indicates the breakpoint
positions within a parametric method thereby eliminating the arbitrary user-
defined settings. This method was extended to examine all combinations and
permutations of members within a dataset of WNV sequences.
6


1.4 Dissertation Chapters
This dissertation has four chapters, which address the issues of recombina-
tion methods used, the validation method used and my ideas on future work to
extend this current research as well as two appendices, which cover the biological
and evolutionary aspects of WNV.
Chapter 2 describes the parametric and nonparametric methods used to
determine if recombination occurs in WNV as well as the results obtained.
Chapter 3 describes the sequence segmentation method developed to rec-
oncile the conflicting results obtained from the parametric and nonparametric
methods, the data used to validate the correctness of the method and the results
obtained.
Chapter 4 describes future work.
Appendix A provides an overview of the molecular biology of WNV, in-
cluding its taxonomic classification, genomic structure, life cycle, pathogenesis,
transmission cycle and global spread, and epidemiology.
Appendix B provides an overview of the evolution and phylogenetic rela-
tionships between the strains of WNV.
7


2. Software
2.1 SplitTrees4 Software Package
Because recombination creates mosaic genomes in which segments have
different evolutionary history, the data generates conflicting phylogenetic signals
and a unique tree topology cannot describe the evolutionary history of a set of
sequences. Instead, a network of sequences can provide a graphical represen-
tation of the data. NeighborNet, a method which creates a network between
sequences, was used in the initial stage of the search for recombination in WNV.
The method combines the neighbor-joining and SplitsTree methods. This anal-
ysis was conducted using the SplitsTree4 program (version 4.6) with the Kimura
3-ST substitution model.
The NeighborNet method builds a circular split system. This system is
represented as a planar split graph created by iteratively selecting pairs of pairs
of sequences which share one node in common to cluster together. Splits occur
in a multiple sequence alignment when a column consists of non-constant data.
The taxa are then partitioned into multiple sets. A set of splits, denoted Si =
A,B and S2 = C,D, is compatible if one, and only one, of the four intersections
AflC, AnD, BnC and BnD is empty; otherwise SI and S2 are incompatible.
A graph is generated which represents the collection of splits in the taxa. In
these graphs, compatible splits are represented by parallel lines, and conflicting
signals or incompatibilities appear as boxes. [210, 66]
8


2.2 RDP2 Software Package
To detect evidence of recombination events and to pinpoint breakpoints
within specific sequences in the alignment as well as the identification of parental
sequences, analysis with RDP method (a phylogenetic method), Maximum y2
(a substitution method), Sister-scanning (a phylogenetic method), Geneconv (a
substitution method), LARD (a phylogenetic method), and Bootscan (a phylo-
genetic method) algorithms as implemented in the software package, RDP2 for
Windows XP was done. [372, 303]
2.2.1 The RDP Method
The RDP method finds recombination using distances between each pair
of sequences as well as phylogenetic techniques to detect different topologies or
branching patterns. The distances are calculated as the percent identity between
the sequences along a sliding window The method utilizes a three-step process
for every combination of three sequences in the dataset using a user-defined
sliding window size. The default window size is 10.
In step one, all non-informative sites are discarded. These include sites that
are identical in all three sequences, different in all three sequences, or unique
in two sequences but not present in the reference sequence. The reference se-
quence is selected based upon its position relative to the selected triples within a
tree constructed using the unweighted pair group method with arithmetic mean
(UPGMA) algorithm. In step two, the window is moved along the manipulated
aligned sequences and the percent identity i.e., the percentage of columns in
9


the alignment where the elements are identical, is calculated. In this analysis,
the recombination detection was set for sequences which share between 0% and
100% identity. Recombinant regions are identified as those regions where the
percentage identity is higher for sequences A and C or sequences B and C than
tities in the possible recombination regions by chance is calculated in step three
using the binomial distribution. A p-value is calculated from this probability by
number of triplets examined. In other words, the probability of sequence A or
B appearing more closely related to C occurring by chance is calculated as
where G is the number of possible combinations of three sequences, L is the
length of the subsequences, N is the length of the suspected recombinant re-
gion, M is the number of nucleotides in common between sequences A and B,
sequences A and C or sequences B and C in the recombinant region, and p is the
proportion of nucleotides in common between sequences A and B, sequences A
and C or sequences B and C in the entire sequence. If this p-value is less than
the default highest P-value setting, the region is deemed a recombination area.
After a potential recombinant region is detected, the parental sequences are
determined by constructing a neighbor-joining tree from the full alignment and
calculating the greatest number of genome changes between internal nodes of
the tree (parsimony). The branch on the tree with the greatest proportion of
changes within the detected recombination area is assumed to be where the re-
combination took place. The RDP method uses the UPGMA clustering method,
for sequences A and B. The probability that the three sequences share high iden-
multiplying it with the number of unique windows examined and then by the
10


which clusters data based on the minimal number of mutations in the dataset
assuming a molecular clock in which all taxa evolve at a constant rate over time.
Table 2.1: Example of Recombination Events Found by RDP Method
Position in Alignment
3029 3032 3038 3053 3056 3062 3065 3071 3074
SeqA G C T C C T A C T
SeqB G C A C C A A C T
SeqC A T T T G T G T C
As illustrated in Table 2.1, sequence C, which RDP indicates is recombinant
has two positions matching sequence A at position 3038 and 3062. Sequence A
and sequence B have more identical bases than other potential parents in the
dataset. The RDP method suggests these are the parental sequences for this
particular sequence C.
While RDP uses only informative sites and therefore, should be insensitive
to the different evolutionary rates of different regions of a sequence alignment,
the UPGMA is very sensitive to unequal evolutionary rates and may not produce
the correct tree for sequences that are evolving at vastly different rates or have
unusual nucleotide compositions. The clustering method works only if the data
generates ultrametric distances e.g., distAC < max(distBC,distAB).
The RDP method may also have difficulty detecting parental and daughter
sequences if they are all nearest neighbors in the tree, if only one parental
sequence is in the alignment or if the position of daughter sequences in the tree
11


cannot be phylogenetically determined because of too few informative sites. In
other words, the RDP method may give faulty results if the sequences are too
similar. [127, 302, 303]
2.2.2 The Maximum y2 Method
The Maximum x2 (Max x2) method finds recombination using statistical
analysis of triplet sequences within an alignment to test for a relationship be-
tween two halves of a sliding window. The y2 statistic describes the difference in
proportion between the sites that have the same nucleotides (non-variable) and
different nucleotides (variable) on each side of the partition. The method utilizes
a three-step process for every combination of three sequences in the dataset. In
the first step, three sequences are selected. All sites that are exactly the same in
all three sequences (monomorphic) are discarded and any gaps are stripped from
the alignment leaving only polymorphic sites. In step two, two sequences are
selected from the altered triplet alignment for analysis. The two sequences are
compared along a sliding window containing 30 variable sites at a time, which
is the default setting. The window is split into two with each half compared to
the other. In step three, a 2 x 2 y2 value is calculated as
2 (observed expected)2
^ expected
where observed is the frequency of the variable and non-variable sites on the
left side and the right side of the partition. The expected value is calculated for
each cell of the table as
row total x column total
total for table
12


The window is then moved to the next nucleotide and the y2 value is recalculated
and plotted along the length of the sequences.
To test the significance of the resulting statistics, a test of 1000 permutations
is done and the p-value is calculated to equal the number of times the original
statistic is smaller than the statistic from permuted alignments divided by 1000
(the number of permutations). This value is compared to the default p-value.
[306, 373]
This method is extended within the RDP software package to determine
the extent of recombinant regions. Placing a window around the calculated
maximum y2 values along the length of the sequences, RDP incrementally in-
creases the window size by one nucleotide on either side until the y2 values drop.
The window should then contain the entire recombinant region. This process
is repeated for each maximum y2 value. To find the parental and daughter
sequences, RDP uses the same method described in the RDP method above.
Table 2.2: Example of Recombination Events Found by Max y2 Method
Position in Alignment
839 998 1127 1145 1178 1247 1283 1586 1817 2093
SeqA T C T A G T T C C T
SeqB T T T A G C T C c T
SeqC C C C G A C T c T C
As illustrated in Table 2.2, sequence C, which the Max y2 method indicates
is recombinant has one base at position 998 matching sequence A and one base at
13


position 1247 matching sequence B. Between positions 839 and 2093, sequences
A and B have the most consecutive, nonvariable bases. The Max y2 method
suggests these are the parental sequences.
Like the RDP method, Max x2 analyzes only polymorphic sites. However,
unlike the RDP method, Max y2 does not analyze the polymorphic sites further
to discard sites that are different in all three sequences. Therefore, these sites
are included in the analysis and may produce false positive results. [303]
2.2.3 The Chimaera Method
The Chimaera or maximum mismatch y2 method is a modification of the
maximum y2 method described above. The Chimaera method finds recombi-
nation by analyzing all triplet sequences within an alignment in which each se-
quence is treated as a potential recombinant sequence of the other two sequences
in the triplet.
Like the Max y2 method, the Chimaera method utilizes a three-step process.
In the first step, three sequences are selected. The monomorphic sites and the
sites where the two potential parental sequences do not match the daughter
sequence are discarded. In the second step, the three sequences are compressed
into a string of ls and 0s in which 1 represents a match between the daughter
and one parental sequence and 0 represents a match between the daughter
and the other parental. In step three, the three sequences are compared along
a sliding window containing 30 variable sites at a time, which is the default
setting. The window is split into two with each half compared to the other. A
2 x 2 y2 value is calculated to determine the difference in the proportion of ls
and 0s. The y2 value is plotted along the length of the sequences.
14


To test the significance of the resulting statistics, a test of 1000 permutations
is done and the />-value is calculated to equal the number of times the original
statistic is smaller than the statistic from permuted alignments divided by 1000
(the number of permutations). This value is compared to the default p-value.
As with the Maximum \2 method, the RDP package extended this method to
determine the extent of recombinant regions as well as identifying the parental
and daughter sequences and are described above. [303, 373]
2.2.4 The Sister Scanning Method
The Sister Scanning method (SiScan) finds recombination by assessing phy-
logenetic and compositional signals between four sequences. The SiScan method
examines every triplet in the alignment along with a fourth sequence using a
sliding window of 100 nucleotides and a step size of 20. The fourth sequence
can be an outlier sequence in the alignment or generated by horizontal random-
ization i.e., randomizing the positions of the nucleotides in one (or two) of the
aligned sequences within the sliding window.
The SiScan method utilizes a five-step process. In the first step, the fourth
sequence is selected or generated. In the second step, each column in the align-
ment is categorized into one of 15 different categories, as illustrated in Table 2.3.
If two or more taxa have the same nucleotide at a position, they are represented
with an equals (=) sign. If the nucleotide of one taxa differs from the others, it
is represented with a colon (:) sign. For example, PI means that the sequences
contain different nucleotides in that column e.g., A. C, G, T. P2 means the first
two sequences contain the same nucleotide in that column while the second two
sequences have different nucleotides e.g., A, A, G, T. P3 means the first and
15


third sequences contain the same nucleotide in that column while the second
and fourth sequences have different nucleotides e.g., A, C, A, T.
Table 2.3: SiScan Classifications Used to Categorize Sequences
Pattern Nucleotide Identity Among Sequences
PI 1 : 2 : 3 : 4
P2 1 = 2 : 3 : 4
P3 1 = 3 : 2 : 4
P4 1 = 4 : 2 : 3
P5 2 = 3 : 1 : 4
P6 2 = 4: 1:3
P7 3 = 4 : 1 : 2
P8 1 = 2 : 3 = 4
P9 TP II CT CO II t-H
P10 CO II cm II
Pll 1 = 2 = 3 : 4
P12 CO II cm II ~i
P13 1 = 3 = 4: 2
P14 II CO II cm
P15 II CO II cm r1
Third, within each window, 15 different categories are summed to create nine
different sum patterns for each kind of informative site (patterns with two pairs
of identical sites) as well as quasi-informative sites (patterns with informative
16


sites which differ by only one nucleotide substitution), as illustrated in Table 2.4.
For example, SI includes Pattern 8 which represents informative sites. Pattern
2 and Pattern 7 can be obtained from Pattern 8 by one substitution, so these
patterns are summed; likewise, S2 and S3. S4 represents counts of informative
sites only in the first and second sequences e.g., Patterns 2, 8, 11 and 12 are all
patterns where sequence 1 and 2 are identical, 1 = 2.
Table 2.4: SiScan Sum Patterns
Sum Sum of Patterns
SI E 2,7,8
S2 E 3,6,9
S3 E 4,5,10
S4 E 2,8,11,12 (1 = 2)
S5 E 3,9,11,13 (1 = 3)
S6 E 4,10,12,13 (1 = 4)
S7 E 5,10,11,14 (2 = 3)
S8 E 6,9,12,14 (2 = 4)
S9 E 7,8,13,14 (3 = 4)
Fourth, the nucleotides in each column are randomized (Monte Carlo sam-
pling) to create four randomized sequence alignments. Each column in the
permuted alignment is also categorized and the sum patterns calculated. This
process is repeated 100 times to create a population of scores. Fifth, within
each window, a z-test is calculated to compare the original alignment and the
17


permuted alignment for the number of columns corresponding to the 15 cate-
gories. The z-test results are then plotted against the position in the alignment.
Abrupt changes in the z-score indicate a recombination area. In other words,
regardless of compositional similarity, the regions of recombination will have
opposing phylogenetic signals.
Table 2.5: Example of Recombination Events Found by SiScan Method
Position in Alignment
4892 4901 4904 4910 4913 4919 4958 4970 4979
SeqA G C G G T C T G C
SeqB G c G G T C T G C
SeqC A A T A C T A A A
outlier A T C C c A A A T
As illustrated in Table 2.5, the outlier sequence is chosen by the method as
the nearest outlier for the A, B and C sequences. Sequence C, which the SiScan
method indicates is recombinant from the parental sequences A and B has the
same base with the outlier sequence at positions 4892, 4913, 4958, and 4970.
Sequences A and B have the same base with each other between positions 4892
and 4979, which is the most within a short segment (< 100 nucleotides) in the
entire sequence.
The SiScan method uses a fourth sequence which may help identify mislead-
ing phylogenetic signals that occur because sequences are compositionally similar
within the window. However, the original paper does not indicate whether the
18


outlier sequence should be the nearest i.e., the sequence that most closely resem-
bles the three sequences in the analysis or the most divergent i.e., the sequence
that is the most different from the three sequences in the analysis or whether
using a randomized sequence is the best option. Also, unlike the RDP and Max
X2 methods, SiScan analyzes all sites in the alignment rather than discarding the
monomorphic sites. Therefore, selecting the outlier sequence becomes important
and could significantly alter the results of the analysis. [169, 303]
2.2.5 The Gene Conversion Method
The Gene Conversion (GENECONV) method finds recombination by ana-
lyzing alignment regions that are identical or have a high degree of similarity
and are unusually long. This method consists of a three-step process for every
combination of three sequences in the dataset. In step one, all monomorphic
sites are discarded. Gaps in the alignment are treated as a single polymorphism.
In step two, for each triplet of sequences in the dataset, a fragment score S is
calculated as
0 r DxNtxG
where I is the number of identical sites in a fragment, D is the number of
different sites in a fragment, Nt is the total number of polymorphic sites in the
original alignment, Nd is the total number of polymorphic sites in the sequences,
and G is the G-scale value. The user-defined G-scale can be set to 0, which does
not allow mismatches within a fragment, 1, which detects older recombination
events or 5, which detects more recent recombination events. A fragment
denotes a homologous pair of segments in an alignment and in this analysis was
set to a length of 1. The minimum fragment score was 2 (the default value). In
19


step three, p-values are calculated for the highest scoring fragments according
to the BLAST procedure where the number of random fragments with score
greater than or equal to 2 are described by a Poisson distribution. Therefore,
the probability of finding c fragment(s) with a score greater than or equal to 2
is calculated as
C1

i=0
where y is the expected value E of the fragment score
E = Kmn eAS,
where m and n are the sequence lengths, S is the fragment score, K and A
characterize the fragment scores. Permutation testing was not done in this
analysis.
Table 2.6: Example of Recombination Events Found by GENECONV Method
Position in Alignment
2298 2305 2310 2325 2329 2331 2340 2355 2370
SeqA C C A A C T A C T
SeqB C C - C A A G T C
SeqC T T A A C T A C T
As illustrated in Table 2.6, sequence C, which the GENECONV method
indicates is recombinant has the same sequence as sequence A from the beginning
of the sequence to position 2310 where it changes to match sequence B.
Since GENECONV looks for regions that are identical or have a high degree
of similarity and are unusually long, a dataset that contains a single, highly di-
20


verged sequence in the alignment affects the analysis of sequences by increasing
the number of polymorphic sites and reducing the ability of the algorithm to
detect genuine recombinant regions. Alternatively, if the dataset is not very
divergent, GENECONV may indicate a long string of conserved sites as recom-
binant. [351, 303]
2.2.6 The Bootscan Method
The Bootscan method tests for recombination by analyzing the similarity
among sequences using a sliding window on every combination of three sequences
in the dataset. By moving a window of length 100 and step size of 20, the method
calculates a bootstrapped neighbor-joining tree for each window using 100 repli-
cates. Distances, using the Jukes and Cantor substitution model of 1969, and
trees are calculated using PHYLIP methods of DNADIST and NEIGHBOR.
The groupings of potentially recombinant sequences should group with two or
more different reference sequences. The groupings of non-recombinant sequences
should group with only one reference sequence along the full length of the se-
quence. The default threshold was set at 70%. As with the RDP method, the p-
value for the identified region is calculated using the same binomial distribution
described above. Also, the parental sequences are determined by constructing a
neighbor-joining tree from the full alignment and calculating the greatest num-
ber of genome changes between internal nodes of the tree (parsimony), described
above.
As illustrated in Table 2.7, sequence C has a high concentration of identical
bases as sequence B between positions 744 and 921.
21


Table 2.7: Example of Recombination Events Found by Bootscan Method
Position in Alignment
744 754 783 792 807 828 837 842 852
SeqA G T G A C T T C A
SeqB A C A T A A C T G
SeqC A C A T A T C T G
861 870 879 885 897 903 918 921
SeqA G T C G A C A T
SeqB A C A A G T G C
SeqC A C A A G T G C
The two main problems with Bootscanning are the windows size and the
bootstrap threshold level. If the window size is too large, small areas of high
variability may be missed. Likewise, if the window size is too small, large areas
of low variability will be missed. In each of these cases, the appropriate boostrap
threshold level is important in determining significance and if not set correctly
may give false results. [303]
2.2.7 The LARD Method
The Likelihood Analysis of Recombination in DNA (LARD) method uses
a maximum likelihood ratio test to test different parts of aligned sequences for
changes in the phylogenetic signal, which indicate recombination breakpoints
in an alignment of three sequences. The method utilizes a three-step process.
First, the alignment is partitioned into two parts and maximum likelihood trees
22


are constructed for each section. Second, the different branch lengths of the
trees are statistically compared to a tree constructed from the full alignment
using a likelihood ratio test. A Monte Carlo simulation is done to determine if
the likelihood scores are greater than would be expected due to chance. Third,
the partition is moved along the sequences and each new partition is tested.
A breakpoint is detected when the partition separates trees with the greatest
difference in branch lengths.
(a) Virus 4a clusters with Virus 4b (b) Virus 4a clusters with Virus 2
Figure 2.1: Example of Recombination Events Found by LARD Method
As illustrated in Figure 2.1, evidence for recombination is shown by the move
of the Virus 4a strain out of the cluster with Virus 4b and into a separate clade;
Virus 4b clusters with Virus 2. The application indicates from which section of
the sequence each tree is generated.
23


LARD is able to accommodate rate heterogeneity among sites; however,
it is unable to detect recombination when a portion of one of the sequences is
evolving at a different rate relative to the same region in the two other sequences.
However, if a breakpoint is identified, then the trees can be constructed and
statistically analyzed. [200]
2.2.8 The LDHat Method
Since RNA viruses are prone to genetic mutations because of their error-
prone polymerase, an important issue in detecting recombination in viruses is
separating true recombination events from recurrent mutations which may re-
semble recombination events. Using informative sites, a finite-sites mutation
model using the two-allele model with reversible, symmetric mutation, and co-
alescent theory can estimate the population recombination rate. The LDHat
method estimates the population recombination rate, p = 4Aer where N is the
effective population size and r is the per gene per generation rate of recombina-
tion (crossing over or gene conversion).
The LDHat method utilizes a four-step process. In the first step, the method
estimates the population mutation rate per site, 9 = ANe p where p is the rate
of mutation per site per generation and is constant across all sites. This value
is obtained from the approximate finite-sites version of the Watterson estimate
where S is the number of segregating sites, L is the length of the sequence and
n is the number of sampled sequences.
24


In the second step, the method compares each pair of segregating sites and
classifies them into sets. In the third step, estimates of the likelihood of each
equivalent set under the estimated value of #, the reversible, symmetric muta-
tion model and a range of recombination rates, usually from 0 to 100 stored in
a lookup table are calculated. In the fourth step, the likelihoods of all pairwise
comparisons are combined to provide an estimate of the population recombina-
tion rate for the entire sequence. The composite likelihood (CLE) is calculated
as
l{4Ner) = '£,KXiJmeriJ),
where l(Xij\4Nerij) is the log likelihood of the data for segregating sites i and
j given
rtj 2ct(l e~d'^t).
where c is the per base rate of initiation of gene conversion, is the physical
distance separating sites i and j and t is the average gene conversion tract length.
The method also includes a likelihood permutation test to detect actual
recombination versus recurrent mutations. Recurrent mutations can produce
genetic patterns similar to recombination. In this permutation test, the maxi-
mum composite likelihood for the data set is calculated. Then the segregating
sites are permuted and the 4iVer and maximum composite likelihood are calcu-
lated. The proportion of permuted data sets with a composite likelihood equal
to or greater than that of the original data is calculated. Evidence for recom-
bination occurs if this proportion is lower than a chosen significance level, .05.
[311]
25


2.3 Data
The dataset used in this research comprised a representative sample of se-
quences from all lineages. The protein-coding region from twenty-four viral
sequences from WNV isolates were parsed from whole genomes collected from
GenBank and aligned using the Clustal W method. [88, 416]
The GenBank Accession Numbers, study names and names of the sequences
used in the analysis are listed in Table 2.8.
Table 2.8: Sequences Used
Study Name GenBank Accession Numbers Strain
Madagascar DQ176636 Madagascar-AnMg798
NY99 AF196835 NY99-flamingo382-99
Hungary03 DQ118127 goose-Hungary / 03
Hungary04 DQ116961 goshawk-Hungary / 04
Rabensburg AY765264 97-103
Ethiopia AY603654 EthAn4766
Mor96 AY701412 Morocco 1996
Mor04 AY701413 04.05
Mexico AY660002 Mex03
China AY490240 Chin-01
Tunisia AY268133 PaHOOl
Continued on next page
26


tablename 2.8 continued from previous page
Study Name GenBank Accession Numbers Strain
France AY268132 PaAnOOl
Kenya AY262283 KN3829
Italy AF404757 Italy 1998-equine
Israel98 AF481864 IS-98 STD
Uganda Ml 2294 RNA Uganda
Egypt AF260968 EglOl
Kunjin D00246 KUNCG
Senegal DQ318019 ArD76104
CAR DQ318020 ArB3573/82
Israel AY688948 Sarafend
Russia AY277251 LEIV-Krnd88-190
India DQ256376 804994
Romania AF260969 RO97-50
2.3.1 Results using Nonparametric Methods
An initial phylogenetic tree assuming no recombination was built using UP-
GMA method for generating trees as implemented in the RDP method and
confirms previous phylogenetic analysis. The WNV strains from North Amer-
ica, Europe, Israel. Africa, China and Mexico cluster into a single lineage, Lin-
eage 1. This lineage also contains the Kunjin viruses found in Australia, a
related subtype. Lineage 2 strains have been isolated in sub-Saharan Africa
and Madagascar as well as Hungary and Israel. The Rabensburg virus strains
27


constitute Lineage 3, the Russian isolate LEIV-Krnd88-190 constitute Lineage
4, and the Indian strains constitute Lineage 5. as illustrated in Figure 2.4.
[43, 269, 266, 27, 28, 56]
Additional analysis using the NeighborNet method as implemented in the
SplitsTree4 program (version 4.6) using the Kimura 3-ST substitution model to
estimate distances was done for Lineage 1 and Lineage 2, 3, 4 and 5 strains to
test the hypothesis that WNV is clonal. This method produced a graph which
provides evidence for networked evolution among Lineage 1 strains of WNV.
As illustrated in Figure 2.2, these sequences are interconnected via multiple
pathways, suggestive of multiple recombination events. The graph reveals a
conflicting relationship between the strain from Ethiopia, Kunjin and the strains
from China and Egypt as well as between these viral strains and the other
Lineage 1 strains.
This method also produced a graph which provides evidence for networked
evolution among Lineage 2, 3, 4 and 5 strains of WNV. The splits graph depicted
in Figure 2.3 reveals that a conflicting relationship exists between the strains
from India, Russia, Rabensburg and Madagascar. The graph also indicates a
conflicting relationship between strains from Israel, Hungary and the Central
African Republic (CAR).
28


Mexico
Ethiopia
Figure 2.2:
NeighborNet Analysis of Lineage 1 West Nile Virus Strains
29


Figure 2.3: NeighborNet Analysis of Lineage 2, 3. 4, and 5 West Nile Virus
Strains
30


Lineage 2
Lineage 1
Lineage 5
Lineage 3
Lineage 4
Figure 2.4: Phylogenetic Analysis of West Nile Virus Strains
31


The NeighborNet method produces a useful way to visualize the non-treelike
relationships between sequences. However, this method does not indicate indi-
vidual recombination events nor does it provide any statistical analysis of those
events. To detect evidence of recombination events and to pinpoint breakpoints
within specific sequences in the alignment as well as parental sequences, analy-
sis with RDP, Maximum x2i Sister-scanning, Geneconv, LARD, and Bootscan
algorithms as implemented in the software package, RDP2 for Windows XP
was done. The default p-value was .0001 with correction turned off; the de-
fault window size was 10. This analysis provides evidence of unique potential
recombination signals in WNV, which are listed in Table 2.9.
Breakpoints in the alignment represent the bounds of the recombination
signal. The major and minor parents refer to the parental sequences which
contribute the larger and smaller fractions, respectively, to the daughter se-
quence. In the methods column, the numbers represent a collection of methods
which indicate the presence of recombination. Specifically, number 1 includes
RDP, Bootscan and Maximum x2, number 2 includes RDP, Bootscan and SiS-
can, number 3 includes RDP, Maximum x2 and Chimaera, number 4 includes
RDP, Maximum x2 and SiScan, number 5 includes GENECONV and Maximum
X2, number 6 includes RDP, SiScan and LARD. Some sequences showed phy-
logenetic evidence of recombination. These sequences are indicated by P. An
asterisk by the sequence name indicates ambiguity as to which sequence is the
actual daughter sequence.
32


Table 2.9: Recombination Events in West Nile Virus
Sequences
Sequence Breakpoints in Alignment Major Parent Minor Parent Methods
Senegal* 154-985 Madagascar Italy 1 P
Senegal* 1920-2379 Russia Rabensburg 2 P
Uganda* 1920-2379 Russia Rabensburg 2 P
Uganda* 154-985 Madagascar Italy 1 P
Hungary04* 782-985 Madagascar Italy 1 P
Hungary04* 1975-2314 Russia Rabensburg 2 P
CAR* 782-997 Madagascar Italy 1 P
CAR* 1940-2332 Russia Rabensburg 2 P
Israel* 782-927 Madagascar Italy 1 P
Israel* 1882-2391 Russia Rabensburg 2 P
Madagascar4 1-601 Rabensburg Hungary04 3
Madagascar*2199-2458 Russia Rabensburg 2 P
Mor04* 6161-6913 Madagascar CAR 4
France* 6161-6913 Madagascar CAR 4
Mor96* 6161-6931 Madagascar CAR 4
Kenya* 6161-6913 Madagascar CAR 4
China* 6161-6948 Madagascar CAR 4
Continued on next page
33


tablename 2.9 continued from previous page
Sequence Breakpoints in Alignment Major Parent Minor Methods Parent
China* 1452-2232 Ethiopia Israel98 4
Italy* 6159-6913 CAR Madagascar 4
Romania* 6161-6913 CAR Madagascar 4
Israel98* 6121-6913 CAR Madagascar 4
NY99* 5603-6913 CAR Madagascar 4
Mexico* 6065-6913 CAR Madagascar 4
Hungary03* 6160-6913 CAR Madagascar 4
Tunisia* 6161-6913 CAR Madagascar 4
Egypt* 6161-6948 CAR Madagascar 4
Egypt* 1452-2380 Ethiopia Israel98 4
Ethiopia* 6161-6948 CAR Madagascar 4
KUNCG* 6122-6948 CAR Madagascar 4
India* 6239-6297 Russia Egypt 5 P
Russia* 1137-1778 Israel India 6
The analysis of Lineage 2 sequences show phylogenetic evidence for recom-
bination as evidenced by the move of the Madagascar strain out of the Lineage
2 clade into a separate clade and the move of the Rabensburg strain into the
Lineage 2 clade, as illustrated in Figures 2.5 and 2.6. The Indian sequence
shows phylogenetic evidence for recombination as well. This strain is clustered
34


as a separate clade, assuming no recombination. However, when the segment
between 6239-6297 is analyzed, the strain moves into the Lineage 1 clade and
appears closely related to the Egypt strain, its minor parent, as illustrated in
Figure 2.7.
H
Senegal
Uganda
Hungary04
CAR
Israel ---
i-Mor04-----
France
fI Italy
-Mor96
Kenya
Romania
---Tunisia
r Israel98
_ T- NY99
Mexico
---Hungary03
China
Egypt
Ethiopia
KUNCG
India -----
4=:
rC
- Madagascar
Rabensburg -
- Russia ---
Lineage 2
Lineage 1
Lineage 5
Lineage 2
Lineage 3
Lineage 4
Figure 2.5: Phylogenetic Analysis of West Nile Virus Strain from Madagascar
35


H
Senegal
Uganda
----Hungary04
CAR
Israel
Madagascar -
Rabensburg
Lineage 2
Lineage 2
Mor04-----
r Italy
J-Mor96
*- Kenya
France
Romania
r Israel98
I-NY99
Mexico
Hungary03
Tunisia
China
Egypt
Ethiopia
KUNCG -
India-----
Lineage I
Russia
Lineage 5
Lineage 4
Figure 2.6: Phylogenetic Analysis of West Nile Virus Strain from Rabensburg
36


Lineage 2
Lineage 5
Lineage 1
Lineage 3
Lineage 4
Figure 2.7: Phylogenetic Analysis of West Nile Virus Strain from India
37


2.4 Results using Parametric Methods
The LDHat method was performed to determine recombination rates using
the Coalescent based method on three sets of WNV strains, the Lineage 2 strains,
the Lineage 1, 3, 4 and 5 strains and all strains. The Lineage 2 dataset consisted
of six sequences; the Lineage 1, 3, 4 and 5 dataset consisted of 19 sequences.
The gene conversion model for recombination with a fixed average tract length
of 1000 nucleotides and a beginning p of 30 was used. The permutation level
was set to 1,000,000. These are all default settings.
As illustrated in Table 2.10, the 6 and p parameters are given per site. The
value of S is the number of segregating sites. The upper and lower bounds of
the confidence interval are given for the 95th percentile (P < .05). The evidence
for recombination in all strains is lacking. The Lineage 2 sequences show a high
level of diversity (9 = .08584) but a low average recombination rate (Average p
= .01435). Likewise, Lineage 1,3,4 and 5 sequences show a comparable level of
diversity (8 = .07581) and a corresponding low recombination rate (Average p
= .15970). When all sequences are analyzed, the sequences show a comparable
level of diversity (i9 = .06603) as well as a low recombination rate (Average p =
.08921).
38


Table 2.10: Recombination Rates in West Nile Virus Sequences
Lineage S e Avg. p Upper bound Lower bound
2 2013 .08584 .01435 .0007614 .002824
1,3,4,5 2685 .07581 .15970 .12166 .212800
all 2533 .06603 .08921 .07692 .104603
39


3. Sequence Segmentation Method
Many algorithms have been developed to detect recombination. These al-
gorithms are based on two basic concepts: 1) analysis to describe sequence
similarity based on the distance between sequences. The results are presented
as a summary of statistics or graphically as changes in phylogenetic trees when
compared to trees generated based on no recombination, or some combination of
these methods (nonparametric methods) and 2) model-based analysis to describe
the underlying evolutionary history such as finding the the rate of recombination
within a coalescent theory framework (parametric method).
In this research, the nonparametric methods and parametric methods gen-
erated conflicting results. The nonparametric methods indicate WNV is recom-
binant with some methods showing phylogenetic evidence while the parametric
method, based on coalescent theory indicates WNV is not recombinant.
3.1 The Issues
The first issue in analyzing the results from the methods implemented within
the RDP2 software package is the appropriateness of the default settings such
as the user-defined window size in the nonparametric methods. The optimal
window size will depend upon the length and relatedness of the sequences being
analyzed as well as the method being used. Another factor in selecting the
appropriate window size is the size of the potential recombination regions, which
is not known a priori for WNV. Selecting the proper window size is problematic
since too large of a window size may lessen the sensitivity of the analysis and
40


increase signaknoise ratios but will detect large recombination regions while too
small of a window size may increase the sensitivity but will also increase the
possibility of false positive results.
The coalescent method has a similar problem in setting a default average
tract length of gene conversion. This value is arbitrary and if the length is too
short, the composite likelihood will increase. When the composite likelihood
estimates are combined, the population recombination rate will be overbiased.
To address these issues and to answer the question of whether some WNV
sequences are mosaic, a parametric algorithm was developed to analyze a
composite sequence based on the major parent, minor parent and the daughter
sequences. This method takes as input three sequences, which are the same
length and converts them into a composite sequence. This composite sequence
is then segmented using the Jensen-Shannon divergence method. The result-
ing segments are simplified to yield an overall segmentation description of the
genome.
3.2 The Conversion Method
The major and minor parental sequences are denoted Mjp and Mnp, re-
spectively, and the daughter sequence is denoted d. For sequence length L. the
three sequences are represented as vectors of nucleotides:
Mjp = {Mjpu Mjp2,..., MjpL},
Mnp = {Mnpi, Mnp2,..., MnpL},
d = {d\,d2,..., di}.
41


A composite sequence is formed from the sequence tripet and each posi-
tion is evaluated and labeled. Informative sites are those where the daughter
nucleotide matches exactly one of the parental nucleotides. If the daughter se-
quence matches the major parent, the position is labeled A in the composite
sequence. If the daughter sequence matches the minor parent, the position is
labeled B. Uninformative sites in which all three sequences are the same or all
three sequences are different is labeled U. If the two parental sequences match
and are different from the daughter sequence, the position is labeled M. If the
alignment process produces an insert, the position is labeled I.
3.3 The Jensen-Shannon Divergence Measure
Jensen-Shannon divergence measure iss used to quantify the difference be-
tween probability distributions of two subsequences. This measure can be de-
fined in the following way. Let
S = {a\, 02,..., aN},
be a sequence composed of N symbols from an alphabet
A = {Ai, A2,..., Afc}.
Next, take a position n(l < n < N) in the middle of S such that S is divided
into two subsequences
S1 = {ai, o2,.. ,a},
S {on-|-i, On_t_2, Ojv}*
The frequency vectors Fm are the relative nucleotide frequencies for S which
satisfy the constraints Xa=i fi = 1 and 0 < < 1, for i = 1, ... k and j
1, ... m.
42


For example,
are the relative nucleotide frequencies for S1 and S2, respectively. In other
words, is the relative proportion of symbol A in S1 and f[2^ is the relative
proportion of symbol A in S2.
The distance between m probability distributions is quantified by the
Jensen-Shannon divergence, Djs as
m m
DJS(F\F\ ...,Fm) = H[J2 x P] ^ x H[P],
3=1 3=1
where n3 are the weights for the distributions Fm, which satisfy the constraints
5Zj=i 7T'j = 1 anfl 0 < 7i3 < 1 and H[F] is Shannons entropy of the distribution
Fm defined as
k
H[F} = ~Y,filogfi,
i=1
for i = 1, ... k. This method was first, described in [40].
3.3.1 Halting Criteria
Using the Jensen-Shannon divergence as defined, the sequence is analyzed
at every position n, where n = 1, ... N and the entropy of the whole, left,
and right sequences is computed. The position where the divergence reaches its
maximum is accepted as a cutoff point. If the segmentation algorithm is run
recursively and left unchecked, the entire sequence will be parsed into strings
of length one, which would be meaningless. A mechanism for stopping the
segmentation process needs to be applied. Two candidates for halting criterion
are:
43


a hypothesis testing framework [40]
a model selection frame work based on Bayesian information criteria (BIC)
[280]
In the first case, the maximum value for Djs for the two subsequences is
computed, Dmax. The statistical significance s(x) is determined by estimating
the probability of getting Dmax or less from a random sequence and is defined
as
s(x) = probability(Dmax < x).
The segmentation process continues until s(x) is less than some preset sig-
nificance level, which is arbitrarily applied.
In the second case, BIC is used to determine the overall performance of a
model to the given data
BIC = -2 ln(L) + ln(N)K,
where L is the maximum likelihood of the model, N is the number of nucleotides,
and K is the number of parameters in the model. The segmentation is accept-
able if BIC is reduced after segmentation, A BIC < 0. The parameters applied
before and after segmentation are K\ and K2, respectively. Therefore, the stop-
ping criteria is
2NDjs > 2ln(N),
where the right side of the equation is (K2 K\) ln(N). This is the lower bound
of the significance level and is not arbitrarily set. An upper bound is not possible
44


but a calculation of segmentation strength can be done and is defined as
2NDjs 2ln(N)
S~ 2 ln{N)
3.3.2 Segment Simplification
Upon completion of the algorithm, the composite sequence will be divided
into homogeneous subsequences or segments. Each segment may be heteroge-
neous to its neighbor and is uniquely labeled. To determine if each segment
is also heterogeneous to its non-neighbors, each segment is recursively concate-
nated with its non-neighbors, including any previously tested segments with the
same label. These concatenated segments are resegmented using the Jensen-
Shannon divergence method described above. If this concatenated segment can-
not be resegmented, then the original segments, Sz and Sy are given the same
label. This results in a simplified description of the mosaic sequence. [24]
3.4 Validation Experiment
To show the validity of the method to detect viral recombination, three
daughter sequences were manually created from the protein-coding region of two
parental genome sequences as described in [58]. The parental genomes came from
Lineage 1 (Study Name: NY99-385, Accession No.: DQ211652, Strain: NY99-
385-99) and Lineage 2 (Study Name: W956, Accession No.: AY532665, Strain:
B956). Three daughter sequences (RECOMB1, RECOMB2, RECOMB3)
were created by swapping structural protein genes between the two parental
sequences. After alignment of these sequences using ClustalW, analysis was
conducted on each triplet e.g., each daughter sequence with its two parental
sequences. A composite sequence was formed from the sequence triplet. While
45


each position within the sequence was accounted for, only the informative sites
were analyzed using the binary alphabet, A and B. In the model selection frame-
work, the relevant parameters were K\ = 1 since only 1 of the 2 letters was
independent and K2 = 3 from the one free parameter from each subsequence
and the cutoff point n.
The results of this experiment indicated this novel approach of segmenting
a composite sequence using a hypothesis testing framework produces reliable
results. The method identifies segments within the composite sequence by their
differences from their neighbors and also by their similarities to each other. Be-
cause recombination events often affect blocks of nucleotides rather than change
specific sites, the method can identify multiple breakpoints.
3.4.1 Recombl Results
RECOMB1 contained nucleotides 0 to 2308 from W956 and the rest of
the genome from NY99-385. During the alignment process, ClustalW added
12 insertions into the RECOMB1 sequence, moving the breakpoint to 2320.
The initial segmentation of the composite sequence from RECOMB1, W956
and NY99-385 showed 11 segments. Following the segmentation simplification
procedure, the composite sequence was divided into two main segments from
0 to 2324 and from 2324 to 10317, as illustrated in Figure 3.1. The known
breakpoint is indicated by the solid line and the predicted breakpoints indicated
by the dashed line.
3.4.1.1 Analysis
During the alignment process, ClustalW added 12 insertions into the RE-
COMB1 and W956 (the minor parent) sequences at position 1331 to align with
46


0
2320 2324
10317
Figure 3.1: A Comparison of the Known and Predicted Breakpoints in
RECOMB1
NY99-385 (the major parent). At position 2308 in NY99-385 and at position
2298 in W956. ClustalW added 12 insertions to align with RECOMB1. This
action moved the original breakpoint in RECOMB1 to position 2320 in the
alignment. Prior to position 2308, the composite sequence consisted of uninfor-
mative and B sites e.g., the daughter sequence matched the minor parent. After
position 2320, the composite sequence consisted of uninformative and A sites
e.g.. the daughter sequence matched the major parent. The first informative
site following the insertions e.g., position 2324, was an A site. The segmenta-
tion method indicated this position as the breakpoint, as illustrated in Figure
2308 2320 2324 1 1 1
NY99-385 TAGGTC- -CATAGCT
W956 CAGGTCA ATTGCT
RECOMB1 CAGGTCAATTGCTATGACCATAGCT
1 B Site 1 A Site
Figure 3.2: Analysis of Breakpoints in RECOMB1
47


However, this informative site was not the correct position for the break-
point. When only these three sequences were aligned, ClustalW added 12 in-
sertions at position 2308 into the two parental sequences, NY99-385 and W956
to align with RECOMB1. This action shifted the breakpoint to the first infor-
mative site following the insertions e.g., position 2321, as illustrated in Figure
3.3.
2308 2320
I 2321
ti
NY99-385 TAGGTC- ... -CATAGCT
fV956 CAGGTC- ... -AATTGCT
RECOMB1 CAGGTCAATTGCTATGACCATAGCT
t t t
B Site A Sites
Figure 3.3: Alternative Analysis of Breakpoints in RECOMB1
3.4.2 Recomb2 Results
RECOMB2 was the opposite of RECOMB1; it contained nucleotides 0 to
2308 from NY99-385 and the rest of the genome from W965. The alignment pro-
cedure added insertions into the RECOMB2 sequence. However, the breakpoint
position was unchanged. The initial segmentation of the composite sequence
from RECOMB2, W956 and NY99-385 showed 11 segments. Following the seg-
mentation simplification procedure, the composite sequence was divided into
two main segments from 0 to 2339 and from 2339 to 10317, as illustrated in Fig-
ure 3.4. The known breakpoint is indicated by the solid line and the predicted
breakpoint is indicated by the dashed line.
48


0
2308
2339
10317
Figure 3.4: A Comparison of the Known and Predicted Breakpoints in
RECOMB2
3.4.2.1 Analysis
During the alignment process, ClustalW added 12 insertions into the W956
(the minor parent) sequence at position 1331 to align with RECOMB2 and
NY99-385 (the major parent). At position 2308. ClustalW added 24 insertions in
RECOMB2 and 12 insertions in NY99-385. At position 2309, ClustalW added 12
insertions in W956. This action did not change the original breakpoint position
in RECOMB2 in the alignment. Prior to position 2308, the composite sequence
consisted of uninformative and A sites e.g., the daughter sequence matched
the major parent. After position 2332, the composite sequence consisted of
uninformative sites and B sites e.g., the daughter sequence matched the minor
parent. The first informative position following the insertions e.g., position
2339, was a B site. The segmentation method indicated this position as the
breakpoint, as illustrated in Figure 3.5.
3.4.3 Recomb3 Results
RECOMB3 contained nucleotides 0 to 368 from W965, nucleotides 369 to
2308 from NY99-385 and the rest from W965. The alignment procedure added
insertions into the RECOMB3 sequence. However, the two breakpoint positions
49


2308 1 2332 2339 1 1
r NY99-38S TAGGTC- t \ -CATAGCCTCACGTTTCTC
W956 CAGGTCA- ... -ATTGCTATGACGTTTCTT
RECOMB2 TAGGTC- f ... - GTTTCTT
T A Site T B Site
Figure 3.5: Analysis of Breakpoints in REC0MB2
were unchanged. The initial segmentation of the composite sequence from RE-
COMBS, W956 and NY99-385 showed 13 segments. Following the segmentation
simplification procedure, the composite sequence was divided into three main
segments from 0 to 371, from 371 to 2339 and from 2339 to 10317, as illustrated
in Figure 3.6. The known breakpoints are indicated by the solid line and the
predicted breakpoints indicated by the dashed line.
0 368 371 2309 2339 10317
Figure 3.6: A Comparison of the Known and Predicted Breakpoints in
RECOMB3
3.4.3.1 Analysis
Prior to position 371, the composite sequence consisted of uninformative
and B sites e.g., the daughter sequence matched the minor parent; after position
371, the composite sequence consisted of uninformative sites and A sites e.g., the
50


daughter sequence matched the major parent. At position 371, the segmentation
algorithm indicated a breakpoint.
During the alignment process, ClustalW added 12 insertions into the W956
(the minor parent) sequence at position 1331 to align with RECOMB3 and
NY99-385 (the major parent). At position 2308, ClustalW added 24 insertions
in RECOMB3 and 12 insertions in NY99-385. At position 2309, ClustalW
added 12 insertions in W956. This action did not move the original breakpoint
in RECOMB3 in the alignment. Prior to position 2308, the composite sequence
consisted of uninformative and A sites e.g., the daughter sequence matched the
major parent. After position 2332 in the alignment, the composite sequence
consisted of uninformative and B sites e.g.. the daughter sequence matched
the minor parent. The first informative position following the insertions e.g.,
position 2339, was a B site. The segmentation method indicated this position
as the breakpoint, as illustrated in Figure 3.7.
2308 1 2332 2339 1 1
T NY99-385 TAGGTC- 1 T -CATAGCCTCACGTTTCTC
W956 CAGGTCA- ... -ATTGCTATGACGTTTCTT
RECOMB3 TAGGTC- f ... - GTTTCTT
T A Site I B Site
Figure 3.7: Analysis of Breakpoints in RECOMB3
51


3.5 Second Experiment
To test the validity of the results obtained from the RDP2 software pack-
age, composite sequences were generated from each set of triplet sequences (two
parental and one daughter sequence) indicated by the software package as re-
combinant, as illustrated in Table 3.1 and analyzed. In the model selection
framework using a binary alphabet, the relevant parameters were K\ = 1 since
only 1 of the 2 letters was independent and K2 = 3 from the one free parameter
from each subsequence and the cutoff point, n. The analysis of these composite
sequences indicated WNV is not recombinant.
3.6 Third Experiment
To analyze all sequences in the dataset, the segmentation algorithm was
extended to select all combinations of three sequences and then permute each
subset. In this experiment, the combinations are calculated as C(n, k) =
where n is 24 and k is 3. Each subset of n is then permuted as P(n, k) =
~~y where n is 3 and A; is 1. Permuted sequences were generated from each
combination to determine the proper relationship of daughter and parental se-
quences.
For example, using three WNV sequences Egypt, Italy and Uganda, the
permuted sequences would be generated as illustrated in Table 3.2. Permuted
sequences were generated from the entire dataset. Each composite sequence was
analyzed using the relevant parameters as described for the previous experiments
using a binary alphabet. The analysis of these composite sequences indicated
WNV is not recombinant.
52


Table 3.1: Composite Sequences Generated from Results Obtained from RDP2
Software
Daughter Sequence Major Parent Minor Parent Daughter Sequence Major Parent Minor Parent
Senegal Madagascar Italy CAR Madagascar Italy
Senegal Russia Rabensburg CAR Russia Rabensburg
Uganda Russia Rabensburg Israel Madagascar Italy
Uganda Madagascar Italy Israel Russia Rabensburg
Hungary04 Madagascar Italy Madagascar Rabensburg Hungary04
Hungary04 Russia Rabensburg Madagascar Russia Rabensburg
Mor04 Madagascar CAR France Madagascar CAR
Mor96 Madagascar CAR Kenya Madagascar CAR
China Madagascar CAR Italy CAR Madagascar
China Ethiopia Israel98 Romania CAR Madagascar
Israel98 CAR Madagascar NY99 CAR Madagascar
Mexico CAR Madagascar Hungary03 CAR Madagascar
Tunisia CAR Madagascar Egypt CAR Madagascar
Egypt Ethiopia Israel98 Ethiopia CAR Madagascar
KUNCG CAR Madagascar India Russia Egypt
Russia Israel India
53


Table 3.2: Example of Permuted Sequences Generated from Combination and
Permutation Function
Daughter Sequence Major Parent Minor Parent
Egypt Italy Uganda
Egypt Uganda Italy
Uganda Egypt Italy
Uganda Italy Egypt
Italy Uganda Egypt
Italy Egypt Uganda
3.7 Fourth Experiment
To improve the accuracy of detecting the border(s) between recombinant re-
gions of the viral genome, composite sequences were generated from each subset
of triplet sequences as described in the third experiment and analyzed using a
4-symbol alphabet that captures the composition of the three symbol codon.
In [39], Bernaola-Galvan et al. used a 12-symbol alphabet to take into account
the nucleotide composition within codons in order to differentiate between cod-
ing and noncoding regions of DNA. They defined the phase or position, i as j
i mod 3, where j e {0,1,2}. In this experiment, the phase or position, i was
defined as j = i mod 3, where j 6 {0,1} to analyze those portions of the codon
which are under selective pressures and thus, may have a different evolutionary
histories. Therefore, the first two symbols in the codon can be substituted
54


with one of the following symbols: {Ao.Aj ,B0,Bi.} where, for example, Ao indi-
cates a symbol A in position 1 of the codon. In the model selection framework
using a 4-symbol alphabet, the relevant parameters were K\ = 2 since only 1 of
the positions for each symbol was independent and K2 = 5 from the two free
parameters from each subsequence and the cutoff point, n. The analysis of these
composite sequences indicated WNV is not recombinant.
55


3.8 Conclusion
West Nile virus is a mosquito-borne member of the family Flaviviridae, genus
Flavivirus, Japanese encephalitis serogroup. While it is capable of infecting
many different vertebrate and invertebrate species, the primary transmission
cycle of WNV involves birds and mosquitoes. While Flaviviruses in general
have a global distribution, each species has a geographic niche based upon its
particular mosquito vector-avian host relationship. Unlike other members of
this genus, WNV has a worldwide distribution in both the Old World and New
World. [117]
West Nile virus was first isolated from a woman in the West Nile region of
Uganda in 1937. Following its discovery, serological and epidemiological surveys
were conducted and immunity to WNV was found to be widespread in humans,
horses, birds, monkeys and domestic farm animals throughout Africa. The virus
caused a mild childhood disease consisting of fever.
From the 1950s to the 1970s, West Nile virus caused occasional epidemics
and epizootics involving humans, birds, horses and domestic farm animals in
Israel, France, Spain and Portugal as well as South Africa. While both children
and adults suffered with acute febrile disease, WNC caused more serious disease
and patients were diagnosed with meningoencephalitis for the first time in 1957.
Subsequently, WNV has caused major epidemics and epizootics in Israel,
Romania, Russia, Europe and north Africa. During these epidemics, WNV
caused acute aseptic meningitis, meningoencephalitis, and encephalitis in hu-
56


mans as well as acute febrile disease. Infection in horses caused acute neurologic
symptoms, fever, paresis of the hindquarters, paralysis, or some combination of
these symptoms. And the infection sometimes proved fatal for both humans
and horses.
Evidence that WNV had spread to the New World occurred in 1999 when
it caused an outbreak in humans, birds and horses in New York City. The virus
has since spread across the lower 48 states of the United States and into Canada.
In humans, WNV caused disease characterized by acute febrile disease as well
as headache, fatigue, malaise, muscle pain, weakness, and sometimes gastroin-
testinal symptoms as well as more severe symptoms such as acute flaccid paral-
ysis. In horses, WNV also caused more serious disease characterized by ataxia,
weakness of limbs, recumbency, difficulty rising, muscle fasciculation, fever, par-
alyzed or drooping lip, twitching face or muzzle, teeth grinding and blindness,
and death. Birds, especially American Crows (Corvus brachyrhynchos) were
also especially hard hit. West Nile virus has spread into the Caribbean, Mexico,
Central and South America but has not caused high numbers of infections in
humans or animals. As a consequence of the spread and increased virulence of
the virus in North America, researchers have turned their attention to develop-
ing treatments and a human vaccine. A vaccine is available for horses, but not
for humans. [349, 414, 196, 380, 501]
All Flaviviruses, especially the Japanese encephalitis group display anti-
genic cross-reactivity which makes identifying viral agents using conventional
serological testing difficult. Phylogenetic analysis has been used to establish
the identity, origin and intra-species relationships of Flaviviruses. Phylogenetic
57


analysis on WNV indicate strains from North America, Europe, Israel, Africa,
Russia and Australia fall into a single lineage (Lineage 1). This lineage con-
tains some related subtypes clustered into clades: the Kunjin viruses found in
Australia, the Indian WNV and the European and African WN viruses. Lin-
eage 2 strains have been isolated only in sub-Saharan Africa and Madagascar.
Lineage 3 strains were isolated in the Czech Republic in 1997 and 1999. The
Russian isolate LEIV-Krnd88-190 constitute Lineage 4 and strains from different
geographical regions of India constitute Lineage 5. [266, 27, 28, 56]
RNA viruses are subject to rapid evolution since transcription is done using
RNA polymerases, which are highly error-prone because of their lack of proof
reading abilities resulting in mutations. Phylogenetic studies on WNV assume
its genetic diversity is the result of mutations. However, evolution can also
occur when RNA viruses undergo recombination. To detect recombination in
aligned nucleotide sequences, many methods have been developed. However,
these methods can produce false positive results because genomic mutations
can produce patterns similar to recombination.
This paper describes parametric and nonparametric methods within the
RDP2 software and the SplitTrees software packages used to detect recombi-
nation in WNV. The results present conflicting evidence of recombination. To
resolve this conflict, an extended parametric statistical method, the Jensen-
Shannon divergence method, was developed. This method uses a composite
sequence generated from two parental and one daughter sequence. To validate
the method, three recombinant sequences were manually generated and ana-
lyzed. The results indicate the extended Jensen-Shannon divergence method
58


can detect homogeneous segments with multiple breakpoints within the com-
posite sequence.
The method was applied to WNV sequence triplets identified by the non-
parametric methods within the RDP2 software package as recombinant. The
method was extended to analyze all combinations and permutations of a dataset
of sequences using a binary alphabet as well as a 4-symbol alphabet. The results
from several experiments indicate WNV is not recombinant.
59


4. Future Work
West Nile virus has a global distribution and shares ecological niches with
other members of the genus Flavivirus as well as the genus Alphavirus. For
example, WNV and JE overlap in Asia. Kunjin virus, JE and Murray Valley
encephalitis virus (MVE) overlap in Australia. West Nile virus and SLE overlap
in North America and WNV and Usutu virus overlap in Africa. Also, WNV,
Usutu virus and Sindbis virus antibodies have been found together in birds in
the United Kingdom. [117, 68]
Given this information and to extend this research, I would like to analyze
members of the Japanese encephalitis serogroup including JE, SLE, MVE, Usutu
virus, and Sindbis virus for heterogeneous recombination.
60


Appendix A. Molecular Biology of West Nile Virus
If life is defined as being able to metabolize energy and produce waste,
independently reproduce, employ some method of communication, adapt and
evolve to accommodate the environment, then viruses confuse the boundary
between what is alive and what is not.
Viruses cannot metabolize energy; they do not capture and use ATP nor do
they produce waste byproduct. Virus cannot independently reproduce them-
selves. In contrast to living cells, virus particles do not grow or undergo cell
division. Instead, the complete virus is assembled from preformed components
within a living host cell. The host cell is used by the virus to produce the compo-
nents and assemble the virus particles (virion); viruses are functionally inactive
outside the host. However, like all living things, the genetic material of viruses
is either ribonucleic acid (RNA) or deoxyribonucleic acid (DNA) and contains
coded information within genes. This genetic material can mutate which allows
viruses to adapt to their environment.
Viral genetic material can be double- or single-stranded, circular or linear.
The capsid contains the genetic material. The morphology of the capsid can
be described as icosahedral, helical or complex. Viruses may be coated with an
envelope made from their hosts cellular membrane or naked. These attributes
can be used to categorize and classify viruses.
61


A.l Taxonomy of Viruses
Taxonomy, according to Websters Encyclopedic Unabridged Dictionary of
the English Language is defined as the science of identifying, naming and classi-
fying organisms. Viral taxonomy can be based on any number of viral qualities:
hosts, geographic distribution, vectors, transmission cycles, virus morphology,
tissue tropism, and symptoms of disease caused from viral infection. These
qualities may not be very satisfactory since different viruses can cause similar
symptoms and disease, may look similar, or inhabit the same environmental
niche and yet not be related. However, with the advances of molecular biology,
viral taxonomy is now based on the type of nucleic acid and genomic structure
of the virus, assuming viruses that order their genes similarly are related and
replicate in a similar way.
A.1.1 The Central Dogma of Molecular Biology
All organisms are made of proteins. The central dogma of molecular biology
indicates that DNA indirectly codes for protein through a two-step process of
transcription and translation, as illustrated in Figure A.l.
Transcription is the process of using enzymes (DNA-dependent-RNA poly-
merase) to change DNA into a single strand of ribonucleic acid (RNA) in the
cells nucleus. This RNA crosses the nuclear boundary into the cytoplasm and
acts as a messenger (mRNA) of information to the ribosomes. The mRNA is
used as a template for protein production (translation).
The genetic material of viruses is either RNA or DNA. DNA is made up
of four types of nucleotides consisting of a nitrogenous base linked to a sugar,
62


DNA
transcription
RNA
translation
Protein
Figure A.l: Central Dogma of Molecular Biology
deoxyribose that, in turn, is linked to a phosphate group on the other side. Each
nucleotide differs in its nitrogenous base: adenine (A), cytosine (C), guanine (G)
and thymine (T). A polynucleotide chain is created when the phosphate group
of one nucleotide attaches to the sugar of another nucleotide with a phosphate
radical (P) at the 5 end and a hydroxyl (OH) at the 3 end, as illustrated in
Figure A.2. Chemically, RNA is similar but contains the base uracil (U) instead
of thymine (T) and the sugar is ribose. [261]
P ____ Sugar ____ Phosphate _____ Sugar _____ Phosphate _____ _________ OH
A C
Figure A.2: RNA schemata
A.1.2 The Baltimore System of Classification
Dr. David Baltimore developed a classification scheme for viruses based on
genomic structure and replication strategy, specifically how mRNA is produced
from the viral genome. Using this criteria, viruses are organized into seven
groups or classes: [120]
Class I: Double-stranded DNA Genomes These viruses follow the cen-
tral dogma. Some viruses replicate in the nucleus using cellular DNA-
63


dependent-RNA polymerase; some replicate in the cytoplasm using viral
DNA-dependent-RNA polymerase.
Class II: Single-stranded DNA Genomes (can be written in the 5 to 3
direction or vice versa) These viruses must be converted to the double-
stranded form. These viruses then follow the central dogma.
Class III: Double-stranded RNA Genomes These viruses have segmented
genomes. The coding region of each genome segment is a single, open
reading frame, which is transcribed to produce monocistronic (code for one
protein) mRNAs. Because the host cell does not contain the polymerase
necessary for transcription, the viral genome encodes the RNA-dependent-
RNA polymerase needed. The host cells ribosomes translate the viral
mRNA into viral proteins.
Class IV: Single-stranded (+)sense RNA Genomes (written in the 5 to 3
direction) The genome of these viruses acts as mRNA upon entry into the
host cell, which causes the virus to be infectious without any intermediate
stages. The mRNA may have an open reading frame or an overlapping
reading frame. If the genome contains overlapping reading frames, the
virus must be able to make functionally monocistronic mRNAs during two
or more rounds of transcription. This processing is generally designated
as early and late phases. The viral genome encodes the RNA-dependent-
RNA polymerase needed for transcription.
Class V: Single-stranded (-)sense RNA Genomes (written in the 3 to 5
direction) The genomes of these viruses may be either segmented or non-
64


segmented. The viral genome encodes the RNA-dependent-RNA poly-
merase needed to produce mRNA from (-)RNA before translation can
take place.
Class VI: Single-stranded (+)sense RNA genome with DNA intermediate
in life-cycle The viral genome is positive-sense but is not an infectious
mRNA. These viruses are diploid (two complete copies of the genome in
each capsid), and the mRNA is a template for conversion into DNA using
the enzyme reverse transcriptase, also known as RNA-dependent-DNA
polymerase, which is encoded by the viral genome. The viral DNA is
integrated into the host DNA and converted back into mRNA by the host
following the central dogma. These viruses are termed retroviruses.
Class VII: Double-stranded DNA genome with RNA intermediate This
group of viruses also relies on reverse transcription, but unlike the retro-
viruses, the processing occurs inside the virus particle on maturation.
A.1.3 The ICTV System of Taxonomy
The International Committee on Taxonomy of Viruses (ICTV) oversees the
naming of viral species. Viral classification starts at the level of order and
proceeds to species. The names of viral orders and families are usually italicized.
The taxon suffixes are given. [120]
65


Order (-virales)
Family (-viridae)
Subfamily (-virinae)
Genus (-virus)
Species (-virus)
Figure A.3: Viral Classification
A.1.3.1 The Taxonomic and Serological Classification of West Nile
Virus
West Nile virus (WNV) belongs to the family Flaviviridae. Three genera,
Flavivirus, Pestivirus, and Hepacivirus belong to this family. Members of each
genus share certain basic physical and chemical properties and some are impor-
tant human pathogens. The term "flavo is Latin for yellow. Hence, the genus
Flavivirus, which includes more than 70 identified viruses, include viruses that
cause Yellow fever. West Nile virus also belongs to the genus Flavivirus.
In [74], Calisher defines a serological complex as two or more viruses, dis-
tinct from each other by quantitative serological criteria (fourfold or greater
differences between homologous and heterologous titres of both serum samples)
in one or more tests, but related to each other or to other viruses by some (any)
serological method. Viruses closely related within a serogroup but distinct from
each other are considered to constitute an antigenic complex.
The genus Flavivirus has been divided into nine serological complexes ac-
cording to their cross-reactivity in neutralization assays, their pathogenic prop-
66


erties and whether they are transmitted by mosquito, tick or an unknown vec-
tor. The largest group, with ten members is the Japanese encephalitis virus
serogroup.
The viruses in the Japanese encephalitis virus serogroup are mosquito-borne
viruses (arboviruses) that may cause human encephalitis and includes West Nile
virus as well as the Japanese encephalitis virus, St. Louis encephalitis virus,
Murray Valley encephalitis virus, Alfuy virus, Koutango virus, Usutu virus,
Cacipacore virus and Yaounde virus. Kunjin virus is a WNV subtype. [431,
433, 112, 74, 370, 310, 345]
A.2 The Genomic Structure of West Nile Virus and Kunjin Virus
West Nile virus and Kunjin virus are single-stranded, positive sense RNA
viruses (Class IV). Their genomes have approximately 11,000 nucleotides and
consist of three main parts, as illustrated in Figure A.4: the 5 untranslated
region (UTR) of 96 nucleotides, the protein coding section, consisting of a single
open reading frame with approximately 10,000 nucleotides, encodes all of the
viral protein genes, and the 3 UTR. The 3 UTR of WNV is of variable length
while the 3UTR of Kunjin virus has approximately 624 nucleotides.
A.2.1 The Primary Structure of West Nile and Kunjin Virus
Genomes
The WNV and Kunjin virus genomes consist of a 5 UTR, which includes a
type 1 cap (7-methylated Guanine cap) and ends with the conserved dinucleotide
CG. The coding region of both viruses is characterized as a single open reading
frame, which starts with the codon ATG (Metionine). The coding region of
WNV ends with either the codon TAG or TAA, while the coding region of
67



pr M
Caspid: 97-465 NS1: 2458-3513 NS3:4600 6456 NS5:7672 -10386
Membrane: 466 966 NS2A: 3514 4206 NS4A: 6457 6903
Envelope: 967 2457 NS2B: 4207 4599 NS4B: 6904 7671
Figure A.4: West Nile Virus Genome Based on the Reference Sequence NC-
00153
Kunjin ends with the codon TAA. The 3 UTR of WNV and Kunjin virus is of
variable length and ends with the conserved dinucleotide CT. This region shows
high levels of heterogeneity between the viruses following the stop codon. At the
distal end of the 3 UTR, both viruses contain conserved nucleotide sequences
referred to as CS1, CS2 and CS3, and repeated sequences referred to as RCS2
and RCS3. Neither virus is polyadenylated at the end of the 3 UTR. [239, 376]
The coding region of WNV codes for between 3430 and 3434 amino acids.
The coding region of Kunjin virus codes for 3433 amino acids. In both viruses,
a single polyprotein is produced during replication. The individual proteins
are arranged in the order 5-C-prM-E-NSl-NS2A-NS2B-NS3-NS4A-NS4B-NS5-
3. This polyprotein is co-translationally and post-translationally cleaved by
host and viral proteinases to produce ten viral proteins. Cleavage occurs at the
C-prM, prM-E, E-NS1, NS1-NS2A, NS4A-NS4B and NS4B-NS5 junctions. The
host convertase furin is responsible for cleavage in the prM to produce a mature
M protein.
68


The mature viral proteins consist of three structural proteins: the capsid
protein designated as (C); the membrane protein designated as (prM/M); the
major envelope protein designated as (E); and seven nonstructural proteins:
the nonstructural protein 1 designated as (NS1); the nonstructural protein 2A
designated as (NS2A); the Flavivirin protease designated as (NS2B); the Fla-
vivirin protease designated as (NS3); the nonstructural protein 4A designated
as (NS4A); the nonstructural protein 4B designated as (NS4B); and, the RNA-
dependent RNA polymerase designated as (NS5). [102, 439, 495, 496, 117, 65,
424]
A.2.1.1 The Nonstructural Proteins of West Nile and Kunjin
Viruses
The seven nonstructural proteins are involved in RNA synthesis as well as
attenuate host antiviral responses. The NS1 protein is a glycoprotein with 12
invariant cysteine residues, which may function as a cofactor during replication.
The NS1 protein is secreted and may correlate with the development of severe
disease. It also is associated with the cell surface membranes but its function in
the cell surface membranes and pathogenesis is unknown. [185, 65, 98]
The NS3 protein encodes a serine protease at the N-terminal of the protein.
This protease is a member of the trypsin family and works in conjunction with
the NS2B protein. This NS2B-NS3 complex may be responsible for cleavage
at the NS2A-NS2B, NS2B-NS3, NS3-NS4A and NS4B-NS5 junctions. It may
also be responsible for cleavage within the C, NS3, and NS4A proteins. The C-
terminal of the protein encodes a RNA helicase and nucleoside triphosphatase
(NTPase). [484, 496, 65, 424]
69


The NS4B protein may have a role in the inhibition of the host cells a/(3
interferon antiviral response. The NS2A protein blocks the interferon signaling
pathway by preventing STAT1 and STAT2 phosphorylation. The NS4A protein
in Dengue virus, a related virus can also block interferon signaling and may
have the same function in WNV. The NS4A protein and the NS3 protein in
Kunjin virus elicits the cytotoxic T cell responses. Also, the Kunjin polypro-
teins NS4A-NS4B, NS2B-NS3-NS4A and NS4A cause cytoplasmic membrane
rearrangements with NS4A, specifically inducing the convoluted membranes and
paracrystalline arrays, characteristic of infection. [287, 355, 390]
The NS5 protein encodes a methyltransferase (MTase) at the N-terminal
of the protein, which methylates the 5 cap. The C-terminal of the protein
encodes the RNA-dependent RNA polymerase necessary for genome replication.
Replication occurs when minus-strand RNA templates are produced from plus-
strand viral RNA via RNA-dependent RNA polymerase. These templates are
used to produce plus-strand viral RNA, which is packaged in the mature virion.
[65, 382]
A.2.2 The Secondary and Tertiary Structures of West Nile and
Kunjin Viruses
Secondary and tertiary structures occur when complementary base pairs
form a hydrogen bond, such as between A T and C G, to form stem regions,
hairpin turns, loops, and bulges. A pseudoknot is a tertiary structure in which
a loop pairs with other bases outside the loop. The single-stranded viral RNA
molecule can fold into secondary and tertiary structures in the 5 and 3 UTRs.
These structures are conserved among Flaviviruses even if the specific nucleotide
70


sequence are not. [127, 261]
One of the secondary structures within the 5 UTR of WNV and Kunjin
genomes is the m7G cap at the beginning of the 5 UTR. This cap consists
of an inverted guanosine, methylated at the N-7 position which is linked to the
first transcribed RNA nucleotide by a unique 5'-5 triphosphate bridge. The cap
is necessary for correct genome replication. The WNV RNA-dependent RNA
polymerase can transcribe both capped and uncapped plus-strand RNA into
minus-strand templates. However, the capped plus-strand RNA is translated
into a minus-strand template of normal length while the uncapped plus-strand
RNA is translated into a minus-strand template twice as long as normal, which
folds into a hairpin structure. [120, 336, 382]
Another secondary structure within the 5 UTR is a small stem-loop struc-
ture. This structure corresponds to a stem-loop structure on the 3 UTR of
the minus-strand. The proteins TIAR/TIA-1 bind to this 3 stem-loop struc-
ture on the minus-strand and may stabilize the stem-loop structure to aid the
polymerase in the recognition of the minus-strand template. [65, 282, 382]
Other secondary and tertiary structures occur within the 3 UTR. The ter-
minal end of the 3 UTR folds into a large stem-loop followed by a smaller
stem-loop structure, which may interact to form a pseudoknot. The shape of
the large stem-loop structure is highly conserved among Flaviviruses. While
the sequences of the stem region are not conserved, the sequence of the loop
regions are highly conserved. Any mutations or deletions of these secondary
structures affects the transcription of Flaviviruses but not the translation of the
viral mRNA into protein. For example, the conserved sequence 5-CACAG-3 at
71


the top of the stem-loop structure is critical for RNA transcription but not for
translation. Also, the stem region of the large stem-loop structure has a high-
activity binding site as well as two low-activity binding sites, which bind
with the host cell protein translation elongation factor-1 alpha (EF-la). These
sites may have a role in RNA translation. [48, 418, 49, 376, 281, 344, 288, 458]
The entire viral genome of WNV and Kunjin may fold into a panhandle-
like structure also known as genome cyclization. This structure is created
when the complementary bases within the conserved nucleotide sequence (CS1)
in the 3 UTR bind with the 5 UTR. This structure may play a role in genome
replication. When researchers deleted as many as 352 nucleotides after the stop
codon in the Kunjin virus genome, which left the complementary bases intact,
replication was partially inhibited but was not lethal to the virus. Additional
mutational analysis indicates the cyclization elements are necessary for synthesis
of minus-strand RNA from the plus-strand RNA but mutations in the minus-
strand RNA do not effect synthesis of plus-strand RNA. [236, 457, 419, 336, 458]
A.3 The Virion Morphology of West Nile Virus
The viral capsid is made of protein and, when assembled, surrounds the the
viral genome. West Nile virus, like other Flaviviruses, has a rigid icosahedral
capsid which is 35nm and consists of multiple copies of the C protein shaped
into a 20-sided solid of identical equilateral triangles.
West Nile virus has an envelope which is derived from the host cell. Within
the envelope, the M and E proteins are anchored by C-terminal a-helical hair-
pins. The mature virus is 40 50 nm in diameter. [363, 325, 347]
72


A.4 Viral Life Cycle
Viruses infect any type of host organism be it bacteria, plants, or animals.
The host is required by the virus to replicate its genetic material and produce
viral proteins in sufficient quantities to produce progeny viruses. The replication
cycle of WNV, as illustrated in Figure A.5 follows the following stages. [120]
Host cell attachment and penetration In this stage the virus comes into
contact with the host cell through viral receptors and complementary re-
ceptors on the cell membrane, usually glycoproteins. Penetration occurs
when the virus crosses the cell membrane by endocytosis, transfer or fusion
of the envelope (if an enveloped virus) with the cell membrane.
Viral uncoating In this stage the viral genome is released from the capsid.
This may be a pH-dependent or pH-independent process.
Expression and replication of viral genomes In this stage the viral genome
is replicated via the two-step process of transcription and translation by
the host cell mechanisms, which results in the production of large quanti-
ties of viral genomes and proteins.
Assembly and maturation When enough of viral proteins and genomes
are produced, viral particles are assembled into nascent virions.
Viral exiting The virions are released by lysis of the host cell which may
or may not result in the death of the host cell.
73


A.4.1 Viral Attachment and Host Cell Penetration
In order to transport the viral genome into the cytoplasm for replication,
WNV must first attach to receptor sites on the host cell membrane. It gains entry
via receptor-mediated endocytosis in coated pit vesicles, specifically clathrin-
mediated endocytosis along with an integrin receptor, a cell membrane protein
involved in the attachment of a cell to the extracellular matrix and to other cells,
and in signal transduction from the extracellular matrix to the cell. The specific
vertebrate host cell receptor responsible for WNV attachment may be the /33
integrin receptor along with the attachment factor, C-type (calcium-dependent)
lectin DC-SIGNR in the thin layer of cells that line the interior surface of blood
vessels (endothelial cells). On the viral side, the E protein of WNV functions as
the viral receptor binding and fusion protein. [95, 96, 97, 111]
A.4.1.1 The Envelope Glycoprotein
The structure of the entire E protein has been determined by X-ray crystal-
lography and shows similar folds as dengue and tick-borne encephalitis viruses.
The overall structure consists of three domains: Domain I is centrally located
in the structure with its nine-stranded /3-barrel and carries the N-glycosylation
site, which may be significant in the severity of disease; Domain II is mostly /3
strands and contains the fusion loop, which promotes the merging of the host
and viral membranes; and, Domain III, with its immunoglobulin-like /3-sandwich
topology contains the receptor binding site for attaching to host cell receptors.
[325, 339, 97, 228]
A.4.1.2 Domain III of the Envelope Glycoprotein
Domain III of the E protein for the New York strain has been determined
by NMR spectroscopy and consists of seven anti-parallel /3-strands in two 3-
74


sheets. One anti-parallel /3-sheet consists of /3-strands 01 (Phe299-Asp307),
32(Val313-Tyr319), 04 (Arg354-Leu355), and ,35 (Lys370-Glu376) arranged so
that 02 is flanked on either side by 01 and 05. The short 04 flanks the end of the
remaining side of 05. The remaining anti-parallel 3-sheet is formed from strands
03 (Ile340-Val343), 06 (Gly380-Arg388), and 07 (Gln391-Lys399) arranged with
36 at the center. [473]
Domain III of the E protein shows an affinity for the av03 integrin and
binding of this section of the E protein to the host cell is enough to activate
the signaling-pathway to allow for viral internalization via phosphorylation of
focal adhesion kinase. The virus may also attach to the host cell by binding
glycosaminoglycans. In laboratory tests after 5 passages in human adenocar-
cinoma (SW13) cells, WNV and Kunjin viruses have amino acid substitutions
in the E protein (at residue 138 (GLU138 > Lys) for WNV and residue 390
(GLU390 > Gly) for Kunjin virus). Following this change, WNV and Kun-
jin virus show an affinity to glycosaminoglycans during entry of cells in tissue
culture. [274, 97, 275]
A.4.2 Viral Uncoating and Release
After the internalization of WNV via clathrin-mediated endocytosis, the
nucleocapsid is uncoated and RNA is released into the cytoplasm following a
low-pH (between 5.7 and 6.4 pH) fusion of the E protein in the endosomal vesi-
cle. The endocytosis process triggers a conformational rearrangement in the
E protein, which produces energy needed to bend the membranes toward each
other. Specifically, the fusion loop is inserted into the outer layer of the host
cell membrane. The E protein then folds back on itself, pointing its C-terminal
75


Q-hclical hairpins toward the fusion loop. This process forces the host cell mem-
brane and the viral membrane against each other causing the two membranes
to fuse. [177, 242, 117, 65, 228]
A.4.3 Expression and Replication
The endocytic pathway serves as a transit system within cells to move virus
particles to the proper area for viral replication. Since the WNV genome acts
as an mRNA, the viral genome is translated by ribosomes in the region of the
rough endoplasmic reticulum into a single polyprotein without any intermediate
DNA processing. Once synthesized, the polyprotein is cleaved eo-translationally
and post-translationally to produce the three structural proteins necessary for
viral assembly as well as the enzymes used during viral replication, including
polymerase.
Viral replication occurs in the cytoplasm when the WNV RNA-dependent
RNA polymerase, in a complex with the NS3 protein (NS3/NS5) interacts with
the cyclized genome as well as other elements such as the conserved sequence
5-CACAG-3 at the top of the 3 stem-loop structure and host cell proteins
such as EF-lo and TIAR/TIA-1. The polymerase then copies complementary
minus strands from the original viral RNA. These minus-strand RNAs serve as
templates for the synthesis of new plus-strand RNA, forming an intermediate
double-stranded RNA. The synthesis of plus- and minus-strand RNAs is asym-
metric with the plus-strand RNA produced in 10- to 100-fold excess over the
minus-strand RNA. [117, 282, 65, 295]
A.4.3.1 Defective Interfering Viruses
During replication, WNV produces defective interfering viruses which are
the result of erroneous genome synthesis. These defective interfering viruses
76


have genomes with large deletions that may be the result of the polymerase
detaching from the template strand and reattaching at a different position or
attaching to a new incomplete strand. The resulting viruses cannot replicate
themselves without the presence of a homologous, complete virus. By using the
replication enzymes produced by the homologous, complete virus, the defective
interfering viruses decrease the output of infectious progeny as well as increase
their proportion of the resulting progeny. In mouse cell cultures, the produc-
tion of defective interfering viruses may be the result of genetic resistance to
Flavivirus-induced encephalitis. In addition, the production and interference
by defective interfering particles is host cell dependent and occurs early in the
replication cycle. [120, 114, 63]
A.4.4 Assembly, Maturation and Exiting
After WNV has infected cells, it matures by either of two modes, termed cis
and trans. In the trans mode of maturation, virions are assembled in the rough
endoplasmic reticulum. Mature virions are transported by the host secretory
channel to the cellular membrane and released by exocytosis. During the cis
mode of maturation, virions are assembled in the cytoplasm. The mature virions
are transported via actin filaments and released into the extracellular matrix by
budding at the host cellular membrane. The trans mode of maturation seems to
be the prevalent method. However, the cis mode of maturation has been shown
in WNV, Sarafend strain. [334, 92]
A.4.4.1 The trans Mode of Maturation
In the trans mode of maturation, the prM protein and the E protein co-
localize with calnexin, a molecular chaperone that resides within the lumen of
77


the endoplasmic reticulum. The prM protein, the E protein, the NS1 protein,
and the NS4B protein are all cleaved in the lumen of the endoplasmic reticulum.
The prM, E and the C proteins accumulate in vesicles within the endoplasmic
reticulum as immature virions. Once these proteins have accumulated to suf-
ficient levels, they assemble with the plus-strand RNA to form nucleocapsids.
Following assembly, these nucleocapsids are transported through the Golgi ap-
paratus where the cellular protease furin cleaves prM, cutting off the single
N-linked glycan in the N-terminal pr fragment, which leaves the mature form of
the M protein in the lipid envelope. These mature viruses are then transported
to the cell surface through the secretory channel and released by exocytosis into
the extracellular matrix. [485, 117, 297, 65, 190, 111]
A.4.4.2 The cis Mode of Maturation
In the cis mode of maturation, after the polyprotein is synthesized at the
rough endoplasmic reticulum, the C and E proteins accumulate in the cytoplasm
and are then transported via microtubules to the cellular membrane for assembly
into nucleocapsids. These nucleocapsids accumulate along the outer rim of large
vacuoles, which are near the cellular membrane. Some of the nucleocapsids
may bud into the lumen of the vacuoles, resulting in mature virions. These
mature virions may be released into the extracellular matrix via exocytosis.
Other nucleocapsids are transported via actin filaments to the budding sites on
the cellular membrane and released via budding into the extracellular matrix.
[334, 92]
A.4.4.3 Cell Lysis
The result of the replication cycle of WNV is cell lysis, which can cause
cell death or necrosis. Cell death is dependent upon the infectious dose and the
78


mode of viral maturation.
Necrosis occurs when the mature West Nile virions bud from the host cell
causing the loss of cell membrane integrity. This occurs when the multiplicity
of infection (the ratio of infectious virus particles to the number of cells being
infected) is high (m.o.i > 10). Necrosis causes the release of high mobility group
1 (HMGB1) protein which triggers an inflammatory response. [334, 94]
79


Figure A.5: Replication Cycle of West Nile Virus
80


A.5 Pathogenesis
The result of successful replication of WNV is not only new progeny but
also disease. Infection by WNV is typically asymptomatic; however, it can cause
fever as well as infection of the central nervous system. Pathogenesis occurs at
the site of inoculation when WNV replicates in the skin cells. The virus is then
carried to the local lymph nodes and disseminated to other organs, including
the brain stem and spinal cord via the bloodstream. The virus may also infect
the peripheral nervous system and then move to the central nervous system.
The ability of WNV to infect the central nervous system is termed neuroin-
vasiveness; the ability of the virus to cause encephalitis is termed neurovirulence.
The term virulence is used to describe the virus if it is neuroinvasive and/or neu-
rovirulent. The structural and non-structural regions of the genome and proteins
have been examined to determine the molecular basis of virulence. Host genetic
factors may also play a role in the pathogenesis of disease.
A.5.1 Viral Molecular Genetics of Virulence
Some strains of WNV show7 a difference in neuroinvasiveness. For example,
the WNV strains responsible for the outbreaks in North America and Israel
caused a higher than normal death rate in birds, while epidemics in eastern
Europe were not correlated with bird deaths. This effect has also been confirmed
in in the mouse model using the Kunjin, Indian West Nile viruses and viruses
from Madagascar and Cyprus. These strains appear to be nonneuorinvasive.
[269, 204, 380, 34]
81


The E protein of many strains of WNV and Kunjin viruses contains an re-
linked glycosylation motif (Asn-Tyr-Ser i.e., N-Y-S) from residues 154 to 156.
Experimental evidence in mice and in cell culture suggests that this motif is a
determinant of neuroinvasiveness. Non-glycosylated strains have an attenuated
neuroinvasion phenotype whereas glycosylated strains have enhanced virulence.
In some strains of WNV, the motif is mutated to either the asparagine mutant
(Asni54 Ser) or the serine mutant (Seri56 > Pro/Ala). These mutations
causes the loss of the glycosylation motif. Specifically, a mutation in the fu-
sion loop within Domain II of the E protein (Leui07 > Phe) seems to cause
total attenuation while a combination of mutations in the receptor-binding re-
gion (Ala3i6 < Val) and in a stem helix (Lys440 Arg) seems to cause some
attenutation. [405, 34, 423, 36, 190, 504]
However, this conclusion is not consistent across reported results and mu-
tations in other genes may contribute to a change in attenuation. For example,
a WNV strain from Mexico, a serine mutant with mutations in NS4B pro-
tein (Ile245 Val) and NS5 protein (Thr898 > lie), has a virulent phenotype.
Also, isolates from Texas with the glycosylation motif also have a conserved
mutation in the NS4B protein (Glu249 > Gly) and an attenuated phenotype.
These isolates were attenuated for neuroinvasiveness, but not for neurovirulence.
[136, 32, 109, 58]
And, other mutations in the E gene may also cause changes. When a nong-
lycosylated WNV strain from Israel was passaged through Vero cells, the deriva-
tives gained the N-linked glycosylation site but were attenuated for neuroinva-
siveness. However, this mutation alone did not correlate directly with attenu-
82


ation since another mutation at residue 68 (L68 > P) of the E gene was also
required for attenuation. [82]
Also, viral mutants, which showed amino acid substitutions in the E pro-
tein at residue 138 (Glui38 > Lys) for WNV and residue 390 (Glu390 Gly)
for Kunjin virus after 5 passages in human adenocarcinoma (SW13) cells, had
decreased neurovirulence and neuroinvasiveness in vitro and in mice. [274]
Some researchers question the correlation between neuroinvasiveness and
glycosylation status. In laboratory experiments when nonglycosylated Kunjin
viruses were serially passaged through Vero cells, the viruses generally gained
the N-linked glycosylation site (a mutation from N-Y-F to N-Y-S at position
154 to 156) by passage 3 or 4 and remained glycosylated during subsequent pas-
sages. This result was less consistent for viruses grown in C6/36 cells. These
glycosylated viruses could cause infection in mice, but there was no clear cor-
relation between neuroinvasiveness and glycosylation status. In addition, the
glycosylation of the Kunjin virus gives it some growth advantage since the virus
grew more successfully and produced 10- to 100-fold more virus in Vero cells
and in C6/36 cells. [404]
Mutations in other nonstructural genes may also affect virulence. For ex-
ample, after passage in C6/36 cells of the WNV prototype B956 strain, the
resulting virus showed 32 mutations consisting of 14 amino acid changes spread
over the entire genome. Most of the mutations were in the NS4A gene, followed
by NS4B. Only three mutations were found in the structural protein region.
This virus also showed reduced cytopathicity and reduced virulence in mice but
did not show reduced growth in Vero cells. Another example is a mutation in
83


the NS3 protein at residue 249 (Pro249 > Thr) of the Kenyan-3829 strain which
makes this strain temperature sensitive thereby reducing virulence in American
crows. [498, 243]
A.5.2 Host Genetic Factors
Genetic resistance and susceptibility to viral infections by Flaviviruses in
mice has been designated as the Flv gene and is located on chromosome 5.
This gene is identified as the 2-5-oligoadenylate synthetase IB (Oaslb) gene.
A single mutation, which truncates the gene that encodes 2-5-oligoadenylate
synthetases is responsible for mice being susceptible to WNV disease. [359, 304,
64, 65]
Humans have a similar gene to the mouse Flv gene. When a human host is
infected by a virus and alpha interferon is released in response, the OAS gene
cluster, located on chromosome 12 is activated and 2-5-oligoadenylate syn-
thetase is produced. This enzyme catalyzes the synthesis of 2-5-oligoadenylates
from ATP. This 2-5-oligoadenylate binds to and activates RNase L, which de-
stroys all RNA within the cell. A single nucleotide polymorphism at the exon
7 splice-acceptor site is associated with viral stimulated enzyme activity which
suggests this gene mutation controls the differences seen in susceptibility to viral
infections. [57]
Also, humans that are homozygous with the defective CCR5 allele known as
CCR5 A32 carry a risk factor for symptomatic WXV infection. The wild-type
CCR5 functions as a host defense factor. [174]
84


A.5.3 Clinical Features of Disease in Humans
Serological testing in endemic areas indicate about 80% of the population
who are infected with WNV are asymptomatic, 20% of the population develop
symptoms associated with West Nile fever, and less than 1% result in some type
of neuroinvasive disease such as encephalitis, meningitis, or flaccid paralysis,
which can have mortality rates of 12% to 14%. [322, 361, 298, 294]
A.5.3.1 West Nile Fever
West Nile fever is self-limited with symptoms generally lasting for 3 to 6 days
after an incubation period of 3 to 14 days. It manifests as an acute onset of febrile
illness with weakness, muscle pain, gastrointestinal symptoms and pain such as
vomiting or diarrhea. In rare cases, it causes pancreatitis, headache, fatigue, and
changes in mental status such as difficulty concentrating. Some patients also
report rash on the trunk, arms, or legs. [451, 358, 332, 483, 367, 361, 276] Other
symptoms may include a change in vision caused by optic neuritis, occlusive
vasculitis, uveitis, vitritis, and chorioretinitis. [262, 225, 26,15,197, 235] In some
patients, West Nile fever may cause more severe disease requiring hospitalization,
including mechanical ventilation. The median time to fully recover is about 60
days. [212, 479]
A.5.3.2 West Nile Virus Neuroinvasive Disease
West Nile virus infection can progress to a neuroinvasive disease such as
meningitis, encephalitis or a meningoencephalitis combined with muscle and
limb weakness. Other symptoms such as headache, stiff neck, general weakness,
85


Parkinsons-like movements, difficulty with balance and gait, and a change in
mental state such as confusion and decreased consciousness may also occur.
[332, 75, 413, 357]
West Nile virus induced meningoencephalitis occurs in only 1 in 140 in-
fections. However, advanced age, immunosuppression and diabetes have been
identified as risk factors for severe neurologic disease and even death. [322, 332,
91, 361, 430]
A.6 Potential Treatments
Anecdotal evidence suggests administration of intravenous immunoglobulin
might work to control WNV infection in humans. A 70-year-old woman in
Israel was admitted to the hospital because of fever and vomiting attributed to
WNV infection. She soon became comatose. Because she was assumed to be
immunosuppressed already because of her history of chronic lymphatic leukemia,
she was treated with intravenous immunoglobulin from donors in Israel, which
contained high titers of antibodies. She dramatically improved and her level of
consciousness returned to normal. No controlled clinical trials are underway or
have been completed to test this treatment option. [421]
However, research indicates treatment of WNV infection might work us-
ing passive immunity treatments if the immunoglobulin is given in a time-
and dose-dependent manner. In [38], WNV-infected mice were given human
immunoglobulin prepared from pooled blood that contained WNV-specific anti-
bodies obtained from healthy Israeli donors and were treated successfully. When
antibodies are administered to wild-type and T-and B-cell-deficient mice prior
to WNV infection, morbidity in wild-type mice is prevented. [135, 118]
86