Citation
Investigating transposable element landscapes in snake genomes

Material Information

Title:
Investigating transposable element landscapes in snake genomes
Creator:
Hall, Kathryn Teresa
Publication Date:
Language:
English
Physical Description:
xi, 95 leaves : ; 28 cm

Subjects

Subjects / Keywords:
Transposons ( lcsh )
Copperhead -- Genetics ( lcsh )
Burmese python -- Genetics ( lcsh )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 91-95).
General Note:
Department of Integrative Biology
Statement of Responsibility:
by Kathryn Teresa Hall.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
747033102 ( OCLC )
ocn747033102
Classification:
LD1193.L45 2011m H34 ( lcc )

Full Text
INVESTIGATING TRANSPOSABLE ELEMENT LANDSCAPES IN SNAKE GENOMES
By
Kathryn Teresa Hall
B.S., University of Colorado Colorado Springs, 2002
A thesis submitted to the
University of Colorado Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
Biology
2011


This thesis for the Master of Science
degree by
Kathryn Teresa Hall
has been approved by
^7 ZQ if
Date
Timberley M. Roane


Hall, Kathryn Teresa (M.S., Biology)
Investigating Transposable Element Landscapes in Snake Genomes
Thesis directed by Assistant Professor Michele L. Engel
ABSTRACT
Initially, we conducted a comprehensive assessment of genomic repeat content in two
snake genomes, the venomous copperhead (Agkistrodon contortrix) and the Burmese
python (Python molurus bivittatus). These two genomes are both relatively small (~1.4
Gb), but have surprisingly extensive differences in the abundance and expansion histories
of their repeat elements. In the python, the readily identifiable repeat element content is
low (21%), similar to bird genomes, whereas that of the copperhead is higher (45%),
similar to mammalian genomes. The copperhead's greater repeat content arises from the
recent expansion of many different microsatellites and TE families, and the copperhead
had 23-fold greater levels of TE-related transcripts than the python. This suggests the
possibility that greater TE activity in the copperhead is ongoing. Expansion of CR1 LINEs in
the copperhead genome has resulted in TE-mediated microsatellite expansion at a scale
several orders of magnitude greater than previously observed in vertebrates. Snakes also
appear to be prone to horizontal transfer of TEs, particularly in the copperhead lineage.
The reason that the copperhead has such a small genome in the face of so much recent
expansion of repeat elements remains an open question, although selective pressure
related to extreme metabolic performance is an obvious candidate.


We expanded our transposable element investigation to include 10 other snakes from
diverse lineages. The resulting transposable element landscapes are dynamic, even
among closely-related species. CR1 and Bov-B LINE families appear to have been
particularly important in shaping snake genomes, and we found at least three separate
lineages of CR1 active in key groupings of snakes. Additionally, we are also using
amplicon sequencing to target specific regions of transposable elements across a range of
snake species at key points in the expansion history of these two elements.
This abstract accurately represents the content of the candidate's thesis. I recommend its
publication.
Signed
MicRele L. Engel


DEDICATION
I dedicate this thesis to people who helped me out, and to the friends who kept me sane
throughout the ordeal. Thanks, Mom and Dad, for the moral (and occasionally financial)
support. Thanks, Beth, for driving me around when I dislocated my elbow. Thanks, Sarah
and Lynn, for pet-sitting when things got crazy. Thanks, Charlotte, Chide, Jill, Ken, Suzi,
and Vijetha for all the coffee runs and lunch-time stories. And thanks, Todd, for all the
guidance from someone who's been there, done that, and lived to tell the tale.


ACKNOWLEDGEMENT
I would like to thank my advisors, Dr. David Pollock, Dr. Michele Engel, and Dr. Timberley
Roane for their support. I would also like to thank the people who aided or collaborated
with me in my research: Todd A. Castoe, Marcel L. Guibotsy Mboulas, Wanjun Gu, A. P.
Jason de Koning, Samuel E. Fox, Alexander W. Poole, Vijetha Vemulapalli, Juan M. Daza,
Todd Mockler, Eric N. Smith, Cedric Feschotte, and David D. Pollock.
Chapter 2 was written as a collaborative paper. The order of authors for chapter 2 is as
follows: Todd A. Castoe, Kathryn T. Hall, Marcel L. Guibotsy Mboulas, Wanjun Gu, A. P.
Jason de Koning, Samuel E. Fox, Alexander W. Poole, Vijetha Vemulapalli, Juan M. Daza,
Todd Mockler, Eric N. Smith, Cedric Feschotte, and David D. Pollock.
We acknowledge the support of the National Institutes of Health (NIH; GM083127) to
DDP, GM77582 to CF, an NIH training grant (LM009451) to TAC, and National Science
Foundation support (DEB-0416160) to ENS. We acknowledge Roche-454 for contributing
the sequencing of the cDNA libraries based on a proof of principle grant to DP and TC. We
thank Carl Franklin from the Amphibian and Reptile Diversity Research Center at the
University of Texas at Arlington for assistance obtaining snake samples.


TABLE OF CONTENTS
Figures ...........................................................................ix
Tables ............................................................................xi
Chapter
1. Introduction....................................................................1
1.1 Background Information on Repetitive Content and Transposable Elements.........2
2. Discovery of Highly Divergent Repeat Landscapes in Snake Genomes Using High
Throughput Sequencing.......................................................4
2.1 Introduction...................................................................4
2.2 Methods........................................................................7
2.2.1 Shotgun Library Creation and Sequencing......................................7
2.2.2 Repeat Analyses..............................................................8
2.2.3 cDNA Sequencing and Analysis.................................................9
2.3 Results......................................................................9
2.3.1 Identification of Interspersed and Tandem Repeats...........................12
2.3.2 Transposable Element Landscapes.............................................15
2.3.3 Evidence for Horizontal Transfer of Transposable Elements...................20
2.3.4 Simple Sequence Repeat (SSR) Structure......................................23
2.3.5 Microsatellite Seeding by CR1 LINEs.........................................25
2.3.6 Evidence of Transposable Element Transcriptional Activity...................27
2.4 Discussion....................................................................30
3. Repeat landscape evolution and LINE dynamics across snake genomes..............36
3.1 Introduction..................................................................36
3.2 Methods.......................................................................38
vii


3.2.1 Primer Design.............................................................38
3.2.2 PCR and Sequencing Preparation............................................38
3.2.3 Shotgun Sequence..........................................................42
3.2.4 Repeat Analyses on Shotgun Sequence.......................................42
3.2.5 Consensus Sequences and TE Alignments from Shotgun Sequence...............43
3.2.6 Bayesian Analysis.........................................................44
3.2.7 Coverage..................................................................44
3.3 Results.....................................................................44
3.3.1 Repeat Content............................................................45
3.3.2 Coverage..................................................................50
3.3.3 Origins...................................................................54
3.3.4 Expansion History.........................................................57
3.4 Discussion................................................................58
4.Conclusion....................................................................62
Appendix
A. Supplementary Methods for Chapter 2.........................................65
B. Supplementary Tables for Chapter 2..........................................68
C. Supplementary Figures for Chapter 2.........................................80
D. Supplementary Tables for Chapter 3..........................................89
Bibliography ...................................................................91
viii


LIST OF FIGURES
Figure
2.1 Comparison of Repeat Analyses...........................................11
2.2 Comparison of the Family-Specific Repeat Content in Copperhead (Agkistrodon) and
Python (Python) Genomes...............................................16
2.3 Sequence Divergence of Selected Transposable Elements....................18
2.4 Depth of Sequence Coverage...............................................19
2.5 Sequence Divergence Between SPIN DNATransposon (MITE) Sequences in the
Copperhead Genome.....................................................22
2.6 Expansion of Specific SSR Motifs in the Copperhead Genome................24
2.7 Relative Frequencies of Transposable Elements in Liver cDNA Transcripts..29
3.1 PCR Results from Proof-of-Principle Primer Reactions.....................40
3.2 Barcoded Primers Designed with 454 Adapters..............................41
3.3 Phylogeny of Snakes Used for RepeatMasker/RepeatModeler Analyses.........46
3.4 Overall Transposable Element Content Identified by
RepeatMasker/RepeatModeler............................................47
3.5 The Most Common LTRs in Snakes, by Percentage of Genome..................48
3.6 Common DNA Transposons in Snakes, by Percent Genomic Content.............49
3.7 BovB and CR1 LINE Content Across the Snake Phylogeny by Percent of Genomic
Sequence..............................................................50
3.8 Snake BovB Coverage from Shotgun Sequence................................51
3.9 Log Ratio of BovB 3' to 5' Coverage......................................52
3.10 Snake L3 CR1 Coverage from Shotgun Sequence.............................52
3.11 L. dulcis Non-L3 CR1 Coverage from Shotgun Sequence.....................53
3.12 BovB Tree from Preliminary Bayesian Analysis............................55
3.13 CR1 Tree from Preliminary Bayesian Analysis.............................56
3.14 Sequence Divergence of Selected Transposable Elements...................58
3.15 Summary of Truncation in BovB and CR1...................................60
C.l Comparison Between RepeatMasker-RepBase Annotations of Two Different Shotgun
Sequence Libraries for the Python.....................................80
C.2 Comparison of the Repeat Element Annotations of RepeatMasker-RepBase
and the De Novo Repeat Identification of RepeatModeler-RepCIass for the
Copperhead............................................................81
IX


C.3 Comparison of the Repeat Element Annotations of RepeatMasker-RepBase and the
De Novo Repeat Identification of RepeatModeler-RepCIass for the Python..82
C.4 Quality Score Distribution for Shotgun Sequence.............................83
C.5 Comparison of Repeat Annotations Between Snakes and the Anolis Lizard.......84
C.6 Inferred Phylogeny of BovB LINE Elements in Tetrapod Genomes, Based on Consensus
Sequences Available in RepBase and Including All Bov-B Sequences from
Squamate Reptiles Available on Genbank..................................85
C.7 Frequencies of SSR Loci in the Two Snakes and the Anolis Lizard.............86
C.8 Linear regressions comparing the frequencies of simple sequence repeat (SSR) loci per
Mbp between species, broken down by SSR sequence motif..................87
C.9 Summary of Evidence Demonstrating that the Sequence Identified as a Bov-B Va LINE
in Vipera Ammodytes is in Fact a Chimeric Sequence Composed of a Bov-B LINE
Flanked by Two Snakel CR1 LINE Fragments........................................88
x


LIST OF TABLES
Table
3.1 Barcoded PCR primers with 454 adapters.........................................39
3.2 Primer pairs used for each species.............................................39
3.3 Picomoles of barcoded DNA purified per product and pooled for sequencing.......41
B.l Summary of sequencing results and repeat identification analyses................68
B.2 Statistics on Agkistrodon TE superfamilies identified and classified by the RepCIass
Homology module, and the degree of fragmentation of elements estimated de
novo classified by RepCIass..................................................69
B.3 Statistics on Python TE superfamilies identified and classified by the RepCIass
Homology module, and the degree of fragmentation of elements estimated de
novo classified by RepCIass..................................................70
B.4 Numbers of non-redundant families of repeat elements classified by RepCIass
analyses of de novo identified repeats from RepeatModeler...................71
B.5 Details of consensus sequences (of "families") from RepeatModeler that were
interrogated by RepCIass, and the numbers that were successfully classified by
different modules of the analyses............................................72
B.6 Combined library-based annotation of repeats in the Python genome...............73
B.7 Combined library-based annotation of repeats in the Agkistrodon genome..........74
B.8 Preliminary comparison of repeat annotations between Python molurus, Agistrodon
contortrix, and the Anolis lizard............................................75
B.9 Linear regression correlation coefficient (r2) comparing SSR loci frequencies (by
sequence motif) between squamate species.....................................76
B.10 Results of BlastN search using the copperhead Viper CR1 LINE consensus sequence
as a query against the non-redundant (NR) database constrained to only find
sequences originating from snakes............................................76
B.ll Summary of cDNA hits to known transposable elements............................79
D.l Summary of sequencing results for 12 snakes.....................................89
D.2 Basic summary of repeat identification analyses for 12 snakes...................90
XI


1. Introduction
Snakes are an incredibly diverse group of organisms, ranging from the tiniest blindsnake
to the enormous anaconda. They are model organisms for metabolic change and organ
hypertrophy, capable of to increasing and decreasing organ mass in response to feeding
and digestion, and can increase metabolic rates from a low resting rate to a high
metabolic rate in response to the influx of nutrients (Secor and Diamond 1995; Secor and
Diamond 1998). Multiple lineages possess venom glands, but both the venom
components and delivery mechanisms differ across the phylogeny. At the base of the
snake lineage, a rapid burst of evolution occurred in key mitochondrial proteins, whereas
this amino acid sequence is highly conserved in most vertebrate lineages (Castoe 2009b).
Yet for all the variability among snakes, little is known about their genomes. The
squamates, the order which includes all snakes, is represented by the lizard, Anolis
carolinensis.
In these two studies, shotgun sequencing was used to sample small fractions of snake
genomes and this sequence was analyzed to better understand their genomic content
and structure. As transposable elements (TEs) compose the majority of genomic content
in vertebrates, the primary focus was on TE content. This investigation began as part of a
larger project to examine in greater detail the genomic content of two snake species,
Agkistrodon contortrix and Python molurus, which diverged from one another about 100
1


million years ago at approximately the same time as Eutherian mammal radiation
occurred. This initial study was a collaborative effort, with full author list (in order) as
follows: Todd A. Castoe, Kathryn T. Hall, Marcel L. Guibotsy Mboulas, Wanjun Gu, A. P.
Jason de Koning, Samuel E. Fox, Alexander W. Poole, Vijetha Vemulapalli, Juan M. Daza,
Todd Mockler, Eric N. Smith, Cedric Feschotte, and David D. Pollock.
The first study revealed two differing genomic landscapes in A. contortrix and P. molurus,
and thus a broader investigation of snake lineages was warranted. Furthermore, two
high-copy TEs were identified for further study. Repeat analysis was performed on
shotgun sequence for 10 additional snake species to obtain a broader picture of the
varied transposable element landscapes across the phylogeny, and a more in-depth
investigation was launched into the nature of BovB and CR1 LINEs, with the goals of
improving resolution at key evolutionary time points with respect to these elements and
understanding how their expansion relates to snake evolutionary history and genomic
structure.
1.1 Background Information on Repetitive Content and Transposable Elements
Genomes contain multiple different types of repetitive content. Three main types of
repetitive content are microsatellites, segmental duplications, and transposable
elements. The smallest of these, microsatellites or simple sequence repeats (SSRs), are
tandem duplications of short motifs, such as AATAATAATAAT(AAT)n, and the total length
2


of the microsatellite region can be hundreds of nucleotides long. The largest form of
repetitive content is segmental duplication, wherein an entire region of a chromosome is
duplicated. The third category, transposable elements, range from ~80 bp to ~11,000 bp
in length and are numerous in vertebrate genomes.
Transposable elements are divided into two main categories: DNA transposons and
retrotransposons. DNA transposons move via excision and insertion, remaining DNA
through the entire process, whereas retrotransposons are first transcribed into RNA and
then copied back into the genome. The major subcategories of retrotransposons found in
snakes are: LTRs (Long Terminal Repeats), LINES (Long Interspersed Nuclear Elements),
SINEs (Short Interspersed Nuclear Elements), and PLEs (Penelope-like Elements). LINEs
and SINEs have characteristics of relevance to further discussion; LINEs insert in the 3' to
5' direction (Cordaux & Batzer, 2009), although the details of this process are not well
understood, and SINEs are non-autonomous elements, depending on active LINEs to
provide the machinery for transposition.
3


2. Discovery of Highly Divergent Repeat Landscapes in Snake Genomes Using High
Throughput Sequencing
2.1 Introduction
Among vertebrates, the snake lineage represents an impressively speciose (~3100 sp.)
and phenotypically diverse radiation, and as a result have become increasingly important
model systems for diverse research areas. Snakes provide a unique model system for
studying extreme physiological remodelling and metabolic cycling (Secor and Diamond
1995; Secor and Diamond 1998) and in venom-related research (Fry et al. 2006; Ikeda et
al. 2010). Snakes have also become important models for developmental biology,
evolutionary ecology, and molecular evolution and adaptation (Cohn and Tickle 1999; Fry
et al. 2006; Castoe et al. 2008; Vonk et al. 2008; Castoe et al. 2009a). Despite the
importance of snakes as models for basic and biomedical research, there is little known
about the genomes of snakes, and about reptile genomes in general (Shedlock et al. 2007;
Janes et al. 2010).
Our aim here was thus to obtain comprehensive sequence-based comparative insight into
snake genomic diversity, particularly the diversity and structure of the repetitive element
landscape. Such insight is important because repetitive elements comprise major portions
of vertebrate genomes and exert major influences over genome evolution. Among
repetitive sequences, transposable elements (TEs) in particular have had a tremendous
impact on the structural and functional evolution of genes and genomes. Numerous
4


studies have documented howTE activity and ectopic recombination between TE copies
promote small- and large-scale variation in the structure of genomes; such
rearrangements provide a substrate for the emergence of new functional sequences, both
coding and non-coding, including the birth of new protein-coding genes and the rewiring
of regulatory networks (Feschotte 2008; Cordaux and Batzer 2009; Herpin et al. 2010).
Our current understanding of vertebrate TE diversity and evolutionary dynamics are,
however, dominated by perspectives from mammalian, and to a lesser extent, avian
genomes.
The speciose nature and evolutionary age of the snake radiation make it an excellent
amniote lineage for comparisons to mammals. Snakes and mammals share a common
ancestor ~310 MYA, and snakes diverged from other squamate reptiles about ~170 MYA
(Castoe et al. 2009b), which slightly predates the estimated split of eutherian (placental)
and metatherian (marsupial) mammals. In this study we chose to focus on two fairly
distantly related snake species, the Burmese python (Python molurus bivittatus), and the
copperhead (Agkistrodon contortrix) these two lineages share a common ancestor at
about the same time as do all eutherian mammals, around 100 MYA (Castoe et al. 2009b).
In comparison to mammalian genomes, snake genomes are generally small (Gregory et al.
2007), ranging from 1.3 gigabases (Gbp) to 3.8 Gbp and averaging 2.1 Gbp, and the two
snakes chosen both have similar sized genomes, on the small side of this range ("'1.4
5


Gbp). These two snakes also represent important lineages for research. The Burmese
python (Python molurus bivittatus) is an important model for physiological and metabolic
adaptation, and the copperhead (Agkistrodon contortrix) is a model for metabolic
adaptation and a viperid model for studies related to venom. Although distantly related,
these two lineages (pythons and viperids) have convergently evolved extremely dynamic
metabolisms to facilitate the infrequent consumption of large prey (Secor and Diamond
2000).
To gain insight into the repeat landscapes of these two genomes and the evolutionary
processes that have shaped them, we obtained low-coverage 454 high-throughput
sequencing data from genomic shotgun libraries, as well as 454 transcriptome sequence
showing evidence of TE transcriptional activity in both species. Using this data, we
analyzed transposable element and simple sequence repeat (SSR) content, diversity, and
origins. The results reveal extraordinary differences between these two snakes. They also
contribute to a broader understanding of vertebrate genome evolution and diversity by
beginning to show how snake genomes compare to one another and to other vertebrates
6


2.2 Methods
2.2.1 Shotgun Library Creation and Sequencing
Whole genome random shotgun libraries were made from two snake species, Agkistrodon
contortrix (the copperhead), and Python molurus bivittatus (the Burmese python). Total
DNA was prepared from liquid-nitrogen snap-frozen liver tissue by standard phenol-
chloroform-isoamyl alcohol extraction methods. 454 FLX-LR and 454 Titanium-XLR
genomic shotgun libraries were prepared using the 454 shotgun library preparation kit
and protocol (Roche). Libraries were sequenced on the Roche 454-FLX sequencing
platform. From the Agkistrodon FLX-LR shotgun library, 60.3 Mbp (megabases) from
280,303 sequence reads were collected using Roche/454 FLX-LR sequencing kits,
amounting to about 4.5% of the estimated 1.35 Gbp (Gregory et al. 2007) genome (Table
B.l). Two libraries from the same individual (one FLX and one Titanium) were sequenced
for the Python; from the FLX-LR library, we sequenced 61,256 reads, totaling 13.3 Mbp,
and from the Titanium-XLR library we sequenced 57,717 reads, totaling 15.2 Mbp. In sum,
28.5 Mbp of Python sequence from 118,973 reads were collected, representing ~2.0% of
its estimated 1.42 Gbp genome (based on estimates of the related P. reticulatus genome,
Table B.l). Comparisons of repeat annotations of FLX-LR and FLX-XLR data for Python
indicated extremely little difference between the two data types (Figure C.l). Sequence
7


reads with similarity to the mitochondrial genome were filtered out prior to analyses (see
Supplementary Methods in Appendix A).
2.2.2 Repeat Analyses
We used the current release of the Tetrapoda RepBase (version 12.12, 01-17-2008) as the
repeat library for RepeatMasker (Smit, Hubley, and Green 2004) to identify known repeat
elements in the snake genomes. For comparisons, we also ran RepeatMasker on ~50 Mbp
of the Anolis genome (AnoCarl.0) from four combined genome scaffolds. For SSR analysis,
we used a previously written Perl script (Castoe et al. 2010) that was modified to identify
SSR loci with repeated sequence motifs of 2 6 bases in length, and a minimum of 12
bases in length (for 2-4mers) or containing 3 or more tandem repeats (for 5-6mers). We
used the program RepeatModeler (A. Smit, unpublished) to identify de novo repeat
sequences in our snake datasets, based on the run parameters suggested as defaults by
the program. The approach essentially couples two de novo repeat finding methods,
RECON (Bao and Eddy 2002) and RepeatScout (Price, Jones, and Pevzner 2005), together
with Tandem Repeat Finder (Benson 1999). We modified RepeatModeler's RepeatMasker
parameters to specify the Tetrapoda library. For all RepeatModeler analyses, we
combined the new Python and Agkistrodon libraries into a single joint snake library to
recover as many elements as possible, and control for differences in sequencing depth.
Consensus sequences from RepeatModeler were classified using RepCIass (Feschotte et
8


al. 2009). By identifying novel "families" that hit the same known TE family we were able
to reduce the original count of new family consensus sequences by 18.6% and 20.3% in
Python and Agkistrodon respectively (see Supplementary Methods in Appendix A; Table
B.2-B.3). We also used the program P-clouds for identifying de novo repeats, with the
following parameter settings: 2, 3, 6,12 and 24 for low, core, step-1, step-2 and step-3
cutoffs (Gu et al. 2008). Details provided in Supplementary Methods (Appendix A).
2.2.3 cDNA Sequencing and Analysis
RNA was extracted, poly-A enriched and cDNA libraries were prepared from Agkistrodon
contortrix and Python molurus liver tissue samples using standard techniques. Libraries
were bi-directionally sequenced using the FLX-LR reagents on the 454 FLX instrument. All
steps were carried out on both samples, side-by-side, from RNA extraction through
sequencing. Transcript contigs were assembled using the 454 GSAssembler, and we
searched forTE sequences using our snake-specific libraries in RepeatMasker. Details
provided in Supplementary Methods.
2.3 Results
The 60.3 Megabases (Mbp) sequenced from the A. contortrix (copperhead hereafter, for
readability in the text) random shotgun sequence library amounts to about 4.5% of its
estimated 1.35 Gbp genome, while the 28.5 Mbp sequenced from the P. molurus (python
9


hereafter) represents about 2.0% of its estimated 1.42 Gbp genome (Table B.l; NCBI
Sequence Read Archive accession SRA029568.1). Distributions of base call quality scores
are very similar for both species, facilitating direct comparisons between the data for
both species (Figure C.4). To obtain a preliminary understanding of repetitive content in
these genomes, we first analyzed the frequencies of 15mers in 28 Mbp samples from
both genomes (consisting of a random 28Mbp sample from the copperhead, and all of the
python data). We chose 15mers because, by chance, any one 15mer should occur only
about once in a genome of this size, making high-copy 15mers extremely infrequent by
chance and thus indicative of repetitive elements (Gu et al. 2008). Both species contained
similar amounts of single copy 15mers (15,364,028 in and the python and 17,186,377 in
the copperhead), but the python had more low-to-moderate copy number 15-mers (i.e.,
2-10 copies; Figure 2.1A). In contrast, the copperhead had more high-copy 15-mers
(Figure 2.1A), suggesting that it has more highly similar (recently expanded) high copy
repeat elements, while python repeat elements are older and/or fewer in numbers, with
greater average divergence among elements. Analysis of the anole lizard (Anolis
carolinensis) genome revealed a 15mer profile similar to the copperhead (Figure 2.1),
suggesting that it too has a substantial number of recently expanded repeats, and thus a
repetitive landscape more similar to the copperhead than the python.
10


RepeatModeler/Masker Only Pclouds + RepeatModeler/Masker
Pclouds Only Unannotated
Agkistrodon
Python
0 20 40 60 80 100
Percent of Genome
Figure 2.1. Comparison of Repeat Analyses. Equal size samples of the genomes of the two
snakes copperhead (Agkistrodon) and python (Python), and the lizard (Anolis) were
considered. The data analyzed was first parsed of simple repeat sequences using
RepeatMasker/RepBase. A) The frequency of each different 15mer sequence was
counted, and the number shown is the number of different 15mers having a particular
count. B) The repeat annotation methods (P-clouds and RepeatModeler/RepeatMasker) in
the two snakes were compared to determine the percent of the genome (in nucleotides)
that was masked by either method alone, both methods, or neither method (i.e.,
remained unannotated). C) The size distribution of P-clouds results in the two snakes are
shown.
11


2.3.1 Identification of Interspersed and Tandem Repeats
The common method of identifying recognizable repeat elements is to scan sequences
using a library of known repeat element consensus sequences (e.g., RepBase (Jurka 2000))
with sequence similarity-based algorithms such as RepeatMasker (Smit, Hubley, and
Green 2004). Such homology-based methods cannot recognize elements not in the
database, and have low power to identify repeat element fragments, even up to 200 bp,
from moderately diverged repeat families (W Gu, APJ de Koning, TA Castoe, DD Pollock,
unpublished data). For this reason, the identification of interspersed and tandem repeats
in snakes is problematic because there are no closely related organisms in which repeat
elements have been studied in-depth, and thus there are no snake-specific repeat
libraries available. Since unassembled next-generation sequence reads likely contain
many (perhaps mostly) TE sequence fragments, we utilized a three-pronged strategy to
evaluate repeat content and maximize detection and classification of repeat elements:
first, we applied the P-clouds method, which can estimate the repeat fraction without
knowing a priori what the repeat elements are, and which is relatively powerful at
identifying short repeat fragments (Gu et al. 2008); second, we utilized the "Tetrapoda"
repeat consensus library in RepBase to detect similar sequences by homology searching;
and third, we used RepeatModeler (Smit, unpublished) to identify new snake-specific
repeat element clusters ("families"). We analyzed the output of RepeatModeler with
12


RepCIass (Feschotte et al. 2009), a tool that automates the classification of newly
discovered TEs (Figures C2-C3).
Previous analyses using the P-clouds method have estimated genomic repetitive content
in repeat-poor bird genomes at around 40% (Warren et al. 2010), in more repeat-rich
genomes of the Anolis lizard at 73.7% (Warren et al. 2010), and in the human and panda
genomes at about 70% (Li et al. 2010). In comparison, the P-clouds estimate of the
repetitive content of the python genome is 39.8%, similar to bird genomes, while the
predicted copperhead repetitive content is 55.2%, intermediate between birds and
mammals/lizards (Table B.l).
Only 4.48% of the python, and 11.81% of the copperhead (Table B.l) sequence, were
identified as readily classifiable repeats (SSRs, tandem repeats, low-complexity
sequences, and known TEs from RepBase). An additional 16.73% of the python and
32.77% of the copperhead sequence were identified as repeat elements using the newly-
identified snake-specific repeat library from RepeatModeler (Figures 2.IB, C.2-C.3), and
about one third of these new "families" identified in both the copperhead and python
were classified by RepCIass into known TE classes (655/1996 and 203/571, respectively;
Tables B.2-B.7). It is important to note that the same snake-specific repeat library was
used to annotate both species, and was derived from running RepeatModeler on data
from each species independently and then combining the resulting libraries; this
13


approach was designed to increase sensitivity and decrease bias due to different amounts
of genomic sampling in the two species.
All methods agree that the copperhead has considerably more detectable repetitive
sequence than the python, and a large part of this repetitive fraction in both species
arises from recognizable transposable elements. Altogether about 45% of the copperhead
was annotated by the homology-based methods compared to only 21% of the python
genome; in contrast, the two estimates from P-clouds are 55% and 40%, respectively (Fig.
2.IB; Table B.l). We expect P-clouds to be more sensitive than homology-based methods
for identifying more divergent and fragmented repeat elements, and the observation that
the estimated python repetitive content is nearly twice as high based on P-clouds analysis
indicates that much of its repetitive content may be older and/or more fragmented than
that of the copperhead. There is substantial overlap between the methods (Figure 2.IB),
and most (~67-76%) of the regions uniquely annotated by P-clouds are fairly long (> 50
bp; Figure 2.1C). The P-c/ouds-unique sequences may also include non-TE derived
sequences such as tandem duplications or large multi-gene families, but given the level of
sampling (< 5% of each genome) and the makeup of known complete genomes, we
expect that these types of repeats make up a fairly small percentage of the P-clouds
annotation. These results support the conclusion from the 15mer profile analysis that the
copperhead has many homogeneous transposable element sequences compared to the
more diverged and/or lower frequency repeats in the python.
14


2.3.2 Transposable Element Landscapes
The joint annotation of previously known RepBase elements, together with newly
identified elements classified using RepCIass (Feschotte et al. 2009), revealed a
substantial diversity of TEs in snake genomes (Figure 2.2 and Table B.l). Almost all types
of repeats appear to occur more frequently in the copperhead than the python (Figure
2.2; Table B.6-B.7), but the breadth of diversity in each of the snakes was similar, with
most subclasses and superfamilies (e.g., Bov-B LINEs, DIRS1) found in the copperhead also
represented in the python (Figure 2.2). Although repetitive elements in the Anolis lizard
genome have not yet been thoroughly annotated, the diversity of repeat types observed
in the snakes was broadly similar to preliminary estimates of the diversity of repeat
elements in the anole lizard genome (Figure C.5, Table B.8). Comparing the two snakes,
there are substantial differences in transposable element abundance (Figure 2.2; Table
B.8). The greatest difference lies in the abundance of CR1 LINEs that are more than four
times as frequent in the copperhead, in contrast to Bov-B LINEs, which are similarly
abundant in both genomes (Figure 2.2).
15


Figure 2.2. Comparison of the Family-Specific Repeat Content in Copperhead
(Agkistrodon) and Python (Python) Genomes. Transposable element families were
determined based on the combined annotations of Repbase, RepeatScout,
RepeatModeler, and RepCIass, and coverage in the genome was annotated using
RepeatMasker.
Analysis of the distribution of sequence divergence (from species-specific consensus
sequences) within these two LINE super-families reveals contrasting expansion histories
both between LINES and between species (Figure 2.3A-B). Both snake lineages
experienced recent (and likely independent) expansion of Bov-B LINEs, indicated by the
16


low sequence divergence among these LINEs within each species (Figure 2.3A). In
contrast, CR1 accumulation appears to have occurred over an extended period in both
lineages. The CR1 consensus sequences for the two species are 11% divergent, suggesting
that CR1 expansion in these species began in a common ancestor of these two lineages,
followed by a recent decrease in the rate of accumulation in both taxa (as suggested by
the small numbers of copies with low divergence); this pattern is nearly opposite that of
Bov-B. Despite a similar age distribution of CR1 elements in both snakes, the copperhead
lineage has experienced much greater levels of CR1 accumulation, particularly during the
more recent half of this expansion period (Figure 2.3B).
Evidence from LINEs suggests that there may be substantially different genomic processes
at work on the two snake genomes that results in truncation and possible purging of
longer repeat elements along the copperhead lineage. Whereas about half of Bov-B LINEs
appear near full length in the python genome, a vast majority of Bov-B LINEs appear to be
truncated, representing predominantly the 3-prime end, based on mapping reads to Bov-
B LINE consensus sequences of both species (Figure 2.4A). We also find that copperhead
CR1 LINEs are truncated to an even greater degree, although the low copy number of CR1
LINEs in the python prevents meaningful comparison (Figure 2.4B).
17


A.
6 i
5 -
Bov-B LINES
Agkistrodon
Python
i i i---1--1nrT*

o- o- c>
o- o-
B.
a
.a
5
u.
a
a
>
u
c
a>
3
O
4>
Sequence Divergence
Figure 2.3. Sequence Divergence of Selected Transposable Elements. The species-specific
consensus sequences were determined for A) Bov-B LINES, and B) CR1 LINEs, and the
sequence divergence levels were calculated for all alignable sequences of these types.
In addition to a greater abundance of LINEs, the copperhead also has a much greater
abundance of both Gypsy-like (2.18% vs. 0.21%) and DIRS (0.84% vs. 0.03%)
18


retrotransposons compared to the python; these families are also both more abundant in
the python than in the anole lizard (Figure 2.2; Table B.8). An increased abundance of
DNA transposons in the copperhead (4.58%, vs. 1.92% in the python) is also observed,
primarily due to increases in hobo-Activator-Tam3 (hAT) transposons (Figure 2.2). The
copperhead also experienced a notable expansion of SSRs and low-complexity regions
relative to the python and lizard, and contains more unclassified elements than the
python (16.45% vs. 9.29%). The rarity of identified SINEs may be an artifact because SINEs
are often difficult to classify due to their short length, rapid evolution and turnover, and
because they do not encode proteins.
snake CR1 LINE Coverage
Figure 2.4. Depth of Sequence Coverage. The coverage depth per megabase of genomic
sequence is shown for the 3' ends of A) Bov-B and B) snake CR1 LINEs for both the
copperhead (Agkistrodon) and python (Python) genome samples.
At the family level, over three times more new snake-specific families were identified
(using RepeatModeler) in the copperhead (1,996) than the python (571). This de novo
repeat identification method produces many potentially redundant family descriptions.
19


but after collapsing redundant families (see Supplementary Methods), the result holds:
only 82 new collapsed families from the python were identified by RepCIass, compared to
243 in the copperhead (Tables B.2-B.5). Many more element families were identified in
the copperhead compared to the python for numerous element types (Table B.4),
including CR1 and LI LINEs, penelope retrotransposons, gypsy and DIRS retrotransposons,
and hobo-Activator (hAT) and Mariner DNA transposons (Table B.4). The familial diversity
of DNA transposons (16 new python families, 48 new copperhead families), DIRS (0 new
python families, 38 new copperhead families) and gypsy (8 new python families, 49 new
copperhead families) are particularly skewed (Table B.4). This indicates that the greater
transposable element content in the copperhead compared to the python is based not
only on greater numbers of elements but also on greater element diversity at the more
fine-scale family level. Higher element abundance in the copperhead was not limited to a
particular set of elements, but rather distributed across a diverse set of elements.
Furthermore, the element types that are more common in the copperhead tend to have
more types of new element families identified.
2.3.3 Evidence for Horizontal Transfer of Transposable Elements
Previous studies have inferred horizontal transfer of Bov-B LINEs between mammals and
snakes and/or squamate reptiles to explain the enigmatic distribution of these elements
across amniote vertebrates (Kordis and Gubensek 1997; Kordis and Gubensek 1998a).
20


Based on phylogenetic analysis of Bov-B sequences from available vertebrate genomes,
we estimate that copperhead and anole Bov-B sequences are more closely related to each
other than either of these is to the python Bov-B sequences. This result implies multiple
episodes of horizontal transfer of Bov-B LINEs to or from squamate reptiles (Figure C.6).
We also found traces of Maverick DNA transposons only in the python (Table B.6-B.7),
and these are not otherwise known from squamate reptiles, although there is a report of
one from the sister lineage of squamates, the tuatara (Pritham, Putliwala, and Feschotte
2007).
hAT elements are barely detectable in the python but comprise ~2% of the copperhead
genome sample (Figure 2.2; Table B.8). Space invader (SPIN) elements, a type of hAT DNA
transposon, are known to have been independently horizontally transferred into the
genomes of multiple tetrapod lineages within the last 15-46 million years, including that
of the anole lizard (Pace et al. 2008; Novick et al. 2010). We found evidence of numerous
SPINs in the genome of the copperhead (Figure 2.5), but found none in the python
(corroborated by PCR and Southern hybridizations; C. Feschotte, unpublished data). In the
copperhead, most SPIN-related sequences (1,142) found were MITEs (proximal ends) of
SPIN elements representing deletion derivatives of longer and presumably autonomous
elements. An additional 19 reads mapped to (non-MITE) internal regions of the anole
lizard SPIN transposon consensus sequence (Pace et al. 2008). SPIN MITE sequences from
the copperhead display relatively low levels of sequence divergence (from a copperhead
21


SPIN consensus), averaging around 6% (Figure 2.5). This is consistent with recent activity
and invasion of SPINs into the copperhead genome at a similar time frame (<45 MYA) as
SPINs appear to have invaded other tetrapod lineages, long after the ~100 MYA split
between the python and copperhead (Castoe et al. 2009b), although the lack of a known
neutral substitution rate for squamate genomes precludes precise dating.
Figure 2.5. Sequence Divergence Between SPIN DNATransposon (MITE) Sequences in the
Copperhead Genome. Divergences from the consensus sequence were calculated as in
Figure 2.3.
22


2.3.4 Simple Sequence Repeat (SSR) Structure
The frequency pattern of SSRs in snakes is similar to the frequency pattern of repetitive
elements in that the copperhead has about four times the SSR content of the python, and
2 or more times that of the anole lizard, which itself has substantially more than other
reptiles or birds examined (Figure 2.6A). As with the TEs, it is surprising to observe such
an expansion in a genome that is smaller than most other reptile genomes. Although the
number of SSRs identified varies somewhat depending on the identification method (e.g
Figures 2.6A, C.6; Table B.8), the approximate proportions remain similar. The relative
abundance of SSR loci in the copperhead, compared to the python and the anole, is also
consistently higher across all repeat motif length classes, from 2mers to 6mers (Figure
C.7). This points to a general expansion of SSR loci. The excessive relative enrichment of
4mers and 5mers in the copperhead, however, indicates a possible role for a motif-
specific mechanism as well.
23


Agtdstrodon Python AnoUs Alligator Chickan Turtlo
SSR lod SSR loci
(sorted by motif and length) (sorted by motif and length)
c.
"snakel" CR1 LINES
Estimated Genomk
Species Copy Number
Anole lizard 0
Python "10,000
Copperhead ~2S4,000
Figure 2.6. Expansion of Specific SSR Motifs in the Copperhead Genome. A) The number of
SSR loci per Mbp for a sampling of amniote genomes is shown along with a phylogenetic
tree of their relationships. B) The 3mer and 5mer SSR loci of python (Python) and
copperhead (Agkistrodon) are shown sorted first by SSR sequence motif and then by SSR
length (in bp). The height of each bar corresponds to the length of each SSR (in bp), and
the width is proportional to the identified number of sequence with a particular motif and
length. The width of the portion of the graph devoted to each motif is proportional to the
motifs relative abundance among SSRs (in terms of number of loci). The regions of the
graph devoted to motifs ATA and AATAG are indicated with double arrows. C) Two
alternative SSR tails at the 5' ends of snakel CR1 LINEs are shown along with the
estimated copy number of this LINE family in anole lizards and the two snakes.
24


SSR motif sequence frequencies are quite similar between the python and anole lizard,
and surprisingly different in the copperhead, suggesting accelerated evolution of SSRs
along the linage leading to the copperhead (Figures C.7-C.8). Due to common descent, the
frequencies of SSR motifs in different species are expected to be correlated, and under
the assumption of a constant rate of evolution (birth and death) of SSRs, the degree of
correlation should decrease with the divergence time between species. The SSR motif-
specific frequency profiles in the anole and python have a linear regression coefficient of
R2 = 0.716, compared to R2 = 0.405 between the two snakes. The contrasting correlation
strengths were particularly strong for comparisons of 2mers, 4mers, and 5mers (Figure
C.7). These results are consistent with the hypothesis that SSR motifs are typically stable
for long periods of time (e.g., the time separating the python and the anole), but that the
copperhead lineage has undergone an unusual amount of SSR turnover resulting in a
major change in the SSR motif frequencies and overall abundances in the copperhead
genome.
2.3.5 Microsatellite Seeding by CR1 LINES
Certain SSR sequence motifs were greatly expanded in the copperhead, including ATA,
ATAG, A AT AG (Figures 2.5B and C.7), which are notably similar to one another. Analysis of
the flanking sequences of these highly expanded copperhead SSR loci showed that the
3mer (ATA)n and 5mer (AATAG)n were associated with other non-SSR repetitive
25


sequences: these SSR motifs are found in the 3' tails of a snake-specific family of CR1
LINEs that we refer to as snakel CR1 LINEs. These snakel CRls are absent from the anole
lizard genome assembly and were detectable at low levels in the python sample (196
elements; ~7 elements/Mbp), but have expanded ~25-30-fold in the copperhead (11,324
elements; ~189 elements/Mbp); we estimate there to be ~9,900 snakel CR1 LINEs in the
python genome versus ~254,000 in the copperhead. In the copperhead, 41.4% of all ATA
SSR loci, and 22.7% of all AATAG SSR loci were flanked by readily identifiable snakel CR1
LINEs; only a small fraction of ATAG SSRs were flanked by snakel CR1 LINEs (0.9%), and it
is thus unclear whether LINEs directly seed these ATAG repeats, or if they are mutated
versions of related 3mers or 5mers. The most extreme perturbations in microsatellite
frequencies between the two snakes thus seems to be due to the "microsatellite-seeding"
(Arcot et al. 1995) activity of these snakel CR1 LINEs.
The elements that we are calling snakel CR1 LINEs have been identified previously but
were confused with Bov-B LINEs because of an early report of a novel Bov-B LINE that was
actually a Bov-B LINE flanked by two snakel CR1 LINE fragments (Figure C.9). This
misidentification has hindered recognition and correct identification of LINEs in reptilian
genomes, especially because RepBase currently includes the misidentified LINE as a
reference sequence. Snakel CR1 LINEs are also notable because our data confirms
previous speculation that they occur at high frequency throughout phospholipase venom
genes in viperid snakes (Ikeda et al. 2010), numerous other venom genes in viperids and
26


elapids (based on BLAST analysis; Table B.10), and in HOX gene clusters of colubrid snakes
(Di-Poi et al. 2010). Because the lack of diversity of available sequences for snake genic
plus inter-genic regions, however, it is not possible to conduct meaningful analyses to
assess the level of CR1 enrichment specifically adjacent to venom genes (versus non-
venom genes) at this time.
2.3.6 Evidence of Transposable Element Transcriptional Activity
Transcription levels in liver tissue samples from both species were evaluated to determine
whether TE elements are actively transcribed in living snake tissues (NCBI Sequence Read
Archive accession SRA029568.1). Transcripts with sequence homology to every TE class
were more frequent in the copperhead liver transcriptome (Figure 2.7; Table B.ll), with
23-fold greater overall levels of transcription of TEs in the copperhead compared to the
python, including many different TE classes that were inferred to be recently active from
the genomic data. For example, LINEs in the copperhead represent ~4.6% of all transcripts
(47-fold more than in the python). In addition, CRls are particularly frequent in the
copperhead, comprising ~3% of the copperhead transcriptome sample; this frequency is
122-fold greater than in the python transcripts (Figure 2.7; Table B.ll). Bov-B LINEs were
also observed in both species, but were 16-fold more abundant in the copperhead (at
0.8% of transcripts; Table B.ll). From the genomic data we inferred that Gypsy and DIRS1
LTR retroelements had expanded recently in the copperhead, and transcriptional data
27


shows moderately high levels of both of these in the copperhead (at 0.54% and 0.18% of
transcript reads, respectively) yet these were either barely detected or not detected at all
in the python transcriptome (Figure 2.7). We also found transcriptional evidence of hAT
DNA transposon activity (which includes SPIN element activity) in the copperhead (0.64%
of reads) at 280-fold greater levels than the python. The highest abundance of
presumably TE-related transcripts were those that were "unclassified" TEs; we found 28-
fold greater relative abundance of unclassified repeats in the copperhead (5.6% of reads)
versus the python (0.2% of reads; Figure 2.7). While we cannot interpret exactly what
these unclassified elements represent, we expect this category to contain a substantial
proportion of SINEs. Although the liver is not where TEs need to be expressed to make
new inherited copies, and these transcripts do not necessarily arise from the TE's own
promoters, this data suggests the possibility that TE activity in the copperhead continues
to be high compared to the python.
28


Relative
Transcript
Frequency
L2/CR1/RM LINES
Figure 2.7. Relative Frequencies of Transposable Elements in Liver cDNA Transcripts.
Relative transcript frequencies in the two snakes are shown in a radar graph on a
logarithmic scale. Sequences shown had long regions of high similarity to known
transposable elements.
Bov-B LINEs have also appear to have contributed to seeding a (CAA)n microsatellite in
snake genomes, but this has not resulted in major expansion of (CAA)n SSRs in the
genomes of either species surveyed here (Figure C.7). It is also notable that even when all
known LINE-associated SSR motifs are excluded from consideration, there are still large
differences in SSR motif abundance between the snakes (e.g., Figure C.7). This suggests
that in addition to the major impact of LINEs, other LINE-independent effects have altered
SSR abundances in the lineage leading to the copperhead.
29


2.4 Discussion
Comparison of two snake genomes, spanning ~100 MY of snake evolution, revealed
extensive differences in their genomic repeat landscapes. Although both snakes contain
diverse sets of repeat elements distributed across most major element types and super-
families, the copperhead genome contains more of essentially all of these repeats
(occupying 45% of the copperhead genome, versus 21% of the python genome), and
many repeats have expanded recently. In comparison, the largest known difference in
genomic repeat content between placental mammalian genomes occurs between the
human and mouse (46% versus 38%, respectively), which are separated by ~75 MY
(Waterston et al. 2002). Thus, for similar levels of divergence, the difference in repeat
content in these two snakes is exceptional. Furthermore, the greater repetitive content in
the copperhead is not due to just one or a few families, but is distributed among a
diversity of element families and subfamilies (243 collapsed TE families in the
copperhead, versus 82 in the python).
TE-related transcripts also appear to be expressed at much higher levels in copperhead
tissue compared to python tissue. Although we surveyed liver rather than gametic tissues
for TE activity, the 23-fold greater overall levels of TE-related transcripts in the
copperhead (Figure 2.7) suggests that TE transcription may be generally more active in
the copperhead. If transcription levels have also increased in germline tissues, they may
30


have contributed to increased genomic TE insertion activity. One hypothesis to explain
the observations thatTEs have higher transcription levels and have been more active in
the copperhead versus python genomes is that mechanisms known to control TE
proliferation (e.g., CpG methylation and chromatin structural regulation (Yoder, Walsh,
and Bestor 1997; Lippman et al. 2004; Feschotte 2008)) may be differentially effective in
the two snakes. It is also possible that TEs may occur in greater proximity to
transcriptional units in the copperhead genome, driving greater levels of read-through
transcription. It is unclear, however, what mutational or selective force would have made
TEs land and become fixed nearer transcriptional units in the copperhead than in the
python. The increased transcription levels in the copperhead also suggest that TEs are
more likely to influence flanking gene expression in the copperhead than in the python.
Snake genomes appear to be particularly prone to horizontal transfer of TEs compared to
other vertebrates that have been studied. This study provides novel evidence for such
transfers. By adding genomic data from two snake species in addition to the anole lizard,
it is the first comparative view into TE dynamics within squamate reptiles. For example,
the availability of sequence-based and PCR-based evidence for the absence of SPIN
elements in the python (C. Feschotte, unpublished), in contrast to the abundance of
recently-inserted SPIN elements in the copperhead, provide compelling evidence that, as
with mammalian genomes (Pace et al. 2008; Gilbert et al. 2010), reptilian genomes have
been differentially invaded by these elements. The complex patterns of horizontal
31


transfer involving Bov-B LINES are another good example. While previous studies have
already suggested horizontal transfer of Bov-B LINEs between squamate reptiles and
mammals (Kordis and Gubensek 1997; Kordis and Gubensek 1998b; Kordis and Gubensek
1998a), our analysis suggests multiple transfer events into and/or out of squamate
genomes (Figure C.6). The previous report of a poxvirus-mediated transfer of Squaml
SINE elements from viperid snakes to rodents demonstrates that viruses may sometimes
mediate such horizontal transfer events (Piskurek and Okada 2007). This transfer is
thought to be dependent on the enzymatic machinery of a Bov-B LINE (Piskurek and
Okada 2007), and high transcript levels of Bov-B reverse transcriptase in snake tissues,
such as those found in the copperhead, may thus increase the probability of horizontal
transfer events.
The copperhead lineage also appears to have modified microsatellite evolutionary
dynamics, including microsatellite seeding (Arcot et al. 1995; Tay et al. 2010) by a snake-
specific CR1 LINE family (Figure 2.6). Microsatellite seeding in the copperhead has
occurred at a scale that is several orders or magnitude greater than any other example
that we are aware of (Nadir et al. 1996; Tay et al. 2010). The similar SSR motif frequencies
between the python and anole lizard are consistent with previous suggestions that SSR
evolution and turnover rates in non-avian reptiles are generally lower than in mammals
(Matsubara et al. 2006; Shedlock et al. 2007). In contrast, the increase in SSR content and
32


radically different motif frequencies in copperhead indicate that SSR turnover rates in
squamates can evolve even more rapidly than what is known from mammalian genomes.
Despite its substantial and recently expanded repeat content, the copperhead has a
genome size that is among the smallest of snakes. This is surprising, as it is reasonable to
expect that small genomes should have low repetitive content, as is the case in pythons
and birds. We suggest that unidentified processes must be acting to remove genomic
sequence in the copperhead. There is a high degree of LINE element truncation in the
copperhead relative to the python (Figure 2.4), and biased gene conversion leading to
deletion of repetitive element sequence via truncation of LINEs has been proposed to
occur in the anole lizard genome (Novick et al. 2009). This mechanism alone is clearly
insufficient to balance the equation (the total LINE element sequence is still considerably
greater in the copperhead), but the truncated LINEs are suggestive of pressure to limit
genome expansion. Genome-wide deletion mutation rates may also be higher (possibly
mediated by differences in effective population size and higher rates of recombination),
and it is possible that selection on genome size may have played a role.
Selection to maintain a smaller genome size has been hypothesized numerous times in
relation to extreme metabolic demands in flighted birds (Hughes and Hughes 1995),
although there is some controversy (Organ et al. 2007). Previous studies have suggested
that extreme metabolic demand in snakes (Secor and Diamond 1995; Secor and Diamond
33


1998) has resulted in selection to decrease their mitochondrial genome size (Jiang et al.
2007), extensive evolutionary redesign (Castoe et al. 2008) and previously unprecedented
molecular convergence in snake metabolic proteins (Castoe et al. 2009a). It is therefore
plausible that selection related to metabolic demands could have shaped snake nuclear
genomes. Broader understanding of genomic repeat landscapes in snakes may shed
greater light on this question. There are a range of alternative theories about the
evolution of genome size and complexity (Lynch and Conery 2003), and thus the role of
selection in snake genome size and structure is a topic of considerable interest.
It is also an open question whether the biology of snake genomes may have contributed
to the evolution of their extreme phenotypes and adaptations (Secor and Diamond 1995;
Cohn and Tickle 1999; Fry et al. 2006; Castoe et al. 2008; Vonk et al. 2008; Castoe et al.
2009a). Among the most conspicuous adaptations in snakes is that some lineages,
including the ancestors of the copperhead, have evolved complex venom repertoires,
largely by duplicating and re-purposing existing genes to produce deadly toxins. Our
evidence (Table B.10), and that of others (Kordis and Gubensek 1997; Kordis and
Gubensek 1998b; Ikeda et al. 2010), shows a tentative association between CR1 LINEs and
venom genes. Our genomic sampling suggests that CR1 LINEs (as well as SSRs and other
TEs) have expanded substantially in the copperhead lineage, and this expansion might be
expected to lead to increased rates of recombination, unequal crossing over and gene
conversion (Witherspoon et al. 2009; Stevison and Noor 2010). These events could have,
34


at least in part, facilitated the expansion and regulatory rewiring of venom gene families
in venomous snakes.
35


3. Repeat landscape evolution and LINE dynamics across snake genomes
3.1 Introduction
Comprised of approximately 3100 species, the snakes have become increasingly
important model systems for basic and biomedical research. Snakes diverged from other
squamate reptiles about ~170 MYA (Castoe et al. 2009b), and have diversified into many
different phenotypes. In comparison to mammalian genomes, snake genomes are
generally small (Gregory et al. 2007), ranging from 1.3 gigabases (Gbp) to 3.8 Gbp and
averaging 2.1 Gbp. Snake genomes are not well characterized as a whole, nor are reptile
genomes in general (Shedlock et al. 2007; Janes et al. 2010).
The prior investigation into Python molurus and Agkistrodon contortrix genomic content
revealed two different transposable element landscapes in the two snakes. TE content in
A. contortrix was higher than in the P. molurus, and the majority of A. contortrix LINEs
were truncated. The striking difference between these two species suggested that a
broader sampling of taxa (as opposed to a more in-depth examination of one or two
representative taxa) was necessary to better understand the dynamics of snake
transposable element landscapes and provide more insight into snake genomic content in
a broad sense. Two LINE categories emerged as warranting further study: BovB, due to
evidence for horizontal transfer, and CR1, discovered to be seeding SSRs in A. contortrix.
36


These two LINES have different expansion histories in A. contortrix and P. molurus, and
may have helped shape snake evolutionary history.
Thus the goal in this study was to characterize the transposable element landscapes of a
larger group of snakes, sampling many more major lineages, to better understand
genomic diversity and the structure of snake transposable element landscapes, and to
identify trends of expansion or loss of transposable element families, with a focus on
prominent LINEs. We began by obtaining low-coverage 454 high-throughput shotgun
sequencing data from genomic libraries, and analyzing nuclear reads for transposable
element content.
Several characteristics of LINE elements were exploited for this study. Since LINE
elements insert into the genome starting with the 3' end of the element, and most copies
within the genome become truncated before the 5' end is copied, the majority are
therefore incomplete and incapable of further transposition (Cordaux & Batzer 2009).
This 3' region is thus an ideal target for studying mutation rates, as the forces acting on
dead elements (defective elements that are unable to generate new copies) approximate
neutral evolution, and a large number of 3' end copies exist for BovB and CR1 throughout
the genomes of snakes. In contrast, the expectation is that targeting 5' ends would
amplify only the many fewer elements that are complete or nearly full-length (and
possibly are master elements), and these complete elements would prove suitable for
37


phylogenetic analyses. The 5' and 3' end sequences were examined for BovB, and the 3'
end sequences were examined for a subfamily of CR1 elements.
3.2 Methods
3.2.1 Primer Design
Initial primers were identified based on the more conserved regions between the
copperhead and python consensus sequences for L3 CR1 and BovB LINES. Primers were
tested using Python molurus, Cerrophidian godmani, and Elaphe guttata DNA (Figure 3.1).
Primers amplifying from all three genomes were used to order 454 primer sets with 6
barcodes for each TE fragment investigated (Table 3.1.). Primers were designed to
sequence inward, and to use the 454 B chemistry for amplification (Figure 3.2). Barcoded
454 primer pairs were tested on P. molurus to verify the functionality of all barcodes
before PCR was performed on the targeted snake species (Table 3.2).
3.2.2 PCR and Sequencing Preparation
PCR was performed in 50 ul reactions using these conditions: 5 ul lOx Taq Hifi buffer, lul
lOmM dNTPs, 1 ul 50mM MgS04, 0.2 ul Platinum Hifi Taq, and 1.5 ul of each primer at
lOuM. Denaturation was performed at 95C for 2 minutes initially, followed by 40 cycles
using a denaturing temperature of 95C, an annealing temperature of 60C,
decrementing 0.5C each cycle for 15 cycles and a steady temperature of 54C for 25
38


cycles, and an elongation temperature of 68C using primers specified in Table 3.1. A list
of snake species and corresponding primer pair IDs is provided in Table 3.2.
Table 3.1. Barcoded PCR primers with 454 adapters.
Element___Primer ID_________Primer sequence_____________________________________________________
BovB 5' BovBlOOFRLIB CTATGCGCCTTGCCAGCCCGCTCAGACGAGTGCGTCAGGTCATRRCAGAGAGTTCTGACAAAATGTG
BovB 5' BovB_100F_RL2B CTATGCGCCTTGCCAGCCCGCTCAGACGCTCGACACAGGTCATRRCAGAGAGTTCTGACAAAATGTG
BovB 5' BovB_100F_RL3B CTATGCGCCTTGCCAGCCCGCTCAGAGACGCACTCCAGGTCATRRCAGAGAGTTCTGACAAAATGTG
BovB 5' BovB_100F_RL4B CTATGCGCCTTGCCAGCCCGCTCAGAGCACTGTAGCAGGTCATRRCAGAGAGTTCTGACAAAATGTG
BovB 5' BovB_100F_RL5B CTATGCGCCTTGCCAGCCCGCTCAGATCAGACACGCAGGTCATRRCAGAGAGTTCTGACAAAATGTG
BovB 5' BovB_100F_RL6B CTATGCGCCTTGCCAGCCCGCTCAGATATCGCGAGCAGGTCATRRCAGAGAGTTCTGACAAAATGTG
BovB 5' BovB_750R_454A CGTATCGCCTCCCTCGCGCCATCAGGAASYGGTCAAYTTCAKCYTCTTCAGC
BovB 3' BovB_3060R_RLlB CTATGCGCCTTGCCAGCCCGCTCAGACGAGTGCGTCTTCTYCTWTTGCCYTCAATCTTTCCCAG
BovB 3' BovB_3060R_RL2B CTATGCGCCTTGCCAGCCCGCTCAGACGCTCGACACTTCTYCTWTTGCCYTCAATCTTTCCCAG
BovB 3' BovB_3060R_RL3B CTATGCGCCTTGCCAGCCCGCTCAGAGACGCACTCCTTCTY'CTWTTGCCYTCAATCTTTCCCAG
BovB 3' BovB_3060R_RL4B CTATGCGCCTTGCCAGCCCGCTCAGAGCACTGTAGCTTCTYCTWTTGCCYTCAATCTTCCCAG
BovB 3' BovB_3060R_RL5B CTATGCGCCTTGCCAGCCCGCTCAGATCAGACACGCTTCTYCTWTTGCCYTCAATCTTTCCCAG
BovB 3' BovB_3060R_RL6B CTATGCGCCTTGCCAGCCCGCTCAGATATCGCGAGCTTCTYCTWTTGCCYTCAATCTTTCCCAG
BovB 3' BovB_2350F_454A CGTATCGCCTCCCTCGCGCCATCAGGCYTATTTAACYTATATGCAGAGYACATCATGAG
L3 CR13 CR11920FRL1B CTATGCGCCTTGCCAGCCCGCTCAGACGAGTGCGTCCTTGGAGGTCTTCTAGTCCAACC
L3 CR13' CR11920FRL2B CTATGCGCCTTGCCAGCCCGCTCAGACGCTCGACACCTTGGAGGTCTTCTAGTCCAACC
L3 CR13 CR1_1920F_RL3B CTATGCGCCTTGCCAGCCCGCTCAGAGACGCACTCCCTTGGAGGTCTTCTAGTCCAACC
L3 CR13' CR1_1920F_RL4B CTATGCGCCTTGCCAGCCCGCTCAGAGCACTGTAGCCTTGGAGGTCTTCTAGTCCAACC
L3CR13' CR11920FRL5B CTATGCGCCTTGCCAGCCCGCTCAGATCAGACACGCCTTGGAGGTCTTCTAGTCCAACC
L3CR13' CR1_1920F_RL6B CTATGCGCCTTGCCAGCCCGCTCAGATATCGCGAGCCTTGGAGGTCTTCTAGTCCAACC
L3 CR13* CR1_1310F_454A CGTATCGCCTCCCTCGCGCCATCAGGGGATCTYGGAGTCCTAGTGGAC
Table 3.2. Primer pairs used for each species.
Species________________
Python molurus
Tropidophus melanurus
Candoia carinata
Uropeltis phillipsi
Xenopeltis unicolor
Cylindrophis sp.
Python molurus
Cerrophidian godmani
Elaphe guttata
Agkistrodon piscivorus
Heterodon platirhinos
Trimeresurus sp._______
5' Primer______
BovB_2350F_454A
BovB_2350F_454A
BovB_2350F_454A
BovB_2350F_454A
BovB_2350F_454A
BovB_2350F_454A
CR1_1310F_454A
CR1_1310F_454A
CR1_1310F_454A
CR1_1310F_454A
CR1_1310F_454A
CR1 1310F 454A
3' Primer_______
BovB_3060R_RLlB
BovB_3060R_RL2B
BovB_3060R_RL3B
BovB_3060R_RL4B
BovB_3060R_RL5B
BovB_3060R_RL6B
CR1_1920F_RL1B
CR1_1920F_RL2B
CR1_1920F_RL3B
CR1_1920F_RL4B
CR1_1920F_RL5B
CR1 1920F RL6B
39


Size selection was performed via the Pippin Prep electrophoresis platform using 1.5%
agarose cartridges. PCR products were first concentrated and then brought to a total
volume of 30 ul with Invitrogen Tris-EDTA before following standard Pippin Prep
protocols. BovB 5' and CR1 3' fragments were size-selected using 600-800 bp as the
bounds, while BovB 3' fragments were selected using the range of 650-850 bp.
A.
z o o o < -> 3 o
00 QO E 00 00
u u Cl. LU o
k k k k
in in in in CO
co CO CO CO CO
> > > > >
o o o o o
CO CO CO CO CO
o E (-> 3 00 B. o 00 o E 4-> 3 00
Q_ LU u to- LU
rn on 1- CO > CO > O -0 rH
o o ro cm 0C Cd
CO co i u U U
1000 bp
500 bp
Figure 3.1. PCR Results from Proof-of-Principle Primer Reactions. The DNA ladder is in
lOObp increments. Cgo is Cerrophidian godmani DNA. Pmol is Python molurus DNA. Egut
is Elaphe guttata DNA. The BovB 5' and 3' primers produced bands in the 650-700bp
range for all species (A), and the L3 CR1 3' primers produced a 600bp band for all three
species (B). A faint band was observed in the E. guttata BovB 5' lane, likely primer dimers
(A).
40


454B Primer Barcode LINE Fwd primer | 3'
5'


3' [ LINE reverse primer 454A Primer
5'
Figure 3.2. Barcoded Primers Designed with 454 Adapters. The barcode is only included
with the 454B adapter and forward primer as 454 sequencing is unidirectional.
Table 3.3. Picomoles of barcoded DNA purified per product and pooled for sequencing.
BovB3' pmol/ul ul pooled total pmols
P. molurus 7 4.7 33
T. melanurus 1.8 9.0 16.2
C. carinata 4.7 7.0 33
U. phillipsi 10.5 3.1 33
X. unicolor 5.9 5.6 33
CR13' pmol/ul ul pooled total pmols
P. molurus 15.4 2.1 33
C. godmani 6.4 5.2 33
E. guttata 5.8 5.7 33
A. piscivorus 4.1 8.0 33
H. platirhinos 12.1 2.7 33
Trimeresurs sp. 19.9 1.7 33
pmol = picomoles. ul = microliters.
Size-selected samples were quantified on the BioAnalyzer using DNA 7500 chips. Samples
below 3 ng/ul were concentrated and then re-quantified. Two successive rounds of
Ampur bead purification were used to remove any remaining small fragments of DNA.
41


3.2.3 Shotgun Sequence
Whole genome random shotgun libraries were made as in section 2.3.1 and
mitochondrial reads were filtered using the same methodology as described in A.l.
Approximate genome coverage for each species is listed in Table D.l. Genome size data is
approximate and based on the most recent value (Gregory, et al., 2007). Genome size for
a member of the same genus is used, if the exact species is not available.
3.2.4 Repeat Analyses on Shotgun Sequence
We used the current release of the Tetrapoda RepBase (version 12.12, 01-17-2008) as the
repeat library for RepeatMasker (Smit, Hubley, and Green 2004) to identify known repeat
elements in the snake genomes, and the program RepeatModeler (A. Smit, unpublished)
to identify de novo repeat sequences in our snake datasets, based on the run parameters
suggested as defaults by the program. We modified RepeatModeler's RepeatMasker
parameters to specify the Tetrapoda library. For all RepeatModeler analyses, we
combined the new Anilius scytale, Boa constrictor, Casarea dussermieri, Crotalus atrox,
Leptotyphlops dulcis, Loxocemus bicolor, Micrurus fulvius, Sibon nebulata, Thamnophis
sirtalis, and Typhlop reticulatus libraries with the previously identified Python molurus and
Agkistrodon contortrix libraries into a single joint snake library to recover as many
elements as possible, and control for differences in sequencing depth.
42


3.2.5 Consensus Sequences and TE Alignments from Shotgun Sequence
Alignments and consensus sequences were created using pairwise alignment information
from BLAST 2.2.20. A subset of reads containing L3 CR1 or BovB sequence was first
obtained using blastn with L3 CR1 or BovB consensus from the more closely related
species (A. contortrix or P. molurus) as the Subject, and then filtered by length and score
(min. 50 bp, min. score of 50 for species with very few candidate reads, min. length was
reduced to 30 bp). The pairwise alignments from these reads were converted to multiple
alignments using perl scripts, and used to calculate consensus sequences. Consensus at a
particular nucleotide was determined by simple majority and ties were resolved
randomly. Minimum coverage for a region was 1 read and regions with no coverage were
marked with N's the length of these regions is approximate. Consensus sequences for
Leptotyphlops and Typhlops BovB were not compiled due to insufficient data; only 3-4
non-overlapping reads had blast scores over 50 for these two species.
Non-L3 CR1 sequence was determined by identifying the RepeatModeler de novo CR1
family with the greatest representation in L. dulcis and extending that sequence 5'-ward
with additional species-specific CR1 families from the RepeatModeler de novo library.
Candidate families were discarded from the assembly if coverage of the resulting
assembled CR1 sequence was discontinuous. A refined L. dulcis non-L3 CR1 consensus
sequence was produced using the same techniques outlined for BovB and L3 CR1.
43


3.2.6 Bayesian Analysis
Well aligned regions of 2101 bp long for BovB LINEs, and 1681 bp for CR1 LINES were used
as input for analysis using MrBayes version 3.1.2. Trees were run for 5 million
generations, with the first 1 million generations discarded as burn-in. Nucleotides were
analyzed using a General Time-Reversible (GTR) plus Gamma and Invariant sites model
(GTRGI). The posterior of trees was summarized to assess nodal support based on
posterior probabilities.
3.2.7 Coverage
Coverage of consensus sequences was determined using the Roche 454 gsMapper and
default settings. The species-specific TE consensus sequence was assigned as the
reference sequence, and the set of all nuclear reads for that species were included in the
query. Mapping was done separately for BovB and L3 CR1 for each species and for L
dulcis non-L3 CR1. Coverage was then determined from the 454 gsMapper pairwise
alignments using a perl script.
3.3 Results
Several groupings of snakes are referenced within these results for purposes of clarity:
the blindsnakes, consisting of Leptotyphlops dulcis and Typhlops reticulatus; the
henophidians, Anilius scytale, Casarea dussermieri, Boa constrictor, Loxocemus bicolor,
44


and Python molurus; and the colubroids, Agkistrodon contortrix, Crotalus atrox, Micrurus
fulvius, Sibon nebulata, and Thamnophis sirtalis (Figure 3.3). Genomic shotgun sequence
was obtained and nuclear sequence was identified and used for repeat analysis (Table
D.l). No trends were observed with respect to GC content closely related species, such
as C. atrox and A. contortrix, had very different levels of GC content (Table D.l).
Furthermore, GC content and repetitive content do not appear to be related (Tables D.l-
D.2, figure not shown).
3.3.1 Repeat Content
3.3.1.1 Overall Transposable Element Landscapes
Overall repetitive content as identified by RepeatMasker/RepeatModeler ranged from
approximately 24% in the Python to nearly 49% in Typhlops (Table D.2), with the majority
of these as unclassified elements identified de novo by RepeatModeler (Figure 3.4). Of
the elements that could be classified, LINEs (Long Interspersed Nuclear Elements) were
most abundant. DNA transposons and SINEs (Short Interspersed Nuclear Elements) were
also relatively common, and PLEs (Penelope-like Elements) and LTR (Long Terminal
Repeat) elements were identified in small quantities (Figure 3.4). As a whole, fewer
repeats were identified in the henophidians, particularly in B. constrictor and P. molurus.
45


Blindsnakes
t
Henophidia
Colubroidea
C
L. dulcis
T. melon urus
A. scytale
C. dussermieri
B. constrictor
L. bicolor
P. molurus
A. contortrix
C. atrox
M. fulvius
S. nebulata
T. sirtalis
150 100 50 0
Million years ago
Figure 3.3. Phylogeny of Snakes Used for RepeatMasker/RepeatModeler Analyses.
Divergence times are approximate (Castoe 2009b; Castoe 2009c). Dotted lines indicate
unknown branch length, with placement based on taxonomic data only. Dashed lines
indicate that the precise phylogenetic placement and divergence time is unresolved.
3.3.1.2 LTR Elements
Fewer LTRs were identified in the henophidians than in the blindsnakes or colubroids.
Expansion is of LTRs is greatest in the colubroids, especially the DIRS1 elements (Figure
3.5). Tyl/Copia is also minimally present, except in S. nebulata, where it has expanded to
0.55% of the genome (Figure 3.5).
46


50.00%
Transposable Element Content
45.00%
40.00%
35.00%
| 30.00%
O
£ 25.00%
£ 20.00%
15.00%
10.00%
5.00%
0.00%
SINEs
LINES
PLEs
LTR elements
DNA transposons
Unclassified
Figure 3.4. Overall Transposable Element Content Identified by
RepeatMasker/RepeatModeler. Phylogenetic relationships of species are indicated
beneath species names.
3.3.1.3 DNA Transposons
The most common DNA transposons are Tcl-IS630-Pogo (Tcl/Pogo) and hobo-Activator
(hAT) elements. Tcl/Pogos are found in all twelve snake species, comprising ~1.5% of
most snake genomes examined. In contrast, hAT content is more variable. Henophidians
have the lowest abundance of hATs and colubroids the highest (Figure 3.6).
47


LTR Content
3.00%
2.50%
w
E 2.00%
o
5 1.50%
O 1.00%
* 0.50%
0.00%
O)
2
c:
.o
"q3
2
k:
'O
o
c
o
u

*
s
o
U
.3
s
2.
§
o
.o
"a
-Q
C
S
o
t
5;
t-'
Figure 3.5. The Most Common LTRs in Snakes, by Percentage of Genome. Phylogenetic
relationships of species are indicated beneath species names.
3.3.1.4 LINE Elements
LINE elements are the most abundant variety of transposable element detected in snakes,
and while R4 and Ll-type LINEs are detectable in all snake genomes (data not shown),
CR1 and BovB families dominate. BovB is present in the blindsnakes at low levels, but
comprises a greater percentage of the genome in both the henophidians and the
colubroids, ranging from ~0.8% to 3.65% (Figure 3.7). CR1 content is high in the
blindsnakes and in the colubroids, but comparatively low in the henophidians (except A.
scytale) (Figure 3.7). L. dulcis has the highest CR1 content at ~13.6% of the genome.
48


% of genome
DNA Transposon Content
4.00%
3.50%
3.00%
2.50%
2.00%
1.50%
1.00%
0.50%
0.00%
Tcl-IS630-Pogo
3
5
Q
.O
"5
-Q
0)
C
Figure 3.6. Common DNA Transposons in Snakes, by Percent Genomic Content.
Phylogenetic relationships of species are indicated beneath species names.
49
T. sirtalis


BovB and CR1 LINE Content
Figure 3.7. BovB and CR1 LINE Content Across the Snake Phylogeny by Percent of
Genomic Sequence. Phylogenetic relationships of species are indicated beneath species
names.
3.3.2 Coverage
3.3.2.1 BovB and L3 CR1 LINE Coverage
In a prior study, evidence from BovB and CR1 LINEs suggested different genomic
processes were at work on A. contortrix and P. molurus genomes, causing truncation and
potential purging of full-length elements in A. contortrix only. This effect is now also
50


observed in related species; whereas colubroid LINEs are primarily truncated, a greater
proportion of henophidian BovB LINEs are full length (Figure 3.8). C. atrox has the lowest
percentage of truncated BovB LINEs of any colubroid examined, whereas BovB LINEs in its
closest relative, A. contortrix, are frequently truncated (Figure 3.9). L3 CR1 LINEs are
truncated in all five colubroids (Figure 3.10A). L3 CR1 LINE coverage in henophidians and
blindsnakes is poor and aberrant compared to coverage in the colubroids (Figure 3.10B).
BovB Coverage
Nucleotide position
P. molurus
B. constrictor
C. dussermieri
L. bicolor
A. contortrix
C. atrox
M. fulvius
S. nebulata
T. sirtalis
Figure 3.8. Snake BovB Coverage from Shotgun Sequence. Henophidians are in red,
colubroids are in blue. Coverage in both groups is normalized on a per Mbp basis.
51


Figure 3.9. Log Ratio of BovB 3' to 5' Coverage. Regions compared are positions 3120-
3199 and 1559-1638. Phylogenetic relationships of species are indicated beneath species
names.
A. L3 CR1 Coverage (Colubroids)
90
Nucleotide Position
B. L3 CR1 Coverage
(Blindsnakes/Henophidians)
8 T. melonurus
7 L dultis
a. a 6 A. scytole
2 5
P. molurus
>
*
3
o w 2 C. dussermieri
1 L. tricolor
Q .
0
^
Nucleotide Position
Figure 3.10. Snake L3 CR1 Coverage from Shotgun Sequence. Coverage in both the
colubroids (A) and henophidians and blindsnakes (B) is normalized on a per Mbp basis.
Indexes are negative as the 5' sequence and full length of the element is unknown.
52


3.3.2.2 Blindsnake Non-L3 CR1 Coverage
A late-breaking development in this research was the discovery of a second family of CRls
identified in L dulcis. Two possible explanations exist for the aberrant coverage of L3 CR1
in the blindsnakes (Figure 3.10B): L3 CR1 in blindsnakes may be highly divergent from L3
CR1 in colubroids; or the most prevalent CR1 in blindsnakes is not an L3. The majority of
CR1 families detected by RepeatMasker/RepeatModeler in the blindsnakes were CR1
families identified de novo by RepeatModeler, supporting the latter hypothesis. As L
dulcis genome has a higher CR1 content than T. reticulatus, a consensus sequence was
assembled for L. dulcis. The majority of non-L3 CR1 LINES in L dulcis are truncated (Figure
3.11).
Figure 3.11. L dulcis Non-L3 CR1 Coverage from Shotgun Sequence. Indices are negative
as the identity of the 5' end and total length of the element are unknown.
53


3.3.3 Origins
Horizontal transfer of BovB into the genomes of vipers has been suggested by Kordis and
Gubensek (1998). A Bayesian analysis of TE sequences was used to estimate a
phylogenetic tree of the relationships between BovB LINEs and examine support for
horizontal transfer of BovB. Previously, horizontal transfer of BovB was evaluated in A.
contortrix and P. molurus. Here the investigation is expanded to include additional snakes
for which reasonable BovB consensus sequences were obtained. We also examine CR1
relationships for the L3 CR1 consensus sequences and the L dulcis non-L3 CR1 consensus
sequence in relation to other vertebrate CR1 elements.
Based on preliminary analysis, henophidian BovBs are estimated to be more closely
related to sea urchin, bovine, and Podarcis muralis (the wall lizard) BovBs than to
colubroid BovBs. Furthermore, while P. muralis BovB is estimated to be closely related
colubroid BovBs, Anolis carolinensis (the green anole lizard) BovB is estimated to be more
distantly related to both snake groups, contrasting with the previous study's estimation
(Figure 3.12, C.6).
The analysis indicates that three distinct CR1 lineages exist in the snakes. The first is L3
CR1 LINE, the most common CR1 in the five colubroids, and this element is estimated to
54


be most closely related to A. carolinensis L3 CR1, followed by human L3. The newly-
identified L dulcis non-L3 CR1 LINE is estimated to be most closely related to several
platypus CRls, and slightly more distantly related to Xenopus CRls. The third lineage of
CR1 in snakes is the snake L2 lineage, which is estimated to be most closely related to
coelacanth CR1 and slightly more distantly related to Xenopus L2 (3.13).
Strongylocentrotus ssp.,
Branchiostoma ssp. BovB
Equine BovB
Platypus BovB
L Hedgehog BovB
Hyrax BovB
Anolis carolinensis BovBs
Snakes
Lizards
Mammals
Sea urchin
Outgroups
^Marsupial BovBs

p C asarea dussermieri BovB
Henophidian BovBs
I r- Strongylocentrotus BovB
I I Bovine BovB
LJ Podarcis muralis BovBs
Colubroid BovBs
Outgroups
0.7
Figure 3.12. BovB Tree from Preliminary Bayesian Analysis. Key branches are highlighted
in color. Posterior probability for node support is pending a more complete analysis.
55


Outgroups, primarily bird CRls
Human L3
_p- Marsupial L3
Colubroid L3 CR1
Anolis carolinensis CR1
^Marsupial, Mammal L3s
Vertebrate CRls
Snakes
Lizards
Mammals
Amphibians
Coelacanth
Outgroups
Snake L2s
Coelacanth CR1
Xenopusffil
Outgroups,
^primarily fish
and mammal
CRls
1
Vertebrate CRls
Figure 3.13. CR1 Tree from Preliminary Bayesian Analysis. Key branches are highlighted in
color. Posterior probability for node support is pending a more complete analysis.
56


3.3.4 Expansion History
Analysis of BovB and L3 CR1 sequence divergence distributions (using species-specific
consensus sequences) reveals similar expansion histories for these LINEs to those
observed in the previous study (Figure 3.14). BovB elements expanded more recently in
henophidians and in C. atrox and A. contortrix (Figure 3.14A-B). Henophidian BovB
expansion is greater in C. dussermieri and L. bicolor than in P. molurus and B. constrictor.
Colubroid BovB expansion in M.fulvius, S. nebulata, and T. sirtalis does not appear to be
as recent as BovB expansion C. atrox and A. contortrix (Figure 3.14B); however, the
expansion is also not particularly ancient, and BovB may have become inactivated in M.
fulvius, 5. nebulata, and T. sirtalis. In contrast, L3 CR1 elements appear to have
accumulated over an extended period of time in all five colubroids(Figure 3.14C). Much
greater L3 CR1 accumulation has occurred in 5. nebulata, C. atrox^and A. contortrix than
in T. sirtalis or M. fulvius, and the bulk of the L3 CR1 expansion has occurred in the more
recent half of the expansion period (excepting M. fulvius).
57


Henophidian BovB Divergence
45
Colubroid BovB Divergence
14
Colubroid L3 CR1 Divergence
45
Figure 3.14. Sequence Divergence of Selected Transposable Elements. The species-
specific consensus sequences were determined for Bov-B LINEs (A and B), and L3 CR1
LINEs (C). Sequence divergence from the consensus was calculated for all alignable
sequences of these elements. *Fewerthan 50 sequences were used.
3.4 Discussion
The twelve snakes can be separated into three groups: the blindsnakes, the henophidians,
and the colubroids, and with group members sharing similar characteristics with respect
to repeat content. Higher repeat content was identified in blindsnakes and colubroids
than in henophidians, and were the henophidians a monophyletic group, this pattern of
repeat content would suggest a single change in the manner transposable elements are
58


regulated. However, the henophidian group is not monophyletic, and thus genomic
regulation of TEs is expected to have changed at least twice.
The repeat content in the colubroids and blindsnakes is considerably higher than
expected based on genome size. This is, however, congruent with previous observations
concerning the A. contortrix genome, and implies unknown mechanisms are acting to
counterbalance transposable element expansion in the genomes of colubroid snakes.
BovB and L3 CR1 LINEs are highly truncated in the colubroids (Figure 3.15) and the
recently-discovered L dulcis non-L3 CR1 LINE is similarly truncated, implying that
genome-size selection pressure may be acting in the blindsnakes in addition to the
colubroids. This selection pressure may have become relaxed in C. atrox, which has both
a higher ratio of full-length BovB LINEs than any of the other colubroids and a larger
genome than A. contortrix (Table D.l), while A. contortrix, its closest relative, has a much
higher degree of LINE truncation.
BovB elements have been implicated in horizontal transfer, and phylogenetic analysis in
conjunction with age distribution continues to support the idea that BovB LINEs have
invaded snake genomes in two separate lineages. Although there are three separate
lineages of CR1, there is no evidence to suggest L3 CR1 is involved in horizontal transfer;
the age distribution and phylogenetic analysis for this particular lineage do not violate
59


expectations. L dulcis non-L3 CR1 and LINE2 CR1 elements will require more careful
examination to rule out horizontal transfer as a possible origin.
CRlBovB
CD L. dulcis
CD T. melanurus
CD A. scytale
CD C. dussermieri
CD B. constrictor
CD L. bicolor
CD P. molurus
A. contortrix
B C. atrox
I M.fulvius
B 5. nebulata
B T. sirtaiis
H Truncation
Reduced truncation
| [ No truncation
Insufficient data
Figure 3.15. Summary of Truncation in BovB and CR1. Snakes are grouped as to location
in the phylogeny.
The expansion histories of BovB and L3 CR1 differ both in age and duration. Whereas L3
CR1 expansion is more ancient and occurring over a broader timescale, BovB divergence
suggests a younger element with a far more recent expansion. This is consistent with
earlier results when only P. molurus and A. contortrix data were available. Also, the
variation in colubroid BovB divergence distributions suggests that the differential BovB
expansion occurred in colubroid lineages, although the analysis is limited by the use of
species-specific consensus sequences.
60


Given the vast differences in transposabie element content and expansion across the
snake phytogeny, and given that transposabie elements can lead to structural changes in
the genome and alterations in gene regulation, it is plausible to consider that a
relationship exists between a snake species' transposabie element landscape and its
phenotype (Secor and Diamond 1995; Cohn and Tickle 1999; Fry et al. 2006; Castoe et al.
2008; Vonk et al. 2008; Castoe et al. 2009a). Gross phenotypic differences exist between
blindsnakes, henophidians, and colubroids, corresponding with observed variation in the
characteristics and composition of their transposabie element content. Whether
phenotype and genome biology truly correspond is still an open question, but the
evidence so far is suggestive of a relationship between the two factors.
61


4.Conclusion
Snakes are an under-studied group of organisms, and despite their diversity, it is only in
the last two years that snake genomics have been investigated. In addition to the
analyses performed here, a Python molurus genome has been recently made available in
the form of assembled contigs. While whole genome sequencing is not yet at a state
where it can be performed readily for every species of interest, shotgun sequencing of
only a small portion of a genome can yield insights into the genomic structure of the
organism of interest, and it is this approach that was used to analyze the twelve snake
species.
The investigation into the nuclear genomes of snakes began with two snakes
approximately 100 million years apart in evolution, Python molurus and Agkistrodon
contortrix. The investigation led to the discovery that the two snake genomes were at
two extremes with respect to levels of recent repetitive content. Evidence was found for
horizontal transfer and for SSR seeding in A. contortrix by L3 CR1, a transposable element
that is defunct in avian reptiles. A. contortrix showed increased levels of repetitive
content yet retained a small genome size, suggesting some mechanism is maintaining
small genome size in opposition to transposable element expansion, and may be leading
to the fixation of truncated LINE elements and the purging of full-length versions.
62


Because the A. contortrix and P. molurus genomes had different transposable element
expansion, different transposable element regulation mechanisms, and different
horizontal transfer events, the investigation was expanded to ten more snake species,
chosen from key branches on the snake phylogeny to better understand the timing of
these changes with respect to snake evolutionary history. The analysis revealed dynamic
TE landscapes, varying from snake to snake, although different trends were observed in
the blindsnakes, henophidians, and colubroids. No one snake could be said to typify the
transposable element landscape of the "average" snake, nor could any particular
transposable element be said to represent the typical expansion pattern across the snake
phylogeny. LINEs were the dominant identifiable category of transposable elements in
snakes, with BovB and CR1 LINEs active at different time points. BovB showed recent
expansion in the henophidians and colubroids, and phylogenetic analysis supported
independent horizontal transfer origins for BovB in both groups. CR1 elements showed
more expansion in the blindsnakes and colubroids than in the henophidians, but a closer
investigation revealed that several different lineages of CR1 were expanding in different
branches of the snake phylogeny. Differences existed even between closely-related
snakes, for example, high levels of BovB truncation were found in A. contortrix, but not in
C. atrox, despite the two species having diverged only ~12 million years ago (Castoe
2009c). All this differential TE expansion may have contributed to genomic diversity
through non-allelic recombination, alternate gene regulation, and other effects to the rise
63


of snake adaptations such as venom, extreme levels of metabolism fluctuations, and
perhaps other innovations.
This study of the dynamic transposable element landscapes of snakes reinforces that
these are a diverse group of animals worthy of further study; however, there are some
broader implications as well. If two species with 100 million years of evolutionary
divergence (Castoe 2009b) can possess such vastly different genomes, one or even a few
species should not be expected to represent a major phylogenic group. With the advent
of single-DNA molecule sequencing on the horizon, sequencing whole genomes for every
branch of the phylogeny may soon be a tractable goal, but until then, the use of shotgun
sequencing of genomic fractions can still provide insight into the genomic landscapes of
under-characterized organisms, such as the snakes.
64


APPENDIX A
Supplementary Methods for Chapter 2
A.l Filtering mitochondrial genome reads
Prior to analyses, shotgun read sets were blasted (blastn) against mitochondrial genomes
from related species available on Genbank: Agkistrodon piscivorus and Python reticulatus.
Reads with a score > 100 and a length > 75 were mapped using the 454 gsMapper
software to the above reference mitochondrial genomes, and resulting contigs were used
as a reference for a second round of blastn. Reads with a score > 50 and length > 50 were
then mapped back to the original reference sequence, and reads which successfully
mapped to the reference sequence from both steps were assembled using the 454
gsAssembler to create new contigs. These new contigs became the reference sequence
for a final round of blastn using all other reads. Any read with a blast score > 50 and a
match length > 50 were iteratively added to the assembly to generate final mitochondrial
contigs. All reads which did not assemble into the final mitochondrial contigs were
considered to be nuclear genome reads in subsequent analyses.
A.2 Classification of newly identified elements
From the RepeatModeler results that provide consensus sequences for newly identified
repeat elements, we removed redundancy from RepeatModeler results by counting as
one the repeats that hit the same top hit in the Repbase family through the FIOM search
in RepCIass. Thus, the "total count" output represents the total number of TE families
within each superfamily including the count of duplicated fragments of each family
member. The "non-redundant count" is the number of TE families after removal of
65


redundant TE families (as described above), and the "fragmentation rate" is the
percentage of non-redundant TE families to total count.
A.3 cDNA library creation, sequencing, assembly and identification
Total RNA was extracted using Trizol Reagent (Invitrogen), following the manufacturer's
protocol. Extracted RNA was enriched for mature mRNA transcripts using three successive
rounds of purification with Oligo dT25 beads (PureBiotech), precipitated using linearized
acrylamide (Ambion) sodium acetate, and ethanol, and analyzed using a BioAnalyzer pico-
RNA chip (Agilent).
The mRNA was reverse transcribed with random heptamers and modified oligo-dT
primers (5'-/Phos/NNNNNNN-3' and 5'-/Phos/TTTTTVN-3') in a 2:1 ratio, using the
Superscript III reverse transcriptase kit (Invitrogen). The remaining RNA was digested
using RNAse A and RNAse H, and purification using RNA Clean beads (Ambion). Two pairs
of double-stranded (with single stranded overhang) adapter oligos were directionally
ligated onto the existing synthesized first strand using T4 DNA Ligase (Invitrogen). Adapter
oligo sequences were: Adapter-A (5-prime adapter), oligo A-prime 5'-
NNNNNNCTGATGGCGCGAGGGAGG-dideoxyC-3', and oligo A 5'-
GCCTCCCTCGCGCCATGAG-3'; and Adapter-B (3-prime adapter) oligo B 5'-biotin-
GCCTTGCCAGCCCGCTCAGNNNNNN-phosphate-3', and oligo B-prime 5'-phosphate-
CTGAGCGGGCTGCAAGG-dideoxyC-3'. Following adapter ligation, ligation products were
purified using RNA Clean beads for three successive times, and then with streptavidin
beads (PureBiotech). Samples were then melted from the streptavidin beads using 0.1M
NaOH and precipitated (as above). Completed libraries were quantified and checked for
appropriate size distribution using the DNA-nano chip on a BioAnalyzer (Agilent).
66


All cDNA libraries were sequence using the 454 GS FLX sequencing instrument using the
LR70 sequencing kit and 70x75 mm PicoTiterPlate (Roche). Emulsion PCR kits II and III
(Roche) were used for sequencing cDNA libraries to obtain sequence from both ends of
transcripts, because cDNA libraries were directional (with kit II sequencing from the 5'
end, and kit III sequencing from the 3' end). Repeatmasker (using our snake-specific
repeat element libraries) was run on assembled contigs from moderately-deep 454
sequencing (python = 174,504, copperhead = 137,778 reads) of liver cDNA libraries from
both species was used to identify evidence of transcriptional activity of transposable
elements.
67


APPENDIX B
Supplementary Tables for Chapter 2
Table B.l. Summary of sequencing results and repeat identification analyses.
Agkistrodon piscivorus Python molurus
"Copperhead" "Burmese python"
Estimated Haploid Genome Size 1.35 Gbp 1.42 Gbp
Nucleotides sequenced 60,175,941 bp 28,077,583 bp
Percent of nuclear genome sampled 4.5% 2.0%
Number of reads 280,303 118,973
GC content 42.53% 39.78%
Mitochondrial genome reads identified 784 (0.28%) 1522 (1.28%)
11.81% 4.48%
RepeatMasker* masked 7,104,438 bp 1,258,018 bp
32.77% 16.73%
RepeatModeler masked 19,722,495 bp 4,696,122 bp
RepeatMasker* + 44.58% 21.21%
RepeatModeler 26,826,933 bp 5,954,140 bp
55.18% 39.76%
P-Clouds masked 33,205,805 bp 11,162,520 bp
Total (all repeat identification 62.92% 46.07%
methods) 37,860,534 bp 12,935,991 bp
*RepeatMasker numbers include novel snake Bov-B and CR1 consensus sequences added
to the RepBase library.
68


Table B.2. Statistics on Agkistrodon TE superfamilies identified and classified by the
RepCIass Homology module, and the degree of fragmentation of elements estimated de
novo classified by RepCIass.
Class Sub-Class Super-Family Total Families Redundant Families Non-Redundant Families Frag, rate <%)
non-ltr retrotransposon CR1 123 90 33 73.2
non-ltr retrotransposon LI 52 16 36 30.8
non-ltr retrotransposon L3 12 11 1 91.7
non-ltr retrotransposon penelope/bridge 37 28 9 75.7
non-ltr retrotransposon Poseidon 16 13 3 81.3
non-ltr retrotransposon r4/dong 18 14 4 77.8
non-ltr retrotransposon REX1 12 7 5 58.3
Class 1 non-ltr retrotransposon RTE 82 76 6 92.7
non-ltr retrotransposon SINE 13 8 5 61.5
Itr retrotransposon Copia 3 1 2 33.3
Itr retrotransposon DIRS 74 36 38 48.6
Itr retrotransposon ERV1 4 0 4 0.0
Itr retrotransposon ERV2 1 0 1 0.0
Itr retrotransposon HERVK22I 3 2 1 66.7
Itr retrotransposon t¥3/gypsv 105 56 49 53.3
dna transposon DNA transposon 2 1 1 50.0
dna transposon hAT 26 17 9 65.4
dna transposon Mariner/Tcl 16 5 11 31.3
Class II dna transposon MER1B 4 3 1 75.0
dna transposon OposCharlie3b 15 14 1 93.3
dna transposon P 1 0 1 0.0
dna transposon SPIN 10 5 5 50.0
dna transposon URRl_Xt 2 1 1 50.0
Totals 631 404 227
Weighted average fragmentation rate_____________70.2
69


Table B.3. Statistics on Python TE superfamilies identified and classified by the RepCIass
Homology module, and the degree of fragmentation of elements estimated de novo
classified by RepCIass.
Class Sub-Class Super-Family Total Families Redundant Families Non-Redundant Families Fragmentation rate (%)
non-ltr retrotransposon CR1 53 32 21 60.4
non-ltr retrotransposon LI 13 3 10 23.1
non-ltr retrotransposon L3 4 3 1 75.0
non-ltr retrotransposon penelope/bridge 9 8 1 88.9
non-ltr retrotransposon Poseidon 5 4 1 80.0
Class 1 non-ltr retrotransposon r4/dong 23 20 3 87.0
non-ltr retrotransposon non-ltr retrotransposon 1 0 1 0.0
non-ltr retrotransposon RP5S 1 0 1 0.0
non-ltr retrotransposon RTE 36 25 11 69.4
non-ltr retrotransposon SINE 1 0 1 0.0
Itr retrotransposon Gypsy 9 1 8 11.1
Itr retrotransposon Itr retrotransposon 1 0 1 0.0
dna transposon hat 1 0 1 0.0
Class II dna transposon Mariner/Tcl 23 6 7 26.1
dna transposon Maverick 1 0 1 0.0
Totals 181 102 69
Weighted average fragmentation rate_______________67.6
70


Table B.4. Numbers of non-redundant families of repeat elements classified by RepCIass
analyses of de novo identified repeats from RepeatModeler.
Sub-Class Super-Family Python Agkistrodon
CR1 21 33
Non-LTR Retrotransposon LI 10 36
L3 1 1
penelope/bridge 1 9
Poseidon 1 2
r4/dong 3 4
retrotransposon 1 0
REX1 0 5
RP5S 1 0
RTE 11 6
SINE 1 5
Total 51 101
LTR Retrotransposon Copia 0 2
DIRS 0 38
HERVK22I 0 1
ERV1 0 4
ERV2 0 1
ty3/gypsy 8 49
LTR retrotransposon 1 0
Total 9 72
DNA Transposon DNA transposon 0 2
hAT 1 9
Mariner/Tcl 7 11
MER1B 0 1
OposCharlie3b 0 1
P 0 1
SPIN 0 5
URRIXt 0 1
Other DNA transposon 7 17
Maverick 1 0
Total 16 48
Interspersed Elements Grand Total 76 221
71


Table B.5. Details of consensus sequences (of "families") from RepeatModeler that were
interrogated by RepCIass, and the numbers that were successfully classified by different
modules of the analyses.
Python Agkistrodon
Number of consensus sequences 571 1,996
Total classified by REPCLASS 203 655
Classified by REPCLASS Homology (HOM) module (1) 181 632
Non-redundant families classified by REPCLASS Homology (HOM) module /non-redundant 75 226
Classified by REPCLASS Structural (STR) module 7 17
Total classified non-redundant 82 243
72


Table B.6. Combined library-based annotation of repeats in the Python genome. No
elements being found in a particular class is indicated by a dash; therefore 0.00%
indicates presence of elements despite being lower than the significant figures reported
in the table. RepeatMasker-RepBase annotation includes novel snake Viperl CR1 and Bov-
B LINE consensus sequences.
RepeatMasker RepeatModeler
Python molurus & RepBase & RepCIass Total
Retroelements 2.91% 5.86% 8.77%
SINEs 0.10% 1.02% 1.12%
Squaml/Sauria 0.04% 0.06% 0.10%
LINES 2.68% 3.89% 6.57%
CRE/SLACS -- --
L2/CR1/Rex 1.14% 1.59% 2.73%
Rl/LOA/Jockey --
R2/R4/NeSL -- 0.80% 0.80%
RTE/Bov-B 1.54% 0.86% 2.40%
L1/CIN4 ... 0.42% 0.42%
Penelope 0.08% 0.51% 0.59%
LTR elements 0.05% 0.44% 0.49%
BEL/Pao ...
Tyl/Copia ... 0.00% 0.00%
Gypsy ... 0.20% 0.20%
DIRS1 ... 0.03% 0.03%
Retroviral 0.05% 0.04% 0.09%
DNA transposons 0.34% 1.48% 1.82%
hobo-Activator 0.01% 0.09% 0.10%
Tcl-IS630-Pogo 0.30% 1.12% 1.42%
En-Spm --
MuDR-IS905 "
PiggyBac ... 0.02% 0.02%
Tourist/Harbinger ...
Other (Mirage, P-element, etc)
Rolling-circles Unclassified --
Elements 9.31% 9.31%
Total interspersed repeats 3.25% 16.64% 19.89%
Small RNA 0.08% 0.08%
Satellites __ 0.04% 0.04%
Simple repeats 0.69% 0.11% 0.80%
Low complexity 0.55% 0.55%
73


Table B.7. Combined library-based annotation of repeats in the Agkistrodon genome. No
elements being found in a particular class is indicated by dashes; therefore 0.00%
indicates presence of elements despite being lower than the significant figures reported
in the table. RepeatMasker-RepBase annotation includes novel snake CR1 and Bov-B LINE
consensus sequences.
Agkistrodon contortrix RepeatMasker & RepBase RepeatModeler & RepCIass Total
Retroelements 7.45% 11.32% 18.77%
SINEs 0.19% 1.09% 1.28%
Squaml/Sauria 0.18% 0.01% 0.19%
LINES 7.12% 5.72% 12.84%
CRE/SLACS ... ... ...
L2/CR1/Rex 6.52% 2.81% 9.33%
Rl/LOA/Jockey ... ... --
R2/R4/NeSL ... 0.51% 0.51%
RTE/Bov-B 0.60% 1.15% 1.75%
L1/CIN4 0.00% 0.60% 0.60%
Penelope 0.09% 0.98% 1.07%
LTR elements 0.05% 3.53% 3.58%
BEL/Pao ... ...
Tyl/Copia ... 0.02% 0.02%
Gypsy 2.16% 2.16%
DIRS1 ... 0.86% 0.86%
Retroviral 0.05% 0.07% 0.12%
DNA transposons 0.09% 4.39% 4.48%
hobo-Activator 0.01% 2.80% 2.81%
Tcl-IS630-Pogo 0.05% 1.06% 1.11%
En-Spm ... 0.01% 0.01%
MuDR-IS905 ...
PiggyBac 0.00% 0.01% 0.01%
Tourist/Harbinger
Other (Mirage,
P-element, etc) 0.03% 0.03%
Rolling-circles ___ ___
Unclassified
Elements ... 16.42% 16.42%
Total
interspersed
repeats 7.53% 32.13% 39.66%
Small RNA 0.03% 0.03%
Satellites 0.00% 0.13% 0.13%
Simple repeats 3.09% 0.74% 3.83%
Low complexity 1.18% 0.00% 1.18%
74


Table B.8. Preliminary comparison of repeat annotations between Python molurus,
Agistrodon contortrix, and the Anolis lizard.
Python Agkistrodon Anolis lizard*
Retroelements 8.77% 18.77% 6.80%
SINEs 1.12% 1.28% 1.60%
LINES 6.57% 12.84% 3.74%
L2/CR1/Rex 2.73% 9.33% 3.07%
Rl/LOA/Jockey -- ...
R2/R4/NeSL 0.80% 0.59%
RTE/Bov-B 2.40% 1.75% 0.68%
L1/CIN4 0.42% 0.60% 0.00%
Penenope 0.59% 1.07% 1.45%
LTR elements 0.49% 3.58% 0.02%
Tyl/Copia 0.00% 0.02%
Gypsy/DIRSl 0.23% 3.02%
Retroviral 0.09% 0.12% 0.02%
DNA
transposons 1.82% 4.48% 0.91%
hobo-Activator 0.10% 2.81% 0.01%
Tcl-IS630-Pogo 1.42% 1.11% 0.03%
Satellites 0.04% 0.13% 0.00%
Simple repeats 0.80% 3.83% 0.00%
Low complexity 0.55% 1.18% 0.62%
*Note that the comparison with the lizard is approximate because this annotation is
based only on homology searching using the existing RepBase Tetrapoda library for this
species (with the addition of the snake CR1 and Bov-B LINE consensus), whereas the two
snakes include this library, and a snake-specific de novo library; thus the Anolis annotation
is likely less complete than the two snakes, and may under estimate repeat abundances.
Anolis data is based on annotation of 5 scaffolds of the Anolis genome release 1.0 (see
text for details).
75


Table B.9. Linear regression correlation coefficient (r2) comparing SSR loci frequencies (by
sequence motif) between squamate species.
Species Compared 2mers Bmers 4mers 5mers 6mers All
Agkistrodon v. Python 0.495 0.668 0.241 0.032 0.734 0.405
Agkistrodon v. Anolis 0.288 0.781 0.132 0.007 0.055 0.297
Anolis v. Python 0.954 0.639 0.933 0.880 0.047 0.716
Table B.10. Results of BlastN search using the copperhead Viper CR1 LINE consensus
sequence as a query against the non-redundant (NR) database constrained to only find
sequences originating from snakes (and excluding Agkistrodon contortrix microsatellite
sequences which otherwise consume the results).
Accession Description Max Score E value
dbj| AB440236.1| Trimeresurus flavoviridis phospholipase A2 ge... 1223 0
dbj | D31777.1| TFLGTFTBPA Trimeresurus flavoviridis gTfTBP gene... 978 0
dbj| D31782.1| TRUGTGTBPB Trimeresurus gramineus gTgTBP gene fo... 922 0
dbj | D31779.1| TRUGTGPLAB Trimeresurus gramineus gTgPLA6a gene ... 825 0
dbj | D13384.11TFLPLABPII Trimeresurus flavoviridis BP-II gene ... 746 0
dbj |D87549.1| Trimeresurus flavoviridis gPLl-A gene for phosp... 517 8.00E-146
dbj |AB003473.1| Trimeresurus flavoviridis DNA for phospholipa... 517 8.00E-146
gb|AF101236.1| Naja sputatrix neutral phospholipase A2 (NPLA2... 491 3.00E-138
gb|AF101235.1| Naja sputatrix acidic phospholipase A2 (APLA2)... 491 3.00E-138
gb 1 AF332697.11AF332697 Vipera ammodytes Bov-B LINE, complete ... 457 6.00E-128
dbj |D87550.1| Trimeresurus flavoviridis gPLI-B gene for phosp... 419 2.00E-116
dbj |AB111959.1| Laticauda semifasciata pla2 gene for phosphol... 398 5.00E-110
gb|EU293789.1[ Sistrurus catenatus edwardsi three finger toxi... 396 2.00E-109
gb|EU293792.1| Sistrurus catenatus edwardsi three finger toxi... 379 1.00E-104
gb|EU293791.1| Sistrurus catenatus edwardsi three finger toxi... 374 6.00E-103
gb|EU293790.1| Sistrurus catenatus edwardsi three finger toxi... 354 6.00E-97
dbj |AB062444.1| Laticauda laticaudata pla2 gene for phospholi... 325 3.00E-88
gb|HM179517.1| Sistrurus catenatus catenatus microsatellite S... 322 3.00E-87
dbj |AB062441.1| Laticauda laticaudata pla2 gene for phospholi... 320 1.00E-86
dbj|AB062443.1| Laticauda laticaudata pla2 gene for phospholi... 316 1.00E-85
dbj|AB037219.2| Laticauda semifasciata pla2 gene for phosphol... 315 5.00E-85
dbj|AB062440.1| Laticauda semifasciata pla2 gene for phosphol... 315 5.00E-85
dbj|AB062439.1| Laticauda semifasciata pla2 gene for phosphol... 315 5.00E-85
dbj|AB078346.1| Laticauda semifasciata pla2 gene for phosphol... 315 5.00E-85
dbj |AB062448.1| Laticauda colubrina pla2 gene for phospholipa... 311 6.00E-84
dbj|AB062447.1| Laticauda colubrina pla2 gene for phospholipa... 307 7.00E-83
76


Table B.10 (Cont.)
Accession Description Max Score E value
dbj|AB062446.1| Laticauda colubrina pla2 gene for phospholipa... 307 7.00E-83
dbj |AB062445.1| Laticauda colubrina pla2 gene for phospholipa... 307 7.00E-83
dbj| AB062442.1| Laticauda laticaudata pla2 gene for phospholi... 307 7.00E-83
gb |AF223946.1| Crotalus durissus terrificus isolate 9705 crot... 289 2.00E-77
dbj | AB060638.1| Elaphe quadrivirgata mRNA for phospholipase A... 289 2.00E-77
gb|AY714263.1| Rhinoplocephalus nigrescens clone Rnl26 micros... 224 7.00E-58
gb| AY714254.1| Rhinoplocephalus nigrescens clone Rn54 microsa... 224 7.00E-58
gb | AF204971.11AF204971 Pseudonaja textilis alpha neurotoxin (... 203 2.00E-51
gb | AF204969.11AF204969 Pseudonaja textilis alpha neurotoxin (... 203 2.00E-51
gb 1 AF204973.11AF204973 Pseudonaja textilis alpha neurotoxin (... 197 1.00E-49
gb 1 AF204970.11AF204970 Pseudonaja textilis alpha neurotoxin (... 194 1.00E-48
dbj 1013383.11TFLPLABPl Trimeresurus flavoviridis 8P-I gene fo... 188 5.00E-47
gb 1 AF204972.1| AF204972 Pseudonaja textilis alpha neurotoxin (... 185 6.00E-46
gb| FJS54641.il Trimeresurus flavoviridis vascular endothelial... 176 3.00E-43
gb|AY714253.1| Rhinoplocephalus nigrescens clone Rn50 microsa... 174 1.00E-42
gb|AY425950.1| Bungarus candidus beta-neurotoxin gene, partia... 167 2.00E-40
emb |AJ431709.1| Bungarus multicinctus al chain gene for A1 ch... 167 2.00E-40
emb |AJ431708.1| Bungarus multicinctus a8 chain gene for A8 ch... 167 2.00E-40
emb|AJ251227.1| Bungarus multicinctus partial all gene for be... 167 2.00E-40
emb|AJ251221.1| Bungarus multicinctus partial al3 gene for be... 167 2.00E-40
emb|AJ431710.1| Bungarus multicinctus ala chain gene for Ala ... 167 2.00E-40
emb| AJ431707.1| Bungarus multicinctus a2 chain gene for A2 ch... 167 2.00E-40
gb |AF544660.1| Elaphe obsoleta microsatellite Eobms366 sequence 163 2.00E-39
emb|AJ251222.1| Bungarus multicinctus partial al4 gene for be... 163 2.00E-39
emb|AJ251220.1| Bungarus multicinctus partial al2 gene for be... 163 2.00E-39
emb|AJ251360.1| Bungarus multicinctus partial gene for Al cha... 163 2.00E-39
gb|AY298758.1| Crotalus tigris clone Crti06 microsatellite se... 158 8.00E-38
emb|AJ251226.1| Bungarus multicinctus a22 gene for beta-bunga... 154 1.00E-36
gb|GU222457.1| Crotalus oreganus concolor clone MFR T1 micros... 145 5.00E-34
gb|HM537138.1| Thermophis baileyi microsatellite TbB03 sequence 143 2.00E-33
gb|EU269460.1| Rhabdophis tigrinus mitochondrial uncoupling p... 131 1.00E-29
gb|AY310916.1| Bothrops jararaca bradykinin-potentiating/C-ty... 131 1.00E-29
gb |FJ641208.1| Crotalus tigris microsatellite Crti23 sequence 118 7.00E-26
dbj|AB072392.1| Agkistrodon blomhoffi mRNA for M-LAO, complet... 109 4.00E-23
gb|EF452300.1| Boiga dendrophila denmotoxin precursor, gene,... 107 1.00E-22
gb| AY027495.il Pseudonaja textilis class IB phospholipase A2 ... 104 2.00E-21
gb|AY027494.1| Pseudonaja textilis class IB phospholipase A2 ... 104 2.00E-21
gb | AF093248.il AF093248 Crotalus atrox FAD-containing L-amino ... 104 2.00E-21
gb| AF071564.1| AF071564 Crotalus adamanteus L-amino acid oxida... 104 2.00E-21
gb |EF080833.1| Bungarus fasciatus L-amino acid oxidase mRNA,... 98.7 7.00E-20
77


Table B.10 (Cont.)
Accession Description Max Score E value
gb |FJ660263.1| Agkistrodon piscivorus voucher gpl locus 51 ge... 96.9 2.00E-19
gb|GU320304.1| Pantherophis guttatus HOXD13 (HoxD13), HOXD11... 95.1 8.00E-19
gb |FJ790439.1| Bothrops jararaca toxin 1 (Toxl) gene, partial... 91.5 1.00E-17
gb|FJ660261.1| Sistrurus miliarius streckeri voucher smm2 loc... 91.5 1.00E-17
gb| FJ660260.1| Sistrurus miliarius streckeri voucher smsl loc... 91.5 1.00E-17
gb| FJ660259.il Sistrurus miliarius miliarius voucher smml loc... 91.5 1.00E-17
gb| FJ660258.il Sistrurus miliarius barbouri voucher smbl04 lo... 91.5 1.00E-17
gb| FJ660257.il Sistrurus miliarius barbouri voucher smblOO lo... 91.5 1.00E-17
gb |FJ660256.1| Sistrurus miliarius barbouri voucher smb2 locu... 91.5 1.00E-17
gb| FJ660255.1| Sistrurus catenatus tergeminus voucher sctll5 ... 91.5 1.00E-17
gb|FJ660254.1| Sistrurus catenatus tergeminus voucher sct83 1... 91.5 1.00E-17
gb |FJ660253.1| Sistrurus catenatus tergeminus voucher sct491... 91.5 1.00E-17
gb| FJ660252.1| Sistrurus catenatus tergeminus voucher sct391... 91.5 1.00E-17
gb|FJ660251.1| Sistrurus catenatus tergeminus voucher sctl61... 91.5 1.00E-17
gb| FJ660250.1| Sistrurus catenatus tergeminus voucher sct2 lo... 91.5 1.00E-17
gb| FJ660249.1| Sistrurus catenatus edwardsi voucher scel50 lo... 91.5 1.00E-17
gb| FJ660247.1| Sistrurus catenatus edwardsi voucher sce32 loc... 91.5 1.00E-17
gb |FJ660246.1| Sistrurus catenatus edwardsi voucher sce27 loc... 91.5 1.00E-17
gb| FJ660248.1| Sistrurus catenatus edwardsi voucher scel27 lo... 87.8 1.00E-16
gb|FJ660245.1| Sistrurus catenatus catenatus voucher scc806 1... 87.8 1.00E-16
gb| FJ660244.1| Sistrurus catenatus catenatus voucher scc583 1... 87.8 1.00E-16
gb |FJ660243.1| Sistrurus catenatus catenatus voucher scc348 1... 87.8 1.00E-16
gb |FJ660242.1| Sistrurus catenatus catenatus voucher sccl63 1... 87.8 1.00E-16
gb |FJ660241.1| Sistrurus catenatus catenatus voucher scclS6 I... 87.8 1.00E-16
gb |FJ660240.1| Sistrurus catenatus catenatus voucher sccl511... 87.8 1.00E-16
gb |FJ660239.1| Sistrurus catenatus catenatus voucher sccl421... 87.8 1.00E-16
gb IFJ660238.il Sistrurus catenatus catenatus voucher scc88 lo... 87.8 1.00E-16
gb |FJ660237.1| Sistrurus catenatus catenatus voucher scc44 lo... 87.8 1.00E-16
gb |FJ660236.1| Sistrurus catenatus catenatus voucher scc39 lo... 87.8 1.00E-16
gb |FJ660235.1| Sistrurus catenatus catenatus voucher scc29 lo... 87.8 1.00E-16
gb |FJ790440.1| Bothrops jararaca toxin 2 (Tox2) gene, partial... 86 4.00E-16
gb |EF080832.1| Bungarus multicinctus L-amino acid oxidase mRN... 84.2 2.00E-15
gb |FJ554640.1| Trimeresurus flavoviridis vascular endothelial... 82.4 5.00E-15
emb |FM177950.1| Echis ocellatus mRNA for L-amino oxidase (lao... 69.8 3.00E-11
78


Table B.ll. Summary of cDNA hits to known transposable elements. Results are based on
running RepeatMasker under stringent settings* using the combination of the Tetrapoda
RepBase library and our snake-specific repeat library (analogous to our analysis of
genomic sequence data). Raw counts refers to reads that had blast hits to the repeat
database, with a minimum threshold of (*) a Smith-Waterman alignment score of 500,
and percentage of counts refers to the number of raw counts divided by the number of
reads in the entire cDNA set for each species.
Agkistrodon contortrix Python molurus Fold Difference
Raw Counts Percentage of Counts Raw Counts Percentage of Counts (Agk. / Python)
Total Sequences 137778 174504
Retroelements 8633 6.27% 510 0.29% 21.4
SINEs: 523 0.38% 263 0.15% 2.5
Squaml/Sauria 30 0.02% 1 0.00% 38.0
UNEs: 6386 4.63% 172 0.10% 47.0
CRE/SLACS 0 0.00% 0 0.00% -
L2/CR1/Rex 4247 3.08% 44 0.03% 122.3
Rl/LOA/Jockey 0 0.00% 0 0.00% -
R2/R4/NeSL 199 0.14% 17 0.01% 14.8
RTE/Bov-B 1101 0.80% 87 0.05% 16.0
L1/CIN4 342 0.25% 18 0.01% 24.1
PLEs: 564 0.41% 59 0.03% 12.1
LTR elements: 1160 0.84% 16 0.01% 91.8
BEL/Pao 0 0.00% 0 0.00% -
Tyl/Copia 4 0.00% 1 0.00% 5.1
Gypsy 745 0.54% 5 0.00% 188.7
DIRS1 254 0.18% 0 0.00% -
Retroviral 24 0.02% 6 0.00% 5.1
DNA transposons 1293 0.94% 98 0.06% 16.7
hobo-Activator 887 0.64% 4 0.00% 280.9
Tcl-IS630-Pogo 253 0.18% 79 0.05% 4.1
En-Spm 0 0.00% 0 0.00%
MuDR-IS905 0 0.00% 0 0.00%
PiggyBac 0 0.00% 0 0.00%
Tourist/Harbinger 0 0.00% 0 0.00%
Other 12 0.01% 0 0.00%
Rolling-circles 0 0.00% 0 0.00% -
Unclassified: 7687 5.58% 347 0.20% 28.1
All Transposable Elements 17613 12.78% 955 0.55% 23.4
79


APPENDIX C
Supplementary Figures for Chapter 2
Low complexity
Simple repeats
Satellites
Total interspersed:
Tourist/Harbinger
PiggyBac
Tcl-IS630-Pogo
hobo-Activator
DNAtransposons
Retroviral
Gypsy/DIRSl
Tyl/Copia
LTR Elements:
L1/CIN4
RTE/Bov-B
LR2/CR1/Rex
LINEs:
Penelope
SINEs
Retroelements
0.00% 0.50% 1.00% 1.50% 2.00% 2.50% 3.00% 3.50% 4.00%
Figure C.l. Comparison Between RepeatMasker-RepBase Annotations of Two Different
Shotgun Sequence Libraries for the Python. One shotgun library was sequenced using
the FLX-LR chemistry, and the second sequenced with the Titanium-XLR chemistry.
80


18.00%
16.00% -
Aakistrodon
14.00%
12.00%
10.00%
8.00%
RepeatModeler &
RepCIass
RepeatMasker 8c
RepBase
6.00%
4.00%
2.00%
0.00%
a; M
o z
S =
Q)
Q.
X I
Q) V/>
QC ^
a:
o
cc
N M
z In 4-* C <0 'o. if 2 c o w O 4-* re "4-* 0 bo O u re CD u Q> E u o Q. Q. 6 £> bO Q. io bo re 'wi oz I 0) Q) ec H H > 1- sr | Q. K > a i/t C re w 4-* < 1 O .a fn VO V/} T re X l/> 2 w 0) <9 U c 3 re E V/> re (/)
Z H 3 4^
o O H 0
a
x
Q) Q.
S l
a. u
£ 5
* o
Figure C.2. Comparison of the Repeat Element Annotations of RepeatMasker-RepBase
and the De Novo Repeat Identification of RepeatModeler-RepCIass for the Copperhead.
81


10.00%
9.00%
8.00%
7.00%
6.00%
5.00%
4.00%
3.00%
2.00%
1.00%
0.00%
- Python
- RepeatModeler & RepCIass
" RepeatMasker & RepBase
1. - J
re a o re Ia UJ Z X re oo rH _1 (/> re z CD i > O 00 Z u i/i c re E re '5. o u rH l/> 00 D re '> o ift c o *A o o +* re o 60 O a.
c cl rH a. 4- o
re CL u _l co . PM CL LU H CL J re re CL H rH >* 1- > tft Q. > <5 re CL go |
3
O
£ re
0) ^
? a
£ -2
E 2
* 3
re
CUD
re
<
Z
00
= 4)
tn v> >.
O £
re =
v 5
(0
E

a "
ai a.
" I
a u
E J
w 9
a)
C
o
Figure C.3. Comparison of the Repeat Element Annotations of RepeatMasker-RepBase
and the De Novo Repeat Identification of RepeatModeler-RepCIass for the Python.
82


25% -i
O
& 20% H
£ 15%
10
2 io%
n
g 5%
V
o
5 o%
i Agkistrodon
Python
JjlHMlIllllllllll
d
OMiggj^UlftNJOOiOMHPHMHMHHMNNNNNNNNNNWWWWWWWWWW^
OMNW^U10^sJ00ifiOHNW^Vim^00 Quality Score
Figure C.4. Quality Score Distribution for Shotgun Sequence.
83


20%
18% -
16% -
^ 14% -
O
c 12% -
C
V
u
Q.
10%
8% H
6%
4%
2%
0%
Python
Agkistrodon
Anolis
Repeat Type
s/s
Figure C.5. Comparison of Repeat Annotations Between Snakes and the Anolis Lizard.
Note that the comparison with the lizard is approximate because this annotation is based
only on homology searching using the existing RepBase Tetrapoda library for this species
(with the addition of the snake CR1 and Bov-B LINE consensus), whereas the two snakes
include this library, and a snake-specific de novo library; thus the Anolis annotation is
likely less complete than the two snakes, and may under estimate repeat abundances.
Anolis data based on annotation of 5 scaffolds of the Anolis genome release 1.0.
84


RTE_4_SP_RTE_Stn>w ytoccntroma
Figure C.6. Inferred Phylogeny of BovB LINE Elements in Tetrapod Genomes, Based on
Consensus Sequences Available in RepBase and Including All Bov-B Sequences from
Squamate Reptiles Available on Genbank. Numbers adjacent to nodes indicate posterior
probability support based on Bayesian phylogenetic analyses conducted on nucleotide
sequences in MrBayes.
85


Figure C.7. Frequencies of SSR Loci in the Two Snakes and the Anolis Lizard. Frequencies
are for (A) 2-6mer SSR Loci (B) 2mer, (C) 3mer, (D) 4mer (E) 5mer, and (F) 6mer SSR, per
MB. The top 30 5mers in Agkistrodon are shown shown in (E) and the top 50 6mers are
shown in (F).
86


Anolis
Python
Figure C.8. Linear regressions comparing the frequencies of simple sequence repeat (SSR)
loci per Mbp between species, broken down by SSR sequence motif.
87


AF332697: Bov-B Va LINE (Vipero ammodytes)
Mapping of copperhead (Agkistrodon) 454
Position
Blast hit mappings of (FJ158987) Anolis lizard CR1 LINE element
Figure C.9. Summary of Evidence Demonstrating that the Sequence Identified as a Bov-B
Va LINE in Vipera Ammodytes is in Fact a Chimeric Sequence Composed of a Bov-B LINE
Flanked by Two Snakel CR1 LINE Fragments. By mapping copperhead 454 reads to this
sequence it is clear that there are 3 areas of very high copy sequences (highly repetitive
sequences) connected by two very low copy sequences this and other data lead us to
conclude that instead of being a single LINE sequence, it is instead a chimeric sequence
with a Bov-B LINE, flanked by two CR1 LINEs. It is also notable that the central Bov-B
sequence matches in length and structure other known Bov-B and RTE LINEs, whereas the
entire fragment is almost lkb longer than other known Bov-B/RTE LINEs.
88


APPENDIX D
Supplementary Tables for Chapter 3
Table D.l. Summary of sequencing results for 12 snakes.
Species Common name Estimated Haploid Genome Size Nucleotides sequenced (bp) Number of reads Nuclear nucleotides (bp) Number of nuclear reads Percent of nuclear genome sampled GC content
Agkistrodon "Copperhead" 1.35 Gbp 60,344,580 280,303 60,175,941 279,519 4.50% 42.53%
piscivorus
Python molurus "Burmese python" 1.42 Gbp 28,496,896 118,973 28,077,583 117,451 2.00% 39.78%
Typhlops "Blind snake" 1.92 Gbp 6,741,155 50,087 6,720,475 49,989 0.35% 46.13%
reticulatus
Leptotyphlops dulcis "Texas threadsnake" Unknown 11,828,885 71,058 11,823,143 71,034 Unknown 43.18%
Anilius "Pipe snake" Unknown 7,542,192 50,319 7,508,176 50,164 Unknown 43.52%
scytale
Boa constrictor "Boa constrictor" 1.71 Gbp 11,575,550 38,037 11,472,103 37,717 0.67% 39.83%
Casarea dussermieri "Round Island boa" Unknown 76,243,119 470,682 76,218,678 470,585 Unknown 43.43%
Loxocemus bicolor "Mexican python" Unknown 6,172,347 40,583 6,163,619 40,557 Unknown 42.87%
Crotalus atrox "Western diamondback rattlesnake" 1.71 Gbp 19,098,306 63,094 18,965,550 62,709 1.11% 38.77%
Micrurus "Coral snake" 1.42 Gbp 7,735,311 26,831 7,703,086 26,728 0.54% 39.35%
fulvius
Sibon nebulata "Goo-eating snake" Unknown 12,772,185 43,542 12,764,026 43,514 Unknown 41.01%
Thamnophis sirtalis "Garter snake" 1.87 Gbp 49,533,818 176,307 49,363,444 175,674 2.64% 42.59%
89