COMBINED AD MIXTURE AND ASSOCIAT ION MAPPING FOR COMPLEX TRAITS by DANIEL V. YORGOV B. A ., U niversity of E conomics V arna , Bulgaria, 2001 M. A ., U niversity of E conomics V arna , Bulgaria, 200 3 M. S ., Michigan Technological University, 2006 A thesis subm itted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Applied Mathematics 2016
ii This thesis for the Doctor of Philosophy degree by Daniel V. Yorgov has be en approved for the Department of Mathematical and Statistical Sciences b y Stephanie A. Santorico, Advisor Joshua P. French , Chair Tasha E. Fingerlin Audrey E. Hendricks Burton Simon July 30, 2016
iii Yorgov, Daniel V. (Ph.D., Applied Mathematics) C ombined Ad mixture and Association Mapping for Complex Traits Thesis directed by Associate Professor Stephanie A. Santorico ABSTRACT Recently admixed populations, such as Latinos and African Americans, have higher genetic diversity and present unique oppor tunities for mapping the genetic basis of a trait. T he heterogeneous genetic background of admixed cohorts also presents challenges on how to account for the additional correlation in the samples. Admixture mapping is a group of methods for localizing chro mosomal regions associated with a trait, using the long range correlation generated by the admixture process. Genome wide association studies (GWAS) are a popular approach to scan the genome for variants associated with a trait. Admixed cohorts allow for g ene mapping methods that combine association and admixture information. The main theme of this work is the feasibility and the practical utility of combining admixture and association in the context of contemporary high resolution genetic data. I implement a combined method, fully stratifying the association effect based on the ancestral origin of the marker, and apply it on a real admixed dataset where a heterogeneous genetic effect at a variant is observed . An e xplicit test is proposed for testing for het erogeneity among the effects due to genetic association given different ancestral background in the region. A s imulated set of high resolution genotype data with allele frequencies and linkage disequilibrium patterns similar to that of contemporary Latino samples is
iv produced. G enotypes are coupled with inferred local ancestry which allow s for simulation of phenotype data from a realistic structure of genetic variation. I investigate different ways of incorporating local ancestry into single variant associa tion testing at GWAS chip resolution and compare such approaches to imputation followed by a standard association test. I simulate polygenic traits with a single causal common variant per locus, and a causal allele with the same effect regardless of its an cestral origin. For this scenario, my results suggest that there is limited benefit from incorporating local ancestry in admixture or association testing since higher power to detect the genetic signal can be achieved by imputation. I further show that th e standard linear mixed model approach, without local ancestry adjustment through a fixed effect, controls well for Type I error in both the association test and in the combined admixture and association tests. The form and content of this abstract are approved. I recommend its publication. Approved: Stephanie A. Santorico
v To Mariana, the love of my life, for her constant support, encouragement and love
vi ACKNOWLEDGMENTS Foremost, I would like to express my deepest gratitude to my advisor, Dr. St ephanie Santorico, for exposing me to the world of statistical genetics and for sharing her great knowledge and skills in the field. This work would have not been possible without her guidance, advice, and continuous support throughout the years. I am als o very thankful to Dr. Joshua French, Dr. Tasha Fingerlin, Dr. Audrey Hendricks, and Dr. Burton Simon for serving on my committee and for their numerous suggestions and feedback that helped me improve this dissertation . The Department of Mathematical and S tatistical Sciences at The University of Colorado Denver is a great and supportive place for a student to grow, not only as a researcher, but as a teacher as well. I am thankful to many faculty members for their mentoring and support. In particular, I am g rateful to Dr. RaKissa Manzanares and to Gary Olson for helping me improve as a teacher and for the guidance and career advice provided by Dr. Mike Ferrara. I would also like to acknowledge Dr. Jan Mandel for his advice and assistance when using the comput ational resources at CU. As part of my research assistantship, I have been involved in applied statistical genetics research at The University of Colorado Anschutz Medical Campus . I am indebted to Dr. Richard Spritz for the opportunity to gain practical kn owledge. Working for his lab has been a very rewarding and enjoyable experience. Finally, I would like to acknowledge the support and encouragement I have received from my family through my years in graduate school and in particular from my wife Mariana, m y son Vassil, and my daughter Natalia. My love is with them.
vii TABLE OF CONTENTS CHAPTER I. INTRODUCTION ................................ ................................ ............................. 1 Background and Contemporary Genetic Data ................................ ................. 1 Heritable Traits ................................ ................................ ............................ 1 Contemporary Genetic Data ................................ ................................ ........ 3 Genetic Variation ................................ ................................ ............................. 4 Allele Frequencies ................................ ................................ ....................... 4 Recombinations ................................ ................................ ........................... 5 Uncovering the Genetic Basis of a Trait ................................ .......................... 8 Genome Wide Associa t ion Studies ................................ ............................. 8 Linkage Mapping ................................ ................................ ......................... 9 Genetic Architecture of Complex Traits ................................ ..................... 10 Population Structure ................................ ................................ .................. 13 Imputation ................................ ................................ ................................ .. 16 Admixed Populations and Admixture ................................ ............................ 19 Admixed Populations ................................ ................................ ................. 19 Admixture Proces s ................................ ................................ .................... 20 Admixture Mapping ................................ ................................ .................... 22 Local Ancestry Inference ................................ ................................ ........... 24 Statistical M ethods ................................ ................................ ........................ 25 Regression and Likelihood Ratio Test Statistics ................................ ........ 25 ................................ ................................ ....... 27
viii Linear Mixed Model with Empirical Genetic Relationship Matrix ................ 29 Combined Admixture and Association ................................ ....................... 31 T hesis Overview ................................ ................................ ............................ 34 II. USE OF ADMIXTURE AND ASSOCIATION FOR DETE CTION OF QUANTITATIVE TRAIT L OCI IN THE TYPE 2 DI ABETES GENETIC EXPLORATION BY NEXT GENERATION SEQUENCIN G IN ETHNIC SAMPLES (T2D GENES ) STUDY ................................ ................................ 36 Abstract ................................ ................................ ................................ ......... 36 Background ................................ ................................ ................................ ... 37 Methods ................................ ................................ ................................ ........ 38 Study Samples ................................ ................................ .......................... 38 Ancestry Estimation ................................ ................................ ................... 38 Statistical Models ................................ ................................ ....................... 39 Reference Panels for Local Ancestry Estimation ................................ ....... 41 Data Processing and Merging ................................ ................................ ... 42 Local Anc estry Estimation Considerations ................................ ................. 42 Results and Discussion ................................ ................................ ................. 43 Ancestry Estimation ................................ ................................ ................... 43 Statistical Tests ................................ ................................ ......................... 45 Heterogeneous Association and/or Admixture Model at rs12639065 ........ 47 Conclusions ................................ ................................ ................................ ... 49 III. A HIGH RESOLUTION SIMULATED LATINO DATASET WITH INFERRED LOCAL ANCESTRY ................................ ................................ ...................... 52 Abstract ................................ ................................ ................................ ......... 52 Introduction ................................ ................................ ................................ ... 53
ix Methods ................................ ................................ ................................ ........ 56 Reference panels for local ancestry inference ................................ ........... 56 Local Ancestry Inference ................................ ................................ ........... 59 Simulated Genomes ................................ ................................ .................. 60 Description of the Simulated Dataset ................................ ........................ 61 Results ................................ ................................ ................................ .......... 63 Discussion ................................ ................................ ................................ ..... 64 IV. EFFECTS OF IMPUTATIO N ON COMBINED ADMIXT URE AND ASSOCIATION MAP PING ................................ ................................ ............ 69 Abstract ................................ ................................ ................................ ......... 69 Introduction ................................ ................................ ................................ ... 70 Methods ................................ ................................ ................................ ........ 74 Simulated Genomes ................................ ................................ .................. 74 GWAS Chip Resolution Dataset ................................ ................................ 75 Polygenic Traits ................................ ................................ ......................... 76 Causal Markers ................................ ................................ ......................... 76 Statistical Models ................................ ................................ ....................... 78 Admixture Mapping ................................ ................................ .................... 79 Genetic Relationship Matrix for Statistical Tests ................................ ........ 80 Likelihood Ratio Test Statistics ................................ ................................ .. 80 Power Calculations ................................ ................................ .................... 82 Controlling for Type I Error ................................ ................................ ........ 84 Differential LD Patterns ................................ ................................ .............. 85 Results ................................ ................................ ................................ .......... 85
x Causal Markers ................................ ................................ ......................... 85 Controlling for Type I Error ................................ ................................ ........ 86 Power ................................ ................................ ................................ ........ 89 Discussion ................................ ................................ ................................ ..... 93 Acknowledgments ................................ ................................ ......................... 97 V. CONCLUSIONS AND FUTURE WORK ................................ ........................ 98 REFERENCES ................................ ................................ ................................ ...... 103 APPENDIX A. Supplementary Table 4.1 ................................ ................................ ........... 112 B. Supplementary Tab le 4.2 ................................ ................................ ........... 114 C. Decorrelation of Phenotypes under the Null Model ................................ .. 116
xi LIST OF TABLES TABLE 2.1 . LAMP LD ancestry estimates fo r different marker sets and parameters ........... 45 2.2 . Wald and likelihood ratio p values at the significant marker for log(DBP) ......... 46 2.3 . Parameter estimates at the significant marker rs12639065 for log(DBP) ........ 48 3.1 . Mean global ancestral proportions in the simulat ed dataset ............................. 63 4.1 . Genomic control inflation factors at the null only SNPs ................................ .... 8 6 4.2 . Empirical power for regions that have stronger signal at Omni resolution for the admixture and/or combined tests compared to the association test ............ 92
xii LIST OF FIGUR ES FIGURE 1.1. Single nucleotide change in the DNA at a p articular chromosomal position ....... 4 1.2. Recombination during meiosis. ................................ ................................ ........... 6 1.3. Ancestral recombination graph for two mutations. ................................ .............. 8 1.4. Complex phenotype ................................ ................................ .......................... 11 1.5. Rare and common variants against effect sizes ................................ ................ 11 1.6. Imputation with reference panel ................................ ................................ ........ 18 1. 7 . Admixture process schematic ................................ ................................ .......... 2 1 3.1. Simulated diploid genome construction ................................ ............................ 61 3.2. Heat map produced from the GRM ................................ ................................ ... 65 3.3 . Principal components analysis ................................ ................................ .......... 67 4.1. Quantile quantile plot of observed and expected p values ............................... 88 4.2. Average power ................................ ................................ ................................ .. 90
xiii LIST OF ABBREVIATION S AIM Ancestry informative markers AMR Latino a ncestry CEU Utah residents with ancestry from Northern and Western Europe CLM Colombians in Medellin, Colombia DBP Diastolic blood pressure DF Degrees of freedom DNA Deoxyribonucleic acid FWER Family wise error rate GAW 18 Genetic Analysis Workshop 18 GRM Genetic relationship matrix GWAS Genome wide association study HGDP Human Genome Diversity Project HMM Hidden Markov m odel LA Local ance stry LD Linkage equilibrium LMM Linear mixed model LRTS Likelihood ratio test statistic MAF Minor allele frequency MXL Mexican sample in Los Angeles NA Native American PCA Principal components analysis PEL Peruvians in Lima, Peru PUR Puerto Ric ans in Puerto Rico SBP Systolic blood pressure SNP Single nucleotide polymorphism WGS Whole genome sequencing YRI Yorub ian in Ibadan, Nigeria
1 CHAPTER I INTRODUCTION Background and Contemporary Genetic Data Heritable Traits Resemblance between re latives and increased susceptibility for diseases in the presence of family history of the disease, both suggest that genetic factors contribute to biological differences between people. Broadly, geneticists aim to understand the relationship between the g enome of an individual and a heritable trait of interest. The narrow goal is often to locate the regions on the chromosomes and the specific genetic variation associated with such a trait. This endeavor inherently requires statistical methods for drawing i nferences from genetic data , and many fundamental ideas in statistics have been developed to address questions originating from genetics. Statistical genetics analyses also have been widely applied to search for genes related to economically significant tr aits in domestic plants and animals, and many of the methods used for humans are shared with, or even originated from, animal and plant breeding and genetics. Single gene human diseases follow a simple mode of inheritance as described by George Mendel in his foundational work published 150 years ago [Mendel 1865] . Such diseases are caused by mutations that can be traced to a single chromosomal region. Although many more heritable common diseases and quantitative traits are influenced by numerous genes and by complex interaction s between genetic and environmental factors. The
2 genetic and environ m ental components of such complex traits, including normal variation between people for traits like height, intelligence, etc., have been studied for almost the same period of time, ever since Sir Francis Galton popularized the the roles of heredity and environment for socioeconomi c status. An important concept quantifying the idea of Nature vs. Nurture is heritability, an idea introduced as an estimable population parameter by Sewall Wright and Ronald Fisher a century ago [Fisher 1919; Visscher, et al. 2008] . Heritabil ity is broadly defined as the fraction of the phenotypic variation in a population that can be explained by the genetic variation in the population (as opposed to environmental factors and/or chance variation). Specifically, narrow sense h eritability (ofte n referred to as just heritability) is the proportion of phenotypic variation due to genetic factors acting in an additive way. Broad sense heritability is the proportion of phenotypic variation due to all genetic factors, including additional gene gene in teractions like dominance (within loci) and epistasis (between loci) . The concept of heritability is not without its caveats, e.g., different populations can have different heritability for the same trait, heritability can vary in time due to changes in en vironmental factors, etc. For example, heritability estimates for height in Finland vary depending on the birth year of the individual , which is correlated with environmental quality [Visscher, et al. 2008] . Heritability is nevert heless a very useful concept allowing estimation of the nature and nurture components for a trait.
3 Contemporary Genetic Data Deoxyribonucleic acid (DNA) is the molecule that carries the genetic instructions for living organisms. The building block of DNA is the base pair , two nucleobases connected to each other by hydrogen bonds. The DNA nucleobases are cytosine, guanine, adenine, and thymine, abbreviated as C, G, A, and T, respectively. DNA nucleotides are composed of nucleobases , deoxyribose sugars , and at least one phosphate group . Half a century passed from the discovery of the DNA three dimensional double helix molecule [Watson and Crick 1953] to the sequencing of the first human genome, reporting approximately 3 billion base pairs [ International Human Genome Sequencing Consortium 2004] . The Human genome consists of 23 pairs of chromosomes , of which 22 are autosomal (non sex) chromosomes. Both chromosomes in an autosomal pair have the same form and genes with one chromosome originati ng from the mother and the other, from the father. Variation in the DNA sequence between people with a known location on a chromosome is called a genetic marker . The most common genetic marker is the single nucleotide polymorphism (SNP) , a single nucleoti de /nucleobase change in the DNA at a particular chromosomal position. The alternative nucleobases at a genetic marker are known as alleles . For illustration, t he SNP in Figure 1. 1 has two alleles, G and T, and the two alleles taken together are ca lled a ge notype . A t this hypothetical bi allelic SNP , different people will have one of the three possible genotype s, namely GG , GT , and TT . The v ariable of interest is often the number of times one of the alleles is present in the genotype. F or the SNP in this exa mple, the
4 number of Ts defines the additive genotype at the marker, say marker , for each individual : Figure 1. 1 . Single nucleotide change in the DNA at a particular chromosomal position . A t this SNP , individuals will have either a GG , GT , or TT genotype . Recent and continuing advances in molecular genetics re sult in ever increasing data availability. Contemporary genet ic data typically spans over the genome of an individual and can be obtained generally in two different resolutions. Genotyping array resolution (chip resolution) data contains from several hundr ed thousand to several million variants. At a higher cost, sequence resolution data can be obtained where essentially the complete individual DNA sequence is obtained. Genetic Variation Allele Frequencies Minor A llele Frequency (MAF) is the frequency of th e less common allele for a genetic variant with two alleles . Markers with MAF>5% are typically labeled common uncommon variants, and those with MAF< 1%, are rare variants. By definition, MAF is a popula tion specific parameter and is often estimated with the observed allele frequencies in the sample at hand.
5 Genotyping arrays are designed to capture the majority of the common variation in the genome and, as technology has advanced , an increasing number of the uncommon and rare variation. Sequence resolution data aims to capture all of the DNA sequence with possibly varying degree of confidence for the very rare variants , depending on the specific technology employed [Sims, et al. 2014] . It is generally assumed that every SNP originated in some common ancestor as a de novo mutation, a chance copying error during the process of sexual reproduction (meiosis) or DNA damage in the resulting sperm or egg cell [Stram 2014] . Such mutations are rare events and the majority are expected to be neutral with respect to reproductive fitness (although , of course, favorable mutations are the building blocks of the process of evolution). Thus, the assumption is that essentially each variant in the DNA of a person has been inherited from at least one of her or his biological parents, who in turn inherited the variant from their parents, etc., ultimately all the way back to t he original de novo mutation, in a single common ancestor. Therefore, common SNPs are expected to be older as they are shared by a larger number of individuals because the original mutation would require more time to propagate in the population through the progeny of the common ancestor. Recombinations The process of sexual reproduction in humans involves forming gametes (egg and sperm) . Each individual has two copies of each autosomal chromosome , one from the mother and one from the father , containing the same genes but possibly different alleles. During meiosis, the parental chromosomes recombine by pair ing up
6 and exchanging stretches of DNA. Each chromosome in the gamete contains genetic material from both paternal chromosomes, and it can then be passed t o the offspring ( Figure 1. 2). Figure 1. 2. Recombination during meiosis . Each individual has two copies of chromosome 1, one from the mother (light green) and one from the father (dark green). During meiosis the two chromosomes are recombined and the C hro mosome 1 in the gamete can be passed to the offspring. It contains genetic material from both grandparents. In this hypothetical example, 3 recombinations took place on C hromosome 1. The number of recombinations that occur on a chromosome is random , but it depends on the length of the chromosome. One average, about 36 recombinations are expected over all human autosomal chromosomes per meiosis 1 . R ecombination events do not occur uniformly across the chromosome: for some chromosomal regions, recombinations are rare and at other regions the rate at which recombination events occur is higher (commonly referred to as recombination hot spots) . In the process of recombination, new allele combinations are produced , which can then be passed to the offspring , i.e., t he genetic makeup of offspring is not identical to that of their parents. 1 Based on the Impute2 recombination map released in October 2014 with the 1000 Genomes Phase 3 reference panel
7 By the nature of the recombination process, t he alleles in consecutive blocks of DNA sequence along a chromosome ( haplotypes ) are often inherited as a unit . For a short chromosome segment, the haplotypes of many individuals in the same population can be traced to a common ancestor in the past. A lleles for markers that are close together on a chromosome are unlikely to be separated during a recombination event and are likely to be i nherited together , thus violating Mendel's second law (Law of Independent Assortment). In contrast, alleles that are further away from each other are likely to be separated in the process of recombination. Linkage disequilibrium (LD) is the non random asso ciation of alleles between two genetic markers, e.g., 2 SNPs, resulting in correlations between those alleles. LD reflects haplotype blocks that descended fr om a single, ancestral chromosome [Reich, et al. 2001] . To illustrate the relationship between LD and recombination, consider a hypothetical example tracing two bi allelic variants in a recombination graph as shown in Figure 1.3. Two mutations A and B occurred on the same branch. If no recombination occurs between the two alleles, the two bi allelic variants are in perfect LD in th e population (Figure 1.3a). If recombination occurs between the two alleles, the correlation between the two alleles is reduced (Figure 1.3b). In essence, recombinations result in smaller and smaller chromosomal segments that segregate together with the mu tation [Stram 2014] .
8 Figure 1. 3 . Ancestral recombination graph for two mutations . Two mutation s occurred on the same branch. (a) No recombination . (b) Recombination breaks the correlation between the variants . Adapted from [Stram 2014] . Uncovering the Genetic Basis of a Trait Genome Wide Associa ti on Studies I n the last decade , the most commonly used approach for searching for genetic variants associated with a trait of interest has been the g enome wide association study (GWAS). The design for GWAS was advocated for 20 years ago [Risch and Merikangas 1996] , and the first successful GWAS was published in 2005 f or a ge related macular degeneration disorder [Klein, et al. 2005] . GWAS are typically population based, assuming no close familial relatedness in the samples. The sample sizes are typically thousands, recently tens of thousands, and for some studies for traits like height and intelligence, hundreds of thousands of individuals [Okbay, et al. 2016; Wood, et al. 2014] . The GWAS approach tests a particular marker via a direct as sociation test, assuming that the observed marker is causal or in strong LD with the causal variant (with the power to detect association diminishing with diminishing LD). Hundreds of thousands or millions of SNPs are tested across the genome. By the virtu e of testing hundreds of thousands to millions of variants, there is a severe multiple testing
9 penalty to avoid false positive results. In populations of European descent, for a significant result to be declared, a very stringent p value threshold of is usually adopted, essentially assuming 1 million independent markers in European populations and a 0.05 genome wide significance level. Replication in an independent sample is also expected. Genome wide scans typically assume a uniform prior for eac h marker tested (as opposed to targeting particular genes or assuming different weights in regions of the DNA based on current knowledge about the function of the region containing the variant). The GWAS design has been very successful for traits like anth ropomorphic measures, type II diabetes, prostate cancer, Alzheimer's disease , hypertension , blood lipids, and many others. As of November 6, 2013, t he National Human Genome Research Institute GWAS Catalog listed 1 , 751 curated publications for GWAS studies with 11, 912 associated SNPs reported for hundreds of traits [Welter, et al. 2014] . Linkage Mapping Linkage studies [Morton 1955; Thompson 2011] are another major class of stud y designs in genetic epidemiology. Such studies were common before the GWAS approach was made feasible by advances in genotyping technology. Family designs are necessary for linkage mapping as the aim is to find specific, but relatively broad, genomic regi ons that co segregate with the phenotype over generations within pedigrees. In parametric linkage, the pattern of allele transmission in the pedigrees is modeled to estimate the recombination fraction between an unobserved trait locus and the marker locus. Initially the genetic
10 markers used for linkage studies were tracts of repetitive DNA motifs called microsatellites. Newer, semi parametric variance components methods allow for the usage of large pedigrees, modeling the shared genetic material between rel atives locally, in the neighborhood of the marker [Almasy and Blangero 1998] . Linkage analyses were successful in identifying the genetic basis of human diseases like some forms of breast cancer, Hunt ington's disease, and Cystic Fibrosis [Almasy, et al. 2015] . These diseases are caused by rare variants with high penetrance, following a Mendelian mode of inheritance. It should be noted tha t even trivial genetic architecture at nucleotide level resolution. For example, several tightly linked variants, each with moderate effect, can have a large effect when combined [Yazbek, et al. 2011] . This phenomenon is often called allelic heterogeneity, and linkage m apping is robust to its effect. Linkage analysis is, however, underpowered for the moderate genetic effects that might be expected for complex phenotypes [Risch and Merikangas 1996] . Genetic Architecture of Complex T raits H eritable, complex traits and common diseases are believed to be dependent on a variety of genetic and env ironmental factors , possi bly interacting with each other (Figure 1. 4) [Terwilliger and Weiss 1998] . For complex traits, the common disease/common variant paradigm was to a large extent t he rationale for the GWAS design . The hypothesis is that a few dozen variants, with moderate effect s and common allele frequencies, explain the majority of the genetic variation in the phenotype [Reich, et al. 2001] . The high er frequency of those common alleles
11 implies that GWAS in la rge r cohorts are expected to be well powered to identify causal alleles (Figure 1.5). Figure 1. 4 . Complex phenotype . Complex phenotypes are believed to be dependent on a variety of genetic and environmental factors , possi bly interacting with each other . Adapted from [Terwilliger and Weiss 1998] . Figure 1. 5 . Rare and common variants against effect size s . Mendelian disorders are due to r are variants of large effect and were common ta rget for linkage studies. Common variants of large effect are unusual for common diseases , and r are variants of small effect are hard to identify genetically . Complex phenotypes are believed to be influenced by many variants of small to intermediate effect s, possibly with lower allele frequency. Adapted from [McCarthy, et al. 2008] .
12 Despite the fact that GWAS have identified many replicated loci for many common traits, the ability to account for phenotypic variation remains limited, i.e., accounting for only a small part of the heritability. This phe nomenon was termed [Maher 2008] and some of the possible explanations for it are gene gene and gene environmental interactions, for which the GWAS design does not test, or the high importance of rare variants, for whic h the GWAS design is underpowered. The c ommon disease/rare variants hypothesis is the assumption that there are rare variants with large effect that influence the genetic component of a common disease . Since direct association tests in a typical GWAS desig n are underpowered for such rare variation, substantial effort was devoted to the development of methods that aggregate the signal of rare variation in a locus [Lee, et al. 2012; Li and Leal 2008] . Some authors eve n suggested that most of the associated common variants discovered by GWAS are due to synthetic association , where a common SNP is tagging several rare SNPs in the chromosomal region. However, t his last conjecture is inconsistent with the GWAS findings so far [Wray, et al. 2011] as it would imply a right skewed hist ogram for the allele freq uencies of the significant SNPs . Another possible explanation for missing heritability is that the strict significance thresholds employed in GWAS to avoid false positive associations come at the cost of an increased number of false negatives , thus missing the signal at common variants with low effects. An argument in support of this conjecture was made by Visscher and colleagues when they reported that common variants accounted for most of the heritability of height [Yang, et al. 2010a] . The authors
13 developed a method for quantitative traits that estimates the variation accounted for by fitting all GWAS markers simultaneously in an additive (polygenic) mixed model and estimating the variance component parameters o f the model. Their results suggest that bigger sample sizes are needed to achieve greater power in order to detect more variants with low and moderate effect. In summary, a plausible working hypothesis is that most complex human phenotypes are highly polyg enic, with both rare variants of large effect and many common variants of modest effect influencing the trait and with the presence of many relevant variants that are of unknown function [Gibson 2011; Wijsman 2012] . P opulation Structure Correlation between genetic variants across the genome can arise due to a variety of population structures, e.g., population stratification (the presence of distinct sub populations), ancestry related assortative mating , admixture, a nd also due to familial (including cryptic) relatedness. Genetic correlation can be a confounding factor in the GWAS design. For example, in the presence of population stratification, many alleles across the genome are likely to be informative for an indi vidual's sub population. For a phenotype that varies across sub populations, those alleles will be predictive of the phenotype. Although some of the alleles might indeed be causal, the majority of such "signals" will be spurious associations in that they a re unrelated to underlying biological mechanisms. Similarly, the presence of environmental confounding factors or technical artifacts that are correlated with the genetic background of the study subjects can
14 lead to spurious associations. Consider an illu strative example: in a British cohort some genetic markers might be associated with proficiency in Welsh language, a trait not expected to have any genetic component. The Welsh population has a divergent population history with some common alleles that hav e different allele frequencies from non Welsh British individuals. The associated genetic markers tag Welsh ancestry as opposed to causing people to speak Welsh [Astle and Balding 2009] . T he pro blem of confounding by population st ructure and cryptic relatedness has been recognized and studied extensively for many years. In a case control setting, a commonly employed approach is to collect samples matched for ethnic background, and to conduct addi tional screening for genetic outliers and cryptic relatedness. An alternative approach is to model the genetic correlation present. A brief overview of some statistical methods that address these issues is provided below. More details for two of the metho ds are provided in the statistical methods sub section of this chapter . Delvin and Roeder developed the method of genomic control [Devlin and Roeder 1999] . Chi squared statistics are calculated at all markers across the genome, and their empirical median is divided by the median of the chi squared distribution to produce an inflation fact or , denoted . This factor is then used to adjust for population structure of the test statistics at all variants by dividing each test statistic by . Since the majority of the markers tested are expected not to be associated with the trai t, i.e., null markers, the median of the test statistics should
15 capture inflation of the test statistics at null markers. The assumption for this method is that any inflation due to genetic correlation is constant across the genome. Genomic control is curr ently utilized as a means to measure the extent of inflation due to population structure and to assess the calibration of the statistical tests performed with respect to Type I error. The goal of s tructured association methods [Pritchard, et al. 2000] is to assign subjects to discrete sub populations and perform tests within each stratum. Development of the method [Pritchard and Donnelly 2001] a llowed for fractional membership, essentially modeling the proportions of the genome of an individual that are inherited from several distinct ancestral populations. Th e se proportions can then be used as covariates in a generalized linear model . A very pop ular approach to address issues with confounding due to correlated genotypes is principal components analysis (PCA) [Chen et al. 2003; Zhang, Zhu, and Zhao 2003; Price et al. 2006]. Orthogonal c ontinuous axes of genetic variation (principal components) are derived from the genomic markers and ordered by the magnitude of the variation. The coordinates for each individual in this coordinate system are included as covariates in a generalized linear model . Typically, no more than 10 axes corresponding to the hi ghest eigenvalues are used. Epstein and colleagues proposed a stratification score [Epstein, et al. 2007] . Briefly, a ncestry informative markers are used to compute individual disease risk scores. S trata are then defined based on the risk scores, and a stratified test for association is performed over ea ch stratum .
16 The methods that were most widely used until recently were genomic control , which adjusts for unknown familial relatedness in the sample (cryptic relatedness) but fails to adjust for population structure due to admixture , and PCA based methods , which adjust for large scale population structure like admixture or stratification but do not adjust for confounding due to relatedness [Price, et al. 2010] . L inear mixed model (LMM) methods [Bradbury, et al. 2007; Kang, et al. 2010; Lippert, et al. 2011; Yang, et al. 2011; Zhou and Stephens 2012] are emerging as contemporary methods of choice to adjust for confounding due to correlated genotypes in GWAS studies. T he methods use an empirical genetic relationship matrix (GRM) to model the full covariance st ructure of the genotypes present in the sample. Including a G RM for a single variance component in a linear model serves as a proxy to account for allele sharing at the associated loci. LMM are reported to adjust for most of the sources of genetic correlat ion, including large scale population structure due to admixture or stratification as well as familial relatedness [Astle and Balding 2009; Price, et al. 2010; Yang, et al. 2014; Yu, et al. 2006] . Imputation Despit e the discovery of an increasing number of variants, many causal variants, especially those with lower MAF, might not be present on the GWAS chip or might not be in high LD with variants on the chip. This reduces the power to detect association in a conven tional GWAS approach and, u nless sequence resolution data is used, implies that some variants associated with the trait are likely to be missed.
17 Through imputation, additional variants can be studied. The imputation approach also relies on the LD that exi sts between variants . Variants with higher allele frequencies are assumed to be older and often segregate with alleles at nearby markers . Those variants are expected to be shared b y many people in a population. If more densely genotyped samples (with DNA r elevant to the ancestral history of the samples in the study) are available, the imputation approach is expected to work well for common variants, inferring with high probability the majority of common variants. That is, t he additional information present in the form of a relevant genotype reference panel can be used to statistically infer the variants observed in the reference panel but not observed in the genotyped sample. Imputation leads to increased genotype data density in the sample studied (Figure 1. 6). The quality of the imputation depends on the reference panel size, sample ancestral history, and the particular method employed. For most populations, there is good availability of relevant public reference panels and the vast majority of common gene tic variation can be captured via imputation [Howie, et al. 2011] . Larger reference panels are in production and are expected to be made available to the public. The imputation approach can now be considered the standard approach for GWAS studies .
18 Figu re 1. 6. Imputation with reference panel. (a) The GWAS dataset does not contain some of the variation present in the reference panel ( represented by dots) . (b) U nobserved genotypes in the study individuals are imputed (inferred) using the set of reference D NA sequences .
19 Admixed Populations and Admixture Admixed Populations Human ancestry traces to Africa, and it is estimated that humans first spread to Southwest Asia around 100,000 years ago and elsewhere in the Old World by 60,000 40,000 years ago. There i s a consensus that the Americas were the latest landmass es to be colonized by modern humans, with this process estimated to have started n o earlier than 23,000 years ago [Raghavan, et al. 2015] . Differential geneti c drift accelerated by population bottlenecks and natural selection [Astle and Balding 2009] , as well as new mutations and non random mating, led to genetic diversity between those human groups, which were geographically isolated for many thousands of years. In the last several centuries those genetically distinct populations intermixed again due to major migrations [Winkler, et al. 2010] . In the United States there are two large admixed populations: African Americans and Hispanic Americans. These populations are often medically underserved [Seldin, et al. 2011] . Hispanic/Latino populations , for example , are the result of admixture of Native American, European, and West African ancestral populations. Other contemporary admixed populat ions [Shriner 2013] include: African Americans (European, West African ancestry), Ashkenazi Jews (Eastern European and Middle East ern ancestry), Australian Aboriginals (Aboriginal and European ancestry), Pacific Islanders (European and Polynesian ancestry), Uyghur s (Asian and European ancestry), and South African Coloured (five way admixture of Bantu speaking Africans, Europeans, Ind ians, Khoisan, and Southeast Asian ancestry).
20 There is increased genetic variability in admixed populations by the virtue of more polymorphic variation and different allele frequency distributions in the ancestral populations. This higher genetic variabili ty can present additional opportunities for mapping the genetic basis of a trait [Brown and Pasaniuc 2014; Liu, et al. 2013b; Zhang and Stram 2014] . For example, SNP rs2814778 has a MAF of 0.08 and is associated wi th white blood cell count in the WHI SHARe Hispanics cohort [Reiner, et al. 2012] . The functional allele has a frequency of 1 in the West Africans (HapMap YRI) and is virtually not present in European populations (HapMap CEU) [Gibbs, et al. 2003] , thus even meta analysis of European and African cohorts will not discover the SNP association for this trait. Disease prevalence can vary by a ncestry, e.g., multiple sclerosis is more common in individuals of European ancestry, while hypertension and prostate cancer are more common in those of African ancestry. It should be noted that such differences might be due to environmental factors [Shriner 2013] . Admixture Process As a result of intermixing between previously isolated ancestral populations, recombinations produce c hromosomal blocks of different continental ancestries (Figure 1. 7), and t he admixed individuals have genomes that are mosaics of segments with distinct continental ancestry [Shriner 2013] .
21 Figure 1. 7. Admixture process schematic . Two previously isolated ancestral populations , represented by light green and dark green chromosomes , intermix. After several generations admixed individu als have genomes that are mosaics of segments with distinct continental ancestry. Adapted from [Winkler, et al. 2010] . In admixed populations, blocks of ancestry vary in size due to the random nature of recombination . Additionally, continuous and complex intermixing patterns are observed for many admixed populations, and the ancestral chromosomes in the admix ed populations have been subjected to a different number of recombinations [Astle and Balding 2009; Jin, et al. 2014] . Local ancestry (LA) at a particular location is defined as 0, 1, or 2 copies from each ancestra l continental population considered. Global ancestry proportions are the proportions of genetic material descending from each ancestral population and can be calculated by averaging the local ancestry proportions across the genome of
22 each individual. For e xample, Gravel and colleagues es timated Native American, European, and West African average global ancestry proportions to be 47.6%, 47%, and 5.4%, respec tively, in the Mexican American samples from Los Angeles present in the 1000 Genome s Project Phase I d ataset [Gravel, et al. 2013] . The global ancestry proportions have been shown to be highly correlated with the first few principal components [Thornton, et al. 2014; Zhang a nd Stram 2014] , and the global ancestries proportions can be used to control population structure due to admixture. Shriner argues that a dmixture among continental populations should be reflected by the top principal components if t here is no addi tional structure in the study sample [Shriner 2013] . Admixture Mapping Instead of adjusting for the structure due to admixture as in s tructured association , the goal of admixture mapping is to use the long range correlations introduced by the admixture process as a tool for gene discovery. This is an old idea proposed by Rife [Rife 1954] . The current admixture mapping paradigms were developed theoretically by McKeigue [McKeigue 1998] . Admixture mapping has been applied to discrete traits like prostate cancer, breast cancer , type 2 diabetes, nondiabetic kidney disease, and hypertension as well as quantitative traits like lipid levels, obesity, and white blood cell counts [Winkler, et al. 2010] . Admixture scans continue to be performed for suitable cohorts [Gomez, et al. 2015; Schick, et al. 2016] , often as a compliment to the standard GW AS approach. Admixture mapping leverages the correlations generated by the admixture process. C urrent admixture mapping approaches can be thought of as test s for
23 association between local ancestry and phenotype . Admixture mapping has power to identify caus al variation that is differentially distributed across populations and does not require differential risk by a ncestry [Shriner 2013 ] , although traits for which differential risk is observed in the ancestral populations are natural candidates for admixture analysis. Admixture tests can be set up in a generalized linear model framework, using for example likelihood ratio test statistic s (LRTS). The LAs for the ancestral population with the majority contribution to the admixed population or LAs for several ancestral populations can be used. Global ancestries are typically used as a covariate in all models, including the null model. Simil ar to linkage mapping, admixture mapping relies on linkage as opposed to LD, i.e., tracking whether the causal variant is segregating together with the variant that is studied [Tang, et al. 2010] . Admixture mapping has intermediate resolution between linkage mapping and GWAS with respect to the length (1 10 mega bases) of the peak signal [Winkler, et al. 2010] . For recently admixed populations, the ancestral blocks are long and the number of ancestry informative markers (AIM) required to infer LAs is low. For African Americans samples, as few as 1,500 2,500 AIMs are deemed sufficient to reliably infer the majority of the ancestral switches across the genome [Winkler, et al. 2010] . Originally, admixture mapping was proposed as a w ay to obtain greater precision of gene mapping compared to linkage . S tudying a few thousand genotyped AIMs was economically more feasible compared to the greater number of markers
24 required for GWAS , an association approach relying on LD . With the increased availability of affordable GWAS chip genotyping technology and the resulting high density data, the economic advantage of us ing AIMs has largely diminished. Still, admixture mapping can be well power if ancestral disease risk and/or ancestral allele frequencies for the causal markers differ, and it can be regarded as a complimentary approach to the standard GWAS approach [Liu, et al. 2013b; S hriner 2013] . Local Ancestry Inference Ancestral blocks are not directly observable, i.e., the ancestral origins of the alleles have to be inferred. Obtaining reliable LA estimates is often challenging but is a necessary step for admixture mapping and for most combined admixture and association analyses. Initially, approaches for LA inference were developed for sparse sets of AIMs in linkage equilibrium. The rapid advances of GWAS chip genotyping resulted in development of methods that allow for the use o f m arkers in LD. The newer methods are using the much d enser contemporary genetic data and are expected to have improved performance with respect to the accuracy of the LA estimates produced [Winkler, et al. 2010] . Methods for LA inference, suitable for sequence data resolution , use contemporary genetic reference panels as proxies for the ancestral continenta l [Brown and Pasaniuc 2014] . Such publicly available panels are easily obtainable for European and West African populations at various resolutions, for example from the 1000 Genomes Project [The 1000 Genomes
25 Project Consortium 2012] . A further obstacle for estimating LA in Latinos is the scarcity of genetically homogenous , publicly available Native American (NA) genotype data that can serve as proxy for the NA ancestral population [Brown and Pasaniuc 2014; da Silva, et al. 2015; Eyheramendy, et al. 2015; Zhang and Stram 2014] . This is especially true at the sequence resolution. L ocal ancestry estimates can be produced using Asian reference panels as a proxy for the NA ancestral population. Such panels are readily available in the 1000 Genomes Project data . I t has been suggested , however, that using a NA reference panel is likely to substantially improve the accuracy of the LA estimates in Latino samples [Brown and Pasaniuc 2014; Moura, et al. 2015] . Statistical Methods Regression and Likelihood Ratio Test Statistics GWAS apply a direct association test, assuming an observed caus al marker or correlation of the tested m arker with the causal variant. For a quantitative trait, one approach to implement such a test is linear regression. Consider the following model with being the number of minor alleles for marker , a vector of covariates, and the phenotype : (1) The standard assumption for this model is that the residuals are independent and identically distri buted normal variables with mean 0 and variance . The goal for a test of association is to assess if setting the coefficient to 0 significantly reduces the fit of the model . The standard approach is to obtain , through ordinary least squares, e stimates for and and to test if the is significantly
26 different from 0. An alternative approach is to maximize the likelihood function of model (1) and the likelihood function of the model with . For a sample of size , t he likelihood function for model (1) is The likelihood ratio test statistic (LRTS) is (2) and the LRTS has a limiting distribution when the null model is true ( . The LRTS can be generalized to test hypotheses that involve more than one parame ter, using two nested models. That is, a null hypothesis can be that the parameter is in a specified subset , , of the parameter space , . The test statistic has the same form as above and is approximately with degrees of freedom equal to the difference in dimensionality of and . The Wald and score test statistics are also based on the likelihood function. Instead of producing a test statistic from the likelihoods at the alternative and the null hypotheses, they rely on approximat ions of the shape of the likelihood function at the null model (score test) and the alternative model (Wald test). All three of the LRT, Wald, and score test statistics are asymptotically equivalent when the null hypothesis is true, but likelihood ratio in ference is more reliable [Agresti 2007] due to the fewer assumptions made.
27 In his landmark work [Fisher 1919] , Sir R . A. F isher assumed that : 1) infinitely many independent mutations are contributing equally, independently, and in an a dditive manner to the phenotype, 2) t he contribution of each marker to the variance of t he phenotype is the same irrespective of its allele frequency, and 3) t he effects due to each marker are independent and follow a normal distribution. For illustration, I will use only an additive genetic component and contemporary LMM notations in the exp osition that follows 2 . Let be the minor allele count at marker for individual and MAF j be the population minor allele frequency at marker j . Let be a matrix of standardized genotypes for causal markers and samples. Specifically, e ach of the marke rs in , i s standardized assuming Binomial [ 2, MAF j ] : Let be a matrix of covariates ( including a term for the intercept) for the samples . The model is: ( 3 ) That is, t he phenotype vector is the sum of fixed effects modeled by , and random genetic effects, assuming , and G aussian noise, . It is a ssume d that and are independent. 2 L oosely based on the exposition in the S upplementary Note 1 of [ Lippert, et al. 2011 ] .
28 Now Let which is the relatedness matrix. Assuming that is a multivariate normal vector , the model for the phenotype is: ( 4 ) Fisher showed that as , the th entry of equals the proportion of genetic material that is shared between individuals and , i.e., the proportion of genetic material that has the same ancestral origin . For known familial relationships, these proportions can be computed relativ e to the founders in the pedigree . fundamental in quantitative genetics, plant and animal breeding , non parametric linkage, some methods that aggregate rare variants, tests of variant causa lity given a localization signal, methods of shared genetic influences in multivariate analyses, and tests of gene gene and gene environment interaction [Almasy and Blangero 2010] . For instance, narrow sense heritability is estimated in terms of the parameters from model ( 4 ): That is, heritability is the ratio of estimated additive genetic variation, , over the total variation, .
29 Recently, explicitly modeling the full covariance structure of the geno types present in the sample as a single polygenic variance component has emerged as a widely used method that simultaneously adjusts for population structure and kinship in GWAS [Hoffman 2013] . Briefly, the se methods use a single empirical genetic relationship matrix (GRM), that is an a dditive g enetic v ariance c ovariance m atrix , that contains the average correlation of the genotypes in the sample [Yang, et al. 2014] . More details follow. Linear Mixed Model with Empirical Genetic Relationship Matrix Historically the matrix has been produced from known familial relatedness and only relevant to family studies. However , population based designs, with no closely related individua ls, have unique advantages, e. g., large r study cohorts can be assembled more easily as there is no need to enroll family members of the already recruited participants [Evangelou, et al. 2006] . Additionally, there e xist issues with using know n pedigree information . For example, t he genetic sharing is random for most familial relationships [Weir, et al. 2006] and therefore using the expected proportions of shared genetic material is an approximation . In addition, p resumingly u nrelated samples can be related to some extent , and k nown relatio nships might not correspond to the true biological relationships . Fur ther , p opulation structure and/or admixture introduce an additional layer of genetic correlation in many cohorts . To address some of these concerns, a development in animal genetic s was the use of an empirical GRM derived from a dense set of genetic marker s [Hayes, et al. 2009] .
30 This approach computes a GRM, , where is a matrix with markers ( typically hundreds of thousands or more) with additive coding and standardized . O therwise , the model is the same as model ( 4 ): The mean model, , can include covariates as well as fixed effects for association and/or admixture . Alternatively , t he G RM can be constructed by computing, for each entry of : wh ere and are the number of minor alleles for marker and individual and , respectively ; is the number of markers, and is the minor allele frequency for marker estimated from the sample. T he GRM mea sures the relative relatedness between subjects in th e sample and can have negative values. A positive off diagonal entry, , can be interpreted as excess allele sharing between individuals and in the sample. The values for certain pairs of i correspond to individuals sharing fewer alleles than expected given the allele frequencies observed in the sample [Astle and Balding 2009] . T he PCA a pproach to adjust for confounding uses the same GRM , , to obtain the eigendecomposition = . T he e igenvectors in are s orted by the magnitude of the corresponding eigenvalues , and t he top few eigenvectors are the principal components used as fixed effect covariates in a regre ssion model. Thus , i n the LMM method , the e igenvectors are included in the matrix for the random effect
31 and adjustment for ancestry with principal components and global ancestry proportions for admixed samples might not be required . Assuming without loss of generality that the phenotype has unit variance, and under this assumption is the heritability of the phenotype . The mean model includes an intercept term, covariates, and one or more fixed effect s that can be used for hypothesis testing . T he likelihood function for such models i s: where , and t he log likelihood is: . The log likelihoods for the nu ll model and for the full model are maximized numerically and the likelihood ratio test statistic is computed, as in (2) , as twice the difference between the log likelihood of the ful l model and the log likelihood of the null model . W hen the null model is true , t his statistic has a limiting distribution with degrees of freedom (df) , equal to the difference between the number of parameters in the full model, , and the nul l mode l , . Combined A dmixture and A ssociation Admixture mapping and association testing have been successfully applied to the detection of genes for complex diseases. T he long range correlations introduced by the admixture process also allow for gene map ping methods that combine genotype and ancestral origin information [Liu, et al. 2013b; Pasaniuc, et al. 2011; Shriner, et al. 2011b; Tang, et al. 2010] . T he local ancestry provides information
32 about the haplotype diversity in the region and can be complement ary to the association signal ; combining both can result in potentially more powerful statistical tests [Tang, et al. 2010] . Pasaniuc , et al . proposed a c ombined admixture and association test in a case/control setting . Under simulations, t he combined test for a single candidate marker resulted in gain s in power , particularly for markers with large alle le frequencies differences between the ancestral populations [Seldin, et al. 2011] . A second test, MIX, proposed in the same work allows for a heterogeneous association signal based on the a ncestral origin of the chromosomal segment. Applied on a real African American dataset, the MIX test produced two SNPs that, when tested for different odds ratios conditioning on African vs. European local ancestry, produced relatively large test statist ics. The authors suggested that a source of heterogeneity is different LD patterns in Africans vs Europeans , or possibly gene gene interaction with another causal SNP in the same region . Stratifying the genotype effect by local ancestry was employed in the Bmix method [Shriner , et al. 2011] as a technical remedy to allow for sequential fitting of models for ( 1) local ancestry effect and ( 2) association conditional on local ancestry , and combining the results in a Bayesian framework take advantage of the reduced mul tiple hypothesis testing burden of admixture mapping that is due to the large length of the ancestr al segments. The authors did not study possible advantages from allowing for heterogeneity , nor scenarios in which this meth od will gain power by the virtue of modeling an association e ffect over strata.
33 In a Hispanic case control study, analysis that in cluded a SNP by local ancestry interactio n term in a logistic regression resulted in a suggestive association at a variant for asthma case status [Liu, et al. 2013] . The variant lacked any evidence of association from a conventional GWAS analysis. Different LD patterns were again suggested as a possibl e source of the heterogeneity of the association effect. The density of the ad mixed genotypes in these studies was GWAS chip resolution with several hundred thousand markers , and the authors did not consider imputation to sequence resolution. To the best of our knowledge, n o combined admixture and association methods were developed incorporating LMM adjustment for population structure and kinship. Evaluating methods to address challenges and opportunities presented by admixed populations has been done with simulations. Differences in simulation setup between papers can result in con flicting conclusions, impeding the choice of methods for researchers studying admixed cohorts [Zhang and Stram 2014] . One approach is for admixed genetic data to be simulated one marker a t a time, completely ignoring linkage disequilibrium patterns on different ancestral backgrounds. Other methods used to simulate admixed samples start from real genomes from homogeneous populations and implement simplified, forward time admixture processes . Such approaches have a varying degree of success mimicking the complex LD patterns in contemporary admixed populations li ke Latinos or African Americans [Astle and Balding 2009; Jin, et al. 2014; Jin, et al. 2012; Ni, et al. 2016] .
34 Thesis Overview The main theme of this dissertation is the feasibility and the practical utility of combining admixture and association in the context of contemporary high resolution genetic data . In Chapter II , I describe a combined admixture and /or association test that allows for a heterogeneous genetic association by fully stratifying the association effect based on the ancestral origin of the marker. I apply association, admixture, and fully stratified association tests to 132 unr elated Mexican Americans and quantitative traits for systolic and diastolic blood pressures. Nested linear regression models are fit with adjustments for global ancestry proportions and select covariates. Significance thresholds are derived through permuta tion analysis . An e xplicit test for heterogeneity among the effects due to genetic association given different ancestral background in the region is also proposed for quantitative traits . A n a dditional goal for this chapter is to investigate the complexiti es of combining admixture and association mapping in the context of whole genome sequencing data , from selection of reference populations and preprocessing of data through to the testing itself . I n Chapter III , in order to simulate phenotype data from a r ealistic structure of genetic variation, I produce a sequence resolution, simulated Latino dataset with allele frequencies and LD patterns similar to that of contemporary Latino samples after one additional meiosis . Simulating a single generation of random mating preserves the admixture LD that is reduced by recombination s . T o address the lack of availability of public Native American reference panels at sequence resolution, a
35 Native American sample without European admixture was obtained, then additionally screened for relatedness , and imputed to high resolution to serve as prox y in local ancestry estimation . The simulated Latino genotypes are coupled with local ancestry estimates that I infer with a state of the art me thod and validate with different measu res for ancestry . By construction, the Latino dataset has a complex structure: admixture of three ancestral populations, distinct sub populations, and non trivial familial like relatedness within each of the sub populations . In Chapter IV, I extend the adm ixture mapping and the combined admixture and association methods in the LMM framework. Using extensive simulations based on the Latino dataset, I investigate the possible gains in power from different ways of incorporating local ancestry into single varia nt association testing at GWAS chip resolution. I compare such approaches to imputation followed by a standard association test. I simulate polygenic traits with a single causal common variant per locus, and a causal allele with the same effect regardless of its ancestral origin. The power lost from applying the combined, higher degrees of freedom tests compared to the standard , one degree of freedom association test at the causal marker s , is also quantified . Lastly, I investigate if the standard LMM approa ch, without local ancestry adjustment through a fixed effect, controls well for Type I err or in both the association test and in the combined admixture and association tests. Finally, C hapter V provide s a global summary and directions for future work moti vated from the work of this dissertation.
36 CHAPTER II USE OF ADMIXTURE AND ASSOCIATION FOR DETE CTION OF QUANTITATIV E TRAIT LOCI IN THE TY PE 2 DIABETES GENETI C EXPLORATION BY NEX T GENERATION SEQUENCIN G IN ETHNIC SAMPLES (T2D GENES) STUDY 3 Abstract Admixtur e mapping and association testing have been successfully applied to the detection of genes for complex diseases. Methods have also been developed to combine these approaches. As an initial step to determine the feasibility of combining admixture and associ ation mapping in the context of whole genome sequencing, we have applied several methods to data from the Genetic Analysis Workshop 18. Here, we describe the steps necessary to carry out such a study from selection of reference populations and preprocessin g of data through to the testing itself. We detected one significant result with a Bonferroni corrected p value of 0.032 at single nucleotide polymorphism rs12639065. Computing local ancestry for Hispanic populations was challenging because there are relat ively few methods by which to handle 3 way admixture, and publicly available Native American reference panels are scarce. However, combining admixture and association is a promising approach for detection of quantitative trait loci because it might be able to elevate the power of detection by combining 2 different sources of genetic signal. 3 Portions of this chapter were previously published in [Yorgov, et al. 2014] and are included with the permission of the copyright holder.
37 Background Whole genome sequencing (WGS) is fast becoming a feasible technology for use in genetic studies of complex traits. Such a rich source of data allows for many methodological approaches; however, with the sheer increase in volume of data, alternatives must be evaluated with regard to their validity and power. For example, in the context of marker panels, methods for admixture mapping and association testing have been used for detection of genes for complex traits [Kao, et al. 2008; Kopp, et al. 2008; Reich, et al. 2007] . Combined admixture and association approaches have also been developed [Lettre, et al. 2011; Pasaniuc, et al. 2011; Shriner, et al. 2011b; Tang, et al. 2010; Wang, et al. 2011] . As an initial step to determine the feasibility of a combined approach in the context of WGS, we have applied several methods to data from the Genetic Analysis Workshop 18 (GAW18). We demonstrate the use of estimates of local and global ancestry in the context of an admixed population and both combine and compare ancestry information with the use of genetic association testing. Our goals are to (a) identify the steps and issues in estimation of local and global ancestry proportions for an admixed population that is best represented by more than 2 reference populations, (b) construct a series of models and test statistics that use ancestry and /or genotype data, (c) perform tests on Chromosome 3 data from the GAW18 workshop for both systolic blood pressure (SBP) and diastolic blood pressure (DBP), and (d) compare the tests with respect to findings.
38 Methods Study S amples The GAW18 dataset consist s of WGS data that were obtained for Mexican American families sampled from San Antonio, Texas, as a part of the Type 2 Diabetes Genetic Exploration by Next Generation Sequencing in Ethnic Samples (T2D GENES) consortium [Almasy, et al. 2014] . The genotype data were cleaned of Mendelian errors for 959 individuals, including 464 individuals who had been sequenced, with the genotypes of the remaining individuals imputed on the basis of genome wide association data. The ancestry estimates in this article were produced for all individuals using Chromosome 3 markers. The regression analysis was performed on Chromosome 3 on a set of 132 unrelated individuals with genotype, SBP, and DBP data as measured at their first examin ation. Simulated Q1 traits not influenced by individual genotypes were used for Type I error rate assessment. Ancestry E stimation Local ancestry is defined as the number of copies of chromosomes inherited from a parental population at a given genomic locat ion, resulting in a mosaic of segments of distinct ancestry across the chromosome and, equivalently, ancestry switches between those segments [ Shriner, et al. 2011 ] . Historically, methods for local ancestry estimation have used coarse marker maps with much work done on constructing ancestry informative marker (AIM) panels with a size of a few thousand markers over the whole genome. AIM panels typically incorporate markers with large frequency differences between the ancestral populations and minimal linkage
39 disequilibrium (LD) between the AIMs; however, it has been suggested that using a denser set of markers naturally provides more information for local ancestry estimation [Winkler, et al. 2010] . Several newer methods allow for higher density marker panels and background LD, but most do not readily allow for multiple way admixture. The method that we used, LAMP LD, combines window based processing within a hierarchical Hidden Markov Model and takes as an input the genotypes of the admixed individuals as well as phased haplotype reference panels representative of the ancestral populations [Baran, et al. 2012] . We produced the most likely pair of local ancestries at each marker with LAMP LD software release 1.0. Global ancestry proportions were estimated by averaging local ancestry estimates over all Chromosome 3 markers us ed. Statistical M odels Multiple linear regression models were fit to individual log transformed blood pressure measurements with global ancestry proportions of European, Native American, and African descent as explanatory variables. Additional covariates w ere selected based on forward stepwise selection with a 0.05 significance level, resulting in log(SBP) with global ancestry, age, and blood pressure medication used as explanatory variables and log(DBP) with global ancestry and blood pressure medication us ed as explanatory variables. We assumed an additive genetic model using counts of minor alleles, g , with . Because of imputation, g is not always an integer. For each marker, local ancestry was coded into dummy variables representing each unique c ombination of ancestry: D EE , D AA , D NN , D EA , D EN and D NA
40 where E, N , and A represent European, Native American, and African ancestry, respectively, for the allele. Four tests were conducted over Chromosome 3 for the logarithm of each of the 2 blood pressure measurements, denoted by y , based on the following models: Model 1 (null): y = + X + Model 2 (association): y = + X + g g + Model 3 (admixture): y = + X + [ AA D AA + NN D NN + EA D EA + EN D EN + N A D N A ] + Model 4 ( heterogeneous association and admixture): y = + X + [ AA D AA + NN D NN + EA D EA + EN D EN + N A D N A ] + g,EE g D EE + g,AA g D AA + g,NN g D NN + g,EA g D EA + g,EN g D EN + g, NA g D N A + Here, X represents baseline covariates. Both Wald and likelihood ratio tests wer e used to test for admixture (model 3 vs. model 1), association (model 2 vs. model 1), association adjusted for admixture (model 4 vs. model 3), and admixture and/or association (model 4 vs. model 1). Wald and likelihood ratio test statistics (LRTS) are as ymptotically chi squared distributed under the standard assumption of a normally distributed error, . Local ancestry and genotype at a marker are not independent; hence, genetic effects in the full model (model 4) were stratified by individual local ances try [ Shriner, et al. 2011 ] . This also allows for heterogeneous genetic association. To adjust for multiple comparisons, 3 permutation techniques were used. First, an individual's phenotype, global ancestry, and covariates were randomized
41 relative to his or her genotype and local ancestry vectors. Empirical significance thresholds for a family wise error rate (FWER) of 0.05 were computed based on 1000 permutations. Second, for a marker significant for one of the tests based on the FWER of 0.05, p values were estimated based on more than 5 Ã— 10 7 permutations. Last, to assess Type I error rates, the admixture and/or association statistics were computed on all markers used for 200 simulated dataset s for a trait not influenced by the genotype (Q1), including sex and age as covariates. In a follow up analysis for a single marker, the model below was used to explicitly test for heterogeneity of the genetic association effect by comparison with model 4: Model 5 (homogenous association and admixture): y = + X + [ AA D AA + NN D NN + EA D EA + EN D EN + NA D NA ] + g g + Reference P anels for L ocal A ncestry E stimation The current trend toward studying local admixture focuses on continental origin as opposed to finer scale identification to region of origin within a conti nent. It is widely accepted that modern Hispanic populations, such as the GAW18 population, are the result of recent admixture of 3 continental level ancestral populations, namely European, Native American, and African [Johnson, et al. 2011; Manichaikul, et al. 2012] . HapMap phase 3 release 2 genotypes for 112 unrelated CEU (Utah residents with ancestry from Northern and Western Europe) and 113 unrelated YRI (Yoruba in Ibadan, Nigeria) individuals were used as proxy reference panels for European and African ancestral components, respectively [Altshuler, et al. 2010] . For the Native
42 American reference panel, a subset of 64 individuals with at most third degree relatedness was obtained from the Human Genome Diversity Project (HGDP) Native American populations (Colombian, Karitiana, Maya, Pima, and Surui) [Li, et al. 2008; Rosenberg 2006] . Single nucleotide polymorphisms (SNPs) with great er than 0.2 missingness per marker were removed. To avoid possible bias as a result of different pipelines and phasing methods, HGDP and HapMap dataset s were phased using the segmented haplotype estimation and imputation tool (SHAPEIT) method [Delaneau, et al. 2012] . Data P rocessing and M erging Wherea s HapMap and HGDP reference dataset s are based on NCBI Build 36.3 genomic coordinates, the GAW18 dataset uses Build 37.3 coordinates. To allow for full confidence in the mapping between the two , only markers in the genome wide dataset with available rs num bers in the VCF GAW18 data files were used. Chromosome 3 markers present in the 3 dataset s were extracted. This resulted in a set of 40098 SNPs (denoted by SNP40098) with an average inter marker distance of 4932 bp. Local A ncestry E stimation C onsiderations Johnson et al . give estimated average global ancestry proportions for the HapMap Mexican sample in Los Angeles (MXL) and for a cohort of 492 parent offspring trios recruited from Mexico City (MEX1) [Johnson, et al. 2011] . Those proportions are 49% European, 45% Native American, and 5% African for the MXL panel and 31% European, 65% Native American, and 3% African for the MEX1 panel.
43 Assuming a 2 way admixed population, an estimate for the number of ancestry switche s in a diploid genome is given by the formula B = (2 Ã— 2 Ã— 0.01) TLz(1 z) where T is generations since admixture, L is the total chromosome length (224.6cM for Chromosome 3 for the genetic map used), and is global proportion for one of the ancestral components [Johnson, et al. 2011] . The same authors estimated 10 to 15 generations since admixture for Hispanic populations. Besides SNP40098, 2 smaller subsets cons tructed by selecting AIMs were also used. First, marker information content for ancestry using the f value for all 40098 markers between each pair of the 3 reference populations [McKeigue 1998] was estimated. Based on the f value, 2 additional marker sets were constructed. For the SNP6884 marker set, which includes 6884 SNPs, all SNPs that have f >0.25 i n at least one of the 3 comparisons between the reference populations were included. For the SNP637 marker set, which includes 637 markers, all SNPs in SNP40098 that have f > 0.25 between both the CEU HGDP and YRI HGDP reference populations were included. By construction, SNP637 is a proper subset of SNP6884, which is a proper subset of SNP40098. Results and D iscussion We first present results of ancestry estimation. Then we provide findings from the statistical tests proposed. Last, we interpret the regres sion model at a marker found to have significant association and/or admixture. Ancestry E stimation Assuming 12 generations since admixture and z = 0.49 , we estimated B = 26.9 average number of ancestry switches and 27.9 average ancestry blocks in
44 Chromosom e 3 for the GAW18 population. We used this result together with the global ancestry proportions estimated for MXL to evaluate different parameters for the LAMP LD method and different SNP subsets on which to base the local ancestry estimation. Ancestry est imates produced with LAMP LD are summarized in Tabl e 2. 1 . This software allows for 2 parameters: window size in number of SNPs and number of founders in the virtual reference populations for the implemented Markov chain [Baran, et al. 2012] . Different values for the second parameter gave comparable global ancestry estimates; the runs reported were based on 25 founders. The window size parameter should depend on the resolution of the marker set used. All LAMP LD runs were stable in terms of global ancestry proportion estimates produced apart for the run on windows of size 2. Both SNP6884 and SNP637 marker sets produced unsatisfactory results with respect to the number of ancestry switches estimated. It seems that the software is optimized for marker densities similar to the SNP40098 set [Baran, et al. 2012] . In particular, the padding around a window of any size seems to be fixed, which might be the reason for the lower number of ancestry switches produced by all runs with the smaller sets. Local ancestry estimates based on the LAMP LD run for the SNP40098 marker set with window size 70 were used in the subsequent linear regression models. This run produced close to the desired number of ancestry switches and global ancestry proportions. The slightly higher average African ancestry can be explained by 2 individuals with close to 50% estimated African global ancestry (51%
45 and 53%). In summary, there were 2462 unique local ancestry vectors for the sample of 132 unrelated individuals. Table 2.1. LAMP LD ancestry estimates for different marker sets and parameters . Marker Set Window Size in # of SNPs Average Global Ancestral Proportions Number of Ancestry Switches European Native American African Mean Standard Deviation SNP40098 50 0.489 0.455 0.057 26.71 6.47 SNP40098 70 0.491 0.453 0.056 25.51 6.06 SNP40098 100 0.491 0.453 0.056 24.76 5.86 SNP6884 5 0.486 0.459 0.055 14.61 3.63 SNP6884 10 0.494 0.454 0.052 16.55 4.10 SNP637 2 0.430 0.458 0.112 5.02 1.84 SNP637 10 0.497 0.447 0.057 8.82 2.50 All estimates are based on 959 individuals using Chromosome 3 markers. Bold italic type denotes ancestry estimates that were used in the subsequent linear regression models. The LAMP L D method used ignores family structure; however, using all family members for estimation of local ancestry may improve the estimates because LAMP LD builds virtual reference populations at the training phase upon which it produces its estimates. As validat ion, local ancestry was estimated with the set of unrelated individuals only. For a marker (rs12639065) found to have significant association and/or admixture, the resulting ancestry vector was identical to that of the full analysis. Statistical T ests Perm utation based significance thresholds were computed for a FWER of 0.05. Using these thresholds, no significant results were found for the test of admixture, test of association, and test of association adjusting for admixture;
46 however, one SNP remained sig nificant for the combined test of admixture and/or association for the log(DBP) trait. All models, fit under both null and alternative hypotheses included effects for global ancestry proportions and the selected covariates. Table 2 .2 summarizes the p value s for all tests at this SNP location, rs12639065. Table 2. 2 . Wald and likelihood ratio p values at the significant marker for log(DBP) . Marker Coordi nates (Build 37.3) Test Statis tic Admixture Association Association Adjusting for Admixture Admixture a nd/or Association p value threshold) d f p value threshold) d f p value threshold) d f p value threshold) d f 14390507 LRTS 2.237 10 4 (6.107 10 5 ) 4 5.878 10 3 (1.008 10 6 ) 1 6.597 10 5 (4.585 10 7 ) 5 2.118 10 7 (6.883 10 7 ) 9 Wald 3.843 10 4 (1.177 10 4 ) 4 7.035 10 3 (1.711 10 6 ) 1 1.909 10 4 (2.291 10 6 ) 5 9.974 10 7 (2.685 10 6 ) 9 thresholds are permutation derived p values required for achieving significance at a family wise error rate of 0.05. P values in bold are below the re spective thresholds. LRTS , likelihood ratio test statistics. The SNP is located in an intergenic region between the LSM3 and SLC6A genes. The minor allele frequency at this marker is 0.364; a Pearson chi square goodness of fit test for departure from Hard y Weinberg equilibrium gave a p value of 0.190. Testing the residuals for the full model, the Shapiro Wilk test for departure of normality gave a p value of 0.479. As expected, local ancestry estimates do not change for any individual in the sample for a r egion around the significant marker, from 14,317,580 bp to 14,513,695 bp (the marker itself is at position 14,390,507 bp). LRTS and Wald tests resulted in the same rank for the admixture and/or association test, producing a permutation p value of 8.068 Ã— 1 0 7 , which corresponds
47 to a Bonferroni adjusted p value of 0.032 ( N = 40,098). The permutation estimated p value is closer to the Wald based p value of 9.974 Ã— 10 7 and greater than the LRTS produced p value of 2.118 Ã— 10 7 . This is in line with our obser vations from the permutation threshold runs, where the LRTS produced smaller p values and a not entirely uniform distribution , but the Wald based permutation p values were uniform as expected under the null (data not shown). To further evaluate the FWER, t he admixture and/or association statistics were computed on all markers in SNP40098 for all 200 simulated dataset s for the Q1 trait. Sex and age were used as covariates and the minimum p values for each dataset were retained. This resulted in estimated FWE Rs of 0.05 and 0.04 compared with the thresholds derived for the LRTS and Wald test statistic, respectively (Tabl e 2 .2 ). Heterogeneous A ssociation and/or A dmixture M odel at rs12639065 Linear regression parameter estimates for model 4 for the log(DBP) trait at rs12639065 are presented in Table 2. 3 . All parameter estimates for the indicators for local ancestry with at least one Native American allele are similar in magnitude (0.138, 0.145, and 0.131 for D EN , D NN , and D NA , respectively). Adjusting for th e genotype at the marker, a Native American local ancestry at this region is related to a 15% higher DBP on average ( e 0.14 (p values <0.004), and although the parameter estimate for the indicator for Native American and African local ancestry, D NA , is not significant (p value = 0.132), this is likely a result of the small NA s ample size.
48 Table 2. 3 . Parameter estimates for model 4 at the significant marker rs12639065 for log(DBP) Factors Parameter Estimate Std. Error p value (Pr >|t|) N 1 Intercept 4.162 0.047 <2 10 16 Proportion Native American (NA) Global Ancestry 0.099 0. 063 0.116 Proportion African Global Ancestry 0.325 0.024 0.118 Indicator for Blood Pressure Medication Use 0.124 0.024 1.32 10 6 D EN Indicator for European and NA Local Ancestry 0.138 0.047 0.004 67 D EA European and African Local Ancestry 0.053 0.083 0.526 5 D NN Native American Local Ancestry (LA) 0.145 0.049 0.004 26 D NA Native American and African Local Ancestry 0.131 0.086 0.132 4 g D EE Stratified Genotype: European Local Ancestry 0.123 0.030 9.06 10 5 30 g D EN Stratified Genotype: Eu ropean and NA LA 0.049 0.022 0.026 67 g D EA Stratified Genotype: European and African LA 0.039 0.103 0.707 5 g D NN Stratified Genotype: Native American LA 0.069 0.037 0.064 26 g D NA Stratified Genotype: Native American and African LA 0.122 0.109 0 .264 4 Bold type indicates p values less than 0.05. 1 Sample size after stratifying for local ancestry at the marker, e.g., 67 of the unrelated individuals had European and Native American ancestral alleles at rs12639065. Although the model suggests that for entirely European local ancestry at this region, DBP is expected to be lower compared with other local ancestries, a minor allele at the marker seems to have a positive effect in such a case, with DBP expected to increase by 13% per minor allele carrie d at the SNP ( e 0.123 is also highly significant, with a p value of 9 Ã— 10 5 . Also significant is the parameter estimate for the genotype stratified on Native American and European local ancestry
49 at the region; DBP is expected to increase by 5% per minor allele carried at the marker ( e 0.049 value of 0.064, given a Native American local ancestry at the region, the minor allele has a negative effect for the trait (DBP is estimated t o be 7% lower per minor allele carried at the marker with e 0.069 African local ancestral component were nonsignificant, possibly because of small sample sizes. To further illustrate the model 4 fit for log(DBP) presented in Table 2. 3 , a person with average ancestral proportions, no medication use, and no minor alleles at the marker is expected to have a DBP of 60.3 if the local ancestry at the region is entirely European and a DBP of 69.7 if th e local ancestry is entirely Native American. This compares with expected DBP levels of 77.1 and 60.7 for a person with 2 minor alleles at the marker and entirely European or entirely Native American local ancestry at the region, respectively. The test for genetic heterogeneity (model 5 vs. model 4) gave an LRTS p value of 5.39 Ã— 10 4 , which suggests that heterogeneity exists among the genetic association effects. The association effect acts in different directions given different ancestry in the region. Co nclusions Combining admixture and association information is a promising approach for detection of quantitative trait loci. Local ancestry estimation must be performed before such combined tests are conducted; however, producing quality local ancestry esti mates is challenging in a multiway admixed population such as the one used in GAW18. To the best of our knowledge, all proposed methods use some form
50 of reference panels, which serve as proxies for the ancestral populations in the admixture. Although good reference panels exist for many populations, we found that obtaining such a proxy for the Native American ancestral component present in Hispanic populations is particularly difficult. Furthermore, using 3 different data sources increases the complexity of the process because it necessarily involves aligning and intersecting the sets used. We did find the use of benchmarks to be helpful in evaluating the ancestry estimates produced. Possible benchmarks for assessing quality are global ancestry proportions f or similar populations as well as model based estimates for the expected number of ancestral blocks in a chromosome. We did not use the family structure in our local ancestry estimates. Methods for local ancestry estimation that exploit the family structur e should improve on the quality of the estimates. Although no significant results were detected for tests of admixture, association, and association adjusted for admixture, we did find a significant marker from a combined admixture and/or association test. This may indicate increased power of quantitative trait loci detection when aggregating 2 different sources of genetic signal. Simulation studies are necessary to evaluate the power of a combined approach; however, it seems that the results from our analy sis demonstrate promise and the need for further studies of such a method. Finally, the regression model at the significant SNP suggests that the genetic signal at the SNP does not act in the same direction for different local ancestral backgrounds. More w ork is needed to investigate the source of heterogeneity in the association effect. Although it is tempting to think of potential sources of
51 heterogeneity as a result of ancestry or environment, another possible explanation may relate to differences in LD between the ancestral populations. A more extensive study would be needed to evaluate such differences and their effects. Acknowledgements The Genetic Analysis Workshop is supported by National Institutes of Health grant R01 GM031575. The GAW18 whole genom e sequence data were provided by the T2D GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family He art Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575. We thank the anonymous reviewers for their co mments and suggestions and in particular for the suggested test for heterogeneity of the genetic association.
52 CHAPTER III A HIGH RESOLUTION SIMULATED LATINO DATASET WITH INFERRED LOCAL ANCESTRY Abstract Admixed populations, such as African Americans a nd Latinos, combine genetic variability from different ancestral populations and facilitate discovery of a greater range of allelic architecture of complex traits. However, the heterogeneous genetic background of admixed cohorts presents challenges on how to account for population sub structure when performing analysis in genome wide association studies or in estimating cryptic relatedness. Evaluation of methods applied to admixed populations has been done through simulation, with varying success in mimicki ng the complex linkage disequilibrium patterns in contemporary admixed populations. Estimating the ancestral origins of variation, i.e., local ancestries, is a prerequisite for some methods and requires contemporary genetic data as proxies for the ancestra l continental populations. For Latino populations, t his step is further complicated by the scarcity of publicly available Native American genotype data. Here, we present a high resolution, simulated Latino dataset with allele frequencies and linkage disequ ilibrium patterns similar to that of contemporary Latino samples after one additional generation of random mating. The dataset combines all four 1000 Genomes Phase 3 Latino populations, preserving sub continental differences. By construction, the dataset i ncludes low level familial like relatedness. Further, a Native American sample without European admixture was obtained, additionally screened, and imputed to high resolution to serve as proxy in
53 local ancestry estimation. The simulated Latino dataset at se quence resolution, the corresponding local ancestry estimates, and the sequence resolution Native American reference panel are made available to the community to enable simulation of phenotypic data from a set of data that better represents the population structure present in Latinos. Introduction During the past several hundred years, continental populations geographically isolated for many thousands of years interacted again due to major migrations. Forces like genetic drift and fixation, population bottl enecks , n ew mutat ions, and selective pressure, led to divergent allele frequencies in isolated groups. A dmixe d populations, like those represented by African American and Latino populations, resulted from interactions between continental populations and ha ve genomes that are mosaics of segments with distinct continental ancestry [Shriner, et al. 2011] . For instance, modern Latino populations are believed to result from genetic admixture of Native American, European, and West African ancestral populations [Johnson, et al. 2011; Manichaikul, et al. 2012] . Admixed populations allow additional opportunities for mapping the genetic basis of a trait compared to more homogenous populations like Europeans [Brown and Pasaniuc 2014; Liu, et al. 2013b; Zhang and Stram 2014] . For example, admixed populations allow for analyses at some markers that are fixed to different alleles in some populations. More importantly, admixed populations na turally combine the genetic variability present in different human populations, thus facilitating the discovery of a greater range of the allelic architecture of complex traits .
54 The classical admixture mapping approach tests for association of the disease or trait with the ancestral origin at a locus [Winkler, et al. 2010] . T he long range correlations introduc ed by the admixture process allow for gene mapping methods that combine genotype and ancestral origin information [Liu, et al. 2013b; Pasaniuc, et al. 2011; Shriner, et al. 2011b; Tang, et al. 2010] . The heterogene ous genetic background of admixed cohorts also presents challenges on how to account for population sub structure when performing analysis in genome wide association studies or when estimating cryptic relatedness [Conomos, et al. 2015] . Development of methods to address the challenges and opportunities offered b y admixed populations is an active area of research. The Type I error and power of such methods have typically been evaluated by simulation. Differences in simulation setup between papers can result in conflicting conclusions, impeding the choice of method s for researchers studying admixed cohorts [Zhang and Stram 2014] . In some papers, admixed genetic data is simulated one marker at a time, completely ignoring linkage disequilibrium (LD) patterns on different ancestral backgrounds. Other methods used to simulate admixed samples start from real samples from homogeneous populations and employ simplified, forward time admixture processes like hybrid isolation, gradual admixture, or continuo us gene flow . These approaches have a varying degree of success mimicking the complex LD patterns in contemporary admixed populations li ke Latinos or African Americans [Jin, et al. 2014] . Local ancestry (LA) is defined as the number of copies of chromosomes inherited from the diff erent ancestral population s at a given genomic location .
55 Obtaining reliable LA estimates is often a challenging but necessary step for admixture mapping and for most combined admixture and association analyses. Methods suitable for sequence data resolution that infer the unobserved ancestral origins of the genotypes use contemporary genetic reference panels as proxies for [Brown and Pasaniuc 2014] . A further obstacle for estimating LA for Latinos is the scarcity of reliable, publicly available Native American (NA) genotype data that can serve as proxy for the NA ancestral population [Brown and Pasaniuc 2014; da Silva, et al. 2015; Eyheramendy, et al. 2015; Zhang and Stram 2014] . This is especially true for sequence resolution NA gen otypes. For example, NA populations are not present in the 1000 Genomes Project. A lthough local ancestry estimates can be produced with Asian reference panels, it has been suggested that using a NA reference panel is likely to substantially improve the acc uracy of the LA estimates in Latino samples [Brown and Pasaniuc 2014; Moura, et al. 2015] . To address these issues, we constructed a high resolution, simulated Latino dataset with allele frequencies and LD patterns similar to that of contemporary Latino samples after one additional generation of mating. The dataset combines all four 1000 Genomes Phase 3 Latino populations, preserving sub continental differences. By construction, the dataset includes low level, predo minantly third degree, familial like relatedness. Further, a NA sample with low level European admixture was obtained, additionally screened, and imputed up to 1000 Genomes Phase 3 marker resolution to serve as a NA reference panel for LA estimation. Extra care was taken to choose an existing method that produces high quality LA
56 estimates at high resolution. The construction of the simulated Latino dataset, the corresponding local ancestry estimates, as well as the sequence resolution Native American refere nce panel are described in this manuscript. Methods The simulated data is based on all Latino individuals from Phase 3 of the 1000 Genomes Project [The 1000 Genomes Project Consortium 2012] . The Phase 3 final variant call set released on May 2, 2013 provides phased genotypes for 2,504 individuals. The genotypes downloaded from the Impute2 [Howie, et al. 2011] website had been created from the variant call format files available from the p roject. Of the 2,504 samples from the 1000 Genomes dataset , 349 ha ve Latino ancestry (AMR) and are broken out as: 94 Colombian s in Medellin, Colombia (CLM), 64 Mexican Ancestry individuals in Los Angeles, California (MXL), 86 Peruvian s in Lima, Peru (PEL), and 105 Puerto Rican s in Puerto Rico (PUR) . Reference P anels for L ocal A ncestry I nference Reference panels serving as proxies for the ancestral European and African populations consisted of 99 Utah residents with Northern and Western European ancestry (CEU) and 108 Yorubians in Ibadan, Nigeria (YRI) from 1000 Genomes P hase 3, respectively. To construct a Native American reference panel at sequence resolution, NA samples used by the 1000 Genomes consortium to produce Phase 1 consensus LA calls were obtained. Specific geographic locations and indigenous population affili ation s of the samples are described elsewhere [Mao, et al. 2007] . The samples were genotyped using the Affymetrix 6.0 SNP arrays and have undergone quality
57 control screening by the 1000 Genomes team [The 1000 Genomes Project Consortium 2013] . Importantly rred NA ancestry are present in the dataset obtained. To identify cryptically related individuals with estimated second degree or higher relatedness, we used a subset of 36 , 980 single nucleotide polymorphisms (SNPs) with minor allele frequency (MAF) greate r or equal to 0.05 and in approximate linkage equilibrium (markers with pairwise squared correlation greater than 0.2 were discarded in overlapping sliding windows of 50 markers using PLINK 1.07 software [Purcell, et al. 2007] ). A parent with two children and a pair of siblings were detected; a random individual from each familial set was retained for downstream analysis. Of the 616 , 568 a vailable markers, 188 SNPs were with alleles that could not be unambiguously matched with 1000 Genomes Phase 3 alleles and 1,673 SNPs that were not present in the 1000 Genomes Phase 3 dataset and were removed. The resulting NA panel consists of complete da ta for 40 individuals with genotypes for 614 , 707 bi allelic SNPs. Next, the NA panel was imputed to the resolution of the 1000 Genomes Phase 3 dataset . Impute2 [Howie, et al. 2011] v2.3.2 software was employed without pre phasing across the genome on chromosomal segments with average length of ~2.5 Mbp. All 2,504 samples in the 1,000 Genomes Phase 3 dataset were used as the imputation refere nce panel. Default algorithm parameters were used except for the following: 20 burn in iterations, 60 total Markov chain Monte Carlo iterations, 400 Hidden Markov Model (HMM) states for phasing, and buffer regions of 1.25 Mbp included on each side of a chr omosomal segment computed. The Impute2 algorithm selects template haplotypes from all available reference haplotypes and uses them
58 for the HMM states for imputation. The number of template haplotypes used per chromosome segment was set to 1000 since such a n approach achieved better leave one out cross validation concordance statistics [Howie, et al. 2011] of Impute2 genotypes calls as opposed to using the default, 250 template haplotypes, or using all 5008 template haplotypes available in the reference panel (results not shown). Posterior genotype probabilities inferred by Impute2 were converted to hard genotype calls only if a genotype probab ility greater than 0. 8 was inferred and were otherwise set to missing, resulting in 2.36% average genotype missingness. Only SNPs with at most 10% missing genotypes were retained in downstream analysis. Using a 0.8 posterior probability threshold resulted in average concordance of 99.54% across the genome for the hard called genotypes b ased on the same leave one out cross validation concordance statistics. To achieve consistency across the reference panels, the hard call NA genotypes were phased with the S hapeit method [Delaneau, et al. 2012] that was also used by the 1000 Genomes Consortium to infer haplotypes for their Phase 3 official data release. Shapeit2 software v2.r790 was run on each chromosome with the following model parameter values: 16 burn in iterations, 8 iterations for each of the 12 pr uning stage s, 30 main iterations of the MCMC algorithm, 300 conditioning states on which haplotype estimation is based , and a window size of 0.1Mb . Additionally, AMR samples from 1000 Genomes were used as a reference panel to improve phasing accuracy [Delaneau, et al. 2012] . Finally, only AMR non mon omorphic bi allelic SNPs o n autosomal chromosomes were extracted from the AMR, CEU, and YRI datasets, and from the
59 phased and imputed NA dataset , resulting in a resolution of 25 ,235,037 SNPs for all four datasets. Local Ancestry Inference The RFmix method [Maples, et al. 2013] was used to infer LA for all AMR samples. The method directly models the conditional density P( ancestry | genotype ) . It uses a conditional random field parameterized by random forests and can learn from the admixed samples as well as the reference pan els. RFmix requires phased haplotype data for both the analysis dataset and the reference panels and produces LA estimates for each marker across the genome. The RFmix method also employs genetic map coordinates at the common SNP density of the datasets. S uch a genetic map was produced by linearly interpolating from the physical coordinates of the markers and a sparser Impute2 genetic map. Local ancestry estimates were produced for all AMR samples with RFMix software release v1.5.4 using the CEU, YRI, and N A reference haplotypes at the final resolution of 25 ,235,037 SNPs. The program runs were executed with the minimum number of reference haplotypes per tree node set to 5 (as recommended by the authors for imbalanced sizes of the ancestral reference panels) and default values for all other algorithm parameters. The LA estimates produced by the second of two additional expectation maximization iterations were used in the downstream analyses as those are expected to improve the quality of the assignments [Eyheramendy, et al. 2015; Maples, et al. 2013] . To assess the accuracy of the LA estimation procedure, the RFmix method was separately used on C hromosome 21 to infer LA for the AMR haplotypes
60 together with all the refer ence haplotypes, thus forcing the algorithm to infer the ancestral origins of the European, African, and Native American proxy ancestral haplotypes as well . The proportion of CEU, YRI, and NA reference haplotypes misclassified as non CEU, not YRI, and non NA, respectively, can serve as a crude lower bound for the LA inference error rate. Simulated G enomes Admixed diploid genomes were simulated from the real admixed haplotypes with a resampling approach [Yuan, et al. 2012] . Chromosomes from the phased AMR genomes are randomly labeled paternal or maternal for each individual. Below, haplotype refers to either the 22 paternal or the 22 maternal chromosomes from an individual. Within each of the 188 CLM, 128 MXL, 172 PEL, and 210 PUR haplotypes, two pairs are sampled at random without rep lacement (together with the correspondi ng LA assignments). Define to be the genomic length of the chromosome in Morgans from the genetic map. For each pair of haplotypes, recombination points are generated uniformly across C hromosome . A pair of haplotypes (and the corresponding ancestral calls) is recombined at those recombination points, and two pairs of haplotypes result in a single simulated diploid genome (Figure 3.1 ). The pr ocess is repeated separately within the CLM, MXL, PEL, and PUR subsets for a total of 948, 646, 857, and 1049 individuals, respectively, preserving the proportions of the sub populations in the original AMR sample. The resulting dataset (we refer to as AMR 1 throughout the manuscript ) has a combined sample size of 3500 genomes.
61 Figure 3.1 . Simulated d iploid g enome c onstruction . ( a ) Chromosomes from the phased AMR genomes are arbitrarily labeled paternal or maternal for each individual. ( b ) Two pairs of haplotypes are sampled at random without replacement together with the correspon ding local ancestry assignments . ( c ) Each pair of haplotypes ( and the ancestral calls ) is recombined at random recombination points, and ( d ) two recombined h aplotypes result in a single simulated diploid genome. Description of the Simulated Dataset If the same one, two, or three haplotypes were sampled for a pair of simulated individuals, those two individuals are expected to share on average 0.125, 0.25, and 0.375 of the genom e, respectively, analogous to third degree or closer relatives. By construction, such non trivial relationship can only occur within the CLM, MXL, PEL, and PUR sub populations. The proportion of such pairs in the particular random realization of the simula tion scheme is calculated . Global ancestry proportions were estimated by averaging LA estimates over all markers for all 25,235,037 individuals in AMR1. To compare the inferred local
62 ancestries in the current study with a previously published result for C hromosome 3 [Yorgov, et al. 2014] , the number of inferred ancestral switches was computed across Chromosome 3 for all MXL samples in the AMR1 dataset. The top 10 principal components were derived with the flashPCA [Abraham and Inouye 2 014] method for AMR1. Briefly, a genetic similarity variance covariance matrix is computed from the individual genotypes and then principal components an alysis is applied to the matrix to identify population structure. Following the ns, s everal regions with strong LD and/or known inversions (chr5 44 Mb 51.5 Mb, chr6 25 Mb 33.5 Mb, chr8 8 Mb 12 Mb, chr11 45 Mb 57 Mb) were removed . Further, markers with pairwise squared correlation greater than 0. 10 were discarded in overlapping sliding windows of 1000 markers using PLINK 1.07 software [Purcell, et al. 2007] , leaving 204 , 427 markers in approximate linkage equilibrium to inform the principal components analysis. To assess the correl ation between the derived principal components and the global ancestry proportions, linear regression models were fit separately for the global European and Native Americans ancestry proportions using the top two principal components as predictors. Admixtu re among three continental populations should be reflected by the top two principal components, if t here is no additional structure in the sample [Shriner 2013] . Finally, a genetic relationship matrix (GRM) was computed with the GCTA method [Yang, et al. 2011] , software version 1.24.4 , using all markers in the AMR1 data set that have MAF .
63 Results Estimated global ancestral proportions of AMR1 samples are reported in Table 3. 1 and are consistent with previous results [Gravel, et al. 2013; Moura, et al. 2015] . For example, Grav el et al. es timated Native American ancestry to be 48%, 25%, and 13% in MXL, CLM, and PUR Phase I 1000 Genome s amples, respectively. Table 3.1 . Mean global ancestral proportions in the simulated dataset. Population (sample size) European Native American African AMR1 (3500) 52. 7 % 39.6% 7 . 7 % CLM (948) 64.8% (6.8%) 27.3% (5.2%) 7.9% (4.1%) MXL (646) 45.6% (10.0%) 50.2% (10.1%) 4.2% (1.2%) PEL (857) 21.0% (6.4%) 75.8% (7.6%) 3.2% (3.1%) PUR (1049) 72.0% (5.7%) 14.5% (2.6%) 13.5% (5.4%) Standard de viation is reported for the continental sub populations (in parentheses). The RFmix LA estimates on C hromosome 21 with all reference haplotypes included in the analysis resulted in misclassification rates of 0.003 2, 0.004 1, and 0.001 2 for the EU, NA, and AFR proxy haplotypes, respectively. There were on average 27.01 ancestral switches, computed across Chromosome 3 , for the MXL samples in the AMR1 dataset which is comparable to a previously published result of 26.71 ancestral switches for Chromosome 3 in a Mexican American sample of unrelated individuals from San Antonio, Texas [Yorgov, et al. 2014] . For this particular realization of the described simulation scheme, AMR1, 2.218% of pairs of individuals are analogous to third degree relatives and 0.060 %
64 have closer than third degree relatedness. By construction, all such relatedness is within the Latino sub populations. Using the top two principal components as predictors in linear models for global ancestries resulted in and for the European and Native American global ancestry proportions, respectively. That is, 95.2% and 99.4% of the global proportion of European and Native American ancestries are explained by the top two PC s from flashPCA, re spectively . The heat map produced from the empirical GRM shows a clear differentiation between the different Latino sub populations with a particularly pronounced differentiation between the PEL and PUR sub populations ( Figure 3. 2 ) , consistent with the hig hly differentiated global ancestry proportions between Peruvians and Puerto Ricans. Discussion By construction, the AMR1 dataset has allele frequencies and LD patterns similar to those of the original Latino samples after one additional generation of rando m mating. The sample size of 3500 for AMR1 is comparable to contemporary admixed cohorts [Coram, et al. 2013] . AMR1 is characterized by different levels of structure, namely: admixture of three ancestral continental populations, distinct sub populations, reflective of sub continental variation, and familial like relatedness within each of the sub populations. Generated from 349 admixed samples, the dataset is representative of variants with MAF , but is unlikely to capture the complete spectrum of vary rare variation that 3500 fully sequenced Latino individuals could reveal.
65 Figure 3. 2 . Heat map produced from the GRM. Color represents the estimated proportion of alleles shared identical by descent ranging from 0.3 (dark pink) to 0.9 (dark cyan). GCTA derived GRM is re scaled using the curre nt population as a base population with the average off diagonal value being 0 [Yang, et al. 2011] . It is worth noting that producing LA estimates before generating the expanded AMR1 datase t avoids another layer of complexity, i.e., the same chromosomal segments cannot have different LA assignments due to the probabilistic nature of the LA inference. Once the LA estimates are produced for the original 1000 Genomes haplotypes, we treat them a s known and propagate them in the simulated genomes, accounting for recombination.
66 The RFmix method used has been reported to produce higher accuracy LA estimates for sequencing resolution data compared to GWAS chip marker resolution [Brown and Pasaniuc 2014; Maples, et al. 2013] . The high quality of the estimates produced by the method seems to be confirmed by the LA estimates for the reference haplotypes. The proportion s of CEU, YRI, and NA reference genotypes mis classified as non CEU, no n YRI, and non NA were 0. 32 %, 0. 41 % , and 0. 12 %, respectively, and can be considered as a crude lower bound for the LA inference error rate. Although we ignored the familial like relatedness present in the dataset when producing pri ncipal components for all 3500 individuals, the top two PCs result in a perfect separation of the sub populations in AMR , suggestive that they are driven by the population structure in the sample as opposed to the familial like relatedness present ( Figure 3. 3 ). Further, the top two PCs together explain most (95.2%) and virtually all (99.4%) of the variation in the European and Native American global ancestry proportions, respectively. This also illustrates that the distant familial like relatedness present in the dataset does not strongly influence the PCs, at least for the first two PC dimensions.
67 Figure 3. 3 . Principal components analysis . The top two principal components are plotted against each other for the 3500 AMR1 samples . Different c olor coding of each point in the figure represents different sub population s . A possible choice for a Native American reference panel for LA inference is to use the 64 NA samples from the Human Genome Diversity Project [Li, et al. 2008] with at most third degree relatedness [Rosenberg 2006] . Discarding samples with less than 99% NA ancestry (inferred by unsupervised analysis with the Admixture method [Alexander, et al. 2009] ) as well as one sample that does not cluster with its reported indigenous population reduced the sample size of this dataset to 34 individuals . Additional screening for cryptic relatedness discovered many pairs of indiv iduals with estimated second degree or higher relatedness (results not shown) .
68 Producing a set without relatedness would have reduced the dataset to about twenty individuals, and this NA reference panel was not pursued further. The AMR1 dataset presented c an be used to evaluate power and Type I error for methods that incorporate LA information, e.g., in combined admixture and association tests, by simulating phenotype data from a realistic structure of genetic variation. Truly polygenic phenotypes generated from the dataset could be used to compare different approaches to correct for population stratification due to highly structured genetic correlations, e.g., principal components analysis and multidimensional scaling, empirical GRM and decomposing the empi rical GRM into several mixed effects and/or fixed effects. Such a dataset can also be used to evaluate methods that estimate distant pedigree relationships in highly structured samples with familial relatedness and population structure [Conomos, et al. 2016; Manichaikul, et al. 2010] .
69 CHAPTER IV EFFECTS OF IMPUTATIO N ON COMBINED ADMIXT URE AND ASSOCIATION MAPP ING Abstract For populations like African Americans and Latinos, the long range correlations introduced b y the admixture process allow for gene mapping methods that combine genotype with ancestral origin information (local ancestry). In several studies, combining admixture and association information resulted in findings that were not detectable with standard association testing, suggesting that for a genome wide association study (GWAS) power can be gained from combining two different sources of genetic information. Using extensive simulations based on real Latino genomes, we investigate the possible gains i n power from different ways of incorporating local ancestry into single variant association testing at GWAS chip resolution. We compare such approaches to imputation followed by a standard association test. We simulate polygenic traits with a single causal common variant per locus, and a causal allele with the same effect regardless of its ancestral origin. Markers with sufficient degree of allele frequency differentiation in the ancestral populations that do benefit from the admixture information were (1) those with differential linkage disequilibrium patterns in the ancestral populations (for combined tests) and (2) those with causal allele originating largely from one of the ancestral populations (for admixture mapping). For this simulation scenario our r esults suggest that, at GWAS chip resolution, there is limited benefit from incorporating local ancestry in admixture
70 mapping or in combined admixture and association testing since higher power can be achieved by the imputation approach. Specifically, impu tation of the causal marker followed by association testing was the best approach with respect to power both on average across all regions containing causal markers and individually at each region. Imputation followed by association yielded increases of av erage power compared to the best powered admixture and combined tests by factors of 5.3 and 2.13, respectively. We further show that the standard linear mixed model approach, without local ancestry adjustment through a fixed effect, controls well for Type I error in both the association test and in the combined admixture and association tests. Introduction Major migrations that started in the 15 th century resulted in the intermixing of c ontinental p opulations that were geographically isolated for many mill ennia. The resulting recently admixed populations combine the genetic variability present in the ancestral populations . Such populations , e.g., Hispanics and African American s, allow additional opportunities for mapping the genetic basis of a trait [Brown and Pasaniuc 2014; Liu, et al. 2013b; Zhang and Stram 2014] . Admixture mapping is a group of methods for localizing chromosomal regions associated with a trait, using local ancestry (LA), the number of copies of ch romosomes inherited from the different ancestral population s at a given genomic location . T he long range correlations introduced by the admixture process also allow for gene mapping methods that combine genotype and ancestral origin information [Liu, et al. 2013b; Pasaniuc, et al. 2011; Shriner, et al. 2011b; Tang, et al. 2010] . As the ancestry
71 signal provides information about the haplotype diversity in the region and can be complement ary to the association signal, combining both can result in potentially more powerful statistical tests [Tang, et al. 2010] . In several studies [Liu, et al. 2013b; Yorgov, et al. 2014] , combining admixture and association while allowing for the effect size to differ based on the ancestral origin , resulted in discoveries that were not detectable with the standard GWAS approach , suggest ing that g enome wide association studies ( GWAS ) can potentially gain power by allowing ancestry specific effect sizes. For instance, fully stratified association at a variant based on the ancestral background of its chromosomal region identified a variant associated with diastolic blood pressure (DBP) in a Hispanic population [Yorgov, et al. 2014] . In the model fit, the effect on DBP of the variant acts in different directions depending on its ancestral origin : f or entirely Native American origin, the g enetic association effect reduces DBP as opposed to increasing DBP for entirely European origin. An e xplicit test for heterogeneity among the effects due to genetic association given different ancestral background in the region was highly significant. In a different Hispanic case control study [Liu, et al. 2013b] , analysis i nclud ing a SNP by local ancestry interaction term in a logistic regression resulted in a suggestive association at a variant for asthma. The variant lacked any evidence of association from a conventional GWAS analysis. The authors suggested that a combined test can be beneficial and complements the standard GWAS approach in admixed cohorts. A c ombined admixture and association test in a case control setting that allows for heterogeneous odds ratios, based on the ancestral origin of
72 the chromosomal segment, was i mplemented in work by Pasaniuc and colleagues [Pasaniuc, et al. 2011] . Applied on a real African American dataset, two SNPs had strong signal for this test and those SNPs w hen explicitly testing for different o dds ratios conditioning on African vs. European local ancestry . D issimilarities between the allele frequencies in the ancestral populations and /or different linkage disequilibrium (LD) patterns within segments with d ifferent ancestral origins are plausible reasons for such discoveries. If the genotype data is at GWAS chip resolution and the causal variant is not genotyped, a possible explanation is that the effects at the tested surrogate marker(s) vary across ancestr al populations due to different LD patterns with the unobserved causal variant. Alternative explanation is gene gene interaction of the tested marker with anothe r causal SNP in the same region that has different allele frequencies in the ancestral populati ons. G enetic correlation can arise due to any type of population st ructure, e.g., sub populations, admixture, and assortative mating, as well as due to known and unknown familial relatedness, and can result in phenotypic correlation and confounding. To ad just for confounding due to admixture induced LD, inclusion of local ancestry in the null model in a regression framework has been suggested for the standard association test [Qin, et al. 2010; Shriner, et al. 2011a ; Wang, et al. 2011] . Recent works concluded that such adjustment is not required and can result in loss of power for association testing [Liu, et al. 2013b; Zhang and Stram 2014] .
73 L inear mixed model (LMM) methods with a random effect based on an empirical genetic relationship matrix (GRM) [Bradbury, et al. 2007; Kang, et al. 2010; Lippert, et al. 2011; Yang, et al. 2011; Yang, et al. 2014; Yu, et al. 2006; Zhou and Stephens 2012] are emerging as preferred methods to adjust for confounding due to correlated genotypes in GWAS studies. LMM are reported to adjust for many sources of such correlation, including large scale population structure due to admixture or stratification, as well as confounding due to familial relatedness [Astle and Balding 2009; Price, et al. 2010; Yang, et al. 2014; Yu, et al. 2006] . Using extensive simulations based on real Latino genomes , we investigate possible gains in power to detect genetic signal from several approaches t hat incorporate local ancestry in single vari ant association testing for a polygenic trait . We compare the power of the proposed tests with respect to the power of the standard association test. We also investigate if, at GWAS chip resolution , similar or greater gains in power can be achieved via imp utation followed by a standard association test. Specifically, we simulate polygenic traits with a single causal variant per locus, and a causal allele with the same effect regardless of its ancestral origin. Markers with a sufficient degree of allele fre quency differentiation in the ancestral populations are selected to be causal. All of our models adjust for confounding due to correlated genotypes in the sample s by adopting the standard LMM approach, i.e ., by including a G RM for a single polygenic varian ce component in the model [Yang, et al. 2011] . We further study if this standard LMM adjustment, without local
74 ancestry adjustment through a fixed effect, controls well for Type I error in bo th the association test and in the combined admixture and association tests. We conclude with a description of characteristics of markers in the 10 ENCODE pilot regions with sufficient degree of allele frequency differentiation in the ancestral population s and that benefit from analyses incorporating admixture information. Methods We first describe the admixed genomes used in our simulations followed by the generation of polygenic traits, the statistical tests performed, and the additional analyses to char acterize the variants that benefit from a testing approach that incorporates LA information. Simulated Genomes A sequence resolution simulated Latino dataset with allele frequencies and LD patterns similar to those of contemporary Latino individuals after one additional generation of random mating was previously described [ Chapter III ]. Briefly, t he simulated dataset contains 25,235,037 non monomorphic bi allelic autosomal single nucleotide polymorphisms (SNPs) and a combined size of 3,500 individuals compa rable to the sample size for a contemporary Hispanic American cohort [Coram, et al. 2013] . T he simulated dataset is based on all Latino individuals from Phase 3 of the 1000 Genomes Project [The 1000 Genomes Project Consortium 2012] . It is representative of common variants, with minor allele frequency (MAF) 0.01 or greater , but is unlikely to capture the complete spectrum of rare variation that a deeply seq uenced sample of the same size would contain . Specifically, the sub -
75 populations for the simulated individuals are: 943 based on Colombians from Medellin, Colombia (CLM), 642 based on Mexican Ancestry from Los Angeles, California (MXL), based on 862 Peruvia ns from Lima, Peru (PEL), and 1053 based on Puerto Ricans from Puerto Rico (PUR). The dataset is characterized by different levels of structure, namely: admixture of three ancestral continental populations, distinct sub populations, reflective of sub conti nental variation, and distant familial like relatedness within each of the sub populations. The method to generate the genotype data was carefully designed and validated using different measures for ancestry. L ocal ancestry estimates are available for each marker in the dataset. The Native American, Eur opean, and West African average global ancestry proportions in the Latino dataset are estimated to be 52.7%, 39.6%, and 7.7%, respectively . There is substantial variability within the sub populations, e.g., N ative American average ancestry is as low as 14.5% in the PUR samples and as high as 75.8% in the PEL samples [Chapter III]. For all sub populations, the two ancestral origins with the greatest contribution to the genotypes are European and Native American . GWAS Chip Resolution Dataset A GWAS resolution dataset was created by extracting (from the Latino dataset AMR1 ) all autosomal SNPs that are also present in the Illumina Â® OmniExpress 24 v1.0a Array . This resulted in 600,691 SNPs at the GWAS chip resolutio n for the Latino dataset after additionally screening out markers with MAF less than 1%. The m arker density of the resulting GWAS resolution genotypes is approximately 1 marker per 4.8 kilo bases (kb) compared to the original resolution of
76 roughly 1 marker per 110 chip Polygenic Traits To simulate polygenic quantitative traits, a genetic relationship matrix (GRM) matrix , , was computed with the GCTA method [Yang, et al. 2011] , software version 1.24.4 , using 8 , 906 , 089 bi allelic autosomal SNPs in the Latino dataset that have MAF . Null q uantitative traits were generated from a multivariate normal distribution , , where is the GRM constructed above and is identity matrix . The heritability parameter was set to 0.3 comparable to recently rep orted estimates for blood lipids in Hispanics [Coram, et al. 2013] . To assess the accuracy of the parameter estimates used in the statistical tests, the heritability of 250 null phenotypes was estimated with the GCTA method. Causal Markers Causal SNPs that contribute deterministically to the phenotypes were selected from the pilot HapMap ENCODE r esequencing and genotyping project regions . They consist of ten 500 k b autosomal regions from the ENCODE P roject pilot phase [Zhang and Dolan 2008] , including a range of chromosomes, recombination rates, gene density, and values of non transcribed conservation with the mouse genome. Let be the observed reference allele frequency (RAF) at a marker with ancestral origin population , where represent s European, Native American, and West African allele origin, respectively . The reference allele is one of
77 the two alleles for a bi allelic SNP in the dataset and can have allele frequency greater than 0.5. Based on the two ancestral origins with the greatest contribution to the Latino population , t he absolute difference in the reference allele frequencies, , was computed for all SNPs within the ten ENCODE regions using the local ancestries in the Latino dataset . is equivalent to the absolute difference in the minor allele frequencies 4 . All highly differentiated SNPs, with , were selected and a dditionally pruned for LD to have pairwise squared correlation less than 0.65 with PLINK version 1.07 software [Purcell, et al. 2007] . This procedure resulted in 46 SNPs that were labeled causal and used for phenotype simulations. Each of these 46 causal SNPs contribut es deterministically, one at a time, to a phenotype simulation with 250 replicates. Specifically, the reference allele at the SNP contributes to the phenotype additively. For each replicate, random error was added to the deterministically generated signal from a multivariate normal distribution with mean and covariance matrix . T he coefficiant is such that the proportion of the phenotypic variance explained by the causal SNP is 1.5%. For 3,500 unrelated individuals , 1.5% phenotypic variance explained by the ca u sal SNP would result in 96. 78 % power for a direct ( additive ) test for association at the causal SNP at a nominal alpha level of [Yang, et al. 2010b] . The power for the genetically correlated samples in the Latino dataset is expected to be lower. Each phenotype vector was scaled to have unit variance. 4 If RAF<0.5 then MAF=RAF and if RAF 0.5 then MAF=1 RAF . In the second case the ones cancel in the form ula for f .
78 Statistical Models For each replicate, a quantitative trait vector, , is modeled as , where is empirical GRM derived from the Omni resolution Latino dataset markers (for details see the Genetic Relationship Matrix for Statistical Tests sub section below). Models diff er by their mean function: (Model 0: Null) (Model 1: Association) ( Model 2: Heterogeneous Association a nd /or Admixture) ( Model 3: Heterogeneous Association ) ( Model 4: Association and/or Admixture ) Here is a vector with known covariates including an intercept term , is the number of refer ence allele s at the marker , and is the number of alleles at the marker with European a ncestral origin . All t ests are constructed by comparing models 1 4 against the null model using a l ikelihood ratio test statistic (LRTS) . When the null m odel is true, the LRTS is asymptotically chi squared distributed with degree s of freedom (df) equal to the difference between the number of parameters in the competing alternative model and the null model . In particular the following tests were performed: Association (Model 1 versus Model 0, 1 df ) , Heterogeneous Association and/or Admixture (Model 2 versus Model 0, 3 df ) , Heterogeneous Association (Model 3 versus Model
79 0, 2 df ) , and Association and/or Admixture (Model 4 versus Model 0, 2 df ) . Models 2 and 3 allow for heterogene ous association effects based on the ancestral origins of the chromosomal segment in which the genetic variant is located . Admixture Mapping Additionally, the following models were fit to allow for tests of admixture compared to the sa me null model as in the combined tests. ( Model 5: Dosage Admixture) ( Model 6: Dosage Admixture Full ) (M odel 7: Full Admixture) Here is the number of alleles at the marker with Native American a ncestral origins , and in the last model, is an indicator for the ancestral origins of the two alleles at the marker tested , where again represent ing European, Native American, and West African allele origin, respectively. The local ancestries at the causal marker were used to perform each of the admixture tests since LA does not change for any individual in the sample for a r egion around the ca u sal marker . This is expected as the number of ancestral switches, the change of the ancestral origin of the genotypes across the chromosome, is not high in the Latino dataset, averaging 1 switch per 7.4 million base pairs per person. As a result, the local ancestry for markers at the borders of each of the 500 kb ENCODE region are quite similar, e.g., the average correlation for at the borders is 0.987.
80 Genetic Relationship Matrix for Statistical Tests The use of GRMs calculated by leaving out the chromosome that is undergoing association testing has been shown to have better power in standard association testing, compared to a GRM based on all markers, while properly controlling for Type I error [Widmer, et al. 2014; Yang, et al. 2014] . The goal of such an approach is to exclude the marker to be tested, as well as markers in LD with the marker to be tested, from the GRM computation. Including markers used for GRM matrix computation results in implicit conditioning on those markers. This phenomenon, referred to as proximal contamination [Listgarten, et al. 2012] , leads to a slight loss in power when testing the causal marker. For computational convenience, a s a n alternative to the leave one chromosome out approach, we excluded the ENCODE regions bef ore computing the GRM. Thus GCTA was used to produce a GRM, , with all Omni resolution SNPs with MAF 0.01 , excluding all variants within 1 m b regions centered at each 500 kb ENCODE region. This resulted in a GRM based on 598 , 069 SNPs . Additionally, all 600,691 SNPs with MAF 0.01 in the Omni resolution Latino dataset were used to compute a second GRM, . The correlation between estimates with and estimates with for 1000 polygenic phenotypes was calc ulated, and the power gain from excluding the regions was compared for the standard association test. Likelihood Ratio Test Statistics For each model, the standardized phenotypes are distr ibuted as where in cludes an intercept term and differs between
81 models in terms of how genotypes and/or ancestry are coded. T he likelihood function for such models i s: where , and t he log likelihood is: . The log likelihoods for the null model and for the subsequent models were all maximized numerically for the respective parameters. The like lihood ratio test statistics for each test considered was computed as twice the difference between log likelihood of the current model and the log likelihood of the null model . W hen the null model is true , this statistic has a limiting distrib ution with degrees of freedom, equal to the difference between the number of parameters in the full model, , and the null model, . In this study , all LMM were fit with a random effect based on the GRM , , using the lmekin function from the coxme R package [Therneau 2015] . The lmekin function directly optimizes th e log likelihood function for the model at hand, by calling optim, a g eneral purpose unconstrained nonlinear optimization routine built into the R statistical package [R Core Team 2013] . Specifically, Broyden, Fletcher, Goldfarb and Shanno quasi Newton optimization method is employed. Fitting and simultaneously is a non trivial, non convex optimization problem and computationally expensive. To avoid repeated estimation of the heritability parameter, , the parameter is estimated once under the null model (Model 0) and then tha t value is used for the parameter in all other models at all SNPs across the genome. Such an approximation is commonly applied in a LMM
82 setting, e.g., in TASSEL, EMMAX and GCTA methods [Bradbury, et al. 2007; Kang, et al. 2010; Yang, et al. 2011] . Estimating the variance component parameter , , even once for each of the 250 replicated data sets for all 46 causal SNPs is still computationally expensive since it must include several starting points in the parameter space to avoid possible convergence to a local maximum when optimizing the log likelihood function. To speed up the computations, a pre processing step was used, where the GCTA software produce d estimates for the heritability, . Such an estimate provided a single starting point in the neighborhood of the global maximum f or the optimization algorithm in lmekin to achieve faster convergence when producing maximum likelihood estimates for the heritability in the null model. Once the maximum likelihood estimate , , is determined for a particular phenotype vector, th e models 1 7 are fit for each of the candidate markers tested and each replicate d dataset . Power Calculations For t he association and the combined admixture and association tests at GWAS chip resolution, a significant result was defined by the presence of at least one Omni resolution marker that achieves genome wide significance at a nominal alpha level of within 15 kb of the causal SNP. For the ENCODE regions, a padding of 15 kb around each causal marker guarantees that at least one SNP will have >0.25 with the causal SNP. If the causal marker used to generate the currently tested ph enotype was pr esent at the Omni resolution, this marker was masked. Empirical average statistical power [Yu, et al. 2006] was computed for each
83 test as the fraction of datasets for which there was at least one significant marker in the 30 kb chromosomal segment around the causal variant over all replicates. For tests of admixture, emp irical average power was computed for each test as the fraction of p values achieving genome wide significance at the causal markers. Admixture map ping has a l ower testing burden and hence a less stringent test wise threshold; has been reported to achieve a genome wide significa nce level of 0.05 in a Hispanic cohort [Schick, et al. 2016] . The empirical average powers at the causal markers were also computed for the test of associatio n and for the combined admixture and association tests, thus allowing us to consider any power lost for higher degrees of freedom tests for the combined association and admixture mapping. In order to assess the empirical average power of the association te st (Model 2 vs Model 1) after imputation, we use the following proxy . We assume that all causal markers in our study are unobserved at Omni resolution and can be imputed with a square d correlation coefficient of =0.925 between the causal and the imputed marker . Since the non centrality parameter for a test of association is approximately proportional to the phenotypic variance explained by the marker [Yang, et al. 2010b] , the association test statistics at the causal markers are scaled by 0.925. This mimics the statistic for an association test at an unobserved causal variant that has been imputed. For power calculations after imputation , we calculate the fraction of replication tests which result in an imputed causal marker that achieves significance (or an Omni resolution marker in the 30kb neighborhood region that achieves significance).
84 Finally, we also report the empirical power of e ach test separately for each causal SNP averaging over 250 phenotype replicates. The correlation between the genotype at the causal marker and the number of European origin chromosomes at the marker is also reported for causal SNPs that benefit from the LA information. Controlling for Type I Error We perform analyses of the purely polygenic traits to determine if the LMM approach results in correctly calibrated tests under the null hypothesis for both the association test and for the combined admixture and association tests. Specifically, both the combined admixture and association tests and the standard association test are applied for 7 of the null phenotypes genome wide at 146 , 075 Omni resolution markers with MAF , pruned for LD ( ) with a sliding window of 500 SNPs. The genome wide test statistics provide an empirical null distribution for each test. Under the assumption that the LMM model with empirical GRM appropriately controls for Type I errors, the p values have a uniform distribution. Additionally, genomic control inflation factors, , are calculated for the association test and for each of the combined tests [Devlin and Roeder 1999] . Genomic control is defined as the ratio of the median of the observed distribution of the test statistic and th e median of the distribution with the corresponding degrees of freedom (see Statistical Models). For a test that is well calibrated, should be approximately 1 , regardles s of the allele frequency differences in the ancestral populations. Deviations f rom indicate an elevated or depressed Type I error rate, i.e., insufficient or over adjustment for the genetic correlation present in the sample, respectively. We compute two additional genomic control inflation factors, and
85 , similarly to , but using the first and third quartiles of the observed and expected distributions, respectively. Differential LD Patterns In order to further study the characteristics of the variants that benefit from combining association and admixture information, we compute LD stratified by the number of chromosomes with European ancestry in the region of the causal marker: , , and . The stratified squared correlations are comput ed between the causal SNP and all markers in the region. Let be the set of all non causal Omni resolution markers in the region considered . We compute maximum differential LD, defined by Differential LD is reported for each causal marker region that achieves higher power at the GWAS chip resolution for a combined test compared to the association test. Results Causal Mar kers By construction, all causal alleles have different frequencies in the European and Native American ancestral populations and only common variation in the genome is selected to be causal. For the selected markers, the average MAF is 0.38 7 (SE 0.075 ) wi th minimum, first quartile, median, third quartile, and maximum MAFs being 0.254 , 0.333, 0.379, 0.467 , and 0.500 , respectively. The specific locations of the 46 causal markers, their overall allele frequencies, allele frequencies for chromosomes with Europ ean ancestral origins, allele frequencies for
86 chromosomes with Native American ancestral origin , and allelic differentiation , , are listed in Supplementary Table 4.1 (Appendix A ). At the 30 kb regions around each causal SNP, there are on average 9. 6 (SE 5.11) Omni SNPs with minimum, first quartile, median, third quartile, and maximum being 3, 6, 8, 11, and 26, respectively. Controlling for Type I Error Quantile quantile (QQ) plot s of observed and expected p values from tests under the null hypothesis are produced. Specifically, the QQ plots for tests of Association , Heterogeneous Association and/or Admixture, Heterogeneous Association, and Ass ociation and/or Admixture are displayed in Figure 4.1 and suggest that the resulting p values have close to unifor m distribution for each test . Genomic control inflation factors at the first, second, and third quartiles are close to 1 for the same four tests (Table 4.1). The correlation between the variance component parameters , , estimated using the GRM , , and the estimates produced using the GRM , , for the same 1000 polygenic phenotyp es is computed to be 0.999944. Table 4.1 . Genomic control inflation factors under the null hypothesis . Genomic Control Association Heterogeneous Association a nd/or Admixture Heterogeneous Association Association and/or Admixture (Q1) 1.003 1.002 1.001 1.005 (Q2) 1.006 1.002 1.002 1.007 (Q3) 1.002 1.002 1.002 1.003 Genomic control inflation factors are computed for 1,022,524 SNPs.
87 (a) QQ p lot for a ssociation t ests under the n ull h ypothesis (b) QQ p lot for h eterogeneous a ssociation and/or a dmixture tests under the n ull h ypothesis
88 (c) QQ p lot for h eterogeneous a ssociation tests under the n ull h ypothesis (d ) QQ p lot for a ssociation and/or a dmixture t ests under the n ull h ypothesis Figure 4.1. Quantile quantile plot of observed and expected values from tests under the null hypothesis . Results for (a) Association, (b) Heterogeneous Association and/or Admixture, (c) Heterogeneous Association, and (d) Association and/or Admixture tests are displayed. The solid red line represent s uniformly distributed p values. T he 95% confidence intervals around the solid red line are shaded in gray . Results are based on 1 , 022 , 525 SNPs.
89 Power At the causal markers (Figure 4.2a), the average power for the association test is 0.7206 (SE 0.015 6 ). For the 2 df tests of Heterogeneous Association (test 3) and Association and/or Admixture (test 4) the average power is 0.6295 (SE 0.01 33 ) and 0.6297 (SE 0.015 1 ), respectively. For the 3 df test of Heterogeneous Association and/or Admixture , the average p ower is further reduced to 0.5567 (SE 0.0143 ). The average power is substantially lower when performing the tests at Omni resolution (Figure 4.2b), with the power of the same four tests reduced by more than 50% . In the same order as in the previous paragra ph, the po wer for the four tests at Omni r esolution is 0.3429 (SE 0.0124 ) , 0.2917 (SE 0.0123 ) , 0.3030 (SE 0.012 6 ) , and 0.2650 (SE 0.012 4 ) . Still, the association test is the most powerful on average. Under the simulation scenario in this work , the three adm ixture tests are underpowered, with average power 0.0937 (SE 0.0082 ) , 0.121 2 (SE 0.0096 ) , and 0.0619 (SE 0.0072 ) , respectively for the Dosage Admixture (test 5), the Dosage Admixture Full (test 6), and the Full Admixture (test 7) . Imputation of the causal SNP followed by association testing result s in average power 0.6456 (SE 0.0144 ).
90 (a) (b) Figure 4.2. Average power. (a) at the causal maker s ; (b) at GWAS chip resolution. For each test, power is calculated as the proportion of tests that achieve si gnificance over 250 replicated sets for 46 causal SNPs . Error bars represent standard error around the proportion .
91 The correlation between European and Native American local ancestry at each causal SNP is on average 0.879 (SE=0.017). Models that implicit ly include the second ancestral component, , result in lower power for the combined admixture and/or association tests (results not shown). However, i ncluding in M odel 6 results in close to a 30% increase of the average power for the 2 df full Dosage Admixture test compared to the 1 df Dos age Admixture test. The Full Admixture approach (test 7) has the lowest power (0.0619) among all of the admixture tests, The tests at individual SNPs are presented in Supplementary Tabl e 4.2 (Appendix B ). F or most of the causal SNP regions, 31 out of 46, there is no benefit from incorporating LA information in a test . Under the assumptions for a single causal variant per locus and a causal allele with the same effect regardless of its an cestral origin , the different ial ancestral allele frequency a lone does not result in higher power for tests of admixture or combined admixture and association compared to the standard test for association. For 15 out of the 46 causal SNP region s, there is a gain in power at GWAS chip resolution for a combined admixture and/or association test or for an admixture mapping test compared to the association test (Table 4.2). For most of these regions, the empirical power is lower for all tests compared to the 31 region s where association is better powered, with average power of the best test (admixture or combined) 0.212 among the 15 regions and 0.470 among the 31 regions.
92 Table 4.2. Empirical power for regions that have stronger signal at Omni resolution for the admix ture and/or combined tests compared to the association test. Chr. / SNP No. Imputation+ Association Association Combined Admixture r (g,D E ) 1 2 3 4 5 6 7 2 / 1 0.66 0.15 0.18 0.25 0.15 0.03 0.01 0.01 0.47 2 / 4 0.50 0.04 0.03 0.02 0.02 0.06 0.08 0.0 3 0.62 2 / 9 0.71 0.08 0.23 0.13 0.24 0.44 0.35 0.20 0.70 4 / 2 0.58 0.04 0.03 0.01 0.04 0.01 0.14 0.06 0.49 4 / 5 0.56 0.03 0.05 0.04 0.06 0.10 0.32 0.20 0.60 4 / 6 0.76 0.12 0.21 0.24 0.14 0.04 0.06 0.02 0.45 7 / 1 0.72 0.25 0.29 0.22 0.30 0.47 0.41 0.22 0.71 7 / 4 0.74 0.02 0.04 0.01 0.06 0.02 0.05 0.02 0.49 7 / 9 0.62 0.19 0.11 0.14 0.15 0.04 0.22 0.11 0.50 7 / 10 0.54 0.03 0.01 0.02 0.02 0.02 0.20 0.09 0.49 8 / 2 0.62 0.01 0.06 0.01 0.07 0.16 0.12 0.06 0.63 8 / 6 0.61 0.01 0.07 0.00 0.0 6 0.22 0.15 0.06 0.60 12 / 1 0.64 0.17 0.16 0.12 0.18 0.06 0.05 0.02 0.51 12 / 2 0.59 0.08 0.11 0.06 0.08 0.02 0.02 0.01 0.52 18 / 1 0.71 0.01 0.04 0.02 0.02 0.05 0.10 0.04 0.54 For each test, at each causal marker region, power is calculated as the proportion of tests that achieve significance out of all tests at 250 replicate d datasets . Bold font denotes maximum power for the association and combined tests (tests 1 4) and maximum power for the admixture tests (tests 5 7). Darker green shade denotes the best overall result (ignoring the imputation result that is always with the highest power). Lighter green shade denotes not significantly different proportions. Correlation coefficient between the genotype, , and European local ancestry, , at each cau sal SNP is reported. The tests are: test 1 , Association, test 2 , Heterogeneous Association and/or Admixture, test 3 , Heterogeneous Association, test 4 , Association and/or Admixture, test 5 , Dosage Admixture, test 6 , Dosage Admixture full, and test 7 , Full Admixture . For 11 of the 15 regions in Table 4.2 , an admixture mapping approach is better powered compared to an association test or a combined test of association and admixture. The correlation between the number of alleles with EU local ancestry and th e genotype at the causal marker for those eleven markers is higher than the remaining 4 markers , with the mean absolute value of equal to . T he
93 strongest admixture signal is observed at two regions (causal SNP 9 at C hromosome 2 and causa l SNP 1 at C hromosome 7) where power is 0.44 and 0.47 and correlation is 0.70 and 0.71, respectively . For three causal SNP regions , one of the combined approaches is significantly better powered than association (Table 4.2). For those regi ons, there is a strong differential LD pattern for the tagging SNPs based on the ancestral origin of the chromosomal segment. The differential LD, , at those 3 markers (causal SNP 1 at C hromosome 2, causal SNP 6 at C hromosome 4, and causal SNP 2 a t C hromosome 12) is 0.5587, 0.7899, and 0.5519 , respectively. Imputation of the causal SNP followed by association testing is the best approach with respect to power both on average across all 46 causal markers and individually at every marker (Figure 4. 2 , Supplementary Table 4.2). This approach yields increases of the average power compared to the best powered admixture and combined tests by factors of 5.3 and 2.13, respectively. Discussion The derived genomic control values and the produced QQ plots (Figu re 4.1) suggest that the LMM approach, without using a fixed effect for local ancestry adjustment, controls Type I error for both the association test and the combined tests. It is worth noting that the use of an empirical GRM in a LMM framework implicitly adjust s for global ancestry . In particular, the t op principal components derived from an empirical GRM have been shown to be highly correlated with global ancestry proportions [Thornton, et al. 2014; Zhang and Stra m 2014] . For our Latino dataset, t he top two principal components , derived from the GRM , explain 95.2%
94 and 99.4% of the European and Native American global ancestry proportions, respectively. Thus the top 2 principal components are driven predominantly from the population structure as opposed to the relatedness present in the sample. The correlation between variance component parameter s , estimated using the GRM , , and , estimate d using the GRM , , is computed to be 0.999944 for t he same 1,000 phenotypes. This suggests that the exclusion of 2 , 622 markers in and a round the ENCODE regions does not substantially change the genetic correlation captured by the GRMs. In addition to having a well calibrated distribution of the test statistics for LM M with GRM , , the average gain in power observed from excluding the ENCODE regions is modest : 2.1% for the association test at the causal markers. It is suggested in a recent work that models with both fixed and random effects can result in increased power to de tect association for markers with highly differentiated ancestry while controlling for Type I error [Conomos, et al. 2015] . Methods that decompose the empirical GRM into random effects and/or fixed effects, and the influence on power and Type I error of such decompositions are open research questions beyond the s cope of this work. The imputation quality i n a different Hispanic American cohort [L iu, et al. 2013a] , measured by the mean squared correlation coefficient , , between the additively coded genotypes at the actual and the imputed markers, is reported to be 92.5% on average for variants with MAF 0.01. The imputation in Liu and colleag Affymetrix Â® 6.0 ) dataset with comparable resolution to the GWAS chip resolution used in our study. They use the
95 1000 Genomes Phase I dataset for a reference panel. The imputation performance is suggested to be significant ly better (although the authors do not quantify this) for common variants (MAF>=0.05), and the causal variants selected in our analyses are common by construction. Further, imputation quality should also improve if 1000 Genomes Phase 3 is used as a referen ce panel due to the larger number of relevant haplotypes in this newer reference. Thus our assumption (see Methods) that all causal markers in our study can be imputed with a squared correlation coefficient of =0.925 between the causal and the impute d marker is a conservative one. In this study, we investigate possible gains in power from incorporating local ancestry into single variant association testing for a polygenic trait with a single causal common variant per locus, and a causal allele with th e same effect regardless of its ancestral origin. Our results suggest that, at GWAS chip resolution, there is a limited benefit from incorporating local ancestry in admixture mapping or in combined admixture and association testing since higher power can b e achieved by imputation followed by association testing. Specifically, imputation of the causal marker, followed by association testing, was the best approach with respect to power, both on average across all regions containing causal markers, and individ ually at each region. Imputation, followed by an association test yielded increases in average power, compared to the best power ed admixture and combined tests, by factors of 5.3 and 2.13, respectively. Further, we show that the standard LMM approach, with out local ancestry adjustment through a fixed effect, controls well for Type I error and results in well
96 calibrated tests statistics for both the association test and in the combined admixture and association tests. In our study, at GWAS chip resolution, g ains in power from incorporating the effects of local ancestry into association tests were limited to markers with differential LD patterns ( of the observed markers with the causal marker ) i n the ancestral populations, and for admixture mapping, to markers where the origin of the causal allele is predominantly one of th e ancestral populations. Greater power gains can be achieved via imputation for these markers . I f the origin of an unobserved causal allele is predominantly one of the ancestral populations, admixt ure mapping can be better powered than association testing to capture the genetic signal. A similar phenomenon was observed i n recent work where significant concordance between the number of Native American ancestral alleles and the reference allele s at a variant in the ACTN1 locus [Schick, et al. 2016] was reported. Clearly, utilizing a local ancestry adjustment in a test, e.g., a test of model 4 versus model 6, will result in a reduced power of the test under such a scenario because a large part of the association signal will be adjusted for by including the local ancestries as confounding variable s in the null model . We note that our results are obtained under the assumptions for a single causal variant per locus and a causal allele with the same effect irrespective of its ancestral origin. While the se are possibly reasonable assumptions, allelic heterogeneity may be expected. L ocal ancestry captures the haplotype diversity in in a test might be able to tag, for example, gene gene interaction and synthetic association due to
97 aggregation of several signals with weaker effects. Thus, given enough resources, admixture and combined admixture and associ ation scans might still be informative, particularly at the discovery stage for a study of a complex trait . A cknowledgments Parts of the numerical experiments were performed using the Colibri GPU cluster at the Center for Computational Mathematics, Univers ity of Colorado Denver. This infrastructure was obtained through the NSF award "GPU Cluster for Computing Research" (CNS 0958354).
98 CHAPTER V CONCLUSIONS AND FUTURE WORK Recently admixed populations, such as Latinos and African Americans, have higher ge netic diversity as they combine genetic variability from different ancestral populations , and present unique opportunities for mapping the genetic basis of a complex trait. In this dissertation, I discussed different ways to implement combined admixture ma pping and association testing , as well as the utility of such an approach in the context of high resol ution genetic data obtain ed via sequencing or imputation . In Chapter II , I utilized real data to apply admixture mapping, association test ing, and combin ed admixture and/or association test that allows for a heterogeneous genetic association by fully stratifying the association effect based on the ancestral origin of the marker . No signal was detect ed by tests for association or admixture. The s tratified a ssociation and/or admixture test identified a variant associated with diastolic blood pressure (DBP) where significance threshold s were derived by permutation analysis . Interestingly, the effect of the variant on DBP acts in different directions depending on its ancestral origin . An e xplicit test for heterogeneity among the effects due to genetic association , given different ancestral background in the region , was highly significant. One p ossible explanation for this result is that it is due to differential LD with the causal SNP in the ancestral population s. An a lternative explanation is that this result is due to allelic heterogeneity, for example due to interaction with another causal allele with different allele frequency in the ancestral populations.
99 Al though high resolution genotypes were available for the Hispanic cohort considered, and I did start from the high density genotypes, the final resolution of my analysis was much lower . The reason for this is that contemporary methods for LA inference requi re the same common density of the genotype data for the study samples and for the proxy reference panels for the ancestral populations. Additionally, o ne of the main issues that I encountered in my work was the scarcity of good reference panel s for the Nat ive American ancestral component present in Hispanic populations. This fact has been suggested to be one of the main obstacles for the adoption of methods that incorporate admixture information [Shriner 2013] . To address th i s issue, in Chapter III, I produced a high resolution Native American reference dataset. Specifically, a Native American sample without European admixture was obt ained, additionally screened for cryptic relatedness , and imputed to sequence resolution to serve as proxy in local ancestry estimation. T o allow simulat ion of phenotype data from a realistic structure of genetic variation , I produce d sequence resolution, simulated Latino dataset with allele frequencies and LD patterns similar to that of contemporary Latino samples after one additional generation of random mating . The resulting dataset has 25,235,037 non monomorphic bi allelic SNPs, a combined size of 3,500 diploid genomes , and local ancestry inferred before the simulation of genotypes . By construction, the Latino dataset is highly structured with admixture of three ancestral populations, distinct sub populations, reflective of subcontinental variation, and non trivial familial like relatedness within each of the sub populations.
100 In C hapter IV, I adopted the standard linear mixed model approach for admixture and for combined admixture and association tests . Two main developments that aim to address the issue of missing heritability are (1) the assembl y of larger and larger cohort s that are better power ed to detect weaker effects, and (2) the increased interest in familial data that is naturally enriched for rare variation . A necessary condition for a method to be widely adopted by the community is for the method to properly control for confounding due to genetic correlation, including structure and kinship. My results suggest that t his is achieved by the LMM approach for the combined admixture and association tests. Specifically, my simulation studies show ed that the standard LMM a pproach, without local ancestry adjustment through a fixed effect, controls well for Type I error in both the association test and in the combined admixture and association tests. Next, u sing extensive simulations based on the Latino dataset , I investigate d possible gains in power from incorporating local ancestry into single variant association testing for a polygenic trait with a single causal common variant per locus, and a causal all ele with the same effect rega rdless of its ancestral origin. For this simulation scenario my results suggest that, at GWAS chip resolution, there is limited benefit from incorporating local ancestry in admixture mapping or in combined admixture and associa tion testing since higher power can be achieved by the imputation approach . Specifically, i mputation of the causal marker , followed by association testing, was the best approach with respect to power , both on average across all regions containing causal ma rkers , and individually at each region.
101 Imputation and association yielded increases of the average power compared to the best powered admixture and combined tests by factors of 5.3 and 2.13, respectively. Recent methods for local ancestry estimation are l everaging external reference panels with genetic data relevant to the population ; however , imputation procedures also do this. It remains an open question, when dense markers are available via sequencing or imputation, if admixture and/or combined admixtur e and association are still analysis path s worth pursuing. In our work I showed , via simulation, that for a common causal variant, with the same effect in each ancestral population, imputation of a GWAS chip data and standard association is the approach be tter powere d to detect association. Additional simulation scenarios that allow for allelic heterogeneity are being considered to assess the utility of a combined approach when high resolution genotypes are available. The se scenarios include ancestry speci fic aggregation of signal at a locus or non additive interactions in a locus. I conjecture that such scenarios can result in admixture and/or combined admixture and association tests to be better powered com pared to a scan for association. For the simulate d phenotypes in this work, nominal significance thresholds established in the literature were adopted for the admixture and for the combined tests . For real data, significan ce thresholds are likely to depend on the , generally unknown, population history of admixture . E mpirical significance thresholds at a genome wide alpha level of 0.05 can be computed with permutation analysis . To achieve these permutations in a computationally tractable way, decorrelation of the phenotypes under the null hypothesis can be implemented ( see Appendix C ) .
102 Likewise , t he eigendecomposition of the GRM , , can be used to allow efficient maximization of the likelihood s involved in the tests and substant ial speedup of the calculations . Such approaches are implemented in available LMM methods [Listgarten, et al. 2012 ; Zhou and Stephens 2012] , but do not allow for LA effects to be incorporated in the models . A software tool that implements the combined admixture and association methods in the LMM framework, utilizing the eigendecomposition of GRM, is ultimately requir ed to enable broader use by the research community of the methods considered in this work .
103 REFERENCES Abraham G, Inouye M. 2014. Fast principal component analysis of large scale genome wide data. PloS one 9(4):e93766. Agresti A. 2007 . An introduction to categorical data analysis. Hoboken. NJ: Wiley. Alexander DH, Novembre J, Lange K. 2009. Fast model based estimation of ancestry in unrelated individuals. Genome research 19(9):1655 1664. Almasy L, Blangero J. 1998. Multipoint quantitat ive trait linkage analysis in general pedigrees. The American Journal of Human Genetics 62(5):1198 1211. Almasy L, Blangero J. 2010. Variance component methods for analysis of complex phenotypes. Cold Spring Harb Protoc 2010(5):pdb.top77. Almasy L, Dyer TD , Peralta JM, Jun G, Wood AR, Fuchsberger C, Almeida MA, Kent JW, Jr., Fowler S, Blackwell TW and others. 2014. Data for Genetic Analysis Workshop 18: human whole genome sequence, blood pressure, and simulated phenotypes in extended pedigrees. BMC Proc 8(S uppl 1):S2. Almasy L, Kos MZ, Blangero J. 2015. Linkage Mapping: Localizing the Genes That Shape Human Variation. Genome Mapping and Genomics in Human and Non Human Primates: Springer. p 33 52. Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner S F, Yu F, Bonnen PE, de Bakker PI, Deloukas P, Gabriel SB and others. 2010. Integrating common and rare genetic variation in diverse human populations. Nature 467(7311):52 8. Astle W, Balding DJ. 2009. Population Structure and Cryptic Relatedness in Genetic Association Studies. Statistical Science 24(4):451 471. Baran Y, Pasaniuc B, Sankararaman S, Torgerson DG, Gignoux C, Eng C, Rodriguez Cintron W, Chapela R, Ford JG, Avila PC and others. 2012. Fast and accurate inference of local ancestry in Latino popula tions. Bioinformatics 28(10):1359 67. Bradbury PJ, Zhang Z, Kroon DE, Casstevens TM, Ramdoss Y, Buckler ES. 2007. TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics 23(19):2633 2635. Brown R, Pasaniuc B. 2014. Enh anced Methods for Local Ancestry Assignment in Sequenced Admixed Individuals. PLoS Comput Biol.
104 Conomos MP, Miller MB, Thornton TA. 2015. Robust inference of population structure for ancestry prediction and correction of stratification in the presence of r elatedness. Genetic epidemiology 39(4):276 293. Conomos MP, Reiner AP, Weir BS, Thornton TA. 2016. Model free Estimation of Recent Genetic Relatedness. Am J Hum Genet 98(1):127 48. Coram MA, Duan Q, Hoffmann TJ, Thornton T, Knowles JW, Johnson NA, Ochs Bal com HM, Donlon TA, Martin LW, Eaton CB. 2013. Genome wide characterization of shared and distinct genetic components that influence blood lipid levels in ethnically diverse human populations. The American Journal of Human Genetics 92(6):904 916. da Silva T M, Rani MS, de Oliveira Costa GN, Barreto ML, Blanton RE. 2015. Reply to Moura et al. European Journal of Human Genetics. Delaneau O, Zagury J F, Marchini J. 2012. Improved whole chromosome phasing for disease and population genetic studies. Nature Methods 10:5 6. Devlin B, Roeder K. 1999. Genomic control for association studies. Biometrics 55(4):997 1004. Epstein MP, Allen AS, Satten GA. 2007. A simple and improved correction for population stratification in case control studies. Am J Hum Genet 80. Evangel ou E, Trikalinos TA, Salanti G, Ioannidis JPA. 2006. Family Based versus Unrelated Case Control Designs for Genetic Associations. PLoS Genet 2(8). Eyheramendy S, Martinez FI, Manevy F, Vial C, Repetto GM. 2015. Genetic structure characterization of Chilean s reflects historical immigration patterns. Nat Commun 6:6472. Fisher RA. 1919. XV. The correlation between relatives on the supposition of Mendelian inheritance. Transactions of the royal society of Edinburgh 52(02):399 433. Gibbs RA, Belmont JW, Hardenbo l P, Willis TD, Yu F, Yang H, Ch'ang L Y, Huang W, Liu B, Shen Y. 2003. The international HapMap project. Nature 426(6968):789 796. Gibson G. 2011. Rare and common variants: twenty arguments. Nat Rev Genet 13(2):135 45. Gomez F, Wang L, Abel H, Zhang Q, Pr ovince MA, Borecki IB. 2015. Admixture mapping of coronary artery calcification in African Americans from the NHLBI family heart study. BMC Genetics 16(1):1.
105 Gravel S, Zakharia F, Moreno Estrada A, Byrnes JK, Muzzio M, Rodriguez Flores JL, Kenny EE, Gignou x CR, Maples BK, Guiblet W and others. 2013. Reconstructing Native American Migrations from Whole Genome and Whole Exome Data. PLoS Genet 9(12). Hayes BJ, Visscher PM, Goddard ME. 2009. Increased accuracy of artificial selection by using the realized relat ionship matrix. Genet Res (Camb) 91(1):47 60. Hoffman GE. 2013. Correcting for population structure and kinship using the linear mixed model: theory and extensions. PloS one 8(10):e75707. Howie B, Marchini J, Stephens M. 2011. Genotype Imputation with Thou sands of Genomes. G3 (Bethesda). p 457 70. International Human Genome Sequencing Consortium. 2004. Finishing the euchromatic sequence of the human genome. Nature 431(7011):931 945. Jin W, Li R, Zhou Y, Xu S. 2014. Distribution of ancestral chromosomal segm ents in admixed genomes and its implications for inferring population history and admixture mapping. European Journal of Human Genetics 22(7):930 937. Jin W, Wang S, Wang H, Jin L, Xu S. 2012. Exploring population admixture dynamics via empirical and simul ated genome wide distribution of ancestral chromosomal segments. Am J Hum Genet 91(5):849 62. Johnson NA, Coram MA, Shriver MD, Romieu I, Barsh GS, London SJ, Tang H. 2011. Ancestral components of admixed genomes in a Mexican cohort. PLoS Genet 7(12):e1002 410. Kang HM, Sul JH, Service SK, Zaitlen NA, Kong S, Freimer NB, Sabatti C, Eskin E. 2010. Variance component model to account for sample structure in genome wide association studies. Nat Genet 42(4):348 54. Kao WHL, Klag MJ, Meoni LA, Reich D, Berthier S chaad Y, Li M, Coresh J, Patterson N, Tandon A, Powe NR and others. 2008. MYH9 is associated with nondiabetic end stage renal disease in African Americans. Nature Genetics 40(10). Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, SanGi ovanni JP, Mane SM, Mayne ST and others. 2005. Complement factor H polymorphism in age related macular degeneration. Science 308(5720):385 9. Kopp JB, Smith MW, Nelson GW, Johnson RC, Freedman BI, Bowden DW, Oleksyk T, McKenzie LM, Kajiyama H, Ahuja TS and others. 2008. MYH9 is a major effect risk gene for focal segmental glomerulosclerosis. Nat Genet 40.
106 Lee S, Wu MC, Lin X. 2012. Optimal tests for rare variant effects in sequencing association studies. Biostatistics 13(4):762 75. Lettre G, Palmer CD, Youn g T, Ejebe KG, Allayee H, Benjamin EJ, Bennett F, Bowden DW, Chakravarti A, Dreisbach A and others. 2011. Genome Wide Association Study of Coronary Heart Disease and Its Risk Factors in 8,090 African Americans: The NHLBI CARe Project. Plos Genetics 7(2). L i B, Leal SM. 2008. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet 83(3):311 21. Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, Cann HM, Barsh GS, Feldm an M, Cavalli Sforza LL and others. 2008. Worldwide human relationships inferred from genome wide patterns of variation. Science 319(5866):1100 4. Lippert C, Listgarten J, Liu Y, Kadie CM, Davidson RI, Heckerman D. 2011. FaST linear mixed models for genome wide association studies. Nature methods 8(10):833 835. Listgarten J, Lippert C, Kadie CM, Davidson RI, Eskin E, Heckerman D. 2012. Improved linear mixed models for genome wide association studies. Nat Methods 9(6):525 6. Liu EY, Li M, Wang W, Li Y. 2013a . MaCH Admix: Genotype Imputation for Admixed Populations. Genetic Epidemiology 37(1):25 37. Liu J, Lewinger JP, Gilliland FD, Gauderman WJ, Conti DV. 2013b. Confounding and heterogeneity in genetic association studies with admixed populations. American journal of epidemiology 177(4):351 360. Maher B. 2008. Personal genomes: The case of the missing heritability. Nature News 456(7218):18 21. Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen W M. 2010. Robust relationship inference in genome wide associa tion studies. Manichaikul A, Palmas W, Rodriguez CJ, Peralta CA, Divers J, Guo X, Chen WM, Wong Q, Williams K, Kerr KF and others. 2012. Population structure of Hispanics in the United States: the multi ethnic study of atherosclerosis. PLoS Genet 8(4):e100 2640. Mao X, Bigham A, Mei R, Gutierrez G, Weiss K, Brutsaert T, Leon Velarde F, Moore L, Vargas E, McKeigue P and others. 2007. A Genomewide Admixture Mapping Panel for Hispanic/Latino Populations. Am J Hum Genet. p 1171 8.
107 Maples BK, Gravel S, Kenny EE, Bustamante CD. 2013. RFMix: a discriminative modeling approach for rapid and robust local ancestry inference. Am J Hum Genet 93(2):278 88. McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JPA, Hirschhorn JN. 2008. Genome wide associat ion studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 9(5):356 369. McKeigue PM. 1998. Mapping genes that underlie ethnic differences in disease risk: Methods for detecting linkage in admixed populations, by conditioning on p arental admixture. American Journal of Human Genetics 63(1). Mendel G. 1865. Experiments in plant hybridization. Verhandlungen des naturforschenden Vereins BrÃ¼nn. English translation (1996) available online: www. mendelweb. org/Mendel. html (accessed on 1 April 2016). Morton NE. 1955. Sequential tests for the detection of linkage. Am J Hum Genet 7(3):277 318. Moura RRd, Balbino VdQ, Crovella S, BrandÃ£o LAC. 2015. On the use of Chinese population as a proxy of Amerindian ancestors in genetic admixture studie s with Latin American populations. European Journal of Human Genetics 24(3):326 327. Ni X, Yang X, Guo W, Yuan K, Zhou Y, Ma Z, Xu S. 2016. Length Distribution of Ancestral Tracks under a General Admixture Model and Its Applications in Population History I nference. Sci Rep 6:20048. Okbay A, Beauchamp JP, Fontana MA, Lee JJ, Pers TH, Rietveld CA, Turley P, Chen G B, Emilsson V, Meddens SFW. 2016. Genome wide association study identifies 74 loci associated with educational attainment. Nature. Pasaniuc B, Zait len N, Lettre G, Chen GK, Tandon A, Kao WHL, Ruczinski I, Fornage M, Siscovick DS, Zhu X and others. 2011. Enhanced Statistical Tests for GWAS in Admixed Populations: Assessment using African Americans from CARe and a Breast Cancer Consortium. Plos Genetic s 7(4). Price AL, Zaitlen NA, Reich D, Patterson N. 2010. New approaches to population stratification in genome wide association studies. Nat Rev Genet 11. Pritchard JK, Donnelly P. 2001. Case control studies of association in structured or admixed populat ions. Theoretical population biology 60(3):227 237. Pritchard JK, Stephens M, Rosenberg NA, Donnelly P. 2000. Association mapping in structured populations. Am J Hum Genet 67(1):170 81. Purcell S, Neale B, Todd Brown K, Thomas L, Ferreira MA, Bender D, Mal ler J, Sklar P, de Bakker PI, Daly MJ and others. 2007. PLINK: a tool set for whole -
108 genome association and population based linkage analyses. Am J Hum Genet 81(3):559 75. Qin H, Morris N, Kang SJ, Li M, Tayo B, Lyon H, Hirschhorn J, Cooper RS, Zhu X. 2010. Interrogating local population structure for fine mapping in genome wide association studies. Bioinformatics 26. R Core Team. 2013. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2013. Rag havan M, SteinrÃ¼cken M, Harris K, Schiffels S, Rasmussen S, DeGiorgio M, Albrechtsen A, Valdiosera C, Ãvila Arcos MC, Malaspinas AS and others. 2015. Genomic evidence for the Pleistocene and recent population history of Native Americans. Science 349(6250): aab3884. Reich D, Patterson N, Ramesh V, De Jager PL, McDonald GJ, Tandon A, Choy E, Hu D, Tamraz B, Pawlikowska L and others. 2007. Admixture mapping of an allele affecting interleukin 6 soluble receptor and interleukin 6 levels. American Journal of Human Genetics 80(4). Reich DE, Cargill M, Bolk S, Ireland J, Sabeti PC, Richter DJ, Lavery T, Kouyoumjian R, Farhadian SF, Ward R and others. 2001. Linkage disequilibrium in the human genome. Nature 411(6834):199 204. Reiner A, Beleza S, Franceschini N, Auer P , Robinson J, Kooperberg C, Peters U, Tang H. 2012. Genome wide Association and Population Genetic Analysis of C Reactive Protein in African American and Hispanic American Women. Am J Hum Genet. p 502 12. Rife DC. 1954. Populations of hybrid origin as sour ce material for the detection of linkage. Am J Hum Genet 6. Risch N, Merikangas K. 1996. The future of genetic studies of complex human diseases. Science 273(5281):1516 7. Rosenberg NA. 2006. Standardized subsets of the HGDP CEPH Human Genome Diversity Cel l Line Panel, accounting for atypical and duplicated samples and pairs of close relatives. Ann Hum Genet 70(Pt 6):841 7. Schick Ursula M, Jain D, Hodonsky Chani J, Morrison Jean V, Davis James P, Brown L, Sofer T, Conomos Matthew P, Schurmann C, McHugh Cai tlin P and others. 2016. Genome wide Association Study of Platelet Count Identifies Ancestry Specific Loci in Hispanic/Latino Americans. The American Journal of Human Genetics 98(2):229 242. Seldin MF, Pasaniuc B, Price AL. 2011. New approaches to disease mapping in admixed populations. Nat Rev Genet 12(8):523 528.
109 Shriner D. 2013. Overview of Admixture Mapping. Curr Protoc Hum Genet CHAPTER:Unit1 23. Shriner D, Adeyemo A, Ramos E, Chen G, Rotimi CN. 2011a. Mapping of disease associated variants in admixed populations. Genome Biology 12(5):1. Shriner D, Adeyemo A, Rotimi CN. 2011b. Joint Ancestry and Association Testing in Admixed Individuals. Plos Computational Biology 7(12). Sims D, Sudbery I, Ilott NE, Heger A, Ponting CP. 2014. Sequencing depth and cover age: key considerations in genomic analyses. Nat Rev Genet 15(2):121 132. Stram DO. 2014. Design, Analysis, and Interpretation of Genome Wide Association Scans: Springer. Tang H, Siegmund DO, Johnson NA, Romieu I, London SJ. 2010. Joint Testing of Genotype and Ancestry Association in Admixed Families. Genetic Epidemiology 34(8). Terwilliger JD, Weiss KM. 1998. Linkage disequilibrium mapping of complex disease: fantasy or reality? Current Opinion in Biotechnology 9(6):578 594. The 1000 Genomes Project Consor tium. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature 491:56 65. The 1000 Genomes Project Consortium. 2013. Native American Sample s Used as a Reference Panel for Local Ancestry Inference of the 1000 Genomes Phase 1 Admixed Pop ulations. Therneau TM. 2015. coxme: Mixed Effects Cox Models. R package version 2.2 5. Thompson E. 2011. The Structure of Genetic Linkage Data: From LIPED to 1M SNPs. Human Heredity 71(2):86 96. Thornton T, Conomos MP, Sverdlov S, Blue EM, Cheung CY, Glazn er CG, Lewis SM, Wijsman EM. 2014. Estimating and adjusting for ancestry admixture in statistical methods for relatedness inference, heritability estimation, and association testing. BMC Proceedings 8(1):1 7. Visscher PM, Hill WG, Wray NR. 2008. Heritabili ty in the genomics era [mdash] concepts and misconceptions. Nat Rev Genet 9(4):255 266. Wang XX, Zhu XF, Qin HZ, Cooper RS, Ewens WJ, Li C, Li MY. 2011. Adjustment for local ancestry in genetic association analysis of admixed populations. Bioinformatics 27 (5):670 677.
110 Watson JD, Crick FH. 1953. Molecular structure of nucleic acids. Nature 171(4356):737 738. Weir BS, Anderson AD, Hepler AB. 2006. Genetic relatedness analysis: modern data and new challenges. Nat Rev Genet 7(10):771 780. Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, Klemm A, Flicek P, Manolio T, Hindorff L. 2014. The NHGRI GWAS Catalog, a curated resource of SNP trait associations. Nucleic acids research 42(D1):D1001 D1006. Widmer C, Lippert C, Weissbrod O, Fusi N, Kadie C, Da vidson R, Listgarten J, Heckerman D. 2014. Further Improvements to Linear Mixed Models for Genome Wide Association Studies. Scientific Reports, Published online: 12 November 2014; | doi:10.1038/srep06874. Wijsman EM. 2012. The role of large pedigrees in an era of high throughput sequencing. Hum Genet 131(10):1555 63. Winkler CA, Nelson GW, Smith MW. 2010. Admixture mapping comes of age. Annu Rev Genomics Hum Genet 11. Wood AR, Esko T, Yang J, Vedantam S, Pers TH, Gustafsson S, Chu AY, Estrada K, Luan Ja, Ku talik Z. 2014. Defining the role of common variation in the genomic and biological architecture of adult human height. Nature genetics 46(11):1173 1186. Wray NR, Purcell SM, Visscher PM. 2011. Synthetic associations created by rare variants do not explain most GWAS results. PLoS Biol 9(1):e1000579. Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW. 2010a. Common SNPs explain a large proportion of the heritability for human height. Nature genetics 4 2(7):565 569. Yang J, Lee SH, Goddard ME, Visscher PM. 2011. GCTA: A Tool for Genome wide Complex Trait Analysis. Am J Hum Genet. p 76 82. Yang J, Wray NR, Visscher PM. 2010b. Comparing apples and oranges: equating the power of case control and quantitativ e trait association studies. Genet Epidemiol 34(3):254 7. Yang J, Zaitlen NA, Goddard ME, Visscher PM, Price AL. 2014. Advantages and pitfalls in the application of mixed model association methods. Nat Genet 46(2):100 6. Yazbek SN, Buchner DA, Geisinger JM , Burrage LC, Spiezio SH, Zentner GE, Hsieh CW, Scacheri PC, Croniger CM, Nadeau JH. 2011. Deep congenic analysis
111 identifies many strong, context dependent QTLs, one of which, Slc35b4, regulates obesity and glucose homeostasis. Genome Res 21(7):1065 73. Yo rgov D, Edwards KL, Santorico SA. 2014. Use of admixture and association for detection of quantitative trait loci in the Type 2 Diabetes Genetic Exploration by Next Generation Sequencing in Ethnic Samples (T2D GENES) study. BMC Proc 8(Suppl 1):S6. Yu J, Pr essoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley JF, McMullen MD, Gaut BS, Nielsen DM, Holland JB and others. 2006. A unified mixed model method for association mapping that accounts for multiple levels of relatedness. Nat Genet 38(2):203 208. Yuan X, M iller DJ, Zhang J, Herrington D, Wang Y. 2012. An Overview of Population Genetic Data Simulation. J Comput Biol. p 42 54. Zhang J, Stram DO. 2014. The role of local ancestry adjustment in association studies using admixed populations. Genetic epidemiology 38(6):502 515. Zhang W, Dolan ME. 2008. Beyond the HapMap Genotypic Data: Prospects of Deep Resequencing Projects. Curr Bioinform 3(3):178. Zhou X, Stephens M. 2012. Genome wide efficient mixed model analysis for association studies. Nat Genet 44(7):821 4.
112 APPENDIX A Supplementary Table 4.1 Supplementary Table 4.1. Causal markers and their reference allele frequencies. Chrom o some SNP No. Position, NCBI B uild 37 Reference Allele Frequency A llelic D ifferentiation , f Overall EU NA 2 1 51667059 0.603 0.477 0.783 0.306 2 2 51751984 0.644 0.771 0.464 0.306 2 3 51815287 0.373 0.501 0.196 0.305 2 4 52006583 0.467 0.634 0.213 0.421 2 5 52011225 0.628 0.509 0.823 0.314 2 6 52073146 0.385 0.501 0.185 0.316 2 7 52105119 0.469 0.649 0.334 0.315 2 8 2347 22065 0.690 0.546 0.889 0.343 2 9 234930311 0.471 0.659 0.216 0.442 2 10 234939481 0.524 0.400 0.733 0.332 2 11 234950997 0.698 0.554 0.887 0.333 4 1 118261890 0.563 0.706 0.305 0.401 4 2 118297655 0.499 0.606 0.288 0.318 4 3 118429408 0.396 0.487 0. 184 0.303 4 4 118481405 0.338 0.448 0.109 0.339 4 5 118503679 0.467 0.642 0.143 0.499 4 6 118520728 0.289 0.406 0.096 0.310 4 7 118547708 0.668 0.530 0.892 0.362 7 1 89881462 0.654 0.831 0.437 0.394 7 2 89893945 0.291 0.430 0.128 0.303 7 3 90046923 0.468 0.648 0.348 0.301 7 4 90120460 0.514 0.371 0.695 0.325 7 5 90142005 0.734 0.857 0.551 0.306 7 6 90166702 0.367 0.536 0.174 0.362 7 7 90232723 0.500 0.644 0.308 0.336 7 8 126673974 0.505 0.349 0.777 0.428 7 9 126717992 0.638 0.504 0.868 0.364 7 10 126835957 0.562 0.426 0.808 0.381 7 11 126849879 0.604 0.479 0.837 0.358 7 12 126863316 0.633 0.527 0.842 0.315 7 13 126926891 0.565 0.452 0.789 0.338 8 1 119153790 0.633 0.487 0.809 0.321 8 2 119157032 0.573 0.380 0.800 0.420 8 3 119237504 0.347 0.209 0.543 0.335 8 4 119241305 0.616 0.394 0.861 0.467
113 Chrom o some SNP No. Position, NCBI B uild 37 Reference Allele Frequency A llelic D ifferentiation , f Overall EU NA 8 5 119250122 0.493 0.648 0.337 0.312 8 6 119303457 0.314 0.488 0.137 0.352 9 1 132067577 0.439 0.607 0.220 0.387 12 1 40383902 0.533 0.396 0.717 0.321 12 2 40558661 0.254 0.123 0.436 0.313 18 1 25496945 0.729 0.847 0.547 0.300 18 2 25506665 0.745 0.866 0.552 0.315 18 3 25528530 0.665 0.793 0.486 0.307 18 4 25540968 0.711 0.5 66 0.895 0.328 18 5 25926175 0.313 0.163 0.533 0.370 18 6 25958880 0.477 0.643 0.334 0.308 The column labeled o verall gives reference allele frequencies in the sample ; the EU column lists allele frequencies for chromosomes with European ancestral origin s ; the NA column lists allele frequencies for chromosomes with N ative American ancestral origin. Allelic differentiation, f , is defined as the absolute difference in the reference allele frequencies given European or Native American ancestral origins of th e allele .
114 APPENDIX B Supplementary Table 4.2 Supplementary Table 4.2. Empirical power at each causal marker. Chr. / SNP No. Imputation+ Association Association Combined Admixture 1 2 3 4 5 6 7 2 / 1 0.66 0.15 0.18 0.25 0.15 0.03 0.01 0.01 2 / 2 0 .59 0.30 0.20 0.21 0.25 0.06 0.04 0.02 2 / 3 0.62 0.49 0.32 0.40 0.41 0.02 0.01 0.00 2 / 4 0.50 0.04 0.03 0.02 0.02 0.06 0.08 0.03 2 / 5 0.69 0.69 0.52 0.58 0.59 0.03 0.09 0.05 2 / 6 0.60 0.14 0.12 0.11 0.12 0.01 0.04 0.02 2 / 7 0.58 0.23 0.19 0.17 0. 23 0.02 0.05 0.02 2 / 8 0.79 0.26 0.23 0.27 0.24 0.19 0.16 0.08 2 / 9 0.71 0.08 0.23 0.13 0.24 0.44 0.35 0.20 2 / 10 0.73 0.19 0.14 0.13 0.18 0.04 0.08 0.05 2 / 11 0.86 0.86 0.70 0.77 0.78 0.21 0.16 0.06 4 / 1 0.47 0.41 0.28 0.30 0.33 0.07 0.19 0.10 4 / 2 0.58 0.04 0.03 0.01 0.04 0.01 0.14 0.06 4 / 3 0.70 0.51 0.33 0.41 0.40 0.00 0.11 0.04 4 / 4 0.72 0.72 0.57 0.65 0.62 0.02 0.15 0.09 4 / 5 0.56 0.03 0.05 0.04 0.06 0.10 0.32 0.20 4 / 6 0.76 0.12 0.21 0.24 0.14 0.04 0.06 0.02 4 / 7 0.65 0.23 0.14 0.16 0.18 0.04 0.05 0.01 7 / 1 0.72 0.25 0.29 0.22 0.30 0.47 0.41 0.22 7 / 2 0.73 0.45 0.34 0.37 0.39 0.08 0.05 0.03 7 / 3 0.74 0.46 0.30 0.36 0.37 0.02 0.04 0.02 7 / 4 0.74 0.02 0.04 0.01 0.06 0.02 0.05 0.02 7 / 5 0.73 0.73 0.55 0.61 0.63 0.07 0.10 0 .06 7 / 6 0.66 0.66 0.50 0.56 0.55 0.24 0.20 0.13 7 / 7 0.74 0.55 0.38 0.46 0.46 0.04 0.07 0.03 7 / 8 0.59 0.59 0.42 0.49 0.49 0.06 0.27 0.16 7 / 9 0.62 0.19 0.11 0.14 0.15 0.04 0.22 0.11 7 / 10 0.54 0.03 0.01 0.02 0.02 0.02 0.20 0.09 7 / 11 0.63 0.6 3 0.44 0.52 0.53 0.01 0.12 0.05 7 / 12 0.51 0.32 0.26 0.28 0.27 0.00 0.07 0.04 7 / 13 0.60 0.29 0.21 0.27 0.25 0.00 0.12 0.05 8 / 1 0.72 0.04 0.02 0.03 0.04 0.04 0.03 0.01 8 / 2 0.62 0.01 0.06 0.01 0.07 0.16 0.12 0.06 8 / 3 0.69 0.69 0.52 0.62 0.62 0. 06 0.08 0.04
115 Chr. / SNP No. Imputation+ Association Association Combined Admixture 1 2 3 4 5 6 7 8 / 4 0.59 0.42 0.29 0.34 0.35 0.38 0.30 0.16 8 / 5 0.69 0.63 0.45 0.52 0.52 0.05 0.02 0.01 8 / 6 0.61 0.01 0.07 0.00 0.06 0.22 0.15 0.06 9 / 1 0.71 0 .67 0.49 0.55 0.55 0.30 0.24 0.15 12 / 1 0.64 0.17 0.16 0.12 0.18 0.06 0.05 0.02 12 / 2 0.59 0.08 0.11 0.06 0.08 0.02 0.02 0.01 18 / 1 0.71 0.01 0.04 0.02 0.02 0.05 0.10 0.04 18 / 2 0.69 0.69 0.52 0.59 0.58 0.07 0.10 0.04 18 / 3 0.68 0.24 0.14 0.18 0. 20 0.08 0.07 0.04 18 / 4 0.63 0.44 0.32 0.36 0.36 0.13 0.08 0.03 18 / 5 0.63 0.56 0.40 0.44 0.46 0.21 0.21 0.10 18 / 6 0.60 0.46 0.29 0.37 0.38 0.04 0.05 0.01 For each test, at each causal marker region, power is calculated as the proportion of tests t hat achieve significance out of all tests at 250 replicates. Bold font denotes maximum power for the association and combined tests (tests 1 4) and maximum power for the admixture tests (tests 5 7). The imputation and association approach has the highest p ower at each region.
116 APPENDIX C Decorrelation of Phenotypes under the Null Model The LMM assumes . Since the genetic relationship matrix , , is real symmetric by construction , it is orthogonally diagonalizable. That is, where is an orthogonal matrix and is a diagonal matrix . Consider . The co variance matrix for is a sum of diagonal m atrices and hence a diagonal matrix. Thus is decorrelated.