Literature DB >> 32840564

SPEARS: Standard Performance Evaluation of Ancestral haplotype Reconstruction through Simulation.

Abstract

MOTIVATION: Ancestral haplotype maps provide useful information about genomic variation and insights into biological processes. Reconstructing the descendent haplotype structure of homologous chromosomes, particularly for large numbers of individuals, can help with characterizing the recombination landscape, elucidating genotype-to-phenotype relationships, improving genomic predictions and more. Inferring haplotype maps from sparse genotype data is an efficient approach to whole-genome haplotyping, but this is a non-trivial problem. A standardized approach is needed to validate whether haplotype reconstruction software, conceived population designs and existing data for a given population provides accurate haplotype information for further inference.
RESULTS: We introduce SPEARS, a pipeline for the simulation-based appraisal of genome-wide haplotype maps constructed from sparse genotype data. Using a specified pedigree, the pipeline generates virtual genotypes (known data) with genotyping errors and missing data structure. It then proceeds to mimic analysis in practice, capturing sources of error due to genotyping, imputation and haplotype inference. Standard metrics allow researchers to assess different population designs and which features of haplotype structure or regions of the genome are sufficiently accurate for analysis. Haplotype maps for 1000 outcross progeny from a multi-parent population of maize are used to demonstrate SPEARS. AVAILABILITYAND IMPLEMENTATION: SPEARS, the protocol and suite of scripts, are publicly available under an MIT license at GitHub (https://github.com/maizeatlas/spears). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Species

Year: 2021 PMID： 32840564 PMCID： PMC8097754 DOI： 10.1093/bioinformatics/btaa749

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The genome is a mosaic of haplotype blocks that capture the evolutionary or breeding history of an individual. Reconstructing ancestral haplotype maps for populations is important for imputing untyped regions of the genome (Davies ), mapping quantitative trait loci (Mott ), investigating the recombination landscape (Morgan ) and inferring the evolutionary history and structure of haplotypes (Aylor ; Gabriel ). Accurately inferring the descendent structure of haplotypes is non-trivial, but several approaches for this have been developed: HAPPY (Mott ); MERLIN (Abecasis ); GAIN (Liu ); DOQTL (Gatti ); R/qtl2 (Broman ). These tools have been shown to perform well, but are mostly limited to specific population types or breeding schemes and can be computationally intensive with complex pedigrees or large numbers of markers. Reconstructing Ancestry Blocks BIT by bit (RABBIT) is a flexible tool that uses a Markovian model to reconstruct ancestral haplotype maps for complex pedigrees involving various mating scenarios (Zheng et al., 2015). Compared to current tools, it has shown the highest accuracy for inbred lines from multi-parent pedigrees [including the mouse CC design (Churchill ) and Arabidopsis thaliana MAGIC design (Kover )]. Here, tailored for RABBIT but extensible for other tools, we present a pipeline for the Standard Performance Evaluation of Ancestral haplotype Reconstruction through Simulation (SPEARS). SPEARS is designed to determine expectations for the accuracy of haplotype reconstruction software applied to user-defined populations. As proof-of-concept and a new demonstration of RABBIT, we develop a detailed picture on the variation in accuracy of genome-wide haplotype maps for progeny from a multi-parent population of maize.

2 Materials and methods

2.1 Simulation data

This study introduces SPEARS (Fig. 1) which incorporates SAEGUS (https://github.com/maizeatlas/saegus) as a genome simulator to create known data. Custom R (R Core Team, 2017) scripts are used for data processing, metric calculations and graphical output. For proof-of-concept, we generated a virtual multi-parent outcross population of 1000 progeny (Supplementary Fig. S1). Real genotype data on inbred line parents of the population were used to initiate the simulation using 47 078 markers combined from genotyping-by-sequencing (GBS; Manching ) and the MaizeSNP50 BeadChip (Ganal ); filtered to remove residual heterozygous sites in the parents for compatibility with MaCH imputation (Li ). We also examined the impact of using fewer markers (23 584 GBS markers).

Fig. 1.

Overview of SPEARS. (A) Steps in the pipeline, file inputs/outputs and standard metrics. This begins with the creation of a simulated population (n virtual genotypes) using SAEGUS based on a user-provided population design, genetic map and parental genotype data. Genotyping errors and missing data are induced in the simulated data, which is then processed via MaCH (imputation) and RABBIT (ancestral haplotype reconstruction). Highlighted in gray are the main steps of the pipeline and files used to compute SPEARS metrics. (B) The figure portrays diplotypes (phased genotypes) for a genomic segment to describe how each summary statistic is calculated. Each of the three colors represents a distinct parent-of-origin. Heterozygous genotypes are shown in white text. The top diplotype represents known data that would be generated by simulation. The bottom diplotype represents inferred data that would be generated by RABBIT as part of the SPEARS pipeline. The formulas and variables used to calculate each metric are shown. CCC is computed as the correlation between known and inferred crossover events for all individuals and is therefore not shown for the example diplotype. The example shows five crossovers for the known individual and four crossovers for the inferred individual In SPEARS, to mimic analysis in practice, a user-defined global error rate is specified to induce genotyping errors and locus-specific missingness rates are specified to induce missing data for the virtual genotypes. SPEARS then uses MaCH for imputation (Li ), prior to reconstructing haplotype maps with RABBIT. For our use case, a global genotyping error rate of 0.006 was used based on trio analysis in prior work (Manching ). Missingness rates were also based on real data for the population that was simulated (column ‘F_MISS’ in Dataset S1). SPEARS reports the frequency of genotyping error and missing data per locus that is realized for the virtual population (e.g., see Supplementary Fig. S2 for the example population). Following imputation of the virtual genotypes, markers with imputation accuracy were filtered, resulting in 46 633 total markers (23 584 GBS markers) retained for RABBIT (Zheng ).

2.2 Reconstruction and evaluation of ancestral haplotype maps

To build haplotype maps for the simulated population the joint model of RABBIT is used to assign an optimal Viterbi path using the ‘origViterbiDecoding’ algorithm (Zheng ).

Based on comparing known and inferred data per individual (Supplementary Fig. S3), SPEARS uses four metrics to assess different features of haplotype maps (Fig. 1B): (i) Ancestral Assignment Accuracy (AAA); (ii) Genotype Assignment Accuracy (GAA); (iii) Phase Assignment Accuracy (PAA); and (iv) Correlation between Crossover Counts (CCC). AAA is calculated as the proportion of markers that have the correct parent assigned on each homologue (given as a percentage). To calculate GAA, genotypes are assigned based on the inferred parent-of-origin and corresponding parental genotype data used as input, from which the proportion of genotype matches are calculated (given as a percentage). To assess phasing accuracy, PAA is calculated as the proportion heterozygous sites whose phase is correctly inferred relative to the previous heterozygous site (Lin ) and is calculated under the assumption that there are no genotyping errors and only among markers with correctly inferred genotype scores. AAA, GAA and PAA are averaged across all samples. The CCC is calculated as the correlation coefficient for the total number of crossovers across both homologues of all chromosomes per individual. Equations for each metric are shown in Figure 1B. For additional analysis, we also calculated parent certainty as the difference between the posterior probabilities of the two most likely parents at each marker (obtained from the ‘origPosteriorDecoding’ algorithm within RABBIT).

3 Results

SPEARS is designed to assess expectations for the accuracy of ancestral haplotype maps reconstructed with multiple software tools (imputation using MaCH and haplotype inference using RABBIT) for user-defined populations and genotype data (simulation using SAEGUS). Comparing known (simulated) and inferred (reconstructed) haplotype maps to compute the accuracy of different metrics (Fig. 1), SPEARS facilitates both genome-wide and regional appraisal of reconstructed haplotypes. Summary metrics reported by SPEARS are semi-independent (Supplementary Table S1) and describe separate features of common interest for haplotype analysis. For demonstration, a use case was processed based on a real multi-parent population with prior estimates of genotyping error and missing data structure. Overall, SPEARS showed that highly accurate genome-wide haplotype maps could be generated from sparse genotype data ( marker per 50 kb) on an admixed non-inbred population (Supplementary Table S2). The average of genome-wide AAA per sample was 97.0%. Genomic regions with lower accuracy (minimum: 79.5%) showed decreases in parent certainty alone or in combination with a lower density of markers (Supplementary Figs S4 and S5), indicating identity-by-state among the parents and low marker density, but neither RABBIT nor the Viterbi algorithm per se, were main sources of error in the inference of parent-of-origin for haplotype blocks. Given the ancestral origin inferred at a marker locus, the corresponding parent genotype data is used to score the genotype for the inferred haplotype, to compute GAA. If the wrong parent-of-origin is inferred at a marker, but that parent shares the same genotype of the correct parent, GAA will be higher than AAA. We observed this for our use case (Supplementary Table S2), indicating that shared parental haplotypes contributed to inferring the incorrect parent-of-origin at 2.3% of the markers on average. RABBIT performed very well in haplotype phasing with an average PAA of 99.4% across all samples. There was a high-positive CCC (r = 0.87, P < 2.2e-16) (Supplementary Fig. S6). However, the number of crossover counts per sample was downward biased with an average of 260 ± 16 versus 227 ± 14 for known and inferred results, respectively. SPEARS can be used to guide decision making. For instance, genotyping platforms vary in cost and result in different marker densities, error rates and missingness structure. For the example population, reducing the marker density by half had essentially no effect on the quality of haplotype reconstruction (Supplementary Table S2). Genotyping error rates of 0.006 and 0.06 also showed little impact on the overall performance of haplotype reconstruction; however, for an error rate of 0.25 substantial reductions in performance were observed for AAA, GAA and CCC but not PAA (Supplementary Table S3).

4. Conclusion

Reconstruction of ancestral haplotypes from genomic data is useful for a number of applications. The use case presented here demonstrates how SPEARS estimates expectations for accuracy of the reconstruction process to guide investigators on the analysis of haplotype structure in multi-parent populations. It enables exploration of study designs before or after creating one. It can also be used to determine if certain features of haplotype data should be included/excluded in a study based on the accuracy of a corresponding metric, and the expectations can be reported. Moreover, one can assess whether specific chromosomes or regions of the genome, but not others, are sufficiently accurate for downstream analysis. Finally, as an extension, the genome simulator in SPEARS can model genetic architectures in order to test downstream analysis based on ancestral haplotype maps. Click here for additional data file.

16 in total

1. Merlin--rapid analysis of dense genetic maps using sparse gene flow trees.

Authors: Gonçalo R Abecasis; Stacey S Cherny; William O Cookson; Lon R Cardon
Journal: Nat Genet Date: 2001-12-03 Impact factor: 38.330

2. The structure of haplotype blocks in the human genome.

Authors: Stacey B Gabriel; Stephen F Schaffner; Huy Nguyen; Jamie M Moore; Jessica Roy; Brendan Blumenstiel; John Higgins; Matthew DeFelice; Amy Lochner; Maura Faggart; Shau Neen Liu-Cordero; Charles Rotimi; Adebowale Adeyemo; Richard Cooper; Ryk Ward; Eric S Lander; Mark J Daly; David Altshuler
Journal: Science Date: 2002-05-23 Impact factor: 47.728

3. The Collaborative Cross, a community resource for the genetic analysis of complex traits.

Authors: Gary A Churchill; David C Airey; Hooman Allayee; Joe M Angel; Alan D Attie; Jackson Beatty; William D Beavis; John K Belknap; Beth Bennett; Wade Berrettini; Andre Bleich; Molly Bogue; Karl W Broman; Kari J Buck; Ed Buckler; Margit Burmeister; Elissa J Chesler; James M Cheverud; Steven Clapcote; Melloni N Cook; Roger D Cox; John C Crabbe; Wim E Crusio; Ariel Darvasi; Christian F Deschepper; R W Doerge; Charles R Farber; Jiri Forejt; Daniel Gaile; Steven J Garlow; Hartmut Geiger; Howard Gershenfeld; Terry Gordon; Jing Gu; Weikuan Gu; Gerald de Haan; Nancy L Hayes; Craig Heller; Heinz Himmelbauer; Robert Hitzemann; Kent Hunter; Hui-Chen Hsu; Fuad A Iraqi; Boris Ivandic; Howard J Jacob; Ritsert C Jansen; Karl J Jepsen; Dabney K Johnson; Thomas E Johnson; Gerd Kempermann; Christina Kendziorski; Malak Kotb; R Frank Kooy; Bastien Llamas; Frank Lammert; Jean-Michel Lassalle; Pedro R Lowenstein; Lu Lu; Aldons Lusis; Kenneth F Manly; Ralph Marcucio; Doug Matthews; Juan F Medrano; Darla R Miller; Guy Mittleman; Beverly A Mock; Jeffrey S Mogil; Xavier Montagutelli; Grant Morahan; David G Morris; Richard Mott; Joseph H Nadeau; Hiroki Nagase; Richard S Nowakowski; Bruce F O'Hara; Alexander V Osadchuk; Grier P Page; Beverly Paigen; Kenneth Paigen; Abraham A Palmer; Huei-Ju Pan; Leena Peltonen-Palotie; Jeremy Peirce; Daniel Pomp; Michal Pravenec; Daniel R Prows; Zhonghua Qi; Roger H Reeves; John Roder; Glenn D Rosen; Eric E Schadt; Leonard C Schalkwyk; Ze'ev Seltzer; Kazuhiro Shimomura; Siming Shou; Mikko J Sillanpää; Linda D Siracusa; Hans-Willem Snoeck; Jimmy L Spearow; Karen Svenson; Lisa M Tarantino; David Threadgill; Linda A Toth; William Valdar; Fernando Pardo-Manuel de Villena; Craig Warden; Steve Whatley; Robert W Williams; Tim Wiltshire; Nengjun Yi; Dabao Zhang; Min Zhang; Fei Zou
Journal: Nat Genet Date: 2004-11 Impact factor: 38.330

4. Genetic analysis of complex traits in the emerging Collaborative Cross.

Authors: David L Aylor; William Valdar; Wendy Foulds-Mathes; Ryan J Buus; Ricardo A Verdugo; Ralph S Baric; Martin T Ferris; Jeff A Frelinger; Mark Heise; Matt B Frieman; Lisa E Gralinski; Timothy A Bell; John D Didion; Kunjie Hua; Derrick L Nehrenberg; Christine L Powell; Jill Steigerwalt; Yuying Xie; Samir N P Kelada; Francis S Collins; Ivana V Yang; David A Schwartz; Lisa A Branstetter; Elissa J Chesler; Darla R Miller; Jason Spence; Eric Yi Liu; Leonard McMillan; Abhishek Sarkar; Jeremy Wang; Wei Wang; Qi Zhang; Karl W Broman; Ron Korstanje; Caroline Durrant; Richard Mott; Fuad A Iraqi; Daniel Pomp; David Threadgill; Fernando Pardo-Manuel de Villena; Gary A Churchill
Journal: Genome Res Date: 2011-03-15 Impact factor: 9.043

5. Structural Variation Shapes the Landscape of Recombination in Mouse.

Authors: Andrew P Morgan; Daniel M Gatti; Maya L Najarian; Thomas M Keane; Raymond J Galante; Allan I Pack; Richard Mott; Gary A Churchill; Fernando Pardo-Manuel de Villena
Journal: Genetics Date: 2017-06 Impact factor: 4.562

6. A large maize (Zea mays L.) SNP genotyping array: development and germplasm genotyping, and genetic mapping to compare with the B73 reference genome.

Authors: Martin W Ganal; Gregor Durstewitz; Andreas Polley; Aurélie Bérard; Edward S Buckler; Alain Charcosset; Joseph D Clarke; Eva-Maria Graner; Mark Hansen; Johann Joets; Marie-Christine Le Paslier; Michael D McMullen; Pierre Montalent; Mark Rose; Chris-Carolin Schön; Qi Sun; Hildrun Walter; Olivier C Martin; Matthieu Falque
Journal: PLoS One Date: 2011-12-08 Impact factor: 3.240

7. A Multiparent Advanced Generation Inter-Cross to fine-map quantitative traits in Arabidopsis thaliana.

Authors: Paula X Kover; William Valdar; Joseph Trakalo; Nora Scarcelli; Ian M Ehrenreich; Michael D Purugganan; Caroline Durrant; Richard Mott
Journal: PLoS Genet Date: 2009-07-10 Impact factor: 5.917

8. Quantitative trait locus mapping methods for diversity outbred mice.

Authors: Daniel M Gatti; Karen L Svenson; Andrey Shabalin; Long-Yang Wu; William Valdar; Petr Simecek; Neal Goodwin; Riyan Cheng; Daniel Pomp; Abraham Palmer; Elissa J Chesler; Karl W Broman; Gary A Churchill
Journal: G3 (Bethesda) Date: 2014-09-18 Impact factor: 3.154

9. Phased Genotyping-by-Sequencing Enhances Analysis of Genetic Diversity and Reveals Divergent Copy Number Variants in Maize.

Authors: Heather Manching; Subhajit Sengupta; Keith R Hopper; Shawn W Polson; Yuan Ji; Randall J Wisser
Journal: G3 (Bethesda) Date: 2017-07-05 Impact factor: 3.154

10. Rapid genotype imputation from sequence without reference panels.

Authors: Simon Myers; Richard Mott; Robert W Davies; Jonathan Flint
Journal: Nat Genet Date: 2016-07-04 Impact factor: 38.330