Literature DB >> 26323718

NgsRelate: a software tool for estimating pairwise relatedness from next-generation sequencing data.

Thorfinn Sand Korneliussen¹, Ida Moltke².

Abstract

MOTIVATION: Pairwise relatedness estimation is important in many contexts such as disease mapping and population genetics. However, all existing estimation methods are based on called genotypes, which is not ideal for next-generation sequencing (NGS) data of low depth from which genotypes cannot be called with high certainty.
RESULTS: We present a software tool, NgsRelate, for estimating pairwise relatedness from NGS data. It provides maximum likelihood estimates that are based on genotype likelihoods instead of genotypes and thereby takes the inherent uncertainty of the genotypes into account. Using both simulated and real data, we show that NgsRelate provides markedly better estimates for low-depth NGS data than two state-of-the-art genotype-based methods. AVAILABILITY: NgsRelate is implemented in C++ and is available under the GNU license at www.popgen.dk/software.

Entities: Chemical Gene Species

Mesh：

Year: 2015 PMID： 26323718 PMCID： PMC4673978 DOI： 10.1093/bioinformatics/btv509

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Estimation of how related two individuals are from genetic data plays a key role in several research areas, including medical genetics and population genetics. For example, in medical genetics it is used for excluding closely related individuals from association studies and thereby to avoid inflated false positive rates. How related two individuals are is usually described through the concept of identity-by-descent (IBD), i.e. genetic identity due to a recent common ancestor. Historically, several summary statistics have been used, such as the kinship coefficient θ, however almost all of these statistics can be calculated from , where k is the fraction of genome in which the two individuals share m alleles IBD. For example . We will therefore here focus on R. Many estimators for R have been proposed, both method of moments (Purcell ; Ritland, 1996) and maximum likelihood (ML) estimators (Thompson, 1975). Common to them all is that they are based on genotype data and it has been shown that they work well on single nucleotide polymorphism (SNP) chip data. However, next-generation sequencing (NGS) is becoming increasingly common and often NGS data are only of low depth, which means that genotypes can only be called with high uncertainty (O’Rawe ). For such data it has been shown that it can be an advantage to take the uncertainty of the genotypes into account by basing statistical methods on so-called genotype likelihoods (GLs), instead of genotypes (Skotte ). Motivated by this we developed NgsRelate; a ML method for estimating the pairwise relatedness parameter R from NGS data based on GLs. In the following, we present this method and show that for low-depth NGS data it performs markedly better than two state-of-the-art genotype-based methods.

2 Methods

To estimate R for two non-inbred individuals i and j we use the following probabilistic framework: Let D = and D = denote the observed NGS data for i and j at L diallelic loci and G = and G = denote the true unobserved genotypes at the L loci. Further, let denote the unobserved number of alleles i and j share IBD at locus l. Finally, let the two alleles at each locus be denoted A and a and the frequencies of the A alleles be denoted f. Then, assuming the loci are independent and that f is known the likelihood function for R, can be written: with and Here and are GLs, which can be estimated using ANGSD (Korneliussen ) and and = are given in Supplementary Table S1–S2. f and major and minor alleles can be precalculated from NGS data using ANGSD or from SNP chip data. NgsRelate provides ML estimates of R by finding the value of R that maximizes this likelihood function with an Expectation Maximization algorithm (Supplementary Data). Like all other ML estimators, this estimator is consistent and we note that this is also true if the assumption of independence between loci is violated, since the function that is optimized then becomes a composite likelihood function. We also note that if the genotypes are known with certainty the GLs will be 0 for all but the true genotype and in that case the method reduces to the ML method in Choi . In all other cases the uncertainty is taken into account by summing over all possible true genotypes and weighing each according to their GLs.

3 Results and discussion

To test NgsRelate we used both simulated and real data. We first simulated NGS data for 100 000 diallelic loci from 100 pairs of individuals from each of the relationships: parent–child, full siblings, half-siblings, first cousins and unrelated individuals. To make it possible to assess how NgsRelate’s performance depends on average sequencing depth we simulated such data for five different average depths ranging from low (1, 2 and 4×) over medium (8×) to relatively high depth (16×). From the simulated data we calculated GLs, which we applied NgsRelate to. We also called genotypes based on the maximum GLs and applied the genotype-based ML method from Choi and PLINK (Purcell ) to these called genotypes. See Supplementary Data for details. The simulations showed that all three methods perform well on high-depth data, but that the two genotype-based methods did not provide accurate estimates of R for the related pairs based on low- and medium-depth data (Fig. 1). Further inspection of the results revealed that for all the related pairs these two methods tend to overestimate k0 and thereby make the pairs look less related (Supplementary Figs S1–S5). NgsRelate on the other hand performs well on medium and low-depth data down to 4× (Fig. 1). Even for 2× data it is only slightly biased (Supplementary Figs S1–S5) and for 1× it has large variance, yet it still performs markedly better than the other two methods (Fig. 1). Hence, the simulations suggest that for low-depth NGS data NgsRelate outperforms the two genotype-based methods.

Fig. 1.

Root mean square deviation (RMSD) between estimated and simulated R for 100 of each combination of four relationship types and five average sequencing depths 1, 2, 4, 8 and 16 (see Supplementary Fig. S5 for results for unrelated pairs). For each combination estimates were obtained with NgsRelate (left), genotype-based ML (middle) and PLINK (right). RMSD will be zero if the estimate is equal to the simulated R To assess if this holds true for real data we then applied the three methods to low-depth (∼4×) NGS data from six genomes from the 1000 Genomes Project Consortium (2012). These individuals have also been SNP chip genotyped (International HapMap 3 Consortium, 2010), and six of the pairs have been reported to be related. We applied NgsRelate to GLs calculated from the low-depth NGS data using ANGSD and applied the two other methods to genotypes called from these GLs. To limit the amount of genotype calling errors only data from sites with depth above 2 in both genomes and a minor allele frequency above 0.05 were included in the genotype-based analyses. Next, we estimated R from the high-quality SNP chip genotypes using a state-of-the-art genotype-based method to achieve accurate estimates of R, which we used as a proxy for the true values when assessing the NGS data-based estimates. For all six-related pairs the estimates from NgsRelate differed markedly less from the ‘true’ values (Fig. 2 and Supplementary Fig. S6), e.g. the difference in k0 ranged from 0.002 to 0.031 for NGSrelate, whereas they ranged from 0.081 to 0.31 for genotype-based ML estimator and from 0.096 to 0.25 for PLINK. In all cases k0 was overestimated, though, note that the opposite was observed for PLINK when we changed the quality filtering of the genotypes (Supplementary Data), suggesting that estimates from the genotype-based methods depend highly on filtering choices. However, all the real data results supported the conclusion from the simulations: for low-depth NGS data NgsRelate provides more accurate estimates.

Fig. 2.

RMSD between the estimated and the true R for six pairs of ∼4× genomes. RMSD will be 0 if the estimate is equal to the true R

8 in total

1. The estimation of pairwise relationships.

Authors: E A Thompson
Journal: Ann Hum Genet Date: 1975-10 Impact factor: 1.670

2. PLINK: a tool set for whole-genome association and population-based linkage analyses.

Authors: Shaun Purcell; Benjamin Neale; Kathe Todd-Brown; Lori Thomas; Manuel A R Ferreira; David Bender; Julian Maller; Pamela Sklar; Paul I W de Bakker; Mark J Daly; Pak C Sham
Journal: Am J Hum Genet Date: 2007-07-25 Impact factor: 11.025

3. Estimating individual admixture proportions from next generation sequencing data.

Authors: Line Skotte; Thorfinn Sand Korneliussen; Anders Albrechtsen
Journal: Genetics Date: 2013-09-11 Impact factor: 4.562

Review 4. Accounting for uncertainty in DNA sequencing data.

Authors: Jason A O'Rawe; Scott Ferson; Gholson J Lyon
Journal: Trends Genet Date: 2015-01-08 Impact factor: 11.639

5. Integrating common and rare genetic variation in diverse human populations.

Authors: David M Altshuler; Richard A Gibbs; Leena Peltonen; David M Altshuler; Richard A Gibbs; Leena Peltonen; Emmanouil Dermitzakis; Stephen F Schaffner; Fuli Yu; Leena Peltonen; Emmanouil Dermitzakis; Penelope E Bonnen; David M Altshuler; Richard A Gibbs; Paul I W de Bakker; Panos Deloukas; Stacey B Gabriel; Rhian Gwilliam; Sarah Hunt; Michael Inouye; Xiaoming Jia; Aarno Palotie; Melissa Parkin; Pamela Whittaker; Fuli Yu; Kyle Chang; Alicia Hawes; Lora R Lewis; Yanru Ren; David Wheeler; Richard A Gibbs; Donna Marie Muzny; Chris Barnes; Katayoon Darvishi; Matthew Hurles; Joshua M Korn; Kati Kristiansson; Charles Lee; Steven A McCarrol; James Nemesh; Emmanouil Dermitzakis; Alon Keinan; Stephen B Montgomery; Samuela Pollack; Alkes L Price; Nicole Soranzo; Penelope E Bonnen; Richard A Gibbs; Claudia Gonzaga-Jauregui; Alon Keinan; Alkes L Price; Fuli Yu; Verneri Anttila; Wendy Brodeur; Mark J Daly; Stephen Leslie; Gil McVean; Loukas Moutsianas; Huy Nguyen; Stephen F Schaffner; Qingrun Zhang; Mohammed J R Ghori; Ralph McGinnis; William McLaren; Samuela Pollack; Alkes L Price; Stephen F Schaffner; Fumihiko Takeuchi; Sharon R Grossman; Ilya Shlyakhter; Elizabeth B Hostetter; Pardis C Sabeti; Clement A Adebamowo; Morris W Foster; Deborah R Gordon; Julio Licinio; Maria Cristina Manca; Patricia A Marshall; Ichiro Matsuda; Duncan Ngare; Vivian Ota Wang; Deepa Reddy; Charles N Rotimi; Charmaine D Royal; Richard R Sharp; Changqing Zeng; Lisa D Brooks; Jean E McEwen
Journal: Nature Date: 2010-09-02 Impact factor: 49.962

6. Case-control association testing in the presence of unknown relationships.

Authors: Yoonha Choi; Ellen M Wijsman; Bruce S Weir
Journal: Genet Epidemiol Date: 2009-12 Impact factor: 2.135

7. ANGSD: Analysis of Next Generation Sequencing Data.

Authors: Thorfinn Sand Korneliussen; Anders Albrechtsen; Rasmus Nielsen
Journal: BMC Bioinformatics Date: 2014-11-25 Impact factor: 3.169

8. An integrated map of genetic variation from 1,092 human genomes.

Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal: Nature Date: 2012-11-01 Impact factor: 49.962

8 in total

26 in total

1. Fast and accurate relatedness estimation from high-throughput sequencing data in the presence of inbreeding.

Authors: Kristian Hanghøj; Ida Moltke; Philip Alstrup Andersen; Andrea Manica; Thorfinn Sand Korneliussen
Journal: Gigascience Date: 2019-05-01 Impact factor: 6.524

2. Terminal Pleistocene Alaskan genome reveals first founding population of Native Americans.

Authors: J Víctor Moreno-Mayar; Ben A Potter; Lasse Vinner; Matthias Steinrücken; Simon Rasmussen; Jonathan Terhorst; John A Kamm; Anders Albrechtsen; Anna-Sapfo Malaspinas; Martin Sikora; Joshua D Reuther; Joel D Irish; Ripan S Malhi; Ludovic Orlando; Yun S Song; Rasmus Nielsen; David J Meltzer; Eske Willerslev
Journal: Nature Date: 2018-01-03 Impact factor: 49.962

Review 3. Beyond broad strokes: sociocultural insights from the study of ancient genomes.

Authors: Fernando Racimo; Martin Sikora; Marc Vander Linden; Hannes Schroeder; Carles Lalueza-Fox
Journal: Nat Rev Genet Date: 2020-03-03 Impact factor: 53.242

4. Genetic diversity, demographic history and neo-sex chromosomes in the Critically Endangered Raso lark.

Authors: Elisa G Dierickx; Simon Yung Wa Sin; H Pieter J van Veelen; M de L Brooke; Yang Liu; Scott V Edwards; Simon H Martin
Journal: Proc Biol Sci Date: 2020-03-04 Impact factor: 5.349

5. Moment estimators of relatedness from low-depth whole-genome sequencing data.

Authors: Anthony F Herzig; M Ciullo; A-L Leutenegger; H Perdry
Journal: BMC Bioinformatics Date: 2022-06-24 Impact factor: 3.307

6. Population genomics of ancient and modern Trichuris trichiura.

Authors: Stephen R Doyle; Martin Jensen Søe; Peter Nejsum; Martha Betson; Philip J Cooper; Lifei Peng; Xing-Quan Zhu; Ana Sanchez; Gabriela Matamoros; Gustavo Adolfo Fontecha Sandoval; Cristina Cutillas; Louis-Albert Tchuem Tchuenté; Zeleke Mekonnen; Shaali M Ame; Harriet Namwanje; Bruno Levecke; Matthew Berriman; Brian Lund Fredensborg; Christian Moliin Outzen Kapel
Journal: Nat Commun Date: 2022-07-06 Impact factor: 17.694

7. Evaluating the Impact of Dropout and Genotyping Error on SNP-Based Kinship Analysis With Forensic Samples.

Authors: Stephen D Turner; V P Nagraj; Matthew Scholz; Shakeel Jessa; Carlos Acevedo; Jianye Ge; August E Woerner; Bruce Budowle
Journal: Front Genet Date: 2022-06-30 Impact factor: 4.772

8. Pervasive Genomic Signatures of Local Adaptation to Altitude Across Highland Specialist Andean Hummingbird Populations.

Authors: Marisa C W Lim; Ke Bi; Christopher C Witt; Catherine H Graham; Liliana M Dávalos
Journal: J Hered Date: 2021-05-24 Impact factor: 2.645

9. The Identification of a 1916 Irish Rebel: New Approach for Estimating Relatedness From Low Coverage Homozygous Genomes.

Authors: Daniel Fernandes; Kendra Sirak; Mario Novak; John A Finarelli; John Byrne; Edward Connolly; Jeanette E L Carlsson; Edmondo Ferretti; Ron Pinhasi; Jens Carlsson
Journal: Sci Rep Date: 2017-01-30 Impact factor: 4.379

10. Pedigree reconstruction and distant pairwise relatedness estimation from genome sequence data: A demonstration in a population of rhesus macaques (Macaca mulatta).

Authors: Lauren E Petty; Kathrine Phillippi-Falkenstein; H Michael Kubisch; Muthuswamy Raveendran; R Alan Harris; Eric J Vallender; Chad D Huff; Rudolf P Bohm; Jeffrey Rogers; Jennifer E Below
Journal: Mol Ecol Resour Date: 2021-01-27 Impact factor: 7.090