Literature DB >> 34711884

TKGWV2: an ancient DNA relatedness pipeline for ultra-low coverage whole genome shotgun data.

Daniel M Fernandes^1,2, Olivia Cheronet³, Pere Gelabert³, Ron Pinhasi³.

Abstract

Estimation of genetically related individuals is playing an increasingly important role in the ancient DNA field. In recent years, the numbers of sequenced individuals from single sites have been increasing, reflecting a growing interest in understanding the familial and social organisation of ancient populations. Although a few different methods have been specifically developed for ancient DNA, namely to tackle issues such as low-coverage homozygous data, they require a 0.1-1× minimum average genomic coverage per analysed pair of individuals. Here we present an updated version of a method that enables estimates of 1st and 2nd-degrees of relatedness with as little as 0.026× average coverage, or around 18,000 SNPs from 1.3 million aligned reads per sample with average length of 62 bp-four times less data than 0.1× coverage at similar read lengths. By using simulated data to estimate false positive error rates, we further show that a threshold even as low as 0.012×, or around 4000 SNPs from 600,000 reads, will always show 1st-degree relationships as related. Lastly, by applying this method to published data, we are able to identify previously undocumented relationships using individuals that had been excluded from prior kinship analysis due to their very low coverage. This methodological improvement has the potential to enable relatedness estimation on ancient whole genome shotgun data during routine low-coverage screening, and therefore improve project management when decisions need to be made on which individuals are to be further sequenced.

Entities: Chemical

Mesh：

Substances：
DNA, Ancient

Year: 2021 PMID： 34711884 PMCID： PMC8553948 DOI： 10.1038/s41598-021-00581-3

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.379

Introduction

The estimation of genetic relatives in ancient DNA (aDNA) research has become an integral part of any study involving individuals from the same site or region. The challenging nature of aDNA has led researchers in recent years to adapt traditional relatedness estimation methods, such as those present in PLINK[1] and KING[2], to datasets of ancient individuals for which diploid and high-quality data is not available. This has resulted in the development of a few different methods, using pseudo-haploid data and/or genotype likelihoods, that have been shown to work with genomic coverages as low as between 0.1 and 1×[3-7]. Inferring these relationships is essential, not only for addressing questions related to social and familial organisation of ancient groups, but also as a quality control step for population-based analyses, where an excess of very close relatives may introduce biases that reflect higher rates of allele sharing among individuals than would be expected among non-relatives, and, therefore, do not represent the real genetic variation of the whole population[8]. Specifically, one individual from a pair of 1st-degree relatives is routinely excluded from analyses requiring the grouping of individuals to avoid these biases[9], which leads to a potential very limited use of otherwise good data from the excluded individual. With this in mind, individual aDNA projects would be able to better structure their workflow and budget if kinship relationships could be estimated at early stages of the research plan—such as during ultra-low-coverage screening, which is a cheap and effective way to evaluate the quantity and quality of data that can be expected from an ancient individual. For projects where the analysis of kinship and related individuals is the main focus, researchers would be able to concentrate their resources on particular individuals targeted for the specific research questions. As aDNA laboratories routinely screen their samples using ultra-low-coverage sequencing data before deciding which samples or individuals are to be further sequenced into the desired depths, having information on kinship during this phase could be an advantage. However, the most widely used relatedness estimation methods/software in aDNA research typically require substantially larger amounts of data than that which originates in routine screening runs, which typically yield only low amounts of data. For example, a 0.1× genomic coverage threshold has been used in the past in software such as NgsRelate[6] and READ[3], in order to confidently estimate 1st and 2nd-degree relationships. Another commonly used software, lcMLkin, in turn requires between 0.5 and 1× coverage to identify 1st and 2nd-degree relatives, or at least 10,000–20,000 common single nucleotide polymorphisms (SNPs), after quality control[5]. Here, we present TKGWV2 (“Thomas Kent Genome-Wide Variants 2”), an update to a method published in 2017[4] that, by using genome-wide variants instead of variant sets commonly used in aDNA research, such as the 1240K Capture or Affymetrix Human Origins arrays, increases the amount of potentially available data for the method’s relatedness estimator from 1,240,000 or 600,000 to over 22,000,000 non-fixed biallelic variants present in the 1000 Genomes Project Phase 3[10]—a 18–37 times gain, respectively. As a result, the amount of whole genome shotgun sequencing data required to estimate at least 1st and 2nd-degree relationships is reduced by at least ~ 4 times, potentially allowing the estimation of close relatedness between ancient individuals during routine screening stages.

Results

Method’s pipeline description

TKGWV2 can be publicly accessed at https://github.com/danimfernandes/tkgwv2. Compared to the method published in 2017, TKGWV2 includes: an embedded relatedness estimator, instead of relying on external software such as SPAGeDI[11] to calculate the coefficients; a substantially optimized, user-friendly, and automated pipeline that also accepts large genome-wide SNP datasets; several helper scripts to help users prepare their input data. The described improvements allow TKGWV2 to handle much larger amounts of data than the 2017 method, at a fraction of the time, and with a completely remodelled pipeline that is much more user-friendly. To run TKGWV2 the user is required to provide three types of files: aligned individual *.BAM files, a list of non-fixed biallelic SNPs for genotype calling, and the population allele frequencies of those same SNPs. These are then processed in three consecutive steps. Genotype calling and conversion of pileup format files into individual pseudo-haploid PLINK text files. Identification of overlapping variants per pairs of individuals, extraction of corresponding allele frequencies from the provided frequencies file, and creation of transposed PLINK text files for the next step. Pairwise relatedness estimation (Rxy). We use the Queller and Goodnight’s[12] estimator as implemented in SPAGeDI[11] and based on the formula described by Oliehoek et al.[13]: The pairwise Rxy is calculated by averaging the result of rxy and ryx (where “x” and “y” are the individuals) for each loci. I represents the identity of the alleles ab (individual x) and cd (individual y), whereas p represents the allele frequencies of a and b. If the input data is already in a genotype dataset format such as PLINK, TKGWV2 can be started from point 2 and will only require the allele frequencies file. The only quality control step required for the selection of SNPs is to include only non-fixed biallelic variants. Exclusion of low frequency variants and correction for linkage disequilibrium are not required, and instead may in fact lead to a higher number of incorrect estimates when modern population frequencies are used on ancient individuals (Supplementary Table 1). This also means that an overall higher number of SNPs is available for the estimator. Three relatedness classes are evaluated: 1st-degree, 2nd-degree, and unrelated. However, due to the use of pseudo-haploid data, as commonly seen in very low coverage ancient shotgun data, the resulting relatedness coefficients (r) are half of the original expected values, and therefore we refer to them as “halved relatedness coefficients” (HRC). The three equidistant relatedness classes used in our method have an average HRC of 0.25 for 1st-degree relatives, 0.125 for 2nd-degree relatives, and 0 for unrelated individuals, and following their normal distribution ranges in real populations, we set hard thresholds for each class’s range of values at the mid-points between classes—0.1875 between 1st- and 2nd-degree relationships, and 0.0625 for the latter and unrelated.

Application to known relatives from the 1000 Genomes Phase 1

To validate this new approach to the method presented in Fernandes et al.[4], we used publicly available data from the 1000 Genomes Phase 1, which includes individuals with known relationships[10]. The downloadable VCF data for ~ 38 million SNPs and indels was converted into PLINK format, and only non-fixed biallelic SNPs were used during analysis. We chose the Southern Han Chinese (CHS) population as it includes ten known 1st- or 2nd-degree relationships shared between 13 individuals. All remaining 4940 pairwise relationships are reported as unrelated. We started by subsampling the 100 CHS individuals at eight fractions between 0.5 and 10% of the complete 38 million variant coverage. We did not specify a minimum number of used SNPs in order to test the total variation of the estimates, and each SNP was converted to random pseudo-haploid genotypes. In the fraction corresponding to a subsampling of 2%, which averaged 4059 SNPs used per test, all ten known relationships described in Phase 1 of the 1000 Genomes Project were correctly identified and assigned the correct degree of relatedness (Fig. 1). However, in this fraction a few unrelated individuals also crossed the threshold for 2nd-degree of relatedness at 0.0625 HRC. These potential false positives seem to be corrected from the next fraction onwards (3% subsampling fraction, average 9137 SNPs), and so all expected related and unrelated relationships are correctly assigned. Only the supposedly unrelated pair of individuals HG00475-HG00542 consistently shows an HRC within the 2nd-degree class, and although they were not detected or declared as relatives in the 1000 Genomes project, previously published research has already described them as 2nd-degree relatives[14]. These results suggest that, for the CHS population, the error rates are likely to be very low starting from around 10,000 shared SNPs.

Figure 1

Top 30 results ordered by HRC of all 4950 pairwise tests for the 100 individuals from the CHS population, for the eight subsampling fractions between 0.5 and 10%. Results are independently ordered for each fraction, so x-axis order is not expected to match. Known relationships are shown as filled symbols, and each different fraction as a specific colour. All triangles (second degree) are expected to fall within the lighter gray area, and all circles (first degree) within the darker gray area. The allele frequencies from the CHS population were used.

Application to published ancient DNA data

We next applied this approach to an ancient shotgun dataset from Schroeder et al.[15] (ENA accession number PRJEB28451), composed of 15 Late Neolithic individuals from a Globular Amphora mass grave from the site of Koszyce, in present-day Poland. The unusually high number of genetic relatives of up to 3rd-degree detected at this site (n = 55) make it a great candidate for testing our new pipeline. However, since 3rd-degree relationships (0.0625 average HRC) are expected to be represented by major overlaps with the unrelated and 2nd-degree relative classes, TKGWV2 only assesses up to 2nd-degree relationships. For this reason, we did not consider the 11 pairs of 3rd-degree relatives identified by Schroeder et al.[15] as part of our pass/fail assessments. We used 1000 Genomes Phase 3 modern day allele frequencies from 503 individuals with European ancestry[10], restricted to ~ 22 million non-fixed biallelic SNPs. This time, instead of randomly subsampling each individual’s genotypes, we investigated genomic coverage thresholds, for the application of this method to shotgun sequenced individuals, by subsampling the BAM files of the 15 Koszyce individuals to specific numbers of aligned and non-duplicated reads: 3,000,000, 2,500,000, 2,000,000, 1,600,000, 1,300,000, 1,000,000, 800,000, 600,000, 400,000, and 200,000 (Supplementary Table 2). The average read length for these individuals’ data was 62 bp. At 1,300,000 aligned reads per sample (corresponding to about 0.026× genomic coverage and an average of 18,364 SNPs shared per pairwise test), all unrelated individuals were assigned to the correct relatedness class (HRC below 0.0625), only two out of 56 2nd-degree relative pairs were incorrectly assigned with a HRC below 0.0625 (individuals 11–9 and 14–4), and only one pair of 1st-degree relative pairs (individuals 10–11) had a HRC under the 0.1875 threshold for 1st-degree relatives (0.1553) (Fig. 2, Supplementary Table 3). In the next higher subset, at 1,600,000 aligned reads (about 0.032× genomic coverage and an average of 27,589 SNPs shared), again all unrelated individuals were correctly assigned, while two 2nd-degree relative pairs had a HRC out of their range (individuals 11–9 with 0.0547 and individuals 10–15 with 0.1909). All 1st-degree relative pairs were correctly identified and had a HRC above 0.1875 (Fig. 2, Supplementary Table 3).

Figure 2

Coefficients of the 44 relationships for 1st- and 2nd-degree relatives from the Neolithic site of Koszyce[15]. Each vertical set of 4 coloured points corresponds to one pair of individuals tested with different numbers of aligned reads, according to the legend. Relationships to the left of the solid vertical line are known to be 1st-degree, as per Schroeder et al.[15], and relationships to the right of the line are expected to be 2nd-degree. Grey areas define 1st- and 2nd-degree range intervals. These results suggest that from around 18,000 SNPs (1,300,000 reads, or 0.026× coverage) the rate of incorrectly assigned relationships is ~ 3.2%, at 3 out of 94 pairwise tests. However, this number was consistent even up to around 93,000 SNPs and almost always involved individuals 5, 10, 11, and 15, whose pairwise HRCs were always in the region between 1st- and 2nd-degrees. In their original publication, these individuals were identified as possible half-siblings who shared a biological father, and possibly different mothers that could be related to each other to varying degrees, due to sharing the same mitochondrial haplogroup, which could explain slightly elevated 2nd-degree HRC values. Since TKGWV2 does not assess 3rd-degree relationships, these will show up as either 2nd-degree or unrelated. On Supplementary Table 3 we show that from 18,000 SNPs used, the 11 3rd-degree relationships identified by Schroeder et al.[15] range between − 0.0095 and 0.0931 HRC (0.0517 average over 55 tests). Conservatively, then, 2nd-degree relationships between 0.0625 and ~ 0.090 may instead be referred to as “2nd- or 3rd-degree” to cover this possibility. Although we argue that 1,300,000 reads (or 0.026× coverage) could be used as a minimum threshold for the application of this method with very low error rates, we also noticed that from around 1,000,000 reads (11,000 SNPs or 0.020× coverage) all true unrelated individuals had an HRC below 0.0625, indicating that this lower threshold can possibly be used to avoid false positive related pairs, as any pair above 0.0625 would necessarily share a 1st- or 2nd-degree relationship. This would agree with the CHS results shown in Fig. 1, where at the 3 and 4% fractions (9137 and 16,244 SNPs) all unrelated individuals also began to have a HRC below 0.0625. On the other hand, false negatives at this 0.020× coverage are possible, as at least four 2nd-degree pairs from Koszyce had a HRC below 0.0625 (Supplementary Table 3). Lastly, by looking at even lower coverages, we noticed that true 1st-degree relatives showed as either 1st- or 2nd-degree from around 4,000 SNPs (600,000 reads or 0.012× coverage), with only a single pair of individuals below 0.1875 (individuals 10 and 11—0.1578 HRC) (Fig. 2; Supplementary Table 3). To extend the method to individuals from different time periods, and therefore different ancestries and allele frequencies, we also applied TKGWV2 at the conservative 1,300,000 reads threshold to other published shotgun sequenced relatives, identifying the published degrees of relatedness with a 100% success rate (Table 1). However, to demonstrate the application of our pipeline to ultra low coverages, we also included some individuals from these publications that were not included in their original kinship analysis because of coverages below the defined thresholds. For the Church burial of the Faroe Islands in Margaryan et al.[16] we included individuals VK239 and VK248 in the analysis, with coverages below the publication’s threshold of 0.1x (0.027× and 0.076×, respectively), and identified a new 2nd-degree relationship for VK239–VK237 (Table 1). In the case of Saag et al.[17], for Toomani (Muuksi) we included individuals X05, X06, and X07, who have coverages below their threshold of 0.03× (0.0290×, 0.0048×, and 0.0059×, respectively). From Lastekangur (Rebala) we included X13 (0.0147×). Overall, we were able to confidently identify a new pair of 1st-degree relatives in Toomani (V14-X05, 0.2595 HRC, 14,374 SNPs), although seven other relationships were also possibly identified (Table 1; in Supplementary Figure 1 we show a putative family tree for X13, X14, and V16, using this new data). Due to the lower coverage of the individuals involved in these relationships, the number of used SNPs ranged from 2854 to 8007, and therefore the degree of uncertainty and error rates are higher. We therefore used simulated data based on the allele frequencies of the population to attach posterior probabilities to each estimate—a process we describe in the section below.

Table 1

Application of the method to other published ancient individuals.

Publication	Individual ID 1	Individual ID 2	Associated period	Published degree estimate	Our method’s HRC	SNPs used	Our degree estimate (posterior probability)
Fernandes et al. (2017)[4]	N44	N45	Chalcolithic	2nd	0.0962	14,457	2nd
Brace et al. (2019)[26]	SB492A3	SB493A2	Neolithic	1st	0.2644	25,805	1st
Saag et al. (2019)[17]	V14	X05	Bronze Age	–	0.2595	14,374	1st
	V14	X06	Bronze Age	–	0.1891	2854	1st (0.546) or 2nd (0.454)
	V14	X07	Bronze Age	–	0.0838	3400	2nd (0.867) or unrelated (0.132)
	V16	X07	Bronze Age	–	0.0675	3202	2nd (0.599) or unrelated (0.401)
	V16	X13	Bronze Age	–	0.1223	7972	Likely 2nd (1.000)
	V16	X14	Bronze Age	2nd	0.1448	12,580	2nd
	X05	X06	Bronze Age	–	0.2739	2836	Likely 1st (0.998)
	X05	X13	Bronze Age	–	0.0667	8007	2nd (0.686) or Unrelated (0.314)
	X13	X14	Bronze Age	–	0.2446	7346	Likely 1st (1.000)
Margaryan et al. (2020)[16]	VK234	VK236	Viking Age	1st	0.2173	22,702	1st
	VK236	VK25	Viking Age	1st	0.2163	22,659	1st
	VK234	VK25	Viking Age	1st	0.2071	22,753	1st
	VK245	VK45	Viking Age	1st	0.1878	17,196	1st
	VK237	VK239	Viking Age	–	0.1265	18,032	2nd
	VK236	VK242	Viking Age	2nd	0.1214	20,349	2nd
	VK236	VK238	Viking Age	2nd	0.1208	21,321	2nd
	VK238	VK242	Viking Age	2nd	0.1188	19,694	2nd
	VK240	VK245	Viking Age	2nd	0.1178	21,463	2nd
	VK240	VK45	Viking Age	2nd	0.1175	17,850	2nd
	VK25	VK44	Viking Age	2nd	0.1155	21,530	2nd
	VK236	VK44	Viking Age	2nd	0.1125	21,293	2nd
	VK234	VK44	Viking Age	2nd	0.1113	21,216	2nd
	VK238	VK44	Viking Age	2nd	0.1108	20,458	2nd
	VK242	VK44	Viking Age	2nd	0.1065	19,489	2nd
	VK242	VK25	Viking Age	2nd	0.0970	20,903	2nd
	VK245	VK46	Viking Age	2nd	0.0895	21,696	2nd
	VK238	VK25	Viking Age	2nd	0.0813	21,947	2nd
	VK45	VK46	Viking Age	2nd	0.0805	19,269	2nd
	VK234	VK238	Viking Age	2nd	0.0759	21,797	2nd
	VK234	VK242	Viking Age	2nd	0.0687	20,617	2nd

BAM files were downloaded from the European Nucleotide Database, subsampled to a maximum of 1,300,000 reads, and then processed through our pipeline. The allele frequencies used were from individuals with European ancestry in the 1000 Genomes Phase 3 dataset. Estimates in bold are based on less than 10,000 SNPs, and therefore include some degree of uncertainty. For these, we present the posterior probabilities of each degree between parentheses, and on Supplementary Figure 2 we show the corresponding simulated range plots for these pairs.

Application of the method to other published ancient individuals. BAM files were downloaded from the European Nucleotide Database, subsampled to a maximum of 1,300,000 reads, and then processed through our pipeline. The allele frequencies used were from individuals with European ancestry in the 1000 Genomes Phase 3 dataset. Estimates in bold are based on less than 10,000 SNPs, and therefore include some degree of uncertainty. For these, we present the posterior probabilities of each degree between parentheses, and on Supplementary Figure 2 we show the corresponding simulated range plots for these pairs. Lastly, we investigated how the application of the same modern day European allele frequencies influenced the relatedness estimates of known Western Hunter-Gatherer relatives, considering the substantial ancestry differences when compared to the individuals just analysed. We used two pairs of 1st-degree relatives from the Iron Gates Mesolithic population[18], and also included one extra individual (I4877) that shared no relationship with the other four. Although the obtained HRC values for the unrelated pairs were well above the expected values (average 0.1149), and the 1st-degree relatives had an average HRC of 0.3131, which is just outside the expected 1st-degree range (0.1875–0.3125), we noticed that by applying a normalization step based on the average HRC of the known unrelated pairs of individuals, all HRC estimates are corrected to their expected ranges (Supplementary Table 4). This is a similar approach to the one applied by READ and Olalde et al.[7], and, on a case-by-case basis, might be a solution for when non-optimal allele frequencies are used.

Simulated distribution ranges and class overlaps

In the tests above we used hard thresholds as a simplified way to bin the different coefficients into each class, and for the CHS population we saw that from around 10,000 SNPs all expected 1st-degree relatives, 2nd-degree relatives, and unrelated relationships were estimated correctly, whereas for the ancient population from around 18,000 SNPs a maximum of three relationships were incorrectly assigned. However, since the number of known relationships in any population is small, including within the 1000 Genomes Project, simulations are useful to better quantify false positive rates from distribution ranges and class overlaps, and to calculate posterior probabilities for each HRC estimate. Moreover, although 1st- and 2nd-degree relatives are theoretically expected to share 50% and 25% of their variation (or 0.25 and 0.125 for HRC), respectively, the observed and estimated coefficients will be influenced by the reference population panel used and the background relatedness between all pairs of individuals within it[19]. Therefore, simulated distribution ranges are useful to represent this variability within the population. As described in Fernandes et al.[4], we first simulated 500 pairs of individuals of the unrelated, 1st-degree, and 2nd-degree classes, for an increasing range of randomly subset SNPs (Fig. 3). For the subsets with the lower numbers of SNPs used (for example, 1000 and 5000), the overlap between the three HRC equidistant classes’ ranges is substantial, however, from between 10,000 and 20,000 SNPs the normal distributions tend towards having no overlaps (Fig. 3a). This lack of overlaps is caused by the larger SNP datasets including more variability that is only captured when larger population sizes are simulated. Considering this, by increasing the number of simulated individuals from 500 to 5000 we induce an expansion of the normal curves, and again see overlapping ranges between all classes (Fig. 3b). As posterior probabilities cannot be assigned to relatedness estimates that fall on areas not covered by any curve, having overlapping ranges is essential, and as seen, the more SNPs used, the larger the number of simulated pairs required to fully cover this range. Assigning posterior probabilities to each class is especially useful when using sub-optimal SNP counts below the thresholds mentioned previously, as the resulting estimates will have higher uncertainty and larger overlaps.

Figure 3

Coefficient distribution ranges for (a) 500 and (b) 5000 pairs of simulated individuals using different numbers of SNPs, based on Phase 1 CHS allele frequencies, demonstrating overlaps and the correction of the curves towards the hard thresholds between classes on the higher SNP numbers. (c) False positive rates, as identified by simulated relationships crossing the thresholds between classes. From 30,000 SNPs no overlap was obtained, even with up to 5000 simulated pairs of individuals, although higher numbers of simulated pairs would eventually produce an overlap with error rates further tending towards 0. Increasing the numbers of simulated pairs to induce an expansion of the normal curves and produce overlaps when using large SNP numbers can quickly become very computationally intensive using our pipeline, but our results show that the rates of incorrect estimates based on overlapping ranges are limited to the number of SNPs used and not the number of simulated pairs. In Fig. 3c we show that these false positive rates stay mostly constant with both 500 and 5000 simulated pairs. By looking at the 2nd-degree curves, which can overlap with both the unrelated and 1st-degree curves, the error rate is on average 47.90% (47.07–49.40%, depending on how many pairs were simulated) with 1000 SNPs, meaning that almost half of the simulated 2nd-degree pairs have their HRC overlapping any of the other classes (Fig. 3a/b, Supplementary Table 5). However, these rates are reduced to averages of 12.83% on 5000 SNPs, to 2.81% on 10,000 SNPs, to 0.49% on 15,000 SNPs, and to 0.06% on 20,000 SNPs. For the latter, only ten out of 5000 simulated 2nd-degree pairs had a HRC overlapping the other classes’ ranges. Overall, these simulation results are in agreement with the estimates presented for the CHS population in Fig. 1, and suggest that using between 15,000 and 25,000 SNPs will likely result in error rates well below 1%. Although different population frequencies are unlikely to produce substantially different error rates on genome-wide shotgun SNP sets, as we report similar error rates using allele frequencies from the 1000 Genomes’ British in England and Scotland (GBR) population (Supplementary Table 6), the user is advised to calculate dataset-specific error rates based on simulations using a provided script. Lastly, the observation in the subsampling experiment of the Koszyce individuals that true 1st-degree relatives were always assigned as 1st- or 2nd-degree relatives based on as few as 4000 SNPs also finds support in the simulated data. Here, at 5000 SNPs, the lowest HRC for 1st-degree relatives was 0.095, among all 7100 pairs of simulated individuals, and only five times the HRC was below 0.125 (Supplementary Table 5).

Comparison with other relatedness assessment methods used for ancient DNA

We compared two relatedness methods commonly used with low-coverage aDNA data to our method, mainly to evaluate if they were able to produce correct results at similar coverages. The first method we compared was lcMLkin[5], which allows the possibility to detect up to 3rd–5th degree relatives[20,21]. For each individual from Koszyce subsampled to the same 1,300,000 and 3,000,000 reads used for testing our method above, we used bcftools to generate genotype likelihoods for ~ 22,000,000 variants of the 1000 Genomes Phase 3. However, the number of SNPs overlapping each pair of individuals was always below 300, which is much lower than the minimum suggested amount of 10,000 SNPs in the software’s manual. The second method we compared was READ, which can detect up to 2nd-degree relatives with coverages as low as 0.1×[3], although it has also been successfully applied to individuals with 0.03× coverage[17]. Again, we used the same 1,300,000 and 3,000,000 read sets, and, following READ’s manual, we excluded non-polymorphic and low frequency variants by using –maf 0.01 in PLINK[1]. At the lowest threshold and with default settings, READ produces a “S-like” curve on its graphical output file for the 105 pairwise relationships, which is indicative of different relationships (Supplementary Figure 3a). However, all individuals are classified as unrelated and the error bars indicate substantial relatedness class overlaps and high uncertainty. Using user-specified normalisation methods did not help produce a larger spread of estimates, and even with the “max” option only two individuals are classified as 2nd-degree (Supplementary Figure 3a). The non-normalised P0 values confirm that there is not a completely clear separation of individuals based on their expected relationship, especially in regards to the unrelated and 2nd-degree classes (Supplementary Table 7). Using the 3,000,000 reads threshold there is an apparent better separation but there is still a lack of enough variation to be able to properly identify relatives, with the “max” normalisation setting detecting some 2nd-degree relatives but still unable to detect any 1st-degree relatives (Supplementary Figure 3b). We tried to understand if these results were caused by lack of data or by the use of the genome-wide SNP dataset. To do so, we extracted the Koszyce individuals genotyped to the variants in the 1240K dataset[18] from the Allen Ancient DNA Resource v44.3 (https://reich.hms.harvard.edu/, accessed on 20.03.2021) and randomly downsampled them to have similar numbers of overlapping SNPs as when we used the 1,300,000 reads and our genome-wide dataset. This corresponded to around 3300 SNPs per pair. By using SNPs from the 1240K dataset READ is in fact able to correctly assigned all 12 1st-degree relationships, as well as 23 of the 32 2nd-degrees (although here 11 had high uncertainty reflected in values of |Z| < 1) (Supplementary Figure 3c, Supplementary Table 7). This suggests that READ can accurately predict a high number of correct relationships with only ~ 3300 SNPs from the 1240K dataset, although mostly for the 1st-degree class, and that the use of a curated dataset based on informative SNPs such as this one (as opposed to a genome-wide SNP set) can have a strong influence on the results and the minimum amount of data needed for relatedness estimation. Indeed, in Supplementary Figure 4 we show that TKGWV2 only requires around 3000 SNPs to achieve average error rates below 1% when the 1240K dataset is used. However, for samples obtained from genome-wide shotgun data, using the 1240K dataset over a genome-wide dataset to assess kinship is disadvantageous, as the amount of data needed to obtain similarly low error rates (1% with around 3000 used 1240K SNPs) would necessarily translate into minimum shotgun coverages closer to the more traditional threshold of 0.1×.

Discussion

We present an updated pipeline for a methodology used to identify the remains of Thomas Kent, an Irish nationalist executed after the Irish Easter Rising, in 1916[4]. The pipeline uses Python, R, and Bash, for a more streamlined and faster process, especially when multiple pairs of individuals are analysed simultaneously. The 105 pairwise relationships between the 15 Koszyce individuals subset to 1,300,000 SNPs ran in 61 min, averaging ~ 35 s per pair, on an Intel Core i7-7700K processor. More importantly, by using genome-wide variants instead of pre-defined SNP sets, this updated pipeline is able to assess kinship relationships with very low error rates using genomic coverages as low as around 0.02×. However, as shown during its application to a set of ancient individuals from Saag et al.[17], specifically for the pair of individuals X05–X06, it can also be applied for a genetic relatives pair where one individual has even lower coverage (as low as 0.005×–X06) and the other individual has a substantially higher coverage (for example, 0.03×–X05). In these situations, although the resulting estimates can have some uncertainty due to the small number of common SNPs used, when interpreted together with simulated population sets and false positive/negative rates, they can be a helpful tool in project planning during screening sequencing runs, by informing decisions on whether to further sequence some individuals within a dataset. With DNA sequencing costs decreasing[22], aDNA data growing exponentially, and research projects increasingly focusing on analysing the social organisation of populations from specific regions and/or local cemeteries[15,20,21,23-25], a method for detecting kinship from whole genome shotgun screening data with ultra low coverage is valuable, and can translate into larger numbers of relationships detected. This is true even for individuals that would not be expected to yield sufficient data for population genetic analysis, or for which the required sequencing power would require substantially more funds. Furthermore, as it is common practice to exclude 1st-degree relatives from population genetic analyses that are not focused on local organisation and structure, the detection of such pairs of individuals during initial screening steps can allow researchers to further sequence only the best of the two individuals, and therefore save sequencing power. Our results suggest that from as low as 600,000 aligned reads (0.012× coverage) it is possible to detect 1st-degree relatives with very low false positive rates. Lastly, although we showed that in non-optimal situations a normalization step based on known unrelated individuals might be able to provide correct results, by default, TKGWV2 relies on population frequencies calculated from an external population, and requires no normalisation steps based on unrelated individuals from the population being tested (or genetically similar)[3,7]. This means that TKGWV2 can solve relationships when a single pair of individuals is given as input, or even simultaneously between individuals belonging to different populations and different time periods, producing results always within the same expected intervals. We show that, for example, using allele frequency data from modern European populations allowed us to correctly estimate all relationships tested from ancient individuals from that region, potentially allowing the method to be used in any other ancient population around the world where there is substantial DNA coverage from the modern populations they contributed substantial amounts of ancestry to, even if only a single pair of ancient individuals exists. Additionally, we show that this modern population data does not necessarily need to fully match the genetic structure of the ancient individuals being tested, which is the case for admixed populations, as we produced accurate relatedness estimation results for populations spanning 5000 years—from Neolithic individuals composed mainly of Early European Farmer and Western Hunter-Gatherer ancestry[4,15,26], to Bronze Age and Medieval individuals with substantial amounts of Steppe-related ancestry, e.g.[16,17]. Analyses such as ancestry clusterization, genetic distances, and f-statistics might provide useful insights into which modern populations are the most appropriate to use for allele frequencies on a case-by-case basis. An overall caveat of using modern population frequencies on ancient individuals, however, is that, when applying this approach to any ancient population from a previously untested region or period, a confirmation analysis on known ancient related individuals needs to be performed using frequencies from the closest modern populations available. This can be challenging for regions or periods for which there is a lack of published ancient relatives. Nevertheless, with the ongoing exponential increase in availability of both modern and aDNA data around the world, our method has the potential to be applied to the great majority of situations encountered by palaeogeneticists. Supplementary Figures. Supplementary Tables.

21 in total

1. Estimating relatedness between individuals in general populations with a focus on their use in conservation programs.

Authors: Pieter A Oliehoek; Jack J Windig; Johan A M van Arendonk; Piter Bijma
Journal: Genetics Date: 2006-03-01 Impact factor: 4.562

2. Effects of sampling close relatives on some elementary population genetics analyses.

Authors: Jinliang Wang
Journal: Mol Ecol Resour Date: 2017-09-18 Impact factor: 7.090

3. ESTIMATING RELATEDNESS USING GENETIC MARKERS.

Authors: David C Queller; Keith F Goodnight
Journal: Evolution Date: 1989-03 Impact factor: 3.694

4. Kinship-based social inequality in Bronze Age Europe.

Authors: Alissa Mittnik; Ken Massy; Corina Knipper; Fabian Wittenborn; Ronny Friedrich; Saskia Pfrengle; Marta Burri; Nadine Carlichi-Witjes; Heidi Deeg; Anja Furtwängler; Michaela Harbeck; Kristin von Heyking; Catharina Kociumaka; Isil Kucukkalipci; Susanne Lindauer; Stephanie Metz; Anja Staskiewicz; Andreas Thiel; Joachim Wahl; Wolfgang Haak; Ernst Pernicka; Stephan Schiffels; Philipp W Stockhammer; Johannes Krause
Journal: Science Date: 2019-10-10 Impact factor: 47.728

5. The genomic history of the Iberian Peninsula over the past 8000 years.

Authors: Iñigo Olalde; Swapan Mallick; Nick Patterson; Nadin Rohland; Vanessa Villalba-Mouco; Marina Silva; Katharina Dulias; Ceiridwen J Edwards; Francesca Gandini; Maria Pala; Pedro Soares; Manuel Ferrando-Bernal; Nicole Adamski; Nasreen Broomandkhoshbacht; Olivia Cheronet; Brendan J Culleton; Daniel Fernandes; Ann Marie Lawson; Matthew Mah; Jonas Oppenheimer; Kristin Stewardson; Zhao Zhang; Juan Manuel Jiménez Arenas; Isidro Jorge Toro Moyano; Domingo C Salazar-García; Pere Castanyer; Marta Santos; Joaquim Tremoleda; Marina Lozano; Pablo García Borja; Javier Fernández-Eraso; José Antonio Mujika-Alustiza; Cecilio Barroso; Francisco J Bermúdez; Enrique Viguera Mínguez; Josep Burch; Neus Coromina; David Vivó; Artur Cebrià; Josep Maria Fullola; Oreto García-Puchol; Juan Ignacio Morales; F Xavier Oms; Tona Majó; Josep Maria Vergès; Antònia Díaz-Carvajal; Imma Ollich-Castanyer; F Javier López-Cachero; Ana Maria Silva; Carmen Alonso-Fernández; Germán Delibes de Castro; Javier Jiménez Echevarría; Adolfo Moreno-Márquez; Guillermo Pascual Berlanga; Pablo Ramos-García; José Ramos-Muñoz; Eduardo Vijande Vila; Gustau Aguilella Arzo; Ángel Esparza Arroyo; Katina T Lillios; Jennifer Mack; Javier Velasco-Vázquez; Anna Waterman; Luis Benítez de Lugo Enrich; María Benito Sánchez; Bibiana Agustí; Ferran Codina; Gabriel de Prado; Almudena Estalrrich; Álvaro Fernández Flores; Clive Finlayson; Geraldine Finlayson; Stewart Finlayson; Francisco Giles-Guzmán; Antonio Rosas; Virginia Barciela González; Gabriel García Atiénzar; Mauro S Hernández Pérez; Armando Llanos; Yolanda Carrión Marco; Isabel Collado Beneyto; David López-Serrano; Mario Sanz Tormo; António C Valera; Concepción Blasco; Corina Liesau; Patricia Ríos; Joan Daura; María Jesús de Pedro Michó; Agustín A Diez-Castillo; Raúl Flores Fernández; Joan Francès Farré; Rafael Garrido-Pena; Victor S Gonçalves; Elisa Guerra-Doce; Ana Mercedes Herrero-Corral; Joaquim Juan-Cabanilles; Daniel López-Reyes; Sarah B McClure; Marta Merino Pérez; Arturo Oliver Foix; Montserrat Sanz Borràs; Ana Catarina Sousa; Julio Manuel Vidal Encinas; Douglas J Kennett; Martin B Richards; Kurt Werner Alt; Wolfgang Haak; Ron Pinhasi; Carles Lalueza-Fox; David Reich
Journal: Science Date: 2019-03-15 Impact factor: 47.728

6. A global reference for human genetic variation.

Authors: Adam Auton; Lisa D Brooks; Richard M Durbin; Erik P Garrison; Hyun Min Kang; Jan O Korbel; Jonathan L Marchini; Shane McCarthy; Gil A McVean; Gonçalo R Abecasis
Journal: Nature Date: 2015-10-01 Impact factor: 49.962

7. The Identification of a 1916 Irish Rebel: New Approach for Estimating Relatedness From Low Coverage Homozygous Genomes.

Authors: Daniel Fernandes; Kendra Sirak; Mario Novak; John A Finarelli; John Byrne; Edward Connolly; Jeanette E L Carlsson; Edmondo Ferretti; Ron Pinhasi; Jens Carlsson
Journal: Sci Rep Date: 2017-01-30 Impact factor: 4.379

8. Unraveling ancestry, kinship, and violence in a Late Neolithic mass grave.

Authors: Hannes Schroeder; Ashot Margaryan; Marzena Szmyt; Bertrand Theulot; Piotr Włodarczak; Simon Rasmussen; Shyam Gopalakrishnan; Anita Szczepanek; Tomasz Konopka; Theis Z T Jensen; Barbara Witkowska; Stanisław Wilk; Marcin M Przybyła; Łukasz Pospieszny; Karl-Göran Sjögren; Zdzislaw Belka; Jesper Olsen; Kristian Kristiansen; Eske Willerslev; Karin M Frei; Martin Sikora; Niels N Johannsen; Morten E Allentoft
Journal: Proc Natl Acad Sci U S A Date: 2019-05-06 Impact factor: 11.205

9. A genetic history of the pre-contact Caribbean.

Authors: Daniel M Fernandes; Kendra A Sirak; Harald Ringbauer; Jakob Sedig; Nadin Rohland; Olivia Cheronet; Matthew Mah; Swapan Mallick; Iñigo Olalde; Brendan J Culleton; Nicole Adamski; Rebecca Bernardos; Guillermo Bravo; Nasreen Broomandkhoshbacht; Kimberly Callan; Francesca Candilio; Lea Demetz; Kellie Sara Duffett Carlson; Laurie Eccles; Suzanne Freilich; Richard J George; Ann Marie Lawson; Kirsten Mandl; Fabio Marzaioli; Weston C McCool; Jonas Oppenheimer; Kadir T Özdogan; Constanze Schattke; Ryan Schmidt; Kristin Stewardson; Filippo Terrasi; Fatma Zalzala; Carlos Arredondo Antúnez; Ercilio Vento Canosa; Roger Colten; Andrea Cucina; Francesco Genchi; Claudia Kraan; Francesco La Pastina; Michaela Lucci; Marcio Veloz Maggiolo; Beatriz Marcheco-Teruel; Clenis Tavarez Maria; Christian Martínez; Ingeborg París; Michael Pateman; Tanya M Simms; Carlos Garcia Sivoli; Miguel Vilar; Douglas J Kennett; William F Keegan; Alfredo Coppa; Mark Lipson; Ron Pinhasi; David Reich
Journal: Nature Date: 2020-12-23 Impact factor: 49.962

10. Understanding 6th-century barbarian social organization and migration through paleogenomics.

Authors: Carlos Eduardo G Amorim; Stefania Vai; Cosimo Posth; Alessandra Modi; István Koncz; Susanne Hakenbeck; Maria Cristina La Rocca; Balazs Mende; Dean Bobo; Walter Pohl; Luisella Pejrani Baricco; Elena Bedini; Paolo Francalacci; Caterina Giostra; Tivadar Vida; Daniel Winger; Uta von Freeden; Silvia Ghirotto; Martina Lari; Guido Barbujani; Johannes Krause; David Caramelli; Patrick J Geary; Krishna R Veeramah
Journal: Nat Commun Date: 2018-09-11 Impact factor: 14.919