| Literature DB >> 31284503 |
Tian Ming Lan1,2, Yu Lin2,3, Jacob Njaramba-Ngatia4, Xiao Sen Guo2, Ren Gui Li5, Hai Meng Li2,6, Sunil Kumar-Sahu2,3, Xie Wang3, Xiu Juan Yang7, Hua Bing Guo8, Wen Hao Xu9, Karsten Kristiansen1,2, Huan Liu10,11, Yan Chun Xu12.
Abstract
The taxonomical identification merely based on morphology is often difficult for ancient remains. Therefore, universal or specific PCR amplification followed by sequencing and BLAST (basic local alignment search tool) search has become the most frequently used genetic-based method for the species identification of biological samples, including ancient remains. However, it is challenging for these methods to process extremely ancient samples with severe DNA fragmentation and contamination. Here, we applied whole-genome sequencing data from 12 ancient samples with ages ranging from 2.7 to 700 kya to compare different mapping algorithms, and tested different reference databases, mapping similarities and query coverage to explore the best method and mapping parameters that can improve the accuracy of ancient mammal species identification. The selected method and parameters were tested using 152 ancient samples, and 150 of the samples were successfully identified. We further screened the BLAST-based mapping results according to the deamination characteristics of ancient DNA to improve the ability of ancient species identification. Our findings demonstrate a marked improvement to the normal procedures used for ancient species identification, which was achieved through defining the mapping and filtering guidelines to identify true ancient DNA sequences. The guidelines summarized in this study could be valuable in archaeology, paleontology, evolution, and forensic science. For the convenience of the scientific community, we wrote a software script with Perl, called AncSid, which is made available on GitHub.Entities:
Keywords: BLAST; ancient DNA; next-generation sequencing; species identification
Year: 2019 PMID: 31284503 PMCID: PMC6679096 DOI: 10.3390/genes10070509
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1The investigation of whole genome sequenced ancient mammals from 2008 to 2019. (A) The species and age of whole genome sequenced ancient mammals from 2008 to 2019. (B) The number of whole genome sequenced ancient mammals from 2008 to 2019.
The description of the next-generation sequencing (NGS) data and samples.
| Species | Sample ID | Age (kyr BP) | Data Sources | Sequencing Platform | Reads Number | Bases Number | Average Length (bp) | Proportion of Endogenous DNA |
|---|---|---|---|---|---|---|---|---|
| N1 | 26 | Sequencing | BGISEQ-500 | 1.00 x 107 | 9.32 x 108 | 93.17 | 59.51% | |
| N2 | 28 | Sequencing | BGISEQ-500 | 1.00 x 107 | 8.76 x 108 | 87.57 | 0.93% | |
| N3 | >43.5 | Sequencing | BGISEQ-500 | 1.00 x 107 | 8.70 x 108 | 87.03 | 34.50% | |
| N6 | >43.5 | Sequencing | BGISEQ-500 | 1.00 x 107 | 9.07 x 108 | 90.69 | 1.51% | |
| N9 | >43.5 | Sequencing | BGISEQ-500 | 1.00 x 107 | 8.99 x 108 | 89.87 | 0.54% | |
| N12 | 17 | Sequencing | BGISEQ-500 | 1.00 x 107 | 9.10 x 108 | 91.02 | 21.90% | |
| JK2911 | 2.7 | Schuenemann et al. | Illumina HiSeq 2500 | 9.74 x 105 | 7.08 x 107 | 72.62 | 39.20% | |
| AfontovaGora3 | 17 | Fu et al. | Illumina HiSeq 2500 | 8.88 x 105 | 5.17 x 107 | 58.18 | 44.64% | |
| Villabruna | 14 | Fu et al. | Illumina HiSeq 2500 | 1.22 x 107 | 6.69 x 107 | 55.02 | 41.13% | |
| Direkli5 | 11.5 | Daly et al. | Illumina HiSeq 2000 | 3.04 x 107 | 1.40 x 107 | 45.94 | 5.29% | |
| British aurochs | 6.7 | Park et al. | Illumina Genome Analyzer IIx | 7.51 x 107 | 3.48 x 109 | 46.29 | 5.91% | |
| Ancient horse | Ancient horse | 560-780 | Orlando et al. | Illumina HiSeq 2000 | 6.27 x 106 | 3.34 x 108 | 53.23 | 0.43% |
BP: before present.
Success rate of ancient species identification under different conditions.
| Conditions | nt Database | mtDNA Database | ||||
|---|---|---|---|---|---|---|
| Animal mtDNA Database (Whole) | Animal mtDNA Database (Partial) | |||||
| BLASTall | BLASTall | BWA aln | BWA mem | BLASTall | ||
| Similarity levels ( | 90 ≤ | 4/12 | 11/12 | 9/12 | 9/12 | 5/12 |
| 92 ≤ | 4/12 | 11/12 | 9/12 | 9/12 | 5/12 | |
| 94 ≤ | 4/12 | 12/12 | 9/12 | 9/12 | 6/12 | |
| 96 ≤ | 4/12 | 12/12 | 9/12 | 9/12 | 6/12 | |
| 98 ≤ | 4/12 | 12/12 | 9/12 | 9/12 | 8/12 | |
| 4/12 | 12/12 | 9/12 | 9/12 | 5/12 | ||
| 90 ≤ | 4/12 | 10/12 | 7/12 | 7/12 | 3/12 | |
| 92 ≤ | 4/12 | 11/12 | 7/12 | 7/12 | 3/12 | |
| 94 ≤ | 4/12 | 12/12 | 8/12 | 9/12 | 4/12 | |
| 96 ≤ | 4/12 | 12/12 | 9/12 | 9/12 | 5/12 | |
| 98 ≤ | 4/12 | 12/12 | 10/12 | 10/12 | 8/12 | |
| Query coverage ( | -- | 7/12 | -- | -- | -- | |
| -- | 7/12 | -- | -- | -- | ||
| -- | 11/12 | -- | -- | -- | ||
| -- | 11/12 | -- | -- | -- | ||
| -- | 12/12 | -- | -- | -- | ||
| -- | 12/12 | -- | -- | -- | ||
| -- | 12/12 | -- | -- | -- | ||
| The first and last | -- | 11/12 | -- | -- | -- | |
| -- | 11/12 | -- | -- | -- | ||
| -- | 11/12 | -- | -- | -- | ||
| -- | 12/12 | -- | -- | -- | ||
| -- | 12/12 | -- | -- | -- | ||
We used BLASTall with the similarity of ≥98% to test the query coverage and deamination screening. BWA: Burrows-Wheeler aligner; mtDNA: Mitochondrial DNA; nt: Nucleotide; aln, mem: BWA functions
Figure 2The percentage of valid mapping hits (PoVMH) of the top one and the second-ranked species in the species ranking (SR) based on the BLAST search results generated using the whole mtDNA database. The range is represented by whiskers, individual data points are shown using dots, the upper quartiles and lower quartiles are denoted by the boxes, and the medians are shown using the central lines. The n in each species was seven, and the n in the average PoVMH was 84.
Figure 3The R value and PoVMH of the top one species under different similarities based on the BLAST search results generated using whole mtDNA database. 90: 90 ≤ L ≤ 100; 92: 92 ≤ L ≤ 100; 94: 94 ≤ L ≤ 100; 96: 96 ≤ L ≤ 100; 98: 98 ≤ L ≤ 100; 100: L = 100. (a) The R values under different similarities for human samples; (b) the PoVMH under different similarities for human samples; (c) the R values under different similarities for mammal samples; (d) the PoVMH under different similarities for mammal samples.
Figure 4The comparison of R values before and after screening the reads based on deamination-induced C-to-T and/or G-to-A change at ends of DNA fragments. The numbers 5 to 10 on the x-axis denote the first and last X bases for screening the reads with C-to-T and/or G-to-A changes. (a) and (b) show the average R values of 12 samples after screening, and (c) and (d) show the average R values of 12 samples before the deamination-based screening. The BLAST search and whole mtDNA database were used in this comparison. The error bars denote the standard error.
Figure 5The comparison of VMH before and after screening the reads based on deamination-induced C-to-T and/or G-to-A changes at the ends of DNA fragments. The red area in the bars represents the proportion of VMH before the screening and the blue area shows the proportion after the screening. The BLAST search and whole mtDNA database were used in this comparison.