| Literature DB >> 24405840 |
Maja P Greminger1, Kai N Stölting, Alexander Nater, Benoit Goossens, Natasha Arora, Rémy Bruggmann, Andrea Patrignani, Beatrice Nussberger, Reeta Sharma, Robert H S Kraus, Laurentius N Ambu, Ian Singleton, Lounes Chikhi, Carel P van Schaik, Michael Krützen.
Abstract
BACKGROUND: High-throughput sequencing has opened up exciting possibilities in population and conservation genetics by enabling the assessment of genetic variation at genome-wide scales. One approach to reduce genome complexity, i.e. investigating only parts of the genome, is reduced-representation library (RRL) sequencing. Like similar approaches, RRL sequencing reduces ascertainment bias due to simultaneous discovery and genotyping of single-nucleotide polymorphisms (SNPs) and does not require reference genomes. Yet, generating such datasets remains challenging due to laboratory and bioinformatical issues. In the laboratory, current protocols require improvements with regards to sequencing homologous fragments to reduce the number of missing genotypes. From the bioinformatical perspective, the reliance of most studies on a single SNP caller disregards the possibility that different algorithms may produce disparate SNP datasets.Entities:
Mesh:
Year: 2014 PMID: 24405840 PMCID: PMC3897891 DOI: 10.1186/1471-2164-15-16
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Geographic location of the two orangutan study populations. The areas colored in brown indicate the current distribution of orangutans.
Figure 2III digest of the orangutan reference genome. Panel a, b and c represent increasing levels of details. The x- and y axis show the generated fragment lengths in base pairs and the number of fragments multiplied by fragment length, respectively. Peaks are due to repetitive sequences. The isolated fragment size range (104–123 bp) is indicated in red.
Overview of the sequencing of improved reduced-representation libraries (iRRLs) for the West Alas (WA) and South Kinabatangan (SK) orangutan study populations
| | ||
|---|---|---|
| No. of individuals | 15 | 16 |
| iRRL stacks per individual (predicted)a | 305,574 | 305,574 |
| Median iRRL target efficiencyb | 97% | 86% |
| Total no. of beads per population | 675,295,801 | 762,234,081 |
| Total no. of mapped reads per population | 528,081,935 | 646,922,204 |
| Median no. of mapped reads per individual | 32,345,177 | 43,451,986 |
| % reads mapped F3/F5 (mappability)c | 74.9/7.3 | 67.0/17.0 |
| Mean no. of bphiqual per individuald | 10,930,563 | 18,186,855 |
| Median sequence coverage per individuale | 41× | 42× |
aPredicted by in-silico digest of the orangutan reference genome ponAbe2 (Sumatran) with HaeIII. bPercentage of sequenced sites that were predicted by the in-silico digest. cF3/F5 are the sequence read directions of the paired end sequencing mode. dNumber of sequenced base pairs passing all high quality filters (sites used for SNP detection). eGATK estimates based on bphiqual.
Overview of SNP discovery and genotype calling using three different callers
| | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| No. of SNPs | 34257 | 40248 | 57396 | 34788 | 55585 | 75364 | 14494 | 14903 | 24103 |
| No. of private SNPs | 17148 | 23139 | 40287 | 19779 | 40576 | 60355 | 9200 | 9609 | 18809 |
| % singletons | 7.68 | 10.83 | 12.18 | 11.53 | 27.47 | 25.59 | 14.63 | 21.66 | 22.19 |
| Median site heterozygositya | 0.267 | 0.250 | / | 0.236 | 0.200 | / | 0.266 | 0.231 | / |
| Median coverage per individual | 93× | 70× | 82× | 66× | 29× | 48× | 66× | 19× | 27× |
| | |||||||||
| | |||||||||
| No. of SNPs | 21475 | 24936 | 37085 | 11325 | 12350 | 18933 | 9861 | 11310 | 17163 |
| No. of private SNPs | 12149 | 15610 | 27759 | 6583 | 7608 | 14191 | 5853 | 7302 | 13155 |
| % singletons | 9.91 | 17.98 | 12.82 | 9.99 | 20.53 | 19.37 | 10.54 | 23.08 | 21.60 |
| Median site heterozygositya | 0.250 | 0.222 | / | 0.286 | 0.231 | / | 0.266 | 0.222 | / |
| Median coverage per individualb | 107× (65) | 81× (27) | 96× (37) | 55× (98) | 18× (98) | 20× (99) | 69× (76) | 19× (35) | 26× (46) |
We required all SNPs to have a genotype call passing all stringent quality filters in a minimum of eight individuals per population (population-based filtering). The intersect datasets contain exclusively concordant genotype calls between the designated SNP callers. Pop_SK: South Kinabatangan population, Pop_WA: West Alas population.
aBased on the sites being polymorphic within the population.
bCoverage values of intersect datasets are taken from the first named SNP caller. The coverage values of the second named caller are given in brackets.
Figure 3Overlap of SNPs among the datasets obtained from three different callers. Percentages specify the proportion of SNPs exclusively present in the particular dataset for each caller.
Median genotype concordance between designated SNP callers for overlapping SNP sites assessed at the individual level
| % same genotype called Pop_WA | 96.92 | 98.46 | 96.15 |
| % same genotype called Pop_SK | 98.27 | 98.04 | 97.45 |
| % same genotype called overall | 97.51 | 98.32 | 97.24 |
| % same genotype called overall (range) | 93.59-98.38 | 97.08-99.26 | 92.46-97.82 |
Figure 4Quantitative investigation of discordant genotype calls between pairs of SNP callers. For the vast majority (>99.77%) of discordant genotype calls, one caller assigned a heterozygous genotype but the other caller a homozygous genotype for either of the alleles. The y-axis represents the percentage of heterozygous genotype calls in such cases. The values are median numbers across all study individuals.
Figure 5Kernel density distributions of minor-allele frequency and site heterozygosity using the different SNP datasets. For each of the six SNP data sets (CLC, GATK, SAMtools, GATK-CLCintersect, SAMtools-GATKintersect, and SAMtools-CLCintersect) we computed the minor-allele frequency (MAF) for the Sumatran (WA) and Bornean (SK) individuals (panels a and b, respectively), and site heterozygosity for WA and SK (panels c and d, respectively).
Figure 6Overlap of outlier regions among SNP datasets in genome-wide scans for positive selection. For all SNP datasets we performed sliding-window analyses (100 kb window, 25 kb step size) of the absolute allele-frequency differential (D) between the SK and WA population. All windows with an average window D > 0.95 were considered as outliers, i.e. candidate regions for selective sweeps. Percentage values are given in relation to the total number of outlier windows.
Overview of genotype validations at overlapping SNP sites
| | ||||||
|---|---|---|---|---|---|---|
| | | | | | | |
| Singleton site determined by GATK/SAMtoolsb | 8 | 8 | 1 | 12.5 | 7 | 87.50 |
| Singleton site determined by CLCb | 4 | 4 | 0 | 0.00 | 4 | 100 |
| Homozygote with GATK/SAMtools but heterozygote with CLC | 23 | 28 | 3 | 10.71 | 25 | 89.29 |
| Heterozygote with GATK/SAMtools but homozygote with CLC | 23 | 23 | 7 | 30.43 | 16 | 69.57 |
| | | | | | | |
aOverlapping SNP sites but discordant genotype assignments. bLoci were exclusively counted in this category without considering them in the homo- or heterozygote categories below. c100 of the 114 genotypes were validated from the same sites used to validate the discordant genotypes. The remaining 14 genotypes were validated from 14 SNPs chosen randomly from the GATK-CLCintersect dataset (exclusively identical genotype calls).