| Literature DB >> 24708091 |
Jeffrey M Kidd, Thomas J Sharpton, Dean Bobo, Paul J Norman, Alicia R Martin, Meredith L Carpenter, Martin Sikora, Christopher R Gignoux, Neda Nemat-Gorgani, Alexandra Adams, Moraima Guadalupe, Xiaosen Guo, Qiang Feng, Yingrui Li, Xiao Liu, Peter Parham, Eileen G Hoal, Marcus W Feldman, Katherine S Pollard, Jeffrey D Wall, Carlos D Bustamante, Brenna M Henn1.
Abstract
BACKGROUND: Targeted capture of genomic regions reduces sequencing cost while generating higher coverage by allowing biomedical researchers to focus on specific loci of interest, such as exons. Targeted capture also has the potential to facilitate the generation of genomic data from DNA collected via saliva or buccal cells. DNA samples derived from these cell types tend to have a lower human DNA yield, may be degraded from age and/or have contamination from bacteria or other ambient oral microbiota. However, thousands of samples have been previously collected from these cell types, and saliva collection has the advantage that it is a non-invasive and appropriate for a wide variety of research.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24708091 PMCID: PMC4051168 DOI: 10.1186/1471-2164-15-262
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Summary statistics for KhoeSan exomes
| Pilot 1 | SA006 | 69,272,282 | 9,122,731 | 13.2% | 54.2% | 63.5% | 12 | 94.9% | 25,225 | 657 | 0.9897 |
| | SA008 | 113,888,276 | 2,143,408 | 1.9% | 19.8% | 78.2% | 73 | 99.5% | 26,408 | 955 | 0.9947 |
| | SA011 | 78,006,472 | 1,664,959 | 2.1% | 33.7% | 77.4% | 40 | 99.0% | 26,365 | 67 | NA |
| | SA012 | 67,209,032 | 1,353,187 | 2.0% | 20.5% | 75.7% | 42 | 99.3% | 26,722 | 86 | NA |
| | SA035 | 85,142,498 | 5,812,851 | 6.8% | 78.0% | 79.4% | 10 | 92.1% | 24,692 | 1,726 | 0.9884 |
| | SA051 | 76,076,464 | 3,102,819 | 4.1% | 27.8% | 76.5% | 37 | 98.8% | 27,674 | 1,239 | NA |
| | SA052 | 60,375,472 | 1,247,951 | 2.1% | 12.9% | 78.2% | 41 | 98.8% | 27,779 | 755 | 0.9968 |
| | SA054 | 62,358,148 | 1,959,032 | 3.1% | 27.9% | 73.9% | 31 | 99.3% | 28,024 | 817 | 0.9956 |
| | |||||||||||
| Pilot 2 | SA1000 | 77,069,730 | 8,387,491 | 10.9% | 9.5% | 57.3% | 44 | 98.4% | 27,921 | 2,483 | 0.9915 |
| | SA1001 | 85,479,934 | 3,551,500 | 4.2% | 11.4% | 74.2% | 67 | 98.7% | 27,694 | 2,318 | 0.9939 |
| | SA1002 | 92,542,846 | 4,674,919 | 5.1% | 15.5% | 70.1% | 65 | 98.8% | 27,886 | 3,286 | 0.9941 |
| | SA1006 | 83,545,692 | 4,002,665 | 4.8% | 18.1% | 74.5% | 59 | 98.4% | 27,446 | 2,442 | 0.9927 |
| | SA1010 | 87,939,484 | 4,445,502 | 5.1% | 14.5% | 71.0% | 62 | 98.6% | 27,295 | 1,782 | 0.9935 |
| | SA1011 | 82,377,158 | 7,810,714 | 9.5% | 11.6% | 49.2% | 40 | 98.5% | 27,484 | 2,717 | 0.9887 |
| | SA1025 | 81,405,650 | 2,498,412 | 3.1% | 10.0% | 87.8% | 63 | 99.3% | 28,696 | 2,676 | 0.9934 |
aTotal number of DNA fragments including: mapped, unmapped and duplicate reads.
bLimited to non-duplicate reads on autosomes, as calculated by GATK Unified Genotype.
cLimited to XX autosomal SNPs identified at the 99% VQSR threshold.
dConcordance at heterozygous and homozygous non-reference positions as compared to Illumina OmniExpress or 550K.v2 SNP arrays.
eFewer average singletons as a result of including closely related individuals in Pilot 1. See Additional file 1: Table S1 for individual data.
Figure 1Schematic of mapping and calling pipelines. Each box summarizes the data and data format used for each step of the human exome and microbiome mapping/calling pipelines. The pipeline begins with next-generation sequencing raw reads obtained from exome sequencing of saliva-derived DNA and ends in finalized exome variant calls and microbiome taxonomic abundances. Arrows indicate analysis methods used to process the human and saliva microbiome data (see Methods).
Figure 2Assessment of base substitutions from mapped reads. Each mapped read was compared to the genome reference sequence to assess patterns consistent with DNA degradation. At each of the 75 positions along a read, we plot the frequency of substitution types, for both the forward (left) and reverse (right) reads from each read-pair. Analysis was limited to 1 million reads from chromosome 1; all raw reads are plotted. Three individuals with varying levels of substitution errors are shown: (A) SA006 with overall higher substitution rate and an excess of purines at the start of the first read, (B) SA035 with a slightly elevated substitution rate and excess of purines at the start of the first read, and (C) SA054 with a low substitution rate and no bias at the beginning of the first read. The additional five Pilot 1 individuals tended to resemble SA054 (Additional file 1: Figure S4). Removal of reads with any soft-clipping substantively reduced the mis-incorporation rate for SA006 and SA035.
Figure 3Novelty compared to 1000 genomes project. We compared the number of nonreference variants in the South African KhoeSan [SSAN] with (A) 1000 Genomes Yoruba samples [YRI], (B) eastern African Luhya [LWK], and (C) Namibian San from HGDP and African-Americans [ASW]. Sites that were included in this analysis required the presence of genotype information for at least 95% of the individuals in the joint dataset. The exome data from ASW and LWK was derived from 1000 Genomes Project, Phase 1 – March 17, 2012 release, from which 13 individuals were randomly sampled. The Venn diagram illustrates the number of shared and unique nonreference variants among populations. The Vennerable package in R was utilized for plotting purposes.
Figure 4Principal components analysis of 61,000 exonic SNPs in the ≠Khomani San and other African populations. Exomes from 1000 Genomes Phase 1, Schuster et al. [15], and HGDP San were combined with the ≠Khomani San (related samples from Families 1 and 2 were removed). 5.76% of the variance is explained by PC1, 3.74% by PC2, 1.27% by PC3, and 1.12% by PC4. PC1 and PC2 separate Africans from Europeans, and western Africans from southern Africans, respectively (A). The three KhoeSan populations drive PC3 and PC4 (B and C), supporting prior descriptions of strong differentiation among Kalahari KhoeSan groups [27], and indicating even sub-structure within the ≠Khomani San samples.
HLA and KIR validation
| | |||||||
|---|---|---|---|---|---|---|---|
| | |||||||
| KIR (13 genes) | 1469 | 91 | 31 | 955 | 99.99 | 670 | 99.99 |
| HLA class 1 A | 690 | 16 | | 690 | 99.98 | 619 | 100.00 |
| HLA class 1 B | 925 | 12 | | 925 | 99.99 | 745 | 99.99 |
| HLA class 2 C | 986 | 8 | 986 | 100.00 | 814 | 100.00 | |
aNon-reference single nucleotide polymorphisms.
bUnique coding sequences.
HLA and KIR validation for SA006 and SA035
| | ||||
|---|---|---|---|---|
| | ||||
| A*03:01 | 37 | 12 | 19 | 19 |
| B*07:02 | 71 | 58 | 34 | 34 |
| C*07:02 | 40 | 39 | 50 | 50 |
aIt was not possible to obtain the HLA-A and -B genotypes from exome data of SA006.
KhoeSan saliva microbiome abundance by read threshold
| | |||
|---|---|---|---|
| 0.062 | 0.077 | 0.13 | |
| 0.051 | 0.062 | 0.076 | |
| 0.031 | 0.039 | 0.063 | |
| 0.038 | 0.046 | 0.063 | |
| 0.047 | 0.054 | 0.059 | |
| 0.041 | 0.049 | 0.056 | |
| 0.027 | 0.031 | 0.029 | |
| 0.016 | 0.019 | 0.028 | |
| 0.02 | 0.025 | 0.027 | |
| 0.015 | 0.012 | 0.024 | |
1The genome-length-corrected relative abundance calculated using a 50% identity fragment recruitment threshold.
2The genome-length-corrected relative abundance calculated using fragment recruitment thresholds of 80% or 95% identity across at least 75% of the read.
Figure 5Comparison of saliva microbiome frequencies from full genome and exome-capture sequencing. Estimates of the relative abundance of saliva microbiota obtained via exome capture (x-axis) strongly correlate with those obtained from shotgun metagenomes produced from the same sample (y-axis). The above dot plots illustrate this result for two KhoeSan individuals involved in our study: A) SA1000 and B) SA1025. Each dot represents a genome. A linear model representing the relationship between exome-capture and non-capture estimates of relative abundance is shown in blue; the variance in the predictions from the model are shaded in grey. A Spearman correlation test indicates that this relationship is very strong (rho > 0.65; p < 2.2e-16).
Figure 6Differences in taxon ranks between South African samples and human microbiome project. A, B) Oral microbiome structure varies among the KhoeSan. Each of the above stacked bar plots illustrates the relative abundance (y-axis) of the most abundant oral microbiota at the A) genus, and B) species levels for each of the 15 KhoeSan individuals (x-axis). Relative abundance was measured as the fraction of high-quality reads that were recruited to a microbial genome of a particular taxonomic rank using conservative recruitment settings (Methods). Only the nine most abundant groups for each taxonomic level are illustrated for visualization purposes, with the remaining taxa being grouped into the ‘Other’ category. C) KhoeSan (red) and healthy North American (blue) saliva microbiomes differ in their community structure. In this bar plot, the normalized relative abundance, which is a taxon’s median relative abundance detected within a population divided by the maximum relative abundance detected within a population, is shown for bacterial genera that are detected in either of the two populations. Genera are ordered by their median relative abundance across the KhoeSan. Notable differences between the populations are those where the taxon is abundant in the KhoeSan and effectively undetected in the North Americans, especially Rothia.