| Literature DB >> 22607156 |
Yan Guo1, Jirong Long, Jing He, Chung-I Li, Qiuyin Cai, Xiao-Ou Shu, Wei Zheng, Chun Li.
Abstract
BACKGROUND: Exome sequencing using next-generation sequencing technologies is a cost efficient approach to selectively sequencing coding regions of human genome for detection of disease variants. A significant amount of DNA fragments from the capture process fall outside target regions, and sequence data for positions outside target regions have been mostly ignored after alignment. RESULT: We performed whole exome sequencing on 22 subjects using Agilent SureSelect capture reagent and 6 subjects using Illumina TrueSeq capture reagent. We also downloaded sequencing data for 6 subjects from the 1000 Genomes Project Pilot 3 study. Using these data, we examined the quality of SNPs detected outside target regions by computing consistency rate with genotypes obtained from SNP chips or the Hapmap database, transition-transversion (Ti/Tv) ratio, and percentage of SNPs inside dbSNP. For all three platforms, we obtained high-quality SNPs outside target regions, and some far from target regions. In our Agilent SureSelect data, we obtained 84,049 high-quality SNPs outside target regions compared to 65,231 SNPs inside target regions (a 129% increase). For our Illumina TrueSeq data, we obtained 222,171 high-quality SNPs outside target regions compared to 95,818 SNPs inside target regions (a 232% increase). For the data from the 1000 Genomes Project, we obtained 7,139 high-quality SNPs outside target regions compared to 1,548 SNPs inside target regions (a 461% increase).Entities:
Mesh:
Year: 2012 PMID: 22607156 PMCID: PMC3416685 DOI: 10.1186/1471-2164-13-194
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Distribution of reads aligned to the human reference genome: average (standard deviation) in millions, and percentage
| SureSelect (n = 22) | # Reads | # Reads | % | # Reads | % | # Reads | % |
| Before filter | 63.4 (9.6) | 34.3 (4.9) | 54.1 | 6.4 (1.2) | 10.1 | 22.7 (3.9) | 35.8 |
| After filter | 57.2 (8.6) | 32.0 (4.5) | 56.1 | 6.0 (1.1) | 10.4 | 19.2 (3.3) | 33.5 |
| TrueSeq (n = 6) | |||||||
| Before filter | 86.2 (2.5) | 37.0 (0.7) | 42.9 | 17.1 (0.3) | 19.8 | 32.1 (1.3) | 37.2 |
| After filter | 75.1 (2.1) | 34.8 (0.7) | 46.3 | 16.3 (0.3) | 21.7 | 24.0 (1.1) | 32.0 |
| 1000 G (n = 6) | |||||||
| Before filter | 33.9 (8.1) | 7.6 (1.2) | 22.3 | 4.7 (1.7) | 14.0 | 21.6 (8.4) | 63.7 |
| After filter | 25.6 (3.8) | 7.3 (1.1) | 28.7 | 4.7 (1.7) | 18.6 | 13.6 (3.6) | 52.7 |
The filter was MAPQ ≥ 20.
Figure 1Average depth around boundaries of target regions (1-50 bp inside and 1-200 bp outside boundaries). Negative distance means inside a target region, and positive distance means outside a target region.
Figure 2Distributions of depth, mapping quality score, and base quality score for “Inside TR”, “Outside ≤200 bp”, and “Outside > 200 bp”.
Figure 3Distribution of sites with a minimum depth of 5 to 10.
GC content (base pairs in millions)
| SureSelecta | total BP | 37.8 | 61.0 | 2.4 | 3.4 | 5.3 |
| (n = 22) | GC | 19.1 | 25.6 | 1.1 | 1.6 | 2.5 |
| | GC% | 50.6% | 42.0% | 45.6% | 45.9% | 46.4% |
| TrueSeqb | total BP | 62.1 | 76.9 | 11.3 | 13.7 | 18.5 |
| (n = 6) | GC | 30.4 | 32.5 | 5.1 | 6.1 | 8.4 |
| | GC% | 49.0% | 42.30% | 44.9% | 44.8% | 45.4% |
| 1000 Gc | total BP | 1.4 | 3.3 | 0.5 | 1.3 | 2.5 |
| (n = 6) | GC | 0.7 | 1.4 | 0.2 | 0.6 | 1.1 |
| GC% | 51.6% | 41.6% | 46.2% | 45.4% | 45.2% | |
a Based on hg18 and Agilent SureSelect v1.
b Based on hg19 and Illumina TrueSeq.
c Based on hg18 and NimbleGen 1.4 M cap kit.
Figure 4Average SNP count per sample, heterozygote consistency, and Ti/Tv ratio.
Distributions of depth for SNPs outside capture regions that had GQ ≥ 20 and depth ≥ 5
| 5 | 1156 | 3463 | 1225 | 3963 | 38 | 194 |
| 6 | 1000 | 1681 | 1278 | 2950 | 46 | 142 |
| 7 | 4036 | 3630 | 3200 | 4702 | 135 | 290 |
| 8 | 3483 | 2570 | 4005 | 4582 | 123 | 246 |
| 9 | 3040 | 2020 | 4222 | 3995 | 103 | 199 |
| 10 | 2647 | 1658 | 4231 | 3447 | 98 | 174 |
| 11 | 2349 | 1420 | 4261 | 2972 | 95 | 160 |
| 12 | 2099 | 1251 | 4223 | 2628 | 85 | 142 |
| 13 | 1866 | 1096 | 4188 | 2366 | 91 | 122 |
| 14 | 1694 | 995 | 4110 | 2136 | 84 | 115 |
| 15 | 1535 | 891 | 4072 | 1906 | 84 | 105 |
| 16 | 1376 | 823 | 3949 | 1749 | 71 | 101 |
| 17 | 1270 | 766 | 3887 | 1649 | 67 | 91 |
| 18 | 1159 | 721 | 3743 | 1490 | 67 | 84 |
| 19 | 1072 | 672 | 3654 | 1421 | 64 | 83 |
| ≥ 20 | 15074 | 15539 | 63620 | 62350 | 1762 | 1884 |
The numbers are average numbers of SNPs over the samples from the same capture assay.
SNPs > 200 bp outside target regions
| SureSelect | ≥ 5 | 6194 | 35 | 4873 | 33569 | 981 | 25 | 4 |
| (n = 22) | ≥ 10 | 4297 | 34 | 3790 | 21713 | 792 | 17 | 3 |
| TrueSeq | ≥ 5 | 29372 | 8 | 7537 | 74431 | 211 | 4 | 0 |
| (n = 6) | ≥ 10 | 16252 | 7 | 10153 | 60889 | 199 | 4 | 0 |
| 1000 G | ≥ 5 | 2619 | 11 | 309 | 1951 | 307 | 6 | 1 |
| (n = 6) | ≥ 10 | 1352 | 8 | 168 | 1183 | 213 | 5 | 1 |
a Variant is within 2-bp of a splicing junction.
b Variant overlaps a transcript without coding annotation in the gene definition.