| Literature DB >> 26382624 |
Pınar Kavak1, Bayram Yüksel2, Soner Aksu2, M Oguzhan Kulekci3, Tunga Güngör4, Faraz Hach5, S Cenk Şahinalp5, Can Alkan6, Mahmut Şamil Sağıroğlu3.
Abstract
The improvements in high throughput sequencing technologies (HTS) made clinical sequencing projects such as ClinSeq and Genomics England feasible. Although there are significant improvements in accuracy and reproducibility of HTS based analyses, the usability of these types of data for diagnostic and prognostic applications necessitates a near perfect data generation. To assess the usability of a widely used HTS platform for accurate and reproducible clinical applications in terms of robustness, we generated whole genome shotgun (WGS) sequence data from the genomes of two human individuals in two different genome sequencing centers. After analyzing the data to characterize SNPs and indels using the same tools (BWA, SAMtools, and GATK), we observed significant number of discrepancies in the call sets. As expected, the most of the disagreements between the call sets were found within genomic regions containing common repeats and segmental duplications, albeit only a small fraction of the discordant variants were within the exons and other functionally relevant regions such as promoters. We conclude that although HTS platforms are sufficiently powerful for providing data for first-pass clinical tests, the variant predictions still need to be confirmed using orthogonal methods before using in clinical applications.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26382624 PMCID: PMC4575192 DOI: 10.1371/journal.pone.0138259
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Summary of the sequence datasets.
| Dataset | Number of reads | Read length | Expected Coverage | Number of mapped reads | Effective Coverage | GC% |
|---|---|---|---|---|---|---|
|
| 1,401,819,290 | 104 | 45.6X | 1,366,858,600 | 42.3X | 42% |
|
| 1,394,524,622 | 90 | 41.5X | 1,272,512,132 | 37.6X | 39% |
|
| 934,050,130 | 104 | 31.3X | 914,763,337 | 29.56X | 43% |
|
| 1,793,560,406 | 90 | 53.4X | 1,688,991,592 | 49.2X | 41% |
Basic statistics of the two samples (S 1, S 2) sequenced at two different centers. S 1 refers to sample S 1 sequenced at TÜBİTAK, where the dataset S 1 was generated from the same sample at BGI. Similarly, datasets from sample S 2 are denoted as S 2 and S 2.
SNPs and indels discovered using UnifiedGenotyper.
| SNPs | Indels | |||
|---|---|---|---|---|
| Total | Novel | Total | Novel | |
|
| 3,320,545 | 40,936 | 34,407 | 430 |
|
| 3,356,829 | 60,596 | 132,144 | 2,076 |
|
| 3,340,498 | 55,408 | 80,950 | 1,227 |
|
| 3,277,433 | 46,448 | 56,189 | 756 |
|
| 3,346,221 | 55,753 | 54,229 | 529 |
|
| 3,393,037 | 98,383 | 32,743 | 502 |
1Compared to dbSNP138
SNPs and indels discovered using HaplotypeCaller.
| SNPs | Indels | |||
|---|---|---|---|---|
| Total | Novel | Total | Novel | |
|
| 3,540,735 | 57,905 | 614,241 | 35,624 |
|
| 3,504,854 | 58,578 | 668,779 | 41,558 |
|
| 3,569,295 | 59,510 | 739,347 | 50,617 |
|
| 3,463,094 | 60,344 | 589,891 | 34,249 |
|
| 3,539,933 | 79,869 | 718,734 | 44,571 |
|
| 3,613,663 | 72,099 | 217,365 | 57,056 |
1Compared to dbSNP138
Comparisons of total and novel SNP and indel call sets generated from the genomes of S 1 and S 2. S 1, S 1, S 1: S 1 calls from BGI, TÜBİTAK, and pooled datasets using UnifiedGenotyper; S 2, S 2, S 2: S 2 calls from BGI, TÜBİTAK, and pooled datasets, respectively.
| SNPs | Indels | |||
|---|---|---|---|---|
| Total | Novel | Total | Novel | |
|
| 3,167,254 | 36,273 | 23,293 | 232 |
|
| 75,839 | 16,073 | 67,478 | 1,239 |
|
| 56,906 | 1,444 | 3,525 | 56 |
|
| 22,737 | 8,896 | 11,647 | 300 |
| ( | 29,807 | 615 | 1,476 | 26 |
| ( | 83,929 | 7,635 | 39,897 | 579 |
| ( | 66,578 | 2,604 | 6,113 | 116 |
|
| 3,164,900 | 42,518 | 12,823 | 93 |
|
| 40,492 | 4,899 | 22,599 | 258 |
|
| 62,748 | 46,415 | 34,980 | 581 |
|
| 62,029 | 2,314 | 3,567 | 219 |
| ( | 12,972 | 251 | 5,420 | 35 |
| ( | 127,857 | 8,085 | 13,387 | 143 |
| ( | 37,532 | 1,365 | 2,966 | 47 |
Comparisons of total and novel SNP and indel call sets generated from the genomes of S 1 and S 2. S 1, S 1, S 1: S 1 calls from BGI, TÜBİTAK, and pooled datasets using HaplotypeCaller; S 2, S 2, S 2: S 2 calls from BGI, TÜBİTAK, and pooled datasets, respectively.
| SNPs | Indels | |||
|---|---|---|---|---|
| Total | Novel | Total | Novel | |
|
| 3,373,868 | 43,693 | 552,114 | 22,090 |
|
| 36,182 | 7,005 | 7,863 | 6,189 |
|
| 55,145 | 6,663 | 9,729 | 3,735 |
|
| 25,347 | 2,418 | 27,621 | 5,919 |
| ( | 18,223 | 1,015 | 794 | 235 |
| ( | 76,581 | 6,865 | 108,008 | 13,044 |
| ( | 93,499 | 6,534 | 51,604 | 9,564 |
|
| 3,334,025 | 46,783 | 543,893 | 22,332 |
|
| 35,153 | 18,073 | 4,807 | 1,762 |
|
| 52,188 | 8,034 | 16,981 | 6,611 |
|
| 43,596 | 10,903 | 54,639 | 9,291 |
| ( | 5,797 | 600 | 687 | 175 |
| ( | 164,958 | 14,413 | 169,347 | 20,302 |
| ( | 71,084 | 4,927 | 28,330 | 5,131 |
Fig 1Underlying sequence content of novel SNP and indel calls.
A) SNPs and B) indels in the genome of S 1. C) SNPs and D) indels in the genome of S 2.
Detailed view of novel SNP and indel distributions of S 1 that map to common repeats.
| SNPs | Indels | |||||
|---|---|---|---|---|---|---|
| All |
|
| All |
|
| |
| Total | 31,226 | 13,279 | 1,840 | 1,081 | 897 | 89 |
| SINE/Alu | 8,911 | 4,175 | 706 | 204 | 196 | 5 |
| LINE/L1 | 8,779 | 3,581 | 332 | 415 | 330 | 33 |
| LTR/ERV | 5,370 | 2,022 | 263 | 84 | 74 | 4 |
| Low compl. | 429 | 196 | 55 | 63 | 41 | 11 |
| Satellite | 237 | 89 | 14 | 9 | 7 | 0 |
| Simple rep. | 1,605 | 1,011 | 312 | 151 | 118 | 27 |
| Other | 5,895 | 2,205 | 158 | 155 | 131 | 9 |
Detailed view of novel SNP and indel distributions of S 2 that map to common repeats.
| SNPs | Indels | |||||
|---|---|---|---|---|---|---|
| All |
|
| All |
|
| |
| Total | 28,483 | 7,597 | 1,907 | 517 | 204 | 265 |
| SINE/Alu | 9,499 | 4,048 | 507 | 71 | 45 | 24 |
| LINE/L1 | 7,396 | 1,331 | 511 | 208 | 71 | 112 |
| LTR/ERV | 4,360 | 434 | 221 | 66 | 20 | 38 |
| Low compl. | 653 | 399 | 59 | 32 | 17 | 12 |
| Satellite | 260 | 61 | 29 | 0 | 0 | 0 |
| Simple rep. | 1,489 | 784 | 410 | 54 | 26 | 27 |
| Other | 4,826 | 540 | 170 | 86 | 25 | 52 |
Distribution of discrepant novel SNP-indels of S 1 and S 2 over gene regions.
| Novel discrepant SNP-Indels of | Novel discrepant SNPs-Indels of | |||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
| |||||
| SNP | Indel | SNP | Indel | SNP | Indel | SNP | Indel | |
| Total | 4,048 | 172 | 23,708 | 1,818 | 3,679 | 628 | 12,984 | 401 |
| intergenic | 2,191 | 107 | 13,451 | 1,029 | 2,261 | 358 | 6,470 | 249 |
| intronic | 1,506 | 50 | 8,899 | 694 | 1,196 | 233 | 5,016 | 126 |
| upstream | 62 | 2 | 139 | 10 | 34 | 2 | 467 | 4 |
| downstream | 44 | 1 | 144 | 8 | 28 | 2 | 89 | 3 |
| UTR5 | 33 | 0 | 36 | 1 | 5 | 1 | 228 | 1 |
| UTR3 | 29 | 3 | 199 | 17 | 21 | 5 | 96 | 5 |
| exonic nonsyn | 26 | 0 | 129 | 0 | 5 | 0 | 131 | 0 |
| exonic syn | 24 | 0 | 47 | 0 | 7 | 0 | 42 | 0 |
| exonic stopgain | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 |
| exonic unknown | 0 | 0 | 1 | 0 | 0 | 0 | 4 | 0 |
| exonic | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| ex. frmshift del | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| ex. nonfrmshift del | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| ex. nonfrmshift ins | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| splicing | 1 | 0 | 13 | 1 | 2 | 0 | 31 | 0 |
| ncRNA intronic | 114 | 9 | 609 | 55 | 116 | 26 | 357 | 12 |
| ncRNA exonic | 17 | 0 | 33 | 0 | 4 | 0 | 39 | 1 |
| ncRNA UTR5 | 1 | 0 | 1 | 0 | 0 | 0 | 8 | 0 |
| ncRNA UTR3 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 0 |
| ncRNA splicing | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1ex. frmshift del: exonic frameshift deletion
Comparisons of total and novel SNP and indel intersections of B 1 vs. T 1 and B 2 vs. T 2. B 1, T 1:pooled S 1 calls from BGI and TÜBİTAK datasets; B 2, T 2:pooled S 2 calls from BGI and TÜBİTAK datasets, respectively.
| SNPs | Indels | |||
|---|---|---|---|---|
| Total | Novel | Total | Novel | |
|
| 3,308,870 | 41,289 | 79,948 | 1,195 |
|
| 25,857 | 13,536 | 651 | 17 |
|
| 5,771 | 483 | 351 | 15 |
|
| 3,321,318 | 51,526 | 32,391 | 468 |
|
| 70,068 | 46,592 | 121 | 11 |
|
| 1,651 | 265 | 231 | 23 |
Comparisons of total and novel SNP and indel intersections of B 1 vs. T 1 and B 2 vs. T 2. B 1, T 1:pooled S 1 calls from BGI and TÜBİTAK datasets using HaplotypeCaller; B 2, T 2:pooled S 2 calls from BGI and TÜBİTAK datasets, respectively.
| SNPs | Indels | |||
|---|---|---|---|---|
| Total | Novel | Total | Novel | |
|
| 3,551,861 | 57,010 | 735,208 | 49,637 |
|
| 5,653 | 1,164 | 1,396 | 346 |
|
| 11,781 | 1,336 | 2,743 | 634 |
|
| 3,595,114 | 69,416 | 789,834 | 55,740 |
|
| 11,140 | 1,722 | 3,687 | 719 |
|
| 7,409 | 961 | 2,688 | 597 |