| Literature DB >> 30992455 |
Mark J P Chaisson1,2, Ashley D Sanders3, Xuefang Zhao4,5, Ankit Malhotra6, David Porubsky7,8, Tobias Rausch3, Eugene J Gardner9, Oscar L Rodriguez10, Li Guo11,12,13, Ryan L Collins5,14, Xian Fan15, Jia Wen16, Robert E Handsaker17,18,19, Susan Fairley20, Zev N Kronenberg1, Xiangmeng Kong21,22, Fereydoun Hormozdiari23,24, Dillon Lee25, Aaron M Wenger26, Alex R Hastie27, Danny Antaki28, Thomas Anantharaman27, Peter A Audano1, Harrison Brand5, Stuart Cantsilieris1, Han Cao27, Eliza Cerveira6, Chong Chen15, Xintong Chen9, Chen-Shan Chin26, Zechen Chong15, Nelson T Chuang9, Christine C Lambert26, Deanna M Church29, Laura Clarke20, Andrew Farrell25, Joey Flores30, Timur Galeev21,22, David U Gorkin31,32, Madhusudan Gujral28, Victor Guryev7, William Haynes Heaton29, Jonas Korlach26, Sushant Kumar21,22, Jee Young Kwon6,33, Ernest T Lam27, Jong Eun Lee34, Joyce Lee27, Wan-Ping Lee6, Sau Peng Lee35, Shantao Li21,22, Patrick Marks29, Karine Viaud-Martinez30, Sascha Meiers3, Katherine M Munson1, Fabio C P Navarro21,22, Bradley J Nelson1, Conor Nodzak16, Amina Noor28, Sofia Kyriazopoulou-Panagiotopoulou29, Andy W C Pang27, Yunjiang Qiu32,36, Gabriel Rosanio28, Mallory Ryan6, Adrian Stütz3, Diana C J Spierings7, Alistair Ward25, AnneMarie E Welch1, Ming Xiao37, Wei Xu29, Chengsheng Zhang6, Qihui Zhu6, Xiangqun Zheng-Bradley20, Ernesto Lowy20, Sergei Yakneen3, Steven McCarroll17,18,19, Goo Jun38, Li Ding39, Chong Lek Koh40, Bing Ren31,32, Paul Flicek20, Ken Chen15, Mark B Gerstein21,22,41,42, Pui-Yan Kwok43, Peter M Lansdorp7,44,45, Gabor T Marth25, Jonathan Sebat28,31,46, Xinghua Shi16, Ali Bashir10, Kai Ye12,13,47, Scott E Devine9, Michael E Talkowski5,19,48, Ryan E Mills4,49, Tobias Marschall8, Jan O Korbel50,51, Evan E Eichler52,53, Charles Lee54,55.
Abstract
The incomplete identification of structural variants (SVs) from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long-read, short-read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,054 indel variants (<50 bp) and 27,622 SVs (≥50 bp) per genome. We also discover 156 inversions per genome and 58 of the inversions intersect with the critical regions of recurrent microdeletion and microduplication syndromes. Taken together, our SV callsets represent a three to sevenfold increase in SV detection compared to most standard high-throughput sequencing studies, including those from the 1000 Genomes Project. The methods and the dataset presented serve as a gold standard for the scientific community allowing us to make recommendations for maximizing structural variation sensitivity for future genome sequencing studies.Entities:
Mesh:
Year: 2019 PMID: 30992455 PMCID: PMC6467913 DOI: 10.1038/s41467-018-08148-z
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 17.694
Summary of sequencing statistics
| Avg. seq. coverage | Avg. frag. length | Physical coverage | |
|---|---|---|---|
| Pacific Biosciences | 39.6 (child) | 8165 (child) | 39.6 |
| Oxford Nanopore | 18.9 (HG00733) | 11,993 | 18.9 |
| Illumina short insert | 74.5 | 694 | 171 |
| Illumina liWGS | 3 | 3475 | 159 |
| Illumina 7 kb JMP | 1.1 | 6973.2 | 39.2 |
| 10X Chromium | 82.4 | 90,098 | 53.9 |
| Bionano Genomics | N/A | 2.81E + 05 | 116.7 |
| Tru-Seq SLR | 3.47 | 4900 | 3.47 |
| Strand-seq | N/A | N/A | 5.87 |
| Hi-C | 19.49 | 1.03E + 07 | N/A |
| Total | 223.56 | 607.08 |
Physical coverage is given for Illumina short insert, liWGS, 7 kb JMP. 10X Chromium physical coverage is estimated read cloud coverage
For Hi-C, fragment length is the distance between two read ends for intra-chromosome read pairs
Fig. 1Characteristics of SNV-based haplotypes obtained from different data sources. a Distribution of phased block lengths for the YRI child NA19240. Note that Strand-seq haplotypes span whole chromosomes and therefore one block per chromosome is shown. Vertical bars highlight N50 haplotype length: the minimum length haplotype block at which at least half of the phased bases are contained. For Illumina (IL) paired-end data, phased blocks cover <50% of the genome and hence the N50 cannot be computed. b Fraction of phase connection, i.e., pairs of consecutive heterozygous variants provided by each technology (averaged over all proband samples). c Pairwise comparisons of different phasings; colors encode switch error rates (averaged over all proband samples). For each row, a green box indicates the phasing of an independent technology with best agreement, with corresponding switch error rates given in green. d Each phased block is shown in a different color. The largest block is shown in cyan, i.e., all cyan regions belong to one block, even though interspaced by white areas (genomic regions where no variants are phased) or disconnected small blocks (different colors). e Fraction of heterozygous SNVs in the largest block shown in d. f Mismatch error rate of largest block compared to trio-based phasing, averaged over all chromosomes of all proband genomes (i.e., the empirical probability that any two heterozygous variants on a chromosome are phased correctly with respect to each other, in contrast to the switch error rate, which relays the probability that any two adjacent heterozygous variants are phased correctly). (*) Not available because trio phasing is used as reference for comparisons. (**) Not shown as population-based phasing does not output block boundaries; refer to Supplementary Material for an illustration of errors in population-based phasing
Fig. 2Comparison and integration of indel and SV callsets on HG00733, HG00514, and NA12940. a Length distribution of deletions and insertions identified by PB (blue), IL (red) and BNG (brown), respectively, together with averaged length distribution of SVs discovered in the maternal genomes by the 1KG-P3 report (silver). b Number of SVs discovered by one or multiple genome platforms in the YRI child NA19240. c Overlap of IL indel discovery algorithms, with total number of indels found by each combination of IL algorithms (gray) and those that overlapped with a PB indel (blue) in the CHS child HG00514
Fig. 3Characterization of simple and complex inversions. a Integration of inversions across platforms based on reciprocal overlap. Shown is an example of five orthogonal platforms intersecting at a homozygous inversion, with breakpoint ranges and supporting Strand-seq signature illustrated in bottom panels. b Size distribution of inversions included in the unified inversion list, subdivided by technology, with the total inversions (N) contributed by each listed. c Classification of Strand-seq inversions based on orthogonal phase support. Illustrative examples of simple (homozygous and heterozygous) and complex (inverted duplication) events are shown. Strand-seq inversions were identified based on read directionality (read count; reference reads in gray, inverted reads in purple), the relative ratio of reference to inverted reads within the locus (read ratio), and the haplotype structure of the inversion, with phased read data considered in terms of directionality (Ph; H1 alleles in red, H2 alleles in blue; alleles from reference reads are displayed above the ideogram and alleles from inverted reads are displayed below). ILL Illumina. liWGS long-insert whole-genome sequencing libraries. PB Pacific Biosciences. StS Strand-seq. BNG Bionano Genomics. SD segmental duplication. Ph phase data
Unified technology callset for copy number gain and loss structural variation
| HG00514 | HG00733 | NA19240 | ||||
|---|---|---|---|---|---|---|
| Count | Average length | Count | Average length | Count | Average length | |
|
| ||||||
| PacBio | 4662 | 195.7 | 5074 | 193.1 | 5586 | 205.0 |
| Illumina | 1387 | 792.0 | 1251 | 563.1 | 1760 | 664.9 |
| Bionano | 109 | 11,7901.6 | 88 | 86,440.9 | 113 | 115,099.1 |
| PacBio,Illumina | 3403 | 298.0 | 3482 | 294.2 | 4132 | 308.3 |
| PacBio,Bionano | 128 | 7569.5 | 128 | 4523.8 | 119 | 4633.7 |
| Illumina,Bionano | 50 | 8516.0 | 42 | 9680.3 | 54 | 9767.3 |
| All | 552 | 3997.0 | 542 | 4076.8 | 657 | 3647.4 |
| Total | 10,291 | 1892.6 | 10,607 | 1273.7 | 12,421 | 1615.9 |
|
| ||||||
| PacBio | 11,314 | 294.2 | 12,272 | 302.7 | 12,080 | 285.0 |
| Illumina | 533 | 501.4 | 483 | 1610.6 | 663 | 2163.7 |
| Bionano | 473 | 21,452.6 | 418 | 18,346.5 | 497 | 16,700.6 |
| PacBio,Illumina | 2146 | 239.5 | 2236 | 262.5 | 2631 | 260.6 |
| PacBio,Bionano | 984 | 2539.3 | 1052 | 2510.8 | 1035 | 2501.2 |
| Illumina,Bionano | 33 | 14,733.4 | 22 | 5905.0 | 26 | 6541.4 |
| All | 83 | 3500.9 | 83 | 3808.6 | 94 | 3751.8 |
| Total | 15,566 | 1126.3 | 16,566 | 955.9 | 17,026 | 997.0 |
Fig. 4Concordance of IL methods compared against total IL callset and PB callset using orthogonal technologies. Results by algorithm shown for a the deletion concordance for individual methods, b the union of all pairs of methods, and c the requirement that more than one caller agree on any call. Individual callers are shown as red points for comparison. Pairs and triples of combinations are in black points. The solid and dashed lines represent the 5% and 10% non-concordance rates (NCR), respectively. The top five combinations of methods in each plot below the 10% NCR, along with the individual plots, are labeled. d Overlap of IL-SV discovery algorithms, with total number of SVs found by each combination of IL algorithms (gray) and those that overlapped with the PB-SV calls (blue) in the YRI child NA19240. e PCA of the genotypes of concordant calls of each method: PC 1 versus 2 (left), PC 2 versus 3 (right). VH VariationHunter