| Literature DB >> 35241127 |
Zhichao Liu1, Ruth Roberts2,3, Timothy R Mercer4,5,6, Joshua Xu1, Fritz J Sedlazeck7, Weida Tong8.
Abstract
Structural variants (SVs) are a major source of human genetic diversity and have been associated with different diseases and phenotypes. The detection of SVs is difficult, and a diverse range of detection methods and data analysis protocols has been developed. This difficulty and diversity make the detection of SVs for clinical applications challenging and requires a framework to ensure accuracy and reproducibility. Here, we discuss current developments in the diagnosis of SVs and propose a roadmap for the accurate and reproducible detection of SVs that includes case studies provided from the FDA-led SEquencing Quality Control Phase II (SEQC-II) and other consortium efforts.Entities:
Mesh:
Year: 2022 PMID: 35241127 PMCID: PMC8892125 DOI: 10.1186/s13059-022-02636-8
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 17.906
Fig. 1Roadmap towards accurate and reproducible SV detection
Fig. 2An insight into the reference samples and efforts for SV detection by the SEQC-II consortium. Great strides have been made in advancing SV detection by the SEQC-II consortium (Fig. 2). First, the SEQC-II consortium established high-quality SV calling sets based on multi-platform sequencing of tumor- normal reference samples and partially verified this using orthogonal methods, including PCR-based validation, cytogenetic array BioNano optical mapping, as well as fusion gene detected from RNA-seq [35]. Meanwhile, SEQC-II systematically evaluated the reproducibility of somatic SV detections across platforms and benchmarked the performance of various software tools. Leveraging the developed high-quality SV calling sets, they developed a deep learning-based calling algorithm for SV detection using the convolutional neural network (CNN). The proposed deep learning models achieved high robustness across multiple sequencing technologies for fresh and FFPE DNA input, varying tumor/normal purities, and different coverages, with significant superiority over conventional detection approaches in general, as well as in challenging situations such as low coverage, low variant allele frequency, DNA damage, and complex genomic regions. Furthermore, the CNV inference method was developed based on the generated single-cell RNAseq data. Second, the SEQC-II consortium comprehensively investigated the performance and confounding factors (i.e., long or short-read sequencing, capture panels, and bioinformatics pipelines) of gene fusion detection [28]. It was found that long-read sequencing achieved higher precision and discovered more novel fusion genes. Short-read sequencing achieved greater sensitivity for detecting known fusion genes correlated with the endogenous expression of targeted genes. Third, the SEQC-II consortium prioritized SV detection divergent sources by using multiple illumina-based short-read sequencing of the Chinese quartet reference samples. Interestingly, mapping methods are significant resources of calling variability, followed by sequencing centers and replicates. Surprisingly, SV supported by only one site or technical replicate often represented true positives defined by long-read PacBio sequencing, consistent with an overall higher false-negative rate for SV calling [36]
A comparison among different sequencing technologies for structural variant detection
| Platform | Read length | Cost | Comments | Run time |
|---|---|---|---|---|
| Short reads (Illumina) | NovaSeq: up to 250 bp | $ | Short-read NGS performs well for >1kb regions. It struggles with shorter CNV detection 50-500bp, and in complex genome regions | NovaSeq: 0.15Tb/day |
| 10X Genomics Chromium | Up to ~100 kb | $$ | Sparse sequencing rather than true long reads; more complicated to align, with poorer resolution of locally repetitive sequences. However, 10X Genomics Chromium is currently discontinued | - |
| PacBio SMRT sequencing | 10–15 kb (average) and up to 100 kb | $$$ | HiFi: long reads (10-20kbp) of high fidelity having a similar error rate as Illumina. CLR: Longer raw reads have high error rates dominated by false insertions; requires new alignment and error correction algorithms | 20 Gb/day |
| Oxford Nanopore | averaging ~10 kb and up to 2 Mb | $$$ | Raw reads have ~5% error rates dominated by false deletions and homopolymer errors; often requires new alignment and error correction algorithms | A MinION Flow Cell : ~ 25 Gb/day |
| Hi-C-based analysis | <100 bp | $$ | Sparse sequencing with highly variable genomic distance between pairs (1 kb to 1 Mb or longer); Detection may result from random chromosomal collisions Less than 1% of DNA fragments actually yield ligation products. Due to multiple steps, the method requires large amounts of starting material | Whole analysis within 28 hours |
| BioNano Genomics optical mapping | ~250kb or longer | $ | Limited algorithms to discover high-confidence alignment between an optical map and a sequence assembly | 100x coverage of 3 human genomes is collected in less than 6 hours |
Benchmark datasets generated from SEQC-II consortium efforts and potential application in SV detection
| Working group | Reference samples | Benchmark data | Potential benefit for SV detection | Link |
|---|---|---|---|---|
| Somatic mutation [ | • Somatic SV benchmark establishment • Low allelic frequency (LOF) somatic SV detection in liquid biopsy or FFPE samples • Deep learning-based somatic SV detection • Reproducibility and repeatability assessment of somatic SV detection based on multiple sample and design | |||
| Oncopanel [ | • Reproducibility and repeatability assessment of actionable somatic SV assessment • Benefit of gene fusion detection by integrating DNAseq and RNAseq | |||
| Germline mutation [ | Chinese Quartet samples (B-lymphocyte cell line and blood samples) | • Influential factors on reproducibility assessment for germline SV detection • Germline SV detection concordance between B-lymphocyte cell line and blood samples • Deep learning-based somatic SV detection • Cross check the best practice of germline SV detection with NIST efforts | NODE OEP001896 (Chinese Quartet Samples) | |
| HapMap samples (HG001) | ||||
| HapMAP Ashkenazi Trio | ||||
| Bacterial genomes (ATCC MSA-3001) | Miseq, Ion PGM, Ion S5, MinION, Flongle, and GenapSys |
Fig. 3The role of AI in promoting SV detection. AI-powered natural language processing (NLP) for SV calling. Considering the suboptimal performance of different SV calling algorithms concerning completeness and accuracy, we suggest that deep learning may be an alternative worth further exploration. CNNs are the primary deep learning algorithm investigated for SV detection, which considers the BAM files as image. Rapid development of deep learning algorithms such as AI-powered NLP not only provided unprecedented innovation for information retrieve from free-text documents but repositioned in other type of biological information such as chemical structures and protein sequences [104]. Here, we developed a hypothesis by resembling chromosomes as paragraphs, the sequence reads as sentences, and different A, T, G, C combinations (e.g., tandem repeats and microsatellite) as vocabularies. Subsequently, the AI-powered language models such as different transformers [105–107] could be utilized to digest genome sequence as human beings read a book. The difference between the sample genome and reference genome (i.e., variants) could be extracted by compared transformer-based genome embedding, which is very similar to the rationale behind de novo assembly (A). Reinforcement learning optimizing meta-caller combination. There is the potential to integrate multiple callers using more sophisticated approaches than simple heuristic union/intersection rules for improving SV detection. Artificial intelligence (AI) may be a solution. The rapid evolution of emerging genomics technologies suggests that improved SV detection should be taking place. The ideal combination strategy for combining different SV callers is to take advantage of each SV caller and eliminate the false positives, which fits well the concept of reinforcement learning. Reinforcement learning is a branch of deep learning that focuses on how intelligent agents ought to take actions in an environment to maximize cumulative reward [108]. AlphaGo is an excellent example of reinforcement learning applications [109]. Reinforcement learning could be utilized to develop the intelligent ensemble SV callers to maximize SV detection performance (B). For each type of SV, the combination for each SV caller could be learned by minimizing the loss function that measures the divergence between called SV and ground truth. Ultimately, reinforcement learning-based ensemble SV callers allow the integration of any individual caller, incorporate different SV types, and incorporate the advantages of newly developed technologies. Generative adversarial network (GAN)-based SV simulation. A true set is the key to investigate the accuracy and reproducibility of SV detection. Unfortunately, the complexity of the SV events and associated genome properties are central to the whole picture of SV events in the sample, hampering objective evaluation. Many reports suggest that simulated ground truth cannot recapitulate genome and SV characteristics and mimic the actual patient situation. Therefore, the simulated SV truth sets with high commutability are urgently needed. The generative adversarial network (GAN) is a deep neural network framework integrating a generative model and discriminative model to generate new data similar to the statistical distribution of the training set [110]. GANs have been widely applied in image generation in fashion, art, and advertising and have attracted much attention in the scientific community. For example, one type of GAN model, named DeepFake, has been utilized to predict cell type-specific transcriptional states induced by drug treatment [111]. Here, we envision a generative adversarial network (GAN) to simulate the binary alignment map (BAM) file spiked in different SV types based on the actual data (C). The proposed GAN model collects the high-quality BAM files with varying SV and length types from the real data, such as SEQC-II and other consortium efforts, as a training set to generate the target SV spiked BAM file. The potential benefit of the proposed GAN model is the simulated SV spiked BAM file could maximize the preservation of the original matrix effect of real data such as VAF level and tumor purity