| Literature DB >> 31947757 |
Rute Pereira1,2, Jorge Oliveira2,3, Mário Sousa1,2.
Abstract
Clinical genetics has an important role in the healthcare system to provide a definitive diagnosis for many rare syndromes. It also can have an influence over genetics prevention, disease prognosis and assisting the selection of the best options of care/treatment for patients. Next-generation sequencing (NGS) has transformed clinical genetics making possible to analyze hundreds of genes at an unprecedented speed and at a lower price when comparing to conventional Sanger sequencing. Despite the growing literature concerning NGS in a clinical setting, this review aims to fill the gap that exists among (bio)informaticians, molecular geneticists and clinicians, by presenting a general overview of the NGS technology and workflow. First, we will review the current NGS platforms, focusing on the two main platforms Illumina and Ion Torrent, and discussing the major strong points and weaknesses intrinsic to each platform. Next, the NGS analytical bioinformatic pipelines are dissected, giving some emphasis to the algorithms commonly used to generate process data and to analyze sequence variants. Finally, the main challenges around NGS bioinformatics are placed in perspective for future developments. Even with the huge achievements made in NGS technology and bioinformatics, further improvements in bioinformatic algorithms are still required to deal with complex and genetically heterogeneous disorders.Entities:
Keywords: NGS pipeline; NGS platforms; bioinformatics; clinical genetics; high throughput data
Year: 2020 PMID: 31947757 PMCID: PMC7019349 DOI: 10.3390/jcm9010132
Source DB: PubMed Journal: J Clin Med ISSN: 2077-0383 Impact factor: 4.241
Figure 1DNA sequencing timeline. Some of the most revolutionary and remarkable events in DNA sequencing. NG—next generation; PCR—polymerase chain reaction; SMS—single molecule sequencing; SeqLL—sequence the lower limit.
Figure 2An overview of the next generation sequencing (NGS) bioinformatics workflow. The NGS bioinformatics is subdivided in the primary (blue), secondary (orange) and tertiary (green) analysis. The primary data analysis consists of the detection and analysis of raw data. Then, on the secondary analysis, the reads are aligned against the reference human genome (or de novo assembled) and the calling is performed. The last step is the tertiary analysis, which includes the variant annotation, variant filtering, prioritization, data visualization and reporting. CNV—copy number variation; ROH—runs of homozygosity, VCF—variant calling format.
Figure 3Schematic representation of the primary analysis workflow in Ion Torrent. Briefly, the signal emitted from nucleotide incorporation is inspected by the sensor, which converts the raw voltage data into a DAT file. This file serves as input to the server, which converts into a WELLS file. This last file is used as input on the Ion Torrent Basecaller module that gives a final BAM file, ready for the secondary analysis.
Figure 4Summary of some widely used base callers’ software available for the Illumina platform. The software is grouped according to the input file: INT (intermediate executable code) text format for the older tools and CIF (cluster intensity files) for the most recent platforms.
Figure 5Schematic representation of the main steps involved in the post-alignment process.
Figure 6Summary of the main methods for calling structural variants (SV) and copy number variation (CNV) from next generation sequencing (NGS) data.
List with examples of widely used tools to perform an NGS functional filter.
| Software | Short Description | Ref. |
|---|---|---|
| PhyloP | Based on a model of neutral evolution, the patterns of conservation (positive scores)/acceleration (negative scores) are analyzed for various annotation classes and clades of interest. | [ |
| SIFT | Predicts based on sequence homology, if an AA substitution will affect protein function and potentially alter the phenotype. Scores less than 0.05 indicating a variant as deleterious. | [ |
| PolyPhen-2 Polymorphism | Predicts the functional impact of an AA replacement from its individual features using a naive Bayes classifier. Includes two tools HumDiv (designed to be applied in complex phenotypes) and HumVar (designed to diagnostic of Mendelian diseases). Higher scores (>0.85) predicts, more confidently, damaging variants. | [ |
| CADD | Integrates diverse genome annotations and scores all human SNV and Indel. It prioritizes functional, deleterious, and disease causal variants according to functional categories, effect sizes and genetic architectures. Scores above 10 should be applied as a cut-off for identifying pathogenic variants. | [ |
| MutationTaster | Analyses evolutionary conservation, splice-site changes, loss of protein features and changes that might affect the amount of mRNA. Variants are classified, as polymorphism or disease-causing | [ |
| Human Splice Finder | Predict the effects of mutations on splicing signals or to identify splicing motifs in any human sequence. | [ |
| nsSNPAnalyzer | Extracts structural and evolutionary information from a query nsSNP and uses a machine learning method (Random Forest) to predict its phenotypic effect. Classifies the variant as neutral and disease. | [ |
| TopoSNP | Analyze SNP based on its geometric location and conservation information, produces an interactive visualization of disease and non-disease associated with each SNP. | [ |
| Condel | Condel integrates the output of different methods to predict the impact of nsSNP on protein function. The algorithm based on the weighted average of the normalized scores classifies the variants as neutral or deleterious. | [ |
| ANNOVAR * | Annotates the variants based on several parameters, such as identification whether SNPs or CNVs affect the protein (gene-based), identification of variants in specific genomic regions outside protein-coding regions (region-based) and identification of known variants documented in public and licensed database (filter-based) | [ |
| VEP * | Determines the effect of multiple variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts and protein sequence, as well as regulatory regions. | [ |
| snpEff * | Annotation and classification of SNV based on their effects on annotated genes, such as synonymous/nsSNP, start or stop codon gains or losses, their genomic locations, among others. Considered as a structural based tool for annotation. | [ |
| SeattleSeq * | Provides annotation of SNVs and small indels, by providing to each the dbSNP rs IDs, gene names and accession numbers, variation functions, protein positions and AA changes, conservation scores, HapMap frequencies, PolyPhen predictions and clinical association. | [ |
AA—amino acid; SNV—single nucleotide variant, Indel—small insertion/deletion variants, SNP—single nucleotide polymorphism, nsSNP—nonsynonymous SNP; CNV—copy number variation; * these tools, although also able to filter variants, are primarily responsible for variant annotation.
Figure 7BAM (binary alignment map) file visual inspection. Two examples of situations that may be observed through this inspection. (A) Demonstrates a case of a true-positive INDEL, confirmed by Sanger sequencing. In contrast, (B) shows a clear example of a false-positive result, where the variant is present in only reverse reads, as later demonstrated by Sanger sequencing it is a technical artifact and should be excluded from further analysis.