| Literature DB >> 31779139 |
Charlie F Rowlands1,2, Diana Baralle3, Jamie M Ellingford1,2.
Abstract
Defects in pre-mRNA splicing are frequently a cause of Mendelian disease. Despite the advent of next-generation sequencing, allowing a deeper insight into a patient's variant landscape, the ability to characterize variants causing splicing defects has not progressed with the same speed. To address this, recent years have seen a sharp spike in the number of splice prediction tools leveraging machine learning approaches, leaving clinical geneticists with a plethora of choices for in silico analysis. In this review, some basic principles of machine learning are introduced in the context of genomics and splicing analysis. A critical comparative approach is then used to describe seven recent machine learning-based splice prediction tools, revealing highly diverse approaches and common caveats. We find that, although great progress has been made in producing specific and sensitive tools, there is still much scope for personalized approaches to prediction of variant impact on splicing. Such approaches may increase diagnostic yields and underpin improvements to patient care.Entities:
Keywords: Mendelian disease; RNA splicing; bioinformatics; diagnostics; effect prediction; genomic medicine; machine learning; variant interpretation; variant prioritization
Mesh:
Substances:
Year: 2019 PMID: 31779139 PMCID: PMC6953098 DOI: 10.3390/cells8121513
Source DB: PubMed Journal: Cells ISSN: 2073-4409 Impact factor: 6.600
Figure 1Diverse mechanisms of splicing dysfunction may be pathogenic. (a) Wild-type splicing. Schematic of a three-exon region of a gene (exons in blue, green and orange) with corresponding wild-type splicing activity. (b) Exon skipping. Mutations in or around an exon may lead to it being skipped from a final transcript. (c) Cryptic intronic splice donor/acceptor. Mutations in the intron may lead to generation of cryptic splice sites that outcompete canonical sites, leading to inclusion of intronic sequences. (d) Cryptic exonic splice donor/acceptor. Exonic mutations that activate cryptic sites may also outcompete canonical sites, causing exclusion of exonic sequences. (e) Intron retention. Splicing of a particular intron may be abrogated, leading to complete inclusion of the length of an intron. (f) Pseudoexon inclusion. Deeply intronic mutations may activate cryptic sites that aberrantly define lengths of intron as exonic, leading to inclusion of short segments of intronic sequence (pseudoexons). Solid lines, introns; black dashed lines, wild-type splicing; red dashed lines, mis-splicing events; hashed boxes, intronic regions aberrantly included as a result of a mis-splicing event; empty boxes, exonic regions that are usually retained after splicing, but which are erroneously excluded from the final transcript.
Glossary of machine learning terms. SVM, support vector machine.
| Term | Definition |
|---|---|
| Backpropagation | The computational process by which a neural network adjusts the weights and biases of the network in such a way as to reduce the loss of the model. |
| Bagging | Abbreviation for bootstrap aggregation. The training of a model on random subsets of data entries and features to improve generalizability of a model (usually a decision tree-based model). |
| Bias | A (usually negative) value that represents a neuron’s inherent tendency towards inactivity. Usually randomized for each neuron before the training of a network. |
| Classification | A type of machine learning system in which the output is assignment of a data point to a discrete group. Usually contrasted with regression. |
| Feature | One of a set of variables in a dataset that are input to a machine learning model. Machine learning models classify data according to the values of features in the dataset. |
| Hidden layer | One of any number of layers of neurons lying between the input and output layers of a deep neural network. |
| Hyperplane | A surface with one fewer dimensions than the space it occupies. SVMs separate datasets with |
| Kernel trick | The use of a mathematical function allowing inference of relational qualities of data without explicitly carrying out computationally expensive mathematical calculations. |
| Loss function | A mathematical function measuring the degree to which a model’s predictions deviate from the true classifications of data. |
| Machine learning | The use of computer systems to detect patterns in and make inferences from data without explicit instruction. |
| Multiclass SVM | A subtype of SVM used when data may be classified into more than two classes. |
| Neuron | The basic unit of a neural network, taking in input from previous neurons and propagating a weighted response to subsequent ones. |
| Regression | A type of machine learning system in which the output is the prediction of a continuous or ordered value. Usually contrasted with classification. |
| Support vectors | Data points that lie along the margins between classifications in an SVM model. |
| Training set | A dataset containing the data that is presented to a machine learning system and then used to make inferences and learn patterns present within the data. |
| Test set | The dataset used to evaluate performance of the model. The test set is generally taken from the same source as the training set, but may come from elsewhere. |
Figure 2Basic machine learning models. (a) Support vector machines (SVMs) classify linearly-separable data using a single hyperplane (solid line), with points classified according to the side of the hyperplane on which they lie. Construction of the hyperplane is done using support vectors (indicated by arrows), data points that mark boundaries (dotted) within which the hyperplane must lie. (b) Where data are not linearly separable, they may be transformed using kernel functions (radial basis function, or RBF, kernel shown here) which infer relational qualities of data in a computationally inexpensive manner. (c) Decision trees use a series of binary choices (orange) to most effectively separate data into different categories (red and blue). (d) Random forest models consist of large numbers (often hundreds or thousands) of trees each derived from bootstrap aggregating (bagging) of both input features and data entries in the original training set. (e) To mitigate overfitting problems common to decision trees, gradient tree boosting generates successive trees of fixed structure that each contribute a small amount to the final classification, with each tree scaled by a learning rate between 0–1. (f) In a neural network, a single neuron receives quantitative input (xi) from neurons in the preceding layer and scales them according to the weights of its connection to them (wi). Each neuron also has a “bias” (b), representing a tendency for inactivity. The output (or activation) of a neuron is the sum of each input neuron multiplied by its respective weight, plus this bias value. (g) A deep neural network has an initial layer of input neurons (orange), which are coded representations of data features. These are connected to 1 or more layers of “hidden neurons” (green), which are, in turn, connected to an output layer of neurons (red and blue) corresponding to the possible classifications of the data. Predictions may be categorical or continuous and are based on the relative activation of the output neurons. Biases for each neuron and weights for each connection are randomized before the network is trained. After a set of training data is presented, the loss function of the model is calculated (i.e., how accurately or inaccurately the model has classified the known data) and an approach termed backpropagation is used to modulate each weight and bias so as to reduce this loss. More data is then presented and this process repeated iteratively to refine the model.
Summary of splice prediction bioinformatics tools. Citation column denotes references to articles describing tools themselves. SVM, support vector machine; RBF, radial basis function; MPRA, massively parallel reporter assay; HGMD, Human Gene Mutation Database; PSSM, position-specific scoring matrix; pLI, probability of loss-of-function intolerant; RVIS, residual variation intolerance score; AUC, area under receiver operator characteristic (ROC) curve; PR-AUC, area under precision-recall curve.
| Tool Name | Function | ML Model | Training/Testing Data | Features | Efficacy | Citation |
|---|---|---|---|---|---|---|
| CADD | General purpose pathogenicity scoring | v1.0: linear SVM | Benign training: evolutionarily neutral variants; pathogenic training: simulated de novo pathogenic variants | 60, covering conversation scores, epigenetic modifications, functional analyses, and genetic context | AUC = 0.916, across all variant types | [ |
| TraP | Quantification of variant impact on transcripts | Random forest of 1000 individual decision trees | Benign: De novo mutations in healthy individuals | 20, including several PSSM-based splice site scores, GERP++ conservation scores, and models of feature interactions | AUC = 0.88, all ClinVar variants | [ |
| SPANR | Cassette exon skipping prediction | Group of neural networks modeled on Bayesian framework | ψ values for all human exons across 16 tissues, based on the Illumina Human Body Map project | 1393, including exon/intron lengths, distances to nearest alternative splice sites, conservation and RNA secondary structure | AUC = 0.955, when distinguishing between high (≥67%) and low (≤33%) ψ values | [ |
| CryptSplice | Effect of variants on existing splice sites and cryptic splice site prediction | SVM with RBF kernel | True and false splice sites from GenBank-derived datasets | 3 types, all sequence-based, relating to the probability of finding given nucleotide sequences at certain points in splice region | Sensitivity = 97.8% and 88.9% in correctly labeling canonical donors and acceptors, respectively | [ |
| MMSplice | Prediction of exon skipping, competitive interactions, changes in splicing efficiency and pathogenicity | Modular neural networks, and linear and logistic regression | Donor/acceptor modules: GENCODE v24 true and false splice sites | Direct encoding of the sequence | R = 0.87 and 0.81, correlation between predicted and actual Δψ values for acceptor and donor mutations, respectively | [ |
| S-CAP | Variant pathogenicity scoring with the compartmentalization of genomic space | Gradient boosting tree | Pathogenic variants curated from HGMD and ClinVar; benign variants curated from gnomAD | Features across chromosomal, gene, exon and variant levels, e.g., pLI [ | AUC: 0.828–0.959, across 6 regions | [ |
| SpliceAI | Prediction of variant impact on acceptor/donor loss or gain | 32-layer deep neural network | GENCODE v24 pre-mRNA transcript sequence for human protein-coding genes | Direct encoding of the sequence | PR-AUC = 0.98 in correct prediction of splice site location from raw sequence | [ |
Tabulated list of loci on pre-mRNA transcripts amenable to predictive analysis by each of 7 splice prediction tools.
| Tool | Loci Covered |
|---|---|
| SPANR | Any exon that is not first or terminal, plus 300 bp flanking intronic sequence |
| CryptSplice | Within 60 bp of a canonical splice junction; >100 bp into intron if novel donor/acceptor is created |
| MMSplice | Any exon, plus 50 bp upstream or 13 bp downstream |
| S-CAP | Any exon, plus 50 bp flanking intronic sequence |
| CADD | All loci |
| TraP | All loci |
| SpliceAI | All loci |
Figure 3Location of variants amenable to analysis by splice prediction software. With diverse underlying training sets and purposes, different splice prediction tools are only able to analyze variants at particular sites in a pre-mRNA transcript. To-scale representation of the loci amenable to analysis by each of 7 tools for the pre-mRNA transcript of the human APOC3 gene (RefSeq accession NM_000040.3). Dotted lines signify canonical exon-intron boundaries. Hashed bars represent loci where the variant effect can be modeled only if a novel splice donor or acceptor is created. Italicized numbers show exon/intron length in nucleotides. UTR, untranslated region.
Figure 4Compartmentalization of the splice region by S-CAP and MMSplice. Both MMSplice and S-CAP divide the splice region into six sub-regions, although the length and location of these divisions are different between the two tools. MMSplice (a) consists of 6 initial deep neural network modules corresponding to each region, with exonic and intronic modules both trained on the results of a massively parallel reporter assay (MPRA) experiment [64] and the acceptor and donor modules trained to predict functional acceptors and donors based on the real and decoy sites in the GENCODE v24 annotation. The scores from all modules are then passed to linear and logistic regression models to predict downstream effects, such as exon skipping, alteration of splicing efficiency, and competitive splice site interactions. S-CAP (b) consists of six separate models trained on pathogenic and benign variants curated for each region. The most significant consequence is returned for a given variant. Length of bars not to scale.