Literature DB >> 24263461

In silico tools for splicing defect prediction: a survey from the viewpoint of end users.

Xueqiu Jian¹, Eric Boerwinkle², Xiaoming Liu¹.

Abstract

RNA splicing is the process during which introns are excised and exons are spliced. The precise recognition of splicing signals is critical to this process, and mutations affecting splicing comprise a considerable proportion of genetic disease etiology. Analysis of RNA samples from the patient is the most straightforward and reliable method to detect splicing defects. However, currently, the technical limitation prohibits its use in routine clinical practice. In silico tools that predict potential consequences of splicing mutations may be useful in daily diagnostic activities. In this review, we provide medical geneticists with some basic insights into some of the most popular in silico tools for splicing defect prediction, from the viewpoint of end users. Bioinformaticians in relevant areas who are working on huge data sets may also benefit from this review. Specifically, we focus on those tools whose primary goal is to predict the impact of mutations within the 5' and 3' splicing consensus regions: the algorithms used by different tools as well as their major advantages and disadvantages are briefly introduced; the formats of their input and output are summarized; and the interpretation, evaluation, and prospection are also discussed.

Entities: Chemical

Mesh：

Substances：
RNA Precursors

Year: 2013 PMID： 24263461 PMCID： PMC4029872 DOI： 10.1038/gim.2013.176

Source DB: PubMed Journal: Genet Med ISSN： 1098-3600 Impact factor: 8.822

1. Introduction to pre-mRNA splicing and mutations affecting splicing

Sixty years ago, the milestone discovery of the double-helix structure of the DNA molecule opened a door for scientists to uncover the secret of life. For long periods after this discovery it was widely accepted that similar to prokaryotes, the genetic information manifested by proteins in eukaryotes was also carried by continuous DNA sequences. This specious assumption was proven wrong by a comparison between an mRNA sequence of adenovirus and the DNA from which it was transcribed, leading to the discovery of split genes and RNA splicing.[1,2] Generally speaking, DNA sequences coding for proteins (exons) are interrupted by non-coding sequences (introns); both exons and introns are transcribed to pre-mRNAs; before they are translated to proteins, introns are excised and discrete exons are spliced, resulting in mature mRNAs (Figure 1). Based on this new discovery, the molecular basis of RNA splicing was gradually revealed.

Figure 1

Schematic illustration of pre-mRNA splicing. 5′ ss and 3′ ss are recognized by the spliceosome and the intron is excised and exons are spliced. The whole process is regulated by trans-acting elements such as SR proteins, hnRNPs, and the regulatory complex.

The completion and regulation of splicing leans on the complicated biochemical reactions between the nucleotide sequences (cis-acting elements) and different proteins binding to them (trans-acting elements). Cis-acting elements contain the 5′ splice site (junction between an exon and an intron), the 3′ splice site (junction between an intron and an exon), the branch point (tens of nucleotides upstream of the 3′ splice site), exonic splicing enhancers (ESEs), intronic splicing enhancers (ISEs), exonic splicing silencers (ESSs), and intronic splicing silencers (ISSs). Trans-acting elements include the spliceosome that is made up of five small nuclear ribonucleoproteins (snRNPs) and more than 150 proteins, SR proteins, hnRNPs, and the regulatory complex (Figure 1). During this process, the key step is to localize the exon-intron boundaries by capturing the splicing signals embedded in the pre-mRNA sequence by the spliceosome. Extensive comparisons of sequences at different exon-intron boundaries suggested not only the presence of almost invariant GT-AG sites (the first and last two sites of an intron, respectively) but also weaker conservation in the vicinity of these boundaries, named 5′ and 3′ splicing consensus sequences, respectively, which function as key splicing signals.[3] However, the so-called consensus sequence does not yet have a consensus definition. For example, one study used more than 1,400 5′ and 3′ splice sites from a variety of organisms to derive the consensus sequence from positions −3 to +6 at the 5′ splice site and from positions −14 to +1 at the 3′ splice site,[4] while another study used an alignment of conserved sequences from 1,683 human introns yielded the 5′ consensus sequence from positions −3 to +8 and the 3′ consensus sequence from positions −12 to +2.[5] Throughout the paper, we will loosely refer to the ‘splicing consensus region’ as a few to tens of nucleotides in the vicinity of a 5′ or 3′ splice site. For a certain gene, the final product of splicing may vary in different conditions as a result of alternative splicing that produces different protein sequences without deleterious effects on its functions. The consequence of alternative splicing can be the skipping of an exon (exon skipping), use of different 5′ or 3′ splice sites (alternative 5′ or 3′ splice sites, respectively), retaining one of the two exons but not both (mutually exclusive exons), or the retention of an intron (intron retention).[6] Alternative splicing diversifies gene expression (e.g., in different tissues or in different developmental stages) and is very common in human genes. Mutations at splice sites may also modify patterns of splicing in a deleterious way. For instance, a mutated splice site may disrupt an authentic exon-intron boundary and thus change the binding site of the spliceosome, which results in an aberrant splicing. For example, a G to T substitution at position 1 in intron 25 of the DFNA1 gene can disrupt the canonical splice donor sequence and lead to a four-base insertion in the transcript, which further results in a frameshift and a premature termination that truncates 32 amino acids of the protein. This splicing mutation has been found to cause nonsyndromic deafness in humans.[7] Another example is that, it has been well known that a C to T point mutation at position 6 in exon 7 of the SMN2 gene in individuals who already have deletions of the SMN1 gene does not change the codon, but instead 80% of the time it inactivates an ESE and creates an ESS, leading to exon 7 skipping and a truncated protein and thus causes spinal muscular atrophy.[8-10] In addition to disrupting the primary linear sequence at splice sites, mutations may also have impact on other aspects of splicing, e.g., modification of the secondary structure of the region which hinders the binding of trans-acting elements.[11] Besides the causal role as shown in the above examples, splicing mutations can also act as a modifier of disease susceptibility and severity, which has been extensively reviewed elsewhere.[12] 7 Mutations affecting RNA splicing are not negligible in the population. For example, the Human Gene Mutation Database (HGMD, http://www.hgmd.cf.ac.uk/ac/index.php) collects known human mutations responsible for inherited diseases. As of its Professional Version 2013.2, a total of 13,030 out of 141,161 (9.2%) disease-causing mutations have consequences for mRNA splicing. A widely cited paper estimated that among all human genetic diseases caused by point mutations, up to 15% are the result of splicing defects.[13] However, this estimate seems still conservative because mutations in coding regions are usually considered as missense, nonsense or silent which may have resulted in misclassification of splicing mutations and underestimation of the number of splicing mutations.[6] Recently, Lim et al.[14] and Sterne-Weiler et al.[15] provided similar estimates of the proportion of variants within exons that affect splicing but were originally classified as missense (missense or nonsense by Sterne-Weiler et al.[15]) mutations in the HGMD using independent methods (22% and 25%, respectively). These statistics indicate that mutations affecting splicing comprise a considerable proportion of genetic disease etiology.

2. A problem of splicing mutation detection in medical genetics

Disease diagnosis and treatment owe a great deal to our understanding of disease etiology and relevant laboratory techniques. With the importance of mutations affecting splicing being unraveled, their potential role in genetic diseases is increasingly attracting the attention of medical geneticists in their clinical practice. Analysis of RNA from the patient is the most straightforward and reliable method to detect splicing defects. Some other widely used laboratory techniques include in vitro splicing assay and minigene splicing assay.[16] However, our current knowledge of splicing is yet to be implemented in clinical practice on a routine basis due to RNA sample availability (especially specific tissue samples) and limitations in the use of these laboratory techniques.[16] Clinical genetic testing still relies largely on DNA extracted from blood samples. Moreover, searching for particular splicing variants responsible for particular diseases in the genome may be akin to looking for a needle in a haystack. Since laboratory testing for all splicing variants is expensive and time-consuming, medical geneticists are seeking a more economical and quicker way of screening thousands of variants without losing much accuracy so that limited medical resource can be used to serve as many patients as possible. One alternative is to use in silico prediction tools to filter out those variants with little odds of being deleterious and thus to narrow down the search to fewer candidate variants for further experimental validation. After decades of efforts, a number of in silico prediction tools have been developed to assess the effect of DNA sequence variations on splicing. Even so, medical geneticists may be uncertain as to which of the many prediction tools to choose when they have their patients’ DNA sequences in hand. Most of the tools were initially designed and developed primarily for research purposes, making them much less useful in clinical practice. Therefore, in this review we try to provide medical geneticists with some basic insights into some of the most popular in silico tools for splicing defect prediction. Although currently available prediction tools can cover almost all cis-acting elements, e.g., ESEfinder, a program that identifies putative ESEs responsive to SR proteins,[17] has successfully predicted the loss of a putative ESE motif in the SMN2 gene in the previous example,[10] we restrict our review to those with the primary goal of predicting the impact of mutations within the 5′ and 3′ consensus regions for the following reasons: (1) the consensus regions are the prominent cis-acting elements; (2) they have been understood and modeled much better than other elements; and (3) prediction tools for mutations within consensus regions are better developed with more potential to be utilized in medical genetics. We focus on the application aspect of these tools and use a user-oriented way to organize the logic of the text. This review may also be useful for bioinformaticians in relevant areas who are working on huge datasets such as whole genome sequencing data. We anticipate that the information presented here will produce an intuitive picture of current progress in this field, from which readers may benefit when using these tools in their daily practice.

3. Overview of in silico prediction tools for 5′ and 3′ splice site mutation

The main purpose for using splice site prediction tools has shifted from the identification of possible exon-intron boundaries before the Human Genome Project (HGP) was completed in 2003 to the prediction of the transcriptional impact of mutations at known splice sites and their vicinity regions in the post-HGP era. This transition reflects the need to understand human variation on splicing and its effect on human diseases, which is of most interest for medical geneticists. From the viewpoint of end-users, the first interface presented when most of the tools are opened will be the input page, which asks the user to type or load the data they want to predict. Most tools require the input of one or more sequences with or without specifying exon-intron boundaries. In the former condition users need to fix the length of the sequence while in the latter condition the computer program automatically searches for potential splice sites through the whole length of the input sequence. The shared feature of both formats is to provide the tool with a sequence around the splice site (either manually or automatically by the program), indicating that the prediction of a given mutation relies only on the sequence context itself, regardless of which tool is chosen. In fact, the major differences among tools are the consensus sequences they used for the comparison with the input sequences, the statistical models applied to this comparison, and the training methods implemented in machine learning approaches, which will be introduced later in this section. Although a number of in silico tools have been developed, the ideas behind them are not so diverse. Tools with the same backbone mainly differ in the extent to which the local sequence context is taken into account. Oftentimes a new tool was introduced when certain components of the algorithm it stemmed from were improved. Although medical geneticists usually have more concern about the application aspect of these tools, which will be discussed later, a brief description of the principles is helpful for their understanding in the advantages and disadvantages of different tools. The basic Position Weight Matrix (PWM) model proposed by Shapiro and Senapathy[4] is to score and rank a sequence using appropriate weights for each nucleotide position based on the information from its aligned consensus sequence (Table 1), and it was used by the web interface Splice-Site Analyzer Tool (http://ibis.tau.ac.il/ssat/SpliceSiteFrame.htm). The PWM model is simple, easy to understand, and widely used for representing different patterns of sequences, however it is overly-simplified as it assumes independency (or no correlation) among all positions. That is, a PWM score of a sequence is the summation of position-specific scores for each of its bases (A, T, C and G), and change of one score at a position has no impact on calculating the score at other positions. SpliceView (http://zeus2.itb.cnr.it/~webgene/wwwspliceview.html) improved the PWM model by considering mutual dependency between nucleotides in different positions.[18] A more general probabilistic model called the Maximal Dependence Decomposition (MDD) model, which is a decision tree method, captures potential strong dependencies between signal positions (adjacent and non-adjacent) by dividing the dataset into subsets based on pairwise dependency between positions and modeling each subset separately.[19] The MDD model was incorporated in the computer program GENSCAN (http://genes.mit.edu/GENSCAN.html). Pertea et al.[20] further enhanced the MDD model by adding Markov models (MM) which capture additional dependencies among adjacent positions. The source code for this method, called GeneSplicer, is downloadable at http://ccb.jhu.edu/software/genesplicer/.

Table 1

A hypothetical example of a position weight matrix (PWM).*

Nucleotide	Site
	1	2	3	4	5	6

A	0.73	0.17	0.00	0.00	0.05	0.62

T	0.05	0.26	0.00	1.00	0.49	0.00

C	0.15	0.57	0.00	0.00	0.16	0.22

G	0.07	0.00	1.00	0.00	0.30	0.16

For each site the frequencies of different nucleotides observed in a set of aligned sequences are calculated to construct the PWM which is used to score and rank a sequence. For example, a sequence ACGTTA is most likely to be observed in the population and has the highest score, while a sequence TGACAT is one of the most unlikely and has the lowest score. The formula used to calculate the score varies between 5′ and 3′ splice sites and between different PWM algorithms.

In the previous examples, the features used for distinguishing true splice sites from decoy ones are selected by hand, e.g., using appropriate weights in the PWM model, which might not be optimized and introduce bias. To overcome this problem, machine learning techniques such as artificial neural networks (NN), have been applied to the classification of splice sites. By training on the true positive and true negative datasets, an NN automatically optimizes a criterion (e.g., a hyperplane) that separates the two classes. For instance, NetGene2 (http://www.cbs.dtu.dk/services/NetGene2/) was developed using an NN in which the threshold is also controlled by the exon signal,[21] and NNSplice (http://www.fruitfly.org/seq_tools/splice.html) was trained only on examples with consensus splice sites and it also accounts for strong correlations in neighboring positions.[22] The support vector machine (SVM) is another type of machine learning technique. SplicePort (http://spliceport.cbcb.umd.edu/) used the Feature Generation Algorithm (FGA) that automatically identifies sequence-based features important for sequence classification as input for the SVM.[23] Although the machine learning approach is highly automatic, the drawback is as obvious as its advantage, which is over-fitting. If a classifier fits the training data ‘too well’, e.g., too many parameters relative to the number of examples, the generalizability will likely be poor. That means, when using an over-fitted NN or SVM to predict unknown splice sites, the optimized criterion might not be appropriate any longer. One of the common ways to minimize over-fitting is to use Bayesian models. One attempt by Brendel et al.[24] used three variables for splice site prediction, and the model was implemented by the web server SplicePredictor (http://bioservices.usd.edu/splicepredictor/). To date the most unbiased approximation for modeling short sequence motifs is to use the Maximum Entropy Distribution (MED). Compared with other methods, the only assumption of MED is the consistency with the features of the empirical distribution estimated from available data.[25] MED also considers dependencies between both non-adjacent and adjacent positions. Rather than a single model, MED is a framework with much flexibility for generating different models by simply changing the sets of constraints. The approach has been utilized by the tool MaxEntScan (http://genes.mit.edu/burgelab/maxent/Xmaxentscan_scoreseq.html). Users can either use the default models or build their own. The model has been successfully applied to the prediction of splicing mutations in the ATM gene responsible for the neurological disorder ataxia-telangiectasia, in which three apparently nonsense, missense, or silent exonic mutations were correctly interpreted as disrupting normal splice sites and creating new ones by using MED that had been confirmed by cDNA analysis.[26] MaxEntScan can also output results using other algorithms such as the PWM, MDD, and MM models for easy comparison. In addition to the approaches described above, various other methods used for splicing defect prediction have been proposed. Examples with user-friendly web interfaces include HBond (http://www.uni-duesseldorf.de/rna/html/hbond_score.php): hydrogen bond model describing the interaction of U1 snRNA and its binding sites;[27] Automated Splice Site Analyses (ASSA, http://splice.uwo.ca/, free registration required): information theory-based models by which changes in the affinity of potential splice and regulatory sites caused by mutation are calculated;[28] CRYP-SKIP (http://cryp-skip.img.cas.cz/): multiple logistic regression model which distinguishes exons that are skipped and that activate cryptic splice sites as a result of splicing mutations;[29] and Spliceman (http://fairbrother.biomed.brown.edu/spliceman/index.cgi): prediction of how likely distant mutations around annotated splice sites disrupt splicing by clustering hexamers into distinct groups based on positional distributions.[30] Some tools also incorporate multiple algorithms for the sake of user convenience. Human Splicing Finder (HSF, http://www.umd.be/HSF/) outputs splicing defect predictions based on the PWM and MED models as well as the predictions of branch points, ESEs, and ESSs.[31] SROOGLE (http://sroogle.tau.ac.il/) is a comprehensive platform that combines nine different prediction algorithms to score four main splicing signals, in which 5′ and 3′ splice sites are predicted by both the PWM and MED models.[32] Automatic Analysis of SNP sites (AASsites, http://genius.embnet.dkfz-heidelberg.de/menu/biounit/open-husar/) is a new analysis pipeline that predicts splicing pattern change caused by SNPs using outputs from five gene prediction programs.[33] Information about the input and output of these tools is listed in Table 2. Unfortunately, at least one sequence is required as the input for almost all tools, thereby making the application of these tools in clinical practice much less convenient. One exception is ASSA which has the option not to input sequence information because ASSA can also localize the variant based on the user-provided gene name, mRNA accession number, or dbSNP rs number. This may be more useful for medical geneticists, as it is better to have a simple format of input that prevents their focus from being distracted by technical concerns. Another common drawback of these tools is their limitation on the length of the sequence being analyzed (Table 2).

Table 2

Summary of input, output, and interpretation of prediction scores for selected currently available in silico tools for 5′ and 3′ splice site prediction with user-friendly web interface.

Tool	Input	Output	Interpretation
Splice-Site Analyzer Tool	Single/multiple sequences (5′: 9 bp (−3~+6); 3′: 15 bp (−14~+1))	S & S score (0~100)	Higher score implies a more similar ss sequence with the consensus sequence
NetGene2	Single sequence (200 bp < length < 80,000 bp)	Confidence score (0~1)	Higher score implies a higher confidence of true site
NNSplice	Single/multiple sequences	Score (0~1)	Higher score implies a more potential splice site
GENSCAN	Single sequence ≤ 1 million bp	Probability score (0~1)	Higher score implies a higher probability of correct exon
SpliceView	Single sequence ≤ 31000 bp	S & S score (0~100)	Higher score implies a more similar ss sequence with the consensus sequence
HBond	Single/multiple 11 bp sequences (−3~+8) containing GT in +1/+2 or one genomic sequence	Hbond score	Higher score implies a stronger capability of forming H-bonds with U1 snRNA
MaxEntScan	Single/multiple sequences (5′: 9 bp (−3~+6); 3′: 23bp (−20~+3))	Maximum entropy score (log-odds ratio)	Higher score implies a higher probability the sequence being a true splice site
SplicePredictor	Single/multiple sequences	*-value (3~15) determined by p, rho and gamma values	Higher value implies a more reliable of the predicted splice site
ASSA	Mutation to be analyzed and the reference sequence	Information contents Ri	Color-coded by direction and type of change in Ri
SplicePort	Single/multiple sequences ≤ 30,000 bp	FGA score	Higher score implies a more precise prediction of splice site
HSF	Single sequence ≤ 5,000 bp	S & S score (0~100)	Higher score implies a more potential splice site
CRYP-SKIP	Single/multiple sequences ≤ 4,000 bp containing one exon in upper case and flanking intronic sequence ≥ 4 bp in lower case	Probability of cryptic ss activation (0~1)	Higher value implies a higher probability of cryptic ss activation as opposed to exon skipping
SROOGLE	Target exon along with two flanking introns	Different scores with their percentile scores (0~1)	Higher percentile score implies a higher ranking of the ss within pre-calculated distributions
AASsites	Single sequence containing the SNP(s) and the Ensembl gene ID to which the SNP(s) belong(s)	Classification of the probability for a change in splicing	Probable, likely, or unlikely
Spliceman	Single/multiple sequences with one mutation and ≥ 5 bp in each side of the mutation	L1 distance and percentile rank	Higher percentile rank implies a higher likelihood the point mutation is to disrupt splicing

4. Interpretation, evaluation, and prospection

A simple, clear but informative interpretation of the output of prediction tools is extremely important for their application in clinical practice. Most tools output a score as a numerical measure of the strength of the splicing signal. Although the range varies, a higher score always indicates a higher degree of similarity to the consensus sequence or a higher probability or confidence of a site being a true splice site. A common misinterpretation of the score by end-users is to treat the score as a measure of the effect size. Since the score is a reflection of how likely the variant is deleterious, it is by no means appropriate to consider a variant with a lower score as more deleterious than that with a higher score. Furthermore, the score itself is meaningless because there is no recognized threshold distinguishing positive sites from negative ones. This might be partially due to the fact that other factors besides splicing signals have an impact on splicing. A common way to interpret the scores and facilitate the comparison between different methods is to use score variation by comparing the mutant score with the reference score.[34,35] Users should use a criterion, usually a cutoff value, to determine whether the mutation is causing splicing defects. However, setting this value is usually arbitrary across different tools in different studies. Since the choice of the threshold might not be optimized, the apparently poor performance of a tool is probably due to human errors rather than the algorithm itself, and this will lead to the incomparability of different tools, thus impeding the development of interpretation guidelines. Lack of interpretation guidelines for splicing defect prediction is also attributable to the small-scale nature of published studies (Table 3). As a recent example, Houdayer et al.[35] systematically evaluated several in silico prediction tools using 272 variants of unknown significance (VUS) in BRCA1 and BRCA2 genes. These VUSs were analyzed in vitro and in silico; the receiver operating characteristic (ROC) curve was used to identify the optimized cutoff value for each tool and to compare their predictive performance for variants in 5′ and 3′ consensus regions (excluding GT-AG sites because all mutations at GT-AG sites affect splicing and were successfully predicted by all tools). They found that the combination of MaxEntScan with a 15% cutoff value and the PWM model with a 5% cutoff value led to an optimized sensitivity of 96% and specificity of 83%. Although the number of variants investigated is still relatively small and only from two genes, we consider it an encouraging step in the right direction. From this study more interesting findings other than the result itself are the opportunities it provides for improvement. (1) The currently available prediction tools perform perfectly for mutations at invariant GT-AT sites. The real difficulty is to predict their vicinity regions (consensus sequences) and more distant sites. (2) If the consensus sequence has a higher score, the prediction is more reliable. Ideally, a good algorithm should give the maximum score to the consensus sequence, but it relies on accurate and representative population sequence information to build consensus sequences. The high-throughput next generation sequencing technologies provide amazing tools to rapidly re-sequence the whole genome and transcriptome, thus a better definition of splicing consensus sequence is possible. (3) The guidelines proposed based on a single study might only apply to a specific dataset. The generalizability to other genes and populations is still unknown. As more and more whole genome and transcriptome sequences are available, real large-scale splicing analyses are expected, which are not limited to certain genes in certain populations. Guidelines for splicing defect prediction based on more general data will be more reliable and generic. (4) From the epidemiological point of view, all existing evaluations are retrospective, which inevitably suffer from a series of selection biases. Though expensive and time-consuming, establishing large cohorts is still preferred to avoid the impact of these biases introduced in a retrospective study that cannot be fully controlled by any analytical method. For example, to associate a disease with its possible causal mutations, investigators often choose to retrospectively compare the nucleotide difference between the cases and the controls because the comparison can be accomplished quickly and economically. However, it might be more convincing to use a random cohort sequenced at baseline (exposure) and observe whether the disease emerges (outcome) prospectively to eliminate the impact of nonrandom selection of cases and controls. This can probably be achieved by using re-sequencing DNA samples collected at baseline from existing well-established cohorts, such as the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium.[36] Besides this, recently the National Institutes of Health funded a five-year research program that will explore the use of genomic sequencing in newborn screening.[37] This provides the opportunity to establish new large cohorts from the very beginning of life and has the potential for studying germline mutations and Mendelian diseases, especially for those with early age of onset.

Table 3

Selected recent publications whose primary goal (or one of the goals) was to evaluate in silico tools for splicing defect prediction.

Number of variants	Gene(s)	Prediction tools evaluated	Year^Reference
39	RB1	NNSplice, PWM, MaxEntScan, ASSA, ESEfinder, RESCUE-ESE*	2008[38]
18	LDLR	MaxEntScan, NNSplice, NetGene2	2009[39]
29	BRCA1/BRCA2	NNSplice, NetGene2, PWM, ASSA, MaxEntScan, HSF	2009[40]
623	Multiple	GENSCAN, GeneSplicer, HSF, MaxEntScan, NNSplice, SplicePort, SplicePredictor, SpliceView, SROOGLE	2010[34]
53	BRCA1/BRCA2	PWM, GeneSplicer, NNSplice, MaxEntScan, HSF	2011[41]
272	BRCA1/BRCA2	NNSplice, PWM, MaxEntScan, ESEfinder, RESCUE- ESE, HSF	2012[35]
24	BRCA1/BRCA2	PWM, MaxEntScan, NNSplice, GeneSplicer, HSF, NetGene2, SpliceView, SplicePredictor, ASSA	2013[42]

ESEfinder and RESCUE-ESE are web tools that predict ESEs.

Besides a standard interpretation guideline, ease of use is another important concern for medical geneticists. As previously mentioned, almost all currently available web tools are not convenient to use in clinical practice. In addition, computational efficiency determines the waiting time of end-users and whether the tool has a local standalone version influences its usefulness when end-users encounter internet outage. A commercial software package called Alamut (Interactive Biosoftware, Rouen, France) integrates multiple reliable, regularly updated data sources and multiple prediction algorithms (for splicing signal detection, PWM, MaxEntScan, NNSplice, GeneSplicer, and HSF are included). By entering only the variant and specifying its coordinates, users can easily obtain all results at the same time without worrying about the sequence context. This should be the future of in silico prediction tools and it is expected that more and more such software with user-friendly interfaces will be developed and launched. For bioinformaticians who usually have large quantities of variants to annotate and predict, a tool that can conduct ‘batch’ analysis is preferable (e.g., NetGene2, GENSCAN, GeneSplicer, MaxEntScan, and SplicePredictor have this option). The high-throughput version of the Alamut software, Alamut-HT, can handle ×1000 variants using its server option (Windows and Linux) or standalone option (Linux only). In summary, in silico tools for splicing defect prediction (especially for 5′ and 3′ splice sites) have potential value in disease diagnosis in view of the infeasibility of laboratory testing of large number of variants in daily clinical practice. There seems to be no simpler way other than relying on the currently available prediction algorithms until we have a more in-depth understanding of splicing mechanism. Reliable and straightforward interpretation guidelines for the results and an easy-to-use interface will accelerate the popularization of in silico tools among medical geneticists.

39 in total

1. Spliceman--a computational web server that predicts sequence variations in pre-mRNA splicing.

Authors: Kian Huat Lim; William Guy Fairbrother
Journal: Bioinformatics Date: 2012-02-10 Impact factor: 6.937

2. Improved splice site detection in Genie.

Authors: M G Reese; F H Eeckman; D Kulp; D Haussler
Journal: J Comput Biol Date: 1997 Impact factor: 1.479

3. RNA splice junctions of different classes of eukaryotes: sequence statistics and functional implications in gene expression.

Authors: M B Shapiro; P Senapathy
Journal: Nucleic Acids Res Date: 1987-09-11 Impact factor: 16.971

4. Using positional distribution to identify splicing elements and predict pre-mRNA processing defects in human genes.

Authors: Kian Huat Lim; Luciana Ferraris; Madeleine E Filloux; Benjamin J Raphael; William G Fairbrother
Journal: Proc Natl Acad Sci U S A Date: 2011-06-17 Impact factor: 11.205

5. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals.

Authors: Gene Yeo; Christopher B Burge
Journal: J Comput Biol Date: 2004 Impact factor: 1.479

6. ESEfinder: A web resource to identify exonic splicing enhancers.

Authors: Luca Cartegni; Jinhua Wang; Zhengwei Zhu; Michael Q Zhang; Adrian R Krainer
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

7. A novel approach to describe a U1 snRNA binding site.

Authors: Marcel Freund; Corinna Asang; Susanne Kammler; Carolin Konermann; Jörg Krummheuer; Marianne Hipp; Imke Meyer; Wolfram Gierling; Stephan Theiss; Thorsten Preuss; Detlev Schindler; Jørgen Kjems; Heiner Schaal
Journal: Nucleic Acids Res Date: 2003-12-01 Impact factor: 16.971

Review 8. RNA and disease.

Authors: Thomas A Cooper; Lili Wan; Gideon Dreyfuss
Journal: Cell Date: 2009-02-20 Impact factor: 41.582

9. Human Splicing Finder: an online bioinformatics tool to predict splicing signals.

Authors: François-Olivier Desmet; Dalil Hamroun; Marine Lalande; Gwenaëlle Collod-Béroud; Mireille Claustres; Christophe Béroud
Journal: Nucleic Acids Res Date: 2009-04-01 Impact factor: 16.971

10. SplicePort--an interactive splice-site analysis tool.

Authors: Rezarta Islamaj Dogan; Lise Getoor; W John Wilbur; Stephen M Mount
Journal: Nucleic Acids Res Date: 2007-06-18 Impact factor: 16.971

45 in total

1. A Multiplexed Assay for Exon Recognition Reveals that an Unappreciated Fraction of Rare Genetic Variants Cause Large-Effect Splicing Disruptions.

Authors: Rocky Cheung; Kimberly D Insigne; David Yao; Christina P Burghard; Jeffrey Wang; Yun-Hua E Hsiao; Eric M Jones; Daniel B Goodman; Xinshu Xiao; Sriram Kosuri
Journal: Mol Cell Date: 2018-11-29 Impact factor: 17.970

2. A novel splice site mutation of the PRKAR1A gene, C.440+5 G>C, in a Chinese family with Carney complex.

Authors: J Fu; F Lai; Y Chen; X Wan; G Wei; Y Li; H Xiao; X Cao
Journal: J Endocrinol Invest Date: 2018-01-09 Impact factor: 4.256

3. Leveraging splice-affecting variant predictors and a minigene validation system to identify Mendelian disease-causing variants among exon-captured variants of uncertain significance.

Authors: Zachry T Soens; Justin Branch; Shijing Wu; Zhisheng Yuan; Yumei Li; Hui Li; Keqing Wang; Mingchu Xu; Lavan Rajan; Fabiana L Motta; Renata T Simões; Irma Lopez-Solache; Radwan Ajlan; David G Birch; Peiquan Zhao; Fernanda B Porto; Juliana Sallum; Robert K Koenekoop; Ruifang Sui; Rui Chen
Journal: Hum Mutat Date: 2017-08-18 Impact factor: 4.878

4. Experimental assessment of splicing variants using expression minigenes and comparison with in silico predictions.

Authors: Neeraj Sharma; Patrick R Sosnay; Anabela S Ramalho; Christopher Douville; Arianna Franca; Laura B Gottschalk; Jeenah Park; Melissa Lee; Briana Vecchio-Pagan; Karen S Raraigh; Margarida D Amaral; Rachel Karchin; Garry R Cutting
Journal: Hum Mutat Date: 2014-09-10 Impact factor: 4.878

5. Increased diagnostic and new genes identification outcome using research reanalysis of singleton exome sequencing.

Authors: Ange-Line Bruel; Sophie Nambot; Virginie Quéré; Antonio Vitobello; Julien Thevenon; Mirna Assoum; Sébastien Moutton; Nada Houcinat; Daphné Lehalle; Nolwenn Jean-Marçais; Martin Chevarin; Thibaud Jouan; Charlotte Poë; Patrick Callier; Emilie Tisserand; Christophe Philippe; Frédéric Tran Mau Them; Yannis Duffourd; Laurence Faivre; Christel Thauvin-Robinet
Journal: Eur J Hum Genet Date: 2019-06-23 Impact factor: 4.246

Review 6. Deciphering the Emerging Complexities of Molecular Mechanisms at GWAS Loci.

Authors: Maren E Cannon; Karen L Mohlke
Journal: Am J Hum Genet Date: 2018-11-01 Impact factor: 11.025

7. Pathogenic Abnormal Splicing Due to Intronic Deletions that Induce Biophysical Space Constraint for Spliceosome Assembly.

Authors: Samantha J Bryen; Himanshu Joshi; Frances J Evesson; Cyrille Girard; Roula Ghaoui; Leigh B Waddell; Alison C Testa; Beryl Cummings; Susan Arbuckle; Nicole Graf; Richard Webster; Daniel G MacArthur; Nigel G Laing; Mark R Davis; Reinhard Lührmann; Sandra T Cooper
Journal: Am J Hum Genet Date: 2019-08-22 Impact factor: 11.025

Review 8. Building the foundation for genomics in precision medicine.

Authors: Samuel J Aronson; Heidi L Rehm
Journal: Nature Date: 2015-10-15 Impact factor: 49.962

9. In vivo and In vitro methods to identify DNA sequence variants that alter RNA Splicing.

Authors: Parth N Patel; Joshua M Gorham; Kaoru Ito; Christine E Seidman
Journal: Curr Protoc Hum Genet Date: 2018-04-26

10. Novel diagnostic tool for prediction of variant spliceogenicity derived from a set of 395 combined in silico/in vitro studies: an international collaborative effort.

Authors: Raphaël Leman; Pascaline Gaildrat; Gérald Le Gac; Chandran Ka; Yann Fichou; Marie-Pierre Audrezet; Virginie Caux-Moncoutier; Sandrine M Caputo; Nadia Boutry-Kryza; Mélanie Léone; Sylvie Mazoyer; Françoise Bonnet-Dorion; Nicolas Sevenet; Marine Guillaud-Bataille; Etienne Rouleau; Brigitte Bressac-de Paillerets; Barbara Wappenschmidt; Maria Rossing; Danielle Muller; Violaine Bourdon; Françoise Revillon; Michael T Parsons; Antoine Rousselin; Grégoire Davy; Gaia Castelain; Laurent Castéra; Joanna Sokolowska; Florence Coulet; Capucine Delnatte; Claude Férec; Amanda B Spurdle; Alexandra Martins; Sophie Krieger; Claude Houdayer
Journal: Nucleic Acids Res Date: 2018-09-06 Impact factor: 16.971