Literature DB >> 21498402

MAISTAS: a tool for automatic structural evaluation of alternative splicing products.

Matteo Floris¹, Domenico Raimondo, Guido Leoni, Massimiliano Orsini, Paolo Marcatili, Anna Tramontano.

Abstract

MOTIVATION: Analysis of the human genome revealed that the amount of transcribed sequence is an order of magnitude greater than the number of predicted and well-characterized genes. A sizeable fraction of these transcripts is related to alternatively spliced forms of known protein coding genes. Inspection of the alternatively spliced transcripts identified in the pilot phase of the ENCODE project has clearly shown that often their structure might substantially differ from that of other isoforms of the same gene, and therefore that they might perform unrelated functions, or that they might even not correspond to a functional protein. Identifying these cases is obviously relevant for the functional assignment of gene products and for the interpretation of the effect of variations in the corresponding proteins.
RESULTS: Here we describe a publicly available tool that, given a gene or a protein, retrieves and analyses all its annotated isoforms, provides users with three-dimensional models of the isoform(s) of his/her interest whenever possible and automatically assesses whether homology derived structural models correspond to plausible structures. This information is clearly relevant. When the homology model of some isoforms of a gene does not seem structurally plausible, the implications are that either they assume a structure unrelated to that of the other isoforms of the same gene with presumably significant functional differences, or do not correspond to functional products. We provide indications that the second hypothesis is likely to be true for a substantial fraction of the cases. AVAILABILITY: http://maistas.bioinformatica.crs4.it/.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2011 PMID： 21498402 PMCID： PMC3106191 DOI： 10.1093/bioinformatics/btr198

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Determining the identity and function of all the sequence elements in human DNA is a daunting challenge. The large scale pilot phase of the ENCODE project (Birney ) provided an exhaustive identification and verification of functional sequence elements in a limited region of 1% of the human genome. The computational analysis of the data revealed several unexpected features of the genome (Tress ). Perhaps the most surprising one was that many transcribed elements could be neutral elements that serve as a reservoir for natural selection. Many of these transcripts derive from alternative splicing events. Their putative products were manually analysed by the BioSapiens European Consortium (Tress ). The analysis led to the striking conclusion that more than 50% of them might not give rise to proteins structurally and/or functionally related to the other isoforms of the same genes or be the result of aberrant splicing events giving rise to non-functional proteins (Tress ). Indeed, comparison of the putative proteins encoded by the alternatively spliced transcripts with the main isoform showed that most of them lacked an active site, key trans-membrane segments, essential signalling regions and post-transcriptionally modified sites. Most importantly, models of their putative three-dimensional structures did not seem to correspond to plausible folds (Tress ). This observation was confirmed by Moult and co-workers (Melamud and Moult, 2009a, b) who, using a completely different dataset of alternative splicing variants, found that the vast majority of them resulted in putatively unstable protein conformations. Recently, some of us manually analysed the putative structures of isoforms of the human genome, the existence of which had been confirmed by mass-spectrometry and of isoforms of the same genes for which no evidence exists in proteomic databases reaching essentially the same conclusions (Leoni ). Altogether these observations suggest that we might be observing the effects of noisy selection of splice sites by the splicing machinery and/or that alternatively spliced products of a gene might assume unrelated conformations. These findings raise several interesting questions, but also a few practical issues. First of all, the careful manual analysis performed by the BioSapiens consortium on 1% of the genome needs to be scaled up to the whole genome and therefore automated. Secondly, analysis tools should be available to biologists performing experiments in a user-friendly manner. At present, there are a few systems that partially satisfy this need. For example, the ProSas database (Birzele ) (http://www.bio.ifi.lmu.de//forschung/structural-bioinformatics/prosas) stores structures and models (provided the target proteins shares at least 40% sequence identity with a known template) for the alternative isoforms annotated in Ensembl (Hubbard ) and Swiss-Prot (Bairoch ) and allows the visualization of the exon boundaries in the context of the three-dimensional structures, but there is no provision for automatic analysis of the plausibility or completeness of the resulting structures and models. The same is true for AS-ALPS (Shionyu ) (http://as-alps.nagahama-i-bio.ac.jp/), a server that provides information about the putative effect of alternative splicing on human and mouse proteins, provided that at least one of the isoforms has an experimentally solved structure. Here, we describe a system named Modelling and Assessment of ISoforms Through Automated Server (MAISTAS) that, given the accession codes of one or more genes or proteins, collects all their putative spliced isoforms annotated in the Ensembl genome database (Hubbard ), builds, whenever possible, comparative models for their structures, analyses their features and provides an estimate of the likelihood that the isoforms correspond to potentially stable and structurally plausible proteins in the absence of major conformational rearrangements. Alternative splicing isoforms can also be uploaded in the FASTA format in order to allow the user to analyse data from more comprehensive and specialized databases such as Aceview (http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/) (Thierry-Mieg and Thierry-Mieg, 2006) or ASPicDB (http://t.caspur.it/ASPicDB/) (Martelli ). Model assessment is performed by analysing the quality of the packing in the core of the structure and/or model, the extent of exposed hydrophobic surface and the putative effect of deletions and insertions. These properties are compared to those observed in known protein structures and in the closest homologs of the known structure. The system is freely available as a Web server.

2 METHODS

The input data can be a set of sequences in the FASTA format or one or more of the following codes: Ensembl Gene ID(s), Ensembl Transcript ID(s), Ensembl protein ID(s) (Flicek et al.), EMBL ID(s) (Leinonen ), EntrezGene ID(s) (Maglott ), GO ID(s) (Ashburner ), HGNC automatic gene name, HGNC curated gene name (Seal ), UniProt/TrEMBL Accession(s), UniProt/Swissprot ID(s), UniProt/Swissprot Accession(s) (The Uniprot Consortium, 2008), VEGA transcript ID(s), HAVANA transcript ID(s) (Wilming ). The collection of all putative splicing isoforms corresponding to the input gene (or to the gene encoding for the protein when a protein accession code is used) is achieved by taking advantage of a locally stored version of the Ensembl database (release 58) (Flicek ). Users can select accession codes for more than 30 different organisms. The HHsearch 1.1.5 (Söding, 2005) is used to search for possible structural templates (E-value lower than 10 , sequence coverage of at least 90%, global alignment mode, all other parameters set at their default values) and for obtaining the sequence alignment between the target and its templates. Model building is performed using a local version of Modeller9v8 (Sali and Blundell, 1993) (default parameters). The selected parameters ensure that the quality of the produced models is sufficiently high to be able to reliably measure properties described below as demonstrated by the last CASP experiment (http://predictioncenter.org/CASP9). POPS (Cavallo ) is used to calculate the accessibility to the solvent of each residue of the models. The OS software (Fleming and Richards, 2000; Pattabiraman ) is used for computing infrequent environment of residues. Finally, the ‘packing-eff’ method from the NUCPROT package (Voss and Gerstein, 2005) is used for estimating how well packed the protein is. The thresholds for POPS, Packing-eff and OS tools were derived by running the programs on 7908 monomeric proteins solved by X-ray crystallography at a resolution better than 2.5 Å. The chosen thresholds, 20.1 for POPS values, 17.8% for Packing-eff values and 0.54 for OS values, correspond to two standard deviations from the average (data not shown). Residues are considered exposed if their mean solvent accessibility—calculated considering three residues on each side of them—is larger than 5 Å2. The average response time for a typical request (three to four isoforms, a few hundreds amino acid long) is <1 h, the time limiting factor being the construction of the HMMs and of the corresponding models. The entire pipeline was built using python scripts and the interface is PHP based. In order to verify that the system can be applied to a substantial fraction of cases and that is able to recognize translated proteins, we ran it on protein isoforms whose existence is unambiguously identified by mass spectrometry. We used the May 2010 human build (http://www.peptideatlas.org/builds/human/201005/APD_Hs_all.fasta) containing 72 396 different peptides ranging in size from 6 to 66 (mean 17) (Deutsch ). Of these, 19 513 could be unambiguously mapped to 2972 isoform products annotated in Ensembl (release 58). We also compared the results of MAISTAS with those obtained by a manual analysis of human transcript products described in Leoni .

3 RESULTS

The automatic analysis performed by MAISTAS requires that the user inputs one or more protein/gene accession codes from the common public databases (see Section 2) or a set of sequences in the FASTA format. In all but the last case, the sequence(s) corresponding to the user query is retrieved and mapped back to the appropriate genome database by using a local installation of the BioMart database (Durinck ). The peptide sequences of all isoforms of the target gene, as annotated in Ensembl, are retrieved. If the input is a set of amino acid sequences in the FASTA format, they are assumed to be different isoforms of the same gene. The user can supply an email address (optional) to which the results will be sent or bookmark the result page. The initial query page of MAISTAS provides a link to an example result page, which allows the user to inspect a typical output (Fig. 1).

Fig. 1.

Snapshots of the MAISTAS output page. (A) Summary table for the modelled isoforms. The following data are shown: gene ID (gene identification code), isoform ID (isoform identification code), isoform length (number of residues of each isoform), first aa, last aa (the first and last modelled or solved amino acid), template ID (the PDB code of the template protein used for modelling or the PDB code of the known isoform structure), isoform/template % seq. ID (sequence identity between the splicing isoform and the sequence of the selected template), fraction of isoform modelled (percentage of the splicing isoform sequence modelled), summary (assessment of the plausibility of the structure). (B) Snapshot of the isoform section showing results of the analysis for each isoform, its final assessment and the modelled structure in a small Jmol window. Different links in the section allow the user to download the coordinates of the model, view their 3D structure with regions corresponding to exons in different colours, view the amino acid sequence and the isoform/template alignment generated by the HHsearch. (C) Alternative spliced isoform three-dimensional structures are displayed in separate windows allowing their simultaneous analysis and comparison. On the right side of each Jmol window, the user can choose which exons should be displayed and select different representation modes. By default, all exons are mapped on the protein structure, each in a different colour. (D) Multiple sequence alignment of the isoforms displayed via the JALVIEW applet. In the first step, the tool evaluates whether a structure exists for any of the isoforms or, lacking this, whether a comparative model can be built. In the latter case, the template is identified using the HHsearch program, which builds a Hidden Markov Model (HMM) of the target protein family and compares it to the HMMs representing a set of non-redundant families of proteins of known structure (sequence identity between any pair below 70%). This strategy has been shown in blind tests to be one of the most sensitive for finding structural templates (Battey ). The target sequence, the template(s) and the alignment obtained by the HHsearch are automatically analysed. Only models based on template structures solved by X-ray crystallography or an NMR are considered. They are inspected to detect any possible gaps in the coordinate set (for example, because of the absence of electron density in X-ray structures). If these regions are present at the N- or C-terminus of the protein they are trimmed, otherwise a warning is issued. A warning is also issued if the alignment includes insertions larger than 50 residues that might correspond to an inserted domain or deletions larger than 20 residues. The alignment is used to build the model using a local installation of Modeller (Sali and Blundell, 1993). Once the model has been built, the system computes the model hydrophobic solvent accessible area and packing efficiency. If the modelled isoform presents deletions with respect to the template, the Euclidean distance between the Cα residues before and after the deletion(s) is recorded. If insertions are present, the surface exposed to the solvent of the amino acids surrounding them and the number of inserted amino acids is computed. The tool informs the user that the model might not correspond to a complete or plausible structure if the distance between the two residues on either side of a deletion is >15 Å and/or if there are more than three residues inserted in the core of the protein and/or if the hydrophobic solvent accessible area of the model is larger than a set threshold (see Section 2). In assessing the results, the system takes into account the corresponding values for the template used for modelling. The output of MAISTAS is shown in Figure 1 and includes a summary table, where all the data regarding the modelled isoforms are reported. These can also be downloaded as a csv file. The user can download the coordinates of all the models and, if desired, all the intermediate data used in the procedure. The next section of the output page describes the detailed results for each modelled isoform and reports (see Section 2 for details): The sequence identity and coverage of the template and its PDB code. The packing efficiency of the model and of its template together with their comparison with the expected value. The extent of the exposed hydrophobic area of the model and of its template together with their comparison with the expected value. The packing environment of residues in the model and the template together with their comparison with the expected value. The assessment of whether insertions and deletions (if any) can be easily accommodated into the model. The modelled or experimental structure in a Jmol window. The option to inspect the multiple sequence alignment via a JALVIEW applet (Waterhouse ). The option to visualize and analyse the models via a Jmol applet (http://www.jmol.org/). A final remark about the plausibility/completeness of the predicted structure. MAISTAS depends on the availability of structural templates to predict the three-dimensional structure of the isoforms by comparative modelling. If no structural templates are available, a ‘No template satisfying all parameters’ warning is issued. When MAISTAS is unable to provide a reasonable structural model (e.g. when very large insertions are present) the system will return the message ‘Maistas is having trouble modelling or assessing this isoform’. The online result pages are accessible via the URL sent either by e-mail or via the ‘Retrieve results by job identifier or by email’ window, using the provided job identification code or the e-mail address. Produced models and the results of their analysis are stored in a local database unless the user requests them to be kept private. This implies that a user might be able to immediately retrieve the results on the gene(s) of interest if they were already been produced in a previous run of the system. The entries of the database are time stamped and presented to the user together with an option to repeat the analysis, which is advisable if major updates of the genome or structure database have taken place since the previous analysis was performed. We ran the system on all human alternatively spliced isoform whose existence at the protein level could be unambiguously verified by mass spectrometry, i.e. of those protein isoforms for which a peptide that unambiguously identifies them has been detected with high reliability by mass spectrometry. The server was able to produce and analyse models in 30% of the cases (890 out of 2972). In 2082 of them (70%), the model could not be built because there is no template satisfying all parameters. This had to be expected since we use rather stringent parameters to select the template (E-value better than 10 , template coverage >90%, X-ray resolution <2.5 Å or solved by the NMR). Out of the modelled isoforms, 712 (80%) were assessed as structurally plausible (see http://www.bioinformatica.crs4.org/maistas/pub/dataset.xls). In the majority of the remaining cases, (160 out of 178) the model showed a large hydrophobic surface exposed to the solvent. In these cases, the protein might indeed represent an incomplete and therefore not plausible structure, but also simply be a subunit of a larger complex. We compared the results obtained by MAISTAS with those derived from a manual analysis of the isoforms of genes for which at least one isoform had been detected in mass-spectrometry experiments [and unambiguously identified by the presence of a peptide in the PeptideAtlas database (Deutsch ) and at least one had not (Leoni )]. The results obtained automatically using MAISTAS are consistent with those reported in Leoni . In particular, MAISTAS was able to model 30% of the 555 proteins for which there is an evidence of translation (to be compared with the 26.4% obtained in the manual analysis), 85% of which were assessed as structurally plausible. The difference in coverage between the manual and automatic analyses is due to the increased size of the protein sequence and structure databases. Models were also produced for 181 out of 555 isoforms for which there is no evidence of translation in PeptideAtlas. Only 44% of these isoforms were reported as complete and plausible by the automatic pipeline. The corresponding numbers for manual analysis are 145 isoforms (26%) modelled and 48% classified as structurally consistent.

3.1 Application example

As an example of the use of MAISTAS, we describe the results obtained using the gene coding as input for the voltage-dependent anion channel 3 (VDAC3) (Ensembl gene identification code: ENSG00000078668), a protein that forms a channel through the mitochondrial outer membrane allowing diffusion of small hydrophilic molecules. Six splice variants are present in the Ensembl database for the gene encoding the protein, identified by the following Ensembl peptide codes: ENSP00000428845, ENSP00000022615, ENSP00000428519, ENSP00000428977, ENSP00000429006 and ENSP00000428029. The UniProt database entry of VDAC3 (Q9Y277) describes only two of these isoforms (ENSP00000388732 and ENSP00000022615). Although four peptides mapping to the putative products are present in the PeptideAtlas database (PeptideAtlas IDs: PAp00006999; PAp00007806; PAp00077146; and PAp00423732), they cannot be used to unambiguous identify specific isoforms of the gene since they fall in the exons present in all of them. Decker et al. (Decker and Craigen, 2000) used specific anti-VDAC3 antibody and demonstrated the existence of the ENSP00000428845 and ENSP00000022615 isoforms. The only difference between these two alternatively spliced isoforms is the insertion of a single methionine at position 39 of the ENSP00000428845 sequence. ENSP00000022615 is also annotated in the CCDS database, a resource that centralizes the identification of well-supported, consistently annotated, protein-coding regions (Pruitt ). MAISTAS was able to provide a plausible structural model for isoforms ENSP00000428845 and ENSP00000022615 (Fig. 2A and F), while models of ENSP00000428519, ENSP00000428977, ENSP00000429006 and ENSP00000428029 were considered unlikely or incomplete (Fig. 2B–E). Inspection of the HHpred alignment used for building the ENSP00000428519, ENSP00000428977, ENSP00000429006 and ENSP00000428029 isoform models does not highlight any specific problem with the alignment (data not shown); however, the VDAC3 beta-barrel domain architecture is completely disrupted in the models of ENSP00000428519, ENSP00000428977, ENSP00000429006 and ENSP00000428029 (Fig. 2B–E). All these isoforms show a large exposed hydrophobic surface, (around 22 Å2, compared with the expected value of 15.6 Å2 and with the value observed for the template of 15.9 Å2). This dramatic architecture variation might imply that the isoforms are non-functional or that they perform a completely different function.

Fig. 2.

Three-dimensional models of the VDAC3 protein isoforms. (A) ENSP00000428845. (B) ENSP00000428977. (C) ENSP00000428519. (D) ENSP00000428029. (E) ENSP00000429006. (F) ENSP00000422615.

4 CONCLUSION

The more detailed is the analysis of the genomes of higher eukaryotes, the more complex they are revealed to be. For example, it is becoming clear that alternative splicing events do not simply result in a modulation of the function of the gene products, for example, by removing or adding structurally compact domains, or by modifying the sequence of specific regions of the encoded protein, but that they can either have a profound effect on the structure and function of the products of the same gene or give raise to non-functional products (Melamud and Moult, 2009a, b; Tress ). The latter can nevertheless have a relevant biological function. For example, Poliseno et al. demonstrated that transcripts may also function by competing for microRNA binding, a biological activity independent of the translation of the protein they encode (Poliseno ). It is impossible for any currently available method, including ours, to assess which is the case. The method described here is able to correctly classify as plausible a large fraction of the experimentally characterized isoforms, and to highlight dubious cases. Our aim is to provide easy access to a computational tool able to draw the attention of the life science community to them. Consequently, we took special care to convey the results of the analysis, although based on rather sophisticated tools, in an easy and understandable fashion. MAISTAS provides access to all the intermediate data used to generate the results, but it describes them in a human readable form. We believe that MAISTAS represents a step in the direction of using the knowledge accumulated in structural bioinformatics as well as the maturity of the tools available for applications related to the interpretation of genomic data and that it can be effectively used as a first step in characterizing novel proteins as well as a support for selecting interesting and intriguing cases for structural and functional studies.

31 in total

1. Protein packing: dependence on protein size, secondary structure and amino acid composition.

Authors: P J Fleming; F M Richards
Journal: J Mol Biol Date: 2000-06-02 Impact factor: 5.469

2. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

3. The Ensembl genome database project.

Authors: T Hubbard; D Barker; E Birney; G Cameron; Y Chen; L Clark; T Cox; J Cuff; V Curwen; T Down; R Durbin; E Eyras; J Gilbert; M Hammond; L Huminiecki; A Kasprzyk; H Lehvaslaiho; P Lijnzaad; C Melsopp; E Mongin; R Pettett; M Pocock; S Potter; A Rust; E Schmidt; S Searle; G Slater; J Smith; W Spooner; A Stabenau; J Stalker; E Stupka; A Ureta-Vidal; I Vastrik; M Clamp
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

4. POPS: A fast algorithm for solvent accessible surface areas at atomic and residue level.

Authors: Luigi Cavallo; Jens Kleinjung; Franca Fraternali
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

5. Swiss-Prot: juggling between evolution and stability.

Authors: Amos Bairoch; Brigitte Boeckmann; Serenella Ferro; Elisabeth Gasteiger
Journal: Brief Bioinform Date: 2004-03 Impact factor: 11.622

6. Protein homology detection by HMM-HMM comparison.

Authors: Johannes Söding
Journal: Bioinformatics Date: 2004-11-05 Impact factor: 6.937

7. Calculation of standard atomic volumes for RNA and comparison with proteins: RNA is packed more tightly.

Authors: N R Voss; M Gerstein
Journal: J Mol Biol Date: 2005-01-05 Impact factor: 5.469

8. Occluded molecular surface: analysis of protein packing.

Authors: N Pattabiraman; K B Ward; P J Fleming
Journal: J Mol Recognit Date: 1995 Nov-Dec Impact factor: 2.137

9. Comparative protein modelling by satisfaction of spatial restraints.

Authors: A Sali; T L Blundell
Journal: J Mol Biol Date: 1993-12-05 Impact factor: 5.469

10. Coding potential of the products of alternative splicing in human.

Authors: Guido Leoni; Loredana Le Pera; Fabrizio Ferrè; Domenico Raimondo; Anna Tramontano
Journal: Genome Biol Date: 2011-01-20 Impact factor: 13.583

7 in total

1. Exploring the functional impact of alternative splicing on human protein isoforms using available annotation sources.

Authors: Dinanath Sulakhe; Mark D'Souza; Sheng Wang; Sandhya Balasubramanian; Prashanth Athri; Bingqing Xie; Stefan Canzar; Gady Agam; T Conrad Gilliam; Natalia Maltsev
Journal: Brief Bioinform Date: 2019-09-27 Impact factor: 11.622

2. Characterization of ABL exon 7 deletion by molecular genetic and bioinformatic methods reveals no association with imatinib resistance in chronic myeloid leukemia.

Authors: Nóra Meggyesi; Lajos Kalmár; Sándor Fekete; Tamás Masszi; Attila Tordai; Hajnalka Andrikovics
Journal: Med Oncol Date: 2011-10-30 Impact factor: 3.064