Literature DB >> 24271393

ExoLocator--an online view into genetic makeup of vertebrate proteins.

Aik Aun Khoo¹, Mario Ogrizek-Tomas, Ana Bulovic, Matija Korpar, Ece Gürler, Ivan Slijepcevic, Mile Šikic, Ivana Mihalek.

Abstract

ExoLocator (http://exolocator.eopsf.org) collects in a single place information needed for comparative analysis of protein-coding exons from vertebrate species. The main source of data--the genomic sequences, and the existing exon and homology annotation--is the ENSEMBL database of completed vertebrate genomes. To these, ExoLocator adds the search for ostensibly missing exons in orthologous protein pairs across species, using an extensive computational pipeline to narrow down the search region for the candidate exons and find a suitable template in the other species, as well as state-of-the-art implementations of pairwise alignment algorithms. The resulting complements of exons are organized in a way currently unique to ExoLocator: multiple sequence alignments, both on the nucleotide and on the peptide levels, clearly indicating the exon boundaries. The alignments can be inspected in the web-embedded viewer, downloaded or used on the spot to produce an estimate of conservation within orthologous sets, or functional divergence across paralogues.

Entities: Chemical Species

Mesh：

Substances：
Proteins

Year: 2013 PMID： 24271393 PMCID： PMC3965120 DOI： 10.1093/nar/gkt1164

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The whole-genome sequencing projects, completed (1) or still under way (2), are bringing the comparative analysis of genomic and protein sequences (3) to a whole new level of insight and reliability, giving the impetus to the field. However, assembling an exhaustive set of recognizably related sequences, be it protein or nucleotide, such that they are complete, and their source clear, remains a painstaking task. ExoLocator aims to alleviate the problem for the case for which it is currently feasible: protein coding sequences from the fully sequenced vertebrate genomes. Protein coding sequences tend to be easier to locate on the genome than the rest of the functional material therein, and the homologues from different species are also easier to faithfully align once translated to the amino acid alphabet. Working with the completed genomes enables us to establish the cognate sequences in various species that are the closest mutual homologues. These can then be used as templates for the similarity search to locate the full complement of exons for each studied gene. Finally, to organize the database in searchable chunks, we use human genome as the organizing point, the type of organization that commonly agrees with the search criteria used in biomedically motivated analysis. The need for such data compilation available in a single place has been recognized in the community by the earlier servers of related nature (4–7), though smaller in scope than our effort presented here.

RESULTS AND THEIR PRESENTATION

Information collected in ExoLocator

ExoLocator takes ENSEMBL (1) database as its primary source of information. The data set is organized using human genome as the orientation map. All human genes annotated in the ENSEMBL as ‘known’ and ‘protein coding’ are collected, and their identifier used as a reference for the whole group of vertebrate genes annotated as orthologues (one-to-one or one-to-many) by the ENSEMBL annotation pipeline. According to ENSEMBL, some exons do not seem to have a counterpart in closely related orthologous genes, and ExoLocator is in part an investigation into the possibility that they were overlooked in the annotation process. As an estimate of the amount of information lost, we take all canonical exons from human protein coding genes, and align them with the exons from the genes annotated as orthologous in other species. In these alignments, some 15% of expected orthologues of human exons appear absent. In our pipeline, 85% of the regions where we expect to find the exons (see ‘Materials and Methods’ section) contain explicitly indicated un-sequenced stretches of genomic sequence. Two to three percent of the missing exons are still recoverable, in full or in partial length, by search by homology. The exons found by the pipeline are added to the overall collection, and organized in several different ways for display and downstream analysis.

ExoLocator’s web interface

The database offers for download the set of protein coding exons compiled from the ENSEMBL, complemented with a straightforward homology search. It also provides the most complete reconstruction of full protein sequences we can achieve in this approach. ExoLocator’s interface provides several ways to inspect the data: as lists of exons corresponding to a gene in each species, as an alignment of orthologous proteins with the exon boundaries indicated, as an alignment of within-species paralogues, or as an alignment of alternative splices. The last option is available only for the cases that have Consensus CDS annotation (8). The alignments are available at the nucleotide and amino acid levels. The visualization of the alignment is provided by the browser embedded JalView alignment viewer (9). The orthologue alignment comes with a set of notes, detailing the list of the exons it contains—their position in the gene, the source: the ENSEMBL itself, Havana annotation project (10) or the similarity search using the closest detectable homologue (see ‘Materials and Methods’ section). The search in ExoLocator can be done by providing the ENSEMBL identifier, pasting in the sequence on the protein level or through a limited name resolution search.

MATERIALS AND METHODS

The original exon set available at ENSEMBL, release e73, our main source of raw genomic data, was assembled through a combination of de novo gene detection and heuristic search (BLAST) by similarity. To that arsenal of methods we have added a pairwise alignment (or similarity search) algorithm by Edgar (11), and our in-house implementation of a hardware accelerated version of Smith–Waterman search (12). In addition to being an extensive exercise in mining the ENSMBL core data, the database also provides an insight into the extent to which the number of known exons can be extended by optimal sequence alignment (applied to detection of homologous sequences across species). To establish the search pipeline, implemented in Python 2.6, we had to make several decisions, and develop appropriate software.

Pipeline description

The first decision we make is to select a canonical set of exons for each human gene in ENSEMBL to use as the reference points in our search. Where exons overlap or disagree, exons annotated as ‘known’ with the greatest length and coverage are chosen over the others. This we do by modeling exons as nodes in a directed acyclic graph, with edges going from overlapping exons with greater quality—measured by the strength of the annotation (Havana over ENSEMBL; strongly supported splice signal over none), the length and the similarity to a known template to existing species—to lesser, then taking the set of nodes with no incoming edges as our model set of exons for the gene. Then, to each human exon we attach a map to ‘master’ exons in the other species from the corresponding genes in other species. The maps are further reconciled in a full-length protein alignment, to detect and accommodate the cases of different intron positioning across species. An ad hoc pairwise aligner that respects exon boundaries is used for the purpose. For the final alignment on the multiple sequence level we use MAFFT alignment utility (13).

Search for missing exons

To detect a missing exon we align a target vertebrate set of exons corresponding to a single gene to the most convincingly homologous set in human. To relate exons to their parent gene we again rely on the annotation provided by the ENSEMBL. The alignment provides the boundaries on the target gene for the search for the missing exon. Next, we need to choose a template from the species that has the exon annotated, and is in some sense the nearest to the species with the exon annotation missing. For that purpose we use the taxonomy tree available at the NCBI’s Taxonomy Web site (14). We traverse the tree to look for the taxonomically closest species that has an exon mapping to the human at the expected place, and use it as a template for the sequence similarity search. Finally, to detect the region of homology, we use an advanced CPU implementation of a heuristic search, and GPU enabled Smith–Waterman algorithm. The implementations of the latter available in the public domain are not capable of handling the sizes for the input sequences we have at hand, and therefore we use our own implementation in which the problem is divided into smaller chunks in a way amenable to graphics card acceleration (https://github.com/mkorpar/swSharp). The exons that ExoLocator reports satisfy two criteria: their translation must be longer then three residues, and similarity by a Tanimoto-like similarity measure , where L1 and L2 are the lengths of the template and the candidate exon, and S12 is the similarity weighted length of the common aligned positions, must be larger than .

Known problems and caveats

Some otherwise interesting peculiarities of animal genomes complicate systematic analysis of the sort we undertook here. Thus we do not attempt to resolve the cases of overlapping genes on the same strand, of which we detected several hundred cases (some of these, though, might be duplicate entries in the source database we are using). As we rely on the ENSEMBL pipeline for the annotation of at least approximate exon location, when the whole gene is unannotated in the species, it will be missing in our search too. Also, the detection by similarity that we use does not allow us to decide on the precise location of the gene boundary. We use MaxEntScan (15), a lightweight implementation of a strong statistical tool to try to estimate the likelihood that the predicted exon has a proper splice signal. MaxEntScan in its currently available parametrization works the best for mammalian sequences of introns being spliced out by the major spliceosome. In these cases we use MaxEntScan to decide on the boundary and the possible phase of the exon, and provide the score MaxEntScan assigns in the accompanying notes.

Database implementation

The database that is accessible via Internet is a relatively straightforward MySQL application with a small number of tables storing the information about the exon coordinates, sequences and the source of the annotation. The processing pipeline used to fill the database, however, is a much more complex Python implementation sourcing the data from the local versions of the core ENSEMBL databases. The Web interface for the database was implemented in Play! Framework (http://www.playframework.org). With the full cycle of data processing being rather time-consuming, the database will be on a semiannual update schedule.

CONCLUSION AND OUTLOOK

At its current stage, ExoLocator aims to balance the goal of giving the complete picture of possible coding exons, with the need to stay grounded in terms of the verifiability of the actual function of the sequences it collects. Thus it relies on the ENSEMBL’s annotation of ‘known’ (‘known’ here being the actual annotation term used, hence the quotes) human exons as the anchor for the search and for the results presentation. In certain cases, it seems that the data from species other than human argue for a different annotation, but we deliberately choose to stay away from any reinterpretation of the existing data. Rather, we envision the database to function as a shortcut to quick retrieval of the established sequences of well-documented human exons and their closest counterparts in the other species, complemented by the putative exon set collected by a reliable search utility. The hope is that, rather than as an interpretative tool, ExoLocator will be understood and used as a resource, ultimately leading to fuller understanding of the complex mechanism of gene function, alternative splicing and translation.

FUNDING

Biomedical Research Council of Agency for Science, Technology and Research Singapore. Funding for open access charge: Agency for Science, Technology and Research, Singapore. Conflict of interest statement. None declared.

15 in total

1. EID: the Exon-Intron Database-an exhaustive database of protein-coding intron-containing genes.

Authors: S Saxonov; I Daizadeh; A Fedorov; W Gilbert
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Search and clustering orders of magnitude faster than BLAST.

Authors: Robert C Edgar
Journal: Bioinformatics Date: 2010-08-12 Impact factor: 6.937

3. Global analysis of exon creation versus loss and the role of alternative splicing in 17 vertebrate genomes.

Authors: Alexander V Alekseyenko; Namshin Kim; Christopher J Lee
Journal: RNA Date: 2007-03-16 Impact factor: 4.942

4. The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes.

Authors: Kim D Pruitt; Jennifer Harrow; Rachel A Harte; Craig Wallin; Mark Diekhans; Donna R Maglott; Steve Searle; Catherine M Farrell; Jane E Loveland; Barbara J Ruef; Elizabeth Hart; Marie-Marthe Suner; Melissa J Landrum; Bronwen Aken; Sarah Ayling; Robert Baertsch; Julio Fernandez-Banet; Joshua L Cherry; Val Curwen; Michael Dicuccio; Manolis Kellis; Jennifer Lee; Michael F Lin; Michael Schuster; Andrew Shkeda; Clara Amid; Garth Brown; Oksana Dukhanina; Adam Frankish; Jennifer Hart; Bonnie L Maidak; Jonathan Mudge; Michael R Murphy; Terence Murphy; Jeena Rajan; Bhanu Rajput; Lillian D Riddick; Catherine Snow; Charles Steward; David Webb; Janet A Weber; Laurens Wilming; Wenyu Wu; Ewan Birney; David Haussler; Tim Hubbard; James Ostell; Richard Durbin; David Lipman
Journal: Genome Res Date: 2009-06-04 Impact factor: 9.043

5. Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species.

Authors:
Journal: J Hered Date: 2009-11-05 Impact factor: 2.645

6. Jalview Version 2--a multiple sequence alignment editor and analysis workbench.

Authors: Andrew M Waterhouse; James B Procter; David M A Martin; Michèle Clamp; Geoffrey J Barton
Journal: Bioinformatics Date: 2009-01-16 Impact factor: 6.937

7. Unconstrained mining of transcript data reveals increased alternative splicing complexity in the human transcriptome.

Authors: I G Mollet; Claudia Ben-Dov; Daniel Felício-Silva; A R Grosso; Pedro Eleutério; Ruben Alves; Ray Staller; Tito Santos Silva; Maria Carmo-Fonseca
Journal: Nucleic Acids Res Date: 2010-04-12 Impact factor: 16.971

8. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals.

Authors: Gene Yeo; Christopher B Burge
Journal: J Comput Biol Date: 2004 Impact factor: 1.479

9. MAFFT version 5: improvement in accuracy of multiple sequence alignment.

Authors: Kazutaka Katoh; Kei-ichi Kuma; Hiroyuki Toh; Takashi Miyata
Journal: Nucleic Acids Res Date: 2005-01-20 Impact factor: 16.971

10. The vertebrate genome annotation (Vega) database.

Authors: L G Wilming; J G R Gilbert; K Howe; S Trevanion; T Hubbard; J L Harrow
Journal: Nucleic Acids Res Date: 2007-11-14 Impact factor: 16.971

5 in total

1. Differential Gene Expression in Coiled versus Flow-Diverter-Treated Aneurysms: RNA Sequencing Analysis in a Rabbit Aneurysm Model.

Authors: A Rouchaud; C Johnson; E Thielen; D Schroeder; Y-H Ding; D Dai; W Brinjikji; J Cebral; D F Kallmes; R Kadirvel
Journal: AJNR Am J Neuroradiol Date: 2015-12-31 Impact factor: 3.825

2. RNA-Sequencing Analysis of Messenger RNA/MicroRNA in a Rabbit Aneurysm Model Identifies Pathways and Genes of Interest.

Authors: M Holcomb; Y-H Ding; D Dai; R J McDonald; J S McDonald; D F Kallmes; R Kadirvel
Journal: AJNR Am J Neuroradiol Date: 2015-07-30 Impact factor: 3.825

3. Evidence for widespread subfunctionalization of splice forms in vertebrate genomes.

Authors: Matthew J Lambert; Wayne O Cochran; Brandon M Wilde; Kyle G Olsen; Cynthia D Cooper
Journal: Genome Res Date: 2015-03-19 Impact factor: 9.043

4. JuncDB: an exon-exon junction database.

Authors: Michal Chorev; Lotem Guy; Liran Carmel
Journal: Nucleic Acids Res Date: 2015-10-30 Impact factor: 16.971

5. Utility of rapid whole-exome sequencing in the diagnosis of Niemann-Pick disease type C presenting with fetal hydrops and acute liver failure.

Authors: Mersedeh Rohanizadegan; Sara M Abdo; Anne O'Donnell-Luria; Ivana Mihalek; Peggy Chen; Marilyn Sanders; Kristen Leeman; Megan Cho; Christina Hung; Olaf Bodamer
Journal: Cold Spring Harb Mol Case Stud Date: 2017-11-21

5 in total