Literature DB >> 17488840

Multiple whole genome alignments and novel biomedical applications at the VISTA portal.

Michael Brudno¹, Alexander Poliakov, Simon Minovitsky, Igor Ratnere, Inna Dubchak.

Abstract

The VISTA portal for comparative genomics is designed to give biomedical scientists a unified set of tools to lead them from the raw DNA sequences through the alignment and annotation to the visualization of the results. The VISTA portal also hosts the alignments of a number of genomes computed by our group, allowing users to study the regions of their interest without having to manually download the individual sequences. Here we describe various algorithmic and functional improvements implemented in the VISTA portal over the last 2 years. The VISTA Portal is accessible at http://genome.lbl.gov/vista.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2007 PMID： 17488840 PMCID： PMC1933192 DOI： 10.1093/nar/gkm279

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Comparing genomic sequences across related species is a fruitful source of biological insight. Functional elements such as exons tend to exhibit significant sequence similarity due to purifying selection, whereas regions that are not functional tend to be neutrally evolving and thus less conserved. The first step in comparing genomic sequences is to align them—to map the letters of the sequences to each other. After an alignment is computed, visualization frameworks become essential to enable users to interact with the sequence and conservation data, especially in the context of longer DNA sequences or whole genomes. Visualization frameworks should be easy to understand by a biologist and provide insight into the mutations that a particular genomic locus has undergone. The VISTA portal is a comprehensive comparative genomics resource that provides biomedical scientists with a single unified framework to generate and download multiple sequence alignments, visualize the results in the context of existing annotations and analyze comparative results in search for important sequence signals in alignments. The VISTA suite of programs has been in development and continued use since 2000 (1–4). It was originally developed for the alignment and comparative analysis of long genomic sequences and later was expanded to pair-wise and multiple alignment of vertebrate genomes. VISTA has popularized the visualization of the level of conservation in the format of a continuous curve based on the conservation in a sliding window. This concept proved to be extremely successful due to the easy interpretation of the resulting plots. VISTA was built through a close collaboration between computational and biological scientists, resulting in a product that is robust, efficient and powerful, yet simple to use for a person without extensive computer experience, as is illustrated by more than 1000 citations to the various VISTA-associated tools (according to http://scholar.google.com). In the last 2 years the VISTA portal has seen many significant improvements. In addition to updating the whole genome alignments, computed using recent assemblies of vertebrate, insect, plant and microbial genomes, we have added significant new functionality and resources to the Genome Browser and other tools, including: A novel multiple whole genome alignment algorithm. A new server for whole-genome alignment of bacterial genomes. Base-pair level visualization ability within the VISTA browser. Visual access to the results of the prediction of potential deleteriousness of non-synonymous Single Nucleotide Polymorphisms (SNPs) by the algorithm PolyPhen (5). A novel conservation track, Rank-VISTA, to show the statistical significance of conserved regions computed by the Gumby algorithm (6). Whole-genome rVISTA, that allows for evaluation of which conserved transcription factor binding sites (TFBS) are over-represented in a group of genes.

VISTA PORTAL

The suite of VISTA tools is accessible through the website http://genome.lbl.gov/vista. Currently it includes five servers for the analysis of user-submitted sequences: mVISTA that computes alignments of user-submitted sequences; wgVISTA to align whole genomes (up to 10 megabases in length); GenomeVISTA that aligns a user-submitted sequence to a selected genome assembly; rVISTA that searches for conserved TFBS; and Phylo-VISTA to analyze multiple DNA sequence alignments of different species while considering their phylogenic relationship. In addition, multiple whole-genome alignments of vertebrate, insect and plant species have been built using in-house algorithms and are publicly available for browsing and analysis. The portal provides access to the VISTA Genome Browser, the main visual interface for both the pre-computed whole genome alignments and alignments of user-submitted sequences. In the sections below we will discuss various algorithmic and functional improvements to the VISTA portal in the last 2 years.

Algorithmic improvements

Most of the algorithms for multiple alignment either rely on a reference genome, against which all of the other sequences are laid out, or require a one-to-one mapping, where each nucleotide of one genome is constrained to align to at most one place on the other genome (4,7–9). Both approaches have drawbacks for whole-genome comparisons. The first approach requires computation, storage and analysis of several alignments, one for each base species. The second approach fails to align any gene that has undergone duplications since the divergence of the species being compared. Additionally, ‘referenced’ alignments commonly fail to include elements conserved among some genomes, but missing in the base genome. We have developed and implemented a novel alignment algorithm that treats all genomes symmetrically.

Improved alignment pipeline

Initially our whole genome alignment pipeline used an alignment strategy where one genome was split up into contigs of about 250 kilobases (kb) (3,4). The potential orthologs for each contig were found in the second genome with the BLAT local aligner (12). This step was followed by a global alignment of two orthologous sequences. Although this approach produces a map that is more accurate within large syntenic blocks than an all-by-all local alignment, it has two main weaknesses: (i) small syntenic blocks, resulting from rearrangements within a larger region, may be missed; (ii) the initial arbitrary division of one genome into segments can split a syntenic region, making it difficult to map the region to its true orthologue. To address these issues we have developed a ‘glocal’ alignment method, which treats the rearrangement events explicitly. There have been several algorithms that decide whether to accept or reject a local alignment based on other alignments near it, and thus allow for the direct treatment of the various rearrangement events. These include Shuffle-LAGAN (S-LAGAN, (10)), which currently serves as the underlying engine for our whole genome alignments. The pair-wise algorithm described below is based on a novel-chaining tool, called SuperMap. The multiple alignment algorithm is a progressive extention of the pair-wise one, where at every internal node we pick an ordering of the alignments that simplifies the next alignment that we will conduct. These algorithms will be discussed in detail in a separate publication.

Pair-wise alignment

To align two genomes our algorithm uses a novel approach based on a reimplementation of the original S-LAGAN chaining algorithm (10,11) combined with a novel post-processing stage called SuperMap. The S-LAGAN chaining takes as input a set of local alignments between the two sequences generated by BLAT (12) or any other local aligner and returns the maximal scoring subset of these under certain gap criteria. This subset is called a 1-monotonic conservation map. In order to allow S-LAGAN to catch rearrangements, the map is allowed to be non-decreasing (monotonic) in only one sequence, without putting any restrictions on the second sequence. The 1-monotonic chain can capture all rearrangement events besides duplications in the second genome. In order to allow our alignments to incorporate these events we have introduced the novel SuperMap algorithm that takes two S-LAGAN outputs to make our algorithm symmetric. We run S-LAGAN twice, using each sequence as the base. This gives us three pieces of data: the original local alignments, which were common to the two runs of S-LAGAN, and two chains of these alignments, each corresponding to the S-LAGAN 1-monotonic maps. We then classify all local alignments as belonging to both chains, and consequently orthologous (best bi-directional hits) or being in only one chain, and hence a duplication (see Figure 1 for a graphical overview of the algorithm).

Figure 1.

SuperMap Algorithm. (a) Local alignment hits: regions A and B correspond to duplications in Organism 1; regions C and D correspond to duplications in Organism 2; (b) S-Lagan chain for Organism 1 as a base. Chain increases in direction of X-axis, but can jump up and down in Y-direction (Organism 2), region D is left out; (c) S-Lagan chain for Organism 2 as a base—chain increases in direction of Y-axis, region B is left out; (d) SuperMap output—combines regions of Figures b and c. SuperMap has two advantages over regular S-LAGAN. One advantage is that it locates both regions of one-to-one similarity (those that were in both 1-monotonic chains) and likely duplications in both sequences (those in only one chain). Additionally, in case of transpositions, two of the pieces are no longer arbitrarily joined together.

Progressive multiple alignment

After the SuperMap algorithm is used to align the two pairs of sister taxa we use a progressive generalization of the pair-wise SuperMap algorithm to align all of the genomes, by following the species’ phylogenetic tree. After aligning two genomes, our algorithm joins together syntenic blocks (regions of genomes without rearrangements) based on their order in the outgroups (those sequences that will be aligned at a later stage: for example if we have aligned mouse with rat, then human, dog and chicken are all outgroups). We use an algorithm based on finding a maximum weighted matching in a graph, with the weights specified by the outgroup genomes, to order the individual alignment blocks in the order that will create the simplest alignment problem when we align the result to the outgroup. We then use the SuperMap based pair-wise alignment algorithm to align the alignments to each other using the regular LAGAN aligner (13). This algorithm is summarized in the flowchart in Figure 2.

Figure 2.

Multiple alignment with LAGAN in the VISTA Genome Pipeline (VGP). After running a local alignment program, SuperMap Chaining is used to identify all rearrangements. The resulting regions are aligned with LAGAN, and finally a maximum matching algorithm is used to predict ancestral contigs. These ancestral contigs are then used to align to outgroup genomes is the higher levels of the phylogenetic tree. By picking an order of the syntenic blocks which is closest to the outgroup we facilitate the alignment of the more distant genomes. Our approach has several advantages over previous algorithms: (i) it does not assume a base genome, to which all other genomes are aligned, but creates a symmetric alignment equally valid for all genomes; (ii) it penalizes various rearrangement events based on an evolutionary tree, creating a set of alignments that mirror the evolutionary history of the sequences; and (iii) it is able to align short, low similarity syntenic blocks based on their adjacency to higher similarity areas even when there has been a rearrangement event between them.

wgVISTA: whole genome alignment for user's genomes

In order to allow our users to compare whole genomes using the whole genome alignment algorithms described above we have developed whole genome VISTA (wgVISTA), a tool which accepts sequences up to 10 megabasepairs in length, aligns them using our alignment pipeline and visualizes the results through the VISTA browser.

Visualization improvements

The VISTA Browser allows for the exploration of alignments and annotations of DNA sequences. It shows any number of alignments on a particular base genome and is scalable to the size of whole mammalian chromosomes. At the larger scale, visual presentation of rearrangements, inversions and gaps in the alignment are also available through the browser. Because all of our alignments are built in a symmetric fashion (see above section) the user may select any sequence or genome as the reference, and display the level of conservation between this reference and the sequences of other organisms. The browser has a number of options, such as zoom, extraction of a region to be displayed, user-defined parameters for conservation level and the selection of sequence elements for study. The VISTA Browser also gives access to the Text Browser that provides a user with all data related to alignments, analysis of conservation, and access to other resources. We have recently introduced two significant features into the VISTA browser that allow for a more detailed analysis of the areas of conservation detected in alignments. First, we have added a scrollable nucleotide-level alignment window. This window displays not only the details of the underlying pair-wise or multiple alignment, but also additional nucleotide-level annotation such as the SNP (Figure 3). Unlike the main VISTA window, the base-pair window does not have a selected continuous base sequence, but rather shows a real pair-wise or multiple alignment where a user can analyze gaps and substitutions in any sequence.

Figure 3.

VISTA Browser display of 9.6 kb fragment of NR1H3 gene on Chr. 11 of the human genome (hg17). VISTA plots for the five-way Human–Dog–Mouse–Rat–Chicken alignment are shown. Conserved sequences in VISTA (70%/100 bp cutoff) are colored according to the annotation (exons—dark blue, UTRs—turquoise and non-coding—pink). Rank-VISTA peaks identified by Gumby (P < 0.5) are shown as vertical bars following the same coloring convention. At the bottom of the window one can see the base-pair browser and SNP data [dbSNP annotation and PolyPhen (5) prediction of functionality for coding SNPs]. Second, we have introduced a plot to show the statistically significant conserved segments. The RankVISTA plotis a histogram-like plot where block width is proportional to the median conserved element length in human, and block area is proportional to the median of −log(P-value) (6). Block height thus represents the degree of evolutionary constraint at the base-pair level. Finally, we have integrated the mVISTA server for user-submitted sequences with the VISTA Browser: when a user submits sequences to mVISTA, instead of just being e-mailed the VISTA results in a PDF document, we now make the alignment into a track on the VISTA browser, allowing the user to zoom in on a region of interest and view the detailed alignment in the nucleotide window.

New applications

One of the main emphases of our development in the last 2 years has been on better integration of the existing VISTA tools and of novel biological data into the VISTA portal. Several new applications developed in our group and by our collaborators have been integrated in to the VISTA portal, allowing biologists to easily access and visualize the results of these analyses.

Whole-genome rVISTA

Gene expression studies generate extensive lists of co-expressed genes, which can share regulatory factors that control their synchronous expression. We have developed a computational tool, called Whole-Genome rVISTA, designed to discover which conserved (between pairs of species) TFBS are over-represented in the upstream regions in a group of genes. This tool uses whole-genome alignments computed in our group, and TRANSFAC Professional from Biobase (14) with the MATCH program (15) to predict TFBS. The effectiveness of Whole Genome rVISTA was recently illustrated in a study of responsiveness to cAMP regulation (16). That expression study indicated that several circadian rhythm clock genes are induced by cAMP. We used Whole Genome rVISTA to scan 5Kb upstream of the transcription start site of the cAMP-regulated genes, and found that up-regulated genes contained more cAMP Response Elements (CRE) than all other genes on the array.

Gumby in RankVISTA

With more genomes available it has become essential to introduce new statistically motivated methods for conservation analysis that take into consideration neutral rates and phylogeny of the species. Gumby (6) makes no prior assumptions about evolutionary rates and requires no adjustment of parameters as the phylogenetic scope is varied from primates to vertebrates. Gumby uses a dynamically generated phylogenetic log-odds scoring scheme to identify local segments of any length that evolve slower than the background neutral rate, and ranks these conserved segments by P-value using the Karlin–Altschul statistic. This scoring technique demonstrated its efficiency in analyzing conservation both in evolutionary distant (17), and very close (6,18,19) species. Rank-VISTA plots of Gumby analysis allow the users to judge the statistical significance of any conserved regions and are available through VISTA Browser for genome-wide alignment of a number of genomes as well as for user-submitted mVISTA queries (Figure 3).

PolyPhen on the nucleotide alignment track

Poly/morphism/Phen/otyping (5) is a tool which predicts possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations. For each non-synonymous SNP in the dbSNP database (20) the VISTA Browser provides access to the results of the PolyPhen analysis of its deleteriousness.

FUTURE DIRECTIONS

The main emphasis of our future work within the VISTA portal will be on integration of additional data that is necessary for biological and medical researchers to carry out their analyses. We plan to integrate into our portal information about human variation, especially where it is known that some variation has a correlation with medical disorders. We will also continue to work on providing our users with a simple-to-use interfaces for browsing genomic data—we have been developing methodologies to display various evolutionary events in the context of the underlying phylogenetic trees (21,22) and expect to make similar improvements for visualizing rearrangements between the various genomes. Finally, the new alignment pipeline implemented within the VISTA portal should be both flexible and powerful enough to analyze many of the genomes that are currently being sequenced. Consequently the majority of our alignment-related work in the near future will be on maintaining up-to-date versions of novel genomes, including low-coverage genomes that are currently being sequenced.

21 in total

1. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA.

Authors: Michael Brudno; Chuong B Do; Gregory M Cooper; Michael F Kim; Eugene Davydov; Eric D Green; Arend Sidow; Serafim Batzoglou
Journal: Genome Res Date: 2003-03-12 Impact factor: 9.043

2. MATCH: A tool for searching transcription factor binding sites in DNA sequences.

Authors: A E Kel; E Gössling; I Reuter; E Cheremushkin; O V Kel-Margoulis; E Wingender
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

3. Glocal alignment: finding rearrangements during alignment.

Authors: Michael Brudno; Sanket Malde; Alexander Poliakov; Chuong B Do; Olivier Couronne; Inna Dubchak; Serafim Batzoglou
Journal: Bioinformatics Date: 2003 Impact factor: 6.937

4. Phylo-VISTA: interactive visualization of multiple DNA sequence alignments.

Authors: Nameeta Shah; Olivier Couronne; Len A Pennacchio; Michael Brudno; Serafim Batzoglou; E Wes Bethel; Edward M Rubin; Bernd Hamann; Inna Dubchak
Journal: Bioinformatics Date: 2004-01-22 Impact factor: 6.937

5. TreeQ-VISTA: an interactive tree visualization tool with functional annotation query capabilities.

Authors: Shengyin Gu; Iain Anderson; Victor Kunin; Michael Cipriano; Simon Minovitsky; Gunther Weber; Nina Amenta; Bernd Hamann; Inna Dubchak
Journal: Bioinformatics Date: 2007-01-17 Impact factor: 6.937

6. VISTA : visualizing global DNA sequence alignments of arbitrary length.

Authors: C Mayor; M Brudno; J R Schwartz; A Poliakov; E M Rubin; K A Frazer; L S Pachter; I Dubchak
Journal: Bioinformatics Date: 2000-11 Impact factor: 6.937

7. Human non-synonymous SNPs: server and survey.

Authors: Vasily Ramensky; Peer Bork; Shamil Sunyaev
Journal: Nucleic Acids Res Date: 2002-09-01 Impact factor: 16.971

8. Automated whole-genome multiple alignment of rat, mouse, and human.

Authors: Michael Brudno; Alexander Poliakov; Asaf Salamov; Gregory M Cooper; Arend Sidow; Edward M Rubin; Victor Solovyev; Serafim Batzoglou; Inna Dubchak
Journal: Genome Res Date: 2004-04 Impact factor: 9.043

9. Human-mouse alignments with BLASTZ.

Authors: Scott Schwartz; W James Kent; Arian Smit; Zheng Zhang; Robert Baertsch; Ross C Hardison; David Haussler; Webb Miller
Journal: Genome Res Date: 2003-01 Impact factor: 9.043

10. Strategies and tools for whole-genome alignments.

Authors: Olivier Couronne; Alexander Poliakov; Nicolas Bray; Tigran Ishkhanov; Dmitriy Ryaboy; Edward Rubin; Lior Pachter; Inna Dubchak
Journal: Genome Res Date: 2003-01 Impact factor: 9.043

14 in total

1. Whole-Genome rVISTA: a tool to determine enrichment of transcription factor binding sites in gene promoters from transcriptomic data.

Authors: Inna Dubchak; Matthew Munoz; Alexandre Poliakov; Nathan Salomonis; Simon Minovitsky; Rolf Bodmer; Alexander C Zambon
Journal: Bioinformatics Date: 2013-06-04 Impact factor: 6.937

2. A conserved MRF4 promoter drives transgenic expression in Xenopus embryonic somites and adult muscle.

Authors: Timothy J Hinterberger
Journal: Int J Dev Biol Date: 2010 Impact factor: 2.203

3. Transcription of the transforming growth factor beta activating integrin beta8 subunit is regulated by SP3, AP-1, and the p38 pathway.

Authors: Jennifer A Markovics; Jun Araya; Stephanie Cambier; David Jablons; Arthur Hill; Paul J Wolters; Stephen L Nishimura
Journal: J Biol Chem Date: 2010-06-02 Impact factor: 5.157

4. ModuleOrganizer: detecting modules in families of transposable elements.

Authors: Sebastien Tempel; Christine Rousseau; Fariza Tahi; Jacques Nicolas
Journal: BMC Bioinformatics Date: 2010-09-22 Impact factor: 3.169

5. The C. savignyi genetic map and its integration with the reference sequence facilitates insights into chordate genome evolution.

Authors: Matthew M Hill; Karl W Broman; Elia Stupka; William C Smith; Di Jiang; Arend Sidow
Journal: Genome Res Date: 2008-06-02 Impact factor: 9.043

10. Ancora: a web resource for exploring highly conserved noncoding elements and their association with developmental regulatory genes.

Authors: Pär G Engström; David Fredman; Boris Lenhard
Journal: Genome Biol Date: 2008-02-15 Impact factor: 13.583