| Literature DB >> 19661281 |
Roland Schwarz1, Philipp N Seibel, Sven Rahmann, Christoph Schoen, Mirja Huenerberg, Clemens Müller-Reible, Thomas Dandekar, Rachel Karchin, Jörg Schultz, Tobias Müller.
Abstract
Multiple sequence alignments (MSAs) are one of the most important sources of information in sequence analysis. Many methods have been proposed to detect, extract and visualize their most significant properties. To the same extent that site-specific methods like sequence logos successfully visualize site conservations and sequence-based methods like clustering approaches detect relationships between sequences, both types of methods fail at revealing informational elements of MSAs at the level of sequence-site interactions, i.e. finding clusters of sequences and sites responsible for their clustering, which together account for a high fraction of the overall information of the MSA. To fill this gap, we present here a method that combines the Fisher score-based embedding of sequences from a profile hidden Markov model (pHMM) with correspondence analysis. This method is capable of detecting and visualizing group-specific or conflicting signals in an MSA and allows for a detailed explorative investigation of alignments of any size tractable by pHMMs. Applications of our methods are exemplified on an alignment of the Neisseria surface antigen LP2086, where it is used to detect sites of recombinatory horizontal gene transfer and on the vitamin K epoxide reductase family to distinguish between evolutionary and functional signals.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19661281 PMCID: PMC2764451 DOI: 10.1093/nar/gkp634
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Artificial example: the MSA (a) and its cluster tree (b) as used in our toy example. The subparts (c) and (d) are scatterplots of the first three component axes, which together account for 100% of the inertia in the data. The CA plots present sequences (black circles) and sites (red crosses) in an integrated manner. For better interpretation, the most important sites are explicitly shown in the plots with their nucleotide letters and alignment positions. Roman numbers indicate the splits in the cluster tree and the component axes resolving them.
Figure 2.Analysis of the LP2086 sequence family. (a) Evolutionary network reconstructed from a distance matrix on 47 unique sequences. Fletcher subfamilies A and B are clearly separated. The further sub-clusters 1 and 2 are marked in color. (b) Schematic representation of the complete alignment of 114 LP2086 sequences, where major parts of the alignment (from position 100 onward) have a block structure corresponding to Fletcher subfamilies A and B, a 30 amino acid region in the beginning votes for a different grouping. (c) CA plot of component axes 1 and 3. The method groups the relevant clusters, isolating each from the rest, and identifies the relevant sites. The groups are colored in analogy to those in the evolutionary network.
Figure 3.Analysis of the VKOR sequence family. (a) Phylogenetic tree of the VKOR protein family. (b) Sequence logo of the MSA including the proposed membrane topology of VKORC1 with conserved positions for VKORC1L1 (44). The conserved VKORC1L1-specific amino acids are marked in yellow. Pink-labeled amino acids are specific to VKORC1L1 and to the VKORC1 protein of fish. In the third transmembrane domain, the blue circles symbolize the redox center (CIVC motive) and the supposed warfarin binding site with the TYA motive is highlighted in red. (c) Scatterplot of the second and third principal factors. Sequences are depicted as black circles, sites as red crosses. Closeness of sequences and sites in the plot shows strength of association.