| Literature DB >> 16792820 |
Julie D Thompson1, Arnaud Muller, Andrew Waterhouse, Jim Procter, Geoffrey J Barton, Frédéric Plewniak, Olivier Poch.
Abstract
BACKGROUND: In the post-genomic era, systems-level studies are being performed that seek to explain complex biological systems by integrating diverse resources from fields such as genomics, proteomics or transcriptomics. New information management systems are now needed for the collection, validation and analysis of the vast amount of heterogeneous data available. Multiple alignments of complete sequences provide an ideal environment for the integration of this information in the context of the protein family.Entities:
Mesh:
Year: 2006 PMID: 16792820 PMCID: PMC1539025 DOI: 10.1186/1471-2105-7-318
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Distribution of the number of sequences/alignment in the well characterised data set used for validation.
Figure 2Schematic overview of the MACSIMS algorithm.
Figure 3Decision rules for feature validation.
Figure 4Decision rules for feature propagation.
feature propagation criteria
| Feature type | Description | Data source | Feature category | Core block Coverage |
| DOMAIN | Structural/functional domain | Uniprot (predicted) | Domain | 40% |
| PFAM-A | Pfam database domain | Pfam (reliable) | Domain | 40% |
| PROSITE | Prosite motif or domain pattern | Prosite (predicted) | Single residue | 100% |
| STRUCT | Secondary structure element | Uniprot/PDB (reliable) | SSE | 70% |
| MODRES | Modified residue | Uniprot (predicted) | Single residue | 100% |
| SITE | Active site | Uniprot (predicted) | Single residue | 100% |
| VARSPLIC | Splicing variant | Uniprot (not propagated) | N/A | N/A |
| VARIANT | Residue variants or mutations | Uniprot (not propagated) | N/A | N/A |
| BLOCK | Conserved core block | Calculated in MACSIMS | N/A | N/A |
| REGION | Conserved region | Calculated in MACSIMS | N/A | N/A |
| LOWCOMP | Low complexity segment | Calculated in MACSIMS | N/A | N/A |
| TRANSMEM | Potential transmembrane helix | Calculated in MACSIMS | N/A | N/A |
| COIL | Potential coiled coil | Calculated in MACSIMS | N/A | N/A |
The data source indicates the original database from which the feature type is retrieved (predicted indicates a feature type that may contain predicted/unreliable information; reliable indicates a feature type that is assumed to manually verified/reliable). The Feature category refers to the three categories used to determine the criteria for feature propagation. Core block coverage indicates the percentage of the feature that should be covered by core blocks for the feature to be propagated.
benchmark test results
| Feature type | percent identity | query features | homolog features | true positive | false positive | new features |
| PFAM-A domain | <90% | 161 | 161 | 160 | 0 | 4 |
| <50% | 161 | 150 | 149 | 0 | 3 | |
| PROSITE pattern | <90% | 166 | 165 | 160 | 0 | 5 |
| <50% | 166 | 153 | 148 | 0 | 4 | |
| Uniprot site | <90% | 360 | 305 | 260 | 0 | 64 |
| <50% | 360 | 288 | 235 | 0 | 56 | |
| secondary structure | <90% | 1486 | 1265 | 1150 | 1 | 987 |
| <50% | 1486 | 1197 | 1009 | 1 | 802 | |
| Total | <90% | 2283 | 1896 | 1730 | 1 | 1060 |
| <50% | 2283 | 1788 | 1541 | 1 | 865 | |
Percent identity indicates the maximum similarity of the sequences in the alignment with the query. 'Query features' is the number of sequence features for the query available in the public databases. 'Homolog features' is the number of features found in the other sequences in the alignment that correspond to a feature in the query. True (or false) positives indicate the number of features propagated by MACSIMS that match (or mismatch) with known query features. 'New features' is the number of features predicted by MACSIMS that are not currently found in the public databases.
Figure 5Example MACSIMS alignment analysis presented in the JalView applet. A. Overview of complete alignment. Regions calculated by MACSIMS are coloured according to their phylogenetic distribution. The red box indicates the section of the alignment shown in B and C. A conservation score [50] for each alignment column is shown below the alignment. B. Detailed view of one part of the alignment. Metal binding and active site residues are indicated, together with Prosite motifs. Mutated residues are shown with a pink background. C. The same part of the alignment as in B, with secondary structure elements highlighted.