| Literature DB >> 16542452 |
Markus Seiler1, Alexander Mehrle, Annemarie Poustka, Stefan Wiemann.
Abstract
BACKGROUND: The identification of patterns in biological sequences is a key challenge in genome analysis and in proteomics. Frequently such patterns are complex and highly variable, especially in protein sequences. They are frequently described using terms of regular expressions (RegEx) because of the user-friendly terminology. Limitations arise for queries with the increasing complexity of patterns and are accompanied by requirements for enhanced capabilities. This is especially true for patterns containing ambiguous characters and positions and/or length ambiguities.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16542452 PMCID: PMC1523217 DOI: 10.1186/1471-2105-7-144
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Overview of terms. Sequences, patterns, and solutions can be segregated to their elemental parts. The sliding window is part of the sequence that is to be searched. This size is defined by the maximal size of the total pattern. The total pattern is segregated into subpatterns that are suited for computation in the 3of5 algorithm. Matching subpatterns become a subsolution. Every branch of a solution tree becomes a total solution, once also the final subpattern has matched.
Figure 2The subpattern attempts. The sequence of the sliding window is investigated for matches of the total pattern for every start position k individually. The total pattern is first segregated into subpatterns that are analyzed in consecutive subpattern attempts. Adjacent subpatterns may not overlap but must be consecutive. A successful subpattern attempt leads to a subsolution (not displayed), and initiates a subpattern attempt with the adjacent subpattern. A total solution is obtained when the last subpattern has led to a subsolution.
Figure 3The multivalence loop within the subpattern attempt. Length-ambiguous subpatterns may lead to different subsolutions. A loop of subpattern attempts, the so-called multivalence loop, is initiated to iteratively find all subsolutions sharing the start position. Subpattern 2 is length-ambiguous in the schema shown. Initially the subpattern is attempted to be matched to the target sequence with its maximal size (a). Then this sequence is diminished by one position ("-1") with respect to the end of the previous subsolution (shaded stretches) to investigate, if also smaller subsolutions can be found (b, c). Note: The indicated start of subpattern 3 is only valid for subsolution 2(a). Since subsolutions are required to be directly adjacent, subsolutions 2(b) or 2(c) would require a subsolution 3 to begin immediately downstream.
Common regular expressions and the n-of-m pattern type in the 3of5 application Individual common RegEx terms are displayed as they can be applied in 3of5. Types of allowed ambiguities in the individual RegEx terms are listed. "no" no ambiguity; "yes" ambiguity can be expressed with that particular term; "any" ambiguity with any residue allowed. Notes: (1) The general term "ambiguity" used in the text is extended here to "content-ambiguity" to distinguish this from the "length ambiguity"
| Discrete character in one position | K | no | no | |
| Subset of characters for one position | [KRH] | yes | no | |
| Arbitrary character in one position | any | no | ||
| Stretch of identical characters, with fixed length | K {3} | no | no | |
| Stretch composed of a subset of characters, with fixed length | [KRH] {3} | yes | no | |
| Stretch of identical characters, with variable length | K {1,3} | no | yes | |
| Stretch composed of a subset of characters, with variable length | [KRH] {1,3} | yes | yes | |
| Stretch with arbitrary characters, with variable length | any | yes | ||
| Stretch composed of a subset of characters that need to be present with a defined number of matches within sequence of otherwise arbitrary composition, with fixed length | (3of5) (KRH) | yes | no | |
| Stretch composed of different subsets of characters that need to be present with defined numbers of matches within sequence of otherwise arbitrary composition, with fixed length | (nof5) ((min3) (KRH) (max1) (P)) | yes | no | |
| Any stretch describable by a pattern which should not contain the characters defined in the [^ ] brackets | [AGC] {2,5} [^KRH] | no | no | |
| Pattern begins at sequence start | ^ KKK | no | no | |
| Pattern ends at sequence end | KKK $ | no | no | |
Figure 43of5 web interface. Three different patterns were entered to be searched for in the sequence of the nucleoplasmin protein of Xenopus laevis [Swiss-Prot:P05221]. Header lines starting with ">>" indicate grouped patterns as feature of the "FASTA grouped" mode. Two posttranslational modification patterns (PKC and Amidation) are thus combined to the group "Posttransl. motif". A second group "Localization motif" contains one pattern (nucleoplasmin NLS [Prosite:PS00015]) in the example. The pattern format is selected by activating the appropriate check box on top of the pattern window. The sequence that shall be investigated in pattern matching is copied into the sequence window, either in FASTA, multiple FASTA, or simple text formats. An output in XML is optional.
Figure 53of5 result page for grouped patterns. The nucleoplasmin protein of Xenopus laevis was analyzed for a set of posttranslational and localization motifs as shown in figure 1. Matches are ordered for every grouped pattern separately by their respective sequence position. A link at the right hand side opens a popup window with a detailed description of the respective pattern parts. Matches are given in a color code. Red: matching discrete characters; blue: matching characters from a subset of characters possible in one position; green: matching subpatterns of the n-of-m pattern type; black and lowercase letters: arbitrary characters. The activated popup window in the figure displays the total pattern and four pattern parts of the nucleoplasmin NLS pattern [Prosite:PS00015].
Figure 6A length-ambiguous pattern and the derived solution cohort. The length ambiguity ".{4,8}" within the EGF-like domain signature 2 [Prosite:PS01186] "C.C.{2} [GP] [FYW].{4,8}C" may lead to more than one match per sequence position. For example, the sequence of the tumor necrosis factor receptor [Swiss-Prot:Q9Y6Q6] has three solutions (a-c) which thus form a solution cohort. The sequence parts of arbitrary content are displayed as numbers in the solutions.