| Literature DB >> 23919085 |
Kirill S Antonets1, Anton A Nizhnikov.
Abstract
The composition of a defined set of subunits (nucleotides, amino acids) is one of the key features of biological sequences. Compositional biases are local shifts in amino acid or nucleotide frequencies that can occur as an adaptation of an organism to an extreme ecological niche, or as the signature of a specific function or localization of the corresponding protein. The calculation of probability is a method for annotating compositional bias and providing accurate detection of biased subsequences. Here, we present a Sequence Analysis based on the Ranking of Probabilities (SARP), a novel algorithm for the annotation of compositional biases based on ranking subsequences by their probabilities. SARP provides the same accuracy as the previously published Lower Probability Subsequences (LPS) algorithm but performs at an approximately 230-fold faster rate. It can be recommended for use when working with large datasets to reduce the time and resources required.Entities:
Keywords: algorithm; composition; probability; protein; sequence analysis
Year: 2013 PMID: 23919085 PMCID: PMC3728207 DOI: 10.4137/EBO.S12299
Source DB: PubMed Journal: Evol Bioinform Online ISSN: 1176-9343 Impact factor: 1.625
Figure 1A workflow of SARP. The steps of SARP performance are shown below. The beginning and end of the algorithm are indicated. The steps suggesting alternative solutions are indicated with rhombs.
Figure 2The scheme of search for LPS in example sequence with SARP for the amino acid alanine (A). The groups of the fragments with equal probability and the subgroups are shown. The process of extension is demonstrated for the first group.
Figure 3The scheme represents all LPSs found with the original LPS algorithm and with SARP in a set of 1000 yeast proteins. The numbers are the GI numbers of proteins in the NCBI database. The horizontal black line represents a protein sequence. Different green regions represent overlapping LPSs found with the original algorithm and SARP. Blue regions denote parts of an LPS that were not identified by the original algorithm. Vertical red dashes denote an exact match of the LPS boundaries found with the original LPS algorithm and SARP.
A comparison of the efficiency between SARP and the original LPS algorithm.
| Parameter | LPS algorithm | SARP |
|---|---|---|
| Total protein length, aa | 489217 | 489217 |
| Total time, ms | 341914177 | 1464619 |
| Average protein length, aa | 489.2 | 489.2 |
| Average time per protein, ms | 341914.2 | 1464.619 |
| Times faster | 1 | 233.45 |
Note:
This parameter illustrates the ratio of running times between the LPS algorithm and SARP, in which the running time of the LPS algorithm is set to 1.
Figure 4(A) A distribution of computation times for separate proteins of different lengths using SARP. Lengths of protein sequences (aa) and times of computation (ms) are shown. (B) The same as A. for the original LPS algorithm. (C) A histogram of relative numbers of proteins from the set of 1000 sequences that were analyzed grouped by their lengths. (D) A comparison of CPU running times for the original LPS algorithm and SARP dependent on the length of proteins. The columns of SARP results are nearly invisible due to its very fast computation time relative to the original algorithm. The results are indicated as the mean ± the confidence interval (P ≥ 0.95). (E) Special histogram for SARP computation times. The results are indicated as the mean ± confidence interval (P ≥ 0.95). (F) Ratio between the CPU times for the original LPS algorithm and SARP in the groups of proteins arranged by their length (aa).
Figure 5(A) The frequencies of amino acids for the sets of 250 proteins for each of the five species analyzed. The means of frequencies are indicated as percentages. (B) The average protein length (aa) is indicated for the set of sequences from each species. (C) Computation time using the original LPS algorithm for the sets of 250 proteins from five different species. Computation time is indicated in ms. The results are shown as the mean ± the confidence interval (P ≥ 0.95). (D) The same as C. for SARP.