| Literature DB >> 18687111 |
Olivier Bastien1, Eric Maréchal.
Abstract
BACKGROUND: Confidence in pairwise alignments of biological sequences, obtained by various methods such as Blast or Smith-Waterman, is critical for automatic analyses of genomic data. Two statistical models have been proposed. In the asymptotic limit of long sequences, the Karlin-Altschul model is based on the computation of a P-value, assuming that the number of high scoring matching regions above a threshold is Poisson distributed. Alternatively, the Lipman-Pearson model is based on the computation of a Z-value from a random score distribution obtained by a Monte-Carlo simulation. Z-values allow the deduction of an upper bound of the P-value (1/Z-value2) following the TULIP theorem. Simulations of Z-value distribution is known to fit with a Gumbel law. This remarkable property was not demonstrated and had no obvious biological support.Entities:
Mesh:
Year: 2008 PMID: 18687111 PMCID: PMC2529321 DOI: 10.1186/1471-2105-9-332
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Aging properties of amino acids. Protein sequences are considered as systems, which components are amino acids. Over time, either amino acids are conserved (similarity of a residue with its descendant is that of identity, diagonal term of a substitution matrix) or modified due to random DNA mutations. Similarity decreases therefore with time, since no similarity is higher than that of identity. When the similarity falls below a threshold that is necessary for the residue to operate according to a standard (functional conservation), the component is damaged. (A) Score distribution corresponding to valine substitution. In this case, the score distribution is exponential, suggesting that valine (V) is a non-aging component. Based on BLOSUM62, residues of this type are V, F, P, W, Y, E, G, H, I, L, K, R, N, D and C (B) Score distribution corresponding to threonine substitution. The score distribution shows a peak, indicating a probable accelerated process of aging (functional damage) when the residue is substituted by random mutation in some other amino acids. Based on BLOSUM62, residues of this type are T, S, M, A and Q. (C) Score distribution in the BLOSUM62 similarity matrix. The complete distribution in the BLOSUM62 matrix is exponential (0.287.exp(-0.287.(s+4))), supporting a general model of amino acids as nonaging components. The exponential law for positive scores is characterized by the same parameter (λ' = 0.287). The original residue is termed i; its descent is termed j.
Figure 2Computing of the probability that the amount of information shared by two sequences, . Given an initial sequence a, we can envisage different scenarios for its evolution into another sequence b. In a first step (Step 1), an elementary probability is computed by taking into account the evolution of just one residue (here a1 into b1). Considering one possible evolutionary scenario (Step 2), residues are considered as independent and the probability is the product of elementary probabilities for each positions aligned in this scenario, with approximations in the asymptotic limit of long sequences. The final probability (Step 3) is then estimated by taking into account all the possible evolutionary scenarios.
Alignment statistics of the homologous Transcription initiation factor IIA (TFIIA) gamma chain sequences from Plasmodium falciparum and Arabidopsis thaliana.
| Alignment method | Blastp | Smith-Waterman | |
| Substitution matrix | BLOSUM62 | BLOSUM62 | DirAtPf100 |
| Statistics | |||
| 0.008 | NA | NA | |
| 10 | 11 | 12 | |
| 0.01 | 8.10-3 | 7.10-3 | |
| 1.5.10-6 | 3.7.10-7 | 1.10-7 | |
TFIIA gamma sequences from Plasmodium (UniProtKB Q8I4S7_PLAF7) and Arabidopsis (UniProtKB T2AG_ARATH) were aligned with Blastp and Smith-Waterman methods. Statistics were computed following the Karlin-Altschul model (as implemented in the Blastp algorithm) or the Lipman-Pearson Z-value model. The upper bound for the P-value based on the TULIP theorem is given following the formula: T-value = 1/Z-value2. The P-value deduced from the Z-value Gumbel distribution was computed following the model presented here. Substitution matrices were either BLOSUM62, or the asymmetric DirAtPf100 matrix specified for Plasmodium vs. Arabidopsis comparisons. NA: not applicable.